Data Engineer (Spark) Resume

SUMMARY

6 years of professional IT experience with Over 4 Years of Hadoop/Spark experience in ingestion, storage, querying, processing and analysis of big data.
Extensive experience in Big Data ecosystem and its various components SPARK, MapReduce, Spark Sql, SQL, HDFS, HIVE, HBase, PIG, Sqoop, Zookeeper, Oozie,Airflow.
Strong knowledge on Hadoop architectural components like Hadoop Distributed File System, Name Node, Data Node, Task Tracker, Job Tracker, MRV2, and Map Reduce programming.
Experience in cleansing and analyze data using HiveQL, Pig Latin, and custom Map Reduce programs.
Implemented Spark SQL and Data frame API to connect to Hive to read the data and distributed processing to make highly scalable.
Worked on spark core transformation and actions along with RDD and Dataset API
Experience in writing MapReduce jobs using Sqoop, Pig, Hive for data processing.
Exposure on NoSQL databases and hands on work experience in writing applications on NoSQL databases like HBase, DynamoDB and Casandra.
Experience in importing and exporting the data from RDBMS databases Netezza, Oracle into Hadoop data lake using Sqoop jobs.
Imported incremental transactional data both in append and last updated mode.
Implemented SCD type 2 using Sqoop and Hive.
Experience in handling different file formats Parquet, Apache Avro, JSON, Spreadsheets, ORC and Flat file format.
Designed and developed Hive data transformation scripts to work against structured data from various data points and created a baseline.
Handled performance optimizations on PL/SQL and Shell Scripts.
Experiences with push down optimization in Informatica.
Experience in real time streaming applications and batch style large scale distributed computing applications using tools like Spark Streaming
Implemented data recovery mechanisms and disaster recovery in distributed environment.
Experience in working with distributed scheduling using Oozie and Airflow.
Optimized performance of HBase/Pig jobs.
Strong knowledge of data transformations and Data MigrationDesign and Implementation principles
Experience in deploying Hadoop cluster on Amazon Web Services (AWS) EC2 Instances.
Implemented data pipeline using AWS services S3, EC2 and EMR
Optimized spark workflow both on - premises (CDH) and cloud (AWS)
Designed and created the data models for customer data for using HBase query APIs
Goal-oriented individual with strong analytical skills.
Flexible, enthusiastic, and project-oriented team player with solid communication and leadership skills to develop creative solution for challenging client needs

TECHNICAL SKILLS

Big Data Eco-system: Hadoop, HDFS, Spark, pyspark, Hive, Java, Sqoop, Map-Reduce, Oozie, Airflow, PIG, YARN, KAFKA, Spark streaming

NoSQL Databases: HBase (Also flexible to any NOSQL like DynamoDB, Cassandra etc.)

Databases: Netezza, Teradata,Oracle.

CI Tools: Jenkins, GitHub, Jira

Programming languages Scripting: Python,Scala, Shell Scripting.

Data warehousing: ETL, Informatica Power Exchange,OLAP, Redshift, Snowflake, OLTP, Workflow manager and workflow monitor.

Hadoop Distribution: Apache, EMR, Cloudera

Operating Systems: Linux, RHEL, Windows.

PROFESSIONAL EXPERIENCE

Data Engineer (Spark)

Confidential

Responsibilities:

Designed and Developed data integration/engineering workflows on big data technologies (Hadoop, Spark, Sqoop, Hive, HBase)
Requirement gathering and performance analysis of requirement document.
Developed dataflows and processes for the Data processing using SQL (Spark SQL & Data frames)
Developing spark programs using python API's to compare the performance of spark with HIVE and SQL and generated reports monthly and daily basis.
CreatedPyspark frame to bring data from RDBMS to Amazon S3.
Involved in importing data into HDFS and Hive using Sqoop and involved in creating Hive tables, loading with data, and writing Hive queries.
Hive and Spark tuning with partitioning/bucketing of Parquet and executors/driver's memory.
Developed optimized Hive queries from RDBMS to Hadoop staging area.
Designed and implemented archiving process from HDFS to S3,
Handled large datasets using Partitions, Spark in Memory capabilities, Broadcasts in Spark, Effective & efficient Joins, Transformations and other during ingestion process itself.
Processed S3 data and created external tables using Hive and developed scripts to ingest and repair tables that can be reused across the project.
Designed and developed data lake analyze & evaluate multiple solutions, considering multiple cost factors across the business as well operational impact on customer data.
Involved in planning process of iterations under the Agile Scrum methodology.
Worked on Hive Megastore backup, Partitioning, and bucketing techniques in hive to improve the performance. Tuning Spark & Scala Jobs.
Worked on Data transformations, Data Migration and processing using Spark PythonSparkSQL, Hive.
Worked closely with Data science team and understand the requirement clearly and create hive table on HDFS.
Developed Spark scripts by using python commands as per the requirement.
Scheduling Spark/Scala jobs using Airflow workflow in Hadoop Cluster and Generated detailed design documentation for the source-to-target transformations

Environment: s:Cloudera Distribution, Spark, Scala, HDFS, Hive, Sqoop, Python, AWS EMR, AWS S3, AWS and PARQUET data files

Hadoop Developer

Confidential

Responsibilities:

Monitored workload, job performance and capacity planning using Cloudera Manager.
Worked with different datasets by Importing and exporting the data from different data sources into spark using Data Source API and done Computation to generate the Output
In part of upgradation to the existing data processing systems, Converted major logical units of hive queries to data frames in Spark Sql for better performance gain.
Involved in complete remodeling of data processing pipeline by changing the design of the data flow.
Performed deep data analytics to get insights by using query languages like Hive and Impala.
Optimizing the Hive Queries using the various files format like PARQUET, JSON, AVRO.
Involved in preparing Partitioning the Data to keep the incremental load in separate folder.
Developed the Sqoop scripts in order as to import data from Oracle Database.
Created External Hive Table on top of parsed data and Prepared Hive DDLs, queries and involved in data ingestion process.
Worked with various HDFS file formats like Avro, Parquet, and various compression formats like Snappy.
Implemented development actives in complete agile model using JIRA, and GIT.
Part of building automated test suit to test the data the outcome data without any manual interruption.
Involved in creation of production deployment forms, script review.

Environment: s:Hadoop, Spark, Hive, Sqoop, SQL Server, Python, Hue, Git, Eclipse, Control-M, UNIX, Shell.

ETL Developer

Confidential

Responsibilities:

Created SSIS Packages using SSIS Designer for export heterogeneous data from OLE DB Source (Oracle), Excel Spreadsheet to SQL Server 2005/2008.
Worked on SSIS Package, DTS Import/Export for transferring data from Database (Oracle and Text format data) to SQL Server.
Working on problem tickets for SSIS requirements. Took lead of change ticket and deployed SSIS packages.
Developed SSIS packages based on provided mapping spec and logic using required transformations.
Created agent job schedules and promoted it in all environments. Provided post deployment support to fix if any issues.
Finding root cause of the issue for SSIS package and providing fixes.
Responsible for DBA activities like backup and restores of databases, troubleshooting security issues, performance issues, connectivity issues, monitoring disk space utilization and addressing Temp log issues.

We provide IT Staff Augmentation Services!

We'd love your feedback!

Resume Categories

Client Services

Job Seekers

Visa Sponsorship