We provide IT Staff Augmentation Services!

Data Engineer (spark) Resume

4.00/5 (Submit Your Rating)

SUMMARY

  • 6 years of professional IT experience with Over 4 Years of Hadoop/Spark experience in ingestion, storage, querying, processing and analysis of big data.
  • Extensive experience in Big Data ecosystem and its various components SPARK, MapReduce, Spark Sql, SQL, HDFS, HIVE, HBase, PIG, Sqoop, Zookeeper, Oozie,Airflow.
  • Strong knowledge on Hadoop architectural components like Hadoop Distributed File System, Name Node, Data Node, Task Tracker, Job Tracker, MRV2, and Map Reduce programming.
  • Experience in cleansing and analyze data using HiveQL, Pig Latin, and custom Map Reduce programs.
  • Implemented Spark SQL and Data frame API to connect to Hive to read the data and distributed processing to make highly scalable.
  • Worked on spark core transformation and actions along with RDD and Dataset API
  • Experience in writing MapReduce jobs using Sqoop, Pig, Hive for data processing.
  • Exposure on NoSQL databases and hands on work experience in writing applications on NoSQL databases like HBase, DynamoDB and Casandra.
  • Experience in importing and exporting the data from RDBMS databases Netezza, Oracle into Hadoop data lake using Sqoop jobs.
  • Imported incremental transactional data both in append and last updated mode.
  • Implemented SCD type 2 using Sqoop and Hive.
  • Experience in handling different file formats Parquet, Apache Avro, JSON, Spreadsheets, ORC and Flat file format.
  • Designed and developed Hive data transformation scripts to work against structured data from various data points and created a baseline.
  • Handled performance optimizations on PL/SQL and Shell Scripts.
  • Experiences with push down optimization in Informatica.
  • Experience in real time streaming applications and batch style large scale distributed computing applications using tools like Spark Streaming
  • Implemented data recovery mechanisms and disaster recovery in distributed environment.
  • Experience in working with distributed scheduling using Oozie and Airflow.
  • Optimized performance of HBase/Pig jobs.
  • Strong knowledge of data transformations and Data MigrationDesign and Implementation principles
  • Experience in deploying Hadoop cluster on Amazon Web Services (AWS) EC2 Instances.
  • Implemented data pipeline using AWS services S3, EC2 and EMR
  • Optimized spark workflow both on - premises (CDH) and cloud (AWS)
  • Designed and created the data models for customer data for using HBase query APIs
  • Goal-oriented individual with strong analytical skills.
  • Flexible, enthusiastic, and project-oriented team player with solid communication and leadership skills to develop creative solution for challenging client needs

TECHNICAL SKILLS

Big Data Eco-system: Hadoop, HDFS, Spark, pyspark, Hive, Java, Sqoop, Map-Reduce, Oozie, Airflow, PIG, YARN, KAFKA, Spark streaming

NoSQL Databases: HBase (Also flexible to any NOSQL like DynamoDB, Cassandra etc.)

Databases: Netezza, Teradata,Oracle.

CI Tools: Jenkins, GitHub, Jira

Programming languages Scripting: Python,Scala, Shell Scripting.

Data warehousing: ETL, Informatica Power Exchange,OLAP, Redshift, Snowflake, OLTP, Workflow manager and workflow monitor.

Hadoop Distribution: Apache, EMR, Cloudera

Operating Systems: Linux, RHEL, Windows.

PROFESSIONAL EXPERIENCE

Data Engineer (Spark)

Confidential

Responsibilities:

  • Designed and Developed data integration/engineering workflows on big data technologies (Hadoop, Spark, Sqoop, Hive, HBase)
  • Requirement gathering and performance analysis of requirement document.
  • Developed dataflows and processes for the Data processing using SQL (Spark SQL & Data frames)
  • Developing spark programs using python API's to compare the performance of spark with HIVE and SQL and generated reports monthly and daily basis.
  • CreatedPyspark frame to bring data from RDBMS to Amazon S3.
  • Involved in importing data into HDFS and Hive using Sqoop and involved in creating Hive tables, loading with data, and writing Hive queries.
  • Hive and Spark tuning with partitioning/bucketing of Parquet and executors/driver's memory.
  • Developed optimized Hive queries from RDBMS to Hadoop staging area.
  • Designed and implemented archiving process from HDFS to S3,
  • Handled large datasets using Partitions, Spark in Memory capabilities, Broadcasts in Spark, Effective & efficient Joins, Transformations and other during ingestion process itself.
  • Processed S3 data and created external tables using Hive and developed scripts to ingest and repair tables that can be reused across the project.
  • Designed and developed data lake analyze & evaluate multiple solutions, considering multiple cost factors across the business as well operational impact on customer data.
  • Involved in planning process of iterations under the Agile Scrum methodology.
  • Worked on Hive Megastore backup, Partitioning, and bucketing techniques in hive to improve the performance. Tuning Spark & Scala Jobs.
  • Worked on Data transformations, Data Migration and processing using Spark PythonSparkSQL, Hive.
  • Worked closely with Data science team and understand the requirement clearly and create hive table on HDFS.
  • Developed Spark scripts by using python commands as per the requirement.
  • Scheduling Spark/Scala jobs using Airflow workflow in Hadoop Cluster and Generated detailed design documentation for the source-to-target transformations

Environment: s:Cloudera Distribution, Spark, Scala, HDFS, Hive, Sqoop, Python, AWS EMR, AWS S3, AWS and PARQUET data files

Hadoop Developer

Confidential

Responsibilities:

  • Monitored workload, job performance and capacity planning using Cloudera Manager.
  • Worked with different datasets by Importing and exporting the data from different data sources into spark using Data Source API and done Computation to generate the Output
  • In part of upgradation to the existing data processing systems, Converted major logical units of hive queries to data frames in Spark Sql for better performance gain.
  • Involved in complete remodeling of data processing pipeline by changing the design of the data flow.
  • Performed deep data analytics to get insights by using query languages like Hive and Impala.
  • Optimizing the Hive Queries using the various files format like PARQUET, JSON, AVRO.
  • Involved in preparing Partitioning the Data to keep the incremental load in separate folder.
  • Developed the Sqoop scripts in order as to import data from Oracle Database.
  • Created External Hive Table on top of parsed data and Prepared Hive DDLs, queries and involved in data ingestion process.
  • Worked with various HDFS file formats like Avro, Parquet, and various compression formats like Snappy.
  • Implemented development actives in complete agile model using JIRA, and GIT.
  • Part of building automated test suit to test the data the outcome data without any manual interruption.
  • Involved in creation of production deployment forms, script review.

Environment: s:Hadoop, Spark, Hive, Sqoop, SQL Server, Python, Hue, Git, Eclipse, Control-M, UNIX, Shell.

ETL Developer

Confidential

Responsibilities:

  • Created SSIS Packages using SSIS Designer for export heterogeneous data from OLE DB Source (Oracle), Excel Spreadsheet to SQL Server 2005/2008.
  • Worked on SSIS Package, DTS Import/Export for transferring data from Database (Oracle and Text format data) to SQL Server.
  • Working on problem tickets for SSIS requirements. Took lead of change ticket and deployed SSIS packages.
  • Developed SSIS packages based on provided mapping spec and logic using required transformations.
  • Created agent job schedules and promoted it in all environments. Provided post deployment support to fix if any issues.
  • Finding root cause of the issue for SSIS package and providing fixes.
  • Responsible for DBA activities like backup and restores of databases, troubleshooting security issues, performance issues, connectivity issues, monitoring disk space utilization and addressing Temp log issues.

We'd love your feedback!