We provide IT Staff Augmentation Services!

Data Engineer Resume

3.00/5 (Submit Your Rating)

Detroit, MI

SUMMARY

  • 8 years of IT experience in the Analysis, Design, Development of ETL and Data pipelines using Hadoop eco system tools and IBM Infosphere Data Stage.
  • Proficient in Analyzing Business processes requirements and translating them into technical requirements. Creating design overviews and technical design documents.
  • Good understanding of distributed computing, cloud computing and parallel processing frameworks.
  • Hands on experience in developing efficient solutions using Hadoop eco system tools such as Hive, Pig, Oozie, Flume, PySpark, HDFS, Storm
  • Developed large scale data ingestion framework using Sqoop to ingest data from Oracle, SQL Server and Teradata RDBMS systems.
  • Experience writing User defined functions for custom functionality in Hive and Spark.
  • Worked with different data sources such as flat files, XMLs, JSONs and RDBMS and stored them in HDFS as Parquet, ORC, Avro.
  • Extensive experience in transforming the data using Spark Dataframe API functions and RDD functions such as filter, join, map, flat map etc.
  • Experience working with structured streaming using spark streaming modules.
  • Knowledge in setting up Kafka topics and utilizing kafka producers, consumers and to store the data HDFS from Kafka streams.
  • Implemented Slowly Changing Dimensions, Star schemas and 3NF data models using IBM Datastage tool.
  • Experience working with Amazon Web Services such as Elastic Map Reduce, EC2, S3 and Athena.
  • Proficient in writing Error handling, reconciliation and logging frameworks.
  • Experience doing data development in Waterfall and Agile methodologies. Worked on both Scrum and Kanban Agile practice.
  • Proficient in working with structured streaming applications using Spark streaming and Flume.
  • Proficient in working with Git and Subversion version control tools such as Bit Bucket, GitHub and SVN.
  • Worked with different IDEs like PyCharm, VS Code, Jupyter and Cloudera Data Science workbench.
  • Experience utilizing CI/CD pipelines to simplify code migration process.
  • Experience writing shell scripts for file watchers, spark submit scripts.
  • Hands on experience in scheduling jobs in Control - m, oozie andAutosys
  • Co-ordinate with vendor and BAs to support and maintenance of various applications.
  • Excellent communication and interpersonal skills, ability to learn quickly, good analytical reasoning and high adaptability to new technologies and tools.
  • Strong Team working spirit and relationship management skills.

TECHNICAL SKILLS

Hadoop Distributions: Cloudera, Hortonworks and AWS EMR

Big Data Eco System Tools: Hive, Spark, Oozie, Impala, SparkSQL, PySpark, Pig, Spark Streaming, Kafka

Languages: Python, Shell Scripting, SQL

ETL: IBM DataStage 8.x

AWS Services: S3, EC2, EMR, Athena

Misc Tools: PyCharm, Jupyter, Bitbucket, Jira, Putty, Control-m, Quality Center, Cloudera Data Science Workbench

PROFESSIONAL EXPERIENCE

Confidential, Detroit, MI

Data Engineer

Responsibilities:

  • Hands-on major components in Hadoop Echo Systems like Spark, HDFS, HIVE, HBase, Zookeeper, Sqoop, Oozie.
  • Developing Sqoop jobs to ingest data from various system of records into Enterprise data lake.
  • Development of Spark jobs in PySpark and SparkSQL to run on top of hive tables and create transformed data sets for downstream consumption.
  • Working with business analysts to convert functional requirements into technical requirements and build appropriate data pipelines.
  • Ingesting data from various source systems like Oracle, SQL Server, Flat files, JSONs.
  • Conducting Exploratory data analysis in Jupyter notebooks using Python libraries and sharing the data analysis
  • Performance tuning spark and hive jobs by reading execution plans, DAGs and Yarn logs.
  • Creating generic shell scripts to submit Hadoop and spark jobs on EMR and on-prem edge node.
  • Worked on migrating on-prem Hadoop cluster data and data pipelines to AWS cloud.
  • Writing Complex SparkSQL code to clean, join, transform and aggregate the datasets and publish them for Power BI team to produce operational scorecards.
  • Writing custom python modules for reusable python code.
  • Solutioning appropriate partition, bucketing schemes and making sure correct load policies are employed so data can be stored as per requirements.
  • Creating Oozie workflows, Coordinators and scheduling handshake jobs in control-m.
  • Working with production support teams and administration teams to ensure correct access controls are setup on each hive database.
  • Working with governance teams to ensure metadata management, data lineage and technical metadata are correctly updated for each data asset.
  • Working with master-feature branch model and commit the code with appropriate comments.
  • Attending sprint planning, agile ceremonies and demo the work products on bi-weekly basis.
  • Documenting data flow diagrams and technical logic in confluence.

Environment: Cloudera Hadoop distribution, AWS EMR, S3, Athena, Hive, Impala,PySpark, SparkSQL, Oralce 11g/12c, Jira, Bit Bucket, Power BI, Control-m

Confidential, Detroit, MI

Hadoop Developer

Responsibilities:

  • Developed Spark scripts by using Scala as per the requirement.
  • Load the data into Spark RDD and performed in-memory data computation to generate the output response.
  • Performed different types of transformations and actions on the RDD to meet the business requirements.
  • Developed a data pipeline using Kafka, Spark and Hive to ingest, transform and analyzing data.
  • Also worked on analyzing Hadoop cluster and different BigData analytic tools including HBase and Sqoop.
  • Involved in loading data from UNIX file system to HDFS.
  • Responsible to manage data coming from various sources.
  • Exported the analyzed data to the relational databases using Sqoop for visualization and to generate reports for the BI team.
  • Analyzed large amounts of data sets to determine optimal way to aggregate and report on it.
  • Involved in managing and reviewing Hadoop log files.
  • Imported data using Sqoop to load data from SQL Server to HDFS on regular basis.
  • Developing Scripts and Batch Job to schedule various Hadoop Program.
  • Responsible for writing Hive queries for data analysis to meet the business requirements.
  • Responsible for creating Hive tables and working on them using HiveQL.
  • Responsible for importing and exporting data into HDFS and Hive using Sqoop.
  • Involved in creating Hive tables, loading with data and writing hive queries which will run internally in map reduce way.
  • Extended HIVE core functionality by using custom User Defined Function's (UDF), User Defined Table-Generating Functions (UDTF) and User Defined Aggregating Functions (UDAF) for Hive
  • Used Spark framework on both batch and real-time data processing.
  • Hands-on processing of data using Spark Streaming API.

Environment: Hortonworks Ambari, Tez, Hive, Version one, GitHub, Control-m, Shell scripting, Spark Streaming, Service Now, SQL Server.

Confidential, Charlotte, NC

Hadoop Developer

Responsibilities:

  • Involved in developing roadmap for migration of enterprise data from multiple data sources like SQL Server, provider databases into HDFS which serves as a centralized datahub across the organization.
  • Loaded and transformed large sets of structured and semi structured data from various downstream systems.
  • Developed ETL pipelines using Spark and Hive for performing various business specific transformations.
  • Building Applications and automating the pipelines in Spark for bulk loads as well as Incremental Loads of various Datasets.
  • Worked closely with our team’s data analysts and consumers to shape the datasets as per the requirements.
  • Automated the data pipeline to ETL all the Datasets along with full loads and incremental loads of data.
  • Worked on building input adapters for data dumps from FTP Servers using Apache spark.
  • Wrote spark applications to perform operations like data inspection, cleaning, load and transforms the large sets of structured and semi-structured data.
  • Developed Spark with Python and Spark-SQL for testing and processing of data.
  • Reporting the spark job stats, Monitoring and Running Data Quality Checks are made available for each Datasets.
  • Used SQL Programming Skills to work around the Relational SQL Databases

Environment: Cloudera Hadoop distribution, Hive, Impala, Cognos, IBM Datastage, Shell scripting, SQL, PL/SQL and Autosys.

Confidential

Datastage Developer

Responsibilities:

  • Involved in Requirement Gathering with the business team and in creating ETL design document and technical specifications document for the project.
  • Optimized the performance of the Informatica mappings by analyzing Job logs and understanding various bottlenecks (source/target/stages)
  • Created UNIX shell scripts to invoke the Informatica workflows & Oracle stored procedures
  • Implemented Slowly Changing Dimension Type 1 and Type 2 for inserting and updating Target tables for maintaining the history.
  • Prompt in responding to business user queries and changes. Designed and Developed jobs in datastageto load the data from Flat Files, Oracle and MS SQL Server sources.
  • Developed custom ETL objects to load the data in generic fashion.
  • Responsible for developing and re-defining several complex jobs and job sequencers to process various feeds using different datastage stages, properties.
  • Troubleshoot and created automatic script/SQL generators.
  • Designed Unit test document after the datastagedevelopment and verified results before moving it to QA.
  • Supported the test environments.
  • Production transition documentation and warranty support for production support teams.

Environment: Datastage 8.1, Oracle 10g, SQL Server 2005, TOAD, SQL,Unix.

We'd love your feedback!