Data Engineer Resume
Detroit, MI
SUMMARY
- 8 years of IT experience in the Analysis, Design, Development of ETL and Data pipelines using Hadoop eco system tools and IBM Infosphere Data Stage.
- Proficient in Analyzing Business processes requirements and translating them into technical requirements. Creating design overviews and technical design documents.
- Good understanding of distributed computing, cloud computing and parallel processing frameworks.
- Hands on experience in developing efficient solutions using Hadoop eco system tools such as Hive, Pig, Oozie, Flume, PySpark, HDFS, Storm
- Developed large scale data ingestion framework using Sqoop to ingest data from Oracle, SQL Server and Teradata RDBMS systems.
- Experience writing User defined functions for custom functionality in Hive and Spark.
- Worked with different data sources such as flat files, XMLs, JSONs and RDBMS and stored them in HDFS as Parquet, ORC, Avro.
- Extensive experience in transforming the data using Spark Dataframe API functions and RDD functions such as filter, join, map, flat map etc.
- Experience working with structured streaming using spark streaming modules.
- Knowledge in setting up Kafka topics and utilizing kafka producers, consumers and to store the data HDFS from Kafka streams.
- Implemented Slowly Changing Dimensions, Star schemas and 3NF data models using IBM Datastage tool.
- Experience working with Amazon Web Services such as Elastic Map Reduce, EC2, S3 and Athena.
- Proficient in writing Error handling, reconciliation and logging frameworks.
- Experience doing data development in Waterfall and Agile methodologies. Worked on both Scrum and Kanban Agile practice.
- Proficient in working with structured streaming applications using Spark streaming and Flume.
- Proficient in working with Git and Subversion version control tools such as Bit Bucket, GitHub and SVN.
- Worked with different IDEs like PyCharm, VS Code, Jupyter and Cloudera Data Science workbench.
- Experience utilizing CI/CD pipelines to simplify code migration process.
- Experience writing shell scripts for file watchers, spark submit scripts.
- Hands on experience in scheduling jobs in Control - m, oozie andAutosys
- Co-ordinate with vendor and BAs to support and maintenance of various applications.
- Excellent communication and interpersonal skills, ability to learn quickly, good analytical reasoning and high adaptability to new technologies and tools.
- Strong Team working spirit and relationship management skills.
TECHNICAL SKILLS
Hadoop Distributions: Cloudera, Hortonworks and AWS EMR
Big Data Eco System Tools: Hive, Spark, Oozie, Impala, SparkSQL, PySpark, Pig, Spark Streaming, Kafka
Languages: Python, Shell Scripting, SQL
ETL: IBM DataStage 8.x
AWS Services: S3, EC2, EMR, Athena
Misc Tools: PyCharm, Jupyter, Bitbucket, Jira, Putty, Control-m, Quality Center, Cloudera Data Science Workbench
PROFESSIONAL EXPERIENCE
Confidential, Detroit, MI
Data Engineer
Responsibilities:
- Hands-on major components in Hadoop Echo Systems like Spark, HDFS, HIVE, HBase, Zookeeper, Sqoop, Oozie.
- Developing Sqoop jobs to ingest data from various system of records into Enterprise data lake.
- Development of Spark jobs in PySpark and SparkSQL to run on top of hive tables and create transformed data sets for downstream consumption.
- Working with business analysts to convert functional requirements into technical requirements and build appropriate data pipelines.
- Ingesting data from various source systems like Oracle, SQL Server, Flat files, JSONs.
- Conducting Exploratory data analysis in Jupyter notebooks using Python libraries and sharing the data analysis
- Performance tuning spark and hive jobs by reading execution plans, DAGs and Yarn logs.
- Creating generic shell scripts to submit Hadoop and spark jobs on EMR and on-prem edge node.
- Worked on migrating on-prem Hadoop cluster data and data pipelines to AWS cloud.
- Writing Complex SparkSQL code to clean, join, transform and aggregate the datasets and publish them for Power BI team to produce operational scorecards.
- Writing custom python modules for reusable python code.
- Solutioning appropriate partition, bucketing schemes and making sure correct load policies are employed so data can be stored as per requirements.
- Creating Oozie workflows, Coordinators and scheduling handshake jobs in control-m.
- Working with production support teams and administration teams to ensure correct access controls are setup on each hive database.
- Working with governance teams to ensure metadata management, data lineage and technical metadata are correctly updated for each data asset.
- Working with master-feature branch model and commit the code with appropriate comments.
- Attending sprint planning, agile ceremonies and demo the work products on bi-weekly basis.
- Documenting data flow diagrams and technical logic in confluence.
Environment: Cloudera Hadoop distribution, AWS EMR, S3, Athena, Hive, Impala,PySpark, SparkSQL, Oralce 11g/12c, Jira, Bit Bucket, Power BI, Control-m
Confidential, Detroit, MI
Hadoop Developer
Responsibilities:
- Developed Spark scripts by using Scala as per the requirement.
- Load the data into Spark RDD and performed in-memory data computation to generate the output response.
- Performed different types of transformations and actions on the RDD to meet the business requirements.
- Developed a data pipeline using Kafka, Spark and Hive to ingest, transform and analyzing data.
- Also worked on analyzing Hadoop cluster and different BigData analytic tools including HBase and Sqoop.
- Involved in loading data from UNIX file system to HDFS.
- Responsible to manage data coming from various sources.
- Exported the analyzed data to the relational databases using Sqoop for visualization and to generate reports for the BI team.
- Analyzed large amounts of data sets to determine optimal way to aggregate and report on it.
- Involved in managing and reviewing Hadoop log files.
- Imported data using Sqoop to load data from SQL Server to HDFS on regular basis.
- Developing Scripts and Batch Job to schedule various Hadoop Program.
- Responsible for writing Hive queries for data analysis to meet the business requirements.
- Responsible for creating Hive tables and working on them using HiveQL.
- Responsible for importing and exporting data into HDFS and Hive using Sqoop.
- Involved in creating Hive tables, loading with data and writing hive queries which will run internally in map reduce way.
- Extended HIVE core functionality by using custom User Defined Function's (UDF), User Defined Table-Generating Functions (UDTF) and User Defined Aggregating Functions (UDAF) for Hive
- Used Spark framework on both batch and real-time data processing.
- Hands-on processing of data using Spark Streaming API.
Environment: Hortonworks Ambari, Tez, Hive, Version one, GitHub, Control-m, Shell scripting, Spark Streaming, Service Now, SQL Server.
Confidential, Charlotte, NC
Hadoop Developer
Responsibilities:
- Involved in developing roadmap for migration of enterprise data from multiple data sources like SQL Server, provider databases into HDFS which serves as a centralized datahub across the organization.
- Loaded and transformed large sets of structured and semi structured data from various downstream systems.
- Developed ETL pipelines using Spark and Hive for performing various business specific transformations.
- Building Applications and automating the pipelines in Spark for bulk loads as well as Incremental Loads of various Datasets.
- Worked closely with our team’s data analysts and consumers to shape the datasets as per the requirements.
- Automated the data pipeline to ETL all the Datasets along with full loads and incremental loads of data.
- Worked on building input adapters for data dumps from FTP Servers using Apache spark.
- Wrote spark applications to perform operations like data inspection, cleaning, load and transforms the large sets of structured and semi-structured data.
- Developed Spark with Python and Spark-SQL for testing and processing of data.
- Reporting the spark job stats, Monitoring and Running Data Quality Checks are made available for each Datasets.
- Used SQL Programming Skills to work around the Relational SQL Databases
Environment: Cloudera Hadoop distribution, Hive, Impala, Cognos, IBM Datastage, Shell scripting, SQL, PL/SQL and Autosys.
Confidential
Datastage Developer
Responsibilities:
- Involved in Requirement Gathering with the business team and in creating ETL design document and technical specifications document for the project.
- Optimized the performance of the Informatica mappings by analyzing Job logs and understanding various bottlenecks (source/target/stages)
- Created UNIX shell scripts to invoke the Informatica workflows & Oracle stored procedures
- Implemented Slowly Changing Dimension Type 1 and Type 2 for inserting and updating Target tables for maintaining the history.
- Prompt in responding to business user queries and changes. Designed and Developed jobs in datastageto load the data from Flat Files, Oracle and MS SQL Server sources.
- Developed custom ETL objects to load the data in generic fashion.
- Responsible for developing and re-defining several complex jobs and job sequencers to process various feeds using different datastage stages, properties.
- Troubleshoot and created automatic script/SQL generators.
- Designed Unit test document after the datastagedevelopment and verified results before moving it to QA.
- Supported the test environments.
- Production transition documentation and warranty support for production support teams.
Environment: Datastage 8.1, Oracle 10g, SQL Server 2005, TOAD, SQL,Unix.