We provide IT Staff Augmentation Services!

Senior Data Engineer Resume

4.00/5 (Submit Your Rating)

Urbandale, IA

SUMMARY

  • Data Engineering professional with solid foundational skills and proven tracks of implementation in a variety of data platforms. Self - motivated with a strong adherence to personal accountability in both individual and team scenarios.
  • Over 8 years of experience in Data Engineering, Data Pipeline Design, Development and Implementation as a Sr. Data Engineer/Data Developer and Data Modeler.
  • Strong experience in Software Development Life Cycle (SDLC) including Requirements Analysis, Design Specification and Testing as per Cycle in both Waterfall and Agile methodologies.
  • Strong experience in writing scripts using Python API, PySpark API and Spark API for analyzing the data.
  • Extensively used Python Libraries PySpark, Pytest, Pymongo, cxOracle, PyExcel, Boto3, Psycopg, embedPy, NumPy and Beautiful Soup.
  • Hands-on use of Spark and Scala API's to compare the performance of Spark with Hive and SQL, and Spark SQL to manipulate Data Frames inScala.
  • Expertise in Python andScala, user-defined functions (UDF) for Hive and Pig using Python.
  • Experience in developing Map Reduce Programs using Apache Hadoop for analyzing the big data as per the requirement.
  • Hands on Spark MLlib utilities such as including classification, regression, clustering, collaborative filtering, dimensionality reduction.
  • Experience in working with Flume and NiFi for loading log files into Hadoop.
  • Experience in working with NoSQL databases like HBase and Cassandra.
  • Experienced in creating shell scripts to push data loads from various sources from the edge nodes onto the HDFS.
  • Good Experience in implementing and orchestrating data pipelines using Oozie and Airflow.
  • Worked with Cloudera and Hortonworks distributions.
  • Expertise working with AWS cloud services like EMR, S3,Redshift, EMR cloud watch, for big data development.
  • Good working knowledge of Amazon Web Services(AWS) Cloud Platform which includes services likeEC2,S3,VPC,ELB, IAM, DynamoDB, Cloud Front, Cloud Watch, Route 53, Elastic Beanstalk (EBS), Auto Scaling, Security Groups, EC2 Container Service (ECS), Code Commit, Code Pipeline, Code Build, Code Deploy,DynamoDB, Auto Scaling, Security Groups, Red shift, CloudWatch, CloudFormation, CloudTrail, Ops Works, Kinesis, IAM, SQS, SNS, SES.
  • Experience in Data Analysis, Data Profiling, Data Integration, Migration, Data governance and Metadata Management, Master Data Management and Configuration Management.
  • Experience in developing customizedUDF’sin Python to extend Hive and Pig Latin functionality.
  • Expertise in designing complex Mappings and have expertise in performance tuning and slowly changing Dimension Tables and Fact tables
  • Extensively worked with Teradata utilities Fast export, and Multi Load to export and load data to/from different source systems including flat files.
  • Experienced in building Automation Regressing Scripts for validation of ETL process between multiple databases like Oracle, SQL Server, Hive, and Mongo DB usingPython.
  • Proficiency in SQL across several dialects (we commonly write MySQL, PostgreSQL, Redshift, SQL Server, and Oracle)
  • Expert in building Enterprise Data Warehouse or Data warehouse appliances from Scratch using both Kimball and Inmon’s Approach.
  • Experience in designing star schema, Snowflake schema for Data Warehouse, ODS architecture.
  • Skilled in System Analysis, E-R/Dimensional Data Modeling, Database Design and implementing RDBMS specific features.
  • Well experience in Normalization and De-Normalization techniques for optimum performance in relational and dimensional database environments.
  • Good knowledge of Data Marts, OLAP, Dimensional Data Modeling with Ralph Kimball Methodology (Star Schema Modeling, Snow-Flake Modeling for FACT and Dimensions Tables) using Analysis Services.
  • Strong analytical and problem-solving skills and the ability to follow through with projects from inception to completion.
  • Ability to work effectively in cross-functional team environments, excellent communication, and interpersonal skills.

TECHNICAL SKILLS

Big Data Tools: Hadoop Ecosystem Map Reduce, Spark 2.3, Airflow 1.10.8, Nifi 2, HBase 1.2, Hive 2.3, Pig 0.17 Sqoop 1.4, Kafka 1.0.1, Oozie 4.3, Hadoop 3.0

Data Modeling Tools: Erwin Data Modeler, ER Studio v17

Programming Languages: SQL, PL/SQL, and UNIX.

Methodologies: RAD, JAD, System Development Life Cycle (SDLC), Agile

Cloud Platform: AWS, Azure, Google Cloud.

Cloud Management: Amazon Web Services (AWS)- EC2, EMR, S3, Redshift, EMR, Lambda, Athena

Databases: Oracle 12c/11g, Teradata R15/R14.

OLAP Tools: Tableau, SSAS, Business Objects, and Crystal Reports 9

ETL/Data warehouse Tools: Informatica 9.6/9.1, and Tableau.

Operating System: Windows, Unix, Sun Solaris

PROFESSIONAL EXPERIENCE

Confidential, Urbandale, IA

Senior Data Engineer

Responsibilities:

  • Transforming business problems into Big Data solutions and define Big Data strategy and Roadmap.
  • Installing, configuring and maintaining Data Pipelines
  • Designing the business requirement collection approach based on the project scope and SDLC methodology.
  • Files extracted from Hadoop and dropped on daily hourly basis intoS3
  • Authoring Python (PySpark) Scripts for custom UDF’s for Row/ Column manipulations, merges, aggregations, stacking, data labeling and for all Cleaning and conforming tasks.
  • Writing Pig Scripts to generate Map Reduce jobs and performed ETL procedures on the data in HDFS.
  • Designed and implemented Sqoop for the incremental job to read data from DB2 and load to Hive tables and connected to Tableau for generating interactive reports using Hive server2.
  • Used Sqoop to channel data from different sources of HDFS and RDBMS.
  • Developed Spark applications using Pyspark and Spark-SQL for data extraction, transformation and aggregation from multiple file formats.
  • Exploring with Spark to improve the performance and optimization of the existing algorithms in Hadoop using Spark context, Spark-SQL,postgreSQL,Data Frame,OpenShift, Talend,pair RDD's
  • Involved in integration of Hadoop cluster with spark engine to perform BATCH and GRAPHX operations.
  • Performed data preprocessing and feature engineering for further predictive analytics using Python Pandas.
  • Developed and validated machine learning models including Ridge and Lasso regression for predicting total amount of trade.
  • Boosted the performance of regression models by applying polynomial transformation and feature selectionand used those methods to select stocks.
  • Generated report on predictive analytics using Python and Tableau including visualizing model performance and prediction results.
  • Developed Automation Regressing Scripts for validation of ETL process between multiple databases like AWS Redshift, Oracle, Mongo DB, T-SQL, and SQL Server usingPython.

Environment: Cloudera Manager (CDH5),Pyspark, HDFS, NiFi, Pig, Hive, S3, Kafka, Snowflake, Pycharm, Scrum, Git.

Confidential, Weehawken, NJ

AWS Data Engineer

Responsibilities:

  • Primarily Responsible for converting Manual Report system to fully automated CI/CD Data Pipeline that ingest data from different Marketing platform to AWS S3 data lake.
  • Utilized AWS services with focus on big data analytics, enterprise data warehouse and business intelligence solutions to ensure optimal architecture, scalability, flexibility
  • Designed AWS architecture, Cloud migration, AWS EMR, DynamoDB, Redshift and event processing using lambda function
  • Gathered data from Google AdWords, Apple search ad, Facebook ad, Bing ad, Snapchat ad, Omniture data and CSG using their API.
  • Implemented usage of Amazon EMR for processing Big Data across a Hadoop Cluster of virtual servers on Amazon Elastic Compute Cloud (EC2) and Amazon Simple Storage Service (S3)
  • Used AWS system manager to automate operational tasks across AWS resources.
  • Wrote Lambda function code and set Cloud watch Event as trigger with Cron job Expression.
  • Connected Redshift to Tableau for creating dynamic dashboard for analytics team.
  • Setup connection between S3 to AWS Sage Maker ML (Machine Learning platform) is used for predictive analytics and uploading inferenced data to redshift
  • Developed data pipeline using Flume, Sqoop, Pig and Java MapReduce to ingest customer behavioral data into HDFS for analysis.
  • Integrated HDP clusters with Active Directory and enabled Kerberos for Authentication.
  • Worked on google cloud platform (GCP) services like compute engine, cloud load balancing, cloud storage, cloud SQL, stack driver monitoring and cloud deployment manager.
  • Setup Alerting and monitoring using Stack driver in GCP.
  • Design and implement large scale distributed solutions in AWS and GCP clouds.
  • Monitoring the Hadoop cluster functioning through MCS and worked on NoSQL databases including HBase.
  • Used Hive and created Hive tables and involved in data loading and writing Hive UDFs and worked with Linux server admin team in administering the server hardware and operating system.
  • Designed, developed and did maintenance of data integration programs in a Hadoop and RDBMS environment with both traditional and non-traditional source systems as well as RDBMS and NoSQL data stores for data access and analysis.
  • Created Jenkins jobs for CI/CD using git, Maven and Bash scripting
  • Built regression test suite in CI/CD pipeline with Data setup, test case execution and tear down using Cucumber- Gherkin, Java, Spring DAO, PostgreSQL

Environment: Redshift, Pyspark, EC2, EMR, Glue, S3, Kafka, IAM, PostgreSQL, Jenkins, Maven, AWS CLI, Git.

Confidential, Topeka, KS

Data Engineer

Responsibilities:

  • Worked on development of data ingestion pipelines using ETL tool, Talend & bash scripting with big data technologies including but not limited to Hive, Impala, Spark, Kafka, and Talend.
  • Experience in developing scalable & secure data pipelines for large datasets.
  • Gathered requirements for ingestion of new data sources including life cycle, data quality check, transformations, and metadata enrichment.
  • Supported data quality management by implementing proper data quality checks in data pipelines.
  • Delivered data engineer services like data exploration, ad-hoc ingestions, subject-matter-expertise to Data scientists in using big data technologies.
  • Build machine learning models to showcase Big data capabilities using Pyspark and MLlib.
  • Enhancing Data Ingestion Framework by creating more robust and secure data pipelines.
  • Implemented data streaming capability using Kafka and Talend for multiple data sources.
  • Worked with multiple storage formats (Avro, Parquet) and databases (Hive, Impala, Kudu).
  • S3 - Data Lake Management. Responsible for maintaining and handling data inbound and outbound requests through big data platform.
  • Working knowledge of cluster security components like Kerberos, Sentry, SSL/TLS etc.
  • Involved in the development of agile, iterative, and proven data modeling patterns that provide flexibility.
  • Knowledge on implementing the JILs to automate the jobs in production cluster.
  • Troubleshooted user's analyses bugs ( JIRA and IRIS Ticket).
  • Worked with SCRUM team in delivering agreed user stories on time for every Sprint.
  • Worked on analyzing and resolving the production job failures in several scenarios.
  • Implemented UNIX scripts to define the use case workflow and to process the data files and automate the jobs.
  • Knowledge on implementing the JILs to automate the jobs in production cluster.

Environment: Spark, Redshift, Python, HDFS, Hive, Pig, Sqoop, Scala, Kafka, Shell scripting, Linux, Jenkins, Eclipse, Git, Oozie, Talend, Agile Methodology.

Confidential 

Software Engineer/Hadoop Engineer

Responsibilities:

  • Research and recommend suitable technology stack for Hadoop migration considering current enterprise architecture.
  • Responsible for building scalable distributed data solutions using Hadoop.
  • Experienced in loading and transforming of large sets of structured, semi-structured and unstructured data.
  • Developed Spark jobs and Hive Jobs to summarize and transform data.
  • Involved in development of Application with Java and J2EE technologies.
  • Develop and maintain elaborate services-based architecture utilizing open source technologies like
  • Hibernate, ORM and Spring Framework.
  • Developed server-side services using core Java multithreading, Struts MVC, Java, EJB, Spring, Webservices (SOAP, WSDL, AXIS).
  • Responsible for developing DAO layer using Spring MVC and configuration XMLs for Hibernate and toalso manage CRUD operations (insert, update, and delete).
  • Designing, Development and Implementation of JSPs in Presentation layer for Submission, Application,
  • Reference implementation.
  • Exporting of a result set from HIVE to MySQL using Sqoop export tool for further processing.
  • Collecting and aggregating large amounts of log data and staging data in HDFS for further analysis.
  • Experience in managing and reviewing Hadoop Log files.
  • Used Sqoop to transfer data between relational databases and Hadoop.
  • Worked on HDFS to store and access huge datasets within Hadoop.
  • Good hands on experience with GitHub.

Environment: Cloudera Manager (CDH5), HDFS, Sqoop, Pig, Hive, Oozie, Kafka, flume, Java, Git.

We'd love your feedback!