We provide IT Staff Augmentation Services!

Data Engineer Resume

4.00/5 (Submit Your Rating)

MI

SUMMARY

  • Data Engineering professional with solid foundational skills and proven tracks of implementation in a variety of data platforms. Self - motivated with a strong adherence to personal accountability in both individual and team scenarios.
  • Over 8+ years of experience in Data Engineering, Data Pipeline Design, Development and Implementation as a Sr. Data Engineer/Data Developer and Data Modeler.
  • Strong experience in Software Development Life Cycle (SDLC) including Requirements Analysis, Design Specification and Testing as per Cycle in both Waterfall and Agile methodologies.
  • Strong experience in writing scripts using Python API, PySpark API and Spark API for analyzing the data.
  • Extensively used Python Libraries PySpark, Pytest, Pymongo, cxOracle, PyExcel, Boto3, Psycopg, embedPy, NumPy and Beautiful Soup.
  • Experience in Google Cloud components, Google container builders and GCP client libraries and cloud SDK's
  • Hands-on use of Spark and Scala APIs to compare the performance of Spark with Hive and SQL, and Spark SQL to manipulate Data Frames in Scala.
  • Expertise in Python and Scala, user-defined functions (UDF) for Hive and Pig using Python.
  • Experience in developing Map Reduce Programs using Apache Hadoop for analyzing the big data as per the requirement.
  • Hands on Spark MLlib utilities such as including classification, regression, clustering, collaborative filtering, dimensionality reduction.
  • Experience in working with Flume and NiFi for loading log files into Hadoop.
  • Experience in working with NoSQL databases like HBase and Cassandra.
  • Experienced in creating shell scripts to push data loads from various sources from the edge nodes onto the HDFS.
  • Good Experience in implementing and orchestrating data pipelines using Oozie and Airflow.
  • Worked with Cloudera and Hortonworks distributions.
  • Expert in developing SSIS/DTS Packages to extract, transform and load (ETL) data into data warehouse/ data marts from heterogeneous sources.
  • Good working knowledge of Amazon Web Services (AWS) Cloud Platform which includes services like EC2, S3, VPC, ELB, IAM, DynamoDB, Cloud Front, Cloud Watch,Route 53, Elastic Beanstalk (EBS), Auto Scaling, Security Groups, EC2 Container Service (ECS), Code Commit, Code Pipeline, Code Build, Code Deploy, Dynamo DB,Auto Scaling, Security Groups, Red shift, CloudWatch, CloudFormation, CloudTrail,Ops Works, Kinesis, IAM, SQS, SNS, SES.
  • Experience in Data Analysis, Data Profiling, Data Integration, Migration, Data governance and Metadata Management, Master Data Management and Configuration Management.
  • Experience in developing customized UDF's in Python to extend Hive and Pig Latin functionality.
  • Expertise in designing complex Mappings and have expertise in performance tuning and slowly changing Dimension Tables and Fact tables
  • Extensively worked with Teradata utilities Fast export, and Multi Load to export and load data to/from different source systems including flat files.
  • Experienced in building Automation Regressing Scripts for validation of ETL process between multiple databases like Oracle, SQL Server, Hive, and Mongo DB using Python.
  • Proficiency in SQL across several dialects (we commonly write MySQL, PostgreSQL, Redshift, SQL Server, and Oracle)
  • Expert in building Enterprise Data Warehouse or Data warehouse appliances from Scratch using both Kimball and Inmon's Approach.
  • Experience in designing star schema, Snowflake schema for Data Warehouse, ODS architecture.
  • Skilled in System Analysis, E-R/Dimensional Data Modeling, Database Design and implementing RDBMS specific features.
  • Well experience in Normalization and De-Normalization techniques for optimum performance in relational and dimensional database environments.
  • Good knowledge of Data Marts, OLAP, Dimensional Data Modeling with Ralph Kimball Methodology (Star Schema Modeling, Snow-Flake Modeling for FACT and Dimensions Tables) using Analysis Services.
  • Strong analytical and problem-solving skills and the ability to follow through with projects from inception to completion.
  • Ability to work effectively in cross-functional team environments, excellent communication, and interpersonal skills.

TECHNICAL SKILLS

Languages: \ Cluster Management & Monitoring Python 3.7.0+, Java 1.8, Scala 2.11.8+, SQL, \ Cloudera Manager 6.0.0+, Hortonworks Ambari TSQL, R 3.5.0+, C++, C, MATLAB. \ 2.6.0+, CloudxLab. \

Hadoop Ecosystem: \ Database Hadoop 2.8.4+, Spark 2.0.0+, MapReduce\ MySQL 5.X, SQL Server Oracle 11g, HBase HDFS, Kafka 0.11.0.1+, Hive 2.1.0+, HBase \ 1.2.3+, Cassandra 3.11. 1.4.4 +, Sqoop 1.99.7+, Pig 0.17, Flume 1.6.0+, Keras 2.2.4.

Visualization: Virtualization PowerBI, Oracle BI, Tableau 10.0+. \ VM ware workstation, AWS.\

Operating Systems: Markup Languages Linux, Windows, Ubuntu. \ HTML5, CSS3, JavaScript. \

Other Tools: IDE Jupyter Notebook, KNIME, MS SSMS, Putty, \ Eclipse, GitHub, PyCharm, Maven, IntelliJ, WinSCP, MS Office 365, SageMath, SEED \ RStudio, Visual Studio.\ Ubuntu, TensorFlow, NumPy.

PROFESSIONAL EXPERIENCE

Data Engineer

Confidential, MI

Responsibilities:

  • Collaborated with Business Analysts, SMEs across departments to gather business requirements, and identify workable items for further development.
  • Partnered with ETL developers to ensure that data is well cleaned and the data warehouse is up-to-date for reporting purpose by Pig.
  • Selected and generated data into CSV files and stored them into AWS S3 by using AWS EC2 and then structured and stored in AWS Redshift.
  • Processed some simple statistical analysis of data profiling like cancel rate, var, skew, kurt of trades, and runs of each stock every day group by 1 min, 5 min, and 15 min.
  • Used PySpark and Pandas to calculate the moving average and RSI score of the stocks and generated them into data warehouse.
  • Exploring with Spark to improve the performance and optimization of the existing algorithms in Hadoop using Spark context, Spark-SQL, postgreSQL, Data Frame, OpenShift, Talend, pair RDD's
  • Involved in integration of Hadoop cluster with spark engine to perform BATCH and GRAPHX operations.
  • Performed data preprocessing and feature engineering for further predictive analytics using Python Pandas.
  • Developed and validated machine learning models including Ridge and Lasso regression for predicting total amount of trade.
  • Boosted the performance of regression models by applying polynomial transformation and feature selection and used those methods to select stocks.
  • Generated report on predictive analytics using Python and Tableau including visualizing model performance and prediction results.
  • Utilized Agile and Scrum methodology for team and project management.
  • Used Git for version control with colleagues.

Environment: Spark (PySpark, SparkSQL, SparkMLIib), Python 3.x (Scikit-learn, Numpy, Pandas), Tableau 10.1, GitHub, AWS EMR/EC2/S3/Redshift, and Pig.

Data Engineer

Confidential, Rensselaer, NY

Responsibilities:

  • Migrate data from on-premises to AWS storage buckets.
  • Developed a python script to transfer data from on-premises to AWS S3.
  • Developed a python script to hit REST API's and extract data to AWS S3.
  • Worked on Ingesting data by going through cleansing and transformations and leveraging AWS Lambda, AWS Glue and Step Functions.
  • Created yaml files for each data source and including glue table stack creation.
  • Worked on a python script to extract data from Netezza databases and transfer it to AWS S3.
  • Developed Lambda functions and assigned IAM roles to run python scripts along with various triggers (SQS, EventBridge, SNS).
  • Created a Lambda Deployment function and configured it to receive events from S3 buckets.
  • Writing UNIX shell scripts to automate the jobs and scheduling cron jobs for job automation using commands with Crontab.
  • Developed various Mappings with the collection of all Sources, Targets, and Transformations using Informatica Designer.
  • Developed Mappings using Transformations like Expression, Filter, Joiner and Lookups for better data messaging and to migrate clean and consistent data.

Environment: Python 3.6, AWS (Glue, Lambda, StepFunctions, SQS, Code Build, Code Pipeline, EventBridge, Athena), Unix/Linux Shell Scripting, PyCharm, Informatica PowerCenter, Code Build, Code Pipeline, EventBridge, Athena), Linux Shell Scripting, Informatica PowerCenter.

Data Engineer

Confidential, Columbia, SC

Responsibilities:

  • Created and executed Hadoop Ecosystem installation and document configuration scripts on Google Cloud Platform.
  • Transformed batch data from several tables containing tens of thousands of records from SQL Server, MySQL, PostgreSQL, and CSV file datasets into data frames using PySpark.
  • Researched and downloaded jars for Spark-Avro programming.
  • Developed a PySpark program that writes data frames to HDFS as Avro files.
  • Utilized Spark's parallel processing capabilities to ingest data.
  • Created and executed HQL scripts that create external tables in a raw layer database in Hive.
  • Developed a Script that copies avro formatted data from HDFS to External tables in raw layer.
  • Created PySpark code that uses Spark SQL to generate dataframes from avro formatted raw layer and writes them to data service layer internal tables as orc format.
  • In charge of PySpark code, creating data frames from tables in data service layer and writing them to a Hive data warehouse.
  • Installed Airflow and created a database in PostgreSQL to store metadata from Airflow.
  • Configured documents which allow Airflow to communicate to its PostgreSQL database.
  • Developed Airflow DAGs in python by importing the Airflow libraries.
  • Utilized Airflow to schedule automatically trigger and execute data ingestion pipeline.

Data Engineer

Confidential

Responsibilities:

  • Worked on development of data ingestion pipelines using ETL tool, Talend & bash scripting with big data technologies including but not limited to Hive, Impala, Spark, Kafka, and Talend.
  • Experience in developing scalable & secure data pipelines for large datasets.
  • Gathered requirements for ingestion of new data sources including life cycle, data quality check, transformations, and metadata enrichment.
  • Supported data quality management by implementing proper data quality checks in data pipelines.
  • Delivered data engineer services like data exploration, ad-hoc ingestions, and subject-matter-expertise to Data scientists in using big data technologies.
  • Build machine learning models to showcase big data capabilities using Pyspark and MLlib.
  • Enhancing Data Ingestion Framework by creating more robust and secure data pipelines.
  • Implemented data streaming capability using Kafka and Talend for multiple data sources.
  • Worked with multiple storage formats (Avro, Parquet) and databases (Hive, Impala, Kudu).
  • S3 - Data Lake Management. Responsible for maintaining and handling data inbound and outbound requests through big data platform.
  • Working knowledge of cluster security components like Kerberos, Sentry, SSL/TLS etc.
  • Involved in the development of agile, iterative, and proven data modeling patterns that provide flexibility.
  • Knowledge on implementing the JILs to automate the jobs in production cluster.
  • Troubleshooted user's analyses bugs (JIRA and IRIS Ticket).
  • Worked with SCRUM team in delivering agreed user stories on time for every Sprint.
  • Worked on analyzing and resolving the production job failures in several scenarios.
  • Implemented UNIX scripts to define the use case workflow and to process the data files and automate the jobs.
  • Knowledge on implementing the JILs to automate the jobs in production cluster.

Environment: Spark, Redshift, Python, HDFS, Hive, Pig, Sqoop, Scala, Kafka, Shell scripting, Linux, Jenkins, Eclipse, Git, Oozie, Talend, Agile Methodology.

Data Engineer

Confidential

Responsibilities:

  • Migrating data from FS to Snowflake within the organization
  • Imported Legacy data from SQL Server and Teradata into Amazon S3.
  • Created consumption views on top of metrics to reduce the running time for complex queries.
  • Exported Data into Snowflake by creating Staging Tables to load Data of different files from Amazon S3.
  • Compare the data in a leaf level process from various databases when data transformation or data loading takes place. I need to analyze and look into the data quality when these types of loads are done (To look for any data loss, data corruption).
  • As a part of Data Migration, wrote many SQL Scripts for Mismatch of data and worked on loading the history data from Teradata SQL to snowflake.
  • Developed SQL scripts to Upload, Retrieve, Manipulate and handle sensitive data (National Provider Identifier Data I.e. Name, Address, SSN, Phone No) in Teradata, SQL Server Management Studio and Snowflake Databases for the Project
  • Worked on to retrieve the data from FS to S3 using spark commands
  • Built S3 buckets and managed policies for S3 buckets and used S3 bucket and Glacier for storage and backup on AWS
  • Created Metric tables, End user views in Snowflake to feed data for Tableau refresh.
  • Generated Custom SQL to verify the dependency for the daily, Weekly, Monthly jobs.
  • Using Nebula Metadata, registered Business and Technical Datasets for corresponding SQL scripts
  • Experienced in working with spark ecosystem using Spark SQL and Scala queries on different formats like text file, CSV file.
  • Developed spark code and spark-SQL/streaming for faster testing and processing of data.
  • Closely involved in scheduling Daily, Monthly jobs with Precondition/Postcondition based on the requirement.
  • Monitor the Daily, Weekly, Monthly jobs and provide support in case of failures/issues.

Environment: Snowflake, AWS S3, GitHub, Service Now, HP Service Manager, EMR, Nebula, Teradata, SQL Server, Apache Spark, Sqoop

We'd love your feedback!