Data Engineer Resume

SUMMARY:

Data Engineering professional with solid foundational skills and a proven track of implementation in various data platforms. Self - motivated with a strong adherence to personal accountability in both individual and team scenarios.
Over 7+ years of Experience in Data Engineering, Data Pipeline Design, Development, and Implementation as a Sr. Data Engineer/Data Developer and Data Modeler.
Strong Experience in Software Development Life Cycle (SDLC) including Requirements Analysis, Design Specification, and Testing as per Cycle in both Waterfall and Agile methodologies.
Strong Experience in writing scripts using Python API, PySpark API, and Spark API for analyzing the data.
Extensively used Python Libraries PySpark, Pytest, Pymongo, cx Oracle, PyExcel, Boto3, Psycopg, embedPy, NumPy, and Beautiful Soup.
Hands-on use of Spark and Scala APIs to compare the performance of Spark with Hive and SQL and Spark SQL to manipulate Data Frames in Scala.
Expertise in Python and Scala, user-defined functions (UDF) for Hive and Pig using Python.
Experience in developing Map Reduce Programs using Apache Hadoop for analyzing big data as per the requirement.
Hands-on Spark MLlib utilities such as including classification, regression, clustering, collaborative filtering, dimensionality reduction.
Experience in working with Flume and NiFi for loading log files into Hadoop.
Experience in working with NoSQL databases like HBase and Cassandra.
Experienced in creating shell scripts to push data loads from various sources from the edge nodes onto the HDFS.
Good Experience in implementing and orchestrating data pipelines using Oozie and Airflow.
Worked with Cloudera and Hortonworks distributions.
Expert in developing SSIS/DTS Packages to extract, transform and load (ETL) data into data warehouse/ data marts from heterogeneous sources.
Good working knowledge of Amazon Web Services (AWS) Cloud Platform, which includes services like EC2, S3, VPC, ELB, IAM, DynamoDB, C

PROFESSIONAL EXPERIENCE:

Confidential

Data Engineer

Responsibilities:

Collaborated with Business Analysts, SMEs across departments to gather business requirements and identify workable items for further development. Partnered with ETL developers to ensure that data is well cleaned and the data warehouse is up - to-date for reporting purposes by Pig. Selected and generated data into CSV files and stored them into AWS S3 by using AWS EC2 and then structured and stored in AWS Redshift. Processed some simple statistical analysis of data profiling like cancel rate, var, skew, Kurt of trades, and runs of each stock everyday group by 1 min, 5 min, and 15 min. Used
PySpark and Pandas to calculate the moving average and RSI score of the stocks and generated them into data warehouse. Exploring with Spark to improve the performance and optimization of the existing algorithms in Hadoop using Spark context, Spark-SQL, PostgreSQL, Data Frame, OpenShift, Talend, pair RDD's Involved in integration of Hadoop cluster with spark engine to perform BATCH and GRAPHX operations. Performed data preprocessing and feature engineering for other predictive analytics using Python Pandas. Developed and validated machine learning models, including Ridge and Lasso
Regression for predicting the total amount of trade. Boosted the performance of regression models by applying polynomial transformation and feature selection and used those methods to select stocks. Generated report on predictive analytics using Python and Tableau, including visualizing model performance and prediction results. Utilized Agile and Scrum methodology for team and project management. Used Git for version control with colleagues.Environment: Spark (PySpark, SparkSQL, SparkMLIib), Python 3.x (Scikit-learn, Numpy, Pandas), Tableau 10.1, GitHub, AWS EMR/EC2/S3/Redshift, and Pig.

Confidential

Data Engineer

Responsibilities:

Migrate data from on - premises to AWS storage buckets. Developed a python script to transfer data from on-premises to AWS S3. Developed a python script to hit REST API's and extract data to AWS S3. Worked on Ingesting data by going through cleansing and transformations and leveraging AWS Lambda,
AWS Glue and Step Functions. Created YAML files for each data source and including glue table stack creation. Worked on a python script to extract data from Netezza databases and transfer it to AWS S3. Developed Lambda functions and assigned IAM roles to run python scripts along with various triggers (SQS, EventBridge, SNS). Created a Lambda Deployment function and configured it to receive events from S3 buckets. Writing UNIX shell scripts to automate the jobs and scheduling cron jobs for job automation using commands with Crontab. Developed various Mappings with the collection of all Sources,
Targets, and Transformations using Informatica Designer. Developed Mappings using Transformations like Expression, Filter, Joiner and Lookups for better data messaging and to migrate clean and consistent data.Environment: Python 3.6, AWS (Glue, Lambda, StepFunctions, SQS, Code Build, Code Pipeline,
EventBridge, Athena), Unix/Linux Shell Scripting, PyCharm, Informatica PowerCenter, Code Build, Code Pipeline, EventBridge, Athena), Linux Shell Scripting, Informatica PowerCenter.

Confidential

Data Engineer

Responsibilities:

Created and executed Hadoop Ecosystem installation and document configuration scripts on Google Cloud Platform. Transformed batch data from several tables containing tens of thousands of records from SQL Server, MySQL, PostgreSQL, and CSV file datasets into data frames using PySpark. Researched and downloaded jars for Spark - Avro programming. Developed a PySpark program that writes data frames to HDFS as Avro files. Utilized Spark's parallel processing capabilities to ingest data. Created and executed HQL scripts that create external tables in a raw layer database in Hive. Developed a Script that
Copies Avro formatted data from HDFS to External tables in the raw layer. Created PySpark code that uses Spark SQL to generate data frames from Avro formatted raw layer and writes them to data service layer internal tables as orc format. In charge of PySpark code, creating data frames from tables in the data service layer and writing them to a Hive data warehouse. Installed Airflow and created a database in PostgreSQL to store metadata from Airflow. Configured documents which allow Airflow to communicate to its PostgreSQL database. Developed Airflow DAGs in Python by importing the Airflow libraries.
Utilized Airflow to schedule automatically trigger and execute data ingestion pipeline.

Confidential

Data Engineer

Responsibilities:

Worked on development of data ingestion pipelines using an ETL tool, Talend & bash scripting with big data technologies including but not limited to Hive, Impala, Spark, Kafka, and Talend. Experience in developing scalable & secure data pipelines for large datasets. Gathered requirements for ingestion of new data sources including life cycle, data quality check, transformations, and metadata enrichment. Supported data quality management by implementing proper data quality checks in data pipelines. Delivered data engineer services like data exploration, ad - hoc ingestions, and subject-matter-expertise to Data
Scientists using big data technologies. Build machine learning models to showcase big data capabilities using Pyspark and MLlib. Enhancing Data Ingestion Framework by creating more robust and secure data pipelines. Implemented data streaming capability using Kafka and Talend for multiple data sources.
Worked with multiple storage formats (Avro, Parquet) and databases (Hive, Impala, Kudu). S3 - Data Lake Management. Responsible for maintaining and handling data inbound and outbound requests through big data platforms. Working knowledge of cluster security components like Kerberos, Sentry,
SSL/TLS etc. Involved in the development of agile, iterative, and proven data modeling patterns that provide flexibility. Knowledge on implementing the JILs to automate the jobs in production cluster. Troubleshooted user's analyses bugs (JIRA and IRIS Ticket). Worked with SCRUM team in delivering agreed user stories on time for every Sprint. Worked on analyzing and resolving the production job failures in several scenarios. Implemented UNIX scripts to define the use case workflow and to process the data files and automate the jobs. Knowledge on implementing the JILs to automate the jobs in production cluster.

Environment: Spark, Redshift, Python, HDFS, Hive, Pig, Sqoop, Scala, Kafka, Shell scripting, Linux, Jenkins, Eclipse, Git, Oozie, Talend, Agile Methodology.

Confidential

Data Engineer

Responsibilities:

Migrating data from FS to Snowflake within the organization Imported Legacy data from SQL Server and Teradata into Amazon S3. Created consumption views on top of metrics to reduce the running time for complex queries. Exported Data into Snowflake by creating Staging Tables to load Data of different files from Amazon S3. Compare the data in a leaf level process from various databases when data transformation or data loading takes place. I need to analyze and look into the data quality when these types of loads are done (To look for any data loss, data corruption). As a part of Data Migration, wrote many
SQL Scripts for Mismatch of data and loaded the history data from Teradata SQL to snowflake. Developed SQL scripts to Upload, Retrieve, Manipulate and handle sensitive data (National Provider Identifier Data I.e., Name, Address, SSN, Phone No) in Teradata, SQL Server Management Studio, and
Snowflake Databases for the Project Worked on retrieving the data from FS to S3 using spark commands Built S3 buckets and managed policies for S3 buckets and used S3 bucket and Glacier for storage and backup on AWS Created Metric tables, End - user views in Snowflake to feed data for Tableau refresh. Generated Custom SQL to verify the dependency for the daily, Weekly, Monthly jobs. Using Nebula Metadata, registered Business and Technical Datasets for corresponding SQL scripts Experienced in working with spark ecosystem using Spark SQL and Scala queries on different formats like text file,
CSV files. Developed spark code and spark-SQL/streaming for faster testing and processing of data. Closely involved in scheduling Daily, Monthly jobs with Precondition/Postcondition based on the requirement. Monitor the Daily, Weekly, Monthly jobs and provide support in case of failures/issues.

Environment: Snowflake, AWS S3, GitHub, Service Now, HP Service Manager, EMR, Nebula, Teradata, SQL Server, Apache Spark, Sqoop

We provide IT Staff Augmentation Services!

We'd love your feedback!

Resume Categories

Client Services

Job Seekers

Visa Sponsorship