We provide IT Staff Augmentation Services!

Data Engineer Resume

0/5 (Submit Your Rating)

Nashville, TN

SUMMARY

  • Experienced and Skilled in Software development, analysis with 5 years in Data Engineer and strong expertise in Spark, Hive, Data warehousing, Data Modeling.
  • Proficiency in using Python/Scala, Azure, AWS Web Services, Snowflake. Designing and deploying Data Visualization using Tableau, PowerBI.
  • Hands on experience in developing and deploying enterprise - based applications using major components in Hadoop ecosystem like Hadoop 2.x, YARN, Sqoop, Spark, Hive, Pig, Map Reduce, HBase, Flume, Kafka, Oozie and Zookeeper.
  • Automated resulting scripts and workflow using Apache Airflow and shell scripting to ensure daily execution in production.
  • Experienced in building advanced model utilizing Data mining, Data Classification, Data Science techniques.
  • Extensive Experience on working with Hadoop Architecture and the components of Hadoop - Map Reduce, HDFS, Job Tracker, Task Tracker, Name Node and Data Node.
  • Experienced in using Sqoop to import data into HDFS, Hive from RDBMS and into RDBMS from HDFS.
  • Experienced working with real time streaming applications using tools like Flume, Kafka, and Spark Streaming.
  • Hands on experience in using Amazon Web Services like EC2, EMR and S3.
  • Strong experience working with big data services on the cloud especially with Cloud SQL, Cloud deployment manager.
  • Install and configure Apache Airflow for S3 bucket and Snowflake data warehouse and created dags to run the Airflow.
  • Hands on experience in using Big Query, Dataflow for Business agility to provide insights.
  • Good Knowledge in Apache Spark and SparkSQL for batch and real-time data for faster data processing.
  • Hands on Bash scripting experience and building data pipelines on Unix/Linux systems.
  • Experience in designing and developing Spark batch streaming to validate, extract, transform, and load 3.5 million to 4 million records per day.
  • Experience in developing Scala scripts to run in Spark Cluster.
  • Experience on Kafka and Spark integration for real time data processing.
  • Analyzed large, structured datasets using Hive's data warehousing infrastructure.
  • Highly experienced in writing HiveQL queries for both managed and external tables and good at Hive partitioning, bucketing, and performing different types of joins on Hive tables.
  • Good understanding of NoSQL databases like HBase, Cassandra, MongoDB, and hands on work experience in writing applications on NoSQL databases like HBase and Cassandra.
  • Used Zookeeper to provide coordination services to the cluster.
  • Experience in using Hadoop distributions Cloudera, Hortonworks.
  • Used Talend for ETL processing based on business needs and extensively used Oozie workflow engine to run multiple Hive and Pig jobs by Direct Acyclic Graph (DAG) of actions with control flows.
  • Extensively used Microsoft SSIS and worked in the end-to-end implementation of ETL projects for Retail Banking and consumer goods and health care.
  • Experience in Migrating data from SQL Server 2008 to SQL Server 2012.
  • Worked on Apache Flink to implements the transformation on data streams for filtering, aggregating, update state.
  • Proficient in creating T-SQL Stored procedures, Triggers, Constraints, and indexes and developed more than 350+ scripts
  • Responsible for Performance tuning, Optimization of stored procedures.
  • Adept in Process Management and software life cycle process with experience from Analysis design to implementation.
  • Hands-on experience in Azure Cloud Services (PaaS & IaaS), Azure Synapse Analytics, SQL Azure, Data Factory, Azure Analysis services, Application Insights, Azure Monitoring, Key Vault, Azure Data Lake.
  • Extensive hands-on experience in Snowflake Query and Performance tuning.
  • Experienced on Azure platform for services like ADLS, Data Factory, Synapse analytics & Azure Databricks.
  • Adept in interpersonal and problem-solving skills with strong communication and coordination skills.
  • Experienced working on projects which involved SCRUM and AGILE methodologies.

PROFESSIONAL EXPERIENCE

Confidential, Nashville, TN

Data Engineer

Responsibilities:

  • Worked extensively with importing metadata into Hive using Python and migrated existing tables and applications to work on the Azure Blob.
  • Involved in complete Big Data flow of the application starting from data ingestion from upstream to HDFS, processing the data in HDFS and analyzing the data.
  • Developed scripts in Hive to perform transformations on the data and loaded it into target systems for reports.
  • Worked on several python packages like numpy, scipy, pytables etc.
  • Hands-on experience with Azure Synapse, Azure Stream Analytics, Azure Event Hubs, Azure Event Grid, Databricks, ADLS Gen 2, & Logic Apps.
  • Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in In Azure Databricks.
  • Used the AWS SageMaker to quickly build, train and deploy the machine learning models.
  • Created a React client web-app backed by serverless AWS Lambda functions to LINKS Interact with an AWS Sagemaker Endpoint.
  • Used AWS SageMaker, Blazing Text, and association rule mining to analyze and enhance the Chart Assist Ontology
  • Utilized Boto3 Library to connect SageMaker and Python to deploy and host the model.
  • Implemented an automated workflow of the training and deployment of AWS Sagemaker deep learning models using AWS St ep Functions, CloudWatch, API Gateway, Lambda, and SES.
  • Developed the map-reduce flows in Microsoft HDInsight Hadoop environment using python.
  • Automated the process for the extraction of data from warehouses and weblogs by developing workflows and coordinator jobs in Oozie.
  • Parsers written in Python for extracting useful data from the design data base.
  • Used PySpark scripts to load data from databases to RDD's and then RDD's to Data frames and perform transformations on them.
  • Worked with Kafka for building robust and fault-tolerant data Ingestion pipeline for transporting streaming data into HDFS and implemented Kafka Custom encoders for custom input format to load data into Kafka Partitions.
  • Worked on Apache Flink framework and distributed processing engine for stateful computations over unbounded and bounded data streams.
  • Utilized Flink to run in all common cluster environments and perform computations at in-memory speed and at any scale
  • Worked on performance tuning of Apache Kafka workflow to optimize the data ingestion speeds.
  • Involved in the development of full life cycle implementation of ETL using Informatica, Oracle and helped with designing the Date warehouse by defining Facts, Dimensions and relationships between them and applied the Corporate Standards in Naming Conventions.
  • Design and Develop ETL Processes in Azure Data Factory to migrate Campaign data from external sources.
  • Data Extraction, aggregations, and consolidation of Adobe data within Data Factory using PySpark.
  • Experience in automatic deploying of applications with Amazon Lambda and Elastic Beanstalk.
  • Worked on migration of an existing feed from Hive to Spark to reduce the latency of feeds in existing HiveQL for OLAP and Operational data store (ODS) applications.
  • Optimized and moved large scale pipeline applications from on-premises clusters to Azure Cloud.
  • Used Spark Streaming APIs to perform transformations and actions on the fly for building a common learner data model which gets the data from Kafka in near real-time and persists it to MongoDB.
  • Built and supported several Azure, multi-server environments and deployed the Big Data Hadoop application on the Azure cloud.
  • Strong SQL knowledge and have the ability to write and understand complex SQLs
  • Troubleshoot and identify slow running Snowflake queries and optimize the queries
  • Ensure improvements in performance, scalability, and efficiency querying data across tables with billions of rows.
  • Brought in best practices/recommendations for query optimization.
  • Tuned enterprise-level Snowflake environments.
  • Extensive experience working in Oracle, DB2, SQL Server, and MySQL database.
  • Developed various Stored Procedures and Triggers to migrate data from various databases to CKB servers for specific markets on demand.

Environment: Hadoop MapReduce, HDFS, Hive, Spark, Python, PySpark, Sqoop, Kafka, Java, Python, SQL Server, MySQL, Oracle, DB2, AWS EMR, S3, Cloud Watch,EC2, Elastic Search

Confidential, Oak Brook, IL

Big Data Engineer

Responsibilities:

  • Developed data pipeline using Flume, Sqoop, Pig, and MapReduce to ingest customer behavioral data and financial histories into HDFS for analysis.
  • Built and maintained SQL scripts, indexes, and complex queries for data analysis and extraction.
  • Perform fundamental tasks related to the design, construction, monitoring, and maintenance of Microsoft SQL databases.
  • Extensively used Star and Snowflake schema methodologies in building and designing the logical data model into Dimensional Models.
  • Involved in creating Hive tables, loading with data, and writing Hive queries.
  • Used Pig to do transformations, event joins filter boot traffic, and some pre-aggregations before storing the data onto HDFS.
  • Extensively used Joins and sub-queries for complex queries involving multiple tables from different databases and performance tuning of SQL queries and stored procedures.
  • Worked with Spark and improved the performance and optimized the existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frame, RDD's, Spark YARN.
  • Design, Develop, Implement the ETL objects by extracting the data using Sqoop from source system to Hadoop Files system (HDFS).
  • Designed and developed ETL code using Informatica Mappings to load data from heterogeneous Source systems flat files, XML's, MS Access files, Oracle to target system Oracle under Stage, then to the data warehouse and then to Data Mart tables for reporting.
  • Developed scripts in Hive to perform transformations on the data and loaded it into target systems for reports.
  • Collecting and aggregating large amounts of log data using Flume and tagging data in HDFS for further analysis. Running and debugging the Python harassers on the Linux environment
  • Worked on performance tuning of Apache Kafka workflow to optimize the data ingestion speeds.
  • Run batch jobs for loading database tables from Flat Files using SQL*Loader.
  • Extract Transform and Load data from Sources Systems to Azure Data Storage services using a combination of Azure Data Factory, T-SQL, Spark SQL, and U-SQL Azure Data Lake Analytics.
  • Built Cloud Data Lake on Snowflake, Redshift and Azure Sql Data warehouse platforms with Talend.
  • Extensively used Snowflake, Redshift to build the Data Lake, Data Warehousing architecture and used most of the tuning features.

Environment: Hadoop MapReduce, HDFS, Pig, Hive, Flume, Sqoop, Kafka, Spark, YARN, Scala, Azure (Azure Data Lake, Azure Storage, Azure SQL, Azure Databricks), Git, PL/SQL, SQL* LOADER, SSIS, TOAD, Oracle 9i, Snowflake, Windows, UNIX.

Confidential

Data Engineer

Responsibilities:

  • Migrated existing data from SQL server, Teradata to HADOOP and performed ETL operations.
  • Designed and implemented Sqoop incremental imports on relation databases to HDFS (S3).
  • Created data pipelines to load transformed data into Redshift from data lake.
  • Written extensive Spark/Scala programming using Data Frames, Data Sets & RDD's for transforming transactional database data and load it into Redshift tables.
  • Implemented best design practices for AWS Redshift for query optimization by distribution style of fact tables.
  • Reduced the latency of SPARK jobs by tweaking spark configurations and following other performance optimization techniques like memory tuning, serializing RDD data structures, broadcasting large variables, data locality etc.
  • Implemented delta lake feature to overcome the challenges of backfill and re-ingestion into the data lake.
  • Used auto scaling feature in AWS to increase and decrease the clusters based on intensity of the computation.
  • Used serverless computing platform (AWS lambda services) for running the SPARK jobs.
  • Developed serverless workflows using AWS Step Function service and automated the workflow using AWS CloudWatch.
  • Designed data warehouse as per the business requirement and implemented FACT tables, Dimension tables using Star schema
  • Optimized the existing ETL pipelines by tuning existing SQL queries and data partition techniques
  • Verified the data flow using ETL tools like KNIME.
  • Contributed to the design and evolution of their cloud data infrastructure platform and overall data engineering tooling
  • Synchronizing both the structured data and unstructured data using HIVE on business prospects.
  • Used Hive query engine to analyze the partitioned and bucketed data to compute various metrics for reporting
  • Developed bash scripts to automate the above processes of extraction transforming and loading
  • Delivered visualizations that distill clear, actionable insights from large complex datasets
  • Collaborated with product and engineering teams in multiple areas of project for building innovative data solutions that up-level our features and get results in a data driven way

Environment: Apache Spark, Scala, Amazon EMR/Redshift, AWS Step Function, AWS Cloud Watch

Confidential

Data Analyst

Responsibilities:

  • Experience in building multiple Data pipelines, end to end ETL and ELT process for Data ingestion and transformation to AWS.
  • Devised simple and complex SQL scripts to check and validate Dataflow in various applications.
  • Used Hive to implement data warehouse and stored data into HDFS. Stored data into Hadoop clusters which are set up in AWS EMR.
  • Processed the image data through the Hadoop distributed system by using Map and Reduce then stored into HDFS.
  • Performed data engineering functions: data extract, transformation, loading, and integration in support of enterprise data infrastructures - data warehouse, operational data stores and master data management
  • Implemented Partitioning, Dynamic Partitions, Buckets in HIVE.
  • Develop database management systems for easy access, storage, and retrieval of data.
  • Perform DB activities such as indexing, performance tuning, and backup and restore.
  • Skilled in data visualization like Matplotlib and seaborn library

Environment: Cloud Shell, AWS, Apache Airflow, Python, Pandas, Matplotlib, Hive, Scala, Spark.

We'd love your feedback!