We provide IT Staff Augmentation Services!

Data Engineer Resume

5.00/5 (Submit Your Rating)

Bloomington, IL

SUMMARY

  • Around 6 years of IT experience includes in BigData, excellent understanding of Hadoop architecture and complete understanding of Hadoop - Daemons and various components such as HDFS, YARN, Resource Manager, Node Manager, Name Node, Data Node and Map Reduce programming paradigm.
  • Experience exclusively on Big Data Ecosystem using HADOOP framework and related technologies such as HDFS, MapReduce, HIVE, HBASE, STORM, YARN, OOZIE, SQOOP, Airflow and Zookeeper and includes working experience in Spark Core, Spark SQL, Spark Streaming, Scala and Kafka.
  • Expertise in Creating, Debugging, Scheduling and Monitoring jobs using Airflow and Oozie.
  • Installed and configured apache airflow for workflow management and created workflows in python.
  • Experienced in facilitating the entire lifecycle of a data science project: Data Cleaning, Data Extraction, Data Pre-Processing, Dimensionality Reduction, Algorithm implementation, Back Testing and Validation.
  • Experienced in working with Datasets, Spark-SQL, Data Frames, RDD's, handling large data frames using Partitions, Spark in-Memory capabilities, Effective & efficient Joins, Broadcast Variables, User Defined Functions (UDFs), User Defined Aggregated Functions (UDAFs), actions, transformations and other during ingestion process itself.
  • Experience in converting Hive/SQL queries into dataframe transformations in spark environment using Scala and Python.
  • Well versed with dealing with Structured and Unstructured data, Time Series data and statistical methodologies like Hypothesis Testing, ANOVA, multivariate statistics, modeling, decision theory and time-series analysis.
  • Proficient in Data transformations using log, square-root, reciprocal, cube root, square and complete box-cox transformation depending upon the dataset.
  • Experience with relational and non-relational databases such as MySQL, SQL, Oracle, MongoDB, Cassandra and PostgreSQL.
  • Adroit at employing various Data Visualization tools like Tableau, Matplotlib, Seaborn, ggplot2, and Plotly.
  • Hands on experience in AWS Cloud in various AWS services such as Redshift cluster, Route 53 domain configuration.
  • Experience on practical implementation of cloud-specific AWS technologies including IAM, Amazon Cloud Services like Elastic Compute Cloud (EC2), Simple Storage Services (S3), Virtual Private Cloud (VPC), Lambda, EBS, and EMR.
  • Proficient with container systems like Docker and container orchestration like EC2 Container Service, Kubernetes, worked with Terraform.
  • Managed Docker orchestration and Docker containerization using Kubernetes.
  • Used Kubernetes to orchestrate the deployment, scaling and management of Docker Containers.
  • Expertise in building, publishing customized interactive reports and dashboards with customized parameters and user - filters using Tableau.
  • Experience with complex Data processing pipelines, including ETL and Data ingestion dealing with unstructured and semi-structured Data.
  • Good communication and presentation skills, willing to learn, adapt to new technologies and third-party products.

TECHNICAL SKILLS

Big Data Technologies: HDFS, Map Reduce, Hive, Sqoop, Oozie, Scala, Kafka, Ambari, Hue

Hadoop/Spark Ecosystem: Hadoop, HDFS, MapReduce, Hive, HBase, Spark, impala, Cloudera, and Hortonworks HDP, Spark Core, Spark SQL, NIFI, Sqoop, Kafka, Spark-Streaming.

Schema: - Snowflake, Teradata.

Programming Languages: Python, Scala, Java, PL/SQL, SQL, Linux Shell Sheets

Database: Oracle, MS SQL Server, My SQL, PostgreSQL

Cloud: AWS, Azure

AWS: S3, EMR, EC2, Glue, ELB

Tools: Jenkins, Maven, ANT

PROFESSIONAL EXPERIENCE

Confidential, Bloomington, IL

Data Engineer

Responsibilities:

  • Developed Spark Applications by using Scala and Implemented Apache Spark data processing project to handle data from various RDBMS and Streaming sources.
  • Worked with the Spark for improving performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, PySpark, Data Frame, and Spark YARN.
  • Used Spark Streaming APIs to perform transformations and actions on the fly for building common learner data model which gets the data from Kafka in Near real time and persist it to Cassandra.
  • Developed Kafka consumer's API in Scala for consuming data from Kafka topics.
  • Consumed XML messages using Kafka and processed the xml file using Spark Streaming to capture UI updates.
  • Developed Pre-processing job using Spark Data frames to flatten Json documents to flat file.
  • Load D-Stream data into Spark dataframes and do in memory data Computation to generate Output response.
  • Involved in writing live Real-time Processing and core jobs using Spark Streaming with Kafka as a data pipe-line system.
  • Worked on AWS Cloud services like EC2, S3, EBS, RDS and VPC.
  • Migrated an existing on-premises application to AWS. Used AWS services like EC2 and S3 for small data sets processing and storage, Involved in Maintaining the Hadoop cluster on AWS EMR.
  • Imported data from AWS S3 into Spark data frames, Performed transformations and actions on data frames.
  • Implemented Elastic Search on Hive data warehouse platform.
  • Designed Columnar families in Cassandra and Ingested data from RDBMS, performed data transformations, and then exported the transformed data to Cassandra as per the business requirement.
  • Wrote Python scripts to process semi-structured data in formats like JSON.
  • Design and Develop ETL Processes in AWS Glue to migrate Campaign data from external sources like S3, ORC/Parquet/Text Files into AWS Redshift.
  • Created PySpark code that uses Spark SQL to generate data frames from Avro formatted raw layer and writes them to data service layer internal tables as Parquet format.
  • Created map reduce jobs using python scripts that can perform ETL jobs.
  • After running ETL queries performed validation check to report to client at every stage of project.
  • Data Extraction, aggregations, and consolidation of Adobe data within AWS Glue using PySpark.
  • Create external tables with partitions using Hive, AWS Athena, and Redshift.
  • Initially migrated existing MapReduce programs to spark model using Python.
  • Used the Spark DataStax Cassandra Connector to load data to and from Cassandra.
  • Tested the cluster Performance using Cassandra-stress tool to measure and improve the Read/Writes.
  • Used Hive QL to analyze the partitioned and bucketed data, Executed Hive queries on Parquet tables stored in Hive to perform data analysis to meet the business specification logic.
  • Used Kafka functionalities like distribution, partition, replicated commit log service for messaging systems by maintaining feeds.
  • Used Apache Kafka to aggregate web log data from multiple servers and make them available in Downstream systems for Data analysis and engineering type of roles.
  • Used Amazon DynamoDB to gather and track the event-based metrics.
  • Strong experience in migrating other databases to Snowflake.
  • Experience with Snowflake Multi - Cluster Warehouses.
  • ETL pipelines in and out of data warehouse using combination of Python and Snowflakes SnowSQL Writing SQL queries against Snowflake.
  • Responsible for generating actionable insights from complex data to drive real business results for various application teams and worked in Agile Methodology projects extensively.
  • Worked with SCRUM team in delivering agreed user stories on time for every Sprint.

Environment: Spark, Spark-Streaming, Spark SQL, AWS EMR, EC2, S3, Redshift, Glue, MapR, HDFS, Hive, Apache Kafka, Sqoop, Java, Python, Scala, Shell scripting, Linux, MySQL Oracle Enterprise DB, SOLR, Jenkins, Eclipse, Oracle, Git, Oozie, Tableau, Power BI, MySQL, Soap, NIFI, Cassandra and Agile Methodologies, used SDLC and RUP for PLM management.

Confidential, Irving, TX

Data Engineer

Responsibilities:

  • Worked on scalable distributed data system using Hadoop ecosystem in AWS EMR and MapR (MapR data platform).
  • Developed Simple to complex Map/reduce streaming jobs using Python, Hive.
  • Used various compression mechanisms to optimize Map/Reduce Jobs to use HDFS efficiently.
  • Used ETL component Sqoop to extract the data from MySQL and load data into HDFS.
  • Performed ETL processes from the business data and created a spark pipeline that can efficiently perform ETL process.
  • Used Broadcast variables in PySpark, effective & efficient Joins, transformations, and other capabilities for data processing.
  • Created PySpark jobs that can perform entire ETL process
  • Wrote Hive queries and scripts to study customer behavior by analyzing the data.
  • Great expose to Unix scripting and good hands-on shell scripting.
  • Wrote Python scripts to process semi-structured data in formats like JSON.
  • Involved in loading and transforming of large sets of structured, semi structured and unstructured data.
  • Troubleshooting and finding the bugs in the Hadoop applications and to clear off all the bugs took help from the testing team.
  • Involved in file movements between HDFS and AWS S3 and extensively worked with S3 bucket in AWS.
  • Load data into Amazon Redshift and use AWS Cloud Watch to collect and monitor AWS RDS instances within Confidential.
  • Used Python API by developing Kafka producer, consumer for writing Avro Schemes.
  • Developed and executed a migration strategy to move Data Warehouse from an Oracle platform to AWS Redshift.
  • Developed the PySpark code for AWS Glue jobs and for EMR.
  • Responsible for analyzing and data cleaning using Spark SQL Queries.
  • Handled importing of data from various data sources performed transformations using spark and loaded data into hive.
  • Worked with spark core, Spark Streaming and Spark SQL modules of Spark.
  • Used Scala to write the code for all the use cases in Spark and extensive experience with Scala for data analytics on Spark cluster and Performed map-side joins on RDD.
  • Exploring with Spark various modules of PySpark and working with Data Frames, RDD and Spark Context.
  • On demand, secure EMR launcher with custom Spark submit steps using S3 Event, SNS, KMS and Lambda function.
  • Used Cloud watch logs to move application logs to S3 and create alarms based on a few exceptions raised by applications.
  • Migrated data from DR region to S3 buckets and developed a connection which is linked to AWS, Snowflake by using python.
  • Developed a data pipeline using Kafka, Spark, and Hive to ingest, transform and analyzing data.
  • Determining the viability of a business problem for a Big Data solution with Pyspark.
  • Involved in time series data representation using HBase.
  • Great working experience with Splunk for real time log data monitoring.
  • Build cluster on AWS environment using EMR using S3, EC2, Redshift.
  • Worked with data bricks for connecting the different sources and transforming data to store in cloud platform.
  • Great hands-on experience with Pyspark for using Spark liberties by using python scripting for data analysis.
  • Worked with (BI)Tableau teams as requirement of datasets and good working experience with Data visualization.

Environment: MapReduce, AWS, S3, EC2, EMR, RedShift, Glue, Java, HDFS, Hive, Tez, Oozie, HBase, Spark, Scala, Spark SQL, Kafka, Python, Putty, Pyspark, Cassandra, Shell Scripting, ETL, YARN, Splunk, Sqoop, LINUX, Cloudera, Ganglia, SQL Server.

Confidential

Hadoop developer

Responsibilities:

  • In depth understanding of Hadoop Architecture and various components such as HDFS, Application master, Node Manager, Resource Manager, Name Node, Data node and MapReduce concepts.
  • Imported required tables from RDBMS to HDFS using Sqoop and also used Storm and Kafka to get real time streaming of data into HBase.
  • Used NoSQL database Hbase and creating Hbase tables to load large sets of semi structured data coming from various sources.
  • Wrote Hive and scripts as ETL tool to do transformations, event joins, filter both traffic and some pre-aggregations before storing into the HDFS.
  • Developed data pipeline using Flume, Sqoop, and MapReduce to ingest customer behavioral data.
  • Developed Spark code using Scala and Spark-SQL for faster testing and processing of data.
  • Used Spark API over Cloudera Hadoop YARN to perform analytics on data in Hive.
  • Involved in moving all log files generated from various sources to HDFS for further processing through Flume.
  • Developed Java code to generate, compare & merge AVRO schema files.
  • Developed complex MapReduce streaming jobs using Java language that are implemented Using Hive and using MapReduce Programs using Python to perform various ETL, cleaning and scrubbing tasks.
  • Prepared the validation report queries, executed after every ETL runs, and shared the resultant values with business users in different phases of the project.
  • Used Hive to analyze the partitioned and bucketed data and compute various metrics for reporting & used the hive optimization techniques during joins and best practices in writing hive scripts using HiveQL.
  • Importing and exporting data into HDFS and Hive using Sqoop. Writing the HIVE queries to extract the data processed.
  • Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs, Scala and Python.
  • Experienced in data ingestion from different data sources and creating an optimal strategy for further analysis.
  • Experienced in creating a ETL pipeline for batch and streaming data.
  • Developing and running Map-Reduce Jobs on YARN and Hadoop clusters to produce daily and monthly reports as per user's need.
  • Imported data from AWS S3 and into Spark dataframes and performed transformations and actions on dataframe’s.
  • Teamed up with Architects to design Spark model for the existing MapReduce model and Migrated MapReduce models to Spark Models using Scala.
  • Implemented Spark using Scala and utilizing Spark Core, Spark Streaming and SparkSQL API for faster processing of data instead of MapReduce in Java.
  • Used Spark-SQL to Load JSON data and create Schema dataframe and loaded it into Hive Tables and handled Structured data using Spark SQL
  • Handled importing of data from various data sources, performed transformations using Hive, MapReduce, loaded data into HDFS and Extracted the data from MySQL into HDFS using Sqoop
  • Integrated Apache Storm with Kafka to perform web analytics and to perform click stream data from Kafka to HDFS.
  • Used Oozie workflow engine to manage interdependent Hadoop jobs and to automate several types of Hadoop jobs such as Java MapReduce Hive, and Sqoop.
  • Analyzed large and critical datasets using Cloudera, HDFS, HBase, MapReduce, Hive, Sqoop, Spark and Zookeeper.
  • Expert knowledge on MongoDB NoSQL data modeling, tuning, disaster recovery and backup.

Environment: Apache Hadoop, AWS, Python, HDFS, MapReduce, HBase, Hive, Yarn, Sqoop, Flume, Zookeeper, Kafka, Impala, SparkSQL, Spark Core, Spark Streaming, NoSQL, MySQL, Cloudera, Java, JDBC, Spring, ETL, WebLogic, Web Analytics, Avro, Cassandra, Oracle, Shell Scripting, Ubuntu.

Confidential

Python Developer

Responsibilities:

  • Assess the infrastructure needs for each application and deploy it on Azure platform.
  • Build and Deployed the code artifacts into the respective environments in the Confidential Azure cloud.
  • Deployed and Published Django Web App in platform as a services PaaS in azure App services
  • Created Non-Prod and Prod Environments in Azure from scratch.
  • Worked on various Azure services like Compute (Web Roles, Worker Roles), Azure Websites, Caching, SQL Azure, NoSQL, USQLS, Storage, Network services, Data Factory, Azure Active Directory, API Management, Scheduling and Auto Scaling.
  • Developed U-SQL Scripts for schematizing the data in Azure Data Lake Analytics.
  • Experience of process and transform data by running USQL scripts on Azure.
  • Designed the user interface and client-side scripting using AngularJS framework, Bootstrap and JavaScript.
  • Created User Interface Design using HTML5, CSS3, JavaScript, jQuery, JSON, REST and AngularJS, Bootstrap.
  • Developed GUI using JavaScript, HTML/HTML5, DOM, AJAX, CSS3, CQ5 and AngularJS in ongoing projects.

Environment: Azure Kubernetes services, Container Services, Model management, Terraform, Docker, Python, Django, HTML5, CSS3, JavaScript, jQuery, Ajax, Bootstrap, GitHub, VSTS.

We'd love your feedback!