Data Engineer Resume
Chicago, IL
SUMMARY:
- Over 7+ years of experience in software analysis, datasets, design, development, testing and implementation of Cloud, Big Data, Big Query, Spark, Google cloud platform, Scala and Hadoop.
- Hands on experience in designing and implementing data engineering pipelines and analyzing data using Hadoop ecosystem tools like HDFS, MapReduce, Yarn, Spark, Sqoop, Hive, Pig, Flume, Kafka, Impala, PySpark, Oozie and HBase.
- Hands on experience in programming using Java, Big Query, Python, Scala and SQL.
- Sound knowledge and experience in architecture of Distributed Systems and parallel processing frameworks.
- Designed and implemented end - to-end data pipelines to extract, cleanse, process and analyze huge amounts of behavioral data and log data.
- Good experience working with various data analytics and big data services in AWS Cloud like EMR, Redshift, S3, Athena, Glue etc
- Have good knowledge and experience in Google Cloud Platform (GCP).
- Experienced in developing production ready spark application using Spark RDD APIs, Data frames, Spark-SQL and Spark-Streaming API's.
- Worked extensively on finetuning spark applications to improve performance and troubleshooting failures in spark applications.
- Strong experience in using Spark Streaming, Spark SQL and other components of spark like accumulators, Broadcast variables, different levels of caching and optimization techniques for spark jobs.
- Proficient in importing/exporting data from RDBMS to HDFS using Sqoop.
- Used hive extensively to performing various data analytics required by business teams.
- Solid experience in working various data formats like Parquet, Orc, Avro, Json etc.,
- Experience automating end-to-end data pipelines with strong resilience and recoverability.
- Worked on Spark Streaming and Structured Spark streaming including Kafka for real time data processing.
- Experience with scheduling and monitoring workflows using Apache Airflow.
- Responsible for developing multiple Kafka Producers and Consumers from scratch as per the software requirement specifications.
- Experience in creating Impala views on hive tables for fast access to data
- Hands on experience in various Hadoop distributions like Cloudera, Hortonworks and AWS EMR.
- Experienced in using waterfall, Agile and Scrum models of software development process framework
- Experience with ETL tools like Informatica, DataStage and Snowflake.
- Good level of experience in Core Java, JEE technologies, JDBC, Servlets and JSP.
- Good knowledge in Oracle PL/SQL and shell scripting.
- Experience in design and development of Web forms using Spring MVC, Java Script, and JSON and JQ plotter.
TECHNICAL SKILLS:
- AWS
- GCP
- Hadoop
- Kafka
- Scala
- EMR
- Spark
- Sqoop
- PySpark
- Cloudera
- Hortonworks
- Airflow
- Redshift
- S3
PROFESSIONAL SKILLS:
Confidential |Chicago, IL
Data Engineer
Responsibilities:
- Developed generic Spark Ingestion code that queries data from multiple data sources (Oracle, Teradata, SAP, RDS) and writes Data to AWS S3.
- Providing E2E solutions for Enterprise Data Management including Data Architecture, Designing and Architecting Data (Quality, Governance), Metadata Management, Designing & Coiling Data Strategy, Master Data Management and Data Modelling-Conceptual, Logical & Physical.
- Worked in Agile Data Modeling methodology and creating data models in sprints in SOA architecture and involved with delivery of complex enterprise data solutions with comprehensive understandings in Architecture, Security, Performance, Scalability, and Reliability.
- Developed Databricks Notebooks to generate the hive create statements from the data and load the data into the table.
- Developed Python code that is used for orchestration of all batch process including batch and file-based processing.
- Developed HQL’s to transfer data between source table and target tables.
- Responsible for full data loads from production to AWS Redshift staging environment and worked on migrating of EDW to AWS using EMR and various other technologies.
- Expertise in converting AWS existing infrastructure to server less architecture (AWS Lambda, Kinesis) and deployed via Terraform or AWS Cloud formation.
- Worked on SQL Server concepts SSIS (SQL Server Integration Services), SSAS (Analysis Services) and SSRS (Reporting Services).
- Developed Airflow dags to orchestrate sequential and parallel ETL Jobs.
- Developed PySpark code that is used to compare data between HDFS and S3.
- Worked on analyzing Hadoop cluster and different big data analytic tools including MapReduce, Hive, Spark, and Scala.
- Extensively used Hive, Spark optimization techniques like Partitioning, Bucketing, Map Join, parallel execution, Broadcast join and Repartitioning.
- Strong understanding of Data Modeling (Relational, dimensional, Star and Snowflake Schema), Data analysis, implementations of Data warehousing using Windows and UNIX.
- Created Data Quality Scripts using SQL and Hive to validate successful data load and quality of the data. Created various types of data visualizations using Python and Tableau.
- Worked on creating Hive tables and written Hive queries for data analysis to meet business requirements and experienced in Sqoop and Falcon to import and export the data to Oracle.
- Used various spark Transformations and Actions for cleansing the input data and Implemented Spark using Scala and utilizing Data frames and Spark SQL API for faster processing of data.
- Developed Spark scripts by using Scala shell commands as per the requirement.
- Created partitioned, bucketed Hive tables, loaded data into respective partitions at runtime, for quick downstream access.
- Involved in converting Hive/SQL queries into Spark transformations using Spark SQL and Scala.
- Used Spark and Spark-SQL to read the parquet data and create the tables in hive using the ScalaAPI.
- Extract Real time feed using Kafka and Spark Streaming and convert it to RDD and process data in the form of Data Frame and save the data as Parquet format in HDFS.
- Worked on a POC and suggested all the possible optimization techniques that can be used in SPARK (broadcast join, repartition vs Coalesce, partitioning vs Bucketing).
Environment: Erwin 9.x, SQL, Oracle, Teradata, SAP, RDS, SQL Server, Oracle 10g, TOAD, PL/SQL, Flat Files, Teradata, T-SQL, MDM, informatica Power Centre, DB2, SSRS, SAS, SQL Server, SSIS, SSAS, Informatica, Tableau.
Confidential |Minneapolis, MN
Data Engineer
Responsibilities:
- Involved in all phases of SDLC including Requirement Gathering, Design, Analysis and Testing of customer specifications, Development, and Deployment of the Application and design a reliable and scalable data pipeline.
- Worked with various complex queries, sub queries and joins to check the validity of loaded and imported data.
- Worked with PowerShell and Unix scripts for file transfer, emailing and other file related tasks.
- Designed and implemented ETL pipelines between from various Relational data Bases to the Data Warehouse using Apache Airflow.
- Worked on Data Extraction, aggregations and consolidation of Adobe data within AWS Glue using PySpark.
- Pulling the data from data lake (HDFS) and massaging the data with various RDD transformations.
- Worked on data transformation and retrieval from mainframes to oracle, using SQL loader and control files.
- Created Tableau Visualizations by connecting to AWS Hadoop Elastic MapReduce.
- Developed Custom ETL Solution, Batch processing and Real-Time data ingestion pipeline to move data in and out of Hadoop using Python and shell Script.
- Developed PySpark and SparkSQL code to process the data in Apache Spark on Amazon EMR to perform the necessary transformations based on the STMs developed.
- Developed data integration strategies for data flow between disparate source systems and Big Data enabled Enterprise Data Lake.
- Built a serverless ETL in AWS lambda to process the files that are new in the S3 bucket to be cataloged immediately.
- Worked on AWS SQS to consume the data from S3 buckets.
- Worked with relational SQL and NoSQL databases, including Postgresql and Hadoop.
- Objective of this project is to build a data lake as a cloud-based solution in AWS using Apache Spark and provide visualization of the ETL orchestration using CDAP tool.
- Worked and learned a great deal from Amazon Web Services (AWS) Cloud services like EC2, S3, and EMR.
- Worked on data cleaning and reshaping, generated segmented subsets using Numpy and Pandas in Python.
- Developed and deployed to production multiple projects in the CI/CD pipeline for real-time data distribution, storage and analytics. Persistence to S3, HDFS, Postgres.
- Realtime data from the source were ingested as file streams to SPARK streaming platform and data was saved in HDFS and HIVE.
- Configured Cloud Watch, Lambda, SQS, and SNS to send alert notifications.
- Creating S3 buckets also managing policies for S3 buckets and Utilized S3 bucket and Glacier for storage and backup on AWS.
- Designed data flow to pull the data using Rest API from a third-party Vendor using OAUTH authentication.
- Experience in designing and developing POCs in Spark using Scala to compare the performance of Spark with Hive and SQL/Oracle.
- Designing and implementing a fully operational production grade large scale data solution on Snowflake Data Warehouse.
- Worked on Amazon EMR processes data across a Hadoop Cluster of viral servers on Amazon Elastic Computing Cloud.
- Experience in Server infrastructure development on Gateway, ELB, Auto Scaling, Dynamo DB, Elastic search, Virtual Private Cloud (VPC).
- Created architecture stack blueprint for data access with NoSQL Database Cassandra.
- Deployed the Big Data Hadoop application using Talendon cloud AWS (Amazon Web Sevices).
- Worked on Snowflake environment to remove redundancy and load real time data from various data sources into HDFS using Kafka.
- Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs and Scala.
- Designed and Developed ETL jobs to extract data from Salesforce replica and load it in data mart in Redshift.
Environment: AWS, Hadoop, Hive, HBase, Spark, Oozie, Kafka, My SQL, Jenkins, API, Snowflake, PowerShell, GitHub, Oracle Database 12c/11g, DataStage, SQL Server.
Confidential |San Diego, CA
Data Engineer
Responsibilities:
- Developed Data pipeline using Spark, Hive and HBase to ingest data into Hadoop cluster for analysis.
- Collected data using Spark Streaming from AWS S3 bucket in batch and real time and performs necessary transformations and aggregations to build the common learner data model and persist the data in HDFS.
- Hands on experience in designing, developing and maintaining software solutions in Hadoop cluster.
- Exploring with the Spark improving the performance and optimization of the existing algorithms in Hadoop using spark context, Spark SQL, Data Frame, Spark Yarn.
- Experienced with spark streaming to ingest data into an ingestion platform, an inbuilt application.
- Designed the ETL runs performance tracking sheet in different phases of the project and shared with the production team.
- Performs quality check on the existing code to improve performance.
- Imported the data from different sources like AWS S3, Local file system into Spark RDD.
- Involved in converting Hive/SQL queries into Spark Transformations using Spark RDDs and python.
- Used Hive to analyze the partitioned and Bucketed data and compute various metrics for reporting.
- Involved in developing Hive DDLS to create, alter and drop Hive tables.
- Involved in loading data from Linux file system to HDFS.
- Involved in data warehousing and Business Intelligent systems.
- Involved in identifying and designing most efficient and cost-effective solution through research and evaluation of alternatives.
- Demonstrated hadoop practices and knowledge of technical solutions, design patterns and code for medium/ complex applications deployed in Hadoop production.
Environment: /Tools: Spark, Hive, Pig, Spark SQL, Spark Streaming, HBase, Sqoop, Cloudera, Pyspark, HDFS
Confidential
Data Engineer
Responsibilities:
- Involved in the analysis, design, implementation, and testing of the project.
- Exposed to various phases of Software Development Life Cycle using Agile - Scrum Software development methodology.
- Developed views and templates with Python and Django's view controller and templating language to create a user-friendly website interface.
- Developed the customer complaints application using Django Framework, which includes Python code.
- Strong understanding and practical experience in developing Spark applications with Python.
- Developed Scala scripts, UDFs using both Data frames/SQL in Spark for Data Aggregations.
- Designed, develop, test, deploy and maintain the website.
- Developed entire frontend and backend modules using Python on Django Web Framework.
- Developed Python scripts to update content in the database and manipulate files.
- Rewrite existing Java application in Python module to deliver certain format of data
- Developed entire frontend and backend modules using Python on Django Web Framework.
- Generated property list for every application dynamically usingPython.
- Designed and developed the UI of the website using HTML, XHTML, AJAX, CSS and JavaScript.
- Wrote Python scripts to parse XML documents and load the data in database.
- Generated property list for every application dynamically using Python.
- Performed testing using Django’s Test Module.
- Designed and developed data management system using MySQL.
- Creating unit test/regression test framework for working/new code.
- Responsible for search engine optimization to improve the visibility of the website.
- Responsible for debugging and troubleshooting the web application.
Environment: Python, Django, Java, MySQL, Linux, HTML, XHTML, CSS, AJAX, JavaScript, Apache Web Server.