Data Engineer Resume
Newark, NJ
SUMMARY
- Over 10+ Years of IT Industry with hands on working experience in Data Engineering.
- In depth knowledge on Big Data with Spark Core, Spark SQL, and Data Frames/Data Sets/RDD API
- Developed applications using Spark with Pyspark, Scala for data processing
- Experience in writing Spark Jobs for data cleansing and transformations
- Good knowledge on Spark architecture and real - time streaming using Spark
- Experience implementing batch and real-time data pipelines using AWS Services, S3, Lambda, DynamoDB, Redshift, EMR, kinesis
- Experience in working with NoSQL databases like MongoDB and DynamoDB
- Strong SQL skills to query data for validation, reporting and dashboard.
- Excellent working experience in Scrum / Agile framework and Waterfall project execution methodologies.
- Good understanding and exposure to Python programming.
- Experience in working with MapReduce programs using Apache Hadoop to analyze large data sets efficiently.
- Strong experience in working with Core Hadoop components like HDFS, Yarn and MapReduce.
- Strong experience on designing big data pipelines such as Data Ingestion, Data Processing (Transformations, enrichment and aggregations) and Reporting.
- Strong experience in submitting Spark applications in different clusters such as Spark Standalone and Hadoop Yarn.
- Experience in integrating Kafka with Spark streaming for high speed data processing.
- Proficient in installing, configuring and using Apache Hadoop ecosystems such as MapReduce, Hive, Pig, Flume, Yarn, HBase, Sqoop, Spark, Storm, Kafka, Oozie, and Zookeeper.
- Strong comprehension of Hadoop daemons and MapReduce topics.
- Good working knowledge of Amazon Web Services(AWS) Cloud Platform which includes services likeEC2,S3,VPC,ELB, IAM, DynamoDB, Cloud Front, Cloud Watch, Route 53, Elastic Beanstalk (EBS), Auto Scaling, Security Groups, EC2 Container Service (ECS), Redshift, Kinesis, IAM, SQS, SNS, SES
- Familiarity with implementing and orchestrating data pipelines using Airflow
- Experience in branching, monitoring, and tagging the version across the environments using tools like Git, SVN
- Strong analytical and problem-solving skills and the ability to follow through with projects from inception to completion.
- Ability to work effectively in cross-functional team environments, excellent communication, and interpersonal skills
- Acquired AWS Solution Architect Associate Certification.
TECHNICAL SKILLS
Big Data: Hadoop 3.0, Spark 2.3, Hive 2.3, Cassandra 3.11, MongoDB 3.6, MapReduce, Sqoop.
Programming Languages: Python, Scala, SQL, Pyspark
Big Data Technologies: Spark, Hadoop, HDFS, Hive, Yarn
Databases: PostgreSQL, MySQL, Oracle, MongoDB, DynamoDB
Other Tools: Eclipse, PyCharm, GitHub, Jira
Cloud: AWS(S3, EMR, EC2, Glue, Redshift, Athena)
Methodologies: Agile, RAD, JAD, RUP, UML, System Development
PROFESSIONAL EXPERIENCE
Confidential - Newark, NJ
Data Engineer
Responsibilities:
- As a Data Egineer involved in Agile Scrum meetings to help, manage and organize a team of developers with regular code review sessions.
- Participated in Code Reviews, Enhancement discussion, maintenance of existing pipelines & systems, testing and bug-fix activities on-going basis.
- Migrated data warehouses to Snowflake Data warehouse.
- Defined virtual warehouse sizing for Snowflake for different type of workloads.
- Worked closely with the business analysts to convert the Business Requirements into Technical Requirements and prepared low and high level documentation.
- Worked on Spark improving the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frame, Pair RDD's
- Developed ETL Processes in AWS Glue to migrate data from external sources like S3, ORC/Parquet/Text Files into AWS Redshift
- Worked on Ingesting data by going through cleansing and transformations and leveraging AWS Lambda, AWS Glue and Step Functions.
- Used Spark for interactive queries, processing of streaming data and integration with popular NoSQL database for huge volume of data.
- Involved in daily Scrum meetings to discuss the development/progress and was active in making scrum meetings more productive.
- Seamlessly worked on Python to build data pipelines after the data got loaded from Kafka.
- Used Kafka Streams to Configure Spark Streaming to get information and then store it in HDFS.
- Worked on loading data into Spark RDD's, perform advanced procedures like text analytics using in-memory data computation capabilities of Spark to generate the Output response.
- Implemented usage of Amazon EMR for processing Big Data across a Hadoop Cluster of virtual servers on Amazon Elastic Compute Cloud (EC2) and Amazon Simple Storage Service (S3)
- Created AWS Lambda functions and assigned IAM roles to schedule python scripts using CloudWatch Triggers to support the infrastructure needs (SQS, Event Bridge, SNS)
- Involved in converting MapReduce programs into Spark transformations using Spark RDD's using Scala and Python.
- Integrated Kafka-Spark streaming for high efficiency throughput and reliability
- Developed a python script to hit REST API’s and extract data to AWS S3
- Conducted ETL Data Integration, Cleansing, and Transformations using AWS glue Spark script
- Worked on functions inLambdathat aggregates the data from incoming events, and then stored result data in AmazonDynamoDB
- Deployed the project on Amazon EMR with S3 connectivity for setting a backup storage.
- Designed and Developed ETL jobs to extract data from oracle and load it in data mart in Redshift.
- Worked on AWS Data Pipeline to configure data loads from S3 to into Redshift
- Used JSON schema to define table and column mapping from S3 data to Redshift
- Connected Redshift to Tableau for creating dynamic dashboard for analytics team
- Used JIRA to track issues and Change Management
- Involved in creating Jenkins jobs for CI/CD using GIT, Maven and Bash scripting
Environment: Spark, AWS- S3, Redshift, Glue, EMR, IAM, EC2, Tableau, Jenkins, Jira, Python, Kafka, Agile.
Confidential - San Antonio, TX
Sr. Data Engineer
Responsibilities:
- As a Data Engineer reviewed business requirement and Ddeveloped Big Data solutions focused on pattern matching and predictive modeling.
- Evaluated business requirements and prepared detailed specifications that follow project guidelines required to develop written programs.
- Worked inAgile development environment and Participated indaily scrumand other design related meetings.
- Implemented Installation and configuration of multi-node cluster on Cloud using Amazon Web Services (AWS) on EC2.
- Responsible for developing data pipeline with Amazon AWS to extract the data from weblogs and store in HDFS.
- Responsible for Big data initiatives and engagement including analysis, brainstorming, POC, and architecture.
- Performed Data transformations in HIVE and used partitions, buckets for performance improvements.
- Ingested data into HDFS using SQOOP and scheduled an incremental load to HDFS.
- Created S3 buckets also managing policies for S3 buckets and Utilized S3 bucket and Glacier for storage and backup on AWS.
- Using Hive to analyze data ingested into HBase by using Hive-HBase integration and compute various metrics for reporting on the dashboard.
- Worked with cloud provisioning team on a capacity planning and sizing of the nodes (Master and Slave) for an AWS EMR Cluster.
- Developed Pyspark scripts for data transfer between S3 to redshift
- Responsible for creating an instance on Amazon EC2 (AWS) and deployed the application on it.
- Worked with Amazon EMR to process data directly in S3 when we want to copy data from S3 to the Hadoop Distributed File System (HDFS) on your Amazon EMR cluster by setting up the Spark Core for analysis work.
- Exposure on Spark Architecture and how RDD’s work internally by involving and processing the data from Local files, HDFS and RDBMS sources by creating RDD and optimizing for performance.
- Developed AWS Lambda to invoke glue job as soon as a new file is available in Inbound S3 bucket.
- Created spark jobs to apply data cleansing/data validation rules on new source files in Inbound bucket and reject records to reject-data S3 bucket.
- Transferred the data using Informatica tool from AWS S3 to AWS Redshift.
- Extensively involved in writing SQL Scripts, functions and packages.
- Created partitioned tables in Hive, also designed a data warehouse using Hive external tables and also created hive queries for analysis.
- Estimates and planning of development work using Agile Software Development.
- Developed all the mappings according to the design document and mapping specs provided and perform unit testing.
- Used Sqoop to import data into HDFS and Hive from Oracle database
- Transformed legacy data to be loaded into staging using Store procedures.
- Act as technical liaison between customer and team on all AWS technical aspects.
- Pulling the data from data lake (HDFS) and massaging the data with various RDD transformations.
- Used AWS Cloud with Infrastructure Provisioning / Configuration.
- Used Hive to analyze the partitioned and bucketed data and compute various metrics for reporting on the dashboard.
- Pulled the data from data lake (HDFS) and massaging the data with various RDD transformations.
- Maintained Tableau functional reports based on user requirements.
- Created action filters, parameters, and calculated sets for preparing dashboards and worksheets in Tableau.
Environment: Spark, AWS, EC2, S3, Redshift, Hive, Tableau, HDFS, AWS EMR, PySpark, Agile & Scrum meetings
Confidential - Atlanta, GA
Sr. Data Engineer
Responsibilities:
- As a Sr. Data Engineer designed and deployed scalable, highly available, and fault tolerant systems on Azure.
- Lead the estimation, review the estimates, identify the complexities and communicate to all the stakeholders.
- Involved in completeSDLClife cycle of big data project that includes requirement analysis, design, coding, testing and production.
- Defined the business objectives comprehensively through discussions with business stakeholders, functional analysts and participating in requirement collection sessions.
- Implemented end-to-end systems for Data Analytics, Data Automation and integrated with custom visualization tools.
- Migrated on-primes environment on Cloud using MS Azure.
- Designed the business requirement collection approach based on the project scope and SDLC (Agile) methodology.
- Moved data to Azure Data Lake to Azure data warehouse using PolyBase.
- Created external tables in ADW with 4 compute nodes and scheduled.
- Extensively used Agile Method for daily scrum to discuss the project related information.
- Worked with data ingestions from multiple sources into the Azure SQL data warehouse
- Transformed and loading data into Azure SQL Database.
- Configured Spark streaming to receive real time data from the Kafka and store the stream data to HDFS.
- Development and maintenance of data pipeline on Azure Analytics platform using Azure Databricks.
- Developed a data pipeline using Kafka to store data into HDFS.
- Implemented Kafka producers create custom partitions, configured brokers and implemented High level consumers to implement data platform.
- Created Airflow Scheduling scripts in Python.
- MaintainedNoSQLdatabaseto handle unstructured data, clean the data by removing invalidate data, unifying the format and rearranging the structure and load for following steps.
- Wrote Python scripts to parse XML documents and load the data in database.
- Written DDL and DML statements for creating, altering tables and converting characters into numeric values.
- Performed data cleaning and data manipulation activities using NOSQL utility.
- Worked on Data load using Azure Data factory using external table approach.
- Automated recurring reports using SQL and Python.
- Developed purging scripts and routines to purge data on Azure SQL Server and Azure Blob storage.
- Resolved thedatatype inconsistencies between the source systems and the target system using the Mapping Documents.
- Developed Python Scripts for automation purpose and Component unit testing using Azure Emulator.
- Involved in T-SQL queries and optimizing the queries in Azure SQL Server.
- Maintaining data storage in Azure Data Lake.
- Written and executed customized SQL code for ad hoc reporting duties and used other tools for routine report generation.
Confidential - Bellevue, WA
Data Engineer
Responsibilities:
- As a Data Engineer I was responsible to build a data lake as a cloud based solution in AWS using Apache Spark and Hadoop.
- Involved in Agile methodologies, daily Scrum meetings, Sprint planning.
- Objective of this project is
- Installed and configured Hadoop and responsible for maintaining cluster and managing and reviewing Hadoop log files.
- Used AWS Cloud and On-Premise environments with Infrastructure Provisioning/ Configuration.
- Contributed to the development of key data integration and advanced analytics solutions leveraging Apache Hadoop.
- Wrote complex Hive queries to extract data from heterogeneous sources (Data Lake) and persist the data into HDFS.
- Developed Big Data solutions focused on pattern matching and predictive modeling.
- Developed the code for Importing and exporting data into HDFS and Hive using Sqoop
- Developed a data pipeline using Kafka, HBase, Spark and Hive to ingest, transform and analyzing customer behavioral data.
- Developed Spark jobs and Hive Jobs to summarize and transform data.
- Developed reconciliation process to make sureelasticsearchindex document count match to source records.
- Developed Spark code using Scala and Spark-SQL/Streaming for faster processing of data.
- Implemented Sqoop to transform the data from Oracle to Hadoop and load back in parquet format
- Developed incremental and complete load Python processes to ingest data intoElasticSearchfrom oracle database
- Used Hive to analyze data ingested into HBase by using Hive-HBase integration and compute various metrics for reporting on the dashboard
- Created Hive External tables to stage data and then move the data from Staging to main tables
- Pulled the data from data lake (HDFS) and massaging the data with various RDD transformations.
- Load the data through HBase into Spark RDD and implement in memory data computation to generate the output response.
- Continuously tuned Hive UDF's for faster queries by employing partitioning and bucketing.
Confidential
Data Analyst
Responsibilities:
- As a Data Analyst I was responsible for gathering data migration requirements.
- Identified problematic areas and conduct research to determine the best course of action to correct the data.
- Analyzed problem and solved issues with current and planned systems as they relate to the integration and management of order data.
- Involved in Data Mapping activities for the data warehouse.
- Analyzed reports of data duplicates or other errors to provide ongoing appropriate inter-departmental communication and monthly or daily data reports.
- Monitor for timely and accurate completion of select data elements.
- Collected, analyze and interpret complex data for reporting and/or performance trend analysis.
- Monitor data dictionary statistics.
- Involved in analyzing and adding new features of Oracle 10g like DBMS SHEDULER, Create Directory, Data pump, CONNECT BY ROOT in existing Oracle 10g application.
- Archived the old data by converting them in to SAS data sets and flat files.
- Extensively used Erwin tool in Forward and reverse engineering, following the Corporate Standards in Naming Conventions, using Conformed dimensions whenever possible.
- Enhance smooth transition from legacy to newer system, through change management process.
- Planned project activities for the team based on project timelines using Work Breakdown Structure.
- Compare data with original source documents and validate Data accuracy.
- Used reverse engineering to create Graphical Representation (E-R diagram) and to connect to existing database.
- Generate weekly and monthly asset inventory reports.
- Created Technical Design Documents, Unit Test Cases.
- Written SQL Scripts and PL/SQL Scripts to extract data from Database to meet business requirements and for Testing Purposes.
- Written complex SQL queries for validating thedataagainst different kinds of reports generated by Business Objects XIR2
- Involved in Test case/ data preparation, execution and verification of the test results
- Created user guidance documentations.
- Created reconciliation report for validating migrated data.