Data Engineer Resume
Bentonville, AR
SUMMARY
- Having 6+ years of experience as a Data Engineerin design, development, and Implementation of Big data applications using Spark, Python, Kafka, Flume, Nifi, Impala, Oozie, Zookeeper, Airflow, etc.
- Expertise in developing using Python, Scala and Java
- Good experience of software development in Python (libraries used: libraries - PySpark, PostgreSQL for database connectivity)
- Hands on experience with AWS (Amazon Web Services), Elastic Map Reduce (EMR), RedShift, Glue, Storage S3, EC2 instances and Data Warehousing.
- Experience in AWS cloud infrastructure database migrations, PostgreSQL and converting existing ORACLE and MS SQL Server databases to PostgreSQL, MySQL and Aurora.
- Experience in Building S3 buckets and managed policies for S3 buckets and used S3 bucket and Glacier for storage and backup on AWS.
- Deep analytics and understanding of Big Data and algorithms using Hadoop, MapReduce, NoSQL and distributed computing tools.
- Experience in developing enterprise level solution using batch processing (using Apache Pig) and streaming framework (using Spark Streaming, apache Kafka & Apache Flink).
- Experience in cloud computing platforms likeAmazon Web Services (AWS), AzureandGoogle Cloud (GCP).
- Expertise in synthesizing Machine learning, Predictive Analytics and Big data technologies into integrated solutions.
- Experienced in Dimensional Data Modeling experience using Data modeling, Relational Data modeling, ER/ Studio, Erwin, and Sybase Power Designer, Star Join Schema/Snowflake modeling, FACT & Dimensions tables, Conceptual, Physical & logical data modeling.
- Procedural knowledge in cleansing and analyzing data using HiveQL, Pig Latin, and custom MapReduce programs in Java.
- Experienced in writing custom UDFs and UDAFs for extending Hive and Pig core functionalities.
- Experience in importing and exporting data using Sqoop from HDFS to Relational Database Systems (RDBMS), Teradata and vice versa.
- Experience with Snowflake Virtual Warehouses.
- Currently helping my organization migrate terabytes of medical data from on-prem Mainframe, Teradata, and SQL databases to cloud Azure Data Lake with the help of Azure Data Factory pipelines.
- Experienced in building ETL and ELT processes using Azure Data Factory pipelines
- Played a key role in Migrating Teradata objects into the Snowflake environment.
- Expertise in full life cycle application development and good experience in Unit testing and Test-
- Driven Development (TDD) and Behavior driven Development.
- Proficient in writing SQL Queries, Stored procedures, functions, packages, tables, views, triggers using relational databases like PostgreSQL.
- Good experience in Shell Scripting, SQL Server, UNIX and Linux and visualization tools such as PowerBI and Tableau.
- Experience in using XML, SOAP, REST web Services for interoperable software applications.
- Experience in Agile development processes ensuring rapid and high-quality software delivery.
- Well versed with Agile, SCRUM and Test-driven development methodologies.
- Experience in handling errors/exceptions and debugging issues in large scale applications.
- Highly motivated, dedicated, quick learner and have proven ability to work individually and as a team.
- Excellent written and oral communication skills with results-oriented attitude.
TECHNICAL SKILLS
BigData/ Hadoop Technologies: MapReduce, Spark, SparkSQL, Azure, Spark Streaming, Kafka, PySpark,, Pig, Hive, HBase, Flume, Yarn, Oozie, Zookeeper, Hue, Ambari Server
Query Languages: Python, R, SQL
Operating Systems: Windows Vista/XP/7/8/10, Linux, Unix, OS X
Deployment Tools: AWS (EC2, S3, ELB, RDS, Glue), Heroku, Jenkins, Azure
Web Development: CSS, HTML, DHTML, XML, JavaScript. Angular JS, JQuery and AJAX
Web Servers: WebSphere, WebLogic, Apache, Gunicorn
Programming and Scripting: Scala, Java, SQL, JavaScript, Shell Scripting, Python, Scala, Pig Latin, HiveQL
Bug Tracking Tools: Jira, Bugzilla, Junit, gdb
Databases: Oracle 11g/10g/9i, Cassandra 2.0, MySQL, SQL Server RC 2008, Data Warehousing
Cloud Computing: Amazon EC2/S3, Heroku, Google App Engine, Google Data Studio, Microsoft Azure Data Factory (ETL)
Methodologies: Agile, Scrum and Waterfall
IDEs: Jupyter Notebook, Visual Studio Code, R Studio, SSIS, Excel (VLookup & XLookup)
Analytics Tools: Tableau, Power BI, Microsoft SSIS, SSAS and SSRS
PROFESSIONAL EXPERIENCE
Confidential
Data Engineer
Responsibilities:
- Work closely with Business Analysts and Product Owner to understand the requirements.
- Developed applications using spark to implement various aggregation and transformation functions of Spark RDD and Spark SQL.
- Used Joins in SPARK for making smaller datasets to large datasets without shuffling of data across nodes.
- Developed Spark Streaming jobs using Python to read messages from Kafka.
- Downloaded JSON files from AWS S3 buckets.
- Implemented ETL using AWS RedShift/Glue.
- Used Spark Streaming to receive real time data from the Kafka and store the stream data to HDFS using Python and NoSQL databases such as HBase and Cassandra
- Prototyped analysis and joining of customer data using Spark in Scala and processed it to HDFS
- Implemented Spark in EMR for processing Big Dataacross our OneLake in AWS System
- Consumed and processed data from DB2.
- Experience on Migrating SQL database to Azure data Lake, Azure data lake Analytics, Azure SQL Database, Data Bricks and Azure SQL Data warehouse and Controlling and granting database access and Migrating On premise databases to Azure Data lake store using Azure Data factory.
- Experience with Snowflake
- Developed Spark/Scala, Python for regular expression (regex) project in the Hadoop/Hive environment with Linux/Windows for big data resources.
- Used Scala components to implement the credit line policy based on the conditions applied on spark data frames.
- Data sources are extracted, transformed and loaded to generate CSV data files with Python programming and SQL queries.
- Used pandas UDF like building the array contains, distinct, flatten, map, sort, split and overlaps for filtering the data
- Prototyped analysis and joining of customer data using Spark in Scala and processed it to HDFS
- Designed and implemented Sqoop for the incremental job to read data from DB2 and load to Hive tables and connected to Tableau for generating interactive reports using Hive server2.
- Built NiFi flows for data ingestion purposes. Ingested data from Kafka, Microservices, CSV files from edge nodes using NIFI flows.
- Implementing and orchestrating data pipelines using Oozie and Airflow.
- Building automate pipelines using Jenkins and groovy scripts
- Currently helping my organization migrate terabytes of medical data from on-prem Mainframe, Teradata, and SQL databases to cloud Azure Data Lake with the help of Azure Data Factory pipelines.
- Experienced in building ETL and ELT processes using Azure Data Factory pipelines
- Using shell commands to push the environment and test files AWS using Jenkins automated pipelines
Environment: Spark, Scala, AWS, EMR, RedShift, EC2, Python, PgSQL, Nifi, Airflow Jupiter, Kafka
Confidential, Bentonville, AR
Data Engineer
Responsibilities:
- Involved in requirements gathering interacting with business analysts and business teams to understand the requirements.
- Created Spark Streaming jobs using Python to read messages from Kafka & download JSON files from AWS S3 buckets
- Used Spark Streaming to receive real time data from the Kafka and store the stream data to HDFS using Python and NoSQL databases such as HBase and Cassandra
- Used Spark API over Hortonworks Hadoop YARN to perform analytics on data in Hive.
- Developed data processing applications in Scala using SparkRDD as well as Dataframes using SparkSQL APIs.
- Created Spark Streaming jobs using Python to read messages from Kafka & download JSON files from AWS S3 buckets.
- Developed spark programs and python functions that involve performing transformations and actions on data sets.
- Developed Python scripts to clean the raw data.
- Acquire data from Transactional source systems to Redshift data warehouse using spark and AWS EMR.
- Implemented ETL using AWS Glue,RedShift.
- Created event-driven and scheduled AWS Lambda functions to trigger various AWS Resources.
- Creating S3 buckets and managing policies for S3 buckets. Utilized S3 bucket and Glacier for storage and backup on AWS.
- Configured multiple AWS services like EMR and EC2 to maintain compliance with organization standards.
- Created Hive external, internal tables and Implemented Partitioning, Dynamic Partitions, and Bucketing in Hive for efficient data access.
- Used Oozie Scheduler systems to automate the pipeline workflow and orchestrate the spark jobs.
- Worked on NiFi data pipelines that are built for consuming data into Data Lake.
- Worked on various types of Databases like Oracle, Teradata and SQL Server.
- Built NiFi flows for data ingestion purposes. Ingested data from Kafka, Microservices, CSV files from edge nodes using NIFI flows.
- Designed and implemented an ETL framework using Sqoop, pig, and hive to be able to automate the process of frequently bringing in data from the source and make it available for consumption.
- Developed the UNIX shell scripts for automation of extraction jobs which helps in creating the reports from Hive data.
- Imported files to HDFS from edge nodes after cleansing data and created Hive tables on top of them.
- Worked on different file formats like Sequence, XML, JSON files, and Map files using Spark Programs.
- Worked in Agile methodology.
Environment: Spark, Python, Kafka, Scala, AWS, Oozie, Apache Hadoop, Hive, SQL, Sqoop, Zookeeper, Teradata, MySQL, Hortonworks, Hive, Apache NiFi, JIRA, Agile/Scrum Methodology.
Confidential, Plano, TX
Data Engineer
Responsibilities:
- Involved in requirements gathering interacting with business analysts and business teams to understand the requirements.
- Developed Spark programs using Scala to compare the performance of Spark with Hive and SparkSQL.
- Developed spark streaming application to consume JSON messages from Kafka and perform transformations.
- Used Spark API over Hortonworks Hadoop YARN to perform analytics on data in Hive.
- Implemented Spark using Scala and SparkSql for faster testing and processing of data.
- Created Spark Streaming jobs using Python to read messages from Kafka & download JSON files from AWS S3 buckets.
- Developed spark programs and python functions that involve performing transformations and actions on data sets.
- Configuring Spark Streaming in Python to receive real-time data from the Kafka and store it into HDFS.
- Developed Python code to gather the data from HBase and designs the solution to implement using PySpark
- Involved in creating Hive Tables, loading with data, and writing HQL queries, which will invoke and run MapReduce jobs in the backend.
- Acquire data from Transactional source systems to Redshift data warehouse using spark and AWS EMR.
- Created event-driven and scheduled AWS Lambda functions to trigger various AWS Resources.
- Creating S3 buckets and managing policies for S3 buckets. Utilized S3 bucket and Glacier for storage and backup on AWS.
- Configured multiple AWS services like EMR and EC2 to maintain compliance with organization standards.
- Created Hive external, internal tables and Implemented Partitioning, Dynamic Partitions, and Bucketing in Hive for efficient data access.
- Used Oozie Scheduler systems to automate the pipeline workflow and orchestrate the spark jobs.
- Worked on NiFidata pipelines that are built for consuming data into Data Lake.
- Built NiFi flows for data ingestion purposes. Ingested data from Kafka, Microservices, CSV files from edge nodes using NIFI flows.
- Designed and implemented an ETL framework using Sqoop, pig, and hive to be able to automate the process of frequently bringing in data from the source and make it available for consumption.
- Developed the UNIX shell scripts for automation of extraction jobs which helps in creating the reports from Hive data.
- Imported files to HDFS from edge nodes after cleansing data and created Hive tables on top of them.
- Worked on different file formats like Sequence, XML, JSON files, and Map files using Spark Programs.
- Worked in Agile methodology.
Environment: Spark, Python, Kafka, Scala, AWS, Oozie, Apache Hadoop, Hive, SQL, Sqoop, Zookeeper, MySQL, Hortonworks, Hive, Apache NiFi, JIRA, Agile/Scrum Methodology.
Confidential, Plano TX
Data Engineer/Data Analyst
Responsibilities:
- Designed and created Data Marts in data warehouse database
- Implementations of MS SQL Server Management studio 2008 to create Complex Stored Procedures and Views using T-SQL.
- Collecting the data from many resources and converting into flat text files with comma delimiter separator and importing the data to the SQL server for data manipulations.
- Responsible for deploying reports to Report Manager and Troubleshooting for any errors during the execution.
- Scheduled the reports to run on daily and weekly basis in Report Manager and emailing them to director and analyst to be reviewed in Excel Sheets.
- Created several reports for claims handling which had to be exported out to PDF formats.
- Analyzed business requirements and provided excellent and efficient solutions
Environment: SQL Server 2008, Microsoft Visual Studio 2008, MS Office, SAS, SSRS