We provide IT Staff Augmentation Services!

Data Engineer Resume

5.00/5 (Submit Your Rating)

Bentonville, AR

SUMMARY

  • Data Engineer having Eight years of experience with strong background in end - to-end enterprise data warehousing and big data projects.
  • Proficient in big data tools like Hive and Spark and relational data warehouse tool Teradata etc.
  • Excellent hands-on business requirement analysis, designing, developing, testing, and maintaining the complete data management & processing systems, process documentation and ETL technical and design documents.
  • Expertise in resolving production issues, hands-on experience in handling all phases of the Software Development Life Cycle (SDLC).
  • A solid experience and understanding of designing and operationalization of large-scale data and analytics solutions on Snowflake Data Warehouse.
  • Developing ETL pipelines in and out of data warehouse using combination of Python and SnowSQL.
  • Substantial experience in Spark 3.o integration with Kafka 2.4
  • Experience in setting up monitoring infrastructure for Hadoop cluster using Nagios and Ganglia.
  • Sustaining the BigQuery, PySpark and Hive code by fixing the bugs and providing the enhancements required by the Business User.
  • Worked with AWS cloud using Data-Proc, Data Flow, Big- Query, EMR, S3, Glacier & EC2 Instance with EMR cluster.
  • Skilled in System Analysis, E-R/Dimensional Data Modeling, Database Design and implementing RDBMS specific features.
  • Knowledge of working with Proof of Concepts (PoC's) and gap analysis and gathered necessary data for analysis from different sources, prepared data for data exploration using data munging and Teradata.
  • Experience in developing customized UDF's in Python to extend Hive and Pig Latin functionality.
  • Experienced in building Automation Regressing Scripts for validation of ETL process between multiple databases like Oracle, SQL Server, Hive, and Mongo DB using Python.
  • Architect & implement medium to large scale BI solutions on Azure using Azure Data Platform services (Azure Data Lake, Data Factory, Data Lake Analytics, Azure SQL DW, HDInsight/Databricks, NoSQL DB).
  • Proficiency in SQL across several dialects (we commonly write MySQL, PostgreSQL, Redshift, SQL Server, and Oracle)
  • Hands-on use of Spark and Scala API's to compare the performance of Spark with Hive and SQL, and Spark SQL to manipulate Data Frames in Scala.
  • Experience in developing Map Reduce Programs using Apache Hadoop for analyzing the big data as per the requirement.
  • Hands on Spark MLlib utilities such as including classification, clustering, collaborative filtering, dimensionality reduction.
  • Experience in working with Flume and NiFi for loading log files into Hadoop.
  • Experience in working with NoSQL databases like HBase and Cassandra.
  • Experienced in creating shell scripts to push data loads from various sources from the edge nodes onto the HDFS.
  • Good Experience in implementing and orchestrating data pipelines using Oozie and Airflow.
  • Worked with Cloudera and Hort on works distributions.
  • Experience in Data Analysis, Data Profiling, Data Integration, Migration, Data governance and Metadata Management, Master Data Management and Configuration Management.
  • Experience in developing customized UDF's in Python to extend Hive and Pig Latin functionality.
  • Expertise in designing complex Mappings and have expertise in performance tuning and slowly changing Dimension Tables and Fact tables.

TECHNICAL SKILLS

Languages: Python, Scala, Java, SnowSQL, SQL, UNIX, Shell Scripting

Big Data Technologies: Hadoop, HDFS, Sqoop, NiFi, Hive, Pig, HBase, PySpark, Sqoop, Cassandra, Flume, Spark SQL, Kafka

Cloud Platform: AWS, Azure, Google Cloud, Cloud Stack/Open Stack

BI Tools: SSIS, SSRS, SSAS.

Modeling Tools: SQL Power Architect, Oracle Designer, Erwin, ER/Studio

Database: Oracle, Microsoft SQL Server, Teradata, Poster SQL, Impala

Methodologies: JAD, System Development Life Cycle (SDLC), Agile, Waterfall Model.

Operating System: Windows, UniX

ETL Tools: Informatica, SAP Business Objects XIR3.1/XIR2, Web Intelligence.

Reporting Tools: Power BI, Tableau.

Tools: & Software: TOAD, MS Office, BTEQ, Teradata SQL Assistant.

Other tools: JIRA, SQL PLUS, SQL LOADER, MS Project, MS Office, have worked on C++, UNIX, PL/SQL etc.

PROFESSIONAL EXPERIENCE

Confidential, Bentonville, AR

Data Engineer

Responsibilities:

  • Hive tables were created on HDFS to store the data processed by Apache Spark on the Cloudera Hadoop Cluster in Parquet format.
  • Executed quantitative analysis on chemical products to recommend effective combinations.
  • Performed statistical analysis using SQL, Python, R Programming and Excel.
  • Import, clean, filter and analyze data using tools such as SQL, HIVE and PIG.
  • Experienced with handling administration activations using Cloudera manager.
  • Involved in developing Impala scripts for extraction, transformation, loading of data into data warehouse.
  • Create and maintain optimal data pipeline architecture in cloud Microsoft Azure using Data Factory and Azure Databricks
  • Create an Architectural solution that leverages the best Azure analytics tools to solve our specific need in Chevron use case
  • Develop Nifi workflow to pick up the data from rest API server, from data lake as well as from SFTP server and send that to Kafka broker.
  • Implement Spark Kafka streaming to pick up the data from Kafka and send to Spark pipeline.
  • Experience in working with different join patterns and implemented both Map and Reduce Side Joins.
  • Wrote Flume configuration files for importing streaming log data into HBase with Flume.
  • Imported several transactional logs from web servers with Flume to ingest the data into HDFS.
  • Installed and configured pig, written Pig Latin scripts to convert the data from Text file to Avro format.
  • Created Partitioned Hive tables and worked on them using HiveQL.
  • Involved in data ingestion into HDFS using Sqoop for full load and Flume for incremental load on variety of sources like web server, RDBMS and Data API's.
  • Built the infrastructure required for optimal extraction, transformation, and loading of data from a wide variety of data sources using SQL and 'big data' technologies like Hadoop Hive, Azure Data Lake storage
  • Created Azure SQL database, performed monitoring and restoring of Azure SQL database. Performed migration of Microsoft SQL server to Azure SQL database.
  • Used Git for version control with Data Engineer team and Data Scientists colleagues.
  • Loading Data into HBase using Bulk Load and Non-bulk load.
  • Implemented the workflows using Apache Oozie framework to automate tasks.
  • Involved in migrating tables from RDBMS into Hive tables using SQOOP and later generate visualizations using Tableau.
  • Built pipelines to move hashed and un-hashed data from XML files to Data lake.
  • Developed Spark scripts using Python on Azure HDInsight for Data Aggregation, Validation and verified its performance over MR jobs.
  • Extensively worked with Spark-SQL context to create data frames and datasets to pre-process the model data.
  • Writing pyspark and spark Sql transformation in Azure Databricks to perform complex transformations for business rule implementation
  • Used Kafka functionalities like distribution, partition, replicated commit log service for messaging systems by maintaining feeds and Created applications using Kafka, which monitors consumer lag within Apache Kafka clusters.
  • Used Python & SAS to extract, transform & load source data from transaction systems, generated reports, insights, and key conclusions.
  • Exporting of a result set from HIVE to MySQL using Sqoop export tool for further processing.
  • Effectively Communicated plans, project status, project risks and project metrics to the project team planned test strategies in accordance with project scope

Environment: Cloudera, Hadoop, Pig, Hive, Informatica, HBase, MapReduce, Azure, HDFS, Sqoop, Impala, SQL, Azure Data Lake, Data Factory, Tableau, Python, SAS, Flume, Oozie, Linux.

Confidential, Palo Alto, CA

Data Engineer

Responsibilities:

  • Transforming business problems into Big Data solutions and define Big Data strategy and Roadmap. Installing, configuring, and maintaining Data Pipelines.
  • Developed the features, scenarios, step definitions for BDD (Behavior Driven Development) and TDD (Test Driven Development) using Cucumber, Gherkin and Ruby.
  • Designing the business requirement collection approach based on the project scope and SDLC methodology.
  • Files extracted from Hadoop and dropped on daily hourly basis into S3. Working with Data governance and Data quality to design various models and processes.
  • Involved in all the steps and scope of the project reference data approach to MDM, have created a Data Dictionary and Mapping from Sources to the Target in MDM Data Model.
  • Created functions and assigned roles in AWS Lambda to run python scripts, and AWS Lambda using java to perform event driven processing. Created Lambda jobs and configured Roles using AWS CLI.
  • Responsible for wide-ranging data ingestion using Sqoop and HDFS commands. Accumulate 'partitioned' data in various storage formats like text, Json, Parquet, etc. Involved in loading data from LINUX file system to HDFS
  • Start working with AWS for storage and handling for tera byte of data for customer BI Reporting tools
  • Created a Lambda Deployment function and configured it to receive events from S3 buckets.
  • Writing UNIX shell scripts to automate the jobs and scheduling cron jobs for job automation using commands with Crontab.
  • Built real time pipeline for streaming data using Kafka and Spark Streaming.
  • Developed PySpark and SparkSQL code to process the data in Apache Spark on Amazon EMR to perform the necessary transformations based on the STMs developed
  • Used Apache Spark Data frames, Spark-SQL, Spark MLLib extensively and developing and designing POC's using Scala, Spark SQL and MLlib libraries.
  • Work on data that was a combination of unstructured and structured data from multiple sources and automate the cleaning using Python scripts.
  • Used SQL Server Integrations Services (SSIS) for extraction, transformation, and loading data into target system from multiple sources
  • Designed both 3NF data models for OLTP systems and dimensional data models using star and snowflake Schemas.
  • Created and maintained SQL Server scheduled jobs, executing stored procedures for the purpose of extracting data from Oracle into SQL Server. Extensively used Tableau for customer marketing data visualization
  • Optimize algorithm with stochastic gradient descent algorithm Fine-tuned the algorithm parameter with manual tuning and automated tuning such as Bayesian Optimization.
  • Performed all necessary day-to-day GIT support for different projects, Responsible for design and maintenance of the GIT Repositories, and the access control strategies.

Environment: Hadoop, Kafka, Spark, Sqoop, Docker, Spark SQL, TDD, Spark-Streaming, Hive, Scala, pig, NoSQL, Impala, Oozie, HBase, PySpark, Data Lake, AWS (Glue, Lambda), Code Build, Code Pipeline, Unix/Linux Shell Scripting, PyCharm, Informatica.

Confidential - Columbus, IN

Data Engineer / Big Data Developer

Responsibilities:

  • Leading the team to design, develop, testing and delivering end -end deliverables
  • Develop solutions to leverage ETL tools and identify opportunities for process improvements using Informatica and Python.
  • Conduct root cause analysis and resolve production problems and data issues
  • Partnered with ETL developers to ensure that data is well cleaned, and the data warehouse is up to date for reporting purpose by Pig.
  • Collaborated with Business Analysts, SMEs across departments to gather business requirements, and identify workable items for further development.
  • Selected and generated data into csv files and stored them into AWS S3 by using AWS EC2 and then structured and stored in AWS Redshift.
  • Performance tuning, code promotion and testing of application changes in requirement gathering and analysis phase of the project in documenting the business requirements
  • Used PySpark and Pandas to calculate the moving average and RSI score of the stocks and generated them into data warehouse.
  • Develop Nifi workflow to pick up the data from rest API server, from data lake as well as from SFTP server and send that to Kafka broker.
  • Implement Spark Kafka streaming to pick up the data from Kafka and send to Spark pipeline.
  • Exploring with Spark to improve the performance and optimization of the existing algorithms in Hadoop using Spark context, Spark-SQL, PostgreSQL, Data Frame, OpenShift, Talend, pair RDD's
  • Involved in integration of Hadoop cluster with spark engine to perform BATCH and GRAPHX operations.
  • Performed data preprocessing and feature engineering for further predictive analytics using Python Pandas.
  • Performed scoring and financial forecasting for collection priorities using Python, and SAS.
  • Generated report on predictive analytics using Python and Tableau including visualizing model performance and prediction results.
  • Utilized Agile and Scrum methodology for team and project management.
  • Used Git for version control with colleagues.
  • Expertise in using Docker to run and deploy the applications in multiple containers like Docker Swarm and Docker Wave.
  • Worked on designing ETL pipelines to retrieve the dataset from MySQL and MongoDB into AWS S3bucket, managed bucket and objects access permission
  • Developed complex Talend ETL jobs to migrate the data from flat files to database. Pulled files from mainframe into Talend execution server using multiple ftp components.
  • Developed complex Talend ETL jobs to migrate the data from flat files to database. Developed Talend ESB services and deployed them on ESB servers on different instances.
  • Architect and design serverless application CI/CD by using AWS Serverless (Lambda) application model.
  • Developed stored procedures/views in Snowflake and use in Talend for loading Dimensions and Facts.
  • Developed merge scripts to UPSERT data into Snowflake from an ETL source

Environment: Hive, Spark (PySpark, SparkSQL, SparkMLIib), Kafka, Linux, Python 3.x(Scikit-learn, NumPy, Pandas), Tableau 10.1, Hadoop, GitHub, AWS EMR/EC2/S3/Redshift, and Pig. Json and Parquet File systems. Map Reduce Spring Boot, Snowflake, SAS, Cassandra, Swamp, Data Lake, Oozie, MySQL, MongoDB.

Confidential, Santa Clara

Data Engineer

Responsibilities:

  • Worked on AWS Redshift and RDS for implementing models and data on RDS and Redshift
  • Gathered business requirements, definition, and design of the data sourcing, worked with the data warehouse architect on the development of logical data models.
  • The Organization Standard to implement the data Models. Used Micro service architecture with Spring Boot based services interacting through a combination of REST and Apache Kafka message brokers.
  • Designing and Developing Azure Data Factory (ADF) extensively for ingesting data from different source systems like relational and non-relational to meet business functional requirements.
  • Performed Regression testing for Golden Test Cases from State (end to end test cases) and automated the process using python scripts.
  • Developed Spark jobs using Scala for faster real-time analytics and used Spark SQL for querying
  • Generated graphs and reports using ggplot package in RStudio for analytical models. Developed and implemented R and Shiny application which showcases machine learning for business forecasting.
  • Research on Reinforcement Learning and control (TensorFlow, Torch), and machine learning model (Scikit-learn).
  • Designed and developed a new solution to process the NRT data by using Azure stream analytics, Azure Event Hub and Service Bus Queue
  • Performed K-means clustering, Regression and Decision Trees in R. Worked on data cleaning and reshaping, generated segmented subsets using NumPy and Pandas in Python.
  • Engineering using pandas and NumPy packages in python and build models using deep learning frameworks.
  • Implemented application of various machine learning algorithms and statistical modeling like Decision Tree, Text Analytics, Sentiment Analysis, Naive Bayes, Logistic Regression and Linear Regression using Python to determine the accuracy rate of each model.
  • Involved in creating Hive tables, loading with data, and writing hive queries which will run internally in Map Reduce way.

Environment: Spark, Python, Azure, HDFS, Hive, Scala, Azure Data Factory, Stream Analytics, Kafka, Shell scripting, Linux, Jenkins, Eclipse, Azure functions Apps, Azure Data Lake, Talend, Agile Methodology.

Confidential

Data Analyst

Responsibilities:

  • Involved in designing physical and logical data model using ERwin Data modeling tool.
  • Designed the relational data model for operational data store and staging areas, Designed Dimension & Fact tables for data marts.
  • Extensively used ERwin data modeler to design Logical/Physical Data Models, relational database design.
  • Created Stored Procedures, Database Triggers, Functions and Packages to manipulate the database and to apply the business logic according to the user's specifications.
  • Created Triggers, Views, Synonyms and Roles to maintain integrity plan and database security.
  • Creation of database links to connect to the other server and Access the required info.
  • Integrity constraints, database triggers and indexes were planned and created to maintain data integrity and to facilitate better performance.
  • Used Advanced Querying for exchanging messages and communicating between different modules.
  • System analysis and design for enhancements Testing Forms, Reports and User Interaction.

Environment: Oracle 9i, SQL* Plus, PL/SQL, ERwin, TOAD, Stored Procedures.

Confidential

Programmer Analyst

Responsibilities:

  • Implemented several User Stories using core java, HTML, CSS.
  • Solid understanding of Object Oriented Design and analysis with extensive experience in the full life cycle of the software design process including requirement definition, prototyping, Proof of Concept, Design, Implementation and Testing.
  • Experience on backend using collections, Structs, maps, DHTML, JavaScript, IDE&Tool Eclipse, Notepad++.
  • Involved in the Software Development Life Cycle phases like Requirement Analysis, Implementation and estimating the time-lines for the project.
  • Assisted in resolving data level issues dealing with input & output streams.
  • Developed custom directives (elements, Attributes and classes).
  • Developed single page applications using angular.js
  • Extensively involved in redesigning the entire site with CSS styles for consistent look and feel across all browsers and all pages.
  • Used Angular MVC and two-way data binding.
  • Created Images, Logos and Icons that are used across the web pages using Adobe Flash and Photoshop.
  • Development of the interactive UI's for the front-end users using the front end technologies like HTML, CSS, JavaScript and jQuery.
  • Designed dynamic client-side JavaScript, codes to build web forms and simulate process for web application, page navigation and form validation

Environment: Core Java, Java Script, UI/UX, Linux, Shell Scripting, Web Browsers, Instrumentation, Oracle SQL Server, SQL queries, Relational Data bases.

We'd love your feedback!