We provide IT Staff Augmentation Services!

Data Engineer Resume

4.00/5 (Submit Your Rating)

Charlotte, NC

SUMMARY

  • Over 8+ years IT experience in Analysis, Design, Development and Big Data in Scala, Spark, Hadoop, Pig and HDFS environment and experience in Python, Java.
  • Excellent technical and analytical skills with clear understanding of design goals and development for OLTP and dimensions modeling for OLAP.
  • Strong experience in building fully automated Continuous Integration & Continuous delivery pipelines and DevOps processors for agile store - based Applications in Retail and Transportations domain.
  • Firm understanding of Hadoop architecture and various components including HDFS, Job Tracker, Task Tracker, Name Node, Data Node and MapReduce programming.
  • Have experience in data analytics, designing reports with visualization solutions using Tableau desktop and publishing on to the Tableau server.
  • Good Knowledge in Amazon Web Services (AWS) concepts like EC2, S3, EMR, Elastic Cache, DynamoDB, Redshift, Aurora.
  • Experience in developing scripts using Python, Shell Scripting to do Extract, Load and Transform data working knowledge of AWS Redshift.
  • Involved in Software development, Data warehousing and Analytics and Data engineering projects using Hadoop, MapReduce, Pig, Hive and other open-source tools/technologies.
  • Experience in analyzing, designing, and developing ETL Strategies and processes, writing ETL specifications.
  • Excellent understanding of NOSQL databases like HBASE, Cassandra, MongoDB.
  • Proficient knowledge and hand on experience in writing shell scripts in Linux.
  • Expertise in Hadoop Ecosystems tools which including HDFS, YARN, MapReduce, Pig, Hive, Sqoop, Flume, Spark, Zookeeper and Oozie.
  • Developed multiple POC's using PySpark, Scala and deployed on the YARN Cluster, compared the performance of Spark, with Hive and SQL.
  • Experienced in requirement analysis, application development, application migration and maintenance using Software Development Lifecycle (SDLC) and Python/Java technologies.
  • Adequate knowledge and working experience in Agile and Waterfall Methodologies.
  • Defining user stories and driving the agile board in JIRA during project execution, participate in sprint demo and retrospective.
  • Have good interpersonal, communication skills, strong problem-solving skills, explore/adopt to new technologies with ease and a good team member.

TECHNICAL SKILLS

Big Data Tools: Hadoop Ecosystem Map Reduce, Spark 2.3, Airflow 1.10.8, Nifi 2, HBase 1.2, Hive 2.3, Pig 0.17 Sqoop 1.4, Kafka 1.0.1, Oozie 4.3, Hadoop 3.0

BI Tools: SSIS, SSRS, SSAS.

Data Modeling Tools: Erwin Data Modeler, ER Studio v17

Programming Languages: SQL, PL/SQL, and UNIX.

Methodologies: RAD, JAD, System Development Life Cycle (SDLC), Agile

Cloud Platform: AWS, Azure, Google Cloud.

Cloud Management: Amazon Web Services (AWS)- EC2, EMR, S3, Redshift, EMR, Lambda, Athena

Databases: Oracle, Teradata R15/R14.

OLAP Tools: Tableau, SSAS, Business Objects, and Crystal Reports 9

ETL/Data warehouse Tools: Informatica 9.6/9.1, and Tableau.

Operating System: Windows, Unix, Sun Solaris

PROFESSIONAL EXPERIENCE

Data Engineer

Confidential, Charlotte, NC

Responsibilities:

  • Design robust, reusable and scalable data driven solutions and data pipeline frameworks to automate the ingestion, processing and delivery of both structured and unstructured batch and real time data streaming data using Python Programming.
  • Worked with building data warehouse structures, and creating facts, dimensions, aggregate tables, by dimensional modeling, Star and Snowflake schemas.
  • Applied transformation on the data loaded into Spark Data Frames and done in memory data computation to generate the output response.
  • For Processing Spreadsheets - and join with other sources used scala and developed a framework.
  • Good knowledge of troubleshooting and tuning Spark applications and Hive scripts to achieve optimal performance.
  • Scheduled the jobs using Airflow and also used airflow hooks to connect to various traditional databases like db2, oracle and Teradata.
  • Using Python - SQL Alchemy tried to connect to databases and query the sources to fetch data.
  • Hands on experience on developing UDF, Data Frames and SQL queries in Spark SQL.
  • Creating and modified existing data ingestion pipelines using Kafka and Sqoop to ingest the database tables and streaming data into HDFS for analysis.
  • Finalize the naming Standards for Data Elements and ETL jobs and create a Data dictionary for Meta Data Management.
  • Worked on developing ETL workflows on the data obtained using Python for processing it in HDFS and HBase using Flume.
  • Hands in experience in working with Continuous Integration and Deployment (CI/CD) using Jenkins, Docker.
  • Developing ETL pipelines in and out data warehouse using combination of Python and Snowflake Snow SQL. Writing SQL quires against Snowflake. Environment: Python, HDFS, Spark, Kafka, Hive, Yarn, Cassandra, HBase, Jenkins, Docker, Tableau, Splunk, BO Reports, Netezza, UDB, MySQL, Snowflake, IBM Datastage.

Big Data Engineer/Hadoop Developer

Confidential, Indianapolis, IN

Responsibilities:

  • Responsible for the design, implementation and architecture of very large-scale data intelligence solutions around big data platforms.
  • Analyzed large and critical datasets using HDFS, HBase, Hive, HQL, Pig, Sqoop and Zookeeper.
  • Developed multiple POC's using Spark, Scala and deployed on the Yarn Cluster, compared the performance of Spark, with Hive and SQL.
  • Use Amazon Elastic Cloud Compute (EC2) infrastructure for computational tasks and Simple Storage Service (S3) as storage mechanism.
  • Capable of using AWS utilities such as EMR, S3 and Cloud Watch to run and monitor Hadoop and Spark jobs on AWS.
  • Maintain AWS Data pipeline as web service to process and move data between Amazon S3, Amazon EMR and Amazon RDS resources.
  • Worked on SQL queries in dimensional data warehouses and relational data warehouses. Performed Data Analysis and Data Profiling using Complex SQL queries on various systems.
  • Troubleshoot and resolve data processing issues and proactively engaged in data modelling discussions.
  • Worked on RDD Architecture and implementing spark operations on RDD and optimizing transformations and actions in Spark.
  • Written programs in Spark using Python, PySpark and Pandas packages for performance tuning, optimization and data quality validations.
  • Worked on developing Kafka Producers and Kafka Consumers for streaming millions of events per second on streaming data.
  • Implemented a distributing messaging queue to integrate with Cassandra using Apache Kafka.
  • Hands on experience on fetching the live stream data from UDB into HBase table using PySpark streaming and Apache Kafka.
  • Worked on Tableau to build customize interactive reports, worksheets, and dashboards.

Environment: HDFS, Python, SQL, Web Services, MapReduce, Spark, Kafka, Hive, Yarn, Pig, Flume, Zookeeper, Sqoop, UDB, Tableau, AWS, GitHub, Shell Scripting.

Big Data Engineer

Confidential, Dallas, TX

Responsibilities:

  • Responsible for building scalable distributed data solution using Hadoop Cluster environment with Hortonworks distribution.
  • Convert raw data with sequence data format, such as Avro and Parquet to reduce data processing time and increase data transferring efficiency through the network.
  • Worked on building end to end data pipelines on Hadoop Data Platforms.
  • Worked on Normalization and De-normalization techniques for optimum performance in relational and dimensional databases environments.
  • Designed developed and tested Extract Transform Load (ETL) applications with different types of sources.
  • Creating files and tuned the SQL queries in Hive Utilizing HUE. Implemented MapReduce jobs in Hive by querying the available data.
  • Exploring with Spark to improve the performance and optimization of the existing algorithms in Hadoop using Spark context, Spark-SQL, Data Frame, pair RDD's.
  • Experience with PySpark for using Spark libraries by using Python scripting for data analysis.
  • Involved in converting HiveQL into Spark transformations using Spark RDD and through Scala programming.
  • Created User Defined Functions (UDF), User Defined Aggregated (UDA) Functions in Pig and Hive.
  • Worked on building custom ETL workflows using Spark/Hive to perform data cleaning and mapping.
  • Implemented Kafka Custom encoders for custom input format to load data into Kafka portions.
  • Support for the cluster, topics on the Kafka manager. Cloud formation scripting, security and resource automation.

Environment: Python, HDFS, MapReduce, Flume, Kafka, Zookeeper, Pig, Hive, HQL, HBase, Spark, Kafka, ETL, Web Services, Linux RedHat, Unix.

Hadoop Engineer/Developer

Confidential

Responsibilities:

  • Designed and developed the applications on the data lake to transform the data according business users to perform analytics.
  • Responsible to manage data coming from different sources and involved in HDFS maintenance and loading of structured and unstructured data.
  • Worked on different files like csv, txt, fixed width to load data from various sources to raw tables.
  • Conducted data model reviews with team members and captured technical metadata through modelling tools.
  • Implemented ETL process wrote and optimized SQL queries to perform data extraction and merging from SQL server database.
  • Experience in loading logs from multiple sources into HDFS using Flume.
  • Worked with NoSQL databases like HBase in creating HBase tables to store large sets of semi-structured data coming from various data sources.
  • Involved in designing and developing tables in HBase and storing aggregated data from Hive tables.
  • Involved in converting Hive/SQL queries into Spark transformations using Spark RDD's and Scala.
  • Data cleaning, pre-processing and modelling using Spark and Python.
  • Strong Experience in writing SQL queries.
  • Responsible for triggering the jobs using the Control-M.

Environment: Python, SQL, ETL, Hadoop, HDFS, Spark, Scala, Kafka, HBase, MySQL, Netezza, Web Services, Shell Script, Control-M.

Data Analyst and ETL tester

Confidential

Responsibilities:

  • Analyzed solution architecture and design documents including Source to Target& detail design documents needed for each tracks and plan test design activities.
  • Used SQL server to develop and execute test scripts in SQL, to validate the test cases.
  • Involved in reviewing test scenarios, test cases and test results for data warehouse/ETL testing.
  • Prepared Requirements Traceability Metrics (RTM), positive and negative test scenarios, detailed oriented Test Scripts, Test Kickoff documents, Test Scorecard for test progress status, Test Results, Release Check list, Lessons Learned documents and Regression Test Suite for future use.
  • Manage datasets using Panda data frames and MySQL, queried MYSQL database queries from python using Python-MySQL connector and MySQL dB package to retrieve information. Developed web applications in Django Framework model view control (MVC) architecture.
  • Responsible for testing Initial/Reconcile and Incremental/daily loads of ETL jobs.
  • Interacted with design/development/DBA team to decide on the various dimensions and facts to test the application.
  • Planned ahead of time to test the mapping parameters and variables by discussing with BA's.
  • Extensively used ALM to track defects and managed them.
  • Validated the data flow and control flow transformations are working according to functionality in SSIS packages
  • Tested various KPI's for Tableau and SF dashboard reports.
  • Manage datasets using Panda data frames and MySQL, queried MYSQL database queries from python using Python-MySQL connector and MySQL dB package to retrieve information. Developed web applications in Django Framework model view control (MVC) architecture.
  • Extensively tested several Business Objects reports for data quality, fonts, headers, footers, and cosmetics.
  • Validated the cardinality of joins and data integrity on the business objects universe.
  • Conducted user acceptance testing (UAT) to validate that the developed application meets the business requirements.
  • Performed ETL using Microsoft SSIS, to extract, transform and load test data on the test environment and to tested against the database.
  • Extensively involved in testing the ETL process from different data sources (SalesForce, UI, SQL Server, Oracle, flat files) into the target database as per the data models.
  • Writing complex SQL queries for data validation for verifying the SSIS Packages and business Rules
  • Written Test Cases for ETL to compare Source and Target database systems.
  • Mocked test data to test all the scenarios and test cases planned.
  • Used UNIX commands for file management; placing inbound files for ETL and retrieving outbound files and log files from UNIX environment.
  • Written several complex SQL queries for data verification and data quality checks.
  • Analyzed the testing progress by conducting walk through meetings with internal quality assurance groups and with development groups.
  • Responsible for documenting the process, issues and lessons learned for future references.
  • Reviewed the test activities through daily Agile Software development stand-up meetings.

Environment: SalesForce, SQL, Microsoft SSIS, HP Quality Center, Tableau, Agile, SharePoint, SoapUI, Oracle 10g, Data Flux, SQL Server 2008 R2, SalesForce, UNIX, Putty, Flat files, Session Logs, Windows.

We'd love your feedback!