We provide IT Staff Augmentation Services!

Senior Data Engineer Resume

3.00/5 (Submit Your Rating)

Eden Prairie, MinnesotA

SUMMARY

  • 8+ years of experienced specializing in Big data applications. Enjoys building relationships with stakeholders, understanding business issues, gathering, and analyzing requirements, inventing creative solutions, documenting functional and technical design specifications, and working with Development, Testing, Production Support and Quality Assurance teams.
  • 6+ years of comprehensive experience as a Big Data & Analytics (Hadoop) and 2 years as Java programmer
  • Experienced in installation, configuration, management and deployment of Hadoop Cluster, HDFS, Map Reduce, Pig, Hive, Sqoop, Flume, Oozie, Nifi, HBase, and Zookeeper.
  • Experience in building pipelines using Azure Data Factory and moving the data into Azure Data Lake Store.
  • Expertise in handling importing of data from various data source, performed transformation, and hands on developing and debugging YARN (MR2) jobs to process large data sets.
  • Very good Knowledge and experience in Amazon Web Service (AWS) concepts like EMR and EC2 web services which provide fast and efficient processing of Teradata Big Data Analytics.
  • Experience in Developing Spark applications using Spark - SQL in Databricks for data extraction, transformation and aggregation from multiple file formats for analyzing & transforming the data to uncover insights into the customer usage patterns.
  • Experienced on extending Pigand Hive core functionality by writing custom UDF’s for Data Analysis. Data transformation, file processing, and identifying user behavior by running Pig Latin Scripts and expertise in creating Hive internal/external Tables/Views using shared Meta Store, writing scripts in HiveQL. Develop Hive queries helps for visualizing business requirement.
  • Excellent experience working with importing and exporting Teradata using Sqoop from HDFS to RDBMS/mainframe& vice versa. Also, worked on incremental import by creating Sqoop metastore jobs.
  • Involved in developing Spark scripts using Python on Azure HDInsight for Data Aggregation, Validation and verified its performance over MR jobs.
  • Experience in implementing OLAP multi-dimensional cube functionality using Azure SQL Data Warehouse, Azure Data Factory, ADLS, Databricks, SQL DW.
  • Good understanding and usage of TERADATA OLAP functions. Proficient in TERADATA SQL, Stored Procedures, Macros, Views, Indexes Primary, Secondary, PPI, Join indexes etc.
  • Experienced in using Apache Flume for collecting, aggregation, moving large amount of data from application server and also handling variety of data using streaming and velocity of data.
  • Expertise in Data Development in Hortonworks HDP platform & Hadoop ecosystem tools like Hadoop, HDFS, Spark, Zeppelin, SparkMLlib, Hive, HBase, SQOOP, flume, Atlas, SOLR, Pig, Falcon, Oozie, Hue, Tez, Apache NiFi, Kafka.
  • In depth understanding of Spark Architecture including Spark Core, Spark SQL, Data Frames and Spark Streaming for developing Spark Programs for Batch and Real-Time Processing.
  • Working experience in Docker, Kubernetes, Vagrant, Chef, Jenkins and python.
  • Experienced in Extraction, Transformation, and Loading (ETL) of data from multiple sources like Flat files, XML files, and Databases. Used Informatica for ETL processing based on business need and extensively used Oozie workflow engine to run multiple Hive and Pig jobs.
  • Expertise in Java Script, JavaScript MVC patterns, Object Oriented JavaScript Design Patterns and AJAX
  • Excellent understanding of Zookeeper and Kafka for monitoring and managing Hadoop jobs and used ClouderaCDH 4x, CDH 5x for monitoring and managing Hadoop cluster.
  • Expertise in developing the presentation layer components like HTML, CSS, JavaScript, JQuery, XML, JSON, AJAX and D3.
  • Experience with Python, SQL on AWS cloud platform, better understanding of Data Warehouses like Snowflake and Data-bricks platform, etc
  • Experienced working on HCatalog to share the schema across the distributed application and experience in batch processing and writing programs using ApacheSpark for handling real-time analytics and real streaming of data.
  • Experienced on NoSql technologies like Hbase, Cassandra for data extraction and storing huge volume of data. Also, experience in Data Warehouse life cycle, methodologies, and its tools for reporting and data analysis.
  • Knowledge of ETL methods for data extraction, transformation and loading in corporate-wide ETL Solutions and Data warehouse tools for reporting and data analysis.
  • Expertise in creating action filters, parameters and calculated set for preparing dashboard and worksheet in Tableau.
  • Experienced creating use case model, use case, class, sequence diagrams using Microsoft Visio and Rational Rose. Experience in design and development of object-oriented analysis design (OOAD) based system using Rational Rose.
  • Experienced in using RDBMS concepts and worked with Oracle 10g/11g, SQL server and pleasant experience in writing stored procedures, Functions and Triggers using PL/SQL.
  • Major strengths are familiarity with multiple software systems, ability to learn quickly modern technologies, adapt to new environments, self-motivated, team player, focused adaptive and quick learner with excellent interpersonal, technical and communication skills.
  • Strong oral and written communication, initiation, interpersonal, learning and organizing skills matched with the ability to manage time and people effectively.

TECHNICAL SKILLS

Hadoop/Big Data Technologies: HDFS, Hive, Pig, SQOOP, Map Reduce, Flume, OOZIE, Spark, SparkQL and Zookeeper.

Languages: Core Java, XML, HTML, HiveQL, SQL, Python, Scala, PL SQL, AZURE PowerShell

J2EE Technologies: Servlets, JSP, AJAX, POJO and JSON.

Frameworks: Spring 3 and Hibernate 3

Cloud: AWS, Azure, GCP

Application & Web Services: JBoss 4.X and Tomcat 8.X,9. X.

NoSQL Databases: HBase, Cassandra, Mongo DB

Database (SQL/No SQL): Oracle 9i, MySQL, DB2, HBase and Mongo DB

IDE: Eclipse, IntelliJ and Edit plus

Tools: Git, Bit Bucket, Jenkins, Apache Maven and Junit.

ETL Tool: Talend Open Studio 5.6

Bug tracking/ Ticketing: JIRA, Mercury Quality Center and Service Now

Operating System: Windows 98/2000 and Linux /Unix

PROFESSIONAL EXPERIENCE

Confidential - Eden Prairie, Minnesota

Senior Data Engineer

Responsibilities:

  • Developed data pipeline using SQOOP scripts to import data from various data sources like MySQL and Teradata, SQLSERVER, ORACLE into HDFS and Hive tables.
  • Responsible to manage data coming from different sources with different data format. Data sources like share point sites, third party files and Kafka messages.
  • Responsible for high performance of data architecture and design including Star Schemas, Snowflake Schemas, and Dimensional Modeling.
  • Written Map Reduce programs, Pig scripts to specify the conditions to separate the fraudulent claims.
  • Developed Spark code using Scala and java for data cleaning and pre-processing.
  • Develop Python scripts to perform Metadata validation of Delimited files.
  • Processed big volumes of data using different big data analytic tools including Spark, Hive, Flume, and Sqoop.
  • Extract Transform and Load data from Sources Systems to Azure Data Storage services using a combination of Azure Data Factory, T-SQL, Spark SQL and U-SQL Azure Data Lake Analytics . Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in In Azure Databricks.
  • Creating Spark clusters and configuring high concurrency clusters using Azure Databricks to speed up the preparation of high-quality data.
  • Developed HiveQL for sorting, joining, filtering and grouping the structure data and Metadata Management.
  • Created procedures in Azure SQL Datawarehouse and built the final aggregate tables for dashboards
  • Developed Spark programs using Scala API’S to compare the performance of Spark with Hive and SQL.
  • Load the data intoSparkRDD and performed in-memory data computation to generate the output.
  • Explore usage ofSparkfor improving the performance and optimization of data transformations in Hadoop usingSparkContext inSpark-SQL.
  • Provide and implement data pipeline solutions / proof of concepts involving Microsoft Azure cloud services as-in Databricks, Data factory, PySpark.
  • Created, provisioned different Databricks clusters needed for batch and continuous streaming data processing and installed the required libraries for the clusters.
  • Design, support and continuously enhance the project code base, continuous integration pipeline, etc.
  • Involved in Requirement gathering, Business Analysis and translated business requirements into Technical design in Google cloud/Big query, Hadoop and Big Data.
  • Worked with cutting-edge Google Cloud (GCP) Big query technologies to deliver next-Gen Cloud solutions.
  • Define BI standards, guidelines and best practices related to the Google cloud (GCP) BigQuery platform for clients, client services, and technical teams.
  • Extensively used Spark to read data from S3 and preprocess it and to store in back S3 again for creating tables using Athena.
  • Developed Oozie Workflows for daily incremental loads, which gets data from Teradata and then imported into hive tables using Sqoop.
  • Used Hive to analyses the partitioned and bucketed data and compute various metrics for reporting on the dashboard.
  • Maintain accurate and complete technical documents on confluence and Jira for the data Migration from legacy system to Google cloud Big Query platform (GCP).
  • Experienced in implementing Spark RDD transformations, actions to implement business analysis.
  • Used Pig as ETL tool to do transformations, event joins, filter boot traffic and some aggregations to storing the output data into HDFS.
  • Used Hive to analyze the partitioned and bucketed data and compute various metrics for reporting.
  • Developed Json Scripts for deploying the Pipeline in Azure Data Factory (ADF) that process the data using the Cosmos Activity.
  • Developed JSON Scripts for deploying the Pipeline in Azure Data Factory (ADF) that process the data using the Sql Activity.
  • Involved in developing of Confidential Data Lake and in building Confidential Data Cube on Microsoft Azure HDINSIGHT cluster.
  • Created pipelines to move data from on-premise servers to Azure Data Lake.
  • Utilized Azure HDInsight to monitor and manage the Hadoop Cluster.
  • Responsible for creating Hive External tables and loaded the data in to tables and query data using HQL.
  • Developed data pipeline using Spark, Hive, Pig, python, Impala and HBase to ingest customer behavioral data and financial histories into Hadoopcluster for analysis.
  • Responsible for running Hadoop streaming jobs to process terabytes of xml's data, utilized cluster co-ordination services through Zookeeper.
  • Developed Spark jobs written in Python and Scala to perform operations like data aggregation, data processing and data analysis.
  • Developing spark programs using python API's to compare the performance of spark with HIVE and SQL and generated reports monthly and daily basis.
  • Responsible for building scalable distributed data solutions using Hadoop, and responsible for Cluster maintenance, commission and de-commission cluster nodes, Cluster Monitoring and Troubleshooting, Manage and review data backups and log files.
  • Worked in functional, System, and regression testing activities with agile methodology.
  • Worked in Agile Environment, which maintain the story points in scrum model.
  • DevelopedSparkscripts by using Scala shell commands as per the requirement. Written Adhoc HiveQL queries to process data and generate reports.
  • Prepared shell scripts for cleaning up data on Hadoop cluster.
  • Configuring and scheduling control-m jobs.

Environment: Hadoop, HDFS, Scala, Spark, SparkQL, GCP ( Google Cloud ), Map Reduce, Azure, Azure Data Bricks, Apache Pig, Hive, SQOOP, Python, Java, IntelliJ, Vertica- HP, MySQL, DB2, UNIX and OOZIE.

Confidential, Pleasanton CA

Senior Data Engineer

Responsibilities:

  • Design and Document the new development process to convert existing ETL pipeline into Hadoop based systems.
  • Understand data needs from data scientists and business leads by reviewing Metadata document for source data information.
  • Designed and Developed data integration/engineering workflows on big data technologies and platforms (Hadoop, Spark, MapReduce, Hive, HBase). Gathered Requirement and prepared the Design.
  • Imported data using Sqoop, into HDFS and created Hive tables, loaded data and developed Hive queries.
  • Performed Hive and Spark tuning with partitioning and bucketing of Parquet, and executors/driver's memory.
  • Developed optimized Hive queries for better processing of data.
  • Handled importing of data from various data sources, performed transformations using Hive, and loaded data into S3 data lakes and snowflake DB.
  • Worked with structured, semi-structured, and unstructured data by implementing complex pyspark scripts.
  • Configured Spark streaming to receive real time data from Kafka and store the stream data to HDFS
  • Handled large datasets using Partitions, Spark in Memory capabilities, Broadcasts in Spark, Effective & efficient Joins, Transformations during ingestion process.
  • Worked and learned a great deal from Amazon web services (AWS) cloud service like EC2, S3, and VPC.
  • Developed Python Script to import data SQL Server into HDFS & created Hive views on data in HDFS using Spark.
  • Developed a Python Script to load the CSV files into the S3 buckets and created AWS S3buckets, performed folder management in each bucket, managed logs and objects within each bucket.
  • Processed oracle data and created external tables using Hive and developed scripts to ingest and repair tables that can be reused across the project.
  • Ran data formatting scripts in python and created terabyte csv files to be consumed by Hadoop MapReduce jobs.
  • Developed Python code using version control tools like GIT hub and SVN on vagrant machines.
  • Responsible for developing data pipeline with Amazon AWS to extract the data from weblogs and store in HDFS.
  • Worked extensively with importing metadata into Hive and migrated existing tables and applications to work on Hive and AWS cloud.
  • Developed dataflows and processes for the Data processing using SQL (Spark SQL & Dataframes)
  • Performed Kafka analysis, feature selection, feature extraction using Apache Spark Machine Learning streaming libraries in Python.
  • Designed and developed data pipelines to analyze & evaluate multiple solutions, considering multiple cost factors across the business as well operational impact on historic sales data.
  • Worked on Hive Metastore backup, Partitioning and bucketing techniques in hive to improve the performance. Tuning Spark Jobs.
  • Worked closely with Data science team to understand the requirement clearly and create hive tables on HDFS.
  • Scheduling Spark jobs using Airflow scheduler.
  • Generated detailed design documentation for the source-to-target transformations.

Environment: Hadoop, HDFS, Map Reduce, Hive, HBase, Oozie, Sqoop, Pig, Java, Tableau, Rest API, Maven, Strom, Kafka, SQL, ETL, AWS, MapR, Python, PySpark, JavaScript, Shell Scripting.

Confidential, Malvern, PA

Hadoop / Big Data Developer

Responsibilities:

  • Developed data pipeline using Flume, SQOOP and Java map reduce to ingest customer behavioral data and Health care historical data into HDFS for analysis.
  • Developed the SQOOP scripts to import history data from various data sources like MySQL and Main frame DB2 into HDFS and Hive tables.
  • Responsible to manage data coming from different sources with different data format.
  • Written Map Reduce programs, Pig scripts to specify the conditions to separate the fraudulent claims.
  • Developed Spark code using Scala for data cleaning and pre-processing.
  • Load the data intoSparkRDD and performed in-memory data computation to generate the output.
  • Explore usage ofSparkfor improving the performance and optimization of data transformations in Hadoop usingSparkContext inSpark-SQL.
  • Experienced in implementing Spark RDD transformations, actions to implement business analysis.
  • Used Pig as ETL tool to do transformations, event joins, filter boot traffic and some aggregations to storing the output data into HDFS.
  • Used Hive to analyze the partitioned and bucketed data and compute various metrics for reporting.
  • Optimized Map Reduce Jobs to use HDFS efficiently by using various compression mechanisms.
  • Responsible for creating Hive External tables and loaded the data in to tables and query data using HQL.
  • Responsible for running Hadoop streaming jobs to process terabytes of xml's data, utilized cluster co-ordination services through Zookeeper.
  • Responsible for building scalable distributed data solutions using Hadoop, and responsible for Cluster maintenance, commission and de-commission cluster nodes, Cluster Monitoring and Troubleshooting, Manage and review data backups and log files.
  • Developed workflows and coordinator jobs in Oozie.
  • Worked in functional, System, and regression testing activities with agile methodology.
  • Worked in Agile Environment, which maintain the story points in scrum model.
  • Analyzed the data by performing Hive queries and Pyspark.
  • DevelopedSparkscripts by using Scala shell commands as per the requirement.
  • Installed Oozie workflow engine to run multiple Hive and Pig Jobs, used SQOOP to import and export data from HDFS to RDBMS and vice-versa for visualization and to generate reports.
  • Written Adhoc HiveQL queries to process data and generate reports.
  • Prepared shell scripts for cleaning up data on Hadoop cluster.

Environment: Hadoop (CDH5.X), HDFS, Scala, Spark, SparkQL, Map Reduce, Apache Pig, Hive, SQOOP, Java, IntelliJ, Vertica- HP, MySQL, DB2, UNIX and OOZIE.

Confidential, Deerfield, IL

Hadoop/ETL Developer

Responsibilities:

  • Involved in understanding business requirements and prepare a design documents, coding, testing and go on live installation environment.
  • Got knowledge of existing framework of Spark and Hadoop ecosystem like HadoopMapReduce, HDFS, Sqoop, HBase, Hive, Flume and Kafka.
  • Involved in creating Hive tables, loading structured data and writing hive queries which will run internally in map reduce way.
  • Monitoring the running Map-Reduce programs on the cluster.
  • Implemented the workflows using Apache Oozie framework to automate tasks.
  • Reviewed the HDFS usage and system design for future scalability and fault-tolerance.
  • Created Restful web service using spring to receive data from device and store in MongoDB and HDFS
  • Implemented Spark with Scala and Java to work with HDFS and Database.
  • Used MongoDB to store BigData in JSON format and applied aggregation Match, Sort and Group operation in MongoDB.
  • Used Oracle for relation database to maintain user role, user detail, device configuration and user complains.
  • Knowledge of sync meters data between HDFS and MongoDB by Scala API.
  • Developed Web application using Restful web service in Spring-MVC with Spring Security and Spark SQL.
  • Developed SQL scripts for Referential Integrity check, which checks validity of incoming data with master tables in database.
  • Involved in Data Migration from SQL Server to Oracle as well as Oracle to Oracle in different environments
  • Developed SQL and PL/SQL scripts to transfer tables across the schemas and databases.
  • Data loaded from legacy systems (ETL Operations) using SQL*Loader.
  • Developed Data Classification Algorithms to identify sales trends and improve sales growth at region level and globally.
  • Utilized Python to identify trends and relationships between different pieces of data draw appropriate conclusions and translate analytical findings into marketing strategies that drive value.
  • Good experience on writing POM.xml to update the revision and dependent jars information.
  • MonitoringHadoopcluster using Cloudera Manager and Apache Ambari.
  • Wrote Hive UDFS to extract data from staging tables and analyzed the data from HDFS using the Hive QL.
  • Worked extensively with Sqoop for importing date from Oracle to HDFS.
  • Used Eclipse as an IDE for code development and Subversion (SVN) for source code management.

Environment: Scala, Spark, Cloudera, Apache Hadoop, Eclipse, MySQL, Pig, Hive, Sqoop, Linux, Oozie, Shell Scripting, Hadoop, HDFS, MapReduce, Hive, Pig, Oozie, Sqoop, Kafka, HBase, Flume, Hue, MongoDB, Jenkins, Apache Tomcat 7.0, Oracle 11g.

Confidential, Chantilly, VA

SQL/ETL Developer

Responsibilities:

  • Created stored procedures to transform the data and worked for various needs of the transformations while loading the data.
  • Designing and creating SQL tables, indexes and sequences.
  • Conducted logical and physical database design including data modeling, maintenance and problem diagnosis.
  • Developed SQL scripts, packages and procedures for Business rules check to implement business rules.
  • Developed SQL and PL/SQL scripts to transfer tables across the schemas and databases.
  • Developed SQL scripts for Referential Integrity check, which checks validity of incoming data with master tables in database.
  • Data loaded from legacy systems (ETL Operations) using PL/SQL and SQL*Loader.
  • Involved in creating and modifying several UNIX shell scripts according to the changing needs of the project and client requirements.
  • Involved in writing SQL queries, joins, DDL, DML and user defined functions (UDF) to implement business logic.
  • Responsible for performance tuning and query optimization.
  • Analyze end user database needs and provide efficient solutions.
  • Performed backup/restore, database objects such as tables, procedures, constraints, Indexes and views.
  • Developed Stored Procedures, Functions, Packages and SQL Scripts using PL/SQL.
  • Involved in creation of Conceptual Modeling covering all the business requirements.
  • Loaded the data into database tables using SQL*loader from text and excel files.
  • Developed data model, SQL Queries, SQL Query tuning process and Schemas.
  • Created SQL*plus reports as per client's various needs, and developed business objects.
  • Developed custom Forms and Reports as per client requirements and making them web enabled using Oracle Reports builder 10g and Oracle Forms builder 10g respectively.
  • Developed Master Detail, Detail reports using tabular and Group above reports.
  • Developed Procedures for efficient error handling process by capturing errors into user managed tables.

Environment: SQL, PL/SQL, SQL Loader, SQL Plus, Toad, WinSCP, Putty, Windows Batch Scripts, Shell Scripting, Linux, Unix, Windows, DB2.

We'd love your feedback!