We provide IT Staff Augmentation Services!

Data Engineer Resume

2.00/5 (Submit Your Rating)

Ashburn, VA

SUMMARY

  • Over 7 +Years of professional IT experience which includes specializing in Big Data and Web architecture solutions using Scala2.11, Python, Hive, Spark, Kafka and Storm.
  • Managing the Hadoop distribution with Cloudera Manager, Cloudera Navigator, and Hue.
  • Setting up the High - Availability for Hadoop Clusters components and Edge nodes
  • Experience in developing Shell scripts and Python Scripts for system management.
  • Very keen in knowing newer techno stack that Google Cloud platform (GCP) adds.
  • Experience in importing and exporting data using Sqoop from HDFS to Relational Database Systems and vice-versa.
  • Extensive experience in developing applications using JSP, Servlets, Spring, Hibernate, Java Script, Angular, AJAX, CSS, JQuery, HTML, JDBC, JNDI, JMS, XML, and SQL across the platforms like Windows, Linux, and UNIX.
  • Experience working with Java/J2EE related technologies
  • Expertise in JVM (Java Virtual Machine) and Java based Middleware and platforms like Cloudera, Horton networks and Map R
  • In-depth understanding of Data Structure and Algorithms and Extensive experience in working with MS Excel, SQLServer and RDBMS databases
  • Experience in developing some deliverable documentations including Data Flow, Use Cases, and Business rules.
  • Hands on experience in development, installation, configuring, and using Hadoop & ecosystem components like Hadoop MapReduce, HDFS, HBase, Hive, Sqoop, Pig, Flume, Kafka and Spark
  • Implemented in setting up standards and processes for Hadoop based application design and implementation.
  • Involved in creating Hive tables, loading, and analyzing data using Hive scripts.
  • Created Hive tables, dynamic partitions, buckets for sampling and working on them using Hive QL.
  • Involved in build applications using SBT and integrated with continuous integration servers like Jenkins to build jobs.
  • Experience in managing and reviewing Hadoop log files, worked on Hadoop Cluster architecture and monitoring the cluster.
  • Extract Transform and Load data from Sources Systems to Azure Data Storage services using a combination of Azure Data Factory, T-SQL, Spark SQL and U-SQL Azure Data Lake Analytics. Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in InAzure Databricks.
  • Responsible for developing efficient MapReduce on AWS cloud programs for more than 20 years’ worth of claim data to detect and separate fraudulent claims.
  • Uploaded and processed more than 30 terabytes of data from various structured and unstructured sources into HDFS (AWS cloud) using Sqoop and Flume.

TECHNICAL SKILLS

Programming & Scripting Languages: Python, PySpark, Scala, R, Java, Shell script, Perl script, SQL

Big Data Ecosystem: Hadoop, MapReduce, HDFS, Sqoop, PIG, Hive, HBase, Oozie, Flume, NiFi, Kafka, Zookeeper, Yarn, Apache Spark, Mahout, Sparklib

Libraries: Python (NumPy, Panda, Scikit-learn, SciPy), MatplotLib, Spark ML, Spark MLlib

Databases: Oracle, MySQL, SQL Server, MongoDB, Cassandra, DynamoDB, PostgreSQL, Teradata, Cosmos

BI and Visualization: SAS, Tableau, Power BI

IDE: Jupyter, Zeppelin, PyCharm, Eclipse

Cloud Based Tools: Microsoft Azure, Google Cloud Platform, AWS, S3, EC2, Glue, Redshift, EMR

Programming & Scripting Languages: Python, PySpark, Scala, R, Java, Shell script, Perl script, SQL

PROFESSIONAL EXPERIENCE

Confidential, Ashburn, VA

Data Engineer

Responsibilities:

  • Experience in dimensional data modeling, ETL development, and Data Warehousing.
  • Handled importing of data from various data sources performed transformations using Spark and loaded data into Hive.
  • Involved in performance tuning of Hive (ORC table) for design, storage, and query perspectives.
  • Performed advanced procedures like text analytics and processing using the in-memory computing capabilities of Spark using Python and Scala.
  • Worked on implementing data lake and responsible for data management in Data Lake.
  • Developed Kafka consumer to consume data from Kafka topics.
  • Developed shell scripts for running Hive scripts in Hive and Impala.
  • Responsible for optimization of data-ingestion, data-processing, and data-analytics.
  • Expertise is developing Pyspark application which build connection between HDFS and HBase and allows data transfer between them.
  • Developed workflows for complete end to end ETL process starting with getting data into HDFS, validating, and applying business logic, storing clean data in hive external tables, exporting data from hive to RDBMS sources for reporting and escalating and data quality issues.
  • Designing, implementing, and testing major subsystems AWS cloud platform and core service offerings.
  • Involved in working with Amazon Web Services (AWS) using AWS Glue, Redshift, Kinesis, EC2, Pyspark for computing and S3 as storage mechanism.
  • Designing and building Bigdata pipelines to process Bigdata using AWS.
  • Experienced with API Gateway & Rest services in collecting the data and publishing to downstream applications.
  • Developed PySpark and SparkSQL code to process the data in Apache Spark on Amazon EMR to perform the necessary transformations.
  • Implemented AWS EC2, Key Pairs, Security Groups, Auto Scaling, ELB, SQS, and SNS
  • Migrated an existing on-premises application to AWS. Used AWS services like EC2 and S3 for small data sets processing and storage, experienced in maintaining the Hadoop cluster on AWS EMR.
  • Developed a lambda script using Python and boto3 library to update the DynamoDB table.
  • Responsible for creating on-demand tables on S3 files using Lambda Functions and AWS Glue using Python and PySpark.
  • Experience in using AWS Glue crawlers to create tables on raw data in AWS S3.
  • Skilled in developing applications in Python language for multiple platforms and good experience in handling data manipulation using python Scripts.
  • Worked on AWS Data Pipeline to configure data loads from S3 to into Redshift.
  • Used JSON schema to define table and column mapping from S3 data to Redshift.
  • Designed Data Pipeline to migrate the data from on-prem/traditional sources to Cloud Platform
  • Developed Spark applications usingPysparkandSpark-SQLfor data extraction, transformation and aggregation from multiple file formats for analyzing & transforming the data to uncover insights into the customer usage pattern.
  • Translating complex functional and technical requirements into detailed design.
  • Fetching the data from multiple data sources and processing in PySpark using Glue.
  • Provide technical leadership in the Big Data space (Vertica, Hadoop, SQL Server, Oozie, Hive, Avro, Spark, Scala, Java, Python, RedShift, Scala).
  • Select and integrate Big Data tools and frameworks required to provide capabilities required by business functions, while keeping in mind hardware, software & financial constraints.
  • Build data pipelines that are scalable, repeatable, and secure, and can serve multiple purposes.
  • Leading and mentoring as a senior member of the Engineering team.
  • Developing and following best practices relative to design, implementation, and testing
  • Producing professional, documented designs.
  • Enhanced presentation and communication methods used for presenting projects capabilities to prospective clients.
  • Developed scalable databases capable of ETL processes using SQL and Spark.
  • Utilized MongoDB to create NoSQL databases that collect data from a variety of sources.

Environment: Hadoop, Python, HDFS, Spark, AWS Redshift, AWS Glue, Map Reduce, Hive, Sqoop, Kafka, HBase, Oozie, Flume, Scala, Python, Java, SQL Scripting, Pyspark, Linux Shell, Cassandra, Zookeeper, HBase, MongoDB, Cloudera Manager, EC2, EMR, S3, Oracle, Kinesis.

Confidential, Plano, TX

Software Engineer

Responsibilities:

  • Enhanced presentation and communication methods used for presenting projects capabilities to prospective clients.
  • Developed scalable databases capable of ETL processes using SQL and Spark.
  • Designed and Implemented Big Data Analytics architecture, transferring data from Oracle.
  • Created SSIS reusable packages to extract data from multi formatted flat files and excel files into SQL Database.
  • Analyze, design, and build Modern data solutions using Azure PaaS service to support visualization of data. Understand current Production state of application and determine the impact of new implementation on existing business processes.
  • Worked on documentation of all Extract, Transform and Load, designed, developed, validated and deploy the Talend ETL processes for Data warehouse team using PIG, HIVE.
  • Created Pipelines in ADF using Linked Services/Datasets/Pipeline/ to Extract, Transform and load data from different sources like Azure SQL, Blob storage, Azure SQL Data warehouse, write-back tool and backwards.
  • Developed business intelligence solutions using SQL server data tools 2015 & 2017 versions and load data to SQL & Azure Cloud databases.
  • Involved in creating fact and dimension tables in the OLAP database and created cubes using MS SQL Server Analysis Services (SSAS).
  • Exposure to Lambda functions and Lambda Architecture.
  • Created DDL's for tables and executed them to create tables in the warehouse for ETL data loads.
  • Implemented logical and physical relational database and maintained Database Objects in the data model using Erwin.
  • Exporting the analyzed and processed data to the RDBMS using Sqoop for visualization and for generation of reports for the BI team.
  • Proficient knowledge of Apache Spark and programming Scala to analyze large datasets using Spark to process real time data.
  • Created SSIS Packages to perform filtering operations and to import the data on daily basis from the OLTP system to SQL server.
  • Design and implement database solutions in Azure SQL Data Warehouse, Azure SQL
  • Built MDX queries for Analysis Services (SSAS) & Reporting Services (SSRS).
  • Experienced in Querying data using SparkSQL on top of Spark Engine, implementing Spark RDD’s in Scala.
  • Worked on designing, building, deploying and maintaining Mongo DB.
  • Developed ETL framework using Spark and Hive (including daily runs, error handling, and logging) to useful data.
  • Coordinated with team and Developed framework to generate Daily adhoc, Report's and Extracts from enterprise data and automated using Oozie.
  • Optimizing of existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frames and Pair RDD's.
  • Experienced in Designed and developed Data models for Database (OLTP), the Operational Data Store (ODS), Data warehouse (OLAP), and federated databases to support client enterprise Information Management Strategy and excellent Knowledge of Ralph Kimball and BillInmon's approaches to Data Warehousing.
  • Responsible for maintaining and tuning existing cubes using SSAS and Power BI.
  • Worked on migration of data from On-prem SQL server to Cloud databases(Azure Synapse Analytics (DW) & Azure SQL DB).
  • Have extensive experience in creating pipeline jobs, scheduling triggers, Mapping data flows using Azure Data Factory(V2) and using Key Vaults to store credentials
  • Design and develop a daily process to do incremental import of raw data from DB2 into Hive tables using Sqoop.
  • Involved in debugging Map Reduce job using MR Unit framework and optimizing Map Reduce.
  • Extensively used Hive/HQL or Hive queries to query data in Hive Tables and loaded data into Hive tables.
  • Worked on creating tabular models onAzure analysis servicesfor meeting business reporting requirements.
  • Educate developers on how to commit their work and how can they make use of the CI/CD pipelines that are in place.
  • Setup full CI/CD pipelines so that each commit a developer makes will go through standard process of software lifecycle and gets tested well enough before it can make it to the production.
  • Responsible for managing and supporting Continuous Integration (CI) using Jenkins

Environment: Hadoop, Python, HDFS, Spark, Hive, Sqoop, Kafka, HBase, Oozie, Flume, Scala, Python, Java, SQL Scripting and Talend, Pyspark, Linux Shell Scripting, Cassandra, Zookeeper, HBase, MongoDB, Cloudera, Cloudera Manager, EC2, EMR, S3, Oracle, MySQL.

Confidential, Seattle, WA

Data Engineer

Responsibilities:

  • Worked on developing POC’s in Spark using Python to compare the performance of Spark with Hive and SQL/Oracle.
  • Involved in creating UDF’s in Spark using Scala, Python programming Language.
  • Monitored continuously and managed the Hadoop cluster using Cloudera manager.
  • Created Hive - HBase tables for data storage Hive for Meta-Store and HBase for data storage in Row Key Format.
  • Used GitHub as repository for committing code and retrieving it and Jenkins for continuous integration.
  • Experienced in working with Amazon Web Services (AWS) using EC2 for computing and S3 as storage mechanism.
  • Worked on AWS Relational Database Services, AWS Security Groups and their rule and implemented Reporting, Notification services using AWS API. Also worked on API Gateway.
  • Developed pipeline for POC to compare performance/efficiency while running pipeline using the AWS EMR Spark cluster and Cloud Dataflow
  • Responsible for creating on-demand tables on S3 files using Lambda Functions and AWS Glue using Python and PySpark.
  • Responsible for developing efficient MapReduce on AWS cloud programs for more than 20 years' worth of claim data to detect and separate fraudulent claims.
  • Implemented AWS EC2, Key Pairs, Security Groups, Auto Scaling, ELB, SQS, and SNS
  • Migrated an existing on-premises application to AWS. Used AWS services like EC2 and S3 for small data sets processing and storage, experienced in maintaining the Hadoop cluster on AWS EMR.
  • Developed a lambda script using Python and boto3 library to update the DynamoDB table.
  • Experience in using AWS Glue crawlers to create tables on raw data in AWS S3.
  • Used AWS Glue for the data transformation, validate and data cleansing.
  • Worked on Apache Spark for Incremental merge process by converting the data to key-value pairs.
  • Good experience in testing different data pipelines and ensured to have the highest data quality.
  • Skilled in developing applications in Python language for multiple platforms and good experience in handling data manipulation using python Scripts.
  • Worked on AWS Data Pipeline to configure data loads from S3 to into Redshift. Worked on PostgreSQL as well.
  • Used JSON schema to define table and column mapping from S3 data to Redshift.
  • Created packages in SSIS with error handling as well as created complex SSIS packages using various data transformations like conditional split, Cache, for each loop, multi cast, Derived column, Data conversions, Merge, OLEDB Command, script task components.
  • Experience with data processing like collecting, aggregating, moving from various sources using Apache Flume and Kafka.
  • Worked on creating correlated and non-correlated sub-queries to resolve complex business queries involving multiple tables from different databases.
  • Integrated Apache Storm with Kafka to perform web analytics. Uploaded click stream data from Kafka to HDFS, HBase and Hive by integrating with Storm.
  • Strong experience in Normalization and De-normalization techniques for effective and optimum performance in OLTP and OLAP environments and experience with Kimball Methodology and Data Vault Modeling
  • Designed Hive external tables using shared meta-store instead of derby with dynamic partitioning & buckets.
  • Worked on Big Data Integration &Analytics based on Hadoop, SOLR, Spark, Kafka, Nifi and web Methods.
  • Design & implement ETL process using Abinito to load data from Worked extensively with Sqoop for importing and exporting the data from HDFS to Relational Database systems/mainframe and vice-versa. Loading data into HDFS.
  • Working knowledge on QlikView server, QlikView publisher and QlikView enterprise version.
  • Maintenance of data in form of Qliksense data files, pulling the data from the relevant database.
  • Created concurrent access for hive tables with shared/exclusive locks enabled by implementing Zookeeper in cluster.
  • Strongly recommended to bring in Elastic Search and was responsible for installing, configuring and administration.
  • Worked on migrating MapReduce programs into Spark transformations using Spark and Scala

Environment: Hadoop, Hive, Impala, Oracle, Spark, Sqoop, Oozie, PostgreSQL, MapReduce, SQL

Confidential, Pasadena, CA

Data Engineer

Responsibilities:

  • Involved in complete project life cycle starting from design discussion to production deployment.
  • Used cloud shell SDK to configure the services Data Proc, Storage, BigQuery and Implemented and configured High Availability Hadoop Cluster.
  • Extensive experience in writing Pig scripts to transform raw data into baseline data and Development of UDF’s in Java as and when necessary to use in Pig and HIVE queries.
  • Worked closely with the business team to gather their requirements and new support features.
  • Responsible for building scalable distributed data solutions using Hadoop.
  • Installed and configured Hadoop Clusters with required services (HDFS, Hive, HBase, Spark, Zookeeper).
  • Developed Hive scripts to analyze data and PHI are categorized into different segments and promotions are offered to customer based on segments.
  • Created DDL's for tables and executed them to create tables in the warehouse for ETL data loads and used Pig as ETL tool to do transformations, event joins, filters and some pre-aggregations before storing the data onto HDFS.
  • Develop, maintain, monitor, and performance tuning of the data mart databases and SSAS OLAP cube(s).
  • Setup, configuration and maintenance of a large-scale Hadoop based distributed computing cluster for analytic analysis of petabytes of event data with associated metadata.
  • Created Hive tables, partitions and loaded the data to analyze using HiveQL queries.
  • Experience in retrieving data from oracle using PHP and Java programming.
  • Tested Apache TEZ, an extensible framework for building high performance batch and interactive data processing applications, on Pig and Hive jobs.
  • Good understanding of ETL tools and how they can be applied in a Big Data environment.
  • Implemented using Pyspark and MySQL for faster testing and processing of data. Real time streaming the data using with KAFKA.
  • Used OOZIE Operational Services for batch processing and scheduling workflows dynamically. worked on creating End-End data pipeline orchestration using Oozie.
  • Populated HDFS and Cassandra with massive amounts of data using Apache Kafka.
  • Worked on major components in Hadoop Ecosystem including Hive, PIG, HBase, HBase-Hive Integration, Pyspark, Sqoop and Flume.

Environment: Hadoop, MapReduce, HDFS, Sqoop, flume, Kafka, Hive, Pig, HBase, MySQL, Shell Scripting, Eclipse, DBeaver, Datagrip, SQL Developer, IntelliJ, Git, SVN, JIRA, Unix, SSIS, SSAS

Confidential

Java Developer

Responsibilities:

  • Worked on designing the content and delivering the solutions based on understanding the requirements.
  • Wrote web service client for tracking operations for the orders which is accessing web services API and utilizing in our web application.
  • Developed User Interface using JavaScript, JQuery and HTML.
  • Used AJAXAPI for intensive user operations and client-side validations.
  • Worked with Java, J2EE, SQL, JDBC, XML, JavaScript, web servers.
  • Utilized Servlet for the controller layer, JSP and JSP tags for the interface
  • Worked on Model View Controller Pattern and various design patterns.
  • Worked with designers, architects, developers for translating data requirements into the physical schema definitions for SQL sub-programs and modified the existing SQL program units.
  • Designed and Developed SQL functions and stored procedures.
  • Involved in debugging and bug fixing of application modules.
  • Efficiently dealt with exceptions and flow control.
  • Worked on Object Oriented Programming concepts.
  • Added Log4j to log the errors.
  • Used Eclipse for writing code and SVN for version control.
  • Installed and used MS SQL Server 2008 database.
  • Spearheaded coding for site management which included change of requests for enhancing and fixing bugs pertaining to all parts of the website.

Environment: Java, JavaScript, JSP, Rest API, JDBC, Servlets, MS SQL, XML, Windows XP, Ant, SQL Server database, Eclipse Luna, SVN

We'd love your feedback!