We provide IT Staff Augmentation Services!

Sr. Data Engineer Resume

0/5 (Submit Your Rating)

TX

SUMMARY

  • Over 8 years of experience as a well Accomplished IT Professional, focused on Big Data ecosystems, Hadoop Architecture, and Data warehousing.
  • Data architectural expertise, including knowledge of data intake, pipeline design, Hadoop information architecture, data modeling, data mining, sophisticated data processing, and ETL workflow optimization.
  • Proficiency with Scala, Apache HBase, Hive, Pig, Mahout, Oozie, Flume, Sqoop, Zookeeper, Spark, Spark SQL, Spark Streaming, Kinesis, Airflow, Yarn, and Hadoop (HDFS, MapReduce).
  • Solid experience designing Spark Applications for conducting high scalability data transformations utilizing RDD, Data Frame, and Spark - SQL API.
  • Worked extensively with NO SQL databases and their integration, including Dynamo DB, Cosmo DB, Mongo DB, Cassandra, and HBase.
  • Proficiency with Cloudera, Hortonworks, Amazon EMR, Redshift, EC2, and Azure HDInsight for project creation, implementation, deployment, and maintenance using Java/J2EE, Hadoop, and Spark.
  • Hands-on Experience with AWS cloud (EMR, EC2, RDS, EBS, S3, Lambda, Glue, Elasticsearch, Kinesis, SQS, DynamoDB, Redshift, ECS).
  • Skillful with MapReduce, Apache Crunch, Hive, Pig, and Splunk for Hadoop tasks.
  • Proficient at developing sophisticated MapReduce systems that operate on a variety of file types, including Text, Sequence, XML, and JSON.
  • Proficiency in creating Spark-Scala and PySpark apps for interactive analysis, batch processing, and stream processing, as well as familiarity with Spark's architecture and parts.
  • Used Spark Data Frame Operations to complete crucial data validations and Spark Data Frame API extensively on the Cloudera infrastructure to do analytics on Hive data.
  • Scripting Python with proficiency; experience with NumPy for statistics, Matplotlib for visualization, and Pandas for data management.
  • Using SparkSQL and the Data Frames API to load structured and semi-structured data into Spark clusters (API).
  • Strong background in building shell scripts and dealing with UNIX/LINUX systems.
  • Extensive expertise troubleshooting Spark application issues to optimize the system's performance by fine-tuning Spark applications and hive queries.
  • Advanced HiveQL searches for required data extraction from Hive tables were developed, and Hive User Defined Functions (UDFs) were built as needed.
  • Excellent understanding of Hive partitions and bucketing ideas, as well as the design of both Managed and External tables to enhance efficiency.
  • Expertise in creating several confluent Kinesis Producers and Consumers in order to suit business needs. Put the stream data in HDFS and use Spark to process it.
  • Experience utilizing the TDCH Teradata connection to load data into partitioned Hive tables from several data sources, such as Teradata, into HDFS.
  • Experience transferring data from HDFS to Relational Database System and vice versa using SQOOP according to client requirements.
  • Used Git, SVN, Bamboo, and Bitbucket version control systems efficiently.
  • Strong knowledge of ETL techniques for data warehousing utilizing Informatica Power Center, OLAP, and OLTP.
  • Strong expertise developing complicated Oracle queries and database architecture utilizing PL/SQL to construct Stored Procedures, Functions, and Triggers.

TECHNICAL SKILLS

AWS Services: S3, EC2, EMR, Redshift, RDS, Lambda, Kinesis, SNS, SQS, AMI, IAM, Cloud formation

Hadoop Components / Big Data: HDFS, Hue, MapReduce, PIG, Hive, HCatalog, Hbase, Sqoop, Impala, Zookeeper, Flume, Kafka, Yarn, Cloudera Manager, Kerberos, pyspark Kafka, Snowflake

Overall Cloud Services: Azure Data Factory (ADF), Azure Datalake (ADLS), AWS (Amazon Web Services), EMR (Elastic Map Reduce),S3 (Simple Storage Service), Lambda (serverless), ECS (Elastic Container Service), SNS (Simple Notification Service), SQS (Simple Queue Service)

Spark Components: Apache Spark,Data Frames, Spark SQL, Spark, YARN, Pair RDDs

Databases: Oracle, Microsoft SQL Server, MySQL, DB2, Teradata

Programming Languages: Java, Scala, Impala, Python.

NoSQL Databases: NoSQL Database (Hbase, Cassandra, Mongo DB)

Methodologies: Agile (Scrum), Waterfall, UML, Design Patterns, SDLC.

Cloud Services: AWS, Azure

ETL Tools: Talend Open Studio & Talend Enterprise Platform

Reporting and ETL Tools: Tableau, Power BI, AWS GLUE, SSIS, SSRS, Informatica, Data Stage

PROFESSIONAL EXPERIENCE

Sr. Data Engineer

Confidential, TX

Responsibilities:

  • Developed Spark programs were used to process raw data, populate staging tables, and store refined data (JSON, XML, CSV. Etc.) in partitioned tables in the Enterprise Data warehouse.
  • Developed streaming applications that accept messages from Amazon AWS Kinesis queues and publish data to AWS S3 buckets using Spark and Kinesis.
  • Used AWS EFS to provide scalable file storage with AWS EC2.
  • Built data pipeline to move data from On-Prem to cloud using Spark-Scala.
  • Integrated data from data warehouses and data marts into cloud-based data structures using T-SQL.
  • Developed DDLs and DMLs scripts in SQL and HQL for analytics applications in RDBMS and Hive.
  • Shell scripts were written to parameterize Hive activities in Oozie workflow and to schedule tasks.
  • Kinesis was used to populate HDFS and Cassandra with massive volumes of data.
  • Used Amazon EKS to run, scale and deploy applications on cloud or On-Premises.ki
  • Developed PySpark codes to mimic the transformations performed in the on-premises environment and analyzed the SQL scripts and designed solutions to implement using PySpark.
  • Used Sqoop widely for importing and exporting data from HDFS to Relational Database Systems/Mainframes, as well as loading data into HDFS.
  • Develop and maintain data warehouse objects. Optimized Pyspark tasks to run on Kubernetes Cluster for quicker data processing by deploying them using Jenkins framework and integrating Git-version control with it.
  • SSIS Designer was used to create SSIS Packages for exporting heterogeneous data from OLE DB Sources and Excel Spreadsheets to SQL Server.
  • Migrate data into RV Data Pipeline using Databricks, Spark SQL and Scala.
  • Application monitoring for YARN, troubleshoot, and address cluster-specific system issues.
  • Worked as a critical member of a team designing an initial prototype of a NiFi large data pipeline. This pipeline exhibited an end-to-end scenario of data input and processing.
  • Used the NiFi tool to determine whether a message was delivered to the destination system or not. NiFi's unique CPU was created.
  • Worked with NoSQL databases such as HBase and integrated with Spark for real-time data processing.
  • Customizing logic around error handling and logging of Ansible/Jenkins job results.
  • Oozie Scheduler technologies were used to automate the pipeline process and coordinate the map-reduce operations that extracted data, while Zookeeper was used to provide cluster coordinating services.
  • Created Hive queries to assist data analysts in identifying developing patterns by comparing new data to EDW (enterprise data warehouse) reference tables and previous measures.
  • Involved in specification design, design documents, data modeling, and data warehouse design. We evaluated existing and EDW (enterprise data warehouse) technologies and processes to ensure that our EDW/BI design fits the demands of the company and organization while also allowing for future expansion.
  • Used Data Frame API in Scala for converting the distributed collection of data organized into named columns, developing predictive analytic using Apache Spark Scala API’s.
  • Worked on Hadoop, SOLR, Spark, and Kinesis-based Big Data Integration and Analytics.
  • Big data tasks were established to load large amounts of data into the S3 data lake and ultimately into the AWS RedShift, and a pipeline was created to allow for continuous data load.
  • Optimized long-running Hive searches utilizing Hive Joins, vectorizations, Partitioning, Bucketing, and Indexing.
  • Designed, developed, and implemented ETL pipelines using python API (PySpark) of Apache Spark on AWS EMR.
  • Extensive experience in Apache/Hudi datasets on Insert / Bulk insert.
  • Developed Spark programs using Scala and Java API’s and performed transformations and actions on RDD’s.
  • Developed Spark jobs on Databricks to perform tasks like data cleaning, data validation, standardization and then applied transformations as per the use cases.
  • To find the Kafka message failure scenarios, I used Kibana and Elastic search.
  • Involved in tuning the Spark applications by adjusting memory and resource allocation settings, determining the best Batch Interval time, and adjusting the number of executors to match the rising demand over time. On the EMR cluster, Spark and Hadoop tasks were deployed.
  • Involved in scheduling data refresh on Tableau Server for weekly and monthly basis on business change to ensure that the views and dashboards are displaying the updated data accurately.

Technologies: Hadoop, HDFS, Java 8, Hive, Sqoop, HBase, Oozie, Storm, YARN, NiFi, Cassandra, Zookeeper, Spark, Kinesis, MySQL, Shell Script, AWS, EC2, Source Control, GIT, Tera Data SQL Assistant.

Sr. Data Engineer

Confidential, New Jersey

Responsibilities:

  • Installed and configured with Apache BigData Hadoop components like HDFS, MapReduce, YARN, Hive, HBase, Sqoop, Pig, Ambari, and Nifi.
  • Zookeeper was utilized to manage synchronization, serialization, and coordination throughout the cluster after migrating from JMS Solace to Kinesis.
  • Designing and developing Azure Data Factory (ADF) to ingest data from various source systems, both relational and non-relational, to fulfill business functional needs.
  • Using a collection of Azure Data Factory, T-SQL, Spark SQL, and U-SQL Azure Data Lake Analytics to extract, transform, and load data from sources systems to Azure Data Storage services.
  • Ingestion of data into one or more Azure Services (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing in Azure Databricks.
  • Using Databricks and ADF, create pipelines, data flows, and complicated data transformations and manipulations.
  • Multiple Databricks clusters were created, provisioned, and the essential libraries were deployed for batch and continuous streaming data processing.
  • Using Azure Cluster services, Azure Data Factory V2 ingested a large amount and diversity of data from diverse source systems into Azure Data Lake Gen2.
  • Multiple apps were designed and maintained by EC2 to ingest and transmit data from S3 to EMR and Redshift.
  • Data from numerous sources was ingested into S3 using AWS Kinesis Data Stream and Firehose.
  • Elastic Map Reduce (EMR) to AWS Redshift was used to process many terabytes of data stored in AWS.
  • Using Azure Data Factory V2, performed entire data loading from S3 to Azure Data Lake Gen2 and SQL Server.
  • Involved in database migration methodologies and integration conversion solutions to convert legacy ETL processes into Azure Synapse compatible architecture.
  • Implemented Apache Spark data processing project to handle data from multiple RDBMS and Streaming sources and developed Spark applications using Scala and Java.
  • Created a Spark Scala notebook to clean and manipulate data across several tables.
  • FTP Adaptor, Spark, Hive, and Impala were used to build a complete data pipeline.
  • Used Scala to implement Spark and heavily used Spark SQL for quicker data production and processing.
  • Experience in Developing ETL solutions using Spark SQL in Azure Databricks for data extraction, transformation and aggregation from multiple file formats and data sources.
  • Creating scripts for data modeling, and mining for providing PMs and EMs with better access to Azure Logs.
  • Performed end-to-end delivery of Pyspark ETL pipelines on Azure-databricks to perform the transformation of data orchestrated via Azure Data Factory (ADF) scheduled through Azure automation accounts and trigger them using Tidal Scheduler.
  • Respond to client requests for SQL objects, schedules, business logic updates, and ad hoc queries, as well as analyze and resolve data sync issues.
  • Created customized reports in Power BI and Tableau for Business Intelligence.
  • Worked with Sqoop to import additional corporate data from various data sources into HDFS, conduct transformations with Hive, Map Reduce, and finally load data into HBase tables.
  • Worked on several speed improvements, including leveraging a distributed cache for small datasets, partitioning, bucketing in Hive, and Map Side Joins.
  • Linked service was created to move data from SFTP to Azure Data Lake.
  • Using Pyspark, I created numerous Databricks Spark tasks to conduct several table-to-table transactions.

Technologies: Azure Data Factory (ADF v2), Azure Databricks (PySpark), Azure Data Lake, Spark (Python/Scala), Hive, Apache Nifi 1.8.0, Jenkins, Kinesis, Spark Streaming, Docker Containers, PostgreSQL, RabbitMQ, Celery, Flask, ELK Stack, AWS, MS-Azure, Azure SQL Database, Azure functions Apps, Azure Data Lake, Azure Synapse, BLOB Storage, SQL Server, Windows remote desktop, UNIX Shell Scripting, AZURE PowerShell, ADLS Gen 2, Azure Cosmos DB, Azure Event Hub, Sqoop, Flume

AZURE/Snowflake Engineer

Confidential, NJ

Responsibilities:

  • Analyze, design, and develop modern data solutions that enable data visualization using the Azure PaaS service.
  • Using a combination of Azure Data Factory, T-SQL, Spark SQL, and U-SQL Azure Data Storage services, extract, transform, and load data from sources systems to Azure Data Lake Analytics.
  • Experience on Migrating SQL database to Azure Data Lake, Azure data lake Analytics, Azure SQL Database, Data Bricks and Azure SQL Data warehouse and controlling and granting database access and Migrating On premise databases to Azure Data Lake store using Azure Data factory.
  • Created Databricks Delta Lake process for real-time data load from various sources (Databases, Adobe, and SAP) to AWS S3 data-lake using Python/PySpark code.
  • Ingestion of data into one or more Azure Services (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing in Azure Databricks
  • Pipelines were created in ADF using Linked Services/Datasets/Pipeline/ to extract, transform, and load data from a variety of sources, including Azure SQL, Blob storage, Azure SQL Data warehouse, write-back tool, and backwards.
  • Experienced in Hive queries to analyze massive data sets of structured, unstructured, and semi-structured data.
  • Develop and deploy the outcome using spark and Scala code in Hadoop cluster running on GCP.
  • Used advanced Hive techniques such as bucketing, partitioning, and optimizing self joins to boost performance on structured data.
  • The CI/CD framework designed, tested, and deployed using Kubernetes and Docker as the runtime environment.
  • Responsible for estimating the cluster size, monitoring, and troubleshooting of the Spark Data Bricks cluster.
  • Owned several end-to-end transformations of customer business analytics problems, breaking them down into a mix of appropriate hardware (IaaS/PaaS/Hybrid) and software (MapReduce) paradigms, and then applying machine learning algorithms to extract useful information from data lakes.
  • On both Cloud and On-Prem hardware, sized and engineered scalable Big Data landscapes with central Hadoop processing platforms and associated technologies including ETL tools and NoSQL databases to support end-to-end business use cases.
  • Developed several technology demonstrators using the Confidential Edison Arduino shield, Azure EventHub, and Stream Analytics, and integrated them with PowerBI and Azure ML to demonstrate the capabilities of Azure Stream Analytics.

Technologies: Azure Data Factory(V2), Azure Databricks, Python 2.0, SSIS, Azure SQL, Azure Data Lake, Azure Blob Storage, Spark 2.0, Hive.

Big Data Engineer

Confidential, Waterbury CT

Responsibilities:

  • Running Spark SQL operations on JSON, converting the data into a tabular structure with data frames, and storing and writing the data to Hive and HDFS.
  • Tuned performance of Informatica mappings and sessions for improving the process and making it efficient after eliminating bottlenecks.
  • Worked on complex SQL Queries, PL/SQL procedures and convert them to ETL tasks
  • Created a risk-based machine learning model (logistic regress, random forest, SVM, etc.) to predict which customers are more likely to be delinquent based on historical performance data and rank order them.
  • Evaluated model output using the uncertainty matrix (Precision, Recall as well as Teradata resources and utilities (BTEQ, Fast load, Multi Load, Fast Export, and TPUMP). .
  • Ingestion and processing of Comcast setup box click stream events in real time with Spark 2.x, Spark Streaming, Databricks, Apache Storm, Kafka, Apache-Memory Igniter’s grid (Distributed Cache)
  • Used various DML and DDL commands for data retrieval and manipulation, such as Select, Insert, Update, Sub Queries, Inner Joins, Outer Joins, Union, Advanced SQL, and so on.
  • Using Informatica Power Center 9.6.1, I extracted, transformed, and loaded data into Netezza Data Warehouse from various sources such as Oracle and flat files.
  • Participated in the transfer of maps from IDQ to power center.
  • Data was ingested from a variety of sources, including Kafka, Flume, and TCP sockets.
  • Data was processed using advanced algorithms expressed via high-level functions such as map, reduce, join, and window.

Technologies: Scala 2.12.8, Python 3.7.2, PySpark, Spark 2.4, Spark ML Lib, Spark SQL, TensorFlow 1.9, NumPy 1.15.2, Keras 2.2.4, PowerBI, Spark SQL, Spark Streaming, HIVE, Kafka, ORC, Avro, Parquet, HBase, HDFS.

Big Data Developer

Confidential

Responsibilities:

  • Develop, improve, and scale processes, structures, workflows, and best practices for data management and analytics.
  • Having experience in working with data ingestion, storage, processing and analyzing the big data.
  • Collaborated with product owners to develop an experiment design and a measuring method for the efficacy of product changes.
  • Hands-on experience with methods such as Pig and Hive for data collection, Sqoop for data absorption, Oozie for scheduling, and Zookeeper for cluster resource coordination.
  • Worked on the Apache Spark Scala code base, performing actions and transformations on RDDs, Data Frames, and Datasets using SparkSQL and Spark Streaming Contexts.
  • Transferred data from HDFS to Relational Database Systems using Sqoop and vice versa. Upkeep and troubleshooting
  • Spring/MVC framework was used to allow interactions between JSP/View layer and different design patterns were implemented using J2EE and XML technology.
  • Investigating the use of Spark background and Spark-based algorithms to improve the efficiency and optimization of existing Hadoop algorithms.
  • Worked on analyzing Hadoop clusters with various big data analytic tools such as Pig, HBase database, and Sqoop.
  • Worked on NoSQL enterprise development and data loading into HBase with Impala and Sqoop.
  • Executed several MapReduce jobs in Pig and Hive for data cleaning and pre-processing.
  • Build Hadoop solutions for big data problems by using MR1 and MR2 in YARN.
  • Evaluated Hadoop and its ecosystem's suitability for the aforementioned project and implemented / validated with various proof of concept (POC) applications to ultimately adopt them to benefit from the Big Data Hadoop initiative.
  • Work closely with malware research/data science teams to enhance malicious site detection, and machine learning/data mining based big data system
  • Participate in the entire development life cycle, which includes requirements review, design, development, implementation, and operations support.

Technologies: Hadoop 3.0, Hive 2.1, J2EE, JDBC, Pig 0.16, HBase 1.1, Sqoop, NoSQL, Impala, Java, Spring, MVC, XML, Spark 1.9, PL/SQL, HDFS, JSON, Hibernate, Bootstrap, JQuery, JDBC, JSP, JavaScript, AJAX, Oracle 10g/11g, MySQL, SQL server, Teradata, Hbase, Cassandra

We'd love your feedback!