We provide IT Staff Augmentation Services!

Data Engineer Resume

5.00/5 (Submit Your Rating)

IL

SUMMARY

  • Around 8 years of IT experience in Analysis, design, development, implementation, maintenance, and support with experience in developing strategic methods for deploying big data technologies to efficiently solve Big Data processing requirement.
  • Around 5years of experience on BIG DATA using HADOOP framework and related technologies such as HDFS, HBASE, MapReduce, Spark, HBase, HIVE, PIG, FLUME, OOZIE, SQOOP, and ZOOKEEPER.
  • Having 4+ working experience on Cloudera Distribution CDH 6.x versions (6.0.0 & 6.0.1)
  • Installing Packages and setting up an CDH cluster coordinating with Zookeeper, Spark, Kafka, HDFS.
  • Experience in data analysis using HIVE, PIG LATIN, HBASE and custom Map Reduce programs in Java.
  • Experience in writing custom UDFs in JAVA and SCALA for HIVE and PIG TO EXTEND THE FUNCTIONALITY.
  • Responsible for importing data to HDFS using Sqoop from different RDBMS servers and exporting data using Sqoop to the RDBMS servers after aggregations for other ETL operations
  • Experience with Cloudera and Horton works distributions.
  • Over4+ years’ experience on SPARK, SCALA, HBASE and KAFKA.
  • Developed analytical components using KAFKA, SCALA, SPARK, HBASE and SPARK STREAM.
  • Experience in working with Flume to load the log data from multiple sources directly into HDFS.
  • Pretty Good knowledge On the Hortonworks administration and security things such as Apache Ranger,Knox Gateway,HighAvailability.
  • Performed Hadoop backup Strategy to take the backup of hive,HDFS,HBase, Oozie etc.
  • Experience in importing and exporting data using Sqoop from HDFS to Relational Database Systems (RDBMS) and from RDBMS to HDFS.
  • Involved in creating HDINSIGHT cluster in MICROSOFT AZURE PORTAL also created EVENTSHUB and AZURE SQL DATABASES.
  • Worked on a clustered Hadoop for Windows Azure using HDInsight and HORTONWORKS Data Platform for Windows.
  • Built real time pipeline for streaming data using EVENTSHUB/MICROSOFT AZURE Queue and SPARK STREAMING.
  • Loaded the aggregated data into HBase for reporting purpose
  • Read the data from HBase to Spark toperform Join on different tables.
  • Created the HBase tables for validation,audit and offset management table.
  • Created logical view instead of tables in order to enhance the performance of hive queries.
  • Involved in developing Hive DDLS to create, alter and drop Hive tables
  • Spark Streaming collects this data from Kafka in near - real-time and performs necessary transformations and aggregation on the fly to build the common learner data model and persists the data in Cassandra cluster.
  • Pretty Good Knowledge on hive Optimization techniques like Vectorization and column-based optimization.
  • Generating the BI Visualization layer on top of the ETL layer and generating the reports with Spotfire
  • Developed Oozie workflows to automate ETL process by scheduling multiple Sqoop, Hive and Spark jobs
  • Written oozie workflow to invoke the Jobs in predefined Interval.
  • Worked with different file formats like Json, AVRO and parquet and compression techniques like snappy. Nifi ecosystem is used.
  • Expert in scheduling Oozie coordinator based on input data events it starts Oozie workflow when input data is available.
  • On Other Hand working on POC with Kafka and NIFI to pull the real-time events into Hadoop Box.
  • Exploring with the Spark for improving the performance and optimization of the existing algorithms in Hadoop using Spark Context, SPARK-SQL, DATA FRAME, PAIR RDD'S and YARN.
  • Experienced in managingHadoopCluster using HORTONWORKS AMBARI.

PROFESSIONAL EXPERIENCE

DATA ENGINEER

Confidential, IL

Responsibilities:

  • New Development and enhancement to the ETL Development using Apache Nifi
  • Writing complex SQL to generate reports / extracts
  • Involved in managing and reviewing Hadoop log files.
  • Imported data using Sqoop to load data from MySQL to HDFS on regular basis.
  • Developing Scripts and Batch Job to schedule various Hadoop Program.
  • Written Hive queries for data analysis to meet the business requirements.
  • Creating Hive tables and working on them using Hive QL.
  • Querying the structured tables such as My SQL.
  • Importing and exporting data into HDFS and Hive using Sqoop. Experienced in defining job flows.
  • Troubleshooting the Production Issues
  • Helping in Preparing Test case design, testing and documenting detailed enhancement.
  • Worked on NiFi data Pipeline to process large set of data and configured Lookup’s for Data Validation and Integrity.
  • Worked with different file formats like Json, AVRO and parquet and compression techniques like snappy. Nifi ecosystem is used.

Environment: Unix Shell, Hive, HQL, SQL, Apache Nifi, HDFS, Linux, Sqoop, Python, My SQL, Ambari/Yarn, Cloudera

BIG DATA ENGINEER / PROD SUPPORT

Confidential, Princeton, NJ

Responsibilities:

  • Developed the code base to back end of Prep Orch Micro Services.
  • Processed datasets CSV, Parquet, Json, text File etc.
  • Building Micro Services and loading logs meta in No Sql data base: Mongo.
  • Dealing with external application where performing spark transformations/actions.
  • Having an experience in building actions and test cases in Scala.
  • Having knowledge in resolving production tickets using SJS logs on running jobs.
  • Experience in Kubernetes deployment orchestration system and analyzing the logs from pods/containers for the deployment issues on spark side.
  • Experienced in handling large datasets using Partitions, Spark in Memory capabilities, Broadcasts in Spark, Effective & efficient Joins, Transformations and other during ingestion process itself.
  • Enhanced and optimized product Spark code to aggregate, group and run data mining tasks using the Spark framework.
  • Good Exposure to Azure Cloud, ADF, ADLS, Azure Devops (VSTS), portal services.
  • Experience developing, deploying Shell Scripts for automation/notification/monitoring.
  • Worked on Performance tuning on Spark Application.
  • Handled importing of data from various data sources, performed transformations using Spark and loaded data into ADLS.
  • Worked with Apache Spark SQL and data frame functions to perform data transformations and aggregations on complex semi structured data.
  • Hands on experience in creating RDDs, df transformations and actions while implementing Spark applications.

Environment: Cloudera, CDH Cluster, Azure, SPARK, HIVE, SPARK SQL, SCALA, SBT, Kubernetes, IntelliJ Idea, UNIX SHELL SCRIPTING, Mongo DB, Ambari/Yarn, Spark Job Server

DATA ENGINEER

Confidential, SF, CA

Responsibilities:

  • Design, build, and launch extremely efficient and reliable data pipelines to move data across several platforms including Data Warehouse, online caches, and real-time systems.
  • Experience working with Azure Function Apps and App Services
  • Experience on Data Bricks, scripting language like Scala, Shell, PowerShell.
  • Worked with azure data bricks notebooks to validate the inbound/out bound from an external source like Amperity.
  • Experience in creating a reliable Data pipeline to move the data across several platforms including snowflake, Azure Delta lake, blob, External Dashboards.
  • Experience in crating spark/Scala application on ETL operations.
  • Source to Target mapping/streaming of different data transfer via API or Azure Data Factory (ADF) Pipelines and to troubleshoot or implement different logics based on different requirements.
  • Writing complex SQL queries to drive analysis and insights.
  • Building the Pipelines with Spark Jar / notebook activities.

Environment: Azure Data Lake, Azure Blob, Azure Data Factory, Apache Spark (Spark SQL), SQL Server, Tera Data, Scala, Hadoop, Snowflake, Azure Data Bricks. Hive, GitHub, IntelliJ Idea, Jira.

SPARK DEVELOPER

Confidential, Dallas TX

Responsibilities:

  • Developed framework to encrypt sensitive data (SSN, Account number ...etc.) in all kinds of datasets and moved datasets one S3 bucket to another.
  • Processed datasets like Text, Parquet, Avro, Fixed Width, Zip, JSON and XML.
  • Developed framework to check data quality of datasets, schema defined in cloud. worked on Amazon Web service (AWS) to integrate EMR with Spark 2 and S3 storage and Snowflake.
  • Configured Spark streaming to receive real time data from the Kafka and store the stream data into AWS S3 using Scala.
  • Experienced in handling large datasets using Partitions, Spark in Memory capabilities, Broadcasts in Spark, Effective & efficient Joins, Transformations and other during ingestion process itself.
  • Enhanced and optimized product Spark code to aggregate, group and run data mining tasks using the Spark framework.
  • Used Spark-Streaming APIs to perform required transformations and actions on the learner data model which gets the data from Kafka in near real time.
  • Worked on stream sets, on reading and writing continuously from the Kafka cluster.
  • Developeda Data CI/CD pipeline using Data controller UI and Control Hub.
  • Performedrequired Transformation, by configuringEvaluation Processor. worked on the installed transformer, configuring spark streaming to perform requiredtransformations.
  • Worked on migrating Map Reduce programs into Spark transformations using Spark and Used File Broker to schedule workflows to run Spark jobs to transform data on a persistent schedule.
  • Experience developing, deploying Shell Scripts for automation/notification/monitoring.
  • Extensively used Apache Kafka, Apache Spark, HDFS and Apache Impala to build a near real time data pipelines that get, transform, store, and analyze click stream data to provide a better personalized user experience.
  • Worked on Performance tuning on Spark Application.
  • Handled importing of data from various data sources, performed transformations using Spark and loaded data into Cassandra.
  • Queried and analyzed data from Cassandra for quick searching, sorting and grouping.
  • Spark Streaming collects this data from Kafka in near-real-time and performs necessary transformations and aggregation on the fly to build the common learner data model and persists the data in Cassandra cluster.
  • Worked with Apache Spark SQL and data frame functions to perform data transformations and aggregations on complex semi structured data.
  • Hands on experience in creating RDDs, transformations and actions while implementing Spark applications.

Environment: Cloudera, CDH Cluster, AWS, SPARK, HIVE, Stream sets SPARK SQL, KAFKA, EMR, SNOWFLAKE, NEBULA, HIVEPYTHON, SCALA, MAVEN, JUPYTER NOTEBOOK, VISUAL STUDIO, UNIX SHELL SCRIPTING, Cassandra.

HADOOP DEVELOPER

Confidential, Plano Texas

Responsibilities:

  • Developed data pipeline using EVENTHUBS, SPARK, HIVE, PIG AND AZURE SQL DATABASE to ingest customer behavioral data and financial histories into HDINSIGHT cluster for analysis.
  • Involved in creating HDINSIGHT cluster in MICROSOFT AZURE PORTAL also created EVENTSHUB and AZURE SQL DATABASES.
  • Worked on a clustered Hadoop for Windows Azure using HDInsight and HORTONWORKS Data Platform for Windows.
  • Used PIG to do transformations, event joins, filter boot traffic and SOME PRE-AGGREGATIONS before storing the data onto azure database.
  • Expertise with the tools inHadoopEcosystem including PIG, HIVE, HDFS, YARN, OOZIE, AND ZOOKEEPER.Hadooparchitecture and its components.
  • Involved in integration of Hadoop cluster with spark engine to perform BATCH and GRAPHX operations.
  • Extensively worked with Informatica performance tuning involving source level, target level and map level bottlenecks.
  • Have extensively worked in developing ETL program for supporting Data Extraction, transformations and loading using Informatica Power Center.
  • Created UNIX shell scripts to run the Informatica workflows and controlling the ETL flow.
  • Exploring with the SPARK improving the performance and optimization of the existing algorithms in Hadoop using SPARK CONTEXT, SPARK-SQL, DATA FRAME, PAIR RDD'S, SPARK YARN.
  • I have been experienced with SPARK STREAMING to ingest data into SPARK ENGINE.
  • Import the data from different sources like EVENTHUBS, COSMOS into SPARK RDD.
  • Developed SPARK CODE using SCALA and Spark-SQL/Streaming for faster testing and processing of data.
  • Involved in converting Hive/SQL queries into SPARK TRANSFORMATIONS using Spark RDDs, and SCALA.
  • Developed multiple POCs using SCALA and deployed on the Yarn cluster, compared the performance of Spark, with Hive and SQL/Teradata.
  • Worked on the SPARK SQL and SPARK STREAMING modules of Spark extensively and Used SCALA to write code for all Spark use cases.
  • Used DATAFRAME API in Scala for converting the distributed collection of data organized into named columns.
  • Involved in converting the JSON data into DATAFRAME and stored into hive tables.
  • Experienced with AZCOPY, LIVY, WINDOWS POWERSHELL and CURL to submit the spark jobs on HDINSIGHT CLUSTER.
  • Analyzed the SQL scripts and designed the solution to implement USING SCALA.

Environment: Cloudera, Azure, Spark, Hive, Spark SQL, Kafka, Horton Works,Hive, Pig, Oozie, HBase, Python, Scala, Maven, Jupiter Notebook, Visual Studio, Unix Shell Scripting, Java.

HADOOP DEVELOPER

Confidential

Responsibilities:

  • Importing and exporting data into HDFS and Hive using Sqoop.
  • Used Bash Shell Scripting, Sqoop, AVRO, Hive, Pig, Java, Map/Reduce daily to develop ETL, batch processing, and data storage functionality.
  • Used Pig to do data transformations, event joins and some pre-aggregations before storing the data on the HDFS.
  • Exploited Hadoop MySQL-Connector to store Map Reduce results in RDBMS.
  • Analyzed large amounts of data sets to determine optimal way to aggregate and report on it.
  • Worked on loading all tables from the reference source database schema through Sqoop.
  • Worked on designed, coded and configured server side J2EE components like JSP, AWS and JAVA.
  • Collected data from different databases (i.e. Oracle, MySQL) to Hadoop
  • Used Oozie and Zookeeper for workflow scheduling and monitoring.
  • Worked on Designing and Developing ETL Workflows using Java for processing data in HDFS/HBase using Oozie.
  • Experienced in managing and reviewing Hadoop log files.
  • Involved in loading and transforming large sets of structured, semi structured, and unstructured data from relational databases into HDFS using Sqoop imports.
  • Working on extracting files from MySQL through Sqoop and placed in HDFS and processed.
  • Supported Map Reduce Programs those running on the cluster.
  • Cluster coordination services through Zookeeper.
  • Involved in loading data from UNIX file system to HDFS.
  • Created several Hive tables, loaded with data and wrote Hive Queries to run internally in MapReduce.
  • Developed Simple to complex MapReduce Jobs using Hive and Pig.
  • Collected data from different databases (i.e. Oracle, MySQL) to Hadoop
  • Used Oozie and Zookeeper for workflow scheduling and monitoring.
  • Worked on Designing and Developing ETL Workflows using Java for processing data in HDFS/HBase using Oozie.
  • Experienced in managing and reviewing Hadoop log files.
  • Involved in loading and transforming large sets of structured, semi structured, and unstructured data from relational databases into HDFS using Sqoop imports.
  • Working on extracting files from MySQL through Sqoop and placed in HDFS and processed.
  • Supported Map Reduce Programs those running on the cluster.
  • Cluster coordination services through Zookeeper.
  • Involved in loading data from UNIX file system to HDFS.
  • Created several Hive tables, loaded with data and wrote Hive Queries to run internally in MapReduce.
  • Developed Simple to complex MapReduce Jobs using Hive and Pig.

We'd love your feedback!