We provide IT Staff Augmentation Services!

Data Engineer/big Data Resume

0/5 (Submit Your Rating)

Chicago, IL

SUMMARY

  • 8+ years of experience Big Data Development in analysis, design, development and implementation of large - scale applications with focus on Big Data technologies such as Apache Spark, Hadoop, Hive, Pig, Sqoop, Oozie, HBase, Zookeeper, Python & Scala.
  • Experience in Analysis, Development, Testing, Implementation, Maintenance and Enhancements on various IT Data Warehousing Projects.
  • Strong experience working with HDFS, MapReduce, Spark, AWS, Hive, Impala, Pig, Sqoop, Flume, Kafka, NIFI, Oozie, HBase, MSSQL and Oracle.
  • Excellent knowledge and working experience in Agile & Waterfall methodologies.
  • Configured and deployed Azure Automation scripts for applications that use the Azure stack, such as Blobs, Azure Data Lake, Azure Data Factory, Azure SQL, and utilities, with an emphasis on automating the conversion of Hive/SQL queries into
  • Excellent experience in Amazon EMR, Cloudera and Hortonworks Hadoop distribution and maintaining and optimized AWS infrastructure (EMR EC2, S3, EBS, Cloud Formation, Red Shift, and Dynamo DB)
  • Experienced in writing database objects like Stored Procedures, Functions, Triggers, PL/SQL packages and Cursors for Oracle, SQL Server, and MySQL & Sybase databases.
  • Strong understanding and strong knowledge in databases like HBase, Mongo DB & Cassandra.
  • Hands on experience with Hadoop, HDFS, Map Reduce and Hadoop Ecosystem (Pig, Hive, Oozie, Flume and HBase).
  • In-depth knowledge of Hadoop architecture and its components like YARN, HDFS, Name Node, DataNode, Job Tracker, Application Master, Resource Manager, Task Tracker and MapReduce programming paradigm.
  • Expertise in installation, configuration, supporting and managing Hadoop Clusters using Apache, Cloudera (CDH3, CDH4) distributions, Horton works and on Amazon web services (AWS).
  • Expertise in developing Spark code using Scala and Spark-SQL/Streaming for faster testing and processing of data.
  • Using Azure Data Factory (ADF V1/V2), migrated on-premises data (Oracle/ SQLServer/ DB2/MongoDB) to Azure Data Lake Store (ADLS).
  • Excellent knowledge on Hadoop Architecture and ecosystems such as HDFS, Job Tracker, Task Tracker, Name Node, Data Node, Yarn and Map Reduce programming paradigm.
  • Experience in using version control tools like Bitbucket, GIT, and SVN etc.

PROFESSIONAL EXPERIENCE

Confidential - Chicago, IL

Data Engineer/Big Data

Responsibilities:

  • Worked with application teams to install operating system, Hadoop updates, patches, version upgrades as required.
  • Developed MapReduce programs to parse the raw data, populate tables and store the refined data in partitioned tables.
  • Involved in creating Hive Tables, loading with data and writing Hive queries which will invoke and run MapReduce jobs in the backend.
  • Implemented Amazon AWS EC2, RDS, S3, RedShift, Cloud Trail, Route 53, etc., and worked with various Hadoop Tools like Hive, Pig, Sqoop, Oozie, HBase, Flume, PySpark.
  • Responsible for building and configuring distributed data solution using MapR distribution of Hadoop.
  • Automated the generation of HQL, creation of Hive Tables and loading data into Hive tables by using Apache NiFi and OOZIE.
  • Wrote Scripts for distribution of query for performance test jobs in Amazon Data Lake. .
  • Created and worked Sqoop jobs with incremental load to populate Hive External tables.
  • Setup end to end ETL orchestration of this framework in AWS, using Spark Graph frames, Spark SQL, AWS hd, S3, EMR, Data pipeline, SNS, EC2, Redshift, IAM and VPC.
  • Written efficient serverless AWS lambda functions in python using Boto 3 API, to dynamically activate AWS Data
  • Developed optimal strategies for distributing the web log data over the cluster importing and exporting the stored web log data into HDFS and Hive using Sqoop.
  • Apache Hadoop installation & configuration of multiple nodes on AWS EC2 system
  • Used Spark API over Cloudera Hadoop YARN to perform analytics on data in Hive.
  • Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs, Scala and have a good experience in using Spark Streaming.
  • Developed Pig Latin scripts for replacing the existing legacy process to the Hadoop and the data is fed to AWS S3.
  • Analyzed the SQL scripts and designed the solution to implement using PySpark
  • Implemented Hive Generic UDF's to in corporate business logic into Hive Queries.
  • Responsible for developing data pipeline with Amazon AWS to extract the data from weblogs and store in Hdfs.
  • Worked on Apache Nifi as ETL tool for batch processing and real time processing.
  • Uploaded streaming data from Kafka to HDFS, HBase and Hive by integrating with storm.
  • Worked and learned a great deal from Amazon Web Services (AWS) Cloud services like EC2, S3, and EMR.
  • Used Apache Spark and Scala language to find patients with similar symptoms in the past and medications used for them to achieve results.
  • Worked on various map reduces framework architectures (MRV1 & YARN Architecture).
  • Analyzed the web log data using the HiveQL to extract number of unique visitors per day, page views, visit duration, most visited page on website.
  • Implemented KBB's Big data ETL processes in AWS, using Hive, Spark, AWS Lambda, S3, EMR, Data pipeline, EC2, Redshift, Athena, SNS, IAM and VPC.
  • Implemented Budget cuts on AWS, by writing Lambda functions to automatically spin up and shut down the Redshift clusters.
  • Used Git for Source Code Management.
  • Integrated Kafka with PySpark Streaming for real time data processing.
  • Involved in build applications using Maven and integrated with CI servers like Jenkins to build jobs.
  • Exported the analyzed data to the RDBMS using Sqoop for to generate reports for the BI team.

Environment: Hadoop, HDFS, Cloudera, Teradata r15, Sqoop, Linux, Yarn, MapReduce, AWS (EC2, RDS, S3, RedShift, Lambda, Cloud Trail, Route 53) Python, Kafka, Pig, SQL, Hive, HBase, Oozie, RDBMS, Spark, Spark Streaming, Scala, Zookeeper, Java.

Confidential - NYC, NY

Azure Data Developer

Responsibilities:

  • Develop a data set process for data mining and data modeling and also recommend the ways to improve data quality, efficiency and reliability.
  • Extract Transform and Load data from Sources Systems to Azure Data Storage services using a combination of Azure Data Factory, T-SQL, Spark SQL and U-SQL Azure Data Lake Analytics. Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in in Azure Databricks.
  • Responsible for writing Hive Queries to analyze the data in Hive warehouse using Hive Query Language (HQL). Involved in developing Hive DDLs to create, drop and alter tables.
  • Extracted the data and updated it into HDFS using Sqoop Import from various sources like Oracle, Teradata, SQL server etc.
  • Implemented Apache Airflow for authoring, scheduling, and monitoring Data Pipelines
  • Created Hive staging tables and external tables and also joined the tables as required.
  • Implemented Dynamic Partitioning, Static Partitioning and also Bucketing.
  • Installed and configured Hadoop Map Reduce, Hive, HDFS, Pig, Sqoop, Flume and Oozie on Hadoop cluster.
  • Worked on Microsoft Azure services like HDInsight Clusters, BLOB, ADLS, Data Factory and Logic Apps and also done POC on Azure Data Bricks.
  • Implemented Sqoop jobs for data ingestion from the Oracle to Hive.
  • Worked with various formats of files like delimited text files, click stream log files, Apache log files,Avro files, JSON files, XML Files.
  • Mastered in using different columnar file formats like RC, ORC and Parquet formats.
  • Extract Transform and Load data from Sources Systems to Azure Data Storage services using a combination of Azure Data Factory, T-SQL, Spark SQL and U-SQL Azure Data Lake Analytics.
  • Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in In Azure Databricks.
  • Created Pipelines in ADF using Linked Services/Datasets/Pipeline/ to Extract, Transform and load data from different sources like Azure SQL, Blob storage, Azure SQL Data warehouse, write-back tool and backwards.
  • Migrated Map reduce jobs to Spark jobs for achieving a better performance.

Environment: Hadoop 2.7, HDFS, Microsoft Azure services like HDinsight, BLOB, ADLS, Logic Apps etc, Hive 2.2, Sqoop 1.4.6, snowflake, Apache Spark 2.3, Airflow, Spark-SQL, ETL, Maven, Oozie, Java 8, Python3, Unix shell scripting.

Confidential - Jacksonville, FL

Big Data/Hadoop Developer

Responsibilities:

  • Worked with application teams to install operating system, Hadoop updates, patches, version upgrades as required.
  • Developed MapReduce programs to parse the raw data, populate tables and store the refined data in partitioned tables.
  • Involved in creating Hive Tables, loading with data and writing Hive queries which will invoke and run MapReduce jobs in the backend.
  • Implemented Amazon AWS EC2, RDS, S3, RedShift, Cloud Trail, Route 53, etc., and worked with various Hadoop Tools like Hive, Pig, Sqoop, Oozie, HBase, Flume, PySpark.
  • Responsible for building and configuring distributed data solution using MapR distribution of Hadoop.
  • Automated the generation of HQL, creation of Hive Tables and loading data into Hive tables by using Apache NiFi and OOZIE.
  • Wrote Scripts for distribution of query for performance test jobs in Amazon Data Lake.
  • Created Hive Tables, loaded transactional data from Teradata using Sqoop.
  • Developed MapReduce (Yarn) jobs for cleaning, accessing and validating the data.
  • Created and worked Sqoop jobs with incremental load to populate Hive External tables.
  • Worked on Cluster co-ordination services through Zookeeper.
  • Created featured develop release branches in GIT for different application to support releases and CI builds.
  • Involved in build applications using Maven and integrated with CI servers like Jenkins to build jobs.
  • Exported the analyzed data to the RDBMS using Sqoop for to generate reports for the BI team.
  • Worked collaboratively with all levels of business stakeholders to implement and test Big Data based analytical solution from disparate sources.

Environment: Hadoop, HDFS, Cloudera, Teradata r15, Sqoop, Linux, Yarn, MapReduce, AWS (EC2, RDS, S3, RedShift, Lambda, Cloud Trail, Route 53) Python, Kafka, Pig, SQL, Hive, HBase, Oozie, RDBMS, Spark, Spark Streaming, Scala, Zookeeper, Java.

Confidential - Irving, TX

Data Engineer

Responsibilities:

  • Developed PySpark Applications by using python and Implemented Apache PySpark data processing project to handle data from various RDBMS and Streaming sources.
  • Handled importing of data from various data sources, performed data control checks using PySpark and loaded data into HDFS.
  • Involved in converting Hive/SQL queries into PySpark transformations using Spark RDD, python.
  • Used PySpark SQL to Load JSON data and create Schema RDD and loaded it into Hive Tables and handled structured data using Spark SQL.
  • Developed PySpark Programs using python and performed transformations and actions on RDD's.
  • Imported data from AWS S3 into Spark RDD, Performed transformations and actions on RDD's.
  • Used PySpark and Spark SQL to read the parquet data and create the tables in hive using the python API. .
  • Developed python scripts, UDFs using both Data frames/SQL/Data sets and RDD/Map Reduce in Spark 1.6 for data Aggregation, queries and writing data back into OLTP system through Sqoop.
  • Experienced in handling large datasets using Partitions, PySpark in Memory capabilities, Broadcasts in PySpark, effective & efficient Joins, Transformations and other during ingestion process itself.
  • Processing the schema oriented and non-schema-oriented data using python and Spark.
  • Involved in writing live Real-time Processing and core jobs using Spark Streaming with Kafka as a data pipe-line system.
  • Used Spark API over Cloudera Hadoop YARN to perform analytics on data in HDFS.
  • Worked on streaming pipeline that uses PySpark to read data from Kafka, transform it and write it to HDFS.
  • Worked on Snowflake database on queries and writing Stored Procedures for normalization.
  • Worked with Snowflake’s stored procedures, used procedures with corresponding DDL statements, used JavaScript API to easily wrap and execute numerous SQL queries.

Environment: Cloudera (CDH3), AWS, Snowflake, HDFS, Pig 0.15.0, Hive 2.2.0, Kafka, Sqoop, Shell Scripting, Spark 1.8, Linux- Cent OS, Map Reduce, python 2, Eclipse 4.6.

Confidential

ETL Developer

Responsibilities:

  • Involved in Migrating historical as built data from Link Tracker Oracle database to TD using Abinitio.
  • Implemented historical purge process for Clickstream, order broker &link tracker to TD using Abinitio
  • Implemented the centralized graphs concept.
  • Extensively used Abinitio components like Reformate, rollup, lookup, joiner, re-defined and also developed many sub graphs.
  • Involved in loading the transformed data file into TD staging tables through TD Load utilities, Fast load and Multi load scripts, and Creating TD macro’s for loading the data from staging to target tables.
  • Responsible as E-R consultant, ER (Extract-Replicate) Golden gate tool which is used to extract the real time data to warehouse without hitting to the data.

Environment: Abinitio, Oracle, Database, Clickstream, Reformate, Rollup, Lookup, UNIX, and Extract-Replicate.

We'd love your feedback!