We provide IT Staff Augmentation Services!

Senior Data Engineer Resume

4.00/5 (Submit Your Rating)

New York, NY

SUMMARY

  • 9+ years of professional IT experience in BIGDATA using HADOOP framework and Analysis, Design, Development, Documentation, Deployment and Integration using SQL and Big Data technologies as well as Java / J2EE technologies with AWS,AZURE
  • Experience in Hadoop Ecosystem components like Hive, HDFS, Sqoop, Spark, Kafka, Pig.
  • Experience in architecting, designing, installation, configuration and management of Apache Hadoop Clusters, MapR, Hortonworks & Cloudera Hadoop Distribution.
  • Good understanding of Hadoop architecture and Hands on experience with Hadoop components such as Resource Manager, Node Manager, Name Node, Data Node and Map Reduce concepts and HDFS Framework.
  • Expertise in Data Migration, Data Profiling, Data Ingestion, Data Cleansing, Transformation, Data Import, and Data Export through the use of multiple ETL tools such as Informatica Power Centre.
  • Working noledge of Spark RDD, Dataframe API, Data set API, Data Source API,
  • Spark SQL and Spark Streaming.
  • Experience in exporting as well as importing the data using Sqoop between HDFS to Relational Database systems and vice - versa and load into Hive tables, which are partitioned.
  • Worked on HQL for required data extraction and join operations as required and having good experience in optimizing Hive Queries.
  • Experience in Partitions, bucketing concepts in Hive and designed both Managed and External tables in Hive to optimize performance.
  • Developed Spark code using Scala, Python and Spark-SQL/Streaming for faster processing of data.
  • Implemented Spark Streaming jobs in Scala by developing RDD's (Resilient Distributed Datasets) and used Pyspark and spark-shell accordingly.
  • Profound experience in creating real time data streaming solutions using Apache Spark /Spark Streaming, Kafka and Flume.
  • Good noledge of using apache NiFi to automate the data movement between different Hadoop Systems.
  • Good experience in handling messaging services using Apache Kafka.
  • Knowledge in Data mining and Data warehousing using ETL Tools and Proficient in Building reports and dashboards in Tableau (BI Tool).
  • Excellent noledge of job workflow scheduling and locking tools/services like Oozie and Zookeeper.
  • Good understanding and noledge of NoSQL databases like HBase and Cassandra.
  • Good understanding of Amazon Web Services (AWS) like EC2 for computing and S3 as storage mechanism and EMR, Step functions, Lambda,RedShift, DynamoDB.
  • Good understanding and noledge of Microsoft Azure services like HDInsight Clusters, BLOB, ADLS, Data Factory and Logic Apps.
  • Worked with various formats of files like delimited text files, JSON files, XML Files. Mastered in using different columnar file formats like RC, ORC and Parquet formats and TEMPhas a good understanding of various compression techniques used in Hadoop processing like G-zip, Snappy, LZO etc.
  • Hands on experience building enterprise applications utilizing Java, J2EE, Spring, Hibernate, JSF, JMS, XML, EJB, JSP, Servlets, JSON, JNDI, HTML, CSS and JavaScript, SQL, PL/SQL.
  • Experienced in Software Development Lifecycle (SDLC) using SCRUM, Agile methodologies.

TECHNICAL SKILLS

Hadoop Components / Big Data: HDFS, Hue, MapReduce, PIG, Hive, HCatalog, Hbase, Sqoop, Impala, Zookeeper, Flume, Kafka, Yarn, Cloudera Manager, Kerberos,pysparkAirflow, Kafka SnowflakeIDE Tools Eclipse, IntelliJ, pycharm.

Cloud platform: AWS, Azure

Reporting and ETL Tools: Tableau, Power BI, Talend, AWS GLUE.

Databases: Oracle, SQL Server, MySQL, MS Access, NoSQL Database (Hbase, Cassandra, Mongo DB)

Big Data Technologies: Hadoop, HDFS, Hive, Pig, Oozie, Sqoop, Spark, Machine Learning, Pandas, NumPy, Seaborn, Impala, Zookeeper, Flume, Airflow, Informatica, Snowflake, DataBricks, Kafka, Cloudera

Containerization: Docker, Kubernetes

CI/CD Tools: Jenkins, Bamboo, GitLab

Operating Systems: UNIX, LINUX, Ubuntu, CentOS.

Software Methodologies: Agile, Scrum, Waterfall

PROFESSIONAL EXPERIENCE

Confidential, New York, NY

Senior Data Engineer

Responsibilities:

  • Develop and add features to existing data analytic applications built with Spark and Hadoop on a Scala, java and Python development platform on the top of AWS services.
  • Programming using Python, Scala along with Hadoop framework utilizing Cloudera Hadoop Ecosystem projects (HDFS, Spark, Sqoop, Hive, HBase, Oozie, Impala, Zookeeper etc.).
  • Involved in developing spark applications using Scala, Python for Data transformations, cleansing as well as validation using Spark API.
  • Worked on all the Spark APIs, like RDD, Dataframe, Data source and Dataset, to transform the data.
  • Worked on both batch processing and streaming data Sources. Used Spark streaming and Kafka for the streaming data processing.
  • Developed Spark Streaming script which consumes topics from distributed messaging source Kafka and periodically pushes batch of data to spark for real time processing.
  • Built data pipelines for reporting, alerting, and data mining. Experienced with table design and data management using HDFS, Hive, Impala, Sqoop, MySQL, and Kafka.
  • Worked on Apache Nifi to automate the data movement between RDBMS and HDFS.
  • Created shell scripts to handle various jobs like Map Reduce, Hive, Pig, Spark etc, based on the requirement.
  • Used Hive techniques like Bucketing, Partitioning to create the tables.
  • Experience on Spark-SQL for processing the large amount of structured data.
  • Experienced working with source formats, which includes - CSV, JSON, AVRO, JSON, Parquet, etc.
  • Worked on AWS to aggregate clean files in Amazon S3 and also on Amazon EC2 Clusters to deploy files into Buckets.
  • Designed and architected solutions to load multipart files which can't rely on a scheduled run and must be event driven, leveraging AWS SNS,
  • Involved in Data Modeling using Star Schema, Snowflake Schema.
  • Used AWS EMR clusters for creating hadoop and spark clusters. These clusters are used for submitting and executing scala and python applications in production.
  • Responsible for developing data pipeline with Amazon AWS to extract the data from weblogs and store in HDFS.
  • Migrated the data from AWS S3 to HDFS using Kafka.
  • Integrating Kubernetes with network, storage of security to provide a comprehensive infrastructure and orchestrating the Kubernetes containers across the multiple hosts.
  • Implementing Jenkins and built pipelines to drive all microservice builds out to Docker registry and deploying to Kubernetes.
  • Experienced in loading and transforming of large sets of structured, semi structured data using ingestion tool Talend.
  • Worked with NoSQL databases like HBase, Cassandra to retrieve and load the data for real time processing using Rest API.
  • Worked on creating data models for Cassandra from the existing Oracle data model.
  • Responsible for transforming and loading the large sets of structured, semi structured and unstructured data.

Environment: Hadoop 2.7.7, HDFS 2.7.7, Apache Hive 2.3, Apache Kafka 0.8.2.X, Apache Spark 2.3, Spark-SQL, Spark-Streaming, Zookeeper, Pig, Oozie, Java 8, Python3, S3, EMR, EC2, Redshift, Cassandra, Nifi, Talend, HBase,Cloudera (CHD 5.X).

Confidential, MN

Sr Data Engineer

Responsibilities:

  • Develop a data set process for data mining and data modeling and also recommend the ways to improve data quality, efficiency and reliability.
  • Extract Transform and Load data from Sources Systems to Azure Data Storage services using a combination of Azure Data Factory, T-SQL, Spark SQL and U-SQL Azure Data Lake Analytics . Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in In Azure Databricks..
  • Responsible for writing Hive Queries to analyze the data in Hive warehouse using Hive Query Language (HQL).Involved in developing Hive DDLs to create, drop and alter tables.
  • Extracted the data and updated it into HDFS using Sqoop Import from various sources like Oracle, Teradata, SQL server etc.
  • Implemented Apache Airflow for authoring, scheduling, and monitoring Data Pipelines
  • Created Hive staging tables and external tables and also joined the tables as required.
  • Implemented Dynamic Partitioning, Static Partitioning and also Bucketing.
  • Installed and configured Hadoop Map Reduce, Hive, HDFS, Pig, Sqoop, Flume and Oozie on Hadoop cluster.
  • Worked on Microsoft Azure services like HDInsight Clusters, BLOB, ADLS, Data Factory and Logic Apps and also done POC on Azure Data Bricks.
  • Implemented Sqoop jobs for data ingestion from the Oracle to Hive.
  • Worked with various formats of files like delimited text files, click stream log files, Apache log files,Avro files, JSON files, XML Files. Mastered in using different columnar file formats like RC, ORC and Parquet formats.
  • Developed custom the Unix/BASH SHELL scripts for the purpose of pre and post validations of the master and slave nodes, before and after the configuration of the name node and datanodes respectively.
  • Developed job workflows in Oozie for automating the tasks of loading the data into HDFS.
  • Implemented compact and efficient file storage of big data by using various file formats like Avro, Parquet, JSON and using compression methods like GZip, Snappy on top of the files.
  • Exploring with Spark, improving performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frame and Pair RDD's.
  • Worked on Spark using Python as well as Scala and Spark SQL for faster testing and processing of data.
  • Worked on various data modelling concepts like star schema, snowflake schema in the project.
  • Extensively used Stash, Bit-Bucket and GITHUB for the code control purpose.
  • Migrated Map reduce jobs to Spark jobs for achieving a better performance.

Environment: Hadoop 2.7, HDFS, Microsoft Azure services like HDinsight, BLOB, ADLS, Logic Apps etc, Hive 2.2, Sqoop 1.4.6, snowflake, Apache Spark 2.3, Airflow, Spark-SQL, ETL, Maven, Oozie, Java 8, Python3, Unix shell scripting.

Confidential

Data Engineer

Responsibilities:

  • Developed PySpark Applications by using python and Implemented Apache PySpark data processing project to handle data from various RDBMS and Streaming sources.
  • Handled importing of data from various data sources, performed data control checks using PySpark and loaded data into HDFS.
  • Involved in converting Hive/SQL queries into PySpark transformations using Spark RDD, python.
  • Used PySpark SQL to Load JSON data and create Schema RDD and loaded it into Hive Tables and handled structured data using Spark SQL.
  • Developed PySpark Programs using python and performed transformations and actions on RDD's.
  • Process and load bound and unbound Data from Google pub/sub topic to Bigquery using cloud Dataflow with Python.
  • Used PySpark and Spark SQL to read the parquet data and create the tables in hive using the python API.
  • Migrated on-premise environment in GCP (Google Cloud Platform)
  • Involved in porting the existing on-premise Hive code migration to GCP (Google Cloud Platform) BigQuery.
  • Built reports for monitoring data loads into GCP and driving reliability at the site level.
  • Implemented PySpark using python and utilizing Data frames and PySpark SQL API for faster processing of data.
  • Developed python scripts, UDFs using both Data frames/SQL/Data sets and RDD/Map Reduce in Spark 1.6 for data Aggregation, queries and writing data back into OLTP system through Sqoop.
  • Experienced in handling large datasets using Partitions, PySpark in Memory capabilities, Broadcasts in PySpark, effective & efficient Joins, Transformations and other during ingestion process itself.
  • Design star schema in Bigquery
  • Processing the schema oriented and non-schema-oriented data using python and Spark.
  • Involved in writing live Real-time Processing and core jobs using Spark Streaming with Kafka as a data pipe-line system.
  • Used Spark API over Cloudera Hadoop YARN to perform analytics on data in HDFS.
  • Worked on streaming pipeline that uses PySpark to read data from Kafka, transform it and write it to HDFS.
  • Building a Scala and Spark based configurable framework to connect common Data sources like MYSQL, Oracle, Postgres, SQL Server, Salesforce, Bigquery and load it in Bigquery
  • Designed Data Marts by following Star Schema and Snowflake Schema Methodology, using industry leading Data Modeling tools.
  • Monitoring Bigquery, Dataproc and cloud Data flow jobs via Stackdriver for all the environment.
  • Worked on Snowflake database on queries and writing Stored Procedures for normalization.
  • Worked with Snowflake’s stored procedures, used procedures with corresponding DDL statements, used JavaScript API to easily wrap and execute numerous SQL queries.

Environment: Cloudera (CDH3), AWS, Snowflake,HDFS, Pig 0.15.0, Hive 2.2.0,Kafka, Sqoop, Shell Scripting, Spark 1.8, Linux- Cent OS, Map Reduce, python 2, Eclipse 4.6.

Confidential

Big Data/Hadoop Developer

Responsibilities:

  • Working with open source Apache Distribution tan Hadoop admins have to manually setup all the configurations- Core-Site, HDFS-Site, YARN-Site and Map Red-Site. However, when working with popular Hadoop distribution like Hortonworks, Cloudera or MapR the configuration files are setup on startup and the Hadoop admin need not configure them manually.
  • Used Sqoop to import data from Relational Databases like MySQL, Oracle.
  • Involved in importing structured and unstructured data into HDFS.
  • Responsible for fetching real-time data using Kafka and processing using Spark and Scala.
  • Worked on Kafka to import real-time weblogs and ingested the data to Spark Streaming.
  • Developed business logic using Kafka Direct Stream in Spark Streaming and implemented business transformations.
  • Worked on Building and implementing real-time streaming ETL pipeline using Kafka Streams API.
  • Worked on Hive to implement Web Interfacing and stored the data in Hive tables.
  • Migrated Map Reduce programs into Spark transformations using Spark and Scala.
  • Experienced with Spark Context, Spark-SQL, Spark YARN.
  • Implemented Spark Scripts using Scala, Spark SQL to access hive tables into Spark for faster processing of data.
  • Implemented data quality checks using Spark Streaming and arranged passable and bad flags on the data.
  • Implemented Hive Partitioning and Bucketing on the collected data in HDFS.
  • Implemented Sqoop jobs for large data exchanges between RDBMS and Hive clusters.
  • Extensively used Zookeeper as a backup server and job scheduled for Spark Jobs.
  • Developed Spark scripts using Scala shell commands as per the business requirement.
  • Worked on Cloudera distribution and deployed on AWS EC2 Instances.
  • Experienced in loading the real-time data to a NoSQL database like Cassandra.
  • Experience in retrieving the data present in Cassandra cluster by running queries in CQL (Cassandra Query Language).
  • Worked on connecting Cassandra database to the Amazon EMR File System for storing the database in S3.
  • Implemented usage of Amazon EMR for processing Big Data across a Hadoop Cluster of virtual servers on Amazon Elastic Compute Cloud (EC2) and Amazon Simple Storage Service (S3).
  • Deployed the project on Amazon EMR with S3 connectivity for setting a backup storage.
  • Well versed in using Elastic Load Balancer for Auto scaling in EC2 servers.
  • Coordinated with the SCRUM team in delivering agreed user stories on time for every sprint.

Environment: Hadoop, Map Reduce, Hive, Spark, Oracle, GitHub, Tableau, UNIX, Cloudera, Kafka, Sqoop, Scala, NIFI, HBase, Amazon EC2, S3.

Confidential

ETL Developer

Responsibilities:

  • Involved in Migrating historical as built data from Link Tracker Oracle database to TD using Abinitio.
  • Implemented historical purge process for Clickstream, order broker &link tracker to TD using Abinitio
  • Implemented the centralized graphs concept.
  • Extensively used Abinitio components like Reformate, rollup, lookup, joiner, re-defined and also developed many sub graphs.
  • Involved in loading the transformed data file into TD staging tables through TD Load utilities, Fast load and Multi load scripts, and Creating TD macro’s for loading the data from staging to target tables.
  • Responsible as E-R consultant, ER (Extract-Replicate) Gloden gate tool which is used to extract the real time data to warehouse without hitting to the data.

Environment: Abinitio,Oracle, Database, Clickstream, Reformate, Rollup, Lookup, UNIX, and Extract-Replicate.

We'd love your feedback!