Senior Big Data Engineer Resume
Weehawken New, JerseY
SUMMARY
- Over 8+ years of extensive hands - on Big Data Capacity with the help of Hadoop Eco System across internal and cloud-based platforms.
- Expertise in Cloud Computing and Hadoop architecture and its various components - Hadoop File System HDFS, MapReduce, Spark, Name node, Data Node, Job Tracker, Task Tracker, Secondary Name Node.
- Experience working in different Google Cloud Platform Technologies like Big Query, Dataflow, Dataproc, Pubsub, Airflow.
- Design and Development of Ingestion Framework over Google Cloud and Hadoop cluster.
- Strong experience using HDFS, MapReduce, Hive, Spark, Sqoop, Oozie, and HBase.
- Deep knowledge of troubleshooting and tuning Spark applications and Hive scripts to achieve optimal performance.
- Experienced working with various Hadoop Distributions (Cloudera, Hortonworks, Map R, Amazon EMR) to fully implement and leverage new Hadoop features.
- Experience in developing Spark Applications using Spark RDD, Spark-SQL and Data frame APIs.
- Worked with real-time data processing and streaming techniques using Spark streaming and Kafka.
- Experience in moving data into and out of the HDFS and Relational Database Systems (RDBMS) using Apache Sqoop.
- Experience and strong knowledge in the areas ofdata Confidential, Modeling, Data integration (ETL), Data Quality, Data Governance and Data Security in EDW and Data Mart - Data Analytics & Reporting applications.
- Expertise in working with HIVE data warehouse infrastructure-creating tables, data distribution by implementing Partitioning and Bucketing, developing and tuning the HQL queries.
- Replaced existing MR jobs and Hive scripts with Spark SQL & Spark data transformations for efficient data processing.
- Deep knowledge of troubleshooting and tuning Spark applications and Hive scripts to achieve optimal performance
- Experience developing Kafka producers and Kafka Consumers for streaming millions of events per second on streaming data
- Database design, modeling, migration and development experience in using stored procedures, triggers, cursor, constraints and functions. Used My SQL, MS SQL Server, DB2, and Oracle
- Experience working with NoSQL database technologies, including MongoDB, Cassandra and HBase.
- Experience with Software development tools such as JIRA, Play, GIT.
- Experience on Migrating SQL database to Azure Data Lake, Azure data lake Analytics, Azure SQL Database, Data Bricks and Azure SQL Data warehouse and Controlling and granting database access and Migrating On premise databases to Azure Data lake store using Azure Data factory
- Good understanding of the Data modelling (Dimensional & Relational) concepts like Star-Schema Modelling, a Schema Modelling, Fact and Dimension tables.
- Experience in manipulating/analysing large datasets and finding patterns and insights within structured and unstructured data.
- Strong ETL experience inInformatica PC- Mapping Designer, Repository manager, Workflow Manager/Monitor.
- Design and development of ETL mapping inAb Initio GDEto processdata ingestion(HDFS)-Big Data Hadoop / Hive.
- Hands of experience in GCP, Big Query, GCS bucket, G - cloud function, cloud migration, cloud dataflow, Pub/suB cloud shell, GSUTIL, BQ command line utilities, Data Proc, Stack driver
- Strong understanding of Java Virtual Machines and multi-threading process.
- Experience in writing complex SQL queries, creating reports and dashboards.
- Proficient in using Unix based Command Line Interface.
- Strong experience with ETL and/or orchestration tools (e.g. Talend, Oozie, Airflow)
- Experience setting up AWS Data Platform - AWS CloudFormation, Development Endpoints, AWS Glue, EMR and Jupyter/Sagemaker Notebooks, Redshift, S3, and EC2 instances
- Experienced in using Agile methodologies including extreme programming, SCRUM and Test-Driven Development (TDD)
- Used Informatica Power Center for (ETL) extraction, transformation and loading data from heterogeneous source systems into target database.
TECHNICAL SKILLS
Operating Systems: Windows, Unix
Databases: MS SQL Server, Oracle 12c, My SQL, MS Access, DB2 Netezza, NoSQL DB Mongo DB, Azure SQL Data warehouse
Big Data Tools: Map Reduce, Spark, Airflow, Nifi, HBase, Hive, Pig, Sqoop, Kafka, Oozie, Hadoop
ETL Tools: SQL Server Integration Services(SSIS), IBM DataStage, Azure Data Factory
Database Tools: SQL Profiler, Management studio, Index Analyzer, SQL Agents, SQL Alerts, Visual Source Safe. Microsoft SQL Server CDC, IBM CDC, AWS Ec2, AWS RDS, MapReduce
Languages: R, T-SQL, JAVA, HTML, PL/SQL, VBA, PYTHON, Hadoop, Spark
Reporting Tools: SQL Server Reporting Services (SSRS), Tableau, MS Excel
DB Modeling Tools: Erwin. Embarcadero
PROFESSIONAL EXPERIENCE
Senior Big Data Engineer
Confidential, Weehawken, New Jersey
Responsibilities:
- Involved in Migrating Objects using the custom ingestion framework from variety of sources such as Oracle, SAP/HANA, MongoDB, & Teradata
- Planning and design of data warehouse in STAR schema. Designing structure of tables and documenting it.
- Handled importing of data from various data sources, performed transformations using Hive, MapReduce, loaded data into HDFS and extracted data from MYSQL into HDFS vice-versa using Sqoop.
- Designed and implemented end to end big data platform on Teradata Appliance
- Performing ETL from multiple sources such as Kafka, NIFI, Teradata, DB2 using Hadoop spark.
- Worked on Apache Spark Utilizing the Spark, SQL and Streaming components to support the intraday and real-time data processing
- Sharing sample data using grant access to customer for UAT/BAT.
- Developed Python, Bash scripts to automate and provide Control flow
- Data Ingestion to one or more cloud Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and cloud migration processing the data in in Azure Databricks
- Moving data from Teradata to a Hadoop cluster Using TDCH/Fast export and Apache NIFI.
- Develop Python, PySpark, Bash scripts logs to Transform, and Load data across on premise and cloud platform.
- Worked extensively on AWS Components such as Elastic Map Reduce (EMR)
- Experience in data ingestions techniques for batch and stream processing using AWS Batch, AWS Kinesis, AWS Data Pipeline
- Create Source to Target data (DEM) mapping document with transformation logic for ETL build and data validation.
- Perform Data Analysis / Data mining from various source systems using SQL’s, capture metadata details, create data cleansing rules, Unwinding the logic from Stored procedures, ETL Mappings, SAS datasets etc.
- Created data pipeline for different events of ingestion, aggregation and load consumer response data in AWS S3 bucket into Hive external tables in HDFS location to serve as feed for Tableau dashboards
- Created monitors, alarms and notifications for EC2 hosts using Cloud Watch, Cloud trail and SNS
- Building data pipeline ETLs for data movement to S3, tan to Redshift.
- Scheduled different Snowflake jobs using NiFi.
- Experience with Snowflake Multi-Cluster Warehouses
- Analyze and reverse engineer the logic details of the DB Stored procedures, ETL mapping logic implemented in the source systems to capture the source objects for the application build.
- Manage performance tuning methodology in optimizing SQL’s, ETL mappings, HIVE- managed / ORC tables.
- Written AWS Lambda code in Python for nested Json files, converting, comparing, sorting etc.
- Installed and configured apache airflow for workflow management and created workflows in python
- Write UDFs in Hadoop Pyspark to perform transformations and loads.
- Use NIFI to load data into HDFS as ORC files.
- Writing TDCH scripts and apache NIFI to load data from Mainframes DB2 to Hadoop cluster.
- Working with, ORC, AVRO and JSON, Parquette file formats. and create external tables and query on top of these files Using BigQuery
- Was responsible for ETL and data validation using SQL Server Integration Services.
- Source Analysis, tracing back the sources of the data and finding its roots though Teradata, DB2 etc.
- Identifying the jobs dat load the source tables and documenting it.
- Implement Continuous Integration and Continuous Delivery process using GitLab along with Python and Shell scripts to automate routine jobs, which includes synchronize installers, configuration modules, packages and requirements for the applications
- Design and development of ETL mappings in Informatica / Ab Initio GDE to process data ingestion in Hadoop HDFS / Hive and Teradata / Oracle database.
- Worked on Informatica Power Center tools- Designer, Repository Manager, Workflow Manager, and Workflow Monitor
- Implemented a batch process to load the heavy volume data loading using Apache Dataflow framework using Nifi in Agile development methodology
- Deployed the Big Data Hadoop application using Talend on cloud AWS (Amazon Web Services) and also on Microsoft Azure
- Created Snow pipe for continuous data load from staged data residing on cloud gateway servers.
- Developing automated process for code builds and deployments using Jenkins, Ant, Maven, Sonar type, Shell Script
- Installing and configuring the applications like docker tool and Kubernetes for the orchestration purpose
- Developed automation system using PowerShell scripts and JSON templates to remediate the Azure services.
Environment: Hadoop, Pyspark, HDFS, NiFi, Pig, Hive, S3, Kafka, Scrum, Git, Sqoop, Oozie, Pyspark, Informatica, Azure. Databricks, Tableau, Nifi, HBase, Cassandra, Informatica, SQL Server, AWS, Python, Bash, Shell Scripting,Kafka, Teradata, DB2, Jenkins, Maven, Json, XML, Unix
Big Data Engineer/Analytics
Confidential, Atlanta, Georgia
Responsibilities:
- Build ETL/ELT pipeline in data technologies like pyspark, hive, presto and data bricks
- Extract data from source and transform the same according to business requirements and Load the data in target
- Build validation module to summarize the data between source and target using pyspark
- Orchestrate dimension module using Airflow and built dependencies according to business document
- Build data pipelines using pyspark to transform data from amazon S3 and stage it in Presto views
- Read file from S3 on incremental basis, transform the data and load it in S3 location
- Analyze and reverse engineer the logic details of the DB Stored procedures, ETL mapping logic implemented in the source systems to capture the source objects for the application build.
- Build external table using HQL to match it with the data schema loaded in the target
- Analyse and validated the data in Presto using SQL queries
- Create and enhanced tools to analyse and process large quantity of data set
- Develop pyspark scripts to process huge data sets from source to target
- Created many SSIS packages using Import/Export Wizard. Designed many Packages using SSIS Designer by using Control Flow and Data Flow Tasks with ETL Tool.
- Execute complex joins and transformations to process data adhering to the business use case
- Develop automated scripts to validate the data processed between source and target
- Provide technical design based on business requirements to create fact and dimension tables
- Design the workflow to orchestrate the dimension and fact module using Airflow
- Develop airflow dags using python to schedule jobs on incremental basis
- Establish dependencies using airflow methods and tracked data processing
- Communicate and coordinated with cross functional teams to ensure business objectives are met
- Document the workflow and brainstormed proactive solutions with the team
- Present solution to the orchestration process by effectively capturing data stats
- Design DAGs using Airflow to process the dimension table and staged it presto views
- Design and developed airflow dags to retrieve data from Amazon s3 and built ETL pipeline using pyspark to process the same to build the dimensions
- Analyzing and Developing Complex SQL queries, Stored Procedures, ETL Mapping for application development
- Develop python scripts to schedule each dimension process as task and set dependencies for the same
- Develop, tested and deployed python scripts to create airflow dags
- Integrate with Databricks using airflow operators to run Notebooks on scheduled basis
- Interface with Business intelligence team and help them build data pipeline to support existing BI platform and data products
- Build aggregated layer over fact and dimension tables to help BI build dashboard over
- Leverage data and built robust ETL pipelines to process Ad data with Revenue and impressions captured for business to make informed decisions
- Develop process workflow and developed pyspark code adhering to the design
- Optimize data structures for efficiently querying Hive and Presto views
- Define and designed Data models to develop dimension and fact tables
- Collaborate with internal and external data sources to ensure integrations are accurate and scalable.
Environment: Hadoop, Hive, AWS, Bigquery, Hbase, Scala, Flume, Apache Tez, Cloud Shell, Docker, Jira, MySQL, Posgres, Sql Server, Python, Scala, Spark, Hive, Spark-Sql
Big Data Engineer
Confidential, Minneapolis, Minnesota
Responsibilities:
- Used Hive Queries in Spark-SQL for analysis and processing the data.
- Hands on experience in installation, configuration, supporting and managing Hadoop Clusters
- Written shell scripts dat run multiple Hive jobs which helps to automate different Hive tables incrementally which are used to generate different reports using Tableau for the Business use
- Used Spark API over Cloudera Hadoop YARN to perform analytics on data in Hive
- Enhanced and optimized product Spark code to aggregate, group and run data mining tasks using the Spark framework and handled Json Data
- Implemented Spark Scripts using Scala, Spark SQL to access hive tables into Spark for faster processing of data
- Involved in business analysis and technical design sessions with business and technical staff to develop requirements document and ETL design specifications.
- Wrote complex SQL scripts to avoid Informatica Look-ups to improve the performance as the volume of the data was heavy.
- Responsible for design, development, Data Modelling, of Spark SQL Scripts based on Functional Specifications
- Analyzing and Developing Stored Procedures, SQL Scripts, ETL Mapping for application development.
- Designed and developed extract, transform, and load (ETL) mappings, procedures, and schedules, following the standard development lifecycle
- Creating ETL mappings and enhancing existing mappings to facilitate the data load in system.
- Developing jobs scheduler and Shell Scripts for the ETL job automation in UNIX environment
- Optimized Map/Reduce Jobs to use HDFS efficiently by using various compression mechanisms
- Worked closely with Quality Assurance, Operations and Production support group to devise the test plans, answer questions and solve any data or processing issues
- Worked on large-scale Hadoop YARN cluster for distributed data processing and analysis using Data Bricks Connectors, Spark core, Spark SQL, GCP, Sqoop, Hive and NoSQL databases
- Worked in writing Spark SQL scripts for optimizing the query performance
- Concerned and well-informed on Hadoop Components such as HDFS, Job Tracker, Task Tracker, Name Node, Data Node, YARN and Map Reduce programming
- Implemented Hive UDF's and did performance tuning for better results
- Tuned, and developed SQL on HiveQL, Drill and SparkSQL
- Experience in using Sqoop to import and export the data from Oracle DB into HDFS and HIVE
- Developed Spark code using Spark RDD and Spark-SQL/Streaming for faster processing of data
- Implemented Partitioning, Data Modelling, Dynamic Partitions and Buckets in HIVE for efficient data access.
Environment: Cloudera CDH, Hadoop, Pig, Hive, Informatica, Hbase, MapReduce, HDFS, Sqoop, Impala, SQL, Tableau, Python, SAS, Flume, Java script, Oozie, Linux, No SQL, MongoDB, Talend, Git.
Data Engineer/Hadoop Engineer
Confidential
Responsibilities:
- Responsible for building scalable distributed data solutions using Hadoop and migrate legacy applications to Hadoop.
- Wrote the Spark code in Scala to connect to Hbase and read/write data to the HBase table.
- Extracted data from different databases and to copy into HDFS using Sqoop and has an expertise in using compression techniques to optimize the data storage.
- Implemented Kafka producers create custom partitions, configured brokers and implemented High level consumers to implement data platform.
- Delivered real-time experience and analyzed massive amounts of data from multiple sources to calculate real-time ETA using Confluent Kafka event streaming.
- Developed the technical strategy of using Apache Spark on Apache Mesos as a next generation, Big Data and "Fast Data" (Streaming) platform.
- Implemented Flume, Spark framework for real time data processing.
- Developed Spark jobs in Java to perform ETL from SQL Server to Hadoop.
- Developed simple to complex Map Reduce jobs using Hive and Pig for analyzing the data.
- Optimized Map/Reduce Jobs to use HDFS efficiently by using various compression mechanisms.
- Developed big data ingestion framework to process multi TB data including data quality checks, transformation, and stored as efficient storage formats like parquet and loaded into Amazon S3 using Spark Scala API and Spark.
- Worked on cloud computing infrastructure (e.g. Amazon Web Services EC2) and considerations for scalable, distributed systems
- Creating functional and technical ETL mapping specification document for data mappings.
- Develop/support ETL transformation mapping and enhancing existing mappings to facilitate data load into DWH.
- Created the Spark Streaming code to take the source files as input.
- Used Oozie workflow to automate all the jobs.
- Exported the analyzed data into relational databases using Sqoop for visualization and to generate reports for the BI team.
- Developed spark programs using Scala, involved in creating Spark SQL Queries and Developed Oozie workflow for spark jobs
- Built analytics for structured and unstructured data and managing large data ingestion by using Avro, Flume, Thrift, Kafka and Sqoop.
- Developed Pig UDF's to know the customer behavior and Pig Latin scripts for processing the data in Hadoop.
- Scheduled automated tasks with Oozie for loading data into HDFS through Sqoop and pre-processing the data with Pig and Hive.
- Worked on scalable distributed computing systems, software architecture, data structures and algorithms using Hadoop, Apache Spark and Apache Storm etc.
- Ingested streaming data into Hadoop using Spark, Storm Framework and Scala.
- Copied the data from HDFS to MongoDB using pig/Hive/Map reduce scripts and visualized the streaming processed data in Tableau dashboard.
- Continuously monitored and managed the Hadoop Cluster using Cloudera Manager.
Environment: MapReduce, Spark, Hive, Scala, Pig, Sqoop, HBase, Oozie, Impala, Kafka, JSON, XML PL/SQL, Sql, HDFS, Unix, Python, SAS, PySpark, Redshift, Shell Scripting, MongoDB, HBase
Hadoop-Spark Developer
Confidential
Responsibilities:
- Responsible for building scalable distributed data solutions using Hadoop.
- Used Spark-Streaming APIs to perform necessary transformations and actions on the fly for building the common learner data model which gets the data from Kafka in near real time and Persists into Cassandra.
- Experience in Loading the data into Spark RDD's, perform advanced procedures like text analytics and processing using in memory data Computation capabilities of Spark using Scala to generate the Output response.
- Developed MapReduce programs to parse the raw data, populate staging tables and store the refined data in partitioned tables in the EDW
- Experience writing scripts using Python (or Go Lang) and familiarity with the following tools: AWS Cloud Lambda, AWS S3, AWS EC2, AWS Redshift, AWS Postgres
- In - depth understanding of Snowflake cloud technology.
- Developed ETL Specifications and Mappings using Informatica PowerCenter tool for Data loading.
- Developed Stored Procedures using PL/SQL for ETL processes.
- In-Depth understanding of Snowflake Multi-Cluster Size and Credit Usage
- Setting up data lake in google cloud using Google cloud storage, Big Query, and Big Table
- Experienced in handling large datasets using Partitions, Spark in Memory capabilities, Broadcasts in Spark, Effective & efficient Joins, Transformations and other during ingestion process itself.
- Developed Scala scripts using both Data frames/SQL and RDD/MapReduce in Spark for Data Aggregation, queries and writing data back into OLTP system through SQOOP.
- Experienced in performance tuning of Spark Applications for setting right Batch Interval time, correct level of Parallelism and memory tuning.
- Optimizing of existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frames and Pair RDD's.
- Worked extensively on AWS Components such as Airflow, Elastic Map Reduce (EMR), Atana, Snowflake.
- Developing scripts in Big Query and connecting it to reporting tools.
- Used DataStax Spark-Cassandra connector to load data into Cassandra and used CQL to analyze data from Cassandra tables for quick searching, sorting and grouping.
- Designed, developed and did maintenance of data integration programs in a Hadoop and RDBMS environment with both traditional and non-traditional source systems.
- Experience in writing SQOOP Scripts for importing and exporting data from RDBMS to HDFS.
- Ingested data from RDBMS and performed data transformations, and tan export the transformed data to Cassandra for data access and analysis.
- Developed python code for different tasks, dependencies, SLA watcher and time sensor for each job for workflow management and automation using Airflow tool
- Created Hive tables for loading and analyzing data, Implemented Partitions, Buckets and developed Hive queries to process the data and generate the data cubes for visualizing.
- Implemented schema extraction for Parquet and Avro file Formats in Hive.
- Developed Hive scripts in Hive QL to de-normalize and aggregate the data.
- Used Spark API over Cloudera Hadoop Yarn to perform analytics on data in Hive.
- Worked on a POC to compare processing time of Impala with Apache Hive for batch applications to implement the former in project.
- Worked on migrating Map Reduce programs into Spark transformations using Spark and Scala.
- Worked with BI team to create various kinds of reports using Tableau based on the client's needs.
- Experience in Querying on Parquet files by loading them in to Spark's data frames by using Zeppelin notebook.
- Experience in troubleshooting any problems dat arises during any batch data processing jobs.
- Extracted the data from Teradata into HDFS/Dashboards using Spark Streaming.
- Migrated an existing on-premises application to AWS. Used AWS services like EC2 and S3 for small data sets processing and storage, experienced in Maintaining the Hadoop cluster on AWS EMR.
Environment: Hadoop, MapReduce, AWS, Snowflake, AWS EC2, S3, GitHub, Spark SQL, Hive, Jira, EMR, Teradata, SQL Server, Apache Spark, Sqoop.
