We provide IT Staff Augmentation Services!

Senior Big Data Engineer Resume

Baltimore, MD

PROFESSIONAL SUMMARY:

  • Over 9+ years of extensive hands - on Big Data Capacity with teh help of Hadoop Eco System across internal and cloud-based platforms.
  • Expertise in Cloud Computing and Hadoop architecture and its various components - Hadoop File System HDFS, MapReduce, Spark, Name node, Data Node, Job Tracker, Task Tracker, Secondary Name Node.
  • Experience working in different Google Cloud Platform Technologies like Big Query, Dataflow, Dataproc, Pubsub, Airflow.
  • Design and Development of Ingestion Framework over Google Cloud and Hadoop cluster.
  • Strong experience using HDFS, MapReduce, Hive, Spark, Sqoop, Oozie, and HBase.
  • Deep knowledge of troubleshooting and tuning Spark applications and Hive scripts to achieve optimal performance.
  • Experienced working with various Hadoop Distributions (Cloudera, Hortonworks, Map R, Amazon EMR) to fully implement and leverage new Hadoop features.
  • Experience in developing Spark Applications using Spark RDD, Spark-SQL and Data frame APIs.
  • Worked with real-time data processing and streaming techniques using Spark streaming and Kafka.
  • Experience in moving data into and out of teh HDFS and Relational Database Systems (RDBMS) using Apache Sqoop.
  • Expertise in working with HIVE data warehouse infrastructure-creating tables, data distribution by implementing Partitioning and Bucketing, developing and tuning teh HQL queries.
  • Replaced existing MR jobs and Hive scripts with Spark SQL & Spark data transformations for efficient data processing.
  • Deep knowledge of troubleshooting and tuning Spark applications and Hive scripts to achieve optimal performance
  • Experience developing Kafka producers and Kafka Consumers for streaming millions of events per second on streaming data
  • Database design, modeling, migration and development experience in using stored procedures, triggers, cursor, constraints and functions. Used My SQL, MS SQL Server, DB2, and Oracle
  • Experience working with NoSQL database technologies, including MongoDB, Cassandra and HBase.
  • Experience with Software development tools such as JIRA, Play, GIT.
  • Experience on Migrating SQL database to Azure data Lake, Azure data lake Analytics, Azure SQL Database, Data Bricks and Azure SQL Data warehouse and Controlling and granting database access and Migrating On premise databases to Azure Data lake store using Azure Data factory
  • Good understanding of teh Data modelling (Dimensional & Relational) concepts like Star-Schema Modelling, a Schema Modelling, Fact and Dimension tables.
  • Experience in manipulating/analysing large datasets and finding patterns and insights within structured and unstructured data.
  • Hands of experience in GCP, Big Query, GCS bucket, G - cloud function, cloud migration, cloud dataflow, Pub/suB cloud shell, GSUTIL, BQ command line utilities, Data Proc, Stack driver
  • Strong understanding of Java Virtual Machines and multi-threading process.
  • Experience in writing complex SQL queries, creating reports and dashboards.
  • Proficient in using Unix based Command Line Interface.
  • Strong experience with ETL and/or orchestration tools (e.g. Talend, Oozie, Airflow)
  • Experience setting up AWS Data Platform - AWS CloudFormation, Development EndPoints, AWS Glue, EMR and Jupyter/Sagemaker Notebooks, Redshift, S3, and EC2 instances
  • Experienced in using Agile methodologies including extreme programming, SCRUM and Test-Driven Development (TDD)
  • Used Informatica Power Center for (ETL) extraction, transformation and loading data from heterogeneous source systems into target database.

PROFESSIONAL EXPERIENCE:

Confidential, Baltimore, MD

Senior Big Data Engineer

Responsibilities:

  • Involved in Migrating Objects using teh custom ingestion framework from variety of sources such as Oracle, SAP/HANA, MongoDB, & Teradata
  • Planning and design of data warehouse in STAR schema. Designing structure of tables and documenting it.
  • Handled importing of data from various data sources, performed transformations using Hive, MapReduce, loaded data into HDFS and extracted data from MYSQL into HDFS vice-versa using Sqoop.
  • Designed and implemented end to end big data platform on Teradata Appliance
  • Performing ETL from multiple sources such as Kafka, NIFI, Teradata, DB2 using Hadoop spark.
  • Worked on Apache Spark Utilizing teh Spark, SQL and Streaming components to support teh intraday and real-time data processing
  • Sharing sample data using grant access to customer for UAT/BAT.
  • Developed Python, Bash scripts to automate and provide Control flow
  • Data Ingestion to one or more cloud Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and cloud migration processing teh data in In Azure Databricks
  • Moving data from Teradata to a Hadoop cluster Using TDCH/Fast export and Apache NIFI.
  • Develop Python, PySpark, Bash scripts logs to Transform, and Load data across on premise and cloud platform.
  • Worked extensively on AWS Components such as Elastic Map Reduce (EMR)
  • Experience in data ingestions techniques for batch and stream processing using AWS Batch, AWS Kinesis, AWS Data Pipeline
  • Created data pipeline for different events of ingestion, aggregation and load consumer response data in AWS S3 bucket into Hive external tables in HDFS location to serve as feed for Tableau dashboards
  • Created monitors, alarms and notifications for EC2 hosts using Cloud Watch, Cloud trail and SNS
  • Building data pipeline ETLs for data movement to S3, tan to Redshift.
  • Scheduled different Snowflake jobs using NiFi.
  • Experience with Snowflake Multi-Cluster Warehouses
  • Written AWS Lambda code in Python for nested Json files, converting, comparing, sorting etc.
  • Installed and configured apache airflow for workflow management and created workflows in python
  • Write UDFs in Hadoop Pyspark to perform transformations and loads.
  • Use NIFI to load data into HDFS as ORC files.
  • Writing TDCH scripts and apache NIFI to load data from Mainframes DB2 to Hadoop cluster.
  • Working with, ORC, AVRO and JSON, Parquette file formats. and create external tables and query on top of these files Using BigQuery
  • Source Analysis, Tracing back teh sources of teh data and finding its roots though Teradata, DB2 etc
  • Identifying teh jobs dat load teh source tables and documenting it.
  • Implement Continuous Integration and Continuous Delivery process using GitLab along with Python and Shell scripts to automate routine jobs, which includes synchronize installers, configuration modules, packages and requirements for teh applications
  • Worked on Informatica Power Center tools- Designer, Repository Manager, Workflow Manager, and Workflow Monitor
  • Implemented a batch process to load teh heavy volume data loading using Apache Dataflow framework using Nifi in Agile development methodology
  • Deployed teh Big Data Hadoop application using Talend on cloud AWS (Amazon Web Services) and also on Microsoft Azure
  • Created Snow pipe for continuous data load from staged data residing on cloud gateway servers.
  • Developing automated process for code builds and deployments using Jenkins, Ant, Maven, Sonar type, Shell Script
  • Installing and configuring teh applications like docker tool and Kubernetes for teh orchestration purpose
  • Developed automation system using PowerShell scripts and JSON templates to remediate teh Azure services.

Confidential, Jersey City, NJ

Big Data Engineer/Analytics

Responsibilities:

  • Build ETL/ELT pipeline in data technologies like pyspark, hive, presto and data bricks
  • Extract data from source and transform teh same according to business requirements and Load teh data in target
  • Build validation module to summarize teh data between source and target using pyspark
  • Orchestrate dimension module using Airflow and built dependencies according to business document
  • Build data pipelines using pyspark to transform data from amazon S3 and stage it in Presto views
  • Read file from S3 on incremental basis, transform teh data and load it in S3 location
  • Build external table using HQL to match it with teh data schema loaded in teh target
  • Analyse and validated teh data in Presto using SQL queries
  • Create and enhanced tools to analyse and process large quantity of data set
  • Develop pyspark scripts to process huge data sets from source to target
  • Execute complex joins and transformations to process data adhering to teh business use case
  • Develop automated scripts to validate teh data processed between source and target
  • Provide technical design based on business requirements to create fact and dimension tables
  • Design teh workflow to orchestrate teh dimension and fact module using Airflow
  • Develop airflow dags using python to schedule jobs on incremental basis
  • Establish dependencies using airflow methods and tracked data processing
  • Communicate and coordinated with cross functional teams to ensure business objectives are met
  • Document teh workflow and brainstormed proactive solutions with teh team
  • Present solution to teh orchestration process by effectively capturing data stats
  • Design DAGs using Airflow to process teh dimension table and staged it presto views
  • Design and developed airflow dags to retrieve data from Amazon s3 and built ETL pipeline using pyspark to process teh same to build teh dimensions
  • Develop python scripts to schedule each dimension process as task and set dependencies for teh same
  • Develop, tested and deployed python scripts to create airflow dags
  • Integrate with Databricks using airflow operators to run Notebooks on scheduled basis
  • Interface with Business intelligence team and help them build data pipeline to support existing BI platform and data products
  • Build aggregated layer over fact and dimension tables to help BI build dashboard over
  • Leverage data and built robust ETL pipelines to process Ad data with Revenue and impressions captured for business to make informed decisions
  • Develop process workflow and developed pyspark code adhering to teh design
  • Optimize data structures for efficiently querying Hive and Presto views
  • Define and designed Data models to develop dimension and fact tables
  • Collaborate with internal and external data sources to ensure integrations are accurate and scalable.

Confidential, Jersey City, NJ

Big Data Engineer

Responsibilities:

  • Used Hive Queries in Spark-SQL for analysis and processing teh data.
  • Hands on experience in installation, configuration, supporting and managing Hadoop Clusters
  • Written shell scripts dat run multiple Hive jobs which helps to automate different Hive tables incrementally which are used to generate different reports using Tableau for teh Business use
  • Used Spark API over Cloudera Hadoop YARN to perform analytics on data in Hive
  • Enhanced and optimized product Spark code to aggregate, group and run data mining tasks using teh Spark framework and handled Json Data
  • Implemented Spark Scripts using Scala, Spark SQL to access hive tables into Spark for faster processing of data
  • Involved in business analysis and technical design sessions with business and technical staff to develop requirements document and ETL design specifications.
  • Wrote complex SQL scripts to avoid Informatica Look-ups to improve teh performance as teh volume of teh data was heavy.
  • Responsible for design, development, Data Modelling, of Spark SQL Scripts based on Functional Specifications
  • Designed and developed extract, transform, and load (ETL) mappings, procedures, and schedules, following teh standard development lifecycle
  • Optimized Map/Reduce Jobs to use HDFS efficiently by using various compression mechanisms
  • Worked closely with Quality Assurance, Operations and Production support group to devise teh test plans, answer questions and solve any data or processing issues
  • Worked on large-scale Hadoop YARN cluster for distributed data processing and analysis using Data Bricks Connectors, Spark core, Spark SQL, GCP, Sqoop, Hive and NoSQL databases
  • Worked in writing Spark SQL scripts for optimizing teh query performance
  • Concerned and well-informed on Hadoop Components such as HDFS, Job Tracker, Task Tracker, Name Node, Data Node, YARN and Map Reduce programming
  • Implemented Hive UDF's and did performance tuning for better results
  • Tuned, and developed SQL on HiveQL, Drill and SparkSQL
  • Experience in using Sqoop to import and export teh data from Oracle DB into HDFS and HIVE
  • Developed Spark code using Spark RDD and Spark-SQL/Streaming for faster processing of data
  • Implemented Partitioning, Data Modelling, Dynamic Partitions and Buckets in HIVE for efficient data access.

Confidential, Franklin, TN

Big Data/ Hadoop Engineer

Responsibilities:

  • Responsible for building scalable distributed data solutions using Hadoop and migrate legacy applications to Hadoop.
  • Wrote teh Spark code in Scala to connect to Hbase and read/write data to teh HBase table.
  • Extracted data from different databases and to copy into HDFS using Sqoop and has an expertise in using compression techniques to optimize teh data storage.
  • Implemented Kafka producers create custom partitions, configured brokers and implemented High level consumers to implement data platform.
  • Delivered real-time experience and analyzed massive amounts of data from multiple sources to calculate real-time ETA using Confluent Kafka event streaming.
  • Developed teh technical strategy of using Apache Spark on Apache Mesos as a next generation, Big Data and "Fast Data" (Streaming) platform.
  • Implemented Flume, Spark framework for real time data processing.
  • Developed simple to complex Map Reduce jobs using Hive and Pig for analyzing teh data.
  • Optimized Map/Reduce Jobs to use HDFS efficiently by using various compression mechanisms.
  • Developed big data ingestion framework to process multi TB data including data quality checks, transformation, and stored as efficient storage formats like parquet and loaded into Amazon S3 using Spark Scala API and Spark.
  • Worked on cloud computing infrastructure (e.g. Amazon Web Services EC2) and considerations for scalable, distributed systems
  • Created teh Spark Streaming code to take teh source files as input.
  • Used Oozie workflow to automate all teh jobs.
  • Exported teh analyzed data into relational databases using Sqoop for visualization and to generate reports for teh BI team.
  • Developed spark programs using Scala, involved in creating Spark SQL Queries and Developed Oozie workflow for spark jobs
  • Built analytics for structured and unstructured data and managing large data ingestion by using Avro, Flume, Thrift, Kafka and Sqoop.
  • Developed Pig UDF's to know teh customer behavior and Pig Latin scripts for processing teh data in Hadoop.
  • Scheduled automated tasks with Oozie for loading data into HDFS through Sqoop and pre-processing teh data with Pig and Hive.
  • Worked on scalable distributed computing systems, software architecture, data structures and algorithms using Hadoop, Apache Spark and Apache Storm etc.
  • Ingested streaming data into Hadoop using Spark, Storm Framework and Scala.
  • Copied teh data from HDFS to MongoDB using pig/Hive/Map reduce scripts and visualized teh streaming processed data in Tableau dashboard.
  • Continuously monitored and managed teh Hadoop Cluster using Cloudera Manager.

Confidential, Franklin, TN

Hadoop-Spark Developer

Responsibilities:

  • Responsible for building scalable distributed data solutions using Hadoop.
  • Used Spark-Streaming APIs to perform necessary transformations and actions on teh fly for building teh common learner data model which gets teh data from Kafka in near real time and Persists into Cassandra.
  • Experience in Loading teh data into Spark RDD's, perform advanced procedures like text analytics and processing using in memory data Computation capabilities of Spark using Scala to generate teh Output response.
  • Developed MapReduce programs to parse teh raw data, populate staging tables and store teh refined data in partitioned tables in teh EDW
  • Experience writing scripts using Python (or Go Lang) and familiarity with teh following tools: AWS Cloud Lambda, AWS S3, AWS EC2, AWS Redshift, AWS Postgres
  • In - depth understanding of Snowflake cloud technology.
  • In-Depth understanding of Snowflake Multi-cluster Size and Credit Usage
  • Setting up data lake in google cloud using Google cloud storage, Big Query, and Big Table
  • Experienced in handling large datasets using Partitions, Spark in Memory capabilities, Broadcasts in Spark, Effective & efficient Joins, Transformations and other during ingestion process itself.
  • Developed Scala scripts using both Data frames/SQL and RDD/MapReduce in Spark for Data Aggregation, queries and writing data back into OLTP system through SQOOP.
  • Experienced in performance tuning of Spark Applications for setting right Batch Interval time, correct level of Parallelism and memory tuning.
  • Optimizing of existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frames and Pair RDD's.
  • Worked extensively on AWS Components such as Airflow, Elastic Map Reduce (EMR), Atana, Snowflake.
  • Developing scripts in Big Query and connecting it to reporting tools.
  • Used DataStax Spark-Cassandra connector to load data into Cassandra and used CQL to analyze data from Cassandra tables for quick searching, sorting and grouping.
  • Designed, developed and did maintenance of data integration programs in a Hadoop and RDBMS environment with both traditional and non-traditional source systems.
  • Experience in writing SQOOP Scripts for importing and exporting data from RDBMS to HDFS.
  • Ingested data from RDBMS and performed data transformations, and tan export teh transformed data to Cassandra for data access and analysis.
  • Developed python code for different tasks, dependencies, SLA watcher and time sensor for each job for workflow management and automation using Airflow tool
  • Created Hive tables for loading and analyzing data, Implemented Partitions, Buckets and developed Hive queries to process teh data and generate teh data cubes for visualizing.
  • Implemented schema extraction for Parquet and Avro file Formats in Hive.
  • Developed Hive scripts in Hive QL to de-normalize and aggregate teh data.
  • Used Spark API over Cloudera Hadoop Yarn to perform analytics on data in Hive.
  • Worked on a POC to compare processing time of Impala with Apache Hive for batch applications to implement teh former in project.
  • Worked on migrating Map Reduce programs into Spark transformations using Spark and Scala.
  • Worked with BI team to create various kinds of reports using Tableau based on teh client's needs.
  • Experience in Querying on Parquet files by loading them in to Spark's data frames by using Zeppelin notebook.
  • Experience in troubleshooting any problems dat arises during any batch data processing jobs.
  • Extracted teh data from Teradata into HDFS/Dashboards using Spark Streaming.
  • Migrated an existing on-premises application to AWS. Used AWS services like EC2 and S3 for small data sets processing and storage, Experienced in Maintaining teh Hadoop cluster on AWS EMR.

Hire Now