Senior Big Data Engineer Resume
Oldsmar, FL
SUMMARY
- Over 8+ years of extensive hands - on Big Data Capacity with the help of Hadoop Eco System across internal and cloud-based platforms.
- Expertise in Cloud Computing and Hadoop architecture and its various components - Hadoop File System HDFS, MapReduce, Spark, Name node, Data Node, Job Tracker, Task Tracker, Secondary Name Node.
- Strong experience using HDFS, MapReduce, Hive, Spark, Sqoop, Oozie, and HBase.
- Experienced working with various Hadoop Distributions (Cloudera, Hortonworks, Map R, Amazon EMR) to fully implement and leverage new Hadoop features.
- Worked with real-time data processing and streaming techniques using Spark streaming and Kafka.
- Experience in moving data into and out of the HDFS and Relational Database Systems (RDBMS) using Apache Sqoop.
- Expertise in working with HIVE data warehouse infrastructure-creating tables, data distribution by implementing Partitioning and Bucketing, developing and tuning the HQL queries.
- Replaced existing MR jobs and Hive scripts with Spark SQL & Spark data transformations for efficient data processing.
- Deep knowledge of troubleshooting and tuning Spark applications and Hive scripts to achieve optimal performance
- Experience developing Kafka producers and Kafka Consumers for streaming millions of events per second on streaming data
- Deep knowledge of troubleshooting and tuning Spark applications and Hive scripts to achieve optimal performance.
- Database design, modeling, migration and development experience in using stored procedures, triggers, cursor, constraints and functions. Used My SQL, MS SQL Server, DB2, and Oracle
- Experience working with NoSQL database technologies, including MongoDB, Cassandra and HBase.
- Experience with Software development tools such as JIRA, Play, GIT.
- Experience on Migrating SQL database to Azure Data Lake, Azure data lake Analytics, Azure SQL Database, Data Bricks and Azure SQL Data warehouse and Controlling and granting database access and Migrating On premise databases to Azure Data lake store using Azure Data factory
- Good understanding of the Data modelling (Dimensional & Relational) concepts like Star-Schema Modelling, a Schema Modelling, Fact and Dimension tables.
- Experience in manipulating/analysing large datasets and finding patterns and insights within structured and unstructured data.
- Experience in developing Spark Applications using Spark RDD, Spark-SQL and Data frame APIs.
- Experience working in different Google Cloud Platform Technologies like Big Query, Dataflow, Dataproc, Pubsub, Airflow.
- Design and Development of Ingestion Framework over Google Cloud and Hadoop cluster.
- Strong understanding of Java Virtual Machines and multi-threading process.
- Experience in writing complex SQL queries, creating reports and dashboards.
- Proficient in using Unix based Command Line Interface.
- Hands of experience in GCP, Big Query, GCS bucket, G - cloud function, cloud migration, cloud dataflow, Pub/Sub cloud shell, GSUTIL, BQ command line utilities, Data Proc, Stack driver
- Strong experience with ETL and/or orchestration tools (e.g. Talend, Oozie, Airflow)
- Experience setting up AWS Data Platform - AWS CloudFormation, Development End Points, AWS Glue, EMR and Jupyter/ Sagemaker Notebooks, Redshift, S3, and EC2 instances
- Experienced in using Agile methodologies including extreme programming, SCRUM and Test-Driven Development (TDD)
- Used Informatica Power Center for (ETL) extraction, transformation and loading data from heterogeneous source systems into target database.
- Experienced in writing, testing and running the code and algorithms on the data to make sure they run and work as expected.
TECHNICAL SKILLS
Programming languages: Python, PySpark, Shell Scripting, SQL, PL/SQL and UNIX Bash
Big Data: Hadoop, Sqoop, Apache Spark, NiFi, Kafka, Snowflake, Cloudera, Horton Works, PySpark, Spark, Spark SQL
Data Modeling Tools: Erwin Data Modeler, ER Studio v17
Operating Systems: UNIX, LINUX, MacOS, Solaris, Mainframes
Data bases: Oracle, SQL Server, My SQL, DB2, Sybase, Netezza, Hive, Impala
Cloud Technologies: AWS, AZURET, GCP
IDE Tools: Aginitiy for Hadoop, PyCharm, Toad, SQL Developer, SQL *Plus, Sublime Text, VI Editor
OLAP Tools: Tableau, SSAS, Business Objects, and Crystal Reports 9
ETL/Data warehouse Tools: Informatica 9.6/9.1, and Tableau.
Others: AutoSys, Crontab, ArcGIS, Clarity, Informatica, Business Objects, IBM MQ, Splunk
PROFESSIONAL EXPERIENCE
Confidential, Oldsmar, FL
Senior Big Data Engineer
Responsibilities:
- Involved in Migrating Objects using the custom ingestion framework from variety of sources such as Oracle, SAP/HANA, MongoDB, & Teradata
- Planning and design of data warehouse in STAR schema. Designing structure of tables and documenting it.
- Handled importing of data from various data sources, performed transformations using Hive, MapReduce, loaded data into HDFS and extracted data from MYSQL into HDFS vice-versa using Sqoop.
- Performing ETL from multiple sources such as Kafka, NIFI, Teradata, DB2 using Hadoop spark.
- Worked on Apache Spark Utilizing the Spark, SQL and Streaming components to support the intraday and real-time data processing
- Developed Python, Bash scripts to automate and provide Control flow
- Data Ingestion to one or more cloud Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and cloud migration processing the data in Azure Databricks
- Moving data from Teradata to a Hadoop cluster Using TDCH/Fast export and Apache NIFI.
- Develop Python, PySpark, Bash scripts logs to Transform, and Load data across on premise and cloud platform.
- Worked extensively on AWS Components such as Elastic Map Reduce (EMR)
- Experience in data ingestions techniques for batch and stream processing using AWS Batch, AWS Kinesis, AWS Data Pipeline
- Sharing sample data using grant access to customer for UAT/BAT.
- Created data pipeline for different events of ingestion, aggregation and load consumer response data in AWS S3 bucket into Hive external tables in HDFS location to serve as feed for Tableau dashboards
- Created monitors, alarms and notifications for EC2 hosts using Cloud Watch, Cloud trail and SNS
- Building data pipeline ETLs for data movement to S3, tan to Redshift.
- Scheduled different Snowflake jobs using NiFi.
- Develop Nifi workflow to pick up the data from rest API server, from data lake as well as from SFTP server and send that to Kafka broker.
- Designed and implemented end to end big data platform on Teradata Appliance
- Written AWS Lambda code in Python for nested Json files, converting, comparing, sorting etc.
- Installed and configured apache airflow for workflow management and created workflows in python
- Write UDFs in Hadoop PySpark to perform transformations and loads.
- Used NIFI to load data into HDFS as ORC files.
- Experience with Snowflake Multi-Cluster Warehouses
- Writing TDCH scripts and apache NIFI to load data from Mainframes DB2 to Hadoop cluster.
- Working with, ORC, AVRO and JSON, Parquette file formats. and create external tables and query on top of these files Using BigQuery
- Source Analysis, tracing back the sources of the data and finding its roots though Teradata, DB2 etc.
- Identifying the jobs dat load the source tables and documenting it.
- Implement Continuous Integration and Continuous Delivery process using GitLab along with Python and Shell scripts to automate routine jobs, which includes synchronize installers, configuration modules, packages and requirements for the applications
- Worked on Informatica Power Center tools- Designer, Repository Manager, Workflow Manager, and Workflow Monitor
- Implemented a batch process to load the heavy volume data loading using Apache Dataflow framework using Nifi in Agile development methodology
- Deployed the Big Data Hadoop application using Talend on cloud AWS (Amazon Web Services) and also on Microsoft Azure
- Created Snow pipe for continuous data load from staged data residing on cloud gateway servers.
- Developing automated process for code builds and deployments using Jenkins, Ant, Maven, Sonar type, Shell Script
- Installing and configuring the applications like Docker tool and Kubernetes for the orchestration purpose
- Developed automation system using PowerShell scripts and JSON templates to remediate the Azure services.
Environment: Apache Spark, Hadoop, PySpark, HDFS, Cloudera, AWS, Azure, Kafka, Snowflake, Docker, Jenkins, Ant, Maven, Kubernetes, Nifi, JSON, Teradata, DB2, SQL Server, MongoDB, Shell Scripting.
Confidential, Malvern, PA
Big Data Engineer
Responsibilities:
- Build ETL/ELT pipeline in data technologies like PySpark, hive, presto and data bricks
- Extract data from source and transform the same according to business requirements and Load the data in target
- Orchestrate dimension module using Airflow and built dependencies according to business document
- Build data pipelines using PySpark to transform data from amazon S3 and stage it in Presto views
- Read file from S3 on incremental basis, transform the data and load it in S3 location
- Build external table using HQL to match it with the data schema loaded in the target
- Analyse and validated the data in Presto using SQL queries
- Build validation module to summarize the data between source and target using PySpark
- Create and enhanced tools to analyse and process large quantity of data set
- Develop PySpark scripts to process huge data sets from source to target
- Execute complex joins and transformations to process data adhering to the business use case
- Provide technical design based on business requirements to create fact and dimension tables
- Design the workflow to orchestrate the dimension and fact module using Airflow
- Develop airflow DAGS using python to schedule jobs on incremental basis
- Communicate and coordinated with cross functional teams to ensure business objectives are met
- Document the workflow and brainstormed proactive solutions with the team
- Present solution to the orchestration process by effectively capturing data stats
- Design DAGS using Airflow to process the dimension table and staged it presto views
- Develop automated scripts to validate the data processed between source and target
- Design and developed airflow DAGS to retrieve data from Amazon s3 and built ETL pipeline using PySpark to process the same to build the dimensions
- Implement Spark Kafka streaming to pick up the data from Kafka and send to Spark pipeline
- Develop python scripts to schedule each dimension process as task and set dependencies for the same
- Develop, tested and deployed python scripts to create airflow DAGS
- Integrate with Databricks using airflow operators to run Notebooks on scheduled basis
- Interface with Business intelligence team and help them build data pipeline to support existing BI platform and data products
- Establish dependencies using airflow methods and tracked data processing
- Build aggregated layer over fact and dimension tables to help BI build dashboard over
- Leverage data and built robust ETL pipelines to process Ad data with Revenue and impressions captured for business to make informed decisions
- Develop process workflow and developed PySpark code adhering to the design
- Optimize data structures for efficiently querying Hive and Presto views
- Define and designed Data models to develop dimension and fact tables
- Collaborate with internal and external data sources to ensure integrations are accurate and scalable.
Environment: PySpark, Hive, Presto, Data Bricks, HQL, AWS, S3, Map Reduce, Hadoop, Flume, kafka, Scala, Python, Tableau, Oracle, My SQL, MongoDB, Cassandra, Git, Shell Scripting.
Confidential, Fountain Valley, CA
Data Engineer
Responsibilities:
- Used Hive Queries in Spark-SQL for analysis and processing the data.
- Hands on experience in installation, configuration, supporting and managing Hadoop Clusters
- Written shell scripts dat run multiple Hive jobs which helps to automate different Hive tables incrementally which are used to generate different reports using Tableau for the Business use
- Enhanced and optimized product Spark code to aggregate, group and run data mining tasks using the Spark framework and handled Json Data
- Implemented Spark Scripts using Scala, Spark SQL to access hive tables into Spark for faster processing of data
- Involved in business analysis and technical design sessions with business and technical staff to develop requirements document and ETL design specifications.
- Used Spark API over Cloudera Hadoop YARN to perform analytics on data in Hive
- Responsible for design, development, Data Modelling, of Spark SQL Scripts based on Functional Specifications
- Designed and developed extract, transform, and load (ETL) mappings, procedures, and schedules, following the standard development lifecycle
- Developed Autosys scripts to schedule the Kafka streaming and batch job
- Setting up data lake in google cloud using Google cloud storage, Big Query, and Big Table
- Optimized Map/Reduce Jobs to use HDFS efficiently by using various compression mechanisms
- Worked closely with Quality Assurance, Operations and Production support group to devise the test plans, answer questions and solve any data or processing issues
- Used PySpark to creating batch job for merge multiple small files (Kafka stream files) into single larger files in parquet format.
- Developing scripts in Big Query and connecting it to reporting tools.
- Worked on large-scale Hadoop YARN cluster for distributed data processing and analysis using Data Bricks Connectors, Spark core, Spark SQL, Sqoop, Hive and NoSQL databases
- Worked in writing Spark SQL scripts for optimizing the query performance
- Concerned and well-informed on Hadoop Components such as HDFS, Job Tracker, Task Tracker, Name Node, Data Node, YARN and Map Reduce programming
- Wrote complex SQL scripts to avoid Informatica Look-ups to improve the performance as the volume of the data was heavy.
- Implemented Hive UDF's and did performance tuning for better results
- Tuned, and developed SQL on HiveQL, Drill and Spark SQL
- Experience in using Sqoop to import and export the data from Oracle DB into HDFS and HIVE
- Developed Spark code using Spark RDD and Spark-SQL/Streaming for faster processing of data
- Implemented Partitioning, Data Modelling, Dynamic Partitions and Buckets in HIVE for efficient data access.
Environment: Spark SQL, Hive, Hadoop Yarn, Map Reduce, Hive QL, GCP, Sqoop, SQL Server, No SQL Databases, kafka, Python, Scala, Shell, Bash Scripting, Git.
Confidential
Hadoop-Spark Developer
Responsibilities:
- Responsible for building scalable distributed data solutions using Hadoop and migrate legacy applications to Hadoop.
- Wrote the Spark code in Scala to connect to Hbase and read/write data to the HBase table.
- Extracted data from different databases and to copy into HDFS using Sqoop and has an expertise in using compression techniques to optimize the data storage.
- Implemented Kafka producers create custom partitions, configured brokers and implemented High level consumers to implement data platform.
- Delivered real-time experience and analyzed massive amounts of data from multiple sources to calculate real-time ETA using Confluent Kafka event streaming.
- Implemented Flume, Spark framework for real time data processing.
- Optimized Map/Reduce Jobs to use HDFS efficiently by using various compression mechanisms.
- Developed big data ingestion framework to process multi TB data including data quality checks, transformation, and stored as efficient storage formats like parquet and loaded into Amazon S3 using Spark Scala API and Spark.
- Worked on cloud computing infrastructure (e.g. Amazon Web Services EC2) and considerations for scalable, distributed systems
- Created the Spark Streaming code to take the source files as input.
- Used Oozie workflow to automate all the jobs.
- Exported the analyzed data into relational databases using Sqoop for visualization and to generate reports for the BI team.
- Developed the technical strategy of using Apache Spark on Apache Mesos as a next generation, Big Data and "Fast Data" (Streaming) platform.
- Developed simple to complex Map Reduce jobs using Hive and Pig for analysing the data.
- Developed spark programs using Scala, involved in creating Spark SQL Queries and Developed Oozie workflow for spark jobs
- Built analytics for structured and unstructured data and managing large data ingestion by using Avro, Flume, Thrift, Kafka and Sqoop.
- Developed Pig UDF's to know the customer behavior and Pig Latin scripts for processing the data in Hadoop.
- Worked on scalable distributed computing systems, software architecture, data structures and algorithms using Hadoop, Apache Spark and Apache Storm etc.
- Ingested streaming data into Hadoop using Spark, Storm Framework and Scala.
- Copied the data from HDFS to MongoDB using pig/Hive/Map reduce scripts and visualized the streaming processed data in Tableau dashboard.
- Scheduled automated tasks with Oozie for loading data into HDFS through Sqoop and pre-processing the data with Pig and Hive.
- Continuously monitored and managed the Hadoop Cluster using Cloudera Manager.
Environment: Spark, Kafka, Hadoop, AWS, Sqoop, HDFS, Oracle, SQL Server, MongoDB, Python, Scala, Shell Scripting, Tableau, Map Reduce, Oozie, Pig, Hive.
Confidential
Hadoop Developer
Responsibilities:
- Responsible for building scalable distributed data solutions using Hadoop.
- Used Spark-Streaming APIs to perform necessary transformations and actions on the fly for building the common learner data model which gets the data from Kafka in near real time and Persists into Cassandra.
- Developed MapReduce programs to parse the raw data, populate staging tables and store the refined data in partitioned tables in the EDW
- Experience writing scripts using Python (or Go Lang) and familiarity with the following tools: AWS Cloud Lambda, AWS S3, AWS EC2, AWS Redshift, AWS Posgres
- In-Depth understanding of Snowflake Multi-Cluster Size and Credit Usage
- Experience in Loading the data into Spark RDD's, perform advanced procedures like text analytics and processing using in memory data Computation capabilities of Spark using Scala to generate the Output response.
- Experienced in handling large datasets using Partitions, Spark in Memory capabilities, Broadcasts in Spark, Effective & efficient Joins, Transformations and other during ingestion process itself.
- Developed Scala scripts using both Data frames/SQL and RDD/MapReduce in Spark for Data Aggregation, queries and writing data back into OLTP system through SQOOP.
- Optimizing of existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frames and Pair RDD's.
- Worked extensively on AWS Components such as Airflow, Elastic Map Reduce (EMR), Atana, Snowflake.
- Used DataStax Spark-Cassandra connector to load data into Cassandra and used CQL to analyze data from Cassandra tables for quick searching, sorting and grouping.
- Designed, developed and did maintenance of data integration programs in a Hadoop and RDBMS environment with both traditional and non-traditional source systems.
- Experience in writing SQOOP Scripts for importing and exporting data from RDBMS to HDFS.
- Ingested data from RDBMS and performed data transformations, and tan export the transformed data to Cassandra for data access and analysis.
- Experienced in performance tuning of Spark Applications for setting right Batch Interval time, correct level of Parallelism and memory tuning.
- Developed python code for different tasks, dependencies, SLA watcher and time sensor for each job for workflow management and automation using Airflow tool
- Created Hive tables for loading and analyzing data, Implemented Partitions, Buckets and developed Hive queries to process the data and generate the data cubes for visualizing.
- Implemented schema extraction for Parquet and Avro file Formats in Hive.
- Developed Hive scripts in Hive QL to de-normalize and aggregate the data.
- Used Spark API over Cloudera Hadoop Yarn to perform analytics on data in Hive.
- Worked on a POC to compare processing time of Impala with Apache Hive for batch applications to implement the former in project.
- In - depth understanding of Snowflake cloud technology.
- Worked on migrating Map Reduce programs into Spark transformations using Spark and Scala.
- Worked with BI team to create various kinds of reports using Tableau based on the client's needs.
- Experience in Querying on Parquet files by loading them in to Spark's data frames by using Zeppelin notebook.
- Experience in troubleshooting any problems dat arises during any batch data processing jobs.
- Extracted the data from Teradata into HDFS/Dashboards using Spark Streaming.
- Migrated an existing on-premises application to AWS. Used AWS services like EC2 and S3 for small data sets processing and storage, experienced in Maintaining the Hadoop cluster on AWS EMR.
Environment: Hadoop, Spark, AWS, Hive, Map Reduce, Snowflake, Sqoop, HDFS, Cloudera, Oracle, MY SQL, Python, Shell Scripting.