Senior Big Data Engineer Resume Oldsmar, FL - Hire IT People

SUMMARY

Over 8+ years of extensive hands - on Big Data Capacity with the help of Hadoop Eco System across internal and cloud-based platforms.
Expertise in Cloud Computing and Hadoop architecture and its various components - Hadoop File System HDFS, MapReduce, Spark, Name node, Data Node, Job Tracker, Task Tracker, Secondary Name Node.
Strong experience using HDFS, MapReduce, Hive, Spark, Sqoop, Oozie, and HBase.
Experienced working with various Hadoop Distributions (Cloudera, Hortonworks, Map R, Amazon EMR) to fully implement and leverage new Hadoop features.
Worked with real-time data processing and streaming techniques using Spark streaming and Kafka.
Experience in moving data into and out of the HDFS and Relational Database Systems (RDBMS) using Apache Sqoop.
Expertise in working with HIVE data warehouse infrastructure-creating tables, data distribution by implementing Partitioning and Bucketing, developing and tuning the HQL queries.
Replaced existing MR jobs and Hive scripts with Spark SQL & Spark data transformations for efficient data processing.
Deep knowledge of troubleshooting and tuning Spark applications and Hive scripts to achieve optimal performance
Experience developing Kafka producers and Kafka Consumers for streaming millions of events per second on streaming data
Deep knowledge of troubleshooting and tuning Spark applications and Hive scripts to achieve optimal performance.
Database design, modeling, migration and development experience in using stored procedures, triggers, cursor, constraints and functions. Used My SQL, MS SQL Server, DB2, and Oracle
Experience working with NoSQL database technologies, including MongoDB, Cassandra and HBase.
Experience with Software development tools such as JIRA, Play, GIT.
Experience on Migrating SQL database to Azure Data Lake, Azure data lake Analytics, Azure SQL Database, Data Bricks and Azure SQL Data warehouse and Controlling and granting database access and Migrating On premise databases to Azure Data lake store using Azure Data factory
Good understanding of the Data modelling (Dimensional & Relational) concepts like Star-Schema Modelling, a Schema Modelling, Fact and Dimension tables.
Experience in manipulating/analysing large datasets and finding patterns and insights within structured and unstructured data.
Experience in developing Spark Applications using Spark RDD, Spark-SQL and Data frame APIs.
Experience working in different Google Cloud Platform Technologies like Big Query, Dataflow, Dataproc, Pubsub, Airflow.
Design and Development of Ingestion Framework over Google Cloud and Hadoop cluster.
Strong understanding of Java Virtual Machines and multi-threading process.
Experience in writing complex SQL queries, creating reports and dashboards.
Proficient in using Unix based Command Line Interface.
Hands of experience in GCP, Big Query, GCS bucket, G - cloud function, cloud migration, cloud dataflow, Pub/Sub cloud shell, GSUTIL, BQ command line utilities, Data Proc, Stack driver
Strong experience with ETL and/or orchestration tools (e.g. Talend, Oozie, Airflow)
Experience setting up AWS Data Platform - AWS CloudFormation, Development End Points, AWS Glue, EMR and Jupyter/ Sagemaker Notebooks, Redshift, S3, and EC2 instances
Experienced in using Agile methodologies including extreme programming, SCRUM and Test-Driven Development (TDD)
Used Informatica Power Center for (ETL) extraction, transformation and loading data from heterogeneous source systems into target database.
Experienced in writing, testing and running the code and algorithms on the data to make sure they run and work as expected.

TECHNICAL SKILLS

Programming languages: Python, PySpark, Shell Scripting, SQL, PL/SQL and UNIX Bash

Big Data: Hadoop, Sqoop, Apache Spark, NiFi, Kafka, Snowflake, Cloudera, Horton Works, PySpark, Spark, Spark SQL

Data Modeling Tools: Erwin Data Modeler, ER Studio v17

Operating Systems: UNIX, LINUX, MacOS, Solaris, Mainframes

Data bases: Oracle, SQL Server, My SQL, DB2, Sybase, Netezza, Hive, Impala

Cloud Technologies: AWS, AZURET, GCP

IDE Tools: Aginitiy for Hadoop, PyCharm, Toad, SQL Developer, SQL *Plus, Sublime Text, VI Editor

OLAP Tools: Tableau, SSAS, Business Objects, and Crystal Reports 9

ETL/Data warehouse Tools: Informatica 9.6/9.1, and Tableau.

Others: AutoSys, Crontab, ArcGIS, Clarity, Informatica, Business Objects, IBM MQ, Splunk

PROFESSIONAL EXPERIENCE

Confidential, Oldsmar, FL

Senior Big Data Engineer

Responsibilities:

Involved in Migrating Objects using the custom ingestion framework from variety of sources such as Oracle, SAP/HANA, MongoDB, & Teradata
Planning and design of data warehouse in STAR schema. Designing structure of tables and documenting it.
Handled importing of data from various data sources, performed transformations using Hive, MapReduce, loaded data into HDFS and extracted data from MYSQL into HDFS vice-versa using Sqoop.
Performing ETL from multiple sources such as Kafka, NIFI, Teradata, DB2 using Hadoop spark.
Worked on Apache Spark Utilizing the Spark, SQL and Streaming components to support the intraday and real-time data processing
Developed Python, Bash scripts to automate and provide Control flow
Data Ingestion to one or more cloud Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and cloud migration processing the data in Azure Databricks
Moving data from Teradata to a Hadoop cluster Using TDCH/Fast export and Apache NIFI.
Develop Python, PySpark, Bash scripts logs to Transform, and Load data across on premise and cloud platform.
Worked extensively on AWS Components such as Elastic Map Reduce (EMR)
Experience in data ingestions techniques for batch and stream processing using AWS Batch, AWS Kinesis, AWS Data Pipeline
Sharing sample data using grant access to customer for UAT/BAT.
Created data pipeline for different events of ingestion, aggregation and load consumer response data in AWS S3 bucket into Hive external tables in HDFS location to serve as feed for Tableau dashboards
Created monitors, alarms and notifications for EC2 hosts using Cloud Watch, Cloud trail and SNS
Building data pipeline ETLs for data movement to S3, tan to Redshift.
Scheduled different Snowflake jobs using NiFi.
Develop Nifi workflow to pick up the data from rest API server, from data lake as well as from SFTP server and send that to Kafka broker.
Designed and implemented end to end big data platform on Teradata Appliance
Written AWS Lambda code in Python for nested Json files, converting, comparing, sorting etc.
Installed and configured apache airflow for workflow management and created workflows in python
Write UDFs in Hadoop PySpark to perform transformations and loads.
Used NIFI to load data into HDFS as ORC files.
Experience with Snowflake Multi-Cluster Warehouses
Writing TDCH scripts and apache NIFI to load data from Mainframes DB2 to Hadoop cluster.
Working with, ORC, AVRO and JSON, Parquette file formats. and create external tables and query on top of these files Using BigQuery
Source Analysis, tracing back the sources of the data and finding its roots though Teradata, DB2 etc.
Identifying the jobs dat load the source tables and documenting it.
Implement Continuous Integration and Continuous Delivery process using GitLab along with Python and Shell scripts to automate routine jobs, which includes synchronize installers, configuration modules, packages and requirements for the applications
Worked on Informatica Power Center tools- Designer, Repository Manager, Workflow Manager, and Workflow Monitor
Implemented a batch process to load the heavy volume data loading using Apache Dataflow framework using Nifi in Agile development methodology
Deployed the Big Data Hadoop application using Talend on cloud AWS (Amazon Web Services) and also on Microsoft Azure
Created Snow pipe for continuous data load from staged data residing on cloud gateway servers.
Developing automated process for code builds and deployments using Jenkins, Ant, Maven, Sonar type, Shell Script
Installing and configuring the applications like Docker tool and Kubernetes for the orchestration purpose
Developed automation system using PowerShell scripts and JSON templates to remediate the Azure services.

Environment: Apache Spark, Hadoop, PySpark, HDFS, Cloudera, AWS, Azure, Kafka, Snowflake, Docker, Jenkins, Ant, Maven, Kubernetes, Nifi, JSON, Teradata, DB2, SQL Server, MongoDB, Shell Scripting.

Confidential, Malvern, PA

Big Data Engineer

Responsibilities:

Build ETL/ELT pipeline in data technologies like PySpark, hive, presto and data bricks
Extract data from source and transform the same according to business requirements and Load the data in target
Orchestrate dimension module using Airflow and built dependencies according to business document
Build data pipelines using PySpark to transform data from amazon S3 and stage it in Presto views
Read file from S3 on incremental basis, transform the data and load it in S3 location
Build external table using HQL to match it with the data schema loaded in the target
Analyse and validated the data in Presto using SQL queries
Build validation module to summarize the data between source and target using PySpark
Create and enhanced tools to analyse and process large quantity of data set
Develop PySpark scripts to process huge data sets from source to target
Execute complex joins and transformations to process data adhering to the business use case
Provide technical design based on business requirements to create fact and dimension tables
Design the workflow to orchestrate the dimension and fact module using Airflow
Develop airflow DAGS using python to schedule jobs on incremental basis
Communicate and coordinated with cross functional teams to ensure business objectives are met
Document the workflow and brainstormed proactive solutions with the team
Present solution to the orchestration process by effectively capturing data stats
Design DAGS using Airflow to process the dimension table and staged it presto views
Develop automated scripts to validate the data processed between source and target
Design and developed airflow DAGS to retrieve data from Amazon s3 and built ETL pipeline using PySpark to process the same to build the dimensions
Implement Spark Kafka streaming to pick up the data from Kafka and send to Spark pipeline
Develop python scripts to schedule each dimension process as task and set dependencies for the same
Develop, tested and deployed python scripts to create airflow DAGS
Integrate with Databricks using airflow operators to run Notebooks on scheduled basis
Interface with Business intelligence team and help them build data pipeline to support existing BI platform and data products
Establish dependencies using airflow methods and tracked data processing
Build aggregated layer over fact and dimension tables to help BI build dashboard over
Leverage data and built robust ETL pipelines to process Ad data with Revenue and impressions captured for business to make informed decisions
Develop process workflow and developed PySpark code adhering to the design
Optimize data structures for efficiently querying Hive and Presto views
Define and designed Data models to develop dimension and fact tables
Collaborate with internal and external data sources to ensure integrations are accurate and scalable.

Environment: PySpark, Hive, Presto, Data Bricks, HQL, AWS, S3, Map Reduce, Hadoop, Flume, kafka, Scala, Python, Tableau, Oracle, My SQL, MongoDB, Cassandra, Git, Shell Scripting.

Confidential, Fountain Valley, CA

Data Engineer

Responsibilities:

Used Hive Queries in Spark-SQL for analysis and processing the data.
Hands on experience in installation, configuration, supporting and managing Hadoop Clusters
Written shell scripts dat run multiple Hive jobs which helps to automate different Hive tables incrementally which are used to generate different reports using Tableau for the Business use
Enhanced and optimized product Spark code to aggregate, group and run data mining tasks using the Spark framework and handled Json Data
Implemented Spark Scripts using Scala, Spark SQL to access hive tables into Spark for faster processing of data
Involved in business analysis and technical design sessions with business and technical staff to develop requirements document and ETL design specifications.
Used Spark API over Cloudera Hadoop YARN to perform analytics on data in Hive
Responsible for design, development, Data Modelling, of Spark SQL Scripts based on Functional Specifications
Designed and developed extract, transform, and load (ETL) mappings, procedures, and schedules, following the standard development lifecycle
Developed Autosys scripts to schedule the Kafka streaming and batch job
Setting up data lake in google cloud using Google cloud storage, Big Query, and Big Table
Optimized Map/Reduce Jobs to use HDFS efficiently by using various compression mechanisms
Worked closely with Quality Assurance, Operations and Production support group to devise the test plans, answer questions and solve any data or processing issues
Used PySpark to creating batch job for merge multiple small files (Kafka stream files) into single larger files in parquet format.
Developing scripts in Big Query and connecting it to reporting tools.
Worked on large-scale Hadoop YARN cluster for distributed data processing and analysis using Data Bricks Connectors, Spark core, Spark SQL, Sqoop, Hive and NoSQL databases
Worked in writing Spark SQL scripts for optimizing the query performance
Concerned and well-informed on Hadoop Components such as HDFS, Job Tracker, Task Tracker, Name Node, Data Node, YARN and Map Reduce programming
Wrote complex SQL scripts to avoid Informatica Look-ups to improve the performance as the volume of the data was heavy.
Implemented Hive UDF's and did performance tuning for better results
Tuned, and developed SQL on HiveQL, Drill and Spark SQL
Experience in using Sqoop to import and export the data from Oracle DB into HDFS and HIVE
Developed Spark code using Spark RDD and Spark-SQL/Streaming for faster processing of data
Implemented Partitioning, Data Modelling, Dynamic Partitions and Buckets in HIVE for efficient data access.

Environment: Spark SQL, Hive, Hadoop Yarn, Map Reduce, Hive QL, GCP, Sqoop, SQL Server, No SQL Databases, kafka, Python, Scala, Shell, Bash Scripting, Git.

Confidential

Hadoop-Spark Developer

Responsibilities:

Responsible for building scalable distributed data solutions using Hadoop and migrate legacy applications to Hadoop.
Wrote the Spark code in Scala to connect to Hbase and read/write data to the HBase table.
Extracted data from different databases and to copy into HDFS using Sqoop and has an expertise in using compression techniques to optimize the data storage.
Implemented Kafka producers create custom partitions, configured brokers and implemented High level consumers to implement data platform.
Delivered real-time experience and analyzed massive amounts of data from multiple sources to calculate real-time ETA using Confluent Kafka event streaming.
Implemented Flume, Spark framework for real time data processing.
Optimized Map/Reduce Jobs to use HDFS efficiently by using various compression mechanisms.
Developed big data ingestion framework to process multi TB data including data quality checks, transformation, and stored as efficient storage formats like parquet and loaded into Amazon S3 using Spark Scala API and Spark.
Worked on cloud computing infrastructure (e.g. Amazon Web Services EC2) and considerations for scalable, distributed systems
Created the Spark Streaming code to take the source files as input.
Used Oozie workflow to automate all the jobs.
Exported the analyzed data into relational databases using Sqoop for visualization and to generate reports for the BI team.
Developed the technical strategy of using Apache Spark on Apache Mesos as a next generation, Big Data and "Fast Data" (Streaming) platform.
Developed simple to complex Map Reduce jobs using Hive and Pig for analysing the data.
Developed spark programs using Scala, involved in creating Spark SQL Queries and Developed Oozie workflow for spark jobs
Built analytics for structured and unstructured data and managing large data ingestion by using Avro, Flume, Thrift, Kafka and Sqoop.
Developed Pig UDF's to know the customer behavior and Pig Latin scripts for processing the data in Hadoop.
Worked on scalable distributed computing systems, software architecture, data structures and algorithms using Hadoop, Apache Spark and Apache Storm etc.
Ingested streaming data into Hadoop using Spark, Storm Framework and Scala.
Copied the data from HDFS to MongoDB using pig/Hive/Map reduce scripts and visualized the streaming processed data in Tableau dashboard.
Scheduled automated tasks with Oozie for loading data into HDFS through Sqoop and pre-processing the data with Pig and Hive.
Continuously monitored and managed the Hadoop Cluster using Cloudera Manager.

Environment: Spark, Kafka, Hadoop, AWS, Sqoop, HDFS, Oracle, SQL Server, MongoDB, Python, Scala, Shell Scripting, Tableau, Map Reduce, Oozie, Pig, Hive.

Confidential

Hadoop Developer

Responsibilities:

Responsible for building scalable distributed data solutions using Hadoop.
Used Spark-Streaming APIs to perform necessary transformations and actions on the fly for building the common learner data model which gets the data from Kafka in near real time and Persists into Cassandra.
Developed MapReduce programs to parse the raw data, populate staging tables and store the refined data in partitioned tables in the EDW
Experience writing scripts using Python (or Go Lang) and familiarity with the following tools: AWS Cloud Lambda, AWS S3, AWS EC2, AWS Redshift, AWS Posgres
In-Depth understanding of Snowflake Multi-Cluster Size and Credit Usage
Experience in Loading the data into Spark RDD's, perform advanced procedures like text analytics and processing using in memory data Computation capabilities of Spark using Scala to generate the Output response.
Experienced in handling large datasets using Partitions, Spark in Memory capabilities, Broadcasts in Spark, Effective & efficient Joins, Transformations and other during ingestion process itself.
Developed Scala scripts using both Data frames/SQL and RDD/MapReduce in Spark for Data Aggregation, queries and writing data back into OLTP system through SQOOP.
Optimizing of existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frames and Pair RDD's.
Worked extensively on AWS Components such as Airflow, Elastic Map Reduce (EMR), Atana, Snowflake.
Used DataStax Spark-Cassandra connector to load data into Cassandra and used CQL to analyze data from Cassandra tables for quick searching, sorting and grouping.
Designed, developed and did maintenance of data integration programs in a Hadoop and RDBMS environment with both traditional and non-traditional source systems.
Experience in writing SQOOP Scripts for importing and exporting data from RDBMS to HDFS.
Ingested data from RDBMS and performed data transformations, and tan export the transformed data to Cassandra for data access and analysis.
Experienced in performance tuning of Spark Applications for setting right Batch Interval time, correct level of Parallelism and memory tuning.
Developed python code for different tasks, dependencies, SLA watcher and time sensor for each job for workflow management and automation using Airflow tool
Created Hive tables for loading and analyzing data, Implemented Partitions, Buckets and developed Hive queries to process the data and generate the data cubes for visualizing.
Implemented schema extraction for Parquet and Avro file Formats in Hive.
Developed Hive scripts in Hive QL to de-normalize and aggregate the data.
Used Spark API over Cloudera Hadoop Yarn to perform analytics on data in Hive.
Worked on a POC to compare processing time of Impala with Apache Hive for batch applications to implement the former in project.
In - depth understanding of Snowflake cloud technology.
Worked on migrating Map Reduce programs into Spark transformations using Spark and Scala.
Worked with BI team to create various kinds of reports using Tableau based on the client's needs.
Experience in Querying on Parquet files by loading them in to Spark's data frames by using Zeppelin notebook.
Experience in troubleshooting any problems dat arises during any batch data processing jobs.
Extracted the data from Teradata into HDFS/Dashboards using Spark Streaming.
Migrated an existing on-premises application to AWS. Used AWS services like EC2 and S3 for small data sets processing and storage, experienced in Maintaining the Hadoop cluster on AWS EMR.

Environment: Hadoop, Spark, AWS, Hive, Map Reduce, Snowflake, Sqoop, HDFS, Cloudera, Oracle, MY SQL, Python, Shell Scripting.

We provide IT Staff Augmentation Services!

Senior Big Data Engineer Resume

Oldsmar, FL

We'd love your feedback!

Resume Categories

Client Services

Job Seekers

Visa Sponsorship