Senior Big Data Engineer Resume

OBJECTIVE:

Overall 8+ years of professional experience in Information Technology and expertise in BIGDATA using HADOOP framework and Analysis, Design, Development, Testing, Documentation, Deployment and Integration using SQL and Big Data technologies.
Expertise in using major components of Hadoop ecosystem components like HDFS, YARN, MapReduce, Hive, Impala, Pig, Sqoop, HBase, Spark, Spark SQL, Kafka, Spark Streaming, Flume, Oozie, Zookeeper, Hue.
Good understanding of distributed systems, HDFS architecture, Internal working details of MapReduce and Spark processing frameworks.
Knowledge of ETL methods for data extraction, transformation and loading in corporate - wide ETL Solutions and Data Warehouse tools for reporting and data analysis.
Develop data set processes for data modelling, and Data mining. Recommend ways to improve data reliability, efficiency and quality.
Experience in importing and exporting the data using Sqoop from HDFS to Relational Database systems and vice-versa and load into Hive tables, which are partitioned.
Having good knowledge in writing MapReduce jobs through Pig, Hive, and Sqoop.
Hands-on use of Spark and Scala API's to compare the performance of Spark with Hive and SQL, and Spark SQL to manipulate Data Frames in Scala.
Experience in developing Map Reduce Programs using Apache Hadoop for analyzing the big data as per the requirement.
Experience in designing star schema, Snowflake schema for Data Warehouse, ODS architecture.
Expertise in Amazon Web Services Cloud Platform which includes services like EC2, S3, VPC, ELB, IAM, DynamoDB, Cloud Front, Cloud Watch, Route 53, Elastic Beanstalk (EBS), Auto Scaling, Security Groups, EC2 Container Service (ECS), Code Commit, Code Pipeline, Code Build, Code Deploy, Dynamo DB, Auto Scaling, Security Groups, Red shift, CloudWatch, CloudFormation, CloudTrail, Ops Works, Kinesis, IAM, SQS, SNS, SES.
Experience in Data Analysis, Data Profiling, Data Integration, Migration, Data

TECHNICAL SKILLS

Amazon Dynamodb
Dynamodb
Apache Hadoop HDFS
HDFS
Apache Hadoop Impala
Impala
Apache Hadoop Mahout
Mahout
Apache Hadoop Mapreduce
Hadoop Mapreduce
Mapreduce
Apache Hadoop Oozie
Oozie
Apache Hadoop Sqoop
Sqoop
Cassandra
Clustering
Data Cleansing
Data Governance
Data Integration
Data Management
Data Migration
Data Mining
Data Model
Data Profiling
Data Visualization
Distributed Systems
ETL
Flume
Hadoop
Hadoop Cluster
Hadoop Distributed File System
Hbase
Informatica
Kafka
Machine Learning
MAP Reduce
Master Data Management
MDM
Metadata
Microsoft SQL Server Analysis Services
SQL Server Analysis Services
Mongodb
Nosql
Online Analytical Processing
OLAP
Operational Data Store
Power Bi
Predictive Analytics
Reference Data
Semi-Structured Data
Snowflake Schema
Star Schema
Teradata
Text Analytics
Data Analysis
Data Cleaning
Data Manipulation
Database Modeling
MS SQL Server
SQL Server
MySQL
OLTP
Oracle
Oracle 10g
PL/SQL
PostgreSQL
Relational Database
SQL Queries
Stored Procedures
Analysis Services
Apache Spark
API
Application Server
Avro
C++
Coding
Exchange
Git
Hive
HTML
JavaScript
JSON
Pig
Python
GGPLOT2
Matplotlib
Numpy
Pandas
Pyspark
R Language
R Programming
Real Time
Scripting
Subversion
SVN
VBA
WEB Scraping
XML
Zookeeper
Amazon Elastic Beanstalk
Elastic Beanstalk
Amazon Elastic Block Storage
EBS
Amazon Elastic Compute Cloud
Amazon EC2
EC2
Amazon Kinesis
Apache
Linux
Shell Scripting
Shell Scripts
Unix/Linux
Microsoft SQL Server Reporting Services
SQL Server Reporting Services
Microsoft SSRS
SSRS
SAS
Tableau Software
Tableau
T-SQL
Boosting
CSS
Security
Streaming
Web Services
Weblogic
WebSphere
Eclipse
Java
Spring
Jquery
JSP
Struts
EMR
GAP Analysis
Gather Business Requirements
Project Manager
SCRUM
Version Control
Data Quality
Integration Testing
JIRA
Unit Testing
DEV OPS
Devo
Scala
Deployment
Real-Time
VMS
Data Structures
Hdinsight
Large-Scale
HTML5
Model View Controller
Model-View-Controller
User Interface
UI
Front-End
Front End Design
Prototypes
Optimization
Statistical Analysis
Algorithms
TOPO
DSL
Reverse Engineering
Serial Attached Scsi
ECS
Pipeline
Pipeline Engineering
Business Intelligence
BI
Scraping
GCP
Scheduling
Neural
Documentation
Exploration
Building Automation

PROFESSIONAL EXPERIENCE

Confidential

Senior Big Data Engineer

Responsibilities:

Developed Data Pipeline with Kafka and Spark.
Contributed in designing the Data Pipeline with Lambda Architecture.
Performed advanced procedures like text analytics and processing, using the in-memory computing capabilities of Spark using Scala.
Involved in installation, configuration, supporting and managing Hadoop clusters, Hadoop cluster administration.
Created Tables, Stored Procedures, and extracted data using PL/SQL for business users whenever required.
Worked on Confidential Data pipeline to configure data loads from S3 to into Redshift.
Used Spark for interactive queries, processing of streaming data and integration with popular NoSQL database for huge volume of data.
Expansively worked with Partitions, Dynamic Partitioning, bucketing tables in Hive, designed both Managed and External tables, also worked on optimization of Hive queries.
Files extracted from Hadoop and dropped on daily hourly basis into S3
Working with Data governance and Data quality to design various models and processes.
Involved in all the steps and scope of the project reference data approach to MDM, have created a Data Dictionary and Mapping from Sources to the Target in MDM Data Model.
Developed Automation Regressing Scripts for validation of ETL process between multiple databases like Confidential Redshift, Oracle, Mongo DB, T-SQL, and SQL Server using Python.
Automated the data processing with Oozie to automate data loading into the Hadoop Distributed File System.
Designing and Developing Oracle PL/SQL and Shell Scripts, Data Import/Export, Data Conversions and Data Cleansing
Developed Kafka consumer API in Scala for consuming data from Kafka topics.
Experienced in writing live Real-time Processing and core jobs using Spark Streaming with Kafka as a data pipe-line system.
Migrated an existing on-premises application to Confidential .
Used Confidential services like EC2 and S3 for small data sets processing and storage.
Experienced in Maintaining the Hadoop cluster on Confidential EMR.
Imported data from Confidential S3 into Spark RDD, Performed transformations and actions on RDDs.
Used IAM to detect and stop risky identity behaviors using rules, machine learning, and other statistical algorithms
Responsible to manage data coming from different sources through Kafka.
Installed Kafka Producer on different severs and Scheduled to produce data for every 10 seconds
Implemented Data Quality in ETL Tool Talend and having good knowledge in Data Warehousing
Developed Apache Spark applications by using spark for data processing from various streaming sources.
Strong Knowledge on architecture and components of Tealeaf, and efficient in working with Spark Core, SparkSQL.
Designed and developed RDD Seeds using Scala and Cascading. Streaming data to Spark streaming using Kafka
Exposure to Spark, Spark Streaming, Spark MLlib, snowflake, Scala and Creating the Data Frames handled in Spark with Scala.
Good Exposure on Map Reduce

Confidential

Senior Data Engineer

Responsibilities:

Extensively used Agile methodology as the Organization Standard to implement the data Models
Created several types of data visualizations using Python and Tableau.
Extracted Mega Data from Confidential using SQL Queries to create reports.
Performed reverse engineering using Erwin to redefine entities, attributes and relationships existing database.
Analyzed functional and non-functional business requirements and translate into technical data requirements and create or update existing logical and physical data models.
Developed a data pipeline using Kafka to store data into HDFS.
Experience in Developing Spark applications using Spark - SQL in Databricks for data extraction, transformation, and aggregation from multiple file formats for analyzing & transforming the data to uncover insights into the customer usage patterns.
Worked on analyzing Hadoop Cluster and different big data analytic tools including Pig, Hive.
Working experience with data streaming process with Kafka, Apache Spark, Hive.
Worked with various HDFS file formats like Avro, Sequence File, Nifi, Json and various compression formats like Snappy, bzip2.
Used Spark-Streaming APIs to perform necessary transformations and actions on the data got from Kafka.
Developed Spark code using Scala and Spark-SQL/Streaming for faster testing and processing of data.
Analyzed the SQL scripts and designed the solution to implement using Scala.
Designed and developed architecture for data services ecosystem spanning Relational, NoSQL, and Big Data technologies.
Performed Regression testing for Golden Test Cases from State (end to end test cases) and automated the process using python scripts.
Developed Spark jobs using Scala for faster real-time analytics and used Spark SQL for querying
Used SQL Server Integrations Services (SSIS) for extraction, transformation, and loading data into target system from multiple sources
Designed both 3NF data models for OLTP systems and dimensional data models using star and snowflake Schemas.
Created and maintained SQL Server scheduled jobs, executing stored procedures for the purpose of extracting data from Oracle into SQL Server.
Extensively used Tableau for customer marketing data visualization.
Developed Advance PL/SQL packages, procedures, triggers, functions, Indexes and Collections to implement business logic using SQL Navigator.
Designed and implemented system architecture for Amazon EC2 based cloud-hosted solution for client.
Generated various reports using SQL Server Report Services (SSRS) for business analysts and the management team.
Created HBase tables to store variable data formats of PII data coming from different portfolios.
Designed data models with industry standards up to 3rd NF (OLTP) and de normalized (OLAP) data marts with Star & Snow flake schemas.
Executed Hive queries on Parquet tables stored in Hive to perform data analysis to meet the business requirem

Confidential

Big Data Hadoop Developer

Responsibilities:

Experience in building and architecting multiple Data pipelines, end to end ETL and ELT process for Data ingestion and transformation in GCP
Strong understanding of Confidential components such as EC2 and S3
Implemented a Continuous Delivery pipeline with Docker and Git Hub
Worked with g-cloud function with Python to load Data in to Bigquery for on arrival csv files in GCS bucket
Process and load bound and unbound Data from Google pub/sub topic to Bigquery using cloud Dataflow with Python.
Devised simple and complex SQL scripts to check and validate Dataflow in various applications.
Performed Data Analysis, Data Migration, Data Cleansing, Transformation, Integration, Data Import, and Data Export through Python.
Developed and deployed data pipeline in cloud such as Confidential and GCP
Performed data engineering functions: data extract, transformation, loading, and integration in support of enterprise data infrastructures - data warehouse, operational data stores and master data management
Responsible for data services and data movement infrastructures
Good experience with ETL concepts, building ETL solutions and Data modeling
Architected several DAGs (Directed Acyclic Graph) for automating ETL pipelines
Hands on experience on architecting the ETL transformation layers and writing spark jobs to do the processing.
Gather and process raw data at scale (including writing scripts, web scraping, calling APIs, write SQL queries, writing applications)
Experience in fact dimensional modeling (Star schema, Snowflake schema), transactional modeling and SCD (Slowly changing dimension)
Devised PL/SQL Stored Procedures, Functions, Triggers, Views and packages.
Made use of Indexing, Aggregation and Materialized views to optimize query performance.
Developed logistic regression models (Python) to predict subscription response rate based on customer's variables like past transactions, response to prior mailings, promotions, demographics, interests, and hobbies, etc.
Develop near real time data pipeline using spark
Process and load bound and unbound Data from Google pub/sub topic to Big-query using cloud Data flow with Python
Hands of experience in GCP, Big Query, GCS bucket, G - cloud function, cloud data flow, Pub/sub cloud shell, GSUTIL, BQ command line utilities, Data Proc, Stack driver
Implemented Apache Airflow for authoring, scheduling and monitoring Data Pipelines
Proficient in Machine Learning techniques (Decision Trees, Linear/Logistic Regressors) and Statistical Modeling
Worked on confluence and Jira
Skilled in data visualization like Matplotlib and seaborn library
Hands on experience with big data tools like Hadoop, Spark, Hive
Experience implementing machine learning back-end pipeline with Pandas, NumPy

Environment: Gcp, Bigquery, Gcs Bucket, G-Cloud Function, Apache Beam, Cloud Dataflow, Cloud Shell, Gsutil, Docker, Kubernetes, Confidential, Apache Airflow, Python, Pandas, Matp

Confidential

Hadoop Developer

Responsibilities:

Installed Hadoop, MySQL, PostgreSQL, SQL Server, Sqoop, Hive, and HBase.
Created bashrc files and all other xml configurations to automate the deployment of Hadoop VMs over Confidential EMR.
Experience creating and organizing HDFS over a staging area.
Troubleshooted RSA SSH keys in Linux for authorization purposes.
Inserted data from multiple csv files into MySQL, SQL Server, and PostgreSQL using spark.
Utilized Sqoop to import structured data from MySQL, SQL Server, PostgreSQL, and a semi-structured csv file dataset into HDFS data lake.
Collaborated with Business Analysts, SMEs across departments to gather business requirements, and identify workable items for further development.
Partnered with ETL developers to ensure that data is well cleaned and the data warehouse is up-to-date for reporting purpose by Pig.
Selected and generated data into csv files and stored them into Confidential S3 by using Confidential EC2 and then structured and stored in Confidential Redshift.
Processed some simple statistical analysis of data profiling like cancel rate, var, skew, kurt of trades, and runs of each stock every day group by 1 min, 5 min, and 15 min.
Used PySpark and Pandas to calculate the moving average and RSI score of the stocks and generated them into data warehouse.
Exploring with Spark to improve the performance and optimization of the existing algorithms in Hadoop using Spark context, Spark-SQL, PostgreSQL, Data Frame, OpenShift, Talend, pair RDD's Developed a raw layer of external tables within S3 containing copied data from HDFS.
Created a data service layer of internal tables in Hive for data manipulation and organization.
Inserted data into DSL internal tables from RAW external tables.
Achieved business intelligence by creating and analyzing an application service layer in Hive containing internal tables of the data which are also integrated with HBase Involved in integration of Hadoop cluster with spark engine to perform BATCH and GRAPHX operations.
Performed data preprocessing and feature engineering for further predictive analytics using Python Pandas.
Developed and validated machine learning models including Ridge and Lasso regression for predicting total amount of trade.
Boosted the performance of regression models by applying polynomial transformation and feature selection and used those methods to select stocks.
Generated report on predictive analytics using Python and Tableau including visualizing model performance and prediction results.
Utilized Agile and Scrum methodology for team and project management.
Used Git for version control with colleagues

Environment: Hadoop, Hive, Hbase, Spark, Python, Pandas, SQL, PL/SQL, PostgreSQL, Confidential, T/SQL, XML, Informatica, Tableau, OLAP, SSIS, SSRS, Excel, OLTP, Git.

Confidential

Data and Reporting Analyst

Responsibilities:

Involved in review of functional and non-functional requirements.
Installed and configured Hadoop MapReduce, HDFS, Developed multiple MapReduce
Jobs in java for data cleaning and preprocessing.
Installed and configured Pig and also written Pig Latin scripts
Wrote MapReduce job using Pig Latin. Involved in ETL, Data Integration and Migration.
Imported data using Sqoop to load data from Oracle to HDFS on regular basis.
Developing Scripts and Batch Job to schedule various Hadoop Program.
Written Hive queries for data analysis to meet the business requirements.
Creating Hive tables and working on them using Hive QL. Experienced in defining job flows.
Utilized various utilities like Struts Tag Libraries, JSP, JavaScript, HTML, & CSS.
Build and deployed war file in WebSphere application server.
Implemented Patterns such as Singleton, Factory, Facade, Prototype, Decorator, Business Delegate and MVC.
Involved in frequent meeting with clients to gather business requirement & converting them to technical specification for development team.
Adopted agile methodology with pair programming technique and addressed issues during system testing.
Involved in Bug fixing and Enhancement phase, used find bug tool.
Version Controlled using SVN.
Developed application in Eclipse IDE. Experience in developing spring Boot applications for transformations.
Primarily involved in front-end UI using HTML5, CSS3, JavaScript, jQuery, and AJAX.
Used struts framework to build MVC architecture and separate presentation from business logic.
Involved in rewriting middle-tier on WebLogic application server.
Actively involved in Code-Reviews & Coding Standards, Unit testing & Integration Testing.
Importing and exporting data into HDFS from Oracle Database and vice versa using sqoop
Involved in creating Hive tables, loading the data and writing hive queries that will run internally in a map reduce way.
Developed a custom File System plugin for Hadoop so it can access files on Data Platform.
The custom File System plugin allows Hadoop MapReduce programs, HBase, Pig and Hive to work unmodified and access files directly.
Designed and implemented MapReduce-based large-scale parallel relation-learning system
Setup and benchmarked Hadoop/HBase clusters for internal use

Environment: Hadoop, MapReduce, HDFS, Hive, Java, Hadoop distribution of Cloudera, Pig, HBase, Linux, XML, Java 6, Eclipse, Oracle 10g, PL/SQL, MongoDB, ToadSKILLS:

We provide IT Staff Augmentation Services!

We'd love your feedback!

Resume Categories

Client Services

Job Seekers

Visa Sponsorship