Senior Big Data Engineer Resume
OBJECTIVE:
- Overall 8+ years of professional experience in Information Technology and expertise in BIGDATA using HADOOP framework and Analysis, Design, Development, Testing, Documentation, Deployment and Integration using SQL and Big Data technologies.
- Expertise in using major components of Hadoop ecosystem components like HDFS, YARN, MapReduce, Hive, Impala, Pig, Sqoop, HBase, Spark, Spark SQL, Kafka, Spark Streaming, Flume, Oozie, Zookeeper, Hue.
- Good understanding of distributed systems, HDFS architecture, Internal working details of MapReduce and Spark processing frameworks.
- Knowledge of ETL methods for data extraction, transformation and loading in corporate - wide ETL Solutions and Data Warehouse tools for reporting and data analysis.
- Develop data set processes for data modelling, and Data mining. Recommend ways to improve data reliability, efficiency and quality.
- Experience in importing and exporting the data using Sqoop from HDFS to Relational Database systems and vice-versa and load into Hive tables, which are partitioned.
- Having good knowledge in writing MapReduce jobs through Pig, Hive, and Sqoop.
- Hands-on use of Spark and Scala API's to compare the performance of Spark with Hive and SQL, and Spark SQL to manipulate Data Frames in Scala.
- Experience in developing Map Reduce Programs using Apache Hadoop for analyzing the big data as per the requirement.
- Experience in designing star schema, Snowflake schema for Data Warehouse, ODS architecture.
- Expertise in Amazon Web Services Cloud Platform which includes services like EC2, S3, VPC, ELB, IAM, DynamoDB, Cloud Front, Cloud Watch, Route 53, Elastic Beanstalk (EBS), Auto Scaling, Security Groups, EC2 Container Service (ECS), Code Commit, Code Pipeline, Code Build, Code Deploy, Dynamo DB, Auto Scaling, Security Groups, Red shift, CloudWatch, CloudFormation, CloudTrail, Ops Works, Kinesis, IAM, SQS, SNS, SES.
- Experience in Data Analysis, Data Profiling, Data Integration, Migration, Data
TECHNICAL SKILLS
- Amazon Dynamodb
- Dynamodb
- Apache Hadoop HDFS
- HDFS
- Apache Hadoop Impala
- Impala
- Apache Hadoop Mahout
- Mahout
- Apache Hadoop Mapreduce
- Hadoop Mapreduce
- Mapreduce
- Apache Hadoop Oozie
- Oozie
- Apache Hadoop Sqoop
- Sqoop
- Cassandra
- Clustering
- Data Cleansing
- Data Governance
- Data Integration
- Data Management
- Data Migration
- Data Mining
- Data Model
- Data Profiling
- Data Visualization
- Distributed Systems
- ETL
- Flume
- Hadoop
- Hadoop Cluster
- Hadoop Distributed File System
- Hbase
- Informatica
- Kafka
- Machine Learning
- MAP Reduce
- Master Data Management
- MDM
- Metadata
- Microsoft SQL Server Analysis Services
- SQL Server Analysis Services
- Mongodb
- Nosql
- Online Analytical Processing
- OLAP
- Operational Data Store
- Power Bi
- Predictive Analytics
- Reference Data
- Semi-Structured Data
- Snowflake Schema
- Star Schema
- Teradata
- Text Analytics
- Data Analysis
- Data Cleaning
- Data Manipulation
- Database Modeling
- MS SQL Server
- SQL Server
- MySQL
- OLTP
- Oracle
- Oracle 10g
- PL/SQL
- PostgreSQL
- Relational Database
- SQL Queries
- Stored Procedures
- Analysis Services
- Apache Spark
- API
- Application Server
- Avro
- C++
- Coding
- Exchange
- Git
- Hive
- HTML
- JavaScript
- JSON
- Pig
- Python
- GGPLOT2
- Matplotlib
- Numpy
- Pandas
- Pyspark
- R Language
- R Programming
- Real Time
- Scripting
- Subversion
- SVN
- VBA
- WEB Scraping
- XML
- Zookeeper
- Amazon Elastic Beanstalk
- Elastic Beanstalk
- Amazon Elastic Block Storage
- EBS
- Amazon Elastic Compute Cloud
- Amazon EC2
- EC2
- Amazon Kinesis
- Apache
- Linux
- Shell Scripting
- Shell Scripts
- Unix/Linux
- Microsoft SQL Server Reporting Services
- SQL Server Reporting Services
- Microsoft SSRS
- SSRS
- SAS
- Tableau Software
- Tableau
- T-SQL
- Boosting
- CSS
- Security
- Streaming
- Web Services
- Weblogic
- WebSphere
- Eclipse
- Java
- Spring
- Jquery
- JSP
- Struts
- EMR
- GAP Analysis
- Gather Business Requirements
- Project Manager
- SCRUM
- Version Control
- Data Quality
- Integration Testing
- JIRA
- Unit Testing
- DEV OPS
- Devo
- Scala
- Deployment
- Real-Time
- VMS
- Data Structures
- Hdinsight
- Large-Scale
- HTML5
- Model View Controller
- Model-View-Controller
- User Interface
- UI
- Front-End
- Front End Design
- Prototypes
- Optimization
- Statistical Analysis
- Algorithms
- TOPO
- DSL
- Reverse Engineering
- Serial Attached Scsi
- ECS
- Pipeline
- Pipeline Engineering
- Business Intelligence
- BI
- Scraping
- GCP
- Scheduling
- Neural
- Documentation
- Exploration
- Building Automation
PROFESSIONAL EXPERIENCE
Confidential
Senior Big Data Engineer
Responsibilities:
- Developed Data Pipeline with Kafka and Spark.
- Contributed in designing the Data Pipeline with Lambda Architecture.
- Performed advanced procedures like text analytics and processing, using the in-memory computing capabilities of Spark using Scala.
- Involved in installation, configuration, supporting and managing Hadoop clusters, Hadoop cluster administration.
- Created Tables, Stored Procedures, and extracted data using PL/SQL for business users whenever required.
- Worked on Confidential Data pipeline to configure data loads from S3 to into Redshift.
- Used Spark for interactive queries, processing of streaming data and integration with popular NoSQL database for huge volume of data.
- Expansively worked with Partitions, Dynamic Partitioning, bucketing tables in Hive, designed both Managed and External tables, also worked on optimization of Hive queries.
- Files extracted from Hadoop and dropped on daily hourly basis into S3
- Working with Data governance and Data quality to design various models and processes.
- Involved in all the steps and scope of the project reference data approach to MDM, have created a Data Dictionary and Mapping from Sources to the Target in MDM Data Model.
- Developed Automation Regressing Scripts for validation of ETL process between multiple databases like Confidential Redshift, Oracle, Mongo DB, T-SQL, and SQL Server using Python.
- Automated the data processing with Oozie to automate data loading into the Hadoop Distributed File System.
- Designing and Developing Oracle PL/SQL and Shell Scripts, Data Import/Export, Data Conversions and Data Cleansing
- Developed Kafka consumer API in Scala for consuming data from Kafka topics.
- Experienced in writing live Real-time Processing and core jobs using Spark Streaming with Kafka as a data pipe-line system.
- Migrated an existing on-premises application to Confidential .
- Used Confidential services like EC2 and S3 for small data sets processing and storage.
- Experienced in Maintaining the Hadoop cluster on Confidential EMR.
- Imported data from Confidential S3 into Spark RDD, Performed transformations and actions on RDDs.
- Used IAM to detect and stop risky identity behaviors using rules, machine learning, and other statistical algorithms
- Responsible to manage data coming from different sources through Kafka.
- Installed Kafka Producer on different severs and Scheduled to produce data for every 10 seconds
- Implemented Data Quality in ETL Tool Talend and having good knowledge in Data Warehousing
- Developed Apache Spark applications by using spark for data processing from various streaming sources.
- Strong Knowledge on architecture and components of Tealeaf, and efficient in working with Spark Core, SparkSQL.
- Designed and developed RDD Seeds using Scala and Cascading. Streaming data to Spark streaming using Kafka
- Exposure to Spark, Spark Streaming, Spark MLlib, snowflake, Scala and Creating the Data Frames handled in Spark with Scala.
- Good Exposure on Map Reduce
Confidential
Senior Data Engineer
Responsibilities:
- Extensively used Agile methodology as the Organization Standard to implement the data Models
- Created several types of data visualizations using Python and Tableau.
- Extracted Mega Data from Confidential using SQL Queries to create reports.
- Performed reverse engineering using Erwin to redefine entities, attributes and relationships existing database.
- Analyzed functional and non-functional business requirements and translate into technical data requirements and create or update existing logical and physical data models.
- Developed a data pipeline using Kafka to store data into HDFS.
- Experience in Developing Spark applications using Spark - SQL in Databricks for data extraction, transformation, and aggregation from multiple file formats for analyzing & transforming the data to uncover insights into the customer usage patterns.
- Worked on analyzing Hadoop Cluster and different big data analytic tools including Pig, Hive.
- Working experience with data streaming process with Kafka, Apache Spark, Hive.
- Worked with various HDFS file formats like Avro, Sequence File, Nifi, Json and various compression formats like Snappy, bzip2.
- Used Spark-Streaming APIs to perform necessary transformations and actions on the data got from Kafka.
- Developed Spark code using Scala and Spark-SQL/Streaming for faster testing and processing of data.
- Analyzed the SQL scripts and designed the solution to implement using Scala.
- Designed and developed architecture for data services ecosystem spanning Relational, NoSQL, and Big Data technologies.
- Performed Regression testing for Golden Test Cases from State (end to end test cases) and automated the process using python scripts.
- Developed Spark jobs using Scala for faster real-time analytics and used Spark SQL for querying
- Used SQL Server Integrations Services (SSIS) for extraction, transformation, and loading data into target system from multiple sources
- Designed both 3NF data models for OLTP systems and dimensional data models using star and snowflake Schemas.
- Created and maintained SQL Server scheduled jobs, executing stored procedures for the purpose of extracting data from Oracle into SQL Server.
- Extensively used Tableau for customer marketing data visualization.
- Developed Advance PL/SQL packages, procedures, triggers, functions, Indexes and Collections to implement business logic using SQL Navigator.
- Designed and implemented system architecture for Amazon EC2 based cloud-hosted solution for client.
- Generated various reports using SQL Server Report Services (SSRS) for business analysts and the management team.
- Created HBase tables to store variable data formats of PII data coming from different portfolios.
- Designed data models with industry standards up to 3rd NF (OLTP) and de normalized (OLAP) data marts with Star & Snow flake schemas.
- Executed Hive queries on Parquet tables stored in Hive to perform data analysis to meet the business requirem
Confidential
Big Data Hadoop Developer
Responsibilities:
- Experience in building and architecting multiple Data pipelines, end to end ETL and ELT process for Data ingestion and transformation in GCP
- Strong understanding of Confidential components such as EC2 and S3
- Implemented a Continuous Delivery pipeline with Docker and Git Hub
- Worked with g-cloud function with Python to load Data in to Bigquery for on arrival csv files in GCS bucket
- Process and load bound and unbound Data from Google pub/sub topic to Bigquery using cloud Dataflow with Python.
- Devised simple and complex SQL scripts to check and validate Dataflow in various applications.
- Performed Data Analysis, Data Migration, Data Cleansing, Transformation, Integration, Data Import, and Data Export through Python.
- Developed and deployed data pipeline in cloud such as Confidential and GCP
- Performed data engineering functions: data extract, transformation, loading, and integration in support of enterprise data infrastructures - data warehouse, operational data stores and master data management
- Responsible for data services and data movement infrastructures
- Good experience with ETL concepts, building ETL solutions and Data modeling
- Architected several DAGs (Directed Acyclic Graph) for automating ETL pipelines
- Hands on experience on architecting the ETL transformation layers and writing spark jobs to do the processing.
- Gather and process raw data at scale (including writing scripts, web scraping, calling APIs, write SQL queries, writing applications)
- Experience in fact dimensional modeling (Star schema, Snowflake schema), transactional modeling and SCD (Slowly changing dimension)
- Devised PL/SQL Stored Procedures, Functions, Triggers, Views and packages.
- Made use of Indexing, Aggregation and Materialized views to optimize query performance.
- Developed logistic regression models (Python) to predict subscription response rate based on customer's variables like past transactions, response to prior mailings, promotions, demographics, interests, and hobbies, etc.
- Develop near real time data pipeline using spark
- Process and load bound and unbound Data from Google pub/sub topic to Big-query using cloud Data flow with Python
- Hands of experience in GCP, Big Query, GCS bucket, G - cloud function, cloud data flow, Pub/sub cloud shell, GSUTIL, BQ command line utilities, Data Proc, Stack driver
- Implemented Apache Airflow for authoring, scheduling and monitoring Data Pipelines
- Proficient in Machine Learning techniques (Decision Trees, Linear/Logistic Regressors) and Statistical Modeling
- Worked on confluence and Jira
- Skilled in data visualization like Matplotlib and seaborn library
- Hands on experience with big data tools like Hadoop, Spark, Hive
- Experience implementing machine learning back-end pipeline with Pandas, NumPy
Environment: Gcp, Bigquery, Gcs Bucket, G-Cloud Function, Apache Beam, Cloud Dataflow, Cloud Shell, Gsutil, Docker, Kubernetes, Confidential, Apache Airflow, Python, Pandas, Matp
Confidential
Hadoop Developer
Responsibilities:
- Installed Hadoop, MySQL, PostgreSQL, SQL Server, Sqoop, Hive, and HBase.
- Created bashrc files and all other xml configurations to automate the deployment of Hadoop VMs over Confidential EMR.
- Experience creating and organizing HDFS over a staging area.
- Troubleshooted RSA SSH keys in Linux for authorization purposes.
- Inserted data from multiple csv files into MySQL, SQL Server, and PostgreSQL using spark.
- Utilized Sqoop to import structured data from MySQL, SQL Server, PostgreSQL, and a semi-structured csv file dataset into HDFS data lake.
- Collaborated with Business Analysts, SMEs across departments to gather business requirements, and identify workable items for further development.
- Partnered with ETL developers to ensure that data is well cleaned and the data warehouse is up-to-date for reporting purpose by Pig.
- Selected and generated data into csv files and stored them into Confidential S3 by using Confidential EC2 and then structured and stored in Confidential Redshift.
- Processed some simple statistical analysis of data profiling like cancel rate, var, skew, kurt of trades, and runs of each stock every day group by 1 min, 5 min, and 15 min.
- Used PySpark and Pandas to calculate the moving average and RSI score of the stocks and generated them into data warehouse.
- Exploring with Spark to improve the performance and optimization of the existing algorithms in Hadoop using Spark context, Spark-SQL, PostgreSQL, Data Frame, OpenShift, Talend, pair RDD's Developed a raw layer of external tables within S3 containing copied data from HDFS.
- Created a data service layer of internal tables in Hive for data manipulation and organization.
- Inserted data into DSL internal tables from RAW external tables.
- Achieved business intelligence by creating and analyzing an application service layer in Hive containing internal tables of the data which are also integrated with HBase Involved in integration of Hadoop cluster with spark engine to perform BATCH and GRAPHX operations.
- Performed data preprocessing and feature engineering for further predictive analytics using Python Pandas.
- Developed and validated machine learning models including Ridge and Lasso regression for predicting total amount of trade.
- Boosted the performance of regression models by applying polynomial transformation and feature selection and used those methods to select stocks.
- Generated report on predictive analytics using Python and Tableau including visualizing model performance and prediction results.
- Utilized Agile and Scrum methodology for team and project management.
- Used Git for version control with colleagues
Environment: Hadoop, Hive, Hbase, Spark, Python, Pandas, SQL, PL/SQL, PostgreSQL, Confidential, T/SQL, XML, Informatica, Tableau, OLAP, SSIS, SSRS, Excel, OLTP, Git.
Confidential
Data and Reporting Analyst
Responsibilities:
- Involved in review of functional and non-functional requirements.
- Installed and configured Hadoop MapReduce, HDFS, Developed multiple MapReduce
- Jobs in java for data cleaning and preprocessing.
- Installed and configured Pig and also written Pig Latin scripts
- Wrote MapReduce job using Pig Latin. Involved in ETL, Data Integration and Migration.
- Imported data using Sqoop to load data from Oracle to HDFS on regular basis.
- Developing Scripts and Batch Job to schedule various Hadoop Program.
- Written Hive queries for data analysis to meet the business requirements.
- Creating Hive tables and working on them using Hive QL. Experienced in defining job flows.
- Utilized various utilities like Struts Tag Libraries, JSP, JavaScript, HTML, & CSS.
- Build and deployed war file in WebSphere application server.
- Implemented Patterns such as Singleton, Factory, Facade, Prototype, Decorator, Business Delegate and MVC.
- Involved in frequent meeting with clients to gather business requirement & converting them to technical specification for development team.
- Adopted agile methodology with pair programming technique and addressed issues during system testing.
- Involved in Bug fixing and Enhancement phase, used find bug tool.
- Version Controlled using SVN.
- Developed application in Eclipse IDE. Experience in developing spring Boot applications for transformations.
- Primarily involved in front-end UI using HTML5, CSS3, JavaScript, jQuery, and AJAX.
- Used struts framework to build MVC architecture and separate presentation from business logic.
- Involved in rewriting middle-tier on WebLogic application server.
- Actively involved in Code-Reviews & Coding Standards, Unit testing & Integration Testing.
- Importing and exporting data into HDFS from Oracle Database and vice versa using sqoop
- Involved in creating Hive tables, loading the data and writing hive queries that will run internally in a map reduce way.
- Developed a custom File System plugin for Hadoop so it can access files on Data Platform.
- The custom File System plugin allows Hadoop MapReduce programs, HBase, Pig and Hive to work unmodified and access files directly.
- Designed and implemented MapReduce-based large-scale parallel relation-learning system
- Setup and benchmarked Hadoop/HBase clusters for internal use
Environment: Hadoop, MapReduce, HDFS, Hive, Java, Hadoop distribution of Cloudera, Pig, HBase, Linux, XML, Java 6, Eclipse, Oracle 10g, PL/SQL, MongoDB, ToadSKILLS:
