Senior Data Engineer Resume
SUMMARY:
- Overall having 8+ years of experience as Data Engineer including designing, developing and implementation of data models for enterprise - level applications and systems.
- Full life cycle implementation experience of Big Data Pipelines.
- Excellent Software Development Life Cycle (SDLC) with good working knowledge of testing methodologies, disciplines, tasks, resources and scheduling.
- Extensive knowledge of Bigdata, Hadoop, MapReduce, Hive, NoSQL Databases and other emerging technologies.
- Good experience in building pipelines using Azure Data Factory and moving the data into Azure Data Lake Store.
- Experience in developing Map Reduce Programs using Apache Hadoop for analyzing the big data as per the requirement.
- Good working experience using Sqoop to import data into HDFS from RDBMS and vice - versa.
- Experience in creating tables, constraints, views, and materialized views using ERwin, ER Studio, and SQL Modeler.
- Extensive experience in Text Analytics, generating data visualizations using Python and creating dashboards using tools like Tableau.
- Data streaming from various sources like cloud (AWS, Azure) and on - premises by using the tools Spark.
- Hands on experience in Normalization and De - Normalization techniques for optimum performance in relational and dimensional database environments.
- Good experience in AGILE delivery process of software using SCRUM.
- Excellent SQL programming skills and developed Stored Procedures, Triggers, Functions, Packages using SQL, PL/SQL.
- Excellent Knowledge of Ralph Kimball and BillInmon's approaches to Data Warehousing.
- Experience in working on Distributed storage for analysis and processing of large data sets using Apache Hadoop.
- Experience in working with Teradata. And making the data to be batch processing using distributed computing.
- Excellent knowledge and extensively using NOSQL databases (HBase).
- Experience in Designing and implementing data structures and commonly used data bus
SKILLS:
Apache Hadoop HDFS
HDFS
Apache Hadoop Mapreduce
Hadoop Mapreduce
Mapreduce
Apache Hadoop Oozie
Oozie, Apache Hadoop Sqoop
Sqoop, Apache Kafka
Big Data
Cassandra
Clustering
Data Governance
Data Mining
Data Model
Data Sources
Data Visualization, db2
Distributed Systems
Elasticsearch
Enterprise Resource Planning
ERP, ETL, Hadoop
Hadoop Cluster, Hbase
Informatica, Kafka
Machine Learning,MAP Reduce
Master Data Management, MDM
Microsoft SQL Server Integration Services
SQL Server Integration Services
Mongodb, Nosql
Online Analytical Processing, OLAP, Predictive Analytics
Reference Data, Replication, Snowflake Schema, Splunk
Star Schema, Teradata,Text Analytics, Unstructured Data
Data Analysis
Data Cleaning
Data Mapping, database
Database Modeling
Data Modeling
Data Models
Mapping, Microsoft Access, MS Access, MS SQL Server,SQL Server,SQL Server 2005,MySQL
OLTP,Oracle, Oracle 10g
PL/SQL,Relational Database, SQL
SQL Queries
Stored Procedure
Stored Procedures
Apache Spark, API, Cucumber
Git, Hive, JSON,Pig, Python, Matplotlib, Numpy
Pandas, Pyspark, Real Time, Regex, Reporting Tools, Scripting, Shell Scripting
Shell Scripts
Structured Software
EXPERIENCE:
Confidential
Senior Data Engineer
Responsibilities:
- Installing, configuring and maintaining Data Pipelines Designing the business requirement collection approach based on the project scope and SDLC methodology
- Designed and distributed systems that manage Big Data using Hadoop and related technologies.
- Developed Spark programs to compare the performance of Spark with Hive and SQL.
- Developed framework for converting existing PowerCenter mappings and to PySpark(Python and Spark) Jobs.
- Provide guidance to development team working on PySpark as ETL platform
- Used Spark for interactive queries, processing of streaming data and integration with popular NOSQL database for huge volume of data.
- Transform and analyze the data using Pyspark, HIVE, based on ETL mappings
- Created Spark Application to load data into Dynamic Partition Enabled Hive Table.
- Created Oozie Jobs for workflow of Spark, Sqoop and Shell scripts.
- Worked on transforming the queries written in Hive to Spark Application.
- Involved in creating Hive tables, loading and analyzing data using Hive Queries.
- Installed and configured Hive and Oozie on the Hadoop Cluster.
- Supported MapReduce Programs running on the cluster.
- Developed multiple MapReduce jobs for data cleaning and pre - processing.
- Involved in developing Hive DDLs to create, alter and drop tables.
- Designed, developed and implemented multi-tiered Splunk log collection solutions.
- Used HUNK to pull the unstructured data from HDFS data into the Splunk environment.
- Understanding business needs, analyzing functional specifications and map those to develop and designing MapReduce programs and algorithms.
- Involved in Dataset/Data frames/RDDs creation and transferring the Warehouse data into HDFS file system.
- Developed workflow in Oozie to automate the tasks of loading the data into HDFS and pre-processing with Pig.
- Designed, constructed, and managed the Amazon Web Services Data Lake environment including the data ingestion, staging, data quality monitoring, and business modeling.
- Experience in designing high availability, scalable, fault-tolerant AWS Cloud platform.
- Handled operations and maintenance support for AWS cloud resources which includes launching, maintaining, and troubleshooting EC2 instances, S3 buckets, Auto Scaling, Dynamo DB, AWS IAM, Elastic Load Balancers (ELB) and Relational Database Services (RDS).
- Also created snapshots for data to store in AWS S3. Performed export and import of data into Amazon AWS S3 from multiple data sources.
- Experience in writing and retrieving files to and from AWS S3 bucket for UI to render data faster that involves complex and time-consuming server-side logic.
- Developed API for using AWS Lambda to manage the servers and run to code in the AWS.
- Implemented AWS solutions using EC2, S3, RDS, EBS, Auto-scaling groups, optimized volumes and EC2 instances.
- Worked with JSON based REST Web services and Amazon Web Services (AWS)
Environment: AWS, Redshift, Map
Confidential
Big Data Engineer
Responsibilities:
- Involved in complete Big Data flow of the application starting from data ingestion upstream to HDFS, processing the data in HDFS and analyzing the data and involved.
- Files extracted from Hadoop and dropped on daily hourly basis into S3.
- Working with Data governance and Data quality to design various models and processes.
- Experience installing Apache Kafka.
- Involved in all the steps and scope of the project reference data approach to MDM, have created a Data Dictionary and Mapping from Sources to the Target in MDM Data Model.
- Configured a cluster of Zookeeper.
- In charge of Topic creation and management. Developed the features, scenarios, step definitions for BDD (Behavior Driven Development) and TDD (Test Driven Development) using Cucumber, Gherkin and ruby. Worked as L1 support on Jira requests for Kafka. Worked on Topic partitioning and replication.
- Decommissioning nodes and adding nodes in the clusters for maintenance
- Monitored cluster health by Setting up alerts using Nagios and Ganglia
- Designed a PoC for Confluent Kafka.
- Developed MapReduce programs in Java for parsing the raw data and populating staging Tables.
- Experience in setting up the whole app stack, setup, and debug log stash to send Apache logs to AWS Elastic search.
- Developed Spark code using Scala and Spark - SQL/Streaming for faster testing and processing of data.
- Analyzed the SQL scripts and designed the solution to implement using Scala. Configured documentations for Kafka to operate effectively.
- Created a Producer application that sends API messages over Kafka.
- Defined API security key and other necessary credentials to run Kafka architecture.
- Wrote python code that tracks Kafka message delivery.
- Implemented API key and credentials into python program.
- Designed both 3NF data models for OLTP systems and dimensional data models using star and snowflake Schemas.
- Created and maintained SQL Server scheduled jobs, executing stored procedures for the purpose of extracting data from Oracle into SQL Server.
- Extensively used Tableau for customer marketing data visualization.
Environment: Apache Kafka, Hadoop, HDFS, MapReduce,Hive, Pyspark, Scala, AWS, Zookeeper, Python, Cucumber, Lambda, GCP, Star, Snowflake Schema, Elastic Search, Oracle, SQL Server, NoSQL, Jupyter, OLTP, Unix, Shell Scripting, SSIS, Git.
Confidential
Big Data Engineer
Responsibilities:
- Transforming business problems into Big Data solutions and define Big Data strategy and Roadmap.
- Developed Data Pipeline with Kafka and Spark
- Files extracted from Hadoop and dropped on daily hourly basis into S3
- Authoring Python (PySpark) Scripts for custom UDF's for Row/ Column manipulations, merges, aggregations, stacking, data labelling and for all Cleaning and conforming tasks.
- Writing Pig Scripts to generate Map Reduce jobs and performed ETL procedures on the data in HDFS.
- Develop solutions to leverage ETL tools and identify opportunities for process improvements using Informatica and Python Contributed in designing the Data Pipeline with Lambda Architecture.
- Performed advanced procedures like text analytics and processing, using the in - memory computing capabilities of Spark using Scala.
- Involved in installation, configuration, supporting and managing Hadoop clusters, Hadoop cluster administration.
- Designed and implemented Sqoop for the incremental job to read data from DB2 and load to Hive tables and connected to Tableau for generating interactive reports using Hive server2.
- Used Sqoop to channel data from different sources of HDFS and RDBMS.
- Developed Spark applications using Pyspark and Spark-SQL for data extraction, transformation and aggregation from multiple file formats.
- Created Tables, Stored Procedures, and extracted data using PL/SQL for business users whenever required.
- Used SSIS to build automated multi-dimensional cubes.
- Used Spark Streaming to receive real time data from the Kafka and store the stream data to HDFS using Python and NoSQL databases such as HBase and Cassandra
- Collected data using Spark Streaming from AWS S3 bucket in near-real-time and performs necessary Transformations and Aggregation on the fly to build the common learner data model and persists the data in HDFS.
- Prepared and uploaded SSRS reports. Manages database and SSRS permissions
- Worked on AWS Data pipeline to configure data loads from S3 to into Redshift.
- Used Spark for interactive queries, processing of streaming data and integration with popular NoSQL database for huge volume of data.
- Expansively worked with Partitions, Dynamic Partitioning, bucketing tables in Hive, designed both Managed and External tables, also worked on optimization of Hive queries.
- Developed Spark Applications by using Scala, Python and Implemented Apache Spark data processing Project to handle data from various RDBMS and Streaming sources.
- Worked with the Spark for improving performance and optimization of the existing algorithms in Hadoop.
- Using Spark Context, Spark-SQL, Spark MLlib, Data Frame, Pair RDD, Spark YARN.
- Used Spark Streaming APIs to perform transformations and actions on the fly for building common.
Environment: AWS, Redshift, MapReduce, Cloudera, Kafka, Spark, Lambda, Hadoop, HBase, Scala, Sqoop, Tableau, Informatica, Python, Hive, PL/SQL, Oracle, T-SQL, SQL Server,NoSQL, Cas
2/2016 - 8/2017
Confidential, Pennsylvania
Data Engineer
Responsibilities:
- Implemented reporting Data Warehouse with online transaction system data.
- Developed and maintained data warehouse.
- Provided reports and publications to Third Parties for Royalty payments.
- Managed user account, groups and workspace creation for different users in PowerCenter.
- Worked with PL/SQL procedures and used them in Stored Procedure Transformations.
- Extensively worked on oracle and SQL server.
- Wrote complex sql queries to query ERP system for data analysis purpose
- Worked on most critical Finance projects and had been the go - to person for any data related issues for team members.
- Documented the code.
- Tuned ETL jobs in the new environment after fully understanding the existing code.
- Maintained Talend admin console and provided quick assistance on production jobs.
- Involve in designing Business Objects universes and creating reports.
- Involved in creating and modifying new and existing Web Intelligence reports.
- Created Publications which split into various reports based on specific vendor.
- Wrote Custom SQL for some complex reports.
- Worked with business partners internal and external during requirement gathering.
- Worked closely with Business Analyst and report developers in writing the source to target specifications for Data warehouse tables based on the business requirement needs.
- Exported data into excel for business meetings which made the discussions easier while looking at the data.
- Performed analysis after requirements gathering and walked team through major impacts.
- Provided and debugged crucial reports for finance teams during month end period.
- Addressed issue reported by Business Users in standard reports by identifying the root cause.
- Get the reporting issues resolved by identifying whether it is report related issue or source related issue.
- Creating Ad hoc reports as per user's needs. Investigating and Analysing any discrepancy found in data and then resolving it
Environment: Informatica Power Center 9.1/9.0, Talend 4.x & Integration suite, Business Objects XI, Oracle 10g/11g, Oracle ERP, EDI, SQL Server 2005, UNIX, Windows Scripting, JIRA
Confidential
Hadoop/Data Engineer
Responsibilities:
- Worked with Hadoop eco system covering HDFS, HBase, YARN and MapReduce Worked with Oozie Workflow Engine in running workflow jobs with actions that run Hadoop MapReduce, Hive, Spark jobs
- Performed Data Mapping, Data design (Data Modeling) to integrate data across multiple databases in to EDW
- Responsible for design and development of advanced Python programs to prepare transform and harmonize data sets in preparation for modeling
- Hands on experience on Hadoop /Big Data related technology experience in Storage, Querying, Processing, and analysis of data Developed Spark/Scala, Python for regular expression (RegEx) project in Hadoop/Hive environment for big data resources.
- Automated the monthly data validation process to validate the data for nulls and duplicates and created reports and metrics to share it with business teams
- Used clustering techniques like K - means to identify outliers and to classify unlabeled data Data gathering, data cleaning and data wrangling performed using Python
- Transformed raw data into actionable insights by incorporating various statistical techniques, data mining, data cleaning, data quality, integrity utilizing Python (Scikit-Learn, NumPy, Pandas, and Matplotlib) and SQL
- Calculated errors using various machine learning algorithms such as Linear Regression, Ridge Regression, Lasso Regression, Elastic net regression, KNN, Decision Tree Regressor, SVM, Bagging Decision Trees, Random Forest, AdaBoost, and XGBoost. Chose best model eventually based on MAE
- Experimented with Ensemble methods to increase accuracy of training model with different Bagging and Boosting methods
- Conducted model optimization and comparison using stepwise function based on AIC value
- Used Kibana an open-source plugin for Elasticsearch in analytics and Data visualization.
- Developed MapReduce/Spark Python modules for machine learning & predictive analytics in Hadoop on AWS.
- Implemented a Python-based distributed random forest via Python streaming.
- Built models using Python and PySpark to predict probability of attendance for various campaigns and events
- Performed data visualization and Designed dashboards and generated complex reports, including charts, summaries, and graphs to interpret findings to team and stakeholders
Environment: Hadoop, HDFS, HBase, Oozie, Spark, Machine Learning, Big Data, Python, PySpark, DB2, MongoDB, Elastic Search, Web Services.
