Data Engineer Resume
SUMMARY
- Over 5 Years of IT experience and Currently working in a Big Data Capacity with the help of Hadoop Eco System across internal and cloud - based platforms.
- Excellent Working knowledge of Hadoop, Hive, Sqoop, pig, HBase & Oozie in real time environment and worked on many modules for performance improvements and architecture designing.
- Experience in Machine Learning, Data Mining with large datasets of Structured and Unstructured data, Data Acquisition, Data Validation, Predictive modeling.
- Experience in implementing various Big Data Engineering, Cloud Data engineering, Data Warehouse, Data Mart, Data Visualization, Reporting, Data Quality, and Data virtualization Solution
- Experience in Data transformation, Data mapping from source to target database schema, Data Cleansing procedures
- Adept in programming languages like Python including Big Data technologies like Hadoop, Hive
- Experienced in Normalization (1NF, 2NF, 3NF and BCNF) and De-normalization techniques for effective and optimum performance in OLTP and OLAP environments
- Excellent Knowledge of Relational Database Design, Data Warehouse/OLAP concepts, and methodologies
- Experience in designing star schema, Snowflake schema for Data Warehouse, ODS architecture
- Expertise in OLTP/OLAP System Study, Analysis and E-R modeling, developing Database Schemas like Star schema and Snowflake schema used in relational, dimensional, and multidimensional modeling
- Experience in designing, building, and implementing complete Hadoop ecosystem comprising of MapReduce, HDFS, Hive, Impala, Pig, Sqoop, Oozie, HBase, MongoDB, and Spark
- Experience in coding SQL for developing Procedures, Triggers, and Packages
- Experience writing spark streaming and spark batch jobs, using spark MLlib for analytics
- Defined real time data streaming solutions across the cluster using Spark Streaming, Apache Storm, Kafka, Nifi and Flume
- Strong experience and knowledge of real time data streaming analytics using Spark Streaming , Kafka, and Flume .
- Having experience in developing a data pipeline using Kafka to store data into HDFS
- Strong work experience on Kafka streaming to fetch the data real time or near real time
- Experienced in Spark Core, Spark RDD, Pair RDD, Spark Deployment Architectures
- Involved in queue extraction Dstreams from near real-time data using Kafka & Spark Streaming with Scala and also have development experience using Presto with EMR, Spark- Core, Context, DataFrame API, RDD and Spark SQL with build tools such as Maven & SBT
- Written the Kafka-Spark Streaming module acting as consumer to Kafka which executes the business logic on the trades using spark DStreams and RDD methods.
- Loading IOT data from Hive tables into MemSQL using Kafka/MapRStreaming
- Used Airflow for scheduling the Hive, Spark, and MapReduce jobs
- Scheduled Airflow DAGs to run multiple Hive and Pig jobs, which independently run with time and data availability
- Experience developing Airflow workflows for scheduling and orchestrating the ETL process
- Experience in importing and exporting data using Sqoop from HDFS to Relational Database Systems (RDBMS)- Oracle, DB2 and SQL Server and from RDBMS to HDFS
- Experienced in Data Analysis, Design, Development, Implementation and Testing using Data Conversions, Extraction, Transformation and Loading (ETL) and SQL Server, ORACLE, and other relational and non-relational databases
- Experience building Automated Data pipelines using Apache Spark and Apache Hadoop
- Proficient writing complex spark(pyspark) User defined functions (UDFs), Spark SQL and HiveQL
- Experience on Migrating SQL database to Azure data Lake, Azure data lake Analytics, Azure SQL Database, Data Bricks and Azure SQL Data warehouse and Controlling and granting database access and Migrating On premise databases to Azure Data lake store using Azure Data factory.
- Experience in Developing Spark applications using Spark - SQL in Databricks for data extraction, transformation, and aggregation from multiple file formats for analyzing & transforming the data to uncover insights into the customer usage patterns
- Good working with big data on AWS cloud services - EC2, S3, EMR, DynamoDB and Redshift and Apache Spark.
- Hands on experience in machine learning, big data, data visualization, Python development, Java, Linux, SQL, GIT/GitHub
- Experience with data visualization using tools like ggplot, Matplotlib, Seaborn, Tableau and using Tableau software to publish and presenting dashboards, storyline on web and desktop platforms
- Experienced in python data manipulation for loading and extraction as well as with python libraries such as NumPy, SciPy and Pandas for data analysis and numerical computations
- Experienced working on NoSQL databases like MongoDB and HBase.
- Expertise in Technical proficiency in Designing, Data Modeling Online Application, Data Warehouse, Business Intelligence Applications
- Worked and extracted data from various database sources like Oracle, SQL Server, and DB2
- Extensive working experience with Python including Scikit-learn, SciPy, Pandas, and NumPy developing machine learning models, manipulating, and handling data
- Extensive experience in Text Analytics, developing different Statistical Machine Learning, Data Mining solutions to various business problems and generating data visualizations using Python.
- Utilized Spark, Scala, Hadoop, HBase, Kafka, Spark Streaming, MLlib, Python, a broad variety of machine learning methods including classifications, regressions, dimensionally reduction
- Expertise in complex Data design/development, Master data and Metadata and hands-on experience on Data analysis in planning, coordinating, and executing on records and databases
PROFESSIONAL EXPERIENCE
Data Engineer
Confidential
Responsibilities:
- Worked as Data Engineer to review business requirement and compose source to target data mapping documents
- Involved in Agile development methodology active member in scrum meetings
- Involved in Data Profiling and merge data from multiple data sources
- Involved in Big data requirement analysis, develop and design solutions for ETL and Business Intelligence platforms
- Designed 3NF data models for ODS, OLTP systems and dimensional data models using Star and Snowflake Schemas
- Worked on Snowflake environment to remove redundancy and load real time data from various data sources into HDFS using Kafka
- Developed data warehouse model in Snowflake for over 100 datasets
- Designing and implementing a fully operational production grade large scale data solution on Snowflake Data Warehouse
- Work with structured/semi-structured data ingestion and processing on AWS using S3, Python. Migrate on-premises big data workloads to AWS
- Designed the data aggregations on Hive for ETL processing on Amazon EMR to process data as per business requirement
- Involved in migration of data from existing RDBMS to Hadoop using Sqoop for processing data, evaluate performance of various algorithms/models/strategies based on real-world data sets
- Implemented Data Validation using MapReduce programs to remove un-necessary records before move data into Hive tables
- Created Hive tables for loading and analyzing data and developed Hive queries to process data and generate data cubes for visualizing
- Extracted data from HDFS using Hive, Presto and performed data analysis using Spark with Scala, PySpark and feature selection and created nonparametric models in Spark
- Handled importing data from various data sources, performed transformations using Hive, Map Reduce, and loaded data into HDFS
- Worked on Enterprise Messaging Bus with Kafka-Tibco connector and published Queues were abstracted using Spark Dstreams and parsed XML, JSON data in Hive.
- Designed and configured Kafka cluster to accommodate heavy throughput of 1 million messages per second. Used Kafka producer 0.6.3 API's to produce messages
- Extract Real time feed using Kafka and Spark Streaming and convert it to RDD and process data in the form of Data Frame and save the data as Parquet format in HDFS
- Used Kafka and Kafka brokers, initiated the spark context and processed live streaming information with RDD and Used Kafka to load data into HDFS and NoSQL databases.
- Subscribing the Kafka topic with Kafka consumer client and process the events in real time using spark.
- Development of Spark structured streaming to read the data from Kafka in real time and batch modes, apply different mode of Change data captures (CDCs) and then load the data into Hive
- Developed and Configured Kafka brokers to pipeline server logs data into spark streaming
- Used Apache Kafka to aggregate web log data from multiple servers and make them available in Downstream systems for analysis
- Integrate AWS Kinesis with on premise Kafka cluster
- Implemented data ingestion and handling clusters in real time processing using Kafka
- Developed Sqoop and Kafka Jobs to load data from RDBMS, External Systems into HDFS and HIVE
- Experienced in writing live Real-time Processing and core jobs using Spark Streaming with Kafka as a data pipe-line system
- Integrated Apache Storm with Kafka to perform web analytics. Uploaded click-stream data from Kafka to HDFS, Hbase and Hive by integrating with Storm
- Created Streamsets pipeline for event logs using Kafka, Streamsets Data Collector and Spark Streaming in cluster mode by customizing with mask plugins, filters and distributed existing Kafka topics across applications using Streamsets Control Hub.
- Developed spark code and spark-SQL/streaming for faster testing and processing of data
- Developed Python scripts to automate the ETL process using Apache Airflow and CRON scripts in the Unix operating system as well
- Captured unstructured data that was otherwise not used and stored it in HDFS and HBase / MongoDB. Scrape data using Beautiful Soup and saved data into MongoDB (JSON format)
- Worked with Apache Airflow and Genie to automate job on EMR
- Worked on AWS S3 buckets and intra cluster file transfer between PNDA and s3 securely
- Used Amazon EC2 command line interface along with Python to automate repetitive work
- Design & Implementation of Data Mart, DBA coordination, DDL & DML generation & usage
- Provide data architecture support to enterprise data management efforts, such as development of enterprise data model and master and reference data, as well as support to projects, such as development of physical data models, data warehouses and data marts
- Creating Databricks notebooks using SQL, Python and automated notebooks using jobs.
- Creating Spark clusters and configuring high concurrency clusters using Azure Databricks to speed up the preparation of high-quality data.
- Worked extensively on the migration of different data products from Oracle to Azure
- Spun up HDInsight clusters and used Hadoop ecosystem tools like Kafka, Spark and databricks for real-time analytics streaming, sqoop, pig, hive and CosmosDB for batch jobs
- Extract Transform and Load data from Sources Systems to Azure Data Storage services using a combination of Azure Data Factory, T-SQL, Spark SQL and U-SQL Azure Data Lake Analytics Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in In Azure Databricks
- Worked with Data governance, Data quality, data lineage, Data architect to design various models and processes
- Independently coded new programs and designed Tables to load and test program effectively for given POC's using with Big Data/Hadoop
- Developed MapReduce/Spark Python modules for machine learning & predictive analytics in Hadoop on AWS
- Worked on Apache Spark with Python to develop and execute Big Data Analytics and Machine learning applications, executed machine learning use cases under Spark ML and MLlib
- Built and analyzed datasets using SAS, and Python, designed data models and data flow diagrams using Erwin and MS Visio
- Used Kibana an open-source plugin for Elasticsearch in analytics and Data visualization.
- Used pandas, NumPy, seaborn, SciPy, matplotlib, Scikit-learn, NLTK in Python for developing various machine learning algorithms for predictive modeling utilizing R and Python
- Implemented a Python-based distributed random forest via Python streaming
- Utilized machine learning algorithms such as linear regression, multivariate regression, Naive Bayes, Random Forests, K-means, & KNN for data analysis
Environment: Python, SQL, CVS/XML Files, Oracle, JSON, Cassandra, MongoDB, AWS, Azure, DataBricks, Snowflake, Hadoop, Hive, MapReduce, Scala, Spark, J2EE, Agile, Apache Avro, Apache Maven, AirFlow, Kafka, MLlib, regression, Docker, Tableau, Git, Jenkins.
Data Engineer
Confidential
Responsibilities:
- As a Data Engineer, provided technical expertise and aptitude to Hadoop technologies as they relate to the development of analytics.
- Responsible for the planning and execution of big data analytics, predictive analytics, and machine learning initiatives
- Very good hands on experience in advanced Big - Data technologies like Spark Ecosystem (Spark SQL, MLlib, SparkR and Spark Streaming), Kafka and Predictive analytics (MLlib, R ML packages including ML library of H2O)
- Designed and developed spark jobs for performing ETL on large volumes of medical membership and claims data
- Created Airflow Scheduling scripts in Python
- Involved in importing the real time data to Hadoop using Kafka and implemented the Oozie job for daily
- Developed applications of Machine Learning, Statistical Analysis, and Data Visualizations with challenging data Processing problems
- Compiled data from various sources public and private databases to perform complex analysis and data manipulation for actionable results.
- Designed and developed Natural Language Processing models for sentiment analysis.
- Worked on Natural Language Processing with NLTK module of python for application development for automated customer response.
- Demonstrated experience in design and implementation of Statistical models, Predictive models, enterprise data model, metadata solution and data lifecycle management in both RDBMS, Big Data environments.
- Used predictive modeling with tools in SAS, SPSS, and Python.
- Applied concepts of probability, distribution, and statistical inference on given dataset to unearth interesting findings through these of comparison, T-test, F-test, R-squared, P-value etc.
- Applied linear regression, multiple regression, ordinary least square method, mean-variance, the theory of large numbers, logistic regression, dummy variable, residuals, Poisson distribution, Bayes, Naive Bayes, fitting function etc. to data with help of Scikit, SciPy, NumPy and Pandas module of Python.
- Applied clustering algorithms i.e. Hierarchical, K-means with help of Scikit and SciPy.
- Developed visualizations and dashboards using ggplot2, Tableau
- Worked on development of data warehouse, Data Lake and ETL systems using relational and non-relational tools like SQL, No SQL.
- Built and analyzed datasets using R, SAS, MATLAB, and Python (in decreasing order of usage).
- Applied linear regression in Python and SAS to understand the relationship between different attributes of dataset and causal relationship between them
- Performs complex pattern recognition of financial time series data and forecast of returns through the ARMA and ARIMA models and exponential smoothening for multivariate time series data
- Used Cloudera Hadoop YARN to perform analytics on data in Hive.
- Wrote Hive queries for data analysis to meet the business requirements.
- Expertise in Business Intelligence and data visualization using Tableau.
- Expert in Agile and Scrum Process.
- Validated the Macro-Economic data (e.g. Blackrock, Moody's etc.) and predictive analysis of world markets using key indicators in Python and machine learning concepts like regression, Bootstrap Aggregation and Random Forest.
- Worked on setting up AWS EMR clusters to process monthly workloads
- Was involved in writing pyspark User Defined Functions (UDF’s) for various use cases and applied business logic wherever necessary in the ETL process
- Wrote spark SQL and spark scripts(pyspark) in databricks environment to validate the monthly account level customer data stored in S3
- Worked in large-scale database environments like Hadoop and MapReduce, with working mechanism of Hadoop clusters, nodes, and Hadoop Distributed File System (HDFS).
- Interfaced with large-scale database system through an ETL server for data extraction and preparation.
- Identified patterns, data quality issues, and opportunities and leveraged insights by communicating opportunities with business partners.
- Performed Source System Analysis, database design, data modeling for the warehouse layer using MLDM concepts and package layer using Dimensional modeling.
- Created ecosystem models (e.g. conceptual, logical, physical, canonical) that are required for supporting services within the enterprise data architecture (conceptual data model for defining the major subject areas used, ecosystem logical model for defining standard business meaning for entities and fields, and an ecosystem canonical model for defining the standard messages and formats to be used in data integration services throughout the ecosystem)
Environment: Spark, AirFlow, Machine learning, AWS, MS Azure, Cassandra,, Avro, HDFS, GitHub, Hive, Pig, Linux, Python (Scikit-Learn/SciPy/NumPy/Pandas), SAS, SPSS, MySQL, Bitbucket, Eclipse, XML, PL/SQL, SQL connector, JSON, Tableau, Jenkins.
Software Engineer - Data Analyst/Data Engineer
Confidential
Responsibilities:
- Worked with Hadoop eco system covering HDFS, HBase, YARN and MapReduce
- Worked with Oozie Workflow Engine in running workflow jobs with actions that run Hadoop MapReduce, Hive, Spark jobs
- Performed Data Mapping, Data design (Data Modeling) to integrate data across multiple databases in to EDW
- Responsible for design and development of advanced Python programs to prepare transform and harmonize data sets in preparation for modeling
- Hands on experience on Hadoop /Big Data related technology experience in Storage, Querying, Processing, and analysis of data
- Developed Spark/Scala, Python for regular expression (RegEx) project in Hadoop/Hive environment for big data resources.
- Automated the monthly data validation process to validate the data for nulls and duplicates and created reports and metrics to share it with business teams
- Used clustering techniques like K-means to identify outliers and to classify unlabeled data
- Data gathering, data cleaning and data wrangling performed using Python
- Transformed raw data into actionable insights by incorporating various statistical techniques, data mining, data cleaning, data quality, integrity utilizing Python (Scikit-Learn, NumPy, Pandas, and Matplotlib) and SQL
- Calculated errors using various machine learning algorithms such as Linear Regression, Ridge Regression, Lasso Regression, Elastic net regression, KNN, Decision Tree Regressor, SVM, Bagging Decision Trees, Random Forest, AdaBoost, and XGBoost. Chose best model eventually based on MAE
- Experimented with Ensemble methods to increase accuracy of training model with different Bagging and Boosting methods
- Identified target groups by conducting Segmentation analysis using Clustering techniques like K-means
- Conducted model optimization and comparison using stepwise function based on AIC value
- Used cross-validation to test models with different batches of data to optimize models and prevent over fitting
- Worked and collaborated with various business teams (operations, commercial, innovation, HR, logistics, safety, environmental, accounting) to analyze and understand changes in key financial metrics and provide ad-hoc analysis that can be leveraged to build long term points of view where value can be captured
- Explored and analyzed customer specific features by using Matplotlib, Seaborn in Python and dashboards in Tableau
- Utilized domain knowledge and application portfolio knowledge to play a key role in defining the future state of large, business technology programs.
- Used Kibana an open-source plugin for Elasticsearch in analytics and Data visualization.
- Developed MapReduce/Spark Python modules for machine learning & predictive analytics in Hadoop on AWS. Implemented a Python-based distributed random forest via Python streaming.
- Experimented with multiple classification algorithms, such as Logistic Regression, Support Vector Machine (SVM), Random Forest, AdA boost and Gradient boosting using Python Scikit-Learn and evaluated performance on customer discount optimization on millions of customers
- Built models using Python and PySpark to predict probability of attendance for various campaigns and events
- Implemented classification algorithms such as Logistic Regression, K-NN neighbors and Random Forests to predict Customer churn and Customer interface
- Performed data visualization and Designed dashboards and generated complex reports, including charts, summaries, and graphs to interpret findings to team and stakeholders
Environment: Hadoop, HDFS, Hbase, Oozie, Spark, Machine Learning, Big Data, Python, PySpark, DB2, MongoDB, Elastic Search, Web Services.
Data Engineer
Confidential
Responsibilities:
- Used Pandas, NumPy, Seaborn, SciPy, Matplotlib, Scikit-learn, and NLTK in Python for developing various machine learning algorithms and utilized machine learning algorithms such as linear regression, multivariate regression, naive Bayes, Random Forests, K-means, &KNN for data analysis.
- Demonstrated experience in design and implementation of Statistical models, Predictive models, enterprise data model, metadata solution and data lifecycle management in both RDBMS, Big Data environments.
- Utilized domain knowledge and application portfolio knowledge to play a key role in defining the future state of large, business technology programs.
- Created and modified several database objects such as Tables, Views, Indexes, Constraints, Stored procedures, Packages, Functions and Triggers using SQL and PL/SQL.
- Created large datasets by combining individual datasets using various inner and outer joins in SQL and also performed dataset sorting and merging.
- Created ecosystem models (e.g. conceptual, logical, physical, canonical) that are required for supporting services within the enterprise data architecture (conceptual data model for defining the major subject areas used, ecosystem logical model for defining standard business meaning for entities and fields, and an ecosystem canonical model for defining the standard messages and formats to be used in data integration services throughout the ecosystem)
- Developed Linux Shell scripts by using NZSQL/NZLOAD utilities to load data from flat files to Netezza database.
- Designed and implemented system architecture for Amazon EC2 based cloud-hosted solution for the client.
- Tested Complex ETL Mappings and Sessions based on business user requirements and business rules to load data from source flat files and RDBMS tables to target tables.
- Hands-on database design, relational integrity constraints, OLAP, OLTP, Cubes and Normalization (3NF) and De-normalization of the database.
- Developed MapReduce/Spark Python modules for machine learning & predictive analytics in Hadoop on AWS.
- Worked on customer segmentation using an unsupervised learning technique - clustering.
- Worked with various Teradata15 tools and utilities like Teradata Viewpoint, Multi-Load, ARC, Teradata Administrator, BTEQ and other Teradata Utilities.
Environment: Erwin, Python, SQL, SQL Server, Informatica, SSRS, PL/SQL, T-SQL, Tableau, MLlib, Spark, Kafka, MongoDB, logistic regression, Hadoop, Hive, OLAP, Azure, MariaDB, SAP CRM, HDFS, SVM, JSON, Tableau, XML, AWS.