Big Data Engineer Resume
Chicago, IllinoiS
SUMMARY
- Senior Hadoop developer with 7+ years of professional IT experience with 4+ years of Big Data consultant experience in Hadoop ecosystem components in ingestion, Data modeling, querying, processing, storage, analysis, Data Integration and Implementing enterprise level systems spanning Big Data.
- A skilled developer with strong problem solving, debugging and analytical capabilities, who actively engages in understanding customer requirements.
- Ability to work independently and collaboratively and to communicate TEMPeffectively with non - technical coworkers.
- Experience in installing, configuring and using Apache Hadoop ecosystem components like Hadoop Distributed File System (HDFS), MapReduce, Yarn, Spark, Nifi, Pig, Hive, Flume, Hbase, Oozie, Zookeeper, Sqoop, Scala.
- Hands on experience in creating real-time data streaming solutions using Apache Spark core, Spark SQL & Data Frames, Kafka, Spark streaming and Apache Storm.
- Excellent noledge of Hadoop architecture and daemons of Hadoop clusters, which include Name node, Data node, Resource manager, Node Manager and Job history server
- Expertise in administering the Hadoop Cluster using Hadoop Distributions like Apache Hadoop & Cloudera.
- Proficient working on NoSQL technologies like HBase, Cassandra and MongoDB.
- Extensive experience in working with different ETL tool environments like SSIS, Informatica and reporting tool environments like SQL Server Reporting Services (SSRS)
- Experience in Data warehousing concepts like Star Schema, galaxy and Snowflake Schema, DataMarts, Kimball Methodology used in Relational and Multidimensional data modeling.
- Worked on extensive migration of Hadoop and Spark Clusters to GCP, AWS and Azure.
- Used Kafka and Spark Streaming for data ingestion and cluster handling in real time processing.
- Experienced in migrating HiveQL into Impala to minimize query response time.
- Extensive experience in the implementation of Continuous Integration (CI), Continuous Delivery and Continuous Deployment (CD) on various Java based Applications using Jenkins, TeamCity, Azure DevOps, Maven, Git, Nexus, Docker and Kubernetes.
- Proficient at using Spark API’s to explore, cleanse, aggregate, transform and store machine sensor data.
- Experience in creating Data frames using PySpark and performing operation on the Data frames using Python.
- Developed ETL\Hadoop related java codes, created RESTful APIs using Spring Boot Framework, developed web apps using Spring MVC and JavaScript, developed coding framework, etc.
- Experience in developing POC's using Scala, Spark SQL and MLlib libraries then deployed on the Yarn cluster.
- Excellent Experience in job workflow scheduling and monitoring tools like Oozie and Zookeeper.
- Good Understanding and experience in Machine Learning Algorithms and Techniques like Classification,Clustering, Regression, Decision Trees, Random Forest, NLP, ANOVA, SVMs, Artificial Neural Networks.
- Expertise in SQL Server Analysis Services (SSAS), SQL Server Reporting Services (SSRS) tools and in development of T-SQL, Oracle PL/SQL Scripts, Stored Procedures and Triggers for business logic implementation.
- Experience in creating interactive Dashboards and Creative Visualizations using tools like Tableau, Power BI
- Hands on experience with Microsoft Azure components like HDINSIGHT, Data Factory, Data Lake Storage, Blob, Cosmos DB.
- Extensive skills on LINUX and UNIX Shell command.
- Excellent working experience in Scrum / Agile framework and Waterfall project execution methodologies
TECHNICAL SKILLS
Big Data Eco-system: HDFS, MapReduce, Spark, Yarn, Hive, Pig, HBase, Sqoop, Flume, Kafka, Oozie, Zoo-Keeper, Impala
Hadoop Technologies: Apache Hadoop 1.x, Apache Hadoop 2.x, Cloudera CDH4/CDH5, Hortonworks
Programming Languages: Python, Scala, Shell Scripting, HiveQL
Machine Learning: Regression, Decision Tree, Clustering, Random Forest, Classification, SVM, NLP
Operating Systems: Windows (XP/7/8/10), Linux (Ubuntu, Centos)
NoSQL Database: HBase, Cassandra and Mongo DB
Database: RDBMS, MySQL, Teradata, DB2, Oracle
Container/Cluster Managers: Docker, Kubernetes
BI Tool: Tableau, Power BI
Cloud: AWS, Azure
Web Development: HTML, XML, CSS
IDE Tools: Eclipse, Jupyter, Anaconda, Pycharm
Development Methodologies: Agile, Waterfall
PROFESSIONAL EXPERIENCE
Confidential - Chicago, Illinois
Big Data Engineer
Responsibilities:
- Designed and Developed Extract, Transform, and Load (ETL) code using Informatica Mappings to load data from heterogeneous Source systems flat files, XML’s, MS Access files, Oracle to target system Oracle under Stage, then to data warehouse and then to Data Mart tables for reporting.
- Created Data mappings, Tech Design, loading strategies for ETL to load newly created or existing tables.
- Worked with Kafka for building robust and fault tolerant data Ingestion pipeline for transporting streaming data into HDFS and implemented Kafka Custom encoders for custom input format to load data into Kafka Partitions.
- Created Kafka broker for structured streaming to get structured data by schema.
- Developed Elastic Search Connector using Kafka Connect API with source as Kafka and sink as elastic search.
- Worked on performance tuning of Apache NIFI workflow to optimize the data ingestion speeds.
- Integrated Map Reduce with HBase to import bulk amount of data into HBase using Map Reduce Programs.
- Developed numerous MapReduce jobs for Data Cleansing and Analyzing Data in Impala.
- Designed appropriate Partitioning/Bucketing schema in HIVE for efficient data access during analysis and designed a data warehouse using Hive external tables and created Hive queries for analysis.
- Configured Hive meta store with MySQL to store the metadata for Hive tables and used Hive to analyze data ingested into HBase using Hive-HBase integration.
- Worked on migration of an existing feed from Hive to Spark to reduce latency of feeds in existing HiveQL.
- Developed Oozie Workflows for daily incremental loads to get data from Teradata and import into Hive tables.
- Retrieved data from Hadoop Cluster by developing a pipeline using Hive (HQL), SQL to retrieve data from Oracle database and used Extract, Transform, and Load (ETL) for data transformation.
- Worked with Flume for building fault tolerant data Ingestion pipeline for transporting streaming data into HDFS.
- Installed and configured Hive and written Hive UDFs and used piggy bank a repository of UDF’s for Pig Latin.
- Applied Spark advanced procedures like text analytics and processing using the in-memory processing.
- Used spring framework for Dependency Injection and integrated with Hibernate.
- Developed data models and data migration strategies utilizing concepts of snowflake schema.
- Worked on data pre-processing and cleaning the data to perform feature engineering and performed data imputation techniques for the missing values in the dataset using Python.
- Implemented Data Interface to get information of customers using Rest API and Pre-Process data using MapReduce 2.0 and store into HDFS (Hortonworks).
- Setup Docker to automate container deployment through Jenkins and Dealt with Docker Hub, making Docker Images and taking care of various Images essentially for middleware establishments.
- Used Jenkins pipelines to drive all microservices builds out to the Docker registry and then deployed to Kubernetes.
- Used Spark-SQL to Load Parquet data and created Datasets defined by Case classes and handled structured data using Spark SQL which were finally stored into Hive tables for downstream consumption.
- Developed and deployed Spark application using Pyspark to compute popularity score for all the contents using an algorithm and load the data into Elastic Search for App content management team to consume
- Used Microservices architecture, with Spring Boot based services interacting through a combination of a REST and Spring Boot
- Used Tableau to convey the results by using dashboards to communicate with team members and with other data science teams, marketing and engineering teams.
- Generated the data cubes using Hive, Pig, JAVA Map-Reducing on provisioning Hadoop cluster in AWS.
- Expertise in Performance Tuning Tableau Dashboards and Reports built on huge sources.
- AWS EMR to process big data across Hadoop clusters of virtual servers on Amazon Simple Storage Service (S3).
- Expertise in AWS data migration between different database platforms like Local SQL Server to Amazon RDS, EMR HIVE and experience in managing and reviewing Hadoop log files in AWS S3.
- Provided support on AWS Cloud infrastructure automation with multiple tools including Gradle, Chef, Nexus, Docker and monitoring tools such as Splunk and CloudWatch.
- Used Jenkins pipelines to drive all microservices builds out to the Docker registry and then deployed to Kubernetes.
- Design and Develop ETL Processes in AWS Glue to migrate Campaign data from external sources like S3, ORC/Parquet/Text Files into AWS Redshift.
- Worked extensively with importing metadata into Hive using Python and migrated existing tables and applications to work on AWS cloud (S3).
- Worked on AWS EC2, IAM, S3, LAMBDA, EBS, Elastic Load balancer (ELB), auto scaling group services.
- Involved in Agile Methodologies, Daily Scrum meetings, Sprint planning's and strong experience in SDLC.
Environment: Hadoop 2.x, HDFS, MapReduce, Apache Spark, Spark SQL, Spark Streaming, Scala, Java, Spring, Pig, Hive, Oozie, Sqoop, Kafka, Flume, Nifi, Zookeeper, Informatica, Databricks, MongoDB, AWS, Python, Linux, Snowflake, Tableau.
Confidential - Chicago, Illinois
Big Data Engineer
Responsibilities:
- Worked on Talend ETL tool and used features like context variable and database components like input to oracle, output to oracle, tFile compare, tFile copy, to oracle close ETL components
- Extracted data from the legacy system and loaded/integrated into another database through the ETL process.
- Transferred data from different data sources into HDFS systems using Kafka producers, consumers, Kafka brokers and used Zookeeper as built coordinator between different brokers in Kafka.
- Used Kafka and Spark Streaming for data ingestion and cluster handling in real time processing.
- Developed flow XML files using Apache NIFI, a workflow automation tool to ingest data into HDFS..
- Developed integration checks around the Pyspark framework for Processing of large datasets.
- Worked on migration of Pyspark framework into AWS Glue for enhanced processing.
- Created Pipelines in ADF using Linked Services/Datasets/Pipeline to Extract, Transform and load datafrom different sources like Azure SQL, Blob storage, Azure SQL Data warehouse, write-back tool and backwards
- Extract Transform and Load data from Sources Systems to Azure Data Storage services using a combination of Azure Data Factory, T-SQL, Spark SQL and U-SQL Azure Data Lake Analytics. Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data inAzure Databricks.
- Hadoop metadata management by extracting and maintaining metadata from Hive tables with Hive QL.
- Worked with Spark and improved the performance and optimized the existing algorithms in Hadoop using Spark context, Spark-SQL, Data Frame, RDD's, Spark YARN.
- Used Oozie workflow engine to manage interdependent Hadoop jobs and to automate several types of Hadoop jobs such as Java map-reduce Hive, Spark and Sqoop jobs.
- Implemented workflows using Apache Oozie framework to automate tasks.
- Imported and exported data into HDFS and Hive/Impala tables from Relational Database Systems using Sqoop.
- Developed data pipeline using Flume, Sqoop, and Pig to extract the data from weblogs and store in HDFS.
- Collected and aggregated large amounts of log data using Flume and tagging data in HDFS for further analysis.
- Developed Java Map Reduce programs for the analysis of sample log file stored in cluster.
- Worked in developing Pig Scripts for data capture change and delta record processing between newly arrived data and already existing data in HDFS.
- Developed CI/CD system with Jenkins on Kubernetes container environment, utilizing Kubernetes and Docker for the CI/CD system to build, test and deploy.
- Configured Azure Container Registry for building and publishing Docker container images and deployed them into Azure Kubernetes Service (AKS).
- Developed a Spark job which indexes data into ElasticSearch from external Hive tables which are in HDFS.
- Worked on migrating application into REST based Microservices to provide all CRUD capabilities using Spring Boot. Wrote Microservices to export/import data and task scheduling using Spring Boot and Hibernate.
- Utilized machine learning algorithms: Linear regression, Naive Bayes, Random Forests, KNN for data
- Performed Sentiment Analysis in python by implementing NLP techniques: Web Scraping, Text Vectorization, Data Wrangling, Bag of Words, TF-IDF score to compute the sentiment score and analyze the reviews.
- Performed data wrangling, data imputation and EDA using pandas, Numpy, Sklearn and Matplotlib in Python.
- Extensively used Agile methodology as the Organization Standard to implement the data models
Environment: Hadoop 2.x, HDFS, MapReduce, Pyspark, Spark SQL, ETL, Hive, Pig, Oozie, Databricks, Java, Spring, Sqoop, Azure, Star Schema, Python, Nifi, Cassandra, Scala, Power BI, Machine Learning.
Confidential
Data Engineer
Responsibilities:
- Loaded and transformed huge sets of structured, semi structured, unstructured data using Hadoop/Big Data.
- Integrated Oozie with Pig, Hive, Sqoop and developed Oozie workflow for scheduling and orchestrating the Extract, Transform, and Load (ETL) process within the Cloudera Hadoop.
- Implemented Kafka High level consumers to get data from Kafka partitions and move into HDFS.
- Managing workflow and scheduling for complex map reduce jobs using Apache Oozie.
- Designed physical and logical data models based on Relational (OLTP), Dimensional (OLAP) on snowflake schema using Erwin modeler to build an integrated enterprise data warehouse.
- Worked on creating Hive tables and written Hive queries for data analysis to meet business requirements and experienced in Sqoop to import and export the data from Oracle & MySQL.
- Extended Spark, Hive and Pig functionality by writing custom UDFs and hooking UDF's into larger Spark applications to be used as in-line functions.
- Worked with Apache Spark which provides fast engine for large data processing integrated with Scala.
- Develop quality check modules in PySpark and SQL to validate data in data lake, automated the process to trigger the modules before the data gets ingested
- Experience in developing multiple MapReduce programs in java for data extraction, transformation and aggregation from multiple file formats including XML, JSON, CSV and other file formats.
- Good experience in building pipelines using Azure Data Factory and moving the data into Azure Data Lake Store.
- Worked on development of Confidential DataLake and in building Confidential Data Cube on Microsoft Azure HDINSIGHT cluster.
- Used Hive QL to analyze data and create summarized data for consumption on Power BI.
- Performed data cleaning and handled missing values in Python using backward-forward filling methods and applied Feature engineering, Feature normalize, & Label encoding techniques using scikit-learn preprocessing.
- Involved in Python OOP code for quality, logging, monitoring, and debugging code optimization.
- Developed Batch processing solutions with Azure Databricks and Azure Event.
- Analyzed, designed and built Modern data solutions using AzurePaaS service to support visualization of data.
- Addressed overfitting and under fitting by tuning the hyper parameter of the Machine learning algorithms by using Lasso and Ridge Regularization and used GIT to coordinate team development
- Experience in designingCloud Azure Architecture and Implementation plans for hosting complex application workloads on MS Azure.
- Application of various machine learning algorithms and statistical modeling like decision trees, text analytics, natural language processing (NLP), supervised and unsupervised, regression models, social network analysis, neural networks, deep learning, SVM, clustering to identify Volume using scikit-learn package in python, Matlab.
- Working noledge of Amazon's Elastic Cloud Compute (EC2) infrastructure for computational tasks and Simple Storage Service (S3) as Storage mechanism.
- Supported various reporting teams and experience with data visualization tools Tableau and Power BI.
- Performed Exploratory Data Analysis (EDA) for maximizing insights from the dataset also to detect outliers and extract important variables by graphical and numerical visualizations.
Confidential
Data Analyst
Responsibilities:
- Analyzed and reported customer using data transactional and analytical data to meet business objectives
- Worked on the entire CRISP-DM life cycle and actively involved in all the phases of project life cycle including data acquisition, data cleaning, data engineering.
- Developed weekly, monthly reports related to the marketing and financial departments using Teradata SQL.
- Designed high level ETL architecture for overall data transfer from the OLTP to OLAP with the halp of SSIS
- Extracted data from SQL Server using Talend to load it into a single data warehouse repository. Designed tabular, matrix reports, drilldown, drill through, Parameterized and linked reports in SSRS.
- Wrote SQL queries using joins, grouping, nested sub-queries, and aggregation depending on data needed from various relational customer databases.
- Created V-Look Up functions in MS Excel for searching data in large spreadsheets.
- Developed ad-hoc reports with V-lookups, Pivot tables, and Macros in Excel and recommended solutions to drive business decision making.
- Designed data models and data flow diagrams using Erwin and MSVisio.
- Involved in Trouble Shooting, Performance tuning of reports, resolving issues within Tableau Server & Reports.
- Wrote a complex SQL, PL/SQL, Procedures, Functions, and Packages to validate data and testing process.
- Done SAS programming such as Proc Sql (Join/ Union), Proc Append, Proc Datasets, And Proc Transpose.
- Created worksheets reports and converted into interactive dashboards by using Tableau Desktop and provided to Business Users, Project Managers and End Users.
Confidential
SQL Developer
Responsibilities:
- Responsible for creating complex Stored Procedures, SSIS packages, triggers, cursors, tables, views and other SQL joins and statements for applications
- Responsible for developing processes, automation of maintenance jobs, tuning SQL Server, locks and indexes configurations, administering SQL Server security, SQL Server automatic e-mail notification and SQL Server backup strategy and automation.
- Configured SSIS packages using Package configuration wizard to allow Packages run on different environments.
- Developed advanced correlated and un-correlated sub-queries in T- SQL to develop complex reports.
- Develop multi-dimensional cubes and dimensions using SQL Server Analysis Services (SSAS)
- Improved the performance of the T- SQL queries and Stored procedures by using SQL profiler, Execution plan, SQL performance monitor and Index tuning advisor.
- Developed many Tabular Reports, Matrix Reports, cascading parameterized Drill down, drop down Reports and Charts using SQL Server Reporting Services (SSRS 2012).