Big Data Engineer Resume
Richardson, TX
SUMMARY
- Big Data and Machine learning Engineer with 8 years of experience in Big Data Analytics with technologies: Hive, Pig, HDFS, HBase, Sqoop, Flume, Spark, Kafka, Yarn, Oozie, Spark integration with Cassandra, Avro and Zookeeper.
- Good understanding of NoSQL databases and hands on work experience in writing applications for No SQL Databases HBase, Cassandra and MongoDB.
- Hands on experience sequencing, dynamic partitioning and bucketing in Hive for improving the performance.
- Experience in data cleaning, EDA using python and spark. Proficiency in using libraries like numpy, pandas and scikit - learn.
- Experience in creating UDF.s in python and using them very often.
- Experience in developing Kafka consumers and producers by extending low level and high-level consumer and producer API’s.
- Expertise in writing Spark Streaming applications using Scala and Python.
- Performed numeric data analysis using Pandas, NumPy and matplotlib.
- Hands-on working experience in ML techniques to draw meaningful insights from the data.
- Supported various data science teams and ML Engineering teams in end-to-end model development.
- Proficient in writing complex sql queries and shell scripting, stored procedures for automation.
- Work experience with cloud infrastructure like Amazon Web Services (AWS) and AZURE.
- Expertise working with AWS cloud services like EMR, S3, Redshift, AWS Glue, EMR cloud watch, for big data development.
TECHNICAL SKILLS
Big data/Hadoop Ecosystem: HDFS, MapReduce, HIVE, PIG, Sqoop, Flume, Oozie, Spark, Kafka, AWS EMR.
Opensource libraries: Scikit-learn, NumPy, SciPy, OpenCv, Deep learning, NLP, Keras, matplotlib (for visualization)
Machine Learning Algorithms: Linear Regression, Logistic Regression, Decision Trees, Random Forest, K Means Clustering, Support Vector Machines, Gradient Boost Machines & XGBoost
NoSQL Databases: MongoDB, Cassandra, HBase
Data Analysis Skills: Data Cleaning, Data Visualization, Feature Selection, Pandas
Programming Languages: Python, SQL, Scala, Python, PL/SQL, Linux shell scripts.
Database: Oracle 11g/10g, DB2, Microsoft SQL, MySQL, Teradata
Cloud Ecosystem: Amazon Web services (EC2,EMR and S3),AZURE(DATABRICKS).
Automation and scheduling: Ooozie
IDE Tools: Eclipse, IntelliJ, Jupyter, Spyder, GIT, Putty, Winscp, Tableau, Anaconda, Pycharm
Monitoring and Reporting tools: Tableau, POWER BI.
Operating System: Ubuntu (Linux), Windows.
PROFESSIONAL EXPERIENCE
Confidential, Richardson, TX
Big Data Engineer
Responsibilities:
- Experience in various ETL techniques.
- Hands on experience with Terrafom Infrastructure as a code to manage public cloud resource AWS.
- Created ETL pipelines Using AWS Lambda Functions based on Python/Pandas for Executing AWS S3 buckets which returns Results, logs, errors to Amazon API
- Created a ETL script with AWS Glue that sources a AWS Glue Catalog table with Json Data and Outputs Parquet.
- Experience in building and CICD pipelines.
- Hands on with Behave Framework Behaviour Driven Development(BDD) in Python Programming Language where Testers, Developers and Business Analysts can contribute.
- Implemented Partitioning and Bucketing in hive to improve query efficiency and join performances respectively.
- Worked in a fast-paced agile development environment to quickly analyze, develop, and test potential use cases for the business.
- Optimized the Sql and Python Scripts to increase the performance.
- Designed and developed the data pipeline to move data from service now to AWSRedshift and Real-time tableau refresh using AWS lambda functions.
Confidential, Alpharetta, GA
Big Data Engineer / Hadoop Developer
Responsibilities:
- Designed, developed and automated data pipelines using HDFS, Hive, Spark and Oozie workflows for various usecases.
- Created Spark applications using Spark Data frames and Spark SQL API extensively and Utilized Spark Scala API to implement batch processing of jobs.
- Involved in developing Tableau dashboard on the Hive tables created which is utilized by the business users.
- Designed and Developed the data pipeline to move data from service now to AWS Redshift.
- Created real time data movement using Spark Structure Streaming with kinesis.
- Real-time tableau refresh using AWS lambda functions.
- Created glue job for data load to Redshift.
- Created AWS cloud formation script for creating an environment.
- Developed an Encryption algorithm to hash the customer sensitive information like contact details, mtn, name, customer account number etc. for the data which used by the vendors.
- Used Java to develop the encryption jar and created scripts in Apache Spark to utilize the algorithm and to encrypt and decrypt the sensitive data.
- Responsible for design and development of advanced PySpark programs to prepare, transform and harmonize data sets in preparation for modeling.
- Performed Data Profiling to learn about customer behavior and merge data from multiple data sources.
- Utilized Sqoop to import and export structured data to/from RDBMS such as MySQL and Oracle.
- Collected raw data through various data sources like rest API’s, enterprise data warehouse.
- Processed, cleaned and transformed the raw json files using PySpark and loaded the data into Hive tables and automated this process by scheduling batch jobs.
- Involved in developing Tableau dashboards on the Hive tables created which is utilized by the business users.
- Extensively worked on Cloudera Hadoop.
- Hands on experience with Hadoop on GCP and AWS stack.Experience with SPARK API’s, AWS Glue and AWS CICD pipeline.
- Deployed most of the applications and Data pipelines using Gitlab CI/CD. Also, good exposure on Jenkins and Terraform.
- Experience in building and CICD pipelines.
- Extensive experience in Core Java 8, Spring Boot, Spring, Hibernate, Web Services, Kubernetes, Swagger, Docker and integrating databases like MongoDB, MySQL with webpages like HTML, PHP and CSS to update, insert, delete and retrieve data with simple ad-hoc queries.
- Created AWS cloud formation script for creating an environment.
- Creating Hive tables, loading and analysing data using hive scripts.
- Implemented Partitioning and Bucketing in hive to improve query efficiency and join performances respectively.
- Developed complex SQL and Hive queries to extract the required data and created various dashboards in Tableau for the end users.
- Performed data deduplication to remove duplicate data.
- Responsible for design and development of advanced PySpark programs to prepare, transform and harmonize data sets in preparation for modeling.
- Frequently met the business teams and conveyed my insights through which they could make some changes in customer approach strategy.
Confidential, Nashville, TN
Big Data Engineer / Hadoop developer
Responsibilities:
- Developed and scheduled Hive scripts and Spark jobs using Oozie workflow scheduler and also Migrated existing ETL scripts to Hive.
- Worked in a fast-paced agile development environment to quickly analyze, develop, and test potential use cases for the business.
- Used spark SQL to load complex nested JSON into hive tables.
- Performed masking and unmasking if the data as per the requests.
- Used hive performance tuning techniques like vectorization, using appropriate file formats, compression, partitioning, using tez execution engine etc.
- Developed Spark-Streaming applications to consume the data from Kafka topics and to insert the processed streams to HBase.
- Used Broadcast variables in Spark, effective & efficient Joins, transformations and other capabilities for data processing.
- Built a pipeline to process data from json files and different data sources into Hive tables.
- Translate business propositions into quantitative queries and collect/clean the necessary data.
- Build scalable databases capable of ETL processes using SQL and Spark.
- Evaluate the workflow and increase the efficiency of data pipelines that process over 50 TB of data daily.
- Involved in feature engineering, data cleansing and pre-processing.
- Evaluated the model performance using metrics like precision, recall and F1 score.
- Worked closely with a team of software engineers on serving the model at scale.
Confidential
Data Analyst
Responsibilities:
- Manipulating, cleansing & processing data using Excel, Access and SQL.
- Responsible for loading, extracting and validation of client data.
- Collecting and aggregating large amounts of log data using Flume and staging data in HDFS for further analysis
- Analysed the data by performing Hive queries (Hive QL) to study customer behavior.
- Helped Dev ops Engineers for deploying code and debug issues.
- Used Hive to analyse the partitioned and bucketed data and compute various metrics for reporting and Developed Hive scripts in Hive QL to de-normalize and aggregate the data.
- Scheduled and executed workflows in Oozie to run various jobs.
- Implemented business logic in Hive and written UDF’s to process the data for analysis.
- Addressing the issues occurring due to the huge volume of data and transitions.