Big Data Engineer Resume

SUMMARY

Extensive experience as a Big Data Engineer with Data Analytics background, working in various domains including Finance, Insurance, Tech and Media service
Solid technical skills using Apache Hadoop ecosystem tools including HDFS, MapReduce, Hive, Pig, Spark (Spark Core, Spark SQL, PySpark), Yarn, Kafka, Sqoop, HBase, Zookeeper, Tez and Ambari
Deep knowledge in design and implement distributed data processing ETL pipelines using Python, SQL, HiveQL, Spark, PySpark etc.
Proficient working with traditional RDBMS including Oracle, MySQL, and PostgreSQL, as well as NoSQL databases like HBase, MongoDB and Cassandra
Hands on experience in writing MapReduce programs in Python and PySpark for high - volume data processing in Hadoop
Implemented data cleansing, data transformation by using Spark RDD transformations and actions command, and familiar with RDD tasks optimization and DAG concepts
Involved in creating Hive tables, writing HiveQL queries to load data using partitioning and bucketing techniques, and UDF for complex data manipulation and analysis
Experience writing Pig Latin scripts for preprocessing and analyzing large volumes of data
Familiar in importing and exporting data using Sqoop from HDFS/Hive/HBase to RDBMS
Highly involved in all phases of Data Warehouse life cycle involving requirement analysis, design, coding, testing, and deployment
Knowledgeable with using OLTP/OLAP database systems and using Star/Snowflake data modeling schema
Administrated Kafka pipelines for collecting, processing large amounts data real-time
Great understanding of workload management, schedulers like CRON and Apache Airflow, scalability, distributed platform architectures, and cluster co-ordination services
Extensive knowledge using Amazon Web Service (AWS) services and components for collection (Kinesis, DMS), storage (S3, Glacier, DynamoDB), processing (EMR, Lambda, Glue, Data Pipeline, ML, SageMaker), analysis (ElasticSearch, Athena, RedShift), and visualization (QuickSight)
Skilled with using statistical analysis and ML algorithms such as Linear/Logistics Regression, Decision Trees, Random Forest, Boosting, K-Means/KNN Clustering, SVM, Neural Network, etc. using Python, R libraries for exploring hidden patterns in the data.
Proficient in Python packages specifically for data manipulation, analysis and visualization such as Pandas, NumPy, SciPy, Scikit-learn, Matplotlib, Seaborn, PyChart etc.
Demonstrated ability to create visually appealing and interactive reports/dashboards using software such as PowerBI, Tableau, Jupyter Notebook, Zeppelin etc.
Experienced on commercial distribution of Hadoop including HortonWorks HDP
Skilled in programmatic problem solving using various Algorithms, Data Structures, and OOP applications
Knowledgeable with data serialization with various data formats including SequenceFile, Avro, Parquet, Flat Files, XML, CSV and JSON
Experienced with version control tools like Git and Gitlab, project management tools such as JIRA, and various software development methodologies like Agile and Scrum
Self-driven goal getter, trustworthy team-player with excellent communication skills in collaborative team and have motivations to take independent responsibility.

TECHNICAL SKILLS

Apache Hadoop 2.x, Hadoop 3.x, HDFS, \ Amazon Web Services EMR, EC2, S3, RDS, \
YARN, MapReduce, Spark 2.x, Hive 2.1, Pig \ Glacier, DMS, DynamoDB, ElasticSearch, ML, \
0.17, Sqoop 1.4.7, Kafka 2.1, Zookeeper 3.4, \ RedShift, Lambda, Glue, Athena, Kinesis, Data \
Tez 0.9.2, Ambari, Airflow\ Pipeline, Quicksight, Google Cloud Platform\
Oracle 12c/18c, MySQL 5.x, PostgreSQL, \ IntelliJ IDEA, Jupyter Notebook, RStudio, \
Microsoft SQL Server 13.0, HBase 1.2, \ Microsoft Visual Studio, Spring Tool Suite, \
MongoDB, Cassandra\ Pycharm, Sublime, Eclipse, Spyder\
Python 2.7/3.7/3.8, SQL, R 3.6, Java 8, Scala \ Pandas, Numpy, Scikit-learn, Scipy, Seaborn, \
2.11, PySpark, SAS 9.4, Unix/Bash shell\ Matplotlib, PyChart\
Git/Gitlab, GitHub, Cron, Jira, Google Colab, \ Linear/Logistic Regression, KNN, K-Means, \
IBM Skills Network Lab, Hortonworks, \ Decision Tress, Random Forest, Gradient \
Microsoft Office (Word, Excel, PowerPoint), \ Boost, SVM, Neural Network\
Tableau 10/2018, PowerBI, Agile, Scrum\
Mac OS, Windows, Linux\

PROFESSIONAL EXPERIENCE

Confidential

Big Data Engineer

Responsibilities:

Performed data storage on AWS S3, and defined MapReduce jobs in multi-node Hadoop clusters on AWS EMR and EC2
Developed ETL pipelines to ingest data to AWS RDS for data processing, data analysis using AWS Glue and AWS Athena, and data warehousing and visualization using AWS Redshift and AWS QuickSight
Upkept AWS Kinesis pipelines to process near real time server logs, analyze data with AWS ElasticSearch, and populate data in AWS DynamoDB using AWS Lambda
Developed and implemented ETL pipelines using Python, SQL, Spark and PySpark to ingest data and updates to relevant databases
Wrote queries on Hadoop to filter, transform and analyze data using HiveQL, written custom Hive UDFs and storing data into Hive tables
Stored column-oriented, semi-structured data using HBase and MongoDB
Performed unit testing on scripts, maintained and edited existing ETL pipelines, performed audits to check data consistency between local databases and AWS data storage/warehouses
Authored workflows using Apache Airflow and defined Airflow DAG objects to schedule tasks and manage dependencies
Utilized Git for version control, JIRA for task tracking, and actively involved in Agile development methodology

Environment: AWS EMR, EC2, S3, RDS, Redshift, Glue, Athena, Kinesis, ElasticSearch, DynamoDB, Quicksight, Hadoop 3.0, HDFS, MapReduce, Spark 2.2/2.4.7, PySpark, Python 2.7/3.6/3.8, Scala 2.11, SQL, Hive 2.3/3.1, HBase 1.4, MongoDB, Apache Airflow, Git, Jira, Agile

Confidential, New York, NY

BI Engineer

Responsibilities:

Generated DDL/DML scripts using PostgreSQL to ingest data onto local MPP databases and AWS EC2/EMR Hadoop clusters utilized by the BI analytical teams for data analysis
Utilized AWS Glue and AWS Athena to perform querying and analysis, and used AWS Redshift for data warehousing
Performed ETL procedures on unstructured data using AWS Lambda and AWS DynamoDB
Wrote Python, Spark and PySpark scripts to build ETL pipelines to automate data ingestion, update data to relevant databases and tables
Performed data cleaning and transformation to provide accurate and accessible data in Hadoop clusters using Python, HiveQL and PigLatin scripts
Performed MapReduce jobs on Hadoop clusters, utilizing partitioning and bucketing techniques to speed up and optimize analytical performance on large volumes of data
Stored semi-structured data using columnar formatted parquet files on HBase for scalable storage and fast querying
Conducted Exploratory Data Analysis (EDA) on data using Python and R packages such as matplotlib, seaborn, ggplot2, plotly etc. to discover data patterns, and display graphs and visualization using Jupyter Notebook and R Markdown
Utilized Sqoop to import data in between local databases and Hadoop HDFS
Upkept Kafka pipelines to efficiently ingest real-time data using various topics
Scheduled jobs using Cron and Apache Airflow, defining DAG objects and job dependencies
Performed routine auditing on data in Hadoop clusters and local data warehouses, updating and correcting written scripts whenever necessary
Used Git/Gitlab for version control, while managing project tasks and issues through Jira.
Supported BI analysts during weekly meetings by suggesting optimized data mining methods, planning data analytical models and supplying necessary data based on various business needs
Provided necessary Excel spreadsheets and extracted data onto Tableau platforms to generate visualization and reports
Practiced Agile methodology, actively participated and provided constructive feedback during daily stand up meetings and weekly iterative review meetings

Environment: AWS EMR, EC2, S3, Glue, Athena, Redshift, Lambda, DynamoDB, Hadoop 2.7/3.0, HDFS, MapReduce, Python 2.7/3.6, PostgreSQL, Spark 2.1/2.2, PySpark, Hive 2.3/3.1, Pig 0.17, HBase 1.0/1.4, Kafka 1.1, Sqoop 1.4.7, Apache Airflow, Cron, EDA, Jupyter Notebook, R Markdown, Tableau 10/2018, Microsoft Office 2016 (Word, Excel, Powerpoint), Git/Gitlab, Jira, Agile

Confidential

Digital Marketing Analyst

Responsibilities:

Extracted data from various sources using Python scripts, stored data in Microsoft SQL Server databases, and performed data analysis procedures using SQL queries
Utilized Google Analytics to track acquisitions, customer behaviors, and conversions rates on websites, as well as using Google Tag Manager to update code fragments
Participated in collecting end-user product experience data, and collaborated with the Software Development team to suggest necessary improvements to increase customer traction
Routinely organized customer and product data on the company’s CRM system
Generated campaign performance report and data insights through visualization tools such as Excel tables, PowerPoint presentations, and Tableau dashboards to highlight key insights led to overall objectives of maximizing Marketing ROI
Conducted internal and external competitor analysis based on market feedback and help developed actionable strategies to improve product performance.
Involved with product management projects, assisted in assessing and prioritizing clients’ immediate business needs, and facilitated dealings with more than 40 companies

Environment: Python 2.7/3.5, SQL, Google Analytics, Google Tag Manager, Microsoft Office (Word, Excel, PowerPoint), Tableau 10

We provide IT Staff Augmentation Services!

We'd love your feedback!

Resume Categories

Client Services

Job Seekers

Visa Sponsorship