Big Data Engineer Resume
SUMMARY
- Extensive experience as a Big Data Engineer with Data Analytics background, working in various domains including Finance, Insurance, Tech and Media service
- Solid technical skills using Apache Hadoop ecosystem tools including HDFS, MapReduce, Hive, Pig, Spark (Spark Core, Spark SQL, PySpark), Yarn, Kafka, Sqoop, HBase, Zookeeper, Tez and Ambari
- Deep knowledge in design and implement distributed data processing ETL pipelines using Python, SQL, HiveQL, Spark, PySpark etc.
- Proficient working with traditional RDBMS including Oracle, MySQL, and PostgreSQL, as well as NoSQL databases like HBase, MongoDB and Cassandra
- Hands on experience in writing MapReduce programs in Python and PySpark for high - volume data processing in Hadoop
- Implemented data cleansing, data transformation by using Spark RDD transformations and actions command, and familiar with RDD tasks optimization and DAG concepts
- Involved in creating Hive tables, writing HiveQL queries to load data using partitioning and bucketing techniques, and UDF for complex data manipulation and analysis
- Experience writing Pig Latin scripts for preprocessing and analyzing large volumes of data
- Familiar in importing and exporting data using Sqoop from HDFS/Hive/HBase to RDBMS
- Highly involved in all phases of Data Warehouse life cycle involving requirement analysis, design, coding, testing, and deployment
- Knowledgeable with using OLTP/OLAP database systems and using Star/Snowflake data modeling schema
- Administrated Kafka pipelines for collecting, processing large amounts data real-time
- Great understanding of workload management, schedulers like CRON and Apache Airflow, scalability, distributed platform architectures, and cluster co-ordination services
- Extensive knowledge using Amazon Web Service (AWS) services and components for collection (Kinesis, DMS), storage (S3, Glacier, DynamoDB), processing (EMR, Lambda, Glue, Data Pipeline, ML, SageMaker), analysis (ElasticSearch, Athena, RedShift), and visualization (QuickSight)
- Skilled with using statistical analysis and ML algorithms such as Linear/Logistics Regression, Decision Trees, Random Forest, Boosting, K-Means/KNN Clustering, SVM, Neural Network, etc. using Python, R libraries for exploring hidden patterns in the data.
- Proficient in Python packages specifically for data manipulation, analysis and visualization such as Pandas, NumPy, SciPy, Scikit-learn, Matplotlib, Seaborn, PyChart etc.
- Demonstrated ability to create visually appealing and interactive reports/dashboards using software such as PowerBI, Tableau, Jupyter Notebook, Zeppelin etc.
- Experienced on commercial distribution of Hadoop including HortonWorks HDP
- Skilled in programmatic problem solving using various Algorithms, Data Structures, and OOP applications
- Knowledgeable with data serialization with various data formats including SequenceFile, Avro, Parquet, Flat Files, XML, CSV and JSON
- Experienced with version control tools like Git and Gitlab, project management tools such as JIRA, and various software development methodologies like Agile and Scrum
- Self-driven goal getter, trustworthy team-player with excellent communication skills in collaborative team and have motivations to take independent responsibility.
TECHNICAL SKILLS
- Apache Hadoop 2.x, Hadoop 3.x, HDFS, \ Amazon Web Services EMR, EC2, S3, RDS, \
- YARN, MapReduce, Spark 2.x, Hive 2.1, Pig \ Glacier, DMS, DynamoDB, ElasticSearch, ML, \
- 0.17, Sqoop 1.4.7, Kafka 2.1, Zookeeper 3.4, \ RedShift, Lambda, Glue, Athena, Kinesis, Data \
- Tez 0.9.2, Ambari, Airflow\ Pipeline, Quicksight, Google Cloud Platform\
- Oracle 12c/18c, MySQL 5.x, PostgreSQL, \ IntelliJ IDEA, Jupyter Notebook, RStudio, \
- Microsoft SQL Server 13.0, HBase 1.2, \ Microsoft Visual Studio, Spring Tool Suite, \
- MongoDB, Cassandra\ Pycharm, Sublime, Eclipse, Spyder\
- Python 2.7/3.7/3.8, SQL, R 3.6, Java 8, Scala \ Pandas, Numpy, Scikit-learn, Scipy, Seaborn, \
- 2.11, PySpark, SAS 9.4, Unix/Bash shell\ Matplotlib, PyChart\
- Git/Gitlab, GitHub, Cron, Jira, Google Colab, \ Linear/Logistic Regression, KNN, K-Means, \
- IBM Skills Network Lab, Hortonworks, \ Decision Tress, Random Forest, Gradient \
- Microsoft Office (Word, Excel, PowerPoint), \ Boost, SVM, Neural Network\
- Tableau 10/2018, PowerBI, Agile, Scrum\
- Mac OS, Windows, Linux\
PROFESSIONAL EXPERIENCE
Confidential
Big Data Engineer
Responsibilities:
- Performed data storage on AWS S3, and defined MapReduce jobs in multi-node Hadoop clusters on AWS EMR and EC2
- Developed ETL pipelines to ingest data to AWS RDS for data processing, data analysis using AWS Glue and AWS Athena, and data warehousing and visualization using AWS Redshift and AWS QuickSight
- Upkept AWS Kinesis pipelines to process near real time server logs, analyze data with AWS ElasticSearch, and populate data in AWS DynamoDB using AWS Lambda
- Developed and implemented ETL pipelines using Python, SQL, Spark and PySpark to ingest data and updates to relevant databases
- Wrote queries on Hadoop to filter, transform and analyze data using HiveQL, written custom Hive UDFs and storing data into Hive tables
- Stored column-oriented, semi-structured data using HBase and MongoDB
- Performed unit testing on scripts, maintained and edited existing ETL pipelines, performed audits to check data consistency between local databases and AWS data storage/warehouses
- Authored workflows using Apache Airflow and defined Airflow DAG objects to schedule tasks and manage dependencies
- Utilized Git for version control, JIRA for task tracking, and actively involved in Agile development methodology
Environment: AWS EMR, EC2, S3, RDS, Redshift, Glue, Athena, Kinesis, ElasticSearch, DynamoDB, Quicksight, Hadoop 3.0, HDFS, MapReduce, Spark 2.2/2.4.7, PySpark, Python 2.7/3.6/3.8, Scala 2.11, SQL, Hive 2.3/3.1, HBase 1.4, MongoDB, Apache Airflow, Git, Jira, Agile
Confidential, New York, NY
BI Engineer
Responsibilities:
- Generated DDL/DML scripts using PostgreSQL to ingest data onto local MPP databases and AWS EC2/EMR Hadoop clusters utilized by the BI analytical teams for data analysis
- Utilized AWS Glue and AWS Athena to perform querying and analysis, and used AWS Redshift for data warehousing
- Performed ETL procedures on unstructured data using AWS Lambda and AWS DynamoDB
- Wrote Python, Spark and PySpark scripts to build ETL pipelines to automate data ingestion, update data to relevant databases and tables
- Performed data cleaning and transformation to provide accurate and accessible data in Hadoop clusters using Python, HiveQL and PigLatin scripts
- Performed MapReduce jobs on Hadoop clusters, utilizing partitioning and bucketing techniques to speed up and optimize analytical performance on large volumes of data
- Stored semi-structured data using columnar formatted parquet files on HBase for scalable storage and fast querying
- Conducted Exploratory Data Analysis (EDA) on data using Python and R packages such as matplotlib, seaborn, ggplot2, plotly etc. to discover data patterns, and display graphs and visualization using Jupyter Notebook and R Markdown
- Utilized Sqoop to import data in between local databases and Hadoop HDFS
- Upkept Kafka pipelines to efficiently ingest real-time data using various topics
- Scheduled jobs using Cron and Apache Airflow, defining DAG objects and job dependencies
- Performed routine auditing on data in Hadoop clusters and local data warehouses, updating and correcting written scripts whenever necessary
- Used Git/Gitlab for version control, while managing project tasks and issues through Jira.
- Supported BI analysts during weekly meetings by suggesting optimized data mining methods, planning data analytical models and supplying necessary data based on various business needs
- Provided necessary Excel spreadsheets and extracted data onto Tableau platforms to generate visualization and reports
- Practiced Agile methodology, actively participated and provided constructive feedback during daily stand up meetings and weekly iterative review meetings
Environment: AWS EMR, EC2, S3, Glue, Athena, Redshift, Lambda, DynamoDB, Hadoop 2.7/3.0, HDFS, MapReduce, Python 2.7/3.6, PostgreSQL, Spark 2.1/2.2, PySpark, Hive 2.3/3.1, Pig 0.17, HBase 1.0/1.4, Kafka 1.1, Sqoop 1.4.7, Apache Airflow, Cron, EDA, Jupyter Notebook, R Markdown, Tableau 10/2018, Microsoft Office 2016 (Word, Excel, Powerpoint), Git/Gitlab, Jira, Agile
Confidential
Digital Marketing Analyst
Responsibilities:
- Extracted data from various sources using Python scripts, stored data in Microsoft SQL Server databases, and performed data analysis procedures using SQL queries
- Utilized Google Analytics to track acquisitions, customer behaviors, and conversions rates on websites, as well as using Google Tag Manager to update code fragments
- Participated in collecting end-user product experience data, and collaborated with the Software Development team to suggest necessary improvements to increase customer traction
- Routinely organized customer and product data on the company’s CRM system
- Generated campaign performance report and data insights through visualization tools such as Excel tables, PowerPoint presentations, and Tableau dashboards to highlight key insights led to overall objectives of maximizing Marketing ROI
- Conducted internal and external competitor analysis based on market feedback and help developed actionable strategies to improve product performance.
- Involved with product management projects, assisted in assessing and prioritizing clients’ immediate business needs, and facilitated dealings with more than 40 companies
Environment: Python 2.7/3.5, SQL, Google Analytics, Google Tag Manager, Microsoft Office (Word, Excel, PowerPoint), Tableau 10