Data Engineer/Big Data/Spark/Hadoop Resume

PROFESSIONAL SUMMARY:

4+ years of strong experience in Data Engineering, Data Analysis, and Data mining with large data sets of Structured and Unstructured data using Big Data/Hadoop, Spark, Predictive modeling, Statistical modeling, Data modeling, and Data Visualization.
Experience in Big Data analytics, Data manipulation, using Hadoop Ecosystem tools Map - Reduce, HDFS, Yarn/MRv2, Pig, Hive, HDFS, HBase, Kafka, Flume, Sqoop, Flume, Oozie, Avro, Sqoop, and Spark integration with Cassandra, Avro, Solr, and Zookeeper.
Good understanding of Spark and Hadoop Cluster Architecture.
Experienced in developing machine learning models for real-world problems using R and Python.
Expertise in designing scalable Big Data solutions, data warehouse models on large-scale distributed data, performing a wide range of analytics to measure service performance.
Experienced in Agile Methodologies, Scrum stories, and sprints experience in a Python-based environment along with data analytics, data wrangling.
Experienced in Amazon Web Services (AWS) and Microsoft Azure, such as AWS EC2, S3, RD3, Azure HDInsight, Machine Learning Studio, Azure Storage, and Azure Data Lake.
Experienced in data architecture including data ingestion pipeline design, Hadoop information architecture, data modeling, and data mining, machine learning, and advanced data processing.
Experienced in processing large datasets with Spark using Python.
Experience with machine learning tools and libraries such as Scikit-learn, R, Spark, and Weka
Experienced in using various Python libraries (Beautiful Soup, NumPy, SciPy, matplotlib, python-twitter, Pandas, and MySQL DB for database connectivity).
Strong understanding in Dimensional modeling - Star Schema, Snowflake Schema.
Good understanding and working experience on Hadoop Distributions like Cloudera and Hortonworks.
Experience in Conceptual, Logical, and Physical data modeling, Enterprise Data Models, Enterprise ETL architecture.
Hands-on expertise in working and designing of Row keys & Schema Design with NoSQL databases like Mongo DB, HBase, Cassandra.
Experience in importing and exporting multi terabytes of data using Sqoop from Relational Database Management System to HDFS and vice versa.
Experience in writing simply to complex Pig scripts for processing and analyzing large volumes of data, querying both Managed and External tables created in Hive using Impala.
Extensive experience with ETL and query big data tools HiveQL and Pig Latin., loading logs from multiple sources into HDFS using Flume.
Experience using Tableau with database join, nested sorting, integration, visualization by creating diverse charts, maps, trend lines, and predictive analysis
Hands-on experience in data mining, cleaning, warehousing, and ETL process by using Talend, Informatica, AWS Glue, Hive with big data frameworks
Performed AB testing and hypothesis testing to deliver business insights.
Built statistical models and exploratory analysis with R Studio
Experience with Deep Learning models for NLP and image recognition by using Python.
Reported and presented data patterns and analytical results by building charts, graphs, and dashboards via MS PowerPoint and Excel.
Experience with project management Agile methodology with Jira
Used Git repository as version control
High self-motivation, quick learner and adaptability towards trending tools and teamwork environment

TECHNICAL SKILLS:

Languages: Java 8, Python, R

Numpy, SciPy, Pandas, Scikit: learn, Matplotlib, Seaborn, ggplot2, caret, dplyr, purrr, readxl, tidyr, Rweka, gmodels, RCurl, C50, twitter, NLP, Reshape2, rjson, plyr, Beautiful Soup, Rpy2

Big Data Technologies: Hadoop, Hive, HDFS, MapReduce, Pig, Kafka

ETL Tools: SSIS

Version Control Tools: SVM, GitHub

Cloud: Amazon Web Services (AWS), Azure

Data Modeling Tools: Erwin, Rational Rose, ER/Studio, MS Visio

BI Tools: Tableau, Power BI

SQL, Hive, Impala, Pig, SQL: Server, My SQL, HBase, MongoDB

NLP/Machine Learning/Deep Learning: LDA (Latent Dirichlet Allocation), NLTK, Apache OpenNLP, Stanford NLP, Sentiment Analysis, SVMs, ANN, RNN, CNN, TensorFlow, MXNet, Caffe, H2O, Keras, PyTorch

Reporting Tools: Crystal reports XI, SSRS

Operating System: Windows, Linux, UNIX

PROFESSIONAL EXPERIENCE:

Confidential

Data Engineer/Big Data/Spark/Hadoop

Responsibilities:

Participated in all phases of Datamining, Data-collection, Data-Cleaning, Developing-Models, Validation, Visualization, and Performed Gap Analysis.
Utilized Spark, Scala, Hadoop, HBase, Cassandra, MongoDB, Kafka, Spark Streaming, MLLib, Python, and a broad variety of machine learning methods including classifications, regressions, dimensionally reduction, etc.
Used Spark Data frames, Spark-SQL, Spark MLLib extensively, and developing and designing POC's using Scala, Spark SQL, and MLlib libraries.
Developed various Tableau Data Models by extracting and using the data from various sources files, DB2, MongoDB, Excel, Flat Files, and Big data.
Worked with Hadoop Architecture and various components such as HDFS, Job Tracker, Task Tracker, Name Node, Data Node, Secondary Name Node, and MapReduce concepts.
Responsible for building scalable distributed data pipelines in Hadoop.
Wrote Pig scripts to setup Kafka hourly data and perform daily roll-ups.
Data Migration from existing RDBMS systems to HDFS and build derived dataset.
Developed Shell scripts to automate Hive dynamic table and partition creation.
Optimized Map/Reduce Jobs to use HDFS efficiently by using various compression mechanisms like gzip, Snappy on top of data formats like Parquet.
Designed ETL workflows on Tableau, Deployed data from various sources to HDFS.
Imported/Exported analyzed and aggregated data to Oracle & MySQL databases using Sqoop.
Wrote Pig & Hive scripts to analyze customer data and detect user patterns.
Continuous monitoring and managing the Hadoop cluster by using Cloudera Manager.
Developed ETL pipelines to source data to Business intelligence teams to build visualizations.
Designed both 3NF data models for ODS, OLTP systems, and Dimensional Data Models using Star and Snowflake Schemas.
Empower productivity improvements and data sharing using Erwin for effective model management.
Implemented end-to-end systems for Data Analytics, Data Automation, and integrated with custom visualization tools using Python, R, Mahout, Hadoop, and MongoDB.
Worked with several R packages including knitr, dplyr, SparkR, Causal Infer, Space-Time.
Used Pandas, Numpy, Seaborn, Scipy, Matplotlib, Sci-kit-learn, and NLTK in Python for developing various machine learning algorithms.
Involved in the design and implementation of Statistical models, Predictive models, enterprise data model, metadata solution, and data life cycle management in both RDBMS, Big Data environments.
Implemented Classification using supervised algorithms like Logistic Regression, Decision trees, KNN, Naive Bayes.
Updated Python scripts to match training data with our database stored in AWS Cloud Search, so that we would be able to assign each document a response label for further classification.

Environment: AWS, R, Tableau, Spark, Scala, Hadoop, HBase, Cassandra, MongoDB, Kafka, Spark Streaming, MLLib, Python, Data Mining, Data Collection, Data Cleaning, Validation, HDFS, ODS, OLTP, Oracle, MpngoDB, Hive, OLAP, MongoDB, Metadata, MS Excel, Map-Reduce, Rational Rose, SQL.

Confidential, Fremont, CA

Data Engineer

Responsibilities:

Build a Customer repository data pipeline using Hadoop, Spark, MapReduce, Java, Scala, Dozer, Hibernate, and Postgres.
This pipeline receives data from several source systems in different formats. The Pipeline ensures high-quality data flow through it before getting loaded to the customer data storage repository.
Worked on batch data ingestion by creating a data pipeline using Sqoop and Spark.
Implemented Pig as an ETL tool to do transformations, event joins, and some pre-aggregations before storing the data onto HDFS.
Develop structured, efficient, and error-free codes for Big Data requirements using my knowledge in Hadoop and its Eco-system.
Developed Spark/Scala, Python for regular expression (regex) project in the Hadoop/Hive environment with Linux/Windows for big data resources.
Involved in optimizing MapReduce algorithms using Mappers, Reducers, combiners, and partitioner to deliver the best results for the large datasets.
Implementation of machine learning methods, optimization, and visualization. Mathematical methods of statistics such as Regression Models, Decision Tree, Naïve Bayes, Ensemble Classifier, Hierarchical Clustering, and Semi-Supervised Learning on different datasets using Python.
Researched and implemented various Machine Learning Algorithms using the R language. then wrote Scala scripts using the Spark machine learning module.
Designed and modeled several kinds of tables like Associative tables, transactional tables, Base tables, Delta tables, Junk Dimensions, Reference tables.
Develop MapReduce jobs to convert data files into the Parquet file format.
Created action filters, parameters, and calculated sets for preparing dashboards and worksheets in Tableau.
Data sources are Extracted, Transformed, and Loaded (ETL) to generate CSV data files with Python programming and SQL queries.
Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs, Scala.
Implemented Spark using Scala and Spark SQL for faster testing and processing of data.
Used Scala scripts for spark machine learning libraries API execution for decision trees, ALS, logistic and linear regressions algorithms.
Worked on Migrating an On-premises virtual machine to Amazon Web Services (AWS) cloud.
Worked on different data formats such as JSON, XML, and performed machine learning algorithms in Python.
ReportedTableau dashboard with different charts and presented with MS PowerPoint
Used Git as version control and Jira for the team-wide management methodology
Summarized the information and reports to deliver the insights for team and client
Implemented the design, analysis, and interpretation of a variety of reports and analytical solutions
Engaged constructively with project teams to support project objectives through the application principle

Environment: AWS, Hadoop, Spark, MapReduce, Java, Scala, Dozer, Hibernate and Postgres, R Language, Hadoop, Big Data, Azure, Python, ETL, JavaScript, DB2, CRUD, PL/ SQL, JDBC, coherence, MongoDB, Apache CXF, soap, Web Services, Eclipse

Confidential

Data Engineer

Responsibilities:

Designed and developed the real-time matching solution for customer data ingestion
Worked on converting the multiple SQL Server and Oracle stored procedures into Hadoop using Spark SQL, Hive, Scala, and Java.
Created production Data-lake that can handle transactional processing operations using Hadoop Eco-System.
Developed PySpark and SparkSQL code to process the data in Apache Spark on Amazon EMR to perform the necessary transformations.
Involved in validating and cleansing the data using Pig statements and hands-on experience in developing Pig MACROS.
Analyzed dataset of 14M record count and reduced it to 1.3M by filtering out rows with duplicate customer IDs and removed outliers using boxplots and univariate algorithms.
Worked with Hadoop Big Data Integration with ETL on performing data extract, loading, and transformation process for ERP data.
Performed extensive exploratory data analysis using Teradata to improve the quality of the dataset and created Data Visualizations using Tableau.
Experienced in various Python libraries like Pandas, One dimensional NumPy, and Two dimensional NumPy.
Experienced in using PyTorch library and implementing natural language processing.
Developed data visualizations in Tableau to display day to day accuracy of the model with newly incoming Data.
Worked with R for statistical modeling like Bayesian and hypothesis test with dplyr and BAS packages, and visualized testing results in R to delivery business insight
Model validation by Confusion Matrix, ROC, AUC, and developed diagnostic tables and graphs that demonstrated how a model can be used to improve the efficiency of the selection process
Presented and reported business insights by SSRS and Tableau dashboard combined with different diagrams
Utilized Jira as project management methodology and Git for version control to build the program
Reported and displayed the analysis result in the web browser with HTML and JavaScript
Involved constructively with project teams, supported the project’s goal through principle and delivered the insights for team and client

Environment: Hadoop, Spark SQL, Hive, Scala, Java, MS Access, SQL Server, Pig, PySpark, Tableau, Excel

Confidential

Data Analyst

Responsibilities:

Provided technical solutions to analyze marketing campaigns and business performance
Worked with engineering team and manager to maintain database via MySQL and Microsoft Excel, and future explore data analysis with R, Python, and Tableau
Designed relational databases and optimized the database performance in MySQLby using different syntax, stored procedures, defined functions and implemented batch
Connected MySQL with Excel for importing or exporting data, cleaning data and analysis via sorting, filter, conditional formatting, charts, and pivot tables
Worked with python for exploratory data analysis to determine the patterns and features of the data, including time-series, correlation, etc.
Loaded data into R, reformatted data, checked consistency, and built statistical analysis and testing such as Naïve Bayesian Model, hypothesis test in R
Visualized and reported data by creating graphs in Microsoft Excel, R, and Tableau with different charts and dashboard
Engaged with project teams, delivering the insights for team and client with Microsoft PowerPoint
Implemented the design, analysis, and interpretation of a variety of reports and analytical solutions

Environment: Database, SQL Server, Linux, Unix, Excel, Oracle, Tableau.

We provide IT Staff Augmentation Services!

We'd love your feedback!

Resume Categories

Client Services

Job Seekers

Visa Sponsorship