We provide IT Staff Augmentation Services!

Data Engineer/big Data/spark/hadoop Resume


  • 4+ years of strong experience in Data Engineering, Data Analysis, and Data mining with large data sets of Structured and Unstructured data using Big Data/Hadoop, Spark, Predictive modeling, Statistical modeling, Data modeling, and Data Visualization.
  • Experience in Big Data analytics, Data manipulation, using Hadoop Ecosystem tools Map - Reduce, HDFS, Yarn/MRv2, Pig, Hive, HDFS, HBase, Kafka, Flume, Sqoop, Flume, Oozie, Avro, Sqoop, and Spark integration with Cassandra, Avro, Solr, and Zookeeper.
  • Good understanding of Spark and Hadoop Cluster Architecture.
  • Experienced in developing machine learning models for real-world problems using R and Python.
  • Expertise in designing scalable Big Data solutions, data warehouse models on large-scale distributed data, performing a wide range of analytics to measure service performance.
  • Experienced in Agile Methodologies, Scrum stories, and sprints experience in a Python-based environment along with data analytics, data wrangling.
  • Experienced in Amazon Web Services (AWS) and Microsoft Azure, such as AWS EC2, S3, RD3, Azure HDInsight, Machine Learning Studio, Azure Storage, and Azure Data Lake.
  • Experienced in data architecture including data ingestion pipeline design, Hadoop information architecture, data modeling, and data mining, machine learning, and advanced data processing.
  • Experienced in processing large datasets with Spark using Python.
  • Experience with machine learning tools and libraries such as Scikit-learn, R, Spark, and Weka
  • Experienced in using various Python libraries (Beautiful Soup, NumPy, SciPy, matplotlib, python-twitter, Pandas, and MySQL DB for database connectivity).
  • Strong understanding in Dimensional modeling - Star Schema, Snowflake Schema.
  • Good understanding and working experience on Hadoop Distributions like Cloudera and Hortonworks.
  • Experience in Conceptual, Logical, and Physical data modeling, Enterprise Data Models, Enterprise ETL architecture.
  • Hands-on expertise in working and designing of Row keys & Schema Design with NoSQL databases like Mongo DB, HBase, Cassandra.
  • Experience in importing and exporting multi terabytes of data using Sqoop from Relational Database Management System to HDFS and vice versa.
  • Experience in writing simply to complex Pig scripts for processing and analyzing large volumes of data, querying both Managed and External tables created in Hive using Impala.
  • Extensive experience with ETL and query big data tools HiveQL and Pig Latin., loading logs from multiple sources into HDFS using Flume.
  • Experience using Tableau with database join, nested sorting, integration, visualization by creating diverse charts, maps, trend lines, and predictive analysis
  • Hands-on experience in data mining, cleaning, warehousing, and ETL process by using Talend, Informatica, AWS Glue, Hive with big data frameworks
  • Performed AB testing and hypothesis testing to deliver business insights.
  • Built statistical models and exploratory analysis with R Studio
  • Experience with Deep Learning models for NLP and image recognition by using Python.
  • Reported and presented data patterns and analytical results by building charts, graphs, and dashboards via MS PowerPoint and Excel.
  • Experience with project management Agile methodology with Jira
  • Used Git repository as version control
  • High self-motivation, quick learner and adaptability towards trending tools and teamwork environment


Languages: Java 8, Python, R

Numpy, SciPy, Pandas, Scikit: learn, Matplotlib, Seaborn, ggplot2, caret, dplyr, purrr, readxl, tidyr, Rweka, gmodels, RCurl, C50, twitter, NLP, Reshape2, rjson, plyr, Beautiful Soup, Rpy2

Big Data Technologies: Hadoop, Hive, HDFS, MapReduce, Pig, Kafka


Version Control Tools: SVM, GitHub

Cloud: Amazon Web Services (AWS), Azure

Data Modeling Tools: Erwin, Rational Rose, ER/Studio, MS Visio

BI Tools: Tableau, Power BI

SQL, Hive, Impala, Pig, SQL: Server, My SQL, HBase, MongoDB

NLP/Machine Learning/Deep Learning: LDA (Latent Dirichlet Allocation), NLTK, Apache OpenNLP, Stanford NLP, Sentiment Analysis, SVMs, ANN, RNN, CNN, TensorFlow, MXNet, Caffe, H2O, Keras, PyTorch

Reporting Tools: Crystal reports XI, SSRS

Operating System: Windows, Linux, UNIX



Data Engineer/Big Data/Spark/Hadoop


  • Participated in all phases of Datamining, Data-collection, Data-Cleaning, Developing-Models, Validation, Visualization, and Performed Gap Analysis.
  • Utilized Spark, Scala, Hadoop, HBase, Cassandra, MongoDB, Kafka, Spark Streaming, MLLib, Python, and a broad variety of machine learning methods including classifications, regressions, dimensionally reduction, etc.
  • Used Spark Data frames, Spark-SQL, Spark MLLib extensively, and developing and designing POC's using Scala, Spark SQL, and MLlib libraries.
  • Developed various Tableau Data Models by extracting and using the data from various sources files, DB2, MongoDB, Excel, Flat Files, and Big data.
  • Worked with Hadoop Architecture and various components such as HDFS, Job Tracker, Task Tracker, Name Node, Data Node, Secondary Name Node, and MapReduce concepts.
  • Responsible for building scalable distributed data pipelines in Hadoop.
  • Wrote Pig scripts to setup Kafka hourly data and perform daily roll-ups.
  • Data Migration from existing RDBMS systems to HDFS and build derived dataset.
  • Developed Shell scripts to automate Hive dynamic table and partition creation.
  • Optimized Map/Reduce Jobs to use HDFS efficiently by using various compression mechanisms like gzip, Snappy on top of data formats like Parquet.
  • Designed ETL workflows on Tableau, Deployed data from various sources to HDFS.
  • Imported/Exported analyzed and aggregated data to Oracle & MySQL databases using Sqoop.
  • Wrote Pig & Hive scripts to analyze customer data and detect user patterns.
  • Continuous monitoring and managing the Hadoop cluster by using Cloudera Manager.
  • Developed ETL pipelines to source data to Business intelligence teams to build visualizations.
  • Designed both 3NF data models for ODS, OLTP systems, and Dimensional Data Models using Star and Snowflake Schemas.
  • Empower productivity improvements and data sharing using Erwin for effective model management.
  • Implemented end-to-end systems for Data Analytics, Data Automation, and integrated with custom visualization tools using Python, R, Mahout, Hadoop, and MongoDB.
  • Worked with several R packages including knitr, dplyr, SparkR, Causal Infer, Space-Time.
  • Used Pandas, Numpy, Seaborn, Scipy, Matplotlib, Sci-kit-learn, and NLTK in Python for developing various machine learning algorithms.
  • Involved in the design and implementation of Statistical models, Predictive models, enterprise data model, metadata solution, and data life cycle management in both RDBMS, Big Data environments.
  • Implemented Classification using supervised algorithms like Logistic Regression, Decision trees, KNN, Naive Bayes.
  • Updated Python scripts to match training data with our database stored in AWS Cloud Search, so that we would be able to assign each document a response label for further classification.

Environment: AWS, R, Tableau, Spark, Scala, Hadoop, HBase, Cassandra, MongoDB, Kafka, Spark Streaming, MLLib, Python, Data Mining, Data Collection, Data Cleaning, Validation, HDFS, ODS, OLTP, Oracle, MpngoDB, Hive, OLAP, MongoDB, Metadata, MS Excel, Map-Reduce, Rational Rose, SQL.

Confidential, Fremont, CA

Data Engineer


  • Build a Customer repository data pipeline using Hadoop, Spark, MapReduce, Java, Scala, Dozer, Hibernate, and Postgres.
  • This pipeline receives data from several source systems in different formats. The Pipeline ensures high-quality data flow through it before getting loaded to the customer data storage repository.
  • Worked on batch data ingestion by creating a data pipeline using Sqoop and Spark.
  • Implemented Pig as an ETL tool to do transformations, event joins, and some pre-aggregations before storing the data onto HDFS.
  • Develop structured, efficient, and error-free codes for Big Data requirements using my knowledge in Hadoop and its Eco-system.
  • Developed Spark/Scala, Python for regular expression (regex) project in the Hadoop/Hive environment with Linux/Windows for big data resources.
  • Involved in optimizing MapReduce algorithms using Mappers, Reducers, combiners, and partitioner to deliver the best results for the large datasets.
  • Implementation of machine learning methods, optimization, and visualization. Mathematical methods of statistics such as Regression Models, Decision Tree, Naïve Bayes, Ensemble Classifier, Hierarchical Clustering, and Semi-Supervised Learning on different datasets using Python.
  • Researched and implemented various Machine Learning Algorithms using the R language. then wrote Scala scripts using the Spark machine learning module.
  • Designed and modeled several kinds of tables like Associative tables, transactional tables, Base tables, Delta tables, Junk Dimensions, Reference tables.
  • Develop MapReduce jobs to convert data files into the Parquet file format.
  • Created action filters, parameters, and calculated sets for preparing dashboards and worksheets in Tableau.
  • Data sources are Extracted, Transformed, and Loaded (ETL) to generate CSV data files with Python programming and SQL queries.
  • Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs, Scala.
  • Implemented Spark using Scala and Spark SQL for faster testing and processing of data.
  • Used Scala scripts for spark machine learning libraries API execution for decision trees, ALS, logistic and linear regressions algorithms.
  • Worked on Migrating an On-premises virtual machine to Amazon Web Services (AWS) cloud.
  • Worked on different data formats such as JSON, XML, and performed machine learning algorithms in Python.
  • ReportedTableau dashboard with different charts and presented with MS PowerPoint
  • Used Git as version control and Jira for the team-wide management methodology
  • Summarized the information and reports to deliver the insights for team and client
  • Implemented the design, analysis, and interpretation of a variety of reports and analytical solutions
  • Engaged constructively with project teams to support project objectives through the application principle

Environment: AWS, Hadoop, Spark, MapReduce, Java, Scala, Dozer, Hibernate and Postgres, R Language, Hadoop, Big Data, Azure, Python, ETL, JavaScript, DB2, CRUD, PL/ SQL, JDBC, coherence, MongoDB, Apache CXF, soap, Web Services, Eclipse


Data Engineer


  • Designed and developed the real-time matching solution for customer data ingestion
  • Worked on converting the multiple SQL Server and Oracle stored procedures into Hadoop using Spark SQL, Hive, Scala, and Java.
  • Created production Data-lake that can handle transactional processing operations using Hadoop Eco-System.
  • Developed PySpark and SparkSQL code to process the data in Apache Spark on Amazon EMR to perform the necessary transformations.
  • Involved in validating and cleansing the data using Pig statements and hands-on experience in developing Pig MACROS.
  • Analyzed dataset of 14M record count and reduced it to 1.3M by filtering out rows with duplicate customer IDs and removed outliers using boxplots and univariate algorithms.
  • Worked with Hadoop Big Data Integration with ETL on performing data extract, loading, and transformation process for ERP data.
  • Performed extensive exploratory data analysis using Teradata to improve the quality of the dataset and created Data Visualizations using Tableau.
  • Experienced in various Python libraries like Pandas, One dimensional NumPy, and Two dimensional NumPy.
  • Experienced in using PyTorch library and implementing natural language processing.
  • Developed data visualizations in Tableau to display day to day accuracy of the model with newly incoming Data.
  • Worked with R for statistical modeling like Bayesian and hypothesis test with dplyr and BAS packages, and visualized testing results in R to delivery business insight
  • Model validation by Confusion Matrix, ROC, AUC, and developed diagnostic tables and graphs that demonstrated how a model can be used to improve the efficiency of the selection process
  • Presented and reported business insights by SSRS and Tableau dashboard combined with different diagrams
  • Utilized Jira as project management methodology and Git for version control to build the program
  • Reported and displayed the analysis result in the web browser with HTML and JavaScript
  • Involved constructively with project teams, supported the project’s goal through principle and delivered the insights for team and client

Environment: Hadoop, Spark SQL, Hive, Scala, Java, MS Access, SQL Server, Pig, PySpark, Tableau, Excel


Data Analyst


  • Provided technical solutions to analyze marketing campaigns and business performance
  • Worked with engineering team and manager to maintain database via MySQL and Microsoft Excel, and future explore data analysis with R, Python, and Tableau
  • Designed relational databases and optimized the database performance in MySQLby using different syntax, stored procedures, defined functions and implemented batch
  • Connected MySQL with Excel for importing or exporting data, cleaning data and analysis via sorting, filter, conditional formatting, charts, and pivot tables
  • Worked with python for exploratory data analysis to determine the patterns and features of the data, including time-series, correlation, etc.
  • Loaded data into R, reformatted data, checked consistency, and built statistical analysis and testing such as Naïve Bayesian Model, hypothesis test in R
  • Visualized and reported data by creating graphs in Microsoft Excel, R, and Tableau with different charts and dashboard
  • Engaged with project teams, delivering the insights for team and client with Microsoft PowerPoint
  • Implemented the design, analysis, and interpretation of a variety of reports and analytical solutions

Environment: Database, SQL Server, Linux, Unix, Excel, Oracle, Tableau.

Hire Now