- 4+ years of strong experience in Data Engineering, Data Analysis, and Data mining with large data sets of Structured and Unstructured data using Big Data/Hadoop, Spark, Predictive modeling, Statistical modeling, Data modeling, and Data Visualization.
- Experience in Big Data analytics, Data manipulation, using Hadoop Ecosystem tools Map - Reduce, HDFS, Yarn/MRv2, Pig, Hive, HDFS, HBase, Kafka, Flume, Sqoop, Flume, Oozie, Avro, Sqoop, and Spark integration with Cassandra, Avro, Solr, and Zookeeper.
- Good understanding of Spark and Hadoop Cluster Architecture.
- Experienced in developing machine learning models for real-world problems using R and Python.
- Expertise in designing scalable Big Data solutions, data warehouse models on large-scale distributed data, performing a wide range of analytics to measure service performance.
- Experienced in Agile Methodologies, Scrum stories, and sprints experience in a Python-based environment along with data analytics, data wrangling.
- Experienced in Amazon Web Services (AWS) and Microsoft Azure, such as AWS EC2, S3, RD3, Azure HDInsight, Machine Learning Studio, Azure Storage, and Azure Data Lake.
- Experienced in data architecture including data ingestion pipeline design, Hadoop information architecture, data modeling, and data mining, machine learning, and advanced data processing.
- Experienced in processing large datasets with Spark using Python.
- Experience with machine learning tools and libraries such as Scikit-learn, R, Spark, and Weka
- Experienced in using various Python libraries (Beautiful Soup, NumPy, SciPy, matplotlib, python-twitter, Pandas, and MySQL DB for database connectivity).
- Strong understanding in Dimensional modeling - Star Schema, Snowflake Schema.
- Good understanding and working experience on Hadoop Distributions like Cloudera and Hortonworks.
- Experience in Conceptual, Logical, and Physical data modeling, Enterprise Data Models, Enterprise ETL architecture.
- Hands-on expertise in working and designing of Row keys & Schema Design with NoSQL databases like Mongo DB, HBase, Cassandra.
- Experience in importing and exporting multi terabytes of data using Sqoop from Relational Database Management System to HDFS and vice versa.
- Experience in writing simply to complex Pig scripts for processing and analyzing large volumes of data, querying both Managed and External tables created in Hive using Impala.
- Extensive experience with ETL and query big data tools HiveQL and Pig Latin., loading logs from multiple sources into HDFS using Flume.
- Experience using Tableau with database join, nested sorting, integration, visualization by creating diverse charts, maps, trend lines, and predictive analysis
- Hands-on experience in data mining, cleaning, warehousing, and ETL process by using Talend, Informatica, AWS Glue, Hive with big data frameworks
- Performed AB testing and hypothesis testing to deliver business insights.
- Built statistical models and exploratory analysis with R Studio
- Experience with Deep Learning models for NLP and image recognition by using Python.
- Reported and presented data patterns and analytical results by building charts, graphs, and dashboards via MS PowerPoint and Excel.
- Experience with project management Agile methodology with Jira
- Used Git repository as version control
- High self-motivation, quick learner and adaptability towards trending tools and teamwork environment
Languages: Java 8, Python, R
Numpy, SciPy, Pandas, Scikit: learn, Matplotlib, Seaborn, ggplot2, caret, dplyr, purrr, readxl, tidyr, Rweka, gmodels, RCurl, C50, twitter, NLP, Reshape2, rjson, plyr, Beautiful Soup, Rpy2
Big Data Technologies: Hadoop, Hive, HDFS, MapReduce, Pig, Kafka
ETL Tools: SSIS
Version Control Tools: SVM, GitHub
Cloud: Amazon Web Services (AWS), Azure
Data Modeling Tools: Erwin, Rational Rose, ER/Studio, MS Visio
BI Tools: Tableau, Power BI
SQL, Hive, Impala, Pig, SQL: Server, My SQL, HBase, MongoDB
NLP/Machine Learning/Deep Learning: LDA (Latent Dirichlet Allocation), NLTK, Apache OpenNLP, Stanford NLP, Sentiment Analysis, SVMs, ANN, RNN, CNN, TensorFlow, MXNet, Caffe, H2O, Keras, PyTorch
Reporting Tools: Crystal reports XI, SSRS
Operating System: Windows, Linux, UNIX
Data Engineer/Big Data/Spark/Hadoop
- Participated in all phases of Datamining, Data-collection, Data-Cleaning, Developing-Models, Validation, Visualization, and Performed Gap Analysis.
- Utilized Spark, Scala, Hadoop, HBase, Cassandra, MongoDB, Kafka, Spark Streaming, MLLib, Python, and a broad variety of machine learning methods including classifications, regressions, dimensionally reduction, etc.
- Used Spark Data frames, Spark-SQL, Spark MLLib extensively, and developing and designing POC's using Scala, Spark SQL, and MLlib libraries.
- Developed various Tableau Data Models by extracting and using the data from various sources files, DB2, MongoDB, Excel, Flat Files, and Big data.
- Worked with Hadoop Architecture and various components such as HDFS, Job Tracker, Task Tracker, Name Node, Data Node, Secondary Name Node, and MapReduce concepts.
- Responsible for building scalable distributed data pipelines in Hadoop.
- Wrote Pig scripts to setup Kafka hourly data and perform daily roll-ups.
- Data Migration from existing RDBMS systems to HDFS and build derived dataset.
- Developed Shell scripts to automate Hive dynamic table and partition creation.
- Optimized Map/Reduce Jobs to use HDFS efficiently by using various compression mechanisms like gzip, Snappy on top of data formats like Parquet.
- Designed ETL workflows on Tableau, Deployed data from various sources to HDFS.
- Imported/Exported analyzed and aggregated data to Oracle & MySQL databases using Sqoop.
- Wrote Pig & Hive scripts to analyze customer data and detect user patterns.
- Continuous monitoring and managing the Hadoop cluster by using Cloudera Manager.
- Developed ETL pipelines to source data to Business intelligence teams to build visualizations.
- Designed both 3NF data models for ODS, OLTP systems, and Dimensional Data Models using Star and Snowflake Schemas.
- Empower productivity improvements and data sharing using Erwin for effective model management.
- Implemented end-to-end systems for Data Analytics, Data Automation, and integrated with custom visualization tools using Python, R, Mahout, Hadoop, and MongoDB.
- Worked with several R packages including knitr, dplyr, SparkR, Causal Infer, Space-Time.
- Used Pandas, Numpy, Seaborn, Scipy, Matplotlib, Sci-kit-learn, and NLTK in Python for developing various machine learning algorithms.
- Involved in the design and implementation of Statistical models, Predictive models, enterprise data model, metadata solution, and data life cycle management in both RDBMS, Big Data environments.
- Implemented Classification using supervised algorithms like Logistic Regression, Decision trees, KNN, Naive Bayes.
- Updated Python scripts to match training data with our database stored in AWS Cloud Search, so that we would be able to assign each document a response label for further classification.
Environment: AWS, R, Tableau, Spark, Scala, Hadoop, HBase, Cassandra, MongoDB, Kafka, Spark Streaming, MLLib, Python, Data Mining, Data Collection, Data Cleaning, Validation, HDFS, ODS, OLTP, Oracle, MpngoDB, Hive, OLAP, MongoDB, Metadata, MS Excel, Map-Reduce, Rational Rose, SQL.
Confidential, Fremont, CA
- Build a Customer repository data pipeline using Hadoop, Spark, MapReduce, Java, Scala, Dozer, Hibernate, and Postgres.
- This pipeline receives data from several source systems in different formats. The Pipeline ensures high-quality data flow through it before getting loaded to the customer data storage repository.
- Worked on batch data ingestion by creating a data pipeline using Sqoop and Spark.
- Implemented Pig as an ETL tool to do transformations, event joins, and some pre-aggregations before storing the data onto HDFS.
- Develop structured, efficient, and error-free codes for Big Data requirements using my knowledge in Hadoop and its Eco-system.
- Developed Spark/Scala, Python for regular expression (regex) project in the Hadoop/Hive environment with Linux/Windows for big data resources.
- Involved in optimizing MapReduce algorithms using Mappers, Reducers, combiners, and partitioner to deliver the best results for the large datasets.
- Implementation of machine learning methods, optimization, and visualization. Mathematical methods of statistics such as Regression Models, Decision Tree, Naïve Bayes, Ensemble Classifier, Hierarchical Clustering, and Semi-Supervised Learning on different datasets using Python.
- Researched and implemented various Machine Learning Algorithms using the R language. then wrote Scala scripts using the Spark machine learning module.
- Designed and modeled several kinds of tables like Associative tables, transactional tables, Base tables, Delta tables, Junk Dimensions, Reference tables.
- Develop MapReduce jobs to convert data files into the Parquet file format.
- Created action filters, parameters, and calculated sets for preparing dashboards and worksheets in Tableau.
- Data sources are Extracted, Transformed, and Loaded (ETL) to generate CSV data files with Python programming and SQL queries.
- Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs, Scala.
- Implemented Spark using Scala and Spark SQL for faster testing and processing of data.
- Used Scala scripts for spark machine learning libraries API execution for decision trees, ALS, logistic and linear regressions algorithms.
- Worked on Migrating an On-premises virtual machine to Amazon Web Services (AWS) cloud.
- Worked on different data formats such as JSON, XML, and performed machine learning algorithms in Python.
- ReportedTableau dashboard with different charts and presented with MS PowerPoint
- Used Git as version control and Jira for the team-wide management methodology
- Summarized the information and reports to deliver the insights for team and client
- Implemented the design, analysis, and interpretation of a variety of reports and analytical solutions
- Engaged constructively with project teams to support project objectives through the application principle
- Designed and developed the real-time matching solution for customer data ingestion
- Worked on converting the multiple SQL Server and Oracle stored procedures into Hadoop using Spark SQL, Hive, Scala, and Java.
- Created production Data-lake that can handle transactional processing operations using Hadoop Eco-System.
- Developed PySpark and SparkSQL code to process the data in Apache Spark on Amazon EMR to perform the necessary transformations.
- Involved in validating and cleansing the data using Pig statements and hands-on experience in developing Pig MACROS.
- Analyzed dataset of 14M record count and reduced it to 1.3M by filtering out rows with duplicate customer IDs and removed outliers using boxplots and univariate algorithms.
- Worked with Hadoop Big Data Integration with ETL on performing data extract, loading, and transformation process for ERP data.
- Performed extensive exploratory data analysis using Teradata to improve the quality of the dataset and created Data Visualizations using Tableau.
- Experienced in various Python libraries like Pandas, One dimensional NumPy, and Two dimensional NumPy.
- Experienced in using PyTorch library and implementing natural language processing.
- Developed data visualizations in Tableau to display day to day accuracy of the model with newly incoming Data.
- Worked with R for statistical modeling like Bayesian and hypothesis test with dplyr and BAS packages, and visualized testing results in R to delivery business insight
- Model validation by Confusion Matrix, ROC, AUC, and developed diagnostic tables and graphs that demonstrated how a model can be used to improve the efficiency of the selection process
- Presented and reported business insights by SSRS and Tableau dashboard combined with different diagrams
- Utilized Jira as project management methodology and Git for version control to build the program
- Involved constructively with project teams, supported the project’s goal through principle and delivered the insights for team and client
Environment: Hadoop, Spark SQL, Hive, Scala, Java, MS Access, SQL Server, Pig, PySpark, Tableau, Excel
- Provided technical solutions to analyze marketing campaigns and business performance
- Worked with engineering team and manager to maintain database via MySQL and Microsoft Excel, and future explore data analysis with R, Python, and Tableau
- Designed relational databases and optimized the database performance in MySQLby using different syntax, stored procedures, defined functions and implemented batch
- Connected MySQL with Excel for importing or exporting data, cleaning data and analysis via sorting, filter, conditional formatting, charts, and pivot tables
- Worked with python for exploratory data analysis to determine the patterns and features of the data, including time-series, correlation, etc.
- Loaded data into R, reformatted data, checked consistency, and built statistical analysis and testing such as Naïve Bayesian Model, hypothesis test in R
- Visualized and reported data by creating graphs in Microsoft Excel, R, and Tableau with different charts and dashboard
- Engaged with project teams, delivering the insights for team and client with Microsoft PowerPoint
- Implemented the design, analysis, and interpretation of a variety of reports and analytical solutions
Environment: Database, SQL Server, Linux, Unix, Excel, Oracle, Tableau.