- 7 years hands - on experience in Data Science and Analytics including BigQuery using SQL, Data Mining, Machine Learning and Statistical Analysis with large datasets of structured and unstructured data, Data Validation, Data Visualization, data warehouse and Predictive Modelling in scripting language python .
- Profound knowledge in Machine Learning Algorithms like Linear, Non-linear and Logistic Regression, SVR, Random forests, Ensemble Methods, Decision tree, Gradient-Boosting, K-NN, SVM, Naïve Bayes, Clustering (K-means) and deep learning models like DNN, CNN, RNN, LSTM,GAN, Transfer Learning .
- Hands-on implementation experience in NLP, Document Representation, Text Categorization, Sentiment Analysis, Topic Modelling, Text Visualization .
- Expertise in programming languages like scripting language python and R .
- Good experience of software development in Python (libraries- Beautiful Soup, PySpark, Numpy, Scipy, Matplotlib, asyncio, python-twitter, Pandas data frame, network, urllib2, MySQL for database connectivity) and IDEs -sublime text, Spyder, pycharm, pytest.
- Proficient with data visualization tools such as Matplotlib, Seaborn, Plotly .
- Performed data collection, pre-processing, feature engineering, data visualization, data warehousing and analysis on large volumes of unstructured data using scripting language Python and R ( Scikitlearn, Matplotlib, Pandas, Numpy, Seaborn, ggplot2, dplyr )
- Performed BigQuery using SQL, Data Collection, data warehousing, Data Cleaning, Featurization, Feature Engineering and Feature Scaling on the Customers historical data.
- Experienced in Big Data Ecosystem with Hadoop, HDFS, MapReduce, Pig, Hive, HBase, Impala, Sqoop, Flume, Kafka, Oozie, Spark, PySpark and Spark Streaming.
- Strong experience with Pig, Hive, Impala, MapReduce in Hadoop Ecosystem.
- Experience in setting up and maintaining Hadoop cluster running HDFS and MapReduce on YARN.
- Experience in importing and exporting data using Sqoop from HDFS to Relational Database Systems (SQL/MySQL) and vice-versa.
- Involved in moving all log files generated from various sources to HDFS for further processing through Flume.
- Experience on commercial distribution of Hadoop including Hortonworks production HDP (Hortonworks Data Platform) and Cloudera CDH.
- Expertise in using statistical models such as Hypothesis Testing, ANOVA, Regression, and A/B Testing .
- Strong Database Experience on RDBMS (Oracle, MySQL) with PL/SQL programming skills in creating Packages, Stored Procedures, Functions, Triggers & Cursors.
- Knowledge on NoSQL databases including HBase, Cassandra, MongoDB.
- Extensive familiarity with SQL, Oracle and MySQL database management.
- Experience on Cloud Databases and Data warehouses ( SQL Azure and Confidential Redshift/RDS ).
- Experience in using cloud services Amazon Web Services (AWS) including EC2, S3, AWS Lambda and EMR, used Redshift for migration.
- Implemented automated local user provisioning instances created in AWS cloud and Google cloud.
- Experience on Migrating SQL database to Azure data Lake, Azure data lake Analytics, Azure SQL Database, Data Bricks and Azure SQL Data warehouse and Controlling and granting database access and Migrating On premise databases to Azure Data lake store using Azure Data factory.
- Implement ad-hoc analysis solutions using Azure Data Lake Analytics/Store, HDInsight.
- Experience with container-based deployments using Docker , working with Docker images, Docker Hub and Docker-registries and Kubernetes .
- Experience in using Design Patterns such as MVC, Singleton and frameworks such as DJANGO. Experienced in developing Web Services with Python programming language.
- Proficient in documenting business processes, gathering requirements, and identifying gaps.
- Robust participation for functioning in fast-paced multi-tasking environment both independently and in the collaborative team. Adequate with challenging projects and work in ambiguity to solve complex problems.
Languages: Python, R, SQL, Shell scripting, Java, Scala
IDE: R Studio, Jupyter Notebook, zeppelin, Eclipse, NetBeans, Atom
Databases: Oracle 11g, SQL Server, MS Access, MySQL, MongoDB, Cassandra PL/SQL, T-SQL, ETL
Big Data Ecosystems: Hadoop, Map Reduce, HDFS, HBase, Hive, Pig, Impala, Spark, MLLib. Pyspark
Operating Systems: Windows XP/7/8/10, Ubuntu, Linux, Unix
Packages: Ggplot, caret, dplyr, R Weka, gmodels, RCurl, tm, C50, Wordcloud, Kernlab, Neuralnet, twitter, NLP, Reshape2, rjson, plyr, pandas, Numpy, seaborn, scipy, Matplotlib, Scikit-learn, Beautiful Soup, Rpy2, Tensorflow, Pytorch, CNN, RNN
Data Analytics Tools: R Console, Python (Numpy, pandas, Scikit-learn, scipy), SPSS
BI and Visualization: Tableau, SSAS, SSRS, Informatica, QuickView
Version Controls: GIT, SVN
Confidential, Atlanta, GA
Data Scientist/Data Engineer
- Developed and implemented various complex models using machine learning algorithms such as linear regression, classification, multivariate regression, Naive Bayes, Random Forests, K-means clustering, KNN, PCA, regularization and imputation.
- Performed statistical analysis, and building high-quality prediction systems using data mining and machine learning techniques. Improved prediction accuracy by using various machine learning modeling techniques such as SVM, Naive Bayes, Decision Trees, Gradient Boosting- GBM, XGBoost, AdaNet, Random Forest, Classification, Linear/Logistic Regression, K-Means, K-NN along with Deep Learning applications such as TensorFlow and Keras library with the objective of achieving lowest test error.
- Performed a proper exploratory data analysis, used K-Means clustering technique to identify outliers. Dealt with unbalanced data by bootstrap resampling.
- Performed Natural Language Processing in Python to mine unstructured data using document clustering, topic analysis, named entity recognition, document classification and sentiment analysis.
- Understand transaction data and develop Analytics insights using Statistical models using Azure Machine learning.
- Architect & implement medium to large scale BI solutions on Azure using Azure Data Platform services (Azure Data Lake, Data Factory, Data Lake Analytics, Stream Analytics, Azure SQL DW, HDInsight/Databricks, NoSQL DB)
- Used MLlib, Spark’s Machine learning library to build and evaluate different models.
- Performed Data Cleaning, features scaling, features engineering using pandas and numpy packages in python.
- Utilizing principal component analysis and factor analysis to complete dimensional reduction of the data.
- Worked on ensemble Methods such as Bagging (Random Forest, Boosting (Ada-boost) & Stacking . Feature Engineering : Pearson Correlation, F-score and Dimensionality Reduction : PCA, LDA.
- Evaluated models using Cross Validation, Log loss function, ROC curves and used AUC for feature selection.
- Ensured that the model has low False Positive Rate.
- Addressed overfitting by implementing of the algorithm regularization methods like L2 and L1.
- Created and designed reports that will use gathered metrics to infer and draw logical conclusions of past and future behavior.
- Developed MapReduce pipeline for feature extraction using Hive.
- Performed data analysis by using Hive to retrieve the data from Hadoop cluster, SQL to retrieve data from Oracle database.
- Created Data Quality Scripts using SQL and Hive to validate successful data load and quality of the data.
- Created various types of data visualizations using R, python and Tableau.
- Implemented rule based expertise system from the results of exploratory analysis and information gathered from the people from different departments.
Environment: Python, R, HDFS, Hadoop, Hive, Impala, MLlib, Linux, Spark, Tableau Desktop, SQL Server, Matlab, Spark SQL, Pyspark, Azure .
Confidential, Atlanta, GA
Big data engineer
- Performed Logistic Regression, Classification, Random Forests and Clustering in Python.
- Worked on Linear Discriminant analysis, Greedy Forward Selection, Greedy Backward Selection and Feature Selection, Feature reduction algorithms like Principal Component Analysis and Factor Analysis.
- Skilled with Python parsing, manipulating, and converting data to and from a wide range of formats (CSV, json, XML, html, etc).
- Familiarity with the AWS ecosystem, including tradeoffs between services that meet overlapping needs.
- Using Informatica PowerCenter Designer analysed the source data to Extract & Transform from various source systems(oracle 10g,DB2,SQL server and flat files) by incorporating business rules using different objects and functions that the tool supports.
- Migrated on premise database structure to Confidential Redshift data warehouse.
- Design and Develop ETL Processes in AWS Glue to migrate Campaign data from external sources like S3, ORC/Parquet/Text Files into AWS Redshift.
- Maintaining existing ETL workflows, data management and data query components.
- Collecting, aggregating, and moving data from servers to HDFS using Apache Flume.
- Creating Hive tables, loading with data, and writing hive queries that will run internally in map-reduce way.
- Migrating ETL jobs to Pig scripts do Transformations, even joins and some pre-aggregations before storing the data into HDFS.
- Involved in installation and configuration of Cloudera Distribution Hadoop platform.
- Extract, transform, and load (ETL) data from multiple federated data sources (JSON, relational database, etc.) with DataFrames in Spark.
- Utilized SparkSQL to extract and process data by parsing using Datasets or RDDs in HiveContext, with transformations and actions (map, flatMap, filter, reduce, reduceByKey).
- Interaction with Spark Shell using Python API- PySpark.
- Developing Spark programs using Scala API's to compare the performance of Spark with Hive and SQL.
- Used Spark API over Hortonworks Hadoop YARN to perform analytics on data in Hive.
- Implemented Spark using Scala and SparkSQL for faster testing and processing of data.
- Designed and created Hive external tables using shared meta-store instead of derby with partitioning, dynamic partitioning and buckets.
- Used Python libraries like NumPy & Pandas in conjunction with Spark in dealing with DataFrames.
- Worked with Flume for collecting, aggregating and moving large amounts of log data as well as for streaming log data.
- Involved in scheduling Oozie workflow engine to run multiple Hives and pig jobs and used Oozie Operational Services for batch processing and scheduling workflows dynamically.
- Using Kafka to build real-time data pipelines and streaming applications, publish and subscribe to message queue (Topic), o Store streams of records in a fault-tolerant durable way, and process streams of records as they occur.
- Worked with Spark Streaming for streaming real time data using DStreams.
- Used Amazon EMR to create and configure a cluster of Amazon EC2 instances running Hadoop.
- Develop, train, and test machine learning models utilizing SageMaker .
- Performed Exploratory Data Analysis and Data Visualizations using R, Python and Tableau.
Environment: Hadoop 2.7.7, HDFS 2.7.7, Spark 2.1, MapReduce 2.9.1, Hive 2.3, Sqoop 1.4.7, Kafka 0.8.2.X, HBase, Oozie, Flume 1.8.0, Scala 2.12.8, PySpark, AWS, Python 3.7, Java 8, JSON, SQL Scripting and Linux Shell Scripting, Avro, Parquet, Hortonworks & Cloudera.
Confidential, San Francisco, CA
- Responsible for building scalable distributed data solutions using EMR cluster environment with Amazon EMR 5.6.1 .
- Worked on Kafka REST API to collect and load the data on Hadoop file system and also used sqoop to load the data from relational databases.
- Extract Real time feed using Kafka and Spark Streaming and convert it to RDD and process data in the form of Data Frame and save the data as Parquet format in HDFS .
- Used Spark-Streaming APIs to perform necessary transformations and actions on the data got from Kafka and Persists into HDFS .
- Developed Spark scripts by writing custom RDDs in Scala for data transformations and perform actions on RDDs.
- Worked on creating Spring-Boot services for Oozie orchestration.
- Deployed Spring-Boot entity services for Audit Framework of the loaded data.
- Worked with Avro, Parque, ORC file formats and compression techniques like LZO .
- Used Hive to form an abstraction on top of structured data resides in HDFS and implemented Partitions, Dynamic Partitions, Buckets on HIVE tables.
- Used Spark API over Hadoop YARN as execution engine for data analytics using Hive .
- Performed advanced procedures like text analytics and processing, using the in-memory computing capabilities of Spark using Scala .
- Worked on migrating MapReduce programs into Spark transformations using Scala .
- Designed, developed data integration programs in a Hadoop environment with NoSQL data store Cassandra for data access and analysis.
- Used Job management scheduler apache Oozie to execute the workflow.
- Used Ambari to monitor node's health and status of the jobs in Hadoop clusters .
- Designing and implementing data warehouses and data marts using components of Kimball Methodology, like Data Warehouse Bus, Conformed Facts & Dimensions, Slowly Changing Dimensions, Surrogate Keys, Star Schema, Snowflake Schema, etc.
- Worked on Tableau to build customized interactive reports, worksheets and dashboards.
- Implemented Kerberos for strong authentication to provide data security.
- Implemented LDAP and Active directory for Hadoop clusters
- Involved in performance tuning of spark jobs using Cache and using complete advantage of cluster environment.
Environment: AWS- S3, EMR, Lambda, CloudWatch, Amazon Redshift, Spark-Java, Spark- Scala, Athena, Hive, HDFS, Spark, Scala, Oozie, Bitbucket Github, Snowflake
Confidential, Dallas, TX
- Performed data extraction, aggregation, log analysis on real time data using Spark Streaming
- Experience working in project with data visualization, R and Python development, Unix, SQL
- Performed exploratory data analysis using numpy, matplotlib and pandas
- Expertise in quantitative analysis, data mining, and the presentation of data to see beyond the numbers and understand trends and insights.
- Implemented Principal Component Analysis and Liner Discriminate Analysis.
- Experience in using the Lambda functions like filter (), map () and reduce () with pandas Data Frame and perform various operations.
- Used Pandas API for analyzing time series. Creating regression test framework for new code.
- Creating complex SQL queries and scripts to extract and aggregate data to validate the accuracy of the data and Business requirement gathering and translating them into clear and concise specifications and queries.
- Prepared high-level analysis reports with Excel and Tableau. Provides feedback on the quality of Data including identification of billing patterns and outliers.
- Created models for time-series forecasting, multi-variate analysis, optimizer design and simulation using E-views and R platform.
- Eliminate incomplete or unusable data.
- Worked on Snowflake Schema, Data Modeling, Data Elements, Issue/Question Resolution Logs, and Source to Target Mappings, Interface Matrix and Design elements.
- Performed Exploratory Data Analysis and Data Visualizations using R, and Tableau.
Environment: - Python, scipy, Pandas, R Studio, Tableau, SQL, Scikit-learn, Matplotlib, Numpy, snowflake.
- Involved in architecture, flow, and the database model of the application.
- Worked on requirements gathering, analysis, design, change management and deployment.
- Worked with the developers to understand the ERD Entity Relationship Diagram and thus create comprehensive test plans, following the RUP model, for the applications and gather test data for the data driven test cases.
- Developed the ETL jobs as per the requirements to update the data into the staging database (Postgres) from various data sources and REST API’s.
- Worked on Informatica Power Center tools- Designer, Repository Manager, Workflow Manager, and Workflow Monitor.
- Used various transformations like Filter, Expression, Sequence Generator, Update Strategy, Joiner, Stored Procedure, and Union to develop robust mappings in the Informatica Designer.
- Developed analytical queries in Teradata, SQL-Server, and Oracle.
- Developed a Web service on the Postgres database using python Flask framework which was served as a backend for the real-time dashboard.
- Worked on optimizing and memory management of the ETL services
- Created Integrated test Environments for the ETL applications developed using the Dockers and the python API’s.
- Installed data sources like SQL-Server, Cassandra and remote servers using the Docker containers as to provide the integrated testing environment for the ETL applications.
- Designed SSIS Packages to transfer data between servers, load data into database on SQL Server 2005 environment and deploy the data..
- Designed high level ETL architecture for overall data transfer from the source server to the Enterprise Services Warehouse which encompasses server name, database name, accounts, tables and direction of data flow, Column Mapping and Metadata.
- Created Data profiling stored procedures using dynamic SQL as well as generating complex scripts to schedule various maintenance tasks.
- Developed E-R diagrams (logical and physical) using Erwin mapping the data into the database .
- Worked in dimensional modeling to design the data warehouse.
- Query optimization, execution plan and Performance tuning of queries for better performance.
- Created Tables, Indexes, Table Spaces and integrity constraints.
- Followed 2NF/3NF standards to create database for OLTP databases.
- Responsible for transforming data from OLTP to OLAP data sources using SSIS.