- Professional qualified Data Scientist/Data Analyst with over 8+ years of experience in Data Science and Analytics including Machine Learning, Data Mining and Statistical Analysis.
- Involved in the entire data science project life cycle and actively involved in all the phases including dataextraction, data cleaning, statistical modeling and data visualization with large data sets of structured and unstructured data
- Experienced with machine learning algorithm such as logistic regression, random forest, XGboost, KNN, SVM, neural network, linear regression, lasso regression and k - means
- Implemented Bagging and Boosting to enhance the model performance.
- Strong skills in statistical methodologies such as A/B test, experiment design, hypothesis test, ANOVA
- Extensively worked on Python (Numpy, Pandas, Matplotlib, NLTK and Scikit-learn)
- Experience inimplementing data analysis with various analytic tools, such as Anaconda 4.0JupiterNotebook 4.X, R 3.0 (ggplot2, Caret, dplyr) and Excel
- Solid ability to write and optimize diverse SQL queries, working knowledge of RDBMS like SQLServer2008, NoSql databases like MongoDB
- Strong experience in BigData technologies like Spark, Sparksql, pySpark, Hadoop 2.X, HDFS, Hive 1.X
- Experience in visualization tools like, Tableau9.X, 10.X for creating dashboard
- Excellent understanding Agile and Scrum development methodology
- Used the version control tools like Git 2.X
- Passionate about gleaning insightful information from massive data assets and developing a culture of sound, data-driven decision making
- Ability to maintain a fun, casual, professional and productive team atmosphere
- Experienced the full software life cycle in SDLC, Agile and Scrum methodologies.
- Skilled in Advanced Regression Modeling, Correlation, Multivariate Analysis, Model Building, Business Intelligence tools and application of Statistical Concepts.
- Proficient in Predictive Modeling, Data Mining Methods, Factor Analysis, ANOVA, Hypotheticaltesting, normal distribution and other advanced statistical and econometric techniques.
- Developed predictive models using Decision Tree, RandomForest, Na veBayes, LogisticRegression, ClusterAnalysis, and Neural Networks.
- Experienced in Machine Learning and Statistical Analysis with PythonScikit-Learn.
- Experienced in Python to manipulate data for data loading and extraction and worked with python libraries like Matplotlib, Numpy, Scipy and Pandas for dataanalysis.
- Worked with complex applications such as R, SAS, Matlab and SPSS to develop neural network, cluster analysis.
- Expertise in transforming business requirements into analytical models, designing algorithms, building models, developing data mining and reporting solutions that scales across massive volume of structured and unstructured data.
- Skilled in performing dataparsing, data manipulation and data preparation with methods including describe data contents, compute descriptive statistics of data, regex, split and combine, Remap, merge, subset, reindex, melt and reshape.
- Strong SQL programming skills, with experience in working with functions, packages and triggers.
- Experienced in Visual Basic for Applications and VB programming languages to work with developing applications.
- Worked with NoSQL Database including Hbase, Cassandra and MongoDB.
- Experienced in Big Data with Hadoop, HDFS, MapReduce, and Spark.
- Experienced in Data Integration Validation and Data Quality controls for ETL process and Data Warehousing using MS Visual Studio SSIS, SSAS, SSRS.
- Proficient in Tableau and R-Shiny data visualization tools to analyze and obtain insights into large datasets, create visually powerful and actionable interactive reports and dashboards.
- Automated recurring reports using SQL and Python and visualized them on BI platform like Tableau.
- Worked in development environment like Git and VM.
- Excellent communication skills. Successfully working in fast-paced multitasking environment both independently and in collaborative team, a self-motivated enthusiastic learner.
Confidential, Chicago, IL
- Used R, SQL to create Statistical algorithms involving Multivariate Regression, Linear Regression, Logistic Regression, PCA, Random forest models, Decision trees, Support Vector Machine for estimating the risks.
- Created a recommendation system using k - means clustering, NLP and Flask to generate list for potential users and worked on NLP algorithm consists of TF-IDF and LSI on the user reviews.
- Developed OLAP cubes for the branding analysis and developed OLAP and Excel Reports for SEC reporting.
- Data Collection, Features creation, Model Building (Linear Regression, SVM, Logistic Regression, Decision Tree, Random Forest, GBM), Evaluation Metrics, Model Serving - R, Scikit-learn, Spark SQL, Spark ML, Flask, RedShift, AWS S3
- Experience on Cassandra node tool to manage Cassandra cluster.
- Involved in the process of designing Cassandra Architecture.
- Worked on Real Time as well as Batch Data and have built lambda architecture to process the datausing Kafka, Spark Streaming, Spark Core and Spark SQL
- Designed and provisioned the platform architecture to execute Hadoop and Machine Learning use cases under Cloud infrastructure, AWS, EMR, and S3.
- Involved in creating Data Lake by extracting customer's Big Data from various data sources into Hadoop HDFS. This included data from Excel, Flat Files, Oracle, SQL Server, Mongo DB, Cassandra, HBase, Teradata, Netezza and also log data from servers
- Developed Python code for data analysis (also using NumPy and SciPy), Curve-fitting.
- Performed extensive Data Validation, Data Verification against Data Warehouse and performed debugging of the SQL-Statements and stored procedures for business scenarios.
- Used Spark Data frames, Spark-SQL, Spark MLLib extensively and developing and designing POC's using Scala, Spark SQL and MLLib libraries.
- Created a recommendation system using k-means clustering, NLP and Flask to generate vehicles list for potential users and worked on NLP algorithm consists of TF-IDF and LSI on the user reviews.
- Worked on predictive and what-if analysis using R from HDFS and successfully loaded files to HDFS from Teradata, and loaded from HDFS to HIVE.
- Designed the schema, configured and deployed AWS Red Shift for optimal storage and fast retrieval of data.
- Developed ETL mappings, testing, correction and enhancement and resolved data integrity issues and coordinated multiple OLAP and ETL projects for various data lineage and reconciliation.
- Analyzed data and predicted end customer behaviors and product performance by applying machine learning algorithms using Spark MLLib.
- Prediction of Function and Industry based on Job Text Analytics using R, Scikit-learn, NLTK, TF-IDF, Bayesian Classifier and Gensim
- Performed transformations of data using Spark and Hive according to business requirements for generating various analytical datasets.
- Design, Develop ETL process and create UNIX shell scripts to execute Teradata SQL, BTEQ, jobs.
- Analyzed the bug reports in BO reports by running similar SQL queries against the source system (s) to perform root-cause analysis.
- NLTK, Stanford NLP, RAKE to preprocess the data, entity extraction and keyword extraction.
- Created dimension and fact tables in RedShift, ETL to get data from different sources and insert into RedShift, Tableau for reporting using RedShift as data source
- Coding using Teradata Analytical functions, BTEQSQL of TERADATA, write UNIX scripts to validate, format and execute the SQLs on UNIX environment.
- Worked on analyzing the data statistically and also prepared statistical reports SAS tool.
- Created Map Reduce running over HDFS for data mining and analysis using R and Loading& Storage data to Pig Script and R for Map Reduce operations.
- Created various types of data visualizations using R, and Tableau.
- Created numerous dashboards in tableau desktop based on the data collected from zonal and compass, while blending data from MS-excel and CSV files, with MS SQL Server databases.
- Developed SPSS Macro, which reduced time of programming syntax and increased the productivity for whole data processing steps.
- Participated in big data architecture for both batch and real-time analytics and mapped data using scoring system over large data on HDFS
Environment: Horton works - Hadoop Map Reduce, PySpark, Spark, R, Spark MLLib, Tableau, Informatica, SQL, Excel, VBA, BO, CSV, Erwin, SAS, AWS RedShift, Scala Nlp, Cassandra, Oracle, Mongo DB, Cognos, SQL Server 2012, Teradata, DB2, SPSS, T-SQL, PL/SQL, Flat Files, XML, and Tableau.
Confidential, Houston, TX
- Utilized ApacheSpark with Python to develop and execute Big Data Analytics and Machine learning applications, executed machineLearning use cases under Spark ML and Mllib.
- Solutions architect for transforming business problems into BigData andDataScience solutions and define Big Data strategy and Roap map.
- Identified areas of improvement in existing business by unearthing insights by analyzing vast amount of data using machinelearning techniques. TensorFlow, Scala, Spark, MLLib, Python and other tools and languages needed.
- Create and validate machine learning models with AzureMachineLearning
- Designing a machine learning pipeline using MicrosoftAzureMachineLearning to predict and prescribe and Implemented a machine learning scenario for a given data problem
- Used Scala for coding the components in Play and Akka.
- Worked on different Machine learning models likeLogisticRegression, Multilayerperceptronclassifier, and K - means clustering by creating Scala-SBT packaging and run it in Spark-shell (Scala) and Auto-encoder model with using R programming.
- Worked on setting up and configuring AWS'sEMRClusters and Used AmazonIAM to grant fine-grained access to AWS resources to users
- Created detailed AWSSecurityGroups, which behaved as virtual firewalls that controlled the traffic allowed to reach one or more AWSEC2 instances
- Wrote scripts and indexing strategy for a migration to Redshift from Postgres9.2 and MySQL databases.
- Wrote Kinesis agents to pipe data from streaming app into S3.
- Good Knowledge in Azurecloudservices, Azurestorage, Azureactivedirectory, AzureServiceBus. Create and manage AzureADtenants, and configure application integration with AzureAD. Integrate on-premises WindowsAD with AzureAD Integrating on-premises identity with AzureActiveDirectory.
- Working knowledge of AzureFabric, Microservices, IoT&Docker containers in Azure. Azure infrastructure management &PaaS Solution Architect - (Azure AD, Licenses, Office365, DR on cloud using AzureRecoveryVault, AzureWebRoles, WorkerRoles, SQLAzure, AzureStorage).
- Utilized Spark, Scala, Hadoop, HBase, Cassandra, MongoDB, Kafka, Spark Streaming, MLLib, Python, a broad variety of machine learning methods including classifications, regressions, dimensionally reduction etc. and Utilized the engine to increase user lifetime by 45% and triple user conversations for target categories.
- Designed and developed NLP models for sentiment analysis.
- Expert in Business Intelligence and Data Visualization tools: Tableau, Microstrategy.
- Performed Multinomial Logistic Regression, Random forest, Decision Tree, SVM to classify package is going to deliver on time for the new route and Performed data analysis by using Hive to retrieve the data from Hadoopcluster, Sql to retrieve data from Oracle database.
- Worked on machine learning on large size data using Spark and MapReduce.
- Let the implementation of new statistical algorithms and operators on Hadoop and SQL platforms and utilized optimizations techniques, linear regressions, K-means clustering, Native Bayes and other approaches.
- Developed Spark/Scala, Pythonfor regular expression (regex) project in the Hadoop/Hive environment with Linux/Windows for big data resources.
- Developed Data Mapping, Data Governance, Transformation and Cleansing rules for the Master Data Management Architecture involving OLTP, ODS and OLAP.
- Data sources are extracted, transformed and loaded to generate CSV data files with Python programming and SQL queries.
- Stored and retrieved data from data-warehouses using AmazonRedshift.
- Worked on TeradataSQL queries, Teradata Indexes, Utilities such as Mload, Tpump, Fast load and FastExport.
- Application of various machine learning algorithms and statistical modeling like decision trees, regression models, neural networks, SVM, clustering to identify Volume using scikit-learn package in python, Matlab.
- Used DataWarehousing Concepts like Ralph Kimball Methodology, Bill Inmon Methodology, OLAP, OLTP, Star Schema, Snow Flake Schema, Fact Table and Dimension Table.
- Refined time-series data and validated mathematical models using analytical tools like R and SPSS to reduce forecasting errors.
- Created Data Quality Scripts using SQL and Hive to validate successful data load and quality of the data. Created various types of data visualizations using Python and Tableau.
Environment: Python, Azure, ER Studio, Hadoop, Map Reduce, EC2, S3, PySpark, Spark, Spark MLLib, Tableau, Informatica, SQL, Excel, VBA, BO, CSV, Netezza, SAS, Mat lab, AWS, Scala Nlp, SPSS, Cassandra, Oracle, Amazon RedShift, Mongo DB, SQL Server 2012, Teradata, DB2, T-SQL, PL/SQL, Flat Files, XML, Tableau.
Confidential, New York, NY
- Involved in Design, Development and Support phases of SoftwareDevelopmentLifeCycle (SDLC)
- Performed data ETL by collecting, exporting, merging and massaging data from multiple sources and platforms including SSIS (SQLServerIntegrationServices) in SQL Server.
- Worked with cross - functional teams (including data engineer team) to extract data and rapidly execute from MongoDB through MongDB connector for Hadoop.
- Performed data cleaning and feature selection using MLlib package in PySpark.
- Performed partitional clustering into 100 by k-means clustering using Scikit-learn package in Python where similar hotels for a search are grouped together.
- Used Python to perform ANOVA test to analyze the differences among hotel clusters.
- Implemented application of various machine learning algorithms and statistical modeling like Decision Tree, NaiveBayes, LogisticRegression and LinearRegression using Python to determine the accuracy rate of each model.
- Determined the most accurately prediction model based on the accuracy rate.
- Used text-mining process of reviews to determine customers' concentrations.
- Delivered analysis support to hotel recommendation and providing an online A/B test.
- Designed Tableau bar graphs, scattered plots, and geographical maps to create detailed level summary reports and dashboards.
- Developed hybrid model to improve the accuracy rate.
- Delivered the results to operation team for better decisions and feedbacks.
Environment: Python, PySpark, Tableau, Mongo DB, Hadoop, SQL Server, SDLC, ETL, SSIS, recommendation systems, Machine Learning Algorithms, text-mining process, A/B test
Confidential, Columbus, OH
- Participated in all phases of research including data collection, data cleaning, data mining, developing models and visualizations.
- Collaborated with data engineers and operation team to collect data from internal system to fit the analytical requirements.
- Redefined many attributes and relationships and cleansed unwanted tables/columns using SQL queries.
- Utilized SparkSQLAPI in PySpark to extract and load data and perform SQL queries.
- Performed data imputation using Scikit - learn package in Python.
- Performed data processing using Python libraries like Numpy and Pandas.
- Worked with data analysis using ggplot2 library in R to do data visualizations for better understanding of customers' behaviors.
- Visually plotted data using Tableau for dashboards and reports.
- Implemented statistical modeling with XGBoost machine learning software package using R to determine the predicted probabilities of each model.
- Delivered the results with operation team for better decisions.
Environment: Python, R, SQL, Tableau, Spark, Machine Learning Software Package, recommendation systems.
R & SAS Programmer
- Analyzed high volume, high dimensional client and survey data from different sources using SAS and R
- Manipulated large financial datasets, primarily in SQL and R
- Used R for large matrix computation
- Developed Algorithms (Data Mining Query's) to extract data from data warehouse & databases to build Rules for the Analyst & Models Team.
- Used R to import high volume of data
- High level programming efficiency in the use of statistical modeling tools such as SAS, SPSS and R.
- Developed predictive models using R to predict customers churn and classification of customers
- Worked on Shiny and R application displaying machine learning for improving the forecast of business.
- Developed, reviewed, tested & documented SAS programs/macros.
- Created Templates by using SAS macro for existing reports to reduce the manual intervention.
- Created Self - service tools for Onshore/Offshore team for data retrieval.
- Worked on daily reports and used them for further analysis.
- Developed/Designed templates for new data extraction requests.
- Executed weekly reports for Commercial Data Analytics Team.
- Communicated progress to key Business partners and Analysts through status reports and tracked issues until resolution.
- Created predictive and other analytically derived models for assessing sales.
- Provided support in the design and implementation of ad hoc requests for Sales-Related Portfolio Data.
- Responsible for preparing test case documents and Technical specification documents.
Environment: R, SQL, Tableau, SPSS, SAS, Oracle, T-SQL, UNIX Shell Scripting, DB2.
- Used SQL to retrieve data from the Oracle database for data analysis and visualization and performed Inventory Analysis with Statistical and Data Visualization Tools.
- Followed the RUP based methods using Rational Rose to create Use Cases, Activity Diagrams / State Chart Diagrams, Sequence Diagrams.
- Designed different type of STAR schemas for detailed data marts and plan data marts in the OLAP environment.
- Implemented Classification using supervised algorithms like Logistic Regression, Decision trees, KNN, Naive Bayes.
- Involved with Data Analysis primarily Identifying Data Sets, Source Data, Source Meta Data, Data Definitions and Data Formats
- Performed Decision Tree Analysis and Random forests for strategic planning and forecasting and manipulating and cleaning data using dplyr and tidyr packages in R.
- Wrote, executed, performance tuned SQL Queries for Data Analysis& Profiling and wrote complex SQL queries using joins, sub queries and correlated sub queries.
- Involved in development and implementation of SSIS, SSRS and SSAS application solutions for various business units across the organization.
- Developed mappings to load Fact and Dimension tables, SCD Type 1 and SCD Type 2 dimensions and Incremental loading and unit tested the mappings.
- Wrote test cases, developed Test scripts using SQL and PL/SQL for UAT.
- Transferred data from various OLTP data sources, such as Oracle, MS Access, MS Excel, Flat files, CSV files into SQL Server.
- Performed data testing, tested ETL mappings (Transformation logic), tested stored procedures, and tested the XML messages.
- Created Use cases, activity report, logical components to extract business process flows and workflows involved in the project using Rational Rose, UML and Microsoft Visio.
Environment: R, SQL, Tableau, SSRS, Oracle, T-SQL, UNIX Shell Scripting, DB2.