- Highly efficient Data Scientist with 8 years of experience in MachineLearning, Datamining with large data sets of Structured and Unstructured data, DataAcquisition, DataValidation, Predictive modeling, Data Visualization, Web Crawling, Web Scraping. Adept in statistical programming languages like R and Python including BigData technologies like Hadoop, Hive.
- Proficient in managing entire datascience project life cycle and actively involved in all the phases of project life cycle including data acquisition, data cleaning, data engineering, features scaling, features engineering, statistical modeling (decision trees, regression models, neural networks, SVM, clustering), dimensionality reduction using Principal Component Analysis and Factor Analysis, testing and validation using ROC plot, K - fold cross validation and data visualization.
- Adept and deep understanding of Statistical modeling, Multivariate Analysis, model testing, problem analysis, model comparison and validation.
- Expertise in transforming business requirements into analytical models, designing algorithms, building models, developing data mining and reporting solutions that scales across massive volume of structured and unstructured data.
- Skilled in performing data parsing, data manipulation and data preparation with methods including describe data contents, compute descriptive statistics of data, regex, split and combine, Remap, merge, subset, reindex, melt and reshape.
- Experience in using various packages in Randpython like ggplot2, caret, dplyr, Rweka, gmodels, RCurl, tm, C50, twitteR, NLP, Reshape2, rjson, plyr, pandas, numpy, seaborn, scipy, matplotlib, scikit-learn, Beautiful Soup, Rpy2.
- Extensive experience in Text Analytics, generating data visualizations using R, Python and creating dashboards using tools like Tableau.
- Hands on experience with big data tools like Hadoop, Spark, Hive, Pig, Impala, Pyspark, SparkSql.
- Hands on experience in implementing LDA, NaiveBayes and skilled in Random Forests, Decision Trees, Linear and Logistic Regression, SVM, Clustering, neural networks, Principle Component Analysis.
- Good Knowledge in Proof of Concepts (PoC's), gap analysis and gathered necessary data for analysis from different sources, prepared data for data exploration using data munging.
- Good industry knowledge, analytical &problem solving skills and ability to work well with in a team as well as an individual.
- Highly creative, innovative, committed, intellectually curious, business savvy with good communication and interpersonal skills.
- Deep understanding of MapReduce with Hadoop and Spark. Good knowledge of bigdata ecosystem like Hadoop 2.0 (HDFS, Hive, Pig, Impala), Spark (SparkSql, Spark MILib, Spark Streaming).
- Excellent performance in building, publishing customized interactive reports and dashboards with customized parameters and user-filters using Tableau 9.4/9.2
- Excellent understanding of SDLC, Agile, and Scrum.
- Experience with version control tool - Git.
- Effective team player with strong communication and interpersonal skills, possess a strong ability to adapt and learn new technologies and new business lines rapidly.
- Extensive experience in Data Visualization including producing tables, graphs, listings using various procedures and tools such as Tableau.
Operating systems: Windows, Ubuntu, Mac.
Languages: R, Python, JAVA.
Markup languages: HTML, XML, Java Script.
Database Language: SQL, Hive, Impala, Pig, Spark SQL Databases SQL-Server, My SQL, MS Access, HDFS, HBase.
Packages: ggplot2, caret, dplyr, Rweka, gmodels, RCurl, tm, C50, twitteR, NLP, Reshape2, rjson, plyr, pandas, numpy, seaborn, scipy, matplotlib, scikit-learn, Beautiful Soup, Rpy2.
Reporting Tools: MS Office (Word/Excel/Power Point/ Visio), Tableau. Gephi, ggplot2.
Version controls: SVM, GitHub.
Topical Expertise: Statistical Modeling, Data Analytics, Machine Learning, Text Mining and Optimization, Web scraping, NLTK.
Techniques: Regression, HMM, GLM, Trees (Decision tress, Oblique decision trees, CHAID), Random Forest, Clustering (K-means, Hierarchical, SOM), Association Rules, K-Nearest Neighbors, Neural Nets, XGBoost, SVM, Bayesian, Linear Programming, Quadratic Programming, Genetic Algorithm, Collaborative filtering
Confidential - Chicago, IL
- Performed Data Profiling to learn about behavior with various features such as traffic pattern, location, time, Date and Time etc.
- Application of various machinelearning algorithms and statistical modeling like decision trees, regression models, neural networks, SVM, clustering to identify Volume using scikit-learn package in python, Matlab.
- Used clustering technique K-Means to identify outliers and to classify unlabeled data.
- Evaluated models using Cross Validation, Log loss function, ROCcurves and used AUC for feature selection.
- Analyze traffic patterns by calculating autocorrelation with different time lags.
- Ensured that the model has low False Positive Rate.
- Addressed overfitting by implementing of the algorithm regularization methods like L2 and L1.
- Used Principal Component Analysis in feature engineering to analyze high dimensional data.
- Created and designed reports that will use gathered metrics to infer and draw logical conclusions of past and future behavior.
- Performed Multinomial LogisticRegression, Randomforest, DecisionTree, SVM to classify package is going to deliver on time for the new route.
- Performed data analysis by using Hive to retrieve the data from Hadoopcluster, Sql to retrieve data from Oracle database.
- Used MLlib, Spark'sMachinelearninglibrary to build and evaluate different models.
- Implemented rule based expertise system from the results of exploratory analysis and information gathered from the people from different departments.
- Performed DataCleaning, features scaling, features engineering using pandas and numpy packages in python.
- Developed MapReduce pipeline for feature extraction using Hive.
- Created DataQuality Scripts using SQL and Hive to validate successful data load and quality of the data. Created various types of data visualizations using Python and Tableau.
- Communicated the results with operations team for taking best decisions.
- Collected data needs and requirements by Interacting with the other departments.
Environment: Python 2.x, CDH5, HDFS, Hadoop 2.3, Hive, Impala, Linux, Spark, Tableau Desktop, SQL Server 2012, Microsoft Excel, Matlab, Spark SQL, Pyspark.
Confidential - Pleasanton, CA
- Collected business requirements and data analysis needs from other departments
- Performed data parsing and data profiling from large volumes of varied data to learn about behavior with various features based on transactional data, call center history data and customer personal profile, etc.
- Processed the primary quantitative and qualitative market research and loaded the survey responses into database, in preparation of data exploration
- Developed python scripts to automate data sampling process. Ensured the data integrity by checking for duplication, completeness, accuracy, and validity
- Worked on data cleaning and ensured data quality, consistency, integrity using Numpy, SFrame in Python
- Used k-means clustering technique to identify outliers and to classify unlabeled data
- Applied Principal Component Analysismethod in feature engineering to analyze high dimensional data
- Application of various machine learning algorithms and statistical modeling - decision tree, lasso regression, multivariate regression to identify key features using scikit-learn package in python
- Evaluated models using k-fold cross validation, log loss function
- Ensured that the model has low false positive rate, validated model by interpreting ROCPlot
- Experimented text mining based on customer complaints using nltk in Python
- Built repeatable processes in support of implementation of new features and other initiatives
- Created various type of data visualization using Tableau
- Communicated and presented the results with product development team for driving best decisions
Environment: Python 3.3, Hadoop 2, Hive QL, HBase, Map Reduce, Tableau 9.4, Numpy, SFrame, Scikit-Learn, nltk
Confidential -Hartford, CT
- Implemented and delivered all requirements that are outlined within the contractual agreement between company and the university
- Prepared and executed complex SparkSql queries involving multiple joins and advanced analytical functions to validate the ETL processed data in target database
- Searched and collected data from external sources, integrated with the primary database. Created SparkSql Context to load data from JSON files and performed SQL queries
- Extracted and compiled data, conducted data manipulation to ensure data quality, consistency, and integrity using SFrame in Python
- Performed time series model (ARIMA) to capture data pattern and traffic trends, conducted the forecasting of the occupancy rate by different parking lots
- Effectively communicated with the business development team, ensured to implement, and complete the initiative that may increase opportunities
- Efficiently delivered data interpretation by creating interactive analysis reports using data visualization tools - Tableau, to identify business solutions and to support business decisions on marketing and operation
Environment: Hadoop 2, Spark, Spark Sql, MS Office (Excel), Tableau 9.2, Python, S Frame.
Confidential -Addison, IL
- Implemented a job which leads an electronic medical record, extract data into Oracle Database and generate an output. Analyze the data and provide the insights about the customers using Tableau.
- Designed, implemented and automated modeling and analysis procedures on existing and experimentally created data.
- Created dynamic linear models to perform trend analysis on customer transactional data in Python.
- Increased pace & confidence of learning algorithm by combining state of the art technology and statistical methods.
- Parsed data, producing concise conclusions from raw data in a clean, well-structured and easily maintainable format. Developed clustering models for customer segmentation using Python.
Environment: Python 2.x, Tableau, Oracle.
- Developed entire frontend and backend modules using Python on Django Web Framework.
- Involved in writing stored procedures using Oracle.
- Optimized the database queries to improve the performance.
- Designed and developed data management system using Oracle.
- Effectively communicated with the stakeholders to gather requirements for different projects
- Used MySQL db package and Python-MySQL connector for writing and executing several MYSQL database queries from Python.
- Created functions, triggers, views and stored procedures using My SQL.
- Worked closely with back-end developer to find ways to push the limits of existing Web technology.
- Involved in the code review meetings.
Environment: Java, Spring Framework, Hibernate, EJB, WebLogic, MySQL.