- Over 5+ years of Professional Qualified Data Scientist/Data Analyst/ Developer in Data Science and Analytics including Machine Learning, Data Mining and Statistical Analysis
- Developed predictive models using Decision Tree, Random Forest, Naïve Bayes, Logistic Regression, Cluster Analysis, and Neural Networks.
- Implemented Bagging and Boosting to enhance the model performance.
- Extensively worked on Python 3.5(NumPy, Pandas, Matplotlib, NLTK and Scikit - learn)
- Experience in implementing data analysis with various analytic tools, such as Anaconda 4.0 Jupiter Notebook, R 3.0 (ggplot2, Caret, dplyr) and Excel
- Involved in the entire data science project life cycle and actively involved in all the phases including data extraction, data cleaning, statistical modeling and data visualization with large data sets of structured and unstructured data.
- Worked with NoSQL Database including HBase, Cassandra and MongoDB.
- Extensive knowledge in the areas of mathematical modeling, statistical analysis, predictive Analytics, machine learning techniques and time series forecasting.
- Proficient in Machine Learning algorithms and Predictive Modeling including Linear Regression, Logistic Regression, Naive Bayes, Decision Tree, Random Forest, Gradient Boosting, SVM, KNN, K-mean clustering.
- Experienced in Big Data with Hadoop, HDFS, MapReduce, and Spark.
- Experience in Applied Statistics, Exploratory Data Analysis and Visualization using matplotlib, Tableau, Report Builder.
- Worked on data cleaning and ensure data quality, consistency, integrity using Numpy and Pandas.
- Participated in feature engineering such as feature intersection generating, feature normalize and label encoding with Scikit-learn preprocessing.
- Implemented Logistic regression, Random forest classification and Gradient boosting classification to compare performance.
- Worked on different data formats such as JSON, XML and performed machine learning algorithms in Python.
- Data warehousing experience using IBM DataStage/Quality Stage and involved in all the stages of Software Development Life Cycle (SDLC).
- Proficient working knowledge in managing entire data science project life cycle and actively involved in all phases including data collection, data preprocessing, EDA and statistical methods, model building and model validation.
- Experienced in creating and implementing interactive charts, graphs, and other user interface elements and developed analytical reports for monthly risk management presentation to higher management.
- Thorough understanding of Data Warehousing principles (Fact Tables, Dimensional Tables, Dimensional Data Modeling - Star Schema and Snow Flake Schema).
- Expert in Data Warehousing techniques for Data Cleansing, Slowly Changing Dimension phenomenon (SCD)
- Highly creative, innovative, committed, intellectually curious, passionate, business savvy with effective communication and interpersonal skills.
Languages: C, Python, SQL, PL/SQL, SQL * Plus, R
Databases: Oracle 12c/10g/11g, DB2, Pig, Hive
Operating systems: UNIX, Microsoft Windows
Tools: Oracle SQL *Plus, TOAD, IBM Info Sphere Data Stage/11.5/9.1, Rstudio, ipython Notebook, Spyder
General Tools: PowerPoint, Word, Excel
Version Control: Share point, Clear case, Git
Bigdata Technologies: Spark, Hadoop
Statistical software/libraries: scikit learn, R
Visualization Tools: Matplotlib, Seaborn, ggplot, Tableau
Data Science Libraries: Numpy, Pandas, scikit Learn, NLTK, Deep Learning
Data Scientist/ETL Developer, Lombard, IL
- Collaborated with database engineers to implement ETL process, wrote and optimized SQL queries to perform data extraction and merging from SQL server database.
- Conducted analysis in assessing customer behaviors and discover value of customers, applied customer segmentation with clustering algorithm.
- Performed data integrity checks, data cleansing, exploratory analysis and feature engineer using python and data visualization packages such as Matplotlib, Seaborn.
- Used Python to develop a variety of models and algorithms for analytic purposes.
- Developed logistic regression models to predict subscription response rate based on customer's variables like past transactions, promotions, response to prior mailings, demographics, interests and hobbies, etc.
- Used Python to implement different machine learning algorithms, including Generalized Linear Model, Random Forest and Gradient Boosting.
- Evaluated parameters with K-Fold Cross Validation and optimized performance of models.
- Recommended and evaluated marketing approaches based on quality analytics on customer consuming behavior.
- Collected and analyzed the customer feedback by using the streaming data from social networks stored in Hadoop system with Hive.
- Performed data visualization and Designed dashboards with Tableau, and provided complex reports, including charts, summaries, and graphs to interpret the findings to the team and stakeholders.
- Identified process improvements that significantly reduce workloads or improve quality.
Environment: Python (Scikit-Learn/Scipy/Numpy/Pandas), Linux, Tableau, Hadoop, Map Reduce, Hive, ETL DataStage, Oracle, Windows 10/XP, JIRA
Data Scientist / Analyst, Richmond, Virginia
- Built data pipelines from multiple data sources by performing necessary ETL tasks. Performed Exploratory Data Analysis using R and Apache Spark.
- Performed Data Cleaning, features scaling, features engineering. Performed natural language processing to extract features from text data.
- Performed text analysis, tf-idf analysis. Visualized bigrams networks to investigate individual importance.
- Spearheaded efforts in developing deep learning algorithms for analyzing text, over their existing dictionary based approaches.
- Developed deep learning algorithms to analyze 300k consumer comments for sentiment polarity & topic presence with 80% accuracy.
- Developed a novel encoding technique to convert text data as input for use in machine learning models.
- Built a forecasting model to predict future sales for anti-diabetes vaccines in global market.
- Built multiple time-series models like ARIMA.
- Evaluated model’s performance on multiple test metrics such as Confusion Matrix, RMSE, Precision, Recall
Environment: Python (Scikit-Learn/Scipy/Numpy/Pandas), Linux, Tableau , Hive, ETL DataStage, Pig, Oracle, Windows 10/XP, JIRA
Data Scientist, Richmond, VA
- Review and determine risk profiles of data based on metadata and derlying data elements Used K Means Algorithm Model with different clusters to find meaningful segments on Customers, and calculated the accuracy of model .
- Visualize the data using matplotlib like bar chart, heat map, and histogram.
- Manipulating and Cleaning data using missing value treatment in Pandas and performed standardization
- Implemented Classification using Supervised algorithms like Linear Regression, Logistic Regression, Decision trees, KNN, Naive Bayes
- Worked on customer segmentation using an unsupervised learning techiqeclustering Performed Exploratory Data Analysis and Data Visualizations using Python
- Strong skills in data visualization like matplotlib and seaborn library. Create different charts such as Heat maps, Bar charts, Line charts etc.,
- Validated and select models using k-fold cross validation, confusion matrices and worked on optimizing models for high recall rate.
- Implemented Ensemble models like Boosting and Bagging .
- Worked with cross validation technique and grid search to improve project model results
Environment: Python, Logistic Regression, clustering, Numpy, Scikit Learn, Map Reduce, Pig, Hive, Pandas, Seaborn, Random Forest and Tableau
Data Analyst/ Developer
- Involved in all phases of Software development life cycle like Requirement Analysis, Design, Coding, Testing, Support and Documentation
- Involved in Analyzing the Business Requirements.
- Extensively used Datastage Designer to develop processes for extracting, transforming, integrating and loading data from various sources into the Data Warehouse database.
- Involved in designing the Data models like star schema to in corporate the Fact and Dimension tables.
- Used different types of stages like Transformer, CDC, Remove Duplicate, Aggregator, ODBC, Join, Funnel, dataset and Merge for developing different ETL jobs
- Involved in Unit testing and performance tuning of ETL Datastage jobs
- Created the Environmental Variables and Set up the default values in DataStage Administrator.
- Extensively worked on different databases like Oracle, Db2 to extract and load the data from source to target.
- Used JIRA for updating and keeping track of day to day tasks.