- Highly experienced Data Scientist with over 6 years’ experience in Data Extraction, Data Modelling, Data Wrangling, Statistical Modeling, Data Mining, Machine Learning and Data Visualization.
- Domain knowledge and experience in Retail, Banking and Manufacture industries.
- Expertise in transforming business resources and requirements into manageable data formats and analytical models, designing algorithms, building models, developing data mining and reporting solutions that scale across a massive volume of structured and unstructured data.
- Proficient in managing entire data science project life cycle and actively involved in all the phases of project life cycle including data acquisition, data cleaning, data engineering, features scaling, features engineering, statistical modeling, testing and validation and data visualization.
- Proficient in Machine Learning algorithm and Predictive Modeling including Regression Models, Decision Tree, Random Forests, Sentiment Analysis, Naïve Bayes Classifier, SVM, Ensemble Models.
- Proficient in Statistical Methodologies including Hypothetical Testing, ANOVA, Time Series, Principal Component Analysis, Factor Analysis, Cluster Analysis, Discriminant Analysis.
- Knowledge on time series analysis using AR, MA, ARIMA, GARCH and ARCH model.
- Knowledge on Natural Language Processing (NLP) algorithm and Text Mining.
- Worked in large scale database environment like Hadoop and MapReduce, with working mechanism of Hadoop clusters, nodes and Hadoop Distributed File System (HDFS).
- Strong experience with Python(2.x,3.x) to develop analytic models and solutions.
- Proficient in Python 2.x/3.x with SciPy Stack packages including NumPy, Pandas, SciPy, Matplotlib and IPython .
- Working experience in Hadoop ecosystem and Apache Spark framework such as HDFS, MapReduce, HiveQL, SparkSQL, PySpark.
- Very good experience and knowledge in provisioning virtual clusters under AWS cloud which includes services like EC2, S3, and EMR.
- Proficient in data visualization tools such as Tableau, Python Matplotlib, R Shiny to create visually powerful and actionable interactive reports and dashboards.
- Excellent Tableau Developer, expertise in building, publishing customized interactive reports and dashboards with customized parameters and user - filters using Tableau(9.x/10.x).
- Experienced in Agile methodology and SCRUM process.
- Strong business sense and abilities to communicate data insights to both technical and nontechnical clients.
Databases: MySQL, Postgre SQL, Oracle, HBase, Amazon Redshift, MS SQL Server 2016/2014/2012/2008 R2/2008, Taradata
Statistical Methods: Hypothetical Testing, ANOVA, Time Series, Confidence Intervals, Bayes Law, Principal Component Analysis (PCA), Dimensionality Reduction, Cross-Validation, Auto-correlation
Machine Learning: Regression analysis, Bayesian Method, Decision Tree, Random Forests, Support Vector Machine, Neural Network, Sentiment Analysis, K-Means Clustering, KNN and Ensemble Method, Natural Language Processing (NLP)
Hadoop Ecosystem: Hadoop 2.x, Spark 2.x, MapReduce, Hive, HDFS, Sqoop, Flume
Reporting Tools: Tableau Suite of Tools 10.x, 9.x, 8.x which includes Desktop, Server and Online, Server Reporting Services(SSRS)
Data Visualization: Tableau, MatPlotLib, Seaborn, ggplot2
Languages: Python (2.x/3.x), R, SAS, SQL, T-SQL
Operating Systems: PowerShell, UNIX/UNIX Shell Scripting (via PuTTY client), Linux and Windows
Confidential, New York City, NY
- Collaborated with data engineers and operation team to implement ETL process, wrote and optimized SQL queries to perform data extraction to fit the analytical requirements.
- Performed data analysis by using Hive to retrieve the data from Hadoop cluster, SQL to retrieve data from RedShift.
- Explored and analyzed the customer specific features by using Spark SQL.
- Performed univariate and multivariate analysis on the data to identify any underlying pattern in the data and associations between the variables.
- Performed data imputation using Scikit-learn package in Python.
- Participated in features engineering such as feature intersection generating, feature normalize and label encoding with Scikit-learn preprocessing.
- Used Python 3.X (numpy, scipy, pandas, scikit-learn, seaborn) and Spark 2.0 (PySpark, MLlib) to develop variety of models and algorithms for analytic purposes.
- Developed and implemented predictive models using machine learning algorithms such as linear regression, classification, multivariate regression, Naive Bayes, Random Forests, K-means clustering, KNN, PCA and regularization for data analysis.
- Conducted analysis on assessing customer consuming behaviors and discover value of customers with RMF analysis; applied customer segmentation with clustering algorithms such as K-Means Clustering and Hierarchical Clustering.
- Built regression models include: Lasso, Ridge, SVR, XGboost to predict Customer Life Time Value.
- Built classification models include: Logistic Regression, SVM, Decision Tree, Random Forest to predict Customer Churn Rate.
- Used F-Score, AUC/ROC, Confusion Matrix, MAE, RMSE to evaluate different Model performance.
- Designed and implemented recommender systems which utilized Collaborative filtering techniques to recommend course for different customers and deployed to AWS EMR cluster.
- Utilized natural language processing (NLP) techniques to Optimized Customer Satisfaction.
- Designed rich data visualizations to model data into human-readable form with Tableau and Matplotlib.
Environment: s: AWS RedShift, EC2, EMR, Hadoop Framework, S3, HDFS, Spark (Pyspark, MLlib, Spark SQL), Python 3.x (Scikit-Learn/Scipy/Numpy/Pandas/NLTK/Matplotlib/Seaborn), Tableau Desktop (9.x/10.x), Tableau Server (9.x/10.x), Machine Learning (Regressions, KNN, SVM, Decision Tree, Random Forest, XGboost, LightGBM, Collaborative filtering, Ensemble), NLP, Teradata, Git 2.x, Agile/SCRUM
Confidential, New York City, NY
- Tackled highly imbalanced Fraud dataset using undersampling, oversampling with SMOTE and cost sensitive algorithms with Python Scikit-learn.
- Wrote complex Spark SQL queries for data analysis to meet business requirement.
- Developed MapReduce/Spark Python modules for predictive analytics & machine learning in Hadoop on AWS.
- Worked on data cleaning and ensured data quality, consistency, integrity using Pandas, Numpy.
- Participated in feature engineering such as feature intersection generating, feature normalize and label encoding with Scikit-learn preprocessing.
- Improved fraud prediction performance by using random forest and gradient boosting for feature selection with Python Scikit-learn.
- Performed Naïve Bayes, KNN, Logistic Regression, Random forest, SVM and XGboost to identify whether a loan will default or not.
- Implemented Ensemble of Ridge, Lasso Regression and XGboost to predict the potential loan default loss.
- Used various metrics (RMSE, MAE, F-Score, ROC and AUC) to evaluate the performance of each model.
- Used big data tools Spark (Pyspark, SparkSQL, Mllib) to conduct real time analysis of loan default based on AWS.
- Conducted Data blending, Data preparation using Alteryx and SQL for tableau consumption and publishing data sources to Tableau server.
- Created multiple custom SQL queries in Teradata SQL Workbench to prepare the right data sets for Tableau dashboards. Queries involved retrieving data from multiple tables using various join conditions that enabled to utilize efficiently optimized data extracts for Tableau workbooks.
Environment: MS SQL Server 2014, Teradata, ETL, SSIS, Alteryx, Tableau (Desktop 9.x/Server 9.x), Python 3.x(Scikit-Learn/Scipy/Numpy/Pandas), Machine Learning (Naïve Bayes, KNN, Regressions, Random Forest, SVM, XGboost, Ensemble), AWS Redshift, Spark (Pyspark, MLlib, Spark SQL), Hadoop 2.x, MapReduce, HDFS, SharePoint
Confidential, Troy, MI
Data Analyst/Data Scientist
- Gathered, analyzed, documented and translated application requirements into data models and Supports standardization of documentation and the adoption of standards and practices related to data and applications.
- Participated in Data Acquisition with Data Engineer team to extract historical and real-time data by using Sqoop, Pig, Flume, Hive, MapReduce and HDFS.
- Wrote user defined functions (UDFs) in Hive to manipulate strings, dates and other data.
- Performed Data Cleaning, features scaling, features engineering using pandas and numpy packages in python.
- Applied clustering algorithms i.e. Hierarchical, K-means using Scikit and Scipy.
- Performs complex pattern recognition of automotive time series data and forecast demand through the ARMA and ARIMA models and exponential smoothening for multivariate time series data.
- Delivered and communicated research results, recommendations, opportunities to the managerial and executive teams, and implemented the techniques for priority projects.
- Designed, developed and maintained daily and monthly summary, trending and benchmark reports repository in Tableau Desktop.
- Generated complex calculated fields and parameters, toggled and global filters, dynamic sets, groups, actions, custom color palettes, statistical analysis to meet business requirements.
- Implemented visualizations and views like combo charts, stacked bar charts, pareto charts, donut charts, geographic maps, spark lines, crosstabs etc.
- Published workbooks and extract data sources to Tableau Server, implemented row-level security and scheduled automatic extract refresh.
Environment: Machine learning (KNN, Clustering, Regressions, Random Forest, SVM, Ensemble), Linux, Python 2.x (Scikit-Learn/Scipy/Numpy/Pandas), R, Tableau (Desktop 8.x/Server 8.x), Hadoop, Map Reduce, HDFS, Hive, Pig, HBase, Sqoop, Flume, Oracle 11g, SQL Server 2012
BI Developer/Data Analyst
- Used SSIS to create ETL packages to Validate, Extract, Transform and Load data into Data Warehouse and Data Mart.
- Maintained and developed complex SQL queries, stored procedures, views, functions and reports that meet customer requirements using Microsoft SQL Server 2008 R2.
- Created Views and Table-valued Functions, Common Table Expression (CTE), joins, complex subqueries to provide the reporting solutions.
- Optimized the performance of queries with modification in T-SQL queries, removed the unnecessary columns and redundant data, normalized tables, established joins and created index.
- Created SSIS packages using Pivot Transformation, Fuzzy Lookup, Derived Columns, Condition Split, Aggregate, Execute SQL Task, Data Flow Task and Execute Package Task.
- Migrated data from SAS environment to SQL Server 2008 via SQL Integration Services (SSIS).
- Developed and implemented several types of Financial Reports (Income Statement, Profit& Loss Statement, EBIT, ROIC Reports) by using SSRS.
- Developed parameterized dynamic performance Reports (Gross Margin, Revenue base on geographic regions, Profitability based on web sales and smartphone app sales) and ran the reports every month and distributed them to respective departments through mailing server subscriptions and SharePoint server.
- Designed and developed new reports and maintained existing reports using Microsoft SQL Reporting Services (SSRS) and Microsoft Excel to support the firm's strategy and management.
- Created sub-reports, drill down reports, summary reports, parameterized reports, and ad-hoc reports using SSRS.
- Used SAS/SQL to pull data out from databases and aggregate to provide detailed reporting based on the user requirements.
- Used SAS for pre-processing data, SQL queries, data analysis, generating reports, graphics, and statistical analyses.
- Provided statistical research analyses and data modeling support for mortgage product.
- Perform analyses such as regression analysis, logistic regression, discriminant analysis, cluster analysis using SAS programming.
Environment: SQL Server 2008 R2, DB2, Oracle, SQL Server Management Studio, SAS/ BASE, SAS/SQL, SAS/Enterprise Guide, MS BI Suite(SSIS/SSRS), T-SQL, SharePoint 2010, Visual Studio 2010, Agile/SCRUM