- Highly motivated Data Scientist with 7+ years of IT experience in crafting and resolving business solutions in the service delivery sector.
- The last 3+ years of leverage of various skills set in the acquiring, cleaning, processing, and interpreting of huge data sets to solve particular business cases and improve company overall bottom line through statistical modeling.
Databases: NoSQL(Cassandra, MongoDB, Hbase) MySQL, Hadoop, Spark
Development Tools: Eclipse IDE, jupyter notebook, R studio, Syder, Anachonda 3.5, 3.6, Zeppelin Matplotlib, ggplot, Tensor board, plotly, cufflinks, Deep learning ANN, CNN, RNN, Tableau, Git, power shell, atom.Statistical Skills Machine Learning
- Experience working in Data Requirement analysis for transforming data according to business requirements.
- Worked thoroughly with data compliance teams such as Data Analysts and Data Engineers to gathered require raw data and define source fields in Hadoop.
- Applied Forward Elimination and Backward Elimination for data sets to identify most statically significant variables for Data analysis; used TukeyHSD when the variables was statically significant. The implementation of Bagging and Boosting to enhance the model performance in various datasets
- Applied the different functions of dyplr, tidyr for data manipulation and leaning using R
- Utilized Label Encoders in Python to create dummy variables for geographic locations to identify their impact on pre-acquisition and post acquisitions by using 2 sample paired t test.
- Hands on experience on R packages and libraries like ggplot2, C5.0, h2o, dplyr, reshape2, plotly, RMarkdown, caret, caTools, sklearn, scipy, etc.
- Expertise in enhancing model performance by using kfold cross validation, hyperparameter tunning using grid search to increase accuracies of different machine learning algorithms
- Built models using pipline, stage, vectorAssembler, stringIndexer, features importance, up sampling of minority class, down sampling of majority class, SMOTE, for unbalance datasets.
- Expertise in transforming business requirements into designing algorithms, analytical models, building models, developing data mining, and reporting solutions that scales across massive volume of structured and unstructured data.
- Provided and created data presentation to reduce biases and telling true story to people by pulling millions of rows of data using SQL and performed Exploratory Data Analysis.
- Applied Wilcoxon sign test to patient and treatment data for pre-acquisition and post-acquisition for different sectors to find the statistical significance in R programming.
- Handled importing data from various data sources, performed transformations using Hive, Map Reduce, and loaded data into HDFS.
- Interaction with Business Analyst, SMEs, and other Data Architects to understand Business needs and functionality for various project solutions.
- Created sparksession, worked with VectorAssembler, stringindexer, onehotencoding to build different models using spark.sql
Associate Data Scientist /Analyst
- Implemented public segmentation using unsupervised machine learning algorithms by implementing k-means algorithm using Pyspark.
- Explored and Extracted data from source XML in HDFS, preparing data for exploratory analysis using data munging.
- Used R and python for Exploratory Data Analysis, A/B testing, Anova test and Hypothesis test to compare and identify the effectiveness of Creative Campaigns.
- Used Spark for test data analytics using MLLib and Analyzed the performance to identify bottlenecks.
- Worked on different data formats such as JSON, XML and performed machine learning algorithms in R.
- Worked on Linux shell scripts for business process and loading data from different interfaces to HDFS.
- Addressed over fitting by implementing of the algorithm regularization methods like L1, L2 and dropout
- Used Python, R, SQL to create Statistical algorithms involving Multivariate Regression, Linear Regression, Logistic Regression, PCA, Random forest models, Decision trees, Support Vector Machine for estimating the risks of welfare dependency.
- Identified and targeted welfare high-risk groups with Machine learning algorithms.
Confidential, Houston, TX
- Involved in implementation of Hadoop Cluster and Hive for Development and Test Environment.
- Analyzed the data as per the business requirements using Hive queries.
- Used Spark API over Cloudera Hadoop YARN to perform analytics on data in Hive.
- Installed and configured Hadoop MapReduce, HDFS and Developed MapReduce jobs in Java for data preprocessing.
- Collected and aggregated large amounts of log data using Apache Flume and staging data in HDFS for further analysis.
- Created Hive Tables, loaded values and generated adhoc-reports using the table data to display strong understanding on Hadoop architecture including HDFS, MapReduce, Hive, Pig, Sqoop and Oozie.
- Used spark with Yarn and got performance results compared with map reduce.
- Loaded existing data warehouse data from Oracle database to Hadoop Distributed File System (HDFS)
- Loaded data into Hive Tables from Hadoop Distributed File System (HDFS) to provide SQL-like access on Hadoop data.