- Experienced data scientist with 6+ years of hands - on experience in Machine Learning, Deep Learning, Data Mining, ETL, Data Visualization, and Cloud Computing (AWS, Azure, Google).
- Extensive experience in state-of-the-art Machine Learning, Deep Learning algorithms including Linear/Logistic Regression, Na ve Bayes, Random Forest, Gradient Boosting Machines, SVM, K-means clustering, Deep neural network (DNN), Convolutional neural network (CNN), and Recurrent neural network (RNN); Hands-on designing Anomaly Detection, Name Entity Recognition, Collaborative Filtering/Content-Based Recommendation Engines, Image Recognition, and Speech Recognition; Implemented most of ML & DL algorithms with Numpy from scratch.
- Expertise in Natural Language Processing (NLP) knowledge including Bag of Words, TF-IDF, N-grams, Word2Vec, GloVe, FastText, LSTMs, GRU, RNN, Cosine Similarity, POS Tagging, Text Mining, LDA, and Sentiment Analysis.
- Involved in various research/industry projects throughout the whole life cycles of data science projects, including Data Acquisition/Crawling, Data Cleaning, Data Manipulation, Feature Engineering, Feature Selection, Data Visualization, Predictive modeling, Model Optimization, Testing, and Deployment.
- Strong Statistical methodologies such as Time Series Analysis, Hypothesis testing (A/B testing), Principle Component Analysis (PCA), Singular Value Decomposition (SVD), ANOVA, Cluster Analysis, and Factor Analysis.
- Proficient in Python 3 including Numpy, Pandas, Scikit-learn, NLTK, spaCy, Gensim, CoreNLP, Tensorflow, Keras, Matplotlib, Plotly, Seaborn, and Pyspark.
- Proficiency with various data visualization tools like Tableau, Matplotlit/Seaborn in Python, and ggplot2/Rshiny in R to create interactive, dynamic reports, and dashboards.
- Hands-on experience with Hadoop ecosystem & Apache Spark Frameworks such as MapReduce, HDFS, HiveQL, Pig, SparkSQL, and PysparkML.
- Adept in developing and debugging Stored Procedures, User-defined Function (UDFs), Triggers, Indexes, Constraints, Transactions, and Queries using Transact-SQL (T-SQL).
- Experience building business intelligence, analytics, or reporting solutions - either front-end consumption mechanisms (e.g., Microsoft, & Tableau) or supply of data for these purposes.
- Comprehensive understanding of Systems Development Life Cycle (SDLC) such as Agile, Waterfall, and SCRUM.
Ms sql server (Less than 1 year), Sql server (Less than 1 year), Mysql (Less than 1 year), Oracle (Less than 1 year), Postgresql (Less than 1 year), Sql (Less than 1 year), Hdfs (Less than 1 year), Hadoop (Less than 1 year), Machine learning (Less than 1 year), Mongodb (Less than 1 year), Teradata (Less than 1 year), Amazon web services (Less than 1 year), Hadoop (Less than 1 year), Hive (Less than 1 year), Pig (Less than 1 year), Python (Less than 1 year), Svn (Less than 1 year), Boosting (Less than 1 year), Deep learning (Less than 1 year), K-means (Less than 1 year), Business Intelligence (Less than 1 year), Excel (Less than 1 year)
Confidential, Piscataway, NJ
- Developed medical image segmentation using multiple deep learning algorithms including SSD, Mask - RCNN, U-Net in mxnet, tensorflow, and keras.
- Manage and cleaned thousands of patients' CT-scans dicom files in AWS S3, and EC2.
- Trained and deployed deep learning model on AWS EC2, which offer clients real time access to the prediction services.
- Evaluate models with mAP, ROC/AUC, and optimize models by fine tuning different layers.
- Engaging, communicating, and presenting data visualization reports to clients across both business and technology functions.
Environment: AWS (EC2, S3), Python, Deep Learning (Tensorflow, Keras, autoencoder), JIRA, GitHub, ETL, and Linux.
Confidential, New York, NY
- Engaged, communicated, and presented data visualization reports with Tableau and R - Shiny to stakeholders across both business and technology functions.
- Collaborated with database engineers to manipulate ETL data, wrote and optimized SQL queries; Immigrated data from SQL Server into AWS Redshifts, and handled outliers, missing data to prepare machine learning oriented tables.
- Established anomaly detection algorithm in Python to monitor stock transaction activities, and performed data integrity checks, data cleaning, exploratory data analysis (EDA), and feature engineering with Pandas, Seaborn, and Matplotlib.
- Implemented machine learning and deep learning algorithms including Random Forest, Gradient Boosting Decision Trees, RNN, and LSTMs for optimal portfolio construction, interaction, and recommendations.
- Utilized AWS EC2 to stream real-time text data from Twitter and news channels for analysis; Deployed NLP Machine Learning and Deep Learning models in Tensorflow including Hierarchical Attention Networks (HAN) with Bi-directional LSTMs and GloVe embedding to real-time analyze the sentiments of specific topics.
- Evaluated (RMSE, MAE, F1-Score, AUC), and optimized models by applying feature importance analysis, correlation matrix, and Bayesian hyperparameter optimization.
- Developed rich data visualization to transform data into human-readable form with Seaborn, Tableau, and Plotly.
Environment: SQL Server, AWS (EC2, S3, RedShift), Python, Machine Learning (Random Forest, Gradient Boosting), Deep Learning (Tensorflow, Keras, LSTMs, Word Embeddings), Natural Language Processing (NLP)Tableau, JIRA, GitHub, ETL, and Linux.
Confidential, New York, NY
- Developed and updated SQL queries, stored procedures, clustered index and non - clustered index, and functions that meet business requirements using SQL Server 2017.
- Utilized SSIS to create ETL process to Validate, Extract, Transform and Load data into Data Warehouse and Data Mart.
- Improved Anti-Money Laundering prediction by developing machine learning algorithms such as random forest (RF) and gradient boosting machines for feature selection with Python Scikit-learn.
- Developed KNN, Logistic Regression, SVM, and Deep Neural Networks for rare event cases and suspicious activities.
- Tackled highly imbalanced Fraud dataset using undersampling, oversampling with SMOTE and cost-sensitive algorithms with Python Scikit-learn.
- Explored optimized sampling methodologies for different types of datasets.
- Explored and analyzed the suspicious transactions features by using SparkSQL.
- Used big data tools Spark (Pyspark, SparkSQL, Mllib) to conduct real-time analysis of transaction default based on AWS.
- Designed and implemented a recommendation system which utilized Collaborative filtering techniques to recommend course for different customers and deployed to AWS EMR cluster.
- Designed rich data visualizations to model data into human-readable form with Tableau, Shiny App, and Matplotlib.
Environment: SQL Server, SSIS, AWS (EC2, S3, RedShift, EMR), Python (Scikit-learn), Machine Learning (KNN, Logistic Regression, SVM), Tableau, JIRA, GitHub, ETL, Spark, Hadoop, R, Shiny App, and Linux.
Confidential, King of Prussia, PA
- Lead the development and maintenance of scalable data pipeline.
- Participated in all stages of project life cycle including data collection, data mining, data cleansing, predictive modeling, model optimization, and report generating.
- Designed SSIS package to perform extract, transform and load (ETL) data across different platforms, validate the data and achieve the data from databases.
- Preprocessed billion rows of messy and missing data in Python including normalization, language processing, scaling, PCA, and feature engineering with Scikit - learn, SpaCy, Gensim, and NLTK.
- Conducted Exploratory Data Analysis (EDA) using Pandas, Numpy, Matplotlib, and Seaborn to visualize underlying patterns, correlation, and potential multi-collinearity between features.
- Developed practical machine learning classification algorithms in Logistic Regression, CatBoost, LightGBM and ensemble models for clinical researchers to predict clinical trials publication results.
- Performed complex pattern recognition of patients' time series data and forecast visiting demand through the ARMA and ARIMA models and exponential smoothening for multivariate time series data.
- Used various evaluation metrics like F1-Score, ROC/AUC and Confusion Matrix to evaluate model performance and performed Stratified K-folds Cross-Validation to optimize the model from overfitting.
- Performed data visualization and designed dashboards with Tableau and generated complex reports including chars, summaries, and graphs to interpret the findings to the team and stakeholders.
Environment: SQL Server, SSIS, AWS (EC2, S3, RedShift), Python, Machine Learning (PCA, Logistic Regression, CatBoost, LightGBM), Tableau, JIRA, GitHub, ETL, and Time Series.
Confidential, Johnstown, PA
- Designed SSIS packages to ETL existing data into SQL Server, using Pivot Transformation, Fuzzy Lookup, Derived Columns, Condition Split, Aggregation, Execute SQL Task, Data Flow Task, and Execute Package Task.
- Monitored existing metrics, analyze data, and partner with other internal teams to solve difficult business problems creating a better customer, and internal experience.
- Maintained and developed complex SQL queries, stored procedures, views, functions, and reports that meet customer requirements using Microsoft SQL Server.
- Created Views and Table - valued Functions, Common Table Expression (CTE), joins, and complex subqueries to provide the reporting solutions.
- Improved and optimized the performance of existing queries with modification in T-SQL queries, removed unnecessary columns and duplicated data, normalized tables, established joins, and created indexes.
- Migrated data from SAS environment to SQL Server 2008 via SQL Integration Services (SSIS).
- Analyzed customer behavior for different loan products and reported portfolio reporting and provided ad-hoc report analysis for senior members to make data-driven business decisions.
- Developed and implemented various types of Financial Reports (Income Statement, EBITA, and ROIC Reports) using SSRS.
Environment: SQL Server, SSIS, Tableau, JIRA, GitHub, and ETL.