Data Scientist Resume
New York City, NY
SUMMARY
- Over 6+ years of experience in Machine Learning, Data mining, Predictive modeling and Visualization with large data sets of Structured and Unstructured data in IT and Banking Domain.
- Adept and deep understanding of Python3.3 with Numpy, Pandas, Scipy, Scikit - learn, matplotlib and NLTK.
- Proficient knowledge on SQL and NOSQL databases like MySQL 5.x, MongoDB 3.x, Cassandra3.x and HBase 1.2.x.
- Experience in Big Data technologies like Hadoop Eco-system, Spark 2.x and MapR Streaming.
- Hands on experience in implementing LDA, Naive Bayes and skilled in Random Forests, Decision Trees, Linear and Logistic Regression, SVM, Lasso/Ridge Regression. Testing and validation using ROC plot, K-fold cross validation.
- Worked with machine learning algorithms such as Adaboost, GBDT, XGBoost, Gaussian mixture model, Structural equation model and Kalman filter.
- Strong skills in Statistics methodologies such as Hypothesis Testing, Correspondence Analysis, Principle Component Analysis, ARIMA, GARCH time series analysis and A/B testing.
- Proficient in building, publishing customized interactive Reports and Dashboards by using Tableau9.4, D3.js.
- Good knowledge on Recommender Systems, Natural Language Processing and Data visualization.
- Skills in performing data parsing, data manipulation and data preparation with methods including describe data contents, compute descriptive statistics of data, regex, split and combine, remap, merge, subset, re-index, melt and reshape.
- Working experience in Cloud Computing technologies such AWS EC2, Google Cloud Computing.
- Hands on Large Parallel and Integrated GPU Computation Platform by using PyCuda 1.2, OpenCL R3.
- Experience in AGILE methodologies, SCRUM process and GIT for Version Control.
- Expertise in handling multiple tasks with an aggressive approach to meet deadlines and create deliverables in fast-paced environments; comfortable in interacting with business and end users.
TECHNICAL SKILLS
Programming Languages: Python2.x/3.x (numpy, pandas, nltk, scikit-learn, matplotlib), SQL, JavaScript, R 3.x, SAS 9.x
Statistical Methods: Time Series ANOVA Bayes Law PCA A/B test
Regression: Linear/Non-Linear, Logistic, SVM, Regression tree
Classification: KNN, Naive Bayes, SVM, decision tree, random forest, Boosting
Clustering: K-means, Hierarchical clustering
Others: Collaborative Filtering, Neural Network, NLP, Deep Learning
Database Technique: MySQL 5.x SQL-Server 2010+ MongoDB 3.x Cassandra 3.x HBase 0.98
Big Data Technique: Hadoop 2.x Spark 2.x HDFS 2.x Hive 2.x Hbase 1.x MapR-Streaming
Cloud Platforms/GPU: AWS Google Cloud PyCuda 1.2 OpenCL R3
Data Visualization: Tableau 9.4/9.2, D3.js, Python-Matplotlib
Operation System: Mac OS Windows Linux(Ubuntu)
IDE’s: PyCharm2017 Spyder 2.1 JupyterNotebook 4.1 Sublime 2.0
Other Skills: XML2.x CSS HTML 5.2 AngularJS 1.x Django 1.11
PROFESSIONAL EXPERIENCE
Confidential, New York City, NY
Data Scientist
Responsibilities:
- Deployed Adaboost, GBDT, XGboost and other machine algorithms to analysis millions customer’s behavior.
- Parsed data, producing concise conclusions from raw data in a clean, Well-structured and Easily Maintainable format.
- Used Pandas, Numpy, PyCuda, OpenCL, Scikit-learn, in Python for developing upon Confidential ’s Parallel and Integrated GPU Computation Platform.
- Worked on driver’s profiles and historical data to improve both drivers and users experience and developed data-driven approaches to understand user profiles
- Performed Linear(Nonlinear) and Logistic Regression (SVM, Random Forest) to tag/classify users.
- Performed K-means clustering and Multivariate analysis in Python and developed Clustering algorithms and KNN that improved Customer segmentation and Market Expansion.
- Worked on Regional Fragmentation Analysis based upon Geo-location to optimize driver’s distribution on map.
- Transferred hexagon regional fragmentation analysis to irregular regional fragmentation analysis.
- Reduced long-term prediction error of keyUbermetrics from 35% to 10%, conducted experimentation and optimization on lifetime valuation ofUberusers.
- Designed and manage A/B experimentation and derive business insights from post-hoc analysis.
- Built high performance MySQL/Hive/MongoDB queries and intuitive dashboards for management, engineering and internal collaborators
- Provideddatascience support todata-driven decision making in product development cycles.
Environment: Python3.3, Scikit-learn, PyCuda 1.2, OpenCL 2.1, MySQL5.7, HDFS 2.7, Hive 2.1, Spark 2.1, MongoDB 3.4.
Confidential, New York City, NY
Data Scientist
Responsibilities:
- Designed and implemented data-driven debit/creditcardfraudrisk model withPython and developedfraudrisk rules/strategiesby SQL Server2016 and achieved Account Takeover Scenario loss reduction by 10% ($3.4MM) per year.
- Obtained and transformed principal components features with PCA in the highly-unbalanced dataset and measuring the accuracy using the AUPRC.
- Real-time Fraud Prediction Using Spark Streaming and batch processing, modularized Spark functions written for the offline machine learning can be re-used for the real-time machine learning.
- Used MapR-Streams, MapR-DB(HBase API) and MapR-FS
- Performed Market-Basket Analysis and implemented Decision Trees, Random Forests and K- fold cross validation.
- DevisedCreditCardFraudClassification system using SVM inPython, TACL and relational database on HP Non-Stop systems to identify risk of payment transactions and classifying normal versus fraudulent transactions improving F-score of the existing system from 0.65 to 0.94.
- Responsible for data identification, collection, exploration & cleaning for modeling, participate in model development
- Effectively preventedfraudactivities of large compromise events by ad-hoc analysis of event data and cooperation with Vendors usingefficient SQL/Pythonprogram in a timely manner.
- Models and probability distributions of various business activities either in terms of various parameters or probability distributions, time-series analysis of time-dependent data.
- Designed rich data visualizations to model data into human-readable from ROC curve, heat map, D3 visualization, Tableau, etc.
- Performed ARIMA and GARCH time series analysis and Gaussian mixture model.
Environment: Python3.2, Hadoop2, Spark1.6, Spark-Streaming, Hbase, HDFS, Hive, Cassandra3.9, D3.js, Matpoltlib, Tableau9.4, SQL Server2016.
Confidential
Data Scientist
Responsibilities:
- Performed Logistic Regression, Classification, Random Forests and Clustering in Python.
- Developed the first hybrid recommender containing both content-based and collaborative filter algorithms.
- Web-scraped over 310,000 reviews and over 19,000 users' ratings using Python, including Request, Beautifulsoup, lxml, CSS/Xpath selector and Anti-Scrapy technology.
- Built the text processing pipeline containing Tokenization, Lemmatization, TF-IDF, sentiment analysis, Latent Semantic Analysis and Singular Value Decomposition.
- Hands on and designed Anti-bots and Target-spam system.
- Utilized MySQL/MongoDB to store user preference and information and deployed the application to Confidential -Cloud Computing for better performance.
- Drawing on Experience in all aspects of analytics/data warehousing solutions (Database issues, Data modeling, Data mapping, ETL Development, metadata management, data migration and reporting solutions)
Environment: Python2.7, Html5, Css3, JavaScript, Scikit-learn, MongoDB, Cloud Computing
Confidential
Data Analyst
Responsibilities:
- Acquiringdatafrom Taobao (Chinese Ebay)reviewsusing python web crawler and SentimentAnalysis.
- Performed text analysis using signals systems to find patterns in customer behaviors along with Weibo (Chinese Twitter)analytics.
- Developed the requiredXMLSchema documents and implemented the framework forparsingXML documents.
- Sentimentanalysis model to classify and predictreviewsusing NLTK
- Created Dashboards(Tableau/PPT) for stakeholders to monitor KPIs
- Analyzed trends and rankings by linear and multivariable regression to make more effective prediction and product development decisions.
Environment: Python2.7, NLTK, Tableau 9.1, PowerPoint, MySQL 4.1
Confidential
Business Analyst
Responsibilities:
- Collecting, understanding, and transmitting the business requirements for the project, and translating these into functional specifications along with customization of telecom software products.
- Gatheredbusinessrequirements through interviews, surveys and observing from account managers and conducted controlled brain-storming sessions with project focus groups and documented them in theBusinessrequirement document.
- Created Use Case Diagrams, Activity Diagrams, and Sequence Diagrams using MS Visio/Excel.
- Coordinated with QA team to create the test approach and determine test needs, test environment, test data, resources and limitations.
- Assisted the QA in performing simple SQL queries for QA testing and data validation.
Environment: MS Visio, MS Office(Excel/PowerPoint/Word), SQL-Server 2010