We provide IT Staff Augmentation Services!

Data Scientist Resume

Philadelphia, PA

SUMMARY:

  • Have 6 years of extensive IT experience with 4+ years of experience in data science with excellent integration of machine learning algorithms on statistical data. Performed Advanced Analytics, Predictive Modeling and Data Science to solve business issues enabling fact - based decision-making.
  • Significant expertise in data acquisition, storage, analysis, integration, machine learning, Predictive modeling, logistic regression, decision trees, data mining methods, forecasting, factor analysis, Ad hoc analysis, A/B testing, multivariate testing, time series analysis, cluster analysis, ANOVA, neural networks and other advanced statistical and econometric techniques.
  • Expertise includes abstracting and quantifying the computational aspects of the problems, designing and applying new statistical algorithms, as well as systems-level software design and implementation in different platforms e.g. R, SAS, Python, Spark. Experience in applying machine learning and statistical modeling techniques to solve business problems.
  • Expert in distilling vast amounts of data to meaningful discoveries at requisite depths. Ability to analyze most complex projects at various levels.
  • Experience in building big data data-intense applications and products using Hadoop ecosystem components like Hadoop, Pig, HIVE, Sqoop, Apache spark, Apache Kafka.
  • Experience of working in text understanding, classification, pattern recognition, recommendation systems, targeting systems and ranking systems using Python.
  • A deep understanding of Statistical Modeling, Multivariate Analysis, Big data analytics and Standard Procedures Highly efficient in Dimensionality Reduction methods such as PCA (Principal component Analysis) etc.
  • Experienced in job workflow scheduling and monitoring tools like Oozie and ESP. Experience using various Hadoop Distributions (PIVOTAL, Hortonworks, MapR etc) to fully implement and leverage new Hadoop features.
  • Experienced in requirement analysis, application development, application migration and maintenance using Software Development Lifecycle (SDLC).
  • Visualization and dash boarding using Tableau, Python's Matplotlib, graphing in R.

TECHNICAL SKILLS

Machine Learning -: Regression analysis, Ridge, Lasso Regression, K-NN, Decision Tree, Support Vector Machine (SVM), Artificial Neural Network (ANN), CNN, RNN, Ensembles method like Bagging, Boosting, Stacking, K Means clustering and Hierarchical clustering.

Python Libraries: Statistics

Databases -: MySQL, SQL Server 2008/2012/2014 , MongoDB, AWS DynamoDB.

Hadoop Ecosystem: Cloud Services

Reporting & Visualization Tools: Tableau, SSRS, Seaborn, Matplotlib, ggplot2.

Languages: System Linux (Ubuntu 14.x - 16.x), Windows 7 - 10, Mac OS.

PROFESSIONAL EXPERIENCE:

Confidential, Philadelphia, PA

Data Scientist

Responsibilities:

  • Developed computational and data science solutions for the storage, management, analysis, and visualization of genomic data.
  • Leveraged existing tools and publicly available genomics data to develop, test, or implement bioinformatics pipelines.
  • Extracted patent text and numerical features with python library Beautiful Soup, created Decision Tree algorithm to predict the patent classification on their Diseases.
  • Detected the near-duplicated news by applying NLP methods (e.g. word2vec) and developing machine learning models like label spreading, clustering
  • Provided expertise in statistical methods or machine learning with the goal of applying these techniques to health data.
  • Extracted data from database, copied into HDFS File system and used Hadoop tools such as Hive and Pig Latin to retrieve the data required for building models.
  • Worked on data cleaning and ensured data quality, consistency, integrity using Pandas, NumPy.
  • Tackled highly imbalanced Fraud dataset using sampling techniques like down-sampling, up-sampling and SMOTE (Synthetic Minority Over-Sampling Technique) using Python Scikit-learn.
  • Used clustering technique K-Means to identify outliers and to classify unlabelled data.
  • Cleaned, analyzed and selected data to gauge customer experience.
  • Used algorithms and programming to efficiently go through large datasets and apply treatments, filters, and conditions as needed.
  • Used PCA and other feature engineering techniques to reduce the high dimensional data, feature normalization techniques and label encoding with Scikit-learn library in Python.
  • Used Pandas, NumPy, Seaborn, Matplotlib, Scikit-learn in Python for developing various machine learning models such as Logistic regression, Gradient Boost Decision Tree and Neural Network.
  • Used cross-validation to test the models with different batches of data to optimize the models and prevent overfitting.
  • Experimented with Ensemble methods to increase the accuracy of the training model with different Bagging and Boosting methods.
  • Implemented a Python-based distributed random forest via PySpark and MLlib.
  • Used AWS S3, DynamoDB, AWS lambda, AWS EC2 for data storage and models' deployment.
  • Created and maintained reports to display the status and performance of deployed model and algorithm with Tableau.

Technology Stack: Oracle 11g, Hadoop 2.x, HDFS, Hive, Pig Latin, Spark/PySpark/MLlib, Python 3.x (Numpy, Pandas, Scikit-learn, Matplotlib, Seaborn), Jupyter Notebook, AWS, Github, Linux, Machine learning algorithms, Tableau.

Confidential - Morristown, NJ

Data Scientist

Responsibilities:

  • Communicated and coordinated with end client for collecting data and performed ETL to define the uniform standard format.
  • Queried and retrieved data from SQL Server database to get the sample dataset.
  • In pre-processing phase, used Pandas to clean all the missing data, datatype casting and merging or grouping tables for EDA process.
  • Used PCA and other feature engineering, feature normalization and label encoding Scikit-learn pre-processing techniques to reduce the high dimensional data (>150 features).
  • In data exploration stage used correlation analysis and graphical techniques in Matplotlib and Seaborn to get some insights about the patient admission and discharge data.
  • Experimented with predictive models including Logistic Regression, Support Vector Machine (SVC), Random Forest provided by Scikit-learn, XGBoost, LightGBM and Neural network by Keras to predict showing probability and visiting counts.
  • Designed and implemented Cross-validation and statistical tests including k-fold, stratified k-fold, hold-out scheme to test and verify the models' significance.
  • Worked on data cleaning and ensured data quality, consistency, integrity using Pandas, NumPy.
  • Participated in feature engineering such as feature intersection generating, feature normalize and label encoding with Scikit-learn pre-processing.
  • Improved fraud prediction performance by using random forest and gradient boosting for feature selection with Python Scikit-learn.
  • Used Python (NumPy, Scipy, Pandas, Scikit-Learn, Seaborn), and Spark 2.0 (PySpark, MLlib) to develop variety of models and algorithms for analytic purposes.
  • Utilized spark, Scala, Hadoop, HBase, Kafka, Spark Streaming, MLlib, Python, a broad variety of machine learning methods including classifications, regressions, dimensionally reduction etc.
  • Implemented, tuned and tested the model on AWS Lambda with the best performing algorithm and parameters.
  • Implemented Hypothesis testing kit for sparse sample data by wring R packages.
  • Collected the feedback after deployment, retrained the model to improve the performance.
  • Designed, developed and maintained daily and monthly summary, trending and benchmark reports in Tableau Desktop.

Technology Stack: SQL Server 2012/2014, AWS EC2, AWS Lambda, AWS S3, AWS EMR, Linux, Python3.x (Scikit-Learn, NumPy, Pandas, Matplotlib), R, Machine Learning algorithms, Tableau.

Confidential - Indianapolis, IN

Intern Data Scientist

Responsibilities:

  • Soloed 4 projects from data orchestration, workflow design, to production code for online software release.
  • Developed custom intent classification techniques to be used during the intent creation and testing, by modifying the Word Mover Distance algorithm.
  • Diagnosed performance issues that only occurred on server and not locally, used Jprofiler to monitor memory utilization.
  • Analyzed incoming new data, and identified possible problems with intent design.
  • Diagnosed problems that were rooted in bad SQL schema design.
  • Used local and Azure cloud multiprocessing to forecast time series predictions for 50+ million search terms.
  • Optimized key features for ad campaigns to generate best ROI for ad bid, ad budget, and sales margins.
  • Used feature importance to find top search terms that generated most revenue for top 20+ million products.
  • Applied computer vision and split testing to optimize product pictures to generate best sales conversion.

Confidential

Data Analyst

Responsibilities:

  • Performed data profiling in the source systems that are required for New Customer Engagement (NCE) Data mart.
  • Document the complete process flow to describe program development, logic, testing, and implementation, application integration, coding.
  • Manipulating, cleansing & processing data using Excel, Access and SQL.
  • Responsible for loading, extracting and validation of client data.
  • Liaising with end-users and 3rd party suppliers. Analyzing raw data, drawing conclusions & developing recommendations writing SQL scripts to manipulate data for data loads and extracts.
  • Developing data analytical databases from complex financial source data. Performing daily system checks. Data entry, data auditing, creating data reports & monitoring all data for accuracy. Designing, developing and implementing new functionality.
  • Monitoring the automated loading processes. Advising on the suitability of methodologies and suggesting improvements.
  • Involved in defining the source to target data mappings, business rules, and business and data definitions. Responsible for defining the key identifiers for each mapping/interface.
  • Responsible for defining the functional requirement documents for each source to target interface.
  • Coordinate with the business users in providing appropriate, effective and efficient way to design the new reporting needs based on the user with the existing functionality.
  • Document data quality and traceability documents for each source interface.
  • Designed and implemented data integration modules for Extract/Transform/Load (ETL) functions.
  • Involved in Data warehouse and DataMart design.
  • Documented the complete process flow to describe program development, logic, testing, and implementation, application integration, coding.
  • Worked with internal architects and, assisting in the development of current and target state data architectures.
  • Worked with project team representatives to ensure that logical and physical ER/Studio data models were developed in line with corporate standards and guidelines.

Technology Stack: SQL/Server, Oracle, MS-Office, Teradata, Enterprise Architect, Informatica Data Quality, ER Studio, TOAD, Business Objects, Green plum Database, PL/SQL

Hire Now