Data Scientist Resume Philadelphia, PA - Hire IT People

SUMMARY:

Have 6 years of extensive IT experience with 4+ years of experience in data science with excellent integration of machine learning algorithms on statistical data. Performed Advanced Analytics, Predictive Modeling and Data Science to solve business issues enabling fact - based decision-making.
Significant expertise in data acquisition, storage, analysis, integration, machine learning, Predictive modeling, logistic regression, decision trees, data mining methods, forecasting, factor analysis, Ad hoc analysis, A/B testing, multivariate testing, time series analysis, cluster analysis, ANOVA, neural networks and other advanced statistical and econometric techniques.
Expertise includes abstracting and quantifying the computational aspects of the problems, designing and applying new statistical algorithms, as well as systems-level software design and implementation in different platforms e.g. R, SAS, Python, Spark. Experience in applying machine learning and statistical modeling techniques to solve business problems.
Expert in distilling vast amounts of data to meaningful discoveries at requisite depths. Ability to analyze most complex projects at various levels.
Experience in building big data data-intense applications and products using Hadoop ecosystem components like Hadoop, Pig, HIVE, Sqoop, Apache spark, Apache Kafka.
Experience of working in text understanding, classification, pattern recognition, recommendation systems, targeting systems and ranking systems using Python.
A deep understanding of Statistical Modeling, Multivariate Analysis, Big data analytics and Standard Procedures Highly efficient in Dimensionality Reduction methods such as PCA (Principal component Analysis) etc.
Experienced in job workflow scheduling and monitoring tools like Oozie and ESP. Experience using various Hadoop Distributions (PIVOTAL, Hortonworks, MapR etc) to fully implement and leverage new Hadoop features.
Experienced in requirement analysis, application development, application migration and maintenance using Software Development Lifecycle (SDLC).
Visualization and dash boarding using Tableau, Python's Matplotlib, graphing in R.

SUMMARY:

Machine Learning -: Regression analysis, Ridge, Lasso Regression, K-NN, Decision Tree, Support Vector Machine (SVM), Artificial Neural Network (ANN), CNN, RNN, Ensembles method like Bagging, Boosting, Stacking, K Means clustering and Hierarchical clustering.

Python Libraries: Statistics

Databases -: MySQL, SQL Server 2008/2012/2014 , MongoDB, AWS DynamoDB.

Hadoop Ecosystem: Cloud Services

Reporting & Visualization Tools: Tableau, SSRS, Seaborn, Matplotlib, ggplot2.

Languages: System Linux (Ubuntu 14.x - 16.x), Windows 7 - 10, Mac OS.

PROFESSIONAL EXPERIENCE:

Confidential, Philadelphia, PA

Data Scientist

Responsibilities:

Developed computational and data science solutions for the storage, management, analysis, and visualization of genomic data.
Leveraged existing tools and publicly available genomics data to develop, test, or implement bioinformatics pipelines.
Extracted patent text and numerical features with python library Beautiful Soup, created Decision Tree algorithm to predict the patent classification on their Diseases.
Detected the near-duplicated news by applying NLP methods (e.g. word2vec) and developing machine learning models like label spreading, clustering
Provided expertise in statistical methods or machine learning with the goal of applying these techniques to health data.
Extracted data from database, copied into HDFS File system and used Hadoop tools such as Hive and Pig Latin to retrieve the data required for building models.
Worked on data cleaning and ensured data quality, consistency, integrity using Pandas, NumPy.
Tackled highly imbalanced Fraud dataset using sampling techniques like down-sampling, up-sampling and SMOTE (Synthetic Minority Over-Sampling Technique) using Python Scikit-learn.
Used clustering technique K-Means to identify outliers and to classify unlabelled data.
Cleaned, analyzed and selected data to gauge customer experience.
Used algorithms and programming to efficiently go through large datasets and apply treatments, filters, and conditions as needed.
Used PCA and other feature engineering techniques to reduce the high dimensional data, feature normalization techniques and label encoding with Scikit-learn library in Python.
Used Pandas, NumPy, Seaborn, Matplotlib, Scikit-learn in Python for developing various machine learning models such as Logistic regression, Gradient Boost Decision Tree and Neural Network.
Used cross-validation to test the models with different batches of data to optimize the models and prevent overfitting.
Experimented with Ensemble methods to increase the accuracy of the training model with different Bagging and Boosting methods.
Implemented a Python-based distributed random forest via PySpark and MLlib.
Used AWS S3, DynamoDB, AWS lambda, AWS EC2 for data storage and models' deployment.
Created and maintained reports to display the status and performance of deployed model and algorithm with Tableau.

Technology Stack: Oracle 11g, Hadoop 2.x, HDFS, Hive, Pig Latin, Spark/PySpark/MLlib, Python 3.x (Numpy, Pandas, Scikit-learn, Matplotlib, Seaborn), Jupyter Notebook, AWS, Github, Linux, Machine learning algorithms, Tableau.

Confidential - Morristown, NJ

Data Scientist

Responsibilities:

Communicated and coordinated with end client for collecting data and performed ETL to define the uniform standard format.
Queried and retrieved data from SQL Server database to get the sample dataset.
In pre-processing phase, used Pandas to clean all the missing data, datatype casting and merging or grouping tables for EDA process.
Used PCA and other feature engineering, feature normalization and label encoding Scikit-learn pre-processing techniques to reduce the high dimensional data (>150 features).
In data exploration stage used correlation analysis and graphical techniques in Matplotlib and Seaborn to get some insights about the patient admission and discharge data.
Experimented with predictive models including Logistic Regression, Support Vector Machine (SVC), Random Forest provided by Scikit-learn, XGBoost, LightGBM and Neural network by Keras to predict showing probability and visiting counts.
Designed and implemented Cross-validation and statistical tests including k-fold, stratified k-fold, hold-out scheme to test and verify the models' significance.
Worked on data cleaning and ensured data quality, consistency, integrity using Pandas, NumPy.
Participated in feature engineering such as feature intersection generating, feature normalize and label encoding with Scikit-learn pre-processing.
Improved fraud prediction performance by using random forest and gradient boosting for feature selection with Python Scikit-learn.
Used Python (NumPy, Scipy, Pandas, Scikit-Learn, Seaborn), and Spark 2.0 (PySpark, MLlib) to develop variety of models and algorithms for analytic purposes.
Utilized spark, Scala, Hadoop, HBase, Kafka, Spark Streaming, MLlib, Python, a broad variety of machine learning methods including classifications, regressions, dimensionally reduction etc.
Implemented, tuned and tested the model on AWS Lambda with the best performing algorithm and parameters.
Implemented Hypothesis testing kit for sparse sample data by wring R packages.
Collected the feedback after deployment, retrained the model to improve the performance.
Designed, developed and maintained daily and monthly summary, trending and benchmark reports in Tableau Desktop.

Technology Stack: SQL Server 2012/2014, AWS EC2, AWS Lambda, AWS S3, AWS EMR, Linux, Python3.x (Scikit-Learn, NumPy, Pandas, Matplotlib), R, Machine Learning algorithms, Tableau.

Confidential - Indianapolis, IN

Intern Data Scientist

Responsibilities:

Soloed 4 projects from data orchestration, workflow design, to production code for online software release.
Developed custom intent classification techniques to be used during the intent creation and testing, by modifying the Word Mover Distance algorithm.
Diagnosed performance issues that only occurred on server and not locally, used Jprofiler to monitor memory utilization.
Analyzed incoming new data, and identified possible problems with intent design.
Diagnosed problems that were rooted in bad SQL schema design.
Used local and Azure cloud multiprocessing to forecast time series predictions for 50+ million search terms.
Optimized key features for ad campaigns to generate best ROI for ad bid, ad budget, and sales margins.
Used feature importance to find top search terms that generated most revenue for top 20+ million products.
Applied computer vision and split testing to optimize product pictures to generate best sales conversion.

Confidential

Data Analyst

Responsibilities:

Performed data profiling in the source systems that are required for New Customer Engagement (NCE) Data mart.
Document the complete process flow to describe program development, logic, testing, and implementation, application integration, coding.
Manipulating, cleansing & processing data using Excel, Access and SQL.
Responsible for loading, extracting and validation of client data.
Liaising with end-users and 3rd party suppliers. Analyzing raw data, drawing conclusions & developing recommendations writing SQL scripts to manipulate data for data loads and extracts.
Developing data analytical databases from complex financial source data. Performing daily system checks. Data entry, data auditing, creating data reports & monitoring all data for accuracy. Designing, developing and implementing new functionality.
Monitoring the automated loading processes. Advising on the suitability of methodologies and suggesting improvements.
Involved in defining the source to target data mappings, business rules, and business and data definitions. Responsible for defining the key identifiers for each mapping/interface.
Responsible for defining the functional requirement documents for each source to target interface.
Coordinate with the business users in providing appropriate, effective and efficient way to design the new reporting needs based on the user with the existing functionality.
Document data quality and traceability documents for each source interface.
Designed and implemented data integration modules for Extract/Transform/Load (ETL) functions.
Involved in Data warehouse and DataMart design.
Documented the complete process flow to describe program development, logic, testing, and implementation, application integration, coding.
Worked with internal architects and, assisting in the development of current and target state data architectures.
Worked with project team representatives to ensure that logical and physical ER/Studio data models were developed in line with corporate standards and guidelines.

Technology Stack: SQL/Server, Oracle, MS-Office, Teradata, Enterprise Architect, Informatica Data Quality, ER Studio, TOAD, Business Objects, Green plum Database, PL/SQL

We provide IT Staff Augmentation Services!

Data Scientist Resume

Philadelphia, PA

We'd love your feedback!

Resume Categories

Client Services

Job Seekers

Visa Sponsorship