We provide IT Staff Augmentation Services!

Lead Data Scientist / Senior Statistician Resume

2.00/5 (Submit Your Rating)

Suitland, MD

TECHNICAL SKILLS:

TOOLS: R / RStudio, Python, SAS, SPSS (IBM Modeler), KNIME Analytics Platform.

TECHNIQUESCRISP: Data Mining Model, SEMMA - Data Mining Model, PMML, BIG DATA, AWS

ALGORITHMS: Decision Trees, Random Forest, Support Vector Machines, Nearest Neighbor k-Means, Naïve Bayes, Linear / Logistic Regression, Deep Learning, Neural Nets

NLP Expertise: Social Media Data Analytics, Key-phrase Extraction, Named Entity Recognition, Sentiments, Document Classification, Topic Modeling, Document Term Matrix, SVD.

Stats Knowledge: Sampling, Descriptive Statistics, Hypothesis Testing, ANOVA, Factor Analysis, Principal

Components: Analysis, Singular Value Decomposition, Multidimensional ScalingStochastic Processes, Markov Models, Queuing Theory, Monte Carlo Simulation.

Domain Knowledge: Litigation - Predictive Coding for Responsive / Privilege document review, Electronic Discover (E-Discovery), Recruiting / Staffing / Human Resources, Banking / Insurance, Regulatory and Risk Modeling (fraud detection, credit scoring, customer churn), Healthcare (Electronic Medical Records & e-Prescriptions), Marketing (recommendation systems and customer profiling), Cybersecurity Breaches & Cyber Attacks Detection.

EXPERIENCE:

Lead Data Scientist / Senior Statistician

Confidential, Suitland, MD

Responsibilities:

  • Collaborating with Infrastructure and IT department in setting up statistical / predictive modeling development environment on Amazon Web Services, including EC2 and S3 buckets, installing open source tools and configuring user accounts and privileges for R - Studio, Python, TensorFlow, GIT-GITHUB, PostGresSQL RDBMS, KNIME Analytics Platform, SAS Enterprise Miner.
  • Writing modeling strategy document, recommending and defending the CRISP-DM model to upper management and providing mentorship to junior data scientists and managers with less familiarity with machine learning, NLP, and CRISP-DM concepts.
  • Simulating census fraud data by collaborating with census experts in identifying data which violate census response rules.
  • Procuring social media data (twitter, FaceBook, Reddit, Instagram) through web APIs as well as third party tools including Sysomos & IBM Watson.
  • Writing code in R / RStudio and Python (NumPy, Sci-Kit Learn) to train and evaluate machine learning models for detecting fraud. Comparing different algorithms and model parameters for best performance based on Recall, Precision, AUC / ROC curves. Experimenting with Decision Trees, Support Vector Machines, Logistic Regression, Neural Networks, Deep Learning, Nearest Neighbor, k-Means.
  • Performing natural language processing of social media data for potential census fraud detection using latent dirichlet allocation (LDA) for topic modeling, bag of words and tf-idf document term matrix representation followed by document clustering (k-Means) and document classification machine learning (SVM and Neural Nets).
  • Generating word occurrence / word cloud and network link analysis visualizations using R (ggplot2 package), matplotlib (Python library), and Tableau; and web-based reporting and dashboards using D3.JS (with HTML5, CSS3, JavaScript, JQUERY) and R packages (RMarkdown, Shiny).
  • Developing and testing Mobile apps machine learning algorithms using JQUERY Mobile, PhoneGap, APIs.
  • Computing, analyzing, and modeling internet census questionnaire response time and question answering sequences and patterns associated with human respondents, compared with robots and machines.
  • Converting predictive models from R, Python, SAS Enterprise Miner into PMML, and writing a run / code book detailing modeling workflow, scoring new instances of input data, using PMML to deploy models.
  • Teaching the basics of data science, machine learning, and statistics to less experienced team members (group of 8), as well as crash courses on open source tools, mathematics of deep learning including matrix computations, linear algebra, Restricted Boltzmann Machine, Convolution Neural Networks, Recurrent Neural Networks..
  • Researching online articles and white papers relating to state-of-the-art methods in anomaly detection.
  • Participating in data science conferences to elicit constructive criticism of in-house methods, as well as keep abreast of current thinking and practices.

Data Scientist (Principal)

Confidential, Albany, NY

Responsibilities:

  • Collaborating with senior attorneys and SMEs in planning automated document review scope and strategies including defining optimized lists of keywords and key phrases that best split documents into categories. .
  • Performing sample size estimates and cost - benefit analysis on seed set of documents (that must be reviewed by senior attorneys intimate with a particular case) to be used as input for training and testing document categorization algorithms. Planning model defensibility and EDRM workflow and compliance.
  • Performing feature extraction and selection using Latent Semantic Analysis, Principal Components
  • Analysis, Factor Analysis, and Singular Value Decomposition, using KNIME version 2.12.2 and RapidMiner with R and Python Programming Tools..
  • Executing statistical methods for boosting model training samples and establishing confidence level and margin of error for model performance metrics (recall, precision, and F-score) with Monte Carlo simulation using Oracle Crystal Ball.
  • Performing correlation analysis and hypothesis tests on email metadata in order to establish any statistically significant relationships between individual (or combinations) email metadata attributes and document categorization (responsiveness and privilege) using MINITAB, R, and Python programming.
  • Comparing model costs and performances by varying modeling configurations such as document vector representation (binary, term frequency, term frequency-inverse document frequency), machine learning algorithm (Na ve Bayes, SVM, k-NN, Decision Trees), and document similarity / distance method (cosinescalar product, LDA, Euclidean, Manhattan, etc.).
  • Estimating minimum sample size of documents required to be drawn from the document population (these documents would then be reviewed by human reviewers and compared with predicted codes) in order to be able to project model performance to the document population; and put a dollar value in the expected savings to be derived from the predictive coding methodology.
  • Estimating potential cost savings (up to millions of dollars in many cases, of applying Machine Learning / Predictive Coding to predict document probabilities for ranking and culling.
  • Setting up and administering Big Data environments and cluster / cloud computing using Amazon Web Servises (AWS). Specifically: managing Spark clusters on Amazon Elastic Compute Cloud (EC2), as well as distributed datasets on Amazon Simple Storage Service (S3) data buckets.
  • Developing purely open source reusable predictive coding software for ranking Responsive and Privileged documents based on KNIME, R, and Python to guarantee absolute no vendor lock-in at deployment.
  • Working with software engineers in implementing predictive models through a platform-independent methodology with industry standard predictive model markup language (PMML).
  • Representing the business / organization at Data Science conferences, seminars, and blogs.
  • Conducting on-going literature review and independent research on Big Data, Clustered-Sampling for Model improvement.
  • Acting as general consulting Statistician and advising data professionals and decision makers on basic statistics, visualization tools (Qlikview, Spotfire, etc), as well as hypothesis testing and data-driven thinking.

Data Scientist

Confidential, Hanover, MD

Responsibilities:

  • Collaborating with account managers, recruiters, and hiring managers to review and convert job descriptions into textual format suitable for natural language processing (NLTK python and TM package in R).
  • Developing text analytics algorithm using python / R, including tokenization, tagging, stemming, lemmatization, term frequency, and inverse doc. Eventually converted unstructured documents into vectors for document length and distance calculations. Ranked and sorted documents. Compared various document representations: Binary, Term frequency, Inverse Document, TFIDF, and LDA and Cosine distances.
  • Compiled python scripts into Windows executable, using py2win, capable of writing output into Microsoft Excel and delivering on a scheduled basis to recruiters and account managers via email.
  • Working with Java programmers to integrate Win - Python programs into existing system written in Java, as well as planning alternative approaches such as using Jython language for Python and Java to communicate.
  • As principal data scientist working on this project my role included
  • Collaborating with account managers, recruiters, solution architects, database administrators and hiring managers in building training samples of past job orders categorized into filled vs. unfilled buckets.
  • Performing descriptive statistics, correlation and hypothesis testing to understand the underlying distributions of numeric variables, and F-test and Chi-square for categorical variables in order to identify predictive variables (features).
  • Using data reduction methods such as factor analysis and principal components analysis in R.
  • Collaborating with account managers, recruiters, and hiring managers to brainstorm and discover or derive predictive variables.
  • Converting unstructured documents into document vectors, and employing keyphrase algorithm in Python.
  • Building binary classifyers Python / R for executing Decision Trees and Support Vector Machines, testing and deploying through the PMML standards, creating model documentation and training end users.
  • As principal data scientist working on this project my role included
  • Researching into various speech-to-text open source tools such as Microsoft TAPI / speech, Dragon, etc.
  • Collaborating with hiring managers to formulate audio interviewing and recording logistics to capture specific responses to standard questions in suitable audio formats for fast conversion to text documents.
  • Converting unstructured documents into document vectors, and employing key-phrase algorithm in Python.
  • Writing Windows program to automate speech-to-text conversion as a batch process.
  • Developing text analytics algorithm using python / R, including tokenization, tagging, stemming, lemmatization, term frequency, and inverse document computations. Eventually converted unstructured documents (speech interviews converted to text documents) into vectors for document length and distance calculations. Ranked and sorted documents (speech interview text documents).
  • Compared binary representation of document vectors with term frequency representation.
  • Researching into various speech - to-text NLP tools such as Microsoft SAPI, TAPI / speech, Dragon, etc.
  • Collaborating with call center managers to formalize and standardize mandatory call opening and closing Scripts that must be adhered to by call agents.
  • Converting unstructured documents into document vectors, and employing NLP, NLTK in Python.
  • Developing VB.NET / Silverlight program to call web services and generate statistical analysis of words, as well as agent / call center compliance reports..
  • Writing Windows program to automate speech-to-text conversion as a batch process.

Confidential

Data Scientist, Wilmington, DE

Responsibilities:

  • Knowledge elicitation in collaboration with business analysts and subject matter experts to gather business and data requirements, and work with database professionals through the data preparation phase.
  • Determine and configure tools (including R code and packages with RStudio and SAS Enterprise Guide, and methodologies that support CRISP and PMML modeling and deployment approaches.
  • Determine data reduction methodologies for dealing with noisy data (correlation matrix, principal components analysis, factor analysis, clustering, and multi - dimensional scaling), missing values, and outliers.
  • Research and make recommendations in terms of the appropriateness of employing Big Data tools in the data engineering aspects of the conversion to TSYS, as well as the modification or re-development of models (SAS + R source code) that failed to convert within the tolerance limits in the new environment.
  • Determine if the bank’s data qualifies as “Big Data” in terms of volume, variety, velocity, and variance; and perform a cost-benefit analysis of the competing big data tools for implementing MapReduce (Google’s HADOOP vs. the open source RHADOOP package), SQL vs. NoSQL; as well as give brief presentations to management on current big data tools at basic level such as Pig, Hive, Mahout.

Data Scientist

Confidential, Jersey City, NJ

Responsibilities:

  • Leading as Data Scientist to define, develop, deploy, and coordinate data acquisition and preparation, statistical analytics, interactive data visualization and predictive models to build automated machine learning and scoring engines in support of business and capital markets projects including the following majors:
  • Developing models to detect and investigate illegal money transactions. This uses both traditional predictive analytics and semantic analytics text mining.
  • The Security and Exchange Commission generates daily alerts of suspicious financial instruments transactions (stocks, bonds, mutual funds, etc.) and watch lists subject to investigations. The main goal of this project is to employ machine learning in discovering the (unpublished) rules of the alert engines.
  • Developing predictive models to learn cardholder spending behavior and anticipate / detect potential credit card frauds.
  • Developing predictive models to understand the attributes of investors who are likely to withdraw or switch over to competitors.
  • My specific roles as principal statistician, software architect and data scientist include (but is not limited) to the following:
  • Knowledge elicitation in collaboration with business analysts and subject matter experts to gather business and data requirements, and work with database professionals through the data preparation phase.
  • Determine and configure tools (including R programming, SAS, SSPS IBM Modeler, Spotfire for interactive visualization and its extension to call R and Python functions through API) and methodologies that support CRISP and PMML modeling and deployment approaches.
  • Determine data reduction methodologies for dealing with noisy data (correlation matrix, principal components analysis, factor analysis, clustering, and multi-dimensional scaling), missing values, and outliers.
  • Determine appropriate models and default parameters including decision trees, Bayesian classification, support vector machines, random forests, logistic regression, K-Means, and Nearest Neighbor, etc.
  • Maintain code documentation, run book, report writing, and presentation to stakeholders
  • Train non-technical business personnel to run / score models and generate reports.

Confidential, Washington, DC

Senior Statistician and SAS Programmer

Responsibilities:

  • Reviewed and tested SAS programs for speed under different data retrieval methods: DATA steps vs. PROC SQL.
  • Performed data validation through PROC UNIVARIATE, PROC FREQ, PROC TABULATE and PROC REPORT. Tested several code logic within the COMPUTE block of PROC REPORT.
  • Reduced very large datasets to smaller but equally meaningful datasets using multivariate statistics including Factor Analysis and Principal Components Analysis and the corresponding SAS procs: PROC FACTOR and PROC PRINCOM. Reduced 97 variables to 27 variables with similar regression results.
  • Created Macros and Macro Variables, and combined with PROC SQL to conditionally create datasets and launch C-Shell scripts to generate and send out email messages to designated recipients on a scheduled basis.
  • Reviewed SAS function libraries.
  • Worked with data from various sources combining PROC SQL, PROC IMPORT, and PROC EXPORT.
  • Created professional quality reports in different formats (PDF / HTML) using the SAS ODS feature.

Confidential, Arlington, VA

Statistical Consultant

Responsibilities:

  • Senior Data Mining Analyst, Electronic Prescription Benefits, Prescription Drug Abuse, and Fraud Detection System. Responsible for developing statistical data mining algorithms for pattern recognition and fraud detection in the PHYSICIAN / PATIENT electronic prescribing patterns.. Edited SAS MACROS, SAS DATA and PROC steps. Explored SAS BASE, SAS / STAT and GRAPH modules.
  • Providing data mining support and back up to the SURESCRIPTS data warehouse networks of E-PRESCRIBING data from PHYSICIANS, CLAIMS DATA from PHARMACIES, and MEDICATION HISTORY and ADVERSE EFFECTS data from PHARMACY BENEFITS MANAGERS. This includes, but is not limited to: client- and cross-team- interaction, knowledge elicitation from the client, participating in requirements, design, specification, fraud detection research and development, performance testing, statistical verification, support in data analysis, data mining and knowledge discovery for ELECTRONIC HEALTHCARE SYSTEMS and ELECTRONIC PRESCRIPTION SOFTWARE systems and vendors.
  • Senior Data Mining Analyst. Responsible for collaborating with Oracle database administrators regarding data field specifications, database design, and PL/SQL query execution and optimization.
  • Responsible for statistical and data mining technical support services including the use of various methods to quantify the effects and benefits of E-PRESCRIBING over FAX, PHONE, and PAPER methods of drug prescriptions
  • Responsible for exploring and testing PASW (Clementine 12) algorithms: K-Means Clustering, Decision Trees C5.0, Support Vector Machines, Cohonen Clustering, PIM and PAR (program implementation and parameter files) modules and deployment on server machines running on UNIX platforms including writing UNIX scripts combining VI, EMACS, and PICO editors and shell scripting (C-SHELL).
  • Responsible for managing and configuring system files and environmental variables with VI editor, and monitoring program and server performance, and executing emergency steps and programs in cases of anticipated or actual program failures due to data overload (up to 140 Million records at times) and potential server time out.

Confidential, Clarksville, MD

Senior Statistician / SAS Programmer Analyst

Responsibilities:

  • Working with physicians in identifying variables for inclusion in field screening of patients. Created databases and data capture front end applications (VB.NET and JOOMLA + PHP 12.0) and MYSQL database back-end). Developed SAS DATA step programs to integrate ETL data into data warehouse..
  • Providing support as statistician in performing statistical analysis of demographical variables in conjunction with regression analysis for understanding the effectiveness of therapies and drugs based on ethnicity, gender, age, pregnancy status, and the concurrent administration of other drugs.
  • Working with medical diagnostics laboratories, prescription pharmacies, hospitals, and medical insurance organizations in monitoring data formatting requirements of their electronic medical records (EMR) software systems in order to develop in-house data transfer software for GVF compatible with client systems.
  • Performing comparison analysis SAS DATA and PROC steps: TTEST, REG, ANOVA, SQL, missing data estimation, graphs and charts, and reporting with SAS / GRAPH procs and MACRO language.
  • Developing billing and patient record automation software for scheduling periodic customized letters with values pulled from databases through executed stored procedures. Front-end of application is being created using VB.NET (2008), business rules and logic implemented through Windows API functions, custom DLLs, and WEB SERVICES; and back-end / output reports generated from Crystal Reports 11 (incorporated within VB.NET as a plug-in COM object).

Confidential, Fairfax, VA

Senior Statistician, Fraud Detection, Data Mining Analyst / Engineer and Software Developer

Responsibilities:

  • Senior Data Mining Analyst, IRS Tax, Electronic Fraud Detection System.
  • Responsible for developing statistical data mining algorithms for pattern recognition and fraud detection in fund disbursement applications.
  • Maintained SPSS PASW (formerly Clementine 12.0) data mining software programs.
  • Providing data mining support and back up to the IRS EFDS project. This includes, but is not limited to: client- and cross-team- interaction, knowledge elicitation from the client, participating in requirements, design, specification, fraud detection research and development, performance testing, statistical verification, support in data analysis, data mining and knowledge discovery for IRS tax fraud detection.
  • Responsible for collaborating with Oracle database administrators regarding data field specifications, database design, and PL/SQL query execution and optimization. Responsible for statistical and data mining technical support services including forecasting anticipated annual fraud volumes and the implicit human resource workload requirements for processing electronic filings and paper based forms on the basis of statistical predictions.
  • Responsible for exploring and testing PASW (Clementine 12) algorithms: K-Means Clustering, Decision Trees C5.0, Support Vector Machines, Cohonen Clustering, PIM and PAR (program implementation and parameter files) modules and deployment on server machines running on UNIX platforms.
  • Responsible for managing and configuring system files and environmental variables with VI editor, and monitoring program and server performance, and executing emergency steps and programs in cases of anticipated or actual program failures due to data overload (up to 140 Million records at times) and potential server time out.
  • Participating in meetings, seminars, and conferences relating to tax laws. Evaluating the impact of changes in tax laws on operational software and hardware, and implementing approved changes to data mining software code as well as configuring database and server machines for process optimization. Performing software and hardware quality assurance to ensure standards are not compromised following changes in tax laws, standard operating procedures (SOPs), and modifications to database and computer infrastructure.

Confidential, Abingdon, Maryland

Senior Statistician, Data Mining Analyst / Engineer and Software Developer

Responsibilities:

  • Worked with chemical engineers using laboratory equipment to study the mathematics and geometry of the shapes and speeds of positive and negative ion waveforms as they travel through the ionization chamber in order to understand the unique features of each compound for later use in pattern recognition. Established patterns involving geometrical areas, gradients, amplitude, frequency, number of peaks, ratios, etc.
  • Collaborated with engineers and other computer programmers in designing the overall architecture of the intended software including its front-end controls, mode of operation in terms of Windows API functions, DLLs, databases operations, display of output information, and hardware configurations.
  • Performed extensive online research of various machine learning and predictive modeling algorithms and available commercial and open source software packages. Demonstrated worked examples to Management using sampled data obtained from supporting engineers to show the strengths and weaknesses of selected algorithms including Naïve Bayes, Support Vector Machines, Decision Trees, etc; and made recommendations, purchase of software, and regular training and project status to the whole team and Management.
  • Installed, developed, and maintained Visual Basic (VB 2008 .NET), Excel VBA, and StatGraphics programs.
  • Performed sample size calculations, hypothesis testing, regression analysis, probability distributions, graphs & charts.

Confidential, Arlington, VA

Senior Consultant

Responsibilities:

  • Pattern recognition algorithms are being used to detect and quantify the severity of contamination by these companies, for prosecution by EPA.
  • Performed Cluster Analysis to separate database into populations using Nearest-Neighbor and K-Means Algorithms, followed by standard SQL queries to identify and eliminate null values and incomplete records.
  • Performed pattern recognition analysis of WEB-LOG data for mode of entry, click fraud, conversion rate, and likelihood of return visit for online marketing, advertisement, and profitability.
  • Implementing Multivariate Analysis (including Factor Analysis and Principal Components Analysis) to compute correlation matrices, factors, factor loadings and rotation (using VARIMAX) to reduce data dimensions to manageable number.
  • Used statistical software (SAS Enterprise Miner and MINITAB) to perform data analysis including regression (step-wise linear regression with analysis of the attending R-square and p-Values), correlation in order to establish whether or not the presence of certain toxic chemical compounds provides evidence for the presence of other suspected material. (SPSS, StatGraphics, and SAS PROCs: GLM, FREQ, ANOVA).
  • Developing Visual Basic (VB.NET) application to invoke PERL’s Regular Expressions for the purpose of discovering matching patterns (various equivalent chemical names and symbols within field values that cannot be retrieved through classical SQL queries.
  • Acting as Windows programming expert for executing and monitoring batch programs remotely. Used VB 2008 and COM methodology to program Microsoft Excel and generate output data for SAS GRAPH procs needed for management reporting.

Confidential, Alexandria, VA

Contractor: Senior Consultant

Responsibilities:

  • Participated in FAA (Federal Aviation Administration) statistical modeling for estimating daily staffing requirements of on-position controllers at U.S. airports.
  • Performed statistical analysis and consulting services on air traffic data using a range of statistical techniques including linear regression for identifying variable dependencies and correlations, principal component analysis and factor analysis for data reduction and variable selection into models, and analysis of ANOVA tables for selecting best models among alternatives.
  • Applied various mathematical and stochastic processes to study how air space configurations, number of runways, arrivals and departures, radar communication profile can predict staffing requirements for air space controllers stationed in airport towers and tracons in fifty two (52) U.S. airports: Linear Programming, Integer Programming, Queuing Models, Clustering, Decision Support Systems, Markov Models (Markov Chains Monte Carlo), Data Mining.
  • Installed, employed, deployed, and administered statistical software packages including SAS, STATGRAPHICS, and EXCEL VBA statistical add-ins.
  • Performed research in current statistical tools and techniques for discussion and use in projects, including attending forums, conferences, and seminars.
  • Acted as subject matter expert and trainer in the area of statistical education, as well as information technology including software development and computer programming and training to team members.
  • Developed Visual Basic programs for running batch data mining algorithms for converting large MS Access databases into text documents, simulating missing values, and searching for keywords using multinomial Bayesian statistics in order to route documents into predefined categories.

Confidential, Chevy Chase, MD

Senior Programmer Analyst Statistician

Responsibilities:

  • Performed auto insurance policy holder profiling for deciding risk factor using statistical learning algorithms for making binary classification including decision trees, support vector machines, and Bayesian methods.
  • Employed multivariate statistical analysis for filtering large datasets to optimum size while attempting to maximize variance (principal component analysis, factor analysis, clustering).
  • Provided training in basic statistical data analysis, data visualization methods, and software packages to Management and SAS Programmers with limited statistical knowledge.
  • Made and documented software enhancements to insurance policy underwriting management system using Crystal Reports (11) Developer, Visual Basic.NET & MS SQL Server 2003. Mostly front-end GUI, some back-end stored procedures SQL.

We'd love your feedback!