We provide IT Staff Augmentation Services!

Data Scientist/data Analyst Resume

4.00/5 (Submit Your Rating)

Columbus, OH

SUMMARY

  • 7 years of experience in Machine Learning, Data mining, Data Architecture, Data Modeling, Data Mining, Data Analysis, NLP with large data sets of Structured and Unstructured data, Data Acquisition, Data Validation, Predictive modeling, Data Visualization, Web Crawling, Web Scraping, Adept in statistical programming languages like R and Python including Big Data technologies like Hadoop, Hive, HDFS, MapReduce and NoSQL Based Databases.
  • Proficient in managing entire data science project life cycle and actively involved in all teh phases of project life cycle including data acquisition, data cleaning, data engineering, features scaling, features engineering, statistical modeling (decision trees, regression models, neural networks, SVM, clustering), dimensionality reduction using TEMPPrincipal Component Analysis and Factor Analysis, testing and validation using ROC plot, K - fold cross validation and data visualization.
  • Very good experience and knowledge in provisioning virtual clusters under AWS cloud which includes services like EC2, S3, and EMR.
  • Adept and deep understanding of Statistical modeling, Multivariate Analysis, model testing, problem analysis, model comparison and validation.
  • Excellent knowledge of Machine Learning, Mathematical Modeling and Operations Research. Comfortable with R, Python, SAS and Weka, MATLAB, Relational databases. Deep understanding & exposure of Big Data Eco-system.
  • Expertise in transforming business requirements into analytical models, designing algorithms, building models, developing data mining and reporting solutions dat scales across massive volume of structured and unstructured data.
  • Experienced in Data Modeling techniques employing Data warehousing concepts like star/snowflake schema and Extended Star.
  • Excellent working experience and knowledge in Hadoop eco-system like HDFS, MapReduce, Hive, Pig, MongoDB, Cassandra, HBase.
  • Expert in creating PL/SQL Schema objects like Packages, Procedures, Functions, Subprograms, Triggers, Views, Materialized Views, Indexes, Constraints, Sequences, Exception Handling, Dynamic SQL/Cursors, Native Compilation, Collection Types, Record Type, Object Type using SQL Developer.
  • Excellent knowledge and experience in OLTP/OLAP System Study with focus on Oracle Hyperion Suite of technology, developing Database Schemas like Star schema and Snowflake schema (Fact Tables, Dimension Tables) used in relational, dimensional and multidimensional modeling, physical and logical Data modeling using Erwin tool.
  • Expertise in performing data parsing, data manipulation and data preparation with methods including describe data contents, compute descriptive statistics of data, regex, split and combine, Remap, merge, subset, reindex, melt and reshape.
  • Experienced in data mining & loading and analyzing unstructured data -XML, JSON, flat file formats into Hadoop.
  • Experienced in using various packages in Rand python like ggplot2, caret, dplyr, Rweka, gmodels, RCurl, tm, C50, twitteR, NLP, Reshape2, rjson, plyr, pandas, numpy, seaborn, scipy, matplotlib, scikit-learn, Beautiful Soup, Rpy2.
  • Extensive experience in Text Analytics, generating data visualizations using R, Python and creating dashboards using tools like Tableau.
  • Analyze Data and Performed Data Preparation by applying historical model on teh data set in AZUREML.
  • Excellent hands on experience with big data tools like Hadoop, Spark, Hive, Pig, Impala, Pyspark, SparkSql.
  • Experienced in Teradata RDBMS using Fast load, Fast Export, Multi load, T pump, and Teradata SQL Assistance and BTEQ Teradata utilities.
  • Expertise in Excel Macros, Pivot Tables, vlookups and other advanced functions and experience with working in Agile/SCRUM software environments.
  • Hands on experience in implementing LDA, Naive Bayes and skilled in Random Forests, Decision Trees, Linear and Logistic Regression, SVM, Clustering, neural networks, Principle Component Analysis.
  • Extensive experience in Data Visualization including producing tables, graphs, listings using various procedures and tools such as Tableau.

TECHNICAL SKILLS

Languages: Java 8, Python, R

Packages: ggplot2, caret, dplyr, Rweka, gmodels, RCurl, C50, twitter, NLP, Reshape2, rjson, plyr, pandas, numPy, Seaborn, sciPy, matplot lib, sci-kit-learn, Beautiful Soup, Rpy2.

Web Technologies: JDBC, HTML5, DHTML and XML, CSS3, Web Services, WSDL

Data Modelling Tools: Erwin r 9.6, 9.5, 9.1, 8.x, Rational Rose, ER/Studio, MS Visio, SAP Power designer

Big Data Technologies: Hadoop, Hive, HDFS, MapReduce, Pig, Kafka

Databases: SQL, Hive, Impala, Pig, Spark SQL, Databases SQL-Server, My SQL, MS Access, HDFS, HBase, Teradata, Netezza, MongoDB, Cassandra.

Reporting Tools: MS Office (Word/Excel/Power Point/ Visio), Tableau, Crystal reports XI, Business Intelligence, SSRS, Business Objects 5.x/ 6.x, Cognos7.0/6.0.

ETL Tools: Informatica Power Centre, SSIS.

Version Control Tools: SVM, GitHub

Project ExecutionMethodologies: Ralph Kimball and Bill Inmon data warehousing methodology, Rational Unified Process (RUP), Rapid Application Development (RAD), Joint Application Development (JAD).

BI Tools: Tableau, Tableau Server, Tableau Reader, SAP Business Objects, OBIEE, QlikView, SAP Business Intelligence, Amazon Redshift, or Azure Data Warehouse

Operating System: Windows, Linux, Unix, Macintosh HD, Red Hat

PROFESSIONAL EXPERIENCE

Confidential - Columbus, OH

Data Scientist/Data Analyst

Responsibilities:

  • Review suspicious activity and complex fraud cases to halp identify and resolve fraud risk trends and issues.
  • Clearly and thoroughly document investigation findings and conclusions.
  • Offline analysis of customer data to tune rules, exposes patterns, research anomalies, reduce false positives, and build executive and project-level reports.
  • Identify meaningful insights from chargeback data. Interpret and communicate findings from analysis to engineers, product and stakeholders.
  • Analyze high-volume data to investigate, identify and report trends linked to fraudulent transactions.
  • Utilize Sqoop to ingest real-time data. Used analytics libraries Sci-Kit Learn, MLLIB and MLxtend.
  • Extensively use Python's multiple data science packages like Pandas, NumPy, matplotlib, Seaborn, SciPy, Scikit-learn and NLTK.
  • Performed Exploratory Data Analysis, trying to find trends and clusters.
  • Built models using techniques like Regression, Tree based ensemble methods, Time Series forecasting, KNN, Clustering and Isolation Forest methods.
  • Work on data dat was a combination of unstructured and structured data from multiple sources and automate teh cleaning using Python scripts.
  • Extensively perform large data read/writes to and from csv and excel files using pandas.
  • Tasked with maintaining RDD's using SparkSQL.
  • Communicate and coordinate with other departments to collection business requirement.
  • Tackle highly imbalanced Fraud dataset using undersampling with ensemble methods, oversampling and cost sensitive algorithms.
  • Improved fraud prediction performance by using random forest and gradient boosting for feature selection with Python Scikit-learn.
  • Implemented machine learning model (logistic regression, XGboost) with Python Scikit- learn.
  • Optimize algorithm with stochastic gradient descent algorithm Fine-tuned teh algorithm parameter with manual tuning and automated tuning such as Bayesian Optimization.
  • Develop a technical brief based on teh business brief. dis contains detailed steps and stages of developing and delivering teh project including timelines.
  • After sign-off from teh client on technical brief, started developing teh SAS codes.
  • Write teh data validation SAS codes with teh halp of Univariate, Frequency procedures.
  • Summarising teh data at customer level by joining teh datasets of customer transaction, dimension and from 3rd party sources.
  • Separately calculated teh KPIs for Target and Mass campaigns at pre-promo-post periods with respective to their transactions, spend and visits.
  • Also measure teh KPIs at MoM (Month on Month), QoQ (Quarter on Quarter) and YoY (Year on Year) with respect to pre-promo-post.
  • Measure teh ROI based on teh differences pre-promo-post KPIs.
  • Extensively use SAS procedures like IMPORT, EXPORT, SORT, FREQ, MEANS, FORMAT, APPEND, UNIVARIATE, DATASETS and REPORT.
  • Standardise teh data with teh halp of PROC STANDARD.
  • Implement cluster analysis (PROC CLUSTER and PROC FASTCLUS) iteratively.
  • Work extensively with data governance team to maintain data models, Metadata and dictionaries.
  • Use Python to preprocess data and attempt to find insights.
  • Iteratively rebuild models dealing with changes in data and refining them over time.
  • Create and publish multiple dashboards and reports using Tableau server.
  • Extensively use SQL queries for legacy data retrieval jobs.
  • Task with migrating teh django database from MySQL to PostgreSQL.
  • Gain expertise in Data Visualization using matplotlib, Bokeh and Plotly.
  • Responsible for maintaining and analyzing large datasets used to analyze risk by domain experts.
  • Develop Hive queries dat compared new incoming data against historic data. Built tables in Hive to store large volumes of data.
  • Use big data tools Spark (Sparksql, Mllib) to conduct teh real time analysis of credit card fraud based on AWS.
  • Perform Data audit, QA of SAS code/projects and sense check of results.

Environment: Spark, Hadoop, AWS, SAS Enterprise Guide, SAS/MACROS, SAS/ACCESS, SAS/STAT, SAS/SQL, ORACLE, MS-OFFICE, Python (scikit-learn, pandas, Numpy), Machine Learning (logistic regression, XGboost), Gradient Descent algorithm, Bayesian optimization, Tableau.

Confidential - CT

Data Scientist/Data Analyst

Responsibilities:

  • Application of various machine learning algorithms and statistical modeling like decision trees, regression models, neural networks, SVM, clustering to identify Volume using scikit-learn package in python, Matlab.
  • Utilized Apache Spark with Python to develop and execute Big Data Analytics and Machine learning applications, executed machine Learning use cases under Spark ML and Mllib.
  • Led technical implementation of advanced analytics projects, Defined teh mathematical approaches, developer new and effective analytics algorithms and wrote teh key pieces of mission-critical source code implementing advanced machine learning algorithms utilizing caffe, TensorFlow, Scala, Spark, MLLib, R and other tools and languages needed.
  • Built analytical data pipelines to port data in and out of Hadoop/HDFS from structured and unstructured sources and designed and implemented system architecture for Amazon EC2 based cloud-hosted solution for client.
  • Performed K-means clustering, Multivariate analysis and Support Vector Machines in Python and R.
  • Professional Tableau user (Desktop, Online, and Server), Experience with Keras and Tensor Flow.
  • Created mapreduce running over HDFS for data mining and analysis using R and Loading & Storage data to Pig Script and R for MapReduce operations and created various types of data visualizations using R, and Tableau.
  • Worked on machine learning on large size data using Spark and MapReduce.
  • Performed data analysis by using Hive to retrieve teh data from Hadoop cluster, SQL to retrieve data from Oracle database.
  • Developed Spark/Scala, Python for regular expression (regex) project in teh Hadoop/Hive environment with Linux/Windows for big data resources.
  • Performed Multinomial Logistic Regression, Random forest, Decision Tree, SVM to classify package is going to deliver on time for teh new route.
  • Stored and retrieved data from data-warehouses using Amazon Redshift.
  • Responsible for planning & scheduling new product releases and promotional offers.
  • Used pandas, numpy, seaborn, scipy, matplotlib, scikit-learn, NLTK in Python for developing various machine learning algorithms.
  • Worked on NOSQL databases like MongoDB, HBase.
  • Created Data Quality Scripts using SQL and Hive to validate successful data load and quality of teh data. Created various types of data visualizations using Python and Tableau.
  • Worked on data pre-processing and cleaning teh data to perform feature engineering and performed data imputation techniques for teh missing values in teh dataset using Python.
  • Extracted data from HDFS and prepared data for exploratory analysis using data munging.
  • Worked on Text Analytics, Naive Bayes, Sentiment analysis, creating word clouds and retrieving data from Twitter and other social networking platforms.
  • Worked on different data formats such as JSON, XML and performed machine learning algorithms in Python.

Environment: Python, MongoDB, JavaScript, SQL Server, HDFS, Pig, Hive, Oracle, DB2, Tableau, ETL (Informatica), SQL, T-SQL, EC2, EMR, Teradata, Hadoop Framework, AWS, Spark SQL, Scala, SparkMllib, NLP, SQL, Matlab, HBase, Cassandra, R, Pyspark, Tableau Desktop, Excel, Linux, CDH5

Confidential - Farmingham, MA

Data Scientist/Data Analyst

Responsibilities:

  • Built statistical analysis and machine learning models using rich, multi-brand, longitudinal data sets covering over 65 million customers globally.
  • Communicated with cross-functional business partners to find teh right questions, teh right data, and teh right approaches needed to reach project goals.
  • Worked at all levels of a multi-billion dollar fashion retailer (design, marketing, customer retention and acquisition, pricing, inventory, logistics, ecommerce, wherever we can make an impact).
  • Buyer Propensity Model: Designed a Bayesian model to estimate buyers’ propensity for different deal categories; gives statistically significant (+7%) lift in clicks, purchases and revenue across North America and European markets.
  • Statistical/Mining tools used: R, Gaussian Modeling, Hive, Tableau.
  • Co-Click Model: Augmented TJX deal recommendation using deal co-clicks similarity; provides statistically significant lifts (+5%) for clicks, purchases and revenue across majority of teh countries. Statistical/Mining tools used: R, Tableau, Hive, Matrix Factorization.
  • Ads Keyword Optimization: Designed keyword generation, using a combination of statistical and linguistic signals, from merchant pages, deal description etc.; obtained higher CTRs than non-linguistic approaches.
  • Statistical/Mining tools used: Statistical Parsing, StanfordNLP.
  • General Data Analysis: General & exploratory data analysis and statistical tests over a variety of TJX datasets/experiments. Statistical/Mining tools used: Various statistical tests, R, Tableau, Shiny, Hive, Teradata.
  • Managing end to end pipeline: As a senior member of data science team, halped manage teh end to end pipeline from ETL (extraction, transform and load) to EDA (exploratory data analysis) and algorithm development to final business decision and product design.

Environment: Exploratory Data Analysis (EDA), Statistical Tests (A/B, A/A etc.), Stochastic Optimization, Bayesian Modeling, Time Series Analysis, Gaussian Processes, Recommendation Engines, Latent Space Models for Text, Large Scale Distributed Learning, Generative & DiscriminativeModels (clustering, regression etc.).

Confidential

Data Scientist

Responsibilities:

  • Developed predictive models using appropriate statistical and machine-learning methodology to support strategic business decisions and front-line operations.
  • Designed and managed experiments and pilots to test hypotheses or fgenerate observation data.
  • Proactively seek continuous improvement opportunities and support cross-functional analytic projects with business stakeholders for implementation.
  • Identify data needs and assists in designing billing & payment operations data roadmap.
  • Collaborated with our underwriters, product managers, software engineers, designers, and data/business intelligence teams.
  • Identified growth opportunities through data (ROI evaluation, LTV, allowables and other analyses) to significantly improve business performance across channels, programs and purchasing modes.
  • Under minimal supervision, ascertain and execute plans to build, implement and maintain predictive models (Stepwise regression, Markov Chain, Time series forecasting, Multiple and logistic regression models, etc.) using available tools and programming languages. Monitor and provide feedback on model performance and recalibrate model as necessary.
  • Performed analysis to identify areas of improvement for conversion rate, user experience, media mix, cross sell and upsell. Lead recommendation implementation and adoption across areas.
  • Independently gathers requirements from appropriate business partners for project, including necessary data for analysis to be performed. Develops project plans, executes on deliverables within agreed upon timeframes, manages deadlines, communicate progress and makes recommendations to address issues.
  • Translated analytical findings and statistical models into measures of business impact and actionable recommendations for teh marketing team.
  • Built Decision Trees in Python to represent segmentation of data and identified key variables to be used in predictive modeling and also used Regression analysis, ANOVA and Z test.
  • Modeling Helped Business and Engineering team to make decision based on data in Policy and Claims System by using Data Visualization technique to transform Unstructured data into structure format to do data visualization for business decision.
  • Worked on Insurance Healthcare Claim System.
  • Used data to develop new statistical models to extract insights from large volume of data using teh concept of Cluster Analysis, Neural Network, Random Forest and ARIMA modelling for identifying pattern in Time Series analysis.
  • Developed and automated various modeling steps to make project-process faster and more accurate using Machine Learning Technique like Linear Regression, Non- Linear Regression, Logistic regression, Naïve Bayes Classification, Support Vector Machine, KNN etc.
  • Created innovative algorithms behind variety of services ranging from Insurance Underwriting Risk to deciding right Insurance structure for all claims.
  • Performed ad hoc statistical, data mining, and machine learning analysis, Develop and Design advance predictive analysis models using Python.
  • Performing Goodness of fit for various distribution and teh best distribution which models teh claims data.
  • Developed logistic regression model in R to predict whether policy is going to claim or not when hit by hailstorm (weather event).
  • Extracting data from teh database using ETL Concept.
  • Validated teh models on out-of-sample and out-of-time data.
  • Spatial Analysis to determine possible relationships between claims and policies in force.

Environment: Python, Regression analysis, ANOVA and Z test, Linear Regression, Non- Linear Regression, Logistic regression, Naïve Bayes Classification, Support Vector Machine, KNN, ETL, Cluster Analysis, Neural Network, Random Forest and ARIMA modelling.

Confidential

Jr. Data Analyst

Responsibilities:

  • Developed and implemented predictive models using Natural Language Processing Techniques and machinelearning algorithms such as linear regression, classification, multivariate regression, Naive Bayes, RandomForests, K-means clustering, KNN, PCA and regularization for data analysis.
  • Designed and developed Natural Language Processing models for sentiment analysis.
  • Applied clustering algorithms i.e. Hierarchical, K-means with halp of Scikit and Scipy.
  • Developed visualizations and dashboards using ggplot, Tableau.
  • Worked on development of data warehouse, Data Lake and ETL systems using relational and non relationaltools like SQL, No SQL.
  • Built and analyzed datasets using R, SAS, Matlab and Python (in decreasing order of usage).
  • Participated in all phases of datamining; datacollection, datacleaning, developingmodels, validation, visualization and performed Gapanalysis.
  • DataManipulation and Aggregation from different source using Nexus, Toad, BusinessObjects, PowerBI and SmartView.
  • Implemented Agile Methodology for building an internal application.
  • Good knowledge of HadoopArchitecture and various components such as HDFS, Job Tracker, Task Tracker, Name Node, Data Node, Secondary Name Node, and MapReduce concepts.
  • As Architect delivered various complex OLAPdatabases/cubes, scorecards, dashboards and reports.
  • Programmed a utility in Python dat used multiple packages (scipy, numpy, pandas).
  • Implemented Classification using supervised algorithms like LogisticRegression, Decisiontrees, KNN, NaiveBayes.
  • Used Teradata15 utilities such as FastExport, MLOAD for handling various tasks data migration/ETL from OLTP Source Systems to OLAP Target Systems
  • Maintenance in teh testing team for System testing/Integration/UAT.
  • Involved in preparation & design of technical documents like Bus Matrix Document, PPDM Model, and LDM & PDM.
  • Understanding teh client business problems and analyzing teh data by using appropriate Statistical models to generate insights.

Environment: R 3.0, Erwin 9.5, Tableau 8.0, MDM, QlikView, ML Lib, PL/SQL, HDFS, Teradata 14.1, JSON, HADOOP (HDFS), MapReduce, PIG, Spark, R Studio, MAHOUT, JAVA, HIVE, AWS.

We'd love your feedback!