We provide IT Staff Augmentation Services!

Data Scientist Resume

Union, NJ


  • Highly efficient Data Scientist/data engineer with over 8+years of experience in areas including Data Analysis, Statistical Analysis, Machine Learning, predictive modeling, data mining with large data sets of structured and unstructured data in banking, automobile, foodand market research sectors.
  • Involved in the entire data science project life cycle including data extraction, data cleansing, transform modeling, data visualization and documentations.
  • Developed predictive models using Regression, Multiple linear regression,Logistic Regression, Decision Trees, Random Forests, NaiveBayes, ClusterAnalysis, and Association rules/Market Basket Analysis, and Neural Networks.
  • Experience in using various packages in R and pythonlike ggplot2, caret, dplyr, Rweka, gmodels, RCurl, tm, C50, twitter, NLP, Reshape2, rjson, plyr, SciPy, scikit - learn, BeautifulSoup, Rpy2.
  • Extensive experience with statistical programming languages such as R and Python.
  • Proficient in Predictive Modeling, Data Mining Methods, Factor Analysis, ANOVA, Hypothetical testing, normal distribution and other advanced statistical and econometric techniques.
  • Extensively worked for data analysis using RStudio, SQL, Tableau and other BItools.
  • Expertise in leveraging the Exploratory Data Analysis (EDA) with all numerical computations and by plotting all kinds of relevant visualizations to do feature engineering and to get feature importance.
  • Skilled in using dplyr and pandas in R and Python for performing exploratory data analysis.
  • Skilled in using Principal Component Analysis for dimensionality reduction.
  • Extensive hands-on experience with structured, semi-structured and unstructured data using R, Python, SparkMLlib, SQL and Scikit-Learn.
  • Strong with ETL, Datawarehousing, DataStore concepts and Datamining.
  • Extensive experience in Text Analytics, developing different Statistical MachineLearning, Data Mining solutions to various business problems and generating data visualizations using R, Python, and Tableau.
  • Knowledge on twitter text analytics using R functions like sapply, corpus, tmmap, searchTwitter and packages like twitter, RCurl, tm, wordcloud.
  • Proficient in writing complex SQLqueries like stored procedures, triggers, joints and subqueries.
  • Extensive working experience with Python including Scikit-learn, Pandas, and Numpy.
  • Experienced in python data manipulation for loading and extraction as well as with python libraries such as NumPy, SciPy and Pandas for data analysis and numerical computations
  • Skilled in data wrangling, Correlation analysis, multi-collinearity, missing values, unbalanced data etc.
  • Proficient in Statistical Modeling and MachineLearning techniques (Linear, Logistics, DecisionTrees, RandomForest, SVM, K-NearestNeighbors, XGBoost) in Forecasting/ Predictive Analytics, Segmentation methodologies, Regression-based models, Hypothesis testing, Factoranalysis/ PCA and Ensembles
  • Experience in designing stunning visualizations using Tableau software and publishing and presenting dashboards, Storyline on web and desktop platforms.
  • Experience in using GIT Version Control System.
  • Knowledge on time series analysis data using AR, MA, ARIMA, GARCH and ARCH model.
  • Good knowledge in Apache- Hive, Sqoop, Flume, Hue,and Oozie.
  • Knowledge in BigData with Hadoop2, HDFS, MapReduce, and Spark.
  • Knowledge in starschema, Snowflakeschema for DataWarehouse, ODS architecture.
  • Good knowledge on Amazon Web Services (AWS)AmazonSageMaker, AmazonS3 for machine learning.
  • Collaborated with data warehouse developers to meet business user needs, promote data security, and maintain data integrity.


Data Analytics Tools/Programming: Python (numpy, scipy, pandas,Gensim, Keras), R (Caret, Weka, ggplot), MATLAB, Microsoft SQL Server, Oracle PLSQL, Python.

Analysis &Modelling Tools: Erwin, Sybase Power Designer, Oracle Designer, Erwin, Rational Rose, ER/Studio, TOAD, MS Visio, SAS.

Data Visualization: Tableau, Visualization packages, Microsoft Excel.

Big Data Tools: Hadoop, MapReduce, SQOOP, Pig, Hive, NOSQL, Cassandra, MongoDB, Spark, Scala.

ETL Tools: Informatica Power Centre, Data Stage 7.5, Ab Initio, Talend.

OLAP Tools: MS SQL Analysis Manager, DB2 OLAP, Cognos Power-play.

Languages: SQL, PL/SQL, T-SQL, XML, HTML, UNIX Shell Scripting, C, C++, AWK, JavaScript.

Databases: Oracle12c/11g/10g/9i/8i/8.0/7.x,Teradata14.0,DB2 UDB 8.1, MS SQL Server 2008/2005, Netezaa 4.0 and Sybase ASE 12.5.3/15,Informix 9, AWS RDS.

Project Execution Methodologies: Ralph Kimball and Bill Inmon data warehousing methodology, Rational Unified Process (RUP), Rapid Application Development (RAD), Joint Application Development (JAD).

Tools: & Software: TOAD, MS Office, BTEQ, Teradata SQL Assistant.

Methodologies: Ralph Kimball, COBOL.

Reporting Tools: Business ObjectsXIR 2/6.5/5.0/5.1 , Cognos Impromptu 7.0/6.0/5.0,Informatica Analytics Delivery Platform, Micro Strategy, SSRS, Tableau.

Tools: MS-Office suite (Word, Excel, MS Project and Outlook), VSS.

Programming Languages: SQL, T-SQL, Base SAS and SAS/SQL, HTML, XML.

Operating Systems: Windows 2007/8, UNIX (Sun-Solaris, HP-UX), Windows NT/XP/Vista, MSDOS.


Confidential, Union, NJ

Data Scientist


  • Worked closely with business, datagovernance, SMEs and vendors to define data requirements.
  • Designed and provisioned the platform architecture to execute Hadoop and machine learning use cases under Cloud infrastructure, AWS, EMR, and S3.
  • Selection of statistical algorithms (Two Class Logistic Regression Boosted Decision Tree, Decision Forest Classifiers etc.).
  • Actively participated in data modeling, data warehousing and complex database designing.
  • Designed and developed NLP models for sentiment analysis.
  • Developed Models using NLP to enhance the performance of Media Service Encoders.
  • Used MLlib, Spark's Machine learning library to build and evaluate different models.
  • Worked in using Teradata14 tools like Fast Load, Multi Load, TPump, Fast Export, Teradata Parallel Transporter (TPT) and BTEQ.
  • Participated in all phases of data mining; data collection, data cleaning, developing models, validation, visualization and performed Gapanalysis.
  • Interpret problems and provides solutions to business problems using data analysis, data mining, optimization tools, and machinelearningtechniques and statistics.
  • Involved in creating Data Lake by extracting customer's Big Data from various data sources into HadoopHDFS. This included data from Excel, Flat Files, Oracle, SQL Server, MongoDb, Cassandra, HBase, Teradata, Netezza and also log data from servers
  • Used Spark Data frames, Spark-SQL, SparkMLLib extensively and developing and designing POC's using Scala, SparkSQL and MLlib libraries.
  • Created high level ETL design document and assisted ETL developers in the detail design and development of ETL maps using Informatica.
  • Adept at using SASEnterprise suite, Python, and BigData related technologies including knowledge in Hadoop, Hive, Sqoop, Oozie, Flume, Map-Reduce
  • Used R, SQL to create Statistical algorithms involving Multivariate Regression, Linear Regression, Logistic Regression, PCA, Random forest models, Decision trees, Support Vector Machine for estimating the risks of welfare dependency.
  • Helped in migration and conversion of data from the Sybase database into Oracle database, preparing mapping documents and developing partial SQL scripts as required.
  • Generated ad-hocSQL queries using joins, database connections and transformation rules to fetch data from legacy Oracle and SQL Server database systems
  • Executed ad-hoc data analysis for customer insights using SQL using AmazonAWSHadoopCluster.
  • Strong SQL Server and Python programming skills with experience in working with functions
  • Worked on predictive and what-if analysis using R from HDFS and successfully loaded files to HDFS from Teradata, and loaded from HDFS to HIVE.
  • Designed the schema, configured and deployed AWS Redshift for optimal storage and fast retrieval of data.
  • Analyzed data and predicted end customer behaviors and product performance by applying machine learning algorithms using SparkMLlib.
  • Performed data mining on data using very complex SQL queries and discovered pattern and used extensive SQL for data profiling/analysis to provide guidance in building the data model.

Environment: R, Python, Machine Learning, Teradata 14, Hadoop Map Reduce, Pyspark, Spark, R, Spark MLLib, Tableau, Informatica, SQL, Excel, AWS Redshift, ScalaNlp, Cassandra, Oracle, MongoDB, Informatica MDM, Cognos, SQL Server 2012, Teradata, DB2, SPSS, T-SQL, PL/SQL, Flat Files, XML, and Tableau.

Confidential, Parsippany, NJ

Data Scientist


  • Created MapReduce running over HDFS for data mining and analysis using R and Loading & Storage data to PigScript and R for MapReduce operations.
  • Created adeeplearningmodels to detect the various object.
  • Designed the prototype of the Data mart and documented possible outcome from it for end-user.
  • Involved in Analyzing various Dataaspect to know the user behavior’s
  • Developed and maintained data dictionary to create metadata reports for technical and business purpose.
  • Developed various QlikView Data Models by extracting and using the data from various sources files, DB2, Excel, Flat Files and Bigdata.
  • Designed the procedures for getting the data from all systems to Data Warehousing system.
  • Application of various machine learning algorithms and statistical modeling like decision trees, text analytics, natural language processing (NLP), supervised and unsupervised, regression models, social network analysis, neural networks, deep learning, SVM, clustering to identify Volume using scikit-learn package in python, Matlab.
  • Translate marketing and sales objectives to SQLscripts and datamining initiatives.
  • Collected Data from Various Resource and collaborate to Performing various EDA and Visualization.
  • Building prediction models using Linear and RidgeRegression, for predicting future customers based on historical data. Developed the model with 3 million data points from historical data and evaluated the model with F-score and adjusted R-squared measure.
  • Customer Profiling models using K-means and K-means++ clustering algorithms to enable targeted marketing. Developed the model with 1.4million data points and used the elbow method to find the optimal value of K using Sum of Squared error as the error measure.
  • Designed and implemented a probabilistic churn prediction model with 80k customer data to predict the probability of customer churn out using LogisticRegression in Python. Client utilized the results in the business to finalize the list of customers to provide a discount.
  • Implemented dimensionality reduction using Principal Component Analysis and k-fold cross validation as part of Model Improvement.
  • Implemented Pearson's Correlation and Maximum Variance techniques to find the key predictors for the Regression models.
  • Data analysis using Exploratory Data Analysis techniques in Python and R, including generating Univariate and Multivariategraphicalplots.
  • Correlation Analysis (chi-square and Pearsoncorrelation test)
  • Coordinated with Onsite Actuaries, Senior Management and Client to interpret and report the results for assisting the in corporation if results in business scenarios.
  • Implement Various BigdataPipelines to build a machine learning models.
  • Analyzed various customer behaviors on product to find out the rootcause of problem.

Environment: Oracle 12c, SQL Plus, Erwin 9.6, MS Visio, SAS, Source Offsite (SOS), Python, Pandas, Numpy, Tableau, Hive, PIG, Windows XP, AWS, QC Explorer, Share point workspace, Teradata, Oracle, Agile, PostgreSQL, Data Stage, MDM, Netezza, IBM Infosphere, SQL, PL/SQL, IBM DB2, SSIS, Power BI, AWS Redshift, Business Objects XI3.5,COBOL,SSRS, QuickData, Hadoop, MongoDB, HBase, Hive, Cassandra, JavaScript.

Confidential, Farmington, Connecticut

Data Architect/Data Modeler


  • Architect and design, solutions for complex business requirements, including data processing, analytics and ETL and reporting processes to improve performance of data loads and processes.
  • Develop a high performance, scalable data architecture solution that incorporates a matrix of technology to relate architecturaldecision to business needs.
  • Conducting strategy and architecture sessions and deliver artifacts such as MDM strategy (Currentstate, InterimStateandTargetstate) and MDM Architecture (Conceptual, Logical and Physical) at detail level.
  • Conducted studies, rapid plots and using advance data mining and statistical modelling techniques to build solution that optimize the quality and performance of data.
  • Currently implementing a POC on Chatbot using openNLP, MachineLearning and DeepLearning.
  • Owned and managed all changes to the datamodels, Createddatamodels, solutiondesignsanddataarchitecture documentation for complex information systems.
  • Design and development of dimensionaldatamodel onRedshiftto provide advanced selection analytics platform and developed Simple to complex MapReduceJobsusingHiveand Pig.
  • Worked on AWSRedshiftand RDS for implementing models anddataon RDS andRedshift.
  • Worked with SME's and other stakeholders to determine the requirements to identify Entities and Attributes to build Conceptual, Logical and PhysicalDataModels.
  • Worked in Data warehousing methodologies/Dimensional Data modeling techniques such as Star/Snowflakeschema using ERWIN9.1.
  • Designed and implemented Near Real Time ETL and Analytics usingRedshiftdatabase.
  • Extensively usedNetezzautilities like NZLOAD and NZSQL and loaded data directly from Oracle toNetezzawithout any intermediate files.
  • Created a logical design and physical design in Erwin.
  • ImplementedHiveGeneric UDF's to in corporate business logic intoHiveQueries and
  • CreatingHivetables and working on them usingHiveQL.
  • Developed DataMapping, DataGovernance, and transformation and cleansing rules for the Master Data Management Architecture involving OLTP, ODS and generated ad-hoc reports using OBIEE.
  • Which loads the data from the CMS to the EMS library database and Involved indatamodeling and providing technical solutions related to Teradata to the team.
  • Build a real time event analytic systems using dynamic Amazonredshiftschema.
  • Wrote SQL queries, PL/SQL procedures/packages, triggers and cursors to extract and process data from various source tables of database.
  • Determine customer satisfaction and help enhance customer experience using NLP.
  • CreatedHiveTables, loaded transactionaldatafrom Teradata usingSqoop and created and worked Sqoop jobs with incremental load to populateHiveExternal tables.
  • Worked with cloud based technology likeRedshift, S3, AWS, EC2 Machine, etc. and extracting thedatafrom the Oracle financials and theRedshiftdatabase.
  • Designing and customizingdatamodels forDatawarehouse supportingdatafrom multiple sources on real time. Requirements elicitation andDataanalysis. Implementation of ETL Best Practices.
  • Generated comprehensive analyticalreports by running SQLqueries against current databases to conductdataanalysis.
  • Createddatamodels for AWSRedshiftand Hive from dimensionaldatamodels.
  • Developed complex SQL scripts for Teradata database for creating BI layer on DW for Tableau reporting.
  • Extensively used ETL methodology for supportingdataextraction, transformations and loading processing, in acomplexEDW using Informatica.
  • Created Active Batch jobs to loaddatafrom distribution servers toPostgreSQLDB using *.bat files and worked on CDC schema to keep track of all transactions.

Environment: Erwin 9.5, MS Visio, Oracle 11g, Oracle Designer, MDM, Power BI, SAS, SSIS, Tableau, Tivoli Job Scheduler, SQL Server 2012, DATAFLUX 6.1, JavaScript, AWS Redshift, PL/SQL, SQL/PL SQl, SSRS, PostgreSQL, Data Stage, SQL Navigator Crystal Reports 9, Hive, Netezza, Teradata, T-SQL, Informatica.

Confidential, Colmar, Pennsylvania

Data Architect/Data Analyst/Data Modeler


  • Design and develop datawarehousearchitecture, datamodeling/conversionsolutions, &ETL mappingsolutions within structured data warehouse environments
  • Reconcile data and ensure data integrity and consistency across various organizational operating platforms for business impact.
  • Successfully optimized codes in Python to solve a variety of purposes in data mining and machine learning in Python.
  • Used Erwin for effective model management of sharing, dividing and reusing model information and design for productivity improvement.
  • Involved in preparing LogicalDataModels/PhysicalDataModels.
  • Worked extensively in both Forward Engineering as well as Reverse Engineering usingdatamodeling tools.
  • Provide and apply quality assurance best practices for data mining/analysis services.
  • Involved in the creation, maintenance ofDataWarehouse and repositories containing Metadata.
  • Involved using ETL tool Informatica to populate the database,datatransformation from the old database to the new database using Oracle and SQL Server.
  • Identifying inconsistencies or issues from incomingHL7messages, documenting the inconsistencies, and working with clients to resolve the datainconsistencies
  • Resolved thedatatype inconsistencies between the source systems and the target system using the MappingDocuments and analyzing the database using SQL queries.
  • Extensively used both Star Schema and Snow flake schema methodologies in building and designing the logicaldatamodel in both Type1 and Type2Dimensional Models.
  • Worked with DBA group to create Best-Fit PhysicalDataModel from the LogicalDataModel using ForwardEngineering.
  • Worked withDataSteward Team for designing, documenting and configuring InformaticaDataDirector for supporting management of MDMdata.
  • ConductingHL7integration testing with clients systems that is testing of business scenarios to ensure that information is able to flow correctly between applications.
  • Extensively worked with MySQL andRedshiftperformance tuning and reduced the ETL job load time by 31% and DW space usage by 50%.
  • Used Teradata SQL Assistant, Teradata Administrator, PMON and data load/export utilities like BTEQ, Fast Load, Multi Load, Fast Export, Tpump on UNIX/Windows environments and running the batch process for Teradata.
  • Created dimensional model based on star schemas and designed them using ERwin.
  • Carrying outHL7interface unit testing aiming to confirm thatHL7messages sent or received from each application conform to theHL7interface specification.
  • Usedtoolssuchas SAS/Access and SAS/SQL to create and extract oracle tables.
  • Enabled theSSISpackage configuration to make the flexibility to pass the connection strings to connection managers and values to package variables explicitly based on environments.
  • Responsible for Implementation ofHL7to build Orders, Results, ADT, DFT interfaces for client hospitals
  • Connected to AmazonRedShiftthrough Tableau to extract livedatafor real time analysis.
  • Developed SQL Queries to fetch complexdatafrom different tables in remote databases using joins, database links and Bulk collects.

Environment: Erwin, Oracle, SQL server 2008, Power BI, MS Excel, Netezza, Agile, MS Visio, Rational Rose, Requisite Pro, SAS, SSIS, SSRS, Windows 7, PL/SQL,, SQl Server, MDM, Teradata, MS Office, MS Access, SQL, SSIS, MS Visio, Tableau, Informatica, Amazon Redshift.


Data Modeler/Data Analyst


  • Designed logical and physicaldatamodels for multiple OLTP and Analytic applications.
  • Involved in analysis of business requirements and keeping track ofdataavailable from variousdata sources, transform and load thedatainto Target Tables using Informatica Power Center.
  • Extensively used the Erwin design tool &Erwin model manager to create and maintain the DataMart.
  • Extensively used Star Schema methodologies in building and designing the logicaldatamodel into Dimensional Models
  • Performed data mining on data using very complex SQL queries and discovered pattern and Used extensive SQL for data profiling/analysis to provide guidance in building the data model
  • Created stored procedures using PL/SQL and tuned the databases and backend process.
  • Involved withDataAnalysis primarily IdentifyingDataSets, SourceData, Source MetaData, Data Definitions andDataFormats
  • Performance tuning of the database, which includes indexes, and optimizing SQL statements, monitoring the server.
  • Developed Informatica mappings, sessions, workflows and have written Pl SQL codes for effective and optimizeddataflow coding.
  • Wrote SQL Queries, Dynamic-queries, sub-queries and complex joins for generating Complex Stored Procedures, Triggers, User-defined Functions, Views and Cursors.
  • Created newHL7interface based on the requirement using XML, XSLT technology.
  • Experienced in creating UNIX scripts for file transfer and file manipulation and utilized SDLC and Agile methodologies such as SCRUM.
  • DataStage jobs were scheduled, monitored, performance of individual stages was analyzed and multiple instances of a job were run using DataStage Director.
  • Led successful integration ofHL7Lab Interfaces and used expertise of SQL to integrateHL7Interfaces and carried out detailed and various test cases on newly builtHL7interface.
  • Wrote simple and advanced SQL queries and scripts to create standard and adhocreports for senior managers.

Environment: SQL Server, UML, Business Objects 5, Teradata, Windows XP, SSIS, SSRS, Embarcadero, ER studio, Erwin, DB2, Informatica, HL7, Oracle, Query Management Facility (QMF),SSRS, DataStage, Clear Case forms, SAS, Agile, Unix and Shell Scripting.


Data Analyst/Data Modeler


  • Developed DataMapping, DataGovernance and transformation and cleansing rules for the Master DataManagement Architecture involving OLTP, ODS.
  • Created new conceptual, logicalandphysicaldatamodels using ERWinand reviewed these models with application team and modeling team.
  • Performed numerousdatapulling requests using SQL for analysis and created databases for OLAP Metadata catalog tables using forward engineering of models in Erwin.
  • Enforced referential integrity in the OLTPdatamodel for consistent relationship between tables and efficient database design.
  • Proficient in importing/exporting large amounts ofdatafrom files to Teradata and vice versa.
  • DevelopedDataMapping,DataGovernance, and Transformation and cleansing rules for the MasterData Management Architecture involving OLTP, ODS.
  • Identified and tracked the slowly changing dimensions, heterogeneous sources and determined the hierarchies in dimensions.
  • Utilized ODBC for connectivity to Teradata &MS Excel for automating reports and graphical representation ofdatato the Business and OperationalAnalysts.
  • Extracteddatafrom existingdatasource, Developing and executing departmental reports for performance and response purposes by using oracle SQL, MS Excel.
  • Extracteddatafrom existingdatasource and performed ad-hoc queries and used BETQ to run and Teradata SQL scripts to create physicaldatamodel.

Environment: UNIX scripting, Oracle SQL Developer, SSRS, SSIS, Teradata, Windows XP, SASdatasets.

Hire Now