We provide IT Staff Augmentation Services!

Data Scientist/bigdata Engineer Resume

Basking Ridge, NJ


  • Over 8 + years of IT industry experience encompassing in MachineLearning, Datamining with largedatasets of Structured and Unstructureddata, DataAcquisition, DataValidation, Predictivemodeling, DataVisualization.
  • Proficient in the Integration of various data sources with multiple relational databases like Oracle/, MS SQL Server, DB2, Teradata and Flat Files into the staging area, ODS, Data Warehouse and Data Mart.
  • Ensure accurate loading of 110, 210, 310, 214, and other transportation data via BizTalk
  • Experience in Extracting data for creating Value Added Datasets using Python, R, SAS, Azure and SQL to analyze the behavior to target a specific set of customers to obtain hidden insights within the data to effectively implement the project Objectives.
  • Worked with NoSQL Database including HBase, Cassandra,andMongoDB.
  • Extensively worked on statistical analysis tools and adept at writing code in Advanced Excel, R, MATLAB, Python.
  • Extensive experience in Text Analytics, developing different statistical machine learning, DataMining solutions to various business problems and generating datavisualizations using R, Python,andTableau.
  • Experience on advanced SAS programming techniques, such as PROC SQL (JOIN/ UNION), PROC APPEND, PROC DATASETS, and PROC TRANSPOSE.
  • Integration Architect & Data Scientist experience in Analytics, BigData, BPM, SOA, ETL and Cloud technologies.
  • Good experience in Google Cloud Vision APIintegratevisiontodetectionfeatureswithinapplications, including image labeling, face and landmark detection, optical character recognition (OCR), andexplicit content.
  • Good knowledge of Gazebo and point cloud library.
  • Highly skilled in using visualization tools like Tableau, ggplot2,and d3.js for creating dashboards.
  • tagging of experience in foundational machine learning models and concepts: regression, random forest, boosting, GBM, NNs, HMMs, CRFs, MRFs, deep learning.
  • Proficiency in understanding statistical and other tools/languages - R, Python, C, C++, Java, SQL, UNIX, Qlikview data visualization tool and Anaplan forecasting tool.
  • Good Knowledge ofTensorFlow.
  • Implemented deep learning models and numerical Computation with the help of data flow graphs using Tensor Flow Machine Learning.
  • Good experience in Textmining to transposing words and phrases in unstructured data into numerical values
  • Lightmapping of customized data forms
  • Worked with complex applications such as R, Stata, Scala, Perl, Linear, SAS, andSPSS to develop aneural network, cluster analysis.
  • Experienced the full software lifecycle in SDLC, Agile,andScrummethodologies.
  • Experience in designing stunning visualizations using Tableau software and publishing and presenting dashboards, Storyline on web and desktop platforms.
  • Designing of PhysicalDataArchitecture of New system engines.
  • Hands on experience in implementing LDA, NaiveBayes and skilled in RandomForests, Decision Trees, Linear and Logistic Regression, SVM, Clustering, neuralnetworks, Principle Component Analysis and good knowledge of Recommender Systems.
  • Experienced with machine learning algorithms such as logistic regression, random forest, XP boost,KNN, SVM, neural network, linear regression, lasso regression and k-means
  • Developing LogicalDataArchitecture with adherence to Enterprise Architecture.
  • Strong experience in Software Development Life Cycle (SDLC) including RequirementsAnalysis, Design Specification and Testing as per Cycle in both Waterfall and Agile methodologies.
  • Adept in statistical programming languages like R and also Python including BigData technologies like Hadoop 2, HAVE, HDFS, MapReduce, and Spark.
  • Experienced in Spark 2.1, Spark SQL and PySpark.
  • Skilled in using dplyr and pandas in R and python for performing Exploratory dataanalysis.
  • Experience working with data modeling tools like Erwin, PowerDesigner,andERStudio.
  • Good understanding of TeradataSQLAssistant, Teradata Administrator,and data load/ export utilities like BTEQ, FastLoad, MultiLoad, FastExport.
  • Experience with data analytics, data reporting, Ad-hoc reporting,Graphs, Scales, PivotTables and OLAP reporting.
  • Highly skilled in using Hadoop (pig and Hive) for basic analysis and extraction of data in the infrastructure to provide datasummarization.
  • Highly skilled in using visualization tools like Tableau, ggplot2,andd3.js for creating dashboards.
  • Worked and extracted data from various database sources like Oracle, SQLServer, DB2, and Teradata.
  • Proficient knowledge of statistics, mathematics, machine learning, recommendation algorithms and analytics with anexcellent understanding of business operations and analytics tools for effective analysis of data.


Languages: Java 8, Python, R

Packages: ggplot2, caret, dplyr, RWeka, gmodels, RCurl, Twitter, NLP, Reshape2, rjson, dplyr, pandas, NumPy, Seaborn, SciPy, Matplot lib, Scikit-learn, Beautiful Soup, Rpy2.

Web Technologies: JDBC, HTML5, DHTML and XML, CSS3, Web Services, WSDLData Modelling Tools: Erwin r 9.6, 9.5, 9.1, 8.x, Rational Rose, ER/Studio, MS Visio, SAP Power designer, Text Mining, and Google Cloud Vision.

Big Data Technologies: Hadoop, Hive, HDFS, MapReduce, Pig, Kafka.

Databases: SQL, Hive, Impala, Pig, Spark SQL, Databases SQL-Server, My SQL, MS Access, HDFS, HBase, Teradata, Netezza, MongoDB, Cassandra.

Reporting Tools : MS Office (Word/Excel/Power Point/ Visio), Tableau, Crystal reports XI, Business Intelligence, SSRS, Business Objects 5.x/ 6.x, Cognos7.0/6.0.

ETL Tools: Informatica Power Centre, SSIS.

Version Control Tools: SVM, GitHub.

Project Execution Methodologies: Ralph Kimball and Bill Inmon data warehousing methodology, Rational Unified Process (RUP), Rapid Application Development (RAD), Joint Application Development (JAD).

BI Tools: Tableau, Tableau Server, Tableau Reader, SAP Business Objects, OBIEE, QlikView, SAP Business Intelligence, Amazon Redshift, or Azure Data Warehouse

Operating System: Windows, Linux, Unix, Macintosh HD, Red Hat.


Confidential, Basking Ridge, NJ

Data Scientist/Bigdata engineer


  • Developed applications of Machine Learning, Statistical Analysis,and Data Visualizations with challenging data Processing problems in sustainability and biomedical domain.
  • Compiled data from various sources public and private databases to perform complex analysis and data manipulation for actionable results.
  • Designed and developed Natural Language Processing models for sentiment analysis.
  • Worked on Natural Language Processing with NLTK module of python for application development for automated customer response.
  • Used predictive modeling with tools in SAS, SPSS, R, Python.
  • Applied concepts of probability, distribution and statistical inference on given dataset to unearth interesting findings through theuse of comparison, T-test, F-test, R-squared, P-value etc.
  • Applied linear regression, multiple regression, ordinary least square method, mean-variance, thetheory of large numbers, logistic regression, dummy variable, residuals, Poisson distribution, Bayes, Naive Bayes, fitting function etc to data with help of Scikit, Scipy, Numpy and Pandas module of Python.
  • Applied clustering algorithms i.e.Hierarchical, K-means with help of Scikit and Scipy.
  • Developed visualizations and dashboards using ggplot, Tableau
  • Worked on development of data warehouse, DataLake and ETL systems using relational and nonrelational tools like SQL, No SQL.
  • Built and analyzed datasets using R, SAS, Matlab,andPython (in decreasing order of usage).
  • Applied linear regression in Python and SAS to understand the relationship between different attributes of dataset and causal relationship between them
  • Performs complex pattern recognition of financial time series data and forecast of returns through the ARMA and ARIMA models and exponential smoothening for multivariate time series data
  • Pipelined (ingest/clean/munge/transform) data for feature extraction toward downstream classification.
  • Used ClouderaHadoopYARN to perform analytics on data in Hive.
  • Wrote Hive queries for data analysis to meet the business requirements.
  • Expertise in Business Intelligence and data visualization using R and Tableau.
  • Expert in Agile and Scrum Process.
  • Validated the Macro-Economic data (e.g. BlackRock, Moody's etc.) and predictive analysis of world markets using key indicators in Python and machine learning concepts like regression, Bootstrap Aggregation and Random Forest.
  • Worked in large-scale database environments like Hadoop and MapReduce, with working mechanism of Hadoop clusters, nodes and Hadoop Distributed File System (HDFS).
  • Interfaced with large-scale database system through an ETL server for data extraction and preparation.
  • Identified patterns, data quality issues, and opportunities and leveraged insights by communicating opportunities with business partners.

Environment: Machine learning, AWS, MS Azure, Cassandra, Spark, HDFS, Hive, Pig, Linux, Python (Scikit-Learn/SciPy/NumPy/Pandas), R, SAS, SPSS, MySQL, Eclipse, PL/SQL, SQL connector, Tableau.

Confidential, GA

Data Scientist/Big data engineer


  • Perform Data Profiling to learn about user behavior and merge data from multiple data sources.
  • Implemented big data processing applications to collect, clean and normalization large volumes of open data using Hadoop ecosystems such as PIG, HIVE, and HBase.
  • Designing and developing various machine learning frameworks using Python, R, and MATLAB.
  • Integrate R into Micro Strategy to expose metrics determined by more sophisticated and detailed models than natively available in the tool.
  • Independently coded new programs and designed Tables to load and test the program effectively for the given POC's using with Big Data/Hadoop.
  • Solution architecting BIG Data solution for Projects & Proposal using Hadoop, Spark, ELK Stack, Kafka, Tensor flow.
  • Correct minor data errors that prevent loading of EDI files
  • Worked on Clustering and classification of data using machine learning algorithms. Used Tensor Flow machine learning to create sentimentally and time series analysis.
  • Develop documents and dashboards of predictions in Micro Strategy and present it to the Business Intelligence team.
  • Used Cloud Vision API integrate vision to detection features within applications, including image labeling, face and landmark detection, optical character recognition (OCR), and tagging of explicit content.
  • Implemented Text mining to transposing words and phrases in unstructured data into numerical values
  • Developed various Qlik-ViewData Models by extracting and using the data from various sources files, DB2, Excel, Flat Files and Bigdata.
  • Good knowledge of Hadoop architecture and various components such as HDFS, JobTracker, TaskTracker, NameNode, DataNode, SecondaryNameNode, and MapReduce concepts.
  • As Architect delivered various complex OLAPdatabases/cubes, scorecards, dashboards,andreports.
  • Track and enable communication across multiple departments to make sure all parties are as educated about potential issues as they can be.
  • Utilized human face recognition OpenCV and tackled the challenge of long running time on personal computer for face
  • Programmed a utility in Python that used multiple packages (scipy, numpy, pandas)
  • Implemented Classification using supervised algorithms like LogisticRegression, Decisiontrees, KNN, NaiveBayes.
  • Gained knowledge about OpenCV and learned to apply it to achieve the red color object identifying with the drone's camera.
  • Used Teradata15 utilities such as FastExport, MLOAD for handling various tasks data migration/ETL from OLTP Source Systems to OLAP Target Systems
  • Handled importing data from various data sources, performed transformations using Hive, MapReduce, and loaded data into HDFS.
  • Collaborate with data engineers to implement ETL process, write and optimized SQL queries to perform data extraction from Cloud and merging from Oracle 12c.
  • Collect unstructured data from MongoDB 3.3 and completed data aggregation.
  • Perform data integrity checks, data cleaning, exploratory analysis and feature engineer using R 3.4.0.
  • Work with freight carriers to correct EDI issues as they arise
  • Conducted analysis of assessing customer consuming behaviors and discover thevalue of customers with RMF analysis; applied customer segmentation with clustering algorithms such as K-Means Clustering and Hierarchical Clustering.
  • Work on outlier’s identification with box-plot, K-means clustering using Pandas, NumPy.
  • Participate in features engineering such as feature intersection generating, feature normalize and Label encoding with Scikit-learn preprocessing.
  • Use Python 3.0 (NumPy, SciPy, pandas, Scikit-learn, Seaborn, NLTK) and Spark 1.6 / 2.0 (PySpark, MLlib) to develop avariety of models and algorithms for analytic purposes.
  • Analyze Data and Performed Data Preparation by applying the historical model to the data set in AZUREML.
  • Experienced in Delivery, Portfolio, Team / Career, Vendor and Program Management Competency in Solution Architecture, implementation & delivery of Big Data, data science analytics & DWH projects on Greenplum, SPARK, Keras, Python, and TensorFlow.
  • Coordinate the execution of A/B tests to measure the effectiveness of personalized recommendation system.
  • Perform data visualization with Tableau 10 and generate dashboards to present the findings.
  • Recommend and evaluate marketing approaches based on quality analytics of customer consuming behavior.
  • Determine customer satisfaction and help enhance customer experience using NLP.
  • Work on Text Analytics, Naive Bayes, Sentiment analysis, creating word clouds and retrieving data from Twitter and other social networking platforms.
  • Use Git 2.6 to apply version control. Tracked changes in files and coordinated work on the files among multiple team members.

Environment:R, MATLAB, MongoDB, exploratory analysis, feature engineering, K-Means Clustering, Hierarchical Clustering, Machine Learning), Python, Spark (MLlib, PySpark), Tableau, MicroStrategy, Git,Unix,, MLlib, SAS, Tensor Flow, regression, logistic regression, Hadoop 2.7, OLTP, random forest, OLAP, HDFS, ODS, NLTK, SVM, JSON, XML and MapReduce, OpenCV.

Confidential, St. Louis

Data Scientist


  • Used Pandas, NumPy, Seaborn, SciPy, Matplotlib, Scikit-learn, NLTK in Python for developing various machine learning algorithms and utilized machine learning algorithms such as linear regression, multivariate regression, naive Bayes, Random Forests , K-means , & KNN for data analysis.
  • Demonstrated experience in design and implementation of Statistical models, Predictive models, enterprise data model, metadata solution and data lifecycle management in both RDBMS, Big Data environments.
  • Utilized domain knowledge and application portfolio knowledge to play a key role in defining the future state of large, business technology programs.
  • Developed MapReduce/Spark Python modules for machine learning & predictive analytics in Hadoop on AWS. Implemented a Python-based distributed random forest via Python streaming.
  • Performed Source System Analysis, database design, data modeling for the warehouse layer using MLDM concepts and package layer using Dimensional modeling.
  • Created ecosystem models (e.g. conceptual, logical, physical, canonical) that are required for supporting services within the enterprise data architecture (conceptual data model for defining the major subject areas used, ecosystem logical model for defining standard business meaning for entities and fields, and an ecosystem canonical model for defining the standard messages and formats to be used in data integration services throughout the ecosystem)
  • Developed Linux Shell scripts by using NZSQL/NZLOAD utilities to load data from flat files to Netezza database.
  • Designed and implemented system architecture for Amazon EC2 based cloud-hosted solution for theclient.
  • Tested Complex ETL Mappings and Sessions based on business user requirements and business rules to load data from source flat files and RDBMS tables to target tables.
  • Hands-on database design, relational integrity constraints, OLAP, OLTP , Cubes and Normalization (3NF) and De-normalization of the database.
  • Developed MapReduce/Spark Python modules for machine learning & predictive analytics in Hadoop on AWS.
  • Worked on customer segmentation using an unsupervised learning technique - clustering.
  • Worked with various Teradata15 tools and utilities like Teradata Viewpoint, Multi-Load, ARC, Teradata Administrator, BTEQ and other Teradata Utilities.

Environment: Erwin r9.6, Python, SQL, Oracle 12c, Netezza, SQL Server, Informatica, Java, SSRS, PL/SQL, T-SQL, Tableau, MLlib, regression, Cluster analysis, Scala NLP, Cassandra, MapReduce, Spark, Kafka, MongoDB, logistic regression, Hadoop, Hive, Teradata0, random forest, OLAP, Azure, MariaDB, SAP CRM, HDFS, ODS, NLTK, SVM, JSON, Tableau, XML, AWS.

Confidential, Norfolk, VA

Data Scientist/Data Modeler


  • Developing propensity models for Retail liability products to drive proactive campaigns.
  • Extraction and tabulation of data from multiple data sources using R, SAS.
  • Data cleansing, transformation and creating new variables using R.
  • Built predictive scorecards for Cross-selling Car loan, Life Insurance, TD,and RD.
  • Scoring predictive models as per regulatory requirements & ensuring deliverables with PSI.
  • Data modeling and formulation of statistical equations using advanced statistical forecasting techniques.
  • Provide guidance and mentoring to team members.
  • Arrange and chair Data Workshops with SME's and related stakeholders for requirement data catalog understanding.
  • Responsible for defining the key identifiers for each mapping/interface.
  • Responsible for defining the functional requirement documents for each source to target interface.
  • The document, clarify, and communicate requests for change requests with the requestor and coordinate with the development and testing team.
  • Work with users to identify the most appropriate source of record and profile the data required for sales and service.
  • Document the complete process flow to describe program development, logic, testing, and implementation, application integration, coding.
  • Design Logical Data Model which will fit and adopt the Teradata Financial Logical Data Model (FSLDM11) using Erwin data modeler tool.
  • Present and approve designed LogicalData Model in DataModel Governance Committee (DMGC).
  • Identifying the Customer and account attributes required for MDM implementation from disparate sourcesand preparing detailed documentation.
  • Validated the machine learning classifiers using ROC Curves and Lift Charts.
  • Extracted data from HDFS and prepared data for exploratory analysis using datamunging.

Environment:Erwin 8, Teradata 13, SQL Server 2008, Oracle 9i, SQL*Loader, PL/SQL, ODS, OLAP, OLTP, SSAS, Informatica Power Center 8.1.

Confidential, Dallas, TX

Data Analyst


  • Implemented Microsoft Visio and Rational Rose for designing the Use Case Diagrams, Class model, Sequence diagrams, and Activity diagrams for SDLC process of the application
  • Worked with other teams to analyze customers to analyze parameters of marketing.
  • Conducted Design reviews and Technical reviews with other project stakeholders.
  • Was a part of the complete life cycle of the project from the requirements to the production support
  • Created test plan documents for all back-end database modules
  • Used MS Excel, MS Access,andSQL to write and run various queries.
  • Used traceabilitymatrix to trace the requirements of the organization.
  • Recommended structural changes and enhancements to systems and databases.
  • Conducted Design reviews and Technical reviews with other project stakeholders.
  • Maintenance in the testing team for System testing/Integration/UAT
  • Guaranteeing quality in the deliverables.

Environment:UNIX, SQL, Oracle 10g, MS Office, MS Visio.

Confidential, Boston, MA

Data Analyst


  • Developed ETL processes for data conversions and construction of data warehouse using IBM InfoSphere DataStage.
  • Used Star Schema and designed Mappings between sources to operational staging targets.
  • Involved in defining the business/transformation rules applied to sales and service data.
  • Define the list codes and code conversions between the source systems and the data mart.
  • Provided On-call Support for the project and gave a knowledge transfer for the clients.
  • Used Rational Application Developer (RAD) for version control.
  • Developed transformations using jobs like Filter, Join, Lookup, Merge, Hashed file, Aggregator, Transformer and Dataset.
  • Worked with internal architects and, assisting in the development of current and target state data architectures.
  • Coordinate with the business users in providing anappropriate, effective and efficient way to design the new reporting needs based on the user with the existing functionality.
  • Remain knowledgeable in all areas of business operations to identify systems needs and requirements.
  • Document the complete process flow to describe program development, logic, testing, and implementation, application integration, coding.

Environment:IBM Rational Clear Case & Clear Quest and IBM Infosphere Metadata Workbench 8.7, IBMInfoSphereDataStage and Quality Stage, IBM Infosphere CDC version 6.5.1, XML files.

Hire Now