We provide IT Staff Augmentation Services!

Data Scientist Resume

4.00/5 (Submit Your Rating)

Denver, ColoradO

PROFESSIONAL SUMMARY:

  • Around 6 years of experience in Machine Learning, Data Mining with large datasets of Structured and Unstructured data, Data Acquisition, Data Validation, Predictive modelling, Data Visualization.
  • Experience in coding SQL/PL SQL using Procedures, Triggers, and Packages.
  • Extensive experience in Text Analytics, developing different Statistical Machine Learning, Data Mining solutions to various business problems and generating data visualizations using R, Python.
  • Data Driven and highly analytical with working knowledge and statistical model approaches and methodologies (Clustering, Regression analysis, Hypothesis testing, Decision trees, Machine learning)
  • Professional working experience in Machine Learning algorithms such as Linear Regression, Logistic Regression, Naive Bayes, Decision Trees, K - Means Clustering and Association Rules.
  • Well experienced in Normalization, De-Normalization and Standardization techniques for optimal performance in relational and dimensional database environments.
  • Worked on Text Mining and Sentimental analysis for extracting the unstructured data from various social Media platforms like Facebook, Twitter, and Reddit.
  • Good Knowledge of NoSQL databases like Mongo DB and HBase.
  • Extensive hands-on experience and high proficiency with structures, semi-structured and unstructured data, using a broad range of data science programming languages and big data tools including R, Python, Spark, SQL, Scikit Learn, Hadoop Map Reduce.
  • Expertise in Technical proficiency in Designing, Data Modeling Online Application, Solution Lead for Architecting Data Warehouse/Business Intelligence Applications.
  • Cluster Analysis, Principal Component Analysis (PCA), Association Rules, Recommender Systems.
  • Strong experience in Software Development Life Cycle (SDLC) including Requirements Analysis, Design Specification and Testing as per Cycle in both Waterfall and Agile methodologies.
  • Hands on experience with R-Studio for doing data pre-processing and building machine learning algorithms on different datasets.
  • Collaborated with the lead Data Architect to model the Data warehouse in accordance with FSLDM subject areas, 3NF format, and Snow-flake schema.
  • Expertise in transforming business requirements into analytical models, designing algorithms, building models, developing data mining and reporting solutions that scale across a massive volume of data.
  • Experience with data visualization using tools like Ggplot, Matplotlib, Seaborn, Tableau and using Tableau software to publish and presenting dashboards, storyline on web and desktop platforms.
  • Experienced in python data manipulation for loading and extraction as well as with python libraries such as NumPy, SciPy and Pandas for data analysis and numerical computations.
  • Worked and extracted data from various database sources like Oracle, SQL Server, and DB2.
  • Predictive Modelling Algorithms: Logistic Regression, Linear Regression, Decision Trees, K-Nearest Neighbors, Bootstrap Aggregation, Naive Bayes Classifier, Random Forests, Boosting, SVM.
  • Flexible with Unix/Linux and Windows Environments, working with Operating Systems like Centos5/6, Ubuntu13/14, Cosmos.
  • Excellent oral and written communication skills. Ability to explain complex technical information to technical and non-technical contacts.
  • Excellent interpersonal skills. Ability to effectively build relationships, promote a collaborative and team environment, and influence others.

TECHNICAL SKILLS:

Bigdata/Hadoop Technologies: Hadoop, HDFS, YARN, MapReduce, Hive, Pig, Impala, Sqoop, Flume, Spark, Kafka, Storm, Drill, Zookeeper and Ooze .

Languages: HTML5, CSS3, XML, C, C++, DHTML, WSDL, R/R Studio, SAS Enterprise Guide, SAS, R, Perl, MATLAB, Schemas, Json, Ajax, Java, Scala, Python (NumPy, SciPy, Pandas, Genism, Keras), SQL, PL/SQL, HiveQL, Java Script, Shell Scripting.

Java & J2EE Technologies: Core Java, JSP, Servlets, JDBC, JAAS, JNDI, Hibernate, Spring, Struts, JMS, EJB, Restful

Application Servers: Web Logic, Web Sphere, JBoss, Tomcat.

Databases: Microsoft SQL Server, MySQL, Oracle, DB2, Teradata, Netezza

NO SQL Databases: HBase, Cassandra, Mongo DB, Maria DB

Build Tools: Jenkins, Maven, ANT, Toad, SQL Loader, RTC, RSA, Control-M, Oozie, Hue, SOAP UI

Business Intelligence Tools: Tableau, Tableau server, Tableau Reader, Splunk, SAP Business Objects, OBIEE, QlikView, SAP Business Intelligence, Amazon Redshift, or Azure Data Warehouse

Development and Cloud Computing Tools: Microsoft SQL Studio, Eclipse, Net Beans, IntelliJ, Amazon AWS, Azure

Development Methodologies: Agile/Scrum, Waterfall, UML, Design Patterns

Version Control Tools and Testing: API Git, SVM, GitHub, SVN and JUNIT

ETL Tools: Informatica Power Centre, SSIS

Reporting Tools: MS Office (Word/Excel/Power Point/ Visio/Outlook), Crystal reports XI, SSRS, Cognos.

Data Modelling Tools: Erwin R, Rational Rose, ER/Studio, MS Visio, Oracle Designer, SAP Power designer, Enterprise Architect.

Operating Systems: All versions of UNIX, Windows, LINUX, Macintosh HD, Sun Solaris

PROFESSIONAL EXPERIENCE:

Confidential, Denver. Colorado

Data Scientist

Responsibilities:

  • Utilized Apache Spark with Python to develop and execute Bigdata Analytics and Machine learning applications, executed machine learning use cases under Spark and MLlib.
  • Setup storage and data analysis tools in Amazon Webservices cloud computing infrastructure.
  • Used pandas, NumPy, Seaborn, SciPy, matplotlib, Sci-kit-learn, NLTK in Python for developing various machine learning algorithms.
  • Worked on different data formats such as JSON, XML and performed machine learning algorithms.
  • Worked with Data Architects and IT Architects to understand the movement of data and its storage.
  • Participated in all phases of data mining; data collection, data cleaning, developing models, validation, visualization and performed Gap analysis.
  • Data Manipulation and Aggregation from a different source using Nexus, Toad, Business Objects, Powerball, and Smart View.
  • Implemented Agile Methodology for building an internal application.
  • Good knowledge of Hadoop Architecture and various components such as HDFS, Job Tracker, Task Tracker, Name Node, Data Node, Secondary Name Node, and MapReduce concepts.
  • As Architect delivered various complex OLAP databases/cubes, scorecards, dashboards and reports.
  • Programmed by a utility in Python that used multiple packages (SciPy, NumPy, pandas)
  • Implemented Classification using supervised algorithms like Logistic Regression, Decision trees, KNN.
  • Responsible for design and development of advanced R/Python programs to prepare to transform and harmonize data sets in preparation for modeling.
  • Participated in all phases of data mining; data collection, data cleaning, developing models, validation, visualization and performed Gap Analysis.
  • Data Manipulation and Aggregation from a different source using Nexus, Toad, BusinessObjects, Power BI, and SmartView.
  • Updated Python scripts to match data with our database stored in AWS Cloud Search, so that we would be able to assign each document a response label for further classification.
  • Data transformation from various resources, data organization, features extraction from raw and stored.
  • Handled importing data from various data sources, performed transformations using Hive, MapReduce, and loaded data into HDFS.
  • Interaction with Business Analyst, SMEs, and other Data Architects to understand Business needs and functionality for various project solutions
  • Researched, evaluated, architected, and deployed new tools, frameworks, and patterns to build sustainable Big Data platforms for the clients.
  • Updated Python scripts to match data with our database stored in AWS CloudSearch, so that we would be able to assign each document a response label for further classification.
  • Data transformation from various resources, data organization, features extraction from raw and stored.
  • Handled importing data from various data sources, performed transformations using Hive, MapReduce, and loaded data into HDFS.
  • Identifying and executing process improvements, hands-on in various technologies such as Oracle, Informatica, and Business Objects.
  • Designed both 3NF data models for ODS, OLTP systems and dimensional data models using Star and Snowflake Schemas.

Environment: R, ODS, OLTP, Bigdata, Oracle 12c, Hive, OLAP, DB2, Metadata, Python, MS Excel, Mainframes MS Vision, Rational Rose., Teradata, DB2, SPSS, T-SQL, PL/SQL, Flat Files, XML, and Tableau.

Confidential, Norfolk, Virginia

Data Scientist

Responsibilities:

  • Performed Data Profiling to learn about behaviour with various features such as traffic pattern, location, and time, Date and Time etc.
  • Application of various machine learning algorithms and statistical modelling like decision trees, regression models, neural networks , SVM , clustering to identify Volume using Scikit-learn package.
  • Utilized Spark , Scala , Hadoop , HBase, Cassandra, MongoDB, Kafka, Spark Streaming, MLlib, Python, a broad variety of machine learning methods including classifications, regressions, dimensionally reduction etc. and Utilized the engine to increase user lifetime by 45% and triple user conversations for target categories.
  • Developed Spark / Scala, Python for regular expression (regex) project in the Hadoop/Hive environment with Linux/Windows for big data resources. Used clustering technique K-Means to identify outliers and to classify unlabelled data.
  • Analyse traffic patterns by calculating autocorrelation with different time lags.
  • Developed entire frontend and backend modules using Python on Django Web Framework.
  • Implemented the presentation layer with HTML, CSS, and JavaScript.
  • Involved in writing stored procedures using Oracle .
  • Addressed over fitting by implementing the algorithm regularization methods like L2 and L1 .
  • Used Principal Component Analysis in feature engineering to analyse high dimensional data.
  • Identified and targeted welfare high-risk groups with Machine learning algorithms.
  • Developed Tableau visualizations and dashboards using Tableau Desktop.
  • Created clusters to classify Control and test groups and conducted group campaigns.
  • Developed Linux Shell Scripts by using NZSQL/NZLOAD utilities to load data from flat files to Netezza .
  • Developed triggers, stored procedures, functions, and packages using cursors and ref cursor concepts associated with the project using PL/SQL.
  • Performed Multinomial Logistic Regression, Random forest, Decision Tree, SVM to classify package is going to deliver on time for the new route.
  • Performed data analysis by using Hive to retrieve the data from Hadoop cluster , SQL to retrieve data.
  • Used MLlib, Spark's Machine learning library to build and evaluate different models.
  • Implemented rule-based expertise system from the results of exploratory analysis and information gathered from the people from different departments.
  • Performed Data Cleaning , features scaling, features engineering using pandas and NumPy packages.
  • Developed Map Reduce pipeline for feature extraction using Hive .
  • Created Data Quality Scripts using SQL and Hive to validate successful data load and quality of the data. Created various types of data visualizations using Python and Tableau .
  • Communicated the results with operations team for taking best decisions.
  • Collected data needs and requirements by Interacting with the other departments.

Environment: Python, CDH5, HDFS, Hadoop, Hive, Impala, Linux, Spark, Tableau Desktop, SQL Server 2014, Microsoft Excel, MATLAB, Spark SQL, PySpark.

Confidential

Data Analyst

Responsibilities:

  • Developed applications of Machine Learning, Statistical Analysis and Data Visualizations with challenging data Processing problems in sustainability and biomedical domain.
  • Worked on Natural Language Processing with NLTK module of python for application development for automated customer response.
  • Used predictive modeling with tools in SAS, SPSS, R, Python.
  • Responsible for design and development of advanced R/Python programs to prepare transform and harmonize data sets in preparation for modeling.
  • Identifying and executing process improvements, hands-on in various technologies such as Oracle, Informatica, and BusinessObjects.
  • Handled importing data from various data sources, performed transformations using Hive, MapReduce, and loaded data into HDFS.
  • Interaction with Business Analyst, SMEs and other Data Architects to understand Business needs and functionality for various project solutions.
  • Created SQL tables with referential integrity and developed queries using SQL SQL+, PL/SQL.
  • Involved with Data Analysis primarily Identifying Data Sets, Source Data, Source Meta Data, Data Definitions and Data Formats
  • Created PL/SQL packages and Database Triggers and developed user procedures and prepared user manuals for the new programs.
  • Prepare ETL Architect& design document which covers ETL architect, SSIS design, Extraction, transformation and loading of Duck Creek data into dimensional model.
  • Applied linear regression, multiple regression, ordinary least square method, mean-variance, theory of large numbers, logistic regression, dummy variable, residuals, Poisson distribution, Bayes, Naive Bayes, fitting function etc. to data with help of Scikit, SciPy, NumPy and Pandas.
  • Applied clustering algorithms i.e. Hierarchical, K-means with help of Scikit and SciPy.
  • Developed visualizations and dashboards using ggplot, Tableau
  • Worked on development of data warehouse, Data Lake and ETL systems using relational and non-relational tools like SQL, No SQL.
  • Built and analyzed datasets using R, SAS, MATLAB and Python (in decreasing order of usage).
  • Applied linear regression in Python and SAS to understand the relationship between different attributes of dataset and causal relationship between them
  • Expertise in Business Intelligence and data visualization using R and Tableau.
  • Validated the Macro-Economic data and predictive analysis of world markets using key indicators in Python and machine learning concepts like regression, Bootstrap Aggregation and Random Forest.
  • Worked in large scale database environment like Hadoop and MapReduce, with working mechanism of Hadoop clusters, nodes and Hadoop Distributed File System (HDFS).
  • Interfaced with large scale database system through an ETL server for data extraction.
  • Identified patterns, data quality issues, and opportunities and leveraged insights by communicating opportunities with business partners.

Environment: Machine learning, AWS, MS Azure, Cassandra, Spark, HDFS, Hive, Pig, Linux, Python (Scikit-Learn/SciPy/NumPy/Pandas), R, SAS, SPSS, My SQL, Eclipse, PL/SQL, SQL connector, Tableau.

Confidential

Software Engineer

Responsibilities:

  • Identified the Edge node(s), an IOT gateway or a cloud aggregator and a back-end data analytics engine operating on the aggregated data for trend analysis, anomaly detection, etc.
  • Explored and classified all the IOT devices of the healthcare infrastructure and performed the data profiling through Informatica to get the better understanding of it.
  • Used the Kafka Interfacing software of Apache Hadoop framework to get the data from all the IOT devices of the healthcare network into the Apache Hadoop Spark system
  • Processed the stream data in the Apache Spark streaming by breaking the stream data into the micro batches and later processed by the spark’s core, which results in lower latency.
  • Stored the processed data in the HDFS and generated the reports through Spark’s SQL queries.
  • Performed data analytics on the output data obtained from the spark’s core by using spark’s ML Lib, an advanced machine Learning tool of Apache Spark
  • Performed the descriptive analysis on the data like correlations and Scatter plots to understand the current performance of the IOT devices of the healthcare and to improve the efficiency and optimizing the usage of it.
  • Partitioned the data set into, testing and validation set to use it in the supervised learning processes.
  • Performed the predictive analysis through popular machine learning algorithms like Linear regression, logistic regression and Artificial Neural networks.
  • Visualized the model performance through the ROC curve (Receiver operating characteristic Curve) by plotting sensitivity against specificity at different thresholds.
  • Measured the predictive ability of a classifier by the Area under the curve (AUC). Area under the curve greater than 0.75 was the model acceptance criteria.
  • Performed the data reporting and created the dashboard and shared with all the major stake holders
  • Performed the statistical analysis to improve the system processes to give better and quality healthcare. Performed the process modelling through BIZAGI BPMN MODELER business process modeler to remove the inefficient tasks and processes in the systems
  • Experience in documenting all the analytical results and findings of the Apache Sparks by Apache Zeppelin.
  • Ensured the proper process was followed to demonstrate to the monitoring government entity that the data provided to them had gone through a stringent data governance process.

Environment: Apache Hadoop, HBASE, Apache Spark, Kafka, Apache Zeppelin, Informatica, BIZAGI BPMN MODELER, Spark’s MLlib, Tableau, R, SAS/STAT, Spark’s SQL, Predictive analysis, Machine Learning, MS office suite

We'd love your feedback!