Data Scientist Resume
Houston, Tx
SUMMARY:
- Professional qualified Data Scientist/Data Analyst with over 8+ years of experience in Data Science and Analytics including Machine Learning, Data Mining, Data Blending and Statistical Analysis.
- Involved in the entire data science project life cycle and actively involved in all the phases including dataextraction, data cleaning, statistical modeling and data visualization with large data sets of structured and unstructured data.
- Experienced with machine learning algorithm such as logistic regression, random forest, XGboost, KNN, SVM, neural network, linear regression, lasso regression and k - means.
- Expertise in synthesizing Machine learning, Predictive Analytics and Big data technologies into integrated solutions.
- Implemented Bagging and Boosting to enhance the model performance.
- Strong skills in statistical methodologies such as A/B test, experiment design, hypothesis test, ANOVA
- Extensively worked on Python 3.5/2.7 (Numpy, Pandas, Matplotlib, NLTK and Scikit-learn)
- Experience in implementing data analysis with various analytic tools, such as Anaconda 4.0JupiterNotebook 4.X, R 3.0 (ggplot2, Caret, dplyr) and Excel.
- Solid ability to write and optimize diverse SQL queries, working knowledge of RDBMS like SQLServer2008, NoSql databases like MongoDB3.2
- Strong experience in BigData technologies like Spark 1.6, Sparksql, pySpark, Hadoop 2.X, HDFS, Hive 1.X.
- Experience in visualization tools like, Tableau9.X, 10.X , Data Blending for creating dashboards
- Excellent understanding Agile and Scrum development methodology
- Used the version control tools like Git 2.X.
- Proficient in Machine Learning, Data/Text Mining, Statistical Analysis & Predictive Modeling.
- Ability to maintain a fun, casual, professional and productive team atmosphere
- Experienced the full software life cycle in SDLC, Agile and Scrum methodologies.
- Strong knowledge in all phases of the SDLC (Software Development Life Cycle) from analysis, design, development, testing, implementation and maintenance.
- Proficient in Predictive Modeling, Data Mining Methods, Factor Analysis, ANOVA, Hypotheticaltesting, normal distribution and other advanced statistical and econometric techniques.
- Developed predictive models using Decision Tree, RandomForest, NaïveBayes, LogisticRegression, ClusterAnalysis, and Neural Networks.
- Experienced in Machine Learning and Statistical Analysis with PythonScikit-Learn.
- Experience in using various packages in Rand python like ggplot2, caret, dplyr, Rweka, gmodels, RCurl, tm, C50, twitteR, NLP, Reshape2, rjson, plyr, pandas, numpy, seaborn, scipy, matplotlib, scikit-learn, Beautiful Soup, Rpy2.
- Excellent knowledge of Machine Learning, Mathematical Modeling and Operations Research. Comfortable with R, Python, SAS and Weka, MATLAB, Relational databases. Deep understanding & exposure of Big Data Eco-system.
- Expert in creating PL/SQL Schema objects like Packages, Procedures, Functions, Subprograms, Triggers, Views, Materialized Views, Indexes, Constraints, Sequences, Exception Handling, Dynamic SQL/Cursors, Native Compilation, Collection Types, Record Type, Object Type using SQL Developer.
- Experienced in Python to manipulate data for data loading and extraction and worked with python libraries like Matplotlib, Numpy, Scipy and Pandas for dataanalysis.
- Over 5+years of experience in the areas of Requirements Gathering, Analysis, Design , Development , Implementation and Testing of Software Applications using Java/J2EE technologies
- Hands on Experience in implementing Model View Control (MVC) architecture using Spring, JDK, Core Java (Collections, OOPS Concepts), JSP, Servlets , Struts, springs, Hibernate, JDBC.
- Strong knowledge of Software Development Life Cycle (SDLC)including Waterfall and Agile development
- Strong experience in application development using Java/J2EE technologies which includes implementing Model View Control (MVC) architecture using Spring, JDK 1.6, Core Java(Collections, OOPS Concepts), JSP, Servlets, Struts, springs, Hibernate, Web Services, AJAX , JDBC, HTML and JavaScript
- Worked with complex applications such as R, SAS, Matlab and SPSS to develop neural network, cluster analysis.
- Skilled in performing dataparsing, data manipulation and data preparation with methods including describe data contents, compute descriptive statistics of data, regex, split and combine, Remap, merge, subset, reindex, melt and reshape.
- Experienced in statistical data analysis like Chi-square, T-test, Dimensionality reduction methods like PCA, LDA and feature selection methods.
- Worked with NoSQL Database including Hbase, Cassandra and MongoDB.
- Experienced in Big Data with Hadoop, HDFS, MapReduce, and Spark.
- Experienced in Data Integration Validation and Data Quality controls for ETL process and Data Warehousing using MS Visual Studio SSIS, SSAS, SSRS.
- Proficient in Tableau and R-Shiny data visualization tools to analyze and obtain insights into large datasets, create visually powerful and actionable interactive reports and dashboards.
- Strong experience with Oracle/SQL Server programming skills, with experience in working with functions, packages and triggers.
- Excellent knowledge and understanding of data mining techniques like classification, clustering, regression techniques and random forests.
- Automated recurring reports using SQL and Python and visualized them on BI platform like Tableau.
- Worked in development environment like Git and VM.
- Excellent communication skills. Successfully working in fast-paced multitasking environment both independently and in collaborative team, a self-motivated enthusiastic learner.
TECHNICAL SKILLS:
Machine Learning: Regression, Polynomial Regression, Random Forest, Logistic Regression, Decision Trees, Classification, Clustering, Association, Simple/Multiple linear, Kernel SVM, K-Nearest Neighbours (K-NN).
Languages: C, C++, Java 8, Python, R
Packages: ggplot2, caret, dplyr, Rweka, gmodels, RCurl, tm, C50, twitteR, NLP, Reshape2, rjson, plyr, pandas, numPy, seaborn, sciPy, matplot lib, scikit-learn, Beautiful Soup, Rpy2, sqlalchemy.
Web Technologies: JDBC, HTML5, DHTML and XML, CSS3, Web Services, WSDLData Modelling Tools
Erwin r 9.6, 9.5, 9.1, 8.x, Rational Rose, ER/Studio, MS Visio, SAP Power designer.
Big Data Technologies: Apache Hadoop 3.0, MapReduce, SQOOP 1.4, Pig, Hive 2.3, NOSQL, Cassandra 3.11, MongoDB 3.6, Spark 2.2, Scala 2.12, Apache Storm, Elastic search, R-Programming, and Kafka.
Databases: SQL Server, MS SQL Server, Spark SQL, Netezza 4.0, Sybase ASE, AWS RDS, MS Access, HDFS, HBase, Teradata, MongoDB, Cassandra.
Reporting Tools: MS Office (Word/Excel/Power Point/ Visio), Tableau, Crystal reports XI, Business Intelligence, SSRS, Business Objects 5.x/ 6.x, Cognos7.0/6.0.
OLAP Tools: MS SQL 2017, Analysis Manager, DB2, OLAP, Cognos, Power play
ETL Tools: Informatica Power Centre, SSIS.
Version Control Tools: SVN, GitHub.
Methodologies: Ralph Kimball and BillInmon data warehousing methodology, Rational Unified Process (RUP), Rapid Application Development (RAD), Joint Application Development (JAD).
BI Tools: Tableau, Tableau server, Tableau Reader, SAP Business Objects, OBIEE, QlikView, SAP Business Intelligence, Amazon Redshift, or Azure Data Warehouse
Operating System: Windows, Linux, Unix, Macintosh HD, Red Hat.
PROFESSIONAL EXPERIENCE:
Confidential, Houston, TX.
Data Scientist
Responsibilities:
- Performed Data Profiling to learn about behavior with various features such as traffic pattern, location, time, Date and Time etc.
- Application of various machine learning algorithms and statistical modeling like decision trees, regression models, neural networks, SVM, clustering to identify Volume using scikit-learn package in python, Matlab.
- Utilized Spark, Scala, Hadoop, HBase, Cassandra, MongoDB, Kafka, Spark Streaming, MLLib, Python, a broad variety of machine learning methods including classifications, regressions, dimensionally reduction etc. and Utilized the engine to increase user lifetime by 45% and triple user conversations for target categories.
- Performed extensive Data Validation, Data Verification against Data Warehouse and performed debugging of the SQL-Statements and stored procedures for business scenarios.
- Data Collection, Features creation, Model Building (Linear Regression, SVM, Logistic Regression, Decision Tree, Random Forest, GBM), Evaluation Metrics, Model Serving R, Scikit-learn, Spark SQL, Spark ML, Flask, Redshift, AWS S3.
- Developed Spark/Scala, Python for regular expression (regex) project in the Hadoop/Hive environment with Linux/Windows for big data resources.
- Used clustering technique K-Means to identify outliers and to classify unlabeled data.
- Created a recommendation system using k-means clustering, NLP and Flask to generate vehicles list for potential users and worked on NLP algorithm consists of TF-IDF and LSI on the user reviews.
- NLTK, Stanford NLP, RAKE to preprocess the data, entity extraction and keyword extraction.
- Worked on Real Time as well as Batch Data and have build lambda architecture to process the data using Kafka, Spark Streaming, Spark Core and Spark SQL.
- Evaluated models using Cross Validation, Log loss function, ROC curves and used AUC for feature selection.
- Used concepts of Data Modeling Star Schema/Snowflake modeling, FACT & Dimensions tables and Logical & Physical data modeling.
- Analyze traffic patterns by calculating autocorrelation with different time lags.
- Ensured that the model has low False Positive Rate.
- Addressed overfitting by implementing of the algorithm regularization methods like L2 and L1.
- Used Principal Component Analysis in feature engineering to analyze high dimensional data.
- Created and designed reports that will use gathered metrics to infer and draw logical conclusions of past and future behavior.
- Performed Multinomial Logistic Regression, Randomforest, Decision Tree, SVM to classify package is going to deliver on time for the new route.
- Performed data analysis by using Hive to retrieve the data from Hadoopcluster, Sql to retrieve data from Oracle database.
- Created numerous dashboards in tableau desktop based on the data collected from zonal and compass, while blending data from MS-excel and CSV files, with MS SQL server databases.
- Used MLlib, Spark's Machine learning library to build and evaluate different models.
- Implemented rule based expertise system from the results of exploratory analysis and information gathered from the people from different departments.
- Performed Data Cleaning, features scaling, features engineering using pandas and numpy packages in python.
- Developed MapReduce pipeline for feature extraction using Hive.
- Created Data QualityScripts using SQL and Hive to validate successful data load and quality of the data. Created various types of data visualizations using Python and Tableau.
- Communicated the results with operations team for taking best decisions.
- Collected data needs and requirements by Interacting with the other departments.
Environment: : Horton works - Hadoop Map Reduce, Pyspark, Spark, R, Spark MLLib, Tableau, Informatica, SQL, Excel, VBA, BO, CSV, Erwin, SAS, AWS Redshift, Scala Nlp, Cassandra, Oracle, MongoDB, Cognos, SQL Server 2012, Teradata, DB2, SPSS, T-SQL, PL/SQL, Flat Files, XML, and Tableau.
Confidential, Chicago, Illinois.
Data Scientist
Responsibilities:
- Provided Configuration Management and Build support for more than 5 different applications, built and deployed to the production and lower environments.
- Implemented public segmentation using unsupervised machine learning algorithms by implementing k-means algorithm using Pyspark.
- Explored and Extracted data from source XML in HDFS, preparing data for exploratory analysis using data munging.
- Developed Spark/Scala, Python for regular expression (regex) project in the Hadoop/Hive environment with Linux/Windows for big data resources.
- Worked on different Machine learning models like Logistic Regression, Multilayer perceptron classifier, K-means clustering by creating Scala-SBT packaging and run it in Spark-shell (Scala) and Auto-encoder model with using R programming.
- Lead discussions with users to gather business processes requirements and data requirements to develop a variety of Conceptual, Logical and Physical Data Models. Expert in Business Intelligence and Data Visualization tools: Tableau, Microstrategy.
- Worked on setting up and configuring AWS's EMR Clusters and Used Amazon IAM to grant fine-grained access to AWS resources to users
- Created detailed AWS Security Groups, which behaved as virtual firewalls that controlled the traffic allowed to reach one or more AWS EC2 instances
- Wrote scripts and indexing strategy for a migration to Redshift from Postgres 9.2 and MySQL databases.
- Responsible for different Data mapping activities from Source systems to Teradata
- Handled importing data from various data sources, performed transformations using Hive, Map Reduce, and loaded data into HDFS.
- Good Knowledge in Azure cloud services, Azure storage, Azure active directory, Azure Service Bus. Create and manage Azure AD tenants, and configure application integration with Azure AD. Integrate on-premises Windows AD with Azure AD Integrating on-premises identity with Azure Active Directory.
- Working knowledge of Azure Fabric, Micro services, IoT & Docker containers in Azure. Azure infrastructure management & PaaS Solution Architect - (Azure AD, Licenses, Office365, DR on cloud using Azure Recovery Vault, Azure Web Roles, Worker Roles, SQL Azure, Azure Storage).
- Used R and python for Exploratory Data Analysis, A/B testing, Anova test and Hypothesis test to compare and identify the effectiveness of Creative Campaigns.
- Created clusters to classify Control and test groups and conducted group campaigns.
- Analyzed and calculated the lifetime cost of everyone in the welfare system using 20 years of historical data.
- Developed LINUX Shell scripts by using NZSQL/NZLOAD utilities to load data from flat files to Netezza database.
- Developed Data Mapping, Data Governance, Transformation and Cleansing rules for the Master Data Management Architecture involving OLTP, ODS and OLAP.
- Developed triggers, stored procedures, functions and packages using cursors and ref cursor concepts associated with the project using Pl/SQL
- Created various types of data visualizations using R, python and Tableau.
- Used Python, R, SQL to create Statistical algorithms involving Multivariate Regression, Linear Regression, Logistic Regression, PCA, Random forest models, Decision trees, Support Vector Machine for estimating the risks of welfare dependency.
- Identified and targeted welfare high-risk groups with Machine learning algorithms.
- Conducted campaigns and run real-time trials to determine what works fast and track the impact of different initiatives.
- Developed Tableau visualizations and dashboards using Tableau Desktop.
- Used Graphical Entity-Relationship Diagramming to create new database design via easy to use, graphical interface.
- Created multiple custom SQL queries in Teradata SQL Workbench to prepare the right data sets for Tableau dashboards
- Perform analyses such as regression analysis, logistic regression, discriminant analysis, cluster analysis using SAS programming.
- Used Meta data tool for importing metadata from repository, new job categories and creating new data elements.
- Scheduled the task for weekly updates and running the model in workflow. Automated the entire process flow in generating the analysis and reports.
Environment: Python, Azure, ER Studio, Hadoop, Map Reduce, EC2, S3, Pyspark, Spark, Spark MLLib, Tableau, Informatica, SQL, Excel, VBA, BO, CSV, Netezza, SAS, Matlab, AWS, Scala Nlp, SPSS, Cassandra, Oracle, Amazon Redshift, MongoDB, SQL Server 2012, Teradata, DB2, T-SQL, PL/SQL, Flat Files, XML, Tableau.
Confidential, New York, NY
Data Scientist
Responsibilities:
- Provided the architectural leadership in shaping strategic, business technology projects, with an emphasis on application architecture.
- Utilized domain knowledge and application portfolio knowledge to play a key role in defining the future state of large, business technology programs.
- Participated in all phases of data mining, data collection, data cleaning, developing models, validation, and visualization and performed Gap analysis.
- Developed MapReduce/Spark Python modules for machine learning & predictive analytics in Hadoop on AWS. Implemented a Python-based distributed random forest via Python streaming.
- Created ecosystem models (e.g. conceptual, logical, physical, canonical) that are required for supporting services within the enterprise data architecture (conceptual data model for defining the major subject areas used, ecosystem logical model for defining standard business meaning for entities and fields, and an ecosystem canonical model for defining the standard messages and formats to be used in data integration services throughout the ecosystem).
- Used Pandas, NumPy, seaborn, SciPy, Matplotlib, Scikit-learn, NLTK in Python for developing various machine learning algorithms and utilized machine learning algorithms such as linear regression, multivariate regression, naive Bayes, Random Forests, K-means, & KNN for data analysis.
- Conducted studies, rapid plots and using advance data mining and statistical modelling techniques to build solution that optimize the quality and performance of data.
- Demonstrated experience in design and implementation of Statistical models, Predictive models, enterprise data model, metadata solution and data life cycle management in both RDBMS, Big Data environments.
- Analyzed large data sets apply machine learning techniques and develop predictive models, statistical models and developing and enhancing statistical models by leveraging best-in-class modeling techniques.
- Worked on database design, relational integrity constraints, OLAP, OLTP, Cubes and Normalization (3NF) and De-normalization of database.
- Developed MapReduce/Spark Python modules for machine learning & predictive analytics in Hadoop on AWS.
- Worked on customer segmentation using an unsupervised learning technique - clustering.
- Worked with various Teradata15 tools and utilities like Teradata Viewpoint, Multi Load, ARC, Teradata Administrator, BTEQ and other Teradata Utilities.
- Utilized Spark, Scala, Hadoop, HBase, Kafka, Spark Streaming, MLlib, Python, a broad variety of machine learning methods including classifications, regressions, dimensionally reduction etc.
- Developed LINUX Shell scripts by using NZSQL/NZLOAD utilities to load data from flat files to Netezza database.
- Designed and implemented system architecture for Amazon EC2 based cloud-hosted solution for client.
- Tested Complex ETL Mappings and Sessions based on business user requirements and business rules to load data from source flat files and RDBMS tables to target tables.
Environment: Erwin r9.6, Python, SQL, Oracle 12c, Netezza, SQL Server, Informatica, Java, SSRS, PL/SQL, T-SQL, Tableau, MLlib, regression, Cluster analysis, Scala NLP, Spark, Kafka, MongoDB, logistic regression, Hadoop, Hive, Teradata, random forest, OLAP, Azure, MariaDB, SAP CRM, HDFS, ODS, NLTK, SVM, JSON, Tableau, XML, Cassandra, MapReduce, AWS.
Confidential, Wilmington, DE
Data Analyst/ Data Modeler
Responsibilities:
- Participated in JAD sessions, gathered information from Business Analysts, end users and other stakeholders to determine the requirements.
- Worked in Data warehousing methodologies/Dimensional Data modeling techniques such as Star/Snowflake schema using ERWIN9.1.
- Extensively used Aginity Netezza workbench to perform various DDL, DML etc. operations on Netezza database.
- Designed the Data Warehouse and MDM hub Conceptual, Logical and Physical data models.
- Performed Daily Monitoring of Oracle instances using Oracle Enterprise Manager, ADDM, TOAD, monitor users, table spaces, memory structures, rollback segments, logs and alerts.
- Used ER Studio Data/ Modeler for data modeling (data requirements analysis, database design etc.) of custom developed information systems, including databases of transactional systems and Datamart's.
- Involved in Teradata SQL Development, Unit Testing and Performance Tuning and to ensure testing issues are resolved on the basis of using defect reports.
- Customized reports using SAS/MACRO facility, PROC REPORT, PROC TABULATE and PROC.
- Used Normalization methods up to 3NF and De-normalization techniques for effective performance in OLTP and OLAP systems.
- Generated DDL scripts using Forward Engineering technique to create objects and deploy them into the databases.
- Worked on database testing, wrote complex SQL queries to verify the transactions and business logic like identifying the duplicate rows by using SQL Developer and PL/SQL Developer.
- Used Teradata SQL Assistant, Teradata Administrator, PMON and data load/export utilities like BTEQ, Fast Load, Multi Load, Fast Export, TPump on UNIX/Windows environments and running the batch process for Teradata.
- Worked on data profiling and data validation to ensure the accuracy of the data between the warehouse and source systems.
- Worked on Data warehouse concepts like Data warehouse Architecture, Star schema, Snowflake schema, and Data Marts, Dimension and Fact tables.
- Developed SQL Queries to fetch complex data from different tables in remote databases using joins, database links and Bulk collects.
- Migrated database from legacy systems, SQL server to Oracle and Netezza.
- Used SSIS to create ETL packages to validate, extract, transform and load data to pull data from Source servers to staging database and then to Netezza Database and DB2 Databases.
- Worked on SQL Server concepts SSIS (SQL Server Integration Services), SSAS (Analysis Services) and SSRS (Reporting Services).
Environment: ER Studio, Teradata13.1, SQL, PL/SQL, BTEQ, DB2, Oracle, MDM, Netezza, ETL, RTF UNIX, SQL Server2010, Informatica, SSRS, SSIS, SSAS, SAS, Aginity.
Confidential
Python Developer
Responsibilities:
- Developed a portal to manage and entities in a content management system using Flask
- Designed the database schema for the content management system.
- Designed email marketing campaigns and also created responsive web forms that saved data into a database usingPython/ Django Framework.
- Worked on Hadoop single node, Apache spark, Hive installations
- Installation, Configuration, Integration, Tuning, Backup, Crash recovery, Upgrades, Patching, Monitoring System Performance, System and Network Security and Troubleshooting of Linux/Unix Servers.
- Developed views and templates in Django to create a user-friendly website interface.
- Configured Django to manage URLs and application parameters. Implementation, testing, integration, and production support of the enterprise web application using Java/J2EE technologies/framework.
- Worked with web technologies like JSP, Html, Angular JS, CSS, Servlet, Java scripts (used frameworks-JSON) J Query, MVC frameworks (Struts, Spring MVC, IOC/DI, AOP) other frameworks (Hibernate, EJB, JUNIT)
- Expertise in analysis, design, development, testing and implementation of Java/J2EE application using Java, Spring, Hibernate, SOAP (JAX-WS), WSDL, SOA, RESTFUL Web Services (JAX-RS), Jersey Framework, Servlets, JAXB, JSON, Java Script, XML, XSD, SQL and using tools like Axis 2.0.
- Designed various modules of application/product with the knowledge of design patterns and OOPS concepts.
- Supported MapReduce Programs those are running on the cluster
- Worked on CSV files while trying to get input from the MySQL database.
- Wrote programs for performance calculations using Numpyand sqlalchemy.
- Administered and monitored multi Data center Cassandra cluster based on the understanding of the Cassandra Architecture.
- Extensively worked with Informatica in designing/developing ETL process to load data from xml sources to target database
- Configured Ansible to manage AWS environments and automate the build process for core AMIs used by all application deployments including Autoscaling, and Cloudformation scripts.
- Designed, automated the process of installation and configuration of secure DataStax Enterprise Cassandra using chef
- Wrote Python scripts to parse XML documents and load the data in database.
- Worked in stages such as analysis and design, development, testing and debugging.
- Built Web pages that are more user-interactive using jQuery plugins for Drag and Drop, Auto Complete, JSON, Angular JS, JavaScript.
Environment: Python 2.7, Windows, MySQL, ETL, Ansible flask and Python Libraries such as Numpy, sqlalchemy, Angular JS, MySQL DB, java
Confidential
Programmer Analyst
Responsibilities:
- Effectively communicated with the stakeholders to gather requirements for different projects
- Used MySQL db package and Python-MySQL connector for writing and executing several MYSQL database queries from Python.
- Created functions, triggers, views and stored procedures using My SQL.
- Worked closely with back-end developer to find ways to push the limits of existing Web technology.
- Involved in the code review meetings.
- Experienced Java/J2EE professional with extensive back ground in Software Development and Testing Life Cycle.
- Designing and developing enterprise level multi-tier and Single-page web Applications
- Experience in developing web applications using Java/J2EE, JSP, Servlets, EJB, JDBC, Spring and XML.
- Thorough knowledge of Web technologies: XML, HTML, CSS , and JavaScript .
Environment: Java, Python, My SQl