Data Scientist/ Data Modeller Resume
St Louis, MO
SUMMARY
- Professional qualified Data Scientist/Data Analyst with around 8+ years of experience in Data Science and Analytics including Data Mining, Deep Learning/Machine Learning and Statistical Analysis
- Involved in the entire data science project life cycle and actively involved in all the phases including data cleaning, data extraction and data visualization with large data sets of structured and unstructured data, created ER diagrams and schema.
- Experienced with machine learning algorithm such as logistic regression, KNN, SVM, random forest, neural network, linear regression, lasso regression and k - means
- Implemented Bagging and Boosting to enhance the model performance.
- Experience in implementing data analysis with various analytic tools, such as Anaconda 4.0 Jupiter Notebook 4.X, R 3.0 (ggplot2,, dplyr, Caret) and Excel
- Solid ability to write and optimize diverse SQL queries, working knowledge of RDBMS like SQL Server 2008/2010/2012 , NoSql databases like Mongo DB 3.2
- Excellent understanding Agile and Scrum development methodology
- Used the version control tools like Git 2.X and build tools like Apache Maven/Ant
- Passionate about gleaning insightful information from massive data assets and developing a culture of sound, data-driven decision making
- Ability to maintain a fun, casual, professional and productive team atmosphere
- Experienced the full software life cycle in SDLC, Agile, DevOps and Scrum methodologies including creating requirements, test plans.
- Skilled in Advanced Regression Modeling, Correlation, Multivariate Analysis, Model Building, Business Intelligence tools and application of Statistical Concepts.
- Developed predictive models using Decision Tree, Naive Bayes, Logistic Regression, Random Forest, Social Network Analysis, Cluster Analysis, and Neural Networks.
- Experienced in Machine Learning and Statistical Analysis with Python Scikit-Learn.
- Experienced in Python to manipulate data for data loading and extraction and worked with python libraries like Matplotlib, Scipy, Numpy and Pandas for data analysis.
- Worked with complex applications such as R, R Shiny, SAS, Plotly, ArcGIS, Matlab and SPSS to develop neural network, cluster analysis.
- Strong SQL programming skills, with experience in working with functions, packages and triggers.
- Expertise in transforming business requirements into designing algorithms, analytical models, building models, developing data mining and reporting solutions that scales across massive volume of structured and unstructured data.
- Skilled in performing data parsing, data manipulation, data architecture, data ingestion and data preparation with methods including describe data contents, compute descriptive statistics of data, regex, split and combine, merge, Remap, subset, reindex, melt and reshape.
- Worked with No SQL Database including Hbase, Cassandra and Mongo DB.
- Experienced in Big Data with Hadoop, Map Reduce, HDFS and Spark.
- Experienced in Data Integration Validation and Data Quality controls for ETL process and Data Warehousing using MS Visual Studio, SSAS, SSIS and SSRS.
- Proficient in Tableau and R-Shiny data visualization tools to analyze and obtain insights into large datasets, create visually powerful and actionable interactive reports and dashboards.
- Automated recurring reports using SQL and Python and visualized them on BI platform like Tableau.
- Worked in development environment like Git and VM.
- Excellent communication skills. Successfully working in fast-paced multitasking environment both independently and in collaborative team, a self-motivated enthusiastic learner.
TECHNICAL SKILLS
Big Data/ Hadoop Technologies: Hadoop, HDFS, YARN, Map Reduce, Hive, Pig, Impala, Sqoop, Flume, Spark, Kafka, Storm, Drill, Zookeeper and Oozie
Languages: HTML5,DHTML, WSDL, CSS3, C, C++, XML,R/R Studio, SAS Enterprise Guide, SAS, R (Caret, Weka, ggplot), Perl, MATLAB, Mathematica, FORTRAN, DTD, Schemas, Json, Ajax, Java, Scala, Python (NumPy, SciPy, Pandas, Gensim, Keras), Java Script, Shell Scripting
NO SQL Databases: Cassandra, HBase, Mongo DB, Maria DB
Business Intelligence Tools: Tableau server, Tableau Reader, Tableau, Splunk, SAP Business Objects, OBIEE, SAP Business Intelligence, Qlik View, Amazon Redshift, or Azure Data Warehouse
Development Tools: Microsoft SQL Studio, IntelliJ, Eclipse, Net Beans.
Development Methodologies: Agile/Scrum, UML, Design Patterns, Waterfall
Build Tools: Jenkins, Toad, SQL Loader, Maven, ANT, RTC, RSA, Control-M, Oziee, Hue, SOAP UI
Reporting Tools: MS Office (Word/Excel/Power Point/ Visio/Outlook), Crystal reports XI, SSRS, Cognos 7.0/6.0.
Databases: Microsoft SQL Server 2008,2010/2012, MySQL 4.x/5.x, Oracle 11g, 12c, DB2, Teradata, Netezza
Operating Systems: All versions of Windows, UNIX, LINUX, Macintosh HD, Sun Solaris
PROFESSIONAL EXPERIENCE
Confidential, St. Louis, MO
Data Scientist/ Data Modeller
Responsibilities:
- Extracted data from HDFS and prepared data for exploratory analysis using data munging
- Built models using Statistical techniques like Bayesian HMM and Machine Learning classification models like XG Boost, SVM, and Random Forest.
- Participated in all phases of data mining, data cleaning, data collection, developing models, validation and visualization and performed Gap analysis.
- A highly immersive Data Science program involving Data Manipulation& Visualization, Web Scraping, Machine Learning, Python programming, SQL, GIT, Mongo DB, Hadoop.
- Setup storage and data analysis tools in AWS cloud computing infrastructure.
- Installed and used Caffe Deep Learning Framework
- Worked on
- Worked as different data formats such as JSON, XML and performed machine learning algorithms in Python. Data Architects and IT Architects to understand the movement of data and its storage and ER Studio 9.7
- Used pandas, numpy, seaborn, matplotlib, scikit-learn, Scipy, NLTK in Python for developing various machine learning algorithms.
- Data Manipulation and Aggregation from different source using Nexus, Business Objects, Toad, Power BI and Smart View.
- Implemented Agile Methodology for building an internal application.
- Focus on integration overlap and Informatica newer commitment to MDM with the acquisition of Identity Systems.
- Coded proprietary packages to analyze and visualize SPC file data to identify bad spectra and samples to reduce unnecessary procedures and costs.
- Programmed a utility in Python that used multiple packages (numpy, scipy, pandas)
- Implemented Classification using supervised algorithms like Logistic Regression, Decision trees, Naive Bayes, KNN.
- As Architect delivered various complex OLAP databases/cubes, scorecards, dashboards and reports.
- Updated Python scripts to match training data with our database stored in AWS Cloud Search, so that we would be able to assign each document a response label for further classification.
- Used Teradata utilities such as Fast Export, MLOAD for handling various tasks data migration/ETL from OLTP Source Systems to OLAP Target Systems
- Data transformation from various resources, data organization, features extraction from raw and stored.
- Validated the machine learning classifiers using ROC Curves and Lift Charts.
Environment: Unix, Python 3.5.2, MLLib, SAS, regression, logistic regression, Hadoop 2.7.4, No SQL, Teradata, OLTP, random forest, OLAP, HDFS, ODS, NLTK, SVM, JSON, XML and Map Reduce.
Confidential, Washington
Data Scientist
Responsibilities:
- Utilized Spark, Scala, Hadoop, HQL, VQL, oozie, PySpark, Data Lake, Tensor Flow, HBase, Cassandra, Redshift, MongoDB, Kafka, Kinesis, Spark Streaming, Edward, CUDA, MLLib, AWS, Python, a broad variety of machine learning methods including classifications, regressions, dimensionally reduction etc.
- Utilized the engine to increase user lifetime by 45% and triple user conversations for target categories.
- Application of various machine learning algorithms and statistical modeling like decision trees, text analytics, natural language processing (NLP), supervised and unsupervised, regression models, social network analysis, neural networks, deep learning, SVM, clustering to identify Volume using scikit-learn package in python, Matlab.
- Worked on analyzing data from Google Analytics, Ad Words and Facebook etc.
- Evaluated models using Cross Validation, Log loss function, ROC curves and used AUC for feature selection and elastic technologies like Elastic Search, Kibana.
- Performed Data Profiling to learn about behavior with various features such as traffic pattern, location, Date and Time etc.
- Categorized comments into positive and negative clusters from different social networking sites using Sentiment Analysis and Text Analytics
- Performed Multinomial Logistic Regression, Decision Tree, Random forest, SVM to classify package is going to deliver on time for the new route.
- Performed data analysis by using Hive to retrieve the data from Hadoop cluster, Sql to retrieve datafrom Oracle database and used ETL for data transformation.
- Performed Data Cleaning, features scaling, features engineering using pandas and numpy packages in python.
- Exploring DAG's, their dependencies and logs using Air Flow pipelines for automation
- Performed data cleaning and feature selection using MLlib package in PySpark and working with deep learning frameworks such as Caffe, Neon.
- Developed Spark/Scala, R Python for regular expression (regex) project in the Hadoop/Hive environment with Linux/Windows for big data resources.
- Used clustering technique K-Means to identify outliers and to classify unlabeled data.
- Tracking operations using sensors until certain criteria is met using Air Flow technology.
- Responsible for different Data mapping activities from Source systems to Teradata using utilities like TPump, FEXP,BTEQ, MLOAD, FLOAD etc
- Analyze traffic patterns by calculating autocorrelation with different time lags.
- Ensured that the model has low False Positive Rate and Text classification and sentiment analysis for unstructured and semi-structured data.
- Addressed over fitting by implementing of the algorithm regularization methods like L1 and L2.
- Used Principal Component Analysis in feature engineering to analyze high dimensional data.
- Used MLlib, Spark's Machine learning library to build and evaluate different models.
- Implemented rule based expertise system from the results of exploratory analysis and information gathered from the people from different departments.
- Created and designed reports that will use gathered metrics to infer and draw logical conclusions of past and future behavior.
- Developed Map Reduce pipeline for feature extraction using Hive and Pig.
- Created Data Quality Scripts using SQL and Hive to validate successful data load and quality of the data. Created various types of data visualizations using Python and Tableau.
- Communicated the results with operations team for taking best decisions.
- Collected data needs and requirements by Interacting with the other departments.
Environment: Python 2.x, CDH5, HDFS, Hadoop 2.3, Hive, Impala, AWS, Linux, Spark, Tableau Desktop, SQL Server 2014, Microsoft Excel, Matlab, Spark SQL, Pyspark.
Confidential
Data analyst
Responsibilities:
- Worked with BI team in gathering the report requirements and also Sqoop to export data into HDFS and Hive
- Involved in the below phases of Analytics using R, Python and Jupyter notebook.
- Developed multiple Map Reduce jobs in Java for data cleaning and preprocessing.
- Assisted with data capacity planning and node forecasting.
- Installed, Configured and managed Flume Infrastructure.
- Administrator for Pig, Hive and HBase installing updates patches and upgrades.
- Worked closely with the claims processing team to obtain patterns in filing of fraudulent claims.
- Worked on performing major upgrade of cluster from CDH3u6 to CDH4.4.0
- Developed Map Reduce programs to extract and transform the data sets and results were exported back to RDBMS using Sqoop.
- Patterns were observed in fraudulent claims using text mining in R and Hive.
- Exported the data required information to RDBMS using Sqoop to make the data available for the claims processing team to assist in processing a claim based on the data.
- Developed Map Reduce programs to parse the raw data, populate staging tables and store the refined data in partitioned tables in the EDW.
- Created tables in Hive and loaded the structured (resulted from Map Reduce jobs) data
- Using HiveQL developed many queries and extracted the required information.
- Created Hive queries that helped market analysts spot emerging trends by comparing fresh data with EDW reference tables and historical metrics.
- Was responsible for importing the data (mostly log files) from various sources into HDFS using Flume
- Enabled speedy reviews and first mover advantages by using Oozie to automate data loading into the Hadoop Distributed File System and PIG to pre-process the data.
- Provided design recommendations and thought leadership to sponsors/stakeholders that improved review processes and resolved technical problems.
- Managed and reviewed Hadoop log files.
- Tested raw data and executed performance scripts.
Environment: HDFS, PIG, HIVE, Map Reduce, Linux, HBase, Flume, Sqoop, R, VMware, Eclipse, Cloudera, Python.
Confidential - Atlanta, GA
Data Modeler/Data Analyst
Responsibilities:
- Created and maintained Logical and Physical models for the data mart. Created partitions and indexes for the tables in the data mart.
- Performed data profiling and analysis applied various data cleansing rules designed data standards and architecture/designed the relational models.
- Maintained metadata (data definitions of table structures) and version controlling for the data model.
- Developed SQL scripts for creating tables, Sequences, Triggers, views and materialized views
- Worked on query optimization and performance tuning using SQL Profiler and performance monitoring.
- Developed mappings to load Fact and Dimension tables, SCD Type 1 and SCD Type 2 dimensions and Incremental loading and unit tested the mappings.
- Utilized Erwin's forward / reverse engineering tools and target database schema conversion process.
- Worked on creating enterprise wide Model EDM for products and services in Teradata Environment based on the data from PDM. Conceived, designed, developed and implemented this model from the scratch.
- Building, publishing customized interactive reports and dashboards, report scheduling using Tableau server
- Write SQL scripts to test the mappings and Developed Traceability Matrix of Business Requirements mapped to Test Scripts to ensure any Change Control in requirements leads to test case update.
- Responsible for development and testing of conversion programs for importing Data from text files into map Oracle Database utilizing PERL shell scripts &SQL*Loader.
- Involved in extensive DATA validation by writing several complex SQL queries and Involved in back-end testing and worked with data quality issues.
- Developed and executed load scripts using Teradata client utilities MULTILOAD, FASTLOAD and BTEQ.
- Exporting and importing the data between different platforms such as SAS, MS-Excel.
- Generated periodic reports based on the statistical analysis of the data using SQL Server Reporting Services (SSRS).
- Worked with the ETL team to document the Transformation Rules for Data Migration from OLTP to Warehouse
- Environment for reporting purposes.
- Created SQL scripts to find data quality issues and to identify keys, data anomalies, and data validation issues.
- Formatting the data sets read into SAS by using Format statement in the data step as well as Proc Format.
- Applied Business Objects best practices during development with a strong focus on reusability and better performance.
- Developed Tableau visualizations and dashboards using Tableau Desktop.
- Used Graphical Entity - Relationship Diagramming to create new database design via easy to use, graphical interface.
- Designed different type of STAR schemas for detailed data marts and plan data marts in the OLAP environment.
Environment: Erwin, MS SQL Server 2008, DB2, Oracle SQL Developer, PL/SQL, Business Objects, Erwin, MS office suite, Windows XP, TOAD, SQL*PLUS, SQL*LOADER, Teradata, Netezza, SAS, Tableau, Business Objects, SSRS, tableau, SQL Assistant, Informatica, XML..
Confidential
Python Developer
Responsibilities:
- Involved in the design, development and testing phases of application using AGILE methodology.
- Designed and maintained databases using Python and developed Python based API (RESTful Web Service) using Flask, SQL Alchemy and Postgre SQL.
- Designed and developed the UI of the website using HTML, XHTML, AJAX, CSS and JavaScript.
- Participated in requirement gathering and worked closely with the architect in designing and modeling.
- Worked on Restful web services which enforced a stateless client server and support JSON few changes from SOAP to RESTFUL Technology Involved in detailed analysis based on the requirement documents.
- Involved in writing SQL queries implementing functions, triggers, cursors, object types, sequences, indexes etc.
- Created and managed all of hosted or local repositories through Source Tree's simple interface of GIT client, collaborated with GIT command lines and Stash.
- Responsible for setting up Python REST API framework and spring frame work using Django
- Develope consumer based features and applications using Python, Django, HTML, behavior Driven Development (BDD) and pair based programming.
- Designed and developed components using Python with Django framework. Implemented code in python to retrieve and manipulate data.
- Involved in development of the enterprise social network application using Python, Twisted, and Cassandra.
- Used Python and Django creating graphics, XML processing of documents, data exchange and business logic implementation between servers. rked closely with back-end developer to find ways to push the limits of existing Web technology.
- Designed and developed the UI for the website with HTML, XHTML, CSS, Java Script and AJAX
- Used AJAX&JSON communication for accessing RESTful web services data payload.
- Designed dynamic client-side JavaScript codes to build web forms and performed simulations for web application page.
- Created and implemented SQL Queries, Stored procedures, Functions, Packages and Triggers in SQL Server.
- Successfully implemented Auto Complete/Auto Suggest functionality using JQuery, Ajax, Web Service and JSON.
Environment: Python, Java/J2EE, Django, HTML,CSS Linux, Shell Scripting, Java Script, Ajax, JQuery, JSON, XML, PostgreSQL, Jenkins, ANT, Maven, Subversion, Python
Aspect Software
Data Analyst/Data Modeler
Responsibilities:
- Analyzed data sources and requirements and business rules to perform logical and physical data modeling.
- Analyzed and designed best fit logical and physical data models and relational database definitions using DB2. Generated reports of data definitions.
- Involved in Normalization/De-normalization, Normal Form and database design methodology.
- Maintained existing ETL procedures, fixed bugs and restored software to production environment.
- Developed the code as per the client's requirements using SQL, PL/SQL and Data Warehousing concepts.
- Involved in Dimensional modeling (Star Schema) of the Data warehouse and used Erwin to design the business process, dimensions and measured facts.
- Worked with Data Warehouse Extract and load developers to design mappings for Data Capture, Staging, Cleansing, Loading, and Auditing.
- Developed enterprise data model management process to manage multiple data models developed by different groups
- Designed and created Data Marts as part of a data warehouse.
- Wrote complex SQL queries for validating the data against different kinds of reports generated by Business Objects XIR2.
- Using Erwin modeling tool, publishing of a data dictionary, review of the model and dictionary with subject matter experts and generation of data definition language.
- Coordinated with DBA in implementing the Database changes and also updating Data Models with changes implemented in development, QA and Production. Worked Extensively with DBA and Reporting team for improving the Report Performance with the Use of appropriate indexes and Partitioning.
- Developed Data Mapping, Transformation and Cleansing rules for the Master Data Management Architecture involved OLTP, ODS and OLAP.
- Tuned and coded optimization using different techniques like dynamic SQL, dynamic cursors, and tuning SQL queries, writing generic procedures, functions and packages.
- Experienced in GUI, Relational Database Management System (RDBMS), designing of OLAP system environment as well as Report Development.
- Extensively used SQL, T-SQL and PL/SQL to write stored procedures, functions, packages and triggers.
Environment: ER Studio, Informatica Power Center 8.1/9.1, Power Connect/ Power exchange, Oracle 11g, Mainframes,DB2 MS SQL Server 2008, SQL,PL/SQL, XML, Windows NT 4.0, Tableau, Workday, SPSS, SAS, Business Objects, XML, Tableau, Unix Shell Scripting, Teradata, Netezza, Aginity
