Data Scientist Resume
Irvine, CA
PROFESSIONAL SUMMARY:
- Professional qualified Data Scientist/Data Analyst with around 8+ years of experience in Data Science and Analytics including Data Mining, and Statistical Analysis.
- Involved in the entire data science project life cycle and actively involved in all the phases including data cleaning, data extraction and data visualization with large data sets of structured and unstructured data, created ER diagrams and schema.
- Experienced with machine learning algorithms such as logistic regression, KNN, SVM, random forest, neural network, linear regression, lasso regression and k - means.
- Implemented Bagging and Boosting to enhance the model performance.
- Experience in implementing data analysis with various analytic tools, such as Anaconda 4.0 Jupiter Notebook 4.X, R 3.0 (ggplot2, dplyr, Caret) and Excel
- Used the version control tools like Git2.X and build tools like Apache Maven/Ant.
- Experienced the full software lifecycle in SDLC, Agile, DevOps and Scrum methodologies including creating requirements, test plans.
- Experienced with machine learning algorithms such as logistic regression, random forest, XG Boost, KNN, SVM, neural network, linear regression, lasso regression and k-means
- Implemented Bagging and Boosting to enhance the model performance.
- Strong skills in statistical methodologies such as A/B test, experiment design, hypothesis test, ANOVA
- Extensively worked on Python 3.5/2.7 (NumPy, Pandas, Matplotlib, NLTK, and Scikit-learn)
- Experience in implementing data analysis with various analytic tools, such as Anaconda 4.0 Jupiter Notebook 4.X, R 3.0 (ggplot2, Caret, dplyr) and Excel
- Solid ability to write and optimize diverse SQL queries, working knowledge of RDBMS like SQL Server 2008, NoSQL databases like MongoDB 3.2
- Strong experience in Big Data technologies like Spark 1.6, Spark SQL, PySpark, Hadoop 2.X, HDFS, Hive 1.X.
- Experience in visualization tools like, Tableau 9.X, 10.X for creating dashboards
- Expertise and Vast knowledge of Enterprise Data Warehousing including Data Modeling, Data Architecture, Data Integration (ETL/ELT) and Business Intelligence.
- Skilled in implementing SQL tuning techniques such as Join Indexes (JI), Aggregate Join Indexes (AJI's), Statistics and Table changes including Index.
- Experienced in using various Teradata Utilities like Teradata Parallel Transporter (TPT), M load, BTEQ, Fast Export, and Fast load.
- Extensive experience in development and designing of ETL methodology for supporting data transformations and processing in a corporate-wide environment using Teradata, Mainframes, and UNIX Shell Scripting.
- Experienced in Dimensional Data Modeling experience using Data modeling, Relational Data modeling, ER/Studio, Erwin, and Sybase Power Designer, Star Join Schema/Snowflake modeling, FACT & Dimensions tables, Conceptual, Physical & logical data modeling.
- Good experience in Production Support, identifying root causes, Troubleshooting and Submitting Change Controls.
- Experienced in handling all the domain and technical interaction with application users, analyzing client business processes, documenting business requirements.
- Possess strong analytical and problem-solving skills and have a quick learning curve. Committed team player and capable of working on tight project delivery schedules and deadlines.
- Experienced in writing Design Documents, System Administration Documents, Test Plans & Test Scenarios/Test Cases and documentation of test results.
- Extensive experience in development of T-SQL, OLAP, PL/SQL, Stored Procedures, Triggers, Functions, Packages, performance tuning and optimization for business logic implementation.
- Proficient in handling complex processes using SAS/ Base, SAS/ SQL, SAS/ STAT SAS/Graph, Merge, Join and Set statements, SAS/ ODS.
TECHNICAL SKILLS:
Skills: SQL, PANDAS, INFORMATICA, TABLEAU, EXCEL.
Languages: C, C++, XML, R/R Studio, SAS Enterprise Guide, SAS, R (Caret, Weka, ggplot), DTD, Schemas, Scala, Python (NumPy, SciPy, Pandas, Keras), Shell Scripting, NO SQL Databases Cassandra, HBase, MongoDB, Maria DB Statistics, Hypothetical Testing, ANOVA, Confidence Intervals, Bayes Law, MLE, Fish Information, Principal Component Analysis (PCA), Cross-Validation, correlation.
BI Tools: Tableau server, Tableau Reader, Tableau, Splunk, SAP Business Objects, OBIEE, SAP Business Intelligence, QlikView, Amazon Redshift, or Azure Data Warehouse, Algorithms Logistic regression, random forest, XG Boost, KNN, SVM, neural network rk, linear regression, lasso regression, k-means.
Development Methodologies: Agile/Scrum, UML, Design Patterns, Waterfall
Reporting Tools: MS Office (Word/Excel/PowerPoint/ Visio/Outlook), Crystal Reports XI, SSRS, Cognos 7.0/6.0.
BI Tools: Microsoft Power BI, Tableau, SSIS, SSRS, SSAS, Business Intelligence Development Studio (BIDS), Visual Studio, Crystal Reports, Informatica 6.1.
Database Design Tools and Data Modeling: MS Visio, ERWIN 4.5/4.0, Star Schema/Snowflake Schema modeling, Fact & Dimensions tables, physical & logical data modeling, Normalization and De-normalization techniques, Kimball & Inmon Methodologies
PROFESSIONAL EXPERIENCE:
Data Scientist
Confidential - Irvine, CA
Responsibilities:
- Communicated and coordinated with end client for collecting data and performed ETL to define the uniform standard format.
- Queried and retrieved data from SQL Server database to get the sample dataset.
- In the preprocessing phase, used Pandas to clean all the missing data, data type casting and merging or grouping tables for the EDA process.
- Used PCA and another feature engineering, feature normalization and label encoding Scikit-learn preprocessing techniques to reduce the high dimensional data (>150 features)
- In data exploration stage used correlation analysis and graphical techniques in Matplotlib and Sea born to get some insights about the patient admission and discharge data.
- Experimented with predictive models including Logistic Regression, Support Vector Machine (SVC), Random Forest provided by Scikit-learn, XG Boost, Light GBM, and Neural network by Keras to predict showing probability and visiting counts.
- Designed and implemented Cross-validation and statistical tests including k-fold, stratified k-fold, hold-out scheme to test and verify the models' significance.
- Implemented, tuned and tested the model on AWS Lambda with the best performing algorithm and parameters.
- Implemented Hypothesis testing kit for sparse sample data by wring R packages.
- Collected the feedback after deployment, retrained the model to improve the performance.
Environment: SQL Server 2012/2014, AWS EC2, AWS Lambda, AWS S3, AWS EMR, Linux, Python3.x (Scikit Learn, NumPy, Pandas, Matplotlib), R, Machine Learning algorithms, Tableau.
Data Scientist
Confidential - Hartford, CT
Responsibilities:
- Work with users to identify the most appropriate source of record required to define the asset data for financing
- Performed data profiling in Target DWH
- Experience in using OLAP function like Count, SUM and CSUM
- Performed Data analysis and Data profiling using complex SQL on various sources systems including Oracle and Teradata.
- Hands on Experience on Sqoop.
- Developed normalized Logical and Physical database models for designing an OLTP application.
- Developed new scripts for gathering network and storage inventory data and make Splunk ingest data.
- Imported the customer data into Python using Pandas libraries and performed various data analysis - found patterns in data which helped in key decisions for the company
- Created tables in Hive and loaded the structured (resulted from Map Reduce jobs) data
- Using Hive QL developed many queries and extracted the required information.
- Exported the data required information to RDBMS using Sqoop to make the data available to the claims processing team to assist in processing a claim based on the data.
- Design and deploy rich Graphic visualizations with Drill Down and Drop-down menu option and Parameterized using Tableau.
- Extracted data from the database using SAS/Access, SAS SQL procedures and create SAS datasets.
- Created Teradata SQL scripts using OLAP functions like RANK () to improve the query performance while pulling the data from large tables.
- Worked on MongoDB database concepts such as locking, transactions, indexes, Sharding, replication, schema design, etc.
- Performed Data analysis using Python Pandas.
- Good experience in Agile Methodologies, Scrum stories, and sprints experience in a Python-based environment, along with data analytics and Excel data extracts.
- Created Hive queries that helped market analysts spot emerging trends by comparing fresh data with EDW reference tables and historical metrics.
- Involved in defining the source to target data mappings, business rules, business and data definitions
- Responsible for defining the key identifiers for each mapping/interface
- Responsible for defining the functional requirement documents for each source to target interface.
- Hands on Experience on Pivot tables, Graphs in MS Excel
- Using advanced Excel features like Pivot tables and Charts for generating Graphs.
- Designed and developed weekly, monthly reports using MS Excel Techniques (Charts, Graphs, Pivot tables) and Power point presentations.
Environment: Teradata, SAS/Access, SAS SQL, MS Excel, Python Pandas, RDBMS, Hive QL.
Data Scientist
Confidential - Dearborn, MI
Responsibilities:
- Perform Data Profiling to learn about user behavior and merge data from multiple data sources.
- Implemented big data processing applications to collect, clean and normalization large volumes of open data using Hadoop ecosystems such as PIG, Hive, and HBase.
- Designing and developing various machine learning frameworks using Python, R, and MATLAB.
- Integrate R into Micro Strategy to expose metrics determined by more sophisticated and detailed models than natively available in the tool.
- Worked on different data formats such as JSON, XML and performed machine learning algorithms in Python.
- Worked as Data Architects and IT Architects to understand the movement of data and its storage and ERStudio9.7.
- Processed huge datasets (over billion data points, over 1 TB of datasets) for data association pairing and provided insights into meaningful data association and trends
- Developed cross-validation pipelines for testing the accuracy of predictions
- Enhanced statistical models (linear mixed models) for predicting the best products for commercialization using Machine Learning Linear regression models, KNN and K-means clustering algorithms
- Participated in all phases of datamining, data collection, Data cleaning, developing models, validation, Visualization and performed Gap analysis.
- Data manipulation and Aggregation from a different source using Nexus, Toad, Business Objects, Power BL, and Smart View.
- Independently coded new programs and designed Tables to load and test the program effectively for the given POC's using with Big Data/Hadoop.
- Develop documents and dashboards of predictions in Micro Strategy and present it to the business intelligence team.
- Developed various QlikView Data Models by extracting and using the data from various sources files, DB2, Excel, Flat Files and Big data.
- Good knowledge of Hadoop Architecture and various components such as HDFS, Job Tracker, Task Tracker, Name Node, Data Node, Secondary Name Node, and MapReduce concepts.
- As Architect delivered various complex OLAP databases/cubes, scorecards, dashboards and reports.
- Implemented Classification using supervised algorithms like Logistic Regression, Decision trees, KNN, Naive Bayes.
- Used Teradata15 utilities such as Fast Export, MLOAD for handling various tasks data migration/ETL from OLTP Source Systems to OLAP Target Systems.
- Handled importing data from various data sources, performed transformations using Hive, MapReduce, and loaded data into HDFS.
- Collaborate with data engineers to implement ETL process, write and optimized SQL queries to perform data extraction from Cloud and merging from Oracle 12c.
- Collect unstructured data from MongoDB 3.3 and completed data aggregation.
- Perform data integrity checks, data cleaning, exploratory analysis and feature engineer using R 3.4.0.
- Conducted analysis of assessing customer consuming behaviors and discover the value of customers with RMF analysis; applied customer segmentation with clustering algorithms such as K-Means Clustering and Hierarchical Clustering.
- Work on outlier's identification with box-plot, K-means clustering using Pandas, NumPy.
- Participate in features engineering such as feature intersection generating, feature normalize and Label encoding with Scikit-learn preprocessing.
- Use Python 3.0 (NumPy, SciPy, pandas, Sci-kit-learn, Sea born, NLTK) and Spark 1.6 / 2.0 (PySpark, MLLib) to develop a variety of models and algorithms for analytic purposes.
- Analyze Data and Performed Data Preparation by applying the historical model to the data set in AZUREML.
- Perform data visualization with Tableau 10 and generate dashboards to present the findings.
- Determine customer satisfaction and help enhance customer experience using NLP.
- Work on Text Analytics, Naïve Bayes, Sentiment analysis, creating word clouds and retrieving data from Twitter and other social networking platforms.
Environment: ER Studio 9.7, Tableau 9.03, AWS, Teradata 15, MDM, GIT, Unix, Python 3.5.2, MLLib, SAS, regression, logistic regression, QlikView.
Data Analyst
Confidential - Columbus, OH
Responsibilities:
- Participated in requirement gathering sessions with business stakeholders to understand the project goals and documented the business requirement documents (BRD).
- Studied the Requirements Specifications, Use Cases and analyzed the data needs of the Business users.
- Redesigned some of the previous models by adding some new entities and attributes as per the business requirements.
- Converted the Logical data models to Physical data models to generate DDL scripts.
- Reverse Engineered existed data models for analyzing and comparing the business process.
- Expertise in the Forward Engineering of logical models to generate the physical model using Erwin.
- Created the Logical data models using Erwin 7.2 and ensured that it follows the normalization rules and have the best possible traceability pattern.
- Migrated several models from Erwin 4.1/7.1 to ERWIN 7.2 and updated the previous naming standards.
- Created complex mappings and applets using Lookup, Expression, Aggregator, Sequence Generator, Union, Normalizer, and Router transformations.
- Implemented Slowly Changing Dimensions Type 1, Type 2, and Type3 methodology for accessing the full history of accounts and transaction information.
- Involved in writing the validation scripts to identify the data inconsistencies in the sources.
- Documented Source Target Mappings documents, Design documents and Sign-off documents.
- Documented designs and Transformation Rules engine for use of all the designers across the project.
- Worked with Business Analysts to design weekly and monthly reports using Cognos.
- Worked very closely with developers, business analyst, and end users to generate various reports
- Worked on the Enterprise Metadata repositories for updating the metadata and involved in Master Data Management MDM .
- Responsible for detailed verification, validation, and review of the design specifications.
- Conducted a JAD session to review the data models involving SME, developers, testers, and analysts.
Environment: Erwin 4.1, Informatica 8.0, Cognos, Oracle 9i, SQL Server 2003, SQL, MS Office, Windows 2003.
Data Analyst
Confidential
Responsibilities:
- Wrote custom procedures and triggers to improve performance and maintain referential integrity.
- Optimized queries with modifications in SQL code, removed unnecessary columns and eliminated data discrepancies.
- Normalized tables, established joins, and created indexes wherever necessary utilizing profiler and execution plans.
- Hands-on experience in analyzing the data and writing Custom MySQL queries for better performance, joining the tables and selecting the required data to run the reports.
- Restricted data for Users using row level security and use filters.
- Utilized technology such as MySQL and Excel Power Pivot to query test data and customize end-user requests.
- Utilized dimensional data modeling techniques and story boarding ETL processes.
- Developed ETL procedures to ensure conformity, compliance with standards and lack of redundancy, translated business rules and functionality requirements into ETL procedures using Informatica Power Center.
- Performed Unit Testing and tuned the Informatica mappings for better performance.
- Re-engineer existing Informatica ETL process to improve performance and maintainability.
- Exceeded expectations by collaborating with a testing team in creating and executing the test cases for a giant pharmaceutical company.
- Organized and delegated tasks effectively while working with multiple customers at the same time.
Environment: Informatica, Oracle 10g, DB2, UNIX, Toad, Putty, HP Quality Center, SSIS,SSAS,SSRS.