Data Scientist Resume
San Jose, CA
SUMMARY:
- Highly efficient Data Scientist with 8 years of experience in Machine Learning, Data mining with large data sets of Structured and Unstructured data, Data Acquisition, Data Validation, Predictive modeling, Data Visualization, Web Crawling, Web Scraping. Adept in statistical programming languages like R and Python including Big Data technologies like Hadoop, Hive.
- Proficient in managing entire data science project life cycle and actively involved in all the phases of project life cycle including data acquisition, data cleaning, data engineering, features scaling, features engineering, statistical modeling (decision trees, regression models, neural networks, SVM, clustering), dimensionality reduction using Principal Component Analysis and Factor Analysis, testing and validation using ROC plot, K - fold cross-validation and data visualization.
- Adept and deep understanding of Statistical modeling, Multivariate Analysis, model testing, problem analysis, model comparison, and validation.
- Advanced experience with Python and its libraries such as NumPy, Pandas, Scikit-learn, XGBoost, PyTorch, Matplotlib, TensorFlow, Scipy.
- Expertise in transforming business requirements into analytical models, designing algorithms, building models, developing data mining and reporting solutions that scale across a massive volume of structured and unstructured data.
- Skilled in performing data parsing, data manipulation and data preparation with methods including describe data contents, compute descriptive statistics of data, regex, split and combine, Remap, merge, subset, Reindex, melt and reshape.
- Experience in using various packages in R and python-like ggplot2, caret, dplyr, Rweka, gmodels, RCurl, tm, C50, twitter, NLP, Reshape2, rjson, dplyr, pandas, NumPy, Seaborn, SciPy, Matplotlib, Scikit-learn, Beautiful Soup, Rpy2.
- Extensive experience in Text Analytics, generating data visualizations using R, Python and creating dashboards using tools like Tableau.
- Proficient in Machine learning algorithms like Linear Regression, Logistic Regression, Decision Tree, Random Forests and more advanced algorithms like CNN, RNN, Ensemble methods like Bagging, Boosting and Reinforcement Learning using PyTorch.
- Hands on experience with big data tools like Hadoop, Spark, Hive, Pig, Impala, PySpark, Spark SQL.
- Hands on experience in implementing LDA, Naive Bayes and skilled in Random Forests, Decision Trees, Linear and Logistic Regression, SVM, Clustering, neural networks, Principle Component Analysis.
- Good Knowledge in Proof of Concepts (PoC's), gap analysis and gathered necessary data for analysis from different sources, prepared data for data exploration using data munging.
- Good industry knowledge, analytical &problem-solving skills and ability to work well within a team as well as an individual.
- Highly creative, innovative, committed, intellectually curious, business savvy with good communication and interpersonal skills.
- Experience working with data modeling tools like Erwin, Power Designer, and ER Studio.
- Experience in designing star schema, Snowflake schema for Data Warehouse, ODS Architecture.
- Experience in designing stunning visualizations using Tableau software and publishing and presenting dashboards, Storyline on web and desktop platforms.
- Experience and Technical proficiency in Designing, Data Modeling Online Applications, Solution Lead for Architecting Data Warehouse/Business Intelligence Applications.
- Good understanding of TeradataSQL Assistant, Teradata Administrator, and data load/ export utilities like BTEQ, Fast load, Multiload, Fast Export.
- Experience with Data Analytics, Data Reporting, Ad-hoc Reporting, Graphs, Scales, PivotTables and OLAP reporting.
- Extensive experience in Data Visualization including producing tables, graphs, listings using various procedures and tools such as Tableau.
TECHNICAL SKILLS:
Languages: Java 8, Python, R
Packages: ggplot2, caret, dplyr, Rweka, gmodels, twitter, NLP, Reshape2, rjson, dplyr, pandas, NumPy, Seaborn, SciPy, Matplotlib, Scikit-learn, Beautiful Soup, Rpy2.
NLP/Machine Learning/Deep Learning: LDA (Latent Dirichlet Allocation), NLTK, Apache OpenNLP, Stanford NLP, Sentiment Analysis, SVMs, ANN, RNN, CNN, TensorFlow, MXNet, Caffe, H2O, Keras, PyTorch, Theano, Azure ML
Data Modelling Tools: Erwin r 9.6, 9.5, 9.1, 8.x, Rational Rose, ER/Studio, MS Visio, SAP Power designer.
Big Data Technologies: Hadoop, Hive, HDFS, MapReduce, Pig, Kafka.
Databases: SQL, Hive, Impala, Pig, Spark SQL, Databases SQL-Server, My SQL, MS Access, HDFS, HBase, Jupyter, Teradata, Netezza, MongoDB, Cassandra.
Reporting Tools: MS Office (Word/Excel/Power Point/ Visio), Tableau, Crystal reports XI, Business Intelligence, SSRS, Business Objects 5.x/ 6.x, Cognos7.0/6.0.
ETL Tools: Informatica Power Centre, SSIS.
Version Control Tools: SVM, GitHub.
Methodologies: Ralph Kimball and Bill Inmon data warehousing methodology, Rational Unified Process (RUP), Rapid Application Development (RAD), Joint Application Development (JAD).
BI Tools: Tableau, Tableau Server, Tableau Reader, SAP Business Objects, OBIEE, QlikView, SAP Business Intelligence, Amazon Redshift, or Azure Data Warehouse
Operating System: Windows, Linux, Unix, Macintosh HD, Red Hat.
PROFESSIONAL EXPERIENCE:
Confidential, SAN JOSE, CA
Data Scientist
Responsibilities:
- Built models using Statistical techniques like Bayesian HMM and Machine Learning classification models like XG Boost, SVM, and Random Forest.
- Worked with data compliance teams, data governance team to maintain data models, Metadata, Data Dictionaries, define source fields and its definitions.
- Developed MapReduce/Spark Python modules for machine learning & predictive analytics in Hadoop on AWS. Implemented a Python-based distributed random forest via Python streaming.
- Setup storage and data analysis tools in Amazon Web Services cloud computing infrastructure.
- A highly immersive Data Science program involving Data Manipulation & Visualization, Web Scraping, Machine Learning, Python programming, SQL, GIT, Unix Commands, NoSQL, MongoDB, Hadoop.
- Transformed Logical Data Model to Erwin, Physical Data Model ensuring the Primary Key and Foreign Key relationships in PDM, Consistency of definitions of Data Attributes and Primary Index Considerations.
- Developed Oracle11g stored packages, procedures, functions and database triggers using PL/SQL for ETL process, data handling, logging, archiving and to perform Oracle back-end validations for batch processes.
- Documented logical, physical, relational and dimensional data models. Designed the Data Marts in dimensional data modeling using star and snowflake schemas.
- Handled importing data from various data sources, performed transformations using Hive, Map Reduce, and loaded data into HDFS.
- Leveraged cutting edge NLP and machine learning technologies/frameworks like Keras, Dynet, PyTorch etc
- Worked with BTEQ to submit SQL statements, import and export data, and generate reports in Teradata.
- Designed and documented Use Cases, Activity Diagrams, Sequence Diagrams, OOD (Object Oriented Design) using UML and Visio.
- Created Hive queries that helped analysts spot emerging trends by comparing fresh data with EDW reference tables and historical metrics and processed the data using HQL (like SQL) on top of Map-reduce.
- Hands on development and maintenance using Oracle SQL, PL/SQL, SQL Loader, and Informatica Power Center9.1.
- Collaborated with data engineers and operation team to implement the ETL process, wrote and optimized SQL queries to perform data extraction to fit the analytical requirements.
- Performed data analysis by using Hive to retrieve the data from the Hadoop cluster, SQL to retrieve data from RedShift
- Designed the ETL process to Extract translates and load data from OLTP Oracle database system to Teradata data warehouse.
- Created tables, sequences, synonyms, join functions and operators in the Netezza database.
- Built and published customized interactive reports and dashboards, report scheduling using Tableau server.
- Hands-onOracle External Tables feature to read the data from flat files into Oracle staging tables.
- Analyzed the weblog data using the HiveQL to extract a number of unique visitors per day, page views, visit duration, most purchased product on the website and managed and reviewed Hadoop log files.
- Designed and developed user interfaces and customization of Reports using PowerBI and designed cubes for data visualization, mobile/web presentation with parameterization and cascading.
- Performed Data Analysis and Data Profiling and worked on data transformations and data quality rules.
Environment: ERwin9.x, Teradata, Oracle10g, Hadoop, HDFS, Pig, Hive, MapReduce, PL/SQL, UNIX, Informatica Power Center, Azure, Machine Learning, NLP, MDM, SQL Server, Netezza, DB2, Tableau, Architecture, SAS/Graph, SAS/SQL, Jupyter,Tableau, SAS/Connect and SAS/Access.
Confidential - Santaclara.
Data Scientist .
Responsibilities:
- Provided the architectural leadership in shaping strategic, business technology projects, with an emphasis on application architecture.
- Utilized domain knowledge and application portfolio knowledge to play a key role in defining the future state of large, business technology programs.
- Participated in all phases of data mining, data collection, data cleaning, developing models, validation, and visualization and performed Gap analysis.
- Developed MapReduce/Spark Python modules for machine learning & predictive analytics in Hadoop on AWS. Implemented a Python-based distributed random forest via Python streaming.
- Created ecosystem models (e.g. conceptual, logical, physical, canonical) that are required for supporting services within the enterprise data architecture (conceptual data model for defining the major subject areas used, ecosystem logical model for defining standard business meaning for entities and fields, and an ecosystem canonical model for defining the standard messages and formats to be used in data integration services throughout the ecosystem).
- Used Pandas, NumPy, Seaborn, SciPy, Matplotlib, Scikit-learn, NLTK in Python for developing various machine learning algorithms and utilized machine learning algorithms such as linear regression, multivariate regression, naive Bayes, Random Forests, K-means, &KNN for data analysis.
- Conducted studies, rapid plots and using advanced data mining and statistical modeling techniques to build a solution that optimizes the quality and performance of data.
- Demonstrated experience in the design and implementation of Statisticalmodels, Predictive models, enterprise data model, metadata solution and data lifecycle management in both RDBMS, BigData environments.
- Analyzed large data sets apply machine learning techniques and develop predictive models, statistical models and developing and enhancing statistical models by leveraging best-in-class modeling techniques.
- Worked on database design, relational integrity constraints, OLAP, OLTP, Cubes and Normalization (3NF) and De-normalization of the database.
- Developed MapReduce/SparkPython modules for machine learning & predictive analytics in Hadoop on AWS.
- Worked on customer segmentation using an unsupervised learning technique - clustering.
- Worked with various Teradata15 tools and utilities like Teradata Viewpoint, Multi-Load, ARC, Teradata Administrator, BTEQ and other Teradata Utilities.
- Utilized Spark, Scala, Hadoop, HBase, Kafka, Spark Streaming, MLLib, Python, a broad variety of machine learning methods including classifications, regressions, dimensionally reduction etc.
Environment: Python, SQL, Oracle 12c, Netezza, SQL Server, Informatica, Java, SSRS, PL/SQL, T-SQL, Tableau, MLLib, regression, Cluster analysis, Scala NLP, Spark, MongoDB, logistic regression, Hadoop, Hive, Teradata, random forest, OLAP, AWS, MariaDB, SAP CRM, HDFS, ODS, NLTK, SVM, JSON, Tableau, XML, Cassandra, MapReduce.
Confidential - Wichita, KS.
Data Scientist .
Responsibilities:
- Collaborated with data engineers and operation team to implement ETL process, wrote and optimized SQL queries to perform data extraction to fit the analytical requirements.
- Worked on data cleaning and ensured data quality, consistency, integrity using Pandas, NumPy.
- Explored and analyzed the customer specific features by using Matplotlib, Seaborn in Python and dashboards in Tableau.
- Performed data imputation using Scikit-learn package in Python.
- Participated in features engineering such as feature generating, PCA, feature normalization and label encoding with Scikit-learn preprocessing.
- Used Python 2.x/3.X (NumPy, SciPy, Pandas, Scikit-learn, seaborn to develop variety of models and algorithms for analytic purposes.
- Connected Database with Jupyter notebook for modeling and Tableau for visualization and reporting.
- Experimented and built predictive models including ensemble methods such as Gradient boosting trees and Neural Network by Keras to predict Sales amount.
- Conducted analysis and patterns on customers' shopping habits in different location, different categories and different months by using time series modeling techniques.
- Used RMSE/MSE to evaluate different models' performance.
- Designed rich datavisualizations to model data into human-readable form with Tableau and Matplotlib.
- Provided Level of Efforts, activity status, identified dependencies impacting deliverables, identified risks, and mitigation plans, identified issues and impacts.
- Designed database solution for applications, including all required database design components and artifacts.
- Developed and maintained Data Dictionary to create Metadata Reports for technical and business purpose using Erwin report designer.
- Generated ad-hoc reports using Crystal Reports 9andSQL Server Reporting Services (SSRS).
Environment: Python 2.x/3.x, (Scikit-Learn/SciPy/NumPy/Pandas/Matplotlib/Seaborn), Jupyter,Tableau, Machine Learning algorithms (Random Forest, Gradient Boosting tree, Neural network by Keras), GitHub.
Confidential - Jacksonville, FL.
Jr Data Scientist .
Responsibilities:
- Participated in all phases of data acquisition, data cleaning, developing models, validation, and visualization to deliver data science solutions.
- Retrieving data from SQL Server database by writing SQL queries like stored procedure, temp table, view.
- Worked on fraud detection analysis on loan applications using the history of loan taking with supervised learning methods.
- Used Pandas, NumPy, Seaborn, Matplotlib, Scikit-learn in Python for developing various machine learning models and utilized algorithms such as Logistic regression, Random Forest, Gradient Boost Decision Tree and Neural Network.
- Experienced in performing feature engineering such as PCA for high dimensional datasets, important feature selection by Tree-based models.
- Perform model tuning and selection by using cross-validation, parameters tuning to prevent overfitting.
- Ensemble methods were used to increase the accuracy of the training model with different Bagging and Boosting methods.
- Managed and tracked project issues and debugged them if needed and developed SQL queries in SQL Server management studio, Toad and generated complex reports for the end users.
Environment: SQL Server 2008, Python 2.x (NumPy/Pandas/Scikit-Learn), Machine Learning algorithm (Logistic regression/ Random Forests/ KNN/ K-Means Clustering/ DBSCAN/ Ensemble methods), GitHub.
Confidential
Data Analyst.
Responsibilities:
- Handled globally distributed team on ModelN product management specifically 'Government Pricing' module and drove the team in resolving various issues on production, enhancements to the system and proposing solutions while mapping to the productroadmap.
- Wrote custom procedures and triggers to improve performance and maintain referential integrity.
- Optimized queries with modifications in SQL code, removed unnecessary columns and eliminated data discrepancies.
- Normalized tables established joins and created indexes wherever necessary utilizing profiler and execution plans.
- Storage Management Managing Space, Tablespaces, Segments, and Extents, Rollback Segments & Data Dictionary.
- Used SQL*Loader to move data from flat files into an Oracle database.
- Utilized dimensional data modeling techniques and storyboarding ETL processes.
- Developed ETL procedures to ensure conformity, compliance with standards and lack of redundancy, translated business rules and functionality requirements into ETL procedures using Informatica Power Center.
- Performed Unit Testing and tuned the Informatica mappings for better performance.
- Re-engineer existing InformaticaETL process to improve performance and maintainability.
- Interacted directly with the business teams in arriving at the workarounds or resolutions while involving appropriate groups within Model N.
- Exceeded expectations by collaborating with a testing team in creating and executing the test cases for a giant pharmaceutical company.
- Organized and delegated tasks effectively while working with multiple customers at the same time.
- Demonstrated leadership by guiding newcomers with the best practices of the company.
Environment: Model N Application, Informatica, Oracle 11g, DB2, UNIX, Toad, Putty, HP Quality Center, SSIS,SSAS,SSRS.