Data Scientist Resume
Madison, Wi
SUMMARY:
- A Data Scientist with over 6 years of experience in Big Data Analytics, Data manipulation, Data Mining and implementing Machine Learning models to solve various business problems.
- Experience dealing with large data sets (structured, semi - structured and unstructured), cleaning the data and organizing the data for analyses.
- Actively involved in all the stages of Data Science Project Lifecycle, from Data understanding to deployment.
- Expert level knowledge on python libraries like NumPy, Scipy, Pandas, Seaborn, TensorFlow used for Data Analysis.
- Worked with various visualization tools like Tableau, R, Plotly, Power BI to understand the insights.
- Experienced in implementing Machine Learning algorithms to develop predictive models to the business Problems.
- Implemented Machine Learning algorithms like Random Forests, Decision Trees, KNN (K Nearest Neighbors), Recommendation System.
- Developed Classification models using Support Vector Machine(SVM), Random forests, Naive Bayes and Logistic Regression.
- Used Synthetic Minority Over-Sampling(SMOTE) technique to reduce the imbalance in the data by synthesizing new minority instances for a better unbiased model development.
- Experienced in implementing K-Means, DBSCAN, Euclidean distance to determine the outliers in the data.
- Implemented Principal component analysis(PCA) and Factor Analysis for dimensionality reduction.
- Implemented the concepts of Data Structures and Object Oriented Programming.
- Good knowledge on using Deep Learning for image and video analysis.
- Hands on experience in spark SQL, spark streaming and spark machine learning.
- Working experience in SQL Server, MySQL, Spark SQL, PostgreSQL, MongoDB, Cassandra, DynamoDB.
- Implemented techniques like Forward Selection, Backward Elimination and Step Wise approach for the selection of significant independent variables for Statistical Analysis.
- Experience performing Sentiment Analysis on the reviews given by the customers using TF-IDF of NLP.
- Experience in Evaluating the performance of model by using Confusion Matrix, RMSE rate, cross validation, ROC and A/B testing.
- Experience in performing Statistical analysis (Distance measures, Hypotheses testing, Descriptive statistics, chi square test, Analysis of Variance).
- Experience in developing MapReduce programs for data preprocessing like data cleaning and data wrangling like handling the null values.
- Experienced in designing, developing, implementing and maintaining of Big Data Analytics using Hadoop Ecosystem with components like HDFS, Hive, Pig, Map Reduce.
- Performed Time Series Analysis on the data to determine key strategies and in taking business decisions.
- Good understanding on Deep Neural Networks, Convolutional Neural Networks(CNN), Recurrent Neural Networks(RNN).
- Knowledge on working with Neural Networks using Keras and TensorFlow libraries in Python.
- Experience working on Hadoop clusters using Amazon Web Services(AWS).
- Knowledge on various cloud platforms like the Azure and Google Cloud Platform(GCP).
- Created interactive Dashboards and reports using Tableau and QlikView for a better understanding of data to perform analyses.
- Strong knowledge of Software Development Life Cycle (SDLC) and expertise in detailed design documentation.
- Experience working in Waterfall and AGILE environment with project management tools like JIRA.
- Expert level knowledge and work experience on RDBMS like MS SQL Server, Oracle, MySQL, PostgreSQL and non-relational database like Cassandra, MongoDB.
- Excellent Leadership qualities with good written and oral communication.
- Proven capability to implement new technologies.
TECHNICAL SKILLS:
Data Analytics Tools: Python(numpy, pandas, scikit-learn, scipy), SPSS
Databases: Oracle 11g, SQL Server, MS Access, MySQL, MongoDB, Cassandra, PL/SQL, T-SQL, PostgreSQL, ETL.
Big Data ecosystem: Hadoop, Hive, Pig, Sqoop, Kafka, MLlib, HDFS, Spark, HBase, NiFi, Flume, Impala, HBase, PySpark, MapReduce.
Languages: R, SQL, Python, Java, C++, Shell scripting, Scala
BI and Visualization: Tableau, R, SSIS, SSRS, SSAS, Informatica, QlikView.
Version Control: GIT, SVN
IDE: RStudio, Atom, Brackets, Eclipse, Jupyter Notebook, Zepplin, NetBeans
Packages: Ggplot2, caret, dplyr, RWeka, gmodels, RCurl, tm, C50, Wordcloud, Kernlab, Neuralnet, twitter, NLP, Reshape2, rjson, plyr, pandas, numpy, seaborn, scipy, matplotlib, scikit-learn, spaCy, Beautiful Soup, Rpy2, Tensorflow, pytorch
Web Technologies: HTML, CSS, PHP, JavaScript
Machine Learning Algorithms: Linear Regression, Logistic Regression, Linear Discriminant Analysis, Time Series Analysis, Decision Trees, Random Forests, Naïve Bayes, K-Nearest Neighbors, Support Vector Machines, Boosting, Bagging, ARIMA, CNN, RNN, LSTM, XGBoost, AdaBoost,, MLP, Radial Bias Function Networks
Operating Systems: Mac OSX, Ubuntu, Unix, Windows XP/7/8/10, Linux
PROFESSIONAL EXPERIENCE:
Confidential, Madison, WI.
Data Scientist
Responsibilities:
- Gathered data from multiple web and central data marts, used SQL queries to join and aggregate various data views for analyses.
- Developed scripts to automate the process of cleaning the structured Metadata and the unstructured data and processed it in to the AWS S3 and Redshift.
- Prepared the data to perform analyses by using Dimensionality Reduction techniques like (t-SNE, PCA) to reduce the number of features.
- Used Kafka to process the live streaming data and perform analytics on it.
- Processed the data using Python pandas by converting the dataset into data frames for data analyses.
- Identified the outliers using DBSCAN, KNN, boxplots and Euclidean distance to eliminate them or reduce the impact of outliers on the model.
- Performed exploratory data analysis using NumPy package of Python and used Seaborn to get the insights of the data.
- Implemented various Machine learning algorithms in python like Logistic Regression, Multivariate regression, Random Forests, Support Vector Machine, Clustering, LDA, Naïve Bayes, Neural Networks.
- Developed and optimized the classifiers using Machine Learning algorithms like Support Vector Machine and Random Forests.
- Built the decision trees based on the Entropy, Information Gain and Gini Index for the split criteria.
- Performed post pruning using Minimum error pruning and Error complexity pruning to reduce the complexity of the developed model and to overcome the problem of over fitting.
- Built predictive models using Random forests, Decision Trees, Support Vector Machine and K Means.
- Utilized Minkowski error to reduce the impact of the outliers on the data by making the training process insensitive to outliers.
- Used Amazon Elastic MapReduce (EMR) to process huge amounts of data and to run the scripts developed on PySpark.
- Implemented Multilayer Perceptron Network (MLP) and Radial Bias Function Networks to improve the accuracy of prediction and was able to achieve 83% accuracy.
- Optimized the developed model by using Conjugate Gradient Descent Method for the convergence of the error curve.
- Read the data of various formats like the JSON, XML, HTML (.htm, .html), Plain Text(.txt), parquet, Avro, Rich Text Format(.rtf).
- Forecasted based on exponential smoothing, ARIMA modeling, statistical algorithms and statistical analysis and transfer function models for multivariate time series data.
- Performed data visualization and built interactive dashboards with Tableau for better understanding of the data and for better communication of the results.
- Generated graphs, charts, and reports using Tableau to communicate the insights to the stakeholders.
- Validated the developed model using Cross validation, k-fold cross validation, AUC, ROC curve.
- Deployed the model using AWS Elastic Beanstalk of AWS cloud environment.
- Improved the accuracy of the models using boosting and bagging techniques to minimize variance and improve predictive accuracy.
- Used Python (NumPy, Scipy, pandas, Scikit-learn, Matplotlib, Seaborn) and spark (PySpark, MLlib) for model development for analytical purposes.
- Conducted studies, rapid plots and using advance data mining and statistical modeling techniques to build solution that optimize the quality and performance of data.
- Performed feature engineering, model building, performance evaluation, and online testing with TB to PB data sets size.
- Participated in all phases of Data-Mining, Data-collection, Data-Cleaning, Developing-Models, Validation, Visualization and Performed Gap Analysis .
- Created user stories, sub tasks and tracked issues during the Agile project development using JIRA.
Environment: Scikit-learn, PySpark, MLlib, AWS, Python, AWS RDS, AWS Dynamo DB, Sqoop, ETL, Tableau, Amazon S3, Random Forests, Support Vector Machine, K Means clustering, Neural networks, EMR, Amazon Redshift, Glue, Athena.
Confidential, Kansas City, MO.
Machine Learning/Data Engineer
Responsibilities:
- Studied the uncertainty in the collected data by applying statistical analysis and understanding the distributions by calculating mean, median, variance.
- Estimated the underlying structure of the data by finding correlations and performing cluster analysis.
- Created ETL pipelines for data preprocessing, data migration and analytics.
- Optimized the ETL workflows for better performance in data migration and performed the required transformation based on the requirements of the project.
- Developed Oozie Workflows for daily incremental loads, which gets data from Teradata and then imported into hive tables.
- Worked on migrating MapReduce programs into Spark transformations using Spark and Scala .
- Enhanced and optimized Spark code to aggregate, group and run data mining tasks using the Spark framework.
- Involved in migrating MapReduce jobs into Spark jobs and used SparkSQL and Data frames API to load structured data into Spark clusters.
- Used Spark Streaming to process the data from static data sources like HBase, PostgreSQL, Cassandra, MySQL and streaming data sources like Kafka, Flume.
- Developed Spark workflows using Scala to pull data from cloud-based system(AWS) and apply transformations on it.
- Applied analysis methods such as Hypotheses testing and Analysis of variance(ANOVA) for validating the existing models on the observed data.
- Identified evaluation strategy (cross validation, Train- Test split) and error measure for choosing an optimal solution.
- Implemented Machine learning algorithms through Scikit-learn, Spark MLlib, by choosing an appropriate model (Support vector machine, Random Forests, K Nearest Neighbors).
- Improved the accuracy of prediction to 81% by implementing Ensemble methods like Bootstrap Aggregation(Bagging) and Boosting (AdaBoost, XGBoost).
- Used Regularization, Cross Validation and early stopping techniques to reduce the problem of over-fitting.
- To improve the customer satisfaction, performed Text Analytics on the user reviews and classified the positive and negative reviews.
- Used Natural Language Processing(NLP) to perform sentiment Analysis on the customer reviews. Performed tokenization and ngraming to convert text to a list of words and then filtered the stop words.
- Performed stemming, lemmatization and token processing for conversion in to root word and removing negations and used TF-IDF to the preprocessed data for Sentiment Analysis.
- Implemented proof of concept approaches using Deep Learning (LSTM, CNN) for machine learning feature engineering on AWS nodes.
- Used Spark MLlib for training and testing the developed model on the live streaming data.
- Evaluated the selected model after fitting it on the data if it is an over fit or an under fit and reduced those problems by optimizing the model using tree pruning methods and ensemble methods.
- Developed various QlikView Data Models by extracting and using the data from various sources files, DB2, Excel, Flat Files and Big data.
- Optimized and Adapted the existing algorithms to leverage new increasing data.
- Created different types of visualizations using Python libraries like (matplotlib, seaborn), Qlikview and Tableau.
- Collaborated with other teams to deploy the developed prediction model in the AWS cloud environment.
Environment: R, Python, Tableau, Scikit-learn, Apache Spark, MongoDB, Agile Methodology, ETL, Oozie, k-nearest neighbors, Random Forests, Support vector machine, K-means, MLlib, Spark SQL, Spark Streaming, SQLite, Qlikview.
Confidential, Chester, VA
Big Data Analyst
Responsibilities:
- Involved in Analysis, Design and Implementation of the user requirements.
- Worked on cleaning and organizing large sets of Structured and Unstructured data by identifying Missing values, Invalid values, and outliers.
- Imported and Exported data in to HDFS and Hive using Sqoop and Kafka.
- Ingested data from relational database using sqoop and spark into HDFS and loading them into Hive tables and transforming large sets of data and analyzed them by running Hive queries and Spark.
- Imported data into Hive and HBase from SQL Server using various ETL tools.
- Analyzed large data sets to provide different strategies, that helped in decision making of the company.
- Actively involved in the process of Designing, Developing and optimizing the Hive scripts based on the requirements.
- Developed various MapReduce jobs for data cleaning and data preprocessing.
- Involved in developing automated scripts using Pig, to validate data ingested from multiple sources.
- Performed PCA (Principal Component Analysis), t-SNE on the cleaned data set to identify the potential parameters for Analysis.
- Visualized the data to understand for any trends and patterns present in the data.
- Determined Trends and Relationships in data by applying Advanced Statistical Methods like t-Test, Hypothesis Testing, ANOVA, Chi-Square Test, Correlation Analysis.
- Used k-nearest neighbors, Euclidean distance and Boxplot technique to identify and deal with the outliers.
- Generated graphs and charts on the data using ggplot package of R and Matplotlib package of python for better understanding of the data.
- Coordinated with the Data Scientist Team and the BA team to analyze on Building a predictive model based on the requirements using various Machine Learning Algorithms.
Environment: Kafka, Hive, SQL Server, HDFS, Pig, MapReduce, Sqoop, Hadoop, R, PCA.
Confidential
Data Analyst
Responsibilities:
- Analyzed data at large scale using distributed computing tools like Hadoop.
- Used Sqoop to import and export the data from HDFS to SQL Server.
- Identified new and innovative approaches to draw useful conclusions that aided in the company’s growth.
- Cleaned the fuzzy data sets and preprocessed them for analysis.
- Developed ETL procedures for data migration from SQL server to Hadoop.
- Designed and formulated several hypotheses and evaluated the existing potential models.
- Analyzed the data using R packages like dplyr, sqldf, ggplot, plotly.
- Wrote complex SQL queries to extract required data from the data sets.
- Performed statistical analysis such as Regression Analysis, Distributions, statistical significance to identify the KPI’s.
- Interpreted and communicated the insights with the internal teams.
- Translated the insights to visuals using Tableau to provide a clear view into interpreting the data.
- Worked with the Quality Assurance team to provide quality assurance on the imported data.
Environment: Python, Tableau, SQL Server, R, Hadoop, Sqoop, HDFS
Confidential
SQL Developer
Responsibilities:
- As a SQL Developer, actively involved from requirements gathering to Deployment phase of SDLC.
- Performed Data Modeling to analyze the requirements and the structure of the database.
- Designed database tables as per the client’s requirements to store the application Data.
- Transformed the data using various SSIS transformations like aggregate, multicast, term extraction, merge and data conversion of SQL Server Integration Services(SSIS).
- Created Stored procedures, complex functions, and scripts for application development support using T-SQL.
- Wrote T-SQL procedures for DML script generation that modified database objects based on the requirements.
- Worked with the application development team to create optimized queries.
- Reviewed the query performance frequently for code optimization and improved performance of the system.
- Created database triggers to automate few tasks.
- Responsible for testing database, database troubleshooting and bug fixes.
- Performed regular database backups to handle any form of failures.
- Recommended Recovery and Backup strategies setup for SQL Servers.
- Provided corrective measures and fixed any issues related to performance of the database.
- Memory management and updating indexes for improved performance.
- Worked with SSRS to format SQL reports in many complex ways.
Environment: Microsoft SQL Server 2008 R2, SSIS, SSRS, Business Intelligence Development Studio(BIDS), MS Excel, Erwin