We provide IT Staff Augmentation Services!

Data Scientist Resume

Nashville, TN

SUMMARY:

  • 5+ years of work experience in designing, building and implementing analytical and enterprise application using Machine learning, Python, R, Spark and Scala.
  • Good Experience with a focus on Deep Learning, Machine Learning, Image processing or AI, Big data.
  • Very good hands - on in Spark Core, Spark SQL, Spark Streaming and Spark machine learning using Scala and Python programming languages.
  • Has very good experience implementing and handling end - to - end data science products.
  • Good experience in periodic model validation and optimization workflows for the data science products developed.
  • Good experience in extracting and analyzing the very large volume of data covering a wide range of information from a user profile to transaction history using machine learning tools.
  • Collaborated with engineers to deploy successful models and algorithms into production environments.
  • Good understanding of model validation processes and optimizations.
  • An excellent understanding of both traditional statistical modeling and Machine Learning techniques and algorithms like Regression, clustering, ensembling (random forest, gradient boosting), deep learning (neural networks), etc.
  • Proficient in understanding and analyzing business requirements, building predictive models, designing experiments, testing hypothesis, and interpreting statistical results into actionable insights and recommendations.
  • Fluency in Python with working knowledge of ML & Statistical libraries (e.g. Scikit-learn, Pandas).
  • Experience in processing real-timedata and building ML pipelines end to end.
  • Very Strong in Python, statistical analysis, tools, and modeling.
  • Very good hands-on experience working with large datasets and Deep Learning algorithms using apache spark and TensorFlow.
  • An excellent understanding of both traditional statistical modeling and Machine Learning techniques and algorithms like Regression, clustering, ensembling (random forest, gradient boosting), deep learning (neural networks), etc.
  • Good knowledge of recurrent neural networks, LSTM networks,and word2vec.
  • Good experience in refining and improving our image recognition pipeline.
  • Experienced with data modeling, Hadoop MapReduce Architecture and distributed System.
  • Knowledgeable in Installation, configuration and monitoring of Hadoop cluster and also performance tuning of the cluster.
  • Developed data-analysis implementations in PIG and HIVE. Executed workflows using OOZIE.
  • Deep interest in learning both the theoretical and practical aspects of working with and deriving insights from data.
  • Developed highly scalable classifiers and tools by leveraging machine learning, Apache spark & deep learning
  • Worked under the direction of CSO to develop an effective solution to a predictive analytics problem, testing a number of potential machine learning algorithms of apachespark.
  • Good experience in extracting and the very large volume of data covering a wide range of information from a user profile to transaction history using machine learning tools.
  • Built state-of-the-art statistical procedures, algorithms and models to solve a range of problems in diverse domains.
  • Proficient code writing capability in a major programming language such as Python, R, Java and Scala.
  • Good experience with deep learning frameworks like Caffe and TensorFlow.
  • Experience using Deep Learning to solve problems in Image or Video analysis.
  • Good understanding of Apache Spark features& advantages over map reduce or traditional systems.
  • Very good hands-on in Spark Core, Spark SQL, Spark Streaming and Spark machine learning using Scala and Python programming languages.
  • Solid Understanding of RDD Operations in i.e. Transformations & Actions, Persistence(Caching), Accumulators, Broadcast Variables.
  • In-depth understanding of Apache spark job execution Components like DAG, lineage graph, Dag Scheduler, Task scheduler, Stages and task.
  • Developed highly scalable classifiers and tools by leveraging machine learning, Apache spark & deeplearning.
  • Highly organized and detail oriented, with a strong ability to coordinate and track multiple deliverables, tasks,and dependencies.
  • Experience in exposing Apache Spark as web services.
  • Worked under the direction of CSO to develop an effective solution to a predictive analytics problem, testing a number of potential machine learning algorithms of apache spark.
  • Experience in real-time processing using Apache Spark and Kafka.
  • Have good working experience of No SQL database like Cassandra and MongoDB.
  • Delivered at multiple end-to-end Bigdata analytical based solutions and distributed systems like Apache Spark.
  • Experience leveraging DevOps techniques and practices like Continuous Integration, Continuous Deployment, Test Automation, Build Automation and Test
  • Hands on experience leading delivery through Agile methodologies.
  • Experience in managing code on GitHub.
  • Hands on experience on JIRA.
  • Expertise in bug tracking process, familiar with bug reporting and tracking using bug tracking tool like HP-QC, Redmine, Jira.
  • Good hands on experience on Spring & Hibernate framework.
  • Solid understanding of object-oriented programming(OOPS).
  • Familiarity with concepts of MVC, JDBC, and RESTful.
  • Familiarity with build tools such as Big data and SBT.
  • Knowledge of Information Extraction, NLP algorithms coupled with Deep Learning

TECHNICAL SKILLS:

Languages: Python, R, Scala and Java

Spark ML, Spark MLLib, Scikit: Learn. NLTK & Stanford NLP

Deep learning framework: TensorFlow

Big Data Frameworks: Apache Spark, Apache Hadoop, Kafka, MongoDB, Cassandra.

Machine learning: Linear regression, Logistic Regression, Naive Bayes, SVM, Decision Trees, Random Forest, Boosting, Kmeans, Bagging etc.

Big data Distribution: Cloudera & Amazon EMRCloud

Web Technologies: Flask, Django and spring MVC

Front End Technologies: JSP, HTML5, Ajax, JQuery and XMLServers

Web server: Apache2, Nginx Web Sphere and Tomcat

Visualization Tool: Apache Zeppelin, Matplotlib and Tableau.

Databases: Oracle, MySQL and PostgreSQL.

No SQL: MongoDB and Cassandra

Operating Systems: Linux and windows

Scheduling Tools: Airflow & oozie

Testing: Agile Practices, UFT - Unified Functional Testing, RFT - Rational Functional Testing, Python-Selenium Katalon studio, VB scripts, RQM - Rational Quality Management, QC - Quality Centre, RTC - Rational team Concert (change management), Investment Banking

Environment: R Studio, AWS S3, Machine Learning, EC2, Neural networks, SVM, Decision trees, MLbase ad: hoc, NLP, NoSQL, Pl/SQL, MLLib & Git, Python, Amazon EMR, Scikit: learning, Pandas, Spark core, Spark SQL ec2, NumPy matplotlib, TensorFlow, Kafka flask, MongoDB, Hive, REST & airflow, Map Reduce, Mahout, Apache Spark, Scikit learn, Eclipse, Jupyter notebook, Hibernate, HTML5, CSS/SCSS, JavaScript, UNIX, Windows

WORK EXPERIENCE:

Confidential, Nashville, TN

Data Scientist

Responsibilities:

  • Converted data from PDF to XML using python script in two ways i.e. from raw xml to processed xml and from processed xml too.CSV files.
  • Developing a generic script for the regulatory documents.
  • Used python Element Tree(ET) to parse through the XML which is derived from PDF files.
  • Data which is stored in sqlite3 datafile(DB.) were accessed using the python and extracted the metadata,tables,and data from tables and converted the tables to respective CSV tables.
  • Used the XML tags and attributes to isolate headings,side-headings,and subheadings to each row in CSV file.
  • Used Text Mining and NLP techniques find the sentiment about the organization.
  • Deployed a spam detection model and performed sentiment analysis of customer product reviews using NLP techniques.
  • Developed and implemented predictive models of user behavior data on websites, URL categorical, social network analysis, social mining and search content based on large-scale MachineLearning.
  • Developed predictive models on large-scale datasets to address various business problems through leveraging advanced statistical modeling, machine learning,and deep learning.
  • Worked on ingesting data streams from Kafka sourced from various source systems and applying data validations on the stream thereby loading the data into HDFS.
  • Innovate and Introduce new technologies such as Spring Boot, Apache Flink, Apache Ignite, Apache Kafka, AWS, AWS Lamda.
  • Extensively used Pandas, NumPy, Seaborn, Matplotlib, Scikit-learn, SciPy and NLTK in R for developing various machine learning algorithms.
  • Used R programming language for graphically critiquing the datasets and to gain insights to interpret the nature of the data.
  • Researching on Deep Learning to implement NLP
  • Clustering, NLP, Neural Networks. Visualized and presented the results using interactive dashboards.
  • Involved in the transformation of files from GITHUB to DSX.
  • Involved in the execution of CSV files in Data Science Experience.
  • The major part is like being a part of the project, importing the converted CSV file to Confidential internal API which is InfoSphere Information Governance Catalog
  • Used Beautiful Soup for web scraping (Parsing the data)
  • Developed the code to capture the description which comes under headings of index section to the description column of CSV row.
  • Used some other python libraries like PDFMiner, PyPDF2, PDFQuery, Sqlite3.
  • Converted the uni-code to a nearest possible string (ASCII value) using Uni-decode module.
  • Adding a column to each CSV row which gives the parent Index number of the given row.
  • Responsible for performing Machine-learning techniques regression/classification to predict the outcomes.
  • Responsible for design and development of advanced R/Python programs to prepare to transform and harmonize data sets in preparation for modeling.
  • Designed and automated the process of score cuts that achieve increased close and good rates using advanced R programming.
  • Managed datasets using Panda data frames and MySQL, queried MYSQL relational database(RDBMS) queries from python using Python-MySQL connector MySQL dB package to retrieve information.
  • Utilized standard Python modules such as csv, itertools,and pickle for development.
  • Tech stack is Python 2.7/PyCharm/Anaconda/pandas/NumPy/unittest/R/Oracle.
  • Developed large data sets from structured and unstructured data. Perform data mining.
  • Partnered with modelers to develop data frame requirements for projects.
  • Utilized Convolution Neural Networks to implement a machinelearning image recognition component.
  • Performed Ad-hoc reporting/customer profiling, segmentation using R/Python.
  • Tracked various campaigns, generating customer profiling analysis and data manipulation.
  • Provided python programming, with detailed direction, in the execution of data analysis that contributed to the final project deliverables. Responsible for data mining.
  • Analyzed large datasets to answer business questions by generating reports and outcome.
  • Worked with a team of programmers and data analysts to develop insightful deliverables that support data-driven marketing strategies.
  • Executed SQL queries from R/Python on complex table configurations.
  • Retrieving data from the database through SQL as per business requirements.
  • Used OOZIE work flow scheduler to schedule different MapReduce job.
  • Involved in managing and reviewing Hadoop log file.
  • Create, maintain, modify and optimize SQL Server databases.
  • Manipulation of Data using python Programming.
  • Adhering to best practices for project support and documentation.
  • Understanding the business problem, build the hypothesis and validate the same using the data.
  • Managing the Reporting/Dashboarding for the Key metrics of the business.
  • Involved in data analysis using different analytic techniques and modeling techniques.

Confidential, New Brunswick, NJ

Data Scientist

Responsibilities:

  • Performed data exploratory, data visualizations, and feature selections using Python and Apache Spark.
  • Scaled Scikit-learn machine learning algorithms using apache spark.
  • Using techniques such as Fast Fourier Transformations, Convolution Neural Networks and Deep learning.
  • I develop Deep Convolution and Recurrent Neural Networks with TensorFlow and have significant Risk Management & Quantitative Finance experience.
  • Used multiplemachine learning algorithms, including random forest and boosted tree, SVM, SGD, neural network, and deep learning using TensorFlow.
  • Used Python, Convolution Neural Networks (CNN), Deep Belief Networks (DBN), Theano, cafe etc.
  • Applied unsupervised and supervised learning methods in analyzing high-dimensional data. Proficient use of Python Scikit-learn, pandas, and NumPy packages.
  • Performed data modeling operations using Power Bi, Pandas, and SQL.
  • Utilized Python libraries wxPython, NumPy, Twisted and matplotlib
  • Used python libraries like Beautiful Soup and matplotlib.
  • Developed and implemented predictive models of user behavior data on websites, URL categorical, social network analysis, social mining and search content based on large-scale Machine Learning,
  • Wrote scripts in Python using Apache Spark and ElasticSearch engine for use in creating dashboards visualized in Grafana.
  • Lead development for Natural Language Processing (NLP) initiatives with chat-bots and virtual assistants.
  • Developed highly scalable classifiers and tools by leveraging machine learning, Apache spark & deep
  • Converted Pandas data frame dataset to apache spark data frame.
  • Used multiple machine learning algorithms, including random forest and boosted tree, SVM, SGD, neural network, and deep learning using TensorFlow.
  • Observed the Set up and monitoring of a scalable distributed system based on HDFS for better idea and worked closely with the team to understand the business requirement and add new support features.
  • Gathered business requirement to determine the feasibility and to convert them to technical tasks in the design document.
  • Installed and configured Hadoop MapReduce jobs, HDFS and developed multiple MapReduce jobs in java and used different UDF's for data cleaning and processing.
  • Involved in loading data from LINUX file system to HDFS.
  • Collaborated with engineers to deploy successful models and algorithms into production environments.
  • Collaborated with a diverse team that includes statisticians, Chief Science Officer,and engineers to build data science project pipelines and algorithms to derive valuable insights from current and new datasets.
  • Used PySpark data frame to read text data,CSV data,Image data from HDFS, S3,andHive.
  • Cleaned input text data using PySpark Machine learning feature exactions API.
  • Created features to train algorithms.
  • Used various algorithms of PySparkMLAPI.
  • Trained model using historical data stored in HDFS and Amazon S3.
  • Used Spark streaming to load the trained model to predict real-time data from Kafka.
  • Stored the result in MongoDB.
  • Utilized various new supervised and unsupervised machine learning algorithms/software to perform NLP tasks and compare performances
  • The web application can pick data which is stored in MongoDB.
  • Used Apache Zeppelin to visualization of Big Data.
  • Fully automated job scheduling, monitoring, and cluster management without human intervention using airflow.
  • Build apache spark as Web service using a flask. Worked with input file formats like an orc, parquet, Json, Avro.
  • Developed highly scalable classifiers and tools by leveraging machine learning, Apache spark & deep learning.
  • Wrote Spark SQL UDFs, Hive UDFs.
  • Optimized Spark coding suing Performance Tuning of apache spark.
  • Optimized machine learning algorithms based on need.
  • Used amazon elastic MapReduce (EMR) to process a hugenumber of datasets using Apache spark and TensorFlow.

Confidential

Data Scientist

Responsibilities:

  • Performed Data collection (Turn raw data into structured data), Data Cleaning (Parsing, Data transformation, Duplicate elimination, dealing with Data leakage, Statistical model), Feature engineering (Creating feature, feature selection (Feature importance), feature extraction/Dimensional reduction: PCA, t-sne)
  • Using EXCEL, Matlab, Python package (Pandas, Matplotlib, Seaborn, Plotly, Plotnine) for DataVisualization. Using t-sne for high dimensional data visualization.
  • Used Numpy, pandas, Matplotlib, Scikit-learn libraries in python at various stages for developing machine learning model such as Linear Regression, Logistic Regression,
  • Support Vector Machine (SVM), K-NN, Naive Bayes, Decision tree, Ensembles (Bagging, Boosting), Random forest. Evaluated the performance of different models using loss (MAE, MSE, RMSE), accuracy (ACC), Precision/Recall/F1 score, AUC/ROC.
  • Building Prediction System using Time Series data, with Moving Average (MA), Autoregressive Integrated Moving Average model (ARIMA), Autoregressive model (AR), Nonlinear (Polynomial) Regression, K-Nearest Neighbors (KNN).
  • Performed text analysis on the reviews using NLP model like Bag-of-words model, tf-idf, word2vec(embedding), Latent Dirichlet Allocation (LDA), (Hidden)Markov Model.
  • Performed statistics analysis (hypothesis testing, T-test, ANOVA) using Matlab/SPSS/R.
  • Experienced data scraping using Beautiful Soup (with lxml), Scrapy, Request.
  • Using TensorFlow (Keras)/ PyTorch to build Convolutional neural network (CNN) for image classification.
  • Using data Augmentation Transformation (Rotation, flip, shift, scale) to dealing with small data set (pictures). Using pre-trained model from keras (Inception V3, MobileNet) to do transfer learning to achieve better. The deep learning model is running on cloud series (Colab/Kaggle for small dataset, AWS/GCP for large dataset).
  • Implement machine learning algorithm (XGBoost, LightBGM, CatBoost, Deep Neural network) for classification. Dealing with imbalanced data with change the performance metric, change algorithm, Resampling (Over-sample minority class, Under-sample majority class), Synthetic Minority Oversampling Technique (SMOTE).
  • Experienced in implement Customer Segmentation system using unsupervised learning techniques such as K-means, DBSCAN, EM.
  • Build Anomaly detection system for fraud using XGBoost, deep Auto- Encoder(unsupervised), Generative Adversarial Network (GAN), Cluster analysis/ Density-based techniques.
  • Sentiment analysis for reviews using 1D Convolutional Layer (Conv1D), recurrent neural network RNN(LSTM/GRU). We also trying to implement state of the art algorithm such as Google BERT to improve performance.
  • Build Recommendation System for customer such as content-based filtering, Collaborative filtering (using neural network with (Word/ ID) embedding, Matrix Factorization)

Confidential

Data Analyst

Responsibilities:

  • Collaborated with internal stakeholders to understand business challenges and develop analytical solutions to optimize business processes.
  • Performed analysis using industry leading text mining, data mining, and analytical tools and open source software.
  • Built and trained a deep learning network using TensorFlow on the data, and reduced wafer scrap by 15%, by predicting the likelihood of wafer damage. A combination of the z-plot features, image features (pigmentation) and probe features are being used.
  • Experienced in Artificial Neural Networks(ANN) and Deep Learning models using Theano, TensorFlow and Keras packages using Python.
  • Used Natural Language Processing (NLP) to pre-process the data, determine the number of words and topics in the emails and form cluster of words
  • Used MATLAB, C/C++ with OpenCV and SVM, Neural Networks, Random Forest as classifiers.
  • Generated graphical reports using python package NumPy and matplotlib.
  • Built various graphs for business decision making using Python matplotlib library.
  • Knowledge of Information Extraction, NLP algorithms coupled with Deep Learning (ANN and CNN), Theano, Keras and TensorFlow.
  • Cleaned input text data using PySpark Machine learning feature exactions API.
  • Used Pandas data frame for exploratory data analysis on sample dataset.
  • Wrote Scikit learn based machine learning algorithms for building POC’s on sample dataset.
  • Analyzed structured, semi-structured and unstructured dataset using map-reduce and apache spark.
  • Implemented end to end lambda architecture to analyze streaming and batch dataset.
  • Used Apache Mahout’s scalable machine learning algorithms for building recommendation engine, for building classification and regression model.
  • Converted mahout’s machine learning algorithms to RDD based Apache Spark MLLib to improve performance.
  • Optimized machine learning algorithms based on need.
  • Automatic music/news/POI recommendation inside the vehicle by using GPS location, passenger conversation, behavior and mood. Using machine learning and natural language.
  • Smart state-of-charge monitor for electric vehicles based on Recurrent Neural Network and Seq2Seq forecast.
  • Build multiple features of machine learning using python, Scala and Java based on need.
  • Developed multiple MapReduce jobs in java for data cleaning and preprocessing.
  • Migrated single machine learning machine learning algorithms to Parallel processing algorithms.
  • Developed Hive queries for ad-hoc analysis.
  • Used amazon elastic MapReduce (EMR) to process a huge number of datasets using Apache spark and TensorFlow.
  • Lead Data Scientist for development of Machine Learning and NLP engines utilizing health population Data.
  • Analyzed the partitioned and bucketed data and compute various metrics for reporting.
  • Involved in loading data from RDBMS and weblogs into HDFS using Sqoop and Flume
  • Involved in building complex streaming data Pipeline using Kafka and Apache Spark.
  • Worked on loading the data from MySQL to HBase where necessary using Sqoop.
  • Exported the result set from Hive to MySQL using Sqoop after processing the data.
  • Optimized hive queries.
  • Optimized MapReduce and apache spark jobs.
  • Wrote custom input formats in map reduce to analyses image dataset.
  • Wrote Hive UDF’s based on need.
  • Developed end to end enterprise Applications using Spring MVC, REST and JDBC Template Modules.
  • Written well designed testable, efficient java code.
  • Understanding and analyzing complex issues and addressing challenges arising during the software development process, both conceptually and technically.
  • Implemented best practices of Automated Build, Test and Deployment.
  • Developed design patterns, data structures and algorithms based on project need.
  • Worked on multiple tools such as Toad, Eclipse, SVN, Apache and Tomcat.
  • Deployed models via APIs into applications or workflows jar into the application server.
  • Created Automated Worked on User Interface technologies like HTML5, CSS/SCSS.
  • Wrote Stored procedure and SQL queries based on project need.
  • Deployed built Unit Tests using Flexible/Open Source Frameworks
  • Developed Multi-threaded and Transaction Handling code (JMS, Database)

Hire Now