Data Scientist/ Machine Learning Engineer Resume
SUMMARY
- 6+ years of extensive experience in the IT industry as Data Scientist/Machine Learning Engineer and Data Analyst which includes proficiency in Machine Learning, NLP, Data Analysis and Data Visualization, Deep Learning, Big Data, Text Mining, Data Engineering and Business Intelligence/ETL.
- Experience in text understanding, classification, pattern recognition and recommendation systems using Python’s in built NLTK library
- A deep understanding of Statistical Modelling, Multivariate Analysis and Highly efficient in Dimensionality Reduction methods such as PCA (Principal component Analysis)
- Knowledge in Hadoop Core Components (HDFS, MapReduce) and Hadoop Ecosystem (Sqoop, Hive, Pig).
- Good knowledge on creating and monitoring Hadoop clusters on Amazon EC2, VM, Horton Works and Cloudera
- Worked with NoSQL Database including HBase, Cassandra and MongoDB
- Experience in foundational machine learning models and concepts: regression, random forest, boosting, GBM, NNs, HMMs, deep learning.
- Comfortable with statistical concepts such as Hypothesis test, ANOVA, T tests, Correlation, A/B test, Experimental Design, Time Series etc.
- Implemented deep learning models and numerical Computation with the help of data flow graphs using Tensor Flow Machine Learning.
- Experience identifying and interpreting trends in datasets and developing multiple reports/dashboards - line chart, bar chart, donut, box-plot, geo-maps, bubble chart, tree map, etc., to visualize data
- Expertise in transforming business requirements into analytical models, building models, developing data mining and reporting solutions that scales across massive volume of structured and unstructured data
- Experience in pre-processing the data like dealing with Class Imbalance, treating a missing value, outlier treatment, scaling of feature
- Experience object-oriented programming (OOP) concepts using Python, Extensive SQL experience in querying, data extraction and data transformations
- Worked and extracted, migrated data from various database sources like Oracle , SQL Server, MYSQL and Teradata
- Experienced with ETL process management, Data modeling, Data Wrangling and Data warehouse architecture
- Applies advanced statistical and predictive modeling techniques to build, maintain, and improve on multiple real-time decision systems
- Identifies what data is available and relevant, including internal and external data sources, leveraging new data collection processes
- Closely works with product managers, Service development managers, and product development team in productizing the algorithms developed
- Strong experience in Software Development Life Cycle (SDLC) including Requirements Analysis, Design Specification and Testing as per Cycle in both Waterfall and Agile methodologies
- Proven ability to work simultaneously on multiple projects as a team player and as an individual contributor with a strong adaptability to new technologies
TECHNICAL SKILLS
Coding: Python (Numpy, Pandas, Scikit-Learn, NLTK), MATLAB, SAS, MySQL, R, Minitab etc.
Visualization: Tableau, Python (Matplotlib, Seaborn, Plotly, Cufflinks), R (ggplot), MS Excel, Microsoft Power BI.
Microsoft Office Suite: MS Excel (VBA, MACROS, Pie Chart, Bar Chart, Pivot Table), MS Word, MS PowerPoint.
IDE: Anaconda, R-Studio, Visual Studio Code, Jupyter Notebook Azure Data Bricks, Amazon Sage Maker
Big Data: Big Data (PySpark, Spark Streaming, Spark MLlib, Spark SQL, HDFS, Pig, Hive, HBase, Sqoop, Spark)
Graduate Coursework: Data Mining, Business Analytics, Project Management, Time Series Analysis, Applied Data Science Game theory.
Soft Skills: Leadership, teamwork, analytical, attention to detail and problem solving skills.
PROFESSIONAL EXPERIENCE
Confidential
Data Scientist/ Machine Learning Engineer
Responsibilities:
- Built customer segmentation and recommender system model using collaborative filtering
- Used K-Means Clustering, LightGBM models for segmentation problems
- Used Spark Data frames and Big Data technologies such as PySpark, SparkSQL, Spark MLLib extensively and developed ML algorithms using MLLib libraries
- Utilized Amazon EMR Big Data Platform to Analyze huge customer data to develop clusters and find possible Customer segments.
- Developed MapReduce/Spark Python modules for machine learning & predictive analytics in Data Bricks on AWS Sage Maker. Implemented a Python-based distributed clustering via Pyspark streaming
- Used Amazon EMR with Hive, Pig, Spark, MapReduce for Batch Analytics and scheduling daily/weekly/monthly jobs.
- Worked with Amazon Redshift, Athena, Amazon EMR with Presto and Spark for Interactive Analytics.
- Worked on feature engineering and data preprocessing using PySpark functions
- Performed Cross validation and grid search on the model which achieved an 90% accuracy for Recommendations and actual sales
- Used CI/CD pipeline with Git to deploy Machine Learning models
- Enhancing data collection procedures to include information that is relevant for building analytic systems Processing, cleansing, and verifying the integrity of data used for analysis
- Perform a proper EDA, Univariate and bi-variate analysis to understand the intrinsic effect/combined
- Performed Dimensionality reduction using near zero variance and correlation techniques
- Used Tableau for data visualization to create reports, dashboards for insights and business process improvement
- Worked with technical and development teams to deploy models
- Build Model Performance Reports and Modeling Technical Documentation to support each of the models for the product line
- Built a Recommender system that utilizes previous transaction data and can be used online/offline mode.
- Used RMSE score, F-SCORE, PRECISION, RECALL, and A/B testing to evaluate recommender's performance in both simulated environment and real world.
Confidential, Springfield, MA
Data Science/ Data Engineer
Responsibilities:
- Work collaboratively with senior management to identify potential Machine Learning use cases and to a setup server-side development environment.
- Exploring with the Spark by improving the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Pair RDD's, Spark
- Worked on batch processing of data sources using Apache Spark, Elastic search
- Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs.
- Worked on migrating MapReduce programs to Spark Data frames API and Spark SQL to improve performance
- Master key facets of data investigation, including data wrangling, cleaning, sampling, management, exploratory analysis, regression and classification, prediction, and data communication.
- Performed Text Analytics and Text Mining . Developed this entire application as a service with REST API using Flask.
- Extensively used Python’s and Spark’s multiple data science packages like Pandas, NumPy, Matplotlib, SciPy, Scikit-learn and NLTK.
- Used Similarity Measure Algorithms like Jaro distance, Euclidean Distance and Manhattan Distance.
- Performed Entity Tagging - Stanford NER Tagger and used Named Entity Recognition packages like SpaCy.
- Used Principal Component Analysis for dimensionality reduction of features.
- Performed nested cross validation to compare performance of different model’s and utilized the results for model optimization
Confidential
Data Analyst
Responsibilities:
- Analyzed data and turn it into actionable business insights and strategies.
- Developed complex SQL queries to analyze and understand data and to bring data together from various systems.
- Used Joins like Inner Joins, Outer joins while creating tables from multiple tables.
- Worked with SQL, PL/SQL procedures and functions , stored procedures and packages .
- Implemented Indexes, Collecting Statistics, and Constraints while creating tables.
- Developed SQL queries for retrieving data, updating data, database testing and data analysis.
- Enhanced data collection procedures to include information that is relevant for building analytic systems Processing, cleansing and verified the integrity of data used for analysis.
- Used advanced Microsoft Excel to create pivot tables, used VLOOKUP and other Excel functions.
- Built and maintained on-demand custom reports (ad-hoc) and scheduled reports in response to internal and external users.
- Performed different data transformations Extracting, Transforming and Loading (ETL) data using Informatica.
- Applied Data Wrangling techniques to Convert unstructured data into structured format.
- Developed reusable transformations and Mapplets wherever redundancy is needed.
- Worked with various transformations including Router transformation, Joiner transformation, Update Strategy, Lookup transformation, Rank Transformation, Expressions, Aggregator, Sequence Generator and sorter transformation.
- Created ETL mappings using Informatica Power center to move Data from multiple sources like Flat files, Oracle into a common target area such as Data Marts and Data Warehouse
- Developed mapping for ETL team with source to target data mapping with physical naming standards, datatypes, volumetric, domain definitions, and corporate meta-data definitions.
- Developed SQL scripts to validate the data loaded into Data warehouse and Data Mart tables using ETL Informatica.
- Performed daily data analysis and prepared reports on daily, weekly, monthly, and quarterly basis
- Guided the new team members, explained the process flow of the analysis, standards and the structural layout followed.
Confidential
Python Developer
Responsibilities:
- Developed frontend and backend modules using Python on Django Web Framework.
- Worked on designing, coding and developing the application in Python using Django MVC.
- Wrote functional API test cases for testing REST API’s with Postman and Integrated with Jenkins server to build scripts.
- Used Python Library Beautiful Soup for Web Scraping to extract data for building graphs.
- Performed troubleshooting, fixed and deployed many Python bug fixes of the two main applications that were a main source of data for both customers and internal customer service team.
- Created RESTful web services for Catalog and Pricing with Django MVT, MySQL, and MongoDB.
- Development of Python APIs to dump the array structures in the Processor at the failure point for debugging.