Data Scientist Resume
Lansdale, PA
SUMMARY
- Above 8+ years of experience in Machine Learning, Data mining with large data sets of Structured and Unstructured data, Data Acquisition, Data Validation, Predictive modeling, Data Visualization.
- Extensive experience in Text Analytics, developing different Statistical Machine Learning, Data Mining solutions to various business problems and generating Data Visualizations using R, Python and Tableau.
- Experience in developing different Statistical Machine Learning, Text Analytics,DataMining solutions to various business generating and problemsdatavisualizations using R and Tableau.
- Expertise in Technical proficiency in Designing, DataModelingOnline Application, Solution Lead for Architecting Data Warehouse/Business Intelligence Applications..
- Proficient in Machine Learning techniques (Decision Trees, Linear, Logistics, Random Forest, SVM, Bayesian, XG Boost, K - Nearest Neighbors) and Statistical Modeling in Forecasting/ Predictive Analytics, Segmentation methodologies, Regression based models, Hypothesis testing, Factor analysis/ PCA, Ensembles.
- Strong experience in Software Development Life Cycle (SDLC) including Requirements Analysis, Design Specification and Testing as per Cycle in both Waterfall and Agile methodologies.
- Expertise in Python programming with various packages including NumPy, Pandas, SciPy and ScikitLearn.
- Worked on several python packages like NumPy, matplotlib, Beautiful Soup, Pickle, PySide, SciPy, wxPython, PyTables etc.
- Worked on deployment tools such as Azure Machine Learning Studio, Oozie, and AWS Lambda.
- Worked and extracteddatafrom various database sources like Oracle, SQL Server and Teradata.
- Experience in foundational machine learning models and concepts( Regression, boosting, GBM, NNs, HMMs, CRFs, MRFs, deep learning).
- Regularly accessing JIRA tool and other internal issue trackers for the Project development.
- Skilled in SystemAnalysis, E-R/DimensionalDataModeling, DatabaseDesign and implementingRDBMS specific features.
- Facilitated and helped translate complex quantitative methods into simplified solutions for users.
- Knowledge of working with Proof of Concepts and gap analysis and gathered necessarydatafor analysis from different sources, prepareddatafordataexploration usingdatamunging.
- Worked on different data formats such as JSON, XML and performed machine learning algorithms in Python.
- Expertise in Technical proficiency in Designing, Data Modeling Online Application, Solution Lead for Architecting Data Warehouse/Business Intelligence Applications.
TECHNICAL SKILLS
Exploratory Data Analysis: Univariate/Multivariate Outlier detection, Missing value imputation, Histograms/Density estimation, EDA in Tableau
Supervised Learning: Linear/Logistic Regression, Lasso, Ridge, Elastic Nets, Decision Trees, Ensemble Methods, Random Forests, Support Vector Machines, Gradient Boosting, XGB, Deep Neural Networks, Bayesian Learning
Unsupervised Learning: Principal Component Analysis, Association Rules, Factor Analysis, K-Means, Hierarchical Clustering, Gaussian Mixture Models, Market Basket Analysis, Collaborative Filtering and Low Rank Matrix Factorization
Feature Selection: Stepwise, Recursive Feature Elimination, Relative Importance, Filter Methods, Wrapper Methods and Embedded Methods
Statistical Tests: T Test, Chi-Square tests, Stationarity tests, Auto Correlation tests, Normality tests, Residual diagnostics, Partial dependence plots and Anova
Sampling Methods: Bootstrap sampling methods and Stratified sampling
Model Tuning/Selection: Cross Validation, Walk Forward Estimation, AIC/BIC Criterions, Grid Search and Regularization
Time Series: ARIMA, Holt winters, Exponential smoothing, Bayesian structural time series
R: caret, glmnet, forecast, xgboost, rpart, survival, arules, sqldf, dplyr, nloptr, lpSolve, ggplot.
SAS: Forecast server, SAS Procedures and Data Steps.
Spark: MLlib, GraphX.
SQL: Subqueries, joins, DDL/DML statements.
Databases/ETL/Query: Teradata, SQL Server, Redshift, Postgres and Hadoop (MapReduce); SQL, Hive, Pig and Alteryx
Visualization: Tableau, ggplot2 and RShiny
Prototyping: PowerPoint, RShiny and Tableau
PROFESSIONAL EXPERIENCE
Confidential, Lansdale, PA
Data Scientist
Responsibilities:
- Utilized Apache Spark with Python to develop and execute Big Data Analytics and Machine learning applications, executed machine learning use cases under Spark ML and Mllib.
- Solutions architect for transforming business problems into BigData and Data Science solutions and define Big Data strategy and Roap map.
- Identified areas of improvement in existing business by unearthing insights by analyzing vast amount ofdatausing machine learning techniques.Tensor Flow, Scala, Spark, MLLib, Python and other tools and languages needed.
- Create and validate machine learning models with Azure Machine Learning
- Designing a machine learning pipeline using Microsoft Azure Machine Learning to predict and prescribe and Implemented a machine learning scenario for a given data problem
- Used Scala for coding the components in Play and Akka.
- Worked on different Machine learning models like Logistic Regression, Multilayer perceptron classifier, K-means clustering by creating Scala-SBT packaging and run it in Spark-shell (Scala) and Auto-encoder model with using R programming.
- Worked on setting up and configuring AWS's EMR Clusters and Used Amazon IAM to grant fine-grained access to AWS resources to users
- Created detailed AWS Security Groups, which behaved as virtual firewalls that controlled the traffic allowed to reach one or more AWSEC2 instances
- Wrote scripts and indexing strategy for a migration to Redshift from Postgres9.2 and MySQL databases.
- Wrote Kinesis agents to pipe data from streaming app into S3.
- Good Knowledge in Azure cloud services, Azure storage, Azure active directory, Azure Service Bus. Create and manage Azure ADtenants, and configure application integration with AzureAD. Integrate on-premises WindowsAD with AzureAD Integrating on-premises identity with Azure Active Directory.
- Working knowledge of Azure Fabric, Micro services, IoT&Docker containers in Azure. Azure infrastructure management & PaaS Solution Architect - (Azure AD, Licenses, Office365, DR on cloud using Azure Recovery Vault, Azure Web Roles, Worker Roles, SQLAzure, Azure Storage).
- Utilized Spark, Scala, Hadoop, HBase, Cassandra, MongoDB, Kafka, Spark Streaming, MLLib, Python, a broad variety of machine learning methods including classifications, regressions, dimensionally reduction etc. and Utilized the engine to increase user lifetime by 45% and triple user conversations for target categories.
- Designed and developed NLP models for sentiment analysis.
- Led discussions with users to gather business processes requirements anddatarequirements to develop a variety of Conceptual, Logical and PhysicalDataModels.Expert in Business Intelligence andDataVisualization tools:Tableau,Microstrategy.
- Performed Multinomial Logistic Regression, Random forest, Decision Tree, SVM to classify package is going to deliver on time for the new route and Performed data analysis by using Hive to retrieve the data from Hadoop cluster, Sql to retrieve data from Oracle database.
- Worked on machine learning on large sizedatausing Spark and MapReduce.
- Let the implementation of new statistical algorithms and operators on Hadoop and SQL platforms and utilized optimizations techniques, linear regressions, K-means clustering, Native Bayes and other approaches.
- DevelopedSpark/Scala, Python for regular expression (regex) project in the Hadoop/Hive environment with Linux/Windows for bigdataresources.
- Developed Data Mapping, Data Governance, Transformation and Cleansing rules for the Master Data Management Architecture involving OLTP, ODS and OLAP.
- Datasources are extracted, transformed and loaded to generate CSVdatafiles withPythonProgramming and SQL queries.
- Stored and retrieveddatafromdata-warehouses using AmazonRedshift.
- Worked on TeradataSQL queries, Teradata Indexes, Utilities such as Mload, Tpump, Fast load and Fast Export.
- Application of various machine learning algorithms and statistical modeling like decision trees, regression models, neural networks, SVM, clustering to identify Volume using scikit-learn package in python, Matlab.
- Used Data Warehousing Concepts like Ralph Kimball Methodology, Bill Inmon Methodology, OLAP, OLTP,StarSchema,SnowFlakeSchema, Fact Table and Dimension Table.
- Refined time-series data and validated mathematical models using analytical tools like R andSPSS to reduce forecasting errors.
- Worked ondatapre-processing and cleaning thedatato perform feature engineering and performeddataimputation techniques for the missing values in the dataset using Python.
- Created Data Quality Scripts using SQL and Hive to validate successful data load and quality of the data. Created various types of data visualizations using Python and Tableau.
Environment: Horton works - Hadoop Map Reduce, Pyspark, Spark, R, Spark MLLib, Tableau, Informatica, SQL, Excel, VBA, BO, CSV, Erwin, SAS, AWS Redshift, ScalaNlp, Cassandra, Oracle, MongoDB, Cognos, SQL Server 2012, Teradata, DB2, SPSS, T-SQL, PL/SQL, Flat Files, XML, and Tableau.
Confidential, Raleigh, NC
Data Scientist
Responsibilities:
- Machine Learning Projects based on Python, SQL, Spark and SAS advanced programming. Performed data exploratory, data visualizations, and feature selections
- Applications of machine learning algorithms, including random forest and boosted tree, SVM, SGD, neural network, and deep learning using CNTK and Tensorflow.
- Big data analytics with Hadoop, HiveQL, Spark RDD, and Spark SQL.
- Tested Python/SAS on AWS cloud service and CNTK modeling on MS-Azure cloud service.
- Created UI using JavaScript and HTML5/CSS.
- Developed and tested many features for dashboard using Python, Bootstrap, CSS, and JavaScript.
- Interacting with the ETL, BI teams to understand / support on various ongoing projects.
- Extensively using MS Excel for data validation.
- Extensively used open source tools - RStudio(R) and Spyder(Python) for statistical analysis and building the machine learning.
- Performed data analysis by using Hive to retrieve the data from Hadoop cluster, Sql to retrieve data from Oracle database and used ETL for data transformation.
- Exploring DAG's, their dependencies and logs using AirFlow pipelines for automation
- Performed data cleaning and feature selection using MLlib package in PySpark and working with deep learning frameworks such as Caffe, Neon.
- Developed Spark/Scala, R, and Python for regular expression (regex) project in the Hadoop/Hive environment with Linux/Windows for big data resources.
- Tracking operations using sensors until certain criteria is met using Airflow technology.
- Responsible for different Data mapping activities from Source systems to Teradata using utilities like TPump, FEXP,BTEQ, MLOAD, FLOAD etc
- Developed PLSQL procedures and functions to automate billing operations, customer barring and number generations
- Redesigned the workflows of Service Request, Bulk Service orders using UNIXCron jobs and PL/ SQL procedures, thereby reduced order processing time and average slippages per month dropped by 40%.
Environment: Data Governance, SQL Server, Python, ETL, MS Office Suite - Excel (Pivot, VLOOKUP), DB2, R, Visio, HP ALM, Agile, Azure, Data Quality, Tableau and Reference Data Management.
Confidential - Providence, RI
Data Analyst /R Developer
Responsibilities:
- Responsible for performing Machine-learning techniques regression/classification to predict the outcomes.
- Performed Ad-hoc reporting/customer profiling, segmentation using R/Python.
- Tracked various campaigns, generating customer profiling analysis and data manipulation.
- Provided R/SQL programming, with detailed direction, in the execution of data analysis that contributed to the final project deliverables. Responsible for data mining.
- Utilized Label Encoders in Python to convert non-numerical significant variables to numerical significant variables to identify their impact on pre-acquisition and post acquisitions by using 2 sample paired t test.
- Worked with ETLSQL Server Integration Services (SSIS) for data investigation and mapping to extract data and applied fast parsing and enhanced efficiency by 17%.
- Developed Data Science content involving Data Manipulation and Visualization, Web Scraping, Machine Learning, Python programming, SQL, GIT and ETL for DataExtraction.
- Applied breadth of knowledge in programming (Python, R), Descriptive, Inferential, and Experimental Design statistics, advanced mathematics, and database functionality (SQL, Hadoop).
- Involved in defining the Source To business rules, Target data mappings, and data definitions.
- Performing Data Validation / Data Reconciliation between disparate source and target systems for various projects.
- Utilized a diverse array of technologies and tools as needed, to deliver insights such as R, SAS, Matlab, Tableau and more.
- Utilized Spark, Scala, Hadoop, HBase, Kafka, Spark Streaming, MLlib, Python, a broad variety of machine learning methods including classifications, regressions, dimensionally reduction etc.
- Used T-SQL queries to pull the data from disparate systems and Data warehouse in different environments.
- Worked closely with the Data Governance Office team in assessing the source systems for project deliverables.
- Extracting data from different databases as per the business requirements using Sql Server Management Studio.
- Interacting with the ETL, BI teams to understand / support on various ongoing projects.
- Extensively using MS Excel for data validation.
Environment: Data Governance, SQL Server, Python, ETL, MS Office Suite - Excel(Pivot, VLOOKUP), DB2, R, Visio, HP ALM, Agile, Azure, MDM, Share point, Data Quality, Tableau and Reference Data Management.
Confidential - Memphis, TN
Data Analyst/Data Engineer
Responsibilities:
- Implemented user interface guidelines and standards throughout the development and maintenance of the website using the HTML, CSS, JavaScript and JQuery.
- Used Django to interface with the JQueryUI and manage the storage and deletion of content.
- UsedHivequeries for data analysis to meet the business requirements.
- Involved with advanced CSS concepts and building table-free layouts.
- Used advanced packages like Mock, patch and beautiful soup (b4) to perform unit testing.
- Used Pandas library for statistics Analysis.
- Used Numpy for Numerical analysis for Insurance premium.
- Worked on rebranding the existing web pages to clients according to the type of deployment.
- Created UI using JavaScript and HTML5/CSS.
- Developed and tested many features for dashboard using Bootstrap, CSS, and JavaScript.
- Managed a small team of programmers using a modified version of the agile development.
- Worked on Jenkins continuous integration tool for deployment of project.
- Worked on updating the existing clipboard to have the new features as per the client requirements.
- Performed Unit testing, Integration Testing, GUI and web application testing using Selenium.
Environment: Django, HTML5, CSS, XML, Kafka, MySQL, JavaScript, Angular JS, Backbone JS, Nginix server, Amazon s3, Jenkins, Beautiful soup, JavaScript, Eclipse, Git, GitHub, Linux, and MAC OSX.
Confidential
Data Analyst/Data Engineer
Responsibilities:
- Analyzed data sources and requirements and business rules to perform logical and physical data modeling.
- Analyzed and designed best fit logical and physical data models and relational database definitions using DB2. Generated reports of data definitions.
- Involved in Normalization/De-normalization, Normal Form and database design methodology.
- Maintained existing ETL procedures, fixed bugs and restored software to production environment.
- Developed the code as per the client's requirements using SQL, PL/SQL and Data Warehousing concepts.
- Involved in Dimensional modeling (Star Schema) of the Data warehouse and used Erwin to design the business process, dimensions and measured facts.
- Worked with Data Warehouse Extract and load developers to design mappings for Data Capture, Staging, Cleansing, Loading, and Auditing.
- Developed enterprise data model management process to manage multiple data models developed by different groups
- Designed and created Data Marts as part of a data warehouse.
- Wrote complex SQL queries for validating the data against different kinds of reports generated by Business Objects XIR2.
- Using Erwin modeling tool, publishing of a data dictionary, review of the model and dictionary with subject matter experts and generation of data definition language.
- Coordinated with DBA in implementing the Database changes and also updating Data Models with changes implemented in development, QA and Production. Worked Extensively with DBA and Reporting team for improving the Report Performance with the Use of appropriate indexes and Partitioning.
- Developed Data Mapping, Transformation and Cleansing rules for the Master Data Management Architecture involved OLTP, ODS and OLAP.
- Tuned and coded optimization using different techniques like dynamic SQL, dynamic cursors, and tuning SQL queries, writing generic procedures, functions and packages.
- Experienced in GUI, Relational Database Management System (RDBMS), designing of OLAP system environment as well as Report Development.
- Extensively used SQL, T-SQL and PL/SQL to write stored procedures, functions, packages and triggers.
- Analyzed of data report were prepared weekly, biweekly, monthly using MS Excel, SQL & UNIX.
Environment: ER Studio, Informatica Power Center 8.1/9.1, Power Connect/ Power exchange, Oracle 11g, Mainframes,DB2 MS SQL Server 2008, SQL,PL/SQL, SAS, Business Objects, XML, Tableau, Unix Shell Scripting, Teradata, Netezza, Aginity.
Confidential
SQL Developer
Responsibilities:
- Worked with internal architects, assisting in the development of current and target state data architectures.
- Worked with project team representatives to ensure that logical and physical ER/Studio data models were developed in line with corporate standards and guidelines.
- Involved in defining the business/transformation rules applied for sales and service data.
- Implementation of Metadata Repository, Transformations, Maintaining Data Quality, Data Standards, Data Governance program, Scripts, Stored Procedures, triggers and execution of test plans
- Define the list codes and code conversions between the source systems and the data mart.
- Involved in defining the source to business rules, target data mappings, data definitions.
- Responsible for defining the key identifiers for each mapping/interface.
- Remain knowledgeable in all areas of business operations in order to identify systems needs and requirements.
- Responsible for defining the key identifiers for each mapping/interface.
- Performed data quality in Talend Open Studio.
- Enterprise Metadata Library with any changes or updates.
- Document data quality and traceability documents for each source interface.
- Coordinated meetings with vendors to define requirements and system interaction agreement documentation between client and vendor system.
Environment: Windows Enterprise Server 2000, SSRS, SSIS, Crystal Reports, DTS, SQL Profiler, and Query Analyze.