Data Scientist Resume
Mclean, VA
SUMMARY:
- 7 years of extensive IT experience in the field of Data Analysis
- Above 5+ years of extensive experience as a Data Scientist with extensive experience in Data Mining, Statistical Data Analysis, Exploratory Data Analysis and Machine Learning with various forms of data.
- Proven expertise in employing techniques for Supervised and Unsupervised (Clustering, Classification, PCA, Decision trees, KNN, SVM) learning, Predictive Analytics, Optimization methods and Natural Language Processing(NLP), Time Series Analysis.
- Experienced in Machine Learning Regression Algorithms like Simple, Multiple, Polynomial, SVR(Support Vector Regression), Decision Tree Regression, Random Forest Regression.
- Experienced in advanced statistical analysis and predictive modeling in structured and unstructured data environment.
- Strong expertise in Business and Data Analysis, Data Profiling, Data Migration, Data Conversion, Data Quality, Data Governance, Data Lineage, Data Integration, Master Data Management(MDM), Metadata Management Services, Reference Data Management (RDM).
- Hands on experience of Data Science libraries in Python such as Pandas, NumPy, SciPy, scikit - learn, Matplotlib, Seaborn, BeautifulSoup, Orange, Rpy2, LibSVM, neurolab, NLTK.
- Experienced in Machine Learning Classification Algorithms like Logistic Regression, K-NN, SVM, Kernel SVM, Naive Bayes, Decision Tree & Random Forest classification.
- Hands on experience on R packages and libraries like ggplot2, Shiny, h2o, dplyr, reshape2, plotly, RMarkdown, ElmStatLearn, caTools etc.
- Efficiently accessed data via multiple vectors (e.g. NFS, FTP, SSH, SQL, Sqoop, Flume, Spark).
- Experience in various phases of Software Development life cycle (Analysis, Requirements gathering, Designing) with expertise in writing/documenting Technical Design Document(TDD), Functional Specification Document(FSD), Test Plans, GAP Analysis and Source to Target mapping documents.
- Experienced in Artificial Neural Networks(ANN) and Deep Learning models using Theano, Tensorflow and keras packages using Python.
- Excellent understanding of Hadoop architecture and Map Reduce concepts and HDFS Framework.
- Strong understanding of project life cycle and SDLC methodologies including RUP, RAD, Waterfall and Agile.
- Experience working on BI visualization tools (Tableau, Shiny & QlikView).
- Excellent Team player and self-starter, possess good communication skills.
TECHNICAL SKILLS:
Databases: SQL Server, MS Access, Teradata, Oracle
NoSqlDatabases: HBase
Programming Languages: C, C++, MATLAB, R, Python, Java, Javascript, scala, pig
Markup languages: XML, HTML, DHTML, XSLT, X Path, X Query and UML
ETL Tools: ETL Informatica Power Center, SSIS
Data Modeling Tools: MS Visio, Rational Rose, Erwin
Testing Tools: HP Quality Center ALM
Big Data Tools: Hadoop, Hive, Apache Spark, Pig
Operating Systems: UNIX, Linux, Windows
Reporting & Visualization: Tableau, Matplotlib, Seaborn, ggplot, SAP Business Objects, Crystal Reports, SSRS, Cognos, Shiny
PROFESSIONAL EXPERIENCE:
Confidential - McLean, VA
Data Scientist
Responsibilities:
- Worked closely with business, datagovernance, SMEs and vendors to define data requirements.
- Designed and provisioned the platform architecture to execute Hadoop and machine learning use cases under Cloud infrastructure, AWS, EMR, and S3.
- Selection of statistical algorithms - (Two Class Logistic Regression Boosted Decision Tree, Decision Forest Classifiers etc)
- Used MLlib, Spark's Machine learning library to build and evaluate different models.
- Worked in using Teradata14 tools like Fast Load, Multi Load, T Pump, Fast Export, Teradata Parallel Transporter (TPT) and BTEQ.
- Involved in creating Data Lake by extracting customer's Big Data from various data sources into Hadoop HDFS. This included data from Excel, Flat Files, Oracle, SQL Server, MongoDb, Cassandra, HBase, Teradata, Netezza and also log data from servers
- Used Spark Data frames, Spark-SQL, Spark MLLib extensively and developing and designing POC's using Scala, Spark SQL and MLlib libraries.
- Created high level ETL design document and assisted ETL developers in the detail design and development of ETL maps using Informatica.
- Used R, SQL to create Statistical algorithms involving Multivariate Regression, Linear Regression, Logistic Regression, PCA, Random forest models, Decision trees, Support Vector Machine for estimating the risks of welfare dependency.
- Helped in migration and conversion of data from the Sybase database into Oracle database, preparing mapping documents and developing partial SQL scripts as required.
- Generated ad-hoc SQL queries using joins, database connections and transformation rules to fetch data from legacy Oracle and SQL Server database systems
- Executed ad-hoc data analysis for customer insights using SQL using Amazon AWS Hadoop Cluster.
- Worked on predictive and what-if analysis using R from HDFS and successfully loaded files to HDFS from Teradata, and loaded from HDFS to HIVE.
- Designed the schema, configured and deployed AWS Redshift for optimal storage and fast retrieval of data.
- Analyzed data and predicted end customer behaviors and product performance by applying machine learning algorithms using Spark MLlib.
- Performed data mining on data using very complex SQL queries and discovered pattern and Used extensive SQL for data profiling/analysis to provide guidance in building the data model
Environment: R, Machine Learning, Teradata 14, Hadoop Map Reduce, Pyspark, Spark, R, Spark MLLib, Tableau, Informatica, SQL, Excel, AWS Redshift, Scala Nlp, Cassandra, Oracle, MongoDB, Informatica MDM, Cognos, SQL Server 2012, Teradata, DB2, SPSS, T-SQL, PL/SQL, Flat Files, XML, and Tableau.
Confidential, Phoenix, AZ
Data Scientist, Lead Business Data Analyst
Responsibilities:
- Helped Amex to Build the Decision System as a Data scientist by creating the Algorithm based on the business data, used Optimization Technique Simulated Annealing and Decision Tree ML concept. Statistical concept was widely used like Central Limit Theorem, Probability Concept, Probability Distribution (Binomial, Poisson and Exponential Distribution
- Built Decision Trees in R to represent segmentation of data and identify key variables to be used in predictive modeling.
- Helped the Business in Amex Credit Card Analysis by building the Instant Decision Rules with the help of Hypothesis and Chi Square Testing in Merchant Finance Application.
- Manage project planning and deliverables for several projects across Advanced Analytics, Big Data and Digital Analytics streams.
- Project Consultant for Insurance clients across globe in their digital transformation journey.
- Resource hiring (Lateral & Campus), compensation fitting, training, coaching and performance review across different analytical streams.
- Managing Offshore and Onshore team workloads for project deliverables.
- Providing ad hoc support to senior management on project and technical
- Provide support in solution development for Data science, Advanced Analytics and Digital Analytics Projects.
- Guide team on data scientists, to develop statistical models and algorithms to answer complex business problems
- Implement machine learning techniques and interpret statistical results which are ready- consumption for senior management and clients.
- Provide Pre-sales support to Syntel team in RFP’s, RFI’s and Client presentations.
- Played a vital role with my analytics experience in Credit Risk Analysis where I helped the Business to take decision for loan funding to different sets of Merchant and generating the Reports in case of Decline with the help of Tableau reporting tool
Environment: R, Decision Tree, Naïve Bayes Classification, Confusion Matrix, Tableau, R, Algorithm
Confidential
Associate Data Scientist
Responsibilities:
- Designed applications of Machine learning, Statistical Analysis and Data visualizations with challenging large data processing problems resulting in savings more than $1.2M an year.
- Responsible for all aspects of management, administration, and support of Microsoft Azure cloud-based infrastructure as the premier hosting provider.
- Worked with various databases like Oracle, SQL and performed the computations, log transformations, feature engineering, and Data exploration to identify the insights and conclusions from complex data using R- programming in R-studio
- Worked with Redshift to analyze the data using standard SQL and Business Intelligence tools to run complex analytic queries against petabytes of structured data, using sophisticated query optimization, columnar storage on high-performance local disks, and massively parallel query execution.
- Implemented predictive models using machine learning algorithms linear regression, linear boosting algorithms and performed in- depth analysis on the structure of models, compared the performance of all the models and found tree boosting is the best for the prediction.
- Applied concepts of R-squared, R.M.S.E, P-value, in the evaluation stage to extract interesting findings through comparisons.
- Performed in-depth statistical analysis and data mining methods using R, including Cluster analysis, Logistic Regression, and boosting models that led to reducing variance by 45%
- Proficient in the entire CRISP-DM life cycle and actively involved in all the phases of project life cycle including data acquisition, data cleaning, data engineering,
- Extensively used Azure Machine Learning to set up the experiments and creating Web services for the predictive analytics
- Developed deep learning models that predict text sequences using NPL and converted words into numeric vectors using Word2Vec
- Performed data manipulation in Python using Numpy, Pandas and Scipy libraries
- Worked on writing complex SQL queries in performing Data analysis using window functions, joins, improving performance by creating partitioned tables,
- Prepared multiple dashboards using Tableau to reflect the data behavior over period of time Analyzed and worked with all aspects of regression models (OLS etc.)
- Responsible for working with stakeholders to troubleshoot issues, communicate to team members, leadership and stakeholders on findings to ensure models are well understood and optimized.
Confidential
Data Analyst
Responsibilities:
- Worked with internal architects, assisting in the development of current and target state data architectures.
- Worked with project team representatives to ensure that logical and physical ER/Studio data models were developed in line with corporate standards and guidelines.
- Involved in defining the business/transformation rules applied for sales and service data.
- Implementation of Metadata Repository, Transformations, Maintaining DataQuality, DataStandards, Data Governanceprogram, Scripts, Stored Procedures, triggers and execution of test plans
- Define the list codes and code conversions between the source systems and the data mart.
- Involved in defining the source to business rules, target data mappings, data definitions.
- Responsible for defining the key identifiers for each mapping/interface.
- Remain knowledgeable in all areas of business operations in order to identify systems needs and requirements.
- Responsible for defining the key identifiers for each mapping/interface.
- Performed data quality in Talend Open Studio.
- Enterprise Metadata Library with any changes or updates.
- Document data quality and traceability documents for each source interface.
- Establish standards of procedures.
- Coordinated meetings with vendors to define requirements and system interaction agreement documentation between client and vendor system.
Environment: Windows Enterprise Server 2000, SSRS, SSIS, Crystal Reports, DTS, SQL Profiler, and Query Analyze.