Data Scientist Resume
Chantilly, VA
SUMMARY
- 8+ years of hands on experience and comprehensive industry knowledge of Machine Learning, Statistical Modeling,DataAnalytics,Data Modeling, Data Architecture, Data Analysis, DataMining, Text Mining & Natural Language Processing (NLP), Artificial Intelligence algorithms, Business Intelligence, Analytics Models (like Decision Trees, Linear & Logistic Regression, Hadoop (Hive, PIG), R, Python, Spark, Scala, MS Excel, SQL and Postgre SQL, Erwin.
- Strong knowledge in all phases of the SDLC (Software Development Life Cycle) from analysis, design, development, testing, implementation and maintenance.
- Experienced in Data Modeling techniques employing Data warehousing concepts like star/snowflake schema and Extended Star.
- Expertise in applyingdatamining techniques and optimization techniques in B2B and B2C industries.
- Expertise in writing functional specifications, translating business requirements to technical specifications, created/maintained/modified database design document with detailed description of logical entities and physical tables.
- Excellent knowledge of Machine Learning, Mathematical Modeling and Operations Research. Comfortable with R, Python, SAS and Weka, MATLAB, Relational databases. Deep understanding & exposure of BigDataEco - system.
- Expertise inDataAnalysis,DataMigration,Data Profiling, DataCleansing, Transformation, Integration, DataImport, andDataExport through the use of multiple ETL tools such as Informatica Power Center.
- Proficient in Machine Learning, Data/Text Mining, Statistical Analysis & Predictive Modeling.
- Expertise in data acquisition, storage, analysis, integration, predictive modeling, logistic regression, decision trees, data mining methods, forecasting, factor analysis, cluster analysis, ANOVA and other advanced statistical techniques.
- Excellent knowledge and experience in OLTP/OLAP System Study with focus on Oracle Hyperion Suite of technology, developing Database Schemas like Star schema and Snowflake schema (Fact Tables, Dimension Tables) used in relational, dimensional and multidimensional modeling, physical and logical Data modeling using Erwin tool,
- Experienced in building data models using machine learning techniques for Classification, Regression, Clustering and Associative mining.
- Expert in creating PL/SQL Schema objects like Packages, Procedures, Functions, Subprograms, Triggers, Views, Materialized Views, Indexes, Constraints, Sequences, Exception Handling, Dynamic SQL/Cursors, Native Compilation, Collection Types, Record Type, Object Type using SQL Developer.
- Working experience in Hadoop ecosystem and Apache Spark framework such as HDFS, Map Reduce, HiveQL, SparkSQL, PySpark.
- Very good experience and knowledge in provisioning virtual clusters under AWS cloud which includes services like EC2, S3, and EMR.
- Proficient in data visualization tools such as Tableau, Python Matplotlib, R Shiny to create visually powerful and actionable interactive reports and dashboards.
- Excellent Tableau Developer, expertise in building, publishing customized interactive reports and dashboards with customized parameters and user-filters using Tableau(9.x/10.x).
- Experienced in Agile methodology and SCRUM process.
- Strong business sense and abilities to communicate data insights to both technical and nontechnical clients.
TECHNICAL SKILLS
Databases: MySQL, Postgre SQL, Oracle, HBase, Amazon Redshift, MS SQL Server 2016/2014/2012/2008 R2/2008, Teradata
Statistical Methods: Hypothetical Testing, ANOVA, Time Series, Confidence Intervals, Bayes Law, Principal Component Analysis (PCA), Dimensionality Reduction, Cross-Validation, Auto-correlation
Machine Learning: Regression analysis, Bayesian Method, Decision Tree, Random Forests, Support Vector Machine, Neural Network, Sentiment Analysis, K-Means Clustering, KNN and Ensemble Method
Hadoop Ecosystem: Hadoop 2.x, Spark 2.x, Map Reduce, Hive, HDFS, Sqoop, Flume
Reporting Tools: Tableau Suite of Tools 10.x, 9.x, 8.x which includes Desktop, Server and Online, Server Reporting Services(SSRS)
Data Visualization: Tableau, MatPlotLib, Seaborn, ggplot2
Languages: Python (2.x/3.x), R, SAS, SQL, T-SQL
Operating Systems: Power Shell, UNIX/UNIX Shell Scripting (via PuTTY client), Linux and Windows
PROFESSIONAL EXPERIENCE
Confidential
Data Scientist
Responsibilities:
- Utilized Spark, Scala, Hadoop, HBase, Kafka, Spark Streaming, MLLib, R, a broad variety of machine learning methods including classifications, regressions, dimensionally reduction etc. and Utilized the engine to increase user lifetime by 45% and triple user conversations for target categories.
- Participated in features engineering such as feature intersection generating, feature normalize and label encoding with Scikit-learn preprocessing.
- Used Python 3.X (numpy, scipy, pandas, scikit-learn, seaborn) and Spark2.0 (PySpark, MLlib) to develop variety of models and algorithms for analytic purposes.
- Developed and implemented predictive models using machine learning algorithms such as linear regression, classification, multivariate regression, Naive Bayes, Random Forests, K-means clustering, KNN, PCA and regularization for data analysis.
- Ensure solutions architecture / technical architectures are documented & maintained, while setting standards and offering consultative advice to technical & management teams and involved in recommending the roadmap and an approach for implementing thedataintegration architecture (with Cost, Schedule & Effort Estimates)
- Designed and developed NLP models for sentiment analysis.
- Led discussions with users to gather business processes requirements anddatarequirements to develop a variety of Conceptual, Logical and PhysicalDataModels.Expert in Business Intelligence andDataVisualization tools:Tableau,Microstrategy.
- Developed and evangelized best practices for statistical analysis of Big Data.
- Designed and implemented system architecture for Amazon EC2 based cloud-hosted solution for client.
- Designed the Enterprise Conceptual, Logical, and PhysicalDataModel for ‘BulkDataStorage System ‘using Embarcadero ER Studio, thedatamodels were designed in 3NF
- Worked on machine learning on large sizedatausing Spark and Map Reduce.
- Collaborated with data engineers and operation team to implement ETL process, wrote and optimized SQL queries to perform data extraction to fit the analytical requirements.
- Performed data analysis by using Hive to retrieve the data from Hadoop cluster, SQL to retrieve data from RedShift.
- Explored and analyzed the customer specific features by using SparkSQL.
- Performed data imputation using Scikit-learn package in Python.
- Let the implementation of new statistical algorithms and operators on Hadoop and SQL platforms and utilized optimizations techniques, linear regressions, K-means clustering, Native Bayes and other approaches.
- DevelopedSpark/Scala, SAS andRprograms for regular expression (regex) project in the Hadoop/Hive environment with Linux/Windows for bigdataresources.
- Conducted analysis on assessing customer consuming behaviours and discover value of customers with RMF analysis; applied customer segmentation with clustering algorithms such as K-Means Clustering and Hierarchical Clustering.
- Built regression models include: Lasso, Ridge, SVR, XGboost to predict Customer Life Time Value.
- Built classification models include: Logistic Regression, SVM, Decision Tree, Random Forest to predict Customer Churn Rate.
- Used F-Score, AUC/ROC, Confusion Matrix, MAE, RMSE to evaluate different Model performance.
Environment: s: AWS RedShift, EC2, EMR, Hadoop Framework, S3,HDFS, Spark(Pyspark, MLlib, Spark SQL), Python 3.x (Scikit-Learn/Scipy/Numpy/Pandas/Matplotlib/Seaborn),Tableau Desktop (9.x/10.x), Tableau Server (9.x/10.x), Machine Learning (Regressions, KNN, SVM, Decision Tree, Random Forest, XGboost, Light GBM, Collaborative filtering, Ensemble), Teradata, Git 2.x, Agile/SCRUM
Confidential
Data Scientist
Responsibilities:
- Tackled highly imbalanced Fraud dataset using under sampling, oversampling with SMOTE and cost sensitive algorithms with Python Scikit-learn.
- Wrote complex Spark SQL queries for data analysis to meet business requirement.
- Developed Map Reduce/Spark Python modules for predictive analytics & machine learning in Hadoop on AWS.
- Worked on data cleaning and ensured data quality, consistency, integrity using Pandas, Numpy.
- Participated in feature engineering such as feature intersection generating, feature normalize and label encoding with Scikit-learn preprocessing.
- Improved fraud prediction performance by using random forest and gradient boosting for feature selection with Python Scikit-learn.
- Performed Naïve Bayes, KNN, Logistic Regression, Randomforest, SVMandXGboost to identify whether a loan will default or not.
- Implemented Ensemble of Ridge, Lasso Regression and XGboost to predict the potential loan default loss.
- Used various metrics(RMSE, MAE, F-Score, ROC and AUC) to evaluate the performance of each model.
- Used big data tools Spark (Pyspark, SparkSQL, Mllib) to conduct real time analysis of loan default based on AWS.
- Conducted Data blending, Data preparation using Alteryx and SQL for tableau consumption and publishing data sources to Tableau server.
- Created multiple custom SQL queries in Teradata SQL Workbench to prepare the right data sets for Tableau dashboards. Queries involved retrieving data from multiple tables using various join conditions that enabled to utilize efficiently optimized data extracts for Tableau workbooks.
Environment: MS SQL Server 2014, Teradata, ETL, SSIS, Alteryx, Tableau (Desktop 9.x/Server 9.x), Python3.x(Scikit-Learn/Scipy/Numpy/Pandas), Machine Learning (Naïve Bayes, KNN, Regressions, Random Forest, SVM, XGboost, Ensemble), AWS Redshift, Spark(Pyspark, MLlib, Spark SQL),Hadoop 2.x, Map Reduce, HDFS, SharePoint
Confidential, Chantilly, VA
Data Analyst/Data Scientist
Responsibilities:
- Gathered, analyzed, documented and translated application requirements into data models and Supports standardization of documentation and the adoption of standards and practices related to data and applications.
- Participated in Data Acquisition with Data Engineer team to extract historical and real-time data by using Sqoop, Pig, Flume, Hive, Map Reduce and HDFS.
- Wrote user defined functions (UDFs) in Hive to manipulate strings, dates and other data.
- Performed Data Cleaning, features scaling, features engineering using pandas and numpy packages in python.
- Applied clustering algorithms i.e. Hierarchical, K-means using Scikit and Scipy.
- Created logical data model from the conceptual model and it's conversion into the physical database design using ERWIN.
- Mapped business needs/requirements to subject area model and to logical enterprise model.
- Worked with DBA's to create a best fit physical data model from the logical data model
- Redefined many attributes and relationships in the reverse engineered model and cleansed unwanted tables/ columns as part of data analysis responsibilities.
- Enforced referential integrity in the OLTP data model for consistent relationship between tables and efficient database design.
- Developed the data warehouse model (star schema) for the proposed central model for the project.
- Created 3NF business area data modeling with de-normalized physical implementation data and information requirements analysis using ERWIN tool.
- Worked on the Snow-flaking the Dimensions to remove redundancy.
- Worked in using Teradata14 tools like Fast Load, Multi Load, T Pump, Fast Export, Teradata Parallel Transporter (TPT) and BTEQ.
- Helped in migration and conversion of data from the Sybase database into Oracle database, preparing mapping documents and developing partial SQL scripts as required.
- Generated ad-hoc SQL queries using joins, database connections and transformation rules to fetch data from legacy Oracle and SQL Server database systems
Environment: Machine learning(KNN, Clustering, Regressions, Random Forest, SVM, Ensemble), Linux, Python 2.x (Scikit-Learn/Scipy/Numpy/Pandas), R, Tableau (Desktop 8.x/Server 8.x), Hadoop, Map Reduce, HDFS, Hive, Pig, HBase, Sqoop, Flume, Oracle 11g, SQL Server 2012
Confidential, Atlanta, GA
BI Developer/Data Analyst
Responsibilities:
- Used SSIS to create ETL packages to Validate, Extract, Transform and Load data into Data Warehouse and Data Mart.
- Maintained and developed complex SQL queries, stored procedures, views, functions and reports that meet customer requirements using Microsoft SQL Server 2008 R2.
- Created Views and Table-valued Functions, Common Table Expression (CTE), joins, complex sub queries to provide the reporting solutions.
- Optimized the performance of queries with modification in T-SQL queries, removed the unnecessary columns and redundant data, normalized tables, established joins and created index.
- Created SSIS packages using Pivot Transformation, Fuzzy Lookup, Derived Columns, Condition Split, Aggregate, Execute SQL Task, Data Flow Task and Execute Package Task.
- Migrated data from SAS environment to SQL Server 2008 via SQL Integration Services (SSIS).
- Developed and implemented several types of Financial Reports (Income Statement, Profit& Loss Statement, EBIT, ROIC Reports) by using SSRS.
- Developed parameterized dynamic performance Reports (Gross Margin, Revenue base on geographic regions, Profitability based on web sales and smart phone app sales) and ran the reports every month and distributed them to respective departments through mailing server subscriptions and SharePoint server.
- Designed and developed new reports and maintained existing reports using Microsoft SQL Reporting Services (SSRS) and Microsoft Excel to support the firm's strategy and management.
- Created sub-reports, drill down reports, summary reports, parameterized reports, and ad-hoc reports using SSRS.
- Used SAS/SQL to pull data out from databases and aggregate to provide detailed reporting based on the user requirements.
- Used SAS for pre-processing data, SQL queries, data analysis, generating reports, graphics, and statistical analyses.
- Provided statistical research analyses and data modeling support for mortgage product.
- Perform analyses such as regression analysis, logistic regression, discriminant analysis, cluster analysis using SAS programming.
Environment: SQL Server 2008 R2, DB2,Oracle,SQL Server Management Studio, SAS/ BASE, SAS/SQL, SAS/Enterprise Guide, MS BI Suite(SSIS/SSRS), T-SQL, SharePoint 2010, Visual Studio 2010,Agile/SCRUM
Confidential
Data Analyst
Responsibilities:
- Wrote SQL queries for data validation on the backend systems and used various tools like TOAD& DB Visualizer for DBMS(Oracle)
- Perform Data analysis, Backend Database testing, Data Modeling and Developing SQL Queries to solve problems and meet user's need for Database management in Data Warehouse.
- Utilize object-oriented languages, concepts, database design, star schemas and databases.
- Create algorithms as needed to manage and implement proposed solutions.
- Participate in test planning and test execution for functional, system, integration, regression, UAT (User Acceptance Testing), load and performance testing.
- Work with test automation tools for recording/coding in Database, and execute in regression testing cycles.
- Transferred data from various OLTP data sources, such as Oracle, MS Access, MS Excel, Flat files, CSV files into SQL Server.
- Working with Databases DB2, Oracle DM, SQL Server for Database testing and maintenance.
- Involved in writing and executing User Acceptance Testing (UAT) with end users.
- Involved in Post- Implementation validations after the changes have been to the Data Marts.
- Chart out Graphs, and Reports alike in QC to point out the percentage of Test Cases passed, and thereby to point out the percentage of Quality achieved and uploading the status daily to ART reports an in-house tool.
- Performed extensiveDataValidation,DataVerification againstDataWarehouse.
- Used UNIX to check the Data marts, Tables and Updates made to the tables.
- Writing advanced SQL Queries to query the data from Data marts and Landings to verify the changes has been made.
- Involved in Client requirement gathering, participated in discussion & brain storming sessions and documented requirements.
- Validating and profiling Flat FileDatainto Teradata tables using UNIX Shell scripts.
- Actively participated Functional, System and User Acceptance testing on all builds and supervised releases to ensure system / functionality integrity.
- Closely interacted with designers and software developers to understand application functionality and navigational flow and keep them updated about Business user sentiments.
- Interacted with developers to resolve different Quality Related Issues.
- Wrote and executed manual test cases for functional, GUI, and regression testing of the application to make sure that new enhancements do not break working features
- Writing and executing Manual test cases in HP Quality Center.
- Wrote test plans for positive and negative scenarios for GUI and functional testing
- Involved in writing SQL queries and stored procedures using Query Analyzer and matched the results retrieved from the batch log files
- Created Project Charter documents & Detailed Requirement document and reviewed with Development & other stake holders.
Environment: Subversion, Tortoise SVN, Jira, Agile-Scrum, Web Services, Mainframe, Oracle, Perl, UNIX, LINUX, Shell Scripts, UML, Quality Center,, RequisitePro, SQL, MS Visio, MS Project, Excel, Power Point, Word, SharePoint, Win XP/7 Enterprise.
Confidential
Data Analyst/Data Modeler
Responsibilities:
- Data analysis and reporting using MY SQL, MS Power Point, MS Access and SQL assistant.
- Involved in MY SQL, MS Power Point, MS Access Database design and design new database on Netezza which will have optimized outcome.
- Used DB2 Adapters to integrate between Oracle database and Microsoft SQL database in order to transfer data
- Designed the data marts using the Ralph Kimball's Dimensional Data Mart modeling methodology using ER Studio.
- Involved in writing T-SQL, working on SSIS, SSRS, SSAS, Data Cleansing, Data Scrubbing and Data Migration.
- Used Normalization methods up to 3NF and De-normalization techniques for effective performance in OLTP systems.
- Initiated and conducted JAD sessions inviting various teams to finalize the required data fields and their formats.
- Involved in designing and implementing the data extraction (XML DATA stream) procedures.
- Created base tables, views, and index. Built a complex Oracle procedure in PL/SQL for extract, loading, transforming the data into the warehouse via DBMS Scheduler from the internal data.
- Involved in writing scripts for loading data to target data Warehouse using Bteq, Fast Load, MultiLoad
- Create ETL scripts using Regular Expressions and custom tools (Informatica, Pentaho, and Sync Sort) to ETL data.
- Developed SQL Service Broker to flow and sync of data from MS-I to Microsoft's master database management (MDM).
- Extensively involved in Recovery process for capturing the incremental changes in the source systems for updating in the staging area and data warehouse respectively
- Strong knowledge of Entity-Relationship concept, Facts and dimensions tables, slowly changing dimensions and Dimensional Modeling (Star Schema and Snow Flake Schema).
- Involved in loading data between Netezza tables using NZSQL utility.
- Worked on Data modeling using Dimensional Data Modeling, Star Schema/Snow Flake schema, and Fact & Dimensional, Physical & Logical data modeling.
- Generated Stats pack/AWR reports from Oracle database and analyzed the reports for Oracle8.x wait events, time consuming SQL queries, table space growth, and database growth.
Environment: ER Studio, MY SQL, MS Power Point, MS Access, MY SQL, MS Power Point, MS Access, Netezza, DB2, T-SQL, DTS, Informatica MDM, SSIS, SSRS, SSAS, ETL, MDM, 3NF and De-normalization, Teradata, Oracle8.x, (Star Schema and Snow Flake Schema) etc.