Sr. Data Scientist/machine Learning Engineer Resume
Minneapolis, MN
PROFESSIONAL SUMMARY:
- Around 8 Years of experience and comprehensive industry knowledge of Machine Learning, Statistical Modeling, Data Analytics, DataModeling, Data Architecture, Data Analysis, Data Mining, Text Mining & Natural Language Processing (NLP), Artificial Intelligence algorithms, Business Intelligence, Analytics Models (like Decision Trees, Linear & Logistic Regression, Hadoop (Hive, PIG), R, Python, Java, Spark, Scala, MS Excel, SQL and Postgre SQL, Erwin.
- Having a good experience in Big Data, Hadoop, No SQL database (MongoDb, HBase), Data Warehousing, Business Intelligence, Data Analytics & ETL concepts.
- Excellent knowledge of Machine Learning, Mathematical Modeling and Operations Research. Comfortable with R, Python, Java, SAS and Weka, MATLAB, Relational databases. Deep understanding & exposure of Big Data Eco - system.
- Expert in creating PL/SQL Schema objects like Packages, Procedures, Functions, Subprograms, Triggers, Views, Materialized Views, Indexes, Constraints, Sequences, Exception Handling, Dynamic SQL/Cursors, Native Compilation, Collection Types, Record Type, Object Type using SQL Developer.
- Have hands on experience in Hadoop, Hive, Hbase, Map Reduce, Pig, Oozie, R, Sqoop, Flume, Zookeeper, Ambari, YARN, Tezand SAP Hana.
- Strong experience and knowledge in Data Visualization with Tableau creating: Line and scatter plots, Bar Charts, Histograms, Pie chart, Dot charts, Box plots, Time series, Error Bars, Multiple Charts types, Multiple Axes, subplots etc.
- Strong experience in Big Data technologies like Spark, SparkSQL, PySpark, Hadoop, HDFS, Hive.
- Proficient at building robust Machine Learning, Deep Learning models, Convolution Neural Networks (CNN), Recurrent Neural Networks (RNN), LSTM using Tensor Flow and Keras.
- Adept in analyzing large datasets using Apache Spark, PySpark, Spark ML and Amazon Web Services (AWS).
- Highly skilled in using Hadoop (pig and Hive) for basic analysis and extraction of data in the infrastructure to provide data summarization and Proficient in HiveQL, SparkSQL, PySpark.
- In depth knowledge in using of spark machine learning library MLlib.
- Hands on experience integrating R with Hadoop ecosystem using rhdfs, hiver, rhbase packages.
- Hands on experience in SAP Hana, HDFS and R integration.
- Hands on experience in making Web application in R using shiny package.
- Hands on experience in PHP, Python, MySql, PostgreSQL, and MongoDB.
- Knowledge on Scala, Spark andJaql.
- Having experience in Data Quality Management to get, clean, process, and cross-verify the data in multiple sources.
- Domain Knowledge on E-Commerce, E-Learning, Travel, Health Care and Gaming.
- I am an Active Team Player, Quick Learner, Planned and Committed Personality.
- Experience in Amazon - EMR, EC2 and S3 cloud services.
- Involved in installing/configuring Hadoop 1.0 and its Eco system tools in CentOS6.x.
- Worked with the admin team in upgrading Hadoop 1.0 to 2.0 using Apache Ambari 2.0.1 and configured with HUE.
- Worked up to 20 nodes, with dedicated nodes for namenode, Jobtracker, Secondary node.
- Handled a data load up to 20 TB.
- Extracted data from log files into HDFS using Flume.
- Developed Oozie workflow for scheduling and orchestrating the ETL process.
- Extracted data from SAP Hana, MsSql and MySqlintoHDFS using Sqoop.
- Experienced in Teradata RDBMS using Fast load, Fast Export, Multi load, T pump and Teradata SQL Assistance and BTEQ Teradata utilities.
- Expertise R user with knowledge of statistical programming languagesSAS.
- Created and worked on Sqoop (version 1.4.3) jobs with incremental load to populate Hive External tables.
- Developed Hive (version 0.10) scripts for end user / analyst requirements to perform ad hoc analysis.
- Very good understanding of Partitions, Bucketing concepts in Hive and designed both Managed and External tables in Hive to optimize performance.
- Solved performance issues in Hive and Pig scripts with understanding of Joins, Group and aggregation and how does it translate to MapReduce jobs.
- Used Tez execution to speed up the query execution time in Hive.
- Good experience with both MapReduce 1 (Job Tracker) and MapReduce 2 (YARN) setups.
- Good experience in monitoring and managing cluster using Ambari through Nagios and Ganglia.
- Experience with working in Agile/SCRUM software environments.
- Highly motivated team player with excellent interpersonal skills, effective communication, analytical and presentation skills.
TECHNICAL SKILLS:
Data Modeling Tools: Erwin r 9.6/9.5, ER/Studio 9.7, Star-Schema Modeling, Snowflake-Schema Modeling, FACT and dimension tables, Pivot Tables.
Databases: Oracle 11g/12c, MS Access, SQL Server 2012/2014, Sybase and DB2, Teradata, Hive.
Big Data Tools: Hadoop, Hive, Spark, Pig, HBase, Sqoop, Flume.
BI Tools: Tableau 7.0/8.2, Tableau server 8.2, Tableau Reader 8.1,SAP Business Objects, Crystal Reports
Packages: Microsoft Office 2010, Microsoft Project 2010, SAP and Microsoft Visio, Share point Portal Server
Operating Systems: Microsoft Windows 8/7/XP, Linux and UNIX.
Languages: SQL, PL/SQL, ASP, Visual Basic, XML, Python, SQL, T-SQL, SQL Server, C, C++, JAVA, HTML, UNIX shell scripting, PERL, R.
Applications: Toad for Oracle, Oracle SQL Developer, MS Word, MS ExcelMS Power Point, Teradata, Designer 6i.
Methodologies: RAD, JAD, RUP, UML, System Development Life Cycle (SDLC), Waterfall Model.
Data Modeling Tools: Erwin r 9.6/9.5, ER/Studio 9.7, Star-Schema Modeling, Snowflake-Schema Modeling, FACT and dimension tables, Pivot Tables.
Databases: Oracle 11g/12c, MS Access, SQL Server 2012/2014, Sybase and DB2, Teradata14/15, Hive.
Big Data Tools: Hadoop, Hive, Spark, Pig, HBase, Sqoop, Flume.
BI Tools: Tableau 7.0/8.2, Tableau server 8.2, Tableau Reader 8.1,SAP Business Objects, Crystal Reports
Packages: Microsoft Office 2010, Microsoft Project 2010, SAP and Microsoft Visio, Share point Portal Server
PROFESSIONAL EXPERIENCE:
Confidential, Minneapolis, MN
Sr. Data Scientist/Machine Learning Engineer
Responsibilities:
- Massively involved in Data Architect role to review business requirement and compose source to target data mapping documents.
- Expertise and experience in domains like Retail Solutions, Finance, Healthcare, Banking, Digital advertisement and e-commerce.
- Responsible for the dataarchitecture design delivery, data model development, review, approval and Data warehouse implementation.
- Set strategy and oversee design for significant data modelling work, such as Enterprise Logical Models, Conformed Dimensions, and Enterprise Hierarchy.
- Analyzed existing Conceptual and Physicaldatamodels and altered them using Erwin to support enhancements.
- Applied feature engineering according to feature importance by Random Forest and correlation among features by L1 and L2 algorithms.
- Designed the LogicalDataModel using Erwin with the entities and attributes for each subject areas.
- Architectural Design in BigData, Hadoop projects and provide for a designer that is an idea-driven.
- Skilled in Data chunking, Data profiling, Data Cleansing, Data mapping, creating workflows and Data Validation using data integration tools like Informatica during the ETL and ELT processes.
- Used Python 3.X (numpy, scipy, pandas, scikit-learn, seaborn) and Spark 2.0 (Scala, PySpark, MLlib) to develop variety of models and algorithms for analytic purposes.
- Developed Map Reduce jobs written in java, using Hive for data cleaning and pre-processing.
- Used big data tools Spark (Pyspark, SparkSQL, Mllib) to conduct real time analysis of loan default based on AWS.
- Developed and configured on InformaticaMDM hub supports the MasterDataManagement (MDM), BusinessIntelligence (BI) and Datawarehousing platforms to meet business needs.
- Loaded data into Hive Tables from HadoopDistributed File System (HDFS) to provide SQL access on Hadoop data
- Used AgileMethodology of Data Warehouse development.
- Design and implement data ingestion techniques for real time and batch processes for structured and unstructured data sources into Hadoopecosystems and HDFSclusters.
- Designed and developedarchitecture for data servicesecosystem spanning Relational, NoSQL, and BigData technologies.
- Responsible for data identification, collection, exploration, and cleaning for modeling, participate in model development
- Implemented multi-datacenter and multi-rack Cassandra cluster.
- Created HBase tables to load large sets of structured, semi-structured and unstructured data coming from NoSQL and a variety of portfolios.
- Involved in data model reviews as dataarchitect with business analysts and business users with explanation of the data model to make sure it is in-line with business requirements.
- Created Entity relationships diagrams, data flow diagrams and enforced all referential integrity constraints using Rational Rose
- Worked with the ETL team to document the SSIS packages for data extraction to Warehouse environment for reporting purposes.
- Developed data Mart for the base data in Star Schema, Snow-FlakeSchema involved in developing the data warehouse for the database.
- Involved in Dataloading using PL\SQLScripts and SQLServer Integration Services packages
- Established data governance, monitoring of DataQuality and clear documentation for facile implementation.
- Involved in the validation of the OLAP, Unittesting and System Testing of the OLAP Report Functionality and data displayed in the reports.
- Generated ad-hocSQLqueries using joins, database connections and transformation rules to fetch data from Teradata database.
- Created HBase tables to load large sets of structured, semi-structured and unstructureddata coming from UNIX, NoSQL and a variety of portfolios.
- Worked on AmazonRedshift and AWS and architecting a solution to load data creates data models and run BI on it.
- Worked on AWS and architecting a solution to load data creates data models and run BI on it.
- Created UNIXscripts for file transfer and file manipulation
- Directed to Create Dashboards based on the business requirement using SSRS/Cognos and helped development team in knowledge about the requirement.
- Handled importing of data from various data sources, performed transformations using Hive, MapReduce, loaded data into HDFS and Extracted the data from Oracle into HDFS using Sqoop
- Worked with various Teradata15 tools and utilities like Teradata Viewpoint, MultiLoad, ARC, TeradataAdministrator, BTEQ and other Teradata Utilities.
- Involved in several facets of MDM implementations including DataProfiling, Metadataacquisition and data migration.
- Developed predictive models using Decision Tree, Random Forest, Naïve Bayes, Logistic Regression, Cluster Analysis, and Neural Networks to predict analytical Online Advertising Pricing Model to maximize client's net revenues, predict accurate Revenue per Click estimates and build a fraud traffic detection system to flag potential bot sessions that cause inflated billings to the client's customers.
- Extensively used AginityNetezza work bench to perform various DML,DDL operations on Netezza database.
- Created DDLscripts using Erwin and source to target mappings to bring the data from source to the warehouse.
- Lead database level tuning and optimization in support of application development teams on an ad-hoc basis.
Environment: Erwin r9.6, Python, SQL, Oracle 12c,Netezza, SQL Server, Informatica, Java, SSRS, PL/SQL, T-SQL, Tableau, MLlib, regression, Cluster analysis, Scala NLP, Spark, Kafka, MongoDB, logistic regression, Hadoop, Hive, Teradata, random forest, OLAP, Azure, MariaDB, SAP CRM, HDFS, ODS, NLTK, SVM, JSON, Tableau, XML, Cassandra, MapReduce, AWS, Assistant 15.0, Flat Files.
Confidential, Boston,MA
Sr. Data Scientist/Data Analyst
Responsibilities:
- Evaluating the data analytics opportunities to improve the efficiency of claims handling process like Fraud Detection
- Utilized various data analysis and data visualization tools to accomplish data analysis, report design and report delivery.
- Trained different ML algorithms models, including Logistic, Tree-Based, SVM, Knn and GBM, multiple times after repeat model evaluations by confusion matrix and cross-validation aimed at finding out optimal parameter and hyper parameter to ensure prediction accuracy.
- Create statistical models based on researched information to provide conclusions that will guide the company and the industry into the future.
- Implemented PySpark jobs for batch processing to handle massive volume of data from various data sources - Bloomberg, Government publications, unstructured news articles, etc. and data persisted in HDFS. Configured a CI/CD pipeline in Kubernetes and DockerSwarm.
- Performed data cleaning and feature selection using MLLib package in PySpark and working with deep learning frameworks such as Caffe, Neon etc.
- Taking care of missing data after import and encoding the categorical data, when needed.
- Splitting the data into training set, test set and scaling the data in training set and test set, if necessary.
- Creatively communicated and presented models to business customers and executives, utilizing a variety of formats and visualization methodologies.
- Impact of marketing tactics on sales and then forecast the impact of future sets of tactics.
- Developed Scala and SQL code to extract data from various databases
- Used R and python for Exploratory Data Analysis and Hypothesis test to compare and identify the effectiveness of Creative Campaigns.
- Used Scala, Python, R and SQL to create Statistical algorithms involving Linear Regression, Logistic Regression, Random forest, Decision trees, Support Vector Machine for estimating the risks.
- Developed statistical models to forecast inventory and procurement cycles.
- Created and designed reports that will use gathered metrics to infer and draw logical conclusions of past and future behaviour.
- Created pipelines for data ingestion and from various channels, through the scripts written in Hive&Java.
- Work with a range of proprietary, industry standard, and open source data stores to assemble and organize and analyze data.
- Mapped customers to revenue to predict the revenue (if any) from a new prospective customer.
- Visualizations, Summary Reports and Presentations using R and Tableau.
- Uploaded data to HadoopHive and combined new tables with existing databases.
- Involved in converting Hive/SQL queries into Spark transformations using SparkRDDs, and Scala.
- Developed pyspark code and Spark-SQL/Streaming for faster testing and processing of data.
- Supported Map Reduce Programs those are running on the cluster.
- Created Data Quality Scripts using SQL and Hive to validate successful data load and quality of the data.
- Scheduled jobs and workflow scheduler to manage Hadoop jobs.
- Loaded the aggregated data into Data Mart for reporting, dash boarding and ad-hoc analysis using Tableau and developed a self-service BI solution for quicker turnaround of insights.
- Maintained SQL scripts to create and populate tables in data warehouse for daily reporting across departments.
Environment: R 3.x, Python 2.x, Tableau 9, SQL Server 2012, Spark/Scala, SBT, Hive, Sqoop, Spark ML.
Confidential - Downers Grove, IL
Sr. Data Scientist
Responsibilities:
- Participated in all phases of project life cycle including data collection, data mining, data cleaning, model building and validation, as well as report creating.
- Utilized MapReduce and PySpark programs to process data for analysis reports.
- Worked on data cleaning to ensure data quality, consistency, and integrity using Pandas/Numpy.
- Performed data pre-processing on messy data including imputation, normalization, scaling, feature engineering etc. using Scikit-Learn.
- Worked on different data formats such as JSON, XML and performed ML algorithms in Python.
- Conducted exploratory data analysis using PythonMatplotlib and Seaborn to identify underlying patterns and correlations between features.
- Built classification models based on Logistic Regression, Decision Trees, Random Forest Support Vector Machine, and Ensemble algorithms to predict the probability of absence of patients.
- Applied various metrics like recall, precision, F-Score, ROC, and AUC to evaluate the performance of each model and k-fold cross-validation to test the models with different batches of data to optimize the models.
- Involving in creating data frames in Hadoop system, Spark using PySpark and then applying Hive/SQL queries into Spark transformations using Spark RDDs, Python libraries.
- Utilized PySpark, Spark Streaming, MLlib, in Spark ecosystem with a broad variety of machine learning methods including classifications, regressions, dimensionally reduction etc.
- Implemented and tested the model on AWSEC2, collaborated with development team to get the best algorithm and parameters.
- Worked on Naïve Bayes algorithms for Agent Fraud Detection using R.
- Performed data visualization and design dashboards with Tableau and generated complex reports including chars, summaries, and graphs to interpret the findings to the team and stakeholders.
Environment: Python (Scikit-Learn/Keras/Scipy/Numpy/Pandas/ Matplotlib/Seaborn), Machine Learning (Linear and Non-linear Regressions, Deep Learning, SVM, Decision Tree, Random Forest, XGboost, Ensemble and KNN), MS SQL Server 2017, AWS Redshift, S3, Hadoop Framework, HDFS, Spark (Pyspark, MLlib, Spark SQL), Tableau Desktop and Tableau Server.
Confidential - San Diego,CA
Data Analyst/Data Modeler
Responsibilities:
- Analyzed and reviewed functional specifications and requirements to determine best data design approach and translate business requirements into data models.
- Created models for various schemas and created the metadata in order to deploy the models into micro strategy to be able to reuse the definitions enterprise wide.
- Performed data Ingestion for the incoming web feeds into the Data lake store which includes both structured and unstructured data.
- Implemented Predictive analytics and machine learning algorithms to forecast key metrics in the form of designed dashboards on to AWS (S3/EC2) and Django platform for the company's core business.
- Create the architectural artifacts for the Enterprise Data Warehouse and the Operational Dashboard, such as Entity Relationship Diagrams (ERD), the DDL scripts, the Conceptual Data Model, and technical as well as business documents.
- Conducted data profiling to insure that the available data could support business needs. Worked with the developers on resolving the reported bugs and various technical issues.
- Involved in requirements gathering activities analyze, and document business processes and fundamentals, and strategic data needs.
- Created data source views from MYSQL and HADOOP data sources.
- Migrated our retired systems to leverage new systems and customized according to the business requirement.
- Enforced database naming standards and maintained user domains.
- Supported data conversion activities and coordinated the resolution of conversion and data migration issues.
- Created and maintained Data Dictionary and pursued to reach consensus.
- Created data lineages and mappings for Data Lake schemas.
- Ensured Error logs and audit tables are generated and populated properly.
- Involved in troubleshooting, resolving and escalating data related issues and validating data to improve data quality.
- Tracking and reporting the issues to project team and management.
- Created mapping for horizontal data lineages for various systems.
- Contribute in the development of knowledge transfer documentation.
- Used Python, R and SQL to create Statistical algorithms involving Linear Regression, Logistic Regression, Random forest, Decision trees for estimating the risks.
- Managed change requests by following change request management process for the project.
- Involved in preparing a simple and detailed user guide and training manual for the application and for an intended novice user.
Environment: Erwin 9.64, MS Access, Micro strategy, MySql, Erwin, Oracle10g, HeidiSql, Hadoop, Toad 12.5,MS Visio, SVN
Confidential - Visakhapatnam, Andhra Pradesh
Data Analyst
Responsibilities:
- Extensively worked on Informatica PowerCenter Transformations such as Source Qualifier, Lookup, Filter, Expression, Router, Joiner, Update Strategy, Rank, Aggregator, Sequence Generator etc.
- Proficiency in using Informatica PowerCenter tool to design data conversions from wide variety of sources.
- Expertise in transforming business requirements into analytical models, designing algorithms, building models, developing data mining and reporting solutions that scale across a massive volume of structured and unstructured data.
- Proficient in using Informatica workflow manager, Workflow monitor to create, schedule and control workflows, tasks, and sessions.
- Created pivot tables and ran VLOOKUP's in Excel as a part of data validation.
- Used Informatica PowerCenter for extraction, loading and transformation (ETL) of data in the data warehouse.
- Worked on data analysis, data discrepancy reduction in the source and target schemas.
- Designed and developed complex mappings, from varied transformation logic like Unconnected and Connected lookups, Router, Filter, Expression, Aggregator, Joiner, Update Strategy and more.
- Preparation of System requirements (SRS), Database specifications (DBS), Software design document (SDD).
- Responsible for the maintenance of few applications in PowerBuilder 10.2
- Involved in using SQLServer 2005 for fixed the production issues in the background.
- Coordination and Quality activities on delivery
- Involved in testing with validation of all fields, functions, programs, agents from front end and back end code reviews across the application.
- Involved in preparation program specifications, unit tests, test cases and user manual documents.
Environment: Informatica 8.x, PowerBuilder 10.2, SQL Server 2005.
Confidential
Java Developer
Responsibilities:
- Involved in each phase of Software Development Life Cycle(SDLC) models like Requirement gathering and analysis, Design, Implementation, Testing, Deployment and Maintenance.
- Developed Login, Policy and Claims Screens for customers using HTML 5, CSS3, JavaScript, AJAX, JSP, and jQuery.
- Used Core Java to develop Business Logic.
- ML models developed: Customer Survival Analysis for better targeting, Member Engagement call center optimization, Financial Forecasting for product realization.
- Involved in the development of business module applications using J2EE technologies like Servlets, JSP.
- Designed and developed the web-tier using JSP's, Servlets framework.
- Used various Core Java concepts such as Multi-Threading, Exception Handling, Collection APIs to implement various features and enhancements.
- Strong experience in design & development of applications using Java/J2EE components such as Java Server Pages (JSP).
- Developed EJB MDB's and message Queue's using JMS technology.
- EJB Session Beans were used to process requests from the user interface and CMP entity beans were used to interact with the persistence layer.
- Developed stored procedures, triggers, and queries using PLSQL in SQL Server.
- Use Spring MVC as framework and JavaScript for client-side view, used frameworks for client-side data validation, creating dynamic web pages-Ajax, jQuery. Developed model classes based on the forms to be displayed on the UI.
- ML Algorithms used: Logistic/Linear Regression, Random Forest, XG-Boost, K-Means Clustering e.t.c.
- Implemented various design patterns in the project such as Business Delegate, Data Transfer Object, Data Access Object, Service Locator and Singleton.
- Used SQL statements and procedures to fetch the data from the database.
- Developed test cases and performed unit test using JUnit Framework.
- Used CVS as version control and ANT scripts to fetch, build, and deploy application to development environment.
Environment: Java, HTML, CSS, JavaScript, MySQL, Struts, EJB, Spring MVC.