Data Scientist/machine Learning Engineer Resume
Brenham, TexaS
PROFESSIONAL SUMMARY:
- Over 8+ years of Experience in Data Architecture, Design, Development and Testing of business application systems, Data Analysis and developing Conceptual, logical models and physical database design for Online Transactional Processing (OLTP) and Online Analytical Processing (OLAP) systems.
- Experienced with Machine Learning Algorithm such as Logistic Regression, KNN, SVM, Random Forest, Neural Network, Linear Regression, Lasso Regression and K - Means.
- Experienced working with data modeling tools like Erwin, Power Designer and ER Studio.
- Experienced in designing star schema, Snowflake schema for Data Ware house, and ODS architecture.
- Experienced in Data Modeling & Data Analysis experience using Dimensional Data Modeling and Relational Data Modeling, Star Schema/Snowflake Modeling, FACT & Dimensions tables, Physical & Logical Data Modeling.
- Experience in DevOps tool to automate and pipeline the software development & Infrastructure via CICD model.
- Over 4 years of Automation experience of whole release management with architecting, configuring and migration of CICD tools with containerization, cloud and API-fixation in Agile.
- Experienced in big data analysis and developing data models using Hive, PIG and Map reduce, SQL with strong data architecting skills designing data-centric solutions.
- Very good knowledge and experience on AWS, Redshift, S3 and EMR.
- Excellent development experience SQL, Procedural Language(PL) of databases like Oracle, Teradata, Netezza and DB2.
- Having good experience in NLP with Apache, Hadoop and Python.
- Very good knowledge and working experience on big data tools like Hadoop, Azure Data Lake, AWS Redshift.
- Expertise in synthesizing Machine Learning, Predictive Analytics and Big data technologies into integrated solutions.
- Working experience in Hadoop ecosystem and Apache Spark framework such as HDFS, Map Reduce, HiveQL, SparkSQL, PySpark.
- Creating from scratch Machine Learning and NLP solutions for BigData on top of Spark using Scala.
- Extensively experienced in working with structured data using Hive QL, join operations, writing custom UDF's and experienced in optimizing Hive Queries.
- Extensive experience in development of T-SQL, DTS, OLAP, PL/SQL, Stored Procedures, Triggers, Functions, Packages, performance tuning and optimization for business logic implementation.
- Experienced using query tools like SQL Developer, PLSQL Developer, and Teradata SQL Assistant.
- Excellent in performing data transfer activities between SAS and various databases and data file formats like XLS,CSV,DBF,MDB etc.
- Extensively worked with Teradata utilities BTEQ, Fast export, and Multi Load to export and load data to/from different source systems including flat files.
- Expertise in extracting, transforming and loading data between homogeneous and heterogeneous systems like SQL Server, Oracle, DB2, MS Access, Excel, Flat File etc. using SSIS packages.
- Experience in UNIX shell scripting, Perl scripting, and automation of ETL Processes.
- Strong experience and knowledge in Data Visualization with Tableau creating Line and scatter plots, Bar Charts, Histograms, Pie chart, Dot charts, Box plots, Time series, Error Bars, Multiple Charts types, Multiple Axes, subplots etc.
- Excellent understanding and working experience of industry standard methodologies like System Development Life Cycle (SDLC), aspen Rational Unified Process (RUP), AGILE Methodologies.
- Experience in source systems analysis and data extraction from various sources like Flat files, Oracle … IBM DB2 UDB, XML files.
- Experienced in developing Entity-Relationship diagrams and modeling Transactional Databases and Dataware house using tools like ERWIN, ER/Studio and Power Designer and experienced with modeling using ERWIN in both forward and reverse engineering cases.
TECHNICAL SKILLS:
ClientSide Technologies: HTML5, PERL, Processing, Python and R, Python, Hive, C/C++, C#, Java or Python, name Bash.
Machine Learning Models: Basic Statistics, Supervised and Unsupervised learning.
Programming Languages: C#, VB.NET (VB6), VBScript, OOPS, Data structures, Algorithms
Frameworks: Shogun, Accord Framework/AForge.net, Scala, Spark, Cassandra, DL4J, ND4J, Scikit-learn
Cassandra,DL4J,ND4J,Scikit: learn,Shogun,AccordFramework/AForge.net, Mahout, MLlib, H2O, ClouderaOryx, GoLearn, Apache Singa.
BI Tools: C x 4,HBase x 4,Bash x 3,Spark x 3,ElasticSearch x 2
Version Controller: TFS, Microsoft Visual SourceSafe, GIT, NUNIT, MSUNIT
MS: Office 2003/ 07/10/13 , MS Access, Messaging Architectures.
Microsoft Technologies: PHP,Scala2,Shark2,Awk,Cascading,Cassandra,Clojure,Fortran,JavaScript,JMP,Mahout,objectiveC,QlickView,Redis,Redshifed
Web Technologies: Windows API, Web Services, Web API (RESTFUL) HTML5, XHTML, CSS3, AJAX, XML, XAML, MSMQ, Silverlight, Kendo UI.
Web Servers: IIS 5.0, IIS 6.0, IIS 7.5, IIS ADMIN.
Operating Systems: Windows Win8/XP/NT/ 95/98/2000/2008/2012 , Android SDK.
Databases: SQL Server 2014/2012/2008/2005/2000 , MS-Access, Oracle 11g/10g/9i and Teradata, big data, Hadoop, Mahout, ML lib, H2O, Cloudera Oryx, GoLearn.
PROFESSIONAL EXPERIENCE:
Confidential, Brenham, Texas
Data Scientist/Machine Learning Engineer
Responsibilities:
- Extracted data from HDFS and prepared data for exploratory analysis using data munging
- Built models using Statistical techniques like Bayesian HMM and Machine Learning classification models like XGBoost, SVM, and Random Forest.
- Designed & developed Recommendation Engine for personalized marketing and real-time personalization. This product's pipeline included Apache Spark's MLlib for Collaborative Filtering, Spark streaming & Kafka as orchestration layers, Elasticsearch for real-time recommendation, Kibana for visualization.
- Participated in all phases of data mining, data cleaning, data collection, developing models, validation, visualization, and performed Gap analysis.
- Used Spark and Apache Parquet (as NoSQL) for table partitioning and reading these partitioned tables into spark DataFrame for distributed processing. Performed partition pruning to reduce I/O.
- A highly immersive Data Science program involving Data Manipulation&Visualization, Web Scraping,
- Machine Learning, Python programming, SQL, GIT, MongoDB, and Hadoop.
- Setup storage and data analysis tools in AWS cloud computing infrastructure.
- Installed and used Caffe Deep Learning Framework.
- Developing Models on scala and Spark for users, prediction models, sequential algorithms
- Worked on different data formats such as JSON, XML and performed machine learning algorithms in Python.
- Worked as Data Architects and IT Architects to understand the movement of data and its storage and ER Studio 9.7.
- Used pandas, numpy, seaborn, matplotlib, scikit-learn, scipy, NLTK in Python for developing various machine learning algorithms.
- Experience in developing programs in Spark using Python to compare the performance of Spark with Hive and SQL/Oracle.
- Performed the ongoing delivery, migrating client mini-data warehouses or functional data-marts from different environments to MS SQL server.
- Developed SSIS packages to export data from Excel (Spreadsheets) to SQL Server, automated all the SSIS packages and monitored errors using SQL Job on a daily basis.
- Developed Hive queries and UDFS to analyze/transform the data in HDFS.
- Continuous Integration and Continuous Delivery (CICD) specialist & developer for complete pipeline automation of release.
- Planning and creating CICD work-flow model for Java applications.
- Data Manipulation and Aggregation from different source using Nexus, Business Objects, Toad, Power BI and Smart View.
- Experience in handling multiple relational databases like SQL Server, Oracle
- Implemented Agile Methodology for building an internal application.
- Focus on integration overlap and Informatica newer commitment to MDM with the acquisition of Identity Systems.
- Good knowledge on Spark components like Spark SQL, MLib, Spark Streaming and GraphX,
- Extensively worked on Spark Streaming and Apache Kafka to fetch live stream data.
- Implemented novel algorithm for test and control team using Spark /Scala, Oozie, HDFS and Python on P&G Yarn cluster.
- Worked on Python Open stack API's.
- Developed Python application for Google Analytics aggregation and reporting.
- Skilled in using dplyr and pandas in R and Python for performing exploratory data analysis.
- Developed scalable model using Spark (RDD, Mllib, Ml, Dataframes) in Scala
- Integrated Tesseract, ghost script with Spark to access data in hdfs and saving data in hive table
- Coded proprietary packages to analyze and visualize SPCfile data to identify bad spectra and samples to reduce unnecessary procedures and costs.
- Programmed a utility in Python that used multiple packages (numpy, scipy, pandas)
- Implemented Classification using supervised algorithms like Logistic Regression, Decision trees, Naive Bayes, KNN.
- Performed Importing and exporting data into HDFS and Hive using Sqoop
- As Architect delivered various complex OLAP databases/cubes, scorecards, dashboards and reports.
- Updated Python scripts to match training data with our database stored in AWS Cloud Search, so that we would be able to assign each document a response label for further classification.
- Promoted improvements in programming practices such as acceptance test-driven development, continuous integration, and automated testing.
- Used Teradata utilities such as Fast Export, MLOAD for handling various tasks data migration/ETL from OLTP Source Systems to OLAP Target Systems
- Data transformation from various resources, data organization, features extraction from raw and stored.
- Validated the machine learning classifiers using ROC Curves and Lift Charts.
Environment: R 9.0, R Studio, Machine Learning, Informatica 9.0, Scala, Spark, Cassandra, DL4J, ND4J, Scikit-learn, Shogun, Accord Framework/AForge.net, Mahout, MLlib, H2O, Cloudera Oryx, GoLearn, Apache, Unix, Python 3.5.2, MLLib, SAS, regression, logistic regression, Hadoop 2.7.4, NoSQL, Teradata, OLTP, random forest, OLAP, HDFS, ODS, NLTK, SVM, JSON, XML and MapReduce.
Confidential, Frisco, Texas
Data Scientist/Machine Learning Engineer
Responsibilities:
- Utilized Spark, Scala, Hadoop, HQL, VQL, oozie, pySpark, Data Lake, TensorFlow, HBase, Cassandra, Redshift, MongoDB, Kafka, Kinesis, Spark Streaming, Edward, CUDA, MLLib, AWS, Python, a broad variety of machine learning methods including classifications, regressions, dimensionally reduction etc.
- Utilized the engine to increase user lifetime by 45% and triple user conversations for target categories.
- Application of various machine learning algorithms and statistical modeling like decision trees, text analytics, natural language processing (NLP), supervised and unsupervised, regression models, social network analysis, neural networks, deep learning, SVM, clustering to identify Volume using scikit-learn package in python, Matlab.
- Used the version control tools like Git 2.X and build tools like Apache Maven/Ant
- Worked on analyzing data from Google Analytics, Ad Words, and Facebook etc.
- Evaluated models using Cross Validation, Log loss function, ROC curves and used AUC for feature selection and elastic technologies like Elastic Search, Kibana.
- Performed Data Profiling to learn about behavior with various features such as traffic pattern, location, Date and Time etc.
- Categorized comments into positive and negative clusters from different social networking sites using Sentiment Analysis and Text Analytics
- Used Python scripts to update content in the database and manipulate files
- Skilled in using dplyr and pandas in R and Python for performing exploratory data analysis.
- Performed Multinomial Logistic Regression, Decision Tree, Random forest, SVM to classify package is going to deliver on time for the new route.
- Used Jenkins for Continuous Integration Builds and deployments (CI/CD).
- Performed data analysis by using Hive to retrieve the data from Hadoop cluster, Sql to retrieve datafrom Oracle database and used ETL for data transformation.
- Performed Data Cleaning, features scaling, features engineering using pandas and numpy packages in python.
- Exploring DAG's, their dependencies and logs using AirFlow pipelines for automation
- Performed data cleaning and feature selection using MLlib package in PySpark and working with deep learning frameworks such as Caffe, Neon.
- Developed Spark/Scala, R Python for regular expression (regex) project in the Hadoop/Hive environment with Linux/Windows for big data resources.
- Used clustering technique K-Means to identify outliers and to classify unlabeled data.
- Tracking operations using sensors until certain criteria is met using Air Flow technology.
- Responsible for different Data mapping activities from Source systems to Teradata using utilities like TPump, FEXP, BTEQ, MLOAD, FLOAD etc.
- CICD pipeline implementation for Java applications.
- CICD implementation on Azure cloud platform.
- Analyze traffic patterns by calculating autocorrelation with different time lags.
- Ensured that the model has low False Positive Rate and Text classification and sentiment analysis for unstructured and semi-structured data.
- Addressed over fitting by implementing of the algorithm regularization methods like L1 and L2.
- Used Principal Component Analysis in feature engineering to analyze high dimensional data.
- Used MLlib, Spark's Machine learning library to build and evaluate different models.
- Implemented rule based expertise system from the results of exploratory analysis and information gathered from the people from different departments.
- Created and designed reports that will use gathered metrics to infer and draw logical conclusions of past and future behavior.
- Developed MapReduce pipeline for feature extraction using Hive and Pig.
- Created Data Quality Scripts using SQL and Hive to validate successful data load and quality of the data. Created various types of data visualizations using Python and Tableau.
- Communicated the results with operations team for taking best decisions.
- Collected data needs and requirements by Interacting with the other departments.
Environment: : Apache, Spark MLlib, TensorFlow, Oryx 2, Accord.NET, Amazon Machine Learning (AML)PythonDjango, Flask, ORM, Jinja 2, Mako, Naive Bayes, SVM,K- means, ANN, Regression.
Confidential, MI
Data Engineer/Data Analyst
Responsibilities:
- Architect and design, solutions for complex business requirements, including data processing, analytics and ETL and reporting processes to improve the performance of data loads and processes.
- Develop a high performance, scalable data architecture solution that incorporates a matrix of technology to relate architectural decision to business needs.
- Conducting strategy and architecture sessions and deliver artifacts such as MDM strategy (Current state, Interim State, and Target state) and MDM Architecture (Conceptual, Logical and Physical) at the detail level.
- Implemented continuous integration using Jenkins and involved in the deployment of application with Ansible automation engine.
- Utilized Apache Spark with Python to develop and execute Big Data Analytics and Machine learning applications, executed machine learning use cases under Spark ML and Mllib.
- Prepared Tableau reports and dashboards with calculated fields, parameters, sets, groups or bins and publish on the server.
- Created SSIS Packages which loads the data from the CMS to the EMS library database and Involved in data modeling and providing technical solutions related to Teradata to the team.
- Designed the physical model for implementing the model into Oracle 11g physical data base and Developed SQL Queries to get complex data from different tables in Hemisphere using joins, database links.
- Worked with the admin team in upgrading Hadoop 1.0 to 2.0 using Apache Ambari 2.0.1 and configured with HUE
- Wrote SQL queries, PL/SQL procedures/packages, triggers and cursors to extract and process data from various source tables of database.
- Created Hive Tables, loaded transactional data from Teradata using Sqoop and created and worked Sqoop jobs with the incremental load to populate Hive External tables.
- Developed LINUX Shell scripts by using NZSQL/NZLOAD utilities to load data from flat files to Netezza database.
- Used Erwin to create logical and physical data models for enterprise wide OLAP system and Involved in mapping the data elements from the user interface to the database and help identify the gaps.
- Generated comprehensive analytical reports by running SQL queries against current databases to conduct data analysis.
- Developed complex SQL scripts for Teradata database for creating BI layer on DW for Tableau reporting.
- Extensively used ETL methodology for supporting data extraction, transformations and loading processing, in a complex EDW using Informatica.
- Created Active Batch jobs to load data from distribution servers to PostgreSQL DB using *.bat files and worked on CDC schema to keep track of all transactions.
Environment: Erwin 9.5, MS Visio, Oracle 11g, Oracle Designer, MDM, Power BI, SAS, SSIS, Tableau, Tivoli Job Scheduler, SQL Server 2012, DATAFLUX 6.1, JavaScript, AWS Redshift, PL/SQL, SQL/PL SQl, SSRS, PostgreSQL, Data Stage, SQL Navigator Crystal Reports 9, Hive, Netezza, Teradata, T-SQL, Informatica.
Confidential, Mission Viejo, CA
Machine Learning
Responsibilities:
- Design and develop data ware house architecture, data modeling/conversion solutions, and ETL mapping solutions within structured data warehouse environments
- Reconcile data and ensure data integrity and consistency across various organizational operating platforms for business impact.
- Worked on Test-driven development, continuous integration systems, and Agile software development.
- Define best practices for data loading and extraction and ensure architectural alignment of the designs and development.
- Involved in the creation, maintenance of Data Warehouse and repositories containing Metadata.
- Involved using ETL tool Informatica to populate the database, data transformation from the old database to the new database using Oracle and SQL Server.
- Resolved the data type inconsistencies between the source systems and the target system using the Mapping Documents and analyzing the database using SQL queries.
- Extensively worked on Spark Streaming and Apache Kafka to fetch live stream data.
- Developed Data Migration and Cleansing rules for the Integration Architecture (OLTP, ODS, DW).
- Created Dashboards on Tableau from different sources using data blending from Oracle, SQL Server, MS Access and CSV at a single instance.
- Documented logical, physical, relational and dimensional data models. Designed the data marts in dimensional data modeling using star and snowflake schemas.
- Created dimensional model based on star schemas and designed them using ERwin.
- Data modeling and design of data warehouse and data marts in star schema methodology with confirmed and granular dimensions and FACT tables.
- Developed SQL Queries to fetch complex data from different tables in remote databases using joins, database links and Bulk collects.
- Responsible for Implementation of HL7 to build Orders, Results, ADT, DFT interfaces for client hospitals
- Connected to Amazon RedShift through Tableau to extract live data for real time analysis.
- Developed SQL Queries to fetch complex data from different tables in remote databases using joins, database links and Bulk collects.
- Worked on data modeling and produced data mapping and data definition specification documentation.
Environment: Erwin, Oracle, SQL server 2008, Power BI, MS Excel, Netezza, Agile, MS Visio, Rational RoseRequisite Pro, SAS, SSIS, SSRS, Windows 7.
Confidential
Data Scientist
Responsibilities:
- Designed logical and physical data models for multiple OLTP and Analytic applications.
- Involved in the analysis of business requirements and keeping track of data available from various data sources, transform and load the data into Target Tables using Informatica Power Center.
- Extensively used the Erwin design tool & Erwin model manager to create and maintain the Data Mart.
- Extensively used Star Schema methodologies in building and designing the logical data model into Dimensional Models
- Involved with Data Analysis primarily Identifying Data Sets, Source Data, Source Meta Data, Data Definitions and Data Formats
- Performance tuning of the database, which includes indexes, and optimizing SQL statements, monitoring the server.
- Wrote SQL Queries, Dynamic-queries, sub-queries and complex joins for generating Complex Stored Procedures, Triggers, User-defined Functions, Views and Cursors.
- Experienced in creating UNIX scripts for file transfer and file manipulation and utilized SDLC and Agile methodologies such as SCRUM.
- Wrote simple and advanced SQL queries and scripts to create standard and ad hoc reports for senior managers.
- Involved in collaborating with ETL/Informatica teams to source data, perform data analysis to identify gaps
- Used Expert level understanding of different databases in combinations for Data extraction and loading, joining data extracted from different databases and loading to a specific database.
Environment: SQL Server, UML, Business Objects 5, Teradata, Windows XP, SSIS, SSRS, Embarcadero, ER studio, Erwin, DB2, Informatica, Oracle, Query Management Facility (QMF), SSRS, Data Stage, Clear Case forms.
Confidential
Data Architecture
Responsibilities:
- Clients include eBay, Click Forensics, Cars.com, Turn.com, Microsoft, and Looksmart.
- Designed the architecture for one of the first analytics 3.0. Online platforms: all-purpose scoring, with on-demand, SaaS, API services. Currently under implementation.
- Web crawling and text mining techniques to score referral domains, generate keyword taxonomies, and assess commercial value of bid keywords.
- Used RAD as Development IDE for web applications.
- Implementation of Metadata Repository, Maintaining Data Quality, Data Cleanup procedures, Transformations, Data Standards, Data Governance program, Scripts, Stored Procedures, triggers and execution of test plans.
- Developed Internet traffic scoring platform for ad networks, advertisers and publishers (rule engine, site scoring, keyword scoring, lift measurement, linkage analysis).
- Automated bidding for advertiser campaigns based either on keyword or category (run-of-site) bidding.
- Creation of multimillion bid keyword lists using extensive web crawling. Identification of metrics to measure the quality of each list (yield or coverage, volume, and keyword average financial value).
- Enterprise Metadata Library with any changes or updates.
- Document data quality and traceability documents for each source interface.
- Establish standards of procedures.
- Generate weekly and monthly asset inventory reports.
- Responsible for communication and negotiation with project related aspects on project loaning, construction budget, design alterations, and unexpected events on the project.
- Responsible for defining the key identifiers for each mapping/interface.
- Developed new hybrid statistical and data mining technique known as hidden decision trees and hidden forests.
- Reverse engineering of keyword pricing algorithms in the context of pay-per-click arbitrage.
- Performed data quality in Talend Open Studio.
- Coordinated meetings with vendors to define requirements and system interaction agreement documentation between client and vendor system.
Environment: Erwin r7.0, SQL Server 2000/2005, Windows XP/NT/2000, Oracle 8i/9i, MS-DTS, UML, UAT, SQL Loader, OOD, OLTP, PL/SQL, MS Visio, Informatica.