We provide IT Staff Augmentation Services!

Data Scientist Resume

4.00/5 (Submit Your Rating)

Boston, MA

SUMMARY:

  • 9 years of experience in IT as Data scientist development of technologies and algorithms for internet applications. Over the years, I have worked on large - scale social recommendation systems, link prediction in social networks, search technologies.
  • Experience categorization of documents/query/tweets, information retrieval, search relevance, information extraction, query expansion, search spam detection, web mining, machine learning algorithms, data clustering, and classification algorithms.
  • Extensive experience in Text Analytics, developing different Statistical Machine Learning, Data Mining solutions to various business problems and generating data visualizations using R, Python and Tableau.
  • Designing of Physical Data Architecture of New system engines.
  • Hands on experience in implementing LDA, Naïve Bayes and skilled in Random Forests, Decision Trees, Linear and Logistic Regression, SVM, Clustering, neural networks, Principle Component Analysis and good knowledge on Recommender Systems.
  • Proficient in Statistical Modeling and Machine Learning techniques (Linear, Logistics, Decision Trees, Random Forest, SVM, K-Nearest Neighbors, Bayesian, XG Boost) in Forecasting/ Predictive Analytics, Segmentation methodologies, Regression based models, Hypothesis testing, Factor analysis/ PCA, Ensembles.
  • Expertise in transforming business requirements into analytical models, designing algorithms, building models, developing data mining and reporting solutions that scales across massive volume of structured and unstructured data.
  • Developing Logical Data Architecture with adherence to Enterprise Architecture.
  • Strong experience in Software Development Life Cycle (SDLC) including Requirements Analysis, Design Specification and Testing as per Cycle in both Waterfall and Agile methodologies.
  • I have also worked on building high performance distributed systems for web crawling and caching.
  • I have also developed and published articles on novel algorithms for distributed caching, network monitoring, multi-processor scheduling, web mining, etc.
  • Result-oriented, hands-on freelance professional, with a successful record of accomplishments in ICT project management, as well as in academic research.
  • Experience and Technical proficiency in Designing, Data Modeling Online Applications, Solution Lead for Architecting Data Warehouse/Business Intelligence Applications.
  • Good understanding of Teradata SQL Assistant, Teradata Administrator and data load/ export utilities like BTEQ, FastLoad, MultiLoad, Fast Export.
  • Performed statistical analytics using Graphical analytics using PANDAS and BI tools such as Tableau.
  • Experience with Data Analytics, Data Reporting, Ad-hoc Reporting, Graphs, Scales, PivotTables and OLAP reporting.
  • Highly skilled in using Hadoop (pig and Hive) for basic analysis and extraction of data in the infrastructure to provide data summarization.
  • Highly skilled in using visualization tools like Tableau, ggplot2 and d3.js for creating dashboards.
  • Implemented and programmed the Google AdWords API to automatically find millions of new high value/high volume keywords for advertising campaigns (Perl, SOAP, and XML) Taxonomy improvement.
  • Creation of multimillion bid keyword lists using extensive web crawling.
  • Identification of metrics to measure the quality of each list (yield or coverage, volume, and keyword average financial value).
  • Understanding and implementation of text mining concepts, graph processing and semi structured and unstructured data processing.
  • Statistical Modelling with ML to bring Insights in Data under guidance of Principal Data Scientist.
  • Big Data Hub coordination technically with applications to visualize the insights.
  • Design, build, and deploy Machine Learning applications to solve real-world problems empirically.
  • Experience with varied forms of practical data, including Image, Speech, Text, Video, Motion-capture & other high-dimensional data.

TECHNICAL SKILLS:

Languages: R, Python

Big Data: Map Reduce, HDFS, Spark

NoSQL: MongoDB

Analysis: Feature Selection Methods, Principal Component Analysis, Supervised and Unsupervised Learning, Classification Techniques, Topic modeling, Model building, Time Series

Relational Databases: Oracle, MySQL, SQL Server, PostgreSQL

Tools: Altreyx(User-Interface),Tableau, QlikView, MS Excel, MS Access, PyTorch

R Packages: ggplot2, caret, dplyr, Rweka, gmodels, RCurl, tm, C50, twitter, NLP, Reshape2, rjson, plyr

NumPy, Pandas, Matplotlib, Scikit: Learn, XGBoost, Light GBM, Seaborn, Beautiful Soup, OS, sklearn, Stats models, NLTK and Skimage

PROFESSIONAL EXPERIENCE:

Confidential - Boston, MA

Data Scientist

Responsibilities:

  • Natural Language Processing & Recurrent Neural Networks (LSTM RNNs) learnt using Deep Learning techniques applied to a fraud detection system.
  • Clean data was migrated from warehouses and streamed into Kafka feeding the Spark engine.
  • Performed transformations on loaded datasets using Python over the spark engine using both batch and streaming data.
  • Assimilated real-time credit data from monitoring agencies and deployed containers (Dockers) thereby allowing for the generated insights/reports to be deployed into the In-House Database management tool.
  • Improving the existential Fraud Detection using Digital Links from TransUnion and Experian.
  • Scaled for Machine Learning pipelines: 4600 processors, 35000 GB memory achieving simultaneous runs.
  • Deployed GUI pages by using JSP, JSTL, HTML, DHTML, XHTML, CSS, JavaScript, and AJAX.
  • Configured the project on Web Sphere 6.1 application servers
  • Designed a new Machine Learning pipeline to replace existing process. Increasing efficiency from 81% to 90%.
  • Handled 2+ TB data with graphs upto130 GB (50M nodes, 100M edges) using single-node in-disk scaling.
  • Developed a Machine Learning test-bed with 24 different model learning and feature learning algorithms.
  • By thorough systematic search, demonstrated performance surpassing the state-of-the-art (deep learning).
  • Upton 10 times more accurate predictions over existing state-of-the-art algorithms.
  • Developed in-disk, huge (100GB+), highly complex Machine Learning models.
  • Used SAX and DOM parsers to parse the raw XML documents
  • Used RAD as Development IDE for web applications.
  • Demonstrated performances comparable to other state-of-the-art deep learning models.
  • Devised and implemented a Credit monitoring application using Data bricks.
  • Deployed the Application using Spark using data from the predictive model created earlier.
  • Tested the insights using Naïve Bayes Algorithm and simulated a model to predict credit frauds.
  • Warehoused data into Amazon Redshift Servers and the model is now being tested out by TRANSUNION.
  • Devised a novel machine learning algorithm for classification of Credit Score prediction models.
  • Created visualization dashboards and API's were hosted using Tableau.

Environment: Hadoop, Spark, OLAP, DB2, Metadata, Scala, Python, Amazon S3, Kafka, CoreML, Automated Logistic regression models, Informatica 9.0, MongoDB

Confidential, Times Square, NY

Responsibilities:

  • Manipulating, cleansing & processing data using Excel, Access and SQL.
  • Responsible for loading, extracting and validation of client data.
  • Modelled clean data into the Kafka servers for use over the spark engine.
  • Zookeeper along with Kafka was used to stream data and end-to-end client communication.
  • Performed transformations over the warehoused data using Scala& Python and modelled the data back into the servers for iterative transformations into KAFKA.
  • Modelled data using Machine learning libraries (Sci-kit learn) apart from SVN and KNN based classification to create a training dataset for use in a predictive model.
  • Assimilated data from Blogs, Feedback systems for NLP based processing models.
  • Liaising with end-users and 3rd party suppliers.
  • Analyzing raw data, drawing conclusions & developing recommendations
  • Writing T-SQL scripts to manipulate data for data loads and extracts.
  • Developing data analytical databases from complex financial source data.
  • Performing daily system checks.
  • Data entry, data auditing, creating data reports & monitoring all data for accuracy.
  • Designing, developing and implementing new functionality.
  • Monitoring the automated loading processes.
  • Advising on the suitability of methodologies and suggesting improvements.
  • Carrying out specified data processing and statistical techniques.
  • Supplying qualitative and quantitative data to colleagues & clients.
  • Using Informatica to extract, transform & load source data from transaction systems.
  • Loaded packages and stored procedures using Base SAS and integrated functional and business requirements using the EBI suite.
  • Creating data pipelines using big data technologies like Hadoop, spark etc.
  • Creating statistical models using distributed and standalone models to build various diagnostics, predictive and prescriptive solution.
  • Utilize a broad variety of statistical packages like R, MLIB, CoreML, Graphs, Hadoop, Spark, Map Reduce, Pig
  • Created a UI dashboard for end users and performed prototype testing using Tableau.
  • Refine and train models based on domain knowledge and customer business objectives
  • Deliver or collaborate on delivering effective visualizations to support the client business objectives
  • Communicate to your peers and managers promptly as and when required.
  • Produce solid and effective strategies based on accurate and meaningful data reports and analysis and/or keen observations.
  • Establish and maintain communication with clients and/or team members; understand needs, resolve issues, and meet expectations
  • Developed web applications using .net technologies; work on bug fixes/issues that arise in the production environment and resolve them at the earliest

Environment: SQL Server 2008R2/2005 Enterprise, Kafka, Scala, Python, Spark, Hadoop, Crystal Reports, Windows Enterprise Server 2000, MongoDB, SQL Profiler, and Query Analyzer, Tensor Flow.

Confidential - San Jose, CA

Data Scientist

Responsibilities:

  • Data mining using state-of-the-art methods
  • Extending company's data with third party sources of information when needed
  • Enhancing data collection procedures to include information that is relevant for building analytic systems
  • Processing, cleansing, and verifying the integrity of data used for analysis
  • Doing ad-hoc analysis and presenting results in a clear manner
  • Creating automated anomaly detection systems and constant tracking of its performance
  • Strong command of data architecture and data modelling techniques.
  • Hands on experience with commercial data mining tools such as Splunk, R, Map reduced, Yarn, Pig, Hive, Floop, Oozie, Scala, HBase, Master HDFS, Sqoop, Spark, Scala (Machine learning tool).
  • Developed scalable machine learning solutions within a distributed computation framework (e.g. Hadoop, Spark, Storm etc.).
  • Utilizing NLP applications such as topic models and sentiment analysis to identify trends and patterns within massive data sets.
  • Knowledge in ML Libraries such as Tensor flow & Statistical libraries (e.g. Scikit-learn, Pandas).
  • Having knowledge to build predict models to forecast risks for product launches and operations and help predict workflow and capacity requirements for TRMS operations
  • Having experience with visualization technologies such as Tableau.
  • Draw inferences and conclusions, and create dashboards and visualizations of processed data, identify trends, anomalies
  • Generation of TLFs and summary reports, etc. ensuring on-time quality delivery.
  • Participated in client meetings, teleconferences and video conferences to keep track of project requirements, commitments made and the delivery thereof.
  • Solved analytical problems, and effectively communicate methodologies and results
  • Worked closely with internal stakeholders such as business teams, product managers, engineering teams, and partner teams.
  • Created automated metrics using complex databases.
  • Foster culture of continuous engineering improvement through mentoring, feedback, and metrics.

Environment: Erwin r9.0, Informatica 9.0, ODS, OLTP, Oracle 10g, Hive, OLAP, DB2, Teradata, MS Excel, Hadoop with Python, PL/SQL, Spark with Python, TensorFlow, Deep learning Libraries, Tableau.

Confidential - Princeton, NJ

Data Scientist

Responsibilities:

  • Statistical Modelling with ML to bring Insights in Data under guidance of Principal Data Scientist
  • Data modeling with Pig, Hive, Impala.
  • Ingestion with Sqoop, Flume.
  • Used SVN to commit the Changes into the main EMM application trunk.
  • Understanding and implementation of text mining concepts, graph processing and semi structured and unstructured data processing.
  • Worked with Ajax API calls to communicate with Hadoop through Impala Connection and SQL to render the required data through it. These API calls are similar to Microsoft Cognitive API calls.
  • Good grip on Cloudera and HDP ecosystem components.
  • Used Elastic search (Big Data) to retrieve data into application as required.
  • Performed Map Reduce Programs on nodes running on the cluster.
  • Developed multiple Map Reduce jobs in Scala for data cleaning and preprocessing.
  • Analyzed the partitioned and bucketed data and compute various metrics for reporting.
  • Involved in loading data from RDBMS and web logs into HDFS using Sqoop and Flume.
  • Worked on loading the data from MySQL to HBase where necessary using Sqoop.
  • Developed Hive queries for Analysis across different banners.
  • Extracted data from Twitter using Java and Twitter API. Parsed JSON formatted twitter data and uploaded to database.
  • Launching Amazon EC2 Cloud Instances using Amazon Images (Linux/ Ubuntu) and Configuring launched instances with respect to specific applications.
  • Exported the result set from Hive to MySQL using Sqoop after processing the data.
  • Analyzed the data by performing Hive queries and running Pig scripts to study customer behavior.
  • Have hands on experience working on Sequence files, AVRO, HAR file formats and compression.
  • Used Hive to partition and bucket data.
  • Experience in writing Map Reduce programs with Java API to cleanse Structured and unstructured data.
  • Wrote Pig Scripts to perform ETL procedures on the data in HDFS.
  • Created HBase tables to store various data formats of data coming from different portfolios.
  • Worked on improving performance of existing Pig and Hive Queries.

Environment: SQL/Server, Oracle 9i, MS-Office, Teradata, Informatica, ER Studio, XML, Business Objects, HDFS, Teradata 14.1, JSON, HADOOP (HDFS), Map Reduce, PIG, Spark, R Studio, MAHOUT, JAVA, HIVE, AWS.

Confidential

Data Architect/Data Modeler

Responsibilities:

  • Worked with large amounts of structured and unstructured data.
  • Knowledge in Machine Learning concepts (Generalized Linear models, Regularization, Random Forest, Time Series models, etc.)
  • Worked in Business Intelligence tools and visualization tools such as Business Objects, Tableau, ChartIO, etc.
  • Deployed GUI pages by using JSP, JSTL, HTML, DHTML, XHTML, CSS, JavaScript, and AJAX.
  • Configured the project on WebSphere 6.1 application servers
  • Implemented the online application by using Core Java, Jdbc, JSP, Servlets and EJB 1.1, Web Services, SOAP, WSDL.
  • Handled end-to-end project from data discovery to model deployment.
  • Monitoring the automated loading processes.
  • Communicated with other Health Care info by using Web Services with the help of SOAP, WSDLJAX-RPC
  • Used Singleton, factory design pattern, DAO Design Patterns based on the application requirements
  • Used SAX and DOM parsers to parse the raw XML documents
  • Used RAD as Development IDE for web applications.
  • Preparing and executing Unit test cases
  • Used Log4J logging framework to write Log messages with various levels.
  • Involved in fixing bugs and minor enhancements for the front-end modules.
  • Implemented Microsoft Visio and Rational Rose for designing the Use Case Diagrams, Class model, Sequence diagrams, and Activity diagrams for SDLC process of the application
  • Doing functional and technical reviews
  • Maintenance in the testing team for System testing/Integration/UAT.
  • Guaranteeing quality in the deliverables.
  • Conducted Design reviews and Technical reviews with other project stakeholders.
  • Was a part of the complete life cycle of the project from the requirements to the production support
  • Created test plan documents for all back-end database modules
  • Implemented the project in Linux environment.

Environment: R 3.0, Erwin 9.5, Tableau 8.0, MDM, QlikView, MLLib, PL/SQL, HDFS, Teradata 14.1, JSON, HADOOP (HDFS), MapReduce, PIG, Spark, R Studio, MAHOUT, JAVA, HIVE, AWS.

Confidential

Data Analyst/Data Modeler

Responsibilities:

  • Developed Internet traffic scoring platform for ad networks, advertisers and publishers (rule engine, site scoring, keyword scoring, lift measurement, linkage analysis).
  • Responsible for defining the key identifiers for each mapping/interface.
  • Clients include eBay, Click Forensics, Cars.com, Turn.com, Microsoft, and Looksmart.
  • Designed the architecture for one of the first analytics 3.0. Online platforms: all-purpose scoring, with on-demand, SaaS, API services. Currently under implementation.
  • Web crawling and text mining techniques to score referral domains, generate keyword taxonomies, and assess commercial value of bid keywords.
  • Developed new hybrid statistical and data mining technique known as hidden decision trees and hidden forests.
  • Reverse engineering of keyword pricing algorithms in the context of pay-per-click arbitrage.
  • Implementation of Metadata Repository, Maintaining Data Quality, Data Cleanup procedures, Transformations, Data Standards, Data Governance program, Scripts, Stored Procedures, triggers and execution of test plans
  • Performed data quality in Talend Open Studio.
  • Coordinated meetings with vendors to define requirements and system interaction agreement documentation between client and vendor system.
  • Automated bidding for advertiser campaigns based either on keyword or category (run-of-site) bidding.
  • Creation of multimillion bid keyword lists using extensive web crawling. Identification of metrics to measure the quality of each list (yield or coverage, volume, and keyword average financial value).
  • Enterprise Metadata Library with any changes or updates.
  • Document data quality and traceability documents for each source interface.
  • Establish standards of procedures.
  • Generate weekly and monthly asset inventory reports.

Environment: Erwin r7.0, SQL Server 2000/2005, Windows XP/NT/2000, Oracle 8i/9i, MS-DTS, UML, UAT, SQL Loader, OOD, OLTP, PL/SQL, MS Visio, Informatica.

We'd love your feedback!