Data Scientist Resume
Austin, TX
SUMMARY
- Data Scientist, Machine Learning and Natural Language Processing Specialist and Software Engineer with a unique combination of solid algorithm design skills and research acumen
- Experience includes research in AI at the PhD level applying neural networks in NLP implementing Deep Learning nets for face recognition
- Machine Learning methods for classification using ‘Big Data” technologies like Apache Spark with Scala, Hadoop and Cascading expert programming in python (using Jupyter notebooks), C/C++ and Java staying abreast of the latest developments in ML and NLP.
- Graph mining in social network analysis Performance analysis and optimization
PROFESSIONAL EXPERIENCE
Confidential
Data Scientist
Responsibilities:
- Working on a new sentiment analysis algorithm for a world leader in the Oil & Gas industry as well as an algorithm to extract information from email threads related to customer care (the cause and resolution of customer complaints) for another Fortune 10 company.
Machine Learning Consultant
Responsibilities:
- Implementing a cascaded convolutional network for use in a system for face detection. The network is intended to reproduce results from a recent research paper, using the Torch Deep Learning framework.
- Other frameworks like Theano/Lasagne were also evaluated. Studied a sequence of papers on other deep neural network architectures as well as OpenFace (CMU) and FaceNet (Google)
Confidential, Austin, TX
Data Scientist
Responsibilities:
- Used word2vec, a neural embedding algorithm, on millions of job descriptions along with graph clustering algorithms to assign “signatures” to them for job retrieval. Developed heuristics for word - sense disambiguation and for automatically determining term specificity.
- Tried a community detection approach to document clustering.
- Applied probabilistic topic modeling techniques such as LDA (Latent Dirichlet Allocation) and HDP (Hierararchical Dirichlet Processes) available in the gensim package to find the major themes in a large job description corpus and use the model for Information Retrieval. Also experimented with methods that combine LDA with word2vec (Topical Word Embeddings,
- Above computations were done with Spark/Scala and PySpark in Databricks notebooks and AWS S3. Gained experience with Spark DataFrames, RDDs and Spark SQL.
- Extensively used python machine learning and NLP stacks (scikit-learn, nltk, scipy, numpy as well as newer libraries like spaCy and chainer - a python neural network library with CUDA and GPU computation support) plus open source Java libraries like OpenNLP, Stanford Core NLP and GATE.
- Developed a gold standard of responses to a carefully engineered set of queries and a random sample of job descriptions to evaluate search engine versions rapidly and without expensive and time-consuming A/B testing.
- Tried developing folksonomy-style tagging methods for documents. In this context, experimented with keyword extraction techniques (Kea, Maui-indexer and KP-Miner).
- Correlated click-through data with presented jobs and combined this with clustering of word neighborhood graphs to find jobs likely to be clicked on.
Environment: research aptitude, machine learning algorithms, document clustering, text classification, graph clustering, neural networks, word2vec, lda2vec, spaCy, chainer, pyLDAvis, NP-MSSG, statistical NLP, python, nltk, scikit-learn, numpy, scipy, Spark, Scala, Spark MLLib, Databricks, Spark Data Frames and Datasets, SQL, MySql, AWS, parquet files, gensim package, WordNet, Stanford Core NLP, OpenNLP, Solr, LDA, HDP, IR, Information Retrieval
Confidential, Atlanta, GA
Data Scientist
Responsibilities:
- Applied recent research on neural-net generated distributed, dense vector representations of words and phrases in experiments to understand the context and intent of a user query by mining a hitherto unexploited corpus of descriptions of ~1M products sold online by Confidential .
- Used word2vec to overcome vocabulary mismatch by suggesting related search terms with the objective of improving online customer experience on homedepot.com and increasing conversion rates by an order of magnitude.
- Devised and selected algorithms that scale to millions of product descriptions.
- Categorized and provided insight into the reasons for “No Results Found” pages by mining query logs containing tens of millions of unique queries. Assessed the potential impacts of better spellchecking, model number recognition, automatic rephrasing of queries on the customer’s experience and conversion rate.
- Evaluated spell checkers like aspell (with Metaphone 3), hunspell, LingPipe (based on the noisy channel model), and homegrown hybrids thereof to correct spelling errors taking phonetics and context into account and using custom dictionaries.
- Discovered a way to use word2vec for correcting spelling errors in O(1) time.
Environment: Python, Java, C/C++, bash, Linux, cygwin, awk, sed, Maven, Ant, ontologies, OWL, RDF, Protégé, OpenRDF, WordNet, neural networks, word2vec, clustering, k-means, kNN, R, Dragon Toolkit, aspell, Hunspell, Jazzy, LingPipe, ARK TurboParser Dependency Parser, Stanford NLP, GATE, OpenNLP, Named Entity Recognition, Statistical NLP, TF-IDF, Jaro-Winkler, Levenshtein distance, fuzzy search algorithms, recommendation systems, collaborative filtering, Named Entity Recognition (NER), POS tagging.21st Century Technologies (21CT), Austin, TX
Confidential
Senior Software Engineer/Social Network Analytics
Responsibilities:
- Developed a highly scalable and fast technique for analyzing and characterizing roles of individuals within large social networks, by importing ideas stemming from the analysis of protein interaction networks in bioinformatics. This innovative application of graphlets to social networks with ~105 edges is able to precisely identify in a matter of seconds individuals who play similar roles to a single exemplar. It made a US Navy project for identifying potential terrorist threats in a large social network enormously successful and is now part of the core IP of 21CT.
- Employed R packages for principal components analysis, k-means clustering and decision trees to analyze results of using graphlet methods on Facebook100, a complete set of Facebook friendship data from 100 American Universities in 2005.
- Implemented the graphlet application in C++ as well as Java for in corporation into company codebase as a Maven project.
- Participated in a project to study collective entity resolution by fusing network data coming from sources in different modalities. System is aimed at coalescing multiple monikers belonging to the same individual.
- Gained experience working on DoD SBIR research projects with tight deadlines.
- Created a small OWL ontology with RDF n-triples using Protégé and Sesame. Experimented with Rya, a distributed RDF repository on top of the Accumulo key-value store. Generated and ran SPARQL queries against the repository.
- Converted a group detection algorithm to MapReduce, using the Cascading abstraction layer on top of Hadoop.
- Worked with several Python scripts and libraries as well as R packages for classification, clustering, principal components analysis and visualization.
Environment: Terrorism Intelligence Analytics, DoD contracts, Social Network Analysis, Java, Maven . C++, NoSQL, Accumulo, R, principal components analysis (PCA), machine learning, Python, iPython, Scipy, Numpy, Eclipse, Netbeans, Cytoscape, graphlets, graph mining, RDF, SPARQL, OpenRdf, ontologies, OWL, Protégé, Sesame, Hadoop, Cascading, MapReduce, Big Data, Cloud, Linux, cygwin, bash, sed, awk, svn, Agile, SCRUM, software integration..
Confidential
Senior Software Engineer
Responsibilities:
- Researched and implemented some of the latest IR techniques for query suggestion, relevance feedback and ranked retrieval to modernize and differentiate the company’s two main products in the eDiscovery marketplace.
- Experimented with Latent Semantic Indexing (LSI) as implemented in the “semantic vectors” package to creates models which represent collections of documents in terms of underlying concepts.
- Enhanced components which are written in Java, Ruby and C#, use MongoDB, MySQL and SQL Server databases and communicate via SOAP/REST web services. Technologies employed include Apache Lucene and Solr (for free text search), JBoss, Spring and Maven
Environment: C++, Boost, g++, cygwin, Visual C++, Java, JBoss, Maven, Spring, Svn, QuickBuild, web services, SOAP, REST, XML, Big Data, SQL, NoSQL, MongoDB, Agile, SCRUM, Rally, Applied Research in Information retrieval (IR), TF-IDF, machine learning, algorithm design and implementation, universal hash functions, Bloom filters, performance analysis and optimization, Eclipse, Mockito, Junit, document classification.
Confidential
Senior Staff Software Engineer
Responsibilities:
- Developed RESTful web services in the Java Restlet framework on the Android platform to expose functionalities of an embedded videoconferencing system with Java and C++ components communicating via Google protobuf.
Environment: Java, REST, Restlet framework, web services, JSON, XML, C++, Google protobuf, Android, Agile, SCRUM, Jira, svn
Confidential, Austin, TX
Compiler Engineer
Responsibilities:
- Initiated and led a project to integrate the Confidential re-targetable compiler with HiWare, Switzerland's Static Single-Assignment (SSA) based compiler to modernize it and demonstrate how new, more powerful optimizations enabled by SSA Form can improve code quality without degrading performance.
- Re-implemented Global Common Sub Expression Elimination and other major dataflow optimizations in the Confidential Intermediate Representation Optimizer to remove flaws and enhance code quality.
- Measured compilation speed and code quality using Intel VTune and EEMBC, gcc and SPEC92 benchmarks.
Environment: C, compiler design, CodeWarrior IDE, CVS, SSA Form, performance analysis, Intel VTune, collaboration