Data Scientist Resume

SUMMARY:

Big Data Architect and Data Scientist with extensive experience in architecting, designing and developing software applications. Highly proficient DBA and Developer in MPP databases like Greenplum. Proficient in Oracle, PostgreSQL, MySQL and Microsoft SQL Server. Experienced in programming languages like Java, Python, Scala and R. A good working knowledge of core data science methodologies including machine learning.
Expertise in developing web applications using Tomcat, Javascript (Jquery) and Java. Great working knowledge of Python, Perl, R and scripting languages like bash. Knowledge of building and developing Hadoop based ecosystems. Good working knowledge of linux administration, networking, security and encryption.
Core Competenciess
Data Architecture
Data Science
Big Data (Greenplum, Hadoop, Spark)
Design, Development and Migration of software applications.
Data Science Tools like regression, classification (random forests) using R, Python (numpy, scipy, scikit - learn and matplotlib).
Design and build complex ETL, auditing and database maintenance code using Java, Python, SQL.
Analyze data using SQL, R, Java, Scala, Python, Spark and present analytical reports to management and technical teams
Build web based applications using Java, Tomcat, JavaScript (Jquery)

TECHNICAL SKILLS:

Languages & Development Tools: Java, Eclipse, Tomcat, JavaScript, JQuery, Python, Perl, R, SQLSSIS, Hadoop, MapReduce, Pig, Hive, C

Operating Systems, Utilities & Virtualization Tools: Linux, Solaris, Microsoft Windows, VMWare, HDFS, EMC DataDomain

Network Systems & Technologies: OpenSSL

Database Management Systems: Greenplum, Oracle, PostgreSQL, MySQL, Microsoft SQL Server

Commercial Software: Tableau, Oracle Forms/Reports/Designer/Discoverer

Methodologies & Standards: Agile, XP

PROFESSIONAL EXPERIENCE:

Confidential

Data Scientist

Responsibilities:

Developed a python application that consumes PDF documents of varying formats to parse and produce structured comma-deilmited output for loading to databases and Hive. Used regular expressions, machine learning algorithms like decision tree and random forest to standardize content headers and python’s multi-processing module for quick processing of thousands of documents.
Documented an operations manual for the Hadoop eco-system. Created a multi-node Cloudera cluster on AWS to explore operational needs and procedures for every eco-system component.
Completed a POC project of building and deploying Greenplum on AWS cloud in a multi-server architecture. As findings, presented the requirements, benefits and challenges of moving from an appliance based solution to AWS and other clouds.
Completed a structured data load ETL to Greenplum using Java. The complexities included automated finding of column names and data types from data files, different column orders in different files and reading tables directly from MS Access database.
Participated in discussions of migrating applications and infrastructure to cloud.
Worked on presenting Apache Spark, its benefits and nuances to team members. Used DataBricks community cloud for demonstration of data processing and visualizations.

Confidential