We provide IT Staff Augmentation Services!

Data Scientist Resume

0/5 (Submit Your Rating)

SUMMARY:

  • Big Data Architect and Data Scientist with extensive experience in architecting, designing and developing software applications. Highly proficient DBA and Developer in MPP databases like Greenplum. Proficient in Oracle, PostgreSQL, MySQL and Microsoft SQL Server. Experienced in programming languages like Java, Python, Scala and R. A good working knowledge of core data science methodologies including machine learning.
  • Expertise in developing web applications using Tomcat, Javascript (Jquery) and Java. Great working knowledge of Python, Perl, R and scripting languages like bash. Knowledge of building and developing Hadoop based ecosystems. Good working knowledge of linux administration, networking, security and encryption.
  • Core Competenciess
  • Data Architecture
  • Data Science
  • Big Data (Greenplum, Hadoop, Spark)
  • Design, Development and Migration of software applications.
  • Data Science Tools like regression, classification (random forests) using R, Python (numpy, scipy, scikit - learn and matplotlib).
  • Design and build complex ETL, auditing and database maintenance code using Java, Python, SQL.
  • Analyze data using SQL, R, Java, Scala, Python, Spark and present analytical reports to management and technical teams
  • Build web based applications using Java, Tomcat, JavaScript (Jquery)

TECHNICAL SKILLS:

Languages & Development Tools: Java, Eclipse, Tomcat, JavaScript, JQuery, Python, Perl, R, SQLSSIS, Hadoop, MapReduce, Pig, Hive, C

Operating Systems, Utilities & Virtualization Tools: Linux, Solaris, Microsoft Windows, VMWare, HDFS, EMC DataDomain

Network Systems & Technologies: OpenSSL

Database Management Systems: Greenplum, Oracle, PostgreSQL, MySQL, Microsoft SQL Server

Commercial Software: Tableau, Oracle Forms/Reports/Designer/Discoverer

Methodologies & Standards: Agile, XP

PROFESSIONAL EXPERIENCE:

Confidential

Data Scientist

Responsibilities:

  • Developed a python application that consumes PDF documents of varying formats to parse and produce structured comma-deilmited output for loading to databases and Hive. Used regular expressions, machine learning algorithms like decision tree and random forest to standardize content headers and python’s multi-processing module for quick processing of thousands of documents.
  • Documented an operations manual for the Hadoop eco-system. Created a multi-node Cloudera cluster on AWS to explore operational needs and procedures for every eco-system component.
  • Completed a POC project of building and deploying Greenplum on AWS cloud in a multi-server architecture. As findings, presented the requirements, benefits and challenges of moving from an appliance based solution to AWS and other clouds.
  • Completed a structured data load ETL to Greenplum using Java. The complexities included automated finding of column names and data types from data files, different column orders in different files and reading tables directly from MS Access database.
  • Participated in discussions of migrating applications and infrastructure to cloud.
  • Worked on presenting Apache Spark, its benefits and nuances to team members. Used DataBricks community cloud for demonstration of data processing and visualizations.

Confidential

Solutions / Data Architect

Responsibilities:

  • Helped multiple application development teams and the DBA team with important design guidelines, solved problem specific to situations.
  • Contributed to a POC application with PostgreSQL
  • Coded Java Application to auto-convert DDL and migrate Data
  • Helped set up logging and demonstrate encryption using pgcrypto
  • Converted 170+ view definitions from Oracle to PostgreSQL
  • Coded Java Application to demonstrate unloading of data to csv and load from csv
  • Designed and coded data generation scripts to simulate Tax forms data for Affordable Care Act. The scripts support creating millions of simulated tax forms data using random distribution of several variables while maintaining several business rules and referential integrity.
  • Implemented in database Negative TIN check and obfuscation in Greenplum using web service call from external web tables and views / in database functions
  • Developed a fully functional command line equivalent to Greenplum Command Center web based utility coded in Java. This was to address the section 508 non-compliance of the existing web based tool

Confidential

Solutions / Data Architect

Responsibilities:

  • Provided guidelines and best practices to follow for Greenplum database management.
  • Coded a ETL module using Java and Greenplum Load / SQL utilities to transform and load large volumes of SAS data to Greenplum. The scripts do an automatic data type discovery (coded in Java) to create corresponding DDL, execute the DDL and loads the data. Complexities like 3200+ columns in tables, multiple files containing data for a single table with non-identical column positions is handled automatically
  • Coded a web portal for ad-hoc queries in Greenplum that enables non-sql experts to build complex SQL queries using a dynamic web interface. Other features include ability to submit large data extracts, previewing data with excel like grid offering multi-column sorting, column resizing and paging, check execution plan, LDAP authentication and functionality to give access to public data sets without a need for a database login id. This is built on a tomcat server using java, jsp, JavaScript, JQuery, JQuery-UI and Ajax. Built as a product with its own metadata management system to easily migrate / expand to other systems and databases. Integrated Madlib statistical functionality to the product.

Confidential

Solutions / Data Architect

Responsibilities:

  • Coded a java based application to automate both DDL and Data migration from SQL Server 2008 to Greenplum database. Used database metadata repository to make the utility completely dynamic and usable for any SQL-Server to Greenplum Migration.
  • Helped with encryption of data using pgcrypto including key management and performance tuning. Also coded an alternate PL/Java function to demonstrate how effective custom encryption solutions can be built using custom or external libraries.

Confidential

Solutions / Data Architect

Responsibilities:

  • Developed multiple log parsers in Java for Cisco and other network security logs for ingesting to Greenplum. The parser is able to parse over 1 terabyte of log files on a single Linux server in 2-2/12 hours. The structure of the log files are complex to unstructured requiring extensive regular expression processing and multi-threading to get the desired performance. The parser accommodates 10+ different Cisco log file types like ACS, Bluecoat, ASA, DHCP and others.

Confidential

Solutions / Data Architect

Responsibilities:

  • Provided Greenplum expertise to a development team to convert a complex Oracle based DW system (2000+ tables, 5TB of data, Informatica, SAS, Tableau and OBIEE) to Greenplum.
  • Coded functions for partition management.
  • Coded functions for quick application of backfill data to large fact tables.
  • Implemented Database link functionality to read data from Greenplum to Greenplum and Greenplum to Oracle using the concept of external web tables and python and java code. This was very helpful in quick data migration and cross database data validation/comparison.
  • Coded jaro winkler string comparison function in Greenplum using python.
  • Coded scripts for database maintenance including vacuum, analyze and rebuild of tables.
  • Coded a data replication system to replicate the full production database to a remote server (DCA) using EMC’s DataDomain backup servers and python/shell scripts.
  • Provided many workaround solutions to complex oracle SQLs and unimplemented functions in Greenplum like percentile cont and percentile disc.
  • Conducted several sessions with the IT and Development team to explain and discuss Greenplum / PostgreSQL concepts.

Confidential

Solutions / Data Architect

Responsibilities:

  • Successfully converted a complex data warehousing environment from SUN/GP3 to DCA/Greenplum 4. Provided technical guidance to EMC PS team (1-2 people) and to client’s IT team.
  • Coded data transfer processes using Greenplum backup/recovery mechanisms as well as external web tables and custom shell and python scripts to perform the following:
  • Full database migration
  • Schema migration
  • Specific table(s) migration
  • Any combination of b and c above including exclusion of tables
  • Used process described above almost 20 times to perform ad-hoc data refreshes required for smooth transition of multiple applications as well as effective testing.
  • Converted 7000+ lines (with support from EMC PS team) of shell scripts coded on sun Solaris to Linux that are used by the DBA team for administrative functions.
  • Coded a full backup/recovery and a DR system using DataDomain. The goal to replicate a 18TB database on the production DCA to a standby DCA was met successfully and the whole replication process is functioning in 8-10 hours every day.
  • Used SNMP extensions to implement a custom web based monitoring tool. This would not have been possible because the request to install SysEdge on the DCA was denied by EMC support.
  • Helped in conversion of 1000+ Informatica workflows and 100+ Cognos reports and all other custom applications. Provided an innovative solution to handle decimal precision in Cognos 8.4 which helped defer migration to the most recent version of Cognos.

We'd love your feedback!