- Data Architect with extensive experience in relational and dimensional data modeling; conceptual, logical, and physical modeling, star and snowflake schema, dimensions and hierarchies.
- ETL Specialist hands - on proficient in Informatica, Sqoop, SQL, and PL/SQL.
- Big Data experience with Hadoop, HDFS, Hive, Pig, PostgreSQL, Spark, Python
- Machine learning implemented with Python, R, scikit-learn
Database Development: PostgreSQL, Oracle SQL, PL/SQL
Data Modeling and Data Warehouse Design: ERwin 9.5, Oracle Data Modeler
ETL: Informatica 9.5, Pentaho, Python
Machine Learning: Python, R, NumPy, Matplotlib, Pandas
Big Data: Hive, Pig, Greenplum, Spark, Scala
- Developed data and process models for the US Pharmacopeia (USP) Reference Standard testing and production business. The modeling tool used was BiZZdesign Horizzon.
- Designed a workflow pipeline for the Confidential IT Category Management project which collected and enhanced government wide transaction data using NLP text mining.
- Designed, created and loaded the Confidential Enterprise Acquisition and Spend Database (EASD) on PostgreSQL and Redshift
- Developed ETL for the EASD using Python and SQL on PostgreSQL, Redshift, and Excel.
- Developed and coded a Python GUI (PyQT5) for pipeline and ETL management.
- Developed ETL for the 2020 Census Fraud Detection System involving Hive, PostgreSQL, Oracle, and Spark.
- Tuned Hive for performance by refactoring existing code.
- Designed and coded Informatica workflows, mappings, and stored procedures on the Greenplum (PostgreSql) big data platform for the CADE2 to ODS Refresh project.
- Designed and coded automated SQL generation from requirements spreadsheets to replace legacy assembler code using Greenplum PL/SQL.
- Applied scikit-learn machine learning algorithms and advanced statistics to the new Greenplum platform, including regression, classification, clustering, and dimension reduction. Used K-Folds cross-validation and Grid Search for model selection. Used XGBoost to improve performance of base algorithms.
- Built a test harness in Python to extract sample data from the mainframe and run several algorithms for direct comparison of effectiveness including K-nn, K-means, LDA, Naïve Bayes, Decision Trees, Random Forest, PCA, LDA, SVM, Linear and Logistic Regression.
- Used Tableau to visualize data for stakeholders.
- Designed and implemented automated test procedures using R that produced thousands of test cases and automatically analyzed results as a full regression test.
- Developed and implemented a B2B website using PHP and MySQL.
- Rewrote numerous SQL queries reducing runtime from 2.5 hours to 3 minutes.
- Reversed engineered the OLTP Oracle database into a 3NF model in order to analyze production reporting and uncover its primary bottleneck.
- Re-architected the reporting mart to de-normalize report data in one pass and pre-compute the most used views.
- Runtime was reduced from hours to seconds, and some reports which could not be run at all were now made available.
- Designed Hive tables to answer business transaction questions.
- Coded HiveQL using Cloudera Hue.
- Extracted text files from their data warehouse and loaded into HDFS.
- Coded UDF’s in Java using the NetBeans IDE for use in HiveQL.
- Developed extensive Oracle SQL extracts of statistical data.
- Designed and coded Oracle PL/SQL load, extract, and QA processes based on SDTM.
- Implemented automated data cleansing procedures.
- Designed and implemented Clinical Trials Repository in Oracle for the Confidential with a team of six.
- Developed multidimensional star schema models. Created logical and physical models.
- Created a 3NF data model with 107 entities based on BRIDG model content.
- Designed and created an Oracle object-relational database of ISO21090 datatypes.
- Designed a star schema DB2 Data Warehouse for reservations, ticket issuance and ticket collection.
- Analyzed disparate data sources for data quality improvement and common dimensions.
- Created conceptual, logical and physical data models. Designed detailed Informatica ETL processes.
- Coded Informatica workflows, sessions, mappings, and transformations.
- Designed and implemented data mapping for Confidential, the bi-directional health information exchange between DoD and VA systems.
- Mentored a group of 4 new developers in Informatica coding especially for performance.
- This was done on an Oracle platform using ERwin and TOAD.
- Developed Web Services with Informatica to deliver XML content.
Data Architect/ETL Specialist
- Designed and developed transformation processes to load data from source to Oracle.
- Developed and tuned Informatica mappings and sessions.
- Coded and documented scripts and stored procedures for data warehousing processes.
- Performance tuning of Informatica for parallel loads and cache optimization.
- Documented ETL processing systems.