AWS for the CLOUD and CLOUDERA ADMINISTRATION Resume

SUMMARY

AWS offers: Cloud computing has three main types that are commonly referred to as Infrastructure as a Service (IaaS), Platform as a Service (PaaS), and Software as a Service (SaaS). Selecting the right type of cloud computing for your needs can help you strike the right balance of control and the avoidance of undifferentiated heavy lifting. I have been primarily involved with Paas & SaaS in my data lake architectural design work.
My primary Big Data Platform is AWS along with a lot of Cloudera and Azure Big Data architecture not only data lake design and development but also a total solution architecture approach to Big Data implementation. 10 plus years’ experience in HADOOP both development and architecture with 6 plus years as an architect. My initial work included work at Cofidential 2003, on initial release of Hadoop and next with AWS, the Amazon Web Services, Cloudera Navigator 2.9, Cloudera CDH 5.7 - 5.10, Impala/KUDU, Cloudera Director and Hortonworks/Azure. TALEND MDM 4yrs, ETL process including scripting 14+ years, Zookeeper, HMaster, HBase database, HFile, Apache: Flume (log files) 2 years, Oozie (sched. Workflow) 1 + year, Sqoop (xfers data) 3 years, Python (2.7 & 3.6 w/SPSS Statistics 23 ) 5 years, Dev Tools such as Spark (with Perf, & Caching) 2 years, HBase 5 years, Pig 4 years, Analysis with: Drill (SQL) 2 years, Hive (HQL) 4 years, Mahout (Clustering, Classification, Collaborative filtering) 6 mos., additionally C & C++, and Shell. I have extensive use of MDM tools, and Erwin and additionally Power Designer and Confidential ’s ER tool. I have extensive work on Apache Hadoop which is a highly scalable storage platform designed to process very large data sets across hundreds to thousands of computing nodes that operate in parallel. Hadoop provides a cost effective storage solution on commodity hardware for large data volumes with no format requirements. Additionally, extensive work with MapReduce, the programming paradigm that allows for this massive scalability, is the heart of Hadoop. Note that the term MapReduce actually refers to two separate and distinct tasks that Hadoop programs perform. Hadoop has two main components- HDFS and YARN.
I utilized Ansible, Tower Red Hat, to scale automation, manage complex deployments and speed up the productivity at the client site for CAS. I further used to extend the power of the workflows process to streamline jobs and simple tools toshare solutions with the CAS team. With Ansible, IT we were able to free admins from automating away the drudgery from the daily tasks. This Automation freed admins up to focus on efforts that help deliver more value to the business by speeding time to application delivery, and building on a culture of success. I was able to give teams the one thing they can never get enough of: time. Allowing smart people to focus on smart things.
I used StreamSets Data Collector (SDC) is an Open Source lightweight in streams data in real time. It allowed us to configure data flows as pipelines through a web UI in few minutes. Among its many features, it makes possible to view real-time statistics and inspect data as it passes through the pipeline.
Have noticed that some companies are delaying data opportunities because of organizational constraints, others are not sure what distribution to choose, and still others simply can’t find time to mature their big data delivery due to the pressure of day-to-day business needs. With my architect skills and Hadoop I will insure that corporation which adopt Hadoop and its full spectrum of tools won’t leave this opportunity to harness their data on the table; it's a nonnegotiable for that my past clients have been able to pursue new revenue opportunities, beat their competition, and delight their customers with better, faster, analytics and data applications. The smartest Hadoop strategies start with choosing recommended distributions, then maturing the environment with modernized hybrid architectures, and adopting a DATA LAKE strategy based on Hadoop technology.
25+ years of experience in IT systems or applications development
15+ years of experience architecting or delivering large scale systems on multiple platforms, with a focus on Big Data Hadoop
Talend (4 yrs) utilized on several projects to simplify and automate big data integration with graphical tools and wizards that generate native code. This allowed the teams to start working with Apache Hadoop, Apache Spark, Spark Streaming and NoSQL databases right away. Talend Big Data Integration platform was utilized to deliver high-scale, in-memory fast data processing, as part of the Talend Data Fabric solution, so the project enterprise systems allowing more data into real-time decisions. It provided blazing fast speed and scale with Spark and Hadoop, it allowed for anyone to access and cleanse big data while governing its use and allowed for optimization of big data performance in the cloud on several project.
Graphics and statistics implementations with RStudio and R programming which is (a free and open source tool) utilized for the integrated development environment (IDE).
Ruby programming experience with Agile development experience as development team leader.
Experience working in network operations center (NOC) administrators supervise, monitor and maintains work to maintain a telecommunications network
Extensive Data Warehousing (Teradata, DB2, Teradata, SQL Server, MySQL & Oracle including building/implementing)
Apache: Flume (log files), Oozie (sched. Workflow), Sqoop (xfers data for relation DBs), Python (lang.), Scala (lang.), Java (lang.)
Dev Tools: Spark (Perf, w/Caching), HBase, Pig, Shell, MongoDB
Analysis with: Drill (SQL), Hive (HQL), Mahout (Clustering, Classification, Collaborative filtering)
Tableau (dashboard) & Talend (MDM, mapping & Datalinage)
MongoDB: One of the most popular document stores. It is a document oriented database.All data in Mongodb is treated in JSON/BSON format.It is a schema less database which goes over tera bytes of data in database. It also supports master slave replication methods for making multiple copies of data over servers making the integration of data in certain types of applications easier and faster. MongoDB combines the best of relational databases with the innovations of NoSQL technologies, enabling engineers to build modern applications. MongoDB maintains the most valuable features of relational databases: strong consistency, expressive query language and secondary indexes. As a result, developers can build highly functional applications faster than NoSQL databases. MongoDB provides the data model flexibility, elastic scalability and high performance of NoSQL databases. As a result, engineers can continuously enhance applications, and deliver them at almost unlimited scale on commodity hardware. Full index support for high performance.
Integration & Migration
Collaboration as a consultant with Teradata Professional Services
Advanced Analytical solutions - Confidential, Teradata, HCL
PhD Business Psychology/ComSi. with excellent verbal and written communication and persuasion skills; able to collaborate and engage effectively with technical and non-technical resources, speaking the language of the business
Have proven experience solving complex problems in a multi-platform systems environment
Including on-site
Cloud/XaaS solutions
Demonstrated comprehensive expert knowledge and exceptional insight into the information technology industry
Expertise in application and information architecture / design artifacts and mechanisms
TOGAF or Zachman with practical experience in the use of these common Architecture frameworks
Experience HL Conceptual Models and the development, implementation, and management of Enterprise Data Models, Data Architecture Strategies, Delivery Roadmaps, Information Lifecycle Management, and Data Governance capabilities
PhD Psychology with minor in Computer Science
Encryption tools such as Protegrity and in depth understanding of security legislation that affects our businesses, including, but not limited to Sarbanes-Oxley, Payment Card Industry regulations, Customer Data Protection regulations and contemporary security legislation activities that may impact future plans
Significant experience with three or more of the following technologies: Teradata, Tableau, Cognos, Oracle, SAS, Hadoop, Hive, SQL Server, DB2, SSIS, Essbase, Microsoft Analysis Services

TECHNICAL SKILLS

Languages: Script, Python, JavaScript, Java, C++. Many others.

Operating Systems: Linux, BSD Unix variants, Macintosh, OpenVMS

Sign-on: LDAP/OpenLDAP

Linux: Debian, Linux Mint, Ubuntu, Red Hat RHEL, Fedora, CENTOS

Desktop GUI design: Java/Swing, GTK+/GNOME, QT/KDE Custom-tailored Linux kernels for; Alpha, PowerPC, IntelOS configuration: filesystem layouts, packaging systems. Debian

Version control and build: CVS, subversion, GIT

Parallel APIs: MPI, PVM (from C, FORTRAN, Python)

Threading: Pthreads from C and compiled languages, Python and Ruby threads.

Network protocols: TCP/IP (e.g. UDP, ARP, etc.), MIDI.

Primary Databases: MySQL, SQLObject, SQLAlchemy, Postgres SQL, Teradata, DB2, Oracle

Web Frameworks: Express (ExpressJS), Django, Flask

GUI Toolkits: Java/Swing, TK, GTK, GTK+, GLADE, GNOME, PyQt, QT/KDE, Wx

Amazon Web Services: AWS, EC2

PROFESSIONAL EXPERIENCE

Confidential

AWS for the CLOUD and CLOUDERA ADMINISTRATION

Responsibilities:

Responsible for implementation and ongoing administration of Hadoop infrastructure.
Aligning with the systems engineering team to propose and deploy new hardware and software environments required for Hadoop and to expand existing environments.
Working with data delivery teams to setup new Hadoop users. This job includes setting up Linux users, setting up Kerberos principals and testing HDFS, Hive, Pig and MapReduce access for the new users.
Cluster maintenance as well as creation and removal of nodes using tools like Ganglia, Nagios, Cloudera Manager Enterprise, Dell Open Manage and other tools.
Performance tuning of Hadoop clusters and Hadoop MapReduce routines.
Screen Hadoop cluster job performances and capacity planning
Monitor Hadoop cluster connectivity and security
Manage and review Hadoop log files.
File system management and monitoring.
HDFS support and maintenance.
Diligently teaming with the infrastructure, network, database, application and business intelligence teams to guarantee high data quality and availability.
Collaborating with application teams to install operating system and Hadoop updates, patches, version upgrades when required.
Point of Contact for Vendor escalation

Confidential

SENIOR BIG DATA AWS & CLOUDERA ADMIN & SECURITY

Responsibilities:

Design and architecture of the CAS’s cloud data lake solution in AWS
Use of AWS Lambda and Amazon S3
Cloudera administrator version 4-5.10 and Kerberos 2.0 security administrator at CAS working with a small team of Hadoop administrators.
TALEND MDN for data linage and master data management
Extensive SPSS
Impala/KUDU administration.
JAVA coding with some Python.
I mentored and assisted the team with Cloudera administration and Cloudera Navigator.
Cloud installation utilizing Cloudera Director with AWS provider.
Performance and Tuning:
Assisted with establishment of queue architecture through the Fair Scheduler.
Tuning MapReduce jobs for enhanced throughput (Java Heap Adjustments)
Block Size Adjustments
Spark Performance Adjustments
I worked on the standards and implementation of FSImage to establish backup standards and procedures as managed though CronTab jobs. Additionally,
I assisted with setup and administration of Kerberos to allow trusted, secure communications between trusted entities. Hadoop Security, Kerberos & Sentry Together: For Hadoop operators in finance, government, healthcare, and other highly-regulated industries to enable access to sensitive data under proper compliance, each of the four functional requirements must be achieved:
Perimeter Security: Guarding access to the cluster through network security, firewalls, and, ultimately, authentication to confirm user identities
Data Security: Protecting the data in the cluster from unauthorized visibility through masking and encryption, both at rest and in transit
Access Security: Defining what authenticated users and applications can do with the data in the cluster through filesystem ACLs and fine-grained authorization
Visibility: Reporting on the origins of data and on data usage through centralized auditing and lineage capabilities
Requirements 1 and 2 are now addressed through Kerberos authentication, encryption, and masking. Cloudera Navigator supports requirement 4 via centralized auditing for files, records, and metadata. But Requirement 3, for access security, had been largely unaddressed, until Sentry.

Confidential, Foster City CA

Senior AWS/AZURE Data Lake Architect

Responsibilities:

Lead Architect for the AWS CLOUD ARCHITECTURE & DATA MODEL at Confidential which provided for a framework to assessment the HADOOP Solution Architecture upgrade as a staging area repository for unstructured, semi structured and structured data. Installation and 5 node cluster developed on BOTH AWS & Hortonworks (Azure Microsoft). SPSS Data Scientist calculations for Multi-variant Linear Regression analysis. This work was done with the primary target in mind and with regard to the medical model which has been specifically concerned with issues regarding signal refinement positive results. “Positive results” are defined to be when an association is detected between a medical product and an adverse outcome that exceeds a pre-specified threshold in the direction of increased risk. For the specific Common Data Model (CDM) the principal goal was to determine if the excess risk can be explained by something other than a cause and effect relationship, such as information or selection bias, confounding, or any errors associated with the signal refinement results. This generated reports of interest addressed assessing signal refinement, which is the second of a three stage process (signal generation, signal refinement, signal evaluation) in medical product post-market safety surveillance.
Talend utilized on several projects to simplify and automate big data integration with graphical tools and wizards that generate native code. This allowed the teams to start working with Apache Hadoop, Apache Spark, Spark Streaming and NoSQL databases right away. Talend Big Data Integration platform was utilized to deliver high-scale, in-memory fast data processing, as part of the Talend Data Fabric solution, so the project enterprise systems allowing more data into real-time decisions. Descriptions and general recommendation regarding sources of systematic error are presented. It begins with an assessment of data validity. This should be a regular activity before a safety signal occurs. The emphasis should be on ruling out errors in the data that contributed to the signal. Sources of systematic error in medical product safety surveillance are information bias, selection bias, and confounding. Information bias is an error in measuring exposure, covariate, or outcome variables that result in different quality (accuracy) of information between comparison groups. Misclassification of categorical variables is a form of information bias. For the Sentinel System, medical product exposures would be measured within electronic healthcare databases. Such data are an imperfect surrogate for actual biological exposures within individuals. Data sources that populate the electronic health databases are not originally created for research or public health purposes. The sensitivity and positive predictive values of electronic diagnostic codes for outcome assessment can be as low. It can be particularly challenging to identify outcomes when there are no specific diagnosis codes or the incident rate is very low and/or misdiagnosis occurs.
Selection bias is a distortion in an effect estimate due to the manner in which the study sample is selected from the source population. To avoid case selection bias, the cases (outcomes) that contributed to a safety signal must represent cases in the source population.

Confidential, Remote

JULY 2015 to JAN 2016

AWS BIG DATA DATA LAKE Analysis/Architecture

Responsibilities:

Data Lake work included development with AWS and completed on a 9 node clustered Data Lake architecture. Primarily unstructured and semi-structured data with utilization of Sqoop, MongoDB, Spark (Hive, Python & Java), Flume, Cloudera Search & Talend as the MDM repository and Apache Sentry for authorization for Impala and Hive access. Lead Hadoop Architect Architect for the de-normalization project at Optum Corporation which involved the simplification of 3rd normal form tables to enhance performance and usability for the end user business community. Extensive consideration was given to Hadoop as the staging area repository for ingesting of source data. The thought was that this data could then me identified and used for marketing analysis. Additionally of interest was logging information which might potentially be mined to determine better monitoring of issues related to anomaly’s in the data. Erwin was a primary tool used for the de-normalization/simplification project. Both Logical Data Models (LDM) and Physical Data Model (PDM) were generated in all platforms, through development, to UAT and finally to production. Use of the ALM, Application Lifecycle Management tool greatly assisted in the reporting and tracking of the project fixes as required and the Rally tool allowed for tracking and timely reporting of the deliverable products to the business. Involved the business users at all points of decision making and signoff processes. Projects were delivered on time and in budget. MongoDB: One of the most popular document stores. It is a document oriented database.All data in Mongodb is treated in JSON/BSON format.It is a schema less database which goes over tera bytes of data in database. It also supports master slave replication methods for making multiple copies of data over servers making the integration of data in certain types of applications easier and faster.

Confidential, Space & Security Systems, Newport Beach Ca

FEB 2015 to JULY 2015

Big Data Architect, Senior Modeler

Responsibilities:

AWS at Boeing Space and Security. They were in need of skilled modelers, data architects and HADOOP/Oracle, skilled implementers to transition systems from Oracle and other System of Record (SOR) data to a Data Lake work included development with Cloudera CDH and completed on a 6 node clustered Data Lake architecture. Ingested unstructured and semi-structured data with utilization of Sqoop, HBase, Spark (Hive, Python & Java), Flume and Talend platform in the cloud. Security for the Data Lake via Apache Sentry. This implementation required the interface with end and business units to migrate data and data attributes to the newly modeled enterprise architecture. This effort has involved extensive user interface to determine the correct mappings for attributes and their datatypes with metadata information passed to the new staging areas and on to the base, 3thrd Normal form table architecture. A big part of my work here include identification and mapping of object, both attribute level and column level.
This activity consisted of the interfaces, email and formal meetings to establish the correct linage of data through its initial attribute discovery level and on through the Agile development process to insure data integrity. As funding “went south…literal” at Boeing the work was concluder and final turnover/meetings took place.

Confidential, Phoenix

Senior Teradata & Big Data Architect & Data Lake

Responsibilities:

This project utilized AWS and was an extensive evaluation of HADOOP systems and infrastructure at Freescale (FSL), providing detailed evaluation and recommendations to the modeling environment and to the current modeling architecture at FSL and for the EDW. Of concern on this project was the scalability and reliability of the daily operations and it is noted that these issues were among the most significant requirements, along with data quality (directly from originating source) and capability for very high performance which is accomplished with the MapR distribution for Hadoop.
Additionally we investigated Cloudera CDH 5.6 with a POC completed on a 2 node clustered Data Lake architecture as a POC for Freescale. Ingested unstructured and semi-structured data with utilization of Sqoop, Spark (Hive, Python & Java), Flume and Talend platform in the cloud.
TALEND administration for big data Data Lake.
Rather than employ HBase, the authentication system uses MapR-DB, a NoSQL database that supports the HBase API and is part of MapR. Strict availability requirements, provide robustness in the face of machine failure, and operate across multiple datacenters and deliver sub-second performance. have extensive evaluation and additionally was involved with performance analysis with PDCR and identification of BI Semantic layer issues for reporting processes and provided documented analysis for specific areas of improvement. Executive level reporting regarding all aspects of modeling and DBA efforts with regard to an “as is” analysis of these current procedures, documentation and process at FSL.
Provided extensive executive level reporting regarding findings and recommendations. implementation and evaluated additional tools such as PDCR, MDS, MDM, ANTANASUITE, APPFLUENT and HADOOP 14.10 functions and features and migrated from Erwin & ModelMart v7.2 to 8.2 and then finally to v9.5. I functioned as the lead consultant for the 6 month effort at FSL assuming responsibility for delivery and executive meeting status delivery regarding all aspects of the project.
Provided numerous power point presentations including the delivery of the “score card” evaluation of the “as is” ongoing modeling, DBA and support activities at FSL. Identified areas to improve upon especially in the modeling area and rendered assistance with the BI semantic layer performance tuning effort and the MDS glossary deliverable for metadata. Designed and assisted with the developed of the executive dashboard reporting process.
Recommended and provided information regarding three new primary tools at FSL, Appfluent, Antanasuite and MDS/MDM (HADOOP). These tools were recommended as part of the agile improvement process to increase productivity and ROI estimated at yielding a 73% overall realized benefit.

Confidential, Raritan NJ

Senior Teradata Architect

Responsibilities:

Was the primary DBA involved in the design and implementation of J&Js new Teradata platform from V13.10 to V14.10., working primarily with the pharmaceutical division of Jansen and also with the Ethicon surgical division. Additionally I performed evaluation and feasibility work regarding Hadoop and specifically with regarding to viewing it as a staging area alternative to the current Teradata staging area. Here server as the primary DBA/Architect for the hierarchical design and layout of the databases, user definitions and creation, role creation and assignments and HADOOP access authority. Additionally, I provided extensive support in performance tuning and problem resolution.
TALEND administration for big data - Data Lake.
Provided close support to the application DBAs and hardware engineering group. Excellent references are available. Installed and made operational HADOOP’s PDCR tool for Performance Data Collection & Reporting. Portlets were enabled in Viewpoint to provide performance data reporting.

Confidential, Raleigh NC

Senior Hadoop & Data Lake Architect

Responsibilities:

Was the senior (and only) TERADATA/HADOOP ARCHITECT on this project for Confidential at a major (the biggest) sea transportation client. I provided support for day to day ARCHITECT activities including DDL creation to support the warehouse and BO semantic layer.
Cloudera CDH and completed on a 4 node clustered Data Lake architecture as a POC for Freescale. Ingested unstructured and semi-structured data with utilization of Sqoop, Spark (Hive, Python & Java), Flume and Talend platform in the cloud. Additionally, HADOOP Solution Architecture upgrade as a staging area repository for unstructured, semi structured and structured data. Installation and 5 node cluster developed on Hortonworks, Azure Microsoft.
Have extensive experience Designing/ developing the semantic layer, with proven success in performance tuning. The primary BI tool utilized was Confidential COGNOS. I have experience on the Confidential projects which used Cognos 8 and Cognos 10 as the end user reporting tool for Confidential with performance and troubleshooting experience involving Cognos and TEADATA/HADOOP (evaluation) and worked on numerous performance problems including SQL code with high IO and high SpoolSpace utilization.
Completed recommendations regarding monitoring tools and evaluation of space requirements. Was responsible for security administration and setups for new users and new applications.
Was available and on call for any and all problems supporting their HADOOP development, testing, integration and production. The warehouse was established in 3rd normal form with the semantic layer structured as a star fact, dimension architecture. I created numerous new tables and view at the semantic layer and supported the business objects access to the semantic layer.

We provide IT Staff Augmentation Services!

We'd love your feedback!

Resume Categories

Client Services

Job Seekers

Visa Sponsorship