Senior Big Data Architect Resume
San Francisco, CA
SUMMARY:
- Technology leader with hands on skills in the architecture, design and development of high performance scalable cloud applications in the distributed big data management. Seeking a Senior Architect or an equivalent position in the Hadoop/Big data ecosystem.
- Specialized in Big data architectural patterns, data ingestion, analytics and data science utilizing scalable Hadoop cluster storage, Data Ingestion from a variety of sources, Cluster provisioning, Cloud Oriented Application design, Big data Systems Management, Performance and Scale up, Spark/Scala streaming, Spark/SQL, Spark distributed processing, Map - reduce parallel programming, Hive and Impala analytical queries, No-SQL special data management strategies such as HBase, and Cassandra, parallel query processing and optimization, data replication, data caching, columnar storage, etc.
- Designed and implemented system management, resource provisioning, scale up, and elastic data processing for big data solutions utilizing OpenStack and OpenStack-Sahara infrastructure services for cloud oriented applications. Worked on a number of performance and scale up features such as Yarn task scheduling optimizations, QoS, multi-tier caching, Spark/Scala in memory analytical processing, and machine learning in the Hadoop big data stack.
- Practiced and employed Service Oriented Architectures to design and implement end to end user friendly applications in Big Data: Ingestion, ETL, No-SQL Stores, Analytics, Data science, Machine learning, Complex Event processing, Search engines, Visualization, etc.
- Extensive hands on skills in database internals, database kernels, access methods, cache and disk I/O management, parallel and cluster query optimization, massively parallel processing (MPP), backup, restore, high availability, and Disaster Recovery (DR). Implemented data management solutions using HBase, Oracle, SQL Server, and MySQL
- Fluent programming and design skills in Java, Scala, C++, SQL, Python, and extensive knowledge of operating systems: Linux, Unix, Windows and Mac OS. Worked with Cloudera, Horton Works, and Apache Hadoop distributions, Google and Amazon cloud computing.
- Trained and certified in Amazon Web Services Big data and Cloud solutions.
TECHNICAL SKILLS:
Web and Big data: Apache Web services, REST Web Services, Apache modules, Tomcat Application server, Jetty Application Server, Apache Lucine search, Solr Search Server, In Memory Data Management, OLAP, Continuous queries, User defined types, functions, and aggregates, Data modeling, Hadoop cluster storage, Map-reduce parallel programming, Apache Crunch pipelines and Storm real-time computation, Hive analytical queries, HBase, Cassandra and Spark/Scala Streaming, Spark/Scala distributed computing, StreamSets data transformations and transport. Multiple years of experience with Hadoop distributions and Hadoop managers of Apache, Cloudera, and Horton Works.
DBMS: Specialized in database architectures, parallel query processing, query optimization, scalability, high availability, connectivity, and standards. Extensive knowledge of Confidential Adaptive Server Enterprise and replication server, MySQL, Oracle Servers, Microsoft SQL Server, and Teradata. Developed applications on the RDBMS of Confidential, Oracle, Microsoft SQL Server, MySQL, PostgreSQL, etc.
Tools: and systems Familiar with a variety of build tools such as Jenkins, Maven, SCM, ANT, XML scripts, Makefiles; Source code control systems such as Github, SVN, Perforce, ClearCase,, etc. and Interactive Development tools such as Eclipse, IntelliJ, MS Studio, etc.
PROFESSIONAL EXPERIENCE:
Senior Big data Architect
Confidential, San Francisco, CA
Responsibilities:
- Compiled Big data Architecture Patterns applicable to the organization’s digital solution requirements and prototyped some of them.
- HBase as Operational data store and compared to Cassandra and Impala, designed prototype and demonstrated the strengths and advantages of HBase as ODS for its handling of multi-structured data, schema evolution, versioning, scalability, ACID properties for ODS, and integration with other Hadoop Services.
- Researched Kafka usage patterns and prototyped use cases to exercise Kafka API: Producer, consumer, streaming, and connector API.
- Defined architecture and design for scalable custom Analytics framework, on boarding new analytics with computation on entire data set, and incrementally updating as new data, arrives. The analytics computation proceeds in two steps. The intermediate data makes incrementally updating the analytics efficient, low latency, scalable and optimal.
- Spark/Scala distributed processing used. Final analytic measures are published into Impala.
- As the clusters, solutions and users grew the need for monitoring performance, resource utilization and operational readiness across the number of hadoop clusters and services became the critical need. Initiated development of utilities and services to monitor performance, resource utilization, and regressions in the hadoop clusters across the enterprise.
- Faults are detected and corrected before they develop into failures. The results of standard benchmark tests complemented with Cloudera metrics are analyzed for event/action proactive monitoring and to establish historic trends. The results of the monitoring are also used for cluster utilization by the Business line.
- Global Data Registry (GDR) is a metadata and catalog service about the objects and entities in key services of the hadoop clusters.
- Metadata is continuously collected over many channels on the objects from many different sources, transformed, and ingested into HBase. Metadata service offers REST API providing access to the Metadata in the store. Besides providing metadata about entities/objects, Global Data Registry offers reporting (throgu Impala), searching (Solr), and analysis(analytics framework). The relationships across the entities/objects, properties, ownership and lineage can be visualized through the UI of the graph database.
- The creation, evolution, and utilization of objects by Business lines is monitored to keep track of space shared resources (disk, network, and memory) and time shared resources (CPU.)
- Reorganized Hadoop infrastructure: Designed and laid out multiple Purpose driven Cloudera Hadoop clusters with the objectives of Quality of service(QoS) and scalability, High availability(HA), and Data loss prevention(DLP)
- Enabled big data solutions with scalability, performance, high availability and security through extensible and modular architecture and design.
- Explored and utilized latest big data technologies to establish scalable data platform for historic and real-time data ingestion; analytics, machine learning, and visualization with high performance and low latency. Used SQL and No-SQL strategies for analyzing data.
- Designed and implemented several big data pipelines and solutions for risk analysis, analytics, data science, visualization, realtime monitoring and reporting.
- Designed and implemented realtime money transfer transaction monitoring and alerts for fraud detection using realtime low latency data pipeline, search engine, and event alerts.
- The realtime data pipeline consists event data collection from the field, posting to Tibco message bus, Flume, Kafka, Spark Streaming, HBase, Lily indexer, Solr, Web services and Alert System.
- Risky and fraudulent transactions are detected and stopped within seconds of initiation.
- The analytics engine continuously updates on realtime basis hundreds of measures classified as age variables, velocity variables, lifetime variables, and models.
- A money transfer transaction has distinct phases called events starting from initiation to final payout. Data is collected on the fly for each event of a transaction and stored in the data lake implemented by HBase NoSQL store. The data stored in HBase is published into realtime monitoring, analysis, data science, and reporting projects.
- Designed a fact-dimensional star schema for reporting and analytics. Offered SQL Access and enabled SQL queries for data scientists and end users through Oracle, Hadoop Hive and Kudu/Impala. Continuously updated reporting tables as new data arrives.
- Reporting schema implementation, and evolution are driven by metadata and no code changes are necessary in the Spark modules updating reporting tables.
- Wrote many architecture, design and product documents with a vision for evolution and market leadership well into the future.
- Planned and worked on the Hadoop clusters and cloud solutions in Amazon Web Services environment.
- Trained and certified in AWS architecture, development, Devops and container services.
Senior Consulting Architect
Confidential, San Francisco
Responsibilities:
- Laid out a big data cluster consisting of master nodes, worker nodes, Kafka, zookeeper nodes, and edge nodes for a scalable and failsafe data lake expected to grow to petabytes in volume. Installed HortonWorks Hadoop distribution, configured for optimal performance, administered and monitored for continuous availability.
- Designed and implemented a Web Server to collect telemetry events from online games, running game related applications and tools.
- The Telemetry Handler converts incoming events into Avro Byte arrays if not already converted by the clients and then posts to Kafka message streams for consumption by the Spark ETL modules.
- Spark/Scala ETL modules were implemented using spark Kafka streaming and file streaming.
- Transformation logic was then applied to the Spark D-streams to generate transformed objects for the events.
- The typical transformation logic involves data type transformation, addition and omission of raw event columns, nesting level changes, object reformatting, aggregations, etc..
- The transformed event objects are then processed by HBase Dao modules to load into data lake built using HBase.
- Analytics and aggregates are computed on the stored data and the data in the pipeline using Hive modules and Spark SQL and then published to the Analytics data marts.
- Implemented Analytics API to support dashboards, analytics presentation layer, and connectivity to other analytics tools such as Tableau, Apache Zeppelin and SQL access through Hive.
- Worked with Data science team and used Spark machine learning library using Spark Data Frames and Zeppelin.
- Conducted performance analysis of big data applications such ETL modules, ingestion tools, analytics computations, and queries against Hive and HBase.
- Worked on configuration, tuning, and scale up of HBase, Hive, and Yarn performance.
- Worked with Hadoop distribution vendors as the point of contact to resolve problems faced during big data solutions design and development.
- Attended a number of big data conferences expanding my big data knowledge of the latest developments.
Confidential, CA
- The metadata collected is stored in HBase and is used to generate
- Spark/Scala code modules to apply transformations, validations, verifications, key generation, and then migrate to Hadoop based data platform such as HBase, Hive/Impala based on data characteristics.
- The data from the base data platform is published to application layers to generate Analytics, Reports, and Complex event processing.
- The Data Ingestion Tool can handle both incremental and snapshot data loads and preserves the data integrity.
- Spark/Scala code modules are regenerated and packaged whenever a schema change or version change detected for a data source.
Confidential, Milpitas, CA
- Developed Big data infrastructure solutions for Hadoop on Demand, Cluster provisioning through virtual clusters, and Elastic Data Processing
- Big data systems management making use of OpenStack, OpenStack-Sahara infrastructure frameworks, Thrift communication modules, Python programming, and Postgresql repository
- Enabled Cloud application design using OpenStack components, OpenStack-Sahara, Virtual clusters, Web Services, and Elastic Data Processing, and Hadoop Big data technologies
- Further speedup and performance boost were enabled through Spark/SQL in memory analytics processing, and multi-tier caching using flash drives.
- Developed applications for data processing, analysis, and ETL using Spark/Scala distributed data management.
- Worked on a number of Big data performance optimizations such as MapReduce scheduling optimizations, QoS parameter injection, On-demand resource provisioning and Machine learning.
- Big data applications were monitored using Ganglia, Nagios, and Apache Hadoop Metrics collection.
Confidential, Kansas City, MO
- Collect large data sets, run transformations, normalize and standardize raw data and store in Hbase and Hadoop Distributed file system for data platform
- Run analytics, data warehouse queries, derive intelligence, and run reports
- Used Apache Crunch pipelines for batch data collection and Apache Storm for real time data collection
- The Population Health Care Applications Suite utilizes Hadoop Ecosystem technologies consisting of Avro Data Models, HBase, Hadoop HDFS, Map/Reduce parallel programming, Apache Crunch Java API for data pipelines, Apache Storm real-time computation, Hive Analytical queries, PIG Latin scripts, Zoo Keeper, Oozie, etc.
Confidential, Santa Clara, CA
- Worked on Big Data Analytics designing parallel query processing and efficient data partitioning strategies including columnar storage.
- Identified ways to integrate open source software modules into the overall Analytics data management solution.
- Speed, Scale out, efficient code generation complementing parallel processing for OLAP query processing are the key requirements.
Confidential, Sunnyvale, CA
- Worked for a Confidential Hadoop storage, scalability, data warehouse and ETL, data mining, search engine, etc.
- The interoperability and federated communication across disparate office communication systems such as Microsoft OCS, OpenFire, Cisco WebEx, Google Apps, Confidential Sametime platforms, etc., generate messages in billions per month demanding scalable storage
- Designed and implemented a Hadoop distributed and cluster file system for highly scalable message logging.
- Worked on scalability issues and brought totally out of control and saturated system to scale again with spare capacity of 80%.
- Worked on Java modules, SQL and Data warehouse design with efficient ETL, data mining, reporting modules, big data management, No-SQL, Hadoop cluster file storage, Hive, Cassandra, and Pig/Latin.
- Designed a shared Directory System for the Community of Unified Communication to exchange messages, and to conduct web based audio and video conferences.
- The directory information is loaded directly into the Directory Database from the Active Directories of the participating companies.
- The directory entry consists of communication addresses, profile and other relevant information on the participating members.
- This information is securely shared among the collaborating professionals and end users.
- An advanced search engine based on Apache Solr and REST web services API support a wide variety efficient and fast searches on the directory entries to quickly locate key resources.
Technical Director
Confidential, Mountain View, California
Responsibilities:
- Designed an architecture of a scalable data store, management server, web server and massively distributed security agent network to scale up and protect from one installed security sight up to a million computers from all kinds of intrusions, viruses, spyware, malware etc., with good response and easy to use configuration, administration and reporting dashboard.
- The Confidential Enterprise Protection (SEP) Server is responsible for the deployment of security agents, and distributing security policies and updates. SEP server periodically collects security event information from the clients, and supports reporting and administrative dashboard.
- Identified the performance bottlenecks and critical sections in different layers and components of the security system and designed solutions for massive scalability and high availability.
- Introduced load balancing, load distribution, information pipelining, information aggregation, data model for I/O and CPU parallelism resulting in improved response and throughput.
- Optimized database queries for faster execution. Also, resolved several usability and manageability issues.
- Introduced Hadoop HDFS parallel, clustered, partitioned storage into the Enterprise Security Server for collecting log type of data using Hadoop file system API for faster, parallel and scalable I/O.
- Used Sqoop data import and export for loading the data from HDFS into SQL Server for running data analysis and reporting.
- The performance gain was significant and helped further scalability.
- Resolved and designed solutions for a number of high profile pre-sale and post-sale enterprise customer technical issues contributing to the companys reputation as the industry leader.
- Worked across divisions of the company designing solutions for superior performance.
- A five-fold scalability of protecting 100K end points to 500K end points per security sight was the goal and accomplished
Lead Architect
Confidential, Santa Clara, California
Responsibilities:
- High speed Real time Complex Event processing in a streaming data environment was designed and developed User defined aggregates extend the query capabilities beyond SQL standard features.
- Provided conceptual design and architecture for caching relational data and objects into a distributed in-memory data fabric and offered SQL and OQL user interfaces.
- Enhanced query features and accomplished query performance of 5 - 10 times that of traditional database systems.
- Designed and implemented caching for critical data for fast OLAP performance.
- Optimal code generation, lock avoidance, and elimination of bottlenecks contributed to high performance.
- Designed and implemented continuous queries in a distributed data cache environment for complex event processing and strategic business decisions emphasizing high performance and speed. MySQL provided the persistence for the data in the distributed cache.
Principal Engineer
Confidential
Responsibilities:
- Worked on distributed query optimization in wide area network for Enterprise Data Integration. MySQL query engine is the starting point for this project.
- Designed and implemented a query engine to offer integrated data access and updates in a widely distributed network of data sources using function shipping, data shipping, and data partitioning, etc.
- Designed and implemented several features for making Virtual Operational Data Warehousing a reality.
- MySQL query engine is enhanced for the distributed querying and distributed updates.
- Modified several MySQL query features to work in a distributed data warehouse environment.
- The Query Engine successfully distributed queries and updates in a wide area network among the data sources:
- Teradata, Oracle, MySQL, and DB2. Teradata was popular in the data warehousing arena and significant effort was devoted to Teradata SQL generation, configuration, and performance tuning including load balancing.
Confidential, Dublin, California
Architect
Responsibilities:
- Envisioned an architecture and design to accomplish decision support query scaling with parallelism in SMP and multi node cluster environment, and mixed load transaction processing.
- The parallel optimizer and data-flow execution engine encapsulate parallelism and adapt to the resource availability
- An extensible query engine that allows different search strategies physical access methods, and extends to cluster architecture has been designed.
- Transitive closures allow for additional optimization improvements.
- Properties generated in the plan flow up to avoid additional sorts and repartitioning.
- Confidential component addresses cluster issues. It is easy to generate SQL from plan segments for function shipping in a cluster environment.
- Cost based pruning, greedy algorithm for high water mark plan, and timeout mechanism are some additional features.
- Specifically implemented the following features in the operational DSS project: data partition property model for optimal partitioning for parallel query processing, repartition rules, transitive closure model, query plan cache for the search engine, global optimization, frame work for partition statistics, strategy to propagate statistics up the query plan, etc.
- Implemented a number of performance features such as parallel sort, large IO, multiple named caches, and page pre-fetch for IO speed up for parallel server product.
- Modified optimizer to use large IO whenever appropriate. A five-fold speedup in performance was achieved as a result.
- Specifically, targeted resource manager and disk I/O manager components for possible performance gains.
- Supported variable page size for fast I/O depending on the type of query. Implemented data and index page pre-fetch using automatically configured large I/O.