We provide IT Staff Augmentation Services!

Big Data Architect (chief, Scientist And Strategist) Resume

2.00/5 (Submit Your Rating)

Roseland, NJ

SUMMARY:

  • I am a technology leader and innovator with significant experience in enterprise Big Data roadmaps, strategies, best practices and innovations, BI & data warehouse and data governance, data science.
  • I have always been intellectually hungry and I always strive to challenge my mind through continued learning.
  • I am hands on with most technologies in order to understand critical details and lead teams by examples, while keeping a big picture view for high scalability, high reliability, cost - effectiveness and innovation. I am solution oriented and always focus on using these innovative and emerging technologies to solve real world problems and achieve the firm’s goals.
  • I always do my best to gain other people’s trust, encourage others thorough positivity, be helpful and resourceful.
  • I thrive in high pressure environments in both start-ups and mature organizations.
  • I have high bandwidth and have made a practice throughout my career to help startups and open source communities with technology leadership in addition to my full-time roles.
  • These engagements keep me in touch with a vibrant, innovative community that is intellectually stimulating and energizing.

SKILLS:

Hadoop: Apache Hadoop (HDFS/HDFS2, MapReduce, YARN, Hive, Pig, Oozie, HBase, Mahout, SqoopKnox), Cloudera (CDH, Cloudera Manager, Impala, Flume. Hue, Sentry, BDR, Parquet), Hortonworks (HDP, Ambari, WebHDFS, HCatalog, Stinger and Tez, Falcon, ORCFile), Pivotal HD (HAWQ, GemFire XD, PXF, Spring XD), SQL-in-Hadoop, RainStor, ZooKeeper, MRUnit, Datameer, Platfora, Pentaho Big Data, Tableau Big Data, Talend Big Data

Spark: Apache Spark (RDD), Spark Streaming, Spark SQL, MLlib, GraphX, Shark, SparkR, Tachyon, Spark-on-YARN, Hive-on-Spark, Spark on HBase, Spark on Cassandra

Scala: SBT, ScalaNLP, Breeze, Scalding Streaming Akka, S4, Storm (Trident), Spark Streaming

Java: GC, multi-threading, JUnit, JDBC, JMS, Java XML, JEE (JSP, Servlet, EJB), Tomcat, Eclipse, Spring

NoSQL: Bigtable, Cassandra (DataStax), HBase, MongoDB, Couchbase, Aerospike, GemFire, MarkLogic

Data Warehouse: Kimball methodologies, Star/Snowflake schema, Operational Data Store (ODS), Master Data Management, Meta-Data Management, MPP (Teradata, Vertica, Greenplum, Netezza, RedShift)

Data Mining: Clustering, Classification, Predictive Modeling, Machine Learning, Natural Language ProcessingText Mining, PCA, SVD, A/B testing, R, MatLab, SAS, SPSS, Weka, PMML

Graph: Neo4j, RDF, Netflix-Graph, Titan, Faunus, Giraph, Hama, GraphLab, GraphX

R: CRAN, RStudio, RHadoop, Revolution Analytics R Search Lucene, Solr, ElasticSearch

BI: OLAP, MDX, Tableau, Cognos, SSAS, SSRS, Pentaho, Excel, 1010data, Google Analytics

IBM: IBM BigInsights (Big SQL, BigSheets), Infosphere Streams, LanguageWare, GPFS-SNC, WebSphere

Amazon: AWS EC2, EBS, S3, RightScale, Elastic MapReduce (EMR), RedShift, RDS, Dynamo

ETL: Pentaho Data Integration (PDI, Kettle), Talend (MDM, Big Data), SSIS, Informatica, Abinitio, DataStage

Social Media: Facebook API, Twitter Firehose, Gnip, DataSift MQ/ESB ActiveMQ, Kafka, MQSeries

WEB: JavaScript, JSON, Node.js, D3.js, RESTful, jQuery, Ajax, Web2.0, OAuth, Google Analytics, JMeter

SQL: 3NF, ERD, ACID, Oracle, Sybase, DB2 UDB, SQL Server, MySQL, Postgres, KDB, ERwin, TOAD

O/S: AIX, Linux, Solaris, Microsoft Windows/NT, Mac OS X

Linux: RedHat, RHEL, CentOS, SuSE, SSH/SSL, top, iostat, Ganglia, Nagios, Chef, Puppet

Languages: Java/Groovy/Scala, C/C++/C#, PHP, Perl, Python, Ruby, R, Shell Scripts

Project Management: Agile (Scrum, Adaptive and Iterative, Extreme Programming), Rational Unified Process (RUP), RAP, UML, unit testing, Basecamp, Pivotal Tracker, Jira

Others: CVS, SVN, Git/GitHub, Kerberos, Apache Thrift, Akamai, Flash, VMware, Scribe, Flume, Azure

PROFESSIONAL EXPERIENCE:

Big Data Architect (Chief, Scientist and Strategist)

Confidential, Roseland, NJ

Responsibilities:

  • Report to CTO with soft skills and assist in building enterprise Big Data blueprints, roadmaps and best practices cross multi-data centers with multi-tenancy of people analytics and social applications; help evaluations and selections of enterprise Big Data solutions (NoSQL, Hadoop, Social and Mobile) for the next generation platform - Confidential 2.0; educate stakeholders by presentations, concepts, designs and issues using technology visualization, examples and internal wiki articles; build and maintain relationships with Big Data vendors and open source communities; develop new use cases, assess expected values, identify key requirements and score enterprise Big Data solutions with stakeholders; define evolutionary data quality and data governance methodologies and processes; provide upon CIO-level selection of recommended approaches
  • Lead installation and configuration, trouble-shooting, performance-tuning and memory-tuning, monitor and maintenance of Hortonworks HDP clusters (HDFS2, YARN, Hive (HCatalog and Tez), HBase, Pig, Oozie, Sqoop) using Ambari integrated with Storm clusters, Cloudera CDH clusters (HDFS2, YARN, Hive, HBase, Impala) integrated with Solr, Spark and Spark Streaming clusters using Cloudera manager, and Pivotal HD clusters (HDFS2 and HAWQ) integrated with GemFire XD clusters using Pivotal Command Center in production environments across multiple data centers (active-active) with high scalability, high availability, high throughput, security, backup and disaster recovery and multi-tenant for real-time applications, ad-hoc queries and interactive analytics, rich analytics, fraud detection and risk management, data warehousing and dimensional modeling, and data lake of client data, enterprise events, web logs, big events and user behavior logs, social data
  • Lead ELT/ETL, data lineage, data profiling & cleansing, data replication and synchronization, data governance, scheduling, monitoring and auditing using Talend and Talend Big Data with Sqoop, Oozie and Spring XD from RDBMS (Oracle, SQL Server, DB2), legacy systems (mainframe), Greenplum to Hortonworks HDP, Cloudera CDH and Pivotal HD, to MongoDB, GemFire and GemFire XD, Greenplum
  • Lead installation and configuration, trouble-shooting and performance-tuning of integrated data visualization, BI and reporting using Tableau and Tableau Big Data with Greenplum, Hive and HBase in Cloudera CDH and Hortonworks HDP, Cloudera Impala, HAWQ in Pivotal HD and Gemfire XD
  • Lead design, development and implementation of low-latency distributed batch processing and analytics using Spark and real-time distributed micro-batch processing and analytics using Spark Streaming with Cloudera CDH in Scala with memory-efficient data structures, data modeling and data formats, fast data serialization and multi-threading with short garbage collection pause time, data quality control and data governance
  • Lead interactive and sophisticated Big Data Analytics & Mining using R and RStudio, Revolution R and RHadoop for exploratory data analysis, principal component analytics (PCA), clustering and classification, fraud detection, risk analytics and text mining
  • Lead design, development and implementation of real-time distributed Storm applications with Hortonworks and Kafka: query processing (distributed joins, de-normalizations, aggregations and materialized views, etc), text mining and machine learning of logs, events and social data joined with enterprise client data stored in DataStax (Cassandra) column families, HBase column families and Hive tables
  • Lead Spark innovation PoC: tuning with concurrency and G1 garbage collector in Java 7; integration with YARN and Hive; interactive SQL, ad-hoc queries analytics using Shark, Spark SQL, Spark SQL integrated with Hive, Hive-on-Spark; integration with HBase, DataStax Cassandra, MongoDB, GemFire and VoltDB
  • Lead Hadoop innovation PoC: SQL-on-Hadoop: Big Data Modeling using star schema with ORCFile, column-oriented, compression, encoding, predicate pushdown and vectorized query execution using Hortonworks Stinger (Hive on Tez) on YARN; Cloudera Impala on YARN (LIama) with Sentry and BDR for data warehousing, data governance, analytics and BI reporting; Falcon for data governance and data replication; Apache Hadoop, Apache Hive and Apache HBase: compilation, building, configuration and test of latest versions; HDFS2: Heterogeneous Storages (memory, SSD and disks), centralized cache management, quotas, extended attributes, short-circuit local reads, federation and ViewFS, ACLs, NFS Gateway; YARN: Node Labels for Heterogeneous Nodes
  • Lead data science innovation PoC: Datameer, Platfora and Pentaho Big Data for Big Data Visualization; Spark with MLlib and SparkR for exploratory data analysis, principal component analytics (PCA), clustering, classification and predictive modeling and text mining; Giraph and Hama on YARN and GraphX on Spark for sophisticated Big Graph Analytics & Mining
  • Lead NoSQL database and data modeling PoC of MongoDB vs GemFire as in-memory NoSQL store for cache and metadata management, Neo4j as low-latency Graph for organization structure data, Couchbase as NoSQL store for client data and enterprise events and DataStax (Cassandra, Solr and Hadoop) for single data store of client data and enterprise events
  • Lead data aggregation and data modeling PoC of MongoDB, GemFire, VoltDB (in-memory NewSQL) with Cloudera CDH vs Gemfire XD (in-memory SQL) integrated with Pivotal HD (HAWK and PXF) as real-time and high-performance operational data stores (ODS), MarkLogic (disk-based NoSQL - XML) as primary data store for transactional data in financial industry from BPS, Gloss, impact and other surrounding systems as well as data held by different clients, into a common data model; lead design and implementation of ETL processes using Red Hat JBoss ActiveMQ as enterprise service bus (ESB), Spring with JMS, Spring data and Spring XD for MongoDB, GemFire and Pivotal HD; lead offshore team in China to develop web GUI for real-time big data aggregation and visualization using Node.js and D3.js

Chief Data Officer (Scientist)

Confidential, New York City

Responsibilities:

  • Report to non-technical CEO with assist in milestone planning and project execution; build real-time, large-scale and innovative Big Data, Graph and recommendation blueprints, roadmaps, engineering and research teams; educate stakeholders by presentations and examples; work with PhD students and professors from great universities; maintain relationships with vendors, open source communities and lead price negotiation
  • Lead research and evaluate real-time and large-scale Graph solutions: Titan on Cassandra, Neo4j and Netflix-Graph; lead implement Graph-based recommendation integrated with Netflix-Graph, integration of Neo4j with YARN on Hortonworks HDP for real-time Graph analytics, MapReduce jobs and Hive tables & UDFs for pre-computation of user similarities and dynamic recommendation rankings
  • Lead NoSQL data base and data modeling PoC: Cassandra, MongoDB, Couchbase and Aerospike; lead configuration, performance tuning of and maintenance of DataStax (Cassandra, Solr and Hadoop) clusters with SSD on Amazon EC2; lead implement NoSQL data modeling and data lifecycle management with DataStax (Cassandra); lead data migration from Microsoft SQL servers on Azure to Cassandra
  • Lead design and develop real-time and innovative hybrid of Graph-based and neighborhood-based collaborative filtering (CF) recommendation framework running with Apache Thrift - Java servers to replace Apache Mahout Taste, integrated with model-based CF, content-based approach and social media-based approach; lead evaluate and customize existing CF user similarity and item ranking algorithms; use R and RStudio for statistics, data visualization, exploratory data analysis, user segmentation and profiling, also to help evaluation of different CF algorithms using different customization and optimization
  • Lead research and evaluate real-time and large-scale social media solutions: Gnip and DataSift; lead integration of user similarity and ranking algorithms with social data from Twitter, Facebook and other social networks, powered by Apache Storm, and enriched augmentations such as geo location, social influences and sentiment analysis provided by DataSift

Big Data Architect (Principal, Scientist)

Confidential, Waltham, MA

Responsibilities:

  • Report to CTO with assist in milestone planning and project execution; build enterprise big data blueprints and roadmaps, build and lead big data platform & analytics team, lead big data project managements, evaluate and select big data stack, educate stakeholders by examples; work with division directors and segment VPs; work with data scientists and consultants from IBM Big Data Innovation Center, HP Vertica, EMC Greenplum, RainStor and commercial open source vendors: DataStax, Cloudera, Hortonworks, Talend, Revolution Analytics; build and maintain relationships with vendors and open source communities, and lead price negotiation
  • Lead PoC of real-time Email Send Optimization, real-time email fraud detection and real-time Scoring Accounts for Compliance, real-time social media analytics with Big Data vendors, vendor response, evaluation response and scoring of paper-down select exercises; execute state assessment, desired architecture, gap analysis and associated deliverables; lead analytical / operational track and data gathering interviews
  • Lead configuration, implementation, performance-tuning and maintenance of big data platform using DataStax (Cassandra, Solr and Hadoop), NoSQL data modeling for real-time solutions of structured data and semi-structured data with No ETL, low-latency OLTP and OLAP in one system
  • Lead implementation of Hive tables & UDFs for big data processing and warehousing, Pig scripts and MapReduce jobs for big data preprocessing, and Mahout scripts for big data mining; lead implementation of ETL/ELT and data quality/data governance using Talend Big Data with Cassandra, DB2 and Cloudera CDH
  • Write internal wiki / blogs with proof-of-concept evaluations of leading edge big data technologies and analytics: Revolution Analytics (RHadoop) with Cloudera CDH; R with Vertica; IBM BigInsights (Natural Language Processing and Text mining)) with Netezza, Greenplum and Vertica for data warehousing, Cloudera CDH for Big Data, IBM Bigsheets for reporting; IBM Infosphere Streams vs Storm; YARN and HCatalog on Hortonworks HDP
  • Work with Chief Analytics Officer and lead big data analytics with business analysts, product managers, data scientist consultants from IBM, using R for data visualization, exploratory data analysis, principal components analysis, customer segmentation and profiling, predictive analysis, modeling, decision trees, linear regression, what-if analysis to discover hidden insights, provide competitive advantages and address business problems

Lead Data Architect (Scientist)

Confidential, New York City

Responsibilities:

  • Report to non-technical CEO with assist in milestone planning and project execution; work closely with business team, data stewards, backend team, professors and PhD students from great universities; define project scopes and maintain project charters; setup BI, analytics and recommendation team, cultivate creative and productive environments with much passion, establish team identity and generate excitements
  • Lead design and implement BI, analytics and recommendation solutions with high scalability, high reliability and cost-effectiveness; evaluate BI, analytics and recommendation products and choose vendors; contact with vendors for big discounts, free or cost-effective trainings, technical supports and short-term consultants; evaluate BI, analytics and recommendation products (Pentaho for BI, OLAP, data visualization and data mining; Talend for daily and real-time ETL/ELT, data profiling and data governance, Metadata Management and Master Data Management (MDM); Vertica for data warehouse, in-database mining and analytics, real-time analytics; Cloudera for Hadoop, Hive, Pig, HBase (per user and video tracking, real-time queries), Mahout (classification, clustering and recommendation), Cloudera Flume for transaction-guarantee logs delivery; design data flow and integration solutions, make tradeoffs between consistency, latency, throughput and resiliency
  • Hadoop Architect
  • Design, configure, do performance-tuning and maintain Cloudera Distribution of Hadoop (CDH) v4.0.0 PROD cluster with Cloudera Manager for cluster management, ganglia for performance monitoring, Hive, Zookeeper, HBase, Pig, Oozie, Mahout on two master nodes (Active/Passive, hot standby, automatic fail-over with Zookeeper) for high availability and 53 nodes in two racks with CentOS 6.2 for data nodes of HDFS, task trackers of MRv1, region severs of HBase, etc; work with backend teams to define hardware configurations (CPU, memory, disks, network cards, etc) of; configure Hadoop security with MIT Kerberos 5, migrate data and nodes gradually from Apache Hadoop v1.0.1, Amazon EMR with Amazon S3 for no critical data lost
  • Setup and maintain Hive v0.8.0 on CDH v4.0.0, design Hive schema using appropriate file formats (text, sequence files, RCFiles, etc), AvroSerde or JSONSerde, multi-level partitions and buckets (sorted) for cost-effective data warehousing with high scalability; design HiveQL scripts using built-in functions and also design custom UDFs for preprocessing, querying and analysis; integrate HBase with Hive for federated queries and analysis, migrate data between HBase and Hive
  • Install, configure, do performance-tuning and maintain Vertica v5.1 cluster PROD and DEV on 5 nodes with Linux CentOS 6.2; work with Vertica and backend teams to define settings (CPU, memory and disks) of, order and setup DELL servers for Vertica clusters
  • Design and maintain OLAP and star/snowflake schema using Kimball methodologies; design dimension tables (user, video, clip, time, etc), fact tables (storage costs, delivery costs, transcoding costs, revenues, etc), slowly changing dimension (SCD) for users, aggregation tables for log data from Hive and HBase

Senior Data Architect and Lead Consultant

Confidential, Weehawken, NJ

Responsibilities:

  • Work closely with business teams, data stewards, offshore teams, vendors to design, recommend and implement enterprise level Big Data management strategy and framework encompassing logging, reporting, analytics, data warehousing, data profiling, data lifecycle management, master data management, reference data management, risk management solutions with best practices in high scalability, high reliability, data security, regulatory compliance, data governance and cost-effectiveness for securities trading data, security master data, back office-related data and financial application data, unstructured data, legal documents, logs and emails
  • Document specifications, workflows, best practices and models; set up status meetings of IT development team, QA team, system and database management team, and do presentations; assist milestone planning and project execution; recruit new members, establish offshore IT teams in China, mentor and train members, perform assessments and evaluations
  • Configure and maintain CDH v3 PROD clusters with Cloudera Manager for cluster management, ganglia for performance monitoring, Hive, Zookeeper, HBase and Mahout on over 100 nodes of multiple racks across two data centers; work with GTIS SA team to configure Hadoop security in private network regions with Kerberos 5; configure file system images and edit logs on local disks and remote NFS mount for high availability; copy data cross clusters for migration and backup; add/stop/decommission data nodes and TaskTracker nodes on the fly; recover name nodes and data nodes from failure
  • Design and maintain Java Map/Reduce jobs, Avro (JSON serialization), Pig as data factory and Yahoo Oozie workflows to transform, cleanse and analysis transactional log files, time series data, database dump files, notes, reports, metrics, news, documents and email archives from Teradata, archiving systems then load and bulk load into Hive and HBase; configure map and reduce child JVM options for performance tuning and health monitoring; optimize joins w/t distributed cache, partitioned, sorted datasets and CompositeInputFormat
  • Setup and maintain Hive v0.5.0 on Hadoop v0.20.0, design Hive schema using appropriate file formats, AvroSerde, multi-level partitions and buckets for cost-effective data warehousing of historic and static data; develop HiveQL scripts using built-in functions and also develop custom UDFs for preprocessing, querying and analysis; integrate HBase with Hive for federated queries and analysis, migrate data between HBase and Hive
  • Setup and maintain HBase v0.90 on Hadoop v0.20-append, AvatarNode, Zookeeper with replication cross data centers as cost-effective NoSQL solution for loading and updating continuously real-time data; design HBase schemas optimized for user cases and take full advantage of HBase and BigTable storage architecture
  • Design and maintain comprehensive data warehouse with data integrity, federated and consolidated data marts of business processes with conformed dimensions, star/snowflake schema using Kimball methodologies; use CA ERwin for logical and physical data modeling, reverse and forward engineering
  • Design and maintain Teradata physical tables and logical views for data marts, DDL and DML scripts, macros, stored procedures and stored functions, triggers, Teradata SQL queries for data profiling, ETL/ELT processes, analytics, business intelligence and trouble-shootings
  • Use Teradata Aggregate Designer to design Aggregate Join Indexes (AJIs) to support high performance MDX queries; run Teradata SQL Explain feature to verify AJIs; recheck MDX expressions and adjust AJIs
  • Use Teradata Master Data Management (MDM) to model and manage master data in a single and centralized repository as a hub, resolve master data issues across various sources; migrate master data from Velocity MDM with Oracle to Teradata MDM; integrate ETL/ELT processes with Teradata MDM
  • Use Teradata Meta Data Services (MDS) to develop and maintain AIMs (metadata classes, properties and relationships), and manage technical metadata (data modeling), business metadata (data profiling, data lineage, data quality), process metadata (ETL/ELT process and business processes)
  • Design and maintain migration processes of terabyte data from SQL server, Sybase, Oracle and DB2 into Teradata using Teradata loading utilities, Teradata SQL scripts and Java stored procedures, Informatica ETL/ELT mappings/workflows; use Informatica PowerCenter Connect for JMS sources to process and load real-time data
  • Use Teradata Analytic Data Set Generator in Teradata Warehouse Miner to create analytic data sets and execute in-database mining using SAS; develop, deploy and maintain Teradata Java stored procedures and Java user-defined functions using Teradata plug-in for Eclipse for transformation, in-database analytics and mining

Architect of Business Intelligence and Data Warehouse Consultant

Confidential, Jersey City, New Jersey

Responsibilities:

  • Design and maintain comprehensive BI, data warehouse, data integration and data governance with best practices in data security, data profiling and data quality, and integrated data marts of business processes across transactional, collaborative and analytical systems with conformed dimensions for drill across, and star/snowflake schema, fact tables, dimension tables using Kimball methodologies with onsite team and offshore team in China
  • Identify slowly changing dimensions (SCD) and right strategies in handling SCD; identify junk dimensions and right strategies in handling and populating junk dimensions; identify rapidly changing dimensions, break off and use mini dimensions to hold rapidly changing attributes
  • Use CA ERwin data modeler for data warehouse and data mart modeling, logical data modeling, physical Teradata and DB2 table modeling, for documentation, reverse engineering, forward engineering, comparison, validation and reducing defects and redundancies
  • Design and maintain Teradata databases and users, physical Teradata tables and Teradata views for data marts; design and maintain DDL and DML scripts, stored procedures and stored functions, triggers; design, develop and improve Teradata SQL queries using joins, unions and sub-queries for data profiling, ETL processes, analytics, business intelligence and trouble-shootings
  • Design and monitor migration processes of terabyte data from DB2 UDB into Teradata EDW; design, tune, monitor and maintain ETL/ELT transformations, mappings, sessions and workflows using Informatica Designer, Workflow Manager and Monitor; also use Teradata load utilities
  • Define model plans, design and configure OLAP cubes, build and maintain multidimensional analysis cubes - PowerCubes using PowerPlay Transformer periodically, design and maintain reports, ad hoc reports and dashboards using Cognos Query Studio, Report Studio and PowerPlay Studio

Senior Consultant of Business Intelligence and Data Warehouse

Confidential, New York City

Responsibilities:

  • Designed and maintained SAP CRM, R/3 ERP, BI and Data Warehouse using SAP APIs with Oracle8i and WebSphere5 in Sun Solaris environments, integrated in-house applications and also migrated applications to SAP platform, worked with offshore team in China
  • Designed and maintained Oracle DMS table spaces and stored redo logs on raw devices to achieve better I/O performance and data integrity with Oracle; designed and maintained physical Oracle tables for OLTP tables, dimension tables and fact tables in data marts of data warehouses; designed, developed and maintained Oracle DML, DDL scripts, triggers and stored procedures; tuned Oracle SQL statements using SQL hints; periodically run analyze table command and adjusted rule-based query optimizer

We'd love your feedback!