Big Data Cloud Engineer Resume
PROFESSIONAL SUMMARY:
- 7 years of experience as Big Data Developer, Engineer driving cutting edge technology
- Overall 13 years of experience as a Programming Polyglot in IT, Software Design and Development
- Hands - on with Hadoop/Spark stack, Data Lake, NoSQL, Kafka, Streaming in On-Premise/Cloud
- Well versed in Amazon Web Services, Google Cloud Platform for storage and compute capabilities
- In-depth knowledge in writing MapReduce and Apache Spark using Scala/Java
- Lead Big Data Engineer onsite for 5 years in big divisions of IP & Science and Finance & Risk
- Experienced in server-side financial software development using tick/timeseries/ data
- Interests in emerging technology, choosing NoSQL options, DS & algorithms, Performance Tuning
- Developed advanced human performance platform for Athlete Management System in AWS, Spark
- Proficient in Hive, Pig with large scale continually growing un/semi/structured data warehouse olap
- Design & implement data pipelines, data model/massaging and optimize ETL workflows
- Possess knowledge in data ingestion to HDFS from RDBMS, other sources using Sqoop/Flume
- Proven track record in installing, troubleshoot & managing Cloudera, Hortonworks multinode cluster
- Visa particulars: Canada Permanent Resident, Citizenship due in Sep’2019, valid USA B1
TECHNICAL SKILLS:
Hadoop: MapReduce, HDFS, YARN, Hive, Pig, Sqoop, Flume, Oozie, Hue, HBase
Spark: Spark Core, Spark SQL, Spark Streaming, Mesos
Others: NoSQL, Cassandra, Phoenix, Zookeeper, Kafka, NiFi, Ambari, Qubole
Cloud DevOps: AWS console, CLI, S3, EBS, Glacier, IAM, VPC, EC2, EMR, Kinesis, RDSGCP console, shell, GCS, IAM, GCE, DataProc, Beam, DataFlow, BigQuery
Programming: Java, Scala, C, C++, Python(beginner), Shell Script, Java Script, SQL, Testing
File Formats: Columnar(ORC, Parquet), JSON, XML, Avro, BSON, Protocol Buffers
Tools: Eclipse, IntelliJ, Maven, GitHub, SVN, Junit, Cygwin, Tomcat, Jira, MS Visio
Architecture: Big Data, Distributed systems, Stock tick data, OOAD, Architectural patterns
Domain Experience: Finance, Ecommerce, Sport Science, Telecom, Content Technology, Intellectual Property & Science, Data Lake, Data Model, UML, Star Schema
Methodology: Agile, Scrum, SDLC, Waterfall
WORK EXPERIENCE:
Confidential
Big Data Cloud Engineer
Responsibilities:
- Focused on improving how our customer allocates inventories to its store locations that consider sales forecasts and regional effect
- Architected Replenishment Bot from the development point of view for production
- RDBMS data ingestion to GCP: migrated 4TB compressed/encrypted Oracle raw dump in persistent storage, setup Oracle DB on GCE, processed/exported CSV data using DataPump to GCS bucket
- Developed routines for one-time migration of 36TB exported data from persistent store to GCS
- Cleansed & preprocessed data using Scala & Spark on DataProc cluster for Data Science team to come up with machine learning techniques to create various BOTs
- Imported data from Google Cloud Storage to BigQuery in form of datasets for analytics
- Used Cloud Composer workflow & Airflow schedular to automatically run the job weekly
- Skills: Java, Scala, Spark ecosystem, GCP, GCE, GCS, GCFS, Oracle, EDW, Scripting, MS Visio
- Optimized Spark execution to finish job within 1 hour from 11 hours & opened the door to upscale
- Solved infrastructure problem to utilize cluster’s full capacity: executors, executor cores & RAM
- Troubleshoot disk out of space issue as Spark not using all available RAM, spilling intermediate data to disc & not freed up as job never completes, too much of disc I/O caused slow job execution
- Understood project’s blocker issue, infrastructure in terms of Hadoop Spark cluster & its limitations in terms of antivirus, blocked ports, CPU/Core/RAM & no. of nodes
Advanced Analytics & Data Strategy Assessment
Confidential
Responsibilities:
- Conducted interviews & workshops with IT leadership team to understand current state pain points
- Evaluated Azure, Amazon & Google cloud offerings for Data Lake implementation
- Led current state analysis by investigating existing functionalities & architectural stacks
- Analyzed gap between As-Is and To-Be, issues faced in DB synchronization, data deduplication, etc.
- Developed target architecture & delivery model to meet client’s Advanced Analytics capabilities
- Skills: Big Data, AWS, GCP, Azure, Data Warehouse, PowerPoint, Diagramming Tools, MS Visio
- Kinduct Technologies, HalifaxBig Data Engineer
- Developed advanced human performance platform for Athlete Management System - Sports Science
- Implemented data collection and backend processing for sports healthcare Analytical System
- Built data-pipeline using Scala & Spark that transforms raw game data into business relevant insights
- Setup secure reliable Big Data infrastructure for Customers from ground up on AWS Cloud
- Setup Hortonworks HDP Hadoop environments & used Ambari to configure them
- POC Project Aggregation: successfully transitioned proof of concept to production on AWS
- Re-designed & developed current Dynamic Reports app to utilize Spark SQL which performs complex aggregated metrics & connect to Prod environment that provides data for Dynamic Reports
- Scala/Spark: Processed raw XY Game Data in various formats like json/xml for Data Science team
- NBA Production: Data Provider xml/json: Sport Radar, Second Spectrum, SportVU
- Skills: Java, Scala, Hadoop/Spark ecosystem, AWS services, JSON, XML, Parquet, ORC
- Thomson Reuters, Bangalore Big Data Lead Engineer
- Involved in the complete life cycle of custom dataset metrics generation out of WebOfScience data for publisher, researchers & universities. Developed & optimized Hive Queries for better throughput
- Designed DB schema in HDFS for research articles in form of Cubes, Fact & Dimension
- Performed metrics like Times Cited, Hot Paper, Highly Cited Paper, University rank, H-Index
- Installed multi-node Cloudera CDH5.4 Hadoop cluster, configured with Cloudera Manager
- Optimized performance of current IMS Ingestion system to HDFS/HBase from Oracle based MPR
- Used Sqoop extensively to import/export structured data to/from HDFS from/to RDBMS
- Developed MapReduce for data cleansing, Hive to HBase for interactive & row level update
- Date Lake Content Integration to capture & ingest Grant & Funding data from external sources
- Data Lake Realtime Cross Linking of Literature (WOS) & Patents (TI) documents into AWS Cloud to create One Platform for various data owned by TR to leverage & create valuable insights
- Implemented Lambda Architecture: Spark Streaming, Kafka, HBase, Hive, Patents Resolution
- Collected & Aggregated large amounts of log data using Flume and staging data in HDFS
- Expertise in developing custom UDF, SerDe, InputFormat in Java to extend Hive/Pig execution
- Designed & implemented backend of ALUM Charts application using HBase, Spark, FusionChart to visualize Article access trends over a time period & correlation between Access to Citation it receives
- Processed usage logs and filter out BOT sessions and unproductive data
- Optimized performance to deal with common bottlenecks: disc, network I/O, RAM, CPU/Threads
- Improved grant information coverage in WOS, created business rules to add & de-dupe grant info
- Data Modeling: Star Schema Modeling, Cubes, Fact & Dimension, Visio
- Environment: Cloudera CDH5.4, Hadoop2.6, Cluster: 160 nodes, Total Data Size: 3 Petabytes
- Skills: Java, Scala, Hadoop/Spark ecosystem, AWS Cloud, NoSQL, Data Lake, Data Model
- Developed server-side financial software for Velocity Analytics(Reuters Tick Capture Engine)
- Understand finance risk management, trades, quotes, corporate actions, stock splits, symbology, etc.
- Involved in multiple successful POC, Horizontal Scaling solution
- Productized Elektron TimeSeries DB migration from proprietary archive to NoSQL store
- Managed & analyzed Stock Tick data, Market Data Feeds, Data & Tick History data
- Implemented Firebird Data migration to enhance Read-Write performance
- Ingested Fact Summary in HBase as data store and query sharding for ElektronTS Hosted Solution
- Prototyped Cassandra based Time Series data store, data model design for real-time data ingestion
- Skills: C++, Cassandra, Hadoop, HDFS, HBase, TCL, SQL, Batch, Financial Trading
- Finance Skills: Reuters Tick Capture Engine, RMDS, Timeseries & Data, Tick History
- SamsungR&D Institute, Bangalore Lead Engineer
- Webkit based mobile browser development; modules: WebCore, WebKit, rendering, painting, layout
- Developed browser features, Dolfin as SO, debug hardware on Trace32, Watchdog issues
- Skills: C++, Arm/gcc/Mingw compiler, Wireshark, Firebug, WebDeveloper, DOM, CSS, JS, HTML
- Wipro Technologies, Bangalore Project Engineer
- Firmware feature development in the core modules, coding, debugging, review and unit testing for MFP printers already released in the market Skills: C++, Java, Linux, make, gdb, clearcase
- Freelance Big Data Trainer:
- Taught entire Big Data Hadoop stack to 500+ professionals worldwide via webinar/in-class
- Subject Matter Expert to Jigsaw Academy, developed Big Data Analytics online course content
- Speaker for Hadoop, Hive, HBase & Big Data at Thomson Reuters Tech Connect conference
