- Extending HIVE and PIG core functionality by using custom User Defined Function's (UDF), User Defined Table - Generating Functions (UDTF) and User Defined Aggregating Functions (UDAF) for Hive and Pig.
- Good Knowledge on Spark framework on both batch and real-time data processing.
- Hands-on experience processing data using Spark Streaming API.
- Skilled in AWS, Redshift, Cassandra, DynamoDB and various cloud tools.
- Use of cloud platforms AWS, Microsoft Azure, and Google Cloud platform.
- Have worked with over 100 terabytes of data from data warehouse and over 1 petabyte of data from Hadoop cluster.
- Have handled over 70 billion messages a day funneled through Kafka topics.
- Responsible for moving and transforming massive datasets into valuable and insightful information.
- Capable of building data tools to optimize utilization of data and configure end-to-end systems.
- Spark SQL to perform transformations and actions on data residing in Hive.
- Spark Streaming to divide streaming data into batches as an input to Spark engine for batch processing.
- Responsible for building quality for data transfer pipelines for data transformation using Flume, Spark, Spark Streaming, and Hadoop.
- Able to architect and build new data models that provide intuitive analytics to customers.
- Able to design and develop new systems and tools to enable clients to optimize and track using Spark.
- Provide end-to-end data analytics solutions and support using Hadoop systems and tools on cloud services as well as on premise nodes.
- Expert in big data ecosystem using Hadoop, Spark, Kafka with column-oriented big data systems such as Cassandra and Hbase.
- Worked with various file formats (delimited text files, click stream log files, Apache log files, Avro files, JSON files, XML Files).
- Uses Flume, Kafka, Nifi, and HiveQL scripts to extract, transform, and load the data into database.
- Able to perform cluster and system performance tuning.
OPERATING SYSTEMS: Windows XP, 7, 8, 10, 2016, 20, 2000, UNIX, Linux
SOFTWARE: Adobe Acrobat, Lotus Notes, MS Office
PROTOCOLS: HTTP, TCP/IP, FTP, POP3, SMTP
MICROSOFT WEB SERVER TECH: IIS
TESTING: Test Director
APPLICATIONS: Oracle CRM
DESKTOP TOOLS: Rumba
CLOUD SERVICES: Amazon AWS - EC2, SQS, S3, DynamoDB Azure, Google Cloud, Horton Labs, Rackspace Adobe, Anaconda Cloud, Elastic
DATA SCRIPTING: PIG/Pig Latin, HiveQL, Python
DISTRIBUTIONS: Cloudera CDH, Hortonworks HDP, MapR
Big Data Hadoop Technologies: MapReduce
COMPUTE ENGINES: Apache Spark, Spark Streaming, Storm
SHELL SCRIPTING: UNIX Shell
DATA PIPELINE: Apache Airflow & Camel Apache Flink/Stratosphere
DATABASE SCRIPTING: SQL, PL/SQL
HADOOP: HDFS, MapReduce
PROGRAMMING LANGUAGES: Scala, Python, ASP.NET, C, C++
ARCHITECTURE: Design and develop cloud-based data solutions POC, Architectural Planning, Hadoop Cycle, Virtualization, HAAS environmentsIDE Eclipse, Oracle J Developer, Visual Studio, SQL Navigator, IntelliJ
TOOLS: Sqoop, Elasticsearch, Lambda Functions, Toad
STORAGE: S3, Talon, DAS, NAS, SAN
ARCHITECTURE: Design and develop cloud-based data solutions POC, Architectural Planning, Hadoop Cycle, Virtualization, HAAS environments
TOOLS: Sqoop, Elasticsearch, Lambda Functions, Toad
FILE FORMATS: Apache Parquet & Avro, JSON, ORC
NoSQL DATABASE: Apache Cassandra, Datastax Cassandra Apache Hbase, MariaDB, MongoDB
FILE COMPRESSION: Snappy, Gzip
SQL/RDBMS DATABASE: SQL, SQL Server, MySQL, Postgres SQL, PL SQL, Oracle, MS Access
Hadoop Big Data Architect
Confidential, Alpharetta, GA
- Met with stakeholders, client engineers and project manager to gather requirements and determine needs accurately.
- Evaluated existing infrastructure, systems, and technologies and provided gap analysis, and documented requirements, evaluation, and recommendations of system, upgrades, technologies.
- Created proposed architecture and specifications along with recommendations
- Involved in migrating MapReduce jobs to Spark, using Spark SQL and DataFrames API to load structured data into Spark clusters
- Implemented Spark using Scala and utilized DataFrames and Spark SQL API for faster processing of data.
- Involved in converting HiveQL/SQL queries into Spark transformations using Spark RDDs, Python and Scala.
- Used Spark DataFrame API over Cloudera platform to perform analytics on Hive data.
- Used Spark streaming to receive real time data using Kafka
- Provided architecture and system design schematics, along with comprehensive implementation plan and dependencies.
- Responsible for presenting recommendations to stakeholders and then walking implementation teams though the plan, and for ensuring that all cross-functional teams understand the proposed system, function and rationale, and are on-board with the plan.
- Worked together with executive management to determine key performance indicators and implement long-term architectural/technical strategy.
- Handled matrix management of resources for both external and internal web site team along with Dashboards and Portal team.
- Defined Roadmaps, scope and execution timelines of various projects.
- Captured the logs from the relevant server using log4j configuration.
- Analyzed logs data and filtered required columns by Logstash configuration and sent it to Elasticsearch.
- Designed the Elasticsearch configuration files based on number of hosts available, naming the cluster and node accordingly.
- Designed batch processing jobs using Apache Spark to increase speed compared to that of MapReduce jobs.
- Wrote Spark applications for data validation, cleansing, transformation, and custom aggregation.
- Worked on cluster maintenance and data migration from one server to other.
- Used Curator API on Elasticsearch to data back up and restoring.
- Merged the data into share on avoid data crush and support load balancing.
- Used Kibana illustrate the data with various display dashboard such as matric, graphs, pia-chart, aggregation table.
- Migrated from RDBMS for streaming or static data into the Hadoop cluster using Cloudera clusters and fine-tuned them to run spark jobs efficiently.
- Explored Apache Flink and delivered its use cases to the business partners for review.
- Deployed Hadoop components on the Cluster like Hive, HBase, Spark and others with respect to the requirement.
- Targeted to process 60,000 RPS (Records per Second) on the current Implemented engine. Till date Rating engine has attained to process 45,000 RPS.
Hadoop Data Engineer
Confidential, Atlanta, GA
- Created custom Hadoop analytics solution integrating data analytics platform to pull data from decentralized Confidential systems for various analytical used such from marketing to actual use data.
- Worked with stakeholders and sales and marketing management to gather requirements and determine needs.
- Evaluated proprietary software and current systems to determine gaps.
- Documented findings including current environment and technologies, in addition to anticipated use cases.
- Created architecture schematics and implementation plan.
- Led implementation and participated in hands-on data system engineering.
- Created Cloudera Hadoop system on AWS consisting of multiple nodes with defined systems by use case.
- Used Spark as an ETL tool to remove the duplicates from the input data, apply certain Joins and aggregate the data which intern is provided as an input to the TwitteR Package to calculate the Time series for Anomaly Detection.
- Developed a Pipeline that runs once a day which does a copy job.
- Developed a JDBC connection to get the data from SQL and feed it to a Spark Job.
- Worked closely with HDInsight production team for the optimization of the performance of Spark Jobs on the cluster.
- Involved in reverse engineering to obtain the Business Rules from the current Commerce Platform.
- Familiarized using SQL Server Management Studio and SQL Server 2014 to develop the business rules.
- Implemented the Business Rules in SPARK/SCALA to get the business logic in place to run the Rating Engine.
- Used Ambari UI to observe the running of a submitted Spark Job at the node level.
- Used Pentaho to showcase the Hive tables in an interactive way of Pie Charts and Graphs.
- Used Spark to do Parsing of the data to get the required fields of data.
- Created external Hive tables on the Blobs to showcase the data to the Hive MetaStore.
- Used both Hive context as well as SQL context of Spark to do the initial testing of the Spark job.
- Worked on Putty and Jupyter Note book to run the Spark SQL commands.
Confidential, Atlanta, GA
- Implemented a Data Analytics system for collection and analysis of global marketing data to be used by sales. By utilizing big ata analytics the company was able to reduce wasted time and effort by sales to get better marketing intelligence and create a more efficient sales process.
- Architected Hadoop system pulling data from Linux systems and RDBMS database on a regular basis to ingest data using Sqoop.
- Integrated web scrapers and pulled data from web and social media.
- Aggregation, queries and writing data back to OLTP system directly or through Sqoop.
- Loading RDBMS of large datasets to big data by using Sqoop.
- Used Pig as ETL tool to do transformations, joins and some pre-aggregations before storing the data into HDFS.
- Transformed data from legacy tables to HDFS, and HBase tables using Sqoop.
- Analyzed the data by performing Hive queries (HiveQL), Impala and running Pig Latin scripts.
- Involved in writing Pig Scripts for cleansing the data and implemented Hive tables for the processed data in tabular format.
- Parsed data from various sources and store parsed data into HBase and Hive using HBase-Hive Integration.
- Used HBase to store majority of data which needed to be divided based on region.
- Involved in benchmarking Hadoop and Spark cluster on a TeraSort application in AWS.
- Created multi-node Hadoop and Spark clusters in AWS instances to generate terabytes of data and stored it in AWS HDFS.
- Used Spark codes to run a sorting application on the data stored on AWS.
- Deployed the application jar files into AWS instances.
- Used the image files of an instance to create instances containing Hadoop installed and running.
- Developed a task execution framework on EC2 instances using SQS and DynamoDB.
- Designed a cost-effective archival platform for storing big data using Hadoop and its related technologies.
Confidential, Duluth, GA
- Evaluated needs and engineered a data analytics platform for marketing for this software company headquartered in Great Britain.
- Worked with cross-functional teams and project management and collaborated with British Architect to help architect and engineer additions and improvement to facilitate a marketing initiative using Hadoop analytics platform.
- Used Hibernate ORM framework with Spring framework for data persistence and transaction management.
- Used the lightweight container of the Spring Framework to provide architectural flexibility for Inversion of Controller (IOC).
- Involved in designing web interfaces using HTML/ JSP as per user requirements. Improved the look and feel of these screens.
- Involved in creating Hive tables, loading with data and writing Hive Queries, which will internally run a Map Reduce job.
- Implemented Partitioning, Dynamic Partitions and Buckets in Hive for optimized data retrieval.
- Connected various data centers and transferred data between them using Sqoop and various ETL tools.
- Extracted the data from RDBMS (Oracle, MySQL) to HDFS using Sqoop.
- Used the Hive JDBC to verify the data stored in the Hadoop cluster.
- Worked with the client to reduce churn rate, read and translate data from social media websites.
- Generated and published reports regarding various predictive analysis on user comments. Created reports and documented various retrieval times of them using the ETL tools like QlikView and Pentaho.
- Performed sentiment analysis using text mining algorithms to find out the sentiment/emotions & opinion of the company/product in the social circle.
- Implemented logistic regression in MapReduce to find the customer's claim probability and k-means clustering in Mahout to group customers with similar behavior.
- Worked with Phoenix, a SQL layer on top of HBase to provide SQL interface on top of No-SQL database.
- Extensively used Impala to read, write, and query Hadoop data in HDFS.
- Developed workflow in Oozie to automate the tasks of loading data into HDFS and pre-processing with Pig and Hive.
Confidential, Lawrenceville, GA
- Migrated the client facing database from Postgres to MongoDB leading to a 90% decrease in query response times.
- Optimized the data pipeline by using in-memory datastores for faster object dereferencing which lead to a 60% reduction in job duration. log files.
- Implemented the workflows using Apache Oozie framework to automate tasks
- Involved in Setup and benchmark of Hadoop /HBase clusters for internal use.
- Created and maintained Technical documentation for launching Hadoop Clusters and for executing Pig
- Wrote SQL queries to perform Data Validation and Data Integrity testing.
- Developed UNIX shell scripts to run the batch jobs.
Confidential, Atlanta, GA
- Responsible for maintaining, supporting, and troubleshooting network and peripheral systems.
- System Administrator for three Windows 2003 Servers
- Network Administrator for company LAN
- PC and Printer Support and Repair
- Manage hosting and administration of website
- Installation configuration and troubleshooting IP DVRs and surveillance cameras
- Software installation, configuration and troubleshooting
- Administrator of security system including issuing badges
- Liaison between staff at all levels of a client organization
- Provided end user Production Support
- Trained customers on how to use the Knowlagent Applications
- Resolved Open Trouble tickets using salesforce.com
- Utilized MS SQL to Troubleshoot and resolve customer reporting issues
- Managed weekly client meetings
- Managed test environment for reproducing customer issues
Confidential, Atlanta, GA
- Implemented new promotions and discounts for services and hardware in database.
- Implemented design using Oracle Apps “Service for Comms” manually as well with PL/SQL.
- Trained, managed and supported 2 offshore recourses.
- Developed validation PL/SQL scripts to verify SETUPS.
- Developed PL/SQL scripts that inputted most of the data for release.
- Developed and implemented component testing scripts.
- Key person for SETUPS in production release nights.
- Maintained matrix of what setup done on any one of the twelve environments.
- Analyzed defects in Test Director with PL/SQL using SQL Plus, Toad and SQL Navigator 4
- Identified the root cause of defect.
- Created, suggested and produced documents with one or more suggestions on how to fix defect.