- 6+ years of experience in Big Data/Hadoop Ecosystems technologies such as in Hadoop, AWS EMR, EC2, Hive, HBase, Oozie, Zookeeper, Sqoop, Storm, Flume, Zookeeper, Impala, Tez, Kafka and Spark with Map Reduce/YARN/Mesos and Spark/Scala jobs.
- Migrated an existing legacy infrastructure and recreated the entire environment within Amazon Web Services (AWS).
- Transferred the Oracle SQL Developer to AWS storage and built the cluster from the scratch.
- Experienced with data Analysis, Design and Development and Testing of ETL methodologies in all the phases of the Data Warehousing and visualization with Tableau.
- Experience in optimizing and performance tuning of Mappings and implementing the complex business rules by creating re - usable Transformations.
- Queried in Oracle SQL Developer, Hive SQL, Spark SQL, Spark Submit for data validation along with developing validation worksheets in Excel in order to validate the dashboards on Tableau .
- Used various versions of Hive, Spark , and Presto on multiple projects. Apart from regular queries, I have also implemented UDFs.
- Worked on a project that involved migrating Hive tables and underlying data from Cloudera CDH to Hortonworks HDP .
- Developed Spark code using Scala , PySpark and Spark-SQL for faster testing, transformation and data processing.
- Extensively used SQL and PL/SQL for development of Procedures, Functions, Packages and Triggers.
- Experienced in integrating Kafka with Spark Streaming for high speed data processing.
- Experience in Hadoop Distributed files system, Hadoop framework and Parallel processing implementation ( AWS EMR , Cloudera ) with HDFS , MapReduce , Hive , HBase, Yarn, Sqoop , Spark , Java , RDBMS , Linux / Unix shell scripting and Linux .
- Experience in Implementing AWS solutions using EC2, S3 and Azure storage.
- Experienced in developing business reports by writing complex SQL queries using views, macros, volatile and global temporary tables.
- Worked with AWS team in testing our Apache Spark- ETL application on EMR/EC2 using S3.
- Experience in designing both time driven and data driven automated workflows using Oozie.
- Experienced with work flow schedulers , data architecture including data ingestion pipeline design and data modelling.
- Configuration of Elastic Search on Amazon Web Service with static IP authentication security features
- Managed AWS EC2 instances utilizing Auto Scaling, Elastic Load Balancing and Glacier for our QA, DevOps and UAT environments as well as infrastructure servers.
- Experienced in handling different file formats like Text file, Avro data files, Sequence files, XML and JSON files.
- Extensively worked on Spark Core, Numeric RDD's, DataFrames, and Caching for developing Spark applications.
- Experience and Expertise in ETL, Data analysis and designing data warehouse strategies.
- Experience with Querying languages like SQL, NoSQL and HQL and programming languages like Scala, Java and Python.
- Experience in importing and exporting data using Sqoop from HDFS to Relational Database.
- Experience with Oozie Workflow Engine in running workflow jobs with actions that run Hadoop, MapReduce, Hive, Spark jobs.
Specialties: AWS EMR, EC2, Auto Scaling, AWS S3, Glacier, Kinesis, KMS etc. Data Warehousing/ETL/BI Concepts, Data Visualization, Data Architecture, Software Development Methodologies, Data Modeling, Cloud Computing
Business Tools: Tableau 8.X/9.X, Power BI, Business Objects XI R2, Informatica, Power center 8.x, OLAP/OLTP, Talend, Teradata 13.x, Teradata SQL Assistant
Big Data: AWS EMR, Hadoop, MapReduce 1.0/2.0, Pig, Hive, Presto, HBase, Sqoop, Oozie, Zookeeper, Kafka, Spark, Flume, Mahout, Hue, Tez, HCatalog, Storm, Cassandra.
Databases: DB2, MySQL, MS SQL server, Vertica, Mongo DB, Oracle SQL Developer, SQL 2008, Hortonworks, Cloudera, AWS EMR Hadoop Framework
Operating System: Mac OS, Unix, Linux (Various Versions), Windows 2003/7/8/8.1/XP
Application Server: Apache Tomcat, WebLogic, WebSphere Tools Eclipse, NetBeans
Confidential, Durham, NC
Senior Big Data/Hadoop Developer
- Experience in designing Build process, Software Product Development, Process Automation, Build and Deployment Automation, Release Management, Source Code repository, Environment management, Cloud Computing, Software Configuration Management (SCM).
- Responsible to ingest existing databases to AWS environment using Hadoop framework.
- Worked extensively creating hive and spark metastore in AWS EMR.
- Worked on migration of servers like Jira, Bitbucket, Confluence, Jenkins master and worker nodes (windows & Linux) from Datacenter to AWS.
- Built on the simple mechanics of resources, tasks, and jobs, Concourse presents a general approach to automation that makes it great for CI/CD.
- Worked on Ansible to automate repetitive tasks, to deploy critical applications quickly, and proactively manage the changes.
- Expertise in Amazon AWS cloud which comprises of services like EC2, S3, VPC, ELB, RDS, IAM, Cloud Front, Cloud Watch, and Security Groups.
- Proficient in AWS Cloud platform and its features which includes EC2, VPC, EBS, AMI, SNS, RDS, Cloud Watch, Cloud Trail, Cloud Formation AWS Comfit, Autoscaling, Cloud Front, IAM, S3, and R53.
- Implemented Amazon EC2 setting up instances, virtual private cloud (VPCs), and security groups. Set-up databases in AWS using RDS, storage using S3 bucket and configuring instance backups to S3 bucket.
- Build and configure a virtual data center in the Amazon Web Services cloud to support Enterprise Data Warehouse hosting including Virtual Private Cloud (VPC), Public and Private Subnets, Security Groups, Route Tables, Elastic Load Balancer.
- Experience in version control using BitBucket, GIT and GITHUB and continuous integration management using Jenkins.
- Ability to work under stringent deadlines with team as well as independently.
- Strong technical skills with Unix/Linux systems.
Environment: AWS EMR, Hive, Spark, Presto, Tez, Oozie, Hue, Tableau, Informatica, KMS, AC2, Kafka, AWS-S3, Apache-Hadoop, Hive, Pig, Shell Script, ETL, Agile Methodology.
Confidential, New York, NY
Big Data/Hadoop Developer
- Involved into testing and migration to Presto.
- Worked extensively with Data migration, Data cleansing, Data profiling, and ETL Processes features for data warehouses.
- Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs and Scala.
- Experienced with the tools in Hadoop Ecosystem including Pig, Hive, HDFS, Sqoop, Spark, Yarn and Oozie.
- Involved in importing the real time data to Hadoop using Kafka and implemented the Oozie job for daily.
- Experienced in writing complex SQL Queries, Stored Procedures, Triggers, Views, Cursors, Joins, Constraints, DDL, DML and User Defined Functions to implement the business logic.
- Developed Custom ETL Solution, Batch processing and Real-Time data ingestion pipeline to move data in and out of Hadoop using Python and shell Script.
- Developed AWS data pipeline, SNS for automating the dunning process on cloud.
- Experience in Large Data processing and transformation using Hadoop-Hive and Sqoop.
- Real time predictive analytics capabilities using Spark Streaming, Spark SQL and Oracle Data Mining tools.
- Working with AWS team in testing our Apache Spark- ETL application on EMR/EC2 using S3.
- Assisted in data analysis, star schema data modeling and design specific to data warehousing and business intelligence environment.
- Expertise in platform related Hadoop Production support tasks by analyzing the job logs.
- Monitored System health and logs and responded accordingly to any warning or failure conditions.
Environment: Amazon Web Service, Vertica, Informatica PowerCenter, Spark, AWS, Kafka, AWS-S3, AWS-EMR, Apache-Hadoop, Hive, Pig, Shell Script, ETL, tableau, Agile Methodology.
Confidential, Atlanta, GA
Big Data/Hadoop Developer
- Worked with variables and parameter files and designed ETL framework to create parameter files to make it dynamic.
- Currently working on the Teradata to HP Vertica Data Migration Project Working extensively on the Copy Command for extracting the data from the files to Vertica . Monitor the ETL process job and validate the data loaded in Vertica DW .
- Built a Full-Service Catalog System which has a full workflow using Elastic Search, Logstash, Kibana, Kinesis, CloudWatch.
- Responsible for data extraction and data ingestion from different data sources into Hadoop Data Lake by creating ETL pipelines using Pig, and Hive.
- Experienced in transferring data from different data sources into HDFS systems using Kafka producers, consumers and Kafka brokers
- The logs and semi structured content that are stored on HDFS were preprocessed using PIG and the processed data is imported into Hive warehouse which enabled business analysts to write Hive queries.
- Worked with data migration form Hadoop clusters to cloud. Good knowledge of cloud components like AWS S3, EMR, Elastic Cache and EC2.
- Responsible to write Hive and Pig scripts as ETL tool to do transformations, event joins, filter both traffic and some pre-aggregations before storing into the HDFS. Developed the Vertica UDF's to preprocess the data for analysis.
- Designed the reporting application that uses the Spark SQL to fetch and generate reports on HBase.
- Build custom batch aggression framework for creating reporting aggregates in Hadoop.
- Experience in working with Hive data warehouse tool-creating tables, data distribution by implementing partitioning and bucketing, writing and optimizing the Hive queries. Built real time pipeline for streaming data using Kafka and Spark Streaming .
- Experienced in working with spark ecosystem using spark SQL and Scala queries on different formats like Text file, CSV file.
- Wrote Python Script to access databases and execute scripts and commands.
- Responsible for defining the data flow within Hadoop eco system and direct the team in implement them and exported the result set from Hive to MySQL using Shell scripts.
Environment: Hadoop, Hive , Apache Spark, Apache Kafka, Hortonworks, AWS, Elastic Search, Lambda, Apache Cassandra, HBase, SQL, Sqoop,, Oozie, Java (JDK 1.6).
Confidential, Hopkinton, MA
Junior Big Data/Hadoop Developer
- Experienced on Kafka Streaming using stream sets to process continuous integration of data from Oracle systems to hive warehouse.
- Developed a generic utility in Spark for pulling the data from RDBMS system using multiple parallel connections.
- Integrated existing code logic in HiveQL and implemented in the Spark application for data processing.
- Extensively involved in using Hive/Spark optimization techniques like Partitioning, Bucketing, Map Join, parallel execution, Broadcast join and Repartitioning.
- Involved in review of functional and non-functional requirements.
- Responsible for designing and implementing the data pipeline using Big Data tools including Hive, Spark, Scala and Stream Sets.
- Developed and implemented Apache NIFI across various environments, written QA scripts in Python for tracking files.
- Involved in importing data from Microsoft SQL Server, MySQL, and Teradata into HDFS using Sqoop.
- Good knowledge in using Apache NIFI to automate the data movement.
- Developed Sqoop scripts to import data from relational sources and handled incremental loading.
- Extensively used Stream Sets Data Collector to create ETL pipeline for pulling the data from RDBMS system to HDFS.
- Implemented the data processing framework using Scala and Spark SQL.
- Worked on implementing the performance optimization methods to improve the data processing timing.
- Experienced in creating the shell scripts and made jobs automated.
- Extensively worked on Data frames and Datasets using Spark and Spark SQL.
Environment: Spark, Python, Scala, Hive, Hue, UNIX Scripting, Spark SQL, Stream sets, Kafka, Impala, Beeline, Git, Tidal.