- Outstanding knowledge of Hadoop Architecture and various components such as HDFS, Job tracker, Task Tracker, Name Node, Date Node, Application Master, Resource Manager, Node Manager and MapReduce programming paradigm.
- 7+ years of IT industry work experience which includes 4 years of experience in Big Data technologies.
- Hands on experience on major components in Hadoop Ecosystem such as Hadoop MapReduce, HDFS, HIVE, PIG, HBase, Sqoop, Oozie and Flume.
- Ingested streaming data into Apache NiFi into Kafka
- Experience in CDH distribution and Cloudera manager to manage and monitor Hadoop cluster.
- Well experienced in Cloudera and Hortonworks Hadoop distributions.
- Used Spark streaming to divide streaming data into batches as an input to Spark engine for batch nosqlprocessing.
- Experienced in loading data into Hive partitions and bucketing.
- Implemented Spark Scripts using Scala, Spark SQL to access Hive tables into Spark for faster processing of data.
- Experienced in performance tuning of Spark Applications for setting right Batch Interval time correct level of Parallelism and memory tuning.
- Developed a Pig Latin scripts for transformations and using Hive Query Language for data analytics.
- Experienced in importing and exporting data from different databases like MySQL, Oracle, Teradata into HDFS and vice - versa using Sqoop.
- Hands on experience with batch processing of data sources using Apache Spark.
- Implemented Spark RDD transformations actions to implement business analysis.
- Used Flume to collect aggregate and store the web log data onto HDFS.
- Used ZooKeeper for various types of centralized configurations.
- Experienced in loading the huge data from local file system and HDFS to Hive and writing complex queries to load data into internal tables.
- Experience in processing of load and transform the large data sets of structured, unstructured and semi structured data.
- Imported and extracted the needed data using Scoop from the server into HDFS and Bulk Loaded the cleaned data into HBase using MapReduce.
- Designing and creating ETL jobs through Talend to load huge volumes of data in Hadoop Ecosystem and relational databases.
- Developed a software using Java to read the data files from UK crime database and implemented Map-Reduce program in Java to sort out the data of different crimes in the cities.
- Implemented mappers and reducers across 24 nodes and distributed the data among the nodes.
- Developed a website using RESTful APIs to fetch data from the web server.
- Java developer with extensive experiences on various Java Libraries, API’s, front end, back end and frameworks.
- Skilled in data management, data extraction, manipulation, validation and analyzing huge volume of data.
- Strong ability to understand new concepts and applications.
- Excellent Verbal and Written Communication Skills have proven to be highly effective in interfacing across business and technical groups.
Hadoop Ecosystem Development: HDFS, Hadoop Map-Reduce, Hive, Impala, Pig, Oozie, HBase, Sqoop, Flume, Yarn, Scala, Kafka, Flume, Zookeeper
Hadoop Distribution System: Cloudera, Hortonworks.
Languages: JAVA, C/C++, SQL, Spring Boot, Python
Scripting: Pig Latin
Database: Oracle, MS-SQL, PL/SQL
Tools: Apache NiFi, Talend, Airflow, Informatica
Web Design: HTML5, CSS, AJAX, REST, JSON
Frame works: MVC, Struts, Hibernate and Spring
OS: Linux (Ubuntu, Fedora), Unix, Windows
Confidential, Chicago, IL
- Worked on Hadoop cluster scaling from 4 nodes in development environment to 8 nodes in pre-production stage and up to 24 nodes in production.
- Involved in complete Implementation lifecycle, specialized in writing custom Map Reduce, Pig and Hive programs.
- Exported the analyzed data to the relational databases using Sqoop for visualization and to generate reports for the BIteam.
- Extensively used Hive/HQL or Hive queries to query or search for a string in Hive tables in HDFS.
- Possess good Linux and Hadoop System Administration skills, networking, shell scripting and familiarity with open source configuration management and deployment tools such as Chef.
- Worked with Puppet for application deployment
- Experience in developing customized UDF's in java to extend Hive and Pig Latin functionality.
- Created HBase tables to store various data formats of data coming from different sources.
- Use Maven to build and deploy code in Yarn cluster
- Good knowledge on building Apache spark applications using Scala.
- Developed several business services using Java Restful Web Services using Spring MVC framework
- Managing and scheduling Jobs to remove the duplicate log data files in HDFS using Oozie.
- Used Apache Oozie for scheduling and managing the Hadoop Jobs. Knowledge on HCatalog for Hadoop based storage management.
- Expert in creating and designing data ingest pipelines using technologies such as spring Integration, Apache Storm-Kafka
- Used Flume extensively in gathering and moving log data files from Application Servers to a central location in Hadoop Distributed File System (HDFS).
- Implemented test scripts to support test driven development and continuous integration.
- Dumped the data from HDFS to MYSQL database and vice-versa using SQOOP
- Responsible to manage data coming from different sources.
- Experienced in Analyzing Cassandra database and compare it with other open-source NoSQL databases to find which one of them better suites the current requirements.
- Used File System check (FSCK) to check the health of files in HDFS.
- Implemented usage of Amazon EMR for processing Big Data across a Hadoop Cluster of virtual servers on Amazon Elastic Compute Cloud (EC2) and Amazon Simple Storage Service (S3)
- Developed the UNIX shell scripts for creating the reports from Hive data.
- Experienced on loading and transforming of large sets of structured, semi structured and unstructured data.
- Analyzed large amounts of data sets to determine optimal way to aggregate and report on it.
- Involved in the pilot of Hadoop cluster hosted on Amazon Web Services (AWS)
- Extensively used Sqoop to get data from RDBMS sources like Teradata and Netezza.
- Create a complete processing engine, based on Cloud era' s distribution
- Worked on connecting Cassandra database to the Amazon EMR File System for storing the database in S3.
- Involved in collecting metrics for Hadoop clusters using Ganglia and Ambari
- Extracted files from Couch DB, Mongo DB through Sqoop and placed in HDFS for processed
- Spark Streaming collects this data from Kafka in real-time and performs necessary transformations and aggregation on the fly to build the common learner data model and persists the data in NoSQL store (Hbase).
- Configured Kerberos for the clusters
- Well versed in using of Elastic Load Balancer for Autoscaling in EC2 servers
Environment: Hadoop, Map Reduce, HDFS, Ambari, Hive, Sqoop, Apache Kafka, Oozie, SQL, Alteryx, Flume, Spark, Cassandra, Scala, Java, AWS, GitHub.
- Worked on analyzing Hadoop stack and different big data analytic tools including Pig, Hive, Hbase database and Sqoop.
- Experienced to implement Hortonworks distribution system (HDP 2.1, HDP 2.2 and HDP 2.3).
- Developed Map Reduce programs for some refined queries on big data.
- Created Azure HDINSIGHT and deployed Hadoop cluster in could platform
- Used HIVE queries to import data into Microsoft AZURE cloud and analyzed the data using HIVE scripts.
- Using Ambari in Azure HDINSIGHT cluster recorded and managed the data logs of name node and data node
- Creating Hive tables and working on them for data analysis to cope up with the requirements.
- Developed a frame work to handle loading and transform large sets of unstructured data from UNIX system to HIVE tables.
- Worked with business team in creating Hive queries for ad hoc access.
- In depth understanding of Classic MapReduce and YARN architectures.
- Implemented Hive Generic UDF's to implement business logic.
- Used Hive to analyze the partitioned and bucketed data and compute various metrics for reporting.
- Installed and configured Pig for ETL jobs.
- Developed Pig UDF's to pre-process the data for analysis.
- Deployed Cloudera Hadoop Cluster on Azure for Big Data Analytics
- Analyzed the data by performing Hive queries, ran Pig scripts, SparkSQL and SparkStreaming.
- Developed Spark code using Scala and Spark-SQL/Streaming for faster testing and processing of data.
- Used Apache NiFi to copy the data from local file system to HDFS.
- Developed SparkStreaming script which consumes topics from distributed messaging source Kafka and periodically pushes batch of data to Spark for real time processing.
- Extracted files from Cassandra through Sqoop and placed in HDFS for further processing.
- Involved in creating generic Sqoop import script for loading data into Hive tables from RDBMS.
- Involved in continuous monitoring of operations using Storm.
- Developed workflow in Oozie to automate the tasks of loading the data into HDFS and pre-processing with Pig.
- Implemented indexing for logs from Oozie to Elastic Search.
- Design, develop, unit test, and support ETL mappings and scripts for data marts using Talend.
Environment: Hortonworks, Hadoop, Map Reduce, HDFS, Hive, Pig, Sqoop, ApacheKafka, AZURE, ApacheStorm, Oozie, SQL, Flume, Spark,Talend Hbase, Cassandra, Informatica, Java, Github.
Confidential, Atlanta, GA
- Extracted, Updated and loaded the data from the different data sources into HDFS utilizing SQOOP import/export command line utility.
- Loaded data from UNIX file system to HDFS and created Hive tables, loaded and analyzed data using Hive queries.
- Data was loaded back to the Teradata for the BASEL reporting and for the business users to analyze and visualize the data.
- Developed UDFs in java for hive and pig and worked on reading multiple data formats on HDFS using Scala.
- Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs and Scala.
- Developed multiple POCs using Scala and deployed on the Yarn cluster, compared the performance of Spark, with Hive and SQL/Teradata.
- Analyzed the SQL scripts and designed the solution to implement using Scala
- Defined some PIG UDF for some financial functions such as swap, hedging, Speculation and arbitrage.
- Creating end to end Spark-Solr applications using Scala to perform various data cleansing, validation, transformation and summarization activities according to the requirement.
- Streamlined the Hive tables using optimization techniques like partitions and bucketing to provide better performance with Hive queries.
- Created custom shell scripts to import data via SQOOP from Oracle databases.
- Created big data workflows to ingest the data from various sources to Hadoop using OOZIE and these workflows comprises of heterogeneous jobs like Hive and SQOOP.
- Experienced in Spark Context, Spark SQL, Pair RDD and Spark YARN.
- Handled moving of data from various data sources and performed transformations using Pig.
- Created Pig Latin scripts to sort, group, join and filter the enterprise wise data.
- Experience in improving the search focus and quality in Elastic Search by using aggregations.
- Worked with Elastic MapReduce and setup Hadoop environment in AWS EC2 Instances.
- Worked and learned a great deal from AWS Cloud services like EC2, S3, EBS, RDS and VPC.
- Migrated an existing on-premises application to AWS. Used AWS services like EC2 and S3 for small data sets processing and storage, Experienced in Maintaining the Hadoop cluster on AWS EMR.
- Imported data from AWS S3 into Spark RDD, Performed transformations and actions on RDD's.
- Used Spark Streaming APIs to perform transformations and actions on the fly for building common learner data model which gets the data from Kafka in near real time and persist it to Cassandra.
Environment: Hadoop, MapReduce, Sqoop, HDFS, HBase, Hive, Pig, Oozie, Spark, Kafka, Cassandra, AWS, Elastic Search, Java, Oracle 10g, MySQL, Ubuntu, HDP.