- Around 5 Years of professional IT experience in Big Data/Hadoop Ecosystem components such as HDFS, MapReduce, Pig, Hive, ZooKeeper, HBase, Sqoop, Oozie, Kafka, Impala and Flume for data storage and analysis
- Experience in installation, configuration, Management, supporting and monitoring Hadoop cluster using various distributions such as Apache and Cloudera
- Very good experience in complete project life cycle like development, testing, design and AWS services like EC2 & S3
- Specialized in developing complex MapReduce jobs in java and User Defined Functions in Pig & Hive
- Experience in project planning, setting up standards for implementation and designing of Hadoop based applications
- Experience in job/workflow scheduling and monitoring tools like Oozie & Zookeeper
- Proficient in fixing production issues and providing error free solution
- Good communication and team player Skills, ability to analyze and problem - solving skills
Big Data Ecosystem: HDFS, Yarn, HBase, MapReduce, Hive, Pig, Sqoop, Oozie, Zookeeper, Flume, Spark & Kafka.
Programming Languages: C, C++, Shell Scripting, Scala, Java, SQL, Python.
Java & J2EE Technologies: Core Java, Servlets, JSP, JDBC, JNDI, Java Beans.
Version Control: Git, Svn.
Databases: Oracle 10g/9i/8i, MySQL, SQL Server, Teradata, DB2, Informix.
Web Technologies: HTML, XML, jQuery, PHP, CSS.
IDE Tools: MyEclipse, Eclipse, IntelliJ IDEA, NetBeans, WSAD.
NoSQL Databases: HBase, Cassandra, MongoDB.
Operating Systems: Windows variants, UNIX, LINUX.
Other Tools: SQL Developer, Maven, JUnit.
Confidential, Bridgewater, NJ
Hadoop / Spark Developer
- Designing, implementing and maintaining spark applications to drive quality and consistency within design and development phases and analyzing the scope of the project
- Identifying the production and non-production application issues
- Handling the data coming from different data sources and loading them from various file systems and databases into HDFS using sqoop
- Transforming large sets of structured, semi structured and unstructured data using hive and pig based on business requirement
- Created Hive internal and external tables and implemented partitioning and bucketing
- Developed PigLatin and HiveQL scripts for Data Analysis & ETL purposes and extended their functionality by developing complex Pig & Hive User Defined Functions in Java
- Developed Hive queries which will invoke and run Map Reduce jobs in the backend
- Continuous monitoring and managing the Hadoop cluster using Cloudera Manager
- Developed complex MapReduce programs to parse the raw data, populate staging tables and store the refined data in partitioned tables
- Managing and scheduling Jobs on a Hadoop cluster using Oozie work flows and Oozie Coordinator engine
- Exported the analyzed data to the relational databases using Sqoop for visualization and to generate reports for the BI team
- Collected huge amounts of data from different sources and aggregated the data using Apache Kafka and store the data into HDFS for analysis
- Supported formal testing and resolved test defects and also executed unit test cases to ensure software quality
- Prepared deliverables for client review and approval
- Setting up Kerberos principals and testing HDFS, Hive, Pig, and MapReduce access for the new users
- Controlling the version of the project related components and documents and conducting knowledge sharing sessions and weekly status report meetings within the project
- Co-ordination with onsite/offshore team members on daily basis
Technology/Tools: Hadoop-Cloudera(CDH3/4 ), HDFS, MapReduce, Hive, Pig, Sqoop, Kafka, Scala, Spark, Hbase, Talend, Oozie, Maven, Java, SQL, Oracle, UNIX, SqlDeveloper, Putty
Confidential, Waltham, MA
Hadoop / Spark Developer
- Worked on Big Data Hadoop cluster implementation and data integration in developing large-scale system software
- Developed MapReduce jobs to convert data files into Parquet file format and included MRUnit to test the correctness of MapReduce programs
- Design and architecture of batch implementation on Hadoop environment
- Created Job Streams/Jobs in Talend Administration Center (TAC) to run the Hadoop jobs
- Worked on Data loading into Hive for Data Ingestion history and Data content summary
- Worked on Spark SQl, Created dataframes by loading data from Hive tables and created prep data and stored in AWS S3
- Programmed MapReduce jobs for analyzing petabytes of data sets on daily basis and derive data patterns
- Experience in importing and exporting terabytes of data using Sqoop from Relational Database Systems to HDFS and vice-versa
- Optimized MapReduce codes, Pig scripts, Hive queries and involved in performance tuning and analysis
- Used Spark for series of dependent jobs and for iterative algorithms.
- Developed a data pipeline using Kafka and Spark Streaming to store data into HDFS
- Used Apache Flume for Streaming the Data from various sources into HDFS
- Implemented a streaming process using Spark to pull data from an external REST API
- Used Kafka for Website activity tracking, Stream processing and for auto-scaling the backend Servers based on the events throughput
- Extensively worked on Oozie and UNIX scripts for batch processing and scheduling workflows dynamically
- Troubleshooting, debugging & altering Talend issues, while maintaining the health and performance of the ETL environment
- Analyzed the data by performing Hive queries and running Pig scripts to study customer behavior
- Used Hive to partition and bucket data
- Converting the existing relational database model to Hadoop ecosystem
Technology/Tools: Hadoop-Hortonworks, HDFS, MapReduce, Hive, Sqoop, Kafka, Scala, Spark, Hbase, Talend, Oozie, Maven, Data Processing Layer, HUE, AZURE, Erwin, MS Visio, Tableau, SQL, MongoDB, UNIX, MySQL, RDBMS, Ambari, Cron
Confidential, Woonsocket, RI
- Developed data pipeline using Flume, Sqoop, Pig and Java map reduce to ingest customer behavioral data and financial histories into HDFS for analysis
- Worked on Creating Kafka topics, partitions, writing custom partitioner classes
- Responsible for developing data pipeline using flume, sqoop and pig to extract the data from weblogs and store in HDFS
- Imported and exported the data using SQOOP from HDFS to Relational Database systems and vice-versa
- Integrating bulk data into Cassandra using MapReduce programs
- Experience in understanding the security requirements for Hadoop and integrate with Kerberos authentication and authorization infrastructure
- Processed data into HDFS, analyzed the data using MapReduce, Pig, Hive and produce summary results from Hadoop to downstream systems
- Involved in the POC implementation of migrating map reduce programs into spark transformations using Spark and Scala
- Enabled speedy reviews and first mover advantages by using Oozie to automate data loading into the Hadoop Distributed File System and Pig to pre-process the data
- Optimized the Hive queries by setting different combinations of Hive parameters and developed UDFs (User Defined Functions) to extend core functionality of PIG and HIVE queries as per requirement
- Extracted the data from Teradata into HDFS using the Sqoop.
- Involved in collecting metrics for Hadoop clusters using Ganglia and Ambari
- Exported the result set from Hive to MySQL using Shell scripts
- Involved in writing custom Pig Loaders and Storage classes to work with a variety of data formats such asJSON, Compressed CSV, etc
- Documenting the procedures performed for the project development
Technology/Tools: Hadoop-Hortonworks/Cloudera, HDFS, MapReduce, Hive, Sqoop, Spark, Hbase, Pig, flume, Scala, Kafka, Oozie, Zookeeper, Teradata, Windows 7, SSH Tectia client, remedy and Jira ticketing tools
- Involved in complete Big Data flow of the application data ingestion from upstream to HDFS, processing the data in HDFS and analyzing the data using several tools.
- Imported the data from various formats like JSON, Sequential, Text, CSV, AVRO and Parquet to HDFS cluster with compressed for optimization.
- Extracted data from Agent Nodes into HDFS using Python scripts and performed UNIX shell commands using python sub-process.
- Ingested data from RDBMS sources like - Oracle, SQL Server and Teradata into HDFS using Sqoop.
- Imported data from Amazon S3 to HIVE using Sqoop & Kafka and maintained multi-node Dev and Test Kafka Clusters
- Configured Hive and written Hive UDF's and UDAF's Also, created partitions such as Static and Dynamic with bucketing.
- Integrated Amazon Redshift with Spark using Scala.
- Implemented advanced procedures like text analytics and processing using the in-memory computing capabilities like spark.
- Imported and exported data into HDFS and hive using Sqoop and Kafka with batch and streaming
- Experienced with Spark-Streaming APIs to perform transformations and actions on the fly for building the common learner data model which gets the data from Kafka in near real time and Persists into HBase.
- Implemented performance analysis of Spark streaming and batch jobs by using Spark tuning parameters.
- Enhanced and optimized product Spark code to aggregate, group and run data mining tasks using the Spark framework.
- Used Hive join queries to join multiple tables of a source system and load them to Elastic search tables.
- Expertise in designing and creating various analytical reports and Automated Dashboards to help users to identify critical KPIs and facilitate strategic planning in the organization.
- Involved in Cluster maintenance, Cluster Monitoring and Troubleshooting.
- Build the automated build and deployment framework using GitHub and Maven etc.
- Created Data Pipelines as per the business requirements and scheduled it using Oozie Coordinators.
Technology/Tools: Scala, Hadoop, HDFS, Hive, Oozie, Sqoop, NiFi, Spark, Kafka, Elastic Search, Shell Scripting, HBase, Python, GitHub, Tableau, Oracle, MySQL, Teradata and AWS