- Over 5+ years of experience in IT and 2+ years of experience Hadoop/Big Data eco systems and Java technologies like HDFS, MapReduce, Apache Pig, Hive, Hbase, Spark Kafka and Sqoop.
- In depth knowledge of Hadoop Architecture and Hadoop daemons such as Name Node, Secondary Name Node, Data Node, Job Tracker and Task Tracker.
- Experience in writing Map Reduce programs using Apache Hadoop for analyzing Big Data.
- Hands on experience in writing Ad - hoc Queries for moving data from HDFS to HIVE and analyzing the data using HIVE QL.
- Experience in importing and exporting data using SQOOP from Relational Database Systems to HDFS.
- Experience in writing Hadoop Jobs for analyzing data using Pig Latin Commands.
- Good Knowledge of analyzing data in HBase using Hive and Pig.
- Working Knowledge in NoSQL Databases like HBase and Cassandra.
- Hands on Experience in setting up automated monitoring and escalation infrastructure for Hadoop Cluster using Ganglia and Nagios.
- Experience in developing and designing POCs using Scala and deployed on the Yarn cluster, compared the performance of Spark, with Hive and SQL/Teradata.
- Good Knowledge in Amazon AWS concepts like EMR, EC2, EBS, S3 and RDS web services which provides fast and efficient processing of Big Data.
- Experience in Integrating BI tools like Tableau and pulling required data to in-memory of BI tool.
- Experience in Launching EC2 instances in Amazon EMR using Console.
- Extending Hive and PIG core functionality by writing custom UDFs like UDAFs and UDTFs.
- Experience in administrative tasks such as installing Hadoop and its ecosystem components such as Hive and Pig in Distributed Mode.
- Experience in using Apache Flume for collecting, aggregating and moving large amounts of data from application servers.
- Passionate towards working in Big Data and Analytics environment.
- Knowledge on Reporting tools like Tableau which is used to do analytics on data in cloud.
- Extensive experience with SQL, PL/SQL, Shell Scripting and database concepts.
- Experience in working with Windows, UNIX/LINUX platform with different technologies such as Big Data,SQL, XML, HTML, Core Java, Shell Scripting etc.
Database: DB2, MySQL, Oracle, MS SQL Server
Languages: Core Java, PIG Latin, SQL, Hive QL, Shell Scripting and XML
API s/Tools: NetBeans, Eclipse, MYSQL workbench, Visual Studio
BigData Ecosystem: HDFS, PIG, MAPREDUCE, HIVE, KAFKA,SQOOP, FLUME, HBase
Operating System: Unix, Linux, Windows XP
Visualization Tools: Tableau, Zeppelin
Virtualization Software: VMware, Oracle Virtual Box.
Cloud Computing Services: AWS (Amazon Web Services).
Confidential, Berkley Heights, NJ
Hadoop Big Data/Spark Developer
- Analyzing the requirement to setup a cluster.
- Developed MapReduce programs in Java for parsing the raw data and populating staging Tables.
- Created Hive queries to compare the raw data with EDW reference tables and performing aggregates
- Importing and exporting data into HDFS and Hive using SQOOP.
- Writing PIG scripts to process the data.
- Developed and designed Hadoop, Spark and Java components.
- Developed Spark programs to parse the raw data, populate staging tables and store the refined data in partitioned tables in the EDW.
- Developed Spark code to using Scala and Spark-SQL for faster processing and testing.
- Developed PIG Latin scripts to extract the data from the web server output files to load into HDFS.
- Involved in HBASE setup and storing data into HBASE, which will be used for further analysis.
- Explored the Spark for improving the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frame, Pair RDD's, spark YARN and converted Hive queries into Spark transformations using Spark RDDs.
- Created applications using Kafka, which monitors consumer lag within Apache Kafka clusters. Used in production by multiple companies.
- Developed Unix/Linux Shell Scripts and PL/SQL procedures.
- Worked towards creating real time data streaming solutions using Apache Spark/Spark Streaming, Kafka.
- Performance optimizations on Spark/Scala. Diagnose and resolve performance issues.
- Installed and configured Hive and written Hive UDFs.
- Involved in converting Map Reduce programs into Spark transformations using Spark RDD's using Scala and Python.
- Involved in creating Hive tables, loading with data and writing hive queries using the HIVEQL which will run internally in MAPREDUCE way.
- Loaded some of the data into Cassandra for fast retrieval of data.
- Involved in developing Spark code using Scala and Spark-SQL for faster testing and processing of data and exploring of optimizing it using Spark Context, Spark-SQL, Pair RDD's, Spark YARN.
- Exported the analyzed data to the relational databases using SQOOP for visualization and to generate reports by our BI team.
- Extracted files from Cassandra through Sqoop and placed in HDFS and processed.
- Implementation of Big Data solutions on the Hortonworks distribution and AWS Cloud platform.
- Developed Pig Latin scripts for handling data formation.
- Extracted the data from MySQL into HDFS using SQOOP.
- Experience in managing and monitoring Hadoop cluster using Cloudera Manager.
Environment: Hadoop, Cloudera distribution, Hortonworks distribution, AWS, EMR, Azure cloud platform, HDFS, MapReduce, DocumentDB Unix Shell Scripting, Kafka, Pig, Hive, Sqoop, Flume, Oozie, Zoo keeper, Core Java, impala, HiveQL, Spark, UNIX/Linux Shell Scripting.
Confidential, Newark, CA
Big Data Developer
- Responsible for building scalable distributed data solutions using Hadoop.
- Involved in loading data from LINUX file system to HDFS.
- Working experience in HDFS Admin Shell commands.
- Experience in ETL methods for data extraction, transformation and loading in corporate-wide ETL Solutions and Data warehouse tools for reporting and data analysis.
- Understanding/knowledge of Hadoop Architecture and various components such as HDFS, Job Tracker, Task Tracker, Name Node and Data Node concepts.
- Developed Kafka producer and consumers, HBase clients, Apache Spark and Hadoop MapReduce jobs along with components on HDFS, Hive.
- Used Spark Streaming API with Kafka to build live dashboards; Worked on Transformations & actions in RDD, Spark Streaming, Pair RDD Operations, Check-pointing, and SBT.
- Used Kafka to transfer data from different data systems to HDFS.
- Migrated complex map reduce programs into Spark RDD transformations, actions.
- Involved in the development of Spark Streaming application for one of the data source using Scala, Spark by applying the transformations.
- Developed a script in Scala to read all the Parquet Tables in a Database and parse them as Json files, another script to parse them as structured tables in Hive.
- Implemented Spark RDD transformations to map business analysis and apply actions on top of transformations.
- Used spark to parse XML files and extract values from tags and load it into multiple hive tables.
- Experience on different Hadoop distribution Systems such as: Cloudera & Hortonworks
- Hands on experience on Cassandra DB.
- Analyzed large data sets by running Hive queries and Pig scripts.
- Used Hive and created Hive tables and involved in data loading and writing Hive UDFs.
- Hands on using SQOOP to import and export data into HDFS from RDBMS and vice-versa.
- Exported the analyzed data to the relational databases using Sqoop for visualization and to generate reports for the BI team.
- Used SQOOP, AVRO, HIVE, PIG, Java, MAPREDUCE daily to develop ETL, Batch Processing and data storage functionality.
- Supported implementation and execution of MAPREDUCE programs in a cluster environment.
Environment: Hadoop, MapReduce, Hive,Pig, Hbase, Sqoop, Kafka, Cassandra, Flume, Java, SQL, Cloudera Manager, Eclipse, Unix Script, YARN.
Confidential, Columbus, OH
- Written MapReduce code to parse the data from various sources and storing parsed data into Hbase and Hive.
- Integrated Map Reduce with HBase to import bulk amount of data into HBase using Map Reduce Programs.
- Imported data from different relational data sources like Oracle, Teradata to HDFS using Sqoop.
- Worked on a stand-alone as well as a distributed Hadoop application.
- Used Scala to convert Hive/SQL queries into RDD transformations in Apache Spark.
- Used Oozie and Zookeeper to automate the flow of jobs and coordination in the cluster respectively.
- Used Spark API over Cloudera Hadoop YARN to perform analytics on data in Hive.
- Extensive knowledge on PIG scripts using bags and tuples and Pig UDF'S to pre-process the data for analysis.
- Implemented usage of Amazon EMR for processing Big Data across a Hadoop Cluster of virtual servers.
- Used Teradata to build Hadoop project and also as ETL project.
- Developed several shell scripts, which acts as wrapper to start these Hadoop jobs and set the configuration parameters.
- Involved in writing query using Impala for better and faster processing of data.
- Involved in developing Impala scripts for extraction, transformation, loading of data into data warehouse.
- Experienced in migrating HiveQL into Impala to minimize query response time.
- Involved in moving all log files generated from various sources to HDFS for further processing through Flume.
- Involved in collecting and aggregating large amounts of log data using Apache and staging data in HDFS for further analysis.
- Develop testing scripts in Python and prepare test procedures, analyze test results data and suggest improvements of the system and software.
Environment: HDFS, MapReduce, Python, CDH5, Hbase, NOSQL, Hive, Pig, Hadoop, Sqoop, Impala, Yarn, Shell Scripting, Ubuntu, Linux Red Hat.
Java / Hadoop Developer
- Involved in design and development phases of Software Development Life Cycle (SDLC) using Scrum methodology.
- Worked with application teams to install operating system, Hadoop updates, patches, version upgrades as required.
- Involved in Installing, Configuring Hadoop Eco System, and Cloudera Manager using CDH4 Distribution.
- Collected the logs data from web servers and integrated in to HDFS using Flume.
- Involved in creating Hive Tables, loading with data and writing Hive queries which will invoke and run Map Reduce jobs in the backend.
- Worked on analyzing Hadoop cluster and different big data analytic tools including Map Reduce, Hive and Spark.
- Developed the Map Reduce programs to parse the raw data and store the pre-Aggregated data in the portioned tables.
- Involved in start to end process of Hadoop cluster installation, configuration and monitoring
- Responsible for building scalable distributed data solutions using Hadoop and Involved in submitting and tracking Map Reduce jobs using Job Tracker.
- Responsible for troubleshooting issues in the execution of MapReduce jobs by inspecting and reviewing log files.
- Worked with HBase in creating tables to load large sets of semi structured data coming from various sources.
- Created design documents and reviewed with team in addition to assisting the business analyst / project manager in explanations to line of business.
- Responsible for understanding the scope of the project and requirement gathering.
- Involved in analysis, design, construction and testing of the application
- Developed the web tier using JSP to show account details and summary.
- Used Tomcat web server for development purpose.
- Involved in creation of Test Cases for JUnit Testing.
- Used Oracle as Database and used Toad for queries execution and also involved in writing SQL scripts, PL/SQL code for procedures and functions.
- Developed application using Eclipse and used build and deploy tool as Maven.