We provide IT Staff Augmentation Services!

Hadoop Developer Resume

5.00/5 (Submit Your Rating)

New York New, YorK

SUMMARY

  • 4 years of IT experience in analysis, design, development, and implementation of large - scale applications using Big Data and Java/J2EE technologies such as Apache Spark, Hadoop, Hive, Pig, Sqoop, Oozie, HBase, Zookeeper, Python & Scala.
  • Strong experience writing Spark Core, Spark SQL, Spark Streaming, Java MapReduce, Spark on Java Applications.
  • Experienced in Apache Spark, Hive and Pig's analytical functions and extending Spark, Hive and Pig functionality by writing custom UDFs and hooking UDF's into larger Spark applications to be used as in-line functions.
  • Experience with installing backup, recovery, configuration and development on multiple Hadoop distribution platforms Cloudera and Hortonworks including cloud platforms Amazon AWS and Google Cloud.
  • Highly skilled in Optimizing and moving large scale pipeline applications from on-premises clusters to AWS Cloud.
  • Working knowledge of spinning-up, configuring and maintaining long-running Amazon EMR clusters manually as well as through Cloud Formation scripts on Amazon AWS.
  • Experienced in building frameworks for large scale streaming applications in Apache Spark.
  • Worked on migrating Hadoop MapReduce programs to Apache Spark on Scala.
  • Extensive hands-on knowledge of working on the Amazon AWS and Google Cloud Architecture.
  • Highly skilled in integrating Amazon Kinesis streams with Spark Streaming applications to build long running real-time applications.
  • Configuring Kinesis Shards for optimal throughput in Kinesis Streams for Spark Streaming Applications on AWS.
  • Solid understanding of RDD operations in Apache Spark i.e., Transformations & Actions, Persistence (Caching), Accumulators, Broadcast Variables, Optimizing Broadcasts.
  • In-depth knowledge of handling large amounts of data utilizing Spark Data Frames/Datasets API and Case Classes.
  • Experienced in running query using Impala and used BI tools to run ad-hoc queries directly on Hadoop.
  • Working knowledge of utilizing Hadoop file formats such as Sequence, ORC, Avro, Parquet as well as open-source Text/CSV and JSON formatted files.
  • In-depth knowledge of the Big Data Architecture along with-it various components of Hadoop 1.X and 2.X such as HDFS, Job Tracker, Task Tracker, Data Node, Name Node and YARN concepts such as Resource Manager, Node Manager.
  • Hands on experience on AWS cloud services (VPC, EC2, S3, RDS, Glue, Redshift, Data Pipeline, EMR, DynamoDB, Workspaces, Lambda, Kinesis, RDS, SNS, SQS).
  • HiveQL and Pig Latin scripts leading to good understanding in MapReduce design patterns, data analysis using Hive and Pig.
  • Great knowledge of working with Apache Spark Streaming API on Big Data Distributions in an active cluster environment.
  • Very capable at using AWS utilities such as EMR, S3 and CloudWatch to run and monitor Hadoop/Spark jobs on AWS.
  • Very well versed in writing and deploying Oozie Workflows and Coordinators. Scheduling, Monitoring and Troubleshooting through Hue UI.
  • Proficient in importing and exporting data from Relational Database Systems to HDFS and vice versa, using Sqoop.
  • Good understanding of column-family NoSQL databases like HBase, Cassandra and Mongo DB in enterprise use cases.
  • Very capable in processing of large sets of structured, semi-structured and unstructured data and supporting system application architecture in Hadoop, Spark, and SQL databases such as Teradata, MySQL, DB2.
  • Working experience in Impala, Mahout, Sparks, Storm, Avro, Kafka, Hue, and AWS.
  • Experience with installing, backup, recovery, configuration, and development on multiple Hadoop distribution platforms like Hortonworks Distribution Platform (HDP), Cloudera Distribution for Hadoop (CDH).
  • Experienced in version control and source code management tools like GIT, SVN, and Bitbucket.
  • Software development in Java Application Development, Client/Server Applications, and implementing application environment using MVC, J2EE, JDBC, JSP, XML methodologies (XML, XSL, XSD), Web Services, Relational Databases and NoSQL Databases.
  • Hands-on experience in application development using Java, RDBMS, and Linux shell scripting, Perl.
  • Hands-on experience working with IDE tools such as Eclipse, IntelliJ, NetBeans, Visual Studio, GIT and Maven and experienced in writing cohesive E2E applications on Apache Zeppelin.
  • Experience working in Waterfall and Agile - SCRUM methodologies.
  • Ability to adapt to evolving technologies, a strong sense of responsibility and accomplishment.

PROFESSIONAL EXPERIENCE

Confidential, New York, New York

Hadoop Developer

Responsibilities:

  • Worked on developing architecture document and proper guidelines
  • Responsible in Installation and Configuration of Hadoop Eco system components using CDH 5.2 Distribution.
  • Responsible to manage data coming from different sources and involved in HDFS maintenance and loading of structured and unstructured data.
  • Processed Multiple Data sources input to same Reducer using Generic Writable and Multi Input format.
  • Worked Big data processing of clinical and non-clinical data using Map Reduce.
  • Visualize the HDFS data to customer using BI tool with the help of Hive ODBC Driver.
  • Customized BI tool for manager team that perform Query analytics using HiveQL.
  • Imported data using Sqoop to load data from MySQL to HDFS on regular basis.
  • Created Partitions, Buckets based on State to further process using Bucket based Hive joins.
  • Created Hive Generic UDF's to process business logic that varies based on policy.
  • Moved Relational Data base data using Sqoop into Hive Dynamic partition tables using staging tables.
  • Experienced in Monitoring Cluster using Cloudera manager.
  • Involved in Discussions with business users to gather the required knowledge.
  • Capable of creating real time data streaming solutions and batch style large scale distributed computing applications using Apache Spark, Spark Streaming, Kafka, and Flume.
  • Analyzing the requirements to develop the framework.
  • Designed and developed architecture for data services ecosystem spanning Relational, NoSQL and Big Data technologies.
  • Loaded and transformed large sets of structured, semi structured and unstructured data using Hadoop/Big Data concepts.
  • Developed Java Spark streaming scripts to load raw files and corresponding.
  • Processed metadata files into AWS S3 and Elasticsearch cluster.
  • Developed Python Scripts to get the recent S3 keys from Elasticsearch.
  • Elaborated Python Scripts to fetch/get S3 files using Boto3 module.
  • Implemented PySpark logic to transform and process various formats of data like XLSX, XLS, JSON, TXT.
  • Built scripts to load PySpark processed files into Redshift Db and used diverse PySpark logics.
  • Developed scripts to monitor and capture state of each file which is being through.
  • Developed Map Reduce programs to cleanse the data in HDFS obtained from heterogeneous data sources.
  • Involved in scheduling Oozie workflow engine to run multiple Hives and pig jobs and used Oozie Operational Services for batch processing and scheduling workflows dynamically.
  • Included migration of existing applications and development of new applications using AWS cloud services.
  • Wrought with data investigation, discovery, and mapping tools to scan every single data record from many sources.
  • Implemented Shell script to automate the whole process.
  • Fine-tuning pyspark applications/jobs to improve the efficiency and overall processing time for the pipelines.
  • Knowledge of writing Hive queries and running both scripts in tez mode to improve performance on Hortonworks Data Platform.
  • Written pyspark job in AWS Glue to merge data from multiple table
  • Utilized Crawler to populate AWS Glue data Catalog with metadata table definitions
  • Generated a script in AWS Glue to transfer the data
  • Utilized AWS Glue to run ETL jobs and run aggregation on pyspark code.
  • Integrated Apache Storm with Kafka to perform web analytics.
  • Uploaded click stream data from Kafka to HDFS, HBase, and Hive by integrating with Storm.
  • Extracted data from SQL Server to create automated visualization reports and dashboards on Tableau.
  • Responsible for Cluster maintenance, adding and removing cluster nodes, Cluster Monitoring and Troubleshooting, Managing and reviewing data backups & log files.

Environment: AWS S3, Java, Maven, Python, Spark, Kafka, Elasticsearch, Mapper Cluster, Amazon Redshift DB, Shell script, pandas, Elasticsearch, PySpark, Pig, Hive, Oozie, JSON, AWS GLUE.

Confidential, Kenilworth, NJ

Hadoop Developer

Responsibilities:

  • Involved in complete project life cycle starting from design discussion to production deployment
  • Worked closely with the business team to gather their requirements and new support features
  • Involved in running POC's on different use cases of the application and maintained a standard document for best coding practices
  • Developed a 200-node cluster in designing the Data Lake with the Hortonworks distribution
  • Responsible for building scalable distributed data solutions using Hadoop
  • Installed, configured, and implemented high availability Hadoop Clusters with required services (HDFS, Hive, HBase, Spark, Zookeeper)
  • Implemented Kerberos for authenticating all the services in Hadoop Cluster
  • Responsible for installation and configuration of Hive, Pig, HBase and Sqoop on the Hadoop cluster and created hive tables to store the processed results in a tabular format.
  • Configured Spark Streaming to receive real time data from the Apache Kafka and store the stream data to HDFS using Scala.
  • Developed the Sqoop scripts to make the interaction between Hive and vertical Database.
  • Processed data into HDFS by developing solutions and analyzed the data using Map Reduce, PIG, and Hive to produce summary results from Hadoop to downstream systems.
  • Build servers using AWS: Importing volumes, launching EC2, creating security groups, auto-scaling, load balancers, Route 53, SES and SNS in the defined virtual private connection.
  • Written Map Reduce code to process and parsing the data from various sources and storing parsed data into HBase and Hive using HBase-Hive Integration.
  • Streamed AWS log group into Lambda function to create service now incident.
  • Involved in loading and transforming large sets of Structured, Semi-Structured and Unstructured data and analyzed them by running Hive queries and Pig scripts.
  • Created Managed tables and External tables in Hive and loaded data from HDFS.
  • Developed Spark code by using Scala and Spark-SQL for faster processing and testing and performed complex HiveQL queries on Hive tables.
  • Scheduled several times based Oozie workflow by developing Python scripts.
  • Developed Pig Latin scripts using operators such as LOAD, STORE, DUMP, FILTER, DISTINCT, FOREACH, GENERATE, GROUP, COGROUP, ORDER, LIMIT, UNION, SPLIT to extract data from data files to load into HDFS.
  • Exporting the data using Sqoop to RDBMS servers and processed that data for ETL operations.
  • Worked on S3 buckets on AWS to store Cloud Formation Templates and worked on AWS to create EC2 instances.
  • Designing ETL Data Pipeline flow to ingest the data from RDBMS source to Hadoopusing shell script, Sqoop, package and MySQL.
  • End-to-end architecture and implementation of client-server systems using Scala, Akka, Java, JavaScript and related, Linux
  • Optimized the Hive tables using optimization techniques like partitions and bucketing to provide better.
  • Used Oozie workflow engine to manage interdependent Hadoopjobs and to automate several types of Hadoop jobs such as Java map-reduce Hive, Pig, and Sqoop.
  • Implementing Hadoop with the AWS EC2 system using a few instances in gathering and analyzing data log files.
  • Involved in Spark and Spark Streaming creating RDD's, applying operations -Transformation and Actions.
  • Created partitioned tables and loaded data using both static partition and dynamic partition method.
  • Developed custom Apache Spark programs in Scala to analyze and transform unstructured data.
  • Handled importing of data from various data sources, performed transformations using Hive, MapReduce, loaded data into HDFS and Extracted the data from Oracle into HDFS using Sqoop
  • Using Kafka on publish-subscribe messaging as a distributed commit log, have experienced in its fast, scalable and durability.
  • Test Driven Development (TDD) process and extensive experience with Agile and SCRUM programming methodology.
  • Implemented POC to migrate Map Reduce jobs into Spark RDD transformations using SCALA
  • Scheduled map reduces jobs in production environment using Oozie scheduler.
  • Involved in Cluster maintenance, Cluster Monitoring and Troubleshooting, Manage and review data backups and log files.
  • Designed and implemented map reduce jobs to support distributed processing using java, Hive and Apache Pig
  • Analyzing Hadoop cluster and different Big Data analytic tools including Pig, Hive, HBase and Sqoop.
  • Improved the Performance by tuning of HIVE and map reduce.
  • Research, evaluate and utilize modern technologies/tools/frameworks around Hadoop ecosystem.

Environment: HDFS, Map Reduce, Hive, Sqoop, Pig, Flume, Vertica, Oozie Scheduler, Java, Shell Scripts, Teradata, Oracle, HBase, MongoDB, Cassandra, Cloudera, AWS, JavaScript, JSP, Kafka, Spark, Scala and ETL, Python.

Confidential, New York, New York

Spark Developer

Responsibilities:

  • Developed Spark applications using Scala utilizing Data frames and Spark SQL API for faster processing of data.
  • Involved in Agile methodologies, daily scrum meetings, spring planning, and scripts were written for distribution of query for performance test jobs in Amazon Data Lake.
  • Created Hive Tables, loaded transactional data from Teradata using Sqoop, and worked with highly unstructured and semi-structured data of 2 Petabytes in size.
  • Developed MapReduce jobs for cleaning, accessing, and validating the data and created and worked Sqoop jobs with the incremental load to populate Hive External tables.
  • Developed optimal strategies for distributing the weblog data over the cluster importing and exporting the stored web log data into HDFS and Hive using Sqoop.
  • Apache Hadoop installation & configuration of multiple nodes on AWS EC2 system and developed Pig Latin scripts for replacing the existing legacy process to the Hadoop and the data is fed to AWS S3.
  • Responsible for building scalable distributed data solutions using Hadoop Cloudera and designed and developed automation test scripts using Python
  • Integrated Apache Storm with Kafka to perform web analytics and to perform clickstream data from Kafka to HDFS.
  • Analyzed the SQL scripts and designed the solution to implement using Spark and implemented Hive Generic UDF's to in corporate business logic into Hive Queries.
  • Responsible for developing data pipeline with Amazon AWS to extract the data from weblogs and store it in HDFS.
  • Uploaded streaming data from Kafka to HDFS, HBase, and Hive by integrating with storm and writing Pig-scripts to transform raw data from several data sources into forming baseline data.
  • Worked on MongoDB by using CRUD (Create, Read, Update and Delete), Indexing, Replication, and Shading features.
  • Involved in designing the row key in HBase to store Text and JSON as key values in the HBase table and designed row key in such a way to get/scan it in sorted order.
  • Integrated Oozie with the rest of the Hadoop stack supporting several types of Hadoop jobs out of the box (such as Map-Reduce, Pig, Hive, and Sqoop) as well as system specific jobs (such as Java programs and shell scripts)
  • Creating Hive tables and working on them using Hive QL and designed and Implemented Partitioning (Static, Dynamic) Buckets in HIVE.
  • Used Spark-Streaming APIs to perform necessary transformations and actions on the fly for building the common learner data model which gets the data from Kafka in near real-time and Persists into Cassandra.
  • Developed syllabus/Curriculum data pipelines from Syllabus/Curriculum Web Services to HBASE and Hive tables.
  • Worked on Cluster coordination services through Zookeeper and monitored workload, job performance, and capacity planning using Cloudera Manager
  • Involved in build applications using Maven and integrated with CI servers like Jenkins to build jobs.
  • Configured deployed and maintained multi-node Dev and Test Kafka Clusters and implemented data ingestion and handling clusters in real-time processing using Kafka.
  • Creating the cube in Talend to create different types of aggregation in the data and also to visualize them.
  • Monitor Hadoop Name Node Health status, number of Task trackers running, number of Data Nodes running and automated all the jobs starting from pulling the Data from different Data Sources like MySQL to pushing the result set Data to Hadoop Distributed File System.
  • Developed story-telling dashboards in Tableau Desktop and published them on to Tableau Server and used GitHub version controlling tools to maintain project versions.

Environment: Hadoop, HDFS, HBase, Spark, Scala, Hive, MapReduce, Sqoop, ETL, Java, PL/SQL, Oracle 11g, Unix/Linux.

We'd love your feedback!