Hadoop Developer Resume
New York New, YorK
SUMMARY
- 4 years of IT experience in analysis, design, development, and implementation of large - scale applications using Big Data and Java/J2EE technologies such as Apache Spark, Hadoop, Hive, Pig, Sqoop, Oozie, HBase, Zookeeper, Python & Scala.
- Strong experience writing Spark Core, Spark SQL, Spark Streaming, Java MapReduce, Spark on Java Applications.
- Experienced in Apache Spark, Hive and Pig's analytical functions and extending Spark, Hive and Pig functionality by writing custom UDFs and hooking UDF's into larger Spark applications to be used as in-line functions.
- Experience wif installing backup, recovery, configuration and development on multiple Hadoop distribution platforms Cloudera and Hortonworks including cloud platforms Amazon AWS and Google Cloud.
- Highly skilled in Optimizing and moving large scale pipeline applications from on-premises clusters to AWS Cloud.
- Working noledge of spinning-up, configuring and maintaining long-running Amazon EMR clusters manually as well as through Cloud Formation scripts on Amazon AWS.
- Experienced in building frameworks for large scale streaming applications in Apache Spark.
- Worked on migrating Hadoop MapReduce programs to Apache Spark on Scala.
- Extensive hands-on noledge of working on teh Amazon AWS and Google Cloud Architecture.
- Highly skilled in integrating Amazon Kinesis streams wif Spark Streaming applications to build long running real-time applications.
- Configuring Kinesis Shards for optimal throughput in Kinesis Streams for Spark Streaming Applications on AWS.
- Solid understanding of RDD operations in Apache Spark i.e., Transformations & Actions, Persistence (Caching), Accumulators, Broadcast Variables, Optimizing Broadcasts.
- In-depth noledge of handling large amounts of data utilizing Spark Data Frames/Datasets API and Case Classes.
- Experienced in running query using Impala and used BI tools to run ad-hoc queries directly on Hadoop.
- Working noledge of utilizing Hadoop file formats such as Sequence, ORC, Avro, Parquet as well as open-source Text/CSV and JSON formatted files.
- In-depth noledge of teh Big Data Architecture along wif-it various components of Hadoop 1.X and 2.X such as HDFS, Job Tracker, Task Tracker, Data Node, Name Node and YARN concepts such as Resource Manager, Node Manager.
- Hands on experience on AWS cloud services (VPC, EC2, S3, RDS, Glue, Redshift, Data Pipeline, EMR, DynamoDB, Workspaces, Lambda, Kinesis, RDS, SNS, SQS).
- HiveQL and Pig Latin scripts leading to good understanding in MapReduce design patterns, data analysis using Hive and Pig.
- Great noledge of working wif Apache Spark Streaming API on Big Data Distributions in an active cluster environment.
- Very capable at using AWS utilities such as EMR, S3 and CloudWatch to run and monitor Hadoop/Spark jobs on AWS.
- Very well versed in writing and deploying Oozie Workflows and Coordinators. Scheduling, Monitoring and Troubleshooting through Hue UI.
- Proficient in importing and exporting data from Relational Database Systems to HDFS and vice versa, using Sqoop.
- Good understanding of column-family NoSQL databases like HBase, Cassandra and Mongo DB in enterprise use cases.
- Very capable in processing of large sets of structured, semi-structured and unstructured data and supporting system application architecture in Hadoop, Spark, and SQL databases such as Teradata, MySQL, DB2.
- Working experience in Impala, Mahout, Sparks, Storm, Avro, Kafka, Hue, and AWS.
- Experience wif installing, backup, recovery, configuration, and development on multiple Hadoop distribution platforms like Hortonworks Distribution Platform (HDP), Cloudera Distribution for Hadoop (CDH).
- Experienced in version control and source code management tools like GIT, SVN, and Bitbucket.
- Software development in Java Application Development, Client/Server Applications, and implementing application environment using MVC, J2EE, JDBC, JSP, XML methodologies (XML, XSL, XSD), Web Services, Relational Databases and NoSQL Databases.
- Hands-on experience in application development using Java, RDBMS, and Linux shell scripting, Perl.
- Hands-on experience working wif IDE tools such as Eclipse, IntelliJ, NetBeans, Visual Studio, GIT and Maven and experienced in writing cohesive E2E applications on Apache Zeppelin.
- Experience working in Waterfall and Agile - SCRUM methodologies.
- Ability to adapt to evolving technologies, a strong sense of responsibility and accomplishment.
PROFESSIONAL EXPERIENCE
Confidential, New York, New York
Hadoop Developer
Responsibilities:
- Worked on developing architecture document and proper guidelines
- Responsible in Installation and Configuration of Hadoop Eco system components using CDH 5.2 Distribution.
- Responsible to manage data coming from different sources and involved in HDFS maintenance and loading of structured and unstructured data.
- Processed Multiple Data sources input to same Reducer using Generic Writable and Multi Input format.
- Worked Big data processing of clinical and non-clinical data using Map Reduce.
- Visualize teh HDFS data to customer using BI tool wif teh halp of Hive ODBC Driver.
- Customized BI tool for manager team that perform Query analytics using HiveQL.
- Imported data using Sqoop to load data from MySQL to HDFS on regular basis.
- Created Partitions, Buckets based on State to further process using Bucket based Hive joins.
- Created Hive Generic UDF's to process business logic that varies based on policy.
- Moved Relational Data base data using Sqoop into Hive Dynamic partition tables using staging tables.
- Experienced in Monitoring Cluster using Cloudera manager.
- Involved in Discussions wif business users to gather teh required noledge.
- Capable of creating real time data streaming solutions and batch style large scale distributed computing applications using Apache Spark, Spark Streaming, Kafka, and Flume.
- Analyzing teh requirements to develop teh framework.
- Designed and developed architecture for data services ecosystem spanning Relational, NoSQL and Big Data technologies.
- Loaded and transformed large sets of structured, semi structured and unstructured data using Hadoop/Big Data concepts.
- Developed Java Spark streaming scripts to load raw files and corresponding.
- Processed metadata files into AWS S3 and Elasticsearch cluster.
- Developed Python Scripts to get teh recent S3 keys from Elasticsearch.
- Elaborated Python Scripts to fetch/get S3 files using Boto3 module.
- Implemented PySpark logic to transform and process various formats of data like XLSX, XLS, JSON, TXT.
- Built scripts to load PySpark processed files into Redshift Db and used diverse PySpark logics.
- Developed scripts to monitor and capture state of each file which is being through.
- Developed Map Reduce programs to cleanse teh data in HDFS obtained from heterogeneous data sources.
- Involved in scheduling Oozie workflow engine to run multiple Hives and pig jobs and used Oozie Operational Services for batch processing and scheduling workflows dynamically.
- Included migration of existing applications and development of new applications using AWS cloud services.
- Wrought wif data investigation, discovery, and mapping tools to scan every single data record from many sources.
- Implemented Shell script to automate teh whole process.
- Fine-tuning pyspark applications/jobs to improve teh efficiency and overall processing time for teh pipelines.
- Knowledge of writing Hive queries and running both scripts in tez mode to improve performance on Hortonworks Data Platform.
- Written pyspark job in AWS Glue to merge data from multiple table
- Utilized Crawler to populate AWS Glue data Catalog wif metadata table definitions
- Generated a script in AWS Glue to transfer teh data
- Utilized AWS Glue to run ETL jobs and run aggregation on pyspark code.
- Integrated Apache Storm wif Kafka to perform web analytics.
- Uploaded click stream data from Kafka to HDFS, HBase, and Hive by integrating wif Storm.
- Extracted data from SQL Server to create automated visualization reports and dashboards on Tableau.
- Responsible for Cluster maintenance, adding and removing cluster nodes, Cluster Monitoring and Troubleshooting, Managing and reviewing data backups & log files.
Environment: AWS S3, Java, Maven, Python, Spark, Kafka, Elasticsearch, Mapper Cluster, Amazon Redshift DB, Shell script, pandas, Elasticsearch, PySpark, Pig, Hive, Oozie, JSON, AWS GLUE.
Confidential, Kenilworth, NJ
Hadoop Developer
Responsibilities:
- Involved in complete project life cycle starting from design discussion to production deployment
- Worked closely wif teh business team to gather their requirements and new support features
- Involved in running POC's on different use cases of teh application and maintained a standard document for best coding practices
- Developed a 200-node cluster in designing teh Data Lake wif teh Hortonworks distribution
- Responsible for building scalable distributed data solutions using Hadoop
- Installed, configured, and implemented high availability Hadoop Clusters wif required services (HDFS, Hive, HBase, Spark, Zookeeper)
- Implemented Kerberos for authenticating all teh services in Hadoop Cluster
- Responsible for installation and configuration of Hive, Pig, HBase and Sqoop on teh Hadoop cluster and created hive tables to store teh processed results in a tabular format.
- Configured Spark Streaming to receive real time data from teh Apache Kafka and store teh stream data to HDFS using Scala.
- Developed teh Sqoop scripts to make teh interaction between Hive and vertical Database.
- Processed data into HDFS by developing solutions and analyzed teh data using Map Reduce, PIG, and Hive to produce summary results from Hadoop to downstream systems.
- Build servers using AWS: Importing volumes, launching EC2, creating security groups, auto-scaling, load balancers, Route 53, SES and SNS in teh defined virtual private connection.
- Written Map Reduce code to process and parsing teh data from various sources and storing parsed data into HBase and Hive using HBase-Hive Integration.
- Streamed AWS log group into Lambda function to create service now incident.
- Involved in loading and transforming large sets of Structured, Semi-Structured and Unstructured data and analyzed them by running Hive queries and Pig scripts.
- Created Managed tables and External tables in Hive and loaded data from HDFS.
- Developed Spark code by using Scala and Spark-SQL for faster processing and testing and performed complex HiveQL queries on Hive tables.
- Scheduled several times based Oozie workflow by developing Python scripts.
- Developed Pig Latin scripts using operators such as LOAD, STORE, DUMP, FILTER, DISTINCT, FOREACH, GENERATE, GROUP, COGROUP, ORDER, LIMIT, UNION, SPLIT to extract data from data files to load into HDFS.
- Exporting teh data using Sqoop to RDBMS servers and processed that data for ETL operations.
- Worked on S3 buckets on AWS to store Cloud Formation Templates and worked on AWS to create EC2 instances.
- Designing ETL Data Pipeline flow to ingest teh data from RDBMS source to Hadoopusing shell script, Sqoop, package and MySQL.
- End-to-end architecture and implementation of client-server systems using Scala, Akka, Java, JavaScript and related, Linux
- Optimized teh Hive tables using optimization techniques like partitions and bucketing to provide better.
- Used Oozie workflow engine to manage interdependent Hadoopjobs and to automate several types of Hadoop jobs such as Java map-reduce Hive, Pig, and Sqoop.
- Implementing Hadoop wif teh AWS EC2 system using a few instances in gathering and analyzing data log files.
- Involved in Spark and Spark Streaming creating RDD's, applying operations -Transformation and Actions.
- Created partitioned tables and loaded data using both static partition and dynamic partition method.
- Developed custom Apache Spark programs in Scala to analyze and transform unstructured data.
- Handled importing of data from various data sources, performed transformations using Hive, MapReduce, loaded data into HDFS and Extracted teh data from Oracle into HDFS using Sqoop
- Using Kafka on publish-subscribe messaging as a distributed commit log, have experienced in its fast, scalable and durability.
- Test Driven Development (TDD) process and extensive experience wif Agile and SCRUM programming methodology.
- Implemented POC to migrate Map Reduce jobs into Spark RDD transformations using SCALA
- Scheduled map reduces jobs in production environment using Oozie scheduler.
- Involved in Cluster maintenance, Cluster Monitoring and Troubleshooting, Manage and review data backups and log files.
- Designed and implemented map reduce jobs to support distributed processing using java, Hive and Apache Pig
- Analyzing Hadoop cluster and different Big Data analytic tools including Pig, Hive, HBase and Sqoop.
- Improved teh Performance by tuning of HIVE and map reduce.
- Research, evaluate and utilize modern technologies/tools/frameworks around Hadoop ecosystem.
Environment: HDFS, Map Reduce, Hive, Sqoop, Pig, Flume, Vertica, Oozie Scheduler, Java, Shell Scripts, Teradata, Oracle, HBase, MongoDB, Cassandra, Cloudera, AWS, JavaScript, JSP, Kafka, Spark, Scala and ETL, Python.
Confidential, New York, New York
Spark Developer
Responsibilities:
- Developed Spark applications using Scala utilizing Data frames and Spark SQL API for faster processing of data.
- Involved in Agile methodologies, daily scrum meetings, spring planning, and scripts were written for distribution of query for performance test jobs in Amazon Data Lake.
- Created Hive Tables, loaded transactional data from Teradata using Sqoop, and worked wif highly unstructured and semi-structured data of 2 Petabytes in size.
- Developed MapReduce jobs for cleaning, accessing, and validating teh data and created and worked Sqoop jobs wif teh incremental load to populate Hive External tables.
- Developed optimal strategies for distributing teh weblog data over teh cluster importing and exporting teh stored web log data into HDFS and Hive using Sqoop.
- Apache Hadoop installation & configuration of multiple nodes on AWS EC2 system and developed Pig Latin scripts for replacing teh existing legacy process to teh Hadoop and teh data is fed to AWS S3.
- Responsible for building scalable distributed data solutions using Hadoop Cloudera and designed and developed automation test scripts using Python
- Integrated Apache Storm wif Kafka to perform web analytics and to perform clickstream data from Kafka to HDFS.
- Analyzed teh SQL scripts and designed teh solution to implement using Spark and implemented Hive Generic UDF's to in corporate business logic into Hive Queries.
- Responsible for developing data pipeline wif Amazon AWS to extract teh data from weblogs and store it in HDFS.
- Uploaded streaming data from Kafka to HDFS, HBase, and Hive by integrating wif storm and writing Pig-scripts to transform raw data from several data sources into forming baseline data.
- Worked on MongoDB by using CRUD (Create, Read, Update and Delete), Indexing, Replication, and Shading features.
- Involved in designing teh row key in HBase to store Text and JSON as key values in teh HBase table and designed row key in such a way to get/scan it in sorted order.
- Integrated Oozie wif teh rest of teh Hadoop stack supporting several types of Hadoop jobs out of teh box (such as Map-Reduce, Pig, Hive, and Sqoop) as well as system specific jobs (such as Java programs and shell scripts)
- Creating Hive tables and working on them using Hive QL and designed and Implemented Partitioning (Static, Dynamic) Buckets in HIVE.
- Used Spark-Streaming APIs to perform necessary transformations and actions on teh fly for building teh common learner data model which gets teh data from Kafka in near real-time and Persists into Cassandra.
- Developed syllabus/Curriculum data pipelines from Syllabus/Curriculum Web Services to HBASE and Hive tables.
- Worked on Cluster coordination services through Zookeeper and monitored workload, job performance, and capacity planning using Cloudera Manager
- Involved in build applications using Maven and integrated wif CI servers like Jenkins to build jobs.
- Configured deployed and maintained multi-node Dev and Test Kafka Clusters and implemented data ingestion and handling clusters in real-time processing using Kafka.
- Creating teh cube in Talend to create different types of aggregation in teh data and also to visualize them.
- Monitor Hadoop Name Node Health status, number of Task trackers running, number of Data Nodes running and automated all teh jobs starting from pulling teh Data from different Data Sources like MySQL to pushing teh result set Data to Hadoop Distributed File System.
- Developed story-telling dashboards in Tableau Desktop and published them on to Tableau Server and used GitHub version controlling tools to maintain project versions.
Environment: Hadoop, HDFS, HBase, Spark, Scala, Hive, MapReduce, Sqoop, ETL, Java, PL/SQL, Oracle 11g, Unix/Linux.