Hadoop Developer Resume

SUMMARY

Around 7+ years of IT experience in analysis, design, development and implementation of large - scale applications using Big Data and Java/J2EE technologies such as Apache Spark, Hadoop, Hive, Pig, Sqoop, Oozie, HBase, Zookeeper, Python & Scala.
Strong experience writing Spark Core, Spark SQL, Spark Streaming, Java MapReduce, Spark on Java Applications.
Experienced in Apache Spark, Hive and Pig's analytical functions and extending Spark, Hive and Pig functionality by writing custom UDFs and hooking UDF's into larger Spark applications to be used as in-line functions.
Experience with installing, backup, recovery, configuration and development on multiple Hadoop distribution platforms Cloudera and Hortonworks including cloud platforms Amazon AWS and Google Cloud.
Highly skilled in Optimizing and moving large scale pipeline applications from on-premise clusters to AWS Cloud.
Working knowledge of spinning-up, configuring and maintaining long-running Amazon EMR clusters manually as well as through Cloud Formation scripts on Amazon AWS.
Experienced in building frameworks for Large scale streaming applications in Apache Spark.
Worked on migrating Hadoop MapReduce programs to Apache Spark on Scala.
Extensive hands-on knowledge of working on the Amazon AWS and Google Cloud Architecture.
Highly skilled in integrating Amazon Kinesis streams with Spark Streaming applications to build long running real-time applications.
Configuring Kinesis Shards for optimal throughput in Kinesis Streams for Spark Streaming Applications on AWS.
Solid understanding of RDD operations in Apache Spark i.e., Transformations & Actions, Persistence (Caching), Accumulators, Broadcast Variables, Optimizing Broadcasts.
In-depth knowledge of handling large amounts of data utilizing Spark Data Frames/Datasets API and Case Classes.
Experienced in running query using Impala and used BI tools to run ad-hoc queries directly on Hadoop.
Working knowledge of utilizing Hadoop file formats such as Sequence, ORC, Avro, Parquet as well as open source Text/CSV and JSON formatted files.
In-depth knowledge of the Big Data Architecture along with-it various components of Hadoop 1.X and 2.X such as HDFS, Job Tracker, Task Tracker, Data Node, Name Node and YARN concepts such as Resource Manager, Node Manager.
Hands on experience on AWS cloud services (VPC, EC2, S3, RDS, Glue, Redshift, Data Pipeline, EMR, DynamoDB, Workspaces, Lambda, Kinesis, RDS, SNS, SQS).
HiveQL and Pig Latin scripts leading to good understanding in MapReduce design patterns, data analysis using Hive and Pig.
Great knowledge of working with Apache Spark Streaming API on Big Data Distributions in an active cluster environment.
Very capable at using AWS utilities such as EMR, S3 and CloudWatch to run and monitor Hadoop/Spark jobs on AWS.
Very well versed in writing and deploying Oozie Workflows and Coordinators. Scheduling, Monitoring and Troubleshooting through Hue UI.
Proficient in importing and exporting data from Relational Database Systems to HDFS and vice versa, using Sqoop.
Good understanding of column-family NoSQL databases like HBase, Cassandra and Mongo DB in enterprise use cases.
Very capable in processing of large sets of structured, semi-structured and unstructured data and supporting system application architecture in Hadoop, Spark and SQL databases such as Teradata, MySQL, DB2.
Working experience in Impala, Mahout, Sparks, Storm, Avro, Kafka, Hue and AWS.
Experience with installing, backup, recovery, configuration and development on multiple Hadoop distribution platforms like Hortonworks Distribution Platform (HDP), Cloudera Distribution for Hadoop (CDH).
Experienced in version control and source code management tools like GIT, SVN, and Bitbucket.
Software development in Java Application Development, Client/Server Applications, and implementing application environment using MVC, J2EE, JDBC, JSP, XML methodologies (XML, XSL, XSD), Web Services, Relational Databases and NoSQL Databases.
Hands-on experience in application development using Java, RDBMS, and Linux shell scripting, Perl.
Hands-on experience working with IDE tools such as Eclipse, IntelliJ, NetBeans, Visual Studio, GIT and Maven and experienced in writing cohesive E2E applications on Apache Zeppelin.
Experience working in Waterfall and Agile - SCRUM methodologies.
Ability to adapt to evolving technologies, a strong sense of responsibility and accomplishment.

PROFESSIONAL EXPERIENCE

Confidential

Hadoop Developer

Responsibilities:

Worked on developing architecture document and proper guidelines
Responsible in Installation and Configuration of Hadoop Eco system components using CDH 5.2 Distribution.
Responsible to manage data coming from different sources and involved in HDFS maintenance and loading of structured and unstructured data.
Processed Multiple Data sources input to same Reducer using Generic Writable and Multi Input format.
Worked Big data processing of clinical and non-clinical data using Map Reduce.
Visualize the HDFS data to customer using BI tool with the help of Hive ODBC Driver.
Customized BI tool for manager team that perform Query analytics using HiveQL.
Imported data using Sqoop to load data from MySQL to HDFS on regular basis.
Created Partitions, Buckets based on State to further process using Bucket based Hive joins.
Created Hive Generic UDF's to process business logic that varies based on policy.
Moved Relational Data base data using Sqoop into Hive Dynamic partition tables using staging tables.
Experienced in Monitoring Cluster using Cloudera manager.
Involved in Discussions with business users to gather the required knowledge.
Capable of creating real time data streaming solutions and batch style large scale distributed computing applications using Apache Spark, Spark Streaming, Kafka and Flume.
Analyzing the requirements to develop the framework.
Designed and developed architecture for data services ecosystem spanning Relational, NoSQL and Big Data technologies.
Loaded and transformed large sets of structured, semi structured and unstructured data using Hadoop/Big Data concepts.
Developed Java Spark streaming scripts to load raw files and corresponding.
Processed metadata files into AWS S3 and Elasticsearch cluster.
Developed Python Scripts to get the recent S3 keys from Elasticsearch.
Elaborated Python Scripts to fetch/get S3 files using Boto3 module.
Implemented PySpark logic to transform and process various formats of data like XLSX, XLS, JSON, TXT.
Built scripts to load PySpark processed files into Redshift Db and used diverse PySpark logics.
Developed scripts to monitor and capture state of each file which is being through.
Developed Map Reduce programs to cleanse the data in HDFS obtained from heterogeneous data sources.
Involved in scheduling Oozie workflow engine to run multiple Hives and pig jobs and used Oozie Operational Services for batch processing and scheduling workflows dynamically.
Included migration of existing applications and development of new applications using AWS cloud services.
Configured Spark Streaming to receive real time data from the Apache Kafka and store the stream data to HDFS using Scala.
Wrought with data investigation, discovery and mapping tools to scan every single data record from many sources.
Implemented Shell script to automate the whole process.
Fine-tuning pyspark applications/jobs to improve the efficiency and overall processing time for the pipelines.
Knowledge of writing Hive queries and running both scripts in tez mode to improve performance on Hortonworks Data Platform.
Written pyspark job in AWS Glue to merge data from multiple table
Utilized Crawler to populate AWS Glue data Catalog with metadata table definitions
Generated a script in AWS Glue to transfer the data
Utilized AWS Glue to run ETL jobs and run aggregation on pyspark code.
Integrated Apache Storm with Kafka to perform web analytics.
Uploaded click stream data from Kafka to HDFS, HBase, and Hive by integrating with Storm.
Extracted data from SQL Server to create automated visualization reports and dashboards on Tableau.
Responsible for Cluster maintenance, adding and removing cluster nodes, Cluster Monitoring and Troubleshooting, Managing and reviewing data backups & log files.

Environment: AWS S3, Java, Maven, Python, Spark, Scala, Kafka, Elasticsearch, MapR Cluster, Amazon Redshift DB, Shell script, pandas, Elasticsearch, PySpark, Pig, Hive, Oozie, JSON, AWS GLUE.

Confidential

Hadoop Developer

Responsibilities:

Involved in complete project life cycle starting from design discussion to production deployment
Worked closely with the business team to gather their requirements and new support features
Involved in running POC's on different use cases of the application and maintained a standard document for best coding practices
Developed a 200-node cluster in designing the Data Lake with the Hortonworks distribution
Responsible for building scalable distributed data solutions using Hadoop
Installed, configured and implemented high availability Hadoop Clusters with required services (HDFS, Hive, HBase, Spark, Zookeeper)
Implemented Kerberos for authenticating all the services in Hadoop Cluster
Responsible for installation and configuration of Hive, Pig, HBase and Sqoop on the Hadoop cluster and created hive tables to store the processed results in a tabular format.
Configured Spark Streaming to receive real time data from the Apache Kafka and store the stream data to HDFS using Scala.
Developed the Sqoop scripts to make the interaction between Hive and vertica Database.
Processed data into HDFS by developing solutions and analyzed the data using Map Reduce, PIG, and Hive to produce summary results from Hadoop to downstream systems.
Build servers using AWS: Importing volumes, launching EC2, creating security groups, auto-scaling, load balancers, Route 53, SES and SNS in the defined virtual private connection.
Written Map Reduce code to process and parsing the data from various sources and storing parsed data into HBase and Hive using HBase-Hive Integration.
Streamed AWS log group into Lambda function to create service now incident.
Involved in loading and transforming large sets of Structured, Semi-Structured and Unstructured data and analyzed them by running Hive queries and Pig scripts.
Created Managed tables and External tables in Hive and loaded data from HDFS.
Developed Spark code by using Scala and Spark-SQL for faster processing and testing and performed complex HiveQL queries on Hive tables.
Scheduled several times based Oozie workflow by developing Python scripts.
Developed Pig Latin scripts using operators such as LOAD, STORE, DUMP, FILTER, DISTINCT, FOREACH, GENERATE, GROUP, COGROUP, ORDER, LIMIT, UNION, SPLIT to extract data from data files to load into HDFS.
Exporting the data using Sqoop to RDBMS servers and processed that data for ETL operations.
Worked on S3 buckets on AWS to store Cloud Formation Templates and worked on AWS to create EC2 instances.
Designing ETL Data Pipeline flow to ingest the data from RDBMS source to Hadoop using shell script, Sqoop, package and MySQL.
End-to-end architecture and implementation of client-server systems using Scala, Akka, Java, JavaScript and related, Linux
Optimized the Hive tables using optimization techniques like partitions and bucketing to provide better.
Used Oozie workflow engine to manage interdependent Hadoop jobs and to automate several types of Hadoop jobs such as Java map-reduce Hive, Pig, and Sqoop.
Implementing Hadoop with the AWS EC2 system using a few instances in gathering and analyzing data log files.
Involved in Spark and Spark Streaming creating RDD's, applying operations -Transformation and Actions.
Created partitioned tables and loaded data using both static partition and dynamic partition method.
Developed custom Apache Spark programs in Scala to analyze and transform unstructured data.
Handled importing of data from various data sources, performed transformations using Hive, MapReduce, loaded data into HDFS and Extracted the data from Oracle into HDFS using Sqoop
Using Kafka on publish-subscribe messaging as a distributed commit log, have experienced in its fast, scalable and durability.
Test Driven Development (TDD) process and extensive experience with Agile and SCRUM programming methodology.
Implemented POC to migrate Map Reduce jobs into Spark RDD transformations using SCALA
Scheduled map reduces jobs in production environment using Oozie scheduler.
Involved in Cluster maintenance, Cluster Monitoring and Troubleshooting, Manage and review data backups and log files.
Designed and implemented map reduce jobs to support distributed processing using java, Hive and Apache Pig
Analyzing Hadoop cluster and different Big Data analytic tools including Pig, Hive, HBase and Sqoop.
Improved the Performance by tuning of HIVE and map reduce.
Research, evaluate and utilize modern technologies/tools/frameworks around Hadoop ecosystem.

Environment : HDFS, Map Reduce, Hive, Sqoop, Pig, Flume, Vertica, Oozie Scheduler, Java, Shell Scripts, Teradata, Oracle, HBase, MongoDB, Cassandra, Cloudera, AWS, JavaScript, JSP, Kafka, Spark, Scala and ETL, Python.

Confidential

Hadoop Developer

Responsibilities:

Provided application demo to the client by designing and developing a search engine, report analysis trends, application administration prototype screens using AngularJS, and Bootstrap JS.
Took the ownership of complete application Design of Java part, Hadoop integration
Apart from the normal requirement gathering, participated in a Business meeting with the client to gather security requirements.
Assisted with the architect to analyze the existing system and future system Prepared design blue pints and application flow documentation
Experienced in managing and reviewing Hadoop log files Load and transform large sets of structured, semi-structured and unstructured data
Responsible to manage data coming from different sources and application Supported Map Reduce Programs those are running on the cluster
Responsible for working with Message broker system such as Kafka Extracted data from mainframes and feed to KAFKA and ingested to HBase to perform Analytics
Written event-driven, link tracking system to capture user events and feed to KAFKA to push it to HBASE.
Created MapReduce jobs to extracts the contents from HBase and configured in OOZIE workflow to generate analytical reports.
Worked on setting up Kafka for streaming data and monitoring for the Kafka Cluster.
Responsible for importing log files from various sources into HDFS using Flume.
Participated in SOLR schema, and ingested data into SOLR for data indexing.
Written MapReduce programs to organize the data and ingest the data to suitable for analytics in client specified format
Hands on experience in writing python scripts to optimize the performance Implemented Storm builder topologies to perform cleansing operations before moving data into Cassandra.
Extracted files from Cassandra through Sqoop and placed in HDFS and processed. Implemented Bloom filters in Cassandra using key space creation
Involved in writing Cassandra CQL statements God hands-on experience in developing concurrency using spark and Cassandra together
Involved in writing spark applications using Scala Hands on experience in creating RDDs, transformations, and Actions while implementing spark applications
Good knowledge in creating data frames using Spark SQL. Involved in loading data into Cassandra NoSQL Database
Implemented record level atomicity on writes using Cassandra Written PIG Scripts to query and process the Datasets to figure out the patterns of trends by applying client-specific criteria, and configured OOZIE workflows to run the jobs along with the MR jobs
Stored the derived the results in HBase from analysis and make it available to data ingestion for SOLR for indexing data
Involved in integration of java search UI, SOLR and HDFS Involved in code deployments using continuous integration tool using Jenkins
Documented all the challenges, issues involved to deal with the security system and Implemented best practices
Created Project structures and configurations according to the project architecture and made it available to the junior developer to continue their work
Handled onsite coordinator role to deliver work to offshore Involved in core reviews and application lead supported activities
Implemented SparkRDD transformations to map business analysis and apply actions on top of transformations.
Objective of this project is to build a data lake as a cloud-based solution in AWS using Apache Spark.
Involved in creating Hive tables, loading with data and writing hive queries which runs internally in MapReduce way.

Environment: Cassandra, Spring 3.2, Spring data, PIG, HIVE, apache AVRO, Map Reduce, Sqoop Zookeeper, SVN, Jenkins, Spark, HBASE.

Confidential

Hadoop / Scala Developer

Responsibilities:

Responsible for implementation and ongoing administration of Hadoop infrastructure and setting up infrastructure
Worked with Spark for improving performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frame sand Pair RDD's.
Performed advanced procedures like text analytics and processing, using the in-memory computing capabilities of Spark using Scala .
Extract Real time feed using Kafka and Spark Streaming and convert it to RDD and process data in the form of Data Frame and save the data as Parquet format in HDFS.
Experienced in writing real-time processing and core jobs using Spark Streaming with Kafka as a data pipe-line system.
Configured Spark streaming to get ongoing information from the Kafka and store the stream information to HDFS.
Used Spark and Spark-SQL to read the parquet data and create the tables in hive using the Scala API.
Experienced in using the spark application master to monitor the spark jobs and capture the logs for the spark jobs.
Worked on Spark using Python and Spark SQL for faster testing and processing of data.
Implemented Spark sample programs in python using pyspark.
Analyzed the SQL scripts and designed the solution to implement using pyspark.
Developed pyspark code to mimic the transformations performed in the on-premise environment.
Developed multiple Kafka Producers and Consumers as per the software requirement specifications.
Used Spark Streaming APIs to perform transformations and actions on the fly for building common learner data model which gets the data from Kafka in near real time and persist it to Cassandra.
Experience in building Real-time Data Pipelines with Kafka Connect and Spark Streaming.
Used Kafka and Kafka brokers, initiated the spark context and processed live streaming information with RDD and Used Kafka to load data into HDFS and NoSQL databases.
Used Zookeeper to store offsets of messages consumed for a specific topic and partition by a specific Consumer Group in Kafka.
Used Kafka functionalities like distribution, partition, replicated commit log service for messaging systems by maintaining feeds and created applications, which monitors consumer lag within Apache Kafka clusters.
Using Spark-Streaming APIs to perform transformations and actions on the fly for building the common learner data model.
Involved in Cassandra Cluster planning and had good understanding in Cassandra cluster mechanism.
Responsible in development of Spark Cassandra connector to load data from flat file to Cassandra for analysis, modified Cassandra.yaml and Cassandra-env.sh files to set various configuration properties.
Used Sqoop to import the data on to Cassandra tables from different relational databases like Oracle, MySQL and Designed Column families.
Responsible in development of Spark Cassandra connector to load data from flat file to Cassandra for analysis.
Maintained ELK (Elastic Search, Logstash, and Kibana) and Wrote Spark scripts using SCALA shell.
Worked in AWS environment for development and deployment of custom Hadoop applications.
Strong experience in working with ELASTIC MAPREDUCE (EMR)and setting up environments on Amazon AWS EC2 instances.
Written Oozie workflow to run the Sqoop and HQL scripts in Amazon EMR.
Involvement in creating custom UDFs for Pig and Hive to consolidate strategies and usefulness of Python into Pig Latin and HQL (HiveQL).
Developed shell scripts to generate the hive create statements from the data and load data to the table.
Involved in writing custom Map-Reduce programs using java API for data processing.
The Hive tables are created as per requirement were Internal or External tables defined with appropriate static, dynamic partitions and bucketing, intended for efficiency.
Got chance working on Apache NiFi like executing Spark script, Sqoop scripts through NiFi, worked on creating scatter and gather pattern in NiFi, ingesting data from Postgres to HDFS, Fetching Hive metadata and storing in HDFS, created a custom NiFi processor for filtering text from Flow files etc.
Cluster coordination services through Zookeeper.

Environment : Scala, Hive, Kafka, Zookeper, Avro, EMR, Python, Java, Cassandra.

We provide IT Staff Augmentation Services!

We'd love your feedback!

Resume Categories

Client Services

Job Seekers

Visa Sponsorship