Hadoop/Spark Developer Resume

SUMMARY

Around 6+ years of professional IT experience with 5years of extensive experience in data integration, data engineering and data analytics using Big Data Systems like Hadoop and Spark.
Very strong knowledge on Hadoop ecosystem components like HDFS, MapReduce, Spark, Hive, Pig, Sqoop, Flume, Kafka, Oozie and HBase.
Strong experience utilizing programming languages like Java, Scala and Python.
Strong knowledge on Architecture of Distributed systems and Parallel processing frameworks.
In - depth understanding of Spark execution model and internals of MapReduce framework.
Expertise in developing production ready Spark applications utilizing Spark-Core, Data-frames, Spark-SQL, Spark- ML and Spark-Streaming API’s.
Experience working with various Hadoop distributions like Cloudera (Cloudera distribution CDH3, 4 and 5) HortonWorksDistributions (HDP).
Worked extensively on AWS Cloud services like EMR, S3, Redshift, Athena, Glue etc.,
Good knowledge in fine tuning resources for long running SparkApplications to utilize better parallelism and executor memory for more caching.
Strong experience working with both batch and real-time processing using Spark frameworks.
Proficient knowledge on Apache Spark and programming Scala to analyze large datasets using Spark to process real time data.
Worked extensively on Hivefor data analytics and ETL modelling.
Strong knowledge on performance tuning Hive queries and troubleshooting various issues related to Joins, memory exceptions in Hive.
Very good understanding of Partitions, bucketing concepts in Hive and designed both internal and external tables in Hive to optimize performance.
Strong experience using different columnar file formats like confidential and Parquet formats.
Hands on experience in installing, configuring and deploying Hadoop distributions both in-house and on cloud.
Experience in optimizing Map-Reduce algorithms by using Combiners and custom practitioners.
Experience in NoSQL Column - Oriented Databases like HBase, ApacheCassandra, MongoDB and its Integration with Hadoop cluster.
Expertise in back-end / server- side java technologies such as: Web services, java persistence API (JPA), Java Messaging Service (JMS), Java Database Connectivity (JDBC).
Experienced with different scripting language like Python and Shell Scripts.
Experienced data processing like collecting, aggregating, moving from various sources using Apache Flume and Kafka.
Extensive experience in ETL process consisting of datatransformation, datasourcing, mapping, conversion and loading.
In-depth understanding/knowledge of Hadoop Architecture and various components such as HDFS, Job Tracker, Task Tracker, Name Node, Data Node, MapReduce programming paradigm.
Worked with Sqoop to move (import / export) data from a relational database into Hadoop.
Knowledge in UNIXShellScripting for automating deployments and other routine tasks.
Experienced in using agile methodologies including extreme programming, SCRUM and Test- Driven Development (TDD).
Used custom SerDes like RegexSerDe, JSONSerDe, CSVSerDe etc.., in hive to handle multiple formats of data. Intensive work experience in developing enterprise solutions using Java, J2EE, Servlets, JSP, JDBC, Struts, spring, Hibernate, JavaBeans, JSF, MVC.
Experience in building and deploying web applications in multiple applications servers and middleware platforms including Weblogic, WebSphere, ApacheTomcat, JBoss.
Experience in using version control tools like Bit-Bucket, GIT, and SVN etc.
Experience in writing build scripts using MAVEN, ANT and Gradle.
Flexible, enthusiastic and project-oriented team player with excellent communication skills with leadership abilities to develop creative solutions for challenging requirement of client.

TECHNICAL SKILLS:

Big Data Ecosystems: HDFS, MapReduce, YARN, Hive, Storm, Sqoop, Pig, Spark HBase, Scala, Flume, Zookeeper, Oozie.

NO SQL Databases: HBase, Cassandra, MongoDB.

Java & J2EE Technologies: JDBC, JAVA, SQL, JavaScript, J2EE, C, JDBC, SQL, PL/SQL, Hibernate 3.0, Spring 3.x, Structs

Cloud technologies: Azure, Data Pipeline, Redshift, EMR.

Languages: Java, Scala, Python, SQL, Pig Latin, HiveQL, Shell Scripting.

Database: Microsoft SQL Server, MySQL, Oracle, DB2.

Web/Application Servers: Web logic, Web Sphere, JBoss, Tomcat.

IDE’s & Utilities: Eclipse, JCreator, NetBeans.

Operating Systems: UNIX, Windows, Mac, LINUX.

GUI Technologies: HTML, XHTML, CSS, JavaScript, Ajax, AngularJS.

Business Intelligent tools: Tableau, Splunk, Qlik View.

Development Methodologies: Agile, V-Model, Waterfall Model, Scrum.

EXPERIENCE:

Confidential, Richmond, VA

Hadoop/Spark Developer

Responsibilities:

Ingested gigabytes of click stream data from external servers such as FTP server and S3 buckets on daily basis using customized home-grown Input Adapters.
Created Sqoop scripts to import/export data from RDBMS to S3 data store.
Developed various Spark applications using Scala to perform cleansing, transformation and enrichment of these click stream data.
Involved in data cleansing, event enrichment, data aggregation, de-normalization and data preparation needed for machine learning and reporting.
Utilized Spark Scala API to implement batch processing of jobs.
Trouble Shooting Spark applications for improved error tolerance and reliability.
Fine-tuningSpark applications/jobs to improve the efficiency and overall processing time for the pipelines.
Created Kafka producer API to send live-stream json data into various Kafka topics.
Developed Spark-Streaming applications to consume the data from Kafka topics and to insert the processed streams to HBase.
Utilized Spark in Memory capabilities, to handle large datasets.
Used Broadcast variables in Spark, effective & efficient Joins, transformations and other capabilities for data processing.
Experienced in working with EMR cluster and S3 in AWS cloud.
Creating Hive tables, loading and analyzing data using hive scripts. Implemented Partitioning, Dynamic Partitions, Buckets in HIVE.
Involved in continuous Integration of application using Jenkins.
Interacted with the infrastructure, network, database, application and BA teams to ensure data quality and availability

Environment: AWS EMR, Spark, Hive, HDFS, Sqoop, Kafka, Oozie, HBase, Scala,Java

Confidential, San Jose, CA

Spark Developer

Responsibilities:

Involved in developing roadmap for migration from legacy system to Hadoop cluster.
Create, validate and maintain scripts to load data using Sqoop manually.
Load and transform large sets of structured, semi structured and unstructured data coming from various downstream systems.
Migrated data between RDBMS and HDFS/Hive with Sqoop.
Create, validate and maintain scripts to extract and transform data from MySQL to flat files and JSON format.
Created Oozie workflows and coordinated to automate the workflowsdaily and monthly.
Worked on reading multiple data formats on HDFS using Apache Spark.
Wrote Scala scripts for Spark to perform operations like data inspection, cleaning, load and transforms the large sets of structured and semi-structured imported data.
Involving in Migrating the coding from Hive to Apache Spark and Scala using Spark-SQL, RDD.
Developed Spark with Scala and Spark-SQL for testing and processing of data.
Developed, validate and maintain HiveQL queries.
Designed Hive tables to load data to and from external HDFS datasets.
Created partitioned and bucketedtables in Hive and designed both Managed and External tables in Hive for optimized performance.
Managing and schedulingOozie jobs on a Hadoop cluster.

Environment: Hadoop, HDFS, Apache Spark 1.6, Spark-SQL, Unix, Hive, Sqoop, Flume, Scala, Oozie, DB2

Confidential, Nashville, TN

Hadoop Developer

Responsibilities:

Extensively involved in Installation and configuration of Cloudera distribution Hadoop, Name Node, Secondary Name Node, Job Tracker, Task Trackers, and Data Nodes.
Developed MapReduce programs in Java and Sqoop the data from ORACLE database.
Responsible for building Scalable distributed data solutions using Hadoop. Written various Hive and Pig scripts.
Moved data from HDFS to Cassandra using Map Reduce and Bulk Output Format class.
Experienced with different scripting language like Python and shell scripts.
Developed various Python scripts to find vulnerabilities with SQL Queries by doing SQL injection, permissionchecks and performance analysis.
Installed Oozie workflow engine to run multiple Hive and Pig jobs which run independently with time and data availability.
Experienced with handling administration activations using Cloudera manager.
Expertise in understanding Partitions, Bucketing concepts in Hive.
Used Oozie Scheduler system to automate the pipeline workflow and orchestrate the Map Reduces jobs that extract the data on a timely manner.
Responsible for loading data from UNIX file system to HDFS.
Analyzed the weblog data using the HiveQL, integrated Oozie with the rest of the Hadoop stack.
Utilized cluster co-ordination services through Zookeeper.
Got good experience with various NoSQL databases and Comprehensive knowledge in process improvement, normalization/de-normalization, data extraction, data cleansing, data manipulation.
Experience with creating script for data modeling and data import and export. Extensive experience in deploying, managing and developing MongoDB clusters.
Created Partitioned Hive tables and worked on them using HiveQL.
Developed Shell scripts to automate routine DBA tasks.
Used Maven extensively for building jar files of MapReduce programs and deployed to Cluster.
Responsible for cluster maintenance, adding and removing cluster nodes, cluster monitoring, troubleshooting, managing and reviewing data backups and Hadoop log files.

Environment: HDFS, Map Reduce, Pig, Hive, Oozie, Sqoop, Flume, HBase, Java, Maven, Avro, Cloudera, Eclipse and Shell Scripting.

Confidential, San Mateo, CA

Responsibilities:

Aggregations and analysis done on large set of log data, collection of log data done using custom built Input Adapters and Sqoop.
Developed MapReduce programs for data extraction, transformation and aggregation.
Monitor and troubleshoot Map Reduce Jobs those are running on the cluster.
Implemented solutions for ingesting data from various sources and processing the data utilizing Hadoop services like Sqoop, Hive Pig, HBase, Map Reduce etc.
Worked on creating Combiners, Practitioners and Distributed cache to improve the performance of Map Reduce jobs.
Wrote Pig scripts to generate Map Reduce jobs and performed ETL procedures on the data in HDFS.
Experienced in handling Avro data files by passing schema into HDFS using Avro tools and Map Reduce.
Optimization of Map Reduce algorithms using combiners and practitioners to deliver the best results and worked on Application performance optimization for a HDFS cluster.
Orchestrated many Sqoop scripts, Pig scripts, Hive queries using Oozie workflows and sub workflows.
Used Flume to collect, aggregate and store the web log data from different sources like web servers and pushed to HDFS.
Involved in creating Hive tables, loading with data and writing Hive queries which will invoke and run Map Reduce jobs in the backend.
Created HBase tables to load large sets of structured, semi-structured and unstructured data coming from UNIX, NOSQL and a variety of portfolios.
Involved in debugging Map Reduce jobs using MRUnit framework and optimizing Map Reduce jobs.
Involved in troubleshooting errors in Shell, Hive and Map Reduce.
Worked on debugging, performance tuning of Hive & Pig jobs.
Design and implement Map Reduce jobs to support distributed processing using Map Reduce, Hive and Apache Pig.
Created Hive external tables on the Map Reduce output before partitioning, bucketing is applied on top of it.

Environment: Hadoop, HDFS, MapReduce, HIVE, Pig, Sqoop, HBase, Oozie, MySQL, SVN, Putty, Zookeeper, UNIX, Shell Scripting, HiveQL, NOSQL database(HBASE), RDBMS, Eclipse, Oracle 11g.

We provide IT Staff Augmentation Services!

We'd love your feedback!

Resume Categories

Client Services

Job Seekers

Visa Sponsorship