Bigdata Hadoop Developer Resume

PROFESSIONAL SUMMARY:

7+ years of IT work experience on Big Data Analytics which includes Analysis, Design, Development, Deployment & Maintenance of projects using Apache Hadoop - HDFS, Amazon S3, MapReduce, YARN, Hive, Impala, Hue, Sqoop, Flume, Spark, and Hadoop APIs’.
Experience in meeting expectations with Hadoop clusters using Cloudera and MapR, in Agile Scrum methodologies.
Experience in implementation and integration using Big Data Hadoop ecosystem components in Cloudera and MapR environments working with various file formats like Avro, Parquet, JSON and ORC.
Experience in data ingestion, processing and analysis using Spark, Flume, Sqoop and Shell Script.
Efficient in developing Sqoop jobs for migrating data from RDBMS to Hive / HDFS and vice versa.
Experience in developing NoSQL applications using Mongo DB, HBase and Cassandra.
Thorough knowledge of Hadoop architecture and core components Name node, Data nodes, Job trackers, Task Trackers, Oozie, Hue, Flume, HBase, etc.
Very good experience of partitioning, bucketing concepts in Hive and designed both Managed and External tables in Hive to optimize performance.
Knowledge of ETL methods for data extraction, transformation and loading in corporate-wide ETL Solutions and Data warehouse tools for data analysis.
Experience in extending HIVE and PIG core functionality by using Custom User Defined functions.
Working knowledge on Oozie, a workflow scheduler system to manage teh jobs that run on PIG, HIVE and SQOOP.
Experience in Spark applications using Python for easy Hadoop transitions.
Knowledge of utilizing Flume technologies for real time data streaming and ingestion.
Excellent knowledge in Data Analysis, Data Validation, Data Cleansing, Data Verification and identifying data mismatch.
Wrote multiple customized MapReduce Programs for various Input file formats.
Developed multiple internal and external Hive Tables using dynamic partitioning & bucketing.
Involved in converting SQL queries into HiveQL.
Designing and creating Hive external tables using shared meta-store instead of teh derby with partitioning, dynamic partitioning and buckets.
Experience in Oozie and workflow scheduler to manage Hadoop jobs by Direct Acyclic Graph (DAG) of actions with control flows.
Experience in integrating Hive and HBase for effective operations.
Design and development of full text search feature with multi-tenancy elastic search after collecting teh real time data through Spark.
Experienced in working with Apache Spark ecosystem using Spark-SQL and Python queries on different data file formats like .txt, .csv etc.
Developed data pipeline for real time use cases using Flume and PySpark and ability to develop MapReduce and Spark applications using Scala and Python
Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs, Scala and having hands-on experience in developing Data Frames and SQL queries in Spark SQL.
Experience in analyzing large scale data to identify new analytics, insights, trends, and relationships with a strong focus on data clustering.
An excellent team player with good organizational, interpersonal, communication skills and leadership qualities, Quick learner, possesses a positive attitude and flexibility towards teh ever- changing industry.
Technically strong person who has capability to work with business users, project managers, team leads, architects and peers, thus maintaining healthy environment in teh project.

TECHNICAL SKILLS:

Distributed Computing: - Apache Hadoop 2.x, HDFS, YARN, Map Reduce, Hive, Pig, HBase, Sqoop, Flume, Zookeeper, Hue, Impala, Oozie, Spark (Core & SQL), Pig, Flume

SDLC Methodologies: - Agile/Scrum, Waterfall

Databases: - MySQL, MSSQL, Oracle, NoSQL (HBase, Cassandra, Mongo DB), Teradata, Netezza.

Distributed Filesystems: - HDFS, Amazon S3

Distributed Query Engines: - Hive, Presto

Distributed Computing Environment: - Cloudera, MapR

Operating Systems: - Windows, Mac OS, Unix, Ubuntu

Programming Languages: - Java, Python, UNIX, Pig Latin, HiveQL

Scripting: - Shell Scripting

Version Control: - GitHub

IDE: - PyCharm, Jupyter Notebook, Eclipse (PyDev)

PROFESSIONAL EXPERIENCE:

Confidential, Charlotte NC

Bigdata Hadoop Developer

Responsibilities:

Involved in designing teh data pipeline using various Hadoop components like Spark, Hive, Impala, Sqoop with Cloudera distribution.
Developed Spark (Python API) applications in PyCharm using spark core libraries (RDD, Spark- SQL) for performing ETL transformations, thereby eliminating teh need of utilizing ETL tool (SSIS/ODI).
Developed Spark application for Batch processing.
Implemented Partitioning, Dynamic partitioning and Bucketing in Hive using internal and external table for more efficient data.
Using Open-Source packages, designed POC to demonstrate Integration of Flume with Spark SQL for real-time data Ingestion and processing.
Designed Sqoop application for migration of sensitive PHI data residing on Netezza (RDBMS) to HDFS / Hive tables.
Designed Sqoop Application for CDC (Change Data Capture) process and integrated it with Spark and Oozie.
Performed data validation using Sqoop on teh exported data.
Performed transformations / analysis by writing complex HQL queries in Hive and exported result to HDFS in discrete file format (JSON, AVRO, Parquet, ORC).
Implemented data ingestion and transformation using automated workflows using Oozie.
Utilizing Cloudera Navigator created Audit reports, which notifies security threat and will track all teh user / tools activity which uses various Hadoop components.
Developed strategy for various Hadoop components used to track Data Lineage and Meta-Data extracted from Pipeline using Cloudera Navigator.
Designed various plots showing HDFS Analytics and Other operations performed on teh environment.
Worked with Infra team for testing teh environment after patches, upgrades and migration take place.
Developed multiple python scripts for delivering End-To-End support and common routines, while maintaining product integrity.
Performed analytical queries using Impala (Cloudera) for swifter responses and exported result to Tableau for data visualization.
Documented all teh applications worked on and presented it to higher level.

Technologies Used - Cloudera, LINUX, Hadoop, HDFS, HBase, Hive, Spark, MapReduce, Sqoop, Flume, Python, Netezza, Oozie, PyCharm, Cloudera Impala, Tableau 10.0, GitHub.

Confidential, Charlotte NC

Bigdata Research Lab Intern

Responsibilities:

Deployed a 5 node Hadoop cluster with all teh big data daemons like HDFS, Spark, Hive, Avro, Parque, Sqoop with 60 cores, 80 GB RAM and 25TB of storage.
Done configurations of NameNode, Secondary NameNode, Resource Manager, Node Manager and Data nodes
Used YARN as a resource manager for MapReduce and Spark applications
Implemented Kafka for streaming data and filtered, processed teh data
Executed various Analytical functions and Windowing functions using SparkSQL
Worked on “Framework for Social Network Sentiment Analysis Using Big Data Analytics" project
Worked as a “Graduate Teaching Assistant”
Related coursework: Bigdata Solutions for Business.

Confidential, California, CA

Hadoop Developer

Responsibilities:

Imported Data from Different Relational Data Sources like RDBMS, Teradata to HDFS using Sqoop.
Analyzed large data sets by running Hive queries and exported teh results as views/ Flat files, etc.
Worked with teh Data Science team to gather requirements for various data mining projects and conducted POC.
Involved in creating Hive managed / external tables while maintaining raw files integrity and analyzed data using hive queries.
Worked on Spark core Libraries like RDD, Spark SQL extensively by handling structured and unstructured data.
Designed Spark applications performing ETL transformations using Python API.
Developed Simple to complex MapReduce Jobs using Hive.
Wrote Hive Queries to has a consolidated view of teh mortgage and retail data.
Orchestrated hundreds of Sqoop scripts, pig scripts, hive queries using oozie workflows and sub- workflows.
Used Hive to analyze teh partitioned and bucketed data and compute various metrics for reporting
Developed Python utility to validate ingested Hive tables using Sqoop with source RDBMS tables
Involved in running Hadoop jobs for processing millions of records of text data.
Worked with application teams to install operating system, Hadoop updates, patches, version upgrades as required.
Responsible for managing data from multiple sources using Flume.
Load and transform large data sets consisted of structured, semi structured and unstructured data.
Created and maintained technical documentation for launching HADOOP Clusters and for executing Hive queries.
Using Python libraries like Pandas, NumPy, and SciPy performed statistical analysis on dataset. Discovered patterns by evaluating parameters, which recommends top parameters to focus while designing.
Integrated Tableau with Impala as a source to create interactive BI dashboard.

Technologies Used - Cloudera, LINUX, Hadoop, HDFS, Hive, Spark, MapReduce, Sqoop, Flume, Teradata, Python, MySQL, Oozie, Impala, Tableau 10.0.

Confidential

Hadoop Developer

Responsibilities:

Installed and configured Hadoop MapReduce, HDFS, Developed multiple MapReduce jobs in java for data cleaning and preprocessing.
Importing and exporting data into HDFS and Hive using Sqoop.
Experience in defining job flows using Oozie and shell scripts.
Experienced in implementing various customizations in MapReduce at various levels by implementing custom input formats, custom record readers, partitioners and data types in java.
Experience in ingesting data using flume from web server logs and telnet sources.
Installed and configured Cloudera Manager, Hive, Pig, Sqoop, and Oozie on CDH5 cluster.
Experienced in managing disaster recovery cluster and responsible for data migration and backup.
Performed an upgrade in development environment from CDH 4.x to CDH 5.x.
Implemented encryption and masking on customer sensitive data in flume by building a custom interceptor and masking and encrypting teh data as per teh requirement by considering teh rules in MySQL.
Experience in managing and reviewing Hadoop log files.
Extracted files from RDBMS through Sqoop and placed in HDFS and processed.
Experience in running Hadoop streaming jobs to process terabytes of xml format data.
Supported MapReduce Programs those are running on teh cluster.
Involved in loading data from UNIX file system to HDFS.
Involved in creating Hive tables, loading with data and writing hive queries which will run internally in MapReduce way.
Experiences in implementing Hive-HBase integration by creating hive external tables and using HBase storage handler.
Executed queries using Hive and developed MapReduce jobs to analyze data.
Developed Pig Latin scripts to extract teh data from teh web server output files to load into HDFS.
Developed Hive queries for teh analysts.
Involved in loading data from LINUX and UNIX filesystem to HDFS.
Designed and implemented MapReduce based large scale parallel relation learning system.
Developed Master tables in HIVE using a "jsondeserializer" or "get json object" or "json tuple" functions of HIVE.
Designed teh entire flow in HDFS in such a way that is needed to be achieved using Oozie workflows.

Technologies Used - Cloudera, Eclipse, Hadoop, Hive, HBase, MapReduce, Flume, HDFS, PIG, Sqoop, Oozie, Cassandra, Java (JDK 1.6), My SQL, UNIX Shell Scripting.

Confidential

Jr. Hadoop Developer

Responsibilities:

Involved in creating Hive tables, loading with data and writing hive queries to process teh data.
Developing and maintaining Workflow Scheduling Jobs in Oozie for importing data from RDBMS to Hive.
Implemented Partitioning, Bucketing in Hive for better organization of teh data.
Involved with teh team of fetching live stream data from DB2 to HBase table using Flume.
Developed solutions to pre-process large sets of structured, semi-structured data, with different file formats (Text file, Avro data files, Sequence files, XML and JSON files, ORC and Parquet).
Involved in teh Design Phase for getting live event data from teh database to teh front-end application.
Importing data from hive table and run SQL queries over imported data and existing RDD’s.
Responsible for loading and transforming large sets of structured, semi structured and unstructured data.
Collected teh log data from web servers and integrated into HDFS using Flume.
Responsible to manage data coming from different sources.
Extracted files from Couch DB and placed into HDFS using Sqoop and pre-process teh data for analysis.
Developed teh sub queries in Hive.
Partitioning and bucketing teh imported data using HiveQL.
Partitioning dynamically using dynamic partition insert feature.
Moving dis partitioned data onto teh different tables as per as business requirements.

Technologies Used - Eclipse, Hadoop, HDFS, Map Reduce, Pig, Hive, Flume, HBase, Couch DB, Apache- Maven.

We provide IT Staff Augmentation Services!

We'd love your feedback!

Resume Categories

Client Services

Job Seekers

Visa Sponsorship