We provide IT Staff Augmentation Services!

Bigdata Hadoop Developer Resume

PROFESSIONAL SUMMARY:

  • 7+ years of IT work experience on Big Data Analytics which includes Analysis, Design, Development, Deployment & Maintenance of projects using Apache Hadoop - HDFS, Amazon S3, MapReduce, YARN, Hive, Impala, Hue, Sqoop, Flume, Spark, and Hadoop APIs’.
  • Experience in meeting expectations with Hadoop clusters using Cloudera and MapR, in Agile Scrum methodologies.
  • Experience in implementation and integration using Big Data Hadoop ecosystem components in Cloudera and MapR environments working with various file formats like Avro, Parquet, JSON and ORC.
  • Experience in data ingestion, processing and analysis using Spark, Flume, Sqoop and Shell Script.
  • Efficient in developing Sqoop jobs for migrating data from RDBMS to Hive / HDFS and vice versa.
  • Experience in developing NoSQL applications using Mongo DB, HBase and Cassandra.
  • Thorough knowledge of Hadoop architecture and core components Name node, Data nodes, Job trackers, Task Trackers, Oozie, Hue, Flume, HBase, etc.
  • Very good experience of partitioning, bucketing concepts in Hive and designed both Managed and External tables in Hive to optimize performance.
  • Knowledge of ETL methods for data extraction, transformation and loading in corporate-wide ETL Solutions and Data warehouse tools for data analysis.
  • Experience in extending HIVE and PIG core functionality by using Custom User Defined functions.
  • Working knowledge on Oozie, a workflow scheduler system to manage teh jobs that run on PIG, HIVE and SQOOP.
  • Experience in Spark applications using Python for easy Hadoop transitions.
  • Knowledge of utilizing Flume technologies for real time data streaming and ingestion.
  • Excellent knowledge in Data Analysis, Data Validation, Data Cleansing, Data Verification and identifying data mismatch.
  • Wrote multiple customized MapReduce Programs for various Input file formats.
  • Developed multiple internal and external Hive Tables using dynamic partitioning & bucketing.
  • Involved in converting SQL queries into HiveQL.
  • Designing and creating Hive external tables using shared meta-store instead of teh derby with partitioning, dynamic partitioning and buckets.
  • Experience in Oozie and workflow scheduler to manage Hadoop jobs by Direct Acyclic Graph (DAG) of actions with control flows.
  • Experience in integrating Hive and HBase for effective operations.
  • Design and development of full text search feature with multi-tenancy elastic search after collecting teh real time data through Spark.
  • Experienced in working with Apache Spark ecosystem using Spark-SQL and Python queries on different data file formats like .txt, .csv etc.
  • Developed data pipeline for real time use cases using Flume and PySpark and ability to develop MapReduce and Spark applications using Scala and Python
  • Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs, Scala and having hands-on experience in developing Data Frames and SQL queries in Spark SQL.
  • Experience in analyzing large scale data to identify new analytics, insights, trends, and relationships with a strong focus on data clustering.
  • An excellent team player with good organizational, interpersonal, communication skills and leadership qualities, Quick learner, possesses a positive attitude and flexibility towards teh ever- changing industry.
  • Technically strong person who has capability to work with business users, project managers, team leads, architects and peers, thus maintaining healthy environment in teh project.

TECHNICAL SKILLS:

Distributed Computing: - Apache Hadoop 2.x, HDFS, YARN, Map Reduce, Hive, Pig, HBase, Sqoop, Flume, Zookeeper, Hue, Impala, Oozie, Spark (Core & SQL), Pig, Flume

SDLC Methodologies: - Agile/Scrum, Waterfall

Databases: - MySQL, MSSQL, Oracle, NoSQL (HBase, Cassandra, Mongo DB), Teradata, Netezza.

Distributed Filesystems: - HDFS, Amazon S3

Distributed Query Engines: - Hive, Presto

Distributed Computing Environment: - Cloudera, MapR

Operating Systems: - Windows, Mac OS, Unix, Ubuntu

Programming Languages: - Java, Python, UNIX, Pig Latin, HiveQL

Scripting: - Shell Scripting

Version Control: - GitHub

IDE: - PyCharm, Jupyter Notebook, Eclipse (PyDev)

PROFESSIONAL EXPERIENCE:

Confidential, Charlotte NC

Bigdata Hadoop Developer

Responsibilities:

  • Involved in designing teh data pipeline using various Hadoop components like Spark, Hive, Impala, Sqoop with Cloudera distribution.
  • Developed Spark (Python API) applications in PyCharm using spark core libraries (RDD, Spark- SQL) for performing ETL transformations, thereby eliminating teh need of utilizing ETL tool (SSIS/ODI).
  • Developed Spark application for Batch processing.
  • Implemented Partitioning, Dynamic partitioning and Bucketing in Hive using internal and external table for more efficient data.
  • Using Open-Source packages, designed POC to demonstrate Integration of Flume with Spark SQL for real-time data Ingestion and processing.
  • Designed Sqoop application for migration of sensitive PHI data residing on Netezza (RDBMS) to HDFS / Hive tables.
  • Designed Sqoop Application for CDC (Change Data Capture) process and integrated it with Spark and Oozie.
  • Performed data validation using Sqoop on teh exported data.
  • Performed transformations / analysis by writing complex HQL queries in Hive and exported result to HDFS in discrete file format (JSON, AVRO, Parquet, ORC).
  • Implemented data ingestion and transformation using automated workflows using Oozie.
  • Utilizing Cloudera Navigator created Audit reports, which notifies security threat and will track all teh user / tools activity which uses various Hadoop components.
  • Developed strategy for various Hadoop components used to track Data Lineage and Meta-Data extracted from Pipeline using Cloudera Navigator.
  • Designed various plots showing HDFS Analytics and Other operations performed on teh environment.
  • Worked with Infra team for testing teh environment after patches, upgrades and migration take place.
  • Developed multiple python scripts for delivering End-To-End support and common routines, while maintaining product integrity.
  • Performed analytical queries using Impala (Cloudera) for swifter responses and exported result to Tableau for data visualization.
  • Documented all teh applications worked on and presented it to higher level.

Technologies Used - Cloudera, LINUX, Hadoop, HDFS, HBase, Hive, Spark, MapReduce, Sqoop, Flume, Python, Netezza, Oozie, PyCharm, Cloudera Impala, Tableau 10.0, GitHub.

Confidential, Charlotte NC

Bigdata Research Lab Intern

Responsibilities:

  • Deployed a 5 node Hadoop cluster with all teh big data daemons like HDFS, Spark, Hive, Avro, Parque, Sqoop with 60 cores, 80 GB RAM and 25TB of storage.
  • Done configurations of NameNode, Secondary NameNode, Resource Manager, Node Manager and Data nodes
  • Used YARN as a resource manager for MapReduce and Spark applications
  • Implemented Kafka for streaming data and filtered, processed teh data
  • Executed various Analytical functions and Windowing functions using SparkSQL
  • Worked on “Framework for Social Network Sentiment Analysis Using Big Data Analytics" project
  • Worked as a “Graduate Teaching Assistant”
  • Related coursework: Bigdata Solutions for Business.

Confidential, California, CA

Hadoop Developer

Responsibilities:

  • Imported Data from Different Relational Data Sources like RDBMS, Teradata to HDFS using Sqoop.
  • Analyzed large data sets by running Hive queries and exported teh results as views/ Flat files, etc.
  • Worked with teh Data Science team to gather requirements for various data mining projects and conducted POC.
  • Involved in creating Hive managed / external tables while maintaining raw files integrity and analyzed data using hive queries.
  • Worked on Spark core Libraries like RDD, Spark SQL extensively by handling structured and unstructured data.
  • Designed Spark applications performing ETL transformations using Python API.
  • Developed Simple to complex MapReduce Jobs using Hive.
  • Wrote Hive Queries to has a consolidated view of teh mortgage and retail data.
  • Orchestrated hundreds of Sqoop scripts, pig scripts, hive queries using oozie workflows and sub- workflows.
  • Used Hive to analyze teh partitioned and bucketed data and compute various metrics for reporting
  • Developed Python utility to validate ingested Hive tables using Sqoop with source RDBMS tables
  • Involved in running Hadoop jobs for processing millions of records of text data.
  • Worked with application teams to install operating system, Hadoop updates, patches, version upgrades as required.
  • Responsible for managing data from multiple sources using Flume.
  • Load and transform large data sets consisted of structured, semi structured and unstructured data.
  • Created and maintained technical documentation for launching HADOOP Clusters and for executing Hive queries.
  • Using Python libraries like Pandas, NumPy, and SciPy performed statistical analysis on dataset. Discovered patterns by evaluating parameters, which recommends top parameters to focus while designing.
  • Integrated Tableau with Impala as a source to create interactive BI dashboard.

Technologies Used - Cloudera, LINUX, Hadoop, HDFS, Hive, Spark, MapReduce, Sqoop, Flume, Teradata, Python, MySQL, Oozie, Impala, Tableau 10.0.

Confidential

Hadoop Developer

Responsibilities:

  • Installed and configured Hadoop MapReduce, HDFS, Developed multiple MapReduce jobs in java for data cleaning and preprocessing.
  • Importing and exporting data into HDFS and Hive using Sqoop.
  • Experience in defining job flows using Oozie and shell scripts.
  • Experienced in implementing various customizations in MapReduce at various levels by implementing custom input formats, custom record readers, partitioners and data types in java.
  • Experience in ingesting data using flume from web server logs and telnet sources.
  • Installed and configured Cloudera Manager, Hive, Pig, Sqoop, and Oozie on CDH5 cluster.
  • Experienced in managing disaster recovery cluster and responsible for data migration and backup.
  • Performed an upgrade in development environment from CDH 4.x to CDH 5.x.
  • Implemented encryption and masking on customer sensitive data in flume by building a custom interceptor and masking and encrypting teh data as per teh requirement by considering teh rules in MySQL.
  • Experience in managing and reviewing Hadoop log files.
  • Extracted files from RDBMS through Sqoop and placed in HDFS and processed.
  • Experience in running Hadoop streaming jobs to process terabytes of xml format data.
  • Supported MapReduce Programs those are running on teh cluster.
  • Involved in loading data from UNIX file system to HDFS.
  • Involved in creating Hive tables, loading with data and writing hive queries which will run internally in MapReduce way.
  • Experiences in implementing Hive-HBase integration by creating hive external tables and using HBase storage handler.
  • Executed queries using Hive and developed MapReduce jobs to analyze data.
  • Developed Pig Latin scripts to extract teh data from teh web server output files to load into HDFS.
  • Developed Hive queries for teh analysts.
  • Involved in loading data from LINUX and UNIX filesystem to HDFS.
  • Designed and implemented MapReduce based large scale parallel relation learning system.
  • Developed Master tables in HIVE using a "jsondeserializer" or "get json object" or "json tuple" functions of HIVE.
  • Designed teh entire flow in HDFS in such a way that is needed to be achieved using Oozie workflows.

Technologies Used - Cloudera, Eclipse, Hadoop, Hive, HBase, MapReduce, Flume, HDFS, PIG, Sqoop, Oozie, Cassandra, Java (JDK 1.6), My SQL, UNIX Shell Scripting.

Confidential

Jr. Hadoop Developer

Responsibilities:

  • Involved in creating Hive tables, loading with data and writing hive queries to process teh data.
  • Developing and maintaining Workflow Scheduling Jobs in Oozie for importing data from RDBMS to Hive.
  • Implemented Partitioning, Bucketing in Hive for better organization of teh data.
  • Involved with teh team of fetching live stream data from DB2 to HBase table using Flume.
  • Developed solutions to pre-process large sets of structured, semi-structured data, with different file formats (Text file, Avro data files, Sequence files, XML and JSON files, ORC and Parquet).
  • Involved in teh Design Phase for getting live event data from teh database to teh front-end application.
  • Importing data from hive table and run SQL queries over imported data and existing RDD’s.
  • Responsible for loading and transforming large sets of structured, semi structured and unstructured data.
  • Collected teh log data from web servers and integrated into HDFS using Flume.
  • Responsible to manage data coming from different sources.
  • Extracted files from Couch DB and placed into HDFS using Sqoop and pre-process teh data for analysis.
  • Developed teh sub queries in Hive.
  • Partitioning and bucketing teh imported data using HiveQL.
  • Partitioning dynamically using dynamic partition insert feature.
  • Moving dis partitioned data onto teh different tables as per as business requirements.

Technologies Used - Eclipse, Hadoop, HDFS, Map Reduce, Pig, Hive, Flume, HBase, Couch DB, Apache- Maven.

Hire Now