We provide IT Staff Augmentation Services!

Hadoop Developer Resume

3.00/5 (Submit Your Rating)

CA

SUMMARY:

  • 5 years of overall experience in building and developing Hadoop Map Reduce solutions and also experience in using Hive, Impala, Pig, Spark, Flume and Kafka.
  • Experience in installation, configuration, supporting and monitoring Hadoop clusters using Cloudera distributions and AWS.
  • Good experience in writing Python Scripts.
  • Good experience with both Job Tracker (Map reduce 1) and yarn (Map reduce 2).
  • Good experience in Spark and its related technologies like SparkSQL and Spark Streaming.
  • Working experience in DevOps environment.
  • Experience in defining detailed application software test plans, including organization, participant, schedule, test and application coverage scope.
  • Good Understanding in Apache Hue and Accumulo.
  • Techno - functional responsibilities include interfacing with users, identifying functional and technical gaps, estimates, designing custom solutions, development, leading developers, producing documentation, and production support.
  • Good understanding in using version control like GITHUB and SVN
  • Experience in importing and exporting data using Sqoop from HDFS to Relational Database Systems and vice-versa.
  • Experience in converting Hive queries into Spark transformations using Spark RDDs and Scala.
  • Having experience on RDD architecture and implementing spark operations on RDD and also optimizing transformations and actions in spark.
  • Expertise in using various tools in Hadoop ecosystem including MapReduce, Hive, Pig, Oozie, Sqoop, Hbase, Flume, Spark, Kafka, and Zookeeper.
  • Knowledge of ETL methods for data extraction, transformation and loading in corporate-wide ETL Solutions and Data warehouse tools for reporting and data analysis.
  • Extending Hive and Pig core functionality by writing custom UDFs.
  • Experience in analyzing data using HQL, Pig Latin, and custom Map Reduce programs in core Java.
  • Knowledge in job workflow scheduling and monitoring tools like oozie..

TECHNICAL SKILLS:

Big Data: Hadoop HDFS, Map Reduce, Hive, Impala, PIG, HBase, ZooKeeper, Sqoop, Oozie, Spark, Scala, Flume, Kafka and Avro.

Programming Languages: C, C++, JAVA/J2EE, Python.

Methodologies: AGILE, Waterfall.

Web Technologies: HTML5, CSS3, JavaScript, jQuery, AJAX, JSON.

Java Technologies: Servlets, JSP, EJB, web services, JDBC, JSON

Databases: Oracle 11g/10g, DB2, SQL Server, MySQL, MS-Access

Application Servers: Web Logic, Web Sphere.

Monitoring and Reporting Tools: Ganglia, Custom Shell scripts.

Version Control: Perforce, SVN, GIT, Bit Bucket

PROFESSIONAL EXPERIENCE:

Confidential, CA

Hadoop Developer

Responsibilities:

  • Involved in loading data from UNIX file system to HDFS.
  • Involved in creating Hive tables, loading with data and writing hive queries(HiveQL) which will run internally in map reduce way.
  • Designed and developed data pipeline for different events of applications data, to filter and load consumer response in AWS S3 bucket into Hive external tables.
  • Worked with different file formats like JSON, AVRO, CSV, ORC and Parquet and compression techniques like snappy, and Zlib.
  • Followed Agile Scrum methodology for the entire project.
  • Selecting the appropriate AWS service based upon data, compute, system requirements.
  • Involved in design, and development of generic Pyspark programs in python to reduce the delivery time of data processing applications.
  • Designed and implemented Data check and Data quality frameworks in Pyspark during the initial load process and the final publish stages.
  • Used AWS EMR for processing of the ETL jobs and load to S3 buckets and AWS Athena for adhoc/low latency querying on S3 data.
  • Developed python code for workflow management and automation in Airflow.
  • Implemented the Spark & Hive best practices, optimizations to efficiently process data by utilizing features like partitioning, resource tuning, memory management.
  • Developed UDF’s in Pyspark to anonymize users personal data and created a framework to delete inactive users.
  • Used Bit bucket as code repository and Jenkins as continuous integration tool.

Environment: Linux, Hadoop, Spark HBase, Sqoop, Pig, Impala, Hive, HQL, Flume, AWS, Zookeeper, Elastic Search, Maven, Devops, Agile, Oracle 11g, Cloudera.

Confidential, NC

Hadoop Developer

Responsibilities:

  • Responsible for designing and implementing the data pipeline using Big Data tools including Hive, Oozie, Spark, Sqoop, Kafka, and EC2, S3 and EMR.
  • Used Sqoop to extract and load incremental and non-incremental data from RDBMS systems into Hadoop.
  • Involved in converting the JSON data into Data Frame and stored into hive tables.
  • Created multiple groups and set permission policies for various groups in AWS.
  • Created streaming cubes and persist into Hbase for building OLAP cubes.
  • Used parquet file format with snappy compression and solved hive small files problem by using merge files, and merge mapred files parameters in hive.
  • Converted existing Snowflake schema data into Star schema in hive for building OLAP cubes.
  • Extensively used Hive optimization techniques like partitioning, bucketing, Map Join and parallel execution.
  • Converted some existing sqoop, hive jobs to Spark SQL applications to read data from Oracle using JDBC and write it to hive tables.
  • Analyzed the SQL scripts and designed the solution to implement using Scala Spark.
  • Involved in converting Hive/SQL queries into Spark transformations using Spark Data Frames.
  • Developed shell scripts for removal of orphan partitions for hive tables, and archive retention in HDFS.
  • Explored Spark for improving the performance and optimization of the existing jobs in Hadoop using Spark context, Spark-SQL, Spark Streaming, Data Frame, pair RDD's, Spark YARN.
  • Validating the fact table data which is migrated on daily load basis.
  • Used AWS EMR (Elastic Map Reduce) for resource intensive transformation jobs.

Environment: Hive, Spark, S3, AWS, SQL, DB2, Impala, Tableau, Git, Kafka, Zookeeper, YARN, Unix shell scripting, Cloudera, Hbase, Elastic -MapReduce.

Confidential, San Jose, CA

Big Data Developer

Responsibilities:

  • Worked on data querying tool Hive to store and retrieve data.
  • Reviewing and managing Hadoop log files by consolidating logs from multiple machines using flume.
  • Developed oozie workflow for scheduling ETL process and Hive Scripts.
  • Involved in writing queries in Spark Sql using Scala.
  • Integrated Spark with MapR-DB using Scala to persist data into Elasticsearch and also for other use cases.
  • Exported data from Impala to Tableau reporting tool, created dashboards on live connection.
  • Designed multiple Python packages that were used within a large ETL process used to load 2TB of data from an existing Oracle database into a new PostgreSQL cluster.
  • Developed Spark code using Scala and Spark-SQL for faster testing and processing of data.
  • Involved in converting Hive/Sql queries into Spark transformations using Spark RDD’s.
  • Developed UDF’s using both DataFrames/Sql and RDD in Spark For data Aggregation queries and reverting back into OLTP through sqoop.
  • Knowledge of ETL methods for data extraction, transformation and loading in corporate-wide ETL Solutions and Data warehouse tools for reporting and data analysis.
  • Developed multiple MapReduce jobs in java to clean datasets.
  • Collected the logs data from web servers and integrated in to HDFS using Flume.
  • Developed UNIX shell scripts for creating the reports from Hive data.
  • Manipulate, serialize, model data in multiple forms like JSON, XML.
  • Prepared avro schema files for generating Hive tables
  • Created Hive tables and loaded the data in to tables and query data using HQL.

Environment: Hadoop MapReduce 2 (yarn), Zookeeper, Scala, HDFS, PIG, Hive, Flume, Eclipse, Ignite Core Java, Sqoop, Spark, Agile, Spark SQL, Devops, Cloudera, Linux shell scripting.

Confidential

Hadoop Developer

Responsibilities:

  • Extensively worked on importing data from SQL Server and converting stored procedures to Spark jobs.
  • Developed common utilities for spark jobs for parallel import of data from source RDBMS, handling data skew.
  • Developed a python framework for loading back data to SQL Server from Hive for incremental data.
  • Worked with different file formats like Json, AVRO, ORC and Parquet and compression techniques like snappy.
  • Extensively used Spark optimization techniques for decreasing the processing time of job including but not limited to repartitioning, memory parameters tuning.
  • Used AWS services like S3 for storing data,EC2, EBS and RDS for spinning up instances on-demand.
  • Extensively used Hive optimization techniques for improving query performance and LLAP/DRILL for low latency end user queries.
  • Developed Spark application for filtering Json source data in AWS S3 location and store it into HDFS.
  • Used Stonebranch as workflow orchestration tool for scheduling ETL jobs.
  • Worked on a POC using Apache KYLO as a self-service tool based on Apache Spark and NiFi. Kylo automates many of the tasks associated with data lakes, such as data ingest, preparation, discovery, profiling, and management.
  • Wrote complex SQL queries and stored procedures.
  • Cluster coordination services through Zookeeper.
  • Followed Agile methodology in analyze, define, and document the application which will support functional and business requirements.
  • Worked on a POC to compare processing time of Impala with Apache Hive for batch applications to implement the former in project.

Environment: Hive, Impala, HBase, UNIX, Hortonworks, MySql, AWS.

We'd love your feedback!