Hadoop Developer Resume
CA
SUMMARY:
- 5 years of overall experience in building and developing Hadoop Map Reduce solutions and also experience in using Hive, Impala, Pig, Spark, Flume and Kafka.
- Experience in installation, configuration, supporting and monitoring Hadoop clusters using Cloudera distributions and AWS.
- Good experience in writing Python Scripts.
- Good experience with both Job Tracker (Map reduce 1) and yarn (Map reduce 2).
- Good experience in Spark and its related technologies like SparkSQL and Spark Streaming.
- Working experience in DevOps environment.
- Experience in defining detailed application software test plans, including organization, participant, schedule, test and application coverage scope.
- Good Understanding in Apache Hue and Accumulo.
- Techno - functional responsibilities include interfacing with users, identifying functional and technical gaps, estimates, designing custom solutions, development, leading developers, producing documentation, and production support.
- Good understanding in using version control like GITHUB and SVN
- Experience in importing and exporting data using Sqoop from HDFS to Relational Database Systems and vice-versa.
- Experience in converting Hive queries into Spark transformations using Spark RDDs and Scala.
- Having experience on RDD architecture and implementing spark operations on RDD and also optimizing transformations and actions in spark.
- Expertise in using various tools in Hadoop ecosystem including MapReduce, Hive, Pig, Oozie, Sqoop, Hbase, Flume, Spark, Kafka, and Zookeeper.
- Knowledge of ETL methods for data extraction, transformation and loading in corporate-wide ETL Solutions and Data warehouse tools for reporting and data analysis.
- Extending Hive and Pig core functionality by writing custom UDFs.
- Experience in analyzing data using HQL, Pig Latin, and custom Map Reduce programs in core Java.
- Knowledge in job workflow scheduling and monitoring tools like oozie..
TECHNICAL SKILLS:
Big Data: Hadoop HDFS, Map Reduce, Hive, Impala, PIG, HBase, ZooKeeper, Sqoop, Oozie, Spark, Scala, Flume, Kafka and Avro.
Programming Languages: C, C++, JAVA/J2EE, Python.
Methodologies: AGILE, Waterfall.
Web Technologies: HTML5, CSS3, JavaScript, jQuery, AJAX, JSON.
Java Technologies: Servlets, JSP, EJB, web services, JDBC, JSON
Databases: Oracle 11g/10g, DB2, SQL Server, MySQL, MS-Access
Application Servers: Web Logic, Web Sphere.
Monitoring and Reporting Tools: Ganglia, Custom Shell scripts.
Version Control: Perforce, SVN, GIT, Bit Bucket
PROFESSIONAL EXPERIENCE:
Confidential, CA
Hadoop Developer
Responsibilities:
- Involved in loading data from UNIX file system to HDFS.
- Involved in creating Hive tables, loading with data and writing hive queries(HiveQL) which will run internally in map reduce way.
- Designed and developed data pipeline for different events of applications data, to filter and load consumer response in AWS S3 bucket into Hive external tables.
- Worked with different file formats like JSON, AVRO, CSV, ORC and Parquet and compression techniques like snappy, and Zlib.
- Followed Agile Scrum methodology for the entire project.
- Selecting the appropriate AWS service based upon data, compute, system requirements.
- Involved in design, and development of generic Pyspark programs in python to reduce the delivery time of data processing applications.
- Designed and implemented Data check and Data quality frameworks in Pyspark during the initial load process and the final publish stages.
- Used AWS EMR for processing of the ETL jobs and load to S3 buckets and AWS Athena for adhoc/low latency querying on S3 data.
- Developed python code for workflow management and automation in Airflow.
- Implemented the Spark & Hive best practices, optimizations to efficiently process data by utilizing features like partitioning, resource tuning, memory management.
- Developed UDF’s in Pyspark to anonymize users personal data and created a framework to delete inactive users.
- Used Bit bucket as code repository and Jenkins as continuous integration tool.
Environment: Linux, Hadoop, Spark HBase, Sqoop, Pig, Impala, Hive, HQL, Flume, AWS, Zookeeper, Elastic Search, Maven, Devops, Agile, Oracle 11g, Cloudera.
Confidential, NC
Hadoop Developer
Responsibilities:
- Responsible for designing and implementing the data pipeline using Big Data tools including Hive, Oozie, Spark, Sqoop, Kafka, and EC2, S3 and EMR.
- Used Sqoop to extract and load incremental and non-incremental data from RDBMS systems into Hadoop.
- Involved in converting the JSON data into Data Frame and stored into hive tables.
- Created multiple groups and set permission policies for various groups in AWS.
- Created streaming cubes and persist into Hbase for building OLAP cubes.
- Used parquet file format with snappy compression and solved hive small files problem by using merge files, and merge mapred files parameters in hive.
- Converted existing Snowflake schema data into Star schema in hive for building OLAP cubes.
- Extensively used Hive optimization techniques like partitioning, bucketing, Map Join and parallel execution.
- Converted some existing sqoop, hive jobs to Spark SQL applications to read data from Oracle using JDBC and write it to hive tables.
- Analyzed the SQL scripts and designed the solution to implement using Scala Spark.
- Involved in converting Hive/SQL queries into Spark transformations using Spark Data Frames.
- Developed shell scripts for removal of orphan partitions for hive tables, and archive retention in HDFS.
- Explored Spark for improving the performance and optimization of the existing jobs in Hadoop using Spark context, Spark-SQL, Spark Streaming, Data Frame, pair RDD's, Spark YARN.
- Validating the fact table data which is migrated on daily load basis.
- Used AWS EMR (Elastic Map Reduce) for resource intensive transformation jobs.
Environment: Hive, Spark, S3, AWS, SQL, DB2, Impala, Tableau, Git, Kafka, Zookeeper, YARN, Unix shell scripting, Cloudera, Hbase, Elastic -MapReduce.
Confidential, San Jose, CA
Big Data Developer
Responsibilities:
- Worked on data querying tool Hive to store and retrieve data.
- Reviewing and managing Hadoop log files by consolidating logs from multiple machines using flume.
- Developed oozie workflow for scheduling ETL process and Hive Scripts.
- Involved in writing queries in Spark Sql using Scala.
- Integrated Spark with MapR-DB using Scala to persist data into Elasticsearch and also for other use cases.
- Exported data from Impala to Tableau reporting tool, created dashboards on live connection.
- Designed multiple Python packages that were used within a large ETL process used to load 2TB of data from an existing Oracle database into a new PostgreSQL cluster.
- Developed Spark code using Scala and Spark-SQL for faster testing and processing of data.
- Involved in converting Hive/Sql queries into Spark transformations using Spark RDD’s.
- Developed UDF’s using both DataFrames/Sql and RDD in Spark For data Aggregation queries and reverting back into OLTP through sqoop.
- Knowledge of ETL methods for data extraction, transformation and loading in corporate-wide ETL Solutions and Data warehouse tools for reporting and data analysis.
- Developed multiple MapReduce jobs in java to clean datasets.
- Collected the logs data from web servers and integrated in to HDFS using Flume.
- Developed UNIX shell scripts for creating the reports from Hive data.
- Manipulate, serialize, model data in multiple forms like JSON, XML.
- Prepared avro schema files for generating Hive tables
- Created Hive tables and loaded the data in to tables and query data using HQL.
Environment: Hadoop MapReduce 2 (yarn), Zookeeper, Scala, HDFS, PIG, Hive, Flume, Eclipse, Ignite Core Java, Sqoop, Spark, Agile, Spark SQL, Devops, Cloudera, Linux shell scripting.
Confidential
Hadoop Developer
Responsibilities:
- Extensively worked on importing data from SQL Server and converting stored procedures to Spark jobs.
- Developed common utilities for spark jobs for parallel import of data from source RDBMS, handling data skew.
- Developed a python framework for loading back data to SQL Server from Hive for incremental data.
- Worked with different file formats like Json, AVRO, ORC and Parquet and compression techniques like snappy.
- Extensively used Spark optimization techniques for decreasing the processing time of job including but not limited to repartitioning, memory parameters tuning.
- Used AWS services like S3 for storing data,EC2, EBS and RDS for spinning up instances on-demand.
- Extensively used Hive optimization techniques for improving query performance and LLAP/DRILL for low latency end user queries.
- Developed Spark application for filtering Json source data in AWS S3 location and store it into HDFS.
- Used Stonebranch as workflow orchestration tool for scheduling ETL jobs.
- Worked on a POC using Apache KYLO as a self-service tool based on Apache Spark and NiFi. Kylo automates many of the tasks associated with data lakes, such as data ingest, preparation, discovery, profiling, and management.
- Wrote complex SQL queries and stored procedures.
- Cluster coordination services through Zookeeper.
- Followed Agile methodology in analyze, define, and document the application which will support functional and business requirements.
- Worked on a POC to compare processing time of Impala with Apache Hive for batch applications to implement the former in project.
Environment: Hive, Impala, HBase, UNIX, Hortonworks, MySql, AWS.