Hadoop Developer Resume
Raleigh, NC
SUMMARY:
- Over 6 years of data analytics and visualization experience including 2+ years of big data/hadoop technologies with full project development, implementation and deployment on Linux/Windows/Unix.
- 2+ years of experience in implementing big data applications using HDFS, Mapreduce, Pig and Hive.
- Proficient in using data visualization tools Tableau, QlikView, Plotly, Raw, Palladio, and MS Excel.
- Experience in building data models with PowerPivot.
- Hands on experience on HDFS, HIVE, PIG, Hadoop Map Reduce framework and SQOOP.
- Worked extensively with HIVE DDLs and Hive Query language (HQLs).
- Developed UDF, UDAF, UDTF functions and implemented it in HIVE Queries.
- Developed PIG Latin scripts for handling business transformations.
- Implemented SQOOP for large dataset transfer between Hadoop and RDBMs.
- Worked with join patterns and implemented Map side joins and Reduce side joins using Map Reduce.
- Worked on ETL reports using Tableau and created statistics dashboards for Analytics.
- Hands on experience in Sequence files, RC files, Combiners, Counters, Dynamic Partitions, Bucketing for best practice and performance improvement.
- Interacted directly with Hortonworks team for Hadoop cluster related issues and resolved the same.
- Experience in setting up Hadoop on Pseudo distributed environment.
- Experience in setting up HIVE, PIG and SQOOP on Ubuntu Operating system.
- Familiarity with common computing environment (e.g. Linux, Shell Scripting)
- Good team player with ability to solve problems, organize and prioritize multiple tasks.
- Excellent communication and inter - personal skills.
TECHNICAL SKILLS:
Big Data Technologies: Hadoop, HDFS, MapReduce, YARN, Hive, Pig, Sqoop, Flume, Oozie, Tez, Impala, Mahout, Ambari, Hadoop Streaming
RDBMS: Oracle, DB2, SQL Server
Scripting/Query: Shell, SQL, HiveQL
RDBMS: Oracle, DB2, SQL Server
NoSQL: HBase, Cassandra
Visualization: Tableau Desktop 8.3, Plotly, Raw, Palladio
Web Servers: WebLogic, WebSphere, Apache Tomcat.
IDEs: RStudio, PyCharm, Eclipse
Platforms: Windows, UNIX, LINUX
Currently Learning: Spark, Scala, R and Python
PROFESSIONAL EXPERIENCE:
Confidential, Raleigh, NC
Hadoop Developer
Responsibilities:
- Primary responsibilities include building scalable distributed data solutions using Hadoop ecosystem
- Installed and configured Hive, Pig, Sqoop, Flume and Oozie on the Hadoop cluster
- Developed Simple to complex Map/reduce streaming jobs using Python language that are implemented using Hive and Pig.
- Optimized Map/Reduce Jobs to use HDFS efficiently by using various compression mechanisms
- Handled importing of data from various data sources, performed transformations using Hive, MapReduce, loaded data into HDFS and Extracted the data from MySQL into HDFS using Sqoop
- Analyzed the data by performing Hive queries (HiveQL) and running Pig scripts (Pig Latin) to study customer behavior.
- Tested Apache™ Tez, an extensible framework for building high performance batch and interactive data processing applications, on Pig and Hive jobs.
- Used Impala to read, write and query the Hadoop data in HDFS or HBase or Cassandra
- Implemented business logic by writing UDFs in Java and used various UDFs from Piggybanks and other sources
- Continuous monitoring and managing the Hadoop cluster using Cloudera Manager
- Worked with application teams to install operating system, Hadoop updates, patches, version upgrades as required
- Installed Oozie workflow engine to run multiple Hive and Pig jobs
- Used Mahout to understand the machine learning algorithms for an efficient data processing
- Exported the analyzed data to the relational databases using Sqoop for visualization and to generate reports for the BI team
- Experienced on loading and transforming of large sets of structured, semi structured and unstructured data.
- Analyzed large amounts of data sets to determine optimal way to aggregate and report on it.
- Written multiple MapReduce programs in Java for data extraction, transformation and aggregation from multiple file formats including XML, JSON, CSV and other compressed file formats
Environment: Hadoop 0.20.2 - PIG, Hive, Apache Sqoop, Oozie, HBase, Zoo keeper, Cloudera manager, 30 Node cluster with Linux-Ubuntu.
Confidential, Durham, NC
Big Data Developer
Responsibilities:
- Developed applications of Machine Learning, Statistical Analysis and Data Visualizations with challenging data Processing problems in clinical and biomedical.
- Read data from local files, XML files, excel files, JSON files in python with use of PANDAS module.
- Read from SQL DBs, Web through APIs and processed them for further use in python with PANDAS module.
- Performed subset, sort, reshape, merge, slice and edit on collected data with use of Numpy and Pandas module of python.
- Developed histogram, scatter, 3-D and other plots with use of different color combination in python with Matplotlib library of Python.
- Worked in large scale database environment like Hadoop and MapReduce, with working mechanism of Hadoop clusters, nodes and Hadoop Distributed File System (HDFS).
- Interfaced with large scale database system through an ETL server for data extraction and preparation.
- Migrating the data from Oracle, MySQL in to HDFS using Sqoop and importing various formats of flat files in to HDFS.
- Proposed an automated system using Shell script to sqoop the job.
- Worked in Agile development approach.
- Created the estimates and defined the sprint stages.
- Developed a strategy for Full load and incremental load using Sqoop.
- Mainly worked on Hive queries to categorize data of different claims.
- Integrated the hive warehouse with HBase
- Written customized Hive UDFs in Java where the functionality is too complex.
- Implemented Partitioning, Dynamic Partitions, Buckets in HIVE.
- Generate final reporting data using Tableau for testing by connecting to the corresponding Hive tables using Hive ODBC connector.
- Maintained System integrity of all sub-components (primarily HDFS, MR, HBase, and Hive).
- Monitored System health and logs and respond accordingly to any warning or failure conditions.
- Presented data and dataflow using Talend for reusability.
Environment: Apache Hadoop, HDFS, Hive, Java, Sqoop, Cloudera CDH4, Oracle, MySQL, Tableau, Talend, Elastic search