Hadoop Developer Resume
Hopkins, MN
SUMMARY
- Around 6 years of professional IT experience in Big Data technologies, architecture and systems.
- Hands on experience in using CDH and HDP Hadoop ecosystem components like Hadoop, MapReduce, Yarn, Hive, Pig, Sqoop, HBase, Cassandra, Spark, Oozie, Zookeeper, Kafka and Flume.
- Configured Spark streaming to receive real - time data from the Kafka and stored the stream data to HDFS using Scala.
- Experienced in importing and exporting data using stream processing Flume and Kafka platforms
- Written Hive UDFs as required and executed complex HQLs to extract data from Hive tables
- Used partitioning and bucketing in Hive and designed both managed and external tables for performance optimization
- Converted Hive/SQL queries into Spark transformations using Spark Data frames and Scala
- Used Spark Data Frames API over Cloudera platform to perform analytics on Hive data
- Good understanding and knowledge of NoSQL databases like MongoDB, HBase and Cassandra
- Experienced in workflow scheduling and locking tools/services like Oozie and Zookeeper
- Practiced ETL methods in enterprise-wide solutions, data warehousing, reporting and data analysis
- Experienced in working with AWS using EMR, EC2 for computing and S3 as storage mechanism
- Developed Impala scripts for extraction, transformation, loading of data into data warehouse
- Good knowledge in using apache NiFi to automate the data movement between Hadoop systems
- Used Pig scripts for transformations, event joins, filters and pre-aggregations for HDFS storage
- Imported and exported data with Sqoop to and from HDFS to RDBMS including Oracle, MySQL and MS SQL Server
- Good Knowledge in UNIX Shell Scripting for automating deployments and other routine tasks
- Experienced in using IDEs like Eclipse, NetBeans, Intellij.
- Used JIRA and Rally for bug tracking and GitHub and SVN for various code reviews and unit testing
- Experienced in working in all phases of SDLC - both agile and waterfall methodologies
- Good understanding of Agile Scrum methodology, Test Driven Development and CI-CD
TECHNICAL SKILLS
Hadoop Technologies: HDFS, MapReduce, Hive, Impala, Pig, Sqoop, Flume, Oozie, Zookeeper, Ambari, Hue, Apache Spark, Strom, Kafka, Yarn, NiFi.
Operating System: Windows, Unix, Linux
Languages: Java, SQL, PL/SQL, Shell Script, Python, Scala
Testing tools: Junit,EasyMock
SQL Databases: MySQL, Oracle 11g/10g/9i, SQL Server, TeraData
NoSQL Databases: HBase, Cassandra, MongoDB
File System: HDFS
Reporting Tools: Tableau
Version control: SVN, GIT and CVS
Build Tools: Maven, Graddle, ANT.
Cloud Technologies: AWS, EC2, S3
PROFESSIONAL EXPERIENCE
Confidential, Hopkins, MN
Hadoop Developer
Responsibilities:
- Responsible for building scalable distributed data solutions using Hadoop.
- Involved in importing the data from various data sources into HDFS using Sqoop and applying various transformations using Hive, apache Spark and then loading data into Hive tables or AWS S3 buckets.
- Involved in moving data from various DB2 tables to AWS S3 buckets using Sqoop process.
- Configuring Splunk alerts in-order to get the log files while execution and storing them to a location in S3 bucket when cluster is running.
- Involved in Hive/SQL queries performing spark transformations using Spark RDDs and Python(pyspark).
- Writing Oozie scripts in-order to schedule and automate the jobs in EMR cluster.
- Used Bitbucket as a repository for storing the code and integrated with bamboo for integration purpose.
- Experienced in bringing up EMR cluster and deploying code into the cluster in S3 buckets.
- Experienced in using NoMachine and Putty in-order to SSH the EMR cluster and running spark-submit.
- Developed Apache Spark Applications by using Scala, Python and Implemented Apache Spark data processing project to handle data from various RDBMS and Streaming sources.
- Experience in developing various Spark Streaming Jobs using python (pyspark) and scala.
- Developing spark code using pysparkto applying various transformations and actions for faster data processing.
- Working knowledge on Apache Spark Streaming that enables scalable, high-throughput, fault-tolerant stream processing of live data streams.
- Used Spark Stream processing using Scala to get data into in-memory, implemented RDD transformations, and performed actions.
- Involved in using various Python libraries with pyspark inorder to create dataframes and store them to Hive.
- Sqoop jobs and Hive queries were created for data ingestion from relational databases to compare with historical data.
- Experience in working with Elastic MapReduce(EMR) and setting up environments on amazon AWS EC2 instances.
- Experienced in migrating HiveQL into Impala to minimize query response time.
- Knowledge on handling Hive queries using Spark SQL that integrates with Spark environment.
- Executed Hadoop/Spark jobs on AWS EMR using programs, stored in S3 Buckets.
- Knowledge on creating the user defined functions (UDF’s) in hive.
- Worked with different File Formats liketextfile, avro, parquet for HIVE querying and processing based on business logic.
- Involved in pulling the data from AWS Amazon S3 bucket to data lake and built Hive tables on top of it and created data frames in Spark to perform further analysis.
- Worked on Sequence files, RC files, Map side joins, bucketing, partitioning for Hive performance enhancement and storage improvement.
- Involved in Test Driven Development writing unit and integration test cases for the code.
- Implemented Hive UDF's to implement business logic and Responsible for performing extensive data validation using Hive.
- Involved in loading the structured and semi structured data into spark clusters using Spark SQL and Data Frames API.
- Involved in developing code and generated various data frames based on the business requirement and created temporary tables in hive.
- Utilized AWS CloudWatch to monitor the performance environment instances for operational and performance metrics during load testing.
- Experience in build scripts using Maven and did continuous system integrations like Bamboo.
- Used JIRA for creating the user stories and creating branches in the bitbucket repositories based on the story.
- Involved in story-driven agile development methodology and actively participated in daily scrum meetings.
Environment: Cloudera, Map Reduce, HDFS, Scala, Hive, Sqoop, Spark, Oozie, Linux, Maven, Splunk, NoMachine, Putty, HBase, Python, AWS EMR Cluster, EC2 instances, S3 Buckets, Bamboo, Bitbucket.
Confidential
Hadoop/BigData Engineer
Responsibilities:
- Installed and configured Hive, Pig, Sqoop and Oozie on the Hadoop cluster.
- Developed Simple to complex MapReduce Jobs using Hive and Pig.
- Optimized MapReduce Jobs to use HDFS efficiently by using various compression mechanisms.
- Worked with Senior Engineer on configuring Kafka for streaming data.
- Developed Spark scripts by using Scala shell commands as per the requirement.
- Performed processing on large sets of structured, unstructured and semi structured data.
- Created applications using Kafka, which monitors consumer lag within Apache Kafka clusters.
- Handled importing of data from various data sources, performed transformations using Hive, MapReduce, loaded data into HDFS and extracted the data from Oracle into HDFS using Sqoop.
- Analyzed the data by performing Hive queries and running Pig scripts to study customer behavior.
- Worked with Spark Ecosystem using Scala and Hive Queries on different data formats like Text file and parquet.
- Used Pig UDF's to implement business logic in Hadoop.
- Implemented business logic by writing UDFs in Java and used various UDFs.
- Responsible to migrate from Hadoop to Spark frameworks, in-memory distributed computing for real time fraud detection.
- Used Spark to store data in-memory.
- Implemented batch processing of data sources using Apache Spark.
- Developed Pig Latin scripts to extract the data from the web server output files to load into HDFS.
- Developed workflow in Oozie to automate the tasks of loading the data into HDFS and pre-processing with Pig and HiveQL.
- Developed the Pig UDF’S to pre-process the data for analysis.
- Involved in creating Hive tables, loading with data and writing hive queries which will run internally inmapreduce way.
- Develop predictive analytic using Apache Spark Scala APIs
- Cluster co-ordination services through ZooKeeper.
- Used Apache Kafka for collecting, aggregating, and moving large amounts of data from application servers.
- Developed Spark code using Scala and Spark-SQL/Streaming for faster testing and processing of data.
- As Part of POC setup Amazon web services (AWS) to check whether Hadoop is a feasible solution or not.
- Worked with application teams to install operating system, Hadoop updates, patches, version upgrades as required.
- Installed Oozie workflow engine to run multiple Hive and Pig jobs.
- Exported the analyzed data to the relational databases using Sqoop for visualization and to generate reports for the BI team.
Environment: Hadoop, MapReduce, HDFS, Hive, Java, SQL, Cloudera Manager, Pig, Apache Sqoop, Spark, Oozie, HBase, AWS, PL/SQL, MySQL and Windows.
Confidential
Hadoop Developer
Responsibilities:
- Developed Spark scripts by using Scala shell commands as per the requirement.
- Created Spark jobs to see trends in data usage by users.
- Used Spark and Spark-SQL to read the parquet data and create the tables in Hive using the Scala API.
- Loaded data pipelines from web servers and Teradata using Sqoop with Kafka and Spark Streaming API.
- Developed Kafka pub-sub, Cassandra clients and Spark along with components on HDFS and Hive
- Populated HDFS and HBase with huge amounts of data using Apache Kafka.
- Configured deployed and maintained multi-node Dev and Test Kafka Clusters.
- Used Prefuse open source java framework for the GUI.
- Developed the Pig UDF’S to pre-process the data for analysis.
- Developed workflow in Oozie to automate the tasks of loading the data into HDFS and pre-processing with Pig and HiveQL.
- Created Hive tables to store data and written Hive queries.
- Extracted the data from Teradata into HDFS using Sqoop.
- Exported the patterns analyzed back to Teradata using Sqoop.
- Involved in Installing, Configuring Hadoop Eco System, and Cloudera Manager using CDH4 Distribution.
- Developed Spark code to useScala and Spark-SQL for faster processing and testing.
- Involved in converting Hive/SQL queries into Spark transformations using Spark RDD, Scala and Python.
- Used Spark API over Hadoop YARN as execution engine for data analytics using Hive.
- Experienced data pipelines using Kafka and Akka for handling large terabytes of data.
- Analyzed the data by performing Hive queries and running Pig scripts to study customer behavior.
- Installed and configured Hive, Pig, Sqoop, Flume and Oozie on the Hadoop cluster.
- Developed PIG Latin scripts to extract the data from the web server output files to load into HDFS.
- Design and implement Map Reduce jobs to support distributed data processing.
- Process large data sets utilizing our Hadoop cluster.
- Designing NoSQL schemas in Hbase.
- Developing Map reduce ETL in Java/Pig.
- Involved in data validation using HIVE.
- Importing and exporting the data using Sqoop from HDFS to Relational Database systems and vice-versa.
- Involved in weekly walkthroughs and inspection meetings, to verify the status of the testing efforts and the project as a whole.
Environment: Hadoop,Spark, Kafka, Map Reduce, Pig Latin, Teradata, Python, Zookeeper, Oozie, Sqoop, Java, Hive, Hbase, UNIX Shell Scripting.
Confidential
Hadoop Developer
Responsibilities:
- Developed several advanced Map Reduce programs to process data files received.
- Developed Map Reduce Programs for data analysis and data cleaning.
- Firm knowledge on various summarization patterns to calculate aggregate statistical values over dataset.
- Experience in implementing joins in the analysis of dataset to discover interesting relationships.
- Completely involved in the requirement analysis phase.
- Extending Hive and Pig core functionality by writing custom UDFs.
- Worked on partitioning the HIVE table and running the scripts in parallel to reduce the run time of the scripts.
- Strong expertise in internal and external tables of HIVE and created Hive tables to store the processed results in a tabular format.
- Implemented Partitioning, Dynamic Partitions, Buckets in HIVE.
- Developed Pig Scripts and Pig UDFs to load data files into Hadoop.
- Analyzed the data by performing Hive queries and running Pig scripts.
- Developed PIG Latin scripts for the analysis of semi structured data and unstructured data.
- Strong knowledge on the process of creating complex data pipelines using transformations, aggregations, cleansing and filtering.
- Experience in writing cron jobs to run at regular intervals.
- Developed MapReduce jobs for Log Analysis, Recommendation and Analytics.
- Experience in using Flume to efficiently collect, aggregate and move large amounts of log data.
- Involved in loading data from edge node to HDFS using shell scripting.
- Worked with application teams to install operating system, Hadoop updates, patches, version upgrades as required.
- Experience in managing and reviewing Hadoop log files.
- Responsible for cluster maintenance, adding and removing cluster nodes, cluster monitoring and troubleshooting, manage and review data backups, manage and review Hadoop log files.
Environment: Hadoop, Java, Apache Pig, Apache Hive, MapReduce, HDFS, Flume, GIT, UNIX Shell scripting, PostgreSQL, Linux.