Sr. Hadoop Developer Resume
Aurora, IllinoiS
PROFESSIONAL SUMMARY:
- Over all 8+ years of professional IT work experience in Analysis, Design, Development, Deployment and Maintenance of critical software and big data applications.
- In depth knowledge in HDFS, Job Tracker, Task Tracker, Name Node, Data Node and Map Reduce programming .
- Expertise in converting Map Reduce programs into Spark transformations using Spark RDD's.
- Expertise in Spark Architecture including Spark Core, Spark SQL , Data Frames, Spark Streaming and Spark MLlib.
- Configured Spark streaming to receive real time data from the Kafka and store the stream data to HDFS using Scala .
- Experience in implementing Real - Time event processing and analytics using messaging systems like Spark Streaming.
- Experience in using Kafka and Kafka brokers to initiate spark context and processing live streaming information with the help of RDD.
- Good knowledge on Amazon AWS concepts like EMR and EC2 web services which provides fast and efficient processing of Big Data.
- Experience with all flavor of Hadoop distributions, including Cloudera, Hortonworks, Mapr and Apache.
- Experience in installation, configuration, supporting and managing Hadoop Clusters using Apache, Cloudera (5.X) distributions and on Amazon web services (AWS).
- Expertise in implementing SparkScala application using higher order functions for both batch and interactive analysis requirement.
- Extensive experienced working with Spark tools like RDD transformations, spark MLlib and spark QL.
- Hands on experience in writing Hadoop Jobs for analyzing data using Hive QL (Queries), Pig Latin (Data flow language), and custom MapReduce programs in Java.
- Experienced in working with structured data using HiveQL , join operations, Hive UDFs , partitions , bucketing and internal / external tables.
- Extensive experience in collecting and storing stream data like log data in HDFS using Apache Flume.
- Experienced in using Pig scripts to do transformations, event joins, filters and some pre-aggregations before storing the data onto HDFS.
- Involvement in creating custom UDFs for Pig and Hive to consolidate strategies and usefulness of Python/Java into Pig Latin and HQL (HiveQL).
- Good Experience with NoSQL Databases like HBase, MongoDB and Cassandra.
- Experience on using Cassandra CQL with Java APIs to retrieve data from Cassandra tables.
- Hands on experience in querying and analyzing data from Cassandra for quick searching, sorting and grouping through CQL.
- Experience working with MongoDB for distributed storage and processing.
- Good knowledge and experienced in Extracting files from MongoDB through Sqoop and placed in HDFS and processed.
- Worked on importing data into HBase using HBase Shell and HBase Client API .
- Experience in designing and developing tables in HBase and storing aggregated data from Hive Table.
- Good knowledge in working with scheduling jobs in Hadoop using FIFO , Fair scheduler and Capacity scheduler.
- Experienced in designing both time driven and data driven automated workflows using Oozie and Zookeeper .
- Experience working on Solr for developing search engine on unstructured data in HDFS.
- Extensively used Solr to enable indexing for enabling searching on Non-primary key columns from Cassandra key spaces.
- Experience in writing stored procedures and complex SQL queries using relational databases like Oracle, SQL Server, and MySQL .
- Experience in Extraction, Transformation and Loading ( ETL ) of data from multiple sources like Flat files, XML files, and Databases .
- Supported various reporting teams and experience with data visualization tool Tableau.
- Implemented Data Quality in ETL Tool Talend and having good knowledge in Data Warehousing and
- ETL Tools like IBM DataStage, Informatica and Talend.
- Experienced and in-depth knowledge of cloud integration with AWS using Elastic Map Reduce ( EMR ), Simple Storage Service ( S3 ), EC2 , Redshift and Microsoft Azure .
- Detailed understanding of Software Development Life Cycle ( SDLC ) and strong knowledge in project implementation methodologies like Waterfall and Agile .
TECHNICAL SKILLS:
Languages: C, C++, Python, R, PL/SQL, Java, HiveQL, Pig Latin, Scala, UNIX shell scripting.
Hadoop Ecosystem: HDFS, YARN, Scala, Map Reduce, Hive, Pig, Zookeeper, Sqoop, Oozie, Bedrock, Flume, Kafka, Impala, NiFi, MongoDB, HBase.
Databases: Oracle, MS: SQL Server, MySQL, PostgreSQL, NoSQL (HBase, Cassandra, MongoDB), Teradata.
Tools: Eclipse, NetBeans, Informatica, IBM DataStage, Talend, Maven, Jenkins.
Hadoop Platforms: Hortonworks, Cloudera, Azure, Amazon Web services (AWS).
Operating Systems: Windows XP/2000/NT, Linux, UNIX.
Amazon Web Services: Redshift, EMR, EC2, S3, RDS, Cloud Search, Data Pipeline, Lambda.
Version Control: GitHub, SVN, CVS.
Packages: MS Office Suite, MS Vision, MS Project Professional.
PROFESSIONAL EXPERIENCE:
Confidential, Aurora, Illinois
Sr. Hadoop Developer
Responsibilities:
- Responsible for building scalable distributed data solutions using Hadoop.
- Experience in creating Kafka producer and Kafka consumer for Spark streaming which gets the data from different learning systems of the patients.
- Configured Spark streaming to receive real time data from the Kafka and store the stream data to HDFS using Scala.
- Used SparkStreaming to divide streaming data into batches as an input to Sparkengine for batch processing.
- Evaluated the performance of Apache Spark in analyzing genomic data.
- Performed advanced procedures like text analytics and processing using the in-memory computing capabilities of Spark.
- Implemented Spark RDD transformations to Map business analysis and apply actions on top of transformations.
- Written Storm topology to accept the events from Kafka producer and emit into Cassandra DB.
- Created POC using SparkSQL and MLlib libraries.
- Experienced in managing and reviewing Hadoop log files.
- Worked closely with EC2 infrastructure teams to troubleshoot complex issues.
- Worked with AWS cloud and created EMR clusters with spark for analyzing raw data processing and access data from S3 buckets.
- Involved in installing EMR clusters on AWS.
- Used AWS Data Pipeline to schedule an Amazon EMR cluster to clean and process web server logs stored in Amazon S3 bucket.
- Apply Transformation rules on the top of DataFrames.
- Worked with different File Formats like TEXTFILE, AVROFILE, ORC, and PARQUET for HIVE querying and processing
- Developed HiveUDFs and UDAF’s for rating aggregation.
- Developed java client API for CRUD and analytical Operations by building a restful server and exposing data from No-SQL databases like Cassandra via rest protocol.
- Created Hive tables and involved in data loading and writing Hive UDFs.
- Experience in managing and reviewing Hadoop Log files.
- Worked extensively with Sqoop to move data from DB2 and Teradata to HDFS.
- Collected the logs data from web servers and integrated in to HDFS using Kafka.
- Provided ad-hoc queries and data metrics to the Business Users using Hive, Impala.
- Worked on various performance optimizations like using distributed cache for small datasets, partition, bucketing in hive, map side joins etc.
- Scheduled Oozie workflow engine to run multiple Hive and Pig jobs, which independently run with time and data availability
- Implemented Row Level Updates and Real time analytics using CQL on Cassandra Data.
- Used Cassandra (CQL) with Java API's to retrieve data from Cassandra tables.
- Worked on analyzing and examining customer behavioral data using Cassandra.
- Worked on Solr configuration and customizations based on requirements.
- Indexed documents using Apache Solr.
- Extensively use Zookeeper as job scheduler for Spark Jobs.
- Worked with BI teams in generating the reports on Tableau.
- Used JIRA for bug tracking and CVS for version control.
Environment: Hadoop, MapReduce, HDFS, PIG, Hive, Sqoop, Oozie, Storm, Kafka, Spark, Spark Streaming, Scala, Cassandra, Cloudera, ZooKeeper, AWS, Solr, MySQL, Shell Scripting, Java, Tableau.
Confidential, Austin, Texas
Sr. Hadoop Developer
Responsibilities:
- Developed real time data processing applications by using Scala and Python and implemented Apache Spark Streaming from various streaming sources like Kafka and JMS.
- Experienced in writing live Real-time Processing and core jobs using Spark Streaming with Kafka as a data pipe-line system.
- Developed Shell, Perl and Python scripts to automate and provide Control flow to Pig scripts.
- Worked on Amazon AWS concepts like EMR and EC2 web services for fast and efficient processing of Big Data.
- Involved in loading data from Linux file systems, servers, java web services using Kafka producers and partitions.
- Applied Kafka custom encoders for custom input format to load data into Kafka Partitions.
- Implement POC with Hadoop. Extract data with Spark into HDFS.
- Used Spark SQL with Scala for creating data frames and performed transformations on data frames.
- Implemented Spark SQL to access hive tables into spark for faster processing of data.
- Used Spark API over Cloudera Hadoop YARN to perform analytics on data in Hive.
- Developed code to read data stream from Kafka and send it to respective bolts through respective stream.
- Worked on Spark streaming using Apache Kafka for real time data processing.
- Experience in creating Kafka producer and Kafka consumer for Spark streaming.
- Developed Map Reduce jobs using Map Reduce Java API and HIVEQL.
- Developed UDF, UDAF, UDTF functions and implemented it in HIVE Queries.
- Developing Scripts and Batch Job to schedule a bundle (group of coordinators) which consists of various Hadoop Programs using Oozie.
- Experienced in using Avro data serialization system to handle Avro data files in map reduce programs.
- Experienced in optimizing Hive queries, joins to handle different data sets.
- Configured Oozie schedulers to handle different Hadoop actions on timely basis.
- Involved in ETL, Data Integration and Migration by writing pig scripts.
- Used different file formats like Text files, Sequence Files, Avro using Hive SerDe's.
- Integrated Hadoop with Solr and implement search algorithms.
- Experience in Storm for handling realtime processing.
- Hands on Experience working in Hortonworks distribution.
- Assisted in creating and maintaining Technical documentation to launching HADOOP Clusters and even for executing Hive queries and Pig Scripts.
- Worked hands on No-SQL databases like MongoDB for POC purpose in storing images and URIs.
- Designed and implemented MongoDB and associated RESTful web service.
- Worked on analyzing and examining customer behavioral data using MongoDB.
- Designed the data aggregations on Hive for ETL processing on Amazon EMR to process data as per business requirement
- Involved in writing test cases and implement test classes using MRUnit and mocking frameworks.
- Developed Sqoop scripts to extract the data from MYSQL and load into HDFS.
- Setup Spark EMR to process huge data which is stored in Amazon S3.
- Experience in processing large volume of data and skills in parallel execution of process using Talend functionality.
- UsedTalend tool to create workflows for processing data from multiple source systems.
Environment: MapReduce, HDFS, Sqoop, LINUX, Oozie, Hadoop, Pig, Hive, Solr, Spark Streaming, Kafka, Storm, Spark, Scala, Python, MongoDB, Hadoop Cluster, Amazon Web Services, Talend.
Confidential, CA
Hadoop developer
Responsibilities:
- Developed solutions to process data into HDFS (Hadoop Distributed File System), process within Hadoop and emit the summary results from Hadoop to downstream systems.
- Developed a Wrapper Script around Teradata connector for Hadoop TCD to support option parameters.
- Used Sqoop extensively to ingest data from various source systems into HDFS.
- Hive was used to produce results quickly based on the report that was requested.
- Played a major role in working with the team to leverage Sqoop for extracting data from Teradata.
- Imported data from different relational data sources like Oracle, Teradata to HDFS using Sqoop.
- I ntegrated Hive server 2 with Tableau using Horton Works Hive ODBC driver, for auto generation of Hive queries for non-technical business user.
- Integrated multiple sources data (SQL Server, DB2, TD) into Hadoop cluster and analyzed data by Hive-HBase integration.
- Involved in Hive-Hbase integration by creating hive external tables and specifying storage as Hbase format.
- Implemented Data Validation using MapReduce programs to remove un-necessary records before move data into Hive tables.
- Developed PIG UDFs for the needed functionality such as custom Pigsloader known as timestamp loader.
- Involved in writing optimized Pig Script along with involved in developing and testing Pig Latin Scripts.
- Worked on custom Pig Loaders and Storage classes to work with a variety of data formats such as JSON, Compressed CSV etc.
- Oozie and Zookeeper were used to automate the flow of jobs and coordination in the cluster respectively.
- Involved in moving log files generated from various sources to HDFS for further processing through Flume.
- Worked on different file formats like Text files, Parquet, Sequence Files, Avro , Record columnar files (RC).
- Developed several shell scripts, which acts as wrapper to start theseHadoop jobs and set the configuration parameters.
- Kerberos security was implemented to safeguard the cluster.
- Worked on a stand-alone as well as a distributed Hadoop application.
- Tested the performance of the data sets on various NoSQL databases.
- Understood complex data structures of different type (structured, semi structured) and de-normalizing for storage in Hadoop.
Environment: Hadoop, HDFS, Pig,Flume, Hive, MapReduce, Sqoop, Oozie, Zookeeper, HBase, Java Eclipse, SQL Server, Shell Scripting.
Confidential, Minneapolis, MN
Hadoop/Java developer
Responsibilities:
- Installed and configured Hadoop Map Reduce, HDFS, developed multiple Map Reduce jobs in java for data cleaning and preprocessing.
- Migrated existing SQL queries to HiveQL queries to move to big data analytical platform.
- Integrated Cassandra file system to Hadoop using Map Reduce to perform analytics on Cassandra data.
- Installed and configured Cassandra DSE multi-node, multi-data center cluster.
- Designed and implemented a 24 node Cassandra cluster for single point inventory application.
- Analyzed the performance of Cassandra cluster using nodetool TP stats and CFstats for thread analysis and latency analysis.
- Implemented Real time analytics on Cassandra data using thrift API.
- Responsible to manage data coming from different sources.
- Supported Map Reduce Programs those are running on the cluster.
- Involved in loading data from UNIX file system to HDFS.
- Worked on installing cluster, commissioning & decommissioning of data node, name node recovery, capacity planning, and slots configuration.
- Load and transform large sets data into HDFS using Hadoop fs commands.
- Scheduled Oozie workflow engine to run multiple Hive and Pig jobs, which independently run with time and data availability.
- Implemented UDFS, UDAFS in java and python for hive to process the data that can’t be performed using Hive inbuilt functions.
- Did various performance optimizations like using distributed cache for small datasets, partition and bucketing in hive, doing map side joins etc.
- Worked on importing and exporting data from Oracle and DB2 into HDFS and HIVE using Sqoop for analysis, visualization and to generate report.
- Involved in writing optimized Pig Script along with involved in developing and testing Pig Latin Scripts
- Supported in setting up updating configurations for implementing scripts with Pig and Sqoop.
- Designed the logical and physical data modeling wrote DML scripts for Oracle 9i database.
- Used Hibernate ORM framework with Spring framework for data persistence.
- Wrote test cases in JUnitfor unit testing of classes.
- Involved in templates and screens in HTML and JavaScript.
Environment: Java, HDFS, Cassandra, Map Reduce, Sqoop, JUnit, HTML, JavaScript, Hibernate, Spring, Pig, Hive.
