- Around 5 years of professional IT experience in Big data Environment, Hadoop Ecosystem and good experience in Spark, SQL, Java Development.
- Hands on experience across Hadoop Eco System that includes extensive experience in Big Data technologies like HDFS, MapReduce, YARN, Spark, Sqoop, Hive, Pig, Impala, Oozie, Oozie Coordinator, Zoo - Keeper and Apache Cassandra, HBase.
- Experience in using various tools like Sqoop, Flume, Kafka, NiFi, Pig to ingest structured, semi-structured and unstructured data into the cluster.
- D esigning both time driven and data driven automated workflows using Oozie and used Zookeeper for cluster co-ordination .
- Experience in Hadoop cluster using Cloudera's CDH, Horton works HDP.
- Experience in working with structured data using HiveQL, join operations, Hive UDFs, partitions, bucketing and internal/external tables.
- Expertise in writing Map-Reduce Jobs in Java, Python for processing large sets of structured, semi-structured and unstructured data sets and stores them in HDFS.
- Experience working with Python, UNIX and shell scripting.
- Experience in Extraction, Transformation and Loading ( ETL ) of data from multiple sources like Flat files and Databases.
- Good knowledge of cloud integration with AWS using Elastic Map Reduce (EMR), Simple Storage Service (S3), EC2, Redshift and Microsoft Azure.
- Experience with complete Software Development Life Cycle(SDLC) process which includes Requirement Gathering, Analysis, Designing, Developing, Testing, Implementing and Documenting.
- Worked with waterfall and Agile methodologies.
- Good team player with excellent communication skills with strong attitude towards learning new technologies.
HADOOP: HDFS, MapReduce, Hive, beeline, Sqoop, Flume, Oozie, Impala, pig, Kafka, Zookeeper, NiFi, Cloudera Manager, HortonWorks
Spark Components: Spark Core, Spark SQL (Data Frames and Dataset), Scala, Python.
Programming Languages: Core Java, Scala, Shell, Hive-QL, Python
Operating Systems: Linux, Ubuntu, Windows 10/8/7
Databases: Oracle, MySQL, SQL ServerNoSQL
Databases: Hbase, Cassandra, MongoDB
Cloud: AWS Cloud Formation, Azure
Version controls and Tools: GIT, Maven, SBT, CBT
Methodologies: Agile, Waterfall
IDES & Command Line Tools: Eclipse, Net Beans, IntelliJ
Confidential, Nashville, TN
- Worked with product owners, Designers, QA and other engineers in Agile development environment to deliver timely solutions to as per customer requirements.
- Transferring data from different data sources into HDFS systems using Kafka producers, consumers and Kafka brokers.
- Used Oozie for automating the end-to-end data pipelines and Oozie coordinators for scheduling the workflows.
- Involved in creating Hive tables, loading data and writing hive queries, views and worked on them using Hive QL.
- Performed Optimizations of Hive Queries using Map side joins, dynamic partitions and Bucketing.
- Applied Hive queries to perform data analysis on HBase using the serde tables in meeting the data requirements for the downstream applications.
- Responsible for executing hive queries using Hive Command Line, Web GUI HUE and Impala to read, write and query the data into HBase.
- Implemented MapReduce secondary sorting to get better performance for sorting results in MapReduce programs.
- Load and transform large sets of structured, semi structured that includes Avro, sequence files.
- Worked on migration of all existed jobs to Spark, to get performance and decrease time of execution.
- Implemented usage of Amazon EMR for processing Big Data across a Hadoop Cluster of virtual servers on Amazon Elastic Compute Cloud (EC2) and Amazon Simple Storage Service (S3).
- Using Hive join queries to join multiple tables of a source system and load them to Elastic search tables.
- Experience with ELK Stack in building quick search and visualization capability for data.
- Experience with different data formats like Json, Avro, parquet, ORC formats and compressions like snappy & bzip.
- Coordinated with the testing team for bug fixes and created documentation for recorded data, agent usage and release cycle notes.
Environment: Hadoop, Big Data, HDFS, Scala, Python, Oozie, Hive, HBase, NiFi, Impala, Spark, AWS, Linux.
Confidential, Hudson, Ohio
- Developed an EDW solution, which is a cloud based EDW and Data Lake that supports Data asset management, Data Integration, and continuous data analytic discovery workloads.
- Developed and implemented real-time data pipelines with Spark Streaming, Kafka, and Cassandra to replace existing lambda architecture without losing the fault-tolerant capabilities of the existing architecture.
- Created a Spark Streaming application to consume real-time data from Kafka sources and applied real-time data analysis models that we can update on new data in the stream as it arrives.
- Worked on importing, transforming large sets of structured semi-structured and unstructured data.
- Used Spark-Structured-Streaming to perform necessary transformations and data model which gets the data from Kafka in real time and Persists into HDFS.
- Implemented the workflows using the Apache Oozie framework to automate tasks. Used Zookeeper to co-ordinate cluster services.
- Created various hive external tables, staging tables and joined the tables as per the requirement.
- Implemented static Partitioning, Dynamic partitioning and Bucketing in Hive using internal and external table. Created Map side Join, Parallel Execution for optimizing the Hive queries.
- Developed and implemented hive and spark custom UDFs involving date Transformations such as date formatting and age calculations as per business requirements.
- Written Programs in Spark using Scala and Python for Data quality check.
- Written transformations and actions on Data Frames, used Spark SQL on data frames to access hive tables into spark for faster processing of data.
- Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs, Python and Scala.
- Used Spark optimizations techniques like Cache/Refresh tables, broadcasting variables, Coalesce/Repartitioning, increasing memory overhead limits, handling parallelism and modifying the spark default configuration variables for performance tuning.
- Performed various benchmarking steps to optimize the performance of Spark jobs and thus improve the overall processing.
- Worked in Agile environment in delivering the agreed user stories with in the sprint time.
Environment: Hadoop, HDFS, Hive, Sqoop, Oozie, Spark, Scala, Kafka, Python, Cloudera, Linux.
Confidential, Bowie, Maryland
- Responsible for building scalable distributed data solutions using Hadoop cluster environment with HortonWorks distribution.
- Used Sqoop to load the data from relational databases.
- Involved in converting Hive/SQL queries into spark transformations using Spark RDD’s.
- Worked with CSV, Jason, Avro and Parquet file formats.
- Implemented usage of Amazon EMR for processing Big Data across Hadoop Cluster of virtual servers on Amazon Elastic Compute Cloud (EC2) and Amazon Simple Storage Service(S3).
- Worked on Kafka to collect and load the data on Hadoop file systems.
- Used Hive to form an abstraction on top of structured data resides in HDFS and implemented Partitions , Buckets on HIVE tables.
- Developed and implemented real-time data pipelines with Spark Streaming.
- Designed, developed data integration programs in a Hadoop environment with NoSQL data store HBase for data access and analysis.
- Worked with Python , to develop analytical jobs using PySpark API of spark.
- Using Job management scheduler apache Oozie to execute the workflow.
- Using Ambari to monitor node’s health, status of the jobs and to run the analytics jobs in Hadoop clusters.
- Experience with pyspark for using spark libraries by using python scripting for data analysis.
- Worked on Tableau to build customized interactive reports, worksheets, and dashboards.
- Involved in performance tuning of spark jobs using Cache and by utilizing complete advantage of cluster environment.
Environment: Hadoop, Spark, Scala, Python, Kafka, Hive, Sqoop, Pyspark, Ambari, Oozie, HBase, Tableau, Jenkins, HortonWorks.
Jr Java Developer
- Involved in different SDLC phases involving Requirement Gathering, Design and Analysis, Development and Customization of the application.
- Wrote database queries using SQL and PL/SQL for accessing, manipulating and updating Oracle database.
- Created database design for new tables and forms with the help of Technical Architect.
- Worked with managers to identify user needs and troubleshoot issues as they arise.
- Performing Unit testing, once the basic implementation has done.