- Around 8 years of professional IT experience, 4+ years Big Data Ecosystem experience in ingestion, querying, processing and analysis of big data.
- Hands on experience with Spark Core, Spark SQL, Spark Streaming using Scala and Python.
- Good Understanding in Hadoop Ecosystem including Spark, Hive, Pig, HBase, Oozie, Sqoop and Kafka.
- Experience in Data Frame, RDD architecture and implementing spark operations on RDD, Data Frame and optimizing Transformations and Actions in Spark.
- Experience with processing large sets of structured, semi - structured and unstructured data using Spark and Scala.
- Expertise knowledge on Object Oriented Programming Concepts and valuable experience in Exception Handling, Debugging and Tracing concepts.
- Experience with Hadoop architecture and its components such as HDFS, Name Node, Data Node and MapReduce programming paradigm.
- Experience with creating Hive tables, Hive joins & HQL for querying the databases eventually leading to complex Hive UDFs.
- Experience in work with Cassandra for fast retrieval of data.
- Worked with Several File Formats such as Avro, Parquet, CSV, JSON, Sequential, ORC etc.
- Knowledge of ETL (Extract, Transform, Load) methods for data extraction, transformation and loading in corporate-wide ETL Solutions and Data warehouse tools for data analysis.
- Understanding of several programing languages such as Scala, Python, Java, C++.
Data Ingestion: Sqoop, Kafka, Flume, Apache Hadoop Eco Systems
Data Processing: Spark, Impala, YARN, Map Reduce
Distributed Storage and Computing: HDFS, Zookeeper, S3.
Programming Languages: Scala, SQL, Python
ETL: IBM-Data Stage, Talend, AB Initio
Relational Databases: Oracle, MYSQL, Oracle SQL & ACCESS
NOSQL Databases: MONGODB, Cassandra, HBASE, DYDB
Cloud AWS: EMR, EC2, S3
Build tools: Jenkins, Maven, Gradle
Version Control: GIT, IntelliJ, Eclipse
Operating System: LINUX, Windows, UNIX
Data Formats: Parquet, Sequence, AVRO, ORC, CSV, JSON
Monitoring: Ambari, Cloudera Manager
Confidential, Boston, MA
- Developed programs in Spark based on the application for faster data processing than standard MapReduce programs.
- Written extensive Hive queries to do transformations on the data to be used by downstream models.
- Developed Hive queries to do analysis of the data and to generate the end reports to be used by business users.
- Developed the Oozie workflows with Sqoop actions to migrate the data from relational databases like Oracle, Teradata to HDFS.
- Worked on scalable distributed computing systems, software architecture, data structures and algorithms using Hadoop, Apache Spark etc. and ingested streaming data into Hadoop using Spark and Scala.
- Experience in writing Sqoop Scripts for importing and exporting data from RDBMS to HDFS.
- Developed Scala scripts using both Data frames and RDD in Spark for Data Aggregation, queries and writing data through Sqoop.
- Developed Spark code and Spark-SQL/Streaming for faster testing and processing of data.
- Worked with various HDFS file formats like Avro, ORC, Sequence File and various compression formats like Snappy, gzip etc.
- Extensively used Maven, SVN as a code repository and Version One for managing day agile project development process and to keep track of the issues and blockers.
- Used Spark for interactive queries, processing of streaming data and integration with popular NoSQL database for huge volume of data.
Environment: Hadoop, Spark, Hive, Oracle, Maven, Scala, Python, Pig, Sqoop, Oozie, MongoDB, SVN.
Confidential, Reston, VA
- Ingested incremental Batch Data from MySQL database and Teradata Using Sqoop.
- Ingested real time data into the HDFS using Kafka and Oozie.
- Worked with Elastic map reduce (EMR) for data processing on Amazon Web Services (AWS).
- Worked with Amazon S3 for storage on Amazon Web Services (AWS).
- Worked with Different Spark Modules like Spark core, Spark SQL, Spark Streaming, Spark Data sets and Data frames.
- Involved in converting the files in HDFS into RDD's from multiple data formats and performing Data Cleansing using RRD Operations.
- Very good understanding of Partitions, bucketing concepts in Hive and designed both Managed and External tables in Hive to optimize performance.
- Wrote complex queries and User Defined functions (UDFs) for custom functionality in hive using Scala.
- Worked with different HDFS file formats like Avro, ORC, parquet, Sequence File
- Integrated Oozie with the rest of Hadoop stack supporting several types of jobs and the system specific jobs (such as Java programs and shell scripts).
Environment: HDFS, Spark, Hive, Sqoop, Kafka, AWS EMR, AWS S3, Oozie, Spark Core, SPARK SQL, Maven, Scala, SQL, Linux, YARN, IntelliJ, Agile Methodology
Big Data Developer
Confidential, Pittsburgh, PA
- Utilized SQOOP, Kafka, Flume and Hadoop File System API’s for implementing data ingestion pipelines from heterogeneous data Sources.
- Created storage with Amazon S3 for storing data and Worked on transferring data from Kafka topic into AWS S3 storage.
- Worked on real time streaming and performed transformations on the data using Kafka and Spark Streaming.
- Implemented Spark Scripts using Scala and used Spark SQL to access hive tables into spark for faster processing of data.
- Created data pipeline for several events of ingestion, aggregation and load consumer response data from AWS S3 bucket into Hive external tables.
- Worked on various data formats like AVRO, Sequence File, JSON, Map File, Parquet and XML.
- Used Apache NiFi to automate data movement between different Hadoop components and perform conversion of raw XML data into JSON, AVRO.
Environment: Hadoop, HDFS, AWS, Scala, Kafka, Map Reduce, YARN, Spark, Pig, Hive, Python, Java, NiFi, HBase, IMS Mainframe, Maven.
- Implemented spring security features using AOP Interceptors for the authentication.
- Developed Spring Framework based Restful Web Services for handling and persisting of requests and Spring MVC for returning response to presentation tier
- Used multithreading in programming to improve overall performance using Singleton design pattern in Hibernate Utility class.
- Wrote complex SQL query to pull data from different tables to build the report.
- Used Log4j for error handling, to monitor the status of the service and to filter bad loans.
- Development and debugging done using Eclipse IDE.
Java/ J2EE Developer
- Developed the application using Spring Framework that leverages Model View Controller (MVC) architecture, spring security and Java API.
- Implemented design patterns like Singleton, Factory pattern and MVC.
- Deployed the applications on IBM Web Sphere Application Server.
- Worked on Java script, CSS Style Sheet and JQuery.
- Wrote SQL queries to bring data from the Oracle & MySQL databases.