- Overall 7 years of programming and software development experience with skills in data analysis, design, development, testing and deployment of software systems from development stage to production stage in Big Data and Java technologies.
- Experience in Hadoop Ecosystem components like Hive, HDFS, Sqoop, Spark, Kafka, Pig.
- Good understanding of Hadoop architecture and Hands on experience with Hadoop components such as Resource Manager, Node Manager, Name Node, Data Node and Map Reduce concepts and HDFS Framework.
- Working knowledge of Spark RDD, Dataframe API, Data set API, Data Source API, Spark SQL and Spark Streaming.
- Experience in developing data pipelines using Sqoop to extract the data from RDBMS and store in HDFS.
- Experience in exporting as well as importing the data using Sqoop between HDFS to Relational Database systems and vice - versa and load into Hive tables, which are partitioned.
- Worked on HQL for required data extraction and join operations as required and having good experience in optimizing Hive Queries .
- Experience in Partitions, bucketing concepts in Hive and designed both Managed and External tables in Hive to optimize performance.
- Developed Spark code using Scala, Python and Spark-SQL/Streaming for faster processing of data.
- Implemented Spark Streaming jobs in Scala by developing RDD's (Resilient Distributed Datasets) and used Pyspark and spark-shell accordingly.
- Profound experience in creating real time data streaming solutions using Apache Spark /Spark Streaming, Kafka and Flume.
- Good knowledge of using apache NiFi to automate the data movement between different Hadoop Systems.
- Good experience in handling messaging services using Apache Kafka.
- Excellent knowledge of job workflow scheduling and locking tools/services like Oozie and Zookeeper.
- Good understanding and knowledge of NoSQL databases like HBase and Cassandra.
- Good understanding of Amazon Web Services (AWS) like EC2 for computing and S3 as storage mechanism and EMR, RedShift, DynamoDB.
- Good understanding and knowledge of Microsoft Azure services like HDInsight Clusters, BLOB, ADLS, Data Factory and Logic Apps.
- Worked with various formats of files like delimited text files, JSON files, XML Files. Mastered in using different columnar file formats like RC, ORC and Parquet formats and has a good understanding of various compression techniques used in Hadoop processing like G-zip, Snappy, LZO etc.
- Experience using Build tools like Maven and also version control tools like Git.
- Experienced in Software Development Lifecycle (SDLC) using SCRUM, Agile methodologies.
Big Data Technologies: Spark, Airflow, Kafka Snowflake, Hadoop, HDFS, Hive, Map Reduce, Pig, Sqoop, Flume, Zookeeper, Oozie
Programming Languages: Python, Java, Scala
Java/J2EE Technologies: Java, Java Beans, J2EE (JSP, Servlets, EJB), Struts, spring,JDBC.
DB Languages: SQL, PL/SQL
Databases used : Oracle 11g, SQL server, IBM DB2
NoSQL Databases: Hbase, MongoDB, Cassandra
Cloud Services: AWS, Azure
AWS Services : S3, EC2, EMR, Redshift, RDS, etc
Operating Systems: LINUX, UNIX, CENTOS and Windows Variants
Confidential, Eden Prairie, MN
- Develop and add features to existing data analytic applications built with Spark and Hadoop on a Scala, java and Python development platform on the top of AWS services.
- Translate business problems to an analytics problem, recommending and applying the most appropriate methods to yield insights and results.
- Involved in developing spark applications using Scala, Python for Data transformations, cleansing as well as validation using Spark AP I.
- Worked on all the Spark APIs, like RDD, Dataframe, Data source and Dataset, to transform the data.
- Worked on both batch processing and streaming data Sources. Used Spark streaming and Kafka for the streaming data processing.
- Developed Spark Streaming script which consumes topics from distributed messaging source Kafka and periodically pushes batch of data to spark for real time processing
- Built data pipelines for reporting, alerting, and data mining. Experienced with table design and data management using HDFS, Hive, Impala, Sqoop, MySQL, and Kafka.
- Worked on Apache Nifi to automate the data movement between RDBMS and HDFS.
- Created shell scripts to handle various jobs like Map Reduce, Hive, Pig, Spark etc, based on the requirement.
- Used Hive techniques like Bucketing, Partitioning to create the tables.
- Experience on Spark-SQL for processing the large amount of structured data.
- Experienced working with source formats, which includes - CSV, JSON, AVRO, JSON, Parquet, etc.
- Worked on AWS to aggregate clean files in Amazon S3 and also on Amazon EC2 Clusters to deploy files into Buckets.
- Used AWS EMR clusters for creating hadoop and spark clusters. These clusters are used for submitting and executing scala and python applications in production.
- Responsible for developing data pipeline with Amazon AWS to extract the data from weblogs and store in HDFS.
- Migrated the data from AWS S3 to HDFS using Kafka.
- Worked with NoSQL databases like HBase, Cassandra to retrieve and load the data for real time processing using Rest API.
- Worked on creating data models for Cassandra from the existing Oracle data model.
- Responsible for transforming and loading the large sets of structured, semi structured and unstructured data.
- Perform data profiling; identify/communicate data quality issues and work with other teams as needed to resolve them.
Environment: Apache Hive, Apache Kafka, Apache Spark 2.3, Spark-SQL, Spark-Streaming, Zookeeper, Pig, Oozie, Scala, Python3, S3, EMR, EC2, Redshift, snowflake, Cassandra, Nifi, Flume, HBase.
Confidential, Reading, PA
- Develop a data set process for data mining and data modeling and also recommend the ways to improve data quality, efficiency and reliability.
- Extracted, Transformed, and Loaded (ETL) and Data Cleansing of data from various sources like XML files, Flat files, and Databases and also Involved in UAT, Batch testing, test plans.
- Responsible for writing Hive Queries to analyze the data in Hive warehouse using Hive Query Language (HQL).Involved in developing Hive DDLs to create, drop and alter tables.
- Extracted the data and updated it into HDFS using Sqoop Import from various sources like Oracle, Teradata, SQL server etc.
- Created Hive staging tables and external tables and also joined the tables as required.
- Implemented Dynamic Partitioning, Static Partitioning and also Bucketing.
- Installed and configured Hadoop Map Reduce, Hive, HDFS, Pig, Sqoop, Flume and Oozie on Hadoop cluster.
- Implemented Sqoop jobs for data ingestion from the Oracle to Hive.
- Worked on Microsoft Azure services like HDInsight Clusters, BLOB, ADLS, Data Factory and Logic Apps and also done POC on Azure Data Bricks.
- Worked with various formats of files like delimited text files, click stream log files, Apache log files, Avro files, JSON files, XML Files. Mastered in using different columnar file formats like RC, ORC and Parquet formats.
- Developed custom the Unix/BASH SHELL scripts for the purpose of pre and post validations of the master and slave nodes, before and after the configuration of the name node and datanodes respectively.
- Developed job workflows in Oozie for automating the tasks of loading the data into HDFS.
- Implemented compact and efficient file storage of big data by using various file formats like Avro, Parquet, JSON and using compression methods like GZip, Snappy on top of the files.
- Exploring with Spark, improving performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frame and Pair RDD's.
- Worked on Spark using Python as well as Scala and Spark SQL for faster testing and processing of data.
- Extensively used Stash, B it-Bucket and GITHUB for the code control purpose.
- Migrated Map reduce jobs to Spark jobs for achieving a better performance.
- Wrote test cases, analyzed and reporting test results to product teams.
Environment: Hadoop 2.x, HDFS, Microsoft Azure services like HDinsight, BLOB, ADLS, Logic Apps etc, Hive, Sqoop, Apache Spark 2.2, Spark-SQL, ETL, Maven, Oozie, Scala, Python3, Unix shell scripting.
- Worked on creating MapReduce programs to analyze the data for claim report generation and running the Jars in Hadoop.
- Extracted, transformed and loaded the data sets using Apache Sqoop.
- Used NiFi and Sqoop for moving data between HDFS and RDBMS.
- Involved in writing Hive queries to analyze ad-hoc data from structured as well as semi structured data.
- Created Hive tables and working on them using HiveQL.
- Also assisted in exporting analyzed data to relational databases using Sqoop
- Imported data from different sources into Spark RDD for processing.
- Imported and exported data into HDFS using Flume.
- Worked on Hadoop components such as pig, oozie, etc.
- Developed Oozie Workflows for daily incremental loads, which gets data from Teradata and then imported into hive tables.
- Involved in agile development methodology and actively participated in daily scrum meetings.
Environment: Hadoop 2.x, HDFS, Hive, Sqoop, Apache Spark 2.2, Nifi, ETL, Pig, Oozie, Scala, Python 3.