- Around 6 years of professional IT experience in Big Data technologies, architecture, and systems.
- Hands on experience in using CDH and HDP Hadoop ecosystem components like Hadoop, MapReduce, Yarn, Hive, Pig, Sqoop, HBase, Cassandra, Spark, Oozie, Zookeeper, Kafka, and Flume.
- Configured Spark streaming to receive real - time data from the Kafka and stored the stream data to HDFS using Scala.
- Experienced in importing and exporting data using stream processing Flume and Kafka platforms
- Written Hive UDFs as required and executed complex HQLs to extract data from Hive tables
- Used partitioning and bucketing in Hive and designed both managed and external tables for performance optimization
- Converted Hive/SQL queries into Spark transformations using Spark Data frames and Scala
- Used Spark Data Frames API over Cloudera platform to perform analytics on Hive data
- Good understanding and knowledge of NoSQL databases like MongoDB, HBase and Cassandra
- Experienced in workflow scheduling and locking tools/services like Oozie and Zookeeper
- Practiced ETL methods in enterprise-wide solutions, data warehousing, reporting and data analysis
- Experienced in working with AWS using EMR, EC2 for computing and S3 as storage mechanism
- Developed Impala scripts for extraction, transformation, loading of data into data warehouse
- Good knowledge in using apache NiFi to automate the data movement between Hadoop systems
- Used Pig scripts for transformations, event joins, filters and pre-aggregations for HDFS storage
- Imported and exported data with Sqoop to and from HDFS to RDBMS including Oracle, MySQL and MS SQL Server
- Good Knowledge in UNIX Shell Scripting for automating deployments and other routine tasks
- Experienced in using IDEs like Eclipse, NetBeans, IntelliJ.
- Used JIRA and Rally for bug tracking and GitHub and SVN for various code reviews and unit testing
- Experienced in working in all phases of SDLC - both agile and waterfall methodologies
- Good understanding of Agile Scrum methodology, Test Driven Development and CI-CD
- Responsible for building scalable distributed data solutions using Hadoop.
- Responsible for building scalable distributed data solutions using Hadoop and migrate legacy Retail applications TALEND ETL to Hadoop.
- Performed real time analytics on HBase using Java API and Fetched data to/from HBase by writing Map Reduce job. Installed and configured Hadoop Map Reduce, HDFS, developed multiple Map Reduce jobs in java for data cleaning and Processing using HDP 2.0
- Wrote SQL queries to process the data using Spark SQL. Worked extensively with importing metadata into Hive and migrated existing tables and applications to work on Hive and making the data available.
- Extracted data from different databases and to copy into HDFS file system using Sqoop.
- Worked extensively with importing metadata into Hive using Python and migrated existing tables and applications to work on Talend ETL Tool.
- Created Talend Mappings to populate the data into Staging, Dimension and Fact tables.
- Worked on project to retrieve log messages procured by leveraging Spark Streaming.
- Designed Oozie jobs for the auto processing of similar data. Collect the data using Spark Streaming.
- Analyzed the data by performing Hive queries and running scripts to know user behavior.
- Installed Oozie workflow engine to run multiple Hive. Used Scala collection framework to store and process the complex consumer information. Used Scala functional programming concepts to develop business logic.
- Developed hive scripts in the areas where extensive coding needs to be reduced.
- An in depth understanding of Scala programming language along with lift framework. Generating Scala and java classes from the respective APIs so that they can be incorporated in the overall application.
- Worked with Spark Streaming to ingest data into spark engine. Extensively used for all and bulk collect to fetch large volumes of data from table.
- Performed transformations, cleaning and filtering on imported data using Hive, Map Reduce, and loaded final data into HDFS.
- Handled importing of data from various data sources using Sqoop, performed transformations using Hive, MapReduce and loaded data into HDFS.
- Worked on running reports in Linux environment. Worked on writing shell scripts to reports in Linux environment. Used Linux to manage files.
- Worked on analyzing Hadoop cluster and different big data analytic tools including Pig HBase database and Sqoop.
- Parsed high-level design specification to simple ETL coding and mapping standards.
- Developed complex Talend jobs mappings to load the data from various sources using different components. Design, develop and implement solutions using Talend Integration Suite.
- Involved in loading and transforming large sets of structured, semi structured and unstructured data from relational databases into HDFS using Sqoop imports.
- Responsible for analyzing and cleansing raw data by performing Hive queries and running Pig scripts on data.
- Developed spark application for filtering JSON source data in location and store it into HDFS with partitions and used Spark to extract schema of JSON files.
- Imported the data from different sources like Talend ETL, Local file system into Spark RDD. Experience with developing and maintaining Applications written for Elastic, Map Reduce.
- Responsible to manage data coming from sources (RDBMS) and involved in HDFS maintenance and loading of structured data.
- Optimized several Map Reduce algorithms in Java according to the client requirement for big data analytics.
- Responsible for importing data from MySQL to HDFS and provide the query capabilities using HIVE.
- Used Sqoop to import the data from RDBMS to Hadoop Distributed File System (HDFS) and later analyzed the imported data using Hadoop Components.
- Developed the Sqoop scripts to make the interaction between Pig and MySQL Database.
- Involved in writing shell scripts in scheduling and automation of tasks.
- Managed and reviewed Hadoop log files to identify issues when Job fails.
Environment: Hadoop, HDFS, Hive, Oozie, Sqoop, Oozie, Spark, ETL, ESP Workstation, Shell Scripting, HBase, GitHub, Tableau, Oracle, MySQL, Agile/Scrum
Confidential, Bethesda, MD
- Experience in Job management using Fair scheduler and Developed job processing scripts using Oozie workflow.
- Used Spark, Hive for implementing the transformations need to join the daily ingested data to historic data.
- Used Spark-Streaming APIs to perform necessary transformations and actions on the fly for building the common learner data model which gets the data from Kafka in near real time.
- Developed Spark scripts by using Scala shell commands as per the requirement.
- Used Spark API over EMR Cluster Hadoop YARN to perform analytics on data in Hive.
- Developed Scala scripts, UDFs using both Data frames/SQL/Data sets and RDD in Spark for Data Aggregation, queries and writing data back into OLTP system through Sqoop.
- Experienced in performance tuning of Spark Applications for setting right Batch Interval time, correct level of Parallelism and memory tuning.
- Optimizing of existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frames and Pair RDD's.
- Performed advanced procedures like text analytics and processing, using the in-memory computing capabilities of Spark.
- Developed reusable transformations to load data from flat files and other data sources to the Data Warehouse.
- Assisted operation support team for transactional data loads in developing SQL Loader & Unix scripts
- Implemented Spark SQL queries which intermix the Hive queries with the programmatic data manipulations supported by RDDs and data frames in Scala and python.
- Implemented Python script to call the Cassandra Rest API, performed transformations and loaded the data into Hive.
- Extensively worked on Python and build the custom ingest framework.
- Experienced in handling large datasets using Partitions, Spark in Memory capabilities, Broadcasts in Spark, Effective & efficient Joins, Transformations and other during ingestion process itself.
- Experienced in writing live Real-time Processing using Spark Streaming with Kafka.
- Created Cassandra tables to store various data formats of data coming from different sources.
- Designed, developed data integration programs in a Hadoop environment with NoSQL data store Cassandra for data access and analysis.
- Worked extensively with Sqoop for importing metadata from Oracle.
- Involved in creating Hive tables and loading and analyzing data using hive queries.
- Developed Hive queries to process the data and generate the data cubes for visualizing.
- Implemented schema extraction for Parquet and Avro file Formats in Hive.
- Good experience with Talend open studio for designing ETL Jobs for Processing of data.
- Implemented Partitioning, Dynamic Partitions, Buckets in HIVE.
- Involved in file movements between HDFS and AWS S3.
- Extensively worked with S3 bucket in AWS.
- Used Reporting tools like Tableau to connect with Hive for generating daily reports of data.
- Collaborated with the infrastructure, network, database, application and BI teams to ensure data quality and availability.
Environment : Hadoop, HDFS, Hive, Oozie, Horton works Sandbox, Java, Eclipse LUNA, Zookeeper, JSON file format, Scala, Apache Spark, Kafka.
Confidential, New York
- Responsible for building scalable distributed data solutions using Apache Hadoop and Spark.
- Worked in the BI team in the area of Big Data Hadoop cluster implementation and data integration in developing large-scale system software.
- Used Spark-Streaming APIs to perform necessary transformations and actions on the fly for building the common learner data model which gets the data from Kafka in near real time and Persists into Cassandra.
- Configured deployed and maintained multi-node Dev and test Kafka Clusters.
- Developed Spark scripts by using Scala shell commands as per the requirement.
- Used Spark API over Cloudera Hadoop YARN to perform analytics on data in Hive.
- Created Hive queries that helped market analysts spot emerging trends by comparing fresh data with EDW reference tables and historical metrics.
- Developed Hive queries for the analysts.
- Worked extensively with Sqoop for importing and exporting the data from HDFS to Relational Database systems/mainframe and vice-versa.
- Managed and reviewed Hadoop log files.
- Shared responsibility for administration of Apache Spark, Hive and Pig.
- Built and maintained scalable data pipelines using the Hadoop ecosystem and other open source components like Hive, and HBase.
- In memory Processing using Spark and run real-time streaming analytics on it.
- Developed Map Reduce programs to parse the raw data, populate staging tables and store the refined data in the tables in EDW.
- Enabled speedy reviews and first mover advantages by using Oozie to automate data loading into the Hadoop Distributed File System.
- Capturing data from existing databases that provide SQL interfaces using Sqoop.
- Developed and maintained complex outbound notification applications that run on custom architectures, using diverse technologies including Core Java, J2EE, SOAP, XML and Web Services.
- Tested raw data and executed performance scripts.
- Developed workflow in Oozie to automate the tasks of loading the data into HDFS.
- Experience in writing Map Reduce programs and using Apache Hadoop API for analyzing the data
- Provided design recommendations and thought leadership to sponsors/stakeholders that improved review processes and resolved technical problems.
- Configured Hive bolts and written data to hive in Horton works Sandbox as a part of POC.
- Assess existing and available data warehousing technologies and methods to ensure our Data warehouse/ BI architecture meets the needs of the business unit and enterprise and allows for business growth.
Environment: Hadoop, MapReduce, HDFS, Hive, Java, SQL, Cloudera Manager, Pig, Apache Sqoop, Spark, Oozie, HBase, AWS, PL/SQL, MySQL and Windows.