- Having 8+ years of experience in IT which includes Analysis, Design, Development of Big Data using Hadoop, Scala, design and development of web applications using JAVA, Spring boot and data base and data warehousing development using My SQL, Oracle.
- Around 4+ years of work experience on Big Data Analytics with hands on experience in installing, configuring and using ecosystem components like Hadoop Map reduce, HDFS, HBase, Zookeeper, Hive, Sqoop, Pig, Flume, Cassandra, Kafka and Spark, NiFi.
- Good Understanding of Hadoop architecture and Hands on experience with Hadoop components such as Job Tracker, Task Tracker, Name Node, Data Node and Map Reduce concepts and HDFS Framework.
- Experience in using Cloudera Manager for installation and management of single - node and multi-node Hadoop cluster (CDH4&CDH5).
- Implemented design patterns in Scala for the application and developed quality code adhering to Scala coding Standards and best practices.
- Strong experience in analyzing large amounts of data sets writing PySpark scripts and Hive queries.
- For the app developing project, I implemented applications with Scala along with Akka and Play framework and implemented Restful services in Spring.
- Running of Apache Hadoop, CDH and Map-R distros, dubbed Elastic MapReduce (EMR) on (EC2).
- Implemented advanced procedures like text analytics and processing using the in-memory computing capabilities like Apache Spark written in Scala.
- Experience in pulling data from Amazon S3 cloud to HDFS.
- Extensively worked on AWS services like EC2, S3, EMR, FSx, Lambda functions, Cloud watch, RDS, Auto scaling, Cloud Formation etc.
- Hands on experience in VPN Putty and WinSCP.
- Experience in Data load management, importing & exporting data using SQOOP & FLUME.
- Experience in analyzing data using Hive, Pig and custom MR programs in Java.
- Experience in scheduling and monitoring jobs using Oozie and Zookeeper.
- Experienced in writing Map Reduce programs & UDF's for both Pig & Hive in java.
- Experience in dealing with log files to extract data and to copy into HDFS using flume.
- Experience in integrating Hive and Hbase for effective operations.
- Experience in Impala, Solr, MongoDB, HBase and Spark.
- Hands on knowledge of writing code in Scala.
- Expertise in Waterfall and Agile - SCRUM methodologies.
- Experienced with code versioning and dependency management systems such as Git, SVT, and Maven, Bitbucket.
- Writing code to create single-threaded, multi-threaded or user interface event driven applications, either stand-alone and those which access servers or services.
- Good experience in using Data Modelling techniques to find the results based on SQL and PL/SQL queries.
- Experience working with different databases, such as Oracle, SQL Server, MySQL and writing stored procedures, functions, joins, and triggers for different Data Models.
- Expertise in implementing Service Oriented Architectures (SOA) with XML based Web Services (SOAP/REST).
- Great team player and quick learner with effective communication, motivation, and organizational skills combined with attention to details and business improvements.
- Experienced in handling different file formats like Text file, Avro data files, Sequence files, Xml and Json files.
Big Data Technologies: HDFS, Hive, MapReduce, Pig, Sqoop, Flume, Oozie, Hadoop distribution, and Hbase, Spark, Yarn, Zookeeper, Kafka.
Programming languages: Core Java, Spring Boot, Scala.
Databases: MySQL, SQL/PL-SQL, MS-SQL Server 20012/16, Oracle 10g/11g/12c
NoSql Databases: Cassandra, HBASE, mongoDB, ELASTIC SEARCH
Operating Systems: Linux, Windows XP/7/8/10, Mac.
Software Life Cycle: SDLC, Waterfall and Agile models.
Utilities/Tools: Eclipse, Tomcat, NetBeans, JUnit, SQL, SVN, Log4j, SOAP UI, ANT, Maven, Alteryx, Visio.
Data Visualization Tolls: Tableau, SSRS, Cloud Healtth.
AWS Services: EC2, S3, EMR, RDS, Lambda, Cloudwatch, FSx, Auto scaling, Cloud Formation
Sr. Big Data/Hadoop Developer
- Used Spark-Streaming APIs to perform necessary transformations and actions on the fly for building the common learner data model which gets the data from Kafka in near real time and Persists into Cassandra.
- Used DataStax Spark-Cassandra connector to load data into Cassandra and used CQL to analyze data from Cassandra tables for quick searching, sorting and grouping.
- Worked on loading data into Spark RDD's, perform advanced procedures like text analytics using in-memory data computation capabilities of Spark to generate the Output response.
- Used Akka framework that enables concurrent processing while loading the data lake.
- Executed many performance tests using the Cassandra-stress tool to measure and improve the read and write performance of the cluster.
- Handled large datasets using Partitions, Broadcasts in Spark, Effective & efficient Joins, Transformations and other during ingestion process itself.
- Used Kafka Streams to Configure Spark Streaming to get information and then store it in HDFS.
- Partitioned data streams using Kafka, designed and configured Kafka cluster to accommodate heavy throughput of 1 million messages per second. Used Kafka producer API's to produce messages.
- Handled ingestion of data from different data sources into HDFS using Sqoop and perform transformations using Hive, Map Reduce and then loading data into HDFS.
- Created an Akka Actor in Scala that parses an API response to extract information for an analysis workflow.
- Data cleansing and analysis with appropriate tools.
- Worked with highly unstructured and semi-structured data and processed based on the customer requirement.
- Created and worked Sqoop jobs with incremental load to populate Hive External tables.
- Develop efficient spark programs in Python to perform batch processes on huge unstructured datasets.
- Implemented Spark using Scala and Spark SQL for faster testing and processing of data.
- Analyzed the data by performing Hive queries (Hive QL) and running Pig scripts (Pig Latin) to study customer behaviour.
- Developed Spark code using Scala and Spark-SQL/Streaming for faster testing and processing of data.
- Performed tuning of Spark Applications for setting right Batch Interval time, correct level of Parallelism and memory tuning.
- Optimizing of existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frames and Pair RDD's.
- Written Sqoop Scripts for importing and exporting data from RDBMS to HDFS.
- Ingested data from RDBMS to Hive to perform data transformations, and then export the transformed data to Cassandra for data access and analysis.
- Experience in AWS cloud services with EC2, EMR, RedShift, Kinesis, glue and S3. • Loaded parquet files from AWS S3 using spark.
- Administered, maintained, provisioned, patched and maintained Cloudera Hadoop clusters on Linux.
- Responsible for building scalable distributed data solutions using Hadoop.
- Implemented Spark Scripts using Scala, Spark SQL to access hive tables into Spark for faster processing of data.
- Used Spark for interactive queries, processing of streaming data and integration with popular NoSQL database for huge volume of data. Created Job management using Fair scheduler and Developed job processing scripts using Oozie workflow.
- End-to-end architecture and implementation of client-server systems using Scala and Akka
- Written Spark applications using Scala to interact with the MySQL database using Spark SQL Context and accessed Hive tables using Hive Context. Extracted the data from Teradata into HDFS/Dashboards using Spark Streaming.
- Implemented Informatica Procedures and Standards while developing and testing the Informatica objects.
- Experienced data pipelines using Kafka and Akka for handling large terabytes of data.
- Migrated an existing on-premises application to AWS. Used AWS services like EC2 and S3 for small data sets processing and storage, Experienced in Maintaining the Hadoop cluster on AWS EMR.
- Performed the migration of Hive and MapReduce Jobs from on-premise MapR to AWS cloud using EMR.
- Developed fully functional responsive modules based on Business Requirements using Scala with Akka
- Written the AWS Lambda functions in Scala with cross functionality dependencies which would generate custom libraries for deploying the Lambda function in the Cloud.
Environment: MapReduce, HDFS, Scala, Oracle, Kafka connectors, Maven 4.0, Spark, Hive, AWS (EC2, EMR, S3 etc.) Sql, Tableau, Cloudera, Yarn, Zookeeper, Scripting (Shell/Python), Sqoop, Oozie, Github.
Sr. Data Engineer /Hadoop Developer
- Involved in Requirement gathering, Business Analysis and translated business requirements into Technical design in Hadoop and Big Data
- Involved in SQOOP implementation which helps in loading data from various RDBMS sources to Hadoop systems and vice versa.
- Developed Python scripts to extract the data from the web server output files to load into HDFS.
- Involved in HBASE setup and storing data into HBASE, which will be used for further analysis.
- Worked on Cloud Health tool to generate AWS reports and dashboards for cost analysis.
- Written a python script which automates to launch the EMR cluster and configures the Hadoop applications.
- Extensively worked with Avro and Parquet files and converted the data from either format Parsed Semi Structured JSON data and converted to Parquet using Data Frames in PySpark.
- Developed a Python Script to load the CSV files into the S3 buckets and created AWS S3buckets, performed folder management in each bucket, managed logs and objects within each bucket.
- Involved in Analyzing system failures, identifying root causes, and recommended course of actions, Documented the systems processes and procedures for future references.
- Involved in Configuring Hadoop cluster and load balancing across the nodes.
- Involved in Hadoop installation, Commissioning, Decommissioning, Balancing, Troubleshooting, Monitoring and, debugging Configuration of multiple nodes using Hortonworks platform.
- Involved in working with Spark on top of Yarn/MRv2 for interactive and Batch Analysis
- Worked closely with AWS EC2 infrastructure teams to troubleshoot complex issues
- Expertise in writing the Scala code using higher order functions for the iterative algorithms in spark for performance consideration
- Experienced in analyzing and Optimizing RDD's by controlling partitions for the given data
- Experienced in writing live Real-time Processing using Spark Streaming with Kafka
- Used HiveQL to analyze the partitioned and bucketed data and compute various metrics for reporting
- Experienced in querying data using SparkSQL on top of Spark engine
- Involved in managing and monitoring Hadoop cluster using Cloudera Manager.
- Used Python and Shell scripting to build pipelines.
- Developed data pipeline using sqoop, HQL, Spark and Kafka to ingest Enterprise message delivery data into HDFS.
- Developed workflow in Oozie also in Airflow to automate the tasks of loading data into HDFS and pre-processing with Pig and Hive.
- Assisted in creating and maintaining Technical documentation to launching HADOOP Clusters and even for executing Hive queries and Pig Scripts.
- Assisted in Cluster maintenance, cluster monitoring, adding and removing cluster nodes and Installed and configured Hadoop, Map Reduce, HDFS, developed multiple Map Reduce jobs in java for data cleaning and pre-processing.
- Involved in file movements between HDFS and AWS S3 and extensively worked with S3 bucket in AWS.
- Created data partitions on large data sets in S3 and DDL on partitioned data.
- Converted all Hadoop jobs to run in EMR by configuring the cluster according to the data size.
- Monitor and Troubleshoot Hadoop jobs using Yarn Resource Manager and EMR job logs using Genie and kibana.
Environment: HDFS, Hive, Java, Sqoop, Spark, Yarn, Clouder Manager, CloudHealth, Splunk, Oracle, Elastic search, Kerberos, Impala, Jira, Confluence, Shell/Perl Scripting, Python, AVRO, Zookeeper, AWS(EC2, S3, EMR, S3, VPC, RDS Lambda, CLoudwatch etc), Ranger, Git, Airflow.
Data Engineer / Hadoop Developer
- Import data from sources like HDFS/HBase into Spark RDD.
- Usage of Spark Streaming and Spark SQL API to process the files.
- Worked extensively with Sqoop for importing and exporting the data from HDFS to Relational Database systems/mainframe and vice-versa loading data into HDFS.
- Stored data in AWS S3 like HDFS and performed EMR programs on data stored in S3.
- Worked on Big Data Hadoop cluster implementation and data integration in developing large-scale system software.
- Developing UDFs in java for hive and pig and worked on reading multiple data formats on HDFS using Scala.
- Developed workflow in Oozie and in Airflow to automate the tasks of loading data into HDFS and pre-processing with Hive.
- Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs and Scala.
- Involved in Migrating the platform from Cloudera to EMR platform.
- Collecting and aggregating large amounts of log data using Apache Flume and staging data in HDFS for further analysis
- Developed Scripts and automated data management from end to end and sync up between all the clusters.
- Involved in creating Hive Tables, loading with data and writing Hive queries which will invoke and run MapReduce jobs in the backend.
- Extensively involved in developing Restful API using JSON library of Play framework.
- Developed Storm topology to ingest data from various source into Hadoop Data Lake.
- Developed web application using HBase and Hive API to compare schema between HBase and Hive tables.
- Played a vital role in Scala/Akka framework for web based applications
- Connected to AWS EMR using SSH and ran spark-submit jobs
- Developed Python Script to import data SQL Server into HDFS & created Hive views on data in HDFS using Spark.
- Created scripts to append data from temporary HBase table to target HBase table in Spark.
- Developed complex and Multi-step data pipeline using Spark.
- Worked on Big Data Integration and Analytics based on Hadoop, SOLR, Spark, Kafka, Storm and web Methods technologies.
- Populated HDFS and Cassandra with huge amounts of data using Apache Kafka.
- Monitoring YARN applications. Troubleshoot and resolve cluster related system problems.
- Upgrading the Hadoop Cluster from CDH3 to CDH4, setting up High Availability Cluster and integrating HIVE with existing applications.
- Collecting and aggregating large amounts of log data using Apache Flume and staging data in HDFS for further analysis.
- Involved in creating ETL flow using Pig, loading with data and writing Pig Latin queries which will run internally in Map Reduce way.
- Involved in writing Unix/Linux Shell Scripting for scheduling jobs and for writing pig scripts and hive QL.
- Involved in creating Hive Tables, loading with data and writing Hive queries which will invoke and run MapReduce jobs in the backend.
- Assisted in exporting data into Cassandra and writing column families to provide fast listing outputs.
- Used Zookeeper for providing coordinating services to the cluster.
- Worked with Hue UI in scheduling jobs with ease and File browsing, Job browsing, Metastore management.
- Developed and designed system to collect data from multiple portal using kafka and then process it using spark.
Environment: Apache Hadoop, HDFS, Hive, Core Java, Sqoop, Spark, Cloudera CDH4, Oracle, Elastic search, Kerberos, SFTP, Impala, Jira, Wiki, Alteryx, Teradata, Shell/Perl Scripting, Kafka, AWS EC2, S3, EMR, Cloudera.