- 5+(USA) and 7+(BD) years of IT experience in analysis, design, development and implementation of large - scale applications using Big Data and Java/J2EE technologies such as Apache Spark, Hadoop, Hive, Pig, Sqoop, Oozie, HBase, Zookeeper, Python & Scala.
- Strong experience writing Spark Core, Spark SQL, Spark Streaming, Java MapReduce, Spark on Java Applications.
- Experienced in Apache Spark, Hive and Pig's analytical functions and extending Spark, Hive and Pig functionality by writing custom UDFs and hooking UDF's into larger Spark applications to be used as in-line functions.
- Experience with installing, backup, recovery, configuration and development on multiple Hadoop distribution platforms Cloudera and Hortonworks including cloud platforms Amazon AWS and Google Cloud.
- Highly skilled in Optimizing and moving large scale pipeline applications from on-premise clusters to AWS Cloud.
- Experienced in building frameworks for Large scale streaming applications in Apache Spark.
- Worked on migrating Hadoop MapReduce programs to Apache Spark on Scala.
- Extensive hands-on knowledge of working on the Amazon AWS and Google Cloud Architecture.
- Highly skilled in integrating Amazon Kinesis streams with Spark Streaming applications to build long running real-time applications.
- Solid understanding of RDD operations in Apache Spark i.e., Transformations & Actions, Persistence (Caching), Accumulators, Broadcast Variables, Optimizing Broadcasts.
- In-depth knowledge of handling large amounts of data utilizing Spark Data Frames/Datasets API and Case Classes.
- Experienced in running query using Impala and used BI tools to run ad-hoc queries directly on Hadoop.
- Working knowledge of utilizing Hadoop file formats such as Sequence, ORC, Avro, Parquet as well as open source Text/CSV and JSON formatted files.
- In-depth knowledge of the Big Data Architecture along with-it various components of Hadoop 1.X and 2.X such as HDFS, Job Tracker, Task Tracker, Data Node, Name Node and YARN concepts such as Resource Manager, Node Manager.
- Hands on experience on AWS cloud services (VPC, EC2, S3, RDS, Glue, Redshift, Data Pipeline, EMR, DynamoDB, Workspaces, Lambda, Kinesis, RDS, SNS, SQS).
- HiveQL and Pig Latin scripts leading to good understanding in MapReduce design patterns, data analysis using Hive and Pig.
- Great knowledge of working with Apache Spark Streaming API on Big Data Distributions in an active cluster environment.
- Very capable at using AWS utilities such as EMR, S3 and CloudWatch to run and monitor Hadoop/Spark jobs on AWS.
- Very well versed in writing and deploying Oozie Workflows and Coordinators. Scheduling, Monitoring and Troubleshooting through Hue UI.
- Proficient in importing and exporting data from Relational Database Systems to HDFS and vice versa, using Sqoop.
- Good understanding of column-family NoSQL databases like HBase, Cassandra and Mongo DB in enterprise use cases.
- Experience working in Waterfall and Agile - SCRUM methodologies.
- Ability to adapt to evolving technologies, a strong sense of responsibility and accomplishment.
- Worked on developing architecture document and proper guidelines
- Responsible in Installation and Configuration of Hadoop Eco system components using CDH 5.2 Distribution.
- Responsible to manage data coming from different sources and involved in HDFS maintenance and loading of structured and unstructured data.
- Processed Multiple Data sources input to same Reducer using Generic Writable and Multi Input format.
- Worked Big data processing of clinical and non-clinical data using Map Reduce.
- Visualize the HDFS data to customer using BI tool with the help of Hive ODBC Driver.
- Customized BI tool for manager team that perform Query analytics using HiveQL.
- Imported data using Sqoop to load data from MySQL to HDFS on regular basis.
- Created Partitions, Buckets based on State to further process using Bucket based Hive joins.
- Created Hive Generic UDF's to process business logic that varies based on policy.
- Moved Relational Data base data using Sqoop into Hive Dynamic partition tables using staging tables.
- Experienced in Monitoring Cluster using Cloudera manager.
- Involved in Discussions with business users to gather the required knowledge.
- Capable of creating real time data streaming solutions and batch style large scale distributed computing applications using Apache Spark, Spark Streaming, Kafka and Flume.
- Analyzing the requirements to develop the framework.
- Designed and developed architecture for data services ecosystem spanning Relational, NoSQL and Big Data technologies.
- Loaded and transformed large sets of structured, semi structured and unstructured data using Hadoop/Big Data concepts.
- Developed Java Spark streaming scripts to load raw files and corresponding.
- Processed metadata files into AWS S3 and Elasticsearch cluster.
- Developed Python Scripts to get the recent S3 keys from Elasticsearch.
- Elaborated Python Scripts to fetch/get S3 files using Boto3 module.
- Implemented PySpark logic to transform and process various formats of data like XLSX, XLS, JSON, TXT.
- Built scripts to load PySpark processed files into Redshift Db and used diverse PySpark logics.
- Developed scripts to monitor and capture state of each file which is being through.
- Developed Map Reduce programs to cleanse the data in HDFS obtained from heterogeneous data sources.
- Involved in scheduling Oozie workflow engine to run multiple Hives and pig jobs and used Oozie Operational Services for batch processing and scheduling workflows dynamically.
- Included migration of existing applications and development of new applications using AWS cloud services.
- Extracted data from SQL Server to create automated visualization reports and dashboards on Tableau.
- Responsible for Cluster maintenance, adding and removing cluster nodes, Cluster Monitoring and Troubleshooting, Managing and reviewing data backups & log files.
Environment: AWS S3, Java, Maven, Python, Spark, Kafka, Elasticsearch, MapR Cluster, Amazon Redshift DB, Shell script, pandas, Elasticsearch, PySpark, Pig, Hive, Oozie, JSON, AWS GLUE.
- Involved in complete project life cycle starting from design discussion to production deployment
- Worked closely with the business team to gather their requirements and new support features
- Involved in running POC's on different use cases of the application and maintained a standard document for best coding practices
- Developed a 200-node cluster in designing the Data Lake with the Hortonworks distribution
- Responsible for building scalable distributed data solutions using Hadoop
- Installed, configured and implemented high availability Hadoop Clusters with required services (HDFS, Hive, HBase, Spark, Zookeeper)
- Implemented Kerberos for authenticating all the services in Hadoop Cluster
- Responsible for installation and configuration of Hive, Pig, HBase and Sqoop on the Hadoop cluster and created hive tables to store the processed results in a tabular format.
- Configured Spark Streaming to receive real time data from the Apache Kafka and store the stream data to HDFS using Scala.
- Exporting the data using Sqoop to RDBMS servers and processed that data for ETL operations.
- Worked on S3 buckets on AWS to store Cloud Formation Templates and worked on AWS to create EC2 instances.
- Designing ETL Data Pipeline flow to ingest the data from RDBMS source to Hadoop using shell script, Sqoop, package and MySQL.
- Involved in Spark and Spark Streaming creating RDD's, applying operations -Transformation and Actions.
- Created partitioned tables and loaded data using both static partition and dynamic partition method.
- Developed custom Apache Spark programs in Scala to analyze and transform unstructured data.
- Handled importing of data from various data sources, performed transformations using Hive, MapReduce, loaded data into HDFS and Extracted the data from Oracle into HDFS using Sqoop
- Using Kafka on publish-subscribe messaging as a distributed commit log, have experienced in its fast, scalable and durability.
- Test Driven Development (TDD) process and extensive experience with Agile and SCRUM programming methodology.
- Implemented POC to migrate Map Reduce jobs into Spark RDD transformations using SCALA
- Scheduled map reduces jobs in production environment using Oozie scheduler.
- Involved in Cluster maintenance, Cluster Monitoring and Troubleshooting, Manage and review data backups and log files.
- Provided application demo to the client by designing and developing a search engine, report analysis trends, application administration prototype screens using AngularJS, and Bootstrap JS.
- Took the ownership of complete application Design of Java part, Hadoop integration
- Apart from the normal requirement gathering, participated in a Business meeting with the client to gather security requirements.
- Assisted with the architect to analyze the existing system and future system Prepared design blue pints and application flow documentation
- Experienced in managing and reviewing Hadoop log files Load and transform large sets of structured, semi-structured and unstructured data
- Responsible to manage data coming from different sources and application Supported Map Reduce Programs those are running on the cluster
- Responsible for working with Message broker system such as Kafka Extracted data from mainframes and feed to KAFKA and ingested to HBase to perform Analytics
- Written event-driven, link tracking system to capture user events and feed to KAFKA to push it to HBASE.
- Created MapReduce jobs to extracts the contents from HBase and configured in OOZIE workflow to generate analytical reports.
- Created Project structures and configurations according to the project architecture and made it available to the junior developer to continue their work
- Handled onsite coordinator role to deliver work to offshore Involved in core reviews and application lead supported activities
- Implemented SparkRDD transformations to map business analysis and apply actions on top of transformations.
- Objective of this project is to build a data lake as a cloud-based solution in AWS using Apache Spark.
- Involved in creating Hive tables, loading with data and writing hive queries which runs internally in MapReduce way.
Environment: Cassandra, Spring 3.2, Spring data, PIG, HIVE, apache AVRO, Map Reduce, Sqoop Zookeeper, SVN, Jenkins, Spark, HBASE.