Hadoop /spark Developer Resume
4.00/5 (Submit Your Rating)
SUMMARY
- Experienced Hadoop Developer with 6+ years of extensive hands - on experience in IT industry including 5 years' experience in deployment of Hadoop Ecosystems like MapReduce, Yarn, Sqoop, Flume, Pig, Hive, HBase, Cassandra, Zookeeper, Oozie, and Ambari, Big Query, Big Table and 7 years' experience on Spark, Storm, Scala, Python.
- Experience in OLTP and OLAP design, development, testing, implementation, and support of enterprise Data warehouses. Strong Knowledge in Hadoop Cluster Capacity Planning, Performance Tuning, Cluster Monitoring.
- Extensive experience in business data science project life cycle including Data Acquisition, Data Cleaning, Data Manipulation, Data Validation, Data Mining, Machine Learning Algorithms, and Visualization.
- Good Hands-on experience in working with Ecosystems like Hive, Pig, Sqoop, Map Reduce, Flume, Oozie. Strong knowledge in HIVE and PIG core functionality by using custom User De ned Function (UDF), User De ned Table-Generating Functions (UDTF) and User De ned Aggregating Functions (UDAF) for Hive. Experience in Productionizing Apache Ni . for data ows with signi cant processing requirements and controlling security of data ow.
PROFESSIONAL EXPERIENCE
Hadoop /Spark Developer
Confidential
Responsibilities:
- Developed Spark Applications by using Scala, Java, and Implemented Apache Spark data processing Project to handle data from various RDBMS and Streaming sources.
- Worked with the Spark for improving performance and optimization of the existing algorithms in Hadoop Used Spark Context, Spark-SQL, Spark MLlib, Data Frame, Pair RDD, Spark YARN and Spark Streaming APIs to perform transformations and actions on the y for building common.
- Used Akka concurrency for processing PDL files
- Learner data model which gets the data from Kafka in near real-time and persist it to Cassandra. Developed Kafka consumer API in Scala for consuming data from Kafka topics.
- Derive business insights from extremely large datasets using Google Big Query
- Consumed XML messages using Kafka and processed the XML le using Spark Streaming to capture UI Programming with Streaming platforms Con uent/Kafka Database Integration - both as source and sink Con uent/Kafka con guration for offset and other parameters. Streamed data processing by Implemented Con uent/Kafka producer application to produce near real-time data
- Developed framework for converting existing PowerCenter mappings and to PySpark (Python and Spark) Jobs.
- Optimized the Pyspark jobs to run on Kubernetes Cluster for faster data processing.
- Analyzed the sql scripts and designed it by using PySpark SQL for faster performance.
- Developed reusable components in Informatica.
- Performed Tuning on Hive Queries, SQL Queries, informatica sessions
- Migration of the mappings from lower to higher environments in Informatica.
- I mplemented event sourcing using Akka
- Distributed Application development using Actor Models for extreme scalability using Akka
- Enabled and automated data pipelines for moving the data from Oracle and DB2 source tables to Hadoop and Google Big Query using GitHub for source control and Jenkins experience in setting up the CI/CD pipelines using Jenkins, Maven, Nexus, GitHub, CHEF, Terraform and AWS
- Deployed various Microservices like Spark, MongoDB, Cassandra in Kubernetes and Hadoop clusters using Docker.
- Created Airflow Scheduling scripts in Python
- Built data pipelines using Hadoop ecosystem components such has Hive, Spark & Airflow
- Optimization of Con uent/Kafka cluster and workloads. Managed Con uent C3, Replicator Connect, Broker, REST API, and Zookeeper, ksql and reviewed and optimize the Spring Cloud Streams to Kafka interface.
- Experience in developing Microservices with Spring boot using Java and Akka framework using Scala
- Strong programming skills in designing and implementation of multi-tier applications using web-based technologies like Spring MVC and Spring Boot
- Used AngularJS frontend to integrate REST endpoint for client
- Used AngularJS for developing single page applications (SPA) used several in built language core directives, expressions and modules.
- Reviewed current Apache Kafka architecture implemented at NT Matrix project and help tune Kafka producers, Kafka consumers and Kafka Brokers performance for optimization
- W orked with Play framework and Akka parallel processing
- Reviewed current con guration settings including but not limited to compression, batch size, linger time, sync/async on Produce side and fetch size on Consumer side and recommend optimal settings
- Help tune Kafka con guration to obtain improve latency measures and throughput measures Worked from Scratch in Con gurations’ of Kafka such as Mangers and Brokers
- G oogle Cloud Platform (Google cloud storage, Big Query, Big Table, Cloud SQL) lead SQL data integration and Hadoop developer roles and responsibilities.
- Used Apache Kafka to aggregate web log data from multiple servers and make them available in Kafka
- Worked on analyzing the Hadoop cluster using different big data analytic tools including Flume, Pig, Hive, HBase, Oozie, Zookeeper, ksql, Sqoop, Spark, and Kafka.
- Used development tools such as Jenkins, Rundeck, SVN/Crucible/Jira, Git/Stash, etc.
- E xtensively used Akka actors architecture for scalable &hassle-free multi-threading.
- As a Big Data Developer implemented solutions for ingesting data from various sources and processing the Data- at-Rest utilizing Big Data technologies such as Hadoop, MapReduce Frameworks, MongoDB, Hive, Oozie, Flume, Impala, Sqoop, and Talend, etc.
Hadoop/Spark Developer
Confidential
Responsibilities:
- Played crucial resource in installation, con guration, supporting, and managing Hadoop Clusters by using of Cassandra security, maintenance, and tuning both database and server.
- Chipped away at outlining and building up the Real-Time Analysis module for Analytic Dashboard utilizing Cassandra, Kafka, Spark Streaming.
- Installed and con gured Con uent Kafka in the R&D line. Validated the installation with HDFS connector and Hive connectors.
- Deployed high availability on the Hadoop cluster quorum journal nodes.
- Involved on implementing SAX (Symbolic Aggregate approximation) in Java to use with Apache Spark for normalizing time series data and de ning job ows, managing, and reviewing log les.
- Developed server-side application to interact with database using SpringBoot and Hibernate
- Developed Microservices using Spring boot and core Java/J2EE hosted on AWS
- Work as a team to build the company framework and layout of the portals by using JavaScript, React JS.
- Set-up con gured and optimized the Cassandra cluster. Developed real-time Spark-based application to work along with the Cassandra database.
- Responsible in managing data coming from different sources through Kafka by the installation of Kafka Producer on different severs and Scheduled to produce data for every 10 seconds
- Integrated Kafka with Spark Streaming to listen onto multiple Kafka Brokers with different Kafka topics for every 5 Seconds.
- Enhanced and optimized product Spark code to aggregate, group, and run data mining tasks using the Spark framework and handled JSON Data.
- Handled JSON Data comes from Kafka Direct Stream on each partition and transformed them into required Data Frame Formats.
- Worked on Kafka brokers, zookeepers, ksql, kstream, and Kafka control center
- Upgraded Spark 1.6 to the latest Version Spark 2.2 and con gure Kafka Version 0.10. Managing Kafka Offsets, Saving Offsets in external databases like HBase, and to its own Kafka.
- Worked on Import & Export of data using the ETL tool Sqoop from MySQL to HDFS.
- Experience in Performance Tuning and Debugging of existing ETL processes.
- Worked on Lambda Architecture for both Batch processing and Real Streaming purposes.
- Used Oozie to Schedule Spark and Kafka Producer Jobs to run in parallel.
- Experience in writing SQL queries and PL/SQL (Stored Procedures, Functions and Triggers)
- Created stored procedures and packages in Oracle as a part of the pre and Post ETL process
- Appended the Data Frames into Cassandra Key Space Tables using DataStax Spark-Cassandra Connector.
- Installed and con gured Datastax OpsCenter and Nagios for Cassandra cluster maintenance and alert.
- Fine Tuned SQL queries and PL/SQL blocks for the maximum efficiency and fast response using Oracle Hints
- Con gured Authentication and security in Apache Kafka pub-sub system.
- Good experience with AWS Cloud for provisioning virtual machines, creating resource groups, con guring key vaults for storing encryption keys, Monitoring, etc.
- Create data pipelines in cloud using Azure Data Factory
- Implemented POC on Launching HDInsight’s on Azure.
- In depth understand of Scalable Machine Learning libraries like Apache Mahout, MLlib
- Great Hands-on Experience in seat stamping Hadoop bunch for investigation of line utilization. Performing OS level setups and Kernel level tuning.
- Fluent in Data Mining and Machine Learning, such as classification, clustering, regression and anomaly detection
Hadoop /Spark Developer
Confidential
Responsibilities:
- Involved in deploying systems on Amazon Web Services (AWS) Infrastructure services EC2. Experience in con guring, deploying the web applications on AWS servers using SBT and Play.
- Migrated Map Reduce jobs into Spark RDD transformations using Scala. Used Spark API over Cloudera Hadoop YARN to perform analytics on data in Hive.
- Developed Spark code using Spark RDD and Spark-SQL/Streaming for faster processing of data. Performed con guration, deployment and support of cloud services including Amazon Web Services (AWS).
- Used multiple types of various AWS technologies like SQS Queuing, SNS Noti cation, S3 storage, Redshift, Data Pipeline, EMR, for all Public (AWS) and Private (OpeKanstack/VMWare/DCOS/Mesos/Marathon) cloud infrastructure
- Developed Flume ETL job for handling data from HTTP Source and Sink as HDFS and con guring Data Pipelining.
- Using SQOOP to move the structured Oracle data to HDFS, HIVE, PIG and HBase
- Used Hive data warehouse tool to analyze the uni ed historic data in HDFS to identify issues and behavioral patterns.
- Involved in Developing a Restful service using Python Flask framework. Expertise in working with Python GUI frameworks - PyJamas, Python.
- Designed, developed a custom single page application using AngularJS, and created services, factories, models, controllers, views.
- Developed python code for different tasks, dependencies, SLA watcher and time sensor for each job for workflow management and automation using Airflow tool.
- Working in UNIX, LINUX/SASGrid environment and cross-functional in a high-paced and fluid Agile team
- Worked in the AGILE Methodologies with 3week sprint cycle, used ServiceNow for SDLC management
- Worked on importing and exporting data from Oracle data into HDFS using SQOOP for analysis, visualization and to generate reports.
- Experienced in using Apache Drill data-intensive distributed applications for interactive analysis of large-scale datasets.
- Developed end to end ETL batch and streaming data integration into Hadoop (MapR), transforming data.
- Designed and Developed UNIX Shell scripts to enhance the functionality of ETL application
- Used Python modules such as requests, urllib, urllib2 for web crawling.
- Used Modern technologies like Scala, Spray Framework, Akka and Play Framework
- Developed extensively tools include Spark, Drill, Hive, HBase, Kafka &MapR Streams, PostgreSQL, Stream Sets.
- Used Hive Queries in Spark-SQL for analysis and processing of the data.
- Worked as a key role in a team of developing an initial prototype of a NiFi big data pipeline. This pipeline demonstrated an end to end scenario of data ingestion, processing.
- Used HUE for running Hive queries. Created Partitions according today using Hive to improve performance.
- Wrote Python routines to log into the websites and fetch data for selected options.
- Worked on custom Pig Loaders and storage classes to work with a variety of data formats such as JSON and XML le formats.
- Loaded some of the data into Cassandra for the fast retrieval of data.
- Worked in provisioning and managing multi-tenant Hadoop clusters on public cloud environment - Amazon Web Services (AWS) and on private cloud infrastructure - Open stack cloud platform and worked on DynamoDB, Ml
- Developed Json Scripts for deploying the Pipeline in Azure Data Factory (ADF) that process the data using the Cosmos Activity.
- Worked on large-scale Hadoop YARN cluster for distributed data processing and analysis using Data Bricks Connectors, Spark core, Spark SQL, ksql, Sqoop, Pig, Hive, Impala and NoSQL databases.
Hadoop developer
Confidential
Responsibilities:
- Concerned and well-informed on Hadoop Components such as HDFS, Job Tracker, Task Tracker, Name Node, Data Node, YARN and Map Reduce programming.
- Developed Map-Reduce programs to get rid of irregularities and aggregate the data.
- Developed Cluster coordination services through Zookeeper.
- Implemented Hive UDF's and did performance tuning for better results
- Developed Pig Latin Scripts to extract data from log files and store them to HDFS. Created User Defined Functions (UDFs) to pre-process data for analysis
- Implemented Optimized Map Joins to get data from different sources to perform cleaning operations before applying the algorithms.
- Created highly optimized SQL queries for MapReduce jobs, seamlessly matching the query to the appropriate Hive table configuration to generate efficient report.
- Used other packages such as Beautiful soup for data parsing in Python.
- Tuned, and developed SQL on HiveQL, Drill and SparkSQL
- Experience in using Sqoop to import and export the data from Oracle DB into HDFS and HIVE, HBase.
- Implemented CRUD operations on HBase data using thrift API to get real time insights.
- Developed workflow in Oozie to manage and schedule jobs on Hadoop cluster for generating reports on nightly, weekly and monthly basis.
- Worked on integration independent microservices for real-time bidding (Scala/akka, firebase, Cassandra, Elasticsearch)
- Used slick to query and storing in database in a Scala fashion using the powerful Scala collection framework
- Using HIVE processed extensively ETL loadings on a Structured Data.
- Defined job flows and developed simple to complex Map Reduce jobs as per the requirement. Optimized Map/Reduce Jobs to use HDFS efficiently by using various compression mechanisms.
- Developed PIG UDFs for manipulating the data according to Business Requirements and also worked on developing custom PIG Loaders.
- Created various Parser programs to extract data from Autosys, Tibco Business Objects, XML, Informatica, Java, and database views using Scala
- Transform the mock-ups hooked on hand-written HTML-4/5, DHTML, Java, JavaScript, React JS.
- Worked on data masking/de - identification using Informatica.
- Involved in SQL query tuning and Informatica Performance Tuning.
- PIG UDF was required to extract the information of the area from the huge data which we get from the sensors. Responsible for creating Hive tables based on business requirements.
- Implemented Partitioning, Dynamic Partitions and Buckets in HIVE for efficient data access.
- Involved in NoSQL database design, integration and implementation. Loaded data into NoSQL database HBase.
- Worked on debugging, performance tuning PIG and HIVE scripts by understanding the joins, group and aggregation between them.
- Used Flume to collect, aggregate and store the web log data from different sources like web servers and pushed to HDFS
- Connected the hive tables to Data analyzing tools like Tableau for Graphical representation of the trends.
- Experienced in managing and reviewing Hadoop log files.
- Involved in loading data from UNIX file system to HDFS.
- Responsible for design & development of Spark SQL Scripts based on Functional Specifications
- Used Apache HUE interface to monitor and manage the HDFS storage. Concerned and well-informed on Hadoop Components such as HDFS, Job Tracker, Task Tracker, Name Node, Data Node, YARN and Map Reduce programming.
- Developed Map-Reduce programs to get rid of irregularities and aggregate the data.
- Developed Cluster coordination services through Zookeeper.
- Implemented Hive UDF's and did performance tuning for better results
- Developed Pig Latin Scripts to extract data from log files and store them to HDFS. Created User Defined Functions (UDFs) to pre-process data for analysis
- Implemented Optimized Map Joins to get data from different sources to perform cleaning operations before applying the algorithms.
- Created highly optimized SQL queries for MapReduce jobs, seamlessly matching the query to the appropriate Hive table configuration to generate efficient report.
- Used other packages such as Beautiful soup for data parsing in Python.
- Tuned, and developed SQL on HiveQL, Drill and SparkSQL
- Experience in using Sqoop to import and export the data from Oracle DB into HDFS and HIVE, HBase.