Senior Hadoop Developer Resume
Lisle, IL
SUMMARY:
- More TEMPthan 6 years of IT Professional experience in industries, involved in Developing, Implementing and Maintenance of various web based applications using Java, J2EE Technologies and Big Data Ecosystems, experience working on Linux environments. Having over 4 years of experience in Hadoop/Big Data related technology, experience in Storing, Querying, Processing and Analysis of data.
- Hands - on experience working on various types of Big Data analytics tools and concepts like Hadoop, HDFS, Map Reduce, Hive, Pig, HBase, Sqoop, Oozie, YARN, Spark, Kafka, Zookeeper and Flume.
- Excellent knowledge on Hadoop Components such as HDFS, Job Tracker, Task Tracker, Name Node, Data Node, YARN and Map Reduce programming paradigm.
- Experience working onSpark Architecture, which includes SparkCore,Spark SQL, Data Frames, Spark Streaming and Spark Real-time Streaming.
- Hands-on experience building data pipelines using Hadoop components Sqoop, Hive, Pig, MapReduce, Spark, Spark SQL.
- Good knowledge in using apache NiFi to automate the data movement between different Hadoopsystems.
- Experience in developing efficient solutions to analyze large data sets.
- Experience in importing and exporting data using Sqoop between HDFS and Relational Database Systems.
- Populated HDFS wif huge amounts of data using Apache Kafka and Flume.
- Excellent knowledge of data mapping, extract, transform and load from different data source.
- Experience in using Sequence files, AVRO file, Parquet file formats; Managing and reviewing Hadoop log files
- Well experienced in data transformation using custom MapReduce, Hive and Pig scripts for various file formats.
- Expertise in extending Hive and Pig core functionality by writing custom UDFs and UDAF’s.
- Designing and creating Hive external tables using shared meta-store instead of derby wif static partitioning, dynamic partitioning and buckets.
- Proficient in UNIX bash scripting and good understanding of NoSQL Databases such as HBase, hands on working experience wif HBase. Firm grip on data modeling, data mapping, database performance tuning and NoSQL map-reduce systems.
- Hands on experience migrating complex Map Reduce programs into Apache Spark RDD transformations.
- Experienced in Apache Spark for implementing advanced procedures like text analytics and processing using the in-memory computing capabilities written in Scala.
- Used Spark Streaming to divide streaming data into batches as an input to spark engine for batch processing.
- Experience in scheduling and monitoring Oozie workflows for parallel execution of jobs.
- Good understanding of Zookeeper for monitoring and managing Hadoop jobs.
- Monitoring Map Reduce Jobs and YARN Applications.
- Hands on experiencewif AWS, Glue, Lambda, CloudWatch, EMR, Storage S3, EC2 instances and Data Warehousing.
- Excellent Object-Oriented Programming skills wif Java and in-depth understanding of data structures and algorithms.
- Proficient in Java, J2EE, JDBC, Collection Framework, Servlets, JSP, Spring, Hibernate, JSON, XML, REST, SOAP Web Services.
TECHNICAL SKILLS:
Programming Languages: Java, Scala, Python and C
Big Data Technologies: HDFS, YARN, Map Reduce, Pig, Hive, HBase, Spark, Spark SQL, Spark Streaming, Sqoop, Flume, Kafka, ZooKeeper, Oozie, NIFI
Big Data Distributions: Hortonworks, Cloudera, Amazon Web Services, Azure
Frame works: Spring, Hibernate, Struts
Version Control: Git
Databases: Oracle, MySQL, MS SQL Server, HBase
Operating Systems: Ubuntu, Cent OS, Windows, Linux
Development Tools: IntelliJ, Eclipse, NetBeans, Workbench
PROFESSIONAL EXPERIENCE:
Confidential, Lisle, IL
Senior Hadoop Developer
Responsibilities:
- Used Sqoop and data stage to ingest data from multiple sources like from Oracle, Teradata etc to Hadoop.
- Written Hive and Impala queries according to the business requirement and optimized the queries using different properties and joins.
- Managed multiple live applications, which involves CDC (change data capturing) process.
- Had connected wif our enterprise RDBMS to make validation of data dat we use for CDC process.
- Used Apache Airflow to schedule Pyspark jobs creating DAGs and writing Hive Operators and Pyhon operators using xcom to send connections to each operators and push code to git where it will automatically deployed using CI/CD.
- Used EMR to run Pyspark jobs and Optimized existing ETL pipeline by doing efficient joins and improving code standards.
- Used Spark to process huge data sets dat are ingested using sqoop and do ETL processing according to the business requirements.
- Used Regex to process the data sets using spark.
- Developed Pyspark code and optimized code to run fast to process 8 TB data for each run related to ODO Mileage.
- Designed multiple end to end Data Pipelines using PySpark in AWS leveraging BOTO3 library to connect from python notebook to S3 data.
- Used NIFI to automate batch data load to replace Flume.
- Developed end to end work flow to bring in controltec files from AWS to edge node and move files from edge node to Hadoop and tan had developed spark program to do ETL processing and tan FTP the output to Bosch and Cummins.
- Had setup multiple lambda functions in AWS to bring current day’s files from s3 multiple sub folders to a specific directory and download it to edge node and move files using FTP from one node to other node and tan do ETL using spark job for reporting and scheduled it using crown job scheduling in AWS.
- Used AWS Lambda to move particular type of files (snapshot) from one s3 bucket to other bucket and scheduled using CloudWatch Cron expression.
- Used AWS Glue to consume data from s3 using crawler and transform the data and store it in s3 and making data available to users using Atana.
- Was part of multiple upgrades, which involves CDP upgrade on Aug 2021.
- Developed Shell Scripts to run the hive and impala queries using beeline for hive and automated them by using control-m jobs.
- Developed shell scripts to perform regex operations like replacing gaps in file name wif underscore and move files to Hadoop etc.
- Parameterized shell scripts for different applications.
- Setup the TFS to our team to check in the code in a structured way.
- Have improved performance of cluster by converting all impala and hive long running jobs to Spark.
- Developed UDF’s using oracle stored procedure and converted dat using java code and registered them as a function in hive and used in queries.
- Used FTP process to ingest and transfer files to different servers owned by different companies.
- Leveraged CDSW to develop code and directly connect to hive.
- Used Cloudera Manager to maintain system health and leverage the available resources from different pools.
- Was main part of migration from one cluster (dse) to other cluster (dde) and has done lot of BDR’s to move data from dse to dde.
Environment: Scala, Cloudera, Data stage, Oozie, Cloudera Manager, Hive, Spark SQL, Spark, TFS, AWS, Pyspark, CDSW, Teradata, Oracle, Sqoop, Shell Scripting, Airflow, EMR, AWS Lambda
Confidential
Sr. Hadoop Developer
Responsibilities:
- Implemented advanced procedures and processing using the in-memory computing capabilities in Apache Spark written in Scala.
- Consumed complex json message from 3 nodes kafka cluster.
- Used spark streaming to subscribe to topics for near real time processing.
- Designed two data pipelines one for streaming data and other for raw data.
- Used spark dataframes using spark session for consuming data from kafka to put it on Hortonworks HDFS.
- Used spark SQL to perform sort, join and filter the data.
- Created various Hive external tables, staging tables and joined the tables as per the requirement
- Used Spark SQL on data frames to access hive tables into spark for faster processing of data.
- Worked on Storing the Output of ingestion json in csv format.
- Stored offsets to HBase.
- Spark microbatch time is for every 15 seconds.
- Implemented lamda architecture using speed and batch layer for streaming data.
- Created Jethro external table pointing to hdfs file location to make streamed data available for query, which is speed layer.
- Scheduled Jethro loader to load data into jethro database every 5 minutes and clean up the hdfs, which is a batch layer.
- Used NIFI to process raw data from kafka to hdfs in second data pipeline.
- Routing tenant using routing processor in NIFI to handle multitenancy.
- Executed in memory Spark data computations.
- Handled wif multitenancy routing to respective tenant folders.
- Transformed and stored the ingested data into data frames using spark SQL.
- Was part of architecture plan for designing how to store data using Snowflake in RDMS so dat there will be minimum number of updates.
- Jethro loader will clean up the hdfs files after loding files to jethro db.
- Created VIEW for Batch table and jethro external table having de-duplication logic.
- Created a data pipeline using oozie workflows, which performs jethro loader jobs on a daily basis.
- Running Spark job in edge node using spark submit to connect to Parent Company kafka servers and write data to hdfs and connect to jethro.
- Written DDL scripts to create tables in Jethro.
- Had written shell scripts to create different folder structures in different environments like dev, UAT and Prod.
- Created Hive tables on top of Parquet files in HDFS to query on it.
- For development installed Cloudera quickstart virtual box to and
Environment: Scala, Hortonworks, Kafka, Oozie, Cloudera, Hive, Spark SQL, Spark Stream, Jethro, VSTS, HBase, MYSQL, Oracle, NIFI
Confidential, Atlanta, GA
Sr. Hadoop Developer
Responsibilities:
- Designed a pipeline to collect, clean, and prepare data for analysis using Map reduce, Spark, Pig, Hive and sqoop and reporting using Tableau.
- Loaded the data into Spark RDD and do in memory data Computation to generate the Output response.
- Expertise in performance tuning of Spark Applications for setting right Batch Interval time, correct level of Parallelism and memory tuning.
- Migrated complex Map reduce programs into Spark RDD transformations such as flatMap, map, filter, Join, map Partitions and Action like collect, CountByKey, Reduce, Foreach.
- Developed Scala scripts, UDFFs using both Data frames/SQL/Data sets and RDD/MapReduce in Spark 1.6 for Data Aggregation, queries and writing data back into OLTP system through Sqoop.
- Optimizing of existing algorithms in Hadoopusing SparkContext, Spark-SQL, Data Frames and Pair RDD's.
- Developed Spark Programs for Batch, Real Time Processing, structured and unstructured data.
- Collecting and aggregating large amounts of log data using Apache Flume and staging data in HDFS for further analysis.
- Good knowledge on Spark components like Spark SQL, Spark Streaming .
- Implemented Kafka Custom encoders for custom input format to load data into Kafka Partitions.
- Worked on different environments like DEV, QA, Data Lake, and Analytics Cluster as a part of Hadoop Development.
- Using the spring technologies like MVC, JDBC, ORM, and Web Services using Eclipse and integration wif Hibernate.
- Developed Kafka real-time order streaming application wif microservice concept, which is improve performance of data processing using spring boot framework and Kafka library top on Kafka messaging system, which computes million records.
- Experienced in Kafka Connect included in Apache Kafka dat integrates Kafka wif other systems. Sourceconnectorsimport data from another system (e.g. a relational database MySQL into Kafka) andSink connectors export data (e.g. the contents of a Kafka topic to MySQL or mongo DB or HDFS file).
- Worked on EMR and S3 in AWS and implemented Amazon S3 to store user data.
- Developed Data ingestion into HDFS using sqoop.
- Experience wif Pysparkfor using Spark libraries by using python scripting for data analysis.
- Performed sqoop imports of data from Data warehouse platform to HDFS and build hive tables on top of datasets.
- Used file formats like AVRO, CSV, JSON, PARQUET and snappy compressions while storing data in S3 and HDFS.
- Worked on conversion of Hive/ SQL queries into Spark transformations using Spark RDDs and Data Frames (DF).
- Developed spark code using python and Spark-SQL/Streaming for faster testing and processing of data.
- Developed Sqoop script’s for moving data from Relational databases to HDFS and vice versa.
- Performance optimization dealing wif large datasets using Partitions, Spark in Memory capabilities, Broadcasts in Spark, Efficient and TEMPeffective joins, transformations and other heavy lifting during ingestion process itself.
- Implemented Partitioning, Dynamic Partitions, Buckets in HIVE.
- Perform DB activities such as indexing, performance tuning, and backup and restore.
- Developed workflows of schedule job by using Airflow.
- Developed low latency high performance Hive queries to process and analyze data in HDFS, resolved memory over the limit issues, generated batch oriented and real-time business reports.
- Used Reporting tools like Tableau to connect to Hive ODBC connector generate daily reports of data.
- Utilized capabilities of Tableau such as Data extracts, Data blending, Forecasting, Dashboard actions and table calculations to build dashboards.
- Bug fixing and 24-7 production support for running the processes.
Environment: AWS S3, EMR, Python, Scala, Hadoop, Map Reduce, Hive, Spark SQL, Spark Stream, Airflow, Sqoop, Git, Tableau, Jenkins, Kafka, MYSQL, Spring, Eclipse.
Confidential, Bellevue, WA
Hadoop Developer
Responsibilities:
- Developed simple to complex Map Reduce jobs using Java for processing and validating data.
- Developed data pipeline using Sqoop, Spark, Map Reduce, and Hive to ingest, transform and analyze, customer behavioral data.
- Used Spark API over to perform analytics on data in HIVE.
- Designed and developed load data to HIVE using Spark.
- Worked on debugging, performance tuning of Hive Jobs.
- Written Hive jobs to parse the logs and structure them in tabular format to facilitate TEMPeffective querying on the log data.
- Performed joins, dynamic partitions, and bucketing on hive tables utilizing hive SerDes like CSV, REGEX, JSON and AVRO.
- Developed JAVA, J2EE applications in Object Oriented Analysis, extensively involved throughout Software Development Life Cycle (SDLC).
- Created Hive internal and external Tables, Partitions, Bucket for further Analysis using Hive.
- Implemented dynamic partitioning and bucketing in Hive and created External tables for loading of customer data.
- Exported analyzed data to relational databases using Sqoop for visualization to generate reports for BI team.
- Used Spark for interactive queries, processing of streaming data and integration wif popular NOSQL database for huge volume of data.
- Experienced in developing Map Reduce programs using Hadoop JavaAPI and performing analytics on Time Series data using HBase and JavaAPI.
- Used Hive-QL for querying and managing large data sets residing in distributed storage.
- Developing kafka producers and consumers in java, integrating wif apache storm and ingesting data into HDFS and HBASE by implementing the rules in storm.
- Built a prototype for real time analysis using Spark streaming and Kafka.
- Involved in moving all log files generated from various sources to HDFS for further processing through Flume.
- Involved in creating Hive tables and working on them using HiveQL and perform data analysis using Hive and Pig.
- Developed workflow in Oozie to manage and schedule jobs on Hadoop cluster to trigger daily, weekly and monthly batch cycles.
- Designed data warehouse using Hive. Analyzed the data by performing Hive queries.
- Experience in job workflow scheduling and monitoring tools like Oozie and Zookeeper.
- Expertise in extending Hive and Pig core functionalities by writing custom User Defined Functions (UDF).
- Used IMPALA to pull data from Hive tables.
- Worked on Apache Flume for collecting and aggregating huge amount of log data and store it on HDFS for doing further analysis.
- Involved in architecture and design of distributed time-series database platform using NOSQL technologies like Hadoop/HBase, Zookeeper.
- Integrated NOSQL database like HBase wif Map Reduce to move bulk amount of data into HBase.
- Efficiently put and fetched data to/from HBase by writing Map Reduce job.
- Wrote Map Reduce jobs inJava to parse the raw data populate staging tables and store the refined data.
- Followed Agile Methodologies while working on the project.
Environment: Hadoop, Map Reduce, Hive, Pig, Spark, Flume, Kafka, Sqoop, Oozie, Zookeeper, HBase, Impala, NOSQL, Java, Java Map-Reduce.