We provide IT Staff Augmentation Services!

Sr. Hadoop/spark Developer Resume

West Des Moines, IA

TECHNICAL SKILLS

  • Hadoop
  • Hive
  • Mapreduce
  • Sqoop
  • Kafka
  • Spark
  • Yarn
  • Pig
  • Cassandra
  • Oozie
  • shell Scripting
  • Scala
  • Maven
  • Java
  • JUnit
  • agile methodologies
  • NIFI
  • MySQL
  • Tableau
  • AWS
  • EC2
  • S3
  • Hortonworks
  • power BI
  • Solr

PROFESSIONAL EXPERIENCE

Sr. Hadoop/Spark Developer

Confidential - West Des Moines, IA

Responsibilities:

  • Hands on experience in Spark and Spark Streaming creating RDD & applying operations transformations and Actions.
  • Developed Spark applications using Scala for easy Hadoop transitions.
  • Used Spark and Spark-SQL to read the parquet data and create the tables in hive using the Scala API.
  • Performed advanced procedures like text analytics and processing, using the in-memory computing capabilities of Spark using Scala.
  • Developed Spark code using Scala and Spark-SQL for faster processing and testing.
  • Implemented Spark sample programs in python using pySpark.
  • Analyzed the SQL scripts and designed the solution to implement using pySpark.
  • Developed pySpark code to mimic the transformations performed in the on-premise environment.
  • Used Spark-StreamingAPIs to perform necessary transformations and actions on the fly for building the common learner data model which gets the data from Kafka in near real time.
  • Responsible for loading Data pipelines from web servers and Teradata using Sqoop with Kafka and Spark Streaming API.
  • Developed Kafka producer and consumers, Cassandra clients and Spark along with components on HDFS, Hive.
  • Populated HDFS and HBase with huge amounts of data using Apache Kafka.
  • Used Kafka to ingest data into Spark engine.
  • Configured deployed and maintained multi-node Dev and Test Kafka Clusters.
  • Managing and scheduling Spark Jobs on a Hadoop Cluster using Oozie.
  • Experienced with different scripting language like Python and shell scripts.
  • Developed various Python scripts to find vulnerabilities with SQL Queries by doing SQL injection, permission checks and performance analysis.
  • Tested Apache TEZ, an extensible framework for building high performance batch and interactive data processing applications, on Pig and Hive jobs.
  • Designed and implemented Incremental Imports into Hive tables and writing Hive queries to run on TEZ.
  • Experienced data pipelines using Kafka and Akka for handling large terabytes of data.
  • Written shell scripts that run multiple Hive jobs which helps to automate different Hive tables incrementally which are used to generate different reports using Tableau for the Business use.
  • Experienced in Apache Spark for implementing advanced procedures like text analytics and processing using the in-memory computing capabilities written in Scala.
  • Developed Solr web apps to query and visualize and solr indexed data from HDFS.
  • Involved in converting Hive/SQL queries into Spark transformations using Spark RDD, Scala and Python.
  • Worked on SparkSQL, created Data frames by loading data from Hive tables and created prep data and stored in AWS S3.
  • Using Spark-Streaming APIs to perform transformations and actions on the fly for building the common learner data model which gets the data from Kafka in near real time and Persists into Cassandra.
  • Involvement in creating custom UDFs for Pig and Hive to consolidate strategies and usefulness of Python into Pig Latin and HQL (HiveQL).
  • Extensively worked on Text, ORC, Avro and Parquet file formats and compression techniques like Snappy, Gzip and Zlib.
  • Implemented Hortonworks NiFi (HDP 2.4) and recommended solution to inject data from multiple data sources to HDFS and Hive using NiFi.
  • Developed various data loading strategies and performed various transformations for analyzing the datasets by using Hortonworks Distribution for Hadoop ecosystem.
  • Ingested data from RDBMS and performed data transformations, and then export the transformed data to Cassandra as per the business requirement and used Cassandra through Java services.
  • Experience in NoSQL Column-Oriented Databases like Cassandra and its Integration with Hadoop cluster.
  • Build servers using AWS, importing volumes, launching EC2, RDS, creating security groups, auto-scaling, load balancers (ELBs) in the defined virtual private connection and open stack to provision new machines for clients.
  • Implemented AWS solutions using EC2, S3, RDS, ECS, EBS, Elastic Load Balancer, Auto scaling groups, Optimized volumes and EC2 instances.
  • Creating S3 buckets and managing policies for S3 buckets and utilized S3 bucket and Glacier for storage and backup AWS.
  • Performed AWS Cloud administration managing EC2 instances, S3, SES and SNS services.
  • Wrote ETL jobs to read from web APIs using REST and HTTP calls and loaded into HDFS using java and Talend.
  • Along with the Infrastructure team, involved in design and developed Kafka and Storm based data pipeline.
  • Created Partitions, Buckets based on State to further process using Bucket based Hive joins.
  • Worked on Sequence files, RC files, Map side joins, bucketing, partitioning for Hive performance enhancement and storage improvement and utilizing HiveSerDes like REGEX, JSON and AVRO.
  • Used OOZIE Operational Services for batch processing and scheduling workflows dynamically.
  • Worked totally in agile methodology and developed Spark scripts by using Scala shell.
  • Involved in loading and transforming large Datasets from relational databases into HDFS and vice-versa using Sqoop imports and export.
  • Handled importing of data from various data sources, performed transformations using Hive, MapReduce, loaded data into HDFS and extracted data from MYSQL into HDFS vice-versa using Sqoop.
  • Written shell scripts that run multiple Hive jobs which helps to automate different Hive tables incrementally which are used to generate different reports using Tableau for the Business use.
  • Used Hibernate ORM framework with Spring framework for data persistence and transaction management.

Environment: Hadoop, Hive, Mapreduce, Sqoop, Kafka, Spark, Yarn, Pig, Cassandra, Oozie, shell Scripting, Scala, Maven, Java, JUnit, agile methodologies, NIFI, MySQL, Tableau, AWS, EC2, S3, Hortonworks, power BI, Solr.

Hadoop/Spark Developer

Confidential - Hilmar, CA

Responsibilities:

  • Optimizing of existing algorithms in Hadoop using Spark context, Spark-SQL, Data Frames and Pair RDD's.
  • Developed Spark scripts by using Java, and Python shell commands as per the requirement.
  • Involved with ingesting data received from various relational database providers, on HDFS for analysis and other big data operations.
  • Experienced in performance tuning of Spark Applications for setting right Batch Interval time, correct level of Parallelism and memory tuning.
  • Used Spark API over Cloudera Hadoop YARN to perform analytics on data in Hive.
  • Worked on Spark SQL and Data frames for faster execution of Hive queries using Spark sql Context.
  • Performed analysis on implementing Spark using Scala.
  • Responsible for creating, modifying topics (Kafka Queues) as and when required with varying configurations involving replication factors and partitions.
  • Extracted files from MongoDB through Sqoop and placed in HDFS and processed.
  • Created and imported various collections, documents into MongoDB and performed various actions like query, project, aggregation, sort and limit.
  • Experience with creating script for data modeling and data import and export. Extensive experience in deploying, managing and developing MongoDB clusters.
  • Extracted files from MongoDB through Sqoop and placed in HDFS and processed.
  • Experience in migrating HiveQL into Impala to minimize query response time.
  • Creating Hive tables to import large data sets from various relational databases using Sqoop and export the analyzed data back for visualization and report generation by the BI team.
  • Involved in creating Shell scripts to simplify the execution of all other scripts (Pig, Hive, Sqoop, Impala and MapReduce) and move the data inside and outside of HDFS.
  • Collecting data from various Flume agents that are imported on various servers using Multi- hop Flow.
  • Used Flume to collect the log data from different resources and transfer the data type to hive tables using different SerDe's to store in JSON, XML and Sequence file formats.
  • Developed Scala scripts, UDFs using Data frames/SQL/Data sets and RDD/MapReduce in Spark 1.6 for Data Aggregation, queries and writing data back into OLTP system through Sqoop.
  • Developed Pig Scripts, Pig UDFs and Hive Scripts, Hive UDFs to analyze HDFS data.
  • Maintained the cluster securely using Kerberos and making the cluster up and running all the times.
  • Implemented optimization and performance testing and tuning of Hive and Pig.
  • Developed a data pipeline using Kafka to store data into HDFS.
  • Worked on reading multiple data formats on HDFS using Scala
  • Written shell scripts and Python scripts for automation of job.

Environment: Cloudera, HDFS, Hive, HQL scripts, Mapreduce, Java, HBase, Pig, Sqoop, Kafka,Impala, Shell Scripts,Python Scripts, Spark, Scala, Oozie.

Big Data Developer/Spark Developer

Confidential - Malvern, PA

Responsibilities:

  • Responsible for building scalable distributed data solutions using Hadoop.
  • Involved in importing the data from various data sources into HDFS using Sqoop and applying various transformations using Hive, apache Spark and then loading data into Hive tables or AWS S3 buckets.
  • Involved in moving data from various DB2 tables to AWS S3 buckets using Sqoop process.
  • Configuring splunk alerts in-order to get the log files while execution and storing them to a location in s3 bucket when cluster is running.
  • Involved in Hive/SQL queries performing spark transformations using spark RDDs and python(pyspark).
  • Writing oozie scripts in-order to schedule and automate the jobs in EMR cluster.
  • Used Bitbucket as a repository for storing the code and integrated with bamboo for integration purpose.
  • Experienced in bringing up EMR cluster and deploying code into the cluster in S3 buckets.
  • Migrated the existing on-prem code to AWS EMR cluster.
  • Experienced in using NoMachine and Putty in-order to SSH the EMR cluster and running spark-submit.
  • Developed Apache Spark Applications by using Scala, python and Implemented Apache Spark data processing project to handle data from various RDBMS and Streaming sources.
  • Experience in developing various Spark Streaming Jobs using python. (pyspark) and scala.
  • Developing spark code using pyspark to applying various transformations and actions for faster data processing.
  • Working knowledge on Apache Spark Streaming that enables scalable, high-throughput, fault-tolerant stream processing of live data streams.
  • Used Spark Stream processing using Scala to get data into in-memory, implemented RDD transformations, and performed actions.
  • Involved in using various python libraries with pyspark to inorder to create dataframes and store them to Hive.
  • Sqoop jobs, and Hive queries were created for data ingestion from relational databases to compare with historical data.
  • Experience in working with Elastic MapReduce(EMR) and setting up environments on amazon AWS EC2 instances.
  • Experienced in migrating HiveQL into Impala to minimize query response time.
  • Knowledge on handling Hive queries using Spark SQL that integrates with Spark environment.
  • Executed Hadoop/Spark jobs on AWS EMR using programs, stored in S3 Buckets.
  • Knowledge on creating the user defined functions (UDF's) in hive.
  • Worked with different File Formats like textfile, avro, parquet for HIVE querying and processing based on business logic.
  • Knowledge in pulling the data from AWS Amazon S3 bucket to data lake and built Hive tables on top of it and created data frames in Spark to perform further analysis.
  • Worked on Sequence files, RC files, Map side joins, bucketing, partitioning for Hive performance enhancement and storage improvement.
  • Involved in Test Driven Development writing unit and integration test cases for the code.
  • Implemented Hive UDF's to implement business logic and Responsible for performing extensive data validation using Hive.
  • Involved in loading the structured and semi structured data into spark clusters using Spark SQL and Data Frames API.
  • Involved in developing code and generated various data frames based on the business requirement and created temporary tables in hive.
  • Utilized AWS CloudWatch to monitor the performance environment instances for operational and performance metrics during load testing.
  • Experience in build scripts using Maven and did continuous system integrations like Bamboo.
  • Used JIRA for creating the user stories and creating branches in the bitbucket repositories based on the story.
  • Knowledge on Sonar in-order to validate the code and to follow coding standards.
  • Involved in story-driven agile development methodology and actively participated in daily scrum meetings.

Environment: Cloudera, Map Reduce, HDFS, Scala, Hive, Sqoop, Spark, Oozie, Linux, Maven, control-M, Splunk, NoMachine, Putty, HBase, Python, AWS EMR Cluster, EC2 instances, S3 Buckets, STS, Bamboo, Bitbucket.

Big Data Developer

Confidential - Wilmington, DE

Responsibilities:

  • Responsible for building a framework which is being used for Sqoop the data and store into HDFS for raw zone. After raw zone there is a need to transform the data into Stage and merge zone and finally keeping the highest transformed data into enriched zone. The whole end to end flow is under development to have a framework which will take care majority of the work automated.
  • Experience in Job management using Autosys and Developed job processing scripts using Shell scripts workflow.
  • Configured deployed and maintained multi-node Dev and Test Clusters.
  • Developed Spark scripts by using Scala shell commands as per the requirement.
  • Used Spark API over Cloudera Hadoop YARN to perform analytics on data in Hive.
  • Developed Scala scripts, UDFFs using both Data frames/SQL/Data sets and RDD/MapReduce in Spark 1.6 for Data Aggregation, queries and writing data back into OLTP system through Sqoop.
  • Experienced in performance tuning of Spark Applications for setting right Batch Interval time, correct level of Parallelism and memory tuning.
  • Optimizing of existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frames and Pair RDD's.
  • Performed advanced procedures like text analytics and processing, using the in-memory computing capabilities of Spark.
  • Experienced in handling large datasets using Partitions, Spark in Memory capabilities, Broadcasts in Spark, Effective & efficient Joins, Transformations and other during ingestion process itself.
  • Worked on a POC to compare processing time of Impala with Apache Hive for batch applications to implement the former in project.
  • Worked extensively with Sqoop for importing metadata from Teradata, Oracle.
  • Involved in creating Hive tables, and loading and analyzing data using hive queries.
  • Developed Hive queries to process the data and generate the data cubes for visualizing.
  • Implemented schema extraction for Parquet and Avro file Formats in Hive.
  • Implemented Partitioning, Dynamic Partitions, Buckets in HIVE.
  • Good experience with continuous Integration of application using Jenkins.
  • Used Reporting tools like Tableau to connect with Hive for generating daily reports of data.
  • Collaborated with the infrastructure, network, database, application and BI teams to ensure data quality and availability.

Environment: Shell Scripting, Hadoop YARN, Spark Core, Spark SQL, Scala, Hive, Sqoop, Toad for Apache Hadoop, Impala, Oracle SQL developer, Cloudera, DB2 10.1, Linux, JIRA

Big Data Developer

Confidential - Needham, MA

Responsibilities:

  • Worked on the environment with Agile/Scrum methodology
  • Extensively involved in installation and configuration of Confidential big data platform
  • Involved exploit Flume to collect, aggregate, and moving large amount of semi-structured/unstructured data from web log files
  • Used Sqoop to efficient transfer bulk data from RDBMS (Oracle database) to HDFS
  • Convert raw data with sequence data format, such as Avro, and Parquet to reduce data processing time and increase data transferring efficiency
  • Utilized a Hive to do data sanitization and filtering to ensure the data reliability
  • Developed Spark programs with Scala to effectively process the data with extraction, transformation, and loading
  • Worked with data science team to build statistical model with Spark MLlib and PySpark
  • Worked with BI team to prepare and Data Visualization in Tableau for reporting
  • Collaborate and tracking the work with Confluent, Git and JIRA
  • Designed unit testing program using Scala Test, Pytest

Environment: Spark 1.6, HDFS, Scala 2.1, Sqoop 1.4.6, Flume 1.6.0, Hive 0.14, Pyspark, MLlib, PySpark, Tableau 9.2, Scala Test, Pytest, Oracle 11g, IntelliJ.

Big Data Developer

Confidential - Waltham, MA

Responsibilities:

  • Worked on Cloudera platform CDH 4 with Agile methodology
  • Responsible for developing scalable distributed data solutions using Hadoop
  • Involved data ingestion process by using Sqoop and Kafka from log files and RDBMS
  • Involved developing Kafka consumers to move data into different data stores etc. HDFS, HBase
  • Wrote a HiveQL and Pig Latin Script to cleaning the data after load from HDFS
  • Based on the ETL process, developed MapReduce program with Java to process the semi-structured data with Mapping, Shuffling, and Reducing
  • Used the HBase to store the processed data for further analysis
  • Involved installation of Oozie workflow engine to run multiple MapReduce and Hive jobs
  • Worked with data science team to build a statistical model with Python, and prepare the Data Visualization with Tableau

Environment: Cloudera CDH 4, Hadoop 2.5.2, MapReduce, HDFS, Hbase, Impala, Pig, Java, Python, Kafka 2.10, 1.6.0, Sqoop 1.4.6, Oozie, IntelliJ, Oracle 11g, Tableau 9.2

Hire Now