Spark Developer Resume
CA
SUMMARY:
- Overall 5+ years of experience in IT industry experience which include 4+ years of experience in Hadoop ecosystem major components like hive, SQL, pig, HDFS, SPARK and Kafka with SCALA programming.
- Hands on experience in installing, configuring and using Hadoop ecosystem components like HDFS, HIVE, PIG, Map Reduce Yarn, Sqoop, Flume, Hbase, Impala, oozie, zookeeper, kafka, Spark.
- In depth understanding of Spark architecture including Spark core, Spark SQL, data frames, Spark streaming, Spark Ml lib and GraphX.
- Experience in using Accumulator variables, broadcast variables, RDD caching for Spark streaming.
- Hands on experience in various bigdata application phases like data ingestion, data analytics and data visualization.
- Expertise in using Spark SQL with various data sources like JSON, parquet and hive.
- Expertise in uses of Hadoop distributions like Cloudera, Hortonwork distribution & amazon AWS
- Experience in transferring data from RDBMS to HDFS and HIVE table using SQOOP.
- Experience in creating tables, partitioning, bucketing, loading and aggregating data using HIVE.
- Migrating the coding from hive to apache Spark and Scala using Spark SQL, RDD.
- Experience in working with flume to load the log data from multiple sources directly into HDFS
- Experience in analyzing data using hiveql, pig Latin, and custom map reduce programs in java
- Experience in NOSQL column oriented databases like Hbase, Cassandra, and its integration with Hadoop cluster
- Experience in manipulating/ analyzing large datasets and finding patterns and insights within structured and unstructured data.
- Uploading and processed terabytes of data from various structured and unstructured sources into HDFS (AWS cloud) using Sqoop and flume.
- Hands on experience in writing HiveQL queries and Spark SQL
- Working experience in writing Sqoop queries for transferring bulk data between Apache Hadoop and structured data stores.
- Good knowledge on dealing with log files to extract data and to import into HDFS using Flume.
- Expertise and Knowledge in using job scheduling and monitoring tool like Oozie.
- Experience with Oozie Workflow Engine in scheduling jobs for Map - Reduce, Pig, Hive and Kafka.
- Good working experience on Spark (Spark streaming, Spark SQL), Scala and Kafka.
- Expertise in writing Spark RDD transformations, actions, Data Frames and datasets case classes for the required input data and performed the data transformations using Spark-Core.
- Strong knowledge in Upgrading Mapr, CDH and HDP Cluster.
- Experience working on various Cloudera distributions like ( CDH 4 / CDH 5 ), Knowledge of working on Horton works and Amazon EMR Hadoop distributors.
- Expert in analyzing real time queries using different NoSQL data bases including Cassandra, MongoDB and Hbase.
- Experience in converting business process into RDD transformations using Apache Spark and Scala.
- Experienced in collecting metrics for Hadoop clusters using Ambari & Cloudera Manager.
- Excellent understanding of abstraction using Scala and Spark.
- Knowledge on handling Hive queries using Spark SQL that integrate Spark environment.
- Good in writing Spark scripts by using Scala and Python.
- Involved in daily SCRUM meetings to discuss the development/progress
- Experience on cloud technologies like Amazon Web Services (AWS).
- Involved in deploying the content Cloud platform on Amazon AWS using EC2, S3.
- Worked on Talend to run ETL jobs on the data in HDFS.
- Have good experience with both Windows, LINUX and UNIX platforms.
TECHNICAL SKILLS:
Programming Languages: C, Core Java, Scala, SQL, PL/SQL, PYTHON
Distributed File Systems: Apache Hadoop HDFS
Hadoop Distributions: Amazon AWS/EMR, Apache Cloudera, Hortonworks, and MapR
Hadoop Technologies: HDFS, MapReduce, Hive, Pig, Sqoop, Oozie, Zookeeper, Flume, Spark-Sql and Apache Kafka
NoSQL data bases: Cassandra, Hbase, MongoDB
Databases: Oracle, MySQL.
Search Platforms: Apache Solr
Inmemory/MPP/Search: Apache Spark, Apache Spark Streaming, Apache Storm
Operating Systems: Windows, UNIX, LINUX
Cloud Platforms: Amazon AWS, OpenStack.
PROFESSIONAL EXPERIENCE:
Confidential, CA
Spark Developer
Responsibilities:
- Developing Spark programs using Scala API’S to compare the performance of Spark with Hive and SQL.
- Implemented Spark using Scala and Sparks SQL for Faster Testing and Processing of data
- Designed and created hive external tables using shared eta store instead of derby partitioning, dynamic portioning and buckets.
- Used Spark-Sql to load json data and create schema RDD and loaded into hive tables and handled structured data using Spark-SQL.
- Exported the analyzed data to the relational databases (MySQL, Oracle) using Sqoop from HDFS.
- Developed data pipeline using Flume, Sqoop, Pig and Java Map Reduce to ingest data into HDFS for analysis.
- Used Spark API over Hortonworks Hadoop YARN to perform analytics on data in Hive
- Used impala for querying HDFS data to achieving for better performance
- Implemented apache pig scripts to load data from and to store data to hive
- Imported data from Amazon S3 into Spark RDD and performed transformations and action on RDD’s.
- Used the json and xml serde’s for serialization and de-serialization to load json and xml data to hive tables
- Worked and learned a great deal from amazon web services (AWS) cloud service like EC2, S3, and vpc.
- Imported the data from different sources like S3 into Spark RDD.
- Worked with various HDFS file formats like Avro sequence files and various compression files like snappy.
- Expert knowledge on Mongodb, NOSQL data modelling, disaster recovery and backup.
- Develop Spark/map reduce jobs to parse the json or XML data.
- Involved in Hbase setup and storing data into Hbase, which will be used for analysis.
- Used Scala libraries to process xml data that was stored in HDFS and process data was stored in HDFS.
- Load the data into Spark RDD and do in memory computation to generate output response.
- Used Spark for interactive queries, processing of streaming data and integration with popular NOSQL data base for huge volume of data.
- Experienced in NoSQL databases such as HBase, Cassandra and MongoDB
- Wrote different pig script to clean up the ingested data and created the partitions for the daily data.
- Involved in converting hive/Sql queries to Spark transformation using Spark RDD’s, Scala and python.
- Analyzed the Sql scripts designed the solutions to implement using pySpark.
- Involved in converting map reduce programs into Spark transformations using Spark RDD in Scala.
- Analyzed Sql scripts and designed the solution to implement using pySpark.
- Developed Spark scripts by using Scala shell commands as per the requirement.
- Developed Spark code to using Scala and Spark -SQL for faster processing and testing.
- Used Avro, parquet and orc data formats to store in to HDFS
- Used oozie workflow to coordinate pig and hive scripts.
Confidential, New York
Hadoop Developer
Responsibilities:
- In depth understanding/ knowledge of Hadoop architecture and various components such as HDFS, application manager, node master, resource manager name node, data node and map reduce concepts.
- Involved in developing a Map Reduce framework that filters bad and unnecessary records.
- Involved in moving all log files generated from various sources to HDFS for further processing through flume
- Imported required tables from RDBMS to HDFS using Sqoop and used storm and Kafka to get real time streaming of data into Hbase.
- Experience in NoSQL Column-Oriented Databases like Cassandra and its Integration with Hadoop cluster.
- Involved in creating hive tables and loading with data writing hive queries that will run internally in a map reduce way.
- Good experience with NoSQL database Hbase and creating Hbase tables to load large set of semi structured data coming from various sources
- Involved in moving all log files generated from various sources to HDFS for further process through flume.
- Implemented the workflows using apache framework to automate tasks.
- Written Map Reduce code that will take input as log files and parse the and structures them in tabular format to facilitate effective querying on the log data.
- Developed java code to generate, compare & merge AVRO schema files.
- Developed complex map reduce streaming jobs using java language that are implemented using hive and pig.
- Used hive to analyze the partitioned and bucketed data and compute various metrics for reporting.
- Used hive optimization techniques during joins and best practices in writing hive scripts using HiveQL.
- Importing and exporting data into HDFS and hive using Sqoop.
- Writing the HIVE queries to extract the data processed.
- Developed data pipeline using flume, Sqoop, pig and map reduce to ingest customer behavioral data and purchase histories into HDFS for analysis.
- Implemented Spark using Scala and utilizing Spark core, Spark streaming and Spark SQL API for faster processing of data instead of Map reduce in Java.
- Used Spark-SQL to load JSON data and create schema RDD and loaded it into Hive tables handled structured data using Spark SQL.
- Developed pig latin scripts to extract the data from the web server out files to load into HDFS.
- Created Hbase tables to store variable data formats of data coming from different legacy systems.
- Used HIVE to do transformations, event joins and some pre-aggregations before storing the data onto HDFS.
- Good understanding of cassandra architecture, replication strategy, gossip, snitch etc.
- Expert knowledge on MongoDB NoSQL data modelling, tuning, disaster recovery and backup.
- Developed Spark scripts by using Scala shell commands as per the requirement.
Confidential, New York
Hadoop Developer
Responsibilities:
- Extensively involved in Installation and configuration of Cloudera distribution Hadoop , Name Node, Secondary Name Node, Job Tracker, Task Trackers, and Data Nodes.
- Developed MapReduce programs in Java and Sqoop the data from ORACLE database.
- Responsible for building scalable distributed data solutions using Hadoop. Written various Hive and Pig scripts.
- Moved data from HDFS to Cassandra using Map Reduce and Bulk Output Format class.
- Experienced with different scripting language like Python and shell scripts.
- Developed various Python scripts to find vulnerabilities with SQL Queries by doing SQL injection, permission checks and performance analysis.
- Installed Oozie workflow engine to run multiple Hive and Pig jobs which run independently with time and data availability.
- Experienced with handling administration activations using Cloudera manager.
- Expertise in understanding Partitions, Bucketing concepts in Hive.
- Automated all the jobs, for pulling data from FTP server to load data into Hive tables, using Oozie workflows. Also Wrote Pig scripts to run ETL jobs on the data in HDFS
- Used Oozie Scheduler system to automate the pipeline workflow and orchestrate the Map Reduces jobs that extract the data on a timely manner. Responsible for loading data from UNIX file system to HDFS .
- Developed suit of Unit Test Cases for Mapper, Reducer and Driver classes using MR Testing library.
- Analyzed the weblog data using the HiveQL , integrated Oozie with the rest of the Hadoop stack
- Utilized cluster co-ordination services through Zookeeper .
- Worked on the Ingestion of Files into HDFS from remote systems using MFT.
- Got good experience with various NoSQL databases and Comprehensive knowledge in process improvement, normalization/de-normalization, data extraction, data cleansing, data manipulation.
- Experience with creating script for data modeling and data import and export. Extensive experience in deploying, managing and developing MongoDB clusters.
- Developed Pig scripts to convert the data from Text file to Avro format.
- Created Partitioned Hive tables and worked on them using HiveQL .
- Developed Shell scripts to automate routine DBA tasks.
- Used Maven extensively for building jar files of MapReduce programs and deployed to Cluster.
- Responsible for cluster maintenance, adding and removing cluster nodes, cluster monitoring and troubleshooting, managing and reviewing data backups and Hadoop log files.
Environment: HDFS, Map Reduce, Pig, Hive, Oozie, Sqoop, Flume, HBase, Java, Maven, Avro, Cloudera, Eclipse and Shell Scripting .
Confidential
Java Developer
Responsibilities:
- Understanding business objectives and implementation of business logic
- Designed front end using JSP and business logic in Servlets.
- Used JSPs, HTML and CSS to develop user interface.
- Responsible for design and build data mart as per the requirements.
- Interact and coordinate with team members to develop detailed software requirements that will drive the design, implementation, and testing of the Consolidated Software application.
- Implemented the object-oriented programming concepts for validating the columns of the import file.
- Extensively used Oracle ETL process for address data cleansing.
- Used Eclipse Integrated Development Environment (IDE) in entire project development.
- Involved in technical design, logical data modeling, data validation, verification, data cleansing, data scrubbing.
- Created Rulesets for data quality index reports.
- Extensively worked on Views, Stored Procedures, Triggers and SQL queries and for loading the data (staging) to enhance and maintain the existing functionality.
- Involved in creating error logs and increase the performance of the jobs.
- Written queries to test the functionality of the code during testing.