Spark Developer Resume
Middletown, NJ
SUMMARY:
- 7+ years of extensive experience in Information Technology with 5+ years of Hadoop/Bigdata processing and 2 years of Java J2EE technologies.
- Comprehensive working experience in implementing Big Data projects using Apache Hadoop, Pig, Hive, HBase, Spark, Sqoop, Flume, Zookeeper, Oozie.
- Experience working on Hortonworks / Cloudera / Map R.
- Excellent working knowledge of HDFS Filesystem and Hadoop Demons such as Resource Manager, Node Manager, Name Node, Data Node, Secondary Name Node, Containers etc.
- In depth understanding of Apache spark job execution Components like DAG, lineage graph, DAG Scheduler, Task scheduler, Stages and task.
- Experience working on Spark and Spark Streaming.
- Hands - on experience with major components in Hadoop Ecosystem like Map Reduce, HDFS, YARN, Hive, Pig, HBase, Sqoop, Oozie, Cassandra, Impala and Flume.
- Knowledge in installing, configuring, and using Hadoop ecosystem components like Hadoop Map Reduce, HDFS, HBase, Oozie, Hive, Sqoop, Pig, spark, kafka, storm, Zookeeper and Flume
- Experience with new Hadoop 2.0 architecture YARN and developing YARN Applications on it
- Worked on Performance Tuning to Ensure that assigned systems were patched, configured and optimized for maximum functionality and availability. Implemented solutions that reduced single points of failure and improved system uptime to 99.9% availability
- Experience with distributed systems, large-scale non-relational data stores and multi-terabyte data warehouses.
- Firm grip on data modeling, data marts, database performance tuning and NoSQL map-reduce systems
- Experience in managing and reviewing Hadoop log files
- Real time experience in Hadoop/Big Data related technology experience in Storage, Querying, Processing and analysis of data
- Experience in setting up Hadoop clusters on cloud platforms like AWS .
- Customized the dashboards and done access management and identity in AWS
- Worked on Data serialization formats for converting complex objects into sequence bits by using Avro, Parquet, JSON, CSV formats.
- Expertise in extending Hive and Pig core functionality by writing custom UDFs and UDAF’s.
- Designing and creating Hive external tables using shared meta-store instead of derby with partitioning, dynamic partitioning and buckets.
- Worked with different File Formats like TEXTFILE, SEQUENCE FILE, AVROFILE, ORC, and PARQUET for Hive querying and processing.
- Proficient in NoSQL databases like HBase.
- Experience in importing and exporting data using Sqoop between HDFS and Relational Database Systems.
- Populated HDFS with vast amounts of data using Apache Kafka and Flume.
- Knowledge in Kafka installation & integration with Spark Streaming.
- Hands-on experience building data pipelines using Hadoop components Sqoop, Hive, Pig, MapReduce, Spark, Spark SQL.
- Loaded and transformed large sets of structured, semi structured and unstructured data in various formats like text, zip , XML and JSON .
- Experience in designing both time driven and data driven automated workflows using Oozie.
- Good understanding of Zookeeper for monitoring and managing Hadoop jobs.
- Monitoring Map Reduce Jobs and YARN Applications.
- Strong Experience in installing and working on NoSQL databases like HBase, Cassandra.
- Work experience with cloud infrastructure such as Azure Services Compute, Amazon Web Services (AWS) EC2 and S3.
- Used Git for source code and version control management.
- Experience with RDBMS and writing SQL and PL/SQL scripts used in stored procedures.
- Proficient in Java, J2EE, JDBC, Collection Framework, JSON, XML, REST, SOAP Web services. Strong understanding in Agile and Waterfall SDLC methodologies.
- Experience in working with small and large groups and successful in meeting new technical challenges and finding solutions to meet the needs of the customer.
- Have excellent problem solving, proactive thinking, analytical, programming and communication skills.
- Experience working both independently and collaboratively to solve problems and deliver high-quality results in a fast-paced, unstructured environment.
TECHNICAL SKILLS:
Big Data Frameworks: Hadoop (HDFS, MapReduce), Spark, Spark SQL, Spark Streaming, Hive, Impala, Kafka, HBase, Flume, Pig, Sqoop, Oozie, Cassandra.
Bigdata distribution: Cloudera, Hortonworks, Amazon EMR, Azure
Programming languages: Core Java, Scala, Python, Shell scripting
Operating Systems: Windows, Linux (Ubuntu, Cent OS)
Databases: Oracle, SQL Server, MySQL
Designing Tools: UML, Visio
IDEs: Eclipse, NetBeans
Java Technologies: JSP, JDBC, Servlets, Junit
Web Technologies: XML, HTML, JavaScript, jQuery, JSON
Linux Experience: System Administration Tools, Puppet
Development methodologies: Agile, Waterfall
Logging Tools: Log4j
Application / Web Servers: Apache Tomcat, WebSphere
Messaging Services: ActiveMQ, Kafka, JMS
Version Tools: Git and CVS
Others: Putty, WinSCP, Data Lake, Talend, AWS,Terraform
PROFESSIONAL EXPERIENCE:
Confidential, Middletown NJ
Spark Developer
Responsibilities:
- Developed ETL data pipelines using Sqoop, Spark, Spark SQL, Scala, and Oozie.
- Used Spark for interactive queries, processing of streaming data and integrated with popular NoSQL databases
- Experience with AWS Cloud IAM, Data pipeline, EMR, S3, EC2.
- Developed the batch scripts to fetch the data from AWS S3 storage and do required transformations
- Developed Spark code using Scala and Spark-SQL for faster processing of data.
- Created Oozie workflow engine to run multiple Spark jobs.
- Exploring with Spark for improving the performance and optimization of the existing algorithms in Hadoop using Spark-SQL, Data Frame, pair RDD's, Spark YARN.
- Experience with terraform scripts which automates the step execution in EMR to load the data to Scylla DB.
- De-normalizing the data as part of transformation which is coming from Netezza and loading it to No Sql Databases and MySql.
- Developed Kafka consumer API in Scala for consuming data from Kafka topics.
- Experienced in writing live Real-time Processing and core jobs using Spark Streaming with Kafka as a data pipe-line system using Scala programming.
- Implemented data quality checks using Spark Streaming and arranged bad and passable flags on the data.
- Good knowledge in setting up batch intervals, split intervals and window intervals in Spark Streaming using Scala Programming language.
- Implemented Spark-SQL with various data sources like JSON, Parquet, ORC and Hive.
- Loaded the data into Spark RDD and do in memory data Computation to generate the output response.
- Used Spark Streaming APIs to perform transformations and actions on the fly for building common learner data model which gets the data from Kafka in near real time and persist it to Cassandra.
- Involved in converting MapReduce programs into Spark transformations using Spark RDD in Scala.
- Developed Spark scripts using Scala Shell commands as per the requirements.
Environment: HDFS, Spark, Scala, Tomcat, Netezza, EMR, Oracle, Sqoop, AWS, Terraform, Scylla DB, Cassandra, MySql, Oozie
Confidential, Houston, TX
Sr. Hadoop Developer
Responsibilities:
- Experience with complete SDLC process staging code reviews, source code management and build process.
- Implemented Big Data platforms using Cloudera CDH4 as data storage, retrieval and processing systems.
- Experienced in Spark Core, Spark SQL, Spark Streaming.
- Performed transformations on the data using different Spark modules.
- Developed data pipelines using Flume, Sqoop, Pig and Map Reduce to ingest data into HDFS for analysis.
- Developed Oozie Workflows for daily incremental loads, which gets data from Teradata and then imported into hive tables.
- Implemented Scala scripts, UDF's using both Data frames/SQL and RDD/MapReduce in Spark for Data Aggregation, queries and writing data into HDFS through Sqoop.
- Utilized Azure services, Databricks and automation tools including Azure Resource Manager, Puppet, Chef, Ansible to implement cloud operating model to enable Environment-as-a-Service and DevOps.
- Applied experience to develop, test, and document enhancements, Databricks, or extension to the Azure Data lake.
- Developed pig scripts to transform the data into structured format and it are automated through Oozie coordinators.
- Developed pipeline for constant information ingestion utilizing Kafka, Spark streaming.
- Wrote Sqoop scripts for importing large data sets from Teradata into HDFS.
- Performed Data Ingestion from multiple internal clients using Apache Kafka.
- Wrote MapReduce jobs to discover trends in data usage by the users.
- Developed Flume configuration to extract log data from different resources and transfer data with different file formats (JSON, XML, Parquet) to Hive tables using different SerDe's.
- Load and transform large sets of structured, semi structured and unstructured data Pig.
- Experienced working on Pig to do transformations, event joins, filtering and some pre-aggregations before storing the data onto HDFS.
- Used Hive to analyze the partitioned and bucketed data and compute various metrics for reporting.
- Involved in developing Hive UDF’s for the needed functionality that is not available out of the box from Hive.
- Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs, Scala and python.
- Experienced in running Hadoop streaming jobs to process terabytes of formatted data using Python scripts.
- Responsible for executing hive queries using Hive Command Line, Web GUI HUE and Impala to read, write and query the data into HBase.
- Developed and executed hive queries for de normalizing the data.
- Developed the Apache Storm, Kafka, and HDFS integration project to do a real-time data analysis.
- Experience loading and transforming structured and unstructured data into HBase and exposure handling Automatic failover in HBase.
- Ran POC's in Spark to take the benchmarking of the implementation.
- Involved in converting Hive/SQL queries into Spark transformations using Spark RDD’s.
- Automated the end to end processing using Oozie workflows and coordinators.
Environment: Cloudera, Java, Scala, Hadoop, Spark, HDFS, MapReduce, Yarn, Hive, Pig, Zookeeper, Impala, Oozie, Sqoop, Flume, Kafka, Teradata, SQL, GitHub, Phabricator, Amazon Web Services
Confidential, San Diego, CA
Big Data Developer
Responsibilities:
- Worked on a live 90 nodes Hadoop cluster running CDH4.4
- Worked with highly unstructured and semi structured data of 90 TB in size (270 TB)
- Extracted the data from Teradata into HDFS using Sqoop.
- Worked with Sqoop (version 1.4.3) jobs with incremental load to populate Hive External tables.
- Extensive experience in writing Pig (version 0.10) scripts to transform raw data from several data sources into forming baseline data.
- Experience in Amazon AWS services such as EMR, EC2, S3, CloudFormation, RedShift which provides fast and efficient processing of Big Data.
- Created data lake on amazon s3
- Implemented scheduled downtime for non-prod servers for optimizing AWS pricing.
- Developed Hive (version 0.10) scripts for end user / analyst requirements to perform ad hoc analysis.
- Implemented AWS solutions using EC2 S3, RDS, EBS, Elastic Load Balancer, Auto scaling groups.
- Very good understanding of Partitions, bucketing concepts in Hive and designed both Managed and External tables in Hive to optimize performance
- Solved performance issues in Hive and Pig scripts with understanding of Joins, Group and aggregation and how does it translate to MapReduce jobs.
- Developed UDFs in Java as and when necessary to use in PIG and HIVE queries
- Experience in using Sequence files, RC File, AVRO and HAR file formats.
- Developed Oozie workflow for scheduling and orchestrating the ETL process
- Worked on Performance Tuning to Ensure that assigned systems were patched, configured and optimized for maximum functionality and availability. Implemented solutions that reduced single points of failure and improved system uptime to 99.9% availability.
- Written MapReduce programs in Python with the Hadoop streaming API.
- Extracted files from CouchDB through Sqoop and placed in HDFS and processed.
- Worked on Hive for exposing data for further analysis and for generating transforming files from different analytical formats to text files.
- Imported data from MySQL server and other relational databases to Apache Hadoop with the help of Apache Sqoop.
- Creating Hive tables and working on them for data analysis to meet the business requirements.
Environment: Hadoop, MapReduce, HDFS, Hive, HBase, Sqoop, Pig, Flume, Oracle 11/10g, DB2, Teradata, MySQL, Eclipse, PL/SQL, Java, Linux, Shell Scripting, SQL Developer, SOLR.
Confidential
Hadoop Developer
Responsibilities:
- Worked with business teams and created Hive queries for ad hoc access.
- Continuous monitoring and managing the Hadoop cluster through Cloudera Manager.
- Involved in review of functional and non-functional requirements
- Responsible to manage data coming from various sources.
- Loaded daily data from websites to Hadoop cluster by using Flume.
- Involved in loading data from UNIX file system to HDFS.
- Creating Hive tables and working on them using Hive QL.
- Created complex Hive tables and executed complex Hive queries on Hive warehouse.
- Wrote MapReduce code to convert unstructured data to semi structured data.
- Used Pig to extract, transformation & load of semi structured data.
- Installed and configured Hive and written Hive UDFs.
- Develop Hive queries for the analysts.
- Developed workflow in Oozie to automate the tasks of loading the data into HDFS and pre-processing with Pig.
- Design technical solution for real-time analytics using Kafka and HBase.
- Cluster co-ordination services through Zookeeper.
- Collected the logs data from web servers and integrated in to HDFS using Flume.
- Creating Hive tables and working on them using Hive QL.
- Used Pig as ETL tool to do transformations, event joins and some pre-aggregations before storing the data onto HDFS.
- Support the data analysts and developers of BI and for Hive/Pig development.
Environment: Apache Hadoop, HDFS, Cassandra, MapReduce, HBase, Impala, Java (jdk1.6), Kafka, MySQL, Amazon, DB Visualizer, Linux, Sqoop, Apache Hive, Apache Pig, Infosphere Python, Scala, NoSQL, Flume, Oozie
Confidential
Java/J2EE Developer
Responsibilities:
- Analyze and modify Java/J2EE Application using JDK 1.7/1.8 and develop webpages using Spring MVC Framework.
- Coordinate with the business analyst and application architects to maintain knowledge on all functional requirements and ensure compliance to all architecture standards.
- Follow AGILE methodology with TDD through all the phases of SDLC.
- Used Connection Pooling to get JDBC connection and access database procedures.
- Attending the daily Standup Meetings.
- Use Rally for managing the portfolio, creating and keep tracking of the user stories.
- Responsible for analysis, design, development and integration of UI components with backend using J2EE technologies.
- Used JUnit to validate input for functions TDD.
- Developed User Interface pages using HTML5, CSS3 and JavaScript.
- Involved in development activities using Core Java /J2EE, Servlets, JSP, JSF used for creating web application, XML and Springs.
- Used Maven tool for building the application and run it using Tomcat Server.
- Use GIT as version control for tracking the changes in the project.
- Used Junit Framework for unit testing and Selenium for integration testing and Test Automation.
- Assist in development for various applications and maintain quality for same and perform troubleshoot to resolve all application issues/bugs identified during the test cycles.
Environment: Java/J2EE, JDK 1.7/1.8, LINUX, Spring MVC, Eclipse, JUnit, Servlets, DB2, Oracle 11g/12c, GIT, GitHub, JSON, RESTful, HTML5, CSS3, JavaScript, Rally, Agile/Scrum