Spark / Hadoop Developer Resume
Plano, TX
SUMMARY
- Over 8 years of professional IT experience with strong emphasis in development and testing of software applications.
- Around 4+ years of experience in Hadoop distributed file system (HDFS), Impala, Sqoop, Hive, HBase, Spark, Hue, MapReduce framework, Kafka, Yarn, Flume, Oozie, Zookeeper and Pig.
- Hands on experience on various Hadoop components ofHadoop ecosystem such as Job Tracker, Task Tracker, Name Node, Data Node, Resource Manager and Application Manager.
- Good knowledge on AWS infrastructure services Amazon Simple Storage Service (Amazon S3), EMR and Amazon Elastic Compute Cloud (Amazon EC2).
- Experience in working with Amazon EMR, Cloudera (CDH3 & CDH4) and Hortonworks Hadoop Distributions.
- Involved in loading the structured and semi structured data into spark clusters using Spark SQL and Data Frames API.
- Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs and Scala.
- Experience in implementing Real - Time event processing and analytics using messaging systems like Spark Streaming.
- Capable of creating real time data streaming solutions and batch style large scale distributed computing applications using Apache Spark, Spark Streaming, Kafka and Flume.
- Experience in analyzing data using Spark SQL, HIVEQL, PIG Latin, Spark/Scala and custom Map Reduce programs in Java.
- Have experience in Apache Spark, Spark Streaming, Spark SQL and NoSQL databases like HBase, Cassandra, and MongoDB.
- Experience in creating DStreams from sources like Flume, Kafka and performed different Spark transformations and actions on it.
- Experience in integrating Apache Kafka with Apache Storm and created Storm data pipelines for real time processing.
- Performed operations on real-time data using Storm, Spark Streaming from sources like Kafka, Flume.
- Implemented Pig Latin scripts to process, analyze and manipulate data files to get required statistics.
- Experienced with different file formats like Parquet, ORC, Avro, Sequence, CSV, XML, JSON, Text files.
- Worked with Big Data Hadoop distributions: Cloudera, Hortonworks and Amazon AWS.
- Developed MapReduce jobs using Java to process large data sets by fitting the problem into the MapReduce programming paradigm.
- Implemented Kafka Custom encoders for custom input format to load data into Kafka Partitions. Real time streaming the data using Spark with Kafka for faster processing.
- Having experience in developing a data pipeline using Kafka to store data into HDFS.
- Good experience in creating and designing data ingest pipelines using technologies such as Apache Storm-Kafka.
- Used Scala SBT to develop Scala coded spark projects and executed using spark-submit.
- Experience on Working with data extraction, transformation and load in Hive, Pig and HBase.
- Orchestrated various Sqoop queries, Pig scripts, Hive queries using Oozie workflows and sub-workflows.
- Responsible for handling different data formats like Avro, Parquet and ORC formats.
- Experience in performance tuning, monitoring the Hadoop cluster by gathering and analyzing the existing infrastructure using Cloudera manager.
- Knowledge of job workflow scheduling and monitoring tools like Oozie (Hive, Pig) and Zookeeper (Hbase).
TECHNICAL SKILLS
Big Data Ecosystem: Hadoop, MapReduce, Pig, Hive, YARN, Kafka, Flume, Sqoop, Impala, Oozie, Zookeeper, Spark, Ambari, Mahout, MongoDB, Cassandra, Avro, Storm & Parquet.
Hadoop Distributions: Cloudera (CDH3, CDH4, and CDH5), Hortonworks, MapR and Apache
Languages: Java, Python, SQL, HTML, DHTML, Scala, JavaScript, XML and C/C++
Nosql Databases: Cassandra, MongoDB and HBase
Java Technologies: Servlets, JavaBeans, JSP, JDBC, and struts
Web Design Tools: HTML, DHTML, AJAX, JavaScript, JQuery and CSS, AngularJs, ExtJS and JSON
Development / Build Tools: Eclipse, Ant, Maven, IntelliJ, JUNIT and log4J
Frameworks: Struts, spring and Hibernate
App/Web servers: WebSphere, WebLogic, JBoss and Tomcat
DB Languages: MySQL, PL/SQL, PostgreSQL and Oracle
RDBMS: Teradata, Oracle 9i,10g,11i, MS SQL Server, MySQL and DB2
Operating systems: UNIX, LINUX, Mac OS and Windows Variants
PROFESSIONAL EXPERIENCE
Confidential, Plano TX
Spark / Hadoop Developer
Responsibilities:
- Hands on experience in Spark and Spark Streaming creating RDD & applying operations transformations and Actions.
- DevelopedSparkapplications using Scala for easyHadooptransitions.
- Used Spark and Spark-SQL to read the parquet data and create the tables in hive using the Scala API.
- Performed advanced procedures like text analytics and processing, using the in-memory computing capabilities of Spark using Scala.
- Developed Sparkcode using Scala and Spark-SQL for faster processing and testing.
- Used Spark-Streaming APIs to perform necessary transformations and actions on the fly for building the common learner data model which gets the data from Kafka in near real time.
- Responsible for loading Data pipelines from web servers and Teradata using Sqoop with Kafka and Spark Streaming API.
- DevelopedKafkaproducer and consumers, Cassandra clients and Spark along with components on HDFS, Hive.
- Populated HDFS and HBase with huge amounts of data using ApacheKafka.
- Used Kafka to ingest data into Spark engine.
- Configured deployed and maintained multi-node Dev and Test Kafka Clusters.
- Managing and schedulingSparkJobs on aHadoopCluster using Oozie.
- Experienced with different scripting language like Python and shell scripts.
- Developed various Python scripts to find vulnerabilities with SQL Queries by doing SQL injection, permission checks and performance analysis.
- Experienced in Apache Spark for implementing advanced procedures like text analytics and processing using the in-memory computing capabilities written inScala.
- Involved in converting Hive/SQL queries into Spark transformations using Spark RDD, Scala and Python.
- Worked on SparkSQL, created Data frames by loading data from Hive tables and created prep data and stored in AWS S3.
- Experience with AWS Cloud IAM, Data pipeline, EMR, S3, EC2, AWS CLI, SNS & other services
- Involvement in creating custom UDFs for Pig and Hive to consolidate strategies and usefulness ofPython into Pig Latin and HQL (HiveQL).
- Extensively worked on Text, ORC, Avro andParquetfile formats and compression techniques like Gzip and Zlib.
- Implemented Hortonworks NiFi (HDP 2.4) and recommended solution to inject data from multiple data sources to HDFS and Hive using NiFi.
- Developed various data loading strategies and performed various transformations for analyzing the datasets by using Hortonworks Distribution for Hadoop ecosystem.
- Ingested data from RDBMS and performed data transformations, and then export the transformed data to Cassandra as per the business requirement and used Cassandra through Java services.
- Experience in NoSQL Column-Oriented Databases like Cassandra and its Integration with Hadoop cluster.
- Creating S3 buckets and managing policies for S3 buckets and utilized S3 bucket and Glacier for storage and backup AWS.
- Performed AWS Cloud administration managing EC2 instances, S3, SES and SNS services.
- Operated Elasticsearch time-series data like metrics and application events, area where the huge Beats ecosystem allows you to easily grab data for common applications.
- Hands on experience in developing the applications with Java, J2EE, J2EE - Servlets, JSP, EJB, SOAP, Web Services, JNDI, JMS, JDBC2, Hibernate, Struts, Spring, XML, HTML, XSD, XSLT, PL/SQL, Oracle10g and MS-SQL Server RDBMS.
- Have experience in migrating on premise ETLs to Google Cloud Platform (GCP) using cloud native tools such as big query, Data proc, google cloud storage, composer.
- Delivered zero defect code for three large projects which involved changes to both front end (Core Java, Presentation services) and back-end (Oracle)
- Along with the Infrastructure team, involved in design and developed Kafka and Storm based data pipeline.
- Used OOZIE Operational Services for batch processing and scheduling workflows dynamically.
- Very keen in knowing techno stack that Google Cloud Platform (GCP) adds.
- Worked parallelly in both GCP and Azure clouds coherently.
- Involved in loading and transforming large Datasets from relational databases into HDFS and vice-versa using Sqoop imports and export.
- Handled importing of data from various data sources, performed transformations using Hive, MapReduce, loaded data into HDFS and extracted data from MYSQL into HDFS vice-versa using Sqoop.
Environment: Hadoop, Hive, MapReduce, Sqoop, Kafka, Spark, Yarn, Pig, Cassandra, Oozie, shell Scripting, Scala, Maven, Java, JUnit, NIFI, MySQL, AWS, EMR, EC2, S3, Hortonworks.
Confidential, Plano TX
Hadoop/Spark Developer
Responsibilities:
- Optimizing of existing algorithms in Hadoop using Spark context, Spark-SQL, Data Frames and Pair RDD’s.
- Developed Spark scripts by using Java, and Python shell commands as per the requirement.
- Involved with ingesting data received from various relational database providers, on HDFS for analysis and other big data operations.
- Experienced in performance tuning of Spark Applications for setting right Batch Interval time, correct level of Parallelism and memory tuning.
- Used Spark API over Cloudera Hadoop YARN to perform analytics on data in Hive.
- Worked on Spark SQL and Data frames for faster execution of Hive queries using Spark sqlContext.
- Performed analysis on implementing Spark using Scala.
- Used Data frames/ Datasets to write SQL type queries using Spark SQL to work with datasets sitting on HDFS.
- Extracted files fromMongoDBthrough Sqoop and placed in HDFS and processed.
- Created and imported various collections, documents into MongoDB and performed various actions like query, project, aggregation, sort and limit.
- Experience with creating script for data modeling and data import and export. Extensive experience in deploying, managing and developing MongoDB clusters.
- Extracted files from MongoDB through Sqoop and placed in HDFS and processed.
- Experience in migrating HiveQL into Impala to minimize query response time.
- Creating Hive tables to import large data sets from various relational databases using Sqoop and export the analyzed data back for visualization and report generation by the BI team.
- Involved in creating Shell scripts to simplify the execution of all other scripts (Pig, Hive, Sqoop, Impala and MapReduce) and move the data inside and outside of HDFS.
- Implemented some of the big data operations on AWS cloud. Created cluster using EMR, EC2 instances, S3 buckets, analytical operations on RedShift, performed RDS, Lambda operations and managedresources using IAM.
- Utilize frameworks such asStruts, Spring, Hibernate, Web servicesto develop backend code.
- UsedHibernatereverse engineering tools to generate domain model classes, perform association mapping and inheritance mapping using annotations and XML, and implement second level caching usingEHCache cacheprovider.
- Developed Pig Scripts, Pig UDFs and Hive Scripts, Hive UDFs to analyze HDFS data.
- Maintained the cluster securely using Kerberos and making the cluster up and running all the times.
- Implemented optimization and performance testing and tuning of Hive and Pig.
- Developed a data pipeline using Kafka to store data into HDFS.
- Worked on reading multiple data formats on HDFS using Scala
- Written shell scripts and Python scripts for automation of job.
- Configured Zookeeper to restart the failed jobs without human intervention.
Environment: Cloudera, HDFS, Hive, HQL scripts, Mapreduce, Java, HBase, Pig, Sqoop, Kafka,Impala, Shell Scripts,Python Scripts, Spark, Scala, Oozie, Zookeeper, shell Scripting, Scala, Maven, Java, JUnit, NIFI, AWS, EMR, EC2, S3.
Confidential, Nashua NH
Spark/Hadoop Developer
Responsibilities:
- Experience in developing customized UDF's in java to extend Hive and Pig Latin functionality.
- Responsible for installing, configuring, supporting, and managing of Hadoop Clusters.
- Importing and exporting data into HDFS from Oracle 10.2 database and vice versa using SQOOP.
- Installed and configured Pigand written Pig Latin scripts.
- Designed and implementedHIVE queries and functions for evaluation, filtering, loading and storing of data.
- Created HBase tables and column families to store the user event data.
- Written automated HBase test cases for data quality checks using HBase command line tools.
- Developed a data pipeline usingHBase, Sparkand Hive to ingest, transform and analyzing customer behavioral data.
- Experience in collecting the log data from different sources like (webservers and social media) using Flume and storing on HDFS to perform MapReduce jobs.
- Handled importing of data from machine logs using Flume.
- Created Hive Tables, loaded data from Teradata using Sqoop.
- Worked extensively with importing metadata into Hive and migrated existing tables and applications to work on Hive and AWS cloud.
- Configured, monitored, and optimized Flume agent to capture web logs from the VPN server to be put into HadoopData Lake.
- Responsible for loading data from UNIX file systems to HDFS. Installed and configured Hiveand written Pig/HiveUDFs.
- Wrote, tested and implemented Teradata Fast load, Multiload and Bteq scripts, DML and DDL.
- Involved in converting Hive/SQL queries into Spark transformations using SparkRDD, Scala and Python.
- Ec2Collecting and aggregating large amounts of log data using Flume and staging data in HDFS for further analysis
- Exported the analyzed data to the relational databases using Sqoop to further visualize and generate reports for the BI team.
- Develop ETL Process usingSPARK, SCALA, HIVE and HBASE.
- Collecting and aggregating large amounts of log data using Flume and staging data in HDFS for further analysis.
- Wrote Java code to format XML documents; upload them toSolrserver for indexing.
- Used with NoSQL technology (Amazon Dynodb) to gather and track event-based metric.
- Maintenance of all the services in Hadoopecosystem using ZOOKEPER.
- Worked on implementing Sparkframe work.
- Designed and implemented Spark jobs to support distributed data processing.
- Expertise in Extraction, Transformation, loading data from Oracle, DB2, SQL Server, MS Access, Excel, Flat Files and XML using Talend.
- Experienced on loading and transforming of large sets of structured, semi and unstructured data.
- Help design of scalable Big Data clusters and solutions.
- Followed agile methodology for the entire project.
- Experience in working with Hadoop clusters using Cloudera distributions.
- Involved inHadoop cluster task like Adding and Removing Nodes without any effect to running jobs and data.
- Developed workflows using Oozie to automate the tasks of loading the data into HDFS and pre-processing with Pig.
- Developed interactive shell scripts for scheduling various data cleansing and data loading process.
- Converting the existing relational database model toHadoopecosystem.
Environment: Hadoop, HDFS, Pig, Hive, Flume, Sqoop, Oozie, Python, Shell Scripting, SQL Talend, Spark, HBase, Elastic search, Linux- Ubuntu, Flume, Cloudera.
Confidential
Java Developer
Responsibilities:
- Implemented applications usingJava, J2EE, JSP, Servlets, JDBC, RAD, XML, HTML, XHTML, Hibernate Struts, spring and JavaScript on Windows environments.
- Experienced in developing web-based applications usingPython, Django, PHP, XML, CSS, HTML, JavaScript and jQuery.
- Designed and implemented the training and reports modules of the application using Servlets, JSP andAjax.
- Developed XML Web Services using SOAP, WSDL, and UDDI.
- Created the UI tool - usingJava, XML, XSLT, DHTML and JavaScript
- Experience in develop of SDLC life cycle and undergo in all the phases in it.
- Developed action Servlets and JSPs for presentation in Struts MVC framework.
- Worked with Struts MVC objects like Action Servlet, Controllers, validators, Web Application Context, Handler Mapping, Message Resource Bundles, Form Controller, and JNDI for look-up for J2EE components.
- Developed PL/SQL View function in Oracle9i database for get available date module.
- Used Oracle SQL 4.0 as the database and write SQL queries in the DAO Layer.
- Experience in application using Core Java, JDBC, JSP, Servlets, spring, Hibernate, Web Services, SOAP, and WSD.
- Used RESTFUL Services to interact with the Client by providing the RESTFUL URL mapping.
- Used SVN and GitHub as version control tool.
- Implemented Hibernate in the data access object layer to access and update information in the Oracle 10g Database.
- Experience in JIRA and tracked the test results and interacted with the developers to resolve issue.
- Used XSLT to transform my XML data structure into HTML pages.
- Deployed EJB Components on Tomcat. Used JDBCAPI for interaction with OracleDB.
- Wrote build & deployment scripts using shell, Perl and ANTscripts
- Extensively used Java multi-threading to implement batch Jobs with JDK 1.5 features
Environment: HTML, JavaScript, Ajax, Servlets, JSP, SOAP, SDLC life cycle, Java, Hibernate, Scrum, JIRA, Github, JQuery, CSS, XML, ANT, Tomcat Server, Jasper Reports.