Hadoop/spark Developer Resume
Piscataway, NJ
PROFESSIONAL SUMMARY:
- 8+ years of IT experience in software development and support with experience in developing strategic methods for deploying Big Data technologies to efficiently solve Big Data processing requirement.
- Expertise in Hadoop eco system components HDFS, MapReduce, Yarn, HBase, Pig, Sqoop and Hive for scalability, distributed computing and high performance computing.
- Have experience in Apache Spark, Spark Streaming, Spark SQL and No SQL databases like Cassandra and HBase.
- Experience in using Hive Query Language for data Analytics.
- Experienced in Installing, Maintaining and Configuring Hadoop Cluster.
- Hands on experience in Talend open Studio ETL tool.
- Strong knowledge on creating and monitoring Hadoop clusters on Amazon EC2, VM, Hortonworks Data Platform 2.1 & 2.2, CDH3, CDH4 Cloudera Manager on Linux, Ubuntu OS etc.
- Capable of processing large sets of structured, semi - structured and unstructured data and supporting systems application architecture.
- Having Good knowledge on Single node and Multi node Cluster Configurations.
- Strong knowledge in NOSQL column oriented databases like HBase, Cassandra, MongoDB and its integration with Hadoop cluster.
- Execution of Batch jobs through the data streams through SPARK Streaming.
- Experience in Apache Spark, Spark Streaming, Spark SQL and No SQL databases like Cassandra and Hbase.
- Have experience in Hadoop distributions like Amazon, Cloudera and Hortonworks.
- Have thorough knowledge on spark architecture and how RDD's work internally. Have exposure to Spark Streaming and Spark SQL
- Uses Talend Open Studio to load files into Hadoop HIVE tables and performed ETL aggregations in Hadoop HIVE.
- Expertise on Scala Programming language and Spark Core.
- Designing & Creating ETL Jobs through Talend to load huge volumes of data into Cassandra , Hadoop Ecosystem and relational databases.
- Experienced in job workflow scheduling and monitoring tools like Oozie and Zookeeper.
- Good knowledge on Amazon EMR, S3 Buckets, Dynamo DB, RedShift.
- Analyze data, interpret results and convey findings in a concise and professional manner
- Partner with Data Infrastructure team and business owners to implement new data sources and ensure consistent definitions are used in reporting and analytics
- Promote full cycle approach including request analysis, creating/pulling dataset, report creation and implementation and providing final analysis to the requestor.
- Very Good understanding of SQL, ETL and Data Warehousing Technologies
- Knowledge of MS SQL Server 2012/2008/2005 and Oracle 11g/10g/9i and E-Business Suite.
- Expert in TSQL, creating and using Stored Procedures, Views, User Defined Functions, implementing Business Intelligence solutions using SQL Server 2000/2005/2008.
- Developed Web-Services module for integration using SOAP and REST.
- NoSQL database experience on HBase, Cassandra
- Flexible with Unix/Linux and Windows Environments working with Operating Systems like Centos 5/6, Ubuntu 13/14, Cosmos.
- Good experience on Kafka and Storm
- Knowledge of java virtual machines (JVM) and multithreaded processing.
- Strong programming skills in designing and implementation of applications using Core Java, J2EE, JDBC, JSP, HTML, Spring Framework, Spring batch framework, Spring AOP, Struts, JavaScript, Servlets.
- Experience in build scripts using Maven and do continuous integrations systems like Jenkins.
- Java Developer with extensive experience on various Java Libraries, API's and frameworks.
- Hands on development experience with RDBMS, including writing complex SQL queries, Stored procedure and triggers.
- Have sound knowledge on designing data warehousing applications with using Tools like Teradata, Oracle and SQL Server.
- Experience on using Talend ETL tool.
- Experience in working with job scheduler like Autosys and Maestro.
- Strong in databases like Sybase, DB2, Oracle, MS SQL.
- Strong understanding of Agile Scrum and Waterfall SDLC methodologies.
- Strong communication, collaboration & team building skills with proficiency at grasping new Technical concepts quickly and utilizing them in a productive manner.
- Adept in analyzing information system needs, evaluating end-user requirements, custom designing solutions and troubleshooting information systems.
- Strong analytical and Problem solving skills.
TECHNICAL SKILLS:
Big Data Ecosystem: HDFS, HBase, Hadoop MapReduce, Hive, Pig, Sqoop, Spark, Flume, Oozie, Cassandra, Storm and Impala.
Distributions: Apache Hadoop 1.0.4, Cloudera CDH3, CDH4.
Languages: C, C++, Java, SQL/PLSQL, Python.
Methodologies: Agile, waterfall.
Database: Oracle 10g, DB2, MySQL, MS SQL server.
Web Tools: HTML, Java Script, XML, ODBC, JDBC, Hibernate, JSP, Servlets, Java, Struts, Springs, JUnit, Json and Avro.
IDE / Testing Tools: Eclipse, Visual Studio, Net Beans, Putty.
Operating System: Windows, UNIX, Linux.
Scripts: JavaScript, Shell Scripting.
Version Control: SVN, CVS, TFS.
PROFESSIONAL EXPERIENCE:
Hadoop/Spark Developer
Confidential, Piscataway, NJ
Responsibilities:
- Importing and exporting data into HDFS and Hive using Sqoop.
- Used Oozie workflow engine to manage interdependent Hadoop jobs and to automate several types of Hadoop jobs such as Java MapReduce Hive, Pig, and Sqoop.
- Exported the analyzed data to the relational databases using Sqoop for visualization and to generate reports for the BI team.
- Extensively used Talend ETL tool.
- Migrated MapReduce jobs to Spark Jobs to achieve better performance.
- Design and Implementation of Real time applications using Apache Storm, Trident Storm, Kafka, Apache ignite Memory grid and Accumulo.
- Experienced with batch processing of data sources using Apache Spark, Elastic Search.
- Performed streaming of data into Apache ignite by setting up cache for efficient data analysis.
- Created Hive External tables and loaded the data in to tables and query data using HQL.
- Responsible to manage data coming from different sources.
- Good knowledge in cloud integration with Amazon Elastic MapReduce (EMR).
- Installing and maintaining the Hadoop - Spark cluster from the scratch in a plain Linux environment and Defining the code outputs as PMML.
- Implemented Data loading using Spark, Storm, Kafka, Elastic Search
- Experience in integrating Cassandra with Elastic Search and Hadoop.
- Load and transform large sets of structured, semi structured and unstructured data even joins and some pre-aggregations before storing data into HDFS.
- Used HIVE join queries to join multiple tables of a source system for loading and analyzing data and load them into Elastic Search Tables.
- Worked on Apache Ranger for HDFS, HBase, Hive access and permissions to the users through active directory.
- Designed and implemented a POC for Microservices based interface to OLS
- Extracted the data from Teradata into HDFS/Databases/Dashboards using SPARK STREAMING.
- Involved in migrating the MapReduce jobs into Spark Jobs and Used Spark SQL and Dataframes API to load structured and semi structured data into Spark Clusters.
- Extensive experience in Spark Streaming (version 1.5.2) through core Spark API running Scala, Java & Python Scripts to transform raw data from several data sources into forming baseline data.
- Hands on expertise in running the SPARK & SPARK SQL on spark engine.
- Implemented SPARK batch jobs on AWS instances through Amazon Simple Storage Service (Amazon S3).
- Performed performance tuning for Spark Steaming e.g., setting right Batch Interval time, correct level of Parallelism, selection of correct Serialization & memory tuning.
- Replacing MapReduce with GridGain.
- Created HBase tables to store variable data formats of input data coming from different portfolios. Involved in adding huge volumes of data in rows and columns to store data in HBase .
- Performance optimization dealing with large datasets using Partitions, Spark in Memory capabilities, Broadcasts in Spark , Effective & efficient Joins, Transformations and other heavy lifting during ingestion process itself.
- Built big Data solutions using HBase handling millions of records for the different trends of data and exporting it to Hive .
- Used Spark API over Hadoop YARN to perform analytics on data in Hive.
- Developed Spark code using scala and Spark -SQL for batch processing of data
- Integration of Cassandra with Talend and automation of jobs.
- Involved in requirement and design phase to implement Streaming Lambda Architecture to use real time streaming using Spark and Kafka .
- Design and development of database operations in PostgreSQL.
- Exploring with Spark improving the performance and optimization of the existing algorithms in Hadoop using Spark context, Spark -SQL , Data Frame, pair RDD's , Spark YARN.
- Hands on experience working on NoSQL databases like HBase and PostgreSQL.
- Used Spark API over Cloudera Hadoop YARN to perform analytics on data in Hive.
- Used Spark Data Frame API to process Structured and Semi Structured files and load them back into S3 Bucket.
- Migrated MapReduce jobs to Spark Jobs to achieve better performance .
- Designed application which receives data from several source systems and ingest to PostgreSQL database .
- Experience developing Spark applications for loading/streaming data into NoSQL databases ( MongoDB ) and HDFS.
- Used Spark for developing machine learning algorithms which analyses click stream data.
- Integrated Oozie with the rest of the Hadoop stack supporting several types of Hadoop jobs out of the box (like, Pig, Hive, and Sqoop ) as well as system specific jobs (such as Perl and shell script ).
- Automated all the jobs, for pulling netflow data from relational databases to load data into Hive tables , using Oozie workflows and enabled email alerts on any failure cases
- Automating and scheduling the Sqoop jobs in a timely manner using Unix Shell Scripts.
- Installed and configured Hive, Pig, Sqoop and Oozie workflow engine (developed sqoop, hive and pig actions).
- Developed Scala & Python scripts, UDFs using both Data frames/SQL and RDD/MapReduce in Spark 1.3+ for Data Aggregation, queries and writing data back to OLTP system directly or through Sqoop.
- Involved in creating workflow to run multiple hive and Pig Jobs, which run independently with time and data availability.
- Integrated Apache Storm with Kafka to perform web analytics. Uploaded click stream data from Kafka to Hdfs , Hbase and Hive by integrating with Storm.
- Analyzed large amounts of data sets to determine optimal way to aggregate and report on it.
Environment: Hadoop, MapReduce, Spark, Spark SQL, Kafka, Storm, HDFS, Hive, Sqoop, Oozie, Java, SQL, Shell script, Talend
Hadoop Developer
Confidential, Bentonville, AR
Responsibilities:
- Installed and configured Hadoop MapReduce, HDFS and developed multiple MapReduce jobs in Java for data cleansing and preprocessing.
- Importing and exporting data into HDFS and Hive using Sqoop.
- Developing data pipeline programs with Spark Scala APIs, data aggregations with Hive, and formatting data (Json) for visualization, and generating. E.g. Highcharts: Outlier, data distribution, Correlation/comparison, and 2 dimension charts using JavaScript.
- Developed Scala & Python scripts, UDFs using both Data frames/SQL and RDD/MapReduce in Spark .
- Extracted files from CouchDB, MongoDB through Sqoop and placed in HDFS for processed
- Used Flume to collect, aggregate, and store the web log data from different sources like web servers, mobile and network devices and pushed to HDFS
- Developed Puppet scripts to install Hive, Sqoop, etc. on the nodes
- Data back up and synchronization using Amazon Web Services.
- Build and maintain scalable data pipelines using the Hadoop ecosystem and other open source components like Hive and HBase.
- Worked on loading data from remote Postgres to HBase tables for fast transactional lookups.
- Developed Spark streaming job to consume data from HDFS and do a look up in HBase
- Created HBase tables to load large sets of structured data.
- Executed queries in HBase to gather information about the data in real time.
- Installed Hadoop, MapReduce, HDFS and developed multiple MapReduce jobs in PIG and Hive for cleaning and pre-processing.
- Involved in setting up storm and Kafka cluster in AWS environment, monitoring and troubleshooting cluster
- Load and transform large sets of structured, semi structured and unstructured data
- Supported MapReduce Programs those are running on the cluster
- Load log data into HDFS using Flume, Kafka and performing ETL integrations
- Designed and implemented DR and OR procedures
- Wrote shell scripts to monitor the health check of Hadoop daemon services and respond accordingly to any warning or failure conditions
- Involved in loading data from UNIX file system to HDFS, configuring Hive and writing Hive UDFs
- Utilized Java and MySQL from day to day to debug and fix issues with client processes
- Used JAVA, J2EE application development skills with Object Oriented Analysis and extensively involved throughout Software Development Life Cycle (SDLC)
- Hands-on experience of Sun One Application Server, Web logic Application Server, Web Sphere Application Server, Web Sphere Portal Server, and J2EE application deployment technology
- Gained very good business knowledge on health insurance, claim processing, fraud suspect identification, appeals process, etc.
- Worked on loading of data from several flat files sources to Staging using Teradata Multiload, FastLoad.
- Monitoring Hadoop cluster using tools like Nagios, Ganglia, Ambari and Cloudera Manager.
- Automation script to monitor HDFS and HBase through Cron jobs.
- Used MRUnit for debugging MapReduce that uses sequence files containing key value pairs.
- Develop high-performance cache, making the site stable and improving its performance.
- Create a complete processing engine, based on Cloudera's distribution
- Proficient with SQL languages and good understanding of Informatica and Talend Administrative support for parallel computation research on a 24-node Fedora/ Linux cluster
Environment: Hadoop, MapReduce, HDFS, Hive, Apache Spark, Kafka, CouchDB, Flume, AWS, Cassandra, Java, Struts, Servlets, HTML, XML, SQL, J2EE, MRUnit, JUnit, JDBC, SQL, XML, Eclipse.
Hadoop Developer
Confidential, Bloomington, IL
Responsibilities:
- Involved in Requirement gathering, Business Analysis and translated business requirements into Technical design in Hadoop and Big Data
- Importing and exporting data into HDFS from database and vice versa using Sqoop.
- Written hive jobs to parse the logs and structure them in tabular format to facilitate effective querying on the log data
- Written MapReduce code to process and parsing the data from various sources and storing parsed data into HBase and Hive using HBase-Hive Integration.
- Involved in creating hive tables, loading with data and writing hive queries that will run internally in MapReduce way.
- Used SparkSQL for Scala interface that automatically converts RDD case classes to schema RDD
- Using SparkSQL read and write table which are stored in hive.
- Involved in creating workflow to run multiple hive and Pig Jobs, which run independently with time and data availability
- Involved in developing shell scripts and automated data management from end to end integration work
- Used Pig as a ETL tool to do Transformations, even joins and some pre-aggregations before storing data into HDFS
- Developed MapReduce program for parsing and loading into HDFS information.
- Built reusable Hive UDF libraries for business requirements which enabled users to use these UDF's in Hive Querying.
- Automating and scheduling the Sqoop jobs in a timely manner using Unix Shell Scripts.
- Used Oozie workflow engine to manage interdependent Hadoop jobs and to automate several types of Hadoop jobs such as Java MapReduce Hive, Pig, and Sqoop.
- Experienced in using Zookeeper and OOZIE Operational Services for coordinating the cluster and scheduling workflows.
- Using HBase to store majority of data which needs to be divided based on region.
- Developed MapReduce programs for data analysis and data cleaning
Environment: Hive QL, MySQL, HDFS, HIVE, HBase, Java, Eclipse, MS-SQL, Spark, Azure PIG, Sqoop, UNIX.
Java Developer
Confidential
Responsibilities:
- Involved in designing of shares and cash modules using UML.
- Effectively used the iterative waterfall model software development methodology during this time constraint project.
- Used HTML and JSP for the web pages and used JavaScript for Client side validation.
- Created XML pages with DTD’s for front-end functionality and information exchange.
- Responsible for writing Java SAX parsers programs.
- Familiar with the state-of-the-art standards, processes, design processes used in creating and designing optimal UI using Web 2.0 technologies like Ajax, JavaScript, CSS, and XSLT.
- Developed ANT build scripts to build and deploy application in enterprise archive format (.ear)
- Performed Unit testing using JUnit and Functional Testing.
- Used Hibernate framework and Spring JDBC framework modules for backend communication in the extended application
- Used the Json response format to retrieve data from web servers.
- Wrote Action Form and Action classes and used various HTML tags, Bean, and Logic etc., also configured Struts-Config.xml for global forwards, error forwards & action forwards.
- Developed UI using JSP and Servlet and server-side code with Java.
- Used JDBC 2.0 extensively and was involved in writing several SQL queries for the data retrieval.
- Prepared program specifications for the loans module and involved in database designing.
- Provided security for REST using SSL& for SOAP using Encryption with X.509 Digital signature.
- Involved in creating JUNIT test cases and ran the TEST SUITE using EMMA tool.
- Developed Schema/Component Template/Template Building Block components in SDL Tridion.
- Developed coding using SQL, PL/SQL, Queries, Joins, Views, Procedures/Functions, Triggers and Packages.
- Implemented Hibernate to persist the data into Database and wrote HQL based queries to implement CRUD operations on the data
- Java programming using swing to complete the functionality of cash lockers and security modules.
- Used application servers like WebLogic, WebSphere, Apache Tomcat, Glassfish and JBoss based on the client requirements and project specifications.
- Servlet programming for connecting to the database server and to retrieve the serialized data.
- Programmed stored procedures using SQL and PL/SQL for the bulk calculations of general ledger.
Environment: Java, J2EE, EJB 2.0, Servlets, JavaScript, OO, JSP, JNDI, Java Beans, Web Logic, XML, XSL, Eclipse, PL/SQL, Oracle 8i, HTML, DHTML, UML.
Associate Java Developer
Confidential
Responsibilities:
- Involved in Design, Development and Support phases of Software Development Life Cycle (SDLC)
- Reviewed the functional, design, source code and test specifications
- Involved in developing the complete front end development using Java Script and CSS
- Author for Functional, Design and Test Specifications
- Implemented Backend, Configuration DAO, XML generation modules of DIS
- Analyzed, designed and developed the component
- Used JDBC for database access.
- Used UNIX Scripting technologies for coding and decoding.
- Used Data Transfer Object (DTO) design patterns
- Unit testing and rigorous integration testing of the whole application
- Written and executed the Test Scripts using JUNIT
- Actively involved in system testing
- Developed XML parsing tool for regression testing
- Prepared the Installation, Customer guide and Configuration document which were delivered to the customer along with the product.
Environment: Java, JavaScript, HTML, CSS, JDK 1.5.1, JDBC, Oracle10g, XML, XSL, Solaris and UML