Spark/hadoop Developer Resume
NY
PROFESSIONAL SUMMARY:
- Over 8+years of professional IT experience in all phases of Software Development Life Cycle including hands on experience in Java/J2EE technologies and Big Data Analytics.
- More than 4+years of work experience in ingestion, storage, querying, processing and analysis of BigData with hands on experience in Hadoop Ecosystem development including Mapreduce, HDFS, Hive, Pig, Spark, ClouderaNavigator, Mahout, HBase, ZooKeeper, Sqoop, Flume, Oozie and AWS.
- Extensive experience working in Teradata, Oracle, Netezza, SQLServer and MySQL database.
- Excellent understanding and knowledge of NOSQL databases like MongoDB, HBase, and Cassandra.
- Strong experience working with different Hadoop distributions likeCloudera, Hortonworks, MapR and Apache distributions.
- Experience in installation, configuring, supporting and managing Hadoop Clusters usingApache, Cloudera (CDH 5.X) distributions and on Amazon web services (AWS).
- Experience in Amazon AWS services such as EMR, EC2, S3, CloudFormation, RedShift which provides fast and efficient processing of Big Data.
- Implemented Kafka Custom encoders for custom input format to load data into Kafka Partitions. Real time streaming the data using Spark with Kafka for faster processing.
- In depth understanding/knowledge of Hadoop Architecture and various components such as HDFS, MR, Hadoop GEN2 Federation, High Availability and YARN architecture and good understanding of workload management, scalability and distributed platform architectures.
- Good understanding of R Programming, Data Mining and Machine Learning techniques.
- Strong experience and knowledge of real time data analytics using Storm, Kafka, Flume and Spark.
- Experience in troubleshooting errors in HBase Shell, Pig, Hive and MapReduce.
- Experience in installing and maintaining Cassandra by configuring the cassandra.yaml file as per the requirement.
- Involved in upgrading existing MongoDB instances from version 2.4 to version 2.6 by upgrading the security roles and implementing newer features.
- Responsible for performing reads and writes in Cassandra from and web application by using java JDBC connectivity.
- Experience in extending HIVE and PIG core functionality by using custom UDF’s and UDAF’s.
- Debugging MapReduce jobs using Counters and MRUNIT testing.
- Expertise in writing the Real - time processing application Using spout and bolt in Storm.
- Experience in configuring various topologies in storm to ingest and process data on the fly from multiple sources and aggregate into central repository Hadoop.
- Good understanding of Spark Algorithms such as Classification, Clustering, and Regression.
- Good understanding on Spark Streaming with Kafka for real-time processing.
- Extensive experience working with Spark tools like RDD transformations, spark MLlib and spark QL.
- Experienced in moving data from different sources using Kafka producers, consumers and preprocess data using Storm topologies.
- Experience working on Solrto develop search engine on unstructured data in HDFS.
- Used Solr to enable indexing for enabling searching on Non-primary key columns from Cassandrakeyspaces.
- Experienced in migrating ETL transformations using Pig Latin Scripts, transformations, join operations.
- Good understanding of MPP databases such as HP Vertica, Greenplum and Impala.
- Hands on experience in implementing Sequence files, Combiners, Counters, Dynamic Partitions and Bucketing for best practice and performance improvement.
- Good knowledge on streaming data from different data sources like Log files, JMS, applications sources into HDFS using Flume sources.
- Experience in importing and exporting data using Sqoop from HDFS to Relational Database Systems and vice-versa.
- Worked on docker based containerized applications.
- Knowledge of data warehousing and ETL tools like Informatica, TalendandPentaho.
- Developed ETL process using Pentaho PDI to extract the data from HIS and Vista and populate it on BI data mart
- Experienced in working with monitoring tools to check status of cluster using Cloudera manager, Ambari and Ganglia.
- Experience with Testing MapReduce programs using MRUnit, Junit.
- Extensive experience in middle-tier development using J2EE technologies like JDBC, JNDI, JSP, Servlets, JSF, Struts, Spring, Hibernate, EJB.
- Expertise in developing responsive Front End components with JSP, HTML, XHTML, JavaScript, DOM,Servlets, JSF, NodeJS, Ajax, JQuery and AngularJS.
- Extensive experience in working with SOA based architectures using Rest based web services using JAX-RS and SOAP based web services using JAX-WS.
- Experience working on Version control tools like SVN and Git revision control systems such as GitHub and JIRA/MINGLE to track issues and crucible for code reviews.
- Worked on various Tools and IDEs like Eclipse, IBM Rational, Visio, Apache Ant-Build Tool, MS-Office, PLSQL Developer, SQL*Plus.
- Experience in different application servers like JBoss/Tomcat, WebLogic, IBM WebSphere.
- Experience in working with Onsite-Offshore model.
TECHNICAL SKILLS:
Big Data Ecosystem: Hadoop, MapReduce, Pig, Hive,YARN,Kafka,Flume, Sqoop, Impala, Oozie, ZooKeeper, Spark,Solr, Storm, Drill,Ambari, Mahout, MongoDB, Cassandra, Avro, Parquet and Snappy.
Hadoop Distributions: Cloudera,MapR, Hortonworks, IBM BigInsights
Languages: Java, Scala, Python,Jruby, SQL, HTML, DHTML, JavaScript, XML and C/C++
No SQL Databases: Cassandra, MongoDBandHBase
Java Technologies: Servlets, JavaBeans, JSP, JDBC, JNDI, EJB and struts
XML Technologies: XML, XSD, DTD, JAXP (SAX, DOM), JAXB
Web Design Tools: HTML, DHTML, AJAX, JavaScript, JQuery and CSS, AngularJs, ExtJS and JSON
Development / Build Tools: Eclipse, Ant, Maven,Gradle,IntelliJ, JUNITand log4J.
Frameworks: Struts, spring and Hibernate
App/Web servers: WebSphere, WebLogic, JBoss and Tomcat
DB Languages: MySQL, PL/SQL, PostgreSQL and Oracle
RDBMS: Teradata, Oracle 9i,10g,11i, MS SQL Server, MySQL and DB2
Operating systems: UNIX, LINUX, Mac OS and Windows Variants
Data analytical tools: R, SAS and MATLAB
ETL Tools: Tableau, Talend, Informatica, Pentaho
PROFESSIONAL EXPERIENCE:
Confidential, NY
Spark/Hadoop Developer
Responsibilities:
- Research and recommend suitable technology stack for Hadoop migration considering current enterprise architecture.
- Extensively used Spark stack to develop preprocessing job which includes RDD, Datasets and Data framesApi'sto transform the data for upstream consumption.
- Developed Realtime data processing applications by using Scala and Python and implemented Apache Spark Streaming from various streaming sources like Kafka, Flume and JMS.
- Worked on extracting and enriching HBase data between multiple tables using joins in spark.
- Worked on writing APIs to load the processed data to HBase tables.
- Replaced the existingMapReduce programs into Spark application using Scala.
- Built on-premise data pipelines using Kafka and Spark streaming using the feed from API streaming Gateway REST service.
- Developed the Hive UDF’s tohandle data quality and create filtered datasets for further processing
- Experienced in writing Sqoopscriptsto import data into Hive/HDFS from RDBMS.
- Using the data Integration tool Pentaho for designing ETL jobs in theprocess of building Data warehouses and Data Marts.
- Good knowledge on Kafka streams API for data transformation.
- Implemented logging framework - ELK stack (Elastic Search, LogStash&Kibana) on AWS.
- Operated Elasticsearch time-series data like metrics and application events, area where the huge Beats ecosystem allows you to easily grab data for common applications.
- Implemented AWS solutions using EC2, S3, RDS, ECS, EBS, Elastic Load Balancer, Auto scaling groups, Optimized volumes and EC2 instances.
- Setup Spark EMR to process huge data which is stored in Amazon S3.
- Developed oozie workflow for scheduling & orchestrating the ETL process.
- Used Talend tool to create workflows for processing data from multiple source systems.
- Created sample flows in Talend, Stream sets with custom coded jars and analyzed the performance of Stream sets and Kafka steams.
- Developed Hive Queries to analyze the data in HDFS to identify issues and behavioral patterns.
- Involved in writing optimized Pig Script along with developing and testing Pig Latin Scripts.
- Able to use Python Pandas,Numpy modules for Data analysis, Data scraping and parsing
- Deployed applications using Jenkins framework integrating Git- version control with it.
- Participated in production support on a regular basis to support the Analytics platform
- Used Rally for task/bug tracking.
- Used GIT for version control.
Environment: MapR,Hadoop,Hbase, HDFS, AWS, PIG, Hive, Drill, SparkSql, MapReduce, Spark streaming, Kafka, Flume, Sqoop, Oozie, Jupyter Notebook, Docker, Kafka, Spark, Scala, Hbase, Talend, Shell Scripting, Java.
Confidential, TX
Spark/Hadoop Developer
Responsibilities:
- Responsible for building scalable distributed data solutions using Hadoop.
- Experienced in loading and transforming of large sets of structured, semi structured, and unstructured data.
- Developed Spark jobs and Hive Jobs to summarize and transform data.
- Expertise in implementing SparkScala application using higher order functions for both batch and interactive analysis requirement.
- Involved in Create / Modify / Drop Teradata objects like Tables, Views, Join Indexes, Triggers, Macros, Procedures, and Databases.
- Involved in writing the test cases and documentation. Implemented change data capture (CDC) using Informatica power exchange to load data from clarity DB to Teradata warehouse.
- Experienced in developing Spark scripts for data analysis in both python and scala.
- Built on-premise data pipelines using kafka and spark for real time data analysis.
- Created reports in TABLEAU for visualization of the data sets created and tested native Drill, Impala and Spark connectors.
- Analysed the SQL scripts and designed the solution to implement using Scala.
- Implemented Hive complex UDF's to execute business logic with Hive Queries.
- Responsible for loading bulk amount of data in HBase using MapReduce by directly creating H-files and loading them.
- Evaluated performance of Spark SQL vs IMPALA vs DRILL on offline data as a part of poc.
- Worked on solr configuration and customizations based on requirements.
- Implemented Spark using Scala and utilizing Data frames and Spark SQL API for faster processing of data.
- Handled importing data from different data sources into HDFS using Sqoop and performing transformations using Hive, Map Reduce and then loading data into HDFS.
- Exporting of result set from HIVE to MySQL using Sqoop export tool for further processing.
- Collecting and aggregating large amounts of log data using Flume and staging data in HDFS for further analysis.
- Responsible for developing data pipeline by implementing Kafka producers and consumers.
- Performed data analysis with HBase using Apache Pheonix.
- Exported the analyzed data to Impala to generate reports for the BI team.
- Developed multiple Spark jobs in PySpark for data cleaning and preprocessing.
- Managing and reviewing Hadoop Log files to resolve any configuration issues.
- Developed aprogram to extract the name entities from OCR files.
- Used Gradle for building and testing project
- Fixed defects as needed during the QA phase, support QA testing, troubleshoot defects and identify the source of defects.
- Used Mingle and later moved to JIRA for task/bug tracking.
- Used GIT for version control
Environment: MapR, Cloudera, Hadoop,HDFS,AWS, PIG, Hive, Impala, Drill, SparkSql,OCR,MapReduce,Flume, Sqoop, Oozie,Storm,Zepplin, Mesos,Docker,Solr, Kafka, Mapr DB, Spark, Scala, Hbase,ZooKeeper, Tableau, Shell Scripting,Gerrit, Java, Redis.
Confidential, NY
Hadoop Developer
Responsibilities:
- Analyzing the requirement to setup a cluster.
- Worked on analyzing Hadoop cluster and different big data analytic tools including MapReduce, Hive and Spark.
- Involved in loading data from LINUX file system, servers, Java web services using Kafka Producers, partitions.
- Implemented Kafka Custom encoders for custom input format to load data into Kafka Partitions.
- Implemented Storm topologies to pre-process data before moving into HDFS system.
- Implemented Kafka High level consumers to get data from Kafka partitions and move into HDFS.
- Implemented POC to migrate MapReduce programs into Spark transformations using Spark and Scala.
- Developed Spark scripts by using Scala shell commands as per the requirement.
- Migrated complex MapReduce programs into Spark RDD transformations, actions.
- Implemented Spark RDD transformations to map business analysis and apply actions on top of transformations.
- Involved in creating Hive tables, loading with data and writing hive queries which runs internally in MapReduce way.
- Developed the MapReduce programs to parse the raw data and store the pre Aggregated data in the partitioned tables.
- Loaded and transformed large sets of structured, semi structured, and unstructured data with MapReduce, Hive and pig.
- Developed MapReduce programs in Java for parsing the raw data and populating staging Tables.
- Implemented Python scripts for writing MapReduce programs using HadoopStreaming.
- Involved in using HCATALOG to access Hive table metadata forMapReduce or Pig code.
- Experience in implementing custom sterilizer, interceptor, source and sink as per the requirement in flume to ingest data from multiple sources.
- Experience in setting up Fan-out workflow in flume to design v shaped architecture to take data from many sources and ingest into single sink.
- Worked on implementing advanced procedures like text analytics and processing using the in-memory computing capabilities like Apache Spark written in Scala.
- Developed Shell, Perl and Python scripts to automate and provide Control flow to Pig scripts.
- Implemented monitoring on all the NiFi flows to get notifications if there is no data flowing through the flow more than the specific time.
- Converted unstructured data to structured data by writing Spark code.
- Indexed documents using Apache Solr.
- Set up Solr Clouds for distributing indexing and search.
- Created NiFi flows to trigger spark jobs and used put email processors to get notifications if there are any failures.
- Worked closely on parallel computing with Spark team to explore RDD in Datastax Cassandra.
- Worked on No-SQL databases like Cassandra, MongoDB for POC purpose in storing images and URIs.
- Integrating bulk data into Cassandra file system using MapReduce programs.
- Worked on MongoDB for distributed storage and processing.
- Designed and implemented Cassandra and associated RESTful web service.
- Implemented Row Level Updates and Real time analytics using CQL on Cassandra Data.
- Used CassandraCQL with Java API's to retrieve data from Cassandra tables.
- Worked on analyzing and examining customer behavioral data using Cassandra.
- Created partitioned tables in Hive, mentored analyst and SQA team for writing Hive Queries.
- Developed Pig Latin scripts to extract the data from the web server output files to load into HDFS.
- Involved in cluster setup, monitoring, test benchmarks for results.
- Involved in build/deploy applications using Maven and integrated with CI/CD server Jenkins.
- Involved in agile methodologies, daily scrum meetings, Sprint planning's.
Environment: Hadoop,Cloudera, HDFS, pig, Hive, Flume, Sqoop, NiFi, AWS Redshift, Python, Spark,Scala, MongoDB, Cassandra,Snowflake,Solr, ZooKeeper,MySQl,Talend,Shell Scripting, Linux Red Hat, Java.
Confidential, NJ
HadoopDeveloper
Responsibilities:
- Converting the existing relational database model to Hadoop ecosystem.
- Generate datasets and load to HADOOP Ecosystem.
- Worked with Linux systems and RDBMS database on a regular basis to ingest data using Sqoop.
- Continuous monitoring and managing the Hadoop cluster through Cloudera Manager.
- Involved in review of functional and non-functional requirements.
- Implemented FrameworksusingJavaand python to automate the ingestion flow.
- Responsible to manage data coming from different sources.
- Loaded the CDRs from relational DB using Sqoopand other sources to Hadoop cluster by using Flume.
- Executing HDFS, Pig, Hive, and MapReduce inline within a SAS program .
- Experience in processing large volume of data and skills in parallel execution of process using Talend functionality.
- Involved in loading data from UNIX file system and FTP to HDFS.
- Designed and implemented HIVE queries and functions for evaluation, filtering, loading and storing of data.
- Creating Hive tables and working on them using HiveQL.
- Developed data pipeline using Kafka and Storm to store data into HDFS.
- Created reporting views in Impala using Sentry policy files.
- Developed Hive queries to analyze the output data.
- Had to do the Cluster co-ordination services through ZooKeeper.
- Collected the logs data from web servers and stored in to HDFS using Flume.
- Used HIVE to do transformations, event joins and some pre-aggregations before storing the data onto HDFS.
- Implemented several Akka Actors which are responsible for loading of data into hive.
- Design and implement Spark jobs to support distributed data processing.
- Supported the existing MapReduce Programs those are running on the cluster.
- Developed and implemented two Service Endpoints (end to end) in Java using Play framework, Akka server Hazelcast.
- Wrote the shell scripts to monitor the health check of Hadoop daemon services and respond accordingly to any warning or failure conditions.
- Wrote Java code to format XML documents; upload them to Solr server for indexing.
- Involved in Hadoop cluster task like Adding and Removing Nodes without any effect to running jobs and data.
- Developed Power enter mappings to extract data from various databases, Flat files and load into DataMart using the Informatica.
- Followed agile methodology for the entire project.
- Installed and configured Apache Hadoop, Hive and Pig environment.
Environment: Hadoop, Hortonworks, HDFS, pig, Hive,Flume, Sqoop,Ambari,Ranger,Python,Akka,Playframework,Informatica, Elastic search,Linux- Ubuntu, Solr.
Confidential
Java developer
Responsibilities:
- Involved in Analysis, Design, Development, Integration and Testing of application modules and followed AGILE/SCRUM methodology. Participated in Estimation size of Backlog Items, Daily Scrum and Translation of backlog items into engineering design and logical units of work(tasks).
- Used Spring framework for implementing IOC/JDBC/ORM, AOP and Spring Security to implement business layer.
- Developed and Consumed Web services securely using JAX-WS API and tested using SOAP UI.
- Extensively used Action, Dispatch Action, Action Forms, Struts Tag libraries, Struts Configuration from Struts.
- Extensively used the Hibernate Query Language for data retrieval from the database and process the data in the business methods.
- Developed pages using JSP, JSTL, Spring tags, JQuery, Java Script & Used JQuery to make AJAX calls.
- Used Jenkins continuous integration tool to do the deployments.
- Worked on JDBC for database connections.
- Worked on multithreaded middleware using socket programming to introduce whole set of new business rules implementing OOPS design and principles
- Involved in implementing Java multithreading concepts.
- Developed several REST web services supporting both XML and JSON to perform task such as demand response management.
- Used Servlet, Java and Spring for server-side business logic.
- Implemented the log functionality by using Log4j and internal logging API's.
- Used Junit for server-side testing.
- Used Maven build tools and SVN for version control.
- Developed frontend of application using BootStrap, Angular.Js and Node.JS frameworks.
- Implemented SOA architecture using Enterprise Service Bus (ESB).
- Designed front-end, data driven GUI using JSF, HTML4, JavaScript and CSS
- Used IBM MQ Series as the JMS provider.
- Responsible for writing SQL Queries and Procedures using DB2.
- Connection with Oracle, MySQL Database is implemented using Hibernate ORM. Configured hibernate, entities using annotations from scratch.
Environment: Core Java1.5, EJB, Hibernate 3.6, AWS, JSF, Struts, Spring 2.5, JPA, REST, JBoss, Selenium, Socket programming, DB2, Oracle 10g, XML, JUnit 4.0, XSLT, IDE, Angular.Js, Node.JS, HTML4, CSS, JavaScript, Apache Tomcat 5x, Log4j .
Confidential
Java Developer
Responsibilities:
- Designed Java Servlets and Objects using J2EE standards.
- Involved in developingmulti-threadingfor improving CPU time.
- Used Multithreading to simultaneously process tables as and when a user data is completed in one table.
- Used JDBC calls in the Enterprise Java Beans to access Oracle Database.
- Involved in developing the presentation layer using Spring MVC/Angular JS/JQuery.
- Involved in design and development of rich internet applications using Flex, Action Script and Java.
- Design and development of Web pages using HTML 4.0, CSS including Ajax controls and XML.
- Worked closely with Photoshop designers to implement mock-ups and the layouts of the application.
- Involved in writing the Properties, methods in the Class Modules and consumed web services.
- Played a vital role in defining, implementingand enforcing quality practices in the team organization to ensure internal controls, quality and compliance policies and standards.
- Used JavaScript1.5 for custom client-side validation.
- Involved in designing and developing the GUI for the user interface with various controls.
- Worked with View State to maintain data between the pages of the application.
Environment: Core Java, JavaBeans, HTML 4.0, CSS 2.0, PL/SQL, MySQL 5.1, Angular JS, JavaScript 1.5, Flex, AJAX and Windows
