Sr. Big Data/spark Developer Resume
Bentonville, ArkansaS
SUMMARY:
- A dynamic professional with around 8 years of diversified experience in the field of Information Technology with an emphasis on Big Data/Hadoop Eco System, SQL/NO - SQL databases, Java /J2EE technologies and tools using industry accepted methodologies and procedures.
- Extensive experience in working with various distributions of Hadoop like enterprise versions of Cloudera, Hortonworks and good knowledge on MAPR distribution and Amazon’s EMR .
- In depth experience in using various Hadoop Ecosystem tools like HDFS, MapReduce, Yarn, Pig, Hive, Sqoop, Spark, Storm, Kafka, Oozie, Elastic search, HBase, and Zookeeper .
- Experienced with the Apache Spark improving the performance and optimization of the existing algorithms in Hadoop using Apache Spark Context, Apache Spark-SQL, Data Frame, Pair RDD's, Apache Spark YARN.
- Knowledge in installation, configuration, supporting and managing Hadoop Clusters using Apache, Cloudera (CDH3, CDH4) distributions and on Amazon web services (AWS).
- Exposure to Data Lake Implementation using Apache Spark and developed Data pipe lines and applied business logics using Spark.
- Experience in using different components of Spark like Spark Streaming to process real-time data as well as historical data.
- Worked with Apache Spark which provides fast and general engine for large data processing integrated with functional programming language Scala.
- Used Spark-Streaming APIs to perform necessary transformations and actions on the fly for building the common learner data model which gets the data from Kafka in near real time and Persists into Cassandra.
- Developed streaming pipelines using Kafka, Memsql and Storm.
- Experience in capturing data and importing it to HDFS using Flume and Kafka for semi-structured data and Sqoop for existing relational databases.
- Used Scala and Python to convert Hive/SQL queries into RDD transformations in Apache Spark.
- Worked with RDD and Dataframes to process the data in spark.
- Implemented Spark RDD Transformations, actions to migrate Map reduce algorithms.
- Experience in integrating Hive queries into Spark environment using Spark SQL.
- Expertise in performing real time analytics on big data using HBase and Cassandra .
- Great familiarity with creating Hive tables, Hive joins & HQL for querying the databases eventually leading to complex Hive UDFs.
- Handled importing data from RDBMS into HDFS using Sqoop and vice-versa.
- Experience in developing data pipeline using Pig, Sqoop, and Flume to extract the data from weblogs and store in HDFS.
- Created User Defined Functions (UDFs), User Defined Aggregated Functions (UDAFs) in PIG and Hive.
- Hands-on experience in tools like Oozie and Automic to orchestrate jobs.
- Hands on expertise in working and designing of Row keys & Schema Design with NOSQL databases like Mongo DB 3.0.1, HBase, Cassandra and DynamoDB (AWS).
- Experience in performance tuning a Cassandra cluster to optimize it for writes and reads.
- Experience in developing data pipeline using Pig, Sqoop, and Flume to extract the data from weblogs and store in HDFS. Accomplished developing Pig Latin Scripts and using Hive Query Language for data analytics.
- Worked on different compression codecs (ZIO, SNAPPY, GZIP) and file formats (ORC, AVRO, TEXTFILE, PARQUET).
- Experience in practical implementation of cloud-specific AWS technologies including IAM, Amazon Cloud Services like Elastic Compute Cloud (EC2), ElastiCache, Simple Storage Services (S3), Cloud Formation, Virtual Private Cloud (VPC), Route 53, Lambda, EBS .
- Experience in Enterprise search using SOLR to implement full text search with advanced text analysis, faceted search, filtering using advanced features like dismax, extended dismax and grouping.
- Experienced in writing Ad Hoc queries using Cloudera Impala, also used Impala analytical functions. Good understanding of MPP databases such as HP Vertica.
- Installed and configured the clusters using CDH & HDP using AWS and local resources.
- Worked on data warehousing and ETL tools like Informatica, Talend, and Pentaho.
- Expertise working in JAVA J2EE, JDBC, ODBC, JSP, Java Eclipse, Java Beans, EJB, Servlets.
- Developed web page interfaces using JSP, Java Swings, and HTML scripting languages.
- Experience working with Spring and Hibernate frameworks for JAVA.
- Worked on various programming languages using IDEs like Eclipse, NetBeans, and Intellij.
- Excelled in using version control tools like PVCS, SVN, VSS and GIT.
- Well versed in Atlassian tools like Bamboo, Bitbucket, Github and JIRA.
- Used web-based UI development using JavaScript, jquery UI, CSS, jquery, HTML, HTML5, XHTML and JavaScript.
- Development experience in DBMS like Oracle, MS SQL Server, and MYSQL .
- Developed stored procedures and queries using PL/SQL.
- Experience with best practices of Web services development and Integration (both REST and SOAP ).
- Experienced in using build tools like Ant, Gradle, SBT, Maven to build and deploy applications into the server.
- Experience in complete Software Development Life Cycle (SDLC) in both Waterfall and Agile methodologies.
- Knowledge in Creating dashboards and data visualizations using Tableau to provide business insights.
- Excellent communication skills, interpersonal skills, problem-solving skills and very good team player along with a will do attitude and ability to effectively communicate with all levels of the organization such as technical, management and customers.
TECHNICAL SKILLS:
Big Data Technologies: HDFS, YARN, MapReduce, Hive, Pig, Impala, Sqoop, Storm, Flume, Spark, Apache Kafka, Zookeeper, Solr, Ambari, Oozie
NO SQL Databases: HBase, Cassandra, MongoDB, Redshift, Redis
Languages: C, C++, Java, Scala, Python, HTML, SQL, PL/SQL, Pig Latin, HiveQL, Unix, Java Script, Shell Scripting
Java & J2EE Technologies: Core Java, Servlets, Hibernate, Spring, Struts, JMS, EJB
Application Servers: WebSphere, WebLogic, JBoss, Tomcat
Cloud Computing Tools: Amazon AWS, (S3, EMR, EC2, Lambda, VPC, Route 53, Cloud Watch), Google Cloud
Databases: Oracle 10g/11g, Microsoft SQL Server, MySQL, DB2
Build Tools: Jenkins, Maven, ANT
Business Intelligence Tools: Tableau, Splunk
Development Tools: Eclipse, IntelliJ, Microsoft SQL Studio, Toad, NetBeans,
Development Methodologies: Agile, Waterfall
PROFESSIONAL EXPERIENCE:
Confidential, Bentonville, Arkansas
Sr. Big Data/Spark Developer
Responsibilities
- Evaluated business requirements and prepared detailed specifications that follow project guidelines required to develop written programs.
- Experience in installation, configuring, supporting and managing Hadoop Clusters using Apache, Cloudera (CDH 5.X) distributions.
- Worked on Cloudera distribution for Hadoop ecosystem and installed and configured Flume, Hive, Pig, Sqoop and Oozie, Automic on the Hadoop cluster.
- Exploring with the Spark improving the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frame, Pair RDD's, Spark YARN.
- Managed and reviewed Hadoop log files to identify issues when job fails and used HUE for UI based pig script execution, Automic scheduling.
- Involved in creating data-lake by extracting customer's data from various data sources to HDFS which include data from Excel, databases, and log data from servers.
- Automated workflows using shell scripts pull data from various databases into Hadoop and developed scripts to automate the process and generate reports.
- Designed number of partitions and replication factor for Kafka topics based on business requirements and worked on migrating MapReduce programs into Spark transformations using Spark and Scala, initially done using python (PySpark).
- Developed SparkSQL automation components and responsible for modifying java component to directly connect to thrift server.
- Used various Spark Transformations and Actions for cleansing the input data and involved in using the Spark application master to monitor the Spark jobs and capture the logs for the spark jobs.
- Experience in refactoring the existing spark batch process for different logs written in Scala.
- Implemented Spark using Scala and utilizing Data frames and Spark SQL API for faster processing of data and worked on extensible framework for building high performance batch and interactive data processing application on hive.
- Extracted Real time feed using Spark streaming and convert it to RDD and process data into Data Frame and load the data into Cassandra.
- Fine Tuning and Productionizing the Teradata SQl queries that are running for long time in a queue.
- Created Hive External Tables for the incremental imports into Hive using Ingest, Reconcile, Compact and Purge Strategy.
- Experience in working with Hive for processing the raw data.
- Created partitions, bucketing across state in Hive to handle structured data.
- Implemented Dash boards that handle HiveQL queries internally like Aggregation functions, basic hive operations, and different kind of join operations.
- Implemented business logic based on state in Hive using Generic UDF's.
- Used Hive queries to analyze the large data sets.
- Build reusable Hive UDF’s libraries for business requirements.
- Designed and implemented Incremental Imports into Hive tables.
- Writing workflows and scheduling using Automic.
- Provide Automic batch job flow support to application development and management during releases in the production environment.
- Developed Automic workflow for scheduling and orchestrating the ETL process within the Cloudera Hadoop system.
- Involved in daily SCRUM meetings to discuss the development/progress of Sprints and was active in making scrum meetings more productive.
Environment: Hadoop stack, Spark SQL, KSQL, Spark-Streaming, Scala, CICD, Cassandra, Cloudera, Kafka, Hive, Pig, Sqoop, Automic, linux.
Confidential, Columbus, Ohio
Big Data/Spark Developer
Responsibilities
- Involved in analyzing business requirements and prepared detailed specifications that follow project guidelines required for project development.
- Configured Spark streaming to get ongoing information from the Kafka and store the dstream information to HDFS.
- Responsible for fetching real time data using Kafka and processing using Spark streaming with Scala.
- Involved in loading data from rest endpoints to Kafka Producers and transferring the data to Kafka Brokers.
- Developed business logic using Kafka Direct Stream in Spark Streaming and implemented business transformations.
- Worked on Building and implementing real-time streaming ETL pipeline using Kafka Streams API.
- Migrated Map Reduce programs into Spark transformations using Scala.
- Experienced with Spark Context, Spark-SQL, Spark YARN.
- Implemented Spark-SQL with various file formats like JSON, Parquet and ORC.
- Implemented Spark Scripts using Scala, Spark SQL to access hive tables into spark for faster processing of data.
- Loaded the data into Spark RDD and perform in memory data Computation to generate the Output response.
- Worked on loading AVRO/PARQUET/TXT files in Spark Framework using Scala language and created Spark Data frames and RDDs to process the data and save file in parquet format in HDFS to load into fact table using ORC Reader .
- Worked on Spark-Streaming APIs to perform transformations and actions to store and stream data into HDFS by using Scala .
- Good knowledge in setting up batch intervals, split intervals and window intervals in Spark Streaming.
- Implemented data quality checks using Spark Streaming and arranged passable and bad flags on the data.
- Developed traits and case classes etc in Scala.
- Implemented usage of Amazon EMR for processing Big Data across Hadoop Cluster of virtual servers on Amazon Elastic Compute Cloud ( EC2 ) and Amazon Simple Storage Service (S3).
- Deployed the project on Amazon EMR with S3 connectivity for setting a backup storage. Also worked on RESTful Web Services.
- Tested the performance using Elasticsearch and Kibana with APM
- Implemented CICD allowing for deploy to multiple client Kubernetes/AWS environments.
- Used BitBucket to check-in and checkout code changes.
- Worked on Hive to implement Web Interfacing and stored the data in Hive external tables.
- Implemented Hive Partitioning and Bucketing on the collected data in HDFS.
- Involved in Data Querying and Summarization using Hive and created UDF’s, UDAF’s and UDTF’s.
- Created and managed table in Hive and Impala using Hue Web Interface.
- Extensively works in data Extraction, Transformation and Loading from source to target system using Informatica and Teradata utilities like fast export, fast load, multi load, TPT.
- Works with Teradata utilities like BTEQ, Fast Load and Multi Load.
- Implemented Sqoop jobs to import/export large data exchanges between RDBMS and Hive platforms.
- Extensively used Zookeeper as a backup server and job scheduling of Spark Jobs.
- Worked on Cloudera distribution and deployed on AWS EC2 Instances.
- Experienced in loading the real-time data to NoSQL database like Cassandra.
- Experienced in using Data Stax Spark-Cassandra Connector which is used to store the data in Cassandra from Spark.
- Involved in Cassandra by writing scripts and invoking them using CQLSH .
- Well versed in using Data Manipulations, Compactions, tombstones in Cassandra.
- Experience in retrieving the data present in Cassandra cluster by running queries in CQL (Cassandra Query Language).
- Worked on connecting Cassandra database to the Amazon EMR for storing the database in S3.
- Well versed in using of Elastic Load Balancer for Autoscaling in EC2 servers.
- Configured work flows that involves Hadoop actions using Oozie scheduler.
- Used Oozie work flows and Java schedulers to manage and schedule jobs on a Hadoop cluster.
- Used Sqoop to import data from Relational Databases like MySQL, Oracle.
- Continuously monitored and managed the Hadoop Cluster using Cloudera Manager.
- Used Cloudera manager to pull metrics on various cluster features like JVM, Running Map and reduce tasks.
- Involved in importing structured and unstructured data into HDFS.
- Developed Pig scripts to help perform analytics on JSON and XML data.
- Experienced with Faceted Reader search and Full Text Search Data querying using Solr.
- Maintain the Data lake in Hadoop by building data pipe line using Sqoop, Hive and PySpark.
- Created Tableau visualization for the internal management (client team) using Simba SparkSQL Connector.
- Integrated Hive server 2 with Tableau using Horton Works Hive ODBC driver, for auto generation of Hive queries for non-technical business user.
- Dealt with Jira for tracking purposes.
- Coordinated with SCRUM team in delivering agreed user stories on time for every sprint.
Environment: Hadoop stack, Spark SQL, KSQL, Spark-Streaming, AWS S3, AWS EMR, google cloud, GraphX, Scala, Python, Pyspark, Kafka, Hive, Pig, Sqoop, Solr, Oozie, vertica, Impala, CICD, Cassandra, Cloudera, Oracle 10g, MySQL, spring boot, Linux.
Confidential, Troy, Michigan
Hadoop Developer
Responsibilities:
- Involved in review of functional and non-functional requirements (NFR’s).
- Worked on analyzing Hadoop cluster using different big data analytic tools including Pig, Hive, Oozie, Zookeeper, Sqoop, Spark, Kafka and Tez with Hortonworks(HDP) distribution.
- Contributed to building hands-on tutorials for the community to learn how to use Hortonworks Data Platform and Hortonworks DataFlow covering categories such as Hello World, Real-World use cases, Operations.
- Collected and aggregated huge data of weblogs and unstructured data from various sources such as web servers, network devices using Apache Flume and stored the data into HDFS for analysis.
- Implemented transformations and data quality checks using Flume Interceptor.
- Implemented the business logic in Flume Interceptor in Java.
- Developed Restful APIs using Spring Boot for faster development and created Swagger documents as specs for defining the REST APIs.
- Involved in configuring Sqoop and flume to extract/export data from IBM QRadar and MySQL.
- Responsible for Collection and aggregation of copious amounts of data from various sources and ingested into Hadoop file system (HDFS) using Sqoop and Flume , the data was transformed to business use cases using Pig and Hive .
- Developed and maintained data integration programs in RDBMS and Hadoop environment for data access and analysis.
- Worked on importing of data from various data sources, performed transformations using Hive, MapReduce, and loaded data into HDFS.
- Developed and implemented map reduce jobs to support distributed processing using Java, Hive and Apache Pig.
- Executed Hive queries on Parquet tables stored in Hive metastore to perform data analysis to meet the business requirements.
- Implemented Partitioning and Bucketing in HIVE.
- Tested Apache TEZ, an extensible framework for building high performance batch and interactive data processing application on Pig and Hive jobs.
- Created Pig Latin scripts to sort, group, join and filter the enterprise wise data.
- Used Spark-SQL to Load data into Hive tables and Written queries to fetch data from these tables.
- Developed Pig scripts and UDF's as per the Business logic.
- Used Pig to import semi-structured data like Avro files and perform serialization.
- Performed secondary indexing on tables using Elastic Search.
- Evaluated Hortonworks NiFi (HDF 2.0) and recommended solution to inject data from multiple data sources to HDFS & Hive using NiFi and importing data using Nifi tool from Linux servers.
- Involved in MapReduce program to develop multiple Map Reduce jobs in Java for data cleaning and processing.
- Experienced in implementing Map Reduce programs to handle semi/unstructured data like json, XML, Avro data files and sequence files for log files.
- Experience in Working with MongoDB for distributed storage and processing.
- Responsible for using Flume sink to remove the data from Flume channel and to deposit in MongoDB.
- Implemented collections & Aggregation Frameworks in MongoDB.
- Configured Oozie workflow engine to automate Map/Reduce jobs.
- Have experience in Nifi which runs in a cluster and provides real-time control that makes it easy to manage the movement of data between any source and any destination.
- Worked with BI teams in generating the reports and designing ETL workflows on Tableau.
- Collaborated with Database, Network, application and BI teams to ensure data quality and availability.
- Hands-on experience in using python Scripts to handle data manipulation.
- Experienced in using agile approaches including Test-Driven Development, Extreme Programming and Agile Scrum.
Environment: Hortonworks HDP, Hadoop, Spark, Flume, Elastic Search, AWS, EC2, S3, Pig, Hive, MapReduce, HDFS, NiFi, Python, Java, MongoDB, spring boot, Zookeeper, Avro.
Confidential, St. Louis, Missouri
Hadoop Developer
Responsibilities:
- Worked on analyzing Hadoop cluster using different big data analytic tools including Pig, Hive. HBase and MapReduce .
- Extracted data of everyday transaction of customers from DB2 and export to Hive and setup Online analytical processing.
- Installed and configured Hadoop, MapReduce, and HDFS clusters.
- Used Flume to collect, aggregate and store the web log data from dissimilar sources like web servers, mobile and network devices and import to HDFS .
- Exported analysed data to the relational databases using Sqoop for visualization & Report generation.
- Created Hive tables, loaded the data and Performed data manipulations using Hive queries in MapReduce Execution Mode.
- Loaded the structured data which was resulted from MapReduce jobs into Hive tables.
- Analyzed user request patterns and implemented various performance optimizations like using skewed joins , SerDe tecniques in HiveQL .
- Identified issues on behavioral patterns and analyzed the logs using Hive queries.
- Involved in using HCATALOG to access Hive table metadata from Map Reduce or Pig code.
- Developed several REST web services which produces both XML and JSON to perform tasks, leveraged by both web and mobile applications.
- Implemented business logic by writing UDFs in Java and used various UDFs from other sources.
- Developed Unit test cases for Hadoop M-R jobs and driver classes with MR Testing library.
- Analyze and transform stored data by writing MapReduce or Pig jobs based on business requirements.
- Integrated Map Reduce with HBase to import bulk data using MR programs.
- Used Maven extensively for building jar files of MapReduce programs and deployed to Cluster.
- Worked in developing Pig Scripts for data capture change and delta record processing between newly arrived data and already existing data in HDFS.
- Developed data pipelines using Sqoop, Pig and Java MapReduce to ingest behavioral data into HDFS for analysis.
- Installed Oozie workflow engine to run multiple Hive and Pig jobs which run independently with time and data availability.
- Used Pig as ETL tool to do Transformations, joins and pre-aggregations before storing the data on to HDFS.
- Used SQL queries, Stored Procedures, User Defined Functions (UDF), Database Triggers using tools like SQL Profiler and Database Tuning Advisor (DTA).
- Installed a cluster , commissioned & decommissioned data node, performed name node recovery, capacity planning and slots configuration adhering to business requirements.
Environment: HDFS, Hortonworks HDP, Map Reduce, Pig, Hive, Oozie, Sqoop, Flume, HBase, Talend, HiveQL, Java, Maven, Avro, Eclipse and Shell Scripting.
Confidential, St. Louis, Missouri
Hadoop Developer
Responsibilities:
- Worked in tuning Hive and Pig to improve performance and solved performance issues in Hive and Pig scripts with understanding of Joins, Group and aggregation and how does it translate to Map Reduce jobs.
- Exported analysed data to the relational databases using Sqoop for visualization & Report generation.
- Installed and configured Hadoop MapReduce , HDFS , developed multiple MapReduce jobs in Java for data cleaning and pre-processing.
- Wrote MapReduce jobs to discover trends in data usage by users.
- Established custom MapReduce programs to analyze data and used Pig Latin to clean unwanted data.
- Written multiple MapReduce programs in Java for Data Analysis.
- Wrote MapReduce job using Pig Latin and Java API .
- Extensively worked on analyzing data using HiveQL , Pig Latin , and custom Map Reduce programs.
- Collected the logs from the physical machines and the OpenStack controller and integrated into HDFS using Flume .
- Involved in automation of FTP process in Talend and FTPing the Files in UNIX.
- Migrated ETL jobs to Pig scripts do Transformations, even joins and some pre-aggregations before storing the data onto HDFS .
- Created and maintained Technical documentation for launching HADOOP Clusters and for executing pig Scripts .
- Experience in various data transformation and analysis tools like Map Reduce , Pig and Hive to handle files in multiple formats (JSON, Text, XML, Binary, Logs etc.).
Environment: Hadoop, Map Reduce, Pig, Hive, Flume, Java, HDFS, ETL, JSON, XML.
Confidential
Java Developer
Responsibilities:
- Developing rules based on different state policy using SpringMVC, iBatis ORM, spring web flow, JSP, JSTL, Oracle, MSSQL, SOA, XML, XSD, JSON, AJAX, Log4j .
- Gathered requirements, developed, implemented, tested and deployed enterprise integration patterns (EIP) based applications using Apache Camel, JBoss Fuse.
- Developed service classes, domain/DAOs, and controllers using JAVA/J2EE technologies.
- Worked on Active MQ messaging service for integration.
- Worked with SQL queries to store and retrieve the data in MS SQL server.
- Performed unit testing using Junit .
- Developed front end using JSTL, JSP, HTML, and Java Script .
- Worked on continuous integration using Jenkins/Hudson .
- Participated in all phases of development life cycle including analysis, design, development, testing, code reviews and documentations as needed.
- Used ECLIPSE as IDE, MAVEN for build management, JIRA for issue tracking, CONFLUENCE for documentation purpose, GIT for version control, ARC (Advanced Rest Client) for endpoint testing, CRUCIBLE for code review and SQL Developer as DB client.
Environment : Spring Framework, Spring MVC, spring web flow, JSP, JSTL, SOAP UI, rating Engine, IBM Rational Team, Oracle 11g, XML, JSON, Ajax, HTML, CSS, IBM WebSphere Application Server, RAD with sub-eclipse, jenkins, maven, SOA, SonarQube, Log4j, Java, Junit.
Confidential
INTERN/ Java Developer
Responsibilities:
- Involved in gathering business requirements, analyzing the project and created UML diagrams such as Use Cases, Class Diagrams, Sequence Diagrams and flowcharts for the optimization Module using Microsoft Visio .
- Configured faces-config.xml for the page navigation rules and created managed and backing beans for the Optimization module.
- Developing Enterprise Application using SpringMVC, JSP, MySql .
- Working on developing client-side Web Services components using Jax-Ws technologies.
- Extensively worked on JUnit for testing the application code of server-client data transferring.
- Developed and enhanced products in design and in alignment with business objectives.
- Used SVN as a repository for managing/deploying application code.
- Used XML to maintain the Queries, JSP page mapping, Bean Mapping etc.
- Used Oracle 10g as the backend database and written PL/SQL scripts.
- Implemented database transactions using Spring AOP & Java EE CDI capability.
- Enriched organization reputation via fulfilling requests and exploring opportunities.
- Business Analysis, Reporting Service and Integrate to Sage Accpac (ERP).
- Developing new and maintaining existing functionality using SPRING MVC, Hibernate .
- Creating new and maintaining existing web pages build in JSP, Servlet .
Environment: Java, SpringMVC, Hibernate, MSSQL, JSP, Servlet, JDBC, ODBC, JSP, Servlet, NetBeans, GlassFish, Spring, Oracle, MySQL, Sybase, Eclipse, Tomcat, WebLogic Server.