Lead Spark/scala Developer Resume
Plano, TX
SUMMARY:
- Over 11+ years of experience in Information Technology which includes 5+ years’ experience in Big data and HADOOP Ecosystem.
- Excellent Hands on Experience in developing Hadoop Architecture in Windows and Linux platforms.
- Good understanding/knowledge of Hadoop Architecture.
- Experience on Hadoop distributions like Cloud era and Horton Works.
- Hands - on experience on major components in Hadoop Ecosystem including HDFS, MapReduce, Hive, Pig, Spark, HBase, Sqoop and knowledge of Flume, Talend.
- Set up standards and processes for Hadoop based application design and implementation.
- Experienced in developing Map Reduce programs in JAVA using Apache Hadoop for working with Big Data.
- Good experience in optimizing Map Reduce algorithms using Mappers, Reducers, combiners and practitioners to deliver the best results for the large datasets.
- Worked on Spark SQL and Data frames for faster execution of Hive queries using Spark Sql Context.
- Experience in developing programs in Spark using Python to compare the performance of Spark with Hive and SQL/Oracle.
- Set up standards and processes for Hadoop based application design and implementation.
- Extensive knowledge on Spark-Sql Development.
- Experience in analyzing data using HIVEQL, PIG Latin. Extending HIVE and PIG core functionality by using custom UDF's.
- Excellent knowledge of different RDBMS like Teradata, Oracle 11g and SQL Server.
- Teradata performance tuning, identifying and resolving performance bottlenecks. Experience in SQL Performance Tuning, table structure and index design for better query performance
- Hands on experience in writing Map Reduce jobs on Hadoop Ecosystem using Pig Latin and creating Pig scripts to carry out essential data operations and tasks.
- Experience in Designing, developing and implementing connectivity products that allow efficient exchange of data between our core database engine and Hadoop ecosystem.
- Experience in importing and exporting the data using Sqoop from HDFS to Relational Database systems/mainframe and vice-versa.
- Experience ingesting both structured and unstructured Data using Flume into HDFS from Legacy systems including Streaming Data.
- Good Expertise working with different types of data including semi/un-structured data.
- Worked on NoSQL databases including HBase, MongoDB.
- Experience in processing different file formats like XML, JSON and sequence file formats.
- Good Experience working with machine learning workflows and deviseda machine learning algorithms using Python.
- Good Knowledge in Amazon AWS concepts like EMR and EC2 web services which provides fast and efficient processing of Big Data.
- Good Experience in creating Business Intelligence solutions and designing ETL workflows using Tableau.
- Good Experience in Agile Engineering practices, Scrum methodologies, and Test Driven Development and Waterfall methodologies.
- Good knowledge on Apache Kafka and in configuring producers and consumers in it.
- Hands-on Experience in Object Oriented Analysis, Design (OOAD) and development of software using UML Methodology.
- Exposure to Java development projects. using Spark-TS library for analyzing large-scale time series data sets.
- Hands on experience in database design using PL/SQL to write Stored Procedures, Functions, Triggers and strong experience in writing complex queries, using Oracle, DB2 and MySQL.
- Good working experience on different OS like UNIX/Linux, Apple Mac OS-X Windows.
- Experience working both independently and collaboratively to solve problems and deliver high quality results in a fast-paced, unstructured environment.
TECHNICAL SKILLS:
Big Data Technologies: Hadoop, HDFS, Hive, Map Reduce, Pig, Sqoop, Flume, Oozie, kafka, Spark and HBase
Programming Languages: Java (5, 6, 7), Python, Scala
Data bases/RDBMS: MySQL, SQL/PL-SQL, MS-SQL Server 2005, Oracle 9i/10g/11g,Teradata
Scripting/ Web Languages: JavaScript, HTML5, CSS3, XML, SQL, Shell
ETL Tools: Cassandra, HBASE,ELASTIC SEARCH
Operating Sistemas: Linux, Windows XP/7/8
Software Life Cycles: SDLC, Waterfall and Agile models
Office Tools: MS-Office, MS-Project and Risk Analysis tools, Visio
Utilities/Tools: Eclipse, Tomcat, NetBeans, JUnit, SQL, SVN, Log4j, SOAP UI, ANT, Maven, Automation and MR-Unit
Cloud Platforms: Amazon EC2
PROFESSIONAL EXPERIENCE:
Confidential, Plano, TX
Lead Spark/Scala Developer
Responsibilities:
- Participated in the sprint review meetings and explained the technical changes to the clients.
- Usage of Spark Streaming and Spark SQL API to process the files.
- Developed Spark scripts by using Scala shell commands as per the requirement.
- Processing the schema oriented and non-schema oriented data using Scala and Spark.
- Developed and designed system to collect data from multiple portal using Kafka and then process it using spark.
- Developed and designed automate process using shell scripting for data movement and purging.
- Used Spark API over Cloud era Hadoop YARN to perform analytics on data in Hive.
- Developed Scala scripts, UDFFs using both Data frames/SQL/Data sets and RDD/MapReduce in Spark 1.6 for Data Aggregation, queries and writing data back into OLTP system through Sqoop.
- Worked on Big Data Hadoop cluster implementation and data integration in developing large-scale system software.
- Responsible for cluster maintenance, adding and removing cluster nodes, cluster monitoring and troubleshooting, managing and reviewing data backups and Hadoop log files.
- Involved in the process of data acquisition, data pre-processing and data exploration of Telecommunication project.
- In pre-processing phase used spark to remove all the missing data and data
- Used flume, sqoop, Hadoop, spark and Oozie for building data pipeline.
- Installed and configured Hadoop Map Reduce, HDFS, Developed multiple MapReduce
- Used spark and spark-sql to read the parquet data and create the tables in hive using the Python API
- Implemented AWS provides a variety of computing and networking services to meet the needs of applications.
- Involved in understanding the existing application and transformations built using Abinitio and Teradata
- Extensively involved in developing Restful API using JSON library of Play framework.
- Used Scala collection framework to store and process the complex consumer information.
- Used Scala functional programming concepts to develop business logic.
- Developed automated workflows for monitoring the landing zone for the files and ingestion into HDFS in Bedrock Tool and Talend.
- Implemented Apache NiFi processors for end to end process of ETL- Extraction, Transformation and Loading data files.
- Improved Apache NiFi cluster performance by distributing flow of data to multiple nodes with RPG
- Developed optimal strategies for distributing the ITCM log data over the cluster; importing and exporting the stored log data into HDFS and Hive using Apache Nifi
- Imported data into HDFS from various SQL databases and files using Sqoop and from streaming systems using Storm into Big Data Lake.
- Collected and aggregated large amounts of log data using Apache Flume and staged data in HDFS for further analysis.
- Importing and exporting data into HDFS and Hive using Sqoop.
- Used Sqoop tool to load data from RDBMS into HDFS.
- Worked on real time streaming data received by Kafka and processed the data using Spark and this data was further stored into HDFS cluster using python.
- Streaming of real time data using Spark with Kafka.
- Managing and reviewing Hadoop log files.
- Running Hadoop streaming jobs to process terabytes of xml format data.
- Analyzing large-scale time series data sets and using the Spark-TS library.
- Worked on Spark-TS library, whichprovides both Scala and Python APIs for manipulating, and modeling time series data, on top of Spark.
- Supported Map Reduce Programs those are running on the cluster.
- Cluster coordination services through Zookeeper.
- Involved in loading data from UNIX file system to HDFS.
- Installed and configured Hive and written HiveUDFs.
- Integrated Apache Storm with Kafka to perform web analytics and to perform click stream data from Kafka to HDFS.
- Involved in creating Hive tables, loading with data and writing hive queries which will run internally in map way
- Automated all the jobs, for pulling data from FTP server to load data into Hive tables, using Oozie workflows.
Environment: HDFS, Map Reduce, Sqoop, Oozie, Pig,TeradataHive,Nifi, HBase, Flume, LINUX, Java, Eclipse, Cassandra, Hadoop Distribution of Cloudera, PL/SQL, UNIX Shell Scripting, and Eclipse
Confidential - Philadelphia, PY
Lead Hadoop developer
Responsibilities:
- Involved in various stages of Software Development Life Cycle (SDLC) during application development.
- Involved in installing and configuring various Hadoop components such as Pig, Hive, Sqoop, Flume, Oozie.
- Used Sqoop as data ingestion tool to import and export data from RDBMS to HDFS and Hive.
- Log data collected from the web servers was channeled into HDFS using Flume and spark streaming.
- Data was also processed using spark such as aggregating, calculating the statistical values by using different transformations and actions.
- Large data sets were analyzed using Pig scripts and Hive queries.
- Implemented bucketing concepts in Hive and Managed and External tables were designed to enhance the performance.
- Developed Spark scripts by using Scala shell commands as per the requirement.
- Processing the schema oriented and non-schema oriented data using Scala and Spark.
- Developed and designed system to collect data from multiple portal using kafka and then process it using spark.
- Involved in developing Pig scripts to transform raw data into data that is useful to gain business insights.
- Used Sqoop to export the analyzed data for visualization and generation of reports, which are given to BI team.
- Ingest Flat files received via ECG FTP tool and files received from Sqoop into UHG Data Lake Hive and HBase using Data Fabric functionalities.
- Extensively used SQL in analyzing, testing, prototyping the Data solutions in Teradata
- Worked on Snappy compression for Avro and Parquet files.
- Configured MySQL database to store Hive metadata.
- Oozie workflow engine was installed to run multiple Hive and Pig Jobs.
- Developed Spark code using Scala and Spark-SQL/Streaming for faster testing and processing of data.
- Migration of ETL processes from RDBMS to HIVE to test easy data manipulation.
- Implemented test scripts to support test-driven development and continuous integration.
- Collected and aggregated large amounts of data from different sources such as COSMA (CSX Onboard System Management Agent), BOMR (Back Office Message Router), ITCM (Interoperable train control messaging), Onboard mobile and network devices from the PTC (Positive Train Control) network using Apache Nifi and stored the data into HDFS for analysis.
- Supported QA engineers in understanding, Troubleshooting and Testing.
- Developed Spark jobs using Scala in test environment for faster data processing and used Spark SQL for querying.
- Mentored analyst and test team for writing Hive queries.
- Cluster co-ordination services through ZooKeeper.
Environment: Hadoop, HDFS, Map Reduce, Hive, Pig, Nifi,HBase, Sqoop, Spark, Oozie, Zookeeper, RDBMS/DB, MySQL, CSV.
Confidential, Dallas
Sr. Bigdata developer
Responsibilities:
- Worked with BI team in the area of Big Data Hadoop cluster implementation and data integration in developing large-scale system software.
- Installed/Configured/Maintained Apache Hadoop clusters for application development and Hadoop tools like Hive, Pig, HBase, Flume, Oozie Zookeeper and Sqoop.
- Responsible for building scalable distributed data solutions using Hadoop.
- Performed performance tuning and troubleshooting of Map Reduce jobs by analysing and reviewing Hadoop log files.
- Developed several custom User defined functions in Hive & Pig using Java.
- Installed and configured Hadoop Map reduce, HDFS, developed multiple Map Reduce jobs in java for data cleaning and pre-processing.
- Collected the logs from the physical machines and the Open Stack controller and integrated into HDFS using Flume.
- Migrated an existing on-premises application to AWS.
- Experience in running Hadoop streaming jobs to process terabytes of xml format data.
- Migrate mongo dB shared/replica cluster form one data centre to another without downtime.
- Manage and Monitor large production MongoDB shared cluster environments having terabytes of the data.
- Worked on Importing and exporting data from RDBMS into HDFS with Hive and PIG using Sqoop.
- Highly skilled and experienced in Agile Development process for diverse requirements.
- Performed advanced procedures like text analytics and processing, using the in-memory computing capabilities of Spark using Scala, Python.
- Setting up MongoDB Profiling to get slow queries
- Implement MMS monitoring and backup (MongoDB Management Services) on cloud and on local servers (on premise and OPS Manager).
- Configuring HIVE and Oozieto store metadata in Microsoft SQL Server.
- Experienced in migrating HiveQL into Impala to minimize query response time.
- Developing and running Map-Reduce jobs on YARN and Hadoop clusters to produce daily and monthly reports as per user's need
- Used Spark API over Horton works Hadoop YARN to perform analytics on data in Hive.
- Developed Spark scripts by using Scala shell commands as per the requirement
- Exploring with Spark improving the performance and optimization of the existing algorithms in Hadoop using Spark context, Spark-SQL, Data Frame, pair RDD's, Spark YARN.
- Developed Spark code and Spark-SQL/Streaming for faster testing and processing of data
- Extensive experience in working with HDFS, Pig, Hive, Sqoop, Flume, Oozie, MapReduce, Zookeeper, Kafka, Spark andHBase. Worked on Text mining project with Kafka.
- Developed a data pipeline to store data into HDFS.
- Configured deployed and maintained multi-node Dev and Test Kafka Clusters.
- Integrated Apache Storm with Kafka to perform web analytics and to perform click stream data from Kafka to HDFS.
- Expertise in deployment of Hadoop Yarn, Spark and Storm integration with Cassandra, ignite and Kafka etc.
- Move data between clusters using distributed copy. Support and maintenance of Sqoop jobs and programs. Designed and developed SparkRDDs, Spark SQLs.
- Worked with customer to provide solutions to various problems. Worked with SPARK for POC purpose
- Experience on implementation of a log producer in Scala that watches for application logs, transform incremental log and sends them to a Kafka and Zookeeper based log collection platform.
- Sqoop jobs, PIG and Hive scripts were created for data ingestion from relational databases to compare with historical data.
- Created HBase tables to load large sets of structured, semi-structured and unstructured data coming from UNIX, NoSQL and a variety of portfolios.
- Involved in submitting and tracking Map Reduce jobs using Job Tracker.
- Implemented Hive Generic UDF's to implement business logic.
Environment: Hadoop, Map Reduce, HDFS, Pig, Hive, Sqoop, Flume, Oozie, Java, Linux, Teradata, Zookeeper, Kafka, Impala, Akka, Apache Spark, Spark Streaming Horton Works, HBase, MongoDB.
Confidential - Austin, TX
Sr. Hadoop Developer
Roles and Responsibilities:
- Day to day responsibilities includes solving developer issues, deployments moving code from one environment to other environment, providing access to new users and providing instant solutions to reduce the impact and documenting the same and preventing future issues
- Adding/installation of new components and removal of them through Cloudera Manager
- Collaborating with application teams to install operating system and Hadoop updates, patches, version upgrades
- Worked with cloud services like AZURE and involved in ETL, DataIntegration and Migration.
- Wrote Lambda functions in python for AZURE Lambda which invokes python scripts to perform various transformations and analytics on large data sets in EMR clusters
- Responsible for Cluster maintenance, Monitoring, commissioning and decommissioning Data nodes, Troubleshooting, Manage and review data backups, Manage & review log files
- Responsible to Design & Develop the Business components using Java.
- Creation of Java classes and interfaces to implement the system.
- Designed, built, and deployed a multitude applications utilizing almost all the AZURE stack, focusing on high-availability, fault tolerance, and auto-scaling
- Designed and developed automation test scripts using Python.
- Azure Cloud Infrastructure design and implementation utilizing ARM templates.
- Orchestrated hundreds of Sqoop scripts, python scripts, Hive queries using Oozie workflows and sub-workflows
- Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs, Python and Scala.
- Implemented custom interceptors for flume to filter data and defined channel selectors to multiplex the data into different sinks
- Partitioned and queried the data in Hive for further analysis by the BI team
- Extending the functionality of Hive and Pig with custom UDF s and UDAF's on Java
- Involved in extracting the data from various sources into Hadoop HDFS for processing
- Worked on analyzing Hadoop cluster and different big data analytic tools including Pig, HBase database and Sqoop
- Responsible for strategic solutions design and client engagement for projects involving machine learning based generation of insights using big data analytics.
- Generated synthetic data necessary for demonstration of the machine learning algorithms developed for activity recognition from sensor data.
- Creating and truncating HBase tables in hue and taking backup of submitter ID(s)
- Used Amazon EMR for map reduction jobs and test locally using Jenkins.
- Creating and managing Azure Web-Apps and providing the access permission to Azure AD users
- Commissioned and Decommissioned nodes on CDH5 Hadoop cluster on Red hat LINUX
- Involved in loading data from LINUX file system to HDFS
- Experience in configuring the Storm in loading the data from MYSQL to HBASE using jms
- Worked with BI teams in generating the reports and designing ETL workflows on Tableau
- Experience in managing and reviewing Hadoop log files .
Environment: HDFS, Map Reduce, Hive, Hue, Pig, AZURE, Flume, Oozie, Sqoop, CDH5, Apache Hadoop, Spark, Python, R programming, Qlik, Horton Works, Ambari, Cloud era Manager, Red Hat, Java, MySQL and Oracle.
Confidential, Tampa, FL
Hadoop Developer/Admin
Responsibilities:
- Responsible for building scalable distributed data solutions using Hadoop.
- Involved in Installing and configuring Hive, Pig, Sqoop, Flume and Oozie on the Hadoop cluster.
- Handled importing of data from various data sources, performed transformations using Hive, MapReduce, loaded data into HDFS and Extracted the data from MySQL into HDFS using Sqoop.
- Hands on experience in Hadoop administration and support activities for installations and configuring Apache Big Data Tools and Hadoop clusters using Cloudera Manager.
- Capable to handle Hadoop cluster installations in various environments such as Unix, Linux and Windows, able to implement and execute Pig Latin scripts in Grunt Shell.
- Strong capability to utilize Unix shell programming methods, able to diagnose and resolve complex configuration issues, ability to adapt Unix domain for Hadoop Tools.
- Experienced with file manipulation, advanced research to resolve various problems and correct integrity for critical Big Data issues with NoSQL Hadoop HDFS Database.
- Analyzed the data by performing Hive queries and running Pig scripts to study customer behavior.
- Translated high level requirements into ETL process.
- Collected the logs data from web servers and integrated in to HDFS using Flume.
- Implemented NameNode backup using NFS.
- Developed PIG Latin scripts to extract the data from the web server output files to load into HDFS.
- Involved in the installation of CDH3 and up-gradation from CDH3 to CDH4.
- Working on NoSQL databases including HBase, MongoDB, and Cassandra.
- Creating Hive External tables and loaded the data in to tables and query data using HQL.
- Working with application teams to install operating system, Hadoop updates, patches, version upgrades as required.
- Involving in Installing the Oozieworkflow engine in order to run multiple Hive and Pig jobs.
- Exporting the analyzed data to the relational databases using Sqoop for visualization and to generate reports for the BI team.
Environment: Hadoop, MapReduce, HDFS, Hive, Pig, Java, SQL, Cloudera Manager, Sqoop, Flume, Oozie, CDH3, MongoDB, Cassandra, HBase, Java (jdk 1.6), Eclipse, Oracle and Unix/Linux.
Confidential
Java Developer
Responsibilities:
- Involved in analysis and design phase of Software Development Life cycle (SDLC).
- Analysis of Business Requirements and Technical Requirements.
- Participated in requirements gathering for the application. Co-ordinated along with the business team to review the requirements and went through Software Requirement Specification(SRS) document.
- Participated in developing different UML diagrams such as Class diagrams, Use case diagrams and Sequence diagrams.
- Spring framework was used in developing the application which uses Model View Controller(MVC) architecture.
- Designed database connections using JDBC.
- Participated in designing and developing UI using HTML, CSS and JavaScript.
- Developed J2EE components on Eclipse IDE.
- Modifications on the database were done using Triggers, Views, Stored procedures, SQL, PL/SQL.
- Implemented multi-threading for faster processing.
- Developed various data gathering forms using JSP's, HTML, CSS.
- JDBC was used for database connectivity to SQL and to invoke stored procedures.
- Developed JavaScript code for input validations.
- Developed Action classes and DAO classes to access the database.
- Used Git version control software to monitor and track all the changes that are done to the source code.
- Used Tomcat Application Server to deploy the applications.
- Involved in Unit testing.
- Actively involved in customer interaction to strengthen customer relationship.
Environment: Oracle 11g, Java 1.5, Struts, Servlets, HTML, XML, SQL, J2EE, JUnit, Tomcat 6, MVC, JavaScript, Git.
Confidential
Java Developer
Responsibilities:
- Worked as software developer for ECIL on developing a supply chain management system.
- The application involved tracking invoices, raw materials and finished products.
- Gathered user requirements and specifications.
- Developed and programmed the required classes in Java to support the User account module.
- Used HTML, JSP and JavaScript for designing the front-end user interface.
- Implemented error checking/validation on the Java Server Pages using JavaScript.
- Developed Servlets to handle the requests, perform server side validation and generate result for user.
- Used Java script for client side validations.
- Developed SQL queries to store and retrieve data from database & used PL SQL.
- Used Struts framework to maintain MVC and created action forms, action mappings, DAOs, application properties for Internationalization etc.
- Used Struts Validation frame work to do business validation from server side.
- Involved in developing business components using EJB Session Beans and persistence using EJB Entity beans.
- Involved in managing Business delegate to maintain decupling between presentation & Business layers.
- Used JMS for Asynchronous messaging.
- Used Eclipse IDE to develop the application
- Involved in fixing defects & tracked them using QC & Provided support and maintenance and customization
- Developing customized reports and Unit Testing using JUnit.
- Used JDBC interface to connect to database.
- Performed User Acceptance Test.
- Deployed and tested the web application on WebLogic application server.
Environment: JDK 1.4, Servlet 2.3, JSP 1.2, JavaScript, HTML, JDBC 2.1, SQL, Microsoft SQL Server, UNIX and BEA Web Logic Application Server.