- Around 6 years of IT experience with 3 years in developing data pipelines on Big Data Technologies such as Spark, Hive, Pig, Hadoop, MapReduce, Sqoop, Kafka.
- Experience in developing Apache Spark programs using Java, Scala, Python.
- Commendable knowledge on Spark architecture including Spark Core, Spark SQL, Data Frames, Spark Streaming, Spark MLlib
- Experienced in writing Spark programs/application in Scala using Spark APIs for Data Extraction, Transformation and Aggregation
- Expertise in processing large sets of structured, semi - structured data in Spark & Hadoop, and store them in HDFS
- Experienced in Spark Framework on both batch and real-time data processing
- Experience in developing Kafka Consumer API using Spark Scala applications
- In depth understanding of Hadoop Architecture including YARN and various components such as HDFS, Resource Manager, Node Manager, Name Node, Data Node and MR v1 & v2 concepts
- Developed MapReduce programs in Java for data cleansing, data filtering, and data aggregation
- Hands on experience in installation, configuration, supporting and managing Hadoop Clusters using Hortonworks and Cloudera
- Expert in working with Hive data warehouse tool-creating tables, data distribution by implementing partitioning and bucketing, writing and optimizing the Hive QL queries.
- Implemented Hive UDF's to achieve customized functionality.
- Experience in ETL operations on Hive to Spark.
- Worked on importing and exporting RDBMS data into HDFS and Hive using Sqoop.
- Experienced in analyzing data using PIG Latin scripts
- Proficient in big data ingestion and streaming tools like Sqoop, Kafka.
- Good Knowledge on NoSQL data bases and hands on work experience in writing applications on NoSQL databases like Cassandra and MongoDB.
- Working knowledge on RDBMS Databases like Oracle11g, SQL Server, MySQL, MS Access.
- Good knowledge of Data warehousing concepts and ETL processes.
- Good knowledge on various scripting languages like Linux/Unix shell scripting and Python.
- Experienced in using IDEs and Tools like Eclipse, Net Beans, GitHub, Maven and IntelliJ.
- Experienced in working with different file formats - Avro, text file, XML, JSON, CSV.
- Good understanding of algorithms, data structures, performance optimization techniques and object-oriented programming.
- Proficient in Data Visualization by creating multiple dashboards using Tableau, R.
- Skilled in using version control software such as GIT.
- Robust understanding of Agile methodology and implementing Scrum structure in Project development.
- Involved in various stages of waterfall Model methodology like Analysis, Development and Maintenance
- Ability to work independently and a strong team player in a team as well with excellent communication skills.
- Quick learning ability, self-motivated, adaptability to new environment
Languages: Cluster Mgmt.& Monitoring Python2.7, Java1.8, Scala2.10, SQL, R, C, C++.\ Cloudera 5.7.6, Horton works Ambari 2.5.
Hadoop Ecosystem: Hadoop2.6, MapReduce v1 & v2, YARN, Spark1.6, Spark SQL, Spark Streaming with HDFS, SQOOP1.4.6, Hive0.13, Pig, Kafka. scala, Spark with python.
Database: Oracle11g, SQL Server, MySQL, MS Access.\ VM ware workstation, Oracle VM Virtual Box.
No SQL Databases: MongoDB, Cassandra.\ MS Excel, R, Tableau.
Cloud Computing: Google Cloud.Eclipse, Net Beans, GitHub, Maven, IntelliJ.
Operating Systems: Unix, Linux, Windows, Git, SVN.
Confidential, New Jersey
Big Data Engineer
- Performed data Ingestion from various sources into Hadoop Data Lake using Kafka.
- Built real time pipeline for streaming data using Kafka and Spark Streaming.
- Written and ran Java Producer programs to post messages to topics.
- Wrote and ran Java Consumer programs to read and process messages from Kafka topics.
- Created tables in DataStax Cassandra and loaded large sets of data for processing. => hdfs
- Exploring with the Spark improving the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frame, and Pair RDD's.
- Responsible for the Implementation of POC to migrate map reduce jobs into Spark RDD transformations using scala.
- Created Spark Application to load data into Dynamic Partition Enabled Hive Table.
- Created Hive external tables for each source table in Hadoop Data Lake.
- Written Hive jobs to parse the logs and structure them in tabular format to facilitate effective querying on the log data.
- Optimized the data sets by creating dynamic partitioning and bucketing in Hive.
- Developed business specific Custom UDF's in Hive, Pig.
- Involved in Agile methodologies, daily Scrum meetings, Sprint planning.
- Experience in code repositories such as Git
Environment: CDH 5.7.6, Hadoop 2.6, Spark 1.6.0, Scala 2.10, Maven, Kafka2.10, Sqoop 1.4.6, Mapreduce, HDFS, Pig, Hive0.13, Intellij, Oracle, DataStax Cassandra 4.8, Centos, Windows, Python 2.7, Tableau 9.0
Confidential, Charlotte, NC
- Worked on analyzing Hadoop cluster using different big data analytic tools including Spark, Pig, Hive and MapReduce.
- Developed Spark code using Scala for faster processing of data.
- Migrated complex Map reduce programs, Hive scripts into Spark RDD transformations and actions.
- Developed Scala scripts, UDF's using both SQL and RDD/MapReduce in Spark for Data Aggregation, queries and writing data back into RDBMS through SQOOP.
- Design and Develop Pig Latin scripts and Pig command line transformations for data joins and custom processing of Map reduce outputs.
- Written PIG scripts to process unstructured data and available to process in Hive.
- Created Hive schemas using performance techniques like partitioning and bucketing.
- Performed data analysis with Cassandra using Hive External tables.
- Exported the analyzed data to Cassandra using Sqoop and to generate reports for the BI team.
- Involved in deploying code into version control git
- Worked on different data formats such as CSV and JSON
Environment: CDH, HDFS, SPARK, Pig, Hive, Sqoop, Map Reduce, YARN, UNIX Shell Scripting, Agile Methodology
- Worked on live 8 node Hadoop clusters running CDH 4.
- Used Sqoop to import the data from RDBMS to Hadoop Distributed File System (HDFS).
- Developed several MapReduce programs to analyze and transform the data to uncover insights into the customer usage patterns.
- Used Pig as ETL tool to do transformations, event joins and some pre-aggregations before storing the data into HDFS.
- Responsible for creating Hive External tables and loaded the data into tables and query data using HiveQL.
- Used Hive data warehouse tool to analyze the unified historic data in HDFS to identify issues and behavioral patterns.
- Created concurrent access for Hive tables with shared and exclusive locking that can be enabled in Hive with the help of Zookeeper implementation in the cluster.
- Involved in Design, Development and support of the application used Agile Methodology and participated in scrum meetings
- Developed user interfaces using JSP, HTML, Java Script, CSS Client Server network communication design and Development
- Offline Location based ERP Design and Development
- Conducted Design reviews and Technical reviews with other project statehood Implemented Services using Core Java.
- Developed analysis level documentation such as Use Case, Business Domain Model, Activity & Sequence and Class Diagrams.
- Develop client and server using core java, Swing and C++
- Technical Support to client
Environment: Java, Core Java, AWT, Applet, Swing and C++, Struts, JSP and Servlet, JDBC and SQL Server
- Used HTML, CSS to build page layouts.
- Following the design requirement to design user-friendly layout by using HTML and CSS.
- Request and Get data from backend using AJAX to exchange JSON data with back-end.
- Used SVN for version control and QC for defect tracking.
- Creating cross-browser compatibility and standards-compliant CSS-based page layouts.
- Daily website maintenance and updating content.