- IT professional with 8 years of experience with proven proficiency in technical and interpersonal skills in various software development methodologies as a team player and as an individual.
- Expertise in Big Data analytics and development with over 5 years of commendable experience using Hadoop and its Ecosystem components in Retail, Banking and Insurance domains.
- 3 years of experience as a developer in challenging roles as a Java and SQL developer.
- Extensively involved in writing hundreds of lines of code throughout the career as a developer.
- Actively interacted with business teams to understand about the business requirements and prepare High - Level design documentation.
- Strong knowledge on Object Oriented programming, shell scripting, ETL designing, monitoring and managing multi-node clusters.
- Involved in all phases of project development including development, UAT and deployment phases.
- Installed and configured Hadoop and its ecosystem components according to Business models on multi-node clusters.
- Emphasis on entire data flow of enterprise Hadoop ecosystem from upstream to downstream and vice versa.
- Excellent knowledge in configuring the architecture of Hadoop core components such as HDFS, Name Node, Secondary Name Node, YARN and Data Nodes.
- Configured the NFS Gateway that supports browsing, downloading, uploading and stream data directly to HDFS through the mount point.
- Configuring distributed file systems and administrating NFS server and NFS clients and editing auto-mounting mapping as per system requirements.
- Deployed rcpbind, mountd and nfsd daemons for successful start and stop of NFS gateway service. Implemented Name Node backup using NFS.
- Expertise in implementing enterprise level security using AD/LDAP, Kerberos, Sentry and Ranger.
- In depth understanding of Hadoop Rack topology, Cluster monitoring and maintenance, Managing and reviewing the data backups and Hadoop log files.
- Worked on various distributions of Hadoop like Cloudera, Hortonworks and MapR.
- Extensively worked on writing complex MapReduce jobs on vast amounts of distributed data present in HDFS Data lake and scheduled the jobs using shell scripting.
- Improved the efficiency of MapReduce jobs by modifying the existing code using enhanced programming techniques and considering the time and space complexities.
- Ingested the historical data from various traditional Databases using Sqoop import functionality for migrating the data from traditional Databases to Distributed Databases.
- Used Sqoop incremental imports for ingesting data produced on daily basis by scheduling and monitoring the jobs using Autosys and Cron.
- Created schema and loaded the data present in HDFS into Hive tables. Using staging tables created static and dynamic partitioning concepts to reduce the bottleneck of running MapReduce jobs when the queries have any conditions.
- Used Bucketing technique on columns with high cardinality to increase the performance of join queries.
- Written Hive UDF’s in Python to perform to manipulate dates, string and execute complex queries where default Hive functions failed to produce expected results.
- Submitted a POC on using various kinds of performance enhancement techniques available with Hive when performing complex queries.
- Integrated the Hive tables with Hbase using HBaseHandler classes to make it possible to perform Online Analytical Processing (OLAP) and Online Transactional Processing (OLTP) on existing Hive tables without any data redundancy.
- Used Hive Serialization Deserialization (SerDe) classes to load data from JSON and XML files into Hive tables.
- Combined Pig with Hive to create processing pipelines which can scale quite easily in place of writing low-level MapReduce jobs.
- Used Pig to extract, write complex data transformations, cleaning and processing of large data sets and storing data in HDFS.
- Built Kafka pipelines to ingest Real time and Near Real time data from various data sources to HDFS.
- Configured Kafka producers and created consumer groups to publish and subscribe stream of records in a distributed environment in a fault-tolerant way.
- Created SQL Context with Hive Context to load Hive tables into RDD’s and perform querying operations more efficiently on large sets of data.
- Widely used Spark transformations to normalize data coming from real time data sources.
- Created Spark Data Frames by loading flat files into RDD’s, transformed the unstructured data into structured data to perform analytics on data present in flat files.
- Developed Spark streaming programs in Python to transform and store the data into HDFS on the fly.
- Good knowledge in setting up batch intervals, split intervals and window intervals in Spark streaming.
- Migrated the traditional MapReduce jobs to Spark jobs to improve the speed of data transformations, analysis and to achieve Real-time processing.
- Hands on knowledge creating Amazon EC2 instances, S3 buckets on Amazon EMR cluster to store and perform data processing and analysis. Also used PSSH to run commands on multiple nodes at a time.
- Widely used various compression techniques like Snappy, LZ4, bZip2 on file formats Parquet, Optimized Row Columnar (ORC) and Avro.
- Working knowledge with Talend, Informatica, Maven, Git Enterprise, Jenkins, Control-M, Cron, Autosys, Putty and WinSCP.
- Actively collaborated with team members on Daily scrum meetings to ensure smooth progress in development and on time completion of sprints.
- Worked with Core Java and J2EE technologies such as Servlets, JSP, Collections, Multi-Threading, Exception Handling, EJB, JDBC and Web services.
- Extensive experience in working with SQL and NoSQL databases such as MySQL, DB2, MongoDB, Cassandra.
Programming languages: Java, Python, Scala, Pig Latin, Shell scripting, C, C#, C++, Java script, SQL and HQL
Hadoop Components: YARN, HDFS, Hive, HBase, Impala, Kafka, Flume, Flink, Sqoop, Apache Phoenix, Spark, Oozie and Zookeeper
Data Bases: MySQL, MongoDB, Hbase, DB2, Cassandra, Netezza and Oracle
Web Services: Elastic Map Reduce, Amazon EC2, Amazon S3 (cloud storage), Cloud watch, RESTful and SOAP
BI Tools: Tableau, Talend and Informatica
Frameworks: Hadoop, Hibernate, Spring, MVC and Spring Boot
Web Technologies: HTML, CSS, Bootstrap, jQuery, JSON and XML
Web Servers: Jetty, Tomcat, Glass fish and Apache HTTP server
Development Methodologies: Scrum, Water fall, Extreme Programming and Spiral models
Other Technologies: Nifi, Git, Cloudera Manager, Jenkins, ELK, Control-M, Autosys, Cron, Putty, Maven, Eclipse IDE, Net beans, IntelliJ, WinSCP, SSRS, SSIS
Sr. Hadoop/Spark Developer
Confidential - Kansas City, Missouri
- Designing and Deployment of Hadoop Cluster (Cloudera) and Data pipelines using Big Data analytic tools.
- Involved in installing and configuring the Hadoop Ecosystem components like Hadoop MapReduce, HDFS, Sqoop, Hive, Impala, Spark HBase, Kafka, Flume, Flink and Pig on the multi-node cluster.
- Worked closely with Data source team for understanding the scale and format of data to be ingested on daily basis.
- Developed custom Unix Shell scripts to do pre-and-post validations of master and slave nodes, before and after configuring the Name node and Data nodes respectively.
- Configured core-site.xml, hdfs-site.xml and mapred.xml according to multi-node cluster environment.
- Used Sqoop import functionality for loading Historical data present in a Relational Database system into Hadoop File System(HDFS).
- Imported and Exported the data from RDBMS to HDFS Data lake and HDFS to Teradata using Sqoop Import, Sqoop incremental Import and Sqoop Export functionalities and scheduled the jobs on daily basis with Shell scripting.
- Efficiently joined raw data with the reference data using Pig scripting.
- Normalized the data coming from various sources like RDBMS, Flat Files and various log files.
- Used various file formats like Parquet, Avro, ORC and compression techniques like Snappy, LZO and GZip for efficient management of cluster resources.
- Written Hadoop MapReduce jobs using JAVA API for processing data present on HDFS.
- Integrated Hive SQL and HBase NoSQL databases in order make it possible to perform Online Analytics Processing (OLAP) and Online Transaction Processing (OLTP) on the same data without redundancy.
- Extensively worked with workflow schedulers like OOZIE and scripting using UNIX Shell Script for automating events for Data Ingestion and Processing.
- Used Pig predefined functions to convert fixed width file to delimited file.
- Experience in writing complex HQL queries like JOINS and creating ALIASES.
- Wrote Hive User Defined Functions(UDF) in python where Hive QL failed to produce results required by Data Science team.
- Used Zookeeper Quorum with HBaseStorageHandler class to integrate the schema of Hive tables with HBase tables.
- Consumed Real time and Near Real time data coming from various data sources through Kafka data pipelines and applied various transformations to normalize the data which further stored in HDFS data lake.
- Developed the configuration files for Flume source, channel and sink for creating pipelines from various data sources into HDFS.
- Extensively worked with Spark Data frames for ingesting data from flat files into RDD’s to transform Unstructured data in Structured data.
- Created the Spark SQL context to load data from Hive tables into RDD’s for performing complex queries and analytics on data present in Data Lake.
- Used Spark transformations for Data Wrangling and ingesting the real-time data of various file formats.
- Exported the data present in HDFS to Teradata for further usage by the data science team.
- Used Apache Nifi for building and automating dataflow from data source to HDFS and HDFS to Teradata.
- Worked on a POC on using Apache Kylo for management of data lakes in HDFS on open source platforms like Apache Spark and Nifi.
- Deployed Control-M into on-premise servers for managing the file transfer operation and automating the job scheduling and application deployment.
- Monitored the Hadoop cluster continuously using Cloudera manager and written the shell scripts for automation of mails to Business team.
- Exported the sprint jobs into Git Enterprise for successful deployment of the project.
- Performed Unit Testing as well as Functional testing at different steps of development to make sure that the data is being processed without any refinement.
- Involved in setting up the Hadoop environment during User Acceptance Testing(UAT) of the project.
Environment: Hadoop MapReduce, Hadoop File System(HDFS), Java, Python, Hive, Impala, Sqoop, Pig, Oozie, Zookeeper, Kafka, Flume, Flink, Teradata, Kylo, Bash Scripting, UNIX, Putty, WinSCP, Jenkins, Git Enterprise, Control M, Microsoft Excel.
Sr. Hadoop/Spark Developer
Confidential - Atlanta, GA
- Configured the Hadoop Ecosystem components like YARN, Hive, Pig, HBase, Impala and Apache Spark on Amazon EMR cluster.
- Interacted with the administration team for managing the permissions and visibility of data present on S3.
- Developed job processing scripts using Oozie workflow.
- Used Amazon’s Simple Storage Service (S3), Amazon Elastic MapReduce (EMR) and Amazon cloud (EC2).
- Created and configured elastic clusters of Amazon EC2 instances used for running Hadoop and other applications in Hadoop ecosystem using Amazon Elastic MapReduce (EMR).
- Migrated an existing on-premises application to AWS. Used AWS services like EC2 and S3 for data processing and storage, experienced in maintaining the Hadoop cluster on AWS EC2.
- Installed and configured Parallel SSH (PSSH) for running commands on multiple nodes at a time.
- Leveraged Amazon S3 as the data layer by using EMR File System (EMRFS) by using Amazon EMR cluster.
- Created the triggers using Lambda functions for launching Elastic Map Reduce (EMR) jobs and scheduled them with UNIX shell scripting.
- Configured HDFS on top of Amazon EMR cluster, also encrypted the data present in HDFS using Amazon EMR security configurations.
- Integrated Apache Phoenix with Apache HBase for low latency SQL access over Apache Hbase tables and secondary indexing for increased performance.
- Imported data from AWS S3 into Spark RDD, performed transformations and actions on RDD’s.
- Developed a Spark job in Python which indexes data into Elasticsearch from external Hive tables which are in HDFS.
- Created indexes for various statistical parameters on Elasticsearch and generated visualization using Kibana.
- Submitted a POC on advantages of creating Amazon S3 buckets in the same region of cluster launch to reduce cross-region bandwidth charges.
- Developed spark scripts to import large files from Amazon S3 buckets.
- Handled importing and exporting jobs into HDFS by developing solutions, analyzed the data using Map Reduce, Hive and produce summary results from Hadoop to downstream systems.
- Used Sqoop import and export functionalities to transfer data from Hadoop Distributed File Systems (HDFS) to RDBMS.
- Created the schema for External Hive tables to load data from HDFS for querying as per the requirement.
- Established custom Elastic MapReduce programs to analyze the data and Python scripts to clean the unwanted data.
- Created custom Hive UDF’s for missing functionality in Hive to analyze and process large volumes of data.
- Worked on various performance optimizations like using distributed cache for small datasets, Partitioning, Bucketing in Hive.
- Effectively used Map-Side joins faster querying when joining two or more tables.
- Involved in writing complex queries to perform join operations between multiple tables.
- Configured the Hive Metadata and CatalogD to make it possible for Impala daemon to pull data using Hive metadata.
- Teamed up with architects to design Spark models for the existing MapReduce models to migrate MapReduce jobs to Spark jobs using Python.
- Consumed data present in JSON, XML file format using Kafka and processed the data using Spark streaming code written in Python which is further pushed into HDFS Data lake.
- Automated the data flow from S3 cluster to on premises servers using Apache Nifi.
- Developed scripts and Scheduled Autosys jobs to filter the data.
- Involved in monitoring Autosys file watcher jobs and testing data for each transaction and verified data whether it ran properly or not.
Environment: Hadoop MapReduce, Hadoop Fie System, YARN, Apache Phoenix, Hive, HBase, Sqoop, Impala, Spark, Kafka, Java, Python, OOZIE, Zookeeper, Flume, Amazon EC2, Amazon S3, Amazon EMR, UNIX, JSON, XML, Windows.
Confidential - Boston, MA
- Responsible for building and maintaining scalable, distributed data solutions using Hadoop (Hortonworks Data Platform) with the architecture team.
- Involved in installing and configuring Hadoop MapReduce, HDFS, YARN, Hive, HBase, Sqoop, Kafka, Flume, Spark and Oozie on to the Hadoop cluster (Hortonworks).
- As a developer worked closely with architecture team to understand the data coming from various sources and configuring the nodes on the cluster accordingly for performance optimization.
- Used Maven extensively for building jar files of Map-Reduce jobs as per the requirement and deployed to cluster.
- Imported the historical data present in MongoDB using Sqoop import and stored in HDFS using compression techniques.
- Used Sqoop incremental imports for ingesting data on daily basis, scheduled the jobs using Shell scripting.
- Extensively worked with file formats like Avro, Parquet and ORC and converted the data from either format.
- POC on using different compression techniques on various file formats to increase the speed of compression process and faster file transfer over network.
- Used Regex, JSON and XML serialization and de-serialization packages with Hive to parse the contents of streamed log data and implemented Hive custom UDF’s and Hive dynamic partitions to move data on time series.
- Parsed Semi-structured and Unstructured data from JSON and XML files to Parquet or bZip2 using Data Frames in Pyspark.
- Created HBase tables to store variable data formats of input data coming from different portfolios.
- Used spark for consuming continuous stream of data coming through Kafka from various sources.
- Migrated MapReduce jobs to Spark jobs to achieve better performance.
- Used Spark API over Hortonworks Hadoop YARN for performing transformations and analytics on Hive tables.
- Developed scripts to integrate Spark streaming and Spark batch processing using Bash scripting.
- Ingested and transformed the data coming from Kafka on the fly.
- Hands on writing code for creating Kafka topics and Consumer groups, configuring brokers and monitoring the log files in distributed environment.
- Wrote Python scripts to parse XML documents.
- Create Oozie workflows to automate scripts for collecting inputs and initialize Spark jobs.
- Involved in moving all log files generated from various sources to HDFS for further processing through Flume.
- Implemented Talend for automated loading and processing of data into HDFS which reduced manual loading and processing of data.
- Expert knowledge in MongoDB NoSQL data modelling, tuning, disaster recovery and backup.
- Involved in daily scrum meetings to discuss the development/progress of Sprints and was active in making scrum meetings more productive.
Environment: Hadoop MapReduce, Hadoop File System (Hortonworks), YARN, Java, Maven, Python, HBase, Hive, Spark, Sqoop, Oozie, Flume, Zookeeper, MongoDB, Talend, JSON, XML.