- Hadoop Developerwith 7+ years of professional IT experience with 4+ years of Big Data consultant experience in Hadoop ecosystem components in Ingestion, Data Modeling, Querying, Processing, Storage, Analysis, Data Integration, and Implementing Enterprise level systems spanning Big Data.
- Hands - on development and implementation experience on Big Data Management Platform (BMP) using Hadoop 2.x, HDFS, MapReduce/Yarn/Spark, Hive, Pig, Oozie, Airflow, Talend, Sqoop and other Hadoop eco-system components as Data Storage and Retrieval systems.
- Excellent knowledge of Hadoop architecture and daemons of Hadoop clusters, which include Name node, Data node, Resource manager, Node Manager and Job history server.
- Experience working with Horton works distribution and Cloudera Hadoop distribution, MapR and EMR.
- Experience in designing end to end scalable architecture to solve business problems using various Azure Services HDInsight, Data Factory, Data Lake, Data Bricks and Machine Learning Studio.
- Experience in usage of Amazon EMR for processing Big Data across a Hadoop cluster of virtual servers on Amazon Elastic Compute Cloud (EC2) and Amazon Simple Storage Service (S3).
- Hands on experience in coding MapReduce/Yarn Programs using Scala and Python for analyzing Big Data and Strong experience in building Data-pipe lines using Big Data Technologies
- Experience creating real-time data streaming solutions using Apache Spark Core, Spark SQL, Kafka, Spark Streaming and Apache Storm.
- Experience in importing and exporting data from various databases like RDBMS, MYSQL, Teradata, Oracle and DB2 into HDFS using Sqoop and also experience with different data formats like Json, Avro, parquet, RC and ORC and compressions like snappy,Gzip.
- Experience in Extraction, Transformation and Loading (ETL) of data from multiple sources like Flat files, Databases and integration with popular NoSQL database for huge volume of data.
- Hands on experience working on NoSQL databases including HBase, Cassandra, MongoDB and its integration with Hadoop cluster for huge volume of data.
- Experience in data processing like collecting, aggregating from various sources using Apache Kafka & Flume
- Hands on experience in working with Flume to load the log data from multiple sources directly into HDFS.
- Strong experience of Pig, Hive and Impala analytical functions, extending Hive, Impala and Pig core functionality by writing Custom User Defined Function's (UDF).
- Expertise in working with Hive data warehouse tool-creating tables, data distribution by implementing partitioning and bucketing, writing and optimizing the HiveQL queries.
- Experience in working with internal tables and external tables in Hive and also developed Batch Processing jobs using Map Reduce, Pig and Hive.
- In - depth understanding of Spark Architecture including Spark Core, RDD, Data Frames, Data Sets, Spark SQL, Spark Streaming and experience in importing the data from source HDFS into Spark RDD for in-memory data computation to generate the output response.
- Hands on experience Using Hive Tables by Spark, performing transformations and Creating Data Frames on Hive tables using SparkSQL.
- Experience in converting Hive/SQL queries into RDD transformations using Spark, Scala and Pyspark
- Worked with Apache Spark components which provides fast and general engine for large data processing integrated with Pyspark and functional programming language Scala.
- Good knowledge in using apache NiFi to automate the data movement between different Hadoop systems.
- Expertise in Oozie for configuring job work flows Scheduling, Automation and Managing based on time driven and data driven.
- Worked with Python, Linux/UNIX and shell scripting.
- Experience with BI tools like Tableau for report creation and further analysis.
- Solid understanding and practical experience of Software Development Life Cycle principles.
Bigdata Technologies: HDFS, Map Reduce, Pig, Hive, Sqoop, Oozie, Scala, Spark, Kafka, Flume, Ambari, Hue, Map Reduce
Hadoop Frameworks: Cloudera CDHs, Hortonworks HDPs, MAPR.
Database: Oracle 10g/11g, PL/SQL, MySQL, MS SQL Server 2012, DB2
Languages: Core Java, Scala, Python
Scripting: Shell Scripting
AWS Components: S3, EMR, EC2,Lambda, Route 53, Cloud Watch, SNS
Methodologies: Agile, Waterfall
Build Tools: Maven, Gradle, Jenkins.
NO-SQL Databases: HBase, Cassandra, MongoDB, DynamoDB
IDE Tools: Eclipse, Net Beans, Intellij
Modelling Tools: Rational Rose, Star UML, Visual paradigm for UML
Architecture: Relational DBMS, Client-Server Architecture
Cloud Platforms: AWS Cloud
BI Tools: Tableau, SSIS
Operating System: Windows 7/8/10, Vista, UNIX, Linux, Ubuntu, Mac OS X
Confidential, Goodlettsville, TN
- Involved in complete project life cycle starting from design discussion to production deployment
- Worked closely with the business team to gather their requirements and new support features
- Involved in running POC's on different use cases of the application and maintained a standard document for best coding practices
- Developed a 250-node cluster in designing the Data Lake with the Hortonworks distribution
- Responsible for building scalable distributed data solutions using Hadoop
- Installed, configured and implemented high availability Hadoop Clusters with required services (HDFS, Hive, HBase, Spark, Zookeeper)
- Implemented Kerberos for authenticating all the services in Hadoop Cluster
- Responsible for installation and configuration of Hive, Pig, HBase and Sqoop on the Hadoop cluster and created hive tables to store the processed results in a tabular format.
- Configured Spark Streaming to receive real time data from the Apache Kafka and store the stream data to HDFS using Scala.
- Processed data into HDFS by developing solutions and analyzed the data using Map Reduce, PIG, and Hive to produce summary results from Hadoop to downstream systems.
- Build servers using AWS: Importing volumes, launching EC2, creating security groups, auto-scaling, load balancers, Route 53, SES and SNS in the defined virtual private connection.
- Written Map Reduce code to process and parsing the data from various sources and storing parsed data into HBase and Hive using HBase-Hive Integration.
- Streamed AWS log group into Lambda function to create service now incident.
- Involved in loading and transforming large sets of Structured, Semi-Structured and Unstructured data and analyzed them by running Hive queries and Pig scripts.
- Created Managed tables and External tables in Hive and loaded data from HDFS.
- Developed Spark code by using Scala and Spark-SQL for faster processing and testing and performed complex HiveQL queries on Hive tables.
- Scheduled several times based Oozie workflow by developing Python scripts.
- Developed Pig Latin scripts using operators such as LOAD, STORE, DUMP, FILTER, DISTINCT, FOREACH, GENERATE, GROUP, COGROUP, ORDER, LIMIT, UNION, SPLIT to extract data from data files to load into HDFS.
- Exporting the data using Sqoop to RDBMS servers and processed that data for ETL operations.
- Worked on S3 buckets on AWS to store Cloud Formation Templates and worked on AWS to create EC2 instances.
- Designing ETL Data Pipeline flow to ingest the data from RDBMS source to Hadoop using shell script, Sqoop, package and MySQL.
- Optimized the Hive tables using optimization techniques like partitions and bucketing to provide better.
- Used Oozie workflow engine to manage interdependent Hadoop jobs and to automate several types of Hadoop jobs such as Java map-reduce Hive, Pig, and Sqoop.
- Implementing Hadoop with the AWS EC2 system using a few instances in gathering and analyzing data log files.
- Involved in Spark and Spark Streaming creating RDD's, applying operations -Transformation and Actions.
- Created partitioned tables and loaded data using both static partition and dynamic partition method.
- Developed custom Apache Spark programs in Scala to analyze and transform unstructured data.
- Handled importing of data from various data sources, performed transformations using Hive, MapReduce, loaded data into HDFS and Extracted the data from Oracle into HDFS using Sqoop
- Using Kafka on publish-subscribe messaging as a distributed commit log, have experienced in its fast, scalable and durability.
- Test Driven Development (TDD) process and extensive experience with Agile and SCRUM programming methodology.
- Scheduled map reduces jobs in production environment using Oozie scheduler.
- Involved in Cluster maintenance, Cluster Monitoring and Troubleshooting, Manage and review data backups and log files.
- Analyzing Hadoop cluster and different Big Data analytic tools including Pig, Hive, HBase and Sqoop.
- Improved the Performance by tuning of HIVE and map reduce.
- Research, evaluate and utilize modern technologies/tools/frameworks around Hadoop ecosystem.
Environment: HDFS, Map Reduce, Hive, Sqoop, Pig, Flume, Oozie, Shell Scripts, Teradata, HBase, MongoDB, Cloudera, AWS, Kafka, Spark, Scala and ETL, Python, Git, IntelliJ, AWS, Yarn.
Confidential, Arlington, Texas
- Developed real time data processing applications by using Scala and Python and implemented Apache Spark Streaming from various streaming sources like Kafka and JMS.
- Experienced in writing live Real-time Processing and core jobs using Spark Streaming with Kafka as a data pipe-line system.
- Developed Shell, Perl and Python scripts to automate and provide Control flow to Pig scripts.
- Worked on Amazon AWS concepts like EMR and EC2 web services for fast and efficient processing of Big Data.
- Involved in loading data from Linux file systems, servers, java web services using Kafka producers and partitions.
- Applied Kafka custom encoders for custom input format to load data into Kafka Partitions
- Implement POC with Hadoop. Extract data with Spark into HDFS.
- Used Spark SQL with Scala for creating data frames and performed transformations on data frames.
- Implemented Spark SQL to access hive tables into spark for faster processing of data.
- Used Spark API over Cloudera Hadoop YARN to perform analytics on data in Hive.
- Developed code to read data stream from Kafka and send it to respective bolts through respective stream.
- Worked on Spark streaming using Apache Kafka for real time data processing.
- Experience in creating Kafka producer and Kafka consumer for Spark streaming.
- Developed Map Reduce jobs using Map Reduce Java API and HIVEQL.
- Developed UDF, UDAF, UDTF functions and implemented it in HIVE Queries.
- Developing Scripts and Batch Job to schedule a bundle (group of coordinators) which consists of various Hadoop Programs using Oozie.
- Experienced in using Avro data serialization system to handle Avro data files in map reduce programs.
- Experienced in optimizing Hive queries, joins to handle different data sets.
- Configured Oozie schedulers to handle different Hadoop actions on timely basis.
- Involved in ETL, Data Integration and Migration by writing pig scripts.
- Used different file formats like Text files, Sequence Files, Avro using Hive SerDe's.
- Integrated Hadoop with Solr and implement search algorithms.
- Experience in Storm for handling realtime processing.
- Hands on Experience working in Hortonworks distribution.
- Assisted in creating and maintaining Technical documentation to launching HADOOP Clusters and even for executing Hive queries and Pig Scripts.
- Worked hands on No-SQL databases like MongoDB for POC purpose in storing images and URIs.
- Designed and implemented MongoDB and associated RESTful web service.
- Worked on analyzing and examining customer behavioral data using MongoDB.
- Designed the data aggregations on Hive for ETL processing on Amazon EMR to process data as per business requirement
- Involved in writing test cases and implement test classes using MRUnit and mocking frameworks.
- Developed Sqoop scripts to extract the data from MYSQL and load into HDFS.
- Setup Spark EMR to process huge data which is stored in Amazon S3.
- Experience in processing large volume of data and skills in parallel execution of process using Talend functionality.
- UsedTalend tool to create workflows for processing data from multiple source systems.
Environment: MapReduce, HDFS, Sqoop, LINUX, Oozie, Hadoop, Pig, Hive, Solr, Spark Streaming, Kafka, Storm, Spark, Scala, Python, MongoDB, Hadoop Cluster, AWS, Talend.
Confidential, Louisville, Kentucky
- Worked on analysing Hadoop cluster using different big data analytic tools including Pig, Hive, and MapReduce.
- Developed data pipeline using Flume, Sqoop to ingest customer behavioral data and purchase histories into HDFS for analysis.
- Using the Spark framework Enhanced and optimized product Spark code to aggregate, group and run data miningtasks.
- Continuous monitoring and managing the Hadoop cluster using Cloudera Manager.
- Used Pig to perform data validation on the data ingested using Sqoop and flume and the cleansed data set is pushed into HBase.
- Participated in development/implementation of Cloudera Hadoop environment.
- Collecting and aggregating large amounts of log data using Apache Flume and staging data in HDFS for further analysis.
- Worked with Zookeeper, Oozie, AppWorx and Data Pipeline Operational Services for coordinating the cluster and scheduling workflows.
- Designed and built the Reporting Application, which uses the Spark SQL to fetch and generate reports on HBase table data.
- Extracted the needed data from the server into HDFS and Bulk Loaded the cleaned data into HBase.
- Responsible for creating Hive tables, loading the structured data resulted from MapReduce jobs into the tables and writing hive queries to further analyze the logs to identify issues and behavioral patterns.
- ImplementedSpark advanced procedures like text analytics and processing using the in-memory computing capabilities.
- Involved in running MapReduce jobs for processing millions of records.
- Involved in scheduling Oozie workflow engine to run multiple Hive and pig jobs
- Developed Hive queries and Pig scripts to analyze large datasets.
- Involved in importing and exporting the data from RDBMS to HDFS and vice versa using Sqoop.
- Involved in generating the Adhoc reports using Pig and Hive queries.
- Used Hive to analyze data ingested into HBase by using Hive-HBase integration and compute various metrics for reporting on the dashboard.
- Developed job flows in Oozie to automate the workflow for pig and hive jobs.Loaded the aggregated data onto Oracle from Hadoop environment using Sqoop for reporting on the dashboard
Environment: Red Hat Linux, HDFS, Map-Reduce, Hive, Java JDK1.6, Pig, Sqoop, Flume, Spark, Zookeeper, Oozie, Oracle, HBase.
- Administration of 1200 + RHEL 5.x/6.x servers which includes installation, testing, tuning, upgrading and loading patches, troubleshooting both physical and virtual server issues.
- Implementation and administration of VMware ESXi, vCenter for running RedHat Linux Servers on Production and Development.
- Installed RedHat Linux using kickstart and applying security polices for hardening the server based on the company policies.
- Experience in Installing Packages with RPM and YUM.
- ConfiguredDNS, DHCP, NIS, NFS in RedHat Enterprise Linux 5.x/6. x.
- Experienced with Java application servers such as Tomcat and WebSphere Application Server Community Edition.
- Worked in Agile Project Management environment to deliver high priority, high-quality work.
- Created file systems using NFS and mounted it. Created devices and special files using mknod.
- Managed daily system administration cases using BMC Remedy Help Desk.
- Used System Monitoring Tools such as Top, SAR, VMSTAT, IOSTAT, FREE, PS etc.
- Mirroring of root disks using Hardware Raid Controller on HP & Dell HW.
- Created volumes using storage such as Netapp, VMAX Array for Redhat servers.
- Data migration at Host level using Red Hat LVM, Veritas Volume Manager.
- Automated routine jobs by developing Korn, Perl, and Bash shell-scripts.
- Installed, configured, and efficiently managed Disk and File system through Veritas Volume Manager.
- Clustering and Cluster management on the servers using VERITAS Cluster Server 2.0 & 3.5.
- Backup Scheduling using the full and incremental backups on tapes and taking backup/restores using ufsdump and ufsrestore commands.
- Handled Performance Monitoring in File system CPU, Memory and Process in all the UNIX Servers.
- Integration tools configuration (Jenkins).
- Configured RedHat Cluster and supported GFS files system.
- Problems & performance issues; installed latest patches for Linux and performed RedHat Linux Kernel Tuning.
- Responsible for determining hardware and compatibility requirements for installation of application software and different tools.
- Installation of Oracle Patches and Troubleshooting, Creating, and modifying application related objects.
- Creating profiles, users, roles and maintaining security.
- Created users, manage user permissions, maintain User & File system quota on Linux servers.
- Backup configuration using Veritas Net backup.
- Setup and configured network TCP/IP on Red Hat LINUX, CentOS including RPC connectivity for NFS. Created mount points for Server directories and mounted these directories on the Servers.
- Day-to-day administration on RHEL 5.x/6.x which includes Installation, upgrade & loading patch management & packages.
- Worked 24 x 7 on-call support on rotation basis for production environment.
Environment: Linux 6.x/5.x, Oracle 10g, ESX, Shell Script, VM ware, RedHat Satellite, Veritas Volume Manager (VVM), LDAP, Tomcat, WebSphere.