- Having 8 + Years of professional IT experience including 3+ Years of Spark/Hadoop developer consultant involved in analysis, design and development using Hadoop ecosystem components and performing Data Ingestion, Data modeling, Querying, Processing, Storage Analysis, Data Integration and Implementing endeavor level systems transforming Big data.
- Worked on Data Modelling using various ML (Machine Learning Algorithms) via R and Python (Graphlab) Worked on Programming Languages like Core Java and Scala.
- Knowledgeable with developing and implementing Spark programs in Scala using Hadoop to work with Structured and Semi - structured data.
- Utilized Spark for intuitive queries, processing of streaming data and integration with NoSQL database for bulk volume of data.
- Good experience in optimization/performance tuning of Spark Jobs, PIG & Hive Queries.
- Familiarly comfortable with data architecture including data ingestion pipeline design, Hadoop architecture, data modeling and data mining and advanced data processing. Experience optimizing ETL workflows.
- Excellent understanding of Spark Architecture and framework, Spark Context, APIs, RDDs, Spark SQL, Data frames, Streaming, MLlib.
- Adequate understanding of Hadoop Gen1/Gen2 architecture and hands on experience with Hadoop components such as Job Tracker, Task Tracker, Name Node, Secondary Name Node, Data Node and YARN architecture and its deamons Node manager, Resource manager and App Master and Map Reduce Programming Paradigm.
- Hands on experience in using the Hue browser for interacting with Hadoop components.
- Good understanding and Experience with Agile and Waterfall methodologies of Software Development Life Cycle (SDLC).
- Proficient in Oracle Packages, Procedures, Functions, Trigger, Views, SQL Loader, Performance Tuning, UNIX Shell Scripting, Data Architecture.
- Astounding hands on experience in Data Extraction, Transformation, Loading and Data Analysis and Data Visualization uutilizing Cloudera Platform (Spark, Scala, HDFS, Hive, Sqoop, Kafka, Oozie).
- Developed end to end Spark applications using Scala to perform various data cleansing, validation, transformation and summarization activities as per the necessities.
- Extract data from heterogeneous sources like flat files, MySQL, Teradata into HDFS using Sqoop and the other way around .
- Broad experience working with structured data using Spark SQL, Data frames, Hive QL, optimizing queries, and in corporate complex UDF's in business logic.
- Experience working with Text, Sequence files, XML, Parquet, JSON, ORC, AVRO file formats and Click Stream log files.
- Experienced in migrating ETL transformations using Spark jobs and Pig Latin Scripts.
- Experience in transferring Streaming data from different data sources into HDFS and HBase using Apache Kafka and Flume.
- Experience in using Oozie schedulers and Unix Scripting to implement Cron jobs that execute different kind of Hadoop actions.
- Highly motivated, self-learner with a positive attitude, willingness to learn new concepts and accepts challenges.
Big Data Technologies: Spark and Scala, Hadoop Ecosystem Components - HDFS, Hive, Sqoop, Impala, Flume, Map Reduce, Pig and Cloudera Hadoop Distribution CDH 5.8.2
Cloud Platforms: Amazon Web Services (AWS), Microsoft Azure and Google Cloud
Monitoring Tools: Cloudera Manager
Programming Languages: Scala, Java, SQL, PL/SQL, Python.
Scripting Languages: Shell Scripting, CSH.
NoSQL Databases: HBase
Databases: Oracle 11g, MySQL, MS SQL Server
Operating Systems: Windows 7/8/10, Unix, Linux
Other Tools: Hue, IntelliJ IDEA, Eclipse, Maven, Zoo Keeper
Front End Technologies: HTML5, XHTML, XML, CSS
Confidential, Nashville, TN
Big Data/Hadoop Developer
- Used Flume to collect, aggregate, and store the web log data from different sources like web servers and pushed to HDFS.
- Involved in creating Hive Tables, loading with data and writing Hive queries which will invoke and run Map Reduce jobs in the backend.
- Developed HBase tables to load large sets of structured, semi-structured and unstructured data coming from UNIX, NoSQL and a variety of portfolios.
- Involved in debugging Map Reduce jobs using MRUnit framework and optimizing Map Reduce jobs.
- Entangled in troubleshooting errors in Shell, Hive and Map Reduce.
- Worked on debugging, performance tuning of Hive & Pig Jobs.
- Implemented solutions for ingesting data from various sources and processing the Data Utilizing Big Data Technologies such as Hive, Pig, Sqoop, Hbase, and Map reduce, etc.
- Developed Pig Scripts for replacing the existing home loans legacy process to Hadoop and data is back fed to retail legacy mainframe systems.
- Enac ted solutions for ingesting data from various sources and processing the Data utilizing Big Data Technologies such as Hive, Spark, Pig, Sqoop, HBase, Map reduce, etc.
- Worked on creating Combiners, Partitioners and Distributed cache to improve the performance of Map Reduce jobs.
- Developed Map Reduce programs for data extraction, transformation and aggregation. Supported Map Reduce Jobs those are running on the cluster.
- Developed Pig Scripts to generate Map Reduce jobs and performed ETL procedures on the data in HDFS.
- Experienced in handling Avro data files by passing schema into HDFS using Avro tools and Map Reduce.
- Optimization of Map reduce algorithms using combiners and partitioners to deliver the best results and worked on Application performance optimization for a HDFS cluster.
- Wrote Hive Queries to have a consolidated view of the telematics data.
- Orchestrated many Sqoop scripts, Pig scripts, Hive queries using Oozie workflows and sub workflows.
- Design and implement map reduce jobs to support distributed processing using Map Reduce, Hive and Apache Pig.
- Created Hive external tables on the map reduce output before partitioning, bucketing is applied on top of it.
Environment: Hadoop, Map Reduce, HDFS, Hive, Pig, Sqoop, Hbase, DB2, Flume, Oozie, CDH 5.6.1, Maven, Unix Shell Scripting.
Confidential, Dallas, TX
Spark/ Hadoop Developer
- Used Spark for interactive queries, processing of streaming data and integration with popular NoSQL database for huge volume of data.
- Performance tuning the Spark jobs by changing the configuration properties and using broadcast variables.
- Real time streaming the data using Spark with Kafka. Responsible for handling Streaming data from web server console logs.
- Refine Performance tuning of long running Greenplum user defined functions. Leveraged the feature of temporary tables break the code into small sub part load to a temp table and join it later with the corresponding join tables. Table distribution keys are refined based on the data granularity and primary key column combination.
- Toiled on numerous file formats like Text, Sequence files, Avro, Parquet, ORC, JSON, XML files and Flat files using Map Reduce Programs.
- Expanded daily process to do incremental import of data from DB2 and Teradata into Hive tables using Sqoop.
- Analyzed the SQL scripts and designed the solution to implement using Scala.
- Resolved performance issues in Hive and Pig scripts with analyzing Joins, Group and Aggregation and how it translate to MR jobs.
- Stock the data into Spark RDD and Perform in-memory data computation to generate the output exact to the requirements.
- Involved in scripting Spark applications using Scala to perform various data cleansing, validation, transformation and summarization activities as to the requirements .
- Developed data pipelines using Spark, Hive and Sqoop to ingest, transform and analyze operational data.
- Developed Spark jobs, Hive jobs to encapsulate and transform data.
- Fine- Tune Spark application to improve performance.
- Worked collaboratively to manage build outs of large data clusters and real time streaming with Spark.
- Work with cross functional consulting teams within the data science and analytics team to design, develop and execute solutions to det ermine business insights and solve clients operational and strategic problems.
- Exported the analyzed data to the relational databases using Sqoop for visualization and to generate reports for the BI team.
- Extensively used Hive/HQL or Hive queries to query data in Hive Tables and loaded data into HBase tables.
- Expan sively worked with Partitions, Dynamic Partitioning, Bucketing tables in Hive, designed both Managed and External tables, also worked on optimization of Hive queries.
- Involved in collecting and aggregating large amounts of log data using Flume and staging data in HDFS for further analysis.
- Assisted analytics team by writing Pig and Hive scripts to perform further detailed analysis of the data.
- Designing Oozie workflows for job scheduling and batch processing.
Environment: Java 1.8, Scala 2.10.5, Apache Spark 1.6.0, MySQL, CDH 5.8.2, IntelliJ IDEA, Hive, HDFS, YARN, Map Reduce, Sqoop 1.4.3, Flume, Unix Shell Scripting, Python 2.6, Apache Kafka.
Confidential, Malvern, PA
- Worked as Hadoop Admin and responsible for taking care of everything related to the clusters total of 150 nodes ranges from POC (Proof-of-Concept) to PROD clusters.
- Experience with implementing High Availability for HDFS, Yarn, Hive and HBase.
- Commissioning and Decommissioning Nodes from time to time.
- Component unit testing using Azure Emulator.
- Implemented Name Node automatic failover using zkp controller.
- As a Hadoop admin, monitoring cluster health status on daily basis, tuning system performance related configuration parameters, backing up configuration xml files.
- Introduced Smart Sense and got optimal recommendations from the vendor and even for troubleshooting the issues.
- Good experience with Hadoop Ecosystem components such as Hive, HBase, Pig and Sqoop.
- Configured the Kerberos and installed MIT ticketing system.
- Secured the Hadoop cluster from unauthorized access by Kerberos, LDAP integration and TLS for data transfer among the cluster nodes.
- Installing and configuring CDAP, an ETL tool in the development and Production clusters.
- Integrated CDAP with Ambari to for easy operations monitoring and management.
- Acquaintance with Installation, configuration, deployment, maintenance, monitoring and troubleshooting Hadoop clusters in different environments such as Development Cluster, Test Cluster and Production using Ambari front-end tool and Scripts.
- Created databases in MySQL for Hive, Ranger, Oozie, Dr. Elephant and Ambari.
- Worked with Sqoop in Importing and exporting data from different databases like MySQL, Oracle into HDFS and Hive.
- Toiled on Hortonworks Distribution which is a major contributor to Apache Hadoop.
- Installed Apache Nifi to make data ingestion fast, easy and secure from internet of anything with Hortonworks data flow.
- Complete end to end design and development of Apache Nifi flow which acts as the agent between middleware team and EBI team and executes all the actions mentioned above.
- Hands on experience in installation, configuration, supporting and managing Hadoop Clusters.
- Replacement of Retired Hadoop slave nodes through AWS console and Nagios Repositories.
- Installed and configured Ambari metrics, Grafana, Knox, Kafka brokers on Admin Nodes.
- Interacted with application teams to install operating system and Hadoop updates, patches, version upgrades when required.
- Used CDAP to monitor the datasets and workflows to ensure smooth data flow.
- Monitor Hadoop cluster and proactively optimize and tune cluster for performance.
- Experienced in defining job flows. Ranger security enabled on all the Clusters.
- Experienced in managing and reviewing Hadoop log files
- Connected to the HDFS using the third-party tools like Teradata SQL assistant using ODBC driver.
- Installed Grafana for metrics analytics & visualization suite.
- Installed various services like Hive, HBase, Pig, Oozie, and Kafka.
- Monitoring local file system disk space usage, CPU using Ambari.
- Production support responsibilities include cluster maintenance.
- Integrated Apache Storm with Kafka to perform web analytics and to perform click stream data from Kafka to HDFS.
- Engage d in the requirements review meetings and interacted intuitively with business analysts to clarify any specific scenario.
Environment: HDP, Ambari, HDFS, MapReduce, Yarn, Hive, NiFi, Flume, PIG, Zookeeper, TEZ, Oozie, MYSQL, Puppet, and RHEL
Linux / Hadoop Developer
- Performance tuning for infrastructure and Hadoop settings for optimal performance of jobs and their throughput.
- Involved in analyzing system failures, identifying root causes, and recommended course of actions and lab clusters.
- Designed the Cluster tests before and after upgrades to validate the cluster status.
- Setting up cluster and installing all the ecosystem components through MapR and manually through command line in Lab Cluster.
- Designed and developed Hadoop system to analyze the SIEM (Security Information and Event Management) data using MapReduce, HBase, Hive, Sqoop and Flume.
- Regular Maintenance of Commissioned/decommission nodes as disk failures occur using Cloudera Manager.
- Documented and prepared run books of systems process and procedures for future references.
- Performed Benchmarking and performance tuning on the Hadoop infrastructure.
- Migrated hive schema from production cluster to DR cluster.
- Worked on Migrating application by doing Poc's from relation database systems.
- Helping users and teams with incidents related to administration and development.
- Onboarding and training on best practices for new users who are migrated to our clusters.
- Guide users in development and work with developers closely for preparing a data lake.
- Tested and Performed enterprise wide installation, configuration and support for hadoop using MapR Distribution.
- Migrated data from SQL Server to HBase using Sqoop.
- Scheduled data pipelines for automation of data ingestion in AWS.
- Integrated Oozie with the rest of the Hadoop stack supporting several types of Hadoop jobs out of the box (like MapReduce, Pig, Hive, Sqoop) as well as system specific jobs.
- Experience on MapR patching and upgrading the cluster with proper strategies.
- Developed entire data transfer model using Sqoop framework.
- Configured flume agent with flume syslog source to receive the data from syslog servers.
- Implemented the Hadoop Name-node HA services to make the Hadoop services highly available.
- Exporting data from RDBMS to HIVE, HDFS and HIVE, HDFS to RDBMS by using SQOOP.
- Installed and managed multiple Hadoop clusters - Production, stage, development.
- Developed custom writable MapReduce JAVA programs to load web server logs into HBase using flume.
- Installed and managed production cluster of 150 Node cluster with 4+ PB.
- Performed AWS EC2 instance mirroring, WebLogic domain creations and several proprietary middleware Installations.
- Monitored multiple Hadoop clusters environments using Nagios. Monitored workload, job performance and capacity planning using MapR control systems.
- Automated data loading between production and disaster recovery cluster.
- Log data Stored in HBase DB is processed and analyzed and then imported into Hive warehouse, which enabled end business analysts to write HQL queries.
- Utilized AWS framework for content storage and Elastic Search for document search.
Environment: Red Hat Linux (RHEL 3/4/5), Solaris, Logical Volume Manager, Sun & Veritas Cluster Server, Global File System, Red Hat Cluster Servers.
- Troubleshooting and analysis of hardware and failures for various Solaris servers (Core dump and log file analysis)
- Performed configuration and troubleshooting of services like NFS, FTP, LDAP and Web servers.
- Installation and configuration of VxVM, Veritas file system (VxFS).
- Management of Veritas Volume Manager (VxVM), Zetabyte File System (ZFS) and Logical Volume Manager
- Involved in patching Solaris and RedHat servers.
- Worked NAS and SAN concepts and technology.
- Installation and configuration of Solaris 9/10 and Red Hat Enterprise Linux 5/6 systems.
- Addition and configuration of SAN disks for LVM on Linux, and Veritas Volume Manager and ZFS on Solaris LDOMs.
- Provided production support and 24/7 support on rotation basis.
- Involved in building servers using jumpstart and kickstart in Solaris and RHEL respectively.
- Installation and configuration of RedHat virtual servers using ESXi 4/5 and Solaris servers (LDOMS) using scripts and Ops Center.
- Managed Logical volumes, Volume Groups, using Logical Volume Manager.
- Performed package and patches management, firmware upgrades and debugging.
- Configuration and troubleshooting of NAS mounts on Solaris and Linux Servers.
- Configuration and administration of ASM disks for Oracle RAC servers.
- Analyzing and reviewing the System performance tuning and Network Configurations.
- Configured and maintained Network Multipathing in Solaris and Linux.
- Configuration of Multipath, EMC power path on Linux, Solaris Servers.
- Performed POC on Tableau which includes running load tests and system performance with large amount of data.
Environment: Solaris 9/10/11, RedHat Linux 4/5/6, AIX, Sun Enterprise Servers E5500/E4500, Sun Fire V 1280/480/440 , Sun SPARC 1000, HP 9000K, L, N class Server, HP & Dell blade servers, IBM RS/6000.