- An accomplished Hadoop/Spark developer experienced in ingestion, storage, querying, processing and analysis of big data.
- Experienced with the BigData Frameworks - Kafka, Spark, HDFS, HBASE and Zookeeper.
- Experienced with Spark Streaming, SparkSQL and Kafka for real-time data processing.
- Excellent Programming skills at a higher level of abstraction using Scala, Java and Python.
- Well versed with Elastic Search to extract, transform and index the source data.
- Extensive experience in working with various distributions of Hadoop like enterprise versions of Cloudera (CDH4/CDH5), Hortonworks and good knowledge on MAPR distribution.
- Hands on expertise in working and designing of Row keys & Schema Design with NOSQL databases like Mongo DB 3.0.1, HBase, Cassandra and DynamoDB (AWS).
- Experience in using D-Streams, Accumulator, Broadcast variables, RDD caching for Spark Streaming.
- Experienced in implementing scheduler using Oozie, Airflow, Crontab and Shell scripts.
- Extensive experience in importing and exporting streaming data into HDFS using stream processing platforms like Flume and Kafka messaging system.
- Worked on AWS Data Pipeline to configure data loads from S3 to into Redshift.
- Strong experience and knowledge of real time data analytics using Spark Streaming, Kafka and Flume.
- Exposure to Data Lake Implementation using Apache Spark and developed Data pipe lines and applied business logics using Spark.
- Extensively worked on Spark using Scala on cluster for computational (analytics), installed it on top of Hadoop performed advanced analytical application by making use of Spark with Hive and SQL/Oracle.
- Well-versed in spark components like Spark SQL, MLib, Spark streaming and GraphX.
- Used Scala and Python to convert Hive/SQL queries into RDD transformations in Apache Spark.
- Expertise in writing Spark RDD transformations, Actions, Data Frames, Case classes for the required input data and performed the data transformations using Spark-Core.
- Experience in integrating Hive queries into Spark environment using Spark SQL.
- Expertise in performing real time analytics on big data using HBase and Cassandra.
- Written multiple MapReduce Jobs using Java API, Pig and Hive for data extraction, transformation and aggregation from multiple file formats including Parquet, Avro, XML, JSON, CSV, ORCFILE and other compressed file formats Codecs like gZip, Snappy, Lzo.
- Experienced in working with monitoring tools to check status of cluster using Cloudera manager and Ambari
- Experience in developing data pipeline using Pig, Sqoop, and Flume to extract the data from weblogs and store in HDFS. Accomplished developing Pig Latin Scripts and using Hive Query Language for data analytics.
- Developed customized UDFs and UDAFs in java to extend Pig and Hive core functionality.
- Great familiarity with creating Hive tables, Hive joins & HQL for querying the databases eventually leading to complex Hive UDFs.
- Experience in writing Complex SQL queries, PL/SQL, Views, Stored procedure, Triggers, etc.
- Experience in optimizing MapReduce algorithms using Mappers, Reducers, combiners and partitioners to deliver the best results for the large datasets.
- Expert in Coding Teradata SQL, Teradata Stored Procedures, Macros and Triggers.
- Experienced in migrating data from various sources using PUB-SUB model in Redis, and Kafka producers, consumers and preprocess data using Storm topologies.
- Had competency in using Chef, Puppet and Ansible configuration and automation tools. Configured and administered CI tools like Jenkins, Hudson Bambino for automated builds.
- Working knowledge of Amazon’s Elastic Cloud Compute(EC2) infrastructure for computational tasks and Simple Storage Service (S3) as Storage mechanism.
- Knowledge in installation, configuration, supporting and managing Hadoop Clusters using Apache, Cloudera (CDH3, CDH4) distributions and on Amazon web services (AWS).
- Built AWS secured solutions by creating VPC with public and private subnets.
- Proficient in developing, deploying and managing the Solr from development to production.
- Experienced in using build tools like Maven, Ant, SBT Log4j to build and deploy applications into the server.
- Worked on data warehousing and ETL tools like Informatica, Talend, and Pentaho.
- Worked on ELK stack like Elastic search, Logstash, Kibana for log management.
- Extensive experience in developing applications that perform Data Processing tasks using Teradata, Oracle, SQL Server and MySQL database
- Hands-on knowledge in Core Java concepts like Exceptions, Collections, Data-structures, I/O. Multi-threading, Serialization and deserialization of streaming applications.
- Worked on various programming languages using IDEs like Eclipse, NetBeans, and Intellij.
- Experienced in ticketing tools like RALLY, JIRA for tracking issues, bugs related to code and GitHub for various code reviews and Worked on various version control tools like CVS, GIT, SVN.
- Experience in maintaining an Apache Tomcat MYSQL, LDAP, LAMP, Web service environment.
- Good working experience in importing data using Sqoop, SFTP from various sources like RDMS, Teradata, Mainframes, Oracle, Netezza to HDFS and performed transformations on it using Hive, Pig and Spark.
- Designed ETL workflows on Tableau, Deployed data from various sources to HDFS.
- Experience in working with different data sources like Flat files, XML files and Databases. Various domain experiences like ERP, Software quality process.
- Experience in complete Software Development Life Cycle (SDLC) in both Waterfall and Agile methodologies.
- Good understanding of all aspects of Testing such as Unit, Regression, Agile, White-box, Black-box.
- Experience with best practices of Web services development and Integration (REST and SOAP).
- Experience in automated scripts using Unix shell scripting to perform database activities.
- Working experience with Linux lineup like Redhat and CentOS.
- Good analytical, communication, problem solving skills and adore learning new technical, functional skills.
- Experienced in Agile Scrum waterfall and Test-Driven Developme nt methodologies.
Big Data Ecosystem: HDFS, YARN, MapReduce, Hive, Pig, Impala, Sqoop, Storm, Flume, Spark, Apache Kafka, Zookeeper, Solr, Ambari, Oozie.
NO SQL Databases: HBase, Elastic Search, Cassandra, MongoDB, Amazon DynamoDB.
Hadoop Distributions: Cloudera (CDH3, CDH4, and CDH5), Hortonworks, MapR and Apache.
Languages: Java, C, C++. Scala, Python, XML, XHTML, HTML, AJAX, CSS, SQL, PL/SQL, Pig Latin, HiveQL, Unix, Java Script, Shell Scripting
Java & J2EE Technologies: Core Java, Servlets, Hibernate, Spring, Struts, JMS, EJB
Source Code Control: Github, Bitbucket, SVN
Application Servers: WebSphere, WebLogic, JBoss, Tomcat
Cloud Computing Tools: Amazon AWS, (S3, EMR, EC2, Lambda, VPC, Route 53, Cloud Watch, Cloud Front), Microsoft Azure
Databases: Teradata, Oracle 10g/11g, Microsoft SQL Server, MySQL, DB2
DB languages: MySQL, PL/SQL, PostgreSQL & Oracle
Build Tools: Jenkins, Maven, ANT, Log4j
Business Intelligence Tools: Tableau, Splunk,Dynatrace
Development Tools: Eclipse, IntelliJ, Microsoft SQL Studio, NetBeans
ETL Tools: Talend, Pentaho, Informatica.
Development Methodologies: Agile, Scrum, Waterfall.
Confidential - Redmond, WA
Hadoop/ Spark Developer
- Handled importing of data from various data sources, performed transformations using Spark, python and loaded data into Cassandra.
- Developed framework for converting existing PowerCenter mappings and to PySpark(Python and Spark) Jobs.
- Optimized the Pyspark jobs to run on Kubernetes Cluster for faster data processing
- Analysed the sql scripts and designed it by using PySpark SQL for faster performance.
- Developed the Pysprk code for AWS Glue jobs and for EMR.
- Experienced in writing live Real-time Processing using Spark Streaming with Kafka on AWS EMR.
- Worked with AWS Glue jobs to transform data to a format that optimizes query performance for Athena
- Used Spark API over EMR Cluster Hadoop YARN to perform analytics on data.
- Analysis of Real-time data over Spark Streaming via Kafka.
- Installed the Apache Kafka cluster and Confluent Kafka open source in different environments
- Extensively involved in infrastructure as code, execution plans, resource graph and change automation using Terraform .
- Address the technical and operational challenges while facing issues in production using Kafka Security.
- Estimate efforts for the new applications to be developed and Code changes pertaining to it.
- Installed and configured MapReduce, HIVE QL and the HDFS; implemented CDH3 Hadoop cluster on CentOS. Assisted with performance tuning and monitoring.
- Complete production programming, deployments and deliveries within defined SLA
- Negotiate, conclude, monitor service level agreements (SLAs ) with key stakeholders
- Ensured SLA are clearly defined and agreed to by the Business Partner
- Worked on Import & Export of data using the ETL tool Sqoop from MySQL to HDFS
- Involved in Tuning and Debugging of existing ETL processes
- Designed and Developed ETL Processes in AWS Glue to migrate Campaign data from external sources like S3.
- Data Extraction, aggregations and consolidation of Adobe data within AWS Glue using PySpark
- Queried and analysed data from Cassandra for quick searching, sorting and grouping.
- Derive business insights from extremely large datasets using Google Big Query
- Enabled and automated data pipelines for moving the data from Oracle and DB2 source tables to Hadoop and Google Big Query using GitHub for source control and Jenkins.
- Used JSON schema to define table and column mapping from S3 data to Redshift.
- Used Teradata Oracle databases for informatica DW tool to load source data.
- Created Teradata schemas with constraints, Created Macros in Teradata. Loaded the data using Fast load utility. Created functions and procedures in Teradata.
- Developed Batch and streaming workflows with in-built Stone branch scheduler and bash scripts to automate the Data Lake systems.
- Involved in writing the scope scripts in Data Lake and HDFS to structure the Peta bytes of unstructured data stored in the Azure Data Lake.
- Created various Parser programs to extract data from Autosys, Tibco Business Objects, XML, Informatica, Java, and database views using Scala
- Strong programming skills in designing and implementation of multi-tier applications using web-based technologies like Spring MVC and Spring Boot
- Developed server-side application to interact with database using Spring Boot and Hibernate
- Worked on data masking/de - identification using Informatica.
- Extensively used Akka actors architecture for scalable &hassle-free multi-threading
- Worked on integration independent microservices for real-time bidding (Scala/akka, firebase, Cassandra, Elasticsearch).
- Involved in Developing Spark applications using Spark - SQL in Databricks for data extraction, transformation and aggregation from multiple file formats for analysing & transforming the data to uncover insights into the customer usage patterns.
- Worked on large-scale Hadoop YARN cluster for distributed data processing and analysis using Data Bricks Connectors, Spark core, Spark SQL, ksql, Sqoop, Pig, Hive, Impala and NoSQL databases.
- Involved in Building ETL to Kubernetes with Apache Airflow and Spark in GCP.
- Worked on Docker based containers for using Airflow.
- Done POC on newly adopted technologies like Apache Airflow and Snowflake and Gitlab
- Installed and configured apache airflow for workflow management and created workflows in python
- Involved in SQL query tuning and Informatica Performance Tuning.
- Used sparkSQL for reading data from external sources and processes the data using Scala computation framework.
- Involved in writing efficient search queries against Solr indexes using Solr REST/Java API
- Developed Microservices using Spring boot and core Java/J2EE hosted on AWS
- Extensive experience with the searching engines like Apache Lucene, Apache Solr and Elastic Search.
- Experienced in deploying the Apache solr/ zookeeper cluster in cloud, on Premises, working on the data storage and disaster recovery for solr/ zookeeper.
- Involved in file movements between HDFS and AWS S3 and extensively worked with S3 bucket in AWS.
- Good Knowledge of Atlassian SDK, JIRA Java API, JIRA REST APIs and JIRA plugin development.
- Developed Python based API for converting the files to Key-Value pairs for the files getting sourced to the Splunk Forwarder
- Responsible for estimating the cluster size, monitoring and troubleshooting of the Spark data bricks cluster.
- Analysed the sql scripts and designed it by using PySpark SQL for faster performance.
- Enhanced and provided core design impacting the Splunk framework and components
- Building, publishing customized interactive reports and dashboards, report scheduling using Tableau server.
- Deployed various Microservices like Spark, MongoDB, Cassandra in Kubernetes and Hadoop clusters using Docker.
- Deployed Kubernetes in both AWS and Google cloud. Setup cluster, replicator. Deployed multiple containers in a pod.
- Created action filters, parameters and calculated sets for preparing dashboards and worksheets in Tableau.
- Automated applications and MySQL container deployment in Docker using Python
- Deployed Resource Manager templates. Resource Manager deployment is the deployment method used by Data Factory in the Azure portal.
- Experience in Google Cloud components, Google container builders and GCP client libraries and cloud SDK’s
- Worked Extensively on building and maintaining clusters managed by Kubernetes, Linux, Bash, GIT, Docker, on GCP (Google Cloud Platform).
- Extensively worked on creating Ansible Playbooks for the application deployment and configuration changes.
- Managed AWS infrastructure as code using Terraform.
- While designing the Airflows of data as well configuring the size by using Airflow and azure data bricks also most of the cases while debugging and preview of data.
- Assigned the built-in contributor role on the Azure data factory resource for the user.
- Develop Spark SQL queries and shell scripting for transforming and processing huge amount of data.
- Worked on Data Migration and Data Reconciliation Projects.
- Coordinate with other Teams and Release management Team for deployment related activities
- Production support and fixes for the migrated applications.
- Coordinate Team meeting, Client meeting, scrum calls for day-to-day activities.
Technologies: HDFS, Pig, Hive QL, Scala, Oozie, Zookeeper, MapReduce, Google Cloud, MongoDB, AWS, Azure, Java 1.8, Scala 2.11.8, Hive, Big Query, Java API, HDFS, YARN, Apache Spark 1.6.1, SLOR, Cassandra, Tableau, Sqoop, Kafka, Python, Shell Scripting.
Confidential, Dallas Tx
- Performed Data Injection from various API's which holds Geospatial location, Weather and Product based information of the fields and products grown in it.
- Worked on Cleaning, Processing the data obtained and performing statistical analysis it to get useful insights.
- Explored Spark framework for improving the performance and optimization of the existing algorithms in Hadoop using Spark Core, Spark SQL, Spark Streaming APIs.
- Ingested data from relational databases to HDFS on regular basis using Sqoop incremental import.
- Extracted structured data from multiple relational data sources as DataFrames in SparkSQL.
- Involved in schema extraction from file formats like Avro, Parquet.
- Transformed the DataFrames as per the requirements of data science team.
- Loaded the data into HDFS in Parquet, Avro formats with compression codecs like Snappy, LZO as per the requirement.
- Worked on the integration of Kafka service for stream processing.
- Worked towards creating near real time data streaming solutions using Spark Streaming, Kafka and persist the data in Cassandra.
- Involved in data modeling, ingesting data into Cassandra using CQL, Java APIs and other drivers.
- Implemented CRUD operations using CQL on top of Cassandra file system.
- Analyze the transactional data in HDFS using Hive and optimizing the performance of the queries by segregating the data using clustering and partitioning.
- Set-up databases in GCP using RDS, storage using S3 bucket and configuring instance backups to S3 bucket. prototype CI/CD system with GitLab on GKE utilizing kubernetes and Docker for the runtime environment for the CI/CD systems to build and test and deploy.
- Developed Spark Applications for various business logics using Scala.
- Created Dynamic visualizations and displaying the statistics of the data based on location on the maps.
- Wrote Restful API's in scala to implement the functionality defined.
- Collaborated with other teams in the data pipeline to achieve desired goals.
- Used Amazon Dynamodb to gather and track the event-based metrics.
- Imported data from AWS S3 into Spark RDD, Performed transformations and actions on RDDs.
- Worked with various AWS Components such as EC2, S3, IAM, VPC, RDS, Route 53, SNS and SQS.
- Involved in pulling the data from Amazon S3 data lake and built Hive tables using Hive Context in Spark
- Involved in running Hive queries and Spark jobs on data stored in S3.
- Run short term ad-hoc queries, jobs on the data stored on S3 using AWS EMR.
Environment: Hadoop, HDFS, Hive, Kafka, Sqoop, Shell Scripting, Spark, AWS EMR, Linux-Cent OS, AWS S3, Cassandra, Java, Scala, Eclipse, Maven, Agile.