- Around 7 years of IT experience, with over Four years of work experience in Big Data Hadoop.
- Worked on various components in the Hadoop ecosystem like HDFS, YARN, HIVE, SPARK, MAPREDUCE, PIG, HBASE, SCOOP, FLUME, KAFKA, OOZIE, and ZOOKEEPER.
- Excellent programming skills in SCALA, JAVA, and PYTHON.
- Involved in Managing scalable Hadoop clusters including Cluster designing, provisioning, custom configurations, monitoring, and maintaining using Hadoop distributions: Cloudera CDH, Hortonworks HDP.
- Involved in using Sequence files like ORC, AVRO, and Parquet file formats.
- Exposure to Data Lake Implementation using Apache Spark and developed Data pipelines and applied business logic using Spark.
- Worked on Spark APIs like Spark Core, Spark SQL, and Spark Streaming.
- Worked with Analytics on Hadoop (Alteryx, R, Python).
- Good knowledge in implementing Apache Spark or Spark Streaming project, preferably using Scala, and Spark SQL.
- Involved in performing Transformations and Actions on data using Spark RDDs, Datasets, Data Frames, and DStreams.
- Worked on minimizing data transfers using Broadcast variables and Accumulators in Spark.
- Involved in monitoring scheduler stages and tasks, RDD sizes and memory usage, Environmental information, Information about the running executors using Spark UI. .
- Experience migrating infrastructure and application from on - premises to AWS and from Cloud to Cloud such as AWS to Microsoft Azure.
- Good Knowledge in Amazon AWS concepts like EMR and EC2 web services which provide fast and efficient processing of Big Data.
- Ability to build deployment on AWS , build scripts (Boto 3 & AWS CLI), and automated solutions using Shell and Python.
- Hands-on experience in Azure Cloud Services (PaaS & IaaS), Storage, Web Apps, Active Directory, Azure Container Service, VPN Gateway, Content Delivery Management, Traffic Manager, Azure Monitoring, OMS, Key Vault, Visual Studio Online (VSO), and SQL Azure.
- Implemented the processing framework for converting SQL to a graph of Map/Reduce jobs and the execution time framework to run those jobs in the order of dependencies using Hive.
- Expertise in using Aggregate functions in Hive using HQL.
- Worked on EDA Analysis and building POCs.
- Worked on HBase to put the data in indexed Store Files that exist on HDFS for high-speed lookups.
- Expert knowledge on Cassandra, Redis, MongoDB, NoSQL data modeling, tuning, and disaster recovery backup used it for distributed storage and processing using CRUD.
- Worked on Ad hoc queries, Indexing, Replication, and Load balancing.
- Experienced in Kafka to read and write streams of data like a messaging system in Hadoop.
- Involved in managing Hadoop jobs using Oozie.
- Generated various kinds of reports using Power BI and Tableau based on Client specifications. Experience with monitoring and visualization tools such as Prometheus, Data Dog, Grafana, and CloudWatch.
- Worked in all stages of SDLC (Agile, Waterfall), writing Technical Design document, Development, Testing, and Implementation of Enterprise level Datamart and Data warehouses.
Data Access Tools: HDFS, YARN, Hive, Pig, HBase, Solr, Impala, Spark Core, Spark SQL, Spark Streaming
Data Management: HDFS, YARN
Data Workflow: Sqoop, Flume, Kafka
Data Operation: Zookeeper, Oozie
Data Security: Ranger, Knox
Big Data Distributions: Hortonworks, Cloudera
Cloud Technologies: AWS (Amazon Web Services) EC2, S3, IAM, CLOUD WATCH, DynamoDB, SNS, SQS, EMR, KINESIS
IDE/Build Tools: Eclipse, IntelliJ
Business Intelligence Tools: Tableau, Power Bi, MS Excel, Microsoft Visio, SSIS, SSAS, SSRS.
Java/J2EE Technologies: XML, Junit, JDBC, AJAX, JSON, JSP
Operating Systems: Linux, Windows, Kali Linux
SDLC: Agile/SCRUM, Waterfall
Confidential - Alpharetta, Georgia
Senior Hadoop/spark Developer
- Involved in combining traditional transactional and event data with social network data.
- Responsible for building scalable distributed data solutions using Apache Hadoop and Spark.
- Used Cloudera QuickStart VM for deploying the cluster.
- Involved in installation and configuration of Cloudera distribution Hadoop CDH 3.x, CDH 4.x
- Continuous monitoring and managing the Hadoop cluster using Cloudera Manager.
- Installed, configured, and maintained several Hadoop clusters which include HDFS, YARN, Hive, HBase, Knox, Kafka, Oozie, Ranger, Atlas, Infra Solr, Zookeeper, and Nifi in Kerberized environments .
- Worked extensively with importing metadata into Hive and migrated existing tables and applications to work on Hive and Cloudera.
- Load the data into Spark RDD and performed in-memory data computation to generate the output response.
- Involved in loading data from the relational database into HDFS using Sqoop.
- Worked with spark core, Spark streaming, and spark SQL modules of Spark.
- Involved in developing generic Spark-Scala functions for transformations, aggregations, and designing schema for rows.
- Involved in minimizing data transfers over Hadoop clusters using Spark optimizations like broadcast variables and Accumulators.
- Involved in working with Spark APIs like RDDs, Datasets, Data Frames, and DStreams to perform transformations on the data.
- Used Spark SQL to perform interactive analysis on the data using SQL and HiveQL.
- Involved in processing live data streams using Spark Streaming with high-level functions like map, reduce, join, and window.
- Worked on setting up Kafka cluster and topics, developed several Kafka producers and consumers to stream the data into spark from multiple sources to multiple targets, developed producers and consumers using python.
- Worked on Amazon Web Services(AWS), Amazon Cloud Services like Elastic Compute Cloud (EC2), Simple Storage Service(S3), Elastic Map Reduce (EMR), Amazon Simple DB, and Amazon Cloud Watch.
- Used AWS CloudFormation and other tools to collect application's components into a single package that can be deployed and managed as one resource.
- Optimizing and tuning the Redshift environment, enabling queries to perform up to 100x faster for Tableau and SAS Visual Analytics
- Migrated on premise database structure to Confidential Redshift data warehouse
- Executed Hadoop/Spark jobs on AWS EMR using programs, stored in S3 Buckets.
- Involved in pulling the data from AWS Amazon S3 bucket to the data lake and built Hive tables on top of it and created data frames in Spark to perform further analysis.
- Configured lambda functions to mount an Amazon Elastic File System (Amazon EFS) file system to a local directory to access and modify shared resources safely and at high concurrency.
- Used AWS Lambda functions to process events and invoking it with the Lambda API, or by configuring an AWS service or resource to invoke it.
- Utilized AWS CloudWatch to monitor the performance environment instances for operational and performance metrics during load testing.
- Developed multiple POCs using Scala and deployed on the Yarn cluster, compared to the performance of Spark, with Hive and SQL/Teradata.
- Analyzed the SQL scripts and designed the solution to implement using Scala.
- Worked on troubleshooting spark applications to make them more error-tolerant.
- Involved in loading the processed data into the Hive warehouse.
- Stored the data in tabular formats using Hive tables and Hive Serdes.
- Implemented Static partitions, Dynamic partitions, and Buckets in Hive.
- Sqoop configuration of JDBC drivers for respective relational databases, controlling parallelism, controlling distchache, controlling import process, compression codec’s, importing data to Hive, HBase, incremental imports, configure saved jobs and passwords, free form query option and troubleshooting.
- Used Oozie Operational Services for scheduling workflows dynamically.
- Involved in running queries using Impala and used BI tools to run ad-hoc queries directly on Hadoop.
- Successfully loaded files to HDFS from Teradata, and load loaded from HDFS to hive and impala.
- Used Reporting tools like Tableau to connect with Impala for generating daily reports of data.
Environment: Hadoop CDH 3.x, CDH 4.x, Cloudera, Spark Core, Spark SQL, Alteryx 11.0 & 11.3, Spark Streaming, Scala, Kafka, Hive, YARN, HBase, Zookeeper, Sqoop, Oozie, Tableau, Amazon EMR, Impala, JIRA, AWS, EC2, Redshift.
Confidential - San Antonio, Texas
Big Data Developer
- Developed the code for Importing and exporting data into HDFS and Hive using Sqoop.
- Worked on the Hortonworks-HDP distribution of Hadoop.
- Worked on installing and configuring the HDP Hortonworks and Cloudera (CDH 5.5.1) Clusters in Dev and Production Environments.
- Created an ODBC connection through Sqoop between Hortonworks and SQL Server .
- Worked on Data Lake set up with Hortonworks and Azure team.
- Responsible for cluster manage and review data backups, manage and review Hadoop log files on Hortonworks.
- Create data pipelines in the cloud using Azure Data Factory.
- Implemented POC on Launching HDInsight on Azure.
- Worked with Microsoft Azure, ADF, ADLS, Azure Blob, COSMOS.
- Good knowledge in building pipelines using Azure Data Factory and moving the data into Azure Data Lake Store.
- Created pipelines to move data from on-premise servers to Azure Data Lake .
- Worked in Developing Spark applications using Spark - SQL in Databricks for data extraction, transformation, and aggregation from multiple file formats for analyzing & transforming the data to uncover insights into the customer usage patterns.
- Extract Transform and Load data from Sources Systems to Azure Data Storage services using a combination of Azure Data Factory, T-SQL, Spark SQL, and U-SQL Azure Data Lake Analytics. Data Ingestion to one or more Azure Services - ( Azure Data Lake, Azure Storage, Azure SQL, Azure DW ) and processing the data in In Azure Databricks.
- Responsible for estimating the cluster size, monitoring, and troubleshooting of the Spark Databricks cluster.
- Developed Spark scripts using Python on Azure HDInsight for Data Aggregation, Validation and verified its performance over MR jobs.
- Utilized Azure HDInsight to monitor and manage the Hadoop Cluster.
- Automated Sqoop incremental imports by using Sqoop jobs and automated jobs using Oozie.
- Worked on various compression and file formats like Avro, Parquet, and Text formats.
- Responsible for writing Hive Queries for analyzing data in Hive warehouse using HQL.
- Responsible for creating complex dynamic partition tables using Hive for best performance and faster querying.
- Involved in developing Hive User Defined Functions in Java, compiling them into jars, and adding them to the HDFS, and executing them with Hive Queries.
- Developed several advanced Map Reduce programs in Java as part of functional requirements for Big Data.
- Submitted MapReduce jobs to Queues, as a collection of jobs, to allow the system to provide the specified functionality.
- Developed multiple POCs using Spark and deployed on the Yarn cluster, compared the performance of Spark with Hive and SQL/Teradata.
- Developed spark scripts by using Scala shell as per requirements.
- Developed Kafka producers and consumers, Spark clients along with components on HDFS, Hive.
- Involved in defining job flows using Oozie for scheduling jobs to manage Apache Hadoop jobs.
- Tested and reported defects in an Agile Methodology perspective.
Environment: Hadoop, HDFS, Hive, MapReduce, Spark, Scala, Kafka, Oozie, Java, Linux, Azure, Cloudera.
- Collected and aggregated large amounts of structured and complex data from multiple silos and combined the data and look for patterns.
- Deployed multiple clusters (Cloudera’s) CDH distributions.
- Strong knowledge of creating and monitoring Hadoop clusters on VM, Hortonworks Data Platform 2.1 & 2.2, CDH5 Cloudera Manager, HDP on Linux, Ubuntu , etc.
- Sound programming capability using Python, core JAVA along with Hadoop framework utilizing Cloudera Hadoop Ecosystem projects (HDFS, Spark, Sqoop, Hive, HBase, Oozie, Impala, Zookeeper, etc.).
- Involved in Installing, Configuring Hadoop Eco System, and Cloudera Manager using CDH4 Distribution.
- Monitored workload, job performance, and Capacity planning using Cloudera Manager.
- Exported the patterns analyzed back into Teradata using Sqoop. Continuous monitoring and managing the Hadoop cluster through Cloudera Manager.
- Used Scala to convert Hive/SQL queries into RDD transformations in Apache Spark.
- Written Programs in Spark using Scala and Python for Data quality check.
- Developed Spark code using Scala and Spark-SQL/Streaming for faster testing and processing of data.
- Performed advanced procedures like text analytics and processing, using the in-memory computing capacities of Spark using Scala.
- Used Spark and Spark-SQL to read the parquet data and create the tables in hive using the Scala API.
- Involved in using Apache Flume and stored the data into HDFS for analysis.
- Implemented multiple MapReduce Jobs in Java for data cleansing and pre-processing of data.
- Developed MapReduce programs to parse the raw data, populate staging tables, and store the refined data in partitioned tables in the EDW.
- Involved in defining job flows, managing, and reviewing log files.
- Worked on analyzing the weblog data using HiveQL to extract the number of unique visitors per day, page views, and returning visitors.
- Responsible for data loading involved in creating Hive tables and partitions based on the requirement.
- Created Hive queries that helped market analysts to spot emerging trends by comparing fresh data with EDW reference tables and historical metrics.
- Developed spark scripts by using python shell commands as per the requirement.
- Involved in data migration from various databases to Hadoop HDFS and Hive using Sqoop.
- Worked on the Oozie workflow engine for job scheduling.
- Used zookeeper for various types of centralized configurations.
Environment: Map Reduce, Hive, Pig, Spark, Flume, Zookeeper, Oozie, Tableau, Java, Python, UNIX.