- 8+ years of IT experience in a variety of industries, which includes hands on experience in Hadoop, Hive, Sqoop and Spark
- Expertise on coding in different technologies i.e. Python, Java, and Unix shell scripting.
- Extensive experience in the Big Data ecosystem and its various components such as SPARK, SCALA, MapReduce, HDFS, HIVE, PIG, Sqoop, Zookeeper, Oozie and Flume.
- Well versed experience in Amazon Web Services (AWS) Cloud services like EC2, S3, EMR, DynamoDB.
- Good experience in working with concepts of Hadoop Architecture and its components such as HDFS, Job Tracker, Task Tracker, Name Node, Data Node, Secondary Name node, and MapReduce concepts.
- Experience in handling different file formats like Text files, Sequence files, Avro data files using different SerDe in Hive.
- Mastered in using different columnar file formats like RCFile, ORC and Parquet formats.
- Experience in working with MapReduce Framework and Spark execution model.
- Hands - on experience in programming with Resilient Distributed Datasets (RDDs), data frames and dataset API.
- Experienced in Partitioning, bucketing concepts in Hive and designed both Managed and External tables in Hive to optimize performance.
- Experienced in writing custom Hive UDF's to in corporate business logic with Hive queries.
- Experience in process improvement, Normalization/deNormalization, data extraction, data cleansing, data manipulation on HIVE.
- Experience in loading data files from HDFS to Hive for reporting.
- Experience in writing Sqoop commands to import data from relational databases to HDFS.
- Having experience in SQL Server and Confidential Database and in writing queries.
- Exploring Spark for improving the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frame, Pair RDD's, Spark YARN.
- Developed Pig Latin scripts to extract the data from the web server output files to load into HDFS.
- Extensively used Pig for data cleansing.
- Developed the Pig UDF's to pre-process the data for analysis.
- Developed workflow in Oozie to automate the tasks of loading the data into HDFS and pre-processing with Pig.
- Used Python scripts to build a workflow in Autosys to automate the tasks in three zones in the cluster.
- Experienced working with the Business team for gathering the requirements and fully understand the business requirements.
- Designed and created data extracts, supporting Power BI, Tableau, or other visualization tools reporting applications.
- Knowledge of ETL methods for data extraction, transformation and loading in corporate-wide ETL Solutions and Data warehouse tools for data analysis.
- Experience in Database design, Data analysis, Programming SQL.
- Hands on experience in designing the REST based Micro services using the Spring Boot, Spring Data with JPA.
Big Data Technologies: HDFS, MapReduce, Yarn, HIVE, PIG, Pentaho, HBase, Oozie, Zookeeper, Sqoop, Cassandra, Spark, Scala, Storm, Flume, Kafka and Avro, Parquet, Snappy.
NO SQL Databases: HBase, Cassandra, MongoDB, Neo4j, Redis.
Cloud Services: Amazon AWS, Google Cloud.
ETL Tools: Informatica, IBM DataStage, Talend.
Application Servers: Web Logic, Web Sphere, JBoss, Tomcat.
Databases: Confidential, MySQL, DB2, Teradata, Microsoft SQL Server.
Operating Systems: UNIX, Windows, iOS, LINUX.
Build Tools: Jenkins, Maven, ANT, Azure.
Frame works: MVC, Struts, Spring, Hibernate.
Version Controls: Subversion, Git, Bitbucket, GitHub
Methodologies: Agile, Waterfall.
Sr. Hadoop Spark Developer
Confidential - Austin, TX
- Written the Apache PIG scripts to process the HDFS data.
- Created HIVE tables to store the processed results in a tabular format.
- Used Spark-Streaming APIs to perform necessary transformations and actions on the fly for building the common learner data model which gets the data from Kafka in near real time and Persists into Cassandra.
- Used Spark API over Cloudera Hadoop YARN to perform analytics on data in Hive.
- Loaded the data into Spark RDD and do in memory data Computation to generate the Output response.
- Optimizing of existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frames and Pair RDD's.
- Performed advanced procedures like text analytics and processing, using the in-memory computing capabilities of Spark using Python.
- Experienced in performance tuning of Spark Applications for setting right Batch Interval time, correct level of Parallelism and memory tuning.
- Experienced in handling large datasets using Partitions, Spark in Memory capabilities, Broadcasts in Spark, Effective & efficient Joins, Transformations and other during ingestion process itself.
- Worked on migrating Map Reduce programs into Spark transformations using Spark and Python.
- Strong experience in working with ELASTIC MAPREDUCE(EMR)and setting up environments on Amazon AWS EC2 instances.
- Pulled Excel data into HDFS.
- Pulled the data from mysql databases into hive tables using NiFi.
- Worked on Apache Nifi as ETL tool for batch processing and real time processing.
- Designed and Developed Spark workflows using Scala for data pull from cloud-based systems and applying transformations on it.
- Have an experience to load and transform large sets of structured, semi structured and unstructured data, using SQOOP from Hadoop Distributed File Systems to Relational Database Systems and also Relational Database Systems to Hadoop Distributed File Systems.
- Implemented schema extraction for Parquet and Avro file Formats in Hive.
- Experience in using Flume to efficiently collect, aggregate and move large amounts of log data.
- Developed Spark Scripts by using python shell commands as per the requirement.
- Developed hive queries and UDF.
- Developed ETL workflow which pushes webserver logs to an Amazon S3 bucket.
- Worked extensively with Dimensional modeling, Data migration, Data cleansing, ETL Processes for data warehouses.
- Extensively used Informatica Power Center Data Validation tool to unit test the ETL mappings.
- Worked with BI teams in generating the reports and designing ETL workflows on Tableau.
- Involved in converting Hive/SQL queries into Spark transformations using Spark RDD and Scala.
- Designed the ETL process and created the High-level design document including the logical data flows, source data extraction process, the database staging, job scheduling and Error Handling.
- Writing the python script files for processing data and loading to HDFS.
- Created External Hive Table on top of parsed data.
- Moved all log/text files generated by various products into HDFS location using NiFi.
- Active involvement in Scrum meetings and Followed Agile Methodology for implementation.
Environment: Linux/UNIX, CentOS, Hadoop 2.4.x, OOZIE, HIVE0.13, SQOOP, Kafka, Cassandra, Spark Hortonworks2.1.1, AWS, Tableau, AVRO.
Confidential - Orlando, FL
- Responsible for building customer centric Data Lake in Hadoop which would serve as the Analysis and Data Science Platform.
- Responsible for building scalable distributed data solutions on Cloudera distributed Hadoop.
- Used Sqoop, Kafka for migrating data and incremental import into HDFS and Hive from various other data sources.
- Modeled and build Hive tables to combine and store structured data and unstructured sources of data for best possible access.
- Integrated Cassandra file system to Hadoop using Map Reduce to perform analytics on Cassandra data.
- Used Cassandra to store billions of records to enable faster & efficient querying, aggregates & reporting.
- Developed Spark Jobs using Python (Pyspark) APIs.
- Developing business logic using Scala.
- Migrated Python programs into Spark Jobs for Various Processes.
- Involved in Job management and Developed job processing scripts using Oozie workflow.
- Implementing optimization techniques in hive like partitioning tables, De-normalizing data & Bucketing.
- Used Spark SQL to create structured data by using data frame and querying from other data sources and Hive.
- To support Data Scientists with Data and Platform Setup for their analysis and finally migrating their finished product to Production.
- Worked on cleansing and extracting meaningful information from click stream Data using Spark and Hive.
- Involved in performance tuning of Spark Applications for setting right level of Parallelism and memory tuning.
- Used optimization techniques in spark like Data Serialization and Broadcasting.
- Optimizing of existing algorithms in Hadoop using Spark, Spark-SQL and Data Frames.
- Implemented POC in persisting click stream data with Apache Kafka
- Implemented data pipelines to move processed data from Hadoop to RDBMS and No-Sql Databases.
- Followed Agile & Scrum principles in developing the project.
Environment: Hadoop (2.6.5), HDFS, Spark (2.0.2), Spark-sql, Sqoop, Hive, Apache Kafka (0.10.1.0), Python, Scala (2.11), Pyspark, Cassandra and Oozie, Cloudera(CDH5).
- Managed 300+ Nodes HDP cluster with 14 petabytes of data using Ambari and Linux Cent OS.
- Installed and configured Horton works Ambari for easy management of existing Hadoop cluster.
- Responsible for the design and implementation of a multi-datacentreHadoop environment intended to support the analysis of large amounts of unstructured data along with ETL processing.
- Coordinated with Horton works support team through support portal to sort out the critical issues during upgrades.
- Conducting RCA to find out data issues and resolve production problems.
- Responsible for troubleshooting issues in the execution of Map Reduce jobs by inspecting and reviewing log files.
- Enhanced and optimized product Spark code to aggregate, group and run data mining tasks using the Spark framework.
- Implemented advanced procedures like text analytics and processing using the in-memory computing capabilities like Spark
- Worked with big data developers, designers and scientists in troubleshooting map reduce job failures and issues with Hive, Pig and Sqoop.
- Experienced in job workflow scheduling and monitoring tools like Oozie and Zookeeper.
- Worked on design and implementation, configuration, performance tuning of Horton works HDP 2.3 Cluster with High Availability and Ambari 2.2.
- Analysing the Server logs for errors and exceptions, Jenkins Job - Builds - Scheduling and monitoring the console outputs.
- Used Agile/scrum Environment and used Jenkins, Git Hub for Continuous Integration and Deployment.
- Experience on JIRA and Service Now to track issues on the big data platform.
- Experienced in managing and reviewing Hadoop log files.
- Configured Jenkins for successful deployment to test and production environments.
- Worked with Sqoop in Importing and exporting data from different databases like MySql, Confidential into HDFS and Hive.
- Worked on setting up high availability for major production cluster and designed automatic failover control using zookeeper and quorum journal nodes.
- Experience on Hbase High availability and manually tested using failover tests.
- Create queues and allocated the clusters resources to provide the priority for jobs.
- Working experience on maintaining MySQL databases creation and setting up the users and maintain the backup of cluster metadata databases with cron jobs.
- Provided technical assistance for configuration, administration and monitoring of Hadoopclusters.
- Coordinated with technical teams for installation of Hadoop and third related applications on systems.
- Supported technical team members for automation, installation and configuration tasks.
- Suggested improvement processes for all process automation scripts and tasks.
- Assisted in designing, development and architecture of Hadoop and Hbasesystems.
- Formulated procedures for planning and execution of system upgrades for all existing Hadoopclusters.
- Responsible for cluster Maintenance, Monitoring, Troubleshooting, Tuning, commissioning and Decommissioning of nodes.
- Responsible for cluster availability and experienced on ON-call support
- Involved in Analyzing system failures, identifying root causes, and recommended course of actions.
- Documented the systems processes and procedures for future references.
Environment: Hortonworks, Ambari, Hive, Pig, Sqoop, zookeeper, Hbase Knox, Spark, Yarn, Mapreduce.
- Worked on analyzing Hadoop cluster using different big data analytic tools including Pig, Hive. HBase and MapReduce
- Extracted data of everyday transaction of customers from DB2 and export to Hive and setup Online analytical processing.
- Worked on Pig script to count the number of times a URL was opened in a particular duration.
- Developed PIG UDFs for the needed functionality such as custom Pigs loader known as timestamp loader.
- Installed and configured Hadoop, MapReduce, and HDFS clusters
- Created Hive tables, loaded the data and Performed data manipulations using Hive queries in MapReduce Execution Mode.
- Worked on Developing custom MapReduce programs and User Defined Functions (UDFs) in Hive to transform the large volumes of data with respect to business requirement.
- Involved in creating Hive tables and working on them using Hive QL.
- Analyzed data using Hadoop components Hive and Pig.
- Scripted complex HiveQL queries on Hive tables to analyze large datasets and wrote complex Hive UDFs to work with sequence files.
- Scheduled workflows using Oozie to automate multiple Hive and Pig jobs, which run independently with time and data availability.
- Responsible for creating Hive tables, loading data and writing Hive queries to analyze data.
- Handled importing data from various data sources, performed transformations using Hive, Map Reduce, and loaded data into HDFS.
- Imported data from Teradata database into HDFS and exported the analyzed patterns data back to Teradata using Sqoop.
- Developed MapReduce jobs for Log Analysis, Recommendation and Analytics.
- Wrote MapReduce jobs to generate reports for the number of activities created on a particular day, during a dumped from the multiple sources and the output was written back to HDFS
- Created HBase tables, used HBase sinks and loaded data into them to perform analytics using Tableau.
- Design the HBase schemes based on the requirements and HBase data migration and validation
- Extending HIVE and PIG core functionality by using custom User Defined Function's (UDF), User Defined Table-
- Generating Functions (UDTF) and User Defined Aggregating Functions (UDAF) for Hive and Pig using Scala
- Proficient in using Cloudera Manager, an end to end tool to manage Hadoop operations.
- Created Partitioned Hive tables and worked on them using HiveQL.
- Developed Shell scripts to automate routine DBA tasks.
- Used Maven extensively for building jar files of Map Reduce programs and deployed to Cluster.
- Responsible for cluster maintenance, adding and removing cluster nodes, cluster monitoring and troubleshooting, managing and reviewing data backups and Hadoop log files.
Environment: Pig, Hive, Map Reduce, Hadoop, HDFS, Hive QL, Oozie, Cloudera, HBase