- Having 4+ years of experience in dealing with Apache Hadoop Components like MapReduce, HDFS, Hive, Sqoop, PIG, Kafka, Flume, Impala and Big Data Analytics.
- Hands on Expertise on Scala development including Spark RDD and Data frame programming.
- Strong experience with application migration from RDBMS to Hadoop.
- Sound relational database concepts and extensively worked with DB2, Oracle. Expert in writing complex SQL queries and stored procs.
- Experience with Real time streaming involving Apache Kafka and Spark Streaming.
- Strong knowledge of Database architecture and Data Modeling including Hive and Oracle.
- Excellent interpersonal and communication skills, technically competent and result - oriented with problem solving and leadership skills.
- Sound understanding of Agile development and Agile Tools.
- Experience of leading projects across verticals like Banking, Communications, Insurance, Retail & hospitality, Man-log.
- Extensive knowledge in Cloud technologies like Microsoft Azure, AWS etc.
Big Data: Hadoop, HDFS, MapReduce, Hive, Sqoop, Apache Spark, SparkSQL, Spark Streaming, HBase, YARN
Database: DB2, Oracle, SQL Server, MySQL, Hive
Hadoop Management: Cloudera Hadoop Distribution, HDInsight
Languages: SQL, Scala, Python, Shell Scripting
IDEs: IntelliJ, Eclipse, Maven, Bit Bucket
Hadoop / Spark Developer
- Load the data from SQL Server, Oracle RDBMS to Hive using Sqoop.
- Create Hive tables to store the processed results in a tabular format.
- Develop the Sqoop scripts to automate data load between RDBMS databases and Hadoop
- Develop Apache spark based programs to implement complex business transformations
- Develop Java custom record reader, partitioner and serialization techniques.
- Use different data formats (Text, Avro, Parquet, JSON, ORC) while loading the data into HDFS.
- Create Managed tables and External tables in Hive and loaded data from HDFS
- Perform complex HiveQL queries on Hive tables for data profiling and reporting
- Optimize the Hive tables using optimization techniques such as partitions and bucketing to provide better performance with HiveQL queries.
- Use Hive to analyze partitioned and bucketed data and compute various metrics for reporting.
- Create partitioned tables and loaded data using both static partition and dynamic partition method.
- Create custom user defined functions in Hive to implement special date functions
- Perform SQOOP import from Oracle to load the data in HDFS and directly into Hive tables.
- Created and scheduled SQOOP Jobs for automated batch data load
- Use JSON and XML SerDe Properties to load JSON and XML data into Hive tables.
- Used SparkSQL and Spark Dataframe extensively to cleanse and integrate imported data into more meaningful insights.
- Dealt with several source systems( RDBMS/ HDFS/S3) and file formats(JSON/ORC and Parquet) to ingest, transform and persist data in hive for further downstream consumption
- Built Spark Applications using IntelliJ and Maven
- Extensively worked on Scala programming language for Data Engineering using Spark
- Scheduled spark jobs in production environment using Oozie scheduler.
- Maintained Hadoop jobs (Sqoop/Hive and Spark) in production environment
Big Data POCs
- As part of Big Data adaptation journey, I participated in couple of Proof of Concepts. The POCs involve technical and performance assessment of Big Data Tech Stack (Sqoop, Hive and Spark)
- As part of the POC program, moved a set of Oracle Tables to Hadoop and evaluated the data load process using Sqoop
- Migrated associated business logic ( PL/SQL procedures/functions) to Apache Spark data frame modules
- Created parallel Hive tables equivalent to Oracle tables and evaluated Hive Partitioning and Bucketing
- Involved with Real-time Steaming POC to load customer behavior data in real time using Kafka and Spark Streaming. Customer web clickstream real-time data was simulated to evaluate Hadoop real-time ingestion and processing capability
Environment: Cloudera 5.8, Spark 2.0, HDFS, Map Reduce, Hive 2.0.1, Sqoop 1.4.6, Oozie Scheduler 4.3, YARN, Java, Linux Shell Scripting, Scala, Spark SQL, Impala 2.8, and Kafka.
- Implemented a POC on Hadoop stack and different big data analytic tools, export and imports from Relational Databases to HDFS.
- Collected and aggregated large amounts of log data using Apache Flume and staging data in HDFS for further analysis.
- Created Hive Tables, loaded values and generated adhoc-reports using the table data.
- Showcased strong understanding on Hadoop architecture including HDFS, MapReduce, Hive, Pig, Sqoop and Oozie.
- Gathered business requirements in meetings for successful implementation and POC (Proof-of-Concept) of Hadoop Cluster.
- Loaded existing data warehouse data from Oracle database to Hadoop Distributed File System (HDFS).
- Developed Oozie workflows for automating Sqoop, Hive and Pig scripts.
- Used to manage and review the Hadoop log files.
- Responsible to manage data coming from different sources.
- Supported Map Reduce Programs those are running on the cluster.
- Installed and configured Pig and also written PigLatin scripts.
- Involved in managing and reviewing Hadoop log files.
- Imported data using Sqoop to load data from Oracle to HDFS on regular basis.
- Developing Scripts and Batch Job to schedule various Hadoop Program.
- Written Hive queries for data analysis to meet the business requirements.
- Creating Hive tables and working on them using Hive QL.
- Utilized Agile Scrum Methodology to help manage and organize a team of 4 developers with regular code review sessions.
- Weekly meetings with technical collaborators and active participation in code review sessions with senior and junior developers.
Environment: Hadoop, Hive, Pig, Flume, Oracle, Java, HBase, Oozie, Shell scripting, Amazon EMR, Oracle
Confidential, Louisville, KY
- DB2 Database design and manipulations.
- Production database capacity monitoring & performance analysis
- Performance tuning of DB2 SQL’s.
- Capacity management of production DB2 objects
- Reorgs, RUNSTATs, RTS updates - proactively and as needed
- Monitoring large growing partitions and adjust key values or adding new partitions accordingly and schedule necessary maintenance
- Data rebalancing in large sized partitions
- Performance management of DB2 objects and engage with technical support teams in performance and problem resolution
- Actively involved in client DR testing and DB2 version migration project
- Brought in many automation things to improve overall system performance
- Add/Delete/Rebuild Indexes for performance improvements
- Database refresh from Prod to TEST/QA regions and data movement.
Environment: z/OS, JCL, IBM DB2 V8 and V9.1 on z/OS, IBM Admin tools
Environment: z/OS, JCL, IBM DB2 V8 and V9.1 on z/OS, IBM Admin tools
- Database objects Creation, Alteration, drops.
- Load data from Model office and Prod region to Dev region using xloads.
- Perform DBA checkout during Version 9 migration project.
- Execute database online reorganizations.
- Participate in Plan specific Mock conversion activities.
- Tune application queries on request.
- Technical support to Application Development Team.
- Review Application developer's Code.
- Create Manage now tickets for Prod and MO region database critical changes.
- Monitor DASD and raise request to extra volumes to MVS team.
- Apply patches having data modeling changes to MTV databases.
Sr. DB2 DBA
Environment : z/OS, JCL, IBM DB2 V7.1/V8.1 on Z/OS, BMC Master Mind tools
- Database objects Creation, Alteration, drops
- Controlling access to DB2 Objects
- Implement and execute database backup
- Execute database recovery when needed using Recovery and DSN1COPY
- Execute database reorganizations (Online and Offline)
- Resizing of Tablespaces
- Tune application queries on request
- Technical support for Application Development Team
- Loading, Unloading Table spaces
- Create partitioned tablespaces and table