- A Data Engineer with over all 7+ years of experience working with ETL, Big Data, Python/Scala, Relational Database Management Systems (RDBMS), and enterprise - level Cloud Base Computing and Applications.
- Comprehensive experience on Hadoop ecosystem utilizing technologies like MapReduce, Hive, HBase, Spark, Sqoop, Kafka, Oozie, Zookeeper and EC2 cloud computing with AWS.
- Created partitions and bucketing as well as designed tables in Hive to optimize performance.
- Working experience with developing User Defined Functions (UDFs) Apache Hive Data warehouse using Java, Scala, and Python.
- Experienced in performing in-memory data processing for batch, real-time, and advanced analytics using Apache Spark (Spark Core, Spark SQL, and Streaming).
- Ingested data into Hadoop from various data sources like Oracle, MySQL, and Teradata using Sqoop tool.
- Strong knowledge in NoSQL column oriented databases like HBase and their integration with Hadoop cluster.
Hadoop/Big Data: HDFS, Map Reduce, Hive, Impala Spark-SQL, HBase, Kafka, Sqoop, Spark Streaming, Oozie, Zookeeper, Hue Scala,Pyspark
Hadoop Distribution: Cloudera (CDH 12.2), Amazon AWS.
Programming/Scripting Languages: Core Java, Linux shell scripts, Python, Scala.
Database: MySQL, PL/SQL,SQL Developer, Teradata, HBase
ETL: Ab Intitio,Informatica
Real Time/Stream Processing: Apache Spark
Build Tools: Maven, SBT
Cloud: AWS, Microsoft Azure
IDE's, Web/App Servers: Intellej, Eclipse, PyCharm
Confidential - Chicago, IL
- Worked on analyzing Data on Hadoop cluster using different big data analytic tools including Spark (Spark SQL, Spark-Shell), Hive Data warehouse and Impala.
- Implemented Spark using Scala and utilizing Data frames and Spark-SQL API for faster processing of data.
- Used spark for fast In-Memory data processing and performed joins(Broadcast hash, Sort Merge Join), pivot (data transpose), complex transformations on terabytes of data.
- Developed poc's on Spark Streaming to ingest flat files automatically when file landed in Edge Node landing Zone.
- Developed Spark code using Scala and Spark-SQL/Streaming for faster processing of data. Used spark and Sqoop export to export data from Hadoop to oracle DB.
- Involved in Developing Hive scripts to parse the raw data, populate staging tables and store the refined data in partitioned tables in the Hive.
- Developed Hive UDFs (SHA256, GUID, UUID, MD5) where the functionality is too complex,
- Automated the dataflow by using bash scripts from pulling data from databases to loading data into HDFS using shell scripts.
- Involved in data ingestion into HDFS using Spark, Sqoop from variety of sources using the connectors like JDBC
- Performed various Hadoop data warehousing operations like de-normalization and aggregation on Hive using DML statements.
- Created and worked on Sqoop jobs with incremental load to populate Hive External tables.
- Worked on creating a JCEX password encryption file for Sqoop import and TD Wallet for TDCH import while importing data from Teradata.
- Worked on Avro and Parquet file formats by leveraging the schema evolution and used AVSC schema to create hive tables by leveraging dynamic schema change from source.
- Used HBase Filter (Binary Prefix, regexstring, binary, SingleValueColumnFilter) to scan the HBase table and get the expected output.
- Developed a wrapper script in bash for File Ingestion where the source is flat files through FTP and to ingest those files into HDFS.
- Implemented Data Pipe Lines using Kafka- Stream Data Platform that captures Stream events in Topics and feeds data to Data systems such as HDFS, also cleanses and aggregates data on the fly to channelize the data to Data Lake.
- Pre-processed large sets of structured and semi-structured data with different formats like Text Files, Avro, Parquet, Sequence Files, and JSON Record and used Snappy and LZ4 compressions.
- Worked with Oozie and Zookeeper to manage the flow of jobs and coordination in the cluster.
- Used Git with Bit Bucket for code versioning and code reviewing, Sonar Qube for code analysis.
Environment: Spark, HDFS, Hive, Map Reduce, Scala, Sqoop, Spark-SQL, Kafka, Spark,Pyspark, Python, Linux Shell Scripting, JDBC, Git, Bit bucket, Control M
Confidential - Birmingham, AL
- Developed Hive Scripts, Hive UDFs, Python Scripting and used Spark (Spark-SQL, Spark-shell) to process data in Hortonworks.
- Performed advanced procedures like text analytics and processing using the in-memory computing capabilities of Spark.
- Designed and Developed Scala code for data pull from cloud based systems and applying transformations on it.
- Usage of Sqoop to import data into HDFS from MySQL database and vice-versa.
- Implemented optimized joins to perform analysis on different data sets using MapReduce programs.
- Experience in processing of load and transform the large data sets of structured, unstructured and semi structured data in Hortonworks.
- Implemented Partitioning, Dynamic Partitions and Buckets in HIVE & Impala for efficient data access.
- Extensively worked on HiveQL, join operations, writing custom UDF's and having good experience in optimizing Hive Queries.
- Experienced in running query using Impala and used BI tools and reporting tool (tableau) to run ad-hoc queries directly on Hadoop.
- Worked on Apache Tez, an extensible framework for building high performance batch and interactive data processing applications Hive jobs
- Experience in using Spark framework with Scala and Python. Good exposure to performance tuning hive queries and MapReduce jobs in spark(Spark-SQL) framework on Hortonworks.
- Developed Scala & Python scripts, UDF's using both Data frames/SQL and RDD/MapReduce in Spark-SQL for Data Aggregation, queries and writing data back into RDBMS through Sqoop.
- Configured Spark streaming (receivers) to receive Kafka input streams from the Kafka and Specified exact block interval for data Processing into HDFS using Scala.
- Collect the data using Spark streaming and dump into HBase and Cassandra. Used the Spark- Cassandra Connector to load data to and from Cassandra.
- Collecting and aggregating large amounts of log data using Kafka and staging data in HDFS Data lake for further analysis.
- Used Hive to analyze data ingested into HBase by using Hive-HBase integration and HBase Filters to compute various metrics for reporting on the dashboard.
- Developed shell scripts in UNIX environment to automate the dataflow from source to different zones in HDFS. .
- Created and defined job work flows as per their dependencies in Oozie and e-mail notification service upon completion of job for the team that request for the data and monitored jobs using Oozie on Hortonworks.
- Experience in designing both time driven and data driven automated workflows using Oozie.
Environment: HDFS, Python Scripting, Map Reduce, Hive, Impala, Spark-SQL, Spark Streaming, Sqoop, AWS S3, Java, JDBC, Python, Scala, UNIX Shell Scripting, Git.
Confidential - Dallas, TX
- Involved in the high-level design of the Hadoop architecture for the existing data structure and Problem statement and setup the Multi-Node cluster and configured the entire Hadoop platform.
- Extracted files from MySQL, Oracle, and Teradata through Sqoop and placed in HDFS Distribution and processed.
- Worked with various HDFS file formats like Avro, Parquet, ORC, Sequence File, Json and various compression formats like Snappy, bzip2,Gzip.
- Developed efficient MapReduce programs for filtering out the unstructured data and developed multiple MapReduce jobs to perform data cleaning and preprocessing.
- Developed the Hive UDF's to pre-process the data for analysis and Migrated ETL operations into Hadoop system using Pig Latin scripts and Python Scripts.
- Used Hive to do transformations, event joins, filtering and some pre-aggregations before storing the data into HDFS.
- Developed bash scripts to automate the data flow by using different commands like awk, sed, grep, xargs, exec and integrated the scripts with YAML.
- Developed Hive queries for data sampling and analysis to the analysts.
- Loaded data into the cluster from dynamically generated files using Flume and from relational database management systems using Sqoop.
- Developed Bash Script and python modules to convert mainframe fixed width source file to delimited file.
- Experienced in running Hadoop streaming jobs to process terabytes of formatted data using Python scripts.
- Created workflows on Talend to extract data from various data sources and dump them into HDFS.
- Designing ETL data pipeline flow to ingest data from RDMS source to HDFS using Shell script.
- Created HBase tables from Hive and Wrote HiveQL statements to access HBase table’s data.
- Used Hive to perform data validation on the data ingested using scoop and flume and the cleansed data set is pushed into HBase.
Environment: Hadoop (Cloudera), HDFS, Map Reduce, Hive, Scala, Python, Pig, Sqoop, AWS, DB2, UNIX Shell Scripting, JDBC.
Confidential - Los Angeles, CA
- Involved in Full Life Cycle of the Project and used most of the AbInitio Components.
- Involved in all phases of the ETL process, which includes requirement analysis, Source-Target Mapping document and ETL process design.
- Extensively used Partition by Key & Sort, Partition by Expression/ Round Robin, Filter by Expression, Sort, Reformat, Gather, Redefine, Replicate, Scan, De normalize, Sorted and Normalize components to develop the ETL transformation logic.
- Designed and Developed AbInitio Graphs based on the Business Requirements. Transformed data using complex business process rules.
- Developed Complex Ab Initio XFR’s to derive new fields and solves rigorous business requirements.
- Generated DB configuration files (.dbc, .cfg) for source and target tables using db config and modified them according to the requirements.
- Created HBase tables from Hive and Wrote HiveQL statements to access HBase table’s data.
- Responsible for the Validation of the different sources by using AbInitio Functions.
- Worked with AbInitio components to create Summary tables using Rollup and Scan components.
- Developed generic graphs to extend a single functionality to many processes and reduce redundant graphs.
- Created complex AbInitio graphs and extensively used Partition and Departition components to process huge volume of records quickly, thus decreasing the execution time.
- Worked on improving the performance of AbInitio graphs by employing AbInitio performance components like Lookups (instead of joins), In-Memory Joins, Rollup and Scan components to speed up execution of the graphs.
- Developed AbInitio graphs, following AbInitio best practices.
- Responsible for creating test cases to test the production ready graph for integrability, operability, and to make sure the data originating from source is making into target properly in the right format
- Written user defined functions for business process to improve the performance of the application.
Environment: AbInitio 2.13, GDE 1.13, UNIX, TOAD 7.5, SunOS 5.8.
- Involved in the complete SDLC life cycle, design and development of the application.
- AGILE methodology was followed and was involved in SCRUM meetings.
- Created various Java bean classes to capture the data from the UI controls.
- Designed UML Diagrams like Class Diagrams, Sequence Diagrams and Activity Diagrams.
- Implemented the Java Web services, JSP, Servlets for handling data.
- Made use of Struts validation framework for validations at the server side.
- Created and Implemented the DAO layer using Hibernate tools.
- Implemented custom Interceptors and Exception Handlers for Struts 2 application.
- Ajax was used to provide dynamic search capabilities for the application.
- Developed Business Components using Service Locator, Session Facade design patterns.
- Developed SQL.
- Developed Session Facade with Stateless Session Beans for coarse functionality.
- Worked with Log4J for logging purpose in the project.
Environment: Java 1.5, Java Script, Struts 2.0, Hibernate 3.0, Ajax, JAXB, XML, XSLT, Eclipses, Tomcat.