- Detailed Oriented Data Engineer/Hadoop developer with 5 years IT Experience
- Excellent problem solving, communication, and teamwork skills, including 2 years with Big Data Technologies working in the Hadoop platform with expertise in key Hadoop technologies like Hive, Sqoop, Spark, Scala and Kafka
- Hands on experience with application migration from RDBMS to Hadoop platform, Health care domain, Real time streaming with Apache Kafka and Spark streaming, and Strong knowledge on Scala, Python and Java development including Spark RDD and Dataframe programming.
- Extensive working experience following all phases of SDLC including application design, development, production support and maintenance projects.
- Progressive hands - on experience in analysis, ETL processes, design and development of enterprise level data warehouse architectures, designing, coding, testing, and integrating ETL.
- Solid knowledge on big data framework: Hadoop, HDFS, Apache Spark, Hive, Map/Reduce and Sqoop
- Hands on Expertise on Scala development including Spark RDD and Data frame programming
- Strong experience with application/database migration from RDBMS to Hadoop
- Sound relational database concepts and extensively worked with SQL Server, Oracle. Expert in writing complex SQL queries and stored procs
- Use JSON and XML SerDe Properties to load JSON and XML data into Hive tables.
- Built Spark Applications using IntelliJ and Maven.
- Extensively worked on Scala programm ing language for Data Engineering using Spark
- Experience with Real time streaming involving Apache Kafka and Spark Streaming.
- Strong knowledge of Database architecture and Data Modeling including Hive and Oracle.
- Excellent interpersonal and communication skills, technically competent and result-oriented with problem solving and leadership skills.
- Sound understanding of agile development and Agile Tools
Big Data Ecosystem: Hadoop, Apache Spark, Hive, Kafka, HDFSMapReduce, Sqoop, Hbase, Zookeeper, Scala Python
Databases: MySQL, Hbase, MS SQL
Programming Languages: Java, Python, Scala, SQL, NoSQL, HiveQLT-SQL
Tools: IntelliJ, Eclipse, PyCharm, Putty, LIMS
Operating Systems: Linux, Windows
Packages: VMWare, Oracle VM Virtual Box, MS Office SSIS(ETL)
Hadoop/ Spark Developer
- Loaded the data of tests performed at local RRL from the SQL server to Hive using Sqoop
- Used different data formats(Text, Avro, Pacquet, JSON, ORC) while files were being loaded to HDFS
- Created Hive tables which stored the processed results in a Tabular format
- Created Managed and External tables in Hive, loaded data from HDFS and performed complex HiveQL queries on the tables based on business needs for reporting
- Created Partitioning and Buckecting HiveQL Queries based on STAT criteria, which helped optimize Query performance
- Created and scheduled Sqoop jobs for automated batch data load
- Created complex application logic to help process the tests data by using SparkSQL and Spark Dataframe to cleanse and integrate imported data based on business requirement
- Built Spark applications using IntelliJ and Maven, and used Scala/Python programming languages for data engineering in Spark framework
- Also monitored and maintained all Hive, Sqoop and Spark jobs to make sure they were optimized at all stages. Used Resource manager and YARN queues to manage and monitor Hadoop jobs
- Executed a Streaming Proof of concept where streaming data was ingested and processed using Kafka and Spark Streaming and data saved into hdfs and Hive tables. As part of the POC, real-time service logs were streamed into hive table using Kafka and Spark streaming Fram ework. Participated in designing Kafka Producer/Consumer and Topics and Integrating the data flow with Spark Streaming pipeline.
- Executed a POC to implement a business use case that requires migrating data sets from SQL Server to Hive and migrate few SQL Server programs to Spark/Scala applications. As part of the POC, few partitioned hive tables were built along with few master tables to store the use case tables. T-SQL programs were migrated to Spark/Scala applications and performance bench-marking done with Legacy application.
Big Data Administrator
- Worked collaboratively to manage build out of large data clusters.
- Helped design big data clusters and administered them.
- Perform analysis of large data sets using components from the Hadoop ecosystem
- Extracted the data from MySQL into HDFS using Sqoop (version 1.4.6).
- Created and worked Sqoop jobs with incremental load to populate Hive External tables.
- Developed work-flows for complete end to end ETL process starting with getting data into HDFS, validating and applying business logic, storing clean data in hive external tables, exporting data from hive to RDBMS sources for reporting and escalating and data quality issues.
- Developed Spark SQL to load tables into HDFS to run select queries on top.
- Assisted in writing Scala Spark scripts for data cleansing. Data cleansing and data enrichment to remove duplicates, null values were also done using Pig Latin and HiveQL.
- Loaded data into Hive Tables from Hadoop Distributed File System (HDFS) to provide SQL-like access on Hadoop data.