- Over 7 Years of professional experience in designing, developing, integrating and testing software applications, which includes 5 years of experience in various Big Data technologies of Hadoop like Map - Reduce, Hive, Spark (core and Spark SQL), Impala and Sqoop and 3+ years of experience in programing.
- Hands on experience in working with Big Data in Hadoop ecosystem using Spark, hive, Impala and Map-Reduce.
- Hands on experience in programming and implementation of Java, Scala and Python codes with strong knowledge in Object Oriented and Functional Programming Concepts.
- In-depth knowledge in implementation of Data Visualization techniques in Spark using Apache Zeppelin.
- Highly skilled at SQL and shell scripting operations.
- Good knowledge of data architecture including data ingestion pipeline designs, Lambda architecture, data streams, data lakes and data warehouses.
- Good knowledge of data modeling and advanced data processing techniques for Structured, Semi Structured and Unstructured data.
- Able to assess business rules, collaborate with stakeholders and perform source-to-target data mapping, design and review with strong analytical and communication skills.
- Hands on experience in tuning mappings with expertise in identifying and resolving performance bottlenecks in various levels.
- Excellent skills in analyzing system architecture usage, defining and implementing procedures
- A quick learner, punctual and trustworthy.
- Motivated problem solver and resourceful team member with decent written and verbal communication skills.
Hadoop Platforms: Hortonworks(HDP) and Cloudera (CDH)
Filesystems: HDFS, S3
Databases: Hive, Impala
Scheduling: Oozie and crontab
Streaming Engines: Sqoop, Flume and Kafka
Querying Engines: Beeline, Phoenix
NoSQL DBs: HBase, MongoDB and Cassandra
Apache Zeppelin, Tableau, Arcadia, Microsoft: office (Power-point, Word & Excel) and Github.
Development Technologies: Spark, Python, Scala, Java, SQL and Shell Scripting.
IDEs, FTP and SSH tools: Eclipse, Intellij, Pycharm, DBvisualizer, Xshell, Putty, Filezilla and WinScp.
Build tools: SBT, Maven
File Formats: Structured (Delimiter separated values), Semi Structured (json, xml, html etc...), Compressed (Zgip, Snappy, LZO etc...) and Binary (Sequence, Avro, Parquet, ORC etc...).
Operating Systems: Mac, Ubuntu, Linux and Windows.
Senior Data Engineer
Confidential, Costa Mesa, CA
Environment: Cloudera (CDH), Hadoop, Putty, Spark, beeline, Impala, MySQL, HDFS, Python, Scala, SQL scripting, Sqoop, Linux shell scripting, Eclipse, Intellij, Pycharms, SBT, and Maven.
- Worked on Designing of ETL pipe line using Spark, Hive and HBase components.
- Worked on batch data ingestion by creating a data pipe line using Sqoop and Spark.
- Worked on near-real time data ingestion using a data pipe line designed using Kafka and Spark.
- Worked on integration of Hive to HBase using HBaseStorageHandlers to enable insertion and updating of Data in the NOSQL HBase storage.
- Worked on enabling Transactional tables in Hive to enable row level updates.
- Worked on encryption of data using Hive to encrypt sensitive data, in-addition to providing a mechanism in to encrypt the Sqoop passwords use to connect to the Legacy systems.
- Worked on designing a framework to input the RDMS and JDBC connections in MYSQL, to automate and unify the data pull processing for Sqoop ingestions.
- Used Oozie and crontab application to schedule the Hadoop application jobs to using the cluster effectively.
- Worked on real-time data analytics in Spark Streaming for streaming text data by integrating Flume and kafka with Spark Streaming.
- Worked on Schema tuning, performance triage/troubleshooting and data distribution for the ingested and existing data in the Enterprise data platform.
- Worked on performance tuning, debugging and optimization of hive queries.
Hadoop Solutions Engineer
Confidential, Colorado Springs, CO
Environment: Hortonworks (HDP), Hadoop, Putty, Oracle, MySQL, HDFS, Spark, Hive, arcadia, Python, Scala, SQL scripting, Sqoop, Linux shell scripting, Eclipse, Intellij, Pycharms, SBT, and Maven.
- Worked on creation of ETL pipe line using Spark, Hive and arcadia components.
- Storing and retrieving data from HDFS in different formats like text, json, Sequence, Avro, Parquet, ORC and in compressed formats.
- Worked on app and visual creation in Arcadia data to enable data visualization and Descriptive analytics.
- Tuned Spark RDD parallelism technics to improving the performance and optimization of the spark jobs on Hadoop cluster.
- Designed Hive table schemas using partitioning and bucketing to store tables as both external and internal table.
- Worked on developing Hive UDF’s in Python to define custom analytical functions.
- Worked on programming spark applications using python and Scala in-addition to optimize the memory parameters for efficient cluster utilization.
- Worked on loading data to and from RDBMS to HDFS using Spark and JDBC connectors for integrating Hadoop with MySQL and Oracle.
Confidential , Southlake, TX, US
Environment: Cloudera (CDH), Hadoop, Putty, Oracle, MySQL, HDFS, Spark, Hive, Impala, Python, Scala, SQL scripting, Linux shell scripting, Eclipse, Intellij, Pycharms, SBT, and Maven.
- Worked on parsing and filtering Semi-Structure data like json using Dataframe/SparkSQL, case classes and also programmatically specifying the schema explicitly.
- Worked on modeling data using Avro schema into Parquet format using SparkSQL.
- Worked on real-time data analytics in Spark Streaming for streaming text and kafka topic data.
- Worked on data preparation methods in spark dataframes using set operations, regular expressions, sorting, parsing arbitrary date/time inputs and converting json arrays values into lists.
- Worked on performance tuning, debugging and optimization of hive queries by changing the default YARN values.
- Worked on developing Queries to analyses data of different format in Impala and Hive.
- Worked on performance tuning, debugging and optimization of hive queries by changing the default YARN values.
- Worked on loading data to and from RDBMS to HDFS using Spark and JDBC connectors.
- Loading and retrieving data to and from the Local systems into HDFS.
Confidential, Mayfield Village, OH
Environment: Big Data Platform - CDH 5.0.3, Hadoop HDFS, Map Reduce, Hive, Sqoop, Spark, Impala, Java, Shell Scripts, Oracle 10g, Eclipse, Tableau, Putty and Intellij.
- Prepare technical design documents based on business requirements and prepare data flow diagrams.
- Implement new design as per technical specifications.
- Integrated Hadoop with Oracle in order to load and then cleanse raw unstructured data in Hadoop ecosystem to make it suitable for processing in Oracle using stored procedures and functions.
- Experience in using Map-Reduce programming model for Batch processing of data stored in HDFS.
- Developed Java Map-Reduce programs on log data to transform into structured way to find user location, login /logout time and spending time, errors.
- Load and transform large sets of structured, semi structured and unstructured data.
- Used SQOOP for importing data into HDFS and exporting data from HDFS to oracle database
- Built re-usable Hive UDF libraries for business requirements which enabled users to use these UDF's in Hive querying
- Used Hive to analyze the partitioned and bucketed data and compute various metrics for reporting.
- Optimized the Hive tables using optimization techniques like partitions and bucketing to provide better performance with Hive-QL queries.
- Developed Spark scala Scripts for ETL kind of operation on captured data and delta record processing between newly arrived data and already existing data in HDFS.
- Extensively used Pig for data cleansing.
- Used Pyspark to do transformations, event joins, filter boot traffic and some pre-aggregations before storing the data onto HDFS.
- Experienced in extending Hive and Pig core functionality by writing custom Impala UDFs using Java and Python.
- Created action filters, parameters and calculated sets for preparing dashboards and worksheets in Tableau.
- Worked extensively in performance optimization by adopting/deriving at appropriate design patterns of the Map-Reduce jobs by analyzing the I/O latency, map time, combiner time, reduce time etc.
- Troubleshooting: Used Hadoop logs to debug the scripts.
Confidential, Houston, TX
Environment: Big Data Platform - CDH 4.2.1, Hadoop HDFS, Map Reduce, Hive, Sqoop, IBM DB2, PL/SQL, UNIX, Python, Eclipse.
- Integrated, managed and optimized utility systems, including assets, devices, networks, servers, applications and data.
- Ensured quality integration into the overall functions of smart meters into the system data acquisition and processing.
- Enabled the use of metering data for a variety of applications such as billing, outage detection and recovery, fraud detection, finance, energy efficiency, customer care and a variety of analytics.
- Analyzed large amounts of raw data to create information. Compiled technical specifications that allowed IT to create data systems, which supported the smart metering system.
- Responsible for technical reviews and gave the quick-fix solution for the customer on production defects.
- Developed Map-Reduce programs to parse the raw data, populate staging tables and store the refined data in partitioned tables in the enterprise data warehouse (EDW).
- Worked with Importing and exporting the data using Sqoop from HDFS to Relational Database systems and vice-versa.
- Worked on developing Scala codes to filter and parse raw data in HDFS using Spark.
- Written Map-Reduce java programs to analyze the log data for large-scale weather data sets.
- Involved in testing Map-Reduce programs using MRUnit and JUnit testing frameworks.
- Customize parser loader application of Data migration to HBase.
- Provide support for data analysts in running ad-hoc Pig and Hive queries
- Developed PL/SQL Procedures, Functions, and Packages using Oracle Utilities like PL/SQL, SQL Loader and Handled Exceptions to handle key business logic.
- Utilized PL/SQL bulk collect feature to optimize the ETL performance. Fine-Tuned and optimized number of SQL queries and performed code debugging.
- Developed UNIX & SQL script to load large volume of data for Data Mining & Data Warehousing.
Environment: Big Data Platform - CDH 4.0.1, XML, Hadoop HDFS, Spark, Hive, Sqoop, Impala, Oracle 10g, Java, Eclipse.
- Involved in design and development of server-side layer using XML, JDBC and JDK patterns using Eclipse IDE.
- Involved in unit testing, system integration testing and enterprise user testing.
- Extensively used Core Java, Servlets, and JDBC.
- Developed data pipeline using Hive, Sqoop, Spark and Map Reduce to ingest customer behavioral data and purchase histories into HDFS for analysis.
- Worked with NoSQL databases like Hbase in creating tables to load large sets of semi structured data coming from various sources.
- Wrote MRUnit test cases to test and debug Map Reduce programs in local machine.
- Involved in creating Hive tables, loading data and running hive queries in those data.
- Imported data using Sqoop to load data from Oracle to HDFS on regular basis.
- Developed scripts and Batch Jobs to schedule various Hadoop Program.
- Written Hive queries for data analysis to meet the business requirements.
- Implemented Partitioning, Dynamic Partitions, Buckets in HIVE.
- Written Hive queries to parse the logs and structure them in tabular format to facilitate effective querying on the log data.
- Developed Pig UDF’s to pre-process data for analysis.
- Developed Complex and Multi-Step data pipeline using Spark.
- Written Spark SQL queries for data analysis.
Environment: Big Data Platform - CDH 3, Map-Reduce, Hive, Spark Scripting, JDK 1.6, and Oracle.
- Involved in analysis, design and development of data collection, data ingestion, and data profiling and data aggregation.
- Working in development of controller, Batch and logging module using JDK 1.6.
- Worked on development of data ingestion process using FS Shell and data loading into HDFS.
- Working in the definition of Hive query for different profiling rules like business checks, outlier’s checks and domain and data range validation.
- Working on the automating the generation of Hive query and Map-Reduce programs.
- Developed User Defined Function in java and python to facilitate data analysis in Hive and pig.
- Managed the end-to-end delivery during the different phase of the software implementation.
- Involved in initial POC implementation using Hadoop - Map Reduce, Spark Scripting, and Hive Scripting.
- Designed the framework for Data Ingestion, Data Profiling and generating the Risk Aggregation report based various business entities.
- Mapped the business requirements and rules with the Risk Aggregation System.
- Used JDBC to invoke Stored Procedures and database connectivity to ORACLE.
- Code debugging and creating Documentation for future use.