- Overall, 6+ years of IT experience, result oriented Big data consultant possessing a proven track record of effectively administering Hadoop ecosystem components & architecture and managing file distribution systems in the Big Data arena.
- Proficient in collaborating with key stakeholders to conceptualize & execute solutions for resolving systems architecture - based technical issues. Highly skilled in processing complex data designing Machine Learning modules for effective data mining & modeling.
- Adept at Hadoop cluster management & capacity planning for end-to-end data management & performance optimization.
- Hadoop Developer with work experience in using HDFS, MapReduce, Hive, Pig, Spark, Sqoop, Oozie, Kafka, zookeeper, Ambari, and HBase.
- Experienced working with various Hadoop Distributions (Cloudera, Hortonworks, Map R) to fully implement and leverage new Hadoop features.
- Proficient in using Unix based Command Line Interface.
- Experience in moving data into and out of the HDFS and Relational Database Systems (RDBMS) using Apache Sqoop.
- Expertise in working with Hive data warehouse infrastructure-creating tables, data distribution by implementing Partitioning and Bucketing, developing and tuning the HQL queries.
- Involved in creating Hive tables, loading with data and writing Hive Ad-hoc queries that will run internally in MapReduce and Spark.
- Significant experience writing custom UDF’s in Hive and custom Input Formats in MapReduce.
- Experience in managing and reviewing Hadoop log files .
- Knowledge of job workflow management and monitoring tools like Oozie, UC4, Tidal, Control-M.
- Experience working with NoSQL database technologies, including MongoDB, Cassandra, and HBase.
- Strong experience building end to end data pipelines on Hadoop platform.
- Experience in developing Spark Applications using Spark RDD, Spark SQL and Data frame APIs.
- Replaced existing MR jobs and Hive scripts with Spark SQL & Spark data transformations for efficient data processing.
- Deep knowledge of troubleshooting and tuning Spark applications and Hive scripts to achieve optimal performance.
- Strong understanding of real time streaming technologies Spark and Kafka.
- Worked with real-time data processing and streaming techniques using Spark streaming and Kafka.
- Good understanding on Machine Learning Methodologies to uncover the hidden patterns, user behaviour, and modeling using Spark MLlib.
- Database design, modeling, migration and development experience in using stored procedures, triggers, cursor, constraints and functions . Used My SQL, MS SQL Server, DB2, and Oracle .
- A good experience on understanding of architecting , designing and operationalization of largescale data and analytics solutions on Snowflake Cloud Data Warehouse.
- Effective communication and interpersonal Skills, an excellent team player work towards the growth of an organization.
Operating Systems: Linux (Ubuntu, CentOS), Windows, Mac OS
Hadoop Ecosystem: Hadoop, MapReduce, Yarn, HDFS, Pig, Oozie, Zookeeper
Big Data Ecosystem: Spark, Spark SQL, Spark Streaming, Spark MLlib, Hive, Impala, Hue
Cloud Ecosystem: AWS, Snowflake cloud data warehouse
Data Ingestion: Sqoop, Flume, NiFi, Kafka
NOSQL Databases: HBase, Cassandra, MongoDB, CouchDB
Programming Languages: C, C++, Scala, Core Java, J2EE
Scripting Languages: UNIX, Python, R Language
Databases: Oracle 10g/11g/12c, PostgreSQL 9.3, MySQL, SQL-Server, Teradata, HANA
IDE: IntelliJ, Eclipse, Visual Studio, IDLE
Hadoop & Snowflake Data Engineer
Confidential - San Jose, CA
- Involved in Migrating Objects using the custom ingestion framework from variety of sources such as Oracle, SAP/HANA, MongoDB, & Teradata.
- Created Snow pipe for continuous data load from staged data residing on cloud gateway servers .
- Used COPY to bulk load the data.
- Using FLATTEN table function to produce lateral view of VARIENT, OBJECT and ARRAY column.
- Working with both Maximized and Auto-scale functionality while running the multi-cluster warehouses.
- Using Temporary and Transient tables on different datasets.
- Sharing sample data using grant access to customer for UAT/BAT .
- Used Snowflake time travel feature to access historical data.
- Heavily involved in testing Snowflake to understand best possible way to use the cloud resources.
- Working on migration of jobs from Tidal to Control-M & creating new scheduled jobs in Control-M.
- Worked on analyzing data using hive .
- Orchestrating scrum calls for couple of functions in Supply Chain to track the project progress.
Environment: Snowflake Web UI, Snow SQL, Hadoop MapR 5.2, Hive, Hue, Toad 12.9, Share point, Control-M, Tidal, ServiceNow, Teradata Studio, Oracle 12c, Tableau
Confidential - Westchester, PA
- Developed Spark programs with Scala API’s to compare the performance of Spark with Hive and SQL.
- Used Hortonworks Hadoop YARN to perform analytics on data in Hive .
- Implemented Spark using Scala and SparkSQL for faster testing and processing of the data.
- Designed and created Hive external tables with partitioning and buckets.
- Used the JSON and XML SerDe’s for serialization and deserialization to load JSON and XML data into Hive tables.
- Imported data from AWS S3 into Spark RDD’s and performed Transformations and Actions on RDD’s.
- Worked and learned a great deal from AWS (Amazon Web Services) EC2, S3 , EBS, RDS, and VPC.
- Worked with various HDFS file formats like Avro , Parquet , Sequence files and various compression formats like gzip , and snappy .
- Used SparkSQL to load JSON data and create schema RDD and loaded it into Hive tables for handling structured data.
- Used Spark for interactive queries, and processing of streaming data using Spark Streaming .
- Involved in converting Hive & SQL queries into Spark transformations using Spark RDD’s, Scala , Python .
- Involved in creating the Spark workflows using the UC4 scheduler .
- Providing L2 DevOps on call production support for operational team on monthly basis in production environment.
Environment: Hadoop, HDFS, Spark Core, Spark Streaming, Spark SQL, Spark MLlib, Scala, Python, MapReduce, Hive, Sqoop, Kafka, AWS, Databricks, Pentaho Data Integration (PDI/Kettle), Oracle 11g.
Hadoop and Spark Developer
Confidential - Houston, TX
- Created Hive tables for loading and analysing data, Implemented Partitions, Buckets and developed Hive queries to process the data and generate the data cubes for visualizing.
- Worked on performing Hive modeling and written many hive scripts to perform various kinds of data preparations that are needed for running machine learning models.
- Implemented schema extraction for Parquet and Avro file Formats in Hive.
- Used Spark API over Cloudera Hadoop Yarn to perform analytics on data in Hive.
- Experience in Loading the data into Spark RDD’s, perform advanced procedures like text analytics and processing using in memory data Computation capabilities of Spark using Scala to generate the Output response.
- Experienced in handling large datasets using Partitions, Spark in Memory capabilities, Broadcasts in Spark, Effective & efficient Joins, Transformations and other during ingestion process itself.
- Experience in Querying on Parquet files by loading them into Spark's data frames by using Zeppelin notebook.
- Developed Scala scripts using both Data frames/SQL and RDD/MapReduce in Spark for Data Aggregation, queries and writing data back into OLTP system through Sqoop.
- Optimizing of existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frames and Pair RDD's.
- Experienced in performance tuning of Spark Applications for setting right Batch Interval time, correct level of Parallelism and memory tuning.
- Worked closely with the data science team in automating and productionalizing various models like logistic regression, k-means using Spark MLlib .
- Used Spark Streaming APIs to perform necessary transformations and actions on the fly for building the common learner data model which gets the data from Kafka in near real time and persists into Cassandra.
- Used DataStax Spark-Cassandra connector to load data into Cassandra and used CQL to analyse data from Cassandra tables for quick searching, sorting and grouping.
- Experience in troubleshooting any problems that arises during any batch data processing jobs.
- Experience in writing Sqoop scripts for importing and exporting structured data from RDBMS to HDFS.
- Responsible for building scalable distributed data solutions using Hadoop.
- Created data pipeline for different events of ingestion, aggregation and load consumer response data in AWS S3 bucket into Hive external tables in HDFS location to serve as feed for tableau dashboards.
- Worked with BI team to create various kinds of reports using Tableau based on the requirements.
Environment: Hadoop Yarn, Spark Core, Spark Streaming, Spark SQL, Spark MLlib, Scala, Python, Kafka, Hive, Sqoop, Amazon AWS, Elastic Search, Impala, Cassandra, Tableau, Talend, Cloudera, MySQL, Linux.
Confidential - Franklin Lakes, NJ
- Utilized Sqoop, Kafka, Flume and Hadoop File System API’s for implementing data ingestion pipelines.
- Worked on real time streaming, performed transformations on the data using Kafka and Spark Streaming.
- Created storage with Amazon S3 for storing data. Worked on transferring data from Kafka topic into AWS S3 storage.
- Created Hive tables, loaded with data, and wrote Hive queries to process the data.
- Created Partitions and used Bucketing on Hive tables and used required parameters to improve performance and developed Hive UDFs as per business use-cases.
- Developed Hive scripts for source data validation and transformation.
- Automated data loading into HDFS and Hive for pre-processing the data using Oozie.
- Collaborated in data modeling, data mining, Machine Learning methodologies, advanced data processing, ETL optimization.
- Worked on various data formats like Avro, Sequence File, JSON, Map File, Parquet, and XML.
- Worked extensively on AWS Components such as Airflow, Elastic Map Reduce (EMR), Athena, Snowflake.
- Used Apache NiFi to automate data movement between different Hadoop components.
- Used NiFi to perform conversion of raw XML data into JSON, Avro.
- Experienced in working with Hadoop from Cloudera Data Platform and running services through Cloudera manager.
- Assisted in Hadoop administration and support activities for installations and configuring Apache Big Data Tools and Hadoop clusters using Cloudera Manager.
- Experienced in Hadoop Production support tasks by analysing the Application and cluster logs.
- Used Agile Scrum methodology/ Scrum Alliance for development.
Environment: Hadoop, HDFS, AWS, Vertica, Scala, Kafka, MapReduce, YARN, Drill, Spark, Pig, Hive, Scala, Java, NiFi, HBase, MySQL, Kerberos, Maven.