- Big Data professional with 6+ years of combined experience in the fields of Data Applications, Big Data implementations and Java/J2EE technologies.
- 4 years of experience in Big Data ecosystems using Hadoop, Pig, Hive, HDFS, MapReduce, Sqoop, Storm, Spark, Airflow, Snowflake, Teradata, Flume, Kafka, Yarn, Oozie, and Zookeeper.
- High Exposure on Big Data technologies and Hadoop ecosystem, In - depth understanding of Map Reduce and Hadoop Infrastructure.
- Expertise in writing end to end Data processing Jobs to analyze data using MapReduce, Spark and Hive.
- Experience with Apache Spark ecosystem using Spark-Core, SQL, Data Frames, RDD's and knowledge on Spark MLLib.
- Extensive Knowledge on developing Spark Streaming jobs by developing RDD’s (Resilient Distributed Datasets) using Scala, PySpark and Spark-Shell.
- Experienced in data manipulation using python for loading and extraction as well as with python libraries such as NumPy, SciPy and Pandas for data analysis and numerical computations.
- Experienced in using Pig scripts to do transformations, event joins, filters and pre-aggregations before storing the data into HDFS.
- Strong knowledge of Hive analytical functions, extending Hive functionality by writing custom UDFs.
- Expertise in writing Map-Reduce Jobs in Python for processing large sets of structured, semi-structured and unstructured data sets and stores them in HDFS.
- Good understanding of data modeling (Dimensional & Relational) concepts like Star-Schema Modeling, Snowflake Schema Modeling, Fact and Dimension tables.
- Used Amazon Web Services Elastic Compute Cloud (AWS EC2) to launch cloud instance.
- Hands on experience working Amazon Web Services (AWS) using Elastic Map Reduce (EMR), Redshift, and EC2 for data processing.
- Hands on experience in SQL and NOSQL database such as Snowflake, HBase, Cassandra and MongoDB.
- Hands on experience in setting up workflow using Apache Airflow and Oozie workflow engine for managing and scheduling Hadoop jobs.
- Strong experience in working with UNIX/LINUX environments, writing shell scripts.
- Excellent knowledge of J2EE architecture, design patterns, object modeling using various J2EE technologies and frameworks with Comprehensive experience in Web-based applications using J2EE Frameworks like Spring, Hibernate, Struts and JMS.
- Worked with various formats of files like delimited text files, clickstream log files, Apache log files, Avro files, JSON files, XML Files.
- Experienced in working in SDLC, Agile and Waterfall Methodologies.
- Strong skills in analytical, presentation, communication, problem solving with the ability to work independently as well as in a team and had the ability to follow the best practices and principles defined for the team.
Hadoop/Spark Ecosystem: Hadoop, MapReduce, Pig, Hive/impala, YARN, Kafka, Flume, Sqoop, Oozie, Zookeeper, Spark, Airflow, MongoDB, Cassandra, HBase, and Storm.
Hadoop Distribution: Cloudera distribution and Horton works.
Programming Languages: Scala, Spring, Hibernate, JDBC, JSON, HTML, CSS
Databases: Oracle, MySQL, SQL Server, PostgreSQL, HBase, Snowflake, Cassandra, MongoDB
Operating Systems: Linux, Windows, Ubuntu, Unix
Web/Application server: Apache Tomcat, WebLogic, WebSphere Tools Eclipse, NetBeans
IDE: Intellij, Eclipse and NetBeans
Version controls and Tools: GIT, Maven, SBT, CBT
- Responsible for the execution of big data analytics, predictive analytics and machine learning initiatives.
- Implemented a proof of concept deploying this product in AWS S3 bucket and Snowflake.
- Utilize AWS services with focus on big data architect /analytics / enterprise Data warehouse and business intelligence solutions to ensure optimal architecture, scalability, flexibility, availability, performance, and to provide meaningful and valuable information for better decision-making.
- Developed Scala scripts, UDF's using both data frames/SQL and RDD in Spark for data aggregation, queries and writing back into S3 bucket.
- Experience in data cleansing and data mining.
- Wrote, compiled, and executed programs as necessary using Apache Spark in Scala to perform ETL jobs with ingested data.
- Used Spark Streaming to divide streaming data into batches as an input to Spark engine for batch processing.
- Wrote Spark applications for data validation, cleansing, transformation, and custom aggregation and used Spark engine, Spark SQL for data analysis and provided to the data scientists for further analysis.
- Prepared scripts to automate the ingestion process using Python and Scala as needed through various sources such as API, AWS S3,Teradata and snowflake.
- Designed and Developed Spark workflows using Scala for data pull from AWS S3 bucket and Snowflake applying transformations on it.
- Implemented Spark RDD transformations to Map business analysis and apply actions on top of transformations.
- Automated resulting scripts and workflow using Apache Airflow and shell scripting to ensure daily execution in production.
- Created scripts to read CSV, json and parquet files from S3 buckets in Python and load into AWS S3, DynamoDB and Snowflake.
- Implemented AWS Lambda functions to run scripts in response to events in Amazon DynamoDB table or S3 bucket or to HTTP requests using Amazon API gateway
- Migrated data from AWS S3 bucket to Snowflake by writing custom read/write snowflake utility function using Scala.
- Worked on Snowflake Schemas and Data Warehousing and processed batch and streaming data load pipeline using Snow Pipe and Matillion from data lake Confidential AWS S3 bucket.
- Profile structured, unstructured, and semi-structured data across various sources to identify patterns in data and Implement data quality metrics using necessary query’s or python scripts based on source.
- Install and configure Apache Airflow for S3 bucket and Snowflake data warehouse and created dags to run the Airflow.
- Created DAG to use the Email Operator, Bash Operator and spark Livy operator to execute and in EC2 instance.
- Deploy the code to EMR via CI/CD using Jenkins
- Extensively used Code cloud for code check-in and checkouts for version control.
Environment: AgileScrum,MapReduce,Snowflake,Pig,Spark,Scala,Hive,Kafka,Python,Airflow,JSON,Parquet,CSV,Codecloud, AWS
Big Data Engineer
- Extensively involved in Installation and configuration of Cloudera Hadoop Distribution.
- Implemented advanced procedures like text analytics and processing using the in-memory computing capabilities like Apache Spark written in Scala
- Developed spark applications for performing large scale transformations and denormalization of relational datasets.
- Have real-time experience of Kafka-Storm on HDP 2.2 platform for real time analysis.
- Loaded data into the cluster from dynamically generated files using Flume and from relational database management systems using Sqoop.
- Created reports for the BI team using Sqoop to export data into HDFS and Hive.
- Performed analysis on the unused user navigation data by loading into HDFS and writing MapReduce jobs. The analysis provided inputs to the new APM front end developers and lucent team.
- Loading the data from multiple Data sources like (SQL, DB2, and Oracle) into HDFS using Sqoop and load into Hive tables.
- Created HIVE Queries to process large sets of structured, semi-structured and unstructured data and store in Managed and External tables.
- Developed Complex HiveQL's using SerDe JSON
- Created HBase tables to load large sets of structured data.
- Involved in importing the real time data to Hadoop using Kafka and implemented the Oozie job for daily imports
- Performed Real time event processing of data from multiple servers in the organization using Apache Storm by integrating with apache Kafka.
- Managed and reviewed Hadoop log files.
- Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs and Scala.
- Worked on PySpark APIs for data transformations.
- Data ingestion to Hadoop (Sqoop imports). To perform validations and consolidations for the imported data.
- Extending Hive and Pig core functionality by writing custom UDF's for Data Analysis.
- Upgraded current Linux version to RHEL version 5.6
- Expertise in hardening, Linux Server and Compiling, Building and installing Apache Server from sources with minimum modules
- Worked on JSON, Parquet, Hadoop File formats.
- Worked on different Java technologies like Hibernate, spring, JSP, Servlets and developed code for both server side and client side for our web application.
- Used Git hub for continuous integration services.
Environment: Agile Scrum, MapReduce, Hive, Pig, Sqoop, Spark, Scala, Oozie, Flume, Java, HBase, Kafka, Python, Storm, JSON, Parquet, GIT, JSON SerDe, Cloudera.