Senior Bigdata Engineer Resume
Losangeles, CaliforniA
PROFESSIONAL SUMMARY:
- 6+ years of Big Data Hadoop Ecosystems experience in ingestion, storage, querying, processing and analysis of big data.
- Experience in dealing with Apache Hadoop components like HDFS, MapReduce, HIVE, HBase, PIG, SQOOP, Spark and Flume Big Data and Big Data Analytics.
- Hands on experience in installing, configuring Hadoop ecosystems such as HDFS, MapReduce, Yarn, Pig, Hive, HBase, Oozie, Sqoop, flume and Kafka.
- Excellent knowledge on Hadoop Architecture such as HDFS, Job Tracker, Task Tracker, Name Node, Data Node and MapReduce programming paradigm.
- Experience in developing MapReduce Programs using Apache Hadoop for analyzing the big data as per the requirement.
- Involved in writing data transformations, data cleansing using PIG operations and good experience in data retrieving and processing using HIVE.
- Worked with HBase to conduct quick look ups (updates, inserts and deletes) in Hadoop.
- Experience in Oozie and workflow scheduler to manage Hadoop jobs by Direct Acyclic Graph (DAG) of actions with control flows.
- Proficient in using Hive optimization techniques like Buckets, Partitions, etc.
- Experienced in loading dataset into Hive for ETL (Extract, Transfer and Load) operation.
- Experience in importing and exporting data using Sqoop from Relational Database Systems to HDFS and vice - versa.
- Extensive Experience on importing and exporting data using stream processing platforms like Flume .
- Developed Apache Spark jobs using Scala and Python for faster data processing and used Spark Core and Spark SQL libraries for querying.
- Experience in creating Spark Streaming jobs to process huge sets of data in real time.
- Experience tuning spark jobs for efficiency in terms of storage and processing.
- Extensive experience using MAVEN as a Build Tool for the building of deployable artifacts from source code.
- Experience working with Amazon's AWS services like EC2, EMR, S3, KMS, Kinesis, Lambda, API gateways, IAM etc.
- Tested, Cleaned, and Standardized Data to meet the business standards using Execute SQL task, Conditional Split, Data Conversion, and Derived column in different environments.
- Expertise in relational database systems (RDBMS) such as My SQL, and No SQL database systems like HBase and had basic knowledge on MongoDB and Cassandra.
- Experience in database development using SQL and PL/SQL and experience working on databases like Oracle 12c/ 11g/10g, SQL Server and MySQL .
- Good understanding of Hadoop Gen1/Gen2 architecture and hands-on experience with Hadoop components such as Job Tracker, Task Tracker, Name Node, Secondary Name Node, Data Node , Map Reduce concepts and YARN architecture which includes Node manager, Resource manager and App Master.
- A great team player& ability to effectively communicate with all levels of the organization such as technical, management and customers.
- Ability to quickly master new concepts and applications.
TECHNICAL SKILLS:
Big Data Ecosystems: Hadoop, MapReduce, HDFS, HBase, Hive, Pig, Sqoop, Oozie, Spark, Kafka
Hadoop platforms: Cloudera, MapR
AWS services: EC2, EMR, S3, KMS, Kinesis, Lambda, API Gateway, IAM
Languages: Java Script, Scala, SQL, Python
Web Technologies: HTML5, CSS3, JavaScript
Scripting Language: UNIX Shell Script
RDBMS DB: MySQL, Oracle 12c/ 11g/ 10g
NoSQL Technologies: HBase, MongoDB, Cassandra, DynamoDB
Tools: & Utilities: Eclipse, Visual Studio, Net Beans, GitHub, Maven, Jenkins
Operating Systems: Unix, Windows, Cent OS, Linux (Ubuntu, Red hat)
Others: Putty, WinSCP, GitHub
PROFESSIONAL EXPERIENCE:
Confidential, LosAngeles, California
Senior Bigdata Engineer
Responsibilities:
- Created a Serverless data ingestion pipeline on AWS using MSK (Kafka)and lambda functions.
- Developed applications using Java that reads data from MSK(kafka) and writes it to Dynamo DB.
- Developed applications that leverages step functions and cloudwatch event triggers to fetch data and generate features from that data.
- Very much involved in a number of key decisions in this project from design decisions to planning and implementation and security.
- Involved in creating research data-lake by extracting customer's data from various data sources to S3 which include data from Excel, databases, and log data from servers.
- Developed Apache Spark applications by using Scala for data processing from various streaming sources.
- Configured Spark Streaming to receive real time data from the Apache Kafka and store the stream data to DynamoDB using Scala.
- Created python-based lambda functions for feature extraction.
- Created the light weight Serverless pipeline that runs on lambdas and generates insights.
- Created a set of data classifiers that reads from DynamoDB and classifies the features into bins and stores them in DynamoDB.
- Used lambda with SNS to create insight notifications to mobile devices.
- Implemented Tableau mobile dashboards via Tableau mobile application.
- Used different stages of Datastage Designer like Lookup, Join, Merge, Funnel, Filter, Copy, Aggregator, and Sort etc .
- Involved in all phases of the SDLC and collaborated with a large team to get this pipeline operational.
- Configured cloudwatch logs and created a cloudwatch dashboard for monitoring.
- Deployed Machine Learning Models on Sagemaker and exposed it as an endpoint.
- Accessed the endpoints to call the model in real time to generate the insight.
Environment: Lambda, MSK, KMS, Spark, SQL Server 2016/2014, DB2, DynamoDB, cloudwatch, Tableau, Python, SNS, step functions.
Confidential, Columbus, Ohio
Bigdata Developer
Responsibilities:
- The near real time reporting was achieved by an event-based processing approach adoption instead of micro-batching to deal with data coming from Kafka.
- Developed spring boot applications to read data from Kafka in an event-based manner. These applications were developed to run as micro-services that deals with parts of the problem and were deployed on Docker containers that were built and deployed automatically using Jenkins pipelines.
- Have written applications using Spring boot that reads data from Kafka and writes it to MaprDB (MapR version of HBase).
- Have written applications that produced data to Kafka and also consumed data from it.
- Used Scala to convert Hive/SQL queries into RDD transformations in Apache Spark.
- Implemented Spark solutions to generate reports, fetch and load data in Hive.
- Experienced in writing real-time processing and core jobs using Spark Streaming with Kafka as a data pipeline system.
- Implemented Spark using Scala, Python and utilizing Data frames and Spark SQL API for faster processing of data.
- Very much involved in a number of key decisions in this project from design decisions to planning and implementation.
- Have written HiveQl scripts to populate table and brought data from various systems using Sqoop.
- Used DataStage as an ETL tool to extract data from sources systems, loaded the data into the ORACLE database.
- Built a data lake on the MapR cluster which was used by different teams.
- Wrote Spark applications and also mentored other team members on the perks of spark.
- Wrote complex logic implementations using Spark to process data present in MaprDB and Hive.
- Involved in all phases of the project lifecycle from requirements collection, design, development, testing and deployment.
- Built a dashboard of all the YARN applications running on the cluster using YARN API.
Environment: Hadoop, HDFS, Hive, HBase, Sqoop, Oracle 12c, Apache Spark, MapReduce, Python, SQL Server 2012, Spark, Springboot, Linux, Relational Databases.
Confidential, Detroit, Michigan
Hadoop Developer
Responsibilities:
- Working on Big Data infrastructure for batch processing as well as real-time processing. Responsible for building scalable distributed data solutions using Hadoop .
- Involved in creating Hive Tables, loading with data and writing Hive queries which will invoke and run Map Reduce jobs in the backend.
- Designed and implemented Incremental Imports into Hive tables.
- Very good understanding of Partitions, bucketing concepts in Hive and designed both Managed and External tables in Hive to optimize performance.
- Experience in importing and exporting Terabytes of data using Sqoop from HDFS to Relational Database Systems and vice-versa.
- Moved Relational Database data using Sqoop into Hive Dynamic partition tables using staging tables .
- Written Hive jobs to parse the logs and structure them in tabular format to facilitate effective querying on the log data.
- Experience in manipulating/analyzing large datasets and finding patterns and insights within structured and unstructured data.
- Experienced in managing and reviewing the Hadoop log files .
- Implemented the workflows using Apache Oozie framework to automate tasks.
- Worked on different file formats like Sequence files, XML files and Map files using Map Reduce Programs.
- Implemented data ingestion and handling clusters in real time processing using Kafka .
- Experience in designing and developing applications in Spark using Scala to compare the performance of Spark with Hive and SQL/Oracle.
- Implemented python scripts which perform transformations and actions on tables and send incremental data to the next zone by using spark submit.
- Experienced in working with spark eco system using Spark SQL and Scala queries on different formats like Text file, CSV file.
- Developed and Configured Kafka brokers to pipeline server logs data into spark streaming.
- Developed Spark scripts by using scala shell commands as per the requirement.
- Developed spark code and spark-SQL/streaming for faster testing and processing of data.
- Exported the analyzed data to relational databases using sqoop for visualization and to generate reports.
Environment : Hadoop, HDFS, Pig, Apache Hive, Sqoop, Flume, Python, Kafka, Apache Spark, HBase, Scala, Zookeeper, Maven, AWS, MySQL.
Confidential
Hadoop Developer
Responsibilities:
- Developed MapReduce jobs in both PIG and Hive for data cleaning and pre-processing.
- Installed and configured Hive, Pig, Sqoop, Flume and Oozie on the Hadoop cluster.
- Developed Sqoop scripts for loading data into HDFS from DB2 and preprocessed with PIG.
- Automated the tasks of loading the data into HDFS and pre-processing with Pig by developing workflows using Oozie.
- Loaded data from UNIX file system to HDFS and written Hive User Defined Functions.
- Used Sqoop to load data from DB2 to HBase for faster querying and performance optimization.
- Worked on streaming to collect this data from Flume and performed real time batch processing.
- Developed Hive scripts for implementing dynamic partitions.
- Developed Python scripts to find vulnerabilities with SQL Queries by doing SQL injection.
- Developed suit of Unit Test Cases for Mapper, Reducer and Driver classes using testing library.
- Collected the logs data from web servers and integrated in to HDFS using Flume.
- Worked on developing ETL Workflows on the data obtained using Scala for processing it in HDFS and HBase using Oozie.
- Written ETL jobs to visualize the data and generate reports from MySQL database using DataStage.
Environment: Hadoop, HDFS, Hive, Pig, Flume, Mapper, Flume, ETL Workflows, HBase, Python, Sqoop, Oozie, DataStage, Linux, Relational Databases, SQL Server 2012, DB2.
Confidential
Associate Software Engineer
Responsibilities:
- Imported the data from CASSANDRA databases and Stored it into AWS.
- Performed transformations on the data using different Spark modules.
- Responsible for Spark Core configuration based on type of Input Source.
- Executed Spark code using Scala for Spark Streaming/Spark SQL for faster processing of data.
- Performed SQL Joins among Hive tables to get input for Spark batch process.
- Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs.
- Created Partitions, Buckets based on State to further process using Bucket based Hive joins.
- Involved in importing the data to Hadoop using Kafka and implemented the Oozie job for daily imports.
- Used Amazon CLI for data transfers to and from Amazon S3 buckets.
- Executed Hadoop/Spark jobs on AWS EMR using programs and data is stored in S3 Buckets.
- Wrote various SQL, PLSQL queries and stored procedures for data retrieval.
- Prepared utilities for the Unit -Testing of Application Using JSP and Servlets.
- Developed Database applications using SQL and PL/SQL.
- Applied design patterns and Object-Oriented design concept to improve the existing Java/J2EE based code base.
- Experience in pulling the data from Amazon S3 bucket to data lake and built Hive tables on top of it and created data frames in Spark to perform further analysis.
- Implemented Spark RDD transformations, actions to implement business analysis.
- Developed Spark scripts by using Scala shell commands as per the requirement.
Environment: Cassandra, Kafka, Spark, Pig, Hive, Oozie, AWS, SQL, Scala, Python, Core Java, FileZilla, putty, IntelliJ, GitHub.
