We provide IT Staff Augmentation Services!

Senior Bigdata Engineer Resume

3.00/5 (Submit Your Rating)

Losangeles, CaliforniA

PROFESSIONAL SUMMARY:

  • 6+ years of Big Data Hadoop Ecosystems experience in ingestion, storage, querying, processing and analysis of big data.
  • Experience in dealing with Apache Hadoop components like HDFS, MapReduce, HIVE, HBase, PIG, SQOOP, Spark and Flume Big Data and Big Data Analytics.
  • Hands on experience in installing, configuring Hadoop ecosystems such as HDFS, MapReduce, Yarn, Pig, Hive, HBase, Oozie, Sqoop, flume and Kafka.
  • Excellent knowledge on Hadoop Architecture such as HDFS, Job Tracker, Task Tracker, Name Node, Data Node and MapReduce programming paradigm.
  • Experience in developing MapReduce Programs using Apache Hadoop for analyzing the big data as per the requirement.
  • Involved in writing data transformations, data cleansing using PIG operations and good experience in data retrieving and processing using HIVE.
  • Worked with HBase to conduct quick look ups (updates, inserts and deletes) in Hadoop.
  • Experience in Oozie and workflow scheduler to manage Hadoop jobs by Direct Acyclic Graph (DAG) of actions with control flows.
  • Proficient in using Hive optimization techniques like Buckets, Partitions, etc.
  • Experienced in loading dataset into Hive for ETL (Extract, Transfer and Load) operation.
  • Experience in importing and exporting data using Sqoop from Relational Database Systems to HDFS and vice - versa.
  • Extensive Experience on importing and exporting data using stream processing platforms like Flume .
  • Developed Apache Spark jobs using Scala and Python for faster data processing and used Spark Core and Spark SQL libraries for querying.
  • Experience in creating Spark Streaming jobs to process huge sets of data in real time.
  • Experience tuning spark jobs for efficiency in terms of storage and processing.
  • Extensive experience using MAVEN as a Build Tool for the building of deployable artifacts from source code.
  • Experience working with Amazon's AWS services like EC2, EMR, S3, KMS, Kinesis, Lambda, API gateways, IAM etc.
  • Tested, Cleaned, and Standardized Data to meet the business standards using Execute SQL task, Conditional Split, Data Conversion, and Derived column in different environments.
  • Expertise in relational database systems (RDBMS) such as My SQL, and No SQL database systems like HBase and had basic knowledge on MongoDB and Cassandra.
  • Experience in database development using SQL and PL/SQL and experience working on databases like Oracle 12c/ 11g/10g, SQL Server and MySQL .
  • Good understanding of Hadoop Gen1/Gen2 architecture and hands-on experience with Hadoop components such as Job Tracker, Task Tracker, Name Node, Secondary Name Node, Data Node , Map Reduce concepts and YARN architecture which includes Node manager, Resource manager and App Master.
  • A great team player& ability to effectively communicate with all levels of the organization such as technical, management and customers.
  • Ability to quickly master new concepts and applications.

TECHNICAL SKILLS:

Big Data Ecosystems: Hadoop, MapReduce, HDFS, HBase, Hive, Pig, Sqoop, Oozie, Spark, Kafka

Hadoop platforms: Cloudera, MapR

AWS services: EC2, EMR, S3, KMS, Kinesis, Lambda, API Gateway, IAM

Languages: Java Script, Scala, SQL, Python

Web Technologies: HTML5, CSS3, JavaScript

Scripting Language: UNIX Shell Script

RDBMS DB: MySQL, Oracle 12c/ 11g/ 10g

NoSQL Technologies: HBase, MongoDB, Cassandra, DynamoDB

Tools: & Utilities: Eclipse, Visual Studio, Net Beans, GitHub, Maven, Jenkins

Operating Systems: Unix, Windows, Cent OS, Linux (Ubuntu, Red hat)

Others: Putty, WinSCP, GitHub

PROFESSIONAL EXPERIENCE:

Confidential, LosAngeles, California

Senior Bigdata Engineer

Responsibilities:

  • Created a Serverless data ingestion pipeline on AWS using MSK (Kafka)and lambda functions.
  • Developed applications using Java that reads data from MSK(kafka) and writes it to Dynamo DB.
  • Developed applications that leverages step functions and cloudwatch event triggers to fetch data and generate features from that data.
  • Very much involved in a number of key decisions in this project from design decisions to planning and implementation and security.
  • Involved in creating research data-lake by extracting customer's data from various data sources to S3 which include data from Excel, databases, and log data from servers.
  • Developed Apache Spark applications by using Scala for data processing from various streaming sources.
  • Configured Spark Streaming to receive real time data from the Apache Kafka and store the stream data to DynamoDB using Scala.
  • Created python-based lambda functions for feature extraction.
  • Created the light weight Serverless pipeline that runs on lambdas and generates insights.
  • Created a set of data classifiers that reads from DynamoDB and classifies the features into bins and stores them in DynamoDB.
  • Used lambda with SNS to create insight notifications to mobile devices.
  • Implemented Tableau mobile dashboards via Tableau mobile application.
  • Used different stages of Datastage Designer like Lookup, Join, Merge, Funnel, Filter, Copy, Aggregator, and Sort etc .
  • Involved in all phases of the SDLC and collaborated with a large team to get this pipeline operational.
  • Configured cloudwatch logs and created a cloudwatch dashboard for monitoring.
  • Deployed Machine Learning Models on Sagemaker and exposed it as an endpoint.
  • Accessed the endpoints to call the model in real time to generate the insight.

Environment: Lambda, MSK, KMS, Spark, SQL Server 2016/2014, DB2, DynamoDB, cloudwatch, Tableau, Python, SNS, step functions.

Confidential, Columbus, Ohio

Bigdata Developer

Responsibilities:

  • The near real time reporting was achieved by an event-based processing approach adoption instead of micro-batching to deal with data coming from Kafka.
  • Developed spring boot applications to read data from Kafka in an event-based manner. These applications were developed to run as micro-services that deals with parts of the problem and were deployed on Docker containers that were built and deployed automatically using Jenkins pipelines.
  • Have written applications using Spring boot that reads data from Kafka and writes it to MaprDB (MapR version of HBase).
  • Have written applications that produced data to Kafka and also consumed data from it.
  • Used Scala to convert Hive/SQL queries into RDD transformations in Apache Spark.
  • Implemented Spark solutions to generate reports, fetch and load data in Hive.
  • Experienced in writing real-time processing and core jobs using Spark Streaming with Kafka as a data pipeline system.
  • Implemented Spark using Scala, Python and utilizing Data frames and Spark SQL API for faster processing of data.
  • Very much involved in a number of key decisions in this project from design decisions to planning and implementation.
  • Have written HiveQl scripts to populate table and brought data from various systems using Sqoop.
  • Used DataStage as an ETL tool to extract data from sources systems, loaded the data into the ORACLE database.
  • Built a data lake on the MapR cluster which was used by different teams.
  • Wrote Spark applications and also mentored other team members on the perks of spark.
  • Wrote complex logic implementations using Spark to process data present in MaprDB and Hive.
  • Involved in all phases of the project lifecycle from requirements collection, design, development, testing and deployment.
  • Built a dashboard of all the YARN applications running on the cluster using YARN API.

Environment: Hadoop, HDFS, Hive, HBase, Sqoop, Oracle 12c, Apache Spark, MapReduce, Python, SQL Server 2012, Spark, Springboot, Linux, Relational Databases.

Confidential, Detroit, Michigan

Hadoop Developer

Responsibilities:

  • Working on Big Data infrastructure for batch processing as well as real-time processing. Responsible for building scalable distributed data solutions using Hadoop .
  • Involved in creating Hive Tables, loading with data and writing Hive queries which will invoke and run Map Reduce jobs in the backend.
  • Designed and implemented Incremental Imports into Hive tables.
  • Very good understanding of Partitions, bucketing concepts in Hive and designed both Managed and External tables in Hive to optimize performance.
  • Experience in importing and exporting Terabytes of data using Sqoop from HDFS to Relational Database Systems and vice-versa.
  • Moved Relational Database data using Sqoop into Hive Dynamic partition tables using staging tables .
  • Written Hive jobs to parse the logs and structure them in tabular format to facilitate effective querying on the log data.
  • Experience in manipulating/analyzing large datasets and finding patterns and insights within structured and unstructured data.
  • Experienced in managing and reviewing the Hadoop log files .
  • Implemented the workflows using Apache Oozie framework to automate tasks.
  • Worked on different file formats like Sequence files, XML files and Map files using Map Reduce Programs.
  • Implemented data ingestion and handling clusters in real time processing using Kafka .
  • Experience in designing and developing applications in Spark using Scala to compare the performance of Spark with Hive and SQL/Oracle.
  • Implemented python scripts which perform transformations and actions on tables and send incremental data to the next zone by using spark submit.
  • Experienced in working with spark eco system using Spark SQL and Scala queries on different formats like Text file, CSV file.
  • Developed and Configured Kafka brokers to pipeline server logs data into spark streaming.
  • Developed Spark scripts by using scala shell commands as per the requirement.
  • Developed spark code and spark-SQL/streaming for faster testing and processing of data.
  • Exported the analyzed data to relational databases using sqoop for visualization and to generate reports.

Environment : Hadoop, HDFS, Pig, Apache Hive, Sqoop, Flume, Python, Kafka, Apache Spark, HBase, Scala, Zookeeper, Maven, AWS, MySQL.

Confidential

Hadoop Developer

Responsibilities:

  • Developed MapReduce jobs in both PIG and Hive for data cleaning and pre-processing.
  • Installed and configured Hive, Pig, Sqoop, Flume and Oozie on the Hadoop cluster.
  • Developed Sqoop scripts for loading data into HDFS from DB2 and preprocessed with PIG.
  • Automated the tasks of loading the data into HDFS and pre-processing with Pig by developing workflows using Oozie.
  • Loaded data from UNIX file system to HDFS and written Hive User Defined Functions.
  • Used Sqoop to load data from DB2 to HBase for faster querying and performance optimization.
  • Worked on streaming to collect this data from Flume and performed real time batch processing.
  • Developed Hive scripts for implementing dynamic partitions.
  • Developed Python scripts to find vulnerabilities with SQL Queries by doing SQL injection.
  • Developed suit of Unit Test Cases for Mapper, Reducer and Driver classes using testing library.
  • Collected the logs data from web servers and integrated in to HDFS using Flume.
  • Worked on developing ETL Workflows on the data obtained using Scala for processing it in HDFS and HBase using Oozie.
  • Written ETL jobs to visualize the data and generate reports from MySQL database using DataStage.

Environment: Hadoop, HDFS, Hive, Pig, Flume, Mapper, Flume, ETL Workflows, HBase, Python, Sqoop, Oozie, DataStage, Linux, Relational Databases, SQL Server 2012, DB2.

Confidential

Associate Software Engineer

Responsibilities:

  • Imported the data from CASSANDRA databases and Stored it into AWS.
  • Performed transformations on the data using different Spark modules.
  • Responsible for Spark Core configuration based on type of Input Source.
  • Executed Spark code using Scala for Spark Streaming/Spark SQL for faster processing of data.
  • Performed SQL Joins among Hive tables to get input for Spark batch process.
  • Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs.
  • Created Partitions, Buckets based on State to further process using Bucket based Hive joins.
  • Involved in importing the data to Hadoop using Kafka and implemented the Oozie job for daily imports.
  • Used Amazon CLI for data transfers to and from Amazon S3 buckets.
  • Executed Hadoop/Spark jobs on AWS EMR using programs and data is stored in S3 Buckets.
  • Wrote various SQL, PLSQL queries and stored procedures for data retrieval.
  • Prepared utilities for the Unit -Testing of Application Using JSP and Servlets.
  • Developed Database applications using SQL and PL/SQL.
  • Applied design patterns and Object-Oriented design concept to improve the existing Java/J2EE based code base.
  • Experience in pulling the data from Amazon S3 bucket to data lake and built Hive tables on top of it and created data frames in Spark to perform further analysis.
  • Implemented Spark RDD transformations, actions to implement business analysis.
  • Developed Spark scripts by using Scala shell commands as per the requirement.

Environment: Cassandra, Kafka, Spark, Pig, Hive, Oozie, AWS, SQL, Scala, Python, Core Java, FileZilla, putty, IntelliJ, GitHub.

We'd love your feedback!