Sr. Big Data Engineer/hadoop Developer Resume
Durham, NC
SUMMARY
- Experience in working on various Hadoop data access components like MAPREDUCE, PIG, HIVE, HBASE, SPARK and KAFKA.
- Experience on handling Hive queries using Spark SQL that integrates with Spark environment
- Having good knowledge on Hadoop data management components like HDFS and YARN.
- Hands on experience in using various Hadoop workflow compononets like SQOOP, FLUME and KAFKA.
- Worked on Hadoop data operation components like ZOOKEEPER and OOZIE.
- Strong experience in writing MapReduce scripts using Scala, Java with Java API, Apache Hadoop API, Python API, PySpark API and Spark API for analyzing the data.
- Working knowledge on AWS technologies like S3 and EMR for storage, big data processsing and analysis.
- Good understanding of Hadoop security components like RANGER and KNOX.
- Good experience working with Hadoop distributions such as HORTONWORKS and CLOUDERA.
- Excellent programming skills at higher level of abstraction using SCALA and JAVA.
- Experience in Java programming with skills in analysis, design, testing and deploying with various technologies like J2EE, JavaScript, JSP, JDBC, HTML, XML and JUNIT.
- Having good knowledge on Apache Spark components including SPARK CORE, SPARK SQL, SPARK STREAMING and SPARK MLLIB.
- Experience in performing transformations and actions on Spark RDDS using Spark Core.
- Experience in using Broadcast variables, Accumulator variables and RDD caching in Spark.
- Experience in troubleshooting Cluster jobs using Spark UI
- Experience working with Cloudera Distribution Hadoop(CDH) and Hortonworks data platform(HDP).
- Expert in Hadoop and Big data ecosystem including Hive, HDFS, Spark, Kafka, MapReduce, Sqoop, Oozie and Zookeeper
Technical Skills:
- Data Management: HDFS, YARN
- Data Workflow: Sqoop, Flume, Kafka
- Data Operation: Zookeper, Oozie
- Data Security: Ranger, Knox
- BigData Distributions: Hortonworks, Cloudera
- Cloud Technologies: AWS (Amazon Web Services) EC2, S3, IAM, CLOUD WATCH, DynamoDB, SNS, SQS, EMR, KINESIS
- Programming & Languages: Java, Scala, Pig Latin, HQL, SQL, Shell Scripting, HTML, CSS, JavaScript
- IDE/Build Tools: Eclipse, Intellij
- Java/J2EE Technologies: XML, Junit, JDBC, AJAX, JSON, JSP
- Operating Systems: Linux, Windows, Kali Linux
- SDLC: Agile/SCRUM, Waterfall
Professional Experience:
Confidential, Durham NC
Sr. Big Data Engineer/Hadoop Developer
Responsibilities:
- Implemented Spark using Scala and utilizing Data frames and Spark SQL API for faster processing of data
- Converted existing MapReduce jobs into Spark transformations and actions using Spark RDDs, Data frames and Spark SQL APIs.
- Implemented solutions utilizing Advanced AWS Components Glue, Lambda, Athena, SNS, SQS, etc, integrated with Big Data/Hadoop Distribution Frameworks: Zookeeper, Yarn, Hive (Beeline), Spark, PySpark, Pig, etc
- Developed a data pipeline using Kafka, Spark and Hive to ingest, transform and analyzing customer behavioral data.
- Worked on Big Data infrastructure for batch processing as well as real-time processing. Responsible for building scalable distributed data solutions using Hadoop.
- Developed real time data processing applications by using Scala and Python and implemented Apache Spark Streaming from various streaming sources like Kafka.
- Developed Spark jobs and Hive Jobs to summarize and transform data
- Streamed AWS log group into Lambda function to create service now incident.
- Creating end to end Spark-Solr applications using Scala to perform various data cleansing, Validation, transformation according to the requirement.
- Developing Spark programs with Python, and applied principles of functional programming to process the complex structured data sets.
- Expertise in implementing Spark Scala application using higher order functions for both batch and interactive analysis requirement.
- Experienced in developing Spark scripts for data analysis in scala.
- Used Spark-Streaming APIs to perform necessary transformations.
- Involved in converting Hive/SQL queries into Spark transformations using Spark SQL and Scala.
- Worked with spark to consume data from kafka and convert that to common format using scala.
- Worked extensively with importing metadata into Hive and migrated existing tables and applications to work on Hive and Spark.
- Converted existing MapReduce jobs into Spark transformations and actions using Spark RDDs, Data frames and Spark SQL APIs.
ENVIRONMENT: Hadoop 2.x, Spark Core, Spark SQL,Spark API Spark Streaming, Pyspark, Hive, Oozie, Amazon EMR, Tableau, Impala, RDBMS,YARN, JIRA, MapReduce.
Confidential, New York, NYC
Sr. Big Data Engineer/Hadoop
Responsibilities:
- Responsible to collect, clean, and store data for analysis using Kafka, Sqoop, Spark, HDFS
- Used Kafka and Spark framework for real time and batch data processing
- Ingested large amount of data from different data sources into HDFS using Kafka
- Implemented Spark using Scala and performed cleansing of data by applying Transformations and Actions
- Used Case Class in Scala to convert RDD’s into DataFrames in Spark
- Processed and Analyzed data in stored in HBase and HDFS
- A highly immersive Data Science program involving Data Manipulation & Visualization, Web Scraping, Machine Learning, Python programming, SQL, GIT, Unix Commands, NoSQL, MongoDB, Hadoop.
- Implemented AWS solutions using EC2, S3 and load balancers.
- Was responsible for creating on-demand tables on S3 files using Lambda Functions and AWS Glue using Python and PySpark
- Installed application on AWS EC2 instances and also configured the storage on S3 buckets.
- Storing and loading the data from HDFS to Amazon S3 and backing up the Namespace data.
- Developed Spark jobs using Scala on top of Yarn for interactive and Batch Analysis.
ENVIRONMENT: Spark, Sqoop, Scala, Hive, Kafka, YARN, PySpark, Teradata, RDBMS, HDFS, Oozie, Zookeeper, AWS, HBase, Tableau, Hadoop (Cloudera), JIRA
Confidential, Boston, MA
Hadoop Developer
RESPONSIBILITIES:
- Actively participated in interaction with users to fully understand the requirements of the system
- Experience with the Hadoop ecosystem and NoSQL database
- Migrating the needed data from Oracle, MySQL in to HDFS using Sqoop and importing various formats of flat files in to HDFS
- Imported data from RDBMS (MySQL, Teradata) to HDFS and vice versa using Sqoop (Big Data ETL tool) for Business Intelligence, visualization and report generation
- Working with Kafka to get near real-time data onto big data cluster and required data into Spark for analysis
- Used Spark streaming to receive near real time data from the Kafka and store the stream data to HDFS using Scala and NoSQL database such as Cassandra and HDFS
- Designed number of partitions and replication factor for Kafka topics based on business requirements and worked on migrating Map Reduce programs into Spark transformations using Spark and Scala, initially done using python (PySpark).
- Worked on collection of large sets using Python scripting. Spark SQL
- Worked on large sets of Structured and Unstructured data.
- Involved in Analyzing data by writing queries using HiveQL for faster data processing
- Designing and creating Hive external tables using shared meta-store instead of derby with partitioning, dynamic partitioning and buckets
ENVIRONMENT: HDFS, Kafka, Sqoop, Scala, java, Hive, Oozie, NoSQL, Oracle, MySQL, GIT, Zookeeper, DataStax Cassandra, Agile methodology, JIRA, Hortonworks data platform, Jenkins, AGILE(SCRUM).
Confidential, IL
Hadoop Developer
RESPONSIBILITIES:
- Responsible to manage data coming from different sources and involved in HDFS maintenance and loading of structured and unstructured data.
- Developed data pipeline using Flume, Sqoop, Pig and Java MapReduce to ingest behavioural data into HDFS for analysis.
- Responsible for importing log files from various sources into HDFS using Flume.
- Imported data using Sqoop to load data from MySQL to HDFS on regular basis.
- Extracted files from MongoDB through Sqoop and placed in HDFS and processed.
- Created customized BI tool for manager team that perform Query analytics using Hive QL.
- Created Partitions, Buckets based on State to further process using Bucket based Hive joins.
- Estimated the hardware requirements for NameNode and DataNodes & planning the cluster.
- Developed framework to import the data from database to HDFS using Sqoop. Developed HQLs to extract data from Hive tables for reporting.
ENVIRONMENT: Hadoop, HDFS, HBase, MapReduce, Java, C++, Python, LINUX, AWS, Hive, Pig, Sqoop, Flume, Kafka, Oozie, Hue, Storm, Zookeeper,SQL, ETL, Python, Cassandra, Cloudera Manager, MySQL, MongoDB, Agile.