Big Data Engineer Resume
Pleasanton, CA
SUMMARY
- Having 8+ years of experience in IT industry implementing, developing and maintenance of various Web Based applications using Java, J2EE Technologies and Big Data Ecosystem.
- Strong knowledge of Hadoop Architecture and Daemons such as HDFS, JOB Tracker, Task Tracker, Name None, Data Node and Map Reduce concepts.
- Well versed in implementing E2E solutions on big data using Hadoop frame work.
- Hands on experience in writing Map Reduce programs using Java to handle different data sets using Map and Reduce tasks.
- Hands on experience in Sequence files, RC files, Combiners, Counters, Dynamic Partitions, Bucketing for best practice and performance improvement.
- Worked with join patterns and implemented Map side joins and Reduce side joins using Map Reduce.
- Developed multiple MapReduce jobs to perform data cleaning and preprocessing.
- Designed HIVE queries & Pig scripts to perform data analysis, data transfer and table design.
- Having experience in developing a data pipeline usingKafkato store data into HDFS.
- Good knowledge on AWS infrastructure services Amazon Simple Storage Service (Amazon S3), EMR, and Amazon Elastic Compute Cloud (Amazon EC2).
- Implemented Ad - hoc query using Hive to perform analytics on structured data.
- Expertise in writing Hive UDF, Generic UDF's to in corporate complex business logic into Hive Queries.
- Experienced in optimizing Hive queries by tuning configuration parameters.
- Involved in designing the data model in Hive for migrating the ETL process into Hadoop and wrote Pig Scripts to load data into Hadoop environment.
- Implemented SQOOP for large dataset transfer between Hadoop and RDBMS.
- Extensively used Apache Flume to collect the logs and error messages across the cluster.
- Experienced in performing real time analytics on HDFS using HBase.
- Used Cassandra CQL with Java API’s to retrieve data from Cassandra tables.
- Worked on Implementing and optimizing Hadoop/MapReduce algorithms for Big Data analytics.
- Experience in working with Amazon EMR, Cloudera (CDH3 & CDH4) and Horton Works Hadoop Distributions
- Experience in meeting expectations with Hadoop clusters using Cloudera (CDH3 &CDH4) and Horton Works.
- Worked with Oozie and Zoo-keeper to manage the flow of jobs and coordination in the cluster.
- Experience in performance tuning, monitoring the Hadoop cluster by gathering and analyzing the existing infrastructure using Cloudera manager.
- Good knowledge in writing Spark application using Python and Scala.
- Experience processing Avro data files using Avro tools and MapReduce programs.
- Implemented pre-defined operators in spark such as map, flat Map, filter, reduceByKey, groupByKey, aggregateByKey and combineByKey etc.
- Used Scala sbt to develop Scala coded spark projects and executed using spark-submit
- Worked on different file formats (ORCFILE, TEXTFILE) and different Compression Codecs (GZIP, SNAPPY, LZO).
- Experienced writing Test cases and implement unit test cases using testing frame works like Junit, Easy mock and Mockito.
- Worked on Talend Open Studio and Talend Integration Suite.
- Adequate knowledge and working experience with Agile and waterfall methodologies.
- Good understanding of all aspects of Testing such as Unit, Regression, Agile, White-box, Black-box.
- Expert in developing applications using Servlets, JPA, JMS, Hibernate, spring frameworks.
- Extensive experience in implementing/ consume Rest Based Web Services.
- Good knowledge of Web/Application Servers like Apache Tomcat, IBM WebSphere and Oracle WebLogic.
- Ability to work with onsite and offshore team members.
- Able to work on own initiative, highly proactive, self-motivated commitment towards work and resourceful.
TECHNICAL SKILLS
Big Data Ecosystems: Hadoop Map Reduce, HDFS, Zookeeper, Hive Pig, Sqoop, Oozie, Flume, Yarn, Spark
Database Languages: SQL, PL/SQL, Oracle
Programming Languages: Java, Scala
Frameworks: Spring, Hibernate, JMS
Scripting Languages: JSP, Servlets, JavaScript, XML, HTML, Python
Web Services: RESTful web services
Databases: RDBMS, HBase, Cassandra
IDE: Eclipse, IntelliJ
Platforms: Windows, Linux, Unix
Application Servers: Apache Tomcat, Web Sphere, Web logic, JBoss
Methodologies: Agile, Waterfall
ETL Tools: Informatica, Talend
PROFESSIONAL EXPERIENCE
Confidential, Pleasanton, CA
Big Data Engineer
Responsibilities:
- Developed Spark Applications by using Scala, Java and Implemented Apache Spark data processing project to handle data from various RDBMS and Streaming sources.
- Worked with the Spark for improving performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Spark MLlib, Data Frame, Pair RDD's, Spark YARN.
- Experience in implementing Spark RDD’s in Scala.
- Experience building bi-directional data pipelines between Oracle 11g and HDFS using Sqoop.
- Experience developing MapReduce programs in Java and optimizing MapReduce algorithms using Custom Partitioner, Custom Shuffle, and Custom Sort.
- Designed and built the Reporting Application, which uses the Spark SQL to fetch and generate reports on HBase table data.
- Designed and Built Spark Streaming application which analyses and evaluates the Streaming data against the business rules through Rules Engine and then send alerts to the business users to address the customer preferences and do product promotions.
- Parsed JSON and XML files with Pig Loader functions and extracted insightful information from Pig Relations by providing a regex using the built-in functions in Pig.
- Experience writing Pig Latin scripts for Data Cleansing, ETL operations and query optimization of existing scripts.
- Experience creating Hive tables, loading tables with data and aggregating data by writing Hive queries.
- Performed Schema design for Hive and optimized the Hive configuration.
- Experience writing reusable custom Hive and Pig UDFs in Java and using existing UDFs from Piggybanks and other sources.
- Worked in AWS environment for development and deployment of custom Hadoop applications.
- Strong experience in working with ELASTIC MAPREDUCE(EMR)and setting up environments on Amazon AWS EC2 instances.
- Ability to spin up different AWS instances including EC2-classic and EC2-VPC using cloud formation templates.
- Experience designing and executing time driven and data driven Oozie workflows.
- Experience developing algorithms for full-text search capability using Solr.
- Experience processing Avro data files using Avro tools and MapReduce programs.
- Experience developing programs to deal with multiple compression formats such as LZO, GZIP, Snappy and LZ4.
- Experience loading and transforming large amounts of structured and unstructured data into HBase database and exposure handling Automatic failover in HBase.
ENVIRONMENT: Hadoop, Spark, Spark-Streaming, Spark SQL, AWS EMR, HDFS, Hive, Pig, Apache Kafka, Sqoop, Cloudera (CDH 4), HDFS, Hive, Flume, Sqoop, Pig, Java, Eclipse, Teradata, MongoDB, Ubuntu, UNIX, and Maven.
Confidential, Minneapolis, MN
Spark/Scala Developer
Responsibilities:
- Analyze and define researcher’s strategy and determine system architecture and requirement to achieve goals.
- Developed multiple Kafka Producers and Consumers from as per the software requirement specifications.
- Used Kafka for log accumulation like gathering physical log documents off servers and places them in a focal spot like HDFS for handling.
- Configured Spark streaming to get ongoing information from the Kafka and store the stream information to HDFS.
- Used various spark Transformations and Actions for cleansing the input data.
- Developed shell scripts to generate the hive create statements from the data and load the data into the table.
- Wrote Map Reduce jobs using Java API and Pig Latin
- Optimized Hive QL/ pig scripts by using execution engine like Tez, Spark.
- Involved in writing custom Map-Reduce programs using java API for data processing.
- Integrated Maven build and designed workflows to automate the build and deploy process.
- Involved in developing a linear regression model to predict a continuous measurement for improving the observation on wind turbine data developed using spark with Scala API.
- The hive tables are created as per requirement were Internal or External tables defined with appropriate static, dynamic partitions and bucketing, intended for efficiency.
- Load and transform large sets of structured, semi structured data using hive.
- Extract Real time feed using Kafka and Spark Streaming and convert it to RDD and process data in the form of Data Frame and save the data as Parquet format in HDFS.
- Used Spark and Spark-SQL to read the parquet data and create the tables in hive using the Scala API.
- Develop Hive queries for the analysts.
- Cassandra implementation usingDatastax Java API.
- Very good understanding Cassandra cluster mechanism that includesreplication strategies, snitch, gossip, consistent hashingandconsistency levels.
- Experienced in using the spark application master to monitor the spark jobs and capture the logs for the spark jobs.
- Implemented Spark using Scala and Spark SQL for faster testing and processing of data.
- Implemented Spark using Scala and utilizing Data frames and Spark SQL API for faster processing of data.
- Involved in making code changes for a module in turbine simulation for processing across the cluster using spark-submit.
- Collected data using Spark Streaming from AWS S3 bucket in near-real-time and performs necessary Transformations and Aggregations to build the data model and persists the data in HDFS
- Imported the data from different sources like AWS S3, LFS into Spark RDD.
- Involved in performing the analytics and visualization for the data from the logs and estimate the error rate and study the probability of future errors using regressing models.
- Used WEB HDFS REST API to make the HTTP GET, PUT, POST and DELETE requests from the webserver to perform analytics on the data lake.
- Worked on a POC to perform sentiment analysis of twitter data using Open NLP API.
- Worked on high performance computing (HPC) to simulate tools required for the genomics pipeline.
- Used Kafka to patch up a customer activity taking after pipeline as a course of action of steady appropriate subscribe supports.
- Cluster coordination services through Zookeeper
Environment: Hadoop, Hive, HDFS, HPC, WEBHDFS, WEBHCAT, Spark, Spark-SQL, KAFKA, Java, Scala, Web Server’s, Maven Build and SBT build.