Data Engineer Resume
Sunnyvale, CaliforniA
SUMMARY
- Data Engineer with around 4+ years of experience in Information Technology & 2+ years in Hadoop / Spark Ecosystem.
- Expertise in Hadoop Ecosystem components HDFS, Map Reduce, Sqoop, Kafka, Spark Framework, Spark Streaming, etc. Exposure to SQL for 4+ years like MySQL / PL - SQL / MSSQL
- Good Knowledge in Core & advance concepts of JAVA & Spring framework for 2+ years.
- Good Knowledge in Python programming language for 2+ years.
- Worked in KAFKA 0.11 for Producer and Spark Streaming as a consumer
- Worked on Key-Value pair with the help of RDD’s transformation and action for sorting, filtering and analyzing Big-Data (pyspark). Good Knowledge with NoSQL Databases - Cassandra.
- Knowledge in Amazon EMR, S3, EC2 Cluster, McQueen ( Confidential Service for cloud)
- Knowledge on Scala Programming Language for developing Spark applications.
- Good knowledge in Regular expressions and parsing unstructured log files of any type.
- Worked on all kinds of file format such as JSON, CSV, Sequence, Parquet, text-file for both importing and exporting AWS S3 and Cassandra.
- Experience in importing and exporting the data using Sqoop from HDFS to Relational Database systems/mainframe and vice-versa
- Expertise in writing Map-Reduce Jobs in Java for processing large sets of structured, semi-structured and unstructured data sets and store them in HDFS.
- Knowledge in working with BI Visualization tools like Tableau
- Experience in handling multiple relational databases: MySQL, SQL Server, PostgreSQL and Oracle
- Good knowledge in Software Development Life Cycle (SDLC) and Software Testing Life Cycle (STLC). Used GIT and RIO ( Confidential service for CICD) for repository and Jenkins for Continuous Integration.
- Excellent communication and inter-personal skills detail oriented, analytical, time bound, responsible team player and ability to coordinate in a team environment and possesses high degree of self-motivation and a quick learner.
TECHNICAL SKILLS
Big Data Technologies: Hadoop Architecture, HDFS, Hive, Sqoop, Zookeeper, Kafka,, Apache Spark, Spark Streaming, Spark SQL
Databases: MySQL, SQL Server, Oracle 12C, Oracle 10g, Cassandra
Programming Language: Python, Scala
CICD: Jenkins, GIT, RIO
Operating System: Windows, Mac, Cent OS, Linux/UNIX
Environment: Cloudera, Amazon EMR
Cloud: Amazon S3, McQueen, Bithub
PROFESSIONAL EXPERIENCE
Confidential, Sunnyvale, California
Data Engineer
Responsibilities:
- Handled up to 3 TB of data for Batch processing and exposure to 100 TB of data.
- Handled up to 20 GB of data every minute live using Kafka Producer and Spark Streaming as a consumer.
- Exclusive knowledge about Kafka Producer (Confluent PyKafka), brokers, topics, offset etc
- Worked on multiple direct stream as a consumer group (Spark Streaming)
- Used DataStax Connectors for connecting spark with Cassandra for bulk read & write.
- Worked on building REST API for accessing that access data from Cassandra from UI.
- Exclusive knowledge about McQueen Cloud Storage (abstract of AWS S3)
- Exposure in handling data put / get from spark compute to McQueen.
- Exposure to multiple python packages such as boto3, Pandas, etc
- Good understanding of Spark (Pyspark 2.3) for computing using Mesos / Hadoop Yarn & Standalone.
- Good knowledge in Linux / Unix commands and shell scripting.
- Good understanding of concepts in spark such as Data frames, RDD etc
- Worked on performance tuning of Spark jobs by setting right executor memory & cores etc.
- Used Chronos, Crontab for scheduling Mesos and other spark batch / Live Jobs.
- Knowledge in parsing log files using regular expressions in python.
- Exclusive Knowledge in SQL for more than 4 years.
- Good knowledge merging and massaging the pre-processing data for Data models from different data source
- Good knowledge in Scala functional concepts used for building rdd & DF logics in Spark.
- Worked in multi node clusters (around 100 + with each of 180 GB Executor memory)
- Knowledge in python multithreading and requests concepts.
Environment: Hadoop v2.7.0, PIE Spark, PIE Kafka, McQueen ( S3), Python, Scala, Spark Streaming, Cassandra 3.2, Spark Core 2.3, SQL, Linux/Unix, GIT, rio etc
Confidential, Manhattan, New York
Spark Developer
Responsibilities:
- Being a ground up project, we have developed the entire application from scratch and I have worked mainly on writing code for accessing Kafka Consumer as per our requirement
- Used Spark Streaming for accessing live data from Kafka Using Stateless transformation
- Good understanding of the concepts such as Stateful and Stateless Transformation.
- Hands on experience in Spark-SQL for processing data frames, datasets that resides in HDFS using Python.
- Performed Spark tuning in memory management and data serialization level.
- Experienced in performance Tuning of Spark Applications for setting right Batch Interval time, correct level of Parallelism/Memory Tuning. Involved in writing basic Scala Programs to process the data from Kafka server using KafkaUtils
- Used Spark Streaming for accessing the Kafka Brokers using KafkaUtils API and processed live stream data in window transformation.
- Hands on experience in Cassandra version 3.2 for writing data from Spark to Cassandra.
- Hands on Experience in Data Modelling concepts of relationship keys, hierarchy, etc
- Good Understanding of Cassandra architecture such as Nodes, Clusters, Partitions, Virtual Nodes, Snitch etc.
- Hands on experience in setting consistency level based on hinted handoff and read-repair. Work experience in connecting spark with Cassandra using datastax connectors
- Worked on Tableau for Descriptive Visualization such as dashboard and stories by connecting Excel / DB.
- Did POC on loading data from Hadoop Cluster to Amazon S3 storage for processing in Amazon EMR and some Data Modelling in Redshift
- Configured, supported and maintained all network, firewall, storage, load balancers, operating systems, and software in AWS EC2.
Environment: Hadoop v2.6.0, HDFS, HDP 2.6, Map Reduce,, Sqoop, Core Java, Hive, Spark Streaming and Apache Kafka, Cassandra 3.2, Spark Core 2.2
ETL Developer
Confidential, Tampa, Florida
Responsibilities:
- Performing Integration testing of the scrums developed. Importing Large Data Sets from MySQL to Hive Table using Sqoop. Developing Hive Scripts for Extract, Transformations and Load from MySQL. Optimized Hive queries for performance tuning.
- Created Hive Managed and External Tables as per the requirements
- Writing Java Custom UDF's for processing data in Hive. Developing and maintaining Workflow Scheduling Jobs in Oozie for importing data from RDBMS to Hive.
- Hive tables created as per requirement were managed or external tables defined with appropriate static and dynamic partitions, intended for efficiency.
- Implemented Partitioning, Bucketing in Hive for better organization of the data
- Maximum exposure to Sqoop commands for importing and exporting structured data to and from HDFS environment. Worked on all kinds of file format such as AVRO, Sequence, Parquet, text-file for both importing and exporting from HDFS.
- In depth analysis on large file related to delimiters such as enclosed-by, terminated-by, optionally-enclosed-by, etc. Hands-on experience in Hive database for storing and retrieving data from HDFS environment using HIVEQL
- Experience in creating External tables for Hive & Impala in HDFS and knowledge in manipulating meta-store/schema
- Worked on scheduling workflows in Oozie environment for Hive jobs, also in Yarn Scheduler and environment like CTrl-M. Experience in creating physical data modelling using STAR and SNOWFLAKE schemas in HIVE environment for ETL.
- Worked on Key-Value pair with the help of RDD’s transformation and action for sorting, filtering and analyzing Big-Data (pyspark)
- Worked on Data Modeling tools such as SSMS to extract data from demographic application called Nterview. Worked on Tableau for Descriptive Visualization such as dashboard and stories by connecting cloudera Hive ODBC Driver.
- Involved in analysis of user requirement and identifying the resources. Developed reusable Mapp lets and Transformations. Worked on data modelling of snow flake schema in mart.
- Worked on various kinds of transformations like expressions, aggregator, stored procedure, lookup, filter, join, rank, router for ETL from Data Warehouse.
- Worked on sessions for loading data into targets, involved in writing queries for staging.
- Created a new mapping to pull data to target using lookup table, aggregators and joins.
- Extensive Working knowledge of partitioned table, UDFs, performance tuning, compression-related properties, thrift server in Hive.
Environment: Hadoop v2.4.0, HDFS, Map Reduce, Core Java, Python, Hive, Sqoop, CDH 4.x.x
Back End Developer
Confidential
Responsibilities:
- Involved in development, testing and maintenance process of the application
- Implemented REST web services for lightweight and with better performance and accessibility
- Used Spring MVC framework to implement the MVC architecture.
- Participated in the implementation of efforts like coding, unit testing.
- Developed Stored Procedures, Triggers and Functions in Oracle.
- Developed spring services, DAO's and performed object relation mappings using Hibernate.
- Wrote PL/SQL queries, stored procedures, and triggers to perform back-end database operation. Prepared test case document and performed unit testing and system testing.
- Followed the algorithms given by senior database programmers and developing tables and database queries.
- Involved in understanding the business processes and defining the requirements.
- Build test cases and performed unit testing.
- Logging done using Log4j.
Environment: Java 7 version, IntelliJ, Maven, Spring Framework, JavaScript, Oracle SQL Developer