Data Engineer Resume
Houston, TX
SUMMARY
- Having 7+ years of professional software development experience wif specialization in Big Data Engineering and Analytics and Java Projects.
- Hands on experience in working wif Spark and Hadoop ecosystems like MapReduce, HDFS, Sqoop, Hive, Kafka, Oozie, Yarn, Impala, Pig, Flume and NoSQL Databases HBase.
- Excellent noledge and understanding of Distributed Computing and Parallel processing frameworks.
- Strong experience in working wif both batch and streaming process using Spark framework.
- Good experience working wif Kafka clusters to storing real time streaming data and write custom Kafka producers and spark streaming consumers.
- Experience in installation, configuration, and monitoring Hadoop clusters both on - perm and cloud.
- Strong experience building data lakes in AWS Cloud utilizing services like S3, EMR, Glue Metastore, Athena, Redshift, Step Functions etc.,
- Strong experience and noledge of real time data analytics using Kafka and Spark Streaming
- Expertise in developing production ready Spark applications utilizing Spark RDD, Spark Data frames, Spark SQL, and Spark Streaming API's.
- Strong hands-on noledge on using programming languages Scala and Python for developing Spark Applications.
- Good experience troubleshooting data pipeline failures, identifying bottlenecks in long running pipelines.
- Good experience productionizing and automating end to end data pipelines and allowing downstream applications to consume the data from data lakes in most optimized fashions.
- Strong experience working wif various file formats like Parquet, ORC, Avro and JSON.
- Strong experience using various features of Hive like creating managed and external tables, partitioning, and bucketing etc.,
- Extending Hive core functionality by writing custom UDF’s for Data Analysis.
- Responsible for developing multiple Kafka Producers and Consumers from scratch as per the software requirement specifications.
- Proficient in importing/exporting data from RDBMS to HDFS using Sqoop.
- Having hands on experience wif Apache Nifi and Apache Airflow
- Run DAGs using airflow. Created workflows using apache airflow.
- Hands on experience on creating Docker containers of microservice Rest applications.
- Strong experience working wif Core Java and Spring Boot for developing Rest APIs, JDBC, JEE technologies and Servlets.
- Experience in version control systems using SVN and Git/GitHub and issue tracking tools like Jira.
- Extensive experience working wif relational databases like PostgreSQL, Teradata, and MySQL database
- Worked on Agile/SCRUM software development.
- Ability to meet deadlines and handle pressure in coordinating multiple tasks in the work environment.
TECHNICAL SKILLS
Big Data Tools: HDFS, Map Reduce, Sqoop, Flume, Pig, Hive, Oozie, Impala, Zookeeper, Ambari, Storm, Spark, and Kafka
No-SQL: HBase, Cassandra, MongoDB
Build and Deployment Tools: Maven, Sbt, Git, SVN, Jenkins
Programming and Scripting: Java, Scala, Python, SQL, Shell Scripting, Pig Latin, HiveQL
Databases: Teradata, Redshift, Oracle, My SQL, Postgres
Web Dev. Technologies: HTML, XML, JSON, CSS, JQUERY, JavaScript
AWS Services: EC2, EMR, S3, Redshift, EMR, Lambda, Glue, Simple Workflow, Athena
PROFESSIONAL EXPERIENCE
Confidential, Houston, TX
Data Engineer
Responsibilities:
- Responsible for ingesting large volumes of user behavioral data and customer profile data to Analytics Data store.
- Developed custom multi-threaded Java based ingestion jobs as well as Sqoop jobs for ingesting from FTP servers and data warehouses.
- Developed Scala based Spark applications for performing data cleansing, event enrichment, data aggregation, de-normalization and data preparation needed for machine learning and reporting teams to consume.
- Worked on troubleshooting spark application to make them more error tolerant.
- Worked on fine-tuning spark applications to improve the overall processing time for the pipelines.
- Wrote Kafka producers to stream the data from external rest API’s to Kafka topics.
- Wrote Spark-Streaming applications to consume the data from Kafka topics and write the processed streams to Snowflake.
- Experienced in handling large datasets using Spark in Memory capabilities, using broadcasts variables in Spark, TEMPeffective & efficient Joins, transformations and other capabilities.
- Worked extensively wif Sqoop for importing data from Oracle.
- Designing and customizing data models for Data warehouse supporting data from multiple sources on real time.
- Experience working for EMR cluster in AWS cloud and working wif S3, Redshift, Snowflake.
- Involved in creating Hive tables, loading and analyzing data using hive scripts.
- Implemented Partitioning, Dynamic Partitions, Bucketing in Hive.
- Good experience wif continuous Integration of application using Bamboo.
- Used Reporting tools like Tableau to connect wif Impala for generating daily reports of data.
- Collaborated wif the infrastructure, network, database, application and BA teams to ensure data quality and availability.
- Designed, documented operational problems by following standards and procedures using JIRA.
Environment: Hadoop, Spark, Scala, Python, Hive, Sqoop, Oozie, Kafka, Amazon EMR, YARN, JIRA, Amazon AWS, Shell Scripting, SBT, GITHUB, Maven.
Confidential, Richmond, VA
BigData Developer
Responsibilities:
- Developed Spark Applications by using Scala and Implemented Apache Spark data processing Project to handle data from various RDBMS and Streaming sources.
- Experienced in handling large datasets using Partitions, Spark in Memory capabilities, Broadcasts in Spark, TEMPEffective & efficient Joins and Transformations.
- Used Spark for implementing the transformations on the historic data.
- Experience wif Pyspark for using Spark libraries by using Python scripting for data analysis and aggregation and for utilizing data frames, developed Spark SQL API for processing data.
- Used Spark programming API over EMR Cluster Hadoop YARN to perform various data processing requirements.
- Run DAGs using Apache airflow to structure batch jobs in an extremely efficient way.
- Developed Spark Scala applications using both RDD/Data frames/Spark Sql for Data Aggregation, queries and writing data back into OLTP system using Spark JDBC.
- Experienced in performance tuning of Spark Applications for setting right Batch Interval time, correct level of Parallelism and memory tuning.
- Configured Spark Streaming to receive real time data from the Kafka and store the processed stream data back to Kafka.
- Experienced in writing live Real-time Processing using Spark Streaming wif Kafka.
- Involved in creating Hive tables and loading and analyzing data using hive queries.
- Developed Hive queries to process the data and generate the data cubes for visualizing.
- Implemented schema extraction for Parquet and Avro file Formats in Hive.
- Implemented Partitioning, Dynamic Partitions, Buckets in HIVE.
- Extensively worked wif S3 bucket in AWS.
Environment: Spark, Spark-Streaming, Spark SQL, AWS EMR, S3, Hive, Apache Kafka, Java, Scala, Shell scripting, Jenkins, Eclipse, Git, Tableau, MySQL and Agile Methodologies.
Confidential, Jersey City, NJ
Hadoop Developer
Responsibilities:
- Responsible for building scalable distributed data solution using Hadoop Cluster environment wif Cloudera distribution.
- Convert raw data wif sequence data format, such as Avro and Parquet to reduce data processing time and increase data transferring efficiency through the network.
- Worked on building end to end data pipelines on Hadoop Data Platforms.
- Worked on Normalization and De-normalization techniques for optimum performance in relational and dimensional databases environments.
- Designed developed and tested Extract Transform Load (ETL) applications wif different types of sources.
- Creating files and tuned the SQL queries in Hive Utilizing HUE. Implemented MapReduce jobs in Hive by querying the available data.
- Exploring wif Spark to improve the performance and optimization of the existing algorithms in Hadoop using Spark context, Spark-SQL, Data Frame, pair RDD’s.
- Experience wif Pyspark for using Spark libraries by using Python scripting for data analysis.
- Involved in converting HiveQL into Spark transformations using Spark RDD and through Scala programming.
- Created User Defined Functions (UDF), User Defined Aggregated (UDA) Functions in Pig and Hive.
- Worked on building custom ETL workflows using Spark/Hive to perform data cleaning and mapping.
- Implemented Kafka Custom encoders for custom input format to load data into Kafka portions.
- Support for the cluster, topics on the Kafka manager. Cloud formation scripting, security and resource automation.
Environment: Python, Cloudera, HDFS, MapReduce, Flume, Kafka, Zookeeper, Pig, Hive, HQL, HBase, Spark, Kafka, ETL, Rest Services.
Confidential
Hadoop/Java Developer
Responsibilities:
- Installed and configured Apache Hadoop to test the maintenance of log files in Hadoop cluster.
- Installed and configured Hive, Pig, Sqoop, Flume and Oozie on the Hadoop cluster.
- Installed Oozie workflow engine to run multiple Hive and Pig Jobs.
- Setup and benchmarked Hadoop/HBase clusters for internal use.
- Developed Java MapReduce programs for the analysis of sample log file stored in cluster.
- Developed Simple to complex Map/reduce Jobs using Hive and Pig
- Developed Map Reduce Programs for data analysis and data cleaning.
- Developed PIG Latin scripts for the analysis of semi structured data.
- Developed and involved in the industry specific UDF user defined functions
- Used Hive and created Hive tables and involved in data loading and writing Hive UDFs.
- Used Sqoop to import data into HDFS and Hive from other data systems.
- Continuous monitoring and managing the Hadoop cluster through Cloudera Manager.
- Migration of ETL processes from RDBMS to Hive to test the easy data manipulation.
- Developed Hive queries to process the data for downstream data analysis.
Environment: Apache Hadoop, HDFS, Cloudera Manager, CentOS, Java, MapReduce, Eclipse, Hive, PIG, Sqoop, Oozie and SQL.
Confidential
Java Developer
Responsibilities:
- Designed and developed applications using Spring MVC framework wif Agile Methodology.
- Developed JSP and HTML pages using CSS and JavaScript as part of the presentation layer.
- Hibernate framework is used in persistence layer for mapping an object-oriented domain model to database.
- Developed database schema and SQL queries for querying, inserting, and managing database.
- Implemented various design patterns in the project such as Data Transfer Object, Data Access Object and Singleton.
- Used Git for Source Code Management.
- Used Maven scripts to fetch, build, and deploy application to development environment
- Created RESTFUL web service interface to Java-based runtime engine.
- Used Git for Source Code Management.
- Used Apache Tomcat for deploying the application.
- Used Junit for functional and unit testing of code.
Environment: Eclipse IDE, Java/J2EE, Spring, Hibernate, JSP, HTML, CSS, JavaScript, Maven, RESTful Web services, Apache Tomcat, Oracle, JUnit, Git