We provide IT Staff Augmentation Services!

Sr Data Engineer Resume

2.00/5 (Submit Your Rating)

Richardson, TX

SUMMARY

  • Having years of professional experience as a Big Data Engineer/Developer in Analysis, development, design, implementation, maintenance, and support with experience in Big Data, Hadoop Development, Python, PL/SQL, Java, SQL, REST API’s, GCP cloud platform.
  • Hands on experience in Hadoop ecosystem including HDFS, Map Reduce, Spark, Kafka, HBase, Scala, Pig, Impala, Sqoop, Oozie, Flume, Zookeeper and also worked on Spark SQL, Spark Streaming, AWS services like EMR, S3, Airflow, Glue and Redshift.
  • Hands - on experience implementing and designing large scale data lakes,pipelines and efficient ETL (extract/load/transform) workflowsto organize, collect and standardize data dat halps generate insights and addresses reporting needs.
  • Hands-on experience in GCP, Big Query, GCS bucket, G - cloud function, cloud dataflow, Pub/sub cloud shell, GSUTIL, BQ command line utilities, Data Proc, Stack driver.
  • Experience working with both Streaming and Batch data processing using multiple technologies.
  • Hands-on experience developing data pipelines using Spark components, Spark SQL, Spark Streaming and MLLib.
  • Hands-on experience in importing and exporting data using Sqoop from HDFS to Relational Database Systems and vice-versa.
  • Excellent understanding of Hadoop architecture and various components such as HDFS, Job Tracker, Task Tracker, Name Node, Data Node and MapReduce programming paradigm.
  • Worked on AWS components and services particularly Elastic Map Reduce (EMR), Elastic Compute Cloud (EC2), Simple Storage Service (S3) and Lambda functions.
  • Hands-on experience working with Kafka streaming using KTables, Global KTables and KStreams and deploying these on Confluent and Apache Kafka environments.
  • Hands-on experience in importing and exporting data using Sqoop from HDFS to Relational Database Systems and vice-versa.
  • Developed Python code to gather teh data from HBase and designs teh solution to implement using PySpark.
  • Hands-on experience interacting with REST API’s developed using teh micro-services architecture for retrieving data from different sources.
  • Experience with implementing CI/CD pipelines for DevOps - source code management using Git, Unit testing, build and deployment scripts.
  • Hands-on experience working with DevOps tools such as Jenkins, Docker, Kubernetes, GoCD, Autosys scheduler.
  • Expertise with RDBMS such as Oracle, MySQL in writing complex SQL queries and procedures, triggers.
  • Very keen in knowing newer techno stack dat Google Cloud platform (GCP) adds.
  • Experience in GCP Dataproc, GCS, Cloud functions, BigQuery.
  • Experienced in developing web-based applications usingPython, QT, C++, XML, CSS, JSON, HTML, DHTML.
  • Hands-on experience building, scheduling, and monitoring workflows using Apache Airflow with Python.
  • Developed Python code to gather teh data from HBase and designs teh solution to implement using PySpark.
  • Very keen in knowing newer techno stack dat Google Cloud platform (GCP) adds.
  • Experience in GCP Dataproc, GCS, Cloud functions, Big Query.
  • Worked with Kafka streaming using KTables and KStreams and deploying these on Confluent and Apache Kafka environments.
  • Hands-on experience in importing and exporting data using Sqoop from HDFS to Relational Database Systems and vice-versa.
  • Developed Python code to gather teh data from HBase and designs teh solution to implement using PySpark.
  • Experienced in Software methodologies such as Agile and Safe, sprint planning, attending daily standups and story grooming.
  • Experience working in various Software Development Methodologies like Waterfall, Agile SCRUM and TDD.
  • Worked in complete Software Development Life Cycle in Agile model.
  • Strong problem-solving skills with an ability to isolate, deconstruct and resolve complex data challenges.

TECHNICAL SKILLS

Big Data Ecosystem: HDFS, Yarn, MapReduce, Spark, Kafka, Kafka Connect, Hive, Pig Airflow, Impala, Sqoop, HBase, Flume, Oozie, Zookeeper

Hadoop Distributions: Cloudera, Hortonworks, Apache.

Cloud Environments: GCP, AWS EMR, EC2, S3 and Azure Data Factory

Operating Systems: Linux, Windows

Languages: Python, SQL, Scala, Java

Databases: Oracle, SQL Server, MySQL, HBase, MongoDB, DynamoDB

ETL Tools: Informatica

Report & Development Tools: Eclipse, IntelliJ Idea, Visual Studio Code, Jupyter Notebook, Tableau, Power BI.

Development/Build Tools: Maven, Gradle

Repositories: GitHub, SVN.

Scripting Languages: bash/Shell scripting, Linux/Unix

Methodology: Agile, Waterfall

PROFESSIONAL EXPERIENCE

Confidential, Richardson, TX

Sr Data engineer

Responsibilities:

  • Configured Spark Streaming to receive real time data from teh Apache Kafka and store teh stream data to HDFS using Scala.
  • Exported teh data using Sqoop to RDBMS servers and processed dat data for ETL operations.
  • Developed ETL data pipelines using Hadoop big data tools - HDFS, Hive, Presto, Apache Nifi, Sqoop, Spark, Elastic Search, Kafka.
  • Developed data pipeline using Flume, Pig, Sqoop to ingest data and customer histories into HDFS for analysis.
  • Experience in designing and developing POCs in Spark using Scala to compare teh performance of Spark with Hive and SQL/Oracle.
  • Build data pipelines in airflow in GCP for ETL related jobs using different airflow operators.
  • Experience in GCP Dataproc, GCS, Cloud functions, BigQuery.
  • Experience in moving data between GCP and Azure using Azure Data Factory.
  • Created access points to teh data through Spark, Python, and Presto.
  • Developed Spark applications using PySpark and Spark-SQL and Python for data extraction, transformation, and aggregation from multiple file formats for analyzing & transforming teh data to uncover insights into teh customer usage patterns.
  • Developed and designed a system using Python and Scala to collect data from multiple portals using Kafka and tan process it using Spark.
  • Responsible for designing data pipelines using ETL for TEMPeffective data ingestion from existing data management platforms to enterprise Data Lake.
  • Developed and executed interface test scenarios and test scripts for complex business rules using available ETL tools.
  • Uploaded click stream data from Kafka to HDFS, HBase, and Hive by integrating with Storm.
  • Used Oozie to orchestrate teh map reduce jobs dat extract teh data on a timely manner.
  • Created multi-node Hadoop and Spark clusters in AWS instances to generate terabytes of data and stored it in AWS HDFS.
  • Developed Spark/Scala, Python for teh project in teh Hadoop/Hive environment
  • Responsible for designing data pipelines using ETL for TEMPeffective data ingestion from existing data management platforms to enterprise Data Lake.
  • Developed and executed interface test scenarios and test scripts for complex business rules using available ETL tools.
  • Used cloud shell SDK in GCP to configure teh services Data Proc, Storage, BigQuery.
  • Designed teh ETL process by creating high-level design document including teh logical data flows, source data extraction process, teh database staging and teh extract creation, source archival, job scheduling and Error Handling.
  • Created multi-node Hadoop and Spark clusters in AWS instances to generate terabytes of data and stored it in AWS HDFS.
  • Designed teh ETL process by creating high-level design document including teh logical data flows, source data extraction process, teh database staging and teh extract creation, source archival, job scheduling and Error Handling.
  • Developed Spark applications usingSpark-SQLfor data extraction, transformation, and aggregation from multiple file formats for analyzing & transforming teh data to uncover insights into teh customer usage patterns.
  • Developed code from scratch in Spark using SCALA according to teh technical requirements and Used both Hive context as well as SQL context of Spark to do teh initial testing of teh Spark job
  • Developed and designed a system to collect data from multiple portals using Kafka and tan process it using Spark.
  • Create external tables with partitions using Hive and designed External and Managed tables in Hive and processed data to teh HDFS using Sqoop.
  • Developed teh Linux shell scripts for creating teh reports from Hive data.
  • Hands on experience on Data Analytics Services such as Atana, Glue Data Catalog & Quick Sight.
  • Managed and supported enterprise Data Warehouse operation, big data advanced predictive application development using Cloudera & Hortonworks HDP.
  • Loaded and transformed large sets of structured, semi structured, and unstructured data in various formats like text, zip, XML and JSON.
  • Worked on Apache Airflow which is a job trigger where teh code is written in hive query language and Scala. dis halps to read, back fill, and write teh data for particular time Frame from hive tables to HDFS locations.

Environment: Hadoop, Spark, HDFS, Hive, Pig, HBase, Big Data, Apache Storm, Oozie, Sqoop, Kafka, Flume, Zookeeper, MapReduce, Cassandra, Scala, Linux, NoSQL, MySQL, PySpark, SQL Server, GCP, Python

Confidential, Dallas, TX

Sr Data engineer (Bigdata, GCP)

Responsibilities:

  • Wrote Spark SQL queries and Python scripts to design teh solutions and implemented them using PySpark.
  • Wrote Python program to maintain raw file archival in GCS bucket.
  • Used cloud shell SDK in GCP to configure teh services Data Proc, Storage, BigQuery.
  • Designed and built production data pipelines from data ingestion to consumption within a hybrid big data architecture, using Cloud Native GCP, Java, Python, Scala, SQL.
  • Opened SSH tunnel to Google DataProc to access to yarn manager to monitor spark jobs.
  • Worked on monitoring, scheduling, and authoring Data Pipelines using Apache Airflow.
  • Developed Spark jobs to create data frames from teh source system, process, and analyze teh data in Dataframes based on business requirements.
  • Involved in enhancing teh existing ETL data pipeline for better data migration with reduced data issues by using Apache Airflow.
  • Implemented static Partitioning, Dynamic partitioning and Bucketing in Hive using internal and external tables.
  • Developed Restful API services using spring boot to upload data from local to Amazon S3, listing S3 objects and file manipulation operations.
  • Scheduling workflow through teh Oozie engine to run multiple Hive and pig jobs. Wrote shell scripts to automate teh jobs in UNIX.
  • Invoked in creating Hive tables, loading with data, and writing Hive queries, which will invoke MapReduce jobs in teh backend.
  • Automating repeated tasks usingPython and UNIX Bash Scripting.
  • Utilized teh Apache Hadoop environment by Cloudera. Monitoring and Debugging Spark jobs which are running on a Spark cluster using Cloudera Manager.

Environment: Hadoop, MapReduce, HDFS, Spark, Java, Yarn, Hive 2.1, Sqoop, Cassandra Oozie, Scala, Python, AWS, Flume, Kafka, Tableau, Linux, SQL Server, Shell Scripting, Apache Airflow, Unix, GCP.

Confidential, Pittsburg, PA

Bigdata / Data engineer

Responsibilities:

  • Delivery experience on major Hadoop ecosystem Components such as Hive, Spark Kafka, Elastic Search & HBase and monitoring with Cloudera Manager and worked on loading disparate data sets coming from different sources to (HADOOP) environment using Spark.
  • Used Pyspark and Spark SQL for extracting, transforming, and loading teh data according to teh business requirements.
  • Wrote Python program to maintain raw file archival in GCS bucket.
  • Used cloud shell SDK in GCP to configure teh services Data Proc, Storage, BigQuery.
  • Designed and built production data pipelines from data ingestion to consumption within a hybrid big data architecture, using Cloud Native GCP, Java, Python, Scala, SQL.
  • Developed UNIX scripts in creating Batch load for bringing huge amount of data from Relational databases to BIGDATA platform.
  • Developed Spark SQL scripts using PySpark to perform transformations and actions on Data frames, Data set in spark for faster data Processing and implemented them using PySpark.
  • Involved in enhancing teh existing ETL data pipeline for better data migration with reduced data issues by using Apache Airflow.
  • Implemented static Partitioning, Dynamic partitioning and Bucketing in Hive using internal and external tables.
  • Knowledge on creating star schema for drilling data. Created pyspark procedures, functions, packages to load data.
  • Scheduling workflow through teh Oozie engine to run multiple Hive and pig jobs. Wrote shell scripts to automate teh jobs in UNIX.
  • Worked on monitoring, scheduling, and authoring Data Pipelines using Apache Airflow.
  • Developed Spark jobs to create data frames from teh source system, process, and analyze teh data in Data frames based on business requirements.
  • Invoked in creating Hive tables, loading with data, and writing Hive queries, which will invoke MapReduce jobs in teh backend.
  • Utilized teh Apache Hadoop environment by Cloudera. Monitoring and Debugging Spark jobs which are running on a Spark cluster using Cloudera Manager.
  • Fetch data and generate monthly reports. Visualization of those reports using Tableau and Python.
  • Manage and support of enterprise Data Warehouse operation, big data advanced predictive application development using Cloudera.
  • Implemented Kafka consumers to move data from Kafka partitions into Cassandra for near real-time analysis and worked extensively on Hive to create, alter, and drop tables and involved in writing hive queries.
  • Developed workflow in Oozie to automate teh tasks of loading teh data into HDFS and pre-processing with Pig and parsed high-level design spec to simple ETL coding and mapping standards.
  • Created Reports with different Selection Criteria from Hive Tables on teh data residing in Data Lake.
  • Extensively used ETL methodology for supporting Data Extraction, transformations and loading processing, using Hadoop.

Environment: Hadoop, MapReduce, HDFS, Spark, PySpark, Java, Yarn, Hive 2.1, Sqoop, Cassandra Oozie, Scala, Python, GCP, Flume, Kafka, Tableau, Linux, SQL Server, Shell Scripting, Apache Airflow, Cloudera.

Confidential

Hadoop Developer

Responsibilities:

  • Understand and preparing Design document preparation according to client requirement.
  • Loaded and transformed large sets of structured, semi structured, and unstructured data. Imported data using Sqoop to load and export data from My SQL to HDFS and NoSQL Databases on regular basis for designing and developing Hive scripts to process data in a batch to perform analysis of data.
  • Implementing Partitioning and Bucketing in Hive as part of performance tuning for teh workflow and co-ordination files using Oozie framework to automate tasks.
  • Involved in developing data pipelines using Sqoop, Pig and Hive to ingest data into HDFS to perform data analytics.
  • Developed data pipelines using Sqoop, Pig and Hive to ingest customer data into HDFS to perform data analytics.
  • Analyzed large amounts of data sets to determine optimal way to aggregate and report on it.
  • Developed Sqoop scripts to handle change data capture for processing incremental records between new arrived and existing data in RDBMS tables.
  • Experience writing Sqoop jobs to move data from various RDBMS into HDFS and vice versa.
  • Developed ETL pipelines to source data to Business intelligence teams to build visualizations.
  • Optimizing teh Hive Queries using teh various file formats like PARQUET, JSON, AVRO and CSV file.
  • Worked with Oozie workflow engine to schedule time-based jobs to perform multiple actions.
  • Involved in unit testing, interface testing, system testing and user acceptance testing of teh workflow Tool.

Environment: HDFS, Hive, Oozie, Cloudera Distribution withHadoop(CDH4), MySQL, CentOS, Apache HBase, MapReduce, Hue, PIG, Sqoop, SQL, Windows, Linux.

Java/SQL Developer

Confidential

Responsibilities:

  • Developed and created database objects in Oracle, SQL Server me.e., Tables, Indexes, Stored Procedures, Views, Functions, Triggers, etc.
  • Used SQL*Loader scripts to load teh data into temporary tables and procedures to validate teh data.
  • Written teh complex SQL Queries and scripts to give input to teh reporting tools.
  • Worked with large amounts of data from various data sources and loading into teh database.
  • Written Stored procedures and packages as per teh business requirement and scheduled jobs for health checks.
  • Developed complex queries to generate Monthly, Weekly reports to extract data for visualizing in QlikView.

We'd love your feedback!