We provide IT Staff Augmentation Services!

Gcp Data Engineer Resume

0/5 (Submit Your Rating)

Columbia, SC

SUMMARY

  • Over 8+ years of experience as a Senior Big Data Engineer/ GCP Data Engineer with demonstrated expertise in building and deploying data pipelines using open - source Hadoop-based technologies such as Apache Spark, Hive, Zoo keeper, Apache Storm Python, and Yarn.
  • Extensive experience with GCP cloud using GCP Cloud storage, DataProc, Data Flow, Big Query, Cloud Composer, and Cloud Pub/Sub.
  • Experience in Migrating SQL database to Azure Data Lake, Azure Data Lake Analytics, Azure SQL Database, Data Bricks, and Azure SQL Data Warehouse and Controlling and granting database access and Migrating On-premise databases to Azure Data lake store using Azure Data Factory.
  • Experience working with Amazon Web Services (AWS) cloud and its services like EC2, S3, RDS, EMR, VPC, IAM, Elastic Load Balancing, Lambda, RedShift, Elastic Cache, Auto Scaling, Cloud Front, CloudWatch, Data Pipeline, DMS, Aurora, ETL, and other AWS Services.
  • Hands-on Experience in developing Spark applications using Pyspark Data Frame, RDD, and Spark SQL.
  • Experience in Developing Spark applications using Spark - SQL in Databricks for data extraction, transformation, and aggregation from multiple file formats for analyzing & transforming the data to uncover insights into customer usage patterns.
  • Good understanding of Big Data Hadoop and Yarn architecture along with various Hadoop Demons such as Job Tracker, Task Tracker, Name Node, Data Node, Resource/Cluster Manager, and Kafka (distributed stream-processing).
  • Experience in Database Design and development with Business Intelligence using SQL Server 2014/2016, Integration Services (SSIS), DTS Packages, SQL Server Analysis Services (SSAS), DAX, OLAP Cubes, Star Schema, and Snowflake Schema.
  • Strong skills in visualization tools Power BI, Confidential Excel - formulas, Pivot Tables, Charts, and DAX Commands.
  • Experience in analyzing data using HiveQL, and MapReduce Programs.
  • Experienced in ingesting data into HDFS from various Relational databases like MYSQL, Oracle, DB2, Teradata, and Postgres using sqoop
  • Well-versed with various Hadoop distributions which include Cloudera (CDH), Hortonworks (HDP), and Azure HD Insight.
  • Extending HIVE and PIG core functionality by using custom User Defined Functions (UDF), User Defined Table-Generating Functions (UDTF), and User Defined Aggregating Functions (UDAF) for Hive and Pig.
  • Experience working on NoSQL Databases like HBase, Cassandra, and MongoDB.
  • Experience in Python, Scala, shell scripting, and Spark.
  • Experience with Testing Map Reduce programs using MRUnit, Junit, and EasyMock.
  • Experience in ETL methodology for supporting Data Extraction, transformations, and loading processing using Hadoop.
  • Worked on data visualization tools like Tableau and also integrated the data using the ETL tool Talend.
  • Hands-on development experience with JAVA, Shell Scripting, RDBMS, including writing complex SQL queries, PL/SQL, views, stored procedures, triggers, etc
  • Passionate about working on the most cutting-edge Big Data technologies.
  • Expertise in Creating, Debugging, Scheduling, and Monitoring jobs using Composer Airflow.
  • Experience in documenting Design specs, Unit test plans, and deployment plans.
  • Very keen on knowing the newer techno stack that the Google Cloud platform adds.
  • Vast knowledge of Teradata SQL Assistant. Developed BTEQ scripts to Load data from the Teradata Staging area to Data Warehouse, Data Warehouse to data marts for specific reporting requirements. Tuned the existing BTEQ script to enhance performance.
  • Proficient in developing UNIX Shell Scripts.
  • Experienced in working with CICD pipelines like Jenkins, and Bamboo.
  • Experienced in working with Source Code management tools in GIT and Bit Bucket.
  • Considerably involved in pinpointing performance bottlenecks in targets, sources, and transformations and successfully optimizing them for optimum performance.

TECHNICAL SKILLS

Hadoop Core Services: HDFS, Map Reduce, Spark, YARN.

Hadoop Distribution: Cloudera Hortonworks, Apache Hadoop.

Databases: Oracle 11g/10g/9i, Teradata, MS SQL.

Data Services: Hive, Pig, Impala, Sqoop, Flume, Kafka.

Scheduling Tools: Zookeeper, Oozie.

Monitoring Tools: Cloudera Manager.

Cloud Computing Tools: AWS, Azure, GCP

GCP Cloud Platform: BigQuery, Cloud Data Proc, GCS Bucket, G-Cloud Function, Apache Beam, Cloud Shell, GSUTIL, BQ Command Line, Cloud Data Flow

Programming Languages: C, Java, Scala, Python, R, SQL, PL/SQL, Pig Latin, HiveQL, Unix, JavaScript, Shell Scripting.

Java & J2EE Technologies: Core Java, Servlets, Hibernate, Spring, Struts, JMS, EJB.

Operating Systems: UNIX, Windows, LINUX.

Build Tools: Jenkins, Maven, ANT.

Development Tools: Eclipse, NetBeans, Microsoft SQL Studio, Toad.

PROFESSIONAL EXPERIENCE

Confidential, Columbia, SC

GCP Data Engineer

Environment: GCP, Cloud SQL, Big Query, Cloud Data Proc, GCS, Cloud SQL, Cloud Composer, Informatica Power Center, Talend for Big Data, Airflow, Hadoop, Hive, Teradata, SAS, Teradata, Spark, Python, Java, SQL Server.

Responsibilities:

  • Developed Spark Applications by using Python and Implemented Apache Spark data processing Project to handle data from various RDBMS and Streaming sources.
  • Responsible for building scalable distributed data solutions using Hadoop
  • Builddatapipelines using airflow inGCPfor ETL-related jobs using different airflow operators.
  • Used Apache airflow in GCP composer environment to build data pipelines and used various airflow operators like bash operator, Hadoop operators, and python callable and branching operators.
  • Built NiFi dataflow to consume data from Kafka, make transformations on data, place in HDFS, and exposed port to run spark streaming job.
  • Experienced in Maintaining the Hadoop cluster on GCP using the Google Cloud Storage, Big Query, and Dataproc.
  • Worked with Spark for improving the performance and optimization of the existing algorithms in Hadoop.
  • Used cloud shell SDK inGCPto configure the services Data Proc, Storage, and BigQuery.
  • Used theGCPenvironment to perform the following: Cloud Function for event-based triggering, Cloud Monitoring, and Alerting.
  • Using G-cloud function with Python to load data into Big query for on-arrival CSV files in the GCS bucket.
  • Work on Spark RDD, Data Frame API, Data set API, Data Source API, Spark SQL, and Spark Streaming.
  • Used Spark Streaming APIs to perform transformations and actions on the fly for building common.
  • Developed Kafka consumer API in Python for consuming data from Kafka topics.
  • Consumed Extensible Markup Language (XML) messages using Kafka and processed the XML file using Spark Streaming to capture User Interface (UI) updates.
  • Developed Pre-processing job using Spark Data frames to flatten JSON documents to a flat file.
  • Load D-Stream data into Spark RDD and do in-memory data Computation to generate output response.
  • DesignedGCPCloud composer DAG to loaddatafrom on-prem CSV files toGCPBig QueryTables. Scheduled DAG to load incremental mode.
  • Configured Snow pipe to pull the data from Google Cloud buckets into the Snowflakes table.
  • Have Extensive Experience in IT data analytics projects, Hands on experience in migrating on-premise ETLs to Google Cloud Platform (GCP) using cloud-native tools such as BIG query, Cloud Data Proc, Google Cloud Storage, Composer.
  • Used Hive QL to analyze the partitioned and bucketed data, Executed Hive queries on Parquet tables.
  • Stored in Hive to perform data analysis to meet the business specification logic.
  • Used Apache Kafka to aggregate web log data from multiple servers and make them available in Downstream systems for Data analysis and engineering type of roles.
  • Worked on Implementing Kafka Security and Boosting its performance.
  • Developed Oozie coordinators to schedule Hive scripts to create Data pipelines.
  • On cluster and testing of HDFS, Hive, Pig, and MapReduce to access cluster for new users.

Confidential, Patskala, Ohio

Azure Data Engineer

Environment: Hadoop, Azure Data Factory, Azure Data Lake, Azure Storage, Azure SQL, Azure DataWarehouse, Azure Databricks, Azure Power Shell, Map Reduce, Hive, Spark, Python, Yarn, Tableau, Kafka, Sqoop, Scala, HBase.

Responsibilities:

  • Analyze, design, and build Modern data solutions using Azure PaaS service to support visualization of data. Understand the current Production state of the application and determine the impact of new implementation on existing business processes
  • Extract Transform and Load data from Sources Systems to Azure Data Storage services using a combination of Azure Data Factory, T-SQL, Spark SQL, and U-SQL Azure Data Lake Analytics. Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in Azure Databricks
  • Created Pipelines in ADF using Linked Services/Datasets/Pipeline/ to Extract, Transform, and load data from different sources like Azure SQL, Blob storage, Azure SQL Data warehouse, write-back tool, and backward.
  • Developed Spark applications using Pyspark and Spark-SQL for data extraction, transformation, and aggregation from multiple file formats for analyzing& transforming the data to Uncover insights into customer usage patterns.
  • Responsible for estimating the cluster size, monitoring, and troubleshooting the Spark databrick cluster.
  • Experienced in performance tuning of Spark Applications for setting the right Batch Interval time, the correct level of Parallelism, and memory tuning.
  • To meet specific business requirements wrote UDF’s in Scala and Pyspark
  • Developed JSON Scripts for deploying the Pipeline in Azure Data Factory (ADF) that process the data using the SQL Activity.
  • Implemented OLAP multi-dimensional cube functionality using AzureSQL Data Warehouse
  • Hands-on experience in developing SQL Scripts for automation purposes.
  • Created Build and Release for multiple projects (modules) in the production environment using Visual Studio Team Services (VSTS).
  • Wrote AzurePower shell scripts to copy or move data from the local file system to HDFS Blob storage.
  • Worked extensively with Dimensional modeling, Data migration, Data cleansing, and ETL Processes for data warehouses
  • Worked in Agile Methodology and used JIRA to maintain the stories about the project.
  • Involved in gathering the requirements, designing, developing, and testing.

Confidential, Washington, PA

AWS

Environment: HDFS, Hadoop, Python, Hive, Sqoop, Flume, Spark, Map Reduce, Scala, Oozie, YARN, Tableau, Spark-SQL, Spark-MLlib, Impala, Nagios, UNIX Shell Scripting, Zookeeper, Kafka, Agile Methodology, SBT

Responsibilities:

  • Worked on analyzing Hadoop cluster and different big data analytical and processing tools including Sqoop, Hive, Spark, Kafka, and Pyspark.
  • Designed and implemented data processing pipelines using AWS services such as AWS Glue, AWS Lambda, AWS Step Functions, and AWS EMR to extract, transform, and load large volumes of data from various sources into a centralized data lake.
  • Built and maintained data warehouses using Amazon Redshift to enable efficient querying and reporting of data for business intelligence and analytics purposes.
  • Managed the ingestion and transformation of real-time streaming data using Amazon Kinesis and Apache Kafka and integrated them with batch processing pipelines to enable near-real-time analytics and decision-making.
  • Developed automated data validation and quality checks using AWS Data Pipeline, AWS Glue, and AWS Lambda to ensure data accuracy and completeness throughout the data processing pipeline.
  • Worked with various AWS services such as AWS S3, AWS DynamoDB, AWS Athena, and AWS RDS to store, query, and retrieve data for various use cases, including ETL processing, ad-hoc querying, and reporting.
  • Worked on MapR platform team for performance tuning of the hive and spark jobs of all users
  • Using Hive TEZ engine to increase the performance of the applications.
  • Working on incidents created by users for the platform team on the hive and spark issues by monitoring Hive and Spark logs and fixing them or else by raising MapR cases.
  • Analyzed large amounts of data sets to determine the optimal way to aggregate and report on it.
  • Tested the cluster Performance using the Cassandra stress tool to measure and improve the Reads/Writes.
  • Worked on Hadoop Data Lake for ingesting data from different sources such as Oracle and Teradata through the INFOWORKS ingestion tool.
  • Worked on ARCADIA for creating analytical views on top of tables as if the batch is loading also no issue in reporting or table locks as it will point to the Arcadia view
  • Worked on Python API for converting assigned group-level permissions to table-level permission using MapR ace by creating a unique role and assigning through EDNA UI.
  • Hands-on experience in Spark using Scala and Python creating RDDs, and applying operations - Transformation and Actions.
  • Extensively perform complex data transformations in Spark using Scala language.
  • Involved in converting Hive/SQL queries into Spark transformations using Scala.
  • Created Master Job Sequences for integration, (ETL Control) logic to capture job success, failure, error, and audit, information for reporting.
  • Used TES Scheduler engine to manage interdependent Hadoop jobs and to automate several types of Hadoop jobs such as Java map-reduce Hive, Spark, Kafka, and Sqoop
  • Experienced in creating recursive and replicated joins in the hive.
  • Experienced in developing scripts for doing transformations using Scala.
  • Written Map Reduce code to process and parse the data from various sources and store parsed data into HBase and Hive using HBase-Hive Integration
  • Experienced in creating shell scripts and making jobs automated.

Confidential

Data Engineer

Environment: Spark, Scala, Python, Hadoop, MapReduce, CDH, Cloudera Manager, Control M Scheduler, Shell Scripting, Agile Methodology, JIRA, Git, Tableau.

Responsibilities:

  • Prepared complicated T-SQL queries and user-defined functions in SQL Server to meet business needs.
  • Designed and implemented code changes in existing modules - Java, python, shell scripts for enhancement.
  • Generated reports to maintain zero percent errors in all the data warehouse tables.
  • Developed Spark and Scala pipelines which transform the raw data from several formats to parquet files for consumption by downstream systems.
  • Developed and performed Sqoop import from Oracle to load the data into HDFS
  • Used AWS Glue services like crawlers and ETL jobs to catalog all the parquet files and make transformations over data according to the business needs.
  • Worked with AWS services like S3, Glue, EMR, SNS, SQS, Lambda, EC2, RDS, and Athena to process data for downstream customers.
  • Created libraries and SDKs which will be helpful in making JDBC connections to the hive database and querying the data using Play framework and various AWS services.
  • Developed scripts using Spark which are used to load the data from Hive to Amazon RDS(Aurora) at a faster rate.
  • Load and transform data using scripting languages and tools (Ex: Python, Linux shell, Sqoop)
  • Led ETL efforts to integrate transform and map data from multiple sources using Python.
  • Written AWS Lambdas in Python which can load the structured and unstructured data from Hive to RDS.
  • Excellent analytical thinking for translating data into informative visuals and reports
  • In-depth understanding of database management systems, online analytical processing (OLAP), and ETL (Extract, transform, load) framework Other Skills/Experience
  • Developed data pipeline using Flume, Sqoop, Hive, and Spark to ingest subscriber data, provider data and claims into HDFS for analysis.
  • Working knowledge of Spark RDD, Data Frame API, Data set API, Data Source API, Spark SQL, and Spark Streaming
  • Writing stored procedures, tuning indexes, and troubleshooting performance bottlenecks.

We'd love your feedback!