We provide IT Staff Augmentation Services!

Cloud Data Engineer Resume

5.00/5 (Submit Your Rating)

San Francisco, CA

SUMMARY

  • Around 8 years of experience in Designing, Developing, and integrating applications using Spark, Hadoop, Hive, across all platforms - Cloudera, Hortonworks, MapR.
  • 4 years of experience as Cloud Data Engineering in Big data Hadoop ecosystems such as HDFS, Hive, Spark, Data Bricks. Kafka, Yarn on AWS, GCP cloud services and Cloud rational databases.
  • Experience with Google Cloud Platform (GCP) services like compute engine, cloud load balancing, cloud storage, cloud DataProc, G - cloud function, Cloud Pub/Sub Cloud Shell, cloud SQL, BigQuery, cloud Dataflow, stack driver monitoring and cloud deployment manager.
  • Experience with Amazon Web Services (AWS) for creating and managing EC2, Elastic Map Reduce, Elastic Load-balancers, Elastic Container Service (Docker Containers), S3, Lambda, Elastic File system, RDS, Cloud Watch, Cloud Trail, IAM and Kinesis Streams.
  • Experience writing pig and hive scripts.
  • Experience in writing Map Reduce programs using Apache Hadoop for analyzing Big Data.
  • Hands-on experience in writing Ad-hoc Queries for moving data from HDFS to HIVE and analyzing the data using HIVE QL.
  • In depth knowledge of Hadoop Architecture and Hadoop daemons such as Name Node, Secondary Name Node, Data Node, Job Tracker and Task Tracker.
  • Good working knowledge on Snowflake and Teradata databases.
  • Extensively worked on Spark using Scala on cluster for computational (analytics), installed it on top of Hadoop, and performed advanced analytical application by making use of Spark with Hive and SQL/Oracle/Snowflake.
  • Expertise in Python data extraction and data manipulation, and widely used python libraries like Pandas for data analysis.
  • Experienced in creating Snowflake Multi-cluster Size and Credit Usage.
  • Played a key role in Migrating Teradata objects into the Snowflake environment.
  • Experience with Snowflake Multi-Cluster Warehouses
  • Experience with Snowflake Virtual Warehouses.
  • Having In-depth knowledge of Data Sharing in Snowflake.
  • Have a knowledge of Snowflake Database, Schema and Table structures.
  • Experience in using Snowflake Clone and Time Travel.
  • Solid understanding and experience with extract, transform, load
  • Written Kafka consumer Topic to move data from adobe clickstream Json object to Datalake. Experience on working with file structures such as text, sequence, parquet and Avro file formats.
  • Expertise in using Sqoop & Spark to load data from MySQL/Oracle to HDFS or HBase.
  • Well versed in using ETL methodology for supporting corporate-wide- solution using Informatica 7.x/8.x/9.x
  • Worked on SparkSQL, created Data frames by loading data from Hive tables and created prep data, and stored them in AWS S3.
  • Implemented Data Warehouse solutions using Snowflake Product.
  • Proficient in the Integration of various data sources with multiple relational databases like Oracle11g /Oracle10g/9i, Sybase12.5, Teradata and Flat Files into the staging area, Data Warehouse and Data Mart.
  • Expertise in Tuning & Optimizing the DB relevant issues (SQL Tuning).
  • Involved in all phases of ETL life cycle from scope analysis, design, and build through production support.

TECHNICAL SKILLS

Big Data Ecosystem: Hadoop, MapReduce, Hive, Spark, YARN, Sqoop, Oozie, Pig, Flume, Kafka, Impala, Zookeeper

Hadoop Distributions: Cloudera, Hortonworks, MapR and Apache

Cloud Technologies: AWS, GCP

Languages: Python, Java, Scala, SQL, PL/SQL, Pig Latin, HiveQL, Unix, Shell Scripting

No SQL Databases: HBase, Cassandra, and MongoDB

Java Technologies: JavaBeans, JSP, Servlets, JDBC, struts and JNDI

XML Technologies: XML, XSD, DTD, JAXP (SAX, DOM), JAXB

Development Methodologies: Agile, waterfall

Web Design: HTML, DHTML, AJAX, JavaScript, JQuery and CSS, AngularJs, ExtJS and JSON

App/Web servers: WebSphere, WebLogic, JBoss and Tomcat

DB Languages: MySQL, PL/SQL and Oracle

RDBMS: Snowflake, Teradata, Oracle, MySQL, DB2 and Postgres

Development Tools: Eclipse, Intellij, Microsoft SQL Studio, Toad, NetBeans

Operating systems: UNIX, LINUX, Mac OS and Windows Variants

PROFESSIONAL EXPERIENCE

Confidential - San Francisco, CA

Cloud Data Engineer

Responsibilities:

  • Gather User requirements and design technical and functional specifications.
  • Installed, Configured and MaintainedHadoopclusters for application development andHadooptools like Hive, Impala, HBase, HDFS, PIG and Sqoop.
  • Developed ETL scripts using PySpark, Sqoop and Map Reduce jobs to perform data ingestion from various sources including Enterprise data grip to Enterprise data lake (HDFS) to implement business rules. Also reduced the storage usage by using different file formats.
  • Used AWS data pipeline for Data Extraction, Transformation and Loading from homogeneous or heterogeneous data sources and built various graphs for business decision-making using Python matplot library
  • Loading the data from the different Data sources like (Teradata, Oracle and Redshift) into HDFS using Sqoop and load into Hive tables, which are partitioned.
  • Worked extensively on AWS Components such as Elastic Map Reduce (EMR), Elastic Compute Cloud(EC2), Simple Storage Service (S3).
  • Automated the process of exploratory data analysis by retrieving the data from RedShift using Psycopg2 and doing analysis in Python according to requirements.
  • Worked on creating Hive tables and written Hive queries for data analysis to meet business requirements and experienced in Sqoop to import and export the data from Oracle & MySQL.
  • Developed Oozie workflows for daily and weekly incremental loads, to ingest data from external data sources and then import into hive tables in HDFS to reduce manual work by making it reusable and reconfigurable.
  • Used AWS SQS send the processed data further to the next working teams for further processing.
  • Experience with Snowflake cloud data warehouse and AWS S3 bucket for integrating data from multiple source system which include loading nested JSON formatted data into snowflake table.
  • Developed complex SQL, HQL scripts to transform the data and load into HIVE, Impala tables for user access.
  • Used AWS services like EC2 and S3 for small data sets & Responsible for setting up Hadoop cluster on AWS EC2 Instance.
  • Extending HIVE and PIG core functionality by using custom User Defined Functions (UDF), User Defined Table-Generating Functions (UDTF) and User Defined Aggregate Functions (UDAF) for Hive and Pig using python.
  • Worked on importing and exporting data into HDFS and Hive using Sqoop.
  • Used Flume to handle streaming data and loaded the data intoHadoop cluster.
  • Developed and executed hive queries for de-normalizing the data.
  • Worked on Hue interface for querying the data in Hive & Impala.
  • Responsible for executing hive queries using Hive Command Line, Web GUI HUE and Impala to read, write and query the data into HBase.
  • Implemented CI and CD process using Jenkins along with custom scripts to automate repetitive tasks.
  • Experience with Snowflake cloud data warehouse and AWS S3 bucket for integrating data from multiple source system.
  • Worked on a POC to compare processing time of Impala with Apache Hive for batch applications to implement the former in a project.
  • Developed batch processing pipeline to process data using python and airflow. Scheduled spark jobs using control-m.
  • Managed, reviewedHadooplog file, and worked in analysing SQL scripts and designed the solution for the process using Spark.
  • Developed various Python scripts to find vulnerabilities with SQL Queries by doing SQL injection, permission checks and performance analysis.
  • Handled importing data from different data sources into HDFS using Sqoop and performing transformations using Hive, Map Reduce and then loading data into HDFS.
  • Exporting the result set from HIVE to MySQL using the Sqoop export tool for further processing.
  • Involved in different data migration activities.
  • Involved in fixing various issues related to data quality, data availability and data stability.
  • Worked in determining various strategies related to data security.
  • Strong on Exception Handling Mappings for Data Quality, Data Cleansing and Data Validation.

Environment: AWS,Hadoop, Spark, HDFS, Hive, Impala, YARN, HBase, EMR, Teradata, RedShift, Python, Oozie, MySQL, XML, Bash Scripting, EC2, Putty, JIRA, Control-M, Jenkins.

Confidential - Bloomington, IL

Data Engineer

Responsibilities:

  • Involved in various phases of development analyzed and developed the system working in an Agile development model.
  • Experience in building and architecting multiple Data pipelines, end to end ETL and ELT process for Data ingestion and transformation in GCP and coordinate task among the team
  • Responsible for data extraction and data ingestion from different data sources into Hadoop Data Lake by creating ETL pipelines using Pig, and Hive.
  • Compared Self hosted Hadoop with respect to GCPs Data Proc and explored Big Table (managed HBase) use cases, performance evolution.
  • Developed Spark scripts using Python on GCP for Data Aggregation and Validation.
  • Extensively worked with Spark-SQL context to create data frames and datasets to preprocess the model data.
  • Involved in Platform Modernization project to get the data into GCP.
  • Experience in GCP Dataproc, GCS, Cloud functions, BigQuery.
  • Expertise in analyzing data using Pig scripting, Hive Queries, Sparks (python) and Impala.
  • Used cloud shell SDK in GCP to configure the services Data Proc, Storage, BigQuery.
  • Experienced in loading and transforming large sets of Structured, Semi-Structured and Unstructured data and analyzed them by running Hive queries and Pig scripts.
  • Involved in designing the row key in the HBase to store Text and JSON as key values in HBase table and designed row key in such a way to get/scan it in sorted order.
  • Wrote Junit tests and Integration test cases for those Microservice.
  • Designing and building data pipelines to load the data into GCP platform.
  • Developed business logic using Kafka Direct Stream in Spark Streaming and implemented business transformations.
  • Created Hive schemas using performance techniques like partitioning and bucketing.
  • Developed and maintained batch data flow using HiveQL and Unix scripting
  • Involved in converting Hive/SQL queries into Spark transformations using Spark RDD, Scala and Python.
  • Build data pipelines in airflow in GCP for ETL related jobs using different airflow operators.
  • Build large-scale data processing systems in data warehousing solutions, and work with unstructured data mining on NoSQL.
  • Specified the cluster size, allocating Resource pool, Distribution of Hadoop by writing the specification texts in JSON File format.
  • Demonstrable experience designing and implementing complex applications and distributed systems into public cloud infrastructure (AWS, GCP)
  • Developed workflow in Oozie to manage and schedule jobs on the Hadoop cluster to trigger daily, weekly and monthly batch cycles.
  • Configured Hadoop tools like Hive, Pig, Zookeeper, Flume, Impala and Sqoop.
  • Wrote Hive Queries for analyzing data in Hive warehouse using Hive Query Language (HQL).
  • Queried both Managed and External tables created by Hive using Impala.
  • Developed customized Hive UDFs and UDAFs in Java, JDBC connectivity with hive development and execution of Pig scripts and Pig UDF’s.

Environment: GCP,Hadoop, BigQuery, 3.0,, Microservices, Java 8, MapReduce, Agile, Spark,Scala, HBase 1.2, JSON, Spark 2.4, Kafka, JDBC, Hive 2.3, JSON, Pig 0.17

Confidential - Chicago, IL

Data Engineer

Responsibilities:

  • Developed Spark applications using Scala.
  • Performance analysis of batch jobs by using Spark Tuning parameters.
  • Enhanced and optimized Spark/Scala/ pyspark jobs to aggregate, group and run data mining tasks using the Spark framework.
  • Worked on aws tools like Kinesis, DynamoDB, and S3.
  • Importing and exporting data into HDFS and hive using Sqoop and Kafka with batch and streaming.
  • Worked on MySQL RDBMs db. as backend database to store monitoring information about CCPA project.
  • Worked on AWS Data Pipeline to configure data loads from S3 to into Redshift.
  • Used Service now platform to open and close tickets of CCPA project.
  • Involved in complete Big Data flow of the application data ingestion from upstream to HDFS, processing the data in HDFS and analyzing the data using several tools.
  • Imported the data from various formats like JSON, ORC and Parquet to HDFS cluster with compressed for optimization.
  • Experienced on ingesting data from RDBMS sources like - Oracle, SQL Server and Teradata into HDFS using Sqoop.
  • Deployed Pyspark applications and developed in Databricks cluster.
  • Experience in managing and reviewing huge Hadoop log files.
  • Importing and exporting data into HDFS and hive using Sqoop and Kafka with batch and streaming.
  • Experienced with Spark-Streaming APIs to perform transformations and actions on the fly for building the common learner data model which gets the data from Kafka in near real time and Persists into HBase.
  • Migrated an existing on-premises application to AWS. Used AWS services like EC2 and S3 for small data sets processing and storage, experienced in Maintaining the Hadoop cluster on AWS EM.
  • Performance analysis of Spark streaming and batch jobs by using Spark tuning parameters.
  • Enhanced and optimized Spark/python jobs to aggregate, group and run data mining tasks using the Spark framework.
  • Installed, configured and developed various pipeline activities with Nifi using various processors such as Sqoop processor, Kafka processor, HDFS Processor, File Processors etc.
  • Created Data Pipelines as per the business requirements and scheduled it using Oozie.
  • Used Hive to join multiple tables of a source system and load them to Elastic search tables.
  • Involved in complete Big Data flow of the application data ingestion from upstream to HDFS, processing the data in HDFS and analyzing the data using several tools.
  • Imported the data from various formats like JSON, Sequential, Text, CSV, AVRO and Parquet to HDFS cluster with compressed for optimization.
  • Experienced on ingesting data from RDBMS sources like - Oracle, SQL Server and Teradata into HDFS using Sqoop.
  • Configured Hive and written Hive UDF and UDAF. Also, created partitions such as Static and dynamic with bucketing.
  • Experience in managing and reviewing huge Hadoop log files.
  • Expertise in designing and creating various analytical reports and Automated Dashboards to help users to identify critical KPIs and facilitate strategic planning in the organization.
  • Experience in CI/CD tool Jenkins for code deployment and scheduling of jobs.
  • Expertise in creating metrics and processing data using query exporter and Prometheus dashboard.

Environment: Hive, Prometheus, pyspark, Jenkins, Airflow, Gerrit, Kafka, Spark, Sqoop, Maven, Automic, SQL, Scala, Junit, IntelliJ, MySQL, Data bricks, Aws cloud.

Confidential

ETL Developer

Responsibilities:

  • Responsible for developing, support and maintenance for the ETL (Extract, Transform and Load) processes using Informatica PowerCenter.
  • Designed and developed several ETL scripts using Informatica, UNIX shell scripts.
  • Analyzed the source data coming from Oracle, Flat Files, and MS Excel coordinated with data warehouse team in developing Dimensional Model.
  • Created FTP, ODBC, Relational connections for the sources and targets.
  • Implemented Slowly Changing Dimension Type 2 methodology for accessing the full history of accounts and transaction information.
  • Well versed in developing thecomplex SQL queries, unions and multiple tables joins and experience with Views.
  • Experience in database programming inPL/SQL(Stored Procedures, Triggers and Packages).
  • Scheduled Sessions and Batches on the Informatica Server using Informatica Server Manager.
  • Experience with writing and executing test cases for data transformations in Informatica.
  • Created JIL scripts and scheduled workflows using CA Autosys.
  • Extensively used SQL scripts/queries for data verification at the backend.
  • Executed SQL queries, stored procedures and performed data validation as a part of backend testing.
  • Used SQL to test various reports and ETL Jobs load in development, testing and production
  • Developed UNIX shell scripts to control the process flow for Informatica workflows to handle high volume data.
  • Prepared Test cases based on Functional Requirements Document.

Environment: Informatica Power Center 9.x, Oracle 11g, SQL plus, PL/SQL, Oracle, SQL Developer, UNIX.

Confidential

Sr.Data Analyst

Responsibilities:

  • Classified users into different groups within a gambling model to calculate odds which affected 100,000 users of the Dotamax App by using Python and R.
  • Wrote SQL query to analyze behavior of existing customers and connected with targeted VIP group of new customers.
  • Crafted and implemented daily mission and rank designed to attract Max+ Mobile App users.
  • Collaborated with the technical directors and site leads to ensure upgrades were scheduled at opportune times to minimize interruption to site activities.
  • Initiated multipledataprojects, including building models, algorithms, and tools to improve the user experience
  • Designed and ImplementedDataCleansing Process and statistical analysis with R, Python 3.x.
  • Created visualization reports and dashboards with Tableau 9.x, ggplot2.
  • Designed and developed new reports and maintained existing reports for the Human Resource Management System Dashboards using Tableau, Qlikview and Microsoft Excel to support the business strategy and management.
  • Identified process improvements that significantly reduce workloads or improve quality.

Environment: Informatica 9.1, Oracle 11g, SQL Developer, PL/SQL, Cognos, Splunk, TOAD, MS Access, MS Excel

We'd love your feedback!