Sr Data Engineer Resume
Richardson, TX
SUMMARY
- Having years of professional experience as a Big Data Engineer/Developer in Analysis, development, design, implementation, maintenance, and support wif experience in Big Data, Hadoop Development, Python, PL/SQL, Java, SQL, REST API’s, GCP cloud platform.
- Hands on experience in Hadoop ecosystem including HDFS, Map Reduce, Spark, Kafka, HBase, Scala, Pig, Impala, Sqoop, Oozie, Flume, Zookeeper and also worked on Spark SQL, Spark Streaming, AWS services like EMR, S3, Airflow, Glue and Redshift.
- Hands - on experience implementing and designing large scale data lakes,pipelines and efficient ETL (extract/load/transform) workflowsto organize, collect and standardize data dat helps generate insights and addresses reporting needs.
- Hands-on experience in GCP, Big Query, GCS bucket, G - cloud function, cloud dataflow, Pub/sub cloud shell, GSUTIL, BQ command line utilities, Data Proc, Stack driver.
- Experience working wif both Streaming and Batch data processing using multiple technologies.
- Hands-on experience developing data pipelines using Spark components, Spark SQL, Spark Streaming and MLLib.
- Hands-on experience in importing and exporting data using Sqoop from HDFS to Relational Database Systems and vice-versa.
- Excellent understanding of Hadoop architecture and various components such as HDFS, Job Tracker, Task Tracker, Name Node, Data Node and MapReduce programming paradigm.
- Worked on AWS components and services particularly Elastic Map Reduce (EMR), Elastic Compute Cloud (EC2), Simple Storage Service (S3) and Lambda functions.
- Hands-on experience working wif Kafka streaming using KTables, Global KTables and KStreams and deploying these on Confluent and Apache Kafka environments.
- Hands-on experience in importing and exporting data using Sqoop from HDFS to Relational Database Systems and vice-versa.
- Developed Python code to gather the data from HBase and designs the solution to implement using PySpark.
- Hands-on experience interacting wif REST API’s developed using the micro-services architecture for retrieving data from different sources.
- Experience wif implementing CI/CD pipelines for DevOps - source code management using Git, Unit testing, build and deployment scripts.
- Hands-on experience working wif DevOps tools such as Jenkins, Docker, Kubernetes, GoCD, Autosys scheduler.
- Expertise wif RDBMS such as Oracle, MySQL in writing complex SQL queries and procedures, triggers.
- Very keen in noing newer techno stack dat Google Cloud platform (GCP) adds.
- Experience in GCP Dataproc, GCS, Cloud functions, BigQuery.
- Experienced in developing web-based applications usingPython, QT, C++, XML, CSS, JSON, HTML, DHTML.
- Hands-on experience building, scheduling, and monitoring workflows using Apache Airflow wif Python.
- Developed Python code to gather the data from HBase and designs the solution to implement using PySpark.
- Very keen in noing newer techno stack dat Google Cloud platform (GCP) adds.
- Experience in GCP Dataproc, GCS, Cloud functions, Big Query.
- Worked wif Kafka streaming using KTables and KStreams and deploying these on Confluent and Apache Kafka environments.
- Hands-on experience in importing and exporting data using Sqoop from HDFS to Relational Database Systems and vice-versa.
- Developed Python code to gather the data from HBase and designs the solution to implement using PySpark.
- Experienced in Software methodologies such as Agile and Safe, sprint planning, attending daily standups and story grooming.
- Experience working in various Software Development Methodologies like Waterfall, Agile SCRUM and TDD.
- Worked in complete Software Development Life Cycle in Agile model.
- Strong problem-solving skills wif an ability to isolate, deconstruct and resolve complex data challenges.
TECHNICAL SKILLS
Big Data Ecosystem: HDFS, Yarn, MapReduce, Spark, Kafka, Kafka Connect, Hive, Pig Airflow, Impala, Sqoop, HBase, Flume, Oozie, Zookeeper
Hadoop Distributions: Cloudera, Hortonworks, Apache.
Cloud Environments: GCP, AWS EMR, EC2, S3 and Azure Data Factory
Operating Systems: Linux, Windows
Languages: Python, SQL, Scala, Java
Databases: Oracle, SQL Server, MySQL, HBase, MongoDB, DynamoDB
ETL Tools: Informatica
Report & Development Tools: Eclipse, IntelliJ Idea, Visual Studio Code, Jupyter Notebook, Tableau, Power BI.
Development/Build Tools: Maven, Gradle
Repositories: GitHub, SVN.
Scripting Languages: bash/Shell scripting, Linux/Unix
Methodology: Agile, Waterfall
PROFESSIONAL EXPERIENCE
Confidential, Richardson, TX
Sr Data engineer
Responsibilities:
- Configured Spark Streaming to receive real time data from the Apache Kafka and store the stream data to HDFS using Scala.
- Exported the data using Sqoop to RDBMS servers and processed dat data for ETL operations.
- Developed ETL data pipelines using Hadoop big data tools - HDFS, Hive, Presto, Apache Nifi, Sqoop, Spark, Elastic Search, Kafka.
- Developed data pipeline using Flume, Pig, Sqoop to ingest data and customer histories into HDFS for analysis.
- Experience in designing and developing POCs in Spark using Scala to compare the performance of Spark wif Hive and SQL/Oracle.
- Build data pipelines in airflow in GCP for ETL related jobs using different airflow operators.
- Experience in GCP Dataproc, GCS, Cloud functions, BigQuery.
- Experience in moving data between GCP and Azure using Azure Data Factory.
- Created access points to the data through Spark, Python, and Presto.
- Developed Spark applications using PySpark and Spark-SQL and Python for data extraction, transformation, and aggregation from multiple file formats for analyzing & transforming the data to uncover insights into the customer usage patterns.
- Developed and designed a system using Python and Scala to collect data from multiple portals using Kafka and tan process it using Spark.
- Responsible for designing data pipelines using ETL for effective data ingestion from existing data management platforms to enterprise Data Lake.
- Developed and executed interface test scenarios and test scripts for complex business rules using available ETL tools.
- Uploaded click stream data from Kafka to HDFS, HBase, and Hive by integrating wif Storm.
- Used Oozie to orchestrate the map reduce jobs dat extract the data on a timely manner.
- Created multi-node Hadoop and Spark clusters in AWS instances to generate terabytes of data and stored it in AWS HDFS.
- Developed Spark/Scala, Python for the project in the Hadoop/Hive environment
- Responsible for designing data pipelines using ETL for effective data ingestion from existing data management platforms to enterprise Data Lake.
- Developed and executed interface test scenarios and test scripts for complex business rules using available ETL tools.
- Used cloud shell SDK in GCP to configure the services Data Proc, Storage, BigQuery.
- Designed the ETL process by creating high-level design document including the logical data flows, source data extraction process, the database staging and the extract creation, source archival, job scheduling and Error Handling.
- Created multi-node Hadoop and Spark clusters in AWS instances to generate terabytes of data and stored it in AWS HDFS.
- Designed the ETL process by creating high-level design document including the logical data flows, source data extraction process, the database staging and the extract creation, source archival, job scheduling and Error Handling.
- Developed Spark applications usingSpark-SQLfor data extraction, transformation, and aggregation from multiple file formats for analyzing & transforming the data to uncover insights into the customer usage patterns.
- Developed code from scratch in Spark using SCALA according to the technical requirements and Used both Hive context as well as SQL context of Spark to do the initial testing of the Spark job
- Developed and designed a system to collect data from multiple portals using Kafka and tan process it using Spark.
- Create external tables wif partitions using Hive and designed External and Managed tables in Hive and processed data to the HDFS using Sqoop.
- Developed the Linux shell scripts for creating the reports from Hive data.
- Hands on experience on Data Analytics Services such as Atana, Glue Data Catalog & Quick Sight.
- Managed and supported enterprise Data Warehouse operation, big data advanced predictive application development using Cloudera & Hortonworks HDP.
- Loaded and transformed large sets of structured, semi structured, and unstructured data in various formats like text, zip, XML and JSON.
- Worked on Apache Airflow which is a job trigger where the code is written in hive query language and Scala. This helps to read, back fill, and write the data for particular time Frame from hive tables to HDFS locations.
Environment: Hadoop, Spark, HDFS, Hive, Pig, HBase, Big Data, Apache Storm, Oozie, Sqoop, Kafka, Flume, Zookeeper, MapReduce, Cassandra, Scala, Linux, NoSQL, MySQL, PySpark, SQL Server, GCP, Python
Confidential, Dallas, TX
Sr Data engineer (Bigdata, GCP)
Responsibilities:
- Wrote Spark SQL queries and Python scripts to design the solutions and implemented them using PySpark.
- Wrote Python program to maintain raw file archival in GCS bucket.
- Used cloud shell SDK in GCP to configure the services Data Proc, Storage, BigQuery.
- Designed and built production data pipelines from data ingestion to consumption wifin a hybrid big data architecture, using Cloud Native GCP, Java, Python, Scala, SQL.
- Opened SSH tunnel to Google DataProc to access to yarn manager to monitor spark jobs.
- Worked on monitoring, scheduling, and authoring Data Pipelines using Apache Airflow.
- Developed Spark jobs to create data frames from the source system, process, and analyze the data in Dataframes based on business requirements.
- Involved in enhancing the existing ETL data pipeline for better data migration wif reduced data issues by using Apache Airflow.
- Implemented static Partitioning, Dynamic partitioning and Bucketing in Hive using internal and external tables.
- Developed Restful API services using spring boot to upload data from local to Amazon S3, listing S3 objects and file manipulation operations.
- Scheduling workflow through the Oozie engine to run multiple Hive and pig jobs. Wrote shell scripts to automate the jobs in UNIX.
- Invoked in creating Hive tables, loading wif data, and writing Hive queries, which will invoke MapReduce jobs in the backend.
- Automating repeated tasks usingPython and UNIX Bash Scripting.
- Utilized the Apache Hadoop environment by Cloudera. Monitoring and Debugging Spark jobs which are running on a Spark cluster using Cloudera Manager.
Environment: Hadoop, MapReduce, HDFS, Spark, Java, Yarn, Hive 2.1, Sqoop, Cassandra Oozie, Scala, Python, AWS, Flume, Kafka, Tableau, Linux, SQL Server, Shell Scripting, Apache Airflow, Unix, GCP.
Confidential, Pittsburg, PA
Bigdata / Data engineer
Responsibilities:
- Delivery experience on major Hadoop ecosystem Components such as Hive, Spark Kafka, Elastic Search & HBase and monitoring wif Cloudera Manager and worked on loading disparate data sets coming from different sources to (HADOOP) environment using Spark.
- Used Pyspark and Spark SQL for extracting, transforming, and loading the data according to the business requirements.
- Wrote Python program to maintain raw file archival in GCS bucket.
- Used cloud shell SDK in GCP to configure the services Data Proc, Storage, BigQuery.
- Designed and built production data pipelines from data ingestion to consumption wifin a hybrid big data architecture, using Cloud Native GCP, Java, Python, Scala, SQL.
- Developed UNIX scripts in creating Batch load for bringing huge amount of data from Relational databases to BIGDATA platform.
- Developed Spark SQL scripts using PySpark to perform transformations and actions on Data frames, Data set in spark for faster data Processing and implemented them using PySpark.
- Involved in enhancing the existing ETL data pipeline for better data migration wif reduced data issues by using Apache Airflow.
- Implemented static Partitioning, Dynamic partitioning and Bucketing in Hive using internal and external tables.
- Knowledge on creating star schema for drilling data. Created pyspark procedures, functions, packages to load data.
- Scheduling workflow through the Oozie engine to run multiple Hive and pig jobs. Wrote shell scripts to automate the jobs in UNIX.
- Worked on monitoring, scheduling, and authoring Data Pipelines using Apache Airflow.
- Developed Spark jobs to create data frames from the source system, process, and analyze the data in Data frames based on business requirements.
- Invoked in creating Hive tables, loading wif data, and writing Hive queries, which will invoke MapReduce jobs in the backend.
- Utilized the Apache Hadoop environment by Cloudera. Monitoring and Debugging Spark jobs which are running on a Spark cluster using Cloudera Manager.
- Fetch data and generate monthly reports. Visualization of those reports using Tableau and Python.
- Manage and support of enterprise Data Warehouse operation, big data advanced predictive application development using Cloudera.
- Implemented Kafka consumers to move data from Kafka partitions into Cassandra for near real-time analysis and worked extensively on Hive to create, alter, and drop tables and involved in writing hive queries.
- Developed workflow in Oozie to automate the tasks of loading the data into HDFS and pre-processing wif Pig and parsed high-level design spec to simple ETL coding and mapping standards.
- Created Reports wif different Selection Criteria from Hive Tables on the data residing in Data Lake.
- Extensively used ETL methodology for supporting Data Extraction, transformations and loading processing, using Hadoop.
Environment: Hadoop, MapReduce, HDFS, Spark, PySpark, Java, Yarn, Hive 2.1, Sqoop, Cassandra Oozie, Scala, Python, GCP, Flume, Kafka, Tableau, Linux, SQL Server, Shell Scripting, Apache Airflow, Cloudera.
Confidential
Hadoop Developer
Responsibilities:
- Understand and preparing Design document preparation according to client requirement.
- Loaded and transformed large sets of structured, semi structured, and unstructured data. Imported data using Sqoop to load and export data from My SQL to HDFS and NoSQL Databases on regular basis for designing and developing Hive scripts to process data in a batch to perform analysis of data.
- Implementing Partitioning and Bucketing in Hive as part of performance tuning for the workflow and co-ordination files using Oozie framework to automate tasks.
- Involved in developing data pipelines using Sqoop, Pig and Hive to ingest data into HDFS to perform data analytics.
- Developed data pipelines using Sqoop, Pig and Hive to ingest customer data into HDFS to perform data analytics.
- Analyzed large amounts of data sets to determine optimal way to aggregate and report on it.
- Developed Sqoop scripts to handle change data capture for processing incremental records between new arrived and existing data in RDBMS tables.
- Experience writing Sqoop jobs to move data from various RDBMS into HDFS and vice versa.
- Developed ETL pipelines to source data to Business intelligence teams to build visualizations.
- Optimizing the Hive Queries using the various file formats like PARQUET, JSON, AVRO and CSV file.
- Worked wif Oozie workflow engine to schedule time-based jobs to perform multiple actions.
- Involved in unit testing, interface testing, system testing and user acceptance testing of the workflow Tool.
Environment: HDFS, Hive, Oozie, Cloudera Distribution wifHadoop(CDH4), MySQL, CentOS, Apache HBase, MapReduce, Hue, PIG, Sqoop, SQL, Windows, Linux.
Java/SQL Developer
Confidential
Responsibilities:
- Developed and created database objects in Oracle, SQL Server me.e., Tables, Indexes, Stored Procedures, Views, Functions, Triggers, etc.
- Used SQL*Loader scripts to load the data into temporary tables and procedures to validate the data.
- Written the complex SQL Queries and scripts to give input to the reporting tools.
- Worked wif large amounts of data from various data sources and loading into the database.
- Written Stored procedures and packages as per the business requirement and scheduled jobs for health checks.
- Developed complex queries to generate Monthly, Weekly reports to extract data for visualizing in QlikView.