We provide IT Staff Augmentation Services!

Data Engineer Resume

Houston, TX

SUMMARY:

  • Well - Rounded Big Data Engineer and Developer with hands-on experience in all phases of Big Data environments such as design, implementation, development and customization and performance tuning, data cleaning and database.
  • Applies Extending HIVE and PIG core functionality by using custom User Defined Function's (UDF), User Defined Table-Generating Functions (UDTF) and User Defined Aggregating Functions (UDAF) for Hive and Pig.
  • Good Knowledge on Spark framework on both batch and real-time data processing. Hands-on experience processing data using Spark Streaming API.
  • Skilled in AWS, Redshift, Cassandra, DynamoDB and various cloud tools. Use of cloud platforms AWS, Microsoft Azure, and Google Cloud platform.
  • Have worked with over 100 terabytes of data from data warehouse and over 1 petabyte of data from Hadoop cluster.
  • Have handled over 70 billion messages a day funneled through Kafka topics. Responsible for moving and transforming massive datasets into valuable and insightful information.
  • Capable of building data tools to optimize utilization of data and configure end-to-end systems. Spark SQL to perform transformations and actions on data residing in Hive.
  • Spark Streaming to divide streaming data into batches as an input to Spark engine for batch processing.
  • Responsible for building quality for data transfer pipelines for data transformation using Flume, Spark, Spark Streaming, and Hadoop.
  • Able to architect and build new data models that provide intuitive analytics to customers.
  • Able to design and develop new systems and tools to enable clients to optimize and track using Spark.
  • Provide end-to-end data analytics solutions and support using Hadoop systems and tools on cloud services as well as on premise nodes.
  • Expert in big data ecosystem using Hadoop, Spark, Kafka with column-oriented big data systems such as Cassandra and Hbase.
  • Worked with various file formats (delimited text files, click stream log files, Apache log files, Avro files, JSON files, XML Files).
  • Use Flume, Kafka, Nifi, and HiveQL scripts to extract, transform, and load the data into database. Able to perform cluster and system performance tuning.

TECHNICAL SKILLS:

Databases: Apache Cassandra, Amazon Redshift, AmazonRDS, SQL, Apache Hbase, Hive, MongoDB

Data Storage: HDFS, Data Lake, Data Warehouse, Database, PostgreSQL

Amazon Stack: AWS, EMR, EC2, EC3, SQS, S3, DynamoDBRedshift, Cloud Formation

Programming Languages: Spark, Scala, PySpark, PyTorch, Java, Shell Script Language

Virtualization: VMWare, vSphere, Virtual Machine

Data Pipelines/ETL: Flume, Apache Kafka, Logstash

Development Environment: IDE: Jupyter Notebooks, PyCharm, IntelliJ, Spyder, Anaconda Continuous Integration (CI CD): Jenkins Versioning: Git, GitHub

Cluster Security & Authentication: Kerberos and Ranger

Query Languages: SQL, Spark SQL, Hive QL, CQL

Log Analysis: Elastic Stack (Elasticsearch, Logstash, and Kibana)

Distributions: Hadoop, Cloudera, Hortonworks

Hadoop Ecosystem: Hadoop, Hive, Spark, Maven, Ant, Kafka, HBase, YARN, Flume, Zookeeper, Impala. HDFS, Pig, Mesos, Oozie, Tez, Zookeeper, Apache Airflows

Frameworks: Spark, Kafka

Search Tools: Apache Solr/Lucene, Elasticsearch

File Formats: Parquet, Avro

File Compression: Snappy, Gzip, ORC

Methodologies: Test-Driven Development, Continuous Integration, Unit Testing, Functional Testing, Scenario Testing, Regression Testing

Streaming Data: Kinesis, Spark, Spark Streaming, Spark Structured Streaming

Development Methodologies: Test-Driven Development, Continuous Integration, Unit Testing, Functional Testing, Scenario Testing, Regression Testing

PROFESSIONAL EXPERIENCE:

DATA ENGINEER

Confidential, Houston, TX

Responsibilities:

  • Developed multiple Spark Streaming and batch Spark jobs using Scala and Python on AWS
  • Cleaned and manipulated different datasets in real time using the DStreams API
  • Researched about the migration of DStreams to Spark Structured Streaming
  • Schematize different streams using case classes and struct types
  • Optimize Spark Streaming Jobs
  • Collaborated on the creation of DevOps and testing for Spark jobs
  • Documented spark code for Knowledge Transfer
  • Created AWS CloudFormation YAML templates to automate creation of a full stack for various environments specified with input parameters
  • Worked on creating AWS SNS messages & SQS queues for various internal company services
  • Created AWS Lambda functions that trigger on object PUT events and convert txt and csv files to parquet for further processing
  • Developed AWS Lambda that starts AWS Glue Workflow process with multiple ON DEMAND and CONDITIONAL triggers to start AWS Glue Jobs.
  • Developed AWS Glue scripts and Pyshell scripts for Glue process to do ETL and process dynamic frames
  • Developed Python scripts that are executed on persistent EC2 machines to catch SNS & SQS queues
  • Started external process using custom jar builds to create merged data based on geolocation, site information, status, and additional Cassandra table specifications
  • Wrote CQL scripts to collect, join and insert data in Cassandra
  • Worked with QA team to assist with testing
  • Developed GLU Jobs to execute spark on the backend
  • Creation of shell scripts to organize information

SENIOR DATA ENGINEER

Confidential, San Antonio, TX

Responsibilities:

  • Developed multiple Spark Streaming and batch Spark jobs using Scala and Python on AWS
  • Optimized data collection, flow, and delivery for cross-functional teams
  • Configured a test environment in GitLab to create an AWS EMR cluster
  • Executed Hadoop/Spark jobs on AWS EMR using programs, data stored in S3 Buckets
  • Used Apache Hue to create Apache Hive queries that filter user data
  • Validated AVRO schema tables to support changed user data
  • Processed multiple terabytes of data stored in AWS using Elastic Map Reduce (EMR) to AWS Redshift
  • Developed Hive queries (HQL) to aggregate user subscription data
  • Initiate data pipeline using Docker image container with AWS CLI and Maven to be deployed on AWS EMR.
  • Wrote Bash script to be used during cluster launch to set up HDFS
  • Appended EMR cluster steps using JSON format to execute tasks preparing cluster during launch
  • Worked on Hue to write queries that generate daily, weekly, monthly as well as custom reports
  • AWS EMR to process big data across Hadoop clusters of virtual servers on Amazon Simple Storage Service (S3)
  • Worked on reports to be accurately generated and ensuring all fields populated correctly as per client’s specifications.
  • Scheduled report generation using Airflow and automated delivery via SFTP, SSH and Email.
  • Wrote streaming applications with Spark Streaming/Structured Streaming.
  • Used SparkSQL module to store data into HDFS
  • Implemented applications on Hadoop/Spark on Kerberos secured cluster
  • Set up and configured AWS ECR to be used as default EMR cluster container
  • Worked on troubleshooting and fixing VERTEX FAILURE errors caused by Hive configuration on EMR, data types mismatch and EMR instance types. These errors at various points caused pipeline disruption, cluster failure and created backlogs.
  • Contributed to switching table creation using AVRO schema files. Old versions used hardcoded queries.
  • Updated and re-wrote test cases written in Java. Test cases were used by GitLab Runner during test stage of cluster deployment to ensure all tables are correctly and successfully created.
  • Built a Spark proof of concept with Python using PySpark
  • Wrote Bash script to be used in GitLab Runner’s YML that automatically detects and backfills reports which have not been generated.
  • Worked on adding steps to EMR cluster that deliver reports via email as per client’s requests
  • Contributed to re-creating existing workflow in Jenkins for CI/CD and CRON task scheduling during company’s transition away from GitLab.

BIG DATA ENGINEER

Confidential, San Ramon, CA

Responsibilities:

  • Worked on Multi - Clustered environment and setting up Cloudera and Hortonworks Hadoop echo-System.
  • Performed upgrades, patches and bug fixes in HDP and CDH clusters.
  • Hands-on with Spark Core, Spark SQL and Data Frames/Data Sets/RDD API.
  • Spark jobs, Spark SQL and Data Frames API to load structured data into Spark clusters.
  • Created a Kafka broker which uses schema to fetch structured data in structured streaming.
  • Interacted with data residing in HDFS using Spark to process the data.
  • Handled structured data via Spark SQL, stored into Hive tables for consumption.
  • Accessed Hadoop file system (HDFS) using Spark and managed data in Hadoop data lakes with Spark.
  • Handled structured data with Spark SQL to process in real time from Spark Structured Streaming.
  • Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs Python
  • Support for the clusters, topics on the Kafka manager.
  • Configured Spark Streaming to receive real-time data to store in HDFS.
  • Developed ETL pipelines w/ Spark and Hive for business-specific transformations.
  • Created Data Quality Scripts using SQL and Hive to validate successful das ta load and quality of the data. Created various types of data visualizations using Python and Tableau.
  • Used Avro for serializing and deserializing data, and for Kafka producer and consumer.
  • Played a key role in installation and configuration of the various Big Data ecosystem tools such as Elastic Search, Logstash, Kibana, Kafka and Cassandra.
  • Knowledge of setting up Kafka cluster.
  • Experience processing Avro data files using Python Spark
  • Developed Python scripts to automate data sampling process. Ensured the data integrity by checking for completeness, duplication, accuracy, and consistency
  • Optimized Spark jobs migrating from Spark RDD s API to Data Frames.
  • Built a model of the data processing by using the PySpark programs for proof of concept.
  • Used Spark SQL to perform transformations and actions on data residing in Hive.
  • Responsible for designing and deploying new ELK clusters.
  • Implemented CI/CD tools Upgrade, Backup and Restore

AMAZON AWS BIG DATA ENGINEER

Confidential, Austin, TX

Responsibilities:

  • Worked with Amazon Web Services (AWS), and cloud services such as EMR, EC2, S3, EBS and IAM entities, roles, and users.
  • Developed AWS Cloud Formation templates for RedShift.
  • Processed multiple terabytes of data stored in AWS using Elastic Map Reduce (EMR) to AWS Redshift.
  • Used spark to build and process real - time data stream from Kafka producer
  • Used SparkSQL module to store data into HDFS
  • Used Spark DataFrame API over Hortonworks platform to perform analytics on data.
  • Used hive for queries and incremental imports with Spark and Spark jobs for data processing and analytics
  • Implemented security measures AWS provides, employing key concepts of AWS Identity and Access Management (IAM).
  • Installed, Configured and Managed AWS Tools such as ELK, Cloud Watch for Resource Monitoring.
  • AWS EMR to process big data across Hadoop clusters of virtual servers on Amazon Simple Storage Service (S3).
  • Launched and configured The Amazon EC2 (AWS) Cloud Servers using AMI's (Linux/Ubuntu) and configuring the servers for specified applications.
  • Implemented AWS Lambda functions to run scripts in response to events in Amazon Dynamo DB table or S3 bucket or to HTTP requests using Amazon API gateway.
  • Experience in using various packages in python like pandas, numpy, matplotlib, Beautiful Soup.
  • AWS Kinesis used for real time data processing.
  • Implemented AWS IAM user roles and policies to authenticate and control access.
  • Specified nodes and performed the data analysis queries on Amazon redshift clusters on AWS.
  • Developed AWS Cloud Formation templates to create custom infrastructure of our pipeline.
  • Working on AWS Kinesis for processing huge amounts of real time data.
  • Developed multiple Spark Streaming and batch Spark jobs using Scala and Python on AWS.
  • Strong technical skills in Python and good working knowledge of Scala.
  • Ingestion data through AWS Kinesis Data Stream and Firehose from various sources to S3

BIG DATA DEVELOPER

Confidential, Rockville, MD

Responsibilities:

  • Collected metrics for Hadoop clusters using Ambari & Cloudera Manager.
  • Implementation of several applications, highly distributive, scalable and large in nature using Cloudera Hadoop.
  • Migrated streaming or static RDBMS data into Hadoop cluster from dynamically - generated files using Flume and Sqoop.
  • Worked with Linux systems and RDBMS database on a regular basis in order to ingest data using Sqoop.
  • Captured data and importing it to HDFS using Flume and Kafka for semi-structured data and Sqoop for existing relational databases.
  • Wrote streaming applications with Spark Streaming/Kafka.
  • Used SparkSQL module to store data into HDFS
  • Configured Kafka broker for the Kafka cluster of the project and streamed the data to Spark for structured streaming to get structured data by schema
  • Identified and ingested source data from different systems into Hadoop HDFS using Sqoop, Flume, creating HBase tables to store variable data formats for data analytics.
  • Mapped to HBase tables and implemented SQL queries to retrieve data.
  • Streaming events from HBase to Solr using HBase Indexer.
  • Assisted in upgrading, configuration and maintenance of various Hadoop infrastructures like Pig, Hive, and HBase.
  • Import the data from different sources like HDFS/HBase into Spark RDD.
  • Created UNIX shell scripts to automate the build process, and to perform regular jobs like file transfers between different hosts.
  • Implemented workflows using Apache Oozie framework to automate tasks.
  • Created Hive Generic UDF's to process business Confidential .
  • Moved Relational Data base data using Sqoop into Hive Dynamic partition tables using staging tables

HADOOP ADMINISTRATOR

Confidential, Milwaukee, WI

Responsibilities:

  • Worked on Hortonworks Hadoop distributions
  • Worked with users to ensure efficient resource usage in the Hortonworks Hadoop clusters and alleviate multi - tenancy concerns.
  • Set-up Kerberos for more advanced security features for users and groups.
  • Implemented enterprise security measures on big data products including HDFS encryption/Apache Ranger. Managing and Scheduling batch jobs on a Hadoop Cluster using Oozie.
  • Used spark to build and process real-time data stream from Kafka producer
  • Used Spark DataFrame API over Cloudera platform to perform analytics on data
  • Worked on Kafka cluster environment and zookeeper.
  • Monitored multiple Hadoop clusters environments using Ambari.
  • Experience in configuring, installing and managing Hortonworks (HDP) Distributions.
  • Involved in implementing security on HDP Hadoop Clusters with Kerberos for authentication and Ranger for authorization and LDAP integration for Ambari, Ranger
  • Secured the Kafka cluster with Kerberos.
  • Worked on tickets related to various Hadoop/Big data services which include HDFS, Yarn, Hive, Sqoop, Spark, Kafka, HBase, Kerberos, Ranger, Knox.
  • Set-up Hortonworks Infrastructure from configuring clusters to Node security using Kerberos.
  • Performed cluster maintenance and upgrades to ensure stable performance.
  • Defined data security standards and procedures in Hadoop using Apache Ranger and Kerberos.
  • Worked on Hortonworks Hadoop distributions (HDP 2.5)
  • Developed Oozie workflows for scheduling and orchestrating the ETL process.
  • Managing Hadoop clusters via Command Line, and Hortonworks Ambari agent.

Data Engineer

Confidential, Banbury, CT

Responsibilities:

  • Implemented enterprise security measures on big data products including HDFS encryption/Apache Ranger.
  • Worked with users to ensure efficient resource usage in the Hortonworks Hadoop clusters and alleviate multi - tenancy concerns.
  • Set-up Kerberos for more advanced security features for users and groups.
  • Managing and Scheduling batch jobs on a Hadoop Cluster using Oozie.
  • Used spark to build and process real-time data stream from Kafka producer
  • Used Spark DataFrame API over Cloudera platform to perform analytics on data
  • Involved in implementing security on HDP Hadoop Clusters with Kerberos for authentication and Ranger for authorization and LDAP integration for Ambari, Ranger
  • Secured the Kafka cluster with Kerberos.
  • Created DTS package to schedule the jobs for batch processing.
  • Involved in performance tuning to optimize SQL queries using query analyzer.
  • Created indexes, Constraints and rules on database objects for optimization.
  • Worked on Kafka cluster environment and zookeeper.
  • Worked on tickets related to various Hadoop/Big data services which include HDFS, Yarn, Hive, Sqoop, Spark, Kafka, HBase, Kerberos, Ranger, Knox.
  • Set-up Hortonworks Infrastructure from configuring clusters to Node security using Kerberos.
  • Performed cluster maintenance and upgrades to ensure stable performance.
  • Worked on Hortonworks Hadoop distributions (HDP 2.5)
  • Developed Oozie workflows for scheduling and orchestrating the ETL process.
  • Managing Hadoop clusters via Command Line, and Hortonworks Ambari agent.
  • Defined data security standards and procedures in Hadoop using Apache Ranger and Kerberos.
  • Monitored multiple Hadoop clusters environments using Ambari.
  • Experience in configuring, installing and managing Hortonworks (HDP) Distributions.

Hire Now