We provide IT Staff Augmentation Services!

Sr. Big Data Engineer Resume

0/5 (Submit Your Rating)

San Francisco, CA

SUMMARY

  • Big Data Engineer with 9+ years of experience in Big Data technologies and 11+ years of experience in IT.
  • Expertise in Hadoop, Spark, Big Datatools, various clouds (AWS, Azure, GCP) and data warehousing using on - premises and cloud services, automation tools, and ETL design process.
  • Experience working with both SQL and NoSQL databases.
  • Add value to Agile/Scrum processes such as Sprint Planning, Backlog, Sprint Retrospective, and Requirements Gathering and provide planning and documentation for projects.
  • Create Spark Core ETL processes to automate using a workflow scheduler.
  • Successfully worked on AWS services like EMR, EC2, Lambda, S3, Glue, Redshift, Kinesis etc.
  • Well versed with Azure environment. E.g., ADL Gen2, Blob Storage, ADF (Data Factory), Azure Databricks, Azure SQL, Azure Synapse for analytics.
  • Proficient with HDFS, Spark, Hive, Sqoop, HBase, Flume, Oozie, and Zookeeper.
  • Experience in ecosystems like Hive, Sqoop, MapReduce, Flume, and Oozie.
  • Experience handling XML files as well as Avro and Parquet SerDes
  • Performance tuning at source, Target, and Data Stage job levels using Indexes, Hints, and Partitioning in DB2, ORACLE
  • Hand on experience in Text Analytics, developing different Statistical Machine Learning, Data Mining solutions for various business problems, and generating data visualizations using R, Python, and Tableau
  • Designed Spark Core ETL processes to automate using a workflow scheduler.
  • Use Apache Hadoop to work with Big Data and analyze large data sets efficiently.
  • Hands-on experience developing PL/SQL Procedures and Functions and SQL tuning of large databases by creating tables, views, indexes, stored procedures, and functions.
  • Skilled at bucketing, partitioning, multi-threading computing and streaming (Python, PySpark).
  • Efficiently used Apache Hadoop to work with Big Data and analyze large data sets.
  • Experience in handling XML files as well as Avro and Parquet SerDes
  • Performance tuning at source, Target and Data Stage job levels using Indexes, Hints and Partitioning in DB2, ORACLE
  • Developed and deployed complex ETL workflows using Apache Airflow and Jenkins to automate data processing and improve efficiency.
  • Excellent written, communication, oral as well as interpersonal and presentation skills. Ability to perform at a high level, meet deadlines, adaptable to ever-changing priorities, and project management skills.

TECHNICAL SKILLS

IDE: Workbench, Jupiter Notebooks, Eclipse, IntelliJ, PyCharm

PROJECT METHODS: Agile, Kanban, Scrum, DevOps, Continuous Integration, Test-Driven Development

HADOOP DISTRIBUTIONS: Hadoop, Cloudera Hadoop, Hortonworks Hadoop

CLOUD PLATFORMS: Amazon AWS - Microsoft Azure

CLOUD SERVICES: Databricks, Snowflake

DATABASES AND DATA WAREHOUSES: SQL, Snowflake, MongoDB, Redshift, DynamoDB, Cassandra, Hbase

PROGRAMMING LANGUAGES: Spark, Spark Streaming, Java, Python, Scala, PySpark, Django, Flask. Netcore

SCRIPTING: Shell Scripting, Python

CONTINUOUS INTEGRATION (CI-CD): Jenkins, Github, bitbucket, Jira

FILE FORMAT AND COMPRESSION: CSV, JSON, Avro, Parquet, ORC

FILE SYSTEMS: HDFS, Data Lake, S3

ETL TOOLS: Apache Nifi, Flume, Kafka, Talend, Sqoop, Oozie

DATA VIZIUALIZATION TOOLS: Tableau, Kibana, Python

SECURITY: Kerberos, Ranger

AWS: AWS Lambda, AWS S3, AWS RDS, AWS EMR, AWS Redshift, AWS Kinesis, AWS ELK, AWS Cloud Formation, AWS IAM

PROFESSIONAL EXPERIENCE

SR. BIG DATA ENGINEER

Confidential - San Francisco, CA

Responsibilities:

  • Designs and Implements monitoring systems for various components that can raise failure and warning alerts using serverless, python, and AWS Lambda among other tools.
  • Liaise with product users to understand needs and
  • Designs and implements ETL based solutions from data ingestion stage to presentation of useful information to users.
  • Creates and maintains objects used to manage data.
  • Creates front interface and applications that present information to users.
  • Writes SQL and spark queries for data transformations.
  • Works with other developers in pipeline hardening to strengthen and improve existing resources.
  • Resolves issues raised by users that use various resources in platform.
  • Maintains and supports general ETL process and data pipelines.
  • Communicates with team on steps and resources needed for development.
  • Helps to maintain data quality within the pipelines and automates the process of data integrity.
  • Assists in knowledge transfer and training of new team at site.
  • Code debugging and improvement of codebase.
  • Participates in PR review.
  • Training of new team members
  • Writing documentation for various features and processes used in the project.
  • Leads and participates in white boarding sessions for various ideas that could bring in new features or improve the existing ones.

CLOUD DATA ENGINEER

Confidential - San Jose, CA

Responsibilities:

  • Populated a Data Lake data using AWS S3 from various data sources using AWS Kinesis.
  • Used an Amazon EMR for processing Big Data implementing tools like Hadoop, Spark, and Hive.
  • Authored AWS Lambda functions to run Python scripts in response to events in S3.
  • Created AWS Cloud Formation templates to create infrastructure in the cloud.
  • Created and implemented AWS IAM user roles and policies to authenticate and control access.
  • Implemented optimizations in Spark nodes and improved the performance of the Spark Cluster.
  • Processed multiple terabytes of data stored in S3 using AWS Redshift and AWS Athena.
  • Designed, built, and maintained a database to analyze the life cycle of checking transactions.
  • Developed ETL jobs in AWS Glue to extract data from S3 buckets and loaded it into the data mart in Amazon Redshift.
  • Implemented and maintained EMR, Redshift pipeline for data warehousing.
  • Used Spark, Spark SQL, and Spark Streaming for data analysis and processing.
  • Worked under an agile methodology and contributed to the creation of user stories.
  • Dove the conversation with non-technical stakeholders and worked closely with an offshore team.

BIG DATA ENGINEER

Confidential - Los Angeles, GA

Responsibilities:

  • Implemented Spark jobs and used Spark SQL for optimized analysis and data processing.
  • Created Spark Streaming jobs to process real-time data via Kafka and store it to HDFS.
  • Loaded data from multiple AWS services to AWS S3 buckets and configured bucket permissions using IAM roles.
  • Used Apache Kafka to ingest and process data in real-time.
  • Implemented Partitioning, Dynamic Partitions, and Bucketing in HIVE for performance optimization.
  • Maintained and Hadoop cluster and performed log file analysis for error handling, access statistics for fine-tuning.
  • Optimized data storage in Kafka Brokers within the Kafka cluster by partitioning Kafka Topics.
  • Authored distributed, well-documented and testable code in python and Pyspark.
  • Created complex SQL queries for data aggregation and analysis.
  • Designed fine-tuning techniques for spark jobs running in AWS EMR clusters.
  • Implemented EMR auto-scaling policies to improve the performance of the queries in the cluster.
  • Created Hive External Tables from parquet files stored in the data lake in S3.
  • Attended the Scrum calls daily and contributed to the User Stories creation.

BIG DATA ENGINEER

Confidential - Hartford, CT

Responsibilities:

  • Ingested data from multiple sources into the HDFS data lake.
  • Created and analyzed python code for unit testing and data validation.
  • Pre-processed data to make it available and reliable for the end-users.
  • Built Spark code in Python to import data from parquet files and various other Database Engines.
  • Creating Spark Jobs and HiveQL queries to pull data from the database and manipulate the data.
  • Developed Python Scripts for data ingestion code using Python and perform ETL and processing phases using the Apache Hive, Spark using Pyspark, and SQL Spark scripting.
  • Performed different data processing techniques like joins, aggregates, and map-reduce using Spark in Python.
  • Authored Spark scripts for Data processing and the creation of automated reports.
  • Configured a full Kafka cluster.
  • Created and managed Topic creation inside Kafka
  • Installed and configured replication factor on topic partitions
  • Communicated and managed consumer groups over Kafka
  • Ingested information from spark over HBase

HADOOP DEVELOPER

Confidential - Chicago, IL

Responsibilities:

  • Cleansed and preprocessed data implementing map-reduce jobs in a multi-node Hadoop cluster.
  • Performed aggregation functions with SQL
  • Migrated Spark applications from Map Reduce to improve performance
  • Created a benchmark between Cassandra and Hbase for fast ingestion
  • Processed Terabytes of information in real-time using spark streaming
  • Loaded data from legacy warehouses onto HDFS using Sqoop.
  • Built data pipeline using MapReduce scripts and Hadoop commands to store onto HDFS.
  • Used Oozie to orchestrate the MapReduce jobs that extract the data on time.
  • Created ELT jobs using Hive, Spark, Pig programming to store onto HDFS.
  • Build Graphs and Plots using python libraries like pyplot to visualize data.
  • Performed Analytics and Recommendations for the Business using Hive & python scripts for ETL processing.
  • Developed and Deployed Spark submit command with suitable Executors, Cores, and Drivers on the Cluster.
  • Applied transformations such as filters and aggregations to the data using Spark in java.

DATABASE ADMINISTRATOR

Confidential - Redmond, WA

Responsibilities:

  • Creating and maintaining databases in SQL Server 2010.
  • Design and establish SQL applications.
  • Create tables and views in the SQL database.
  • Supported schema changes and maintained the database to perform in optimal conditions.
  • Creating and managing tables, views, user permissions, and access control.
  • Sent requests to source REST Based API from a Scala script via Kafka producer
  • Utilized a cluster of multiple Kafka brokers to handle replication needs and allow for fault tolerance
  • Writing Hive Queries for analyzing data in Hive warehouse using Hive Query Language
  • Built the Hive views on top of the source data tables, and built a secured provisioning
  • Created and managed dynamic web parts.
  • Customization of library attributes, import, and export of existing data, and connections of data.
  • Provided a workflow and initiated the workflow processes.
  • Worked on SharePoint Designer and InfoPath Designer and developed workflows and forms.

We'd love your feedback!