Sr. Big Data Engineer Resume
San Francisco, CA
SUMMARY
- Big Data Engineer with 9+ years of experience in Big Data technologies and 11+ years of experience in IT.
- Expertise in Hadoop, Spark, Big Datatools, various clouds (AWS, Azure, GCP) and data warehousing using on - premises and cloud services, automation tools, and ETL design process.
- Experience working with both SQL and NoSQL databases.
- Add value to Agile/Scrum processes such as Sprint Planning, Backlog, Sprint Retrospective, and Requirements Gathering and provide planning and documentation for projects.
- Create Spark Core ETL processes to automate using a workflow scheduler.
- Successfully worked on AWS services like EMR, EC2, Lambda, S3, Glue, Redshift, Kinesis etc.
- Well versed with Azure environment. E.g., ADL Gen2, Blob Storage, ADF (Data Factory), Azure Databricks, Azure SQL, Azure Synapse for analytics.
- Proficient with HDFS, Spark, Hive, Sqoop, HBase, Flume, Oozie, and Zookeeper.
- Experience in ecosystems like Hive, Sqoop, MapReduce, Flume, and Oozie.
- Experience handling XML files as well as Avro and Parquet SerDes
- Performance tuning at source, Target, and Data Stage job levels using Indexes, Hints, and Partitioning in DB2, ORACLE
- Hand on experience in Text Analytics, developing different Statistical Machine Learning, Data Mining solutions for various business problems, and generating data visualizations using R, Python, and Tableau
- Designed Spark Core ETL processes to automate using a workflow scheduler.
- Use Apache Hadoop to work with Big Data and analyze large data sets efficiently.
- Hands-on experience developing PL/SQL Procedures and Functions and SQL tuning of large databases by creating tables, views, indexes, stored procedures, and functions.
- Skilled at bucketing, partitioning, multi-threading computing and streaming (Python, PySpark).
- Efficiently used Apache Hadoop to work with Big Data and analyze large data sets.
- Experience in handling XML files as well as Avro and Parquet SerDes
- Performance tuning at source, Target and Data Stage job levels using Indexes, Hints and Partitioning in DB2, ORACLE
- Developed and deployed complex ETL workflows using Apache Airflow and Jenkins to automate data processing and improve efficiency.
- Excellent written, communication, oral as well as interpersonal and presentation skills. Ability to perform at a high level, meet deadlines, adaptable to ever-changing priorities, and project management skills.
TECHNICAL SKILLS
IDE: Workbench, Jupiter Notebooks, Eclipse, IntelliJ, PyCharm
PROJECT METHODS: Agile, Kanban, Scrum, DevOps, Continuous Integration, Test-Driven Development
HADOOP DISTRIBUTIONS: Hadoop, Cloudera Hadoop, Hortonworks Hadoop
CLOUD PLATFORMS: Amazon AWS - Microsoft Azure
CLOUD SERVICES: Databricks, Snowflake
DATABASES AND DATA WAREHOUSES: SQL, Snowflake, MongoDB, Redshift, DynamoDB, Cassandra, Hbase
PROGRAMMING LANGUAGES: Spark, Spark Streaming, Java, Python, Scala, PySpark, Django, Flask. Netcore
SCRIPTING: Shell Scripting, Python
CONTINUOUS INTEGRATION (CI-CD): Jenkins, Github, bitbucket, Jira
FILE FORMAT AND COMPRESSION: CSV, JSON, Avro, Parquet, ORC
FILE SYSTEMS: HDFS, Data Lake, S3
ETL TOOLS: Apache Nifi, Flume, Kafka, Talend, Sqoop, Oozie
DATA VIZIUALIZATION TOOLS: Tableau, Kibana, Python
SECURITY: Kerberos, Ranger
AWS: AWS Lambda, AWS S3, AWS RDS, AWS EMR, AWS Redshift, AWS Kinesis, AWS ELK, AWS Cloud Formation, AWS IAM
PROFESSIONAL EXPERIENCE
SR. BIG DATA ENGINEER
Confidential - San Francisco, CA
Responsibilities:
- Designs and Implements monitoring systems for various components that can raise failure and warning alerts using serverless, python, and AWS Lambda among other tools.
- Liaise with product users to understand needs and
- Designs and implements ETL based solutions from data ingestion stage to presentation of useful information to users.
- Creates and maintains objects used to manage data.
- Creates front interface and applications that present information to users.
- Writes SQL and spark queries for data transformations.
- Works with other developers in pipeline hardening to strengthen and improve existing resources.
- Resolves issues raised by users that use various resources in platform.
- Maintains and supports general ETL process and data pipelines.
- Communicates with team on steps and resources needed for development.
- Helps to maintain data quality within the pipelines and automates the process of data integrity.
- Assists in knowledge transfer and training of new team at site.
- Code debugging and improvement of codebase.
- Participates in PR review.
- Training of new team members
- Writing documentation for various features and processes used in the project.
- Leads and participates in white boarding sessions for various ideas that could bring in new features or improve the existing ones.
CLOUD DATA ENGINEER
Confidential - San Jose, CA
Responsibilities:
- Populated a Data Lake data using AWS S3 from various data sources using AWS Kinesis.
- Used an Amazon EMR for processing Big Data implementing tools like Hadoop, Spark, and Hive.
- Authored AWS Lambda functions to run Python scripts in response to events in S3.
- Created AWS Cloud Formation templates to create infrastructure in the cloud.
- Created and implemented AWS IAM user roles and policies to authenticate and control access.
- Implemented optimizations in Spark nodes and improved the performance of the Spark Cluster.
- Processed multiple terabytes of data stored in S3 using AWS Redshift and AWS Athena.
- Designed, built, and maintained a database to analyze the life cycle of checking transactions.
- Developed ETL jobs in AWS Glue to extract data from S3 buckets and loaded it into the data mart in Amazon Redshift.
- Implemented and maintained EMR, Redshift pipeline for data warehousing.
- Used Spark, Spark SQL, and Spark Streaming for data analysis and processing.
- Worked under an agile methodology and contributed to the creation of user stories.
- Dove the conversation with non-technical stakeholders and worked closely with an offshore team.
BIG DATA ENGINEER
Confidential - Los Angeles, GA
Responsibilities:
- Implemented Spark jobs and used Spark SQL for optimized analysis and data processing.
- Created Spark Streaming jobs to process real-time data via Kafka and store it to HDFS.
- Loaded data from multiple AWS services to AWS S3 buckets and configured bucket permissions using IAM roles.
- Used Apache Kafka to ingest and process data in real-time.
- Implemented Partitioning, Dynamic Partitions, and Bucketing in HIVE for performance optimization.
- Maintained and Hadoop cluster and performed log file analysis for error handling, access statistics for fine-tuning.
- Optimized data storage in Kafka Brokers within the Kafka cluster by partitioning Kafka Topics.
- Authored distributed, well-documented and testable code in python and Pyspark.
- Created complex SQL queries for data aggregation and analysis.
- Designed fine-tuning techniques for spark jobs running in AWS EMR clusters.
- Implemented EMR auto-scaling policies to improve the performance of the queries in the cluster.
- Created Hive External Tables from parquet files stored in the data lake in S3.
- Attended the Scrum calls daily and contributed to the User Stories creation.
BIG DATA ENGINEER
Confidential - Hartford, CT
Responsibilities:
- Ingested data from multiple sources into the HDFS data lake.
- Created and analyzed python code for unit testing and data validation.
- Pre-processed data to make it available and reliable for the end-users.
- Built Spark code in Python to import data from parquet files and various other Database Engines.
- Creating Spark Jobs and HiveQL queries to pull data from the database and manipulate the data.
- Developed Python Scripts for data ingestion code using Python and perform ETL and processing phases using the Apache Hive, Spark using Pyspark, and SQL Spark scripting.
- Performed different data processing techniques like joins, aggregates, and map-reduce using Spark in Python.
- Authored Spark scripts for Data processing and the creation of automated reports.
- Configured a full Kafka cluster.
- Created and managed Topic creation inside Kafka
- Installed and configured replication factor on topic partitions
- Communicated and managed consumer groups over Kafka
- Ingested information from spark over HBase
HADOOP DEVELOPER
Confidential - Chicago, IL
Responsibilities:
- Cleansed and preprocessed data implementing map-reduce jobs in a multi-node Hadoop cluster.
- Performed aggregation functions with SQL
- Migrated Spark applications from Map Reduce to improve performance
- Created a benchmark between Cassandra and Hbase for fast ingestion
- Processed Terabytes of information in real-time using spark streaming
- Loaded data from legacy warehouses onto HDFS using Sqoop.
- Built data pipeline using MapReduce scripts and Hadoop commands to store onto HDFS.
- Used Oozie to orchestrate the MapReduce jobs that extract the data on time.
- Created ELT jobs using Hive, Spark, Pig programming to store onto HDFS.
- Build Graphs and Plots using python libraries like pyplot to visualize data.
- Performed Analytics and Recommendations for the Business using Hive & python scripts for ETL processing.
- Developed and Deployed Spark submit command with suitable Executors, Cores, and Drivers on the Cluster.
- Applied transformations such as filters and aggregations to the data using Spark in java.
DATABASE ADMINISTRATOR
Confidential - Redmond, WA
Responsibilities:
- Creating and maintaining databases in SQL Server 2010.
- Design and establish SQL applications.
- Create tables and views in the SQL database.
- Supported schema changes and maintained the database to perform in optimal conditions.
- Creating and managing tables, views, user permissions, and access control.
- Sent requests to source REST Based API from a Scala script via Kafka producer
- Utilized a cluster of multiple Kafka brokers to handle replication needs and allow for fault tolerance
- Writing Hive Queries for analyzing data in Hive warehouse using Hive Query Language
- Built the Hive views on top of the source data tables, and built a secured provisioning
- Created and managed dynamic web parts.
- Customization of library attributes, import, and export of existing data, and connections of data.
- Provided a workflow and initiated the workflow processes.
- Worked on SharePoint Designer and InfoPath Designer and developed workflows and forms.
