Big Data Engineer Resume
2.00/5 (Submit Your Rating)
Cary, NC
SUMMARY
- Having 7+ years of experience in architecting, designing, and building data platform using Spark, ETL Tools, Open source and AWS components.
- Expert knowledge in building data pipelines by analyzing business, system, and functional requirements.
- Extensive experience in building Realtime and Batch Pipelines using AWS Services and Container services.
- Used container service like Docker and Kubernetes.
- Worked on with orchestration tools like Cron, Apache Airflow and AWS Step Function.
- Fluent in writing production quality code in - Python, Shell, SQL and SparkSQL.
- Experience in Big Data ecosystem and tools - Hadoop, Spark, Hive, Kafka, Airflow and DBT
- Knowledge of Machine Learning tools like NumPy, Pandas, SparkSQL and Spark MLlib Experience in Databases - Oracle, MySQL, Snowflake, RDS, Redshift, SAP HANA.
- Experience in AWS Components - IAM, S3, EC2, Glue, Lambda, EMR, Kinesis, SQS, SNS, CloudWatch, Athena, RDS, Redshift and Cloud Formation.
- Strong expertise in designing and implementation of ETL process using Glue, EMR, DBT and Spark.
- Migrated an existing on-premises application to AWS. Used AWS services like EC2 and S3 for small data sets processing and storage, Experienced in Maintaining the Hadoop cluster on AWS EMR.
- Extensive experience using Jenkins and Terraform.
- Hands On experience on Spark Core, Spark SQL, Spark Streaming and creating the Data Frames handle in SPARK with Scala.
- Expertise in querying and analyzing the data to make recommendation and build dashboards using Tableau, Power BI and QuickSight.
- Experience in optimizing SQL queries and often improving the performance of the code using Spark Data Frames, RDD’s and other Spark components.
- Hands on experience in Machine Learning Concepts like Supervised and Unsupervised learning.
- Knowledge of Monitoring and Alerting tools like Datadog and Opsgenie.
- Collaborated with team members in performing code reviews. Strong track record of contributing to end-to-end data solution based on the stake holder needs.
- Expertise in GIT and branching strategies. Good Knowledge of Continuous Integration tools like Jenkins.
- Experience in Agile/Scrum Development and monitoring tools like JIRA.
TECHNICAL SKILLS
Programming language: SQL, Python, PySpark, Shell and Linux
Cloud: AWS
Container Services: Docker and Kubernetes
Orchestration tools: Cron, Airflow, Step Function
Data Services: AWS S3, AWS Redshift, AWS RDS, BigQuery, Snowflake
PROFESSIONAL EXPERIENCE
Confidential, Cary, NC
Big Data Engineer
Responsibilities:
- Designed, Developed and Deployed Batch and streaming pipelines using AWS Services.
- Developed data pipelines using cloud and container services like Docker and Kubernetes.
- Developed high performance data pipelines using AWS Glue and PySpark Jobs in EMR cluster.
- Developed multiple ETL pipelines to deliver data to the Stakeholders.
- Designed and developed monitoring solution using AWS services like AWS CloudWatch, AWS IAM, AWS Glue and AWS QuickSight.
- Used AWS services like Lambda, Glue, EMR, Ec2 and EKS for Data processing.
- Used Spark and Kafka for building batch and streaming pipelines.
- Developed Data Marts, Data Lakes and Data Warehouse using AWS services.
- Extensive experience using AWS storage and querying tools like AWS S3, AWS RDS and AWS Redshift.
- Evaluated and implemented next generation AWS Serverless Architecture.
- Developed UDF’s to perform standardization on the entire dataset.
- Worked on End-to-End Development of the ingestion Framework using AWS Glue, IAM, CloudFormation, Athena, RESTAPI.
- Worked on the core logic of masking and creating the mask utility using the SHA-2.
- Used Redshift to store the hashed and un-hashed values for a corresponding PII attribute and to map the user with his email id or the oxygen id which is unique for the user.
- Changed the entire existing datasets into GDPR Complaint Datasets and pushed it to production.
- Worked on the successful transformation of the project in the PRD among the users so the company is completely GDPR complaint.
- Provided Architectural guidance, validation, and Implementation.
- Wrote pipelines for automated hashing and un-hashing of datasets in case of business-critical scenarios.
- Actively responsible for onboarding new technologies by evaluating tools or components which meet with functional and cost needs.
- Created POC projects in DBT and wrote macros to Calculate Formula Fields for Salesforce Data.
- Worked on improving operational efficiency by partnering with Business Partners and Data Scientists.
Confidential, Milwaukee, WI
Big Data Engineer
Responsibilities:
- Involved in designing and deploying multi-tier applications using all the AWS services.
- Designed and implemented streaming and Batch pipelines using Confluent Cloud Kafka for the incremental job to read data from RDBMS, transformed batch data using PySpark and loaded to AWS RedShift tables and generating interactive reports AWS QuickSight Dashboards.
- Created views in AWS Athena to allow secure and streamlined data analysis access to downstream business teams.
- Developing Spark applications using PySpark and Spark-SQL for data extraction, transformation, and aggregation from multiple file formats. Using Kafka and integrating with the Spark Streaming. Developed data analysis tools using SQL and Python code.
- Involved in creating external AWS Redshift tables from the files stored in the AWS S3.
- Selected and generated data into csv files and stored them into AWS S3 by using AWS EC2 and then structured and stored in AWS Redshift.
- Authoring Python (PySpark) Scripts for custom UDF’s for Row/ Column manipulations, merges, aggregations, stacking, data labeling and for all Cleaning and conforming tasks. Migrate data from on-premises to AWS storage buckets.
- Developed various Mappings with the collection of all Sources, Targets, and Transformations using Informatica Designer.
- Developed Mappings using Transformations like Expression, Filter, Joiner, and Lookups for better data messaging and to migrate clean and consistent data
- Designed and implemented CDC Data pipelines using Kafka Connect, Source and Sink Connector for the incremental job to read data from Databases and load to AWS Redshift tables and connected to QuickSight Dashboards for generating interactive reports.
- Developed Spark applications using PySpark and Spark-SQL for data extraction, transformation, and aggregation from multiple file formats.
- Responsible for loading processed data to AWS Redshift table for allowing Business reporting team to build dashboards.
- Used Spark Streaming to receive real time data from the Kafka and store the stream data to HDFS using Python and NoSQL databases such as HBase and Cassandra
- Collected data using Spark Streaming from AWS S3 bucket in near-real-time and performs necessary Transformations and Aggregation on the fly to build the common learner data model and persists the data in HDFS.
- Worked on Dimensional and Relational Data Modeling using Star and Snowflake Schemas, OLTP/OLAP system, Conceptual, Logical and Physical data modeling using Erwin.
- Automated the data processing with Apache Airflow to automate data loading into the Hadoop Distributed File System
- Built the machine learning model include: SVM, random forest, XGboost to score and identify the potential new business case with Python Scikit-learn.
Confidential, Chicago, IL
Big Data Engineer
Responsibilities:
- Implemented Apache Airflow for authoring, scheduling, and monitoring Data Pipelines
- Designed several DAGs (Directed Acyclic Graph) for automating ETL pipelines
- Performed data extraction, transformation, loading, and integration in data warehouse, operational data stores and master data management
- Extensively worked with AWS components such as AWS S3, Lambda, Ec2, EMR, Athena, Glue, EMR, RDS, RedShift, IAM, Kinesis firehose, etc.,
- Created YAML files for each data source and including glue table stack creation
- Worked on a python script to extract data from MySQL and Oracle databases and transfer it to AWS S3
- Developed Lambda functions and assigned IAM roles to run python scripts along with various triggers (SQS, Event Bridge, SNS).
- Worked on ETL Migration services by developing and deploying AWS Lambda functions for generating a serverless data pipeline which can be written to Glue Catalog.
- Created a Lambda Deployment function, and configured it to receive events from S3 buckets
- Writing UNIX shell scripts to automate the jobs and scheduling cron jobs for job automation using commands with Crontab.
- Developed Mappings using Transformations like Expression, Filter, Joiner, and Lookups for better data messaging and to migrate clean and consistent data
- Worked on setting up high availability for major production cluster and designed automatic failover control using zookeeper and quorum journal nodes.
- Provide troubleshooting and best practices methodology for development teams.
- This includes process automation and new application onboarding.
- Produce unit tests for Spark transformations and helper methods. Design data processing pipelines.
- Leveraged cloud and GPU computing technologies for automated machine learning and analytics pipelines, such as AWS, GCP.
- Created several types of data visualizations using Python and Tableau. Extracted Mega Data from AWS using SQL Queries to create reports.
- Performed reverse engineering using Erwin to redefine entities, attributes, and relationships existing database.
- Analyzed functional and non-functional business requirements and translate into technical data requirements and create or update existing logical and physical data models. Developed a data pipeline using Kafka to store data into AWS S3 and AWS RedShift.
Confidential
Data Engineer
Responsibilities:
- Imported Legacy data from SQL Server and MySQL Amazon S3.
- Created consumption views on top of metrics to reduce the running time for complex queries.
- Exported Data into Snowflake by creating Staging Tables to load Data of different files from Amazon S3.
- As a part of Data Migration, wrote many SQL Scripts for Mismatch of data and worked on loading the history data from Teradata SQL to snowflake.
- Worked on to retrieve the data from HDFS to AWS S3 using spark commands
- Built S3 buckets and managed policies for S3 buckets and used S3 bucket and Glacier for storage and backup on AWS
- Created performance dashboards in Tableau point for the key stakeholders.
- Worked with stakeholders to communicate campaign results, strategy, issues or needs.
- Analyzed marketing campaigns from various perspectives including CTR, conversion rates, seasonal/geographical trends, search queries, landing page, conversion funnel, quality score, competitors, distribution channel, etc. to achieve maximum ROI for clients.
- Understood Business requirements to the core and came up with Test Strategy based on Business rules
- Implemented Defect Tracking process using JIRA tool by assigning bugs to Development Team
- Involved in Functional Testing, Integration testing, Regression Testing, Smoke testing and performance Testing. Tested Hadoop Map Reduce developed in python, pig, Hive
- Created Metric tables, End user views in Snowflake to feed data for Tableau refresh.
- Generated Custom SQL to verify the dependency for the daily, Weekly, Monthly jobs.
- Experienced in working with spark ecosystem using Spark SQL.
- Developed spark code and spark-SQL/streaming for faster testing and processing of data.
- Closely involved in scheduling Daily, Monthly jobs with Precondition/Postcondition based on the requirement.