Big Data Engineer Resume Cary, NC - Hire IT People

SUMMARY

Having 7+ years of experience in architecting, designing, and building data platform using Spark, ETL Tools, Open source and AWS components.
Expert knowledge in building data pipelines by analyzing business, system, and functional requirements.
Extensive experience in building Realtime and Batch Pipelines using AWS Services and Container services.
Used container service like Docker and Kubernetes.
Worked on with orchestration tools like Cron, Apache Airflow and AWS Step Function.
Fluent in writing production quality code in - Python, Shell, SQL and SparkSQL.
Experience in Big Data ecosystem and tools - Hadoop, Spark, Hive, Kafka, Airflow and DBT
Knowledge of Machine Learning tools like NumPy, Pandas, SparkSQL and Spark MLlib Experience in Databases - Oracle, MySQL, Snowflake, RDS, Redshift, SAP HANA.
Experience in AWS Components - IAM, S3, EC2, Glue, Lambda, EMR, Kinesis, SQS, SNS, CloudWatch, Athena, RDS, Redshift and Cloud Formation.
Strong expertise in designing and implementation of ETL process using Glue, EMR, DBT and Spark.
Migrated an existing on-premises application to AWS. Used AWS services like EC2 and S3 for small data sets processing and storage, Experienced in Maintaining the Hadoop cluster on AWS EMR.
Extensive experience using Jenkins and Terraform.
Hands On experience on Spark Core, Spark SQL, Spark Streaming and creating the Data Frames handle in SPARK with Scala.
Expertise in querying and analyzing the data to make recommendation and build dashboards using Tableau, Power BI and QuickSight.
Experience in optimizing SQL queries and often improving the performance of the code using Spark Data Frames, RDD’s and other Spark components.
Hands on experience in Machine Learning Concepts like Supervised and Unsupervised learning.
Knowledge of Monitoring and Alerting tools like Datadog and Opsgenie.
Collaborated with team members in performing code reviews. Strong track record of contributing to end-to-end data solution based on the stake holder needs.
Expertise in GIT and branching strategies. Good Knowledge of Continuous Integration tools like Jenkins.
Experience in Agile/Scrum Development and monitoring tools like JIRA.

TECHNICAL SKILLS

Programming language: SQL, Python, PySpark, Shell and Linux

Cloud: AWS

Container Services: Docker and Kubernetes

Orchestration tools: Cron, Airflow, Step Function

Data Services: AWS S3, AWS Redshift, AWS RDS, BigQuery, Snowflake

PROFESSIONAL EXPERIENCE

Confidential, Cary, NC

Big Data Engineer

Responsibilities:

Designed, Developed and Deployed Batch and streaming pipelines using AWS Services.
Developed data pipelines using cloud and container services like Docker and Kubernetes.
Developed high performance data pipelines using AWS Glue and PySpark Jobs in EMR cluster.
Developed multiple ETL pipelines to deliver data to the Stakeholders.
Designed and developed monitoring solution using AWS services like AWS CloudWatch, AWS IAM, AWS Glue and AWS QuickSight.
Used AWS services like Lambda, Glue, EMR, Ec2 and EKS for Data processing.
Used Spark and Kafka for building batch and streaming pipelines.
Developed Data Marts, Data Lakes and Data Warehouse using AWS services.
Extensive experience using AWS storage and querying tools like AWS S3, AWS RDS and AWS Redshift.
Evaluated and implemented next generation AWS Serverless Architecture.
Developed UDF’s to perform standardization on the entire dataset.
Worked on End-to-End Development of the ingestion Framework using AWS Glue, IAM, CloudFormation, Athena, RESTAPI.
Worked on the core logic of masking and creating the mask utility using the SHA-2.
Used Redshift to store the hashed and un-hashed values for a corresponding PII attribute and to map the user with his email id or the oxygen id which is unique for the user.
Changed the entire existing datasets into GDPR Complaint Datasets and pushed it to production.
Worked on the successful transformation of the project in the PRD among the users so the company is completely GDPR complaint.
Provided Architectural guidance, validation, and Implementation.
Wrote pipelines for automated hashing and un-hashing of datasets in case of business-critical scenarios.
Actively responsible for onboarding new technologies by evaluating tools or components which meet with functional and cost needs.
Created POC projects in DBT and wrote macros to Calculate Formula Fields for Salesforce Data.
Worked on improving operational efficiency by partnering with Business Partners and Data Scientists.

Confidential, Milwaukee, WI

Big Data Engineer

Responsibilities:

Involved in designing and deploying multi-tier applications using all the AWS services.
Designed and implemented streaming and Batch pipelines using Confluent Cloud Kafka for the incremental job to read data from RDBMS, transformed batch data using PySpark and loaded to AWS RedShift tables and generating interactive reports AWS QuickSight Dashboards.
Created views in AWS Athena to allow secure and streamlined data analysis access to downstream business teams.
Developing Spark applications using PySpark and Spark-SQL for data extraction, transformation, and aggregation from multiple file formats. Using Kafka and integrating with the Spark Streaming. Developed data analysis tools using SQL and Python code.
Involved in creating external AWS Redshift tables from the files stored in the AWS S3.
Selected and generated data into csv files and stored them into AWS S3 by using AWS EC2 and then structured and stored in AWS Redshift.
Authoring Python (PySpark) Scripts for custom UDF’s for Row/ Column manipulations, merges, aggregations, stacking, data labeling and for all Cleaning and conforming tasks. Migrate data from on-premises to AWS storage buckets.
Developed various Mappings with the collection of all Sources, Targets, and Transformations using Informatica Designer.
Developed Mappings using Transformations like Expression, Filter, Joiner, and Lookups for better data messaging and to migrate clean and consistent data
Designed and implemented CDC Data pipelines using Kafka Connect, Source and Sink Connector for the incremental job to read data from Databases and load to AWS Redshift tables and connected to QuickSight Dashboards for generating interactive reports.
Developed Spark applications using PySpark and Spark-SQL for data extraction, transformation, and aggregation from multiple file formats.
Responsible for loading processed data to AWS Redshift table for allowing Business reporting team to build dashboards.
Used Spark Streaming to receive real time data from the Kafka and store the stream data to HDFS using Python and NoSQL databases such as HBase and Cassandra
Collected data using Spark Streaming from AWS S3 bucket in near-real-time and performs necessary Transformations and Aggregation on the fly to build the common learner data model and persists the data in HDFS.
Worked on Dimensional and Relational Data Modeling using Star and Snowflake Schemas, OLTP/OLAP system, Conceptual, Logical and Physical data modeling using Erwin.
Automated the data processing with Apache Airflow to automate data loading into the Hadoop Distributed File System
Built the machine learning model include: SVM, random forest, XGboost to score and identify the potential new business case with Python Scikit-learn.

Confidential, Chicago, IL

Big Data Engineer

Responsibilities:

Implemented Apache Airflow for authoring, scheduling, and monitoring Data Pipelines
Designed several DAGs (Directed Acyclic Graph) for automating ETL pipelines
Performed data extraction, transformation, loading, and integration in data warehouse, operational data stores and master data management
Extensively worked with AWS components such as AWS S3, Lambda, Ec2, EMR, Athena, Glue, EMR, RDS, RedShift, IAM, Kinesis firehose, etc.,
Created YAML files for each data source and including glue table stack creation
Worked on a python script to extract data from MySQL and Oracle databases and transfer it to AWS S3
Developed Lambda functions and assigned IAM roles to run python scripts along with various triggers (SQS, Event Bridge, SNS).
Worked on ETL Migration services by developing and deploying AWS Lambda functions for generating a serverless data pipeline which can be written to Glue Catalog.
Created a Lambda Deployment function, and configured it to receive events from S3 buckets
Writing UNIX shell scripts to automate the jobs and scheduling cron jobs for job automation using commands with Crontab.
Developed Mappings using Transformations like Expression, Filter, Joiner, and Lookups for better data messaging and to migrate clean and consistent data
Worked on setting up high availability for major production cluster and designed automatic failover control using zookeeper and quorum journal nodes.
Provide troubleshooting and best practices methodology for development teams.
This includes process automation and new application onboarding.
Produce unit tests for Spark transformations and helper methods. Design data processing pipelines.
Leveraged cloud and GPU computing technologies for automated machine learning and analytics pipelines, such as AWS, GCP.
Created several types of data visualizations using Python and Tableau. Extracted Mega Data from AWS using SQL Queries to create reports.
Performed reverse engineering using Erwin to redefine entities, attributes, and relationships existing database.
Analyzed functional and non-functional business requirements and translate into technical data requirements and create or update existing logical and physical data models. Developed a data pipeline using Kafka to store data into AWS S3 and AWS RedShift.

Confidential

Data Engineer

Responsibilities:

Imported Legacy data from SQL Server and MySQL Amazon S3.
Created consumption views on top of metrics to reduce the running time for complex queries.
Exported Data into Snowflake by creating Staging Tables to load Data of different files from Amazon S3.
As a part of Data Migration, wrote many SQL Scripts for Mismatch of data and worked on loading the history data from Teradata SQL to snowflake.
Worked on to retrieve the data from HDFS to AWS S3 using spark commands
Built S3 buckets and managed policies for S3 buckets and used S3 bucket and Glacier for storage and backup on AWS
Created performance dashboards in Tableau point for the key stakeholders.
Worked with stakeholders to communicate campaign results, strategy, issues or needs.
Analyzed marketing campaigns from various perspectives including CTR, conversion rates, seasonal/geographical trends, search queries, landing page, conversion funnel, quality score, competitors, distribution channel, etc. to achieve maximum ROI for clients.
Understood Business requirements to the core and came up with Test Strategy based on Business rules
Implemented Defect Tracking process using JIRA tool by assigning bugs to Development Team
Involved in Functional Testing, Integration testing, Regression Testing, Smoke testing and performance Testing. Tested Hadoop Map Reduce developed in python, pig, Hive
Created Metric tables, End user views in Snowflake to feed data for Tableau refresh.
Generated Custom SQL to verify the dependency for the daily, Weekly, Monthly jobs.
Experienced in working with spark ecosystem using Spark SQL.
Developed spark code and spark-SQL/streaming for faster testing and processing of data.
Closely involved in scheduling Daily, Monthly jobs with Precondition/Postcondition based on the requirement.

We provide IT Staff Augmentation Services!

Big Data Engineer Resume

Cary, NC

We'd love your feedback!

Resume Categories

Client Services

Job Seekers

Visa Sponsorship