Data Engineer Resume
Dallas, TX
SUMMARY
- Around 8+ years of experience in systems analysis, design, and development in the fields of Data Warehousing, AWS Cloud Data Engineering, Data Visualization, Reporting and Data Quality Solutions.
- Good experience in Amazon Web Services like S3, IAM, EC2, EMR, Kinesis, VPC, Dynamo DB, RedShift, Amazon RDS, Lambda, Athena, Glue, DMS, Quick Sight, Amazon Elastic Load Balancing, Auto Scaling, CloudWatch, SNS, SQS and other services of the AWS family.
- Working on Snowflake access controls and architecture design
- Implementing standards and security Features for Authentication for Snowflake
- Written PySpark job in AWS Glue to merge data from multiple tables and in Utilizing Crawler to populate AWS Glue Data Catalog with metadata table definitions.
- Generated a script in AWS Glue to transfer the data and utilized AWS Glue to run ETL jobs and run aggregation on PySpark code.
- Having knowledge in Amazon EC2, S3, VPC, RDS, Elastic Load Balancing, Autoscaling, IAM, SQS, SWF, SNS, Security Groups, Lambda, Cloud Watch services
- Define virtual warehouse sizing for Snowflake for different type of workloads.
- Hands - on experience in handling database issues and connections with SQL and NoSQL databases such as MongoDB, HBase, Cassandra, SQL server, and PostgreSQL. Created Java apps to handle data in MongoDB and HBase. Used Phoenix to create SQL layer on HBase.
- Experience in designing and creating RDBMS Tables, Views, User Created Data Types, Indexes, Stored Procedures, Cursors, Triggers and Transactions.
- Expert in designing ETL data flows using creating mappings/workflows to extract data from SQL Server and Data Migration and Transformation from Oracle/Access/Excel Sheets using SQL Server SSIS.
- Wrote AWS Lambda functions in python for AWS's Lambda which invokes python scripts to perform various transformations and analytics on large data sets in EMR clusters.
- Experience in building and optimizing AWS data pipelines, architectures, and data sets.
- Hands on experience on tools like Hive for data analysis and Sqoop for data ingestion and Oozie for scheduling.
- Experience in scheduling and configuring the oozie and also having good experience in writing Oozie workflow and coordinators.
- Worked on different file formats like JSON, XML, CSV, ORC, Paraquet. Experience in processing both structured and semi structured Data with the given file formats.
- Worked on Apache Spark performing the Actions, Transformations on RDDs, Data Frames & Datasets using spark SQL and Spark streaming contexts.
- Having good experience in spark core, spark SQL and spark streaming.
- Having good experience in different SDLC models including Waterfall, V-Model and Agile.
- Involved in Daily standups and sprint planning and review meetings in Agile model.
TECHNICAL SKILLS
Big Data Technologies: Hadoop, MapReduce, HDFS, Sqoop, PIG, Hive, HBase, Oozie, Flume, NiFi, Kafka, Zookeeper, Yarn, Apache Spark, Mahout, SparkMLIib
Databases: Oracle, MySQL, SQL Server, MongoDB, Cassandra, DynamoDB, PostgreSQL, Teradata, Cosmos.
Programming: Python, PySpark, Scala, Java, C, C++, Shell script, Perl script, SQL
Cloud Technologies: AWS, Microsoft Azure
Frameworks: Django REST framework, MVC, Hortonworks
Tools: PyCharm, Eclipse, Visual Studio, SQL*Plus, SQL Developer, TOAD, SQL Navigator, Query Analyzer, SQL Server Management Studio, SQL Assistance, Eclipse, Postman
Versioning tools: SVN, Git, GitHub
Operating Systems: Windows 7/8/XP/2008/2012, Ubuntu Linux, MacOS
Network Security: Kerberos
Database Modelling: Dimension Modeling, ER Modeling, Star Schema Modeling, Snowflake Modeling
Monitoring Tool: Apache Airflow
Visualization/ Reporting: Tableau, ggplot2, MatPlotLib, SSRS and Power BI
PROFESSIONAL EXPERIENCE
Confidential, Dallas, TX
Data Engineer
Responsibilities:
- Designed and setup Enterprise Data Lake to provide support for various uses cases including Storing, processing, Analytics and Reporting of voluminous, rapidly changing data by using various AWS Services.
- Evaluated Snowflake Design considerations for any change in the application
- Built the Logical and Physical data model for snowflake as per the changes required
- Used various AWS services including S3, EC2, AWS Glue, Athena, RedShift, EMR, SNS, SQS, DMS, Kinesis.
- Extracted data from multiple source systems S3, Redshift, RDS and Created multiple tables/databases in Glue Catalog by creating Glue Crawlers.
- Created AWS Glue crawlers for crawling the source data in S3 and RDS.
- Defined virtual warehouse sizing for Snowflake for different type of workloads.
- Built the Logical and Physical data model for snowflake as per the changes required
- Written PySpark job in AWS Glue to merge data from multiple tables and in Utilizing Crawler to populate AWS Glue Data Catalog with metadata table definitions.
- Used AWS Glue for transformations and AWS Lambda to automate the process.
- Design and Develop ETL Processes in AWS Glue to migrate Campaign data from external sources like S3, ORC/Parquet/Text Files into AWS Redshift
- Used various spark Transformations and Actions for cleansing the input data
- Build the Logical and Physical data model for snowflake as per the changes required
- Define virtual warehouse sizing for Snowflake for different type of workloads.
- Used Jira for ticketing and tracking issues and Jenkins for continuous integration and continuous deployment.
- Enforced standards and best practices around data catalog, data governance efforts
- Created Datastage jobs using different stages like Transformer, Aggregator, Sort, Join, Merge, Lookup, Data Set, Funnel, Remove Duplicates, Copy, Modify, Filter, Change Data Capture, Change Apply, Sample, Surrogate Key, Column Generator, Row Generator, Etc
- Expertise in Creating, Debugging, Scheduling and Monitoring jobs using Airflow for ETL batch processing to load into Snowflake for analytical processes.
- Worked in building ETL pipeline for data ingestion, data transformation, data validation on cloud service AWS, working along with data steward under data compliance.
- Worked on scheduling all jobs using Airflow scripts using python added different tasks to DAG, LAMBDA.
- Used Pyspark for extract, filtering and transforming the Data in data pipelines.
- Used AWS EMR to transform and move large amounts of data into and out of AWS S3.
- Created monitors, alarms, notifications and logs for Lambda functions, Glue Jobs using CloudWatch.
- Performed end- to-end Architecture & implementation assessment of various AWS services like Amazon EMR, Redshift and S3.
- Used AWS EMR to transform and move large amounts of data into and out of other AWS data stores and databases, such as Amazon Simple Storage Service (Amazon S3) and Amazon DynamoDB.
- To analyze the data Vastly used Athena to run multiple queries on processed data from Glue ETL Jobs and then used Quick Sight to generate Reports for Business Intelligence.
- Defined virtual warehouse sizing for Snowflake for different type of workloads.
- Evaluated Snowflake Design considerations for any change in the application
- Built the Logical and Physical data model for snowflake as per the changes required
- Defined virtual warehouse sizing for Snowflake for different type of workloads
Environment: AWS Glue, S3, IAM, EC2, RDS, Redshift, EC2, Lambda, Boto3, DynamoDB, Apache Spark, Kinesis, Athena, Hive, Sqoop, Python, Snowflake.
Confidential, Dearborn, MI
Data Engineer
Responsibilities:
- Responsible for provisioning key AWS Cloud services and configure them for scalability, flexibility, and cost optimization
- Create VPCs, subnets including private and public, NAT gateways in a multi- region, multi-zone infrastructure landscape to manage its worldwide operation
- Built the Logical and Physical data model for snowflake as per the changes required
- Define virtual warehouse sizing for Snowflake for different type of workloads.
- Manage Amazon Web Services (AWS) infrastructure with orchestration tools such as CFT, Terraform and Jenkins Pipeline
- Create Terraform scripts to automate deployment of EC2 Instance, S3, EFS, EBS, IAM Roles, Snapshots and Jenkins Server
- Define virtual warehouse sizing for Snowflake for different type of workloads.
- Evaluate Snowflake Design considerations for any change in the application
- Build Cloud data stores in S3 storage with logical layers built for Raw, Curated and transformed data management
- Skilled in monitoring servers using Nagios, Cloud watch and using ELK Stack Elasticsearch Kibana
- Used Data Build Tool for transformations in ETL process, AWS lambda, AWS SQS
- Worked on scheduling all jobs using Airflow scripts using python. Adding different tasks to DAG’s and dependencies between the tasks.
- Worked on Oracle Databases, RedShift, and Snowflakes
- Create conceptual, logical and physical models for OLTP, Data Warehouse Data Vault and Data Mart Star/Snowflake schema implementations.
- Create parameters and SSM documents using AWS Systems Manager
- Established CICD tools such as Jenkins and Git Bucket for code repository, build and deployment of the python code base
- Build the Logical and Physical data model for snowflake as per the changes required.
- Setting up Data Shares across snowflake accounts
- Used Kinesis Family (Kinesis Data streams, Kinesis Firehose, Kinesis Data Analytics) for collection, processing and analyze the streaming data.
- Create Athena data sources on S3 buckets for adhoc querying and business dashboarding using Quicksight and Tableau reporting tools
- Monitoring and Alerting Snowflake account through Snowflake web console
- Using Spark, performed various transformations and actions and the final result data is saved back to HDFS from there to target database Snowflake
- Copy Fact/Dimension and aggregate output from S3 to Redshift for Historical data analysis using Tableau and Quicksight
- Use Lambda functions and Step Functions to trigger Glue Jobs and orchestrate the data pipeline
- Use PyCharm IDE for Python/PySpark development and Git for version control and repository management
Environment: AWS - EC2, VPC, S3, EBS, ELB, CloudWatch, CloudFormation, ASG, Lambda, AWS CLI, GIT, Glue, Athena and Quicksight, Python and PySpark, Shell Scripting, Jenkins, Snowflake.
Confidential, Washington, PA
Data Engineer
Responsibilities:
- Implemented a 'serverless' architecture using API Gateway, Lambda, and Dynamo DB and deployed AWS Lambda code from Amazon S3 buckets. Created a Lambda function and configured it to receive events from your S3 bucket
- Designed the data models to be used in data intensive AWS Lambda applications which are aimed to do complex analysis, creating analytical reports for end-to-end traceability, lineage, definition of Key Business elements from Aurora.
- Writing code that optimizes performance of AWS services used by application teams and provide Code-level application security for clients (IAM roles, credentials, encryption, etc.)
- Creating AWS Lambda functions using python for deployment management in AWS and designed and implemented public facing websites on Amazon Web Services and integrated it with other applications infrastructure.
- Creating different AWS Lambda functions and API Gateways, to submit data via API Gateway that is accessible via Lambda function.
- Responsible for Building Cloud Formation templates for SNS, SQS, Elastic search, Dynamo DB, Lambda, EC2, VPC, RDS, S3, IAM, Cloud Watch services implementation and integrated with Service Catalog.
- Regular monitoring activities in Unix/Linux servers like Log verification, Server CPU usage, Memory check, Load check, Disk space verification, to ensure the application availability and performance by using cloud watch and AWS X-ray. implemented AWS X-Ray service inside Confidential, it allows development teams to visually detect node and edge latency distribution directly from the service map Tools.
- Design and Develop ETL Processes in AWS Glue to migrate data from external sources like S3, ORC/Parquet/Text Files into AWS Redshift.
- Utilized Python Libraries like Boto3, NumPy for AWS.
- Used Amazon EMR for MapReduce jobs and test locally using Jenkins.
- Created external tables with partitions using Hive, AWS Athena and Redshift.
- Developed the PySpark code for AWS Glue jobs and for EMR.
- Good Understanding of other AWS services like S3, EC2 IAM, RDS Experience with Orchestration and Data Pipeline like AWS Step functions/Data Pipeline/Glue.
- Experience in writing SAM template to deploy serverless applications on AWS cloud.
- Hands-on experience on working with AWS services like Lambda function, Athena, DynamoDB, Step functions, SNS, SQS, S3, IAM etc.
- Designed and Developed ETL jobs in AWS GLUE to extract data from S3 objects and load it in data mart in Redshift.
- Responsible for Designing Logical and Physical data modelling for various data sources on Redshift.
- Experienced with event-driven and scheduled AWS Lambda functions to trigger various AWS resources.
- Integrated lambda with SQS and DynamoDB with step functions to iterate through list of messages and updated the status into DynamoDB table.
Environment: AWS EC2, S3, EBS, ELB, EMR, Lambda, RDS, SNS, SQS, VPC, IAM, Cloud formation, CloudWatch, ELK Stack, Bitbucket, Python, Shell Scripting, GIT, Jira, Unix/Linux, AWS X-Ray, Dynamo DB, Kinesis.
Confidential
Data Engineer
Responsibilities:
- Designed and Developed ETL Processes with pyspark in AWS Glue to migrate data from S3 to generate Reports.
- Involved in writing and Scheduling the Databricks jobs Using Airflow.
- Developed the Pysprk code for AWS Glue jobs and for EMR.
- Installed and configured Hadoop Map Reduce, HDFS, developed multiple Map Reduce jobs in java and Scala for data cleaning and preprocessing
- Developed Java Map Reduce programs for the analysis of sample log file stored in cluster
- Implemented Spark using Python and Spark SQL for faster testing and processing of data.
- Developed Spark scripts using Python on Azure HDInsight for Data Aggregation, Validation and verified its performance over MR jobs.
- Used Sagemaker as dev endpoint for the glue development.
- Authored Spark Jobs for data filtering and data transforming through Pyspark data frames both in aws glue and databricks.
- Imported data using Sqoop to load data from MySQL to HDFS on regular basis.
- Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs, Python and Scala
- Used IAM to create new accounts, roles and groups and polices and developed critical modules like generating amazon resource numbers and integration points with S3, Dynamo DB, RDS, Lambda and SQS Queue
- Reviewing the explain plan for the SQLs in snowflake
- Developed ETL parsing and analytics using Python/Spark to build a structured data model in Elastic search for consumption by the API and UI.
- Used AWS glue catalog with Athena to get the data from S3 and perform sql query operations
- Wrote various data normalization jobs for new data ingested to s3
- Created Airflow Dags to schedule the jobs on daily, weekly, monthly schedules.
- Designed and Developed ETL Processes with pyspark in AWS Glue to migrate data from external sources and S3 Files into AWS Redshift.
- Involved in writing and Scheduling the Glue jobs, Building data catalog and mapping from S3 to Redshift.
- Created AWS Lambda functions and assigned IAM roles to schedule python scripts using CloudWatch Triggers to support the infrastructure needs that needed extraction of xml tags.
- Involved in connecting Redshift to Tableau for creating dynamic dashboard for analytics team.
- Authored Spark Jobs for data filtering and data transforming through Pyspark data frames.
Environment: AWS EMR 5.0.0, EC2, S3, Oozie 4.2, Kafka, Spark, Spark SQL PostgreSQL, Shell Script, SQOOP1.4, Scala.
Confidential
AWS Data Engineer
Responsibilities:
- Written Spark applications using Scala to interact with the PostgreSQL database using Spark SQL Context and accessed Hive tables using Hive Context.
- Involved in designing different components of system like big-data event processing framework Spark, distributed messaging system Kafka and SQL database PostgreSQL.
- Implemented Spark Streaming and Spark SQL using Data Frames.
- I have integrated product data feeds from Kafka to Spark processing system and store the order details in PostgreSQL data base.
- Created functions and assigned roles in AWS Lambda to run python scripts, and AWS Lambda using java to perform event driven processing
- Created multiple Hive tables, implemented Dynamic Partitioning and Buckets in Hive for efficient data access.
- Designed tables and columns in Redshift for data distribution across data nodes in the cluster keeping columnar database design considerations
- Create, modify and execute DDL in table AWS Redshift and snowflake tables to load data
- Involved in creating Hive External tables, also used custom SerDe's based on the structure of input file so that Hive knows how to load the files to Hive tables.
- Managed large datasets using Panda data frames and MySQL
- Monitor Resources and Applications using AWS Cloud Watch, including creating alarms to monitor metrics such as EBS, EC2, ELB, RDS, S3, SNS and configured notifications for the alarms generated based on events defined
- Monitor System health and logs and respond accordingly to any warning or failure conditions.
- Worked on scheduling all jobs using Oozie.
Environment: AWS EMR, EC2, S3, Oozie, Kafka, Spark, Spark SQL PostgreSQL, Shell Script, SQOOP, Scala, Kafka