Aws Data Engineer Resume
ChicagO
SUMMARY
- Over 7+ years of IT experience in the field of Data Engineering with Big data tools like Hadoop Technologies, Spark, Kafka along with AWS and azure cloud services.
- Much into end - to-end solutions on data ingesting, modeling, processing, transform and analysis with focus skills on Azure, SQL, Databricks, Data Factory, PySpark, and Azure Data Lake, AWS Glue, AWS Lambda, Athena. Experience on Migrating SQL database to Azure Data Lake, Azure data Lake Analytics, Azure SQL Database, Data Bricks and Azure SQL Data warehouse, RDS, Redshift, S3.
- Extensively worked on Spark using Scala on cluster for computational (analytics), installed it on top of Hadoop performed advanced analytical application by making use of Spark with Hive and SQL/Oracle/Snowflake.
- Comprehensive knowledge and experience in process improvement, normalization/de-normalization, data extraction, data cleansing, data manipulation
- Strong experience in interacting with stakeholders/customers, gathering requirements through interviews, workshops, and existing system documentation or procedures, defining business processes, identifying, and analyzing risks using appropriate templates and analysis tools.
- Experience in Creating Teradata SQL scripts using OLAP functions like rank and rank () Over to improve the query performance while pulling the data from large tables.
- Expertise in creating complex SSRS reports against OLTP, OLAP databases.
- Skilled experience in analysis, modeling, development, and Project Management.
- Skilled experience in Python with proven expertise in using new tools and technical developments.
- Strong with performance improvement with very large datasets in SAS
- Using SAS/SQL for extract, transfer, and load ETL methodology and processes.
- Strong experience in Business and Data Analysis, Data Profiling, Data Migration, Data Conversion, Data Quality, Data Integration and Metadata Management Services and Configuration Management
- Involved in converting SQL queries into Spark transformations using Spark RDD and Pyspark concepts.
- Real-time data processing using Azure Event Hub and then reading data through.
- Experience in various phases of Software Development life cycle (Analysis, Requirements gathering, Designing) with expertise in documenting various requirement specifications, functional specifications, Test Plans, Source to Target mappings, SQL Joins.
- Worked in PIM for different interfaces with PIM and data migration in PIM.
- Worked on Managing Amazon instances by taking AMIs and performing administration and monitoring of Amazon instances using Amazon Cloud Watch
- Expertise in relational databases like Oracle, MySQL and SQL Server.
- Experienced in using Flume and Kafka to load the log data from multiple sources into HDFS.
- Expertise in creating/managing database objects like Tables, Views, Indexes, Procedures, Triggers, and Functions.
- Created many complex ETL jobs for data exchange from and to Database Server and various other systems including RDBMS, XML, JSON, CSV and Flat file structures into staging.
- Develop and execute detailed ETL related functional, performance, integration and regression test cases, and documentation
- Analyze and understand the ETL workflows developed
- Performed ETL on different formats of data like JSON, CSV files and converted them to Parquet\ORC while loading to final tables.
- Experienced in source control repositories via CVS, SVN, GitHub.
- Experienced in Database development, ETL, OLAP, OLTP.
TECHNICAL SKILLS
Hadoop/Big Data Technologies: HDFS, Mapreduce, Spark, Hive, Hue, Impala, Pig, Sqoop, Flume, Kafka, Oozie, Zookeeper, Ambari, Storm.
Hadoop Distribution Systems: Cloudera, Horton works, MapR.
Programming Languages: Python, Scala, Bash/Shell Scripting, PL/ SQL.
Frame Works: Spring, Hibernate, Struts, JSF, JMS, EJB.
Web Technologies: HTML, XML, CSS, JavaScript, JSON, AJAX, jQuery, Bootstrap.
Databases: Oracle, MySQL, DB2, Teradata, SQL Server.
Operating Systems: UNIX, Windows, Linux
Web/Application Server: Apache Tomcat, Web Sphere, Web Logic, JBoss.
Methodologies: Agile, Waterfall.
Version Control: SVN, CVS, GIT.
Cloud: AWS, Azure
PROFESSIONAL EXPERIENCE
Confidential, Chicago
AWS Data Engineer
Responsibilities:
- Is responsible for creating on-demand tables on S3 files using Lambda Functions and AWS Glue using Python and PySpark.Supporting Continuous storage in AWS using Elastic Block Storage, S3, Glacier. Created Volumes and configured Snapshots for EC2 instances.Installed/Configured/Maintained Apache Hadoop clusters for application development based on the requirements.Installed and configured Hadoop Map Reduce, HDFS, developed multiple Map Reduce jobs in java and Scala for data cleaning and preprocessing.
- Designing and Developing Oracle PL/SQL and Shell Scripts, Data Import/Export, Data Conversions and Data Cleansing.
- Developed Spark scripts by using Scala Shell commands as per the requirement.
- Developed Spark code using python for Pyspark and Spark-SQL for faster testing and processing of data.
- Created data pipelines for different events to load the data from Dynamo DB to AWS S3 bucket and then into HDFS location.
- Performed Data Cleaning, features scaling, features engineering using pandas and numpy packages in python.
- Involved in importing the real time data to Hadoop using Kafka and implemented the Oozie job for daily imports.Developed a Spark job which indexes data into Elastic Search from external Hive tables which are in HDFS.
- Developed Spark programs with Scala and applied principles of functional programming to do batch processing.
- Involved in the development of real time streaming applications using PySpark, Apache Flink, Kafka, Hive on distributed Hadoop Cluster.
- Involved in designing and deploying multi-tier applications using all the AWS services like (EC2, Route53, S3, RDS, Dynamo DB, SNS, SQS, IAM) focusing on high-availability, fault tolerance, and auto- scaling in AWS Cloud Formation.
- Developed application to clean semi-structured data like JSON/XML into structured files before ingesting them into HDFS.
- Built real time pipeline for streaming data using Kafka and Spark Streaming.
- Writing HiveQL as per the requirements and Processing data in Spark engine and store in Hive tables.
- Responsible for importing data from Postgres to HDFS, HIVE using SQOOP tool.
- Developed stored procedures/views in Snowflake and use in Talend for loading Dimensions and Facts.
- Writing the Spark Core Programs for processing and cleansing data thereafter load that data into Hive or HBase for further processing.
- Developed framework for automated data ingestion from different sources like relational databases, delimited files, JSON files, XML files into HDFS and build Hive/Impala tables on top of them.
- Developed real-time data ingestion application using Flume and Kafka.
- Hive tables were created on HDFS to store the data processed by Apache Spark on the Cloudera Hadoop Cluster in Parquet format.
- Developed a tool to load S3 JSON file into Hive table in parquet format in Scala and Apache Spark.
- Written a tool that scrubs numerous files in Amazon S3, getting rid of unwanted characters and other activities using Scala.
Environment: Cloudera, Hive, Impala, Spark, Apache Kafka, Flume, Scala, AWS, EC2, S3, Dynamo DB, Auto Scaling, Lambda, Nifi, Snowflake, Java, Shell-scripting, SQL, Sqoop, Oozie, Java, PL/SQL, Oracle 12c, SQL Server, HBase, Python.
Confidential, Chicago
Data Engineer
Responsibilities:
- Worked on Ingesting data by going through cleansing and transformations and leveraging AWS Lambda, AWS Glue and Step Functions
- Created yaml files for each data source and including glue table stack creation
- Worked on a python script to extract data from Netezza databases and transfer it to AWS S3
- Developed Lambda functions and assigned IAM roles to run python scripts along with various triggers (SQS, Event Bridge, SNS)
- Developed and executed a migration strategy to move Data Warehouse from an Oracle platform to AWS Redshift.
- Migrate data from on-premises to AWS storage buckets.
- Performed Data Cleaning, features scaling, features engineering using pandas and numpy packages in python.
- Analyze and define researcher's strategy and determine system architecture and requirement to achieve goals and developed multiple Kafka Producers and Consumers from as per the software requirement specifications.
- Used Kafka for log accumulation like gathering physical log documents off servers and places them in a focal spot like HDFS for handling.
- Configured Spark Streaming to get ongoing information from the Kafka and store the stream information to HDFS.
- Create external tables with partitions using AWS Athena and Redshift
- Developed Spark Programs using Scala and performed transformations and actions on RDD's.
- Involved in writing custom Map-Reduce programs for data processing.
- Document the complete process flow to describe program development, logic, testing, and implementation, application integration, coding.
- Performed Fit Gap analysis for all the data migration requirements.
- Develop ETL data workflows and code for newdata integrationsand maintain/enhance existing workflows and code.
Confidential
Azure Data Engineer
Responsibilities:
- Create Notebooks using Databricks, Python and spark and capturing the data from Delta tables in Delta lakes.
- Created Azure Data Factory and managing policies for Data Factory and Utilized Blob storage for storage and backup on Azure.
- Designing and maintaining ADF pipelines with activities - Copy, Lookup, For Each, Get Metadata, Execute Pipeline, Stored Procedure, if condition, Web, Wait, Delete etc.
- Extensive knowledge in migrating applications from internal data storage to Azure. Experience in building streaming applications in Azure Notebooks using Kafka and Spark.
- Captured SCD2 and updated or inserted or deleted based on Business requirement using Databricks.
- Develop up the Framework for creation of new snapshots and deletion of old snapshots in Azure Blob Storage and worked on setting up the life cycle policies to back the data from delta lakes.
- Expert in building the Azure Notebooks functions by using Python and Spark.
- Built and configured a virtual data center in the Azure cloud to support Enterprise Data Warehouse hosting including Virtual Private Cloud (VPC), Public and Private Subnets, Security Groups, Route Tables.
- Integrated both framework and CloudFormation to automate Azure environment creation along with ability to deploy on Azure, using build scripts (Azure CLI) and automate solutions using terraform.
- Worked on transform data in Azure Spark Databricks platform to parquet formats for efficient data storage
- Worked on reading and writing multiple data formats like JSON, ORC, Parquet on HDFS using PySpark.
- Understanding the Business requirements and developing the common solutions that meet the business requirement.
- Worked on implementing secure views and row-level security on snowflake tables.
- Created external tables and copied the data from the external storage account s3 to snowflake.
- Writing a Data Bricks code and ADF pipeline fully parameterized for efficient code management.
- Developed spark applications in python (PySpark) on the distributed environment to load huge numbers of CSV files with different schema into Hive ORC tables.
- Worked on reading and writing multiple data formats like JSON, ORC, Parquet on HDFS using PySpark.
Environment: GIT, Azure, Snowflake, PySpark, Kafka, Delta lakes, Blob Storage, Event hub, Databricks, Notebooks, Scala, Python
Confidential
SQL Developer
Responsibilities:
- Building databases and validating their stability and efficiency
- Writing optimized SQL queries for integration with other applications
- Designing database tables and dictionaries
- Maintaining data quality and backups
- Worked in software designing and testing for billing software carried out manual software testing for versions 3, 4, and 5 of the POS software
- Create and administer a Postgres Database
- Development and manage a Postgres Database using PL/pgSQL.
- Created SQL queries for diverse usage ensured the integrity of data with frequent restoration and back up.
- Identified the Data Source and defining them to build the Data Source Views.
- Performed market research and gathering statistics.
Environment: PL/pgSQL, SQL