Implement Lake House in AWS – Data Lake, Lake House, and Delta Lake

Implement Lake House in AWS

Now, let us implement a lake house in AWS using the health data CSV file we used while building the lake house on Azure. We will go through the following steps to build the lake house in AWS:

• Create an S3 bucket to store the raw data.

• Create an AWS Glue job to convert the raw data into a delta table.

• Query the delta table using the AWS Glue job.

As a prerequisite, you should have permission for the IAM role, as in Figure 3-53.

Figure 3-53. IAM permissions

Create an S3 Bucket to Keep the Raw Data

Let us create an S3 bucket, as in Figure 3-54, to keep the raw data and the delta table. The AWS Glue job will transform the raw CSV data to a delta table in parquet format and keep it in the S3 bucket.

Figure 3-54. Create bucket

Click on the Create folder, as in Figure 3-55, to create a folder to keep the raw CSV file.

Figure 3-55. Create folder

Provide the name of the folder as raw data, as in Figure 3-56. Click on Create folder.

Figure 3-56. Create folder

Use the same steps to create another folder called delta-lake, as in Figure 3-57. We will keep the delta table here in this folder.

Figure 3-57. Folders in S3

Go to the raw-data folder and click on Upload, as in Figure 3-58. We need to upload the raw CSV file to this folder.

Figure 3-58. Upload raw CSV file.

Figure 3-59 depicts the uploaded CSV file in the S3 bucket. We will use the AWS Glue job to transform this CSV file into a delta table in parquet format.

Figure 3-59. Uploaded raw CSV file

Implement Lake House in AWS – Data Lake, Lake House, and Delta Lake

Leave a Reply Cancel reply

Related Posts

Advantages of Modern Data Warehouse over Traditional Data Warehouse – Modern Data WarehousesAdvantages of Modern Data Warehouse over Traditional Data Warehouse – Modern Data Warehouses

Autonomous Administration Capabilities – Modern Data WarehousesAutonomous Administration Capabilities – Modern Data Warehouses

Performance – Modern Data WarehousesPerformance – Modern Data Warehouses