The Basics You Need to Know about AWS and Azure Amazon AWS Exams,Azure and AWS,Implement Lake House in AWS,Microsoft Exams,Storage Efficiency Data Lake, Storage, and Data Processing Engines Synergies and Dependencies – Data Lake, Lake House, and Delta Lake

Data Lake, Storage, and Data Processing Engines Synergies and Dependencies – Data Lake, Lake House, and Delta Lake



Data Lake, Storage, and Data Processing Engines Synergies and Dependencies

Data processing engines and pipelines play a pivotal role in the lake house. They facilitate data movements and ingest data into the data lake from external sources. The data ingested in the data lake is in raw format and, by default, cannot be used by the delta lake. Delta lakes are in the parquet format. You need data processing engines to format the data and convert them into delta lake parquet format. Even once the data is ready to be consumed from the lake house, you will need data processing engines to further enhance the data so that it can be ready for consumption as-is by the target data consumers.

Azure provides Azure Synapse pipelines and Azure Data Factory, which can connect to a variety of data stores and facilitate data processing and movements. AWS provides Glue, which helps you move and process data across a variety of data stores.

Data processing engines are an integral part of the lake house and accelerate solutions in the lake house platform. In subsequent sections, we will explore building a lake house on the cloud, where we will depend on these data processing engines and pipelines to a great extent.

Implement Lake House in Azure

By now we have seen how the lake house works as well as its architecture. Let us build a lake house in the Azure platform. We have a health-care platform that generates health data for patients daily in CSV format and ingests the CSV file into the data lake. We need to process the data and make it available in the lake house in Azure so that the target data-consuming platforms can use the data from the lake house database via SQL queries. To implement this scenario, we will perform the following steps:

•    Create a data lake on Azure and ingest the health data CSV file.

•    Create an Azure Synapse pipeline to convert the CSV file to aparquet file.

•    Attach the parquet file to the lake database.

Leave a Reply

Your email address will not be published. Required fields are marked *