Data Lake VS Data Warehouse

With the rise of the concept of a data lake in recent years, the industry has been constantly arguing about the comparison and debate between data warehouse and data lake. Some people say that the data lake is the next-generation big data platform. Major cloud manufacturers are also proposing their own data lake solutions. Some cloud data warehouse products also add the feature of linking with the data lake. But what exactly is the difference between a data warehouse and a data lake?

In the last article, we introduced what is a data lake, the basic architecture of a data lake, and a cloud-based data lake framework. This article will analyze the difference between a data lake and a data warehouse, and introduce the new direction of the integration and evolution of the two - lake warehouse one.

  1. What is a data warehouse The concept of data warehouse originated from the database field, mainly dealing with complex data-oriented query and analysis scenarios. With the development of big data technology, a large number of database technologies, such as SQL language, query optimizer, etc., have been borrowed to form a big data data warehouse, which has become the mainstream due to its powerful analytical capabilities. In recent years, the combination of data warehouse and cloud-native technology has produced cloud data warehouse, which solves the problem of resource supply for enterprises to deploy data warehouse. As a high-level (enterprise-level) platform capability of big data, the cloud data warehouse has attracted more and more attention due to its out-of-the-box, unlimited expansion, and simple operation and maintenance capabilities.

A data warehouse is a strategic collection that provides support for all types of data for decision-making processes at all levels of an enterprise. It is a single data store created for analytical reporting and decision support purposes. Provides guidance on business process improvement, monitoring time, cost, quality, and control for businesses that require business intelligence.

The essence of the data warehouse consists of the following three parts: (1) Built-in storage system, data is provided in an abstract way (such as using Table or View), and the file system is not exposed. (2) Data needs to be cleaned and transformed, usually using ETL/ELT method (3) Emphasis on modeling and data management for business intelligence decision-making

Judging from the above criteria, both traditional data warehouses (such as Teradata) and emerging cloud data warehouse systems (AWS Redshift, Google BigQuery, Alibaba Cloud MaxCompute) embody the design essence of data warehouses, and neither of them exposes the file system to the outside world. It is a service interface that provides data in and out. There are several advantages to this design:

(1) The engine deeply understands data, and storage and computing can be deeply optimized (2) Data life cycle management, perfect blood system (3) Fine-grained data management and governance (4) Perfect metadata management capabilities, easy to build an enterprise-level data center

Because of this, at the beginning of the construction of Alibaba's Feitian big data platform, it adopted the data warehouse architecture, that is, the MaxCompute big data platform. MaxCompute (formerly ODPS) is not only a big data platform for Alibaba's economy, but also an online big data computing service on Alibaba Cloud that is safe, reliable, high-performance, low-cost, and elastically scalable from GB to EB on demand (Fig. 6. It is the MaxCompute product architecture. For details, please click the Alibaba Cloud MaxCompute official website address). As an enterprise-level cloud data warehouse in the SaaS model, MaxCompute is widely used in Alibaba economies and thousands of customers on Alibaba Cloud, including the Internet, new finance, new retail, and digital government.

Thanks to the architecture of the MaxCompute data warehouse, the upper management of Alibaba has gradually built up management capabilities such as "data security system", "data quality", "data governance", and "data labeling", and finally formed Alibaba's big data middle office. . It can be said that, as the earliest proponent of the concept of data middle platform, Alibaba's data middle platform benefits from the architecture of data warehouse.

  1. The evolution trend from database, data warehouse to data lake The data of the database have alignment requirements, the database is application-oriented, and each application may need a database. If a company has dozens of applications, there will be dozens of databases. How to connect and analyze dozens of databases? There is no way.

Then it developed from a database to a data warehouse, and the data warehouse is not oriented to any application. However, it is connected to the database. If you need to schedule some ETL batch tasks every day, aggregate different applications and data, and perform connection analysis according to some paradigm models to obtain an overall data view for a certain period of time. This premise is that many databases need to supply data to the data warehouse.

With the increase of data volume and the change of data types, many unstructured data, such as video, audio, and documents, occupy more and more proportion of the total data. The original data warehouse has been difficult to support, so more and more enterprises hope to keep the original data in the real initial state. Driven by this demand, the idea of ​​a data lake has begun to take shape, which can save data in its original state so that companies can conduct more analysis from multiple dimensions. Data can easily enter the data lake, and users can also delay data collection, data cleaning, and normalized processing, which can be delayed until business needs come. In traditional data warehouses, due to the requirements of the model paradigm, the business cannot be changed casually, and changes involve various changes in the underlying data. Relatively speaking, data lakes are more flexible and can adapt to changes in upper-layer data applications more quickly.

  1. Data Lake vs Data Warehouse The data lake is stored in the original data format, aiming that any data can be stored in the most primitive form, which can be structured or unstructured data, to ensure that the data can be used without losing any details, all real-time data and batch data, are aggregated into the data lake, and then relevant data is taken from the lake for machine learning or data analysis.

The data lake-first design brings maximum flexibility to data entering the lake by opening up the underlying file storage. The data entering the data lake can be structured, semi-structured, or even completely unstructured raw logs. In addition, open storage also brings more flexibility to the upper-layer engine. Various engines can read and write data stored in the data lake at will according to their own scenarios, but only need to follow fairly loose compatibility conventions. But at the same time, the direct access to the file system makes it difficult to implement many higher-level functions, such as fine-grained (smaller than file granularity) permission management, unified file management, and read-write interface upgrades are also very difficult (every access needs to be completed). The engine of the file is upgraded, and the upgrade is completed).

The data warehouse-first design pays more attention to enterprise-level growth requirements such as data usage efficiency, large-scale data management, and security/compliance. Data enters the data warehouse through a unified but open service interface. The data usually has a predefined schema, and users access the files in the distributed storage system through the data service interface or computing engine. The data warehouse-first design exchanges higher performance (whether storage or computing), closed-loop security system, and data governance capabilities by abstracting data access interfaces/authority management/data itself. These capabilities are important for the long-term big data of enterprises Use is crucial, we call it growth.

  1. Lake and warehouse integration The integration of lakes and warehouses means that the two systems of the data warehouse and the data lake are connected so that data and computing can flow freely between the lakes and the warehouses, so as to build a complete and organic big data technology ecosystem.

Alibaba Cloud Lake and Warehouse Integrated Solution:

Based on the original data warehouse architecture, Alibaba Cloud MaxCompute integrates open-source data lakes and cloud-based data lakes, and finally realizes the overall architecture of the integration of lakes and warehouses. In this architecture, although multiple underlying storage systems coexist, through a unified storage access layer and unified metadata management, an integrated encapsulation interface is provided to the upper-level engine, and users can jointly query the tables in the data warehouse and data lake. The overall architecture also has unified data security, management, and governance capabilities.

Based on the MaxCompute lake and warehouse integration technology, DataWorks can further encapsulate the two systems of the lake and warehouse, shield the heterogeneous cluster information of the lake and the warehouse, and build an integrated big data middle-end, so that a set of data and a set of tasks can be integrated between the lake and the warehouse. seamless scheduling and management. Enterprises can use the integrated data middle-office capabilities of the lake and warehouse to optimize the data management structure and fully integrate the respective advantages of the data lake and the data warehouse. Use the data lake as a centralized raw data storage to take advantage of the flexibility and openness of the data lake. Through the integration of lake and warehouse technology, production-oriented high-frequency data and tasks are seamlessly dispatched to the data warehouse to obtain better performance and cost, as well as a series of production-oriented data governance and optimization in the future. Find the best balance between efficiency and efficiency.

In general, the integration of MaxCompute, lake, and warehouse provides enterprises with a more flexible, efficient, and economical data platform solution. It is suitable for enterprises building new big data platforms, and for enterprises with existing big data platforms to upgrade their architecture. Existing investments can be protected and asset returns can be realized.