Whitepaper
January 30, 2023

Private: Chapter 2: Background and History of Data Architecture and Lakehouses

Private: Chapter 2: Background and History of Data Architecture and Lakehouses

How data lakes and warehouses were brought together, the challenges surrounding data lakes and warehouses and how data lakehouses overcome these challenges.

Now that we have covered the data lakehouse definition, data lake definition and data warehouse definition, we'll explain how data lakes and warehouses were brought together, the challenges surrounding data lakes and warehouses and how data lakehouses overcome these challenges.

Bringing Lakes and Warehouses Together

We can only discuss the background and history of data lakehouses by first addressing where data lakes and warehouses came from and how they formed data lakehouse architecture.

Where Lakes and Warehouses Came From

As the development of applications rapidly increased, the issue of data integrity arose. With so many applications, the same data appeared in multiple places with differing values. This made it difficult for a user to shift through the various applications to determine the correct version of the data. If the user couldn't find the correct version of the data, they might make the wrong decision. Businesses realized they required a different architectural approach to locate the correct data for decision-making. This led to the creation of the data warehouse.

A data warehouse placed disparate application data in one location, requiring the development of a new infrastructure altogether. Data warehouses have long been used in BI applications and decision support. However, data warehouses were not suited for handling semi-structured data, unstructured data or data with high volume, velocity and variety. These limitations quickly became evident. Even if they could handle these data structures, warehouses were often too expensive for many organizations. This led to the emergence of data lakes.

Data warehouses were not suited for handling semi-structured data, unstructured data or data with high volume, velocity and variety.

Data lakes are a collection of the various types of data a corporation has. A data lake handles raw data in several formats on affordable storage for machine learning (ML) and data science. Initially, businesses believed that all that would be required for a data lake is the extraction and placement of data in the lake.

Once data was in the lake, the user could simply find the data and perform an analysis. However, businesses quickly learned that the reality of using the data was quite different from merely placing it in the data lake. Data lakes lacked the critical features that data warehouses offered. A data lake does not enforce data quality or support transactions, for example. This led to the creation of data lakehouses.

How Lakes and Warehouses Formed Data Lakehouse Architecture

Data lakes and warehouses were brought together to form lakehouses. A standardized and open system design enables this new data architecture. This system combines the data management features and data structures of a data warehouse with the low-cost storage of a data lake. The architecture of the data lakehouse also addresses the key data architecture challenges.

Data Architecture Challenges

The limitations of data warehouses and data lakes mean these systems come with multiple challenges, especially when an organization uses several data warehouses and a lake. Some of these common data architecture challenges include limited support for ML, lack of openness and forced trade-offs between lakes and warehouses.

1. Limited Support for Machine Learning

Plenty of research exists on the confluence of data management and machine learning, yet leading ML systems do not perform well on top of data warehouses. While BI extracts a small amount of data, an ML system processes large amounts of data via complex non-SQL code. If you choose to export data to files, this could further complicate the process.

2. Lack of Openness

A data warehouse locks data in a proprietary format, increasing the expense of migrating workloads or data to other systems. Typically, a data warehouse provides SQL-only access, so running other analytics engines like ML systems can be difficult. Accessing data directly in the data warehouse with SQL can be a slow and expensive process, which can make integration with other technology challenging.

 Typically, a data warehouse provides SQL-only access, so running other analytics engines like ML systems can be difficult.

3. Forced Trade-Off Between Lakes and Warehouses

A large amount of enterprise data is stored in lakes because of the flexibility and low-cost storage associated with data lakes. Enterprises often transformed application data from the lake into corporate data in a data warehouse to overcome the quality issues and lack of performance of a data lake. They could also use this data for important BI applications and decision support.

When using both of these systems, organizations must continuously transform application data into corporate data between the data lake and data warehouse. Each time, the organization risks failures or bugs that could reduce the quality of data. Further, keeping data consistent between systems is costly and challenging, as the organization must pay for the storage costs of both systems.

The Best of Both Worlds

As with any new technology, early data lakes had some weaknesses. Data quality was hard to enforce, transactions were unsupported, database features were lacking, and the eventual consistency model was used instead of the strong consistency model.

  • Eventual consistency model: With this model, eventually all access to a record will return the most recent updated value if the data record does not get any new updates.
  • Strong consistency model: In this model, data is up-to-date but takes longer to access.

Data lakehouses were formed to overcome those challenges and offer the best of both worlds. The name itself is a combination of both systems and represents the goal of integrating the key features of data lakes and data warehouses to achieve a new framework. With a data lakehouse, there is no forced trade-off between lakes and warehouses. A data lakehouse also addresses the lack of openness and limited support for ML associated with the other forms of data architecture.

A data lakehouse addresses the lack of openness and limited support for ML associated with the other forms of data architecture.

Technology evolves quickly, and businesses need to adapt just as quickly to implement new technology and data sources successfully. To learn more about how data lakehouses combine the best features of both data lakes and warehouses, contact us at RightData.