Private: Chapter 4: Top Features Within a Data Lakehouse
How a data lakehouse is built and the key technologies and features of a lakehouse.
To help you understand the structure of a data lakehouse, this chapter covers how a data lakehouse is built and the key technologies and features of a lakehouse.
Building the Data Lakehouse
Data lakehouses are amalgamations of different kinds of data and features. The three elements of a data lakehouse include a data lake, an analytical infrastructure and a collection of different types of data.
1. A Data Lake
Building a lakehouse, of course, starts with a data lake. Raw amounts of text are stored in the lake. To understand how to design a data lake, you should know common data lake features. Your business can store large amounts of raw data in a data lake in structured, semi-structured and unstructured formats.
2. An Analytical Infrastructure
Inside the infrastructure is where end users can find descriptive information. The analytical infrastructure is similar to a library's card catalog and helps users quickly and efficiently find the data they are looking for. Like using a card catalog to locate a book, you use the analytical infrastructure to narrow down your search so you know exactly where to locate your data.
When you find the data, it's open for access and analysis. You can analyze and read data found in a data lake and analytical infrastructure using many different tools, including real-time applications, business intelligence (BI) tools and machine learning (ML).
3. A Collection of Different Types of Data
A data lakehouse collects many different types of data, including structured data, textual data and other unstructured data.
- Structured data: Structured data was the earliest kind of data in computers and is mainly transaction-based. New transactions create new data that leads to structured records. For each occurrence, the record has a uniform structure and contains various types of information, such as attributes and keys. For example, the data for a sale generally includes the product name, quantity, price and payment method.
- Textual data: Textual data is another type of data that can be found in a lakehouse. Sources of textual data include the internet, emails and transcribed phone conversations. Typically, text is stored in a database format so analytical tools can be used, but the lakehouse can store raw text if this aligns better with your business needs.
- Other unstructured data: In a data lakehouse, your business can also store other unstructured data. Usually, Internet of Things (IoT) data and analog data fall into this category. Machines generate and collect these data types.
These three elements are brought together to build data lakehouses. However, technology is also a key part of a data lakehouse.
Key Technologies
The key technologies that allowed data architects to bring lakes and warehouses together include metadata layers, improved performance and open data formats.
Metadata Layers
Metadata layers for a data lake sit on top of an open file format. Metadata layers over lake storage can raise a lakehouse's abstraction level and track files that are part of multiple table versions to provide rich management capabilities that are otherwise only available for warehouse data, like ACID-compliant transactions.
Metadata layers enable common features such as:
- Data validation
- Schema enforcement and evolution
- Support for streaming inputs and outputs
Today, new systems have improved scalability and allowed more capabilities. Modern lakehouse systems tend to provide better performance than data lakes and add helpful management features like zero-copy coning, transactions and time travel to previous versions of a table. Metadata layers are also a good place to implement features for data quality enforcement. These features improve the quality of pipelines in a data lake.
You can also use metadata layers to implement governance features like audit logging and access control. For example, a metadata layer can determine whether someone has access to a table and log every access reliably.
Improved Performance
Improved performance led to the data lakehouse becoming the predominant data architecture among businesses. Its performance is one of the reasons why the data exists in a two-tier architecture. A lakehouse needs to provide SQL performance on top of Optimized Row Columnar (ORC) or Parquet datasets. Performance features include:
- Caching
- Indexing
- Auditing
- Data versioning
- ACID transactions
- Query optimization
Though a data lake uses inexpensive object stores that were previously slow to access, lakehouses use query engine designs that empower high-performance SQL analysis. A data lakehouse accepts SQL and optimizes proprietary storage formats and data layout to achieve performance comparable to that of data warehouses.
Open Data Formats
Data lakehouses use open data formats that make it easy for ML engineers and data scientists to access data. They can use popular tools for the data science (DS) and ML ecosystem that can access sources like ORC and Parquet. With more regulatory requirements regarding data management, businesses may need to delete certain data, search through old datasets or change data processing infrastructure. Standardizing on an open format helps ensure you constantly have direct access to data.
Combining all of these technologies helps data lakehouses rival data warehouses in performance on large datasets.
Features of a Data Lakehouse
A data lakehouse includes many features similar to previous data warehouse and data lake concepts. However, these features are updated to be more effective for the current technology and digital world. Some of the key features enabled by these new technologies include the following:
- Open and flexible storage: When your business uses standardized and open storage formats, you have a head start in preparing your data from curated sources for reporting or analytics. Additionally, the ability to separate storage resources from compute resources makes scaling storage simple.
- Data management: A data lakehouse should be capable of storing raw data and simultaneously supporting Extract, Transform, Load (ETL) or Extract, Load, Transform (ELT) processes that curate data to enhance its quality for analysis. Data warehouses tend to offer data management features like schema enforcement, ETL and data cleansing. These management features help you prepare data from curated sources quickly for BI tools and analytics.
- Diverse workloads: Data lakehouses integrate the features of a data lake and a data warehouse, creating the ideal solution for several different types of workloads. Data lakehouses can support diverse workloads in your organization, including analytics tools and business reporting.
- Streaming: Several data sources directly use real-time streaming from devices. Data lakehouses are built to support real-time ingestion better than a standard data warehouse. As IoT devices become more integrated into the world, real-time support is more important now than ever.
Essentially, a data lakehouse enables you to use data management features fundamental to data warehousing along with the raw data stored in an affordable data lake. Request a free demo from RightData to learn more about how the features of a data lakehouse can benefit your organization.