Optimization #4

Late Materialization

Deferring the copying of data until absolutely necessary to minimize memory bandwidth usage.

The Concept

Late materialization is a query optimization technique that delays data retrieval. Instead of fetching all column values at the start, we begin only with the column needed for the next join.

Retrieval is performed using row IDs (unique identifiers for each row). When an operator needs data from a specific column, it uses these row IDs to fetch only the corresponding values.

This significantly reduces unnecessary data movement, as only information crucial for the next step is transferred, improving performance for queries handling subsets of columns from large tables.

Assumptions

Joins are performed on INT32 columns.
Columns are either INT32 or VARCHAR.
Columns without NULL values are compact (dense packing).

Visualizing the Process

Example: A join uses column R.a and produces results as row IDs. If the next join requires R.b, it materializes values only for the active row IDs (e.g., rows 1, 1, 3).

Better String Handling

Columns involved only in the final result (mostly VARCHAR) do not need to be stored before all joins are complete. Since they are rarely join keys, we can change the string representation to a rowid reference.

We retrieve the actual string value only at the very end, saving expensive copy operations for rows that might be filtered out later.

1. New Indexing Structure

We designed a new 64-bit structure to index the initial column store without holding actual string data.

struct StringPtr {
  uint16_t table_id;
  uint16_t column_id;
  uint16_t page_id;
  uint16_t offset;
};

2. The `value_t` Type

Replacing the costly std::variant in the Data type, we defined value_t to efficiently represent both int32 and our new string references.

union value_t {
int32_t int_val;
StringPtr str_ptr;
};

Note: Careful handling of NULLs and Long Strings is essential.

Implementation Changes

ScanNodes
Modified to produce a row store of format vector<value_t>. INT32 columns are fully materialized, while strings use the new pointer representation.
Joins
The join logic remains largely unchanged, as the columns involved in join conditions (INT32) still possess their actual values.
Final Result Generation
At the very end, we convert the final tuples (indices) back to the standard Data variant format. While this final copy step has overhead, delaying it until after filtering and joining provides massive net gains.