Late Materialization
Deferring the copying of data until absolutely necessary to minimize memory bandwidth usage.
The Concept
Late materialization is a query optimization technique that delays data retrieval. Instead of fetching all column values at the start, we begin only with the column needed for the next join.
Retrieval is performed using row IDs (unique identifiers for each row). When an operator needs data from a specific column, it uses these row IDs to fetch only the corresponding values.
This significantly reduces unnecessary data movement, as only information crucial for the next step is transferred, improving performance for queries handling subsets of columns from large tables.
Assumptions
-
Joins are performed on
INT32columns. -
Columns are either
INT32orVARCHAR. - Columns without NULL values are compact (dense packing).
Visualizing the Process
Example: A join uses column R.a and produces results as row IDs. If the next join requires R.b, it materializes values only for the active row IDs (e.g., rows 1, 1, 3).
Better String Handling
Columns involved only in the final result (mostly VARCHAR) do not need to be stored before all joins are complete. Since they are rarely join keys, we can change the string representation to a rowid reference.
We retrieve the actual string value only at the very end, saving expensive copy operations for rows that might be filtered out later.
1. New Indexing Structure
We designed a new 64-bit structure to index the initial column store without holding actual string data.
uint16_t table_id;
uint16_t column_id;
uint16_t page_id;
uint16_t offset;
};
2. The value_t Type
Replacing the costly std::variant in the Data type, we defined value_t to efficiently represent both int32 and our new string references.
int32_t int_val;
StringPtr str_ptr;
};
Note: Careful handling of NULLs and Long Strings is essential.
Implementation Changes
-
ScanNodes
Modified to produce a row store of format
vector<value_t>.INT32columns are fully materialized, while strings use the new pointer representation. -
Joins
The join logic remains largely unchanged, as the columns involved in join conditions (INT32) still possess their actual values.
-
Final Result Generation
At the very end, we convert the final tuples (indices) back to the standard
Datavariant format. While this final copy step has overhead, delaying it until after filtering and joining provides massive net gains.