Indexing Optimization
Implementing a zero-copy strategy for dense integer columns to eliminate redundant memory operations.
The Overhead
In our current implementation, we perform deep copies of all data pages for every table column processed.
This approach makes sense for:
VARCHARcolumns (variable length data).INT32columns containing gaps (NULL values), which need consolidation for easier management.
However, for dense INT32 columns without NULLs, this copying is pure overhead.
Memory Access Pattern
Zero-Copy Indexing
For INT32 columns that do not contain NULL values, copying data is unnecessary. Instead, we can build the index directly pointing to the raw pages provided in the input.
This requires adapting the creation of our column_t structure so it can index the original input column directly without owning the memory.
Implementation Strategy
-
Inspect Headers
Check the header of each column page to determine if the column contains NULL values or is dense.
-
Direct Pointer Assignment
If dense, assign the internal data pointer of
column_tto the address of the input page data, skipping allocation.
Technical Hint
"Look at the header of the column pages to understand which columns have null values and which do not."
if (page.header.num_rows == page.header.num_values) {
// Dense Page -> Zero Copy
}