CTRL + S our souls
Optimization #8

Indexing Optimization

Implementing a zero-copy strategy for dense integer columns to eliminate redundant memory operations.

The Overhead

In our current implementation, we perform deep copies of all data pages for every table column processed.

This approach makes sense for:

  • VARCHAR columns (variable length data).
  • INT32 columns containing gaps (NULL values), which need consolidation for easier management.

However, for dense INT32 columns without NULLs, this copying is pure overhead.

Memory Access Pattern

Input Page Allocation Copy
Input Page Direct Reference (Pointer)

Zero-Copy Indexing

For INT32 columns that do not contain NULL values, copying data is unnecessary. Instead, we can build the index directly pointing to the raw pages provided in the input.

This requires adapting the creation of our column_t structure so it can index the original input column directly without owning the memory.

Implementation Strategy

  • Inspect Headers

    Check the header of each column page to determine if the column contains NULL values or is dense.

  • Direct Pointer Assignment

    If dense, assign the internal data pointer of column_t to the address of the input page data, skipping allocation.

Technical Hint

"Look at the header of the column pages to understand which columns have null values and which do not."

// Pseudo-check
if (page.header.num_rows == page.header.num_values) {
  // Dense Page -> Zero Copy
}