CTRL + S our souls
Optimization #5

Column Store Joins

Transitioning intermediate results from row-oriented to column-oriented storage to maximize cache locality and throughput.

The Transition

The benefits of column store versus row store organization are well-established for analytical workloads. However, our baseline implementation was still using a row store format for intermediate results produced during joins.

The Goal: To maintain intermediate results in a column store structure. This allows us to exploit CPU cache locality (keeping similar data types together) and significantly accelerate the execution of subsequent joins.

Architectural Shift

Row Store
{id, val}
{id, val}
Column Store
[id, id]
[val, val]

Paged Storage for Intermediate Results

The storage space for data in the intermediate results of each join is now designed as a collection of physically contiguous buffers (pages). Each page contains many elements packed tightly together.

Memory Efficiency

By using large pages instead of allocating memory for individual tuples or small vectors, we minimize the overhead of memory allocation (malloc/new) directly within the algorithm's hot path.

Reduced Allocation Overhead

Future-Proofing for Parallelism

This paged structure is a precursor to parallel probing. It allows each thread to write results into its own private pages without locking.

Partial results from all threads can then be trivially combined by simply collecting the list of pages produced by each thread, requiring zero data copying.

Implementation Details

  • New Structure: column_t

    A paged structure holding value_t elements (defined in the previous step) contiguously. This replaces the old row-based vectors.

  • ScanNode Modification

    ScanNodes were updated to produce intermediate results in the form of vector<column_t> instead of rows.

  • Hash Join Iteration

    The hash join logic was rewritten to iterate over this new columnar structure and produce output in the same format. Crucially, no row store structures remain in execute() after this optimization.