CTRL + S our souls
Optimization #11

Work Stealing

Dynamic load balancing to ensure all threads remain productive until the entire task is complete.

The Problem

When using a fixed number of threads, workload imbalance is a common issue. Some threads may finish their assigned tasks significantly faster than others due to data skew or OS scheduling.

This leads to stragglers: fast threads sit idle waiting for the slower threads to finish, wasting valuable CPU cycles and delaying the final result.

The Solution

Allow fast threads to "steal" work from the queues of busy threads.

Thread A
Idle
Steals Task
Thread B
Overloaded

Implementation Logic

This optimization is particularly useful during the probing phase. Instead of statically assigning pages to threads, we treat the work as a shared pool.

When a thread finishes its current task, it can pick up unprocessed pages that were originally intended for another thread (or simply next in the queue).

Atomic Counters

We use a global std::atomic<size_t> counter to manage the index of the next available work unit (e.g., a data page).

std::atomic<size_t> next_page(0);

// Worker Loop
while (true) {
  size_t page_idx = next_page++;
  if (page_idx >= total_pages) break;
  process_page(page_idx);
}

Result Aggregation

Since threads are picking up arbitrary pages, the output results (matches) will be scattered across different thread-local buffers.

Final Step: A merging phase is required at the end to aggregate these fragmented results into the final contiguous output structure required by the execution engine.