CTRL + S our souls
Optimization #9

Building Parallelization

Executing the hash table build phase in parallel using a lock-free, two-pass counting approach.

The Parallel Build Phase

Building the unchained hash table can be efficiently parallelized. The build phase operates on the build-side relation and constructs the global directory and the tuple storage buffer.

To avoid expensive locking mechanisms during insertion, we adopt a two-pass approach. This allows threads to determine exactly where to write data in the global buffer before moving any tuples.

Strategy: Count then Write

  • Pass 1: Count histograms per bucket (locally).
  • Sync: Compute global offsets (Prefix Sum).
  • Pass 2: Write tuples to calculated offsets (Lock-free).

The Algorithm

1

Initialization

We initialize the global directory with size \( 2^k \). A next_free atomic counter (or thread-local offset map) is prepared.

2

Histogram Construction (Pass 1)

The build relation is split into chunks processed by separate threads. Each thread scans its chunk, hashes the keys, and computes a local histogram: it counts how many tuples fall into each directory bucket (based on the hash prefix).

parallel_for (chunk : chunks) {
  for (tuple : chunk) {
    hash = h(tuple.key);
    idx = hash >> (64 - k);
    local_counts[idx]++;
  }
}
3

Prefix Sum (Synchronization)

After counting, we compute the exclusive prefix sum of the counts to determine the global start index for each bucket in the contiguous tuple storage. This also gives each thread its specific write offset for every bucket.

Optimization: If using atomic adds on a global directory during Pass 1, this step might be implicit, but a local-then-global merge reduces contention.

4

Writing Tuples (Pass 2)

Threads scan their input chunks a second time. Using the computed offsets, they write the tuples directly into their final positions in the global buffer.

Simultaneously, the Bloom filters in the directory are updated. Since register-sized Bloom filters fit in the unused bits of the pointers (as described in the Unchained Optimization), we update them using atomic OR operations.

parallel_for (chunk : chunks) {
  for (tuple : chunk) {
    hash = h(tuple.key);
    idx = hash >> (64 - k);
    offset = thread_offsets[idx]++;
    global_buffer[offset] = tuple;
    
    // Update Bloom Filter (atomic)
    filter_mask = compute_bloom(hash);
    directory[idx].filter.atomic_or(filter_mask);
  }
}

Performance Impact

This approach eliminates the need for expensive fine-grained locking or latches on individual buckets. By separating the space reservation (counting) from the data movement (writing), we achieve high parallelism and ensure that the memory writes are efficient and conflict-free.