Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Accumulation Strategies

Two ways to fold child results into the parent’s heap, selected at compile time via the AccumulateStrategy trait. Both preserve child order — accumulate is called in slot order regardless of which worker delivered first. This is what allows hylic’s non-associative accumulate to run correctly in parallel:

#![allow(unused)]
fn main() {
/// Accumulation strategy: how child results flow into the parent's heap.
pub trait AccumulateStrategy: 'static {
    type Spec: Copy + Default + Send + Sync;

    fn deliver<N, H, R>(
        chain: &FoldChain<H, R>, slot: SlotRef, result: R,
        fold: &impl FoldOps<N, H, R>,
    ) -> Option<R>;

    fn set_total<N, H, R>(
        chain: &FoldChain<H, R>,
        fold: &impl FoldOps<N, H, R>,
    ) -> Option<R>;
}
}

Both use the same ticket system for last-event detection. They differ in WHEN and HOW the heap is swept.

OnArrival: streaming sweep

Each delivery tries to sweep contiguous filled slots immediately. A CAS gate (bit 31 of sweep: AtomicU32) ensures only one thread sweeps at a time. A cursor (bits 0-30) tracks sweep progress.

%3slotsslot 0slot 1slot 2slot 3sweepsweep: AtomicU32gate (bit 31) + cursor (0-30)slots->sweepfilled?heapheap: Haccumulate asslots fillsweep->heapcontiguousfilled slots

Per-delivery flow:

  1. Write result to slot, filled.store(true, Release)
  2. state.fetch_add(1, Relaxed) — take ticket
  3. Try CAS gate: if won, sweep contiguous filled slots from cursor
  4. If this was the last event (ticket), spin until sweep completes

Advantage: Results accumulate as they arrive — lower latency to completion when children finish in order.

Cost: CAS contention on the sweep gate when multiple threads deliver simultaneously. ~16-28ns per delivery depending on gate outcome.

OnFinalize: bulk sweep

Deliveries only store + ticket. No sweep, no CAS gate. The last event (determined by the ticket) bulk-sweeps all slots at once.

%3slotsslot 0slot 1slot 2slot 3ticketstate: AtomicU64events_donetotalslots->ticketdeliver(store + ticket)heapheap: Hbulk accumulateby last eventticket->heaplast event:sweep ALL slots

Per-delivery flow:

  1. Write result to slot, filled.store(true, Release)
  2. state.fetch_add(1, Relaxed) — take ticket
  3. If last event: iterate all slots, filled.load(Acquire) each, accumulate, finalize

Advantage: Zero CAS contention per delivery. Each delivery is ~11ns (store + fetch_add only).

Cost: The finalizer must spin-wait on any slot whose filled hasn’t been published yet (rare — Release/Acquire propagates quickly). Cold-cache sweep if slots were written by other cores.

When to use which

WorkloadStrategyWhy
Init-heavy, graph-heavyOnArrivalStreaming pipelines computation
Balanced, finalize-heavyOnFinalizeMinimal per-delivery overhead
Wide trees (bf > 10)EitherOnArrival if results arrive in order
Deep narrow (bf=2)OnFinalizeOnly 2 slots — sweep trivial

Memory footprint

Both strategies use destructive reads — results are moved out of their slots during accumulation, then dropped. For result types that own heap memory (String, Vec, etc.), this means resources are freed progressively as the sweep advances, not held alive until fold completion.

With OnArrival, the live memory at any point is bounded by the number of delivered-but-not-yet-swept results (typically much smaller than total results). With OnFinalize, all results are delivered before the bulk sweep, so peak memory equals the node’s child count × result size — freed in one pass.

This is relevant for folds over large trees where each result carries significant heap data (parsed documents, artifact records, aggregated datasets). See Infrastructure for the arena allocation model.