Cascade: The Trampolined Upward Pass
When a child completes, fire_cont delivers its result and cascades
upward through the continuation chain. It is a loop, not
recursion — zero stack growth. One thread can cascade from leaf to
root without touching the queue.
This is the dual of walk_cps: walk descends,
fire_cont ascends. Together they form a single DFS round-trip —
down and back up — on one thread for the inline spine, handing off
across threads at multi-child boundaries.
The function
#![allow(unused)]
fn main() {
pub(crate) fn fire_cont<N, H, R, F, G, P: FunnelPolicy>(
ctx: &WalkCtx<'_, F, G, H, R, P>,
mut cont: Cont<H, R>,
mut result: R,
) where
F: FoldOps<N, H, R> + 'static,
G: TreeOps<N> + 'static,
N: Clone + Send + 'static,
H: 'static,
R: Send + 'static,
{
loop {
match cont {
Cont::Root(cell_ptr) => {
// SAFETY: cell_ptr points to stack-local RootCell in run_fold.
// The scoped pool guarantees this thread finishes before run_fold returns.
let cell = unsafe { &*cell_ptr };
cell.set(result);
let view = ctx.view_ref();
view.fold_done.store(true, Ordering::Release);
view.event().notify_all();
return;
}
Cont::Direct { mut heap, parent_idx } => {
let fold = ctx.fold_ref();
fold.accumulate(&mut heap, &result);
result = fold.finalize(&heap);
// SAFETY: parent_idx came from cont_arena.alloc in
// walk_cps; Direct conts are consumed exactly once
// (the child's fire_cont path), so this take is the
// sole reader.
cont = unsafe { ctx.cont_arena().take(parent_idx) };
}
Cont::Slot { node: node_idx, slot } => {
let arena = ctx.chain_arena();
// SAFETY: node_idx came from chain_arena.alloc in
// walk_cps and outlives every Cont::Slot that holds it
// (arena is freed by run_fold after all workers join).
let node = unsafe { arena.get(node_idx) };
let fold = ctx.fold_ref();
let delivered = P::Accumulate::deliver(&node.chain, slot, result, fold);
match delivered {
Some(finalized) => {
cont = node.take_parent_cont();
result = finalized;
}
None => return,
}
}
}
}
}
}
Three continuation variants, three behaviors, one loop.
Per-variant behavior
Cont::Root — terminal
The fold is complete. Write the result to the RootCell, set
fold_done, notify all parked workers. Cost: ~5ns. One cell write,
one atomic store, one futex wake.
Cont::Direct — single-child fast path
Accumulate the child result into the heap. Finalize. Take the parent
continuation from the ContArena. Continue the loop. No atomics, no
synchronization — pure sequential speed. The heap was moved INTO the
continuation by walk_cps. The entire single-child spine collapses
into loop iterations at ~2ns overhead + user work each.
Cont::Slot — multi-child delivery
Deliver the result to the FoldChain slot via
P::Accumulate::deliver(). The ticket system
determines if this thread is the last to arrive. If yes: sweep or
finalize the chain, take the parent continuation, continue cascading.
If no: return — this thread’s cascade is done.
This is where parallelism meets sequentiality. Multiple threads race to deliver. Exactly one wins. The winner cascades; the losers return to the help loop to steal more work.
The cascade as a round-trip
The walk and cascade form a symmetric pair — down through task creation, up through result delivery:
A leaf fires upward. Direct levels collapse at sequential speed. The Slot level requires atomic delivery and a ticket check. Root stores the result and signals completion.
Parallel interleaving
The cascade runs WITHOUT touching the queue. One thread goes up while other threads simultaneously go down:
Thread 0’s delivery to R and Thread 1’s delivery to R are concurrent
atomic operations on R’s FoldChain. The ticket
determines which thread cascades past R.
Cache warmth
The same thread that walks DOWN the DFS spine walks back UP via
fire_cont. ChainNodes allocated on the way down are in L1 cache
on the way up — no cross-core transfer. This is a structural
consequence of first-child inlining:
the allocating thread is the reading thread.
Compile-time accumulation dispatch
In the Cont::Slot arm, the accumulation strategy is resolved at
compile time:
let delivered = P::Accumulate::deliver(&node.chain, slot, result, fold);
P::Accumulate is an associated type on FunnelPolicy, resolved
via monomorphization. No runtime branch — the compiler inlines
deliver_and_sweep (OnArrival) or deliver_and_finalize (OnFinalize)
directly. See Accumulation for the two strategies.