Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Zero-cost performance

The closure-based API (Fold from a domain module, plus Treeish) is the ergonomic default. For performance-critical paths, the graph side admits user-defined TreeOps implementations whose visit method monomorphises directly. The fold side does not — the executor signature pins the fold type to D::Fold<H, R> (the closure-based wrapper). The two ops traits are nevertheless the right vocabulary for thinking about per-node cost.

The overhead budget

Per node with K children, the fused executor makes these calls through the closure-based API:

Call siteCountDispatch
fold.init(node)1dyn Fn via Arc/Rc/Box
graph.visit(node, cb)1dyn Fn via Arc/Rc/Box
cb(child) inside visitK&mut dyn FnMut callback
fold.accumulate(heap, &r)Kdyn Fn via Arc/Rc/Box
fold.finalize(heap)1dyn Fn via Arc/Rc/Box
Total3+2K

Measured: ~0.47 ns per indirect call (well-predicted by the branch predictor). On a noop workload (bf=8, 200 nodes): ~1.8 µs above hand-written recursion. On any real workload (>10 µs/node) the overhead drops below the noise floor.

Eliminating graph dispatch: implement TreeOps

The executor’s graph parameter is generic — G: TreeOps<N> — so any concrete impl is monomorphised at the call site:

#![allow(unused)]
fn main() {
    #[test]
    fn zero_cost_treeops() {
        use hylic::prelude::*;
        use hylic::ops::TreeOps;

        #[derive(Clone)]
        struct TreeNode { id: usize, value: u64 }

        struct AdjGraph {
            adj:   Vec<Vec<usize>>,
            nodes: Vec<TreeNode>,
        }
        impl TreeOps<TreeNode> for AdjGraph {
            fn visit(&self, node: &TreeNode, cb: &mut dyn FnMut(&TreeNode)) {
                for &child_id in &self.adj[node.id] {
                    cb(&self.nodes[child_id]);
                }
            }
        }

        let graph = AdjGraph {
            adj: vec![vec![1, 2], vec![], vec![]],
            nodes: vec![
                TreeNode { id: 0, value: 1 },
                TreeNode { id: 1, value: 2 },
                TreeNode { id: 2, value: 3 },
            ],
        };

        let f: Fold<TreeNode, u64, u64> = fold(
            |n: &TreeNode| n.value,
            |h: &mut u64, r: &u64| *h += r,
            |h: &u64| *h,
        );
        let root = graph.nodes[0].clone();
        let total: u64 = FUSED.run(&f, &graph, &root);
        assert_eq!(total, 6);
    }
}

AdjGraph::visit is a direct, inlinable loop. Only the callback cb: &mut dyn FnMut is still indirect — K calls per node. The closure-based fold is still in the picture because executors take &D::Fold<H, R>; replacing it requires a custom executor (below).

The shape of the trait

FoldOps and TreeOps are the operation traits any user code can target:

pub trait FoldOps<N, H, R> {
    fn init(&self, node: &N) -> H;
    fn accumulate(&self, heap: &mut H, result: &R);
    fn finalize(&self, heap: &H) -> R;
}

pub trait TreeOps<N> {
    fn visit(&self, node: &N, cb: &mut dyn FnMut(&N));
}

The closure-based domain folds (shared::Fold, local::Fold, owned::Fold) implement FoldOps by delegating to their stored closures. A user-defined FoldOps struct is callable from any custom executor that drives the recursion through the trait — bypassing the closure layer entirely. The shipped Fused executor’s inner loop is exactly this:

#![allow(unused)]
fn main() {
fn recurse<N, H, R>(
    fold: &impl FoldOps<N, H, R>,
    graph: &impl TreeOps<N>,
    node: &N,
) -> R {
    let mut heap = fold.init(node);
    graph.visit(node, &mut |child: &N| {
        let r = recurse(fold, graph, child);
        fold.accumulate(&mut heap, &r);
    });
    fold.finalize(&heap)
}
}

When the budget matters

PathPer-node overheadWhen to use
Closure-based Fold + Treeish3+2K indirect callsDefault — combinators, lifts, sugars
Closure-based Fold + custom TreeOpsK+1 indirect callsAdjacency lists or graph types where the visit path is the hot side
Custom executor over FoldOps + TreeOpsK indirect callsMaximum control; sacrifices the lift / pipeline machinery for one specific shape

Why LTO doesn’t help

LLVM cannot devirtualise Rust dyn Fn calls. Rust does not emit the !vcall_visibility metadata that LLVM’s whole-program devirtualisation would need. Neither thin LTO nor fat LTO changes this. The trait-based path is the only reliable way to eliminate dispatch.

See Benchmarks for the measured comparison across all execution modes.