Zero-cost performance

The closure-based API (Fold from a domain module, plus Treeish) is the ergonomic default. For performance-critical paths, the graph side admits user-defined TreeOps implementations whose visit method monomorphises directly. The fold side does not — the executor signature pins the fold type to D::Fold<H, R> (the closure-based wrapper). The two ops traits are nevertheless the right vocabulary for thinking about per-node cost.

The overhead budget

Per node with K children, the fused executor makes these calls through the closure-based API:

Call site	Count	Dispatch
`fold.init(node)`	1	`dyn Fn` via Arc/Rc/Box
`graph.visit(node, cb)`	1	`dyn Fn` via Arc/Rc/Box
`cb(child)` inside visit	K	`&mut dyn FnMut` callback
`fold.accumulate(heap, &r)`	K	`dyn Fn` via Arc/Rc/Box
`fold.finalize(heap)`	1	`dyn Fn` via Arc/Rc/Box
Total	3+2K

Measured: ~0.47 ns per indirect call (well-predicted by the branch predictor). On a noop workload (bf=8, 200 nodes): ~1.8 µs above hand-written recursion. On any real workload (>10 µs/node) the overhead drops below the noise floor.

Eliminating graph dispatch: implement TreeOps

The executor’s graph parameter is generic — G: TreeOps<N> — so any concrete impl is monomorphised at the call site:

#![allow(unused)]
fn main() {
    #[test]
    fn zero_cost_treeops() {
        use hylic::prelude::*;
        use hylic::ops::TreeOps;

        #[derive(Clone)]
        struct TreeNode { id: usize, value: u64 }

        struct AdjGraph {
            adj:   Vec<Vec<usize>>,
            nodes: Vec<TreeNode>,
        }
        impl TreeOps<TreeNode> for AdjGraph {
            fn visit(&self, node: &TreeNode, cb: &mut dyn FnMut(&TreeNode)) {
                for &child_id in &self.adj[node.id] {
                    cb(&self.nodes[child_id]);
                }
            }
        }

        let graph = AdjGraph {
            adj: vec![vec![1, 2], vec![], vec![]],
            nodes: vec![
                TreeNode { id: 0, value: 1 },
                TreeNode { id: 1, value: 2 },
                TreeNode { id: 2, value: 3 },
            ],
        };

        let f: Fold<TreeNode, u64, u64> = fold(
            |n: &TreeNode| n.value,
            |h: &mut u64, r: &u64| *h += r,
            |h: &u64| *h,
        );
        let root = graph.nodes[0].clone();
        let total: u64 = FUSED.run(&f, &graph, &root);
        assert_eq!(total, 6);
    }
}

AdjGraph::visit is a direct, inlinable loop. Only the callback cb: &mut dyn FnMut is still indirect — K calls per node. The closure-based fold is still in the picture because executors take &D::Fold<H, R>; replacing it requires a custom executor (below).

The shape of the trait

FoldOps and TreeOps are the operation traits any user code can target:

pub trait FoldOps<N, H, R> {
    fn init(&self, node: &N) -> H;
    fn accumulate(&self, heap: &mut H, result: &R);
    fn finalize(&self, heap: &H) -> R;
}

pub trait TreeOps<N> {
    fn visit(&self, node: &N, cb: &mut dyn FnMut(&N));
}

The closure-based domain folds (shared::Fold, local::Fold, owned::Fold) implement FoldOps by delegating to their stored closures. A user-defined FoldOps struct is callable from any custom executor that drives the recursion through the trait — bypassing the closure layer entirely. The shipped Fused executor’s inner loop is exactly this:

#![allow(unused)]
fn main() {
fn recurse<N, H, R>(
    fold: &impl FoldOps<N, H, R>,
    graph: &impl TreeOps<N>,
    node: &N,
) -> R {
    let mut heap = fold.init(node);
    graph.visit(node, &mut |child: &N| {
        let r = recurse(fold, graph, child);
        fold.accumulate(&mut heap, &r);
    });
    fold.finalize(&heap)
}
}

When the budget matters

Path	Per-node overhead	When to use
Closure-based Fold + Treeish	3+2K indirect calls	Default — combinators, lifts, sugars
Closure-based Fold + custom TreeOps	K+1 indirect calls	Adjacency lists or graph types where the visit path is the hot side
Custom executor over `FoldOps + TreeOps`	K indirect calls	Maximum control; sacrifices the lift / pipeline machinery for one specific shape

Why LTO doesn’t help

LLVM cannot devirtualise Rust dyn Fn calls. Rust does not emit the !vcall_visibility metadata that LLVM’s whole-program devirtualisation would need. Neither thin LTO nor fat LTO changes this. The trait-based path is the only reliable way to eliminate dispatch.

See Benchmarks for the measured comparison across all execution modes.