Zero-cost performance
The closure-based API (Fold from a domain module, plus Treeish)
is the ergonomic default. For performance-critical paths, the
graph side admits user-defined TreeOps implementations whose
visit method monomorphises directly. The fold side does not — the
executor signature pins the fold type to D::Fold<H, R> (the
closure-based wrapper). The two ops traits are nevertheless the
right vocabulary for thinking about per-node cost.
The overhead budget
Per node with K children, the fused executor makes these calls through the closure-based API:
| Call site | Count | Dispatch |
|---|---|---|
fold.init(node) | 1 | dyn Fn via Arc/Rc/Box |
graph.visit(node, cb) | 1 | dyn Fn via Arc/Rc/Box |
cb(child) inside visit | K | &mut dyn FnMut callback |
fold.accumulate(heap, &r) | K | dyn Fn via Arc/Rc/Box |
fold.finalize(heap) | 1 | dyn Fn via Arc/Rc/Box |
| Total | 3+2K |
Measured: ~0.47 ns per indirect call (well-predicted by the branch predictor). On a noop workload (bf=8, 200 nodes): ~1.8 µs above hand-written recursion. On any real workload (>10 µs/node) the overhead drops below the noise floor.
Eliminating graph dispatch: implement TreeOps
The executor’s graph parameter is generic — G: TreeOps<N> —
so any concrete impl is monomorphised at the call site:
#![allow(unused)]
fn main() {
#[test]
fn zero_cost_treeops() {
use hylic::prelude::*;
use hylic::ops::TreeOps;
#[derive(Clone)]
struct TreeNode { id: usize, value: u64 }
struct AdjGraph {
adj: Vec<Vec<usize>>,
nodes: Vec<TreeNode>,
}
impl TreeOps<TreeNode> for AdjGraph {
fn visit(&self, node: &TreeNode, cb: &mut dyn FnMut(&TreeNode)) {
for &child_id in &self.adj[node.id] {
cb(&self.nodes[child_id]);
}
}
}
let graph = AdjGraph {
adj: vec![vec![1, 2], vec![], vec![]],
nodes: vec![
TreeNode { id: 0, value: 1 },
TreeNode { id: 1, value: 2 },
TreeNode { id: 2, value: 3 },
],
};
let f: Fold<TreeNode, u64, u64> = fold(
|n: &TreeNode| n.value,
|h: &mut u64, r: &u64| *h += r,
|h: &u64| *h,
);
let root = graph.nodes[0].clone();
let total: u64 = FUSED.run(&f, &graph, &root);
assert_eq!(total, 6);
}
}
AdjGraph::visit is a direct, inlinable loop. Only the callback
cb: &mut dyn FnMut is still indirect — K calls per node. The
closure-based fold is still in the picture because executors take
&D::Fold<H, R>; replacing it requires a custom executor (below).
The shape of the trait
FoldOps and TreeOps are the operation traits any user code
can target:
pub trait FoldOps<N, H, R> {
fn init(&self, node: &N) -> H;
fn accumulate(&self, heap: &mut H, result: &R);
fn finalize(&self, heap: &H) -> R;
}
pub trait TreeOps<N> {
fn visit(&self, node: &N, cb: &mut dyn FnMut(&N));
}
The closure-based domain folds (shared::Fold, local::Fold,
owned::Fold) implement FoldOps by delegating to their stored
closures. A user-defined FoldOps struct is callable from any
custom executor that drives the recursion through the trait —
bypassing the closure layer entirely. The shipped Fused
executor’s inner loop is exactly this:
#![allow(unused)]
fn main() {
fn recurse<N, H, R>(
fold: &impl FoldOps<N, H, R>,
graph: &impl TreeOps<N>,
node: &N,
) -> R {
let mut heap = fold.init(node);
graph.visit(node, &mut |child: &N| {
let r = recurse(fold, graph, child);
fold.accumulate(&mut heap, &r);
});
fold.finalize(&heap)
}
}
When the budget matters
| Path | Per-node overhead | When to use |
|---|---|---|
| Closure-based Fold + Treeish | 3+2K indirect calls | Default — combinators, lifts, sugars |
| Closure-based Fold + custom TreeOps | K+1 indirect calls | Adjacency lists or graph types where the visit path is the hot side |
Custom executor over FoldOps + TreeOps | K indirect calls | Maximum control; sacrifices the lift / pipeline machinery for one specific shape |
Why LTO doesn’t help
LLVM cannot devirtualise Rust dyn Fn calls. Rust does not emit
the !vcall_visibility metadata that LLVM’s whole-program
devirtualisation would need. Neither thin LTO nor fat LTO changes
this. The trait-based path is the only reliable way to eliminate
dispatch.
See Benchmarks for the measured comparison across all execution modes.