Benchmark results
Wall-clock means from criterion across four harnesses. The
Overhead harness pits Fused against handrolled
single-threaded recursions to measure framework cost. The
Matrix harness puts Funnel across its 16 policy variants
alongside Rayon and a scoped pool, all parallel, across 14
workload scenarios. Module simulation runs a synthetic
dependency-graph resolver — the workload that originally
motivated the library. Quick is a small subset of the
Matrix grid, used to track changes during development.
What the numbers say
Sequential first. Fused lands within ±20% of hand.seq
on every row of the Overhead bench, faster on 8 of 11. The
spread against real.seq (a plain fn f(&T) -> R with no
hylic types in sight) is within ±16%. The library’s
fold/treeish indirection is, in this regime, on the order of
compiler-level noise rather than an integer multiple. The
plausible reason is a uniform per-node shape that monomorphises
predictably plus closures held inside Fold and Treeish
that the compiler can inline through; whatever the cause, the
practical statement is parity, not dominance.
The parallel picture is more interesting. A Funnel variant
is the row winner on 10 of 14 Matrix workloads. On the
remaining 4 the row winner is handrolled and the nearest
Funnel variant lies within a few percent. No single policy
preset wins across the grid: shallow-wide workloads prefer
Shared queues with OnArrival accumulation; deep-narrow
prefer PerWorker with OnFinalize; the wake axis can move a
row by 10–30% on its own. The 14-row table below is most
useful read row-by-row — for any one workload, the policy that
wins tells you something about the workload’s shape.
The Module-simulation harness picks at the same trade. On the
four _fast rows (large-dense, large-sparse, small-dense,
small-sparse) Funnel variants win — different policy axes
per row, unsurprisingly given the Matrix story. On the _slow
rows, where per-node work dominates and scheduling ceases to
matter, the runners cluster.
These properties of Funnel are statements about the source,
not inferences from the benchmarks. Policies are monomorphised
(Funnel<P> is generic, the entire walk specialises per
policy, no runtime dispatch on strategy). Continuations are
defunctionalised — Cont<H, R> is a three-variant enum
(Root, Direct, Slot); the inner loop is match cont in
a loop, no Box<dyn FnOnce> per step. Continuations and
fold chains live in arenas (ChainNode<H, R> in a scoped
Arena, Cont<H, R> in a ContArena, both released in bulk
at the end of the pool’s lifetime; no per-node malloc/free).
Under the OnArrival accumulation policy, each child result
is folded into its parent’s heap on arrival via
P::Accumulate::deliver, and the slot is freed; OnFinalize
buffers until siblings are complete and then drains. The walk
references the user’s fold and treeish by &'a _, with the
lifetime tied to the pool’s with(...) scope; user closures
are not cloned into worker queues. Queue topology is a
compile-time choice — per-worker deques (local push, remote
steal) or a single shared FIFO — and selection is per workload
rather than universal.
See the Funnel deep-dive for the walk, ticket system, and arena details, and Policies and presets for the policy traits.
Interactive: Funnel axes viewer
The Matrix bench output filtered by policy axis, marginalised
on demand, with cell-level deviations from real.rayon.
Overhead
make -C hylic-benchmark bench-overhead
The Overhead table also lists several parallel runners
(real.rayon, hylic-rayon, hand.rayon, hylic-parref+rayon,
hylic-eager+rayon) for cross-reference. They are not the
denominators for sequential-overhead statements; a parallel
runner beating a sequential one says that multiple cores are
faster than one, not that the framework is slow.
For a framework-vs-handrolled comparison in the parallel
regime, hylic-rayon versus real.rayon is the
apples-to-apples pair: within ±15% on most rows, with a
worst-case +33% on parse-lt_sm. That is a real framework tax
on the parallel path; whether it’s tolerable depends on the
choice between Funnel and a Rayon-backed executor.
Matrix
make -C hylic-benchmark bench-matrix
Each cell shows the wall-clock mean and the +X% deviation
from the row’s fastest entry; the row winner is marked
(best). Reading a few rows together brings out the
policy-axis story.
wide_sm (200 nodes, branching 20):
funnel.pw.arrv.push = 6.2ms (best), 20% ahead of both
hand.pool and hand.rayon at 7.5ms. Wide fan-out plus
immediate OnArrival delivery and per-worker deques keeps the
push cheap and drains the child heap as siblings complete.
graph-hv_sm (heavy edge-discovery, modelling a dependency
resolver): funnel.sh.fin.k2 = 16.4ms (best), 2% ahead of
hand.rayon at 16.8ms. Dropping the wake frequency to every
second child amortises the edge-discovery cost better than the
handrolled approaches.
The 4 rows where handrolled wins are bal_sm, io_sm,
graph-io_sm, noop_sm. On bal_sm, hand.rayon = 16.1ms
versus funnel.sh.fin.push = 17.0ms (+6%). noop_sm is the
zero-work cell — dominated by per-node bookkeeping, absolute
times sub-millisecond, percentage deltas distort. The
framework cost is most visible there and unavoidable for any
tree-shaped recursive parallelisation.
Module simulation
make -C hylic-benchmark bench-modsim
Eight workloads on two axes — sparse vs dense graph, fast vs
slow per-node work. On the four _fast rows, Funnel
variants take three of four winners (funnel.pw.fin.push = 1.0ms on large-dense_fast, funnel.pw.arrv.push = 1.0ms
on large-sparse_fast, funnel.sh.arrv.push = 0.3ms on
small-sparse_fast); the fourth, small-dense_fast, sits
near 0.3ms across runners. For dependency-graph-shaped
workloads with cheap per-node work — the common case for a
module resolver — Funnel is the faster choice. Where
per-node work dominates, scheduler choice ceases to matter and
the runners converge.
Quick
make bench-quick-light
Five runners — real.rayon plus four Funnel variants
covering both queue axes (PerWorker, Shared) and both
accumulation axes (OnArrival, OnFinalize), all with EveryK<4>
wake. Nine scenarios chosen for variation: noop, hash,
parse-lt, parse-hv, aggr, xform, bal, wide,
graph-hv. Near-parity scenarios (io, deep, fin,
graph-io, lg-dense) are excluded.
The -ab variants run the same bench across multiple git
revisions of hylic, archiving each run with a timestamp.
Further revisions can be added by appending label=gitref to
the makefile target.
Workload scenarios
Each scenario is a TreeSpec (node count, branching factor)
and a WorkSpec (per-phase CPU burn amounts plus an optional
I/O spin-wait). busy_work is the deterministic u64 LCG
loop inside black_box; spin_wait_us is a wall-clock
busy-wait. The scenarios are synthetic — the intent is to
cover a shape space (shallow-wide, deep-narrow,
accumulate-heavy, finalize-heavy, I/O-bound, graph-discovery-
heavy) rather than reproduce any specific production workload.
#![allow(unused)]
fn main() {
{{#include ../../../../hylic-benchmark/benches/support/scenario.rs:scenario_catalog}}
}
#![allow(unused)]
fn main() {
{{#include ../../../../hylic-benchmark/benches/support/work.rs:work_spec}}
}
Funnel policy variants
#![allow(unused)]
fn main() {
{{#include ../../../../hylic-benchmark/benches/support/executor_set.rs:funnel_specs}}
}
See Funnel policies for the meaning of each axis, the rationale, and guidance on selecting a preset.
Text tables
Overhead
workload hand-pool hand-rayon hand-seq hylic-eager+fused hylic-eager+rayon hylic-fused hylic-fused-local hylic-fused-ownedhylic-parref+fusedhylic-parref+rayon hylic-rayon real-rayon real-seq
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
aggr_sm 19.5ms (+68%) 14.2ms (+23%) 50.1ms (+334%) 21.7ms (+88%) 20.1ms (+74%) 52.3ms (+353%) 52.0ms (+350%) 55.7ms (+382%) 15.6ms (+35%) 12.6ms (+9%) 11.5ms (best) 15.9ms (+38%) 47.9ms (+315%)
bal_sm 25.8ms (+69%) 15.3ms (best) 92.3ms (+505%) 62.8ms (+311%) 28.5ms (+87%) 89.0ms (+483%) 88.2ms (+478%) 88.7ms (+481%) 59.9ms (+292%) 21.2ms (+39%) 21.0ms (+38%) 20.7ms (+36%) 76.9ms (+404%)
deep_sm 10.0ms (+33%) 9.6ms (+28%) 34.9ms (+365%) 32.3ms (+330%) 13.3ms (+78%) 34.6ms (+361%) 36.8ms (+391%) 35.0ms (+367%) 31.9ms (+325%) 9.5ms (+26%) 8.6ms (+15%) 7.5ms (best) 31.0ms (+313%)
fin_sm 15.1ms (+60%) 9.5ms (best) 45.1ms (+377%) 14.8ms (+56%) 15.7ms (+66%) 44.6ms (+371%) 47.9ms (+407%) 43.9ms (+364%) 9.5ms (best) 10.5ms (+11%) 10.2ms (+7%) 11.0ms (+17%) 43.0ms (+355%)
hash_sm 1.6ms (+69%) 1.2ms (+23%) 5.9ms (+520%) 4.5ms (+377%) 1.6ms (+66%) 4.2ms (+339%) 4.5ms (+375%) 4.5ms (+374%) 4.2ms (+340%) 1.3ms (+37%) 1.0ms (+5%) 1.0ms (best) 4.6ms (+381%)
io_sm 10.9ms (+43%) 7.6ms (+1%) 42.4ms (+460%) 42.7ms (+464%) 7.9ms (+4%) 42.3ms (+458%) 42.6ms (+462%) 42.5ms (+461%) 42.6ms (+462%) 7.6ms (best) 7.6ms (+1%) 7.6ms (best) 42.0ms (+455%)
lg-dense_sm 29.1ms (+49%) 19.5ms (best) 91.1ms (+368%) 74.8ms (+284%) 22.7ms (+17%) 102.8ms (+428%) 87.2ms (+348%) 85.7ms (+340%) 73.0ms (+275%) 22.9ms (+17%) 20.3ms (+4%) 20.5ms (+5%) 96.4ms (+395%)
noop_sm 0.1ms (+7762%) 0.0ms (+2148%) 0.0ms (+36%) 0.1ms (+13599%) 0.2ms (+20791%) 0.0ms (+336%) 0.0ms (+867%) 0.0ms (+623%) 0.1ms (+12240%) 0.1ms (+8872%) 0.0ms (+2548%) 0.0ms (+2697%) 0.0ms (best)
parse-hv_sm 37.6ms (+56%) 24.2ms (best) 102.4ms (+323%) 119.6ms (+395%) 24.8ms (+3%) 119.7ms (+395%) 114.0ms (+372%) 124.1ms (+413%) 106.6ms (+341%) 24.6ms (+2%) 27.2ms (+13%) 24.6ms (+2%) 111.5ms (+361%)
parse-lt_sm 10.5ms (+101%) 7.2ms (+39%) 28.9ms (+455%) 26.8ms (+414%) 7.7ms (+48%) 26.2ms (+403%) 29.0ms (+458%) 26.2ms (+403%) 27.2ms (+422%) 6.9ms (+33%) 5.7ms (+10%) 5.2ms (best) 30.6ms (+488%)
wide_sm 9.6ms (+40%) 6.9ms (best) 38.6ms (+461%) 31.6ms (+359%) 10.0ms (+45%) 35.6ms (+418%) 35.6ms (+417%) 35.6ms (+417%) 30.9ms (+349%) 10.2ms (+49%) 8.9ms (+30%) 9.4ms (+37%) 31.3ms (+355%)
xform_sm 17.3ms (+66%) 11.7ms (+13%) 53.9ms (+419%) 19.5ms (+88%) 16.3ms (+57%) 43.6ms (+320%) 50.1ms (+383%) 50.0ms (+382%) 13.1ms (+27%) 12.4ms (+19%) 10.4ms (best) 11.6ms (+12%) 50.9ms (+390%)
(Matrix and Module-simulation text tables refresh after the
next make bench-matrix / make bench-modsim run.)
Benchmark source
Overhead harness
#![allow(unused)]
fn main() {
{{#include ../../../../hylic-benchmark/benches/bench_overhead.rs}}
}
Matrix harness
#![allow(unused)]
fn main() {
{{#include ../../../../hylic-benchmark/benches/bench_matrix.rs}}
}
Module simulation harness
#![allow(unused)]
fn main() {
{{#include ../../../../hylic-benchmark/benches/bench_modsim.rs}}
}
Runner matrix construction
#![allow(unused)]
fn main() {
{{#include ../../../../hylic-benchmark/benches/support/runners.rs}}
}
Handrolled baselines
#![allow(unused)]
fn main() {
{{#include ../../../../hylic-benchmark/benches/support/baselines.rs}}
}
Funnel policy specs
#![allow(unused)]
fn main() {
{{#include ../../../../hylic-benchmark/benches/support/executor_set.rs}}
}
Correctness
Performance numbers are uninformative without correctness. The
Funnel executor has a unit and integration suite under
hylic/src/exec/variant/funnel/tests/ covering the API, parity
with the Fused baseline, and deterministic results across all
policy variants. An interleaving stress harness in
tests/interleaving.rs and tests/stress.rs exercises the
scheduler under aggressive steal patterns. Every benchmark
harness asserts that the computed R matches a reference Fused
run (PreparedScenario::expected) before timing begins; a
policy variant producing a faster-but-incorrect answer would
never reach the tables above.