Benchmark results

Wall-clock means from criterion across four harnesses. The Overhead harness pits Fused against handrolled single-threaded recursions to measure framework cost. The Matrix harness puts Funnel across its 16 policy variants alongside Rayon and a scoped pool, all parallel, across 14 workload scenarios. Module simulation runs a synthetic dependency-graph resolver — the workload that originally motivated the library. Quick is a small subset of the Matrix grid, used to track changes during development.

What the numbers say

Sequential first. Fused lands within ±20% of hand.seq on every row of the Overhead bench, faster on 8 of 11. The spread against real.seq (a plain fn f(&T) -> R with no hylic types in sight) is within ±16%. The library’s fold/treeish indirection is, in this regime, on the order of compiler-level noise rather than an integer multiple. The plausible reason is a uniform per-node shape that monomorphises predictably plus closures held inside Fold and Treeish that the compiler can inline through; whatever the cause, the practical statement is parity, not dominance.

The parallel picture is more interesting. A Funnel variant is the row winner on 10 of 14 Matrix workloads. On the remaining 4 the row winner is handrolled and the nearest Funnel variant lies within a few percent. No single policy preset wins across the grid: shallow-wide workloads prefer Shared queues with OnArrival accumulation; deep-narrow prefer PerWorker with OnFinalize; the wake axis can move a row by 10–30% on its own. The 14-row table below is most useful read row-by-row — for any one workload, the policy that wins tells you something about the workload’s shape.

The Module-simulation harness picks at the same trade. On the four _fast rows (large-dense, large-sparse, small-dense, small-sparse) Funnel variants win — different policy axes per row, unsurprisingly given the Matrix story. On the _slow rows, where per-node work dominates and scheduling ceases to matter, the runners cluster.

These properties of Funnel are statements about the source, not inferences from the benchmarks. Policies are monomorphised (Funnel<P> is generic, the entire walk specialises per policy, no runtime dispatch on strategy). Continuations are defunctionalised — Cont<H, R> is a three-variant enum (Root, Direct, Slot); the inner loop is match cont in a loop, no Box<dyn FnOnce> per step. Continuations and fold chains live in arenas (ChainNode<H, R> in a scoped Arena, Cont<H, R> in a ContArena, both released in bulk at the end of the pool’s lifetime; no per-node malloc/free). Under the OnArrival accumulation policy, each child result is folded into its parent’s heap on arrival via P::Accumulate::deliver, and the slot is freed; OnFinalize buffers until siblings are complete and then drains. The walk references the user’s fold and treeish by &'a _, with the lifetime tied to the pool’s with(...) scope; user closures are not cloned into worker queues. Queue topology is a compile-time choice — per-worker deques (local push, remote steal) or a single shared FIFO — and selection is per workload rather than universal.

See the Funnel deep-dive for the walk, ticket system, and arena details, and Policies and presets for the policy traits.

Interactive: Funnel axes viewer

The Matrix bench output filtered by policy axis, marginalised on demand, with cell-level deviations from real.rayon.

Overhead

make -C hylic-benchmark bench-overhead

The Overhead table also lists several parallel runners (real.rayon, hylic-rayon, hand.rayon, hylic-parref+rayon, hylic-eager+rayon) for cross-reference. They are not the denominators for sequential-overhead statements; a parallel runner beating a sequential one says that multiple cores are faster than one, not that the framework is slow.

For a framework-vs-handrolled comparison in the parallel regime, hylic-rayon versus real.rayon is the apples-to-apples pair: within ±15% on most rows, with a worst-case +33% on parse-lt_sm. That is a real framework tax on the parallel path; whether it’s tolerable depends on the choice between Funnel and a Rayon-backed executor.

Matrix

make -C hylic-benchmark bench-matrix

Each cell shows the wall-clock mean and the +X% deviation from the row’s fastest entry; the row winner is marked (best). Reading a few rows together brings out the policy-axis story.

wide_sm (200 nodes, branching 20): funnel.pw.arrv.push = 6.2ms (best), 20% ahead of both hand.pool and hand.rayon at 7.5ms. Wide fan-out plus immediate OnArrival delivery and per-worker deques keeps the push cheap and drains the child heap as siblings complete.

graph-hv_sm (heavy edge-discovery, modelling a dependency resolver): funnel.sh.fin.k2 = 16.4ms (best), 2% ahead of hand.rayon at 16.8ms. Dropping the wake frequency to every second child amortises the edge-discovery cost better than the handrolled approaches.

The 4 rows where handrolled wins are bal_sm, io_sm, graph-io_sm, noop_sm. On bal_sm, hand.rayon = 16.1ms versus funnel.sh.fin.push = 17.0ms (+6%). noop_sm is the zero-work cell — dominated by per-node bookkeeping, absolute times sub-millisecond, percentage deltas distort. The framework cost is most visible there and unavoidable for any tree-shaped recursive parallelisation.

Module simulation

make -C hylic-benchmark bench-modsim

Eight workloads on two axes — sparse vs dense graph, fast vs slow per-node work. On the four _fast rows, Funnel variants take three of four winners (funnel.pw.fin.push = 1.0ms on large-dense_fast, funnel.pw.arrv.push = 1.0ms on large-sparse_fast, funnel.sh.arrv.push = 0.3ms on small-sparse_fast); the fourth, small-dense_fast, sits near 0.3ms across runners. For dependency-graph-shaped workloads with cheap per-node work — the common case for a module resolver — Funnel is the faster choice. Where per-node work dominates, scheduler choice ceases to matter and the runners converge.

Quick

make bench-quick-light

Five runners — real.rayon plus four Funnel variants covering both queue axes (PerWorker, Shared) and both accumulation axes (OnArrival, OnFinalize), all with EveryK<4> wake. Nine scenarios chosen for variation: noop, hash, parse-lt, parse-hv, aggr, xform, bal, wide, graph-hv. Near-parity scenarios (io, deep, fin, graph-io, lg-dense) are excluded.

The -ab variants run the same bench across multiple git revisions of hylic, archiving each run with a timestamp. Further revisions can be added by appending label=gitref to the makefile target.

Workload scenarios

Each scenario is a TreeSpec (node count, branching factor) and a WorkSpec (per-phase CPU burn amounts plus an optional I/O spin-wait). busy_work is the deterministic u64 LCG loop inside black_box; spin_wait_us is a wall-clock busy-wait. The scenarios are synthetic — the intent is to cover a shape space (shallow-wide, deep-narrow, accumulate-heavy, finalize-heavy, I/O-bound, graph-discovery- heavy) rather than reproduce any specific production workload.

#![allow(unused)]
fn main() {
{{#include ../../../../hylic-benchmark/benches/support/scenario.rs:scenario_catalog}}
}

#![allow(unused)]
fn main() {
{{#include ../../../../hylic-benchmark/benches/support/work.rs:work_spec}}
}

Funnel policy variants

#![allow(unused)]
fn main() {
{{#include ../../../../hylic-benchmark/benches/support/executor_set.rs:funnel_specs}}
}

See Funnel policies for the meaning of each axis, the rationale, and guidance on selecting a preset.

Text tables

Overhead

workload                          hand-pool        hand-rayon          hand-seq hylic-eager+fused hylic-eager+rayon       hylic-fused hylic-fused-local hylic-fused-ownedhylic-parref+fusedhylic-parref+rayon       hylic-rayon        real-rayon          real-seq
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
aggr_sm                       19.5ms (+68%)     14.2ms (+23%)    50.1ms (+334%)     21.7ms (+88%)     20.1ms (+74%)    52.3ms (+353%)    52.0ms (+350%)    55.7ms (+382%)     15.6ms (+35%)      12.6ms (+9%)    11.5ms  (best)     15.9ms (+38%)    47.9ms (+315%)
bal_sm                        25.8ms (+69%)    15.3ms  (best)    92.3ms (+505%)    62.8ms (+311%)     28.5ms (+87%)    89.0ms (+483%)    88.2ms (+478%)    88.7ms (+481%)    59.9ms (+292%)     21.2ms (+39%)     21.0ms (+38%)     20.7ms (+36%)    76.9ms (+404%)
deep_sm                       10.0ms (+33%)      9.6ms (+28%)    34.9ms (+365%)    32.3ms (+330%)     13.3ms (+78%)    34.6ms (+361%)    36.8ms (+391%)    35.0ms (+367%)    31.9ms (+325%)      9.5ms (+26%)      8.6ms (+15%)     7.5ms  (best)    31.0ms (+313%)
fin_sm                        15.1ms (+60%)     9.5ms  (best)    45.1ms (+377%)     14.8ms (+56%)     15.7ms (+66%)    44.6ms (+371%)    47.9ms (+407%)    43.9ms (+364%)     9.5ms  (best)     10.5ms (+11%)      10.2ms (+7%)     11.0ms (+17%)    43.0ms (+355%)
hash_sm                        1.6ms (+69%)      1.2ms (+23%)     5.9ms (+520%)     4.5ms (+377%)      1.6ms (+66%)     4.2ms (+339%)     4.5ms (+375%)     4.5ms (+374%)     4.2ms (+340%)      1.3ms (+37%)       1.0ms (+5%)     1.0ms  (best)     4.6ms (+381%)
io_sm                         10.9ms (+43%)       7.6ms (+1%)    42.4ms (+460%)    42.7ms (+464%)       7.9ms (+4%)    42.3ms (+458%)    42.6ms (+462%)    42.5ms (+461%)    42.6ms (+462%)     7.6ms  (best)       7.6ms (+1%)     7.6ms  (best)    42.0ms (+455%)
lg-dense_sm                   29.1ms (+49%)    19.5ms  (best)    91.1ms (+368%)    74.8ms (+284%)     22.7ms (+17%)   102.8ms (+428%)    87.2ms (+348%)    85.7ms (+340%)    73.0ms (+275%)     22.9ms (+17%)      20.3ms (+4%)      20.5ms (+5%)    96.4ms (+395%)
noop_sm                       0.1ms (+7762%)     0.0ms (+2148%)      0.0ms (+36%)     0.1ms (+13599%)     0.2ms (+20791%)     0.0ms (+336%)     0.0ms (+867%)     0.0ms (+623%)     0.1ms (+12240%)     0.1ms (+8872%)     0.0ms (+2548%)     0.0ms (+2697%)     0.0ms  (best)
parse-hv_sm                   37.6ms (+56%)    24.2ms  (best)   102.4ms (+323%)   119.6ms (+395%)      24.8ms (+3%)   119.7ms (+395%)   114.0ms (+372%)   124.1ms (+413%)   106.6ms (+341%)      24.6ms (+2%)     27.2ms (+13%)      24.6ms (+2%)   111.5ms (+361%)
parse-lt_sm                  10.5ms (+101%)      7.2ms (+39%)    28.9ms (+455%)    26.8ms (+414%)      7.7ms (+48%)    26.2ms (+403%)    29.0ms (+458%)    26.2ms (+403%)    27.2ms (+422%)      6.9ms (+33%)      5.7ms (+10%)     5.2ms  (best)    30.6ms (+488%)
wide_sm                        9.6ms (+40%)     6.9ms  (best)    38.6ms (+461%)    31.6ms (+359%)     10.0ms (+45%)    35.6ms (+418%)    35.6ms (+417%)    35.6ms (+417%)    30.9ms (+349%)     10.2ms (+49%)      8.9ms (+30%)      9.4ms (+37%)    31.3ms (+355%)
xform_sm                      17.3ms (+66%)     11.7ms (+13%)    53.9ms (+419%)     19.5ms (+88%)     16.3ms (+57%)    43.6ms (+320%)    50.1ms (+383%)    50.0ms (+382%)     13.1ms (+27%)     12.4ms (+19%)    10.4ms  (best)     11.6ms (+12%)    50.9ms (+390%)

(Matrix and Module-simulation text tables refresh after the next make bench-matrix / make bench-modsim run.)

Benchmark source

Overhead harness

#![allow(unused)]
fn main() {
{{#include ../../../../hylic-benchmark/benches/bench_overhead.rs}}
}

Matrix harness

#![allow(unused)]
fn main() {
{{#include ../../../../hylic-benchmark/benches/bench_matrix.rs}}
}

Module simulation harness

#![allow(unused)]
fn main() {
{{#include ../../../../hylic-benchmark/benches/bench_modsim.rs}}
}

Runner matrix construction

#![allow(unused)]
fn main() {
{{#include ../../../../hylic-benchmark/benches/support/runners.rs}}
}

Handrolled baselines

#![allow(unused)]
fn main() {
{{#include ../../../../hylic-benchmark/benches/support/baselines.rs}}
}

Funnel policy specs

#![allow(unused)]
fn main() {
{{#include ../../../../hylic-benchmark/benches/support/executor_set.rs}}
}

Correctness

Performance numbers are uninformative without correctness. The Funnel executor has a unit and integration suite under hylic/src/exec/variant/funnel/tests/ covering the API, parity with the Fused baseline, and deterministic results across all policy variants. An interleaving stress harness in tests/interleaving.rs and tests/stress.rs exercises the scheduler under aggressive steal patterns. Every benchmark harness asserts that the computed R matches a reference Fused run (PreparedScenario::expected) before timing begins; a policy variant producing a faster-but-incorrect answer would never reach the tables above.

Keyboard shortcuts

hylic