Benchmark methodology: how we measured cold start, memory, and throughput

The table on the homepage shows four numbers per framework. This post explains how those numbers were produced, what they mean precisely, and where the methodology falls short.

Skipping this would leave the numbers floating without context. That is not useful to anyone evaluating Kernex for real work.

Test environment

All benchmarks were run on a single machine. No cloud instances, no shared compute.

CPU: Apple M2 Pro (12-core)
RAM: 32 GB unified memory
OS: macOS 14.4
Rust: 1.77.0 (stable)
Python: 3.11.8
Kernex: 0.3.0
LangChain: 0.1.14
LangGraph: 0.0.30
CrewAI: 0.22.5

Each benchmark was run 10 times. The reported value is the median. The first run was discarded to exclude cold filesystem cache effects.

What each metric measures

Cold start

Cold start is the time from process launch to the runtime being ready to accept the first task.

For Kernex, this measures RuntimeBuilder::build() end-to-end, including:

Binary load and memory mapping
SQLite store initialization (in-memory mode, :memory:)
Provider registry setup
Tool registry setup

For Python frameworks, this measures from subprocess.Popen() to the first line of user code executing, which includes:

CPython interpreter startup
Import of the framework package and its dependencies (LangChain 0.1.14, LangGraph 0.0.30, CrewAI 0.22.5)
Any framework-level initialization that runs on import

This is a realistic cold start scenario: a new process is spawned for each measurement. It is not measuring a warm process that has already initialized.

Peak memory

Memory is measured as peak RSS (Resident Set Size) delta. The baseline RSS of an idle process is subtracted so the number reflects framework overhead, not OS overhead.

Concurrent agent counts tested: 1, 5, 10. The table shows the 10-agent figure, which is where the difference is most pronounced. The 1-agent figures are less interesting but are in the raw results.

Each agent was running a minimal task: read a string input, call a single no-op tool, return a string output. No LLM calls were made during this test. The goal is to measure runtime overhead, not model latency.

Throughput

Throughput measures requests per second for a single-tool agent task over a 30-second window, using a fixed pool of 10 concurrent agents.

The task is the same no-op task used in the memory test. Again, no LLM calls. This isolates runtime scheduling and dispatch overhead from model latency, which would dominate and make the comparison meaningless.

What the benchmarks do not cover

LLM call latency. The tests do not make real model calls. This is intentional: LLM latency is dominated by the provider, not the framework. Including it would obscure the runtime overhead we are trying to measure.

Complex multi-step pipelines. The test task is deliberately minimal. A 10-step agent with memory lookups, web search, and tool chaining will show different relative performance. We do not have those numbers yet.

Linux. All measurements are on macOS. Kernex’s sandboxing primitives differ between macOS (Seatbelt) and Linux (Landlock + seccomp). We expect Linux cold start to be lower for Kernex due to faster process spawn, but have not run the full suite on Linux yet.

Python framework warm-path performance. Long-running Python servers that keep the framework warm in memory are a different workload. The cold start numbers are not relevant there. The throughput numbers still are, but should be weighted differently.

CrewAI and LangGraph internals. We used each framework’s default configuration with a minimal agent setup. Experts in those frameworks may be able to get meaningfully better numbers through configuration. We welcome corrections.

Reproducing the results

The benchmark harness is in the bench/ directory of the main repository. Rust micro-benchmarks use Criterion.rs for statistical rigor: each measurement includes a warmup phase, outlier detection, and confidence intervals. To run:

# Rust benchmarks (requires Rust 1.77+)
cargo bench --package kernex-bench

# Python baselines (requires Python 3.11+, each framework installed)
python bench/python/run_all.py

Output is written to bench/results/. The compare.py script generates the summary table.

Full raw results, including per-run timings and the 1-agent and 5-agent memory figures, are in bench/results/.

The numbers are preliminary. We marked them that way on the homepage and we mean it. As Kernex matures and the test suite expands to cover Linux, real LLM calls, and more complex pipelines, we will update both the table and this post.

Corrections, reproductions on different hardware, and methodological critiques are welcome via GitHub.