Why we built an AI agent framework in Rust

Every AI framework in production is Python. LangChain, LangGraph, CrewAI, AutoGen. The ecosystem, the tutorials, the hiring pool — all Python. Building in Rust is a deliberate bet against the grain.

This post explains that bet. Not the marketing version. The actual reasoning, including the parts that cut against us.

The problem we kept running into

We were building an internal agent for a client in a regulated environment. The requirements were specific:

No external network calls from the agent process
Bounded memory usage (hard ceiling enforced by the platform)
Audit trail for every LLM call and tool invocation
Cold start under 500ms (scheduled triggers, not persistent services)

We tried LangChain first. The cold start alone was 2,200ms. Peak memory for 10 concurrent agents was 310 MB. The process had no concept of OS-level isolation — the agent could read any file the process had access to, make any network call, execute arbitrary subprocesses. Every mitigation required wrapping Python in another container boundary.

That works. We have shipped Python agents in containers. But it felt like using a tarp as a wall. The boundary was outside the tool, not part of it.

What Rust actually gives you

Cold start. The Kernex runtime starts in 12ms. This is not a tuned benchmark. It is the natural result of a compiled binary with no interpreter startup, no JIT warmup, no garbage collection pause. For scheduled workloads, this changes the cost model entirely.

Memory. Rust’s ownership model means memory use is deterministic and bounded at compile time, not runtime. 24 MB peak for 10 concurrent agents is not us being clever — it is the absence of a GC heap and a reference-counted runtime that can balloon under load.

Sandboxing as a first-class primitive. kernex-sandbox wraps macOS Seatbelt and Linux Landlock directly. The agent process declares what it needs at startup. The OS enforces it. There is no “agent tried to call home and we caught it at the proxy” — the syscall never completes. This is a meaningful difference for environments where the threat model includes the LLM itself.

The type system as a correctness layer. Agent pipelines have a lot of moving parts: providers, tools, memory reads, reward signals, multi-step workflows. In Python, a wrong key in a dict propagates until runtime. In Rust, the type checker catches the structural error before deployment. For pipelines defined in TOML and loaded at startup, this matters — the process panics on load with a clear message rather than silently misbehaving mid-run.

The parts that cut against us

Hiring. The number of engineers who can write idiomatic Rust and also understand LLM pipeline patterns is small. We are building in a space where we cannot hire our way out of problems easily.

Ecosystem. tokio-openai exists. async-openai exists. But the tooling ecosystem around Rust AI is two years behind Python. We built kernex-providers from scratch because the available wrappers did not handle streaming correctly for our use case.

Development velocity. The borrow checker is real. Early prototypes took longer to write than the equivalent Python would have. The payoff comes at integration time — when the Python prototype would be accumulating bugs, the Rust version is usually already correct. But that is a tricky thing to internalize before you have seen the cycle a few times.

REPL-driven exploration. Python’s interactive shell is genuinely better for exploring an API you do not control. We compensate with kx run for quick one-shot tests, but it is not the same as Jupyter.

Where this lands

The bet is specific: Rust is the right choice for agent infrastructure deployed in production environments where cold start, memory, and isolation guarantees are load-bearing requirements. It is not the right choice for every use case.

If you are building a one-off automation script, a research prototype, or an internal tool where Python’s ecosystem is an asset — use Python. The frameworks are good. LangChain ships fast.

If you are building something that runs in a regulated environment, on constrained hardware, or where “the agent is sandboxed by default” needs to be a true statement — the Rust path is worth the friction.

That is the bet we made. We will keep publishing the numbers as the project matures.

Full methodology, environment notes, and raw Criterion results are in the bench directory on GitHub. The cold start benchmark measures RuntimeBuilder::build() end-to-end, including SQLite store initialization. The memory benchmark measures peak RSS delta for 1, 5, and 10 concurrent agent instances. For a full explanation of how each metric was measured and what the tests do not cover, see Benchmark methodology. Corrections welcome.