Treni Experiment Docs

What Is Treni

Treni is a single C/CUDA binary that packs multiple ML models, routing, tokenization, and tool execution into one GPU process. Instead of an SDK agent calling remote tools and receiving serialized responses, the agent lives inside the runtime — it can see logprobs, feel tokenization quality, inspect model state, and adapt its next step inline.

The core hypothesis: an agent that can observe its own execution context makes better decisions than one that gets sent results.

This is not just about speed. It's about aware generation.

What We've Proven So Far

Claim	Status	Key Number
Faster than Python baseline	Proven	29x on warm path
Sub-100ms steady-state	Proven	80.8 ms mean, 89.6 ms p99
Internal routing beats external	Proven	1.032x faster
Cold start manageable	Proven (after fix)	15-117x speedup via index cache
Identical numerical outputs	Proven	0 parity failures, strict mode
Aware generation improves loops	Next	Track C in progress

Reading Order

Start here, then follow the links in order:

Paper — Entropy-Guided Loop — the research foundation for uncertainty-aware generation
Objectives and thesis — why a GPU agent that can feel beats one that gets told
Findings changelog — what we discovered, in order
Leaderboard — all the benchmark numbers
Routing comparison — internal vs external routing breakdown
Canonical G5 artifact set — the official reference run set
Benchmark status — detailed completion status
Raw artifacts — every JSON and report file
TODO and next actions — what's coming next

Treni Experiment Docs

What Is Treni

What We've Proven So Far

Reading Order

On this page