Treni Experiment Docs
A unified GPU agent runtime that can feel, not just be told.
What Is Treni
Treni is a single C/CUDA binary that packs multiple ML models, routing, tokenization, and tool execution into one GPU process. Instead of an SDK agent calling remote tools and receiving serialized responses, the agent lives inside the runtime — it can see logprobs, feel tokenization quality, inspect model state, and adapt its next step inline.
The core hypothesis: an agent that can observe its own execution context makes better decisions than one that gets sent results.
This is not just about speed. It's about aware generation.
What We've Proven So Far
| Claim | Status | Key Number |
|---|---|---|
| Faster than Python baseline | Proven | 29x on warm path |
| Sub-100ms steady-state | Proven | 80.8 ms mean, 89.6 ms p99 |
| Internal routing beats external | Proven | 1.032x faster |
| Cold start manageable | Proven (after fix) | 15-117x speedup via index cache |
| Identical numerical outputs | Proven | 0 parity failures, strict mode |
| Aware generation improves loops | Next | Track C in progress |
Reading Order
Start here, then follow the links in order:
- Paper — Entropy-Guided Loop — the research foundation for uncertainty-aware generation
- Objectives and thesis — why a GPU agent that can feel beats one that gets told
- Findings changelog — what we discovered, in order
- Leaderboard — all the benchmark numbers
- Routing comparison — internal vs external routing breakdown
- Canonical G5 artifact set — the official reference run set
- Benchmark status — detailed completion status
- Raw artifacts — every JSON and report file
- TODO and next actions — what's coming next