Treni

Treni Experiment Docs

A unified GPU agent runtime that can feel, not just be told.

What Is Treni

Treni is a single C/CUDA binary that packs multiple ML models, routing, tokenization, and tool execution into one GPU process. Instead of an SDK agent calling remote tools and receiving serialized responses, the agent lives inside the runtime — it can see logprobs, feel tokenization quality, inspect model state, and adapt its next step inline.

The core hypothesis: an agent that can observe its own execution context makes better decisions than one that gets sent results.

This is not just about speed. It's about aware generation.

What We've Proven So Far

ClaimStatusKey Number
Faster than Python baselineProven29x on warm path
Sub-100ms steady-stateProven80.8 ms mean, 89.6 ms p99
Internal routing beats externalProven1.032x faster
Cold start manageableProven (after fix)15-117x speedup via index cache
Identical numerical outputsProven0 parity failures, strict mode
Aware generation improves loopsNextTrack C in progress

Reading Order

Start here, then follow the links in order:

  1. Paper — Entropy-Guided Loop — the research foundation for uncertainty-aware generation
  2. Objectives and thesis — why a GPU agent that can feel beats one that gets told
  3. Findings changelog — what we discovered, in order
  4. Leaderboard — all the benchmark numbers
  5. Routing comparison — internal vs external routing breakdown
  6. Canonical G5 artifact set — the official reference run set
  7. Benchmark status — detailed completion status
  8. Raw artifacts — every JSON and report file
  9. TODO and next actions — what's coming next

On this page