Objectives and Thesis
Why a GPU agent that can feel its execution context beats one that gets sent results.
The Problem with External Agents
Today's agent architectures run on SDK frameworks (Anthropic, OpenAI, LangChain) that call registered tools — endpoints for models, databases, RAG, regression services. The agent receives serialized responses. It never sees:
- Tokenization quality — was the input tokenized well for this retrieval query?
- Logprobs / entropy — how confident is the model in this output? Should it branch or clarify?
- Model state — what does the regression output distribution look like before it's serialized to JSON?
- Timing and resource state — is this path slow because of a cold model or a bad query?
The agent is told things. It doesn't feel things.
The Thesis: Feel vs Be Told
A GPU agent that runs inside its own runtime — with models, tools, state, and observability packed into one process — can make fundamentally better decisions because it has direct access to its own execution context.
Concrete example
An SDK agent calls a RAG endpoint. The RAG pipeline tokenizes the query, runs retrieval, returns top-k results as JSON. The agent sees the results but has no idea if the tokenization was appropriate — maybe a medical term was split badly, maybe a code identifier was mangled. It can't fix what it can't see.
A GPU agent running the same RAG in-process can see the tokenizer output, check the embedding distances, notice the retrieval quality is low, and immediately re-tokenize or run a direct search — all without a network round trip, all within the same execution context.
This is what the Entropy-Guided Loop paper formalizes: using token-level uncertainty (Shannon entropy, perplexity) as a feedback signal for targeted corrections. In the paper, this achieves ~95% of larger reasoning models' performance at ~1/3 the cost, with a 16pp accuracy improvement when the loop activates.
Treni is the runtime that makes this loop native — not a post-hoc wrapper, but the architecture itself.
What Must Be Proven (In Order)
Track A: Speed (Proven)
The unified runtime is 29x faster than the Python baseline on warm paths, with sub-100ms p99 latency serving 4 models from a single binary. Cold starts are manageable after the tensor index cache fix (15-117x speedup).
Track B: Routing (Proven)
Internal routing beats external routing on every matched task. The overhead of network hops and serialization is real and measurable, even on the same host.
Track C: Awareness (Next)
This is the thesis that matters most. Speed and routing are table stakes — they prove the architecture works. The real question is whether an agent that can see logprobs, entropy, tokenization state, and model internals makes better decisions in multi-step loops:
- Retrieval correction: bad tokenization → detect via embedding quality → re-query
- Tool-state adaptation: regression output changes next reasoning step because the agent sees the distribution, not just the prediction
- Confidence-gated branching: high entropy on a generation step → branch to clarification instead of committing
This is what separates Treni from "just another fast inference server."
Guardrails
- No blended or proxy metrics presented as final truth.
- Same tasks, budgets, and hardware class for all comparisons.
- Raw artifacts and exact commands available for every headline number.
- Qualitative claims must be backed by trace-level evidence, not just aggregate metrics.