Findings Changelog

Dated summary of major experiment findings and interpretation.

At A Glance

Warm request path on G5 is stable and fast in the current runtime.
Internal routing beats external routing on matched benchmark tasks.
Cold start had a major bottleneck; tensor lookup indexing removed most of it.
Remaining cold cost is concentrated mostly in Qwen first-hit path.

Timeline

Latest Key Numbers

Warm Path (G5)

Warm steady-state request mean: ~80.8 ms
Warm steady-state p99: ~89.6 ms

Routing (Internal vs External, G5)

Internal mean: 94.849 ms
External mean: 97.927 ms
External/Internal: 1.032x (internal faster)

Cold TTFT Before vs After Index Cache (3-run means, G5)

Model	Before	After	Speedup
qwen	27574.564 ms	1774.951 ms	15.535x
donut	67360.388 ms	572.485 ms	117.663x
bart	77520.798 ms	743.652 ms	104.243x
minilm	23.342 ms	22.698 ms	1.028x

What Was Actually Tested

Baseline (Python/dependency path) runs on T4 and G5.
Runtime cold and warm request-path benchmarks.
True runtime-reported TTFT (not SSE first-event proxy).
Internal-vs-external routing comparison on matched tasks.
Week 3 numerical parity checks (strict mode; donut intentionally skipped in parity harness).

What Is Not Finished Yet

Phase 3 agentic loop capability study (retrieval correction, tool-state adaptation, confidence-gated branching).
A100/H100 reruns from the original expansion phase.
Paper-grade figures package.

Raw Artifact Links

Objectives and Thesis

Why a GPU agent that can feel its execution context beats one that gets sent results.

Leaderboard

Side-by-side benchmark results for T4 and G5 experiment sets.

On this page

At A Glance Timeline Latest Key Numbers Warm Path (G5)Routing (Internal vs External, G5)Cold TTFT Before vs After Index Cache (3-run means, G5)What Was Actually Tested What Is Not Finished Yet Raw Artifact Links