Findings Changelog
Dated summary of major experiment findings and interpretation.
At A Glance
- Warm request path on G5 is stable and fast in the current runtime.
- Internal routing beats external routing on matched benchmark tasks.
- Cold start had a major bottleneck; tensor lookup indexing removed most of it.
- Remaining cold cost is concentrated mostly in Qwen first-hit path.
Timeline
Latest Key Numbers
Warm Path (G5)
- Warm steady-state request mean:
~80.8 ms - Warm steady-state p99:
~89.6 ms
Routing (Internal vs External, G5)
- Internal mean:
94.849 ms - External mean:
97.927 ms - External/Internal:
1.032x(internal faster)
Cold TTFT Before vs After Index Cache (3-run means, G5)
| Model | Before | After | Speedup |
|---|---|---|---|
| qwen | 27574.564 ms | 1774.951 ms | 15.535x |
| donut | 67360.388 ms | 572.485 ms | 117.663x |
| bart | 77520.798 ms | 743.652 ms | 104.243x |
| minilm | 23.342 ms | 22.698 ms | 1.028x |
What Was Actually Tested
- Baseline (Python/dependency path) runs on T4 and G5.
- Runtime cold and warm request-path benchmarks.
- True runtime-reported TTFT (not SSE first-event proxy).
- Internal-vs-external routing comparison on matched tasks.
- Week 3 numerical parity checks (strict mode; donut intentionally skipped in parity harness).
What Is Not Finished Yet
- Phase 3 agentic loop capability study (retrieval correction, tool-state adaptation, confidence-gated branching).
- A100/H100 reruns from the original expansion phase.
- Paper-grade figures package.