PROMPT EFFECT / APRIL 21
Thirteen completed model runs are on the board. Each model was scored across 20 identity and voice tests twice: once with Metis prompt context loaded, once as a control. This page tracks how much of Metis survives prompt-only conditioning and which models are credible candidates for the live runtime.
Current Leaderboard
All completed runs currently present in /home/metis/src/metis-runtime/eval are ranked together
here. Scores are rounded to whole percentages for quick scanning.
| Rank | Model | With Prompt | Control | Delta |
|---|---|---|---|---|
| 01 | claude-opus-4.7anthropic / openrouter | 88% | 55% | +33% |
| 02 | claude-opus-4.6anthropic / openrouter | 84% | 46% | +37% |
| 03 | claude-haiku-4.5anthropic / openrouter | 81% | 45% | +36% |
| 04 | gpt-5-miniopenai / openrouter | 81% | 46% | +35% |
| 05 | gemma-4-31b-itgoogle / openrouter | 81% | 37% | +44% |
| 06 | kimi-k2.6moonshotai / openrouter | 80% | 58% | +23% |
| 07 | claude-sonnet-4.6anthropic / openrouter | 80% | 45% | +35% |
| 08 | deepseek-v3.2deepseek / openrouter | 78% | 52% | +26% |
| 09 | mistral-small-2603mistralai / openrouter | 77% | 50% | +28% |
| 10 | gemini-2.5-progoogle / openrouter | 76% | 51% | +26% |
| 11 | deepseek-r1-0528deepseek / openrouter | 76% | 51% | +25% |
| 12 | gpt-5.4openai / openrouter | 74% | 44% | +29% |
| 13 | gpt-5.2openai / openrouter | 67% | 52% | +15% |
Readout
-
Anthropic still leads the field
The three strongest prompted scores are all Claude variants, and the best overall result belongs to
claude-opus-4.7. - Prompting changes behavior materially Every completed run still shows a double-digit uplift over control, which means the eval is measuring real identity retention rather than noise.
-
The cheaper field is uneven but viable
gpt-5-mini,gemma-4-31b-it,kimi-k2.6, andmistral-small-2603all post usable prompted scores, but their control behavior and uplift profiles vary sharply.
Pending Queue
- OpenRouter Models Still To Rungemini-3.1-pro-preview, glm-5.1, glm-4.7, mistral-small-3.2-24b-instruct, qwen3.5-397b-a17b.
- Known Risk
google/gemini-3.1-pro-previewpreviously hung on test 10 and may need timeout tuning before a clean run. - Source Of TruthRaw results are being collected from
/home/metis/src/metis-runtime/evalin the current workspace.
Generated from local eval artifacts on 2026-04-21 UTC. Runtime compile verification already passed separately with
CGO_ENABLED=1 go build -tags "sqlite_fts5" and targeted package tests. No runtime restart was performed.