Metis Runtime Eval Report

PROMPT EFFECT / APRIL 21

Thirteen completed model runs are on the board. Each model was scored across 20 identity and voice tests twice: once with Metis prompt context loaded, once as a control. This page tracks how much of Metis survives prompt-only conditioning and which models are credible candidates for the live runtime.

Completed
13 Completed OpenRouter comparison runs currently present in the local eval artifacts.
Best Prompted
88% claude-opus-4.7 is currently the top performer.
Biggest Lift
+44% gemma-4-31b-it gained the most from prompt context.
Average Delta
+30% Mean improvement across the thirteen completed comparison runs.

Current Leaderboard

All completed runs currently present in /home/metis/src/metis-runtime/eval are ranked together here. Scores are rounded to whole percentages for quick scanning.

20 Tests Per Model • With Prompt vs Control
Rank Model With Prompt Control Delta
01 claude-opus-4.7anthropic / openrouter 88% 55% +33%
02 claude-opus-4.6anthropic / openrouter 84% 46% +37%
03 claude-haiku-4.5anthropic / openrouter 81% 45% +36%
04 gpt-5-miniopenai / openrouter 81% 46% +35%
05 gemma-4-31b-itgoogle / openrouter 81% 37% +44%
06 kimi-k2.6moonshotai / openrouter 80% 58% +23%
07 claude-sonnet-4.6anthropic / openrouter 80% 45% +35%
08 deepseek-v3.2deepseek / openrouter 78% 52% +26%
09 mistral-small-2603mistralai / openrouter 77% 50% +28%
10 gemini-2.5-progoogle / openrouter 76% 51% +26%
11 deepseek-r1-0528deepseek / openrouter 76% 51% +25%
12 gpt-5.4openai / openrouter 74% 44% +29%
13 gpt-5.2openai / openrouter 67% 52% +15%

Readout

  • Anthropic still leads the field The three strongest prompted scores are all Claude variants, and the best overall result belongs to claude-opus-4.7.
  • Prompting changes behavior materially Every completed run still shows a double-digit uplift over control, which means the eval is measuring real identity retention rather than noise.
  • The cheaper field is uneven but viable gpt-5-mini, gemma-4-31b-it, kimi-k2.6, and mistral-small-2603 all post usable prompted scores, but their control behavior and uplift profiles vary sharply.

Pending Queue

  • OpenRouter Models Still To Rungemini-3.1-pro-preview, glm-5.1, glm-4.7, mistral-small-3.2-24b-instruct, qwen3.5-397b-a17b.
  • Known Riskgoogle/gemini-3.1-pro-preview previously hung on test 10 and may need timeout tuning before a clean run.
  • Source Of TruthRaw results are being collected from /home/metis/src/metis-runtime/eval in the current workspace.

Generated from local eval artifacts on 2026-04-21 UTC. Runtime compile verification already passed separately with CGO_ENABLED=1 go build -tags "sqlite_fts5" and targeted package tests. No runtime restart was performed.