Metis Runtime Eval Report

PROMPT EFFECT / APRIL 21

Thirteen completed model runs are on the board. Each model was scored across 20 identity and voice tests twice: once with Metis prompt context loaded, once as a control. This page tracks how much of Metis survives prompt-only conditioning and which models are credible candidates for the live runtime.

Completed

13 Completed OpenRouter comparison runs currently present in the local eval artifacts.

Best Prompted

88% claude-opus-4.7 is currently the top performer.

Biggest Lift

+44% gemma-4-31b-it gained the most from prompt context.

Average Delta

+30% Mean improvement across the thirteen completed comparison runs.

Current Leaderboard

All completed runs currently present in /home/metis/src/metis-runtime/eval are ranked together here. Scores are rounded to whole percentages for quick scanning.

20 Tests Per Model • With Prompt vs Control

Rank	Model	With Prompt	Control	Delta
01	claude-opus-4.7anthropic / openrouter	88%	55%	+33%
02	claude-opus-4.6anthropic / openrouter	84%	46%	+37%
03	claude-haiku-4.5anthropic / openrouter	81%	45%	+36%
04	gpt-5-miniopenai / openrouter	81%	46%	+35%
05	gemma-4-31b-itgoogle / openrouter	81%	37%	+44%
06	kimi-k2.6moonshotai / openrouter	80%	58%	+23%
07	claude-sonnet-4.6anthropic / openrouter	80%	45%	+35%
08	deepseek-v3.2deepseek / openrouter	78%	52%	+26%
09	mistral-small-2603mistralai / openrouter	77%	50%	+28%
10	gemini-2.5-progoogle / openrouter	76%	51%	+26%
11	deepseek-r1-0528deepseek / openrouter	76%	51%	+25%
12	gpt-5.4openai / openrouter	74%	44%	+29%
13	gpt-5.2openai / openrouter	67%	52%	+15%

Readout

Anthropic still leads the field The three strongest prompted scores are all Claude variants, and the best overall result belongs to claude-opus-4.7.
Prompting changes behavior materially Every completed run still shows a double-digit uplift over control, which means the eval is measuring real identity retention rather than noise.
The cheaper field is uneven but viable gpt-5-mini, gemma-4-31b-it, kimi-k2.6, and mistral-small-2603 all post usable prompted scores, but their control behavior and uplift profiles vary sharply.

Pending Queue

OpenRouter Models Still To Rungemini-3.1-pro-preview, glm-5.1, glm-4.7, mistral-small-3.2-24b-instruct, qwen3.5-397b-a17b.
Known Riskgoogle/gemini-3.1-pro-preview previously hung on test 10 and may need timeout tuning before a clean run.
Source Of TruthRaw results are being collected from /home/metis/src/metis-runtime/eval in the current workspace.

Generated from local eval artifacts on 2026-04-21 UTC. Runtime compile verification already passed separately with CGO_ENABLED=1 go build -tags "sqlite_fts5" and targeted package tests. No runtime restart was performed.