TWITTER_ARTICLE

Across 19 frontier models tested on a closed-book SQuAD task, answer F1 scores…

Brief

Phoebe Yao reports that metacognitive confidence in frontier LLMs appears to reflect a shared fact-recall difficulty signal rather than genuine self-knowledge. In closed-book SQuAD evaluations across 19 models, performance clustered at F1 0.6-0.8, yet confidence aligned only weakly with accuracy. The claimed mechanism is a common learned heuristic plus model-specific thresholds, reinforced by a Mistral-7B experiment where one steering parameter matched other models' confidence profiles at roughly 80% agreement.

Why it matters

Across 19 frontier models tested on a closed-book SQuAD task, answer F1 scores were roughly 0.6-0.8, but models' reported confidence was nearly uncorrelated with actual accuracy across models.

Key details

  • The authors argue confidence variance is largely explained by a single shared, model-agnostic difficulty heuristic learned during training, with models differing mainly in their decision threshold; in the summary characterization, Claude appears more cautious while GPT appears more eager.
  • On Mistral-7B, adjusting a single steering coefficient reproduced any target model's confidence profile with about 80% agreement, suggesting confidence behavior is tunable and not evidence of true self-knowledge.
Source evidence

title: @phoebeyao: model confidence tracks a shared model-agnostic signal for fact recall, not true...
author: phoebeyao
contenttype: twitterarticle
published: 2026-04-01T17:14:29+00:00
source_url: https://x.com/phoebeyao/status/2039399882486861977

word_count: 116

model confidence tracks a shared model-agnostic signal for fact recall, not true self-knowledge.

we

model confidence tracks a shared model-agnostic signal for fact recall, not true self-knowledge.

we tested metacognitive confidence across 19 frontier models on a closed-book SQuAD task. f1 scores look reasonable (0.6–0.8), but confidence and accuracy are nearly uncorrelated between models.

the variance traces to a single shared difficulty heuristic learned during training. models differ only in their decision threshold. claude is cautious. gpt is eager.

shifting one steering coefficient on mistral-7b recovers any target model's confidence profile at ~80% agreement.

full breakdown + methods in the article

Across 19 frontier models, metacognitive confidence on question and answer tasks tracks a shared difficulty heuristic with only a weak relationship to actual performance.
Do models know what they...


Posted: 2026-04-01T17:14:29.000Z

Engagement: 0 likes, 2 retweets, 1 replies