Code-quality judge agreement

Ratings: data/code_quality_preferences_n1000.jsonl

Models: claude-haiku-4-5-20251001 vs gpt-5-nano; n: 1000

Summary

metricvalue
raw Pearson0.468
raw Spearman0.458
raw MAE1.910
mean claude-haiku-4-5-20251001 − gpt-5-nano1.828
tercile exact agreement51.9%
tercile unbiased agreement0.279
Cohen's kappa0.260
adjacent agreement93.6%

Model score distributions

modelmeanstdevminmax
claude-haiku-4-5-202510017.0401.2503.09.0
gpt-5-nano5.2120.7822.07.0
Raw score distributions Raw score scatter

Normalized/rank agreement

Percentile normalization ranks each judge's own scores from 0 to 1, so this removes scale bias such as one model being harsher overall.

Percentile score scatter

Tercile label agreement

Labels are made from each model's percentile scores: bottom third = low, middle third = mid, top third = high.

Label confusion matrix Label distributions Label agreement metrics