Code-quality judge agreement

Ratings: data/code_quality_preferences_n1000.jsonl

Models: claude-haiku-4-5-20251001 vs gpt-5-nano; n: 1000

Summary

model	mean	stdev	min	max
claude-haiku-4-5-20251001	7.040	1.250	3.0	9.0
gpt-5-nano	5.212	0.782	2.0	7.0

Percentile normalization ranks each judge's own scores from 0 to 1, so this removes scale bias such as one model being harsher overall.

Labels are made from each model's percentile scores: bottom third = low, middle third = mid, top third = high.