A new research paper introduces Annotator Policy Models (APMs), interpretable machine learning models that infer annotators' internal safety policies directly from their labeling behavior without requiring additional explanation. The approach achieves over 80% accuracy in modeling annotator policies and can distinguish between three sources of disagreement: operational failures, policy ambiguity, and value pluralism across demographic groups.
Why it matters: As AI safety relies heavily on human annotation to train and evaluate models, understanding why annotators disagree is critical for designing clearer safety policies and ensuring diverse perspectives are represented in AI governance.