For each model we release its per-sample prediction dump on the validation split — the exact outputs extract_predicts produced — so you can reproduce the reported metrics by running only the evaluator ...