Table 1 Interrater agreement and consistency under the three prompts
From: Evaluating large language models for criterion-based grading from agreement to consistency
Prompt 1 | Prompt 2 | Prompt 3 | |
---|---|---|---|
Interrater agreement | 0.46 [−0.04, 0.74] | 0.37 [−0.10, 0.73] | 0.61 [0.02, 0.85] |
Interrater consistency | 0.63 [0.35, 0.80] | 0.71 [0.48, 0.85] | 0.77 [0.57, 0.88] |