Table 1 Interrater agreement and consistency under the three prompts

From: Evaluating large language models for criterion-based grading from agreement to consistency

 

Prompt 1

Prompt 2

Prompt 3

Interrater agreement

0.46 [−0.04, 0.74]

0.37 [−0.10, 0.73]

0.61 [0.02, 0.85]

Interrater consistency

0.63 [0.35, 0.80]

0.71 [0.48, 0.85]

0.77 [0.57, 0.88]

  1. Point estimates and their 95% confidence intervals are presented.