Fig. 1: Bland-Altman plot showing agreement between IELTS official examiners and ChatGPT.
From: Evaluating large language models for criterion-based grading from agreement to consistency

The plots display the mean scores versus the differences in scores. The solid line represents the mean difference, the dashed lines represent the upper and lower limits of agreement, and the dotted line represents the trend.