Table 4 Sample analysis of the differences between model and human scoring.
Sample ID | S1 | S2 | S3 |
---|---|---|---|
Prompt | Should governments invest more in public transportation? | Is it better to live in a city or a rural area? | Should college education be free? |
Excerpt | “While some may argue that cars offer greater freedom, I firmly believe that investing in public transportation leads to a greener, more efficient society. Isn’t it better to reduce traffic jams and pollution?” | “Living in a city has many benefits. You can go to museums, restaurants, or hospitals easily. Everything is close.” | “College education should be free so that everyone can access knowledge. However, the government needs a sustainable plan to fund it.” |
Human Score | 4.5 | 3.0 | 4.8 |
Model Score | 3.7 | 4.1 | 4.5 |
Score Gap | −0.8 | + 1.1 | −0.3 |
Analysis of Deviation | The model misinterpreted the rhetorical question and contrastive reasoning, underestimating the strength of the author’s stance and giving a lower score. | Despite the fluent language and clear structure, the essay lacked critical analysis. The model over-weighted surface fluency and failed to penalize the lack of argument depth, resulting in an inflated score. | Minor spelling and grammar issues were over-penalized by the model, leading to a slight underestimation of the overall quality. |