Table 11 Distribution of measured outcomes of linguistic feature-, LLM-based AES approach.
From: Applying large language models for automated essay scoring for non-native Japanese
Statistical metric | AES system | Mean score (Lexical richness) | Mean score (Syntactic complexity) | Mean score (Content) | Mean score (Grammatical accuracy) |
|---|---|---|---|---|---|
QWK | Jess-human | 0.608 | 0.591 | 0.518 | 0.655 |
JWriter-human | 0.600 | 0.589 | 0.521 | 0.661 | |
BERT-human | 0.653 | 0.652 | 0.638 | 0.671 | |
GPT 4 -human | 0.665 | 0.655 | 0.657 | 0.689 | |
OCLL-human | 0.639 | 0.648 | 0.623 | 0.662 | |
Human-human | 0.657 | 0.639 | 0.578 | 0.677 | |
PRMSE | Jess-human | 0.691 | 0.658 | 0.601 | 0.689 |
JWriter-human | 0.683 | 0.675 | 0.584 | 0.732 | |
BERT-human | 0.701 | 0.746 | 0.628 | 0.749 | |
GPT 4 -human | 0.711 | 0.733 | 0.634 | 0.754 | |
OCLL-human | 0.687 | 0.699 | 0.619 | 0.694 | |
Human-human | 0.691 | 0.745 | 0.590 | 0.744 |