Table 4 GSB evaluation results comparing model outputs to author responses (N = 200)
From: Self-reflection enhances large language models towards substantial academic response
Method | Good | Same | Bad |
|---|---|---|---|
ChatGLM-DP | 2.5% | 15.5% | 82.0% |
ChatGLM-CoT | 4.0% | 35.0% | 61.0% |
ChatGLM-SR | 6.5% | 42.5% | 51.0% |
ChatGLM-ABR | 3.0% | 31.0% | 66.0% |
Ours | 9.0% | 55.5% | 35.5% |