Table 4 GSB evaluation results comparing model outputs to author responses (N = 200)

From: Self-reflection enhances large language models towards substantial academic response

Method

Good

Same

Bad

ChatGLM-DP

2.5%

15.5%

82.0%

ChatGLM-CoT

4.0%

35.0%

61.0%

ChatGLM-SR

6.5%

42.5%

51.0%

ChatGLM-ABR

3.0%

31.0%

66.0%

Ours

9.0%

55.5%

35.5%