Table 7 The numerical results of LLM performance evaluated by GPT-4.
Key-insights | Yi | Yi FT | Mixtral | Mixtral FT | InternLM2 | InternLM2 FT | Multi-Actor | |
---|---|---|---|---|---|---|---|---|
GPT-4 score | Aim | 78.1 | 81.2 | 58.3 | 62.0 | 87.5 | 91.1 | 95.7 |
Motivation | 71.2 | 74.1 | 52.1 | 55.3 | 79.1 | 82.5 | 92.7 | |
Methods | 72.1 | 77.0 | 54.3 | 57.3 | 81.0 | 84.7 | 93.2 | |
Question addressed | 73.4 | 77.4 | 48.7 | 52.7 | 78.2 | 81.3 | 91.2 | |
Evaluation metrics | 60.6 | 62.2 | 44.4 | 44.1 | 64.5 | 69.4 | 91.5 | |
Findings | 74.1 | 79.5 | 52.7 | 53.0 | 79.8 | 84.5 | 96.4 | |
Limitations | 46.1 | 47.2 | 32.6 | 32.7 | 44.5 | 49.7 | 66.3 | |
Contribution | 75.3 | 76.6 | 55.3 | 58.8 | 81.1 | 83.6 | 91.5 | |
Future work | 66.0 | 68.1 | 46.4 | 48.6 | 69.4 | 73.5 | 89.7 | |
Average | 68.5 | 71.4 | 49.4 | 51.6 | 73.9 | 77.8 | 89.8 | |
Vector similarity | Aim | 77.6 | 80.4 | 78.4 | 77.2 | 84.1 | 86.4 | 86.1 |
Motivation | 66.7 | 68.5 | 65.2 | 65.8 | 71.4 | 72.6 | 76.4 | |
Methods | 66.0 | 69.3 | 65.5 | 67.1 | 72.2 | 74.4 | 78.8 | |
Question addressed | 68.9 | 70.0 | 64.8 | 65.9 | 69.9 | 71.5 | 76.8 | |
Evaluation metrics | 55.5 | 56.9 | 55.2 | 56.3 | 57.6 | 59.2 | 67.6 | |
Findings | 66.7 | 68.8 | 65.3 | 67.4 | 69.7 | 72.1 | 76.7 | |
Limitations | 47.2 | 48.4 | 47.9 | 46.9 | 46.5 | 48.3 | 55.0 | |
Contribution | 71.3 | 72.2 | 70.7 | 71.0 | 72.6 | 74.2 | 80.8 | |
Future work | 58.0 | 57.9 | 58.7 | 57.1 | 59.1 | 60.1 | 66.9 | |
Average | 64.2 | 65.8 | 63.5 | 63.9 | 67.0 | 68.8 | 73.9 |