Table 7 The numerical results of LLM performance evaluated by GPT-4.

	Key-insights	Yi	Yi FT	Mixtral	Mixtral FT	InternLM2	InternLM2 FT	Multi-Actor
GPT-4 score	Aim	78.1	81.2	58.3	62.0	87.5	91.1	95.7
	Motivation	71.2	74.1	52.1	55.3	79.1	82.5	92.7
	Methods	72.1	77.0	54.3	57.3	81.0	84.7	93.2
	Question addressed	73.4	77.4	48.7	52.7	78.2	81.3	91.2
	Evaluation metrics	60.6	62.2	44.4	44.1	64.5	69.4	91.5
	Findings	74.1	79.5	52.7	53.0	79.8	84.5	96.4
	Limitations	46.1	47.2	32.6	32.7	44.5	49.7	66.3
	Contribution	75.3	76.6	55.3	58.8	81.1	83.6	91.5
	Future work	66.0	68.1	46.4	48.6	69.4	73.5	89.7
	Average	68.5	71.4	49.4	51.6	73.9	77.8	89.8
Vector similarity	Aim	77.6	80.4	78.4	77.2	84.1	86.4	86.1
	Motivation	66.7	68.5	65.2	65.8	71.4	72.6	76.4
	Methods	66.0	69.3	65.5	67.1	72.2	74.4	78.8
	Question addressed	68.9	70.0	64.8	65.9	69.9	71.5	76.8
	Evaluation metrics	55.5	56.9	55.2	56.3	57.6	59.2	67.6
	Findings	66.7	68.8	65.3	67.4	69.7	72.1	76.7
	Limitations	47.2	48.4	47.9	46.9	46.5	48.3	55.0
	Contribution	71.3	72.2	70.7	71.0	72.6	74.2	80.8
	Future work	58.0	57.9	58.7	57.1	59.1	60.1	66.9
	Average	64.2	65.8	63.5	63.9	67.0	68.8	73.9

Quick links

Search