Table 7 The numerical results of LLM performance evaluated by GPT-4.

From: A scientific-article key-insight extraction system based on multi-actor of fine-tuned open-source large language models

 

Key-insights

Yi

Yi FT

Mixtral

Mixtral FT

InternLM2

InternLM2 FT

Multi-Actor

GPT-4 score

Aim

78.1

81.2

58.3

62.0

87.5

91.1

95.7

Motivation

71.2

74.1

52.1

55.3

79.1

82.5

92.7

Methods

72.1

77.0

54.3

57.3

81.0

84.7

93.2

Question addressed

73.4

77.4

48.7

52.7

78.2

81.3

91.2

Evaluation metrics

60.6

62.2

44.4

44.1

64.5

69.4

91.5

Findings

74.1

79.5

52.7

53.0

79.8

84.5

96.4

Limitations

46.1

47.2

32.6

32.7

44.5

49.7

66.3

Contribution

75.3

76.6

55.3

58.8

81.1

83.6

91.5

Future work

66.0

68.1

46.4

48.6

69.4

73.5

89.7

Average

68.5

71.4

49.4

51.6

73.9

77.8

89.8

Vector similarity

Aim

77.6

80.4

78.4

77.2

84.1

86.4

86.1

Motivation

66.7

68.5

65.2

65.8

71.4

72.6

76.4

Methods

66.0

69.3

65.5

67.1

72.2

74.4

78.8

Question addressed

68.9

70.0

64.8

65.9

69.9

71.5

76.8

Evaluation metrics

55.5

56.9

55.2

56.3

57.6

59.2

67.6

Findings

66.7

68.8

65.3

67.4

69.7

72.1

76.7

Limitations

47.2

48.4

47.9

46.9

46.5

48.3

55.0

Contribution

71.3

72.2

70.7

71.0

72.6

74.2

80.8

Future work

58.0

57.9

58.7

57.1

59.1

60.1

66.9

Average

64.2

65.8

63.5

63.9

67.0

68.8

73.9