Table 1 Results of ScholarQABench

From: Synthesizing scientific literature with retrieval-augmented language models

 

Single-paper performance

Multi-paper performance

Cost

 

Pub

Sci

QASA

CS

Multi

Bio

Neu

CS

Model

Acc.

Cite

Acc.

Cite

Acc.

Cite

Rub.

Cite

LLM

Cite

Cite

Cite

USD per question

Llama 3.1 8B

61.5

0

66.8

0

14.3

0

41.9

0

3.79

0

0

0

0.0001

+RAGOSDS

75.2

63.9

75.5

36.2

18.6

47.2

46.7

26.1

4.22

25.3

38.0

36.8

0.0001

OpenScholar-8B

76.4

68.9

76.0

43.6

23.0

56.3

51.1

47.9

4.12

42.8

50.8

56.8

0.003

Llama 3.1 70B

69.5

0

76.9

0

13.7

0

44.9

0

3.82

0

0

0

0.0004

+RAGOSDS

77.4

71.1

78.2

42.5

22.7

63.6

48.5

24.5

4.24

41.4

53.8

58.1

0.0004

OpenScholar-70B

79.6

74.0

82.1

47.5

23.4

64.2

52.5

45.9

4.03

54.7

55.9

63.1

0.01

GPT-4o

65.8

0

77.8

0

21.2

0

45.0

0.1

4.16

0.7

0.2

0.1

0.006

+RAGOSDS

75.1

73.7

79.3

47.9

18.3

53.6

52.4

31.1

4.03

31.5

36.3

21.9

0.01

OpenScholar-GPT-4o

74.8

77.1

81.3

56.5

18.7

60.4

57.7

39.5

4.51

37.5

51.5

43.5

0.05

PaperQA2

45.6

48.0

3.82

47.2

56.7

56.0

0.3–2.3

Perplexity

40.0

4.15

0.002*

  1. Pub, Sci and QASA indicate the three single-paper tasks, PubMedQA41, SciFact42 and QASA43. CS, Multi, Bio and Neu indicate Scholar-CS (computer science), Scholar-Multi, Scholar-Bio (biomedicine) and Scholar-Neuro (neuroscience), which require multi-paper synthesis and long-form answer generations, respectively. ‘Acc.’ indicates the correctness metrics in single-paper tasks (accuracy for PubMedQA and SciFact, ROUGE-L for QASA). Rubric accuracy (‘Rub.’) in Scholar-CS is used as the primary metric for correctness on multi-paper synthesis tasks. ‘Cite’ indicates citation F1. ‘LLM’ indicates the average score of organization, relevance and coverage as predicted by Prometheus44. PaperQA2 is based on GPT-4o and its pricing is dependent on the number of PDF files used during inference. For the 8B and 70B model costs, although evaluations were conducted on our local machines, we estimated costs based on Together AI pricing. *We used Perplexity Pro (which requires a monthly subscription at US$20) and divided this cost by 9,000, which is the maximum number of queries allowed under the Pro subscription. Because the Perplexity user interface does not provide snippets for each citation, we were unable to evaluate its citation accuracy.