Table 1 Results of ScholarQABench
From: Synthesizing scientific literature with retrieval-augmented language models
Single-paper performance | Multi-paper performance | Cost | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Pub | Sci | QASA | CS | Multi | Bio | Neu | CS | ||||||
Model | Acc. | Cite | Acc. | Cite | Acc. | Cite | Rub. | Cite | LLM | Cite | Cite | Cite | USD per question |
Llama 3.1 8B | 61.5 | 0 | 66.8 | 0 | 14.3 | 0 | 41.9 | 0 | 3.79 | 0 | 0 | 0 | 0.0001 |
+RAGOSDS | 75.2 | 63.9 | 75.5 | 36.2 | 18.6 | 47.2 | 46.7 | 26.1 | 4.22 | 25.3 | 38.0 | 36.8 | 0.0001 |
OpenScholar-8B | 76.4 | 68.9 | 76.0 | 43.6 | 23.0 | 56.3 | 51.1 | 47.9 | 4.12 | 42.8 | 50.8 | 56.8 | 0.003 |
Llama 3.1 70B | 69.5 | 0 | 76.9 | 0 | 13.7 | 0 | 44.9 | 0 | 3.82 | 0 | 0 | 0 | 0.0004 |
+RAGOSDS | 77.4 | 71.1 | 78.2 | 42.5 | 22.7 | 63.6 | 48.5 | 24.5 | 4.24 | 41.4 | 53.8 | 58.1 | 0.0004 |
OpenScholar-70B | 79.6 | 74.0 | 82.1 | 47.5 | 23.4 | 64.2 | 52.5 | 45.9 | 4.03 | 54.7 | 55.9 | 63.1 | 0.01 |
GPT-4o | 65.8 | 0 | 77.8 | 0 | 21.2 | 0 | 45.0 | 0.1 | 4.16 | 0.7 | 0.2 | 0.1 | 0.006 |
+RAGOSDS | 75.1 | 73.7 | 79.3 | 47.9 | 18.3 | 53.6 | 52.4 | 31.1 | 4.03 | 31.5 | 36.3 | 21.9 | 0.01 |
OpenScholar-GPT-4o | 74.8 | 77.1 | 81.3 | 56.5 | 18.7 | 60.4 | 57.7 | 39.5 | 4.51 | 37.5 | 51.5 | 43.5 | 0.05 |
PaperQA2 | – | – | – | – | – | – | 45.6 | 48.0 | 3.82 | 47.2 | 56.7 | 56.0 | 0.3–2.3 |
Perplexity | – | – | – | – | – | – | 40.0 | – | 4.15 | – | – | – | 0.002* |