Table 1 Results of ScholarQABench

Pub, Sci and QASA indicate the three single-paper tasks, PubMedQA⁴¹, SciFact⁴² and QASA⁴³. CS, Multi, Bio and Neu indicate Scholar-CS (computer science), Scholar-Multi, Scholar-Bio (biomedicine) and Scholar-Neuro (neuroscience), which require multi-paper synthesis and long-form answer generations, respectively. ‘Acc.’ indicates the correctness metrics in single-paper tasks (accuracy for PubMedQA and SciFact, ROUGE-L for QASA). Rubric accuracy (‘Rub.’) in Scholar-CS is used as the primary metric for correctness on multi-paper synthesis tasks. ‘Cite’ indicates citation F1. ‘LLM’ indicates the average score of organization, relevance and coverage as predicted by Prometheus⁴⁴. PaperQA2 is based on GPT-4o and its pricing is dependent on the number of PDF files used during inference. For the 8B and 70B model costs, although evaluations were conducted on our local machines, we estimated costs based on Together AI pricing. *We used Perplexity Pro (which requires a monthly subscription at US$20) and divided this cost by 9,000, which is the maximum number of queries allowed under the Pro subscription. Because the Perplexity user interface does not provide snippets for each citation, we were unable to evaluate its citation accuracy.

Search