Extended Data Table 1 Overview of ScholarQABench datasets and evaluation protocols

From: Synthesizing scientific literature with retrieval-augmented language models

  1. The top three rows show single-paper datasets taken from previous datasets. The bottom four rows are new datasets, which we constructed by recruiting PhD-level experts. Answer* indicates that the dataset comes with questions only and Answer† indicates that the answer will be evaluated on the basis of human-annotated rubrics. The evaluation columns correspond to the multifaceted evaluations in the Methods (‘Metrics and evaluation protocols’). ‘Multi-paper’ indicates that the task requires several papers to answer. To evaluate response correctness, we use ‘Acc.’ for single-paper tasks (SciFact, PubMedQA, QASA), which correspond to the primary metrics in the original datasets (that is, accuracy for SciFact and PubMedQA and ROUGE-L for QASA) and ‘Rub.’ (rubric accuracy) for Scholar-CS. For Scholar-Multi, we assess relevance, organization and coverage using both a LLM judge (aggregated as ‘LLM’) and expert annotators (aggregated as ‘Exp.’). Avg. x and Avg. y denote the average token length of input and output reference answers (where applicable), respectively.