Fig. 2: Automatic evaluation of SIDE components on the WAFER test set.

a, Proportion of times our retrievers can surface the gold source among the top 200 results, for citations in featured and other Wikipedia articles. The verification engine bar (green) combines sparse and dense retrievers, 100 passages each. b, Accuracy in surfacing the gold source in first position, for citations in featured and other Wikipedia articles. The verification engine (green) takes as input a combination of 100 passages from the sparse and 100 from the dense retriever and reranks those. c, Precision versus recall in detecting citations marked as failed verification against citations in featured articles. We compare a passage versus a document-level approach for the verification engine and a baseline that simply uses the depth of the cited URL.