Fig. 2: Qualitative evaluation results on inconsistency, missing information, and hallucinations. | Nature Communications

Fig. 2: Qualitative evaluation results on inconsistency, missing information, and hallucinations.

From: Benchmarking large language models for biomedical natural language processing applications and recommendations

Fig. 2

A Error analysis on the named entity recognition benchmark NCBI Disease. Correct entities: the predicted entities are correct with both text spans and entity types; Wrong entities: the predicted entities are incorrect; Missing entities: true entities are not predicted; and Boundary issues: the predicted entities are correct but with different text spans than the gold standard. B–D Qualitative evaluation on ChemProt, HoC, and MedQA where the gold standard is a fixed classification type or multiple-choice option. Inconsistent responses: the responses are in different formats; Missingness: the responses are missing; and Hallucinations, where LLMs fail to address the prompt and may contain repetitions and misinformation in the output.

Back to article page