Fig. 2: Qualitative evaluation results on inconsistency, missing information, and hallucinations.

A Error analysis on the named entity recognition benchmark NCBI Disease. Correct entities: the predicted entities are correct with both text spans and entity types; Wrong entities: the predicted entities are incorrect; Missing entities: true entities are not predicted; and Boundary issues: the predicted entities are correct but with different text spans than the gold standard. B–D Qualitative evaluation on ChemProt, HoC, and MedQA where the gold standard is a fixed classification type or multiple-choice option. Inconsistent responses: the responses are in different formats; Missingness: the responses are missing; and Hallucinations, where LLMs fail to address the prompt and may contain repetitions and misinformation in the output.