Extended Data Fig. 2: Error pattern distributions across the five core PsychBench tasks.
From: A domain-adapted large language model to support clinicians in psychiatric clinical practice

Bar plots summarize the major categories of model errors for each task: Task 1 (clinical information summarization), where most errors arose from onset-pattern misjudgment; Tasks 2 & 3 (diagnosis and differential diagnosis), dominated by inaccuracies in associated-symptom assessment; Task 4 (medication recommendation), where overly conservative treatment decisions represented the majority of errors; and Task 5 (long-term course management), where limitations were primarily attributable to remote-information and detailed-information retention. Percentages represent the proportion of each error type within the task-specific error set.