Fig. 1: Model Performance on the Full Held-Out Test Set.
From: Detecting stigmatizing language in clinical notes with large language models for addiction care

Bootstrapped (n = 1000) performance on the complete held-out test set (11,586 clinical notes). Each approach reflects a distinct prompting or fine-tuning strategy for identifying stigmatizing language. Results are reported as mean macro F1 score with 95% bootstrapped confidence intervals.