Fig. 3: Evaluation metrics on \({{{\boldsymbol{\mathcal{D}}}}}_{{{{\bf{Eval}}}}}\) for the three AI approaches.
From: Large language models for preventing medication direction errors in online pharmacies

a, Distribution of NLP scores BLEU and METEOR for MEDIC, T5-FineTuned (1.5M) and Claude calculated across all suggested directions (n = 1,200 prescriptions). Average values are indicated with an horizontal black line and median values are highlighted with a notch on each box-plot. Whiskers extend from the first and third quartiles (box limits) toward the min/max observed values for each metric and model, respectively. b, Comparison of ratios of all categories of possible near-miss events from a total of n = 1,200 prescriptions of different models with respect to MEDIC, with their 95% percentile intervals represented by black lines obtained via bootstrap59 to account for the ratios’ skewed distribution, with their centers representing the median values. c, Comparison of ratios highlighting near-misses related to incorrect dosage or frequency from a total of n = 1,200 prescriptions, which carry an elevated risk of patient harm, with their 95% percentile intervals represented by black lines obtained via bootstrap to account for the ratios’ skewed distribution, with their centers representing the median values.