Fig. 4: Accuracy of Vision-Language models (VLMs) detecting diseases in images with weak artefacts using standard prompts, and their performance percentage change compared to the accuracy obtained from original unaltered images.
From: Understanding the robustness of vision-language models to medical image artefacts

We present only the top three models for lesion detection on original unaltered images because the remaining models performed poorly, with accuracies around 0.5 (Fig. 3), and their performance on distorted images provides limited insight into model robustness. a Shows the models’ accuracy (y-axis) after adding weak artefacts to the images. Full models performance through various prompt strategies shown in Supplementary Figs. 6–8. b Illustrates the percentage change in models’ accuracy (y-axis) after introducing weak artefacts. Positive values indicate increased performance, whereas negative values represent decreased performance. P values comparing original model performance with performance after adding weak artefacts are shown above each bar. Unless otherwise specified, p < 0.001. ** indicates 0.001 < p < 0.01, * indicates 0.01 < p < 0.05 and ns indicates p ≥ 0.05. Model performance percentage change through various prompt strategies shown in Supplementary Figs. 9–17. The complete quantitative results are presented in Supplementary Data 7, while all p values can be found in Supplementary Data 9. For each disease detection task, we performed 1000 iterations of stratified bootstrapping to calculate accuracy and sensitivity. Results show mean performance with 95% confidence intervals (error bars).