Fig. 2: Overview of building benchmarks and the pipeline for evaluating robustness of Vision-Language models (VLMs) on disease detection tasks. | npj Digital Medicine

Fig. 2: Overview of building benchmarks and the pipeline for evaluating robustness of Vision-Language models (VLMs) on disease detection tasks.

From: Understanding the robustness of vision-language models to medical image artefacts

Fig. 2

A presents robustness evaluation benchmarks covering disease detection tasks across different medical imaging modalities. Three intensity artefacts (random bias field, noise and motion) with weak (w₁–w₃) and strong (s₁–s₃) scales, and two spatial artefacts (random cropping and rotation) with weak (w₄–w₅) and strong (s₄–s₅) scales are introduced to original unaltered images. All hyperparameters used to generate artefacts at weak and strong scales are provided in Supplementary Data 3. B illustrates the project pipeline, demonstrating the evaluation process used to assess the robustness of the VLMs. When adding weak artefacts, we evaluated model performance (e.g. accuracy) and the performance drop from its performance on the original unaltered images. When images were severely distorted by strong artefacts, we assessed VLMs’ ability to detect poor image quality. All experiments were repeated through different prompt strategies from restricting reasoning through structured output to encouraging reasoning step by step through Chain of Thought.

Back to article page