Fig. 4: Stage I (CLIP guidance) evidence.

Left: curated/learnable prompts improve concordantly across ERS–CAF, LR, and H; synonyms remain stable while OOD/antonyms deteriorate. Right: text guidance drives focused tile attention, whereas no-text yields diffuse, low-contrast maps. a Prompt × target performance (mean r, 5-fold). b Macro-r across folds by prompt group. c Attention overlay with curated prompts. d Attention overlay with no text.