Fig. 5: Visual Comparison of Multimodal Model Performance and Interpretability on Yuan Dynasty Grotto Sculptures.
From: Multimodal AI for Yuan Buddhist sculpture chronology and style

a Average scores across six evaluation dimensions for CSN and five baseline multimodal large language models (GPT-4o, Claude 3.5, Gemini 1.5 Pro, LLaMA 3.3 70B, and Grok Beta), presented as mean (left), standard deviation (middle), and median (right) across dimensions. b Distribution of evaluation scores for each model across all dimensions, visualized using violin plots. c Comparison of CSN’s mean scores to the best-performing baseline model for each dimension. Asterisks indicate statistical significance (*p < 0.05, **p < 0.01, ***p < 0.001; two-tailed t-test).