Fig. 2: Inter-rater reliability across scenarios for each LLM.

Radar chart comparing Fleiss Kappa scores of four large language models (LLMs) across four scenarios. Each axis represents a scenario, with Fleiss Kappa values plotted radially from the center. Model performance is shown as distinct geometric lines. Higher values toward the outer edges indicate stronger inter-rater agreement.