Fig. 8
From: Arch-Eval benchmark for assessing chinese architectural domain knowledge in large language models

Preliminary Experiment ② Results–Sample Variance of Accuracy for Qwen-14B-Chat and GPT-3.5-turbo Outputs at Different Test Set Sizes.
From: Arch-Eval benchmark for assessing chinese architectural domain knowledge in large language models

Preliminary Experiment ② Results–Sample Variance of Accuracy for Qwen-14B-Chat and GPT-3.5-turbo Outputs at Different Test Set Sizes.