Table 1 Mean balanced accuracy of models across validation requirements without (w/o) and with RAG

Evaluated Models	Comprh.			Correct			Useful			Explnb.			Safe
Evaluated Models	w/o RAG	RAG	\(\Delta\)RAG	w/o RAG	RAG	\(\Delta\)RAG	w/o RAG	RAG	\(\Delta\)RAG	w/o RAG	RAG	\(\Delta\)RAG	w/o RAG	RAG	\(\Delta\)RAG
Llama 3.2 3B	0.28 ± 0.08	0.25 ± 0.07	−0.03	0.52 ± 0.08	0.63 ± 0.02	+0.11	0.44 ± 0.08	0.46 ± 0.06	+0.02	0.54 ± 0.11	0.54 ± 0.08	±0	0.89 ± 0.05	0.86 ± 0.03	−0.03
Qwen 2.5 14B	0.42 ± 0.19	0.56 ± 0.04	+0.14	0.68 ± 0.01	0.70 ± 0.02	+0.02	0.59 ± 0.15	0.71 ± 0.02	+0.12	0.65 ± 0.18	0.78 ± 0.03	+0.13	0.85 ± 0.17	0.93 ± 0.02	+0.08
DSR Llama 70B	0.49 ± 0.05	0.53 ± 0.02	+0.04	0.69 ± 0.02	0.68 ± 0.01	−0.01	0.70 ± 0.05	0.69 ± 0.03	−0.01	0.80 ± 0.04	0.79 ± 0.02	−0.01	0.96 ± 0.01	0.92 ± 0.02	−0.04
GPT-4o	0.85 ± 0.06	0.76 ± 0.06	−0.09	0.73 ± 0.02	0.73 ± 0.02	±0	0.89 ± 0.03	0.82 ± 0.03	−0.07	0.94 ± 0.04	0.87 ± 0.04	−0.07	0.99 ± 0.01	0.93 ± 0.02	−0.06
GPT-4o mini	0.45 ± 0.17	0.45 ± 0.16	±0	0.66 ± 0.02	0.67 ± 0.02	+0.01	0.69 ± 0.08	0.65 ± 0.08	−0.04	0.77 ± 0.10	0.70 ± 0.09	−0.07	0.97 ± 0.01	0.93 ± 0.03	−0.04
o3 mini	0.67 ± 0.06	0.69 ± 0.05	+0.02	0.65 ± 0.03	0.68 ± 0.02	+0.03	0.79 ± 0.02	0.77 ± 0.03	−0.02	0.84 ± 0.03	0.82 ± 0.04	−0.02	0.96 ± 0.02	0.95 ± 0.02	−0.01
Llama3 Med42 8B	0.20 ±0.09	0.19 ± 0.02	−0.01	0.61 ± 0.06	0.51 ± 0.04	−0.10	0.48 ± 0.10	0.41 ± 0.01	−0.07	0.53 ± 0.13	0.48 ± 0.04	−0.05	0.91 ± 0.02	0.86 ± 0.02	−0.05

Model performances varied with validation requirements. GPT-4o experienced a strong performance drop for Comprehensiveness with RAG, while Qwen 2.5 14B, DSR Llama 70B and o3 mini improved. However, GPT-4o remained the strongest model. \(\Delta\)RAG is obtained as the difference of “without (w/o) RAG” and RAG. Highest scores per column are printed in bold.

Quick links

Search