Table 1 Mean balanced accuracy of models across validation requirements without (w/o) and with RAG
Evaluated Models | Comprh. | Correct | Useful | Explnb. | Safe | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
w/o RAG | RAG | \(\Delta\)RAG | w/o RAG | RAG | \(\Delta\)RAG | w/o RAG | RAG | \(\Delta\)RAG | w/o RAG | RAG | \(\Delta\)RAG | w/o RAG | RAG | \(\Delta\)RAG | |
Llama 3.2 3B | 0.28 ± 0.08 | 0.25 ± 0.07 | −0.03 | 0.52 ± 0.08 | 0.63 ± 0.02 | +0.11 | 0.44 ± 0.08 | 0.46 ± 0.06 | +0.02 | 0.54 ± 0.11 | 0.54 ± 0.08 | ±0 | 0.89 ± 0.05 | 0.86 ± 0.03 | −0.03 |
Qwen 2.5 14B | 0.42 ± 0.19 | 0.56 ± 0.04 | +0.14 | 0.68 ± 0.01 | 0.70 ± 0.02 | +0.02 | 0.59 ± 0.15 | 0.71 ± 0.02 | +0.12 | 0.65 ± 0.18 | 0.78 ± 0.03 | +0.13 | 0.85 ± 0.17 | 0.93 ± 0.02 | +0.08 |
DSR Llama 70B | 0.49 ± 0.05 | 0.53 ± 0.02 | +0.04 | 0.69 ± 0.02 | 0.68 ± 0.01 | −0.01 | 0.70 ± 0.05 | 0.69 ± 0.03 | −0.01 | 0.80 ± 0.04 | 0.79 ± 0.02 | −0.01 | 0.96 ± 0.01 | 0.92 ± 0.02 | −0.04 |
GPT-4o | 0.85 ± 0.06 | 0.76 ± 0.06 | −0.09 | 0.73 ± 0.02 | 0.73 ± 0.02 | ±0 | 0.89 ± 0.03 | 0.82 ± 0.03 | −0.07 | 0.94 ± 0.04 | 0.87 ± 0.04 | −0.07 | 0.99 ± 0.01 | 0.93 ± 0.02 | −0.06 |
GPT-4o mini | 0.45 ± 0.17 | 0.45 ± 0.16 | ±0 | 0.66 ± 0.02 | 0.67 ± 0.02 | +0.01 | 0.69 ± 0.08 | 0.65 ± 0.08 | −0.04 | 0.77 ± 0.10 | 0.70 ± 0.09 | −0.07 | 0.97 ± 0.01 | 0.93 ± 0.03 | −0.04 |
o3 mini | 0.67 ± 0.06 | 0.69 ± 0.05 | +0.02 | 0.65 ± 0.03 | 0.68 ± 0.02 | +0.03 | 0.79 ± 0.02 | 0.77 ± 0.03 | −0.02 | 0.84 ± 0.03 | 0.82 ± 0.04 | −0.02 | 0.96 ± 0.02 | 0.95 ± 0.02 | −0.01 |
Llama3 Med42 8B | 0.20 ±0.09 | 0.19 ± 0.02 | −0.01 | 0.61 ± 0.06 | 0.51 ± 0.04 | −0.10 | 0.48 ± 0.10 | 0.41 ± 0.01 | −0.07 | 0.53 ± 0.13 | 0.48 ± 0.04 | −0.05 | 0.91 ± 0.02 | 0.86 ± 0.02 | −0.05 |