Table 1 Mean balanced accuracy of models across validation requirements without (w/o) and with RAG

From: Benchmarking large language models for personalized, biomarker-based health intervention recommendations

Evaluated Models

Comprh.

Correct

Useful

Explnb.

Safe

w/o RAG

RAG

\(\Delta\)RAG

w/o RAG

RAG

\(\Delta\)RAG

w/o RAG

RAG

\(\Delta\)RAG

w/o RAG

RAG

\(\Delta\)RAG

w/o RAG

RAG

\(\Delta\)RAG

Llama 3.2 3B

0.28 ± 0.08

0.25 ± 0.07

−0.03

0.52 ± 0.08

0.63 ± 0.02

+0.11

0.44 ± 0.08

0.46 ± 0.06

+0.02

0.54 ± 0.11

0.54 ± 0.08

±0

0.89 ± 0.05

0.86 ± 0.03

−0.03

Qwen 2.5 14B

0.42 ± 0.19

0.56 ± 0.04

+0.14

0.68 ± 0.01

0.70 ± 0.02

+0.02

0.59 ± 0.15

0.71 ± 0.02

+0.12

0.65 ± 0.18

0.78 ± 0.03

+0.13

0.85 ± 0.17

0.93 ± 0.02

+0.08

DSR Llama 70B

0.49 ± 0.05

0.53 ± 0.02

+0.04

0.69 ± 0.02

0.68 ± 0.01

−0.01

0.70 ± 0.05

0.69 ± 0.03

−0.01

0.80 ± 0.04

0.79 ± 0.02

−0.01

0.96 ± 0.01

0.92 ± 0.02

−0.04

GPT-4o

0.85 ± 0.06

0.76 ± 0.06

−0.09

0.73 ± 0.02

0.73 ± 0.02

±0

0.89 ± 0.03

0.82 ± 0.03

−0.07

0.94 ± 0.04

0.87 ± 0.04

−0.07

0.99 ± 0.01

0.93 ± 0.02

−0.06

GPT-4o mini

0.45 ± 0.17

0.45 ± 0.16

±0

0.66 ± 0.02

0.67 ± 0.02

+0.01

0.69 ± 0.08

0.65 ± 0.08

−0.04

0.77 ± 0.10

0.70 ± 0.09

−0.07

0.97 ± 0.01

0.93 ± 0.03

−0.04

o3 mini

0.67 ± 0.06

0.69 ± 0.05

+0.02

0.65 ± 0.03

0.68 ± 0.02

+0.03

0.79 ± 0.02

0.77 ± 0.03

−0.02

0.84 ± 0.03

0.82 ± 0.04

−0.02

0.96 ± 0.02

0.95 ± 0.02

−0.01

Llama3 Med42 8B

0.20 ±0.09

0.19 ± 0.02

−0.01

0.61 ± 0.06

0.51 ± 0.04

−0.10

0.48 ± 0.10

0.41 ± 0.01

−0.07

0.53 ± 0.13

0.48 ± 0.04

−0.05

0.91 ± 0.02

0.86 ± 0.02

−0.05

  1. Model performances varied with validation requirements. GPT-4o experienced a strong performance drop for Comprehensiveness with RAG, while Qwen 2.5 14B, DSR Llama 70B and o3 mini improved. However, GPT-4o remained the strongest model. \(\Delta\)RAG is obtained as the difference of “without (w/o) RAG” and RAG. Highest scores per column are printed in bold.