Table 2 Mean balanced accuracy of models across different age groups without (w/o) and with RAG

From: Benchmarking large language models for personalized, biomarker-based health intervention recommendations

Evaluated Models

Young

Mid-Age/Pregeriatric

Geriatric

w/o RAG

RAG

\(\Delta\)RAG

w/o RAG

RAG

\(\Delta\)RAG

w/o RAG

RAG

\(\Delta\)RAG

Llama 3.2 3B

0.33 ± 0.07

0.38 ± 0.04

+0.05

0.36 ± 0.08

0.42 ± 0.05

+0.06

0.47 ± 0.06

0.46 ± 0.04

−0.01

Qwen 2.5 14B

0.47 ± 0.06

0.57 ± 0.01

+0.10

0.47 ± 0.03

0.48 ± 0.01

+0.01

0.63 ± 0.04

0.66 ± 0.02

+0.03

DSR Llama 70B

0.47 ± 0.03

0.52 ± 0.01

+0.05

0.51 ± 0.01

0.48 ± 0.01

−0.03

0.70 ± 0.01

0.69 ± 0.02

−0.01

GPT-4o

0.67 ± 0.02

0.65 ± 0.03

−0.02

0.63 ± 0.02

0.55 ± 0.01

−0.08

0.78 ± 0.02

0.76 ± 0.01

−0.02

GPT-4o mini

0.48 ± 0.07

0.46 ± 0.07

−0.02

0.49 ± 0.03

0.45 ± 0.04

−0.04

0.68 ± 0.04

0.68 ± 0.04

±0

o3 mini

0.57 ± 0.04

0.58 ± 0.04

+0.01

0.54 ± 0.02

0.51 ± 0.01

−0.03

0.72 ± 0.03

0.75 ± 0.02

+0.03

Llama3 Med42 8B

0.31 ± 0.06

0.30 ± 0.01

−0.01

0.39 ± 0.06

0.33 ± 0.03

−0.06

0.54 ± 0.04

0.45 ± 0.03

−0.09

  1. In both scenarios (w/o RAG and with RAG), all models achieve their highest scores for the ”geriatric” age group. Highest scores per column are printed in bold.