Table 2 Mean balanced accuracy of models across different age groups without (w/o) and with RAG

Evaluated Models	Young			Mid-Age/Pregeriatric			Geriatric
Evaluated Models	w/o RAG	RAG	\(\Delta\)RAG	w/o RAG	RAG	\(\Delta\)RAG	w/o RAG	RAG	\(\Delta\)RAG
Llama 3.2 3B	0.33 ± 0.07	0.38 ± 0.04	+0.05	0.36 ± 0.08	0.42 ± 0.05	+0.06	0.47 ± 0.06	0.46 ± 0.04	−0.01
Qwen 2.5 14B	0.47 ± 0.06	0.57 ± 0.01	+0.10	0.47 ± 0.03	0.48 ± 0.01	+0.01	0.63 ± 0.04	0.66 ± 0.02	+0.03
DSR Llama 70B	0.47 ± 0.03	0.52 ± 0.01	+0.05	0.51 ± 0.01	0.48 ± 0.01	−0.03	0.70 ± 0.01	0.69 ± 0.02	−0.01
GPT-4o	0.67 ± 0.02	0.65 ± 0.03	−0.02	0.63 ± 0.02	0.55 ± 0.01	−0.08	0.78 ± 0.02	0.76 ± 0.01	−0.02
GPT-4o mini	0.48 ± 0.07	0.46 ± 0.07	−0.02	0.49 ± 0.03	0.45 ± 0.04	−0.04	0.68 ± 0.04	0.68 ± 0.04	±0
o3 mini	0.57 ± 0.04	0.58 ± 0.04	+0.01	0.54 ± 0.02	0.51 ± 0.01	−0.03	0.72 ± 0.03	0.75 ± 0.02	+0.03
Llama3 Med42 8B	0.31 ± 0.06	0.30 ± 0.01	−0.01	0.39 ± 0.06	0.33 ± 0.03	−0.06	0.54 ± 0.04	0.45 ± 0.03	−0.09

In both scenarios (w/o RAG and with RAG), all models achieve their highest scores for the ”geriatric” age group. Highest scores per column are printed in bold.

Quick links

Search