Fig. 3: LLM mean balanced accuracy across various system prompts, age groups and diseases. | npj Digital Medicine

Fig. 3: LLM mean balanced accuracy across various system prompts, age groups and diseases.

From: Benchmarking large language models for personalized, biomarker-based health intervention recommendations

Fig. 3: LLM mean balanced accuracy across various system prompts, age groups and diseases.

a Overview of LLM performance across five system prompts without the application of RAG. Llama 3.2 3B, Qwen 2.5 14B but also GPT-4o mini and Llama3 Med42 8B exhibit a notable dependence on the system prompt in terms of response quality. b System-prompt-specific LLM performance with RAG. c LLM performance distribution across three different age groups without RAG. All models achieve significantly higher performance for geriatric individuals compared to the other two age groups (P < 0.001). d LLM performance distribution across three different age groups with RAG. e LLM performance distribution across diseases without RAG. LLMs show increasing scores in case of degenerative diseases. The mean balanced accuracy is displayed above each bar. Error bars and individual data points are shown. GH Growth hormone, PCOS Polycystic ovarian syndrome.

Back to article page