npj Digital Medicine

Table 2 Few-shot experiments (with up to three in-context examples)

From: Evaluating search engines and large language models for answering health questions

prompt	d002				d003
	0-shot	1-shot	2-shot	3-shot	0-shot	1-shot	2-shot	3-shot
No-context	0.76	0.7	0.78	0.78	0.76	0.86	0.86	0.86
Non-expert	0.48	0.64	0.74	0.76	0.72	0.82	0.82	0.82
Expert	0.68	0.74*	0.76*	0.78*	0.72	0.82*	0.84*	0.84*

prompt	FT5				ChatGPT
	0-shot	1-shot	2-shot	3-shot	0-shot	1-shot	2-shot	3-shot
No-context	0.56	0.66	0.64	0.7	0.76	0.82	0.88	0.84
Non-expert	0.54	0.68*	0.66*	0.64	0.8	0.8	0.88	0.86
Expert	0.74	0.68*	0.72	0.72	0.9	0.84	0.88	0.88

prompt	Llama3				GPT-4
	0-shot	1-shot	2-shot	3-shot	0-shot	1-shot	2-shot	3-shot
No-context	0.82	0.86	0.84	0.8	0.86	0.84	0.86	0.86
Non-expert	0.8	0.76	0.8	0.8	0.86	0.86	0.88	0.88
Expert	0.8	0.76	0.82	0.78	0.88	0.88	0.92	0.9

prompt	MedLlama3
	0-shot	1-shot	2-shot	3-shot
No-context	0.78	0.8	0.76	0.76
Non-expert	0.76	0.78	0.78	0.82
Expert	0.8	0.82	0.8	0.8

Accuracy of each model and prompt. For each row, if a few-shot instance outperforms the 0-shot case then the few-shot case is bolded, and the symbol “*” marks those instances where the improvement was deemed as statistically significant (McNemar’s test, α = 0.05).

Back to article page

Search

Advanced search

Quick links