Table 2 Few-shot experiments (with up to three in-context examples)
From: Evaluating search engines and large language models for answering health questions
prompt | d002 | d003 | ||||||
|---|---|---|---|---|---|---|---|---|
0-shot | 1-shot | 2-shot | 3-shot | 0-shot | 1-shot | 2-shot | 3-shot | |
No-context | 0.76 | 0.7 | 0.78 | 0.78 | 0.76 | 0.86 | 0.86 | 0.86 |
Non-expert | 0.48 | 0.64 | 0.74 | 0.76 | 0.72 | 0.82 | 0.82 | 0.82 |
Expert | 0.68 | 0.74* | 0.76* | 0.78* | 0.72 | 0.82* | 0.84* | 0.84* |
prompt | FT5 | ChatGPT | ||||||
|---|---|---|---|---|---|---|---|---|
0-shot | 1-shot | 2-shot | 3-shot | 0-shot | 1-shot | 2-shot | 3-shot | |
No-context | 0.56 | 0.66 | 0.64 | 0.7 | 0.76 | 0.82 | 0.88 | 0.84 |
Non-expert | 0.54 | 0.68* | 0.66* | 0.64 | 0.8 | 0.8 | 0.88 | 0.86 |
Expert | 0.74 | 0.68* | 0.72 | 0.72 | 0.9 | 0.84 | 0.88 | 0.88 |
prompt | Llama3 | GPT-4 | ||||||
|---|---|---|---|---|---|---|---|---|
0-shot | 1-shot | 2-shot | 3-shot | 0-shot | 1-shot | 2-shot | 3-shot | |
No-context | 0.82 | 0.86 | 0.84 | 0.8 | 0.86 | 0.84 | 0.86 | 0.86 |
Non-expert | 0.8 | 0.76 | 0.8 | 0.8 | 0.86 | 0.86 | 0.88 | 0.88 |
Expert | 0.8 | 0.76 | 0.82 | 0.78 | 0.88 | 0.88 | 0.92 | 0.9 |
prompt | MedLlama3 | |||
|---|---|---|---|---|
0-shot | 1-shot | 2-shot | 3-shot | |
No-context | 0.78 | 0.8 | 0.76 | 0.76 |
Non-expert | 0.76 | 0.78 | 0.78 | 0.82 |
Expert | 0.8 | 0.82 | 0.8 | 0.8 |