Table 2 Few-shot experiments (with up to three in-context examples)

From: Evaluating search engines and large language models for answering health questions

prompt

d002

d003

 

0-shot

1-shot

2-shot

3-shot

0-shot

1-shot

2-shot

3-shot

No-context

0.76

0.7

0.78

0.78

0.76

0.86

0.86

0.86

Non-expert

0.48

0.64

0.74

0.76

0.72

0.82

0.82

0.82

Expert

0.68

0.74*

0.76*

0.78*

0.72

0.82*

0.84*

0.84*

prompt

FT5

ChatGPT

 

0-shot

1-shot

2-shot

3-shot

0-shot

1-shot

2-shot

3-shot

No-context

0.56

0.66

0.64

0.7

0.76

0.82

0.88

0.84

Non-expert

0.54

0.68*

0.66*

0.64

0.8

0.8

0.88

0.86

Expert

0.74

0.68*

0.72

0.72

0.9

0.84

0.88

0.88

prompt

Llama3

GPT-4

 

0-shot

1-shot

2-shot

3-shot

0-shot

1-shot

2-shot

3-shot

No-context

0.82

0.86

0.84

0.8

0.86

0.84

0.86

0.86

Non-expert

0.8

0.76

0.8

0.8

0.86

0.86

0.88

0.88

Expert

0.8

0.76

0.82

0.78

0.88

0.88

0.92

0.9

prompt

MedLlama3

 

0-shot

1-shot

2-shot

3-shot

No-context

0.78

0.8

0.76

0.76

Non-expert

0.76

0.78

0.78

0.82

Expert

0.8

0.82

0.8

0.8

  1. Accuracy of each model and prompt. For each row, if a few-shot instance outperforms the 0-shot case then the few-shot case is bolded, and the symbol “*” marks those instances where the improvement was deemed as statistically significant (McNemar’s test, α = 0.05).