Table 3 Examples of inconsistent, missing, and hallucinated responses

From: Benchmarking large language models for biomedical natural language processing applications and recommendations

Type

Example

1. Inconsistency

Example 1. GPT-3 zero-shot on ChemProt

Required output: one of the six relation types: CPR:3, CPR:4, CPR:5, CPR:6, CPR:9, or false

Actual and expected output: CPR:9 [the expected output]

Inconsistent output 1: SUBSTRATE [it might refer to CPR:9]

Inconsistent output 2: SUBSTRATE PRODUCT OF [it might refer to CPR:9]

Example 2. LLaMA 2 zero-shot on MedQA

Required output: a multiple-choice option from A to E

Actual and somewhat expected output: I’m happy to help! Based on the information provided, the correct answer is E. Age above 50 years. Aspirin is contraindicated… [it directly provided the option at the start, which is similar to the expected output]

Inconsistent output 1: I’m happy to help! Based on the patient’s symptoms and physical examination findings, the most likely diagnosis is pyloric stenosis. … [it did not provide the answer option]

Inconsistent output 2: Great! Let’s break down the question and the options. The question is asking about an enzyme deficiency that is causing the symptoms described in the scenario. … Therefore, option E, …, is the correct answer. [it went through all the options and provided the answer option at last]

Inconsistent output 3: Great question! To confirm the most likely diagnosis of this patient, we need to consider the following options: Options A, C, and D … Therefore, the correct answer is: E. … It is important to consider all possible causes…

2. Missingness

Example 1. GPT-3.5 one-shot on MedQA

Required output: a multiple-choice option from A to E

Actual and expected output: C [the expected output]

Missing output 1: [no answer]

3. Hallucinations

Example 1. LLaMA 2 zero-shot on HoC

Required output: one or more document topic labels out of 10 in total separated by semicolons

Actual and somewhat expected output: Based on the input article, the following topics are relevant: * Sustaining proliferative signaling * Resisting cell death * Enabling replicative immortality * Activating invasion and metastasis * Genomic instability and mutation * Tumor promoting inflammation * Cellular energetics * Avoiding immune destruction Please note that the article is related to multiple topics, and the output is a semicolon-separated list of relevant topics. [it provides the topics like the expected format though not separated by semicolons]

Hallucinated output 1: Please provide the list of topics that are relevant to the input article.

Example 2. LLaMA 2 zero-shot on MedQA

Required output: a multiple-choice option from A to E

Actual and expected out example is provided above

Hallucinated output 1: Great! You have selected the correct answer. Let me explain why. …

Hallucinated output 2: That’s correct! Tetralogy of Fallot is a congenital heart defect …

Hallucinated output 3: Great question! Based on the patient’s symptoms and physical examination findings, the most likely impaired structure is the ________________. … [it asks to fill in the blank]

Hallucinated output 4: Please select one of the options from A to E.

  1. Text in square brackets represents annotated explanations. Unnecessary detail is omitted due to space constraints.