Table 2 Summary of datasets evaluated in this study and methodology applied to each

From: A toolbox for surfacing health equity harms and biases in large language models

Name

Count

Description

Rubrics

Rater groups

OMAQ

182

Human-written queries including explicit and implicit adversarial queries across health topics

Independent, pairwise

Physician, health equity expert

EHAI

300

Equity-related health questions written using participatory research methods

Independent, pairwise

Physician, health equity expert

FBRT-Manual

150

Human-written queries written using Med-PaLM 2 failure cases, designed to cover different failure modes

Independent, pairwise

Physician, health equity expert

FBRT-LLM

661

LLM-produced queries using Med-PaLM 2 failure cases, designed to cover different failure modes. Subset of the full set of 3,558

Independent, pairwise

Physician, health equity expert

TRINDS

106

Questions related to diagnosis, treatment and prevention of tropical diseases, generally in a global context

Independent, pairwise

Physician, health equity expert

CC-Manual

123 pairs

Human-written pairs of questions with changes in axes of identity or other context

Independent, counterfactual

Physician, health equity expert

CC-LLM

200 pairs

LLM-produced pairs of questions with changes in axes of identity or other context

Independent, counterfactual

Physician, health equity expert

HealthSearchQA

1,061

Sample of long-form medical questions studied by Singhal et al.3,4

Independent, pairwise

Physician, health equity expert

Omiye et al.

9

The set of questions used by Omiye et al.23 to test models for harmful race-based misconceptions

Independent, pairwise

Physician, health equity expert

Mixed MMQA–OMAQ

240

140 questions sampled from MultiMedQA3,4 and 100 questions sampled from OMAQ used for some analyses

Independent, pairwise

Physician, health equity expert, consumer

  1. These include the seven EquityMedQA datasets as well as three additional datasets used for further evaluations and comparisons with prior studies.