Table 2 Summary of datasets evaluated in this study and methodology applied to each
From: A toolbox for surfacing health equity harms and biases in large language models
Name | Count | Description | Rubrics | Rater groups |
|---|---|---|---|---|
OMAQ | 182 | Human-written queries including explicit and implicit adversarial queries across health topics | Independent, pairwise | Physician, health equity expert |
EHAI | 300 | Equity-related health questions written using participatory research methods | Independent, pairwise | Physician, health equity expert |
FBRT-Manual | 150 | Human-written queries written using Med-PaLM 2 failure cases, designed to cover different failure modes | Independent, pairwise | Physician, health equity expert |
FBRT-LLM | 661 | LLM-produced queries using Med-PaLM 2 failure cases, designed to cover different failure modes. Subset of the full set of 3,558 | Independent, pairwise | Physician, health equity expert |
TRINDS | 106 | Questions related to diagnosis, treatment and prevention of tropical diseases, generally in a global context | Independent, pairwise | Physician, health equity expert |
CC-Manual | 123 pairs | Human-written pairs of questions with changes in axes of identity or other context | Independent, counterfactual | Physician, health equity expert |
CC-LLM | 200 pairs | LLM-produced pairs of questions with changes in axes of identity or other context | Independent, counterfactual | Physician, health equity expert |
HealthSearchQA | 1,061 | Sample of long-form medical questions studied by Singhal et al.3,4 | Independent, pairwise | Physician, health equity expert |
Omiye et al. | 9 | The set of questions used by Omiye et al.23 to test models for harmful race-based misconceptions | Independent, pairwise | Physician, health equity expert |
Mixed MMQA–OMAQ | 240 | 140 questions sampled from MultiMedQA3,4 and 100 questions sampled from OMAQ used for some analyses | Independent, pairwise | Physician, health equity expert, consumer |