Fig. 4: Overview of development and test sets used in this study.

A development set was created to guide prompt engineering. Test set 1 was also used in our prior work on KEEPER, thus providing a benchmark for consistency. Test set 2 mimics 1 but uses insurance claims data. Test set 3 takes a truly random sample across more diseases to enhance generalizability. The highly sensitive set demonstrates the use of LLMs to annotate a large set of patients, allowing computation of sensitivity and PPV of phenotype algorithms.