MedHELM, an extensible evaluation framework including a new taxonomy for classifying medical tasks and a benchmark of many datasets across these categories, enables the evaluation of large language models on real-world clinical tasks.
- Suhana Bedi
- Hejie Cui
- Nigam H. Shah