Chen, Cai et al., present a rigorous human-in-the-loop framework for evaluating the medical performance of AI-generated responses to real clinical questions, using a scale aligned with physician career stages. Although some models perform at levels comparable to early-career physicians, substantial rates of incompetent answers and hallucinations demonstrate that unmonitored clinical deployment remains risky, underscoring the continuing need for expert oversight.
- Peikai Chen
- Jifu Cai
- Kenneth M. C. Cheung