Foundation models in medicine are a social experiment

The rise of foundation models has significantly accelerated the adoption of AI in healthcare, transforming both its pace and scope. Unlike traditional medical technologies, which have been gradually integrated into healthcare following extensive testing, unapproved, and new AI models, such as those from the GPT series, have swiftly been employed to support medical tasks1. As general-purpose AI systems, foundation models serve as versatile platforms for a wide range of applications, including language-based tasks powered by large language models (LLMs)2. Moreover, these models hold the potential to expand into multimodal domains, enabling the analysis and generation of images, sound, and video—or even integrating these capabilities together3.

To understand the immediate impact of foundation models on healthcare, we conducted a systematic literature review focusing on their use in the month following OpenAI’s release of ChatGPT4. Our findings revealed a broad range of potential applications, including promises to enhance healthcare efficiency, alleviate the workload of healthcare professionals, and improve patient outcomes. Foundation models, particularly generic and publicly available LLM-based applications, were quickly tested to assist in diagnosing and triaging patients, streamlining clinical workflows by supporting documentation, and providing direct patient support4.

However, we also identified a significant lack of professional guidelines and clear frameworks for the appropriate use of foundation models in medicine. Additionally, we found evidence of several unwanted effects, including “hallucinations”, biases, intransparency of models, and potential privacy violations4. This contributes to a growing discourse on the contradictory expectations surrounding foundation models in healthcare. While some highlight their promise for improving clinical outcomes and expanding access to health information—potentially reducing health disparities—others caution against their inconclusive qualities in diagnostic reasoning5 and their potential to perpetuate existing inequities6 or to reproduce racial biases in medicine7. Reports have already documented real-world harm to patients who relied on layperson-facing applications of foundation models for self-triage8, raising further concerns about their safe integration into healthcare systems.

Regulating the unpredictable?

The growing use of potent foundation models, alongside their ambiguous potential in real-world medical settings, has sparked calls for robust regulatory frameworks. Some have argued for regulations akin to those governing medical devices9, while others propose the creation of a novel regulatory category tailored to their unique features10. However, regulating publicly available foundation models presents significant challenges9. Generic models, such as Claude, Gemini, or GPT-4, do not claim medical use and therefore typically fall outside the scope of medical device regulations in regions like the US or the EU10. The rapid evolution of foundation models, combined with their general-purpose capabilities and often inconsistent, sometimes unpredictable outputs, presents significant challenges for performance testing. Their inherent opacity further complicates regulatory efforts, as many publicly available applications deprive users of the means to critically evaluate their outputs9,10.

The call for regulation is also rooted in an overly optimistic belief that it should be possible to foresee the impact of such technologies and how these systems will be used. Such a “crystal ball” approach is unlikely to be successful for foundation models. The significant uncertainties they introduce are attributable to a lack of real-world operational experience11 and understanding, making their impact difficult to predict. As with other advanced technologies, the consequences of foundation models—both positive and negative—on healthcare systems, patients, and professionals cannot be fully anticipated before these systems are widely implemented. Unlike other technologies, foundation models are highly versatile and can function as general-purpose technologies9, allowing for a broad range of applications that are often shaped by the unpredictable ways in which users choose to engage with them12. Most publicly available models lack a clear, defined purpose for their use; it is therefore unsurprising that medical professionals have begun to experiment with these tools. As we observed with our review data, right after the release of ChatGPT in November 2022, a vibrant phase of exploration and testing emerged, often driven by curiosity and an experimental attitude that lacked institutionalized control and oversight4.

Foundation models as social experiment

Given these developments, we suggest framing the ongoing use of foundation models in medicine as a large-scale social experiment13. This notion captures the introduction of experimental technologies into uncontrolled environments, effectively transforming society—or specific sectors like healthcare—into real-world laboratories. This perspective highlights the largely unpredictable, uncontrolled, and not entirely controllable nature of foundation model deployment in healthcare. It acknowledges two key points: First, efforts to fully anticipate and foresee the risks, benefits, and ethical implications of foundation models are likely to achieve only partial success. Second, that a deeper understanding of these risks and benefits can only be gained incrementally, as the technology becomes embedded in real-world scenarios and social practices.

Within just two years of ChatGPT’s release, one in five general practitioners in the UK have incorporated generative AI tools into clinical practice14. Without comprehensive consent procedures, oversight, or established methodologies, millions of patients are, in effect, being enrolled as subjects in the informal testing of these technologies. Concurrently, one in six adults in the US now turns to AI chatbots for health-related questions15, often sharing sensitive personal information. For the first time, significant volumes of patient data are being processed by AI systems lacking clear instructions or defined clinical roles. Foundation models trained for nonmedical purposes are thus being used in capacities typically reserved for certified, rigorously tested and monitored medical devices.

Ethics framework for experimental technologies

Observation of the current developments requires acknowledging the fact that increasing control over foundation models in healthcare necessitates a coordinated process of gradual learning and reducing uncertainties. From an ethical perspective, this demands moving from a largely tacit, to a deliberate and reflective approach. The experimental nature of foundation models demands strategies to mitigate risks, particularly in sensitive domains like medicine, where the consequences of errors can be severe.

For dealing with experimental technologies, van de Poel proposes a framework synthesizing different approaches to research ethics and biomedical ethics, echoing the four principles approach: non-maleficence, beneficence, autonomy, and justice. His approach builds on earlier work13 and has since been applied and further developed in various contexts to address experimental technologies16,17. For the further integration of foundation models into medicine and healthcare, we suggest applying this framework. Given the specific nature of these AI systems which can be opaque to human actors, we propose extending this framework with the widely accepted principle of explicability18. Explicability addresses the specific epistemic challenges posed by foundation models by ensuring that the inner workings of these systems are reasonably understandable to users and stakeholders (intelligibility) and by clarifying who is responsible for decisions based on the output of AI systems (accountability)18.

However, applying these principles remains challenging due to the inherent uncertainties that come with experimental technologies. Limited knowledge exists about which foundation model applications might cause harm, raise justice concerns, or require specific information for truly informed consent. Therefore, these ethical principles must be translated into conditions that address these uncertainties, allowing for a conscious and deliberately phased introduction to mitigate potential severe consequences.

In this context, van de Poel proposes an incremental approach as a process of iterative learning from experiences and testing technologies cautiously and step-by-step on a small scale11. By embracing this framework, we acknowledge that errors may occur, but it ensures that they happen on a limited scale, facilitating feedback that informs the learning process and reduces risks over time. Through this gradual accumulation of knowledge, this framework makes it acceptable for errors to occur, provided they contribute to the overall improvement and safe integration of the technology into healthcare.

Table 1 outlines an adapted framework for experimental technologies, providing a non-exhaustive set of conditions and corresponding specifications for the responsible integration of foundation models into medicine. While it may not be possible to fully meet all of these conditions, a negative response to any specific question should prompt a pause to carefully assess the ethical implications and the experimental design. This approach minimizes the need for extensive anticipatory knowledge by emphasizing incremental learning and step-by-step implementation. Framing foundation models as a social experiment offers a pragmatic mechanism for managing emerging technologies responsibly. It avoids the rigidity of moratoriums and the unrealistic expectation of foreseeing all possible outcomes, instead advocating for a flexible and adaptive framework that prioritizes the safe and ethical use.

Table 1 Ethical conditions for experimental technologies (framework adopted from van de Poel 2016) and corresponding questions to determine the responsible use of foundation models (FM) in medicine

The framework provides a starting point for practitioners, as well as researchers and developers of foundation models, to critically reflect on their work from an ethical perspective—something we see as essential for conducting responsible research with experimental technologies. While it is not an exhaustive checklist of necessary or sufficient criteria, and while research projects may vary in their potential ethical implications depending on their specific aims, means, and contexts, the framework is intended to steer reflection towards key areas of ethical relevance and to support concrete actions—such as adjusting the research design, seeking further ethical guidance, or implementing risk mitigation strategies etc. In the future, this framework may also serve as a basis for rethinking the criteria for institutionalized oversight of research with foundation model applications in healthcare. However, doing so would involve addressing a number of additional questions and lies beyond the scope of this paper.