arising from Weissman et al. npj Digital Medicine https://www.nature.com/articles/s41746-025-01544-y (2025)

In their recent article, Weissman et al.1 examined the extent to which artificial intelligence (AI)-based large language models (LLMs) generate clinical decision support (CDS) outputs that meet the criteria of regulated medical devices1 and called for new regulations for LLM-based CDS systems. In this manuscript, we suggest that some proposed considerations have been addressed by existing guidelines and would not need new frameworks. We also outline specific areas in which new regulations may be required and risk mitigation strategies that could be incorporated.

Calls for Increased Scope of Regulations Given the Gaps Identified in This Research

First, the authors called for regulations to refine LLM CDS criteria pertaining to clinician or non-clinician end-users. Fortunately, this consideration has been embedded in the latest Food and Drug Administration (FDA) Software as a Medical Device (SaMD) guidelines, that are consistent with SaMD regulatory frameworks from the European Union, Australia, and Singapore2,3,4. CDS systems (CDSS) with device-like functionality are currently subject to strict regulatory approval calibrated to risk level across international frameworks4 while those meeting all four non-device FDA criteria are exempt. Nonetheless, all CDSS (regardless of underlying technology) fall under the broader SaMD framework, with the 2025 update5 refining explicit compliance requirements based on intended use and end-users (e.g. clinicians or non-clinicians).

Second, all CDSS regardless of their underlying technology, whether Generative AI (GenAI), LLM-based or otherwise, are subject to the current SaMD guidelines2. This has been accepted in the digital health field as reflected in publications by us and others. For example, in the APPRAISE study we gathered consensus from over 1000 international experts on clinician acceptance of various CDSS tools in ophthalmology based on FDA-SaMD criteria6. Therefore, new guidelines for LLM-based CDSS specifically may not be required.

Third, we agree with the authors on the need for new regulations of “generalized” CDSS that are not anchored to specific clinical indications. Contextualizing this regulatory gap requires distinguishing between confined and unconfined AI systems. Early CDSS that were popularised comprised of confined, deterministic clinical software (DCS) algorithms that have known, fixed input data-output label (IDOL) relationships. These systems generated outputs from predefined, bounded labels e.g., binary disease classifiers (present/absent) or categorical risk stratification (low/medium/high). They were well-addressed by existing regulatory guidelines based on evaluation using relatively small, well-characterized datasets. The next iteration of CDSS that followed were confined clinical software (CCS), using techniques such as deep learning (DL) to improve handling of IDOL pairings with unknown relationships. These CCS exhibited predictable variability due to a confined spectrum of output labels, and remained amenable to evaluation with expanded datasets based on existing FDA-SaMD guidelines. While the outputs were conformed in some ways, they were not necessarily safe7 and so required extensive testing and assurances such as post-processing limits or guardrails.

In contrast, unconfined AI systems, such as general-purpose CDSS using transformer-based LLMs, operate across an open-ended semantic space in response to unstructured input prompts. This design introduces unique risks, including errors and outright “hallucinations”. Hallucinations are semantic errors and can be considered inherent to the engineering of LLMs, as the models are a small approximate representation of a large corpus of training material, using what could be considered a form of data compression8. Within unconfined systems, a further distinction may be made based on whether they incorporate non-deterministic components. Most LLMs are based on transformers which are inherently deterministic (the same input always results in the same output). However, non-determinism may be (sometimes intentionally) introduced via methods such as “temperature”, which involve random selection of the logits at the top layer of the transformer stack, or sometimes caused by inaccuracies in floating point calculations9,10. Some contemporary LLMs employ temperature to enhance naturalistic, human-like language generation. This stochasticity produces an unpredictably random spectrum of probabilistically-sampled outputs that are difficult to confine, as demonstrated in Weissman et al’s study. This limits the feasibility of traditional dataset-driven evaluation based on exhaustive testing with large datasets. Together with the massive compression of training data into a relatively small model, this creates a behaviour that could be referred to as non-determinism.

Therefore, we suggest that a new category of regulations may be required for novel general-purpose SaMD solutions developed using GenAI or other AI techniques for generalized CDS, which we term unconfined non-deterministic clinical software (UNDCS). This could apply across all health-related applications of such UNDCS technology, from healthcare administration to health promotion, not limited to CDSS. These regulations may set standards for the inclusion of potential safeguards such as red teaming, guardrails, agent-agent moderation and confined retrieval-augmented generation (RAG). Each approach offers distinct strengths and weaknesses in addressing UNDCS’ unique risks (Fig. 1) detailed in the next section. Unconfined deterministic systems may benefit from red teaming test cases specifically designed to target critical failure modes or underrepresented clinical scenarios, along with multi-agent system (MAS) implementations which offer the potential for consensus and thereby lower the frequency of errors. UNDCS can also be improved by extensive testing with repeated sampling and adjudication through LLM-as-a-Judge loops to score and verify aggregate output validity across multiple runs, albeit with some limitations11.

Fig. 1: Strengths and Weaknesses of Potential Safeguards to Facilitate the Alignment of Unconfined Non-Deterministic Clinical Software (UNDCS).
figure 1

Description: This figure outlines the strengths and weaknesses of potential safeguards for unconfined non-deterministic clinical software (UNDCS) including red teaming, guardrails, agent-agent moderation and confined retrieval-augmented generation (RAG) that can be used to facilitate their alignment with intended use.

Risk Mitigation for Unconfined Non-Deterministic Clinical Software (UNDCS) such as Clinical GenAI

First, red teaming involves stress-testing AI systems by simulating challenging scenarios under experimental conditions through techniques such as jailbreaking, prompt injection and adversarial attacks. This also presents an opportunity to involve clinicians early in traditionally developer-driven technical evaluation, providing them practical insight into UNDCS strengths and limitations12.

Second, guardrails are algorithms that can be used to help filter inappropriate LLM output. Existing open-source frameworks include Llama Guard and Guardrails AI with healthcare-specific implementations demonstrating promise in addressing these risks13. However, these systems are underpinned by computationally-based methods that are not always able to consistently check the full spectrum of non-deterministic LLM outputs. Moreover, their susceptibility to jailbreaks14 further highlights the need to co-implement this method with relevant defences against adversarial attacks15.

Third, RAG may reduce risks by integrating information retrieval from additional trusted knowledge sources, grounding responses in validated sources. However, it trades versatility for specialization. It is highly effective within its source content domain, yet limited in broader contexts. Potential limitations of RAG include omissions where RAG materials overpower the local context of a query, leading to an incorrect response.

Fourth, the latest developments in AI have introduced software workflows and practical implementations that enable LLM-based agents to perform digital functions. Reinforcement learning with human feedback (RLHF) has emerged as a powerful tool that can both learn from other models (a process called distillation) and use human feedback (including aspects of goal and safety alignment) to catalyse a process of open-ended generation tasks16. However, limitations of single-agent evaluations include challenges with specialized domains and risk of biases including self-reference bias whereby LLMs have a positive bias towards models types similar to their own17. Agent-agent moderation can help address these limitations using MAS architectures. These systems may be further augmented through RAG integration across multiple checkpoints18, and by incorporating neuro-symbolic models that reason deterministically from validated guidelines for improved reliability19. These approaches can help ensure CDSS outputs align with their intended use.

The Need for a New Regulatory Paradigm for UNDCS That Is Not Label-Driven

Given the enormous advances in GenAI techniques, another consideration is whether existing regulatory frameworks are even suitable for novel UNDCS. Current regulations are label-driven, with device classification based on manufacturer-designated intended use. These frameworks were effective for traditional medical devices whereby distribution was confined either to purpose-driven applications or licensed providers, themselves subject to certification requirements and ongoing quality audits. They were used effectively for earlier DCS and CCS SaMDs that had an AI core “wrapped” in a customized application layer bearing the appropriate labels, distributed by a regulated manufacturer.

However, today’s popular LLMs (e.g., ChatGPT, Grok, Claude etc.) are developed by technology providers that control the entire AI supply chain from base model to consumer-facing interface, and may not always detail their training sources. These direct-to-consumer models are not addressed by regulations tied to labelling and manufacturer registration. This regulatory void lacks protections for end users whereby LLM manufacturers have used general-purpose disclaimers while scaling distribution to a broad user base. For example, blanket statements prohibiting use of LLMs for clinical purposes20 are unlikely to deter real-world use.

These LLMs are now widely accessible and lack the consumer protections afforded by traditional SaMD distribution pathways that ensure appropriate user selection (e.g., based on health and technology literacy), appropriate right-sighting of care (particularly for clinical emergencies) and adverse event monitoring. Weissman et al.1 have demonstrated that in high-risk situations, LLMs may provide seemingly credible but inappropriate device-like recommendations based on incomplete clinical information, potentially leaving end-users at risk of serious medical harm.

With GenAI techniques advancing in sophistication, static regulations focused on present-day norms risk rapidly becoming outdated21. Recent work has demonstrated approaches to evaluate and deploy non-clinical LLM applications such as AI scribes22. However, even non-clinical, administrative tools applied in healthcare settings may have unforeseen clinical consequences, such as hallucinations causing flawed documentation or diagnosis labelling that compound errors in downstream clinical decisions, impact medical claims and potentially increase patients’ insurance premiums. Thus, future regulations for all UNDCS may need to set acceptable standards for risk mitigation. Particularly for direct-to-consumer UNDCSs, built-in safeguards may be needed to demonstrably restrict outputs to non-medical device use cases unless formally evaluated in clinical trials and deployed with ongoing quality controls.

In conclusion, while applications of UNDCS such as LLMs in healthcare could yield tremendous clinical benefits, appropriate safeguards are still required for consumer protections and patient safety. As UNDCS blurs the boundaries between intended uses and users, regulators have a challenging responsibility to adopt forward-looking frameworks as agile as the technologies they govern without stifling advances in healthcare transformation. Therefore, a new regulatory paradigm may be needed to encourage the safe use of UNDCS in healthcare, provide consumer protections for the public, and ensure manufacturers are accountable for the software solutions they monetise.