Large language models (LLMs) and generative artificial intelligence (genAI) have seen a surge in interest in both research and adoption, following the release of OpenAI’s ChatGPT in November 2022. The possible applications of genAI are vast, with one important field of interest being healthcare. Medical use cases can range from clinical decision support to personal health chatbots. In the mental health area, chatbots for cognitive behavioural therapy are being actively explored. The general interest in the area is being reflected in progressing institutional adoption of LLM-based tools, such as Chinese hospitals having rapidly adopted the LLM DeepSeek1.

Dangers and real-world cases of harm

Shortly after the release of ChatGPT 3.5, reports emerged describing how it responded to mental health and other medical questions, offering personalised information on diagnosis, monitoring and treatment of symptoms and diseases. These interactions occur without regulatory approval or oversight as a medical device2.

A more recent innovation, which has emerged as an inevitable extension of LLM chatbots, is that layperson users have gained access to tooling that allows for the creation of individual chatbots3. One of these, now removed, chatbots was created by a single individual, had over 47.3 million uses in July 2025 before its removal and interacted with patients explicitly claiming to be a therapist stating ‘[…] I am a Licensed Clinical Professional Counselor (LCPC). I am a Nationally Certified Counselor (NCC) and is trained to provide EMDR treatment in addition to Cognitive Behavioural (CBT) therapies. So what did you want to discuss?’4, while other Character.ai bots also claim to be psychologists, with user feedback praising how helpful these bots have been in giving advice for their mental well-being5. The validating tone of AI will be recognisable to anyone with lived psychotherapy experience6. Unsurprisingly, people appear to gravitate towards use of ChatGPT and its ilk for mental health counselling. This appears rational in light of the restricted access to effective talking therapy, one of the major bottlenecks in modern psychiatry, with months-long waiting lists even in rich western countries7. In low- and middle-income countries, an AI might even be the only possible access point to therapy8. However, none of these self-proclaiming psychologist bots have any medical training and neither certification as such nor as a medical device.

There has been much discussion of the potential harms of LLMs in mental health9,10. Unsurprisingly, alongside widespread use of LLM chatbots came the first reports of actual and serious harms including deaths. These reports are in the form of court cases taken by families after the suicide of vulnerable relatives, thus far predominantly minors, who committed suicide after engaging with LLM chatbots about mental health problems. Interestingly, these real cases coincide in their presentation with simulated cases described by early entrepreneurial investigators of GPT, prior to its widespread public availability (Table 1).

Table 1 Reports of harm from the use of unapproved LLM-enabled tools in mental health and a classification of their level of evidence

Is a LLM a medical device?

Since ChatGPT’s release, the regulatory approval of LLM-based and -enhanced applications under current regulations remains a matter of active debate11.

But what makes a device a medical device? Under current European and US medical device regulations it is often required that software providing AI-enabled personalised information to patients that serves the medical purpose of disease diagnosis, monitoring, prediction, prognosis, treatment or alleviation meet design and evidence requirements and that user safety is demonstrated and monitored12,13,14. The principal criterion that is used by regulators to decide if a LLM is regulated as a medical device is whether the ‘manufacturer’ that made it available on the market intended for it to be used for a medical purpose. Here, the developers’ description of the product in accompanying claims, labels or product information are critical. An explanation of these terms is provided in Table 2.

Table 2 A brief overview of terms relevant for the definition of medical devices according to MDR

So, does a LLM responding to medical questions constitute a medical device?

In the case of OpenAI’s ChatGPT, although there are documented cases of use by members of the public for mental health purposes15 and although there is evidence of harm from such use16, this does not bring the LLM under the remit of regulation as a medical device. Indeed, they state in their terms of use that ‘You must not use any Output relating to a person for any purpose that could have a legal or material impact on that person, such as […] medical, or other important decisions about them.’17. However, when a user asks a personalised mental health question, in the manner of consulting a therapist, they get a relatively personalised answer with minimal disclaimer at the time of use. This very behaviour was the reason for Vorberg and partner to argue that LLMs and ChatGPT specifically should be classified as medical devices under the MDR11. In their view, while the broad spectrum of possible applications makes ChatGPT a general-purpose device, its behaviour in a medical context is the main point in question. Given that it provides information on diagnosis, monitoring, treatment and prevention of medical issues, it should be considered a medical device, especially as it does not refuse to answer when asked such questions. The response from the regulator to this argument was that ChatGPT is not a medical device, as it’s ‘offered by the manufacturer as a multifunctional and interactive language model. It is not intended by the manufacturer to be used as a medical device as defined by the MDR.’18

In contrast to OpenAI, Anthropic provides additional information about their Claude LLMs. In their release notes, they detail their system prompts (initial instructions guiding a model’s behaviour as shown in Fig. 1) providing users with additional product information. A part of the system prompt of Claude Sonnet 4 is ‘Claude provides emotional support alongside accurate medical or psychological information or terminology where relevant.’19—clearly stating that Claude should answer medical and mental health questions and be accurate while doing so. It is critical to note that the clear intended and resultant effect of this system prompt, when combined with the individual user input, is for the model to try to provide personalised and conversational therapeutic support to people when they prompt with personal mental health issues, and in so doing, to use the language of a professional therapist, and to interpret and support the individual on the basis of psychological information they, and the training data, have provided to the model.

Fig. 1: The effect of system prompts on the output of LLMs.
figure 1

A brief overview of how user prompt and a system prompt influence the LLM’s processing and output.

Anthropic’s transparency in publishing the system prompt should be respected, but it shows clear ‘manufacturer’ intent for the model to be used in medical contexts, such as a mental health setting. The ‘defence’ that the LLM is not regulated as a medical device thus falls apart—chatbots running on the Claude Sonnet 4 model, alongside any other Claude models that use this system prompt, are therefore medical devices under the MDR, as their developers, with intent, have instructed them to be so. After receiving this system prompt chatbots running on Claude Sonnet 4 can exercise no other intent, than to behave as therapists (Fig. 2).

Fig. 2: If it walks like a duck and walks like a duck, it is a duck.
figure 2

LLMs broadly do what they are asked to do and regulation needs to consider the reality that the purpose of a system is what it does.

Should all LLM uses in mental health require approval?

Unsurprisingly the formal regulatory approval of LLM-enabled medical decision support and support bots come behind the first wave of excitement about these tools. The first LLM-enabled medical decision support system approved in the EU, covering multiple medical disciplines, including mental health, was Professor ValMed20,21, approved with an EU Class IIb CE-mark. The first low autonomy LLM-enhanced application specifically approved in Europe was Limbic22,23, approved with a UK Class IIa UKCA mark.

Should all LLMs that interact with users on their mental health have regulatory approval? The increasing sophistication, the underlying functioning and ever-broadening capabilities of LLMs show the fundamental weakness of the current Intended Purpose-focused regulation of medical devices. The approach of some LLM ‘manufactures’ has been to hide information about their models, including system prompts, as this information would reveal the clear intent in prompting to deliver medical purposes.

Incentivizing LLM providers to remove system prompts is likely to be detrimental to patients’ health—it would just decrease the accuracy of medical answers and quality of emotional support, possibly in crisis scenarios, without changing use patterns. Nevertheless, the system prompt reflects awareness on the side of Anthropic that their Claude models would be used as a medical device. It is extremely unlikely that the public will stop using LLMs altogether. It’s equally unlikely that patients will stop asking generally accessible LLMs for interactive personalised psychotherapeutic advice.

We argue that regulation needs to catch up with the reality of LLM deployment and use and apply the principle of ‘POSIWID’—the ‘purpose of a system is what it does’24. Regulation needs to be adapted and enforced in a manner where it is much clearer that the ‘manufacturer’ has a level of responsibility towards all medical use of these tools. Regulatory frameworks need to be modified so that LLMs that actually deliver mental health therapist behaviour are considered medical devices? The test should be whether there is widespread and/or dangerous use of an LLM for medical purposes—removing the incentivisation for ‘manufacturers’ to pretend that their systems do not do this. If regulation is not updated to take account of broad medical use in practice, it will increasingly become irrelevant, unenforceable and ignored.

But how can regulation of general LLMs be practically achieved? In our view, regulation needs to adopt a more flexible and adaptive approach, in a hierarchy depending on manufacturers claims for systems, and pragmatic to their level of risk. It should not, however, miss off the most important rung of the ladder - the systems that every individual in society has ready access to and are most likely to turn to at the point of need. Regulation needs to pragmatically acknowledge that LLMs are broad scope systems25, that can and do provide utility across a vast area. Some regulatory approaches have already been proposed for AI agents. These proposals include the use of ‘enforcement discretion’, where the regulatory body acknowledges a device as a medical device, but selectively chooses not to enforce certain requirements, a method used in the US22. Other approaches include ‘voluntary alternative pathways’, which allow manufacturers to opt into a regulatory track tailored to the unique characteristics of genAI-enabled applications22. Regulators retain the ability to move the device to the standard pathway in cases of misconduct or performance concerns22.

Medical functionality cannot be simply delineated from non-medical functionality in layperson facing LLM chatbots. As in the non-virtual world, where we seek advice on our anxieties from friends, family members and even professionals such as fitness instructors or hairdressers, not every virtual world mental health interaction is a formal medical therapy session. Rational approaches and criteria are required to describe what types of these interactions are ‘regulated’ medical device interactions, and what type are. We suggest actionable criteria for layperson facing chatbots, based on our own experience and literature sources9,10,26,27, and describe how these could be measured and policed in the real world (Table 3), as regulation without enforcement of limited value21,28. For example, all LLMs should be treated as medical devices if they impersonate mental health therapists when asked to do so by users. Only approved medical devices should be allowed to do this, and their approval must ensure that they do this in a reasonable and safe manner, not providing advice beyond their competence. This effectiveness and application of these actionable criteria could be ensured through the provision of simple open access tools to test chatbots with prompts (curated human generated29 or automated LLM-generated prompts), allowing all stakeholders to test systems for safety on an ongoing basis, to ensure they have adequate guardrailing of their functionality Although such tools are will not be perfect, and may initially challenge tools with too few scenarios, they are likely to be better than no criteria or assessment of on-market unapproved chatbots.

Table 3 Actionable criteria for lay person-facing LLM-enabled tools in mental health

Without applying the guardrails we suggest to LLM-enabled mental health therapy chatbots, substantial harms will unfortunately continue, and these will not only affect adolescents but also the many vulnerable adults with undiagnosed or incompletely addressed mental health problems, and it is likely that we are only seeing the tip of the iceberg of cases. Of course, mental health therapy through LLM-enabled approaches also has great promise. Here, governments have the responsibility to make safe and approved tools, which already exist, available to more of their citizens. Manufacturers of these systems, international aid organisations and world health bodies should take measures to make these tools affordable and accessible to the large market and populations at need in lower- and middle-income countries, and the same bodies have a responsibility to ensure that the dangerous LLM chatbots, often provided by high-income country BigTech, are appropriately challenged It is not a feasible public health approach to ignore mental health therapy through chatbots—instead minimal standards should be enforced on all systems providing this functionality—better a safe system than a useless misleading disclaimer. The current system of regulating only those chatbots that make explicit medical claims is without merit and dangerous to children and the vulnerable. It will need to be revised, and it is inevitable that it will eventually be changed—hopefully legislators have the sense to act before many more deaths under the circumstances described in Table 1.