A comparative evaluatıon of AI-based ChatGPT and physicians in psychiatric case assessment

Umutlu, Didem; Umutlu, Yaşam; Döndü, Ayşe

doi:10.1038/s41598-025-20813-0

Download PDF

Article
Open access
Published: 10 November 2025

A comparative evaluatıon of AI-based ChatGPT and physicians in psychiatric case assessment

Scientific Reports volume 15, Article number: 39216 (2025) Cite this article

2462 Accesses
2 Altmetric
Metrics details

Subjects

Abstract

According to the results of our study, ChatGPT shows a variable performance in different aspects in medical fields. It was observed that it did not reach the level of physicians in communicating with the patient and conducting in-depth questioning. However, in most cases, the individuals’ past psychiatric illnesses or medical histories are questioned. However, especially in cases where risk assessment is required, this process does not achieve the expected level of success and lacks the depth required for an effective assessment. While it functions effectively to inform differential diagnoses, its ability to make complex clinical decisions and make definitive diagnoses is limited. While it generally performed adequately in determining which medications should be prescribed and appropriate dosages, its capacity to warn about side effects of medications was found to be relatively weak. ChatGPT, which performs strongly in providing supportive treatment recommendations, shows a marked inability in more complex clinical processes such as making patient hospitalization decisions. These results suggest that AI-based systems can be useful as assistive tools in medical practice, but should be used with an awareness of their limitations.

Quality safety and disparity of an AI chatbot in managing chronic diseases: simulated patient experiments

Article Open access 25 September 2025

Quantifying the impact of AI recommendations with explanations on prescription decision making

Article Open access 07 November 2023

Healthcare professionals and the public sentiment analysis of ChatGPT in clinical practice

Article Open access 07 January 2025

Objective

Artificial Intelligence (AI) is a field of science and technology that enables computers and machines to perform tasks that typically require human intelligence. These tasks include learning, reasoning, problem-solving, language comprehension, visual perception, and decision-making¹.

Artificial intelligence is utilized in various fields of medicine, including diagnosis, treatment, and patient monitoring. Image processing technologies facilitate the rapid and accurate diagnosis of diseases in radiology and pathology, while machine learning algorithms assist in the development of personalized treatment plans. Additionally, AI-powered robots enhance precision in surgery, and natural language processing systems contribute to time efficiency by analyzing electronic health records. Furthermore, AI-driven clinical decision support systems provide guidance to physicians, improving the effectiveness and accuracy of healthcare services².

In psychiatry, artificial intelligence is utilized for mood analysis, therapeutic support, and mental health assessments, facilitating the understanding of patients’ conditions and the determination of appropriate treatment methods. AI-based chatbots and applications enable individuals to receive psychological support while enhancing opportunities for early diagnosis and intervention^3,4.

In this context, the primary aim of this study is to comprehensively examine the performance of ChatGPT, an AI-based language model, in evaluating and interpreting psychiatric cases. Additionally, by comparing this performance with the clinical decision-making processes of psychiatry specialists, the study seeks to reveal ChatGPT’s potential and limitations in the medical field. The study aims to analyze the role of artificial intelligence in the diagnosis and treatment management of psychiatric disorders in parallel with the expertise of clinicians.

Methods

Ethical considerations

Ethics committee approval of study was obtained from Adnan Menderes University non-interventional clinical research ethics committee with protocol number 2025/104. All methods were performed in accordance with the relevant guidelines and regulations. Names of human participants, identifying information and images that could lead to the identification of a participant have not been used in the database or in the article sections (text/figures/tables/images/database). Written informed consent was obtained from all individuals participating in the study by providing detailed information about the purpose, method, possible risks and benefits of the study. It was clearly stated to the participants that they had the right to withdraw from the study at any time and that the confidentiality of their data would be protected. The consent process was carried out on a voluntary basis and each participant signed the consent form of their own free will.

Study design and settings

In this study, ChatGPT-4 was used to initiate conversations and collect responses to clinical cases in the field of psychiatry. Using 120 clinical cases, the responses were evaluated by psychiatrists. ChatGPT’s responses were assessed in three main categories (evaluation, diagnosis, and treatment planning) and three subcategories.

In evaluating ChatGPT’s clinical management, the clinical management applied by two psychiatrists was accepted as the gold standard reference point. ChatGPT’s and the physician’s responses to the clinical scenarios were compared for content similarity within a pre-structured question algorithm (Table 1). Each category that resembled the physician’s clinical management was classified as positive, meaning similar. In analyzing these similarities, statistical metrics such as sensitivity and specificity were utilized to objectively measure clinical accuracy and performance. Thus, the accuracy of ChatGPT’s recommendations and their alignment with the physician’s management were meticulously assessed.

Table 1 A Pre-Configured question Algorithm.

Full size table

Statistical analyses

The research data were evaluated using SPSS 27.0 statistical program. Descriptive statistics of the study were shown using number (n) and percentage (%). Chi-square test was used to show whether there was a difference between categorical variables. For statistical significance, a p value less than 0.05 was required.

Results

A total of 120 patients were included in the study. The distribution of diagnoses among the patients is summarized in Fig. 1. The most common diagnosis among the included patients was anxiety disorder, followed by mood disorders.

Table 2 Presents the evaluation of chatgpt’s performance in the assessment category. chatgpt’s similarity to the physician’s management was found to be 0.0% in the subcategory of ‘In-depth inquiry of complaints, signs, and symptoms,’ between 76.7% in ‘Inquiry of the patient’s past psychiatric or medical history,’ and between 64.2–73.3% in ‘Risk assessment.

Full size table

Table 2. Evaluation of ChatGPT’s Performance in the Assessment Category.

Table 3 Evaluates chatgpt’s performance in the diagnosis category. The similarity to the physician’s management was determined as between 51.7–57.5% in the subcategory of ‘Psychometric tests, scales, or laboratory tests,’ between 70.0–76.7% in ‘Differential diagnosis,’ and 39.3% in ‘Diagnosis.’.

Full size table

Table 3. Evaluation of ChatGPT’s Performance in the Diagnosis Category.

Table 4 Assesses chatgpt’s performance in the treatment category. The similarity to the physician’s management was found to be between 94.2–98.3% in ‘Prescribed medications,’ between 96.7–97.5% in ‘Dosage information of prescribed medications,’ between 13.3–15.8% in ‘Potential side effects of prescribed medications,’ between 57.5–58.3% in ‘Patient and family education recommendations,’ between 76.7–78.3% in ‘Lifestyle modification recommendations,’ between 73.3–76.7% in ‘Psychotherapy recommendations,’ and between 82.5–85.0% in ‘Hospitalization recommendation.’.

Full size table

Table 4. Evaluation of ChatGPT’s Performance in the Treatment Category.

When similarity rates were evaluated, the highest similarity was observed in ‘Prescribed medications’ (98.3%), followed by ‘Dosage information of prescribed medications’ (97.5%), while the lowest similarity was found in ‘In-depth inquiry of complaints, signs, and symptoms’ (0.0%), followed by ‘Potential side effects of prescribed medications’ (13.3%).

Discussion

AI systems such as ChatGPT are designed not to replace physicians, but to support their work. Among the strengths of artificial intelligence are speed, accuracy, the ability to learn from current medical developments, analyzing radiological images, and performing repetitive tasks. However, its weaknesses include the inability to understand the emotional states of patients and provide comfort, the potential to make incorrect decisions in complex situations, and uncertainties regarding responsibility for incorrect diagnoses or treatment recommendations⁵.

While artificial intelligence is beginning to play a significant role in medicine, it currently seems unlikely that it will fully replace physicians. AI may be highly effective in disease diagnosis and treatment, image analysis, drug discovery, and patient monitoring. For instance, in certain specialties such as radiology, AI systems can provide faster and more accurate results than human doctors. However, the human factor—particularly empathy, communication, ethical decision-making, and relationships with patients—still plays a critical role. It is expected that AI will support doctors, making their work more efficient, but the complete replacement of physicians depends not only on technology but also on complex issues such as ethics and human rights. Physicians demonstrate emotional intelligence, ethical values, and contextually appropriate decision-making in their relationships with patients, which are particularly challenging for AI^6,7.

In our study, we observed that artificial intelligence’s evaluations of past psychiatric illnesses or medical histories closely resemble the decisions made by physicians. Indeed, AI is quite successful in listing and organizing information from medical texts, particularly in the assessment of past psychiatric illnesses or medical histories. However, we believe that AI falls short in actively utilizing this information in case management and in formulating treatment plans tailored to individual patient needs, especially when compared to the clinical experience and human judgment of physicians. This suggests that the current algorithms of AI are unable to fully replicate the significance of the human factor and intuitive approach in case management. However, our study was conducted using widely used free artificial intelligence application worldwide. The literature indicates that more advanced models found in paid versions have higher analytical and intuitive decision-making capabilities. Therefore, repeating similar studies using more powerful models could increase the scope and validity of the findings. In this context, the fact that the free version was preferred in the study can be considered a limitation, as it may lead to limited results depending on the capacity of the artificial intelligence model used.

Our study has shown that artificial intelligence is not sufficiently effective in risk assessment. In situations where risk assessment of an individual’s health condition is critical (such as in cases of suicide, harm to others, or treatment refusal), AI may not be as effective as current approaches. This limitation is related to the inability to properly interpret the patient’s unique history and environmental factors, as well as the challenge of understanding the data in its contextual framework. As a result, reliably identifying risks and detecting potential dangers in a timely manner becomes more difficult, particularly in complex cases, where clinical decision-making processes are adversely affected. Our study found that AI’s risk assessment closely resembled the decisions made by physicians, but AI only performed questioning in a small subset of patients who were subjected to risk evaluation by the physician. This suggests that AI generally fails to engage in questioning when the physician does not, indicating its inadequacy in risk assessment.

According to the results of our study, while ChatGPT proves effective in providing information on differential diagnoses, it remains limited in its ability to make definitive diagnoses through complex clinical decision-making. This limitation primarily stems from its difficulty in fully evaluating the patient’s unique clinical history, environmental factors, and dynamic variables. The inability to accurately synthesize contextual data and the complexity of situations that require patient-specific approaches constrain the AI’s capacity to make definitive diagnoses. Therefore, particularly in complex cases requiring multidisciplinary evaluation, AI generally plays a complementary role. However, the fact that only the free version of artificial intelligence was used in the study can be considered a significant limitation. Indeed, the literature indicates that paid versions of ChatGPT can produce more reliable outputs than free versions thanks to their more advanced model capacity, fast and priority access, ability to process long texts, and advanced file and web-based analysis capabilities. They are also noted for providing more consistent and comprehensive solutions, particularly when addressing complex, multi-step, or contextually rich cases.

Our study indicates that ChatGPT is generally successful in determining appropriate medications and dosages within the treatment category, but its capacity to provide warnings regarding drug side effects is relatively weak. While ChatGPT performs strongly in offering supportive treatment recommendations, it demonstrates significant limitations in more complex clinical processes, such as making decisions regarding patient admissions. These results suggest that AI-based systems can serve as helpful tools in medical practice, but they should be used with an awareness of their current limitations.

Conclusions

According to the results of our study, ChatGPT has been identified as a strong artificial intelligence model as a source of information, particularly functioning effectively in medical data and general health knowledge. Additionally, it can assist healthcare professionals in the diagnostic and treatment processes, helping to make evidence-based decisions. However, limitations have been observed in ChatGPT’s communication with patients and in conducting in-depth inquiries. In terms of comprehensively assessing and resolving complex cases, AI has yet to reach the intellectual and emotional intelligence capacity possessed by human physicians. Notably, deficiencies have been observed in human-specific competencies such as empathy, understanding the patient’s psychological and emotional state, and making appropriate decisions based on this understanding. Therefore, while ChatGPT can be a useful tool to support decision-making processes in healthcare, it may not be sufficient on its own in establishing patient relationships and handling more complex clinical decisions. Furthermore, only the free version of artificial intelligence was used in our study. Therefore, it should be noted that more advanced models offered in paid versions may reflect such communicative and analytical competencies at a higher level. However, the fact that the free version is the most preferred worldwide and widely used in the literature adds additional social value to the findings of our study.

In conclusion, while artificial intelligence holds great potential as an auxiliary tool in healthcare systems, the complete replacement of physicians is unlikely in the short term.

Data availability

The study data is available on request from the corresponding author.

References

Brewka, G. Artificial intelligence—a modern approach by Stuart Russell and Peter Norvig, Prentice Hall. Series in Artificial Intelligence, Englewood Cliffs, NJ. The Knowledge Engineering Review 11(1):78–79. https://doi.org/10.1017/S0269888900007724 (1996).
Hamet, P. & Tremblay, J. Artificial intelligence in medicine. Metabolism 69S, S36–S40. https://doi.org/10.1016/j.metabol.2017.01.011 (2017).
Article PubMed CAS Google Scholar
Ray, A., Bhardwaj, A., Malik, Y. K., Singh, S. & Gupta, R. Artificial intelligence and psychiatry: an overview. Asian J. Psychiatr. 70, 103021. https://doi.org/10.1016/j.ajp.2022.103021 (2022).
Article PubMed PubMed Central Google Scholar
Fakhoury, M. Artificial intelligence in psychiatry. Adv. Exp. Med. Biol. 1192, 119–125. https://doi.org/10.1007/978-981-32-9721-0_6 (2019).
Article PubMed CAS Google Scholar
Jiang, F., Jiang, Y. & Zhi, H. Artificial intelligence in healthcare: past, present and future. Sem. Cancer Biol. 42, 1–11. https://doi.org/10.1136/svn-2017-000101 (2017).
Article Google Scholar
Pesapane, F., Codari, M. & Sardanelli, F. Artificial intelligence in medical imaging: threat or opportunity? Radiologists again at the forefront of innovation in medicine. Eur. Radiol. Exp. 24 (1), 35. https://doi.org/10.1186/s41747-018-0061-6 (2018).
Article Google Scholar
Reddy, S., Fox, J. & Purohit, M. P. Artificial intelligence-enabled healthcare delivery. J. R Soc. Med. 112 (1), 22–28. https://doi.org/10.1177/0141076818815510 (2019).
Article PubMed Google Scholar

Download references

Author information

Didem Umutlu and Ayşe Döndü contributed equally to this work.

Authors and Affiliations

Aydin Ataturk State Hospital, Aydin, Turkey
Didem Umutlu
Aydın Efeler District Health Directorate, Aydin, Turkey
Yaşam Umutlu
Department of Psychiatry, School of Medicine, Aydin Adnan Menderes University, Aydin, Türkiye
Ayşe Döndü

Authors

Didem Umutlu
View author publications
Search author on:PubMed Google Scholar
Yaşam Umutlu
View author publications
Search author on:PubMed Google Scholar
Ayşe Döndü
View author publications
Search author on:PubMed Google Scholar

Contributions

D.U. and A.D. generated the idea for the study, collected the data and wrote the main manuscript text. Y.U. prepared the ethics committee application file, organized the methodology of the study, performed the statistical analysis and prepared tables and graphs.

Corresponding author

Correspondence to Ayşe Döndü.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Umutlu, D., Umutlu, Y. & Döndü, A. A comparative evaluatıon of AI-based ChatGPT and physicians in psychiatric case assessment. Sci Rep 15, 39216 (2025). https://doi.org/10.1038/s41598-025-20813-0

Download citation

Received: 04 May 2025
Accepted: 17 September 2025
Published: 10 November 2025
Version of record: 10 November 2025
DOI: https://doi.org/10.1038/s41598-025-20813-0