Objective

Artificial Intelligence (AI) is a field of science and technology that enables computers and machines to perform tasks that typically require human intelligence. These tasks include learning, reasoning, problem-solving, language comprehension, visual perception, and decision-making1.

Artificial intelligence is utilized in various fields of medicine, including diagnosis, treatment, and patient monitoring. Image processing technologies facilitate the rapid and accurate diagnosis of diseases in radiology and pathology, while machine learning algorithms assist in the development of personalized treatment plans. Additionally, AI-powered robots enhance precision in surgery, and natural language processing systems contribute to time efficiency by analyzing electronic health records. Furthermore, AI-driven clinical decision support systems provide guidance to physicians, improving the effectiveness and accuracy of healthcare services2.

In psychiatry, artificial intelligence is utilized for mood analysis, therapeutic support, and mental health assessments, facilitating the understanding of patients’ conditions and the determination of appropriate treatment methods. AI-based chatbots and applications enable individuals to receive psychological support while enhancing opportunities for early diagnosis and intervention3,4.

In this context, the primary aim of this study is to comprehensively examine the performance of ChatGPT, an AI-based language model, in evaluating and interpreting psychiatric cases. Additionally, by comparing this performance with the clinical decision-making processes of psychiatry specialists, the study seeks to reveal ChatGPT’s potential and limitations in the medical field. The study aims to analyze the role of artificial intelligence in the diagnosis and treatment management of psychiatric disorders in parallel with the expertise of clinicians.

Methods

Ethical considerations

Ethics committee approval of study was obtained from Adnan Menderes University non-interventional clinical research ethics committee with protocol number 2025/104. All methods were performed in accordance with the relevant guidelines and regulations. Names of human participants, identifying information and images that could lead to the identification of a participant have not been used in the database or in the article sections (text/figures/tables/images/database). Written informed consent was obtained from all individuals participating in the study by providing detailed information about the purpose, method, possible risks and benefits of the study. It was clearly stated to the participants that they had the right to withdraw from the study at any time and that the confidentiality of their data would be protected. The consent process was carried out on a voluntary basis and each participant signed the consent form of their own free will.

Study design and settings

In this study, ChatGPT-4 was used to initiate conversations and collect responses to clinical cases in the field of psychiatry. Using 120 clinical cases, the responses were evaluated by psychiatrists. ChatGPT’s responses were assessed in three main categories (evaluation, diagnosis, and treatment planning) and three subcategories.

In evaluating ChatGPT’s clinical management, the clinical management applied by two psychiatrists was accepted as the gold standard reference point. ChatGPT’s and the physician’s responses to the clinical scenarios were compared for content similarity within a pre-structured question algorithm (Table 1). Each category that resembled the physician’s clinical management was classified as positive, meaning similar. In analyzing these similarities, statistical metrics such as sensitivity and specificity were utilized to objectively measure clinical accuracy and performance. Thus, the accuracy of ChatGPT’s recommendations and their alignment with the physician’s management were meticulously assessed.

Table 1 A Pre-Configured question Algorithm.

Statistical analyses

The research data were evaluated using SPSS 27.0 statistical program. Descriptive statistics of the study were shown using number (n) and percentage (%). Chi-square test was used to show whether there was a difference between categorical variables. For statistical significance, a p value less than 0.05 was required.

Results

A total of 120 patients were included in the study. The distribution of diagnoses among the patients is summarized in Fig. 1. The most common diagnosis among the included patients was anxiety disorder, followed by mood disorders.

Fig. 1
Fig. 1
Full size image

Distribution of Primary Diagnoses Among Patients Included in the Study.

Table 2 Presents the evaluation of chatgpt’s performance in the assessment category. chatgpt’s similarity to the physician’s management was found to be 0.0% in the subcategory of ‘In-depth inquiry of complaints, signs, and symptoms,’ between 76.7% in ‘Inquiry of the patient’s past psychiatric or medical history,’ and between 64.2–73.3% in ‘Risk assessment.

Table 2. Evaluation of ChatGPT’s Performance in the Assessment Category.

Table 3 Evaluates chatgpt’s performance in the diagnosis category. The similarity to the physician’s management was determined as between 51.7–57.5% in the subcategory of ‘Psychometric tests, scales, or laboratory tests,’ between 70.0–76.7% in ‘Differential diagnosis,’ and 39.3% in ‘Diagnosis.’.

Table 3. Evaluation of ChatGPT’s Performance in the Diagnosis Category.

Table 4 Assesses chatgpt’s performance in the treatment category. The similarity to the physician’s management was found to be between 94.2–98.3% in ‘Prescribed medications,’ between 96.7–97.5% in ‘Dosage information of prescribed medications,’ between 13.3–15.8% in ‘Potential side effects of prescribed medications,’ between 57.5–58.3% in ‘Patient and family education recommendations,’ between 76.7–78.3% in ‘Lifestyle modification recommendations,’ between 73.3–76.7% in ‘Psychotherapy recommendations,’ and between 82.5–85.0% in ‘Hospitalization recommendation.’.

Table 4. Evaluation of ChatGPT’s Performance in the Treatment Category.

When similarity rates were evaluated, the highest similarity was observed in ‘Prescribed medications’ (98.3%), followed by ‘Dosage information of prescribed medications’ (97.5%), while the lowest similarity was found in ‘In-depth inquiry of complaints, signs, and symptoms’ (0.0%), followed by ‘Potential side effects of prescribed medications’ (13.3%).

Discussion

AI systems such as ChatGPT are designed not to replace physicians, but to support their work. Among the strengths of artificial intelligence are speed, accuracy, the ability to learn from current medical developments, analyzing radiological images, and performing repetitive tasks. However, its weaknesses include the inability to understand the emotional states of patients and provide comfort, the potential to make incorrect decisions in complex situations, and uncertainties regarding responsibility for incorrect diagnoses or treatment recommendations5.

While artificial intelligence is beginning to play a significant role in medicine, it currently seems unlikely that it will fully replace physicians. AI may be highly effective in disease diagnosis and treatment, image analysis, drug discovery, and patient monitoring. For instance, in certain specialties such as radiology, AI systems can provide faster and more accurate results than human doctors. However, the human factor—particularly empathy, communication, ethical decision-making, and relationships with patients—still plays a critical role. It is expected that AI will support doctors, making their work more efficient, but the complete replacement of physicians depends not only on technology but also on complex issues such as ethics and human rights. Physicians demonstrate emotional intelligence, ethical values, and contextually appropriate decision-making in their relationships with patients, which are particularly challenging for AI6,7.

In our study, we observed that artificial intelligence’s evaluations of past psychiatric illnesses or medical histories closely resemble the decisions made by physicians. Indeed, AI is quite successful in listing and organizing information from medical texts, particularly in the assessment of past psychiatric illnesses or medical histories. However, we believe that AI falls short in actively utilizing this information in case management and in formulating treatment plans tailored to individual patient needs, especially when compared to the clinical experience and human judgment of physicians. This suggests that the current algorithms of AI are unable to fully replicate the significance of the human factor and intuitive approach in case management. However, our study was conducted using widely used free artificial intelligence application worldwide. The literature indicates that more advanced models found in paid versions have higher analytical and intuitive decision-making capabilities. Therefore, repeating similar studies using more powerful models could increase the scope and validity of the findings. In this context, the fact that the free version was preferred in the study can be considered a limitation, as it may lead to limited results depending on the capacity of the artificial intelligence model used.

Our study has shown that artificial intelligence is not sufficiently effective in risk assessment. In situations where risk assessment of an individual’s health condition is critical (such as in cases of suicide, harm to others, or treatment refusal), AI may not be as effective as current approaches. This limitation is related to the inability to properly interpret the patient’s unique history and environmental factors, as well as the challenge of understanding the data in its contextual framework. As a result, reliably identifying risks and detecting potential dangers in a timely manner becomes more difficult, particularly in complex cases, where clinical decision-making processes are adversely affected. Our study found that AI’s risk assessment closely resembled the decisions made by physicians, but AI only performed questioning in a small subset of patients who were subjected to risk evaluation by the physician. This suggests that AI generally fails to engage in questioning when the physician does not, indicating its inadequacy in risk assessment.

According to the results of our study, while ChatGPT proves effective in providing information on differential diagnoses, it remains limited in its ability to make definitive diagnoses through complex clinical decision-making. This limitation primarily stems from its difficulty in fully evaluating the patient’s unique clinical history, environmental factors, and dynamic variables. The inability to accurately synthesize contextual data and the complexity of situations that require patient-specific approaches constrain the AI’s capacity to make definitive diagnoses. Therefore, particularly in complex cases requiring multidisciplinary evaluation, AI generally plays a complementary role. However, the fact that only the free version of artificial intelligence was used in the study can be considered a significant limitation. Indeed, the literature indicates that paid versions of ChatGPT can produce more reliable outputs than free versions thanks to their more advanced model capacity, fast and priority access, ability to process long texts, and advanced file and web-based analysis capabilities. They are also noted for providing more consistent and comprehensive solutions, particularly when addressing complex, multi-step, or contextually rich cases.

Our study indicates that ChatGPT is generally successful in determining appropriate medications and dosages within the treatment category, but its capacity to provide warnings regarding drug side effects is relatively weak. While ChatGPT performs strongly in offering supportive treatment recommendations, it demonstrates significant limitations in more complex clinical processes, such as making decisions regarding patient admissions. These results suggest that AI-based systems can serve as helpful tools in medical practice, but they should be used with an awareness of their current limitations.

Conclusions

According to the results of our study, ChatGPT has been identified as a strong artificial intelligence model as a source of information, particularly functioning effectively in medical data and general health knowledge. Additionally, it can assist healthcare professionals in the diagnostic and treatment processes, helping to make evidence-based decisions. However, limitations have been observed in ChatGPT’s communication with patients and in conducting in-depth inquiries. In terms of comprehensively assessing and resolving complex cases, AI has yet to reach the intellectual and emotional intelligence capacity possessed by human physicians. Notably, deficiencies have been observed in human-specific competencies such as empathy, understanding the patient’s psychological and emotional state, and making appropriate decisions based on this understanding. Therefore, while ChatGPT can be a useful tool to support decision-making processes in healthcare, it may not be sufficient on its own in establishing patient relationships and handling more complex clinical decisions. Furthermore, only the free version of artificial intelligence was used in our study. Therefore, it should be noted that more advanced models offered in paid versions may reflect such communicative and analytical competencies at a higher level. However, the fact that the free version is the most preferred worldwide and widely used in the literature adds additional social value to the findings of our study.

In conclusion, while artificial intelligence holds great potential as an auxiliary tool in healthcare systems, the complete replacement of physicians is unlikely in the short term.