Introduction

In clinical microbiology, accurately verifying reports generated by automated microbial identification and antibiotic susceptibility testing systems (automated ID/AST systems) is a very important step1,2. A detailed description of the pathogen, its phenotype, quantity, and antibiotic resistance is provided in these reports3,4. As microbial automation advances, the need for quick, easy, and accurate interpretation of automated ID/AST systems reports becomes increasingly crucial.

Despite the fact that automated ID/AST systems reports offer assistance and time efficiency for clinical laboratory staff, there are potential drawbacks. Unless a laboratory has extensive experience in drug susceptibility reporting and interpretation, there is a risk of interpretation errors, antibiotic susceptibility break point judgments, as well as other information being sent directly to the client or patient5. The field of clinical microbiology is constantly evolving, such as Clinical & Laboratory Standards Institute (CLSI) updates its guidelines annually, incorporating new sensitivity breakpoints, resistance phenotype interpretations, and new bacterial classifications. Furthermore, as new antibiotic treatments are developed, new resistant strains will also develop. The issuance of automated ID/AST systems reports is time-consuming, resource-intensive, and necessitates expert knowledge and ongoing education for clinical microbiology staff, which presents a substantial challenge for less experienced Clinical microbiologists (CM). Hence, it is necessary to explore methods for improving the quality of clinical automated ID/AST systems reports.

Large language models (LLMs) are transforming various aspects of life, and the newly launched chatbot “ChatGPT” has received significant attention and praise6. ChatGPT uses a natural language processing model based on the transformer architecture to generate human-like responses covering a wide range of topics and inquiries7,8. In particular, ChatGPT excels in the medical and healthcare fields, having been trained on a massive dataset and developed with approximately 175 billion parameters9,10. There is evidence that ChatGPT can be used to assist with clinical diagnosis11. By skillfully leveraging the complex language patterns in its training data, the LLM enables tailored and insightful responses based on a rich knowledge. CM, infectious diseases experts, and nurses can use chatbots to make diagnostic decisions regarding tests and to improve interactions with medical microbiology laboratories12. The application of ChatGPT in clinical microbiology has garnered significant attention for its potential to improve diagnostic processes13. Although ChatGPT has been studied in the field of neurosurgery14, there has been no study of its application in clinical microbiology.

We evaluated whether ChatGPT, an LLM tool, could assist in issuing automated ID/AST systems reports compared to microbiology professional recommendations. We supplied ChatGPT with a summary of the automated ID/AST systems report and utilized the hint project to assist it in generating the approval report, training it with the CLSI Guidelines (2024)15. In order to standardize the issuance process, we developed a workflow. We compared ChatGPT’s responses to CM’s suggestions based on accuracy, relevance, objectivity, completeness, and clarity16,17,18,19. The objective of this study is to assess the potential of ChatGPT in enhancing the clinical microbiology workflow, specifically by assisting in the generation and review of automated ID/AST systems reports. Our goal is to determine whether ChatGPT can offer practical assistance in improving the accuracy and standardization of clinical automated ID/AST systems report generation.

Methods

ChatGPT training

In order to prepare the content for the report generation, we provided ChatGPT with standardized summaries of automated ID/AST systems reports derived from Vitek-2 output, noting that the Vitek-2 system is widely used globally20,21 and represents a typical automated platform. All outputs were formatted according to the 2024 edition of CLSI guidelines. In this study, ChatGPT-4, the most recent version of the model at the time of research, was used due to its accuracy and context-awareness. Including the model version ensures transparency, enabling future studies to replicate or compare results reliably. The same standardized Vitek-2 outputs (with only organism names, antimicrobial susceptibility results, and alarm signal colors) were also presented to CM in the questionnaire, ensuring identical input content and format across groups. All three groups generated complete written reports for each case, ensuring directly comparable outputs.

Developing the standardized review protocol

Following the CLSI (2024) standards, we trained ChatGPT using datasets provided in Supplementary Text 1. As a result of this training, a standardized review protocol was developed to ensure the accuracy and consistency of automated ID/AST systems report reviews. The protocol is illustrated in Fig. 1, which outlines key steps such as alarm interpretation, phenotypic analysis, and verification of intrinsic drug resistance mechanisms. In the GPT_AT group, structured prompts were designed to follow these steps in sequence, ensuring consistent and reproducible outputs; by contrast, GPT_BT received the same standardized inputs without stepwise prompting. The complete list of prompts used for GPT_BT and GPT_AT is available in Supplementary Text 1 to ensure transparency and reproducibility.To clarify the methodological background, although clinical microbiology laboratories routinely employ multiple approaches such as MALDI-TOF mass spectrometry for identification, selective media for screening, and molecular assays for resistance confirmation, it is important to note that both the CM group and ChatGPT were free to incorporate such complementary methodologies in their responses if considered relevant, reflecting real-world reasoning and ensuring that the evaluation captured not only data interpretation but also professional judgment.

Fig. 1
figure 1

ChatGPT training for automated ID/AST systems report review protocol. Correctly understanding automated ID/AST systems alert information (01), check the MIC value, phenotypic analysis (02), checking the original plate: whether the colony is pure and if the colony morphology matches the ID (03), verifying the accuracy of the ID, including cross-verification with different methodologies (04), validating resistance mechanisms through various methodologies (05), studying relevant papers (06), paying attention to intrinsic resistance adjustments and product insert limitations (07), and issuing an accurate report (08).

Experimental setup

Twenty hospitals in Fujian Province, China, participated in this study through their clinical microbiology laboratories. A total of 63 participants in the CM group completed the survey and provided responses. Eight representative clinical microbiology laboratory cases were collected from routine practice in the Department of Laboratory Medicine, Xiamen Chang Gung Hospital Hua Qiao University. These cases were drawn from routine clinical practice and represented common challenges encountered in daily laboratory work22,23,24,25. All cases were anonymized to protect patient privacy. The details of the eight clinical cases, including the questions, purposes, and correct solutions, are presented in Table 1, while the full case materials are provided in Supplementary Text 2. The GPT_BT and GPT_AT groups represented ChatGPT’s performance before and after prompt training, respectively, as ChatGPT operates as a single LLM. Scores were calculated based on the accuracy, relevance, objectivity, completeness, and clarity of the collected questionnaire responses (Supplementary Table 1).

Table 1 Eight clinical cases used for evaluation.

Experimental grouping

We divided the participants into three groups: We divided the participants into three groups: CM (63 participants), GPT_BT (1 instance of ChatGPT), and GPT_AT (1 instance of ChatGPT). The GPT_BT and GPT_AT groups represented different configurations of the same LLM (ChatGPT), reflecting its performance before and after prompt training. Clinical microbiologists in the CM group wrote automated ID/AST systems reports empirically. Both the GPT-BT and GPT-AT groups used recommendations generated by ChatGPT, with the GPT-AT group receiving our prompt training.

Evaluation

The evaluation was blinded: experts were not informed of the group origin of each output. Outputs from all three groups (CM, GPT_BT, GPT_AT) were independently assessed by two senior clinical microbiology experts according to CLSI (2024) guidelines, which served as the gold standard. Each case was scored across five dimensions—accuracy, relevance, objectivity, completeness, and clarity—using a four-level scale (0–3; Supplementary Table 1). Scores from the two experts were averaged; if their difference exceeded one point, the experts discussed the case and reached consensus before finalizing the score. This procedure ensured reproducibility and reliability of the evaluation. For analysis, scores (including dimension-specific values, total scores, and case-level results) were summarized as mean ± standard deviation (SD). This protocol ensured that performance comparisons were based on standardized expert consensus rather than subjective impressions.

Ethics approval

The studies involving human participants were reviewed and approved by the Ethics Committee of Xiamen Chang Gung Hospital (approval number XMCGIRB2024018, approval date: April 26, 2024). Electronic informed consent was obtained from all participants prior to their participation in the study. The consent process was embedded at the beginning of the survey, where participants were informed that by proceeding, they agreed to participate and understood the research purpose. The study adhered to the principles outlined in the Declaration of Helsinki (Ethical Approval). All methods were carried out in accordance with relevant guidelines and regulations, including institutional, national, and international standards for research involving human participants.

Statistical analysis

To determine which groups had significant differences, a one-way analysis of variance was conducted using GraphPad Prism 10.0, followed by Tukey’s post hoc analysis. The statistical analysis also included descriptive statistics on participant characteristics, such as gender, years of automated ID/AST systems usage, and whether their institution is ISO15189 accredited. The significance level was set at a p value of < 0.05 (*), a p value of < 0.01(**), a p value of < 0.001  (***), or a p value of < 0.0001  (****).

Results

Establishing a standardized review protocol

As detailed in the Methodology section, a standardized review protocol (Fig. 1) was developed following the training of ChatGPT. This protocol was applied to systematically review automated ID/AST systems reports, ensuring consistency and alignment with CLSI (2024) standards. The outcomes of this application are described below.

Basic information of quality assessment participants

Table 2 shows the gender and age distribution of 63 CM who had an average of 8.6 years of experience using the automated ID/AST systems.

Table 2 Characteristic of CM participants in the survey.

Comparison between GPT_BT group, GPT_AT group, and CM group

According to the evaluation results, there was no significant difference in quality of responses from CM based on gender, experience with automated ID/AST systems, or ISO15189 certification (all p ≥ 0.05; Fig. 2). Overall, the quality of automated ID/AST system reports generated by ChatGPT (GPT_BT and GPT_AT groups) was superior to that of CM group, with significantly higher mean total scores (GPT_BT:23.63 ± 7.69, GPT_AT:31.63 ± 4.31, CM:19.25 ± 3.97; GPT_BT vs. CM: p < 0.01, GPT_AT vs. CM: p < 0.0001, GPT_BT vs. GPT_AT: p < 0.001; Fig. 3F). Even without training, GPT_BT reports outperformed CM particularly in relevance (p < 0.0001; Fig. 3B) and completeness (CM:4.06 ± 0.48; p < 0.0001; Fig. 3D). Although the GPT_BT group scored higher than the CM group in accuracy (GPT_BT:6.50 ± 2.62, CM:5.95 ± 2.69; p ≥ 0.05; Fig. 3A), objectivity (GPT_BT:3.13 ± 0.99, CM:2.66 ± 0.75; p ≥ 0.05; Fig. 3C), and clarity (GPT_BT:1.88 ± 0.83, CM:1.52 ± 0.69; p ≥ 0.05; Fig. 3E), the differences were not statistically significant (Supplementary Table 2). After structured prompt engineering, GPT_AT showed substantial gains across nearly all dimensions compared to CM, including accuracy, relevance, objectivity, completeness, clarity and total scores (all p < 0.001; Fig. 3A–F; Supplementary Table 3). Compared to GPT_BT, GPT_AT responses were significantly more relevant, objective, completeness, and clarity (all p < 0.05; Fig. 3B–E; Supplementary Table 4), but not statistically significantly more accurate (p ≥ 0.05; Fig. 3A; Supplementary Table 4).

Fig. 2
figure 2

Correlation analysis of response quality within the clinical Microbiologist (CM) group. There was no significant difference in response quality based on gender (A), years of automated ID/AST system usage (B), and whether the institution was ISO15189 certified (C). The abbreviation ‘ns’ means p ≥ 0.05, indicating no statistical significance.

Fig. 3
figure 3

Bar charts for group comparisons. The clinical Microbiologist (CM) group, the ChatGPT before training (GPT_BT) group, and the ChatGPT after training (GPT_AT) group. Group comparisons were made in terms of accuracy (A), relevance (B), objectivity (C), completeness (D), clarity (E), and total score (F). *indicates statistical significance (p < 0.05), **indicates statistical significance (p < 0.01), ***indicates statistical significance (p < 0.001), ****indicates statistical significance (p < 0.0001), while ‘ns’ indicates no statistical significance (p ≥ 0.05).

Characteristics of response quality in different groups

The radar chart analysis shows that, without training, the response quality of ChatGPT was better than the CM group in terms of accuracy, relevance, and completeness, while there was no significant difference in clarity and objectivity. Despite significant improvements in clarity, accuracy, and objectivity after training, ChatGPT’s relevance and completeness did not improve significantly (Fig. 4).

Fig. 4
figure 4

Radar analysis of response characteristics across the three groups. Blue represents the clinical microbiologist (CM) group, green represents the ChatGPT_BT group, and red represents the ChatGPT_AT group. The analysis examines changes across five dimensions: accuracy, clarity, relevance, objectivity, and completeness.

Impact of the training protocol on chatgpt’s response quality

To further highlight the difference between groups, we calculated the total score differences between GPT_BT-CM and the GPT_AT-CM, defined as the mean total score of the ChatGPT before or after training minus the mean total score of CM group (a positive value indicates ChatGPT outperformed CM, whereas a negative value indicates CM outperformed ChatGPT). We found that after prompt engineering training, the quality of automated ID/AST systems reports generated by ChatGPT improved significantly. It was found that the quality of responses by the GPT_AT - CM group exceeded those of the pre-training group in case 1, case 3, case 4, case 5, case 6, case 7, and case 8 (all p < 0.0001; Fig. 5C–H), while there was no significant difference for case 2 (GPT_BT – CM: 8.16 ± 0.67, GPT_AT – CM: 8.16 ± 0.67; p ≥ 0.05; Fig. 5B). The detailed results are presented in Supplementary Table 5. Overall, the GPT_AT – CM values were consistently positive across eight clinical cases, demonstrating that after prompt engineering training, ChatGPT generated significantly higher-quality automated ID/AST systems reports than CM (case 1: 17.29 ± 0.49, case 2: 8.16 ± 0.67, case 3: 6.86 ± 0.28, case4: 9.11 ± 0.48, case 5: 14.37 ± 0.54, case 6: 13.63 ± 0.47, case 7: 21.63 ± 0.48, case 8: 16.02 ± 0.48; also see Fig. 5). In contrast, the GPT_BT – CM values varied. In case 1 (–6.71 ± 0.49; Fig. 5A) and case 3 (–1.14 ± 0.28; Fig. 5C), CM outperformed GPT_BT, reflected by negative values. However, in case 2 (8.16 ± 0.67; Fig. 5B), case 4 (3.11 ± 0.48; Fig. 5D), case 5 (11.37 ± 0.54; Fig. 5E), case 6 (10.37 ± 0.47; Fig. 5F), case 7 (12.37 ± 0.48; Fig. 5G), and case 8 (5.37 ± 0.48; Fig. 5H), GPT_BT achieved higher scores than CM.

Fig. 5
figure 5

Differences in case results before and after ChatGPT training. Case1 (A), case 2 (B), case 3 (C), case 4 (D), case 5 (E), case 6 (F), case 7 (G) and case 8 (H). GPT_BT-CM represents the difference in total scores between the GPT_BT group and the CM group, GPT_AT-CM represents the difference in total scores between the GPT_AT group and the CM group, and ΔTotal score represents the difference in total scores. **** indicates statistical significance (p < 0.0001), while ‘ns’ indicates no statistical significance (p ≥ 0.05).

Discussion

Key findings on chatgpt’s performance in automated ID/AST systems report generation

This study evaluates the potential of ChatGPT in assisting with the issuance of automated ID/AST systems reports in clinical microbiology. By implementing a training methodology grounded in the latest CLSI (2024) guidelines and established clinical workflows, we aimed to enhance ChatGPT’s capability to process complex datasets effectively. The findings indicate that utilizing ChatGPT can significantly improve report accuracy, leading to better-informed clinical decisions. In particular, the eight representative clinical cases used in this study (Table 1) covered scenarios that frequently challenge routine practice, including database limitations (Case 3), carbapenemase gene detection (Case 4), enzyme-mediated resistance mechanisms (Case 7), and novel or unclassified resistance patterns (Case 8). ChatGPT demonstrated an improved ability to provide structured and clinically relevant interpretations across these cases, underscoring its potential to support healthcare professionals in both standard and complex diagnostic contexts. This application highlights the transformative role that advanced language models can play in streamlining laboratory processes and supporting healthcare professionals in their diagnostic tasks.

Limitations of ISO15189 certification in clinical microbiology

The responses from CM were comparable to or marginally lower than those of GPT_BT (Fig. 3). Through an analysis of gender, years of automated ID/AST system usage, and whether the hospital’s department of laboratory medicine was ISO certified, we found that the quality of clinical microbiologists’ responses was not significantly correlated with these factors (Fig. 2). In Fujian, China, ISO15189 certification in hospitals does not appear to have substantially enhanced clinical microbiology standards, which may be attributable to intrinsic characteristics of the discipline. Clinical microbiology is a highly complex discipline that demands continuous learning and adaptation. The dynamic nature of microbial evolution, the diversity of pathogens, and the rapid advancement in diagnostic technologies require CM to constantly update their knowledge and skills26,27. These factors make it challenging to standardize procedures across different institutions, even with ISO15189 certification28. Moreover, the variability in resources, such as laboratory infrastructure and staff expertise, may further limit the impact of such certifications on improving clinical microbiology practices29,30,31. Therefore, achieving meaningful improvements in this field may require not only certification but also a concerted effort to promote ongoing education, invest in cutting-edge technologies, and foster collaboration between institutions to ensure consistent laboratory standards32,33. Another potential explanation is that statistical variations in sampling, such as differences in geographic region and economic status, may have influenced the results. Hospitals located in more economically developed areas may have better access to advanced technologies and more trained personnel, while those in less developed regions might struggle with limited resources, affecting the consistency and quality of microbiology testing34,35. Despite these limitations, our study provides a novel perspective that leverages ChatGPT to assist clinical microbiologists in addressing challenges associated with automated microbiology testing. While the findings are promising, future studies should validate this approach across a broader range of geographic and socioeconomic contexts to ensure its generalizability and scalability in diverse clinical settings.

Importantly, ISO15189 itself still provides substantial benefits to clinical laboratories, such as establishing standardized validation frameworks, ensuring traceability of results, and facilitating external quality assessments36. These mechanisms support the safe introduction of novel technologies, including AI-assisted tools, into routine workflows. Therefore, although our data did not show a significant performance advantage for ISO-certified institutions, the certification should be viewed as a complementary safeguard that can promote reliability and harmonization when integrating AI into clinical microbiology practice.

In addition, geographic variability in pathogen prevalence and infection rates can complicate standardization, as laboratories in different regions may encounter distinct diagnostic challenges37,38. Addressing these disparities may require a tailored approach that accounts for regional differences in resources, training, and local epidemiology39,40,41. A more equitable distribution of resources, coupled with region-specific training programs, could help harmonize clinical microbiology practices across diverse geographic and economic contexts42. To mitigate these issues, CM must continuously expand their knowledge base and stay updated with the latest advancements in the field. In this context, ChatGPT’s development of a personalized and continuously updated clinical microbiology database could significantly alleviate the knowledge constraints faced by clinical microbiologists and facilitate standardization.

Enhancing ChatGPT performance through prompt training

Based on the research findings, the quality of ChatGPT’s responses exhibited a marked improvement following prompt training utilizing our protocol (Figs. 1 and 3), particularly in the areas of clarity, accuracy, and objectivity (Fig. 4). This improvement can be attributed to several factors. First, the standardized training protocol helped eliminate human cognitive biases that might otherwise affect the consistency of report generation. Second, ChatGPT demonstrated the ability to apply a uniform set of clinical guidelines and references, ensuring that its responses adhered to established standards. Third, its capacity to process and synthesize large volumes of information without fatigue contributed to more reliable and comprehensive outputs. Additionally, the iterative feedback mechanism during training allowed the model to refine its understanding of complex microbiological concepts, further enhancing its performance compared to clinical microbiologists who may rely on subjective judgment in certain scenarios. However, there was minimal enhancement in completeness for both human participants and ChatGPT. For CM, this may be attributed to insufficient knowledge reserves, inadequate knowledge updates, and a lack of proactive engagement30. In the case of ChatGPT, the limitation is inherent to its operational logic; each conversation initiates a memory reset, and it lacks a dedicated, specialized database8. In the future, to facilitate the issuance of automated ID/AST systems reports by CM, a specialized ChatGPT microbiology knowledge base could be developed, incorporating continuously updated and verified information. Our research findings indicate that prompt engineering training significantly enhances the quality of automated ID/AST systems reports generated by ChatGPT. This conclusion aligns with the findings of previous studies, which demonstrated that, with appropriate training and guidance, LLM like ChatGPT exhibit improved capabilities in natural language processing and generation43,44. This result confirms our hypothesis that, with appropriate training and guidance, ChatGPT can become an effective tool for assisting in the issuance of high-quality automated ID/AST systems reports.

ChatGPT exhibited the highest trainability in cases 1, 3, 4, 5, 6, 7, and 8 (Fig. 5) (p < 0.05); however, the training effect in case 2 did not show a statistically significant difference (p > 0.05). Due to the requirement of using pure colonies for testing in case 2, the training did not enhance ChatGPT’s ability to recognize this specific issue. CM also encounter similar problems. The challenges primarily stem from inefficient workflows and inadequate attention to detail. To minimize reporting errors resulting from the use of non-pure colonies, it is advisable to carefully verify the use of pure colonies before testing and to double-check the plates during report issuance. This dual verification process could significantly reduce errors associated with non-pure colonies. In cases 1 and 3, the CM group demonstrated superior performance compared to the GPT_BT group. As indicated in Table 1, in terms of accurately identifying rare bacteria that produce NDM enzymes without matching phenotypes in the expert database, the quality of responses from GPT_BT was inferior to that of CM. This issue is related to ChatGPT’s design as an intelligent question-answering tool, but it can be effectively addressed through training or the development of a specialized clinical microbiology database.

Innovations and limitations of the study

This study is distinguished by three key aspects. First, we are pioneers in utilizing a LLM, ChatGPT, to assist in generating automated ID/AST systems reports, thereby opening new possibilities for the practical application of such models in frontline clinical microbiology. Additionally, this multicenter clinical study, based on real-world data from actual clinical scenarios, provides robust scientific evidence for ChatGPT’s application as a tool to assist in the issuance of automated ID/AST systems reports12,45. Furthermore, our findings indicate that ChatGPT-generated automated ID/AST systems reports are superior to those produced by clinical microbiologists, enhancing report accuracy and, consequently, clinical diagnostic precision, which has significant implications for patient treatment outcomes. Finally, we present a new prompt engineering training method that significantly improves the quality of ChatGPT-assisted automated ID/AST systems reports through targeted training and guidance. This approach offers novel insights into the training of LLMs and holds promise for expanding their applications across various fields. Despite employing a multicenter sampling strategy, our study’s sample distribution was limited by regional and economic disparities. Previous research has highlighted significant socioeconomic differences between China’s eastern and western regions, which may influence both healthcare outcomes and access to services46,47.These factors may have affected the representativeness of our sample and should be considered when interpreting the results. To address these limitations, future studies should aim to include more geographically and economically diverse regions to enhance sample representativeness and generalizability. Additionally, incorporating standardized protocols across different healthcare settings could mitigate the variability introduced by regional disparities. Further research could also explore collaborations with international institutions to validate the findings in a global context, thereby extending the applicability of ChatGPT-assisted tools beyond the current study scope. These strategies will provide stronger evidence for the adoption of large language models like ChatGPT in diverse clinical environments and further improve the robustness of their application.

Conclusion

The results of this study hold significant value for clinical microbiology, as training ChatGPT to assist in generating automated ID/AST systems reports can enhance report generation efficiency and reduce the workload of CM.