Introduction

Multidisciplinary tumor boards (MDTs) are pivotal in determining optimal diagnostic and therapeutic strategies for oncology patients1,2. The increasing complexity of renal cell carcinoma (RCC) management, guided by evolving recommendations such as the 2024 EAU guidelines, necessitates an efficient and evidence-based approach3. With the recent advancements in generative AI, particularly in natural language processing, AI-driven decision support systems may offer potential benefits in streamlining case discussions and reducing variability in clinical decision-making4,5.

Previous research has explored AI applications in oncology, particularly in radiology and pathology6,7; however, its role in clinical decision support remains underexplored. This study evaluates the capability of an advanced large language model (LLM) AI chatbot to provide RCC treatment recommendations in alignment with MDT decisions, intending to assess its potential utility in enhancing oncological workflow efficiency.

Methods

All RCC cases discussed by the institutional MDT were reviewed. For each case, a summarized clinical history—including age, sex, relevant imaging findings, tumor stage, and eventual treatment and/or diagnostic procedures (i.e., biopsy) received with histology—was input into the AI chatbot after removing all identifying patient data.

All interactions were conducted using the “GPT for Slides™ Docs™ Sheets™” add-on, configured with the OpenAI GPT-4.1 model (release o1, dated 2024-05-12) with a Temperature 0.30, Top-p 1.0, and 120k token context window. UTC timestamp of the first query was on January 24, 2025, 08:43 UTC.

To automate data processing, Google Sheets was linked to ChatGPT o1 (OpenAI, San Francisco, CA, USA) using the following procedure:

  1. 1.

    A Google (Google LLC, Mountain View, CA, USA) account was required to access Google Sheets.

  2. 2.

    Within Google Sheets, the “Add-ons” menu was accessed, and the option “Get add-ons” was selected.

  3. 3.

    The Google Workspace Marketplace opened, where multiple applications were available to link Google Sheets to ChatGPT. The selected tool was GPT for Slides™ Docs™ Sheets™ (Qualtir Technology, Roseville, CA, USA), and the necessary permissions were granted following on-screen instructions.

  4. 4.

    Once installed, the new add-on appeared under the “Add-ons” menu in Google Sheets and was activated for use.

The clinical cases were derived from multidisciplinary discussions held during 2023–2024 at our institution. Only patients presented for their first discussion were included to ensure that the AI’s recommendations were compared against an unbiased MDT decision-making process. The medical team summarized each case in a concise text (30–50 words) following a standardized format. These summaries were entered into the first Column of a Google Sheets worksheet. To generate automated therapeutic and diagnostic suggestions based on the 2024 European Association of Urology (EAU) guidelines3, a custom function was implemented in Google Sheets:

fx = GPT (“Can you help us suggest the pathway according to the 2024 EAU guidelines for the following patients? Please give us only the first choice of the next step you would recommend based on the clinical scenario and the patient’s age. You must use a maximum of 30 words”). All relevant clinical factors—such as disease stage, prior treatments, comorbidities, and imaging findings—were incorporated within the free-text clinical summary provided to the LLM.

This function allowed the patient summaries from the first column to be processed automatically, generating a response from ChatGPT within seconds.

The generated outputs were compared to the official MDT recommendations, and concordance rates were analyzed after revision by a third party (A.A).

Statistical analysis

To compare the output of the human MDT with that of the AI chatbot in suggesting the diagnostic and/or therapeutic pathway for a series of patients, the following statistical methods were used: the Cohen’s Kappa (κ) measured the level of agreement between the AI and MDT while adjusting for chance agreement, stratified by the clinical stage of the disease; the Fisher’s exact test was applied after categorizing the AI and MDT recommendations into different decision-making settings to assess whether a significant difference existed in the distribution of suggestions. Multivariable logistic regression analysis was performed to identify factors predicting a greater discrepancy between AI and MDT recommendations.

A total of 103 RCC cases were included. The patients’ demographics and clinical characteristics are summarized in Supplementary Table 1. The analysis of agreement between the AI chatbot and the MDT in suggesting the next diagnostic and/or therapeutic pathway showed an overall agreement of 62.1%, with an expected agreement of 32.6%, resulting in a Cohen’s Kappa (κ) of 0.44 (p < 0.001), indicating moderate agreement.

Stratifying by disease stage, agreement was highest in the Nx/N0 M0 group (73.8% observed vs. 48.9% expected, κ = 0.48, p < 0.001), reflecting moderate agreement. In the Nx/N0 M+ subgroup, agreement was 60% observed vs. 28.9% expected, κ = 0.44, p = 0.001, also suggesting moderate agreement. Conversely, a lower agreement was observed in patients with N + M0 disease (45.4% observed vs. 28.1% expected, κ = 0.24, p = 0.03). It was particularly weak in the N + M+ subgroup (31.2% observed vs. 22.3% expected, κ = 0.11, p = 0.09), where no significant agreement was detected (data detailed in Table 1).

Table 1 Cohen’s Kappa (κ) measuring the level of agreement between the LLM and the MDT while adjusting for chance agreement, stratified by the clinical stage of the disease

Significant differences were found in the categories of recommendations into different decision-making settings between the AI chatbot and the MDT (p = 0.001).

Higher discordance was found in cases where a biopsy was suggested. Lower discordance was noted in the cases where follow-up imaging was indicated (Table 2).

Table 2 Fisher’s exact test was applied after categorizing the AI and MDT recommendations into different decision-making settings to assess whether a significant difference existed in the distribution of suggestions

The multivariable analysis identified several factors influencing the concordance between the multidisciplinary team and the AI chatbot: ongoing systemic therapy showed a potential association with higher concordance, with an OR of 4.54 (95% CI: 0.82–25.05, p = 0.08). However, it did not reach statistical significance. Disease status had a notable impact on concordance: compared to patients with Nx/N0 M0 disease (reference category), those with both nodal and metastatic involvement (N + M + ) had significantly lower odds of concordance (OR = 0.11, 95% CI: 0.03–0.5, p = 0.004). Conversely, patients with nodal involvement but no metastases (N + M0) showed a signal toward reduced concordance (OR = 0.26, 95% CI: 0.06–1.11, p = 0.07), though this did not reach statistical significance (Table 3).

Table 3 Multivariable logistic regression analysis performed to identify factors predicting a greater discrepancy between AI and MDT recommendations

Our study demonstrates that AI-driven decision support systems have the potential to align with expert MDT decision-making in a proportion of RCC cases. The AI and MDT agreements varied across disease stages, with weaker agreements in more advanced disease settings. Disagreement was more common in cases where invasive diagnostic and therapeutic procedures were recommended instead of simple follow-up with imaging.

Previous researchers have recently published pilot experiences analogous to ours but in other fields. Most attempts have been published about breast cancer8. In an observational study, Griewing et al. compared the concordance of treatment recommendations from ChatGPT 3.5 with those of a breast cancer multidisciplinary tumor board. Overall concordance between the LLM and the MDT was reached for half of the patient profiles9. Sorin et al. asked the LLM to recommend the next most appropriate step in the management of their patients, providing the LLM with detailed patient history as a basis for the decision. Recommendations of the LLM were retrospectively compared to the decisions by the MDT: in seven out of ten cases, LLM recommendations overlapped those by the MDT. The authors underlined that the LLM tended to overlook important patient information10. These results are very similar to what we observed in our experience. It is entirely understandable that discrepancies may arise between the verdicts from a LLM and those by an MDT. These discrepancies may stem from unique clinical presentations that are not sufficiently addressed by the guidelines, or from the fact that the LLM lacks full awareness of the patient’s frailty status and cannot view and interpret radiological imaging, for example. This is why managing atypical cases presents the greatest room for improvement when aiming at integrating the workflow of an MDT with AI.

In colorectal cancer field, Choo et al. discussed colorectal cancer cases in the MDT board at a single tertiary institution. The treatment recommendations made by the LLM ChatGPT were analyzed to ensure adherence to oncological principles. The recommendations by LLM were compared with the decision plans made by the MDT. As a result, the oncological management recommendation concordance rate between the LLM and the MDT was 86.7%, which is very optimistic compared to what observed in our experience11.

Lechien et al. evaluated ChatGPT-4 performance in oncological board decisions regarding 20 medical records of patients with head and neck cancer. GPT-4 was accurate in 13 cases (65%)12.

In another field, Haemmerly et al. prompted ChatGPT with detailed patient histories to recommend treatments. Like the other reported experiences, the output by the LLM was evaluated by a rater, and inter-rater agreement was assessed. The performance of the LLM was poor at classifying glioma types, but good for recommending adjuvant treatments. Overall, expert agreement was moderate, as indicated by an intraclass correlation coefficient of 0.713.

It is clear, early experiences with testing LLMs in multidisciplinary decision-making for oncology patients are beginning to emerge. With some exceptions, such studies consistently report similar findings, with agreement rates around 60-70%. As expected, the ability of a LLM to replicate the verdict of a MDT varies depending on case complexity, with higher concordance observed in less intricate cases.

To our knowledge, ours is the first study to assess a generative AI model’s ability to propose guideline-based MDT recommendations in the field of kidney cancer. However, we are still far from the day when an LLM could fully replace a human MDT. Machine learning algorithms trained with big data about decisions made by MDT teams could progressively improve the accuracy of AI.

While the study presents an original and timely concept, it is accompanied by several limitations and controversial aspects that should be carefully considered when interpreting the results.

As concerning methodology flaws, the study focused on a single, widely used LLM, and did not include a comparative evaluation across different generative models. While this choice was intentional for a pilot feasibility analysis, it limits the generalizability of our findings across the broader and rapidly evolving landscape of LLMs. Comparing different models or even the same model with varying hyperparameters would require careful consideration of factors such as prompt design, temperature, and random seeds, which can introduce significant variability in performance. Another limitation is the non-systematic approach to prompt design: the prompt was written in a straightforward and practical manner without formal testing or comparison against alternative formulations. Prompt engineering strategies—such as testing variants, iterative refinement, or formal validation—would be recommended. While case summaries provided to the LLM were generated using a standardized format, their internal consistency was not formally assessed; subtle variability in how clinical scenarios were framed may have influenced downstream comparisons—much like different angles can alter the perception of the same object. Future studies should consider quantifying this representational variability using methods such as cosine similarity14 or Jaccard index15, especially in settings where LLM outputs are highly sensitive to input phrasing. When evaluating LLM outputs and MDT decisions, textual similarity analyses using BLEU16, ROUGE17, and cosine metrics14 (Supplementary Material, Supplementary Discussion 1) revealed limited lexical and structural overlap, highlighting the need for refined prompting and more semantically-aware evaluation methods; however, the reader should note that low scores do not necessarily indicate clinical inaccuracy and warrant qualitative case-by-case assessment, as was done in this study.

Regarding the flaws in data evaluated, the study focused exclusively on first-time MDT discussions to ensure unbiased comparisons, thereby excluding follow-up cases where decisions are often guided by prior therapeutic steps; while this enhances internal validity, it limits insights into scenarios where LLMs might eventually offer greater workflow support. In fact, it may be precisely the more straightforward setting of re-discussions—where clinical pathways are already partially defined—that represents the most immediate opportunity for meaningful LLM integration into MDT workflows. A major limitation is the lack of granular data on performance status, the absence of a systematic frailty assessment using geriatric scores, and the lack of standardized evaluation of comorbidities, which could have provided a more detailed and personalized influence on the decision-making process. These factors were included in the clinical case scenario only when considered essential for the LLM to make an informed decision. Additionally, the lack of direct integration of imaging or pathology data into the LLM workflow represents another drawback. Unlike human MDTs, which routinely base their decisions on direct visual inspection of radiologic and histologic images, the LLM relied solely on textual inputs. This introduces an asymmetry in the comparison, as critical diagnostic nuances may not be fully captured in narrative reports. That said, incorporating raw image data into general-purpose LLMs raises substantial ethical and cybersecurity concerns, including risks related to patient privacy and data protection, which currently preclude such integration in routine clinical research settings.

A further limitation lies in the absence of systematic follow-up data, which prevents a direct evaluation of the clinical impact of both MDT and AI-driven decisions. In fact, the real-world effectiveness of MDT recommendations is itself not always measurable, making any outcome-based comparison with AI inherently challenging and beyond the scope of this study.

Finally, the study lacks external validation, as all cases were drawn from a single institution. This limits the generalizability of our findings. Future research will focus on fine-tuning the model and conducting external validations to enhance applicability across broader settings.

With all these limitations in mind, in the context of optimizing patient care, our pilot experience suggests that LLMs could at least serve as a triage tool, helping to prioritize the most critical cases before discussion.

A continuous, exponential increase in the number of cases requiring discussion is expected in the next years, in alignment with modern clinical practices. Consider, for example, localized kidney cancer: many patients who, in the past, would have been treated exclusively with surgery—such as partial or radical nephrectomy—have now to be at least evaluated for counseling for adjuvant immunotherapy18. And this is just the beginning.

As for legal and regulatory considerations, we acknowledge that we are still far from a point where AI could ethically or legally replace human decision-making in high-stakes clinical contexts such as MDT discussions. We remark that this study should be interpreted as a pilot exploration of AI’s supportive potential, not as an endorsement of autonomous, AI-driven care.

In conclusion, LLMs show promise as a support tool for RCC decision-making within an MDT framework, particularly for cases with lower complexity. While AI may not replace human expertise, it has the potential to optimize case discussions and improve workflow efficiency. Further validation studies and AI model enhancements will be essential to maximize its utility in real-world oncology settings.