Introduction

Rare Gynecological Tumors (RGTs), comprising over 30 distinct histological subtypes, such as sex cord stromal tumors, and uterine or ovarian carcinosarcomas, account for more than 50% of gynecologic malignancies, presenting a major clinical challenge1. With an incidence rate below six per 100,000 individuals, RGTs are difficult to study through large-scale randomized trials, leading to reliance on less standardized treatment approaches such as retrospective studies, case reports, and expert opinions. This lack of robust clinical guidelines has contributed to persistently poor prognosis for patients with RGTs2.

Technological advancements in cancer diagnostics have enabled the identification of biomarkers as therapeutic targets. Biomarker-guided treatments promise to accelerate the development of precision therapeutics across tumor types, reducing the relevance of organ-based classification3. The prevailing organ-centric approach to clinical trial design hinders the development of effective treatments for rare cancers with shared biomarkers4. This obstacle extends beyond rare cancers: The premature dismissal of olaparib in ovarian cancer and the seven to ten year delay in addressing Programmed Cell Death Ligand 1 (PD-L1) expressing breast and gynecological cancers with PD-L1 inhibition illustrate the need for biomarker-driven stratification for cancer treatment5,6.

Molecular tumor boards (MTBs) are essential for interpreting biomarker profile results and matching cancer patients with appropriate therapies. This includes identifying suitable investigational drugs7. The manual interpretation of multiple, co-occurring molecular alterations requires an in-depth understanding of their functional implications and correlations with treatment sensitivity or resistance. The rapid growth of biomedical literature and the fragmented nature of data sources make manual curation a bottleneck in efficiently translating genomic data into actionable treatment strategies7.

The data produced by MTBs is often stored in unstructured formats within electronic health records (EHRs) or other repositories, hindering their reusability for similar patients. Evaluating the effectiveness of MTB-guided treatments requires extracting follow-up data from EHRs. Unstructured text within EHRs, coupled with the lack of interoperability across healthcare institutions—particularly when MTB patients receive treatment at external facilities—renders the process labor-intensive, error-prone, and time-consuming8. Consequently, applying MTB insights to future patients is hindered.

Advances in data capture and analysis, alongside decreasing costs in genome sequencing, are paving the way for innovative tools to manage rare or refractory cancers more effectively9. Digital twin technology constructs virtual representations of physical entities with dynamic, bidirectional interfaces10. Initially applied in industrial engineering, digital twins can also represent the human body in healthcare. By modeling physiological processes and predicting biomarker-specific responses to treatments, digital twins can address the challenges of patient variability and the limitations of traditional one-size-fits-all approaches11. In the case of RGT, the standard carboplatin and paclitaxel regimen, followed by chemotherapy monotreatments for subsequent lines, may not be the most effective approach12. Digital twins could help stratify RGT patients based on their unique biomarker profiles, enabling more tailored treatments and potentially improving outcomes, even in heavily pretreated cases.

Despite their potential, the adoption of digital twins in clinical practice is constrained by the challenges associated with integrating the diverse and complex data required for their development13. Unlike conventional computational models that often rely on static, population-level assumptions, digital twins are dynamic and personalized—virtual representations of individual patients that evolve in real time with incoming clinical data. This adaptability offers potential for precision medicine, particularly in complex or rare diseases. However, building such models requires synthesizing heterogeneous, often unstructured, information. Traditional natural language processing (NLP) methods are limited in this regard. Recent advances in large language models (LLMs), such as transformer-based architectures, enable more context-aware, scalable, and flexible extraction of clinical insights. Their in-context learning capabilities make them especially suited to support the creation of responsive and individualized digital twins, as explored in this study14.

In this study, we demonstrate the application of an LLM-enabled workflow for constructing digital twins for patients with RGT, specifically metastatic uterine carcinosarcoma (UCS).

The research question was inspired by a real-world UCS case presented to a major German cancer center for evaluation of third-line treatment options. According to a consensus statement by Bogani et al., third-line monotherapy in UCS typically results in a median progression-free survival (PFS) of 1.8 months and a response rate of 5.5%, highlighting the urgent need for novel therapeutic strategies12. The patient presented with a proficient mismatch repair (pMMR) carcinosarcoma with intermediate Tumor Mutational Burden (TMB) and high PD-L1 expression. Although PD-L1 positivity is common in UCS15,16 and has been suggested as an independent prognostic factor17, it has not been validated as a target for immunotherapy12.

Given the potential therapeutic importance of PD-L1, we investigated outcomes in similar patients. We identified cases with high PD-L1 expression, pMMR status and low to intermediate TMB from the institutional MTB database, including non-gynecological cancers, and to expand the pool of UCS cases, from the literature. The unstructured nature of EHR and academic publications posed challenges for immediate analysis. We utilized a local LLM to extract and structure data from EHRs and a cloud-based LLM for literature data. These datasets were integrated into a unified local database, forming the foundation of an RGT Digital Twin system. This system enabled the generation of virtual representations of individual patients, allowing for the simulation of personalized treatment strategies.

The RGT Digital Twin system facilitated the identification of additional therapeutic options, which were subsequently evaluated by MTB members. By integrating data from institutional sources (including non-gynecological cancers) and literature sources (to expand the pool of UCS cases), this approach provided novel insights that were not apparent from either data source alone. This integration has the potential to guide more effective treatment strategies for RGT patients and supports a shift towards a biology-based rather than organ-based definition of tumors. LLM technology enabled us to streamline the extraction, structuring, and analysis of EHR and web data, making it readily accessible for MTB evaluation. This is especially valuable in resource-limited settings like MTBs, where results can occasionally arrive too late to guide timely treatment decisions18.

Results

Patient cohort

A retrospective analysis of 1821 cases discussed at the institutional MTB between September 2017 and July 2024 was conducted. Among these, 132 cases exhibited high PD-L1 expression (CPS ≥ 40), encompassing 28 different tumor entities. The analysis was restricted to patients with TMB < 15 mutations/megabase and pMMR status with either gynecological cancers or carcinosarcoma/sarcomatoid carcinoma, resulting in a cohort of nine patients. Of these, seven patients received ICI therapy and were included in the study. The cohort comprised six females and one male aged 32 to 83 years at MTB presentation. Given that the similarity analysis focused on biomarker profiles and cancer morphology, the male patient’s inclusion was appropriate. His sarcomatoid carcinoma aligned with the other inclusion criteria, regardless of his gender or cancer type.

Data extraction

89 clinical reports were extracted for the patient cohort (median: 11, range: 9-21). Documents had a median of two pages and 4340 characters. Documents contained 70 data points for the selected attributes. Experts reviewed all extractions in the sample. The local LLM achieved accuracy of 0.76, precision of 0.85, recall of 0.87 and F1 of 0.86. The highest accuracy was achieved for ‘diagnosis’ and ‘ICI treatment’ (1.00). Low accuracy occurred in ‘previous treatment’ (0.29) and ‘PFS’ (0.14), mainly due to parsing errors in order and dates of previous treatments. See Table 1 for full results of the analysis.

Table 1 Evaluation of LLM performance for EHR and literature record extraction

Document analysis revealed that primary tumor sites included metastatic UCS (n = 1), metastatic cervical cancer (n = 4; three squamous cell carcinoma, one adenocarcinoma), metastatic uterine serous carcinoma (n = 1), and metastatic, undifferentiated sarcomatoid carcinoma of the pancreas (n = 1). Patients exhibited high PD-L1 expression with a median CPS of 75 (range: 40–95) and a median TMB of 5.5 (range: 0–11). Median follow-up duration was 48 months (range: 15–132 months). Detailed baseline clinical characteristics are presented in Supplementary Table 3.

The LLM-based systematic literature research yielded a dataset of 663 scientific documents. Files had a median of seven pages and 27,995 characters, with a maximum of 934,513 characters. The LLM extracted 7956 attributes from scientific documents. Attribute extraction was reviewed with a random sample of n = 352 (Z = 1.96, N = 7956, e = 0.05, P = 0.5). The cloud-based LLM achieved accuracy of 0.97, precision of 0.97, recall of 0.99, and F1 of 0.98. Lowest precision was observed in PFS (0.77).

The LLM system identified 15 studies reporting ICI treatment in UCS, encompassing a total of 215 patients. While seven of the studies did not exclusively enroll UCS patients, four provided stratified outcomes for UCS cases. Phase II studies that provided stratified analysis for UCS patients showed objective response rates between 0 and 10% for ICI treatment. None of the seven studies allowed for individual patient-level data extraction to create digital twins (see details on these seven studies in Supplementary Table 4).

PD-L1 status was reported in 10 of the 215 literature-derived UCS cases treated with immunotherapy, with three cases exhibiting PD-L1 positivity. This limited sample size precluded stratified analysis. Notably, two of the PD-L1-positive UCS patients harbored dMMR and one had a high TMB, both of which are known to influence ICI treatment response.

Further analysis of the 15 identified studies yielded eight studies with individual patient follow-up data, comprising a total of 14 cases. The median age of these literature-derived patients was 63 years (range: 55–68 years).

Treatment response outcomes for 21 individual patients

In the institutional cohort, seven patients received ICI therapy: five with pembrolizumab monotherapy, one with pembrolizumab plus lenvatinib, and one with ipilimumab plus nivolumab. ICI therapy was initiated on average in the third line (range: 2–4). Median PFS was 6 months (range: 1–48). One patient remained disease-free after 45 months, two continued to respond, one received a subsequent treatment line, and three had died.

Treatments in the 14 cases of the literature-derived cohort consisted of pembrolizumab (n = 4), pembrolizumab plus lenvatinib (n = 7), pembrolizumab plus lenvatinib plus letrozole (n = 1), PD-1/Cytotoxic T-lymphocyte associated protein 4 inhibitors (n = 1), and avelumab plus axitinib (n = 1). ICI was typically given in the third line (range: 2–5). Median PFS was 4 months (range: 0.9–15), and median OS was 9.9 months (range: 2.1–48). At data cut-off, six patients were alive, seven had died, and one had unknown status.

Table 2 provides a summary of ICI treatment response outcomes for all 21 cases.

Table 2 ICI treatment outcomes in 7 institutional cases and 14 literature-derived cases

RGT digital twins enable predictive modeling of individualized patient treatment strategies

To inform personalized treatment planning for the UCS patient (case 1), digital twins were created based on 21 evaluable patients. Treatment outcomes were predicted based on a database of additional 404,265 cases derived from scientific papers (n = 655). Potential treatment strategies were predicted for a patient with UCS with disease progression following third-line pembrolizumab monotherapy.

Supplementary Figure 1 presents treatment-relevant biomarkers after progression on standard-of-care combination treatment with carboplatin and paclitaxel identified by the digital twin system.

The digital twin system tailored treatment recommendations based on the patient’s specific tumor characteristics, treatment history, and geographic location. Considering the patient’s ongoing pembrolizumab therapy, the system suggested testing for Folate Receptor Alpha (FRα) to assess potential eligibility for an off-label treatment regimen currently under clinical investigation. This trial investigated the combination of mirvetuximab soravtansine and pembrolizumab in FRα-positive UCS, with eligibility criteria including pMMR status and prior pembrolizumab progression. However, the trial was no longer recruiting participants and was limited to the United States19. Therefore, the digital twin system suggested considering off-label use of this regimen for the patient. A previous evaluation (2021) identified HER2 amplification in the patient’s tumor, a biomarker linked to high objective response rates to trastuzumab deruxtecan20. Due to the potential for HER2 status to evolve, the system recommended confirming this finding through a new biopsy21. Additionally, the digital twin system suggested evaluating Melanoma-Associated Antigen A4 (MAGE-A4) and Preferentially Expressed Antigen in Melanoma (PRAME), biomarkers frequently expressed in UCS22,23. Ongoing research explores targeted therapies for these markers. Three relevant clinical trials were accessible within the patient’s geographic area. To monitor disease progression, the system recommended continued tracking of serum Cancer Antigen-125 (CA-125) levels24 based on its established correlation with disease progression identified in the patient’s 2021 EHR data.

Potential treatment trajectories for treatment line four derived from the Digital Twin pipeline are demonstrated in Table 3.

Table 3 Digital twin pipeline provided the following individualized treatment targets for case 1

Quantifying LLM throughput

For local patient data processed on a standard PC (Macbook Pro M4), LLM extraction took a total of 61 min, averaging 0.68 min (41 s) per file. We estimate that this could be reduced to around 10 s per file when using a clustered computing environment. Manual review of the LLM output added an average of 8 min per file, resulting in a total time of 8.69 min per file for the combined LLM-assisted and human-reviewed approach. In comparison, manual extraction of the same data took an average of 55 min per file, representing a 6.3-fold reduction in time.

When processing a larger volume of literature with a cloud-based LLM, extraction took 83 minutes, averaging 0.125 min (7.5 s) per document. Manual validation added another 6 minutes per document, for a total of 6.125 min per document. In contrast, manual extraction from literature averaged 17 min per document, resulting in a 2.8-fold time saving with the LLM-assisted method.

Discussion

Extracting meaningful data from unstructured medical text is a prerequisite for precision medicine. In this study, we implemented an LLM-based extraction pipeline to systematically retrieve, structure, and analyze data from real-world EHRs and online sources to support and evaluate diagnostic and targeted therapeutic strategies for constructing patient-specific digital twins for metastatic UCS.

The limited data for this rare cancer led us to strategically use in-context learning rather than fine-tuning, which risks overfitting on small datasets and poor generalization. In-context learning leverages the pre-trained LLMs’ broad knowledge through prompt examples for effective data extraction without extensive retraining.

The LLM-based extraction pipeline facilitated timely and accurate synthesis of all relevant full-text scientific publications available through institutional access up to August 15, 2024.

Quantifying LLM throughput highlights the considerable efficiency gains achieved through LLM-assisted data extraction, even when incorporating necessary manual review. These improvements demonstrate the practical viability of integrating LLMs into clinical workflows, particularly for accelerating data curation from unstructured sources.

The cloud-based LLM achieved precision of 0.97 on a complex corpus of medical literature, close to the 0.96 observed by other researchers25. This enabled the generation of evidence-based recommendations and predictive insights grounded in the latest research. Key gaps were observed in extraction of complex data structures, with precision of ‘PFS’ (0.77) and ‘OS’ (0.94) below the overall precision of 0.97. This was due to the fragmented and unstructured way of reporting PFS and OS. Sentences such as “Patient survived for 14 months with the residual tumor post-relapse,”26 make it challenging to accurately determine PFS, as it requires estimation based on prior treatments and the number of treatment cycles. However, estimated PFS may not be accurate if treatment cycles were prolonged. This highlights the challenge of extracting precise outcome data when the primary source lacks comprehensive reporting. We strongly advocate for standards in reporting treatment outcomes, e.g., by clearly stating PFS in months and not date ranges.

In institutional data, unstructured EHR impeded the extraction of key clinical information. This limitation delayed the integration of institutional patient data for informing the management of similar cases. Phase II trials neglecting biomarker-stratification in patients with UCS yielded low objective response rates to ICI therapy, ranging from only zero to ten percent27,28. For our UCS patient, this bottleneck might have precluded ICIs based on high PD-L1 expression, despite the fact that pembrolizumab proved highly efficacious with no adverse effects in this patient. The local LLM system was able to extract structured follow-up data from EHRs across a diverse and complex set of medical documents. While it achieved lower precision than the cloud-based model at 0.85, this is in line with the performance of similar models on complex EHR29. Notably, precision for biomarkers was rated the second-lowest at 0.57. The LLM achieved full recall for all biomarkers given as examples for in-context learning, but did not recognize biomarkers that were not explicitly mentioned (e.g., BRAF for case ID 4). The LLM again achieved lowest precision for PFS at 0.14. This is due to the highly unstructured and fragmented way of reporting PFS, often across multiple documents.

Error analysis of the LLM-based extraction highlighted two challenges: Errors were found in extracting PFS from literature and EHRs, likely due to inconsistent formats and the need for temporal reasoning across multiple paragraphs. Similarly, identifying the number and type of prior treatments from free-text clinical notes led to incorrect extractions. In contrast, the LLMs performed reliably when extracting clearly defined and consistently reported data points. Notably, errors were either presented as missing or incomplete data points, rather than the fabrication of information. The use of in-context learning improved the model’s grounding for well-structured tasks but was less effective in addressing the ambiguities and contextual dependencies inherent in more complex extractions.The European Society for Medical Oncology Precision Medicine Working Group recently established criteria for evaluating the tumor-agnostic potential of molecularly guided therapies, mandating an ORR of ≥20% in at least one of five patients across at least four investigated tumor types, with a minimum of five evaluable patients per type3. Our institutional MTB database identified six analogous cases involving four additional tumor types, most of which exhibited durable responses to ICIs. While the limited number of evaluable patients per tumor type in our single-institution cohort restricted the statistical power, an LLM-driven literature review highlighted an underreporting of PD-L1 expression in UCS in studies conducted to date, despite the known high prevalence of PD-L1 positivity in this malignancy15,16. This underreporting impeded our ability to assess the predictive value of PD-L1 to guide ICI treatment in UCS. Despite the limited sample size of our institutional cohort, the promising outcomes observed suggest that targeting PD-L1 expression in RGT may be a viable therapeutic strategy. The inclusion of diverse tumor types in our institutional cohort further strengthens the role of PD-L1 inhibition in both gynecological and non-gynecological cancers, making it a potential tumor-agnostic marker.

To inform treatment strategies in the event of disease progression, we constructed 21 individualized digital twins, including 7 from our institutional database and 14 from the literature, and queried an LLM-derived database containing 404,265 patient cases. Although our systematic PubMed search was specifically limited to the terms “uterine carcinosarcoma” and “endometrial carcinosarcoma,” the resulting sample also included other uterine and ovarian malignancies. This is because UCS is frequently reported within the broader context of clinical trials involving more common gynecologic cancers.

The RGT digital twin system generated individualized trajectory predictions for various targeted therapies within a secure local environment that respects patient data privacy, offering guidance on further diagnostics, potential treatment options and continued treatment monitoring with serum CA-125. Additionally, since our real-world cohort comprised only White patients, being able to extract data on patients of other races from a vast corpus of literature helped us validate the generalizability of our treatment recommendations30.

This study successfully demonstrated the utility of RGT digital twins for individualized treatment prediction and response modeling. The digital twin not only provided generalized recommendations for additional diagnostic testing but also incorporated specific clinical details from the patient’s treatment history—such as prior pembrolizumab administration—to refine eligibility assessments for targeted therapies. The extraction and analysis of follow-up data revealed that, following the MTB recommendation, the patient received pembrolizumab due to high PD-L1 expression and exhibited a sustained partial response for over 30 months. Subsequent to pembrolizumab initiation, CA-125 levels normalized within three months and have remained stable in the context of ongoing stable disease.

This study had several limitations. Firstly, despite a large dataset, the combination of stringent similarity criteria, a limited institutional cohort, and underreported PD-L1 status in published UCS cases prevented us from stratifying patients by PD-L1 status. Efficacy of ICI treatment in PD-L1–positive UCS remains uncertain, and current trials lack PD-L1 as a stratification factor. Bogani et al. listed nine clinical trials currently exploring ICI treatment in UCS, many of which are nearing completion12. None of these trials included PD-L1 expression as a stratification factor. Our findings could inform the design of future trials that specifically evaluate ICI efficacy in pMMR UCS with high PD-L1 expression and low to intermediate TMB. Secondly, only somatic biomarkers were included, potentially underestimating clinical actionability by excluding germline mutations, such as BRCA1/2, which are predictive of PARP inhibitor response31. Thirdly, a local LLM was used for data extraction, impacting extraction performance due to its smaller size and no fine-tuning on German medical texts32. Lack of a German-language equivalent to the English-language MIMIC labeled medical record dataset33 precluded fine-tuning our own model. Finally, the German Network for Personalized Medicine (DNPM) data model is under revision34, necessitating the use of a custom data model for this study and highlighting the importance of future validation for compatibility with DNPM v2.

National and international collaborative initiatives, such as the DNPM Data Integration Platform35 and the Molecular Tumor Board Portal by Cancer Core Europe36, aim to enhance MTB decision-making by standardizing and harmonizing data collection across institutions. These platforms stand to benefit significantly from the integration of LLM-based extraction pipelines, which could facilitate the automated extraction of both baseline and follow-up data, thereby enabling the real-time utilization of MTB data across different healthcare systems. Once the DNPM database is fully operational, clinical narratives derived from EHR data could be transformed into HL7 Fast Healthcare Interoperability Resources (FHIR), streamlining interoperability and reducing the biases and costs associated with manual documentation37. Such automation would enable the analysis of larger patient cohorts, thereby providing the statistical power necessary for accurate treatment predictions in rare cancers, a critical step towards advancing personalized oncology. These outcome data offer a valuable resource for fine-tuning LLMs and for guiding both preclinical research and stratified clinical trial design.

At present, LLM knowledge synthesis forms the core of the system’s predictive capabilities. The future incorporation of auxiliary machine learning models could offer complementary approaches for enhanced predictive analytics.

Beyond the specific limitations of our study, the broader applicability of generating digital twins necessitates careful consideration of local environments and ethical implications. Our framework is designed to prioritize local data processing within institutional settings, ensuring sensitive patient data remains secure, as reflected in our accompanying GitHub code. For the 27b model used, local processing is feasible with standard computer hardware. This is paired with a cloud-based approach for processing publicly available literature, allowing us to leverage external resources while maintaining local control over sensitive information, thereby addressing key ethical and privacy concerns. Future work should continue to explore strategies for secure and ethical deployment across diverse institutional infrastructures and data governance models.

Our LLM-enabled precision oncology approach can inform more effective treatment strategies for RGT patients and suggests the potential for incorporating biomarkers, such as PD-L1 expression, to broaden therapeutic options and inform more personalized strategies that transcend traditional organ-based boundaries, particularly in the context of cancers with limited standard treatments.

Given the increasing volume and complexity of precision oncology data from MTBs, and the limited availability of precision oncologists to translate this abundance of information into clinically meaningful actions7, there is an urgent need for advanced digital tools to facilitate the extraction, structuring, and analysis of large datasets18. Our proof-of-concept study demonstrates the potential of LLMs to efficiently synthesize relevant information for MTB evaluation.

While this study focused on RGTs, the LLM-enabled digital twin approach holds potential for a wide range of refractory cancers. By accurately predicting individual patient trajectories, these digital twins can inform personalized diagnostics and treatment strategies in a timely and cost-effective manner, potentially improving patient outcomes.

Methods

Study setup

We employed an RGT Digital Twin system to create personalized treatment suggestions for UCS. A real-world patient case, along with molecular profiling data, was analyzed using the RGT Digital Twin system. The findings were then compared to analogous cases drawn from institutional and public databases. Treatment options were discussed at the MTB to inform individualized care decisions. Post-treatment outcomes were documented in the patient’s EHR and updated for the individual RGT Digital Twin to improve future predictions. The RGT Digital Twin provided rationale for cost coverage requests and supported study inclusion decisions. Refer to Fig. 1 for an overview of the study process.

Fig. 1: Workflow from RGT patient to RGT digital twin.
figure 1

Rare Gynecological Tumor (RGT) Digital Twin is a dynamic system that can integrate diverse data sources to predict individual patient trajectories. Molecular profiling identifies patient biomarkers. Large Language Model (LLM) capabilities support clinical interpretation of molecular profiles, patient matching to clinical trials, reasoning for cost-coverage requests, medical documentation, and data preparation for advanced computing. Advanced computing techniques such as classification and regression algorithms enable the creation and exploration of Digital Twin models. By adjusting parameters such as biomarker expression or previous treatment strategies, clinicians can model potential patient outcomes and determine suitable treatment strategies. The RGT Digital Twin then integrates outcome data back into the RGT Patient Electronic Health Record (EHR), supporting a continuous learning process. RGT Rare Gynecological Tumor, DB Database, LLM Large Language Model Fig. 1 uses free icons obtained from Flaticon.com, in accordance with their licensing terms.

Patient description and research question

The patient is a 77-year-old woman with metastatic UCS, initially diagnosed with FIGO IIIC2 UCS at age 66. Six years after surgery and adjuvant chemotherapy with carboplatin and paclitaxel, the patient experienced a recurrence in the cervical lymph nodes and pelvis, which was also associated with an increase in CA-125 levels. A cervical lymph node biopsy confirmed the recurrence, and the patient underwent the same chemotherapy regimen followed by MTB presentation in 2021 (see Supplementary Table 1 for detailed results). Genomic profiling was conducted using the TruSight Oncology 500 (TSO 500) and TruSight Tumor 170 (TST 170) panels. By analyzing a wide range of cancer-related genes, these panels facilitate the discovery of potential therapeutic targets. Molecular profiling revealed an intermediate TMB of 6.3 mutations/megabase, high PD-L1 expression (Combined Positive Score, CPS: 41), and a pMMR status. The patient’s high PD-L1 expression prompted us to investigate the potential efficacy of PD-L1 inhibitors in metastatic tumors, regardless of primary tumor site or regional drug approval status. To this end, we searched our institutional MTB database for analogous cases, not restricted to gynecological cancers, and expanded our cohort with additional UCS cases identified in the literature.

Data collection

EHR data obtained from Technical University of Munich (TUM) University Hospital was the primary data source. Data downloaded from web-based repositories through institutional access extended the dataset. These repositories included PubMed, ClinicalTrials.gov, and clinical practice guidelines from the National Comprehensive Cancer Network (NCCN) and the German Cancer Society (DKG), which ensured adherence to up-to-date clinical standards.

A two-stage approach was employed for extracting structured, actionable data from source files. First, a locally deployed LLM system extracted relevant data from institutional EHRs. Second, a cloud-based LLM system processed documents from web-based repositories. Afterwards, the extracted structured dataset was made available to clinicians and the locally deployed LLM for further analysis. This process is shown in Fig. 3.

Identification of analogous institutional cases

We utilized a locally deployed large language model to systematically analyze pathology reports from MTB cases discussed at TUM University Hospital between September 2017 and July 2024. PD-L1 expression, TMB and MMR status were documented in a total of 1821 cases. Eligibility criteria included high PD-L1 expression (CPS ≥ 40) and the availability of both MMR and TMB status. Patients with high TMB ( ≥ 15 mutations/megabase) or deficient MMR were excluded to avoid bias, as these features are already established FDA-approved biomarkers for Immune Checkpoint Inhibitor (ICI) therapy responsiveness.

The LLM-assisted screening process applied predefined inclusion criteria and stratified cases based on similarity to the presented UCS patient. Similarity was determined by either medical specialty (gynecological oncology) or histopathological features, specifically carcinosarcoma or sarcomatoid carcinoma morphology, independent of gender or anatomical origin. This strategy was based on the recognized molecular overlap between UCS and high-grade serous ovarian or endometrial carcinomas12. Given the limited clinical value of distinguishing between carcinosarcoma and sarcomatoid carcinoma in practice, the institutional MTB members grouped these under the unified category of “morphology.”

As visualized in the CONSORT diagram (Fig. 2), 9 cases met all inclusion criteria: CPS ≥ 40, pMMR, TMB < 15 mutations/megabase, and concordance with at least one similarity parameter. Of these, 7 patients received ICI treatment and were included in the final LLM-driven analysis, which focused on extracting clinical and treatment outcomes, and evaluating case-level comparability to the index UCS patient.

Fig. 2: CONSORT flow diagram for evaluation of immune checkpoint inhibitor treatment in institutional and literature-derived cases.
figure 2

Flow diagram depicting patient selection for analysis from the institutional Molecular Tumor Board (MTB), including those eligible for and treated with immune checkpoint inhibitors (ICI). Due to the limited number of analogous digital twins within the institutional MTB cohort (n = 7), additional UCS cases receiving ICI treatment were identified through a targeted literature review and included to supplement the Rare Gynecological Tumor (RGT) Digital Twin (DT) dataset. MTB Molecular Tumor Board, PD-L1 Programmed Cell Death-Ligand 1, MMR Mismatch Repair, TMB Tumor Mutational Burden, CPS Combined Positive Score, pMMR proficient Mismatch Repair, Mut/Mb Mutations/Megabase, UCS Uterine Carcinosarcoma, RGT Rare Gynecological Tumor. DT Digital Twin.

Figure 2 provides an exemplary execution of the system for institutional data enhanced with external literature due to the limited availability of analogous digital twins in EHR.

Institutional patient data extraction pipeline

EHR of selected patients were processed in a secure hospital environment. Documents varied in format, from (handwritten) medical notes to obituaries. Ten attributes were extracted from documents for each patient to form the RGT Digital Twin. Supplementary Table 2 shows the full list of attributes. Optical Character Recognition (OCR) was performed using Tesseract38. Raw text was processed with a locally deployed version of pre-trained LLM gemma-2-27b-it, chosen for its ability to run locally while maintaining strong performance on medical texts39. This privacy-preserving architecture ensured that patient data would not leave the local clinic environment. In-context learning was used to adapt the LLM to the task at hand. This involves providing the LLMs with extensive instructions in their prompts, including example input and output (e.g., “age”: “63”, “gender”: “female”, “diagnosis”: “Uterine Carcinosarcoma”). The publicly available GitHub repository provides the full prompt templates and examples. This improves their recall and precision40.

Literature extraction pipeline

To extend the limited sample of analogous digital twins available in institutional EHR, a systematic literature search was conducted on PubMed using the terms ‘uterine carcinosarcoma’ and ‘endometrial carcinosarcoma’. Studies and case reports that included individual clinical follow-up data on patients with UCS treated with ICI were considered for inclusion in the analysis. Potential alternative treatment options and therapeutic targets were identified through a comprehensive review of national (DKG) and international (NCCN) oncological guidelines, PubMed-indexed publications, as well as the ClinicalTrials.gov database.

Data points were extracted in the structure shown in Supplementary Table 2. The sample size was captured as an additional data point. Additionally, the LLM was instructed to extract the main treatment recommendation based on the patient profile in the paper. General purpose LLM Google-Gemini-1.5-Pro was selected for this task due to its large context window, which enabled it to process all files in the sample without splitting them into smaller chunks41. Since no institutional patient data was processed in this step, use of public cloud resources was permitted. The LLM was instructed to return results in the form of a JSON object. The LLM processed all documents sequentially, with each document processed in-context. Outputs were exported to a Pandas dataframe on the local machine in the secure hospital environment for convenient analysis by clinicians.

Ethics statement

Patients from the institutional MTB were included in a clinical registry that allowed for retrospective analyses of clinical and molecular anonymized data in accordance with the Declaration of Helsinki. The retrospective analysis was approved by the Ethics committee of the Medical Faculty of the Technical University of Munich (Reference No. 2023-486-S-SB).

All web-based research procedures were conducted exclusively on publicly accessible, anonymized patient data and in accordance with the Declaration of Helsinki, maintaining all relevant ethical standards.

Construction of LLM-enabled digital twin system

Next, extracted data points were stored in a database in the secure hospital environment that constituted the patient’s digital twin. Clinicians were able to model potential outcomes for their patients and determine suitable treatment strategies by reviewing treatment outcomes from patients with similar biomarkers and treatment history. Additionally, they were able to employ the local LLM to combine treatment strategies identified from web sources with the patient’s digital twin, creating personalized treatment recommendations.

The digital twin system provided a comprehensive overview of all potential recommendations relevant to the patient’s digital representation. Suitable recommendations were determined based on matching similar cases based on disease, prior treatments and biomarkers extracted from institutional data, oncological guidelines, clinical trials and literature. If the patient had received a specific treatment before, it was not suggested by the system. If relevant biomarkers were not specified in the patient’s digital representation, the digital twin system proactively suggested their analysis to inform treatment decisions. The recommendations represented the complete set of applicable and viable options for that consultation. As new patient follow-up data and relevant literature were added to the local database, the LLM’s accessible knowledge base expanded. This growing repository enabled the system to generate increasingly informed and updated insights for subsequent patients, enhancing its ability to provide personalized and temporally relevant recommendations.

After selecting a treatment strategy, the database served as evidence for MTB evaluation, clinical trial matching, and creation of cost coverage requests with health insurance providers.

The developed pipeline is illustrated in Fig. 3.

Fig. 3: LLM-enabled RGT Digital Twin Pipeline.
figure 3

To obtain institutional patient data and matching patient profiles from literature, we first filtered institutional records and public data sources (e.g., PubMed) by biomarker profiles and patient disease. We then extracted structured patient data from EHR using a locally deployed, privacy-preserving LLM, and extracted similar data from published literature using a cloud-based LLM. By utilizing a broader patient population than what is available in institutional data, the RGT Digital Twin system generated personalized treatment plans for MTB evaluation. This method revealed additional treatment options that might have been missed when considering each data source alone. EHR Electronic Health Record, LLM Large Language Model. Figure 3 uses free icons obtained from Flaticon.com, in accordance with their licensing terms.

Analysis

Clinical characteristics, treatment regimens, duration of therapy, treatment responses, PFS, and overall survival (OS) were systematically collected from patient records and reports. Treatment response was captured from radiology reports and categorized as complete response (CR), partial response (PR), stable disease (SD), mixed response (MR), or progressive disease (PD). For additional treatment strategies, outcomes were summarized for each therapeutic approach. Cases were sequentially numbered, starting with those retrieved from the institutional MTB database, followed by cases identified from the literature.

Formal statistical analysis to evaluate the accuracy of LLM data retrieval was performed by experts. Due to the large amount of data processed by the Digital Twin pipeline, we adopted human-in-the-loop reviews, an important aspect of machine learning42. To ensure that no information was missed during extraction, a sample-based review of LLM output was performed according to machine learning leading practice43. For institutional data, experts reviewed all attributes extracted from EHR by the LLM for correctness. For public research data, experts reviewed a random sample of attributes extracted from scientific studies for correctness. Additionally, all attributes that were used by the LLM to construct the literature-derived digital twins were manually reviewed. Afterwards, accuracy, precision, recall, and F1 scores of LLM extraction were calculated. Finally, all treatment recommendations generated by the LLM were manually reviewed and corrected by human experts. The data extraction review panel included two bioinformaticians and two gynecological oncologists with five and 16 years of clinical experience.

A panel of five MTB members, including three clinicians, one pathologist, and one biologist, along with a senior gynecological oncologist, evaluated the personalized treatment recommendations generated by the RGT Digital Twin system.

All statistical analyses were conducted using Pandas and SciPy libraries in Python (Version 3.10.12). The full code and documentation is available on GitHub.