Introduction

Pelvic and sacral tumors (PSTs) are rare, and metastatic tumor is the most common type due to the prominent hematopoietic function of this site1,2,3. Primary benign PSTs mainly include giant cell tumors, schwannoma, neurofibroma, osteoid osteoma, and osteoblastoma4,5,6. Primary malignant PSTs mainly include chordoma, chondrosarcoma, osteosarcoma, Ewing’s sarcoma, and lymphoma7,8,9. Given that PSTs are rare and have similar clinical and imaging features, radiologists are having difficulty acquiring sufficient clinical experience to make a definite diagnosis10. In the early stage, PSTs are usually small and asymptomatic. When detected, it is usually large and compresses surrounding organs, which often requires surgical intervention. For all primary malignant sacral tumors and benign lesions involving lower segments when preservation of both S3 roots is possible, wide resection should be selected3. However, the prognoses of patients with PST are poor due to complex anatomical structures, multiple surrounding organs, and difficulty in operating on this site2,11. Consequently, the diagnostic challenges posed by PSTs—including their rarity, symptom latency, imaging similarities, and the critical need for early detection to enable potentially curative (but complex) surgery—underscore the urgent requirement for accurate and efficient diagnostic tools12.

In recent years, deep learning (DL) has shown great potential in exploring the nature of tumors and has been extensively used in bone tumor diagnosis, efficacy evaluation, and prognosis prediction13,14,15,16,17,18. Few studies have used DL models to distinguish between benign and malignant bone tumors and were mainly based on plain films12,19,20,21. Compared with plain films, multi-sequence magnetic resonance imaging (MRI) can better display the bone marrow infiltration and surrounding soft tissue involvement of PSTs. Owing to the large sizes of PSTs, the manual segmentation of lesions is time-consuming4,22. An MRI-based DL segmentation model may be able to automatically segment PSTs lesions and reduce the tedious process of manually delineating lesions. In addition, attention-based DL models have been applied to medical image classification problems and have shown better aggregation and representation capabilities23. Crucially, non-enhanced MRI is the cornerstone of initial bone lesion evaluation in clinical practice due to its wide availability, absence of contrast-related risks (e.g., nephrogenic systemic fibrosis, allergic reactions), and lower cost compared to contrast-enhanced protocols. However, interpreting complex non-enhanced MRI studies for rare PSTs remains challenging, particularly for less experienced radiologists. Therefore, developing a robust DL model capable of automatically diagnosing PSTs directly on routine non-enhanced MRI sequences holds significant promise for directly addressing the aforementioned diagnostic challenges. Such a tool could potentially: (1) augment radiologists’ diagnostic confidence and accuracy, especially in settings with limited PST expertise; (2) expedite the diagnostic workflow by automating lesion segmentation and analysis, reducing time-to-diagnosis; and (3) leverage the most accessible and safest initial MRI protocol, maximizing clinical utility and impact.

The aim of our study was to develop an end-to-end DL model for the diagnosis of benign and malignant PSTs using non-enhanced MRI.

Results

Patient characteristics

A total of 835 patients (441 males and 394 females; median age 45.0 years [range: 29.0–58.0], with ages ranging from 3 to 83 years) were included in this study (see Table 1). This cohort comprised 621 malignant tumors and 214 benign tumors. Centers 2, 3, and 4 had 17, 19, and 18 benign tumors respectively, and 46, 58, and 23 malignant tumors respectively. Clinical data for patients in the different sets are detailed in Supplementary Table 1S.

Table 1 Clinical characteristic of patients

We found significant statistical differences in terms of age, sex, tumor size, and tumor location between patients with benign and malignant tumors (P < 0.01). The median age of patients with malignant tumors was 48.0 (29.0, 60.0), which was significantly higher than that of patients with benign tumors 38.0 (28.0, 51.0) (Z = −4.483; P = 0.000). The difference in the sex ratio between the two groups was significant (χ2 = 14.55; P = 0.000). In patients with malignant tumors, the proportion of males is higher than that of females, whereas in patients with benign tumors, the proportion of females is higher than that of males. In addition, malignant tumors were significantly larger than benign tumors (Z = −3.431; P = 0.001). Benign tumors located in the sacrum were the highest in number (168 cases; 78.5%), followed by those in the ilium (19 cases; 8.9%). Malignancies in the sacrum were the highest in number (289; 46.5%), followed by those in the ilium (136; 21.9%). A significant difference in tumor location distribution was found between benign and malignant groups (χ2 = 69.259; P = 0.000).

Performance of different models

The average Dice score and IoU value of the segmentation model were 0.758 and 0.610, respectively. For T1-w, T2-w, DWI, and CET1-w sequences, Dice scores were 0.606, 0.792, 0.694, and 0.728, and IoU values were 0.472, 0.678, 0.573, and 0.598, respectively. As shown in Fig. 1, the segmentation model achieved a relatively good segmentation effect.

Fig. 1: Demonstration of the segmentation model’s predictions.
Fig. 1: Demonstration of the segmentation model’s predictions.
Full size image

A, B T1-w images of a 62-year-old female patient with neurofibroma; C, D T2-w images of a 44-year-old male patient with mucinous papillary ependymoma; E, F DWI images of a 45-year-old female patient with schwannoma; G, H CET1-w images of the same patient as (A) and (B). The second line shows T1-w, T2-w, DWI and CET1-w images (from left to right) with model’s segmentation and radiologist’s segmentation, darker color represents radiologist’s segmentation.

Among Models ORI, MAN, and SEG, Model SEG had the best performance (Fig. 2; Table 2). Model ORI achieved an AUC of 0.735 and ACC of 0.759 in the Internal Test Set 1 and an AUC of 0.697 and ACC of 0.686 in the Internal Test Set 2. Model MAN achieved an AUC of 0.728 and ACC of 0.716 in the Internal Test Set 1. Model SEG had an AUC of 0.852 and ACC of 0.767 in the Internal Test Set 1 and an AUC of 0.736 and ACC of 0.743 in the Internal Test Set 2. Delong-test between AUCs showed that Model SEG was significantly better than Model ORI (P = 0.01) and Model MAN (P = 0.02) in the Internal Test Set 1. However, no significant difference between Model ORI and Model SEG was found in the Internal Test Set 2 (P = 0.58).

Fig. 2
Fig. 2
Full size image

The ROC curve and precision recall curve (PRC) of different models in Internal Test Set 1 (A, B) and Internal Test Set 2 (C, D). A and C ROC curve of all models. B and D PRC of all models.

Table 2 Performance of different models

Model SEG-NC achieved an AUC of 0.825 and ACC of 0.750 in the Internal Test Set 1 and an AUC of 0.735 and ACC of 0.724 in the Internal Test Set 2. Delong-test showed no significant difference between Models SEG and SEG-NC in the Internal Test Set 1 (P = 0.06) and Internal Test Set 2 (P = 0.92). Model SEG-CL had an AUC of 0.852 and ACC of 0.784 in the Internal Test Set 1 and an AUC of 0.840 and ACC of 0.800 in the Internal Test Set 2. Model SEG-CL-NC achieved 82.3% AUC (95% confidence interval [CI]: 72.6, 90.1), 77.6% ACC (95% CI: 69.8, 84.5), 82.7% sensitivity (95% CI: 74.0, 90.6), and 65.7% specificity (95% CI: 48.6, 81.2) in the Internal Test Set 1 and 83.6% AUC (95% CI: 74.9, 90.7), 78.1% ACC (95% CI: 70.5, 85.7), 82.5% sensitivity (95% CI: 74.1, 90.7), and 64.0% specificity (95% CI: 44.8, 82.4) in the Internal Test Set 2. Delong-test showed no significant difference between Models SEG-CL and SEG-CL-NC in the Internal Test Set 1 (P = 0.22) and Internal Test Set 2 (P = 0.82).

In addition, we found no difference between the AUCs and ACCs of Models SEG and SEG-CL in the Internal Test Set 1 (AUC P = 0.99; ACC P = 0.625) but a significant difference in the Internal Test Set 2 (AUC P = 0.004; ACC P = 0.03). Similarly, the AUCs and ACCs of Models SEG-NC and SEG-CL-NC did not differ in the Internal Test Set 1 (AUC P = 0.94; ACC P = 0.25) but significantly differed in the Internal Test Set 2 (AUC P = 0.01; ACC P = 0.03). Figure 3 shows confusion matrices for differentiating between benign and malignant PSTs in the test sets. Figure S1 shows the nomogram of Model SEG-CL-NC.

Fig. 3: shows the confusion matrix for Model SEG-CL-NC in distinguishing benign from malignant PSTs.
Fig. 3: shows the confusion matrix for Model SEG-CL-NC in distinguishing benign from malignant PSTs.
Full size image

Panel A displays the confusion matrix for Internal Test Set 1, while Panel B shows the matrix for Internal Test Set 2. The x-axis represents the predictions made by Model SEG-CL-NC, and the y-axis represents the pathological results. 0 = benign, 1 = malignant.

The total ACC of External Dataset was 0.734, the sensitivity was 0.680, and the specificity was 0.847. ACC of Center 2, 3, and 4 were 0.714, 0.740 and 0.756, sensitivity 0.690, 0.660, and 0.704, specificity 0.762, 0.917, and 0.857, respectively.

Performance of radiologist’s diagnosis

The diagnostic ACC values of the two residents and one junior attending physician were 0.819, 0.771, and 0.790, sensitivity values were 0.9, 0.925, 0.862, and specificity values were 0.56, 0.28, 0.56, respectively. Although the diagnostic ACC values of Models SEG-CL and SEG-CL-NC were slightly lower than those of the two residents and junior attending physician, the difference was nonsignificant (P > 0.05). The average time to diagnose a patient with a physician was 5.61, 4.42, and 2.94 min, respectively. However, the times required by Models SEG-CL and SEG-CL-NC to provide segmentation and classification results were only 2.8 and 2.1 s, which were significantly less than the time required by radiologists (Table 3).

Table 3 Results of physician-reading experiment

Discussion

In this study, we developed an end-to-end DL model (Model SEG-CL-NC) for diagnosing benign and malignant PSTs using non-enhanced MRI. We evaluated its efficacy by comparing it with five other diagnostic models and with radiologists. Our findings demonstrated that Model SEG-CL-NC achieved comparable diagnostic accuracy to contrast-enhanced MRI and radiologists, with the added benefit of significantly shorter reading times compared to radiologists.

Patients with PSTs exhibit similar clinical and imaging features, posing challenges for preoperative diagnosis. Our study identified significant differences in sex, age, tumor size, and location between benign and malignant PSTs, consistent with previous studies2,3,24. Compared with benign tumors, malignant PSTs are older in age, larger in size, occur more frequently in males, and are mostly located in the sacrum and ilium. These distinctions likely correlate with the diverse pathological profiles of benign and malignant PSTs. Malignant PSTs commonly include metastatic tumors, bone sarcomas, and chordomas, whereas benign tumors typically encompass giant cell tumors of the bone and neurogenic tumors. Our study highlights the enhanced performance of models that integrate clinical information. By integrating clinical data, our model aligns more closely with real-world clinical practice, potentially improving overall diagnostic accuracy and utility in clinical settings25.

Owing to the large sizes of PSTs, the manual segmentation of lesions is time-consuming and is susceptible to interobserver variability4,26,27. In this study, we used coarse labeled ROIs to train the model and refined labeled data to test the model. Our study demonstrated that the segmentation-based diagnostic model (Model SEG) outperformed both the diagnostic model based solely on original images (Model ORI) and the model relying on manual lesion delineation (Model MAN). Model SEG seamlessly integrated segmentation with diagnosis, eliminating the need for manual lesion delineation. This approach sets a precedent for future research, indicating that training models in this manner can enhance algorithm efficiency, reduce manual annotation costs, improve accessibility, and ensure ease of use in clinical applications.

Our results showed that the model based on non-enhanced MR images obtained an ACC comparable to that of the enhanced model. CET1-w may generate high-quality images of pelvic tumors, showing enhanced regions within tumors and distinguishing between necrotic tissues and solid tumors4,28. However, the utility of MR-enhanced images in bone tumor treatment is primarily limited to guiding biopsy and planning tumor resection29. In contrast, non-enhanced MRI scans are more commonly employed in clinical practice for diagnosing bone lesions. This approach is advantageous for patients who may be unwilling to undergo enhanced MRI due to factors such as fear of injections (especially children) or allergies to contrast media. Additionally, utilizing non-enhanced MRI can potentially reduce medical costs and shorten examination times, thereby enhancing overall efficiency.

Our proposed non-enhanced MRI-based model (Model SEG-CL-NC) achieves automatic lesion segmentation and diagnosis, providing an end-to-end diagnostic solution. Following a non-enhanced MR scan for patients suspected of PSTs in clinical settings, images are automatically transmitted to the radiologist’s diagnostic system. Our model then autonomously identifies and segments the lesion, providing a benign or malignant diagnosis promptly. This capability assists radiologists in making accurate diagnoses efficiently. Furthermore, our proposed Model SEG-CL-NC demonstrated performance comparable to that of radiologists while significantly reducing diagnostic time. In our hospital, which manages a substantial number of PST cases, the implementation of our model has enhanced physician efficiency, minimized the risk of misdiagnosing primary bone tumors, and facilitated personalized patient treatment. Malignant cases particularly benefit from the model by enabling more aggressive treatment strategies in clinical practice. Moreover, our model’s potential for generalization to other medical centers is promising, offering utility to physicians with varying levels of expertise in bone tumor imaging, including those in smaller hospitals12. Although our model performed less effectively at other centers compared to our own, this discrepancy may be due to differences in scanners, acquisition parameters and inconsistent scanning protocols across centers. Specifically, our model integrates T1-w, T2-w, and DWI sequences, which enhances its performance. However, data from other centers might not include all three sequences simultaneously (e.g., Center 3 provided only T2-w and T1 TSE), leading to incomplete input sequences. Future research will address multicenter protocol variability through: 1) deep harmonization networks for parameter standardization, 2) sequence-robust transformers to handle partial inputs, and 3) federated calibration systems30,31,32,33. Additionally, prospective multicenter trials will further validate these solutions for broader clinical implementation.

This study acknowledges several limitations requiring contextualization. First, excluding patients with incomplete or poor-quality images may introduce selection bias, potentially compromising real-world generalizability. Second, restricting radiologist comparisons to 4–6 year-experienced specialists precludes benchmarking against senior expertise; future trials will implement multi-tier radiologist assessment. Third, unexamined dimensions include formal cost-effectiveness analysis and quantification of segmentation stability through repeated measurements. Fourth, restricting analysis to single primary PSTs overlooks multifocal malignancies. Multiple PSTs are more common in malignancies (such as metastases, multiple myeloma, lymphoma, etc.) and are easier to diagnose. Finally, the inadequate assessment of cross-center imaging variability constrains the model’s generalizability. Future research will focus on addressing protocol variability across multicenter settings and validating the proposed approaches through prospective multicenter trials.

In conclusion, our end-to-end DL Model SEG-CL-NC exhibited diagnostic performance comparable to contrast-enhanced models and radiologists in distinguishing benign and malignant PSTs, which may provide an accurate, efficient, and cost-effective tool for clinical practice.

Methods

Patients and data acquisition

A total of 1211 patients with pathologically confirmed benign or malignant PSTs, treated at four hospitals between April 2011 and August 2024, were retrospectively analyzed.

Initially, we examined data from 1021 PST patients at our hospital (Center 1) for the period from April 2011 to May 2022. Patients from April 2011 to June 2021 were included in Internal Dataset 1, while those from July 2021 to May 2022 were included in Internal Dataset 2. Dataset 2, with its more recent data, provides a better foundation for the application of the model.

The inclusion criteria for Center 1 were as follows: 1) Single lesion was found on MRI; 2) Preoperative MRI included T1-w, T2-w, diffusion weighted imaging (DWI), and contrast-enhanced T1-weighted (CET1-w) images were complete; 3) Pathologically confirmed benign or malignant PSTs. Tumors classified as intermediate according to the WHO classification criteria were grouped as benign in this study5,14. Exclusion criteria for Center 1 were as follows: 1) Multiple lesions (n center 1 = 50); 2) Incomplete enhanced MR sequence (n center 1 = 254); 3) Postoperative recurrence and severe image artifacts (n center 1 = 63).

Ultimately, 654 patients with PSTs from Center 1 were included in this study. Of these, 549 patients from Center 1 were assigned to Internal Dataset 1, which was further divided into 346 patients for the training set, 87 for the validation set, and 116 for Internal Test Set 1. Additionally, 105 patients from Center 1 were included in Internal Dataset 2, designated for Internal Test Set 2.

To further validate our model, data from 190 patients with PSTs at Centers 2, 3, and 4 were used as external test sets. All datasets adhered to the same inclusion and exclusion criteria as Center 1, except that incomplete MR sequences were not considered an exclusion criterion. This adjustment was made due to significant variations in scanning sequences between centers. Nine patients were excluded due to the presence of multiple lesions. Finally, 181 patients were included for external validation from Centers 2, 3, and 4, with 63 from Center 2, 77 from Center 3, and 41 from Center 4 (Fig. 4). Sex, age, tumor location, and maximal tumor size of the patients were also analyzed.

Fig. 4: Patient selection flowchart.
Fig. 4: Patient selection flowchart.
Full size image

A retrospective analysis included 1211 patients with pathologically confirmed benign or malignant PSTs treated across four hospitals (April 2011 to August 2024). From Center 1 (1021 patients screened; 654 included) Internal Dataset 1 (549 patients, April 2011 to June 2021) comprised a training set (n = 346), validation set (n = 87), and Internal Test Set 1 (n = 116). Internal Dataset 2 (105 patients, July 2021 to May 2022) formed Internal Test Set 2. For external validation, 181 patients from Centers 2, 3, and 4 (External Test Set: Center 2 = 63, Center 3 = 77, Center 4 = 41) were included.

This retrospective study was approved by institutional review boards at four institutions, including Peking University People’s Hospital (Approval No.2020PHB293), Peking University Third Hospital (Approval No.M2023827), The First Affiliated Hospital of Guangxi Medical University (Approval No.2025-E0250), and The First Affiliated Hospital of Chongqing Medical University (Approval No.2023-139). Given the study’s retrospective nature and reliance on standard clinical protocols, the requirement for informed consent was waived by the Institutional Review Board. The study was conducted following the Declaration of Helsinki.

All images from Center 1 were acquired on the Signa HDxt 3.0 T (GE Healthcare), Signa EXCITE 1.5 T (GE Healthcare), and Discovery 750 3.0 T (GE Healthcare) MR image scanner. The acquisition parameters were as follows: axial T1-w liver acquisition with volume acceleration-flexible (LAVA-Flex) or axial T1-w FSE fs, repetition time (TR) = 3.8 ~ 700 ms, echo time (TE) = 1.7 ~ 7.8 ms, matrix = 288× 224 ~ 320 × 224, slice thickness = 4 ~ 7 mm, and field of view (FOV) = 38 × 38 cm ~ 42 × 42 cm. T2-w, TR = 2300 ~ 5119 ms, TE = 84.1 ~ 102.5 ms, matrix = 288 × 224 ~ 320 × 224, slice thickness = 6 ~ 7 mm, and FOV = 38 × 38 cm ~ 44 × 44 cm. DWI, b value = 1000, TR = 4800 ~ 5000 ms, TE = 59.2 ~ 60 ms, matrix = 128 × 128 ~ 160 × 160, slice thickness = 6 ~ 7 mm, and FOV = 36 × 36 cm ~ 44 × 44 cm. Axial CET1-w was performed following the intravenous injection of 0.2 mL/kg contrast medium (gadopentetate dimeglumine injection) with a manual push or high-pressure syringe, TR = 3.8 ~ 700 ms, TE = 1.7 ~ 7.8 ms, matrix = 288 × 224 ~ 320 × 224, slice thickness = 4 ~ 7 mm, and FOV = 38 × 38 cm ~ 42 × 42 cm.

All images from Center 2 were acquired on the Signa HDxt 1.5 T (GE Healthcare), Discovery 750 3.0 T (GE Healthcare), Discovery 750w 3.0 T (GE Healthcare), and uMR780 3.0 T (United Imaging Healthcare) MR image scanner from December 2014 to July 2024. The acquisition parameters were as follows: Axial T1 TSE: TR = 631 ms, TE = 11.1 ms, matrix = 320 × 256, slice thickness = 5 mm, FOV = 36 × 36 cm. Axial T2-w, TR = 2700 ~ 4939 ms, TE = 58 ~ 100 ms, matrix = 288 × 224 ~ 320 × 256, slice thickness = 4 ~ 7 mm, and FOV = 24 × 20 cm ~ 42 cm × 42 cm. DWI, b value = 800, TR = 3000 ~ 6650 ms, TE = 62.9 ~ 65 ms, matrix = 128 × 64 ~ 128 × 128, slice thickness = 4 ~ 6 mm, and FOV = 24 × 20 cm ~ 44 × 44 cm.

All images from Center 3 were acquired on the Signa HDxt 1.5 T (GE Healthcare), Signa Premier 3.0 T (GE Healthcare), Siemens Verio / Prisma 3.0 T (SIEMENS Healthcare), Siemens Altea 1.5 T (SIEMENS Healthcare), and Philips Achieva 3.0 T (Philips Healthcare) MR image scanner from May 2014 to June 2024. The acquisition parameters were as follows: axial T2-w, TR = 3040~5240 ms, TE = 64~130 ms, matrix = 288 × 192~400 × 306, slice thickness = 4 ~ 8 mm, and FOV = 32 × 32 cm ~ 64 × 64 cm. Axial T1 TSE: TR = 500~544 ms, TE = 8.6 ms, matrix = 280 × 312~512 × 370, slice thickness = 5 ~ 8 mm, FOV = 52.8 × 51.2 cm ~ 64 × 64 cm.

All images from Center 4 were acquired on the Signa HDxt 1.5 T (GE Healthcare), Discovery 750w 3.0 T (GE Healthcare), MAGNETOM_ESSENZA 1.5 T (SIEMENS Healthcare), and Skyra 3.0 T (SIEMENS Healthcare) MR image scanner from April 2013 to August 2024. The acquisition parameters were as follows: axial T2-w, TR = 2000 ~ 4970 ms, TE = 83 ~ 131 ms, matrix = 256 × 179 ~ 288 × 288, slice thickness = 4 ~ 5 mm, and FOV = 15 × 10.5 cm ~ 40 × 40 cm. Axial T1-w: TR = 150~630 ms, TE = 1.5 ~ 14 ms, matrix = 256 × 230~320 × 272, slice thickness = 5 ~ 7 mm, FOV = 19.1 × 12.7 cm ~ 28 × 19.6 cm. DWI, b value = 1000, TR = 3689 ms, TE = 75.4 ms, matrix = 112 × 114, slice thickness = 5 mm, and FOV = 11.2 × 11.4 cm.

Radiologist’s segmentation

The PSTs of the retrospective dataset were manually segmented using ITK-SNAP software version 3.6.0 (www.itksnap.org)34. All regions of interest (ROIs) in the training set were coarsely labeled (5 layers above and below the largest level of the lesion), and ROIs in the internal test set were fine labeled (all layers of the lesion). All lesions in the retrospective dataset were carefully delineated along the edge of the lesion in each sequence by a musculoskeletal radiologist with 5 years of experience and validated by a senior musculoskeletal radiologist with 15 years of experience. All ROIs in the Internal Test Set 2 were not manually segmented, and the ROIs of each sequence finely labeled (all layers of the lesion) in the Internal Test Set 1 were used as the ground truth.

Model preprocessing

For segmentation model preprocessing, we first normalized the input MRI series and padded images with a width or height smaller than 224 to meet the crop requirement in the subsequent step. Then, we randomly selected eight slices with and without segmentation. After acquiring the slices, we used the slices above and below these images to produce three-channel images. Within each three-channel image, we randomly cropped it to 3 × 224 × 224 and then performed random flipping and rotation to decrease overfitting. The transformed image patches were fed into the segmentation model. During testing, each image slice was input into the segmentation model with its above and below slices without cropping.

For diagnostic model preprocessing, we first normalized the input images and then cropped the cuboid region with ROI from the MRI series to reduce irrelevant information. After determining the ROI, we randomly selected 12 slices and cropped them into 12 × 169 × 169 slices. Data augmentation techniques including random flipping and rotation were then performed on the cropped patches and finally sent to the classification model.

DL model development

This study compared six diagnostic models differentiated by input data sources: Original Image-based model (ORI, 1) trained directly on original images; Manual Annotation-based model (MAN, 2) utilizing radiologist annotations (manual lesion delineations); Segmentation-based model (SEG, 3) employing an automated segmentation model to eliminate manual delineation; Non-Contrast Segmentation-based model (SEG-NC, 4) applying the automated segmentation model to non-contrast MR sequences (excluding CET1-w); Clinical-feature augmented Segmentation-based model (SEG-CL, 5) combining the automated segmentation model with clinical features; and Non-Contrast Clinical-feature augmented Segmentation-based model (SEG-CL-NC, 6) integrating the automated segmentation model, clinical features, and non-contrast MR sequences (excluding CET1-w). See Table 4 for details.

Table 4 The inputs to different models

The segmentation and diagnostic models were trained separately, and the optimal diagnostic model was trained according to the segmentation model’s predicted mask. In the segmentation stage, we adapted U-Net to be the structure of the segmentation model as it has shown promising segmentation performance in a recent study35. We also used MobilenetV2 initialized with ImageNet pre-trained weights as the encoder to improve model performance. The 2.5D training method proposed by Lv et al.36 was applied to train our segmentation model instead of 2D or 3D. The 2.5D segmentation method contained more spatial information than the 2D segmentation method. Given that some ROIs were large, 3D segmentation occupied large memory, which restricted model performance. All MRI sequences were input into the segmentation model for training after preprocessing.

In the diagnostic stage, given that all patients had undergone multi-sequence MRI scans, we selected a state-of-the-art CoaT-based transformer network together with multiple instance learning strategy proposed by Huang et al.37 as our diagnostic model, which is proven efficient in multi-modality input classification problems. In Models SEG-CL and SEG-CL-NC, late fusion was used to combine the DL output with clinical information. Clinical information included age, sex, tumor size, and tumor location, and we executed logistic regression to fit DL and clinical features. A nomogram was drawn for the visualization of our logistic regression model. Figure 5 shows the schematic diagram of Model SEG-CL-NC. Detailed U-Net-based Network and CoaT-based Network structures are shown in supplementary Figures S2, S3, and S4.

Fig. 5: presents the schematic diagram of Model SEG-CL-NC.
Fig. 5: presents the schematic diagram of Model SEG-CL-NC.
Full size image

Panel A illustrates the entire inference process of Model SEG-CL-NC. Panel B shows the training process for the segmentation component, while Panel C depicts the training process for the classification component.

Radiologist’s diagnosis

To compare the model’s diagnostic accuracy (ACC) with that of radiologists, two senior residents (YL and JWZ) and one junior attending physician (TYZ) independently diagnosed 105 patients in the prospective dataset without using our DL model. They had 4, 5, and 6 years of clinical experience, respectively. The original image of the patient is fed into the RadiAnt DICOM Viewer 2021.2.2 – Activation software. In order to be consistent with the input information from the model, the three viewers only saw the patient’s image, gender and age, and the rest of the clinical information was hidden. On the DICOM Viewer, the reader can also measure the maximum diameter of the lesion. Record the diagnostic result and the time spent after viewing the film. The patient’s diagnostic time was recorded from the time when the viewer opened the patient’s images to the time when the diagnosis was made. The diagnostic result was benign or malignant.

Statistical analysis

The performance of different models was assessed using the area under the receiver operating characteristic curve (AUC), ACC, sensitivity, and specificity values. The diagnostic time, ACC, sensitivity, and specificity were used as evaluation indexes for radiologist’s diagnosis. All models’ prediction probabilities were transformed to binary labels according to Youden’s index in validation sets38. Statistical analysis was performed on R software (R Core Team) version 3.4.3. Two-sample t-test test was performed to compare continuous variables, while chi-squared test was used to classify variables between groups. Dice score and the intersection over union (IoU) were used for the evaluation of segmentations. All statistical tests were two-sided, and a P value less than 0.05 was considered statistically significant.