Abstract
This study developed an end-to-end deep learning (DL) model using non-enhanced MRI to diagnose benign and malignant pelvic and sacral tumors (PSTs). Retrospective data from 835 patients across four hospitals were employed to train, validate, and test the models. Six diagnostic models with varied input sources were compared. Performance (AUC, accuracy/ACC) and reading times of three radiologists were compared. The proposed Model SEG-CL-NC achieved AUC/ACC of 0.823/0.776 (Internal Test Set 1) and 0.836/0.781 (Internal Test Set 2). In External Dataset Centers 2, 3, and 4, its ACC was 0.714, 0.740, and 0.756, comparable to contrast-enhanced models and radiologists (P > 0.05), while its diagnosis time was significantly shorter than radiologists (P < 0.01). Our results suggested that the proposed Model SEG-CL-NC could achieve comparable performance to contrast-enhanced models and radiologists in diagnosing benign and malignant PSTs, offering an accurate, efficient, and cost-effective tool for clinical practice.
Similar content being viewed by others
Introduction
Pelvic and sacral tumors (PSTs) are rare, and metastatic tumor is the most common type due to the prominent hematopoietic function of this site1,2,3. Primary benign PSTs mainly include giant cell tumors, schwannoma, neurofibroma, osteoid osteoma, and osteoblastoma4,5,6. Primary malignant PSTs mainly include chordoma, chondrosarcoma, osteosarcoma, Ewing’s sarcoma, and lymphoma7,8,9. Given that PSTs are rare and have similar clinical and imaging features, radiologists are having difficulty acquiring sufficient clinical experience to make a definite diagnosis10. In the early stage, PSTs are usually small and asymptomatic. When detected, it is usually large and compresses surrounding organs, which often requires surgical intervention. For all primary malignant sacral tumors and benign lesions involving lower segments when preservation of both S3 roots is possible, wide resection should be selected3. However, the prognoses of patients with PST are poor due to complex anatomical structures, multiple surrounding organs, and difficulty in operating on this site2,11. Consequently, the diagnostic challenges posed by PSTs—including their rarity, symptom latency, imaging similarities, and the critical need for early detection to enable potentially curative (but complex) surgery—underscore the urgent requirement for accurate and efficient diagnostic tools12.
In recent years, deep learning (DL) has shown great potential in exploring the nature of tumors and has been extensively used in bone tumor diagnosis, efficacy evaluation, and prognosis prediction13,14,15,16,17,18. Few studies have used DL models to distinguish between benign and malignant bone tumors and were mainly based on plain films12,19,20,21. Compared with plain films, multi-sequence magnetic resonance imaging (MRI) can better display the bone marrow infiltration and surrounding soft tissue involvement of PSTs. Owing to the large sizes of PSTs, the manual segmentation of lesions is time-consuming4,22. An MRI-based DL segmentation model may be able to automatically segment PSTs lesions and reduce the tedious process of manually delineating lesions. In addition, attention-based DL models have been applied to medical image classification problems and have shown better aggregation and representation capabilities23. Crucially, non-enhanced MRI is the cornerstone of initial bone lesion evaluation in clinical practice due to its wide availability, absence of contrast-related risks (e.g., nephrogenic systemic fibrosis, allergic reactions), and lower cost compared to contrast-enhanced protocols. However, interpreting complex non-enhanced MRI studies for rare PSTs remains challenging, particularly for less experienced radiologists. Therefore, developing a robust DL model capable of automatically diagnosing PSTs directly on routine non-enhanced MRI sequences holds significant promise for directly addressing the aforementioned diagnostic challenges. Such a tool could potentially: (1) augment radiologists’ diagnostic confidence and accuracy, especially in settings with limited PST expertise; (2) expedite the diagnostic workflow by automating lesion segmentation and analysis, reducing time-to-diagnosis; and (3) leverage the most accessible and safest initial MRI protocol, maximizing clinical utility and impact.
The aim of our study was to develop an end-to-end DL model for the diagnosis of benign and malignant PSTs using non-enhanced MRI.
Results
Patient characteristics
A total of 835 patients (441 males and 394 females; median age 45.0 years [range: 29.0–58.0], with ages ranging from 3 to 83 years) were included in this study (see Table 1). This cohort comprised 621 malignant tumors and 214 benign tumors. Centers 2, 3, and 4 had 17, 19, and 18 benign tumors respectively, and 46, 58, and 23 malignant tumors respectively. Clinical data for patients in the different sets are detailed in Supplementary Table 1S.
We found significant statistical differences in terms of age, sex, tumor size, and tumor location between patients with benign and malignant tumors (P < 0.01). The median age of patients with malignant tumors was 48.0 (29.0, 60.0), which was significantly higher than that of patients with benign tumors 38.0 (28.0, 51.0) (Z = −4.483; P = 0.000). The difference in the sex ratio between the two groups was significant (χ2 = 14.55; P = 0.000). In patients with malignant tumors, the proportion of males is higher than that of females, whereas in patients with benign tumors, the proportion of females is higher than that of males. In addition, malignant tumors were significantly larger than benign tumors (Z = −3.431; P = 0.001). Benign tumors located in the sacrum were the highest in number (168 cases; 78.5%), followed by those in the ilium (19 cases; 8.9%). Malignancies in the sacrum were the highest in number (289; 46.5%), followed by those in the ilium (136; 21.9%). A significant difference in tumor location distribution was found between benign and malignant groups (χ2 = 69.259; P = 0.000).
Performance of different models
The average Dice score and IoU value of the segmentation model were 0.758 and 0.610, respectively. For T1-w, T2-w, DWI, and CET1-w sequences, Dice scores were 0.606, 0.792, 0.694, and 0.728, and IoU values were 0.472, 0.678, 0.573, and 0.598, respectively. As shown in Fig. 1, the segmentation model achieved a relatively good segmentation effect.
A, B T1-w images of a 62-year-old female patient with neurofibroma; C, D T2-w images of a 44-year-old male patient with mucinous papillary ependymoma; E, F DWI images of a 45-year-old female patient with schwannoma; G, H CET1-w images of the same patient as (A) and (B). The second line shows T1-w, T2-w, DWI and CET1-w images (from left to right) with model’s segmentation and radiologist’s segmentation, darker color represents radiologist’s segmentation.
Among Models ORI, MAN, and SEG, Model SEG had the best performance (Fig. 2; Table 2). Model ORI achieved an AUC of 0.735 and ACC of 0.759 in the Internal Test Set 1 and an AUC of 0.697 and ACC of 0.686 in the Internal Test Set 2. Model MAN achieved an AUC of 0.728 and ACC of 0.716 in the Internal Test Set 1. Model SEG had an AUC of 0.852 and ACC of 0.767 in the Internal Test Set 1 and an AUC of 0.736 and ACC of 0.743 in the Internal Test Set 2. Delong-test between AUCs showed that Model SEG was significantly better than Model ORI (P = 0.01) and Model MAN (P = 0.02) in the Internal Test Set 1. However, no significant difference between Model ORI and Model SEG was found in the Internal Test Set 2 (P = 0.58).
The ROC curve and precision recall curve (PRC) of different models in Internal Test Set 1 (A, B) and Internal Test Set 2 (C, D). A and C ROC curve of all models. B and D PRC of all models.
Model SEG-NC achieved an AUC of 0.825 and ACC of 0.750 in the Internal Test Set 1 and an AUC of 0.735 and ACC of 0.724 in the Internal Test Set 2. Delong-test showed no significant difference between Models SEG and SEG-NC in the Internal Test Set 1 (P = 0.06) and Internal Test Set 2 (P = 0.92). Model SEG-CL had an AUC of 0.852 and ACC of 0.784 in the Internal Test Set 1 and an AUC of 0.840 and ACC of 0.800 in the Internal Test Set 2. Model SEG-CL-NC achieved 82.3% AUC (95% confidence interval [CI]: 72.6, 90.1), 77.6% ACC (95% CI: 69.8, 84.5), 82.7% sensitivity (95% CI: 74.0, 90.6), and 65.7% specificity (95% CI: 48.6, 81.2) in the Internal Test Set 1 and 83.6% AUC (95% CI: 74.9, 90.7), 78.1% ACC (95% CI: 70.5, 85.7), 82.5% sensitivity (95% CI: 74.1, 90.7), and 64.0% specificity (95% CI: 44.8, 82.4) in the Internal Test Set 2. Delong-test showed no significant difference between Models SEG-CL and SEG-CL-NC in the Internal Test Set 1 (P = 0.22) and Internal Test Set 2 (P = 0.82).
In addition, we found no difference between the AUCs and ACCs of Models SEG and SEG-CL in the Internal Test Set 1 (AUC P = 0.99; ACC P = 0.625) but a significant difference in the Internal Test Set 2 (AUC P = 0.004; ACC P = 0.03). Similarly, the AUCs and ACCs of Models SEG-NC and SEG-CL-NC did not differ in the Internal Test Set 1 (AUC P = 0.94; ACC P = 0.25) but significantly differed in the Internal Test Set 2 (AUC P = 0.01; ACC P = 0.03). Figure 3 shows confusion matrices for differentiating between benign and malignant PSTs in the test sets. Figure S1 shows the nomogram of Model SEG-CL-NC.
Panel A displays the confusion matrix for Internal Test Set 1, while Panel B shows the matrix for Internal Test Set 2. The x-axis represents the predictions made by Model SEG-CL-NC, and the y-axis represents the pathological results. 0 = benign, 1 = malignant.
The total ACC of External Dataset was 0.734, the sensitivity was 0.680, and the specificity was 0.847. ACC of Center 2, 3, and 4 were 0.714, 0.740 and 0.756, sensitivity 0.690, 0.660, and 0.704, specificity 0.762, 0.917, and 0.857, respectively.
Performance of radiologist’s diagnosis
The diagnostic ACC values of the two residents and one junior attending physician were 0.819, 0.771, and 0.790, sensitivity values were 0.9, 0.925, 0.862, and specificity values were 0.56, 0.28, 0.56, respectively. Although the diagnostic ACC values of Models SEG-CL and SEG-CL-NC were slightly lower than those of the two residents and junior attending physician, the difference was nonsignificant (P > 0.05). The average time to diagnose a patient with a physician was 5.61, 4.42, and 2.94 min, respectively. However, the times required by Models SEG-CL and SEG-CL-NC to provide segmentation and classification results were only 2.8 and 2.1 s, which were significantly less than the time required by radiologists (Table 3).
Discussion
In this study, we developed an end-to-end DL model (Model SEG-CL-NC) for diagnosing benign and malignant PSTs using non-enhanced MRI. We evaluated its efficacy by comparing it with five other diagnostic models and with radiologists. Our findings demonstrated that Model SEG-CL-NC achieved comparable diagnostic accuracy to contrast-enhanced MRI and radiologists, with the added benefit of significantly shorter reading times compared to radiologists.
Patients with PSTs exhibit similar clinical and imaging features, posing challenges for preoperative diagnosis. Our study identified significant differences in sex, age, tumor size, and location between benign and malignant PSTs, consistent with previous studies2,3,24. Compared with benign tumors, malignant PSTs are older in age, larger in size, occur more frequently in males, and are mostly located in the sacrum and ilium. These distinctions likely correlate with the diverse pathological profiles of benign and malignant PSTs. Malignant PSTs commonly include metastatic tumors, bone sarcomas, and chordomas, whereas benign tumors typically encompass giant cell tumors of the bone and neurogenic tumors. Our study highlights the enhanced performance of models that integrate clinical information. By integrating clinical data, our model aligns more closely with real-world clinical practice, potentially improving overall diagnostic accuracy and utility in clinical settings25.
Owing to the large sizes of PSTs, the manual segmentation of lesions is time-consuming and is susceptible to interobserver variability4,26,27. In this study, we used coarse labeled ROIs to train the model and refined labeled data to test the model. Our study demonstrated that the segmentation-based diagnostic model (Model SEG) outperformed both the diagnostic model based solely on original images (Model ORI) and the model relying on manual lesion delineation (Model MAN). Model SEG seamlessly integrated segmentation with diagnosis, eliminating the need for manual lesion delineation. This approach sets a precedent for future research, indicating that training models in this manner can enhance algorithm efficiency, reduce manual annotation costs, improve accessibility, and ensure ease of use in clinical applications.
Our results showed that the model based on non-enhanced MR images obtained an ACC comparable to that of the enhanced model. CET1-w may generate high-quality images of pelvic tumors, showing enhanced regions within tumors and distinguishing between necrotic tissues and solid tumors4,28. However, the utility of MR-enhanced images in bone tumor treatment is primarily limited to guiding biopsy and planning tumor resection29. In contrast, non-enhanced MRI scans are more commonly employed in clinical practice for diagnosing bone lesions. This approach is advantageous for patients who may be unwilling to undergo enhanced MRI due to factors such as fear of injections (especially children) or allergies to contrast media. Additionally, utilizing non-enhanced MRI can potentially reduce medical costs and shorten examination times, thereby enhancing overall efficiency.
Our proposed non-enhanced MRI-based model (Model SEG-CL-NC) achieves automatic lesion segmentation and diagnosis, providing an end-to-end diagnostic solution. Following a non-enhanced MR scan for patients suspected of PSTs in clinical settings, images are automatically transmitted to the radiologist’s diagnostic system. Our model then autonomously identifies and segments the lesion, providing a benign or malignant diagnosis promptly. This capability assists radiologists in making accurate diagnoses efficiently. Furthermore, our proposed Model SEG-CL-NC demonstrated performance comparable to that of radiologists while significantly reducing diagnostic time. In our hospital, which manages a substantial number of PST cases, the implementation of our model has enhanced physician efficiency, minimized the risk of misdiagnosing primary bone tumors, and facilitated personalized patient treatment. Malignant cases particularly benefit from the model by enabling more aggressive treatment strategies in clinical practice. Moreover, our model’s potential for generalization to other medical centers is promising, offering utility to physicians with varying levels of expertise in bone tumor imaging, including those in smaller hospitals12. Although our model performed less effectively at other centers compared to our own, this discrepancy may be due to differences in scanners, acquisition parameters and inconsistent scanning protocols across centers. Specifically, our model integrates T1-w, T2-w, and DWI sequences, which enhances its performance. However, data from other centers might not include all three sequences simultaneously (e.g., Center 3 provided only T2-w and T1 TSE), leading to incomplete input sequences. Future research will address multicenter protocol variability through: 1) deep harmonization networks for parameter standardization, 2) sequence-robust transformers to handle partial inputs, and 3) federated calibration systems30,31,32,33. Additionally, prospective multicenter trials will further validate these solutions for broader clinical implementation.
This study acknowledges several limitations requiring contextualization. First, excluding patients with incomplete or poor-quality images may introduce selection bias, potentially compromising real-world generalizability. Second, restricting radiologist comparisons to 4–6 year-experienced specialists precludes benchmarking against senior expertise; future trials will implement multi-tier radiologist assessment. Third, unexamined dimensions include formal cost-effectiveness analysis and quantification of segmentation stability through repeated measurements. Fourth, restricting analysis to single primary PSTs overlooks multifocal malignancies. Multiple PSTs are more common in malignancies (such as metastases, multiple myeloma, lymphoma, etc.) and are easier to diagnose. Finally, the inadequate assessment of cross-center imaging variability constrains the model’s generalizability. Future research will focus on addressing protocol variability across multicenter settings and validating the proposed approaches through prospective multicenter trials.
In conclusion, our end-to-end DL Model SEG-CL-NC exhibited diagnostic performance comparable to contrast-enhanced models and radiologists in distinguishing benign and malignant PSTs, which may provide an accurate, efficient, and cost-effective tool for clinical practice.
Methods
Patients and data acquisition
A total of 1211 patients with pathologically confirmed benign or malignant PSTs, treated at four hospitals between April 2011 and August 2024, were retrospectively analyzed.
Initially, we examined data from 1021 PST patients at our hospital (Center 1) for the period from April 2011 to May 2022. Patients from April 2011 to June 2021 were included in Internal Dataset 1, while those from July 2021 to May 2022 were included in Internal Dataset 2. Dataset 2, with its more recent data, provides a better foundation for the application of the model.
The inclusion criteria for Center 1 were as follows: 1) Single lesion was found on MRI; 2) Preoperative MRI included T1-w, T2-w, diffusion weighted imaging (DWI), and contrast-enhanced T1-weighted (CET1-w) images were complete; 3) Pathologically confirmed benign or malignant PSTs. Tumors classified as intermediate according to the WHO classification criteria were grouped as benign in this study5,14. Exclusion criteria for Center 1 were as follows: 1) Multiple lesions (n center 1 = 50); 2) Incomplete enhanced MR sequence (n center 1 = 254); 3) Postoperative recurrence and severe image artifacts (n center 1 = 63).
Ultimately, 654 patients with PSTs from Center 1 were included in this study. Of these, 549 patients from Center 1 were assigned to Internal Dataset 1, which was further divided into 346 patients for the training set, 87 for the validation set, and 116 for Internal Test Set 1. Additionally, 105 patients from Center 1 were included in Internal Dataset 2, designated for Internal Test Set 2.
To further validate our model, data from 190 patients with PSTs at Centers 2, 3, and 4 were used as external test sets. All datasets adhered to the same inclusion and exclusion criteria as Center 1, except that incomplete MR sequences were not considered an exclusion criterion. This adjustment was made due to significant variations in scanning sequences between centers. Nine patients were excluded due to the presence of multiple lesions. Finally, 181 patients were included for external validation from Centers 2, 3, and 4, with 63 from Center 2, 77 from Center 3, and 41 from Center 4 (Fig. 4). Sex, age, tumor location, and maximal tumor size of the patients were also analyzed.
A retrospective analysis included 1211 patients with pathologically confirmed benign or malignant PSTs treated across four hospitals (April 2011 to August 2024). From Center 1 (1021 patients screened; 654 included) Internal Dataset 1 (549 patients, April 2011 to June 2021) comprised a training set (n = 346), validation set (n = 87), and Internal Test Set 1 (n = 116). Internal Dataset 2 (105 patients, July 2021 to May 2022) formed Internal Test Set 2. For external validation, 181 patients from Centers 2, 3, and 4 (External Test Set: Center 2 = 63, Center 3 = 77, Center 4 = 41) were included.
This retrospective study was approved by institutional review boards at four institutions, including Peking University People’s Hospital (Approval No.2020PHB293), Peking University Third Hospital (Approval No.M2023827), The First Affiliated Hospital of Guangxi Medical University (Approval No.2025-E0250), and The First Affiliated Hospital of Chongqing Medical University (Approval No.2023-139). Given the study’s retrospective nature and reliance on standard clinical protocols, the requirement for informed consent was waived by the Institutional Review Board. The study was conducted following the Declaration of Helsinki.
All images from Center 1 were acquired on the Signa HDxt 3.0 T (GE Healthcare), Signa EXCITE 1.5 T (GE Healthcare), and Discovery 750 3.0 T (GE Healthcare) MR image scanner. The acquisition parameters were as follows: axial T1-w liver acquisition with volume acceleration-flexible (LAVA-Flex) or axial T1-w FSE fs, repetition time (TR) = 3.8 ~ 700 ms, echo time (TE) = 1.7 ~ 7.8 ms, matrix = 288× 224 ~ 320 × 224, slice thickness = 4 ~ 7 mm, and field of view (FOV) = 38 × 38 cm ~ 42 × 42 cm. T2-w, TR = 2300 ~ 5119 ms, TE = 84.1 ~ 102.5 ms, matrix = 288 × 224 ~ 320 × 224, slice thickness = 6 ~ 7 mm, and FOV = 38 × 38 cm ~ 44 × 44 cm. DWI, b value = 1000, TR = 4800 ~ 5000 ms, TE = 59.2 ~ 60 ms, matrix = 128 × 128 ~ 160 × 160, slice thickness = 6 ~ 7 mm, and FOV = 36 × 36 cm ~ 44 × 44 cm. Axial CET1-w was performed following the intravenous injection of 0.2 mL/kg contrast medium (gadopentetate dimeglumine injection) with a manual push or high-pressure syringe, TR = 3.8 ~ 700 ms, TE = 1.7 ~ 7.8 ms, matrix = 288 × 224 ~ 320 × 224, slice thickness = 4 ~ 7 mm, and FOV = 38 × 38 cm ~ 42 × 42 cm.
All images from Center 2 were acquired on the Signa HDxt 1.5 T (GE Healthcare), Discovery 750 3.0 T (GE Healthcare), Discovery 750w 3.0 T (GE Healthcare), and uMR780 3.0 T (United Imaging Healthcare) MR image scanner from December 2014 to July 2024. The acquisition parameters were as follows: Axial T1 TSE: TR = 631 ms, TE = 11.1 ms, matrix = 320 × 256, slice thickness = 5 mm, FOV = 36 × 36 cm. Axial T2-w, TR = 2700 ~ 4939 ms, TE = 58 ~ 100 ms, matrix = 288 × 224 ~ 320 × 256, slice thickness = 4 ~ 7 mm, and FOV = 24 × 20 cm ~ 42 cm × 42 cm. DWI, b value = 800, TR = 3000 ~ 6650 ms, TE = 62.9 ~ 65 ms, matrix = 128 × 64 ~ 128 × 128, slice thickness = 4 ~ 6 mm, and FOV = 24 × 20 cm ~ 44 × 44 cm.
All images from Center 3 were acquired on the Signa HDxt 1.5 T (GE Healthcare), Signa Premier 3.0 T (GE Healthcare), Siemens Verio / Prisma 3.0 T (SIEMENS Healthcare), Siemens Altea 1.5 T (SIEMENS Healthcare), and Philips Achieva 3.0 T (Philips Healthcare) MR image scanner from May 2014 to June 2024. The acquisition parameters were as follows: axial T2-w, TR = 3040~5240 ms, TE = 64~130 ms, matrix = 288 × 192~400 × 306, slice thickness = 4 ~ 8 mm, and FOV = 32 × 32 cm ~ 64 × 64 cm. Axial T1 TSE: TR = 500~544 ms, TE = 8.6 ms, matrix = 280 × 312~512 × 370, slice thickness = 5 ~ 8 mm, FOV = 52.8 × 51.2 cm ~ 64 × 64 cm.
All images from Center 4 were acquired on the Signa HDxt 1.5 T (GE Healthcare), Discovery 750w 3.0 T (GE Healthcare), MAGNETOM_ESSENZA 1.5 T (SIEMENS Healthcare), and Skyra 3.0 T (SIEMENS Healthcare) MR image scanner from April 2013 to August 2024. The acquisition parameters were as follows: axial T2-w, TR = 2000 ~ 4970 ms, TE = 83 ~ 131 ms, matrix = 256 × 179 ~ 288 × 288, slice thickness = 4 ~ 5 mm, and FOV = 15 × 10.5 cm ~ 40 × 40 cm. Axial T1-w: TR = 150~630 ms, TE = 1.5 ~ 14 ms, matrix = 256 × 230~320 × 272, slice thickness = 5 ~ 7 mm, FOV = 19.1 × 12.7 cm ~ 28 × 19.6 cm. DWI, b value = 1000, TR = 3689 ms, TE = 75.4 ms, matrix = 112 × 114, slice thickness = 5 mm, and FOV = 11.2 × 11.4 cm.
Radiologist’s segmentation
The PSTs of the retrospective dataset were manually segmented using ITK-SNAP software version 3.6.0 (www.itksnap.org)34. All regions of interest (ROIs) in the training set were coarsely labeled (5 layers above and below the largest level of the lesion), and ROIs in the internal test set were fine labeled (all layers of the lesion). All lesions in the retrospective dataset were carefully delineated along the edge of the lesion in each sequence by a musculoskeletal radiologist with 5 years of experience and validated by a senior musculoskeletal radiologist with 15 years of experience. All ROIs in the Internal Test Set 2 were not manually segmented, and the ROIs of each sequence finely labeled (all layers of the lesion) in the Internal Test Set 1 were used as the ground truth.
Model preprocessing
For segmentation model preprocessing, we first normalized the input MRI series and padded images with a width or height smaller than 224 to meet the crop requirement in the subsequent step. Then, we randomly selected eight slices with and without segmentation. After acquiring the slices, we used the slices above and below these images to produce three-channel images. Within each three-channel image, we randomly cropped it to 3 × 224 × 224 and then performed random flipping and rotation to decrease overfitting. The transformed image patches were fed into the segmentation model. During testing, each image slice was input into the segmentation model with its above and below slices without cropping.
For diagnostic model preprocessing, we first normalized the input images and then cropped the cuboid region with ROI from the MRI series to reduce irrelevant information. After determining the ROI, we randomly selected 12 slices and cropped them into 12 × 169 × 169 slices. Data augmentation techniques including random flipping and rotation were then performed on the cropped patches and finally sent to the classification model.
DL model development
This study compared six diagnostic models differentiated by input data sources: Original Image-based model (ORI, 1) trained directly on original images; Manual Annotation-based model (MAN, 2) utilizing radiologist annotations (manual lesion delineations); Segmentation-based model (SEG, 3) employing an automated segmentation model to eliminate manual delineation; Non-Contrast Segmentation-based model (SEG-NC, 4) applying the automated segmentation model to non-contrast MR sequences (excluding CET1-w); Clinical-feature augmented Segmentation-based model (SEG-CL, 5) combining the automated segmentation model with clinical features; and Non-Contrast Clinical-feature augmented Segmentation-based model (SEG-CL-NC, 6) integrating the automated segmentation model, clinical features, and non-contrast MR sequences (excluding CET1-w). See Table 4 for details.
The segmentation and diagnostic models were trained separately, and the optimal diagnostic model was trained according to the segmentation model’s predicted mask. In the segmentation stage, we adapted U-Net to be the structure of the segmentation model as it has shown promising segmentation performance in a recent study35. We also used MobilenetV2 initialized with ImageNet pre-trained weights as the encoder to improve model performance. The 2.5D training method proposed by Lv et al.36 was applied to train our segmentation model instead of 2D or 3D. The 2.5D segmentation method contained more spatial information than the 2D segmentation method. Given that some ROIs were large, 3D segmentation occupied large memory, which restricted model performance. All MRI sequences were input into the segmentation model for training after preprocessing.
In the diagnostic stage, given that all patients had undergone multi-sequence MRI scans, we selected a state-of-the-art CoaT-based transformer network together with multiple instance learning strategy proposed by Huang et al.37 as our diagnostic model, which is proven efficient in multi-modality input classification problems. In Models SEG-CL and SEG-CL-NC, late fusion was used to combine the DL output with clinical information. Clinical information included age, sex, tumor size, and tumor location, and we executed logistic regression to fit DL and clinical features. A nomogram was drawn for the visualization of our logistic regression model. Figure 5 shows the schematic diagram of Model SEG-CL-NC. Detailed U-Net-based Network and CoaT-based Network structures are shown in supplementary Figures S2, S3, and S4.
Panel A illustrates the entire inference process of Model SEG-CL-NC. Panel B shows the training process for the segmentation component, while Panel C depicts the training process for the classification component.
Radiologist’s diagnosis
To compare the model’s diagnostic accuracy (ACC) with that of radiologists, two senior residents (YL and JWZ) and one junior attending physician (TYZ) independently diagnosed 105 patients in the prospective dataset without using our DL model. They had 4, 5, and 6 years of clinical experience, respectively. The original image of the patient is fed into the RadiAnt DICOM Viewer 2021.2.2 – Activation software. In order to be consistent with the input information from the model, the three viewers only saw the patient’s image, gender and age, and the rest of the clinical information was hidden. On the DICOM Viewer, the reader can also measure the maximum diameter of the lesion. Record the diagnostic result and the time spent after viewing the film. The patient’s diagnostic time was recorded from the time when the viewer opened the patient’s images to the time when the diagnosis was made. The diagnostic result was benign or malignant.
Statistical analysis
The performance of different models was assessed using the area under the receiver operating characteristic curve (AUC), ACC, sensitivity, and specificity values. The diagnostic time, ACC, sensitivity, and specificity were used as evaluation indexes for radiologist’s diagnosis. All models’ prediction probabilities were transformed to binary labels according to Youden’s index in validation sets38. Statistical analysis was performed on R software (R Core Team) version 3.4.3. Two-sample t-test test was performed to compare continuous variables, while chi-squared test was used to classify variables between groups. Dice score and the intersection over union (IoU) were used for the evaluation of segmentations. All statistical tests were two-sided, and a P value less than 0.05 was considered statistically significant.
Data availability
The datasets used and/or analysed during the current study available from the corresponding author on reasonable request.
Code availability
The code required to reproduce these findings is available for download from https://github.com/chencancan1018/BoneTumorRecognition.
References
Siegel, R. L., Miller, K. D., Fuchs, H. E. & Jemal, A. Cancer statistics, 2022. CA Cancer J. Clin. 72, 7–33 (2022).
Yin, P. et al. Radiomics models for the preoperative prediction of pelvic and sacral tumor types: a single-center retrospective study of 795 cases. Front. Oncol. 11, 709659 (2021).
Yin, P. et al. Machine and deep learning based radiomics models for preoperative prediction of benign and malignant sacral tumors. Front. Oncol. 10, 564725 (2020).
Yin, P. et al. A triple-classification radiomics model for the differentiation of primary chordoma, giant cell tumor, and metastatic tumor of sacrum based on T2-weighted and contrast-enhanced T1-weighted MRI. J. Magn. Reson. Imaging 49, 752–759 (2019).
WHO Classification of Tumours Editorial Board eds. World Health Organization classification of soft tissue and bone tumours. 5th ed. (IARC Press, 2020).
Si, M. J. et al. Differentiation of primary chordoma, giant cell tumor and schwannoma of the sacrum by CT and MRI. Eur. J. Radio. 82, 2309–2315 (2013).
Thornton, E. et al. Imaging features of primary and secondary malignant tumours of the sacrum. Br. J. Radio. 85, 279–286 (2012).
Gerber, S. et al. Imaging of sacral tumours. Skelet. Radio. 37, 277–289 (2008).
Olson, J. T., Wenger, D. E., Rose, P. S., Petersen, I. A. & Broski, S. M. Chordoma: 18F-FDG PET/CT and MRI imaging features. Skelet. Radio. 50, 1657–1666 (2021).
Sambri, A. et al. Primary tumors of the sacrum: imaging findings. Curr. Med. Imaging 18, 170–186 (2022).
Marmouset, D. et al. Characteristics, survivals and risk factors of surgical site infections after En Bloc sacrectomy for primary malignant sacral tumors at a single center. Orthop. Traumatol. Surg. Res. 108, 103197 (2022).
von Schacky, C. E. et al. Multitask deep learning for segmentation and classification of primary bone tumors on radiographs. Radiology 301, 398–406 (2021).
Zhao W. et al. GMILT: a novel transformer network that can noninvasively predict EGFR mutation status. IEEE Trans. Neural Netw. Learn. Syst. 35, 7324–7338 (2024)
Eweje, F. R. et al. Deep learning for classification of bone lesions on routine MRI. EBioMedicine 68, 103402 (2021).
Liu, H. et al. Benign and malignant diagnosis of spinal tumors based on deep learning and weighted fusion framework on MRI. Insights Imaging 13, 87 (2022).
Li, M. D. et al. Artificial intelligence applied to musculoskeletal oncology: a systematic review. Skelet. Radio. 51, 245–256 (2022).
He, Y. et al. Convolutional neural network to predict the local recurrence of giant cell tumor of bone after curettage based on pre-surgery magnetic resonance images. Eur. Radio. 29, 5441–5451 (2019).
Vogrin, M., Trojner, T. & Kelc, R. Artificial intelligence in musculoskeletal oncological radiology. Radio. Oncol. 55, 1–6 (2020).
Liu, R. et al. A deep learning-machine learning fusion approach for the classification of benign, malignant, and intermediate bone tumors. Eur. Radio. 32, 1371–1383 (2022).
He, Y. et al. Deep learning-based classification of primary bone tumors on radiographs: a preliminary study. EBioMedicine 62, 103121 (2020).
Huang, P. Y. et al. Osteomyelitis of the femur mimicking bone tumors: a review of 10 cases. World J. Surg. Oncol. 11, 283 (2013).
Langevelde, K. V., Vucht, N. V., Tsukamoto, S., Mavrogenis, A. F. & Errani, C. Radiological assessment of giant cell tumour of bone in the sacrum: from diagnosis to treatment response evaluation. Curr. Med. Imaging 18, 162–169 (2022).
Ilse, M., Tomczak, J. M. & Welling, M. Attention-based deep multiple instance learning. Proc. 35th Int. Conf. Mach. Learn. PMLR 80, 2127–2136 (2018).
Yin, P., Sun, C., Wang, S., Chen, L. & Hong, N. Clinical-deep neural network and clinical-radiomics nomograms for predicting the intraoperative massive blood loss of pelvic and sacral tumors. Front. Oncol. 11, 752672 (2021).
Murphey, M. D. & Kransdorf, M. J. Staging and classification of primary musculoskeletal bone and soft-tissue tumors according to the 2020 WHO update, from the AJR special series on cancer staging. AJR Am. J. Roentgenol. 217, 1038–1052 (2021).
Avanzo, M., Stancanello, J. & El Naqa, I. Beyond imaging: the promise of radiomics. Phys. Med. 38, 122–139 (2017).
Larue, R. T., Defraene, G., De Ruysscher, D., Lambin, P. & van Elmpt, W. Quantitative radiomics studies for tissue characterization: a review of technology and methodological procedures. Br. J. Radio. 90, 20160665 (2017).
Samji, K. et al. Comparison of high-resolution T1W 3D GRE (LAVA) with 2-point Dixon fat/water separation (FLEX) to T1W fast spin echo (FSE) in prostate cancer (PCa). Clin. Imaging 40, 407–413 (2016).
Sundaram, M. The use of gadolinium in the MR imaging of bone tumors. Semin Ultrasound CT MR 18, 307–311 (1997).
Zhou, Y. et al. Development and validation of a deep learning-based framework for automated lung CT segmentation and acute respiratory distress syndrome prediction: a multicenter cohort study. EClinicalMedicine 75, 102772 (2024).
Ye, Z. et al. Deep learning algorithms for melanoma detection using dermoscopic images: a systematic review and meta-analysis. Artif. Intell. Med. 155, 102934 (2024).
Morelli, L. et al. Addressing intra- and inter-institution variability of a radiomic framework based on apparent diffusion coefficient in prostate cancer. Med. Phys. 51, 8096–8107 (2024).
Lew, C. O. et al. Artificial intelligence outcome prediction in neonates with encephalopathy (AI-OPiNE). Radio. Artif. Intell. 6, e240076 (2024).
Yushkevich, P. A. et al. User-guided 3D active contour segmentation of anatomical structures: significantly improved efficiency and reliability. Neuroimage 31, 1116–1128 (2006).
Chen, W. et al. Improving the diagnosis of acute ischemic stroke on non-contrast CT using deep learning: a multicenter study. Insights Imaging 13, 184 (2022).
Lv, Peiqing, Wang, Jinke & Wang, Haiying 2.5D lightweight RIU-Net for automatic liver and tumor segmentation from CT. Biomed. Signal Process. Control 75, 103567 (2022).
Huang, C. et al. Transformer-based deep-learning algorithm for discriminating demyelinating diseases of the central nervous system with neuroimaging. Front. Immunol. 13, 897959 (2022).
Ruopp, M. D., Perkins, N. J., Whitcomb, B. W. & Schisterman, E. F. Youden Index and optimal cut-point estimated from observations affected by a lower limit of detection. Biom. J. Biom. Z. 50, 419–430 (2008).
Acknowledgements
This study received funding by the National Natural Science Foundation of China (NO.82001764), Peking University People’s Hospital Scientific Research Development Funds (RDY2020-08, RS2021-10), and Beijing United Imaging Research Institute of Intelligent Imaging Foundation (CRIBJQY202105).
Author information
Authors and Affiliations
Contributions
P.Y. designed the study, collected and analyzed the data, and prepared and edited the paper. K.L. participated in the clinical research process and edited the paper. R.R.C. participated in the clinical research process and edited the paper. Y.L. participated in the clinical research process. L.L. revised the paper. C.S. participated in the clinical research process. Y.L. participated in the clinical research process. T.Y.Z. participated in the clinical research process. J.W.Z. participated in the clinical research process. W.D.C. took part in the processing of data analysis and statistics. R.Z.Y. took part in the processing of data analysis and statistics. D.W.W. took part in the processing of data analysis and statistics. X.L. participated in the process of clinical guidance and article preparation. N.H. designed the study, ensured the integrity of the whole study and revised the paper. All authors reviewed the manuscript.
Corresponding authors
Ethics declarations
Competing interests
All authors declare no financial or non-financial competing interests.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Yin, P., Liu, K., Chen, R. et al. End-to-end deep learning for the diagnosis of pelvic and sacral tumors using non-enhanced MRI: a multi-center study. npj Precis. Onc. 9, 286 (2025). https://doi.org/10.1038/s41698-025-01077-3
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s41698-025-01077-3







