Introduction

Ovarian cancer (OvCa) is the most lethal gynecological malignancy1,2 and is usually diagnosed at an advanced stage because of its asymptomatic nature in the early stages3,4, leading to high recurrence rates and poor prognosis5. Ovarian tumors are classified as benign, borderline, or malignant, and accurate classification is crucial for determining treatment strategies and surgical planning6. However, preoperative diagnostic methods, including imaging and the use of blood-based biomarkers, have limited accuracy, particularly for borderline ovarian tumors (BOTs). Intraoperative frozen section analysis shows a high concordance with final pathology (approximately 93–94%), but its reliability significantly decreases for BOTs. Recent studies report a wide range of diagnostic performance: sensitivity varies between 71 and 92%, overall accuracy between 69 and 82%, and positive predictive value between 81 and 88%, highlighting a persistent risk of BOT misclassification7,8,9. These diagnostic limitations result in overtreatment10, unintended loss of fertility, or inefficient use of surgical resources11, including limited operating room slots, surgical staff, and gynecologic oncologists, highlighting the need for improved preoperative classification methods12.

Recent advancements in artificial intelligence (AI) have expanded its application to various medical fields, including the classification of ovarian tumors13. Machine learning (ML) effectively analyzes numerical data such as blood test parameters to classify diseases or assess risk factors, whereas deep learning (DL), a ML subset, excels in processing complex data such as magnetic resonance imaging (MRI) data to detect patterns and anomalies with high accuracy. Kawakami et al. used ML with the Random Forest algorithm to distinguish between benign and malignant ovarian tumors (MOTs) and achieved a remarkable receiver operating characteristic area under the curve (ROC-AUC) of 0.96814. Similarly, Wang et al. developed a DL model using T2-weighted MRI data to classify BOTs and MOTs, achieving a ROC-AUC of 0.87, outperforming radiologists, who reported an AUC of 0.7515. While many studies have considered blood test or imaging data for ovarian tumor classification, few studies have integrated both modalities into a single classification model, suggesting a promising but underexplored research area. Furthermore, the lack of transparency in AI decision-making, usually termed the “black-box problem”, limits its practical implementation16. Addressing this limitation and exploring integrated approaches may enhance the potential of AI in preoperative diagnostic support.

To address the literature gap, we aim to improve the preoperative classification accuracy of BOTs by developing a diagnostic approach that integrates an ML model for blood test classification and a DL model for MRI data classification, focusing on the binary classification of BOTs and MOTs, as the intermediate category in the three-class classification is more challenging owing to its less distinct features17. To evaluate the utility of the integrated approach, we developed five models: ML-only, DL-only, and three integrated models that combined both modalities. Our findings revealed that different integrated models excelled in detecting BOTs and MOTs, suggesting that combining multiple data modalities enhances diagnostic accuracy. This approach may facilitate accurate preoperative classification, contributing to appropriate treatment planning and the efficient use of limited medical resources.

Methods

Patient cohort

This study, which aimed to improve the preoperative classification accuracy of BOTs, was approved by the Ethics Committee of Nagoya University in accordance with the principles of the Declaration of Helsinki (approval number: 2019 − 0113), and the requirement for informed consent was waived, as this was a non-invasive, non-interventional study that did not involve the use of human-derived specimens. Instead, an opt-out procedure was implemented via the university website. A total of 285 patients with serous ovarian tumors who underwent surgery at Nagoya University Hospital between 2001 and 2023 were included in the initial cohort. The inclusion and exclusion criteria for patient selection are shown in Fig. 1. Of these 285 patients, the following were excluded: (1) 73 with benign tumors, (2) 47 with inadequate MRI data, (3) 27 lacking sufficient preoperative blood test data, (4) 24 with recurrent tumors, and (5) 5 with other coexisting cancers. After applying these criteria, 109 patients were included in the final analysis, comprising 31 with BOTs and 78 with MOTs.

Fig. 1
figure 1

Patient flowchart.

Data collection

We used two types of preoperative data, blood test and MRI, in addition to basic clinical information. Blood test data, including tumor markers, such as CA125, and parameters from hematology, biochemistry, and coagulation tests, were primarily collected from the preoperative examinations required for treatment. We removed the incomplete test parameters during data preprocessing. Twenty-eight features were finally used, comprising two clinical information (age and body mass index (BMI)) and 26 blood test data (Table 1). For image recognition, T2-weighted axial MRI data were used because of their effectiveness in detecting ovarian tumors compared with those of other available techniques such as computed tomography (CT) or ultrasonography18. MRI data were reviewed by two gynecologic oncologists, and the slice that captured most of the ovarian tumors was collected per case (evaluation slices). To develop image-based classifiers in clinical settings, data augmentation is commonly performed to increase the number of images that can be used for model training. In this study, given the clinical nature of the data and the significant variability in tumor position and size across images, the application of conventional augmentation techniques was limited. To address this challenge, we increased the number of images by incorporating multiple tumor-containing MRI slices from each patient’s scan.

Table 1 Features used in this study.

ML models with preoperative blood test data

To develop a representative ML-based classification model with preoperative blood test data, we compared three tree-based classification algorithms, namely Random Forest (RF)19, Extreme Gradient Boosting (XGB)20, and Light Gradient Boosting Machine (LGBM)21. Clinical information and numerical data from blood tests were used for model development. A five-fold cross-validation (CV) method was used for all datasets, and the average values of the five test sets for each model were used as the results. The developed models were evaluated using evaluation metrics, including precision, recall, accuracy, ROC-AUC, PR-AUC with 95% confidence intervals (CIs), and F1 score22, which were calculated on the basis of the confusion matrix. The seed value was fixed at 0 during the analysis. The selected model was subsequently used in the fusion model with an image-based classifier or comparator for further analysis.

DL models with MRI data

To develop image-based classifiers using preoperative MRI findings, we compared three image classification algorithms, namely Visual Geometry Group 16-layer network (VGG16)23,24, Residual Network with 50 layers (ResNet50)25, and Densely Connected Convolutional Network with 121 layers (DenseNet121)26. The study included 109 cases in total (BOT: 31; MOT: 78). We performed stratified, patient-level five-fold CV. In each fold, cases were split 1:4, with the central tumor-containing slice from the one-fifth were used exclusively as the test set, and all available tumor-containing slices from the remaining four-fifths were used for training and validation. This design ensured no patient overlap between the test and training/validation sets. Thus, in each fold, the training/validation dataset comprised 142 BOT and 398 MOT slices, and the test set comprised 35 BOT and 100 MOT slices. The batch size for training was set to 32, and the maximum number of epochs was limited to 100. Early stopping was implemented if the minimum loss was not updated over five epochs. The developed classifier models were evaluated using the same set of metrics defined earlier for the ML analyses, and we report the mean values across CV folds. We also implemented Gradient-weighted Class Activation Mapping (Grad-CAM) to assess whether the image-based classifier (VGG16) focuses on tumor-relevant regions27. Grad-CAM maps were generated for the predicted class using the final convolutional block of VGG16. Heatmaps were produced post hoc and did not influence training. The same MRI preprocessing was applied as in model training.

Development of fusion models using multimodal AI

The best-performing ML model with blood test data, LGBM, and the best-performing DL model with preoperative MRI data, VGG16, were subsequently used to develop and evaluate the fusion model. There are four major types of fusion strategies depending on the timing at which information from the two modalities is integrated: early fusion (integration at the input level), late fusion (integration at the output level), intermediate fusion (integration and joint learning at the intermediate feature level), and dense fusion (a combination of multiple fusion layers throughout the model)28. To develop the late fusion (L-F) model29, the LGBM-based ML and VGG16-based DL models developed as mentioned above were combined in the last step, and the final prediction of the L-F method was defined by performing one of three operations on the individual inference results: AND, OR, or a maximum-probability rule (using the larger of the two estimated probabilities).

To implement an intermediate fusion (IM-F)30, feature extraction should first be conducted from MRI data, and these features need to be integrated as novel features into the tree-based classifier. Thus, we developed a novel model by integrating U-Net, which is a common segmentation model in identifying objects and their locations in images31, with a Variational Autoencoder (VAE)32, an improved version of the traditional autoencoder. The VAE defines latent variable z on the basis of a probabilistic distribution (e.g., Gaussian), improving the decoding accuracy and efficiency. We first pretrained a U-Net in an unsupervised manner with all available tumor-containing slices from the entire cohort. We then connected the encoder to XGB to form the IM-F model and ran stratified patient-level five-fold CV with 109 cases (BOT: 31; MOT: 78). To align with the ML setting and avoid patient overlap, classification used one slice per case (the central tumor-containing slice) for both the training/validation set and the held-out test set. Notably, the batch size for training was set to 32, and the maximum number of epochs was limited to 200. Furthermore, early stopping was applied if the loss did not improve for five consecutive epochs.

We further explored a method to merge the IM-F model with an LGBM-based classifier to enhance the performance, leading to the development of the Dense Fusion (D-F) model, which involves multiple integrations of information and performs fusion at multiple layers33. The models were merged using the L-F approach, and the final prediction of the D-F model was defined by performing an AND operation on the IM-F and LGBM inference results. These developed models were evaluated using the same evaluation metrics as previously described. Additionally, to examine potential biases in the detection performance, we calculated the recall ratio by dividing the recall for the BOT by that for the MOT.

Data processing and statistical analysis

All data processing and model development were performed using Python (version 3.10.10). Blood test data were handled as comma-separated values files, and image analysis was performed using DICOM-to-PNG-converted MRI data. In short, MRI images (DICOM) were converted to PNG, resized to 200 × 200 pixels, and normalized using transforms.Normalize with mean = (0.485, 0.456, 0.406) and std = (0.229, 0.224, 0.225). Python packages for XGB and LGBM were adopted for the Learning Models using numerical data. In contrast, for image-based Learning Models, the TorchVision model library (version 0.15.0) in PyTorch (version 2.0.0) was used.

For loss functions, cross-entropy loss was used for the DL classifier and for the IM-F classifier, whereas the U-Net used for feature pretraining was trained in an unsupervised reconstruction setting with mean squared error loss. The U-Net encoder produced a 128-dimensional latent representation; this latent vector itself was used as the feature set for IM-F. During IM-F training, the U-Net encoder was frozen (i.e., not updated). For the ML models, we did not perform hyperparameter tuning: our goal was to compare baseline performance to understand model-class tendencies while reducing the risk of overfitting given the modest dataset size; default library settings were used unless otherwise specified.

The 95% CIs for ROC-AUC, PR-AUC, and F1 score were calculated using the bootstrap method with 1,000 resamples. The percentile method was applied, with the 2.5th and 97.5th percentiles of the bootstrap distribution taken as the lower and upper bounds of the 95% CI, respectively. Statistical analyses were performed using R (version 4.4.2). Statistical significance was set at p < 0.05. Data normality was assessed using the Shapiro–Wilk test. Depending on the nature of the data, the Student t-test, Chi-squared test, or Fisher exact test was applied as appropriate. Pairwise comparisons of ROC–AUCs between models were conducted using the DeLong test.

Results

Characteristics of patients

Patient characteristics are presented in Table 2. The study comprised 109 patients with serous ovarian tumors, of whom 31 had BOT and 78 had MOT. The median ages of the patients were 46.2 and 57.7 years for BOT and MOT, respectively (p < 0.001). The median BMIs were 21.7 for borderline and 21.8 for malignant (p = 0.949). According to the International Federation of Gynecology and Obstetrics staging, 25 patients (80.6%) with BOT were in Stage I, 3 (9.7%) were in Stage II, 3 (9.7%) were in Stage III, and 0 were in Stage IV. In contrast, 6 patients (7.7%) with MOT were in Stage I, 4 (5.1%) were in Stage II, 50 (64.1%) were in Stage III, and 18 (23.1%) were in Stage IV (p < 0.001). CA125 (4.07 U/mL for BOT and 6.25 U/mL for MOT, p < 0.001) and in CA72-4 (0.80 U/mL for BOT and 2.53 U/mL for MOT, p < 0.001) levels differed significantly; however, CA19-9 levels showed no significant differences (2.66 U/mL for BOT and 2.62 U/mL for MOT, p = 0.917).

Table 2 Patient Background.

Development of the ML-based prediction models with preoperative blood test data

Using preoperative blood test data, three ML classification models, namely, RF, XGB, and LGBM, were developed on the basis of the common ensemble learning methods for supervised ML tasks. Prediction models were constructed on the basis of 28 features (Table 1) of 109 cases, including 31 BOT and 78 MOT cases. A five-fold CV method was adopted, and the average evaluation values of the five test sets for each model were used as the results (Fig. 2). The precision values for BOT classification were 0.647, 0.665, and 0.707 for RF, XGB, and LGBM, respectively, whereas the recall values were 0.514 for RF, 0.614 for XGB, and 0.681 for LGBM (Fig. 2A and B). The precision values for the MOT were 0.831, 0.852, and 0.881 for RF, XGB, and LGBM, respectively, whereas the recall values were 0.893 for RF, 0.869 for XGB, and 0.882 for LGBM (Fig. 2C and D). The accuracies for RF, XGB, and LGBM were 0.786, 0.797, and 0.825, respectively (Fig. 2E). Because the LGBM demonstrated the highest accuracy, it was selected as the comparative model for evaluation against the fusion-based model described subsequently. The evaluation metrics for all models are summarized in Fig. 2F for ease of comparison. As supplementary evaluation metrics, the F1 score, ROC-AUC, and PR-AUC for each ML model are summarized in Supplementary Table S1.

Fig. 2
figure 2

Results for RF, XGB, and LGBM. (A) Borderline precision for RF, XGB, and LGBM. (B) Borderline recall for RF, XGB, and LGBM. (C) Malignant precision for RF, XGB, and LGBM. (D) Malignant recall for RF, XGB, and LGBM. (E) Accuracy for RF, XGB, and LGBM. (F) Evaluation metrics for each ML model. (G) Feature importance ranked in descending order, shown as a horizontal bar plot. RF, Random Forest; XGB, Extreme Gradient Boosting; LGBM, Light Gradient Boosting Machine; ML, machine learning.

To examine the interpretability of the developed models in terms of the indicators that contributed to their decision-making process, we visualized the feature importance output in ML-based classifiers (Fig. 2G). The top-ranking features included lactate dehydrogenase (LDH), CA125, CA72-4, age, and white blood cell count, highlighting the significance of well-established tumor markers, such as CA125, as well as relatively underestimated markers in clinical settings, such as LDH and CA72-4.

Development of the DL-based prediction models with MRI data

Next, three DL-based pre-trained image classifiers, namely VGG16, ResNet50, and DenseNet121, were applied to develop DL-based prediction models using preoperative MRI data of ovarian tumors. The dataset comprised the same 109 cases as those in the ML classification models. A five-fold CV method was applied, and the average values of the five test sets for each model were used as the results (Fig. 3). For the BOT cases, the precision values were 0.567 for VGG16, 0.447 for ResNet50, and 0.480 for DenseNet121, whereas the recall values were 0.357, 0.224, and 0.162 for VGG16, RestNet50, and DenseNet121, respectively (Fig. 3A and B). The precision values for MOT were 0.773 for VGG16, 0.745 for ResNet50, and 0.720 for DenseNet121, whereas the recall values were 0.867, 0.900, and 0.869 for VGG16, ResNet50, and DenseNet121, respectively (Fig. 3C and D). The accuracies of the VGG16, ResNet50, and DenseNet121 were 0.722, 0.707, and 0.668, respectively (Fig. 3E). As VGG16 demonstrated the highest accuracy, it was selected as the comparative model for evaluation against the fusion-based model described subsequently. The evaluation metrics for all models are summarized in Fig. 3F for ease of comparison. As supplementary evaluation metrics, the F1 score, ROC-AUC, and PR-AUC for each DL model are summarized in Supplementary Table S2. Also, from visual explainability assessment by Grad-CAM method, learning was confirmed to be tumor-focused in this qualitative analysis (Supplementary Figure S1).

Fig. 3
figure 3

Results for VGG16, ResNet50, and DenseNet121. (A) Borderline precision of VGG16, ResNet50, and DenseNet121. (B) Borderline recall of VGG16, ResNet50, and DenseNet121. (C) Malignant precision of VGG16, ResNet50, and DenseNet121. (D) Malignant recall of VGG16, ResNet50, and DenseNet121. (E) Accuracy of VGG16, ResNet50, and DenseNet121. (F) Evaluation metrics for each DL model. VGG16, Visual Geometry Group 16-layer network; ResNet50, Residual Network with 50 layers; DenseNet121, Densely Connected Convolutional Network with 121 layers; DL, deep learning.

Development of multimodal prediction models

To integrate multimodal inputs such as numerical results from preoperative blood tests and images from diagnostic imaging, several fusion approaches were explored. As the computational cost of the early fusion (E-F) approach that extracts features from images and integrates them with numerical data at the initial step increases substantially, we initially adopted late L-F, where the final decision was determined by combining the inferred output vectors from differently trained models at the final step (Fig. 4A), and the IM-F approach, a method that integrates feature vectors obtained as intermediate layers of the image learning network into the final classifier model based on decision tree algorithms (Fig. 4B). To develop the L-F model, LGBM-based ML models with blood test data and VGG16-based DL models with diagnostic MRI data were trained to infer separately. Subsequently, the final prediction of the L-F model was defined by performing an AND operation on the individual inference results, where the output of the L-F model became MOT only when both outputs from LGBM- and VGG16-based models were MOT; otherwise, the output became BOT. The rationale for applying the AND operation was as follows: while the recall of the VGG16-based DL model for BOT was approximately half that of the LGBM, its precision did not exhibit a large discrepancy, as did the recall. Thus, this strategy aimed to construct a complementary model. Although the VGG16-based DL model revealed fewer cases as BOT, its precision for MOT prediction was high. Hence, the approach sought to leverage the strengths of the VGG16-based DL model while compensating for the detection performance using LGBM. The results of the L-F model are summarized in Fig. 5, along with the results from other integration approaches. The precision for BOT was 0.601, and the recall was 0.810. The precision for MOT was 0.914, and the recall was 0.762. The accuracy was 0.776. Thus, the L-F model achieved a higher recall for BOT cases, as well as an improvement in precision for MOT than did the standalone models. Beyond the AND rule, we also examined two alternative fusion rules: (i) an OR rule, in which the L-F output was MOT if either the LGBM- or VGG16-based model predicted MOT (i.e., BOT only when both predicted BOT), and (ii) the maximum-probability rule that adopted the prediction from whichever model yielded the higher estimated probability of malignancy. The results of these alternatives are provided in Supplementary Table S3. Among the three fusion rules evaluated for L-F, the maximum-probability rule achieved the highest overall accuracy (0.835). However, BOT recall was low for both the OR and maximum-probability rules (0.229 and 0.552, respectively). Given our aim to enhance BOT detection, we therefore adopted the AND rule—which provided the highest BOT recall—to construct the final L-F model.

Fig. 4
figure 4

Conceptual diagrams. (A) L-F method. (B) IM-F method. (C) D-F method. L-F, late fusion; IM-F, intermediate fusion; D-F, Dense Fusion.

Fig. 5
figure 5

Results for LGBM, VGG16, L-F, IM-F, and D-F. (A) Borderline precision for LGBM, VGG16, L-F, IM-F, and D-F. (B) Borderline recall for LGBM, VGG16, L-F, IM-F, and D-F. (C) Malignant precision for LGBM, VGG16, L-F, IM-F, and D-F. (D) Malignant recall for LGBM, VGG16, L-F, IM-F, and D-F. (E) Accuracy for LGBM, VGG16, L-F, IM-F, and D-F. (F) Recall ratio for LGBM, VGG16, L-F, IM-F, and D-F. (G) Evaluation metrics for each ML, DL, and fusion model. LGBM, Light Gradient Boosting Machine; VGG16, Visual Geometry Group 16-layer network; DL, deep learning; ML, machine learning; L-F, late fusion; IM-F, intermediate fusion; D-F, Dense Fusion.

Subsequently, the IM-F approach was attempted to combine multimodal information during the learning step at a relatively low computational cost. DL models such as ResNet50 were initially used as image encoders. However, when combined with the LGBM as the numerical encoder, all 109 cases were classified as malignant, rendering the binary classification system ineffective. To address this challenge, we implemented a VAE-adapted U-Net as the image encoder, which classified three cases as BOTs. Next, replacing the LGBM with XGB as the numerical encoder improved the classification performance, revealing 12 cases as borderline and enhancing the overall accuracy. Based on these findings, the XGB was selected as the numerical encoder. The encoder part of U-Net was trained in an unsupervised manner to learn feature extraction from MRI data. After combining with XGB, a five-fold CV was conducted for the entire model, and the average of the evaluation matrix was calculated. With the IM-F model, the precision for BOT was 0.950, and the recall was 0.362. The precision for MOT was 0.799, and the recall was 0.987. The accuracy was 0.809 (Fig. 5). Thus, the IM-F method demonstrated superior MOT detection capability compared with those of the DL and traditional ML approaches. In contrast, the precision for BOT cases was notably high for the IM-F method (0.950), suggesting a tendency to classify cases more easily as malignant.

Because the IM-F method demonstrated higher precision in classifying BOT with a more pronounced bias toward classification as MOT than did VGG16, we considered combining IM-F and LGBM at the final step of decision-making (Fig. 4C). This method combines the IM-F and LGBM models with an L-F approach that aggregates the output results. Because it performs information integration at multiple levels in the intermediate and output layers, it is characterized as a model employing D-F. In the developed D-F model, the final prediction was defined by performing an AND operation on the IM-F and LGBM inference results, and the evaluation metrics were subsequently calculated. As shown in Fig. 5, the precision for BOT was 0.690, and the recall was 0.714. The precision for MOT was 0.892, and the recall was 0.869. The accuracy was 0.825. While changes in individual recall and precision values for the BOT and MOT cases were not prominent, the D-F method improved the recall ratio from 0.772 to 0.822 compared with that of LGBM, indicating a meaningful reduction in detection bias. As supplementary evaluation metrics, the F1 score, ROC-AUC, and PR-AUC for each fusion model are summarized in Supplementary Table S4.

The higher threshold of IM-F for BOT led to low recall and high precision, and its bias toward malignancy resulted in high recall but low precision for MOT. This bias limited its BOT detection compared with conventional methods. In contrast, L-F used a higher threshold for MOT, improving BOT recall but lowering its precision, while achieving higher MOT precision with reduced recall. This adjustment helped correct over-prediction of malignancy. These outcomes reflect differences in learning mechanisms: IM-F requires learning interactions between image and numerical data, increasing computational cost, whereas L-F combines pre-trained models without extra training. D-F, which integrates IM-F and LGBM, showed intermediate detection ability and the highest accuracy. In terms of recall ratio, IM-F had the strongest bias (0.367), L-F was most balanced (1.063), and D-F fell in between. Overall, L-F effectively enhanced BOT detection and corrected malignancy bias with low computational cost. L-F and IM-F outperformed the image-only model (VGG16), improving BOT recall and overall precision and accuracy. This suggests that incorporating low-cost modalities like numerical data enhances DL-based model performance. A summary of our workflow and the overall results is presented in Fig. 6.

Fig. 6
figure 6

An overall schematic image of the present study.

Discussion

Precise preoperative diagnosis is essential for the implementation of patient-centered, evidence-based medicine in the field of OvCa treatment. However, reports concerning the development of prediction models that integrate multimodal information are limited. In this study, we developed multimodal prediction models by integrating ML-based models using blood test data and DL-based models using MRI data through three fusion approaches: L-F, IM-F, and D-F. An E-F approach was not implemented owing to the high computational burden and complex preprocessing required for aligning input dimensions34. Our results showed that L-F achieved the best performance for BOT detection, whereas IM-F was more effective for malignancies.

Similar to the study of Kawakami et al.14, several previous studies have considered ML models in classifying ovarian tumors based on blood test data. For example, Ahamad et al. adopted ML models, such as LGBM, using tumor markers and general blood test parameters to perform binary classification of benign and malignant ovarian tumors and reported an accuracy of 0.9135. Similarly, Akazawa et al. used blood test data and clinical information to develop ML models, including XGB, for three-class classification of benign, borderline, and malignant ovarian tumors, achieving an accuracy of 0.8036. In addition to the study of Wang et al.15, DL models have been applied to imaging data for ovarian tumor classification. For example, Kodipalli et al. used CT images and DL models, such as DenseNet121, to perform a binary classification of benign and malignant ovarian tumors, reporting an accuracy of 0.9637. Furthermore, Jian et al. applied a multiple-instance convolutional neural network to MRI images for the binary classification of BOT and malignant ovarian tumors, achieving an AUC of 0.88, which outperformed radiologists’ evaluations (AUC, 0.80)38. In this study, we developed a novel fusion approach that integrates ML and DL models using blood test and MRI data. Although the highest accuracy obtained with the D-F model (0.825) does not represent an evident improvement over previous reports, our findings suggest that further refinement and development of fusion methods enhance diagnostic performance.

We also computed ROC–AUC and PR–AUC with 95% confidence intervals and the F1 score for each model (Supplementary Tables S1,S2,and S4). Across MOTs, F1 scores were generally in the 0.8 range for all models, whereas for BOTs the DL model showed markedly lower F1 scores (0.242–0.438) than the ML models (0.573–0.693), likely reflecting the greater data requirements of DL and the smaller number of BOT cases. In the IM-F approach, ROC–AUC and PR–AUC were lower than in the other fusion strategies, consistent with a stronger tendency to predict malignancy. Taken together, these results indicated that the ML models and the fusion models (excluding IM-F) outperform the DL model and IM-F. Given that our task was binary classification between BOTs and MOTs, accuracy at a fixed threshold of 0.5 might be a more practical summary metric for operational use.

LDH was identified as the most influential feature when we assessed feature importance in the ML-based models. LDH levels increase when surrounding tissues are disrupted by tumors or when tumors undergo rupture or necrosis39 and are associated with various malignant tumors beyond OvCa40,41,42. Furthermore, its correlation with the prognosis of ovarian tumors has been reported43. The significance of CA125, a well-established tumor marker for OvCa, and age has also been highlighted. These findings are consistent with those of previous reports and emphasize the reliability of our predictive models. CA72-4, which is not markedly elevated in serous ovarian tumors44, ranked immediately after CA125. One possible explanation is that CA72-4 contributes to classification decisions when considered along with other blood test parameters. Decision trees usually classify samples based on complex feature interactions, where individual features may not be independently significant but may gain importance when combined with other variables. Thus, CA72-4 may have played such a role in our model. This analysis of feature importance may also serve to partially address the black-box problem inherent in ML models by providing interpretability and insight into how specific variables contribute to classification.

We also examined explainability of DL model by applying Grad-CAM to the VGG16-based classifier. In many cases, the heatmaps exhibited prominent activation (reddish regions) near the tumor, supporting the view that the classifier leverages tumor-adjacent image cues. However, such tumor-centered activation was not observed uniformly across all cases. A more comprehensive assessment—potentially including alternative CAM variants (e.g., Grad-CAM++, Score-CAM)45, quantitative localization metrics, and weakly supervised/segmentation-assisted approaches—remains an important direction for future work.

This study had some limitations. First, while Jian et al. conducted a comparative analysis between model performance and radiologists’ diagnostic accuracy38, our study did not include such a comparison. Instead, we evaluated our models against a well-established clinical scoring system, the Risk of Malignancy Index-1 (RMI-1), which has been widely used for the preoperative assessment of ovarian tumors, with cutoff value of 20046,47. As shown in Supplementary Table S6, our fusion models demonstrated comparable performance, suggesting that the classification accuracy of our approach is at least on par with that of RMI-1. As another limitation, it should be noted that, because of a single-institutional retrospective study design with a limited sample size, as well as a class imbalance between BOTs and MOTs, the risk of overfitting and restricted generalizability cannot be excluded. Particularly, in the development of the DL model, the limited number of cases may have resulted in insufficient learning. In addition, our analysis focused on serous tumors, as they represent the most common histological type of OvCa, preventing evaluation across other histological subtypes. Moreover, we did not develop a segmentation model that uses additional information, such as tumor position, as an annotation when training the model using the MRI data. Although using a segmentation model could improve the accuracy of classification models, creating annotated data is usually challenging because of the obstacle in accurately delineating indistinct boundaries. This challenge frequently raises concerns regarding the reliability of the training data; therefore, we did not adopt a segmentation model in this study. Furthermore, for potential future clinical implementation, annotation-based models would require clinicians to manually annotate tumor regions prior to applying the classification model, introducing an additional step that may hinder practical use. To promote seamless integration into clinical workflows, we therefore chose to avoid using segmentation models.

Future perspectives include conducting multi-institutional collaborative studies to increase the sample size, extending the analysis to other histological subtypes, and exploring combined analyses across multiple subtypes. Indeed, based on the hypothesis-generating results obtained in this study, we are currently planning to conduct a multi-institutional prospective study with radiologist involvement.

Conclusions

In this study, we developed and validated multimodal diagnostic models that integrate multiple information types to improve the preoperative diagnostic accuracy of BOTs and MOTs. From the perspective of improving DL-based classification methods, our findings demonstrate the effectiveness of integrating ML-based classifiers based on laboratory data. To facilitate clinical implementation, prospective validation of the developed models and redevelopment using larger sample sizes will be essential.