Application of multimodal integration to develop preoperative diagnostic models for borderline and malignant ovarian tumors

Kunishima, Atsushi; Inaba, Daiki; Iyoshi, Shohei; Ikeda, Yoshiki; Goto, Mayuko; Muramatsu, Reina; Hashimoto, Mizuki; Yoshida, Kosuke; Mogi, Kazumasa; Yoshihara, Masato; Nagao, Yukari; Tamauchi, Satoshi; Yokoi, Akira; Yoshikawa, Nobuhisa; Niimi, Kaoru; Koizumi, Norihiro; Kajiyama, Hiroaki

doi:10.1038/s41598-025-21128-w

Download PDF

Article
Open access
Published: 23 October 2025

Application of multimodal integration to develop preoperative diagnostic models for borderline and malignant ovarian tumors

Atsushi Kunishima¹,
Daiki Inaba²,
Shohei Iyoshi^1,3^na1,
Yoshiki Ikeda^1,4^na1,
Mayuko Goto⁵,
Reina Muramatsu⁵,
Mizuki Hashimoto⁵,
Kosuke Yoshida¹,
Kazumasa Mogi¹,
Masato Yoshihara¹,
Yukari Nagao¹,
Satoshi Tamauchi¹,
Akira Yokoi¹,
Nobuhisa Yoshikawa¹,
Kaoru Niimi¹,
Norihiro Koizumi² &
…
Hiroaki Kajiyama¹

Scientific Reports volume 15, Article number: 37114 (2025) Cite this article

10k Accesses
Metrics details

Subjects

Abstract

Malignant ovarian tumors (MOTs) and borderline ovarian tumors (BOTs) differ in treatment strategies and prognosis. However, accurate preoperative diagnosis remains challenging, and improving diagnostic accuracy is crucial. We developed and validated a system using artificial intelligence (AI) to integrate machine learning (ML) models based on blood test data and deep learning (DL) models based on magnetic resonance imaging (MRI) findings to distinguish between MOT and BOT. We analyzed 78 patients with malignant serous ovarian tumors and 31 with borderline serous ovarian tumors treated at our institution. A classification model was developed using ML for blood test data, and a DL model was constructed using MRI data. By integrating these models, we developed three fusion models as multimodal diagnostic AI and compared them with standalone models. The performance was evaluated using precision, recall, and accuracy. The classification model using Light Gradient Boosting Machine achieved an accuracy of 0.825, and the DL model using Visual Geometry Group 16-layer network achieved an accuracy of 0.722 for discriminating BOT from MOT. The intermediate, late, and dense fusion models achieved accuracies of 0.809, 0.776, and 0.825, respectively. Integrating multimodal information such as blood test and imaging data may enhance learning efficiency and improve diagnostic accuracy.

Introduction

Ovarian cancer (OvCa) is the most lethal gynecological malignancy^1,2 and is usually diagnosed at an advanced stage because of its asymptomatic nature in the early stages^3,4, leading to high recurrence rates and poor prognosis⁵. Ovarian tumors are classified as benign, borderline, or malignant, and accurate classification is crucial for determining treatment strategies and surgical planning⁶. However, preoperative diagnostic methods, including imaging and the use of blood-based biomarkers, have limited accuracy, particularly for borderline ovarian tumors (BOTs). Intraoperative frozen section analysis shows a high concordance with final pathology (approximately 93–94%), but its reliability significantly decreases for BOTs. Recent studies report a wide range of diagnostic performance: sensitivity varies between 71 and 92%, overall accuracy between 69 and 82%, and positive predictive value between 81 and 88%, highlighting a persistent risk of BOT misclassification^7,8,9. These diagnostic limitations result in overtreatment¹⁰, unintended loss of fertility, or inefficient use of surgical resources¹¹, including limited operating room slots, surgical staff, and gynecologic oncologists, highlighting the need for improved preoperative classification methods¹².

Recent advancements in artificial intelligence (AI) have expanded its application to various medical fields, including the classification of ovarian tumors¹³. Machine learning (ML) effectively analyzes numerical data such as blood test parameters to classify diseases or assess risk factors, whereas deep learning (DL), a ML subset, excels in processing complex data such as magnetic resonance imaging (MRI) data to detect patterns and anomalies with high accuracy. Kawakami et al. used ML with the Random Forest algorithm to distinguish between benign and malignant ovarian tumors (MOTs) and achieved a remarkable receiver operating characteristic area under the curve (ROC-AUC) of 0.968¹⁴. Similarly, Wang et al. developed a DL model using T2-weighted MRI data to classify BOTs and MOTs, achieving a ROC-AUC of 0.87, outperforming radiologists, who reported an AUC of 0.75¹⁵. While many studies have considered blood test or imaging data for ovarian tumor classification, few studies have integrated both modalities into a single classification model, suggesting a promising but underexplored research area. Furthermore, the lack of transparency in AI decision-making, usually termed the “black-box problem”, limits its practical implementation¹⁶. Addressing this limitation and exploring integrated approaches may enhance the potential of AI in preoperative diagnostic support.

To address the literature gap, we aim to improve the preoperative classification accuracy of BOTs by developing a diagnostic approach that integrates an ML model for blood test classification and a DL model for MRI data classification, focusing on the binary classification of BOTs and MOTs, as the intermediate category in the three-class classification is more challenging owing to its less distinct features¹⁷. To evaluate the utility of the integrated approach, we developed five models: ML-only, DL-only, and three integrated models that combined both modalities. Our findings revealed that different integrated models excelled in detecting BOTs and MOTs, suggesting that combining multiple data modalities enhances diagnostic accuracy. This approach may facilitate accurate preoperative classification, contributing to appropriate treatment planning and the efficient use of limited medical resources.

Methods

Patient cohort

This study, which aimed to improve the preoperative classification accuracy of BOTs, was approved by the Ethics Committee of Nagoya University in accordance with the principles of the Declaration of Helsinki (approval number: 2019 − 0113), and the requirement for informed consent was waived, as this was a non-invasive, non-interventional study that did not involve the use of human-derived specimens. Instead, an opt-out procedure was implemented via the university website. A total of 285 patients with serous ovarian tumors who underwent surgery at Nagoya University Hospital between 2001 and 2023 were included in the initial cohort. The inclusion and exclusion criteria for patient selection are shown in Fig. 1. Of these 285 patients, the following were excluded: (1) 73 with benign tumors, (2) 47 with inadequate MRI data, (3) 27 lacking sufficient preoperative blood test data, (4) 24 with recurrent tumors, and (5) 5 with other coexisting cancers. After applying these criteria, 109 patients were included in the final analysis, comprising 31 with BOTs and 78 with MOTs.

Data collection

We used two types of preoperative data, blood test and MRI, in addition to basic clinical information. Blood test data, including tumor markers, such as CA125, and parameters from hematology, biochemistry, and coagulation tests, were primarily collected from the preoperative examinations required for treatment. We removed the incomplete test parameters during data preprocessing. Twenty-eight features were finally used, comprising two clinical information (age and body mass index (BMI)) and 26 blood test data (Table 1). For image recognition, T2-weighted axial MRI data were used because of their effectiveness in detecting ovarian tumors compared with those of other available techniques such as computed tomography (CT) or ultrasonography¹⁸. MRI data were reviewed by two gynecologic oncologists, and the slice that captured most of the ovarian tumors was collected per case (evaluation slices). To develop image-based classifiers in clinical settings, data augmentation is commonly performed to increase the number of images that can be used for model training. In this study, given the clinical nature of the data and the significant variability in tumor position and size across images, the application of conventional augmentation techniques was limited. To address this challenge, we increased the number of images by incorporating multiple tumor-containing MRI slices from each patient’s scan.

Table 1 Features used in this study.

Full size table

ML models with preoperative blood test data

To develop a representative ML-based classification model with preoperative blood test data, we compared three tree-based classification algorithms, namely Random Forest (RF)¹⁹, Extreme Gradient Boosting (XGB)²⁰, and Light Gradient Boosting Machine (LGBM)²¹. Clinical information and numerical data from blood tests were used for model development. A five-fold cross-validation (CV) method was used for all datasets, and the average values of the five test sets for each model were used as the results. The developed models were evaluated using evaluation metrics, including precision, recall, accuracy, ROC-AUC, PR-AUC with 95% confidence intervals (CIs), and F1 score²², which were calculated on the basis of the confusion matrix. The seed value was fixed at 0 during the analysis. The selected model was subsequently used in the fusion model with an image-based classifier or comparator for further analysis.

DL models with MRI data

To develop image-based classifiers using preoperative MRI findings, we compared three image classification algorithms, namely Visual Geometry Group 16-layer network (VGG16)^23,24, Residual Network with 50 layers (ResNet50)²⁵, and Densely Connected Convolutional Network with 121 layers (DenseNet121)²⁶. The study included 109 cases in total (BOT: 31; MOT: 78). We performed stratified, patient-level five-fold CV. In each fold, cases were split 1:4, with the central tumor-containing slice from the one-fifth were used exclusively as the test set, and all available tumor-containing slices from the remaining four-fifths were used for training and validation. This design ensured no patient overlap between the test and training/validation sets. Thus, in each fold, the training/validation dataset comprised 142 BOT and 398 MOT slices, and the test set comprised 35 BOT and 100 MOT slices. The batch size for training was set to 32, and the maximum number of epochs was limited to 100. Early stopping was implemented if the minimum loss was not updated over five epochs. The developed classifier models were evaluated using the same set of metrics defined earlier for the ML analyses, and we report the mean values across CV folds. We also implemented Gradient-weighted Class Activation Mapping (Grad-CAM) to assess whether the image-based classifier (VGG16) focuses on tumor-relevant regions²⁷. Grad-CAM maps were generated for the predicted class using the final convolutional block of VGG16. Heatmaps were produced post hoc and did not influence training. The same MRI preprocessing was applied as in model training.

Development of fusion models using multimodal AI

The best-performing ML model with blood test data, LGBM, and the best-performing DL model with preoperative MRI data, VGG16, were subsequently used to develop and evaluate the fusion model. There are four major types of fusion strategies depending on the timing at which information from the two modalities is integrated: early fusion (integration at the input level), late fusion (integration at the output level), intermediate fusion (integration and joint learning at the intermediate feature level), and dense fusion (a combination of multiple fusion layers throughout the model)²⁸. To develop the late fusion (L-F) model²⁹, the LGBM-based ML and VGG16-based DL models developed as mentioned above were combined in the last step, and the final prediction of the L-F method was defined by performing one of three operations on the individual inference results: AND, OR, or a maximum-probability rule (using the larger of the two estimated probabilities).

To implement an intermediate fusion (IM-F)³⁰, feature extraction should first be conducted from MRI data, and these features need to be integrated as novel features into the tree-based classifier. Thus, we developed a novel model by integrating U-Net, which is a common segmentation model in identifying objects and their locations in images³¹, with a Variational Autoencoder (VAE)³², an improved version of the traditional autoencoder. The VAE defines latent variable z on the basis of a probabilistic distribution (e.g., Gaussian), improving the decoding accuracy and efficiency. We first pretrained a U-Net in an unsupervised manner with all available tumor-containing slices from the entire cohort. We then connected the encoder to XGB to form the IM-F model and ran stratified patient-level five-fold CV with 109 cases (BOT: 31; MOT: 78). To align with the ML setting and avoid patient overlap, classification used one slice per case (the central tumor-containing slice) for both the training/validation set and the held-out test set. Notably, the batch size for training was set to 32, and the maximum number of epochs was limited to 200. Furthermore, early stopping was applied if the loss did not improve for five consecutive epochs.

We further explored a method to merge the IM-F model with an LGBM-based classifier to enhance the performance, leading to the development of the Dense Fusion (D-F) model, which involves multiple integrations of information and performs fusion at multiple layers³³. The models were merged using the L-F approach, and the final prediction of the D-F model was defined by performing an AND operation on the IM-F and LGBM inference results. These developed models were evaluated using the same evaluation metrics as previously described. Additionally, to examine potential biases in the detection performance, we calculated the recall ratio by dividing the recall for the BOT by that for the MOT.

Data processing and statistical analysis

All data processing and model development were performed using Python (version 3.10.10). Blood test data were handled as comma-separated values files, and image analysis was performed using DICOM-to-PNG-converted MRI data. In short, MRI images (DICOM) were converted to PNG, resized to 200 × 200 pixels, and normalized using transforms.Normalize with mean = (0.485, 0.456, 0.406) and std = (0.229, 0.224, 0.225). Python packages for XGB and LGBM were adopted for the Learning Models using numerical data. In contrast, for image-based Learning Models, the TorchVision model library (version 0.15.0) in PyTorch (version 2.0.0) was used.

For loss functions, cross-entropy loss was used for the DL classifier and for the IM-F classifier, whereas the U-Net used for feature pretraining was trained in an unsupervised reconstruction setting with mean squared error loss. The U-Net encoder produced a 128-dimensional latent representation; this latent vector itself was used as the feature set for IM-F. During IM-F training, the U-Net encoder was frozen (i.e., not updated). For the ML models, we did not perform hyperparameter tuning: our goal was to compare baseline performance to understand model-class tendencies while reducing the risk of overfitting given the modest dataset size; default library settings were used unless otherwise specified.

The 95% CIs for ROC-AUC, PR-AUC, and F1 score were calculated using the bootstrap method with 1,000 resamples. The percentile method was applied, with the 2.5th and 97.5th percentiles of the bootstrap distribution taken as the lower and upper bounds of the 95% CI, respectively. Statistical analyses were performed using R (version 4.4.2). Statistical significance was set at p < 0.05. Data normality was assessed using the Shapiro–Wilk test. Depending on the nature of the data, the Student t-test, Chi-squared test, or Fisher exact test was applied as appropriate. Pairwise comparisons of ROC–AUCs between models were conducted using the DeLong test.

Results

Characteristics of patients

Patient characteristics are presented in Table 2. The study comprised 109 patients with serous ovarian tumors, of whom 31 had BOT and 78 had MOT. The median ages of the patients were 46.2 and 57.7 years for BOT and MOT, respectively (p < 0.001). The median BMIs were 21.7 for borderline and 21.8 for malignant (p = 0.949). According to the International Federation of Gynecology and Obstetrics staging, 25 patients (80.6%) with BOT were in Stage I, 3 (9.7%) were in Stage II, 3 (9.7%) were in Stage III, and 0 were in Stage IV. In contrast, 6 patients (7.7%) with MOT were in Stage I, 4 (5.1%) were in Stage II, 50 (64.1%) were in Stage III, and 18 (23.1%) were in Stage IV (p < 0.001). CA125 (4.07 U/mL for BOT and 6.25 U/mL for MOT, p < 0.001) and in CA72-4 (0.80 U/mL for BOT and 2.53 U/mL for MOT, p < 0.001) levels differed significantly; however, CA19-9 levels showed no significant differences (2.66 U/mL for BOT and 2.62 U/mL for MOT, p = 0.917).

Table 2 Patient Background.

Full size table

Development of the ML-based prediction models with preoperative blood test data

Using preoperative blood test data, three ML classification models, namely, RF, XGB, and LGBM, were developed on the basis of the common ensemble learning methods for supervised ML tasks. Prediction models were constructed on the basis of 28 features (Table 1) of 109 cases, including 31 BOT and 78 MOT cases. A five-fold CV method was adopted, and the average evaluation values of the five test sets for each model were used as the results (Fig. 2). The precision values for BOT classification were 0.647, 0.665, and 0.707 for RF, XGB, and LGBM, respectively, whereas the recall values were 0.514 for RF, 0.614 for XGB, and 0.681 for LGBM (Fig. 2A and B). The precision values for the MOT were 0.831, 0.852, and 0.881 for RF, XGB, and LGBM, respectively, whereas the recall values were 0.893 for RF, 0.869 for XGB, and 0.882 for LGBM (Fig. 2C and D). The accuracies for RF, XGB, and LGBM were 0.786, 0.797, and 0.825, respectively (Fig. 2E). Because the LGBM demonstrated the highest accuracy, it was selected as the comparative model for evaluation against the fusion-based model described subsequently. The evaluation metrics for all models are summarized in Fig. 2F for ease of comparison. As supplementary evaluation metrics, the F1 score, ROC-AUC, and PR-AUC for each ML model are summarized in Supplementary Table S1.

To examine the interpretability of the developed models in terms of the indicators that contributed to their decision-making process, we visualized the feature importance output in ML-based classifiers (Fig. 2G). The top-ranking features included lactate dehydrogenase (LDH), CA125, CA72-4, age, and white blood cell count, highlighting the significance of well-established tumor markers, such as CA125, as well as relatively underestimated markers in clinical settings, such as LDH and CA72-4.

Development of the DL-based prediction models with MRI data

Next, three DL-based pre-trained image classifiers, namely VGG16, ResNet50, and DenseNet121, were applied to develop DL-based prediction models using preoperative MRI data of ovarian tumors. The dataset comprised the same 109 cases as those in the ML classification models. A five-fold CV method was applied, and the average values of the five test sets for each model were used as the results (Fig. 3). For the BOT cases, the precision values were 0.567 for VGG16, 0.447 for ResNet50, and 0.480 for DenseNet121, whereas the recall values were 0.357, 0.224, and 0.162 for VGG16, RestNet50, and DenseNet121, respectively (Fig. 3A and B). The precision values for MOT were 0.773 for VGG16, 0.745 for ResNet50, and 0.720 for DenseNet121, whereas the recall values were 0.867, 0.900, and 0.869 for VGG16, ResNet50, and DenseNet121, respectively (Fig. 3C and D). The accuracies of the VGG16, ResNet50, and DenseNet121 were 0.722, 0.707, and 0.668, respectively (Fig. 3E). As VGG16 demonstrated the highest accuracy, it was selected as the comparative model for evaluation against the fusion-based model described subsequently. The evaluation metrics for all models are summarized in Fig. 3F for ease of comparison. As supplementary evaluation metrics, the F1 score, ROC-AUC, and PR-AUC for each DL model are summarized in Supplementary Table S2. Also, from visual explainability assessment by Grad-CAM method, learning was confirmed to be tumor-focused in this qualitative analysis (Supplementary Figure S1).

Development of multimodal prediction models

To integrate multimodal inputs such as numerical results from preoperative blood tests and images from diagnostic imaging, several fusion approaches were explored. As the computational cost of the early fusion (E-F) approach that extracts features from images and integrates them with numerical data at the initial step increases substantially, we initially adopted late L-F, where the final decision was determined by combining the inferred output vectors from differently trained models at the final step (Fig. 4A), and the IM-F approach, a method that integrates feature vectors obtained as intermediate layers of the image learning network into the final classifier model based on decision tree algorithms (Fig. 4B). To develop the L-F model, LGBM-based ML models with blood test data and VGG16-based DL models with diagnostic MRI data were trained to infer separately. Subsequently, the final prediction of the L-F model was defined by performing an AND operation on the individual inference results, where the output of the L-F model became MOT only when both outputs from LGBM- and VGG16-based models were MOT; otherwise, the output became BOT. The rationale for applying the AND operation was as follows: while the recall of the VGG16-based DL model for BOT was approximately half that of the LGBM, its precision did not exhibit a large discrepancy, as did the recall. Thus, this strategy aimed to construct a complementary model. Although the VGG16-based DL model revealed fewer cases as BOT, its precision for MOT prediction was high. Hence, the approach sought to leverage the strengths of the VGG16-based DL model while compensating for the detection performance using LGBM. The results of the L-F model are summarized in Fig. 5, along with the results from other integration approaches. The precision for BOT was 0.601, and the recall was 0.810. The precision for MOT was 0.914, and the recall was 0.762. The accuracy was 0.776. Thus, the L-F model achieved a higher recall for BOT cases, as well as an improvement in precision for MOT than did the standalone models. Beyond the AND rule, we also examined two alternative fusion rules: (i) an OR rule, in which the L-F output was MOT if either the LGBM- or VGG16-based model predicted MOT (i.e., BOT only when both predicted BOT), and (ii) the maximum-probability rule that adopted the prediction from whichever model yielded the higher estimated probability of malignancy. The results of these alternatives are provided in Supplementary Table S3. Among the three fusion rules evaluated for L-F, the maximum-probability rule achieved the highest overall accuracy (0.835). However, BOT recall was low for both the OR and maximum-probability rules (0.229 and 0.552, respectively). Given our aim to enhance BOT detection, we therefore adopted the AND rule—which provided the highest BOT recall—to construct the final L-F model.

Subsequently, the IM-F approach was attempted to combine multimodal information during the learning step at a relatively low computational cost. DL models such as ResNet50 were initially used as image encoders. However, when combined with the LGBM as the numerical encoder, all 109 cases were classified as malignant, rendering the binary classification system ineffective. To address this challenge, we implemented a VAE-adapted U-Net as the image encoder, which classified three cases as BOTs. Next, replacing the LGBM with XGB as the numerical encoder improved the classification performance, revealing 12 cases as borderline and enhancing the overall accuracy. Based on these findings, the XGB was selected as the numerical encoder. The encoder part of U-Net was trained in an unsupervised manner to learn feature extraction from MRI data. After combining with XGB, a five-fold CV was conducted for the entire model, and the average of the evaluation matrix was calculated. With the IM-F model, the precision for BOT was 0.950, and the recall was 0.362. The precision for MOT was 0.799, and the recall was 0.987. The accuracy was 0.809 (Fig. 5). Thus, the IM-F method demonstrated superior MOT detection capability compared with those of the DL and traditional ML approaches. In contrast, the precision for BOT cases was notably high for the IM-F method (0.950), suggesting a tendency to classify cases more easily as malignant.

Because the IM-F method demonstrated higher precision in classifying BOT with a more pronounced bias toward classification as MOT than did VGG16, we considered combining IM-F and LGBM at the final step of decision-making (Fig. 4C). This method combines the IM-F and LGBM models with an L-F approach that aggregates the output results. Because it performs information integration at multiple levels in the intermediate and output layers, it is characterized as a model employing D-F. In the developed D-F model, the final prediction was defined by performing an AND operation on the IM-F and LGBM inference results, and the evaluation metrics were subsequently calculated. As shown in Fig. 5, the precision for BOT was 0.690, and the recall was 0.714. The precision for MOT was 0.892, and the recall was 0.869. The accuracy was 0.825. While changes in individual recall and precision values for the BOT and MOT cases were not prominent, the D-F method improved the recall ratio from 0.772 to 0.822 compared with that of LGBM, indicating a meaningful reduction in detection bias. As supplementary evaluation metrics, the F1 score, ROC-AUC, and PR-AUC for each fusion model are summarized in Supplementary Table S4.

The higher threshold of IM-F for BOT led to low recall and high precision, and its bias toward malignancy resulted in high recall but low precision for MOT. This bias limited its BOT detection compared with conventional methods. In contrast, L-F used a higher threshold for MOT, improving BOT recall but lowering its precision, while achieving higher MOT precision with reduced recall. This adjustment helped correct over-prediction of malignancy. These outcomes reflect differences in learning mechanisms: IM-F requires learning interactions between image and numerical data, increasing computational cost, whereas L-F combines pre-trained models without extra training. D-F, which integrates IM-F and LGBM, showed intermediate detection ability and the highest accuracy. In terms of recall ratio, IM-F had the strongest bias (0.367), L-F was most balanced (1.063), and D-F fell in between. Overall, L-F effectively enhanced BOT detection and corrected malignancy bias with low computational cost. L-F and IM-F outperformed the image-only model (VGG16), improving BOT recall and overall precision and accuracy. This suggests that incorporating low-cost modalities like numerical data enhances DL-based model performance. A summary of our workflow and the overall results is presented in Fig. 6.

Discussion

Precise preoperative diagnosis is essential for the implementation of patient-centered, evidence-based medicine in the field of OvCa treatment. However, reports concerning the development of prediction models that integrate multimodal information are limited. In this study, we developed multimodal prediction models by integrating ML-based models using blood test data and DL-based models using MRI data through three fusion approaches: L-F, IM-F, and D-F. An E-F approach was not implemented owing to the high computational burden and complex preprocessing required for aligning input dimensions³⁴. Our results showed that L-F achieved the best performance for BOT detection, whereas IM-F was more effective for malignancies.

Similar to the study of Kawakami et al.¹⁴, several previous studies have considered ML models in classifying ovarian tumors based on blood test data. For example, Ahamad et al. adopted ML models, such as LGBM, using tumor markers and general blood test parameters to perform binary classification of benign and malignant ovarian tumors and reported an accuracy of 0.91³⁵. Similarly, Akazawa et al. used blood test data and clinical information to develop ML models, including XGB, for three-class classification of benign, borderline, and malignant ovarian tumors, achieving an accuracy of 0.80³⁶. In addition to the study of Wang et al.¹⁵, DL models have been applied to imaging data for ovarian tumor classification. For example, Kodipalli et al. used CT images and DL models, such as DenseNet121, to perform a binary classification of benign and malignant ovarian tumors, reporting an accuracy of 0.96³⁷. Furthermore, Jian et al. applied a multiple-instance convolutional neural network to MRI images for the binary classification of BOT and malignant ovarian tumors, achieving an AUC of 0.88, which outperformed radiologists’ evaluations (AUC, 0.80)³⁸. In this study, we developed a novel fusion approach that integrates ML and DL models using blood test and MRI data. Although the highest accuracy obtained with the D-F model (0.825) does not represent an evident improvement over previous reports, our findings suggest that further refinement and development of fusion methods enhance diagnostic performance.

We also computed ROC–AUC and PR–AUC with 95% confidence intervals and the F1 score for each model (Supplementary Tables S1,S2,and S4). Across MOTs, F1 scores were generally in the 0.8 range for all models, whereas for BOTs the DL model showed markedly lower F1 scores (0.242–0.438) than the ML models (0.573–0.693), likely reflecting the greater data requirements of DL and the smaller number of BOT cases. In the IM-F approach, ROC–AUC and PR–AUC were lower than in the other fusion strategies, consistent with a stronger tendency to predict malignancy. Taken together, these results indicated that the ML models and the fusion models (excluding IM-F) outperform the DL model and IM-F. Given that our task was binary classification between BOTs and MOTs, accuracy at a fixed threshold of 0.5 might be a more practical summary metric for operational use.

LDH was identified as the most influential feature when we assessed feature importance in the ML-based models. LDH levels increase when surrounding tissues are disrupted by tumors or when tumors undergo rupture or necrosis³⁹ and are associated with various malignant tumors beyond OvCa^40,41,42. Furthermore, its correlation with the prognosis of ovarian tumors has been reported⁴³. The significance of CA125, a well-established tumor marker for OvCa, and age has also been highlighted. These findings are consistent with those of previous reports and emphasize the reliability of our predictive models. CA72-4, which is not markedly elevated in serous ovarian tumors⁴⁴, ranked immediately after CA125. One possible explanation is that CA72-4 contributes to classification decisions when considered along with other blood test parameters. Decision trees usually classify samples based on complex feature interactions, where individual features may not be independently significant but may gain importance when combined with other variables. Thus, CA72-4 may have played such a role in our model. This analysis of feature importance may also serve to partially address the black-box problem inherent in ML models by providing interpretability and insight into how specific variables contribute to classification.

We also examined explainability of DL model by applying Grad-CAM to the VGG16-based classifier. In many cases, the heatmaps exhibited prominent activation (reddish regions) near the tumor, supporting the view that the classifier leverages tumor-adjacent image cues. However, such tumor-centered activation was not observed uniformly across all cases. A more comprehensive assessment—potentially including alternative CAM variants (e.g., Grad-CAM++, Score-CAM)⁴⁵, quantitative localization metrics, and weakly supervised/segmentation-assisted approaches—remains an important direction for future work.

This study had some limitations. First, while Jian et al. conducted a comparative analysis between model performance and radiologists’ diagnostic accuracy³⁸, our study did not include such a comparison. Instead, we evaluated our models against a well-established clinical scoring system, the Risk of Malignancy Index-1 (RMI-1), which has been widely used for the preoperative assessment of ovarian tumors, with cutoff value of 200^46,47. As shown in Supplementary Table S6, our fusion models demonstrated comparable performance, suggesting that the classification accuracy of our approach is at least on par with that of RMI-1. As another limitation, it should be noted that, because of a single-institutional retrospective study design with a limited sample size, as well as a class imbalance between BOTs and MOTs, the risk of overfitting and restricted generalizability cannot be excluded. Particularly, in the development of the DL model, the limited number of cases may have resulted in insufficient learning. In addition, our analysis focused on serous tumors, as they represent the most common histological type of OvCa, preventing evaluation across other histological subtypes. Moreover, we did not develop a segmentation model that uses additional information, such as tumor position, as an annotation when training the model using the MRI data. Although using a segmentation model could improve the accuracy of classification models, creating annotated data is usually challenging because of the obstacle in accurately delineating indistinct boundaries. This challenge frequently raises concerns regarding the reliability of the training data; therefore, we did not adopt a segmentation model in this study. Furthermore, for potential future clinical implementation, annotation-based models would require clinicians to manually annotate tumor regions prior to applying the classification model, introducing an additional step that may hinder practical use. To promote seamless integration into clinical workflows, we therefore chose to avoid using segmentation models.

Future perspectives include conducting multi-institutional collaborative studies to increase the sample size, extending the analysis to other histological subtypes, and exploring combined analyses across multiple subtypes. Indeed, based on the hypothesis-generating results obtained in this study, we are currently planning to conduct a multi-institutional prospective study with radiologist involvement.

Conclusions

In this study, we developed and validated multimodal diagnostic models that integrate multiple information types to improve the preoperative diagnostic accuracy of BOTs and MOTs. From the perspective of improving DL-based classification methods, our findings demonstrate the effectiveness of integrating ML-based classifiers based on laboratory data. To facilitate clinical implementation, prospective validation of the developed models and redevelopment using larger sample sizes will be essential.

Data availability

The data that support the findings of this study are available from Nagoya University; however, restrictions apply to the availability of these data, which were used under license for the current study, and thus are not publicly available. Additional datasets generated and/or analyzed during the current study are available from the corresponding author, Dr. Shohei Iyoshi, upon reasonable request.

Abbreviations

AI:: Artificial intelligence
BOT:: Borderline ovarian tumor
MOT:: Malignant ovarian tumor
ML:: Machine learning
DL:: Deep learning
MRI:: Magnetic resonance imaging
CT:: Computed tomography
LGBM:: Light Gradient Boosting Machine
RF:: Random Forest
XGB:: Extreme Gradient Boosting
VGG16:: Visual Geometry Group 16-layer network
ResNet50:: Residual Network with 50 layers
DenseNet121:: Densely Connected Convolutional Network with 121 layers
ROC-AUC:: Receiver operating characteristic area under the curve
PR-AUC:: Precision-recall area under the curve
CIs:: Confidence intervals
CV:: Cross-validation
L-F:: Late fusion
IM-F:: Intermediate fusion
VAE:: Variational Autoencoder
D-F:: Dense Fusion
E-F:: Early fusion
OvCa:: Ovarian cancer
BMI:: Body mass index
LDH:: Lactate dehydrogenase
RMI-1:: The Risk of Malignancy Index-1
Grad-CAM:: Gradient-weighted Class Activation Mapping

References

Brun, J. L. et al. Long-term results and prognostic factors in patients with epithelial ovarian cancer. Gynecol. Oncol. 78, 21–27 (2000).
Article CAS PubMed Google Scholar
Jemal, A. et al. Cancer statistics, 2007. CA Cancer J. Clin. 57, 43–66 (2007).
PubMed Google Scholar
Weidle, U. H., Birzele, F., Kollmorgen, G. & Rueger, R. Mechanisms and targets involved in dissemination of ovarian cancer. Cancer Genom. Proteom. 13, 407–423 (2016).
Article CAS PubMed Google Scholar
Jelovac, D. & Armstrong, D. K. Recent progress in the diagnosis and treatment of ovarian cancer. CA Cancer J. Clin. 61, 183–203 (2011).
PubMed PubMed Central Google Scholar
Bast, R. C., Hennessey, B. & Mills, G. B. Hennessey B mills GB: biology ovarian cancer: new opportunities translation. Nat. Rev. Cancer. 9, 415–428 (2009).
Article CAS PubMed PubMed Central Google Scholar
Prat, J. Ovarian tumors of borderline malignancy (tumors of low malignant potential): a critical appraisal. Adv. Anat. Pathol. 6, 247–274 (1999).
Article CAS PubMed Google Scholar
Akalin, M., Akalin, E. E. & Gi̇ray, B. Kabaca Kocakuşak, C. Challenges of frozen section in borderline ovarian tumors: A 10-year retrospective analysis from a tertiary gynecological cancer center: retrospective analysis. J. Clin. Obstet. Gynecol. 31, 40–45 (2021).
Article Google Scholar
Huang, Z. et al. Diagnostic accuracy of frozen section analysis of borderline ovarian tumors: a meta-analysis with emphasis on misdiagnosis factors. J. Cancer. 9, 2817–2824 (2018).
Article PubMed PubMed Central Google Scholar
Kung, F. Y. L., Tsang, A. K. H. & Yu, E. L.-M. Intraoperative frozen section analysis of ovarian tumors: a 11-year review of accuracy with clinicopathological correlation in a Hong Kong regional hospital. Int. J. Gynecol. Cancer. 29, 772–778 (2019).
Article PubMed Google Scholar
Ratnavelu, N. D. G. et al. Intraoperative frozen section analysis for the diagnosis of early stage ovarian cancer in suspicious pelvic masses. Cochrane Database Syst. Rev. 3, CD010360 (2016).
PubMed Google Scholar
Fischerova, D., Zikan, M., Dundr, P. & Cibula, D. Diagnosis, treatment, and follow-up of borderline ovarian tumors. Oncologist 17, 1515–1533 (2012).
Article PubMed PubMed Central Google Scholar
Temkin, S. M., Tanner, E. J., Dewdney, S. B. & Minasian, L. M. Reducing overtreatment in gynecologic oncology: the case for less in endometrial and ovarian cancer. Front. Oncol. 6, 118 (2016).
Article PubMed PubMed Central Google Scholar
Jiang, F. et al. Artificial intelligence in healthcare: past, present and future. Stroke Vasc Neurol. 2, 230–243 (2017).
Article PubMed PubMed Central Google Scholar
Kawakami, E. et al. Application of artificial intelligence for preoperative diagnostic and prognostic prediction in epithelial ovarian cancer based on blood biomarkers. Clin. Cancer Res. 25, 3006–3015 (2019).
Article CAS PubMed Google Scholar
Wang, Y. et al. Deep learning for the ovarian lesion localization and discrimination between borderline and malignant ovarian tumors based on routine MR imaging. Sci. Rep. 13, 2770 (2023).
Article ADS CAS PubMed PubMed Central Google Scholar
Muhammad, D. & Bendechache, M. Unveiling the black box: A systematic review of explainable artificial intelligence in medical image analysis. Comput. Struct. Biotechnol. J. 24, 542–560 (2024).
Article PubMed PubMed Central Google Scholar
Fujii, I. et al. Artificial intelligence and image analysis-assisted diagnosis for fibrosis stage of metabolic Dysfunction-Associated steatotic liver disease using ultrasonography: A pilot study. Diagnostics (Basel). 14, 2585 (2024).
Article PubMed Google Scholar
Sohaib, S. A. A. & Reznek, R. H. MR imaging in ovarian cancer. Cancer Imaging. 7 (Spec No A), S119–S129 (2007).
Article PubMed PubMed Central Google Scholar
Biau, G. & Scornet, E. A random forest guided tour. Test. (Madr). 25, 197–227 (2016).
Article MathSciNet Google Scholar
Chen, T., Guestrin, C. & XGBoost: A Scalable Tree Boosting System. arXiv [cs.LG] (2016).
Ke, G. et al. LightGBM: A highly efficient gradient boosting decision tree. Neural Inf. Process. Syst. 30, 3146–3154 (2017).
Foody, G. M. Challenges in the real world use of classification accuracy metrics: from recall and precision to the Matthews correlation coefficient. PLoS One. 18, e0291908 (2023).
Article CAS PubMed PubMed Central Google Scholar
Simonyan, K. & Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv [cs.CV] (2014).
Tammina, S. Transfer learning using VGG-16 with deep convolutional neural network for classifying images. Int. J. Sci. Res. Publ (IJSRP). 9, 9420 (2019).
Google Scholar
He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. in IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (IEEE, 2016). (IEEE, 2016). (2016). https://doi.org/10.1109/cvpr.2016.90
Huang, G., Liu, Z., van der Maaten, L. & Weinberger, K. Q. Densely connected convolutional networks. arXiv [cs.CV] (2016).
Ramprasaath, R. et al. Grad-Cam: Visual Explanations from Deep Networks via Gradientbased Localization. (2017).
Ramachandram, D. & Taylor, G. W. Deep multimodal learning: A survey on recent advances and trends. IEEE Signal. Process. Mag. 34, 96–108 (2017).
Article ADS Google Scholar
Tanno, R. 5. Using multimodal deep learning for classification of drive recorder data. J. Inst. Image Inf. Telev. Eng. 74, 44–48 (2020).
Tan, H. & Bansal, M. LXMERT. Learning Cross-Modality Encoder Representations from Transformers. arXiv [cs.CL] (2019).
Ronneberger, O., Fischer, P. & Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. in Lecture Notes in Computer Science 234–241Springer International Publishing, Cham, (2015).
Kingma, D. P. & Welling, M. Auto-Encoding Variational Bayes. arXiv [stat.ML] (2013).
Hu, D., Wang, C., Nie, F. & Li, X. Dense Multimodal Fusion for Hierarchically Joint Representation. in ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2019) https://doi.org/10.1109/icassp.2019.8683898
Shen, Z., Li, Y., Zhang, H., Weng, Y. & Wang, J. Rethinking early-fusion strategies for improved multimodal image segmentation. arXiv [cs.CV] (2025).
Ahamad, M. M. et al. Early-stage detection of ovarian cancer based on clinical data using machine learning approaches. J. Pers. Med 12, 1211 (2022).
Akazawa, M. & Hashimoto, K. Artificial intelligence in ovarian cancer diagnosis. Anticancer Res. 40, 4795–4800 (2020).
Article CAS PubMed Google Scholar
Kodipalli, A., Fernandes, S. L., Gururaj, V., Varada Rameshbabu, S. & Dasar, S. Performance analysis of segmentation and classification of CT-scanned ovarian tumours using U-Net and deep convolutional neural networks. Diagnostics (Basel) 13, 2282 (2023).
Jian, J. et al. MRI-Based multiple instance convolutional neural network for increased accuracy in the differentiation of borderline and malignant epithelial ovarian tumors. J. Magn. Reson. Imaging. 56, 173–181 (2022).
Article PubMed Google Scholar
Van Wilpe, S. et al. Lactate dehydrogenase: a marker of diminished antitumor immunity. Oncoimmunology 9, 1731942 (2020).
Article PubMed PubMed Central Google Scholar
Ban, E. J. et al. Lactate dehydrogenase A as a potential new biomarker for thyroid cancer. Endocrinol. Metab. (Seoul). 36, 96–105 (2021).
Article CAS PubMed Google Scholar
Wang, H., Wang, M. S., Zhou, Y. H., Shi, J. P. & Wang, W. J. Prognostic values of LDH and CRP in cervical cancer. Onco Targets Ther. 13, 1255–1263 (2020).
Article CAS PubMed PubMed Central Google Scholar
Zhao, J. et al. LDHA promotes tumor metastasis by facilitating epithelial-mesenchymal transition in renal cell carcinoma. Mol. Med. Rep. 16, 8335–8344 (2017).
Article CAS PubMed Google Scholar
Zhu, L. et al. A lactate metabolism-related signature predicting patient prognosis and immune microenvironment in ovarian cancer. Front. Endocrinol. (Lausanne). 15, 1372413 (2024).
Article PubMed Google Scholar
Kobayashi, H. The clinical usefulness of serum CA72-4 analysis in patients with ovarian cancer. Nihon Sanka Fujinka Gakkai Zasshi. 41, 585–589 (1989).
CAS PubMed Google Scholar
Wang, H. et al. Score-CAM: Score-weighted visual explanations for convolutional neural networks. arXiv [cs.CV] (2019).
Jacobs, I. et al. A risk of malignancy index incorporating CA 125, ultrasound and menopausal status for the accurate preoperative diagnosis of ovarian cancer. Br. J. Obstet. Gynaecol. 97, 922–929 (1990).
Article CAS PubMed Google Scholar
Tingulstad, S. et al. Evaluation of a risk of malignancy index based on serum CA125, ultrasound findings and menopausal status in the pre-operative diagnosis of pelvic masses. Br. J. Obstet. Gynaecol. 103, 826–831 (1996).
Article CAS PubMed Google Scholar

Download references

Acknowledgements

We would like to thank Editage (www.editage.jp) for English language editing.

Funding

This study was supported by Japan Society for the Promotion of Science KAKENHI Grants-in-Aid for Scientific Research (grant numbers: 23K08795 and 24K19723).

Author information

Shohei Iyoshi and Yoshiki Ikeda contributed equally to this work.

Authors and Affiliations

Department of Obstetrics and Gynecology, Nagoya University Graduate School of Medicine, 65 Tsurumai-cho, Showa-ku, Nagoya-shi, 466-8550, Aichi, Japan
Atsushi Kunishima, Shohei Iyoshi, Yoshiki Ikeda, Kosuke Yoshida, Kazumasa Mogi, Masato Yoshihara, Yukari Nagao, Satoshi Tamauchi, Akira Yokoi, Nobuhisa Yoshikawa, Kaoru Niimi & Hiroaki Kajiyama
Department of Mechanical and Intelligent Systems Engineering, Graduate School of Informatics and Engineering, The University of Electro-Communications, 1-5-1 Chofugaoka, Chofu-shi, 182-8585, Tokyo, Japan
Daiki Inaba & Norihiro Koizumi
Institute for Advanced Research, Nagoya University, Furo-cho, Chikusa-ku, Nagoya, 464-8601, Japan
Shohei Iyoshi
Department of Obstetrics and Gynecology, Kasugai Municipal Hospital, 1-1-1 Takaki- cho, Kasugai-shi, 486-8510, Aichi, Japan
Yoshiki Ikeda
Nagoya University School of Medicine, 65 Tsurumai-cho, Showa-ku, Nagoya-shi, 466-8550, Aichi, Japan
Mayuko Goto, Reina Muramatsu & Mizuki Hashimoto

Authors

Atsushi Kunishima
View author publications
Search author on:PubMed Google Scholar
Daiki Inaba
View author publications
Search author on:PubMed Google Scholar
Shohei Iyoshi
View author publications
Search author on:PubMed Google Scholar
Yoshiki Ikeda
View author publications
Search author on:PubMed Google Scholar
Mayuko Goto
View author publications
Search author on:PubMed Google Scholar
Reina Muramatsu
View author publications
Search author on:PubMed Google Scholar
Mizuki Hashimoto
View author publications
Search author on:PubMed Google Scholar
Kosuke Yoshida
View author publications
Search author on:PubMed Google Scholar
Kazumasa Mogi
View author publications
Search author on:PubMed Google Scholar
Masato Yoshihara
View author publications
Search author on:PubMed Google Scholar
Yukari Nagao
View author publications
Search author on:PubMed Google Scholar
Satoshi Tamauchi
View author publications
Search author on:PubMed Google Scholar
Akira Yokoi
View author publications
Search author on:PubMed Google Scholar
Nobuhisa Yoshikawa
View author publications
Search author on:PubMed Google Scholar
Kaoru Niimi
View author publications
Search author on:PubMed Google Scholar
Norihiro Koizumi
View author publications
Search author on:PubMed Google Scholar
Hiroaki Kajiyama
View author publications
Search author on:PubMed Google Scholar

Contributions

Conceptualization and design of the study: AK, DI, YI, and NK. Data acquisition: MG, RM, MH, KY, KM, MY, YN, ST, AY, NY, and KN. Analysis and interpretation of data: AK, DI, SI, YI, and NK. Drafting of the manuscript: AK, SI, and YI. Statistical analysis: AK, DI, and SI. Funding acquisition: SI and YI. Supervision: HK. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Shohei Iyoshi.

Ethics declarations

Ethics approval and consent to participate

This study was approved by the Ethics Committee of Nagoya University (number: 2019 − 0113) in accordance with the principles of the Declaration of Helsinki. The requirement for informed consent was waived in accordance with Chap. 5, Article 12 of the Ethical Guidelines for Medical and Biological Research Involving Human Subjects (revised February 2017), as this was a non-invasive, non-interventional study that did not involve the use of human-derived specimens. Instead, an opt-out procedure was implemented via the university website.

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary Material 1

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Kunishima, A., Inaba, D., Iyoshi, S. et al. Application of multimodal integration to develop preoperative diagnostic models for borderline and malignant ovarian tumors. Sci Rep 15, 37114 (2025). https://doi.org/10.1038/s41598-025-21128-w

Download citation

Received: 24 July 2025
Accepted: 18 September 2025
Published: 23 October 2025
Version of record: 23 October 2025
DOI: https://doi.org/10.1038/s41598-025-21128-w