Multiple instance learning using pathology foundation models effectively predicts kidney disease diagnosis and clinical classification

Kurata, Yu; Mimura, Imari; Kodera, Satoshi; Abe, Hiroyuki; Yamada, Daisuke; Kume, Haruki; Ushiku, Tetsuo; Tanaka, Tetsuhiro; Takeda, Norihiko; Nangaku, Masaomi

doi:10.1038/s41598-025-19297-9

Download PDF

Article
Open access
Published: 09 October 2025

Multiple instance learning using pathology foundation models effectively predicts kidney disease diagnosis and clinical classification

Yu Kurata¹,
Imari Mimura^1,2,
Satoshi Kodera³,
Hiroyuki Abe⁴,
Daisuke Yamada⁵,
Haruki Kume⁵,
Tetsuo Ushiku⁴,
Tetsuhiro Tanaka⁶,
Norihiko Takeda^2,3 &
…
Masaomi Nangaku^1,2

Scientific Reports volume 15, Article number: 35298 (2025) Cite this article

387 Accesses
Metrics details

Subjects

Abstract

Recently developed pathology foundation models, pretrained on large-scale pathology datasets, have demonstrated excellent performance in various downstream tasks. This study evaluated the utility of pathology foundation models combined with multiple instance learning (MIL) for kidney pathology analysis. We used 242 hematoxylin and eosin-stained whole slide images (WSIs) from the Kidney Precision Medicine Project (KPMP) and Japan-Pathology Artificial Intelligence Diagnostics Project databases as the development cohort, comprising 47 healthy controls, 35 acute interstitial nephritis, and 160 diabetic kidney disease (DKD) slides. External validation was performed using 83 WSIs from the University of Tokyo Hospital. Pretrained pathology foundation models were utilized as patch encoders and compared with ImageNet-pretrained ResNet50. Using the extracted patch features, we trained MIL models to classify diagnoses. In internal validation, all foundation models outperformed ResNet50, achieving area under the receiver operating characteristic curve (AUROC) over 0.980. In external validation, the performance of ResNet50 markedly dropped, which was in contrast to all foundation models. Visualization of attention heatmaps confirmed that foundation models accurately recognized diagnostically relevant structures. In overt proteinuria (albuminuria ≥ 300 mg/gCre or proteinuria ≥ 1000 mg/gCre) prediction task, foundation models also outperformed ResNet50. We successfully integrated pathology foundation models with MIL to achieve robust diagnostic performance.

Discovering a trans-omics biomarker signature that predisposes high risk diabetic patients to diabetic kidney disease

Article Open access 02 November 2022

Unsupervised machine learning for identifying important visual features through bag-of-words using histopathology data from chronic kidney disease

Article Open access 22 March 2022

Investigation on explainable machine learning models to predict chronic kidney diseases

Article Open access 14 February 2024

Introduction

The diagnosis of kidney diseases requires interpreting diverse pathological features, such as inflammation, atherosclerosis, and metabolic changes, which demands significant effort and expertise. The advent of whole-slide imaging (WSI) and advances in deep learning have spurred intensive research in digital pathology. In particular, image analysis with convolutional neural networks (CNNs) has been successfully applied to kidney pathology, enabling segmentation of functional structures such as glomeruli and tubules, as well as quantitative analysis of kidney pathology and its association with clinical outcomes^1,2,3. These approaches are expected to facilitate efficient kidney histopathological analysis and reduce the time and effort required for evaluation. However, these segmentation-based approaches heavily depend on manually annotated datasets for supervised training, posing a significant limitation to scalability. Self-supervised learning (SSL)⁴ has been introduced as a powerful approach for extracting meaningful representations from unlabeled large datasets. In kidney pathology, the use of SSL for effective feature extraction from glomerular images with limited annotations has been reported⁵. Recently, pathology foundation models, trained using SSL with Vision Transformer (ViT)⁶ architectures, have demonstrated remarkable performance in various downstream tasks compared with ImageNet-trained CNN-based models^7,8,9,10,11. A key difference between CNNs and ViTs is their inductive biases. CNNs extract spatial patterns using small convolutional kernels across multiple layers and have strong inductive biases, which leads to high performance even with small datasets. However, this inductive bias in turn prevents CNNs from fully leveraging large-scale datasets. In contrast, ViTs have minimal inductive biases, which enable them to outperform CNNs when trained on large-scale datasets¹². Pathology foundation models can be used as patch-level encoders to extract meaningful features from image patches, and these extracted features can be applied to downstream tasks such as classification and segmentation.

One of the major computational challenges in WSI analysis is the massive data size, making direct slide-level analysis infeasible. The standard solution is dividing each slide into smaller image patches to facilitate computational processing. Multiple instance learning (MIL) is an effective approach for slide-level classification, as it aggregates information from individual patches without requiring patch-level annotations. In heterogeneous tissue slides, where diagnostic value varies widely among patches, MIL excels by learning to focus on important patches for classification.

In this study, we utilized pathology foundation models as encoders to extract features from image patches and employed MIL to aggregate patch-level features for classifying expert pathologist-derived disease categorizations and predicting a clinical outcome. Rather than conducting detailed assessments of glomerular lesions commonly performed in prior studies, we aimed to investigate whether slide-level supervision can identify diagnostically informative patches without patch-level labels. We therefore focused on acute interstitial nephritis (AIN) and diabetic kidney disease (DKD), in which discriminative pathological findings are distributed across the slide, and we evaluated overt proteinuria (albuminuria ≥ 300 mg/gCre or proteinuria ≥ 1000 mg/gCre) as the clinical outcome. We benchmarked the performance of pathology foundation models against ImageNet-pretrained ResNet50¹³, a widely used CNN-based model, and assessed robustness to inter-institutional variability by comparing performance with and without stain normalization.

Methods

Dataset

We conducted two analyses: a main disease-classification analysis with external validation and an exploratory clinical-outcome prediction analysis using a single dataset without external validation.

In the main analysis, we developed a diagnostic model for classifying healthy control (HC), AIN, and DKD using H&E-stained WSIs (Fig. 1a). Training data were obtained from two publicly available datasets: the Kidney Precision Medicine Project (KPMP, accessed on 20th January 2025)¹⁴ and the Japan-Pathology Artificial Intelligence Diagnostics Project (JP-AID, accessed on 20th January 2025)¹⁵. The KPMP dataset comprised 26 HC slides from 21 patients, 26 AIN slides from 12 patients, and 151 DKD slides from 70 patients. From the JP-AID database, we collected H&E-stained WSIs comprising 21 HC slides, nine DKD slides, and nine AIN slides. For external validation, an independent dataset of H&E-stained WSIs was obtained from biopsy slides collected between 2009 and 2025 at the University of Tokyo Hospital (the UT dataset). The UT dataset included 39 HC slides, 17 AIN slides, and 30 DKD slides. Patients with other coexisting kidney diseases were excluded. Diagnoses were based on adjudicated diagnoses (KPMP) and expert pathologist-derived diagnoses (JP-AID and UT). Most HC cases were derived from transplanted kidney biopsies. Detailed dataset description and selection flowcharts are presented in Supplementary Methods and Figures S1-S3. This study protocol adhered to the Declaration of Helsinki. Research Ethics Committee of the Faculty of Medicine of the University of Tokyo approved the study protocol (approval number 2024526NI) and waived the requirement for written informed consent. Instead, all participants were given the opportunity to opt out of the study.

In the exploratory analysis, we evaluated our pipeline’s ability to predict clinical outcome (overt proteinuria defined as albuminuria ≥ 300 mg/gCre or proteinuria ≥ 1000 mg/gCre) using DKD slides from the KPMP dataset (Fig. 1b). Six slides from three cases were excluded because of missing values. In this exploratory analysis, external validation using the UT dataset was not performed due to significant class imbalance. Most DKD cases in the UT dataset had overt proteinuria.

Image preprocessing

To prepare images for analysis, each WSI was divided into non-overlapping tiles using Slideflow version 3.02¹⁶ with the Libvips backend at 20x (128 μm, 256 pixels). Background was removed using Otsu’s thresholding¹⁷, Gaussian blur filtering (sigma = 2, threshold = 0.02), and whitespace filtering (< 45% tissue content). Slides containing fewer than 100 patches were excluded to ensure consistent image quality and diagnostic reliability. Analyses were primarily performed without stain normalization, and in the main analysis we compared performance with and without stain normalization. Thumbnails of representative slides from each dataset are shown in Supplementary Figure S4. Variations in staining protocols and slide scanning methods introduce color heterogeneity, posing challenges in digital pathology analysis¹⁸. However, recently developed pathology foundation models trained on pathology slides with diverse color variations have demonstrated robust performance even without stain normalization^7,9,19,20. Given these findings, we opted to primarily analyze the images without stain normalization and assess the robustness of foundation models.

Multiple instance learning

For slide-level classification analysis, we adopted the MIL approach. MIL is a weakly supervised learning framework that leverages slide-level labels rather than instance-level labels for learning. In the MIL framework, each WSI is treated as a bag with a slide-level label, whereas the individual patches within the slide that serve as an instance are not labeled. The overall workflow is illustrated in Fig. 1c. First, feature vectors were extracted from the extracted patches using pretrained pathology foundation models, including UNI⁷, UNI2-h⁷, Prov-Gigapath¹⁰, Phikon¹¹, Virchow⁸, and Virchow2⁹, which were pretrained on large-scale pathology slides using SSL. As a baseline, we used ResNet50¹³ pretrained on the ImageNet. A summary of the encoders is presented in Supplementary Table S1. ResNet50 was built from the Torchvision library with ImageNet-pretrained weghts²¹. The pretrained weights of pathology foundation models were downloaded from Huggingface²². In this study, we employed four aggregation methods: max pooling, attention-based MIL (ABMIL)²³, transformer-based MIL (TransMIL)²⁴, clustering-constraint attention multiple instance learning (CLAM)²⁵. Max pooling assumes that the single most indicative patch in a bag determines the bag label. ABMIL uses an attention mechanism to perform weighted aggregation of patch features, with weights trained by a neural network. TransMIL utilizes a transformer mechanism to learn spatial relationships between patches via self-attention. CLAM incorporates instance-level clustering into its attention mechanism to enhance feature representation learning. Among the two variants of CLAM, we employed the multi-branch variant of CLAM (CLAM-MB), which computes the class-specific attention weights for each class. In the exploratory analysis, aggregation of patch features was performed using CLAM.

Model development and validation

Each model was trained to predict the diagnostic category (the main analysis) or overt proteinuria (the exploratory analysis). The hyperparameters used for model development are listed in Supplementary Table S2. To develop the model, we performed five-fold cross-validation, in which the dataset was randomly split into five subsets while ensuring a balanced distribution of diagnostic categories across the folds. Multiple slides from the same patient were kept in the same subset. In each fold, four subsets were used for training, and the remaining subset was used for validation. This process was repeated five times. The model was trained for 50 epochs per-fold, and the model with the lowest validation loss was selected. In the main analysis, the five trained models were subsequently validated using the external UT dataset to assess their generalizability. All computations were performed using Python 3.9.12 and PyTorch 1.11.0. Details of the computational environment were provided in the Supplementary Methods.

Visualization

Among the four aggregation methods, CLAM-based models were used for visualization. CLAM-MB leverages an attention mechanism to assign a weight to each patch. Based on these attention scores, an attention heatmap is generated to visualize diagnostically relevant regions within the slide. CLAM-MB employs a multi-branch design to learn class-specific attention weights. For each class label, the model generates a distinct attention heatmap that highlights important regions for its classification. We used Slideflow-Studio¹⁶ to visualize attention heatmaps with the Turbo colormap, where warmer colors (yellow to red) represent higher attention scores while cooler colors (green to blue) indicate lower attention scores. Additionally, we identified the top nine important patches for classification according to their attention scores.

Statistical analysis

Quantitative or ordinal data are expressed using medians and interquartile ranges or means and 95% confidence intervals (95% CI). Qualitative data are expressed as numbers and percentages. Performance metrics, including accuracy, macro-averaged F1 score, and area under the receiver operating characteristic curve (AUROC) were calculated to assess the predictive performance of the trained models. Accuracy is the proportion of correct classifications out of total classifications. F1 score is the harmonic mean of precision (the proportion of true positive predictions out of all positive predictions) and recall (the proportion of true positive predictions out of all actual positive samples). Macro-averaged F1 score is the arithmetic mean of all per-class F1 scores. The results from the five cross-validation folds were aggregated and reported as the mean and 95%CI. All metric calculations and statistical analyses were conducted using the scikit-learn package.

Results

Main analysis

Clinical characteristics of the development cohort

The clinical characteristics of the KPMP and JP-AID cohorts used in the main analysis are presented in Table 1. In the KPMP cohort, 19 of 21 HC cases were zero-hour transplant kidney biopsies. In the JP-AID cohort, 19 of 21 HC cases were transplant kidney biopsies: 6 were zero-hour, 8 were one-hour post-transplant, and 5 were obtained beyond the perioperative period. The JP-AID cohort does not provide detailed clinical information other than age, sex, and diagnosis. For eGFR calculations in the KPMP dataset, where eGFR data are provided in ranges (e.g., 40–50 ml/min/1.73 m²), the midpoint eGFR was used (e.g., 45.5 ml/min/1.73 m²). For AIN cases in the KPMP cohort, baseline eGFR values and AKI severity were presented. Regarding AKI severity of AIN cases in the KPMP cohort, Kidney Disease: Improving Global Outcomes (KDIGO) AKI stages²⁶ were provided, with four patients classified as Stage 2 and eight patients as Stage 3. In the KPMP cohort, DKD cases were characterized by older age and a higher prevalence of hypertension.

Table 1 Characteristics of cases in the KPMP, JP-AID, and UT cohorts.

Full size table

Model development and internal validation

Based on the development cohort (the KPMP and JP-AID datasets), we developed and evaluated models to predict slide-level diagnosis (HC, AIN, or DKD) using fivefold cross validation. We performed a comparative evaluation of seven encoders (ResNet50, UNI, UNI2-h, Phikon, Prov-Gigapath, Virchow, Virchow2) with four MIL methods (max pooling, ABMIL, TransMIL, and CLAM). The results are summarized in Fig. 2. Among the MIL methods tested, CLAM consistently showed high performance across all seven encoders. In contrast, max pooling underperformed relative to the other MIL methods, although the performance did not decline when Virchow and UNI2-h were used as encoders. Regarding encoders, all pathology foundation models outperformed ResNet50.

External validation

To investigate the robustness and generalizability of the developed models, we assessed their performance on the external validation cohort (the UT dataset). The characteristics of the UT dataset are described in Table 1. All HC cases in the UT dataset were one-hour transplant kidney biopsies. Compared with the development cohort, the AIN cases in the UT dataset consisted of older patients, with a greater proportion of females. Additionally, patients with DKD in the UT dataset had lower eGFR levels and a higher prevalence of overt proteinuria. The model performance is summarized in Fig. 3. The external validation results followed a similar trend to those observed in the development cohort: CLAM outperformed the other MIL methods, and all pathology foundation models surpassed ResNet50. For AUROC, max pooling-based models showed substantial decline on external validation, whereas ABMIL, TransMIL, and CLAM largely maintained comparable performance. An exception was observed with ResNet50 and Phikon. For ResNet50, the decline was attenuated only with CLAM: the remaining aggregation methods experienced comparable performance drops. For Phikon, relatively larger performance decreases with ABMIL, TransMIL, and CLAM were observed compared with other pathology foundation models. For accuracy and F1 scores, max pooling also exhibited a pronounced performance drop; ABMIL, TransMIL, and CLAM declined to a lesser extent. When using UNI2-h or Virchow as the encoder, the magnitude of performance decline with max pooling was similar to that of the other aggregation methods. With ResNet50, the decline was attenuated only with CLAM: the remaining aggregation methods experienced similar performance drops. We next assessed the impact of stain variability by applying Macenko stain normalization to all patches²⁷. After stain normalization, each model was trained on the development cohort and validated on the external validation cohort using CLAM. As shown in Table 2, stain normalization had a negligible impact on the performance of models based on UNI, UNI2-h, Virchow, and Virchow2, whereas other encoder-based models demonstrated improved performance after stain normalization. These findings indicate the limited generalizability of ImageNet-pretrained ResNet50 and highlight the advantages of pathology foundation models, particularly UNI, UNI2-h, Virchow and Virchow2, for external applicability.

Table 2 Model performance for disease classification in the validation cohort with or without color normalization.

Full size table

Visualization of the developed models

To understand how the developed models identified diagnostically important pathological features, we sampled slides from patients diagnosed with HC, AIN, and DKD in the UT cohort. Using CLAM, which outperformed other MIL methods, we evaluated seven encoders. For each encoder, the best model with the highest AUROC out of five folds in the external validation was used. We created attention heatmaps and identified the top nine patches with the highest attention scores per slide (Fig. 4). Analysis of these high-attention patches revealed that classification of HC generally relied on recognizing healthy tubules (Fig. 4a). Notably, the Phikon-based model did not assign high attention scores to tubular structures, which may explain its inferior performance in the external validation. Interestingly, UNI2-, Prov-Gigapath-, Virchow-, and Virchow2-based models highlighted patches exhibiting coagulative tubular cell necrosis. This is consistent with the fact that HC samples obtained from zero-hour or one-hour kidney transplant biopsies frequently show ischemia and/or ischemia–reperfusion injury. Since normal tubules may also appear in varying degrees in AIN and DKD, ischemia–reperfusion injury is more specific to transplanted kidneys. Regarding AIN classification (Fig. 4b), all models focused on regions of inflammatory cell infiltration, except for the ResNet50-based model, which incorrectly emphasized largely normal tubules. For diagnosing DKD (Fig. 4c), Virchow- and Vichow2-based models specifically focused on glomerular alterations, whereas other encoders highlighted tubulointerstitial and vascular changes. More detailed observations of heatmaps from ResNet50- and Virchow2-based models are presented in Fig. 5. A sample slide was obtained from a patient diagnosed with AIN in the UT cohort, which is the same slide as shown in Fig. 4b. The Virchow2-based model correctly classified this slide as AIN, whereas the ResNet50-based model misclassified it as HC. Attention scores for each diagnostic label (HC, AIN, and DKD) were used to visualize the regions of focus for the models in the disease classification task. In the whole slide view, the Virchow2-based model appropriately captured AIN-related histological features with high attention scores shown in red, covering most of the tissue in the attention heatmap for AIN (Fig. 5a). In Region A, where the tissue structure remained relatively intact, the attention heatmap for HC assigned high scores, whereas the attention maps for AIN and DKD assigned low scores. Region B exhibited prominent inflammatory cell infiltration within the interstitium, with high attention scores for AIN and low scores for HC. The attention map for DKD showed elevated scores only in the glomerular regions. These findings indicate that the Virchow2-based model successfully identified key histological patterns associated with each diagnostic category. In contrast, the attention heatmaps generated by the ResNet50-based model failed to capture diagnostically important structures, leading to misclassification of this slide as HC (Fig. 5b). While this model correctly assigned high attention scores to Region A for HC classification, it also assigned high scores to Region B, indicating a failure to recognize inflammatory cell infiltration.

Exploratory analysis

In the exploratory analysis, we developed predictive models with DKD cases from the KPMP dataset using fivefold cross-validation, focusing on overt proteinuria (albuminuria ≥ 300 mg/gCre or proteinuria ≥ 1000 mg/gCre). In the exploratory experiment, we employed CLAM as the MIL method. The results of the exploratory analysis are summarized in Fig. 6. All foundation models showed higher performance than ResNet50, and in particular, UNI2-h demonstrated the highest performance with an AUROC of 0.872 (95%CI 0.794–0.949). We then visualized attention heatmaps to determine which histological structures were highlighted by each model. Attention heatmaps from the ResNet50-based model and the best-performing UNI2-h-based model are shown in Fig. 7. The slide was derived from a patient with overt proteinuria, and both models correctly classified this as “overt proteinuria”. Whereas the ResNet50-based model assigned high attention scores across most patches, UNI2-h focused on patches containing glomeruli with mesangial matrix expansion and nodular lesions, as well as arteriolar hyalinosis.

Discussion

In this study, we demonstrated that pathology foundation models pretrained on large-scale pathological image datasets are highly effective feature extractors for kidney pathology analysis compared with ResNet50 pretrained on ImageNet. Furthermore, we successfully integrated foundation models with the MIL framework to achieve high diagnostic performance without requiring patch-level annotations. MIL aggregates patch-level features without relying on patch-level labels, making it a practical approach for large-scale histopathological analysis, where detailed manual annotation is infeasible. Among the MIL methods evaluated, CLAM showed the best performance. CLAM aggregates patch-level features through attention pooling and calculates both slide-level classification and instance-level losses to better separate positive and negative instances, enabling the model to efficiently learn discriminative features from important patches.

As for disease diagnosis prediction, foundation models’ superiority was consistent when applied to the external validation cohort from a different institution. A major challenge in computational pathology is inter-institutional staining variability, which has been shown to hinder robust feature extraction²⁸. Remarkably, our findings indicate that foundation models effectively extracted meaningful features, even without stain normalization, demonstrating robustness against staining variations. This advantage was particularly evident when compared with ResNet50, which exhibited a marked performance drop in external validation. Among the pathology foundation models, UNI, UNI2-h, Virchow, and Virchow2 showed robust performance regardless of whether stain normalization was applied. This finding is consistent with a recent study showing greater robustness of UNI2-h and Virchow2²⁹. Attention heatmap analysis further revealed that ResNet50 was less effective in extracting pathological features compared with the pathology foundation models. Of note, Virchow and Virchow2 specifically focused on glomeruli in DKD classification, suggesting that these encoders successfully recognized glomerular structures. This finding may be attributed to their substantially larger training datasets compared with other pathology foundation models. Understanding the factors that influence the performance of foundation models in kidney pathology analysis is crucial for optimizing pretraining strategies and developing more effective models.

Diagnosis of kidney pathology typically relies on a combination of multiple stains. In this study, we focused on AIN and DKD, both of which exhibit distinct morphological features on H&E-stained slides. This explains why the model for the disease classification task demonstrated a good performance despite being trained solely on H&E-stained slides. For other kidney diseases such as lupus nephritis and membranous nephropathy, integrating multiple stains may be crucial for improving diagnostic accuracy. Indeed, a recent study showed that combining multiple stains enhanced the prediction of clinical remission in lupus nephritis³⁰. The foundation models used in this study were trained primarily on H&E slides. Therefore, to improve the diagnostic performance in kidney pathology, developing domain-specific pathology foundation models trained on large-scale, multi-stained slides is an important next step.

In the exploratory analysis, we attempted to predict overt proteinuria in patients with DKD using the same pipeline as in the main analysis. Pathology foundation models also demonstrated higher performance than ResNet50. In the attention heatmap analysis, the ResNet50-based model assigned high attention scores across most patches, which may reflect overfitting to the training data. In contrast, the UNI2-h based model focused on glomeruli with mesangial matrix expansion and nodular lesions, as well as arteriolar hyalinosis. As nodular lesions and severe arteriolar hyalinosis have been reported to relate to higher proteinuria levels³¹, the UNI2-h-based model’s focus on these regions is pathologically plausible. Although the analysis was limited to proteinuria prediction in this study, the model’s ability to capture histologically meaningful features suggests that this approach could be extended to forecasting renal prognosis if sufficiently large WSI datasets linked to longitudinal clinical data were available.

This study has several limitations that warrant consideration. First, we used zero-hour and one-hour transplant kidney biopsies as healthy controls, which exhibited mild ischemic and/or ischemia–reperfusion injury, indicating that they may not fully represent normal kidney histology. Some models assigned high attention scores to coagulative tubular necrosis when classifying HC, likely reflecting the inclusion of transplant biopsies in the HC category. Note that zero-hour and one-hour transplant kidney biopsies are wedge biopsies; this sampling difference may have influenced model learning. Nevertheless, as shown in Fig. 5, the model correctly identified normal regions in the needle-biopsy AIN case. This may be attributed to the inclusion of needle-biopsy cases in the JP-AID HC cases. Second, while we excluded cases with coexisting kidney diseases based on pathology reports, some DKD and AIN slides contained atherosclerotic lesions, likely attributable to aging or hypertension, raising the possibility that the model learns hypertensive changes rather than diabetic changes. In fact, visualization analysis revealed that the models identified patches with interstitial fibrosis as important for DKD classification (Fig. 2c). Of note, the models based on Virchow and Virchow2 recognized glomerular alterations as important features for DKD diagnosis, indicating that the pathological features prioritized by the model may differ depending on the encoder. Third, our model was limited to classifying HC, AIN, and DKD. Since glomerular diseases such as IgA nephropathy and membranous nephropathy account for the majority of kidney biopsy cases, AIN and DKD cover only a minor subset. Therefore, the present models are not immediately applicable in clinical practice. Although our results suggest that pathology foundation models can recognize certain pathological findings (e.g., mesangial matrix expansion, interstitial inflammation), their ability to distinguish fine-grained lesions, such as discriminating mesangial hypercellularity with matrix expansion in IgA nephropathy from mesangial matrix expansion in DKD, remains unknown. Developing such fine-grained classifiers typically requires lesion-level annotations, which impose a substantial labeling burden and restrict recognition to predefined pathological patterns. In contrast, the weakly supervised framework used in this study could identify histologic signals including previously unrecognized features. Of note, such analyses require sufficiently large datasets, and any newly identified features would need validation with ancillary testing (e.g., immunohistochemistry) to establish their biological and clinical significance. Fourth, the relatively small number of slides used in this study may limit the generalizability of our findings. However, even with this limited training set, the pathology foundation models effectively extracted meaningful features and demonstrated good diagnostic performance in the external cohort.

In conclusion, we utilized pathology foundation models as feature extractors and demonstrated superior performance in the slide-level classification of kidney specimens compared with ImageNet-pretrained ResNet50. Their robustness against staining variability and inter-institutional differences highlights their potential for real-world clinical implementation. Since detailed annotations are not required, this approach has the potential to be extended to diagnosis of other diseases and prediction of clinical outcomes.

Data availability

The results here are in part based upon data generated by the Kidney Precision Medicine Project, accessed on January 20, 2025 (https://www.kpmp.org). KPMP is funded by the National Institute of Diabetes and Digestive and Kidney Diseases (Grant numbers: U01DK133081, U01DK133091, U01DK133092, U01DK133093, U01DK133095, U01DK133097, U01DK114866, U01DK114908, U01DK133090, U01DK133113, U01DK133766, U01DK133768, U01DK114907, U01DK114920, U01DK114923, U01DK114933, U24DK114886, UH3DK114926, UH3DK114861, UH3DK114915, UH3DK114937). The KPMP dataset used in this study is publicly available at https://atlas.kpmp.org/. The JP-AID dataset is available at https://jpaid.jp/ upon request, following the prescribed procedures. The UT dataset is not publicly available due to data protection regulations.

References

Jayapandian, C. P. et al. Development and evaluation of deep learning–based segmentation of histologic structures in the kidney cortex with multiple histologic stains. Kidney Int. 99, 86–101 (2021).
Article CAS PubMed Google Scholar
Holscher, D. L. et al. Next-Generation Morphometry for pathomics-data mining in histopathology. Nat. Commun. 14, 470 (2023).
Article ADS PubMed PubMed Central Google Scholar
Pilva, P., Bulow, R. & Boor, P. Deep learning applications for kidney histology analysis. Curr. Opin. Nephrol. Hypertens 33, 291–297 (2024).
Article PubMed Google Scholar
Chen, T., Kornblith, S. & Norouzi, M. et al. A simple framework for contrastive learning of visual representations. arXiv:2002.05709 (2020).
Abe, M. et al. Self-supervised learning for feature extraction from glomerular images and disease classification with minimal annotations. J. Am. Soc. Nephrol. 36, 471–486 (2025).
Article PubMed Google Scholar
Dosovitskiy, A. et al. An image is worth 16x16 words: transformers for image recognition at scale. arXiv:2010.11929 (2020).
Chen, R. J. et al. Towards a general-purpose foundation model for computational pathology. Nat. Med. 30, 850–862 (2024).
Article CAS PubMed PubMed Central Google Scholar
Vorontsov, E. et al. Virchow: A million-slide digital pathology foundation model. arXiv:2309.07778 (2023).
Zimmermann, E. et al. Virchow2: Scaling Self-Supervised Mixed Magnification Models in Pathology. arXiv:2408.00738 (2024).
Xu, H. et al. A whole-slide foundation model for digital pathology from real-world data. Nature 630, 181–188 (2024).
Article ADS CAS PubMed PubMed Central Google Scholar
Filiot, A. et al. Scaling self-supervised learning for histopathology with masked image modeling. medRxiv 2023.07.21.23292757.
Li, J. et al. Transforming medical imaging with Transformers? A comparative review of key properties, current progresses, and future perspectives. Med. Image Anal. 85, 102762 (2023).
Article PubMed PubMed Central Google Scholar
He, K. et al. Deep residual learning for image recognition. IEEE Conf. Comput. Vis. Pattern Recogn. (CVPR) 2015, 770–778 (2016).
Google Scholar
de Boer, I. H. et al. Rationale and design of the kidney precision medicine project. Kidney Int. 99, 498–510 (2021).
Article PubMed PubMed Central Google Scholar
Sakai Yasuhiro, K. M., Masashi, F., & Akihiko, Y. Development of artificial intelligence to help pathological diagnosis-Japan pathology AI diagnostics (JP-AID) project. Impact 6, 40–42 (2019).
Dolezal, J. M. et al. Slideflow: Deep learning for digital histopathology with real-time whole-slide visualization. BMC Bioinformatics 25, 134 (2024).
Article PubMed PubMed Central Google Scholar
Otsu, N. A threshold selection method from gray-level histograms. IEEE Trans. Syst. Man Cybern. 9, 62–66 (1979).
Article ADS Google Scholar
Schömig-Markiefka, B. et al. Quality control stress test for deep learning-based diagnostic model in digital pathology. Mod. Pathol. 34, 2098–2108 (2021).
Article PubMed PubMed Central Google Scholar
Vorontsov, E. et al. A foundation model for clinical-grade computational pathology and rare cancers detection. Nat. Med. 30, 2924–2935 (2024).
Article CAS PubMed PubMed Central Google Scholar
Wolflein, G. et al. Benchmarking pathology feature extractors for whole slide image classification. arXiv:2311.11772v5 (2024).
TorchVision: PyTorch’s Computer Vision library. In, GitHub repository, GitHub (2016).
Wolf, T. et al. HuggingFace’s Transformers: State-of-the-art Natural Language Processing. arXiv:1910.03771 (2019).
Maximilian Ilse, J. M.T. & Welling, M. Attention-based Deep Multiple Instance Learning. arXiv:1802.04712v4 (2018).
Zhuchen Shao, H. B. et al. TransMIL: Transformer based correlated multiple instance learning for whole slide image classification. arXiv:2106.00908v2 (2021).
Lu, M. Y. et al. Data-efficient and weakly supervised computational pathology on whole-slide images. Nat. Biomed. Eng. 5, 555–570 (2021).
Article PubMed PubMed Central Google Scholar
KDIGO Clinical Practice Guideline for Acute Kidney Injury. Kidney Int. Suppl. 2 (2012).
Macenko, M. et al. A method for normalizing histology slides for quantitative analysis. IEEE Int. Symp. Biomed. Imaging Nano Macro 2009, 1107–1110 (2009).
Google Scholar
Boschman, J. et al. The utility of color normalization for AI-based diagnosis of hematoxylin and eosin-stained pathology images. J. Pathol. 256, 15–24 (2022).
Article CAS PubMed Google Scholar
Edwin D., de Jong, E. M. & Teuwen, J. Current Pathology Foundation Models are unrobust to Medical Center Differences. arXiv:2501.18055 (2025).
Cheng, C. et al. Multi-stain deep learning prediction model of treatment response in lupus nephritis based on renal histopathology. Kidney Int. 107, 714–727 (2025).
Article CAS PubMed Google Scholar
Rodriguez-Rodriguez, R. et al. The role of vascular lesions in diabetes across a spectrum of clinical kidney disease. Kidney Int. Rep. 6, 2392–2403 (2021).
Article PubMed PubMed Central Google Scholar

Download references

Acknowledgements

We sincerely thank Prof. Matthias Kretzler from Division of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, Michigan for his helpful advice. We also thank Dr. Isao Matsui from Osaka University, Dr. Ryo Inuzuka, Dr. Masaya Sato, Dr. Hiroaki Ikushima, and Dr. Syohei Hanaoka from the University of Tokyo Hospital for their guidance and insightful discussions that have contributed to the completion of this study. We appreciate Kenichi Hashimoto from Division of Urology of the University of Tokyo in collecting the materials. This research was conducted using the JP-AID database of the Japanese Society of Pathology.

Funding

This study was supported by Cross-ministerial Strategic Innovation Promotion Program (SIP) on “Integrated Health Care System” Grant Number JPJ012425.

Author information

Authors and Affiliations

Division of Nephrology and Endocrinology, The University of Tokyo Graduate School of Medicine, 7-3-1 Hongo Bunkyo-ku, Tokyo, 113-8655, Japan
Yu Kurata, Imari Mimura & Masaomi Nangaku
Organ Pathophysiology Program, The University of Tokyo Graduate School of Medicine, Tokyo, Japan
Imari Mimura, Norihiko Takeda & Masaomi Nangaku
Department of Cardiovascular Medicine, The University of Tokyo Graduate School of Medicine, Tokyo, Japan
Satoshi Kodera & Norihiko Takeda
Department of Pathology, The University of Tokyo Graduate School of Medicine, Tokyo, Japan
Hiroyuki Abe & Tetsuo Ushiku
Division of Urology, The University of Tokyo Graduate School of Medicine, Tokyo, Japan
Daisuke Yamada & Haruki Kume
Department of Nephrology, Tohoku University Graduate School of Medicine, Sendai, Japan
Tetsuhiro Tanaka

Authors

Yu Kurata
View author publications
Search author on:PubMed Google Scholar
Imari Mimura
View author publications
Search author on:PubMed Google Scholar
Satoshi Kodera
View author publications
Search author on:PubMed Google Scholar
Hiroyuki Abe
View author publications
Search author on:PubMed Google Scholar
Daisuke Yamada
View author publications
Search author on:PubMed Google Scholar
Haruki Kume
View author publications
Search author on:PubMed Google Scholar
Tetsuo Ushiku
View author publications
Search author on:PubMed Google Scholar
Tetsuhiro Tanaka
View author publications
Search author on:PubMed Google Scholar
Norihiko Takeda
View author publications
Search author on:PubMed Google Scholar
Masaomi Nangaku
View author publications
Search author on:PubMed Google Scholar

Contributions

YK and IM conceived and designed the study. YK, IM, and DY collected clinical data and kidney biopsy specimens. YK, IM, and HA digitized kidney biopsy sections. YK performed data analysis, deep learning, and visualization. Y.K. wrote the manuscript. IM, SK, and TT edited the manuscript. SK, HK, TU, NT, TT, and MN supervised the study.

Corresponding author

Correspondence to Imari Mimura.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Supplementary Information.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Kurata, Y., Mimura, I., Kodera, S. et al. Multiple instance learning using pathology foundation models effectively predicts kidney disease diagnosis and clinical classification. Sci Rep 15, 35298 (2025). https://doi.org/10.1038/s41598-025-19297-9

Download citation

Received: 27 June 2025
Accepted: 08 September 2025
Published: 09 October 2025
DOI: https://doi.org/10.1038/s41598-025-19297-9