Introduction

Histopathology remains the gold standard for cancer diagnosis and confirming malignancies1,2. Traditionally reviewed under light microscopes, pathology slides can now be digitized using whole slide imaging (WSI) for clinical sign-out, telepathology, and computer-assisted analysis 2,3,4. While conventional computer-assisted image analysis methods have achieved moderate success, advances in artificial intelligence (AI) and deep learning are revolutionizing histopathological interpretation and computational pathology3,5. AI tools have been shown to reduce subjectivity, improve accuracy, and drive workflow efficiencies that support digital pathology adoption and quality of care 6,7. Despite the clear benefits, computational pathology algorithms are not yet widely deployed.

Figure 1
Figure 1
Full size image

Representation shift. Same-slide tissue regions scanned on two devices—(Scanner 1 (Aperio) and Scanner 2 (Sakura VisionTek)—show differences in patch-level embeddings despite identical spatial locations. These differences are quantified using novel representation-shift metrics applied to deep neural network (DNN)-derived feature vectors.

Deep learning models suffer performance degradation when applied to unseen or out-of-distribution data8,9. Scanner variation in histopathology images is a leading technical barrier for robust deployment of AI across pathology laboratories 10,11,12,13. Different labs may support different scanner vendors, which creates differences in colour and noise distributions, among other “invisible” acquisition factors that can inadvertently affect deep learning algorithms. Figure 1 illustrates the variability in appearance when the same glass slide is scanned using two different scanners, which can lead to shifts in the model’s learned feature representations (“representation shift”). These variations can undermine the consistency and reliability of model performance4, which would create inequities in care delivered to the patient, as the model would behave differently across different scanners and labs.

Histopathology foundation models (FMs) have emerged to improve generalization14. These models leverage large-scale, unlabeled histopathology archives to learn rich and transferable representations14,15. FM generalization in pathology applications is only beginning to be explored, and to our knowledge, there are no studies that specifically assess the impact of scanner bias. A key innovation of this work lies in the evaluation of FM robustness using a novel dataset of identical histological slides scanned using two different WSI scanners. This setup enables a targeted analysis of performance variation and representational shifts arising from scanner variability.

A major drawback of FMs is their substantial demand for computational resources and large-scale clinical training datasets 16,17,18. FMs have been reported to consume up to 35 times more power than traditional task-specific models, raising concerns about environmental sustainability and long-term feasibility19. These constraints can also limit the application of AI in low-resource healthcare settings and reduce efficiency.

In response to these challenges, we propose HistoLite, a lightweight self-supervised deep learning framework designed to be trained on smaller datasets using reduced computational resources. HistoLite incorporates a novel autoencoder-based contrastive learning architecture designed to be resource-efficient and capable of learning robust, generalizable feature representations. Autoencoders represent one of the most established and effective approaches for learning deep representations in a self-supervised manner20. They have been utilized successfully in a range of histopathology applications, including segmentation, detection, feature extraction, compact representation learning, and cross-modal representation learning21,22,23,24.

We introduce a novel framework to assess model sensitivity to scanner bias by quantifying representation shifts in identical pathology slides digitized using two distinct scanners. Leveraging this framework, we compare HistoLite against state-of-the-art (SOTA) FMs in breast cancer pathology images.

Related work

Foundation models in histopathology

Advanced self-supervised learning (SSL) techniques 15,25,26,27,28,29,30 have enabled the development of FMs. SSL techniques have several advantages for histopathology image analysis, where annotated data is scarce, in that large unlabeled datasets can be leveraged to obtain robust representations that can be fine-tuned for downstream tasks. SSL generates supervisory signals directly from the data itself, rather than relying on expert-provided labels25,26,27,28. Recently, numerous FMs have emerged in the literature, driven by the growing popularity of SSL27,28. By removing the dependency on labeled data, these models can effectively utilize large datasets obtained from gigapixel WSIs. The latest FMs for histopathology include UNI30, Hibou-B31, Virchow29, Virchow2/2G32, and Prov-GigaPath15, have utilized the DINOv2 self-supervised learning framework28, leveraging their extensive in-house datasets. iBOT-Path33 used the iBOT-SSL framework34. Hierarchical Image Pyramid Transformer (HIPT)35, and PathDino36 used the DINO SSL framework27. In addition to SSL Vision Transformer (ViT) FMs, there exists a Convolutional Neural Network (CNN) called KimiaNet37, which is built upon the DenseNet-121 architecture and has been trained in a supervised manner using the entire TCGA dataset. These large-scale FMs are trained on extensive graphics processing unit (GPU) clusters with large datasets15,29,30,31,32. Table 1 provides a summary of all models, including model architecture, parameters, patch sizes, number of organs, number of WSIs and patches used for training, the source of the WSIs, and the datasets from which they were obtained.

Table 1 Histopatology FMs available in the literature.

Domain shift in histopathology

A data domain is defined as the joint distribution of feature and label spaces for the source in-domain (ID) data used to train a model, and the target out-of-domain (OOD) data unseen during training8. Domain shift occurs when the distributions of the source and target domains differ, and may manifest as covariate, prior, posterior, or class-conditional shifts8. In histopathology, covariate shift (differences in image appearance due to staining, scanning, or pre-processing) is the most common38, and has been widely recognized as a critical technical challenge39,40,41. Such shifts mean that datasets acquired from different laboratories or imaging centers can yield markedly different model performance, raising the risk of healthcare inequities.

Domain Generalization (DG) techniques are solutions for the domain shift and use only source domain data without having access to target data42,43,44. This is important for translation to clinical use, as models can be applied widely and robustly at new imaging centres without the need to collect data and labels or fine-tune, which can have large regulatory implications. This is in contrast to Domain Adaptation (DA) techniques, which have access to both the source and target data9.

In computational pathology, a variety of DG methods have been proposed and can be broadly categorized based on their underlying mechanisms. Domain alignment strategies such as stain normalization45,46, the use of generative models47, and feature alignment48 aim to mitigate domain shift by learning feature representations that are invariant across different data distributions. Data augmentation techniques47,49,50,51 are another field of DG that enhances model robustness by artificially expanding the training dataset using perturbations of existing samples. Domain separation52,53,54,55 focuses on decomposing the learned feature space into domain-invariant and domain-specific components. Meta-learning approaches56,57,58,59 seek to learn a generalizable learning algorithm itself, enabling a model to quickly adapt to new, unseen domains with minimal data. Ensemble learning41,60,61,62 leverages the collective power of multiple models to reduce the risk of overfitting and compensate for the weaknesses of a single model. Tailored model design63,64,65,66 involves developing specialized architectures optimized for specific tasks, which can lead to resource savings and better regularization against overfitting. Regularization training strategies67,68,69,70 constrain model complexity, thereby preventing overfitting and reducing the influence of irrelevant features. While many of these methods achieved modest results, newer approaches such as FMs, which are trained on large datasets of unlabeled source domain data, are demonstrating superior performance in histopathology tasks15,30,32, and have the potential to generalize due to the large datasets used for pretraining.

HistoLite: lightweight self-supervised model

This work proposes HistoLite, a lightweight self-supervised model designed to achieve strong generalization from substantially smaller datasets than those required by conventional FMs. It can be trained on a single standard GPU, prioritizing efficiency and accessibility.

HistoLite uniquely leverages a dual-stream contrastive autoencoder architecture with shared weights for improved generalization. Figure 2a shows the Histolite architecture. In the first stream, the network learns features by reconstructing the original image, while the second stream processes an augmented version of the image that simulates realistic variations in stain, contrast, sharpness, and field-of-view (FOV). Compressed representations from the bottlenecks of both streams are aligned via a contrastive objective, encouraging the learning of domain-invariant representations. This is further supported by a novel rotation augmentation strategy called Adaptive HistoRotate, which dynamically adjusts rotational transformations to maximize robustness to orientation variability.

Figure 2
Figure 2
Full size image

HistoLite SSL framework. (a) A dual-stream autoencoder-based self-supervised learning framework designed to achieve domain-agnostic representation learning. (b) HistoLite autoencoder design for both contrastive learning streams.

Model architecture

Autoencoders are one of the first self-supervised deep learning architectures that leverage the inherent structure of data to learn useful representations without explicit labels20. Autoencoders learn deep feature representations in an unsupervised manner by defining the reconstruction of the input as the primary learning objective. Figure 2b illustrates the proposed design of the lightweight CNN-based 2D autoencoder for both contrastive learning streams. The encoder progressively extracts 64, 128, 256, 384, and 512 2D feature maps through convolutional layers. The decoder reconstructs the input image by sequentially reducing the number of feature maps from 512 to 64. For input images of size \(512 \times 512\), HistoLite consists of 41M parameters, with the autoencoder streams and predictive modules. An individual autoencoder (Fig. 2b), the parameter count is 17.2M, with the encoder accounting for 7.6M parameters. This architecture ensures precise image reconstruction while generating a compact, robust representation suitable for downstream tasks. The dual-stream contrastive learning framework promotes feature alignment and invariance to domain-specific augmentations (see Loss Functions).

Representation vector

A compact 1D representation of the feature maps is required for the output of the encoder, which is used for feature alignment across dual contrastive learning streams, fine-tuning for downstream tasks, and for analyzing the representation shift. Similar to other small FMs, we define the model embedding to be length 384. The 512 2D feature maps are processed using a \(2\times 2\) Max Pooling followed by global average pooling. The resultant vector is then passed through a single multilayer perceptron (MLP) layer with 384 units, yielding the final encoded vectorized representation of the image. The 384-dimensional vector representation generated by the autoencoder is passed through a predictor MLP that first expands it to 1536 dimensions before reducing it back to 384 dimensions. This \(4\times\) expansion and subsequent reduction facilitate better alignment of the representations, enabling the predictor to model more complex relationships and promote effective feature alignment. The predictor MLP comprises 3.5M trainable parameters.

Loss function

Chen et al.26 introduced the concept of feature alignment through predictor similarity in a Siamese network, using a stop-gradient mechanism to improve training stability and effectiveness. Inspired by this, a similar strategy is implemented to align the Histolite’s representations, as shown in Fig. 2a. A predictor is applied to the output of each autoencoder, and training minimizes the discrepancy between these predicted representations. This encourages consistency and robust feature learning while preserving the unique characteristics of each stream. The method eliminates the need for negative pairs or a momentum encoder, and remains effective with standard batch sizes, thereby eliminating the dependency on large-batch training. As a result, the approach is resource-efficient and well-suited for typical computational environments.

To ensure similarity in the features learned by the autoencoders, the Mean Squared Error (MSE) loss is applied to the decoder outputs from both autoencoders as well as to the representations generated by the corresponding bottlenecks, resulting in three MSE losses:

$$\begin{aligned} \mathcal {L}_{\text {total MSE}} = \mathcal {L}_{\text {S1 MSE}} + \mathcal {L}_{\text {S2 MSE}} +\mathcal {L}_{\text {Sim MSE}}, \end{aligned}$$
(1)

where \(\mathcal {L}_{\text {S1 MSE}}\) represents the MSE similarity loss associated with the reconstruction of the original images in stream one, \(\mathcal {L}_{\text {S2 MSE}}\) represents the MSE similarity loss associated with the reconstruction of the augmented images in stream two, \(\mathcal {L}_{\text {Sim MSE}}\) represents the MSE similarity loss associated with the representations generated from stream one and stream two, and \(\mathcal {L}_{\text {total MSE}}\) represents the total MSE loss combined for backpropagation.

Data augmentation

To capture the inherent variability of histopathology data, we applied a set of augmentations encompassing changes in orientation, staining, contrast, sharpness, FOV, and magnification. The objective is to perturb the data realistically to encourage learning of domain-invariant features across the two streams.

Rotation augmentation was applied to the inputs of both streams. Previously, Alfasly et al.36 proposed HistoRotate, a 360\(\circ\) rotation augmentation in which a large image is randomly rotated and then center-cropped to form the network input. Although this method improves rotational invariance and robustness71, it can generate repeated versions of the same patch. We address this limitation with Adaptive HistoRotate, which first performs a random crop from the larger image, followed by rotation, ensuring that a distinct patch is selected at each iteration (Fig. 3).

Stream two was further subjected to additional perturbations to enhance representation diversity and promote domain-invariant contrastive learning, analogous to Siamese26 and SimCLR72 frameworks. Specifically, Color Jitter was applied with brightness, contrast, saturation, and hue factors of 0.3, 0.3, 0.15, and 0.05, respectively, each with a probability of 1.0. Gaussian Blur used a \(7 \times 7\) kernel with \(\sigma \in [0.1, 3.0]\) and a probability of 0.3. Sharpness augmentation was applied with a factor of 2.0 with a probability of 0.3. Horizontal and vertical flips were applied independently with a probability of 0.3. For size variation, Magnification augmentation was implemented by random cropping to \(448^2\), \(480^2\), or \(512^2\) pixels. The augmentation parameters are determined based on empirical observation.

Figure 3
Figure 3
Full size image

Proposed orientation augmentation method that incorporates the surrounding context and tissue patterns. A random region of size \(1024 \times 1024\) is initially selected from a larger \(1536 \times 1536\) image. This region is then subjected to a random 360-degree rotation. Finally, a \(512 \times 512\) region is cropped from the center of the randomly rotated \(1024 \times 1024\) region.

Experimental design

Generalization performance of HistoLite and SOTA FMs through two experiments. First, zero-shot feature representations were used to quantify representation shift, the change in embeddings across scanners (covariate shift) as shown in Fig. 1. Second, we assessed downstream performance on automated tumour versus non-tumour patch classification using ground truth labels from slides scanned on different devices.

Datasets

Model evaluation was conducted with 111 FFPE H&E-stained breast cancer slides from the Ontario Institute for Cancer Research (OICR), comprising invasive ductal carcinoma, ductal carcinoma in situ, encapsulated papillary carcinoma, and invasive lobular carcinoma. Each slide was scanned on two devices—Aperio AT2 (40\(\times\), 0.25 mpp) and Sakura VisionTek (20\(\times\), 0.27 mpp)—yielding 222 WSIs. VisionTek images were upsampled to match Aperio resolution, and affine registration1 was performed using Aperio as reference (Fig. 1).

An expert pathologist annotated ten tumour and ten non-tumour HPFs (212,000 \(\mu\)m\(^2\) each) on Aperio WSIs; the same regions were mapped to VisionTek images via registration. From these, \(512\times 512\) pixel patches at 0.5 mpp were extracted at identical coordinates across scanners, producing 4,952 patches per scanner (2,959 tumour; 1,993 non-tumour), for a total of 9,904 aligned patches. This design enables controlled analysis of scanner-induced domain shift and model robustness.

HistoLite was trained on 2,761 public breast cancer WSIs: CAMELYON1673 (270 WSIs, 46,239 patches), CAMELYON1774 (997 WSIs, 136,837 patches), HEROHE75 (360 WSIs, 108,490 patches), and TCGA76 (1,134 WSIs, 253,456 patches). Patches (\(1536\times 1536\) pixels, 20\(\times\)) were selected to contain \(\ge\)80% tissue while excluding artifacts and blur, yielding 545,022 training patches.

HistoLite training

Figure 4
Figure 4
Full size image

Training losses. Illustrate the training loss curves for HistoLite over 14 epochs.

The model was trained for 14 epochs using the proposed SSL framework, with training halted once the loss reached a plateau. The training process used the Adam optimizer with a learning rate and weight decay, both set to \(1e^{-4}\). A batch size of four images was used, effectively doubled to eight images after data augmentation, as both the original and augmented images were input into the two network streams. Figure 4a presents the loss curves observed during training, which include the MSE loss for the first stream, representing the reconstruction error of the original input image; the MSE loss for the second stream, representing the reconstruction error of the augmented input image; and the similarity MSE loss, which enforces alignment between the representations generated by the bottlenecks of both streams to ensure consistency despite domain variations. These individual losses are combined to form a unified loss function used for backpropagation, enabling the network to optimize effectively and learn robust features.

Zero-shot feature representations across scanners

To analyze the generalization properties of HistoLite and the SOTA FMs, we conducted a novel assessment of the same tissue across scanners. The same tissue sample scanned by two different scanners is used as an input to the pre-trained models for inference. All models are used as is, i.e. zero-shot with no additional training or fine-tuning. The embeddings produced by each network serve as the model representations, which are subsequently compared pairwise with the embeddings of the patches from the two scanners to measure representation shift.

To quantify the representation shift, vector-, histogram- and cluster-based measures are proposed. Vector-based measures use the z-score normalized embeddings to standardize for different embedding lengths and ranges. Representation shift is measured using mean absolute error (MAE) and cosine distance between paired patches across scanners. The Kullback-Leibler (KL) divergence is used to measure the differences in the embedding probability distributions of paired patches from two scanners.

Cluster-based metrics consider proximity of samples in a high-dimensional feature space and are used to investigate the compactness and separability of the embeddings across classes. In this work, the Calinski-Harabasz (CH) Index is used, which measures the ratio of the weighted sum of squared Euclidean distances between the embeddings and their respective cluster centroid (within-class dispersion) to the sum of squared Euclidean distances of each class centroid to the overall centroid of the data (between-cluster dispersion). The centroid (mean) and classes are based on scanner type and tissue class. This results in the following comparisons (with \(\downarrow\) and \(\uparrow\) denoting the desirable outcome, i.e. small or large):

  • Tumour CH \(\downarrow\): Compares tumour patches from the Aperio scanner to tumour patches from the VisionTek scanner (Tumour AP vs. Tumour VT).

  • Non-tumour CH \(\downarrow\): Compares non-tumour patches from Aperio to non-tumour patches from VisionTek (Non-Tumour AP vs. Non-Tumour VT).

  • Scanner CH \(\downarrow\): Compares all patches from Aperio to all patches from VisionTek.

  • Tissue CH \(\uparrow\): Compares all tumour patches (Tumour AP + Tumour VT combined) to all non-tumour patches (Non-Tumour AP + Non-Tumour VT).

To summarize the similarity and differences in the embeddings across tissues and scanners, we propose the Robustness Index (RI) as:

$$\begin{aligned} {\text {Robustness Index (RI)}} = \frac{2 \times (\text {Tissue CH})}{\text {Tumor CH} + \text {Non-Tumour CH}}, \end{aligned}$$
(2)

the numerator represents the inter-class separation, specifically the difference between the centroids of the tumor and non-tumour clusters (tissue discrimination), and the denominator accounts for intra-class compactness of each class across scanners (tissue-specific scanner discrimination), as the average CH for the tumor and non-tumour clusters.

Breast cancer classification across scanners

To evaluate the models on a downstream task, the network backbone was frozen for all models, and a two-class classification head (tumor vs. non-tumour) was added on top of the frozen feature extractor. This classification head was trained for 20 epochs using a cross-entropy loss function, the Adam optimizer, with a learning rate and weight decay, both set to \(1e^{-4}\). Data augmentation strategies were applied consistently across all linear probing models. These included color jitter, Gaussian blur, random sharpness, autocontrast, horizontal and vertical flips, as defined in “Data augmentation”. No augmentations were applied to the validation or test sets. For models pretrained on \(224\times 224\) resolution (e.g., most ViT-based FMs), images were center-cropped to match the input size. In contrast, models such as PathDino, KimiaNet, and HistoLite were evaluated using \(512\times 512\) input patches.

The models were first trained on Aperio data and evaluated on both the Aperio test set (in-domain, ID) and the paired VisionTek set (out-of-domain, OOD) containing the same matched patches. In the second experiment, models were trained on VisionTek data and tested on the same paired datasets, with VisionTek as ID and Aperio as OOD.

Classification performance of the models is evaluated over five folds with an 80/20 training/testing split to predict tumour or non-tumour labels for each patch. The True Positive Rate (TPR), False Positive Rate (FPR), True Negative Rate (TNR), and False Negative Rate (FNR) are measured. Accuracy is calculated as the ratio of correctly classified instances (true positives, correctly identified tumour patches, and true negatives, correctly identified non-tumour patches) to the total number of instances in the test dataset.

The recently proposed generalization test for histopathology algorithms using Two One-Sided Test (TOST) was conducted77. TOST analysis is a statistical approach to assess performance equivalence by verifying whether performance differences between ID and OOD datasets fall within an automatically derived equivalence margin. This margin is determined using the standard error to compute the 95% confidence interval77.

To determine the relationship between performance consistency and feature similarity, we introduce a novel comparative analysis that examines the change in ID and OOD performance of the models compared to the representation shift. To further understand the trade-offs in model designs, we also take into account the model size, considering its impact on both computational efficiency and generalization ability.

Results

Experiments were conducted on a Windows PC equipped with an Intel i9-12900K CPU, 64GB of RAM, and an NVIDIA RTX 3090 Ti GPU with 24GB of VRAM. Python version 3.9.19, PyTorch version 2.2.2, and CUDA version 11.8 were used.

Figure 5
Figure 5
Full size image

t-SNE plots. Combined t-SNE projections of tumorous and non-tumorous (normal) patch embeddings from two scanners, Aperio (AP) and VisionTek (VT), are shown. Subfigures (aj) correspond to the models: (a) HistoLite (ours), (b) KimiaNet, (c) PathDino, (d) HIPT, (e) iBOT-Path, (f) Hibou-B, (g) UNI, (h) Virchow, (i) Virchow2, and (j) Prov-Gigapath. All the plots are at the same scale.

Zero-shot feature representations across scanners

The same patches from the two scanners were processed by HistoLite and the comparative FMs using zero-shot feature representations, and the difference in embeddings are analyzed using the proposed metrics. Table 2 presents the mean and standard deviation of vector- and histogram-based similarity metrics, including MAE, cosine distance, and KL divergence. The distributions for MAE, cosine distance, and KL divergence are included in the supplementary material (see Figs. S1, S2 and S3, respectively). HistoLite achieves the lowest KL divergence, which indicates that the distributions of the embeddings are more similar across scanners, followed closely by HIPT. HIPT achieves the lowest MAE and Cosine Distance, followed closely by HistoLite. These results suggest that HistoLite and HIPT are learning more scanner-invariant features (most similar embedding magnitudes across scanners).

The cluster-based CH metric is reported in Table 3, which considers the degree of overlap in embeddings in a high-dimensional space with respect to tissue or scanner classes. HIPT consistently has the lowest CH over the Non-Tumour CH, Tumour CH and Scanner CH, and the highest Tissue CH indicating good discrimination over tissue classes and feature overlap across scanners. HistoLite has moderate CH metrics across all categories, with good separation between tumour and non-tumour in Tissue CH. Consistently, Prov-GigiPath, Virchow2, UNI and i-BOT-Path, all demonstrate strong performance in the clustering metrics, indicating good generalization across scanners and good separability between tissue classes.

The RI shown in Fig. 6e and Table 3 indicates that HIPT has the highest robustness, followed by Prov-Gigipath, Virchow2 then, iBOT-Path. The RI of HistoLite is comparable to iBOT-path. It shows that there is a tradeoff in performance between tissue discriminability and scanner-variance. From the RI, HistoLite is outperforming KimiaNet, PathDino, Hibou-B, UNI and Virchow. In addition, the embeddings are visualized using the t-distributed Stochastic Neighbor Embedding (t-SNE) in Fig. 5. The t-SNE plots support the cluster-based metrics in feature separation and overlap for different classes and scanners.

Table 2 Representation shift metrics MAE, cosine distance, and KL divergence to quantify the similarity between the embeddings of corresponding OICR patches from two different scanners.
Table 3 Calinski–Harabasz (CH) Index, and Robustness Index (RI) for different scanners and tissue types for all models.
Figure 6
Figure 6
Full size image

CH and robustness index. The CH Indexes are illustrated as follows: (a) Tumour CH Index, (b) Non-Tumour CH Index, (c) Scanner CH Index, and (d) Tissue CH Index. Additionally, the Robustness Index is shown in (e). This figure corresponds to Table 3.

Figure 7
Figure 7
Full size image

Mean accuracy and performance drop. Mean accuracies across all ID and OOD evaluation scenarios, along with the mean performance drop for the corresponding OOD datasets. The top bar chart corresponds to the mean values in Table 4, and the bottom bar chart corresponds to the mean values in Table 5.

Breast cancer classification across scanners

Classification as a downstream task was evaluated using fine-tuned models to classify tumour and non-tumour tissue patches across scanners. The 4,952 patches from each scanner were split into five folds for cross-validation, with each network performing 10 classification experiments in total (five folds per scanner). Table 4 presents the mean classification accuracy for both ID and OOD scenarios. Table 5 reports the performance differences in ID and OOD data for both scanners. The corresponding mean accuracy and performance drop are also illustrated in Fig. 7. To enable a more comprehensive evaluation of model performance, we report additional metrics beyond accuracy. Specifically, fold-wise accuracy along with the average TPR, TNR, FPR, and FNR computed across the five cross-validation folds, are provided in the supplementary material (Tables S1 and S2, respectively).

Table 4 Mean classification accuracy over five folds: (1) Aperio training data (ID) with VisionTek as OOD data and (2) VisionTek training data (ID) with Aperio as OOD data.
Table 5 Difference in mean classification accuracy between ID and OOD: (1) Aperio training data (ID) with VisionTek as OOD, and (2) VisionTek training data (ID) with Aperio as OOD.
Figure 8
Figure 8
Full size image

Cross-validation. Five-fold cross-validation accuracy results for all the models. (a) and (b) present the mean accuracy and standard deviation when Aperio and VisionTek were used as the ID datasets, respectively. (c) and (d) present the TOST analysis.

The best performing models are UNI, Virchow2, and Prov-GigaPath possibly due to the large model size and training datasets. The drop in mean classification accuracy of these models are some of the lowest as well, indicating they are performing well across scanners in downstream tasks. The proposed light-weight model, HistoLite, had an average classification accuracy of 91.8% across both ID and OOD datasets, which is higher than KimiaNet and HIPT. While HIPT and HistoLite had the smallest vector and histogram-based representation shift, they exhibited lower mean accuracy in classification. This means the embedding vectors had similar magnitudes and distributions in zero-shot inference, but the features are less robust at differentiating tissue classes when fine-tuned for classification. Considering the performance drop across datasets, HistoLite had the smallest difference between ID and OOD, followed by HIPT. Therefore, although these models did not achieve the highest performance, the performance across ID and OOD datasets is more consistent, suggesting higher model reliability.

TOST was completed to investigate whether the performance on ID and OOD datasets were statistically equivalent which may be used as a proxy for generalization. The equivalence margin is automatically computed as proposed by Varnava et al.77. The tissue classification accuracy for both training setups, as well as the TOST analysis with the data-driven bounds are shown in Fig. 8. HistoLite has low mean differences and variance, and also is contained within the equivalence bounds indicating that HistoLite is generalizing across scanners. In contrast, SOTA FMs such as KimiaNet, PathDino, and Virchow exceed the bound in both training scenarios, indicating these models experience significant performance drops on the OOD dataset.

To investigate the relationship between model performance and differences in embeddings, classification accuracy is correlated to MAE representation shift and RI, with circle sizes representing the model size, as shown in Fig. 9. The performance difference between ID and OOD data widens as the MAE representation shift increases over all models. Despite being the smallest model, HistoLite exhibits a minimal performance drop with minimal representation shift across scanners (and modest classification performance). HIPT, which has approximately three times the parameters of HistoLite and is trained on 100 million more patches, exhibits comparable results. However, considering the downstream task of tissue classification, HistoLite outperforms HIPT. Other FMs, which have considerably larger MAE and performance differences (but with top classifciation accuracy), include Prov-GigiPath, Virchow2, UNI and iBOT-Path. KimiaNet has both the largest performance drop and largest MAE representation shift. Additionally, Fig. S4 in the supplementary file illustrates the relationship between performance differences and both cosine distance and KL divergence, along with trend lines for tumor, normal, and mixed tumor–normal cases.

Figure 9c,d demonstrates the relationship between the ID to OOD accuracy difference, to the RI. Prov-GigaPath has the highest robustness across both Aperio and VisionTek scanners, with low drop in performance. HistoLite is operating with a moderate RI and low performance drop. KimiaNet has low RI and the highest accuracy difference indicating poor generalization. HistoLite, positioned in the middle, demonstrates a balanced approach reflecting its ability to maintain a stable latent representation and consistency between ID and OOD data, with modest classification performance but greater efficiency. HIPT is not included in this figure due to its exceptionally high RI. However, this highlights that a single metric does not capture the complete picture, and it is essential to consider additional metrics for a comprehensive evaluation. Cosine distance and KL divergence as metrics for representation shift, are provided in Fig. S4 of the n materials.

Figure 9
Figure 9
Full size image

Performance vs representation shift/robustness index. Relationship between the average performance difference between ID and OOD against the representation shift is illustrated in panels (a) and (b), while the relationship with the robustness index is depicted in panels (c) and (d). In (a) and (b) circle size represents the model’s parameter count, providing a visual comparison of model scale and the red line shows the trend.

Discussion

Domain variation in digital pathology, arising from differences in scanners can lead to performance degradation and poses a significant challenge to the generalization of machine learning models. Recent large-scale FMs have tried to solve this but a recent study conducted by Mulliqi et al. raises concerns regarding the purported generalizability benefits of the FMs.19. It was also found that FMs consume 35 times more energy and significantly more data compared to conventional task-specific models.

In this study, we present HistoLite, a lightweight SSL framework designed to achieve domain-generalized representation learning in histopathology. Unlike resource-intensive FMs, HistoLite is optimized to operate efficiently on modest computational setups, such as a personal GPU computer. The framework incorporates a customizable autoencoder architecture, which can be tailored to fit the computational resources available, enhancing its adaptability.

A dataset was created from the same tissue slides from two different scanners to study the variability arising from scanner differences (covariate shift). Representation shift was measured at the embedding level across scanners, which is a novel approach, and the first work of its kind. Results can further our understanding of model generalizability. Model performance was further validated through a downstream classification task and correlated to representation shift. HistoLite exhibits the smallest (ID – OOD) performance drop, the second-lowest representation shift, and moderate classification accuracy. This performance profile is likely due to its smaller model size, which provides a favorable balance between accuracy and generalization. Although its downstream classification accuracy is moderate, HistoLite achieves the strongest cross-scanner generalization, as indicated by the reduced (ID – OOD) performance gap and further supported by the TOST analysis results.

It was shown that HIPT has great similarity in embeddings across scanners, but classification performance using the fine-tuned model is one of the lowest. The clustering metrics showed good overlap across scanners, and separation between tissue classes, but on the downstream task it did not perform as well. However, the generalization analysis using TOST showed that while the performance was lower, it was consistent across scanners. These results for HistoLite and HIPT suggest that perhaps there is a tradeoff between accuracy and generalization in smaller models. It may be possible to generate robust representations that are more consistent across scanners, but this may come with a reduction of performance.

The top-performing models in terms of classification accuracy are Prov-GigaPath, Virchow2, UNI, and iBOT-Path, each outperforming HistoLite by at least 4%. Among these, UNI achieves the smallest (ID – OOD) performance difference (1.55%), followed by Prov-GigaPath (1.74%) and Virchow2 (1.99%), while iBOT-Path exhibits the largest gap among the top performers (2.64%). These are all ViT-based models, trained on multiple organs and large datasets, which could indicate these configurations are more optimal for top classification accuracy and generalization. In contrast, KimiaNet demonstrates the lowest accuracy and the poorest generalization, with a mean accuracy of 88.1% and the largest mean performance drop of 11.24%. KimiaNet is a relatively small CNN-based model trained using supervision learning. These factors may contribute to the reduced generalization and classification performance.

Our analysis reveals several important insights into the relationship between model characteristics, representation similarity, and cross-scanner performance. First, we observe that a small representation shift (i.e., high similarity between ID and OOD embeddings) does not necessarily guarantee superior classification accuracy; rather, it tends to be associated with a smaller performance gap between ID and OOD evaluations. This is an important observation as it may suggest that there is a trade-off between representation robustness, and task-specific accuracy. This trend is illustrated in Fig. 5, which correlates representation shift with performance drop. The models with the smallest performance drop also had small representation shift but not necessarily the top classification accuracy. The larger and top performing models had modest performance drop and representation shift. Similarly, a high RI does not inherently translate to the best performance. For example, while HIPT achieves the highest RI, it does not yield the top accuracy. These findings suggest that consistent performance is achieved through better representation similarity and RI, but more investigation is required to ensure that these models also perform well on downstream tasks. The relationship between accuracy and generalization should be studied further, and models that can achieve both are an interesting avenue of future research.

Contrary to the common assumption that larger models inherently deliver better performance and generalization, our results show this is not always the case. For instance, UNI, which is smaller than Virchow/Virchow2 and outperforms Virchow in terms of generalizability, representation shift, and performance drop. Likewise, Virchow and Virchow2 share the same architecture, yet Virchow2 achieves better generalization and classification accuracy, which may be due to the multi-resolution training and broader diversity of organ types in its training data. Interestingly, Prov-GigaPath, the largest model in our evaluation, performs comparably to smaller models such as Virchow2 and UNI, and shows only a slight advantage in terms of representation shift, robustness index and accuracy.

Model comparisons further support these observations. Hibou-B and iBOT-Path are similar in size, yet iBOT-Path exhibits stronger generalization, as evidenced by its smaller performance drop and reduced representation shift. Among smaller models, HistoLite and KimiaNet have comparable sizes, while PathDino is slightly larger. HistoLite demonstrates superior cross-scanner generalization, as indicated by both its lower representation shift and smaller performance drop, whereas PathDino achieves higher classification accuracy (by approximately 1%) but exhibits a larger performance drop (6.33%). Therefore, this slight gain in classification accuracy is coming at the expense of scanner-robustness. In contrast, KimiaNet shows the weakest performance overall, likely due to its supervised training paradigm and CNN-based architecture, compared to the self-supervised, transformer-based approaches used by the other evaluated models. Notably, while HistoLite also employs a CNN-based encoder, its self-supervised training enables it to outperform KimiaNet.

A limitation of this work may be the differences in input patch sizes between HistoLite and the evaluated FMs. HistoLite operates on 512\(\times\)512 patches, while most FMs, including ViT-based architectures, use 224\(\times\)224 patches at 20\(\times\) magnification. We chose to employ these models in their off-the-shelf form without fine-tuning or altering their patch size. It was not our intention to optimize the performance of each model individually, but rather to apply them as they were designed. To ensure a fair comparison across models, the pixel resolution of all patches was the same for all models (which may be just as important as the patch-size). Harmonizing patch sizes in the future could lead to more standardized results for isolatating further architectural differences.

Going forward, we aim to explore the proposed HistoLite framework with alternative architectures, such as ViTs, to investigate how the attention mechanisms in ViTs can learn domain-invariant features using the proposed SSL framework. This could provide insights into whether attention-based models further enhance the framework’s ability to generalize across domains. To expand the findings more broadly, the framework could be evaluated on datasets from various organs as this study focused exclusively on breast tissue and may not fully represent other anatomies. However, since the same tissue slides were acquired by both scanners, perhaps the remaining variability was related to the scanner-bias alone. Expanding the evaluation to include datasets from organs such as lung, colon, and prostate will enable a more comprehensive assessment of generalizability, robustness, and fairness across diverse biological and clinical contexts.

Another potential future direction involves refining the representation alignment approach. Specifically, we propose replacing the current predictor, which is based on a Siamese network26, with a DINO head27,28. By incorporating the DINO head, it would be possible to examine how it aligns features and whether it enhances the performance of the framework. However, this approach would introduce a higher number of learnable parameters, which could impact the computational efficiency and may require careful resource management during training.

Conclusion

Developing models that generalize across scanners is critical for translation of computational pathology algorithms. FMs have been developed as tools to overcome these challenges. However, the training of large FMs is often inaccessible due to limitations in data availability and computational resources. Additionally, the generalization abilities of the models across different scanning platforms is largely understudied. To address these challenges, we propose HistoLite: a lightweight, self-supervised representation learning framework that is resource-efficient, customizable and provides robustness to covariate shifts. HistoLite is designed to be trained on a personal GPU, making it accessible for researchers and institutions with limited resources. Due to the feature alignment of augmented images, and dual-stream autoencoder, the framework is designed to be robust to data variability.

The experimental results demonstrate that HistoLite achieves performance comparable to larger models trained on millions of patches using extensive GPU clusters. Moreover, HistoLite has consistent performance across scanners, highlighting its robustness. This framework offers a practical solution for smaller healthcare facilities or research units with limited data and computational capabilities. Depending on the available resources and domain requirements, one must carefully consider whether to develop a lightweight, domain-specific model with scanner robustness or opt for a larger model with slightly higher accuracy that demands significantly greater resources and data.