Evaluating normative representation learning in generative AI for robust anomaly detection in brain imaging

Bercea, Cosmin I.; Wiestler, Benedikt; Rueckert, Daniel; Schnabel, Julia A.

doi:10.1038/s41467-025-56321-y

Download PDF

Article
Open access
Published: 13 February 2025

Evaluating normative representation learning in generative AI for robust anomaly detection in brain imaging

Nature Communications volume 16, Article number: 1624 (2025) Cite this article

8658 Accesses
6 Citations
9 Altmetric
Metrics details

Subjects

Abstract

Normative representation learning focuses on understanding the typical anatomical distributions from large datasets of medical scans from healthy individuals. Generative Artificial Intelligence (AI) leverages this attribute to synthesize images that accurately reflect these normative patterns. This capability enables the AI allowing them to effectively detect and correct anomalies in new, unseen pathological data without the need for expert labeling. Traditional anomaly detection methods often evaluate the anomaly detection performance, overlooking the crucial role of normative learning. In our analysis, we introduce novel metrics, specifically designed to evaluate this facet in AI models. We apply these metrics across various generative AI frameworks, including advanced diffusion models, and rigorously test them against complex and diverse brain pathologies. In addition, we conduct a large multi-reader study to compare these metrics to experts’ evaluations. Our analysis demonstrates that models proficient in normative learning exhibit exceptional versatility, adeptly detecting a wide range of unseen medical conditions. Our code is available at https://github.com/compai-lab/2024-ncomms-bercea.git.

Unsupervised abnormality detection in neonatal MRI brain scans using deep learning

Article Open access 17 July 2023

Data-driven normative values based on generative manifold learning for quantitative MRI

Article Open access 30 March 2024

Using normative modelling to detect disease progression in mild cognitive impairment and Alzheimer’s disease in a cross-sectional multi-cohort study

Article Open access 03 August 2021

Introduction

The continuous advancement in medical imaging technology has markedly enhanced our ability to diagnose diseases. Yet, this progress presents a new challenge: extracting actionable insights from the vast volumes of medical imaging data now available¹. This scenario underscores the urgency for automated diagnostic tools capable of efficiently processing this data to offer accurate and timely diagnoses, thereby easing the burden on healthcare systems. Computer-aided diagnostics, particularly those employing supervised learning^2,3, have represented a significant leap in this direction, enabling machines to recognize disease patterns across various imaging modalities⁴. However, these methods often struggle to fully capture the complexity and rarity of human pathologies (https://rarediseases.info.nih.gov/about), especially in the absence of large, annotated datasets⁵.

In response to these limitations, Unsupervised Anomaly Detection (UAD)⁶ has gained prominence, offering the promise of autonomous anomaly detection without reliance on labeled data. The potential of UAD extends across diverse imaging modalities, from brain MRIs^7,8,9,10 to chest X-rays^11,12 and beyond^13,14,15, suggesting a transformative change in diagnostic approaches. Nonetheless, the clinical integration of UAD faces challenges, including biases toward certain pathology profiles¹⁶ and the opaque nature of these ‘black box’ systems.

Generative AI¹⁷ has brought a novel dimension to anomaly detection by adeptly capturing the nuances of what is considered ‘normal’ in medical images, see Fig. 1. The true innovation of generative AI lies in normative representation learning, a concept driven by data to uncover characteristics of a healthy population. Here, anomaly detection assumes a complementary role, focusing on pinpointing deviations from these learned representations. However, the evaluation of generative AI methods often tilts disproportionately towards anomaly detection, leading to biased or incomplete assessments. For instance, certain approaches optimize solely for hyper-intense lesion detection¹⁸, anchor their design principles (e.g., specific noise types) around large tumors¹⁹, or approximate the unknown anomaly distribution through self- or weak-supervision^20,21. While such methodologies can be advantageous for detecting specific pathologies, they often fall short in broader anomaly detection contexts^16,22.

**Fig. 1: Generative AI for unsupervised anomaly detection in medical imaging.**

Addressing this critical gap, our analysis advocates for a reorientation in the evaluation of generative AI, emphasizing its intrinsic role in normative representation learning. This approach marks a shift in the assessment of generative AI, positioning it not just as a tool for disease identification but as a system for evaluating the realness and plausibility of generated counterfactuals in medical imaging.

Proposed metrics

We propose specific metrics to assess the quality of the pseudo-healthy restorations: Restoration Quality Index (RQI): is an image-based metric that evaluates the perceived quality of synthesized images by measuring their semantic similarity to original inputs.

Anomaly to Healthy Index (AHI): measures the closeness of the distribution of restored pathological images to a healthy reference set.

Healthy Conservation and Anomaly Correction Index (CACI): measures the effectiveness of models in maintaining integrity in healthy regions and correcting anomalies in pathological areas.

These metrics collectively provide a more comprehensive evaluation, extending beyond simple anomaly mapping to include assessments of the quality and accuracy of the normative and pseudo-healthy representations generated by AI models.

Evolution of generative AI for UAD

The impact of generative AI on medical imaging for anomaly detection has unfolded in distinct evolutionary phases, each marking a pivotal advancement in the field, as illustrated in Fig. 2.

**Fig. 2: Advancements in generative AI for medical anomaly detection: a chronological perspective.**

Autoencoders (AEs) set the basis for UAD, premised on the hypothesis that anomalies would induce higher reconstruction errors^{4,23,24,25,26}. Despite their promise, AEs struggle with generalizing beyond training data without losing detail, leading to anomaly misidentification^26,27,28.

Variational AEs (VAEs) introduced probabilistic constraints on latent variables, advancing anomaly detection through nuanced posterior distribution approximations^8,28,29,30. However, challenges in high-dimensional data representation still impact reconstruction quality.

Generative Adversarial Networks (GANs) revolutionized data synthesis through adversarial training, significantly enhancing anomaly detection^31,32,33,34. However, their tendency towards mode collapse and limitations in preserving healthy tissue are critical areas for improvement.

Hybrid Models, such as Adversarial AEs (AAEs), combine structured latent spaces of VAEs with the superior image generation of GANs^16,35,36,37. However, they still face challenges in handling discrepancies in healthy regions.

Diffusion Models have revamped generative modeling by capturing complex characteristics without latent space constraints³⁸. Within medical anomaly detection^19,21, diffusion models incrementally add noise to a pathological input, obscuring anomalies up to a certain threshold before methodically reverting them to a pseudo-healthy state. However, the choice of the noise level remains an important challenge^39,40.

Guided Restoration Techniques utilize context-encoding and masking strategies to enhance the accuracy of diagnostics. Incorporating shape priors derived from healthy structures²⁷ and techniques like random masking in masked AEs (MAEs) and Patched Diffusion models (pDDPM)^41,42,43,44 offer nuanced adaptability in anomaly detection. Recent advancements include automatic masking strategies that intelligently transform only regions likely to contain anomalies, preserving the integrity of healthy tissue^40,45.

Results

In our evaluation of generative AI models, we utilized normal T1-weighted brain MRI datasets, FastMRI+⁴⁶ with 176 scans and 581 samples from IXI, for model training. For the evaluation phase, we focused on two key datasets: the enhanced FastMRI+ dataset, which encompasses a wide spectrum of 171 brain pathologies, and 420 subjects from the ATLAS v2.0 dataset⁴⁷, known for its diverse range of stroke lesions. This strategy allowed us to rigorously test the capabilities of different models in detecting and localizing a broad array of anomalies, benchmarking their performance against the complexity and diversity of real-world brain pathologies. See Supplementary Figs. 1 and 2 for more details.

Normative learning evaluation

Our analysis primarily centered on normative representation learning, a crucial aspect for ensuring that models accurately represent healthy anatomical patterns. Figure 3 summarizes the results.

Each of these indices offers a unique perspective on the performance of the generative models. For instance, methods that simply replicate input images, like AEs, achieve high RQI but score low on AHI and CACI. Conversely, methods such as VAEs, latent transformer models (LTMs), and MAEs that remove anomalies and improve CACI often produce blurry outputs, resulting in poor RQI. The AHI metric, which requires the synthesized image distributions to closely resemble a healthy set, proves challenging for many methods, often resulting in near-zero scores. Notably, the FID is particularly demanding; even methods that produce realistic healthy images but lack diversity or show slight domain shifts, such as RA or DDPM-G, find it challenging to achieve good scores. Guided restoration techniques using intelligent masking tend to achieve the best scores.

Therefore, it is crucial to consider these metrics collectively rather than in isolation. Optimal performance is characterized by high scores across RQI, AHI, and CACI, indicating the comprehensive ability to understand and replicate healthy anatomical structures while effectively identifying and rectifying anomalies. To fuse the metrics, we propose a harmonic mean between RQI and CACI, averaged with AHI. This approach balances image quality and anomaly correction while mitigating the impact of near-zero AHI scores on the overall evaluation (see Eq. (4)).

Anomaly detection performance

The anomaly detection results, detailed in Table 1, reveal notable insights into the performance of various generative AI models (please refer to Supplementary Table 1 for the complete results). PHANES and AutoDDPM, in particular, demonstrated exceptional proficiency in mastering normative aspects of medical imaging, which translated effectively into their leading roles in anomaly detection. In the FastMRI+ dataset, AutoDDPM achieved great success, detecting 159 out of 171 pathologies. PHANES showed superior performance in identifying enlarged ventricles and was especially effective in segmenting large stroke lesions in the ATLAS dataset. In contrast, AutoDDPM exhibited heightened sensitivity to smaller stroke lesions. These findings highlight the predictive power of high normative learning scores, as measured by RQI, AHI, and CACI, in determining the capability of a model to navigate the complexities of medical image analysis.

Table 1 Anomaly Detection

Full size table

Interestingly, the results further reveal the varying effectiveness of models with average or lower scores in normative learning metrics. These models exhibited inconsistent results across different datasets, suggesting a dependency on the specific pathology types and dataset characteristics. For instance, the MAE model, despite ranking 7th in stroke lesion detection and showing commendable performance in identifying large lesions, ranked only 12th in the FastMRI+ benchmark for detecting more diverse and subtle anomalies like edema or smaller lesions. This inconsistency in performance across varying conditions underscores the critical importance of considering RQI, AHI, and CACI scores collectively to assess the generalization ability in anomaly detection across a wide spectrum of pathologies.

Interplay between normative learning and anomaly detection

Our in-depth analysis, detailed in Fig. 4, delves into the relationship between normative learning metrics and anomaly detection metrics (⌈Dice⌉ or F1 as in Table 1), revealing key insights:

**Fig. 4: Correlation between normative learning metrics and anomaly detection performance.**

The analysis, visually represented in the chord diagram and top heatmap of Fig. 4 underscores the crucial role of normative learning metrics in universal anomaly detection. Notably, while individual metrics like RQI provide valuable insights, their diagnostic impact is limited when considered in isolation. For instance, models focusing solely on RQI might replicate anomalies, and missing critical detections. Conversely, CACI shows a high predictive value for anomaly detection. However, the most comprehensive insights are obtained when these metrics are collectively analyzed, with combinations such as CACI and RQI, and particularly the integration of all three metrics (RQI, AHI, CACI), demonstrating enhanced predictive anomaly detection power. This highlights the importance of a balanced normative learning approach in generalizing across various pathologies. Interestingly, we observed a single negative correlation where higher normative metrics were associated with reduced detection of cases labeled as ‘paranasal sinus opacification’. Upon review by an expert neuroradiologist, these cases exhibited no clear visual signs of pathology. Methods with a tendency for more false positives might mistakenly detect issues in these cases, likely attributable to the proximity of the affected area to the skull-a region prone to false positives.

Our analysis of the five leading methods (Top 5), as per normative learning metrics, led to two significant observations. First, we noticed negative correlations in the detection of white matter lesions (WML). Expert neuroradiologists pointed out that some lesions were not visible, often obscured by significant motion artifacts, which are not typically classified as anomalies. Second, a higher AHI seemed to negatively affect the quantification of extensive stroke lesions. This trend may be related to how these lesions are annotated and evaluated. Pathologists often mark entire regions impacted by stroke, covering both affected and healthy tissues. Consequently, a model adept at normalizing pathological areas without altering healthy tissue tends to show a reduced overlap in these regions, leading to lower DICE scores. This issue is more apparent in cases with larger lesions, as indicated by the performance of AutoDDPM. These findings indicate a potential need to revisit current annotation and evaluation methods, particularly when the focus is on quantifying pathological burden.

Models with only moderate scores (Avg. 4) in normative learning metrics often demonstrate inconsistent diagnostic capabilities, hinting at overfitting to specific scenarios rather than wide-ranging disease detection. These models, with suboptimal scores in RQI, AHI, and CACI, generally excel in limited contexts but lack broad applicability across diverse pathologies.

Less effective methods (Bottom 5) showed a significant negative correlation between restoration quality and anomaly detection performance. This finding aligns with existing literature^18,48, which notes that methods with dense latent spaces, often producing less sharp reconstructions, can surprisingly outperform more advanced methods. However, this trend may inadvertently lead researchers to focus on optimizing anomaly detection for a limited range of pathologies, typically marked by distinct features such as hyperintense or hypointense lesions. Pursuing this narrowed research direction risks developing models with limited adaptability and reduced clinical utility.

Clinical validation

To evaluate the performance of different AI algorithms and validate our quantitative metrics–RQI and AHI–we conducted a comprehensive test involving 16 radiologists. Each was given 180 images in a randomized sequence, including 30 pathology-free originals and 30 from each AI method (15 from ATLAS and 15 from FastMRI+). Radiologists rated each image on ‘Realness’ (1 for likely fake to 5 for real), ‘Image Quality’ (1 for poor to 5 for excellent), and ‘Health Status’ (1 for pathological to 5 for healthy). Evaluating the CACI, which requires the analysis of input-reconstruction pairs, is challenging in a blinded test setting where radiologists view a randomized order of unknown images.

The violin plots in Fig. 5a reveal the scores given by radiologists for both real and AI-generated images. Even real images show score variability, especially regarding health status, indicating that not all pathology-free images should be automatically deemed ‘healthy’. Differentiating between AI-generated and real images proves difficult, with real images scoring only marginally higher, highlighting AI’s growing proficiency in replicating authentic radiological scans. The AutoDDPM model often deceived radiologists, receiving high ‘Realness’ scores (≥3). In contrast, the RA method achieved strong ‘Health’ scores, similar to real images, but displayed limited sample diversity, as reflected by its lower AHI score (0.16) compared to 0.49 for AutoDDPM. Enhancing reconstruction quality is crucial for methods like MAE, RA, and AnoDDPM to improve realism and diagnostic utility. Others, including AutoDDPM and pDDPM, should focus on better generating ‘healthy’ images, especially transforming large pathological areas. The primary challenge remains to balance accurate reconstruction with effective pseudo-healthy synthesis, a complex yet critical task in anomaly detection⁴⁰.

**Fig. 5: Radiologists in a randomized study performed a test to differentiate real and AI-generated images.**

The heatmap in Fig. 5b shows the variability in realness scores among different radiologists. While the general trends from the previous violin plot are consistent, the scores vary between individual raters. Generally, real images received the highest realness scores. Except for the MAE method, which consistently received the lowest ratings, other AI methods varied in their ratings, with some instances even surpassing real images.

The bar charts in Fig. 5c highlight differences in evaluation scores between residents (N = 13) and experienced, board-certified radiologists (N = 3). These variations suggest that experience level affects the interpretation of image quality and health implications. Specifically, for realness scores (illustrated in the upper boxplot), experienced radiologists distinguished more clearly between real and AI-generated images. Additionally, the experts generally assigned higher ratings for both image quality and health status.

The correlation matrix in Fig. 5d demonstrates the relationships between the scores given by radiologists and our proposed metrics. The RQI showed a very high degree of correlation with the perceived image realness and quality. The AHI also exhibited a positive correlation with the perceived health status, albeit with slightly reduced strength. Minor inconsistencies in the AHI might stem from its use of the FID, which evaluates not only the “health" status of images but also other aspects such as sampling diversity and domain alignment between the evaluated sets.

Discussion

Our analysis marks a significant shift in generative AI for medical imaging, advocating for emphasis on normative representation learning. To facilitate this, we introduced new metrics–Restoration Quality Index (RQI), Anomaly to Healthy Index (AHI), and Conservation and Correction Index (CACI)–designed to evaluate how well AI models learn the underlying, normal anatomy. Our findings demonstrate that the proposed metrics indicate the ability of a model to generalize across diverse conditions without relying on predefined labels or expectations about disease characteristics. We conducted a comprehensive clinical validation with 16 radiologists and found that the proposed RQI, and to a lesser extent the AHI, correlate well with clinical assessments.

Clinically, the implications are considerable. Models that master normative learning can discern subtle pathological nuances, essential for early disease detection and accurate diagnostics. Beyond improving diagnosis, the insights from this analysis can extend to preoperative planning, therapeutic monitoring, and training healthcare professionals. The pseudo-healthy reconstructions provided by such models can serve as a ‘baseline’ view against which deviations can be accurately assessed. This could be particularly beneficial in complex scenarios such as brain volume loss. Here, the distinction between healthy and pathological states is often not clear-cut but exists on a continuum. The ability of these models to accurately represent this spectrum is invaluable for clinicians, aiding in more nuanced diagnosis and treatment planning. Ultimately, these AI models can assist clinicians in developing a deeper understanding of underlying pathophysiologies. This could be instrumental in generating novel hypotheses and advancing medical research, ultimately contributing to improved patient care strategies.

Despite the promising advancements, integrating these AI systems seamlessly into clinical workflows presents ongoing challenges. As critical tools in patient triage, these systems must exhibit robustness across various scanning protocols and adaptability to different patient demographics. Additionally, while 2D generative models produce high-quality slices, they often fail to maintain spatial integrity in a full 3D context⁴⁹. Future work could evaluate these models across axial, coronal, and sagittal planes to assess their coherence in constructing accurate 3D structures.

In summary, our emphasis on normative learning and the introduction of new metrics contribute to developing AI models with greater clinical relevance. Moving forward, the development of AI models should emphasize not just excellence in experimental conditions but also resilience and adaptability within the complex realities of clinical environments. The pursuit of AI that comprehensively encompasses the range of normal anatomical variations is a critical step towards systems that enhance clinical decision-making and ultimately improve patient care and outcomes.

Methods

Background

This study complies with all relevant ethical regulations. The analysis was conducted on publicly available, anonymized datasets; no additional ethical approval was required. The comprehensive understanding of standard anatomical structures is imperative for the identification of anomalies. This process, as illustrated in Fig. 1, involves training models on datasets (${{{{\mathcal{D}}}}}_{H}$) composed solely of images ${{{\boldsymbol{x}}}}\in {{\mathbb{R}}}^{C\times W\times H}$ from healthy subjects, where C, W, and H represent the channels, width, and height of the images, respectively. These models are then employed to discern and morphologically convert pathological structures within mixed datasets (${{{{\mathcal{D}}}}}_{P}$) into their normative (healthy) counterparts.

Mathematical framework

Generative models typically employ an encoder-decoder mechanism, where both the encoder E and the decoder G are formulated as neural networks with parameters θ. E compresses an input image x into a lower-dimensional latent representation z, typically in ${{\mathbb{R}}}^{d}$ where d ≪ C × W × H. Then, a decoder (or generative model G) restores a pseudo-healthy image $\hat{{{{\boldsymbol{x}}}}}\subset {{{{\mathcal{D}}}}}_{H}$ from z. The primary training objective is to minimize the reconstruction loss $L({{{\boldsymbol{x}}}},\hat{{{{\boldsymbol{x}}}}})$, often measured as the mean squared error (MSE), thereby optimizing the parameters θ. Interestingly, not all models follow this traditional “condensation” route. Diffusion models, for instance, take a different path. They preserve the original dimensionality and iteratively add noise to the input. This action allows the model to build up content while preserving intricate details⁵⁰.

Anomaly scoring

The discrepancy between the original image x from ${{{{\mathcal{D}}}}}_{P}$ and the reconstructed image $\hat{{{{\boldsymbol{x}}}}}$ from ${{{{\mathcal{D}}}}}_{H}$ is quantified as an anomaly score S(x). This score provides a pixel-wise indication of anomalies, aiding in the localization of abnormal regions. For broader detection purposes, the anomaly presence in an image can be summarized by computing the mean or maximum anomaly score across all pixels: ${S}_{{{{\rm{image}}}}}={\max }_{i,j}S({{{{\boldsymbol{x}}}}}_{i,j})$ or ${S}_{{{{\rm{image}}}}}=\frac{1}{N}{\sum }_{i,j}S({{{{\boldsymbol{x}}}}}_{i,j})$ where x_i,j denotes the pixel at position (i, j) and N is the total number of pixels.

Normative representation learning

Req. i. Restoration Quality Index (RQI)

Evaluates restoration quality by semantically comparing two images using the Learned Perceptual Image Patch Similarity (LPIPS). LPIPS is a metric that quantifies perceptual similarity between images in a way that aligns closely with human judgment. This metric is derived from the internal activations of deep convolutional networks, which, although trained on high-level classification tasks, have shown to be effective in representing perceptual similarity⁵¹. The RQI evaluates the fidelity of image restorations of unseen healthy samples, with a lower LPIPS indicating greater accuracy:

$${{{\rm{RQI}}}}=\frac{1}{N}{\sum}_{{{{\boldsymbol{x}}}}\in {{{{\mathcal{D}}}}}_{{{{\rm{H}}}}}}1-\frac{{{{\rm{LPIPS}}}}({{{\boldsymbol{x}}}},\hat{{{{\boldsymbol{x}}}}})-{\min }_{lp}}{{\max }_{lp}-{\min }_{lp}},$$

(1)

where ${{{{\mathcal{D}}}}}_{{{{\rm{H}}}}}$ represents the dataset of N unseen healthy samples. The ${\min }_{lp}$ denotes the minimum LPIPS possible, which is 0 in the case a method returns the identity function. We set ${\max }_{lp}$ to 25, which is higher than the worst-performing method (VAE). To maintain consistency and avoid negative values, performances worse than 25 can be capped at this maximum threshold, thus normalizing the metric between 0 and 1.

Req. ii. Restoration to a healthy state

To test how well the models can normalize image abnormalities, we propose measuring the distances between image distributions. Specifically, the Anomaly to Healthy Index (AHI) measures the ability to transform a pathological dataset ${{{{\mathcal{D}}}}}_{P}$ towards the normative patterns of the healthy training set ${{{{\mathcal{D}}}}}_{H}$. Using the Fréchet Inception Distance (FID)⁵² to measure the distance between two distributions, we compute AHI as:

$${{{\rm{AHI}}}}=\max \left(0,1-\frac{\max \left(0,{{{{\rm{FID}}}}}_{{{{\rm{RP}}}},{{{\rm{H}}}}}-{{{{\rm{FID}}}}}_{{{{\rm{UH}}}},{{{\rm{H}}}}}\right)}{{{{{\rm{FID}}}}}_{{{{\rm{P}}}},{{{\rm{H}}}}}-{{{{\rm{FID}}}}}_{{{{\rm{UH}}}},{{{\rm{H}}}}}+\epsilon }\right),$$

(2)

where FID_A,B represents the FID between two datasets A and B, and P and RP represent the pathological and restored pathological samples, respectively. AHI is normalized to the range between 0 and 1.

Req. iii. The healthy conservation and Anomaly Correction Index (CACI)

Utilizes the Structural Similarity Index (SSIM)⁵³ to assess the proficiency of the models in preserving healthy tissue details and correcting anomalies:

$${{{\rm{CACI}}}}=\frac{2}{\left(\frac{1}{{{{{\rm{SSIM}}}}}_{{{{\rm{H}}}}}+\epsilon }\right)+\left(\frac{1}{\max (0,{{{{\rm{SSIM}}}}}_{{{{\rm{H}}}}}-{{{{\rm{SSIM}}}}}_{{{{\rm{AN}}}}})+\epsilon }\right)},$$

(3)

where ${{{{\rm{SSIM}}}}}_{{{{\rm{H}}}}}$ measures similarity within healthy regions and ${{{{\rm{SSIM}}}}}_{{{{\rm{AN}}}}}$ within anomalous regions. This index requires lesion segmentation masks annotated by expert clinicians, formally defined as binary masks m_i, where m_i = 1 if the pixel is identified as anomalous and m_i = 0 otherwise. CACI is bounded between 0 and 1.

Metric Integration

Integrating the RQI, AHI, and CACI offers a comprehensive evaluation of generative models in medical imaging. However, individual assessments of these metrics still provide unique insights into specific aspects of model performance, highlighting strengths and areas for improvement. For a balanced overall evaluation, we propose a combined metric calculated as:

$$\,{{\mathrm{RQI}}},{{\mathrm{AHI}}},{{\mathrm{CACI}}}\,=\frac{2\times \left(\frac{{{{\rm{RQI}}}}\times {{{\rm{CACI}}}}}{{{{\rm{RQI}}}}+{{{\rm{CACI}}}}}\right)+{{{\rm{AHI}}}}}{2}.$$

(4)

Datasets

We used the following datasets in our manuscript:

Healthy data for training

Two public T1w brain MRI datasets from healthy individuals, ${{{{\mathcal{D}}}}}_{{{{\rm{H}}}}}$, were employed for model training: IXI (581 training samples) and FastMRI+⁴⁶ (131 training, 15 validation). We kept 30 samples from the FastMRI+ as unseen healthy test samples, ${{{{\mathcal{D}}}}}_{{{{\rm{UH}}}}}$.

Pathology data for evaluation

We used two public datasets containing several disease classes as our datasets, ${{{{\mathcal{D}}}}}_{{{{\rm{P}}}}}$, containing pathology:

FastMRI+ Dataset⁴⁶: It includes 643 annotated pathologies across 30 classes. We selected mid-axial T1-weighted slices, yielding 171 unique pathologies in 13 classes. We used the annotations provided as bounding boxes by medical experts to assess the detection (‘#det’) and precision (F1 score) of the models¹⁶. We considered a detection to be a true positive (TP) if at least 10% of the pixels within the annotated bounding box were flagged as anomalous. False positives (FP) were calculated as the ratio of misdetected pixels on healthy tissue relative to correctly detected pixels within the anomaly box. Finally, we report the F1 score as: $F1=\frac{1}{N}{\sum }_{i=0}^{N}\frac{2\times P\times TP}{P+TP}$, where $P=\frac{TP}{TP+FP}$ and N is the number of test cases.

ATLAS v2.0 Dataset⁴⁷: Featuring scans with stroke lesions, the ATLAS v2.0 presents a challenging range of lesion sizes and intensities. The dataset includes 655 training scans with detailed annotations, offering in-depth views of stroke anomalies. We stratified the test sets into small (first 25th percentile, < 71 pixels), medium, and large lesions (top 25th percentile, ≥570 pixels) for performance evaluation, with the largest Dice coefficient (⌈Dice⌉) as the metric. ⌈Dice⌉ represents the theoretical maximum segmentation accuracy, achieved by a greedy search for the best residual threshold on the test set⁸. We excluded the middle slices containing no visible anomalies (N = 215) and the scans showing visible unlabeled artifacts (N = 20) as in ref. ⁴⁰.

Data pre-processing

We have intentionally preserved the variability inherent in the data, adhering to the preprocessing protocols of the original datasets (See Supplementary Table 2 for details). Additionally, we normalized the mid-axial slices to the 98th percentile, applied padding, and resized them to a resolution of 128 × 128 pixels. For training, we used affine augmentations with a random rotation up to 10 degrees, up to 0.1 translation, scaling from 0.9 to 1.1, and horizontal flips with a probability of 0.5.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability

All datasets used in this study are publicly available, and their usage complies with the respective terms and conditions of the databases where they were sourced.

The FastMRI dataset is available at https://fastmri.org, with the respective labels of the fastMRI+ dataset at https://github.com/microsoft/fastmri-plus. The FastMRI dataset is provided under a Dataset Sharing Agreement by NYU Langone Health, permitting its use for internal research or educational purposes only. Our use of this dataset strictly adheres to these terms, as it has been employed solely for non-commercial academic research purposes.

The IXI dataset is available at https://brain-development.org/ixi-dataset/and is licensed under the Creative Commons CC BY-SA 3.0 license. This license permits use, distribution, and adaptation of the data with appropriate credit and indication of changes. We have ensured compliance with these requirements by properly citing and crediting the dataset in this study.

The Atlas 2.0 dataset is available at https://atlas.grand-challenge.org. The terms of this dataset require acknowledgment of its source in publications or presentations. We confirm that the Atlas 2.0 dataset has been used in accordance with these terms and appropriately cited in this manuscript. Source data are provided with this paper.

Code availability

The implementation of all models and code presented in this manuscript are publicly available at https://github.com/compai-lab/2024-ncomms-bercea.git⁵⁴.

References

Kim, Y. W. & Mansfield, L. T. Fool me twice: delayed diagnoses in radiology with emphasis on perpetuated errors. Am. J. Roentgenol. 202, 465–470 (2014).
Article MATH Google Scholar
Aggarwal, R. et al. Diagnostic accuracy of deep learning in medical imaging: a systematic review and meta-analysis. NPJ Digital Med. 4, 65 (2021).
Article MATH Google Scholar
Zhou, Y. et al. A foundation model for generalizable disease detection from retinal images. Nature 622, 156–163 (2023).
Tschuchnig, M. E. & Gadermayr, M. Anomaly detection in medical imaging-a mini review. In Data Science–Analytics and Applications: Proc. 4th International Data Science Conference 33–38, (Spriger Fachmeden Wiesbaden, 2022).
Ruff, L. et al. A unifying review of deep and shallow anomaly detection. Proc. IEEE 109, 756–795 (2021).
Pang, G., Shen, C., Cao, L. & Hengel, A. V. D. Deep learning for anomaly detection: a review. ACM Comput. Surv. 54, 1–38 (2021).
Article MATH Google Scholar
Bercea, C. I., Wiestler, B., Rueckert, D. & Albarqouni, S. Federated disentangled representation learning for unsupervised brain anomaly detection. Nat. Mach. Intell. 4, 685–695 (2022).
Article Google Scholar
Pinaya, W. H. et al. Unsupervised brain imaging 3d anomaly detection and segmentation with transformers. Med. Image Anal. 79, 102475 (2022).
Article PubMed PubMed Central MATH Google Scholar
Zimmerer, D., Isensee, F., Petersen, J., Kohl, S. & Maier-Hein, K. Unsupervised anomaly localization using variational auto-encoders. In Medical Image Computing and Computer Assisted Intervention – MICCAI 2019. MICCAI 2019. Lecture Notes in Computer Science, Vol 11767 (eds Shen, D. et al.) (Springer, Cham, 2019).
Pawlowski, N. et al. Unsupervised lesion detection in brain CT using Bayesian convolutional autoencoders. In: International Conference on Medical Imaging with Deep Learning (2018).
Zhang, J. et al. Viral pneumonia screening on chest x-rays using confidence-aware anomaly detection. IEEE Trans. Med. Imaging 40, 879–890 (2021).
Article PubMed MATH Google Scholar
Kim, M., Moon, K.-R. & Lee, B.-D. Unsupervised anomaly detection for posteroanterior chest x-rays using multiresolution patch-based self-supervised learning. Sci. Rep. 13, 3415 (2023).
Article ADS PubMed PubMed Central MATH Google Scholar
Tan, J. et al. Detecting outliers with foreign patch interpolation. Mach. Learn. Biomed. Imaging 1, 1–27 (2022).
Article MATH Google Scholar
Park, S., Lee, K. H., Ko, B. & Kim, N. Unsupervised anomaly detection with generative adversarial networks in mammography. Sci. Rep. 13, 2925 (2023).
Article ADS PubMed PubMed Central MATH Google Scholar
Lee, S. et al. Emergency triage of brain computed tomography via anomaly detection with a deep generative model. Nat. Commun. 13, 4251 (2022).
Article ADS PubMed PubMed Central MATH Google Scholar
Bercea, C. I., Wiestler, B., Rueckert, D. & Schnabel, J. A. Generalizing unsupervised anomaly detection: towards unbiased pathology screening. In International Conference on Medical Imaging with Deep Learning in Proceedings of Machine Learning Research, Vol 227, 39–52 (2024).
Koohi-Moghadam, M. & Bae, K. T. Generative ai in medical imaging: applications, challenges, and ethics. J. Med. Syst. 47, 94 (2023).
Article PubMed Google Scholar
Baur, C., Denner, S., Wiestler, B., Navab, N. & Albarqouni, S. Autoencoders for unsupervised anomaly segmentation in brain MR images: a comparative study. Med. Image Anal. 69, 101952 (2021).
Wyatt, J., Leach, A., Schmon, S. M. & Willcocks, C. G. Anoddpm: Anomaly detection with denoising diffusion probabilistic models using simplex noise. In: Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops 650–656 (2022).
Kascenas, A., Pugeault, N. & O’Neil, A. Q. Denoising autoencoders for unsupervised anomaly detection in brain MRI. International Conference on Medical Imaging with Deep Learning, in Proceedings of Machine Learning Research 172, 653–664 (2022).
Wolleb, J., Bieder, F., Sandkühler, R. & Cattin, P. C. Diffusion models for medical anomaly detection. In Medical Image Computing and Computer Assisted Intervention – MICCAI 2022. MICCAI 2022. Lecture Notes in Computer Science, Vol 13438 (eds Wang, L. et al.) (Springer, Cham, 2022).
Bercea, C. I., Wiestler, B., Rueckert, D. & Schnabel, J. A. Diffusion models with implicit guidance for medical anomaly detection. In Medical Image Computing and Computer Assisted Intervention – MICCAI 2024. MICCAI 2024. Lecture Notes in Computer Science, Vol 15011 (eds Linguraru, M. G. et al.) (Springer, Cham, 2024).
Japkowicz, N. et al. A novelty detection approach to classification. Int. Jt. Conf. Artif. Intell. 1, 518–523 (1995).
MATH Google Scholar
Yoon, S., Noh, Y.-K. & Park, F. Autoencoding under normalization constraints. In International Conference on Machine Learning in Proceedings of Machine Learning Research, 12087–12097 (2021).
Denouden, T. et al. Improving reconstruction autoencoder out-of-distribution detection with Mahalanobis distance. Preprint at https://arxiv.org/abs/1812.02765 (2018).
Gong, D. et al. Memorizing normality to detect anomaly: memory-augmented deep autoencoder for unsupervised anomaly detection. In International Conference on Computer Vision 1705–1714 (2019).
Bercea, C. I., Rueckert, D. & Schnabel, J. A. What do AEs learn? Challenging common assumptions in unsupervised anomaly detection. In Medical Image Computing and Computer Assisted Intervention – MICCAI 2023. MICCAI 2023. Lecture Notes in Computer Science, Vol 14224 (eds Greenspan, H. et al.) (Springer, Cham, 2023).
Zong, B. et al. Deep autoencoding gaussian mixture model for unsupervised anomaly detection. In: International Conference on Learning Representations (2018).
Kingma, D. P. & Welling, M. Auto-encoding variational Bayes. Preprint at https://arxiv.org/abs/1312.6114 (2013).
Higgins, I. et al. beta-VAE: Learning basic visual concepts with a constrained variational framework. In: International Conference on Learning Representations (2017).
Goodfellow, I. et al. Generative adversarial nets. Adv. Neural Inform. Process. Syst. 27 (2014).
Akcay, S., Atapour-Abarghouei, A. & Breckon, T. P. Ganomaly: Semi-supervised anomaly detection via adversarial training. In Proc Asian Conference on Computer Vision. Computer Vision – ACCV 2018. ACCV 2018. Lecture Notes in Computer Science, Vol 11363 (eds Jawahar, C. et al.) 622–637 (Springer, Cham, 2019).
Kim, D.-Y. et al. Feasibility of anomaly score detected with deep learning in irradiated breast cancer patients with reconstruction. NPJ Digital Med. 5, 125 (2022).
Article MATH Google Scholar
Schlegl, T., Seeböck, P., Waldstein, S. M., Langs, G. & Schmidt-Erfurth, U. f-AnoGAN: Fast unsupervised anomaly detection with generative adversarial networks. Med. Image Anal. 54, 30–44 (2019).
Article PubMed Google Scholar
Pidhorskyi, S., Almohsen, R. & Doretto, G. Generative probabilistic novelty detection with adversarial autoencoders. Adv. Neural Inform. Process. Syst. 31 (2018).
Daniel, T. & Tamar, A. Soft-IntroVAE: analyzing and improving the introspective variational autoencoder. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4391–4400 (IEEE Computer Society, 2021).
Chen, X. & Konukoglu, E. Unsupervised detection of lesions in brain MRI using constrained adversarial auto-encoders. In International Conference on Medical Imaging with Deep Learning (2022).
Kazerouni, A. et al. Diffusion models in medical imaging: a comprehensive survey. Med. Image Anal. 88 102846 (2023).
Graham, M. S. et al. Denoising diffusion models for out-of-distribution detection. In: Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition 2947–2956 (2023).
Bercea, C. I., Neumayr, M., Rueckert, D. & Schnabel, J. A. Mask, stitch, and re-sample: enhancing robustness and generalizability in anomaly detection through automatic diffusion models. In ICML 3rd Workshop on Interpretable Machine Learning in Healthcare (2023).
Zimmerer, D., Kohl, S. A., Petersen, J., Isensee, F. & Maier-Hein, K. H. Context-encoding variational autoencoder for unsupervised anomaly detection. (Medical Imaging with Deep Learning, 2018).
He, K. et al. Masked autoencoders are scalable vision learners. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition 16000–16009 (2022).
Schwartz, E. et al. MAEDAY: MAE for few-and zero-shot AnomalY-Detection. Computer Vision and Image Understanding in Elsevier 241, 103958 (2024).
Behrendt, F., Bhattacharya, D., Krüger, J., Opfer, R. & Schlaefer, A. Patched diffusion models for unsupervised anomaly detection in brain MRI. In International Conference on Medical Imaging with Deep Learning in Proceedings of Machine Learning Research, Vol 227, 1019–1032 (2023).
Bercea, C. I., Wiestler, B., Rueckert, D. & Schnabel, J. A. Reversing the abnormal: pseudo-healthy generative networks for anomaly detection. In Medical Image Computing and Computer Assisted Intervention – MICCAI 2023. MICCAI 2023. Lecture Notes in Computer Science, Vol 14224 (eds Greenspan, H. et al.) (Springer, Cham, 2023).
Zhao, R. et al. fastMRI+, Clinical pathology annotations for knee and brain fully sampled magnetic resonance imaging data. Sci. Data 9, 152 (2022).
Liew, S.-L. et al. A large, curated, open-source stroke neuroimaging dataset to improve lesion segmentation algorithms. Sci. Data 9 (2022).
Lambert, B. et al. Leveraging 3D information in unsupervised brain MRI segmentation. In Proc 2021 IEEE 18th International Symposium on Biomedical Imaging (ISBI) (IEEE, 2021).
Durrer, A. et al. Denoising diffusion models for 3D healthy brain tissue inpainting. In Deep Generative Models. DGM4MICCAI 2024. Lecture Notes in Computer Science, Vol 15224 (eds Mukhopadhyay, A.) (Springer, Cham, 2025).
Ho, J., Jain, A. & Abbeel, P. Denoising diffusion probabilistic models. Adv. Neural Inf. Process. Syst. 33, 6840–6851 (2020).
Google Scholar
Zhang, R., Isola, P., Efros, A. A., Shechtman, E. & Wang, O. The unreasonable effectiveness of deep features as a perceptual metric. In: Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition 586–595 (2018).
Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B. & Hochreiter, S. Gans trained by a two time-scale update rule converge to a local Nash equilibrium. Adv. Neural Inform. Process. Syst. 30 (2017).
Wang, Z., Bovik, A. C., Sheikh, H. R. & Simoncelli, E. P. Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Process. 13, 600–612 (2004).
Article ADS PubMed MATH Google Scholar
Bercea, C. I., Wiestler, B., Rueckert, D. & Schnabel, J. A. Evaluating normative representation learning in generative AI for robust anomaly detection in brain imaging (2024). https://github.com/ci-ber/GenAI_UAD (Accessed 28 Dec 2024).

Download references

Acknowledgements

The authors thank Prof. Jan Kirschke, Philipp Raffler, Dr. Cornelius Berberich, Dr. Kirsten Jung, Dr. Su Hwan Kim, Dr. Lukas Walder, Dr. Severin Schramm, Dr. Joachim Schmitt, Constanze Ramschuetz, Lena Schmitzer, Dr. Olivia Kertels, and Mirjam Beyrle for their invaluable contributions to our multi-reader study. Their expertise and dedication were essential in validating our proposed metrics and enhancing the rigor of our work. C.I.B. is funded via the EVUK program ("Next-generation Al for Integrated Diagnostics”) of the Free State of Bavaria and partially supported by the Helmholtz Association under the joint research school ‘Munich School for Data Science’. This work was in part supported by Berdelle-Stiftung (grant TimeFlow).

Funding

Open Access funding enabled and organized by Projekt DEAL.

Author information

Authors and Affiliations

Chair of Computational Imaging and AI in Medicine, Technical University of Munich (TUM), Munich, Germany
Cosmin I. Bercea & Julia A. Schnabel
Helmholtz AI and Helmholtz Center Munich, Munich, Germany
Cosmin I. Bercea & Julia A. Schnabel
Chair of AI for Image-Guided Diagnosis and Therapy, TUM School of Medicine and Health, Munich, Germany
Benedikt Wiestler
Munich Center for Machine Learning (MCML), Munich, Germany
Benedikt Wiestler, Daniel Rueckert & Julia A. Schnabel
Chair of AI in Healthcare and Medicine, Technical University of Munich (TUM) and TUM University Hospital, Munich, Germany
Daniel Rueckert
Department of Computing, Imperial College London, London, UK
Daniel Rueckert
School of Biomedical Engineering and Imaging Sciences, King’s College London, London, UK
Julia A. Schnabel

Authors

Cosmin I. Bercea
View author publications
Search author on:PubMed Google Scholar
Benedikt Wiestler
View author publications
Search author on:PubMed Google Scholar
Daniel Rueckert
View author publications
Search author on:PubMed Google Scholar
Julia A. Schnabel
View author publications
Search author on:PubMed Google Scholar

Contributions

C.I.B. led the project; contributed to the conception of key ideas and study design; conducted experiments; interpreted data and results; and drafted and edited the manuscript. J.A.S. and D.R. secured funding for the project; provided critical feedback on study design, data analysis, and interpretation; and contributed to manuscript editing. B.W. led a clinical study involving 16 radiologists to validate the proposed metrics; provided clinical insights; and reviewed the manuscript for clinical accuracy. All authors reviewed and approved the final manuscript.

Corresponding author

Correspondence to Cosmin I. Bercea.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Communications thanks Ulas Bagci, and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. A peer review file is available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Reporting Summary

Transparent Peer Review file

Source data

Source Data

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Bercea, C.I., Wiestler, B., Rueckert, D. et al. Evaluating normative representation learning in generative AI for robust anomaly detection in brain imaging. Nat Commun 16, 1624 (2025). https://doi.org/10.1038/s41467-025-56321-y

Download citation

Received: 22 December 2023
Accepted: 15 January 2025
Published: 13 February 2025
DOI: https://doi.org/10.1038/s41467-025-56321-y