Introduction

Magnetic resonance imaging (MRI) is a cornerstone of non-invasive diagnostic imaging, owing to its superior soft-tissue contrast. However, the clinical utility of MRI remains constrained by inherently prolonged data acquisition times. In the era of artificial intelligence (AI), acceleration of MRI acquisition can generally follow two algorithmic strategies: AI-driven MRI reconstruction from undersampled k-space data, and AI-driven denoising applied to routinely reconstructed MR images. The former approaches1,2,3,4,5,6, while effective, necessitate large-scale, fully-sampled k-space datasets for training, which are challenging to obtain due to their limited clinical relevance, substantial storage requirements, and the complexities associated with data access and formatting across various MRI platforms. Consequently, such models are typically tailored for specific organs, imaging contrasts, or acceleration factors. In contrast, denoising approaches operating on magnitude images (DICOM) leverage data readily available from standard reconstruction pipelines such as those based on GRAPPA7 or SENSE8, particularly from the widely deployed 1.5 T scanners, and do not require access to multi-coil, complex k-space data. Thus, the development of robust, generalizable, and versatile denoising models for accelerated MRI bears significant clinical promise, offering a practical pathway to improving image quality within existing clinical workflows.

Traditionally, the development of denoising algorithms has relied on the use of synthetically generated noisy images, created by superimposing Rician9 or mixed noise patterns onto clean images10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25. While these strategies have enabled the advancement of denoising techniques, they rely on simplified noise models. Therefore, the resulting gap between synthesized noise and real-world noise causes inherent limitations of these methods in clinical practice, thus, these methods may introduce unanticipated risks. Diffusion models (DMs)26,27,28,29 represent the latest development in image denoising. Conditional diffusion models leverage the noisy input as condition, sampling the denoised output from the posterior distribution during the reverse diffusion process. However, these approaches can produce hallucinated anatomical structures, raising concerns regarding safety and reliability in clinical applications. More recently, there is an increasing interest in DM-based frameworks for addressing inverse problems30,31,32,33. These methods typically embed a degradation model into the reverse diffusion process at inference, steering the reconstruction to be consistent with observed measurements. Despite these advances, degradation models are commonly assumed to be linear functions for simplicity. Our observations of paired noisy and clean MR images reflect that real-world degradation in MRI is inherently non-linear, underscoring the need for denoising methods that can more accurately reflect the complex nature of MRI noise for clinical utility.

To address the limitations posed by hallucinations in diffusion models and the oversimplified characterization of degradation processes across heterogeneous clinical environments, we prospectively assembled a large-scale dataset comprising 148,930 paired noisy-clean MRI slices from six organs, sourced from six medical centers utilizing 1.5 T MRI scanners. Leveraging these real-world data, we developed a unified, advanced denoising framework for accelerated MR imaging, designed for seamless integration with standard commercial reconstruction algorithms. We performed extensive validation, encompassing internal (N = 102,060 slices) and external (N = 46,870 slices) cohorts from multiple centers. We systematically benchmark our method against five state-of-the-art denoising models and further validate our method by multi-reader assessments. Experimental results demonstrate that the proposed denoising model substantially extends the acceleration capacity of 96 routinely used MRI protocols, while remaining compatible with conventional reconstruction algorithms such as GRAPPA7 and SENSE8 on 1.5 T MR scanners. Notably, the acquisition time of these clinically representative MRI protocols can be significantly reduced within a range between 25% and 72% (with 30% on average), depending upon organ site and protocol (Supplementary Table 1). Importantly, many routinely used MRI protocols can be completed within one minute without compromising image quality and diagnostic accuracy, substantially enhancing the acquisition efficiency of MRI in clinical usage.

Results

Data sources and unified denoising network

Our unified denoising network was trained on a large-scale prospective dataset, which consists of 5366 real-world noisy-clean volume pairs (N = 102,060 slice pairs), covering six organs including head (N = 37,482), knee (N = 8,329), C-spine (N = 14,097), L-spine (N = 14,447), T-spine (N = 18,139), and shoulder (N = 9566) with 82 MRI protocols (e.g., T1w, T1-FLAIR, T2w, T2-FLAIR, DWI, PDw, DIXON), and three MRI manufacturers (i.e., SIEMENS, GE, Philips) acquired from January 2024 to August 2024 in three hospitals in Shenzhen and Guangzhou, China. Besides, for external evaluation, we further collected 2,157 volume pairs (N = 46,870 slice pairs) of healthy and non-healthy subjects including MRI scanners of Siemens, GE, UIH, and Philips from four data centers from October 2024 to March 2025 covering totally 96 MRI protocols (Siemens:29, GE:25, UIH:14, Philips:19). The noisy images were acquired by acceleration of the clinically used MRI sequences through either k-space subsampling or deactivation of k-space average. The clean images were obtained by either k-space fully sampling or k-space averaging. The acceleration rates of the clinically used sequences were mainly between 1.1× and 3.5×, depending on the MRI protocols and scanners. For convenience, we split the abovementioned different acceleration rates into three groups, i.e., 1.5×, 2×, and 3×. It is worth noting that the definition of “acceleration rate” used here differs from the one used in MRI reconstruction31,34,35,36,37,38. Acceleration rate in MRI reconstruction often refers to undersampling rate in the k-space during acquisition. While the acceleration rate in this work refers to the number of excitations (averaged acquisitions). We summarize the population characteristics and imaging parameters in the Supplementary Table 1.

As the key principle of our model design, the denoised MR images are expected to have high fidelity and good sharpness with suppressed noise for routinely used MRI sequences. Due to the complexity and diversity of clinical scenarios, the proposed model is supposed to handle various MRI contrasts and multiple organs by different MRI scanners under different acceleration factors. To this end, we propose a novel denoising framework, which integrates a learnable real-world non-linear imaging model (describing the image degradation process) into a text-guided conditional DM framework as shown in Fig. 1. The cutting-edge text-guided DM framework is employed to “inpaint” the missing anatomical structures in the noisy images by resorting to its generative nature, which is particularly effective for cases of relatively high acceleration rates. Meanwhile, a degradation model is learned from real-world noisy-clean MR image pairs using the proposed multi-cycle (MC) training strategy. The non-linear degradation model is built on non-generative model, which alleviates the hallucination issue of the diffusion model and hence facilitates the data fidelity. To endow our denoising model with enhanced versatility, a pretrained text encoder CLIP is integrated into the diffusion model and finetuned by the Low-Rank Adaptation (LoRA). The non-linear degradation model and the conditional DM equipped with a finetuned text encoder were trained on the internal dataset containing 3750 volume pairs (N = 71,780 slice pairs). In the inference time, an objective function is constructed using the well-trained DM and degradation model with their model parameters frozen, which is solved iteratively along with the reverse diffusion process using gradient-based algorithm. For computational demands, under the setup of sampling steps as 10, our method overall has 248 M parameters and needs 6.31GB RAM for inference. A desktop computer with an NVIDIA GeForce RTX 3080 GPU (10GB VRAM) processes a DICOM image with the size of 256 × 256 in only 0.39 s. More detailed descriptions of our model are given in the Methods section.

Fig. 1: Overview of our study.
Fig. 1: Overview of our study.The alternative text for this image may have been generated using AI.
Full size image

a The collected prospective large-scale real-world noisy-clean data pairs containing 148,930 slices of 6 organs from four MRI manufacturers. b Our denoising model built on the diffusion model framework, which consists of a learnable degradation model and a text-guided denoiser. During inference, an objective function constructed by the pretrained degradation model and text-guided denoiser is optimized via a gradient-based algorithm in each reverse diffusion step. c Extensive evaluation including denoising performance, downstream task by multi-organ tissue segmentation, and clinical impact by comprehensive reader studies.

Internal evaluation

We conducted comprehensive evaluation of our model on the internal dataset, which consists of in total 102,060 real-world noisy-clean slice pairs of head, knee, C-spine, L-spine, T-spine, and shoulder. The internal dataset was split subject-wise into training (N = 71,780), validation (N = 10,137), and test (N = 20,143) sets. On the test dataset, we conducted systematic evaluation of our model from different aspects, including (1) benchmarking with five state-of-the-art (SOTA) denoising methods quantitatively and qualitatively; (2) analyzing tissue segmentation of multiple organs on denoised images; (3) evaluating the effectiveness of core components of our framework in the ablation study. Detailed descriptions are presented in the following sections.

To validate the advancement in denoising of our model, we compared with five state-of-the-art denoising methods, including the CNN- based NBNet39 and BME-X33, and the diffusion model-based DDPM26, RED-Diff32, and BlindDPS40. The BlindDPS is an extended version of DPS41, which employs a learned non-linear degradation model39, instead of a pre-defined linear one. In the experiments, we adopted the same degradation model as ours in BlindDPS. For a fair comparison, we used the same training strategy as ours for BlindDPS. We utilized three metrics for quantitative evaluation, including the Peak Signal-to-Noise Ratio (PSNR), the Structural Similarity Index Measure (SSIM), and the Learned Perceptual Image Patch Similarity (LPIPS). LPIPS is a widely used metric to measure perceptual similarity between two extracted features by a DL model such as VGG42 or AlexNet43. In our work, we employed the LPIPS module in Python based on VGG16.

In Fig. 2a, we demonstrate visual comparison with the other methods for representative MRI sequences. For each MRI sequence, we illustrate both the denoised images and the close-up views of the marked green regions. It is shown that our model achieves significant superiority over the other methods in visual perception, especially for preserving structural fidelity and anatomical details. Moreover, it turns out that our model can effectively remove the noise in flat regions while retain the lesion regions with fine details. This merit arises from our particular combination of the learned real-world non-linear degradation model and the text-guided denoising DM. We also validate this property in External Evaluation section. More visual comparisons can be found in Supplementary Figs. 29.

Fig. 2: Comparison with state-of-the-art methods on the internal dataset.
Fig. 2: Comparison with state-of-the-art methods on the internal dataset.The alternative text for this image may have been generated using AI.
Full size image

a Qualitative comparison with close-up views for two representative cases. The first case is with acceleration rate of 3×. The second case is with acceleration rate of 2×. More visual comparisons can be found in the Supplementary Fig. 9; b Quantitative comparison for different organs with acceleration rates of 1.5×, 2×, and 3× in terms of PSNR, SSIM, and LPIPS. Acceleration is achieved by avoiding averaging of multiple noisy acquisitions.

In Fig. 2b, we summarize the results of different methods in terms of SSIM, PSNR, and LPIPS according to acceleration factors of 1.5×, 2×, and 3×. We can see that, for each acceleration factor, our model (marked in orange) obtains the best performance in all the three metrics. As the acceleration rate increases (fewer scans are averaged), the SNR drops due to magnified noise. In this circumstance, denoising algorithm plays a critical role for increasing diagnosis confidence. Besides, it turns out that our model obtains more evident performance gain in LPIPS than PSNR and SSIM, indicating the advancement of our model in visual perception.

In addition, we further evaluated our denoising model by tissue segmentation. To be specific, we adopted a widely used segmentation method named FAST44 and segmented multiple organs. The segmentation results of different methods for acceleration factors of 1.5×, 2×, and 3× are illustrated in Dice score using barplots in Fig. 3a. We also performed statistical analysis for different organs, including white matter (WM), gray matter (GM), knee, C-spine, L-spine, and T-spine. We can see that our model outperforms the other methods significantly for all the investigated organs. The consistent performance gain in both the denoised images and their corresponding tissue segmentation verifies the superiority and effectiveness of our model.

Fig. 3: Quantitative comparison of organ segmentation.
Fig. 3: Quantitative comparison of organ segmentation.The alternative text for this image may have been generated using AI.
Full size image

Quantitative comparison for segmentation of different organs on internal (upper panel) and external dataset (bottom panel) in terms of Dice score with two-sided p-value test (p < 0.05 as , p < 0.01 as , p < 0.001 as , and p < 0.0001 as ). In each barplot, the midline represents the median value, and its lower and upper boundaries represent the first and third quartiles.

Finally, we conducted ablation studies to evaluate the effectiveness of (1) the proposed MC strategy for learning the non-linear degradation model, (2) the conditional DM, (3) the LoRA-finetuned CLIP text encoder. We conduct experiments on the internal test dataset and list the results in the Supplementary Table 2. We can see that the use of MC can largely improve the denoising performance. In fact, from our empirical observations, MC can stabilize the training and improve the fidelity of the denoised images. In addition to the degradation model, the integration of conditional DM further improves the denoising performance due to its inherent generative nature, especially for highly accelerated MR images. Moreover, it is shown that the adoption of finetuned CLIP via LoRA can further improve the denoising performance by explicitly providing the information of imaging contrast, organ, acceleration factor, and MRI manufacturer.

External validation

In order to evaluate the generalizability of our model, we directly apply the pretrained model on a prospectively collected external dataset, which consists of 46,870 slice pairs from four cohorts. The external dataset covers slice pairs of head (N = 24,912), knee (N = 6778), C-spine (N = 4358), L-spine (N = 4284), T-spine (N = 2240), and shoulder (N = 4298) from Siemens (N = 8487), GE (N = 20,046), Philips (N = 17,702), and United Imaging Healthcare (UIH) (N = 635) with acceleration factors of 1.5× (N = 23,875), 2× (N = 16,359), and 3× (N = 6636). More descriptions of the population characteristics are detailed in the Supplementary Table 1. In this experiment, we not only conduct quantitative and qualitative evaluations on the denoised images and their corresponding segmentation maps, but also carry out extensive multi-round reader studies. The results of the multi-perspective evaluation are described in the following sections.

Similar to the evaluation on internal dataset, we demonstrate the denoising performance of different methods on external dataset in Fig. 4. We can see that, although the MR images were acquired by four different scanners (Siemens Aera 1.5T, GE SIGNA Voyager 1.5T, Philips Prodiva CX 1.5T, UIH uMR 660 1.5T) from four data centers, our model provides reliable denoised images of high image fidelity with an improvement of 1.09% in SSIM, 1.51dB in PSNR, and 0.0218 in LPIPS, compared to the other denoising methods, indicating its great generalizability to unseen data. More visual comparisons can be found in Supplementary Figs. 29. Besides, we also demonstrate across-slice consistency on a 3D T1w denoised image (UIH uMR 660, 1.5T, isotropic spacing of 0.67 mm) in the Supplementary Fig. 10. Axial view was used for slice-wise denoising, and our model achieves good consistency and continuity across slices in the sagittal and coronal views.

Fig. 4: Comparison with state-of-the-art methods on the external dataset.
Fig. 4: Comparison with state-of-the-art methods on the external dataset.The alternative text for this image may have been generated using AI.
Full size image

a Qualitative comparison with close-up views for two representative cases. The first case is with acceleration rate of 2×. The second case is with acceleration rate of 3×. More visual comparisons can be found in Supplementary Fig. 28; b Quantitative comparison for different organs with acceleration rates of 1.5×, 2×, and 3× in terms of PSNR, SSIM, and LPIPS. Acceleration is achieved by avoiding averaging of multiple noisy acquisitions.

In addition, we perform tissue segmentation on the four external unseen cohorts (N = 46,870 slices) as well. The results are depicted in Dice score in barplots in Fig. 3b. It is shown that our method achieves consistent superiority in tissue segmentation for different organs compared to the other methods. Interestingly, the other DM-based methods obtain slightly worse segmentation performance than NBNet, which might be due to the lack of structural fidelity preservation.

Finally, reader studies were conducted by two radiologists with 5 and 3 years of specialty experience, respectively, to evaluate the clinical utility of our model. In the first reader study, one prior published denoised model (BME-X) was also included for comparison. As shown in Fig. 5a, b and Supplementary Fig. 13, our denoised images demonstrate statistically significant superiority over the noisy images and BME-X images in terms of both the overall image quality and diagnostic confidence for head (n = 145), spinal (n = 50), and articular regions (n = 99) (all p < 0.05), achieving comparable visual fidelity to the GT images (all p > 0.05). For the second reader study assessing diagnostic performances of our denoised images and the GT images across three clinical benchmarks: (1) Fazekas grading of white matter lesions, (2) signal intensity agreement in spinal disc, and (3) musculoskeletal pathology detection. Brain MRI analysis shows no significant differences in white matter lesion detection accuracy (Fig. 5c, p > 0.05). Cervical and lumbar spine MRI analyses reveal near-perfect correlations in disc signal intensity indices (Spearman’s ρ = 0.84 for cervical, ρ = 0.98 for lumbar; both p < 0.001, Fig. 5d, e). Shoulder MRI evaluations (Fig. 5f) demonstrate promising diagnostic consistency for tendon integrity (supraspinatus/infraspinatus/subscapularis/teres minor), biceps tendon pathology, and cartilage and labral abnormalities. Knee MRI assessments (Figs. 5g) similarly achieve full diagnostic agreement with GT for meniscal tears, ligament injuries (ACL/PCL/MCL/LCL), and cartilage/bone marrow abnormalities. Notably, our denoised model exhibited superior performance when compared to BME-X in detecting injuries of the infraspinatus tendon, teres minor tendon, and long head of the biceps tendon (Fig. S1b). Good agreement was observed between the readers, with intraclass correlation coefficients ranging from moderate to excellent (0.67–0.99 [95% CI, 0.43–0.99]). Representative clinical cases are provided in Supplementary Fig. 1.

Fig. 5: Multi-dimensional evaluation of imaging performance and clinical validity across anatomical region.
Fig. 5: Multi-dimensional evaluation of imaging performance and clinical validity across anatomical region.The alternative text for this image may have been generated using AI.
Full size image

a Clinical evaluation overall image quality (4-point scale: 1=poor to 4=excellent) across head, cervical/lumbar spine (C + L), and shoulder/knee joints for noisy images, denoised images by our method (Ours), and ground-truth references (GT). b Diagnostic confidence ratings (1=inadequate to 4=definitive certainty). Friedman test with post hoc Wilcoxon signed-rank tests (Bonferroni-corrected α = 0.0167) showed Ours significantly outperformed noisy images (all p < 0.001) with no significant differences from GT (all p > 0.0167). c Fazekas grading (1–3) for cerebral small vessel disease in Ours vs GT. d, e Strong Spearman correlations between Ours and GT for cervical (ρ = 0.84) and lumbar (ρ = 0.98) disc degeneration indices (both p < 0.001). f, g Detection rates of key pathologies (e.g., rotator cuff tears, meniscal injuries) in shoulder/knee joints, quantified in agreement with GT in percentage. Color-coded stacked bars (yellow/green/blue/purple: 1–4 ratings) highlight consistent proximity of our outputs to GT benchmarks, with no statistical significance (all p > 0.001, pairwise Wilcoxon signed-rank tests) across metrics.

Public dataset validation

To verify the versatility of our denoising method on other MRI field strengths, organs, and international cohorts, we conducted additional experiments on other cohorts: the M4Raw45 low-field strength dataset and the fastMRI knee34, prostate46, and breast47 datasets. The M4Raw dataset is a real-world low-field (0.3T) MRI dataset that contains repetitive MRI scans of the brain. It comprises a training set of 128 subjects, a validation set of 30 subjects, and a test set of 25 subjects. Due to large domain gap, we fine-tuned our pretrained model on this 0.3T dataset. Following data processing approach of existing methods25, the training and validation dataset used three-repetition-averaged images as ground truth images, while higher-SNR labels of the test dataset were created by averaging six repetitions for T1w and T2w, and four for FLAIR.

We further conduct comparison experiments on the 1.5T and 3.0T MR images of the fastMRI knee34, prostate46, and breast47 dataset. Since noisy images are unavailable in fastMRI, we chose two noise synthesis approaches to simulate noisy images: (1) Adding Rician noise with \(\sigma =13/255\), following the common setting in real-world MR images24,25; (2) Adding Rician noise with \(\sigma \sim U[\mathrm{6,25}]\), covering wide real-world scenarios. For the knee dataset, we performed both zero-shot and few-shot evaluation (using 50 knee volumes). For the prostate and breast datasets, we performed zero-shot evaluation. Results are shown in the Supplementary Tables 3, 4 and 5. We can see that our method significantly outperforms all the other compared methods in terms of PSNR and SSIM, indicating the versatility of our method across various different MRI field strengths, organs, noise distributions, and application settings such as zero-shot and few-shot.

Discussion

In this work, we have presented a unified MRI denoising framework, developed on 148,930 real-world noisy-clean image pairs covering 96 MRI protocols for six organs from Siemens, GE, Philips, and UIH. Our model is built on a text-guided diffusion model and integrates a learnable non-linear degradation model at inference time to ensure data consistency. We rigorously assess our model using an internal test set comprising 20,143 clinical image pairs and a multi-center external dataset of 46,870 image pairs, employing both quantitative and qualitative analyses across similarity metrics, tissue segmentation tasks, and blinded reader studies. Our model consistently surpasses the leading denoising approaches, particularly with respect to visual fidelity. Furthermore, when directly applied to real-world MR images exhibiting varying noise levels from accelerated clinical acquisitions, our approach maintains diagnostic performance on par with ground-truth images, underscoring its robustness and generalizability to previously unseen data.

Our empirical investigations and comprehensive assessments reveal that the proposed model exhibits two principal denoising merits: (1) superior fidelity preservation and (2) substantial perceptual enhancement. The high-fidelity outcome stems from a non-linear degradation model, obtained by a large-scale, prospectively assembled dataset of real-world noisy–clean image pairs in conjunction with a multi-cycle learning strategy. From our empirical experience, we found that using a degradation model, which generates noisy images from the clean counterparts, tends to obtain sharper denoised images than the case of using a restoration model, which restores the clean images from the noisy counterparts. The hypothesis is that estimating noisy images is more difficult than estimating the clean ones when using non-generative models based on the L2-norm loss, which actually encourages an average correct mean. In consequence, the degradation model would require a relatively noisier clean image to better align with its noisy reference image. Since the degradation model is more difficult to converge than the restoration model, we propose to introduce a restoration model to accompany the training of the degradation model in a multi-cycle fashion. These two models are considered as mutual-inverse functions, and deep supervision is applied to each cycle. In such a way, the degradation model can provide sharper denoised images with high fidelity in a stable manner. The second merit originates from the text-assisted diffusion model, which facilitates the denoising by resorting to its generative nature, especially for severely contaminated images. The degradation model and diffusion model, trained independently, are synergistically integrated during inference, i.e., the diffusion process is regularized by the degradation model, constraining the denoising trajectory and further enhancing denoising performance.

Our model achieves significant improvement compared to the state-of-the-art DM-based denoising methods on both the internal and external test data for diverse clinical scenarios. Notably, the superiority of our method is particularly pronounced in terms of visual quality. We observe that this advantage becomes increasingly apparent under stronger noise associated with increased acceleration factors. Quantitatively, our model yields an average PSNR improvement of 2.06 dB for head, 2.02 dB for knee, 1.87 dB for cervical spine, 1.77 dB for lumbar spine, 1.61 dB for thoracic spine, and 1.98 dB for shoulder, compared to the cutting-edge diffusion model-based baselines on the test dataset. Furthermore, subsequent tissue segmentation on the denoised images achieves a mean Dice coefficient of 85.57% across six organs, consistently surpassing alternative denoising methods and thereby further verifying the efficacy of our framework.

Moreover, our model is shown to have promising generalizability to unseen cohorts without any finetuning. The large-scale evaluation on 2157 real-world volumes covering diverse clinical scenarios, including six organs, multiple noise levels, 96 MRI protocols, and four MRI vendors, demonstrates great denoising feasibility and reliability of our model. Specifically, we have extensively evaluated the clinical impact of our model on external unseen data in multi-perspective reader studies. Our findings demonstrate that our model achieves promising overall image quality and comparable diagnostic confidence as the GT images across head, spine, and joint, outperforming the noisy input in both evaluations significantly. More importantly, we show that our denoised images consistently achieve equivalent diagnostic performance compared to the GT images in three critical clinical assessments: (1) Fazekas grading for cerebral small vessel disease shows no statistically significant differences (p > 0.05); (2) signal intensity measurements in spinal disc exhibit strong linear correlations in both cervical (r = 0.84) and lumbar (r = 0.98) regions; and (3) key pathology detection rates in shoulder/knee examinations, including rotator cuff tears and meniscal injuries, reveal comparable diagnostic accuracy with no significant statistical disparities (p > 0.05). These evaluations confirm the ability of our model for preserving clinically essential tissue contrast properties and anatomical integrity. The synergistic combination of highly accelerated acquisition and high-fidelity image quality enables our methods as a viable strategy for optimizing efficiency of MRI workflow without compromising diagnostic reliability relative to the existing clinical setups.

Our denoising method is applied on the DICOM images, which are reconstructed from the MR scanner. Therefore, it can be seamlessly integrated into the user interface of the MR scanner as a plugin or an optional image enhancement module. Besides, our proposed method has flexible practical applications. The existing clinical protocols used in 1.5T MRI scanners usually take 1.2×–3× acceleration by undersampling in the k-space. To achieve enhanced SNR, usually 2–3 averages of accelerated scans are required, which increases the overall acquisition time. Figure 5 illustrates a significant difference in diagnosis confidence when using accelerated images with and without performing averaging (noisy image vs. GT image). Our denoising method achieves acceleration by getting rid of the averaging without compromising diagnosis confidence. The acquisition time for clinically representative MRI protocols can be significantly reduced by 30% on average. This significantly reduces the waiting time of patients and facilitates the equipment utilization efficiency. Besides, our method requires only a desktop computer with an NVIDIA GeForce RTX 3080 GPU (10GB VRAM) to inference within seconds, without encumbering the existing clinical workflow.

Despite the promising performance of our denoising method, our study also has several limitations: First, it is worth noticing that our denoising model is a blind denoising method, applied to the reconstructed DICOM images with no aliasing artifacts. That is, when using the traditional reconstruction algorithms such as GRAPPA7 or SENSE8, the acceleration through subsampling in the k-space is usually limited by up to 3\(\times\) to avoid aliasing artifacts. Another line of work to reduce acquisition time relies on MRI reconstruction, which can deal with aliasing artifacts but needs to have access to the subsampled k-space data that is practically difficult to collect. Recently, some physics-informed MRI reconstruction methods36,37 have chosen to potentially alleviate the reliance on real-world k-space data and boost model generalizability for fast MRI reconstruction. These physics-informed frameworks rely on theoretical equations to synthesize data. In contrast, our degradation model is learned directly from large-scale real-world data, and it empirically captures complex, non-linear noise characteristics (e.g., system-specific electronic noise, physiological motion). A combination of real-world blind denoising approaches and synthesis-empowered MRI reconstruction approaches would be a potential research direction. Second, our method was developed and trained on real-world data of six anatomical regions, including the head, shoulder, knee, cervical spine, lumbar spine, and thoracic spine. Future efforts will extend the applicability of our method to more organs.

To summarize, we proposed a unified denoising model for real-world accelerated MRI, which integrates an elaborately designed non-linear degradation model into a text-assisted diffusion model, leveraging a large-scale, prospectively collected real-world noisy-clean image pairs. Extensive validation—including quantitative evaluation, qualitative assessments, tissue segmentation, and multi-center reader studies—demonstrates that our model achieves superior perceptual quality, promising diagnostic reliability, and strong generalizability across various imaging protocols. In a nutshell, by integrating our proposed denoising model, the acquisition time of clinically representative MRI protocols can be significantly reduced by 30% on average, permitting the scan time below one minute without compromising diagnostic performance or diagnostic confidence compared to the routinely used scanning setups (typically 2–3 min or even longer). These findings highlight its promising potential for a broad range of clinical scenarios.

Methods

Ethical approval

The prospective data collection was approved by the institutional review board (IRB) at each institution with a waiver for informed consent: Huazhong University of Science and Technology Union Hospital (Nanshan Hospital, ky-2024-102301), The Second People’s Hospital of Panyu Guangzhou (py2y-xjsll-20250017), Longgang Central Hospital of Shenzhen (2025ECPJ170), Southern University of Science and Technology Hospital (SUCTH-014), Shenzhen Bao’an Songgang People’s Hospital (IRB-YJ-2025-043), and Shenzhen FuYong People’s Hospital (KY202603). In this study, patients were directly involved or recruited for the study. For all research involving human participants, informed consent to participate in the study has been obtained from participants.

Dataset collection and preprocessing

We collected internal data from three hospitals, including Huazhong University of Science and Technology Union Hospital (Nanshan Hospital), The Second People’s Hospital of Panyu Guangzhou, and Longgang Central Hospital of Shenzhen in Guangdong Province, China, comprising both healthy and non-healthy subjects. The acquisition of internal data was performed by three 1.5T MRI scanners, i.e., Siemens MAGNETOM Amira 1.5T, GE SIGNA Voyager 1.5T, and Philips Ingenia 1.5T, over the period from January 2024 to August 2024. The internal dataset consists of a total of 5366 noisy-clean volume pairs with overall 102,060 paired slices, as shown in Fig. 1a. The GT images were calculated by averaging the accelerated acquisitions or using fully sampled acquisitions. Some subjects were scanned for multiple organs, including head, knee, cervical vertebra, lumbar vertebra, thoracic vertebra, and shoulder. Depending on the scanned organs, 18 imaging protocols such as T1-weighted (T1w), T2-weighted (T2w), proton density-weighted (PDw), fat-suppressed T1 fluid-attenuated inversion recovery (T1 FLAIR), fat-suppressed T2 FLAIR, and diffusion-weighted imaging (DWI) were acquired. More details of the imaging parameters can be found in the Supplementary Table 1. In the experiments, the entire internal dataset was randomly divided into training, validation, and test sets in a ratio of 7:1:2.

Besides Nanshan Hospital, three additional hospitals, namely Southern University of Science and Technology Hospital, Shenzhen Bao’an Songgang People’s Hospital, and Shenzhen FuYong People’s Hospital, were involved (totally four hospitals) for external validation as shown in the Supplementary Table 1. To be specific, 2157 volume pairs (46,870 slice pairs) of six organs were collected from Siemens (N = 8487), GE (N = 20,046), Philips (N = 17,702), and UIH (N = 635). The same imaging sequences as for training were used for external validation covering accelerations of 1.5×, 2×, and 3×.

Given the fact that the noisy-clean image pairs were collected sequentially in clinical setups, there exists an inconsistency in image size and object orientation between the noisy and clean counterparts. To address this issue, we first converted the images into the frequency domain by the fast Fourier transform, then applied zero-padding in the frequency domain, and transformed it back to the image domain using the inverse Fourier transform. Subsequently, we utilized rigid registration to align the noisy images with the clean ones.

Overall Model architecture and training

Our denoising framework contains two main modules, namely, a non-linear degradation model and a text-guided conditional diffusion model, as shown in Fig. 1b. These two modules were trained individually and used collaboratively in the inference phase. As mentioned above, diffusion model has strong generative ability and can provide plausibly looking results on severely degraded images. However, diffusion model is prone to hallucination issues, which can lead to serious consequences in the safety-critical medical imaging field. To compensate this issue, we propose to introduce a learnable degradation model (given a clean image and text information of degradation, estimating the noisy counterpart) to strengthen the denoising fidelity and sharpness. From our empirical observations, directly training a degradation model on real-world image pairs is difficult due to the vast search space. We propose a multi-cycle training strategy that the degradation model is trained along with its inverse function, i.e., a restoration model. Specifically, we consider the degradation model and its restoration counterpart as a unity and stack multiples of the combination of these two models during training, where each of them is deeply supervised. This training strategy encourages multi-cycle consistency and leads to a more stable convergence and more reliable degradation. It is worth noting that the degradation and restoration models have the same architecture of NBNet39 with 1-128-64-32-16-8-16-32-64-128-1 as channel numbers, but different model weights. With regard to the conditional diffusion model, the noisy image is employed as the condition of the denoiser, and a pretrained text encoder CLIP is finetuned using LoRA. We adopted the Dhariwal U-Net48 as the denoiser of the diffusion model.

Multi-cycle training for non-linear degradation model

Denoising is regarded as an ill-posed inverse problem. Generally speaking, the inverse problem can be formulated as:

$$y=f\left(x\right)+\varepsilon$$
(1)

where \(y\) is the noisy image (observation or measurement), \(x\) is the expected clean image, \(f({\rm{\cdot }})\) is a system model or degradation model (for denoising problem \(f(\cdot )\) is usually considered as the identity function). \(\varepsilon\) is usually simplified and assume to be white Gaussian noise (AWGN). However, in reality, the recorded intensity value is not pixel-independent, and the noise is more complex than AWGN. For example, the noise in the magnitude images (DICOM) of MRI is intensity-related, which is usually modeled by Rician distribution, and is more complex for multi-coil imaging. To better model the distribution of noise in real-world acquired MR images and deal with real-world scenarios, we learn a non-linear degradation function \(f({\rm{\cdot }})\) by a neural network \({f}_{\varphi }\), aiming to better preserve the data fidelity and alleviate the hallucination effect.

The straightforward way to approximate the real-world degradation function \(f(\cdot )\) is to directly train an end-to-end model \({f}_{\varphi }\)

using the clean-noisy image pairs. However, in the experiments, we found that this training paradigm leads to unstable results. To address this issue, we propose a novel training strategy based on multi-cycle consistency. To be specific, in addition to the degradation model \({f}_{\varphi }\), we further introduce a restoration model \({f}_{\psi }\), which acts as the inverse function of the degradation model. The restoration model \({f}_{\psi }\) is cascaded with the degradation model \({f}_{\varphi }\) as a unity. A series of such unities is stacked and trained simultaneously by deep supervision. In such a way, the errors of the latter unities can be backpropagated to the former ones, and hence hierarchical constraints can be imposed to the degradation model, which effectively reduces the solution space and stabilizes the performance of the degradation model. It should be mentioned that different unities do not share model weights, and we set the number of the unities as two in our experiments according to the tradeoff between denoising performance and training resources.

Compared to the straightforward end-to-end training, the degradation model trained by our multi-cycle strategy has improved the fidelity and sharpness of the denoised images. In order to further enhance the visual perception of the denoised images, we utilize an adversarial loss to train the stacked unities. Specifically, each of \({f}_{\psi }\) and \({f}_{\varphi }\) uses a weighted sum of the L1-norm, MS-SSIM, and the adversarial loss as their loss function, and the overall loss is the sum of losses for both \({f}_{\psi }\) and \({f}_{\varphi }\) as below:

$${L}_{{\rm{Total}}}=\mathop{\sum }\limits_{n}^{N}\left[{L}_{n}\left({f}_{\psi }\left({x}_{n}^{{\rm{noisy}}}\right)\right)+{L}_{n}\left({f}_{\varphi }\left({x}_{n}^{{\rm{clean}}}\right)\right)\right]$$
(2)
$$\begin{array}{l}{L}_{n}\left({f}_{\psi }\left({x}_{n}^{\mathrm{noisy}}\right)\right)={\left|{f}_{\psi }\left({x}_{n}^{\mathrm{noisy}}\right)-{x}_{\mathrm{GT}}\right|}_{1}+{\lambda }_{1}{L}_{\mathrm{MS}-\mathrm{SSIM}}\left({f}_{\psi }\left({x}_{n}^{\mathrm{noisy}}\right),{x}_{\mathrm{GT}}\right)\\ \,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,+{\lambda }_{2}D\left({f}_{\psi }\left({x}_{n}^{\mathrm{noisy}}\right),{x}_{\mathrm{GT}}\right)\end{array}$$
(3)
$$\begin{array}{l}{L}_{n}\left({f}_{\varphi }\left({x}_{n}^{\mathrm{clean}}\right)\right)={\left|{f}_{\varphi }\left({x}_{n}^{\mathrm{clean}}\right)-{x}_{\mathrm{noisy}}\right|}_{1}+{\lambda }_{1}{L}_{\mathrm{MS}-\mathrm{SSIM}}\left({f}_{\varphi }\left({x}_{n}^{\mathrm{clean}}\right),{x}_{\mathrm{noisy}}\right)\\ \,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,+{\lambda }_{2}D\left({f}_{\varphi }\left({x}_{n}^{\mathrm{clean}}\right),{x}_{\mathrm{noisy}}\right)\end{array}$$
(4)

where \({x}_{n}^{{\rm{noisy}}}\) is the noisy input for the n-th restoration model, and \({x}_{n}^{{\rm{clean}}}\) is the clean(denoised) input for the n-th degradation model. \({x}_{{\rm{GT}}}\) is the clean GT image, and N is the overall number of unities. Function D(·) is the cross-entropy function serving as the discriminator to distinguish whether the results of the models \({f}_{\psi }\) and \({f}_{\varphi }\) resemble the real-world clean or noisy images, respectively. The same set of weighting parameters \({\lambda }_{1},\,{\lambda }_{2}\) was used for \({L}_{n}\left({f}_{\psi }\left({x}_{n}^{{\rm{noisy}}}\right)\right)\) and \({L}_{n}({f}_{\varphi }({x}_{n}^{{\rm{clean}}}))\). Note that the degradation network trained in the first unity was used for inference. For better interpretation, we illustrate the output of the degradation model over reverse diffusion process in the Supplementary Fig. 11.

LoRA-finetuned text guidance for diffusion model

In this work, we aim to develop a unified denoising model that can deal with multiple organs with clinically widely used imaging sequences by four MRI vendors. Since different organs with their imaging sequences and acceleration rates lead to large diversity in noise level and data distribution, we propose to integrate a CLIP49 text encoder to explicitly guide the denoising model. The text encoder embeds the text prompt constructed by organ, contrast, acceleration rates, and vendor information and is finetuned by LoRA50 on the training data. Specifically, the template for the prompt is: “MR image acquired by facilities from [vendor name], organ in image is [organ name], acquisition protocol is [protocol name]”. Acceleration rate information is included in the protocol name. To this end, a 16-token prompt containing the above imaging metadata is entered in a pre-trained CLIP text encoder of the Stable Diffusion v2.1. To adapt the pretrained CLIP to MRI data, we apply LoRA to finetune the pretrained CLIP. We ultimately obtain a 16 × 1024 text embedding. The image embedding is fused with the text embedding based on cross attention, where the features of the denoiser bottleneck are flattened and used as query (Q), and the text embeddings are used as key (K) and value (V). With the text guidance, the score-based denoiser can better group features from different classes into clusters with enlarged inter-class distance while reduced intra-class distance. We show the effectiveness of the CLIP-based text coder in the Supplementary Fig. 12.

Fidelity-enhanced optimization-based inference

Diffusion modes have strong generative ability but are prone to unrealistic structures, which is fatal for medical diagnosis. To address this issue, we incorporate data consistency into the diffusion model framework following the inference paradigm of RED-diff32, which is a variational diffusion model:

$$\mathop{\min }\limits_{\varphi }{D}_{{\rm{KL}}}\left({q}_{\varphi }({x}_{0}{|y}){||p}({x}_{0}{|y})\,\right)=\mathop{\min }\limits_{\varphi }{{\rm{E}}}_{{x}_{0} \sim {q}_{\varphi }}\left[\log {q}_{\varphi }\left({x}_{0}|y\right)-\log p\left({x}_{0}|y\right)\right]$$
(5)

Here,\(\,\varphi\) are parameters of a learnable model that outputs the restoration result \({x}_{0}\) with the input \(y\). The optimization of the KL objective function in Eq. 5 equals to a minimization problem:

$$\mathop{\min }\limits_{\varphi }\,{{\rm{E}}}_{{x}_{0} \sim {q}_{\varphi }}[-\log p\left(y|{x}_{0}\right)]+{D}_{{\rm{KL}}}\left({q}_{\varphi }({x}_{0}{|y}){||p}({x}_{0})\,\right)$$
(6)

The reverse diffusion process involves solving the above objective using stochastic gradient descent (SGD), so that the intermediate state \(\mu\) of the reverse diffusion step continuously approximates the clean image \({x}_{0}\) given the measurement \(y\):

$${{\nabla }_{{\mu }_{t}}{\mathscr{L}}}_{{\mu }_{t}}={\nabla }_{{\mu }_{t}}\left[-\log p\left(y|{\mu }_{t}\right)+{D}_{{\rm{KL}}}\left({q}_{\varphi }({\mu }_{t}{|y}){||p}({\mu }_{t})\,\right)\right]$$
(7)
$${\mu }_{t-1}={\mu }_{t}-{\nabla }_{{\mu }_{t}}{{\mathscr{L}}}_{{\mu }_{t}}$$
(8)

The first term in the above equation \(\log p\left(y,|,{\mu }_{t}\right)\) can be realized by the reconstruction (data fidelity) term, corresponding to \({{||y}-{f}_{\varphi }\left({\mu }_{t}\right){||}}^{2}\), to ensure the fidelity of the result. The \({q}_{\varphi }\left({\mu }_{t},|,y\right)\) can be expressed using the simple modeling \({q}_{\varphi }\left(\mu ,|,y\right){\rm{:= }}N(\mu ,{\sigma }^{2}{I}_{n})\), and the score function \(p({x}_{0})\) is modeled using the score-matching method30,51. Ultimately, the KL-regularization term in the above loss function can be simplified to the following equivalent form:

$${\nabla }_{{\mu }_{t}}{D}_{{\rm{KL}}}\left({q}_{\varphi }({\mu }_{t}{|y}){||p}({\mu }_{t})\,\right)=\lambda {\left[{\mu }_{t}-{g}_{\theta }\left({x}_{t},t\right)\right]}^{{\rm{T}}}$$
(9)

In the above formula, \(\lambda\) is a constant, and \({g}_{\theta }\left({x}_{t},t\right)\) is the output of the score matching denoiser that fits \(p\left({x}_{0}\right)\). It is worth noting that the output of the score-matching denoiser is the clean image \({x}_{0}\) predicted according to \({x}_{t}\), rather than the standard normal distribution noise. For the complete mathematical proof of the above formula, please refer to the Proposition 2 in RED-diff32. To ease interpretability, we illustrate the evolution of the denoised image over reverse diffusion process in the Supplementary Fig. 11.

Implementation Details

The proposed framework was implemented using PyTorch. Both the degradation model and diffusion model were trained on a NVIDIA A100 Tensor Core GPU with 80GB GPU RAM. For the degradation model. We set the mini-batch size as 32. For both models, Adam was used as the optimizer with an initial learning rate of 104 and a weight decay of 0.998 for every one epoch. For the diffusion model, the denoiser was trained from the scratch for 100 epochs, and the text encoder was finetuned by LoRA (rank = 8) for 100 epochs. We empirically tuned the weighting parameters as \({\lambda }_{1}={\lambda }_{2}=\) 0.5 and \({\lambda }_{3}=\) 0.25 in our experiments based on grid search. The number of overall iterations of the reverse diffusion process is 200, and, within each step, one SGD-based optimization is performed.

Reader studies

Reader studies were conducted by two radiologists with 5 and 3 years of specialty experience, respectively, to evaluate the clinical utility of our model. The first study used a four-point Likert scale to assess both image quality and diagnostic confidence. Image quality was graded from 1 (poor: inadequate signal-to-noise ratio, low spatial resolution, severe artifacts) to 4 (excellent: optimal signal-to-noise ratio, high resolution, minimal artifacts). Diagnostic confidence was similarly rated from 1 (inadequate pathological evaluation) to 4 (definitive diagnostic certainty with excellent lesion detection), with intermediate scores indicating increasing degrees of diagnostic uncertainty and morphological clarity. The second reader study systematically evaluated diagnostic performance across multiple anatomical regions. Cerebral small vessel disease was analyzed using the Fazekas scale (Grades 1–3) for white matter hyperintensities. Intervertebral disc degeneration was quantified by the disc signal intensity index, calculated as the normalized T2 signal ratio between the nucleus pulposus and cerebrospinal fluid. Shoulder MRI evaluations covered tendon integrity (supraspinatus, infraspinatus, subscapularis, teres minor, biceps long head), glenoid labral pathology, cartilage abnormalities, and bone marrow lesions. Knee MRI assessments included meniscal tear evaluation (medial/lateral), ligament integrity (ACL, PCL, MCL, LCL), cartilage pathology, and bone marrow abnormalities. All findings were classified using a four-category diagnostic certainty scale: 1 = definitely absent, 2 = probably absent, 3 = probably present, 4=definitely present. To minimize bias, evaluations were distributed across three randomized sessions with ≥2-week intervals. Comprehensive blinding protocols included anonymization of patient identifiers and masking of acquisition methods (noisy images, Ours, and GT). Images from individual patients were prohibited from appearing in the same evaluation session. Discordant interpretations were adjudicated through consensus review by both readers. Inter-rater reliability for qualitative ratings was assessed by using a two-way mixed absolute-agreement intraclass correlation coefficient (Poor: <0.5; Moderate: 0.5 to <0.75; Good: 0.75 to <0.9; Excellent: ≥0.9). To validate the advancement of our model, we further conducted a comparative analysis between our model and one prior published model (BME-X) regarding both image quality and diagnostic performance.