Abstract
Recent advancements in deep learning have enabled the development of generalizable models that achieve state-of-the-art performance across various imaging tasks. Vision Transformer (ViT)-based architectures, in particular, have demonstrated strong feature extraction capabilities when pre-trained on large-scale datasets. In this work, we introduce the Magnetic Resonance Image Processing Transformer (MR-IPT), a ViT-based image-domain framework designed to enhance the generalizability and robustness of accelerated MRI restoration. Unlike conventional deep learning models that require separate training for different acceleration factors, MR-IPT is pre-trained on a large-scale dataset encompassing multiple undersampling patterns and acceleration settings, enabling a unified framework. By leveraging a shared transformer backbone, MR-IPT effectively learns universal feature representations, allowing it to generalize across diverse restoration tasks. Extensive experiments demonstrate that MR-IPT outperforms both CNN-based and existing transformer-based methods, achieving superior quality across varying acceleration factors and sampling masks. Moreover, MR-IPT exhibits strong robustness, maintaining high performance even under unseen acquisition setups, highlighting its potential as a scalable and efficient solution for accelerated MRI. Our findings suggest that transformer-based general models can significantly advance MRI restoration, offering improved adaptability and stability compared to traditional deep learning approaches.
Similar content being viewed by others
Introduction
Magnetic Resonance Imaging (MRI) is a widely used diagnostic and research tool in clinical settings, offering high-resolution imaging and diverse contrast mechanisms to visualize various structural and functional characteristics of the underlying anatomy. However, one of the significant limitations of MRI is its relatively long acquisition time, which can reduce patient throughput, increase costs, and lead to delays in diagnosis. Long scan times can also contribute to patient discomfort and motion-related artifacts, which may degrade image quality. To address this challenge, various acceleration techniques have been developed, with compressed sensing (CS)1,2,3 being one of the most widely adopted methods. CS accelerates MRI acquisition by undersampling k-space data, relying on sparsity constraints and iterative algorithms to recover high-quality images from limited measurements. While effective, CS-based approaches often introduce reconstruction errors, including aliasing artifacts and loss of fine structural details, particularly in high-acceleration settings.
Recent advances in deep learning have provided powerful alternatives for accelerated MRI reconstruction4,5,6,7,8,9,10. Deep learning-based methods leverage large datasets to learn complex mappings between undersampled and fully sampled k-space or image domain representations. Many state-of-the-art deep learning models for accelerated MRI are based on convolutional neural networks (CNNs). These CNN-based architectures have demonstrated remarkable improvements over traditional CS-based methods, yielding higher-quality with fewer artifacts and faster inference times. However, CNNs have inherent limitations due to their local receptive fields and translation-invariant convolutional operations11,12. These characteristics can restrict their ability to capture long-range dependencies and global contextual information, leading to suboptimal performance, especially in cases where high-frequency details are critical13.
The emergence of transformer architectures has introduced new possibilities for image processing and computer vision tasks. Originally developed for natural language processing (NLP), transformers14 and their variants have been widely applied to tasks such as text classification, machine translation, and question-answering15,16,17,18,19,20. A key advantage of transformer-based models is their ability to capture long-range dependencies and contextual information via self-attention mechanisms21,22. Inspired by this, Vision Transformer (ViT)23 adapted transformers for image-related tasks by treating input images as sequences of non-overlapping patches, similar to words in NLP. Unlike CNNs, which gradually expands the receptive field hierarchically, even a shallow ViT model can effectively model global contextual relationships, making it highly competitive across various vision applications24,25,26.
Further developments in transformer-based models have led to innovations such as masked autoencoders (MAE)27, which utilizes a masked token prediction strategy and encoder-decoder structure during pretraining to enhance representation learning. These techniques have demonstrated strong potential in image reconstruction tasks, where learning robust feature representations is crucial. Additionally, models like the Segment Anything Model (SAM)28,29 have demonstrated the adaptability of ViT-based structures as backbones for multimodal learning, further expanding their applicability to medical imaging.
Despite these advancements, applying transformer-based models to accelerated MRI remains an area of active research. Several studies have explored ViT-based architectures for accelerated MRI, showing that transformer models can benefit from pretraining on large datasets and outperform CNN-based approaches in certain settings30,31. However, most of these efforts focus on task-specific models rather than developing a generalized framework32. Existing ViT-based models are often designed for specific undersampling patterns and acceleration factors, limiting their adaptability across different acquisition setups. A more generalizable approach is needed to fully leverage the capabilities of transformers for accelerated MRI.
Image Processing Transformer (IPT)33 has emerged as a promising framework for achieving generalizability in low-level imaging tasks. IPT introduces a multi-task learning paradigm by incorporating multiple input-output configurations within a single framework. By integrating multiple heads and tails for different image processing tasks, the shared transformer body learns to extract universal feature representations, improving model adaptability across diverse imaging scenarios. This design has demonstrated success in generalizing across multiple low-level image processing tasks such as denoising, deraining, and super-resolution.
Motivated by these developments, we introduce the Magnetic Resonance Image Processing Transformer (MR-IPT), a novel image-domain framework designed to enhance the generalizability of ViT-based models for accelerated MRI restoration. MR-IPT extends the IPT paradigm by interpreting different undersampling restoration setups as distinct tasks, allowing the core ViT backbone to focus on learning robust feature representations. We pre-train MR-IPT on a large-scale medical imaging dataset to maximize its feature extraction capabilities. Subsequently, we evaluate its performance on multiple downstream MRI restoration tasks, incorporating various acceleration factors and sampling masks to assess its adaptability. Our experimental results demonstrate that MR-IPT outperforms both CNN- and ViT-based models across a range of MRI restoration scenarios. Notably, MR-IPT exhibits strong generalization capabilities, effectively handling unseen sampling patterns and acceleration rates. Additionally, we conduct model stability assessments, showing that MR-IPT maintains high quality even when trained with limited downstream data. These findings highlight the potential of MR-IPT as a robust and scalable solution for accelerated MRI restoration.
Results
MR-IPT framework
The MR-IPT framework consists of five core components: heads, tails, a prompt encoder, a shared encoder, and a shared decoder (Fig. 1a). The shared encoder utilizes shifted-window multi-head self-attention (W-MSA)34 to efficiently capture global context across multiple layers. Inspired by MAE and SAM, we implemented a lightweight decoder incorporating prompt self-attention and two-way cross-attention, facilitating effective feature refinement and restoration. This lightweight design allows for a deeper encoder architecture without significantly increasing model size and computational costs, thereby enhancing the model’s representational capacity. The heads extract features from undersampled images, transforming them into patch tokens. The prompt encoder generates prompt tokens based on acceleration labels, which, together with image patch tokens, are processed by the shared encoder-decoder to recover missing information. The tails then reconstruct fully sampled images from the learned restoration tokens (Fig. 1b). To ensure broad generalizability in accelerated MRI restoration, we train MR-IPT on images corrupted using a diverse range of sampling masks and acceleration ratios, as illustrated in Fig. 1c.
Unlike the original IPT, which assigns a dedicated head-tail pair to each specific restoration task, we implement three MR-IPT variants across diverse undersampling patterns: (1) MR-IPT-type: Heads and tails are aggregated based on acceleration ratios, where each head-tail pair specializes in reconstructing images from different sampling masks; (2) MR-IPT-level: Heads and tails are aggregated based on sampling masks, allowing each head-tail pair to focus on restoration across different acceleration ratios; (3) MR-IPT-split: Each unique combination of sampling mask and acceleration ratio is assigned a dedicated head-tail pair.
To fully leverage the generalization potential of MR-IPT, we trained our model on RadImageNet35, a large-scale medical imaging dataset. Training images were corrupted at five levels, with sampling ratios ranging from two to ten, incorporating both 1D and 2D sampling masks to enhance robustness. For evaluation, we conducted downstream MRI restoration experiments on the fastMRI dataset36, assessing MR-IPT’s performance across typical restoration scenarios, unseen sampling ratios and masks, zero-shot generalization capabilities, and model stability under limited data conditions.
The overall architecture of the Magnetic Resonance Image Processing Transformer (MR-IPT) framework. (a) MR-IPT consists of heads, tails, a prompt encoder, a shared encoder, and a shared decoder. (b) A lightweight decoder incorporating prompt self-attention and two-way cross-attention enhances feature refinement and restoration. This design enables a deeper encoder architecture without significantly increasing model size or computational cost, improving the model’s representational capacity. The heads extract features from undersampled images and transform them into patch tokens. The prompt encoder generates prompt tokens based on acceleration labels, which, together with image patch tokens, are processed by the shared encoder-decoder to recover missing information. The tails reconstruct fully sampled images from the learned restoration tokens. (c) During pre-training, MR-IPT is trained on diverse acceleration ratios (2×, 4×, 6×, 8×, and 10×) and various sampling masks. The 1D sampling masks include Cartesian random, Cartesian equispaced, and 1D Gaussian, while the 2D sampling mask follows a 2D Gaussian distribution.
Accelerated MRI restoration performance
In this section, we evaluate MR-IPT on accelerated MRI restoration tasks using the fastMRI knee and brain datasets. Following the standard fastMRI benchmark, knee dataset restorations are performed with 4× and 8× Cartesian random undersampling, while brain dataset restorations are tested at 4× and 8× Cartesian equispaced undersampling.
We compare MR-IPT against multiple representative models, including UNet3237, UNet128, ViT-Base, ViT-L23,30, E2E-VarNet9, and ReconFormer31. Additionally, to assess the impact of large-scale pretraining, we introduce UNet128-FT and ViT-L-FT, both of which are pretrained on RadImageNet (the same as MR-IPT) before being fine-tuned on the corresponding fastMRI datasets. Note that models like E2E-VarNet are optimized for parallel imaging, the application with single-coil data could limit the network’s capacity to exploit multi-coil sensitivity maps. The quantitative results are presented in Table 1, where MR-IPT consistently outperforms baseline models across different acceleration ratios. Among its three variants, MR-IPT-level exhibits slightly superior performance compared to MR-IPT-type and MR-IPT-split, suggesting that aggregating heads and tails by sampling mask provides an effective balance between task-specific optimization and generalizability.
For instance, in the 4× brain restoration task, MR-IPT-level achieves a PSNR/SSIM of 42.48/0.9831, outperforming UNet128 (36.25/0.9648), ViT-L (37.54/0.9558) and ReconFormer (41.63/0.9737). When compared to pretrained and fine-tuned models (UNet128-FT: 37.27/0.9653, ViT-L-FT: 37.54/0.9558), MR-IPT still demonstrates a clear advantage. This highlights the effectiveness of MR-IPT’s multi-head-tail structure and its unified shared encoder-decoder, which maximizes the benefits of large-scale pretraining by improving feature adaptability across different undersampling conditions.
Figure 2 illustrates qualitative comparisons of reconstructed images, including error maps that represent absolute differences between restorations and ground truth images (intensified by a factor of three for better visualization). Figure 3 provides a comparative analysis of the three MR-IPT variants (MR-IPT-type, MR-IPT-level, and MR-IPT-split) under 4× and 8× Cartesian random and Cartesian equispaced undersampling. Overall, MR-IPT produces cleaner error maps across all tested sampling ratios and masks. However, a closer look reveals that some methods, such as ReconFormer, can produce sharper details especially with higher acceleration factor due to their specialized modules for enforcing data-consistency (e.g., the 8× Cartesian random knee image in Fig. 2). Conversely, the MR-IPT restorations can sometimes appear smoother. When evaluating images with pathological features, such as the 8× Cartesian equispaced brain image in Fig. 2, all tested networks struggle to accurately recover finer structures, like the tumor, and show suboptimal performance compared to the fully sampled image. This highlights a critical area for future research focusing on pathology-specific evaluation.
Restoration comparison across different models. Each column presents reconstructed images from various methods, highlighting differences in image quality and artifact suppression. The second column of each subplot shows the corresponding error maps (intensified by a factor of three for better visualization), which visualize absolute differences between the reconstructed images and the fully sampled ground truth. The red boxes highlight areas of differences among models.
Comparison of MR-IPT variants across different undersampling patterns. We implement three MR-IPT variants across diverse undersampling patterns: (1) MR-IPT-type, where heads and tails are grouped based on acceleration ratios, with each head-tail pair specializing in different sampling masks; (2) MR-IPT-level, where heads and tails are grouped based on sampling masks, allowing each pair to generalize across different acceleration ratios; and (3) MR-IPT-split, where each unique combination of sampling mask and acceleration ratio is assigned a dedicated head-tail pair. The results demonstrate that all three variants achieve high-quality restorations, highlighting MR-IPT’s flexibility in handling diverse undersampling patterns.
Performance on new sampling ratios
Given that downstream tasks may involve different undersampling ratios, it is crucial to assess MR-IPT’s generalization to previously unseen acceleration factors. To keep a balance between the number of head-tail pairs and overall generalizability, we pre-train MR-IPT using five acceleration ratios (2×, 4×, 6×, 8×, and 10×), covering a broad range of undersampling scenarios. To further evaluate its adaptability, we conduct downstream restoration on the brain dataset with unseen acceleration ratios of 5× and 7× during inference.
Table 2 presents the quantitative results. UNet128 and ViT-L are trained directly on the indicated acceleration ratios, whereas UNet128-FT and ViT-L-FT follow the same pretraining-finetuning pipeline as MR-IPT for a fair comparison. Notably, both UNet128 and ViT-L benefit from pretraining, with ViT-L-FT showing greater improvements, achieving PSNR/SSIM of 36.29/0.9490 (5×) and 34.60/0.9378 (7×), compared to UNet128-FT at 35.52/0.9554 (5×) and 32.96/0.9346 (7×). MR-IPT-level consistently outperforms both models, achieving PSNR/SSIM of 39.92/0.9763 (5×) and 36.69/0.9626 (7×), highlighting its superior restoration capability.
Figure 4 provides qualitative comparisons of reconstructed images. Interestingly, while ViT-L-FT and UNet128-FT exhibit differences in error map characteristics, UNet128-FT, despite having a higher maximum absolute error, produces cleaner backgrounds, particularly at higher acceleration ratios where overall image intensity is lower. MR-IPT consistently yields the cleanest error maps across all tested setups, including unseen sampling ratios, demonstrating its strong adaptability and generalization to novel acceleration factors.
Restoration comparison across multiple sampling ratios, including unseen configurations. This figure showcases the performance of different models in reconstructing MRI images at various acceleration ratios. Results highlighted in the yellow block represent new sampling ratios (e.g., 5× and 7×) that were not encountered during pre-training, demonstrating each model’s generalization ability. The results highlight MR-IPT’s strong adaptability to previously unseen acceleration factors, effectively preserving fine anatomical structures while minimizing artifacts, compared to other baseline methods.
Performance on new sampling masks
In this section, we evaluate MR-IPT’s adaptability to novel sampling masks. As depicted in Fig. 1c, our pretraining strategy includes a diverse set of 1D (Cartesian random, Cartesian equispaced, and 1D Gaussian) and 2D (2D Gaussian) sampling masks. To assess the impact of excluding 2D masks during pretraining, we introduced three MR-IPT-1D variants, trained exclusively on 1D masks. For downstream evaluations, we test all models on accelerated MRI resotrations using 4× and 8 × 2D Gaussian sampling masks.
Table 3 presents quantitative results comparing standard MR-IPT, MR-IPT-1D, and baseline models UNet128-FT and ViT-L-FT. Notably, both UNet128-FT and ViT-L-FT are pretrained with 2D Gaussian masks, aligning with the standard MR-IPT setup. Surprisingly, despite being trained solely on 1D masks, MR-IPT-1D demonstrates strong generalization and competitive performance on 2D-masked restorations. For instance, MR-IPT-level-1D achieves a PSNR/SSIM of 37.99/0.9685 (8×), closely matching MR-IPT-level at 38.94/0.9727 and significantly surpassing UNet128-FT (32.30/0.9310) and ViT-L-FT (33.21/0.9373). Interestingly, when evaluating 1D-masked restorations, MR-IPT-1D slightly outperforms the standard MR-IPT due to its specialized pretraining. For example, in 8× Cartesian equispaced restoration, MR-IPT-level-1D achieves a PSNR/SSIM of 35.60/0.9562, compared to MR-IPT-level at 35.53/0.9557. Figure 5 provides qualitative comparisons of reconstructed images, demonstrating that both MR-IPT and MR-IPT-1D generate cleaner error maps for 2D-masked inputs than ViT-L-FT, further validating the robustness and adaptability of the MR-IPT framework.
Restoration comparison using 2D Gaussian sampling masks. Results highlighted in the green block correspond to MR-IPT-1D models, which were pre-trained exclusively on 1D sampling masks. Despite the absence of 2D masks during pre-training, MR-IPT-1D demonstrates strong generalization capabilities, achieving high-fidelity restorations with minimal artifacts, further validating its robustness for new sampling masks.
Zero-shot performance
In this part, we evaluate the zero-shot performance of MR-IPT in comparison to other models. The quantitative results are summarized in Table 4. All models are pretrained on the same dataset before being tested directly on the specified acceleration setups without additional fine-tuning. Since UNet and ViT do not include a multi-head-tail structure for different sampling masks and ratios, during pretraining, sampling masks and ratios were randomly selected for each image in a batch. Across all tested configurations, MR-IPT outperforms both UNet128 and ViT-L by a significant margin, demonstrating superior generalization ability. For example, in the 8× Cartesian random sampling test, MR-IPT-level achieves a PSNR/SSIM of 30.65/0.7814, surpassing UNet128 at 27.48/0.7184 and ViT-L at 29.00/0.7296. Similarly, in the 4× Cartesian equispaced test, MR-IPT-level attains a PSNR/SSIM of 39.90/0.9756, outperforming UNet128 at 32.89/0.9315 and ViT-L at 35.73/0.9489. These results highlight the effectiveness of MR-IPT’s multi-head-tail design and shared encoder-decoder architecture in adapting to novel sampling patterns without task-specific fine-tuning.
Model stability regarding downstream dataset size
In many real-world scenarios, acquiring large-scale medical imaging datasets for comprehensive training of CNN-based models is challenging. When only a limited number of images are available, fine-tuning MR-IPT on a small dataset becomes a more practical approach. To evaluate MR-IPT’s performance under varying dataset sizes, we fine-tuned our model on the fastMRI brain dataset using an 8× Cartesian equispaced mask, with dataset sizes ranging from 10 to 2500. For each size, we randomly sampled subsets from the dataset and repeated the process ten times to assess model performance and stability.
As illustrated in Fig. 6, both the average performance and stability of MR-IPT improve as dataset size increases. For instance, with just 10 training samples, MR-IPT achieves a PSNR/SSIM of 32.78/0.9316. As the dataset size increases to 2500, the performance improves to 34.05/0.9441, approaching the performance of MR-IPT fine-tuned on the full dataset (35.53/0.9557). These results highlight MR-IPT’s ability to leverage large-scale pretraining, maintaining stable and competitive restoration quality even in data-constrained settings.
MR-IPT performance across different dataset sizes. To assess the impact of dataset size on restoration quality, MR-IPT was fine-tuned on the fastMRI brain dataset using an 8× Cartesian equispaced mask, with training subsets ranging from 10 to 2500 images. Each subset was randomly sampled, and the process was repeated ten times to evaluate both model performance and stability. The red line and green line represent zero-shot and fully fine-tuned performance. The results indicate that as the dataset size increases, MR-IPT exhibits significant improvements in both accuracy and consistency. Even with a limited number of training samples, the model maintains competitive performance, demonstrating its ability to leverage large-scale pretraining effectively and achieve stable, high-quality restorations in data-constrained scenarios.
Performance compared to IPT
We conducted a comparison between our MR-IPT model and the original IPT framework. Quantitative results are summarized in Table 5, with visual comparisons shown in Fig. 7. In IPT, the head-tail setup functions similarly to MR-IPT’s split mode, where each unique combination of sampling mask and acceleration ratio corresponds to a dedicated head-tail pair. We evaluate both zero-shot and fine-tuned performance for 4× and 8× undersampling with Cartesian random and Cartesian equispaced masks. Pre-trained on the same dataset, which includes both 1D and 2D sampling masks, MR-IPT generally outperforms IPT across various configurations. For example, in the 4× Cartesian random restoration on the knee dataset, MR-IPT achieves a PSNR/SSIM of 34.22/0.8635 (zero-shot) and 34.52/0.8681 (fine-tuned), whereas IPT performs at 31.61/0.8038 (zero-shot) and 32.70/0.8094 (fine-tuned). Similarly, for the 4× Cartesian equispaced restoration on the brain dataset, MR-IPT reaches a PSNR/SSIM of 39.90/0.9756 (zero-shot) and 42.48/0.9831 (fine-tuned), compared to IPT’s 37.63/0.9580 (zero-shot) and 39.43/0.9640 (fine-tuned).
Comparison of MR-IPT and IPT restorations across different undersampling settings. This figure evaluates the performance of MR-IPT against the original Image Processing Transformer (IPT) under both fine-tuned and zero-shot scenarios. Results in the green block represent fine-tuned comparisons, where both IPT and MR-IPT were trained on 4× and 8× undersampling with Cartesian random and Cartesian equispaced masks. Results in the yellow block illustrate zero-shot comparisons, where models were tested without additional fine-tuning. The results demonstrate that MR-IPT consistently outperforms IPT, achieving higher restoration fidelity, better structural preservation, and reduced artifacts, highlighting its superior adaptability and generalization capabilities.
Ablation studies and other analysis
To evaluate the effectiveness of the prompt encoder, we conducted ablation studies to analyze the performance on the fastMRI knee dataset with 4× and 8× Cartesian random sampling masks. As demonstrated in Table 6, excluding prompt encoding in MR-IPT leads to a performance drop of PSNR/SSIM from 34.52/0.8681 to 33.51/0.8283 in 4× restoration and 31.45/0.7952 to 30.96/0.7670 in 8× restoration.
In Fig. 8, we conducted further analysis regarding model performance of low (1/3 central region in k-space) and high (2/3 peripheral region in k-space) spatial frequencies on the fastMRI brain dataset with 4× Cartesian equispaced sampling masks. As demonstrated, MR-IPT achieves better performance in both cases with more accurate general shape and sharper details.
In Table 7, we demonstrate computational cost analysis with comparison of model throughputs. We performed our test on an RTX 4090 GPU during evaluation. Figure 9 illustrates model performance relative to model size for multiple downstream tasks, with the size of each model represented by the size of the corresponding marker. Overall, this test highlights that MR-IPT consistently outperforms IPT, ViT-L-FT, and UNet128-FT with similar model sizes. This demonstrates the effectiveness of MR-IPT’s lightweight decoder and prompt encoder design, which allows for a larger encoder and thus more effective latent space learning. In summary, our results show that MR-IPT provides robust and stable performance across a wide range of downstream tasks. Notably, it excels in handling new sampling ratios and masks and demonstrates impressive zero-shot generalization as well as stability when trained on limited dataset sizes.
(a) Frequency analysis comparison on the fastMRI brain dataset. (b) Restoration samples for MR-IPT in low frequency (1/3 central region in k-space) and high frequency (2/3 peripheral region in k-space). All high frequency images and error maps have been intensified three times for better demonstration.
Model performance versus size across different restoration tasks. Each marker represents a model, with its size proportional to the model’s parameter count, illustrating the trade-off between computational complexity and restoration performance. The results show that MR-IPT consistently outperforms IPT, ViT-L-FT, and others, even when operating at comparable model sizes. This superior performance is attributed to MR-IPT’s lightweight decoder and prompt encoder design, which enables the use of a deeper and more expressive encoder for improved latent space learning without significantly increasing computational cost. The findings highlight MR-IPT’s efficiency, scalability, and effectiveness in MRI restoration across diverse undersampling conditions.
Discussion
In this work, we introduce MR-IPT, a transformer-based framework designed for general accelerated MRI restoration. Unlike previous ViT-based models that primarily focus on task-specific restoration, MR-IPT is built to fully leverage the potential of ViT through large-scale pre-training. This approach enables MR-IPT to deliver superior performance across diverse accelerated MRI restoration tasks. Its multi-head-tail design provides flexibility, supporting a wide range of undersampling masks and acceleration ratios during image degradation. The shared ViT backbone facilitates a unified structure for latent space representation and learning under various sampling conditions. Additionally, the lightweight decoder, deeper encoder, and label prompt encoder contribute to enhanced restoration performance compared to conventional ViT models.
MR-IPT consistently demonstrates strong performance compared to traditional MRI restoration networks such as UNet and ViT. As illustrated in Table 1; Fig. 2, MR-IPT outperforms these models significantly. For example, in the 4-fold Cartesian equispaced sampling test on the fastMRI brain dataset, MR-IPT-level achieves a PSNR/SSIM of 42.48/0.9831, compared to UNet128 at 36.25/0.9648 and ViT-L at 37.54/0.9558. Even when employing the same pre-training strategies, MR-IPT maintains its advantage, outperforming both UNet128-FT (37.27/0.9653) and ViT-L-FT (37.63/0.9564).
MR-IPT also exhibits stable and superior performance when evaluated on new sampling masks and acceleration ratios, as shown in Tables 2 and 3; Figs. 4 and 5. In the Cartesian equispaced restoration test with a new 5× sampling ratio, MR-IPT-level achieves a PSNR/SSIM of 39.92/0.9763, outperforming UNet128-FT (35.52/0.9554) and ViT-L-FT (36.29/0.9490). Remarkably, in the 4 × 2D Gaussian sampling restoration test, where the model was pre-trained solely with 1D sampling masks, MR-IPT-level-1D achieves a PSNR/SSIM of 42.30/0.9839. This performance is comparable to MR-IPT pre-trained with both 1D and 2D masks (MR-IPT-level at 42.69/0.9849) and superior to UNet128-FT (35.62/0.9571) and ViT-L-FT (37.47/0.9638), both of which included 2D sampling masks during pre-training. These results underscore MR-IPT’s adaptability and generalizability in accelerated MRI restoration tasks.
Furthermore, MR-IPT demonstrates robust performance in zero-shot learning scenarios, as summarized in Table 4. For instance, in the 8× Cartesian equispaced test, MR-IPT-level achieves a PSNR/SSIM of 32.52/0.9288, outperforming UNet128 (27.93/0.8584) and ViT-L (30.21/0.9197). In situations with limited training data, MR-IPT maintains stable and high-quality restoration performance, as shown in Fig. 6. These results highlight MR-IPT’s ability to leverage large-scale pre-training effectively, ensuring competitive restoration quality even in data-constrained settings.
When compared to the original IPT framework, MR-IPT consistently exhibits superior performance in both zero-shot and fine-tuned scenarios, as demonstrated in Table 5; Fig. 7. For example, in the 4× Cartesian equispaced restoration of the brain dataset, MR-IPT achieves a PSNR/SSIM of 39.90/0.9756 (zero-shot) and 42.48/0.9831 (fine-tuned), compared to IPT’s 37.63/0.9580 (zero-shot) and 39.43/0.9640 (fine-tuned). Comparative studies (Fig. 9) further reveal that MR-IPT outperforms IPT, ViT-L-FT, and UNet128-FT even with similar model sizes. This superior performance is attributed to MR-IPT’s lightweight decoder and prompt encoder design, which enables a larger encoder for more effective latent space learning.
Despite its strong performance, this study has several limitations. First, our current framework operates as an image-domain restoration network and, as such, does not directly process raw k-space data. A comprehensive evaluation of its efficacy for multi-coil MRI would require extending the approach to work directly with clinical k-space data. This further raises another important topic regarding data discrepancy handling to leverage large-scale pre-training and downstream fine-tuning. For example, since a resource like RadImageNet is composed solely of images, it’s not immediately clear how a network pre-trained on it can be adapted for downstream multi-coil data. One potential avenue for future work is to explore how a pre-trained image-domain model could be integrated into a dual-domain pipeline. Alternatively, a separate adapting layer could be introduced during the fine-tuning process to bridge the gap between single-coil and multi-coil images. As an image-domain method, MR-IPT does not enforce explicit k-space data consistency. While this approach successfully leverages ViT-based multi-task pre-training to effectively suppress common artifacts and enhance image quality, the absence of a data fidelity constraint presents a limitation. Specifically, it can potentially introduce hallucination artifacts and may compromise quantitative fidelity, factors that are critical for clinical adoption. Consequently, the integration of data consistency mechanisms represents a vital direction for future work to improve the reliability and clinical applicability of the framework. Second, the dataset used for pre-training is limited to accelerated MRI data. Although previous studies have shown that pre-training on general image datasets can further enhance performance on accelerated MRI30, our pre-training dataset remains confined to medical imaging. Future work should investigate the benefits of incorporating more diverse datasets and explore the potential of merging multiple datasets to scale up pre-training, potentially improving model performance. In addition, we mainly use the fastMRI dataset for downstream restoration task evaluations, future studies could include other datasets for a more comprehensive generalizability investigation. Furthermore, our model’s performance is constrained by the fixed input size required for pre-training, which may not match the native resolution of all downstream clinical datasets. To improve generalizability, a key direction for future research is the development of resolution-adaptive processes that can effectively transfer learned features across varying image dimensions. Third, we only include 2D Gaussian mask for 2D sampling patterns during pretraining and downstream evaluations. More intensive evaluation across diverse sampling patterns (e.g., radial, spiral, or Poisson-disc) would provide further enhancement to the model’s robustness and generalizability. While our generalizability test from 1D to 2D masks demonstrates translational potential, we note that prospective 2D undersampling is primarily relevant for 3D imaging acquisitions. Thus, more intense work is needed to further validate its clinical applicability in the future. Fourth, our downstream tasks are primarily focused on image restoration. Recent works like MAE27 and SAM28 have demonstrated the versatility of pre-trained encoders for tasks such as classification and object detection. Future research directions could explore integrating MR-IPT into additional clinical workflows for tasks such as image segmentation and lesion detection, involving radiologists to further assess its diagnostic value. This extension would help evaluate the framework’s broader utility and applicability in medical imaging analysis beyond restoration tasks. Moreover, downstream interference tests (e.g., noise injection, adversarial attacks) after inference would provide further understanding regarding model’s resilience and applicability in real-world deployments. Additionally, integrating other advanced techniques such as denoising diffusion models38 and flow matching39,40 holds potential for developing more generalized multimodal frameworks in medical imaging. Integrating MR-IPT with these methods could broaden its applicability beyond MRI restoration and image processing. Finally, our MR-IPT currently employs a simple L1 loss function during training, which might limit its perceptual quality. Future studies will explore more sophisticated objective functions, such as SSIM loss and contrastive loss, to further enhance the performance.
In conclusion, we present MR-IPT, a ViT-based framework for general accelerated MRI restoration. MR-IPT demonstrates superior performance across various sampling setups, including new sampling masks and ratios, zero-shot learning, and limited dataset scenarios. By leveraging a multi-head-tail structure and a shared backbone design, MR-IPT exhibits strong adaptability to diverse acceleration configurations. Its efficient latent representation learning and large-scale pre-training capabilities highlight the potential of ViT-based architecture in advancing medical imaging deep learning models. This approach represents a significant step forward in the development of generalizable, high-performance deep learning models for medical imaging, with the potential to advance both clinical applications and future AI-driven diagnostic tools.
Methods
MR-IPT model structure and details
The MR-IPT framework builds upon a modified IPT structure, specifically designed to accommodate a wide range of accelerated MRI restoration tasks. The architecture of MR-IPT is illustrated in Fig. 1, showcasing its five core components: heads, tails, a prompt encoder, a shared encoder, and a lightweight decoder.
Each head consists of a 3 × 3 convolutional layer followed by two 5 × 5 residual blocks, enabling robust feature extraction from accelerated images. The tail, responsible for image restoration, includes a 3 × 3 upsampling convolutional block followed by another 3 × 3 convolutional layer to refine output quality. The prompt encoder, inspired by the design of the Segment Anything Model (SAM)28, utilizes sparse embeddings with an embedding layer to project acceleration information into the same dimensional space as image embeddings, enhancing task-specific feature representation.
For the shared encoder, we adopted a 24-layer ViT equipped with W-MSA34, which effectively captures global contextual information across multiple layers. This design mirrors the heavy-weight encoder/light-weight decoder paradigm seen in MAE27, balancing computational efficiency with strong representational capacity. The 2-layer lightweight decoder incorporates two-way cross-attention mechanisms between image patch tokens and prompt tokens, facilitating effective feature refinement and image restoration.
During pre-training, MR-IPT is exposed to diverse acceleration ratios (2×, 4×, 6×, 8×, and 10×) and a variety of sampling masks. The 1D sampling masks include Cartesian random, Cartesian equispaced, and 1D Gaussian, while the 2D sampling mask is based on 2D Gaussian distributions. We developed three MR-IPT variants to explore different aggregation strategies: (1) MR-IPT-type: Heads and tails are grouped based on acceleration ratios, with each head-tail pair specialized for different sampling masks; (2) MR-IPT-level: Heads and tails are aggregated based on sampling masks, allowing each pair to handle various acceleration ratios; (3) MR-IPT-split: A dedicated head-tail pair is assigned to each unique combination of sampling mask and acceleration ratio.
For downstream tasks, the appropriate head-tail pair is selected based on the specific sampling configuration. In cases involving unseen acceleration ratios (e.g., 5×), we utilize the head-tail pair trained for the nearest higher ratio (e.g., 6×). For MR-IPT-1D models, pre-training is limited to 1D masks, and during downstream evaluations, head-tails trained with Cartesian random masks are preferred due to their robustness against diverse sampling strategies.
MR-IPT was implemented using PyTorch41 and trained on systems equipped with either an NVIDIA RTX 3090 Ti (24GB VRAM) or an NVIDIA RTX 4090 (24GB VRAM). The model optimization follows an Adam optimizer42 with a learning rate of 1e-5. The training protocol includes a pre-training for 5 epochs on the large-scale dataset, then a fine-tuning for 15 epochs for each downstream restoration task.
The training objective for MR-IPT is L1 loss, defined as:
where \(\:{x}_{accelerated}^{i}\) denotes the accelerated image for sampling task \(\:i\), and \(\:{x}_{clean}\) denotes the fully sampled clean ground truth image.
Datasets and image processing
For large-scale pre-training, we utilized the RadImageNet dataset35, which supports the findings of this study and is publicly available at https://www.radimagenet.com/. Specifically, we employed the MRI subset, comprising 672,675 images, which were split into a 9:1 ratio, resulting in 605,408 images for training and 67,267 for validation. For downstream task evaluations, we used the fastMRI dataset36, which is openly accessible at https://fastmri.med.nyu.edu/. For fastMRI knee dataset, we used the training set including 34,742 images for training. For testing, we used the validation set including 7135 images. For fastMRI brain dataset, we used the training set including 70,748 images for training. For testing, we used the test set including 8852 slices. To standardize the data, all images were resized to 224 × 224 pixels with pixel intensity values normalized within the range of [0, 1] based on the down-sampled image. Given the variability in image dimensions across the fastMRI datasets, we first applied center cropping to reduce the images to 320 × 320 pixels, ensuring the preservation of central anatomical features, and subsequently resized (downsampling interpolation) them to the target dimensions for model training and evaluation. To keep a fair comparison, for methods that requires k-space data as inputs such as E2E-VarNet and ReconFormer, the k-space data was first back-calculated from the resized fully sampled images. Then, this k-space data was undersmapled by corresponding masks and serve as inputs.
Evaluation metrics
To comprehensively assess restoration performance, we adopted two widely used quantitative metrics during our restoration performance evaluation comparisons: peak signal-to-noise ratio (PSNR) and structural similarity index (SSIM). We report slice-wise PSNR/SSIM across our comparisons.
where \(\:MSE(x,{x}_{clean})\) is the mean squared error between the reconstructed image \(\:x\) and the fully sampled ground truth clean image \(\:{x}_{clean}\).
where \(\:{\mu\:}_{x}\), \(\:{\mu\:}_{{x}_{clean}}\), \(\:{\sigma\:}_{x}^{2}\), and \(\:{\sigma\:}_{{x}_{clean}}^{2}\) are the mean and variance of reconstructed image \(\:x\) and fully-sampled clean image \(\:{x}_{clean}\), respectively. \(\:{\sigma\:}_{x{x}_{clean}}\) is the covariance of \(\:x\) and \(\:{x}_{clean}\). \(\:{C}_{1}={\left(0.01L\right)}^{2}\), \(\:{C}_{2}={\left(0.03L\right)}^{2}\), where \(\:L\) is the dynamic range of the pixel-values.
Data availability
The data that support the findings of this study are available from RadImageNet and fastMRI but restrictions apply to the availability of these data, which were used under license for the current study, and so are not publicly available. Data are however available from the authors (G.S.) upon reasonable request and with permission of RadImageNet and fastMRI. Our code is available at: https:/github.com/GuoyaoShen/MR-IPT .
References
Lustig, M., Donoho, D., Pauly, J. M. & Sparse, M. R. I. The application of compressed sensing for rapid MR imaging. Magnetic resonance in medicine: an official. J. Int. Soc. Magn. Reson. Med. 58 (6), 1182–1195 (2007).
Donoho, D. L. Compressed sensing. IEEE Trans. Inf. Theory. 52 (4), 1289–1306 (2006).
Candès, E. J. & Wakin, M. B. An introduction to compressive sampling. IEEE. Signal. Process. Mag. 25 (2), 21–30 (2008).
Eo, T. et al. KIKI-net: cross‐domain convolutional neural networks for reconstructing undersampled magnetic resonance images. Magn. Reson. Med. 80 (5), 2188–2201 (2018).
Souza, R. & Frayne, R. A hybrid frequency-domain/image-domain deep network for magnetic resonance image reconstruction. In 2019 32nd SIBGRAPI conference on graphics, patterns and images (SIBGRAPI) (pp. 257–264). IEEE. (2019).
Souza, R. et al. Dual-domain cascade of U-nets for multi-channel magnetic resonance image reconstruction. Magn. Reson. Imaging. 71, 140–153 (2020).
Shen, G. et al. Attention hybrid variational net for accelerated MRI reconstruction. APL Mach. Learn., 1(4). (2023).
Shen, G., Li, M., Farris, C. W., Anderson, S. & Zhang, X. Learning to reconstruct accelerated MRI through K-space cold diffusion without noise. Sci. Rep. 14 (1), 21877 (2024).
Sriram, A. et al. End-to-end variational networks for accelerated MRI reconstruction. In Medical image computing and computer assisted intervention–MICCAI 2020: 23rd international conference, Lima, Peru, October 4–8, 2020, proceedings, part II 23 (pp. 64–73). Springer International Publishing. (2020).
Montalt-Tordera, J., Muthurangu, V., Hauptmann, A. & Steeden, J. A. Machine learning in magnetic resonance imaging: image reconstruction. Physica Med. 83, 79–87 (2021).
Zheng, M. et al. Attention-based CNNs for image classification: A survey. In Journal of Physics: Conference Series (Vol. 2171, No. 1, p. 012068). IOP Publishing. (2022).
Li, Z., Liu, F., Yang, W., Peng, S. & Zhou, J. A survey of convolutional neural networks: analysis, applications, and prospects. IEEE Trans. Neural Networks Learn. Syst. 33 (12), 6999–7019 (2021).
Knoll, F. et al. Deep-learning methods for parallel magnetic resonance imaging reconstruction: A survey of the current approaches, trends, and issues. IEEE. Signal. Process. Mag. 37 (1), 128–140 (2020).
Vaswani, A. et al. Attention is all you need. Advances in neural information processing systems, 30. (2017).
Brown, T. et al. Language models are few-shot learners. Adv. Neural. Inf. Process. Syst. 33, 1877–1901 (2020).
Radford, A., Narasimhan, K., Salimans, T. & Sutskever I. Improving language understanding by generative pre-training. (2018).
Ouyang, L. et al. Training Language models to follow instructions with human feedback. Adv. Neural. Inf. Process. Syst. 35, 27730–27744 (2022).
Radford, A. et al. Language models are unsupervised multitask learners. OpenAI Blog. 1 (8), 9 (2019).
Tay, Y., Dehghani, M., Bahri, D. & Metzler, D. Efficient transformers: A survey. ACM Comput. Surveys. 55 (6), 1–28 (2022).
Xu, P., Zhu, X. & Clifton, D. A. Multimodal learning with transformers: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 45 (10), 12113–12132 (2023).
Devlin, J., Chang, M. W., Lee, K., Toutanova, K. & Bert Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers) (pp. 4171–4186). (2019).
Lin, T., Wang, Y., Liu, X. & Qiu, X. A survey of Transformers. AI open. 3, 111–132 (2022).
Dosovitskiy, A. et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:.11929. (2020). (2020). (2010).
Han, K. et al. A survey on vision transformer. IEEE Trans. Pattern Anal. Mach. Intell. 45 (1), 87–110 (2022).
Ali, A. M. et al. Vision Transformers in image restoration: A survey. Sensors 23 (5), 2385 (2023).
Zhao, H. et al. Comprehensive and delicate: An efficient transformer for image restoration. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 14122–14132). (2023).
He, K. et al. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 16000–16009). (2022).
Kirillov, A. et al. Segment anything. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 4015–4026). (2023).
Ma, J. et al. Segment anything in medical images. Nat. Commun. 15 (1), 654 (2024).
Lin, K. & Heckel, R. Vision transformers enable fast and robust accelerated MRI. In International Conference on Medical Imaging with Deep Learning (pp. 774–795). PMLR. (2022).
Guo, P. et al. Accelerated mri reconstruction using recurrent transformer. IEEE Trans. Med. Imaging. 43 (1), 582–593 (2023).
Shamshad, F. et al. Transformers in medical imaging: A survey. Med. Image. Anal. 88, 102802 (2023).
Chen, H. et al. Pre-trained image processing transformer. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 12299–12310). (2021).
Liu, Z. et al. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 10012–10022). (2021).
Mei, X. et al. RadImageNet: an open radiologic deep learning research dataset for effective transfer learning. Radiology: Artif. Intell. 4 (5), e210315 (2022).
Zbontar, J. et al. fastMRI: An open dataset and benchmarks for accelerated MRI. arXiv preprint arXiv:1811.08839. (2018).
Ronneberger, O., Fischer, P. & Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Medical image computing and computer-assisted intervention–MICCAI 2015: 18th international conference, Munich, Germany, October 5–9, 2015, proceedings, part III 18 (pp. 234–241). Springer international publishing. (2015).
Ho, J., Jain, A. & Abbeel, P. Denoising diffusion probabilistic models. Adv. Neural. Inf. Process. Syst. 33, 6840–6851 (2020).
Lipman, Y., Chen, R. T., Ben-Hamu, H., Nickel, M. & Le, M. Flow matching for generative modeling. ArXiv Preprint arXiv :221002747. (2022).
Lipman, Y. et al. Flow matching guide and code. arXiv preprint arXiv:2412.06264. (2024).
Paszke, A. et al. Pytorch: An imperative style, high-performance deep learning library. arXiv preprint arXiv:.01703. (2019). (2019). (1912).
Kingma, D. P., Ba, J. & Adam A method for stochastic optimization. ArXiv Preprint arXiv :14126980. (2014).
Acknowledgements
This work was supported by the Rajen Kilachand Fund for Integrated Life Science and Engineering. We would like to thank the Boston University Photonics Center for technical support.
Funding
This work was supported by the Rajen Kilachand Fund for Integrated Life Science and Engineering.
Author information
Authors and Affiliations
Contributions
Guoyao Shen: Conceptualization (lead); Methodology (equal); Software (equal); Formal Analysis (equal); Writing – Original Draft (lead). Mengyu Li: Methodology (equal); Software (equal); Formal Analysis (equal); Writing – Original Draft (supporting). Stephan Anderson: Methodology (supporting); Writing – Review & Editing (supporting). Chad W. Farris: Conceptualization (supporting); Formal Analysis (supporting); Writing – Review & Editing (supporting). Xin Zhang: Conceptualization (lead); Methodology (equal); Software (equal); Formal Analysis (equal); Writing – Review & Editing (lead); Project Administration (lead); Funding Acquisition (lead).
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Shen, G., Li, M., Anderson, S. et al. Magnetic resonance image processing transformer for general accelerated image restoration. Sci Rep 15, 40064 (2025). https://doi.org/10.1038/s41598-025-23851-w
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s41598-025-23851-w











