Introduction

We seek to provide a critical assessment of the opportunities and limitations of artificial intelligence (AI) in the context of magnetic resonance imaging (MRI) of cancer and provide the reader with guidance on where to go for further information. Indeed, there have been many excellent publications on various aspects of AI in MRI (see, for example, references1,2,3,4). However, given the explosion of publications on the topic, it can be challenging to separate the hype from the actual successes and identify the current state-of-the-art of the field. In particular, it is difficult to identify what is currently available—and what the practical limitations are—for the application of AI in the MRI of cancer. In this contribution, we seek to respectfully provide a more balanced presentation of the utility and limitations of AI for MRI within clinical oncology. To address this problem, we begin by describing the key components of AI that are frequently employed in medical imaging, in general, and MRI, in particular. In Section 3, MRI acquisition and reconstruction is presented as a regression task that involves estimating the voxel values of an image from raw scanner measurements. In Section 4, we discuss the classification task of image segmentation, which entails categorizing every voxel as (for example) tumor or healthy tissue, and registration, which entails spatially aligning images to a common space. Sections 5 and 6 concern the classification tasks of making a diagnosis and prognosis from the MRI data, respectively, which require accounting for global characteristics (e.g., the location and size of tumors) to make a prediction.

Key AI concepts for medical imaging

What is artificial intelligence?

Artificial intelligence (AI) refers to the theory and development of computer systems that can perform tasks normally thought to require human intelligence including language, visual perception, and reasoning5. We focus on a branch of AI called machine learning (ML), which refers to training a statistical model on relevant data to perform a task. Deep learning is a sub-branch of ML that concerns the development of neural networks (NN), a special class of models that have demonstrated practical utility in medical imaging6,7,8.

Common ML techniques for medical imaging

A key design consideration when developing ML models is the type(s) of data available for training, which can be split into supervised and unsupervised settings. Supervised datasets contain matched {sample, label} pairs; for example, MRI images annotated as containing or not containing a tumor. In this setting, the model learns to predict the label for each sample. However, annotation can be time-consuming and expensive as domain experts (e.g., radiologists) must label each individual data point9. In contrast, unsupervised datasets contain only samples, without any labels. For example, given a set of MR training images obtained with low spatial resolution, one may train a model to learn to increase the resolution (called super-resolution10,11) without access to any high-resolution examples. After training, the model can be applied to enhance the quality of new images acquired at lower resolution, thus reducing the need for acquiring high-resolution data. While the unsupervised setting eases the burden of annotating data, the ability to generate useful labels is intimately connected to the ability to collect high-quality data that are relevant to the intended application. For example, if only images of knees are available to train a super-resolution model, the model is likely to perform poorly when used to enhance images of brains12.

The reader may find it helpful to refer to Table 1 as they consider the rest of the paper.

Table 1 A taxonomy of machine learning techniques for medical imaging

AI-based image acquisition and reconstruction of MRI data

Image acquisition and reconstruction are inextricably linked in MRI. Image acquisition consists of both signal and spatial encoding through user-controllable radiofrequency and gradient waveforms. By modifying the different components of the acquisition process (pulse time delay, signal frequency, signal strength, signal phase, etc.), the speed of acquisition as well as the sensitivity of a given image to a particular tissue property is changed. The resulting image contrast can be described by:

$$x=f(\theta ,q)$$
(1)

where x Cn is the (vectorized) image (containing n voxels) generated based on a spatiotemporal function f of both acquisition-controlled signal encoding parameters, θ, and biophysical tissue parameters, q13. The acquired measurements (called k-space) are related to the image through the linear operator

$$y={A}_{{\rm{\phi }}}x+\eta$$
(2)

where Aϕ Cm×n represents a (possibly multi-coil) sampled Fourier transform, η Cm is additive complex-valued Gaussian noise, and m is the number of acquired measurements with locations determined by the spatial encoding parameters ϕ.

The measured raw signals (i.e., y) are not immediately ready for visualization as they represent Fourier components of the image. Therefore, optimization algorithms are used to reconstruct the image. The reconstruction algorithm can be represented by a function g(y) which either inverts the linear measurement process in Eq. (2) to estimate the image x, or inverts the non-linear measurement process in Eq. (1) to estimate the tissue parameters, q. These tasks are often challenging as the data are corrupted by additive noise and often subsampled (m << n) to reduce scan time, thereby leading to an ill-posed inverse problem.

Common machine learning techniques in image acquisition and reconstruction

Training machine learning models to assist in both acquisition and reconstruction procedures is often framed as an optimization problem in which we train a parameterized function \({f}_{w}(\cdot )\), often in the form of a deep neural network, to complete a specific task. In the supervised case, the function is optimized with respect to a training set of images (or scan parameters, or tissue parameters) given input k-space. This often takes the form of minimizing some loss function \(D\left(\cdot ,\cdot \right)\) via large-scale optimization solvers in the following way:

$${w}^{* }={argmi}{n}_{w}\mathop{\sum }\limits_{i=1}^{N}D\left({f}_{w}\left({y}_{i}\right),{x}_{i}\right)$$
(3)

where xi and yi represent the i’th training sample and N is the size of the training set. In unsupervised learning, the same task is solved but with a minimization that only depends on the inputs yi.

Image acquisition

The selection of signal encoding parameters θ and spatial encoding parameters ϕ can influence both the contrast of the resulting image (Eq. (1)) and acquisition time (Eq. (2)). As image sensitivity to pathology is affected by acquisition parameters, it makes sense to optimize the image acquisition pipeline to increase sensitivity to relevant tissue contrast. This means finding the optimal radiofrequency flip angles, phases, timings, etc. for a desired contrast. Supervised learning has been used to find these acquisition strategies for optimizing contrast sensitivity14,15 as well as optimizing the k-space measurements16,17,18,19. Here w in Eq. (3) is replaced by either or both of contrast (θ) and k-space locations (ϕ). For example, k-space trajectory optimization can be used to accelerate imaging by reducing the number of acquired k-space measurements for a given target spatial resolution. In this case, the loss function will take the form

$${\phi }^{* }={argmi}{n}_{\phi }\mathop{\sum }\limits_{i=1}^{N}D\left({f}_{w}\left({y}_{i}(\phi )\right),{x}_{i}\right)$$
(4)

where yi(ϕ) represents a particular sampling trajectory in k-space (for example, specific phase encode lines) for the ith training sample, and \({f}_{w}\left({y}_{i}\left(\phi \right)\right)\) represents a reconstruction algorithm that takes the subsampled k-space and outputs an image. When the reconstruction network and the sampling trajectory are both differentiable, this is straightforward to implement; however, for Cartesian sampling, this problem is combinatorial in nature. Therefore, greedy approaches or smooth approximations become necessary16,17,18,19). A similar approach can be used to update other scan parameters that influence image contrast15.

After Eq. (4) is solved, an optimized acquisition scheme can be represented as a new measurement model Aϕ and can be implemented on the scanner. When this sequence is used for acquisition, the image reconstruction problem can then be solved given the acquired measurements and the measurement model (i.e., Eq. (2)).

Image reconstruction

Classical image reconstruction relies on hand-crafted priors like L1-wavelet regularization20. More recently, deep neural networks have contributed promising techniques for image reconstruction. AI-based techniques can be separated into several categories: end-to-end supervised, end-to-end unsupervised, and generative modeling. In the end-to-end supervised setting, the function fw in Eq. (3) is trained with pairs of subsampled k-space data and images obtained from fully sampled data, (yi, xi), and D(・, ・) can represent any valid metric of distance (e.g., mean squared error). End-to-end unsupervised methods also train an NN except with access only to subsampled measurements, yi, and no accompanying reference image, xi. This is important in scenarios where it is only possible to collect large amounts of subsampled data (e.g., in dynamic imaging). End-to-end NNs for supervised and unsupervised reconstruction can take many forms, but unrolled optimization networks, which alternate between classical optimization steps (e.g., conjugate gradient descent) and forward passes through a NN, are quite common21,22.

Although end-to-end methods are powerful, performance can degrade with changes in how the measurements are taken at training versus at test time, which is referred to as a “test-time distribution shift”. Recently, the use of generative models for inverse problems has been shown to improve robustness to variations in acquisition schemes. These methods train a Bayesian prior on the distribution of fully sampled images, p(x), and are therefore agnostic to distribution shifts in the likelihood, p(y|x), which can change across scans and imaging protocols (i.e., the signal and spatial encoding parameters). Perhaps the most popular method for including generative priors in inverse problems is by way of score-based, or diffusion probabilistic models23,24.

Despite the distinction between supervised and unsupervised training, it is important to note that AI models are trained on data derived from raw sensors (i.e., the raw k-space measurements), which are typically acquired via multi-coil arrays and are often subsampled. Even when the data are fully sampled, they are noisy and must be first reconstructed into images. Therefore, there is no real notion of “ground-truth” images. New training methods that account for the acquisition process have been proposed, both for end-to-end methods that are trained to take in acquired measurements and output reconstructed images, as well as for generative models (where the AI model is used as a statistical image prior to an iterative reconstruction).

It is also possible to jointly train the acquisition parameters together with the reconstruction method, which may consist of deep neural networks14,15,16,17. For example, Aggarwal et al., formulate the optimization problem as

$${\phi }^{* }{w}^{* }={argmi}{n}_{\phi ,w}\mathop{\sum }\limits_{i=1}^{N}D\left({f}_{w}\left({y}_{i}(\phi )\right),{x}_{i}\right)$$
(5)

In other words, they jointly solve for the phase encode lines given by ϕ and the neural network reconstruction given by fw.

Current clinical needs for AI-based image acquisition and reconstruction

AI has shown the ability to maintain high image quality using fewer measurements than conventional reconstruction techniques, and can even produce diagnostic images for use in specific downstream clinical tasks25. For example, AI for the reconstruction of prospectively accelerated abdominal imaging produced non-inferior images compared to conventional reconstruction methods26. AI for image reconstruction has made its way into products for several scanner manufacturers. Two recent examples are GE’s AIRTM reconstruction protocol and Siemen’s Deep ResolveTM product line. However, it is important to emphasize that many of these products are for specialized cases (e.g., specific anatomy/contrast) and are not currently deployed for many specific use cases. An ability to generalize across imaging protocols is critical for widespread adoption.

An additional concern for many longer scan sessions is patient motion, especially for pediatric populations. In general, this includes developing robust reconstruction methods that can handle dynamic imaging scenarios (e.g., contrast-enhanced imaging27) where it is often infeasible to collect the desired measurements from a single imaging volume. As many forms of pathology exhibit only subtle contrast changes, there is also a need for acquisition/reconstruction methods capable of resolving subtle variations in tissue properties, and thus reducing the use and dose of exogenous contrast agents containing Gadolinium28. Finally, as many new methods often train and measure performance based only on image quality metrics, the need for better quantifying the link between image quality and downstream diagnostic metrics (e.g., tumor classification) must be better understood and optimized for when designing new AI-based techniques.

Barriers to practical deployment in the clinical setting

Although AI has displayed marked improvements over classical reconstruction and acquisition techniques17,21,22,23, there are still many issues that must be addressed before clinical adoption can be pursued. For example, AI techniques need to be robust to variations in how data are collected due to differences in vendor, field strength, field inhomogeneity, and patient motion. Models must be trained on representative data for each of the variations in hardware and imaging protocols; if the scan protocol changes (or if the scan parameters change to accommodate a specific patient such as a larger field of view, different resolution, different echo time, larger fat saturation bands, etc.), then this “new” protocol could be out of the distribution that was included in the training set, and the subsequent performance will degrade as seen in Jalal et al.23. This can quickly devolve into a “model soup.” For example, the winning team in the FastMRI 2020 reconstruction challenge trained eight independent models, to account for field strength, scan anatomy, etc1. The interpretability must also be improved through uncertainty quantification. To demonstrate these points, Fig. 1 shows two example deep learning reconstructions (reproduced from the FastMRI 2020 reconstruction challenge29) where fully sampled scans were retrospectively subsampled to simulate faster scanning. In the first case, the DL reconstruction results in a faithful, and diagnostic image; however, in the second case, the DL reconstruction falsely hallucinates a blood vessel, likely due to the unseen artifact caused by surgical staples. This demonstrates the enormous challenge, and potentially clinically confounding problems, of deploying these models in clinical settings30.

Fig. 1: Example reconstructions from the 2020 FastMRI Challenge, adapted from ref. 29.
figure 1

The fully sampled images (A, C) were retrospectively subsampled to simulate 8× (top) and 4× (bottom) faster scans. In the top case, the DL reconstruction (B) is able to reproduce with high fidelity the lesion in the post-contrast T1-weighted image, though with some blurring. In the bottom case, the DL reconstruction (D) hallucinated a false vessel (red arrow), perhaps due to the surgical staple artifact not being well-represented in the training set.

A potential avenue to addressing these problems is in the use of generative models for reconstruction: generative models have been shown to be less sensitive to changes in scan protocols and anatomy, largely due to the decoupling between the physical forward model and the statistical image prior. In other words, generative models are trained to learn the distribution of MR images independent of the physical imaging system parameters that produced those images. This means that the generative model can be used to reconstruct images from MR data acquired with other imaging schemes, so long as the images follow the model’s learned distribution and the imaging parameters are known. Learning optimal sampling patterns for generative models could also be helpful as this would allow the acquisition to be more tailored to the specific imaging protocol, while still benefiting from the robustness offered by generative model-based reconstruction.

Registering and segmenting MRI data via AI

Common AI techniques in segmentation and registration

Tumor and tissue segmentation

Tumor and tissue segmentations are often used for surgical planning, radiotherapy design, and assessing treatment response31,32. Currently, manual or semi-automated segmentations often suffer from being resource-intensive and exhibiting unacceptable inter-observer variability31,33. While automated segmentation approaches enabled by AI may reduce both of these concerns, they require extensive training sets with expert-defined segmentations, and (due to its black-box nature) may fail without warning31,34. Given these limitations, AI-based approaches require human oversight to review and, if necessary, edit31,35. Many AI-automated approaches are based on voxel-based features and typically employ convolutional neural networks (CNN; such as U-Nets) to identify healthy tissue (e.g., organs-at-risk for radiotherapy planning), cancer, and intra-tumoral sub-regions (e.g., necrosis, edema, enhancing lesions)32. Pal et al. introduced a method combining fuzzy c-means clustering and random forest algorithms and showed 99% accuracy for segmenting spine tumors36. Kundal et al. evaluated four CNN-Based methods for brain tumor segmentation. (CaPTk, 2DVNet, EnsembleUNets and ResNet50) EnsembleUNets achieved Dice scores of 0.93 and 0.85 and Hausdorff Distances of 18 and 17.5 for testing and validation, respectively37. While current approaches do not eliminate the need for human refinement, they can often provide an acceptable segmentation of organs at risk that can be manually refined when greater precision and accuracy are required38.

Image registration

Classical image registration methods consist of estimating the optimal transformation that maximizes the similarity between a set of images and the “target” image. Multiple combinations of transformations (rigid or deformable) and cost functions (sum square differences, normalized cross-correlation, or mutual information) exist. Osman et al. introduced a deformable CNN-based registration of 3D MRI scans for glioma patients called ConvUNet-DIR39 that outperformed the Voxelmorph method with a mean dice score of 0.975 and similarity index of 0.908 compared to 0.969 and 0.893. However, each method has specific drawbacks such as being sensitive to artifacts or being limited in capturing local differences40. Learning-based methods, both supervised and unsupervised, attempt to overcome these difficulties. Supervised learning models such as BIRNet41 and DeepFLASH42 are trained using paired images (i.e., registered and un-registered images) with their corresponding transformation. Even though such methods can achieve accurate registration performance, finding the necessary training data (which includes before/after registered images and the associated transformations) is difficult. Thus, unsupervised deep learning-based approaches are preferred43,44, traditionally focusing on optimizing similarity metrics and model architectures using gradient-based optimizers like stochastic gradient descent.

Current clinical needs for AI-based image segmentation and registration

We assign a moderate state of readiness for DL-based registration of MRIs of brain cancer. In particular, locally registering voxels at the tumor boundary is challenging. Estienne et al.45 introduced a joint 3D-CNN method for brain registration and tumor segmentation. The model was trained using the BraTS 2018 dataset46 and then tested on over 200 pairs from the same dataset. The registration method was evaluated by computing the average distance between two ratios: (1) original to deformed tumor mask area, and (2) between brain volumes of the paired images. The method outperformed a previously established approach using whole tumor masks (part of the VoxelMorph package47) that employed an unsupervised learning-based inference algorithm that was initially applied to healthy brain MRIs48.

There are several areas where DL-based segmentation can meet clinical needs. First, segmenting a tumor from surrounding healthy tissues is central to radiotherapy planning. In fact, the time-intensive nature of manual segmentation has limited the widespread implementation of adaptive radiotherapy49. To address this challenge, AI tools employing CNNs to perform semi-automated and automated segmentation have been developed. Such tools often have reduced performance after treatment (i.e., surgery, radiotherapy, and chemotherapy), an increased focus should be placed on training these methods with post-treatment data50 to enable accurate longitudinal segmentation of tumors. Clinical deployment of these automated segmentation tools seeks to accelerate the segmentation task within the workflow of adaptive radiotherapy. However, these algorithms have had mixed success due to their limited generalizability when applied to data they did not see during training, therefore limiting their widespread adoption. There are numerous potential reasons for the lack of success of AI-based segmentation algorithms including variability in the scan acquisition, patient-specific presentation of normal tissue and tumors, and the segmentation goals of the treating clinical team.

Tumor segmentation and registration are also necessary steps for quantitatively describing imaging features that represent the underlying biology (Fig. 2). The inconsistencies in MRI segmentation can arise from various factors including inter-rater variability among segmentations done by different readers and intra-rater variability among segmentations done by the same reader multiple times. Additional sources of inconsistency include variation in input data quality and resolution due to acquisition protocols, algorithmic bias across different segmentation methods and/or parameter settings, temporal changes in anatomical structure, or intensity contrast of segmentation targets due to disease progress or intervention effects51,52,53,54. These inconsistencies introduce bias in the measurement of volume, morphology, and contrast of segmented regions, and influence the accurate and reproducible extraction of (for example) radiomic features55,56,57,58, which then restricts the reliability and generalizability of radiomics’ clinical application in diagnosis and prognosis59,60. Therefore, it is critical to develop approaches to improve the consistency of segmentation to advance quantitative imaging.

Fig. 2: The flow chart depicts the downstream impact that quantitative imaging workflow can have on clinical applications.
figure 2

Beginning with image acquisition (A), variability and uncertainty is propagated through each step and can affect the accuracy in assessing lesion response and/or anticipated clinical outcome (H). Image processing routines (B) are then used to quantify image features prior to image registration (C), however, image registration may also occur prior to post-processing. Once the images are co-registered, automatic or manual segmentation (D) identifies the tissue of interest. At this stage, the clinical response can be assessed by determining lesion response (E). Alternatively, the features (F) from imaging and -omics (G) can be identified to predict clinical outcome (H) or lesion response (E). Error at any of these steps can compound and result in a false classification of patient outcomes.

Shortcomings of current registration and segmentation approaches

A fundamental shortcoming of current AI-based segmentation and registration approaches is limitations in the available data itself for training and validating these approaches. These limitations include scarcity of data in the target setting, variations in data quality, standardization of imaging protocols, or deviation of imaging protocols from those used to train previously validated approaches. For example in the setting of brain cancer, while there have historically been significant efforts to segment pre-operative tumors, a recent literature review of 180 articles observed only three papers that included post-operative imaging31.The lack of trained models or expert-segmented or curated data limits61 the transferability of registration and segmentation approaches to new applications where treatment may alter image contrast or introduce large deformations and alterations in the anatomy. Another pitfall is the phenomena of data drift where changes in imaging protocols or other factors may lead to input data that falls outside of the distribution used for training62 resulting in reduced accuracy of segmentations and registration.

AI for diagnosis from MRI data

Common AI techniques for diagnosis via MRI

While computer-aided detection systems (CADe) leverage AI to identify the position of the tumor within an image, computer-aided diagnosis (CADx) systems exploit imaging features to characterize it quantitatively (see Fig. 3). These technologies have been constructed using AI approaches based on radiomics, ML32,63,64,65 (e.g., random forests, support vector machines), and DL32,63,64,65. In particular, CNNs are the most common choice of architecture for DL-based CADe and CADx. For example, Chakrabarty et al.66 trained a CNN with post-contrast, T1-weighted data to perform differential diagnosis of six brain tumor types and healthy tissue, achieving AUCs over 0.95 in internal and external validation. Additionally, Saha et al.67 achieved AUCs over 0.86 with a CNN-based CADe/CADx system for clinically significant prostate cancer that was informed by T2-weighted, diffusion-weighted MRI (DW-MRI), apparent diffusion coefficient maps (ADC), and an anatomic prior of zonal cancer prevalence. While CNN training requires vast labeled datasets and computational resources, pre-trained CNNs (e.g., AlexNet, GoogLeNet) can also be used in new datasets for cancer detection and diagnosis tasks (i.e., transfer learning)32,65,68,69. For instance, Antropova et al.70 achieved an AUC of 0.89 in the diagnosis of malignant breast lesions on dynamic contrast-enhanced MRI (DCE-MRI) data with a CADx system that combined radiomics with CNN-based features, for which they used the pre-trained VGG19 model.

Fig. 3: AI for imaging-based cancer detection and diagnosis.
figure 3

The figure illustrates the global process whereby MRI sequences (i.e., T2-weighted (T2W), DW-MRI, and DCE-MRI) are processed with a CNN to identify whether there is a tumor present, its localization, and the differential diagnosis (e.g., tumor class, tumor subtype, clinical risk). Several CNN architectures have been used to construct CADe and CADx systems for this type of MRI analysis66,67,69,91,94. This figure shows the use of a U-net due to its successful use for automatic tumor detection and segmentation, as well as in the context of cancer diagnosis.

Current clinical needs

There are ongoing efforts to leverage AI to analyze standard-of-care MRI data for cancer detection and diagnosis to reduce inter-reader variability, while also streamlining time-intensive diagnostic processes32,63,64,65,69. These methods attempt to enable timely and reliable assistance for MRI-informed clinical decision-making for diagnosis, monitoring, and treatment. For example, CADe/CADx systems can assist in determining an appropriate clinical management strategy for the identification of clinically-significant prostate cancer64,69 on multiparametric MRI data. A fundamental challenge in neuro-oncology is the differential diagnosis of primary central nervous system tumor subtypes and brain metastases, which necessitate different treatments32,66. Current clinical needs in breast cancer diagnosis include early detection of high-risk disease, improvement of screening for intermediate-risk and dense breast populations, the differentiation between benign and malignant lesions, and the identification of specific subtypes63,65,70,71. These are all tasks that could potentially be assisted by leveraging AI methods, although there are several challenges to their development (see Section 5.3). Additionally, more recent efforts are tackling the integration of multimodal data (e.g., imaging, histopathology, biomarkers, omics) using AI to better inform cancer management.

Barriers to practical deployment in the clinical setting

In Medicine, as opposed to many other fields of application of AI, understanding why a decision is made (e.g., a diagnosis) is as important as the decision itself. Thus, establishing model interpretability is a fundamental barrier in the clinical translation of AI-based techniques for decision-making in clinical oncology63,64,72,73. Beyond explaining the biophysical causes underlying AI-driven cancer detection and diagnosis, model interpretability is also required to extrapolate and interrogate model outcomes63,64, which would contribute to more informed clinical decisions. The medical AI community has proposed several approaches to address the lack of interpretability in MRI-informed AI models for cancer detection and diagnosis74,75,76, such as class activation methods66,67, knowledge-driven priors67,74, and integration of multimodal multiscale data77,78. Furthermore, recent approaches in the field of scientific machine learning have shown promise in improving the interpretability of AI models79 and could therefore be developed for the detection and diagnosis of cancer on MRI data. For example, mechanistic feature engineering can identify biophysically-relevant inputs that characterize tumor biology80,81, physics-informed neural networks (PINNs)82,83 include a (bio)physical model in the loss function, and biology-informed neural networks (BINNs) adapt the model architecture according to prior biological knowledge84,85.

Despite the progress in the development of MRI-informed AI methods for cancer detection and diagnosis, these technologies need to address potentially critical pitfalls in their performance in tumor-specific scenarios. For example, the detection of malignant breast lesions using MRI-informed AI models can be affected by complex tumor geometries (e.g., non-mass tumors), the architectural and radiological features of surrounding healthy tissue (e.g., tissue density, background parenchymal enhancement), as well as signal distortion and movement of the tumor during DCE acquisition65,71,86,87. Additionally, the performance of CADe/CADx methods for prostate cancer can be affected by several well-established MRI confounders64,67,88,89,90, such as prostatitis and benign prostatic hyperplasia. Similarly, the different types of tumors that may develop in the brain can produce similar MRI signals that complicate their differential diagnosis32,66. To address these issues, future studies require (i) training and validation databases that balance the amount and diversity of confounding features and comorbidities (e.g., breasts with mass-like and non-mass geometries; prostates with cancer alone and combined with other prostatic pathologies; and diverse brain tumor cases), and (ii) training AI models to recognize confounding patterns (e.g., standard textural and deep learning features) to boost the performance of CADe/CADx technologies for cancer32,64,65,66,67,71,86,87,88.

Several validation, implementation, ethical, and data issues have also hindered the clinical translation of AI models for cancer detection and diagnosis. Firstly, the diagnostic performance of CADe/CADx systems still needs to be externally validated in large prospective clinical trials including diverse imaging acquisition methods, patient demographics, and reader expertise32,63,64,65,69,91,92. Additionally, two key obstacles to their practical clinical deployment are the lack of local data science support and the high computational cost, particularly during the training of DL models. The need for high-performance hardware and efficient algorithms demands a considerable financial investment, hindering widespread adoption in resource-constrained healthcare settings. As noted in Section 5.1, transfer learning is a promising strategy to address these computational limitations by leveraging pre-trained DL models32,65,68,69. Another fundamental issue is the limited availability of rigorously curated data for AI model development, which may require specific labels and non-standard preprocessing that can be costly and time-consuming63. Furthermore, biases in data collection processes and unrecognized biases in clinical practices from which the training data is gathered can lead to distorted outcomes, even perpetuating societal inequalities93. Data inaccessibility due to confidentiality concerns, potential breaches compromising patient confidentiality, and strict data privacy regulations limiting collaboration also constitute important challenges63,93. To address limitations in data transfer for multicenter collaborations, federated learning94 is a potential strategy that relies on sharing model parameter updates rather than datasets.

AI for predicting response from MRI data

Common AI techniques for predicting response

AI can be used to link imaging data to outcomes such as pathological response, time to recurrence, and overall survival. Commonly used machine learning methods for predicting response include support vector machine (SVM)95,96,97, regression98,99, random survival forest (RSF)100, clustering101,102, and CNNs99,103,104. Recently, more DL models105 are established based on the transformer architecture or incorporating the attention mechanism106. Panels A and B of Fig. 4 demonstrate how SVM and CNN, respectively, can be used in prediction of response.

Fig. 4: Demonstration of how an SVM and a CNN can use imaging data to predict treatment response.
figure 4

In Panel A, features related to histograms of relative cerebral blood volume97 or peak height of a perfusion signal107 are extracted from the imaging data. Clinical information such as patient age and genetic data can also be included as features. The goal of the SVM is to take N features and determine the (N-1)-dimensional hyperplanes that maximally separate (for example) patients into short, medium, or long survival, or complete response as determined by pathology. Panel B shows potential inputs to a CNN – either the whole image domain, imaging-derived features, or a patch of the domain. Extracting patches from a domain can be used to increase the amount of training data, or to reduce computational burden when working with large images. These are then input to a CNN, here represented with convolution and down sampling layers feeding into a fully connected architecture. In general, multiple sets of convolution and down-sampling layers are used. The network output accomplishes the same goal as the SVM in Panel A; namely, separating inputs into classes such as responders and non-responders, or survival at a particular time.

In brain cancer, combining clinical data with AI methods has been shown to be more effective in predicting survival outcomes than clinical methods alone107,108. SVM, for instance, has been used to predict treatment outcomes for gliomas96 by employing both clinical and functional features96. An SVM trained by Emblem et al. found whole tumor relative cerebral blood volume was the optimal predictor of overall survival97. CNNs have been used for the assessment and prediction of outcomes. For example, Jang et al.104 distinguished between pseudo-progression and progression in patients with GBM using CNNs with long short-term memory (LSTM)109,110, an ML algorithm used to train recurrent NNs. They compare two options for the CNN input data: (1) MRI (post-contrast T1-weighted images) and clinical parameters, and (2) MRI data alone. The model trained on MRI and clinical data outperformed the MRI-only model, as the AUC for predicting progression versus pseudo-progression had values of 0.83 and 0.69, respectively. This demonstrates that the combination of AI, MRI, and clinical features might help in post-treatment decision-making for patients with GBM104.

In prostate cancer, MRI-based AI has enabled the prediction of recurrence after surgery and radiotherapy111. For example, Lee et al.99 developed a DL model trained on preoperative multiparametric MRI data (T2-weighted, DW-MRI, and DCE-MRI) to predict long-term post-surgery recurrence-free survival. Using Cox models and Kaplan-Meier survival analysis, the features obtained from multiparametric MRI data via their DL model outperformed clinical and radiomics features, while the combination of clinical and DL features yielded the best predictive performance. Additionally, in breast cancer, prognostic CNN models trained on MRI data have also enabled the prediction of pathological complete response for neoadjuvant chemotherapy112.

Current clinical needs

Common clinical challenges in oncology include the stratification of patients by treatment response, risk of relapse, and overall survival113,114. Identification of patient outcomes before treatment begins (or early during treatment) could help select, escalate, or de-escalate prescribed treatments, as well as tailor personalized follow-up schedules. It is important to note that analyses (AI-based and otherwise) leveraging multi-parametric MRI as compared to biopsy-based stratification also enable longitudinal whole-tumor coverage addressing the sampling bias limitations of biopsy113. Furthermore, many of these AI-based approaches leveraging imaging data achieve greater performance over standard clinical information (e.g., age, sex, extent of surgery) alone113,114.

Barriers to practical deployment in the clinical setting

Although AI-based approaches have demonstrated greater performance over standard clinical information in some settings, these studies are typically done in specific cohorts with limited external validation and therefore have limited generalizability115. The variability in MR acquisition techniques between manufacturers, scanner types, protocols, and institutions can lead to substantial bias in training image-guided AIs without proper data harmonization116, such as that completed by Marzi et al.117, who were able to reduce site effects after data harmonization in a study of T1-weighted MRI data from 1740 healthy subjects at 36 sites using a harmonizer transformer as part of the preprocessing steps for a machine learning pipeline. Moreover, restrictions on inter-institution sharing of patient data118 (due to concerns such as privacy) may restrict the verification of model generalizability. These limitations, alongside uncertainty in the AI models themselves, create substantial barriers to the widespread clinical adoption of AI models for response prediction. Federated learning could be a realistic strategy to bypass the complexity of inter-institutional data sharing. Instead of requiring assembling data from different institutions into a centralized large-scale dataset, federated learning enables training of the AI model on decentralized, private datasets in multiple sites independently, then only the trained parameters (instead of original data) will be shared between training sites to generate the globally tuned model119,120

Efforts to construct large, publicly available MRI datasets for brain tumors are ongoing. However, careful considerations need to be made while using them in training, testing or validating AI-based model121. One significant issue is the potential for overlaps between different datasets (i.e., multiple datasets containing the same patient), which can reduce the number of uniquely available data. More specifically, the IvyGAP Radiomics dataset contains the pre-operative MRIs of the IvyGap dataset with additional segmentations and derived radionics parameters. Moreover, the BraTS 2021 dataset contains data that was available in the previous BraTS challenges and other public datasets such as the TCGA-LGG (65 patients), TCGA GBM (102 patients), and Ivy Gap (30 patients). This may lead to redundancy and diminish the overall diversity of the training data121. Furthermore, these datasets have been published for more than a decade and some of them have undergone updates, adding a higher variability in protocols, scans quality, and evolving WHO classification. We identify similar issues in prostate cancer122, particularly when it comes to the age of the publicly available datasets (many being 10+ years old) and overlap between datasets (which may or may not be known between sets). Dataset sizes are also often small, especially once missing or low-quality data is removed.

Efforts for breast cancer data standardization are also in progress. For instance, Kilintzis et al.123 attempted to produce a harmonized dataset from 5 publicly available datasets from the TCIA platform, including 2035 patients. More generally, Kosvyra et al.124 propose a new methodology for assessing the data quality of cancer imaging repositories by incorporating a three-step procedure that includes a data integration quality check tool to ensure compliance with quality requirements.

Federated learning could be a practical strategy to overcome the complexity of inter-institutional data sharing. Instead of assembling data from different institutions into a centralized large-scale dataset, federated learning enables the training of the AI model on decentralized, private datasets in multiple sites independently, then only the trained parameters (instead of original data) are shared between training sites to generate the globally tuned model. We discuss these points via the results presented in Rauniyar et al.119 and Guan et al.120.

Beyond data variation, the cancer population itself is heterogeneous (as categorized by the cancer subtypes, staging, patient demographics, etc.), which leads to differences in therapy response between patients125. Moreover, novel therapies can be introduced into clinical care, the therapeutic regimens vary between institutions, and treatments can be refined, leading to significant differences in datasets collected at different times126. Overall, the trade-off between the sample size and homogeneity of accessible datasets to train AI models is a major barrier to the robust application of AI for response prediction. One method that could help combat generalizability issues with small datasets is pretraining. For example, Yuan et al.127 and Han et al.128 pretrain convolutional neural networks using the ImageNet129 dataset to successfully classify risk in prostate tumors and differentiate between long- and short-term glioma survivors, respectively. As an alternative to pretraining on ImageNet, Wen et al.130 explore the possible benefits of pretraining using medical images. While their study ultimately found that ImageNet pretraining provided more accurate results, medical pretraining showed potential. As such, it could be useful to construct a large medical image database like ImageNet to further explore pretraining for medical problems using medical images. Zhang et al.131 explore another possible workflow for fine-tuning a network designed to classify breast cancer molecular subtypes from DCE-MRI data. Rather than relying on a separate, large dataset for pre-training, they separate their data into a training set and two testing sets – A and B. They then compare (1) testing the network with both A and B, (2) fine-tuning with B and testing with A, and (3) fine-tuning with A and testing with B. They conclude that finetuning with this method increases accuracy. Thus, in a situation where the dataset of interest is very small, initial network training could be completed on large, publicly available datasets, then fine-tuning and testing could be completed using the dataset of interest.

Many AI-based algorithms lack interpretability, thereby also limiting their clinical adoption. Furthermore, most AI-based approaches for outcome prediction are deployed with set parameters that govern their sensitivity and specificity. The patient-specific use of these models may require their optimization to fit the specific wishes of the patient, their family, and their clinical team. This patient-specific optimization creates an additional barrier to clinical deployment. The integration of AI-based prediction algorithms into the clinical workflow faces additional challenges as it has not been well established and currently generates additional work for the clinical team. This limits their adoption to clinics that have a robust computational infrastructure, clinicians who are willing to spend extra time generating the data, and ready access to the necessary data for the algorithm.

Discussion

AI-based methods have established some measure of success in several areas of MRI including image acquisition, reconstruction, registration, and segmentation, as well as assisting in diagnosis and prognosis. Some registration and segmentation techniques have even been approved by the FDA for clinical application69,91,92,132. However, while the application of AI techniques for MRI in cancer clearly has tremendous potential, three major challenges persist: (1) model generalizability, (2) model interpretability, and (3) establishing confidence in the output of an AI model. Indeed, these problems are common to many applications of AI in medicine.

The limited progress in being able to transport an AI method from one institute to another, and from one disease setting to another (i.e., generalizability), is heavily influenced by both the data and the devices employed to capture the data133. The wide variety of the available scanners (both from different manufacturers and then different models with a manufacturer), as well as the protocols they run, can further hamper generalizability as the data employed for training may not adequately sample all the acquisition scenarios encountered in the testing data set133,134. The lack of standardization in quality assurance and control (QA/QC) adds additional complexities135. Even when an AI method has had success in one clinical setting, it does not guarantee it will be successful when applied in another disease setting due to variations in patient characteristics and previously received treatments136. Thus, retraining the AI model is required and this can introduce nontrivial changes to the AI architecture or method of implementation. This is especially true in cancer, where the differences between diseases and the site of origin can introduce tremendous anatomical and physiological heterogeneities that may confound a previously trained AI model. It is increasingly recognized that each model should be evaluated with local data in the disease setting for which it is intended to be used137.

A fundamental issue with the majority of AI-based analysis is their limited “interpretability” in the sense of only providing a limited understanding of the relationships between model input and model output; that is, there is limited insight into how the AI model gets from cause to effect. This is an active area of investigation72,138,139 and is now well-recognized for being extremely important in domains that have high-consequence decisions like oncology140,141. For example, lack of interpretability generates problems when trying to identify optimal interventions for a particular patient where it is critical to understand why an AI model selects a particular therapeutic regimen over another. In particular, for an AI model built on population-based MRI data, does the training data set adequately capture the unique characteristics of the features in the individual’s MRI data? One attractive way forward is to link mechanism-based modeling with AI methods and scientific machine learning in which known/established biological and physical laws can be explicitly incorporated in the AI algorithm142,143,144. This can have the additional benefit of increasing confidence in the method.

The challenge in establishing confidence is not only related to the lack of model generalizability and interpretability, but also concern over the ethical issues associated with applying AI techniques to healthcare data as well as maintaining patient privacy and data security. This is particularly true with medical imaging data in which detailed anatomical features can be rendered in 3D. Importantly, once introduced in the clinical workflow, the performance of the AI tool must be continuously monitored as the data inputs are adjusted with new imaging hardware, software, and methods of image acquisition and subsequent analysis. More generally, the Office of Science and Technology Policy has published a white paper entitled, “A Blueprint for an AI Bill of Rights”145 designed to guide the safe, effective, and unbiased development and application of AI.

Beyond the issues related to generalizability, interpretability, and confidence enumerated above, there may be fundamental limitations to what AI can contribute to cancer imaging. For example, it is important to note that every AI-based method requires a training set and since cancer is notoriously heterogeneous across both space and time, there are fundamental limits to what a population-based method can achieve. Indeed, this problem manifests itself in everything from image reconstruction to predicting response if the specific details of the pathology under investigation are not well-characterized in the training set. For a disease as heterogeneous as cancer, and an imaging modality as flexible as MRI, this seems to be a difficult problem to overcome with an AI-only approach.

Conclusion

AI has shown promise in accelerating image acquisition and reconstruction methods to maintain high spatial resolution in a fraction of the time typically required to obtain such an image. It has also been a useful tool for improving image segmentation and registration, as well as offering valuable results in both diagnostic and prognostic settings. However, questions remain about the interpretability and generalizability of these techniques, especially when a method is ported from one disease setting to another or, even, from one institution to another institution. The dearth of robust QA/QC and ethics-ensuring methods means that much work is left to be done to maintain patient safety given the current status of AI techniques in cancer imaging and healthcare. Finally, questions concerning fundamental limitations for any method that requires a large training dataset remain.