Introduction

The wide variability in tumor appearance and location makes precise monitoring of oncological disease a critical challenge for both clinical care and research. Effective evaluation is essential for assessing tumor aggressiveness, predicting prognosis, and guiding treatment decisions. RECIST v1.1 has long been regarded as the gold standard for assessing solid tumors over time1, allowing for patient stratification based on disease response or progression.

However, this method has significant limitations: low information yield from linear long-axis measurement compared to total tumor burden2,3, arbitrary and non-reproducible selection of target lesions leading to misclassification of disease status4, and significant inaccuracies in measuring the long axis, with inter-reader variability exceeding 20%5, further contributing to classification errors.

Traditionally, long and short axes of the tumor are used as proxies for estimating tumor size on CT scans. However, linear measurements are increasingly considered inadequate as the field shifts toward more informative markers, such as volumetry6 and shape assessments, including tumor eccentricity and irregularity7. Volumetric analysis, more sensitive to change due to its proportionality to the cube of the radius, is advantageous for detecting tumor burden changes, especially for tumors with irregular shapes, where linear measurements fail8. Novel radiomics biomarkers derived from volumetric analysis show promise in oncological evaluation, notably in colon and lung cancers9,10.

Despite its promise, volumetric measurement is time-consuming11 and impractical to perform manually. Early models relying on manual feature extraction to deep-learning approaches using convolutional neural networks12 have limited success. Most are organ-specific and effective primarily in straightforward cases, such as lung nodules, but struggle with more complex lesions. Furthermore, these methods lack interactivity and adaptability, restricting their clinical integration.

The emergence of foundation models, powered by transformer architecture and self-attention mechanisms13, could alleviate these issues. Pre-trained on extensive unannotated datasets in a self-supervised manner, foundation models tend to outperform traditional deep-learning systems when evaluated on a wide range of downstream-tasks14. Remarkably, these models also exhibit emergent properties, where complex capabilities arise naturally from their scale and training, further boosting their adaptability. Their capacity for transfer learning and zero-shot classification allows them to tackle previously unseen challenges-or do so with minimal fine-tuning-making them a transformative force in medical imaging15,16. In computer vision, for instance, they can generate reliable segmentation masks from simple visual cues like bounding boxes or point-click inputs17. This ability to dynamically refine segmentations paves the way for more explainable AI, enhancing usability for radiologists.

In response, we developed ONCOPILOT, an interactive foundation model trained on more than 8000 publicly available CT scans of general anatomy and tumors. ONCOPILOT aims to deliver precise RECIST measurements and facilitate volumetric analysis, integrating seamlessly into the radiologist’s workflow.

Results

Foundation model

ONCOPILOT is a foundation model adapted from SAM17, developed specifically for segmenting biomedical images. Similar methods have been concurrently explored in MedSAM18, SegVol19, and SAM-Med3D20. It was pre-trained on a diverse dataset comprising normal anatomy and oncological lesions, totaling 2374 CT scans including 104 anatomical structures (e.g., organs, bones) and 4 oncological lesions regardless of histology and malignity (i.e., lung, liver, pancreas and colon tumors) from the MSD dataset (Fig. 1a), without distinction regarding their histological type or malignancy. To become specialized for oncology the model was subsequently fine-tuned on a comprehensive dataset of 6229 tumors from various organs (e.g., pancreas, bone, liver, kidney, lung, lymph nodes).

Fig. 1: ONCOPILOT Foundation Model Training and Evaluation.
figure 1

A Overview of the datasets used for training the ONCOPILOT segmentation model, including the distribution across train, test, and validation sets. B Diagram illustrating the ONCOPILOT segmentation model’s workflow. The model accepts visual prompts (either point-clicks or bounding boxes) of 3D tumor volumes and outputs corresponding 3D segmentation masks. Optional editing allows for real or simulated radiologist interaction, where positive and negative edit-points can be set manually in a viewer environment or automatically during evaluation.

ONCOPILOT is designed to interactively segment oncological lesions in 3D, utilizing visual prompts such as a bounding box (referred to as bbox) around the lesion of interest or a point-click (referred to as point) inside it (Fig. 1b). To simulate the dynamic refinement of the predicted segmentation masks by radiologists we developed an editing mechanism (referred to as point-edit, see Methods) which performs 4 successive prompting of the model with negative and positive edit-points targeting over-segmented and under-segmented areas respectively.

Segmentation performance

The state-of-the-art baseline for comparing ONCOPILOT’s performance was the model used in the ULS23 oncological lesion segmentation challenge21, based on nnUnet22.

ONCOPILOT matched or surpassed the ULS23 model21 in all evaluation metrics-point, point-edit, and bbox-across all lesion types, with the exception of lung tumors, where only the point-edit model demonstrated superior performance (Fig. 2a, with examples of successful segmentations in Fig. 2b). ONCOPILOT achieved mean DICE scores of 0.70 (95% CI: 0.69–0.71) for point mode, 0.70 (95% CI: 0.68–0.71) for bbox mode, and 0.79 (95% CI: 0.78–0.80) for point-edit mode, compared to 0.70 for the baseline.

Fig. 2: ONCOPILOT performance against baseline.
figure 2

A Radar plot (left) and table (right) displaying segmentation mean DICE scores across 7 lesion types for 3 different ONCOPILOT models (point, point-edit, bbox) compared to the best-performing baseline from the ULS23 segmentation challenge on the 10% held-out test set. B Examples of successful segmentations from the test set, comparing point mode (left columns) and bbox mode (right columns). The top row shows the visual prompt provided to the model, the middle row displays the ground truth mask for that slice, and the bottom row presents the ONCOPILOT model’s predicted segmentation.

The distribution of lesion sizes by organs is provided in Supplementary Fig. 1b. To be noted, the test dataset was imbalanced, with over 40% of the lesions being lung tumors (further addressed in the Discussion section).

Morphology analysis

The segmentation masks produced by the model in point mode were influenced by the lesion morphology and size. Indeed, ONCOPILOT exhibited lower DICE scores for lesions with irregular, non-spherical shapes, with a mean DICE of 0.66 (95% CI: 0.63–0.69) for tumors with a sphericity index below 0.6, compared to 0.71 (95% CI: 0.7–0.73) for more spherical tumors in point mode (p < 0.001, Fig. 3a, Supplementary Fig. 2a).

Fig. 3: ONCOPILOT Performance on Different Lesion Types.
figure 3

A Bar plot showing the mean DICE scores from ONCOPILOT segmentation masks in point mode (red) and point-edit mode (blue) for spherical lesions (sphericity > 0.6) versus irregular lesions (see Methods for the sphericity formula). B Bar plot showing the mean DICE scores from ONCOPILOT segmentation masks in point mode (red) and point-edit mode (blue) for large lesions (long axis > 15 mm) versus smaller lesions. C Bar plot showing the mean DICE scores from ONCOPILOT segmentation masks in point mode (red) and point-edit mode (blue) for voluminous lesions (volume >1 mL) versus smaller lesions. D Boxplot displaying the distribution of DICE scores produced by ONCOPILOT in point mode (red) and point-edit mode (blue) across various lesion types in the 10% held-out test set, with median values and interquartile ranges highlighted. E Boxplot showing RECIST measurements error against the ground truth, derived from ONCOPILOT's predicted masks in point mode (red) and point-edit mode (blue) across different lesion types in the 10% held-out test set, highlighting median values and interquartile ranges. The long axis is defined as the longest possible line in the axial plane across the predicted 3D mask. ***: p < 0.001; n.s: non-significant.

Similarly, smaller lesions yielded lower DICE scores, with a mean of 0.67 (95% CI: 0.65–0.69) for lesions with a long axis <15 mm versus 0.73 (95% CI: 0.72–0.75) for larger lesions (p < 0.001, Fig. 3b, Supplementary Fig. 2b). This trend persisted when using volume as a metric: lesions under 1 mL had a mean DICE of 0.67 (95% CI: 0.66–0.69), compared to 0.74 (95% CI: 0.72–0.76) for larger lesions (p < 0.001, Fig. 3c, Supplementary Fig. 2c).

Crucially, interactive editing mitigated these biases, eliminating significant differences (p > 0.05) in DICE scores between lesions of varying sphericity, long axis, or volume in point-edit mode. This approach also reduced disparities in DICE between lesion types (Fig. 3d). Additionally, when using RECIST measurements for the long axis instead of DICE scores, interactive editing significantly reduced measurement errors, with the median error decreasing from 14.1% in point mode to 9.6% in point-edit mode (p < 0.001). This level of accuracy is consistent with the reported inter-reader variability among radiologists for single-lesion measurements5 (Fig. 3e).

ONCOPILOT evaluation against radiologists

To evaluate ONCOPILOT’s accuracy in clinical setting its long axis measurements were compared to those of radiologists. A validation set of 67 tumors from the ULS23 DeepLesion dataset was used, selected according to RECIST v1.1 guidelines (long axis ≥10 mm for solid lesions, short axis ≥15 mm for lymph nodes) and segmentation quality.

ONCOPILOT demonstrated radiologist-level performance in point, point-edit, and bbox modes (Fig. 4a, b). There was no statistically significant difference (p > 0.05) between the different ONCOPILOT models when evaluated against radiologists, with a median absolute error in long axis measurement of 1.3 mm (95% CI: 1–2.1) for radiologists (8.6% of the median lesion size) versus 1.1 mm (95% CI: 0.9–1.3) for ONCOPILOT in point-edit mode (7.4%), 1.6 mm (95% CI: 1–3) in point mode (10.8%), and 1.5 mm (95% CI: 0.9–2.3) in bbox mode (10.4%).

Fig. 4: ONCOPILOT Integration Into Radiologist’s Workflow.
figure 4

A Diagram and results comparing ONCOPILOT in point, point-edit, and bbox modes against three radiologists for the long-axis measurement of diverse oncological lesions. Median absolute error (mm) and median relative error (% of lesion size) are shown. p values from t-tests compare ONCOPILOT models to radiologists for long-axis measurement error, without statistical significance p ≥ 0.05. The long axis is the longest line in the axial plane across the predicted 3D mask. B Boxplot (bottom) of ONCOPILOT's tumors long-axis measurement performance against radiologists. Left: median absolute error (mm) vs. ground truth. Right: median relative error (% of lesion size). Median and interquartile ranges are shown. C Diagram of an experiment evaluating radiologists' inter-operator variability and measurement time while measuring tumors' long-axis using a digital viewer for manual vs. ONCOPILOT-assisted (bbox mode) long-axis assessments. D Boxplots show radiologists' inter-operator variability in measurement error (left) and measurement time (right) using manual vs. ONCOPILOT-assisted annotations across diverse tumors, with t-test p values; n = 3.

ONCOPILOT integration into radiologist’s workflow

To assess whether ONCOPILOT could serve as an AI companion, its integration into the workflow was evaluated. ONCOPILOT enhanced the reproducibility and efficiency of radiologist measurements, with an inter-reader deviation of 1.7 mm when assisted by ONCOPILOT versus 2.4 mm manually (Fig. 4c, d, p < 0.05), leading to an increase of ICC(2,1) from 0.88 to 0.93. Additionally, radiologists demonstrated a faster measurement speed using ONCOPILOT, with an average time of 17.2 s per measurement compared to 20.6 s with manual annotations (p < 0.05). Notably, this improvement in speed was achieved without focusing on speed optimization, as it operated on a non-optimized web-based platform (showcased in Supplementary Fig. 3a–e). Most of the measurement time was spent locating the lesion within the exam, suggesting that ONCOPILOT could be further accelerated with targeted improvements.

Segmentation performances on a external dataset

To demonstrate out-of-distribution generalizability, we evaluated ONCOPILOT on an external multicentric cohort of real-world patients undergoing longitudinal oncological evaluation. Moreover, we compared ONCOPILOT to the ULS23 baseline and to non-specialized models: the state-of-the-art promptable foundation model SAM-Med3D-turbo20 and ONCOPILOT-zero the pre-trained version of ONCOPILOT (see Methods).

On this dataset, ONCOPILOT achieved a mean DICE score of 0.79 (95% CI: 0.75–0.84), 0.68 (95% CI: 0.63–0.74), and 0.70 (95% CI: 0.65–0.76) for point-edit, point, and bbox modes, respectively. These results match or slightly exceed the ULS23 baseline, which exhibited a mean DICE of 0.66 (95% CI: 0.61–0.71) (see Supplementary Fig. 4a).

For comparison, ONCOPILOT-zero and SAM-Med3D (turbo version, 5 edits) achieved a mean DICE of 0.58 (95% CI: 0.53–0.64) and 0.44 (95% CI: 0.38–0.5) respectively, demonstrating that specialized model soutperform generic ones.

To be noted, ONCOPILOT’s performance on lung lesions was superior in this validation cohort compared to previously reported results, with a mean DICE of 0.71 (95% CI: 0.60–0.80) vs 0.68 (95% CI: 0.66–0.70) for point, and 0.80 (95% CI: 0.73–0.86) vs 0.66 (95% CI: 0.65–0.69) for bbox. Consistently, the lung tumors in the external validation cohort were far bigger with a median large-axis of 23 mm vs 9 mm (Supplementary Fig. 4b). Some successful and failed segmentation cases from the external cohort are displayed in Supplementary Fig. 5.

Discussion

In summary, ONCOPILOT matched state-of-the-art performance in tumor segmentation across a diverse set of oncological lesions, achieving radiologist-level accuracy in RECIST 1.1 measurements. The model’s flexibility, enabled by interactive visual prompts and refinement capabilities in a radiological viewer, marks a significant advancement in integrating an explainable AI copilot into the imaging workflow while keeping the radiologist in the loop. This strategy not only minimizes inter-reader variability and reduces measurement time, but also offers greater flexibility than rigid, task-specific segmentation models.

Indeed, while recent advances have improved segmentation models considerably, they still sometimes generate errors-ranging from significant failures reminiscent of generative model hallucinations to more subtle inaccuracies along tumor boundaries. For example, delineating certain tumors poses inherent challenges that demand nuanced expert intervention and manual refinement, especially when dealing with heterogeneous tumors, necrotic lymph node clusters, confluent lesions, or diffuse infiltrative masses. Consequently, retaining radiologist oversight is crucial, underscoring the value of an interactive editing mode. Without the ability to correct initial segmentation shortcomings, the clinical adoption of these tools could be severely limited.

Clinical evaluation of this segmentation model is needed to establish how these gains in efficiency and precision translate into meaningful improvements in real-world longitudinal oncological evaluation and influence disease status assessment. Addressing the management of multiple lesions per patient across various time points, in line with RECIST 1.1 guidelines, is a critical prerequisite for future studies.

Additionally, it will be important to examine the learning curve and workflow implications for radiologists employing this segmentation tool, especially for the editing mechanism, as well as to identify potential pitfalls in its use, given that human-AI collaborations do not always yield optimal outcomes23. However, due to the straightforward nature of our visual prompts, relying solely on bounding boxes and click points, we are confident that these segmentation tools can be easily mastered by users across a wide range of skill levels.

Regarding ONCOPILOT suboptimal performance on lung tumors, lesion size appears to be the main factor involved, highlighting a limitation of our model that should be addressed in future versions. Indeed, lung tumors in the test set were predominantly small nodules of uncertain oncological relevance, in part not acceptable under the RECIST 1.1 guidelines with a median size of 9 mm compared to 20 mm for non-lung tumors. This disproportionate representation (more than 40%) of lung lesions in the test dataset skewed the overall results, disadvantageous to our model’s performance. This hypothesis is supported by the more convincing performance of ONCOPILOT on the lung lesions from the external validation cohort, which were larger.

ONCOPILOT not only enhances the precision and consistency of RECIST-based oncological assessments but also goes beyond traditional RECIST measurements by enabling volumetric analysis and uncovering previously unexplored radiomic features. Volumetric biomarkers, such as tumor growth rate and total tumor burden, combined with morphology-based markers, offer more comprehensive and accurate indicators of tumor mass and aggressiveness compared to conventional long and short axis measurements. These novel radiomic biomarkers will better accommodate the variability in tumor presentations, providing a more precise characterization of oncological disease.

This study leverages publicly available baseline and data21 as well as open-source model architecture17, demonstrating that foundation models are already capable of delivering impactful results in the biomedical field without significant technical hurdles. ONCOPILOT showcases the promising potential of this technology, with future iterations expected to be significantly more advanced and effective, which will require extensive clinical validations. These results reinforce our belief that foundation models are a pivotal step toward the next generation of AI-assisted radiology.

Through this work, we aim to demonstrate oncological evaluation as the first use case for the native integration of foundation-model-based AI assistants into the radiologist’s workflow, paving the way for improved patient stratification, optimized clinical trial monitoring, more informed treatment decisions, and ultimately enhanced patient care.

Methods

Foundation model

ONCOPILOT is trained to perform segmentation tasks on 2D images using prompts such as a bounding box, a point, or a mask. It generates a 3D prediction of an anatomical structure from the input image and visual prompts. The segmentation is propagated sequentially along the z-axis, starting from the initial 2D slice, until the object’s boundaries are reached, resulting in a segmentation mask. Alternatively, propagation can stop based on predefined criteria.

Our fondation model was initialized using SAM model weights17 and underwent supervised training on diverse anatomical segmentation tasks to provide an intermediary model, ONCOPILOT-zero. It is then fine-tuned specifically on tumors to deliver the final model, ONCOPILOT. The model’s objective is a combination of DICE and cross-entropy loss. The model is trained to segment various anatomical structures on all axes and is prompted using either a point that is sampled randomly on the mask, a bounding box that is sampled around the mask with a random pixel offset, or a mask that is rotated from -10 to 10 degrees, scale up or down by up to 10%, move up, down, right or left by up to 10% and finally eroded and dilated. Random windowing augmentation is performed with a shift of up to 10%.

Finally, the model is fine-tuned on tumors following the same procedure. The pre-training took 40 h using 32 V100 GPUs (1280 GPU h) with a constant learning rate of 10−5. The fine-tuning required 10 h on a Nvidia 4090 GPU and uses the same configuration.

Baseline

The baseline model (nnUnet-ResEnc+SS) was evaluated on the 10% held-out test set from their labeled dataset of 38,693 lesions. It takes as input volume-of-interest images of size 128 × 256 × 256 padded to the minimum intensity value when necessary. To simulate a radiologist’s click on the lesion, the volume-of-interest is centered on the lesion.

Datasets

ONCOPILOT’s pre-training was performed using publicly available datasets with medical images and segmentation masks for general anatomy and oncological lesions:

  • 1204 CT scans from TotalSegmentator v124, with 104 labeled anatomical structures (27 organs, 59 bones, 10 muscles, 8 vessels).

  • 743 diverse tumors from the DeepLesion dataset25, curated and segmented for the ULS23 challenge21, referred to as ULS23 DeepLesion.

  • 697 bone oncological lesions and 120 pancreatic tumors from the Radboudumc hospital, available through the ULS23 dataset21.

  • 470 volumes from the multimodal MSD challenge26, using only the Lung, Colon, Pancreas datasets.

  • 700 lung nodules from the LNDb dataset27.

  • 300 kidney tumors from the KITS23 dataset28.

  • 832 liver tumors from the LiTS dataset29, also part of the MSD challenge.

  • 932 mediastinal and abdominal lymph nodes from the NIH-LN dataset30.

  • 2236 lung oncological lesions from the LIDC-IDRI dataset31.

A 90% training set was selected randomly, leaving 10% as a held-out test dataset, following the ULS23 challenge methodology. A validation set comprising 67 tumors ≥10 mm (≥15 mm for lymph nodes) was reserved for comparison against radiologists, selected from the ULS23 DeepLesion training set.

Segmentation process

The model had access to the entire volume and visual prompts. The volume was thresholded using a wide windowing range of [−500, 1000] HU, which yielded optimal results on the diverse ULS23 DeepLesion dataset (Supplementary Fig. 1a).

The model initially outputs a 2D segmentation mask for the middle slice. Then, autoregressive propagation of segmentation masks occurs across adjacent slices, using the prior mask as a prompt for the next slice, thus producing a 3D segmentation mask.

ONCOPILOT was evaluated using three experimental settings simulating real-life visual prompts:

  • Bounding box: A 2D bounding box outlines the lesion from the middle slice of the ground-truth mask, expanded by a 15-pixel offset.

  • Point-click: A single point is placed at the barycenter of the ground-truth mask or the nearest point within the mask.

  • Point-edit: To simulate radiologist interactions, the 3D segmentation mask from point-click mode is refined by up to 4 edits, targeting areas with prediction errors (either reducing over-segmentation or expanding under-segmentation).

On the ULS23 DeepLesion dataset, each segmentation inference from a visual prompt takes approximately 50 ms allowing real-time edits. The initial exam preprocessing step takes around 7 seconds for the entire CT scan on a Nvidia 4090 GPU, can be performed asynchronously, and is only performed once per exam.

Segmentation performances

DICE scores were reported to compare segmentation models’ performances on the test sets. 95% confidence intervals (CI) were computed using the standard normal distribution approach, assuming the normality of the sampling distribution.

Morphology analysis

A sphericity index is used to evaluate shape regularity, calculated as the ratio of the surface area of a perfect sphere to the surface area of the ground-truth segmentation mask for objects of equal volume. A perfect sphere has a sphericity index of 1, while irregular structures have values closer to 0. The formula is

$$S=\frac{{\pi }^{1/3}\cdot {(6V)}^{2/3}}{A}$$
(1)

where S represents sphericity, V the object volume, and A the surface area. Independent Student’s t tests were performed to compare mean DICE/RECIST measurements across predefined size, volume and sphericity groups.

RECIST measurement

ONCOPILOT’s RECIST measurements were based on the segmentation masks in bounding box, point, and point-edit modes. The primary measurement was the long axis of the lesion, simplified and restricted to the axial plane, even for lymph nodes. The axial plane is generally recommended as the default plane for RECIST 1.1 measures on CT scans, even when provided with isotropic sequences, to ensure comparison with non-isotropic follow-ups or previous exams1.

Measurements from ONCOPILOT and three radiologists with ≥18 months of experience were compared against measurements inferred from ground-truth segmentation masks to calculate measurement errors. Wilcoxon Signed-Rank tests were performed to compare median RECIST measurements across different visual prompts versus radiologists measurements. 95% confidence intervals (CI) for median measurement errors were computed using the standard normal distribution approach, assuming the normality of the sampling distribution.

Radiologists used a web-based viewer for both manual and ONCOPILOT-assisted measurements. They could freely adjust the windowing and navigate the volume but without multi-planar reconstruction.

ONCOPILOT integration into radiologist’s workflow

Inter-operator variability was calculated as the absolute deviation of each radiologist’s measurement from the overall average for each lesion, using manual measures or ONCOPILOT-assisted with bounding box prompts. A paired Student’s t test was performed to compare interoperator deviation with and without ONCOPILOT assistance. ICC(2,1) (two-way random effects model for absolute agreement) coefficients were reported with and without ONCOPILOT. Measurement duration was defined as the time from the initial display of the CT to the final measurement. A paired Student’s t-test was performed to compare mean measurement durations.

Segmentation performance on a external multicentric dataset

A private anonymized external dataset was curated from Gradient Health data platform. 37 patients from multiple centers in North and South America who underwent CT examination for RECIST 1.1 longitudinal analysis were included, totaling 87 lesions. The lesions were segmented in 3D by one radiologist followed by a review from another. This dataset was used uniquely for testing the model, without any overlap with the training datasets. DICE scores were reported, as well as their stratification by lesion type.