Introduction

Prostate cancer mortality is highly dependent upon stage of disease, and assessment of metastatic diseases at prostate cancer diagnosis is critical for adequate treatment planning and selection between potentially morbid treatment options. Bone scan staging remains the most widely available tool for quantifying metastatic burden and most supported for basing treatment decisions upon [1, 2]. Yet, recommendations on which patients to scan vary between guidelines and literature. Guideline recommendations are reported as weak [1] or based only upon expert opinion and grade 2A–C evidence [2, 3]. Recommendations from the primary literature were developed in small selective cohorts [4,5,6] often based upon insensitive performance markers (like negative predictive value), infrequently externally validated and if validated, done so in small selective cohorts [7,8,9,10,11,12,13,14,15,16,17,18,19]. Head-to-head comparisons of strategies are also limited [7, 13,14,15,16].

Decision curve analysis offers a novel approach to evaluate these strategies and compare them at various levels of conservatism (preference to avoid missing a positive scan) and tolerance (preference to limit number of people scanned). This approach compares strategies on net-benefit, which considers the positive scans detected by a particular strategy and the number of people scanned with it, weighting these two results by the conservatism and tolerance of the preferred strategy type.

We use decision curve analysis to review and validate strategies for bone scan staging in patients with newly diagnosed prostate cancer, comparing them against major clinical guidelines. The aim is to identify optimal strategies for bone scan staging in newly diagnosed prostate cancer patients.

Subjects and methods

Identifying models used for selective bone scan staging

Models were chosen from published literature and guidelines and validated in the South Australian Prostate Cancer Clinical Outcomes Collaborative (SA-PCCOC) database. A model was defined as any allocation of bone scan positivity risk to a group of newly diagnosed prostate cancer patients based on a predictor(s). MEDLINE and EMBASE databases were searched for models using keywords: Prostate Cancer, Metastases, Prediction, Staging, Screening and Imaging (with related terms) and an English-only limit. Titles and abstracts were screened for relevance. Abstract-only records and reviews were manually excluded. Articles containing models predicting bone scan positivity, using common clinical predictors, were further assessed. Those using tests not routinely available (circulating tumour cells, cell-free DNA and similar) were excluded. Common predictors included serum Prostate Specific Antigen, Tumour stage and Gleason score (GS) at diagnosis. We used the Prediction model Risk Of Bias ASsessment Tool (PROBAST) tool [20] for quality assessment.

Validation cohort

The cohort comprised of all patients diagnosed between 1 January 2005 and 26 May 2019 in the SA-PCCOC registry. This registry captures more than 90% of prostate cancer patients diagnosed in South Australia, collecting data on disease characteristics at diagnosis, initial treatment type, cause of death, time to biochemical recurrence and more. Patients are retained unless they opt-out of data collection. Survival data is obtained from the births, deaths and marriages registry and is available for all patients. Only patients diagnosed before 2005 or without a diagnosis date were excluded.

Model outcome

Bone scans performed within 20 weeks of histological diagnosis were considered staging scans [21]. Indeterminate scans were reclassified as positive or negative using subsequent imaging and clinical information. Where further classification was unachievable, results were imputed.

Model predictors

Most models used serum prostate-specific antigen (PSA), tumour (T) stage and/or GS as predictors. For validation, PSA prior to treatment and closest to diagnosis were used for “PSA at diagnosis”. If all PSA levels on record were post-treatment, PSA was set as unknown and imputed. T-stage was assessed by physical exam at diagnosis. GS was based on diagnostic biopsies.

Ethics

The SA-PCCOC research committee approved use of de-identified data, having permission to authorize this from the Southern Australian Clinical Human Research Ethics Committee. This study was performed in accordance with the Declaration of Helsinki 2013.

Statistical methods

Calibration

Calibration slope and calibration-in-the-large (calibration intercept) were assessed to gauge accuracy of model predictions of the risk of bone scan positivity. These were calculated by fitting logistic regressions of observed risk of bone scan positivity against predicted risk [22]. Calibration-in-the-large was similarly calculated with slope fixed at one [22]. These analyses were performed in each imputed dataset and pooled using Rubin’s rules [23]. Ideal calibration slope is one and calibration-in-the-large is zero [22]. Where predicted risk was not specified for a model risk group, the rate of bone scan positivity in the model’s development study was taken as predicted risk (Supplementary Table 6). Calibration was not calculated for guideline models, which did not report numeric predicted risks.

Discrimination

Area-under-the-receiver-operator-characteristics curve (AUC) was used to summarize model ability to discriminate between patients with a positive and negative bone scan. AUC was interpreted in accordance with Hosmer et al. [24].

Decision curve analysis

Decision curve analysis was used to compare the net-benefit of models at different scanning thresholds (staging strategies) over varying degrees of conservatism (preference to avoid missing disease) and tolerance (preference to scan fewer people) [25]. Traditionally, varying degrees of conservatism and tolerance (“preference”) are reflected in the x-axis of decision curves as a probability threshold (pt)—the point at which the user believes intervention is appropriate. To avoid confusion between model thresholds and pt, we used the alternative measure of preference ratio [25] and number-willing-to-test (NWT). A preference ratio of 1:99, in this context, represents a belief that scanning one hundred people to capture one positive bone scan is reasonable [25], and a pt of 0.01 and NWT of 100. A preference ratio of 1:9 was the upper limit of preference assessed, as it represents a willingness to scan at least ten patients to capture one positive bone scan—a number we felt was universally acceptable.

Continuous and categorical models were presented differently. As categorical models provide qualitative rather than quantitative predictions, they had fewer potential decision thresholds. They were presented as fixed strategies, akin to the presentation of a “test” in Vickers et al. [25], with each potential threshold from a categorical model displayed as a straight-line across the range of preference ratios assessed (equation in Supplementary 2). Continuous models were presented as both decision-analysis curves (demonstrating potential outcomes of using any threshold in that model) and straight lines for the fixed strategies their source articles recommended. Strategies with higher net-benefit were considered higher performing, the magnitude of this difference being irrelevant [25].

Missing data

Missing data were multiply imputed using chained equations (mice package [26]). Based upon analyses in Supplementary 1, reasons for missingness were felt well explained and correlated to prostate cancer-specific overall survival, initial treatment, treatment in a public or private setting, biopsy type and disease factors, allowing the missing-at-random assumption. We imputed one hundred datasets, each with one hundred iterations, and pooled results using Rubin’s rules [23]. Kaplan–Meier curves were used to compare survival in patients with imputed positive bone scans to those with observed positive scans, and likewise for imputed negative bone scans (Supplementary 1).

All statistical analyses were performed using R version 3.4.2 [27].

Results

Validation cohort

The cohort is comprised of 10,721 consecutive men newly diagnosed with prostate cancer (Fig. 1), 4,079 of whom had a staging bone scan and 354 (8.7%) of which were positive (Table 1). As expected, patients with positive scans had poorer survival and higher GSs, PSA at diagnosis, clinical T-stage and percent positive cores on biopsy than those with negative scans. There were 150 indeterminate bone scans (3.6%, 150/4079), the majority of which were (n = 135) were subsequently reclassified as negative based on follow-up imaging and data. The remaining fifteen were imputed. 6642 patients had no staging bone scan result in our database. These patients had lower GS and T-stage than patients with staging bone scans on record, were more often treated in the private setting (Supplementary Table 1) and had better survival (Supplementary Fig. 2). This points towards two main mechanisms of missing data, selective use of bone scan staging (in patients thought to be at “higher risk” as per previous clinical guidelines) or restricted access to data in privately treated patients. As the difference in survival between patients with and without bone scan minimizes with stratification by risk group (Supplementary Fig. 3 and Supplementary Table 2), there is strong support for this mechanism of missingness and thus our choice of imputation model. Supplementary 1 confirms reliability of imputations. Survival was almost identical in patients imputed with a positive bone scan, compared to those with a known positive scan, and likewise for patients imputed with negative scans (Supplementary Fig. 1). Post-imputation cohort characteristics (Supplementary Table 3) show that distribution of disease stage and incidence of metastatic disease was similar in our cohort to the SEER database [28].

Fig. 1: Selection of validation cohort.
figure 1

Flow diagram demonstrating cohort selection process, excludion criteria and cohort breakdown for selective staging strategy validation.

Table 1 Characteristics of the validation cohort prior to multiple imputation.

Model identification

Thirteen distinct models were identified from the guidelines and literature search (Supplementary Fig. 4): EAU 2020 risk strata [29], AUA 2018 risk strata [3], NCCN 2019 risk strata [30], Ho [31], Wang [32], Chybowski [33], Briganti [4], O’Sullivan [34], Lai [5], ISUP [7], Gnanapragasam [7], Wang 2 [35] and Lorente [36]. Two could not be validated (Wang 2 [35] and Lorente [36]) as they used serum alkaline phosphatase (not recorded in the database). Three provided continuous estimates of risk based on logistic regression (Ho [31], Wang [32] and Chybowski [33]), while others categorized patients as low, intermediate, high-risk or similar based upon common clinical thresholds [3, 5, 7, 29, 30, 34] or classification-and-regression-training [4]. Thresholds recommended from these models were used to select for bone scanning (Table 2).

Table 2 Model details and characteristics of sources.

A high risk of bias was identified in all literature-derived models due to small sample sizes, limited internal and external validations and some biased recruitment processes (Supplementary Tables 4 and 5 and Supplementary Fig. 5). The rationale behind threshold selection for staging strategies was sometimes missing [30] or poor. Three main approaches were used to select thresholds: percent bone scan positivity (inadequate in small studies where observed risk may not generalize) [7], negative predictive value (insensitive for rare events) and the highest point on the ROC curve (balancing sensitivity and specificity equally though sensitivity must be higher in this context).

Model validation

No model had the ideal calibration-in-the-large of zero (Table 3). Most models had a positive calibration-in-the-large, indicating they under-estimated risk on average. Lai deviated least in calibration-in-the-large (−0.28 [95% confidence interval, CI: −0.37, −0.19]) and Ho deviated most (−1.88 [95% CI: −1.96, −1.80]), overestimating risk on average.

Table 3 Calibration and discrimination of models.

Calibration slope was also rarely one, the ideal (Table 3). The Wang model was closest with slope 0.94 [95% CI: 0.88, 1.00], but most others deviated significantly. Those with slope less than one (Chybowski, ISUP and Lai) over-predicted risk in high-risk groups and under-predicted it in low-risk groups (Supplementary Fig. 6), classic of overfitting. The Ho and Gnanapragasam models had slopes far greater than one. Their calibration plots suggest this was likely due to under-prediction of risk in high-risk groups for Gnanapragasam and over-prediction in low-risk groups for Ho (Supplementary Fig. 6).

Discrimination ranged from 0.68 to 0.80 for all models, considered “fair” by Hosmer et al. [24]. The highest AUCs were seen with Ho, Wang and Gnanapragasam (Table 3).

Strategy validation

Figure 2 summarizes net-benefit comparisons. Part A presents the decision-analysis curves for guideline recommendations and the two novel selective staging strategies that superseded all other approaches: scanning EAU high risk patients only and Gnanapragasam Group 5 patients only. Part B highlights the strategy performing best at each assessed preference ratio. The EAU guideline recommendation was best for preference ratios 1:39–3:97 (NWT 40–33), scanning EAU high-risk patients for preference ratios 3:97 to 7:93 (NWT 32–14) and scanning Gnanapragasam Group 5 patients for preference ratios 7:93–1:9 (NWT 13–10). The scan-all strategy had higher net-benefit than all other strategies at preference ratios 1:99–1:39 (representing a number-willing-to-test to capture a positive scan, NWT, 100–40). Supplementary Fig. 7 has decision-analysis curves for all strategies.

Fig. 2: Model performance by decision curve analysis.
figure 2

A Decision analysis curves for guideline recommendations and top-performing alternative staging strategies. B Stepwise plot demonstrating optimal staging strategy for each potential preference ratio.

Supplementary Table 7 presents net-benefit for each model’s recommended staging strategy (the strategy advised by the model’s source) at different preference ratios, above the net-benefit from the best performing strategy in that model for that preference ratio. There was often a discrepancy, indicating the benefit of using net-benefit to identify optimal staging strategies. The table also shows that fixed strategies from continuous models often had higher net-benefit than the continuous model itself at the same preference ratio. This may be due to mis-calibration.

Discussion

Bone scans are the most widely available tool for prostate cancer staging and remain the most evidence-based in guiding treatment selection [2]. Bone scan results can significantly alter the optimal treatment plan for newly diagnosed prostate cancer patients. A finding of oligometastases may lead a patient from radical curative treatment to combined radiotherapy and systemic therapy, or from systemic to combined radiotherapy and systemic therapy. However, recommendations for bone scan staging vary and are based upon consensus opinion or models developed in small cohorts often with selective recruitment and limited rigorous external validation. Ours is the first study to validate such a broad range of bone scan staging strategies head-to-head in a large independent cohort using net-benefit.

We found that (i) none of the commonly used models or strategies were universally superior across preference ratios, and (ii) the optimal staging strategy varied with preference ratio. Selective staging strategies that performed best were the EAU 2020 guideline recommendations (scanning patients with intermediate-risk GS 4 + 3 disease or high-risk disease), scanning EAU high-risk patients only and scanning patients in Group 5 of the novel Gnanapragasam model. The choice between them depends upon the preference ratio of conservatism and tolerance appropriate to the local health system and a given patient’s case. As bone scan results can radically alter treatment, some clinicians and patients may prefer more conservative approaches like the EAU guideline recommendation. In other scenarios, with different patients or health systems, or in health crises, such changes in treatment or such generous scanning may not be feasible, necessitating more “tolerant” strategies-like scanning EAU high-risk patients or Gnanapragasam Group 5 patients only.

Interestingly, at high levels of conservatism, scanning everyone had greater net-benefit than currently available selective staging strategies. This may be a result of true misses with selective staging strategies. In our pre-imputation cohort, approximately 3% (35/1086) of patients with GS 6 disease on biopsy had positive staging bone scans. These patients are often excluded from selective staging strategies as GS 6 disease is often thought not to metastasize. However, upgrading of Gleason 6 prostate cancer is common on radical prostatectomy [37,38,39], and these patients may have a risk of metastatic disease higher than appreciated by current selective staging strategies. Additional predictors of final grade, like PIRADS score, may improve the accuracy of selective staging strategies at conservative preference ratios [40]. A scan-all approach may also have appeared superior to selective staging approaches because of false positives. Present literature suggests a 79% specificity of bone scan staging [41], but patients with low-risk disease were often excluded from these studies. Our own data suggest a higher rate of false-positive scans in patients with low-risk disease, as indeterminate scans in patients with low-risk disease were classified as negative more often than in patients with high-risk disease. False positives have the potential of inappropriately altering treatment plans and leading to sub-optimal care, and thus such inclusive strategies should be used with care. Improved imaging technologies should bring fewer false positives, and conditioning future models on true positive scan results rather than all positives could also circumvent this issue in future.

Our analysis confirmed inaccuracies in bone scan positivity risk prediction by current models. Ho and Gnanapragasam were overfitted (calibration slope more than one), and Chybowski, Lai and ISUP were underfitted (calibration slope less than one). Both are consequences of small sample sizes, having fewer than ten events (positive bone scans) per predictor-variable (EPV) at model development or few events at model validation and repurposing (ISUP and Gnanapragasam) [42]. Calibration issues are likely responsible for differences in net-benefit from continuous models and the “fixed strategies” recommended from them. Recalibration may prove these models more useful. This analysis confirms the widespread problems of model development noted by Moon et al. [42], but also shows that despite mis-calibration, the Gnanapragasam model provided a highly effective selective staging strategy, underscoring the importance of practical measures of model performance like net-benefit.

Another key strength of our study is it is one of few studies in this field to meet the sample size requirements for reliable external validation [42, 43]. Our cohort was also derived from an opt-out population-based registry, with minimal exclusion criteria, limiting selection bias. Although missing data is a key limitation, this is a common issue in this field [42], and our study is the first to report on it in such detail and the first in the field to use multiple imputation to handle it. Additionally, we have strong evidence to support the reliability of our imputations, with post-imputation distributions of prostate cancer disease characteristics fitting those expected in a prostate cancer population. Finally, while PSMA-PET use is extending to primary prostate cancer staging [44, 45], radionuclide bone scans have the most evidence in guiding treatment strategies and have FDA approval [2]. Thus, this work is of critical relevance and use now, and in future, may help evaluate PSMA-PET staging.

This study found that no single model performed best for selective bone scan staging, and rather different strategies from different models were better than others over different degrees of conservatism and tolerance. Of the selective staging strategies assessed, three performed best: scanning patients as per the 2020 EAU guideline, scanning EAU high-risk patients and scanning Gnanapragasam Group 5 patients. Scanning only EAU high-risk patients provided the greatest net-benefit over the greatest range of preference ratios (NWT 14–32), but other approaches may be preferred in different settings with different degrees of conservatism and tolerance. This study provides a robust analysis that can improve bone scan use and decision making now in primary prostate cancer staging, and acts as a flagship for the assessment of future technologies like PSMA-PET/CT.