Background

Melanoma is the 5th most common cancer in the UK, accounting for approximately 5% of all new cancer cases [1]. Age-standardised incidence rates of melanoma have increased by 140% in the UK since the early 1990s. Whilst prognosis for early-stage melanoma (AJCC stages I/II) is good, many more deaths occur overall in patients diagnosed in these stages than those with more advanced disease (AJCC stages III/IV) [2]. This demonstrates the difficult challenge for clinicians in making accurate predictions about survival at initial diagnosis and suggests that using AJCC staging alone in early-stage cutaneous melanoma is not sufficient.

Cancer staging systems are designed to combine disease variables, with established prognostic associations, to provide estimations of expected outcomes of relevance such as survival or disease progression. They are widely used to group together patients with a similar expected survival to guide decision making regarding further investigations and treatments. The American Joint Committee on Cancer (AJCC) staging systems are commonly used for numerous cancer types and are frequently updated to improve accuracy and reflect current survival trends [3].

With an increasing interest in individualised medicine, prognostic prediction tools have been developed for a variety of cancer types. These tools take additional variables into consideration, combining them with disease specific information to provide survival predictions as well as other interim outcomes of interest to patients. Such models are specifically defined in guidelines designed to standardise the reporting of new tools as: “a mathematical equation that relates multiple predictors for a particular individual to the probability of risk for the presence (diagnosis) or future occurrence (prognosis) of a particular outcome” [4].

The TRIPOD reporting guidelines [4] have been designed to increase the transparency of newly developed models. The aim is to enable potential users to understand model development, the populations in which the model has been developed and how well it performs in that population. This enables users to assess how useful the model might be to patient groups of interest, especially in different patient populations. The development of such tools are based on a variety of statistical techniques and are increasingly utilising artificial intelligence and machine learning based methods. Making the decision about whether to use a particular tool is not a straightforward one.

The popularity of some prognostic prediction tools is impressively large, suggesting both clinicians and patients find them a useful adjunct to current standards of care including endorsement by professional bodies such as the American Joint Committee on Cancer (AJCC) and the National Institute for Health and Care Excellence (NICE) in the UK [5, 6].

To highlight the complexities of selecting a prognostic prediction tool to use, this review systematically examines prognostic tools currently available for use in patients diagnosed with primary cutaneous melanoma. It examines the methodological basis of the tools and validates a subset of those recently published on a dataset derived from an unselected group of patients from a University Hospital in the UK [7].

Input variables for prognostic prediction tools in melanoma

Existing tools utilise clinicopathological variables to predict outcomes. These data are easily attainable and require little further processing of tumour tissue or patient data than would already be done routinely. Those clinicopathological variables with established relationships with disease severity are naturally the most frequent input variables included in such models. Variables such as: histological subtype, Breslow thickness, ulceration, mitotic rate, sex & age are well supported in the literature as being useful prognostic indicators. Any variables utilised in addition to these may be included based on particular data being available to an institution and/or noted to significantly improve model performance.

Data sources to develop prediction tools

Tools are generated by analysing patient data for those who have been diagnosed and undergone treatment for melanoma. A variety of methods can then be used to create a model of the data by determining the relationship between combinations of variables with the outcomes of interest, such as survival or sentinel lymph node positivity.

Regardless of the methods used, the data used to determine the relationships is crucial to the success of the model. For instance, a dataset that does not include any patients over the age of 60 is unlikely to produce a model that performs well when making predictions in an older patient group. This also extends to the diversity of the individuals in the dataset, and its similarity between the populations where the model is developed to where it is applied. Datasets collected entirely from tertiary centres are at risk of containing a disproportionate number of patients with advanced disease or that which requires specialist treatment, including enrolment in trials. The development of models is specifically focused on patients with a particular stage or type of disease and the user must be aware of these criteria before attempting to apply any resulting tool in a wider patient group.

Validation & performance of prediction tools

Validation is the process of assessing the performance of a predictive model. This is performed on the data used to derive the model (internal validation) and should also be performed on separate data not involved in model derivation (external validation). Of these, it is external validation that is most of interest since our primary focus is in the model making predictions in unseen data.

Performance is assessed by measuring the calibration and discrimination of the model. Calibration is a measure of the agreement between estimated risks of an outcome and the observed outcome frequencies. Discrimination is the ability of the model to differentiate between individuals that experience and outcome and those that remain event free. In models that predict survival it can be thought of as the ability of the model to correctly rank individuals by their risk.

It is important that both these aspects of model performance are assessed. Good models are both well calibrated and can effectively discriminate, excellent performance in one domain is not sufficient to account poor performance in the other.

Methods

Systematic search strategy

A database search strategy was developed searching for relevant manuscripts & online tools from January 1985–March 2023 following PRISMA principles [8]. An example search term used is provided in Supplementary Methods. The search was only conducted for articles published in English, with full text availability and focusing specifically on cutaneous melanoma.

A prognostic tool was defined as any equation, nomogram, risk classification system, electronic calculator, or other tool format that had a foundation in a statistical model or algorithm, developed with the purpose of predicting survival in clinical practice [9]. All references in identified articles were scrutinised for additional relevant work meeting the search criteria.

This search initially yielded 196 results, and an additional study was identified during anonymous manuscript expert review (Fig. 1). Following removal of duplicates (102), articles relating to melanoma of sites other than the skin [2], those articles not related to the development of a prognostic prediction tool for clinical use [10], not published in English [1], specifically relating to prediction for individual subtypes of melanoma or metastatic disease in one anatomical location [4] and using genetic or other non-clinicopathological predictors [8], 29 studies remained for inclusion in the review.

Fig. 1: PRISMA flowchart.
figure 1

Figure outlines article search and filtering process identifying primary research articles with prediction models for clinical outcomes in primary cutaneous melanoma.

Articles were assessed using the CHARMS [11] and TRIPOD [4] guidelines designed to aid systematic review of prognostic prediction tools, model development and validation. The criteria set out by the AJCC for individualised risk prediction models was also utilised as a reference for model assessment [12].

External validation of existing tools for predicting a positive sentinel lymph node biopsy result

Sentinel lymph node biopsy has become part of standardised care pathways for melanoma. In the tools identified by our search, seven [13,14,15,16,17,18,19] are designed to provide predictions for the probability of receiving a positive result from undertaking this procedure. Such a prediction has potential for use in clinical settings to determine the risk more accurately for an individual and better assess the balance of risks and benefits of conducting this procedure.

A dataset containing patient data for individuals who underwent sentinel lymph node biopsy for cutaneous melanoma between 2008–2023 (n = 1564) was curated from a tertiary university melanoma centre (Addenbrooke’s Hospital, Cambridge, UK) [7]. This dataset was utilised to externally validate selected prediction tools identified in the literature search. The results provide an indication of the suitability of these models in the UK population. This study was reviewed by the Cambridge University Hospitals EHR Research and Innovation (ERIN) Database Access Committee (Reference A096904). We did not have data for patients not undergoing SLNB to validate survival or recurrence models.

Given the changing guidance regarding eligibility for SLNB it was felt that assessment of the most recently published prediction tools would be most appropriate. Those models utilising the AJCC 8th edition staging criteria were identified and included models published by Tripathi et al. (2023) [19], Bertolli et al. (2021) [18], Lo et al. (2020) [17] and Friedman et al. (2019) [16].

Sufficient information on the statistical models derived was available in the articles published by Lo et al. and Bertolli et al. Sufficient detail was provided by authors of the Friedman et al. paper on request. Model details from Tripathi et al. were not received to enable validation of the model they describe in our dataset.

Variables from our dataset were recoded according to the requirements of each model. The Friedman et al. model is specifically designed to make predictions for individuals diagnosed with thin melanomas (Breslow thickness 0.5–1.00 mm) [16]. The model does not permit the input of unknown or missing values for any variable and hence only individuals with thin melanomas and recorded values for all required variables were included (n = 215). Whilst the Lo et al. calculator does allow unknown values for several variables, only those with complete data for all required variables were used for analysis (n = 1348) [17]. To validate the Bertolli et al. model only those patients with complete data for all variables required by the model were included (n = 714) [18]. We compared the patient populations used to derive each model by variable. All provided population data was categorical and compared to the UK population using either the chi-squared test or Fisher’s exact test with a significance threshold set at 0.05.

Model discrimination was assessed by plotting the receiver operating characteristics for each models and calculating the area under the curve utilising the R package pROC [20]. 95% confidence intervals for the area were computed using this package with 2000 stratified bootstrap replicates.

Model calibration was assessed by plotting the observed frequency of the outcome and predicted probabilities produced by models and comparing the slope and intercept for each. It was additionally assessed by regressing the outcome onto the probability prediction produced by the model.

Results

Identification of current prognostic prediction tools

The literature search identified 29 clinical prognostic tools for use in cutaneous melanoma [14,15,16,17,18,19, 21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42]. Twelve of these tools were available to use with an online interface to calculate and display patient risk [13, 14, 17, 19, 21,22,23,24,25,26,27,28], with one of these also available as an Android app [24]. The remainder were available as a publication only. A detailed summary of identified tools can be seen in Fig. 2.

Fig. 2: Summary of identified prognostic prediction tools for cutaneous melanoma.
figure 2

AJCC American Joint Committee on Cancer, TILs tumour infiltrating lymphocytes, Cox Reg Cox Regression Model (proportional hazards or logistic), LR Logistic Regression, KM Kaplan Meier, ML machine learning, Survival model predicting patient survival (disease specific or overall), Recurrence model predicting disease recurrence, SLNB result model predicting the result of a sentinel lymph node biopsy procedure, SEER Surveillance, Epidemiology, and End Results, EORTC European Organisation for Research And Treatment of Cancer).

Fifty-five input variables were used in the models identified. Breslow thickness was the most used pathological variable and age the most common patient factor variable, appearing in 24 and 23 models, respectively. The top 10 variables are illustrated in Fig. 2 with a complete list of input variables included in Supplementary Table 1.

All articles exclude children, with the exception of Bertolli et al. [18]. Whilst the age cut off for this varies between tools, no tool includes data on any individual diagnosed with cutaneous melanoma at less than fifteen years of age, except for Bertolli et al. which has an age range of 5–89.

From the models identified, the data used to develop them came from twenty different sources. Six population level datasets; The Surveillance, Epidemiology, and End Results (SEER) database (USA) [19, 27, 29, 30], Dutch Pathology Registry (PALGA, Netherlands) [26], Queensland Cancer Registry (Australia) [31], Swedish melanoma registry (Sweden [16, 19]) [32], Veneto Cancer Registry (Italy) [28], National Cancer Database, USA [16, 19]. Nine multi-centre datasets: Pigmented Lesion Group, University of Pennsylvania (USA) [33, 34], AJCC Melanoma Database (USA) [23], Cancer Genome Atlas (USA) [25], Scottish Melanoma Group Database (Scotland) [35], Sunbelt Melanoma Trial [36] Data [24], European Organisation for Research and Treatment of Cancer (EORTC) Melanoma Group Centres [37], six European melanoma centres [38] and combined datasets from 5 melanoma centres in the UK [15]. Eleven tools made use of single centre datasets from; The John Wayne Cancer Institute (USA) [39], The Melanoma Institute Australia [17], Memorial Sloan Kettering (USA) [13], Edmonton, Alberta (Canada) [40], Mayo Clinic, Rochester (USA) [41], Princess Alexandra Hospital, Queensland, Australia [42], Mass General Brigham, Dana-Farber Cancer Institute, Boston, USA [43, 44], Massachusetts General Hospital, USA [14, 21, 22] and the A.C. Camargo Cancer Centre, Brazil [18].

The sample sizes used for initial model creation ranged from 68 patients to 156,154 (Median 2647, IQR (979–25,930). One tool provided details of a sample size calculation and verification performed to ensure that a large enough sample was used for the chosen methodology [43].

A range of statistical tools were utilised to generate the prediction tools reviewed. These are outlined in Supplementary Table 2. ‘Classical’ statistical methods such as Cox and Logistic regression were the most frequently utilised. Techniques based on Cox regression were most common, utilised in sixteen of the tools reviewed. Machine learning techniques were less frequently used and as expected, appeared in work published much more recently.

Nineteen tools are provided with details of internal validation methodology, however only sixteen provide statistical results for this. The most common statistic provided was a concordance statistic or equivalent (e.g., Harrel C statistic for models utilising censored data).

Fourteen tools provided details of an external validation process with nine providing details of external validation statistics within the publication describing the tool. It is noted that some tools have subsequently been externally validated in separate publications and sometimes by separate authors, these are not included in this review. Concordance statistics were again the most frequently presented. Datasets used for external validation purposes were sourced from the same source as training data in two tools, but at different time points providing temporal validation. The other nine tools used data from an external source, including from a dataset originating from another country.

External validation of tools predicting positive sentinel lymph node biopsy results

Population comparisons

Populations utilised by Friedman et al., Lo et al. and Bertolli et al. in development of their respective tools were compared to the patient population used to externally validate them [7]. Tables 13 display comparisons between populations utilised by each model and our own dataset used to assess their performance. In the case of all model development populations there is a significant difference in disease variables when compared with our own dataset.

Table 1 Comparison of demographic, tumour and lymph node status of patient populations used in development of Friedman et al. tool [16] and melanoma database from Cambridge University Hospital.
Table 2 Comparison of demographic, tumour and lymph node status of patient populations used in development of Lo et al. [17] prediction tool and melanoma database from Cambridge University Hospital.
Table 3 Comparison of demographic, tumour and lymph node status of patient populations used in development of Bertolli et al. [18] prediction tool and melanoma database from Cambridge University Hospital.

The comparison with the thin melanoma cohort from Friedman et al. showed more variables with adverse outcomes present in our cohort [16]. For example, there was a higher proportion of thinner melanomas (32.2% vs. 17.2% in the 0.5–0.8 mm category, p = >0.005), fewer patients with absent mitotic figures (15.4% vs 32.4%, p < 0.005), and fewer patients with dermal regression (7.0% vs. 15.2%, p < 0.005) in the Cambridge cohort [Table 1]. This may explain why the proportion of positive sentinel lymph node biopsies was lower in the Cambridge cohort compared to Friedman et al. (4.0% vs 7.3%, p = <0.005). This is likely reflective of higher thresholds of SLNB use in the UK population for pT1b melanomas.

In the case of the Lo et al. the population comparison highlighted that the Australian cohort had a significantly greater proportion of patients with adverse features [17] [Table 2]. A greater proportion of patients had thicker melanomas (≥2mm) (46.7% vs 40.1%, p = <0.005), ulcerated tumours (29.6% vs 21.9%, p = < 0.005), more mitoses (54.5% vs 35.7% in ≥4 category, p = < 0.005) & evidence of lymphovascular invasion (5.8% vs 2.7%, p = <0.005). These tie in with the significantly greater proportion of positive SLNB observed (21.0% vs 20.1%, p = <0.005). This is likely reflective of more adverse melanomas presenting in the Australian population.

Comparison of the Bertolli population with our Cambridge database population notably demonstrates some key differences [18] [Table 3]. Median melanoma Breslow thickness is higher in the Brazilian cohort (2.26 vs 1.4) with a greater proportion presenting with ulcerated tumours 25.8% vs 16.1%. A greater proportion of tumours were of the acral subtype than in the Cambridge group, associated with a significantly worse prognosis (8.9% vs 2.1%).

Model performance comparisons

The Friedman et al. model had an AUC of 77.1% (95% CI: 66.8% to 85.7%) (Fig. 3a), a significantly better result to that reported by the original paper of 67% (95% CI: 65–70%). Figure 3b, c displays the calibration plot for the model, demonstrating consistent underestimation of risk by the model that appears to worsen with increasing frequency of observed events. It is worth reiterating that this model is only designed to make predictions for those individuals with thin melanoma (0.5–1.0 mm) and hence has only been assessed on such patients from our dataset.

Fig. 3: Model performance plots for described models applied to subsets from Cambridge University database.
figure 3

ac Comparisons with the Friedman et al. model [16] with (a) Receiver Operating Characteristics area under the curve 77.1% (95% CI 66.8–85.7%). b Calibration Plot with slope = 32.92 and intercept = −4.01. c Differences between predicted and observed probabilities in the Cambridge University dataset. df Comparisons with the Lo et al. model [17] with (d) Receiver Operating Characteristics area under the curve 68.1% (95% CI 64.5–71.8%). e Calibration Plot slope = 0.44 and intercept = −1.17. f Differences between predicted and observed probabilities in the Cambridge University dataset. gi Comparisons with the Bertolli et al. model [18] with (g) Receiver Operating Characteristics area under the curve 68.6% (95% CI 63.3–74.1%). h Calibration Plot with slope = 4.88 and intercept = −2.75. i Differences between predicted and observed probabilities in the Cambridge University dataset.

The Lo et al. model had an AUC of 68.1% (95% CI: 64.5–71.8%) demonstrating a reasonable discriminative performance (Fig. 3d). This compares with the 74.1% (95% CI: 72.1% to 76.0%) result from the internal validation reported by the original paper and 75.0% (95% CI: 73.2 to 76.7%) from external validation using data from MD Anderson Cancer Centre [17]. Fig. 3e, f displays the calibration plot for the model and demonstrates a tendency for the model to overestimate the risk of a positive SLNB result, particularly in the group of patients with highest clinicopathological risk.

The Bertolli et al. model had an AUC of 68.6% (95% CI: 63.3–74.1%) demonstrating a reasonable discriminative performance (Fig. 3g). The original paper describing this model reports a value of 75.1% (no CI provided) from internal validation. Figure 3h, i display the calibration of the model for our dataset, they demonstrate overestimation of risk for positive SLNB result with most accurate predictions at extremes of patient observed risk.

Discussion

This systematic literature search has identified published prognostic prediction tools designed for use in patients diagnosed with cutaneous melanoma. They aim to make a variety of predictions, but most commonly focus on melanoma specific survival, recurrence, and probability of a positive sentinel lymph node result. These models have been developed on a range of datasets that range from single centre to national cohorts of patients. The techniques used range from classical statistical techniques to newer machine learning derived methods. The data utilised and techniques employed to create the models are of interest to the end user since they can materially impact the suitability of the model for use in other patient groups.

Validation of such prediction models are essential and whilst such analysis is reported alongside models, it is not done so in all cases. There is also a focus on presenting validation statistics that relate only to the discriminative performance of models with the calibration component either not performed or not specifically reported upon. Good discriminative performance cannot make up for poor calibration and indeed can result in inaccurate predictions [45].

The three models that underwent external validation on our own dataset demonstrate poor calibration in our patient group. Two tended to overestimate risk and the other to underestimate, whilst demonstrating reasonable discriminative performance. These results suggest that they should be used in a UK population with caution and an awareness of these specific tendencies. In comparing the populations used in the development of these models with our own patient dataset, we have identified significant differences in the distribution of disease specific variables such as Breslow thickness, ulceration, and mitotic count. This further suggests that these models can be improved for use in a UK population.

In the case of the Friedman model, our dataset used for external validation is small, with a low number of positive SLNBs. This is secondary to UK melanoma guidelines for performing SLNB in patients with thin melanomas. For T1b melanomas, there are variations internationally in SLNB uptake (18.2% Sweden versus 28.1% Australia) [46]. Our criterion for offering SLNB has evolved over time in line with changes to AJCC 7th edition staging where a single mitotic figure classified as thin melanomas <1mm Breslow thickness as pT1b, and current NICE guidelines where SLNB can be considered if thin melanomas have a mitotic rate ≥3. Variations in practice as indications for SLNB change both over time, and in different countries should be considered when developing and applying different risk prediction models.

Technical developments in the fields of genetics and genomics have enabled the development of tools based on additional molecular features such as: gene expression profiles, ctDNA and individual biomarkers. These have deliberately been excluded from this review of clinically validated prediction tools. These tools are the consequences of developments in technology and our understanding of the genetic basis of disease. They may present opportunities for improvement in prognostic prediction and treatments, but there are several issues with using them in clinical practice at present. A review of utilising GEPs in cutaneous melanoma [47] highlights that they are not endorsed by either the American Academy of Dermatology [48] or National Comprehensive Cancer Network [49]. No guidance exists to specify interventions based on GEP test results, although data on case use scenarios continue to develop.

Current GEP tests largely assign an individual’s tumour to a prognostic class (high vs low risk, or class 1 vs 2), rather than calculating specific survival. This can lead to grouping of patients to either high or low risk, despite significant differences in those individuals clinicopathological factors, that would normally be associated with a different expected survival [50]. Available tools development and validation appear to be based on small case numbers. The studies forming the basis of the two largest commercially available GEP based tools report using between 217 [10]-260 [51] patient samples (DecisionDX-Melanoma) and 245 [52] patient samples (Melagenix). Some authors have also expressed concerns regarding the minimal overlap among gene panels across various studies. The discussion of the review suggests that “there is insufficient data to support routine use of the currently available GEP tests” [47].

We suggest that utilising a large national dataset is warranted to develop a prognostic prediction tool for patient with cutaneous melanoma in the UK. This would serve as a valuable resource for patient and clinicians to enable better communication about risk and the decision-making process for further investigations and treatment. It would also serve to contrast and understand differences between melanoma patient cohorts internationally.