Introduction

Soft tissue sarcomas (STS) are the most frequent sarcomas in adults representing approximately 3000–5000 new patients each year in France1. Predicting relapse after initial curative surgery in STS patients is challenging due to the heterogeneity of sarcomas in terms of clinical, radiological and histological presentations. The most important prognostic feature is the histological sarcoma grade according to the French federation of cancer centers (FNCLCC), which relies on three criteria: histological differentiation, mitotic count and tumor necrosis2. The FNCLCC 3-tier grade is particularly associated with the risk of metastatic relapse, which is the major determinant of patient prognosis3,4. However, the FNCLCC grade is imperfectly reproducible from one pathologist to another and provides limited assistance in tumors classified as intermediate grade II, which accounts for 40% of cases, particularly when dealing with small biopsy specimens.

Computational pathology is opening new possibilities for analyzing histological slides5,6,7 that may help overcome the shortcomings of the histological grade. Deep learning (DL), notably convolutional neural networks (CNNs) dedicated to computer vision and taking scanned histological slides as input are increasingly used to develop predictive models (or signatures) for various purposes, including cancer subtyping, correlations with mutations of interest, treatment response prediction and prognostication8,9,10. Validated radiomics model can predict the histological type and grade of retroperitoneal sarcomas with excellent performance11. In sarcomas, recent studies have demonstrated the potential of deep learning on digital Hematoxylin, Eosin, Saffron-stained (HES) slides from gastrointestinal stromal tumors to accurately predict patient outcomes as well as PDGFRA and KIT mutational status12. DL models have also been developed to identify the five most common subtypes of sarcomas and predict disease-specific survival in leiomyosarcoma patients6. Furthermore, DL models trained on HES slides from rhabdomyosarcoma in children and young adults have shown high diagnostic accuracy in predicting molecular subtypes (PAX3/7-FOXO1 fusion, RAS, MYOD1 and TP53 alterations) with better results for the prediction of event-free and overall survival compared to molecular and clinical models13.

Therefore, our aim was to develop a to develop a novel risk score (first continuous, then categorized as appropriate into high risk or low risk) for survival prediction, derived from deep learning analysis of digitalized HES-stained slides, that could meaningfully challenge the prognostic utility of the FNCLCC histological grade. As the management and biology of peripheral sarcomas (located in the trunk walls and limbs) and intra-truncular sarcomas (located in the abdomen and mediastinum) are distinct, we focused our study on retrospective cohorts of patients affected with peripheral sarcomas managed homogeneously in our sarcoma centers14,15. Additionally, previous studies have predominantly focused on characteristics of a single tumor area selected within the central tumor bulk, often referred to as the tumor center, despite evidence of the heterogeneous nature of tumors16,17. In this study, we aimed to simultaneously investigate several areas of the tumor bulk to implement different DL models, exploring whether tumor areas other than the tumor center, such as the tumor margins, may help refine patient prognosis prediction.

Material and methods

Study design and patients

This two-center study was approved by the institutional review boards of Bergonié Institute (Bordeaux, France) and Gustave Roussy Institute (Villejuif, France), two French sarcoma reference centers, part of the French national network for the diagnosis and treatment of sarcomas (Netsarc +). All methods were performed in accordance with the relevant guidelines and regulations including the Declaration of Helsinki. The need for written informed consent with an opt-out mechanism was waived by the Ethics committee of Bergonié Institute (Bordeaux, France) because of its retrospective nature using pseudonymized data.

The study involved two initial populations, as illustrated in the study flowchart (Supplementary Fig. SF1). The first cohort included all consecutive patients from Bergonié Institute registered in the French ‘Base Clinico-Biologique’ (BCB) sarcoma database (https://conticabase.sarcomabcb.org/connect), which is approved by the National Committee for Protection of Personal Data (CNIL, no. 910390), between January 1st, 1990 and December 1st, 2020. Inclusion criteria encompassed adult patient (> 18 years old) with newly diagnosed non-metastatic primary tumors, located in the limbs or trunk walls, who underwent upfront curative surgery in a sarcoma center, and had available tissue samples for HES slide digitalization.

Exclusion criteria were: atypical lipomatous tumors (which are tumors of intermediate malignancy), patient who received neoadjuvant treatments posing a risk of denaturing the tissue sample, and patient with no available follow-up. The cohort consisted of 213 patients who were randomly allocated into a Training cohort (70% [149/213]) and a validation cohort (Validation-1, 30% [64/213]).

A second population of 95 patients from Gustave Roussy Institute, fulfilling the same inclusion criteria for a period of inclusion between January 1st, 1998 and December 1st, 2016 was used as a second independent validation cohort (Validation-2). Hence, three distinct cohorts were studied: a Training cohort from Bergonié Institute, an internal Validation-1 cohort (or testing set) also from Bergonié Institute, and an external Validation-2 cohort from Gustave Roussy. We choose this approach to ensure a rigorous evaluation of model generalizability at multiple levels. Using a separate internal Validation-1 cohort from the same institution as the training set allowed us to assess the model ability to generalize to unseen patients from the same clinical setting, while mitigating the risk of overfitting to the training data. The external Validation-2 cohort from Gustave Roussy was used to further evaluate the model robustness across institutions, encompassing potential differences in clinical practice, population characteristics, or image acquisition.

The primary endpoint was metastatic relapse-free survival (MFS), defined as the time (in months) from curative surgery to the occurrence of a distant metastatic relapse or last patient follow-up. In accordance with prior sarcoma studies3,4, patients who died without experiencing a metastatic relapse were censored at the time of death. Although non systematic, this cause-specific definition of MFS was chosen to allow direct comparability with earlier literature and to maintain consistency with Cox regression analyses, while acknowledging that death represents a competing risk for the event of interest.

Other endpoints included overall survival (OS) and local relapse-free survival (LFS), defined as the time elapsed from curative surgery to death and local relapse, respectively. The clinical and follow-up data of all included patients were updated for the purpose of this study.

Data collection

We extracted the following variables from the BCB sarcoma database: age at diagnosis, histological type (further categorized as in the Sarculator nomogram, i.e., leiomyosarcoma (LMS), dedifferentiated liposarcoma, myxoid/round cell liposarcoma [M/RC-LPS], malignant peripheral nerve sheath tumor [MPNST], myxofibrosarcoma, synovial sarcoma (SS), undifferentiated sarcoma, vascular sarcoma and others)18 and FNCLCC histological grade performed on the complete surgical specimen (all reviewed by referent pathologists in sarcoma). Regarding M/RC-LPS, according to the last WHO classification, myxoid and round cell liposarcomas belong to the same histotype and the distinction between both subtype may be subjective. In fact myxoid liposarcomas were graded 1 or 2 according to the presence of necrosis or not, and round cell liposarcomas were graded 2 or 3 according to the presence of necrosis and to the mitotic index. The data collection also comprised tumor location (categorized as upper limb, lower limb or trunk wall), depth (categorized as superficial [i.e., located entirely above the superficial muscular fascia and not invading it], deep [i.e., located beneath the superficial muscular fascia or invading it] or deep and superficial [i.e., when the tumor extends across the fascia, involving both superficial and deep compartments]) and size (in mm), surgical margins (categorized as: R0 or R1), adjuvant chemotherapy and radiotherapy, dates of curative surgery, local relapse, metastatic relapse and death or last follow-up.

Review and annotation of histology slides

For all included patients, all slides from the surgical specimen were entirely reviewed by senior pathologists (F.L.L., C.N., J.M.C.), with a total of 2 to 87 slides per case (mean: 21.6 and median: 18) and 1 to 46 slides with tumor (mean: 11.7 and median: 10). At least one slide per case was selected for having a representative view of tumor centrum, tumor periphery and peritumoral tissue, and 1 to 4 slides were selected per case with a total of 407 slides. When multiple slides were available, we selected the most cellular one with the highest number of mitoses, as we do when determining the FNCLCC grade. Tumor centrum (C) was defined as the tumor tissue located far from the periphery of the tumor and at least at a distance of 10 mm from its margin for the smallest tumors. Tumor periphery (P) was defined as the tumor tissue between the margin and 10 mm from this margin. Peritumoral tissue (R) was defined as the margin non-tumoral tissue in direct continuity with the tumor area. Figure 1A depicts the demarcation of each area reported on whole slide images (WSI). These HES slides were digitalized using a Hamamatsu Nanozoomer S360 scanner at 40-fold magnification (Hamamatsu, Japan). The TOOLKIT software was used for pseudonymization. The digital slides were annotated with the NDP.view2 software (Hamamatsu, Japan) Annotation of the 3 areas (C, P, R) was done on digital slides with the annotation tool of the Hamamatsu Nanozoomer Series scanner and subsequently submitted for DL analysis. For tumors with infiltrative margins, immunohistochemistry was used to precisely defined the boundary between the tumor and normal tissues, particularly with dedifferentiated liposarcomas (HMGA2 and MDM2 antibodies). However, defining the margins between apparently normal (R) and tumor tissues (P) could be challenging for undifferentiated pleomorphic sarcoma (UPS) and myxofibrosarcoma given the lack of specific markers for these histological subtypes, with particular difficulty in the case of myxofibrosarcoma due to their often infiltrative nature. Among the 31 patients with myxofibrosarcoma, 9 were well circumscribed, 12 showed focal infiltration, and 10 were diffusely infiltrative. In the latter cases, we considered the peripheral tissue to be normal (R) when histological examination did not reveal definite tumor cells, and defined the peripheral tumor tissue (P) as the transitional zone containing both tumor and normal tissue. We acknowledge that this method may introduce a part of subjectivity, which could affect consistency.

Fig. 1
Fig. 1
Full size image

The workflow of three-stage to predict metastasis risk by using DL model: (A) Demarcation of each area reported on whole slide images (WSI). Ct : central of tumor, P: periphery of tumor, R: adjacent non tumoral tissue. The pre-processing WSIs to extract the tiles and these features; (B) Features extraction by using a pre-trained DL extractor (e.g. ResNet50), (C) Deep Attention-MIL model to predict the risk of metastasis for each patient.

The cohort included different tumor histotypes, with 6 sub-groups retained in fine with UPS, myxofibrosarcoma, LMS, M/RC-LPS, other liposarcoma (dedifferentiated and pleomorphic) and others with SS,, low grade fibromyxoid sarcoma (LGFMS), and other rarer entities (malignant solitary fibrous tumours, extraskeletal myxoid chondrosarcoma, rhabdomyosarcoma, MPNST, extraskeletal osteosarcoma, clear cell sarcoma, alveolar soft tissue sarcoma, angiosarcoma and epithelioid sarcoma).

Development of DL models from digital WSI

We used a three-stage procedure (Fig. 1B) to generate the risk score for each patient based on different areas of digitized tumor tissue and their combinations: (i) we extracted image tiles (patches) from selected areas of the HE slides; (ii) we computed imaging features from each tile using a pre-trained deep learning model, e.g., ResNet5019; and (iii) we derived an attention-based deep learning model to predict the metastasis risk score for each patient using imaging features from the tiles. These steps are described in detail below.

Tile extraction

We first extract the tiles associated with each area of interest (from the center of the tumor to the R margins). These zones were delineated by pathologists with more than 10 years of experience. The extraction step was performed according to the delineated areas on the image at 40X magnification9. The original image is divided into non-overlapping tiles with a size of 224 × 224 (W × H) pixels. Based on the proportion of tissue, the tiles are divided into 4 groups: group A consists of the tiles that contain more than 80% tissue, group B includes the tiles that contain more than 10% and less than 80% tissue, group C includes the tiles that contain less than 10% tissue, and group D includes the tiles that contain no tissue. In this work, only tiles containing of at least 10% tissue were considered (i.e. groups A, B). The number of tiles depends on the size of the selected area and can vary from a few tens of thousands to several hundred thousand tiles. The average number of tiles per patient in the Bergonié and Gustave Roussy cohorts is around 90 K and 67 K respectively. The average number of tiles per area (C, P and R) per patient in the Bergonié cohort corresponds to 38 K, 35 K, 18 K tiles; these values in the Gustave Roussy cohort are 32 K, 23 K, 12 K for areas C, P and R respectively.

Feature extraction

This step is performed using a pre-trained deep learning model, ResNet5019, to capture 2048 features from each tile. Therefore, for each slide, we obtain a matrix of N (tiles) × 2048 (features) (where N is the number of extracted tiles of the slide). Since it was not feasible to use all tiles from an entire slide due to computational constraints, we instead randomly sampled a subset of 10,000 tiles per epoch for training the models9.

Generating DL scores for all patients20

To generate the DL scores for each patient based on the input tiles of each area and realistic incremental combinations of these areas, we used attention-based deep multiple instance learning21. Our model uses the extracted features from the tiles (of the WSI) and outputs the risk score for each patient. This model can be broken down into three parts consists of the layers before the attention module to aggregate the features of each tile; (ii) the second part is an attention module21, which provides the relative importance weights for each tile and aggregates the tile features to obtain features at the slide level; (iii) the layers in the last part are used to predict the risk of metastasis for each patient (Fig. 1C). The model was trained to predict MFS as a time-to-event outcome using a survival-based deep learning framework. Right-censored data (i.e., patients without metastasis at last follow-up) were incorporated through a loss function adapted for censored observations, such as the negative partial log-likelihood from the Cox proportional hazards model. This allows the model to learn from both censored and uncensored patients without biasing survival estimates. The risk scores were normalized to a fixed range of [− 1, 1] to harmonize the results of different datasets. The details of the DL model and its implementation details are described in Supplementary Method SM1. The algorithm has been trained on Bergonié cohort using different tumor areas (C, P, R) and various combinations to produce five DL models. More precisely, these models produced five DL risk scores: DL-C (for tumor centrum alone), DL-P (for the tumor periphery alone), DL-R (for tumor margin R), DL-CP (combination of tumor centrum and periphery) and DL-CPR (combination of all areas). The DL models were then applied on the Validation cohorts to generate the corresponding DL scores. The normalization of DL scores to the fixed range [− 1, 1] was performed using the minimum and maximum DL score values calculated from the Training cohort to ensure consistency. Finally, the DL scores and the median score on the Training cohort were used to discriminate patients (in each cohort) into ‘low risk’ (< median DL score in the training cohort) and ‘high risk’ (≥ median DL score in the Training). Similarly, for binarization, we computed the median of each DL score in the Training cohort and used these thresholds to dichotomize the scores in the Validation cohorts.

Comprehensive histological analysis of the DL score

To better understand the outputs of the best-performing DL model, we selected a total of 20 patients: 5 patients with the lowest DL scores and 5 patients with the highest DL scores from the Training cohort, and similarly, 5 patients with the lowest DL scores and 5 patients with the highest DL scores from the Validation-1 and Validation-2 cohorts. Next, we extracted 50 tiles per patient, which were retrospectively analyzed by a senior pathologist with 45 years of experience in sarcoma pathology (J.M.C.), blinded to any patient data or model output. The pathologist analyzed these 20 patients and collected the following variables: tumor cellularity (categorized as: < 10%, 10–50% or > 50%), tumor stroma (categorized as: absent, chondroid, fibromyxoid, fibrosis or myxoid), main cell type (categorized as: epitheloid, pleomorph, round or spindle cells), atypia (categorized as: no or very mild [0], moderate [1] or severe [2]), hyperchromasia (categorized as: absent or present), mitosis (categorized as: absent, 1 mitosis or 2 mitosis), necrosis (categorized as absent or present), red blood cells (categorized as absent or present), tumor differentiation (categorized as: absent, chondroid or smooth muscle), tumor infiltration (categorized as: absent or present [by lymphocytes ± plasmocytes]) and vessels (categorized as absent or present).

Statistical analysis

Statistical analyses were performed with R (v4.1.0, The R foundation for Statistical Computing, Vienna, Austria). All tests were two-tailed. Significance was set at p < 0.05. Survival analysis utilized the ‘survival’, ‘pec’, and ‘survminer’ packages22.

Descriptive and exploratory analysis

Descriptive statistics were presented for categorical variables as numbers and percentages, and for numeric variables as mean ± standard deviation, or median with range (minimum–maximum) and interquartile range (Q1–Q3). In the entire population, correlations between continuous DL scores and tumor size were assessed using the Spearman rank test. Associations between continuous DL scores and histologic types and grade were investigated using one-way analysis of variance (ANOVA-1) or the Friedman rank test with post-hoc Tukey or Mann–Whitney tests (depending on Shapiro-Wild normality test), corrected for multiple comparisons with the Benjamini–Hochberg adjustment.

Univariable MFS analyses including competing risk analysis

Since deaths may preclude the occurrence of metastatic relapse, cumulative incidence functions (CIF) for distant metastasis were estimated, treating death without metastasis as a competing event. Gray’s test was used to compare CIFs between DL-score groups and FNCLCC grades. This approach was used solely for descriptive visualization of metastatic incidence under competing risks for the main study objective and did not modify the Cox regression analyses. Classical Kaplan–Meier curves for MFS were also drawn for the five dichotomized DL-scores and all covariables in the Training, Validation-1 and Validation-2, and difference in survivals were tested with the log-rank test. Univariable Cox regressions estimated hazard ratios (HRs) with 95% confidence intervals (CIs). A subgroup analysis for grade II STS patients was also conducted.

Multivariable analyses

We then designed a multivariable analysis similar to that of the Sarculator nomogram to investigate the prognostic significance of the five DL scores and compared them with an analogous model incorporating the FNCLCC grade. For each dichotomized score and grade (I vs. II–III, as per the latest ESMO guidelines)14, multivariable Cox regressions including size (continuously), age (continuously), histologic type (categorical variable with M/RC-LPS as the reference category) and the score of interest were trained in the Training cohort. The models were then applied on Validation-1 and Validation-2, and their performances in those two validation cohorts were estimated with the Harrell concordance index (c-index), which ranges from 0 (worst possible) to 1 (perfect model)23. Pairwise c-index comparisons were conducted via bootstrapping on 1000 replicates (‘boot’ package). Calibration curves were plotted for all multivariable models22.

Other survival outcomes

The Kaplan–Meier curves for LFS and OS were generated for the five dichotomized DL-scores in the Training, Validation-1 and Validation-2 cohorts, complemented with log-rank tests.

Histological features associated with high and low risk DL scores

Associations between categorical histological features and the low- and high-risk groups were evaluated using Chi-square tests, both across the entire tile dataset and stratified by tile location.

Results

Patient and tumor characteristics in the three cohorts

Patient characteristics are summarized in Table 1. In the Training cohort, 75/149 patients (50.3%) were women, compared to 23/64 (35.9%) in Validation-1 and 37/95 (28.9%) in Validation-2 (P = 0.0781, Chi-square test). The median age was 67 years (Q1–Q3: 52–79) in Training, 52.5 years (Q1–Q3: 37.5–75) in Validation-1, and 58 years (Q1–Q3: 44.5–72) in Validation-2 (P = 0.0011, ANOVA-1). Tumor size averaged 96.1 ± 57 mm in Training, 96 ± 52.4 mm in Validation-1, and 84 ± 44 mm in Validation-2 (P = 0.1720, ANOVA-1). The proportion of patients with FNCLCC grade III sarcomas was 70/149 (47%) in Training, 27/64 (42.2%) in Validation-1, and 52/95 (54.7%) in Validation-2 (P = 0.3110, Chi-square test). Metastatic relapses were observed in 45/149 (30.2%) patients in Training (5 year MFS probability: 67.3 months, 95%CI 59.6–76.1), 20/64 (31.3%) in Validation-1 (5 year MFS probability: 70.3 months, 95%CI 59.1–83.6), and 33/95 (34.7%) in Validation-2 (5 year MFS probability: 63.8 months 95%CI 54.3–75.1) (Supplementary Table ST1). There were no significant differences in MFS probabilities among the three cohorts (P = 0.8000, log-rank test).

Table 1 Characteristics of the study population.

Understanding the DL risk scores

All continuous and binarized DL risk scores exhibited strong associations with both the FNCLCC grade and histologic types (all P values < 0.0001, ANOVA-1 and Chi-square tests), demonstrating higher scores in leiomyosarcomas and the undifferentiated sarcoma group, as well as in grade III tumors (Supplementary Table ST2, Supplementary Fig. SF2). However, no association was observed between the DL risk scores and tumor size. Additionally, significant correlations were found between the risk scores of DL-C and DL-R (Spearman rho = 0.258, 95%CI 0.146–0.358, P < 0.0001), as well as between the risk scores of DL-P and DL-R (Spearman rho = 0.305, 95%CI 0.202–0.403, P < 0.0001).

Univariable survival analyses

Figure 2 depicts the CIF curves for DL-C, DL-P, DL-R, DL-CP, DL-CPR and grade in the 3 cohorts with corresponding P values according to the Gray test (Supplementary Fig. SF3 represents the corresponding classical Kaplan–Meier curves). Univariable analysis for the dichotomized DL risk scores and clinical and histological covariables are detailed in Table 2. In the Training cohort, all DL risk scores showed significant associations with MFS according to log-rank tests (P value range: 0.0004 [for DL-C] to < 0.0001 [for all other scores]) and Gray’s test (P value range: 0.0004 [for DL-C] to < 0.0001 [for all other scores]). However, DL-R groups did not exhibit associations with MFS in Validation-1 (log-rank P = 0.0979 and Gray’s test P = 0.0935) or in Validation-2 cohort (log-rank P = 0.3696 and Gray’s test P = 0.7485). Moreover, the DL-CPR score did not demonstrate associations with MFS in Validation-2 (log-rank P = 0.0747 and Gray’s test P = 0.3622). In contrast, all other combinations revealed significant associations between high-risk groups and lower MFS. Regarding histologic grade, it was associated with MFS in Training and Validation-2 (log-rank P = 0.0122 and P = 0.0002, respectively; Gray’s test P = 0.0238 and 0.0005, respectively), but not in Validation-1 (log-rank P = 0.0712 and Gray’s test P = 0.0565).

Fig. 2
Fig. 2
Full size image

Cumulative incidence functions (CIF) curve for metastatic relapse (solid lines) and death without prior metastasis (dotted lines) according to the five deep learning (DL) scores and FNCLCC grade, shown separately in the Training, Validation-1, and Validation-2 cohorts. For each panel, the y-axis indicates the cumulative probability of the event of interest over time since curative surgery, accounting for death as a competing risk. DL scores were based on the tumor centrum (C), periphery (P) and surrounding tissues (R) and their incremental combinations (CP, CPR). P values from Gray’s tests for differences in metastatic relapse-free survival are reported: *P < 0.05, **P < 0.005, **P < 0.001.

Table 2 Univariable survival analysis for metastatic relapse-free survival in the three cohorts.

Subgroup analysis in patients with grade II tumors

There were 112/308 (36.4%) grade II patients in total. Among these patients, 25/112 (22.3%) experienced metastatic relapses, with 54 patients in Training (including 12 [22.2%] metastatic relapses), 24 patients in Validation-1 (including 7 [29.2%] metastatic relapses) and 28 patients in Validation-2 (including 6 [21.4%] metastatic relapses). Univariable survival analysis depending on the DL risk score is given in Supplementary Table ST3 and the Kaplan–Meier curve in Supplementary Fig. SF4. No significant differences were observed with dichotomized DL-R and DL-CPR scores across all cohorts. Regarding the model DL-P, a significantly lower MFS probability was found in the high-risk group in the Training and Validation-1 cohorts (P = 0.0014 and P = 0.0176, respectively, log-rank tests).

For DL-C and DL-CP risk scores, a similar association was only evident in Training (P = 0.0199 and P = 0.0038, respectively) but not in Validation-1 and Validation-2. Notably, in Validation-2, no significant associations were found regardless of the DL risk score tested.

Multivariable analysis

The results for all multivariable modeling with HRs for each input variable are detailed in Table 3. Each DL risk score emerged as an independent predictor of MFS. Notably, the Cox regression trained on the Training cohort revealed lower survival rates in the high-risk group across all DL risk scores. Specifically, the HR values were as follows: 3.28, 95% CI 1.56–6.89 (P < 0.0001) for DL-C; 23.41, 95% CI 6.98–78.53 (P < 0.0001) for DL-P; 3.74, 95% CI 1.76–7.96 (P < 0.0001) for DL-R; 8.82, 95% CI 3.39–22.96 (P < 0.0001) for DL-CP; and 6.86, 95% CI 2.91–16.18 (P < 0.0001) for DL-CPR. The FNCLCC grade (with its 3 categories) did not exhibit a significant association with MFS in the Training cohort (P = 0.5692 for grade II and P = 0.1021 for grade III, respectively, with grade I as the reference). However, when combining grade II and grade III (with grade I as the reference), a significant and independent association with MFS was observed (HR = 1.87, 95% CI 1–1.05, P = 0.0444), prompting the adoption of this combination for subsequent analyses.

Table 3 Results and performances of the multivariable modeling involving the deep learning risk scores and the FNCLCC grade.

Comparing the predictive performances, in Validation-1, a highest c-index was reached with the DL-CPR model (c-index = 0.786) followed by DL-C model (c-index = 0.759). In Validation-2, the highest c-index was reached with the DL-C model (c-index = 0.741) followed by DL-CP (c-index = 0.698) and DL-CPR models (c-index = 0.698). Notably, the c-indices of a similar model substituting risk scores with grade were 0.739 in Validation-1 and 0.723 in Validation-2, with no significant difference observed against DL-based models. Similarly, no significant differences were noted when comparing the DL models against the FNCLCC grade (Supplementary Table ST4). Detailed c-index comparisons and calibration curves for all models across the three cohorts are illustrated in Fig. 3.

Fig. 3
Fig. 3
Full size image

Results of the multivariable modeling in the validation cohorts. Barchart of the concordance index (c-index) with 95% confidence intervals (95%CIs) in Validation-1 (A) and Validation-2 (B). *: P < .05. Calibration plot of the DL and grade-based models in Validation-1 (C) and Validation-2 (D).

Retrospective histological review of the DL model output

Since the DL models incorporating both centrum and periphery tiles consistently showed associations with MFS across all three cohorts, we focused our analysis on the DL-CP model. Supplementary Table ST5 presents the associations between the DL-CP low and high risk groups and the histological features, analyzed across the entire tile dataset, as well as separately for tiles from the tumor periphery and centrum. Several significant associations were observed. The DL-CP high-risk group was characterized by higher tumor cellularity in both the centrum (P = 0.0002) and periphery (P = 0.0002), a predominance of pleomorphic cell types in the centrum (P = 0.0129) and periphery (P = 0.0038), and more pronounced atypia in both regions (centrum: P = 0.0029; periphery: P = 0.0036). Conversely, myxoid or fibromyxoid stroma was more commonly seen in the DL-CP low-risk group, both in the centrum (P = 0.0054) and periphery (P = 0.0043). Additionally, increased mitotic activity was noted in the tumor periphery of the high risk group (P = 0.0357). Representative tiles from the DL-CP low and high risk groups are shown in Fig. 4.

Fig. 4
Fig. 4
Full size image

Typical examples of histological tiles from the high risk (A) and low risk (B) group according to the deep learning model trained on the tumor centrum and periphery. The most predictive tiles for the high risk group showed high cellular density (A1) or fibrous stroma (A2), pleomorphic cells (A3 and A4), nuclear atypia (A1 to A4) and mitoses (A4) while tiles with low risk showed low cellular density with fibromyxoid (B1) or myxoid (B2) stroma, spindle cells with no nuclear atypia and no mitoses (B1 to B4).

Other outcomes

We provide the results of the models for both LFS and OS in Supplementary Tables ST6 and ST7, respectively. A notable association with LFS was observed solely with the DL-P risk score in Training (log-rank P = 0.0126). In Training, all DL risk scores demonstrated significant associations with OS (log-rank P value range: < 0.0001 [for DL-P and DL-R] to 0.0057 [for DL-C]). Conversely, in Validation-1, only the DL-CPR score exhibited a significant association with OS (log-rank P = 0.0009), while neither the other scores nor the FNCLCC grade showed significance. Similarly, in Validation-2, the DL-P score displayed a significant association with OS (log-rank P = 0.0123), whereas the other scores and the FNCLCC grade did not.

Discussion

The prognostication and risk stratification of patients with STS pose significant challenges. In cases of newly-diagnosed locally-advanced STS, patient outcomes are primarily linked to the occurrence of metastatic relapses, heavily influenced by the FNCLCC histologic grade. This grade dictates treatment strategies, including anthracyclines-based chemotherapy and radiotherapy, in addition to curative surgery14. However, accurately assessing the FNCLCC grade requires considerable expertise in STS pathology, which is hindered by the disease’s relative rarity and the scarcity of expert pathologists. This can result in prolonged delays before diagnosis, grading, and referral to specialized centers. Therefore, there is a pressing need for precise and reproducible automated histological tools to assist pathologists. In this study, we introduce an original deep learning pipeline leveraging digital pathology, pre-trained CNN, and MIL. Our objectives were twofold: (i) to predict MFS and (ii) to explore whether incorporating normally appearing surrounding tissues (R areas) in the HES digitalized slice could enhance performance compared to standard assessments focusing solely on the tumor center and periphery (C and P areas) or histological grading. Utilizing one training cohort and two independent validation cohorts from two of the three French sarcoma reference centers all annotated by expert pathologists, our approach revealed that including the R area did not improve the performance of DL models already utilizing the C and P areas. Moreover, we found that DL models could outperform models based on grading assessed by senior pathologists in predicting MFS.

First, we observed significant associations between the DL risk scores evaluated on C alone, P alone and C + P and MFS in the two independent validation cohorts. However, no significant associations were observed for R alone and C + P + R. Specifically, the DL-R risk score was not linked to MFS in Validation-1 and Validation-2, while the DL-CPR risk score showed no association with MFS in Validation-2. In parallel, the histological grade was not significantly associated with MFS in Validation-1, but in Validation-2. Despite this finding, the population remained representative of the typical demographic seen in sarcoma studies, with the majority of patients aged over 50 years, presenting with large tumors exceeding 5 cm, and half of the cases exhibiting high histological grade III, resulting in a considerable 30% risk of metastatic relapse at 5 years. Moreover, excepted the grade, the clinical and pathological covariables linked to lower MFS were consistent with previous findings, namely older age, larger tumor size, deep-seated or in-between sarcomas, and R1 surgical margins24.

Secondly, we developed supervised survival models inspired by the methodology employed in the construction of the Sarculator nomogram for primary non-metastatic tumors, utilizing Cox regression models18. These regressions included patient age, tumor size, and histotype, with simplified categorization due to the small sample sizes in vascular sarcoma and MPNST. Additionally, we included adjuvant chemotherapy and radiotherapy as potential impactors on patient outcomes14. The coefficients of the multivariable DL models obtained in the Training cohort indicated that all DL risk scores were independent predictors of MFS, whereas the histological grade was not associated with MFS.

Thirdly, according to the c-index, the DL-CPR model exhibited the highest prognostic performance in Validation-1 (c-index = 0.786), followed by DL-C. In Validation-2, the highest prognostic performance was achieved with DL-C (c-index = 0.741), followed by DL-CP and DL-CPR. Notably, the DL model based on DL-C risk score provided the best overall performance in both Validation-1 and Validation-2, outperforming the models based on FNCLCC grade, although the comparisons did not reach statistical significance. It is worth noting that the DL models exhibited a decrease in performance from Training to Validation-1 and Validation-2, particularly the DL-CPR, DL-CP, and DL-P models, with a c-index decrease > 0.1, indicating potential overfitting in the modeling process. This finding was not unexpected given the complexity of the DL models and the numerous hyperparameters involved in training. Importantly, while additional covariables might have bolstered the c-indices, we chose to limit their inclusion in the modeling and to assess our DL risk scores and subsequent models against a grade-based model akin to the Sarculator. Hence, the c-indices of the Sarculator nomogram in external validation ranged between 0.65 and 0.75, i.e., closely mirroring the performances of our benchmark grade-based model and slightly lower than the DL-C model in Validation-1 and Validation-218,25.

Overall, these findings underscore the potential utility of the DL-C and DL-CP models as a prognostic tool for identifying patients at heightened risk of metastatic relapse, thus guiding the allocation of more aggressive local and systemic therapies, as well as intensified monitoring. Moreover, DL models offer robustness by consistently providing the same prediction when presented with identical input images. In contrast, the inter-observer reproducibility of the FNCLCC grade among senior pathologists from both centers in the Validation-2 cohort yielded a weighted Kappa of 0.480 (95%CI 0.333–0.644, P < 0.0001—data not shown), indicating only fair agreement, a finding consistent with previous studies26. Hence, a main advantage of DL models lies in their reproducibility and elimination of inter-observer variability associated with FNCLCC grading. Moreover, the DL approach offers practical utility in settings with limited access to expert sarcoma pathologists, such as low-resource or non-specialized centers. By enabling automated and consistent risk assessment from routine histology slides, the model could help reduce disparities in prognostic evaluation. These strengths position the DL models as complementary or alternative tool to existing nomograms. Hence, future studies prospectively could compare the prognostic value of the Sarculator nomogram with FNCLCC grade, the Sarculator coupled with a DL score, and an end-to-end DL model, to better define their respective clinical utilities.

The advent of digital pathology, coupled with the widespread scanning of tissue slices in extensive sarcoma databases, coincides with a scarcity of sarcoma pathology experts and the emergence of telemedicine. In this context, such tools could assist in pre-labeling the HES slice from each newly-diagnosed patient, streamlining the process for pathologists during secondary reviews and confirmatory analyses.

Previous studies have consistently supported these findings, albeit in different cancer types or with smaller sample sizes. For instance, Foersch et al. developed a DL model utilizing CNN to predict disease-specific survival in a cohort of 85 leiomyosarcoma patients, demonstrating high diagnostic accuracy and superior predictive performance compared to histological grading6. Similarly, other DL models have been tailored to specific histological subtypes such as synovial sarcoma27 and rhabdomyosarcoma28. Milewski et al. recently investigated DL models in a cohort of 321 rhabdomyosarcoma patients, revealing strong discrimination in overall survival and event-free survival, surpassing existing molecular and clinical models, although quantitative comparison metrics specific to survival analysis were not utilized by the authors13.

Future researches should include validating the DL-C and DL-CP models in independent prospective cohorts to assess its utility in clinical decision-making processes compared to unaided pathology reviews. Integration of digital immunostaining alongside HES slides could potentially enhance characterization of the tumor microenvironment. Given the focus of our study on peripheral STS, further investigations are warranted for visceral sarcomas, leveraging pre-trained DL pipelines through transfer learning. Moreover, dissecting the characteristics of the most crucial tiles in accurately predicting outcomes could provide valuable insights for enhancing future prognostic models. Herein, our retrospective review of the histological characteristics of representative tiles from the DL-CP high risk and low risk groups yielded findings consistent with established histological markers of STS aggressiveness, namely: increased cellularity, higher mitotic activity, and distinctive stromal patterns and cell types. Furthermore, it is also important to note that our study was not designed to identify the most accurate DL model for predicting patient outcomes from digital slides, nor to systematically benchmark various optimized CNN architectures. Rather, our primary objectives were to demonstrate the feasibility of generating a prognostic deep histological grade, and to determine which combinations of tiles from the tumor core, periphery, and peritumoral region would yield the highest prognostic performance for this deep grade. Accordingly, if any benchmarking was performed, it pertained to the comparison between the conventional FNCLCC grade, DL-C, DL-P, DL-R, DL-CP and DL-CPR. However, current expansions of our work include alternative encoders, notably CONtrastive learning from Captions for Histopathology, UNI2 and CTransPath. Lastly, an important future direction would be the development of DL models tailored to individual histological subtypes of STS, rather than applying a single model across all subtypes. However, this approach was not feasible in our study due to the limited sample sizes available for each subtype across the three cohorts. Even for the most prevalent histological type—undifferentiated sarcoma—only 42 patients were included in the Training cohort, raising concerns about the reliability and generalizability of subtype-specific models. Moreover, one of our objectives was to critically assess the performance of the FNCLCC grade, which is universally applied across all histological subtypes in clinical practice. In this context, developing DL models on a similarly heterogeneous population was a deliberate and clinically consistent choice. Nonetheless, future studies with larger, subtype-enriched datasets may enable the development of more refined, histotype-specific models.

Our study has limitations. First, it was a retrospective study with a limited study population for a deep learning framework, though it was the largest population regarding the use digital pathology and AI for predicting MFS in STS patients. Hence, some histotypes were under-represented and gathered in the ‘Other’ group. Second, while the study reported associations between DL risk scores and clinical and pathological variables, the underlying mechanisms driving these associations may not be fully elucidated. Enhancing model interpretability and transparency could improve the clinical adoption and trustworthiness of DL-based risk prediction models. Third, the DL model performances could have been enhanced by including other ‘omics’ data (including gene-expression or radiomics), as it has already been shown than the Sarculator nomogram and gene expression signature (such as CINSARC) are complementary and potentiate each other29,30. Fourth, the selection of digital slides by pathologists for training and validating the DL models could introduce a sampling bias, potentially impacting model generalizability. Future studies should assess the reproducibility and robustness of DL models with respect to inter-observer variability in slide selection, to determine how different slide sampling strategies influence model performance. Fifth, herein, patients who died without experiencing metastatic relapse were censored at the time of death, However, we acknowledge that this approach may introduce a competing risk, as death precludes the observation of relapse. Future studies could consider competing risk methods to further refine prognostic assessment. Sixth, This study was exploratory in nature, and no multiple comparison corrections were applied to the evaluation of DL scores. However, all findings were independently validated in two independent validation cohorts, limiting the risk of false-positive results. Seventh, several factors likely contributed to the lower and variable prognostic performance of the DL models in the validation cohorts, particularly the discrepancy in c-indices between Validation-1 and Validation-2. These include differences in patient characteristics, histological subtype distribution, and technical variations in slide preparation, staining, or scanning protocols across centers and time periods. Importantly, the relatively small size of the validation cohorts likely impacted statistical power and model stability. As this is a proof-of-concept study, further validation on larger, multicenter datasets will be essential before clinical implementation. Lastly, another limitation of our study is the handling of death as a censoring event in the definition of MFS, whereas death may also be considered a competing risk for metastatic relapse. We adopted this cause-specific approach to ensure comparability with prior sarcoma studies and to maintain consistency with Cox regression analyses3,4. Nevertheless, competing risk methods were additionally applied for descriptive purposes, which confirmed the robustness of our findings.

In conclusion, our study underscores the potential of the DL-C model as a robust prognostic tool for identifying STS patients at heightened risk of metastatic relapse, aiding pathologists, guiding treatment allocation and monitoring strategies. Adding the surrounding tissues from HES digitalized slide did not improve the model performance. Despite limitations such as retrospectivity, limited representation of certain histotypes and lack of other ‘omics’ data, our findings contribute to the growing body of evidence supporting the utility of digital pathology and deep learning in oncology and sarcoma.