Introduction

Immune checkpoint inhibitors (ICIs) targeting PD-1 or its ligand PD-L1 are approved for the treatment of metastatic non-small-cell lung cancer (mNSCLC) both alone1 and in combination with chemotherapy2, anti-CTLA-43 or anti-CTLA-4 and chemotherapy4. One factor in the choice of which treatment to select for a given patient is the expression of PD-L1 within their tumor sample. The use of monotherapy anti-PD-1 is currently recommended only for patients whose tumors demonstrate ≥50% PD-L1 expression5. Across PD-L1 expression levels, other checkpoint blockade regimens are among the treatment options, however the addition of anti-CTLA-4 appears to show greater differentiation from regimens containing only anti-PD-(L)1 in patients whose tumors have <1% PD-L1 expression3,4. Further, recent studies of combinations with anti-TIGIT suggest that the additional benefit of this combination may be most clear in the PD-L1 ≥50% setting6.

In the context of these data, optimizing the measurement of PD-L1 expression is critical to guiding patients to the most appropriate treatment. Currently this expression is assessed by pathologist scoring of immunohistochemically (IHC) stained tissue7. However, such a semi-quantitative process is subjective8, with measurable variation in the overall percentage agreement among pathologists, particularly at the lowest levels of expression9. Additionally, current scoring of PD-L1 leaves room for improvement in the classification of samples between the well-defined cut-points of 1% and 50%10. The group of patients whose samples fall within this 1–49% group likely represents a population with heterogenous tumor biology, some of whom may have the potential to benefit from chemotherapy free regimens, though they currently cannot access them.

In recent years, computational pathology has emerged as a promising alternative to visual scoring of digital pathology slides11,12,13,14. Among the main motivations are consistency to alleviate effects of inter- and intraobserver variability10,15,16,17, the ability for computational pathology algorithms to effectively consider the entire available tissue18 and its power of true quantitative sample scoring by deriving more precise measurements and descriptive statistics of the stain distribution14,19,20,21. Successful examples of digital pathology applications in IHC include the Immunoscore22, digital pathology for multiplex IHC23 as well as applications exploiting the ability to derive fine-grained continuous measurement of protein expression both in brightfield24,25,26 and in immunofluorescence imaging27,28.

In the field of digitized PD-L1 scoring, various approaches have been proposed, mainly aiming to replicate tumor cell (TC) or tumor proportion scoring (TPS) by detecting and classifying cells15,29,30,31,32,33,34,35,36,37, but also including assessment of the relative pixel proportion of positively stained tumor epithelium38,39,40, weakly-supervised slide-level learning for TPS prediction41 or directly learning survival prediction from PD-L1 positive regions39. Most published systems rely on machine learning driven classification of PD-L1 cell positivity, i.e. cell positivity is implicitly defined by training of a machine learning system using positive and negative cell annotations, allowing replication of TC scoring among other statistics to be derived from pre-categorized positive and negative TCs31,33,34,36,37,42,43. However, using pre-set positivity categories does not offer the option to capture subtle graduations in cellular expression. In contrast, including quantitative scoring of per-cell staining intensity into the design of the digital scoring system allows for fine-grained calibration of cell positivity against pathology scores for precise PD-L1 TC scoring29,44 and further combined calibration of slide-level features together with cell-level marker expression against outcome for improved patient selection24,45,46.

In this work, we present and evaluate a system for quantitative continuous scoring of PD-L1 expression (PD-L1 QCS), expanding on previously published work on quantitative continuous scoring24 of digitized whole slide images (WSIs). The analysis includes a detailed investigation of quantitative scoring of PD-L1 and its association with clinical outcomes based on the MYSTIC randomized and controlled phase 3 clinical trial47. Of note, we identified patient subpopulations optimally benefiting from ICI treatment. Expanding over existing digital scoring schemes including TC scoring, additional PD-L1 expression schemes are explored for their utility in patient subgroup identification. Finally, a thorough assessment of robustness of chosen cut-points and cell positivity thresholds are presented elucidating the stability of the identified digital biomarkers.

Results

PD-L1 QCS identifies NSCLC patients more likely to benefit from Durvalumab

157 baseline WSIs from 157 NSCLC patients randomized to CP1108 were available to support initial PD-L1 QCS analysis (see Supplementary Table 1 for statistics of the biomarker evaluable population (BEP)). Optimization for a PD-L1 QCS classifier was performed by evaluating the density and percentage of positive cells at various positivity thresholds and descriptive statistics of the overall staining intensity distribution in the entire slide. In total, considering the mean and a range of quantiles for descriptive statistics, this amounted to a pool of 59 feature/parameter combinations. Progression-free survival (PFS) within the QCS biomarker positive group was maximized with the constraint that log-rank p-value remained significant while a prevalence between 20% and 80% was preserved.

These analyses yielded a classifier in which positive cells were defined as having a PD-L1 membrane staining intensity (SI) ≥40 (see Section “Quantitative Continuous Scoring of PD-L1”), and a sample was considered positive where >3% of cells met this SI threshold. As it is demonstrated in Section “Quantitative Continuous Scoring of PD-L1”, the score mainly describes the proportion of medium to strongly stained tumor cells, hence the score is referred to as PD-L1 QCS-PMSTC. Use of the QCS-PMSTC >3% classifier to stratify patients resulted in a significant improvement in PFS for QCS-PMSTC+ as compared to QCS-PMSTC- patients, hazard ratio (HR) 0.43. This compared favorably to stratification based on conventional pathologist-derived TC scores at both the ≥1% and ≥50% cut-points (see Table 1).

Table 1 CP1108 prevalence, mean progression-free survival (mPFS) results in the optimized biomarker positive groups and in TC score subgroups from visual scoring

Independent analyses in MYSTIC confirm the predictive value of PD-L1 QCS

From 1118 randomized patients in the MYSTIC clinical trial, 920 digital WSIs were received. Visual quality inspection by data curation experts together with pathologists indicated that 72 samples needed to be removed as they did not meet predefined standards for digital image analysis (incl. not enough tissue or less than 100 TCs, extensive staining artifacts not permitting analysis, scanning issues as scanning lines or blur and preparation artifacts). 48 samples were either fine needle aspiration samples or were acquired from patients who did not receive treatment and were also excluded from analysis.

PD-L1 QCS was applied to 800 WSIs from MYSTIC. Following removal of 32 images with scoring discrepancies between visual and digital scoring caused by non-obvious out-of-focus regions (N = 19) or inappropriate image analysis (N = 13), a BEP of 768 remained. The BEP included 256 patients from the anti-PD-L1 (Durvalumab) arm, 266 patients from the anti-PD-L1/anti-CTLA-4 combination arm (Durvalumab-Tremelimumab) and 246 patients from the chemotherapy treatment arm (standard-of-care, SoC). A comparison of outcome data for this BEP as compared to the intention-to-treat (ITT) population is presented across PD-L1 sub-groups, as defined by pathologists TC scores, in Supplementary Tables 8 and 9. Overall, a trend for increased activity for Durvalumab and reduced activity for Durvalumab + Tremelimumab was observed across PD-L1 sub-groups within the BEP.

For features with parameters, these were varied, and a single representative threshold was chosen by calculating the Spearman rank correlation and picking the median expression or the lowest average correlation to all others within a correlation set (rho >0.8). In addition to each of these parameterized features, QCS-PMSTC as identified within CP1108 was considered as an additional feature.

In order to identify optimal cut-points for the parameterized features, standardized two-sample linear rank statistics incorporating survival data (Log-Rank statistics) were computed over a set of reasonable cut-points48. The optimal cut-point was determined for each arm separately as the one maximizing the computed rank statistics. The parameterized features and cut-points that defined the final set of classifiers for further analysis are described in Supplementary Table 7.

For both Durvalumab-containing arms in MYSTIC, PD-L1 QCS biomarker-positive (BM+) groups for each feature were assessed with respect to prevalence and median overall survival (mOS, see Fig. 1). Pathologists scoring of TC percentage at 1%, 25% and 50% demonstrated a trend of increasing mOS combined with declining prevalence proportional to increasing cut-point, which is in line with previously published results49.

Fig. 1: Prevalence analysis and mOS benefit in the biomarker-positive group of MYSTIC for various PD-L1 QCS features.
figure 1

Subfigure a depicts the D and subfigure b the D + T arm. The features were applied in their identified configuration (cut-point and threshold) and used for stratification, depicting superior stratification—combining a high mOS with a high prevalence—towards the top right quadrant. In addition to PD-L1 QCS features, the pathologist stratification based on TC scores at 1%, 25% and 50% cut-points are indicated in red. A detailed account of the feature parameters is provided in Supplementary Table 7.

When compared to pathologists’ TC score, two QCS features provided a favorable combination of prevalence and mOS within the Durvalumab (D) arm. The first one encompassed fitting a mixture model of two Gaussians (GMM) to the SI distribution over all TC membranes in a WSI, calculating the difference of their means, and thresholding this feature at an SI value of 11.32. The second classifier determined patients QCS-PMSTC+ with a proportion greater than 0.575% (see also Fig. 1a). Both features identify a population with comparable mOS in respect to %TC ≥ 50 scoring, but offer an increased prevalence of BM+ patients (GMM with 52.7% prevalence and mOS = 20.9 m, QCS-PMSTC with 54.3% prevalence and mOS = 19.9 m).

Within the combination arm of Durvalumab and Tremelimumab (D + T), classifiers based on the median calculation of the cell membrane and the analysis of the median of the cytoplasm and membrane difference offered comparable performance to the %TC ≥ 50 classifier based on pathology scoring (see Fig. 1b). However, none of the PD-L1 QCS derived classifiers showed performance that improved over the prevalence and mOS by pathology scoring.

Comparison to histopathological scoring offers explainability for PD-L1 QCS TC classifier

Quantifying the percentage of tumor cells with SI ≥ 40 was identified both in MYSTIC and CP1108 as predictive of activity for Durvalumab, and potentially superior to visual TC scoring. Moreover, this feature demonstrated excellent transferability across studies CP1108 and MYSTIC, although the WSIs were acquired with different scanners and processed by different providers.

Examination of which cells are defined as positive using the SI ≥ 40 threshold suggests that these represent mainly medium to strongly stained cells (see Fig. 2a). A correlation analysis of visual TC scores against percentage of cells above the SI = 40 threshold within CP1108 rendered a Spearman correlation value of R = 0.76. Inspection of the concordance correlation coefficient (CCC) of PD-L1 QCS percentage of TCs at differing SI increments against the percentage of 1+, 2+ and 3+ cells scored by pathologists available for a subset of 74 patients in CP1108 indicates a maximized correlation with SI = 15 against 1+ cells (CCC = 0.83), similarly for 2+ cells with somewhat decreased correlation (CCC = 0.65) and an SI = 100 maximizing CCC to 3+ cells (CCC = 0.69). This alignment analysis confirms that the SI ≥ 40 threshold captures cells with moderately increased intensity above 2+ level (Fig. 2b), suggesting SI ≥ 40 to be a reasonable approximation to what would be considered a moderate to strongly positive cell by pathologist classification. The QCS-PMSTC score captures the proportion of these tumor cells as a percentage.

Fig. 2: Classified cells for PD-L1 QCS PMSTC calculation and correlation analysis of stain intensities against visual cell categorization.
figure 2

a Example of PD-L1 QCS application with cell positivity classification at SI = 40 threshold. Cells overlaid in green are considered negative (lower SI than the threshold), while overlaid orange cells are considered positive (SI equal or larger than the threshold). The SI value is derived from the segmented epithelial cell membrane, depicted in red in the middle overlay. b Analysis of PD-L1 QCS at various SI thresholds against visual assessment of the percentage of positive cells classified according to 1+, 2+, 3+ scheme in the CP1108 dataset using the concordance correlation coefficient (CCC). Vertical bars indicate the maximum CCC obtained at an SI threshold (SI = 15 for cells classified ≥ 1+ or ≥2+, SI = 100 for 3+).

PD-L1 QCS is predictive of Durvalumab activity within the MYSTIC study

The classifier was applied back to the MYSTIC trial in order to assess its predictive value, considering patients with QCS-PMSTC ≥ 0.575% biomarker positive (QCS-PMSTC+) and the remaining patients biomarker negative (QCS-PMSTC-). Within the Durvalumab arm, the prevalence of QCS-PMSTC+ patients was 54.3%, greater than the prevalence of patients selected using pathologists TC score cut-point of either 50% or 25%. Despite this higher prevalence, median overall survival (mOS) was comparable to that achieved using a TC score 25% cut-point and only slightly less than that achieved by using a TC score 50% cut-point. Importantly, the mOS in the QCS-PMSTC- group was lower than using either pathologists TC score cut-point, resulting in an improved HR for QCS-PMSTC+ vs. QCS-PMSTC- patients (Fig. 3). In contrast, the same classifier was unable to meaningfully stratify patients within the D + T arm (see Supplementary Fig. 2).

Fig. 3: Patient stratification as defined by visual scoring and PD-L1 QCS in the Durvalumab arm of MYSTIC.
figure 3

Kaplan–Meier Analysis (incl. confidence intervals) of MYSTIC Durvalumab arm (OS), indicating performance of TC scores at 1%, 25% and 50% cut-point together with QCS-PMSTC ≥ 0.575%. HR indicates hazard ratio.

In order to confirm the predictive vs. prognostic value of the QCS-PMSTC classifier, Durvalumab treatment was compared to SoC chemotherapy within the QCS-PMSTC+ and QCS-PMSTC- sub-groups. Within the QCS-PMSTC+ subgroup this comparison yielded an HR of 0.62 (CI 0.46–0.82, logrank p = 0.0008), as compared to a HR of 1.23 (CI 0.93–1.63, logrank p = 0.1413) in the QCS-PMSTC- sub-group. For comparison, HRs were 0.69 (CI 0.46–1.02, logrank p = 0.0642) and 0.93 (CI 0.74–1.17, logrank p = 0.5133) for patients considered positive and negative, respectively, when employing a visual pathologist assessment of TC scores at the 50% cut-point (Fig. 4).

Fig. 4: Patient stratification as defined by visual scoring and PD-L1 QCS in the Durvalumab and SoC arm of MYSTIC.
figure 4

Kaplan–Meier analysis (OS) indicating subgroup stratification performance of TC scores at 1%, 25% and 50% cut-point together with QCS-PMSTC ≥ 0.575%. The p-values are derived from log-rank tests (p) or obtained for Cox interaction terms (pinteract).

PD-L1 QCS improves the stratification of patients in the 0–49% TC score sub-group

Visual assessment of PD-L1 expression is highly established in NSCLC for guiding treatment strategies. To assess the performance of QCS in identifying responders to anti-PD-L1 treatment, we compared the classification of BM+ and BM- patients by QCS in the three established categories of visual PD-L1 scoring in NSCLC, based on TC scoring: no/low expressors (%TC < 1), intermediate expressors (%TC in 1–49), and high expressors (%TC ≥ 50). Importantly, all QCS-PMSTC+ patients within the Durvalumab arm were also positive based on pathologist TC scoring at a 50% cut-point. Within the sub-group of patients considered <50%, the QCS-PMSTC classifier split these patients almost equally between positive and negative subgroups (Fig. 5a). Three patients exhibited a category change due to incorrect assessment of intratumoral immune cells, mainly macrophages. Within the QCS-PMSTC+ sub-group mOS was 18.6 months while in the QCS-PMSTC- sub-group mOS was 7.8 months, with an HR between the groups of 0.48. In contrast, patients classified as %TC ≥ 1 and %TC < 1 by pathologist assessment had a mOS of 11.1 and 8.6 months respectively, with a HR for a comparison of these two groups of 0.63 (Fig. 5b). Overall, these data indicate the potential of the QCS approach to more precisely stratify patients currently within the 1-49% sub-group as determined by pathologist TC scoring.

Fig. 5: Patient subgroup mapping and survival analysis in the 0-49% TC score subgroup.
figure 5

a Patient mapping of TC scores and PD-L1 QCS-PMSTC categories across the Durvalumab arm. b Kaplan–Meier analysis of visual TC scoring at 1% threshold and PD-L1 QCS constrained to the %TC < 50 subgroup of the Durvalumab treatment arm. The p-values are derived from log-rank tests.

PD-L1 QCS demonstrates pronounced stratification in non-SqCC patients

Further, PD-L1 QCS-PMSTC stratification in the Durvalumab arm was examined across the Squamous Cell Carcinoma (SqCC) and non-SqCC NSCLC subtype, which represented 30.6% resp. 69.4% of the BEP in MYSTIC (see Supplementary Table 8). In non-SqCC, an increased prevalence in the PD-L1 QCS-PMSTC+ group compared to %TC ≥ 50 scoring was observed, which was paralleled by a decreased mOS in PD-L1 QCS-PMSTC- patients (see Fig. 6a). In SqCC, prevalence was also increased, however a decrease of 5.1 m mOS in the PD-L1 QCS-PMSTC+ subgroup was observed compared to manual scoring (see Fig. 6b). For further comparison to other visual scoring thresholds, please refer to Supplementary Fig. 1.

Fig. 6: Kaplan–Meier analysis (OS) in NSCLC subtypes within the Durvalumab arm.
figure 6

a Kaplan–Meier analysis of patient stratification in the Non-SqCC subgroup with PD-L1 QCS-PMSTC compared to visual %TC ≥ 50 scoring. b Kaplan–Meier analysis of patient stratification in the SqCC subgroup with PD-L1 QCS-PMSTC compared to visual %TC ≥ 50 scoring. P-values are derived from log-rank tests. NR indicates median OS was not reached.

PD-L1 QCS classification based on cellular optical density is stable across a range of thresholds

To assess the robustness of the novel QCS-based biomarker across SI thresholds and its relationship to prevalence, a stability analysis for the Cox interaction term for D vs. SoC against the threshold was carried out by varying the SI for cell positivity classification together with the cut-point (implicitly defining the prevalence). This analysis indicated that the chosen pair of SI ≥ 40 threshold and cut-point at 0.575% are embedded within a range of thresholds (approx. between SI = 15 and SI = 90) and prevalences (approx. between 50% and 65%), in which the parameters could be slightly altered to still retrieve a comparable stratification (Fig. 7a). Similarly, analysis of stratification within the D arm indicated that a large range of possible SI and prevalence ranges would result in a p < 0.001 (Fig. 7b, left) and that the HR for the resulting QCS BM+ vs. QCS BM- groups was optimized within a similar range as that for the comparison of D vs. SoC (Fig. 7b, right).

Fig. 7: Stability analysis varying thresholds and cut-points of the PD-L1 QCS-defined percentage of positive tumor cells.
figure 7

The yellow rhombus indicates the discovered parameterization of QCS-PMSTC with SI = 40 as cell positivity threshold together with a cut-point of 0.575% for patient categorization. Subfigure a shows the p-values obtained for cox interaction terms for Durvalumab vs. SoC arm upon variation of both parameters and b the p-value of log-rank testing respectively the hazard ratio for stratification in the Durvalumab arm.

Discussion

First line treatment decisions in driver mutation-negative metastatic NSCLC are currently guided through visual assessment of PD-L1 expression by IHC. Three main groups are defined based on the proportion of tumor cells expressing PD-L1: The %TC ≥ 50 group who are eligible to receive monotherapy anti-PD-1, the 1–49% group who receive mainly anti-PD-1 together with chemotherapy and the %TC < 1 group who are also eligible for this treatment, but for whom anti-CTLA-4 containing regimens may also be an attractive option. Despite its firm establishment in clinical practice, there are multiple factors50 challenging the predictiveness of visual IHC scoring for PD-L1 testing, including tissue sampling and preparation, assay, tumor heterogeneity as well as inter- and intraobserver variability in accurately scoring tissue10,51. In particular, the 1–49% group is likely highly heterogenous, made up of patients for whom monotherapy anti-PD-1 may be sufficient, those for whom a chemotherapy combination is needed and those who may be best served by more intensive combination regimens, e.g., containing anti-CTLA-4. To this end, reported benefits of digitized PD-L1 TC scoring systems include improved interpathologist concordance with digital scoring supported reading17,37 and increased prevalences observed with digital scoring in comparison to visual pathologist scoring alone34,49,52,53. Furthermore, a digital pathology approach offers the opportunity to leverage fine-grained algorithmic assessment of expression to optimize the stratification of patients54.

In this work, we examine Quantitative Continuous Scoring for PD-L1, which - in contrast to many other published approaches that rely on direct detection of positive and negative cells - avoids the need to derive cell positivity based on pathologist annotations7,17,31,36,37,49,52,53,55,56. Instead, PD-L1 QCS expands on the previously established approach of quantitative continuous scoring24,57: Cell identification is handled via an instance segmentation approach, which determines cells together with their compartments and then derives a range of features through measurement of continuous staining intensity on a per-cell basis. Individual cells can be defined as positive or negative for expression post detection, by applying a threshold value to any SI derived feature, similar in principle to the definition of cells as 0, 1+, 2+ or 3+ expression through visual assessment of, for example, HER258. A sample can then be classified as positive or negative by applying a cut-point for the percentage of cells that must meet this threshold. This approach enables post-detection classification of the cells according to their intensity, among other readouts, and so is analogous in many ways to pathology scoring but with the granularity of a digital approach that can support more accurate optimization of classifiers for positivity.

Initial application of PD-L1 QCS in the context of the CP1108 study confirmed the potential for this approach to improve on classifications based on visual assessment of TC expression. Utilizing PD-L1 QCS to determine the proportion of medium to strongly stained tumor cells yielded a patient group that maximized prevalence while retaining favorable mOS. Subsequent application to the phase 3 study MYSTIC confirmed these findings, identifying two features with the potential to optimize patient stratification. Interestingly, these features, GMM and QCS-PMSTC, are rooted in the definition of a relationship between positive and negative tumor cells across a sample. The identification of these two features, amongst a collection of more complex features, speaks to the strength of the existing TC scoring approach used in pathology which also defines a relationship between positive and negative cells across a sample. This suggests that, at least in the setting of NSCLC, an advantage of a digital approach to PD-L1 quantification may be derived from the ability to more accurately fine-tune components of a classifier, such as threshold and cut-point. While the degree of PD-L1 most relevant for anti-PD-L1 response is not well understood and needs to be further investigated50, we identified a specific level of PD-L1 expression as a novel biomarker in a data-driven approach.

A comparison of the final PD-L1 QCS classifier to visual TC scoring in MYSTIC supports the potential value of a fine-grained quantitative scoring in optimizing patient selection: PD-L1 QCS-PMSTC ≥ 0.575% identified a higher proportion of patients compared to the %TC ≥ 50 cut-point for visual scoring (54.3% vs. 29.7%), while sustaining a comparable mOS for D (19.9 m vs. 21.1 m) and HR for comparison of D vs. SoC (0.62 vs. 0.69) within the QCS-PMSTC+ group. Importantly, the cell positivity threshold of SI = 40 was derived on CP1108, a separate study sourced from a different lab and scanned with a different scanner. The improvements with the QCS approach were particularly pronounced in non-SqCC NSCLC subtypes and in the population defined as 1–49% TC by visual scoring. Especially in the %TC < 50 group, the QCS-PMSTC- subgroup increased the prevalence together with a decreased mOS, paralleled by a higher mOS in the QCS-PMSTC+ subgroup.

These findings align with reports on increased patient prevalences when employing systems that replicate the visual PD-L1 TC scoring paradigm34,49,52,53 and more pronounced effects observed in lower PD-L1 expression levels52,53. However, the QCS framework allowed to expand biomarker candidate search beyond replication of established scoring paradigms by considering combined analysis of SI thresholds and cut-points, and to provide insights into their stability upon parameter variation24. In result, PD-L1 QCS-PMSTC encompasses a TC population with SI ≥ 40, which represents a higher intensity as regarded in established visual TC scoring and hence takes advantage of the data-driven approach with more degrees of freedom than score replication59.

Interestingly, the results indicate that there is a sub-set of patients within the PD-L1 < 50% sub-group that may have the potential to benefit from a chemotherapy free regimen, which is consistent with previous studies suggesting that the 1–49% segment is the most challenging for visual assessment10. Considering recent studies that suggest addition of anti-TIGIT to anti-PD-L1 may bring the greatest benefit in patients with baseline immunity indicated by PD-L1 ≥ 50%6, it would be interesting to assess whether a QCS based classifier could also more accurately identify patients benefiting from this potentially superior chemotherapy free regimen. In contrast to the D arm, QCS-based classification was not able to improve patient stratification within the D + T arm of MYSTIC, which warrants further investigation.

Notably, the described analysis underscores the significant value that clinical trial specimens represent when utilized for additional digital biomarker research60,61. Particularly in the light of recent advances in digital pathology, such as the abilities to accurately learn complex, whole-tissue representations from large WSI databases with foundation models and additionally integrate non-imaging data62,63,64, this approach likely enables extraction of novel insights by leveraging the abundance of relevant information present in clinical sample collections. To fully realize this potential, we advocate for systematic integration of high-quality, well curated tissue and data collection incl. standardized annotation, and appropriate consent for secondary use into early clinical development workflows.

In summary, we have developed a novel computational pathology approach to the assessment of PD-L1 expression by IHC: PD-L1 QCS. Application of this approach across two independent trials of Durvalumab treatment in NSCLC confirmed the validity of existing scoring approaches that focus on assessment of tumor cell positivity, while more sophisticated image analysis features were not superior. Where the PD-L1 QCS methodology offered advantages was in its ability to identify scoring parameters with a predictive value not possible via visual pathology scoring. This enabled identification of additional patients who may potentially benefit from monotherapy anti-PD-(L)1 through better classification of the current 1–49% TC sub-group. It is worth noting that NSCLC represents the simplest possible scenario for PD-L1 assessment. Almost every other indication requires combined scoring of both immune cells and tumor cells and represents a notably greater challenge for visual pathology assessment. While outside the scope of this work, it is plausible that the application of the QCS approach to such setting could bring even greater benefits in both reproducibility and predictive value. Additionally, novel ways of scoring PD-L1 expression may be highly relevant for new drug combinations, such as IO and ADC therapies, as they can better capture nuanced expression patterns, enabling more precise patient stratification and treatment efficacy assessment.

Methods

Immunohistochemical staining for PD-L1

Tumor tissue was obtained prior to any treatment from NSCLC patients enrolled in CP1108 (NCT01693562), a phase 1/2 clinical trial of Durvalumab monotherapy, and MYSTIC (NCT02453282), a phase 3 trial of Durvalumab and Tremelimumab, which have been described previously47,65. The studies were approved by the institutional review board or Ethics Committee at each site and conducted in adherence with the International Council for Harmonization Good Clinical Practice guidelines, the Declaration of Helsinki, and local regulations on the conduct of clinical research. All patients provided written informed consent before participation in the trial.

PD-L1 expression was assessed using the VENTANA PD-L1 (SP263) Assay sections from formalin-fixed, paraffin-embedded tissue samples. Baseline percentages of TC (TC scores) were acquired from the laboratory providers of the respective studies, in which these were determined by trained board-certified human pathologists, who visually assessed the proportion of epithelial TCs with membrane positivity for PD-L1 at any intensity above background.

Preparation of whole slide images

WSIs were obtained by scanning glass slides containing PD-L1 stained tissue samples with the Aperio Scan Scope AT Turbo Scanner (Leica Biosystems, USA) in study CP1108 at 20x magnification (0.5 µm pixel edge length) and the Hamamatsu NanoZoomer 360 Scanner (Hamamatsu Photonics K.K., Japan) in the MYSTIC study at 40x magnification (0.23 µm pixel edge length). WSI of insufficient quality were excluded, e.g., when these exhibited out-of-focus regions, folds, blur or overstaining and other artifacts. After review, the tumor area on WSIs was annotated to define the region of interest for the analysis (ROI). WSIs were reviewed by trained specialists with more than 5 years of laboratory experience and histopathology quality assurance background respectively a veterinary pathologist with more than 8 years of experience in human tissue analysis, under supervision of board-certified pathologists. In CP1108, reviews and annotations were carried out under supervision of a board-certified pathologist with experience in PD-L1 scoring, MYSTIC WSI were annotated by three board-certified pathologists incl. two with more than 10 years experience in PD-L1 scoring.

Segmentation of regions, cells and subcellular compartments in CP1108

For epithelial region segmentation, 178 field of view (FOV) annotations from a subset of 27 WSI out of CP1108 were joined into an internal data pool (for details see Supplementary Table 2) and a supervised deep learning model for segmentation of tumor epithelium regions was trained using a custom digital pathology framework relying on Keras and TensorFlow66 (see also Supplementary Table 3).

TCs and their corresponding subcellular compartments (membrane, nucleus and cytoplasm) were segmented using an independently developed deep learning model described elsewhere45. The detailed epithelium was segmented followed by derivation of TCs with the second model combined with watershed segmentation using Definiens Developer XD67 image analysis software. Results were visually inspected by pathologists, whereas 33/33 images passed quality control (10 images at ≥80% correctness level and 23 images at ≥90% correctness level).

Segmentation of regions, cells and subcellular compartments in MYSTIC

Segmentation employed a four-class region segmentation model (EMRA), which was trained on data independent from MYSTIC, allowing to segment the epithelial tumor area, alveolar macrophages, artifacts and the remaining region within the ROI as stroma. First, EMRA was pre-trained on a database of ca. 53,000 image patches sampled from 25 full slide images, normalized to a resolution of 0.5 µm edge length per pixel. These images were synthetically generated IHC WSIs from multiplex immunofluorescence (mIF) WSIs incl. PD-L1, DAPI and the autofluorescence channel. The mIF WSI included Pan-Cytokeratin as a marker. These were segmented and superimposed on synthetic PD-L1 images generated with a Wasserstein CycleGAN trained for domain-translation39,68,69,70,71. After pre-training, the model was refined on a database of 4618 FOVs, manually annotated by pathologists, obtained from scanned WSIs, which included a variety of markers staining either epithelium (such as HER2) as well as tumor epithelium and stroma (such as PD-L1). For details of model training parameters and the underlying database, please refer to Supplementary Tables 4 and 5.

EMRA region analysis was benchmarked against independent annotation of 89 FOVs by three pathologists for discrimination of tumor epithelium vs rest. In order to adequately capture the challenging morphologies which, e.g., non-squamous carcinomas (non-SqCC) may present for image analysis, the regions for FOV placement were selected such that NSCLC tumor subtypes were sampled approximating their prevalence in the biomarker evaluable population (28% SqCC, 72% non-SqCC). The resulting dice scores indicated comparability between pathologists and EMRA in segmentation performance (see also Supplementary Table 6).

A second model, termed Subcellular Segmentation of Tumor Cells (SSTC)72 was applied within identified tumor areas in order to define membrane, cytoplasmic and nuclear regions. Final segmentation of cells as instances was performed on predicted posteriors, constrained by the previously detected epithelial tumor region. Here, cell objects were segmented by identifying centers using non-maxima suppression (NMS) followed by a seeded watershed using an implementation in Python 3.8, skimage and OpenCV73,74. To facilitate quantification of PD-L1 staining within stromal regions, stroma was structured into equally sized tiles of 1 µm edge length. Visual quality control of SSTC by subject matter experts under supervision of pathologists across 89 FOV indicated 78 FOVs (87.6%) with ≥80% correctness, and 11 FOV with less than 80% quality rating (12.3%).

Quantitative continuous scoring of PD-L1

Following segmentation, PD-L1 staining intensity was measured by deriving the continuous optical density (OD) score proportional to 3,3’-diaminobenzidine (DAB) intensity as described by van der Laak et al.75. Segmented cell compartments were used to query the OD values for each image pixel, i.e., for segmented tumor cell membranes, OD values for each included pixel were calculated and averaged to derive a cell membrane staining intensity (SI) measurement24. Similarly, SI measurements were calculated for other SSTC compartments (e.g., cytoplasm)24 or stroma tiles. Finally, image features were derived which summarized the individual SI measurements into slide-level scores, such as the positive cell density, descriptive statistics over TC membranes, percentage of positive TCs, descriptive stromal expression statistics etc. (see Supplementary Table 7). The overall workflow for PD-L1 QCS as described above is illustrated in Fig. 8.

Fig. 8: Overview of the PD-L1 QCS analysis pipeline.
figure 8

In the first step (left), WSI are inspected and pathologists’ annotation of the tumor center are added. Next, the tumor center is segmented into four classes on region-level and four classes on cell level. Subsequently, the staining intensity is calculated to measure PD-L1 expression of each pixel (brighter pixels indicate higher intensity). Finally, image-derived features are calculated, e.g., percentage of positive tumor cells and descriptive statistics of the staining intensity distribution.

Calibration of PD-L1 QCS for patient stratification

A QCS biomarker signature encompasses several components, such as the QCS feature, its associated parameters and a cut-point, which in sum allows assignment of an overall QCS biomarker positive or negative classification to the WSI24,29,57. In order to derive the percentage of positive TCs in a slide as a feature, first the individual cell membrane SI is determined by averaging over the OD for each pixel in the cell membrane. Next, application of an SI threshold allows a detected cell to be defined as positive or negative and the percentage of positive TCs can be calculated. By subsequently applying a cut-point, which defines what proportion of cells must meet the positivity threshold for a sample to be considered positive, a patient is assigned to the QCS biomarker negative (QCS-) or QCS biomarker positive (QCS+) category76. Further parameters may include e.g., the selection of the quantile of the overall membrane intensity distribution over all detected TCs (see Supplementary Table 7). In subsequent analyses, parameters to the features (e.g., TC positivity threshold) and the overall feature cut-point were treated as free parameters to the QCS system24.