Introduction

Breast cancers represent a diverse array of tumors characterizide by significant differences in morphology, molecular profiles, clinical presentations, biological behaviours, and therapeutic responses1. Treatment decisions primarily rely on the clinical stage of the disease and the outcomes of conventional biomarker analysss, which include assessment of estrogen and progesterone receptors, human epidermal growth factor receptor 2 (HER2), and proliferative marker Ki672.

Human epidermal growth factor receptor 2 (HER2) belongs to the epidermal growth factor receptor (EGFR) family of receptor tyrosine kinases. About 15% of breast cancers exhibit HER2 amplification and HER protein overexpression resulting in increased aggressiveness and reduced progression-free and overall survival rates3.

The advancement of HER2-targeting therapies over the past thirty years has resulted in markedly improved clinical outcomes across all stages and altered the prognosis for this patient population4.

HER2 overexpression testing in routine practice utilizes immunohistochemistry (IHC), and in cases of equivocal results (score 2+), the in situ hybridization method (ISH) is employed. HER2 status is categorized into four categories (scores 0, 1+, 2 + and 3+). However, only score 3 + and score 2 + cases with ISH amplification demonstrated clinical benefits from anti-HER2 agents which inhibit the hyperactive HER2 pathway resulting from HER2 amplification and protein overexpression5.

Until recently, IHC negative scores of 0 and + 1, along with an equivocal score 2+, ISH non-amplified, were collectively classified as the “HER2-negative” group regarding therapeutic options, rendering the differentiation between these scores irrelevant to oncologists.

The DESTINY-Breast 04 clinical study indicates that breast cancers exhibiting low HER2 protein expression, without ERB2 amplification, respond positively to the novel antibody-drug conjugate Trastuzumab-deruxtecan (T-DXd) therapy This therapy utilizes antibodies to deliver cytotoxic agents to cells expressing a specific target protein while also activating the host immune system6. Breast cancers exhibiting HER2 protein expression, previously categorized as negative (score 1 + and 2 + ISH non-amplified), demonstrated significantly extended progression-free and overall survival (9.9 versus 5.1 months and 23.4 versus 16.8 months, p < 0.001 for both) when administrated T-DXd, across both luminal (hormone receptor-positive) and triple-negative subgroups6. Subsequent to these findings, both the Food and Drug Administration (FDA) and the European Medicines Agency (EMA) granted approval for T-DXd in the treatment of unresectable or metastatic „HER2-low“ breast cancers7,8.

In the DESTINY-Breast 04 clinical study, “HER2-low” breast cancers encompassed cases with IHC scores 1 + and 2 + that were ISH non-amplified. This expansion of criteria broadens treatement options for a significantly larger population of hormone positive and triple-negative breast patients who previously did not qualify for HER2-targeted therapies6.

The 2023 updated ASCO/CAP guidelines highlight the necessity of accurate HER2 scoring, while rejecting the term „HER2-low“. This decision is based on the observation that such do not represent distinct molecular and biological categories, and that tumors exhibiting incomplete membranous staining in less than 10% of tumor cells (designated as HER2 “ultra-low”) were excluded from the DESTINY-Breast 04 study5.

Previous studies assessing concordance among pathologists in „HER-low“ scoring have yielded mixed results, reporting varying levels of discordance, including good, moderate, and significant discrepancies9,10,11,12,13,14. This study was initiated in response to these results and involves experts in breast pathology from four Balkan countries.

Results

Cohort characteristics

In the study, 20 samples were analyzed, with the majority being invasive breast carcinomas NST (70%). The remaining samples comprised one case each of tubular carcinoma, mucinous carcinoma, invasive micropapillary carcinoma, carcinoma with apocrine differentiation, invasive lobular carcinoma, and mixed invasive lobular and NST carcinoma. The median age of patients was 65 years. Sevety-5% of the cases were surgical specimens, while 25% were core needle biopsy specimens (Table 1).

Table 1 Clinico-pathological characteristics of the cohort.

According to the original IHC scoring, nine of the samples were scored as 1+, eight samples were scored as 0, one sample was scored as 2 + and two samples were scored as 3+. Score 2 + case was SISH non-amplified (Table 1).

Levels of agreement

Levels of agreement were assessed with individual scoring categories and two combined scoring categories.

Agreement levels were divided into four categories: absolute when the agreement was 100%, and high, moderate, and poor when the agreement was higher than 75%, higher than 50% and lower than 50% respectively (Table 2.)

Table 2 Levels of agreement.

Absolute agreement with individual scores was found in five of cases, with all participants scoring four samples as score 0, and one sample as score 3+. High agreement was found in 10 samples, and moderate agreement in five, all of which were score 1+. When combining score 1 + and score 2 + into a HER2 low category, two additional samples were categorized as absolute agreement, and only 2 samples remained in the moderate agreement category (Tables 3, 4 and 5).

Table 3 Scoring of individual samples with histologic subtypes.
Table 4 Levels of agreement achieved for each of 20 samples.
Table 5 Levels of agreement among raters (pathologists) across individual and combined IHC scoring categories.

Inter-observer and consensus observers’ agreement

Cohen’s weighted kappa was used to evaluate the degree of agreement between two raters while also taking into account the magnitude of disagreements (Table 6, Fig. 1.), in the entire cohort and additionally in HER2 low cases14. The kappa values in the table are categorized into five categories: almost perfect, substantial, moderate, fair and slight, which represent the ranges: 0.81-1, 0.61–0.8, 0.41–0.6, 0.21–0.4 and 0–0,2 respectively15. Inter-observer agreement was substantial (11,8%) to almost perfect (85,25%) for all cases, and was mostly moderate (38,85%) to substantial (33,01%) for HER2 low cases. Consensus observers’ agreement was almost perfect for the whole cohort (98,36%), and moderate (42,62%) to fair (36,07%) for HER2 low cases.

Table 6 Distribution of cohen’s weighted kappa values, indicating the level of agreement among raters.
Fig. 1
figure 1

The heatmap displays Cohen’s weighted kappa values among pathologist for two different sets of cases: the whole cohort (upper triangle) and the HER2-low cases (lower triangle). The x-axis and y-axis represent the indices of the pathologist (P1 to P61), and the colour scale indicates the kappa values, ranging from 0 (no agreement) to 1 (perfect agreement). C-All indicates the consensus scores for all cases, and C-Low indicates the consensus score (reference scores) only for the HER2-low cases.

Additionally, Fleiss’ kappa values, representing the level of agreement among pathologists for individual and combined IHC scoring categories were calculated. The IHC scoring categories are 0, 1+, 2+, and 3+, with an additional column for Overall Agreement (OA) (Table 7). The table shows a notable increase in Fleiss’ kappa values after combining IHC scoring categories, suggesting a reduction in variability and enhanced consistency among pathologists scoring, but this can be attributed to loss of information (merging of categories).

Table 7 Fleiss’ kappa values, representing the level of agreement among pathologists for individual and combined IHC scoring categories.

Discussion

Historically, pathologists have reported HER2 status in breast cancer in a binary manner: HER2 positive (IHC 3 + or 2+, ISH amplified) and negative (IHC 0, 1 + or 2 + ISCH non-amplified), as only HER2 positive patients qualified for anti-HER2 treatment3.

The recent advancement of a new generation of antibody-drug conjugates, including Trastuzumab-deruxtecan, provides a novel therapeutic option for patients exhibiting lower levels of HER2 + protein (score 1 + and 2+, ISH non-amplified). The impressive results of Destiny Breast 04 study6 and the subsequent approval of T-DXd by both FDA and EMA7,8 raise an important question: can pathologists accurately assess HER2-low levels?

The updated 2023. guidelines from the American Society of Clinical Oncology (ASCO) and the College of American Pathologists (CAP) for HER2 testing indicate that there are no changes to the recommendations established in 2018. for HER2 IHC scoring but calls for an increased awareness for IHC 1 + or 2 + non-amplified cases to better stratify patients eligible for T-DXd therapy5.

Concordance studies of HER2-low scoring among pathologists are crucial, particulary in distinguishing between scores of 0 and 1+5.

This study represenst the inaugural multicentric study into the concordance of pathologist reporting regarding the HER2-low category conducted in four countries in the Balkan region: Bulgaria, Croatia, Serbia and Montenegro.

A total of 20 samples and 61 pathologists were included. The findings indicate a strong consensus among the pathologists participating in the study. The majority of cases exhibited high to absolute agreement among pathologists, particulary when HER2 low cases were aggregated into a single category. Moderate agreement rates were indentified in five samples classified as HER2-low (score 1+), whereas the highest concordance was noteded in the score 0 and score 3 + categories, consistent with expectations.

A 2022 study published in JAMA Oncology by Fernandez et al. involving pathologists from the USA reported a 41% discordance rate in distinguishing HER2 0 from HER2 1+, 2 + or 3+. In contrast, the discordance rate between HER2 3 + versus non-3 + was only 11%9. The data indicates that, despite the ASCO/CAP HER2 guidelines providing explicit criteria for HER2 scoring, pathologists may overlook the critical distinction between HER2 0 and HER2 1 + if they do not recognize the significance of accurately defining the HER2 low category.

Subsequent studies indicate comparable high levels of agreement, as demonstrated by Vialle et al. and Ruschoff et al., both of which reported agreement levels exceeding 80%10,11. Discrepancies in inter-observer scoring predominantly occurred in cases scored 1+, aligning with our findings.

This ring study focused on including breast pathology experts from large hospitals, as they are the primary evaluators of samples in our countries and represent a suitable cohort for comparison with previously published results. It is important to note that this cohort selection may not accurately reflect the true status of breast cancer pathology in various countries, including those in the Balkan region. The majority of participants had engaged in formal education regarding for HER2-low scoring prior to this study. Smaller hospitals, which perform fewer breast cancer biopsies, and their pathologists, lacking specialized training in breast cancer pathology and assessment of predictive markers like HER2, are unlikely to achieve the same level of concordance.

In addition to temporal and spatial heterogeneity, various factors influence IHC scoring, including pre- and post-analytical variables, test sensitivity, specimen type and the experience of the laboratory and/or pathologist16,17,18. Membranous staining may exhibit heterogeneity, characterized by varying intensity and type at different areas. Additionally, cytoplasmatic staining may occur, complicating the evaluation process.

The findings of this study indicate that pathologists from major institutions handling breast cancer patients in the Balkans region can consistently assess “HER2-low” status in routine practice. Continuous educations, increased awareness of the significane of HER2-low levels and regular workshops led by experts are essential for all pathologists involved in the evaluation of predictive biomarkers in breast cancers, both in larger and smaller institutions.

The primary limitation of this study is the relatively small sample size. Nevertheless, we believe that when combined with a large number of pathologists included in the study we were able to collect sufficient data for a reliable analysis. Additional studies with a larger sample size and an increased number of pathologists are planned to validate the obtained results.

Materials and methods

The research was carried out in four Balkan countries - Bulgaria, Croatia, Montenegro and Serbia – encompassing all major hospitals engaged in breast cancer diagnosis and treatment within these nations. A total of 18 hospitals were included: five from Bulgaria, six from Croatia, one from Montenegro and six from Serbia, along with 61 breast pathology experts.

All pathologists participating in this study are acknowledged as leading authorities in breast pathology, both nationally and internationally. Inclusion was limited to individuals from major centers that manage a substantial volume of breast cancer biopsies in their routine practice. The majority have more than 10 years of experience in breast pathology and participate in continuous training and education within the discipline.

20 samples of invasive breast cancers, including surgical specimens and core needle biopsies from the routine practice enriched with HER2 0, 1 + and 2 + samples were selected from the archives of a single institution (Department of Pathology, University Hospital Split, Croatia). All slides were stained using the Ventana 4B5 assay on the Ventana Autostainer (Ventana BenchMark Ultra, Roche, Indianapolis, USA), following the manufacturer’s instructions. Controls included four breast cancer cell lines (HER2 Analyte ControlDR, HistoCyte Laboratories Ltd, The Neon Building, Quorum Park, Benton Lane, Newcastle upon Tyne, United Kingdom) of HER2 scores 0, 1+, 2 + and 3+, according to ASCO/CAP guidelines5. All samples were evaluated on H&E slides and IHC stained slides by the five expert pathologists educated for HER2 low scoring (consensus score). Data on patients’ age, histological type, hormonal status and the original HER2 scores were obtained from the histopathology reports. The SISH amplification status for all cases was also recorded, and in this study showed amplification only in samples that were scored as 3 + by both the consensus and original scoring.

A statistical analysis was performed over the whole cohort as well as samples that were split into two groups: HER2-low (HER2 IHC scores 1 + or 2+, SISH non-amplified) and HER2-positive (HER2 IHC score 3+).

In this study, 61 pathologists evaluated 20 samples. The level of agreement among the pathologists was categorized based on several thresholds: absolute (100% agreement), high (76–99% agreement), moderate (50–75% agreement), and poor in cases with less than 50% agreement.

To further assess, inter-observer reliability, both Fleiss’ kappa and Cohen’s weighted kappa were calculated. These statistical measures provided insights into the consistency of the ratings across the cohort, with Fleiss’ kappa used for assessing agreement among multiple pathologists and Cohen’s weighted kappa, with a quadratic weight metric, employed to evaluate pairwise agreement.

True labels, or sample rates, were used as reference scores throughout the analysis. Cohen’s weighted kappa was applied to both the entire cohort and specifically the HER2-low cohort, as demonstrated in the heatmap (Fig. 1.), with the upper triangle depicting the whole cohort and the lower triangle the HER2-low cohort.

Fleiss’ kappa was employed to compare each group with each other, as well as to conduct specific comparisons by grouping 1 + and 2+ (1/2) and grouping 0 against all the rest.

All statistical analysis was performed using Python 3.10.

The study was approved by the Ethics Committee of the University Hospital of Split (Split, 24 May 2024; 520-03/24 − 01/108). Informed consent was obtained from all patients for the additional HER2 scoring and data collection. All methods regarding tissue processing, staining, scoring and use of control samples were carried out in accordance with current ASCO/CAP guidelines.