Introduction

Approximately 15% of breast cancers exhibit human epidermal growth factor receptor 2 (HER2) gene overexpression1. In breast cancer molecular classification, HER2 overexpression is recognized as an independent subtype2, and targeted therapies such as trastuzumab have been developed against this receptor3. Current guidelines recommend a combined detection approach using HER2 immunohistochemistry (IHC) with fluorescence in situ hybridization (FISH)/in situ hybridization (ISH) to accurately identify patients eligible for HER2-targeted treatment4. In recent years, the novel antibody-drug conjugate trastuzumab deruxtecan (T-DXd) has transformed the treatment landscape by demonstrating efficacy even in HER2-non-amplified breast cancers—a population historically considered unresponsive to conventional HER2-targeted agents like trastuzumab5. This breakthrough stems from several key pharmacological advantages of T-DXd, including its unique mechanism to overcome drug resistance, enhanced tumor cell cytotoxicity, and a potent bystander effect6,7,8. Accumulating clinical evidence has confirmed that T-DXd significantly improves survival outcomes in patients with chemotherapy-resistant advanced breast cancer9 and provides substantial clinical benefit for both HER2-low10,11 and ultralow-expressing populations12. These advances have heightened the clinical relevance of precisely identifying HER2 expression levels, particularly in the low and ultralow ranges.

However, accurate HER2 IHC testing—essential for guiding these therapeutic decisions—is hampered by multiple technical challenges. Pre-analytical variables such as variations in fixation time and fixative concentration, analytical inconsistencies in section thickness and staining conditions, and interpreter subjectivity during evaluation all contribute to substantial inter-laboratory variability13,14. This lack of reproducibility poses a critical barrier to the reliable identification of HER2-low and ultralow cases, potentially affecting patient access to appropriate targeted therapies.

To address these issues, this study systematically investigates the clinical significance and methodological reliability of HER2 IHC testing through three integrated dimensions: First, we analyze the clinicopathological characteristics associated with different HER2 expression categories—specifically distinguishing HER2-ultralow from HER2-null cases—to uncover potential biological and prognostic implications. Second, we conduct a multicenter reproducibility assessment of HER2 IHC staining and interpretation to identify key variables affecting consistency and to propose standardization strategies. Third, we evaluate the utility of artificial intelligence (AI) tools in improving interpretation agreement and accuracy, thereby providing practical insights into the evolving role of computational pathology in HER2 scoring.

Results

Clinicopathological differences between HER2-low and HER2-ultralow expression

Among 1455 HER2-non-amplified cases, 1379 (94.78%) were invasive breast carcinoma of no special type (NST), while 76 (5.22%) were special subtypes including 7 (0.48%) lobular carcinomas, 30 (2.06%) mucinous carcinomas, 27 (1.86%) micropapillary carcinomas, 4 (0.27%) cribriform carcinomas, 1 (0.07%) adenoid cystic carcinoma, and 7 (0.48%) metaplastic carcinomas. The cohort comprised 1,269 (87.22%) hormone receptor-positive (HR+) cases and 186 (12.78%) triple-negative breast cancers (TNBCs). HER2-low expression was identified in 1,119 cases (76.91%), HER2 0 in 336 (23.09%), HER2-ultralow in 134 (9.21%), and HER2-null in 202 (13.88%). No significant differences were observed among groups regarding patient age, tumor T stage, or N stage. Compared with HER2 0 (including HER2-ultralow and HER2-null) cases, HER2-low tumors showed significantly lower histological grade (all p < 0.001), higher estrogen receptor (ER)/ progesterone receptor (PR) expression, and lower Ki-67 indices (HER2-ultralow vs. HER2-low: ER p < 0.001, PR P = 0.006, Ki-67 p < 0.001; HER2-null vs. HER2-low: ER p < 0.001, PR p < 0.001, Ki-67 p = 0.004), while HER2-ultralow and HER2-null groups showed no such significant difference. In addition, HER2-null cases demonstrated significantly higher tumor-infiltrating lymphocytes (TILs) than HER2-low cases (P = 0.005) (Table 1).

Table 1 Comparison of clinicopathological features in HER2-negative breast cancer (n = 1455)

All cases were stratified into HR+ breast cancer and TNBC groups. Among HR+ cases, HER2-ultralow cases showed higher ER expression (p = 0.018) compared to HER2-null cases, with ER expression levels showing positive correlation with HER2 IHC scores in HER2-low cases (HER2 2+ vs. HER2 1+, p = 0.023). However, no significant difference in ER expression exists between HER2-ultralow and HER2-low cases. Between HER2 IHC scores and patient age, T stage, N stage, or TILs levels, no significant associations were observed. Consistent with the overall analysis, HR+ tumors with HER2-null demonstrated significantly higher histological grade (p = 0.011) compared to those with HER2-low, HER2-ultralow cases exhibited higher Ki-67 indices than HER2-low cases (p = 0.042). In addition, HER2 1+ cases showed lower Ki-67 indices than HER2 2+ cases (p = 0.034).

In TNBC cases, HER2 scores showed no association with T stage, N stage, or TILs. Among TNBC patients, no clinicopathological differences were observed between the HER2-ultralow and HER2-null subgroups. Separately, some variations were noted in other comparisons: the proportion of young patients (≤50 years) was higher in the HER2 1+ group than in the HER2 2+ group (P = 0.013), and a higher proportion of young patients was observed in the HER2-null group than in the combined HER2-low group (P = 0.040). Regarding histological grade, HER2-null TNBC demonstrated higher grades than HER2-low cases (p = 0.007), and HER2 1+ tumors showed higher grades than HER2 2+ tumors (p < 0.001). A similar pattern was observed for Ki-67 indices as for histological grade: specifically, both the HER2 1+ group (vs. HER2 2+, p = 0.030) and the HER2-null group (vs. the combined HER2-low group, p = 0.003) demonstrated significantly higher Ki-67 indices. (Table 2).

Table 2 Clinicopathological characteristics of HR+/HER2- vs. triple-negative breast cancers by HER2 IHC scores (n = 1269 vs. 186)

HER2 immunohistochemical staining performance

The overall staining concordance rate between all three groups of test slides and reference standards was 72.60%, with Tissue 1(HER2 2+) showing 55.56% concordance (1 case [2.22%] HER2 0 and 19 cases [42.22%] HER2 1+), Tissue 2(HER2 0) demonstrating 66.67% concordance (12 cases [26.67%] HER2 1+ and 3 cases [6.67%] HER2 2+), and Tissue 3(HER2 3+/2+) achieving 95.56% concordance (2 cases [4.44%] HER2 2+). Two laboratories (4.44%) failed external controls (Fig. 1).

Fig. 1: Examples of unacceptable HER2 IHC staining in test slides.
figure 1

A A test slide from Tissue 1 showing unacceptable IHC 1+ staining (200×). B Weak external control for the test slide from Tissue 1 (200×). C IHC 1+ region in a test slide from Tissue 3 (200×). D IHC 2+ region in the same test slide from Tissue 3 (200×). E Tumor heterogeneity observed in the test slide from Tissue 3 (20×). F Weak external control corresponding to the test slide from Tissue 3 (200×).

Among different antibody clones, the 4B5 group showed higher HER2 1+ rates (78.95%) in HER2 2+, FISH- samples compared to antibody clones refer to laboratory-developed tests (LDTs) (MXR001 vs 4B5 p = 0.003; 4B5 vs. LDTs p < 0.001). In HER2 0 samples, the 4B5 group had higher HER2 0 rates (84.21%) than MXR001 (MXR001 vs. 4B5 p = 0.017), while in HER2 3+, Heterogeneous samples, MXR001 group showed higher HER2 3+ rates (69.23%) than LDTs clones (MXR001 vs. 4B5 p = 0.011; MXR001 vs. LDTs p = 0.015). Across different platforms, Roche Ventana showed significantly higher HER2 1+ rates (94.12% vs. 10.71%, p < 0.05) in HER2 2+, FISH- samples, higher HER2 0 rates (100% vs. 12.5%, p < 0.05) in HER2 0 samples compared to Leica BOND MAX/Titan/UltraPATH, and higher HER2 2+ rates (94.12% vs 0%, p < 0.05) in HER2 3+, Heterogeneous versus Titan (key findings summarized in Table 3). The full breakdown of the statistical data for all antibody clones and staining platforms can be found in Supplementary Tables S1 and S2.

Table 3 Analysis of key technical factors influencing HER2 immunohistochemical staining

Concordance of HER2 score interpretation

The overall concordance rate between the tested laboratories’ interpretations and the reference scores was 68.89% across all three slide sets. Analyzed for each predefined reference score, the concordance was highest for slides with a reference score of HER2 3+ (86.67%), followed by HER2 0 (70.97%), and lowest for HER2 1+ (57.58%). The concordance for slides defined as HER2-low was 65.56% (Fig. 2).

Fig. 2
figure 2

Concordance and variability in test slide evaluations.

Overall interpretation consistency was moderate (kappa = 0.566, p < 0.001), with Tissue 2 (HER2 0) showing the highest agreement (kappa = 0.458, p < 0.001, moderate strength), while the other two groups demonstrated fair agreement (Tissue 1 [HER2 2+]: kappa = 0.391, p = 0.001; Tissue 3 [HER2 2+/3+]: kappa = 0.373, p = 0.004) (Table 4).

Table 4 Concordance analysis between testing laboratories and review team interpretations of HER2 immunohistochemical staining

The role of AI in HER2 immunohistochemical interpretation

Following AI-assisted interpretation, the most frequent score adjustment across all groups (Pathologist A, Pathologist B, and Consensus Scoring [CS]) occurred in cases reclassified from HER2 0 to HER2 1+ (A: 10.5%; B: 4.3%; CS: 5.3%), resulting in an overall reduction of 3.7% in HER2 0 cases. For HER2 2+ cases, AI-assisted interpretation led to balanced redistributions toward both ends (HER2 1+ and HER2 3+), though manual interpretations exhibited a pronounced tendency to downgrade HER2 2+ to HER2 1+ rather than upgrade to HER2 3+ (A: 4.8% vs. 0.5%; B: 3.3% vs. 0%; CS: 2.9% vs. 0%). Notably, all groups demonstrated a modest yet consistent decrease in HER2 2+ cases (A: 1 case [3.0%]; B: 2 cases [7.1%]; CS: 3 cases [9.4%]), while changes in other score categories showed no discernible pattern (Fig. 3).

Fig. 3: Distribution of manual and AI-assisted scoring results.
figure 3

AC The changes in HER2 IHC scores across different groups following AI interpretation. D The quantitative changes in each HER2 IHC score category before and after AI analysis. AI-a AI-assisted, CS consensus scoring.

AI assistance improved both concordance rates and agreement levels between pathologists, with overall interpretation concordance increasing from 79.43% to 85.17% and kappa values improving from 0.711 (p < 0.001) to 0.794 (p < 0.001). For non-amplified cases specifically, concordance rose from 77.02% to 82.61% with kappa improvement from 0.610 (p < 0.001) to 0.713 (p < 0.001). When comparing individual pathologists’ results with consensus scores, Pathologist A’s concordance improved from 86.60% to 91.39% (kappa 0.813–0.881), while Pathologist B’s results advanced from 92.82% to 93.78% (kappa 0.899–0.913). Non-amplified cases showed relatively lower but still improved agreement (Pathologist A: kappa 0.751–0.836; Pathologist B: kappa 0.861–0.875; all p < 0.001) (Table 5, Fig. 4).

Fig. 4: Concordance analysis between manual and AI-assisted HER2 scoring.
figure 4

CS consensus scoring.

Table 5 Concordance rates and agreement between manual and AI-assisted HER2 scoring

AI interpretation markers for HER2 3+ to 0 cases are demonstrated in Fig. 5. Manual review identified interpretation discrepancies in 35 AI-classified cases, primarily attributable to unrecognized cells (21/35, 60%) and cell type misclassification (9/35, 25.7%), with excessive staining intensity interfering with cell detection (3/35, 8.6%) and nonspecific cytoplasmic staining affecting interpretation accuracy (2/35, 5.7%) constituting remaining challenges.

Fig. 5
figure 5

AI interpretation examples.

Discussion

Previous studies have suggested that HER2-ultralow tumors demonstrate higher histological grade and lymph node metastasis rates along with decreased ER/PR expression compared to HER2-low cases, HER2-null tumors exhibit higher histological grade and lower ER/PR expression than HER2-ultralow cases15. Our findings confirm that invasive carcinomas of the HER2-null subtype are associated with a higher histological grade than HER2-low cases. Although the HER2-ultralow group also showed a trend towards higher grading compared to the HER2-low group in the overall cohort, this difference was no longer significant after stratifying the cases into HR+ breast cancer and TNBC subgroups for separate analysis. In HR+ tumors, we confirmed higher ER expression in HER2-ultralow versus HER2-null cases, suggesting that tumors with HER2-ultralow expression are more likely to benefit from endocrine therapy than those with HER2-null expression. Based on the combined characteristics above, we propose that HR+ breast cancers with HER2-ultralow more closely resemble tumors with HER2-low rather than HER2-null in their clinicopathological features. While in TNBCs, tumors with HER2-ultralow expression showed a trend toward closer resemblance to those with HER2-null expression, although this difference was not statistically significant. Although tumor-infiltrating lymphocytes (TILs) were significantly more abundant in HER2-null versus HER2-low tumors overall, this difference disappeared when analyzing HR+ and TNBC subgroups separately. This discovery may put forward more specific requirements for the stratification of treatment. Limited research exists on TILs in luminal/HER2-negative breast cancer, though some evidence suggests high TILs may correlate with poorer overall survival (but not disease-free survival) after neoadjuvant chemotherapy16, contrasting with a new study showing a favorable prognosis associated with high TILs in young patients treated with adjuvant chemotherapy or endocrine therapy17.

HER2 immunohistochemistry is the most cost-effective and efficient method for patient stratification. Quality control within and between laboratories is particularly important. Therefore, we conducted a multicenter staining and interpretation double assessment. Given the inter-laboratory variability in IHC staining and interpretation, we do not require further subcategorization of HER2 0 into HER2-ultralow and HER2-null in our scoring system. Our multicenter evaluation of HER2 IHC staining and interpretation consistency revealed that HER2 3+ staining showed superior reproducibility compared to HER2 0 and non-amplified HER2 2+ cases, with the latter demonstrating the lowest consistency. Ventana platforms showed distinct staining patterns for non-amplified HER2 2+ cases (higher IHC 1+ rates) and HER2 0 cases (more frequent HER2 0 calls) compared to other platforms. Antibody-specific analysis demonstrated that 4B5 clones generated significantly lower membrane positivity rates in non-amplified specimens, whereas MXR001 tended toward higher membrane staining in amplified cases, aligning with prior reports of 4B5’s superior specificity versus HercepTest and MXR00118. These variations may directly impact treatment decisions for patients with HER2-low tumors, which underscores how staining reliability depends on multiple technical factors, with FDA-approved platforms/antibodies and rigorous internal quality control being essential.

Analysis of 45 laboratories revealed moderate overall interpretation consistency (kappa = 0.566), with HER2 3+ showing the highest agreement (86.67%), followed by HER2 0 (70.97%), while HER2 1+ demonstrated the poorest reproducibility (57.58%). The major discordance occurred between HER2 0 and 1+, which reflects the declining sensitivity of the 10% cutoff threshold for faint membrane staining19,20. Emerging requirements to distinguish HER2-ultralow from HER2-null cases will likely exacerbate these inconsistencies.

Nowadays, AI is gradually being put to use in pathological diagnosis. Therefore, whether AI can assist in the interpretation of IHC is also something we want to know. In our research, it was found that AI-assisted interpretation significantly improved inter-observer agreement while reducing ambiguous HER2 2+ calls and unnecessary FISH tests. However, current AI tools cannot reliably differentiate HER2-ultralow from HER2-null due to persistent challenges in recognizing nonspecific staining, with limitations including inaccurate invasive carcinoma detection and cytoplasmic staining misinterpretation. While conventional machine learning algorithms struggle with image logic interpretation—an area where advanced deep learning shows promise for pathological diagnosis21—our results confirm AI’s clinical utility in expanding patient eligibility for novel targeted therapies, albeit still requiring manual verification for optimal accuracy. The ultimate clinical value of HER2 IHC stratification warrants further investigation.

Methods

Clinicopathological characteristics of HER2-low and HER2-ultralow expression

This retrospective study identified and included a total of 1455 treatment-naïve cases of HER2-non-amplified invasive breast cancer from the pathological archives of the First Affiliated Hospital of China Medical University. The cases, consecutively accessioned between July 2022 and July 2024, were selected based on the availability of complete clinical, pathological, and immunohistochemical data. The study protocol was reviewed and approved by the Ethics Committee of the First Affiliated Hospital of China Medical University (Approval Number: 2025-79). This study was conducted in accordance with the Declaration of Helsinki. The committee granted a waiver for informed consent due to the retrospective nature of the analysis. Tumor-infiltrating lymphocytes (TILs) were assessed according to the 2024 Chinese Society of Clinical Oncology (CSCO) Breast Cancer Diagnosis Guidelines22. HER2 status was evaluated following the Guidelines for HER2 Testing in Breast Cancer (2024 Edition)23.

Reproducibility of HER2 immunohistochemical testing

Forty-five laboratories were evaluated for staining performance and HER2 interpretation. Three formalin-fixed, paraffin-embedded (FFPE) tissue blocks with distinct HER2 scores were used to prepare control and test slides. Control slides were processed using the Roche Ventana platform with the 4B5 antibody clone. The HER2 scores for each tissue were as follows: Tissue 1: 2+ (IHC 2+ and FISH-negative); Tissue 2: Ultralow expression (<10% of tumor cells showing faint, incomplete membrane staining); Tissue 3: 3+ or 2+ (demonstrating heterogeneity with 50% 3+ and 50% 2+ staining areas confirmed as FISH-amplified) (Fig. 6).

Fig. 6: HER2 IHC staining of reference tissues and their external controls.
figure 6

A Tissue 1: HER2 2+ (200×). B External control for Tissue 1 (200×). C Tissue 2: HER2 0 (<10% of tumor cells with incomplete/faint membranous staining) (200×). D External control for Tissue 2 (200×). E IHC 2+ region (FISH-positive) of Tissue 3 (200×). F IHC 3+ region of Tissue 3 (200×). G Tumor heterogeneity in Tissue 3 (approximately 50% area 3+, 50% area 2+) (20×). H Positive control for Tissue 3 (200×).

To account for inter-laboratory variability, the interpretation criteria for staining performance are as follows: Tissue 1: 2+, Tissue 2: 0 (<10% of tumor cells showing faint, incomplete membrane staining), Tissue 3: 3+ or 2+ (with more than 50% 2+ staining areas). Interpretative consistency is allowed within a reasonable range. The acceptable reference score ranges for test slides were defined as: Tissue 1: 1+ or 2+; Tissue 2: 1+ or 0; Tissue 3: 2+ or 3+. Test slides were deemed non-conforming if they met any of the following criteria: Failure in internal or external controls; Presence of nonspecific heterogeneity in invasive carcinoma regions.

Significance of AI tools in HER2 immunohistochemical assessment

This study conducted a comparative analysis of 209 HER2 immunohistochemical (IHC) slides from treatment-naïve postoperative invasive breast cancer cases at the First Affiliated Hospital of China Medical University between July 22, 2019, and July 22, 2022. The study evaluated the inter-pathologist concordance in IHC interpretation, the agreement between individual pathologists and consensus scoring (CS), as well as the differences in interpretation accuracy before and after AI-assisted evaluation. The AI tool employed in this study was the D-Path AI platform developed by Dpath Technology Co., Ltd.

Statistical analysis

All statistical analyses were performed using SPSS 24.0 software. This study employed both comparative analysis and agreement analysis, with the former conducted using χ2 tests, Monte Carlo simulations, and Fisher’s exact probability tests. The kappa coefficient was used to evaluate inter-rater agreement, with the following interpretation guidelines: kappa ≤ 0.20 indicating poor agreement, 0.20 < kappa ≤ 0.40 indicating fair agreement, 0.40 < kappa ≤ 0.60 indicating moderate agreement, 0.60 < kappa ≤ 0.80 indicating good agreement, and 0.80 < kappa ≤ 1.00 indicating excellent agreement.