Main

Breast cancer remains a substantial global health challenge, and possesses significant molecular and clinical heterogeneity1. Over the past few decades, advances in understanding of genomic drivers have led to improved treatment modalities, including targeted therapies and immunotherapies5. However, recurrence and metastasis of breast cancers remain common, underscoring the need for a deeper understanding of the biology. Recent advances in genomic technology serve as essential assets to unravel the genetic complexities of breast cancer, facilitating personalized treatment approaches to improve outcomes.

Despite previous valuable insights, the molecular landscape of breast cancer remains partially understood. Traditional methods, such as targeted sequencing approaches, mostly focus on individual mutations in known cancer genes, and miss significant information outside the targets and other pattern-based markers6, such as most genomic rearrangements, copy number alterations and mutational signatures7. In this context, whole-genome sequencing (WGS) is a comprehensive alternative technique that captures the full spectrum of genomic changes and offers an unbiased view of cancer genomes3,8, enabling biological discovery and the exploration of potential biomarkers for future clinical application.

Over the past few decades, academic-driven cancer genome research has analysed more than 10,000 cancer genomes across diverse cancer types8,9,10. Although these studies have identified and archived many genomic mutations, their clinical significance often remains unclear owing to insufficient integration of clinical records. To maximize the real-world impact of genome sequencing, it will be essential to integrate genomic data with comprehensive medical records, including treatment responses, disease recurrence and long-term clinical outcomes11.

To this end, we sequenced and analysed the whole genomes of 1,364 breast cancers from a Korean cohort in combination with full medical records. To our knowledge, this cohort—referred to as CUBRICS cohort—is the largest of its kind to date (Supplementary Methods, Supplementary Fig. 1 and Supplementary Table 1). In addition to WGS, transcriptome sequencing data were incorporated for most cases (n = 1,209; 88.6% of cases), which enabled stratification of these cancers into five prediction analysis of microarray 50 (PAM50) subtypes (luminal A, luminal B, HER2-enriched, basal-like and normal-like)12 and tracking the expression of acquired genomic variants. The cohort is characterized by a younger median age (44 years) and lower proportions of oestrogen receptor-positive (ER+)/luminal A subtypes compared with breast cancer cases in Western countries, making it a distinct population for investigation13,14,15 (Supplementary Discussion 1).

Mutational landscapes and cancer drivers

From the 1,364 whole-genome sequences, we identified 10,929,118 somatic mutations, which include 8,935,132 single nucleotide variants (SNVs), 1,785,446 indels and 208,540 structural variants with a substantial variation in their burden per sample (Extended Data Fig. 1 and Supplementary Figs. 2 and 3 present the full oncogrid plots; interactive exploration of the genomic variants is also possible at https://open.cancervision.com). The median tumour mutational burden (TMB) was 4,742 (interquartile range (IQR) 2,392–8,842), with subtype-specific medians as follows: luminal A, 2,182 (IQR 1,628–2,841); luminal B, 4,027 (IQR 2,376–7,392); HER2-enriched, 6,737 (IQR 4,076–14,213); and basal-like, 8,360 (IQR 5,296–11,504). Basal-like tumours had the highest TMB, as previously reported16.

By applying the IntOGen pipeline17, we identified 41 breast cancer driver genes (Methods), including 4 novel driver genes in addition to classical drivers such as TP53, PIK3CA and GATA33 (Fig. 1 and Extended Data Fig. 2). For example, BCL11B, which encodes a transcription factor that is involved in chromatic remodelling, gene regulation and mammary stem cell self-renewal18,19, was mutated in 23 patients at much higher frequency than random expectation. Similarly, RREB1, which encodes a zinc finger transcription factor that regulates Ras and TGFβ signalling and directly binds the p53 promoter20,21,22, was truncated in 26 patients, suggesting that its disruption could affect oncogenic pathways. Our data also suggest that RAF123 and SPECC124 (both were mutated in five patients) act as tumour suppressors, with potential roles in tumour initiation and progression.

Fig. 1: Driver genes in breast cancer.
figure 1

Seven driver gene identification algorithms (SMRegions, OncodriveFML, OncodriveCLUSTL, MutPanning, HotMAPS, dNdScv and CBaSE) were applied to identify protein-coding driver genes, generating a combined significance score and identifying 41 driver genes in breast cancer. Top to bottom: the number of affected cases for each gene, ranked by mutation frequency across the cohort; PAM50 subtype for each mutated gene; integrative clusters (IntClust) subtype for each mutated gene; types of point mutations; results from the seven driver gene detection algorithms; a combined significance score, summarizing the overall support for each gene as a driver; genes that were not observed in any external dataset were classified as putative novel drivers, highlighted in blue; and the number of external breast cancer cohorts (out of 14) in which each gene was reported. Del, deletion; ins, insertion; LumA, luminal A; lumB, luminal B; NA, not available; mut, mutation; subs, substitution.

Tumours with TP53 mutations frequently exhibited genomic instability25, characterized by higher ploidy and increased homologous recombination DNA repair deficiency (HRD) for double-strand DNA breaks (Extended Data Fig. 3a,b). TP53 mutations were positively associated with the level of intratumour genetic heterogeneity (calculated by mutant allele tumour heterogeneity (MATH) scoring with tumour WGS4), particularly in basal-like subtypes (Extended Data Fig. 3c shows MATH scores associated with different TP53 mutations), indicating more active late-stage mutagenesis and emergence of potent subclones in TP53-mutant tumours. Notably, higher MATH scores were associated with poorer overall survival, particularly in TP53-mutant tumours (hazard ratio 1.66, 95% confidence interval 1.08–2.74; P = 0.048; Extended Data Fig. 3d,e), suggesting that intratumoural heterogeneity, as measured by MATH score, is associated with clinical outcomes and deserves further investigation as a potential prognostic factor. The prognostic effect of the MATH score was further demonstrated in the METABRIC cohort of 1,670 participants26 (Extended Data Fig. 3f and Supplementary Discussion 2).

Key mutational signatures

To systematically explore the mutational processes that shape alterations in breast cancer genomes, we performed a comprehensive mutational signature analysis2 (Methods, Supplementary Discussion 3 and Supplementary Fig. 4). We identified various signatures, including 17 SNV signatures, 9 insertion–deletion mutation (indel) signatures and 6 structural variant signatures (Fig. 2a; reference mutational signatures are available at https://cancer.sanger.ac.uk/signatures/). Signatures with clock-like properties, such as SBS5, ID1 and ID2, were present in nearly all samples, and accounted for a substantial proportion (more than 10%) of somatic mutations. By contrast, some mutational signatures contributed to a high number of mutations in only a subset of samples, suggesting hyper-mutagenicity that was restricted to the affected tumours. For examples, mutational processes attributable to polymerase epsilon exonuclease domain mutations (SBS10a and SBS10b) and defective DNA mismatch repair (SBS21 and SBS44) were observed in four cases with an extremely high TMB (more than 100,000; Supplementary Fig. 5).

Fig. 2: Mutational signatures in breast cancer.
figure 2

a, Distribution of mutational signatures across breast cancer subtypes. Top, number of mutations per signature for individual samples. Bottom, heat map showing the percentage of individuals within each subtype for whom a signature contributes more than 10% of total mutations. b, Correlation between signatures based on Pearson’s correlation coefficient. c, HRD score distribution (top) and receiver operating characteristic curve for predicting germline BRCA1 and BRCA2 mutations (bottom). d, Distribution of HRD and HRP cancers across all samples and subtypes; outer tracks indicate germline or somatic homologous recombination pathway mutations. e, Germline and somatic mutations in the homologous recombination pathway in cases with HRD. f, Disease-free survival (DFS) in patients with TNBC according to homologous recombination pathway status after adjuvant anthracycline–cyclophosphamide chemotherapy. Left, Kaplan–Meier curves for HRD versus HRP. Right, multivariate Cox proportional hazards regression (two-sided Wald test, no adjustment for multiple comparisons; n = 89). Error bars indicate 95% confidence intervals (CIs) and the centre dot is defined as the hazard ratio. g, Distribution of germline deletion of APOBEC3A/3B in our cohort (left), a healthy Korean population (middle) and Europeans with breast cancer (right). h, TMB (left), ratio of C>G and C>T mutations in YpTpCpA versus RpTpCpA contexts (middle) and sequence logo plots for C>T and C>G mutations at TC sites (right) in carriers of APOBEC3A/3B germline deletions. nYTCA and nRTCA represent the number of mutations occuring in YpTpCpA and RpTpCpA sequence contexts, respectively. Box plots indicate median (middle line), first and third quartiles (box edges) and 1.5× IQR (whiskers). P values were calculated using a two-sided Student’s t-test. Het, heterozygote; hom, homozygote; WT, wild type. i, APOBEC3A/3B germline status according to APOBEC mutational signature. Bars indicate the proportion of germline status in groups (n = 30) ordered by signature intensity. In all panels, n refers to the number of participants.

Mutational signatures associated with HRD3,27, including SBS3, SBS8, ID6, SV3 and SV5, are of particular interest, as they are potential WGS-based biomarkers for poly(ADP-ribose) polymerase inhibitors (Fig. 2b,c). The combined estimated HRD scores (Methods) showed a clear bimodal distribution (Fig. 2c), identifying 315 cases as HRD in the CUBRICS cohort (23.1%, 315 out of 1,364; Fig. 2d and Supplementary Fig. 6). Although it is highly enriched in the basal-like cancers (72.2%, 182 out of 252), HRD was also present in other subtypes (Fig. 2d). Out of the 315 cases identified as HRD, we identified causal pathogenic variants on genes in the homologous recombination pathway in 130 cases (41.3%, 130 out of 315), including 90 with germline defects (28.6%, 90 out of 315) and 40 (17.8%, 40 out of 225) with only somatically acquired defects (Supplementary Discussion 4, Supplementary Fig. 7 and Supplementary Table 2). Although germline pathogenic variants predominantly inactivated BRCA1 and BRCA2, somatic mutations most frequently truncated RAD51B, primarily by genomic rearrangements rather than point mutations (Fig. 2e).

The CUBRICS cohort included 89 patients with triple-negative breast cancer (TNBC) with a treatment history of adjuvant chemotherapy containing anthracycline–cyclophosphamide (Supplementary Table 3). Individuals with HRD (n = 66) exhibited significantly superior disease-free survival compared with those with homologous recombination DNA repair-proficient (HRP) cancer (hazard ratio 0.10, 95% confidence interval 0.02–0.54; P = 0.006; Fig. 2f); this difference was more pronounced than in a previous study with a targeted panel-based assay28, highlighting the potential of WGS for more specific clinical evaluation of HRD.

APOBEC-associated mutational signatures (SBS2 and SBS13; Fig. 2b) were widespread in breast cancers, contributing more than 10% of somatic SNVs in 633 samples. Of these, 12 samples were hypermutators with APOBEC-associated mutational processes, with TMB of more than 50,000. A germline deletion involving APOBEC3A and APOBEC3B (APOBEC3A/B), which effectively generates APOBEC3AAPOBEC3B fusion transcripts, has been suggested to predispose to APOBEC-mediated hypermutations29 (Extended Data Fig. 4a). Of note, the population allele frequency of the germline deletion was 31.8% in our cohort (present in 736 individuals, 131 being homozygous and 605 heterozygous; Fig. 2g and Extended Data Fig. 4b), significantly higher than the frequency in the European population29 (31.8% versus 8.5%; P < 0.001 by two-sided chi-square test), suggesting an East Asian origin. Contrary to previous reports30,31, the deletion was not substantially enriched in patients with breast cancer compared with the background Korean population (51.0%, 844 out of 1,654 individuals, with 150 being homozygous and 694 heterozygous; P = 0.276 by two-sided chi-square test; Fig. 2g), suggesting that it is not a strong cancer-predisposing germline variant. However, carriers exhibited reduced APOBEC3B expression (Extended Data Fig. 4c), significantly higher TMB (median 4,325 in wild type versus median 5,148 in carriers; P < 0.001; Fig. 2h) and APOBEC-associated signatures (SBS2, SBS13 and ID9; Extended Data Fig. 4d) in cancers. In addition, wild-type individuals were substantially depleted in the APOBEC hypermutator cancers (10.0% in the top 30 cases versus 46.9% in others; P < 0.001 by two-sided Fisher’s exact test; Fig. 2i). APOBEC-deletion carriers showed an enrichment of APOBEC-associated mutations in the YpTpCpN context than in the RpTpCpN context (where Y represents pyrimidine, R represents purine and the mutated cytosine is underlined; Fig. 2h).

Frequently rearranged genes

A key mutational process in cancer is structural variation, which acts to amplify, delete or reorder chromosomal material at scales that range from single genes to entire chromosomes32. To determine the structural variation landscape, we generated a DNA interaction map between 5-Mb segments across the entire genome (Fig. 3a), along with large-scale copy number amplification patterns (CNAs; G-score). Of the 208,540 structural variation events, local intrachromosomal (<5 Mb) rearrangements were predominant (162,988 out of 208,540, 78.2%), but reshuffling of genomic materials between distal intrachromosomal or interchromosomal regions were also observed (45,266 out of 208,540, 21.7%). Some of the rearrangements were observed more frequently than random. For example, 15 breast cancers carried an interchromosomal translocation between chromosomes 8 and 11, which brought CCND1, and ZNF703 and FGFR1 (ZNF703/FGFR1) into close proximity (mostly luminal B type; 73.3%, 11 out of 15; Fig. 3a, top left). Of note, the interaction was observed in lung cancer33, and was also reported to be associated with reduced benefit from aromatase inhibitors in metastatic breast cancer34. Consistent with these findings, tumours with these rearrangements exhibited distinct transcriptomic profiles, including upregulation of tumour progression-related pathways such as TNF, TGFβ signalling and epithelial–mesenchymal transition, as well as metabolic reprogramming (Extended Data Fig. 5).

Fig. 3: Whole-genome structural variation landscape.
figure 3

a, Genome-wide structural variation (SV) and copy number variation (CNV) landscape. Top, interchromosomal and intrachromosomal structural variant interactions across 5-Mb windows. Frequent events between chr. 8 and chr. 11 (top left) and between chr. 17 and chr. 20 (top right) are highlighted. Middle, number of samples with structural variants or super-enhancers (SEs) per window. Bottom, recurrent CNVs identified by GISTIC. b, Distribution of distal structural variant breakpoints (BPs) within 1 Mb of ERBB2 (grey) and breast cancer-specific super-enhancers (red). c, Structural variant breakpoint density within 1 Mb of ERBB2. Peaks at −99.0 kb (5′) and +91.7 kb (3′) indicate clustering near the gene. d, Effect of extragenic structural variations (<1 Mb from gene) on RNA expression. Genes with significant upregulation (log2(fold change) ≥ 1, q < 0.10) are highlighted. Dot size reflects the number of affected samples. e, CCF of extragenic structural variants affecting genes highlighted in d. f, Permutation test (10,000 iterations) shows that structural variant breakpoints (from d) are closer to super-enhancers than expected by chance (two-sided; P = 0.001). g, Left, genes that are downregulated by intragenic structural variants (q < 0.10). Well-known tumour suppressors (RB1, PTEN and RUNX1) are shown in bold. Right, PAM50 subtype distribution. h, Recurrent gene fusions (occurring in at least four participants). Number of patients and whole-transcriptome sequencing (WTS) support (left), and CCF (middle) and PAM50 (right) subtype for each fusion. i, NRG1 fusions in 17 patients in our cohort. Protein domain architecture of the N-terminal fusion partner and C-terminal NRG1 are shown. CCF values indicate the clonality of each fusion event. ADAM CR, ADAM family cysteine-rich domain; Del, deletion; DNA pol A exo 1, exonuclease 1 domain; Dup, duplication; glyco tran 10 N, glycosyltransferase family 10 domain; glyco transf 64, glycotransferase 64 domain; guanylate cyc 2, guanylate cyclase 2 domain; Inv, inversion; I-set, immunoglobulin-like domain; KH1, KH-type RNA-binding domain 1; Tra, translocation.

Another frequent interchromosomal translocation involved the ERBB2 locus on chromosome 17 and a locus on chromosome 20 (observed in 30 participants), where breast cancer super-enhancers were enriched35 (Fig. 3a, top right). Structural variations can influence transcription by disrupting enhancer–promoter interactions, chromatin topology and long-range regulatory elements36. In line with this idea, structural variations involving the ERBB2 locus were clustered with super-enhancers genome-wide (P = 0.028 by permutation test; Fig. 3b), which translocated super-enhancers in about 90 kb from the ERBB2 locus (Fig. 3c). Tumours with extragenic structural variants in the vicinity (within 1 Mb) showed significantly higher ERBB2 gene expression levels than their gene copy number (log2 fold change >1, q-value <0.10; Fig. 3d). They often carried extragenic structural variants in the vicinity (within 1 Mb) with a cancer cell fraction (CCF) of approximately 1 (Fig. 3e), and distal breakpoints significantly enriched near breast cancer-specific super-enhancers35 (Fig. 3f). Our observation overall suggests a possible enhancer-hijacking mechanism for transcriptional activation in breast cancer genomes.

In addition, we found gene truncation events associated with structural variation. A total of 28 genes, including the well-established tumour suppressors RB1, PTEN and RUNX1, exhibited significant downregulation due to gene-truncating structural variants, reinforcing their tumour suppressor role in breast cancer (Fig. 3g).

Although recurrent fusion oncogenes are considered rare in breast cancer37, we identified several recurrent fusion events (Fig. 3h and Extended Data Figs. 6 and 7). In luminal-type breast cancers, frequent fusion genes were MIPOL1–TTC6 (n = 9), CEP112–PRKCA (n = 6) and CCDC170–ESR1 (n = 6). In basal-like breast cancers, BCL2L14–ETV6 (n = 12), AGO2–PTK2 (n = 6) and BRD4–NOTCH3 (n = 6) were recurrent. Of note, CCDC170–ESR1 has been implicated in endocrine therapy resistance and metastasis in breast cancers38 and BCL2L14–ETV6 has been linked to increased invasiveness and paclitaxel resistance39. The fraction of variant-supporting reads suggests that most of the fusion events are clonal or near-clonal, implicating them in tumour initiation.

In the gene-level analysis, NRG1, which encodes a small ligand of ERBB family kinase receptors, was frequently affected by 166 structural variants in 110 cases (8%, 110 out of 1,364; Extended Data Fig. 8; top 2). Although gene disruption was the most frequent outcome (72.9%, 121 out of 166 structural variants), oncogenic fusions were sometimes formed with various partner genes40 (n = 17; Fig. 3i), collectively sixfold more frequently than previous reports41 (1.25% versus 0.2%). Of note, the EGF-like domain of NRG1 was retained in all 17 fusion cases, supporting their oncogenic effect.

Hotspots of focal amplifications

Focal amplification, which increases the copy number of short-segment DNA (typically up to 3 Mb in size, with a copy number gain greater than 3 compared with flanking regions), is a prominent signature in many human cancers42,43. To explore these events in breast cancers, we calculated the recurrence score of focal amplifications across the genome, referred to as F-score (Fig. 4a). The most prominent peaks were observed at chromosome 8 (chr. 8):35–40 Mb, chr. 11:65–70 Mb and chr. 17:35–40 Mb, which included ZNF703 (chr. 8; n = 124; 9.1%) and FGFR1 (chr. 8; n = 107; 7.8%), CCND1 (chr. 11; n = 108; 7.9%) and ERBB2 (chr. 17; n = 255; 18.7%), respectively (Fig. 4b and Extended Data Fig. 9a). Notably, ZNF703 exhibited higher amplification levels than FGFR1 on chromosome 8, suggesting its oncogenic role in breast cancer.

Fig. 4: CNAs in breast cancer with clinical and evolutionary implications.
figure 4

ah, Focal CNAs, particularly ERBB2, and their clinical relevance. il, CNAs and their temporal dynamics with implications for therapeutic resistance. a, Genome-wide F-score plots highlight recurrent focal amplifications at chromosomes 8, 11 and 17. Orange rings indicate regions that are frequently amplified as ecDNA. b, Copy number profiles at the three major focal peaks. Black lines show mean copy number and grey shading indicates 95% confidence intervals. c, Amplification mechanisms of the oncogenes in b. BFB, breakage–fusion bridge. d, Distribution of PAM50 subtype, ERBB2 amplification type, copy number and expression across HER2 IHC and fluorescence in situ hybridization (ISH) groups. Amp, amplification. e, Clinicogenomic profiles of 75 HER2-positive breast cancers treated with neoadjuvant docetaxel–carboplatin–trastuzumab–pertuzumab (TCHP), stratified by pCR (n = 38) and non-pCR (n = 37). The top 10 most mutated genes are shown. ChT, chromothripsis; PR, progesterone receptor. f, Sankey plots showing associations of HER2 IHC 3+, ERBB2 copy number ≥33 and ecDNA status with pCR (two-sided chi-square test). g, Performance metrics comparing HER2 IHC 3+, ERBB2 copy number ≥33 and chromothripsis for predicting pCR. CN, copy number; LR, likelihood ratio; OR, odds ratio. h, Chromothripsis frequency in pCR versus non-pCR groups (two-sided chi-square test). i, Prognostic effect of 9p23 amplification in basal-like breast cancer. n = 252. Top, Kaplan–Meier overall survival. Bottom, multivariate Cox regression. Two-sided Wald test without adjustment for multiple comparisons. Error bars indicate 95% confidence intervals and the centre dot is defined as the hazard ratio. j,k, Circos plots and CNA timing curves for two cases with HRD: a triple-negative tumour (j) and a hormone receptor-positive tumour with a germline BRCA1 1-bp deletion (k). l, Integrative Genomics Viewer snapshot for patient 703, showing a somatic 8-bp deletion overlapping a germline 1-bp deletion in BRCA1 exon 14, which is likely to restore the reading frame. In all panels, n refers to the number of participants.

Gene expression levels of the oncogenes positively correlated with gene copy number in a dose-dependent manner (Extended Data Fig. 9b). Notably, ERBB2 expression exhibited mutual exclusivity with the other oncogenes, suggesting a redundant oncogenic mechanism between ERBB2 and the other three genes in breast tumorigenesis (Extended Data Fig. 9c). However, CCND1 and ZNF703/FGFR1 loci displayed a positive correlation in RNA expression, in line with their co-rearrangement as previously discussed (Fig. 3a).

Of note, more than 40% of the ERBB2, FGFR1, ZNF703 and CCND1 focal amplifications were formed by extrachromosomal DNA (ecDNA) (Fig. 4c), suggesting that ecDNA is the primary mechanism of focal amplification42,44. Overall, 387 cancer tissues (28.4%, 387 out of 1,364) carried ecDNA, whose location was tightly correlated with F-score (Fig. 4a).

Clinical effect of ERBB2 focal gain

Immunohistochemistry (IHC) targeting HER2 protein, encoded by ERBB2, is a gold standard technique for clinical detection of HER2-positive breast cancers45. In our cohort, as expected, HER2 IHC correlated with, but did not completely overlap with PAM50 subtypes or ERBB2 gene copy number gains (Fig. 4d). For example, ecDNA-derived focal amplification of ERBB2 (copy number 68) was observed in a case with HER2 IHC 0 (Extended Data Fig. 9d). Conversely, we also observed a case with HER2 IHC 3+ without obvious ERBB2 amplification (Extended Data Fig. 9e). The PAM50 subtype of this case was not HER2-enriched but luminal B, with ERBB2 transcription markedly lower (transcripts per million (TPM) = 43.5) than typically observed in HER2 IHC 3+ tumours (mean TPM = 692.0), but similar to HER2 IHC 0 tumours (Extended Data Fig. 9e).

Then, we evaluated whether these genome and transcriptome data can predict treatment responses in HER2-positive cancer. Out of 75 individuals with HER2-positive tumours who received neoadjuvant TCHP regimens (docetaxel (microtubule inhibitor), carboplatin (platinum intercalator), trastuzumab and pertuzumab (anti-HER2 inhibitors)), 38 showed pathologic complete response (pCR) (Fig. 4e and Supplementary Table 4). Although cases without pCR (n = 37) were characterized by a higher prevalence of luminal subtypes (P < 0.001, two-sided Fisher’s exact text), hormone receptor positivity (P = 0.001, two-sided chi-square test) and PIK3CA mutations (P = 0.003, two-sided chi-square test), cases with pCR (n = 38) exhibited significantly higher ERBB2 expression (P < 0.001, two-sided Student’s t-test) and a greater frequency of ERBB2 focal amplification (P = 0.001, two-sided Fisher’s exact test; Extended Data Fig. 9f–h (external validation in the TransNEO cohort46)). Although HER2 IHC 3+ was a sensitive biomarker for pCR (36 out of 38, 94.7%), ERBB2 copy number (n ≥ 33; Extended Data Fig. 9i,j) demonstrated higher precision (30 out of 38, 78.9%; Fig. 4f). Overall, ERBB2 copy number provided a complementary performance, with higher prediction in specificity, precision, and likelihood ratio (Fig. 4g).

Of note, ecDNA is generally associated with poor prognosis across multiple cancer types47. Notably, in our cohort, tumours with ERBB2 copy number of 33 or more were significantly more likely to be ecDNA-positive (P = 0.003; Fig. 4f, right); however, ERBB2 amplification by ecDNA was not associated with pCR rates (P = 0.540). Our observation suggests that absolute copy number. rather than the amplification mechanism, may be a more relevant predictor of therapeutic response. It also indicates that ERBB2-amplifying ecDNA may be an actionable target for anti-HER2 therapy.

Further, we observed a significant enrichment of chromothripsis events, a complex genomic rearrangement with catastrophic chromosomal shattering and erroneous reassembly, in cases with pCR48 (71.1% versus 43.2%; Fig. 4e,h and Supplementary Fig. 8). Adding chromothripsis to cancer stratification improved the precision and specificity of predicting responses to TCHP (Fig. 4g, Extended Data Fig. 9k and Supplementary Discussion 5).

Long-segmental copy number gains

Along with focal amplifications, long-segmental amplifications (typically larger than 1 Mb), often chromosome- or arm-level alterations, were common in breast cancers (Fig. 3a). The timing analysis with the burden of co-amplified, pre-amplification somatic base substitutions yielded the acquisition timing of the frequent long-segmental CNAs in breast cancer in molecular time49 (Extended Data Fig. 10). Most of the recurrent long-segmental CNAs were acquired by 20% of molecular time, when recent common ancestral cells of a cancer emerge, which should be decades earlier than tumour diagnosis. This implies that acquisition of the long-segmental CNAs is an early evolutionary event in breast cancer, which is presumably acquired in early puberty, consistent with a previous report50. Our findings also suggest that full neoplastic transformation can take decades from the first event of genomic instability.

The most common CNAs occurred at 8q21.13, with a relative timing of 0.15 (IQR 0.08–0.25) and a median copy number gain of 5.5 (IQR 4.0–8.0). This amplification was associated with poor overall survival (hazard ratio 1.67, 95% confidence interval 1.09–2.55; P = 0.019). This region contains the MYC oncogene, which has been linked to treatment resistance and poor survival across various solid tumours, including breast cancer51.

Of note, CNAs of 9q23 (n = 40), occurring at a relative timing of 0.13 (IQR 0.09–0.25), were enriched in basal-like cancers and exhibited poor overall survival in the subtype (hazard ratio 2.45, 95% confidence interval 1.16–5.18; P = 0.019; Fig. 4i and Supplementary Table 5). External validation in the METABRIC cohort confirmed this trend52 (Extended Data Fig. 11a,b and Supplementary Table 6). The 9p23 gain was associated with 823 differentially expressed genes genome-wide (282 upregulated and 531 downregulated; Extended Data Fig. 11c), including PSIP1, which has been reported to be negatively correlated with survival of patients with TNBC53.

Late-stage CNA acquisitions were more frequently observed in HRD breast cancers (P < 0.001; Extended Data Fig. 11d,e). This implies that the rate of large-segmental CNAs is accelerated when a tumour cell acquires the HRD phenotype. For instance, in one case with HRD (no. 635; Fig. 4j), the rate of CNA acquisition was substantially increased from about 75% of the molecular time, which we speculate is the time of complete inactivation of the homologous recombination-mediated DNA repair pathway.

By contrast, in another case with HRD (no. 703; Fig. 4k), there was no such late CNA acquisition stage. The cancer developed talazoparib resistance after ten months of the second-line palliative treatment. In line with the resistance, post-therapy cancer genome exhibited an 8-bp deletion in the vicinity of the germline 1-bp deletion in BRCA1, which is likely to have converted an HRD tumour back to HRP (Fig. 4l). From the timing analysis (Fig. 4k, bottom), we believe that the rescuing mutation could be acquired in about 50% of the molecular time in one of the cells in the primary cancer tissue, which could lead to talazoparib resistance.

Potential biomarkers from whole genomes

Finally, we explored the potential utility of whole-genome results by integrating genomic profiles and clinical records. First, we observed that an increased MATH score, which is indicative of higher intratumoural heterogeneity, was associated with worse progression-free survival (PFS) in first-line anti-HER2 therapies against HER2-positive breast cancer. In our cohort, 45 patients who received the therapy showed a median PFS of 1.08 years (range 0.14–9.96 years; as of 15 November 2023; Fig. 5a and Supplementary Table 7). Individuals with a high MATH score (≥40; cut-off with the median MATH score) exhibited worse PFS compared to those with a MATH score below 40 (hazard ratio 2.95, 95% confidence interval 1.48–5.91; P = 0.002; Fig. 5b,c).

Fig. 5: WGS-based biomarkers for first-line palliative treatment of metastatic breast cancer.
figure 5

ac, Swimmer plot (a), Kaplan–Meier curves (b) and multivariate Cox proportional hazard analysis (c) comparing PFS according to MATH score in patients with HER2-positive breast cancer who received first-line anti-HER2 treatment as palliative treatment. a, Patients are stratified into two groups: MATH < 40 (n = 27) and MATH ≥ 40 (n = 18). Bars represent individual patients, with progression events indicated in yellow. Bottom, clinicopathologic and genomic features, including treatment regimen, ER and PR expression status, ERBB2 transcript expression as copy number (CN) gain and TPM, Ki67 index, tumour ploidy, HRD score and MATH score. L, lapatinib; Pro, prospective; retro, retrospective; T, taxane; TP, taxane plus platinum. d, Genomic and molecular characteristics of patients with hormone receptor-positive breast cancer (n = 57) treated with first-line CDK4/6 inhibitors (CDK4/6i), stratified by progression status. Top panel displays patient recruitment type, CDK4/6i regimen, PAM50 subtype, Ki67 expression, ploidy, MATH score, TMB and HRD score. Bottom, somatic mutations in recurrently altered cancer-related genes. e, Kaplan–Meier survival curves comparing PFS between individuals with high versus low TMB. f, Kaplan–Meier survival curves comparing PFS between HRD and HRP cases. g, Multivariate Cox proportional hazards analysis. The model accounts for potential correlations between TMB and HRD by incorporating a HRD:TMB interaction term. Multivariate Cox proportional hazards analysis was performed using a two-sided Wald test. Error bars indicate 95% confidence intervals, with the centre dot defined as the hazard ratio. AI, aromatase inhibitor; GNRH, gonadotrophin-releasing hormone; SERD, selective oestrogen receptor degrader. In all panels, n refers to the number of participants included in the analysis.

Similarly, we identified high TMB, and HRD scores were associated with worse PFS in CDK4/6 inhibitor plus endocrine therapy, a first-line palliative treatment recently developed for hormone receptor-positive breast cancer54 (Fig. 5d–g). In a total of 57 patients receiving the regimens, individuals with a high TMB (≥1.7 per Mb, cut-off with the median TMB) exhibited poor PFS (hazard ratio 2.55, 95% confidence interval 1.26–5.16; P = 0.009; Fig. 5d,e). Thirteen out of the 57 patients exhibited HRD, 85% of whom experienced progression, resulting in a shorter PFS than those with HRP (hazard ratio 4.20, 95% confidence interval 1.91–9.25; P < 0.001; Fig. 5f and Supplementary Table 8). Multivariate Cox proportional hazards analysis indicated HRD as the single most significant factor predicting favourable PFS to the treatment (hazard ratio 10.20, 95% confidence interval 1.68–61.80; P = 0.012; Fig. 5g), warranting validation in larger, prospective studies. In gene set enrichment analysis (GSEA), these HRD tumours showed significant enrichment of pathways related to cell cycle progression, DNA repair and mitotic checkpoint activation, suggesting that these tumours may exhibit increased reliance on cell cycle regulatory mechanisms (Extended Data Fig. 12). Of note, PTEN mutations were exclusively present in individuals who experienced progression on the treatment, suggesting that they had a negative impact on the treatment (P = 0.042, two-sided Fisher’s exact test; Fig. 5d).

Discussion

This study provides a comprehensive analysis of 1,364 tumour–normal pair WGS datasets with highly annotated paired clinical information for breast cancer (the findings are summarized in Supplementary Table 9). Our extensive efforts enabled us to construct a highly detailed somatic mutation landscape of breast cancer, revealing novel insights into the heterogeneous and complex genomics of the disease. By leveraging this meticulously curated clinical data, we were able to identify genomic features with potential clinical relevance and offer hypotheses for future prospective validation (Supplementary Discussion 5).

Our findings highlight the potential of whole-genome analysis for advancing precision oncology for breast cancer. However, integration of such whole-genome analysis into clinical practice has been challenging owing to the complexity of data interpretation, variability in clinical utility and lack of standardization. Despite these challenges, recent studies have shown promising results, such as WGS aiding in decision making for childhood solid cancers55, providing actionable insights for adult cancers11 and improving treatment outcomes in high-risk paediatric patients56. These findings, together with our breast cancer data, highlight the increasing potential of WGS in clinical research and its value for building a robust platform for systematic investigation through the integration of clinical and genomic big data.

Intratumoural heterogeneity poses a significant challenge in the clinical management of breast cancer, particularly in HER2-positive cases57, as it is closely linked to treatment responses. Traditionally, heterogeneity has been assessed through pathology, which is limited by its reliance on small tumour samples and often cannot detect subclonal mutations. By contrast, WGS offers a more comprehensive approach by analysing the entire genome, enabling detection of diverse genetic alterations, including subclonal mutations. This comprehensive analysis provides a quantitative understanding of tumour heterogeneity, capturing the full genetic diversity within tumours and aiding in the prediction of treatment responses (Fig. 5a–c). Given these advantages, we anticipate that WGS-based quantitative assessment of tumour heterogeneity, both at diagnosis and during progression, will have a critical role in shaping future precision oncology strategies.

The exploration of genome-wide mutational patterns reveals the various mutational processes that are active in breast cancer. Identifying mutational signatures associated with specific biological processes, such as HRD and APOBEC activity, provides a deeper understanding of different breast cancer subtypes2. Our study has implications in demonstrating the potential clinical relevance of mutational patterns. Notably, our study revealed the potential of HRD as a predictive biomarker for treatment response, particularly in adjuvant chemotherapy for TNBC (Fig. 2f) and first-line CDK4/6 inhibitor treatment for patients with advanced hormone receptor-positive breast cancers (Fig. 5d–g), providing a rationale for further investigation. Notably, HRD predicted a better response to the former but worse prognosis to the latter. This dichotomy highlights the nuanced role of HRD across different treatment contexts.

Cancer progresses along a specific trajectory, with somatic events accumulating in a defined order over time. By analysing somatic mutations, we were able to reconstruct the chronological trajectory of CNAs in key genomic regions (Extended Data Fig. 10), including ERBB2. Traditional sequencing methods have primarily viewed the genome in a two-dimensional manner, mapping genomic coordinates against the presence of mutations. However, by incorporating a time axis, we can visualize the genome in three dimensions. This advancement offers a more dynamic perspective, and provides crucial insights into cancer biology, tumorigenesis and the resistance mechanisms that emerge during treatment (Fig. 4j–l). Such data will be invaluable in deepening our understanding of these processes.

One potential limitation of our study is the possibility of left truncation bias in survival analyses, owing to the mixed retrospective and prospective nature of the cohort and variable timing of inclusion. In retrospective data, participants must survive until enrolment, which may exclude early deaths and bias survival estimates. Although our dataset lacks formal entry time variables for fully adjusted models, we applied several complementary strategies to mitigate this bias, including subgroup analyses and external validation (Supplementary Discussion 6). Nonetheless, residual bias may remain, particularly for the subgroup of patients with de novo metastases, and we urge caution when interpreting the overall survival findings in this context.

Overall, this study enhances our understanding of the genomic landscape of breast cancer and underscores the value of WGS in discovery research and its potential to inform future clinical strategies, contingent upon further validation. Integrating genomic data with detailed clinical outcomes paves the way for more personalized and effective treatment strategies, with the ultimate aim of improving outcomes for patients. Future prospective clinical trials will be essential to confirm the functional implications of the identified genomic alterations, facilitating their translation into real-world clinical practice.

Methods

Study design and participants

Participants were recruited from Samsung Medical Center and Seoul St Mary’s Hospital (Seoul, Korea) between 2012 and 2023 through prospective and retrospective cohorts. Retrospective cases were selected based on the availability of archived primary tumour and matched normal blood samples, along with sufficient clinical information, and were enroled regardless of disease stage or survival status at accrual. All retrospective samples were obtained shortly after diagnosis, typically at curative-intent surgery. Detailed study design, inclusion and exclusion criteria, and the CONSORT diagram are provided in Supplementary Methods and Supplementary Fig. 1. The study was approved by the Institutional Review Boards of both institutions (Samsung Medical Center: SMC 2022-05-050 and SMC 2013-04-005; Seoul St Mary’s Hospital: KC21TISI0007 and KC22TISI0292) and conducted in accordance with the Declaration of Helsinki and Good Clinical Practice guidelines. Written informed consent, including consent to publish de-identified clinical and genomic data, was obtained from all prospective participants. For retrospective cases, the use and publication of de-identified data were approved by the relevant institutional review boards, with a waiver of informed consent where applicable. A subset of participants was enroled as part of substudy components within the clinical trials NCT03131089 and NCT06334471, which informed specific aspects of the study design.

Sample preparation for sequencing

We performed WGS using CancerVision assay as previously reported58. In brief, WGS was performed on tumour samples obtained as part of routine clinical care either via surgery or biopsy and stored as fresh frozen tissue. For biopsy sample cores were retrieved first for routine pathology, followed by at least one additional core for cancer WGS. Biopsy sample cores were retrieved first for routine pathology, followed by at least one additional core for cancer WGS. For the matched normal samples peripheral blood was used. DNA extraction and library preparation was performed at the Inocras in a Clinical Laboratory Improvement Amendments (CLIA)-certified laboratory. We used the Allprep DNA/RNA Mini Kit (Qiagen) for DNA extraction, and TruSeq DNA PCR-Free (Illumina) for library preparation. Sequencing was performed on the Illumina NovaSeq6000 platform (Illumina) with an average depth of coverage of 40x for tumour and 20x for blood. Quality assessment of the WGS data is available in Supplementary Table 10.

For whole-transcriptome sequencing, total RNA was extracted using the AllPrep DNA/RNA Mini Kit (Qiagen) in accordance with the manufacturer’s protocol. Total RNA was quantified using a Qubit Fluorometer (Invitrogen) and the purity and integrity were assessed by the TapeStation RNA ScreenTape (Agilent Technologies). Total RNA-sequencing analysis enabled detection of both coding and noncoding RNA, along with other long intergenic noncoding RNA (lincRNA), small nuclear RNA (snRNA) and small nucleolar RNA (snoRNA). RNA-sequencing libraries were constructed using the KAPA RNA HyperPrep Kit, with RiboErase (Roche Molecular Systems, following the manufacturer’s protocol. We quantified and assessed the libraries using KAPA Library Quantification Kits for Illumina Sequencing platforms according to the qPCR Quantification Protocol Guide (KAPA Biosystems, KK4854) and the TapeStation D1000 ScreenTape (5067–5582, Agilent Technologies) recommendations. The generated libraries were sequenced by using a paired-end 150-bp read protocol on a NovaSeq 6000 platform (Illumina) with the S4 reagent kit (Illumina).

Genomic analysis and interpretation

Comprehensive genomic analysis and interpretation were conducted using the CancerVision platform (Inocras). WGS data were aligned to the GRCh38 human reference genome using bwa-mem (v.0.7.17-r1188)59. Preprocessing included duplicate marking and generation of compressed reference-oriented alignment map (CRAM) files. Somatic SNVs and short indels were called using Mutect2 (GATK v.4.0) and Strelka2 (v.2.9.10)60,61. High-confidence somatic variants were selected based on the following criteria: ≥2 variant reads in tumour, ≤1 variant read in matched normal, mapping quality ≥15, variant allele frequency (VAF) ≤ 5% in tumour, and population allele frequency ≤1% in the panel of normals.

Tumour purity, ploidy, and segmented CNV profiles were estimated with Sequenza (v.3.0.0)62 and somatic structural variations were identified using Delly (v.0.7.6)63. High-quality structural variations were defined as those with ≥2 variant reads in tumour, ≤1 in normal, mapping quality ≥15, and population allele frequency ≤5% in the panel of normals. GISTIC (v.2.0.23) was applied to identify recurrently amplified or deleted genomic regions64. AmpliconArchitect (v.1.2) was used to detect and characterize ecDNA amplicons42. Variant annotation was performed using the Ensembl Variant Effect Predictor (VEP, release 112)65. Variant call format (VCF) files were processed with bcftools (v.1.9). All variants, both germline and somatic, were subjected to rigorous manual review and curation within Inocras’s proprietary genome browser.

Identification of protein-coding driver genes

Protein-coding driver genes were identified using the IntOGen pipeline (v.2023), which integrates seven independent driver gene identification methods: dNdSCV, OncodriveFML, OncodriveCLUSTL, cBaSE, MutPanning, HotMaps3D and smRegions17. These methods collectively assess the selective advantage of somatic mutations in cancer genomes.

To determine a consensus set of driver genes, we combined the results from the seven methods using a strategy previously described. In brief, for each method, the top 100 ranked genes were selected along with their associated P values and q-values. Somatically mutated genes categorized as tier 1 or tier 2 in the COSMIC Cancer Gene Census (CGC) were used as a reference set of known drivers66. The relative enrichment of CGC genes in the top-ranked gene lists was evaluated to assign a per-method weighting. The final consensus ranking was obtained using Schulze’s voting method, and combined P values were estimated using a weighted Stouffer z-score method.

Candidate driver roles were further classified based on dN/dS ratios for missense (wmis) and nonsense (wnon) mutations, derived from dNdSCV. A distance metric was defined as:

$${\rm{Distance}}=({\rm{wmis}}-{\rm{wnon}})/\sqrt{2}$$

Candidate drivers with distance >0.1 were considered oncogenes, as they exhibited an excess of missense over nonsense mutations, and candidate drivers with distance <0.1 were classified as tumour suppressor genes, characterized by an excess of nonsense over missense mutations.

Additionally, each candidate driver was annotated based on its presence in any IntOGen cohorts from a previous pan-cancer analysis (Supplementary Table 11).

To ensure accuracy, we manually filtered the significant driver genes (combined q-value <0.10), excluding known artefacts (for example, PABPC1) and genes whose annotated driver roles were inconsistent with prior literature, such as JAK1, VAV2 and HGF.

Mutational signature analysis

We estimated contributions of mutational signatures to an observed mutational spectrum in each sample (the presumed amount of exposure to corresponding mutational processes). We solved the following constrained optimization problem67:

$${\mathrm{Argmin}}_{h}{|v-Wh|}^{2}$$

where vR10+, WRm×k0+, hR10+ where R denotes the set of real numbers, m×1 and k×1 indicate real vectors of dimensions m and k, respectively, m×k indicates a real matrix of dimension m by k, and 0+ denotes strictly positive real values (here, m is the number of mutation types and k is the number of mutational signatures). For each sample, given the observed counts of each mutation type v from a sample and the pre-trained mutational signature matrix W, we calculated exposure h. We used the R package pracma, which internally uses an active-set method to solve the above problem. Our method demonstrates comparable performance to SigProfiler68, an elaborated version of the framework used for the previous Catalogue Of Somatic Mutations In Cancer (COSMIC) compendium of mutational signatures, and SignatureAnalyzer69, which is based on a Bayesian variant of non-negative matrix factorization, in accurately estimating mutational signature contributions (Supplementary Discussion 3).

The relative contributions of mutational signatures were calculated by fitting 16 consensus mutational signatures previously identified in breast cancers (COSMIC signatures SBS1, SBS2, SBS3, SBS5, SBS8, SBS13, SBS17a, SBS17b, SBS18, ID1, ID2, ID3, ID5, ID6, ID8 and ID9; available at https://cancer.sanger.ac.uk/cosmic/signatures)2. For samples with a cosine similarity below 0.90, a manual review was performed, and additional COSMIC signatures were incorporated for refitting to improve the accuracy of signature decomposition. The additional signatures included those associated with polymerase eta somatic hypermutation activity (SBS9), polymerase epsilon exonuclease domain mutations (SBS10a and SBS10b), platinum chemotherapy treatment (SBS31), defective DNA mismatch repair (SBS21 and SBS44), and activation-induced cytidine deaminase (SBS84), as well as signatures of unknown aetiology (SBS34, SBS41, ID4 and ID11).

HRD and MSI

To assess HRD, we developed our proprietary algorithm by combining HRD-associated features, such as mutational signatures of point mutations, and copy number changes27. These included single-base substitution signatures (SBS3 and SBS8; reference signatures are available at https://cancer.sanger.ac.uk/signatures/), an indel signature (ID6), genomic rearrangement signatures (RS3 and RS5), deletions accompanied by microhomology, and CNVs. Custom scripts scaled scores of the multi-dimensional features using coefficients derived from published algorithms to compute the final HRD probability scores. For a quantitative evaluation of somatic microsatellite alterations, we considered both the score from MSIsensor70 and the proportion of microsatellite instability (MSI)-related mutational signatures (SBS6, SBS15, SBS20, SBS21, SBS26, SBS30 and SBS44)2.

Mutation copy numbers and determination of pre- and post-amplification events

We estimated mutation copy number (nmut) using the previously described formula71

$${n}_{\mathrm{mut}}={f}_{s}\times (1/\rho )\times [\rho \times {{n}^{t}}_{\mathrm{locus}}+{{n}^{{n}}}_{\mathrm{locus}}\times (1-\rho )]$$

In this formula, fs indicates variant allele fraction, ρ indicates tumour cellularity. nt locus and nn locus are absolute copy numbers in tumour and normal cells, respectively, which were derived from the following formula.

$${n}_{\mathrm{locus}}=2\times {\mathrm{RD}}_{\mathrm{locus}}/{\mathrm{RD}}_{\mathrm{auto}}$$

in which RDlocus indicates read depth of the locus of interest, and RDauto indicates average haploid autosomal coverage that was obtained from paired normal WGS.

A mutation was classified as a pre-amplification event, when nmut was larger than the major copy number of tumour multiplied by a factor of 0.75. If nmut was smaller than the above value but larger than 0.75, the mutation was assigned as post-amplification or minor allele mutations.

Molecular within-tumour timing of copy number gains

When a chromosomal segment is gained, it inherently co-amplifies all somatic mutations acquired within the segment up to that point in time, and thus the relative timing of genomic gains can be inferred by comparing frequencies of these co-amplified somatic mutations. To estimate relative event timing within individual tumours we utilized a relative timing measure, π, for every copy gain72,73, which represents the proportion of mutations per unit length of DNA that occurred before the event relative to the total number of mutations on the same DNA interval. Since mutations accumulate over time, this measure can be used to evaluate the timing of a copy number gain in a hypothetical molecular clock. To remove the effect of mutational processes that lead to a hypermutator phenotype (such as APOBEC mutagenesis, MSI and HRD), the number of mutations in each amplified segment was adjusted according to the proportion of SBS1 and SBS5, considering their clock-like nature74,75.

Breast cancer subtyping

We classified breast cancer samples into intrinsic subtypes using the prediction analysis of microarray 50 (PAM50) gene expression profiling method12. After loading gene expression data and removing duplicates and missing values, subtyping was conducted with the molecular.subtyping function from the genefu package76, applying the PAM50 model to classify samples into five categories: luminal A, luminal B, HER2-enriched, basal-like and normal-like.

To classify breast cancer samples into integrative clusters (IntClust) based on their copy number and expression profiles, we utilized the ic10 R package, a validated tool designed for the IntClust subtyping system. This classification method was originally derived from the METABRIC cohort and integrates both gene expression and segmented copy number variation data to define ten molecular subtypes of breast cancer52. The algorithm was run with default parameters, using log2 copy number ratios as input.

Scoring focally amplified genomics regions

To construct the landscape of focal amplifications across the whole genome, we calculated the F-score for each 5-Mb window. For each sample, we identified the segment with the maximum copy number within each window and computed the ratio of this copy number to the segment’s width. We then summed these ratios across all samples to define the F-score. The F-scores were subsequently plotted across the entire genome.

$$\begin{array}{l}{\rm{F}} \mbox{-} {\mathrm{score}}_{\mathrm{window}}\\ \,=\,{\Sigma }_{\mathrm{all\; samples}}\max {({\mathrm{major\; copy\; number}}_{\mathrm{segment}}/{\mathrm{width}}_{\mathrm{segment}})}_{\mathrm{window}}\end{array}$$

Survival analysis

All tumour samples used for survival analyses were collected prior to the initiation of targeted therapies, ensuring that the results were not influenced by treatment-induced genomic alterations. Survival analyses were conducted using the Kaplan–Meier method, and differences between groups were compared using the log-rank test.

PFS was defined as the time from treatment initiation to disease progression or death from any cause, whichever occurred first. Overall survival (OS) was measured from the time of diagnosis to death from any cause. Disease-free survival was defined as the time from adjuvant treatment initiation to the first documented recurrence or death from any cause, whichever occurred first.

In all survival analyses, both breast cancer-related and unrelated deaths were treated as events.

To evaluate the association between specific variables and survival outcomes, a Cox proportional hazards regression model was applied, with multivariate models incorporating relevant clinical and molecular covariates to adjust for potential confounders.

Given the mixed prospective and retrospective nature of this study, immortality bias (left truncation bias) was carefully addressed: (1) consistent survival definitions: the same survival time definitions were applied uniformly to both retrospective and prospective cohorts; (2) statistical adjustment for cohort type: cohort type (prospective versus retrospective) was included as a covariate in Cox regression models to correct for potential bias; (3) sensitivity analyses: we conducted additional sensitivity analyses, assessing the impact of cohort selection on survival estimates to ensure the robustness of our findings; and (4) external validation: where possible, independent external cohorts were used to validate survival analyses and confirm reproducibility.

These measures minimize potential survival biases, ensuring that our findings accurately reflect the prognostic and predictive significance of genomic alterations (Supplementary Discussion 6).

Statistics and reproducibility

Statistical tests or methods are described in the figure legends. We used R (v.3.4.0) for all data processing and secondary computational analysis.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.