Whole-genome landscapes of 1,364 breast cancers

Kim, Ryul; Yu, Jonghan; Lim, Joonoh; Oh, Brian Baek-Lok; Nam, Seok Jin; Kim, Seok Won; Lee, Jeong Eon; Chae, Byung Joo; Kim, Ji-Yeon; Park, Ga Eun; Kang, Bong Joo; Paik, Pill Sun; Bae, Soo Yeon; Yoon, Chang Ik; Lee, Young Joo; Kim, Dooreh; Shin, Kabsoo; Lee, Ji Eun; Kang, Jun; Lee, Ahwon; Connolly-Strong, Erin; Lee, Sangmoon; Lee, Bo Rahm; Lee, Yuna; Yi, Ki Jong; Kwon, Young Oh; Chun, In Hwan; Park, Junggil; Kim, Jihye; Choi, Chahyun; Shin, Jong Yeon; Lee, Hyungjung; Kim, Minji; Park, Hansol; Jeong, Ilecheon; Yi, Boram; Lee, Won-Chul; Lee, Jeong Seok; Park, Woo Chan; Kim, Sung Hun; Choi, Yoon-La; Lee, Jeongmin; Ju, Young Seok; Park, Yeon Hee

doi:10.1038/s41586-025-09812-3

Download PDF

Article
Open access
Published: 03 December 2025

Whole-genome landscapes of 1,364 breast cancers

Nature volume 649, pages 1282–1291 (2026)Cite this article

60k Accesses
211 Altmetric
Metrics details

Subjects

Abstract

Breast cancer remains a major global health challenge¹. Here, to comprehensively characterize its genomic landscape and the clinical significance of genomic characteristics, we analysed whole-genome sequences from 1,364 clinically annotated breast cancers, with transcriptome data available for most cases. Our study expands the repertoire of oncogenic alterations and identifies novel driver genes, recurrent gene fusions, structural variants and copy number alterations. Timing analyses on copy number alterations suggest that genomic instability emerges decades before tumour diagnosis, and offer insights into early initiation of tumorigenesis. Pattern-driven genomic features, including mutational signatures², homologous recombination deficiency³, tumour mutational burden and tumour heterogeneity scores⁴, were associated with clinical outcomes, highlighting their potential utility as predictive biomarkers for clinical evaluation of treatments such as CDK4/6 and HER2 inhibitors, as well as adjuvant and neoadjuvant chemotherapy. These findings highlight the power of large-scale, clinically annotated whole-genome sequencing in advancing our understanding of how genomic alterations shape patient outcomes.

Identification of putative actionable alterations in clinically relevant genes in breast cancer

Article 28 August 2021

CDK4/6i-treated HR+/HER2- breast cancer tumors show higher ESR1 mutation prevalence and more altered genomic landscape

Article Open access 22 February 2024

Unique evolutionary trajectories of breast cancers with distinct genomic and spatial heterogeneity

Article Open access 19 May 2021

Main

Breast cancer remains a substantial global health challenge, and possesses significant molecular and clinical heterogeneity¹. Over the past few decades, advances in understanding of genomic drivers have led to improved treatment modalities, including targeted therapies and immunotherapies⁵. However, recurrence and metastasis of breast cancers remain common, underscoring the need for a deeper understanding of the biology. Recent advances in genomic technology serve as essential assets to unravel the genetic complexities of breast cancer, facilitating personalized treatment approaches to improve outcomes.

Despite previous valuable insights, the molecular landscape of breast cancer remains partially understood. Traditional methods, such as targeted sequencing approaches, mostly focus on individual mutations in known cancer genes, and miss significant information outside the targets and other pattern-based markers⁶, such as most genomic rearrangements, copy number alterations and mutational signatures⁷. In this context, whole-genome sequencing (WGS) is a comprehensive alternative technique that captures the full spectrum of genomic changes and offers an unbiased view of cancer genomes^3,8, enabling biological discovery and the exploration of potential biomarkers for future clinical application.

Over the past few decades, academic-driven cancer genome research has analysed more than 10,000 cancer genomes across diverse cancer types^8,9,10. Although these studies have identified and archived many genomic mutations, their clinical significance often remains unclear owing to insufficient integration of clinical records. To maximize the real-world impact of genome sequencing, it will be essential to integrate genomic data with comprehensive medical records, including treatment responses, disease recurrence and long-term clinical outcomes¹¹.

To this end, we sequenced and analysed the whole genomes of 1,364 breast cancers from a Korean cohort in combination with full medical records. To our knowledge, this cohort—referred to as CUBRICS cohort—is the largest of its kind to date (Supplementary Methods, Supplementary Fig. 1 and Supplementary Table 1). In addition to WGS, transcriptome sequencing data were incorporated for most cases (n = 1,209; 88.6% of cases), which enabled stratification of these cancers into five prediction analysis of microarray 50 (PAM50) subtypes (luminal A, luminal B, HER2-enriched, basal-like and normal-like)¹² and tracking the expression of acquired genomic variants. The cohort is characterized by a younger median age (44 years) and lower proportions of oestrogen receptor-positive (ER+)/luminal A subtypes compared with breast cancer cases in Western countries, making it a distinct population for investigation^13,14,15 (Supplementary Discussion 1).

Mutational landscapes and cancer drivers

From the 1,364 whole-genome sequences, we identified 10,929,118 somatic mutations, which include 8,935,132 single nucleotide variants (SNVs), 1,785,446 indels and 208,540 structural variants with a substantial variation in their burden per sample (Extended Data Fig. 1 and Supplementary Figs. 2 and 3 present the full oncogrid plots; interactive exploration of the genomic variants is also possible at https://open.cancervision.com). The median tumour mutational burden (TMB) was 4,742 (interquartile range (IQR) 2,392–8,842), with subtype-specific medians as follows: luminal A, 2,182 (IQR 1,628–2,841); luminal B, 4,027 (IQR 2,376–7,392); HER2-enriched, 6,737 (IQR 4,076–14,213); and basal-like, 8,360 (IQR 5,296–11,504). Basal-like tumours had the highest TMB, as previously reported¹⁶.

By applying the IntOGen pipeline¹⁷, we identified 41 breast cancer driver genes (Methods), including 4 novel driver genes in addition to classical drivers such as TP53, PIK3CA and GATA3³ (Fig. 1 and Extended Data Fig. 2). For example, BCL11B, which encodes a transcription factor that is involved in chromatic remodelling, gene regulation and mammary stem cell self-renewal^18,19, was mutated in 23 patients at much higher frequency than random expectation. Similarly, RREB1, which encodes a zinc finger transcription factor that regulates Ras and TGFβ signalling and directly binds the p53 promoter^20,21,22, was truncated in 26 patients, suggesting that its disruption could affect oncogenic pathways. Our data also suggest that RAF1²³ and SPECC1²⁴ (both were mutated in five patients) act as tumour suppressors, with potential roles in tumour initiation and progression.

**Fig. 1: Driver genes in breast cancer.**

Tumours with TP53 mutations frequently exhibited genomic instability²⁵, characterized by higher ploidy and increased homologous recombination DNA repair deficiency (HRD) for double-strand DNA breaks (Extended Data Fig. 3a,b). TP53 mutations were positively associated with the level of intratumour genetic heterogeneity (calculated by mutant allele tumour heterogeneity (MATH) scoring with tumour WGS⁴), particularly in basal-like subtypes (Extended Data Fig. 3c shows MATH scores associated with different TP53 mutations), indicating more active late-stage mutagenesis and emergence of potent subclones in TP53-mutant tumours. Notably, higher MATH scores were associated with poorer overall survival, particularly in TP53-mutant tumours (hazard ratio 1.66, 95% confidence interval 1.08–2.74; P = 0.048; Extended Data Fig. 3d,e), suggesting that intratumoural heterogeneity, as measured by MATH score, is associated with clinical outcomes and deserves further investigation as a potential prognostic factor. The prognostic effect of the MATH score was further demonstrated in the METABRIC cohort of 1,670 participants²⁶ (Extended Data Fig. 3f and Supplementary Discussion 2).

Key mutational signatures

To systematically explore the mutational processes that shape alterations in breast cancer genomes, we performed a comprehensive mutational signature analysis² (Methods, Supplementary Discussion 3 and Supplementary Fig. 4). We identified various signatures, including 17 SNV signatures, 9 insertion–deletion mutation (indel) signatures and 6 structural variant signatures (Fig. 2a; reference mutational signatures are available at https://cancer.sanger.ac.uk/signatures/). Signatures with clock-like properties, such as SBS5, ID1 and ID2, were present in nearly all samples, and accounted for a substantial proportion (more than 10%) of somatic mutations. By contrast, some mutational signatures contributed to a high number of mutations in only a subset of samples, suggesting hyper-mutagenicity that was restricted to the affected tumours. For examples, mutational processes attributable to polymerase epsilon exonuclease domain mutations (SBS10a and SBS10b) and defective DNA mismatch repair (SBS21 and SBS44) were observed in four cases with an extremely high TMB (more than 100,000; Supplementary Fig. 5).

**Fig. 2: Mutational signatures in breast cancer.**

Mutational signatures associated with HRD^3,27, including SBS3, SBS8, ID6, SV3 and SV5, are of particular interest, as they are potential WGS-based biomarkers for poly(ADP-ribose) polymerase inhibitors (Fig. 2b,c). The combined estimated HRD scores (Methods) showed a clear bimodal distribution (Fig. 2c), identifying 315 cases as HRD in the CUBRICS cohort (23.1%, 315 out of 1,364; Fig. 2d and Supplementary Fig. 6). Although it is highly enriched in the basal-like cancers (72.2%, 182 out of 252), HRD was also present in other subtypes (Fig. 2d). Out of the 315 cases identified as HRD, we identified causal pathogenic variants on genes in the homologous recombination pathway in 130 cases (41.3%, 130 out of 315), including 90 with germline defects (28.6%, 90 out of 315) and 40 (17.8%, 40 out of 225) with only somatically acquired defects (Supplementary Discussion 4, Supplementary Fig. 7 and Supplementary Table 2). Although germline pathogenic variants predominantly inactivated BRCA1 and BRCA2, somatic mutations most frequently truncated RAD51B, primarily by genomic rearrangements rather than point mutations (Fig. 2e).

The CUBRICS cohort included 89 patients with triple-negative breast cancer (TNBC) with a treatment history of adjuvant chemotherapy containing anthracycline–cyclophosphamide (Supplementary Table 3). Individuals with HRD (n = 66) exhibited significantly superior disease-free survival compared with those with homologous recombination DNA repair-proficient (HRP) cancer (hazard ratio 0.10, 95% confidence interval 0.02–0.54; P = 0.006; Fig. 2f); this difference was more pronounced than in a previous study with a targeted panel-based assay²⁸, highlighting the potential of WGS for more specific clinical evaluation of HRD.

APOBEC-associated mutational signatures (SBS2 and SBS13; Fig. 2b) were widespread in breast cancers, contributing more than 10% of somatic SNVs in 633 samples. Of these, 12 samples were hypermutators with APOBEC-associated mutational processes, with TMB of more than 50,000. A germline deletion involving APOBEC3A and APOBEC3B (APOBEC3A/B), which effectively generates APOBEC3A–APOBEC3B fusion transcripts, has been suggested to predispose to APOBEC-mediated hypermutations²⁹ (Extended Data Fig. 4a). Of note, the population allele frequency of the germline deletion was 31.8% in our cohort (present in 736 individuals, 131 being homozygous and 605 heterozygous; Fig. 2g and Extended Data Fig. 4b), significantly higher than the frequency in the European population²⁹ (31.8% versus 8.5%; P < 0.001 by two-sided chi-square test), suggesting an East Asian origin. Contrary to previous reports^30,31, the deletion was not substantially enriched in patients with breast cancer compared with the background Korean population (51.0%, 844 out of 1,654 individuals, with 150 being homozygous and 694 heterozygous; P = 0.276 by two-sided chi-square test; Fig. 2g), suggesting that it is not a strong cancer-predisposing germline variant. However, carriers exhibited reduced APOBEC3B expression (Extended Data Fig. 4c), significantly higher TMB (median 4,325 in wild type versus median 5,148 in carriers; P < 0.001; Fig. 2h) and APOBEC-associated signatures (SBS2, SBS13 and ID9; Extended Data Fig. 4d) in cancers. In addition, wild-type individuals were substantially depleted in the APOBEC hypermutator cancers (10.0% in the top 30 cases versus 46.9% in others; P < 0.001 by two-sided Fisher’s exact test; Fig. 2i). APOBEC-deletion carriers showed an enrichment of APOBEC-associated mutations in the YpTpCpN context than in the RpTpCpN context (where Y represents pyrimidine, R represents purine and the mutated cytosine is underlined; Fig. 2h).

Frequently rearranged genes

A key mutational process in cancer is structural variation, which acts to amplify, delete or reorder chromosomal material at scales that range from single genes to entire chromosomes³². To determine the structural variation landscape, we generated a DNA interaction map between 5-Mb segments across the entire genome (Fig. 3a), along with large-scale copy number amplification patterns (CNAs; G-score). Of the 208,540 structural variation events, local intrachromosomal (<5 Mb) rearrangements were predominant (162,988 out of 208,540, 78.2%), but reshuffling of genomic materials between distal intrachromosomal or interchromosomal regions were also observed (45,266 out of 208,540, 21.7%). Some of the rearrangements were observed more frequently than random. For example, 15 breast cancers carried an interchromosomal translocation between chromosomes 8 and 11, which brought CCND1, and ZNF703 and FGFR1 (ZNF703/FGFR1) into close proximity (mostly luminal B type; 73.3%, 11 out of 15; Fig. 3a, top left). Of note, the interaction was observed in lung cancer³³, and was also reported to be associated with reduced benefit from aromatase inhibitors in metastatic breast cancer³⁴. Consistent with these findings, tumours with these rearrangements exhibited distinct transcriptomic profiles, including upregulation of tumour progression-related pathways such as TNF, TGFβ signalling and epithelial–mesenchymal transition, as well as metabolic reprogramming (Extended Data Fig. 5).

**Fig. 3: Whole-genome structural variation landscape.**

Another frequent interchromosomal translocation involved the ERBB2 locus on chromosome 17 and a locus on chromosome 20 (observed in 30 participants), where breast cancer super-enhancers were enriched³⁵ (Fig. 3a, top right). Structural variations can influence transcription by disrupting enhancer–promoter interactions, chromatin topology and long-range regulatory elements³⁶. In line with this idea, structural variations involving the ERBB2 locus were clustered with super-enhancers genome-wide (P = 0.028 by permutation test; Fig. 3b), which translocated super-enhancers in about 90 kb from the ERBB2 locus (Fig. 3c). Tumours with extragenic structural variants in the vicinity (within 1 Mb) showed significantly higher ERBB2 gene expression levels than their gene copy number (log₂ fold change >1, q-value <0.10; Fig. 3d). They often carried extragenic structural variants in the vicinity (within 1 Mb) with a cancer cell fraction (CCF) of approximately 1 (Fig. 3e), and distal breakpoints significantly enriched near breast cancer-specific super-enhancers³⁵ (Fig. 3f). Our observation overall suggests a possible enhancer-hijacking mechanism for transcriptional activation in breast cancer genomes.

In addition, we found gene truncation events associated with structural variation. A total of 28 genes, including the well-established tumour suppressors RB1, PTEN and RUNX1, exhibited significant downregulation due to gene-truncating structural variants, reinforcing their tumour suppressor role in breast cancer (Fig. 3g).

Although recurrent fusion oncogenes are considered rare in breast cancer³⁷, we identified several recurrent fusion events (Fig. 3h and Extended Data Figs. 6 and 7). In luminal-type breast cancers, frequent fusion genes were MIPOL1–TTC6 (n = 9), CEP112–PRKCA (n = 6) and CCDC170–ESR1 (n = 6). In basal-like breast cancers, BCL2L14–ETV6 (n = 12), AGO2–PTK2 (n = 6) and BRD4–NOTCH3 (n = 6) were recurrent. Of note, CCDC170–ESR1 has been implicated in endocrine therapy resistance and metastasis in breast cancers³⁸ and BCL2L14–ETV6 has been linked to increased invasiveness and paclitaxel resistance³⁹. The fraction of variant-supporting reads suggests that most of the fusion events are clonal or near-clonal, implicating them in tumour initiation.

In the gene-level analysis, NRG1, which encodes a small ligand of ERBB family kinase receptors, was frequently affected by 166 structural variants in 110 cases (8%, 110 out of 1,364; Extended Data Fig. 8; top 2). Although gene disruption was the most frequent outcome (72.9%, 121 out of 166 structural variants), oncogenic fusions were sometimes formed with various partner genes⁴⁰ (n = 17; Fig. 3i), collectively sixfold more frequently than previous reports⁴¹ (1.25% versus 0.2%). Of note, the EGF-like domain of NRG1 was retained in all 17 fusion cases, supporting their oncogenic effect.

Hotspots of focal amplifications

Focal amplification, which increases the copy number of short-segment DNA (typically up to 3 Mb in size, with a copy number gain greater than 3 compared with flanking regions), is a prominent signature in many human cancers^42,43. To explore these events in breast cancers, we calculated the recurrence score of focal amplifications across the genome, referred to as F-score (Fig. 4a). The most prominent peaks were observed at chromosome 8 (chr. 8):35–40 Mb, chr. 11:65–70 Mb and chr. 17:35–40 Mb, which included ZNF703 (chr. 8; n = 124; 9.1%) and FGFR1 (chr. 8; n = 107; 7.8%), CCND1 (chr. 11; n = 108; 7.9%) and ERBB2 (chr. 17; n = 255; 18.7%), respectively (Fig. 4b and Extended Data Fig. 9a). Notably, ZNF703 exhibited higher amplification levels than FGFR1 on chromosome 8, suggesting its oncogenic role in breast cancer.

**Fig. 4: CNAs in breast cancer with clinical and evolutionary implications.**

Gene expression levels of the oncogenes positively correlated with gene copy number in a dose-dependent manner (Extended Data Fig. 9b). Notably, ERBB2 expression exhibited mutual exclusivity with the other oncogenes, suggesting a redundant oncogenic mechanism between ERBB2 and the other three genes in breast tumorigenesis (Extended Data Fig. 9c). However, CCND1 and ZNF703/FGFR1 loci displayed a positive correlation in RNA expression, in line with their co-rearrangement as previously discussed (Fig. 3a).

Of note, more than 40% of the ERBB2, FGFR1, ZNF703 and CCND1 focal amplifications were formed by extrachromosomal DNA (ecDNA) (Fig. 4c), suggesting that ecDNA is the primary mechanism of focal amplification^42,44. Overall, 387 cancer tissues (28.4%, 387 out of 1,364) carried ecDNA, whose location was tightly correlated with F-score (Fig. 4a).

Clinical effect of ERBB2 focal gain

Immunohistochemistry (IHC) targeting HER2 protein, encoded by ERBB2, is a gold standard technique for clinical detection of HER2-positive breast cancers⁴⁵. In our cohort, as expected, HER2 IHC correlated with, but did not completely overlap with PAM50 subtypes or ERBB2 gene copy number gains (Fig. 4d). For example, ecDNA-derived focal amplification of ERBB2 (copy number 68) was observed in a case with HER2 IHC 0 (Extended Data Fig. 9d). Conversely, we also observed a case with HER2 IHC 3+ without obvious ERBB2 amplification (Extended Data Fig. 9e). The PAM50 subtype of this case was not HER2-enriched but luminal B, with ERBB2 transcription markedly lower (transcripts per million (TPM) = 43.5) than typically observed in HER2 IHC 3+ tumours (mean TPM = 692.0), but similar to HER2 IHC 0 tumours (Extended Data Fig. 9e).

Then, we evaluated whether these genome and transcriptome data can predict treatment responses in HER2-positive cancer. Out of 75 individuals with HER2-positive tumours who received neoadjuvant TCHP regimens (docetaxel (microtubule inhibitor), carboplatin (platinum intercalator), trastuzumab and pertuzumab (anti-HER2 inhibitors)), 38 showed pathologic complete response (pCR) (Fig. 4e and Supplementary Table 4). Although cases without pCR (n = 37) were characterized by a higher prevalence of luminal subtypes (P < 0.001, two-sided Fisher’s exact text), hormone receptor positivity (P = 0.001, two-sided chi-square test) and PIK3CA mutations (P = 0.003, two-sided chi-square test), cases with pCR (n = 38) exhibited significantly higher ERBB2 expression (P < 0.001, two-sided Student’s t-test) and a greater frequency of ERBB2 focal amplification (P = 0.001, two-sided Fisher’s exact test; Extended Data Fig. 9f–h (external validation in the TransNEO cohort⁴⁶)). Although HER2 IHC 3+ was a sensitive biomarker for pCR (36 out of 38, 94.7%), ERBB2 copy number (n ≥ 33; Extended Data Fig. 9i,j) demonstrated higher precision (30 out of 38, 78.9%; Fig. 4f). Overall, ERBB2 copy number provided a complementary performance, with higher prediction in specificity, precision, and likelihood ratio (Fig. 4g).

Of note, ecDNA is generally associated with poor prognosis across multiple cancer types⁴⁷. Notably, in our cohort, tumours with ERBB2 copy number of 33 or more were significantly more likely to be ecDNA-positive (P = 0.003; Fig. 4f, right); however, ERBB2 amplification by ecDNA was not associated with pCR rates (P = 0.540). Our observation suggests that absolute copy number. rather than the amplification mechanism, may be a more relevant predictor of therapeutic response. It also indicates that ERBB2-amplifying ecDNA may be an actionable target for anti-HER2 therapy.

Further, we observed a significant enrichment of chromothripsis events, a complex genomic rearrangement with catastrophic chromosomal shattering and erroneous reassembly, in cases with pCR⁴⁸ (71.1% versus 43.2%; Fig. 4e,h and Supplementary Fig. 8). Adding chromothripsis to cancer stratification improved the precision and specificity of predicting responses to TCHP (Fig. 4g, Extended Data Fig. 9k and Supplementary Discussion 5).

Long-segmental copy number gains

Along with focal amplifications, long-segmental amplifications (typically larger than 1 Mb), often chromosome- or arm-level alterations, were common in breast cancers (Fig. 3a). The timing analysis with the burden of co-amplified, pre-amplification somatic base substitutions yielded the acquisition timing of the frequent long-segmental CNAs in breast cancer in molecular time⁴⁹ (Extended Data Fig. 10). Most of the recurrent long-segmental CNAs were acquired by 20% of molecular time, when recent common ancestral cells of a cancer emerge, which should be decades earlier than tumour diagnosis. This implies that acquisition of the long-segmental CNAs is an early evolutionary event in breast cancer, which is presumably acquired in early puberty, consistent with a previous report⁵⁰. Our findings also suggest that full neoplastic transformation can take decades from the first event of genomic instability.

The most common CNAs occurred at 8q21.13, with a relative timing of 0.15 (IQR 0.08–0.25) and a median copy number gain of 5.5 (IQR 4.0–8.0). This amplification was associated with poor overall survival (hazard ratio 1.67, 95% confidence interval 1.09–2.55; P = 0.019). This region contains the MYC oncogene, which has been linked to treatment resistance and poor survival across various solid tumours, including breast cancer⁵¹.

Of note, CNAs of 9q23 (n = 40), occurring at a relative timing of 0.13 (IQR 0.09–0.25), were enriched in basal-like cancers and exhibited poor overall survival in the subtype (hazard ratio 2.45, 95% confidence interval 1.16–5.18; P = 0.019; Fig. 4i and Supplementary Table 5). External validation in the METABRIC cohort confirmed this trend⁵² (Extended Data Fig. 11a,b and Supplementary Table 6). The 9p23 gain was associated with 823 differentially expressed genes genome-wide (282 upregulated and 531 downregulated; Extended Data Fig. 11c), including PSIP1, which has been reported to be negatively correlated with survival of patients with TNBC⁵³.

Late-stage CNA acquisitions were more frequently observed in HRD breast cancers (P < 0.001; Extended Data Fig. 11d,e). This implies that the rate of large-segmental CNAs is accelerated when a tumour cell acquires the HRD phenotype. For instance, in one case with HRD (no. 635; Fig. 4j), the rate of CNA acquisition was substantially increased from about 75% of the molecular time, which we speculate is the time of complete inactivation of the homologous recombination-mediated DNA repair pathway.

By contrast, in another case with HRD (no. 703; Fig. 4k), there was no such late CNA acquisition stage. The cancer developed talazoparib resistance after ten months of the second-line palliative treatment. In line with the resistance, post-therapy cancer genome exhibited an 8-bp deletion in the vicinity of the germline 1-bp deletion in BRCA1, which is likely to have converted an HRD tumour back to HRP (Fig. 4l). From the timing analysis (Fig. 4k, bottom), we believe that the rescuing mutation could be acquired in about 50% of the molecular time in one of the cells in the primary cancer tissue, which could lead to talazoparib resistance.

Potential biomarkers from whole genomes

Finally, we explored the potential utility of whole-genome results by integrating genomic profiles and clinical records. First, we observed that an increased MATH score, which is indicative of higher intratumoural heterogeneity, was associated with worse progression-free survival (PFS) in first-line anti-HER2 therapies against HER2-positive breast cancer. In our cohort, 45 patients who received the therapy showed a median PFS of 1.08 years (range 0.14–9.96 years; as of 15 November 2023; Fig. 5a and Supplementary Table 7). Individuals with a high MATH score (≥40; cut-off with the median MATH score) exhibited worse PFS compared to those with a MATH score below 40 (hazard ratio 2.95, 95% confidence interval 1.48–5.91; P = 0.002; Fig. 5b,c).

**Fig. 5: WGS-based biomarkers for first-line palliative treatment of metastatic breast cancer.**

Similarly, we identified high TMB, and HRD scores were associated with worse PFS in CDK4/6 inhibitor plus endocrine therapy, a first-line palliative treatment recently developed for hormone receptor-positive breast cancer⁵⁴ (Fig. 5d–g). In a total of 57 patients receiving the regimens, individuals with a high TMB (≥1.7 per Mb, cut-off with the median TMB) exhibited poor PFS (hazard ratio 2.55, 95% confidence interval 1.26–5.16; P = 0.009; Fig. 5d,e). Thirteen out of the 57 patients exhibited HRD, 85% of whom experienced progression, resulting in a shorter PFS than those with HRP (hazard ratio 4.20, 95% confidence interval 1.91–9.25; P < 0.001; Fig. 5f and Supplementary Table 8). Multivariate Cox proportional hazards analysis indicated HRD as the single most significant factor predicting favourable PFS to the treatment (hazard ratio 10.20, 95% confidence interval 1.68–61.80; P = 0.012; Fig. 5g), warranting validation in larger, prospective studies. In gene set enrichment analysis (GSEA), these HRD tumours showed significant enrichment of pathways related to cell cycle progression, DNA repair and mitotic checkpoint activation, suggesting that these tumours may exhibit increased reliance on cell cycle regulatory mechanisms (Extended Data Fig. 12). Of note, PTEN mutations were exclusively present in individuals who experienced progression on the treatment, suggesting that they had a negative impact on the treatment (P = 0.042, two-sided Fisher’s exact test; Fig. 5d).

Discussion

This study provides a comprehensive analysis of 1,364 tumour–normal pair WGS datasets with highly annotated paired clinical information for breast cancer (the findings are summarized in Supplementary Table 9). Our extensive efforts enabled us to construct a highly detailed somatic mutation landscape of breast cancer, revealing novel insights into the heterogeneous and complex genomics of the disease. By leveraging this meticulously curated clinical data, we were able to identify genomic features with potential clinical relevance and offer hypotheses for future prospective validation (Supplementary Discussion 5).

Our findings highlight the potential of whole-genome analysis for advancing precision oncology for breast cancer. However, integration of such whole-genome analysis into clinical practice has been challenging owing to the complexity of data interpretation, variability in clinical utility and lack of standardization. Despite these challenges, recent studies have shown promising results, such as WGS aiding in decision making for childhood solid cancers⁵⁵, providing actionable insights for adult cancers¹¹ and improving treatment outcomes in high-risk paediatric patients⁵⁶. These findings, together with our breast cancer data, highlight the increasing potential of WGS in clinical research and its value for building a robust platform for systematic investigation through the integration of clinical and genomic big data.

Intratumoural heterogeneity poses a significant challenge in the clinical management of breast cancer, particularly in HER2-positive cases⁵⁷, as it is closely linked to treatment responses. Traditionally, heterogeneity has been assessed through pathology, which is limited by its reliance on small tumour samples and often cannot detect subclonal mutations. By contrast, WGS offers a more comprehensive approach by analysing the entire genome, enabling detection of diverse genetic alterations, including subclonal mutations. This comprehensive analysis provides a quantitative understanding of tumour heterogeneity, capturing the full genetic diversity within tumours and aiding in the prediction of treatment responses (Fig. 5a–c). Given these advantages, we anticipate that WGS-based quantitative assessment of tumour heterogeneity, both at diagnosis and during progression, will have a critical role in shaping future precision oncology strategies.

The exploration of genome-wide mutational patterns reveals the various mutational processes that are active in breast cancer. Identifying mutational signatures associated with specific biological processes, such as HRD and APOBEC activity, provides a deeper understanding of different breast cancer subtypes². Our study has implications in demonstrating the potential clinical relevance of mutational patterns. Notably, our study revealed the potential of HRD as a predictive biomarker for treatment response, particularly in adjuvant chemotherapy for TNBC (Fig. 2f) and first-line CDK4/6 inhibitor treatment for patients with advanced hormone receptor-positive breast cancers (Fig. 5d–g), providing a rationale for further investigation. Notably, HRD predicted a better response to the former but worse prognosis to the latter. This dichotomy highlights the nuanced role of HRD across different treatment contexts.

Cancer progresses along a specific trajectory, with somatic events accumulating in a defined order over time. By analysing somatic mutations, we were able to reconstruct the chronological trajectory of CNAs in key genomic regions (Extended Data Fig. 10), including ERBB2. Traditional sequencing methods have primarily viewed the genome in a two-dimensional manner, mapping genomic coordinates against the presence of mutations. However, by incorporating a time axis, we can visualize the genome in three dimensions. This advancement offers a more dynamic perspective, and provides crucial insights into cancer biology, tumorigenesis and the resistance mechanisms that emerge during treatment (Fig. 4j–l). Such data will be invaluable in deepening our understanding of these processes.

One potential limitation of our study is the possibility of left truncation bias in survival analyses, owing to the mixed retrospective and prospective nature of the cohort and variable timing of inclusion. In retrospective data, participants must survive until enrolment, which may exclude early deaths and bias survival estimates. Although our dataset lacks formal entry time variables for fully adjusted models, we applied several complementary strategies to mitigate this bias, including subgroup analyses and external validation (Supplementary Discussion 6). Nonetheless, residual bias may remain, particularly for the subgroup of patients with de novo metastases, and we urge caution when interpreting the overall survival findings in this context.

Overall, this study enhances our understanding of the genomic landscape of breast cancer and underscores the value of WGS in discovery research and its potential to inform future clinical strategies, contingent upon further validation. Integrating genomic data with detailed clinical outcomes paves the way for more personalized and effective treatment strategies, with the ultimate aim of improving outcomes for patients. Future prospective clinical trials will be essential to confirm the functional implications of the identified genomic alterations, facilitating their translation into real-world clinical practice.

Methods

Study design and participants

Participants were recruited from Samsung Medical Center and Seoul St Mary’s Hospital (Seoul, Korea) between 2012 and 2023 through prospective and retrospective cohorts. Retrospective cases were selected based on the availability of archived primary tumour and matched normal blood samples, along with sufficient clinical information, and were enroled regardless of disease stage or survival status at accrual. All retrospective samples were obtained shortly after diagnosis, typically at curative-intent surgery. Detailed study design, inclusion and exclusion criteria, and the CONSORT diagram are provided in Supplementary Methods and Supplementary Fig. 1. The study was approved by the Institutional Review Boards of both institutions (Samsung Medical Center: SMC 2022-05-050 and SMC 2013-04-005; Seoul St Mary’s Hospital: KC21TISI0007 and KC22TISI0292) and conducted in accordance with the Declaration of Helsinki and Good Clinical Practice guidelines. Written informed consent, including consent to publish de-identified clinical and genomic data, was obtained from all prospective participants. For retrospective cases, the use and publication of de-identified data were approved by the relevant institutional review boards, with a waiver of informed consent where applicable. A subset of participants was enroled as part of substudy components within the clinical trials NCT03131089 and NCT06334471, which informed specific aspects of the study design.

Sample preparation for sequencing

We performed WGS using CancerVision assay as previously reported⁵⁸. In brief, WGS was performed on tumour samples obtained as part of routine clinical care either via surgery or biopsy and stored as fresh frozen tissue. For biopsy sample cores were retrieved first for routine pathology, followed by at least one additional core for cancer WGS. Biopsy sample cores were retrieved first for routine pathology, followed by at least one additional core for cancer WGS. For the matched normal samples peripheral blood was used. DNA extraction and library preparation was performed at the Inocras in a Clinical Laboratory Improvement Amendments (CLIA)-certified laboratory. We used the Allprep DNA/RNA Mini Kit (Qiagen) for DNA extraction, and TruSeq DNA PCR-Free (Illumina) for library preparation. Sequencing was performed on the Illumina NovaSeq6000 platform (Illumina) with an average depth of coverage of 40x for tumour and 20x for blood. Quality assessment of the WGS data is available in Supplementary Table 10.

For whole-transcriptome sequencing, total RNA was extracted using the AllPrep DNA/RNA Mini Kit (Qiagen) in accordance with the manufacturer’s protocol. Total RNA was quantified using a Qubit Fluorometer (Invitrogen) and the purity and integrity were assessed by the TapeStation RNA ScreenTape (Agilent Technologies). Total RNA-sequencing analysis enabled detection of both coding and noncoding RNA, along with other long intergenic noncoding RNA (lincRNA), small nuclear RNA (snRNA) and small nucleolar RNA (snoRNA). RNA-sequencing libraries were constructed using the KAPA RNA HyperPrep Kit, with RiboErase (Roche Molecular Systems, following the manufacturer’s protocol. We quantified and assessed the libraries using KAPA Library Quantification Kits for Illumina Sequencing platforms according to the qPCR Quantification Protocol Guide (KAPA Biosystems, KK4854) and the TapeStation D1000 ScreenTape (5067–5582, Agilent Technologies) recommendations. The generated libraries were sequenced by using a paired-end 150-bp read protocol on a NovaSeq 6000 platform (Illumina) with the S4 reagent kit (Illumina).

Genomic analysis and interpretation

Comprehensive genomic analysis and interpretation were conducted using the CancerVision platform (Inocras). WGS data were aligned to the GRCh38 human reference genome using bwa-mem (v.0.7.17-r1188)⁵⁹. Preprocessing included duplicate marking and generation of compressed reference-oriented alignment map (CRAM) files. Somatic SNVs and short indels were called using Mutect2 (GATK v.4.0) and Strelka2 (v.2.9.10)^60,61. High-confidence somatic variants were selected based on the following criteria: ≥2 variant reads in tumour, ≤1 variant read in matched normal, mapping quality ≥15, variant allele frequency (VAF) ≤ 5% in tumour, and population allele frequency ≤1% in the panel of normals.

Tumour purity, ploidy, and segmented CNV profiles were estimated with Sequenza (v.3.0.0)⁶² and somatic structural variations were identified using Delly (v.0.7.6)⁶³. High-quality structural variations were defined as those with ≥2 variant reads in tumour, ≤1 in normal, mapping quality ≥15, and population allele frequency ≤5% in the panel of normals. GISTIC (v.2.0.23) was applied to identify recurrently amplified or deleted genomic regions⁶⁴. AmpliconArchitect (v.1.2) was used to detect and characterize ecDNA amplicons⁴². Variant annotation was performed using the Ensembl Variant Effect Predictor (VEP, release 112)⁶⁵. Variant call format (VCF) files were processed with bcftools (v.1.9). All variants, both germline and somatic, were subjected to rigorous manual review and curation within Inocras’s proprietary genome browser.

Identification of protein-coding driver genes

Protein-coding driver genes were identified using the IntOGen pipeline (v.2023), which integrates seven independent driver gene identification methods: dNdSCV, OncodriveFML, OncodriveCLUSTL, cBaSE, MutPanning, HotMaps3D and smRegions¹⁷. These methods collectively assess the selective advantage of somatic mutations in cancer genomes.

To determine a consensus set of driver genes, we combined the results from the seven methods using a strategy previously described. In brief, for each method, the top 100 ranked genes were selected along with their associated P values and q-values. Somatically mutated genes categorized as tier 1 or tier 2 in the COSMIC Cancer Gene Census (CGC) were used as a reference set of known drivers⁶⁶. The relative enrichment of CGC genes in the top-ranked gene lists was evaluated to assign a per-method weighting. The final consensus ranking was obtained using Schulze’s voting method, and combined P values were estimated using a weighted Stouffer z-score method.

Candidate driver roles were further classified based on dN/dS ratios for missense (wmis) and nonsense (wnon) mutations, derived from dNdSCV. A distance metric was defined as:

$${\rm{Distance}}=({\rm{wmis}}-{\rm{wnon}})/\sqrt{2}$$

Candidate drivers with distance >0.1 were considered oncogenes, as they exhibited an excess of missense over nonsense mutations, and candidate drivers with distance <0.1 were classified as tumour suppressor genes, characterized by an excess of nonsense over missense mutations.

Additionally, each candidate driver was annotated based on its presence in any IntOGen cohorts from a previous pan-cancer analysis (Supplementary Table 11).

To ensure accuracy, we manually filtered the significant driver genes (combined q-value <0.10), excluding known artefacts (for example, PABPC1) and genes whose annotated driver roles were inconsistent with prior literature, such as JAK1, VAV2 and HGF.

Mutational signature analysis

We estimated contributions of mutational signatures to an observed mutational spectrum in each sample (the presumed amount of exposure to corresponding mutational processes). We solved the following constrained optimization problem⁶⁷:

$${\mathrm{Argmin}}_{h}{|v-Wh|}^{2}$$

where v ∈ R^m×1₀₊, W ∈ R^m×k₀₊, h ∈ R^k×1₀₊ where R denotes the set of real numbers, m×1 and k×1 indicate real vectors of dimensions m and k, respectively, m×k indicates a real matrix of dimension m by k, and 0+ denotes strictly positive real values (here, m is the number of mutation types and k is the number of mutational signatures). For each sample, given the observed counts of each mutation type v from a sample and the pre-trained mutational signature matrix W, we calculated exposure h. We used the R package pracma, which internally uses an active-set method to solve the above problem. Our method demonstrates comparable performance to SigProfiler⁶⁸, an elaborated version of the framework used for the previous Catalogue Of Somatic Mutations In Cancer (COSMIC) compendium of mutational signatures, and SignatureAnalyzer⁶⁹, which is based on a Bayesian variant of non-negative matrix factorization, in accurately estimating mutational signature contributions (Supplementary Discussion 3).

The relative contributions of mutational signatures were calculated by fitting 16 consensus mutational signatures previously identified in breast cancers (COSMIC signatures SBS1, SBS2, SBS3, SBS5, SBS8, SBS13, SBS17a, SBS17b, SBS18, ID1, ID2, ID3, ID5, ID6, ID8 and ID9; available at https://cancer.sanger.ac.uk/cosmic/signatures)². For samples with a cosine similarity below 0.90, a manual review was performed, and additional COSMIC signatures were incorporated for refitting to improve the accuracy of signature decomposition. The additional signatures included those associated with polymerase eta somatic hypermutation activity (SBS9), polymerase epsilon exonuclease domain mutations (SBS10a and SBS10b), platinum chemotherapy treatment (SBS31), defective DNA mismatch repair (SBS21 and SBS44), and activation-induced cytidine deaminase (SBS84), as well as signatures of unknown aetiology (SBS34, SBS41, ID4 and ID11).

HRD and MSI

To assess HRD, we developed our proprietary algorithm by combining HRD-associated features, such as mutational signatures of point mutations, and copy number changes²⁷. These included single-base substitution signatures (SBS3 and SBS8; reference signatures are available at https://cancer.sanger.ac.uk/signatures/), an indel signature (ID6), genomic rearrangement signatures (RS3 and RS5), deletions accompanied by microhomology, and CNVs. Custom scripts scaled scores of the multi-dimensional features using coefficients derived from published algorithms to compute the final HRD probability scores. For a quantitative evaluation of somatic microsatellite alterations, we considered both the score from MSIsensor⁷⁰ and the proportion of microsatellite instability (MSI)-related mutational signatures (SBS6, SBS15, SBS20, SBS21, SBS26, SBS30 and SBS44)².

Mutation copy numbers and determination of pre- and post-amplification events

We estimated mutation copy number (n_mut) using the previously described formula⁷¹

$${n}_{\mathrm{mut}}={f}_{s}\times (1/\rho )\times [\rho \times {{n}^{t}}_{\mathrm{locus}}+{{n}^{{n}}}_{\mathrm{locus}}\times (1-\rho )]$$

In this formula, f_s indicates variant allele fraction, ρ indicates tumour cellularity. n^t _locus and nⁿ _locus are absolute copy numbers in tumour and normal cells, respectively, which were derived from the following formula.

$${n}_{\mathrm{locus}}=2\times {\mathrm{RD}}_{\mathrm{locus}}/{\mathrm{RD}}_{\mathrm{auto}}$$

in which RD_locus indicates read depth of the locus of interest, and RD_auto indicates average haploid autosomal coverage that was obtained from paired normal WGS.

A mutation was classified as a pre-amplification event, when n_mut was larger than the major copy number of tumour multiplied by a factor of 0.75. If n_mut was smaller than the above value but larger than 0.75, the mutation was assigned as post-amplification or minor allele mutations.

Molecular within-tumour timing of copy number gains

When a chromosomal segment is gained, it inherently co-amplifies all somatic mutations acquired within the segment up to that point in time, and thus the relative timing of genomic gains can be inferred by comparing frequencies of these co-amplified somatic mutations. To estimate relative event timing within individual tumours we utilized a relative timing measure, π, for every copy gain^72,73, which represents the proportion of mutations per unit length of DNA that occurred before the event relative to the total number of mutations on the same DNA interval. Since mutations accumulate over time, this measure can be used to evaluate the timing of a copy number gain in a hypothetical molecular clock. To remove the effect of mutational processes that lead to a hypermutator phenotype (such as APOBEC mutagenesis, MSI and HRD), the number of mutations in each amplified segment was adjusted according to the proportion of SBS1 and SBS5, considering their clock-like nature^74,75.

Breast cancer subtyping

We classified breast cancer samples into intrinsic subtypes using the prediction analysis of microarray 50 (PAM50) gene expression profiling method¹². After loading gene expression data and removing duplicates and missing values, subtyping was conducted with the molecular.subtyping function from the genefu package⁷⁶, applying the PAM50 model to classify samples into five categories: luminal A, luminal B, HER2-enriched, basal-like and normal-like.

To classify breast cancer samples into integrative clusters (IntClust) based on their copy number and expression profiles, we utilized the ic10 R package, a validated tool designed for the IntClust subtyping system. This classification method was originally derived from the METABRIC cohort and integrates both gene expression and segmented copy number variation data to define ten molecular subtypes of breast cancer⁵². The algorithm was run with default parameters, using log₂ copy number ratios as input.

Scoring focally amplified genomics regions

To construct the landscape of focal amplifications across the whole genome, we calculated the F-score for each 5-Mb window. For each sample, we identified the segment with the maximum copy number within each window and computed the ratio of this copy number to the segment’s width. We then summed these ratios across all samples to define the F-score. The F-scores were subsequently plotted across the entire genome.

$$\begin{array}{l}{\rm{F}} \mbox{-} {\mathrm{score}}_{\mathrm{window}}\\ \,=\,{\Sigma }_{\mathrm{all\; samples}}\max {({\mathrm{major\; copy\; number}}_{\mathrm{segment}}/{\mathrm{width}}_{\mathrm{segment}})}_{\mathrm{window}}\end{array}$$

Survival analysis

All tumour samples used for survival analyses were collected prior to the initiation of targeted therapies, ensuring that the results were not influenced by treatment-induced genomic alterations. Survival analyses were conducted using the Kaplan–Meier method, and differences between groups were compared using the log-rank test.

PFS was defined as the time from treatment initiation to disease progression or death from any cause, whichever occurred first. Overall survival (OS) was measured from the time of diagnosis to death from any cause. Disease-free survival was defined as the time from adjuvant treatment initiation to the first documented recurrence or death from any cause, whichever occurred first.

In all survival analyses, both breast cancer-related and unrelated deaths were treated as events.

To evaluate the association between specific variables and survival outcomes, a Cox proportional hazards regression model was applied, with multivariate models incorporating relevant clinical and molecular covariates to adjust for potential confounders.

Given the mixed prospective and retrospective nature of this study, immortality bias (left truncation bias) was carefully addressed: (1) consistent survival definitions: the same survival time definitions were applied uniformly to both retrospective and prospective cohorts; (2) statistical adjustment for cohort type: cohort type (prospective versus retrospective) was included as a covariate in Cox regression models to correct for potential bias; (3) sensitivity analyses: we conducted additional sensitivity analyses, assessing the impact of cohort selection on survival estimates to ensure the robustness of our findings; and (4) external validation: where possible, independent external cohorts were used to validate survival analyses and confirm reproducibility.

These measures minimize potential survival biases, ensuring that our findings accurately reflect the prognostic and predictive significance of genomic alterations (Supplementary Discussion 6).

Statistics and reproducibility

Statistical tests or methods are described in the figure legends. We used R (v.3.4.0) for all data processing and secondary computational analysis.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability

Whole-genome and transcriptome sequencing data from this study are available in the European Genome–Phenome Archive (EGA; study ID EGAS50000001377). WGS CRAM files have been deposited under accession number EGAD50000001994, and RNA-seq data are available under accession number EGAD50000001993. Clinical information for the study patients is provided in Supplementary Table 12.

Code availability

In-house scripts for analyses are available on GitHub (https://github.com/RyulKim-Inocras/breast_cancer).

References

Siegel, R. L., Miller, K. D., Wagle, N. S. & Jemal, A. Cancer statistics, 2023. CA Cancer J. Clin. 73, 17–48 (2023).
PubMed Google Scholar
Alexandrov, L. B. et al. The repertoire of mutational signatures in human cancer. Nature 578, 94–101 (2020).
Article ADS CAS PubMed PubMed Central Google Scholar
Nik-Zainal, S. et al. Landscape of somatic mutations in 560 breast cancer whole-genome sequences. Nature 534, 47–54 (2016).
Article ADS CAS PubMed PubMed Central Google Scholar
Mroz, E. A. & Rocco, J. W. MATH, a novel measure of intratumor genetic heterogeneity, is high in poor-outcome classes of head and neck squamous cell carcinoma. Oral Oncol. 49, 211–215 (2013).
Article CAS PubMed Google Scholar
Jallah, J. K., Dweh, T. J., Anjankar, A. & Palma, O. A review of the advancements in targeted therapies for breast cancer. Cureus 15, e47847 (2023).
PubMed PubMed Central Google Scholar
Horak, P., Fröhling, S. & Glimm, H. Integrating next-generation sequencing into clinical oncology: strategies, promises and pitfalls. ESMO Open 1, e000094 (2016).
Article PubMed PubMed Central Google Scholar
Abbasi, A. & Alexandrov, L. B. Significance and limitations of the use of next-generation sequencing technologies for detecting mutational signatures. DNA Repair 107, 103200 (2021).
Article CAS PubMed PubMed Central Google Scholar
ICGC/TCGA Pan-Cancer Analysis of Whole Genomes Consortium. Pan-cancer analysis of whole genomes. Nature 578, 82–93 (2020).
Article ADS Google Scholar
Sosinsky, A. et al. Insights for precision oncology from the integration of genomic and clinical data of 13,880 tumors from the 100,000 Genomes Cancer Programme. Nat. Med. 30, 279–289 (2024).
Article CAS PubMed PubMed Central Google Scholar
Priestley, P. et al. Pan-cancer whole-genome analyses of metastatic solid tumours. Nature 575, 210–216 (2019).
Article ADS CAS PubMed PubMed Central Google Scholar
Kim, R. et al. Clinical application of whole-genome sequencing of solid tumors for precision oncology. Exp. Mol. Med. 56, 1856–1868 (2024).
Article CAS PubMed PubMed Central Google Scholar
Parker, J. S. et al. Supervised risk predictor of breast cancer based on intrinsic subtypes. J. Clin. Oncol. 27, 1160–1167 (2009).
Article PubMed PubMed Central Google Scholar
Ahn, S. H. et al. Poor outcome of hormone receptor-positive breast cancer at very age is due to tamoxifen resistance: nationwide survival data in Korea-a report from the Korean Breast Cancer Society. J. Clin. Oncol. 25, 2360–2368 (2007).
Article PubMed Google Scholar
Min, S. Y. et al. The basic facts of korean breast cancer in 2013: results of a nationwide survey and breast cancer registry database. J. Breast Cancer 19, 1–7 (2016).
Article PubMed PubMed Central Google Scholar
Kan, Z. et al. Multi-omics profiling of younger Asian breast cancers reveals distinctive molecular signatures. Nat. Commun. 9, 1725 (2018).
Article ADS PubMed PubMed Central Google Scholar
Thomas, A. et al. Tumor mutational burden is a determinant of immune-mediated survival in breast cancer. Oncoimmunology 7, e1490854 (2018).
Article PubMed PubMed Central Google Scholar
Martínez-Jiménez, F. et al. A compendium of mutational cancer driver genes. Nat. Rev. Cancer 20, 555–572 (2020).
Article PubMed Google Scholar
Miller, D. H. et al. BCL11B drives human mammary stem cell self-renewal in vitro by inhibiting basal differentiation. Stem Cell Rep. 10, 1131–1145 (2018).
Article CAS Google Scholar
Przybylski, G. K., Przybylska, J. & Li, Y. Dual role of BCL11B in T-cell malignancies. Blood Sci. 6, e00204 (2024).
Article PubMed PubMed Central Google Scholar
Nitz, M. D., Harding, M. A., Smith, S. C., Thomas, S. & Theodorescu, D. RREB1 transcription factor splice variants in urologic cancer. Am. J. Pathol. 179, 477–486 (2011).
Article CAS PubMed PubMed Central Google Scholar
Morgani, S. M., Su, J., Nichols, J., Massagué, J. & Hadjantonakis, A.-K. The transcription factor Rreb1 regulates epithelial architecture, invasiveness, and vasculogenesis in early mouse embryos. eLife 10, e64811 (2021).
Article CAS PubMed PubMed Central Google Scholar
Kent, O. A. et al. Repression of the miR-143/145 cluster by oncogenic Ras initiates a tumor-promoting feed-forward pathway. Genes Dev. 24, 2754–2759 (2010).
Article CAS PubMed PubMed Central Google Scholar
Rocca, A., Braga, L., Volpe, M. C., Maiocchi, S. & Generali, D. The predictive and prognostic role of RAS–RAF–MEK–ERK pathway alterations in breast cancer: revision of the literature and comparison with the analysis of cancer genomic datasets. Cancers 14, 5306 (2022).
Article CAS PubMed PubMed Central Google Scholar
Liu, D. et al. SPECC1 as a pan-cancer biomarker: unraveling its role in drug sensitivity and resistance mechanisms. Discov. Oncol. 15, 552 (2024).
Article CAS PubMed PubMed Central Google Scholar
Marvalim, C., Datta, A. & Lee, S. C. Role of p53 in breast cancer progression: An insight into p53 targeted therapy. Theranostics 13, 1421–1442 (2023).
Article CAS PubMed PubMed Central Google Scholar
Pereira, B. et al. The somatic mutation profiles of 2,433 breast cancers refines their genomic and transcriptomic landscapes. Nat. Commun. 7, 11479 (2016).
Article ADS CAS PubMed PubMed Central Google Scholar
Davies, H. et al. HRDetect is a predictor of BRCA1 and BRCA2 deficiency based on mutational signatures. Nat. Med. 23, 517–525 (2017).
Article CAS PubMed PubMed Central Google Scholar
Sharma, P. et al. Impact of homologous recombination deficiency biomarkers on outcomes in patients with triple-negative breast cancer treated with adjuvant doxorubicin and cyclophosphamide (SWOG S9313). Ann. Oncol. 29, 654–660 (2018).
Article CAS PubMed Google Scholar
Nik-Zainal, S. et al. Association of a germline copy number polymorphism of APOBEC3A and APOBEC3B with burden of putative APOBEC-dependent mutations in breast cancer. Nat. Genet. 46, 487–491 (2014).
Article CAS PubMed PubMed Central Google Scholar
Long, J. et al. A common deletion in the APOBEC3 genes and breast cancer risk. J. Natl Cancer Inst. 105, 573–579 (2013).
Article CAS PubMed PubMed Central Google Scholar
Xuan, D. et al. APOBEC3 deletion polymorphism is associated with breast cancer risk among women of European ancestry. Carcinogenesis 34, 2240–2243 (2013).
Article CAS PubMed PubMed Central Google Scholar
Li, Y. et al. Patterns of somatic structural variation in human cancer genomes. Nature 578, 112–121 (2020).
Article ADS CAS PubMed PubMed Central Google Scholar
Yang, Y., Lu, T., Li, Z. & Lu, S. FGFR1 regulates proliferation and metastasis by targeting CCND1 in FGFR1 amplified lung cancer. Cell Adh. Migr. 14, 82–95 (2020).
Article PubMed PubMed Central Google Scholar
Aleksakhina, S. N. et al. CCND1 and FGFR1 gene amplifications are associated with reduced benefit from aromatase inhibitors in metastatic breast cancer. Clin. Transl. Oncol. 23, 874–881 (2021).
Article CAS PubMed Google Scholar
Jiang, Y. et al. SEdb: a comprehensive human super-enhancer database. Nucleic Acids Res. 47, D235–D243 (2019).
Article CAS PubMed Google Scholar
Dubois, F., Sidiropoulos, N., Weischenfeldt, J. & Beroukhim, R. Structural variations in cancer and the 3D genome. Nat. Rev. Cancer 22, 533–546 (2022).
Article CAS PubMed PubMed Central Google Scholar
Loo, S. K. et al. Fusion-associated carcinomas of the breast: diagnostic, prognostic, and therapeutic significance. Genes Chromosomes Cancer 61, 261–273 (2022).
Article CAS PubMed PubMed Central Google Scholar
Lei, J. T., Gou, X. & Ellis, M. J. ESR1 fusions drive endocrine therapy resistance and metastasis in breast cancer. Mol. Cell. Oncol. 5, e1526005 (2018).
Article PubMed PubMed Central Google Scholar
Lee, S. et al. Landscape analysis of adjacent gene rearrangements reveals BCL2L14–ETV6 gene fusions in more aggressive triple-negative breast cancer. Proc. Natl Acad. Sci. USA 117, 9912–9921 (2020).
Article ADS CAS PubMed PubMed Central Google Scholar
Howarth, K. D. et al. NRG1 fusions in breast cancer. Breast Cancer Res. 23, 3 (2021).
Article CAS PubMed PubMed Central Google Scholar
Jonna, S. et al. Detection of NRG1 gene fusions in solid tumors. Clin. Cancer Res. 25, 4966–4972 (2019).
Article CAS PubMed PubMed Central Google Scholar
Deshpande, V. et al. Exploring the landscape of focal amplifications in cancer using AmpliconArchitect. Nat. Commun. 10, 392 (2019).
Article ADS CAS PubMed PubMed Central Google Scholar
Schwab, M. Amplification of oncogenes in human cancer cells. Bioessays 20, 473–479 (1998).
Article CAS PubMed Google Scholar
Turner, K. M. et al. Extrachromosomal oncogene amplification drives tumour evolution and genetic heterogeneity. Nature 543, 122–125 (2017).
Article ADS CAS PubMed PubMed Central Google Scholar
Wolff, A. C. et al. Human epidermal growth factor receptor 2 testing in breast cancer: American Society of Clinical Oncology/College of American Pathologists clinical practice guideline focused update. J. Clin. Oncol. 36, 2105–2122 (2018).
Article CAS PubMed Google Scholar
Sammut, S.-J. et al. Multi-omic machine learning predictor of breast cancer therapy response. Nature 601, 623–629 (2022).
Article ADS CAS PubMed Google Scholar
Bailey, C. et al. Origins and impact of extrachromosomal DNA. Nature 635, 193–200 (2024).
Article ADS CAS PubMed PubMed Central Google Scholar
Cortés-Ciriano, I. et al. Comprehensive analysis of chromothripsis in 2,658 human cancers using whole-genome sequencing. Nat. Genet. 52, 331–341 (2020).
Article PubMed PubMed Central Google Scholar
Lee, J. J.-K. et al. Tracing oncogene rearrangements in the mutational history of lung adenocarcinoma. Cell 177, 1842–1857.e21 (2019).
Article CAS PubMed Google Scholar
Nishimura, T. et al. Evolutionary histories of breast cancer and related clones. Nature 620, 607–614 (2023).
Article ADS CAS PubMed PubMed Central Google Scholar
Xu, J., Chen, Y. & Olopade, O. I. MYC and breast cancer. Genes Cancer 1, 629–640 (2010).
Article CAS PubMed PubMed Central Google Scholar
Curtis, C. et al. The genomic and transcriptomic architecture of 2,000 breast tumours reveals novel subgroups. Nature 486, 346–352 (2012).
Article CAS PubMed PubMed Central Google Scholar
Singh, D. K. et al. PSIP1/p75 promotes tumorigenicity in breast cancer cells by promoting the transcription of cell cycle genes. Carcinogenesis 38, 966–975 (2017).
Article CAS PubMed PubMed Central Google Scholar
Im, S. A. et al. Overall survival with ribociclib plus endocrine therapy in breast cancer. N. Engl. J. Med. 381, 307–316 (2019).
Article CAS PubMed Google Scholar
Hodder, A. et al. Benefits for children with suspected cancer from routine whole-genome sequencing. Nat. Med. 30, 1905–1912 (2024).
Article CAS PubMed PubMed Central Google Scholar
Lau, L. M. S. et al. Precision-guided treatment in high-risk pediatric cancers. Nat. Med. 30, 1913–1922 (2024).
Article CAS PubMed PubMed Central Google Scholar
Rye et al. Intratumor heterogeneity defines treatment-resistant HER2+ breast tumors. Mol. Oncol. 12, 1838–1855 (2018).
Article CAS PubMed PubMed Central Google Scholar
Lee, S. et al. Target-enhanced whole-genome sequencing shows clinical validity equivalent to commercially available targeted oncology panel. Cancer Res. Treat. 57, 350–361 (2025).
Article CAS PubMed Google Scholar
Jung, Y. & Han, D. BWA-MEME: BWA-MEM emulated with a machine learning approach. Bioinformatics 38, 2404–2413 (2022).
Article CAS PubMed Google Scholar
Cibulskis, K. et al. Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples. Nat. Biotechnol. 31, 213–219 (2013).
Article ADS CAS PubMed PubMed Central Google Scholar
Kim, S. et al. Strelka2: fast and accurate calling of germline and somatic variants. Nat. Methods 15, 591–594 (2018).
Article CAS PubMed Google Scholar
Favero, F. et al. Sequenza: allele-specific copy number and mutation profiles from tumor sequencing data. Ann. Oncol. 26, 64–70 (2015).
Article CAS PubMed Google Scholar
Rausch, T. et al. DELLY: structural variant discovery by integrated paired-end and split-read analysis. Bioinformatics 28, i333–i339 (2012).
Article CAS PubMed PubMed Central Google Scholar
Mermel, C. H. et al. GISTIC2.0 facilitates sensitive and confident localization of the targets of focal somatic copy-number alteration in human cancers. Genome Biol. 12, R41 (2011).
Article PubMed PubMed Central Google Scholar
McLaren, W. et al. The ensembl variant effect predictor. Genome Biol. 17, 122 (2016).
Article PubMed PubMed Central Google Scholar
Tate, J. G. et al. COSMIC: the catalogue of somatic mutations in cancer. Nucleic Acids Res. 47, D941–D947 (2019).
Article CAS PubMed Google Scholar
Alexandrov, L. B. et al. Signatures of mutational processes in human cancer. Nature 500, 415–421 (2013).
Article CAS PubMed PubMed Central Google Scholar
Alexandrov, L. B., Nik-Zainal, S., Wedge, D. C., Campbell, P. J. & Stratton, M. R. Deciphering signatures of mutational processes operative in human cancer. Cell Rep. 3, 246–259 (2013).
Article CAS PubMed PubMed Central Google Scholar
Kasar, S. et al. Whole-genome sequencing reveals activation-induced cytidine deaminase signatures during indolent chronic lymphocytic leukaemia evolution. Nat. Commun. 6, 8866 (2015).
Article ADS CAS PubMed Google Scholar
Niu, B. et al. MSIsensor: microsatellite instability detection using paired tumor-normal sequence data. Bioinformatics 30, 1015–1016 (2014).
Article CAS PubMed Google Scholar
Nik-Zainal, S. et al. The life history of 21 breast cancers. Cell 149, 994–1007 (2012).
Article ADS CAS PubMed PubMed Central Google Scholar
Purdom, E. et al. Methods and challenges in timing chromosomal abnormalities within cancer samples. Bioinformatics 29, 3113–3120 (2013).
Article CAS PubMed PubMed Central Google Scholar
Leshchiner, I. et al. Comprehensive analysis of tumour initiation, spatial and temporal progression under multiple lines of treatment. Preprint at bioRxiv https://doi.org/10.1101/508127 (2018).
Alexandrov, L. B. et al. Clock-like mutational processes in human somatic cells. Nat. Genet. 47, 1402–1407 (2015).
Article CAS PubMed PubMed Central Google Scholar
Blokzijl, F. et al. Tissue-specific mutation accumulation in human adult stem cells during life. Nature 538, 260–264 (2016).
Article ADS CAS PubMed PubMed Central Google Scholar
Gendoo, D. M. A. et al. Genefu: an R/Bioconductor package for computation of gene expression-based signatures in breast cancer. Bioinformatics 32, 1097–1099 (2016).
Article CAS PubMed Google Scholar

Download references

Acknowledgements

This study was conducted with internal funding from Inocras for collaboration with Samsung Medical Center and Saint Mary’s Hospital, with partial financial support from Bayer, Korea (KCT0006142 to J. Lee). We acknowledge the Korea Research Environment Open Network (KREONET) service and the usage of the Global Science Experimental Data Hub Center (GSDC) provided by Korea Institute of Science and Technology Information (KISTI). We thank J. Lee for invaluable guidance and insightful advice throughout the course of this research.

Author information

These authors contributed equally: Ryul Kim, Jonghan Yu

Authors and Affiliations

Inocras, San Diego, CA, USA
Ryul Kim, Joonoh Lim, Brian Baek-Lok Oh, Erin Connolly-Strong, Sangmoon Lee, Bo Rahm Lee, Yuna Lee, Ki Jong Yi, Young Oh Kwon, In Hwan Chun, Junggil Park, Jihye Kim, Chahyun Choi, Jong Yeon Shin, Hyungjung Lee, Minji Kim, Hansol Park, Ilecheon Jeong, Boram Yi, Won-Chul Lee, Jeong Seok Lee & Young Seok Ju
Department of Surgery, Samsung Medical Center, Sungkyunkwan University College of Medicine, Seoul, Republic of Korea
Jonghan Yu, Seok Jin Nam, Seok Won Kim, Jeong Eon Lee & Byung Joo Chae
Division of Hematology-Oncology, Department of Medicine, Samsung Medical Center, Sungkyunkwan University School of Medicine, Seoul, Republic of Korea
Ji-Yeon Kim & Yeon Hee Park
Department of Radiology, Seoul St Mary’s Hospital, College of Medicine, The Catholic University of Korea, Seoul, Republic of Korea
Ga Eun Park, Bong Joo Kang, Sung Hun Kim & Jeongmin Lee
Division of Breast Surgery, Department of Surgery, Bucheon St Mary’s Hospital, College of Medicine, The Catholic University of Korea, Seoul, Republic of Korea
Pill Sun Paik
Division of Breast Surgery, Department of Surgery, Seoul St Mary’s Hospital, College of Medicine, The Catholic University of Korea, Seoul, Republic of Korea
Soo Yeon Bae, Chang Ik Yoon, Young Joo Lee, Dooreh Kim & Woo Chan Park
Division of Medical Oncology, Department of Internal Medicine, Seoul St Mary’s Hospital, College of Medicine, The Catholic University of Korea, Seoul, Republic of Korea
Kabsoo Shin & Ji Eun Lee
Department of Hospital Pathology, Seoul St Mary’s Hospital, College of Medicine, The Catholic University of Korea, Seoul, Republic of Korea
Jun Kang & Ahwon Lee
Graduate School of Medical Science and Engineering, Korea Advanced Institute of Science and Technology, Dajeon, Republic of Korea
Jeong Seok Lee & Young Seok Ju
Department of Pathology and Translational Genomics, Samsung Medical Center, Sungkyunkwan University School of Medicine, Seoul, Republic of Korea
Yoon-La Choi
Department of Radiology and Center for Imaging Science, Samsung Medical Center, Sungkyunkwan University School of Medicine, Seoul, Republic of Korea
Jeongmin Lee

Authors

Ryul Kim
View author publications
Search author on:PubMed Google Scholar
Jonghan Yu
View author publications
Search author on:PubMed Google Scholar
Joonoh Lim
View author publications
Search author on:PubMed Google Scholar
Brian Baek-Lok Oh
View author publications
Search author on:PubMed Google Scholar
Seok Jin Nam
View author publications
Search author on:PubMed Google Scholar
Seok Won Kim
View author publications
Search author on:PubMed Google Scholar
Jeong Eon Lee
View author publications
Search author on:PubMed Google Scholar
Byung Joo Chae
View author publications
Search author on:PubMed Google Scholar
Ji-Yeon Kim
View author publications
Search author on:PubMed Google Scholar
Ga Eun Park
View author publications
Search author on:PubMed Google Scholar
Bong Joo Kang
View author publications
Search author on:PubMed Google Scholar
Pill Sun Paik
View author publications
Search author on:PubMed Google Scholar
Soo Yeon Bae
View author publications
Search author on:PubMed Google Scholar
Chang Ik Yoon
View author publications
Search author on:PubMed Google Scholar
Young Joo Lee
View author publications
Search author on:PubMed Google Scholar
Dooreh Kim
View author publications
Search author on:PubMed Google Scholar
Kabsoo Shin
View author publications
Search author on:PubMed Google Scholar
Ji Eun Lee
View author publications
Search author on:PubMed Google Scholar
Jun Kang
View author publications
Search author on:PubMed Google Scholar
Ahwon Lee
View author publications
Search author on:PubMed Google Scholar
Erin Connolly-Strong
View author publications
Search author on:PubMed Google Scholar
Sangmoon Lee
View author publications
Search author on:PubMed Google Scholar
Bo Rahm Lee
View author publications
Search author on:PubMed Google Scholar
Yuna Lee
View author publications
Search author on:PubMed Google Scholar
Ki Jong Yi
View author publications
Search author on:PubMed Google Scholar
Young Oh Kwon
View author publications
Search author on:PubMed Google Scholar
In Hwan Chun
View author publications
Search author on:PubMed Google Scholar
Junggil Park
View author publications
Search author on:PubMed Google Scholar
Jihye Kim
View author publications
Search author on:PubMed Google Scholar
Chahyun Choi
View author publications
Search author on:PubMed Google Scholar
Jong Yeon Shin
View author publications
Search author on:PubMed Google Scholar
Hyungjung Lee
View author publications
Search author on:PubMed Google Scholar
Minji Kim
View author publications
Search author on:PubMed Google Scholar
Hansol Park
View author publications
Search author on:PubMed Google Scholar
Ilecheon Jeong
View author publications
Search author on:PubMed Google Scholar
Boram Yi
View author publications
Search author on:PubMed Google Scholar
Won-Chul Lee
View author publications
Search author on:PubMed Google Scholar
Jeong Seok Lee
View author publications
Search author on:PubMed Google Scholar
Woo Chan Park
View author publications
Search author on:PubMed Google Scholar
Sung Hun Kim
View author publications
Search author on:PubMed Google Scholar
Yoon-La Choi
View author publications
Search author on:PubMed Google Scholar
Jeongmin Lee
View author publications
Search author on:PubMed Google Scholar
Young Seok Ju
View author publications
Search author on:PubMed Google Scholar
Yeon Hee Park
View author publications
Search author on:PubMed Google Scholar

Contributions

Y.S.J., J. Lee and Y.H.P. conceived the study. J.Y., S.J.N., S.W.K., Jeong Eon Lee, B.J.C., J.-Y.K., G.E.P., B.J.K., P.S.P., S.Y.B., C.I.Y., Y.J.L., D.K., K.S., Ji Eun Lee, J. Kang, A.L., S.H.K., Y.-L.C., W.C.P. and J. Lee collected breast cancer samples. J. Kim, C.C. and J.Y.S. conducted genome sequencing. R.K. performed most genome and statistical analyses with contributions from J.Y., J. Lim, B.B.-L.O. and Y.S.J. H.L., M.K. and B.B.-L.O. collected clinical histories and contributed to data management with support from E.C.-S. and S.L. H.P., I.J., B.Y., W.-C.L., K.J.Y., Y.O.K., I.H.C. and J.P. developed the bioinformatics pipeline and conducted variant calling. B.R.L. and Y.L. participated in whole-genome interpretation and data curation. J.S.L. contributed to data integration and project coordination. R.K., J. Lee, Y.S.J. and Y.H.P. wrote the manuscript with input from all authors. Y.S.J. supervised the overall study.

Corresponding authors

Correspondence to Jeongmin Lee, Young Seok Ju or Yeon Hee Park.

Ethics declarations

Competing interests

Y.S.J. and J.S.L. are co-founders of Inocras, a San Diego-based precision medicine company. Y.H.P. has received grants from MSD, AstraZeneca, Pfizer, Gencurix, Roche, Inocras and Novartis, and consulting fees from AstraZeneca, MSD, Pfizer, Eisai, Lilly, Roche, Gilead, Daiichi-Sankyo, Menarini, Everest and Novartis. R.K., J. Lim, B.B.-L.O., E.C.-S., S.L., B.R.L., Y.L., K.J.Y., Y.O.K., I.H.C., J.P., J. Kim, C.C., J.Y.S., H.L., M.K., H.P., I.J., B.Y. and W.-C.L. are employees of Inocras. The other authors declare no competing interests.

Peer review

Peer review information

Nature thanks the anonymous reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data figures and tables

Extended Data Fig. 1 Comprehensive overview of the genomic and molecular characteristics of 1,364 breast cancer samples.

Columns are ordered to visualize mutual exclusivity between alterations across samples, highlighting potential co-occurrence and exclusivity patterns among key genomic events. Rows represent driver genes identified by seven different driver gene detection algorithms: SMRegions, OncodriveFML, OncodriveCLUSTL, MutPanning, HotMAPS, dNdScv, and CBaSE, providing a robust and integrative approach to detecting candidate driver mutations. Abbreviations: TMB, tumor mutational burden; ER, estrogen receptor; PR, progesterone receptor; HRD, homologous recombination deficiency; MATH, mutant-allele tumor heterogeneity; SBS, single base substitution; ID, indel; SV, structural variation.

Extended Data Fig. 2 Rare but putatively novel driver genes identified in this study.

Each lollipop plot represents the distribution of mutations along the protein-coding sequence of each gene: (a) BCL11B, (b) RREB1, (c) RAF1, and (d) SPECC1. The x-axis corresponds to the amino acid position, while the y-axis indicates the count of samples in which a given mutation was identified. Circles denote mutation types and their frequency (number of samples in which the mutation was observed). Colored rectangles on the coding sequence represent distinct functional domains of each protein.

Extended Data Fig. 3 Genomic instability associated with TP53 mutation.

a–c, Distribution of homologous recombination deficient (HRD) and proficient (HRP) tumors (a), ploidy (b), and mutant allele tumor heterogeneity (MATH) scores (c) according to TP53 mutation status (TP53^wt, wild-type; TP53^mut, mutant). P-values were calculated using a two-sided chi-square test (a) and two-sided Student’s t-test (b, c). d–f, Kaplan–Meier survival curves stratified by TP53 mutation status and MATH score in the CUBRICS cohort with multivariate Cox regression analysis (d,e), and in the METABRIC cohort (f). In CUBRICS, MATH^high was defined as MATH ≥ 40 and MATH^low as MATH < 40, whereas in METABRIC, MATH^high and MATH^low were defined by the upper and lower quartiles, respectively. Both cohorts showed consistent associations between MATH, TP53 mutation status, and overall survival. Error bars in panel e represent 95% confidence intervals, with the centre defined as the hazard ratio. Notes: Box plots show median (line), first and third quartiles (box edges), and 1.5× the interquartile range (whiskers). In all panels, “n” indicates the number of patients included in the analysis.

Extended Data Fig. 4 Impact of APOBEC3A/3B germline deletion on mutational signatures and APOBEC gene expression.

a, Detection of APOBEC3A/3B germline deletion. Germline deletion status was determined using the depth ratio method, where R1 (d₀/d₁) represents the ratio of sequencing depth between the upstream 30 kbp region and the deletion region, and R2 (d₀/d₂) represents the ratio between the downstream 30 kbp region and the deletion region. b, Distribution of R1 and R2 values. The density plot shows distinct clustering of samples based on APOBEC3A/3B germline deletion status (wild-type, heterozygous deletion, and homozygous deletion), confirming that these metrics effectively differentiate deletion groups. c, RNA expression differences in APOBEC family genes by germline deletion status. Violin plots display the transcriptional impact of APOBEC3A/3B germline deletion on APOBEC gene expression levels, highlighting significant differences where applicable. Sample sizes: homozygous deletion (n = 121), heterozygous deletion (n = 539), wild-type (n = 549). d, Differences in mutational signatures by germline deletion status. Violin plots illustrate the number of mutations assigned to each COSMIC mutational signature across samples stratified by APOBEC3A/3B germline deletion status. Sample sizes: deletion (n = 736), wild-type (n = 628). Note: P-values were calculated using a two-sided Student’s t-test without adjustment for multiple comparisons. Box plots indicate median (middle line), first and third quartiles (edges).

Extended Data Fig. 5 Transcriptomic impact of structural variations spanning chr8:35-40 Mb and chr11:65-70 Mb in luminal B breast cancer.

a, Volcano plot displaying differentially expressed genes between luminal B breast cancer cases with structural variations (SVs) spanning chr8:35-40 Mb and chr11:65-70 Mb (right) and those without these SVs (left). The x-axis represents log₂ fold-change in gene expression, while the y-axis represents the -log₁₀(q-value). b, Gene Set Enrichment Analysis results showing pathways enriched in luminal B breast cancer cases with these SVs. Positive normalized enrichment scores (NES) indicate upregulated pathways, including TNF-α signaling via NF-κB, TGF-β signaling, and epithelial-mesenchymal transition, which are known to contribute to tumor progression and metastasis. Conversely, downregulated pathways include oxidative phosphorylation, glycolysis, and fatty acid metabolism, suggesting a metabolic shift in tumors harboring these SVs.

Extended Data Fig. 6 Structural characterization of recurrent fusions identified in our cohort.

Illustrated are fusions involving TTC6/MIPOL1, BCL2L14/ETV6, PRKCA/CEP112, ESR1/CCDC170, AGO2/PTK2, GALNT17/AUTS2, and BRD4/NOTCH3.

Extended Data Fig. 7 Structural characterization of recurrent fusions identified in our cohort.

Illustrated are fusions involving TRAPPC9/PTK2, UHRF1BP1L/ANKS1B, FBXL20/IKZF3, PRKCA/RGS9, DLG2/TENM4, IL34/SF3B3, AGO2/TRAPPC9, IMMP2L/DOCK4, SLC39A11/SDK2, KAT6A/ANK1, and IKZF3/ERBB2.

Extended Data Fig. 8 Structural variations (SVs) affecting COSMIC cancer gene census genes.

The left panel shows the number of cases with SVs in each gene, categorized by PAM50 molecular subtype. The right panel displays the percentage distribution of different SV types. The data highlight genes recurrently affected by SVs in breast cancer, with tumor suppressor genes (e.g., PTEN, RB1, and RUNX1) and oncogenes (e.g., NRG1, ERBB4, and ESR1) prominently impacted. These results suggest that SVs may contribute to the dysregulation of key cancer-related genes across different breast cancer subtypes.

Extended Data Fig. 9 Focal amplification of key oncogenes and predictive role of ERBB2 focal amplification in neoadjuvant anti-HER2 therapy response.

a, Segment length (Mbp) and copy number (CN) gain of ERBB2, CCND1, ZNF703, and FGFR1, classified as focal amplification (≤3 Mbp, relative CN gain >3 compared to surrounding regions, absolute CN ≥ 7), broad amplification (CN gain ≥1 without focal), or none. Vertical and horizontal lines mark thresholds for focal amplification. b, Relationship between log₁₀(CN) and log₁₀(RNA expression; TPM) for the four oncogenes. Density shading (red = highest) and yellow dots (individual tumors) shown. c, Correlation matrix of RNA expression: lower = density plots, diagonal = TPM distribution, upper = Pearson correlation with significance (*P < 0.05, **P < 0.01, ***P < 0.001). d, Genome-wide structural plot (Yilong plot) showing focal ERBB2 amplification (CN = 68) in a HER2 IHC 0 tumor; structural analysis confirms extrachromosomal DNA. e, HER2 IHC 3+ case without focal ERBB2 amplification, with ERBB2 TPM of 43.5, lower than typical IHC 3+ tumors. f–h, Association of ERBB2 focal amplification with pathologic complete response (pCR) to neoadjuvant trastuzumab + pertuzumab (TransNEO cohort). f, Amplicon width vs. major allele CN in 168 patients. g, CN profiles in eight treated patients: pCR cases (brown) show higher focal amplification; non-pCR cases (grey) lack it. h, Sankey plot: all without focal amplification failed pCR; 75% (3/4) with focal amplification responded. i–k, ERBB2 CN as predictor of pCR in 75 HER2-positive breast cancers treated with neoadjuvant TCHP. i, ROC curve (AUC = 0.819). j, CN distribution in responders vs. non-responders. k, Sankey plot showing pCR vs. non-pCR outcomes stratified by ERBB2 copy number (≥33) and chromothripsis status in the CUBRICS cohort. “n” indicates the number of patients in each analysis.

Extended Data Fig. 10 Long-segmental copy number changes in breast cancer.

Relative molecular timing of recurrent copy number amplifications (CNAs). From top to bottom: number of patients harboring each CNA; percentage of PAM50; integrative clusters (IntClust) molecular subtypes among samples with the respective CNA; box plots indicating median (middle line), first and third quartiles (edges) and 1.5x the interquartile range (whiskers); hazard ratio estimates with 95% confidence intervals; violin plots displaying the distribution of relative CNA timing (black dots represent the mean timing); and the G-score from GISTIC algorithm.

Extended Data Fig. 11 Prognostic impact of 9p23 amplification and copy number amplification timing in breast cancer.

a,b, Impact of 9p23 amplification on overall survival (OS) in TNBC within the METABRIC cohort. a, Kaplan-Meier survival curves comparing OS between patients with (n = 102) and without (n = 228) 9p23 amplification, demonstrating a poorer prognosis in those with amplification. b, Multivariate Cox regression analysis for OS, displaying hazard ratios and 95% confidence intervals (CI) for 9p23 amplification, tumor size (>50 mm vs. ≤50 mm), and lymph node metastasis status. A two-sided Wald test was performed without adjustment for multiple comparisons. Error bars indicate 95% confidence intervals, with the centre defined as the hazard ratio. c, Differential gene expression analysis based on 9p23 amplification status in basal-like breast cancer. Significantly upregulated genes in 9p23-amplified samples are shown in red, while those upregulated in non-amplified samples are shown in yellow. The x-axis represents log₂ fold-change in expression, and the y-axis shows the -log₁₀(q-value), indicating statistical significance. d, Copy number amplification timing in homologous recombination-proficient (HRP) and homologous recombination-deficient (HRD) samples. Each row represents a sample, and the x-axis represents the relative timing of amplification events. The color intensity indicates the number of amplification segments at a given time point, with darker shades representing a higher number of amplification events. e, Distribution of amplification duration in HRP and HRD samples. The violin plot compares the duration of amplification events, defined as the difference between the earliest and latest amplification times within a sample. Box plots indicate median (middle line), first and third quartiles (edges). The p-value was estimated by a two-sided Student’s t-test. Note: In all panels, “n” refers to the number of patients included in the analysis.

Extended Data Fig. 12 Gene set enrichment analysis (GSEA) of RNA expression data in homologous recombination-deficient (HRD) vs. homologous recombination-proficient (HRP) hormone receptor-positive breast cancer patients treated with CDK4/6 inhibitors as first-line palliative treatment.

The x-axis represents the normalized enrichment score (NES), indicating whether a pathway is upregulated (positive NES) or downregulated (negative NES) in HRD patients (n = 13) compared to HRP patients (n = 44). The y-axis lists the significantly enriched pathways. The size of each dot corresponds to -log₁₀(q-value), with larger dots indicating stronger statistical significance. Pathways on the right (positive NES) are enriched in HRD tumors, whereas pathways on the left (negative NES) are enriched in HRP tumors.

Supplementary information

Supplementary Information

Supplementary Methods, Supplementary Discussion 1–6, Supplementary Figs. 1–9, a guide to Supplementary Tables 1–12 and references.

Reporting Summary

Supplementary Tables

Supplementary Tables 1–12

Peer Review File

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Kim, R., Yu, J., Lim, J. et al. Whole-genome landscapes of 1,364 breast cancers. Nature 649, 1282–1291 (2026). https://doi.org/10.1038/s41586-025-09812-3

Download citation

Received: 15 September 2024
Accepted: 27 October 2025
Published: 03 December 2025
Version of record: 03 December 2025
Issue date: 29 January 2026
DOI: https://doi.org/10.1038/s41586-025-09812-3