Main

HNC, including malignancies affecting the mouth, pharynx and larynx, represents ~4% of the global cancer burden, with an annual incidence of about 750,000 new cases1. The incidence rate of HNC varies between different countries, largely reflecting the distribution of its main risk factors, including tobacco smoking, alcohol consumption1,2 and infection with high-risk strains of human papillomavirus (HPV) for oropharynx cancer3,4,5. Other proposed risk factors include consumption of hot beverages, obesity and poor oral health, although evidence for their role in HNC is limited6,7,8. In addition, a substantial proportion of HNCs (about 42% for women and 26% for men) cannot be attributed to known lifestyle habits or exposures9.

Epidemiological studies in Europe and America suggest that seven out of ten HNC cancers are caused by preventable behavioral risk factors, with tobacco use, either alone or in combination with alcohol, accounting for most cases9. Conversely, alcohol use on its own is responsible for only ~4% of the disease burden, suggesting a limited effect on HNC burden. This raises the question of whether alcohol acts as an independent carcinogen or simply enhances the known carcinogenic effect of tobacco. Furthermore, the susceptibility to these exposures varies depending on the anatomical region, with smoking posing a higher risk for developing larynx cancer and the risk associated with alcohol being greater for other subsites10.

Considering the dominant role of tobacco in HNC development, risk differences across subsites and potential interactions with other risk factors, HNC offers a particularly interesting opportunity to investigate the effects of tobacco exposure. In this context, the analysis of mutational signatures is an effective tool to track the complex mutagenic patterns linked to this and other exposures over a patient’s lifetime11,12,13. Certain mutational signatures have been related to well-established biological mechanisms and exposures. Signatures SBS4, found predominantly in lung cancer, and SBS92, in bladder cancer, capture two distinct mutagenic processes linked to tobacco use12,14,15. Conversely, signature SBS16 has been attributed to alcohol consumption in esophageal and liver cancer13,16.

Previous studies exploring the genomic landscape of HNC have relied predominantly on exome sequencing data, which have limited power to detect mutational signatures, lacked a diverse geographical and ethnic representation of cases and/or were limited to specific anatomical subsites17,18,19,20. Therefore, the carcinogenic mechanisms underpinning this cancer type in different geographical regions and anatomical subsites remain unclear. To bridge this gap, we performed whole-genome sequencing of 265 HNC samples from individuals exposed to known and suspected risk factors across eight countries with varying incidence rates. By leveraging mutational signature analysis combined with extensive epidemiological data, we shed light on the complexity of tobacco-induced mutagenesis and its interplay with alcohol consumption and other HNC risk factors.

Results

Case-series overview and multicountry study design

A total of 265 HNC cases were included in the study, comprising retrospective collections from eight countries in Europe and South America6,21 (Fig. 1 and Supplementary Table 1). These provide a broad geographic representation of HNC, including cases from high-incidence regions, with sex-combined age-standardized rates (ASRs) ranging from 9.4 per 100,000 to 18.2 per 100,000 in Romania, Slovakia, Czech Republic and Brazil, as well as moderate-incidence regions, with ASRs from 3.8 to 7.8 per 100,000 in Colombia, Argentina, Greece and Italy1. The study population encompasses diverse ethnic backgrounds, including European, Latin American, African and East Asian descent (Supplementary Table 2 and Supplementary Fig. 1). The dataset contains cases from all HNC anatomical subsites, with 127 oral cavity, 46 oropharynx, 17 hypopharynx and 75 larynx cancers. Epidemiological questionnaire data were available on exposure to known and suspected HNC risk factors, including cases from drinkers and smokers, with both exposed and nonexposed (Supplementary Table 3). DNAs from paired tumor and blood samples were extracted and whole-genome sequenced to average coverage of 55-fold and 27-fold, respectively.

Fig. 1: HNC incidence and epidemiological characteristics.
Fig. 1: HNC incidence and epidemiological characteristics.
Full size image

a, Incidence of HNC, sex-combined, ASRs per 100,000, data from GLOBOCAN 2022. Dots indicate countries included in this study and number of participating patients. Panel a adapted from ref. 1, © International Agency for Research on Cancer. Data version: GLOBOCAN 2022-08.02.2024. b, Anatomical subsites of HNC, with number of tumor samples indicated in brackets. Panel b created using BioRender.com. c, Known and suspected risk factors included in the study, based on epidemiological questionnaire data and HPV detection. Frequencies of risk factors in the complete dataset (left) and by anatomical subsite (right) are indicated. OC, oral cavity; OPC, oropharynx; HPX, hypopharynx; LYX, larynx.

Mutation burden

Among the 265 HNC cases, we observed a median of 12,887 single-base substitutions (SBSs; range = 720–244,026), 63 doublet-base substitutions (DBSs; range = 2–7,113) and 757 small insertions and deletions (indels; range = 124–9,898) (Supplementary Table 4). Tumor samples from tobacco users exhibited higher SBS, DBS and indel burdens compared to nonsmokers (Extended Data Fig. 1b and Supplementary Table 5), as previously reported for larynx cancer14. Differences were also found between anatomical subsites, with larynx samples presenting higher mutation burdens, even after correcting for tobacco status (Extended Data Fig. 1a and Supplementary Table 5). No significant differences were found between geographical regions or ancestry profiles (Extended Data Fig. 1c and Supplementary Table 5).

Mutational signatures of exogenous and endogenous exposures

To investigate the mutational processes and carcinogenic exposures that have been operative in HNC development, we extracted SBS, DBS and indel signatures and estimated the contribution of each signature to every sample. We obtained 15 de novo SBS signatures, which were decomposed into 18 reference signatures from the Catalog of Somatic Mutations in Cancer (COSMIC v3.2) database, and two signatures that could not be decomposed into any combination of existing signatures, SBS_I and SBS_L (Fig. 2a,b, Extended Data Fig. 2, Supplementary Tables 6 and 911 and Supplementary Note).

Fig. 2: Mutational signature landscape of HNC.
Fig. 2: Mutational signature landscape of HNC.
Full size image

a, SBS, DBS and indel signatures extracted in 265 HNC tumors. The size of each dot represents the proportion of samples presenting each mutational signature in the whole HNC dataset and across anatomical subsites. The color represents the mean relative attribution of each signature. Dots filled in white indicate signatures without significantly different relative burdens across subsites. Significance was assessed using a two-sided Kruskal–Wallis test and Bonferroni correction. Top, the mutations per megabase attributed to each signature in samples with counts higher than zero. b, Mutational spectrum of undecomposed signatures extracted from HNC. c, Known SBS signatures of tobacco exposure identified in the HNC dataset. ROS, reactive oxygen species; HR, homologous recombination; DSB, double-strand break.

Among the identified signatures, several have been previously associated with exogenous mutational processes12. The tobacco-related signatures SBS4 and SBS92 were found in 33.6% and 7.6% of HNC samples and, respectively, accounted for 6.3% and 3.5% of the mutational burden on average across all HNC cases. SBS16, attributed to alcohol consumption13,16, was present in 19.2% of the samples with a modest impact on the HNC mutation burden of 1.4% on average. Signatures SBS7a and SBS7b, related to ultraviolet (UV) light exposure, co-occurred in 4.2% of cases.

We also identified signatures associated with endogenous exposures and aberrant cellular processes. Notably, SBS2 and SBS13, which result from cytosine deamination by apolipoprotein B mRNA-editing enzyme catalytic polypeptide-like (APOBEC)12, were present in the majority of HNC cases (92.8% and 91.7%, respectively) (Fig. 2a) and were highly correlated (Supplementary Fig. 2). Combined, these signatures accounted for an average of 20.4% of the total SBS mutation burden. Other prevalent signatures included SBS18, which is caused by reactive oxygen species (77.4% of samples), and clock-like signatures SBS1 (78.1%) and SBS5 (54.7%) (Fig. 2a).

Extraction of DBS signatures identified four de novo signatures, which decomposed into four COSMIC reference signatures (DBS1, DBS2, DBS4 and DBS6) and one nondecomposed signature (DBS_D) (Fig. 2a,b, Extended Data Fig. 3a and Supplementary Tables 7 and 911). We also extracted seven de novo indel signatures, all of which were decomposed into 12 COSMIC signatures (Fig. 2a,b, Extended Data Fig. 3b and Supplementary Tables 811). DBS and indel signatures of exogenous exposures were positively correlated with their SBS counterparts (Supplementary Fig. 2). For instance, the known tobacco-related signatures DBS2 (59.2% of samples) and ID3 (41.4%), along with DBS6, which has been previously registered as of unknown etiology, correlated with both SBS4 and SBS92. These associations are consistent with the SBS, DBS and indel signatures being generated by the same underlying mutational process. Similarly, ID11 (38.1%), which was associated with alcohol consumption in esophageal cancer13, exhibited a positive correlation with the alcohol signature SBS16, while UV-related DBS1 (16.6%) and ID13 (1.5%) signatures showed the same link with SBS7a–c.

To establish which mutagenic exposures were active earlier or later during the development of HNC, we estimated the molecular timing of each SBS signature (Methods and Supplementary Table 12). Signatures of tobacco and alcohol consumption, as well as the SBS_L signature, were enriched in early clonal mutations (Extended Data Fig. 4a–c), consistent with carcinogenic exposures occurring in normal cells22. Similarly, SBS_I was significantly enriched in early clonal mutations in cases exposed to tobacco and in oral cavity cases, while no significant differences were seen in other subsites (Extended Data Fig. 4d,e). Signatures of APOBEC signaling and SBS39 were enriched in late clonal mutations, suggesting that the corresponding mutational processes increased in activity during the evolution of cancer clones22.

HNC tumors present complex tobacco-related mutation patterns

We then investigated the associations between mutational signatures and epidemiological features using regression analysis (Supplementary Tables 13 and 14). Several signatures were independently associated with tobacco consumption, including the previously recognized tobacco-related signatures SBS4, SBS92, DBS2 and ID3, as well as signature DBS6, reported as of unknown etiology, and the newly discovered SBS_I (Fig. 3a,b, Extended Data Fig. 5a, Supplementary Table 13 and Supplementary Note). The tobacco-associated SBS signatures were composed of three different substitution patterns (predominantly C>A for SBS4, T>C for SBS92 and T>A for SBS_I) (Fig. 2c) and exhibited transcriptional strand bias15,23 (Supplementary Fig. 3). This strand bias often occurs as a result of transcription-coupled DNA repair and is found in mutations owing to bulky adducts, caused by exogenous exposures such as tobacco smoke carcinogens23. Assuming this mechanism is responsible for the strand bias in SBS_I, this is indicative of adduct formation on adenine bases.

Fig. 3: Tobacco-related signatures.
Fig. 3: Tobacco-related signatures.
Full size image

a, Mutational burdens of tobacco-related signatures in HNC cases sorted by subsite and tobacco status. The tumor mutational burden (TMB) per sample is also displayed. b, Mutational burdens for SBS, DBS and indel signatures showing significant positive associations with tobacco consumption (n = 265 biologically independent samples). The Kruskal–Wallis test (two-sided) was used to test for global differences. Box-and-whisker plots are in the style of Tukey. The line within the box is plotted at the median, while the upper and lower ends indicate the 25th and 75th percentiles. Whiskers show 1.5× the interquartile range (IQR). The y axes were cut at 1.25× upper whisker for clarity. Bar plots indicate the frequencies of dichotomized signatures. c, Percentage of driver mutations occurring in C>A contexts in LYX and OC HNC from smokers. d, SBS96-mutation spectrum of driver mutations in LYX and OC HNC from smokers, showing enrichment in the frequency of C>A driver mutations in LYX cases.

The distribution of tobacco-associated signatures varied across different anatomical subsites (Fig. 3a, Extended Data Fig. 5c,d and Supplementary Table 13). Previously established tobacco signatures exhibited higher signature burdens and frequencies in larynx cases compared to other subsites. For instance, SBS4 was present in 17.3% of oral cavity, 17.4% of oropharynx, 52.9% of hypopharynx and 66.7% of larynx cases. Similar distributions were observed for SBS92, DBS2 and ID3. Conversely, the previously unknown SBS_I signature was present in smokers across all subsites, with particular enrichment in the oral cavity. The associations between signatures and subsites remained significant after correction for tobacco consumption and other confounding variables (Extended Data Fig. 5c and Supplementary Table 13). The enrichment of SBS_I signature tobacco smokers and oral cavity cases was confirmed in an external dataset (Supplementary Note).

Effects of tobacco exposure on the driver mutation spectra

We explored the driver mutation profile in tobacco-related HNC. This revealed 96 cancer genes with driver mutations in our dataset, including TP53, NOTCH1, CDKN2A, KMT2D and CASP8, which are commonly implicated in HNC24 (Extended Data Fig. 6a,b and Supplementary Tables 15 and 16). TP53 mutations were significantly enriched among smokers compared to nonsmokers (83.2% (164/197) versus 61.8% (42/68), Fisher’s exact test q = 0.0112), while CASP8 mutations were more frequent among nonsmokers (6.09% (12/197) versus 20.6% (14/68), Fisher’s exact test q = 0.0135). A total of 642 driver mutations were identified (Methods), and these showed enrichment of C>A substitutions in smokers compared to nonsmokers (24.9% (114/457) versus 17.3% (32/185), Fisher’s exact test P = 0.0379) (Extended Data Fig. 6c), consistent with the SBS4 mutation profile12. The frequency of C>A driver mutations in tobacco-exposed cases was higher in the larynx subsite compared to oral cavity (31.5% (53/168) versus 19.9% (38/191), Fisher’s exact test P = 0.0148) (Fig. 3c,d). This reflects the lower contribution of SBS4 to mutations in tobacco-exposed oral cavity HNC compared to larynx cases, which has been carried through into the generation of driver mutations. T>A driver mutations were also observed among smokers, albeit in low frequencies (6.6% (11/168) in larynx and 8.4% (16/191) in oral cavity), hinting at a lower presence of SBS_I in driver mutations.

Tobacco-related signatures correlate with HNC incidence

We analyzed the link between tobacco mutagenesis and variations in HNC incidence across different countries, sexes and anatomical subtypes. Our findings support previous epidemiological evidence, which has shown a connection between HNC incidence and smoking habits2 (Fig. 4a,b). Moreover, HNC incidence correlated with tobacco-related signatures (Fig. 4c and Supplementary Fig. 4), showing a higher ASR of HNC incidence in demographic groups presenting higher signature burdens. This further confirms that the geographical and demographic differences in tobacco exposure have a dominant role in driving HNC incidence.

Fig. 4: Association of tobacco use with incidence of HNC.
Fig. 4: Association of tobacco use with incidence of HNC.
Full size image

a, Association between ASR of HNC incidence and tobacco smoking per country and sex (n = 16) measured by linear regression analysis. Estimate of ASR of tobacco smoking prevalence was obtained from the WHO Global Health Observatory data repository (2019). b, Association between cigarette quantity smoked per day in the HNC dataset and ASR incidence per country, sex and subsite, adjusted for age (n = 265). c, Association of tobacco-related signatures with ASR incidence per country, sex and subsite, adjusted for age. Data are represented as average mutations attributed to tobacco-related SBS (SBS4, SBS92 and SBS_I), DBS (DBS2 and DBS6) and indel (ID3) mutational signatures per group. The number of cases per group and frequency of positive cases are indicated by size and color, respectively. For ac, 95% confidence interval is shown in clear blue. The P values shown correspond to ASR incidence in regressions across all data points with ASR of tobacco smoking (a), cigarette quantity (b) or mutation attributions (c) as explanatory variables.

Alcohol-related signatures in drinkers and smokers

Next, we assessed the signature profile in HNC cases with a history of alcohol intake. Regression analysis revealed significant associations between alcohol consumption and the following three specific signatures: SBS16, ID11 and DBS4 (Extended Data Fig. 5b, Supplementary Table 13 and Supplementary Note). Although the etiology of DBS4 is unclear, it has been found prevalent in esophageal cancer cases from countries with high alcohol intake rates13. SBS16, ID11 and DBS4 presented higher signature burdens in cases exposed to both tobacco and alcohol compared to alcohol alone (Fig. 5a,b). In the regression analysis, these signatures showed significant associations with the combined exposure (Fig. 5c and Supplementary Table 13). Models incorporating both tobacco and alcohol showed improved performance over those with alcohol alone (Supplementary Note). In conjunction, this indicates a combined effect of smoking and drinking in shaping the mutation profile of HNC, even after correcting for alcohol quantity and other potential confounding factors (Supplementary Note). The results are, therefore, consistent with SBS16, DBS4 and ID16, all being generated by the same underlying alcohol-related mutational process, and with the mutagenicity of this process being increased with co-exposure to tobacco smoke.

Fig. 5: Alcohol-related signatures.
Fig. 5: Alcohol-related signatures.
Full size image

a, Mutational burdens of tobacco-related signatures in HNC cases sorted by subsite, alcohol and tobacco status. TMB per sample is also displayed. b, Mutational burdens for SBS, DBS and indel signatures showing positive associations with the tobacco plus alcohol status (n = 265 biologically independent samples). The Kruskal–Wallis test (two-sided) was used to test for global differences. Pairwise comparisons with the tobacco plus alcohol group were assessed with Dunn’s test (P values are shown in gray). Box-and-whisker plots are in the style of Tukey. The line within the box is plotted at the median, while the upper and lower ends indicate the 25th and 75th percentiles. Whiskers show 1.5× IQR. The y axes were cut at 1.25× upper whisker for clarity. Bar plots indicate the frequencies of dichotomized signatures. c, Associations between alcohol-related mutational signatures and the combined tobacco and alcohol exposures measured by logistic regression analysis. Regressions were corrected for sex, age of diagnosis, anatomical subsite and region. The Bonferroni method was used to adjust P values for multiple hypothesis testing. Effect size (log2(OR), color) and significance level (−log10(adjusted P), size). Dots filled in white indicate nonsignificant associations (Bonferroni-adjusted P < 0.05). OR, odds ratio.

For driver mutations, samples from individuals exposed to both tobacco and alcohol were characterized by a particularly high TP53 frequency of mutations (87.0% (141/162), 71.4% (25/35), 68.8% (22/32) and 55.6% (20/36) in the tobacco plus alcohol, alcohol alone, tobacco alone and unexposed groups, respectively; Fisher’s exact test q = 0.0024) (Extended Data Fig. 6). The driver mutation burden in the SBS16 context was too low to assess differences in the driver spectra between groups. However, TP53 mutations in the SBS16 contexts were exclusively found in samples from individuals exposed to both tobacco and alcohol (n = 5 TP53 variants).

HPV-positive HNC is characterized by APOBEC signatures

HPV infection in oropharynx cases did not elicit a specific mutational signature profile (Supplementary Table 13). However, most of the mutations in HPV-infected cases were driven by APOBEC activity (57.6% of the signature burden on average). This reflects a trend toward higher relative burdens of APOBEC signatures compared to HPV-negative oropharynx (30.0% of signature burden) (Extended Data Fig. 7a–c), consistent with previous reports18. Notably, the presence of APOBEC signatures was nearly ubiquitous across HNC cases (Fig. 2a), suggesting a broader role for APOBEC activation beyond its antiviral function11.

We also observed differences between HPV-positive and HPV-negative oropharynx cases exposed to tobacco. Among smokers, only 1/6 (16.7%) HPV-positive oropharynx cases presented tobacco-related SBS signatures, compared to 7/26 (26.9%) in HPV-negative cases (Fisher’s exact test P = 0.0214). Despite the well-known influence of tobacco smoking on the driver profiles of HNC20,24, the driver alterations in HPV-positive smokers differed from that of HPV-negative smokers and, instead, resembled the profile in HPV-positive cases from nonsmokers. This included PIK3CA mutations, PTEN mutations and deletions, as well as the absence of TP53 mutations and of FADD gains (Extended Data Fig. 7d,e). This, together with the reduced presence of tobacco-related signatures, suggests that oncogenesis in HPV-positive smokers may primarily be driven by viral infection rather than tobacco exposure.

Signature profiles of exposure to putative HNC risk factors

We next investigated the presence of additional environmental exposures beyond the most widely known HNC risk factors. Notably, UV-related signatures, SBS7a–c, DBS1 and ID13, were detected predominantly in oral cavity cases (Fig. 6a, Supplementary Table 13 and Supplementary Note). SBS7 signatures have been previously described in HNC, but the anatomical and epidemiological features of positive cases have not been previously investigated12. Samples with a relative SBS7a–c burden of >10% were categorized as positive for UV exposure, a criterion met by 13 oral cavity cases from the lip, tongue and floor of the mouth (Fig. 6b). All positive cases were either tobacco or alcohol users, with 11/13 presenting both risk factors (Fig. 6b). Thus, our data suggest a potential role of UV light exposure in HNC carcinogenesis23, which could be enhanced by tobacco and/or alcohol.

Fig. 6: UV-related signatures in HNC.
Fig. 6: UV-related signatures in HNC.
Full size image

a, Mutational burdens for mutational signatures related to UV light exposure showing positive associations with the HNC anatomical subsite (n = 265 biologically independent samples). The Kruskal–Wallis test (two-sided) was used to test for global differences. Box-and-whisker plots are in the style of Tukey. The line within the box is plotted at the median, while the upper and lower ends indicate the 25th and 75th percentiles. Whiskers show 1.5× IQR. Frequencies of positive samples in each category are indicated in bar plots. b, SBS, DBS and indel signature burdens in samples positive for UV exposure based on relative SBS7a–c contributions above 10% of relative mutational burdens. Samples are sorted by lip (inner, n = 3 or unspecified, n = 1), tongue and floor of the mouth location within the OC. Positive tobacco and alcohol status are indicated in black.

Our analysis did not show any specific mutational patterns associated with other putative HNC risk factors, including hot drink consumption, poor oral health score and high body mass index6,7 (Supplementary Table 13). This suggests that these agents are likely not causing direct mutagenesis. Finally, the previously unknown DBS_D signature and ID4, with unknown etiology, were enriched among nonsmokers (Extended Data Fig. 5e), suggesting a potential link to unidentified mutational processes in this population.

HNC risk factors elicit distinct copy number profiles

HNC is characterized by complex patterns of copy number aberrations throughout the genome19,20. Unsupervised hierarchical clustering analysis on the copy number counts in HNC samples (n = 242) revealed two main clusters—one displaying diploid genomes (cluster D) and another presenting polyploidy and high burden of copy number gains and losses (cluster P) (Extended Data Fig. 8 and Supplementary Fig. 5). These clusters were further subdivided into four groups (D1, D2, P1 and P2). Notably, subgroup D2 was characterized by a copy-neutral profile, exhibiting substantially lower burdens of copy number events compared to the other groups.

The copy number clusters were associated with distinct epidemiological profiles (Fig. 7c,d). Specifically, tobacco-related HNC was enriched within both the diploid and polyploid copy number-high clusters (that is, D1, P1 and P2), while the copy number-silent cluster D2 was mostly constituted by samples from nonsmokers, including cases with unknown risk factors and alcohol drinkers in the absence of tobacco (Supplementary Table 17). Consistent with this pattern, the D2 cluster was enriched in samples from female patients, oral cavity cases and older age (Supplementary Table 17), aligning with the characteristic features of HNC with undefined risk factor24. Finally, HPV-positive oropharynx cases were enriched in the diploid clusters, predominantly in cluster D1.

Fig. 7: Copy number profile and copy number signature analysis in HNC.
Fig. 7: Copy number profile and copy number signature analysis in HNC.
Full size image

a, Copy number signatures extracted in 242 HNC tumors. The size of each dot represents the proportion of samples presenting the signature, and the color represents the mean relative attribution of each signature. b, Copy number spectrum of the newly identified signature CN_G, defined by a 48 context copy number classification incorporating loss-of-heterozygosity status, total copy number state and segment length to categorize segments from allele-specific copy number profiles. c, Copy number profiles of HNC cases classified by copy number cluster. Relative signature burdens, copy number burden and associated epidemiological characteristics are indicated. The displayed epidemiological variables show significant differences by copy number cluster as per Fisher’s exact test and Benjamini–Hochberg procedure. d, Summary of exposures, driver alterations and copy number signatures associated with each cluster. Alluvial diagram depicts the frequency of each etiology in the copy number clusters. WGD, whole-genome duplication; CIN, chromosomal instability; LOH, loss of heterozygosity.

To identify distinct copy number particularities within each cluster and etiology, we conducted copy number signature analysis25 (Fig. 7a,b, Extended Data Fig. 9 and Supplementary Note). Cluster D1 exhibited enrichment in signatures of chromosomal instability within a diploid genome background (signatures CN1, CN9 and CN13) (Extended Data Fig. 10a). By contrast, cluster D2 presented a signature profile related to a diploid copy-neutral background (CN1). Clusters P1 and P2 displayed associations with signatures of whole-genome duplication (CN2 and CN20) along with genomic aberrations (CN5 and CN_G) (Extended Data Fig. 10b,c). Cluster P1 was consistent with double whole-genome duplications (CN18), while P2 showed signatures of chromosomal instability in conjunction with genome doubling (CN12). Collectively, our analysis suggests that HNC risk factors align with different copy number profiles and provides an enhanced characterization of the copy number aberrations in each HNC etiology (Fig. 7d). Specifically, tobacco use, alone or with alcohol, may trigger chromosomal instability and aneuploidy, while HPV infection may confer a copy number-unstable diploid profile. Finally, samples with unknown risk factors exhibit a copy-neutral profile.

We explored whether this difference in the copy number profile could be due to the driver profile that is associated with each risk factor (Fig. 7d and Extended Data Fig. 10d). TP53 mutations and MYC gains, two known promoters of genomic instability25,26,27, as well as gains in the anti-apoptotic FADD gene, were enriched in cluster P. CASP8 and HRAS mutations were enriched in the D2 copy-neutral cluster, in agreement with previous studies in HNC20,24,25. Finally, PTEN and RB1 mutations were enriched in the D1 cluster. Overall, these results indicate that tobacco use in HNC is associated with a distinct copy number-rich profile and driver alterations related to genome instability.

Discussion

The role of tobacco as one of the most avoidable cancer risk factors has been known for over 50 years. Yet, the detailed mechanisms by which tobacco smoke leads to DNA damage and carcinogenesis in different tissues are still not fully understood14,28,29. In this study encompassing HNC cases from eight countries in Europe and South America, we shed light on the effects of tobacco as the main mutagenic exposure in HNC and explored the complex mutational patterns and genomic alterations linked to tobacco exposure in different HNC subsites, as well as its interplay with alcohol consumption and other risk factors.

Tobacco smoke contains a mixture of thousands of chemicals, including over 60 carcinogens, among which benzo(a)pyrene (BaP) and nitrosamines are the most widely studied. These carcinogens undergo metabolic activation, generating reactive intermediates that interact with DNA in exposed tissues, resulting in complex mutagenic processes that can lead to cancer development29. In HNC, tobacco exposure resulted in six different signatures, identifying at least three mutational processes due to tobacco in HNC. Signature SBS4, characterized by C>A transversions, has been largely attributed to BaP adducts14,30,31. Exposure to this compound is also consistent with the CC>AA substitutions and C deletions present in DBS2 and ID3 tobacco signatures, respectively31. Conversely, signature SBS92, composed predominantly of T>C transitions, has not been related to specific carcinogens in tobacco smoke15. Finally, the T>A-rich substitution profile captured by the previously unidentified signature SBS_I is compatible with adduct formation on adenines, which have been observed in response to multiple tobacco compounds31,32,33. Among those, exposure to nicotine-derived nitrosamine ketone, one of the main tobacco carcinogens in oral tissues34, also yielded a T>A-rich signature in vitro and in mouse tumors35,36. Notably, a signature exhibiting high T>A frequencies and transcriptional strand bias has been described in normal lung epithelia from patients with a history of smoking37.

Our epidemiological analysis revealed that the mutational effects of tobacco vary among anatomical subsites. The canonical tobacco signatures, SBS4 and SBS92, were found predominantly in larynx cases, along with the tobacco-related DBS and indel signatures. Conversely, SBS_I was extracted in HNC cases from all subsites, with a notable enrichment in oral cavity cases. While previous studies primarily reported SBS4 in laryngeal HNC14,38, suggesting a minimal mutational impact of tobacco in the oral cavity and pharynx, our findings reveal different tobacco-related mutagenic processes occurring across all subsites. Altogether, our observations hint at varying susceptibility, exposure level or clearance of tobacco carcinogens across tissues, leading to different genotoxic effects. A possible explanation for these differences is the tissue-specific pattern of cytochrome P450 function. CYP1A1, the main BaP metabolizer, is primarily expressed in lung and larynx, whereas enzymes responsible for nitrosamine metabolism, such as CYP2E1, are predominant in the upper aerodigestive tract, including the oral cavity34,39,40,41. These differences in the response to tobacco across tissues may partially explain the greater susceptibility to smoking found for larynx cancers compared to other anatomical subsites10. While tobacco use was associated with elevated mutation burdens and BaP-related driver mutations in larynx cancers, this was not observed in oral cavity cases, aligning with a reduced carcinogenic effect. Thus, additional carcinogenic processes may be necessary to aid in the development of oral cavity and oropharynx cancers, including alcohol and HPV infection10.

In this regard, we also identified mutagenic processes linked to alcohol exposure13,16, including signatures SBS16, ID11 and an unreported association with signature DBS4. In HNC, alcohol-related signatures were predominantly observed in patients reporting both alcohol and tobacco consumption, consistent with epidemiological evidence showing a synergistic effect between these two factors on disease risk9,42. Furthermore, a previous study suggested an enrichment of SBS16 in oropharynx cases from tobacco users18. Altogether, our findings indicate that tobacco could enhance the carcinogenic effects of alcohol through shared mutagenic processes. Experimental evidence suggests that salivary concentrations of acetaldehyde, the genotoxic byproduct of alcohol metabolism, are greatly increased by tobacco smoking43, which could result in enhanced alcohol-related mutagenesis in cases with combined exposure.

Our data show that tobacco use, alone or in conjunction with alcohol, is also associated with a distinct copy number-rich profile, characterized by signatures of chromosomal instability, and resembling a previously described subset of copy number-rich HNC19. These genomic profiles are likely due to driver alterations leading to genome instability such as TP53 mutations, which are prevalent among smokers and drinkers24. Although high copy number burdens have been reported in lung adenocarcinoma cases from smokers14, the link between this exposure and specific copy number or driver profiles in HNC was previously unclear44. Cases with unknown etiology, on the other hand, exhibit few copy number alterations, prevalence of CASP8 and HRAS mutations and wild-type TP53. A similar copy-neutral group of samples has been observed in HNC, with an unreported link with HNC etiology19,45.

Regarding the mutagenic potential of other investigated risk factors, HPV infection did not elicit a specific mutational signature profile, but it was associated with distinct driver mutations and a copy number-unstable diploid genome. Poor oral hygiene, high body mass index and consumption of hot drinks did not display a direct effect on the mutation profile of HNC cases and likely contribute to the development of HNC of unknown etiology through mechanisms distinct from direct mutagenesis. This pattern has been proposed for several carcinogens in prior studies13,46. Nevertheless, there may exist additional unidentified mutagens leading to HNC, as hinted by the presence of the previously unidentified signature SBS_L as well as the enrichment of DBS_D and ID4 among nonsmokers.

Furthermore, we provide evidence suggesting that sunlight exposure may contribute to HNC development. Specifically, we identified signatures consistent with pyrimidine dimer formation (SBS7a–c and DBS1) in oral HNC cases, indicative of DNA damage by UV light12,23. UV light has only been described as a risk factor for malignancies in the external lip47, but experimental evidence suggests that oral cavity epithelia are susceptible to this exposure, and its carcinogenic processes could be enhanced by tobacco smoking48,49,50,51,52. While we cannot exclude the possibility of other mutational processes eliciting CC>TT substitutions, such as those driven by reactive oxygen species53, the presence of ID13 signatures, identified in melanoma12, provides additional evidence supporting the role of sunlight exposure in oral HNC.

In summary, through our comprehensive analysis of the mutational, genomic and epidemiological profile of HNC cases from diverse geographical regions, we have uncovered genomic mechanisms by which tobacco smoke and other risk factors contribute to HNC development. These findings enhance our understanding of the complexity and tissue specificity of tobacco mutagenesis, offering additional evidence that may inform prevention strategies aimed at reducing the risk of this disease.

Methods

Recruitment of cases and informed consent

The International Agency for Research on Cancer (IARC)/World Health Organization (WHO) coordinated participant recruitment through the HEADSpAcE and Central European international networks, comprising 13 collaborators from the eight participating countries in Europe and South America (Supplementary Table 18). Inclusion criteria for patients were ≥18 years of age (ranging from 18 to 90 years, with a mean of 60 and s.d. of 12 years), confirmed diagnosis of primary HNC and no prior cancer treatment. Written informed consent was obtained for all participants. Patients were excluded if they had any condition that could interfere with their ability to provide informed consent or if there were no means of obtaining adequate tissues as per protocol requirements. Ethical approvals were first obtained from each local research ethics committee and federal ethics committee when applicable, as well as from the IARC Ethics Committee (project 17-10).

Bio-samples and data collection

Dedicated standard operating procedures, following guidelines from the International Cancer Genome Consortium, were designed by the IARC/WHO to select adequate retrospective case series with complete biological samples and exposure information as described previously11,13 (Supplementary Table 18). In brief, for all case series included, anthropometric measures were taken, together with relevant information regarding medical and familial history. All biological samples from retrospective cohorts were collected using rigorous, standardized protocols and fulfilled the required standards of sample collection defined by the IARC/WHO for sequencing and analysis. Retrospective case series were included after examination of their respective recruitment protocols to ensure the availability of necessary biological samples based on standard operating procedures, following guidelines from the International Cancer Genome Consortium, and also based on the collection of relevant exposure history based on a comparison of validated epidemiological questionnaires from each specific region. Comparable smoking and alcohol history was available from all centers, as well as detailed epidemiological information on oral health, coffee, tea and mate consumption for specific regions2.

Potential limitations of using retrospective clinical data collected using different protocols from different populations were addressed by central data harmonization to ensure a comparable group of exposure variables (Supplementary Table 1). All patient-related data, as well as clinical, demographical, lifestyle, pathological and outcome data, were pseudonymized locally using a dedicated alphanumerical identifier system before being transferred to the IARC/WHO central database.

Expert pathology review

Original diagnostic pathology departments provided diagnostic histological details of contributing cases through standard abstract forms, together with a representative hematoxylin–eosin-stained slide of formalin-fixed paraffin-embedded tumor tissues whenever possible. The IARC/WHO centralized the entire pathology workflow and coordinated a centralized digital pathology examination of frozen tumor tissues collected for the study, as well as formalin-fixed paraffin-embedded sections when available, via a web-based report approach and a dedicated expert panel, following standardized procedures as described previously13. A minimum of 50% viable tumor cells was required for eligibility for whole-genome sequencing.

DNA extraction

Extraction of DNA from fresh frozen tumors and matched blood samples was centrally conducted at IARC/WHO. Of the cases that proceeded to the final analysis (n = 265), germline DNA was extracted from blood samples using previously described protocols and methods13.

HPV infection status and genome detection

The HPV infection status was determined by HPV16 E1, E2, E6 and E7 serology. To assess the HPV status in oropharynx cases with missing serologic information (n = 3), we used two orthogonal next-generation sequencing-based viral integration tools—Virus intEgration sites through iterative Reference SEquence customization (VERSE) and Fast Viral Integration and Fusion Identification (FastViFi)54,55 (Supplementary Table 19 and Supplementary Note). VERSE was used as part of the VirusFinder2.0 package (https://bioinfo.uth.edu/VirusFinder/), and FastViFi was installed using GitHub (https://github.com/sara-javadzadeh/FastViFi). Default parameters were used for running both tools.

Whole-genome sequencing

A total of 618 patients with HNC were enrolled in the study. Out of those, 315 cases were selected based on pathologic review and DNA quality (tumor and germline), and DNA was received at the Wellcome Sanger Institute for whole-genome sequencing. To ensure that the tumor and matched normal sample originated from the same individual, Fluidigm SNP genotyping with a custom panel was performed. Whole-genome sequencing (150 bp paired-end) was performed on the NovaSeq 6000 platform with a target coverage of 40× for tumors and 20× for matched normal tissues. All sequencing reads were aligned to the GRCh38 human reference genome using Burrows-Wheeler-MEM (v.0.7.16a and v.0.7.17). A standard set of postsequencing quality criteria was applied for metrics, including total coverage, evenness of coverage and contamination. Cases were excluded if coverage was below 30× for tumors or 15× for normal tissue. For evenness of coverage, the median over mean coverage (MoM) score was calculated, and tumor samples with MoM scores outside the range of values (0.92–1.09), which were determined by previous studies to be appropriate, were excluded56. Conpair56 (https://github.com/nygenome/Conpair) was used to detect contamination, and any tumor or normal sample with a value above 3% was excluded57. A total of 265 cases passed all criteria and were included in subsequent analysis.

Somatic variant calling

A standard analysis pipeline (https://github.com/cancerit) was used to perform variant calling for copy number variants (CNVs; ASCAT58 and Battenberg59, when tumor purity allowed), SNVs (cgpCaVEMan60), indels (cgpPindel61) and structural rearrangements (BRASS). CaVEMan and BRASS were run using the copy number profile and purity values determined from ASCAT when possible (complete pipeline, n = 242), or using copy number defaults and an estimate of purity obtained from ASCAT–Battenberg when tumor purity was insufficient to determine an accurate copy number profile (partial pipeline, n = 23). For SNVs, additional filters (ASRD ≥ 140 and CLPM = 0) in addition to the standard PASS filter. To further exclude the possibility of caller-specific artifacts being included in the analysis, a second variant caller, Strelka2, was run for SNVs and indels13,62, with variants called by both the Sanger variant-calling pipeline and Strelka2 included in the final analysis.

Generation of mutational matrices

Mutational matrices for SBS, DBS, indels and CNVs were generated using SigProfilerMatrixGenerator (https://github.com/AlexandrovLab/SigProfilerMatrixGenerator) with default options (v1.2.0)63.

Mutational signature analysis

Multiple methods were used to extract mutational signatures. The primary extractions were performed using SigProfilerExtractor (https://github.com/AlexandrovLab/SigProfilerExtractor) with a second method, mSigHdp, used to validate the de novo mutational signatures extracted (https://github.com/steverozen/mSigHdp)15,64. SigProfilerExtractor (v1.1.13) was run using nndsvd_min initialization (NMF_init=‘nndsvd_min’) for 1–20 signature solutions and 500 non-negative matrix factorization (NMF) replicates. For SBS, mutational signatures were extracted in both SBS1536 and SBS288 contexts. Both results were similar (Supplementary Note), with the SBS1536 results taken forward for the final analysis (Supplementary Table 6). Signatures were extracted using SigProfilerExtractor in the following contexts for other variant types: DBS78 for DBS, ID83 for indels and CNV48 for CNVs (Supplementary Tables 7, 8 and 20). The extracted de novo signatures were decomposed to COSMIC reference signatures where possible; this step is important as it allows the detection of de novo signatures that are made up of multiple reference signatures that have not separated during the extraction process (Supplementary Note). mSigHdp extractions were performed using the suggested parameters and using the country of origin to construct the hierarchy for SBS96 and ID83 contexts. A comparison of the SigProfilerExtractor and mSigHdp results can be found in the Supplementary Note.

Attribution of activities of mutational signatures

MSA (v2.0, https://gitlab.com/s.senkin/MSA) was used to attribute both de novo and decomposed mutational signatures65. For decomposed attributions, the panel of signatures included COSMIC reference signatures identified during the decomposition of mutational signatures in addition to newly extracted signatures that were not decomposed. A conservative approach was used for MSA attributions using the (params.no_CI_for_penalties=false) option for the calculation of optimum penalties. Pruned attributions were used for the final analysis, where confidence intervals were applied to each attributed mutational signature, and any signature activity with a lower confidence limit equal to 0 was removed.

Driver mutations

Driver mutations in HNC were identified using the following methods. First, the normalized ratio of non-synonymous to synonymous mutations (dN/dS) was used to identify genes under positive selection in HNC66. Results were calculated both for the whole genome (q < 0.01) and with restricted hypothesis testing for a panel of 369 known cancer genes66. Variants in any gene identified as under positive selection in global dN/dS or in the 369-cancer gene panel were considered as potential driver mutations and were then classified as likely drivers if they met any of the following criteria: (1) truncating mutations in genes annotated as tumor suppressors, (2) mutations annotated as likely or known oncogenic in MutationMapper, (3) truncating variants in genes with selection (q < 0.05) for truncating mutations assumed to be tumor suppressors and thus likely drivers, (4) missense variants in all genes under positive selection and with dN/dS ratios for missense mutations above five (assuming four of every five missense mutations are drivers) labeled as likely drivers or (5) in-frame indels in genes under significant positive selection for in-frame indels. The Cancer Gene Census (https://cancer.sanger.ac.uk/census) and the Cancer Genome Interpreter tool (https://www.cancergenomeinterpreter.org) were used to annotate potential drivers with the mode of action. Missense mutations were assessed using the MutationMapper tool (http://www.cbioportal.org/mutation_mapper).

Copy number profile

The copy number profiles were investigated in a subset of cases with available copy number data (complete pipeline, n = 242). Unsupervised clustering analysis of the copy number counts was performed using Euclidean distance and Ward’s agglomerative procedure. Driver copy number alterations were defined as cancer-related alterations in the COSMIC cancer gene census as follows25,67: (1) homozygous deletion (copy number = (0, 0)) of genes listed as deleted in COSMIC and (2) amplification (copy number > 2 × ploidy + 1) of genes listed as amplified (A) in COSMIC or PIK3CA gains, a commonly reported HNC alteration20,24.

Evolutionary analysis

MutationTimeR22 was run to annotate mutations as either early clonal, late clonal, subclonal or not-assigned clonal (meaning clonality could not be assigned). Samples with at least 256 early clonal mutations and at least 256 late clonal mutations were retained (n = 173), and the early and late clonal mutations for these samples were split into individual VCF files. SigProfilerAssignment68 was run on the resulting VCF files to identify the mutational processes active in the early clonal and late clonal mutations for each sample. Differences between the early and late relative activity of each mutational signature were assessed using a Wilcoxon signed-rank test, and P values were corrected across signatures using the Benjamini–Hochberg procedure (q value).

Genomic ancestry and admixture analyses

The genetic ancestry of individuals within the HNC dataset was inferred using the ADMIXTURE tool (v1.3.0)69. The admixture and principal component analyses were restricted to HapMap SNPs. Germline variants with minor allele frequency <1% within regions of long-range and high linkage disequilibrium in the human genome (GRCh38) were excluded, remaining 1,182,596 variants. After pruning for linkage disequilibrium (r2 < 20% within a 50 kb window), 159,464 independent variants remained in HNC genotype data. The 1000 Genomes reference population genotype data70 (phase 3) for Europeans (n = 489), Africans (n = 661), East Asians (n = 504) and Latin Americans (n = 347) were filtered and merged with the HNC genotype data based on the pruned set of variants present in both datasets. ADMIXTURE analysis was performed on the merged genotype data with k = 4, which would correspond to the four ancestral continental population groups reflecting the participants of our study (Supplementary Table 2). To complement the ADMIXTURE results, principal component analysis was run on the same samples, and HNC cases were visualized in two dimensions in comparison with each reference population included in the 1000 Genomes dataset (Supplementary Fig. 1).

Regressions and associations with signatures

Signature attributions were dichotomized into presence and absence using confidence intervals, with presence defined as both lower and upper limits being positive and absence as the lower limit being zero. If a signature was present in at least 75% of cases (SBS1, SBS2, SBS13, SBS18, SBS_I, ID1 and ID2), it was dichotomized into above and below the median of attributed mutation counts. The binary attributions served as dependent variables in logistic regressions, and relevant risk factors, epidemiological characteristics or ancestry data were used as factorized independent variables. Regressions with variables presenting data separation were performed using Firth’s penalized logistic regression.

For SBS, DBS and indel mutation burden analyses, cases defined as hypermutators (mutation burdens more than 1.5 interquartile range above Q3) were excluded, and associations with epidemiological factors were assessed using linear regression analysis.

To adjust for confounding factors, sex, age of diagnosis, subsite, region, tobacco and alcohol status were added as covariates in all regressions. The region variable was categorized as Europe and South America. The Bonferroni method was used to test for significant P values.

Regressions with HNC incidence (ASRs) were performed as linear regressions with signature attributions for signatures present in at least 75% of cases. Signatures present in less than 75% of cases were dichotomized into presence and absence, as indicated above, and analyzed using the logistic regressions. ASRs were obtained from Global Cancer Observatory (GLOBOCAN 2022)1. Regressions were performed on a sample basis and adjusted for age.

Statistics and reproducibility

Analyses were conducted using R (v.4.1.2)71. Data handling and statistical analysis were conducted using the R packages dplyr 1.1.4, tidyr 1.3.1, stringr 1.5.1, logistf 1.26.0 and yardstick 1.3.1 (refs. 72,73,74,75,76). Figures 17 and Extended Data Figs. 1,4,5,7,8 and 10 were created using ggplot2 3.5.1, ggrepel 0.9.5, ggpubr 0.6.0, ggalluvial 0.12.5, viridis 0.6.5, cowplot 1.1.3, patchwork 1.2.0, gridExtra 2.3, circlize 0.4.16, scales 1.3.0, ComplexHeatmap 2.10.0, EnrichedHeatmap 1.24.0, GenomicRanges 1.46.1 and CNAqc 1.0.0 (refs. 77,78,79,80,81,82,83,84,85,86,87,88,89,90). Signature extraction was replicated two times independently at both Wellcome Sanger Institute and University of California San Diego (UCSD). Signature attribution was replicated two times independently at both Wellcome Sanger Institute and IARC. All attempts at replication were successful and provided similar results. No other experiments other than those mentioned here were replicated independently. Additional details relating to the methods used in this study can be found in Supplementary Note.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.