Extended Data Fig. 1: Study design, DNA analysis & epidemiology.

A) Study design schematic featuring the 3 aspects of the paper. LEFT: Epidemiological analysis of cancer incidence and PM2.5. MIDDLE: Pollution exposure in mouse models. RIGHT: Normal lung tissue analysis. B) TX421 Tumours from Smokers. Barplots indicating proportion of SNVs in each tumour attributed to each SBS mutational signature. The barplots (Top: Lung adenocarcinoma (LUAD), Bottom: Lung sqaumous cell carcinoma (LUSC)) reflect the probability that clonal driver mutations in patients, where smoking-related signatures have been detected, are caused by different mutational processes (SBS4 and SBS92 smoking, SBS2 and SBS13 APOBEC, SBS1 and SBS5 ageing). Each observed driver mutation in each patient is given a mutational-signature-causing probability based on the trinucleotide context and the signatures exposure of the patient (see Methods) and then these probabilities are aggregated. Asterisks represent patients where the smoking-related aggregated probabilities are below 0.5. C) Correlation between PM2.5 levels and EGFR mutant (EGFRm) adenocarcinoma lung cancer incidence in England. The blue line: robust linear regression line; grey shading: 95% confidence interval. D-E) The Canadian Lung Cancer Cohort. D) Distribution of 3 year and 20 year cumulative PM2.5 exposure levels for all patients in the Canadian cohort. Red lines mark the thresholds that were used to determine Low, Intermediate and High groups that are used in (D). These are the 1st (6.77 ug/m3) and 5th quintiles (7.27 ug/m3) of the distribution. The full distribution is displayed in the top plot, while the bottom plot displays a narrower range of 4–10 ug/m3 (for clarity). E) Counts and frequencies of EGFRm in the Canadian Cohort, where 3 year and 20 year cumulative PM2.5 exposure levels were available. Patients are grouped into high, intermediate and low groups based on thresholds established as described in (D). These groups are defined based on 3 year cumulative PM2.5 exposure data (left) and based on 20 year cumulative PM2.5 exposure data (right). The bar plots display the counts and frequency of EGFRm amongst patients within each group. The map was created using DEFRA data in R. The illustrations in A were created using BioRender (https://biorender.com).