Abstract
DNA damage is preferentially repaired in expressed genes; thus, genome-wide correlations between somatic mutation patterns and normal cell transcription may reflect tumor cell origins. Accordingly, we found that aggregate lung adenocarcinoma (LUAD) and squamous cancer (LUSC) somatic mutation density associated most strongly with distal (alveolar) and proximal (basal) lung cell-type-specific gene expression, respectively, consistent with presumed LUAD and LUSC cell origins. Analyzing individual genomes, 21% of LUADs bore mutational footprints of proximal airway origins, with 38% classified as ambiguous. Distal origin LUADs, enriched for KRAS and STK11 drivers, occurred mainly in smokers; proximal origin LUADs, enriched for EGFR drivers, were more common in never-smokers. Ambiguous origin LUADs showed APOBEC signatures and SMARCA4 alterations. TP53 mutant LUADs with non-distal cell origins preferentially exhibited non-distal transcriptional identity. Our study reveals a complex interplay between lineage and identity in LUAD evolution and offers a scalable strategy to infer tumor origins in human cancers.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$32.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 print issues and online access
$259.00 per year
only $21.58 per issue
Buy this article
- Purchase on SpringerLink
- Instant access to the full article PDF.
USD 39.95
Prices may be subject to local taxes which are calculated during checkout






Similar content being viewed by others
Data availability
Obtained data sources
Whole genome sequencing data were obtained from cases as part of TCGA Research Network consortium through the Genomic Data Commons (https://portal.gdc.cancer.gov). We obtained 205 cases with appropriate dbGaP permissions (study accession, phs000178.v11.p8). Additional WGS data were obtained for 90 cases from multiple studies publicly available through the EGA (https://ega-archive.org) and dbGap (https://dbgap.ncbi.nlm.nih.gov). These cohorts include 49 LUADs (accession no. EGAS00001002801)28, 13 LUADs and three LUSCs from Cancer Alliance (accession no. EGAS00001004013) and 25 LUADs (accession no. phs000488.v2.p1)25. The HLCA was downloaded from CellXGene (https://cellxgene.cziscience.com/collections/6f6d381a-7701-4781-935c-db10d30de293). ScRNA-seq data for the lung cancer atlas were obtained from several publicly available cohorts, including https://www.ebi.ac.uk/biostudies/arrayexpress/studies/E-MTAB-6149 (ref. 30), GEO accession GSE123904 (ref. 32), https://lungcancer.chenlulab.com (ref. 34), https://ega-archive.org/studies/EGAS00001004419 (ref. 33) and GSE133747 (ref. 31).
Generated data sources
Five LUAD samples were processed for both scRNA-seq and WGS (described in detail above). Raw count scRNA-seq data for these samples are available as Supplementary Data 1. WGS data for WCM-1 are available as Supplementary Data 2. Raw sequencing data available upon request by contacting the corresponding author and a 4–8 week review of the request by a data access committee and IRB. Source data are provided with this paper.
Code availability
Code for generating analyses and figures is provided in GitHub (https://github.com/mskilab-org/lung_coo_2025) and Zenodo (https://doi.org/10.5281/zenodo.17243535)59.
References
Visvader, J. E. Cells of origin in cancer. Nature 469, 314–322 (2011).
Sainz de Aja, J., Dost, A. F. M. & Kim, C. F. Alveolar progenitor cells and the origin of lung cancer. J. Intern. Med. 289, 629–635 (2021).
Ferone, G., Lee, M. C., Sage, J. & Berns, A. Cells of origin of lung cancers: lessons from mouse studies. Genes Dev. 34, 1017–1032 (2020).
Rubin, M. A., Bristow, R. G., Thienger, P. D., Dive, C. & Imielinski, M. Impact of lineage plasticity to and from a neuroendocrine phenotype on progression and response in prostate and lung cancers. Mol. Cell 80, 562–577 (2020).
Quintanal-Villalonga, A. et al. Lineage plasticity in cancer: a shared pathway of therapeutic resistance. Nat. Rev. Clin. Oncol. 17, 360–371 (2020).
Ci, B. et al. Molecular differences across invasive lung adenocarcinoma morphological subgroups. Transl. Lung Cancer Res. 9, 1029–1040 (2020).
Nicholson, A. G., Scagliotti, G., Tsao, M. S., Yatabe, Y. & Travis, W. D. 2021 WHO classification of lung cancer: a globally applicable and molecular biomarker-relevant classification. J. Thorac. Oncol. 17, e80–e83 (2022).
Tomasetti, C., Vogelstein, B. & Parmigiani, G. Half or more of the somatic mutations in cancers of self-renewing tissues originate prior to tumor initiation. Proc. Natl Acad. Sci. USA 110, 1999–2004 (2013).
Pleasance, E. D. et al. A comprehensive catalogue of somatic mutations from a human cancer genome. Nature 463, 191–196 (2010).
Pleasance, E. D. et al. A small-cell lung cancer genome with complex signatures of tobacco exposure. Nature 463, 184–190 (2010).
Imielinski, M., Guo, G. & Meyerson, M. Insertions and deletions target lineage-defining genes in human cancers. Cell 168, 460–472.e14 (2017).
Haradhvala, N. J. et al. Mutational strand asymmetries in cancer genomes reveal mechanisms of DNA damage and repair. Cell 164, 538–549 (2016).
Supek, F. & Lehner, B. Differential DNA mismatch repair underlies mutation rate variation across the human genome. Nature 521, 81–84 (2015).
Stamatoyannopoulos, J. A. et al. Human mutation rate associated with DNA replication timing. Nat. Genet. 41, 393–395 (2009).
Tomkova, M., Tomek, J., Kriaucionis, S. & Schuster-Böckler, B. Mutational signature distribution varies with DNA replication timing and strand asymmetry. Genome Biol. 19, 129 (2018).
Lawrence, M. S. et al. Mutational heterogeneity in cancer and the search for new cancer-associated genes. Nature 499, 214–218 (2013).
Schuster-Böckler, B. & Lehner, B. Chromatin organization is a major influence on regional mutation rates in human cancer cells. Nature 488, 504–507 (2012).
Polak, P. et al. Cell-of-origin chromatin organization shapes the mutational landscape of cancer. Nature 518, 360–364 (2015).
Gonzalez-Perez, A., Sabarinathan, R. & Lopez-Bigas, N. Local determinants of the mutational landscape of the human genome. Cell 177, 101–114 (2019).
Pich, O. et al. Somatic and germline mutation periodicity follow the orientation of the DNA minor groove around nucleosomes. Cell 175, 1074–1087.e18 (2018).
Salvadores, M., Mas-Ponte, D. & Supek, F. Passenger mutations accurately classify human tumors. PLoS Comput. Biol. 15, e1006953 (2019).
Jiao, W. et al. A deep learning system accurately classifies primary and metastatic cancers using passenger mutation patterns. Nat. Commun. 11, 728 (2020).
Nguyen, L., Van Hoeck, A. & Cuppen, E. Machine learning-based tissue of origin classification for cancer of unknown primary diagnostics using genome-wide mutation features. Nat. Commun. 13, 4013 (2022).
Sikkema, L. et al. An integrated cell atlas of the lung in health and disease. Nat. Med. 29, 1563–1577 (2023).
Imielinski, M. et al. Mapping the hallmarks of lung adenocarcinoma with massively parallel sequencing. Cell 150, 1107–1120 (2012).
Carrot-Zhang, J. et al. Whole-genome characterization of lung adenocarcinomas lacking the RTK/RAS/RAF pathway. Cell Rep. 34, 108707 (2021).
Collisson, E. A. et al. Comprehensive molecular profiling of lung adenocarcinoma. Nature 511, 543–550 (2014).
Lee, J. J.-K. et al. Tracing oncogene rearrangements in the mutational history of lung adenocarcinoma. Cell 177, 1842–1857.e21 (2019).
Hadi, K. et al. Distinct classes of complex structural variation uncovered across thousands of cancer genome graphs. Cell 183, 197–210.e32 (2020).
Lambrechts, D. et al. Phenotype molding of stromal cells in the lung tumor microenvironment. Nat. Med. 24, 1277–1289 (2018).
Raredon, M. S. B. et al. Single-cell connectomic analysis of adult mammalian lungs. Sci. Adv. 5, eaaw3851 (2019).
Laughney, A. M. et al. Regenerative lineages and immune-mediated pruning in lung cancer metastasis. Nat. Med. 26, 259–269 (2020).
Lukassen, S. et al. SARS-CoV-2 receptor ACE 2 and TMPRSS 2 are primarily expressed in bronchial transient secretory cells. EMBO J. 39, e105114 (2020).
Zhang, L. et al. Integrated single-cell RNA sequencing analysis reveals distinct cellular and transcriptional modules associated with survival in lung cancer. Signal Transduct. Target. Ther. 7, 9 (2022).
Helleday, T., Eshtad, S. & Nik-Zainal, S. Mechanisms underlying mutational signatures in human cancers. Nat. Rev. Genet. 15, 585–598 (2014).
Spisak, N., de Manuel, M., Milligan, W., Sella, G. & Przeworski, M. The clock-like accumulation of germline and somatic mutations can arise from the interplay of DNA damage and repair. PLoS Biol. 22, e3002678 (2024).
Newman, A. M. et al. Determining cell type abundance and expression from bulk tissues with digital cytometry. Nat. Biotechnol. 37, 773–782 (2019).
Campbell, J. D. et al. Distinct patterns of somatic genome alterations in lung adenocarcinomas and squamous cell carcinomas. Nat. Genet. 48, 607–616 (2016).
Kadur Lakshminarasimha Murthy, P. et al. Human distal lung maps and lineage hierarchies reveal a bipotent progenitor. Nature 604, 111–119 (2022).
Hill, W. et al. Lung adenocarcinoma promotion by air pollutants. Nature 616, 159–167 (2023).
Tong, X. et al. Adeno-to-squamous transition drives resistance to KRAS inhibition in LKB1 mutant lung cancer. Cancer Cell 42, 413–428.e7 (2024).
Rekhtman, N. et al. SMARCA4-deficient thoracic sarcomatoid tumors represent primarily smoking-related undifferentiated carcinomas rather than primary thoracic sarcomas. J. Thorac. Oncol. 15, 231–247 (2020).
Concepcion, C. P. et al. Smarca4 inactivation promotes lineage-specific transformation and early metastatic features in the lung. Cancer Discov. 12, 562–585 (2022).
Petljak, M., Green, A. M., Maciejowski, J. & Weitzman, M. D. Addressing the benefits of inhibiting APOBEC3-dependent mutagenesis in cancer. Nat. Genet. 54, 1599–1608 (2022).
Ettinger, D. S. et al. NCCN guidelines insights: non-small cell lung cancer, version 2.2021: featured updates to the NCCN guidelines. J. Natl Compr. Cancer Netw. 19, 254–266 (2021).
Hafemeister, C. & Satija, R. Normalization and variance stabilization of single-cell RNA-seq data using regularized negative binomial regression. Genome Biol. 20, 296 (2019).
Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics 25, 1754–1760 (2009).
Díaz-Gay, M. et al. Assigning mutational signatures to individual samples and individual somatic mutations with SigProfilerAssignment. Bioinformatics 39, btad756 (2023).
Mallick, S. et al. The Simons Genome Diversity Project: 300 genomes from 142 diverse populations. Nature 538, 201–206 (2016).
Alexandrov, L. B. et al. The repertoire of mutational signatures in human cancer. Nature 578, 94–101 (2020).
Wolf, F. A., Angerer, P. & Theis, F. J. SCANPY: large-scale single-cell gene expression data analysis. Genome Biol. 19, 15 (2018).
Pliner, H. A., Shendure, J. & Trapnell, C. Supervised classification enables rapid annotation of cell atlases. Nat. Methods 16, 983–986 (2019).
Tokheim, C. & Karchin, R. CHASMplus reveals the scope of somatic missense mutations driving human cancers. Cell Syst. 9, 9–23.e28 (2019).
Sondka, Z. et al. The COSMIC Cancer Gene Census: describing genetic dysfunction across all human cancers. Nat. Rev. Cancer 18, 696–705 (2018).
Suehnholz, S. P. et al. Quantifying the expanding landscape of clinical actionability for patients with cancer. Cancer Discov. 14, 49–65 (2024).
Gu, Z., Eils, R. & Schlesner, M. Complex heatmaps reveal patterns and correlations in multidimensional genomic data. Bioinformatics 32, 2847–2849 (2016).
Masica, D. L. et al. CRAVAT 4: Cancer-Related Analysis of Variants Toolkit. Cancer Res. 77, e35–e38 (2017).
Finak, G. et al. MAST: a flexible statistical framework for assessing transcriptional changes and characterizing heterogeneity in single-cell RNA sequencing data. Genome Biol. 16, 278 (2015).
Panja, S., Mantri, P., Johnson, K. E., Martinez, J. S. A. & Imielinski, M. Analysis code for inferring cell-of-origin and transcriptional identity in human lung adenocarcinoma. Zenodo https://doi.org/10.5281/zenodo.17243535 (2025).
Acknowledgements
We thank C. Kim, T. Tammela and D. Lyden for helpful discussions and the Weill Cornell Medicine Epigenomics and Histology core facilities for technical support. Project support for this research was provided in part by the Center for Translational Pathology at the Department of Pathology and Laboratory Medicine, Weill Cornell Medicine. M.I., S.P., P.M. and H.T. were supported by National Institutes of Health (NIH) award R37CA229861 to M.I. In addition, M.I., J.S.A-M., A.D., J.R. and H.T. were supported by Weill Cornell Medicine Department of Pathology and Laboratory Medicine startup funds, NYU Perlmutter Cancer Center startup funds, a Pershing Square Sohn Cancer Prize and a Burroughs Wellcome Fund Career Award for Medical Scientists awarded to M.I. K.E.J. was supported by a National Science Foundation Graduate Research Fellowship (1746886).
Author information
Authors and Affiliations
Contributions
M.I. designed and supervised the study. S.P., P.M., J.S.A.-M. and K.E.J. performed experiments and data analysis with contributions from M.I. and A.D. S.B., K.O. and J.M.M. coordinated sample collection and reviewed histopathology. M.S. and P.S. performed scRNA-seq experiments. H.T., M.S., J.S.A.-M. and K.E.J. performed sample processing and library preparation for scRNA-seq and WGS. S.P., P.M., J.S.A.-M. and K.E.J. performed data curation, scRNA-seq analysis, cancer genomics analyses and simulations, with assistance from A.L. on driver mutation analysis. S.R.Y. and W.T. conducted the expert pathology review. K.E.J., S.B., J.R. and J.M.M. performed immunohistochemistry analysis. S.P., P.M., K.E.J., J.S.A.-M., A.D., P.P. and M.I. interpreted data. S.P., P.M., J.S.A.-M., K.E.J. and M.I. wrote the manuscript with comments from all authors.
Corresponding author
Ethics declarations
Competing interests
M.I. reports receiving personal or consultancy fees from ImmPACT Bio outside of the scope of the submitted work. M.S. and P.S. report receiving personal fees from 10× Genomics outside of the scope of the submitted work. P.P. reports receiving personal fees from C2I-Genomics outside of the scope of the submitted work. S.R.Y. reports receiving speaking fees from Medscape, Onclive, Medical Learning Institute, PRIME Education, AstraZeneca and Roche outside of the scope of the submitted work. S.R.Y. reports consultancy fees from AstraZeneca, AbbVie, Merus, Eli Lilly, Boehringer Ingelheim, Roche, Amgen and Sanofi outside of the scope of the submitted work. The remaining authors declare no competing interests.
Peer review
Peer review information
Nature Genetics thanks the anonymous reviewer(s) for their contribution to the peer review of this work.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data
Extended Data Fig. 1 Identification of carcinoma cells.
(a) Average gene expression of selected cell type specific markers across lung cell atlas scRNA-seq clusters (see Fig. 1b). Each circle i,j represents the expression of gene i in cell cluster j. The size of each circle i,j represents the percentage of cells in cluster i expressing gene j. The color represents the average expression of gene i in cluster j. (b) Scatter plot of Shannon’s diversity index (quantifying uniformity of cluster membership across patient samples) and cluster-level aneuploidy score used to label epithelial cell (EPCAM + ) clusters as carcinoma vs. non-carcinoma (see Methods and Supplementary Note 5). Carcinoma cells are EPCAM+ clusters with low diversity (Shannon’s diversity index < 10) and high aneuploidy (aneuploidy score > 40, thresholds indicated by dotted lines in the plot). (c) Heatmap of chromosome arm-level aneuploidy scores (average of normalized gene expression across each chromosome arm, see Methods and Supplementary Note 5 for details) showing gains (red) and losses (blue) across individual cells from EPCAM+ clusters (rows). Bars (right) delineate carcinoma and non-carcinoma labels for clusters to which cells were assigned.
Extended Data Fig. 2 Hierarchical representation of epithelial cell-type composition across five levels of annotation in the healthy human lung atlas.
Hierarchy of human lung cell atlas cell types across five levels, from broadest (level 1) to most detailed (finest level). The “finest level” annotation comprises 23 cell types. Cell types in level 4 and “finest level” are denoted: b = bronchial, n = nasal, nn = non-nasal, ss = subsegmental, SMG = submucosal gland, TB = terminal bronchiole, and AT = alveolar type.
Extended Data Fig. 3 Benchmarking label transfer and COO inference.
(a) Bar plot showing label concordance between auto-encoder based label transfer to the human lung cell atlas (scANVI) and marker-based cell annotation (Garnett) for benign cells from tumor and adjacent normal tissue samples (see Methods). (b) Gene-specific SNV density (SNVs/Mbp) in LUAD (n = 242 independent tumor/normal pairs) and LUSC (n = 53 independent tumor/normal pairs). Densities are shown across tertiles of AT2 and basal-resting gene expression. (c) Gene-specific LUAD SNV density of tobacco, aging, and APOBEC SNVs (SNVs/Mbp) across three tertiles of average lung epithelial cell gene expression (n = 242, as in b). (d) Accuracy of COO inference (see Methods) across levels of the HLCA lung epithelial cell hierarchy (see Extended Data Fig. 2). Cell types are denoted: b = bronchial, n = nasal, nn = non-nasal, ss = subsegmental, SMG = submucosal gland, TB = terminal bronchiole, and AT = alveolar type. (e) Accuracy of COO inference at various levels of cell type resolution (see Extended Data Fig. 2) and tumor mutational burden (TMB) (for a single tumor sample and/or aggregated cohort). Error bars in (b-c) represent standard error of the mean. Accuracy for (d) and (e) was calculated as the fraction of simulations which the inferred COO matched the true COO at the specified cell type taxonomy level (see Methods).
Extended Data Fig. 4 Inferring patient-specific LUSC COO from passenger mutational patterns uncovers proximal origins.
(a) Hierarchically clustered heatmap showing association between benign cell type-specific gene expression and SNV density across individual LUSC WGS samples (n = 53 independent tumor/normal pairs). Each heatmap pixel i, j represents the strength of correlation (RR) for tumor sample i and benign lung cell type j, with values below 1 (blue) representing anti-correlation. (b) Bar plot of benign cell type-specific gene expression vs SNV counts regression results across the aggregated mutation calls from each of the LUSC clusters 1–3 (n = 6, 25, 22 cases respectively). Lime green bars represent proximal cell types, and orange bars represent distal cell types. Relative risk (RR) and 95% confidence interval (error bars) from the maximum likelihood regression fit is plotted for each aggregate cluster sample and benign lung cell type combination. Error bars represent 95% confidence intervals on the Bernoulli trial parameter. Cell types are labeled with the following abbreviations: b = bronchial, n = nasal, nn = non-nasal, ss = subsegmental, SMG = submucosal gland, TB = terminal bronchiole, and AT = alveolar type.
Extended Data Fig. 5 Relationship between identity, origin, histology and TP53 mutations in LUAD.
(a) Bar plot comparing the fraction of cases with distal/non-distal lineage plasticity (that is those with distal COO and non-distal identity or vice versa) grouped by inferred LUAD COO and TP53 mutation status (b) Alluvial plot linking origin, identity and histology with TP53 wildtype (WT) and mutant status across n = 75 LUAD cases. Bar height is proportional to the number of cases with the given feature, and ribbons indicate the number of cases with the given feature pair. (c-d) Bar plots comparing fraction of (c) papillary histology and (d) NSCLC, NOS histology between non-distal (ND, n = 34 cases) and distal (D, n = 41 cases) identity groups. (e) Bar plots comparing fraction of NSCLC, NOS between TP53 mutant (n = 44) and wild-type (n = 31) groups. (f) Bar plots comparing fraction of TP53 mutant NSCLC, NOS (n = 19 cases) and other TP53 mutant histologies (n = 56 cases). (g) Oncoprint of genomic alterations in LUAD, LUSC and Pan NSCLC drivers in NSCLC-NOS tumor samples. Error bars in a,c-f represent 95% confidence intervals on the Bernoulli trial parameter. P values in c-f obtained by two-sided Fisher’s exact test.
Extended Data Fig. 6 Identification of two distinct carcinoma cell populations in WCM-1.
(a) Euclidean distance of the (gene-wise cell type expression vs. SNV density) regression result vector for WCM-1 to regression vector centroids for distal, proximal, and ambiguous groups (Fig. 3, see Methods) (b-c) UMAP projection highlighting WCM-1 carcinoma cells (b) and two major WCM-1 carcinoma cell clusters (WCM-1-A, n = 618 cells; WCM-1-B, n = 1,297 cells) (c) against the backdrop of scRNA-seq transcriptomes of carcinoma cells from other patients’ tumor samples.
Supplementary information
Supplementary Information
Supplementary Notes.
Supplementary Table
Supplementary Table 1 contains lung cancer single-cell atlas cases; Supplementary Table 2 contains cell-type-specific markers; Supplementary Table 3 contains LUAD WGS cases; Supplementary Table 4 contains antibody information; Supplementary Table 5 contains reagents and instruments used in the study.
Supplementary Data 1
Zip file with contents: highly_variable_genes.txt, most highly variable genes across lung epithelial cell types; WCMcounts.csv, scRNA-seq counts matrix for LUAD patients profiled as part of this study; and WCMMetaData.csv, metadata for LUAD patients profiled as part of this study.
Supplementary Data 2
Zip file with contents: WCM-1_non-apobec_data.csv, per gene count of Non-APOBEC mutation in LUAD patient WCM-1.
Source data
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Panja, S., Mantri, P., Johnson, K.E. et al. Passenger mutations link cellular origin and transcriptional identity in human lung adenocarcinomas. Nat Genet 57, 3066–3074 (2025). https://doi.org/10.1038/s41588-025-02418-5
Received:
Accepted:
Published:
Version of record:
Issue date:
DOI: https://doi.org/10.1038/s41588-025-02418-5


