Introduction

The multistep transition from normal mammalian cells to a malignant phenotype during carcinogenesis represents a fundamental evolutionary transformation. Somatic cells are not directly subject to natural selection as their death, survival, phenotype, movement, and proliferation are normally controlled by the host tissue1,2. That is, mammalian cells within differentiated tissue contribute to and are defined by the fitness of the evolving multicellular organism. In contrast, survival and proliferation of individual cancer cells are governed by Darwinian dynamics determined by interactions of their heritable phenotype with local environmental selction forces. This transition from one evolutionary state to another is our working definition of carcinogenesis1.

Carcinogenesis is often framed as a series of specific genetic events. Here we present an alternative model in which carcinogenesis is viewed as a sequence of phenotypic adaptations that give rise to a cancer cell that is evolutionarily distinct from its mammalian cell progenitors. In the context of the “evolutionary transformation” model, several key adaptations are necessary and, therefore, expected in all cancers: 1) Independence from controlling host signals; 2) Acquisition of alternative signaling pathways to optimize fitness (proliferation); 3) Maintenance of molecular machinery for proliferation; 4) Down regulation of whole-organism differentiated functions which benefit the host but do not contribute to fitness (proliferation) of an individual cell; 5) Promotion of phenotypic plasticity to rapidly detect and adapt to spatially and temporally variable environmental selection forces; 6) Niche construction to generate an extracellular environment that optimizes cancer cell survival and proliferation.

Thus, we view  accumulating molecular changes in each cancer cell as  the legacy of this evolutionary “march to independence”. Although each cancer must pass through predictable evolutionary changes, their specific evolutionary adaptations and corresponding molecular dynamics are subject to differences in the cell of origin and the selection pressures imposed by local dynamics. Here we use bulk data from patient cohorts in the TCGA database to investigate similarities and differences in the molecular dynamics driving key evolutionary steps in adenocarcinomas of the lung (without EGFR, KRAS, or BRAF driver mutations) (LUAD) and breast (BRCA). LUAD without common drivers typically evolves in a smoking-induced mutagenic3 and inflammatory environment4 while BRCA emerges from a plastic tissue that expands and regresses in response to systemic hormone levels5. We choose these as oncogenic extremes with the intent of exploring divergent, parallel, and convergent evolutionary pathways during carcinogenesis.

The large TCGA data sets for LUAD and BRCA can be analyzed through multiple bioinformatic methods. Here, we use evidence of evolutionary selection to reveal key genetic and molecular pathways. While carcinogenesis is often viewed as a sequence of genetic mutations6, we note Darwinian dynamics only require a “mechanism of inheritance” for selected phenotypes, which can, therefore, be genetic or epigenetic. Thus, we also view expression data as a reflection of Darwinian cost/benefit optimization. That is, the change in expression of each gene in a cancer cell compared to its normal progenitor is determined by the fitness benefit of the encoded protein compared to the cost of synthesizing it.

Our present work builds upon an extends techniques we have developed for studying under- and over-mutated genes, and gene expression arising through non-mutational gene-regulation including changes in expression, the splicesome, or copy number7. Previously we have compared gene mutations across three subtypes of lung cancer (KRAS mutated, EGFR, and non-driver mutation), focusing primarily on those genes in common that are rarely mutated8, or associated specifically with immune evasion and activation9. Lastly, we have examined in detail the mutational and expression pathways to carcinogenesis in BRCA and LUAD  cancers to identify  key selection pressures opportunities and avoids hazards for each cancer type10.  We take advantage of evolutionary principles  to examine parallel and convergent evolution, and co-adaptations in LUAD with no known driver mutations (high mutation rates) with breast cancer (BCA) typified by fewer mutations. This allows us to integrate using the TCGA data sets both mutational changes (under- or over-mutated) and comparisons of gene expression between normal and lung and breast cells with that seen in cancerous lung and breast.

Our approach complements other approaches to using mutations and gene expression to compare cancers and infer evolutionary processes such as positive and negative selection11. Here we go beyond standard differential gene expression approaches such as edgeR12 and DEseq213 by considering gene expression and mutations together as evolutionary driving forces in cancer progression as well as overall rates of expression change across groups of genes. Similar to our interests, the Curated Cancer Cell Atlas provides single cell RNAseq from diverse cancers and studies that are grouped into meta-programs that identify possible co-adapted syndromes within both normal and cancer cells14. For instance, we use under-mutated and over-mutated genes as evidence for stabilizing or directional selection, in a manner similar to MutSigCV and complementary to approaches using the ratio of synonymous to non-synonymous mutations (e.g., dNdScv).

Eco-evolutionary mathematical models15 of carcinogenesis and clinical observations16 demonstrate normal cells can evolve to a malignant phenotype through a wide range of genetic/epigenetic changes. Molecular data that traces the evolutionary arc of a single cancer is, therefore, difficult to generalize. However, we view each cancer within a TCGA cohort as a “replicate experiment” in carcinogenesis. Thus, patterns of genetic and transcriptional changes can provide insights into parallel and convergent adaptive strategies resulting from different or common evolutionary selection forces.

To illustrate, we have previously noted the analogy of cancer evolution to that of cavefish17. Cavefish converge on a common phenotype (without eyes or skin pigment) through diverse genetic pathways18,19 -- the cave morph has evolved from > 85 different fish species. Thus, molecular sampling from a single cave population may be difficult to generalize but adding data from multiple caves reveals a general pattern of genetic and epigenetic selection on genes associated with, for example, eyes, skin pigment, and adaptation to hypoxia20. Furthermore, specific attributes of the cave system may result in repeatable and predictable divergent trait evolution. For instance, cave fish in a sulfidic stream have larger heads, longer gill filaments, and fewer and larger offsprings than those inhabiting a stream without hydrogen sulfide21.

Hutchinson described Darwinian dynamics as “The Ecological Theater and Evolutionary Play22”. Within that analogy, we find the evolutionary plays for breast and lung cancers have different first acts but both their later acts and associated ecological stages tend to converge (Fig. 1).

Fig. 1: The evolutionary play and ecological stages in cancer development.
figure 1

The cell population in normal tissue has limited opportunities to evolve as the fates of individual cells are determined by tissue control. Initially, cancer evolution occurs on the ecological “stage” of normal tissue. Transition to somatic evolution requires severing tissue controls so that the cell’s fate is determined by its own fitness in the context of its environment. Cancer cells initially increase their fitness by ceasing differentiated tissue functions that consume resources while not increasing the probability of survival and proliferation. Subsequently, similar co-adapted traits show broad convergence among different cancer types including transition to a highly modified and engineered ecological stage.

Methods

Gene list acquisition

We divided the TCGA lung adenocarcinoma cohort based on known driver mutations in KRAS (G12, G13, Q61, A146), BRAF (V600, N581, G464, G466, G469, G596, D594), and EGFR (L858, S768, L861, G719, T790, indels in exons 18–21). The 313 non-EGFR/KRAS/BRAF LUAD patients were selected as the study group because of a high prevalence (90%) of smoking (and associated mutagenic, inflammatory environment) that results in the highest mutation rate among the different lung cancer cohorts8. Somatic mutations along with gene expression in normal tissue and cancers were downloaded from the TCGA PanCancer Atlas23. We similarly obtained the entire breast cancer cohort (727 patients) (BRCA), including mutation and expression data from the TCGA PanCancer Atlas. RNAseq gene-level expression was downloaded from https://api.gdc.cancer.gov/data/3586c0da-64d0-4b74-a449-5ff4d9136611 (EBPlusPlusAdjustPANCAN_IlluminaHiSeq_RNASeqV2.geneExp.tsv). Mean expression was calculated across all samples in each organ and disease status group, and then log2 transformed. In contrast to LUAD, BRCA has minimal exposure to environmental toxins and, therefore, evolves into a cancer with low mutational burden. However, BRCA arises in a normal tissue background that undergoes (or has undergone) cycles of expansion and contraction in response to systemic hormone changes.

Tumor and normal sequence alignment files used to calculate individual base coverage were downloaded from Genome Data Commons. A base was considered sufficiently covered if coverage depth was ≥14 in tumor samples and ≥8 in normal samples as has been previously described: https://www.synapse.org/#!Synapse:syn1695394. The fraction of each gene’s protein coding bases (using the longest RefSeq transcript) covered by sufficient sequence data was calculated for each sample using the Negative Storage Model24. Gene-level depth of coverage was then determined by calculating the number of bases sufficiently covered by sequencing for each of the RefSeq coding genes (with 25 base-pair flanking regions). This procedure measures the fraction of each gene (longest transcript) covered by sequencing data. To address sequencing artifacts that falsely decrease mutation rates, we excluded genes with a low average depth of coverage frequency ( < 50%) and those with errors in the RefSeq gene model.

TCGA data provides whole exome sequencing with paired tumor/normal analysis to exclude germline mutations and focus on somatic mutations, which were downloaded from https://api.gdc.cancer.gov/data/1c8cfe5f-e52d-41ba-94da-f15ea1337efc (mc3.v0.2.8.PUBLIC.maf.gz). The identified mutations include non-synonymous, synonymous, intronic, UnTRanslated region, and intergenic. For our analysis, we limited the mutations to protein altering mutations: non-synonymous and truncating (stop-gain, frameshifting indel, splicing).

To minimize potential artifacts related to expression, and to focus on genes that are functional, we examined average differential gene expression between tumor samples and normal adjacent tissue. Genes were included in the analysis of mutations only if their average expression (log base 2) was ≥ 1.5 in either cancer or normal tissue. In each gene, the average expression in normal tissue and tumor samples across the cohort was analyzed separately to supplement the genetic data.

When gene expression increased ≥ 4-fold in cancer cells, we considered this to indicate greater utility to the cancer cells than to the functioning of normal cells. Conversely a ≥ 75% decrease in expression indicated a gene function that benefited normal cells but not cancer cells.

Observed mutational frequency as a function of its fitness effects

Similar to Cannataro et al.25, we characterize the fitness benefit of a gene mutation based on “evolutionary triage”15, which links the fitness effect of each genetic change to cellular proliferation. Thus, genetic mutations or non-mutational changes in expression or splicesome that increase fitness (and, therefore proliferation) will be observed more frequently than expected by chance alone. Conversely, mutations that decrease fitness are eliminated by more fit cancer cell lineages. Mutations in these genes, therefore, are observed less frequently than expected by chance alone. We note this could be interpreted as an intrinsically lower mutation rate in this gene.

We identify mutations as “over-mutated” based on their observed frequency of non-synonymous mutations in each gene against that gene’s size (number of base pairs) in each cohort. Assuming the probability of mutation was approximately equal for every base pair (see below) in an expressed gene, the background mutation rate was determined by regressing the mutational frequency of each gene against gene size. The distance of each gene to the regression line was then determined, and this standardized residual was compared across all genes. Negative and positive residual values indicate under-mutated and over-mutated genes, respectively. We scored a gene as over-mutated and under positive selection if it was ≥2 standard deviations above the neutral line. Similarly, a gene was labelled “under-mutated” (i.e., conserved) if it was ≥2 standard deviations below the neutral line. In other words, our primary metric for natural selection was based on whether mutations to a gene were less (stabilizing selection) or more (directional selection) frequent than expected by chance.

Our approach is similar to prior studies identifying “essential genes” in bacterial26 and human genomes27. We note, however, that evolutionary selection on genes will be context dependent. That is, genes necessary for cancer cell proliferation in the host may differ significantly from genes required for cancer cells competing in a culture dish.

We recognize that our assumption of a roughly equal probability for mutations in all base pairs differs from prior studies that find variation in mutation rates related to gene expression, differences in DNA repair, and chromosomal location28. However, as discussed in prior publications8,9, these studies typically are caried out in populations over time, during which competition among cancer cells will occur. Thus, variation in the observed frequency of mutations in a gene that decreases fitness and eliminated by natural selection could be interpreted as the result of intra-genomic variation in mutation rates. For example, Monroe et al.29 observed that in Arabidopsis thaliana “genes subject to stronger purifying selection have a lower mutation rate”. Wehypothesize important genes are observed to be mutated less often30 because such mutations reduce fitness and are not transmitted across generations. Thus, we use apparent differences in the number of mutations observed in a cohort compared to the number expected based on the gene size as an estimate of its contributions to fitness. A genetic mutation observed significantly more frequently than expected indicates that mutation increases fitness while a gene that has significantly fewer mutations than expected must encode a protein whose normal function is essential for optimal fitness of the cancer cell.

Identifying pathways and functions under selection

The collected TCGA data includes the expression of each gene in the cancer and its corresponding normal tissue. These data are used to estimate the average transcriptomic state of the normal and cancer cells. To determine coordinated functions of related groups of genes, curated lists of selected or conserved genes were subjected to pathway analysis using DAVID (Database for Annotation and Integrated Discovery), available at the website https://david.ncifcrf.gov and the ConsensusPathDB functional annotation tool at website http://cpdb.molgen.mpg.de/CPDB/fct_annot. Both Gene Ontology (GO)- DIRECT and -FAT were used to identify significant ontologies, including Biological Process (BP), Cell Compartment (CC), and Molecular Function (MF). We then performed functional annotation clustering. We selected clusters based on high numbers of queried genes or based on biological significance. Statistical p values are from bioinformatic tools and represent the probability that the observed cluster could be the result of chance alone. Venn diagrams were constructed using the Ghent University VIB/UGent Center website: http://bioinformatics.psb.ugent.be/webtools/Venn/.

Comparing expression changes

To investigate non-mutational evolutionary strategies, we used the average expression for each gene in each TCGA cohort. Similarities and differences in the transcriptional states of normal tissue and cancer tissue in each cohort were measured by comparing the average expression of genes under investigation in normal breast and normal lung tissue. The average expression of each gene was then plotted for lung cancer and breast cancers in the same TCGA cohort. Variations in the linear regression from each gene set in the normal tissue and cancers were used as a metric of divergence or convergence.

Statistics and reproducibility

We used chi-square tests of heterogeneity when comparing counts in response two categorical factors such as where one factor might be cancer versus normal or lung versus breast, and the other factor might be upregulated versus downregulated expression or in common or exclusive to one group or the other. We used chi-square tests of homogeneity when there is a single factor (cancer versus normal or lung versus breast) and there is an expectation that counts should be equal between the categories of the factor.

For the genes of a gene family we compared, using least square regression, each gene’s expression in normal lung with its expression in normal breast, or each gene’ s expression in lung cancer with its expression in breast cancer. Both the slope (a slope closer to 1 indicates a greater convergence of genes expression) and the R-squared (closer to 1 indicates a tighter correlation in gene expression with less residual variation around the fitted line) indicate the degree of correspondence in gene expression between normal tissues and cancer. While much variation in these values when comparing across gene families, overall, across all gene families, lung and breast cancer generally showed higher slopes and greater R-squared values than the corresponding normal lung and breast.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Results

Initial and final cell states

Prior mathematical models show the evolutionary arc of each cancer is strongly influenced by the initial state of the normal progenitor cells and the final state of the optimally fit cancer cells15. As a general estimate of differences and similarities in initial and final transcriptional states, we identified the 150 most highly expressed genes in each group. As evidence for convergent evolution, in normal lung and normal breast tissue, fewer highly expressed genes were common, 75, than for LUAD and BRCA with 100 of their respective 150 highly expressed genes in common (χ2 = 8.57, df = 1, p < 0.005, test for heterogeneity) (Fig. 2, Supplemental Fig. 1).

Fig. 2: Overlap and differences in the 150 most highly expressed genes in normal breast and lung tissue and breast and lung cancers as a metric of convergence.
figure 2

In normal breast and lung tissue, 75 of the 150 highest expressing genes are common. This increases to 100 of the 150 highest expressing genes in breast and lung cancers. See Supplemental Figs. 1, 2, 3, and 4 for specific genes and functional pathways.

Functional annotations of the 150 highest expressing genes in lung and breast cancer show common (Supplemental Fig. 2) genes associated with cytoplasmic translation, signaling, antigen presentation and processing, unfolded protein response, glycolysis, lysosomes, serine endopeptidase inhibitors, spliceosome, the melanosome, and the extracellular matrix including both collagen production and degradation.

Of the 150 genes most highly expressed in normal lungs, LUAD maintained or increased expression of 96. This was a significantly greater proportion than for BRCA which maintained or increased expression of 78 of the 150 genes most highly expressed in normal breast tissue (χ2 = 4.43, df = 1, p < 0.05, test for heterogeneity) (Supplemental Figs. 3 and 4). This may reflect variations in tissue background during somatic evolution. The stress of inflamed, smoking-related environment in “normal” lung tissue may require upregulation of genes that also contribute to fitness in the cancer environment.

Mutations and expression changes in transition from normal to malignant states

LUADs without known driver mutations, unlike BRCA, typically arise in a smoking induced mutagenic and inflamed ecosystem. Consistent with this, BRCAs average 34 mutations/tumor compared to 273 mutations/tumor in LUADs. As an alternative metric of mutational frequency, we find only 2 gene mutations (TP53 and PIK3CA) occurred in >10% of the breast cancer cohort compared to 231 in lung cancer (Table 1, Supplemental Fig. 5). LUAD have significantly more evolution-selected mutations than BRCA (321 versus 93 (Supplemental Fig. 6), respectively; χ2 = 366.4, df = 1, p < 0.001, test for homogeneity). Of these, sixteen are common to both cancers (Supplemental Fig. 6).

Table 1 Differences in genetic and epigenetic mechanisms of inheritance LUAD and BRCA

While mutations are extensively studied in cancers, the conceptual model of evolutionary triage expects that conserved genes reveal valuable insights into critical pathways that must be preserved for optmal fitness15. We identified significantly more conserved genes in LUAD than in BRCA (260 versus 130, respectively, χ2 = 43.3, df = 1, p < 0.001, test for homogeneity) with only one (HTT – the Huntingtin gene) conserved in both (Supplemental Fig. 7).

As a measure of transcriptional changes, breast cancers have 963 genes with ≥4-fold increased expression versus normal breast tissue compared to 558 in lung cancer with 246 common to both cancers. However, there are more genes with >75% decreased expression (compared to normal tissue) in LUAD than BRCA (914 versus 822, respectively, χ2 = 13.3, df = 1, p < 0.001, test for homogeneity) with 255 of these genes in common (Supplemental Fig. 8).

Evolutionary steps in the transition to a malignant phenotype

General patterns of mutations and expression change

LUADs have far more mutations than BRCAs but the latter exhibits more genes with large expression changes (Table 1). However, there is some overlap of mutational selection in LUAD and expression changes in BRCA as 79 of 341 genes with evolution selected mutations in LUAD have >2-fold expression and 126 have >50% decreased expression in BRCA. (Supplemental Fig. 9).

Although parallel changes in expression are common in BRCA and LUAD, 60 genes show >4-fold increased expression in BRCA and >75% decreased expression in LUAD (Supplemental Fig. 8), and 12 genes have the opposite divergence. However, this apparently divergent expression changes generated an overall pattern of convergence so that there was greater similarity of expression in the cancers compared to normal tissue (Supplemental Fig. 10). This group of genes is enriched (p < 0.05) for 7 genes associated with collagen-containing extracellular matrix and 5 membrane transporters (SLC29A4, SLC7A10, SLC5A8, SLC6A4, SLC1A) which have diverse functions including the uptake of amino acids, monocarboxylates, and neurotransmitters31,32,33.

Abrogating response to normal tissue controls

To evolve, cancer cells must become independent of local tissue controls that govern their phenotypic properties, position, and proliferation/death. Compared to normal breast and lung, respectively, we expect BRCA and LUAD to significantly downregulate the expression of genes associated with local signaling pathways that govern cellular proliferation, death, phenotype, function and location. As noted above, we see ≥75% decreased expression (compared to normal tissue) in 822 genes in breast cancer and 914 genes in lung cancer with 255 common to both. The functional annotations (Supplemental Fig. 11) of the genes with decreased expression exhibit both convergent and parallel evolution (see below). For example (Supplemental Fig. 11), BRCA and LUAD downregulate 152 and 153 genes, respectively associated with signal transduction. Of these, 59 are common to both while 93 and 92, respectively, are specific for each tumor type. BRCA and LUAD downregulated 5 and 8 genes, respectively, in the HIPPO pathway with no gene in common. Conversely, they down regulate the same genes in the VEGFA-VEGFR signaling pathway. For adrenergic signaling, peptide-ligand receptors, calcium signaling, and G alpha (i) signaling, roughly half of the downregulated genes in each tumor type are unique while the other half are common to both. BRCA downregulated 9 genes in the RAF/MAPK cascade in common with LUAD which additionally downregulated 14 other genes in the pathway. The opposite pattern with 6 to 14 genes downregulated genes in common to both cancers and BRCA only decreasing expression in an additional 8 to 19 genes involved in chemokine, cytokine, and PPAR signaling as well as the neuroactive ligand-receptor signaling pathway. LUADs uniquely downregulate genes in the Netrin-UNC5B, corticotropin releasing hormone, ERBB, AGE-RAGE, PI3/AKT, and secretin receptor signaling pathways. Similarly, BRCAs uniquely down regulates genes in the insulin, glucagon, apelin, AMPK, oxytocin, cGMP-PKG, nuclear receptor, retinoic acid, estrogen, adipocytokine, and NR1H2- and NR1H3-mediated pathways.

A critical factor in normal tissue communication is cell-cell interactions. We expect this interaction will normally be tissue specific and, for example, find relatively low correlation with genes related to cell-cell interactions such as the Cadherin, Protocadherin, and Claudin gene families in normal breast and lung (Fig. 3). However, both families show strong convergence in BRCA and LUAD suggesting selection for a common cancer phenotype that governs a cancer cell’s interactions with both host cells and other cancer cells.

Fig. 3: Rewiring genes associated with cell-cell interaction.
figure 3

Left figures plot average expression of each member of the protocadherin (top panel), cadherin (middle panel) and claudin (lower panel) family of genes in normal breast (x-axis) and normal lung (y-axis) from their TCGA cohorts compared to their expression in breast (x-axis) and lung (y-axis) in the right panels. There is pattern of convergence as the expression of members of each gene family become more similar in the cancers compared to the normal tissue.

Downregulation of differentiated functions

As independence from host control evolves, the specialized functions of mammalian cells, like eyes in a cavefish, impose an ongoing cost while usually providing no evolutionary fitness benefit. These functions are generally tissue specific. Thus, BRCA turned off 108 genes highly expressed in normal breast compared to 32 genes expressed in normal lung but turned off in LUAD with only 7 in common (Fig. 4). Similarly, BRCA and LUAD downregulated 44 and 33 developmental genes, respectively, with no overlap (Supplemental Fig. 11) and each  tumor downregulated  multiple genes associated with cell-cell adhesion/communication, circadian entrainment, vascular smooth muscle, and prostaglandin synthesis with no genes in common. LUAD downregulated 18, 7, and 8 genes related to complement, surfactant, and oxygen exchange, respectively with 4, 3, and 4 in common with BRCA. BRCA downregulated 22 and 31 genes, respectively; associated with adipogenesis and muscle proteins with 8 and 15 in common with LUAD. LUADs uniquely downregulated genes associated with cilia, ECM-receptor interactions, and clathrin-mediated endocytosis. BRCA uniquely downregulated genes associated with steroid metabolism, androgen/estrogen biosynthesis, and triacylglycerol synthesis.

Fig. 4: Genes expressed in normal tissue but not in the cancers arising from that tissue.
figure 4

Early evolutionary selection will decrease differentiated function that benefits the host. The associated molecular machinery incurs a cost to the nascent, evolving cancer cell but typically provides no fitness benefit. The involved genes are highly tissue-specific and show minimum overlap. See Supplemental Figs. 12 and 13.

Maximizing cancer cell fitness

In cavefish, specific phenotypic properties (loss of eyes and skin pigment with increased tactile sensory function) maximize fitness. Thus, although multiple, diverse fish species enter and live in caves, they evolutionarily converge on an  optimal phenotype for their common environment. We hypothesized that this might be observed in cancers. That is, once cells are independent of host tissue, they will tend to converge on a common phenotype that is optimally fit for a single cell protist living in situ.

Promoting phenotypic plasticity – accessing restricted regions of the genome

Phenotypic variability within individual cancer populations is well recognized. Here, we propose the ability to rapidly change phenotypic properties is a trait explicitly selected during cancer development. That is, the cancer environment, due to variations in blood flow and host response34, can experience sudden, potentially lethal changes in local environmental conditions. This selects for cells that can rapidly detect changes in environmental conditions and alter their phenotype to survive and proliferate. In other words, phenotypic plasticity is, itself, a selected trait.

Normal mammalian cells have epigenetic restrictions on regions of the genome35 to maintain stable differentiated functions36. We identified 273 and 111 genes (27 in common) with high expression ( > 1.5) in BRCA and LUAD, respectively, but no or minimal expression ( < 0.4) in their corresponding normal tissue (Supplemental Fig. 12). The smaller number in LUAD may reflect prior “unlocking” of restricted genes due to smoking induced inflammation. LUAD (Supplemental Fig. 13) eliminated expression of 32 genes expressed in normal lung compared to 108 in BRCA with 7 genes in common (χ2 = 89.7, df = 1, p < 0.001, test for homogeneity).

The 27 activated genes common to both LUAD and BRCA include telomerase reverse transcriptase, (TERT), LIN28B, otherwise expressed only in embryonic stem cells37, CDH18, expressed on the embryonic subplate38, and CSAG1, which maintains integrity of the mitotic centrosome with defective p5339. LUAD and BRCA each activate 10 members of the MAGE gene family (with 7 in common (Supplemental Fig. 14)), which are typically described as “cancer/testes antigens40.” This strong evolutionary selection suggests MAGE genes contribute to cancer cell fitness although their precise function is currently unclear41. PLAC1, also described as a cancer testes gene42, has increased expression in both cancers, while BRCA uniquely activate CTAG2, CT83, PAGE2, PAGE2B, and PAGE5.

In BRCA, access to restricted elements of the genome may be promoted by frequent mutations in 7 genes associated with pluripotency (TP53, AKT1, KRAS, TBX3, PIK3CA, PIK3R1 and GATA3). Of these, 4 (AKT1, KRAS, TBX3 GATA3) have ≥ 8-fold increased expression in BRCA. PIK3CA is the most highly mutated gene in BRCA but its expression, and that of PIKI3R1, are ~50% decreased. Likely related, expression of GATA4, which can induce pluripotency43, is increased >32-fold in BRCA. Interestingly, TBX3 shows 90% decreased expression in LUAD. However, LUAD have strong genetic and epigenetic selection on HOX genes, which regulate pluripotency in stem cells44. LUAD also exhibit 8-fold increased expression of HMGB3, a multifunctional gene that can promote stemness45.

Optimizing the molecular machinery of proliferation

Among the upregulated genes in BRCA and LUAD, approximately half are associated with the molecular machinery of proliferation (Table 2). This includes large, identical gene sets associated with cell cycle, mitosis, nucleosome assembly and DNA synthesis (all p < 0.001).

Table 2 Transcriptional selection (defined as >4-fold increased or 75% decreased expression) is extensively observed in both tumor types

While the “mutator phenotype” to generate new mutations and accelerate evolution is often posited as a necessary trait of cancers, this likely has limits as excessive numbers of mutations risks an “error catastrophe46” in which the progressively degraded cellular information is insufficient to maintain optimal fitness. We find that both cancers upregulate the same 17 genes associated with DNA repair (p = 3.0e-6). Interestingly, both tumors upregulate 15 genes associated with meiosis (p = 3.5e-10) which may participate in repair of DNA double stand breaks47 but has also been posited to drive oncogenesis48.

Optimal fitness requires upregulation of the molecular machinery for genetic transcription. Consistent with this, both cancers upregulate 29 genes associated with RNA Polymerase II and 12 genes in the E2F transcription network (both (p < 0.01)). We find convergence of expression in both gene families as well as the APOBEC, E2F, and DDX gene families (Supplemental Fig. 15).

TP53 is the most mutated gene in LUAD and BRCA and both cancers have ≥4-fold increase in 7  genes associated with p53 regulation of cell cycle (p = 7.2e-5) (Table 2).

Cancer cells are assumed to be immortal and, in a seeming paradox, both cancers upregulate 10 genes related to cellular senescence (p = 7.1e-5) and 6 related to telomere maintenances (p < 0.01). This observation warrants further investigation (Table 2).

Alternative signaling pathways

While loss of tissue control is necessary for cancer initiation, alternative signaling pathways, not governed by host tissue signals, are necessary for responding to environmental opportunities and threats, as well as maintaining pro-proliferative signaling. Thus, both cancers upregulate the same 51 genes associated with signal transduction (p < 0.01), 29 common genes in the “generic” transcription pathway (p < 0.01), and 11 microRNAs associated with cancer (p < 0.01). Both tumors upregulate 6 genes encoding targets to C-MYC activation, 7 members of the p53 signaling pathway, 12 members of the FOXM1 transcription network, and 7 members of the p73 signaling pathway (p < 0.01 for all) (Table 2).

Other pathways with genes upregulated in both cancers include PLK1, Aurora A and B, Rho GTPase, progesterone, estrogen, ATM, ATR, NIMA kinases, and Syndecan-1 (p < 0.01 for all).

We find that expression of genes associated with the MAPK pathway and EGFR transduction are similar in normal breast and lung cells and further converge in breast and lung cancer. Expression of members of the Rho and G protein coupled receptors (Fig. 5) and C-C Motif chemokine ligands (Supplemental Fig. 16) converge in breast and lung cancers, suggesting an important and common role in optimizing fitness.

Fig. 5: Promoting alternative signaling pathways.
figure 5

In the left panels, expression of each gene in normal breast (x-axis) and normal lung (y-axis) is plotted. In the right panels, the expression of each gene in breast cancer (x-axis) and lung cancer (y-axis) is plotted. Top panels demonstrate conversion of expression in MAPK pathway genes in breast and lung cancers compared to their normal tissue counterparts. Middle panels demonstrate changes in specific components of EGFR, RAS, and RAF signaling pathways. Note, for example, convergence on expression of EGF, NRAS, KRAS, and RRAS. Lower panels show convergence with elements of the G Coupled Protein signaling families are more similar in breast and lung cancers compared to normal breast and lung.

Although there is broad convergent evolution in many signaling pathways, tumor-specific changes as a legacy of the normal tissue of origin persist. For example, genes from the BMP pathway show no convergence (Supplemental Fig. 16). LUADs increase the expression of ESR1 nearly 8-fold but expression is still much less than BRCA which increases expression by > 3-fold. Interestingly, BRCAs decrease expression ESR2 > 90% while it is highly expressed in normal lung with increased expression in LUAD. PGR is highly expressed in both normal breast and lung with ~ 50% increased expression in LUAD and BRCA.

Cellular infrastructure

In the often-harsh intratumoral environment, cancer cells must maintain the necessary infrastructure for survival and proliferation. Both cancers upregulate 10, 11, and 11 genes associated with kinesins, chromosome maintenance, and Golgi-to-ER retrograde transport (all p < 0.01), respectively, with relative convergence in the expression patterns of genes encoding members of the actin and myosin families (Supplemental Fig. 17).

A necessary function of cancer cells is foraging as they transport diverse molecules in and out of the cell. In Supplemental Fig. 18, we demonstrate convergence in the ABCA and ABCB families. The former is associated with lipid transport49 and that latter with diverse development functions50. Both have prognostic significance in LUAD51,52. In contrast, we show moderate convergence within the SCLC gene family and no convergence in TMEM genes (Supplemental Fig. 19).

Ion channels and neural receptors

Differentiated mammalian cells typically maintain a high transmembrane potential, but the potential is small or absent in cancer cells53,54,55. Both cancers demonstrate strong selection on channels associated with membrane ion fluxes including ion channels and neural receptors (Supplemental Figs. 20 and 21). BRCAs increase ( ≥ 4-fold) expression of 15 genes associated with voltage gated ion channels. LUADs show increased expression of 7 genes with 3 in common. BRCA decreased ( ≥ 75% decrease) expression of 25 genes associated with voltage gated ion channels, and 22 in LUAD with 5 in common (Supplemental Fig. 20, Table 2). The pattern demonstrates tissue-specific variations in the voltage gated ion channels primarily responsible for the transmembrane potential in normal breast and lung cells but general convergence of expression in LUAD and BRCA particularly in Na+ voltage gated ion channels. (Fig. 6, Supplemental Fig. 21).

Fig. 6: Upregulating alternative signaling pathways.
figure 6

Convergence of some potassium and sodium ion channels suggests a significant signaling role in breast and lung cancers.

Among ion channels associated with neural receptors we see strong selection and expression convergence in the RYR (Supplemental Fig. 22) among other neuroreceptors we see convergence in the 5-HT family which promotes cyclic AMP (Supplemental Fig. 22) but not cholinergic, adrenergic, or glutamate receptors (Supplemental Fig. 23).

Niche engineering and host response

Recent evolutionary theory recognizes the role of ecological engineering as adaptations for organisms to enhance their fitness by causing favorable changes to their environment56,57. Because these changes can persist over multiple generations, they represent a form of “ecological inheritance.”

Three competing dynamics govern the extracellular tumor environment: 1) a host response of deploying components of the extracellular matrix that can inhibit tumor growth. 2) tumor cell signaling to promote angiogenesis, blunt the immune response, and degrade host extracellular matrix to enable invasion. 3) tumor cell cooption of host mesenchymal or immune cells to produce an environment that maximizes their own fitness.

We find that 19 and 12 genes related to ECM are among the highest 150 expressed in BRCA and LUAD, respectively. These include 8 collagen genes in BRCA and 6 in LUAD in LUAD with 5 of these in common (p < 0.001 for all) (Fig. 1, Supplemental Fig. 1).

Mutation and expression data in LUAD and BRCA demonstrate extensive genetic and epigenetic selection on elements of the ECM (Supplemental Fig. 24). Genes with ≥4-fold increased expression in both LUAD and BRCA are highly enriched (p < 0.001) for collagen degradation and components of the extracellular matrix including 4 collagen genes and 6 matrix metallopeptidases. In LUAD, we find high mutation rates in most collagen genes and some members of the ADAMTS, Mucin, and Fraser gene families. In BRCA, there is strong evolutionary selection applied to members of the Fibrillin and Mucin gene families.

The most highly expressed collagen genes in normal breast and lung tissue, COL1A1, COL1A2, COL3A1, COL6A6 and COL6A3, have a further 2- to 6-fold increased expression in both cancers. COL10A1 and COL11A1, which are produced by acid-adapted cancer cells58, are moderately expressed in both normal tissues but have >50-fold increased expression in both cancers. Conversely, COL4A3 is highly expressed in both normal tissues with ~90% decreased expression in both cancers. COL2A1 is highly expressed in normal breast tissue and further increased by ~10-fold in breast cancer but minimally expressed in normal lungs and increased about 2-fold in lung cancer.

Divergent patterns of expression changes are observed in COL16A1, COL17A1, COL19A1, COL3A1, COL4A3BP, COL4A4, COL4A6, COL6A1 and COL6A6. However, as shown in Fig. 7, the net result is that expression of collagen genes is more similar in BRCA and LUAD than in normal breast and lung. An even greater convergence is seen in expression of members of the ADAM gene family (Fig. 5). However, little or no convergence is seen in other gene families associated with the extracellular space such as MMP, carbonic anhydrase, or mucin (Supplemental Fig. 25)

Fig. 7: Niche construction – remodeling the extracellular matrix.
figure 7

The tumor microenvironment is subject to complex dynamics that mix host response with cancer cell niche construction. In the left panels, expression of each gene in normal breast (x-axis) and normal lung (y-axis) is plotted. In the right panels, the expression of each gene in breast cancer (x-axis) and lung cancer (y-axis) is plotted. Common patterns of increased and decreased collagen genes and metalloproteinase genes result in a convergence of expression patterns (see Supplementary Fig. 19).

A key component of cancer niche construction is angiogenesis (Fig. 8, Supplemental Fig. 24). In both cancers, HIF1A is conserved (1 observed mutation in 1040 samples) with about 20% increased expression. EPAS1 (HIF2A) is highly expressed in both normal breast and normal lung (among the 150 highest expressed genes in normal lung) but decreases 75% and 90% in expression in BRCA and LUAD, respectively. Expression of VEGFA increased by 10 to 20% in both cancers while VEGFB and VEGFC decreased 10–50%. Interestingly, we find broadly decreased expression of angiopoietin genes as well as decreased expression of TEK (cell surface receptor for ANGPT1, ANGPT2, and ANGPT4) in both. ANPTL4 is increased 3-fold in expression in LUAD, and expression of ANGPTL6 increased by a similar amount in BRCA. The net effect is relative convergence of the VEGF and ANGPT gene families (Fig. 8)

Fig. 8: Niche construction - angiogenesis.
figure 8

In the left panels, expression of each gene in normal breast (x-axis) and normal lung (y-axis) is plotted. In the right panels, the expression of each gene in breast cancer (x-axis) and lung cancer (y-axis) is plotted. Both cancers increase the expression of some VEGF family members that promote angiogenesis while other members of the family as well as angiopoietin genes that promote lymphangiogenesis or vascular maturation are downregulated. The overall selection forces result in relative convergence of expression in LUAD and BRCA.

Tumor to host signaling pathways

Cancer cells can co-opt fibroblasts and other host cells. As potential tumor-to-host signaling pathways, we find both cancers significantly increase expression of some ligands while decreasing expression of their associated receptors. For example, both cancers show a marked increased expression of Neuromedin U (NMU) but marked decreased expression of its receptor (NMUR1) (Supplemental Fig. 21) suggesting that cancer cells use the NMU protein to manipulate non-cancer cells within the tumor. For example, NMU binding to NMUR1 in lymphoid cells, causes release of IL-25 which strongly amplifies allergic inflammation59. NMU may also promote cancer cachexia60. Similarly, EDN2 (Supplemental Fig. 21) is upregulated in both cancers while its receptors are down regulated or not expressed. BRCA highly upregulates the expression of UCN3, CGA, and CPLX3 (neither is expressed in normal breast) which produce signaling proteins.

A more general tumor to host signaling is through production of acid58. Both cancers upregulate genes associated with glycolysis which will produce lactic acid as a byproduct (Supplemental Fig. 24 and 25). Similarly, both cancers highly upregulate CA9 expression (by 32 to 64-fold) while generally decreasing expression of other members of the Carbonic Anhydrase gene family. CA9 buffers cancer cells against acidic conditions thus allowing cancer cells to adapt to conditions that are lethal or induce dysfunction in local host cells.

Discussion

We examined patterns of gene mutations and transcriptional variations in two cancer populations with different cells of origin, tissue environments, and mutation rates. LUAD develops in smoking-related inflamed and mutagenic environments and has direct contact with the external environment. Conversely, BRCAs emerge in tissue sheltered from environmental pathogens and cyclically respond to changing systemic hormone levels.

When carcinogenesis is viewed as an evolutionary transition, several predictable steps must be achieved by all cancers: 1) Independence from controlling host signals; 2) Acquisition of alternative signaling to optimize Darwinian fitness (proliferation); 3) Maintenance of molecular machinery for proliferation; 4) Down regulation of differentiated, whole-organism functions; 5) Promotion of phenotypic plasticity to adapt to temporally and spatially heterogeneous environments; 6) Niche construction to generate an extracellular environment that optimizes cancer cell survival and proliferation. In general, evolutionary dynamics 1 and 4 represent alterations that will be dependent on the structure and function of the normal tissue of origin while dynamics 2, 3, 5, and 6 represent adaptations broadly necessary for optimal fitness of cancer cells, as free-living protists in situ.

We investigated each of these dynamics by identifying genes and gene families under evolutionary selection. That is, we assume natural selection produces consistent patterns of co-adaptations among the genetic and epigenetic properties of cancer populations. We use directional (positive) and stabilizing (negative) selection on suites of genes as evidence of adaptive strategies across each patient cohort.

Because all replicating populations are subject to random gene mutations, we identify evolution favored mutations as well as gene conservation based on the observed frequency of the mutation in the cohort compared to the expected frequency based on the gene size. As discussed above, this assumption differs from models in which the mutation rate is assumed to vary across the genome. We note our assumption that the observed frequency of a mutation is due to a fitness gain or loss (with a corresponding difference in proliferation) could also be interpreted as a difference in the mutation rate. We minimize the potential error through application of stringent criteria for evidence of evolutionary selection (at least 2 standard deviations from the neutral line).

We find downregulated canonical signaling pathways related to differentiated functions and interactions with local tissue through decreased expression and/or increased mutations. This represents parallel evolution as the general evolutionary trends (independence from host control and loss of differentiated functions) are identical but the specific molecular pathways under selection are tissue specific.

In contrast, adaptations that increase the fitness of a cancer cell within host tissue, including new signaling pathways, upregulating molecular machinery of gene transcription and cell cycle, niche construction, and phenotypic plasticity, demonstrate significantly more overlap in BRCA and LUAD.

Perhaps the most predictable commonality is upregulation of the molecular machinery of proliferation. We find LUAD and BRCA upregulate a common set of core genes associated with cell cycle, mitosis, and DNA repair. However, we also find common selection on signaling pathways that promote proliferation. TP53, often described as a tumor suppressor, is the most mutated gene in both cancers and both upregulate elements of the p53 signaling pathway consistent with a primary role in promoting tumor fitness and proliferation61. Strong genetic/epigenetic selection on voltage gated ion channels suggests a reduction of the transmembrane gradient is a common strategy in LUAD and BRCA, to increase proliferative signals62 while decreasing differentiated function63 and increasing pluripotency54.

Recent evolutionary theory has recognized the critical role of ecological engineering as adaptations for optimizing an organism’s fitness by causing favorable changes in its environment56,57. We find, despite marked differences in the ecology of their normal tissue, both cancers engineer broadly similar environments though genetic/epigenetic selection of specific collagen genes which promote survival in an acidic environment and inhibit immune response. Because this niche construction can persist over multiple generations, it represents a form of ecological inheritance in which descendants inherit the environment sculpted by their forebears.

In conclusion, BRCA and LUAD, which arise from different cellular progeny, evolve in different tissues, and rely on different mechanisms of inheritance, exhibit parallel evolutionary strategies in gaining independence from host control and downregulating the differentiated function of their specific tissue type, and convergent evolution towards strategies that optimize their fitness once they have become free-living protists in situ.