Main

Genetic diversity within the human population has been well described. The Human Genome Project resulted in the first near-complete mapping of the human DNA sequence1, and was followed by large-scale projects, such as the 1,000 Genomes Project2 and the Pangenome project3, that mapped the genetic diversity between individuals and populations. Now, there is growing recognition that extensive genetic variation exists within individuals among different tissues and cells. Two decades after completion of the first draft human genome, the Somatic Mosaicism across Human Tissues (SMaHT) Network plans to map the genetic diversity across different tissues and cells within individuals.

From fertilization onwards, the cells of the human body continuously experience damage to their genome, either from intrinsic causes or from exposure to mutagens4,5,6,7,8,9. Although the vast majority of DNA damage is repaired, and the genome is replicated with extremely high fidelity, cells steadily acquire somatic mutations throughout life. All cells within an individual harbour somatic mutation, but any given mutation is present in only a subset of the cells, or even in single cells. Hence, somatic mutations are often described as mosaic10,11.

The detection of somatic mutations is challenging. In contrast to inherited variants, somatic mutations only exist in small and variable proportions of cells, ranging from embryonic mutations present in most cells to mutations present in just a single cell (Fig. 1a). This challenge is exacerbated by the introduction of artefacts and errors resembling low-frequency mutations during DNA library preparation and sequencing12. Current short-read sequencing technologies limit detection of mutations in repetitive regions of the genome and are likely to be less suitable for detection of somatic structural variations.

Fig. 1: Somatic mutations, causes and patterns.
figure 1

a, Schematic comparison between inherited variants, an early somatic mutation and a late somatic mutation. b, Overview of causes and types of somatic mutations. EN, endonuclease; ME, mobile element; ORF, open reading frame; RT, reverse transcriptase; ssDNA, single-stranded DNA. c, Overview of the reported mutation rates of somatic SNV across developmental stages and tissues. Data of first cell divisions6,7,59 and later cell divisions6,7,59 are SNVs per cell per division. Data from fetal development of the early central nervous system (CNS)9 and placenta62 are SNVs per cell per day. Adult data are SNVs per year and estimated for seminiferous tubules48, haematopoietic stem cells26,52,144, B lymphocytes52, neurons63,145, T lymphocytes52, bronchial epithelium53, gastric epithelium146, endometrial epithelium79, hepatocytes19, small bowel epithelium19,115, colorectal epithelium19,24,29 and cardiomyocytes49. ZGA, zygotic genome activation.

Although most somatic mutations are probably functionally neutral13, some can profoundly alter the phenotype of a cell and are implicated in a wide variety of diseases. Many insights have come from sequencing the genomes of cancers14, the best-known example of disease arising from somatic mutation, but mutagenesis in tumours is often accelerated, and normal mutational patterns are distorted by genome instability. More recently, mapping the patterns of somatic mutations in normal tissues, exemplified by efforts of the Brain Somatic Mosaicism Network and other studies3,4,5,6,15,16,17,18,19,20,21,22,23,24,25,26, has identified a role for somatic mutations in developmental syndromes, neurological diseases and inflammatory disorders27,28,29,30,31,32,33,34,35,36,37,38,39. Despite these efforts, there is currently no comprehensive reference dataset of somatic mosaicism across many tissues of a large pool of donors.

In this Perspective article, we describe the SMaHT Network, initiated by the NIH Common Fund, which aims to generate a reference catalogue of somatic variation from 150 donors in 19 non-diseased tissue sites. To advance the field, the SMaHT Network will perform a comprehensive discovery and analysis of all types of somatic mutations at an unprecedented scale: the joint analysis of mosaicism across many tissues per donor; the robust discovery of structural variants (SVs) through long-read sequencing and donor-specific assemblies; and the widespread and robust application of ultrasensitive sequencing technologies, such as duplex sequencing, across sequencing centres. Furthermore, beyond applying established sequencing assays at scale, the SMaHT Network has a strong emphasis on tool and technological development to enable the next generation of somatic mutation studies. Before describing the network goals in detail, we briefly review the current knowledge about somatic mutations in health and disease, as well as the technical challenges in detection of mutations. A large part of the SMaHT Network will focus on the development of technologies and computational tools to improve the detection of all types of somatic variation.

Somatic mutations in healthy tissues

Throughout the human lifespan, from conception to death, cells acquire mutations in their DNA6,7,8,9,40 (Fig. 1b). These somatic mutations can be the consequence of erroneous repair of damaged DNA bases or DNA strand breaks, errors during replication, chromosome missegregation or the integration of mobile elements. Somatic mutations can be divided into different types41: substitutions, the vast majority of which are single-nucleotide variants (SNVs); small (less than 50 bp) insertions and deletions (indels); SVs, including segmental duplications, large deletions, translocations, inversions, mobile element insertions (MEIs) and complex SVs, including chromothripsis and chromoplexy; and other large chromosomal aberrations, such as whole-chromosome gains and losses. Duplications, deletions and whole-chromosome gains and losses are also referred to as copy number variants (CNVs) or mosaic chromosomal alterations. These classes differ profoundly in their underlying causes and patterns across tissues and their phenotypic effects on cells. In normal tissues, SNVs are by far the most common type of somatic variation, followed by indels. SVs and large chromosomal aberrations are observed less frequently27, but typically affect more base pairs and thus may have larger functional effects. However, most previous studies on somatic mutations have relied on short-read DNA sequencing, which may fail to detect various types of SVs. Studies of germline differences have shown that SVs are far more abundant, but the majority are missed by short-read approaches42.

Different mutagenic processes cause distinct patterns of somatic mutation, depending on the types of DNA damage incurred and the pathways responsible for DNA lesion repair. Research over the past decade has deconvolved these patterns into mutational signatures and linked certain signatures to specific mutagens, such as ultraviolet light, tobacco smoke, chemotherapy or natural age-related accumulation of endogenous mutations43. Mutational signatures are most commonly applied to SNVs44, but they have been defined for other classes of somatic mutations, including indels44, chromosomal alterations45,46 and SVs44,47. In the context of SNVs, mutational signatures reflect the distribution of specific base changes within their trinucleotide contexts.

All normal tissues, including post-mitotic cells, exhibit SNV mutational signatures linked to clock-like endogenous processes (single base substitution signature 1 (SBS1) or SBS5) and, to a lesser extent, oxidative damage (SBS18)48,49,50. Mutational signatures linked to mutagenic exposure can be confined to specific organs, such as UV damage (SBS7) in the skin51 or skin-resident T lymphocytes52, damage from tobacco smoke (SBS4) in the bronchial epithelium of the lung53 and exposure to a genotoxic strain of Escherichia coli (SBS88)24 in the large intestine. These exposure differences drive some of the variation in the types of somatic mutations observed across different tissues of the human body7,48,54. Furthermore, different mutational processes show different correlations with genomic features, such as replication timing, replication strand and transcription strand55,56,57,58, reflecting genomic biases of DNA damage and repair.

The somatic mutation rate varies across human tissues and life stages (Fig. 1c). During the initial embryonic cell divisions, somatic SNVs accumulate at a high rate of approximately three per division, probably due to the high division rate and delayed activation of the zygotic genome6,7,59. Afterwards, the mutation rate decreases (approximately one SNV per division) during development in utero, in both embryonic tissues, such as the fetal brain9,60,61, and extraembryonic tissues, such as the placenta62. After birth, mutation rates further decline 5–10-fold and vary substantially across tissues, from 16–20 SNVs per year in post-mitotic cells such as neurons21,50,61,63 to 44 SNVs per year in colonic stem cells24 (Fig. 1c). Germ cells have the lowest somatic mutation rate reported23, in line with the parental age effect on de novo germline mutations48. Although division rate may influence the endogenous somatic mutation rate, there are probably other factors that modulate both mutagenesis and repair of DNA damage64,65,66.

Large somatic mutations such as SVs, chromosomal alterations and MEIs are detected much less frequently than SNVs and indels. Although somatic aneuploidy appears to be rare, sub-chromosomal structural variations affect 13–41% of neurons18,34,67,68. Frequent CNVs, mostly duplications of likely developmental origin, have been detected in approximately 7% of brains from the Brain Somatic Mosaicism Network consortium34, and mosaic chromosomal alterations have been observed in approximately 5% of blood samples in the UK Biobank69. Single-neuron DNA sequencing of mobile element-enriched libraries or whole genomes has revealed MEI events that appear to occur during development and create mosaicism in the human brain5,70,71. Bulk sequencing approaches have also detected a few examples of somatic MEIs in the brain72 and non-brain tissues including the heart73, fibroblasts73 and liver74. Recent somatic MEI profiling in colorectal epithelial single-cell clones has indicated peak insertion rates during early embryogenesis75. Considering the potential effect of these large mutations on the sequence, splicing or expression of genes76,77, it is valuable to understand their prevalence across human tissues during development and ageing.

Although most somatic mutations do not discernibly affect the phenotype of a cell, some somatic mutations are under selection in different tissues. Such driver mutations may lead to a proliferative advantage or increased survival of the cell and its progeny, resulting in clonal expansions in tissues. Cancer is the canonical example of somatic evolution and often involves the stepwise accumulation of key somatic mutations and genomic instability1,78. Mutations typically associated with cancer can be abundant across normal tissues with age. For comparison, in a typical individual of 60 years of age, approximately 90% of the endometrial epithelium harbours a driver mutation79, whereas this is true of only about 1% of the colonic epithelium24, despite the latter having a much higher somatic mutation rate24,79. This difference is probably caused by the menstrual cycles of shedding and regrowth in the endometrium. Probably due to similar clonal expansion in development or ageing, about 6% of individuals harbour a 3–20-fold higher than average number of detectable SNVs in their brain34. These varying proportions of clonally expanded cell populations probably reflect differences in tissue architecture, cell turnover, regeneration and selection pressures, but much is still unknown.

Although many driver mutations in normal tissues can be identical to those found in corresponding cancer types, their abundance and phenotypic consequence may differ profoundly as normal tissues may experience different selection pressures than cancer. For example, clones with NOTCH1 mutations are exceedingly abundant in normal oesophageal epithelium, at even higher rates than oesophageal cancers80. NOTCH1-mutant clones have a lower propensity of malignant transformation and even outcompete precancerous clones in the oesophagus81,82. These observations suggest that characterizing the somatic mutation landscape in normal individuals will be important to understand the role of these mutations in pathological phenomena such as cancer.

Finally, somatic mutations can be used as intrinsic barcodes to create phylogenies and trace the ancestries of cells, such that it becomes possible to quantitatively study human development from somatic mutations ascertained in adult donors6,7,20,25,40,58,59,60,70,83. This approach has been applied to studying embryogenesis, clonal expansions across the lifespan and the origins of childhood cancers84. As the allele frequency of a mutation reflects the fraction of cells within a population that harbours it, this method can be used to quantitatively assess the contribution of embryonic progenitors to the adult body. Such studies have found that one of the two daughter cells of the zygote often has at least twice as many descendant cells as the other6,7,20,25,40,58,59,60,70,83,85,86, probably due to cellular bottlenecks in embryogenesis, developmental cell death or migratory patterns, and confirming earlier observations in mice40,87.

Together, these initial studies on somatic mutations in normal tissues have shown the variability of rates, patterns and selection of mutations across tissues. It is unknown, however, how variable these patterns are between individuals and how different types of somatic mutations are correlated with inherited genetic background, environmental exposures or other behavioural characteristics. In addition, mutation discovery is severely hampered in poorly mapped regions of the genome, including acrocentric chromosomes, centromeric and repetitive regions, and, hence, the mutational patterns in these regions are largely unknown. Thus, identification of the differences in mutational patterns between tissues and individuals, particularly in the context of specific organs21,26,29,30,34,88,89, may have profound clinical implications.

Somatic mutations and disease

Somatic mutations can profoundly alter the phenotype of a cell and have been implicated in human diseases. Besides cancer, various other diseases and conditions can be a result of somatic mutations, including cardiovascular anomalies, immunological and neurological disorders26,30,31,32,33,34,52,90,91. Of note, early somatic mutations can cause clonal expansions and alterations in the differentiation programs of precursor cells that subsequently can lead to paediatric cancers and organ overgrowth84,92,93. Among the first described instances of somatic mutagenesis, PI3K–AKT–mTOR pathway mutations involving the brain were associated with brain malformations leading to intractable epilepsy33,94,95. Other examples are NRAS mutations leading to congenital melanocytic nevi96 and UBA1 mutations in haematopoietic stem cells97 leading to VEXAS syndrome, a rare and severe inflammatory disorder. Somatic expansions of short tandem repeats in the brain can cause cell death and neurodegeneration98, and underpin Huntington disease99,100. Large SVs, including CNVs and MEIs, have also been implicated in neurodevelopmental and neurodegenerative disorders72,101,102.

The effects of somatic mutations can be highly specific to the timing and tissue of origin. For example, an activating PIK3CA mutation acquired during development can lead to widespread overgrowth across organs and vascular malformations91. However, PIK3CA mutations acquired after development can lead to cavernomas in the brain103 and are also a common driver mutation observed in normal colonic24 and endometrial epithelium79.

Clonal expansions can also indirectly lead to or influence other diseases26. An example is clonal haematopoiesis of indeterminate potential (CHIP), characterized by a clonal expansion within the haematopoietic stem cell compartment driven by somatic mutations. CHIP is highly prevalent in the context of normal ageing26. Besides acting as a potential cancer precursor clone, CHIP has been linked to various non-cancer diseases, such an increased risk of cardiovascular disease104 and infections105.

Conversely, diseases can also select clones with certain adaptive somatic mutations. Recently, inflammatory bowel disease has been shown to lead to the preferential remodelling of the colonic epithelium with clones harbouring IL-17 and Toll-like receptor pathway mutations29,106. Likewise, chronic liver disease selects for clones of hepatocytes that escape the toxicity imposed by the disease, notably, by recurrent, independent mutations in FOXO1, CIDEB and GPAM, which are all involved in lipid metabolism89.

Together, research over the past years has shown that somatic evolution is ubiquitous in normal tissues and is fundamental to our understanding of the causes, mechanisms and consequences of disease, and the normal process of ageing.

The SMaHT Network

The SMaHT Network, funded by the NIH Common Fund, was established with the goal of transforming our understanding of how somatic variation in human cells influences biological processes. The SMaHT Network will accomplish this through the following aims: (1) generate a comprehensive dataset of somatic variants across human tissues (Fig. 2); (2) develop tools and technologies to optimize detection and characterization of various types of somatic variants; and (3) create a somatic mutation database that is widely used by researchers and the wider public, and interoperable with similar datasets.

Fig. 2: Tissue sampling.
figure 2

Overview of sampling from 19 primary tissue sites, spanning three developmental germ layers (endoderm, mesoderm and ectoderm) and germ cells. Although organs represent a mixture of cells derived from the germ layers (for example, skin epidermis (ectoderm) versus dermis (mesoderm), and adrenal gland medulla (ectoderm) versus cortex (mesoderm)), we have indicated the major germ layer represented by each organ. Gonads represent germ cells and their supportive structures (mesoderm), whereas buccal swabs are variable mixtures of germ layers (mesoderm and ectoderm).

The Network comprises five Genome Characterization Centres (GCCs), 14 Tool and Technology Development projects (TTDs), an Organizational Centre (OC), a Data Analysis Centre (DAC) and a Tissue Procurement Centre (TPC), and includes over 250 researchers from 52 institutions. The GCCs are tasked with producing a core dataset of somatic mutations for the SMaHT Network from multiple tissues collected by TPC, whereas TTDs are tasked with developing novel experimental assays and computational tools. The DAC will integrate the data generated by GCCs and TTDs to build the somatic mutation catalogue, data portal and the analysis work bench for the Network. The OC will coordinate the Network activities and focus on outreach efforts and building liaison with other genomics consortia. The SMaHT Network has implemented a set of policies (https://smaht.org/policies/), including a policy to allow external researchers to apply for associate membership of the Network.

The tissues to be profiled by the Network include those arising from the three germ layers and germlines within the human body, which will give the opportunity to delineate early somatic mutations that are common across all tissues, as well as later mutations that are unique to certain tissues (Fig. 2). The TPC is partnering with multiple organ procurement organizations (OPOs) in the USA for the screening, authorization and recovery of tissues from post-mortem organ and tissue donors. Tissues will be collected following transplant recovery and include the ascending and descending colon, oesophagus, lung and liver (predominantly endoderm); blood, heart, aorta and skeletal muscle (predominantly mesoderm); and the brain, adrenal gland, sun-exposed and non-sun-exposed skin (predominantly ectoderm). We also aim to collect buccal swabs to assess the extent of the somatic mutation landscape that can be gleaned from clinically accessible tissues in living donors. To study mutagenesis in germ cells, we also aim to collect ovaries and testes. Finally, to enable various experimental techniques requiring live cells, we will derive fibroblast cultures from the dermis (skin). All tissues are requested to be recovered from each donor approached for the SMaHT tissue collection. The number and type of samples collected from each donor will vary based on donor authorization and eligibility (Box 1), but the goal is to recover as many tissues from a single donor as possible. To study the mechanisms and consequences of somatic mosaicism across the lifespan, these post-mortem donors will span the human adult age ranges from 18 to over 85 years. The race and ethnicity of donors are assessed using a single-question framework.

To maximize the scientific and clinical impact of the dataset, the TPC will collect a large amount of donor metadata during donation and biospecimen collection, building on practices developed for the Genotype-Tissue Expression (GTEx)107 and developmental GTEx projects108. De-identified donor-level data will include demographic information, medical history, sample-based laboratory test results and death circumstances. Sample-level data will include tissue type and location, ischaemic time and tissue metrics from pathology review. Pathology images will be made publicly available. When possible, tissue sampling will align with the common coordinate framework structure of other large-scale projects. For all of these biospecimens, sufficient fresh-frozen material will be collected and banked to enable all core assays as well as implementation of novel emerging technologies. Fixed samples for pathology review will be collected from adjacent sites to the fresh-frozen specimens utilizing a standardized collection schema developed for each tissue type.

To pursue a demographically robust and evenly sex-distributed pool of donors, the SMaHT Network includes an ethical, legal and social implications project109 consistent with the recommendations of the American Society for Human Genetics to address under-representation in human genomics studies with meaningful engagement of under-represented communities109. This ethical, legal and social implications substudy engages geographically, racially, ethnically and socioculturally diverse stakeholders, which include family decision-makers, tissue requesters, community advisory board members and multi-disciplinary specialty committee members throughout the entire duration of the SMaHT Network. Feedback from community stakeholders will be leveraged to inform communication and enrolment efforts as well as dissemination of study findings.

The SMaHT Network is uniquely positioned to collaborate with many other large consortia and programmes. These include the Human Pangenome Reference Consortium9, to leverage methods for constructing haplotype-phased genome assemblies; the Impact of Genomic Variation on Function Consortium110, to understand the functional consequences of genetic variation; the developmental GTEx project108, to access datasets from tissues at early developmental stages; the Human Tumor Analysis Network and PreCancer Atlas, to further understand the progression from normal cells to tumour cells through somatic mutations; and PsychENCODE111, to inform on the phenotypic consequences of brain somatic mosaicism. These collaborations will enrich the individual studies and, ultimately, through data integration and cross-network analyses, further enhance our understanding of the context and consequences of somatic mutations.

Producing the somatic mutation catalogue

To produce the first phase of the somatic mutation catalogue, the SMaHT Network will strike a balance between standard genomic assays, productionized and applied uniformly by the GCCs to all tissues, and bespoke assays developed by the TTD projects, focusing on novel technological approaches. As part of the initial phase of the SMaHT project, benchmarking efforts are nearing completion, using both primary human tissues and cell lines. We have used this benchmarking to determine optimal sequencing coverage, compare the accuracy of variant calling algorithms, and evaluate the utility of long-read and short-read sequencing data generated on diverse sequencing platforms from multiple GCCs.

The GCCs will deploy three core assays across all tissue specimens that meet quality thresholds: deep short-read whole-genome sequencing (WGS; over 300× coverage), long-read WGS (over 30× coverage) sequencing and RNA sequencing (over 50 million reads). The deep short-read WGS will enable the discovery of high allele frequency somatic mutations across tissues acquired early in embryogenesis, as well as discovery of the large clonal expansions arising later in life. As these core assays will be performed on bulk tissues, composed of heterogeneous cell types, only mutations with a relatively high variant allele frequency (above 1–2%) will be accurately detectable at the proposed depth of sequencing. The long-read WGS will facilitate the detection of complex SVs, MEIs and variants in complex genetic loci that have been challenging to accurately study using short-read data, such as the MHC region, centromeres, telomeres, acrocentric DNA including ribosomal DNA and other tandem-repeat regions of the genome. Ultra-long-read sequencing will enable us to generate near telomere-to-telomere donor-specific reference genome assemblies for at least 50 donors and through reducing misalignment, enhance the discovery of diverse types of variants within an individual50, including complex somatic SVs and other mutations in previously unmappable regions of the genome112. Finally, RNA sequencing may allow us to assess transcriptional consequences of early mutations and late clonal expansions, as well as, by comparison with single-cell RNA sequencing atlases113, cell-type composition of heterogeneous tissues.

In addition to these core assays, GCCs will deploy three approaches specifically designed to profile low-frequency somatic mutation: duplex sequencing, single-cell WGS and transcript-based detection of mutations. These technologies, although published and well-tested, represent recent innovations and have not yet been systematically deployed across sequencing centres or applied at large scale.

As conventional DNA sequencing platforms have a non-trivial sequencing error rate (in the order of 1 in 1,000–10,000), a putative mutation needs to be detected in multiple independent reads to assure it is not artefactual. However, by sequencing both the forward and the reverse strands of each individual DNA duplex molecule, this error rate is drastically reduced. As the reduced error rate is much lower than the expected number of somatic mutations in most tissues, an average mutation burden and mutational profile can be obtained by shallow genome-wide duplex coverage (0.5–2×)63,114. Duplex sequencing of bulk tissue samples is well suited to finding average mutation burdens and spectra of SNVs and indels within cell populations, but the low depth generally precludes discovery of somatic CNVs and SVs, or the precise inference of variant allele frequency of specific mutations.

Even with a reduced sequencing error rate, bulk DNA sequencing will average out the mutational patterns of all cells and does not allow assessment of the variability of mutational patterns between cells or the reconstruction of cell lineages. Instead, sequencing the DNA of single cells or single-cell-derived clones will enable the most detailed discovery of somatic mutations. This can be achieved either by expanding single cells in vitro6,25,26,52,59 or laser capture microdissection to isolate naturally occurring clonal populations of cells7,24,79,88,115.

Alternatively, direct single-cell DNA sequencing is applicable to all cell types, including non-dividing cells. However, whole-genome amplification can cause allelic or locus dropout, uneven coverage across the genome and artefactual variants introduced during biochemical amplification. The direct library preparation (DLP+)116,117 method avoids whole-genome amplification and allows for the accurate detection of CNVs at the single-cell level and other mutations at the population level. The primary template-directed amplification (PTA)30,118 method offers a substantial improvement in data quality over previous single-cell amplification methods, resulting in more uniform genome coverage and fewer artefactual variants. A more recent version of PTA, the ResolveOme approach, profiles both the transcriptome and the genome from the same single cell. If validated, this approach will represent a major advance in allowing new mutation detection and cellular phenotyping at the same time. Profiling somatic mutations in single cells will enable us to characterize mutational patterns and associations between mutation types and to reconstruct phylogenetic trees of normal cells across tissues. In cases of polyploid cells, the variant allele frequencies of somatic mutations may deviate from the expected 0.5 and ploidy will need to be taken in consideration in downstream analyses.

Finally, at least some somatic mutations can be inferred from RNA119,120,121. Methods that allow for the interrogation of the full-length transcriptome in single cells, such as Smart-seq3 (ref. 122) or STORM-seq123, can facilitate the detection of somatic mutations, such as SNVs, indels and fusion genes within transcribed regions of the genome. This allows assessing cell-type specificity for clonal expansion of certain genetic variants. Furthermore, STORM-seq enables quantification of transposable element expression at single-cell resolution, which has been shown to be challenging with other single-cell RNA sequencing methods124. The single-cell data also provide references for a more precise deconvolution of cell types in bulk tissues.

Each of these methods for the detection of somatic mosaic variants presents its own advantages and disadvantages and thus they are complementary (Table 1). For example, although genome-wide duplex sequencing has a lower sequencing error rate and excels at population-level inferences of patterns of short mutations acquired during the entire lifespan, the low depth precludes detection of the precise allele frequency of a specific variant. Bulk sequencing at medium–high coverage (300×) will only detect variants at a sufficiently high frequency (that is, 1–2%) in tissues, which are mostly acquired in early embryogenesis. Single-cell sequencing can in principle detect all variants present in a single cell and allow reconstruction of cell phylogenies, but it requires significant costs and efforts to address genome amplification artefacts. RNA-based mutation discovery allows for direct integration of mutations with transcriptomic information but is naturally confined to expressed regions of the genome. Together, these genomic assays function as complementary techniques to detect somatic mutations and will enable the robust interrogation of mutational patterns across human tissues.

Table 1 Comparison between somatic mutation discovery methods

Areas of technological development

As new technologies to interrogate somatic mutations with high resolution or sensitivity are constantly emerging, a large part of the SMaHT Network is devoted to developing new tools and technologies (Box 2). The first area of innovation aims to increase the accuracy of mutation detection in single cells or molecules by further reducing background noise. For single-cell WGS, a limited cloning step to create small pools of cells can reduce allelic dropout and amplification artefacts. In parallel, the SMaHT Network aims to reduce the error rate of amplification and sequencing for single cells and molecules through various adaptations of duplex sequencing technologies63,125,126,127. These approaches will allow for the interrogation of the landscape of somatic variation in single cells and complex multicellular tissues with high precision, which is crucial to study tissues without large-scale expansions.

Second, the SMaHT Network aims to increase the sensitivity of SV detection to single molecules or cells. As SVs extend beyond the length of a typical short read, long-read sequencing unlocks SV detection across the genome, especially for MEIs and other rearrangements in repetitive regions128,129. However, many single-cell DNA amplification approaches result in short fragments. Therefore, we are applying long-read sequencing to clonal populations such as induced pluripotent stem cell lines, which have been used25 in lieu of single cells for lineage reconstructions as they can be expanded and analysed by bulk sequencing, avoiding in vitro DNA amplification. In addition, MEIs can be cost-effectively assessed by target enrichment assays as new insertions share conserved sequences in each transposon subfamily. We are developing targeted detection of MEI insertions by utilizing Cas9-targeted long-read sequencing130 and PTA-amplified micro-bulk or single cells73. These efforts will unlock the study of SVs and MEIs in all tissues and across the lifespan, even in the absence of clonal expansions.

Third, the SMaHT Network will develop scalable platforms that can perform variant detection spatially in human tissues, through single-cell DNA and RNA sequencing with resolved spatial barcodes131,132. This will allow us to study the prevalence and extent of clonal expansions across ages and tissues, especially in organs without a clearly organized tissue architecture.

An outstanding question is the effect of specific somatic mutations on the phenotype of the cells that harbour them. Although certain mutations are under positive selection and lead to clonal expansions, how these mutations alter cellular phenotypes is mostly unknown. The consequence of a mutation can be assessed by combining mutational readouts, either through genotyping of specific mutations133,134 or genome sequencing, in combination with functional readouts of cells, such as the transcriptome, proteome, epigenome, methylome and the chromatin accessibility landscape135,136,137,138,139,140. Interpreting the phenotypic effects of somatic mutations will greatly benefit our understanding of the clinical consequences.

The efforts in tool and technological development within the SMaHT Network are focused on improving precision in somatic mutation detection and interpretation at scale, each addressing vital shortcomings of current assays, with a goal to productionize and deploy many of these within the Network at large. After the development phase, the precise extent and scope of the deployment of these assays across the SMaHT tissues and donors will depend on the cost, scalability and priorities of the Network.

Integration and analysis of data

The low variant allele frequency of mosaic variants brings unique challenges in bioinformatic analysis141, and we expect that novel computational methods and tools are needed to fully analyse the data and to increase the sensitivity and specificity of variant detection. Somatic mutation detection algorithms developed in cancer genomics are often inadequate for detecting variants with allele fractions less than 2–5% and simply increasing the depth of sequencing is not cost-effective. Thus, more sophisticated machine learning algorithms that efficiently incorporate various local features near candidate variants may prove useful136,137,138,142.

Other challenges include optimal integration of long-read and short-read data, inference of lineage relationships based on bulk and single-cell data, and effective strategies for integrative and comparative analysis of samples across the tissues and across individuals. An important aspect of our analysis will be the use of donor-specific diploid genomes assembled using short Illumina, long PacBio and ultra-long Nanopore and Hi-C reads. Alignment to the donor-specific reference genome135 will allow for more accurate variant identification, especially in repetitive regions, as well as for examination of allele-specific transcriptional and epigenetic modulations associated with genetic variants.

The SMaHT DAC will lead an effort to collect, curate and analyse the vast amount of multi-modal data generated on multiple platforms and to create a data resource for the scientific community. The DAC will ensure high data standards with various quality control steps and compile extensive metadata describing experimental and data processing protocols, following the FAIR (Findable, Accessible, Interoperable and Reusable) guidelines143. Scalable and cost-effective analytical workflows will be implemented on a cloud platform with full provenance and docker images to enable reproducibility of the analysis output.

The data generated by the consortium will be made available to the wider scientific community via a user-friendly and secure web portal (https://data.smaht.org). This portal will feature: (1) a reference catalogue of somatic variants that can be searched (for example, by locus, tissue or phenotypic features such as age) and annotated with information from other genomics databases; (2) a workbench that enables users to apply the computational pipelines developed by the SMaHT Network to their own data; and (3) data visualization tools including a multi-scale browser that allows users to navigate the data from a genome-level view to the sequencing read-level view. Visual inspection of variants using such a browser will be particularly helpful in assessing their quality, and the annotations will enable rapid identification of variants that may be functionally relevant.

Conclusion

The SMaHT Network aims to produce a comprehensive reference catalogue of somatic mutations, across tissues and individuals, by harnessing the full potential of many different genomic assays, including short-read and long-read bulk WGS, duplex sequencing, ultra-long-read sequencing, single-cell DNA sequencing and RNA sequencing (Fig. 3). The Network will develop new tools and technologies to increase our ability to detect somatic mutations as well as infer their phenotypic consequences at greater resolution. All of these various data modalities will be integrated, analysed and released to the research community and wider public.

Fig. 3: Methods, assays and questions.
figure 3

Overview of sampling methods and sequencing assays deployed in the SMaHT Network, as well as the biological questions, outcomes and inferred mutational patterns from downstream analyses of the catalogue of somatic mutations across normal tissues, including mutation rates or burdens, selection, lineage tracing and mutational signatures (reference signatures were obtained from https://cancer.sanger.ac.uk/signatures)44. ZMW, zero-mode waveguide.

An extensive catalogue of somatic mutations will reveal mutational patterns, rates and signatures across tissues, allowing us to infer the biological and molecular processes that govern somatic mutagenesis and their adaptive and maladaptive consequences for development and disease (Fig. 3). Our assays can inform on mutations under selection in tissues, which result in clonal expansions and potentially tissue dysfunction. Single-cell analyses added to the bulk readouts will further allow us to generate cellular phylogenies of human development, infer embryonic differentiation dynamics and improve our future assessment of de novo germline mutations.

Delineating the full extent of somatic mosaicism greatly exceeds the scope of the Human Genome Project. A typical cell may acquire hundreds to thousands of somatic mutations in a lifetime. There are trillions of cells in a human body and so the total number of somatic mutations acquired in a single individual may well exceed quadrillions (1015), millions of times the size of the human genome. Beyond cataloguing somatic variation across tissues, the SMaHT Network provides the opportunity to understand the causes, patterns and consequences of somatic mutations in normal cells, and provide a crucial comparison baseline for disease research. The efforts of the SMaHT Network will substantially contribute to our insights into the role of somatic variation in health, ageing and disease.