Introduction

Genetic diversity and speciation rate play major roles in the evolution of species and clades. Genetic diversity measures the level of polymorphism of DNA sequences among individuals within a species. Understanding how and why genetic diversity varies across species is one of the main questions in population genetics1,2. Research in this area has focused on the role of species ecology (e.g., life-history strategies3,4), demographic changes (e.g., recent bottlenecks or expansions5,6), geographic structure and selection in driving variable levels of genetic diversity7,8, as well as on the consequences of genetic diversity on species adaptation and survival9,10.

Speciation rate measures the frequency at which a given species gives rise to two daughter species. Together with the rate at which species go extinct, it determines how species richness varies in time and across species groups. Understanding how and why speciation rate varies across species is thus one of the main questions in macroevolution11,12,13,14,15,16. Research in this area has focused on the role of species ecologies and the environment they experience (e.g., whether they occur in tropical or temperate biomes) in modulating these rates17,18,19.

In comparison with the wealth of studies investigating the factors that modulate either genetic diversity or speciation rates, and despite the interdependence between genetic and speciation processes20, the genetic diversity-speciation relationship remains poorly characterized. More generally, there is a gap in our understanding of the interrelationship between species ecology, the environment, population genetics, and speciation rates21,22,23,24,25,26. Only a handful of studies have investigated the relationship between speciation rates and population divergence rates27, population structure28,29, substitution rates30,31, or genetic diversity32,33, with contrasting results. The two only studies we are aware of focusing on genetic diversity found an absence32 of or a weak negative association33 between genetic diversity and speciation rates.

The relationship between genetic diversity and speciation rates is crucial, but what type of relationship should we expect? Species-wide genetic diversity is the product of mutation rate and the effective population size of the species (Ne34). In an idealized species constituted of a single Wright-Fisher population with random mating and no selection, Ne is simply the census population size, and genetic diversity is thus higher in more abundant species and/or species with high mutation rates. Given that abundant species with wider ranges are more likely to be hit by isolating mechanisms such as geographic barriers, and that high mutation rates increase the rate at which populations acquire substitutions and thus reproductive isolation20,30, we could expect speciation rates to be higher in lineages with high genetic diversity. This positive association could be bolstered by intraspecific geographic structure (limited dispersal between sub-populations), which is the premise of allopatric speciation and tends to increase Ne (and thus genetic diversity), although the effect of geographic structure on species-wide genetic diversity is highly dependent on the details of the migration process35.

A positive genetic diversity-speciation rate association could also be related to divergent natural selection, which promotes fast speciation according to the theory of adaptive radiations36,37, and is more likely to occur in populations with many polymorphic alleles to act upon38,39,40,41. Speciation itself potentially maintains high genetic diversity within species by increasing the number of interacting species, as proposed by the diversity begets diversity hypothesis42. The proposed mechanism is that biotic interactions increase opportunities for intraspecific divergent selection; this can maintain high genetic diversity within species on genes involved in the interactions or linked loci43. A positive genetic diversity - speciation rate association could also be strengthened by external factors, such as temperature, which induces interspecific variation in mutation rates (e.g. related to latitude) that have either a direct flow-through effect on speciation rates as proposed by the evolutionary speed44 and metabolic45 theories of biodiversity, or are indirectly linked through the effect of temperature on other factors that modulate speciation rates, such as climatic stability and productivity46.

Alternatively, there are potential causes of a negative diversity-speciation rate association. Species with low genetic diversity (reflecting small effective population size) tend to accumulate reproductive incompatibilities faster due to the reduced efficacy of purifying selection, which could lead to higher speciation rates47,48,49. In the other direction, speciation can reduce genetic diversity through bottleneck effects50,51. Previously observed bursts of molecular evolution associated with speciation events52,53 support this inter-relationship between speciation, small effective population sizes and genetic effects. If speciation is instead adaptive, and not limited by standing genetic variation, we can expect positive selection to reduce genetic diversity by increasing heritable variance in fitness among individuals (which reduces Ne) and by fixing beneficial alleles54, while spurring speciation by driving populations towards distinct adaptive peaks. Heterogeneity in the time it takes for dividing populations to complete speciation, which can be related to different ecological, geographic or genomic contexts, could also generate a negative association between genetic diversity and speciation rate, as species for which speciation takes a long time to complete will tend to accumulate more genetic variation (as they encompass increasingly genetically differentiated sub-populations) while having lower speciation rates55 when there are few speciation initiation events and frequent population extinctions: in this case, completing speciation fast is key to induce speciation events56. Finally, if geographic structure spurs speciation but decreases rather than increases species-wide genetic diversity, as expected if some populations contribute much more migrants than others57,58,59, we also expect to find a negative genetic diversity - speciation rate relationship.

In addition to such potential mechanistic links between genetic diversity and speciation rates, an indirect association could also arise from independent factors that correlate with both genetic diversity and speciation rates, as already mentioned for temperature. Another important (somewhat related) example is latitude, which correlates with many biologically important variables: prior work has found lower levels of genetic diversity at temperate latitudes, generally attributed to recent climatic shifts and associated founder events60,61, and a latitudinal gradient in speciation rates, generally attributed to other factors17,26,62, although both the existence and direction of this gradient is debated26,63. Other potentially important factors are intrinsic characteristics of the species, in particular life-history traits that determine species position on the r/K-strategist gradient: it has been proposed that small species with short generations and high fecundity (r-strategists) have higher genetic diversity than large species with long generations and small fecundity (K-strategists), potentially in relation with their differential sensitivity to environmental variations3; on the other hand, r-strategists tend to live in unstable environments, which could either promote speciation (e.g. by inducing divergent selection) or impede it (e.g., by preventing a fine niche partitioning).

Here, to provide a thorough characterization of the genetic diversity - speciation relationship, we assemble a dataset encompassing the whole extant mammalian radiation. To do so, we use a mitochondrial gene (cytochrome b). Compared to nuclear genes, mitochondrial genes are characterized by high mutation rates, low population sizes, strong purifying selection, the absence of recombination, and strong linkage64. While it has been suggested that under this situation natural selection could erase the relationship between genetic diversity and effective population size65, this hypothesis has been disputed66. In mammals in particular, mitochondrial and nuclear polymorphisms have been shown to be correlated67,68. We confirm this finding here by analyzing several nuclear databases. We show that intraspecific genetic diversity and species-specific speciation rates are negatively correlated and that this association is likely not due to differences in species’ ecological characteristics.

Results and Discussion

Mitochondrial genetic diversity estimates across mammals

We assembled a mammal dataset using the phylogeny from Upham et al. (2019)69 as our reference. We gathered a database of cytochrome b alignments by matching GenBank sequences to species names from this phylogeny and estimated synonymous genetic diversity for each species as Tajima’s70 \({\theta }_{T{syn}}\) and Watterson’s71 \({\theta }_{W{syn}}\), corrected for gaps in the alignment72 (see Methods). Our database encompassed 90,337 sequences distributed across 1897 species, with a minimum of five sequences per species (Supplementary Fig. 1). These species provided a good representation of the entire extant mammal tree of life (Fig. 1). Consistent with previous analyses of genetic diversity across mammals68, we found that genetic diversity varies by several orders of magnitude across species (from 7.5\(\times\)105 to 0.113 for \({\theta }_{T{syn}}\) with a mean of 0.0193, and from 2\(\times\)104 to 0.105 for \({\theta }_{W{syn}}\) with a mean of 0.02, Fig. 1, Supplementary Fig. 1). The clade-level distributions of genetic diversity overlap largely (Fig. 1, Supplementary Fig. 2), with Castorimorpha (castors and beavers) showing the highest mean genetic diversity and Carnivora the lowest (Supplementary Fig. 2).

Fig. 1: Mammals species-level consensus phylogeny from Upham et al. (2019), with branches coloured with branch-specific speciation rates estimated with ClaDS2 (see color legend in the central inset).
Fig. 1: Mammals species-level consensus phylogeny from Upham et al. (2019), with branches coloured with branch-specific speciation rates estimated with ClaDS2 (see color legend in the central inset).
Full size image

Bars at tips reflect estimated within-species genetic diversity for those species with 5 or more cytochrome b sequences available: Tajima’s \({\theta }_{{Tsyn}}\) (inner circle, red color legend in the top right inset) and Watterson’s \({\theta }_{{Wsyn}}\) (outer circle, blue color legend). Central inset: distribution of tip speciation rates for all mammals (black line, shaded fill) and 14 clades with more than 20 species (coloured lines, no fill); Top-right inset: distribution of genetic diversity, log scaled following the same line colouration. Silhouette figures were contributed by various authors with a public domain license (public domain mark 1.0; CC0 1.0) from PhyloPic (http://phylopic.org). Source data are provided as a source data file.

Tip speciation rate estimates across mammals

We next estimated species-specific (tip) speciation rates using the ClaDS model15,73, fitted to the mammal phylogeny from Upham et al. (2019), using both the consensus Maximum Clade Credibility (MCC) and 100 trees from the posterior distribution69. Consistent with what was found for birds using the same method15,73, we found that speciation rates vary considerably within and across clades (from 0.03 to 1.29 events per lineage per million years with a mean of 0.25, Fig. 1, Supplementary Fig. 3). The speciation rates of the species with genetic diversity estimates spanned the full range found in the group as a whole (from 0.03 to 1.1).

Negative genetic diversity-speciation rate relationship

We found a highly significant negative association (Supplementary Fig. 4; PGLS using \({\theta }_{{Tsyn}}\) with MCC tree: slope estimate −0.431, p-value 2.69\(\times\)10-9, \({R}_{{Resid}}^{2}\) 0.125; \({R}_{{Resid}}^{2}\) computed as in Ives 201874) between intraspecific genetic diversity and speciation rates across all mammals (Fig. 2, Fig. 3). This significant negative relationship was found while accounting for phylogenetic dependence in the data both with phylogenetic generalized least square models (PGLS, Fig. 3) and Bayesian generalized linear multilevel models (BMLM, Fig. 3). This was consistent across clades, most of them showing a negative association (Fig. 2), even though the statistical significance of the relationship depended on the clade, the tree, and the analysis performed (Fig. 3, Supplementary Fig. 5). The results were largely consistent across genetic diversity estimates (i.e. \({\theta }_{{Tsyn}}\) or \({\theta }_{{Wsyn}}\)), with original or mean subsampled estimates, across trees (i.e. the MCC tree and the 100 posterior trees), and when using PGLS or BMLM analyses (Supplementary Fig. 5). The global (all mammals) relationship remained statistically significant when reducing the global dataset from species represented by more than five sequences (1897 species) to species represented by more sequences, up to a threshold of 100 sequences per species (the dataset then encompassed 219 species, Supplementary Fig. 6). The relationship also remained significant when accounting for the geographical extent of sampling (Supplementary Fig. 7). The reliability of speciation rate estimates depends on the reliability of the estimates of phylogenetic branching times; these depend on a variety of factors, including hypotheses made on the molecular clock model75, the birth-death prior76,77, and the potential interaction between the two78. While we acknowledge this uncertainty (and accounted for some of it by running analyses on posterior tree samples), we cannot think of a consistent bias that could artifactually generate a negative correlation between speciation rates (computed from a 31-gene supermatrix phylogeny) and cyt b genetic diversity.

Fig. 2: Relationship between intraspecific genetic diversity (Tajima’s \({\theta }_{{Tsyn}}\)) and speciation rate across all mammals and for each of the 14 clades with at least 20 species.
Fig. 2: Relationship between intraspecific genetic diversity (Tajima’s 
                        
                          
                        
                        $${\theta }_{{Tsyn}}$$
                        
                          
                            
                              θ
                            
                            
                              T
                              s
                              y
                              n
                            
                          
                        
                      ) and speciation rate across all mammals and for each of the 14 clades with at least 20 species.
Full size image

The number of species included in each analysis is indicated. Speciation rates represented by their 95% confidence intervals (CIs) from 100 posterior trees; CIs are very narrow, demonstrating that estimates vary little across posterior trees. Results of the PGLS analyses on the consensus MCC tree are provided and linear regression lines with 95% confidence intervals are shown in purple for visualization purposes. Axes are log scaled. Source data are provided as a source data file.

Fig. 3: Slope estimates of the relationship between intraspecific genetic diversity (\({\theta }_{{Tsyn}}\)) and speciation rates for all mammals (top panel) and each of the 14 clades with at least 20 species (bottom panel).
Fig. 3: Slope estimates of the relationship between intraspecific genetic diversity (
                        
                          
                        
                        $${\theta }_{{Tsyn}}$$
                        
                          
                            
                              θ
                            
                            
                              T
                              s
                              y
                              n
                            
                          
                        
                      ) and speciation rates for all mammals (top panel) and each of the 14 clades with at least 20 species (bottom panel).
Full size image

The grey density plots with median point and 95% confidence intervals in black represent the estimated posterior distribution of slopes obtained with the Bayesian Multilevel Models (BMLM) using 100 phylogenetic trees with approximately 1000 posterior samples per tree. The points below represent the slopes estimated with Phylogenetic Generalized Least Squares analyses conducted on each of the 100 trees and are coloured in red when significant (p-value < 0.05). Source data are provided as a source data file.

No detectable effect of ecological attributes or selection

If some external factors or intrinsic characteristics of the species (i.e. covariates) influence genetic diversity and speciation rates in opposite directions, this could indirectly induce the observed negative genetic diversity - speciation relationship. This would occur, for example, if genetic diversity decreases with latitude60,61 and speciation rates increase with latitude, as suggested by some authors17,26,62. Or if r-strategists sustain higher genetic diversity3 but are less likely to speciate. To investigate these potential effects, we first tested the correlation between genetic diversity and latitudinal midpoint18, mean range temperature18, body-mass79,80, generation length3,7,79, and fecundity (litter size)3,7,79. When these covariates were analyzed one-by-one, we found, as expected, that genetic diversity is significantly higher in mammals inhabiting at low latitudes, in warm climates, and in small species with short generations (Supplementary Fig. 8). The effect of generation length was no longer significant when the covariates were all combined in a single analysis (Supplementary Fig. 9 & Table 1). Genetic diversity was higher in species with small litter sizes (Supplementary Figs. 8, 9 & Table 1), but this relationship was highly sensitive to the set of species included and we therefore do not interpret it biologically (see Methods). The only covariate significantly correlated with speciation rate in at least some analyses was litter size, with a positive association (Supplementary Figs. 8, 9, and Table 1). The absence of a latitudinal gradient in speciation rate is consistent with recent findings across vertebrates63, and suggests that other covariates that correlate with latitude, such as rate of climate change and species richness, would not be strong predictors of speciation rates either. The negative association between genetic diversity and speciation rate could in part be due to an indirect effect of litter size, however this negative association remained highly significant (p-value < 0.01) when accounting for the effect of all traits (including litter size) (Table 1, Supplementary Fig. 9). Hence, although we cannot exclude the potential indirect effect of other covariates not considered here, these results suggest a direct negative association between genetic diversity and speciation rates.

Table 1 Correlations between genetic diversity (\({\theta }_{{Tsyn}}\)), speciation rates (\(\lambda\)) and species-specific covariates

Selection can potentially affect the genetic diversity - speciation rate relationship, as noted in the Introduction. While we measured genetic diversity at synonymous sites, these neutral sites can be closely linked to nonsynonymous sites under selection, particularly in the mtDNA with limited recombination. Depending on the nature of selection, in particular whether it is purifying or adaptive, and divergent or directional, selection could either generate a negative genetic diversity - speciation rate relationship at linked sites or weaken it. To assess this potential effect, we computed genetic diversity at nonsynonymous sites. Genetic diversity was lower at nonsynonymous than synonymous sites, showing that all sites are not entirely linked despite limited recombination (Supplementary Fig. 10). If selection drives the observed correlation between genetic diversity and speciation rate, we expect the correlation to be stronger with genetic diversity measured at nonsynonymous sites. Instead, the relationship was weaker with nonsynonymous (MCC PGLS slope estimate −0.380, p-value < 0.0001) than synonymous genetic diversity (slope estimate −0.430, p-value < 0.0001) with a significant difference (Pillai’s test p-value 6.37\(\times\)10-8) (Supplementary Fig. 10). These results suggest that the negative genetic diversity - speciation rate relationship is not driven by selection.

Role of mutation rate versus N e

Levels of intraspecific genetic diversity depend on both effective population size (Ne) and mutation rate (\(\mu\)) (\(\theta={N}_{{\rm{e}}}\mu\)). To further investigate the potential mechanisms underlying the observed negative genetic diversity – speciation rates relationship, we used the scaling of phylogenetic branch lengths in units of substitutions at the 3rd codon position versus time (in years) as a proxy for mutation rates (see Methods). As expected, the relationships between genetic diversity and both mutation rate and Ne are positive (Supplementary Fig. 11). We also found a negative, although only marginally significant, correlation between mutation and speciation rates, as well as between Ne (estimated as \(\theta /\mu\)) and speciation rates (Fig. 4). The negative correlation between mutation and speciation rates is unexpected given the evolutionary speed44 and metabolic45 theories, as well as previous empirical results30,31, and could be due to the smoothing of speciation rate differences obtained with ClaDS15 (see Methods). These results nevertheless suggest that fast speciation is not explained by high mutation rates in mammals, and that at least part of the negative genetic diversity – speciation rate relationship arises from a negative association between Ne and speciation rates.

Fig. 4: Relationships between speciation rate (\(\lambda\)) and the two theoretical components of \(\theta\): effective population size (Ne) and mutation rate (\(\mu\)).
Fig. 4: Relationships between speciation rate (
                        
                          
                        
                        $$\lambda$$
                        
                          λ
                        
                      ) and the two theoretical components of 
                        
                          
                        
                        $$\theta$$
                        
                          θ
                        
                      : effective population size (Ne) and mutation rate (
                        
                          
                        
                        $$\mu$$
                        
                          μ
                        
                      ).
Full size image

Mutation rates are computed using the scaling of phylogenetic branch lengths in units of substitutions versus time (in years) for 100 trees, and Ne is computed using the ratio of genetic diversity to mutation rates. Left panels: mean \(\lambda\), Ne and \(\mu\) across 100 trees and 95% confidence intervals are shown, with a regression line and log scaled axes. Right panels: slope estimates from BMLM analyses on 100 trees (shaded distributions) and MCC tree (triangles). The black intervals represent the corresponding 95% credibility intervals and the medians. The circles (and triangle) below each of these plots represent PGLS estimates and are coloured red when significant (p-value < 0.05). Source data are provided as a source data file.

Specificities of mammals and mitochondrial markers

Taken together, our results suggest that population genetic processes and the tempo of speciation are tightly linked, although the generality of the negative genetic diversity – speciation rates relationship we observed would need to be tested on other species groups and across genomic data. There is a possibility that mammals experience a particularly high frequency of founder geographic speciation events81, hence a strong reduction of genetic diversity82,83 at speciation, that could generate a negative genetic – diversity speciation rates relationship in this group that wouldn’t necessarily be observed in other species groups dominated by different speciation modes.

It would also be insightful to assess the genetic diversity - speciation rate relationship using nuclear markers. Indeed, nuclear and mitochondrial genetic diversity are expected to show substantial differences, with mitochondrial genetic diversity being in general more strongly influenced by variations in mutation rates, demography, geographic structure and selection than nuclear diversity. It has even been suggested that mitochondrial genetic diversity is relatively constant across species and does not reflect Ne65,68,84 although this remains debated66. Our analyses suggest that, in mammals at least, mitochondrial genetic diversity reflects both Ne and mutation rate, as reported above (Supplementary Fig. 12). Unfortunately, we still lack consistent nuclear data across species at broad taxonomic scales, and as a result, macrogenetic studies with a large phylogenetic scope such as the one conducted here are not yet possible with nuclear markers85,86,87, although we expect that they will soon be. In an effort to assess how much our results may be specific to mitochondrial genetic diversity versus general across genetic markers, we analyzed four nuclear datasets, each with their specific limitations (Supplementary Note 1). We found that mitochondrial genetic diversity is in general positively correlated to nuclear genetic diversity (Supplementary Fig. 12A). We did not recover a negative association between nuclear genetic diversity and speciation rates, but this is likely due to limitations of the nuclear databases (Supplementary Fig. 12B, see the Supplementary Note 1 for a detailed discussion). Our results confirm previous studies that reported several orders of magnitude variation in mitochondrial genetic diversity (and associated Ne estimates) across mammals (e.g. Piganeau and Eyre-Walker 200988) and a good correlation between mitochondrial and nuclear genetic diversity in this group68. This suggests the negative relationship between genetic diversity and speciation rates may hold across markers, although this will need to be tested in the future with a larger number of nuclear datasets.

Processes that lead to a negative association between genetic diversity and speciation rates therefore seem to dominate those that could have generated a positive association, although the exact mechanisms at play cannot easily be disentangled. The relationship could for example be related to the faster genetic divergence between sub-populations in species with small Ne20,48,49, bottleneck effects at speciation50, the preponderance of a geographic mode of speciation with asymmetric dispersal57, heterogeneity in the time it takes to complete speciation, or a combination of such processes. This illustrates the complexity of the micro-macro evolutionary continuum and highlights the need for quantitative models linking population genetics to the tempo of speciation. A first implication of the negative genetic diversity – speciation rates relationship is that the availability of polymorphic alleles does not exert a rate-limiting control on speciation dynamics. A second implication is that speciation does not maintain high genetic diversity below the species level. Quite to the contrary, the frequency of speciation events seems to limit the amount of genetic diversity that a species contains. Given that genetically poor species tend to be more prone to extinction89, we can speculate that fast-speciating lineages, depleted of genetic diversity, are “volatile” lineages with high extinction rates.

Regardless of the mechanisms that underlie the negative genetic diversity – speciation rates relationship we observed, the consistency and robustness of this relationship highlight the importance of microevolutionary processes for understanding the dynamics that shape broad-scale patterns of diversity. Reciprocally, these deep-time dynamics influence the genetic diversity of present-day species, and therefore their capacity to adapt and cope with global change.

Methods

Mammalian phylogeny

We used a recent time-scaled phylogeny built using a “backbone-and-patch” approach to assemble species-level relationships of living mammals69. Using a 31-gene supermatrix, the backbone tree of 28 main groups was estimated, and then the 28 species-level “patch” phylogenies were re-scaled from backbone divergence times and grafted on to form the combined Mammalia-wide trees. We used the phylogeny that contained only extant species with DNA data (“DNA-only” trees) which encompassed 4064 species, representing ~69% of the total mammalian diversity, rather than the trees that additionally contain recently extinct and taxonomically imputed species. In order to account for phylogenetic uncertainty in our analyses, we used both the maximum clade credibility consensus tree (MCC) and a set of 100 trees randomly sampled from the credible set of 10,000 trees from Upham et al. (2019)69.

Estimating intraspecific genetic diversity across mammals

We computed intraspecific genetic diversity across mammals using the mitochondrial DNA locus cytochrome b (cyt b). We downloaded DNA sequence data per mammals family from the NCBI GenBank database on the 13th of December 2019 using as arguments “(Family)[Organism] AND CYTB NOT Homo sapiens[Organism]” and using the R (v.3.5.190) package Reutils (v0.2.391). We excluded sequences with hybrid species names and non-identifiable species names.

We used a list of synonyms to match species names linked to the NCBI GenBank sequences to species names from the mammalian phylogeny, which were from a master taxonomy that includes IUCN (2015) accepted species names and new species names. We updated the synonym list from92 using the R package Taxize (v.0.9.9193) and rotl (v.3.0.1094) and the Integrated Taxonomic Information System (ITIS), IUCN and Open Tree of Life (OTL) databases. We discarded sequences linked to NCBI species names that matched more than one species in the phylogeny. Additionally, subspecies sequence names that were not valid in all species names databases were only kept when the specific epithet but not the generic name with the subspecies name would match the species names in the phylogeny.

We selected one cyt b sequence per family to be used as a reference for the alignments; we chose the longest available sequence (DNA sequence size for most mammal species is 1140 bp). We used the recently developed python toolkit SuperCRUNCH95 to process cyt b sequences and obtain family-level alignments. This consisted of filtering sequences while creating a BLAST database from the reference sequences, adjusting the orientation of the sequences, checking for the presence of stop codons and adjusting accordingly, and aligning them (with adjusted reading frames to improve the alignments) using Mafft96,97 set with the FFT-NS-i algorithm. We then separated the resulting family-level alignments into species-level alignments, trimmed these alignments to remove potential sites with only gaps, and inspected them visually using Geneious v.7.1.998. We then employed the bioseq package (v.0.1.4)99 to identify any not in frame sequences, which we subsequently realigned using macse (v2.06)100. Conveniently, macse takes into account frameshifts in the alignment while incorporating the appropriate genetic code (in this case, the mitochondrial vertebrate genetic code for cyt b). We identified the synonymous sites in all alignments using the bioseq package, and computed genetic diversity at synonymous and nonsynonymous sites across species.

We measured intraspecific molecular genetic diversity using Tajima’s70 θTsyn, which uses the mean pairwise difference among sequences, and Watterson’s71 θWsyn, which is based on the number of segregating sites. We chose these measures of genetic diversity as a) they are well grounded in population genetic theories, being direct estimators of θ in an idealized population following the Wright-Fisher model, b) they are the two mostly widely used measures of genetic diversity in population genetics, and c) they have been modified to account for missing data. We indeed used the modified estimators of Ferretti et al. (2012)72, which account for missing data in the alignments and therefore avoid the need to remove bases or individuals from the analysis. We computed these estimators for all species represented by at least five sequences in the species-level alignments, hereafter referred to as “original” genetic diversity estimates. We initially gathered 124,289 sequences, which reduced to 111,624 after filtering; these sequences spanned 1959 species represented by at least five sequences.

To better account for uncertainty due to the broad variation in the number of sequences per species (median 17 and maximum 2080 sequences, Supplementary Fig. 1), we performed 1000 sub-sampling of each species-level alignment to four sequences and computed genetic diversity estimates for each of these subsamples, hereafter referred to as ‘subsampled’ genetic diversity estimates. We also computed the mean and standard error of these subsampled estimates for each species. The original and subsampled estimates were generally strongly correlated (Supplementary Fig. 1). This suggests that our original genetic diversity estimates were not too sensitive to the number of individuals sampled, and therefore also to the fraction of the full species range represented. We excluded species for which original genetic diversity for either θTsyn or θWsyn was not within the range of subsampled estimates (Supplementary Fig. 1). In the end, the dataset encompassed 98,966 sequences across 1897 species. Nonsynonymous genetic diversity was computed for the 1730 species that had nonsynonymous sites. As another approach to testing the potential effect of the number of sequences per species on the genetic diversity - speciation rate relationship, we restricted the dataset to species represented by at least 10, 20, 50, 75 and 100 sequences (Supplementary Fig. 6). Finally, to test the potential effect of geographic extent of sampling, we obtained geographical coordinates from Theodoridis et al. (2020)61, by matching our sequences with theirs using NCBI IDs. Only a subset of our sequences were georeferenced by Theodoridis et al. 2020, reducing the dataset to 453 species. We assessed the correlation between the mean within-species geographic distance between sequences and genetic diversity, as well as the significance of the genetic diversity - speciation rate association when accounting for mean geographic distance (Supplementary Fig. 7).

Estimating species-specific speciation rates

We estimated branch-specific speciation rates using an updated implementation of the recently developed cladogenetic diversification rate shift (ClaDS) Bayesian model15. This implementation uses data augmentation for faster computation in large phylogenies, and is available in the PANDA.jl (v0.0.2) package73. ClaDS allows for gradual variation in diversification rates by implementing a rate shift at each speciation event. We used the ClaDS2 model, which implements a scenario with a constant turnover rate (i.e extinction rate divided by speciation rate). We fitted ClaDS2 to the full mammalian phylogeny, accounting for missing species by including family-level sampling fraction information. We computed each “patch clade” sampling fraction as the number of species included in the (DNA only) phylogeny for this clade divided by the number of extant species in the “complete” phylogeny of Upham et al. (2019)69 with imputed species. We ran ClaDS2 on both the MCC tree and the 100 trees randomly sampled from the credible set. We extracted speciation rates at the tips to obtain species-specific (tip) speciation rates. Our species-specific rates are positively correlated to those estimated in Upham et al. (2020)101 where rates were estimated with DR102, but only loosely so (R2 = 0.68). We present ClaDS-based results as this method accounts for small rate variations in an explicitly model-based approach15.

Assessing the correlation between species-specific genetic diversity and speciation rate

We assessed the correlation between species-specific speciation rate estimates and species genetic diversity (using both \({\theta }_{{Tsyn}}\) and \({\theta }_{{Wsyn}}\)), at both a global scale for the entire Mammalia class and 14 of the 28 monophyletic patch clades from Upham et al. (2019)69, selected as those with at least 20 species for which genetic diversity could be computed. We analyzed the correlation using two statistical approaches that account for phylogenetic dependence in the data: a frequentist Phylogenetic Generalized Least Squares approach (PGLS) implemented in the R package nlme (v.3.1-162)103, and a Bayesian Multilevel Models approach (BMLM) implemented in the R package BRMS (v.2.11)104. In both cases, analyses were performed for both the MCC tree and each of the 100 trees. We performed the global and per-clade PGLS analyses independently, using either the original estimates of genetic diversity or the mean of subsampled estimates. The analyses used the gls function and included the phylogenetic correlation structure using the corPagel function estimated using maximum likelihood (ML). We also calculated the R-squared values of the PGLS analyses105 using the rr2 package106, which decomposes the total variance of the response variable into variance explained by the predictor variable (R2_lik) and variance explained by other factors (R2_resid). The BMLM approach allows a joint analysis of the entire Mammalia and individual clades while propagating the modelization uncertainty across levels. We performed this joint analysis with clade as a grouping factor, allowing the slope and intercept of the correlation to vary across clades. We ran these analyses with log-transformed genetic diversity as the response variable and log-transformed speciation rate as the predictor variable; preliminary results on a subset of analyses with speciation rate as the response variable suggested similar results. BMLM also allows incorporating the error associated with the response variables, so we additionally performed an analysis using the mean of the subsampled genetic diversity with its standard deviation implemented in an error-in-variables model. The analyses with nonsynonymous genetic diversity were performed with PGLS on the MCC tree and 100 posterior trees using nonsynonymous genetic diversity (\({\theta }_{{Tnonsyn}}\)). Additionally, we performed a Multivariate Analysis of Variance (MANOVA) to compare the association between synonymous and nonsynonymous measures of genetic diversity and speciation rates (MCC tree and rates) using the Pillai’s test implemented in mvMorph (function manova.gls with “LL” method; v.1.1.5107).

Species-specific covariates

To investigate whether the relationship between genetic diversity and speciation rates could be due to indirect effects from relationships of each variable with species-specific characteristics, we gathered data for species’ mean body mass and range area108, mean annual temperature over species range and latitudinal midpoint109, generation length and litter size110. Next, we independently analyzed the correlation between these covariates and both genetic diversity and speciation rate. We also analyzed the correlation between genetic diversity and speciation rate when accounting for the effect of the covariates by including them as predictor variables. These three models were run at the global scale for mammals as a whole, on the MCC tree and each of the 100 posterior trees, using \({\theta }_{{Tsyn}}\) as the estimate of genetic diversity. We ran both “one-by-one” analyses where each covariate was considered independently, and a “combined” analysis where they were all included simultaneously in a single analysis.

Surprisingly, we found that genetic diversity was higher in species with small litter sizes (Supplementary Figs. 8, 9 & Table 1). This was still true when analyzing this relationship without accounting for other covariates (MCC PGLS slope estimate −0.22, p-value = 0.014). This can seem counter-intuitive, and contrasts with Romiguier et al. (2014)3. We also computed fecundity as the product of litter size and frequency of litters per year, as in Welch et al. (2008)111, for the subset of species for which these data were available (1017 species)111. With this subset, we found that the relationship between genetic diversity and fecundity was non-significant with fecundity measured as litter size (MCC PGLS slope estimate 0.0025, p-value = 0.98), and significantly positive with fecundity measured as litter size x litter frequency (MCC PGLS slope estimate 0.12, p-value = 0.039). We also found that the relationship between genetic diversity and litter size was not significant when including all the species for which data were available (1427 species, MCC PGLS slope estimate −0.11, p-value = 0.19). The relationship between genetic diversity and litter size therefore seems highly sensitive to the set of species included, and we do not interpret it biologically in the paper. We include it in our analyses to account for a potential indirect effect of this covariate on the genetic diversity - speciation rate relationship.

Mutation rates and effective population sizes

Under the neutral theory and for a haploid population (the case relevant to cyt b which is a haploid and maternally inherited mtDNA locus) at equilibrium, \(\theta=\,{N}_{{\rm{e}}}\mu\), where Ne is the effective population size and \(\mu\) is the mutation rate per site per generation. We unfortunately do not have direct measures of Ne and \(\mu\); the estimates we use here for \(\mu\) rely on phylogenetic branch lengths (see below), which are also used to estimate speciation rates, and the estimates of Ne are based on our estimates of genetic diversity. This introduces an inevitable circularity in the analyses to keep in mind for cautious interpretation of the results. We approximated species-specific mutation rates by substitution rates at the 3rd codon position of the cyt b alignment from Upham et al. (2019)69. This approach (versus, for example, using codon-based models to estimate synonymous substitution rate) is often done to mitigate computational constraints112. In theory, substitutions at the 3rd position more closely reflect neutral divergence arising from mutations because changes in 3rd codon positions are mostly synonymous, thus limiting the effect of natural selection. We estimated divergence in units of substitutions at third codon positions on the MCC tree and the 100 trees topologies under the GTR substitution model with rate variation among sites (modelled with a gamma distribution with 5 rate categories) using the baseml program in PAML (v.4.9j113). These analyses were run across the same 100 trees used to infer speciation rates but performed per patch clade from Upham et al. (2019)69, with a further split of Carnivora and Muridae clades due to computational limits. We obtained estimates of mutation rates in time units (i.e. per year) for each species by dividing the estimated number of substitutions per site on each terminal branch by branch length in time units (as given by the time-calibrated trees in years). To avoid potential biases, 81 species with estimates of number of substitutions of 4\(\times\)10-6 (default value assigned to branches by PAML when the actual value is zero) were discarded. Finally, we calculated Ne per species by dividing \({\theta }_{{Tsyn}}\) by the estimated mutation rate (per year) for these species. We initially computed the mutation rate per generation by multiplying the mutation rates per year by the length (in years) of a generation, and used this per generation rate to obtain Ne, but we found that variations in mutation rates (and therefore also Ne) were then almost entirely explained by variations in generation length. This led to some spurious results, such as a negative relationship between genetic diversity and mutation rate. Not correcting by generation length is justified by our result that generation length does not explain the negative relationship between speciation rate and genetic diversity (Table 1). It also allows a more direct comparison between mutation rates and speciation rates, which are expressed in Myrs. We explored the relationship between speciation rate (as the explanatory variable) and either \(\mu\) or Ne using PGLS and BMLM analyses on the MCC tree and the 100 posterior trees. We found an unexpected negative correlation between mutation and speciation rates, which we attribute to the smoothing of speciation rates obtained with ClaDS analyses15. This smoothing would lead to speciation rates being underestimated on short branches with high speciation rates, and overestimated on long branches with low speciation rates. On the other hand, mutation rates are likely overestimated on short branches and underestimated on long branches because of the way they are computed (number of substitutions divided by branch length). Combined, these effects would generate the negative correlation we observed.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.