Variance of protein heterozygosity in different species of mammals with respect to the number of loci studied

Makarieva, Anastassia M

doi:10.1046/j.1365-2540.2001.00899.x

Download PDF

Original Article
Published: 01 July 2001

Variance of protein heterozygosity in different species of mammals with respect to the number of loci studied

Anastassia M Makarieva¹^nAff2

Heredity volume 87, pages 41–51 (2001)Cite this article

1418 Accesses
13 Citations
Metrics details

Abstract

Analysis of published data on protein heterozygosity of 321 species of mammals shows that it varies from 0 up to 22%, an average species being heterozygous at 5% of its protein-coding loci. Many attempts have been made to explain the observed differences in protein heterozygosity, relating its value to various species-, population-, or environment-specific parameters. In this work it is shown that the wide scatter of protein heterozygosity in different species of mammals can be explained by the small numbers of loci studied (usually 20–30). It is shown that with an increasing number of studied loci, the mean of the heterozygosity does not change, while its variance among different species decreases in accordance with a Poisson distribution. The true heterozygosity of the whole protein-coding region of the mammalian genome is thus characterized by a narrow spread around the mean. This means that the true heterozygosity of the protein-coding region is similar in all mammalian species. Its value can be viewed as the threshold level of variability of the protein-coding region of mammals, which characterizes the permissible level of erosion of genetic information of species and is maintained by stabilizing selection in natural ecological niches.

Understanding the heterogeneous performance of variant effect predictors across human protein-coding genes

Article Open access 30 October 2024

A lethal mitonuclear incompatibility in complex I of natural hybrids

Article Open access 10 January 2024

Detecting macroevolutionary genotype–phenotype associations using error-corrected rates of protein convergence

Article Open access 05 January 2023

Introduction

The question of the nature and maintenance of intraspecific genetic variability is widely discussed in the literature (Nei, 1984; Nevo et al., 1984; Kimura, 1991; Ayala & Fitch, 1997; Amos & Harwood, 1998). The observations show that intraspecific genetic variability is remarkably different even among evolutionarily close species (Nevo et al., 1984). Much work has been done to reveal possible correlations between genetic variability and such characteristics as genome size (Pierce & Mitton, 1980; Larson, 1981), body mass (Wooten & Smith, 1985; Gorshkov & Makarieva, 1997), mating system (Appolonio & Hartl, 1993), ploidy level (Graur, 1985; Hofker et al., 1986; Crespi, 1991; Burrows & Ryder, 1997), population size, structure and history (Nei & Graur, 1984; O’Brien, 1987; Carson, 1990) and other ecological or behavioural parameters such as, for example, environmental stress (Imasheva, 1999), host–parasite interactions (Thompson & Lymbery, 1996) or habitat fragmentation (Gaines et al., 1997), but in spite of many significant findings, the general pattern remains unclear.

Abundant evidence on the levels of heterozygosity of the protein-coding region of the genome has been accumulated for thousands of species (Nevo et al., 1984). At present, due to the explosive development of new measuring techniques, most studies assess the DNA variability directly (e.g. by DNA sequencing) rather than by studying proteins encoded by it. However, the development of various DNA techniques during the last several years has not yet led to creation of a comparable dataset of DNA variability for the coding region of different species, especially taking into account the growing bias in molecular studies towards medicine, and, consequently, the single species Homo sapiens. Thus, the protein variability dataset still remains unique due to its extensiveness with respect to the number of species encompassed (Butlin & Tregenza, 1998). In these data, mammals represent one of the best-studied taxa.

In this study the statistical aspect of the problem of intraspecific genetic variability is investigated. Published data on the expected protein heterozygosity of 321 natural mammalian species were collected. It is shown that the different observed values of protein heterozygosity in different species and the wide spread of the observed values of protein heterozygosity around the mean can be explained by the small number of loci studied. With increasing number of studied loci the scatter of heterozygosity values decreases in accordance with a Poisson distribution while the mean does not change. This suggests that all mammals are characterized by nearly the same level of protein heterozygosity.

Rationale

The coding region of the mammalian genome is known to contain about 30–40 000 genes. This number of genes cannot be studied in any population genetics experiment. In most studies, sets of 20–30 loci are investigated in different species. The number of studied loci is determined by the available buffer systems, techniques developed, manpower and money.

The expected heterozygosity h of a given locus (i.e. the relative frequency of heterozygous state in a random-mating population) is determined as:

where q_j is the relative frequency of the j-th allele.

Let l be the number of polymorphic loci (i.e. loci with non-zero h) in a study of L loci in a population. The average expected heterozygosity of individuals in a population is given by the expression

where L is the total number of studied loci, h_i is the expected heterozygosity of the i-th polymorphic locus, see (eqn 1), h_l is the average heterozygosity of the l polymorphic loci, i.e. h_l≡∑^l_i₌₁h_i/l(l>0).

Let us denote by $h_{l_{tot}}$ the average heterozygosity of all l_tot polymorphic loci existing in the protein-coding region of a given species, $h_{l_{tot}} \equiv Σ_{i = 1}^{l_{tot}} h_{i} / l_{tot}$ . The greater the number L of studied loci, the better the approximation of H_L that is given by the expression

.

It will be shown below that under certain assumptions, a very accurate estimate of $h_{l_{tot}}$ can be obtained from data on heterozygosity of polymorphic loci in different species of mammals. Introduction of a constant value of $h_{l_{tot}}$ (eqn 3) instead of the random variable h_l (eqn 2) leads to a substantial simplification of all the formulae and allows one to analyse the variance of heterozygosity H_L as a function of variance of the number of polymorphic loci l alone, provided that the inaccuracy of the approximate formula (3) is not very large.

The random variable h_l defined in (eqn 2) assumes different values when different sets of L loci are studied in the same species. If in all sets of L loci the same number l of polymorphic loci were observed, the mean and variance of h_l were given by the expressions ${\bar{h}}_{l} \equiv \sum_{k = 1}^{K_{tot}} h_{l}^{k} / K_{tot}$ and $σ_{h_{l}}^{2} \equiv \sum_{k = 1}^{K_{tot}} {(h_{l}^{k} - {\bar{h}}_{l})}^{2} / K_{tot}$ , where h^k_l is the average heterozygosity of polymorphic loci in the k-th set of L loci, K_tot is the total number of sets of L loci that cover the whole protein-coding region of the species genome, K_tot=L_tot/L, where L_tot is the total number of loci in the protein-coding region. Note that ${\bar{h}}_{l} = h_{l_{tot}}$ by definition. Thus, $σ_{h_{l}}^{2}$ characterizes deviation of h_l from $h_{l_{tot}}$ and, consequently, deviation of the approximate formula (eqn 3) from the exact formula (eqn 2). It can be easily shown from the definitions of h¯_l and $σ_{h_{1}}^{2}$ that

where σ²_h is the variance of the random variable h, which assumes values of expected heterozygosity of all polymorphic loci l_tot of this species, $σ_{h}^{2} \equiv \sum_{i = 1}^{l_{tot}} {(h_{i} - h_{l_{tot}})}^{2} / l_{tot}, l_{tot} = K_{tot} l$ ,l_tot=K_totl. Variance is a characteristic of the squared value of spread around the mean. Equality (eqn 4) essentially represents the law of large numbers and reflects the fact that the more polymorphic loci are studied, the more accurate is the estimate of their mean heterozygosity that is obtained. The average number of polymorphic loci observed in a given set of loci is proportional to the total number of loci studied in this set. It will be shown below that at most commonly used L the average inaccuracy of (eqn 3) does not exceed 30%, see (eqn 12). Thus, in further calculations we use (eqn 3) as an acceptable approximation of (eqn 2).

Let us call the hypothetical value of heterozygosity that could be obtained when studying all protein-coding loci in a species for true heterozygosity H:

where P ≡ l_tot/L_tot is the true polymorphism. If H is equal in all species (the assumption that we will be further testing), it is likely that both $h_{l_{tot}}$ and P are equal simultaneously, rather than that there exists a certain inverse relationship between them. There are two independent factors that influence the level of heterozygosity of a particular locus in a particular species. Firstly, for a locus to become polymorphic it has to be affected by a mutation. It is natural to expect that the mutation-affected loci are randomly located in genomes of different species. Secondly, the exact value of heterozygosity h_i of a mutation-affected locus depends on how many chances the heterozygous genotypes have for spreading in the population. This may depend on the properties of the locus itself (Ward et al., 1992). Indeed, certain groups of loci tend to show elevated values of heterozygosity in most species, whereas others are generally less variable (O’Brien et al., 1980). However, if no mutation has affected the locus in the population, it will remain monomorphic with h=0, irrespective of how high its heterozygosity could be were such a mutation to occur. Thus, we assume below that the equality of both independent factors, P and $h_{l_{tot}}$ , in all species is both necessary and sufficient for equality of H in all species.

If determined by the random character of the process of mutagenesis, polymorphic loci should be distributed randomly over the coding region of genome. Then, given the low relative frequency of polymorphic loci as compared to monomorphic ones (as is usually the case) the probability p(l) of finding l polymorphic loci among L studied loci will be determined by a Poisson distribution. (It has been tested that applying a binomial distribution does not change in any significant way any results obtained below).

If the true heterozygosity H, and, consequently, the true polymorphism P (eqn 5) are equal in all species, then the number l of polymorphic loci observed in sets of L loci in different species will follow the same Poisson distribution as the number of polymorphic loci in sets of L loci chosen in different parts of the coding region of the same species. Then the probability p(l) of finding l polymorphic loci in a random set of L loci in a population of a given species is equal to

where l¯ is the average number of polymorphic loci that are found in different species, $\bar{l} \equiv \sum_{k = 1}^{N_{L}} l_{k} / N_{L}$ , where l_k is the number of polymorphic loci in the k-th species and N_L is the number of species where L loci were studied. Note that l¯=LP. Everywhere below we assume that only one population represents each species, so averaging over different populations is equivalent to averaging over different species.

Note that if different species are characterized by different levels of genetic variability determined by various species-specific parameters, there are no grounds to expect that the number of polymorphic loci found in different species will follow a Poisson distribution, see also (ii) below.

The Poisson distribution is characterized by the well-known relation between mean l¯ and variance $σ_{l}^{2} \equiv \sum_{k = 1}^{N_{L}} {(l_{k} - \bar{l})}^{2} / N_{L}$ :

Using eqns (3) and (7) and the definition of mean and variance of ${\bar{H}}_{L} \equiv \sum_{k = 1}^{N_{L}} H_{L}^{k} / N_{L}$ and $σ_{H_{L}}^{2} \equiv \sum_{k = 1}^{N_{L}} {(H_{L}^{k} - {\bar{H}}_{L})}^{2} / N_{L}$ , where H^k_L is the expected heterozygosity of the k-th species, we obtain for $σ_{H_{L}}^{2}$ at fixed L

Note that in (eqn 8) we applied our assumption about equal values of $h_{l_{tot}}$ in different species, while the assumption about equal values of P in different species was implicitly used in (eqn 6).

Generally, the relationship between $σ_{H_{L}}^{2}$ and L can be written in the logarithmic form (decimal logarithms were chosen just for reasons of convenience):

Thus, if $h_{l_{tot}}$ , P and, consequently, H are equal in all species, the following equalities should be true according to (eqn 8):

These predictions were tested with the available empirical data.

Results

(i) Decreasing variance of heterozygosity with increasing number of studied loci

Published data on expected protein heterozygosity (eqn 2) of 321 mammalian species were collected (Fig. 1). In those cases when allele frequencies of polymorphic loci were available, heterozygosity h was calculated for each polymorphic locus according to (eqn 1) (99% criterion of polymorphism). In total, 2003 polymorphic loci were considered. The total number of all loci considered (including monomorphic ones) was 10 296.

The 2003 values obtained for h were used to construct the probability distribution p(h), Fig. 2. Under our assumption of equal values $h_{l_{tot}}$ of in all species, the mean of this distribution gives a rather accurate estimate of $h_{l_{tot}}$ due to the great number of studied loci. It is equal to $h_{l_{tot}} = \sum_{i = 1}^{2003} h_{i} / 2003 = 0.261$ , where h_i is the expected heterozygosity (eqn 1) of the i-th polymorphic locus. Using this value we obtain an estimate of coefficient b in (eqn 10):

Let us now show that the average inaccuracy of the approximate formula (eqn 3), which was used in the Rationale instead of (eqn 2), is not more than 30%. The distribution p(h) is characterized by variance σ²_h=0.034. The dataset considered (Fig. 1) is characterized by the average number of polymorphic loci l_av of about l_av ≈ 5.4. Expression (4) that characterizes inaccuracy of (3) was obtained under the assumption that the number of polymorphic loci l is the same in all sets of loci studied. It can be shown, however, that for the dataset considered (Fig. 1) where l is a random variable, expression (4) remains adequate as well, if the average number of polymorphic loci l_av is used in expression (4) instead of l. Thus, according to expression (4) the standard deviation of h_l at l ≈ l_av is equal to

which constitutes 30% of the average value $h_{l_{tot}} = 0.261$ . Thus, the average deviation of formula (3) from formula (2) is about 30%, as stated above.

The minimum number of studied loci L in a given population was 11, the maximum was 62 (Fig. 1). The range (11, 62) was divided into eight intervals (Fig. 1), each of them containing not less than 20 values of heterozygosity H_L. The division chosen is arbitrary. It was chosen to minimize differences between lengths of intervals and at the same time between numbers of heterozygosity values observed in each interval. It can be shown, however, that the results of the study do not depend on different ways of choosing intervals.

For each interval the average number of studied loci L¯, the average heterozygosity H¯_L, and variance of heterozygosity $σ_{H_{L}}^{2}$ were calculated, Table 1. It should be noted that though some species are represented by two or more populations (for 321 species a total of 411 populations were studied), none of the eight intervals of L contains two or more populations of the same species. Thus, in each interval all calculations are made for populations of different species.

Table 1 Numerical data for the eight intervals of Fig. 1.

Full size table

Figure 3 shows the results of testing relationship (8) in the logarithmic form (9) with the available data. Variables $σ_{H_{L}}^{2}$ , H¯_L and L in (9) assumed values $σ_{H_{L} i}^{2}$ , H¯_L_i and L¯_i, respectively (i changed from 1 to 8, Table 1).

Linear regression of log $σ_{H_{L}}^{2}$ on logH¯_L/L (eqn 10) gave the following results (Fig. 3):

Uncertainty in (13) represents standard errors of the respective values.

The obtained value of a agrees very well with the corresponding relationship in (eqn 10). The high correlation coefficient and low probability level testifies for the statement that variance of heterozygosity really decreases inversely proportionally to L. This means that with growing L heterozygosity values observed in different species converge to a common mean.

The obtained estimate of b corresponds to $h_{l_{tot}} = 10^{0.17} = 1.19$ (which makes no sense, as $h_{l_{tot}}$ must be less than unity by definition), exceeds considerably the expected value b=−0.58 (eqn 11) and is characterized by high uncertainty. This is an indication of a departure of the distribution of the number of polymorphic loci in real samples from the proposed Poisson distribution (eqn 6). A possible reason for that may be nonrandom sampling of the loci studied. For example, there are many proteins coded by two or more loci. If a given protein is studied, all its loci are studied as a rule. If these loci tend to display similar levels of heterozygosity (which is often the case), they cannot be considered as random samples. This will result in a distortion of the Poisson distribution.

There can be other sources of distortions as well. One of them is an unconscious bias towards more variable loci when choosing a set of loci to be studied. I considered the relationship between protein heterozygosity and the number of studied loci in 30 species of Drosophila. Data were taken from Nevo et al. (1984). It proved that heterozygosity decreased significantly with the increasing number of studied loci (r=−0.58; P=0.0002). This confirms — at least for Drosophila studies — the statement that investigators tend to choose more variable loci first (Harris & Hopkinson, 1972; Graur, 1985). One of possible manifestations of this effect can be that papers where higher levels of heterozygosity are described may have more chances to be published than those where no polymorphism is discovered. Then the initial Poisson distribution of the number of polymorphic loci would be distorted due to overly frequent observations of large numbers of polymorphic loci.

(ii) Testing the assumption of equal heterozygosity in different species of mammals

It can be shown that the observed decrease of variance of heterozygosity with increasing number of studied loci is not compatible with the assumption that different mammalian species have significantly different values of true heterozygosity H, see (eqn 5). Let us assume for simplicity that there are only two types of mammals with different true heterozygosities H₁ and H₂. This means that if we study all protein-coding loci in all mammalian species of one of the two types, we always get the value of either H₁ or H₂. However, if we study randomly chosen small sets of L loci, we will observe two probability distributions of heterozygosity p₁(H_L) and p₂(H_L) with nonzero variances σ₁²(L) and σ₂²(L) and means H₁ and H₂. Let the relative frequency of species of the two types be γ₁ and γ₂, γ₁ + γ₂=1. Then for the probability distribution of heterozygosity of all mammalian species p(H_L) we can write

Expression (14) reflects the fact that in γ₁ cases mammals of the first type are studied and the heterozygosity value follows distribution p₁(H_L), while in the remaining γ₂ cases mammals of the second type are studied and the heterozygosity follows p₂(H_L) distribution.

It can be easily shown in such a case that the variance $σ_{H_{L}}^{2}$ characterizing probability distribution p(H_L) (eqn 14) is equal to

(To get expression (15) one needs to use the definitions of mean and variance of heterozygosity, H¯_L≡ ∫₀¹H_Lp(H_L)dH_L and $σ_{H_{L}}^{2} \equiv \int_{0}^{1} {(H_{L} - {\bar{H}}_{L})}^{2} p (H_{L}) d H_{L}$ , and the well-known relation between the variance and mean of any random variable x, σ²_x=x²-x¯²).

With increasing L the first two terms in (eqn 15) decrease inversely proportionally to L, similarly to (eqn 8). This reflects the fact that with increasing number of studied loci more accurate estimates of H₁ and H₂ are obtained inside the two groups of species. The third term is determined by the difference between true heterozygosities of the two groups of species and does not depend on L. Thus, if the true heterozygosities of the two types of mammals are significantly different (i.e. the third term in eqn 15 is very large as compared to the first two terms), whatever large number of loci L is studied, it will not lead to any considerable decrease of $σ_{H_{L}}^{2}$ . This is because whatever large number of loci is studied in one and the same species, it gives more information only about the mean value of heterozygosity for this particular species. It cannot make the estimate of the mean of all species more accurate if different species have different heterozygosities.

More generally, assuming that all species of mammals have different values of true heterozygosity, one can write for $σ_{H_{L}}^{2}$

where λ is an arbitrary constant. Note also that under our assumptions H_L does not depend on L either. σ²₀ is the constant term, which corresponds to the third term in eqn 15 and does not depend on L. Its value reflects the differences between values of true heterozygosity in different species. The greater are the differences, the greater is the value of σ²₀.

Linear regression of $σ_{H_{L}}^{2}$ on H_L/L (eqn 16) gives the following results:

The analysed set of heterozygosity values for the 411 populations of mammals (Fig. 1) is characterized by variance σ²_H equal to 0.0017. The obtained value of σ²₀ is negative, differs insignificantly from zero and is nearly an order of magnitude smaller by absolute value compared to σ²_H. The linear regression giving results (17) is characterized by a high correlation coefficient and is highly significant (P < 0.0001). Taken altogether these results allow us to conclude that the available data on the decrease of $σ_{H_{L}}^{2}$ with increasing L cannot be explained under the assumption of a large value of σ²₀ and therefore not by significant differences between true heterozygosities of different mammalian species.

Note that in (eqn 9) we used a log-scale when studying the dependence of $σ_{H_{L}}^{2}$ on H_L/L, whereas in (eqn 16) a normal scale is used. The fact that, in both cases, linear regression fits the data well (eqn 13, 17) is not a contradiction. The coefficient a in (eqn 9) that determines the power of the H¯_L/L term and might cause deviations from linearity in (eqn 16) is very close to unity, see (eqn 13), while the coefficient σ²₀ in (eqn 18) that might distort the log-linearity in (eqn 9) is negligibly small, see (eqn 17). Results 13 and 17 actually support our prediction that is $σ_{H_{L}}^{2}$ directly proportional to H¯_L/L (eqn 8), because only such dependence can be described by linear regression in both normal and log-scales.

It is possible to get an idea of the real differences in values of true heterozygosity in different mammalian species using (eqn 15) and the obtained absolute value of σ²₀ (eqn 17). If a half of all mammalian species had true heterozygosity H₁ and the other half H₂ (i.e. γ₁=γ₂=0.5) then the difference H₁ – H₂ would be equal to

Thus, given the average heterozygosity H=0.051, values of true heterozygosity of the two types of mammals will differ from the average by not more than 30% ((0.028/2)/0.051 × 100% ≈ 30%).

Note that the result obtained imposes constraints on the variance of the heterozygosity values in the whole class of mammals. It does not exclude the possibility of existence of a small number of mammalian species with heterozygosities noticeably differing from the average. In other words, when either γ₁ or γ₂ is very small, see (eqn 15), the difference H₁ – H₂ may become noticeable. In particular, the result obtained does not contradict the statement that large mammals may have noticeably lower values of true heterozygosity compared to the average for the whole class (Wooten & Smith, 1985; Gorshkov & Makarieva, 1997), because species of large mammals constitute but a small part of all mammalian species. Mammals with their body size exceeding 1 m are hardly responsible for more than a few percent of the total number of mammalian species known (Chislenko, 1981; Eisenberg, 1981).

(iii) Poisson distribution of polymorphic loci

It is possible to demonstrate the Poisson distribution of numbers of polymorphic loci (eqn 6) in different species directly. The only obstacle here is the small amount of data obtained at fixed values of L. The number of studied species is maximum when L=20. Twenty loci were studied in 45 populations of different species, which constitutes somewhat more than 10% of the total number of populations considered (411). In order to enlarge the dataset, species with 19 and 21 studied loci were added, so that the three most commonly used values of L (19, 20 and 21) were considered.

The total number of species studied was 82. For each species the total number of studied loci L, the total number of observed polymorphic loci l_sum, and number of polymorphic loci belonging to four groups according to their heterozygosity values h were considered, Fig. 4

Loci with the frequency of the most common allele equalling or exceeding 0.9 were not considered. The number of such loci strongly depends on the number of individuals studied. For example, the probability of not discovering the minor allele of a diallelic polymorphic locus with frequency of the minor allele q when sampling n diploid individuals is equal to δ_q=(1 – q)²ⁿ, which at n=15 gives δ_0.05=0.21, δ_0.01=0.74. It means that in samples of 15 individuals the number of polymorphic loci with allele frequencies of 0.05 and 0.01 are underestimated by 21 and 74%, respectively. Meanwhile, in large samples all these loci can be discovered. To minimize possible distortions caused by this effect such weakly polymorphic loci were excluded from the analysis.

Figure 4 shows the results of approximation of the observed distributions of polymorphic loci belonging to different intervals of h by Poisson distribution. The goodness of approximation was estimated by the χ²-test. Figure 4 makes it clear that the null hypothesis of Poisson distribution cannot be rejected in any of the five cases.

Discussion

In the present study it is shown that the variance of protein heterozygosity between different species of mammals is inversely proportional to the number of the studied loci. This fact suggests that there exists a certain value of protein heterozygosity that is common to most species of mammals.

According to the traditional approach, one would expect intraspecific genetic variability to vary greatly between mammals, since it is determined by numerous factors such as differences in mutation rates between loci within species, global mutation rate differences between species, differences in effective population or actual population size, differences in population histories (bottlenecks), differences in the intensity or mode of selection, etc. It is highly unlikely that such different factors would readjust in such a manner that they yielded the same degree of heterozygosity in all species. Hence, if all these factors are indeed significant for determination of intraspecific variability, then the observation of equal heterozygosity values in different species essentially means that all the above parameters are the same in all species. However, too many factors have to coincide for such an explanation to be considered as reliable.

It seems to be a much more likely explanation that the above factors (some of which are definitely different between many populations of mammals, see below) do not actually affect the levels of variability as strongly as predicted by the majority of selection/mutation/drift models. This point can be illustrated by the following two simple figures displaying different selection profiles, Fig. 5(a,b) Variable n stands for the number of mutational substitutions in the genome, which can be easily related to heterozygosity.

Figure 5(a) presents a selection profile that is often referred to as ‘truncated selection’. The selection profile displayed in Fig. 5(b) is a more common profile, its particular properties in different areas of n-values depending on the mode of selection. In a real situation the mode of selection itself can be a function of n.

In a population where the selection profile conforms to Fig. 5(a), the equilibrium value of intraspecific genetic variability will be determined by n₀=Δn₁ irrespective of the actual values of mutation rate and effective population number. On the contrary, in a population conforming to Fig. 5b, the equilibrium value of genetic variability n₀ may vary significantly within a broad corridor Δn₁ < n₀ < Δn₁ + Δn₂ depending — in the simplest case — on values of mutation rate and effective population number.

The critical difference between the selection profiles 5a and 5b lies in the ratio of the width of the area where fitness changes significantly, Δn₂, to the width of the area where changes of fitness are relatively small, Δn₁. In Fig. 5a this ratio is equal to zero (fitness changes abruptly) Δn₂/Δn₁=0 and Δn₂ ≪ Δn₁. On the contrary, for profile 5b Δn₂≫Δn₁ and Δn₂/Δn₁≫1.

If we begin to increase the Δn₁ part of profile 5b keeping the width of the Δn₂ part intact, we can finally arrive at a situation where the corridor of possible changes of the equilibrium value of n₀, Δn₁ < n₀ < Δn₁ + Δn₂, becomes relatively narrow as compared to the absolute value of n₀, although the absolute width of this corridor may remain arbitrarily large. In such a situation, factors determining the mutation/selection/drift balance will have only limited impact on the value of n₀, and, consequently, on the equilibrium value of intraspecific variability. This situation will effectively resemble case 5a, although if we again restrict our consideration to the Δn₁ < n₀ < Δn₁ + Δn₂ interval we will still detect the influence of the mutation/selection/drift factors. But on a larger scale their impact will be lost. In such a situation the equilibrium value of genetic variability will be determined by the value of Δn₁ with an accuracy of Δn₂/Δn₁.

An important point is that, due to the unprecedented complexity of organization of living objects and their interactions, it is virtually impossible to predict a real selection profile from a mathematical model alone, however complex the latter may be. Parameters of any models can only be deduced from empirical evidence. The study performed here suggests that the real selection profile in mammals is characterized by a relatively wide plateau and a relatively narrow corridor of significant changes of fitness. (In a number of studies a significant influence of newly acquired slightly deleterious mutations on organisms’ fitness has been observed (see, e.g. de Visser et al., 1997). This is consistent with the view that the equilibrium value of intraspecific genetic variability is located within a corridor where fitness changes sharply, see Fig. 5(b). Meanwhile, the fact that addition of a tiny number of slightly deleterious mutations (as compared to the number of mutational substitutions already present in the genome) results in a relatively sharp change of fitness, testifies that this corridor is indeed very narrow.)

The next question is what are the fundamental factors that determine the plateau’s width and make it nearly the same in all species of mammals? The existence of a common value of protein heterozygosity in most species of mammals can be explained if one considers intraspecific genetic variability as a result of a mutational process that erases meaningful genetic information of a species. The process of information erosion is limited by natural stabilizing selection when the amount of information lost reaches a certain value that is determined by the sensitivity of selection. This critical value is a fundamental characteristic of a species and represents the permissible level of information erosion in natural populations. Evolutionarily close groups of organisms may share this fundamental characteristic, which may be the reason for all mammalian species being characterized by similar values of heterozygosity, close to the average heterozygosity of the whole class, H=0.051.

In its essence, selection is a process of measurement and comparison of certain phenotypic traits of individuals that compete with each other within a population. The existence of a limit of sensitivity of a certain process of measurement is a very general phenomenon. For example, it is impossible to weigh small loads (e.g. milligrams) using scales that are calibrated in kilograms. Such scales do not ‘discern’ differences between small loads. Similarly, if the process of selection is ‘calibrated’ in thousands of mutational substitutions, selection will not be able to tell apart individuals differing from each other by two or three hundreds of substitutions. Due to the continuous process of mutagenesis, all individuals will finally accumulate a number of substitutions of the order of the value of selection sensitivity. (Note that sensitivity of selection cannot be quantified as a simple number of mutational substitutions tolerated in a genome. Rather, each substitution should be weighted according to the degree it affects normal functioning of the organism. But it is possible to speak about an average number of mutational substitutions of average ‘deleteriousness’).

The sensitivity of any process of measurement is determined by the properties of the applied instrument (e.g. scales) and the process itself. Similarly, sensitivity of selection is a fundamental property of the organization of life and cannot be quantified a priori. The present work provides an insight into the problem of quantitative analysis of this important characteristic of living organisms.

In addition to the general picture outlined above (Fig. 5a,b), it is possible to perform a more detailed quantitative analysis of one factor that has been traditionally considered to be one of the decisive ones in determining the levels of intraspecific variability, namely the effective population size, N_e (Amos & Harwood, 1998). Due to large difficulties in estimating effective population sizes of various species, information about this parameter is derived by comparing actual population sizes of species, see, e.g. (Nei & Graur, 1984).

The overwhelming majority of terrestrial mammalian species are characterized by average body masses of the order of m ≈ 1 kg or less, which corresponds to a linear body size of the order of x ≈ (m/ρ)^1/3 ≈ 0.1 m, where ρ ≈ 10³ kg m^–3 is the living mass density, see Fig. 6, which presents data of Chislenko (1981) and Eisenberg (1981). Note the logarithmic scale of the density of species numbers N(y).

Thus, an average terrestrial mammal has a linear body size of the order of x ≈ 10 cm and is characterized by protein heterozygosity H ≈ 0.051. Let us for simplicity restrict our consideration to purely neutral variability using the well-known formula H=4ν_gN_e/(1 + 4ν_gN_e), where ν_g is the mutation rate per gene per generation. At H ≪ 1 (which is the case) the heterozygosity is simply proportional to ν_gN_e.

It is known that for herbivorous mammals the cumulative species biomass B grows approximately proportionally to body size x (Damuth, 1981), B ∝ x. As far as body mass of an individual animal is proportional to the third power of its linear body size, m ∝ x³, this means that a species’ population number decreases with body size of mammals as N=B/m ∝ x^–2. Thus, population numbers of large herbivorous mammals with x ≈ 1 m would be a hundred times smaller than those of average mammals with x ≈ 0.1 m. When we turn to carnivores who reside at the top of the trophic pyramid (Odum, 1983), we conclude that population numbers of large carnivores would be about 10 times lower than population numbers of herbivores of the same body size. (This is due to the fact that the energy content of production of animal biomass does not generally exceed 10% of the energy content of the consumed food). Thus, the difference in population numbers between large carnivores and average mammals (which are small and herbivorous) constitutes three orders of magnitude. Note that this difference, caused by fundamental biochemical and ecological regularities, is unlikely to decrease during bottleneck events.

Were the effective population sizes to change proportionally to real population sizes, then, according to Nei’s formula, we would expect heterozygosity values in carnivores to be of the order of 0.051 × 10^–3= 5.1 × 10^–5. In the sample of 411 populations considered in this study carnivores accounted for 24 populations with an average heterozygosity value of H_carn = 0.032, which corresponds to 63% of the global average H¯=0.051.

Whatever the real dependence between the effective and actual population sizes could be, it is very unlikely that a three orders of magnitude change in real population numbers of mammalian species scales to a 37% reduction of the effective population number. This calls for looking for different reasons that could account for the observed slight decrease of heterozygosity values in a small number of large mammalian species. Note also that this slight decrease does not create a distortion from the first order effect of constant heterozygosity observed in this study.

In conclusion, I would like to add that although the present study has focused on allozyme variability in mammals, the proposed approach can be readily applied to analysis of significance of the observed interspecific differences in values of genetic variability at the nucleotide or even microsatellite level as well, provided a sufficiently large dataset for different species has been created.

References

Altukhov, I. U. P. and Dubrova, I. U. E. (1981). Biochemical polymorphism of populations and its biological significance. Prog Modern Biol (Uspekhi Sovremennoj Biologii), 91: 467–480 (in Russian).
Google Scholar
Amos, W. and Harwood, J. (1998). Factors affecting levels of genetic diversity in natural populations. Phil Trans R Soc B, 353: 177–186.
Article CAS Google Scholar
Appolonio, M. and Hartl, G. B. (1993). Are biochemical-genetic variation and mating systems related in large mammals? Acta Theriol, 38 (Suppl. 2), 175–185.
Article Google Scholar
Ayala, F. J. and Fitch, W. M. (1997). Genetics and the origin of species: an introduction. Proc Nat Acad Sci USA, 94: 7691–7697.
Article CAS Google Scholar
Burrows, W. and Ryder, O. A. (1997). Y-chromosome variation in great apes. Nature, 385: 125–126.
Article CAS Google Scholar
Butlin, R. K. and Tregenza, T. (1998). Levels of genetic polymorphism: marker loci versus quantitative traits. Phil Trans R Soc B, 353: 187–198.
Article CAS Google Scholar
Carson, H. L. (1990). Increased genetic variation after a population bottleneck. Trends Ecol Evol, 5: 228–230.
Article CAS Google Scholar
Chislenko, L. L. (1981). Structure of Flora and Fauna as Related to Body Size of Organisms. Moscow University Press, Moscow (in Russian).
Google Scholar
Crespi, B. J. (1991). Heterozygosity in the haplodiploid Thysanoptera. Evolution, 45: 458–464.
Google Scholar
Damuth, J. (1981). Population density and body size in mammals. Nature, 290: 699–700.
Article Google Scholar
Eisenberg, J. (1981). The Mammalian Radiations: An Analysis of Trends in Evolution, Adaptation and Behaviour. Athlone Press, London.
Google Scholar
Fuerst, P. A., Chakraborty, R. and Nei, M. (1977). Statistical studies on protein polymorphism in natural populations. I. Distribution of single locus heterozygosity. Genetics, 86: 455–483.
CAS PubMed Central Google Scholar
Gaines, M. S., Diffendorfer, J. E., Tamarin, R. H. and Whittam, T. S. (1997). The effects of habitat fragmentation on the genetic structure of small mammal populations. J Hered, 88: 294–304.
Article CAS Google Scholar
Gorshkov, V. G. and Makarieva, A. M. (1997). Dependence of heterozygosity on body weight in mammals. Proc Russian Acad Sci (Dokl Akad Nauk), 355: 418–421. (in Russian).
CAS Google Scholar
Graur, D. (1985). Gene diversity in Hymenoptera. Evolution, 39: 190–199.
Article Google Scholar
Harris, H. and Hopkinson, D. A. (1972). Average heterozygosity per locus in man: an estimate based on the incidence of enzyme polymorphisms. Ann Hum Genet, 36: 9–19.
Article CAS Google Scholar
Hofker, M. H., Scraastad, M. I., Bergen, A. A. B., Wapenaar, M. C. et al (1986). The X chromosome shows less genetic variation at restriction sites than the autosomes. Am J Hum Genet, 39: 438–451.
CAS PubMed Central Google Scholar
Imasheva, A. G. (1999). Environmental stress and genetic variation in animal populations. Genetika, 35: 421–431. (in Russian).
CAS Google Scholar
Kimura, M. (1991). The neutral theory of molecular evolution: a review of recent evidence. Jap J Genet, 66: 367–386.
Article CAS Google Scholar
Larson, A. (1981). A re-evaluation of the relationship between genome size and genetic variation. Am Nat, 118: 119–125.
Article Google Scholar
Nei, M. (1984). Genetic polymorphism and neomutationism. Lecture Notes Biomath, 53: 214–241.
Article Google Scholar
Nei, M. and Graur, D. (1984). Extent of protein polymorphism and the neutral mutation theory. Evol Biol, 17: 73–118.
Article Google Scholar
Nevo, E., Beiles, A. and Ben-Shlomo, R. (1984). The evolutionary significance of genetic diversity: ecological, demographic and life history correlates. Lecture Notes Biomath, 53: 13–213.
Article Google Scholar
O'brien, S. J., Gail, M. H. and Levin, D. L. (1980). Correlative genetic variation in natural populations of cats, mice and men. Nature, 288: 580–583.
Article CAS Google Scholar
O'brien, S. J., Wildt, D. E., Bush, M., Caro, T. M., Fitzgibbon, C., Aggundey, I. and Leakey, R. E. (1987). East African Cheetahs: evidence for two population bottlenecks? Proc Natl Acad Sci, 84: 508–511.
Article CAS Google Scholar
Odum, E. P. (1983). Basic Ecology. Saunders College Publications, New York.
Google Scholar
Pierce, B. A. and Mitton, J. B. (1980). The relationship between genome size and genetic variation. Am Nat, 116: 850–861.
Article Google Scholar
Thompson, R. C. and Lymbery, A. J. (1996). Genetic variability in parasites and host–parasite interactions. Parasitology, 112 (Suppl.), S7–S22.
Google Scholar
de Visser, J. A., Hoekstra, R. F. and van Den Ende, H. (1997). An experimental test for synergistic epistasis and its application in Chlamydomonas. Genetics, 145: 815–819.
CAS PubMed Central Google Scholar
Ward, R. D., Skidinski, O. F. and Woodmark, M. (1992). Protein heterozygosity, protein structure and taxonomic differentiation. Evol Biol, 26: 73–159.
Article CAS Google Scholar
Wooten, M. C. and Smith, M. H. (1985). Large mammals are genetically less variable? Evolution, 39: 210–212.
Article Google Scholar

Download references

Acknowledgements

I am extremely grateful to Dr Victor Gorshkov for the problem’s setting, discussions and inspiration; to Dr Laurence Hurst and Dr Jos van Damme for encouragement and invaluable comments on an early version of the paper; to Dr Maria Filipucci and Dr Shiro Wada for providing me with sources of data unavailable in my country. Special thanks are due to two anonymous referees and Dr John Brookfield, whose detailed critical analysis of the paper has greatly assisted the author in clarifying many important issues. The work was supported by the Ministry of Natural Resources of the Russian Federation, the Research Support Scheme of the Open Society Support Foundation, grant No. 800/2000, and the European Science Foundation Scientific Programme on Theoretical Biology of Adaptation.

Author information

Anastassia M Makarieva
Present address: Petersburg Nuclear Physics Institute, 188300, Gatchina, St-Petersburg, Russia

Authors and Affiliations

Theoretical Biology Group, Coltegium Budapest, Institute for Advanced Study, Szentháromság utca 2, Budapest, H-1014, Hungary
Anastassia M Makarieva

Authors

Anastassia M Makarieva
View author publications
Search author on:PubMed Google Scholar

Corresponding author

Correspondence to Anastassia M Makarieva.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Makarieva, A. Variance of protein heterozygosity in different species of mammals with respect to the number of loci studied. Heredity 87, 41–51 (2001). https://doi.org/10.1046/j.1365-2540.2001.00899.x

Download citation

Received: 22 October 1999
Accepted: 02 April 2001
Published: 01 July 2001
Issue date: 01 July 2001
DOI: https://doi.org/10.1046/j.1365-2540.2001.00899.x

Keywords

This article is cited by

Intraspecific chromosomal and genetic polymorphism in Brassica napus L. detected by cytogenetic and molecular markers
- ALEXANDRA V. AMOSOVA
- LYUDMILA V. ZEMTSOVA
- OLGA V. MURAVENKO
Journal of Genetics (2014)
Genetic variability of the marine mussel Mytilus galloprovincialis assessed using two-dimensional electrophoresis
- E Mosquera
- J L López
- G Alvarez
Heredity (2003)

Variance of protein heterozygosity in different species of mammals with respect to the number of loci studied

Abstract

Similar content being viewed by others

Understanding the heterogeneous performance of variant effect predictors across human protein-coding genes

A lethal mitonuclear incompatibility in complex I of natural hybrids

Detecting macroevolutionary genotype–phenotype associations using error-corrected rates of protein convergence

Introduction

Rationale

Results

(i) Decreasing variance of heterozygosity with increasing number of studied loci

(ii) Testing the assumption of equal heterozygosity in different species of mammals

(iii) Poisson distribution of polymorphic loci

Discussion

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

This article is cited by

Intraspecific chromosomal and genetic polymorphism in Brassica napus L. detected by cytogenetic and molecular markers

Genetic variability of the marine mussel Mytilus galloprovincialis assessed using two-dimensional electrophoresis

Search

Quick links

Abstract

Similar content being viewed by others

Understanding the heterogeneous performance of variant effect predictors across human protein-coding genes

A lethal mitonuclear incompatibility in complex I of natural hybrids

Detecting macroevolutionary genotype–phenotype associations using error-corrected rates of protein convergence

Introduction

Rationale

Results

(i) Decreasing variance of heterozygosity with increasing number of studied loci

(ii) Testing the assumption of equal heterozygosity in different species of mammals

(iii) Poisson distribution of polymorphic loci

Discussion

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

This article is cited by

Intraspecific chromosomal and genetic polymorphism in Brassica napus L. detected by cytogenetic and molecular markers

Genetic variability of the marine mussel Mytilus galloprovincialis assessed using two-dimensional electrophoresis

Search

Quick links