Introduction

Mapping quantitative trait loci (QTLs) has long been treated as a variable selection problem (Broman and Speed, 2002; Manichaikul et al., 2009) because the number of markers (predictors) can be larger than the sample size, making ordinary least square method infeasible. Ridge regression (Hoerl and Kennard, 1970) is one of the solutions to handle relatively large regression models and has been applied to QTL mapping (Whittaker et al., 2000). However, the results of the usual ridge regression are not satisfactory because all regression coefficients are shrunken by the same shrinkage factor. Xu (2003) developed a Bayesian shrinkage method to estimate QTL effects, in which different regression coefficients are shrunken using different shrinkage factors. This kind of selective shrinkage analysis discriminates against small regression coefficients and favors for large regression coefficients. As a result, it performs far better than the classical ridge regression. The original Bayesian shrinkage analysis of Xu (2003) was implemented through the Markov chain Monte Carlo sampling algorithm, which is time consuming for large models coupled with large sample sizes. Xu (2007) recently proposed an empirical Bayesian method to improve the computational efficiency, while still preserving the desired sparseness of the final model.

In the empirical Bayesian method of Xu (2007), estimation of variance components is achieved by repeated callings of the Nelder and Mead (1965) simplex algorithm. This method only applies to numerically coded predictors. In many situations, in which the predictors are discrete classification variables, the special algorithm of Xu (2007) that only applies to numerically coded predictors cannot be used. For example, in QTL mapping of F2 populations that are derived from the cross of two inbred lines, there are three possible genotypes at each locus. We have to code the three genotypes numerically as 1, 0 and −1, to capture the additive effect (a), and as 0, 1 and 0, to capture the dominance (d) effect. With other mapping populations, for example, four-way cross (Xu, 1998), the numerical coding is more complicated. In association mapping, in which the number of genotypes may vary from one locus to another, an optimal numerical coding system may not even exist. Therefore, a method that can handle classification predictor variables is more general than the simplex algorithm adopted by Xu (2007). With the general method, we can directly estimate the genotypic values and their variances, and then convert the genotypic values into additive effect, dominance effect and whatever effect of interest. Whereas the simplex algorithm in the empirical Bayesian method of Xu (2007) cannot handle classification predictor variables, the expectation–maximization (EM) algorithm (Dempster et al., 1977) can do it in a straightforward manner. Therefore, we propose an EM algorithm to estimate the variance components under this general setting of the predictors.

It is well known that the least absolute shrinkage and selection operator (Lasso, Tibshirani, 1996) estimation of regression coefficients has a Bayesian interpretation. When the variance parameter in the normal prior of each regression coefficient is assigned an exponential prior, the Bayesian posterior mode estimate of the regression coefficient is the Lasso estimate (Tibshirani, 1996; Park and Casella, 2008; Yi and Xu, 2008). With the simplex algorithm adopted by Xu (2007), extension to the Lasso estimate is not obvious. But such an extension is straightforward when an EM algorithm is applied. Similar EM algorithm has been proposed by Figueiredo (2003) and Yi and Banerjee (2009), who treated the variance components as missing values. In the proposed EM algorithm, we will treat the regression coefficients as missing values when we estimate the variance parameters. This makes the estimates of regression coefficients empirical Bayesian estimates. As a result, theory and method of classical mixed-effect model apply to the empirical Bayesian estimation of QTL effects.

Theory and methods

Model

Let y be an n × 1 vector for the phenotypic values of a quantitative trait, where n is the number of individuals in the mapping population. The linear model for y is

where βj is the jth non-QTL effect (for example, the year effect), Xj is the corresponding design matrix, γk is a vector of genotypic values for locus k and Zk is the corresponding incidence matrix determined by the genotypes of locus k. The dimensions of γk and Zk depend on the number of genotypes for locus k. The residual error vector ɛ is assumed to be distributed as ɛN(0, σ2In), where In is an n × n identity matrix and σ2 is an unknown residual error variance. We are interested in estimating all the nuisance parameters (β), the genotypic values for all QTLs (γ) and the prior variances of all QTL effects simultaneously from the same model. If we evaluate markers of the entire genome, p can be very large and sometimes may be even larger than the sample size, although q can be relatively small. In this case, we need to adopt a shrinkage method to estimate γ, which are the most important parameters in QTL analysis.

Prior distribution

Let mk be the number of genotypes at locus k. For example, in a F2 population, each locus has three possible genotypes, and thus, mk=3 for all k=1,…, p. The dimension of Zk is n × mk and the dimension of γk is mk × 1. We adopt the normal prior for γk, for example,

Under this prior, model (1) becomes a typical mixed model so that y has a multivariate normal distribution with mean μ and variance–covariance matrix V, where

and

Following Yi and Xu (2008), we consider two classes of prior for σk2. The first class is the scaled inverse χ2 prior, whose density is

In the scaled inverse χ2 distribution, τ and ω are hyperparameters representing the degree of prior belief and the scale. Two special cases of the scaled inverse χ2 distribution are particularly interesting, because they represent priors commonly used in data analysis. One special case is ξ=(τ, ω)=(−2, 0), which is equivalent to the uniform prior P(σk2)1. This uniform prior leads to the usual maximum likelihood estimate of the variance component. The other special case is ξ=(τ, ω)=(0, 0), which represents the Jeffreys’ prior (Figueiredo, 2003), that is, P(σk2)=1/σk2. This prior does not have hyperparameters at all, and thus, is extremely convenient to use in real data analysis (Figueiredo, 2003).

The second class of prior is the exponential prior,

where λ2 is the shrinkage factor (hyperparameter). This exponential prior will generate the Lasso estimation (Tibshirani, 1996; Park and Casella, 2008; Yi and Xu, 2008) of the QTL effects.

Posterior mode

Our EM algorithm treats γ as the missing value. This is different from the EM algorithm of Figueiredo (2003) and Yi and Banerjee (2009), who treated σk2 as the missing value. The EM steps will be given after we describe the formulas for the maximization steps and the expectation steps. The target function for maximization in our EM algorithm is the expected complete-data log likelihood function, in which the regression coefficients are treated as missing values. For the scaled inverse χ2 prior, the part of the expected complete-data log likelihood function relevant to σk2 is

where E(γkTγk)=E(γkTγkθ, y) is a short notation for the conditional expectation of the quadratic term of γk, given the current values of parameters (θ) and the data (y). Setting and solving for σk2, we obtain

When ξ=(τ, ω)=(−2, 0), we have σk2=E(γkTγk)/mk, equivalent to the solution when a uniform prior is used (typical mixed model solution for a variance component). When ξ=(τ, ω)=(0, 0), we get σk2=E(γkTγk)/(2+mk), a stronger shrinkage than the uniform prior.

For the exponential (Lasso) prior, the part of the expected complete-data log likelihood function relevant to σk2 is

Setting L(σk2λ2)=0 and solving for σk2 leads to two solutions, with the positive one being

Formulas for the fixed effects and residual variances follow the standard procedure of mixed model methodology (Lindstrom and Bates, 1988). For the fixed effects, we have

For the residual error variance, we use

where E(γk)=E(γkθ,y) is a short notation for the conditional expectation of γk. Finding the posterior modes of the parameters belongs to the maximization steps. We have noticed that these maximization steps depend on E(γk) and E(γkTγk), which are the conditional expectations of the linear and quadratic terms of the missing value.

Best linear unbiased prediction

The expectation of the quadratic term required in the maximization steps is expressed as

where

is the conditional expectation and

is the conditional variance of the missing vector γk. Derivation of equations (14) and (15) are given in Appendix A. Both the expectation and the variance depend on the parameters and thus iterations are needed. Once the iterations converge, the conditional expectation E(γk) is called the best linear unbiased prediction (BLUP) and the square root of the variance var(γk) is called the prediction error of γk. However, BLUP is defined on the basis of true parameters. The conditional expectation of γk after the iterations converge is conditional on estimated parameters. Technically, the conditional expectation given in equation (14) is not called as BLUP, but is called as empirical Bayesian estimate. Therefore, we will call the BLUP of QTL effects as the estimated QTL effects subsequently, although they are predicted QTL effects under the mixed model framework.

EM steps

Now let us define θ={β,σ2,σ12,…,σp2} as the parameter vector and ξ={τ, ω} or λ2 as the hyperparameters. The genotypic values γ are treated as missing values. The EM steps are described below.

Step (0) Choose ξ or λ2, set t=0 and initialize parameters with θ=θ(t).

Step (1) Calculate E(γkTγk) using equations (13, 14, 15), which is the E-step.

Step (2) Update θ using equations (8, 10, 11, 12), which is the M-step.

Step (3) Let t=t+1, and repeat Steps (1) and (2) until convergence is reached.

Linear contrasts

The EM algorithm is described with γ being defined as the genotypic values that are not equivalent to QTL effects. The QTL effects can be defined as linear contrasts or linear combinations of the genotypic values. There are two ways to obtain the QTL effects, one of which is to recode matrix Z so that γ directly represent the QTL effects. For example, if Zk for the jth individual of locus k is coded as 1, 0 and −1, for the three genotypes, the corresponding γk would be the additive effect of QTL k. The second way of obtaining QTL effects is through linear contrasts of the genotypic values. The Zk retains its original definition as a matrix of dummy variables so that γk represents a vector of genotypic values. In this case, an extra step is required to obtain the QTL effects after the EM algorithm converges. First, we need to obtain the BLUP and prediction error of γk. Second, we define coefficients of a linear contrast and use them to convert the estimated genotypic values into a QTL effect. For example, in the F2 line crossing example, the three components of γk represent the three genotypic values denoted by γk=[G11 G12 G22]T for the three genotypes (A1A1, A1A2 and A2A2). The coefficients of the linear contrast for the additive effect may be defined as Ha=[1/2 0 −1/2]T. The additive effect for QTL k is then defined as ak=HaTγk. Similarly, the dominance effect may be defined as dk=HdTγk, where Hd=[−1/4 1/2 −1/4]T. Define H=HaHd as the horizontal concatenation of matrices Ha and Hd (notation used in SAS language), the QTL effects (including both the additive and the dominance effects) are then obtained by using the formula ηk=[ak dk]T=HTγk. The estimated QTL effects for locus k are then

with a variance–covariance matrix

The coefficients of linear contrasts, denoted by matrix H, can be defined in many different ways. It is up to the investigator to choose his/her own favorite scale. Therefore, the genotypic effect model is more flexible than the QTL effect model. Finally, it is possible to test the hypothesis H0:ηk=0 using the Wald test statistic

for each locus. Under the null hypothesis, Wald test statistic follows approximately a χ2 distribution with two degrees of freedom. This allows us to calculate the P-value for each locus. Therefore, the Wald test statistics is often called the χ2 statistics.

The variance of the prior distribution of the genotypic value is σk2 for the kth QTL. After the linear contrasts (combinations), the additive effect ak has a prior N(0, HaTHaσk2)=N(0, 1/2 σk2) and the dominance effect dk has a prior N(0, HdTHdσk2)=N(0, 3/8σk2). The two effects are no longer independent because the prior covariance between ak and dk is HaTHdσk2=−1/4σk2. The additive effects estimated using the allelic effect model and the genotypic effect model with linear contrast will not be affected by the coding (see results of simulations described later).

Simulation study

Experimental setup

We simulated a single large chromosome of 2400 cM (centiMorgan) long evenly covered by 481 co-dominance markers (5 cM per marker interval). The simulated population was an F2 family derived from the cross of two inbred lines with sample size n=500. The genotype indicator variable for individual j at locus k is defined as Zjk={1,0, −1} for the three genotypes (A1A1, A1A2, A2A2), respectively. Dominance effects were not simulated and also not included in the model for this simulation experiment, but will be considered in a separate experiment presented later. A total of 20 QTLs were simulated, with the sizes and locations of the QTLs listed in Table 1. These parameter values were used to generate a quantitative trait with a population mean β=10.0 and a residual error variance σ2=10.0. The total genetic variance for the trait is

where rkk is the recombination coefficient between QTLs k and k′, cov(zk, zk′)=var(z)(1−2rkk) is the covariance between Zk and Zk and var(Z)=1/2 is the variance of Z (assuming no segregation distortion). The total genetic variance for the quantitative trait is VG=VQ+VL=66.384, which is the sum of the genetic variances due to QTL (VQ) and covariance between linked QTLs (VL), where

and

Table 1 QTL parameters used in the simulation studies

The residual error variance for the trait is σ2=VE=10.0. Therefore, the total phenotypic variance is VP=VG+VE=76.384. The proportion of the genetic variance contributed by each QTL is 0.5γk2/VG for the kth QTL (given in the column headed with Prop-G in Table 1). The corresponding proportion of the phenotypic variance contributed by the kth QTL is 0.5γk2/VP and given in the column headed with Prop-P in Table 1. The true QTL effects are depicted in Figure 1, which will be used as the standard for comparison with estimated QTL effects using various model and prior setups.

Figure 1
figure 1

True quantitative trait loci (QTL) effects (additive model) and locations of QTLs in a simulated genome of 2400 cM (centiMorgan) in length.

Allelic effect model

Under the allelic effect model, we numerically coded the three genotypes with Zk={1, 0, −1} for the three genotypes {A1A1, A1A2, A2A2}. The QTL effects were directly estimated without taking linear contrasts of the genotypic values. For 481 markers, the Z matrix has a dimensionality of 500 × 481. Three different priors were chosen for this data analysis: (1) ξ=(τ, ω)=(−2, 0) representing uniform prior for σk2; (2) ξ=(τ, ω)=(0, 0) representing the Jeffreys’ prior for σk2; (3) the Lasso prior λ2=5.1758. This particular Lasso prior value was chosen using the following empirical method,

More information about this empirical Lasso parameter will be discussed later. The results for the three different priors are presented in graphical form for the reason that a tabular form of presentation is hard to show all the small estimated QTL effects. The results are depicted in Figure 2, showing that the Jeffreys’ prior appears to be better than the Lasso prior, but both are better than the uniform prior. The QTL effect profile of the Jeffreys’ prior mimics the true QTL effect profile (see Figure 1) more closely than the other two priors. Compared with the Jeffreys’ prior, the Lasso prior tends to split major QTL effects into a few small effects in the neighborhood of the true QTL. Therefore, the Lasso-estimated QTL effect profile tends to have many small ‘bumps’ along the genome.

Figure 2
figure 2

Estimated additive effects and locations of quantitative trait loci (QTLs) for the simulated data under the allelic effect model. The uniform prior is equivalent to ξ=(τ, ω)=(−2, 0). The Jeffreys’ prior is equivalent to ξ=(τ, ω)=(0, 0). The Lasso prior is λ2=5.1758.

We used the mean squared error (MSE) of the estimated QTL effects to further evaluate the performance of the three priors. The MSE is defined as

for the scaled inverse χ2 prior and

for the Lasso prior, where γkInv–χ2 is the BLUP value obtained under the scaled inverse χ2 distribution, γkLasso is the BLUP value obtained under the Lasso prior distribution and γk is the true value. The MSE comparison shows that MSE(−2, 0)=0.351129659, MSE(0, 0)=0.034842259 and MSE(5.1758)=0.033882049. Therefore, the Jeffreys’ prior and the Lasso prior perform equally well, and both are better than the uniform prior. The noisy signals of the Lasso prior have not increased the MSE compared with the Jeffreys’ prior. In fact, they have improved (decreased) the MSE slightly.

Genotypic effect model

The same data set was also analyzed using the genotypic effect model, in which the Z matrix was coded as dummy variables. For 481 markers, the Z matrix has 481 × 3=1443 columns, and thus, 1443 genotypic values were estimated. To compare this analysis with the allelic effect model, we used linear contrast Ha (described earlier) to convert the three genotypic values of each locus into an additive effect. The dominance effects, however, were not simulated (zero effects for all loci). Again, the three priors chosen in the allelic effect model analysis were used here, that is, ξ=(τ, ω)=(−2, 0), ξ=(τ, ω)=(0, 0) and λ2=4.786525. The results are almost duplicates of the allelic effect model. The additive effect profiles for the three priors are almost the same as that obtained in the allelic effect model (data not shown). The estimation errors are also very close for the two models (data not shown). The MSEs of the three priors are MSE(−2, 0)=0.417594, MSE(0, 0)=0.0682055 and MSE(4.786525)=0.031560243, respectively. The Lasso prior appeared to perform slightly better than the Jeffreys’ prior. The genotypic effect model and the allelic effect model can be used interchangeably for QTL mapping in line crosses. For line crossing experiments such as BC and F2, there is no advantage of using the genotypic effect model except that this model provides estimated genotypic values so that investigators can directly interpret the results regarding which parent is carrying the ‘high’ or ‘low’ allele at each locus.

Simulation with dominance effects

To examine the efficiency of the EM algorithm for estimating the dominance effects, we simulated another data set with all other settings being the same as the simulated data set described before except that we added six dominance effects to the genome. The sizes and the locations of the dominance effects are depicted in Figure 3a (the upper panel). For simplicity, we only report the result for the Jeffreys’ prior ξ=(τ, ω)=(0, 0) under the genotypic effect model. The estimated additive effects and the dominance effects are depicted in Figure 3b (the lower panel). The estimated genotypic values and other relevant information for the data analysis are presented in Table 3. We used âk=HaTγ̂k and k=HdTγ̂k to convert the genotypic values γk into additive (a) and dominance (d) effects. The variance–covariance matrix of the estimated QTL effects are then calculated and used to generate the Wald test statistic and the P-value using

where Pχ2−1 denote the inverse of the χ2 distribution function with two degrees of freedom. We used an arbitrary cutoff point to determine the ‘significance’ of each locus using P-value <0.01 as the criterion of significance. The Wald test statistics and the P-values are listed in Table 2 for all the 24 simulated loci. All but four of the 24 loci were detected. The four loci that failed to reach the cutoff P-value are markers 123, 127, 243 and 270. Markers 123 and 127 are 20 cM apart from each other and each had an additive effect of 1.1 but with opposite signs. Marker 243 had an additive effect of a=−1.0, explaining only 0.65% of the phenotypic variance. Marker 270 had an additive effect of a=1.0, also explaining only 0.65% of the phenotypic variance. In fact, this marker is only 10 cM apart from marker 268, which had an additive effect of a=1.58. The effect of marker 270 was absorbed by marker 268, because the estimated effect of marker 268 is a=2.147, slightly less than 2.58=1.58+1.0 (sum of the additive effects of the two loci).

Figure 3
figure 3

Additive and dominance quantitative trait loci (QTL) effects in the second simulation experiment. The upper panel shows the true QTL effects and the lower panel shows the estimated QTL effects under the genotypic effect model with the Jeffreys’ shrinkage prior. The blue needles with diamonds represent the additive effects and the red needles with triangles represent the dominance effects.

Table 2 Estimated genotypic values of the three genotypes (A1A1, A1A2 and A2A2), and the corresponding additive and dominance effects of QTL obtained under the genotypic effect model using the Jeffreys’ shrinkage prior ξ=(τ, ω)=(0, 0)

Alternative values of hyperparameters

For the same simulated data set without dominance effects (described in the experimental setup section), we chose a few alternative hyperparameters for the scaled inverse χ2 distribution and a few alternative Lasso parameters to evaluate the performance of the new method. We only evaluated the allelic effect model for its simplicity and quickness. For the scaled inverse χ2 prior, we first let ξ=(τ, ω)=(τ, 0) and only varied τ from 0 to −1, decremented by 0.1. This type of priors was proper and suggested by ter Braak et al. (2005). In addition, we let ξ=(τ, ω)=(−0.5, ω) and varied ω from 0 to 1, incremented by 0.1. For the Lasso prior, we chose λ2 in the neighborhood of λ2=5.1758 (empirical value obtained earlier for this data set) ranging from 1 to 10, incremented by 1. We used the MSE to evaluate the performance of the method under various hyperparameter values. The MSE of these priors are presented in Table 3. For the set of priors in the ξ=(τ, ω)=(τ, 0) series (Prior I), the minimum MSE occurs at τ≈−0.6. For the set of priors in the ξ=(τ, ω)=(−0.5, ω) series (Prior II), the minimum MSE occurs at ω≈0.05. A slight increase of ω will dramatically increase the MSE. Therefore, 0ω0.1 seems to be optimal. For the Lasso priors (Prior III), the minimum MSE occurs when 6.0λ210.0. The empirical value of λ2=5.1758 is not far away from the optimal values. Note that these optimal hyperparameters are sample specific and may not be generalized to other samples. More discussion on the optimal hyperparameters will be presented later.

Table 3 The mean squared error (MSE) of alternative prior choice for the simulated data set reported in the ‘experimental setup’ section under the allelic effect model

Power and false-positive rate

The Bayesian methods presented here can be reinterpreted for classical power analysis using replicated simulation experiments. In this section, we used the same QTL parameters given in Table 1 and the same experimental setup to simulate 100 additional samples for power analysis. We used the allelic effect model to estimate parameters and calculate the test statistics. As we only considered the additive effects, the test statistic for each locus is defined as the squared QTL effect divided by the squared prediction error of the estimated QTL effect. Under the null hypothesis, this test statistic approximately follows a χ2 distribution with one degree of freedom. This allows us to calculate the P-value for each locus. We chose 0.01 as the threshold for the P-value to determine the significance of a locus with a QTL effect and the false-positive status of a locus with no QTL effect. In other words, if a QTL has a P-value <0.01 in a particular replication, the QTL is claimed to be detected in that replication and the proportion of the replicates in which the QTL is detected out of the 100 replications is the empirical statistical power for that QTL. As the power was evaluated for each QTL, the false-positive rate (FPR) should also be defined in a locus-specific manner. A locus with no QTL effect is labeled false positive if the P-value is smaller than the 0.01 threshold. The FPR of the non-QTL locus is then defined as the proportion of the replicates labeled as false positive out of the 100 replications. The FPR is also called the Type I error. We simulated 20 QTLs out of 481 loci. The distance between any consecutive loci is 5 cM. We observed that the effect of a QTL failing to be detected was very often picked up by a marker in the neighborhood. If a neighboring marker reaches the significance level, this QTL is also claimed to be detected. Therefore, for every true QTL, three consecutive loci (with the true QTL in the center) are claimed as QTLs. A non-QTL is defined as a locus that is separated by at least one neutral marker from a true QTL.

The 100 replicated samples were analyzed using three different priors (methods): the Lasso method (the Lasso parameter was empirically estimated), the Jeffreys’ method (Jeffreys's prior was used) and the method of Xu (2007) implemented with the Nelder and Mead (1965) simplex algorithm. The three methods are denoted as Lasso, Jeffreys and NM, respectively. For some reasons, the NM method cannot handle the Jeffreys’ prior. Therefore, ξ=(τ, ω)=(−0.5, 00.5) was used as the prior for the NM method. The average estimated QTL effects for all the 481 loci over the 100 replications are depicted in Figure 4, for all the three methods. The heights of the needles represent the average estimated QTL effects. The empirical statistical powers (numerical values) for the loci are placed at the tips of the needles in Figure 5. The three methods have similar powers, with the Jeffreys method slightly better than the Lasso method, which is slightly better than the NM method. Figure 4 shows the corresponding biases of the estimated QTL effects for the three methods. The biases are typically between −0.6 and 0.6. Two loci show large biases for the Jeffreys’ prior, from −0.8 to 0.8. The Bayesian shrinkage method is expected to be biased. The biases observed from the repeated simulation experiments are not too serious compared with the actual values of the QTL effects.

Figure 4
figure 4

Average estimated quantitative trait loci (QTL) effects over 100 replicated simulations and the empirical statistical powers for the simulated QTL. The upper panel shows the results of the Jeffreys’ prior. The panel in the middle shows the results of the Lasso prior. The lower panel shows the results of the Nelder–Mead (NM) algorithm with hyperparameter ξ=(τ, ω)=(−0.5, 0.05). The heights of the needles represent the average estimated QTL effects. The numbers at the tip (end) of the needles represent the statistical powers measured in percentage, for example, the integer 99 represents 99%.

Figure 5
figure 5

The biases (average estimated quantitative trait loci (QTL) effect–true QTL effect) of the estimated QTL effects for three methods. The upper, middle and lower panels represent the Jeffreys’ method, the Lasso method and the Nelder–Mead (NM) method, respectively. The true QTL positions are marked with the ticks of the horizontal axis.

Figure 6 presents the FPR profiles for the three methods. Most of the non-QTLs have zero FPR. A small percentage of the loci have one false positive out of the 100 replications. For the Jeffreys’ method, one locus has 6% FPR, six loci have 3% FPR and 14 loci have 2% FPR. The largest FPR occurs near a true QTL position with a small effect. The Lasso method has one locus with 3% FPR and two loci with 2% FPR. The NM method has the lowest FPR. Overall, all the three methods have quite low FPR.

Figure 6
figure 6

The false-positive rate (FPR) profiles for the three methods. The upper, middle and lower panels represent the Jeffreys’ method, the Lasso method and the Nelder–Mead (NM) method, respectively. The quantitative trait loci (QTL) positions are marked with the red ticks.

The average numbers of iterations required to converge were 23.51, 15.96 and 11.81, respectively, for the three methods (Lasso, Jeffreys and NM). The corresponding total computing times for completing the analysis of 100 replications were 128 min (Lasso), 89 min (Jeffreys) and 100 min (NM) for the three methods. The longer computing time for the Lasso method was due to the large number of iterations required for the program to converge. The average estimated QTL parameters along with the estimated population mean and residual variance obtained from 100 replicated simulations are provided in the supplemental material for interested readers. The original simulated data sets are also given in the supplemental material.

Real data analysis

We used a real data set from recombinant inbred lines of Arabidopsis (Loudet et al., 2002) as an example to show the application of the method. The two parents initiating the line cross were Bay-0 and Shahdara with Bay-0 as the female parent. The recombinant inbred lines were actually F7 progeny of single seed descendants (selfing) of the F2 plants. The residual heterozygosity was low (Loudet et al., 2002). Flowing time was recorded for each line in two environments: long day (16 h photoperiod) and short day (8 h photoperiod). We used the short-day flowering time as the quantitative trait for QTL mapping. The two parents had very little difference in short-day flowering time. The sample size (number of recombinant lines) was 420. A couple of lines did not have the phenotypic records, and their phenotypic values were replaced by the population mean for convenience of data analysis. A total of 38 microsatellite markers were used for QTL mapping. These markers are more or less evenly distributed along five chromosomes with an average 10.8 cM per marker interval. The marker names and positions can be found in the original article (Loudet et al., 2002).

We inserted a pseudo marker in every 2 cM of the genome. With the inserted pseudo markers, the total number of loci subject to analysis is 200 (38 true markers plus 162 pseudo markers). All the 200 putative loci were evaluated simultaneously in a single model. Therefore, the model for the short-day flowering time trait is

where X is a 420 × 1 vector of unity, β is the population mean (intercept), Zk is a 420 × 1 vector coded as 1 for one genotype and 0 for the other genotype for locus k. If locus k is a pseudo marker, Zk=Pr(genotype=1), which is the conditional probability of marker k being of genotype 1. Finally, γk is the QTL effect of locus k. We only used the allelic effect model for the real data analysis.

The data were analyzed using three different priors, (1) ξ=(τ, ω)=(−2, 0) corresponding to the uniform prior, (2) ξ=(τ, ω)=(0, 0) representing the Jeffreys’ prior and (3) the Lasso prior with λ2=3.2739. The estimated QTL effects are depicted in Figure 7. The Jeffreys’ prior (the panel in the middle of Figure 7) produced the cleanest signals of QTL effects. Four QTLs were detected in three chromosomes. The uniform prior (the panel at the top of Figure 7) and Lasso prior (the panel at the bottom of Figure 7) also produced four peaks corresponding to the same positions as those detected by the Jeffreys’ prior. However, additional signals also occur for these two priors. The estimated QTL effects and QTL positions along with the t-test statistics and other information under the Jeffreys’ prior are given in Table 4.

Figure 7
figure 7

Estimated effects and locations of quantitative trait loci (QTLs) for the trait of short-day flowering time of Arabidopsis (Loudet et al., 2002). The five chromosomes are merged into a single genome and separated by the dotted green reference lines. The upper panel represents the results using the uniform prior. The panel in the middle represents the results using the Jeffreys’ prior. The lower panel gives the results of the Lasso prior with λ2=3.2739.

Table 4 The estimated QTL parameters for the Arabidopsis data using the Jeffreys’ prior under the allelic effect model

We also performed an interval mapping on the short-day flowering time trait. The results are depicted in Figure 8. Results of chromosome 1, 2, 3 and 4 agree well with our Bayesian analysis. However, interval mapping cannot separate the two QTLs in chromosome 5. Detailed result of interval mapping can be found in the original study (Loudet et al., 2002).

Figure 8
figure 8

LOD (logarithm (base 10) of odds) score profile of the Arabidopsis short-day flowering time quantitative trait loci (QTL) mapping resulted from interval mapping. The upper panel shows the LOD score profile. The lower panel shows the estimated QTL effect profile.

Discussion

The EM algorithm developed in this study is not a new method of QTL mapping. It is an alternative algorithm used to find the empirical Bayesian estimates of QTL effects. All properties of the empirical Bayesian method of Xu (2007) implemented through the simplex algorithm apply to the EM algorithm. These properties (for example, dealing with epistatic effects) have been investigated by Xu (2007), and thus, were not further explored in the current study. The advantages of the EM algorithm over the simplex algorithm are the flexibility to handle both the allelic effect model and the genotypic effect model, and the ability to deal with the Lasso prior. Although the simplex method in general can handle genotypic effect models, the fast algorithm to invert the variance matrix described by Xu (2007) cannot be applied, because that algorithm only holds for the allelic effect model in which each regression coefficient has its own variance. Another advantage of the EM algorithm is its transparency of the formulation, as apposed to the simplex algorithm, so that programming of the EM algorithm becomes much easier. Similar to any other EM algorithms, our EM algorithm also has its own limit in terms of slow convergence when the parameters are near the local optimum. Therefore, the simplex algorithm adopted in the original empirical Bayes (Xu, 2007) still has its value in terms of fast convergence and robustness to the initial values.

The empirical Bayesian estimation of QTL effects is a kind of posterior mode estimation, and thus, is different from the fully Bayesian estimation implemented through the MCMC algorithm (Xu, 2003; Wang et al., 2005). If the Markov chain is sufficiently long, results of the MCMC sampling would be better than the posterior mode estimation. However, the posterior mode estimation is a quick method to achieve the results that are almost as good as the fully Bayesian estimation. For the same simulated data, the EM algorithm took about 1 min to complete the estimation, whereas the MCMC-implemented sampling algorithm took about one-half hour (data not shown). In addition, our experience showed that the Jeffreys’ prior usually performs well compared with other hyperparameter values. However, the Jeffreys’ prior is improper in the sense that a marginal posterior distribution of σk2 does not exist (ter Braak et al., 2005). Although we are not interested in σk2 per se, but use σk2 as a shrinkage factor to control the estimate of γk, an improper posterior σk2 always presents a warning signal regarding the convergence of the chain. Theoretically, all parameters should converge to the stationary distribution to validate the MCMC algorithm. The posterior mode estimation does not have such a concern.

An obvious question with the posterior mode estimation is how to choose the hyperparameter ξ=(τ, ω) or λ2. We have noticed that the hyperparameter has a large role in the final estimates of QTL effects. A common way of choosing the hyperparameter is to use a cross-validation test. Tibshirani (1996) in the original Lasso method took a fivefold cross-validation approach. We can adopt the same cross-validation method to help determine the optimal hyperparameter. If desired, cross-validation can be conducted by the users, because standard x-fold cross-validation is straightforward and easy to program. However, using cross-validation to determine the optimal parameter may also have its own problems. For example, the optimal Lasso parameter λ2 may depend on both the sample size and the dimensionality of the model. Assume that we decide to use the recommended fivefold cross-validation to determine the optimal λ2. The optimal value found in the fivefold validation may not be optimal at all if a threefold cross-validation is performed. What is the optimal x in the x-fold cross-validation? Suppose that the fivefold cross-validation is the choice and we do not want to use any other folds, the optimal λ2 in fact is only optimal for sample size 4n/5, but our sample size is actually n. The question may keep coming one after another.

If one decides not to use a cross-validation to determine the hyperparameters, we offer the following suggestions based on our own experience of data analyses. The scale parameter ω in ξ=(τ, ω) can be set to zero or close to zero, say 0.001, and thus, we only have one hyperparameter τ to worry about. We should start with the Jeffreys’ prior ξ=(τ, ω)=(0, 0) and then choose an improved value from there. A cross-validation can be used to evaluate a few alternative values around τ=0. Given that the algorithm is computationally efficient, a wide range of values of τ can be evaluated within a short period of time.

The Lasso prior should be found using the cross-validation method suggested by Tibshirani (1996). By trial and error, we found that equation (22) usually is a good choice for the Lasso parameter. Let be the average of the QTL variance components. The empirical Lasso prior is simply

. Intuitively, when all QTLs have very large variance components, the average should also be large, and thus, the Lassos prior should be small (little shrinkage). If all QTL effects have small variance components, the average should also be small, leading to strong shrinkage. If we treat λ2 as an unknown parameter and estimate it through maximization of the expected complete-data log likelihood function, the solution would be . However, this value did not work, because the shrinkage was too strong so that all regression coefficients would be shrunken to zero. It's square root worked just fine, but provided no theoretical proof. We used this empirical shrinkage parameter for the simulated data (500 individuals and 481 markers) and found that the optimal value λ2 was in the range between 6 and 10. It turned out that the empirical value of λ2=5.1758 is not far away from that optimal range.

Programming the EM algorithm developed in the study is made straightforward by following the EM steps described earlier. However, users can download the SAS/IML code that we used to analyze the simulated data. The SAS/IML code (EM-Lasso) along with the data is posted on our website (www.statgen.ucr.edu). Skilled SAS users may use PROC MIXED and PROC IML interactively with the SAS MACRO to call the iterative process. We can use PROC MIXED to calculate β and γ, with variance parameters held at the values provided in a SAS data set. PROC MIXED is extremely efficient in estimating β and predicting γ. PROC IML can be used to calculate the variance components using the predicted γ and their standard errors generated by PROC MIXED. The calculated variance components are stored in a SAS data set, which in turn is called by PROC MIXED as the input parameter values. Finally, we can use a SAS MACRO to connect the two procedures iteratively and call the macro to achieve the EM estimates of QTL effects. There is a newly released mixed model procedure in SAS called PROC HPMIXED. This new procedure is a simplified version of PROC MIXED, designed with the purpose of fast speed. We can replace PROC MIXED by PROC HPMIXED to improve the computational efficiency.

Finally, association study for quantitative traits involves no new statistical methods beyond the methods presented for linkage studies. The two only differ by the populations used for marker analysis. Association study uses randomly selected individuals from a target population for mapping. As a result, the inference space is the entire population from which the individuals are sampled. Linkage study, however, uses all individuals from the same family of line cross, and thus, the inference space is only the two lines initiating the cross. Association study can narrow down the actual genes because of cumulative historical recombinants, whereas the linkage study cannot unless the sample size is extremely large. The EM algorithm developed here can be used for both linkage study and association study, except that the fixed effects in the association study should be designed so that they can capture population admixture and other complicated factors unique to association study. The genotypic effect model is more useful than the allelic effect model in association study, because the number of genotypes per locus may vary from one locus to another. When the number of genotypes per locus is very large, linear contrasts for QTL effect conversion are not easy to define. In this case, association of marker k with a trait is actually indicated by the estimated value of σk2.