Introduction

Plants compete for above- and below-ground resources. The competition intensity will depend on several factors such as the species, density and resource availability. In plant breeding, competition may distort the selection process. This is because competitive plants may impair the capacity of sensitive neighbors to express their full genetic potential. Furthermore, selecting only aggressive individuals may decrease the correspondence between what was observed in trials and the actual performance in commercial farms (Besag and Kempton 1986; Kempton 1982). These are consequences of overlooking competition and selecting based only on the portion due to direct genetic effects (DGE) of the total heritable variation (Bijma 2011; Walsh and Lynch 2018, chapter 22). A possible solution is to consider the underlying competition capacity of selection candidates. This is done by taking the indirect genotypic effects (IGE) into account during the selection process (Bijma et al. 2007; Muir 2005).

IGE can be seen as the influence that the genetic makeup of one individual has on the phenotypic performance of its surrounding neighbors. Evidence suggests that competition ability is a quantitative genetic trait (Bailey and Desjonquères 2022; Griffing 1967; Sakai 1955). Under this scenario, the phenotypic value of an individual v, surrounded by n neighbors, is given by \({P}_{v}={g}_{v}+\mathop{\sum }\nolimits_{k = 1}^{n}{c}_{k}+{e}_{v}\), with gv being the DGE, ck being the IGE of the kth neighbor, and ev being the non-heritable terms (Bijma 2014), instead of just Pv = gv + ev, in traditional modeling approaches. Given that each candidate has its own competition capacity and that genetic competition is relevant to phenotypic expression (i.e., IGE are significant), the most appropriate selection criterion is the summation \({g}_{v}+\hat{\phi }{c}_{v}\), known as total genotypic (or genetic, if kinship is considered) value (TGV), with \(\hat{\phi }\) being the mean competition intensity factor (see “Methods” for more details), and cv the IGE of candidate v (Costa e Silva and Kerr 2013; Ferreira et al. 2023). Note that if cv ≠ 0, the merit of a selection candidate will be different when comparing models with and without competition. In other words, competition causes change in candidates’ ranking, and, ultimately, in the selection.

Even with the relevant insights related to IGE in animal and plant breeding so far (Bijma 2010; Costa e Silva et al. 2017; Kempton and Lockwood 1984; Muir 2005), its statistical modeling is not straightforward. The main challenge is the construction of a competition matrix (IGE’s incidence matrix), as there are many ways to obtain it and each methodology yields different results. Another difficulty is the incorporation of IGE into a linear mixed model, where it is important to ensure both DGE and IGE are estimated properly. The estimation of competition effects in the genetic part of the model can be done in combination with spatial components so that the residual and genetic competitions are properly calculated (Stringer et al. 2017, 2011).

Here, we propose a new resource called gencomp, an R (R Core Team 2024) package that leverages the average information algorithm implemented in ASReml-R (Gilmour et al. 1995; The VSNi Team 2023) to facilitate the use and fitting of competition models in plant breeding programs. Currently, methods for considering competition in tree and crop breeding are implemented. In addition, the package has functions to: (i) fit (spatial-) genetic competition models, (ii) define competition classes, (iii) compute the total genotypic value and (iv) calculate the total heritable variation. Specifically for tree breeding, gencomp can also predict the performance of commercial forest stands composed of clonal mixtures, based on clones’ competition capacity (Ferreira et al. 2023).

gencomp is designed to make it easily accessible to model (spatial-) genetic competition, so plant breeders can take full leverage of this information for decision-making. In this paper, we aimed to show how to properly use gencomp for both tree and crop breeding. Using gencomp, we also aimed to illustrate how considering genetic competition effects (along with spatial trends) can change genetic and non-genetic parameters and the final—and most important—decision for plant breeders: the selection. This paper is divided as follows: first, we describe the Methods implemented within gencomp, highlighting the Theory behind it. Then, using two Motivating examples, one for tree breeding and another for crop breeding, we use the Results and Discussion sections to (i) describe the packages’ pipeline—from building the competition matrices to extracting the main results of the model; and (ii) compare the outputs of linear mixed models with and without modeling spatial trends and genetic competition.

Methods

Theory

Quantitative genetics base of Griffing (1967)

Using two individuals (1 and 2) in a group, a gene model that considers the competition effects (or, as he called it, associative effects) can be described as follows:

$$\begin{array}{lll}{u}_{{A_1a}_{1}}\;=\;{\alpha }_{{A}_{1}}+{\alpha }_{{a}_{1}}+{\delta }_{{A}_{1}{a}_{1}}+{\alpha }_{{A}_{2}}+{\alpha }_{{a}_{2}}+{\delta }_{{A}_{2}{a}_{2}}+{(\alpha \alpha )}_{{A}_{1}{A}_{2}}+{(\alpha \alpha )}_{{A}_{1}{a}_{2}}+{(\alpha \alpha )}_{{a}_{1}{A}_{2}}\\\qquad\quad\;\, +\,{(\alpha \alpha )}_{{A}_{1}{a}_{2}}\,+{(\alpha \delta )}_{{A}_{1}{A}_{2}{a}_{2}}+{(\alpha \delta )}_{{a}_{1}{A}_{2}{a}_{2}}+{(\delta \alpha )}_{{A}_{1}{a}_{1}{A}_{2}}+{(\delta \alpha )}_{{A}_{1}{a}_{1}{a}_{2}}+{(\delta \delta )}_{{A}_{1}{a}_{1}{A}_{2}{a}_{2}}\end{array}$$

where \({u}_{{A_1a}_{1}}\) is the genetic value of locus A in individual 1, \({\alpha }_{{A}_{1}}\) is the direct additive effect of allele A in individual 1 (the same goes to allele a), \({\delta }_{{A}_{1}{a}_{1}}\) is the direct dominance effect of alleles A and a in the individual 1. The effects described so far measure the DGE of individual 1. Proceeding, \({\alpha }_{{A}_{2}}\) is the competition additive effect of allele A in individual 2 as measured on alleles of individual 1 (the same goes to \({\alpha }_{{a}_{2}}\)), and \({\delta }_{{A}_{2}{a}_{2}}\) is the competition dominance effect of Aa in individual 2 as measured on A1a1. These effects relate to the indirect genetic effects (IGE) of individual 2 which individual 1 is susceptible to. The other “epistatic” effects are additive × additive, additive × dominant, and dominant × dominant interaction between direct and competition effects. This model is the basis of the variance-component approach (also known as genotypic or treatment interference approach) to model IGE, described in the next topics.

A linear mixed model

Let the starting point be a field trial with N plots wherein V candidates (v = 1, 2, …, V) are assessed. It is assumed that it has R rows and C columns, with each candidate replicated J times (j = 1, 2, …, J). For clarity, each plot is composed of a single plant. A typical linear mixed model fitted to this trial is described as follows:

$${\bf{y}}={\bf{X}}{\boldsymbol{\beta }}+{{\bf{Z}}}_{1}{\bf{u}}+{{\bf{Z}}}_{2}{\bf{p}}+{\mathbf{\varepsilon }}$$
(1)

where y is a N × 1 vector of phenotypic records, organized sequentially according to row and then column nested within row; β is the vector of fixed effects, u is the V × 1 vector of random genotypic effects, p is the vector of non-genetic random effects, and ε is the N × 1 vector of random residuals. The capital letters X, Z1 and Z2 are the incidence matrices that connect their respective vectors to y. Depending on the context, p can be absent from Equation (1), and the replication (block) effects can be random or fixed. The random effects are assumed to be the outcomes of a multivariate Gaussian distribution with zero means and variance matrices Gu, Gp, and R. Gu and R equal to \({\sigma }_{u}^{2}{{\bf{I}}}_{V}\) and \({\sigma }_{\varepsilon }^{2}{{\bf{I}}}_{N}\), respectively. IV can be replaced by a kinship matrix (based on pedigree or genomic data). The form of Gp depends on the context. Still, it usually follows the same structure as Gu and R, i.e., a variance component multiplied by an identity matrix whose order depends on the dimension of Gp.

The spatial component

The most popular way of addressing spatial trends in plant breeding is by modeling the residual covariance matrix with a separable first-order autoregressive structure (Gilmour et al. 1997). Under this scenario, \({\bf{R}}={\sigma }_{\varepsilon }^{2}\left({{\boldsymbol{AR1}}}_{R}\otimes {{\boldsymbol{AR1}}}_{C}\right)\), where AR1R and AR1C are a R × R and C × C first-order autocorrelation matrices for rows and columns, respectively; and is the direct product. Note that R × C = N, so \({\sigma }_{\varepsilon }^{2}\left({{\boldsymbol{AR1}}}_{R}\otimes {{\boldsymbol{AR1}}}_{C}\right)\) is a surrogate of \({\sigma }_{e}^{2}{{\bf{I}}}_{N}\) in R. One may also want to fully distinguish between spatially dependent and independent errors (η). In this case, \(\eta \sim N({\bf{0}},{\sigma }_{\eta }^{2}{{\bf{I}}}_{N})\).

The genetic competition component

Once spatial trends are defined, we can focus on modeling the genetic competition. For this purpose, we make u a 2V × 1 vector, with \({{\bf{u}}}^{{\prime} }=({{\bf{g}}}^{{\prime} },{{\bf{c}}}^{{\prime} })\), where g and c are the V × 1 vectors of DGE and IGE, respectively. Consequently, Z1 becomes a N × 2V matrix containing an incidence matrix for DGE effects \(\left[{{\bf{Z}}}_{g}^{(N\times V)}\right]\) and another for IGE effects—known as competition matrix \(\left[{{\bf{Z}}}_{c}^{(N\times V)}\right]\), i.e., Z1 = (Zg, Zc) (Costa e Silva et al. 2013; Stringer et al. 2011). Under this notation:

$${V\,ar}({\bf{u}})=\left[\begin{array}{cc}{\sigma }_{g}^{2}&{\sigma }_{gc}\\ {\sigma }_{gc}&{\sigma }_{c}^{2}\end{array}\right]\otimes {{\bf{I}}}_{V}$$
(2)

where σgc is the covariance between g and c. An alternative parameterization disregards σgc and takes g and c as two orthogonal effects in the model (Ferreira et al. 2023).

The model differentiates g and c based on their incidence matrices. Zg is a regular design matrix, i.e., with ones in the corresponding position of a given plot and zero elsewhere. On the other hand, Zc is filled in the corresponding positions of the genotypes neighboring a given plot. For instance, a plot has the genotype “G7” neighbored by “G6” and “G9” in the horizontal direction, “G5” and “G9” in the vertical direction, and “G4”, “G8”, “G5” and “G4” in the diagonal directions (Fig. 1a). For crop breeding, Zc will be an incidence matrix with ones in the positions corresponding to the genotypes neighboring “G7” in the horizontal (“G6” and “G9”) or vertical directions (“G5” and “G9”), and zero elsewhere (Stringer et al. 2011). For tree breeding, due to the large area occupied by a single tree and large spacing between trees, the standard procedure is to compute the directional competition intensity factors for each direction by filling the positions corresponding to the genotypes neighboring “G7” in the respective row of Zc. gencomp has three options to calculate these factors:

  • MU: In this method proposed by Muir (2005), the competition intensity factors are the inverse of the distance between the focal individual and its neighbors in the diagonal, row, and column directions:

    $$\left\{\begin{array}{l}{f}_{d}=\frac{1}{\sqrt{{{\mathcal{D}}}_{r}^{2}+{{\mathcal{D}}}_{c}^{2}}} \\ {f}_{r}=\frac{1}{{{\mathcal{D}}}_{r}} \\ {f}_{c}=\frac{1}{{{\mathcal{D}}}_{c}}\quad \end{array}\right.$$
    (3)
  • CC: In this method proposed by Cappa and Cantet (2008), the distance and the number of neighbors in each direction are considered. This method assumes that the distance between the focal individual and its neighbors in the row is the same as the distance between the focal individual and its neighbors in the column:

    $$\left\{\begin{array}{l}{f}_{d}=\frac{1}{\sqrt{2\left({n}_{c}+{n}_{r}\right)+{n}_{d}}}\quad \\ {f}_{r}=\sqrt{\frac{2}{2\left({n}_{c}+{n}_{r}\right)+{n}_{d}}}\quad \\ {f}_{c}=\sqrt{\frac{2}{2\left({n}_{c}+{n}_{r}\right)+{n}_{d}}}\quad \end{array}\right.$$
    (4)
  • SK: This method, proposed by Costa e Silva and Kerr (2013), considers the number of neighbors, the distance between the focal individual and its neighbors, and the difference between distances in the row and column directions:

    $$\left\{\begin{array}{l}{f}_{d}=\frac{p}{\sqrt{\left({n}_{r}{p}^{4}\right)+\left({n}_{r}{p}^{2}\right)+\left({n}_{c}{p}^{2}\right)+\left({n}_{d}{p}^{2}\right)+{n}_{c}}} \\ {f}_{r}={f}_{d}\sqrt{1+{p}^{2}} \\ {f}_{c}=\frac{{f}_{d}\sqrt{1+{p}^{2}}}{p}\quad \end{array}\right.$$
    (5)
Fig. 1: Details about genetic competition models implemented in gencomp.
Fig. 1: Details about genetic competition models implemented in gencomp.
Full size image

a Illustration of a field trial with contiguous blocks. The numbers followed by “G” represent genotypes, and followed by “P” represent plots. The plot sequence determines the direction of data collection, which will influence the construction of the competition matrix. b Illustration of a single trial divided into two areas by a geographical barrier (represented by a stream), a scenario that can occur in large tree breeding trials.

For all of the above methods, fr, fc and fd are the directional competition intensity factors for a given plot (i.e., a given row of Zc) in the row, column and diagonal directions, respectively; nr, nc and nd are the number of neighbors in the row, column and diagonal directions, respectively; \({{\mathcal{D}}}_{r}\) and \({{\mathcal{D}}}_{c}\) are the distance between the focal individual and its neighbors in the row and column directions, respectively; and \(p={{\mathcal{D}}}_{c}/{{\mathcal{D}}}_{r}\). Note that when the distances between the focal individual and its neighbors in the row and column directions are the same, SK = CC. Table 1 contains the key factors that differentiate the methods implemented in gencomp. The Appendix A details the construction of Zc based on the example of Fig. 1a.

Table 1 Information on the methods implemented in gencomp to build the competition matrix: the type of plant breeding, reference, considered direction, whether inter-row and inter-column distances are treated as the same or different, and the estimator of the mean competition intensity factor (\(\hat{\phi }\)).

Once Zc is built and the model is solved, the DGE \((\hat{{\bf{g}}})\) and IGE \((\hat{{\bf{c}}})\) are estimated. With these values, we can compute the total genotypic value (TGV) as:

$${TGV}_{v}={\hat{g}}_{v}+\hat{\phi }\times {\hat{c}}_{v}$$
(6)

where \(\hat{\phi }=1\) for crop breeding, or \(\hat{\phi }=\bar{{n}_{r}}{\bar{f}}_{r}+{\bar{n}}_{c}{\bar{f}}_{c}+{\bar{n}}_{d}{\bar{f}}_{d}\) for tree breeding. In the last case, \(\hat{\phi }\) is the mean competition intensity factor. In crop breeding, \(\hat{\phi }=1\) because there is no distance-based weighting, i.e., the genetic competition effects are not weighted by the distance between plots. In this case, the full contribution of IGE to the total heritable variance is considered (Bijma 2014). When the competition effects are significant, the TGV is the most appropriate breeding selection unit or value, as selection based solely on the DGE - which is equivalent to selecting based on the BLUPs in a typical linear model—may yield biased results, which can be reflected in the ranking (see the subsection “Competition affects the selection” in "Results" for more details). The significance of competition effects can be assessed via the likelihood ratio test.

The variance component estimates provided by the model can be used to calculate the broad-sense heritability (when no kinship matrix is incorporated). In genetic-competition models, breeders can base their decision-making on the DGE heritability (\({H}_{g}^{2}\)) or the total heritability (\({H}_{t}^{2}\)) (Bijma et al. 2007). The first is the portion of the total variance that refers to the DGE. The latter is a ratio between the sum of the total heritable components against the phenotypic variance, and it is an adjusted estimate of the heritability that considers the competition effects and the covariance between DGE and IGE. The expressions for these heritabilities are given below:

$${H}_{g}^{2}=\frac{{\hat{\sigma }}_{g}^{2}}{{\hat{\sigma }}_{y}^{2}}$$
(7)
$${H}_{t}^{2}=\frac{{\hat{\sigma }}_{g}^{2}+2\times \hat{\phi }\times {\hat{\sigma }}_{gc}+{\hat{\phi }}^{2}\times {\hat{\sigma }}_{c}^{2}}{{\hat{\sigma }}_{y}^{2}}$$
(8)

with \({\sigma }_{y}^{2}\) being the total phenotypic variance.

Depending on the data, species and trait’s architecture, the reliability of DGE and IGE can vary, with DGE’s reliability being frequently higher than IGE’s. For this reason, one may want to weigh the sum of these estimated effects by their respective reliabilities when computing the TGV (Ferreira et al. 2024). In this case, the weighted TGV (wTGV) is given by:

$${wTGV}_{v}={\hat{g}}_{v}\times {r}_{{g}_{v}}^{2}+\hat{\phi }\times {\hat{c}}_{v}\times {r}_{{c}_{V}}^{2}$$
(9)

with \({r}_{{g}_{v}}^{2}\) and \({r}_{{c}_{v}}^{2}\) being the reliabilities of DGE and IGE, respectively; which are calculated as:

$${r}_{{g}_{v}}^{2}=1-\frac{{PEV}_{{g}_{v}}}{{\sigma }_{g}^{2}}\quad \,\text{and}\,\quad {r}_{{c}_{v}}^{2}=1-\frac{{PEV}_{{c}_{v}}}{{\sigma }_{c}^{2}}$$
(10)

with \({PEV}_{{g}_{v}}\) and \({PEV}_{{c}_{v}}\) being the prediction error variance of DGE and IGE of the vth genotype.

The estimated competition effects can also be used to categorize the selection candidates into three competition classes: aggressive, homeostatic, and sensitive. For this purpose, we used the classification proposed by Ferreira et al. (2023), and detailed below:

$$\,{\text{Classes}}\,=\left\{\begin{array}{l}\bar{{\rm{c}}}-(\tau ){\rm{sd}}({\rm{c}}) \,<\,{\hat{{\rm{c}}}}_{{\rm{v}}} \,<\, \bar{{\rm{c}}}+(\tau ){\rm{sd}}({\rm{c}})\to \,{\text{Homeostatic}}\,\quad \\ {\hat{{\rm{c}}}}_{{\rm{v}}} \,<\,\bar{{\rm{c}}}-(\tau ){\rm{sd}}({\rm{c}})\to \,{\text{Agressive}}\,\quad \\ {\hat{{\rm{c}}}}_{{\rm{v}}} \,>\, \bar{{\rm{c}}}+(\tau ){\rm{sd}}({\rm{c}})\to \,{\text{Sensitive}}\,\quad \end{array}\right.$$
(11)

with \(\bar{c}\) being the mean IGE in the population, and sd(c) is the IGE’s standard deviation. The parameter τ (not included in the original definition), was added to gencomp for additional flexibility when defining competition classes. τ is a weight defining the thresholds to declare if a genotype is aggressive, homeostatic or sensitive. For instance, let \(\bar{c}=-2,sd(c)=3,{c}_{G1}=0,{c}_{G2}=-6\) and cG3 = 6. If τ = 1, the lower and upper limits to homeostaticity are −2 − 1 × 3 = −5 and −2 + 1 × 3 = 1, so G1 is homeostatic, G2 is aggressive and G3 is sensitive. Now, if τ = 2, the lower and upper limits are −2 − 2 × 3 = −8 and −2 + 2 × 3 = 4, thence G1 and G2 are homeostatic, and G3 is sensitive.

Multi-age model (repeated measures) for tree breeding data

The procedures previously described are required to fit a spatial-genetic competition model that applies to a unique measurement (single age). Recently, Ferreira et al. (2024) proposed an extension to fit multi-age (or repeated measures) models that include competition. Let M represent the number of ages or measurements (m = 1, 2, …, M), and \(T=\mathop{\sum }\nolimits_{m}^{M}{N}_{m}\). The multi-age spatial-genetic competition model is written as:

$${\bf{y}}={\bf{X}}{\boldsymbol{\beta }}+{{\bf{Z}}}_{g}{\bf{g}}+{{\bf{Z}}}_{c}{\bf{c}}+{\mathbf{\varepsilon }}$$
(12)

all terms of this model were previously described, but note that their dimensions are modified to fit this multi-age definition, e.g., g and c are of dimension VM × 1, and Zg and Zc are T × VM. In addition, the age effect is added to the vector of fixed effects. In this model, \({{\bf{Z}}}_{c}=({{\bf{Z}}}_{{c}_{1}},{{\bf{Z}}}_{{c}_{2}},\ldots ,{{\bf{Z}}}_{{c}_{m}})\), i.e., a unique competition matrix must be built for each age level. The variance-covariance structure of both g and c is the compound symmetry, meaning that Equation (12) estimates their main effects and their interaction with the different ages. The within-age DGE and IGE can be accessed by adding the main effect to its corresponding interaction effect from a specific age.

The residual modeling is also expanded to consider heterogeneous variances and spatial autocorrelations between ages:

$${\bf{R}}={\oplus }_{m = 1}^{M}{\sigma }_{{\varepsilon }_{m}}^{2}\left({{\boldsymbol{AR1}}}_{{C}_{m}}\otimes {{\boldsymbol{AR1}}}_{{R}_{m}}\right)$$
(13)

in which represents the direct sum. R is a T × T block diagonal matrix.

Clonal composites for tree breeding data

In some tree breeding pipelines, the final goal is to deploy high-performance clones (as occurs in eucalyptus, cacao, rubber tree, poplar, etc.). Some clones can be more competitive and/or endure the competition more efficiently. In that case, there might be a clonal mixture that can coexist harmoniously when planted together. This is the core idea of Ferreira et al. (2023) to propose a method for defining the best composition of a clonal mixture considering competition effects. We implemented a function that simulates a physical planting grid considering the DGE and IGE of a given set of clones. These are positioned differently in each simulation, which enables the modification of focal individual-neighbor dynamics. In each simulation, the expected mean of each clone is predicted using the following equation:

$${\hat{y}}_{ik}=\hat{\mu }+{\hat{g}}_{i}+\mathop{\sum }\limits_{i\ne k}^{n}{\hat{c}}_{k}$$
(14)

where \({\hat{g}}_{i}\) is the DGE of the focal individual i, and \({\hat{c}}_{k}\) is the IGE of the neighbor k (i can have up to n neighbors). Both \({\hat{g}}_{i}\) and \({\hat{c}}_{k}\) are obtained from (spatial-) genetic competition models previously defined. In addition, one may want to weigh the IGE by the distance between the focal individual i and neighbor k (\({{\mathcal{D}}}_{ik}\)). This makes more biological sense, as the influence of a neighbor k diminishes as its distance to the focal individual increases. In this case, the equation is:

$${\hat{y}}_{ik}=\hat{\mu }+{\hat{g}}_{i}+\mathop{\sum }\limits_{i\ne k}^{n}\frac{1}{{{\mathcal{D}}}_{ik}}\times {\hat{c}}_{k}$$
(15)

As an extension of Ferreira et al. (2023) proposed approach, gencomp estimate the 95% confidence interval of the predicted means using a bootstrap approach based on 10,000 random samples.

Here, it is important to clearly distinguish TGV and \({\hat{y}}_{ik}\). As previously described, the focal genotype’s TGV comprises its genetic merit (represented by the DGE) summed to the weighted IGE that it exerts over its neighbors. On the other hand, \({\hat{y}}_{ik}\) consists of the expected mean of a given individual considering its genetic merit and the influence of its neighbors’ IGE on its phenotype. In the first case, we are concerned about the genetic aspects of the competition for selection (Fig. 1C of Ferreira et al. 2023), while in the second one, we are interested in how it affects the phenotypic expression when the selected candidates are planted together in commercial orchards (Fig. 1B of Ferreira et al. 2023).

Motivating examples

Using the example datasets (euca and potato, described below), we demonstrate the usage of gencomp and the effects of modeling the genetic competition. For the second objective, we compare four models: a “traditional” mixed model (TMM, without spatial adjustment and genetic competition), a spatial mixed model (SMM, without genetic competition), a genetic competition mixed model (CMM, without spatial adjustment) and a spatial-genetic competition mixed model (SCMM, the most complete scenario). From each model, we obtained the REML-estimates of variance components, from which we computed the heritabilities. We also investigated the relationships between BLUP-estimates of genotypic values (of TMM and SMM), and DGE and IGE (of CMM and SCMM), and calculated how different the selection would be considering an intensity of 20% and 25%, for the euca and potato datasets, respectively.

Tree breeding

The package gencomp contains a tree breeding dataset, named euca, which users can employ to test the function and adapt it to their own datasets. In brief, euca is a dataset simulated using parameters from a real data of an intermediate-stage clonal eucalyptus trial. It has the mean annual increment values (m3 ha−1 year−1, column MAI) of 100 clones (“C001” to “C100” in clone column) laid out in a randomized complete block design with 13 replicates (“B01” to “B13” in block column). The experimental unit is the same as the observation unit, i.e., there is a single plant per plot. The plants are spaced by 2 and 3 meters in the row and column directions, respectively; and the position of each tree in the field is found in columns row and col. Phenotypes of two ages are available (“3y” and “6y” in age column). This trial was not organized into contiguous blocks: the first six blocks were situated in one area, while the other seven were in another. The dataset includes a column labeled area, distinguishing between these areas. This scenario is common in large tree breeding trials, particularly when trials are divided by geographical features like streams, roads, or steep hills. In such cases, the lack of contiguity between areas must be considered in spatial-genetic competition models because clones on the border of one area do not neighbor clones on the border of another (Fig. 1b). Essentially, the layout information directs the construction of the Zc matrix. Furthermore, the autoregressive structure in the residual, described in “The spatial component” subsection, needs to be modified to \({\bf{R}}={\oplus }_{h = 1}^{H}{\sigma }_{{\varepsilon }_{h}}^{2}\left({{\bf{AR1}}}_{{C}_{h}}\otimes {{\bf{AR1}}}_{{R}_{h}}\right)\) or \({\oplus }_{m = 1}^{M}\left[{\oplus }_{h = 1}^{H}{\sigma }_{{\varepsilon }_{mh}}^{2}\left({{{\boldsymbol{AR}}}{\mathbf{1}}}_{{C}_{mh}}\otimes {{{\boldsymbol{AR}}}{\mathbf{1}}}_{{R}_{mh}}\right)\right]\) in the multi-age case, with H being the number of areas (h = 1, 2, …, H). In summary, this modification assumes that no residual covariance exists between plots from different contiguous blocks.

Crop breeding

To exemplify the usage of genetic competition models in crop breeding, gencomp has a data set named potato, obtained from the package agridat [connolly.potato, Wright 2024]. Originally, this dataset was generated by Connolly et al. (1993) to study the inter-plot competition in single-drill plot trials of potatoes in Scotland. They measured the tuber yield of 20 varieties (column gen), which were replicated four times (column rep). Each replication was an independent row of 20 drills. The maturity class of each variety was also registered (column matur), with representatives of the first early (M1), second early (M2), and maincrop (M3) classes. Each drill had five tubers spaced 45 cm apart, with 75 cm between drills. The rows (replications) are not contiguous, which impedes the usage of an autocorrelation structure in the residual part of the model. Thus, in this case, we fitted a genetic-competition model considering the within-row competition. Check Connolly et al. (1993) for more details.

Results and discussion

gencomp is an optimized tool designed to facilitate fitting spatial-genetic competition models. The example datasets comprise 2288 (euca) and 80 (potato) observations. Taking the larger dataset as example, and utilizing a computer equipped with 8 GB of RAM and a 12th Gen Intel® Core i7-1255U processor, featuring a base frequency of 1.70 GHz, and using the 11th Gen Windows® operational system software, the entire process—from constructing the competition matrix to extracting the results of the fitted model—consumed only about 2 min. Below, we will detail how to utilize gencomp with the example datasets.

First step: competition matrix

Following the logic presented in the “Methods” section, the first step is to build the competition matrix. For this, gencomp has two functions: prepfor and prepcrop. The former is designed to deal with tree breeding trials, and the latter, with crop breeding trials. These functions construct the matrix Zc and provide an overview of the dataset. Their basic structure is shown in Box 1.

The argument data receives the working data frame, and gen, row, col, and trait receive the name of the column that contains the corresponding information in the data frame. The plt argument is optional (defaulting to NULL) and allows users to specify the name of the column containing plot information. This helps ensure that the functions follow the same order as the data collection in the field. If plt is not provided, the functions will automatically generate a column to differentiate the plots, ordering the dataset by row and column. The effs argument accepts a string vector with the names of columns representing other effects to be considered in the model fitting step. For instance, the effect of block (block) for the tree breeding data, and maturity stage (matur) and block (rep) for the crop breeding data. This is crucial as it transforms these columns into factors in the data, which will then be used in the model fitting functions. The default value is NULL if no additional effects are included. verbose defines whether a progress bar should be shown (TRUE) in the console during the matrix construction process or not (FALSE).

The arguments mentioned in the last paragraph are common between functions. prepfor has other exclusive arguments. area and age refer to the column that contains the area information (when the trial has non-contiguous blocks, for instance) and the age information (repeated measures), respectively. They both default to NULL. dist.row and dist.col correspond to the distances between rows and columns in the field trial, respectively. The method argument can receive three options: “MU”, “CC” and “SK”. They define which method will be used to estimate the directional competition intensity factors. Finally, n.dec allows users to specify the number of decimal digits to be displayed in the matrix. The sole exclusive argument for comprepcrop is direction, which defines which direction will be considered (currently, row or column) for constructing the competition matrix.

The functions prepfor and prepcrop generate objects of classes comprepfor and comprepcrop, respectively, which contain the competition matrix, a data frame with the inputted data merged with Zc, another data frame containing the phenotypic records of each focal plot and its neighbors (see an example in Table 2), and the mean competition intensity factor (exclusive for prepfor). Two graphs are available from the comprepfor and comprepcrop objects using the S3 method plot: (i) a heatmap illustrating the field trial (Fig. 2A), and (ii) boxplots with each candidate’s performance (Fig. 2B).

Table 2 Neighborhood check for the euca dataset: phenotypic records of mean annual increment of wood volume (MAI) of each focal individual and mean phenotype of its neighbors in the row, column, and diagonal directions, phenotypic mean of all neighbors, and the number of neighbors in the row, column and diagonal directions.
Fig. 2: Plots generated from objects of class comprepfor and comprepcrop using the plot S3 method.
Fig. 2: Plots generated from objects of class comprepfor and comprepcrop using the plot S3 method.
Full size image

A Heatmap representing the grid, in which the cells are filled according to the phenotype value of each plot (missing values would be represented by blank cells). B Boxplots depicting the phenotypic performance (y-axis) of each selection candidate (x-axis). Both Figures were built using the potato dataset.

Second step: model fitting

We can proceed with model fitting once Zc is constructed. The model fitting step is the most computationally intensive. To optimize time and computational resources, gencomp internally utilizes the Average Information (Gilmour et al. 1995) algorithm of the ASReml-R package (The VSNi Team 2023) to solve linear mixed models rapidly. Thus, ASReml-R is currently a strong dependency of gencomp. Future versions will implement the same functionalities using open-source resources. It is worth mentioning that other freely available packages provide alternative solutions to deal with genetic competition, like breedR (Muñoz and Sanchez 2020) and sommer (Covarrubias-Pazaran 2016).

Currently, two functions are responsible for fitting the spatial-genetic competition model: asr and asr_ma. They have the same structure, as outlined in Box 2, but asr_ma can only be used in a multi-ages (repeated measures) context.

The argument prep.out receives the object generated by the functions prepfor or prepcrop. The arguments fixed and random receive formulas describing the fixed and random parts of the linear mixed model, using the usual ASReml-R syntax (more details in the ASRreml-R manual). In the random argument, users must specify any random effects other than the genotypic effect, if applicable. These effects must also be previously declared in the effs argument of prepfor and prepcrop. The basic structure implemented in asr and asr_ma internally accounts for the partition of the genotypic effect into DGE and IGE. If there are no other effects, the function uses the default random = ~1. The cor, lrtest, and spatial arguments are logical values that dictate if the fitted model should consider the covariance between DGE and IGE, if likelihood ratio tests should be performed, and if the model should use a first-order autoregressive structure to adjust spatial trends. For instance, if users wanted to address both spatial and independent errors, then spatial = TRUE and random = ~units should be used. Besides, bear in mind that setting cor = FALSE adds a bias in the model, as covariance between DGE and IGE is usually different from zero (Bijma et al. 2007; Costa e Silva et al. 2013; Trebissou et al. 2021). Additional arguments passed to the asreml function can also be employed in asr. In asr_ma, for example, maxit was used to increase the maximum number of iterations, which may be necessary if the model does not converge with the default number of iterations (13).

Third step: main results

The resp function provides a list of the most relevant outputs: (i) results of the likelihood ratio tests (if lrtest = TRUE in asr or asr_ma), (ii) variance components, (iii) heritabilities of the DGE and the total genotypic effects (if cor = TRUE in asr or asr_ma), and (iv) BLUP-estimates of DGE, IGE, and TGV. In the case of multi-age models, resp provides the main DGE and IGE, and the within-ages DGE and IGE. These values are later used to compute the TGVs across and within ages. The structure of the resp function is shown in Box 3. Table 3 has the variance components and likelihood ratio test results, and Table 4 has the heritabilities (of DGE and total genotypic effects) estimated from the models fitted with the euca and potato datasets.

Table 3 Variance component and correlation estimates in euca and potato datasets, their respective standard errors, and likelihood ratio statistics for the direct (DGE) and indirect genotypic effects (IGE), and their interaction with ages (only for euca).
Table 4 Broad-sense heritabilities of the direct genotypic effects (\({H}_{g}^{2}\)) and of the total genotypic effects (\({H}_{t}^{2}\)) in the euca and potato datasets.

The arguments prep.out and model receive the objects of class comprepfor or comprepcrop and asreml generated by the functions prepfor or prepcrop, and asr or asr_ma, respectively. The weight.tgv argument is a logical value that determines if DGE and IGE should be weighted by their respective reliabilities when computing the TGV (see Equation 9). Finally, sd.class defines the value of τ, i.e., the weight given to sd(c) when establishing competition classes.

The function resp provides an object of class comresp. The S3 methods summary, print, and plot generate specific results for comresp objects. Figures 3 and 4 depict the nine alternatives available via plot to illustrate the comresp object results. The lollipop plots illustrate the DGE, IGE (Fig. 3A), and TGV (Fig. 3B) of each selection candidate. Breeders can base selection on the TGV, which acts as a selection index incorporating DGE and IGE, or manually assign weights to DGE and IGE. For example, we might want to prioritize high-performing and homeostatic candidates over sensitive or aggressive ones. It is important not to overlook the reliability of the information: we recommend users consider it for decision-making, even when weight.tgv = FALSE (Ferreira et al. 2024).

Fig. 3: Plots generated from an object of class comresp using the plot S3 method.
Fig. 3: Plots generated from an object of class comresp using the plot S3 method.
Full size image

A Direct (DGE) and indirect (IGE) genotypic effects (y-axis) of each candidate. The plots are in descending order according to the DGE. The color of the dots reflects the reliability of both the DGE and IGE for each genotype; B Total genotypic value (TGV) (y-axis) of each candidate (x-axis), in increasing order; C the number of different genotypes as neighbors (total and per competition class) of each selection candidate; D Density of IGE values. The area within the distribution is filled according to the competition class; E relationship between IGE (x-axis) and DGE (y-axis). The dots are colored according to the competition class. These Figures were built using the euca dataset.

Figures 3C, D, E and 4D are colored according to the three competition classes. The evaluation of the number of different genotypes as neighbors of the selection candidate can be done using Fig. 3C. Here, we exemplified using the euca dataset, which has neighbors in all directions. Overall, almost all clones neighbored each other, and most of them had homeostatic neighbors. Figure 3D presents the density of IGE values in the euca data and depicts the threshold adopted to determine a candidate’s class. Figure 3E illustrates the relationship between DGE and IGE, which is negative in the present example, a trend usually observed for growth- and yield-related traits (Costa e Silva et al. 2013; Ferreira et al. 2024). The distribution of genotypes according to their competition classes in the field is illustrated in Fig. 4D for the potato data. Note that these plots can change depending on the value set on the argument sd.class.

Fig. 4: Plots generated from an object of class comresp using the plot S3 method.
Fig. 4: Plots generated from an object of class comresp using the plot S3 method.
Full size image

Heatmap representations of the field trial, with cells filled according to the residual effect of each plot (A), Direct genotypic effect (DGE) of each genotype (B), indirect genotypic effect (IGE) of each genotype (C), and the competition class of each genotype (D). These Figures were built using the potato dataset.

Similar to Figs. 2A and 4D, Fig. 4A–C represent the field trial. The difference is the interpretation of each cell. In Fig. 4A, the cells are filled according to the magnitude of the residual effect. This Figure is useful for investigating extraneous trends in the field. The heatmaps of Fig. 4B and C are filled according to the DGE and IGE of the corresponding genotype of each cell. Even cells with missing values will be filled in this case. This is because the classification into aggressive, sensitive, and homeostatic is performed at the genotype level.

Simulation of clonal composites for tree breeding

The results presented in Figs. 3 and 4 are examples of information users can leverage. Nonetheless, there is one further resource available for tree breeders. We provide a function to simulate clonal composites as per in Ferreira et al. (2023). The function composite has the structure shown in the box below.

The function composite (Box 4) uses all information obtained up to this point (arguments prep.out, model, and resp.out). Users may also provide the desired spacing between rows and columns in a numeric vector of size two in d.row.col. Furthermore, it must be indicated whether the IGEs should be weighted by the neighbor-focal individual distance when computing the expected mean in d.weight. The argument n.sim dictates the number of simulations, i.e., how many field grids will be generated. This is useful to guarantee that every neighbor-focal individual combination is sampled. Note that, depending on the number of selected clones, this might not happen. The selected argument receives the names of the clones that should be tested in clonal composites. In the example, we tested the top 10 clones based on their TGV. These results are presented in Table 5. Users can use an iterative process to test several clonal combinations and define which composite yields the best result.

Table 5 Predicted means of the 10 selected clones when planted together in clonal composites.

Future users can install gencomp from GitHub using the codes presented in Box 5. Given the dependency, ASReml-R should be already installed.

Competition affects the selection

The availability of a tool that facilitates the management of competition in plant breeding trials is important since genetic competition represents a systematic bias that can distort the candidates’ genotypic values and hamper the selection process (Besag and Kempton 1986; Sakai 1955). Several studies showed that competition can change the candidates’ ranking, meaning that overlooking it may lead to suboptimal results (Ferreira et al. 2024; Hunt et al. 2013; Stringer et al. 2011). Furthermore, there is a hidden heritable variation related to competition effects which can only be accessed using genetic competition models, meaning that these models provide a better overview of a population’s genetic parameters (Bijma 2011; Costa e Silva et al. 2013).

Using gencomp and the available datasets, we also observed the aforementioned patterns. In the potato dataset, since we cannot fit spatial models, we compared only the traditional mixed model (TMM) with the genetic competition mixed model (CMM). The residual variance decreased from the TMM to the CMM (Fig. 5A), which can be attributed to the clear distinction between direct and IGE. This decrease resulted in an increased heritability of the DGE (Fig. 5B). As expected, the total heritability was lower than the DGE’s, since there is a negative covariance between DGE and IGE. The selection of the top 5 candidates from TMM’s BLUPs was 40% and 20% different from selecting considering the DGE and TGV of CMM, respectively (Fig. 6).

Fig. 5: Changes in estimates of population parameters by fitting indirect genotypic effects.
Fig. 5: Changes in estimates of population parameters by fitting indirect genotypic effects.
Full size image

Variance component estimates (A) and heritabilities (B) obtained from analyzing the potato dataset using a traditional mixed model (TMM) and a genetic competition mixed model (CMM). In (A), “G” stands for the genotypic effect, “DGE” is the direct genotypic effect, “IGE” is the indirect genotypic effect, and “R” is the residual effect. In (B), the legend refers to the variance component used in the numerators. “Genotypic” is the genotypic effect of TMM, as per used in a regular broad-sense heritability; and “Direct” and “Total” relate to the DGE and total heritable variance of CMM, as depicted in Equation (8).

Fig. 6: Relationship between BLUPs (obtained from the traditional mixed model, TMM), direct (DGE) and indirect genotypic effects (IGE), and total genotypic values (TGV) (obtained from the genetic competition mixed model, CMM) in the potato dataset.
Fig. 6: Relationship between BLUPs (obtained from the traditional mixed model, TMM), direct (DGE) and indirect genotypic effects (IGE), and total genotypic values (TGV) (obtained from the genetic competition mixed model, CMM) in the potato dataset.
Full size image

The lower triangle has scatter plots, and the green dots represent candidates that were coincidentally selected by considering the criteria depicted on the top and right of the plot, considering an intensity of 25%. The diagonal contains the density plots of values of all candidates (in black) and of the selected ones (in green). The upper triangle has the ranking (Spearman) correlation between all values (in black), and the percentage of coincident candidates among the selected (in green).

In the euca dataset, we fitted a multi-age model and compared the inclusion of spatial adjustments on linear mixed models, considering genetic competition (SCMM) or not (SMM). As illustrated in the potato dataset, there is a clear trend of decreasing the residual variance as complexity is added to the model, with SCMM reaching the lowest values (Fig. 7A). The value removed from the residual variance is redistributed to the other variance components (Fig. 7B), and more precise estimates of heritability (both DGE and total) are obtained (Fig. 7C). It is worth mentioning the differences in variance components and heritabilities between ages. There is a trend to lose experimental precision from 3y to 6y, as indicated by the increase of residual variance and decrease of heritability. This is expected since trees are longer in the field and susceptible to the cumulative effects of the environment. Furthermore, the competition tends to increase throughout the ages, as individuals demand more space and resources (water, nutrients, and light) (Ferreira et al. 2024). Changes in ranking are more perceptible in the euca dataset (Fig. 8). The selection would be 35% and 40% different when considering TMM and the TGV of SCMM, and TMM and the TGV of CMM, respectively. This shows the importance of considering an index such as the TGV, since the selection would be only 15% different when comparing TMM and the DGE of SCMM and CMM, showing that BLUPs and DGEs are somewhat equivalent. The addition of the spatial component also changes the ranking and the selection but on a smaller scale.

Fig. 7: Changes in estimates of population parameters by fitting indirect genotypic effects.
Fig. 7: Changes in estimates of population parameters by fitting indirect genotypic effects.
Full size image

Residual (A) and genetic (B) variance component estimates and heritabilities (C) obtained from analyzing the euca dataset using a traditional mixed model (TMM), a spatial mixed model (SMM, with spatial adjustment), a genetic competition mixed model (CMM), and a spatial-genetic competition mixed model (SCMM). “3y” and “6y” refers to the ages, and “A1” and “A3”, the areas. In (B), “G” stands for the genotypic effect, “DGE” is the direct genotypic effect, and “IGE” is the indirect genotypic effect. In (C), the legend refers to the variance component used in the numerators. “Genotypic” is the genotypic effect of TMM or SMM, as per used in a regular broad-sense heritability; and “Direct” and “Total” relate to the DGE and total heritable variance of CMM and SCMM, as depicted in Equation (8).

Fig. 8: Relationship between BLUPs (obtained from the traditional mixed model, TMM; and from the spatial mixed model, SMM), direct (DGE) and indirect genotypic effects (IGE), and total genotypic values (TGV) (obtained from the genetic competition mixed model, CMM and the spatial-genetic competition mixed model, SCMM) in the euca dataset.
Fig. 8: Relationship between BLUPs (obtained from the traditional mixed model, TMM; and from the spatial mixed model, SMM), direct (DGE) and indirect genotypic effects (IGE), and total genotypic values (TGV) (obtained from the genetic competition mixed model, CMM and the spatial-genetic competition mixed model, SCMM) in the euca dataset.
Full size image

The lower triangle has scatter plots, and the green dots represent candidates that were coincidentally selected by considering the criteria depicted on the top and right of the plot, considering an intensity of 25%. The diagonal contains the density plots of values of all candidates (in black) and of the selected ones (in green). The upper triangle has the ranking (Spearman) correlation between all values (in black), and the percentage of coincident candidates among the selected (in green).

Concluding remarks

The R package gencomp stands as a user-friendly tool with the advantage of facilitating the fitting and utilization of (spatial-) genetic competition models, irrespective of users’ programming proficiency. Moreover, the package offers flexibility by including other effects in model fitting, enabling users to adapt models to their specific requirements. It is worth mentioning that gencomp is a work in progress and will continue to evolve as we introduce additional functionalities. Future versions may extend to fitting multi-environment and multi-trait models and integrating kinship matrices. User feedback is vital to this process.