To the Editor:

Phenotyping genetically engineered mouse lines has become a central strategy for discovering mammalian gene function. The International Mouse Phenotyping Consortium (IMPC) coordinates a large-scale community effort for phenotyping thousands of mutant lines1, making data accessible in public databases2 and distributing novel mutant lines as animal models of human diseases. The utility of any findings, however, critically depends on whether they can be replicated in other laboratories. This 'megascience' project is but one example of the general concern regarding replicability3. Here we introduce a statistical approach and implementation (https://stat.cs.tau.ac.il/gxl_app/) that can be used to estimate the interlab replicability of new results in a single laboratory.

An influential multilaboratory phenotyping study4 concluded that “experiments characterizing mutants may yield results that are idiosyncratic to a particular laboratory” on account of significant genotype-by-laboratory interaction (G × L) in several phenotypes. However, we proposed5 a more appropriate statistical model ascribing random effect to each laboratory and its interaction with genotype. This 'random lab model' (RLM) considers the laboratories in the study as a sample representing all potential phenotyping laboratories. It therefore adds the interaction 'noise' σ2G × L to the individual animal noise, generating an adjusted yardstick against which genotype differences are tested. Consequently, RLM raises the benchmark for finding a significant genotype effect, trades some statistical power for ensuring replicability, and widens the confidence interval of the estimated effect size (Supplementary Fig. 1).

In practice, however, almost all preclinical experiments are single-lab studies. Suppose that a researcher phenotypes an important animal model and makes a discovery that the difference between the phenotypes of mutants and wild-type controls is large and statistically significant. How would researchers in other labs know whether to use this mutant and expect to replicate the effect? The RLM also implies that in single-lab experiments the correct yardstick against which the genotype effect is tested should include σ2G × L in addition to the commonly used within-lab variability. We term this a 'G × L adjustment' (Supplementary Methods; implications demonstrated in Fig. 1a–c) and validate it by analyzing eight data sets from published multilab mouse phenotyping studies and databases (Supplementary Table 1 and Supplementary Note 1). These data sets include standard physiological, anatomical, and behavioral phenotypes measured in inbred strains and mutant lines. They offer the opportunity to assess the replicability of single-lab results against the multilab RLM conclusions regarding replicable genotypic difference.

Figure 1: Adjusting for genotype-by-laboratory interaction (G × L).
figure 1

(ac) Comparison of two mouse genotypes for three phenotypes across six laboratories from data set 1 (Supplementary Table 1). Each line connects genotype means within the same laboratory, so the slope of each line reflects the difference in these means. Within-lab significances (coded by line type) are all two sided at 0.05. (a) Small G × L effect (similar slopes) and significant genotype effect according to the Random Lab Model (RLM). (b) Moderate G × L effect (more variation among labs), but genotype effect appears fairly replicable and is significant according to the RLM. (c) Substantial G × L effect and no significant genotype effect according to RLM. Standard single-lab analysis would report significant genotype effect for Giessen that is opposite in direction to significant effects for Mannheim, Muenster, and Munich. (d) G × L adjustment decreases percentage of nonreplicable discoveries (type-I error rate).

Source data

From each laboratory's point of view, we compare the G × L adjustment method with the standard method of analysis, a two-tailed t-test at 0.05 significance level using within-lab variability. Cases in which the RLM analysis did not indicate a replicated genotype effect enabled us to quantify false discoveries (type-I errors; note that we term a nonreplicable discovery 'false' simply because it proved idiosyncratic to the laboratory discovering it). Over all data sets, the average type-I error rate of the standard method ranged between 19.3% and 41% (Fig. 1d). This can be viewed as an estimate of the replicability situation in the field of mouse phenotyping, assuming the high standardization level in these data sets. G × L adjustment reduced this error rate to the vicinity of the chosen 0.05, ranging from 3.3% to 9%, at the cost of reducing power by 8–30% (Supplementary Table 1). Potential biases in the above estimates were addressed using a simulation study (Supplementary Note 2).

For brevity, we present G × L adjustment by way of statistical significance and type-I errors, but the same adjusted standard error should be used to construct replicable confidence intervals. Comparison of multiple phenotypes requires that FDR be applied to the G × L adjustment P values (as in ref. 6). Similarly, the error rate of 'hits' reported by IMPC is lower than those in Figure 1d, because the IMPC imposes a considerably more conservative significance threshold.

Whereas here, we G × L-adjust using σ2G × L estimated from the multilab analysis, general use will employ σ2G × L from previous multilab studies, with similar phenotypes but possibly other genotypes, treating σ2G × L as a property of the phenotype rather than of the genotype. This procedure is practical; phenotyping only a few genotypes across several laboratories enables σ2G × L estimation and adjustment of other genotypes in other laboratories. No highly coordinated collaboration is required, as these results can merely be posted in a combined database for the benefit of the community. We provide a prototype web server for performing G × L adjustment (https://stat.cs.tau.ac.il/gxl_app/). By entering phenotypic results and testing conditions, users receive G × L-adjusted P values and confidence intervals from available relevant estimates, and they are encouraged to post their results in order to further enrich the database (see Supplementary Fig. 2).

A similar approach 'simulates' σ2G × L by systematically 'heterogenizing' housing and testing conditions within single-lab studies7. As indicated by the consistently lower type-I error rate in the heterogenized data set (data sets 2 versus 1 in Supplementary Table 1), this may be a worthwhile effort, although the simple form of heterogenization used did not capture all of the estimated σ2G × L.

The concern about replicability of phenotyping results may be regarded as an example of the general concern about reproducibility in science, which has been attributed to issues such as the file drawer effect, publication bias, financial and publicity incentives, etc.4. While these are all relevant problems, substantial statistical issues have yet to be addressed, such as testing with the relevant variability as discussed here. The G × L-adjusted P value and confidence interval indicate the prospects of replicating the result in additional laboratories. Reporting these values side by side with the usual P value and confidence interval will promote replicability of preclinical research.

Data availability statement. All data and analysis are publically available; see “S.3 Data and Code Availability” in the Supplementary Information.

Author contributions

N.K., I.G., and Y.B. conceived the project in cooperation with all other authors. All authors contributed and/or extracted previously published data. I.J., H.M., T.S., S.Y., and Y.B. performed statistical analyses and programmed software. N.K., I.J., and Y.B. drafted the paper, to which I.G., H.M., T.S., H.W., and S.Y. also contributed.