To the Editor:
Phenotyping genetically engineered mouse lines has become a central strategy for discovering mammalian gene function. The International Mouse Phenotyping Consortium (IMPC) coordinates a large-scale community effort for phenotyping thousands of mutant lines1, making data accessible in public databases2 and distributing novel mutant lines as animal models of human diseases. The utility of any findings, however, critically depends on whether they can be replicated in other laboratories. This 'megascience' project is but one example of the general concern regarding replicability3. Here we introduce a statistical approach and implementation (https://stat.cs.tau.ac.il/gxl_app/) that can be used to estimate the interlab replicability of new results in a single laboratory.
An influential multilaboratory phenotyping study4 concluded that “experiments characterizing mutants may yield results that are idiosyncratic to a particular laboratory” on account of significant genotype-by-laboratory interaction (G × L) in several phenotypes. However, we proposed5 a more appropriate statistical model ascribing random effect to each laboratory and its interaction with genotype. This 'random lab model' (RLM) considers the laboratories in the study as a sample representing all potential phenotyping laboratories. It therefore adds the interaction 'noise' σ2G × L to the individual animal noise, generating an adjusted yardstick against which genotype differences are tested. Consequently, RLM raises the benchmark for finding a significant genotype effect, trades some statistical power for ensuring replicability, and widens the confidence interval of the estimated effect size (Supplementary Fig. 1).
In practice, however, almost all preclinical experiments are single-lab studies. Suppose that a researcher phenotypes an important animal model and makes a discovery that the difference between the phenotypes of mutants and wild-type controls is large and statistically significant. How would researchers in other labs know whether to use this mutant and expect to replicate the effect? The RLM also implies that in single-lab experiments the correct yardstick against which the genotype effect is tested should include σ2G × L in addition to the commonly used within-lab variability. We term this a 'G × L adjustment' (Supplementary Methods; implications demonstrated in Fig. 1a–c) and validate it by analyzing eight data sets from published multilab mouse phenotyping studies and databases (Supplementary Table 1 and Supplementary Note 1). These data sets include standard physiological, anatomical, and behavioral phenotypes measured in inbred strains and mutant lines. They offer the opportunity to assess the replicability of single-lab results against the multilab RLM conclusions regarding replicable genotypic difference.
(a–c) Comparison of two mouse genotypes for three phenotypes across six laboratories from data set 1 (Supplementary Table 1). Each line connects genotype means within the same laboratory, so the slope of each line reflects the difference in these means. Within-lab significances (coded by line type) are all two sided at 0.05. (a) Small G × L effect (similar slopes) and significant genotype effect according to the Random Lab Model (RLM). (b) Moderate G × L effect (more variation among labs), but genotype effect appears fairly replicable and is significant according to the RLM. (c) Substantial G × L effect and no significant genotype effect according to RLM. Standard single-lab analysis would report significant genotype effect for Giessen that is opposite in direction to significant effects for Mannheim, Muenster, and Munich. (d) G × L adjustment decreases percentage of nonreplicable discoveries (type-I error rate).
From each laboratory's point of view, we compare the G × L adjustment method with the standard method of analysis, a two-tailed t-test at 0.05 significance level using within-lab variability. Cases in which the RLM analysis did not indicate a replicated genotype effect enabled us to quantify false discoveries (type-I errors; note that we term a nonreplicable discovery 'false' simply because it proved idiosyncratic to the laboratory discovering it). Over all data sets, the average type-I error rate of the standard method ranged between 19.3% and 41% (Fig. 1d). This can be viewed as an estimate of the replicability situation in the field of mouse phenotyping, assuming the high standardization level in these data sets. G × L adjustment reduced this error rate to the vicinity of the chosen 0.05, ranging from 3.3% to 9%, at the cost of reducing power by 8–30% (Supplementary Table 1). Potential biases in the above estimates were addressed using a simulation study (Supplementary Note 2).
For brevity, we present G × L adjustment by way of statistical significance and type-I errors, but the same adjusted standard error should be used to construct replicable confidence intervals. Comparison of multiple phenotypes requires that FDR be applied to the G × L adjustment P values (as in ref. 6). Similarly, the error rate of 'hits' reported by IMPC is lower than those in Figure 1d, because the IMPC imposes a considerably more conservative significance threshold.
Whereas here, we G × L-adjust using σ2G × L estimated from the multilab analysis, general use will employ σ2G × L from previous multilab studies, with similar phenotypes but possibly other genotypes, treating σ2G × L as a property of the phenotype rather than of the genotype. This procedure is practical; phenotyping only a few genotypes across several laboratories enables σ2G × L estimation and adjustment of other genotypes in other laboratories. No highly coordinated collaboration is required, as these results can merely be posted in a combined database for the benefit of the community. We provide a prototype web server for performing G × L adjustment (https://stat.cs.tau.ac.il/gxl_app/). By entering phenotypic results and testing conditions, users receive G × L-adjusted P values and confidence intervals from available relevant estimates, and they are encouraged to post their results in order to further enrich the database (see Supplementary Fig. 2).
A similar approach 'simulates' σ2G × L by systematically 'heterogenizing' housing and testing conditions within single-lab studies7. As indicated by the consistently lower type-I error rate in the heterogenized data set (data sets 2 versus 1 in Supplementary Table 1), this may be a worthwhile effort, although the simple form of heterogenization used did not capture all of the estimated σ2G × L.
The concern about replicability of phenotyping results may be regarded as an example of the general concern about reproducibility in science, which has been attributed to issues such as the file drawer effect, publication bias, financial and publicity incentives, etc.4. While these are all relevant problems, substantial statistical issues have yet to be addressed, such as testing with the relevant variability as discussed here. The G × L-adjusted P value and confidence interval indicate the prospects of replicating the result in additional laboratories. Reporting these values side by side with the usual P value and confidence interval will promote replicability of preclinical research.
Data availability statement. All data and analysis are publically available; see “S.3 Data and Code Availability” in the Supplementary Information.
Author contributions
N.K., I.G., and Y.B. conceived the project in cooperation with all other authors. All authors contributed and/or extracted previously published data. I.J., H.M., T.S., S.Y., and Y.B. performed statistical analyses and programmed software. N.K., I.J., and Y.B. drafted the paper, to which I.G., H.M., T.S., H.W., and S.Y. also contributed.
References
de Angelis, M.H. et al. Nat. Genet. 47, 969–978 (2015).
Koscielny, G. et al. Nucleic Acids Res. 42 D1, D802–D809 (2014).
Collins, F.S. & Tabak, L.A. Nature 505, 612–613 (2014).
Crabbe, J.C., Wahlsten, D. & Dudek, B.C. Science 284, 1670–1672 (1999).
Kafkafi, N., Benjamini, Y., Sakov, A., Elmer, G.I. & Golani, I. Proc. Natl. Acad. Sci. USA 102, 4619–4624 (2005).
Richter, S.H., Garner, J.P. & Würbel, H. Nat. Methods 6, 257–261 (2009).
Benjamini, Y., Drai, D., Elmer, G., Kafkafi, N. & Golani, I. Behav. Brain Res. 125, 279–284 (2001).
Acknowledgements
This work is supported by European Research Council grants FP7/2007-2013 ERC agreement no[294519]-PSARPS (Y.B., N.K., I.G., I.J., T.S., and S.Y.) and REFINE (H.W.). We thank the International Mouse Phenotyping Consortium (IMPC) and their Data Coordination Centre for the provision of phenotyping data sets.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing financial interests.
Integrated supplementary information
Supplementary Figure 1 The Random Lab Model.
The proposed Random Lab Model (RLM) analysis vs. the commonly-used Fixed Lab Model (FLM) standard analysis for a single laboratory experiment. The illustrated example represents a phenotyping experiment comparing two genotypes g' and g'' (e.g., a “knockout” mutant vs. its wildtype control) in a single laboratory l. The two models include the same effects (upper row), but in the RLM, the laboratory and therefore its interaction with the genotype are modeled as random (effects in red) rather than fixed (blue). bl is the contribution of laboratory specific to its measurement procedure, which is common to all animals from any genotype measured in lab l. cg'l and cg''l are the contributions of interactions of lab specifics with genotypes g' and g'' specific to measurement, which are common to all animals from same genotype measured in lab l. When phenotyping the two genotypes in the same laboratory, the laboratory effect cancels whether it is fixed or random. However, the random interaction effect are not the same, they do not cancel out, and because they are independent their variances sum up in the standard error (SE, bottom row) just as the individual animals effects do. Unlike the individual animal “noise”, it cannot be reduced by increasing the number of animals n, and it cannot be estimated in a single laboratory, and thus has to be imported from previous multi-lab experiments (GxL-adjustment). Larger SE increases the p-value and confidence interval, therefore requiring more power to show a difference, but also ensures results will be replicated in other laboratories.
Supplementary Figure 2 A proposed framework for practical community implementation of GxL-adjustment.
Researchers in a local laboratory Labl (left) perform a local phenotyping experiment comparing genotypes g' and g''. They search an online community database (right) and retrieve the current estimate of the interaction variability σ2G × L for the phenotype p of interest, estimated in other genotypes g1–g4 across other laboratories Lab1–Lab3. The researchers use this σ2G × L to GxL-adjust their local statistical analysis of p-value and confidence interval of the genotype effect, deriving a conclusion that is more likely to replicate in other laboratories. The researchers also submit their local data to the community database, thus enriching it and enabling an updated estimation of σ2G × L for future users.
Supplementary information
Supplementary Text and Figures
Supplementary Table 1, Supplementary Figures 1 and 2, Supplementary Methods, Supplementary Notes 1 and 2 (PDF 2598 kb)
Source data
Rights and permissions
About this article
Cite this article
Kafkafi, N., Golani, I., Jaljuli, I. et al. Addressing reproducibility in single-laboratory phenotyping experiments. Nat Methods 14, 462–464 (2017). https://doi.org/10.1038/nmeth.4259
Published:
Issue date:
DOI: https://doi.org/10.1038/nmeth.4259
This article is cited by
-
Statistical Perspectives on Reproducibility: Definitions and Challenges
Journal of Statistical Theory and Practice (2025)
-
Mouse phenome database: curated data repository with interactive multi-population and multi-trait analyses
Mammalian Genome (2023)
-
Increasing the statistical power of animal experiments with historical control data
Nature Neuroscience (2021)
-
A reaction norm perspective on reproducibility
Theory in Biosciences (2021)
-
Reproducibility of animal research in light of biological variation
Nature Reviews Neuroscience (2020)
