Addressing reproducibility in single-laboratory phenotyping experiments

Kafkafi, Neri; Golani, Ilan; Jaljuli, Iman; Morgan, Hugh; Sarig, Tal; Würbel, Hanno; Yaacoby, Shay; Benjamini, Yoav

doi:10.1038/nmeth.4259

Correspondence
Published: 27 April 2017

Addressing reproducibility in single-laboratory phenotyping experiments

Neri Kafkafi¹,
Ilan Golani²,
Iman Jaljuli¹,
Hugh Morgan³,
Tal Sarig⁴,
Hanno Würbel ORCID: orcid.org/0000-0002-2934-3010⁵,
Shay Yaacoby ORCID: orcid.org/0000-0002-2583-4170¹ &
…
Yoav Benjamini^1,6

Nature Methods volume 14, pages 462–464 (2017)Cite this article

3043 Accesses
49 Citations
6 Altmetric
Metrics details

Subjects

You have full access to this article via your institution.

Download PDF

To the Editor:

Phenotyping genetically engineered mouse lines has become a central strategy for discovering mammalian gene function. The International Mouse Phenotyping Consortium (IMPC) coordinates a large-scale community effort for phenotyping thousands of mutant lines¹, making data accessible in public databases² and distributing novel mutant lines as animal models of human diseases. The utility of any findings, however, critically depends on whether they can be replicated in other laboratories. This 'megascience' project is but one example of the general concern regarding replicability³. Here we introduce a statistical approach and implementation (https://stat.cs.tau.ac.il/gxl_app/) that can be used to estimate the interlab replicability of new results in a single laboratory.

An influential multilaboratory phenotyping study⁴ concluded that “experiments characterizing mutants may yield results that are idiosyncratic to a particular laboratory” on account of significant genotype-by-laboratory interaction (G × L) in several phenotypes. However, we proposed⁵ a more appropriate statistical model ascribing random effect to each laboratory and its interaction with genotype. This 'random lab model' (RLM) considers the laboratories in the study as a sample representing all potential phenotyping laboratories. It therefore adds the interaction 'noise' σ²_{G × L} to the individual animal noise, generating an adjusted yardstick against which genotype differences are tested. Consequently, RLM raises the benchmark for finding a significant genotype effect, trades some statistical power for ensuring replicability, and widens the confidence interval of the estimated effect size (Supplementary Fig. 1).

In practice, however, almost all preclinical experiments are single-lab studies. Suppose that a researcher phenotypes an important animal model and makes a discovery that the difference between the phenotypes of mutants and wild-type controls is large and statistically significant. How would researchers in other labs know whether to use this mutant and expect to replicate the effect? The RLM also implies that in single-lab experiments the correct yardstick against which the genotype effect is tested should include σ²_{G × L} in addition to the commonly used within-lab variability. We term this a 'G × L adjustment' (Supplementary Methods; implications demonstrated in Fig. 1a–c) and validate it by analyzing eight data sets from published multilab mouse phenotyping studies and databases (Supplementary Table 1 and Supplementary Note 1). These data sets include standard physiological, anatomical, and behavioral phenotypes measured in inbred strains and mutant lines. They offer the opportunity to assess the replicability of single-lab results against the multilab RLM conclusions regarding replicable genotypic difference.

**Figure 1: Adjusting for genotype-by-laboratory interaction (G × L).**

From each laboratory's point of view, we compare the G × L adjustment method with the standard method of analysis, a two-tailed t-test at 0.05 significance level using within-lab variability. Cases in which the RLM analysis did not indicate a replicated genotype effect enabled us to quantify false discoveries (type-I errors; note that we term a nonreplicable discovery 'false' simply because it proved idiosyncratic to the laboratory discovering it). Over all data sets, the average type-I error rate of the standard method ranged between 19.3% and 41% (Fig. 1d). This can be viewed as an estimate of the replicability situation in the field of mouse phenotyping, assuming the high standardization level in these data sets. G × L adjustment reduced this error rate to the vicinity of the chosen 0.05, ranging from 3.3% to 9%, at the cost of reducing power by 8–30% (Supplementary Table 1). Potential biases in the above estimates were addressed using a simulation study (Supplementary Note 2).

For brevity, we present G × L adjustment by way of statistical significance and type-I errors, but the same adjusted standard error should be used to construct replicable confidence intervals. Comparison of multiple phenotypes requires that FDR be applied to the G × L adjustment P values (as in ref. 6). Similarly, the error rate of 'hits' reported by IMPC is lower than those in Figure 1d, because the IMPC imposes a considerably more conservative significance threshold.

Whereas here, we G × L-adjust using σ²_{G × L} estimated from the multilab analysis, general use will employ σ²_{G × L} from previous multilab studies, with similar phenotypes but possibly other genotypes, treating σ²_{G × L} as a property of the phenotype rather than of the genotype. This procedure is practical; phenotyping only a few genotypes across several laboratories enables σ²_{G × L} estimation and adjustment of other genotypes in other laboratories. No highly coordinated collaboration is required, as these results can merely be posted in a combined database for the benefit of the community. We provide a prototype web server for performing G × L adjustment (https://stat.cs.tau.ac.il/gxl_app/). By entering phenotypic results and testing conditions, users receive G × L-adjusted P values and confidence intervals from available relevant estimates, and they are encouraged to post their results in order to further enrich the database (see Supplementary Fig. 2).

A similar approach 'simulates' σ²_{G × L} by systematically 'heterogenizing' housing and testing conditions within single-lab studies⁷. As indicated by the consistently lower type-I error rate in the heterogenized data set (data sets 2 versus 1 in Supplementary Table 1), this may be a worthwhile effort, although the simple form of heterogenization used did not capture all of the estimated σ²_{G × L}.

The concern about replicability of phenotyping results may be regarded as an example of the general concern about reproducibility in science, which has been attributed to issues such as the file drawer effect, publication bias, financial and publicity incentives, etc.⁴. While these are all relevant problems, substantial statistical issues have yet to be addressed, such as testing with the relevant variability as discussed here. The G × L-adjusted P value and confidence interval indicate the prospects of replicating the result in additional laboratories. Reporting these values side by side with the usual P value and confidence interval will promote replicability of preclinical research.

Data availability statement. All data and analysis are publically available; see “S.3 Data and Code Availability” in the Supplementary Information.

Author contributions

N.K., I.G., and Y.B. conceived the project in cooperation with all other authors. All authors contributed and/or extracted previously published data. I.J., H.M., T.S., S.Y., and Y.B. performed statistical analyses and programmed software. N.K., I.J., and Y.B. drafted the paper, to which I.G., H.M., T.S., H.W., and S.Y. also contributed.

References

de Angelis, M.H. et al. Nat. Genet. 47, 969–978 (2015).
Article Google Scholar
Koscielny, G. et al. Nucleic Acids Res. 42 D1, D802–D809 (2014).
Article Google Scholar
Collins, F.S. & Tabak, L.A. Nature 505, 612–613 (2014).
Article Google Scholar
Crabbe, J.C., Wahlsten, D. & Dudek, B.C. Science 284, 1670–1672 (1999).
Article CAS Google Scholar
Kafkafi, N., Benjamini, Y., Sakov, A., Elmer, G.I. & Golani, I. Proc. Natl. Acad. Sci. USA 102, 4619–4624 (2005).
Article CAS Google Scholar
Richter, S.H., Garner, J.P. & Würbel, H. Nat. Methods 6, 257–261 (2009).
Article CAS Google Scholar
Benjamini, Y., Drai, D., Elmer, G., Kafkafi, N. & Golani, I. Behav. Brain Res. 125, 279–284 (2001).
Article CAS Google Scholar

Download references

Acknowledgements

This work is supported by European Research Council grants FP7/2007-2013 ERC agreement no[294519]-PSARPS (Y.B., N.K., I.G., I.J., T.S., and S.Y.) and REFINE (H.W.). We thank the International Mouse Phenotyping Consortium (IMPC) and their Data Coordination Centre for the provision of phenotyping data sets.

Author information

Authors and Affiliations

Department of Statistics and O.R., School of Mathematical Sciences, Tel Aviv University, Tel Aviv, Israel
Neri Kafkafi, Iman Jaljuli, Shay Yaacoby & Yoav Benjamini
Department of Zoology, Tel Aviv University, Tel Aviv, Israel
Ilan Golani
MRC Harwell, Mammalian Genetics Unit, Oxfordshire, UK
Hugh Morgan
Department of Statistics and Data Sciences, Yale University, New Haven, Connecticut, USA
Tal Sarig
Division of Animal Welfare, Vetsuisse Faculty, University of Bern, Bern, Switzerland
Hanno Würbel
The Sagol School of Neuroscience and The Edmond J. Safra Center for Bioinformatics, Tel Aviv University, Tel Aviv, Israel
Yoav Benjamini

Authors

Neri Kafkafi
View author publications
Search author on:PubMed Google Scholar
Ilan Golani
View author publications
Search author on:PubMed Google Scholar
Iman Jaljuli
View author publications
Search author on:PubMed Google Scholar
Hugh Morgan
View author publications
Search author on:PubMed Google Scholar
Tal Sarig
View author publications
Search author on:PubMed Google Scholar
Hanno Würbel
View author publications
Search author on:PubMed Google Scholar
Shay Yaacoby
View author publications
Search author on:PubMed Google Scholar
Yoav Benjamini
View author publications
Search author on:PubMed Google Scholar

Corresponding author

Correspondence to Yoav Benjamini.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Integrated supplementary information

Supplementary Figure 1 The Random Lab Model.

The proposed Random Lab Model (RLM) analysis vs. the commonly-used Fixed Lab Model (FLM) standard analysis for a single laboratory experiment. The illustrated example represents a phenotyping experiment comparing two genotypes g' and g'' (e.g., a “knockout” mutant vs. its wildtype control) in a single laboratory l. The two models include the same effects (upper row), but in the RLM, the laboratory and therefore its interaction with the genotype are modeled as random (effects in red) rather than fixed (blue). b_l is the contribution of laboratory specific to its measurement procedure, which is common to all animals from any genotype measured in lab l. c_g'l and c_g''l are the contributions of interactions of lab specifics with genotypes g' and g'' specific to measurement, which are common to all animals from same genotype measured in lab l. When phenotyping the two genotypes in the same laboratory, the laboratory effect cancels whether it is fixed or random. However, the random interaction effect are not the same, they do not cancel out, and because they are independent their variances sum up in the standard error (SE, bottom row) just as the individual animals effects do. Unlike the individual animal “noise”, it cannot be reduced by increasing the number of animals n, and it cannot be estimated in a single laboratory, and thus has to be imported from previous multi-lab experiments (GxL-adjustment). Larger SE increases the p-value and confidence interval, therefore requiring more power to show a difference, but also ensures results will be replicated in other laboratories.

Supplementary Figure 2 A proposed framework for practical community implementation of GxL-adjustment.

Researchers in a local laboratory Lab_l (left) perform a local phenotyping experiment comparing genotypes g' and g''. They search an online community database (right) and retrieve the current estimate of the interaction variability σ²_{G × L} for the phenotype p of interest, estimated in other genotypes g₁–g₄ across other laboratories Lab₁–Lab₃. The researchers use this σ²_{G × L} to GxL-adjust their local statistical analysis of p-value and confidence interval of the genotype effect, deriving a conclusion that is more likely to replicate in other laboratories. The researchers also submit their local data to the community database, thus enriching it and enabling an updated estimation of σ²_{G × L} for future users.

Supplementary information

Supplementary Text and Figures

Supplementary Table 1, Supplementary Figures 1 and 2, Supplementary Methods, Supplementary Notes 1 and 2 (PDF 2598 kb)

Source data

Source data to Fig. 1

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kafkafi, N., Golani, I., Jaljuli, I. et al. Addressing reproducibility in single-laboratory phenotyping experiments. Nat Methods 14, 462–464 (2017). https://doi.org/10.1038/nmeth.4259

Download citation

Published: 27 April 2017
Issue date: May 2017
DOI: https://doi.org/10.1038/nmeth.4259

This article is cited by

Statistical Perspectives on Reproducibility: Definitions and Challenges
- Andrea Simkus
- Tahani Coolen-Maturi
- Claus Bendtsen
Journal of Statistical Theory and Practice (2025)
Mouse phenome database: curated data repository with interactive multi-population and multi-trait analyses
- Molly A. Bogue
- Robyn L. Ball
- Elissa J. Chesler
Mammalian Genome (2023)
Increasing the statistical power of animal experiments with historical control data
- V. Bonapersona
- H. Hoijtink
- M. Joëls
Nature Neuroscience (2021)
A reaction norm perspective on reproducibility
- Bernhard Voelkl
- Hanno Würbel
Theory in Biosciences (2021)
Reproducibility of animal research in light of biological variation
- Bernhard Voelkl
- Naomi S. Altman
- Hanno Würbel
Nature Reviews Neuroscience (2020)