Abstract
Genome-wide association studies (GWAS) has brought methodological challenges in handling massive high-dimensional data and also real opportunities for studying the joint effect of many risk factors acting in concert as an organic group. The random forest (RF) methodology is recognized by many for its potential in examining interaction effects in large data sets. However, RF is not designed to directly handle GWAS data, which typically have hundreds of thousands of single-nucleotide polymorphisms as predictor variables. We propose and evaluate a novel extension of RF, called random forest fishing (RFF), for GWAS analysis. RFF repeatedly updates a relatively small set of predictors obtained by RF tests to find globally important groups predictive of the disease phenotype, using a novel search algorithm based on genetic programming and simulated annealing. A key improvement of RFF results from the use of guidance incorporating empirical test results of genome-wide pairwise interactions. Evaluated using simulated and real GWAS data sets, RFF is shown to be effective in identifying important predictors, particularly when both marginal effects and interactions exist, and is applicable to very large GWAS data sets.
Similar content being viewed by others
Log in or create a free account to read this content
Gain free access to this article, as well as selected content from this journal and more on nature.com
or
References
WTCCC: Genome-wide association study of 14 000 cases of seven common diseases and 3,000 shared controls. Nature 2007; 447: 661–678.
Zeggini E, Scott LJ, Saxena R et al: Meta-analysis of genome-wide association data and large-scale replication identifies additional susceptibility loci for type 2 diabetes. Nat Genet 2008; 40: 638–645.
Cox NJ, Frigge M, Nicolae DL et al: Loci on chromosomes 2 (NIDDM1) and 15 interact to increase susceptibility to diabetes in Mexican Americans. Nat Genet 1999; 21: 213–215.
Dimas AS, Stranger BE, Beazley C et al: Modifier effects between regulatory and protein-coding variation. PLoS Genet 2008; 4: e1000244.
Dong C, Wang S, Li WD, Li D, Zhao H, Price RA : Interacting genetic loci on chromosomes 20 and 10 influence extreme human obesity. Am J Hum Genet 2003; 72: 115–124.
Cordell HJ : Detecting gene–gene interactions that underlie human diseases. Nat Rev Genet 2009; 10: 392–404.
Marchini J, Donnelly P, Cardon LR : Genome-wide strategies for detecting multiple loci that influence complex diseases. Nat Genet 2005; 37: 413–417.
Hastie T, Tibshirani R, Friedman J : The Elements of Statistical Learning. Data Mining, Inference, and Prediction. Springer, 2001.
Hahn LW, Ritchie MD, Moore JH : Multifactor dimensionality reduction software for detecting gene-gene and gene-environment interactions. Bioinformatics 2003; 19: 376–382.
Cook NR, Zee RY, Ridker PM : Tree and spline based association analysis of gene-gene interaction models for ischemic stroke. Stat Med 2004; 23: 1439–1453.
Breiman L : Random Forest. Mach Learn 2001; 45: 5–32.
Goldstein BA, Polley EC, Briggs FB : Random forests for genetic association studies. Stat Appl Genet Mol Biol 2011; 10: 1–34.
Bureau A, Dupuis J, Falls K et al: Identifying SNPs predictive of phenotype using random forests. Genet Epidemiol 2005; 28: 171–182.
Diaz-Uriarte R, Alvarez de Andres S : Gene selection and classification of microarray data using random forest. BMC Bioinformatics 2006; 7: 3.
Jiang H, Deng Y, Chen HS et al: Joint analysis of two microarray gene-expression data sets to select lung adenocarcinoma marker genes. BMC Bioinformatics 2004; 5: 81.
Lunetta KL, Hayward LB, Segal J, Van Eerdewegh P : Screening large-scale association study data: exploiting interactions using random forests. BMC Genet 2004; 5: 32.
Schwarz DF, Szymczak S, Ziegler A, Konig IR : Picking single-nucleotide polymorphisms in forests. BMC Proc 2007; 1 (Suppl 1): S59.
Schwarz DF, Konig IR, Ziegler A : On safari to Random Jungle: a fast implementation of Random Forests for high-dimensional data. Bioinformatics 2010; 26: 1752–1758.
Jiang R, Tang W, Wu X, Fu W : A random forest approach to the detection of epistatic interactions in case-control studies. BMC Bioinformatics 2009; 10: S65.
Zou L, Huang Q, Li A, Wang M : A genome-wide association study of Alzheimer’s disease using random forests and enrichment analysis. Sci China Life Sci 2012; 55: 618–625.
Goldstein BA, Hubbard AE, Cutler A, Barcellos LF : An application of Random Forests to a genome-wide association dataset: Methodological considerations & new findings. BMC Genet 2010; 11: 49.
Kirkpatrick S, Gelatt CD Jr, Vecchi MP : Optimization by simulated annealing. Science 1983; 220: 671–680.
Holland JH : Adaptation in Natural and Artificial Systems. MA: MIT press Cambridge, 1992.
Team R : R: A Language and Environment for Statistical Computing. Vienna Austria: R Foundation for Statistical Computing, 2010; 3.
Liaw A, Wiener M : Classification and Regression by randomForest. R News 2002; 2: 18–22.
Gibbs RA, Belmont JW, Harden P et al: The International HapMap Project. Nature 2003; 426: 789–796.
Li C, Li M : GWAsimulator: a rapid whole-genome simulation program. Bioinformatics 2008; 24: 140–142.
Purcell S, Neale B, Todd-Brown K et al: PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet 2007; 81: 559–575.
Dennis G Jr, Sherman BT, Hosack DA et al: DAVID: database for annotation, visualization, and integrated discovery. Genome Biol 2003; 4: P3.
Raichlin E, Prasad A, Mathew V et al: Efficacy and safety of atrasentan in patients with cardiovascular risk and early atherosclerosis. Hypertension 2008; 52: 522–528.
Acknowledgements
This research is supported in part by NIH grants HL091028, HL071782, DA012854, and DA027995.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Competing interests
The authors declare no conflict of interest.
Additional information
Supplementary Information accompanies this paper on European Journal of Human Genetics website
Supplementary information
Rights and permissions
About this article
Cite this article
Yang, W., Charles Gu, C. Random forest fishing: a novel approach to identifying organic group of risk factors in genome-wide association studies. Eur J Hum Genet 22, 254–259 (2014). https://doi.org/10.1038/ejhg.2013.109
Received:
Revised:
Accepted:
Published:
Issue date:
DOI: https://doi.org/10.1038/ejhg.2013.109