Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

Interpretation of an individual functional genomics experiment guided by massive public data

Abstract

A key unmet challenge in interpreting omics experiments is inferring biological meaning in the context of public functional genomics data. We developed a computational framework, Your Evidence Tailored Integration (YETI; http://yeti.princeton.edu/), which creates specialized functional interaction maps from large public datasets relevant to an individual omics experiment. Using this tailored integration, we predicted and experimentally confirmed an unexpected divergence in viral replication after seasonal or pandemic human influenza virus infection.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Overview of YETI.
Fig. 2: Evaluation of network accuracy and relevance.
Fig. 3: YETI maps the specific functional landscapes of human dendritic cells after seasonal or pandemic influenza virus infection.

Similar content being viewed by others

Data availability

The virus infection microarray data are available in GEO under accession GSE55278. Researchers may submit their data of interest for YETI analysis at http://yeti.princeton.edu/. Visualization and exploration of their YETI network and precomputed YETI networks are also available at http://yeti.princeton.edu. All data used in this study are available from the corresponding author on request.

References

  1. Rung, J. & Brazma, A. Reuse of public genome-wide gene expression data. Nat. Rev. Genet. 14, 89–99 (2013).

    Article  CAS  PubMed  Google Scholar 

  2. Dolinski, K. & Troyanskaya, O. G. Implications of Big Data for cell biology. Mol. Biol. Cell 26, 2575–2578 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  3. Eisen, M. B., Spellman, P. T., Brown, P. O. & Botstein, D. Cluster analysis and display of genome-wide expression patterns. Proc. Natl. Acad. Sci. USA 95, 14863–14868 (1998).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  4. Stuart, J. M., Segal, E., Koller, D. & Kim, S. K. A gene-coexpression network for global discovery of conserved genetic modules. Science 302, 249–255 (2003).

    Article  CAS  PubMed  Google Scholar 

  5. De Smet, R. & Marchal, K. Advantages and limitations of current network inference methods. Nat. Rev. Microbiol. 8, 717–729 (2010).

    Article  CAS  PubMed  Google Scholar 

  6. Song, L., Langfelder, P. & Horvath, S. Comparison of co-expression measures: mutual information, correlation, and model based indices. BMC Bioinformatics 13, 328 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  7. Lee, H. K., Hsu, A. K., Sajdak, J., Qin, J. & Pavlidis, P. Coexpression analysis of human genes across many microarray data sets. Genome Res. 14, 1085–1094 (2004).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  8. Wren, J. D. A global meta-analysis of microarray expression data to predict unknown gene functions and estimate the literature–data divide. Bioinformatics 25, 1694–1701 (2009).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  9. Lee, I., Blom, U. M., Wang, P. I., Shim, J. E. & Marcotte, E. M. Prioritizing candidate disease genes by network-based boosting of genome-wide association data. Genome Res. 21, 1109–1121 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  10. Huttenhower, C. et al. Exploring the human genome with functional maps. Genome Res. 19, 1093–1106 (2009).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  11. Park, C. Y. et al. Functional knowledge transfer for high-accuracy prediction of under-studied biological processes. PLoS Comput. Biol. 9, e1002957 (2013).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  12. Gorenshteyn, D. et al. Interactive big data resource to elucidate human immune pathways and diseases. Immunity 43, 605–614 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  13. Greene, C. S. et al. Understanding multicellular function and disease with human tissue–specific networks. Nat. Genet. 47, 569–576 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  14. Berger, B., Peng, J. & Singh, M. Computational solutions for omics data. Nat. Rev. Genet. 14, 333–346 (2013).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  15. Cowen, L., Ideker, T., Raphael, B. J. & Sharan, R. Network propagation: a universal amplifier of genetic associations. Nat. Rev. Genet. 18, 551–562 (2017).

    Article  CAS  PubMed  Google Scholar 

  16. Grossman, R. L. et al. Toward a shared vision for cancer genomic data. N. Engl. J. Med. 375, 1109–1112 (2016).

    Article  PubMed  PubMed Central  Google Scholar 

  17. Clough, E. & Barrett, T. The Gene Expression Omnibus Database. Methods Mol. Biol. 1418, 93–110 (2016).

    Article  PubMed  PubMed Central  Google Scholar 

  18. Hartmann, B. M. et al. Human dendritic cell response signatures distinguish 1918, pandemic, and seasonal H1N1 influenza viruses. J. Virol. 89, 10190–10205 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  19. Nogusa, S. et al. RIPK3 activates parallel pathways of MLKL-driven necroptosis and FADD-mediated apoptosis to protect against influenza A virus. Cell Host Microbe 20, 13–24 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  20. Hartmann, B. M. et al. Pandemic H1N1 influenza A viruses suppress immunogenic RIPK3-driven dendritic cell death. Nat. Commun. 8, 1931 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  21. Bender, A. et al. The distinctive features of influenza virus infection of dendritic cells. Immunobiology 198, 552–567 (1998).

    Article  CAS  PubMed  Google Scholar 

  22. Collado-Torres, L. et al. Reproducible RNA-seq analysis using Recount2. Nat. Biotechnol. 35, 319–321 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  23. Troyanskaya, O. et al. Missing value estimation methods for DNA microarrays. Bioinformatics 17, 520–525 (2001).

    Article  CAS  PubMed  Google Scholar 

  24. Maglott, D., Ostell, J., Pruitt, K. D. & Tatusova, T. Entrez Gene: gene-centered information at NCBI. Nucleic Acids Res. 39, D52–D57 (2011).

    Article  CAS  PubMed  Google Scholar 

  25. Chatr-Aryamontri, A. et al. The BioGRID interaction database: 2013 update. Nucleic Acids Res. 41, D816–D823 (2013).

    Article  CAS  PubMed  Google Scholar 

  26. Kerrien, S. et al. The IntAct molecular interaction database in 2012. Nucleic Acids Res. 40, D841–D846 (2012).

    Article  CAS  PubMed  Google Scholar 

  27. Licata, L. et al. MINT, the molecular interaction database: 2012 update. Nucleic Acids Res. 40, D857–D861 (2012).

    Article  CAS  PubMed  Google Scholar 

  28. Pagel, P. et al. The MIPS mammalian protein–protein interaction database. Bioinformatics 21, 832–834 (2005).

    Article  CAS  PubMed  Google Scholar 

  29. Portales-Casamar, E. et al. JASPAR 2010: the greatly expanded open-access database of transcription factor binding profiles. Nucleic Acids Res. 38, D105–D110 (2010).

    Article  CAS  PubMed  Google Scholar 

  30. Grant, C. E., Bailey, T. L. & Noble, W. S. FIMO: scanning for occurrences of a given motif. Bioinformatics 27, 1017–1018 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  31. Subramanian, A. et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc. Natl. Acad. Sci. USA 102, 15545–15550 (2005).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  32. Kotera, M., Hirakawa, M., Tokimatsu, T., Goto, S. & Kanehisa, M. The KEGG databases and tools facilitating omics analysis: latest developments involving human diseases and pharmaceuticals. Methods Mol. Biol. 802, 19–39 (2012).

    Article  CAS  PubMed  Google Scholar 

  33. Schaefer, C. F. et al. PID: the Pathway Interaction Database. Nucleic Acids Res. 37, D674–D679 (2009).

    Article  CAS  PubMed  Google Scholar 

  34. Caspi, R. et al. The MetaCyc database of metabolic pathways and enzymes and the BioCyc collection of pathway/genome databases. Nucleic Acids Res. 40, D742–D753 (2012).

    Article  CAS  PubMed  Google Scholar 

  35. Myers, C. L., Barrett, D. R., Hibbs, M. A., Huttenhower, C. & Troyanskaya, O. G. Finding function: evaluation methods for functional genomic data. BMC Genomics 7, 187 (2006).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  36. Myers, C. L. & Troyanskaya, O. G. Context-sensitive data integration and prediction of biological networks. Bioinformatics 23, 2322–2330 (2007).

    Article  CAS  PubMed  Google Scholar 

  37. Friedman, N., Geiger, D. & Goldszmidt, M. Bayesian network classifiers. Mach. Learn. 29, 131–163 (1997).

    Article  Google Scholar 

  38. Steck, H. & Jaakkola, T. S. On the Dirichlet prior and Bayesian regularization. In Advances in Neural Information Processing Systems (eds Becker, S., Thrun, S. & Obermayer, K.) 713–720 (MIT Press, Boston, MA, 2002).

  39. Huttenhower, C., Schroeder, M., Chikina, M. D. & Troyanskaya, O. G. The Sleipnir library for computational functional genomics. Bioinformatics 24, 1559–1561 (2008).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  40. Brucker, P. An O(n) algorithm for quadratic knapsack problems. Oper. Res. Lett. 3, 163–166 (1984).

    Article  Google Scholar 

  41. Tibshirani, R. Regression shrinkage and selection via the Lasso. J. R. Stat. Soc. B 58, 267–288 (1996).

    Google Scholar 

  42. Szekely, G. J. & Rizzo, M. L. Brownian distance covariance. Ann. Appl. Stat. 3, 1236–1265 (2009).

    Article  Google Scholar 

  43. Simon, N. & Tibshirani, R. Comment on “Detecting novel associations in large data sets” by Reshef Et Al, Science Dec 16, 2011. arXiv Preprint at https://arxiv.org/abs/1401.7645 (2014).

  44. Lockhart, R., Taylor, J., Tibshirani, R. J. & Tibshirani, R. A significance test for the Lasso. Ann. Stat. 42, 413–468 (2014).

    Article  PubMed  PubMed Central  Google Scholar 

  45. Efron, B., Hastie, T., Johnstone, I. & Tibshirani, R. Least angle regression. Ann. Stat. 32, 407–451 (2004).

    Article  Google Scholar 

  46. Diestel, R. Graph Theory (Springer, Berlin/Heidelberg, 2018).

  47. Bordería, A. V., Hartmann, B. M., Fernandez-Sesma, A., Moran, T. M. & Sealfon, S. C. Antiviral-activated dendritic cells: a paracrine-induced response state. J. Immunol. 181, 6872–6881 (2008).

    Article  PubMed  Google Scholar 

  48. Ritchie, M. E. et al. limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res. 43, e47 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

Download references

Acknowledgements

We thank R. Dannenfelser for help in processing the TCGA RNA-seq datasets and A. Krishnan for discussions regarding the network evaluations. We greatly appreciate all members of the Troyanskaya lab for their valuable advice and discussions. This work was supported in part by the NIH (grant NIH U19 AI117873 to S.C.S.; grant NIH R01 GM071966 to O.G.T.). O.G.T. is a senior fellow of the Genetic Networks program of the Canadian Institute for Advanced Research (CIFAR).

Author information

Authors and Affiliations

Authors

Contributions

Y.-s.L., E.Z., S.C.S., and O.G.T. conceived and designed the research. Y.-s.L. performed the computational analyses with contributions from C.Y.P. A.K.W., A.T., and Y.-s.L. developed the web interface. B.M.H., V.A.D., and I.R. performed the molecular experiments. Y.-s.L., E.Z., S.C.S., and O.G.T. wrote the manuscript with revisions from all other authors.

Corresponding authors

Correspondence to Elena Zaslavsky or Stuart C. Sealfon.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Integrated supplementary information

Supplementary Figure 1 Overview of other approaches.

(left) Coexpression networks are based entirely on the correlated expression in a specific dataset. This yields functional relationships that are highly relevant to the specific dataset but does not capture accurately biological pathway interactions. (right) A generic Bayesian integration using a global data compendium accurately identifies biological pathway interactions, but the functional relationships in this network have a low specificity to any specific dataset.

Supplementary Figure 2 YETI computational framework to construct dataset-specific functional networks.

Known functional interactions are categorized into 237 distinctive biological processes spanning the multifaceted interaction landscape of the human genome. The context-specific interaction network of each biological process is learned through Bayesian integration of the public data compendium. These 237 Bayesian functional networks (i.e. source networks) are then selected based on similar interaction patterns in the user dataset of interest.

Supplementary Figure 3 Example work-through of the YETI webserver.

The user first submits her omics dataset of interest in a simple tab-separated values (TSV) format with genes in rows and omics assays in columns. The web server then integrates the public data compendium in accordance to the latent data structure of the input dataset. The user can then easily explore the dataset-relevant source networks and the dataset-specific functional map of query genes to gain deeper insight into the omics dataset used as input.

Supplementary Figure 4 Effect of exclusion of a single dataset from generic or YETI integrations.

Distribution of the Dataset Specificity Score for including the dataset, excluding it, and using YETI are shown. The center line represents the median, the lower and upper hinges indicate the first and third quartiles, the upper whisker extends to the largest value less than 1.5 x IQR and the lower whisker extends to the smallest value at most 1.5 x IQR. 10 of the 362 GEO datasets used for evaluation in Fig. 2 were chosen at random to be excluded from or included into the generic integrations that were evaluated over the MeSH terms relevant to each dataset (See Supplemental Online Methods). YETI achieved significantly improved dataset specificity over generic integration (**p = 3.1 x 10-4, one-tailed paired t test), and including the dataset of interest in the generic integration had no effect on specificity (p = 0.87, one-tailed paired t test). N.S. = not significant.

Supplementary Figure 5 Evaluation of YETI network performance robustness to the number of directly relevant datasets in the data compendium.

Accuracy score of YETI networks from disease datasets were grouped by the number of datasets annotated to the disease, excluding the user dataset used for YETI analysis. Boxplots were drawn as in Supplementary Fig. 3. The sample sizes of boxplots from left to right are: 52, 36, 64, 35, and 249.

Supplementary Figure 6 Evaluation of vulnerability of the density of YETI networks and co-expression networks to dataset size.

(a) Network densities of co-expression networks exponentially decreased with greater dataset size. (b) Network densities of YETI networks were consistently low even across input datasets of different sizes.

Supplementary information

Supplementary Text and Figures

Supplementary Figures 1–6

Reporting Summary

Supplementary Data 1

Name and description of datasets included in the public data compendium

Supplementary Data 2

ID and name of the 237 source networks

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Lee, Ys., Wong, A.K., Tadych, A. et al. Interpretation of an individual functional genomics experiment guided by massive public data. Nat Methods 15, 1049–1052 (2018). https://doi.org/10.1038/s41592-018-0218-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue date:

  • DOI: https://doi.org/10.1038/s41592-018-0218-5

This article is cited by

Search

Quick links

Nature Briefing AI and Robotics

Sign up for the Nature Briefing: AI and Robotics newsletter — what matters in AI and robotics research, free to your inbox weekly.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing: AI and Robotics