Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

Evaluation of large language models for discovery of gene set function

Abstract

Gene set enrichment is a mainstay of functional genomics, but it relies on gene function databases that are incomplete. Here we evaluate five large language models (LLMs) for their ability to discover the common functions represented by a gene set, supported by molecular rationale and a self-confidence assessment. For curated gene sets from Gene Ontology, GPT-4 suggests functions similar to the curated name in 73% of cases, with higher self-confidence predicting higher similarity. Conversely, random gene sets correctly yield zero confidence in 87% of cases. Other LLMs (GPT-3.5, Gemini Pro, Mixtral Instruct and Llama2 70b) vary in function recovery but are falsely confident for random sets. In gene clusters from omics data, GPT-4 identifies common functions for 45% of cases, fewer than functional enrichment but with higher specificity and gene coverage. Manual review of supporting rationale and citations finds these functions are largely verifiable. These results position LLMs as valuable omics assistants.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Use and evaluation of LLMs for functional analysis of gene sets.
Fig. 2: Evaluation of LLMs in recovering GO gene set names.
Fig. 3: Evaluation of LLM self-confidence.
Fig. 4: Evaluation of GPT-4 in naming ‘omics gene clusters.
Fig. 5: Representative analysis for protein interaction clusters (NeST:2-105).

Similar content being viewed by others

Data availability

All data used in this paper are publicly available. The full GO (2023-11-15 release) is downloaded from http://release.geneontology.org/2023-11-15/ontology/index.html. The selected NeST gene set is available to download from https://github.com/idekerlab/llm_evaluation_for_gene_set_interpretation/blob/main/data/Omics_data/NeST__IAS_clixo_hidef_Nov17.edges. The L1000 data used in this study are available at https://maayanlab.cloud/static/hdfs/harmonizome/data/lincscmapchemical/gene_attribute_edges.txt.gz. The viral infection data are available at https://maayanlab.cloud/static/hdfs/harmonizome/data/geovirus/gene_attribute_matrix.txt.gz. Detailed information on data download and parsing procedures, along with all datasets used in this paper, are available in our GitHub repository at https://github.com/idekerlab/llm_evaluation_for_gene_set_interpretation.

Code availability

The code to run the LLM gene set analysis pipeline and to reproduce results for the evaluation tasks is available via GitHub at https://github.com/idekerlab/llm_evaluation_for_gene_set_interpretation or Code Ocean77 (https://doi.org/10.24433/CO.7045777.v1) with the MIT License. Note that LLM outputs are inherently stochastic and the precise names and analysis text produced by the models are not guaranteed to be the same from run to run. We minimized the variability of the outputs as described in ‘Controlling the variability of LLM responses’ section in Methods.

References

  1. Zeeberg, B. R. et al. GoMiner: a resource for biological interpretation of genomic and proteomic data. Genome Biol. 4, R28 (2003).

    Article  PubMed  PubMed Central  Google Scholar 

  2. Breitling, R., Amtmann, A. & Herzyk, P. Iterative group analysis (iGA): a simple tool to enhance sensitivity and facilitate interpretation of microarray experiments. BMC Bioinf. 5, 34 (2004).

    Article  Google Scholar 

  3. Beissbarth, T. & Speed, T. P. GOstat: find statistically overrepresented Gene Ontologies within a group of genes. Bioinformatics 20, 1464–1465 (2004).

    Article  CAS  PubMed  Google Scholar 

  4. Subramanian, A. et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc. Natl Acad. Sci. USA 102, 15545–15550 (2005).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  5. Al-Shahrour, F. et al. From genes to functional classes in the study of biological systems. BMC Bioinf. 8, 114 (2007).

    Article  Google Scholar 

  6. Backes, C. et al. GeneTrail—advanced gene set enrichment analysis. Nucleic Acids Res. 35, W186–W192 (2007).

    Article  PubMed  PubMed Central  Google Scholar 

  7. Huang, D. W., Sherman, B. T. & Lempicki, R. A. Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat. Protoc. 4, 44–57 (2009).

    Article  CAS  PubMed  Google Scholar 

  8. Chen, E. Y. et al. Enrichr: interactive and collaborative HTML5 gene list enrichment analysis tool. BMC Bioinf. 14, 128 (2013).

    Article  Google Scholar 

  9. Pomaznoy, M., Ha, B. & Peters, B. GOnet: a tool for interactive Gene Ontology analysis. BMC Bioinf. 19, 470 (2018).

    Article  CAS  Google Scholar 

  10. Cerami, E. G. et al. Pathway Commons, a web resource for biological pathway data. Nucleic Acids Res. 39, D685–D690 (2011).

    Article  CAS  PubMed  Google Scholar 

  11. Fabregat, A. et al. The Reactome pathway Knowledgebase. Nucleic Acids Res. 44, D481–D487 (2015).

    Article  PubMed  PubMed Central  Google Scholar 

  12. Pico, A. R. et al. WikiPathways: pathway editing for the people. PLoS Biol. 6, e184 (2008).

    Article  PubMed  PubMed Central  Google Scholar 

  13. Kanehisa, M., Goto, S., Sato, Y., Furumichi, M. & Tanabe, M. KEGG for integration and interpretation of large-scale molecular data sets. Nucleic Acids Res. 40, D109–D114 (2012).

    Article  CAS  PubMed  Google Scholar 

  14. Pillich, R. T. et al. NDEx IQuery: a multi-method network gene set analysis leveraging the Network Data Exchange. Bioinformatics 39, btad118 (2023).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  15. Wang, S. et al. Typing tumors using pathways selected by somatic evolution. Nat. Commun. 9, 4159 (2018).

    Article  PubMed  PubMed Central  Google Scholar 

  16. Ashburner, M. et al. Gene Ontology: tool for the unification of biology. Nat. Genet. 25, 25–29 (2000).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  17. Gene Ontology Consortiumet al. The Gene Ontology knowledgebase in 2023. Genetics 224, iyad031 (2023).

    Article  Google Scholar 

  18. Kanehisa, M. & Goto, S. KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 28, 27–30 (2000).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  19. Kanehisa, M. Toward understanding the origin and evolution of cellular organisms. Protein Sci. 28, 1947–1951 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  20. Kanehisa, M., Furumichi, M., Sato, Y., Kawashima, M. & Ishiguro-Watanabe, M. KEGG for taxonomy-based analysis of pathways and genomes. Nucleic Acids Res. 51, D587–D592 (2023).

    Article  CAS  PubMed  Google Scholar 

  21. Croft, D. Reactome: a database of biological pathways. Nat. Preced. https://doi.org/10.1038/npre.2010.5025.1 (2010).

  22. Jassal, B. et al. The reactome pathway knowledgebase. Nucleic Acids Res. 48, D498–D503 (2020).

    CAS  PubMed  Google Scholar 

  23. Sollis, E. et al. The NHGRI-EBI GWAS Catalog: knowledgebase and deposition resource. Nucleic Acids Res. 51, D977–D985 (2023).

    Article  CAS  PubMed  Google Scholar 

  24. Blake, J. A. et al. The Mouse Genome Database genotypes::phenotypes. Nucleic Acids Res. 37, D712–D719 (2009).

    Article  CAS  PubMed  Google Scholar 

  25. Weng, M.-P. & Liao, B.-Y. MamPhEA: a web tool for mammalian phenotype enrichment analysis. Bioinformatics 26, 2212–2213 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  26. Keenan, A. B. et al. ChEA3: transcription factor enrichment analysis by orthogonal omics integration. Nucleic Acids Res. 47, W212–W224 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  27. Rubin, J. D. et al. Transcription factor enrichment analysis (TFEA) quantifies the activity of multiple transcription factors from a single experiment. Commun. Biol. 4, 661 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  28. Franzén, O., Gan, L.-M. & Björkegren, J. L. M. PanglaoDB: a web server for exploration of mouse and human single-cell RNA sequencing data. Database 2019, baz046 (2019).

    Article  PubMed  PubMed Central  Google Scholar 

  29. Zhang, X. et al. CellMarker: a manually curated resource of cell markers in human and mouse. Nucleic Acids Res. 47, D721–D728 (2019).

    Article  CAS  PubMed  Google Scholar 

  30. Hu, C. et al. CellMarker 2.0: an updated database of manually curated cell markers in human/mouse and web tools based on scRNA-seq data. Nucleic Acids Res. 51, D870–D876 (2023).

    Article  CAS  PubMed  Google Scholar 

  31. Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. BERT: pre-training of deep bidirectional transformers for language understanding. In Proc. 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) (eds Burstein, J. et al.) 4171–4186 (Association for Computational Linguistics, 2019).

  32. Brown, T. B. et al. Language models are few-shot learners. In Advances in Neural Information Processing Systems (eds H. Larochelle, et al.) 1877–190 (NeurIPS, 2020).

  33. Vaswani, A. et al. Attention is all you need. Neural Inf. Process Syst. 30, 5998–6008 (2017).

    Google Scholar 

  34. OpenAI. GPT-4 technical report. Preprint at https://arxiv.org/abs/2303.08774 (2023).

  35. Touvron, H. et al. Llama 2: open foundation and fine-tuned chat models. Preprint at https://arxiv.org/abs/2307.09288 (2023).

  36. Jiang, A. Q. et al. Mixtral of experts. Preprint at https://arxiv.org/abs/2401.04088 (2024).

  37. Gemini Team et al. Gemini: a family of highly capable multimodal models. Preprint at https://arxiv.org/abs/2312.11805 (2023).

  38. Joachimiak, M. P., Harry Caufield, J., Harris, N. L., Kim, H. & Mungall, C. J. Gene set summarization using large language models. Preprint at https://arxiv.org/abs/2305.13338 (2023).

  39. Moghaddam, S. R. & Honey, C. J. Boosting theory-of-mind performance in large language models via prompting. Preprint at https://arxiv.org/abs/2304.11490 (2023).

  40. Hebenstreit, K., Praas, R., Kiesewetter, L. P. & Samwald, M. An automatically discovered chain-of-thought prompt generalizes to novel models and datasets. Preprint at https://arxiv.org/abs/2305.02897 (2023).

  41. Wei, J. et al. Chain-of-thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems (eds Koyejo, S. et al.) 24824–24837 (NeurIPS, 2022).

  42. Caufield, J. H. et al. Structured prompt interrogation and recursive extraction of semantics (SPIRES): a method for populating knowledge bases using zero-shot learning. Bioinformatics 40, btae104 (2024).

    Article  PubMed  PubMed Central  Google Scholar 

  43. Miller, G. A. & Charles, W. G. Contextual correlates of semantic similarity. Lang. Cogn. Process. 6, 1–28 (1991).

    Article  Google Scholar 

  44. Xiong, M. et al. Can LLMs express their uncertainty? An empirical evaluation of confidence elicitation in LLMs. In The Twelfth International Conference on Learning Representations (ICLR, 20234).

  45. Fu, J., Ng, S.-K., Jiang, Z. & Liu, P. GPTScore: evaluate as you desire. In Proc. 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1: Long Papers (eds Duh, K. et al.) 6556–6576 (Association for Computational Linguistics, 2024).

  46. Kolberg, L. et al. g:Profiler-interoperable web service for functional enrichment analysis and gene identifier mapping (2023 update). Nucleic Acids Res. 51, W207–W212 (2023).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  47. Duan, Q. et al. LINCS Canvas Browser: interactive web app to query, browse and interrogate LINCS L1000 gene expression signatures. Nucleic Acids Res. 42, W449–W460 (2014).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  48. Subramanian, A. et al. A next generation connectivity map: L1000 platform and the first 1,000,000. Profiles Cell 171, 1437–1452.e17 (2017).

    CAS  PubMed  Google Scholar 

  49. Barrett, T. et al. NCBI GEO: archive for functional genomics data sets—update. Nucleic Acids Res. 41, D991–D995 (2013).

    Article  CAS  PubMed  Google Scholar 

  50. Zheng, F. et al. Interpretation of cancer mutations using a multiscale map of protein systems. Science 374, eabf3067 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  51. Pinkas, D. M. et al. Structural complexity in the KCTD family of Cullin3-dependent E3 ubiquitin ligases. Biochem. J. 474, 3747–3761 (2017).

    Article  CAS  PubMed  Google Scholar 

  52. Dhanoa, B. S., Cogliati, T., Satish, A. G., Bruford, E. A. & Friedman, J. S. Update on the Kelch-like (KLHL) gene family. Hum. Genomics 7, 13 (2013).

    Article  PubMed  PubMed Central  Google Scholar 

  53. Pleiner, T. et al. WNK1 is an assembly factor for the human ER membrane protein complex. Mol. Cell 81, 2693–2704.e12 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  54. Berthold, J. et al. Characterization of RhoBTB-dependent Cul3 ubiquitin ligase complexes—evidence for an autoregulatory mechanism. Exp. Cell. Res. 314, 3453–3465 (2008).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  55. McCormick, J. A. et al. Hyperkalemic hypertension-associated cullin 3 promotes WNK signaling by degrading KLHL3. J. Clin. Invest. 124, 4723–4736 (2014).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  56. Sohara, E. & Uchida, S. Kelch-like 3/Cullin 3 ubiquitin ligase complex and WNK signaling in salt-sensitive hypertension and electrolyte disorder. Nephrol. Dial. Transpl. 31, 1417–1424 (2016).

    Article  CAS  Google Scholar 

  57. Tang, H., Finn, R. D. & Thomas, P. D. TreeGrafter: phylogenetic tree-based annotation of proteins with Gene Ontology terms and other annotations. Bioinformatics 35, 518–520 (2019).

    Article  CAS  PubMed  Google Scholar 

  58. Groh, B. S. et al. The antiobesity factor WDTC1 suppresses adipogenesis via the CRL4WDTC1 E3 ligase. EMBO Rep. 17, 638–647 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  59. Ji, W. & Rivero, F. Atypical rho GTPases of the RhoBTB subfamily: roles in vesicle trafficking and tumorigenesis. Cells 5, 28 (2016).

    Article  PubMed  PubMed Central  Google Scholar 

  60. Nori, H., King, N., McKinney, S. M., Carignan, D. & Horvitz, E. Capabilities of GPT-4 on medical challenge problems. Preprint at https://arxiv.org/abs/2303.13375 (2023).

  61. López Espejel, J., Ettifouri, E. H., Yahaya Alassan, M. S., Chouham, E. M. & Dahhane, W. GPT-3.5, GPT-4, or BARD? Evaluating LLMs reasoning ability in zero-shot learning and performance boosting through prompts. Nat. Lang. Process. J. 5, 100032 (2023).

    Article  Google Scholar 

  62. Yu, H. et al. Evaluation of retrieval-augmented generation: a survey. Preprint at https://arxiv.org/abs/2405.07437 (2024).

  63. Yao, S. et al. ReAct: synergizing reasoning and acting in language models. In The Eleventh International Conference on Learning Representations (ICLR, 2022).

  64. Nair, V., Schumacher, E., Tso, G. & Kannan, A. DERA: enhancing large language model completions with dialog-enabled resolving agents. In Proc. 6th Clinical Natural Language Processing Workshop (eds Naumann, T. et al.) 122–161 (2023).

  65. Shinn, N. et al. Reflexion: language agents with verbal reinforcement learning. In Advances in Neural Information Processing Systems (eds Oh, A. et al.) 8634–8652 (NeurIPS, 2023).

  66. Li, G., Al Kader Hammoud, H. A., Itani, H., Khizbullin, D. & Ghanem, B. CAMEL: Communicative Agents for ‘Mind’ Exploration of Large Scale Language Model Society. In Advances in Neural Information Processing Systems (eds Oh, A. et al.) 36, 51991–52008 (NeurIPS, 2023).

  67. Schick, T. et al. Toolformer: language models can teach themselves to use tools. In Advances in Neural Information Processing Systems (eds Oh, A. et al.) 68539–68551 (NeurIPS, 2023).

  68. Shen, Y. et al. HuggingGPT: solving AI tasks with ChatGPT and its friends in Hugging Face. In Advances in Neural Information Processing Systems (eds Oh, A. et al.) 36, 38154–38180 (NeurIPS, 2023).

  69. Keskar, N. S., McCann, B., Varshney, L. R., Xiong, C. & Socher, R. CTRL: a conditional transformer language model for controllable generation. Preprint at https://arxiv.org/abs/1909.05858 (2019).

  70. Holtzman, A., Buys, J., Du, L., Forbes, M. & Choi, Y. The curious case of neural text degeneration. In The Eighth International Conference on Learning Representations (ICLR, 2020).

  71. Smith, B. et al. The OBO Foundry: coordinated evolution of ontologies to support biomedical data integration. Nat. Biotechnol. 25, 1251–1255 (2007).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  72. Tirmizi, S. H. et al. Mapping between the OBO and OWL ontology languages. J. Biomed. Semant. 2, S3 (2011).

    Article  Google Scholar 

  73. Liu, F., Shareghi, E., Meng, Z., Basaldella, M. & Collier, N. Self-alignment pretraining for biomedical entity representations. In Proc. 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (eds Toutanova, K. et al.) 4228–4238 (Association for Computational Linguistics, 2021).

  74. Wolf, T. et al. Transformers: state-of-the-art natural language processing. In Proc. 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations (eds Liu, Q. & Schlangen, D.) 38–45 (Association for Computational Linguistics, 2020).

  75. Rouillard, A. D. et al. The harmonizome: a collection of processed datasets gathered to serve and mine knowledge about genes and proteins. Database 2016, baw100 (2016).

    Article  PubMed  PubMed Central  Google Scholar 

  76. Seal, R. L. et al. Genenames.org: the HGNC resources in 2023. Nucleic Acids Res. 51, D1003–D1009 (2023).

    Article  CAS  PubMed  Google Scholar 

  77. Hu, M. et al. Evaluation of Large Language Models for Discovery of Gene Set Function (Code Ocean, 2024); https://doi.org/10.24433/CO.7045777.V1

Download references

Acknowledgements

This work was supported by National Institutes of Health grants U24 CA269436 (R.T.P., D.F., K.S., T.I. and D.P.), OT2 OD032742 (M.H., T.I. and D.P.), U24 HG012107 (D.F., T.I. and D.P.) and U01 MH115747 (S.A. and T.I.). Additional support was received from Schmidt Futures (M.H. and T.I.). We thank X. Zhao and A. Singhal for insightful comments, M. R. Kelly for providing the NeST data raw files and C. Churas and J. Lenkiewicz for helping to improve the GitHub repository.

Author information

Authors and Affiliations

Authors

Contributions

M.H., S.A., T.I. and D.P. designed the study. M.H. and S.A. developed and implemented the automated LLM-based gene set interpretation pipeline, performed the data analysis and organized the GitHub repository. S.A. developed and assessed the semantic similarity calculation. I.L. and M.H. contributed to the development of the citation search and validation pipeline. D.P. contributed to the coding and the evaluation of the analysis. R.T.P. assisted in the study design, prompt engineering and the evaluation of the analysis. M.H., R.T.P., R.B. and D.P. conducted the scientific review of the LLM output. M.H. and D.P. contributed to the user interface design for the GSAI tool. D.F. built the web interface for the GSAI tool, and K.S. set up the server for accessing open-source LLMs. M.H., S.A., T.I. and D.P. wrote the paper with input from all authors. All authors approved the final version of this paper.

Corresponding authors

Correspondence to Trey Ideker or Dexter Pratt.

Ethics declarations

Competing interests

T.I. is a cofounder and member of the advisory board and has an equity interest in Data4Cure and Serinus Biosciences. T.I. is a consultant for and has an equity interest in Ideaya Biosciences. The terms of these arrangements have been reviewed and approved by the University of California San Diego in accordance with its conflict-of-interest policies. The other authors declare no competing interests.

Peer review

Peer review information

Nature Methods thanks Qiao Jin, Zhiyong Lu, Zhizheng Wang and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Primary Handling Editor: Lin Tang, in collaboration with the Nature Methods team.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Schematic of the citation module.

a, GPT-4 is asked to provide gene symbol keywords and functional keywords separately. Multiple gene keywords and functions are combined and used to search PubMed for relevant paper titles and abstracts in the scientific literature. GPT-4 is queried to evaluate each abstract, saving supporting references. b, Prompts used to query the GPT-4 model.

Extended Data Fig. 2 Distribution of GO term gene sizes.

a, Distribution of term size (number of genes) for terms in the Biological Process branch (GO-BP). Terms with 3-100 genes shown (n = 8,910). b, Distribution of term size for the 1000 GO terms used in Task 1.

Extended Data Fig. 3 Evaluation of GPT-4 in recovery of GO-CC and GO-MF names.

a, Cumulative number of GO-CC term names recovered by GPT-4 (y-axis) at a given similarity percentile (x-axis). 0 = least similar, 100 = most similar. Blue curve: semantic similarities between GPT-4 names and assigned GO-CC term names. Grey dashed curve: semantic similarities between GPT-4 names and random GO-CC term names. The red dotted line marks that 642 of the 1000 sampled GO-CC names are recovered by GPT-4 at a similarity percentile of 95%. b, As for panel a, but for GO-MF terms rather than GO-CC. The red dotted line marks that 757 of the 1000 sampled GO-MF names are recovered by GPT-4 at a similarity percentile of 95%.

Extended Data Fig. 4 Supplemental analysis of the confidence score.

a, Distribution of confidence scores (n = 300) assigned by GPT-4 with confidence level threshold set based on the distribution pattern. “High confidence” (red): 0.87–1.00; “Medium confidence” (blue): 0.82–0.86; “Low confidence” (dark orange): 0.01–0.81; “Name not assigned” (gray): 0. b, Scatter plot of naming accuracy versus GPT-4 self-assessed confidence score for real gene sets drawn from GO (points, n = 100). Accuracy is estimated by the semantic similarity between the GPT-4 proposed name and the real GO term name. The best-fit regression line is shown in dark gray. The correlation coefficient (R) is determined by a two-sided Pearson’s correlation with p-value shown.

Extended Data Fig. 5 Distribution of ‘omics gene set sizes.

Distribution shown for all ‘omics gene sets considered in this study (n = 300).

Extended Data Table 1 Engineered prompt for gene set analysis
Extended Data Table 2 Overview of five language models
Extended Data Table 3 Confidence assessment by GPT-4 versus human
Extended Data Table 4 Clusters named by LLM (GPT-4) versus enrichment (g:Profiler)
Extended Data Table 5 Engineered prompt for identifying genes supporting a proposed name

Supplementary information

Supplementary Information

Description for Supplementary Tables 1–4.

Reporting Summary

Supplementary Tables 1–4

Supplementary Table 1. Complete analysis of GO terms, 50/50 mix and random for all models (related to task 1: Figs. 2a and 3b and Table 1). Supplementary Table 2. Complete GPT-4 analysis of GO terms (related to task 1: Fig. 2). Supplementary Table 3. Complete GPT-4 analysis of omics gene sets (related to task 2: Fig. 4 and Extended Data Table 4). Supplementary Table 4. Reviewer fact-checking of GPT-4 analysis text and citation relevance (related to task 2: Fig. 5).

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Hu, M., Alkhairy, S., Lee, I. et al. Evaluation of large language models for discovery of gene set function. Nat Methods 22, 82–91 (2025). https://doi.org/10.1038/s41592-024-02525-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue date:

  • DOI: https://doi.org/10.1038/s41592-024-02525-x

This article is cited by

Search

Quick links

Nature Briefing: Translational Research

Sign up for the Nature Briefing: Translational Research newsletter — top stories in biotechnology, drug discovery and pharma.

Get what matters in translational research, free to your inbox weekly. Sign up for Nature Briefing: Translational Research