Fig. 2: The biological process names generated by GeneAgent are more similar with their ground truth than those produced by GPT-4 using the prompts proposed by Hu et al.16.
From: GeneAgent: self-verification language agent for gene-set analysis using domain databases

a, The ROUGE scores of GeneAgent and GPT-4 were evaluated across three datasets: 1,000 gene sets from GO, 50 from NeST and 56 from MSigDB. The s.d. for each bar was calculated using nine-fold cross-validation based on batch size (bs) sampling, with bs = 200 for GO and bs = 20 for both NeST and MSigDB. The ROUGE score for each batch size is presented in the figure. The central value of the error bars represents the mean score across all samples. The results are presented as the mean ± s.d. b, Distribution of similarity scores obtained by GeneAgent and GPT-4 in three datasets. The total number of gene sets used for the statistics is 1,000 (GO), 50 (NeST) and 56 (MSigDB). The middle points represent the mean values; bounds of the inner boxes of each violin plot represent the upper and lower percentiles; and whiskers represent the minimum and maximum points within all data samples. The statistically significant P value is 3.1 × 10−5 for 1,106 evaluated gene sets, which is calculated by a one-tailed t-test with 95% confidence intervals. The results are reported as the mean ± s.d., calculated from all similarity scores obtained by GeneAgent or GPT-4. c, The percentile distribution of semantic similarity between generated names and their ground truths was assessed across all candidate background terms. This background set comprises 12,320 terms, including 12,214 GO biological process terms used by Hu et al.16 and all available annotated terms in NeST (50) and MSigDB (56). The plot illustrates the distribution of gene sets within the top 90th percentile. The caption values represent the number of gene sets in GeneAgent and GPT-4 that fall within the top 98th percentile (that is, shadings shown in the figures). d, The accuracy of tested terms that exactly match the significant enrichment terms obtained by GSEA. Each value on the bar is calculated by the proportion of exact matched terms within all terms tested by the GeneAgent or SPINDOCTOR.