Fig. 2: Literature search experiment results.
From: Accelerating clinical evidence synthesis with large language models

a The distribution of systematic literature reviews and clinical studies in the evaluation dataset categorized by the primary types of investigated interventions. b The TrialMind interface for literature search allows users to refine search terms efficiently. TrialMind suggests an initial set of treatment and condition terms, which users can modify by adding, editing, or removing terms. A sampling button enables users to explore additional terms generated by TrialMind. Once the term set is finalized, users can click the “Search Studies” button to construct the search query and retrieve relevant studies from the literature. c The recall of search results for reviews across four topics. The left y-axis represents the average recall achieved by different methods, while the right y-axis shows the number of identified studies. TrialMind successfully captures 70–80% of ground truth studies, whereas other methods miss most of them. d Scatter plots of the Recall against the number of ground truth studies. It is assumed that the more ground truth studies there are, the harder it is to cover all of them via one attempted search query. Each scatter indicates the results of one review. Regression estimates are displayed with the 95% CIs in blue or purple. TrialMind shows a consistent superiority over the other methods while maintaining invariance to the increasing complexity of the searching task. e Example cases comparing the outputs of three methods. The manually crafted terms are usually precise while not diverse enough to cover all the variants in the literature studies. A vanilla GPT-4 approach can introduce terms that are either too broad or irrelevant to the research objective.