Background & Summary

Large language models (LLMs) have demonstrated remarkable success in several natural language processing tasks in the health and life sciences domain1. Due to parameter scaling, access to specialized corpora, and better human alignment techniques, performance has significantly improved in recent years2. Yet, they still struggle with factual accuracy in various domains3. LLMs may produce factual errors that contradict established knowledge available at the time4. These inaccuracies and errors are particularly concerning in critical fields like healthcare, where incorrect information can have severe consequences5.

To mitigate issues with factual accuracy and vulnerability to hallucinations, the incorporation of domain-specific knowledge when evaluating LLMs has been proposed6. This stems from the fact that factual accuracy7 and vulnerability to hallucinations8 in LLMs can vary significantly across domains9. Models fine-tuned for a general purpose tend to outperform in the general domain6 while models fine-tuned for specific domains, such as medicine (e.g., Meditron10, Med-PaLM11), often outperform general-purpose models in those areas.

Another critical challenge for LLMs is their ability to perform logical reasoning12. This is particularly important in clinical research, where scientific claims are posed as logical statements, such as ‘the intervention X is more effective than placebo for a specific outcome'13, that are either true or false. Evaluating these claims requires a strong understanding of hypothesis testing and causal inference14. However, the nature of LLMs, which are trained to predict tokens within a context15, makes them struggle with complex logical statements16, even making unfaithful reasoning17.

Research has shown that LLMs can be easily misled by irrelevant information18. Chain-of-thought (CoT) prompting can improve multi-step reasoning by providing intermediate rationales19. Concerns remain regarding the faithfulness and reliability of these explanations, as they can often be biased or misleading20. Furthermore, while methodologies such as self-correction can improve reasoning accuracy, current models still struggle to correct their errors autonomously without external feedback21. In some cases, their performance degrades after self-correction21. Integrating LLMs with symbolic solvers for logical reasoning22 and hypothesis testing prompting for improved deductive reasoning23 are proposed to address these limitations.

Claim verification datasets play a crucial role in assessing the factual accuracy of LLMs across various domains24. FEVER25, a general-domain dataset, was created by rewriting Wikipedia sentences into atomic claims, which are then verified using Wikipedia’s textual knowledge base. FEVER also introduces a three-step fact verification process: document retrieval, evidence selection, and stance detection. In the political domain, the UKP Snopes corpus26, derived from the Snopes fact-checking website, includes 6,422 validated claims paired with evidence text snippets. For the scientific domain, SciFact27 includes 1.4 K expert-written biomedical scientific claims paired with evidence containing abstracts annotated with labels and rationales while Climate-FEVER28 contains 1,535 claims sourced from web searches, with corresponding evidence from Wikipedia.

Specifically to the health and life science domains, PUBHEALTH29 gathers public health claims from fact-checking websites and verifies them against news articles. ManConCorpus30 contains claims and sentences from 259 abstracts linked to 24 systematic reviews on cardiovascular disease. The COVID-19 pandemic and its infodemic effect7,31 have further motivated the development of specialized datasets. HealthVer32 is a medical-domain dataset derived by rewriting responses to questions from TREC-COVID33, verified against the CORD19 corpus34. Similarly, COVID-Fact35 targets COVID-19 claims by scraping content from Reddit and verifying them against scientific papers and documents retrieved via Google search. CoVERt36 enhances claim verification in the clinical domain by providing a new COVID verification dataset containing 15 PICO-encoded drug claims and 96 abstracts, each accompanied by one evidence sentence as rationale. These datasets are either focused on lay claims29,32,35, which require simpler reasoning skills, or, when focused on complex clinical research claims, they are disease-specific, e.g., COVID-1936 or cardiovascular30 and of reduced scale (O(101) claims)30,36. Thus, they are limited to evaluating the factuality of complex clinical research claims by LLMs.

To reduce this gap, we propose CliniFact37, a large-scale claim dataset to evaluate the generalizability of LLMs in comprehending factuality and logical statements in clinical research. CliniFact37 claims were automatically extracted from clinical trial protocols and results available from ClinicalTrials.gov. The claims were linked to supporting information in scientific publications available in Medline, with evidence provided at the abstract level. The resulting dataset contains O(103) claims spanning across 20 disease classes. This new benchmark offers a novel approach to evaluating LLMs in the health and life science domains, with specific challenges to understanding claims at the logical reasoning and hypothesis testing levels.

Methods

We utilized the ClinicalTrials.gov database as our primary data source, which comprises an extensive collection of registered clinical trials and their respective results. From each selected clinical trial, we systematically extracted key components to create the research claim, including the primary outcome measure, intervention, comparator, and type of statistical test. These components form the basis for generating one or more claims from each trial. We used the corresponding PubMed abstracts linked to these clinical trials as evidence to make judgments on the claims. In the following, we detail the dataset construction process.

Resources

The dataset uses information from two resources maintained by the U.S. National Library of Medicine: ClinicalTrials.gov (https://clinicaltrials.gov/) and PubMed (https://pubmed.ncbi.nlm.nih.gov/). ClinicalTrials.gov is a comprehensive online database that provides up-to-date information on clinical research studies and their results. These clinical trials serve as the most reliable medical evidence for evaluating the efficacy of single or multiple clinical interventions38. PubMed primarily includes the MEDLINE database of references and abstracts on life sciences and biomedical topics. The empirical evidence for clinical trial outcomes is often described in the results of clinical research studies published in medical journals indexed by PubMed39.

Data acquisition and pre-processing

On January 15th, 2024, we downloaded a total number of 57,422 clinical trials from CT.gov (https://clinicaltrials.gov/search) that met the following criteria: (i) Study Status: Terminated or Completed; (ii) Study Type: Interventional; and (iii) Study Results: With results. The resulting dataset is a compressed zip file containing individual raw JSON files of each study named by the clinical trial identifier (NCTID).

Dataset construction

In the following, we formally describe the claim generation process. An overview of the pipeline is illustrated in Fig. 1, and an example of the extracted fields is illustrated in Table 1.

Fig. 1
figure 1

Overview of the CliniFact dataset construction pipeline with three major modules: claim extraction, claim-evidence pairing, and labeling.

Table 1 Example of the extracted fields in CliniFact.

Claim extraction

Let \(D\) represent the filtered ClinicalTrials.gov database we downloaded, with each clinical trial represented as \({ct}{\prime} {\in }D\). From the \({ct}{\prime} \) set, we extracted the intervention, outcome measures, and comparator information from the subset of clinical trials ct limiting the selection to trials with bi-arm groups (see details in Section 2.4.1) and reporting p-value results for the primary outcome measures. For the ct set, we then extracted the primary outcome measures \({O}_{n}({ct})\), with their corresponding intervention \({I}_{n}\left({ct}\right)\), comparator \({P}_{n}\left({ct}\right)\), and type of statistical test \({T}_{n}({ct})\). These components are utilized to construct one or more claims \({C}_{n}\left({ct}\right)\) for each clinical trial, where n can vary between 0 and N. We represent the generation of the claim \({C}_{n}\left({ct}\right)\) as function \(f\), such that:

$${C}_{n}\left({ct}\right)=f({O}_{n}\left({ct}\right),{I}_{n}\left({ct}\right),{P}_{n}\left({ct}\right),{T}_{n}({ct})).$$
(1)

Claim-evidence pairing

For each \({ct}{\in }D\), there may be an associated scientific abstract \(E\left({ct}\right)\) from PubMed reported by the authors of the study in ClinicalTrials.gov. We link \(E\left({ct}\right)\) to each corresponding claim \({C}_{n}\left({ct}\right)\) to create a claim-evidence pair \({(C}_{n}\left({ct}\right),E\left({ct}\right)).\) In this context, \(E\left({ct}\right)\) represents the evidence from the PubMed abstract used to make judgements on the claim \({C}_{n}\left({ct}\right)\). An \(E\left({ct}\right)\) might have two statuses depending on the type of information it provides: (i) background, if it provides background information to the clinical trial, and (ii) result, if it describes results for the clinical trial. This information is provided by the study authors and will be used further to label the claim-evidence pairs (Cn (ct), E(ct)).

Claim-evidence labeling

The labeling process for the claim-evidence pairs involves two key steps as illustrated in Algorithm S1. First, for each claim \({C}_{n}\left({ct}\right)\), we assign a positive or negative label \({L}_{1}({C}_{n}\left({ct}\right))\) based on the p-value reported for its respective primary outcome measure. A positive label is assigned if the p-value indicates a statistically significant result, while a negative label is assigned if the p-value indicates a lack of statistical significance. Following conventional statistical thresholds, we considered p-value < 0.05 as statistically significant. Second, we consider the link nature between the clinical trial and the scientific abstract. If a clinical trial \({ct}\) is linked to a scientific abstract E(ct) that reports results for the trial, we further filter these instances to include only those where exactly one abstract is linked to the clinical trial. For these cases, the label for the claim-evidence pair L2 (Cn (ct), E(ct)) is defined as “evidence” if the claim \({C}_{n}\left({ct}\right)\) is positive, and “inconclusive” if the claim Cn (ct) is negative. Conversely, if the scientific abstract E(ct) linked to the clinical trial provides background information, we include all the linked abstracts, and the label for the claim-evidence pair L2 (Cn (ct), E(ct)) is defined as “not enough information” (NEI).

The test labels were further manually revised by two groups of reviewers. The inter-annotator agreement is with a Cohen’s Kappa (κ) of 0.85. In instances where reviewers disagreed (9%), a senior reviewer provided the final judgment. The manual revisions and the original labels aligned in 80% of the instances. The manually refined labels replace the original test labels in the final dataset.

Primary outcome-arm group pairs for claim generation

To extract primary outcome measures and arm group information from the clinical trial database \(D\), we focused exclusively on clinical trials \({ct}{\in }D\) that included bi-arm groups of types Experimental and Comparator. In a clinical trial, an arm refers to a group of participants that receives a particular intervention, treatment, or no intervention according to the trial’s protocol (https://clinicaltrials.gov/study-basics/glossary). The arm type represents the role of each arm in the clinical trial. For generating the clinical research claim, we used the term intervention to represent the Experimental arm group and comparator to represent the Comparator arm group, with mappings provided in Supplementary Table S1. The intervention and comparator terminologies are grounded in the PICOT framework, i.e., population, intervention, comparator, outcome and time40. We excluded Population information when formulating the claim, as it is part of extrinsic criteria and is not typically available from PubMed abstracts. Since the abstracts serve as grounding evidence for the claim in our dataset, including Population information would be out of the scope of our analysis.

In the study design section, arm groups are labeled as Experimental and Comparator, but these labels are not in the result section. The titles of arm groups also vary between the two sections. Thus, to label the arm groups in the result section as intervention or comparator, we followed the approach proposed by Shi et al.41. In this approach, we mapped arm group titles in the result section to the most similar one in the study design section by calculating the cosine similarity of their embeddings created using BioBERT42.

Efficacy label

We extract the efficacy label \({L}_{1}({C}_{n}\left({ct}\right))\) for the primary outcome \({O}_{n}({ct})\) from the measure analysis in the outcome measure information module. The efficacy label represents both the statistical significance of the analysis and the clinical effectiveness of the intervention. Each primary outcome \({O}_{n}({ct})\) paired with arm groups \({I}_{n}\left({ct}\right),{P}_{n}\left({ct}\right)\) may have one or multiple associated analyses, some of which include p-values representing the statistical significance of the results. We only extracted the analyses with p-values and compiled the p-values into a list for each primary outcome-arm group pair. We assigned a positive efficacy label to an outcome-group pair if any p-value in its associated list is smaller than 0.05, indicating statistical significance. We assigned a negative efficacy label if all p-values were equal to or greater than 0.05, indicating statistical non-significance. We extracted 8,679 primary outcome-arm group pairs, of which 4,179 were labeled as positive and 4,500 as negative, for 5,751 unique clinical trials.

Type of statistical test

Each primary outcome-arm group pair could be associated with a type of statistical test. We categorize these types into the ones outlined in the study data structure of ClinicalTrials.gov43. The outlined types include Superiority, Noninferiority, Equivalence, Noninferiority or Equivalence, and Superiority or Other. A Superiority test evaluates if a new treatment is better than another (e.g., standard treatment or placebo) by rejecting the null hypothesis of no difference44. A Noninferiority test shows that the new treatment is not significantly worse than the existing one, within a predefined margin44. An Equivalence test demonstrates that two treatments are statistically equivalent, with differences falling within a clinically insignificant margin44. Noninferiority or Equivalence tests first establish noninferiority, and then assess equivalence45. Lastly, Superiority or Other tests may also evaluate noninferiority or equivalence if superiority is not shown46.

Clinical research claim

A scientific claim is a verifiable statement. The claim should be atomic (about one aspect of the statement), decontextualized (understandable without additional context), and check-worthy (the veracity can be confirmed)47. In natural language, a clinical research claim can be expressed as a scientific claim in a format of ‘<Intervention> is <Type of Statistical Test> to <Comparator> in terms of <Outcome>.‘ For example, for study NCT00234065 shown in Table 1, we could reframe the outcome-group pair to the following claim \({C}_{n}\left({ct}\right)\): Cilostazol is Non-Inferior or Equivalent to Aspirin in terms of Numbers of Patients With First Occurrence of Stroke From start of treatment to end of follow-up period (follow-up periods: 29 months [Standard Deviation 16, range 1–59 months]). The Intervention and Comparator terms are sourced from the Arm/Group Title, while the Outcome is the combination of Outcome Measure Title and Outcome Measure Time Frame available in the clinical trial protocol (Table 2). In the absence of naturally occurring scientific claims in clinical trials, we used the PICOT framework to closely mimic natural occurring scientific claims based on hypothesis testing results in clinical research.

Table 2 Fields extracted for clinical research claim generation.

Clinical trial-publication linkage

Publications corresponding to clinical trials were identified by their PubMed Identifiers (PMIDs) provided in the Publications section of the clinical trial results and categorized by reference types. The types are by domain experts who register trials on ClinicalTrials.gov under standardized reporting guidelines. As shown in Table 3, we created a CSV file detailing the clinical trial-publication relationships by extracting the NCT ID, PMID, and reference type from the filtered ClinicalTrials.gov database \({ct}\in D\). A total of 1,550 clinical trial-publication links were used in the balanced dataset, including 868 background links and 682 results links. The precision of labeling with the link is 95%, as manually validated on the test split.

Table 3 Fields used to create clinical trial-publication pairs.

Claim-evidence pairs generation

Using the extracted relationships between clinical trials and scientific publications, we linked claims to their corresponding publications to generate claim-evidence pairs. Each clinical trial may correspond to one or multiple publications categorized as either background or results. For results publications, we focused on clinical trials linked to a single publication. In these cases, if a claim \({C}_{n}\left({ct}\right)\) had a positive label, we labeled the claim-evidence pair as evidence. Conversely, if the claim had a negative label, we labeled the claim-evidence pair as inconclusive. For background publications, we labeled the claim-evidence pair as NEI regardless of whether the clinical trial was linked to one or multiple background publications. Using the PubMedFetcher object in the Metapub Python library (https://metapub.org/overview/), we downloaded the titles and abstracts of publications based on their PubMed unique IDs (PMIDs). We excluded samples with incomplete abstracts that are less than 15 words. The statistics for the number of words for the extracted primary outcomes and publications are illustrated in Table 4.

Table 4 Number of words for the text fields.

We extracted the date of each abstract using a direct HTTP request to the PubMed API (https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id={pmid}&retmode=xml). The date is extracted including the year, month, and day when they are available. Figure S1 shows the number of publications per year, giving an overview of when the grounding evidence was published.

The generated dataset included 570 evidence, 415 inconclusive, and 8,196 not enough information (NEI) claim-evidence pairs. We further balanced the class distribution by down sampling the NEI class to 985 samples to match the total number of evidence and inconclusive samples to create the final dataset.

Data Records

The CliniFact dataset is available at Figshare37. It contains three CSV files – train.csv, validation.csv, and test.csv – each including 21 columns that capture key information about the claim and the corresponding PubMed abstracts. We present the variables, their corresponding column headings and descriptions in Table S2.

The dataset contains 1,970 primary claim-evidence pairs of 992 unique clinical trials and 1,540 unique publications. We show examples of evidence, inconclusive and NEI paired clinical research claims and abstracts in Table 5. The distribution of labels in the train, validation, and test splits is provided in Table 6.

Table 5 Examples of evidence, inconclusive and not enough information (NEI) paired clinical research claims and abstracts.
Table 6 Distribution of labels in train, validation, and test splits.

In Fig. 2, we show the clinical research claims stratified by the studied condition. Using the MeSH annotations provided by ClinicalTrials.gov, clinical research claims were associated with disease classes using the MeSH tree code (3 digits). In total, 20 disease categories (out of the 27 categories available in MeSH) are included in our dataset. It is important to note that a single clinical trial may be mapped to multiple disease classes according to the MeSH terminology. Therefore, when reporting the number of clinical research claims per disease, the total sample count may exceed the number of claims.

Fig. 2
figure 2

Clinical research claims stratified by study condition.

Technical Validation

Given the claim \({C}_{n}\left({ct}\right)\) and the abstract \(E\left({ct}\right)\), we investigated the performance of several discriminative and generative LLM to predict the label L2 (Cn (ct), E(ct)) for the claim-evidence pair. We treated it as a multiclass classification problem, where the output indicates whether the abstract states that there is evidence for the claim, that it is inconclusive, or that the abstract does not provide information for the claim (NEI). For the discriminative LLMs, we concatenated a claim \({C}_{n}\left({ct}\right)\) and its corresponding abstract \(E\left({ct}\right)\) with the special token [CLS] and [SEP] to form an input sequence [CLS, \({C}_{n}\left({ct}\right)\), SEP, \(E\left({ct}\right)\)] and fed this input to the LLM. We add the [CLS] token at the beginning of the input to provide a summary embedding for classification tasks. We use the [SEP] token to separate sequences for the model to understand boundaries between pieces of text. The model takes a sequence of tokens with a maximum length of 512 and produces a 768-dimensional sequence representation vector. For input shorter than 512 tokens, we added paddings (empty tokens) to the end of the text to make up the length. For input longer than 512 tokens, we truncated the abstract \(E\left({ct}\right)\) from the beginning to make the input sequence fit into the 512 tokens. We provided the truncation algorithm in Supplementary Figure S2. For the generative LLMs, we concatenated a claim \({C}_{n}\left({ct}\right)\) and its corresponding abstract \(E\left({ct}\right)\) with the prompt shown in Table 7. In the zero-shot approach, we computed the probability of generating the token TRUE, FALSE, or NONE, and the token with the highest probability was the response. We fine-tuned the generative LLMs on the training split, evaluated their performance on the validation split, and selected the model with the lowest cross-entropy loss.

Table 7 Prompt for generative language models.

We show the results of discriminative and generative LLMs on the test split in Table 8. The fine-tuned discriminative LLMs outperformed zero-shot and fine-tuned generative LLMs in the clinical research claim assessment. Specifically, BioBERT achieved the highest accuracy of 80.2% and an F1-macro score of 74.7%, showing improved effectiveness in processing biomedical text (p-value < 0.001, McNemar-Bowker Test), likely due to its domain-specific training. Other discriminative models like PubMedBERT and RoBERTa also performed well, with 77.9% and 75.4% accuracy, respectively. In contrast, zero-shot generative LLMs exhibited significantly lower performance, with OpenBioLLM-8B achieving the highest at 43.4% accuracy and an F1-macro of 30.6%, indicating limited capability in assessing biomedical claims without task-specific fine-tuning. Upon fine-tuning, generative LLMs showed significant improvements; for instance, Llama3-70B’s accuracy increased from 37.3% to 53.6%, and its F1-macro score from 25.2% to 38.1% (p-value < 0.001, McNemar-Bowker Test). Similarly, OpenBioLLM-70B improved from 33.2% to 51.0% accuracy after fine-tuning (p-value < 0.001, McNemar-Bowker Test). Nevertheless, they remain sub-optimal as compared to discriminative LLMs, despite a much higher number of parameters.

Table 8 Results of discriminative and generative language models on the test split.

We illustrate a detailed comparison of precision, recall, and F1-macro between the top discriminative and generative LLMs - BioBERT and OpenBioLLM-70B - across the classes evidence, inconclusive, and NEI in Fig. 3(a). BioBERT demonstrated superior performance across all classes. For the NEI class, it achieved a precision of 92.9% and a recall of 81.1%, indicating that it effectively determined whether relevant information was present in the abstracts. For the evidence class, BioBERT reached a precision of 72.8% and a recall of 87.6%, enabling it accurately distinguish the evidence from the inconclusive grounding. OpenBioLLM-70B exhibited a lower precision (37.6%) but a higher recall (88.5%) in identifying the evidence class, on the contrary, it showed higher precision but lower recall for the inconclusive and NEI class. These results suggest that the finetuned OpenBioLLM-70B tends to over-predicting evidence labels.

Fig. 3
figure 3

Performance and stratified analysis of the top discriminative and generative models.

Figure 3(b) shows the analysis of the top discriminative and generative LLMs across samples with different statistical test types. BioBERT outperformed OpenBioLLM-70B in Superiority, Superiority or Other, and Equivalence classes, achieving the highest performance in the Superiority class (Accuracy: 81.5%, F1-macro: 75.6%). OpenBioLLM-70B showed higher performance across the Non-inferiority or Equivalence and Non-inferiority class, achieving the highest performance in Non-inferiority class (Accuracy: 87.5%, F1-macro: 87.3%).

Figure 3(c) compares the performance of the top discriminative and generative LLMs across 20 disease types, arranged in a decent order of sample size. BioBERT achieved its highest F1-macro scores (100%) on Hemic and Lymphatic Diseases and Eye Diseases, while its lowest F1-macro score (45.0%) on Otorhinolaryngologic Diseases. OpenBioLLM achieved its highest F1-macro scores (77.8%) on Eye Diseases, while its lowest F1-macro score (17.8%) on Chemically Induced Disorders. Different diseases are either more challenging or easier to classify, and we do not observe that this complexity correlates with the number of samples available.

Usage Notes

CliniFact37 provides a benchmark for evaluating the accuracy of large language models (LLMs) in verifying scientific claims specific to clinical research. Researchers can utilize the dataset to develop and fine-tune models aimed at improving natural language understanding, logical reasoning, and misinformation48 detection in healthcare. Additionally, the dataset facilitates the comparison of performance across various types of LLMs.