A dataset for evaluating clinical research claims in large language models

Zhang, Boya; Bornet, Alban; Yazdani, Anthony; Khlebnikov, Philipp; Milutinovic, Marija; Rouhizadeh, Hossein; Amini, Poorya; Teodoro, Douglas

doi:10.1038/s41597-025-04417-x

Download PDF

Data Descriptor
Open access
Published: 16 January 2025

A dataset for evaluating clinical research claims in large language models

Scientific Data volume 12, Article number: 86 (2025) Cite this article

6375 Accesses
1 Citations
Metrics details

Subjects

Abstract

Large language models (LLMs) have the potential to enhance the verification of health claims. However, issues with hallucination and comprehension of logical statements require these models to be closely scrutinized in healthcare applications. We introduce CliniFact, a scientific claim dataset created from hypothesis testing results in clinical research, covering 992 unique interventions for 22 disease categories. The dataset used study arms and interventions, primary outcome measures, and results from clinical trials to derive and label clinical research claims. These claims were then linked to supporting information describing clinical trial results in scientific publications. CliniFact contains 1,970 instances from 992 unique clinical trials related to 1,540 unique publications. When evaluating LLMs against CliniFact, discriminative models, such as BioBERT with an accuracy of 80.2%, outperformed generative counterparts, such as Llama3-70B, which reached 53.6% accuracy (p-value < 0.001). Our results demonstrate the potential of CliniFact as a benchmark for evaluating LLM performance in clinical research claim verification.

Evaluation and mitigation of the limitations of large language models in clinical decision-making

Article Open access 04 July 2024

A framework to assess clinical safety and hallucination rates of LLMs for medical text summarisation

Article Open access 13 May 2025

Synthetic data distillation enables the extraction of clinical information at scale

Article Open access 10 May 2025

Background & Summary

Large language models (LLMs) have demonstrated remarkable success in several natural language processing tasks in the health and life sciences domain¹. Due to parameter scaling, access to specialized corpora, and better human alignment techniques, performance has significantly improved in recent years². Yet, they still struggle with factual accuracy in various domains³. LLMs may produce factual errors that contradict established knowledge available at the time⁴. These inaccuracies and errors are particularly concerning in critical fields like healthcare, where incorrect information can have severe consequences⁵.

To mitigate issues with factual accuracy and vulnerability to hallucinations, the incorporation of domain-specific knowledge when evaluating LLMs has been proposed⁶. This stems from the fact that factual accuracy⁷ and vulnerability to hallucinations⁸ in LLMs can vary significantly across domains⁹. Models fine-tuned for a general purpose tend to outperform in the general domain⁶ while models fine-tuned for specific domains, such as medicine (e.g., Meditron¹⁰, Med-PaLM¹¹), often outperform general-purpose models in those areas.

Another critical challenge for LLMs is their ability to perform logical reasoning¹². This is particularly important in clinical research, where scientific claims are posed as logical statements, such as ‘the intervention X is more effective than placebo for a specific outcome'¹³, that are either true or false. Evaluating these claims requires a strong understanding of hypothesis testing and causal inference¹⁴. However, the nature of LLMs, which are trained to predict tokens within a context¹⁵, makes them struggle with complex logical statements¹⁶, even making unfaithful reasoning¹⁷.

Research has shown that LLMs can be easily misled by irrelevant information¹⁸. Chain-of-thought (CoT) prompting can improve multi-step reasoning by providing intermediate rationales¹⁹. Concerns remain regarding the faithfulness and reliability of these explanations, as they can often be biased or misleading²⁰. Furthermore, while methodologies such as self-correction can improve reasoning accuracy, current models still struggle to correct their errors autonomously without external feedback²¹. In some cases, their performance degrades after self-correction²¹. Integrating LLMs with symbolic solvers for logical reasoning²² and hypothesis testing prompting for improved deductive reasoning²³ are proposed to address these limitations.

Claim verification datasets play a crucial role in assessing the factual accuracy of LLMs across various domains²⁴. FEVER²⁵, a general-domain dataset, was created by rewriting Wikipedia sentences into atomic claims, which are then verified using Wikipedia’s textual knowledge base. FEVER also introduces a three-step fact verification process: document retrieval, evidence selection, and stance detection. In the political domain, the UKP Snopes corpus²⁶, derived from the Snopes fact-checking website, includes 6,422 validated claims paired with evidence text snippets. For the scientific domain, SciFact²⁷ includes 1.4 K expert-written biomedical scientific claims paired with evidence containing abstracts annotated with labels and rationales while Climate-FEVER²⁸ contains 1,535 claims sourced from web searches, with corresponding evidence from Wikipedia.

Specifically to the health and life science domains, PUBHEALTH²⁹ gathers public health claims from fact-checking websites and verifies them against news articles. ManConCorpus³⁰ contains claims and sentences from 259 abstracts linked to 24 systematic reviews on cardiovascular disease. The COVID-19 pandemic and its infodemic effect^7,31 have further motivated the development of specialized datasets. HealthVer³² is a medical-domain dataset derived by rewriting responses to questions from TREC-COVID³³, verified against the CORD19 corpus³⁴. Similarly, COVID-Fact³⁵ targets COVID-19 claims by scraping content from Reddit and verifying them against scientific papers and documents retrieved via Google search. CoVERt³⁶ enhances claim verification in the clinical domain by providing a new COVID verification dataset containing 15 PICO-encoded drug claims and 96 abstracts, each accompanied by one evidence sentence as rationale. These datasets are either focused on lay claims^29,32,35, which require simpler reasoning skills, or, when focused on complex clinical research claims, they are disease-specific, e.g., COVID-19³⁶ or cardiovascular³⁰ and of reduced scale (O(10¹) claims)^30,36. Thus, they are limited to evaluating the factuality of complex clinical research claims by LLMs.

To reduce this gap, we propose CliniFact³⁷, a large-scale claim dataset to evaluate the generalizability of LLMs in comprehending factuality and logical statements in clinical research. CliniFact³⁷ claims were automatically extracted from clinical trial protocols and results available from ClinicalTrials.gov. The claims were linked to supporting information in scientific publications available in Medline, with evidence provided at the abstract level. The resulting dataset contains O(10³) claims spanning across 20 disease classes. This new benchmark offers a novel approach to evaluating LLMs in the health and life science domains, with specific challenges to understanding claims at the logical reasoning and hypothesis testing levels.

Methods

We utilized the ClinicalTrials.gov database as our primary data source, which comprises an extensive collection of registered clinical trials and their respective results. From each selected clinical trial, we systematically extracted key components to create the research claim, including the primary outcome measure, intervention, comparator, and type of statistical test. These components form the basis for generating one or more claims from each trial. We used the corresponding PubMed abstracts linked to these clinical trials as evidence to make judgments on the claims. In the following, we detail the dataset construction process.

Resources

The dataset uses information from two resources maintained by the U.S. National Library of Medicine: ClinicalTrials.gov (https://clinicaltrials.gov/) and PubMed (https://pubmed.ncbi.nlm.nih.gov/). ClinicalTrials.gov is a comprehensive online database that provides up-to-date information on clinical research studies and their results. These clinical trials serve as the most reliable medical evidence for evaluating the efficacy of single or multiple clinical interventions³⁸. PubMed primarily includes the MEDLINE database of references and abstracts on life sciences and biomedical topics. The empirical evidence for clinical trial outcomes is often described in the results of clinical research studies published in medical journals indexed by PubMed³⁹.

Data acquisition and pre-processing

On January 15^th, 2024, we downloaded a total number of 57,422 clinical trials from CT.gov (https://clinicaltrials.gov/search) that met the following criteria: (i) Study Status: Terminated or Completed; (ii) Study Type: Interventional; and (iii) Study Results: With results. The resulting dataset is a compressed zip file containing individual raw JSON files of each study named by the clinical trial identifier (NCTID).

Dataset construction

In the following, we formally describe the claim generation process. An overview of the pipeline is illustrated in Fig. 1, and an example of the extracted fields is illustrated in Table 1.

Table 1 Example of the extracted fields in CliniFact.

Full size table

Claim extraction

Let $D$ represent the filtered ClinicalTrials.gov database we downloaded, with each clinical trial represented as ${ct}{\prime} {\in }D$. From the ${ct}{\prime} $ set, we extracted the intervention, outcome measures, and comparator information from the subset of clinical trials ct limiting the selection to trials with bi-arm groups (see details in Section 2.4.1) and reporting p-value results for the primary outcome measures. For the ct set, we then extracted the primary outcome measures ${O}_{n}({ct})$, with their corresponding intervention ${I}_{n}\left({ct}\right)$, comparator ${P}_{n}\left({ct}\right)$, and type of statistical test ${T}_{n}({ct})$. These components are utilized to construct one or more claims ${C}_{n}\left({ct}\right)$ for each clinical trial, where n can vary between 0 and N. We represent the generation of the claim ${C}_{n}\left({ct}\right)$ as function $f$, such that:

$${C}_{n}\left({ct}\right)=f({O}_{n}\left({ct}\right),{I}_{n}\left({ct}\right),{P}_{n}\left({ct}\right),{T}_{n}({ct})).$$

(1)

Claim-evidence pairing

For each ${ct}{\in }D$, there may be an associated scientific abstract $E\left({ct}\right)$ from PubMed reported by the authors of the study in ClinicalTrials.gov. We link $E\left({ct}\right)$ to each corresponding claim ${C}_{n}\left({ct}\right)$ to create a claim-evidence pair ${(C}_{n}\left({ct}\right),E\left({ct}\right)).$ In this context, $E\left({ct}\right)$ represents the evidence from the PubMed abstract used to make judgements on the claim ${C}_{n}\left({ct}\right)$. An $E\left({ct}\right)$ might have two statuses depending on the type of information it provides: (i) background, if it provides background information to the clinical trial, and (ii) result, if it describes results for the clinical trial. This information is provided by the study authors and will be used further to label the claim-evidence pairs (C_n (ct), E(ct)).

Claim-evidence labeling

The labeling process for the claim-evidence pairs involves two key steps as illustrated in Algorithm S1. First, for each claim ${C}_{n}\left({ct}\right)$, we assign a positive or negative label ${L}_{1}({C}_{n}\left({ct}\right))$ based on the p-value reported for its respective primary outcome measure. A positive label is assigned if the p-value indicates a statistically significant result, while a negative label is assigned if the p-value indicates a lack of statistical significance. Following conventional statistical thresholds, we considered p-value < 0.05 as statistically significant. Second, we consider the link nature between the clinical trial and the scientific abstract. If a clinical trial ${ct}$ is linked to a scientific abstract E(ct) that reports results for the trial, we further filter these instances to include only those where exactly one abstract is linked to the clinical trial. For these cases, the label for the claim-evidence pair L₂ (C_n (ct), E(ct)) is defined as “evidence” if the claim ${C}_{n}\left({ct}\right)$ is positive, and “inconclusive” if the claim C_n (ct) is negative. Conversely, if the scientific abstract E(ct) linked to the clinical trial provides background information, we include all the linked abstracts, and the label for the claim-evidence pair L₂ (C_n (ct), E(ct)) is defined as “not enough information” (NEI).

The test labels were further manually revised by two groups of reviewers. The inter-annotator agreement is with a Cohen’s Kappa (κ) of 0.85. In instances where reviewers disagreed (9%), a senior reviewer provided the final judgment. The manual revisions and the original labels aligned in 80% of the instances. The manually refined labels replace the original test labels in the final dataset.

Primary outcome-arm group pairs for claim generation

To extract primary outcome measures and arm group information from the clinical trial database $D$, we focused exclusively on clinical trials ${ct}{\in }D$ that included bi-arm groups of types Experimental and Comparator. In a clinical trial, an arm refers to a group of participants that receives a particular intervention, treatment, or no intervention according to the trial’s protocol (https://clinicaltrials.gov/study-basics/glossary). The arm type represents the role of each arm in the clinical trial. For generating the clinical research claim, we used the term intervention to represent the Experimental arm group and comparator to represent the Comparator arm group, with mappings provided in Supplementary Table S1. The intervention and comparator terminologies are grounded in the PICOT framework, i.e., population, intervention, comparator, outcome and time⁴⁰. We excluded Population information when formulating the claim, as it is part of extrinsic criteria and is not typically available from PubMed abstracts. Since the abstracts serve as grounding evidence for the claim in our dataset, including Population information would be out of the scope of our analysis.

In the study design section, arm groups are labeled as Experimental and Comparator, but these labels are not in the result section. The titles of arm groups also vary between the two sections. Thus, to label the arm groups in the result section as intervention or comparator, we followed the approach proposed by Shi et al.⁴¹. In this approach, we mapped arm group titles in the result section to the most similar one in the study design section by calculating the cosine similarity of their embeddings created using BioBERT⁴².

Efficacy label

We extract the efficacy label ${L}_{1}({C}_{n}\left({ct}\right))$ for the primary outcome ${O}_{n}({ct})$ from the measure analysis in the outcome measure information module. The efficacy label represents both the statistical significance of the analysis and the clinical effectiveness of the intervention. Each primary outcome ${O}_{n}({ct})$ paired with arm groups ${I}_{n}\left({ct}\right),{P}_{n}\left({ct}\right)$ may have one or multiple associated analyses, some of which include p-values representing the statistical significance of the results. We only extracted the analyses with p-values and compiled the p-values into a list for each primary outcome-arm group pair. We assigned a positive efficacy label to an outcome-group pair if any p-value in its associated list is smaller than 0.05, indicating statistical significance. We assigned a negative efficacy label if all p-values were equal to or greater than 0.05, indicating statistical non-significance. We extracted 8,679 primary outcome-arm group pairs, of which 4,179 were labeled as positive and 4,500 as negative, for 5,751 unique clinical trials.

Type of statistical test

Each primary outcome-arm group pair could be associated with a type of statistical test. We categorize these types into the ones outlined in the study data structure of ClinicalTrials.gov⁴³. The outlined types include Superiority, Noninferiority, Equivalence, Noninferiority or Equivalence, and Superiority or Other. A Superiority test evaluates if a new treatment is better than another (e.g., standard treatment or placebo) by rejecting the null hypothesis of no difference⁴⁴. A Noninferiority test shows that the new treatment is not significantly worse than the existing one, within a predefined margin⁴⁴. An Equivalence test demonstrates that two treatments are statistically equivalent, with differences falling within a clinically insignificant margin⁴⁴. Noninferiority or Equivalence tests first establish noninferiority, and then assess equivalence⁴⁵. Lastly, Superiority or Other tests may also evaluate noninferiority or equivalence if superiority is not shown⁴⁶.

Clinical research claim

A scientific claim is a verifiable statement. The claim should be atomic (about one aspect of the statement), decontextualized (understandable without additional context), and check-worthy (the veracity can be confirmed)⁴⁷. In natural language, a clinical research claim can be expressed as a scientific claim in a format of ‘<Intervention> is <Type of Statistical Test> to <Comparator> in terms of <Outcome>.‘ For example, for study NCT00234065 shown in Table 1, we could reframe the outcome-group pair to the following claim ${C}_{n}\left({ct}\right)$: Cilostazol is Non-Inferior or Equivalent to Aspirin in terms of Numbers of Patients With First Occurrence of Stroke From start of treatment to end of follow-up period (follow-up periods: 29 months [Standard Deviation 16, range 1–59 months]). The Intervention and Comparator terms are sourced from the Arm/Group Title, while the Outcome is the combination of Outcome Measure Title and Outcome Measure Time Frame available in the clinical trial protocol (Table 2). In the absence of naturally occurring scientific claims in clinical trials, we used the PICOT framework to closely mimic natural occurring scientific claims based on hypothesis testing results in clinical research.

Table 2 Fields extracted for clinical research claim generation.

Full size table

Clinical trial-publication linkage

Publications corresponding to clinical trials were identified by their PubMed Identifiers (PMIDs) provided in the Publications section of the clinical trial results and categorized by reference types. The types are by domain experts who register trials on ClinicalTrials.gov under standardized reporting guidelines. As shown in Table 3, we created a CSV file detailing the clinical trial-publication relationships by extracting the NCT ID, PMID, and reference type from the filtered ClinicalTrials.gov database ${ct}\in D$. A total of 1,550 clinical trial-publication links were used in the balanced dataset, including 868 background links and 682 results links. The precision of labeling with the link is 95%, as manually validated on the test split.

Table 3 Fields used to create clinical trial-publication pairs.

Full size table

Claim-evidence pairs generation

Using the extracted relationships between clinical trials and scientific publications, we linked claims to their corresponding publications to generate claim-evidence pairs. Each clinical trial may correspond to one or multiple publications categorized as either background or results. For results publications, we focused on clinical trials linked to a single publication. In these cases, if a claim ${C}_{n}\left({ct}\right)$ had a positive label, we labeled the claim-evidence pair as evidence. Conversely, if the claim had a negative label, we labeled the claim-evidence pair as inconclusive. For background publications, we labeled the claim-evidence pair as NEI regardless of whether the clinical trial was linked to one or multiple background publications. Using the PubMedFetcher object in the Metapub Python library (https://metapub.org/overview/), we downloaded the titles and abstracts of publications based on their PubMed unique IDs (PMIDs). We excluded samples with incomplete abstracts that are less than 15 words. The statistics for the number of words for the extracted primary outcomes and publications are illustrated in Table 4.

Table 4 Number of words for the text fields.

Full size table

We extracted the date of each abstract using a direct HTTP request to the PubMed API (https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id={pmid}&retmode=xml). The date is extracted including the year, month, and day when they are available. Figure S1 shows the number of publications per year, giving an overview of when the grounding evidence was published.

The generated dataset included 570 evidence, 415 inconclusive, and 8,196 not enough information (NEI) claim-evidence pairs. We further balanced the class distribution by down sampling the NEI class to 985 samples to match the total number of evidence and inconclusive samples to create the final dataset.

Data Records

The CliniFact dataset is available at Figshare³⁷. It contains three CSV files – train.csv, validation.csv, and test.csv – each including 21 columns that capture key information about the claim and the corresponding PubMed abstracts. We present the variables, their corresponding column headings and descriptions in Table S2.

The dataset contains 1,970 primary claim-evidence pairs of 992 unique clinical trials and 1,540 unique publications. We show examples of evidence, inconclusive and NEI paired clinical research claims and abstracts in Table 5. The distribution of labels in the train, validation, and test splits is provided in Table 6.

Table 5 Examples of evidence, inconclusive and not enough information (NEI) paired clinical research claims and abstracts.

Full size table

Table 6 Distribution of labels in train, validation, and test splits.

Full size table

In Fig. 2, we show the clinical research claims stratified by the studied condition. Using the MeSH annotations provided by ClinicalTrials.gov, clinical research claims were associated with disease classes using the MeSH tree code (3 digits). In total, 20 disease categories (out of the 27 categories available in MeSH) are included in our dataset. It is important to note that a single clinical trial may be mapped to multiple disease classes according to the MeSH terminology. Therefore, when reporting the number of clinical research claims per disease, the total sample count may exceed the number of claims.

Technical Validation

Given the claim ${C}_{n}\left({ct}\right)$ and the abstract $E\left({ct}\right)$, we investigated the performance of several discriminative and generative LLM to predict the label L₂ (C_n (ct), E(ct)) for the claim-evidence pair. We treated it as a multiclass classification problem, where the output indicates whether the abstract states that there is evidence for the claim, that it is inconclusive, or that the abstract does not provide information for the claim (NEI). For the discriminative LLMs, we concatenated a claim ${C}_{n}\left({ct}\right)$ and its corresponding abstract $E\left({ct}\right)$ with the special token [CLS] and [SEP] to form an input sequence [CLS, ${C}_{n}\left({ct}\right)$, SEP, $E\left({ct}\right)$] and fed this input to the LLM. We add the [CLS] token at the beginning of the input to provide a summary embedding for classification tasks. We use the [SEP] token to separate sequences for the model to understand boundaries between pieces of text. The model takes a sequence of tokens with a maximum length of 512 and produces a 768-dimensional sequence representation vector. For input shorter than 512 tokens, we added paddings (empty tokens) to the end of the text to make up the length. For input longer than 512 tokens, we truncated the abstract $E\left({ct}\right)$ from the beginning to make the input sequence fit into the 512 tokens. We provided the truncation algorithm in Supplementary Figure S2. For the generative LLMs, we concatenated a claim ${C}_{n}\left({ct}\right)$ and its corresponding abstract $E\left({ct}\right)$ with the prompt shown in Table 7. In the zero-shot approach, we computed the probability of generating the token TRUE, FALSE, or NONE, and the token with the highest probability was the response. We fine-tuned the generative LLMs on the training split, evaluated their performance on the validation split, and selected the model with the lowest cross-entropy loss.

Table 7 Prompt for generative language models.

Full size table

We show the results of discriminative and generative LLMs on the test split in Table 8. The fine-tuned discriminative LLMs outperformed zero-shot and fine-tuned generative LLMs in the clinical research claim assessment. Specifically, BioBERT achieved the highest accuracy of 80.2% and an F1-macro score of 74.7%, showing improved effectiveness in processing biomedical text (p-value < 0.001, McNemar-Bowker Test), likely due to its domain-specific training. Other discriminative models like PubMedBERT and RoBERTa also performed well, with 77.9% and 75.4% accuracy, respectively. In contrast, zero-shot generative LLMs exhibited significantly lower performance, with OpenBioLLM-8B achieving the highest at 43.4% accuracy and an F1-macro of 30.6%, indicating limited capability in assessing biomedical claims without task-specific fine-tuning. Upon fine-tuning, generative LLMs showed significant improvements; for instance, Llama3-70B’s accuracy increased from 37.3% to 53.6%, and its F1-macro score from 25.2% to 38.1% (p-value < 0.001, McNemar-Bowker Test). Similarly, OpenBioLLM-70B improved from 33.2% to 51.0% accuracy after fine-tuning (p-value < 0.001, McNemar-Bowker Test). Nevertheless, they remain sub-optimal as compared to discriminative LLMs, despite a much higher number of parameters.

Table 8 Results of discriminative and generative language models on the test split.

Full size table

We illustrate a detailed comparison of precision, recall, and F1-macro between the top discriminative and generative LLMs - BioBERT and OpenBioLLM-70B - across the classes evidence, inconclusive, and NEI in Fig. 3(a). BioBERT demonstrated superior performance across all classes. For the NEI class, it achieved a precision of 92.9% and a recall of 81.1%, indicating that it effectively determined whether relevant information was present in the abstracts. For the evidence class, BioBERT reached a precision of 72.8% and a recall of 87.6%, enabling it accurately distinguish the evidence from the inconclusive grounding. OpenBioLLM-70B exhibited a lower precision (37.6%) but a higher recall (88.5%) in identifying the evidence class, on the contrary, it showed higher precision but lower recall for the inconclusive and NEI class. These results suggest that the finetuned OpenBioLLM-70B tends to over-predicting evidence labels.

Figure 3(b) shows the analysis of the top discriminative and generative LLMs across samples with different statistical test types. BioBERT outperformed OpenBioLLM-70B in Superiority, Superiority or Other, and Equivalence classes, achieving the highest performance in the Superiority class (Accuracy: 81.5%, F1-macro: 75.6%). OpenBioLLM-70B showed higher performance across the Non-inferiority or Equivalence and Non-inferiority class, achieving the highest performance in Non-inferiority class (Accuracy: 87.5%, F1-macro: 87.3%).

Figure 3(c) compares the performance of the top discriminative and generative LLMs across 20 disease types, arranged in a decent order of sample size. BioBERT achieved its highest F1-macro scores (100%) on Hemic and Lymphatic Diseases and Eye Diseases, while its lowest F1-macro score (45.0%) on Otorhinolaryngologic Diseases. OpenBioLLM achieved its highest F1-macro scores (77.8%) on Eye Diseases, while its lowest F1-macro score (17.8%) on Chemically Induced Disorders. Different diseases are either more challenging or easier to classify, and we do not observe that this complexity correlates with the number of samples available.

Usage Notes

CliniFact³⁷ provides a benchmark for evaluating the accuracy of large language models (LLMs) in verifying scientific claims specific to clinical research. Researchers can utilize the dataset to develop and fine-tune models aimed at improving natural language understanding, logical reasoning, and misinformation⁴⁸ detection in healthcare. Additionally, the dataset facilitates the comparison of performance across various types of LLMs.

Code availability

The entire process, from developing the CliniFact³⁷ dataset to conducting experiments, was carried out using the Python programming language. The complete code and dataset are available on https://github.com/ds4dh/CliniFact.

References

Tian, S. et al. Opportunities and challenges for ChatGPT and large language models in biomedicine and health. Briefings in Bioinformatics 25, bbad493 (2023).
Article PubMed Google Scholar
Raiaan, M. A. K. et al. A Review on Large Language Models: Architectures, Applications, Taxonomies, Open Issues and Challenges. IEEE Access 12, 26839–26874 (2024).
Article Google Scholar
Augenstein, I. et al. Factuality challenges in the era of large language models and opportunities for fact-checking. Nat Mach Intell 6, 852–863 (2024).
Article MATH Google Scholar
Pan, Y. et al. On the Risk of Misinformation Pollution with Large Language Models. in Findings of the Association for Computational Linguistics: EMNLP 2023 1389–1403 (Association for Computational Linguistics, Singapore, 2023).
Crocco, A. G. Analysis of Cases of Harm Associated With Use of Health Information on the Internet. JAMA 287, 2869 (2002).
Article PubMed MATH Google Scholar
Stammbach, D., Zhang, B. & Ash, E. The Choice of Textual Knowledge Base in Automated Claim Checking. J. Data and Information Quality 15, 1–22 (2023).
Article MATH Google Scholar
Zhang, B., Naderi, N., Mishra, R. & Teodoro, D. Online Health Search Via Multidimensional Information Quality Assessment Based on Deep Language Models: Algorithm Development and Validation. JMIR AI 3, e42630 (2024).
Article PubMed PubMed Central Google Scholar
Tam, D. et al. Evaluating the Factual Consistency of Large Language Models Through News Summarization. in Findings of the Association for Computational Linguistics: ACL 2023 5220–5255 (Association for Computational Linguistics, Toronto, Canada, 2023).
Hu, X. et al. Towards Understanding Factual Knowledge of Large Language Models. in The Twelfth International Conference on Learning Representations, (2024).
Chen, Z. et al. MEDITRON-70B: Scaling Medical Pretraining for Large Language Models. Preprint at (2023).
Singhal, K. et al. Large language models encode clinical knowledge. Nature 620, 172–180 (2023).
Article ADS CAS PubMed PubMed Central MATH Google Scholar
Creswell, A., Shanahan, M. & Higgins, I. Selection-Inference: Exploiting Large Language Models for Interpretable Logical Reasoning. in The Eleventh International Conference on Learning Representations, (2023).
An, M., Dusing, S. C., Harbourne, R. T., Sheridan, S. M. & START-Play Consortium. What Really Works in Intervention? Using Fidelity Measures to Support Optimal Outcomes. Physical Therapy 100, 757–765 (2020).
Article PubMed Google Scholar
Hammerton, G. & Munafò, M. R. Causal inference with observational data: the need for triangulation of evidence. Psychol. Med. 51, 563–578 (2021).
Article PubMed PubMed Central MATH Google Scholar
Vaswani, A. et al. Attention Is All You Need. CoRR abs/1706.03762, (2017).
Wang, Z. CausalBench: A Comprehensive Benchmark for Evaluating Causal Reasoning Capabilities of Large Language Models. in Proceedings of the 10th SIGHAN Workshop on Chinese Language Processing (SIGHAN-10) (eds. et al.) 143–151 (Association for Computational Linguistics, Bangkok, Thailand, 2024).
Lyu, Q. et al. Faithful Chain-of-Thought Reasoning. in Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 1: Long Papers) (eds. Park, J. C. et al.) 305–329 (Association for Computational Linguistics, Nusa Dua, Bali, 2023).
Shi, F. et al. Large language models can be easily distracted by irrelevant context. in Proceedings of the 40th International Conference on Machine Learning (JMLR.org, Honolulu, Hawaii, USA, 2023).
Wang, B. et al. Towards Understanding Chain-of-Thought Prompting: An Empirical Study of What Matters. in Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) 2717–2739 (Association for Computational Linguistics, Toronto, Canada, 2023).
Turpin, M., Michael, J., Perez, E. & Bowman, S. R. Language Models Don’t Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting. in Thirty-seventh Conference on Neural Information Processing Systems (2023).
Huang, J. et al. Large Language Models Cannot Self-Correct Reasoning Yet. in The Twelfth International Conference on Learning Representations (2024).
Pan, L., Albalak, A., Wang, X. & Wang, W. Logic-LM: Empowering Large Language Models with Symbolic Solvers for Faithful Logical Reasoning. in Findings of the Association for Computational Linguistics: EMNLP 2023 3806–3824 (Association for Computational Linguistics, Singapore, 2023).
Li, Y., Tian, J., He, H. & Jin, Y. Hypothesis Testing Prompting Improves Deductive Reasoning in Large Language Models. Preprint at (2024).
Guo, Z., Schlichtkrull, M. & Vlachos, A. A Survey on Automated Fact-Checking. Transactions of the Association for Computational Linguistics 10, 178–206 (2022).
Article MATH Google Scholar
Thorne, J., Vlachos, A., Christodoulopoulos, C. & Mittal, A. FEVER: a Large-scale Dataset for Fact Extraction and VERification. in Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers) 809–819 (2018).
Hanselowski, A., Stab, C., Schulz, C., Li, Z. & Gurevych, I. A Richly Annotated Corpus for Different Tasks in Automated Fact-Checking. in Proceedings of the 22nd Conference on Computational Natural Language Learning (CoNLL2019) (2019).
Wadden, D. et al. Fact or Fiction: Verifying Scientific Claims. in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) 7534–7550 (2020).
Bulian, J., Boyd-Graber, J., Leippold, M., Ciaramita, M. & Diggelmann, T. CLIMATE-FEVER: A Dataset for Verification of Real-World Climate Claims. in NeurIPS 2020 Workshop on Tackling Climate Change with Machine Learning, (2020).
Kotonya, N. & Toni, F. Explainable Automated Fact-Checking for Public Health Claims. in 7740–7754 (Association for Computational Linguistics, Stroudsburg, PA, USA, 2020).
Alamri, A. & Stevenson, M. A corpus of potentially contradictory research claims from cardiovascular research (2016).
Teodoro, D. et al. Information retrieval in an infodemic: the case of COVID-19 publications. Journal of medical Internet research 23, e30161 (2021).
Article PubMed PubMed Central MATH Google Scholar
Sarrouti, M., Ben Abacha, A., Mrabet, Y. & Demner-Fushman, D. Evidence-based Fact-Checking of Health-related Claims. in 3499–3512 (Association for Computational Linguistics, Stroudsburg, PA, USA, 2021).
Voorhees, E. et al. TREC-COVID: constructing a pandemic information retrieval test collection. SIGIR Forum 54, 1–12 (2020).
Article MATH Google Scholar
Wang, L. L. et al. CORD-19: The COVID-19 Open Research Dataset. in Proceedings of the 1st Workshop on NLP for COVID-19 at ACL 2020 (eds. Verspoor, K. et al.) (Association for Computational Linguistics, Online, 2020).
Saakyan, A., Chakrabarty, T. & Muresan, S. COVID-Fact: Fact Extraction and Verification of Real-World Claims on COVID-19 Pandemic. in Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) 2116–2129 (Association for Computational Linguistics, Online, 2021).
Liu, H. et al. Retrieval augmented scientific claim verification. JAMIA Open 7, ooae021 (2024).
Article PubMed PubMed Central Google Scholar
Zhang, B. et al. A dataset for evaluating clinical research claims in large language models. figshare https://doi.org/10.6084/m9.figshare.27188109.
Murad, M. H., Asi, N., Alsawas, M. & Alahdab, F. New evidence pyramid. Evid Based Med 21, 125–127 (2016).
Article PubMed PubMed Central Google Scholar
Chan, A.-W., Hróbjartsson, A., Haahr, M. T., Gøtzsche, P. C. & Altman, D. G. Empirical Evidence for Selective Reporting of Outcomes in Randomized Trials: Comparison of Protocols to Published Articles. JAMA 291, 2457 (2004).
Article CAS PubMed Google Scholar
The well-built clinical question: a key to evidence-based decisions. ACP Journal Club 123, A12 (1995).
Shi, X. & Du, J. Constructing a Finer-Grained Representation of Clinical Trial Results from Clinicaltrials.gov. Sci Data 11, 41 (2024).
Article PubMed PubMed Central MATH Google Scholar
Lee, J. et al. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36, 1234–1240 (2020).
Article CAS PubMed MATH Google Scholar
Study Data Structure. https://www.clinicaltrials.gov/data-api/about-api/study-data-structure.
Lesaffre, E. Superiority, equivalence, and non-inferiority trials. Bulletin of the NYU hospital for joint diseases 66 (2008).
Walker, J. Non-inferiority statistics and equivalence studies. BJA Education 19, 267–271 (2019).
Article CAS PubMed PubMed Central MATH Google Scholar
Dunn, D. T., Copas, A. J. & Brocklehurst, P. Superiority and non-inferiority: two sides of the same coin? Trials 19, 499 (2018).
Article PubMed PubMed Central MATH Google Scholar
Wang, L. L. Using Machine Learning to Verify Scientific Claims. (2023).
Zhang, B., Naderi, N., Jaume-Santero, F. & Teodoro, D. DS4DH at TREC Health Misinformation 2021: Multi-Dimensional Ranking Models with Transfer Learning and Rank Fusion. in I. Soboroff and A. Ellis, editors, The Thirtieth REtrieval Conference Proceedings (TREC 2021). National Institute of Standards and Technology (NIST), Special Publication 500-335, Washington, USA, 2022.
Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) (eds. Burstein, J., Doran, C. & Solorio, T.) 4171–4186 (Association for Computational Linguistics, Minneapolis, Minnesota, 2019).
Liu, Y. et al. Ro{BERT}a: A Robustly Optimized {BERT} Pretraining Approach. (2020).
Gu, Y. et al. Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing. ACM Trans. Comput. Healthcare 3 (2021).
Reimers, N. & Gurevych, I. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) (eds. Inui, K., Jiang, J., Ng, V. & Wan, X.) 3982–3992 (Association for Computational Linguistics, Hong Kong, China, 2019).
Dubey, A. et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783 (2024).
Ankit Pal, M. S. OpenBioLLMs: Advancing Open-Source Large Language Models for Healthcare and Life Sciences. Hugging Face repository (2024).

Download references

Author information

Authors and Affiliations

Department of Radiology and Medical Informatics, Faculty of Medicine, University of Geneva, Geneva, Switzerland
Boya Zhang, Alban Bornet, Anthony Yazdani, Marija Milutinovic, Hossein Rouhizadeh & Douglas Teodoro
Risklick AG, Bern, Switzerland
Philipp Khlebnikov & Poorya Amini

Authors

Boya Zhang
View author publications
Search author on:PubMed Google Scholar
Alban Bornet
View author publications
Search author on:PubMed Google Scholar
Anthony Yazdani
View author publications
Search author on:PubMed Google Scholar
Philipp Khlebnikov
View author publications
Search author on:PubMed Google Scholar
Marija Milutinovic
View author publications
Search author on:PubMed Google Scholar
Hossein Rouhizadeh
View author publications
Search author on:PubMed Google Scholar
Poorya Amini
View author publications
Search author on:PubMed Google Scholar
Douglas Teodoro
View author publications
Search author on:PubMed Google Scholar

Contributions

B.Z., P.A. and D.T. conceptualized the study, and B.Z. implemented the codes for the creation and evaluation of the dataset. B.Z., A.B., P.K., A.Y., M.M., H.R. and D.T. performed manual annotation. The manuscript was drafted by B.Z. and edited by A.B., A.Y. and D.T. All authors reviewed the final version.

Corresponding authors

Correspondence to Boya Zhang or Douglas Teodoro.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Zhang, B., Bornet, A., Yazdani, A. et al. A dataset for evaluating clinical research claims in large language models. Sci Data 12, 86 (2025). https://doi.org/10.1038/s41597-025-04417-x

Download citation

Received: 15 October 2024
Accepted: 03 January 2025
Published: 16 January 2025
DOI: https://doi.org/10.1038/s41597-025-04417-x