Abstract
As qualitative research increasingly informs patient-centred care, rapid assessment of existing evidence to meet research guidelines is needed to inform practice settings. We evaluate the performance of Claude, a generative AI model, in assessing qualitative articles adherence to a consensus-based reporting guideline. The Consolidated Criteria for Reporting Qualitative Research (COREQ), commonly used in qualitative research, is used as a reference criteria list to test the performance of Claude. 15 articles from a systematic scoping review were extracted for analysis. Structured prompts were applied to Claude to evaluate if each criterion in COREQ is met for each article. Two independent reviewers checked model results for concordance and accuracy. The F1, balanced accuracy (BA) scores, Matthews correlation coefficient (MCC) and other performance metrics were tabulated at the criterion, criterion domain, and article level. 4 main categories were identified from performance results, namely: (1) balanced (6/32 criteria, 18.75%), (2) under-reported (2/32, 6.25%), (3) mixed errors (9/32, 28.13%), and (4) information limited (15/32, 46.88%) clusters. Results show heterogeneity amongst different clusters of criteria. While balanced criteria perform consistently across a range of metrics, criteria in under- or over-reported clusters require targeted prompt adjustments. Limited information criteria require a larger sample of articles to verify results. Clearly defined criteria outperformed criteria that were broadly defined or requires interpretation. Segmenting criteria into performance clusters allow researchers to identify areas of incongruence, so that specific strategies to modify prompts may be utilised for any given set of research articles. Customised approaches that are expertly crafted can allow for the rapid extraction of valuable insights that may inform patient-centred recommendations and practice guidelines.
Introduction
Assessing the adherence of qualitative research studies in meeting consensus-based guidelines is a crucial step in determining the reliability and robustness of health and medical research studies. Checklists and guidelines allow readers to validate if study components have been adequately fulfilled and to assess if a study has been systematically conducted. For qualitative research, the 32-item Consolidated Criteria for Reporting Qualitative Research (COREQ) (supplementary information file 3, table A1)1, and the Standards for Reporting Qualitative Research (SPQR) checklist2 are two of the most frequently used guidelines to determine if qualitative articles are adequately reported. Given how patient-centric and scalable solutions are increasingly needed in dynamic, fast-paced healthcare settings, the adoption of insights from qualitative research can help provide diverse perspectives and pragmatic nuance in the deployment of innovative technologies or treatments.
Recent developments in the use of large language models (LLMs) have shown potential in helping researchers improve the efficiency of qualitative analysis3. Bijker et al. 4 applied ChatGPT to 539 user generated forum messages related to sugar consumption to generate qualitative codes aimed at identifying behavioural change mechanisms. Findings indicate that inductively generated codes have high agreement with experts compared to deductive codes that followed a pre-determined framework (k = 0.69–0.84 for inductive coding; k = 0.52–0.73 for unconstrained deductive coding, k = 0.66 for structured deductive coding).
Prescott et al. 5 tested the ability of ChatGPT and Google’s Bard to perform thematic analysis on 40 short SMS messages as part of a digital intervention program to promote medication adherence amongst methamphetamine users with HIV. Although intercoder reliability between LLM tools and human coders ranged from fair to moderate for ChatGPT (ICR = 47%, 37% for inductive, deductive analysis) and Bard (ICR = 37%, 36%), thematic agreement was good (71%, 50% for ChatGPT; 71%, 58% for bard, for inductive and deductive analysis respectively). More notably, ChatGPT took only 15 and 25 min to conduct inductive and deductive analysis, while Bard took 20 min to conduct both types of analysis respectively, compared to 492 and 705 min for humans, leading to time savings of 97% for both types of analysis for ChatGPT, and 96%, 97%-time savings for Bard.
Despite its wide-ranging potential and demonstrable application in analysing text proficiently, current use of generative AI tools is largely confined to individual studies, where models are applied to an assemblage of data for qualitative coding to generate insights. Less explored are analysis conducted at the meta-analytic level, where a model is tasked with checking the adherence of multiple research studies in meeting consensus-based or objective checklists, which remains a laborious, time consuming and manual process in health or clinical research. By evaluating a model’s performance based on a commonly used objective framework, this study aims to contribute to emerging research on AI-assisted evidence synthesis6,7,8.
Study aims
The aim of this study is to evaluate the performance of Claude 3.5 Sonnet (released June 2024)9 in assessing qualitative health research articles adherence to a consensus-based, objective set of qualitative reporting guidelines. A 2-step iterative zero-shot approach was used to evaluate 15 articles extracted from a previously conducted scoping review. Output was validated for accuracy and summarised in a confusion matrix at the article and criterion level by 1 reviewer, followed by a second reviewer on 5 articles to check for inter-rater reliability. Additional robustness checks were performed. Model performance was primarily evaluated using F1, balanced accuracy (BA) scores, and the Mathhew’s Correlation Coefficient (MCC) derived from accuracy, precision, recall (sensitivity) and specificity scores. An additional number of supporting performance metrics were tabulated to holistically evaluate performance results. Results are reported at the criterion, criterion domain level as defined in COREQ, and at the article level. Quantitative error analysis was conducted to understand which criterion tend to be falsely evaluated by Claude.
We evaluate articles extracted from a scoping review rather than conduct a separate systematic search tailored specifically for model evaluation, so as to better understand the basic performance of a LLM when used as a checklist tool embedded within a larger scoping review.
The full list of articles evaluated can be found in supplementary information file 3.
Methods
Overview
The overall sequence of this study is as follows:
-
1.
15 qualitative articles were extracted from a scoping review conducted separately in a previous study, to evaluate the performance of Claude in assessing whether extracted articles meet an objective list of reporting criteria as defined in COREQ.
Initial prompt testing and adjustments using Claude 3 Opus.
-
2.
The list of COREQ criteria as described in Tong et al. (2007) was first uploaded to Claude.
-
3.
Instructions were given to create a 32 by 4 table, consisting of:
-
4.
a numeric number, in ascending order
-
5.
the list of criteria, as described in the COREQ guideline
-
6.
'Yes/No’ column to check for the presence/absence of each criterion, and
-
7.
a column to justify reasons for each 'Yes/No’ response.
-
4.
-
8.
Qualitative full-text articles were uploaded to Claude individually for evaluation, with clear and specific prompt instructions provided for Claude to evaluate each article. [Study supplements were not uploaded to Claude]
-
9.
Prompts adjustments were made based on initial output and errors generated (see Figures. 2 and 3 for full prompt wordings and adjustments made).
Application of prompts to Claude 3.5 Sonnet.
-
10.
Steps (2) and (3) repeated with updated prompts.
-
11.
Qualitative full-text articles are re-uploaded individually to Claude for evaluation on a separate message thread. [Article supplementary files/appendices (if any) were not uploaded]
-
12.
Standardised prompts are applied to all articles individually.
Evaluation of results.
-
13.
Output generated for all articles are checked and evaluated by a reviewer.
-
14.
A 2nd reviewer independently evaluates 5 randomly selected articles.
-
15.
Results are classified as true positive/negative or false positive/negative by both reviewers and added into separate, individual confusion tables, summarised at the criterion/article level.
-
16.
2 reviewers convened to compare evaluations and discuss discordant criteria.
-
17.
Final confusion table after consensus agreement is tabulated using a range of performance metrics.
An overview of study procedures is illustrated in Figure. 1 below.
Data sources
15 qualitative articles were retrieved from a scoping review conducted previously, on patient-physician communication of health and risk information pertaining to cardiovascular diseases and diabetes10. A comprehensive database search was conducted for articles published between 1st Jan 2000 to 3rd October 2023. Of 8378 articles that were screened, 88 articles were reviewed, of which 30 articles were included, comprising of 15 qualitative, 14 quantitative and 1 mixed method studies. The PRISMA flow diagram, search terms and key characteristics of included studies table can be found in additional files 1 to 3 in the referenced scoping review article.
Ethical considerations
This study is exempted from ethics approval, as only journal articles are included as data points for analysis. No human/patient identifiers or information on research subjects were collected.
Guiding prompts and model output
2-step prompt sequence
First, the original COREQ article with description of each criterion was uploaded into Claude as contextual information, followed by prompt instructions to create a 32 by 4 blank table. Claude is then instructed to populate the first and second columns of each row with a sequential number and to include the name of each criterion respectively. In the third column, a ‘Yes/No’ option is included to indicate whether a criterion is mentioned in an article. If the third column is indicated as ‘Yes’ in any given article, the model is tasked to provide evidence from the article justifying the presence of each criterion in the fourth column. If the third column is ‘No’, then ‘N.A.’ should be indicated by the model in the fourth column. The first prompt sequence is to be applied only once within each dialogue thread.
Each article is then uploaded to Claude individually for assessment. Prompts for each article include additional instructions to clarify the scope of extraction. Instructions include requesting textual evidence in Column 4 to be extracted and stated verbatim rather than paraphrased, to ensure that any potential ‘hallucinations’ generated can be verified in a transparent way. Hallucination refers to LLMs presenting wrong information as if the information generated is correct and true, which LLMs are occasionally at risk of generating11,12.
The 2nd prompt was used each time a new qualitative article was uploaded for assessment. The 2-prompt sequence were first tested on Claude 3 Opus, a precursor to Claude 3.5 Sonnet. Claude 3 Opus was used for evaluation as this was the most advanced Claude model available during initial test-runs. Claude 3.5 Sonnet superseded Claude 3 Opus shortly after the initial testing of prompts was completed.
Prompt adjustments
After applying the 2-step prompt procedure to Claude 3 Opus using a few qualitative articles and generating some output results, adjustments were made to the 2nd prompt due to persistent errors generated among a few criteria. For criteria 2 and 3 (‘credentials’ and ‘occupation’), a prompt qualifier was added to emphasize that credentials or occupation is not the same as the address of an institution, which Claude 3 Opus often conflated with. For criterion 4 (‘gender’), a prompt was added to infer gender from a person’s name. Although this is not necessarily true as names can be neutral, we wanted to test the inference ability of the model. Claude 3 Opus tended to state ‘no’ initially even when gender was mentioned.
Once prompts were finalised, the 2-step prompt sequence was applied to all 15 articles individually in Claude 3.5 Sonnet for evaluation. The first prompt was applied once, while the second prompt was applied repeatedly for each additional article uploaded. Prompts were applied with no change in wordings so that difference in output generated cannot be attributed to changing prompts arbitrarily. Probing of model results through additional prompts were also avoided to ensure successive outputs within the same dialogue thread were not affected.
The preliminary and final prompts applied is shown in Figures. 2 and 3.
Human evaluation and truth-value classification
After all articles were assessed, a reviewer (AC) checked the accuracy of output generated by Claude manually and assigned each criterion into one of 4 truth-value categories:
-
18.
‘True or False positive’ (TP or FP) – Claude correctly/wrongly classifies a criterion as mentioned in an article.
-
19.
‘True or False negative’ (TN or FN) – Claude correctly/wrongly classifies a criterion as not mentioned in an article.
A description of truth-values and how each assessed result is assigned is described below in Table 1.
Inter-rater reliability and robustness checks
A second, independent (WT) reviewer evaluated 5 randomly selected articles to check for concordance/discordance in results and inter-rater reliability. Statistical tests were performed to ensure the robustness of results. Pre-consensus raw agreement between reviewers was high at 0.944 (Wilson’s CI 0.90–0.97)13. To confirm that results were not due to chance or each reviewer’s positive or negative inclination, Cohen’s k was tabulated, generating relatively high results at 0.880 (CI 0.80–0.96)14. Confidence intervals were obtained by bootstrapping articles using 2,000 resamples15. To adjust for sensitivity towards class imbalance and asymmetric category use between reviewers, prevalence and bias adjusted k (PABAK) as well as Gwet’s AC1 were tabulated, both of which achieved an equally high result of 0.888 and 0.894 respectively. PABAK rescales percent agreement assuming categories are equally likely, while Gwet’s AC1 adjusts for chance, factoring how raters actually use categories during evaluation16,17. Prevalence index (PI) at 0.256 indicate notable imbalance tilted towards positive cases, while a very low bias index (BI) score of 0.02 indicate low systematic rater bias. Non-significant McNemar’s χ2 test results (χ2 = 1.00, p = 0.317) to evaluate if there were systematic differences between the ratings of 2 reviewers on categorical, paired data, present no evidence that one reviewer was more inclined to indicate a criterion as being reported than another18.
A full documentation of inter-rater evaluation results and robustness tabulation can be found in supplementary information file 2.
Confusion matrix summary
Consensus results between 2 reviewers were summarised in a confusion matrix table and tabulated at the criterion/article-level (supplementary information file 1, Table 1). Numeric totals for each truth-value category (TP, TN, FP, FN) was summed, yielding a total of 480 values (32 criteria*15 articles). Summary at the criterion level was further demarcated into 3 domains as categorized in COREQ. Confusion matrices allow for the tabulation of accuracy, precision, recall (sensitivity) and specificity scores, which in turn allow for the calculation of F1, balanced accuracy and Matthews Corelation Coefficient as well as an associated range of other scores used for performance analysis.
Performance metrics
F1, balanced accuracy scores and the Matthews corelation coefficient
F1 scores were tabulated at the criterion and article level to understand the model’s performance in identifying positive cases. The score measures the harmonic mean between precision (the accuracy of positive predictions) and recall (the rate of true positive identification). Results range from 0 to 1, where 0 indicates no precision/recall, and 1 perfect precision and recall. The F1 score is a standardised metric commonly used to evaluate classification models applied to disease prediction and natural language processing in healthcare19,20.
Since the F1 score does not account for true negatives, we calculate balanced accuracy (BA) and the Matthews Corelation Coefficient to provide a more balanced measure21,22. BA measures the average of sensitivity (TP/(TP + FN)) and specificity (TN/(TN + FP)), while MCC is the fundamental discriminator metric that reflects true agreement under class imbalance, for a moderately rather than extremely imbalanced dataset that consists of 35.21% (169/480) true negatives out of all cases (Table 2). Using the BA and MCC provides a more meaningful performance metric to compare across criteria, as both metrics balances criteria where there is a larger proportion of true negatives that would otherwise derive a higher result via the F1 score.
Other performance metrics measured include the difference between F1 and BA scores, false positive rate (FPR), false negative rate (FNR), actual positives (n+) and actual negatives (n-). Explanation of all metrics and definitions is described in Table 3. It was decided that text results generated by Claude in the 32*4 table (column 4) would not be included for quantitative analysis, since embedding textual analysis in the context of a closed-set confusion matrix would present labelling related challenges. An illustrative problem that may arise from labelling is described in Table A2, supplementary information file 3. Class imbalance and proportions for the present dataset is described in Table 2 below.
Imputation and metric stability
To enable item-level comparability and avoid undefined metrics where denominators were zero, the Haldane–Anscombe correction was applied by adding 0.5 to each cell of a 2 × 2 confusion table (TP, FP, TN, FN), prior to computing sensitivity, specificity, BA, F1 scores, and MCCs23. This approach allows for undefined or unsupported values to gravitate towards more central, stable values and symmetrically reduces small-sample bias. Imputation was applied at the criterion, criterion domain level, but not at the article-level.
Results
To ensure a holistic understanding of Claude’s performance across all articles, we examine a range of key performance metrics, paying particular attention to criterion and criterion domain performance. 3 complementary metrics were used primarily to measure performance. MCC, balanced accuracy (BA), as well as the F1 scores are the principal metrics, along with Δ(BA–F1) and FPR/FNR to identify error direction, while counts of actual positives (n⁺) and negatives (n⁻) were tabulated to determine the interpretability of estimates.
Criterion were categorised using specific thresholds, then consolidated into performance clusters. Based on the overall results, 4 main clusters were identified, namely: (1) balanced criterion, (2) under-reported criterion, (3) mixed errors criterion, and (4) information limited criterion. Performance thresholds of each cluster are described in the Table 4 below:
Criterion level analysis
Cluster 1: balanced criterion
Based on a combination of assessed metrics and performance indicators, 6 out of 32 criterion (18.8%) displayed a balanced profile. Criterion with a balanced profile include ‘occupation’ (C3) (BA = 0.845, MCC = 0.660, FPR/FNR = 0.227/0.083), ‘experience and training’ (C5) (BA = 0.929, MCC = 0.858, FPR/FNR = 0.100/0.042), ‘participant knowledge of interviewer’ (C7) (BA = 0.939, MCC = 0.879, FPR/FNR = 0.071/0.050), ‘field notes’ (C20) (BA = 0.936, MCC = 0.871, FPR/FNR = 0.083/0.045), ‘data saturation’ (C22) (BA = 0.889, MCC = 0.768, FPR/FNR = 0.150/0.071), and ‘software’ (C27) (BA = 0.939, MCC = 0.879, FPR/FNR = 0.071/0.050). Criterion in this cluster have high discriminative scores (MCC ≥ 0.66), very low absolute change in F1 to BA scores (|Δ(BA–F1)|<0.05), consistently low FPR (|<0.227|) and FNR (|<0.083|), and a good mix of actual positive (n+) and negative cases (n-). A balanced profile suggests close alignment between criterion definitions and the range of description represented in articles evaluated by Claude. The low FPR/FNR number confirms that the robustness of BA/MCC scores is not due to model mislabelling.
Cluster 2: under-reported criterion
2 out of 32 (6.3%) criteria are grouped together as under-reported, due to markedly higher FNR than FPR, and change in F1 to BA scores of ≥ + 0.02, indicating misclassification towards negative for actual positive cases. Criterion in this cluster include ‘relationship established’ (C6) (BA = 0.890, MCC = 0.758, FPR/FNR = 0.083/0.136) and ‘interviewer characteristics’ (C8) (BA = 0.841, MCC = 0.606, FPR/FNR = 0.125/0.192). Claude tends to be more conservative evaluating these criteria, indicating variable or implicit meaning in criteria definitions.
Cluster 3: mixed errors criterion
9 out of 32 (28.1%) criteria was grouped into a mixed errors cluster, due to heterogeneous performance and results that do not align neatly with clusters 1 or 2. Criterion in this cluster may further be demarcated into 4 sub-clusters, including: (i) near balanced, (ii) under-reported inclined, (iii) over-reported inclined and (iv) ambiguous clusters. 4 criteria can be classified as near balanced, due to high BA scores and MCC, but having 1 or 2 metrics that is marginally incongruent with the main indicators, such as having a high FPR. Criterion in this sub-cluster include ‘gender’ (C4) (BA = 0.863, MCC = 0.653, FPR = 0.107) ‘non-participant’ (C13) (BA = 0.852, MCC = 0.739, FPR = 0.250), ‘transcripts returned’ (C23) (BA=, MCC=, FPR = 0.167) and ‘participant checking’ (C28) (BA = 0.899, MCC = 0.798, FPR = 0.167).
Criteria in the under-reported inclined sub-cluster, have modest to relatively high BA scores and MCC, but also have a high FNR. Criteria in this cluster include ‘interviewer/facilitator’ (C1) (BA = 0.793, MCC = 0.653, FNR = 0.375), ‘setting of data collection’ (C14) (BA = 0.788, MCC = 0.575, FNR = 0.300), and ‘number of data coders’ (C24) (BA = 0.796, MCC = 0.640, FNR = 0.357). 1 criterion, ‘credentials’ (C2) (BA = 0.719, MCC = 0.479, FPR = 0.500), may be classified as over-reported, which is similar to the under-reported inclined sub-cluster but with high FPR instead. 1 criterion, ‘clarity of minor themes’ (C32) (BA = 0.678, MCC = 0.316, FPR/FNR = 0.269/0.375) is classified as ambiguous due to unstable performance across all indicators.
Cluster 4: information limited criterion
Almost half of all articles (15 out of 32, 46.88%) assessed by Claude had results in mostly or all in one class (positive or negative), leading to performance indicator values that cannot be sufficiently interpreted given the complete absence of one class of values. Criterion that had results that fall into positive classes only include ‘methodological orientation and theory’ (C9), ‘sampling’ (C10), ‘sample size’ (C12), ‘description of sample’ (C16), ‘audio/visual recording’ (C19), ‘derivation of themes’ (C26), ‘quotations presented’ (C29), ‘data and findings consistent’ (C30), and ‘clarity of major themes’ (C31). All criteria had similar F1 scores of 0.969, reflecting a high prevalence of positive cases, with moderately high BA scores of 0.734, and modest MCCs of 0.469 after incorporating true negative results. Criterion that had results fall mostly within positive classes include ‘method of approach’ (C11) (BA = 0.758, MCC = 0.365), ‘interview guide’ (C17) (BA = 0.608, MCC = 0.297) and ‘duration’ (C21) (BA = 0.825, MCC = 0.549). Criterion that had results that comprises of negative classes only include ‘repeat interviews’ (C18) (BA = 0.734, MCC = 0.469), while criterion with results that consists of mostly negative classes only are ‘presence of non-participants’ (C15) (BA = 0.825, MCC = 0.549) and ‘description of the coding tree’ (C25) (BA = 0.608, MCC = 0.297).
A full list of criteria grouped by clusters and sub-clusters, and corresponding list of performance metrics can be found in Figure. 4 and Table 5.
Criterion domain level analysis
Results at the criterion domain level show the aggregate performance of criterion when grouped collectively with other related criterion, following domain categories as defined in COREQ. Limited information criteria were excluded from analysis to allow for fair intra- and inter-domain comparisons. Median confidence intervals for each domain were obtained by bootstrapping the median of all evaluable criterion using 2,000 resamples (supplementary information file 1).
Domain 2, ‘analysis and findings’, achieved the highest overall median MCC (0.768, CI 0.575–0.871), followed by domain 3 ‘analysis of findings’ (0.719, CI 0.316–0.879) and domain 1 ‘research team and reflexivity’ (0.656, CI 0.606–0.808). Domain 1 had the highest proportion of evaluative criteria, with all criteria (8/8, 100.0%) of evaluable quality, followed by domains 3 and 2, which had a moderate to low proportion of overall evaluable criteria per domain at 44.4% (4/9) and 33.3% (5/15) respectively. Although most criteria in these domains were not evaluable due to extreme class imbalance (all positive or negative only), the remaining criteria within each domain was still evaluated proficiently. This includes criteria such as ‘field notes’ (C20), ‘data extraction’ (criterion 22) in domain 2, and ‘software’ (C27) in domain 3, which are the most straightforward, quantifiable criterion within these 2 domains.
A lower proportion of evaluable criteria within domains 2 and 3 suggests a larger sample of articles may be needed to sufficiently gauge a model’s true performance within these domains, to determine if a model’s assessment of extremely imbalanced criterion reflects its true discriminative ability. Likewise, whether imbalance towards one class can be attributed to performance rather than the tendency of a specific domain to integrate a mix of criterion with diverse characteristics together.
Conversely, a high proportion of evaluable criteria within domain 1 suggests that most criterion within the ‘research team and reflexivity’ domain are consistently well described, with clear evaluable qualities for generative AI assessment even as median MCC performance is relatively modest (0.656, CI 0.606–0.808) compared to the other 2 domains (0.768, CI 0.575–0.871; 0.719, CI 0.316–0.879 for domains 2 and 3 respectively). The confidence interval of domain 1 (CI difference: 0.202) show a narrower range than domains 2 (CI diff.: 0.293) and 3 (CI diff.: 0.563), indicating greater precision of estimates, in contrast to greater variability in performance and output results for domains 2 and 3. Domain 3 has the widest confidence interval that extends to MCC < 0.500, suggesting the possibility of estimates falling within a lower range, likely due to a larger proportion of criteria that have more open-ended meanings that are susceptible to interpretation.
Median MCC results by domain, and numeric criterion totals by cluster within each domain can be found in Tables 6 and 7 respectively. The list of all criteria and corresponding domains can be found in Table 5.
Article-level analysis
Analysis at the article-level, excluding information limited criteria, displayed generally high scores across all key performance metrics, indicating strong model discrimination at the article level, with overall mean F1 score of 0.904, BA score of 0.911, and MCC of 0.827. Several articles achieved perfect scores (1.000) across all 3 metrics, while one article fared poorly because of a lack of positive cases. The typical article performed well, with overall median F1 score of 0.875, BA of 0.929, and MCC of 0.789. Since analysis at the article-level subsumes multiple types of criteria, each with unique performance thresholds as an overall score, it is thus not possible to fully appraise a model’s performance based on article-level results. Article-level analysis is reported in supplementary information file 1, Table 4.
Error analysis
Error analysis was tabulated quantitatively at the criterion and criterion domain level, excluding limited information criteria. Balanced error rate (BER), the average of false positive and false negative rates was tabulated to reflect the overall rate of misclassification. A higher rate indicates a model that is more error prone towards identifying either false positive or negative cases. Results show domain 3 ‘analysis and findings’ to have the highest aggregate BER at 0.151 (FPR/FNR, 0.111/0.190), followed by domains 1 ‘research team and reflexivity’ at 0.091 (0.068/0.115) and domains 2 ‘study design’ at 0.071 (0.030/0.111). Although most individual criterion had a low to moderate BER range of < 0.25, 3 criterion had an elevated error rate. Description of the coding tree (C25), clarity of minor themes (C32), and credentials (C2) of 0.392 (0.750/0.033), 0.322 (0.269/0.375) and 0.281 (0.500/0.063) respectively had the highest BER. While C25 and C2 were more prone towards FP errors, C32 were inclined towards FN errors.
Full results of BER for each criterion is illustrated graphically in Figure. 5, with full results in the supplementary information file 1, Table 3.
Discussion
We undertake a rigorous, data driven approach to evaluate the performance of Claude in assessing qualitative articles following a consensus-based objective set of criteria, with clinical implications for evidence-based medicine. LLMs such as Claude can be used as evaluation assistants where consensus-based criteria have been clearly demarcated and adequately defined in standards or checklists, to accelerate research in health communication, medication adherence, patient literacy and other health domains24,25,26.
A range of quantitative metrics was used to identify areas where Claude performs well without extensive pre-prompting, in assessing adherence towards a comprehensive criteria list. Results reveal 4 key performance clusters: (a) balanced, (b) under-reported, (c) mixed errors, and (d) information limited clusters suggest varied outcomes. Criteria that fall within the balanced cluster, such as ‘occupation’ (C3) and ‘data saturation’ (C22), are typically clearly defined, distinct and well reported, encapsulating performance consistency that can be extrapolated across a diverse set of articles. Criteria that are categorised in the near balanced inclined sub-cluster, such as ‘gender’ (C3) and ‘participant checking’ (C28), require further prompt enclosure to allow for a model to achieve balanced assessment levels. Enclosed prompts adjustments include specifying or suggesting where each criterion may be located within an article or how it is conventionally described. For example, stating within a prompting sequence how the “participant checking (C28) criterion is usually reported in the methods section…” or providing information ‘flags’ commonly related to a criterion that allows for a model to detect a criterion more sensitively. Information flags for C28 may include the following example phrase within a prompt, “Participants in a study are usually provided a summary of results and are asked for their feedback. Feedback refers to thoughts and opinions about the study that they have participated in.”
For criteria that fall within the under-reported cluster or mostly under-reported sub-cluster within the mixed errors group (where FNR > FPR), results suggests that multi-shot prompts that aim to clarify criterion definitions iteratively may be required to further parse through definitions provided in guidelines. Criteria with broadly defined definitions include ‘relationship established’ (C6), ‘interviewer characteristics’ (C8) in the under-reported cluster, and ‘setting of data collection’ (C14) and ‘number of data coders’ (C24) in the under-reported inclined sub-cluster. Paraphrasing or elaborating on primary definitions can allow for the meaning of fundamental words to be explicated in detail so that polysemantic meanings are narrowed. This means interrogating the meaning of concepts such as ‘relationship’, ‘characteristics’, ‘setting’, and ‘data coders’ that are often syntactically determined and contextually dependent. In contrast, criterion that fall within the over-reported inclined or ambiguous sub-cluster such as ‘credentials’ (C2) or ‘clarity of minor themes’ (C31) requires prompt strategies that aim for exactness, through the use of multiple stated examples or hypothetical scenarios that can concretely explicate the preferred output that a LLM should generate for a given criterion.
Interestingly, even though effort was taken at the beginning to define ‘credentials’ (C2) in this study (see Figure. 3), where an illustration was provided of how an output should look like, output by Claude was lacklustre for this criterion, resulting in a low MCC score and a high BER inclined towards FP (FPR/FNR = 0.500/0.063). Such intractable criteria may require a combination of prompt approaches to produce optimal outcomes. Approaches include pre-requesting examples from an AI model to describe or expand upon a given conceptual term prior to prompt iteration, allowing for chain-of-thought reasoning, using multi-shot prompts, and avoiding common errors such as providing irrelevant, redundant or conflicting instructions27,28,29,30.
A substantial proportion of all criteria categorized within the information limited cluster highlight how a larger number of articles are needed to confirm if discriminative ability holds true for each criterion in this category. This means collating sufficient articles that trend in a balanced way towards both positive and negative cases for fair evaluation. In reality, gathering a balanced dataset may not be feasible for health and clinical research groups that are driven mainly by clinical hypotheses, since the evaluation of articles usually comes after a pool of articles has already been extracted from databases. Future research should consider a large scale study dedicated solely to examining the performance of generative AI models in assessing articles, to explore both narrower and wider health domains and to check for comparative performance between domains.
Implications for clinical research
Results suggests that a customised or tailored strategy may be needed for researchers who plan to use AI models to assess whether research articles adhere to consensus-based standardised guidelines. Careful preparation should be given to the development of prompts and the use of performance metrics to ensure coherence between guidelines and output results. A balanced profile indicated by consistent performance over a range of metrics provides confirmation that a criterion can be evaluated by a model reliably. For criteria that indicate under-, over-reported or ambiguous outcomes, additional clarificatory or narrower, demarcating prompts may be required to adjust for optimal outputs.
Categorisation into performance clusters can help clinical teams stratify levels of performance and to determine where an AI model or prompt sequence needs fine-tuning. Researchers should be cognizant that current consensus or standardised reporting guidelines may not be developed in an ideal format for models to evaluate (e.g. STROBE for cross-sectional, observational studies; PRISMA for systematic reviews etc.)31,32 and thus should tailor prompt approaches to typologies of criteria even as new AI relevant guidelines such as PRISMA-AI are being developed33. The new AI guidelines plan to look specifically at the reporting of systematic reviews related to AI topics such as machine learning, deep learning and neural networks, but it is unclear whether this will include the utilisation of AI to check for adherence towards standardised checklists or guidelines.
It is likely that new reporting checklists developed for generative AI assisted scoping or systematic reviews will be required in the future, that is similar to CONSORT-AI guidelines used for the reporting of AI systems as interventions34 or TRIPOD-AI guidelines used for diagnostic or prognostic prediction models35. Reporting checklists will need to be developed and structured optimally based on real-world feedback from healthcare professionals and users, and be sensitive to how systematic/scoping or literature reviews are usually conducted. Checklists should ideally be able to provide additional detailed guidance on text descriptions that may appear variedly in articles.
To the best of our knowledge, this is the first study to use a quantitative, data driven approach to assess qualitative research articles adherence to a consensus-based, objective guideline using a generative AI model.
Limitations
One limitation of this study is the relatively small number and range of research articles assessed that limits the scope of results. Evaluating a larger pool of articles can provide a more robust understanding of how a generative AI model performs when presented with a more extensive or complex dataset, for the generation of more precise estimates and in testing the discriminant abilities of a model. A larger scale study involving comparisons between different models, using multiple criteria or standardised checklists, as well as multi-modal benchmark tools simultaneously, can provide a deeper understanding of how generative AI outputs are aligned with human inputs in different contexts, and to gauge which specific models are more reliable for research evaluation.
Although prompts were developed iteratively in this study at the initial stage using an approximation approach, a full-fledged systematic ‘prompt engineering’ strategy would be ideal to test prompts extensively before they are applied to different articles. It is known that the quality of prompts can substantially affect the quality of outputs that a model can generate36,37,38. One specific technique, substituting similar words and phrases iteratively based on an initial parsing of results until satisfactory wordings are attained39, would have enhanced the generation of model outputs and allow for a more multi-faceted analysis.
One additional limitation is the use of binary quantitative results (TP, TN/FP, FN) extracted from a classification model (confusion matrix) as the main driver of model assessment ability. Text generated by Claude was not included in this study that would have allowed for a more comprehensive performance analysis or comparison between text and quantitative results.
Conclusion
Although LLMs can support and accelerate the evaluation of qualitative research findings based on consensus-based guidelines, the quality of output depends significantly on prompts that are well calibrated and measured using a range of performance metrics. Near balanced criteria require prompt enclosure adjustments or information flags to achieve a more balanced performance, while under-reported criteria require paraphrasing or interrogation of key concepts so that polysemantic meanings are narrowed. Over-reported criteria require the use of concretely stated examples or hypothetical scenarios to achieve balance.
Conversely, criteria with limited information require a larger sample of articles for assessment to determine that performance is not primarily due to the propensity of truth-values falling within one class (T/F). Segmenting criteria into performance clusters allow researchers to identify areas of incongruence, so that specific strategies to modify prompts can be applied for any given set of research articles. Customised approaches that are expertly crafted can allow for the rapid extraction of valuable insights from articles to inform patient-centred recommendations and practice guidelines.
Data availability
All articles evaluated in this study were obtained from publicly available journal databases; article references may be found in supplementary file 3. The COREQ checklist and definition of each criterion is provided in supplementary information file 3. Human-coded classifications of output data generated by Claude 3.5 Sonnet are available in supplementary information files 1 and 2. No personal patient or identifiable data was used in this study.
References
Tong, A., Sainsbury, P. & Craig, J. Consolidated criteria for reporting qualitative research (COREQ): a 32-item checklist for interviews and focus groups. Int. J. Qual. Health Care. 19 (6), 349–357. https://doi.org/10.1093/intqhc/mzm042 (2007).
O’Brien, B. C., Harris, I. B., Beckman, T. J., Reed, D. A. & Cook, D. A. Standards for reporting qualitative research: a synthesis of recommendations. Acad. Med. 89 (9), 1245–1251. https://doi.org/10.1097/ACM.0000000000000388 (2014).
Mishra, T. et al. Use of large Language models as artificial intelligence tools in academic research and publishing among global clinical researchers. Sci. Rep. 14 (1), 31672. https://doi.org/10.1038/s41598-024-81370-6 (2024).
Bijker, R., Merkouris, S. S., Dowling, N. A. & Rodda, S. N. ChatGPT for automated qualitative research: content analysis. J. Med. Internet Res. 26, e59050. https://doi.org/10.2196/59050 (2024).
Prescott, M. R. et al. Comparing the efficacy and efficiency of human and generative AI: qualitative thematic analyses. JMIR AI. 3, e54482. https://doi.org/10.2196/54482 (2024).
Gartlehner, G. et al. Data extraction for evidence synthesis using a large Language model: A proof-of‐concept study. Res. Synthesis Methods. 15 (4), 576–589. https://doi.org/10.1002/jrsm.1710 (2024).
Ovelman, C., Kugley, S., Gartlehner, G. & Viswanathan, M. The use of a large Language model to create plain Language summaries of evidence reviews in healthcare: A feasibility study. Cochrane Evid. Synthesis Methods. 2 (2), e12041. https://doi.org/10.1002/cesm.12041 (2024).
Spillias, S. et al. Human-AI collaboration to identify literature for evidence synthesis. Cell. Rep. Sustain. 1 (7). https://doi.org/10.1016/j.crsus.2024.100132 (2024).
Anthropic. Claude AI (Sonnet 3.5, June 2024 release) [Large language model]. Anthropic. (2024). https://www.anthropic.com.
Chia, A. W. Y., Teo, W. L. L., Acharyya, S., Munro, Y. L. & Dalan, R. Patient-physician communication of health and risk information in the management of cardiovascular diseases and diabetes: a systematic scoping review. BMC Med. 23 (1), 96. https://doi.org/10.1186/s12916-025-03873-x (2025).
Agarwal, V. et al. MedHalu: hallucinations in responses to healthcare queries by large Language models. ArXiv Preprint arXiv:2409 19492. https://doi.org/10.48550/arXiv.2409.19492 (2024).
Zhang, Y. et al. Siren’s song in the AI ocean: a survey on hallucination in large Language models. ArXiv Preprint arXiv:2309 01219. https://doi.org/10.48550/arXiv.2309.01219 (2023).
Wilson, E. B. Probable inference, the law of succession, and statistical inference. J. Am. Stat. Assoc. 22 (158), 209–212. https://doi.org/10.1080/01621459.1927.10502953 (1927).
Cohen, J. A coefficient of agreement for nominal scales. Educ. Psychol. Meas. 20 (1), 37–46. https://doi.org/10.1177/001316446002000104 (1960).
Efron, B. Better bootstrap confidence intervals. J. Am. Stat. Assoc. 82 (397), 171–185. https://doi.org/10.2307/2289144 (1987).
Byrt, T., Bishop, J. & Carlin, J. B. Bias, prevalence and kappa. J. Clin. Epidemiol. 46 (5), 423–429. https://doi.org/10.1016/0895-4356(93)90018-v (1993).
Gwet, K. L. Computing inter-rater reliability and its variance in the presence of high agreement. Br. J. Math. Stat. Psychol. 61 (1), 29–48. https://doi.org/10.1348/000711006X126600 (2008).
McNemar, Q. Note on the sampling error of the difference between correlated proportions or percentages. Psychometrika 12 (2), 153–157. https://doi.org/10.1007/BF02295996 (1947).
Grandini, M., Bagli, E. & Visani, G. Metrics for multi-class classification: an overview. ArXiv Preprint arXiv:2008 05756. https://doi.org/10.48550/arXiv.2008.05756 (2020).
Hicks, S. A. et al. On evaluation metrics for medical applications of artificial intelligence. Sci. Rep. 12 (1), 5979. https://doi.org/10.1038/s41598-022-09954-8 (2022).
Brodersen, K. H., Ong, C. S., Stephan, K. E. & Buhmann, J. M. The balanced accuracy and its posterior distribution. In 2010 20th international conference on pattern recognition (pp. 3121–3124). IEEE. (2010). https://doi.org/10.1109/ICPR.2010.764.
Chicco, D. & Jurman, G. The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genom. 21 (1), 6. https://doi.org/10.1186/s12864-019-6413-7 (2020).
Anscombe, F. J. On estimating binomial response relations. Biometrika 43 (3/4), 461–464 (1956).
Liu, C. et al. What is the meaning of health literacy? A systematic review and qualitative synthesis. Family Med. Commun. Health. 8 (2), e000351 (2020). https://doi.org/10.1136/fmch-2020-000351.
Marshall, I. J., Wolfe, C. D. & McKevitt, C. Lay perspectives on hypertension and drug adherence: systematic review of qualitative research. Bmj 345 https://doi.org/10.1136/bmj.e3953 (2012).
Mentrup, S., Harris, E., Gomersall, T., Köpke, S. & Astin, F. Patients’ experiences of cardiovascular health education and risk communication: a qualitative synthesis. Qual. Health Res. 30 (1), 88–104. https://doi.org/10.1177/1049732319887949 (2020).
Google Cloud. Overview of prompting strategies. Generative AI on Vertex AI — Google Cloud. (2025). https://cloud.google.com/vertex-ai/generative-ai/docs/learn/prompts/prompt-design-strategies.
Meskó, B. Prompt engineering as an important emerging skill for medical professionals: tutorial. J. Med. Internet. Res. 25, e50638 (2023).
OpenAI. GPT-5 prompting guide. OpenAI Cookbook. (2025). https://cookbook.openai.com/examples/gpt-5/gpt-5_prompting_guide.
Microsoft Prompt engineering techniques. Microsoft Learn. (2025)., September 30 https://learn.microsoft.com/en-us/azure/ai-foundry/openai/concepts/prompt-engineering.
Von Elm, E. et al. The strengthening the reporting of observational studies in epidemiology (STROBE) statement: guidelines for reporting observational studies. Lancet 370 (9596), 1453–1457. https://doi.org/10.1136/bmj.39335.541782.AD (2007).
Page, M. J., McKenzie, J. E., Bossuyt, P. M., Boutron, I., Hoffmann, T. C., Mulrow,C. D., & Moher, D. The PRISMA 2020 statement: an updated guideline for reporting systematic reviews. BMJ 372. https://doi.org/10.1136/bmj.n71 (2021).
Cacciamani, G. E., Chu, T. N., Sanford, D. I., Abreu, A., Duddalwar, V., Oberai, A., & Hung, A. J. PRISMA AI reporting guidelines for systematic reviews and meta-analyses on AI in healthcare. Nat. Med. 29(1), 14–15. https://doi.org/10.1038/s41591-022-02139-w (2023).
Liu, X. et al. Reporting guidelines for clinical trial reports for interventions involving artificial intelligence: the CONSORT-AI extension. Lancet Digit. Health. 2 (10), e537–e548. https://doi.org/10.1038/s41591-020-1034-x (2020).
Collins, G. S. et al. TRIPOD + AI statement: updated guidance for reporting clinical prediction models that use regression or machine learning methods. BMJ 385, 1. https://doi.org/10.1136/bmj-2023-078378 (2024).
Kim, J. et al. Which is better? Exploring prompting strategy for llm-based metrics. ArXiv Preprint arXiv:2311 03754. https://doi.org/10.48550/arXiv.2311.03754 (2023).
Yugeswardeenoo, D., Zhu, K. & O’Brien, S. Question-analysis prompting improves LLM performance in reasoning tasks. ArXiv Preprint arXiv:2407 03624. https://doi.org/10.48550/arXiv.2407.03624 (2024).
Sun, S., Zhuang, S., Wang, S. & Zuccon, G. An investigation of prompt variations for Zero-shot LLM-based rankers. ArXiv Preprint arXiv:2406 14117. https://doi.org/10.1007/978-3-031-88711-6_12 (2024).
Wang, B., Deng, X. & Sun, H. Iteratively prompt pre-trained Language models for chain of thought. ArXiv Preprint arXiv:2203 08383. https://doi.org/10.48550/arXiv.2203.08383 (2022).
Funding
This research study is kindly supported and generously funded by the Ng Teng Fong Foundation (Grant Reference: NTF_SRP_P1) and NHG Health as part of the Personalised Cardiometabolic Risk Management program (Predict to Prevent, ‘P2P study’). The P2P study is a population health program that aims to monitor, predict, and delay the risk of macrovascular complications through early risk identification and stratification amongst patients and population groups at high risk of developing cardiovascular disease. The funding agencies were not involved in the design, planning, screening, analysis or interpretation of the findings of this study, as well as the preparation of this manuscript in any way.
Author information
Authors and Affiliations
Contributions
Study design and conceptualisation: AC, WT; Prompt development and adjustments: AC; Model run and testing: AC; Data extraction from model output: AC; Intercoder validation AC, WT; Statistical/quantitative analysis: AC; Drafting and preparation of manuscript: AC; Review of manuscript: AC, WT, RD. All authors approve of the final version of this manuscript.
For scoping review (source of extracted articles for evaluation) Database search and extraction of article list: YM; Screening and review of articles: AC, WT.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Below is the link to the electronic supplementary material.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Chia, A.WY., Teo, W.LL., Munro, Y.L. et al. Evaluating the performance of a generative AI model in assessing qualitative health research articles adherence to objective reporting standards. Sci Rep 16, 3258 (2026). https://doi.org/10.1038/s41598-025-29591-1
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s41598-025-29591-1




