Introduction

Assessing the adherence of qualitative research studies in meeting consensus-based guidelines is a crucial step in determining the reliability and robustness of health and medical research studies. Checklists and guidelines allow readers to validate if study components have been adequately fulfilled and to assess if a study has been systematically conducted. For qualitative research, the 32-item Consolidated Criteria for Reporting Qualitative Research (COREQ) (supplementary information file 3, table A1)1, and the Standards for Reporting Qualitative Research (SPQR) checklist2 are two of the most frequently used guidelines to determine if qualitative articles are adequately reported. Given how patient-centric and scalable solutions are increasingly needed in dynamic, fast-paced healthcare settings, the adoption of insights from qualitative research can help provide diverse perspectives and pragmatic nuance in the deployment of innovative technologies or treatments.

Recent developments in the use of large language models (LLMs) have shown potential in helping researchers improve the efficiency of qualitative analysis3. Bijker et al. 4 applied ChatGPT to 539 user generated forum messages related to sugar consumption to generate qualitative codes aimed at identifying behavioural change mechanisms. Findings indicate that inductively generated codes have high agreement with experts compared to deductive codes that followed a pre-determined framework (k = 0.69–0.84 for inductive coding; k = 0.52–0.73 for unconstrained deductive coding, k = 0.66 for structured deductive coding).

Prescott et al. 5 tested the ability of ChatGPT and Google’s Bard to perform thematic analysis on 40 short SMS messages as part of a digital intervention program to promote medication adherence amongst methamphetamine users with HIV. Although intercoder reliability between LLM tools and human coders ranged from fair to moderate for ChatGPT (ICR = 47%, 37% for inductive, deductive analysis) and Bard (ICR = 37%, 36%), thematic agreement was good (71%, 50% for ChatGPT; 71%, 58% for bard, for inductive and deductive analysis respectively). More notably, ChatGPT took only 15 and 25 min to conduct inductive and deductive analysis, while Bard took 20 min to conduct both types of analysis respectively, compared to 492 and 705 min for humans, leading to time savings of 97% for both types of analysis for ChatGPT, and 96%, 97%-time savings for Bard.

Despite its wide-ranging potential and demonstrable application in analysing text proficiently, current use of generative AI tools is largely confined to individual studies, where models are applied to an assemblage of data for qualitative coding to generate insights. Less explored are analysis conducted at the meta-analytic level, where a model is tasked with checking the adherence of multiple research studies in meeting consensus-based or objective checklists, which remains a laborious, time consuming and manual process in health or clinical research. By evaluating a model’s performance based on a commonly used objective framework, this study aims to contribute to emerging research on AI-assisted evidence synthesis6,7,8.

Study aims

The aim of this study is to evaluate the performance of Claude 3.5 Sonnet (released June 2024)9 in assessing qualitative health research articles adherence to a consensus-based, objective set of qualitative reporting guidelines. A 2-step iterative zero-shot approach was used to evaluate 15 articles extracted from a previously conducted scoping review. Output was validated for accuracy and summarised in a confusion matrix at the article and criterion level by 1 reviewer, followed by a second reviewer on 5 articles to check for inter-rater reliability. Additional robustness checks were performed. Model performance was primarily evaluated using F1, balanced accuracy (BA) scores, and the Mathhew’s Correlation Coefficient (MCC) derived from accuracy, precision, recall (sensitivity) and specificity scores. An additional number of supporting performance metrics were tabulated to holistically evaluate performance results. Results are reported at the criterion, criterion domain level as defined in COREQ, and at the article level. Quantitative error analysis was conducted to understand which criterion tend to be falsely evaluated by Claude.

We evaluate articles extracted from a scoping review rather than conduct a separate systematic search tailored specifically for model evaluation, so as to better understand the basic performance of a LLM when used as a checklist tool embedded within a larger scoping review.

The full list of articles evaluated can be found in supplementary information file 3.

Methods

Overview

The overall sequence of this study is as follows:

  1. 1.

    15 qualitative articles were extracted from a scoping review conducted separately in a previous study, to evaluate the performance of Claude in assessing whether extracted articles meet an objective list of reporting criteria as defined in COREQ.

Initial prompt testing and adjustments using Claude 3 Opus.

  1. 2.

    The list of COREQ criteria as described in Tong et al. (2007) was first uploaded to Claude.

  2. 3.

    Instructions were given to create a 32 by 4 table, consisting of:

    1. 4.

      a numeric number, in ascending order

    2. 5.

      the list of criteria, as described in the COREQ guideline

    3. 6.

      'Yes/No’ column to check for the presence/absence of each criterion, and

    4. 7.

      a column to justify reasons for each 'Yes/No’ response.

  1. 8.

    Qualitative full-text articles were uploaded to Claude individually for evaluation, with clear and specific prompt instructions provided for Claude to evaluate each article. [Study supplements were not uploaded to Claude]

  2. 9.

    Prompts adjustments were made based on initial output and errors generated (see Figures. 2 and 3 for full prompt wordings and adjustments made).

Application of prompts to Claude 3.5 Sonnet.

  1. 10.

    Steps (2) and (3) repeated with updated prompts.

  2. 11.

    Qualitative full-text articles are re-uploaded individually to Claude for evaluation on a separate message thread. [Article supplementary files/appendices (if any) were not uploaded]

  3. 12.

    Standardised prompts are applied to all articles individually.

Evaluation of results.

  1. 13.

    Output generated for all articles are checked and evaluated by a reviewer.

  2. 14.

    A 2nd reviewer independently evaluates 5 randomly selected articles.

  3. 15.

    Results are classified as true positive/negative or false positive/negative by both reviewers and added into separate, individual confusion tables, summarised at the criterion/article level.

  4. 16.

    2 reviewers convened to compare evaluations and discuss discordant criteria.

  5. 17.

    Final confusion table after consensus agreement is tabulated using a range of performance metrics.

An overview of study procedures is illustrated in Figure. 1 below.

Figure. 1
figure 1

Study roadmap and analytical approach.

Data sources

15 qualitative articles were retrieved from a scoping review conducted previously, on patient-physician communication of health and risk information pertaining to cardiovascular diseases and diabetes10. A comprehensive database search was conducted for articles published between 1st Jan 2000 to 3rd October 2023. Of 8378 articles that were screened, 88 articles were reviewed, of which 30 articles were included, comprising of 15 qualitative, 14 quantitative and 1 mixed method studies. The PRISMA flow diagram, search terms and key characteristics of included studies table can be found in additional files 1 to 3 in the referenced scoping review article.

Ethical considerations

This study is exempted from ethics approval, as only journal articles are included as data points for analysis. No human/patient identifiers or information on research subjects were collected.

Guiding prompts and model output

2-step prompt sequence

First, the original COREQ article with description of each criterion was uploaded into Claude as contextual information, followed by prompt instructions to create a 32 by 4 blank table. Claude is then instructed to populate the first and second columns of each row with a sequential number and to include the name of each criterion respectively. In the third column, a ‘Yes/No’ option is included to indicate whether a criterion is mentioned in an article. If the third column is indicated as ‘Yes’ in any given article, the model is tasked to provide evidence from the article justifying the presence of each criterion in the fourth column. If the third column is ‘No’, then ‘N.A.’ should be indicated by the model in the fourth column. The first prompt sequence is to be applied only once within each dialogue thread.

Each article is then uploaded to Claude individually for assessment. Prompts for each article include additional instructions to clarify the scope of extraction. Instructions include requesting textual evidence in Column 4 to be extracted and stated verbatim rather than paraphrased, to ensure that any potential ‘hallucinations’ generated can be verified in a transparent way. Hallucination refers to LLMs presenting wrong information as if the information generated is correct and true, which LLMs are occasionally at risk of generating11,12.

The 2nd prompt was used each time a new qualitative article was uploaded for assessment. The 2-prompt sequence were first tested on Claude 3 Opus, a precursor to Claude 3.5 Sonnet. Claude 3 Opus was used for evaluation as this was the most advanced Claude model available during initial test-runs. Claude 3.5 Sonnet superseded Claude 3 Opus shortly after the initial testing of prompts was completed.

Prompt adjustments

After applying the 2-step prompt procedure to Claude 3 Opus using a few qualitative articles and generating some output results, adjustments were made to the 2nd prompt due to persistent errors generated among a few criteria. For criteria 2 and 3 (‘credentials’ and ‘occupation’), a prompt qualifier was added to emphasize that credentials or occupation is not the same as the address of an institution, which Claude 3 Opus often conflated with. For criterion 4 (‘gender’), a prompt was added to infer gender from a person’s name. Although this is not necessarily true as names can be neutral, we wanted to test the inference ability of the model. Claude 3 Opus tended to state ‘no’ initially even when gender was mentioned.

Once prompts were finalised, the 2-step prompt sequence was applied to all 15 articles individually in Claude 3.5 Sonnet for evaluation. The first prompt was applied once, while the second prompt was applied repeatedly for each additional article uploaded. Prompts were applied with no change in wordings so that difference in output generated cannot be attributed to changing prompts arbitrarily. Probing of model results through additional prompts were also avoided to ensure successive outputs within the same dialogue thread were not affected.

The preliminary and final prompts applied is shown in Figures. 2 and 3.

Figure. 2
figure 2

1st prompt sequence.

Figure. 3
figure 3

2nd prompt sequence.

Human evaluation and truth-value classification

After all articles were assessed, a reviewer (AC) checked the accuracy of output generated by Claude manually and assigned each criterion into one of 4 truth-value categories:

  1. 18.

    ‘True or False positive’ (TP or FP) – Claude correctly/wrongly classifies a criterion as mentioned in an article.

  2. 19.

    ‘True or False negative’ (TN or FN) – Claude correctly/wrongly classifies a criterion as not mentioned in an article.

A description of truth-values and how each assessed result is assigned is described below in Table 1.

Table 1 Truth-value assignment by human reviewer.

Inter-rater reliability and robustness checks

A second, independent (WT) reviewer evaluated 5 randomly selected articles to check for concordance/discordance in results and inter-rater reliability. Statistical tests were performed to ensure the robustness of results. Pre-consensus raw agreement between reviewers was high at 0.944 (Wilson’s CI 0.90–0.97)13. To confirm that results were not due to chance or each reviewer’s positive or negative inclination, Cohen’s k was tabulated, generating relatively high results at 0.880 (CI 0.80–0.96)14. Confidence intervals were obtained by bootstrapping articles using 2,000 resamples15. To adjust for sensitivity towards class imbalance and asymmetric category use between reviewers, prevalence and bias adjusted k (PABAK) as well as Gwet’s AC1 were tabulated, both of which achieved an equally high result of 0.888 and 0.894 respectively. PABAK rescales percent agreement assuming categories are equally likely, while Gwet’s AC1 adjusts for chance, factoring how raters actually use categories during evaluation16,17. Prevalence index (PI) at 0.256 indicate notable imbalance tilted towards positive cases, while a very low bias index (BI) score of 0.02 indicate low systematic rater bias. Non-significant McNemar’s χ2 test results (χ2 = 1.00, p = 0.317) to evaluate if there were systematic differences between the ratings of 2 reviewers on categorical, paired data, present no evidence that one reviewer was more inclined to indicate a criterion as being reported than another18.

A full documentation of inter-rater evaluation results and robustness tabulation can be found in supplementary information file 2.

Confusion matrix summary

Consensus results between 2 reviewers were summarised in a confusion matrix table and tabulated at the criterion/article-level (supplementary information file 1, Table 1). Numeric totals for each truth-value category (TP, TN, FP, FN) was summed, yielding a total of 480 values (32 criteria*15 articles). Summary at the criterion level was further demarcated into 3 domains as categorized in COREQ. Confusion matrices allow for the tabulation of accuracy, precision, recall (sensitivity) and specificity scores, which in turn allow for the calculation of F1, balanced accuracy and Matthews Corelation Coefficient as well as an associated range of other scores used for performance analysis.

Performance metrics

F1, balanced accuracy scores and the Matthews corelation coefficient

F1 scores were tabulated at the criterion and article level to understand the model’s performance in identifying positive cases. The score measures the harmonic mean between precision (the accuracy of positive predictions) and recall (the rate of true positive identification). Results range from 0 to 1, where 0 indicates no precision/recall, and 1 perfect precision and recall. The F1 score is a standardised metric commonly used to evaluate classification models applied to disease prediction and natural language processing in healthcare19,20.

Since the F1 score does not account for true negatives, we calculate balanced accuracy (BA) and the Matthews Corelation Coefficient to provide a more balanced measure21,22. BA measures the average of sensitivity (TP/(TP + FN)) and specificity (TN/(TN + FP)), while MCC is the fundamental discriminator metric that reflects true agreement under class imbalance, for a moderately rather than extremely imbalanced dataset that consists of 35.21% (169/480) true negatives out of all cases (Table 2). Using the BA and MCC provides a more meaningful performance metric to compare across criteria, as both metrics balances criteria where there is a larger proportion of true negatives that would otherwise derive a higher result via the F1 score.

Other performance metrics measured include the difference between F1 and BA scores, false positive rate (FPR), false negative rate (FNR), actual positives (n+) and actual negatives (n-). Explanation of all metrics and definitions is described in Table 3. It was decided that text results generated by Claude in the 32*4 table (column 4) would not be included for quantitative analysis, since embedding textual analysis in the context of a closed-set confusion matrix would present labelling related challenges. An illustrative problem that may arise from labelling is described in Table A2, supplementary information file 3. Class imbalance and proportions for the present dataset is described in Table 2 below.

Table 2 Description of class imbalance over total cases, with percentages.
Table 3 List of key performance metrics and definitions used for evaluation.

Imputation and metric stability

To enable item-level comparability and avoid undefined metrics where denominators were zero, the Haldane–Anscombe correction was applied by adding 0.5 to each cell of a 2 × 2 confusion table (TP, FP, TN, FN), prior to computing sensitivity, specificity, BA, F1 scores, and MCCs23. This approach allows for undefined or unsupported values to gravitate towards more central, stable values and symmetrically reduces small-sample bias. Imputation was applied at the criterion, criterion domain level, but not at the article-level.

Results

To ensure a holistic understanding of Claude’s performance across all articles, we examine a range of key performance metrics, paying particular attention to criterion and criterion domain performance. 3 complementary metrics were used primarily to measure performance. MCC, balanced accuracy (BA), as well as the F1 scores are the principal metrics, along with Δ(BA–F1) and FPR/FNR to identify error direction, while counts of actual positives (n⁺) and negatives (n⁻) were tabulated to determine the interpretability of estimates.

Criterion were categorised using specific thresholds, then consolidated into performance clusters. Based on the overall results, 4 main clusters were identified, namely: (1) balanced criterion, (2) under-reported criterion, (3) mixed errors criterion, and (4) information limited criterion. Performance thresholds of each cluster are described in the Table 4 below:

Table 4 Performance thresholds and error profile/explanation of each cluster.

Criterion level analysis

Cluster 1: balanced criterion

Based on a combination of assessed metrics and performance indicators, 6 out of 32 criterion (18.8%) displayed a balanced profile. Criterion with a balanced profile include ‘occupation’ (C3) (BA = 0.845, MCC = 0.660, FPR/FNR = 0.227/0.083), ‘experience and training’ (C5) (BA = 0.929, MCC = 0.858, FPR/FNR = 0.100/0.042), ‘participant knowledge of interviewer’ (C7) (BA = 0.939, MCC = 0.879, FPR/FNR = 0.071/0.050), ‘field notes’ (C20) (BA = 0.936, MCC = 0.871, FPR/FNR = 0.083/0.045), ‘data saturation’ (C22) (BA = 0.889, MCC = 0.768, FPR/FNR = 0.150/0.071), and ‘software’ (C27) (BA = 0.939, MCC = 0.879, FPR/FNR = 0.071/0.050). Criterion in this cluster have high discriminative scores (MCC ≥ 0.66), very low absolute change in F1 to BA scores (|Δ(BA–F1)|<0.05), consistently low FPR (|<0.227|) and FNR (|<0.083|), and a good mix of actual positive (n+) and negative cases (n-). A balanced profile suggests close alignment between criterion definitions and the range of description represented in articles evaluated by Claude. The low FPR/FNR number confirms that the robustness of BA/MCC scores is not due to model mislabelling.

Cluster 2: under-reported criterion

2 out of 32 (6.3%) criteria are grouped together as under-reported, due to markedly higher FNR than FPR, and change in F1 to BA scores of ≥ + 0.02, indicating misclassification towards negative for actual positive cases. Criterion in this cluster include ‘relationship established’ (C6) (BA = 0.890, MCC = 0.758, FPR/FNR = 0.083/0.136) and ‘interviewer characteristics’ (C8) (BA = 0.841, MCC = 0.606, FPR/FNR = 0.125/0.192). Claude tends to be more conservative evaluating these criteria, indicating variable or implicit meaning in criteria definitions.

Cluster 3: mixed errors criterion

9 out of 32 (28.1%) criteria was grouped into a mixed errors cluster, due to heterogeneous performance and results that do not align neatly with clusters 1 or 2. Criterion in this cluster may further be demarcated into 4 sub-clusters, including: (i) near balanced, (ii) under-reported inclined, (iii) over-reported inclined and (iv) ambiguous clusters. 4 criteria can be classified as near balanced, due to high BA scores and MCC, but having 1 or 2 metrics that is marginally incongruent with the main indicators, such as having a high FPR. Criterion in this sub-cluster include ‘gender’ (C4) (BA = 0.863, MCC = 0.653, FPR = 0.107) ‘non-participant’ (C13) (BA = 0.852, MCC = 0.739, FPR = 0.250), ‘transcripts returned’ (C23) (BA=, MCC=, FPR = 0.167) and ‘participant checking’ (C28) (BA = 0.899, MCC = 0.798, FPR = 0.167).

Criteria in the under-reported inclined sub-cluster, have modest to relatively high BA scores and MCC, but also have a high FNR. Criteria in this cluster include ‘interviewer/facilitator’ (C1) (BA = 0.793, MCC = 0.653, FNR = 0.375), ‘setting of data collection’ (C14) (BA = 0.788, MCC = 0.575, FNR = 0.300), and ‘number of data coders’ (C24) (BA = 0.796, MCC = 0.640, FNR = 0.357). 1 criterion, ‘credentials’ (C2) (BA = 0.719, MCC = 0.479, FPR = 0.500), may be classified as over-reported, which is similar to the under-reported inclined sub-cluster but with high FPR instead. 1 criterion, ‘clarity of minor themes’ (C32) (BA = 0.678, MCC = 0.316, FPR/FNR = 0.269/0.375) is classified as ambiguous due to unstable performance across all indicators.

Cluster 4: information limited criterion

Almost half of all articles (15 out of 32, 46.88%) assessed by Claude had results in mostly or all in one class (positive or negative), leading to performance indicator values that cannot be sufficiently interpreted given the complete absence of one class of values. Criterion that had results that fall into positive classes only include ‘methodological orientation and theory’ (C9), ‘sampling’ (C10), ‘sample size’ (C12), ‘description of sample’ (C16), ‘audio/visual recording’ (C19), ‘derivation of themes’ (C26), ‘quotations presented’ (C29), ‘data and findings consistent’ (C30), and ‘clarity of major themes’ (C31). All criteria had similar F1 scores of 0.969, reflecting a high prevalence of positive cases, with moderately high BA scores of 0.734, and modest MCCs of 0.469 after incorporating true negative results. Criterion that had results fall mostly within positive classes include ‘method of approach’ (C11) (BA = 0.758, MCC = 0.365), ‘interview guide’ (C17) (BA = 0.608, MCC = 0.297) and ‘duration’ (C21) (BA = 0.825, MCC = 0.549). Criterion that had results that comprises of negative classes only include ‘repeat interviews’ (C18) (BA = 0.734, MCC = 0.469), while criterion with results that consists of mostly negative classes only are ‘presence of non-participants’ (C15) (BA = 0.825, MCC = 0.549) and ‘description of the coding tree’ (C25) (BA = 0.608, MCC = 0.297).

A full list of criteria grouped by clusters and sub-clusters, and corresponding list of performance metrics can be found in Figure. 4 and Table 5.

Table 5 Criterion clustered by performance categories with summary of key performance metrics.
Figure. 4
figure 4

Criterion grouped by cluster (1 to 4), then ranked by MCC performance. F1 adjusted for prevalence = 0.5 + 0.5*(F1-P)/(1-p), where 0.5 = chance. MCC re-scaled to 0 to 1 for comparability, where MCC = (MCC + 1)/2. Adjustments made to ensure comparability between metrics, where 0.5 = chance.

Criterion domain level analysis

Results at the criterion domain level show the aggregate performance of criterion when grouped collectively with other related criterion, following domain categories as defined in COREQ. Limited information criteria were excluded from analysis to allow for fair intra- and inter-domain comparisons. Median confidence intervals for each domain were obtained by bootstrapping the median of all evaluable criterion using 2,000 resamples (supplementary information file 1).

Domain 2, ‘analysis and findings’, achieved the highest overall median MCC (0.768, CI 0.575–0.871), followed by domain 3 ‘analysis of findings’ (0.719, CI 0.316–0.879) and domain 1 ‘research team and reflexivity’ (0.656, CI 0.606–0.808). Domain 1 had the highest proportion of evaluative criteria, with all criteria (8/8, 100.0%) of evaluable quality, followed by domains 3 and 2, which had a moderate to low proportion of overall evaluable criteria per domain at 44.4% (4/9) and 33.3% (5/15) respectively. Although most criteria in these domains were not evaluable due to extreme class imbalance (all positive or negative only), the remaining criteria within each domain was still evaluated proficiently. This includes criteria such as ‘field notes’ (C20), ‘data extraction’ (criterion 22) in domain 2, and ‘software’ (C27) in domain 3, which are the most straightforward, quantifiable criterion within these 2 domains.

A lower proportion of evaluable criteria within domains 2 and 3 suggests a larger sample of articles may be needed to sufficiently gauge a model’s true performance within these domains, to determine if a model’s assessment of extremely imbalanced criterion reflects its true discriminative ability. Likewise, whether imbalance towards one class can be attributed to performance rather than the tendency of a specific domain to integrate a mix of criterion with diverse characteristics together.

Conversely, a high proportion of evaluable criteria within domain 1 suggests that most criterion within the ‘research team and reflexivity’ domain are consistently well described, with clear evaluable qualities for generative AI assessment even as median MCC performance is relatively modest (0.656, CI 0.606–0.808) compared to the other 2 domains (0.768, CI 0.575–0.871; 0.719, CI 0.316–0.879 for domains 2 and 3 respectively). The confidence interval of domain 1 (CI difference: 0.202) show a narrower range than domains 2 (CI diff.: 0.293) and 3 (CI diff.: 0.563), indicating greater precision of estimates, in contrast to greater variability in performance and output results for domains 2 and 3. Domain 3 has the widest confidence interval that extends to MCC < 0.500, suggesting the possibility of estimates falling within a lower range, likely due to a larger proportion of criteria that have more open-ended meanings that are susceptible to interpretation.

Median MCC results by domain, and numeric criterion totals by cluster within each domain can be found in Tables 6 and 7 respectively. The list of all criteria and corresponding domains can be found in Table 5.

Table 6 Median F1, BA scores and MCC at the criterion domain level (excluding information limited criteria).
Table 7 Number of criteria by performance category within each domain.

Article-level analysis

Analysis at the article-level, excluding information limited criteria, displayed generally high scores across all key performance metrics, indicating strong model discrimination at the article level, with overall mean F1 score of 0.904, BA score of 0.911, and MCC of 0.827. Several articles achieved perfect scores (1.000) across all 3 metrics, while one article fared poorly because of a lack of positive cases. The typical article performed well, with overall median F1 score of 0.875, BA of 0.929, and MCC of 0.789.  Since analysis at the article-level subsumes multiple types of criteria, each with unique performance thresholds as an overall score, it is thus not possible to fully appraise a model’s performance based on article-level results. Article-level analysis is reported in supplementary information file 1, Table 4.

Error analysis

Error analysis was tabulated quantitatively at the criterion and criterion domain level, excluding limited information criteria. Balanced error rate (BER), the average of false positive and false negative rates was tabulated to reflect the overall rate of misclassification. A higher rate indicates a model that is more error prone towards identifying either false positive or negative cases. Results show domain 3 ‘analysis and findings’ to have the highest aggregate BER at 0.151 (FPR/FNR, 0.111/0.190), followed by domains 1 ‘research team and reflexivity’ at 0.091 (0.068/0.115) and domains 2 ‘study design’ at 0.071 (0.030/0.111). Although most individual criterion had a low to moderate BER range of < 0.25, 3 criterion had an elevated error rate. Description of the coding tree (C25), clarity of minor themes (C32), and credentials (C2) of 0.392 (0.750/0.033), 0.322 (0.269/0.375) and 0.281 (0.500/0.063) respectively had the highest BER. While C25 and C2 were more prone towards FP errors, C32 were inclined towards FN errors.

Full results of BER for each criterion is illustrated graphically in Figure. 5, with full results in the supplementary information file 1, Table 3.

Figure. 5
figure 5

Balanced error rate, grouped by domain. BER = 0.5 (FPR + FNR), where > 0.5 = error rates worse than chance. Information limited criterion is excluded.

Discussion

We undertake a rigorous, data driven approach to evaluate the performance of Claude in assessing qualitative articles following a consensus-based objective set of criteria, with clinical implications for evidence-based medicine. LLMs such as Claude can be used as evaluation assistants where consensus-based criteria have been clearly demarcated and adequately defined in standards or checklists, to accelerate research in health communication, medication adherence, patient literacy and other health domains24,25,26.

A range of quantitative metrics was used to identify areas where Claude performs well without extensive pre-prompting, in assessing adherence towards a comprehensive criteria list. Results reveal 4 key performance clusters: (a) balanced, (b) under-reported, (c) mixed errors, and (d) information limited clusters suggest varied outcomes. Criteria that fall within the balanced cluster, such as ‘occupation’ (C3) and ‘data saturation’ (C22), are typically clearly defined, distinct and well reported, encapsulating performance consistency that can be extrapolated across a diverse set of articles. Criteria that are categorised in the near balanced inclined sub-cluster, such as ‘gender’ (C3) and ‘participant checking’ (C28), require further prompt enclosure to allow for a model to achieve balanced assessment levels. Enclosed prompts adjustments include specifying or suggesting where each criterion may be located within an article or how it is conventionally described. For example, stating within a prompting sequence how the “participant checking (C28) criterion is usually reported in the methods section…” or providing information ‘flags’ commonly related to a criterion that allows for a model to detect a criterion more sensitively. Information flags for C28 may include the following example phrase within a prompt, “Participants in a study are usually provided a summary of results and are asked for their feedback. Feedback refers to thoughts and opinions about the study that they have participated in.”

For criteria that fall within the under-reported cluster or mostly under-reported sub-cluster within the mixed errors group (where FNR > FPR), results suggests that multi-shot prompts that aim to clarify criterion definitions iteratively may be required to further parse through definitions provided in guidelines. Criteria with broadly defined definitions include ‘relationship established’ (C6), ‘interviewer characteristics’ (C8) in the under-reported cluster, and ‘setting of data collection’ (C14) and ‘number of data coders’ (C24) in the under-reported inclined sub-cluster. Paraphrasing or elaborating on primary definitions can allow for the meaning of fundamental words to be explicated in detail so that polysemantic meanings are narrowed. This means interrogating the meaning of concepts such as ‘relationship’, ‘characteristics’, ‘setting’, and ‘data coders’ that are often syntactically determined and contextually dependent. In contrast, criterion that fall within the over-reported inclined or ambiguous sub-cluster such as ‘credentials’ (C2) or ‘clarity of minor themes’ (C31) requires prompt strategies that aim for exactness, through the use of multiple stated examples or hypothetical scenarios that can concretely explicate the preferred output that a LLM should generate for a given criterion.

Interestingly, even though effort was taken at the beginning to define ‘credentials’ (C2) in this study (see Figure. 3), where an illustration was provided of how an output should look like, output by Claude was lacklustre for this criterion, resulting in a low MCC score and a high BER inclined towards FP (FPR/FNR = 0.500/0.063). Such intractable criteria may require a combination of prompt approaches to produce optimal outcomes. Approaches include pre-requesting examples from an AI model to describe or expand upon a given conceptual term prior to prompt iteration, allowing for chain-of-thought reasoning, using multi-shot prompts, and avoiding common errors such as providing irrelevant, redundant or conflicting instructions27,28,29,30.

A substantial proportion of all criteria categorized within the information limited cluster highlight how a larger number of articles are needed to confirm if discriminative ability holds true for each criterion in this category. This means collating sufficient articles that trend in a balanced way towards both positive and negative cases for fair evaluation. In reality, gathering a balanced dataset may not be feasible for health and clinical research groups that are driven mainly by clinical hypotheses, since the evaluation of articles usually comes after a pool of articles has already been extracted from databases. Future research should consider a large scale study dedicated solely to examining the performance of generative AI models in assessing articles, to explore both narrower and wider health domains and to check for comparative performance between domains.

Implications for clinical research

Results suggests that a customised or tailored strategy may be needed for researchers who plan to use AI models to assess whether research articles adhere to consensus-based standardised guidelines. Careful preparation should be given to the development of prompts and the use of performance metrics to ensure coherence between guidelines and output results. A balanced profile indicated by consistent performance over a range of metrics provides confirmation that a criterion can be evaluated by a model reliably. For criteria that indicate under-, over-reported or ambiguous outcomes, additional clarificatory or narrower, demarcating prompts may be required to adjust for optimal outputs.

Categorisation into performance clusters can help clinical teams stratify levels of performance and to determine where an AI model or prompt sequence needs fine-tuning. Researchers should be cognizant that current consensus or standardised reporting guidelines may not be developed in an ideal format for models to evaluate (e.g. STROBE for cross-sectional, observational studies; PRISMA for systematic reviews etc.)31,32 and thus should tailor prompt approaches to typologies of criteria even as new AI relevant guidelines such as PRISMA-AI are being developed33. The new AI guidelines plan to look specifically at the reporting of systematic reviews related to AI topics such as machine learning, deep learning and neural networks, but it is unclear whether this will include the utilisation of AI to check for adherence towards standardised checklists or guidelines.

It is likely that new reporting checklists developed for generative AI assisted scoping or systematic reviews will be required in the future, that is similar to CONSORT-AI guidelines used for the reporting of AI systems as interventions34 or TRIPOD-AI guidelines used for diagnostic or prognostic prediction models35. Reporting checklists will need to be developed and structured optimally based on real-world feedback from healthcare professionals and users, and be sensitive to how systematic/scoping or literature reviews are usually conducted. Checklists should ideally be able to provide additional detailed guidance on text descriptions that may appear variedly in articles.

To the best of our knowledge, this is the first study to use a quantitative, data driven approach to assess qualitative research articles adherence to a consensus-based, objective guideline using a generative AI model.

Limitations

One limitation of this study is the relatively small number and range of research articles assessed that limits the scope of results. Evaluating a larger pool of articles can provide a more robust understanding of how a generative AI model performs when presented with a more extensive or complex dataset, for the generation of more precise estimates and in testing the discriminant abilities of a model. A larger scale study involving comparisons between different models, using multiple criteria or standardised checklists, as well as multi-modal benchmark tools simultaneously, can provide a deeper understanding of how generative AI outputs are aligned with human inputs in different contexts, and to gauge which specific models are more reliable for research evaluation.

Although prompts were developed iteratively in this study at the initial stage using an approximation approach, a full-fledged systematic ‘prompt engineering’ strategy would be ideal to test prompts extensively before they are applied to different articles. It is known that the quality of prompts can substantially affect the quality of outputs that a model can generate36,37,38. One specific technique, substituting similar words and phrases iteratively based on an initial parsing of results until satisfactory wordings are attained39, would have enhanced the generation of model outputs and allow for a more multi-faceted analysis.

One additional limitation is the use of binary quantitative results (TP, TN/FP, FN) extracted from a classification model (confusion matrix) as the main driver of model assessment ability. Text generated by Claude was not included in this study that would have allowed for a more comprehensive performance analysis or comparison between text and quantitative results.

Conclusion

Although LLMs can support and accelerate the evaluation of qualitative research findings based on consensus-based guidelines, the quality of output depends significantly on prompts that are well calibrated and measured using a range of performance metrics. Near balanced criteria require prompt enclosure adjustments or information flags to achieve a more balanced performance, while under-reported criteria require paraphrasing or interrogation of key concepts so that polysemantic meanings are narrowed. Over-reported criteria require the use of concretely stated examples or hypothetical scenarios to achieve balance.

Conversely, criteria with limited information require a larger sample of articles for assessment to determine that performance is not primarily due to the propensity of truth-values falling within one class (T/F). Segmenting criteria into performance clusters allow researchers to identify areas of incongruence, so that specific strategies to modify prompts can be applied for any given set of research articles. Customised approaches that are expertly crafted can allow for the rapid extraction of valuable insights from articles to inform patient-centred recommendations and practice guidelines.