A comprehensive assessment of current methods for measuring metacognition

Rahnev, Dobromir

doi:10.1038/s41467-025-56117-0

Download PDF

Article
Open access
Published: 15 January 2025

A comprehensive assessment of current methods for measuring metacognition

Dobromir Rahnev ORCID: orcid.org/0000-0002-5265-2559^1,2

Nature Communications volume 16, Article number: 701 (2025) Cite this article

12k Accesses
25 Citations
51 Altmetric
Metrics details

Subjects

Abstract

One of the most important aspects of research on metacognition is the measurement of metacognitive ability. However, the properties of existing measures of metacognition have been mostly assumed rather than empirically established. Here I perform a comprehensive empirical assessment of 17 measures of metacognition. First, I develop a method of determining the validity and precision of a measure of metacognition and find that all 17 measures are valid and most show similar levels of precision. Second, I examine how measures of metacognition depend on task performance, response bias, and metacognitive bias, finding only weak dependences on response and metacognitive bias but many strong dependencies on task performance. Third, I find that all measures have very high split-half reliabilities, but most have poor test-retest reliabilities. This comprehensive assessment paints a complex picture: no measure of metacognition is perfect and different measures may be preferable in different experimental contexts.

Reliable, rapid, and remote measurement of metacognitive bias

Article Open access 28 June 2024

A measure of reliability convergence to select and optimize cognitive tasks for individual differences research

Article Open access 04 July 2024

Hot metacognition: poorer metacognitive efficiency following acute but not traumatic stress

Article Open access 04 March 2024

Introduction

Metacognition is classically defined as knowing about knowing¹. Within this broad construct, the term “metacognitive ability” refers more narrowly to the capacity to evaluate one’s decisions by distinguishing between correct and incorrect answers^2,3. High metacognitive ability allows us to have high confidence when we are correct but low confidence when we are wrong. Conversely, low metacognitive ability impairs the capacity of confidence ratings to distinguish between instances when we are correct or wrong. Metacognitive ability is thus a critical capacity in human beings linked to our ability to learn⁴, make good decisions⁵, interact with others⁶, and know ourselves⁷. As such, it is critical that we have the tools to precisely measure metacognitive ability in human participants.

Metacognitive ability is typically assumed to be a somewhat stable trait with meaningful variability across people^2,8,9. Consequently, metacognitive ability has been correlated with other stable individual differences, such as brain structure^10,11,12,13. While metacognitive ability is often assumed to be domain-general and rely on shared neural substrates, this question remains hotly debated^14,15,16,17. The construct of metacognitive ability is also thought to be different from other constructs such as task skill or bias, so it is often desirable to find metrics of metacognitive ability unrelated to these other constructs¹⁸.

Below, I first examine the properties that one may desire in a measure of metacognition and then review the known properties of existing measures of metacognitive ability. This brief overview demonstrates that there is little we firmly know about the properties of existing measures of metacognition. The rest of the paper aims to fill this gap by providing a comprehensive test of the critical properties of many common measures of metacognition.

Before one can evaluate a given measure of metacognition, it is first necessary to determine what properties are important or desirable. Since there is no existing list of desirable properties, I start by creating one here (Supplementary Table 1) and discuss each property below.

The most important property of any measure is that it is valid: namely, it should measure what it purports to measure¹⁹. Existing measures of metacognitive ability assess the degree to which confidence is associated with objective reality, thus making them face valid. Still, we lack a formal way of verifying the validity of existing measures. A related property is precision. I use the term “precision” following its definitions in the literature as “the ability to repeatedly measure a variable with a constant true score and obtain similar results”²⁰, “the margin of error” in a measurement²¹, or “the spread of values that would be expected across multiple measurement attempts”²². Note that precision here does not refer to whether a measure is only affected by the construct of interest. Precision has been largely ignored in the context of measures of metacognition and we currently lack methods to measure it. Here I develop a simple and intuitive method for assessing both validity and precision of metacognition measures. The method demonstrates that all existing measures of metacognition are valid but show some variations in precision.

Another critical property of measures of metacognition—one that is perhaps the most widely appreciated—is that such measures should be independent of various nuisance variables. Here a “nuisance variable” is any property of people’s behavior that is not directly related to their metacognitive ability.

The nuisance variable that has received the most attention is task performance. It is often desirable that a measure of metacognition should not be affected by whether people happened to be performing an easy or a difficult task^3,18. For example, in visual perception tasks with confidence, there is little reason to believe that the underlying metacognitive ability should be affected by stimulus contrast. Thus, one may want to measure the same metacognitive ability regardless of contrast level. Note that there are subtleties here. If difficulty is manipulated by introducing cognitive load or other task demands that may tax the metacognitive system, then one would not necessarily expect metacognitive ability to remain the same anymore (though whether metacognitive ability is affected by working memory load remains a topic of debate^23,24,25). Therefore, the logic here applies more readily to stimulus than task manipulations. That said, even if one does not agree that metacognitive ability should be independent from task performance, examining how each measure depends on task performance is still informative, especially if there are meaningful differences between measures. Task performance can be computed as d’, which is a measure of sensitivity derived from signal detection theory (SDT).

A second nuisance variable is response bias, that is, the tendency to select one response category more than another¹⁸. For two-choice tasks, this variable can be quantified as the decision criterion, c, derived from SDT. Response bias is under strategic control in that participants can freely choose to select one stimulus category more often than others. In fact, they consistently do so in response to experimental manipulations such as expectation or reward²⁶. As such, measures of metacognitive ability should ideally remain independent of response bias.

The final nuisance variable is metacognitive bias, that is, the tendency of people to be biased towards the lower or upper ranges of the confidence scale^27,28. This variable can be quantified simply as the average confidence across all trials. As with response bias, metacognitive bias is under strategic control in that participants can freely choose to use lower or higher confidence. As such, measures of metacognitive ability should ideally remain independent of metacognitive bias because we do not want to measure different ability if people purposefully choose to use predominantly low or high confidence ratings³. The logic here is similar to the logic in SDT, where the measure of performance (d’) is designed to be mathematically independent from the measure of response bias (c)²⁹. In the case of SDT, we interpret high d’ values as showing high ability to perform the task even if the participant exhibits an extreme bias and, consequently, low percent of correct responses. Similarly, this paper, following the standard in the field¹⁸, adopts the perspective that measures of metacognitive ability should be independent of metacognitive bias.

Task performance, response bias, and metacognitive bias are arguably the primary nuisance variables that a measure of metacognitive ability should be independent of (Supplementary Table 2). They are also variables that can be measured in any design that also allows the measurement of metacognitive ability. It is possible to add more variables to this list (e.g., reaction time³⁰) but the current paper only examines these three variables.

The final critical property of measures of metacognition is that they should be reliable. This property is critical for studies of individual differences. This paper examines both split-half and test–retest reliability.

Having reviewed the desirable properties of measures of metacognition, let us now turn our attention to the existing measures of metacognitive ability. One popular measure is the area under the Type 2 ROC function³¹, also known as AUC2. Other popular measures are the Goodman–Kruskall Gamma coefficient (or just Gamma), which is essentially a rank correlation between trial-by-trial confidence and accuracy³² and the Pearson correlation between trial-by-trial confidence and accuracy (known as Phi³³). Another simple but less frequently used measure is the difference between average confidence on correct trials and the average confidence on error trials (which I call ΔConf).

While all four of these traditional measures are intuitively appealing, they are all thought to be influenced by the primary task performance¹⁸. To address this issue, Maniscalco and Lau³⁴ developed a new approach to measuring metacognitive ability where one can estimate the sensitivity, meta-d’, exhibited by the confidence ratings. Because meta-d’ is expressed in the units of d’, Maniscalco and Lau then reasoned that meta-d’ can be normalized by the observed d’ to obtain either a ratio measure (M-Ratio, equal to meta-d’/d’) or a difference measure (M-Diff, equal to meta-d’–d’). These measures are often assumed to be independent of task performance¹⁸.

The normalization introduced by Maniscalco and Lau³⁴ has only been applied to the measure meta-d’ (resulting in the measures M-Ratio and M-Diff), but there is no theoretical reason why a conceptually similar correction cannot be applied to the traditional measures above. Consequently, here I develop eight new measures where one of the traditional measures of metacognitive ability is turned into either a ratio (AUC2-Ratio, Gamma-Ratio, Phi-Ratio, and ΔConf-Ratio) or a difference (AUC2-Diff, Gamma-Diff, Phi-Diff, and ΔConf-Diff) measure. The logic is that a given measure (e.g., AUC2) is computed once using the observed data (obtaining, e.g., AUC2_observed) and a second time using the predictions of SDT given the observed sensitivity and decision criterion (obtaining, e.g., AUC2_expected). One can then take either the ratio or the difference between the observed and the SDT-predicted quantities.

Finally, one important limitation of all measures above is that they are not derived from a process model of metacognition. In other words, none of these measures are based on an explicit model of how confidence judgments may be corrupted. Recently, Shekhar and Rahnev²⁷ developed a process model of metacognition—the lognormal meta noise model—that is based on SDT assumptions but with the addition of lognormally distributed metacognitive noise. This metacognitive noise corrupts the confidence ratings but not the initial decision and, in the model, takes the form of confidence criteria that are sampled from a lognormal distribution rather than having constant values. The metacognitive noise parameter (${\sigma }_{{meta}}$, referred here as meta-noise) can then be used as a measure of metacognitive ability. A similar approach was taken by Boundy-Singer et al.³⁵ who developed another process model of metacognition, CASANDRE, based on the notion that people are uncertain about the uncertainty in their internal representations. The second-order uncertainty parameter (meta-uncertainty) thus represents another possible measure of metacognitive ability.

This paper examines the properties of all 17 measures of metacognition introduced above (for a summary, see Table 1). Before then, however, I briefly review the previous literature on the properties of these measures.

Table 1 Measures of metacognition examined in the current paper

Full size table

Given the importance of using measures with good psychometric properties, it is perhaps surprising that the published literature contains very little empirical investigation into the properties of the different measures of metacognition. For example, no paper to date has examined the precision of any existing measure. Several papers have relied exclusively on simulations to investigate some of the properties of measures of metacognition^36,37. Such investigations are important but cannot substitute empirical studies because it is a priori unknown how well the process models used to simulate data reflect empirical reality. Evans and Azzopardi³⁸ empirically showed that a specific measure of metacognition, Kunimoto’s a’³⁹, exhibits a strong dependence on response bias. Because Kunimoto’s a’ is built on wrong distributional assumptions⁴⁰, it is not investigated here. Finally, several older papers investigated the theoretical properties of several measures independent of any simulations or empirical data³², but this approach cannot be used to establish the empirical properties of the measures under consideration.

Only recently, Shekhar and Rahnev²⁷ examined the dependence on both task performance and metacognitive bias for five measures: meta-d’, M-Ratio, AUC2, Phi, and meta-noise. They found that meta-d’, AUC2, and Phi strongly depend on task performance, but M-Ratio and meta-noise do not. On the other hand, meta-d’, M-Ratio, AUC2, and Phi have a complex dependence on metacognitive bias, while only meta-noise appeared independent of it. Guggenmos⁴¹ examined both the split-half reliability and the across-participant correlation between d’ and several measures of metacognition (meta-d’, M-Ratio, M-Diff, and AUC2) finding surprisingly low reliability and significant correlations with d’ for all measures. Relatedly, Kopcanova et al.¹⁴ examined the test-retest reliability of M-Ratio and also found low-reliability values. Another paper developed a new technique to examine dependence on metacognitive bias and found that meta-d’ and M-Ratio are not independent of metacognitive bias²⁸. Finally, Boundy-Singer et al.³⁵ showed that meta-uncertainty appears to have high test–retest reliability, and only a weak dependence on task performance and metacognitive bias.

As this brief overview demonstrates, most previous investigations only focused on a few measures of metacognition, only examined a few of the critical properties of interest, and often did not make use of empirical data. Here, I empirically examine each of the critical properties for all 17 measures of metacognition introduced above. To do so, I make use of six large datasets^{27,42,43,44,45,46} (Table 2) all made available on the Confidence Database⁴⁷. All datasets involve 2-choice tasks because most measures of metacognition only apply to 2-choice tasks.

Table 2 Datasets used in the current paper

Full size table

Overall, I find that no current measure of metacognitive ability is “perfect” in the sense of possessing all desirable properties. Nevertheless, they are not equivalent either with many important differences between measures emerging. Based on these results, I make recommendations for the use of different measures of metacognition based on the specific analysis goals.

Results

Here I assess the properties of 17 measures of metacognition. Specifically, I focus on each measure’s (1) validity and precision, (2) dependence on nuisance variables, and (3) reliability. To examine each of these properties, I use six existing datasets (Table 2) from the Confidence Database. For each property, I analyze the data from between one and three of the six datasets. In addition, I compute precision and reliabilities using 50, 100, 200, or 400 trials at a time to clarify how these measures behave for different amounts of underlying data.

Validity and precision

Perhaps the most important requirement for any measure is that it is both valid and precise^{19,20,21,22,48}. In other words, a measure should reflect the quantity it purports to measure, and it should do so with a high level of quantitative accuracy. However, despite the importance of both criteria, there has been no formal method to assess the validity or precision of measures of metacognition.

Here I develop a simple method for assessing both properties. The method selects a small proportion of trials and decreases confidence by 1 point for each correct trial and increases confidence by 1 point for each incorrect trial. This manipulation artificially decreases the informativeness of confidence ratings. A valid measure of metacognition should therefore show a drop when applied to these altered data. The size of the drop relative to the normal fluctuations of the measure quantifies the precision of the measure (i.e., if the drop is large relative to background fluctuations, this indicates that the measure has a high level of precision).

To quantify the precision of existing measures of metacognition, one would ideally use a dataset with very large number of trials coming from a single experimental condition because mixing conditions can strongly impact metacognitive scores⁴⁹. Consequently, I selected the two datasets from the Confidence Database with the largest number of trials per participant that also had a single experimental condition: Haddara (3000 trials per participant) and Maniscalco (1000 trials per participant). In each case, I examined the results of altering 2, 4, and 6% of all trials and computed metacognitive scores using bins of 50, 100, 200, and 400 trials.

The results showed that all 17 measures are valid in that metacognitive scores decreased when confidence ratings were artificially corrupted (Fig. 1). The decrease in each measure was roughly a linear function of the percent of trials corrupted. For example, in the Haddara dataset, the values of meta-d’ decreased from an average of 1.14 without any corruption to averages of 0.98, 0.84, and 0.72 when 2%, 4%, and 6% of trials were corrupted, respectively (for an average drop of about 0.14 for every 2% of trials corrupted). However, this drop is difficult to compare between measures because different measures are on different scales (e.g., meta-d’ normally takes values between 0 and $\infty$, whereas AUC2 normally takes values between 0.5 and 1). Therefore, to obtain values that are easy to interpret and compare, one can normalize the average drop after corruption by the standard deviation (SD) of the observed values across different subsets of trials in the absence of any corruption. Because the SD value is larger for smaller bin sizes—reflecting the larger noisiness of each measure when few trials are used—the results show that larger bin sizes lead to greater precision of the measures (Fig. 1a). Indeed, across the 17 measures, corrupting 2% of the trials led to an average decrease of 0.35, 0.50, 0.70, and 1.04 SDs in the measured metacognitive ability value for bins of 50, 100, 200, and 400 trials, respectively.

**Fig. 1: Validity and precision of each measure.**

This technique allows us to compare the precision of different measures. To simplify the comparison, I averaged the decreases across the four different bin sizes and the three levels of corruption (2, 4, and 6%; Fig. 1b,c). These analyses revealed that the precision scores were overall higher in the Haddara compared to the Maniscalco datasets. This difference is likely due to differences in variables such as sensitivity and metacognitive bias that are likely to vary across datasets. Therefore, the technique introduced here is useful for comparing between different measures but is unlikely to be useful if one wants to compare values across different datasets.

More importantly, most measures of metacognition showed comparable levels of precision (Fig. 1b,c). The one exception was the measure meta-uncertainty, which had substantially lower average precision score in both the Haddara (meta-uncertainty: 0.37; average of other measures: 0.67; ratio = 0.56) and the Maniscalco datasets (meta-uncertainty: 0.30; average of other measures: 0.53; ratio = 0.58). Indeed, pairwise comparisons showed that, without multiple comparison correction, the precision for meta-uncertainty was significantly lower than every one of the other 16 measures in both datasets (p < 0.05 for all 32 comparisons). In the Haddara dataset, 15 of the 16 comparisons remained significant even after applying a very conservative Bonferroni correction for the existence of $\frac{17*16}{2}=136$ pairwise comparisons; in the smaller Maniscalco dataset, no comparison remained significant after this conservative correction. This difference between meta-uncertainty and the remaining measures may stem from the noisiness of the process of estimating meta-uncertainty in the presence of relatively few trials. In fact, the original authors who introduced meta-uncertainty already warned about the dangers of trying to compute this variable using low trial numbers³⁵.

The differences between the remaining measures were much smaller and, in some cases, inconsistent across the two datasets. The differences between all other measures of pairs were never significant (at p < 0.05 uncorrected) in both the Haddara and Maniscalco datasets. Nevertheless, there appear to be some small but consistent difference between measures, such that meta-d’, Gamma, Phi, Gamma-Diff, Phi-Diff, and meta-noise show above-average precision, whereas AUC2, ΔConf, and ΔConf-Diff show below-average precision (Fig. 1d). Overall, these analyses suggest that all measures of metacognition investigated here are valid, and that most have comparable level of precision except for meta-uncertainty, which appears to be noisier than the remaining measures. Whether the differences between the remaining measures are meaningful remains to be demonstrated.

Dependence on nuisance variables

Beyond validity and precision, another important feature for good measures of metacognition is that they should not be influenced by nuisance variables. Here I examine three nuisance variables—task performance, metacognitive bias, and response bias—and test how much each of these variables affects each of the 17 measures of metacognition.

Dependence on task performance

The most widely recognized nuisance variable for measures of metacognition is task performance¹⁸. The reason that task performance is a nuisance variable is that an ideal measure of metacognition should not be affected by whether a participant happens to be given an easier or a more difficult task. That is, the participant’s estimated ability to provide informative confidence ratings should not change based on the difficulty of the object-level task that they are asked to perform. As mentioned earlier, this logic does not apply well to task manipulations, which is why I only examine stimulus manipulations here.

To quantify how task performance affects measures of metacognition, one needs datasets with multiple difficulty conditions and a large number of trials (either because of including many participants or many trials per participant). I selected three datasets from the Confidence Database that meet these characteristics: Shekhar (3 difficulty levels, 20 participants, 2800 trials/sub, 56,000 total trials), Rouault1 (70 difficulty levels, 466 participants, 210 trials/sub, 97,860 total trials), and Rouault2 (many difficulty levels, 484 participants, 210 trials/sub, 101,640 total trials). Both Rouault datasets have a large range of difficulty levels which I split into low/high by taking a median split. I then computed each measure separately for each difficulty level and compared them using t-tests.

The results showed that all traditional measures that are not normalized in any way (i.e., meta-d’, AUC2, Gamma, Phi, and ΔConf) are strongly dependent on task performance: they all substantially increase as the task becomes easier (p < 0.001 for all five measures and three datasets; Fig. 2a; Supplementary Tables 3–5; see Supplementary Fig. 2 for the same plots as a function of difficulty level instead of d’ level). Critically, the increase across the five measures from the most difficult to the easiest had a very large effect size (Cohen’s d = 2.47, 2.29, 2.95, 1.34, and 1.81 for each of the five measures after averaging across the four datasets; Fig. 2b).

**Fig. 2: Dependence of estimated metacognitive scores on task performance.**

Having established that these five measures strongly depend on task performance, I then examined whether normalizing them removes this dependence. The more popular method of normalization—the ratio method—indeed performed well. The average effect size (Cohen’s d) for M-Ratio, AUC2-Ratio, Gamma-Ratio, Phi-Ratio, and ΔConf-Ratio was −0.18, −0.39, −0.11, −0.17, and −0.23, respectively. These are small effect sizes, except for AUC2-Ratio which has medium effect size. Nevertheless, it should be noted that the negative direction of the effect between task performance on metacognitive scores was consistent across all five measures and three datasets (with 9/15 tests being significant at p < 0.05; Supplementary Tables 3–5). Thus, while all ratio measures perform much better than the original metrics they are derived from, they tend to slightly overcorrect.

The five difference measures (M-Diff, AUC2-Diff, Gamma-Diff, Phi-Diff, and ΔConf-Diff) were much less effective in removing the dependence on task performance compared to their ratio counterparts. Indeed, they all exhibited an over-correction where easier conditions led to lower scores with medium average Cohen’s d effect sizes (M-Diff: −0.58; AUC2-Diff: −0.49; Gamma-Diff: −0.39; Phi-Diff: −0.30; ΔConf-Diff: −0.55). Further, the relationship between task performance and the metacognitive score was significantly negative for all five measures and three datasets (p < 0.05 for all 15 tests; Supplementary Tables 3–5). These results demonstrate that the difference measures uniformly fail at their main purpose, which is to remove the dependence of metacognitive measures on task performance.

Finally, the two model-based measures (meta-noise and meta-uncertainty) showed relatively weak but still systematic relationships with task difficulty. Specifically, meta-noise decreased for easier conditions in all three datasets (average Cohen’s d = −0.29), whereas meta-uncertainty increased for easier conditions in all three datasets (average Cohen’s d = 0.06). Both effects were associated with relatively small Cohen’s d effect sizes that were comparable to what was observed for the ratio measures. As such, both model-based measures perform as well as the ratio measures in controlling for task performance. Given that meta-uncertainty corrected in the opposite direction of the other viable measures (the ratio measures and meta-noise) and had the lowest absolute Cohen’s d, studies that feature task performance confounds may benefit from performing analyses using both meta-uncertainty and at least one more measure.

Dependence on metacognitive bias

A less appreciated nuisance variable is metacognitive bias: the tendency to give low or high confidence ratings for a given level of performance. Metacognitive bias can be measured simply as the average confidence in a condition. Recently, Shekhar and Rahnev²⁷ developed a method that involves recoding the original confidence ratings to examine how measures of metacognition depend on metacognitive bias. The method was further improved by Xue et al.²⁸. The Xue et al. method consists of recoding confidence ratings as to artificially induce metacognitive bias toward lower or higher confidence ratings. Comparing the obtained values for a given measure of metacognition applied to the recoded confidence ratings allows us to evaluate whether the measure is independent of metacognitive bias.

Similar to quantifying precision, quantifying how metacognitive bias affects measures of metacognition requires datasets with very large number of trials coming from a single experimental condition. Consequently, I selected the same two datasets used to quantify precision since they have the largest number of trials per participant while also featuring a single experimental condition: Haddara (3000 trials per participant) and Maniscalco (1000 trials per participant). In addition, I also used the Shekhar dataset (3 difficulty levels, 2800 trials per participant) but analyzed each difficulty level in isolation and then averaged the results across the three difficulty levels. For that dataset, the continuous confidence scale was first binned into six levels as in the original publication²⁷.

The results demonstrated that meta-d’, AUC2, Phi, and ΔConf tend to increase with higher average confidence, whereas Gamma tends to decrease (Fig. 3a). The average (across the three datasets) Cohen’s d effect size was in the medium-to-large range for all five measures (meta-d’: 0.44; AUC2: 0.51; Gamma: −0.61; Phi: 0.81; ΔConf: 0.54; Fig. 3b). In other words, all five non-normalized measures of metacognition depend on metacognitive bias. All five ratio measures had a positive relationship with metacognitive bias but with smaller Cohen’s d effect sizes (M-Ratio: 0.27; AUC2-Ratio: 0.09; Gamma-Ratio: 0.001; Phi-Ratio: 0.23; ΔConf-Ratio: 0.42). Difference measures performed similarly to ratio measures (M-Diff: 0.43; AUC2-Diff: 0.10; Gamma-Diff: 0.24; Phi-Diff: 0.11; ΔConf-Diff: 0.34). Finally, the two model-based measures performed similar to the ratio and difference measures and exhibited low-to-medium effect sizes that again went in opposite directions of each other (meta-noise: −0.21; meta-uncertainty: 0.27). Note that the scores after recoding were similar but slightly larger than the original metacognitive scores before recoding (Supplementary Fig. 3). Overall, researchers who want to control for metacognitive bias would appear to do best if they used AUC2-Ratio, Gamma-Ratio, AUC2-Diff, or Phi-Diff as these all featured absolute effect sizes under 0.15. Nevertheless, given that meta-noise corrected in the opposite direction of the ratio and difference measures, it may be advisable for results obtained using one of those metrics to be reproduced with meta-noise.

**Fig. 3: Dependence of estimated metacognitive scores on metacognitive bias.**

Dependence on response bias

The final nuisance variable examined here is response bias. Response bias can be measured simply as the decision criterion c in signal detection theory. To understand how response bias affects measures of metacognition, one needs datasets where the response criterion is experimentally manipulated and confidence ratings are simultaneously collected. Very few such datasets exist and only a single such dataset is featured in the Confidence Database. The dataset—named here Locke⁴⁴—features seven conditions with manipulations of both prior and reward. Rewards were manipulated by changing the payoff for correctly choosing category 1 vs. category 2 (e.g., R = 4:2 means that 4 vs. 2 points were given for correctly identifying categories 1 and 2, respectively), whereas priors were manipulated by informing participants about the probability of category 2 (e.g., P = 0.75 means that there was 75% probability of presenting category 2 and 25% probability of presenting category 1). The seven categories were as follows (1) P = 0.5, R = 3:3, (2) P = 0.75, R = 3:3, (3) P = 0.25, R = 3:3, (4) P = 0.5, R = 4:2, (5) P = 0.5, R = 2:4, (6) P = 0.75, R = 2:4, and (7) P = 0.25, R = 4:2. The Locke dataset included many trials per condition (700) but relatively few participants (N = 10) and collected confidence on a 2-point scale.

The results suggested that none of the 17 measures of metacognition are strongly influenced by response bias (Fig. 4a). Indeed, while a repeated measures ANOVA revealed a very strong effect of condition on response criterion (F(6,54) = 12.18, p < 0.001, ${\eta }_{p}^{2}$ = 0.58), it showed no significant effect of condition on any of the measures of metacognition (all p’s > 0.13 for 17 tests; Supplementary Table 9). Critically, I computed the correlation between the estimated metacognitive ability for each of the 17 measures and the absolute value of the response criterion (i.e., |c|). The idea behind this analysis is to investigate whether more extreme response bias (either positive or negative) is associated with increases or decreases in estimated metacognitive ability. The results demonstrated that all correlation coefficients were very small (all r-values were between −0.04 and 0.21; Fig. 4b). There was a fair amount of uncertainty about these values, as seen by the wide error bars in Fig. 4b, so it is possible some of these relationships may be stronger than the current data suggest. Overall, these results should be interpreted with caution given the small sample size and the fact that a 2-point confidence scale may be noisier for estimating metacognitive scores. Nonetheless, these initial findings suggest that response bias may not have a large biasing effect on measures of metacognition.

**Fig. 4: Dependence of estimated metacognitive scores on response bias.**

Reliability

Measures of metacognition are often used in studies of individual differences to examine across-participant correlations between metacognitive ability and many different factors such as brain activity and structure^10,11,50, metacognitive ability in other domains^51,52, psychiatric symptom dimensions⁴⁶, cognitive processes such as confidence leak¹², etc. These types of studies require measures of metacognition to have high reliability. (Note that the reliability of a measure is enhanced by both high precision and large spread of scores across participants, so both of these two factors are important for between-subject analyses. In contrast, within-subject analyses only require high precision. Therefore, low-reliability scores are not necessarily problematic for within-subject designs.)

Perhaps surprisingly, relatively little has been done to quantify the reliability of measures of metacognition (but see refs. ^14,41). Here I examine split-half reliability (correlation between estimates obtained from odd vs. even trials) and test-retest reliability (correlation between estimates obtained on different days).

Split-half reliability

To examine split-half reliability for different sample sizes, one needs datasets with many trials per participant and a single condition (or large number of trials per condition if multiple conditions are present). Consequently, I selected the same three datasets used to examine the dependence of measures of metacognition on metacognitive bias: Haddara (3000 trials per participant), Maniscalco (1000 trials per participant), and Shekhar (3 difficulty levels, 2800 trials per participant). As before, I analyzed each difficulty level in the Shekhar dataset in isolation and then averaged the results across the three difficulty levels. For each dataset, I computed each measure of metacognition based on odd and even trials separately and correlated the two. To examine how split-half reliability depends on sample size, I performed the procedure above for bins of 50, 100, 200, and 400 trials separately. Because the datasets contained multiple bins of each size, I averaged the results across all bins of a given size.

The results showed that measures of metacognition have good split-half reliability as long as the measures are computed using at least 100 trials (Fig. 5). Indeed, bin sizes of 100 trials produced split-half correlations of r > 0.837 for all 17 measures when averaged across the three datasets with an average split-half correlation of r = 0.861. These numbers increased further for bin sizes of 200 (all r’s > 0.938, average r = 0.946) and 400 trials (all r’s > 0.961, average r = 0.965). Further, these numbers were only a little lower than the split-half correlations for d’ (100 trials: r = 0.913; 200 trials: r = 0.958; 400 trials: r = 0.970). However, the split-half correlations strongly diminished when the measures of metacognition were computed based on 50 trials with an average r = 0.424 and no measure exceeding r = 0.6. It should be noted that while performing better, d’ also had a relatively low split-half reliability of r = 0.685 when computed based on 50 trials. These results suggest that individual difference studies should employ 100 trials per participant at a minimum and that there is little benefit in terms of split-half reliability for using more than 200 trials.

**Fig. 5: Split-half reliability of metacognitive scores.**

Test-retest reliability

Split-half reliability is a useful measure of the intrinsic noise present in the across-subject correlations that can be expected in studies of individual differences. However, they do not account for fluctuations that could occur from day to day. These fluctuations can be examined by computing measures of metacognition obtained from different days, thus estimating what is known as test-retest reliability. Such estimation requires datasets with multiple days of testing and a large number of trials per participant per day. Only one dataset in the Confidence Database meets these criteria: Haddara (6 days; 3000 total trials per participant; 70 participants). I examined test–retest reliability by computing both intraclass correlation (ICC) and Pearson correlation between all pairs of days and then averaged across the different pairs.

The results showed very low test–retest reliability values (Fig. 6). Even with 400 trials used for estimation, no measure of metacognition exceeded an average ICC reliability of 0.75 and none of the measures outside of the five non-normalized and non-model-based measures (i.e., meta-d’, AUC2, Gamma, Phi, and ΔConf) reached ICC reliability of 0.5, which is often considered the threshold for poor reliability. For example, the widely used measure M-Ratio had average ICC reliability of r = 0.16 (for 50 trials), 0.23 (for 100 trials), 0.29 (for 200 trials), and 0.42 (for 400 trials). The measure with highest test–retest correlation was ΔConf with ICC reliability of 0.39 (for 50 trials), 0.53 (for 100 trials), 0.65 (for 200 trials), and 0.75 (for 400 trials). Notably, test-retest reliability was not much higher for d’ or criterion c compared to ΔConf (average difference of about 0.1) and was only robustly high for confidence (above 0.86 regardless of sample size). Similar test–retest correlation coefficients were obtained when Pearson correlation was computed instead of ICC (Fig. 6). These results are in line with the findings of Kopcanova et al.¹⁴ and suggest that correlations between measures of metacognition and measures that do not substantially fluctuate on a day-by-day basis (e.g., structural brain measures) are likely to be particularly noisy such that very large sample sizes may be needed to find reliable results.

**Fig. 6: Test–retest reliability of metacognitive scores.**

Across-subject correlations between different measures

Lastly, I examined how different measures are related to each other by performing across-subject correlations. Note that these analyses should be interpreted with extreme caution because the correlation between two measures could be driven by a third factor. For these analyses, I again used the Haddara (3000 trials per participant), Maniscalco (1000 trials per participant), and Shekhar (3 difficulty levels, 2800 trials per participant) datasets. As in previous analyses, I examined each difficulty level in the Shekhar dataset in isolation and then averaged the results across the three difficulty levels. For each dataset, I computed each measure of metacognition based on all trials in the experiment and examined the across-subject correlations between different measures.

Overall, the 17 measures of metacognition showed medium-sized across-subject correlations with each other (average r = 0.49, 0.55, and 0.56 for the Haddara, Maniscalco, and Shekhar datasets, respectively; Supplementary Fig. 4). These analyses seemed to reveal three groups of measures. The first group consists of the five non-normalized measures (meta-d’, AUC2, Gamma, Phi, and ΔConf), which exhibited average inter-measures correlation of 0.60 (r = 0.60, 0.63, and 0.58 in each dataset). The second group consists of the five ratio and five difference measures, which exhibited average inter-measures correlation of 0.63 (r = 0.62, 0.62, and 0.63 in each dataset). The average correlation between the first two groups of measures was slightly weaker than the within-group correlations (r = 0.51 on average; r = 0.42, 0.55, and 0.55 in each dataset). Note that these results could be driven by the fact that all five non-normalized measures are strongly driven by d’, thus increasing the correlations between them. It may also be that the SDT-based normalization makes all ratios and difference measures similar to each other.

Finally, the third group of measures consists of the two model-based measures, which showed the strongest divergence from the rest of the measures. Specifically, meta-noise had an average correlation of 0.35 with the remaining measures (r = 0.35, 0.34, and 0.37 in each dataset) and meta-uncertainty had an average correlation of 0.44 with the remaining measures (r = 0.33, 0.45, and 0.53 in each dataset). The measures meta-noise and meta-uncertainty had a very weak correlation with each other (r = 0.15, 0.03, and 0.06 in each dataset). These results suggest that the two model-based measures may capture unique variance related to metacognitive ability.

Discussion

Despite substantial interest in developing good measures of metacognition, there has been surprisingly little empirical work into the psychometric properties of current measures. Here I investigate the properties of 17 measures of metacognition, including eight new variants. I develop a method of determining the validity and precision of a measure of metacognition and examine each measure’s dependence on nuisance variables and its split-half and test-retest reliability. The results paint a complex picture. No measure of metacognition is “perfect” in the sense of having the best psychometric properties across all criteria. Researchers need to make informed decisions about which measures to use based on the empirical properties of the different measures. The results are summarized in Fig. 7.