Introduction

Web-based testing has become an essential tool in psychological research, with Amazon Mechanical Turk (MTurk) remaining the most popular platform. Recent keyword analyses indicate that MTurk is used in 30–40% of papers in top psychology journals1,2. This platform, along with others, has enabled large-scale testing of cognitive and perceptual abilities, extending beyond mere convenience. Web-based recruitment allows for diverse geographic sampling and facilitates access to specific participant groups, such as twins3, clinical populations4, and extreme performance groups5. Compared to traditional recruitment methods, such as undergraduate populations, online recruitment is also more economical and provides greater demographic diversity.

However, several studies have identified potential pitfalls in online recruitment. Data quality issues have been noted with MTurk participants, particularly in contrast to platforms like Prolific that focus on providing high-quality participants for research6,7. These issues include high rates of inattentive responses8, meaningless data9, and fraudulent participation using IP masking10. Such factors can reduce the effectiveness of experimental manipulations10 and affect the mean and standard deviation of scale measurements11, although some studies do not report such effects12. While stricter exclusion criteria can mitigate these issues, their effect on experimental outcomes is often modest10.

In this study, we explore how recruitment pools influence performance on tests of face identity processing. This refers to tasks that involve comparing face images perceptually or remembering faces to later identify them. Both lab-based and online studies have demonstrated that face identity processing varies across the population13, from individuals with developmental prosopagnosia, who have significant deficits, to super-recognisers, who perform near-perfectly on standardised tests14. In applied contexts, assessing high performers can have real-world implications for identity verification tasks15,16, and identifying deficits is clinically important for understanding disorders of social cognition17. Theoretically, reliable tests can provide insights into the cognitive systems that support face perception13,18.

Initial web-based face processing tests were conducted via testmybrain.org, which has since tested over 2.5 million participants19. Performance on these tests is comparable to lab-based samples20, though this platform typically attracts motivated participants interested in learning about their cognitive abilities. Other self-selecting pools, such as those recruited via registries or targeted links (e.g., super-recogniser tests), often yield skewed performance, with participants scoring higher than average21,22.

Given this, researchers should also have access to relatively unbiased online samples like those provided by MTurk and Prolific. However, some studies have reported lower accuracy on face identity tasks for MTurk participants compared to lab-based samples14. For instance, MTurk participants scored 61.4% on the Cambridge Face Memory Test (CFMT) compared to 69.3% for undergraduate students23,24,25. Similarly, bespoke tests of face identity processing have shown lower accuracy for MTurk samples compared to student populations26.

Performance on face memory tests among Prolific participants has also been lower than normative scores established in lab-based testing. For example, accuracy on the CFMT ranged from 68 to 75% in recent studies27,28,29, which is well below the 80.4% found in lab-based samples30,31,32. This pattern is consistent for the Australian version of the test (CFMT-Aus), with Prolific participants scoring 71.5%, compared to 80.2% in lab-based samples33. Other Prolific studies have found mixed evidence of differences between lab and online samples, for example accuracy on the Cambridge Face Memory Test—Extended Version (CFMT+ ,14) was 2% lower in34, with other tests in the same study showing test scores between 3 and 15% lower.

Test score variation in online settings is not limited to memory tasks but extends to perceptual matching tasks, such as the Glasgow Face Matching Test 2 (GFMT2). Normative data from MTurk participants for the GFMT2-Short subtest (GFMT2-S) showed 75% accuracy, consistent with subsequent studies on MTurk35. However, Prolific participants have performed slightly better, with reported accuracies of 80–82%34,36,37. Moreover, MTurk participants have shown lower performance on other face matching tasks compared to lab-based samples35,38.

In this study we recruited undergraduate students and three online samples to complete the GFMT2 and other standard tests of face identity processing ability (CFMT+ 14, Models Face Matching Test39). Details of these tests and participant groups can be found in the Methods section. To pre-empt our results, we found that M-Turk participants scored markedly lower on all tests. Because initial development of the GFMT2 relied heavily on MTurk data40, we then decided to assess psychometric properties of the GFMT2. We find robust psychometric properties of the GFMT2 subtests that are consistent with the original test publication40, including: high test–retest reliability, convergent validity with other face identity processing tests (GFMT41, CFMT+ 14, UNSW Face Test21) and diagnostic value in distinguishing between super-recognisers and standard participant groups. Updated normative scores for all GFMT2 subtests are provided in Table 1.

Table 1 Mean GFMT2 subtest percent correct scores across participant samples (standard deviations in parenthesis).

Results

Stability of GFMT2-S and GFMT2-H test scores across four recruitment methods

We administered two GFMT2 subtests (GFMT2-S, GFMT2-H) to participants from three online recruitment pools (MTurk, Prolific, UNSW Face Research Registry) and a group of UNSW undergraduates tested in-person. Participants from Prolific also completed the GFMT2-Low. Participants scoring below chance or using the same response key throughout a test were excluded prior to any analysis. Demographic details of the final participant groups are shown in Table 1, with additional information about the participant groups in the “Methods” (see: “Participant cohorts, procedure, and exclusion criteria”).

Normative test scores for the GFMT2 subtests are displayed in Table 1. We found substantial variation in group averages for both the GFMT2-S (74–89%) and GFMT2-H (67–82%). MTurk participants showed scores equivalent to those reported in40 and35. However, MTurk scores were lower than those of other groups, suggesting MTurk may not provide generalizable normative measures. We found no demographic differences to explain these results (see Supplementary Material 2; Figures S1-2 and Tables S2-4).

Accuracy for Prolific and in-person samples was equivalent, but were both approximately 5 percentage points lower than the UNSW Face Research Registry group. This elevated performance is consistent with the self-selection in recruitment of this group, as evidenced by their scores on other standard tests of face recognition [CFMT+ : M = 78.8%, SD = 12.6%; t(712) = 20.0, p < 0.001, Cohen’s d = 0.75; GFMT: M = 92.8%, SD = 7.74%; t(761) = 41.0, p < 0.001, Cohen’s d = 1.48; UNSW Face Test: M = 64.7%, SD = 6.51%; t(995) = 18.90, p < 0.001, Cohen’s d = 0.92].

Stability of CFMT+ and MFMT normative test scores across recruitment methods

To assess whether the lower performance of MTurk participants was specific to the GFMT2, we recruited three additional groups. We recruited two groups from MTurk—one that could use any device to complete the tests and another that was limited to participants using a desktop or laptop computer42. An additional group of Prolific participants were recruited and limited to using desktop or laptop computers. They completed the Cambridge Face Memory Test—Extended Version (CFMT+ 14), which measures ability to remember unfamiliar faces, and the Models Face Matching Test (MFMT28), which measures ability to perceptually match the identity of images showing faces of unfamiliar male models.

Table 2 shows average test scores. The top three rows show accuracy when the same general exclusion criteria as for the GFMT2 participant groups were applied. Consistent with the GFMT2 results, MTurk participants scored approximately 10 percentage points lower than Prolific participants, and there was no significant difference between participants using any device and those that only used computers (see Supplementary Material 3 for statistical comparisons).

Table 2 CFMT+ and MFMT scores for different web-based participant samples and exclusion criteria.

For tests in Table 2, we also included attention checks that were designed to assess participants’ engagement with the task. First, participants completed three practice trials using easily recognisable cartoon faces (The Simpsons characters). Additionally, free text responses to demographic questions regarding country of birth and the number of countries lived in were manually reviewed. We then exluded participants who failed our attention check criteria, by either failing to correctly answer all three practice questions or providing inappropriate or incorrect responses to demographic questions (e.g., entering a year instead of a country).

The effectiveness of including these attention checks is shown in the bottom 3 rows of Table 3. The attention checks led to a 7% improvement in CFMT+ scores, but little change in MFMT scores, despite removing over 60% of MTurk participants from the sample. Applying the same criteria to Prolific led to fewer exclusions (22%) and no effect on average scores, suggesting low engagement contributed to lower scores for MTurk participants. For more details on these participant groups and exclusion criteria see “Methods” (“Participant cohorts, procedure and exclusion criteria”).

Table 3 Spearman’s correlations between performance on face recognition tests for UNSW Students. (n = 92; ***p < 0.001).

Psychometric properties of GFMT2 subtests

Given we developed the GFMT2 using MTurk participants, the observed discrepancies in MTurk results prompted us to reevaluate the psychometric properties of the GFMT2-S and GFMT2-H, including: (i) test–retest reliability, (ii) internal reliability, (iii) convergent validity, (iv) sensitivity to participant age, and (v) diagnostic value of subtests for identifying super-recognisers.

Test–retest reliability

High test–retest reliability for the GFMT2-S (r = 0.774) was reported in40 but was not measured for the GFMT2-H. In our study, 80 UNSW students and 713 UNSW Face Research Registry (UNSW-FRR) participants completed both tests twice, with UNSW students having a similar retest interval to40 (M = 7.3 days; SD = 2.16), and UNSW Face Research Registry participants having a longer interval (M = 50.0 days; SD = 6.08).

As shown in Fig. 1, test–retest correlations were high for both tests in both groups, with slightly lower correlations in the UNSW-FRR sample (GFMT2-S Spearmans rho = 0.734; GFMT2-H Spearmans rho = 0.675) than the UNSW student sample (GFMT2-S Spearmans rho = 0.801; GFMT2-H Spearmans rho = 0.714), presumably due to the longer retest interval. Nevertheless, high reliability over 7 weeks provides compelling evidence that individual differences persist over extended periods and can be reliably measured by GFMT2 subtests.

Fig. 1
figure 1

Correlations between accuracy at time 1 and time 2 for the GFMT2-S (A) and GFMT2-H (B). The orange shaded region around each regression line represents the 95% confidence interval of the regression estimate.

Internal reliability

We calculated internal reliability of the GFMT2 subtests, in addition to the CFMT+ and MFMT, using Cronbach’s alpha. This showed good internal reliability for all subtests on Prolific samples (GFMT-S: 0.851; GFMT-L: 0.738; GFMT-H: 0.841). These scores were higher than for the CFMT+ in our Prolific sample (0.520) and comparable to the MFMT (0.888). We note however that this estimate of CFMT+ internal reliability is markedly lower than that reported in previous publications (e.g. 0.9 in22). Interestingly, despite the data quality issues associated with our MTurk sample noted above, the measures of internal reliability computed from this sample were consistently higher for all tests (GFMT-S: 0.929; GFMT-H: 0.910; CFMT+ : 0.838; MFMT: 0.929).

Convergent validity

We computed Spearman’s correlations between GFMT2 subtest scores and three standardized face identification tests (UNSWFT, CFMT+ , GFMT). Tables 3 and 4 show strong convergence between the GFMT2 and GFMT, with slightly lower correlations for memory-based tasks (CFMT+ , UNSWFT). Interestingly, for the UNSW students tested in-person this pattern is especially pronounced, showing a statistically higher correlation between the GFMT2 subtests and GFMT compared to the correlation between the GFMT2 and CFMT+ [GFMT2-S: Fisher’s z-test = 2.66, p = 0.004; GFMT2-H: Fisher’s z-test = 3.20, p = 0.001]. These same contrasts were non-significant for the UNSW-FRR participants.

Table 4 Spearman’s correlations between performance on face recognition tests for UNSW Face Research Registry participants (n = 687; *** p < 0.001).

We speculate this difference relates to the relative homogeneity of the UNSW student sample in terms of age and testing conditions. Controlling for these sources of variation in accuracy may have provided greater precision in isolating variance attributable to the processing differences between the CFMT+ and GFMT tests.

Sensitivity of test scores to participant age

We examined age-related accuracy patterns by aggregating test scores from all cohorts. Figure 2 shows a pattern consistent with prior research21,43, with peak accuracy estimated at age 36 for both GFMT2-S and GFMT2-H. A formula for age-correcting scores is provided in Supplementary Material 4.

Fig. 2
figure 2

Average accuracy for each participant age on the GFMT2-S (left) and GFMT2-H (right) where the size and shade of each data point shows the number of participants in each age group. A formula for age-correcting scores is provided in Supplementary Material 4.

Diagnostic value of the GFMT2 for identifying super-recognisers

The original GFMT has been used extensively to test staff in applied settings, with a view to selecting high performers for specialist face identity roles (44,45 for a review see46). It is also used as part of test batteries to identify ‘super-recognisers’ in research settings (e.g.,15,47). However, it suffers from high average accuracy making it poorly suited for this purpose40. Because of this, we specifically designed a subtest of the GFMT2—the GFMT2-High—to be more challenging than the primary GFMT2-S subtest, and therefore better calibrated for identifying high-performers.

We tested the effectiveness of the GFMT2-S and GFMT2-H in discriminating super-recognisers from a standard participant group. Super-recognisers were the sample of UNSW Face Research Registry super-recognisers featured in the convergent validity analyses above (n = 97; 58 female, 37 male, 2 prefer a different term; Mage = 39.3, SD = 9.69, 1 missing age), who scored at least 1.7 standard deviations above established norms on each of three standard face recognition tests: UNSW Face Test (UNSWFT21), Glasgow Face Matching Test (GFMT41), and the Cambridge Face Memory Test – Extended Version (CFMT+ 14). The standard participant group combined the Prolific participants and UNSW students who completed the GFMT2-S and GFMT2-H (n = 192; 107 female, 85 male; Mage = 30.8, SD = 15.6). We did not to include the UNSW-FRR cohort here because they self-selected for participation in the tests and show above average accuracy. For completeness, we report the same analysis using this group in Supplementary Material 5.

Super-recognisers scored substantially higher on the GFMT2-S (M = 95.7%, SD = 3.15%) and the GFMT2-H (M = 91.5%, SD = 4.81%) compared to the standard participant group (GFMT2-S: M = 83.8%, SD = 7.86%; GFMT2-H: M = 74.9%, SD = 9.79%). To quantify how well each GFMT2 subtest discriminated between super-recognisers and the standard participant group, we use Area Under the Receiver Operator Characteristic curve (AUC) which measures the extent to which test scores can differentiate between two classes (i.e., super-recognisers vs standard participant group). The larger the AUC score, the better the face matching test is at differentiating between the groups. The GFMT2-H had slightly higher discriminative power (AUC = 0.9368) compared to the GFMT2-S (AUC = 0.9297). Combining the GFMT2-S and the GFMT2-H scores produced even higher diagnostic value (AUC = 0.949). Overall, this analysis shows high diagnostic value of the GFMT2 subtests for categorising super-recognisers, with the GFMT2-H providing an efficient alternative where short test sessions are desirable.

Discussion

We found systematic differences in face identity processing test scores collected on Amazon Mechanical Turk (MTurk) compared with other web-based and in-person samples. On average, MTurk participants performed approximately 10 percentage points below mean accuracy established from other participant groups. This difference was observed for standard tests of both face memory and perceptual matching. Using stricter data quality cleaning processes, that screened for inattentive participants, brought some average test scores closer to those observed in other cohorts—but at the cost of removing around 60% of the sample.

The reasons for this are uncertain, but it is consistent with a more general decline in the quality of MTurk participant data since 201811. It is possible that some of this is attributable to non-human ‘bot’ participants48. However, our screening procedures for all testing included reCAPTCHA, and subsequent tests used manual checking of qualitative responses and catch trials that have been shown to be effective in identifying bots. This suggets that differences in accuracy are due to dfferences in human respondents. These differences might include motivation, task engagement, testing conditions or potentially underlying differences in the ability level of MTurk participants.

Overall test completion times showed that MTurk participants spent substantially longer than Prolific participants, which may suggest they were completing the task in parallel to other tasks or in multiple sessions (see Supplementary Material 6; Tables S5-6). The GFMT2 is self-paced, and response time is collected automatically in the desktop versions of the test but trial-level response time data was not collected in our online study due to limitations of the Qualitrics survey software. Users may wish to monitor response times in future online testing and so we have since created new online versions of the GFMT2 that collect this data and can be shared with researchers on request.

Other studies have found lower accuracy of MTurk participants on face identity processing tasks. In a recent study, GFMT2-S accuracy was 75%35 which is in line with normative measures reported in the MTurk sample used in the initial GFMT2 publication40. This study also reported lower accuracy in MTurk participants for the Expertise in Facial Comparison Test than had been reported in a group of university students tested in-person (White et al. 2015). Interestingly,35 also reported higher scores on the self-report measure of face recognition ability using MTurk participants (PI-2049). This suggests MTurk participants also believe they had poorer-than-average face recognition ability. MTurk workers score markedly higher on scales that are associated with poor face recognition ability compared to the overall population (see50 for a review), such as those measuring traits of social anxiety51 and autism52,53. This might suggest that poorer accuracy in MTurk participants is due to differences in this cohort that extend beyond simple motivation.

Our results also provide a more complete set of psychometric tools for researchers that use the GFMT2 to measure unfamiliar face matching ability. Normative data is now available for all subtests (GFMT2-S, GFMT2-H, GFMT2-L), however our results clearly show that MTurk is not suitable for estimating normative accuracy on face identity processing tests. Normative test scores acquired from MTurk could lead to underdiagnosis of impaired ability and overestimates of the prevalence of people with extremely high ability. Given the relative heterogeneity of the Prolific sample, in addition to evidence of high task engagement, we suggest researchers adopt these scores as normative measures of test performance on the GFMT2-S (M = 82.9%; SD = 7.46%), GFMT2-H (M = 73.9%; SD = 9.56%) and GFMT-L (M = 90.4%; SD = 6.67%). Where appropriate for diagnostic purposes, individual scores on the GFMT2-S and GFMT2-H can be adjusted for age using the formulas found in Supplementary Material 4.

We found the GFMT2-S and GFMT2-H were highly reliable and valid measures of unfamiliar face matching ability. Test–retest reliability remained high even when we conducted follow-up tests from 6 weeks after an initial test. Convergent validity analysis show the test correlates highly with other tests of face identity processing ability. There was also evidence of some disciminant validity with face memory tasks, which is consistent with studies showing face memory and matching tasks rely on somewhat different abilities (e.g.41,54,55,56,57). This suggests that GFMT2 subtests are a complementary tool for use alongside face memory tests to provide an overall picture of face identity processing ability.

We assessed the value of the GFMT2-S and GFMT2-H in discriminating between super-recognisers—that we have identified in our prior work (e.g., see21,47)—from standard performers. We found both tests were able to classify super-recognisers with high diagnostic accuracy, and so both tests will be useful in applied settings where super-recognisers are selected for specialist face identity processing roles45,58,59,60, or for theoretically motivated studies of the underlying perceptual mechanisms47,61,62,63,64,65,66. Although the best results were found when combining both the GFMT2-S and the GFMT2-H, we also found the 40-item GFMT2-H was slightly better than the 80-item GFMT2-S in identifying super-recognisers, suggesting it is a more suitable tool for short test sessions aimed at finding super-recognisers.

Future studies could aim to determine optimal uses of the GFMT2-S and GFMT2-L for identifying individuals at the opposite end of the ability spectrum that have impairements in face identity processing. Given that impaired perceptual encoding of face identity appears to be a key deficit in developmental prosopagnosia32, it is important to include perceptual matching tasks in diagnostic testing. Our analysis of psychometric properties shows the GFMT2 is able to reliably measure—and specifically target—perceptual identity processing ability, supporting its inclusion in comprehensive test batteries of face identity processing ability.

More generally, our results show that normative scores on popular cognitive measures of face processing ability vary across different participant cohorts. Some of this variation is attributable to differences in participant demographics, for example age, and some can be removed by screening participant responses with quality control measures. But normative scores are also likely to covary with factors that we have not measured or controlled here. Further, our study has focussed on differences between lab-based testing of undergraduate students and online testing, but it is also likely that accuracy of undergraduate samples differs across different test sites. Together this suggests that researchers should view published normative scores on tests of face identity processing ability not as properties of the test alone, but in the context of the specific cohort that was tested.

Methods

The studies reported were all approved by UNSW Human Research Ethics Advisory Panel. As a condition of this ethical approval, informed consent was obtained from all subjects and all methods were performed in accordance with the relevant guidelines and regulations.

Tests of face identity processing ability

Glasgow Face Matching Test 2 (GFMT240)

The GFMT2 is an expanded version of the original GFMT41 designed to assess unfamiliar face matching ability. It comprises three subtests created from a pool of 300 pairs of face images: the 80-item GFMT-Short (GFMT2-S), the GFMT2-High (GFMT2-H), and the GFMT2-Low (GFMT2-L). Each pair consists of a high-quality frontal image and either a same-identity or different-identity face. The difficulty levels of the pairs vary by type of variation: rigid (e.g., head angle), non-rigid (e.g., expression), or distance variation. For this study, we focus on the psychometric properties of the GFMT2-S and GFMT2-H, although we also provide normative test scores for the GFMT-Low (GFMT2-L; see Table 1). Further details of the GFMT2 subtest development and example face pairs are provided in the original publication40.

Glasgow Face Matching Test (GFMT41)

The format and design of the original GFMT is the same as the GFMT2, and we included the GFMT in this study to assess convergent validity of the GFMT2. Test images are frontal with a neutral expression and consistent subject-to-camera distance making the task somewhat easier than the GFMT2. The most widely used version of this test contains 40 test items of unfamiliar face matching ability (20 match, 20 non-match).

Cambridge Face Memory Test—Long Form (CFMT14)

The CFMT+ evaluates memory for unfamiliar faces. Participants learn a series of faces and later identify the learned faces from arrays containing distractors. As the test progresses, visual noise is added to obscure the faces, thus increasing task difficulty. In total there are 102 test items.

UNSW Face Test (UNSWFT21)

The UNSWFT is a challenging test of general face identity processing ability that was originally developed as a screening test for high performing ‘super-recognisers’. Again, we included this in the battery of tests used to assess convergent validity of the GFMT2. It consists of two tasks which are completed in a fixed order. The first task is a standard recognition memory paradigm where participants memorise studio-quality face images and are later asked to recognise these faces in social-media-style photos. The second task is a match-to-sample sorting task where participants memorise a face image presented for 3 s. Immediately after the face image disappears from screen, they must sort a ‘pile’ of new, unseen face images as either belonging to the identity they just saw or not belonging to the identity.

Models Face Matching Test (MFMT39)

The MFMT uses 90 pairs of face images of White male models, half of which are same-identity pairs. Participants judge whether each pair represents the same person or different people. The MFMT is designed to be a challenging test of face matching and so the face images are unconstrained and contain a lot of natural variability (e.g., head angle, lighting).

Participant cohorts, procedure and exclusion criteria

Amazon Mechanical Turk participant groups

Mechanical Turk was used in the initial GFMT2 test development and normative test data in40. As in this prior work, we selected participants based on a screening procedure aimed at ensuring high-quality data. Eligible workers had to achieve a HIT approval rate of over 99%, indicating a strong track record of submitting high-quality work. Additionally, they were required to have completed more than 100 HITs to confirm sufficient experience with MTurk norms and expectations, and a reCAPTCHA security check was administered on Qualtrics to confirm human participation. This screening procedure was applied to all MTurk samples described below.

The first MTurk sample were 99 participants that completed the GFMT2-S and GFMT2-H. After excluding participants for pressing the same response key repeatedly (n = 2) or performing below chance (n = 6), 91 participants remained (Mage = 31.8, SD = 8.40; 28 female, 63 male). Performance of this group is presented in Table 1.

A second sample of 106 participants completed the CFMT+ and MFMT. This sample was not restricted by device type and could complete these two tests using any device, including mobile devices. Participants were excluded from analysis if they repeatedly pressed the same or different response buttons for at least one of the tests (n = 1) or performed below chance on the CFMT+ or the MFMT (n = 10), resulting in a final sample of 95 participants (Mage = 35.4, SD = 11.3; 31 female, 64 male). This sample is referred to as ‘Amazon MTurk (all devices)’ in Table 2.

The third sample of 98 MTurk participants completed the CFMT+ and MFMT using computers only (i.e. laptop and desktop PCs), with exclusions due to repeated pressing of the same button (n = 4) or below chance accuracy (n = 18) resulting in a final sample of 76 (Mage = 30.8, SD = 3.59; 21 female, 55 male). This sample is referred to as ‘Amazon MTurk (computer only)’ in Table 2.

In addition to the general exclusion criteria detailed above, we also identified groups of individuals who failed strict attention checks from both the Amazon MTurk (all devices) and Amazon MTurk (computer only) samples to create subsamples including high quality data only. These attention checks involved passing catch trials in both the CFMT+ and MFMT, as well as a strict manual review of free-text responses (e.g., country of birth, number of countries lived in). In the Amazon MTurk (all devices) sample, 27 participants failed the attention checks resulting in a strict sample of 68 participants (Mage = 35.5, SD = 12.0; 22 female, 46 male). In the Amazon MTurk (computer only) sample, 41 participants failed the attention checks resulting in a strict sample of 35 participants (Mage = 29.8, SD = 4.5; 12 female, 23 male). These samples are denoted by the label ‘ + attention check’ in Table 2.

Prolific participant groups

Prolific was used to establish normative data as it is a commonly used online research platform and reported to have superior data quality than MTurk for a range of measures (e.g., attention, reliability, comprehension, honesty6). All Prolific participant groups had to meet screening criteria, including using a computer device only and residing in the UK and the same basic exclusion criteria used for the Amazon MTurk samples were applied (i.e., removing repeated pressing of same key or scoring below-chance accuracy).

One sample of 100 participants completed the GFMT2-S and GFMT2-H, with no participants failing basic exclusion criteria (Mage = 41.0, SD = 15.2; 47 female, 53 male). We recruited a separate sample of 100 Prolific participants to complete the GFMT2-L with no participants failing basic exclusion criteria (Mage = 43.6, SD = 13.5; 55 female, 44 male, 1 prefers a different term). Finally, a third group of 100 participants completed the CFMT+ and MFMT (Mage = 44.8, SD = 14.0; 46 female, 54 male). This sample is referred to as ‘Prolific (computer only)’ in Table 2. Additionally, we applied the same strict attention checks as used for the Amazon MTurk samples, resulting in the exclusion of 22 participants. This sample is referred to as ‘Prolific (computer only) + attention check’ in Table 2 (n = 78; Mage = 45.6, SD = 14.3; 38 female, 40 male).

UNSW Face Research Registry (UNSW-FRR; online testing group 1)

Research volunteers have the option to join this participant registry after completing the UNSW Face Test (www.unswfacetest.com). People typically complete the UNSW Face Test via weblinks posted on media coverage of our research on super-recognisers (e.g., see67). As a result, this cohort displays a self-selcting bias towards higher than average scores on standardised tests of face identity processing ability21,22,68. While they are instructed to complete the study using a computer, there are no technical restrictions preventing the use of other devices.

The UNSW-FRR cohort comprised 1393 individuals who had responded to our recruitment email and completed both the GFMT2-S and GFMT2-H. Participants were excluded from analysis if they repeatedly pressed the same or different response buttons for at least one of the subtests (n = 3), or performed below chance on the GFMT2-S or GFMT2-H (n = 2), resulting in a final sample of 1388 participants (890 female, 480 male, 10 prefer a different term, 4 prefer not to answer, 4 individuals with missing data; Mage = 46.8, SD = 13.9, 8 individuals with missing age data).

To assess the test–retest reliability of the GFMT2-S and GFMT2-H, we contacted participants who completed these tests in the first session and invited them to repeat the test 6 weeks later, with 725 participants completing the tests a second time. Participants were excluded from test–retest analysis if they repeatedly pressed the same button (n = 1), or if they did not have a valid time 1 score (n = 11). This resulted in a final sample of 713 participants (454 female, 251 male, 4 prefer a different term; Mage = 48.4, SD = 14.0).

To assess convergent validity of GFMT2-S and GFMT2-H, we used existing performance data on 3 other standardised tests of face identity processing ability (UNSWFT, CFMT+ , GFMT). Approximately half of the UNSW participants that completed the first test session (n = 703) had completed all three of these other tests. After excluding participants who performed below chance on at least one of these standardised tests (n = 16), this left a final sample of 687 participants (Mage = 46.7, SD = 12.6, 5 missing values; 440 female, 241 male, 4 prefer a different term, 1 prefered not to answer, 1 missing value).

UNSW students (in-person testing)

Ninety-four first year psychology students (undergraduate) from UNSW Sydney participated in return for course credit. Participants completed the study under experimenter supervision in our research lab on a desktop computer. Two participants were excluded from analysis due to technical issues while completing the study, so this resulted in a final lab sample of 92 participants (Mage = 19.8, SD = 4.31; 60 female, 32 male). We also applied general exclusion criteria to this sample (i.e., repeatedly pressing the same button and/or below chance accuracy), however, all participants passed these checks.

To assess test–retest reliability and convergent validity, participants in this cohort were invited to complete a second testing session one week after the first. Eighty-four participants returned to complete the second session. Four participants were excluded from analysis because of either technical issues (n = 2), repeatedly pressing the same response key for at least one of the tests (n = 1), or scoring below chance on at least one of the tests (n = 1). This resulted in a final sample of 80 participants (Mage = 20.0, SD = 4.55; 51 female, 29 male).

In each session the participants completed four tests: GFMT2-S, GFMT2-H, GFMT, and CFMT+ . The order in which these four tests were completed was counterbalanced across participants, but each participant completed the tests in the same order in test session 1 and 2.