Introduction

Impairment in cognitive functioning is a core marker of psychotic disorders such as schizophrenia1, which often emerges prior to psychosis2. Poorer performance on objective neuropsychological tests is more common in individuals at clinical high risk (CHR) for psychosis than in community controls (CC) and individuals with recent-onset depression3,4. Medium to large impairments in all seven Measurement and Treatment Research to Improve Cognition in Schizophrenia domains (i.e., processing speed, attention, working memory, visual and verbal learning and memory, reasoning and problem solving, and social cognition)5 are common in schizophrenia1, while small to medium impairments are more typical of CHR cohorts2. Additionally, deficits in visuospatial ability, various measures of executive functioning, motor functioning, olfactory identification, and premorbid and current IQ, are observed in CHR cohorts2,6. Cognitive impairment is, on average, more severe in those who transition to psychosis versus those who do not, and it contributes unique variance from verbal learning and processing speed paradigms to multivariate prediction models of transition2,7,8. Among CHR individuals, cognitive impairment is also associated with other unfavorable clinical outcomes, such as poorer functioning9,10 and persistence of the CHR state11,12.

Accordingly, cognition is a key domain measured as part of the AMP SCZ initiative13,14. AMP SCZ consists of two data collection research networks, specifically the Psychosis Risk Outcomes Network (ProNET) and the PREdiction SCIENTific Global Consortium (PRESCIENT). It also involves a third network, the Psychosis Risk Evaluation, Data Integration and Computational Technologies: Data Processing, Analysis and Coordination Center (PREDICT-DPACC), responsible for quality assurance and control (QA/QC) and upload of these data to the National Institute of Mental Health (NIMH) Data Archive (NDA). The AMP SCZ program aims to develop tools for predicting outcomes among CHR individuals to fast-track preventive and effective treatments13,14. AMP SCZ involves public-private partnerships between the National Institute of Mental Health (NIMH) in the United States, the Food and Drug Administration (FDA) of the United States, the European Medicines Agency, industry (e.g., pharmaceutical and life sciences) and non-profit and other organizations (e.g., Wellcome). Through AMP SCZ, we can determine the suitability of cognitive markers for use in future clinical trials.

The primary aim of this paper is to describe the rationale, processes, considerations, and final harmonization of the cognitive battery within AMP SCZ. Specifically, we describe the selection of cognitive domains, measures, and timepoints according to the careful consideration of (1) theoretical, (2) psychometric, and (3) practical factors (Table 1), through a consensus-based process. This process led to the selection of the final harmonized cognitive battery (Table 2). Providing a rationale for and description of the harmonized cognition battery is important because these data will be available to the scientific community, open source, in perpetuity, through the quality control and assessment (QA/QC) and data flow pipelines via the DPACC. The cognitive data will, therefore, be available publicly for future studies by the research community. Furthermore, we hope researchers embarking on similar projects will consider using this battery to harmonize it with future datasets.

Table 1 Core guiding principles for selection of AMP SCZ harmonized cognition battery.
Table 2 AMP SCZ Cognition Battery and assessment schedule.

Work group structure and process

The assessment domains, cognitive tests, timepoints, and final harmonized battery were decided via several working group discussions involving representatives from all stakeholder groups. The cognition core working group comprised 28 members (see Supplementary Material for member list) representing several key AMP SCZ partners, including NIMH, FDA, industry (Janssen, Otsuka), the research networks (ProNET and Prescient), and the DPACC. Members included experts in cognition and psychosis/schizophrenia. The meetings were co-chaired by the Cognition Team co-leads KA (PRESCIENT) and WS (ProNET). The group’s goal was to prioritize and harmonize the cognition battery for AMP SCZ. The group convened over 10 meetings from November 2020 to February 2021 to discuss and decide on the final battery via consensus. The initial task was to compare the batteries and timepoints originally proposed in the PRESCIENT and ProNET protocols and then progress to harmonization. Numerous factors were considered, as described below (see Table 1), and members of the group were invited to raise questions and present their views, evidence, and supporting data organically as meetings progressed. Published literature, individuals (e.g., investigators involved in related studies), and organizations (e.g., cognitive test publishers/developers) external to the group were also consulted where necessary to assist with the decision-making process.

Theoretical considerations

Domains of interest

The initial phase of the AMP SCZ project (i.e., prediction of CHR outcome) does not focus on testing specific hypotheses15. Rather, the focus was on the identification and selection of cognitive variables that reflect clinically relevant neuropsychological impairments in psychotic disorders that could be targeted in clinical trials for the prevention of schizophrenia onset, reduction in symptom severity, and other important clinical and functional outcomes, such as quality of life and functioning. Thus, the Cognition Team considered a wide range of specific neuropsychological functions, as well as premorbid and overall intellectual ability (i.e., premorbid and current IQ estimates). It also considered domains known to be impaired in CHR or previously shown to be associated with transition to psychosis or other outcomes, such as poor functioning (as described in the Introduction).

Cognitive functioning measured at ascertainment of CHR status is consistently reported to contribute to multivariate prediction models of later transition to psychosis and functional outcome10,16,17,18,19. However, there is limited consistency across prediction models in the cognitive domain(s) or variables that are predictive (i.e., a lack of replication). This likely reflects variability in the cognitive tests used, combinations of multivariate predictors within different models, variation in samples recruited across studies, and remaining questions about which aspects of illness are best predicted by cognitive variables. Nevertheless, the inclusion of specific cognitive measures was useful in the North American Prodrome Longitudinal Study (NAPLS2) risk calculator, which included tests of verbal learning and processing speed (Hopkins Verbal Learning Test – Revised, BACS Symbol Coding). The utility of cognitive measures in risk calculators has been replicated in independent CHR cohorts7,8, and shown promise cross-culturally20. Thus, we considered it important to measure processing speed and verbal learning alongside the measurement of several other neuropsychological domains. Assessing a range of cognitive domains also places key findings within the broader context of cognitive function.

Domains out of scope

Novel computational behavioral neuroscience measures that may be specifically related to dopamine dysregulation (the longest-standing bio-etiologic theory of schizophrenia21) and vulnerability to psychosis were also considered by the Cognition Team. These measures included predictive coding22,23, aberrant salience24, and reward processing and learning (effort, time, and reward discounting)25. These constructs have shown associations with negative and positive symptoms in studies of established schizophrenia26,27,28. There has been relatively less research on these constructs in CHR and in relation to psychosis transition, with primarily cross-sectional studies involving small samples conducted to date27,29,30,31,32,33,34. Further, the limited number of longitudinal studies have shown inconsistent associations between aberrant salience and reward learning (e.g., jumping to conclusions) and transition to psychosis35,36,37. Thus, while some of these measures have shown promising preliminary evidence of a relationship with psychosis, including the CHR syndrome38, we concluded that this evidence remains largely cross-sectional, mixed, and not widely replicated with respect to CHR and its outcomes, especially compared with the well-established neuropsychological literature. Further, the additional time needed to administer experimental measures was considerable, and inclusion could not be justified given the constraints imposed by the need to minimize participant burden.

Age and stage of development

It was also important that the tests employed were sensitive to the detection of cognitive abilities, their inherent heterogeneity, and developmental changes across adolescence and young adulthood, specifically ages 12–30 years. Of note, there is an asynchronous peak in the development of cognitive abilities across the lifespan, with earlier peaks observed for fluid rather than crystallized intelligence39. Fluid intelligence is a family of adaptive cognitive abilities such as learning, organization, processing speed and problem-solving, used often for coping with novel challenges. Crystallized intelligence is a family of rule- and knowledge-based cognitive abilities used often for coping with familiar, overlearned, verbal and knowledge-based challenges. Abilities such as working memory, planning, processing speed, and verbal and visual learning tend to peak during adolescence and early adulthood. In contrast, verbal comprehension (crystallized) skills continue to develop well into the 40 s and 50s39,40,41. Intellectual capacity remains “elastic” during the teenage years42 and is thus an important variable to capture longitudinally in at-risk mental states, such as CHR. To minimize the burden but still capture potential change in specific and general cognitive functioning, we decided that specific cognitive domains relevant to CHR be assessed more often throughout the study, whereas IQ (general cognitive ability) could be assessed twice, at baseline and final follow-up.

Measurement of IQ

With respect to IQ, fluid and crystallized intelligence trajectories are shown to be impacted differentially over the lifespan in people diagnosed with schizophrenia. Early stable impairment may occur in verbal/crystallized IQ, possibly with a further decline in later adulthood, while a pattern of increasing decline (or attenuated gain) is seen in fluid reasoning from childhood to adulthood43,44,45. Measuring only fluid reasoning as a proxy for general IQ was considered a practical approach. Still, we believed doing so would run the risk of over- or underestimating impairment in one type of intelligence or the other. Both fluid and crystallized intelligence are thought to contribute to general intelligence (g)43. Moreover, knowledge about changes in fluid and crystallized components of IQ during the CHR period remains very limited46, supporting the rationale to measure both longitudinally during the CHR phase and following transition. The Adolescent Brain and Cognitive Development (ABCD) consortium took a similar rationale and approach47.

An estimate of premorbid cognitive ability was also considered valuable, albeit acknowledging the limitations in its measurement in this younger population. The most common methods for estimating premorbid cognitive ability are performance on a single word reading task and past academic performance or achievement48,49. Single word reading/recognition accuracy is conventionally assessed to estimate premorbid IQ given its relative resistance to decline in conditions affecting the brain48. Nevertheless, estimating “premorbid IQ” in adolescents during a dynamic period of cognitive development and who are often still in school, is quite different from measuring the same construct in adults who have completed their education and experience a mental health condition later in life. Furthermore, a range of neurodevelopmental reading difficulties, particularly in phonological processing and word reading, are observed in people with psychosis50,51, potentially confounding estimates of premorbid IQ. Finally, tests of reading accuracy/recognition are not universal, and in some languages, not applicable; for example, in languages that have few irregular words (e.g., Mandarin). Thus, it was accepted that a harmonized approach to assessing premorbid IQ could not be achieved across all study sites. We will investigate the comparability of different reading tasks used in the study and the influence of education and cultural factors on premorbid and current IQ to advance knowledge in this field.

Timing of assessments

Given that drug discovery/novel therapeutics is a key goal of the AMP SCZ initiative15, there was an interest in understanding the potential impact of short- and long-term treatment and disease state on cognitive performance over time and how the cognitive measurements may be used in clinical trials to inform treatment and to demonstrate improvements following treatment. To date, relatively few longitudinal studies of CHR have involved the assessment of cognition at more than two timepoints52. Thus, regular assessment of cognition, particularly close to ascertainment, but also over the 24-month study period, was considered important for a more fine-grained and dynamic observation of cognitive changes and how these relate to other biomarkers, environmental and genetic factors, and therapeutic interventions. We decided on a more frequent assessment of cognition in the 1st year, given that a high proportion of transitions occur within the 1st year of ascertainment53. Assessments are conducted at all timepoints for both transition and non-transition cases. Conducting cognitive assessments immediately post-transition was considered problematic since it will not coincide with the cognitive assessments conducted with non-transitioned and community control participants and adds considerable burden for an unwell participant.

Psychometric considerations

Practice effects

A key psychometric consideration was the repeatability of the tests and their lack of practice effects when practice is expected. Practice effects are observed on many cognitive tests, particularly those assessing attention, memory, and executive function54,55. The strongest practice effects tend to be observed in the early phase of frequent testing and thereafter plateau55,56. One strategy to mitigate practice effects is to administer a pre-baseline ‘practice’ round of testing55; however, an argument against this is that the first repetition shows the largest practice effect and may reduce the measure’s sensitivity. Attenuated practice effects are also associated with disease risk factors, later cognitive decline, and neurodegeneration biomarkers in cognitive disorders such as Alzheimer’s disease57. They are also common in schizophrenia58. Nevertheless, there are some tests where the practice effect is so strong that longitudinal assessment is not useful. For example, highly novel executive functioning tests that require learning a rule to solve a novel problem, such as the Wisconsin Card Sorting Test or analogs, are less useful for detecting change over time because once a person discovers the rule for solving the task, their performance dramatically improves59. One mitigation strategy that minimizes the impact of practice on serial cognitive assessment is the use of alternate forms60. Where possible, we included tests where alternate forms could be administered, such as the Short Penn List Learning Test from the Penn Computerized Neurocognitive Battery (PennCNB). The PennCNB battery also reduces practice effects by limiting exposure to the material using short-duration measures. Moreover, some tasks, such as motor, sustained attention, and working memory tasks, are less susceptible to practice effects and do not require alternate forms.

Test difficulty

One potential issue identified during the consensus harmonization process was the noted variability in performance on word list learning tasks across the NAPLS (North American) consortium studies61,62 and the Australian/European studies such as PACE40063 and NEURAPRO10,52. The Hopkins Verbal Learning Test – Revised version (HVLT-R), used in the NAPLS studies involves learning and memorizing 12 words. In contrast, the Rey Auditory Verbal Learning List (RAVLT) or BACS list learning task used in the Australian/European studies involves learning 15 words. Baseline performance on the RAVLT in the Australian PACE400 cohort suggested that a 12-item list learning task might lead to ceiling effects for up to 30% of participants, and in the NEURAPRO multi-site international trial, around 10% of participants would score above the ceiling for the HVLT-R. Thus, for the AMP SCZ study, we chose a suitably difficult list learning task (16 words) from the PennCNB that was sensitive enough to capture a broad range of abilities.

Practical considerations

Several practical factors were also considered when developing the battery. Some of these considerations also overlap with psychometric considerations.

Language and culture

There is considerable language and cultural variability across the research networks. AMP SCZ includes 43 data collection sites located in 13 countries across five continents. Eight different primary languages are spoken across the sites (English, Cantonese, Mandarin, Danish, French, German, Korean, and Spanish). This cultural and language variability posed a significant challenge for valid cognitive assessment, as most traditional cognitive measures were developed and normed in English-speaking Western/European countries. Test publishers are making strides in addressing this by including test items that are culturally relevant and establishing culturally appropriate norms. However, cross-cultural differences remain a potential bias in the measurement of cognitive functioning (for example, words may occur with different frequencies in different languages or have different meanings)64,65. Few cross-cultural cognitive tests or batteries are available, and those that exist were developed for assessing older populations and dementia65. To our knowledge, few if any previous studies of this magnitude have measured cognitive functioning simultaneously across so many countries, cultures, and languages. In AMP SCZ, a community control group recruited from all study sites completes the cognitive battery at baseline, which reduces the need to rely on norms that may be inappropriate, particularly if the controls are well-matched. It may also be feasible to compare existing PennCNB norms to norms generated by the community control group, particularly for English-speaking sites.

It is known that measuring cognition via tasks involving visual (rather than verbal) stimuli does not completely remove the potential for cultural bias in performance64. However, pragmatically to meet the study aims, visual-based measures that would not load heavily on language and are relatively easy to administer and score in different languages were selected where possible. Thus, we chose a repeated cognitive battery that predominantly includes tasks comprising pattern, shape, face, or numerical stimuli. The exception was the word list learning task, deemed essential because it is a well-accepted and widely used measure of verbal learning and memory within the psychosis field66 and is theoretically relevant, as described above. The other exceptions were the assessment of estimated premorbid and current IQ, which both involve language but were also deemed relevant, as described earlier. Nevertheless, since cognitive testing is sensitive to cultural/language differences, we will investigate how site and language differences within AMP SCZ may contribute to measurement variance in IQ and specific cognitive functions.

Tolerability

We also needed to strike a balance between comprehensive assessment across relevant cognitive domains and participant and assessor burden. Thus, our goal was to keep the repeatable battery to ~30 min, with supplementary global measures only included at baseline and final follow-up. For this reason, some longer tests (i.e., >10 min) of olfactory identification, problem solving, and several social cognition domains (i.e., theory of mind, social perception, attribution style) were not included in the core battery.

Remote testing

The Cognition Team had to consider how to proceed during the COVID-19 pandemic. Thus, remote test administration capability was critical to optimize data collection and participant engagement and minimize missing data. Online neuropsychological assessment paradigms show increasing evidence of reliability and validity both before and during the pandemic67. A range of tests/batteries that could be reliably administered online were explored in the harmonization process. In the final AMP SCZ battery, participants were able to complete remote testing from home, or if they did not have access to a computer, ‘remote’ testing within the research facility was conducted.

Mode of testing

The decision to use a computerized battery rather than more standard ‘paper-and-pencil’ measures was reinforced by several factors, including the timing precision of computerized measures, the standardization of test administration across study sites, fewer missing data, automated scoring, and the absence of scoring errors (e.g., arithmetic errors, out-of-range scores). As part of the decision-making process, we analyzed pilot data from a small group of participants in the NAPLS-3 study (n = 16), who received both the paper-and-pencil version of The Brief Assessment of Cognition in Schizophrenia (BACS)68 Symbol Coding and a PennCNB electronic version of the same test69. These data showed that both types of administration produced the same pattern of performance between the CC and CHR participants who had not transitioned to psychosis and CHR transitioned to psychosis (i.e., CC > CHR non-transition >CHR transition), with similar effect sizes between the groups, providing additional validation for the use of the computerized measures.

AMP SCZ cognitive battery

The AMP SCZ cognitive battery is administered in a single session at each timepoint at a time of day that is suitable to the participant to minimize fatigue. It may be administered on the same or a different day to other assessment domains, as preferred by the participant to maximize data collection and adherence to the protocol.

Premorbid IQ

At English language sites, estimated premorbid IQ is assessed at baseline using the Wide Range Achievement Test – 5th Edition Word Reading subtest (WRAT5)70. However, some countries/languages do not measure single word reading accuracy/recognition in ways that allow for premorbid IQ estimates. As these tasks vary based on language, some sites use a local version of a single word reading/recognition task (e.g., French NART, Danish NART, Spanish Test de Acentuación de Palabras) and can be harmonized with English-speaking sites. Other sites that use a different alphabet (e.g., based on tone) including Korean, Cantonese or Mandarin cannot use a reading measure of premorbid IQ and therefore premorbid IQ is not assessed.

Current IQ

Estimated current IQ is assessed at baseline and 104-weeks. At English language sites, the two-subtest version (Vocabulary and Matrix Reasoning) of the Wechsler Abbreviated Scale of Intelligence-Second Edition (WASI-II)71 is used, as it is normed across the study age range and was specifically developed for rapid assessment of IQ, particularly within research. Both the WRAT5 Reading task and WASI-II are administered online using Q-global, which is Pearson’s web-based system for administering cognitive assessments and was deemed ideal for assessing IQ remotely because the stimuli can be shown to participants onscreen via zoom while the administrator records responses. The manual and norms tables can also be accessed online, which is economical for sites with multiple assessors. However, as the WASI-II is only available in English, non-English speaking sites administer, in their local language, the Vocabulary and Matrix Reasoning subtests of the Wechsler Adult Intelligence Scale–Fourth Edition (WAIS-IV)72 for participants aged ≥16 and the Wechsler Intelligence Scale for Children–Fifth Edition (WISC-V)73 for participants aged <16 years. The primary outcome variable is IQ, with a mean of 100 and SD of 15. Individual subtest (i.e., Vocabulary and Matrix Reasoning) standard scores can also be derived. We note in this preliminary sample that obtained IQ scores were consistent with obtained WRAT-5 Reading scores (i.e., Premorbid IQ scores), and they were also well within the average range, though in the higher portion of it for both CC (112 ± 13.2) and CHR (109 ± 14.1). Similarly, they showed considerable variability, highlighting the need for future demographic analyses (e.g., site effects) to understand the generalizability of these findings.

Specific cognitive domains

The PennCNB74 is administered to assess specific cognitive domains at five timepoints, including baseline, weeks 8, 26, 52, and 104. The PennCNB measures cognitive domains that are comparable to the MCCB, the BACS, and other cognitive batteries constructed by researchers to assess cognition in schizophrenia over the last 30–40 years. It is a reliable and valid suite of computerized cognitive tests that have been used globally and across the lifespan in both healthy individuals and various illnesses, including the psychosis spectrum and CHR mental states6,75,76,77,78,79. It was considered ideal for the aims of AMP SCZ because it addresses the theoretical, psychometric, and practical considerations described earlier. The PennCNB team worked closely with AMP SCZ to adapt their battery to coincide with the repeatability (i.e., alternate forms), language, training, and data capture requirements of the study.

For AMP SCZ, eight PennCNB tests measure the following cognitive domains (see Table 2; note that the Digit Symbol test is used to measure processing speed and, after its completion, memory for the Digit-Symbol pairings): verbal learning and memory (Short Penn List Learning Test; four alternate forms), sensorimotor ability (Motor Praxis Test), attention (Short Penn Continuous Performance Test), emotion recognition (Penn Emotion Recognition Test), working memory (Short Fractal N-Back Test), processing speed (Digit-Symbol Test; two alternate forms), relational memory (Digit-Symbol Recall; two alternate forms), visual memory (Short Visual Object Learning Test; two alternate forms), and motor speed (Short Computerized Finger-Tapping Test). This battery has been translated into eight non-English languages, including Cantonese, Mandarin, Danish, French, German, Korean, Italian, and Spanish, which means interpreters were not required, and the administration was uniform for all participants. Some items were changed if deemed culturally inconsistent, and all items were back-translated to ensure accuracy. The PennCNB battery takes 30–35 min. It is administered remotely via Zoom under the constant supervision of a trained assessor, who ensures that test results validly reflect the participant’s performance level. Assessors undergo standard training to address all aspects of test validity, ensuring that participants understand the task and are prepared and positioned to give their best effort and are doing so in an environment that is free from distractions. Assessors provide all instructions and support as needed throughout and are trained in how to identify and report questionable test bouts. There are set procedures for training the Assessor, and for Training the Trainer, which include provision of automated and personal exposure to educational and case specific material, with a certification process. These procedures have been useful in the present study as in many other multi-center studies using the PennCNB across the globe, demonstrating high rates of acceptance and valid responding79,80,81,82,83,84. Numerous scores are generated from each test, including measures of speed and accuracy for most tests. A summary of the AMP SCZ cognition battery is presented in Table 2, and more detailed test descriptions are provided in Supplementary Table 1.

Training and quality assurance

Training and quality assurance in data collection is critical for cognition, as it is in other areas. All procedures and procedure changes are documented in a detailed Standard Operating Procedures manual. The PennCNB team has extensive experience and systems in place for the training and certification of assessors from around the globe, which is considered ideal for the cognition assessment needs of AMP SCZ. A system for training and certification of assessors in the IQ measures was developed by the cognition team co-leads, who are clinical neuropsychologists with considerable experience training and overseeing the assessment of IQ for research purposes. IQ and PennCNB data are continuously monitored by DPACC, and any errors identified are documented in a centralized tracker and relayed to the sites. The DPACC set up the ‘pipeline’ by which PennCNB data are transferred immediately from the PennCNB computers. Validation procedures for the PennCNB include auto-validation based on reaction time and accuracy, as well as manual validation based on rater comments. These comments are checked by the team leads as a form of QC. Moreover, the DPACC samples data from IQ and WRAT-5 RedCap data for errors stemming from the manual input of data, including scores that were derived manually. Quality checks for IQ measures (including extreme scores) are also performed as part of the quarterly monitoring reports done by the DPACC clinical trial monitor. The team has bi-weekly meetings to discuss any QA/QC issues.

Preliminary baseline comparisons and estimated stability and reliability

Method

We performed a limited number of comparisons using R between the CHR and CC at baseline and between the baseline and 8-week timepoints. Our purpose was to determine whether our measures were performing in ways that were likely to be informative. Each participant provided oral and written informed consent. The project was approved by the governing institutional review board at each site and is registered at clinicaltrials.gov (NCT05905003). The variables in each test for baseline and longitudinal analyses were selected based on consensus between the cognition team leads (WS, KA). Accuracy was the measure used for all the assessments, including the correct or true positive responses, depending on the test. For intelligence quotient (IQ), the two-subtest Wechsler Abbreviated Scale of Intelligence, Second Edition (WASI-II) was used. The non-weighted composite score comprised the z-scored variables Short Penn Continuous Performance Test (SPCPTNL) true positive (TP), Penn Emotion Recognition Test (ER40) correct responses (CR), Short Fractal N-Back Test (SFNB) TP, Digit-Symbol Test (DIGSYM) CR, and Short Computerized Finger Tapping Test (SCTAP) dominant (hand). The Short Visual Object Learning Test (SVOLT) CR and the Short Penn List-Learning Test (SPLLT) were not included in the composite because they showed a different trajectory between the baseline and 8-week assessments, as described below.

Follow-up assessments in cognition (and other domains assessed in the study) are among the primary goals of the study. Here, we assessed an interim portion of the CHR data, between the baseline and the next assessment. This was done for the tests that had follow-up data 8 weeks later (Table 2), as well as a composite score (as described above). We wanted to determine the initial stability and reliability of performance for each of the PennCNB tests in the absence of potential treatments as we consider the designs of future clinical trials to improve cognitive function or prevent its decline. We also wanted to determine whether cognitive performances would increase or decline without encountering floor or ceiling effects. Independent sample t-test (age) and Chi-square test (sex) were used to assess the demographic data. Paired sample t-test and Wilcoxon signed-rank test were used where appropriate to evaluate the group change between the two timepoints. Additionally, Intraclass correlation (ICC) was utilized to evaluate the reliability of the tests between baseline and the next assessment for CHR individuals, using the ICC function in the psych package in R v4.2. Due to a programming error affecting data acquisition from one test (SPLLT) at the 8-week follow-up, the follow-up data for that measure is not reported here. Baseline SPLLT data was not affected, however, and is included in group comparisons below.

Results

Sex did not differ between the groups. The CC were 1.8 years older (p < 0.001) than the CHR on average (Table 3). The CHR group performed significantly lower in WASI-II IQ (p = 0.01), WRAT-5 Reading (p = 0.045), ER40 (p = 0.01), DIGSYM (p < 0.001), SCTAP (p = 0.002), and the composite score (p < 0.001) between the CHR and CC at baseline. No differences were observed in SPLLT, SPCPTNL, SFNB2, or SVOLT (Table 4). There was a slight but significant increase in SPCPT (p < 0.001), ER40 (p < 0.001), SFNB2 (p = 0.02), DIGSYM (p = 0.001), and the composite score (p = 0.002) at the 8-week follow-up, and a decrease in the SCTAP (p < 0.001) and SVOLT (p < 0.01) (Table 5). As for the ICC (Table 6), some of the tests demonstrated good reliability (0.75 ≤ ICC < 0.9) using the average random rater, including DIGSYM CR (ICC = 0.89), SPCPT TP (ICC = 0.79), and ER40 CR (ICC = 0.75) and the composite score (ICC = 0.78). Others had moderate reliability (0.5 ≤ ICC < 0.75), such as the SPCPT (ICC = 0.69), SVOLT CR (ICC = 0.57) and SCTAP dominant (ICC = 0.50). One variable, the SFNB TP, showed lower stability (ICC = 0.47). The results are visualized in Figs. 1 and 2.

Table 3 Demographic data.
Table 4 Baseline comparisons of the AMP SCZ Cognition Battery between the CHR and CC.
Table 5 Consistency of the AMP SCZ Cognition Battery in CHR between baseline and the following timepoint.
Table 6 Intraclass correlation in CHR between baseline and the following timepoint.
Fig. 1: Comparison between CHR and CC in the AMP SCZ Cognition Battery at baseline.
figure 1

a current IQ, b verbal learning, c attention, d emotion recognition, e working memory, f processing speed, g visual memory, h motor speed, i estimated premorbid IQ, j PennCNB composite. WRAT-5 Wide Range Achievement Test, IQ Full Scale Intelligence Quotient, SPLLT Short Penn List Learning Test, SPCPTNL Short Penn Continuous Performance Test, ER40 Penn Emotion Recognition Test, SFNB Short Fractal N-Back Test, DIGSYM Digit-Symbol Test, SVOLT Short Visual Object Learning Test, SCTAP Short Computerized Finger Tapping Test. CHR Clinical High Risk, CC Community Control. Composite is the average of all tests except IQ, SPLLT, and SVOLT. Significance level “***“<0.001, “**“<0.01, “*“<0.05, “NS” > 0.05.

Fig. 2: Consistency and reliability of the AMP SCZ Cognition Battery between baseline and 8 weeks later for the CHR participants.
figure 2

a attention, b emotion recognition, c working memory, d processing speed, e motor speed, f visual memory, g PennCNB composite. SPCPTNL Short Penn Continuous Performance Test, ER40 Penn Emotion Recognition Test, SFNB Short Fractal N-Back Test, DIGSYM Digit-Symbol Test, SVOLT Short Visual Object Learning Test, SCTAP Short Computerized Finger Tapping Test, CHR Clinical High Risk. Composite is the average of all tests except SVOLT. CHR Clinical High Risk. V1 Visit 1 (baseline), V2 Visit 2 (1 month after baseline). Significance level according to ICC, ““<0.001, ““<0.01, ““<0.05, “NS” >0.05. Significance level according to paired comparisons “***“<0.001, “**“<0.01, “*“<0.05, “NS” > 0.05, with blue color indicating increase, and red decrease from baseline.

These interim assessments, made before the analysis of demographic contributory factors or identification of CHR who later transitioned to psychosis, suggest the following tentative observations:

  1. 1.

    –Several individual PennCNB tests, including the composite score, showed significantly higher performance in CC than CHR (Baseline CC significant > CHR in IQ, ER40, SCTAP, DIGSYM, and Composite), suggesting possible room for improvement in CHR through intervention.

  2. 2.

    –ICC reliability varied, with most scores, including the Composite showing good (DIGSYM, SPCPT, ER40, Composite) or moderate (SCTAP) reliability. One test showed low reliability (SFNB), and one test showed a significant and possibly meaningful decline (SVOLT). Notably, the SVOLT involved learning and included and alternate form at 8 weeks, raising the question of whether the baseline assessment created proactive interference in learning the new material at 8 weeks. This finding, if confirmed, may identify a sensitivity in CHR that could be amenable to improvement. The other low-reliability score (SFNB) showed a small decline statistically but was close to the moderate range. However, additional analyses may determine whether other factors contributed to the outcome, as hypothesized for the VOLT.

  3. 3.

    –ER40 and DIGSYM showed relatively good group separation and reliability, suggesting they may provide good targets for intervention.

  4. 4.

    –The Composite, SPCPT, ER40, SFNB, and DIGSYM all showed a mixture of apparent practice and lack of practice effects in individual participants, suggesting problems with information processing but room for improvement with intervention.

These preliminary observations are encouraging in that they suggest several potential paradigms for treatment, such as using a composite score in addition to individual test performance scores, focusing on tests with high stability and good group separation from CC, and/or test performances with high levels of interference. These paradigms also highlight the importance of finding acceptable balances between cognitive vulnerability and behavioral stability.

Summary

The AMP SCZ cognition battery was designed to provide comprehensive coverage across a diverse array of domains previously demonstrated to be significant for the psychosis spectrum, including CHR. This novel cognitive assessment battery combines computerized and paper-and-pencil measures and provides sensitive, longitudinal (repeatable) measures used internationally across 43 recruitment sites. Although most sites are English-speaking, brief IQ estimates are measurable at English- and non-English-speaking sites due to the availability of the same or similar tests in different versions of the Wechsler intelligence scales. Similarly, the PennCNB tests are administered to all sites because many of the selected tests minimize language or use it in ways discrete enough to translate the stimuli (as in the SPLLT, which is a list-learning test). The cognitive battery is administered to CHR and CC participants across multiple languages and cultures while minimizing participant (and assessor) burden. The battery includes repeat assessment of both general and specific cognitive abilities before and over the period of transition to psychosis. It allows for the assessment of practice effects and for the effects of using alternate forms. Our initial observations of baseline and 8-week data suggest CHR perform lower than CC in multiple measures. Moreover, CHR show a range of practice effects on several PennCNB measures, and in at least one learning test, a decline in performance associated with an alternate form at the 8-week follow-up. By contrast, CHR performances on several measures are reliable enough to combine them into a composite score. These performances show there is room for improvement. To our knowledge, this is the largest cognition data set for CHR and provides a unique opportunity to identify candidate cognitive variables of risk prediction and targets for novel therapeutics (see Box 1).

Future directions

The study of cognition in the development of psychotic disorders is important for at least two related reasons. First, it is an independent marker for current and subsequent functional outcomes, and together with other measures (e.g., clinical, demographic, psychosocial, and environmental), it aids in the prediction of transition to psychosis in CHR. Second, it is a potential target of intervention strategies. The potential for new knowledge from this study is huge. Baseline analyses are still in the early stages but will include the effects of contributory factors on cognition. As noted previously, outside of the primary aims of the study, there is a unique opportunity to investigate the influence of demographic variables, and language and cultural factors on the assessment of cognition, and whether this contributes to measurement variance. AMP SCZ will obtain these cognitive measures in combination with a comprehensive set of multimodal variables, including environmental and genetic factors, detailed clinical assessments, and a range of biological domains, including neuroimaging, EEG, digital/actigraphy, and peripheral biomarkers. A comprehensive range of clinical (psychosis/non-psychosis) and functional outcomes will be measured. This overall dataset will provide an unprecedented opportunity to investigate and significantly advance knowledge of the determinants, specificity, and trajectory of cognition and associated outcomes in this population.