Introduction

Schizophrenia is a chronic and disabling psychiatric disorder characterized by disturbances in thought, perception, emotion, and behavior1,2. Schizophrenia hzxxxxxxxxxxxas been posited to lead to cognitive impairments in several domains such as attention, memory, and visuospatial skills3,4,5. Despite decades of research, effective long-term management remains a challenge due to its heterogeneous nature and the variability in symptom presentation across individuals and contexts. Traditionally, clinical assessment of schizophrenia relies on infrequent, subjective evaluations conducted in clinical settings1,2. These assessments, while important, may fail to capture the dynamic fluctuations in cognitive impairment that occur in patients’ daily lives6,7,8.

In recent years, digital remote assessments offered on smartphones and computers have emerged as promising tools to address the limitations of traditional in-person cognitive tests9,10,11. Within schizophrenia research, digital remote assessments have been proposed as a means to enable the recruitment of larger and more diverse samples (e.g. from rural and remote areas) and of individuals who might have logistical (e.g. cost, transportation, availability of clinicians) or symptomatic (e.g. social avoidance or paranoia) issues that make in-person attendance difficult10,12,13,14.

Moreover, given that digital or smartphone-based assessment can be completed anytime and anywhere, such technology supports the advancement of brief assessments of cognition, otherwise known as ecological momentary assessments (EMAs)15,16. In the context of schizophrenia, where symptom expression is heterogeneous, multidimensional, and temporally dynamic, smartphone-based surveys and sensors offer an ecologically valid alternative to traditional cross-sectional neuropsychological assessments17,18,19,20.

Existing works have explored and demonstrated the preliminary utility of using remote cognitive assessment as a tool for research18,21,22,23. NeuroUX has shown, for instance, with a healthy general adult population that remote cognitive assessments exhibit acceptable test-retest reliability and that factors such as the individual’s age and testing environment impact their performance24,25. Although there are other apps in this space24,26,27, our work focuses on using the mindLAMP as it offers a robust and extensible framework for integrating cognitive assessments and smartphone sensor data while also supporting care delivery22,28,29. mindLAMP cognitive assessments have been studied in schizophrenia9,30,31, Parkinson’s Disease32, mild/moderate cognitive impairment, and Alzheimer’s Disease19.

Given the relative nascency of this field, there is no predetermined or gold-standard manner to evaluate these digital cognitive assessments10,33. Whereas traditional assessments such as the MCCB tend to rely on either speed (e.g. the Trails-Making Tests)34 or accuracy (e.g. BACS Symbol Coding)35 to score tasks, digital cognitive assessments have access to more specific and precise item-level data; mindLAMP, in particular, stores metadata such as duration for each user event, which generally constitutes a tap on the phone screen. Within the existing literature, digital cognitive assessments are typically scored using quantitative performance metrics that map onto well-defined cognitive constructs, such as memory, attention, and processing speed10,33,36. Most research has primarily focused on time-based or accuracy-based metrics. Within the time-based scoring paradigm, scores typically correspond to time34,37,38. Some examples include total completion time4, response time, and interresponse time (IRT) or latency38. Other works rely on accuracy-based metrics where the scores are determined by the rate of correct responses and are often directly mapped to standard neuropsychological outcomes (e.g. number of correct matches in a symbol substitution task)39. However, existing work has shown that different scoring methods can yield significantly different results and outcomes10. A key gap in the literature is that none of the existing works have investigated composite-based metrics, which combine both speed and accuracy, to score digital cognitive assessments. Our work presents the first attempt to use a composite metric, the Rate-Correct Score (RCS), to score these assessments.

Currently, there is a lack of substantial research on the correlation between digital cognitive assessments and traditional in-person MCCB tests, or with affective states as measured using established clinical scales, such as the Positive and Negative Syndrome Scale (PANSS). Building on the previous work conducted by Raje et al. which explores patient and clinician co-design of the cognitive assessments18, this study presents a global multisite pilot study of smartphone data in schizophrenia, drawing on data collected across geographically and culturally diverse regions3,22,40. This study aims to determine which metric is best suited to score the digital cognitive assessments and which cognitive assessments of the mindLAMP smartphone app may merit further study. The principal hypothesis of this paper is that scoring digital cognitive assessments as the number of correct responses per unit time will offer strong and significant correlations with MCCB scores. In our work, we have opted to use composite metrics, which account for both accuracy and time, as research has indicated that findings are stronger for composite scores10,41.

Results

Demographic and intersite comparison

Table 1 contains the demographic and clinical data for the study sample. There are no statistically significant differences among the study sites with regard to age (F = 0.131, p > 0.5), sex (χ2 = 9.567, p = 1), and education (χ2 = 46.952, p > 0.5). One participant in Boston did not complete the demographic survey.

Table 1 Demographic and clinical characteristics of across study sites in the U.S. (Boston) and India (Bangalore and Bhopal).

Analysing the data across the three study sites, we found that the MCCB domain distribution across all three are statistically different, with the two sites in India being more similar to one another than to the site in Boston. Table 2 presents the results of a one-way ANOVA for the MCCB domains: Working Memory (F = 0.940, p > 0.05), Verbal Learning (F = 0.876, p > 0.05), Visual Learning (F = 0.818, p > 0.05), and Social Cognition (F = 2.112, p > 0.05) are not substantially different across the sites.

Table 2 One-way ANOVA Results for MCCB Domains.

Given that the MCCB scores for the Boston study site diverge significantly from the MCCB scores for the two sites in India, the analyses presented in continuation for the MCCB will primarily concern the two sites in India.

Engagement results

Out of the 62 participants who enrolled in the study and downloaded mindLAMP, across the three study sites, a total of 6 dropped out of the study and have been excluded from all data analysis.

Throughout the study, in order of decreasing frequency, participants completed Cats and Dogs an average of 8.8 times (SD = 8.7; range = 1–33); Spatial Span an average of 8.7 times (SD = 8.0; range = 1–31); Balloon Risk an average of 7.4 times (SD = 7.9; range = 1–31); Symbol Digit Substitution an average of 7.0 times (SD = 7.8; range = 1–31); Spin the Wheel an average of 6.6 times (SD = 7.1; range = 1–31); Jewels A an average of 6.5 times (SD = 6.8; range = 1–31); Emotion Recognition an average of 6.4 times (SD = 6.4; range = 1–31); Jewels B an average of 5.5 times (SD = 6.2; range = 1–31); and Maze an average of 4.1 times (SD = 3.1; range = 1–13). Participants were assigned one or two cognitive assessments per day, so that they were expected to have completed each cognitive evaluation at least 4 times by the end of the study. The schedule for participants in Boston is shown in Table 3; the schedule for participants in India similarly has participants complete one or two assessments per day.

Table 3 Daily Schedule for mindLAMP Cognitive Assessments for Boston.

Correlations between MCCB domains and mindLAMP cognitive assessment scores

Table 4 presents correlations Bonferroni-corrected for multiple comparisons between baseline MCCB domain scores and baseline scoring metrics for the mindLAMP cognitive assessments. Of the five different scoring metrics outlined in Section "Digital Cognitive Assessment Scoring Methods", the Rate-Correct Score has the strongest and most significant correlations with baseline MCCB domain scores corrected for age, gender, and education. Importantly, the Rate-Correct Score for Jewels A correlates with both the Overall Composite Score (r = 0.597, p < 0.001) and the Overall Neurocognitive Composite Scores (r = 0.537, p < 0.001). The Rate-Correct Score for Jewels A also correlates with Speed of Processing (r = 0.464, p < 0.05), as well as Working Memory (r = 0.454, p < 0.05). The Rate-Correct Score for Symbol Digit Substitution correlates with both the Overall Composite Score (r = 0.532, p < 0.01) and the Neurocognitive Composite Scores (r = 0.530, p < 0.01), and with Working Memory (r = 0.502, p < 0.05). The alternate score for Spatial Span, which is the number of correct responses, correlates with Attention/Vigilance (r = 0.534, p < 0.01), Overall Composite Score (r = 0.505, p < 0.01), Speed of Processing (r = 0.499, p < 0.01), and the Neurocognitive Composite Score (r = 0.449, p < 0.05). Correlations between the mindLAMP assessments and the individual subtests of the MCCB can be found in Appendix B.

Table 4 Correlations between baseline MCCB domain scores and mindLAMP cognitive assessment scores for Bangalore and Bhopal.

Intraclass Correlation Coefficients (ICCs) for mindLAMP cognitive assessments

Table 5 presents ICCs for the mindLAMP digital cognitive assessments to assess test-retest reliability. Given that test-retest reliability refers to the consistency of a test or measure over time, within the context of traditional tests such as the MCCB, high reliability ensures that baseline and follow-up comparisons are meaningful42. As shown in Table 5, Balloon Risk (ICC = 0.664, 95% CI [0.570–0.764]), Jewels A (ICC = 0.568, 95% CI [0.466−0.684]), and Symbol Digit Substitution (ICC = 0.536, 95% CI [0.438−0.651]) exhibited moderate test-retest reliability. In contrast, the rest of the assessments did not demonstrate acceptable test-retest reliability. This may be due to external factors, which will be further discussed within the discussion section. By way of comparison, for a neuroUX study with 393 adults who completed each of the platform’s digital cognitive assessments five times over the course of ten days, the ICCs ranged from 0.438 for a task testing visual working memory and processing speed to 0.912 for a task testing processing speed24.

Table 5 Intraclass Correlation Coefficients (ICCs) for RCS of mindLAMP cognitive assessments.

Mediation analysis

Using the data from the two sites in India, we performed an exploratory analysis to see whether sleep duration calculated from mindLAMP smartphone data mediates between mindLAMP survey scores for mood and anxiety and the Rate-Correct Scores for Jewels A and Jewels B. Intraindividual z scores (i.e. distance from the mean in units of standard deviation calculated based on each participant’s distribution of data) were used for sleep, survey scores, and the Rate-Correct Scores. Given the small sample size, this analysis is primarily intended to suggest a future direction of interest for later studies.

Anxiety EMA

Figure 1 visualizes the mediation diagrams with the anxiety EMA scores as the predictor. For Jewels A, none of the indirect (r = 0.026, p > 0.05), direct (r = 0.107, p > 0.05), or total (r = 0.133, p > 0.05) effects were significant, so that sleep did not appear to be a mediator. For Jewels B, none of the indirect (r = 0.012, p > 0.05), direct (r = 0.019, p > 0.05), or total (r = 0.031, p > 0.05) effects were significant. Sleep was, however, almost a moderator (r = −0.18, p = 0.054).

Fig. 1
Fig. 1
Full size image

Mediation diagrams for Jewels A and Jewels B where the predictor is the anxiety EMA score, the outcome is the Rate-Correct Score (RCS), and the mediator is sleep. *p < 0.05.

Mood EMA

Figure 2 visualizes the mediation diagrams with the mood EMA scores as the predictor. For Jewels A, the indirect effect (r = 0.020, p > 0.05) was insignificant, but the direct (r = 0.203, p < 0.05) and total (r = 0.223, p < 0.05) effects were significant. For Jewels B, none of the indirect (r = 0.004, p > 0.05), direct (r = 0.079, p > 0.05), or total (r = 0.083, p > 0.05) effects were significant; however, sleep did appear to be a moderator (r = −0.214, p = 0.05).

Fig. 2
Fig. 2
Full size image

Mediation diagrams for Jewels A and Jewels B where the predictor is the mood EMA score, the outcome is the Rate-Correct Score (RCS), and the mediator is sleep. *p < 0.05.

Discussion

In addition to examining the feasibility, validity, and cross-site comparability of smartphone data in assessing cognitive function in individuals diagnosed with schizophrenia, this study aimed to determine which cognitive assessments of the mindLAMP smartphone app may merit further study. The results presented suggest the following two key findings: first, the Rate-Correct Score43 correlates the most with scores on the MCCB; second, of the various mindLAMP cognitive assessments, Jewels A and Symbol Digit Substitution have the strongest correlational evidence with MCCB measures of both domain-specific and overall cognition.

The Rate-Correct Score succinctly balances trade-offs in speed and accuracy43, two fundamental aspects of performance, by considering the number of correct responses per unit time: accuracy in isolation may be misleading if the time taken to complete a test is excessive; likewise, rapid completion of a test is unimpressive if the responses are incorrect. In this vein, it is worth noting that the Rate-Correct Score generally outperformed traditional scoring metrics based on accuracy or speed for relevant digital cognitive assessments such as Jewels A/B and Symbol Digit Substitution; traditional scoring metrics only appeared relevant for Spatial Span. This highlights the general importance of both speed and accuracy to properly scoring digital cognitive assessments, even if they were not for their traditional counterparts such as the Trails-Making Tests and BACS Symbol Coding. As highlighted in the introduction, prior mobile app cognitive assessment research16,37,39 did not use composite-based metrics to score assessments, which may limit their real-world ecological validity and hinder research progress within the field. Nonetheless, for some cognitive assessments, speed and accuracy can be irrelevant by design: for instance, a mobile version of the Iowa Gambling Task, which assesses risk-taking behaviour and requires users to maximize their score by preferentially selecting buttons that are more likely to award rather than detract points, does not concern itself directly with speed or accuracy, but rather with a sort of pattern recognition. Such digital cognitive assessments, however, do not correspond to the cognitive domains under consideration in the current analysis.

In comparison with prior works, with respect to Intraclass Correlation Coefficient (ICC) analysis, our results are comparable with and complement the findings of Keefe et al.44,45 which showed that the composite scores have high test-retest reliability. By way of comparison, Keefe et al. (2011) showed that, for a population of 323 individuals with schizophrenia, the MCCB composite score had an ICC of 0.88, and Speed of Processing had the highest ICC of 0.79, and Verbal Learning the lowest one of 0.5846. For digital cognitive assessments, environmental variability such as distractions and fatigue can introduce noise. Moreover, the test-retest reliability may also be more sensitive to mood, motivation, or even phone performance.

In the paper by Shvetz et al., the authors investigated the initial accessibility, validity and reliability of using the Jewels Trails Test by comparing whether individuals with schizophrenia had significantly lower performance compared to controls on both the in-lab Jewels Trails Test and Trails-Making Test16. Although the findings indicate promising validity, the authors have also highlighted that remote cognitive tests such as the Jewels A and B, mobile versions of the Trails-Making Tests, remain to be validated to assess actual cognitive impairments in schizophrenia, which is precisely what our study attempts to address. Moreover, the findings here complement recent studies investigating the validity of remote administration of the MCCB test47, which suggests that remote administration of some of the MCCB subtests may be a valid alternative to in-person testing. However, further research is necessary to determine why some tasks were comparatively more affected by administration format48.

By integrating cognitive assessments and smartphone sensor data across multiple sites, this study offers a comprehensive examination of how smartphone data can support scalable, global mental health research. While schizophrenia serves as a practical and well-studied use case due to its well characterized cognitive impairments3,4, our broader aim is to demonstrate how such smartphone-based cognitive assessments can serve as a generalizable tool in the evaluation of cognition across a range of neuropsychiatric and neurodegenerative conditions. Cognitive decline is often subtle and difficult to detect in its early stages, particularly in disorders such as Alzheimer’s disease, Parkinson’s disease, or mild cognitive impairment5,49. By first validating these methods in a population where cognitive dysfunction is prominent and measurable4,50 and providing promising exploratory mediation analysis on how smartphone data such as sleep mediates between performance on digital cognitive assessments and EMA survey scores, we lay the groundwork for extending this approach to other EMA use cases and populations where early and continuous cognitive monitoring may be even more critical.

Moreover, as our experiments relied on EMA-based methods, the results might have greater ecological validity. In traditional neuropsychological testing, like the MCCB, participants complete tasks in lab settings with minimal real-world distractions. Such tasks may boast high construct validity51 (i.e., the test measures what it is designed to measure), but suffer from low ecological validity42 (i.e., the test does not reflect real-world performance and behaviour)52. This work, therefore, not only advances digital mental health for schizophrenia but also contributes a scalable, flexible framework for digital cognitive assessment across the diagnostic spectrum and across an individual’s lifespan. Exploratory mediation analysis looking at whether sleep mediates between the Rate-Correct Scores for Jewels A and Jewels B on mindLAMP and EMA survey scores for mood and anxiety suggests that sleep may, in fact, be a moderator.

Limitations of this work concern its generalizability given our relatively small total sample size of 56. The remote digital cognitive assessments were each done once weekly, which may limit the reliability or validity of our results. Nonetheless, we wish to emphasize that, given reliability (in classical test theory) is a function of variation in the population under study and the precision of the test, our efforts are a step forward in that direction by collecting, using, and analyzing data from different sites. Future work can verify our findings by increasing the sample size across more sites. We postulate that, cognitively, all patients across all three sites are similar. However, there may be differences in clinical symptomatology, which informed our choice to primarily conduct experiments and analysis using data from India (i.e., Bhopal and Bangalore). The low clinical severity in terms of psychotic symptoms and lack of symptoms at the Boston site is another limitation of this paper, and why the analysis did not focus on these symptoms. While our sample was recruited from clinical populations, we did not conduct additional interviews to re-diagnose or confirm the clinical diagnosis.

In conclusion, our work evaluates the relationship between the traditional MCCB tests, digital cognitive assessments, and affective state for schizophrenia patients. We discovered that (1) the Rate-Correct Score shows the most utility in scoring digital cognitive assessments in a manner that renders them correlates of traditional paper-and-pencil tests; (2) of the various mindLAMP cognitive assessments, Jewels A, a mobile version of the Trails-Making Test A, has strong and significant correlations with MCCB scores for domain-specific and overall cognition. We have highlighted the key limitations of our work and encourage future researchers to further empirically validate and advance the use of smartphone-based cognitive assessments as a tool to monitor an individual’s affective state, particularly in high-stakes applications such as schizophrenia.

Methods

The design of this study was informed by focus group discussions conducted in India and the U.S. in which participants provided feedback on their interactions with the study app, generally reporting favourably about the app and noting the importance of receiving scores to interpret their performance on the app’s cognitive assessments, prompting the current investigation into different methods of scoring digital assessments22.

Participants

This multi-site 30-day observational study took place at three healthcare facilities: the Beth Israel Medical Deaconess Center (BIDMC) in Boston, USA; the National Institute of Mental Health and Neuro Sciences (NIMHANS) in Bangalore, India; and the All India Institute of Medical Sciences (AIIMS) and Sangath in Bhopal, India. As in prior research by this same team working with apps and psychosis18, inter-rater reliability for the PANSS was examined for research assistants administering the PANSS from each of the three sites by having them rate five video-recorded clinical interviews. Intraclass correlations were excellent (>0.75) for PANSS Total and Positive scores and fair to good (>0.4) for PANSS Negative score.

Participant recruitment began in September 2024 and concluded in March 2025. Across the three sites, a total of 56 participants partook in the study; the sample demographics are summarized in Table 1. The inclusion criteria for participation in the study consisted of being at least 18 years of age, having a diagnosis schizophrenia or schizoaffective disorder, owning a smartphone capable of running the study app, and speaking the local language of the study site (i.e. English in the case of Boston, and English or Hindi in the case of Bhopal and Bangalore). The exclusion criteria consisted of suffering any uncontrolled mental illness or any significant speech, sight, or hearing impairment that impacts the individual’s ability to operate a smartphone. Participants were recruited if they met the above inclusion criteria and were able to provide informed consent.

The study protocol was approved by each site’s IRB, and all participants signed an informed consent prior to beginning the study. Participants met with a research assistant at the beginning of the study to complete the intake visit and again at the end of the study to complete the follow-up visit. Regular meetings were held with members of the teams across the study sites to ensure overall uniformity. At each study visit, participants completed a number of clinical assessments, such as the General Anxiety Disorder-7 (GAD-7), Patient Health Questionnaire-9 (PHQ-9), PANSS, Psychotic Symptom Rating Scales (PSYRATS), Calgary Depression Rating Scale for Schizophrenia (CDRSS), WHO Disability Assessment Schedule (WHODAS), Social Functioning Scale (SFS), Human Connectome Project Social Task (HCP), Pittsburgh Sleep Quality Index (PSQI), Phenx Access Health Services, Phenx English Proficiency, Phenx Health Literacy, Phenx Occupational Prestige, Phenx Social and Role Dysfunction in Psychosis and Schizophrenia, via REDCap. REDCap is a HIPAA-compliant online platform developed by Vanderbilt University that facilitates survey administration and data entry for research, supporting features such as branching logic and custom reporting53. The surveys participants completed included the Positive and Negative Syndrome Scale (PANSS)54.

Participants also completed the MATRICS Consensus Cognitive Battery (MCCB)45,50, which is a gold standard assessment for measuring cognition across seven domains from verbal learning to social cognition via ten tests. During the intake study visit, the research assistant assisted participants with downloading the mindLAMP app to their phone and helped them to enable GPS permissions to allow smartphone sensor data collection; the research assistant explained to participants that they would be contacted to help with troubleshooting if their data quality proved to be low. For the 30 days of the study, participants engaged with various cognitive assessments and surveys on the mindLAMP app; on average, they were assigned 2-3 cognitive assessments and surveys per day, and received app notifications at 6 pm each day; over the course of their time in the study, participants should have completed each cognitive assessment at least 4 times. At the end of the study, participants met again with a research assistant to complete the same surveys on REDCap as at the beginning of the study, in addition to a system usability survey about their experience with the study app. Participants were compensated at the beginning and the end of the study.

Digital cognitive assessment—MindLAMP App

mindLAMP is accessible in 8 languages (English, Spanish, Korean, Simplified Chinese, Traditional Chinese, Italian, French, German, and Hindi), and the app’s cognitive assessments have been co-designed with patient partners and across a series of workshops, design rounds, and ongoing app updates7,22,55. Figure 3 presents the cognitive assessments featured on the mindLAMP app: (a) Balloon Risk (Balloon Analog Risk Task), (b) Cats and Dogs (Simple Memory Task), (c) Emotion Recognition, (d) Jewels A (Trail Making Tests A), (e) Jewels B (Trail Making Tests B), (f) Maze (Problem Solving Task), (g) Pop the Bubbles (Go/No-Go Task), (h) Spatial Span, (i) Spin the Wheel (Iowa Gambling Task), and (j) Symbol Digit Substitution.

Fig. 3: mindLAMP cognitive assessments.
Fig. 3: mindLAMP cognitive assessments.
Full size image

a Balloon Risk (digital version of the Balloon Analog Risk Test)—a task in which the user attempts to inflate a balloon as many times as possible before it pops. b Cats and Dogs—a task in which the user is presented with an array of boxes and must remember to select the boxes covering either dogs or cats. c Emotion Recognition—a task in which the user is presented with a random sequence of 10 images and must identify the emotion represented. d Jewels A (digital version of the Trails-Making Test Version A)—a task in which the user must select numbered jewels in ascending order. e Jewels B (digital version of the Trails-Making Test Version B)—a task in which the user must select numbered jewels in ascending order, alternating between two different sets of jewels. f Maze—a task in which the user must tilt the phone in order to move a ball toward the exit. g Pop the Bubbles (digital version of a go/no-go task)—a task in which the user must tap on bubbles of a specified color. h Spatial Span—a task in which the user is presented with a sequence and must recreate the sequence in either the same or reverse order. i Spin the Wheel (digital version of the Iowa Gambling Task)—a task in which the user must select one of four buttons to spin wheels in an attempt to increase the starting balance. j Symbol Digit Substitution (digital version of a test from the Wechsler Adult Intelligence Scale)—a task in which, given a legend with symbols and numbers, the user must select the number corresponding to a given symbol.

Most tasks are customizable without the need for coding or changing the app. As an example, the faces used for the Emotion Recognition Task can be changed to display culturally relevant images; participants in the U.S. were presented with images taken from UPenn’s ER 40 Color Emotional Stimuli56, and participants in India were presented with images taken from the AIIMS Facial Toolbox for Emotion Recognition57.

Using the smartphone data collected by the app, it is possible to generate sleep data via an algorithm that combines GPS and phone screen state: analysis will primarily concern sleep duration and sleep quality, which is roughly related to the degree of fragmentation in the data.

In-person cognitive assessment—MCCB cognitive domains

For all participants, the cognitive performance was assessed via the MCCB45,50 at two time points (baseline and after 30 days). The MCCB provides scores seven cognitive domains computed from ten subtests (see Appendix A for specifics on the subtests):

  1. 1.

    Speed of Processing (SoP) [Trail Making Test: Part A (TMT), Brief Assessment of Cognition in Schizophrenia: Symbol Coding (BACS SC), and Category Fluency: Animal Naming (Fluency)]

  2. 2.

    Attention/Vigilance (AV) [Continuous Performance Test Identical Pairs (CPT-IP)]

  3. 3.

    Working Memory (WM) [Letter Number Span (LNS) and Wechsler Memory Scale Spatial Span (WMS III SS)]

  4. 4.

    Verbal Learning and Memory (Vrbl Lrng) [Hopkins Verbal Learning Test-Revised (HVLT-R)]

  5. 5.

    Visual Learning and Memory (Vis Lrng) [Brief Visuospatial Memory Test-Revised (BVMT-R)]

  6. 6.

    Reasoning and Problem Solving (RPS) [Neuropsychological Assessment Battery Mazes (NAB Mazes)]

  7. 7.

    Social Cognition (SC) [Mayer–Salovey–Caruso Emotional Intelligence Test (MSCEIT ME)].

The MCCB raw scores were converted to T scores according to the U.S. English norms and corrected for age, gender, and education, and the T scores were utilized as the primary measurement for analysis.

Table 6 provides a summary of how the different tests map to the cognitive assessments on mindLAMP.

Table 6 Mapping of MATRICS domains onto mindLAMP cognitive assessment.

Digital cognitive assessment scoring methods

In order to analyze the raw data from the cognitive assessments completed on the mindLAMP app, it was necessary to first implement scoring methods. Here, we adopt a composite approach as pioneered by Keefe et al. 45,58, building on the work conducted by Liesefeld and Janczyk, which outlined four scoring metrics that generally combine speed and accuracy for participant i in condition j59.

IES

The Inverse Efficiency Score (Eq. 1) is the ratio of average response time to the proportion of correct responses:

$$IE{S}_{i,j}=\frac{\underline{R{T}_{i,j}}}{\underline{P{C}_{i,j}}}$$
(1)

Rate-Correct Score (RCS)

The Rate-Correct Score (Eq. 2) is the ratio of the number of correct responses to total response time:

$${RC}{S}_{i,j}=\frac{N{C}_{i,j}}{\mathop{\sum }\limits_{k=1}^{{n}_{i,j}}R{T}_{i,j,k}}$$
(2)

Linear Integrated Speed-Accuracy Score (LISAS)

The Linear Integrated Speed-Accuracy Score (Eq. 3) accounts for the standard deviation of average response time and proportion of incorrect responses:

$$LISA{S}_{i,j}=\underline{R{T}_{i,j}}+\frac{{S}_{R{T}_{i,j}}}{{S}_{P{E}_{i,j}}}\cdot P{E}_{i,j}$$
(3)

Balanced Integration Score (BIS)

The Balanced Integration Score (Eq. 4) is the difference of the z-scores for the proportion of correct responses and average response time:

$${BI}{S}_{i,j}={z}_{P{C}_{i,j}}-{z}_{R{T}_{i,j}}$$
(4)

where \({Z}_{{x}_{i,j}}=\frac{{x}_{i,j}-\underline{x}}{{S}_{x}}\)

Alternate scoring

In addition to the above formulas, alternate scoring metrics informed by a review of the extant literature on cognitive assessments were implemented for specific cognitive assessments. In general, these alternate scores amount to considering either duration or accuracy exclusively: for Jewels A, Jewels B, and Maze, the amount of time to complete each level is recorded; for Emotion Recognition, Spatial Span, and Symbol Digit Substitution, the number of correct responses is recorded.

Statistical procedures

All analysis was done with Python (version 3.8.8). Significance was determined with a p-value threshold value of 0.05, and multiple comparisons were accounted for with Bonferroni corrections.

Descriptive statistics were obtained for the sample’s demographics and clinical symptoms. A one-way ANOVA was performed to examine whether there were significant differences in participant age and in MCCB scores across the three study sites. Chi-squares was performed to examine whether there were significant differences in sex.

Spearman’s correlations were calculated to assess validity between scoring metrics for mindLAMP’s digital cognitive assessments and MCCB domain scores. For these correlations, each participant’s first available scoring metric for mindLAMP’s digital cognitive assessments was used.

Test-retest reliability of the digital cognitive assessments on mindLAMP was analyzed by calculating Intraclass Correlation Coefficients (ICC) for the Rate-Correct Score scoring metric for each assessment completed. To be included in this calculation, participants must have at least completed each assessment twice (this corresponded to 42 participants for Balloon Risk, 45 for Cats and Dogs, 42 for Jewels A, 41 for Jewels B, 44 for Spatial Span, 42 for Spin the Wheel, 45 for Symbol Digit Substitution, 36 for Emotion Recognition, 17 for Maze). The ICC values were computed by invoking R’s ICCest function in Python and were interpreted according to the existing guidelines in the literature60, with moderate reliability corresponding to an ICC value between 0.50 and 0.75 and high reliability corresponding to ICC values greater than 0.75.

Spearman’s correlations were used to explore the relationship between sleep and performance on mindLAMP’s digital cognitive assessments. Mediation analysis was conducted using the statsmodel Python library to fit a linear model using ordinary least squares to see if sleep mediated between EMA survey scores for anxiety and mood, and mindLAMP scoring metrics: the direct effect was modeled with the intraindividual z score of the EMA survey score as the predictor and the intraindividual z score of Rate-Correct Score for a particular mindLAMP assessment as the outcome; the indirect effect was modeled with the intraindividual z score of sleep duration as the mediator.