Abstract
Cognition in schizophrenia is difficult to assess in clinical settings due to the time required to administer traditional pen-and-paper tests, among other factors. Digital remote assessments completed on a smartphone offer an alternative that can reduce the burden on healthcare staff and patients, in addition to providing more nuanced cognitive profiles, especially when used in conjunction with smartphone data such as sleep. Building on previous work using the mindLAMP research app in international contexts, this paper presents a global multi-site pilot study to explore the validity of the app’s digital cognitive assessments as proxies for traditional in-person assessments such as the gold standard MATRICS Consensus Battery (MCCB). Across one site in the U.S. (Boston) and two sites in India (Bangalore and Bhopal), a total of 56 participants with diagnoses of early-course schizophrenia or schizoaffective disorder were recruited between September 2024 and March 2025 to engage with the mindLAMP app for 30 days. Participants completed 2-3 different cognitive tasks and surveys each day; at the beginning and the end of this period, participants also took the MCCB and surveys related to their diagnosis. mindLAMP cognitive assessments were scored using different metrics that combine speed and accuracy, and correlation analyses were run on these metrics and MCCB domains. Of the scoring metrics used, the Rate-Correct Score (RCS) most consistently correlates with baseline MCCB domains corrected for age, gender, and education. Moderate test-retest reliability was observed across certain cognitive assessments such as a mobile version of Trails-Making Test A and Symbol Digit Substitution, which agrees with previous research done by Keefe et al.; poor test-retest reliability, in contrast, was observed across assessments such as Spatial Span. Additionally, we conducted exploratory mediation analyses using sleep data to see if sleep mediates between the Ecological Momentary Assessment (EMA) survey scores and performance on select digital cognitive assessments on mindLAMP. Our results support the initial accessibility, validity and reliability of using smartphones to assess cognition in schizophrenia. Future research to develop additional smartphone-based cognitive tests, as well as with larger samples and in other psychiatric populations, is warranted.
Similar content being viewed by others
Introduction
Schizophrenia is a chronic and disabling psychiatric disorder characterized by disturbances in thought, perception, emotion, and behavior1,2. Schizophrenia hzxxxxxxxxxxxas been posited to lead to cognitive impairments in several domains such as attention, memory, and visuospatial skills3,4,5. Despite decades of research, effective long-term management remains a challenge due to its heterogeneous nature and the variability in symptom presentation across individuals and contexts. Traditionally, clinical assessment of schizophrenia relies on infrequent, subjective evaluations conducted in clinical settings1,2. These assessments, while important, may fail to capture the dynamic fluctuations in cognitive impairment that occur in patients’ daily lives6,7,8.
In recent years, digital remote assessments offered on smartphones and computers have emerged as promising tools to address the limitations of traditional in-person cognitive tests9,10,11. Within schizophrenia research, digital remote assessments have been proposed as a means to enable the recruitment of larger and more diverse samples (e.g. from rural and remote areas) and of individuals who might have logistical (e.g. cost, transportation, availability of clinicians) or symptomatic (e.g. social avoidance or paranoia) issues that make in-person attendance difficult10,12,13,14.
Moreover, given that digital or smartphone-based assessment can be completed anytime and anywhere, such technology supports the advancement of brief assessments of cognition, otherwise known as ecological momentary assessments (EMAs)15,16. In the context of schizophrenia, where symptom expression is heterogeneous, multidimensional, and temporally dynamic, smartphone-based surveys and sensors offer an ecologically valid alternative to traditional cross-sectional neuropsychological assessments17,18,19,20.
Existing works have explored and demonstrated the preliminary utility of using remote cognitive assessment as a tool for research18,21,22,23. NeuroUX has shown, for instance, with a healthy general adult population that remote cognitive assessments exhibit acceptable test-retest reliability and that factors such as the individual’s age and testing environment impact their performance24,25. Although there are other apps in this space24,26,27, our work focuses on using the mindLAMP as it offers a robust and extensible framework for integrating cognitive assessments and smartphone sensor data while also supporting care delivery22,28,29. mindLAMP cognitive assessments have been studied in schizophrenia9,30,31, Parkinson’s Disease32, mild/moderate cognitive impairment, and Alzheimer’s Disease19.
Given the relative nascency of this field, there is no predetermined or gold-standard manner to evaluate these digital cognitive assessments10,33. Whereas traditional assessments such as the MCCB tend to rely on either speed (e.g. the Trails-Making Tests)34 or accuracy (e.g. BACS Symbol Coding)35 to score tasks, digital cognitive assessments have access to more specific and precise item-level data; mindLAMP, in particular, stores metadata such as duration for each user event, which generally constitutes a tap on the phone screen. Within the existing literature, digital cognitive assessments are typically scored using quantitative performance metrics that map onto well-defined cognitive constructs, such as memory, attention, and processing speed10,33,36. Most research has primarily focused on time-based or accuracy-based metrics. Within the time-based scoring paradigm, scores typically correspond to time34,37,38. Some examples include total completion time4, response time, and interresponse time (IRT) or latency38. Other works rely on accuracy-based metrics where the scores are determined by the rate of correct responses and are often directly mapped to standard neuropsychological outcomes (e.g. number of correct matches in a symbol substitution task)39. However, existing work has shown that different scoring methods can yield significantly different results and outcomes10. A key gap in the literature is that none of the existing works have investigated composite-based metrics, which combine both speed and accuracy, to score digital cognitive assessments. Our work presents the first attempt to use a composite metric, the Rate-Correct Score (RCS), to score these assessments.
Currently, there is a lack of substantial research on the correlation between digital cognitive assessments and traditional in-person MCCB tests, or with affective states as measured using established clinical scales, such as the Positive and Negative Syndrome Scale (PANSS). Building on the previous work conducted by Raje et al. which explores patient and clinician co-design of the cognitive assessments18, this study presents a global multisite pilot study of smartphone data in schizophrenia, drawing on data collected across geographically and culturally diverse regions3,22,40. This study aims to determine which metric is best suited to score the digital cognitive assessments and which cognitive assessments of the mindLAMP smartphone app may merit further study. The principal hypothesis of this paper is that scoring digital cognitive assessments as the number of correct responses per unit time will offer strong and significant correlations with MCCB scores. In our work, we have opted to use composite metrics, which account for both accuracy and time, as research has indicated that findings are stronger for composite scores10,41.
Results
Demographic and intersite comparison
Table 1 contains the demographic and clinical data for the study sample. There are no statistically significant differences among the study sites with regard to age (F = 0.131, p > 0.5), sex (χ2 = 9.567, p = 1), and education (χ2 = 46.952, p > 0.5). One participant in Boston did not complete the demographic survey.
Analysing the data across the three study sites, we found that the MCCB domain distribution across all three are statistically different, with the two sites in India being more similar to one another than to the site in Boston. Table 2 presents the results of a one-way ANOVA for the MCCB domains: Working Memory (F = 0.940, p > 0.05), Verbal Learning (F = 0.876, p > 0.05), Visual Learning (F = 0.818, p > 0.05), and Social Cognition (F = 2.112, p > 0.05) are not substantially different across the sites.
Given that the MCCB scores for the Boston study site diverge significantly from the MCCB scores for the two sites in India, the analyses presented in continuation for the MCCB will primarily concern the two sites in India.
Engagement results
Out of the 62 participants who enrolled in the study and downloaded mindLAMP, across the three study sites, a total of 6 dropped out of the study and have been excluded from all data analysis.
Throughout the study, in order of decreasing frequency, participants completed Cats and Dogs an average of 8.8 times (SD = 8.7; range = 1–33); Spatial Span an average of 8.7 times (SD = 8.0; range = 1–31); Balloon Risk an average of 7.4 times (SD = 7.9; range = 1–31); Symbol Digit Substitution an average of 7.0 times (SD = 7.8; range = 1–31); Spin the Wheel an average of 6.6 times (SD = 7.1; range = 1–31); Jewels A an average of 6.5 times (SD = 6.8; range = 1–31); Emotion Recognition an average of 6.4 times (SD = 6.4; range = 1–31); Jewels B an average of 5.5 times (SD = 6.2; range = 1–31); and Maze an average of 4.1 times (SD = 3.1; range = 1–13). Participants were assigned one or two cognitive assessments per day, so that they were expected to have completed each cognitive evaluation at least 4 times by the end of the study. The schedule for participants in Boston is shown in Table 3; the schedule for participants in India similarly has participants complete one or two assessments per day.
Correlations between MCCB domains and mindLAMP cognitive assessment scores
Table 4 presents correlations Bonferroni-corrected for multiple comparisons between baseline MCCB domain scores and baseline scoring metrics for the mindLAMP cognitive assessments. Of the five different scoring metrics outlined in Section "Digital Cognitive Assessment Scoring Methods", the Rate-Correct Score has the strongest and most significant correlations with baseline MCCB domain scores corrected for age, gender, and education. Importantly, the Rate-Correct Score for Jewels A correlates with both the Overall Composite Score (r = 0.597, p < 0.001) and the Overall Neurocognitive Composite Scores (r = 0.537, p < 0.001). The Rate-Correct Score for Jewels A also correlates with Speed of Processing (r = 0.464, p < 0.05), as well as Working Memory (r = 0.454, p < 0.05). The Rate-Correct Score for Symbol Digit Substitution correlates with both the Overall Composite Score (r = 0.532, p < 0.01) and the Neurocognitive Composite Scores (r = 0.530, p < 0.01), and with Working Memory (r = 0.502, p < 0.05). The alternate score for Spatial Span, which is the number of correct responses, correlates with Attention/Vigilance (r = 0.534, p < 0.01), Overall Composite Score (r = 0.505, p < 0.01), Speed of Processing (r = 0.499, p < 0.01), and the Neurocognitive Composite Score (r = 0.449, p < 0.05). Correlations between the mindLAMP assessments and the individual subtests of the MCCB can be found in Appendix B.
Intraclass Correlation Coefficients (ICCs) for mindLAMP cognitive assessments
Table 5 presents ICCs for the mindLAMP digital cognitive assessments to assess test-retest reliability. Given that test-retest reliability refers to the consistency of a test or measure over time, within the context of traditional tests such as the MCCB, high reliability ensures that baseline and follow-up comparisons are meaningful42. As shown in Table 5, Balloon Risk (ICC = 0.664, 95% CI [0.570–0.764]), Jewels A (ICC = 0.568, 95% CI [0.466−0.684]), and Symbol Digit Substitution (ICC = 0.536, 95% CI [0.438−0.651]) exhibited moderate test-retest reliability. In contrast, the rest of the assessments did not demonstrate acceptable test-retest reliability. This may be due to external factors, which will be further discussed within the discussion section. By way of comparison, for a neuroUX study with 393 adults who completed each of the platform’s digital cognitive assessments five times over the course of ten days, the ICCs ranged from 0.438 for a task testing visual working memory and processing speed to 0.912 for a task testing processing speed24.
Mediation analysis
Using the data from the two sites in India, we performed an exploratory analysis to see whether sleep duration calculated from mindLAMP smartphone data mediates between mindLAMP survey scores for mood and anxiety and the Rate-Correct Scores for Jewels A and Jewels B. Intraindividual z scores (i.e. distance from the mean in units of standard deviation calculated based on each participant’s distribution of data) were used for sleep, survey scores, and the Rate-Correct Scores. Given the small sample size, this analysis is primarily intended to suggest a future direction of interest for later studies.
Anxiety EMA
Figure 1 visualizes the mediation diagrams with the anxiety EMA scores as the predictor. For Jewels A, none of the indirect (r = 0.026, p > 0.05), direct (r = 0.107, p > 0.05), or total (r = 0.133, p > 0.05) effects were significant, so that sleep did not appear to be a mediator. For Jewels B, none of the indirect (r = 0.012, p > 0.05), direct (r = 0.019, p > 0.05), or total (r = 0.031, p > 0.05) effects were significant. Sleep was, however, almost a moderator (r = −0.18, p = 0.054).
Mediation diagrams for Jewels A and Jewels B where the predictor is the anxiety EMA score, the outcome is the Rate-Correct Score (RCS), and the mediator is sleep. *p < 0.05.
Mood EMA
Figure 2 visualizes the mediation diagrams with the mood EMA scores as the predictor. For Jewels A, the indirect effect (r = 0.020, p > 0.05) was insignificant, but the direct (r = 0.203, p < 0.05) and total (r = 0.223, p < 0.05) effects were significant. For Jewels B, none of the indirect (r = 0.004, p > 0.05), direct (r = 0.079, p > 0.05), or total (r = 0.083, p > 0.05) effects were significant; however, sleep did appear to be a moderator (r = −0.214, p = 0.05).
Mediation diagrams for Jewels A and Jewels B where the predictor is the mood EMA score, the outcome is the Rate-Correct Score (RCS), and the mediator is sleep. *p < 0.05.
Discussion
In addition to examining the feasibility, validity, and cross-site comparability of smartphone data in assessing cognitive function in individuals diagnosed with schizophrenia, this study aimed to determine which cognitive assessments of the mindLAMP smartphone app may merit further study. The results presented suggest the following two key findings: first, the Rate-Correct Score43 correlates the most with scores on the MCCB; second, of the various mindLAMP cognitive assessments, Jewels A and Symbol Digit Substitution have the strongest correlational evidence with MCCB measures of both domain-specific and overall cognition.
The Rate-Correct Score succinctly balances trade-offs in speed and accuracy43, two fundamental aspects of performance, by considering the number of correct responses per unit time: accuracy in isolation may be misleading if the time taken to complete a test is excessive; likewise, rapid completion of a test is unimpressive if the responses are incorrect. In this vein, it is worth noting that the Rate-Correct Score generally outperformed traditional scoring metrics based on accuracy or speed for relevant digital cognitive assessments such as Jewels A/B and Symbol Digit Substitution; traditional scoring metrics only appeared relevant for Spatial Span. This highlights the general importance of both speed and accuracy to properly scoring digital cognitive assessments, even if they were not for their traditional counterparts such as the Trails-Making Tests and BACS Symbol Coding. As highlighted in the introduction, prior mobile app cognitive assessment research16,37,39 did not use composite-based metrics to score assessments, which may limit their real-world ecological validity and hinder research progress within the field. Nonetheless, for some cognitive assessments, speed and accuracy can be irrelevant by design: for instance, a mobile version of the Iowa Gambling Task, which assesses risk-taking behaviour and requires users to maximize their score by preferentially selecting buttons that are more likely to award rather than detract points, does not concern itself directly with speed or accuracy, but rather with a sort of pattern recognition. Such digital cognitive assessments, however, do not correspond to the cognitive domains under consideration in the current analysis.
In comparison with prior works, with respect to Intraclass Correlation Coefficient (ICC) analysis, our results are comparable with and complement the findings of Keefe et al.44,45 which showed that the composite scores have high test-retest reliability. By way of comparison, Keefe et al. (2011) showed that, for a population of 323 individuals with schizophrenia, the MCCB composite score had an ICC of 0.88, and Speed of Processing had the highest ICC of 0.79, and Verbal Learning the lowest one of 0.5846. For digital cognitive assessments, environmental variability such as distractions and fatigue can introduce noise. Moreover, the test-retest reliability may also be more sensitive to mood, motivation, or even phone performance.
In the paper by Shvetz et al., the authors investigated the initial accessibility, validity and reliability of using the Jewels Trails Test by comparing whether individuals with schizophrenia had significantly lower performance compared to controls on both the in-lab Jewels Trails Test and Trails-Making Test16. Although the findings indicate promising validity, the authors have also highlighted that remote cognitive tests such as the Jewels A and B, mobile versions of the Trails-Making Tests, remain to be validated to assess actual cognitive impairments in schizophrenia, which is precisely what our study attempts to address. Moreover, the findings here complement recent studies investigating the validity of remote administration of the MCCB test47, which suggests that remote administration of some of the MCCB subtests may be a valid alternative to in-person testing. However, further research is necessary to determine why some tasks were comparatively more affected by administration format48.
By integrating cognitive assessments and smartphone sensor data across multiple sites, this study offers a comprehensive examination of how smartphone data can support scalable, global mental health research. While schizophrenia serves as a practical and well-studied use case due to its well characterized cognitive impairments3,4, our broader aim is to demonstrate how such smartphone-based cognitive assessments can serve as a generalizable tool in the evaluation of cognition across a range of neuropsychiatric and neurodegenerative conditions. Cognitive decline is often subtle and difficult to detect in its early stages, particularly in disorders such as Alzheimer’s disease, Parkinson’s disease, or mild cognitive impairment5,49. By first validating these methods in a population where cognitive dysfunction is prominent and measurable4,50 and providing promising exploratory mediation analysis on how smartphone data such as sleep mediates between performance on digital cognitive assessments and EMA survey scores, we lay the groundwork for extending this approach to other EMA use cases and populations where early and continuous cognitive monitoring may be even more critical.
Moreover, as our experiments relied on EMA-based methods, the results might have greater ecological validity. In traditional neuropsychological testing, like the MCCB, participants complete tasks in lab settings with minimal real-world distractions. Such tasks may boast high construct validity51 (i.e., the test measures what it is designed to measure), but suffer from low ecological validity42 (i.e., the test does not reflect real-world performance and behaviour)52. This work, therefore, not only advances digital mental health for schizophrenia but also contributes a scalable, flexible framework for digital cognitive assessment across the diagnostic spectrum and across an individual’s lifespan. Exploratory mediation analysis looking at whether sleep mediates between the Rate-Correct Scores for Jewels A and Jewels B on mindLAMP and EMA survey scores for mood and anxiety suggests that sleep may, in fact, be a moderator.
Limitations of this work concern its generalizability given our relatively small total sample size of 56. The remote digital cognitive assessments were each done once weekly, which may limit the reliability or validity of our results. Nonetheless, we wish to emphasize that, given reliability (in classical test theory) is a function of variation in the population under study and the precision of the test, our efforts are a step forward in that direction by collecting, using, and analyzing data from different sites. Future work can verify our findings by increasing the sample size across more sites. We postulate that, cognitively, all patients across all three sites are similar. However, there may be differences in clinical symptomatology, which informed our choice to primarily conduct experiments and analysis using data from India (i.e., Bhopal and Bangalore). The low clinical severity in terms of psychotic symptoms and lack of symptoms at the Boston site is another limitation of this paper, and why the analysis did not focus on these symptoms. While our sample was recruited from clinical populations, we did not conduct additional interviews to re-diagnose or confirm the clinical diagnosis.
In conclusion, our work evaluates the relationship between the traditional MCCB tests, digital cognitive assessments, and affective state for schizophrenia patients. We discovered that (1) the Rate-Correct Score shows the most utility in scoring digital cognitive assessments in a manner that renders them correlates of traditional paper-and-pencil tests; (2) of the various mindLAMP cognitive assessments, Jewels A, a mobile version of the Trails-Making Test A, has strong and significant correlations with MCCB scores for domain-specific and overall cognition. We have highlighted the key limitations of our work and encourage future researchers to further empirically validate and advance the use of smartphone-based cognitive assessments as a tool to monitor an individual’s affective state, particularly in high-stakes applications such as schizophrenia.
Methods
The design of this study was informed by focus group discussions conducted in India and the U.S. in which participants provided feedback on their interactions with the study app, generally reporting favourably about the app and noting the importance of receiving scores to interpret their performance on the app’s cognitive assessments, prompting the current investigation into different methods of scoring digital assessments22.
Participants
This multi-site 30-day observational study took place at three healthcare facilities: the Beth Israel Medical Deaconess Center (BIDMC) in Boston, USA; the National Institute of Mental Health and Neuro Sciences (NIMHANS) in Bangalore, India; and the All India Institute of Medical Sciences (AIIMS) and Sangath in Bhopal, India. As in prior research by this same team working with apps and psychosis18, inter-rater reliability for the PANSS was examined for research assistants administering the PANSS from each of the three sites by having them rate five video-recorded clinical interviews. Intraclass correlations were excellent (>0.75) for PANSS Total and Positive scores and fair to good (>0.4) for PANSS Negative score.
Participant recruitment began in September 2024 and concluded in March 2025. Across the three sites, a total of 56 participants partook in the study; the sample demographics are summarized in Table 1. The inclusion criteria for participation in the study consisted of being at least 18 years of age, having a diagnosis schizophrenia or schizoaffective disorder, owning a smartphone capable of running the study app, and speaking the local language of the study site (i.e. English in the case of Boston, and English or Hindi in the case of Bhopal and Bangalore). The exclusion criteria consisted of suffering any uncontrolled mental illness or any significant speech, sight, or hearing impairment that impacts the individual’s ability to operate a smartphone. Participants were recruited if they met the above inclusion criteria and were able to provide informed consent.
The study protocol was approved by each site’s IRB, and all participants signed an informed consent prior to beginning the study. Participants met with a research assistant at the beginning of the study to complete the intake visit and again at the end of the study to complete the follow-up visit. Regular meetings were held with members of the teams across the study sites to ensure overall uniformity. At each study visit, participants completed a number of clinical assessments, such as the General Anxiety Disorder-7 (GAD-7), Patient Health Questionnaire-9 (PHQ-9), PANSS, Psychotic Symptom Rating Scales (PSYRATS), Calgary Depression Rating Scale for Schizophrenia (CDRSS), WHO Disability Assessment Schedule (WHODAS), Social Functioning Scale (SFS), Human Connectome Project Social Task (HCP), Pittsburgh Sleep Quality Index (PSQI), Phenx Access Health Services, Phenx English Proficiency, Phenx Health Literacy, Phenx Occupational Prestige, Phenx Social and Role Dysfunction in Psychosis and Schizophrenia, via REDCap. REDCap is a HIPAA-compliant online platform developed by Vanderbilt University that facilitates survey administration and data entry for research, supporting features such as branching logic and custom reporting53. The surveys participants completed included the Positive and Negative Syndrome Scale (PANSS)54.
Participants also completed the MATRICS Consensus Cognitive Battery (MCCB)45,50, which is a gold standard assessment for measuring cognition across seven domains from verbal learning to social cognition via ten tests. During the intake study visit, the research assistant assisted participants with downloading the mindLAMP app to their phone and helped them to enable GPS permissions to allow smartphone sensor data collection; the research assistant explained to participants that they would be contacted to help with troubleshooting if their data quality proved to be low. For the 30 days of the study, participants engaged with various cognitive assessments and surveys on the mindLAMP app; on average, they were assigned 2-3 cognitive assessments and surveys per day, and received app notifications at 6 pm each day; over the course of their time in the study, participants should have completed each cognitive assessment at least 4 times. At the end of the study, participants met again with a research assistant to complete the same surveys on REDCap as at the beginning of the study, in addition to a system usability survey about their experience with the study app. Participants were compensated at the beginning and the end of the study.
Digital cognitive assessment—MindLAMP App
mindLAMP is accessible in 8 languages (English, Spanish, Korean, Simplified Chinese, Traditional Chinese, Italian, French, German, and Hindi), and the app’s cognitive assessments have been co-designed with patient partners and across a series of workshops, design rounds, and ongoing app updates7,22,55. Figure 3 presents the cognitive assessments featured on the mindLAMP app: (a) Balloon Risk (Balloon Analog Risk Task), (b) Cats and Dogs (Simple Memory Task), (c) Emotion Recognition, (d) Jewels A (Trail Making Tests A), (e) Jewels B (Trail Making Tests B), (f) Maze (Problem Solving Task), (g) Pop the Bubbles (Go/No-Go Task), (h) Spatial Span, (i) Spin the Wheel (Iowa Gambling Task), and (j) Symbol Digit Substitution.
a Balloon Risk (digital version of the Balloon Analog Risk Test)—a task in which the user attempts to inflate a balloon as many times as possible before it pops. b Cats and Dogs—a task in which the user is presented with an array of boxes and must remember to select the boxes covering either dogs or cats. c Emotion Recognition—a task in which the user is presented with a random sequence of 10 images and must identify the emotion represented. d Jewels A (digital version of the Trails-Making Test Version A)—a task in which the user must select numbered jewels in ascending order. e Jewels B (digital version of the Trails-Making Test Version B)—a task in which the user must select numbered jewels in ascending order, alternating between two different sets of jewels. f Maze—a task in which the user must tilt the phone in order to move a ball toward the exit. g Pop the Bubbles (digital version of a go/no-go task)—a task in which the user must tap on bubbles of a specified color. h Spatial Span—a task in which the user is presented with a sequence and must recreate the sequence in either the same or reverse order. i Spin the Wheel (digital version of the Iowa Gambling Task)—a task in which the user must select one of four buttons to spin wheels in an attempt to increase the starting balance. j Symbol Digit Substitution (digital version of a test from the Wechsler Adult Intelligence Scale)—a task in which, given a legend with symbols and numbers, the user must select the number corresponding to a given symbol.
Most tasks are customizable without the need for coding or changing the app. As an example, the faces used for the Emotion Recognition Task can be changed to display culturally relevant images; participants in the U.S. were presented with images taken from UPenn’s ER 40 Color Emotional Stimuli56, and participants in India were presented with images taken from the AIIMS Facial Toolbox for Emotion Recognition57.
Using the smartphone data collected by the app, it is possible to generate sleep data via an algorithm that combines GPS and phone screen state: analysis will primarily concern sleep duration and sleep quality, which is roughly related to the degree of fragmentation in the data.
In-person cognitive assessment—MCCB cognitive domains
For all participants, the cognitive performance was assessed via the MCCB45,50 at two time points (baseline and after 30 days). The MCCB provides scores seven cognitive domains computed from ten subtests (see Appendix A for specifics on the subtests):
-
1.
Speed of Processing (SoP) [Trail Making Test: Part A (TMT), Brief Assessment of Cognition in Schizophrenia: Symbol Coding (BACS SC), and Category Fluency: Animal Naming (Fluency)]
-
2.
Attention/Vigilance (AV) [Continuous Performance Test Identical Pairs (CPT-IP)]
-
3.
Working Memory (WM) [Letter Number Span (LNS) and Wechsler Memory Scale Spatial Span (WMS III SS)]
-
4.
Verbal Learning and Memory (Vrbl Lrng) [Hopkins Verbal Learning Test-Revised (HVLT-R)]
-
5.
Visual Learning and Memory (Vis Lrng) [Brief Visuospatial Memory Test-Revised (BVMT-R)]
-
6.
Reasoning and Problem Solving (RPS) [Neuropsychological Assessment Battery Mazes (NAB Mazes)]
-
7.
Social Cognition (SC) [Mayer–Salovey–Caruso Emotional Intelligence Test (MSCEIT ME)].
The MCCB raw scores were converted to T scores according to the U.S. English norms and corrected for age, gender, and education, and the T scores were utilized as the primary measurement for analysis.
Table 6 provides a summary of how the different tests map to the cognitive assessments on mindLAMP.
Digital cognitive assessment scoring methods
In order to analyze the raw data from the cognitive assessments completed on the mindLAMP app, it was necessary to first implement scoring methods. Here, we adopt a composite approach as pioneered by Keefe et al. 45,58, building on the work conducted by Liesefeld and Janczyk, which outlined four scoring metrics that generally combine speed and accuracy for participant i in condition j59.
IES
The Inverse Efficiency Score (Eq. 1) is the ratio of average response time to the proportion of correct responses:
Rate-Correct Score (RCS)
The Rate-Correct Score (Eq. 2) is the ratio of the number of correct responses to total response time:
Linear Integrated Speed-Accuracy Score (LISAS)
The Linear Integrated Speed-Accuracy Score (Eq. 3) accounts for the standard deviation of average response time and proportion of incorrect responses:
Balanced Integration Score (BIS)
The Balanced Integration Score (Eq. 4) is the difference of the z-scores for the proportion of correct responses and average response time:
where \({Z}_{{x}_{i,j}}=\frac{{x}_{i,j}-\underline{x}}{{S}_{x}}\)
Alternate scoring
In addition to the above formulas, alternate scoring metrics informed by a review of the extant literature on cognitive assessments were implemented for specific cognitive assessments. In general, these alternate scores amount to considering either duration or accuracy exclusively: for Jewels A, Jewels B, and Maze, the amount of time to complete each level is recorded; for Emotion Recognition, Spatial Span, and Symbol Digit Substitution, the number of correct responses is recorded.
Statistical procedures
All analysis was done with Python (version 3.8.8). Significance was determined with a p-value threshold value of 0.05, and multiple comparisons were accounted for with Bonferroni corrections.
Descriptive statistics were obtained for the sample’s demographics and clinical symptoms. A one-way ANOVA was performed to examine whether there were significant differences in participant age and in MCCB scores across the three study sites. Chi-squares was performed to examine whether there were significant differences in sex.
Spearman’s correlations were calculated to assess validity between scoring metrics for mindLAMP’s digital cognitive assessments and MCCB domain scores. For these correlations, each participant’s first available scoring metric for mindLAMP’s digital cognitive assessments was used.
Test-retest reliability of the digital cognitive assessments on mindLAMP was analyzed by calculating Intraclass Correlation Coefficients (ICC) for the Rate-Correct Score scoring metric for each assessment completed. To be included in this calculation, participants must have at least completed each assessment twice (this corresponded to 42 participants for Balloon Risk, 45 for Cats and Dogs, 42 for Jewels A, 41 for Jewels B, 44 for Spatial Span, 42 for Spin the Wheel, 45 for Symbol Digit Substitution, 36 for Emotion Recognition, 17 for Maze). The ICC values were computed by invoking R’s ICCest function in Python and were interpreted according to the existing guidelines in the literature60, with moderate reliability corresponding to an ICC value between 0.50 and 0.75 and high reliability corresponding to ICC values greater than 0.75.
Spearman’s correlations were used to explore the relationship between sleep and performance on mindLAMP’s digital cognitive assessments. Mediation analysis was conducted using the statsmodel Python library to fit a linear model using ordinary least squares to see if sleep mediated between EMA survey scores for anxiety and mood, and mindLAMP scoring metrics: the direct effect was modeled with the intraindividual z score of the EMA survey score as the predictor and the intraindividual z score of Rate-Correct Score for a particular mindLAMP assessment as the outcome; the indirect effect was modeled with the intraindividual z score of sleep duration as the mediator.
Data availability
The datasets generated during and/or analysed during the current study are available from the corresponding author on reasonable request.
Code availability
The computer codes used to generate the results presented in this paper are available from the corresponding author on reasonable request.
References
Hafner, H. et al. Iraos: an instrument for the assessment of onset and early course of schizophrenia. Schizophrenia Res. 6, 209–223 (1992).
Andreasen, N. C. Assessment issues and the cost of schizophrenia. Schizophrenia Bull. 17, 475–481 (1991).
Schaefer, J., Giangrande, E., Weinberger, D. R. & Dickinson, D. The global cognitive impairment in schizophrenia: consistent over decades and around the world. Schizophrenia Res. 150, 42–50 (2013).
Nuechterlein, K. H., Nasrallah, H. & Velligan, D. Measuring cognitive impairments associated with schizophrenia in clinical practice: Overview of current challenges and future opportunities. Schizophrenia Bull. 51, 401–421 (2025).
J. Campbell, et al. Quality of life in mild cognitive impairment and mild dementia associated with alzheimer’s disease: A systematic review, Neurology and Therapy, vol. 14.
Chen, K., Torous, J. & Cheong, J. The Current State/ Trends in Digital Phenotyping for Mental Health Research and Care. Psychiatr. Clin. 49, S0193953X25000899 (2025).
Vaidyam, A. et al. Enabling research and clinical use of patient-generated health data (the mindlamp platform): digital phenotyping study. JMIR mHealth uHealth 10, e30557 (2022).
Cohen, A. S. et al. Digital phenotyping using multimodal data. Curr. Behav. Neurosci. Rep. 7, 212–220 (2020).
Liu, G., Henson, P., Keshavan, M., Pekka-Onnela, J. & Torous, J. Assessing the potential of longitudinal smartphone based cognitive assessment in schizophrenia: A naturalistic pilot study. Schizophrenia Res.: Cognition 17, 100144 (2019).
W. P. Horan, R. C. Moore, H. G. Belanger, and P. D. Harvey, Utilizing technology to enhance the ecological validity of cognitive and functional assessments in schizophrenia: an overview of the state-of-the-art, Schizophrenia Bulletin Open, sgae025, 2024.
Lavigne, K. M. et al. Remote cognitive assessment in severe mental illness: a scoping review. Schizophrenia 8, 14 (2022).
Lane, E. et al. Exploring current smartphone-based cognitive assessments in schizophrenia and bipolar disorder. Schizophrenia Res.: Cognition 37, 100309 (2024).
Daniel, D. G. et al. Remote assessment of negative symptoms of schizophrenia. Schizophrenia Bull. Open 4, sgad001 (2023).
S. Abdulla, et al, Community-based collaborative care for serious mental illness: A rapid qualitative evidence synthesis of health care providers’ experiences and perspectives, Community Mental Health Journal, 1–13, 2025.
Moore, R. C., Swendsen, J. & Depp, C. A. Applications for self-administered mobile cognitive assessments in clinical research: A systematic review. Int. J. Methods Psychiatr. Res. 26, e1562 (2017).
Shvetz, C., Gu, F., Drodge, J., Torous, J. & Guimond, S. Validation of an ecological momentary assessment to measure processing speed and executive function in schizophrenia. npj Schizophrenia 7, 64 (2021).
Bladon, S. et al. A systematic review of passive data for remote monitoring in psychosis and schizophrenia. npj Digital Med. 8, 62 (2025).
E. Lane, L. Gray, D. Kimhy, D. Jeste, and J. Torous, Digital phenotyping of social functioning and employment in people with schizophrenia: Pilot data from an international sample, Psychiatry and Clinical Neurosciences, 2025.
Hackett, K. et al. Mobility-based smartphone digital phenotypes for unobtrusively capturing everyday cognition, mood, and community life-space in older adults: Feasibility, acceptability, and preliminary validity study. JMIR Hum. Factors 11, e59974 (2024).
Lane, E. et al. Digital phenotyping in adults with schizophrenia: a narrative review. Curr. Psychiatry Rep. 25, 699–706 (2023).
Cohen, A. et al. Digital phenotyping correlates of mobile cognitive measures in schizophrenia: A multisite global mental health feasibility trial. PLOS Digital Health 3, e0000526 (2024).
Raje, A. et al. Designing smartphone-based cognitive assessments for schizophrenia: Perspectives from a multisite study. Schizophrenia Res.: Cognition 40, 100347 (2025).
A. Facchin, E. Cavicchiolo, and E. Chan, Neuropsychological testing: from psychometrics to clinical neuropsychology, 1549236, 2025.
Paolillo, E. W. et al. Characterizing performance on a suite of english-language neuroux mobile cognitive tests in a us adult sample: ecological momentary cognitive testing study. J. Med. Internet Res. 26, e51978 (2024).
V. K. Aldridge, T. M. Dovey, and A. Wade, Assessing test-retest reliability of psychological measures, European Psychologist, 2017.
Onnela, J.-P. et al. Beiwe: A data collection platform for high-throughput digital phenotyping. J. Open Source Softw. 6, 3417 (2021).
van Berkel, N., D’Alfonso, S., Kurnia Susanto, R., Ferreira, D. & Kostakos, V. Aware-light: A smartphone tool for experience sampling and digital phenotyping. Personal. Ubiquitous Comput. 27, 435–445 (2023).
Bilden, R. & Torous, J. Global collaboration around digital mental health: the lamp consortium. J. Technol. Behav. Sci. 7, 227–233 (2022).
Torous, J. et al. Creating a digital health smartphone app and digital phenotyping platform for mental health and diverse healthcare needs: an interdisciplinary and collaborative approach. J. Technol. Behav. Sci. 4, 73–85 (2019).
Cohen, A. et al. Relapse prediction in schizophrenia with smartphone digital phenotyping during covid-19: a prospective, three-site, two-country, longitudinal study. Schizophrenia 9, 6 (2023).
Lakhtakia, T. et al. “Smartphone digital phenotyping, surveys, and cognitive assessments for global mental health: Initial data and clinical correlations from an international first episode psychosis study. Digital Health 8, 20552076221133758 (2022).
Weizenbaum, E. L. et al. Smartphone-based neuropsychological assessment in parkinson’s disease: feasibility, validity, and contextually driven variability in cognition. J. Int. Neuropsychological Soc. 28, 401–413 (2022).
Tsiakiri, A. et al. Processing speed and attentional shift/mental flexibility in patients with stroke: a comprehensive review on the trail making test in stroke studies. Neurol. Int. 16, 210–225 (2024).
Li, Y., Ang, M. S., Yee, J. Y., See, Y. M. & Lee, J. Predictors of functioning in treatment-resistant schizophrenia: the role of negative symptoms and neurocognition. Front. Psychiatry 15, 1444843 (2024).
Y. Bhargava, S. Sumanth, A. Deshmukh, and V. Baths, A peek into the minds of game developers and neuroscience researchers collaborating on cognition assessing games, IEEE Access, 2025.
Berretta, S. A. et al. Protocol for evaluation of itest, a novel blended intervention to enhance introspective accuracy in psychotic disorders. NPP—Digital Psychiatry Neurosci. 3, 5 (2025).
VanNostrand, M., Bae, M., Ramsdell, J. C. & Kasser, S. L. Information processing speed and disease severity predict real-world ambulation in persons with multiple sclerosis. Gait Posture 111, 99–104 (2024).
T. H. Waggestad, et al. New regression-based norms for the trail making test on norwegian older adults: Understanding the effect of education, The Clinical Neuropsychologist, 1–24, 2025.
Priya, G. et al. Influence of auditory-based cognitive training on auditory resolution, executive function, and working memory skills in individuals with mild cognitive impairment–a pilot randomized controlled study. F1000Research 13, 1022 (2025).
Cohen, A. et al. Digital phenotyping data and anomaly detection methods to assess changes in mood and anxiety symptoms across a transdiagnostic clinical sample. Acta Psychiatr. Scandinavica 151, 388–400 (2025).
Koren, D., Seidman, L. J., Goldsmith, M. & Harvey, P. D. Real-world cognitive—and metacognitive—dysfunction in schizophrenia: a new approach for measuring (and remediating) more “right stuff”. Schizophrenia Bull. 32, 310–326 (2006).
Chaytor, N. & Schmitter-Edgecombe, M. The ecological validity of neuropsychological tests: A review of the literature on everyday cognitive skills. Neuropsychol. Rev. 13, 181–197 (2003).
Woltz, D. J. & Was, C. A. Availability of related long-term memory during and after attention focus in working memory. Mem. Cognition 34, 668–684 (2006).
R. S. Keefe, et al. Brief assessment of cognition in schizophrenia, Schizophrenia Research, 1999.
Keefe, R. S. et al. Norms and standardization of the brief assessment of cognition in schizophrenia (bacs). Schizophrenia Res. 102, 108–115 (2008).
Keefe, R. S. et al. Characteristics of the matrics consensus cognitive battery in a 29-site antipsychotic schizophrenia clinical trial. Schizophrenia Res. 125, 161–168 (2011).
J. Shen, et al. Data missingness in digital phenotyping: Implications for clinical inference and decision-making, medRxiv, 2024–10, 2024.
Russell, M. T. et al. Validity of remote administration of the matrics consensus cognitive battery for individuals with severe mental illness. Schizophrenia Res.: Cognition 27, 100226 (2022).
J. G. Goldman, P. Jagota, and E. Matar, Managing cognitive impairment in parkinson’s disease: an update of the literature, Expert Review of Neurotherapeutics, no. just-accepted, 2025.
Nuechterlein, K. H. et al. The matrics consensus cognitive battery, part 1: test selection, reliability, and validity. Am. J. Psychiatry 165, 203–213 (2008).
Strauss, M. E. & Smith, G. T. Construct validity: Advances in theory and methodology. Annu. Rev. Clin. Psychol. 5, 1–25 (2009).
Stinson, L., Liu, Y. & Dallery, J. Ecological momentary assessment: a systematic review of validity research, Perspectives on Behavior. Science 45, 469–493 (2022).
Harris, P. A. et al. Research electronic data capture (redcap)—a metadata-driven methodology and workflow process for providing translational research informatics support. J. Biomed. Inform. 42, 377–381 (2009).
Kay, S. R., Fiszbein, A. & Opler, L. A. The positive and negative syndrome scale (panss) for schizophrenia. Schizophrenia Bull. 13, 261–276 (1987).
Vaidyam, A., Halamka, J. & Torous, J. Actionable digital phenotyping: a framework for the delivery of just-in-time and longitudinal interventions in clinical healthcare. Mhealth 5, 25 (2019).
Pinkham, A. E. et al. The other-race effect in face processing among african american and caucasian individuals with schizophrenia. Am. J. Psychiatry 165, 639–645 (2008).
Verma, R., Kalsi, N., Shrivastava, N. P. & Sheerha, A. Development and validation of the aiims facial toolbox for emotion recognition. Indian J. Psychological Med. 45, 471–475 (2023).
Hill, S. K. et al. Efficiency of the catie and bacs neuropsychological batteries in assessing cognitive effects of antipsychotic treatments in schizophrenia. J. Int. Neuropsychological Soc. 14, 209–221 (2008).
Liesefeld, H. R. & Janczyk, M. Combining speed and accuracy to control for speed-accuracy trade-offs (?). Behav. Res. Methods 51, 40–60 (2019).
Koo, T. K. & Li, M. Y. A guideline of selecting and reporting intraclass correlation coefficients for reliability research. J. Chiropr. Med. 15, 155–163 (2016).
Acknowledgements
This work was funded by a grant from the Wellcome Trust (John Torous PI).
Author information
Authors and Affiliations
Contributions
J.T. designed and oversaw the study. J.C., S.C., N.D., A.B., R.S., M.A, and Y.S. collected the data; J.C. and S.N. performed all data analyses. All authors contributed to and have approved the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Castillo, J., Cheong, J., Choudhary, S. et al. Mobile cognitive remote assessment of schizophrenia: a global multi-site pilot study. Schizophr 11, 144 (2025). https://doi.org/10.1038/s41537-025-00660-8
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s41537-025-00660-8





