Abstract
Digital mental health assessments can effectively link young people to appropriate health services. Current approaches often lack personalization, failing to recognize the complex and multidimensional needs of young people. 1734 young people aged 12–25 years completed seven standardized measures (49 items) using a digital health assessment tool while receiving mental health care. A multidimensional computerized adaptive test (MCAT) was developed to predict scores on seven standardized scales, spanning clinical symptoms, suicidality, functioning, and alcohol use. Different adaptive tests were simulated under various stopping criteria configurations. Ten-fold cross-validation was performed to determine the accuracy and efficiency of the multidimensional assessment. By administering a personalized subset of items to each individual, the average number of assessment items could be reduced by 69% while maintaining excellent agreement with full-length scores for suicidality (ICC = 0.96), anxiety (ICC = 0.92), and alcohol use (ICC = 0.91), and good agreement for psychological distress (ICC = 0.88), functioning (ICC = 0.86), psychosis (ICC = 0.78), and mania (ICC = 0.75). Estimated average assessment time decreased from 10.5 minutes to under 3.3 minutes (49 items reduced to 15.3 items, per person, with mean absolute agreement ICC = 0.87). This adaptive digital assessment can screen across key domains to identify mental health needs and complexity in youth mental health, leading to rapid decisions about treatment needs and care pathways.
Similar content being viewed by others
Introduction
The proliferation of digital mental health tools is a promising response to the youth mental health crisis1,2. With most young people now having access to smart technologies, virtual front doors are becoming popular entry points to the health system3. This presents an opportunity for large-scale access to assessment and treatment4, necessitating smart approaches to identify who needs what type of mental health care5.
For digital assessments to effectively facilitate access to care, they must capture a broad range of needs without compromising engagement. Many young people have a diverse and fluctuating mix of needs ranging across clinical, psychosocial, and comorbidity domains6,7,8,9. The traditional one-size-fits-all approach to assessing multiple dimensions of needs is to administer multiple time-consuming, often overlapping (and potentially redundant) self-report assessments10,11, leading to burden and disengagement12,13,14. Conversely, services may opt for narrowly focused assessments to sustain engagement (i.e., only focusing on symptoms of anxiety and depression11), yet this undermines the detection of multidimensional treatment needs. Therefore, mental health services need time-efficient and clinically effective ways to identify which care pathways are most appropriate for each young person they assess5,15. Increasing the personalization of assessment, rather than using a pre-defined fixed set of items, promises to enhance usability while promoting multidimensional needs assessment16.
The overlap of clinical symptoms across mental health diagnoses17,18,19, in addition to their statistical associations with functioning and comorbid factors, allows for more flexible approaches to assessing dimensions of need. Using various forms of information and item response theory (IRT), researchers have created dynamic assessments that are personalized to an individual’s idiosyncratic pattern of responses20,21. Instead of providing everyone with the same fixed set of assessments, each question is tailored based on what information is known about the individual and which domains require further assessment. When enough certainty is achieved to make critical decisions (e.g., which services someone requires), the assessment concludes. Adaptive assessments using IRT and computerized adaptive testing (CAT) methodologies appear highly effective for reducing participant burden while maintaining accuracy22,23,24,25,26,27.
In this study, we use real-world data, collected from young people accessing a digital mental health platform, to validate a brief multidimensional needs assessment tool that predicts scores across seven standardized and widely used assessment scales. The selected measures assessed psychological distress28 (Kessler-10 [K-10]; 10 items), psychotic-like experiences29 (Prodromal Questionnaire [PQ16]; 16 items), mania-like experiences30 (Altman Self-Rating Mania Scale [ASRM]; 5 items), symptoms of anxiety31 (Overall Anxiety Severity and Impairment Scale [OASIS]; 5 items), suicidality32 (Suicidal Ideation Attributes Scale [SIDAS]; 5 items), alcohol use33 (Alcohol Use Disorders Identification Test [AUDIT-C]; 3 items), and social and occupational functioning34 (Work and Social Functioning [WSAS]; 5 items). We hypothesize that leveraging information from related items between scales will help estimate full-length scores efficiently. Approximating full-length scales may be useful for health services where comparability to currently deployed measurement frameworks is critical for clinical interpretation and stakeholder acceptance.
Results
Sample characteristics
There were 1734 young people aged 12–25 years who met the eligibility criteria for this study (1734/2304 = 75.26%). The mean age was 19.80 (standard deviation [SD] = 2.78), with 71.8% female and 5.5% identifying as an Aboriginal and/or Torres Strait Islander (Table 1). Demographic comparisons between the included and excluded samples are provided in Supplementary Table 1.
Scale diagnostics and assumption tests
Tests for local dependency and monotonicity are detailed in Supplementary Tables 2-4. Unidimensional models revealed 4 pairs with some local dependence (Q3 > 0.2) in the K10 (items 2&3, 2&5, 3&6, 5&6). These items were likely related due to their semantic similarity (items 2 and 3 ask about “nervousness”, and items 5 and 6 ask about being “restless or fidgety”). Global tests for local dependency revealed more pairs with Q3 > 0.2 across scales.
Taken together, there was some local dependency exhibited in the item bank. However, this was expected given the multidimensional structure of the model and the information overlap across the scales. Furthermore, the skip-logic employed in the SIDAS and AUDIT-C would inevitably lead to local dependency. Though this does violate the technical assumption of local independence, all items were included for the following reasons:
-
1.
Some local dependency is expected in multidimensional item response theory (MIRT) models35
-
2.
The local dependency we detected reflects real-life clustering
-
3.
The aim of our study was to preserve the fidelity of the legacy scales rather than refining an item bank for new scale development.
Monotonicity was confirmed when the probability of answering higher-scoring response categories on an item increases with higher latent trait scores. All items except for the third item from the ASRM (ASRM_3; assessing need for sleep) passed technical checks for monotonicity (Loevinger’s H coefficient > 0.3; Supplementary Table 4). The step response function for ASRM_3 showed no substantive departure from the monotonicity function, and thus the item was retained to preserve coverage of the mania trait within the ASRM scale (Supplementary Fig. 1).
Model fit and selection
The correlated multidimensional model converged within 0.001 tolerance after 691 Metropolis-Hastings Robbins-Monro (MHRM) iterations, with a standard error of 0.147. The root mean square error of approximation was 0.057, which was approaching the cutoff for an excellent fit (0.05) and falls in the ‘good’ fit category (0.05–0.08).
Factor structure and item characteristics
The proportion of common variance explained by each domain ranged from 4.6% (Mania/ASRM) to 15.8% (Psychosis/PQ16) (Supplementary Table 5). Expectedly, several domains overlap as indicated by their factor correlation. The highest factor correlations were observed between functioning and psychological distress (r = 0.75) and functioning and anxiety (r = 0.70) (Supplementary Table 6).
Item discrimination and intercept parameters are not publicly available due to intellectual property review, consistent with practices for proprietary digital health tools. There were some outliers for item parameters from the SIDAS and AUDIT-C scale, likely due to the skip logic applied to items in these scales, contributing to a steeper item characteristic curve. For the case of AUDIT-C_3, as this is derived from a short 3-item scale, the local dependence may have inflated the shared variance between items.
Multidimensional computerized adaptive test (MCAT) simulations
To conduct cross-validation with ten folds, we randomly divided the sample of 1734 into 10 approximately equal groups. In the initial full-length MCAT simulation (with no stopping criteria), the mean θj standard error of measurement (SEM, for person j) estimates for mania and psychosis did not reach below 0.4 (i.e., above 0.84 reliability), indicating less information gained for these latent traits. Therefore, suitable stopping criteria included when ∆θj (the absolute change in each person's θj estimate between successive item responses) was below a predefined value. This approach is common in multidimensional adaptive tests where SEM may not converge below a threshold due to individuals with scores toward the upper and lower bounds of full-length scales25.
Optimal stopping criteria
Observing the standard error of measurement for each domain in the full-length MCAT (Supplementary Fig. 2) informed a three-dimensional grid search comprised of a different number of minimum items (range: 2-9, to ensure enough information is collected before prematurely ending the assessment), maximum items (range: 9-30, to reflect when additional items may not significantly improve measurement), and change in θj thresholds (∆θj: 0.01, 0.03, or 0.05, based on previous CAT research25). The upper limit of 30 items was selected because SEM curves plateaued for all domains by this point, except for psychosis, which displayed no clear inflection point (Supplementary Fig. 2).
The simulated assessment that produced intraclass correlation coefficients ICCs ≥ 0.75 for all domains, with the lowest mean number of items used per person, comprised the following stopping criteria:
-
i.
a minimum of 7 items per assessment AND
-
ii.
(a maximum of 25 items per assessment OR
-
iii.
∆θj < 0.05 between successive items)
With this criteria, the mean number of items used per person was 15.3 (SD = 5.39, median = 14) across all ten folds, representing a 68.78% reduction in item usage from the original 49-item battery. Timestamp data revealed that participants took on average 10 min and 40 s to complete all 49 items (Supplementary Table 7), meaning the adaptive test would take on average 3 min and 18 s to complete (saving on average 7 min and 22 s per person, with an estimated range between 91 s [7-item assessment] and 5 min and 26 s [25-item assessment]). Approximately one in ten (11.01%) individuals reached the maximum item limit (item 25) before ∆θj < 0.05 was observed between successive items. Post-hoc comparisons showed no meaningful demographic or clinical profile differences between those who reached the item limit and those who did not (Supplementary Table 8).
Approximately one-sixth of the items (16.33%; 8/49) were never presented to individuals as items in the simulation, with 19/49 (38.78%) used for less than 4.5% of the sample, and over half (51.02%; 25/49) used for less than 15.3% of the sample. The SIDAS_1 item (assessing the presence of suicidal thoughts) was the first item presented to all individuals in the sample. Following this, the average item order (indicating where in the adaptive assessment the item is presented to the individual) for AUDITC_3 (assessing binge drinking behavior), OASIS_4(assessing for impact of anxiety on functioning), and ASRM_1 (assessing for the extent to which one feels more or less cheerful than usual) were 2.38, 3.23, and 4.41, respectively. These items were presented earliest in the assessment, reflecting their high information gain across the latent space, and thus power in discriminating across multiple needs (Fig. 1a). Probability of item selection also reflected informativeness across test simulations (Fig. 1b). The items in the adaptive assessment most commonly came from the Kessler-10 (distress; average items from K-10 per test=2.69 (SD = 0.12)), followed by OASIS (anxiety; 2.64 (SD = 0.10)), and Work and Social Functioning (WSAS) (functioning; 2.09 (SD = 0.10)). The fewest items used on average were from the AUDIT-C scale (alcohol; 1.70 (SD = 0.04)).
In (a), the x-axis indicates the average position at which each item was selected across all simulated assessments. Lower selection order (i.e., closer to 0) reflects earlier selection and higher informativeness. In (b), probability indicates how often each item was selected during the adaptive tests. Detailed item information is provided in Supplementary Note 1. K10 Kessler-10, PQ16 Prodromal Questionnaire, ASRM Altman Self-Rating Mania Scale, OASIS Overall Anxiety Severity and Impairment Scale, AUDITC Alcohol Use Disorders Identification Test, WSAS Work and Social Functioning, SIDAS Suicidal Ideation Attributes Scale.
Mean percent error for each scale ranged between 3.94% (SIDAS, representing on average an error of 1.97/50 points on the scale) to 11.17% (PQ-16; representing an error of 1.79/16 on the scale) (Table 2). There was excellent agreement (ICC ≥ 0.90) between predicted and full-length scores for suicidality (ICC = 0.96), anxiety (ICC = 0.92), and alcohol use (ICC = 0.91), and good agreement (ICC ≥ 0.75) for distress (ICC = 0.88), functioning (ICC = 0.86), psychosis (ICC = 0.78), and mania (ICC = 0.75). Pearson’s correlations also supported these findings. The Pearson’s correlations between full-length scale scores and predicted estimates could be interpreted as “very strong” (≥ 0.80) for suicidality (r = 0.96), anxiety (r = 0.94), alcohol use (r = 0.93), distress (r = 0.88), functioning (r = 0.88), and psychosis (r = 0.81), while mania score estimates showed a correlation of r = 0.79 with the full-length mania score (considered a “strong” relationship). These findings suggest overall strong agreement between estimated and predicted scores, though weaker estimates were observed towards the scales’ minimum and maximum possible scores (Fig. 2). A post-hoc sensitivity analysis showed minimal performance difference when comparing younger adolescents (12–17 years) and young adults (18–25 years), with a mean domain-level ICC difference of 0.01 (range: 0.002–0.036).
Each panel displays the mean MCAT-estimated score (y-axis) against the rounded true total score (x-axis), binned for visualization. Absolute differences between predicted and true mean scores are indicated by vertical lines. Color indicates the average number of items administered by the MCAT for individuals in that bin, with circle size reflecting the number of individuals per bin. The red dashed line marks perfect prediction. Accuracy (ICC, intraclass correlation coefficient; MAE, mean absolute error; and Pearson’s r) and mean item usage are reported in each panel.
Alternative stopping criteria
Hyperparameter optimization revealed that higher absolute agreement between estimated and true scale scores could be achieved with various stopping criteria (Table 2; Supplementary Figs. 3 and 4). For example, all domains with ICC ≥ 0.80 could be achieved with 19.5 items per assessment (60% reduction in items and approximately 6 min and 25 s reduction, average ICC = 0.89). Additionally, the maximum agreement in the hyperparameter space led to a mean of 24.8 items per assessment (ICC ≥ 0.84, mean ICC = 0.92). Finally, an ultra-brief screener (ICC ≥ 0.70) led to an average of 10.7 items per assessment (78% reduction in items and 8 min and 20 s reduction, average ICC = 0.83).
Discussion
Using a large cohort of young people presenting for mental health care, this study demonstrates the accuracy and efficiency of an adaptive digital assessment of seven mental health-related questionnaires, spanning clinical symptoms (anxiety, distress, psychosis, mania), functioning, suicidality, and comorbid alcohol use. On average, only 15/49 items were required to accurately estimate an individual’s full-length scores across the seven validated and widely used assessment measures. This translates to a 69% reduction in item usage, reducing average completion time from 10.5 min to approximately 3.3 min. The absolute agreement between MCAT-predicted scores and out-of-sample full-length scores ranged from “excellent” to “good” (mean ICC = 0.87, range: 0.75-0.96), with the strongest accuracy for suicidality (ICC = 0.96, r = 0.96), anxiety (ICC = 0.92, r = 0.94), alcohol use (ICC = 0.91, r = 0.93), distress (ICC = 0.88, r = 0.88), and functioning (ICC = 0.86 r = 0.88), and good, though comparatively lower, agreement for psychosis (ICC = 0.78, r = 0.81) and mania (ICC = 0.75, r = 0.79). Replicating standardized scale scores with far fewer items shows that dynamic, high quality mental health assessment is feasible at scale. When embedded in broader digital systems, the tool could facilitate the direction of users to appropriate care pathways for other forms of assessment (e.g., face-to-face, or delivered by large language models).
As virtual front doors are increasingly in demand by youth seeking mental health care36,37, digital assessments must cover a broad range of domains while sustaining engagement and personalization. Rather than deploying several separate questionnaires, the MCAT approach tailors questions to the individual so that accuracy is maintained despite a substantially briefer assessment time. This methodology has been previously studied for the detection of specific diagnoses21,23. By reproducing estimates across several reputable scales, our tool measures treatment needs across clinical, psychosocial, and comorbidity domains that could be implemented at a population level for directing people to care. For example, Australia requires a standardized health measure to formalize mental health treatment plans, which presents an opportunity to briefly screen for multiple types of treatment needs, either during a clinical appointment or independently, at scale. However, it is difficult to match people with the right services using one unidimensional measure (e.g., the K-10 does not measure the need for psychosocial services). Multidimensional screening could instead identify those who require services other than specialized clinical care, which is in high demand, and thus increase access for those who require multidisciplinary and higher-intensity care38.
Combining adaptive screening with other assessment pathways positions them as crucial identifiers of those with complex or atypical mental health needs. These young people are vulnerable to care misallocation due to unclear profiles or for needing treatment that exceeds what is typically offered by primary care services10,11. In our study, almost one in ten individuals (11.01%) reached the maximum item limit (25 items) before the model achieved the certainty threshold. Despite this being viewed as a “failure” of the model, we hypothesize this as an indicator of an atypical profile that warrants more detailed or in-person assessment to increase certainty about these individuals' needs. These atypical cases may be attributed to high complexity, as they do not follow the patterns observed in most young people and may therefore be more difficult to match to appropriate care. As such, the dynamic assessment is not only predicting full-scale scores, but also estimating certainty around how these scores should be interpreted. For example, individuals who reach the maximum number of items without model certainty could be directed to the original full-length assessment or comprehensive interview – following a staged-based approach to assessment for those exhibiting uncertain profiles or who meet threshold criteria5. Alternatively, this group could be directed to large language models which ask tailored, open-ended questions that broaden assessment to then be interpreted by clinical staff39. In this way, services’ dynamic assessments could act as a digital entry point that detect atypicality and complexity as part of a tiered, triage assessment model.
Mania-like and psychosis-like experiences are crucial components of assessment yet often overlooked in real-world settings5. Several factors may have contributed to the difficulty in predicting these full-length scores in this study. Relying on self-report for identifying psychosis- and mania-like experiences may be difficult as a lack of insight can be a feature of these states40. Therefore, accurate identification often requires a professional to observe nuanced features. An alternative methodological explanation is that since MCAT is optimized towards the mean, predicting states that exist towards the maximum or minimum possible score may limit assessment accuracy. Additionally, the sleep item from the ASRM did not display a strictly monotonic relationship with the ASRM score, which may have introduced minor imprecision in full-length score estimates. Nevertheless, MCAT-predicted scores were most accurate for suicidality (r = 0.96, ICC = 0.96) and the SIDAS_1 item (assessing presence of suicidal thoughts) provided the most information to the model across all seven domains. This demonstrates the importance of assessing suicidality to understand complex clinical profiles, yet 25% of an international transdiagnostic youth service sample do not assess for suicidality at intake10,11.
Beyond the intake process, MCAT could improve digitally-enhanced routine outcome monitoring and measurement-based mental health care (MBC)41. These treatment approaches rely on frequent data collection to assess one’s illness trajectory or the impact of treatment (e.g., monitoring for side effects from new medication). MBC hinges on regular assessment, which is necessary for young people who do not obviously fall into the category of early illness (likely requiring low-intensity self-management) or more severe presentations (clearly needing specialized mental health care). A brief assessment tool could be a scalable, low-burden solution to mitigating barriers to MBC, such as inattention, low motivation, and poor engagement41,42.
The ongoing innovations and wide adoption of artificial intelligence signify the future popularization of dynamic, digital assessments. MCAT could be optimized with reinforcement learning43 or combined machine-learning approaches (e.g., a dual approach with model trees and linear models)44. These methods may improve item selection and thus model efficiency. Future work should also investigate how large language models could integrate with psychometric testing via ensemble learning, to potentially allow for user flexibility while producing more robust results. In either future approach, complex, probabilistic inferences are made across a multidimensional latent space, necessitating cautionary interpretations of item selection and estimation.
Some limitations of this study should be noted. Naturalistic study designs may introduce sampling bias that are difficult to quantify. For example, data were restricted to those who utilised the platform and were unavailable for the population who received care at these services but did not utilize the platform. As this study involved the use of technology, groups with lower digital literacy or cognitive difficulties were possibly disincentivized to partake in this research. Furthermore, the item bank showed some local dependence, likely due to the skip logic employed by certain scales. However, given model fit was appropriate, and estimates aligned accurately with full-length scores, it is unlikely that this impacts the validity of the results. Although cross-validation was used, testing this assessment on new datasets would provide clarity on the generalizability of these findings to different samples. For example, although post-hoc analysis demonstrated similar results between the young adolescent- and young adult-subgroup, some items included in the assessment may not be entirely relevant to younger adolescents (i.e., 12–15 years). Relatedly, incorporating informative priors (e.g., age, gender) may enhance the efficiency of the assessment, particularly because of the broad range of needs unique to various ages across the youth service population. For selecting stopping criteria, the grid search was performed on out-of-sample data, though fully nested resampling was not conducted due to the computationally expensive process of re-estimating high-dimensional MIRT models and performing MCAT simulations within each sample. Therefore, absolute agreement may be slightly optimistic45. Yet, there were many comparably accurate and efficient stopping criteria configurations (Supplementary Figs. 3 and 4), which indicates minimal tuning bias. Live testing of the assessment in a clinical setting would therefore be a valuable next step for this work. This forms the opportunity for a real-world implementation study to evaluate the usability of this kind of assessment, as well as its impact on care decisions and health service efficiency.
Overall, this study supports the use of adaptive digital assessments in youth mental health services. For services to conduct multidimensional assessment, MCAT offers a low-burden and personalized approach to achieve accurate score estimates across several symptom profiles, as well as functioning, suicidality, and alcohol use. Combining an adaptive screener with other assessment pathways could be a promising avenue for future research that accommodates poor model convergence (i.e., estimate uncertainty) in young people with statistically complex profiles. Therefore, the MCAT should be viewed as more than a tool, but rather an accessible digital entry point into a health care system that identifies atypical needs and complexity to redirect young people to further, tailored assessment.
Methods
All data for this study were collected through a quality assurance process facilitated by the University of Sydney research team and as approved by the Human Research Ethics Committee. According to the National Health and Medical Research Council’s National Statement on Ethical Conduct in Human Research (2023), this research was deemed lower risk and eligible for exemption from ethics review. Consent by participants was provided at registration to use the platform. All data is non-identifiable to protect the privacy of participants.
Study design
This cross-sectional study used post-hoc simulations to validate a digital adaptive assessment.
Participants
The sample screened for eligibility from 2304 young people aged 12–25 years who attended one of 12 Australian mental health services between November 2018—September 2024 (i.e., 11 headspace services, and one private practice in the metropolitan area of Sydney). Individuals then completed a comprehensive self-report assessment using the Innowell Platform46, a digital health technology accessed through the internet via a smartphone, tablet, or computer.
Item bank/measures
This study utilized the Innowell Platform to collect individuals’ data across seven standardized and validated self-report measures related to mental health treatment planning9,46. These measures were selected for this study as they have strong psychometric properties for the youth population and correspond to clinical, psychosocial, and medical care pathways in the multidisciplinary youth mental health services from which these data were collected. Therefore, these scales provide a form of multidimensional assessment aligned with real-world treatment planning.
The selected measures assessed psychological distress28 (K-10), psychotic-like experiences29 (PQ16), mania-like experiences30 (ASRM), symptoms of anxiety31 (OASIS), suicidality32 (SIDAS), alcohol use33 (AUDIT-C), and social and occupational functioning34 (WSAS). Each of the seven measures are designed to assess one distinct construct, yielding an item bank of 49 items. Notably, data from the Quick Inventory of Depressive Symptomatology (QIDS) were available, but these items violated monotonicity assumptions and were thus excluded from further analysis (Supplementary Table 9).
There were 33 items scored using a Likert scale structure (e.g., “On a scale of 0-10, how much do you agree with the following statement”). The remaining 16 items (from the PQ-16) were binary Yes/No questions. The SIDAS and AUDIT-C questionnaires were administered with skip logic. For the SIDAS, a response of 0 on the first item (assessing the presence of suicidal thoughts) triggered logical imputation of 0 for the remaining four items, which are used to assess the nature of suicidal thoughts. Similarly, a response of 0 on the first item of the AUDIT-C (assessing alcohol consumption) resulted in automatic scores of 0 for items 2 and 3, to reflect the individual’s non-drinking behavior. A more detailed description of each measure is provided in Supplementary Note 1.
Inclusion criteria
Participants were included in this study if they completed the 49 items from the seven measures at one time point during their care.
Item response theory (IRT)
IRT aims to model the relationship between item responses and one underlying construct (i.e., a latent trait for person j, denoted as θj). This framework uses mathematical non-linear, monotonic functions to estimate the probability of an individual’s specific item response, given θj47,48. Instead of summing up item scores to generate individual totals (as in Classical Test Theory), IRT models patterns of responses across multiple items.
For Likert scales which are common in mental health assessment, IRT models (such as the Graded Response Model; GRM) can estimate the probability of responding to item i in or above category k, given an individual’s latent trait level (θj), with θj assumed to follow a normal distribution with mean = 0 and standard deviation (SD) = 1.
Textbox 1: Graded response model (GRM)
Yij: Response to item i for person j
\({a}_{i}\): Discrimination parameter for item i
\({b}_{{ik}}\): Difficulty parameter for category k of item i where bik increases with k .
\({\theta }_{j}\): Latent trait level (e.g., anxiety) for person j
In the GRM, we can derive the likelihood of responding to an item’s score using cumulative probabilities.
Textbox 2. Probability of observed response category
Extending this problem beyond the unidimensional space would be to consider that θj is not one but several traits that may or may not be correlated. In unidimensional IRT models, the covariation between item responses is explained by a single underlying latent trait, whereas multidimensional IRT (MIRT) models explain the covariation between item responses through multiple latent traits49. This is useful in mental health assessments because of the inter-relationships among symptoms and other relevant treatment factors (e.g., functioning, alcohol use).
This study aims to solve for θj – a latent trait vector representing multiple dimensions (i.e., multiple mental health assessment scores representing different symptoms and other clinically relevant factors).
Textbox 3. Multidimensional graded response model
Yij: Response to item i for person j
\({{\boldsymbol{\alpha }}}_{i}\): Discrimination parameters for item i
\({d }_{{ik}}\): Intercept parameter for category k of item i where dik decreases as k increases.
\({{\boldsymbol{\theta }}}_{j}\): Latent trait level (e.g., anxiety) for person j across m dimensions, where \({{\boldsymbol{\theta }}}_{j}=({{\rm{\theta }}}_{j1},{{\rm{\theta }}}_{j2},\ldots ,{{\rm{\theta }}}_{{jm}})\)
Multidimensional computerized adaptive testing (MCAT)
The CAT methodology utilizes IRT to create personalized and dynamic assessments. This process involves calibrating an item bank to identify how well each item aligns with each latent trait. Once the item bank is calibrated and parameters are estimated, the CAT can be developed. During a CAT, when an individual answers a question, the algorithm updates latent trait estimates (θj) before presenting a new question that will maximize the information gain. In MCAT, the Fisher Information matrix can be used to quantify the information gained from each item by capturing corresponding information onto the direct trait as well as the covariance between traits50. The test continues until a predefined level of precision is achieved across all latent trait estimates (e.g., minimizing standard error) or it reaches the predefined maximum item limit.
Model specification
Each item was assigned to one domain based on its corresponding legacy scale, but domain scores were estimated concurrently by allowing for inter-domain correlations. Thus, the model explained seven latent traits, each representing a dimension of θj that corresponded to one of the assessed scales. Covariation among latent traits was permitted so that information from individual items could spread multidimensionally via the covariance matrix. Because we modeled several domains with a mix of item types, the MHRM algorithm was used to estimate item parameters. A comparison of this correlated multidimensional structure was compared with the bifactor model (Supplementary Fig. 5). The bifactor model failed to converge after 5000 cycles using the Expectation-Maximization algorithm. Thus, the bifactor model was discarded from further analyses, and the correlated multidimensional model was used for MCAT simulations.
Assumptions of multidimensional item response theory (MIRT)
Two key assumptions of MIRT were tested on the 49 items: monotonicity (as θ increases, the probability of endorsing higher item responses increases) and local dependence (item responses should be independent after accounting for latent traits). Monotonicity checks were conducted by inspecting whether each item’s Loevinger’s H coefficient was above 0.3. Local dependence was assessed using Yen’s Q3, which calculates the correlation between item residuals after accounting for the latent traits51. We inspected Q3 values both locally (by fitting separate unidimensional models for each domain) and globally (using the complete model).
Simulated assessments using multidimensional computerized adaptive testing
Using the preferred model structure, we then conducted post-hoc MCAT simulations to estimate the performance and efficiency of the dynamic assessment. The simulations demonstrate how the model would perform if it was administered to real assessment-taking participants. To reduce the risk of overfitting, we conducted ten-fold cross-validation. In each fold, the model was trained on 90% of the data and the assessment was tested on the remaining 10% of (unseen) data. This process was repeated 10 times, so that each simulated participant was included in the test set exactly once.
An individual’s latent trait estimates were updated using expected a posteriori (EAP) estimation. Items were selected using the Determinant Posterior Rule, which selects the next item as the one that maximizes the determinant of the posterior Fisher information matrix50. That is the item that, given an individual’s current estimates for θj, would achieve the most reduction in uncertainty across all latent traits. At the start of each assessment a multivariate standard normal distribution was assumed for θj, resulting in trait estimates of 0 before any items were administered.
To evaluate efficiency and accuracy of the MCAT, we first conducted a simulated assessment on the entire item bank (number of items = 49). The mean SEM for θj (for all estimated traits) was observed at each item to understand model performance and tailor stopping criteria for adaptive simulated assessments.
Using the full item bank simulation, hyper-parameter optimization was conducted with three stopping rules (justified by popular use in CAT simulation studies)22,25,52:
-
1.
A minimum item threshold (i.e., requiring a set number of items before stopping was permitted)
-
2.
A maximum item limit (i.e., concluding the assessment at a specific number of items)
-
3.
Absolute change in θj rule (i.e., the assessment ends when the absolute change in θj estimates between successive item responses falls below a predefined threshold)—known herein as ∆θj
The minimum item threshold prevents the assessment from prematurely concluding before sufficient coverage across each domain, while the maximum item limit constrains response burden by stopping the assessment when additional items are unlikely to meaningfully improve estimate precision (e.g., for extreme cases). The ∆θj rule acts as a convergence criterion that monitors the stability of estimates and identifies when further information is unlikely to significantly change estimated values.
Given that the objective of this study was to shorten the questionnaire battery while maintaining absolute agreement with the true full-length scores, hyperparameters were optimized using a grid search. This process searches for stopping criteria that produce the minimum required items per assessment (on average across the sample) such that all domain-level estimates achieved ICCs ≥ 0.75 with full-length scores, which indicates “good” or “excellent” agreement53. The hyperparameter space was informed by the full item bank simulation and previous CAT simulation studies22,25,52. Comparing multiple stopping criteria would demonstrate the robustness of results under different hyperparameter configurations, while also highlighting the relationship between increasing the average items used and the absolute agreement between true and estimated assessment scores.
Evaluation metrics
Efficiency and accuracy metrics for each MCAT simulation were recorded as an average across the 10 folds. Efficiency was measured by the mean and median number of items administered across the whole assessment. By calculating the time taken to complete each assessment scale, this value was then converted into an estimated time taken to complete the MCAT. Absolute agreement was measured by two-way mixed-effects ICCs (i.e., the degree to which the individual’s estimated score aligned with their full-length score), Pearson correlation coefficient (r; measuring the linear relationship between the individual’s estimated score and their full-length score), and mean absolute error (i.e., an average absolute difference between estimated and true scores, represented as both real values and percentage error relative to the domain’s maximum score).
Doman- and item-level findings will be based on the simulation results determined by the optimal stopping criteria configuration (minimum ICC ≥ 0.75).
Statistical software
The mirt and mirtCAT packages in R (version 4.2.1) were used for fitting the multidimensional IRT models and conducting the MCAT simulations54,55.
Data availability
The data that support the findings of this study are available from Innowell Pty Ltd, but restrictions apply to the availability of these data, which were used under licence for the current study, and so are not publicly available. Data are however available from the authors upon reasonable request and with permission of Innowell Pty Ltd.
Code availability
The underlying code for this study is under formal intellectual property review by the University of Sydney and is therefore not publicly available.
References
Tutun, S. et al. An AI-based decision support system for predicting mental health disorders. Inf. Syst. Front. 25, 1261–1276 (2023).
McGorry, P. D. et al. The Lancet Psychiatry Commission on youth mental health. Lancet. Psychiatry 11, 731–774 (2024).
McMahon B., McInerney D. Right care, right place, first time: how AI is improving national virtual front doors. NEJM AI. 2, 2025.
Bucci, S., Schwannauer, M. & Berry, N. The digital revolution and its impact on mental health care. Psychol. Psychother. 92, 277–297 (2019).
Karcher, N. R. et al. Youth mental health screening and linkage to care. Psychiatr. Serv. 74, 727–736 (2023).
Capon, W. et al. A multidimensional approach for differentiating the clinical needs of young people presenting for primary mental health care. Compr. Psychiatry 126, 152404 (2023).
Capon, W. et al. Characterising variability in youth mental health service populations: a detailed and scalable approach using digital technology. Australas. Psychiatry 31, 295–301 (2023).
Capon, W. et al. Matching needs to services: Development of a service needs index for determining care pathways in youth mental health. Aust. N. Z. J. Psychiatry 59, 776–785 (2025).
Hickie, I. B. et al. Right care, first time: a highly personalised and measurement-based care model to manage youth mental health. Med. J. Aust. 211, S3–S46 (2019).
Mughal, S. et al. The needs of most or the most in need? How integrated and transdiagnostic youth services assess and operationalize mental health needs. Psychiatr. Serv. 76, 997–1017 (2025).
Capon, W. et al. What gets measured gets managed: an analysis of how international integrated and transdiagnostic youth mental health service models identify treatment needs. Under review. 2025.
Martel, R. M. et al. YouthCHAT as a primary care E-screening tool for mental health issues among Te Tai Tokerau Youth: protocol for a co-design study. JMIR Res. Protoc. 8, e12108 (2019).
Stewart, S. L. et al. The interRAI Child and Youth Suite of Mental Health Assessment Instruments: an integrated approach to mental health service delivery. Front. Psychiatry 13, 710569 (2022).
Gorban, C. et al. Building mutually beneficial collaborations between digital navigators, mental health professionals, and clients: naturalistic observational case study. JMIR Ment. Health 11, e58068 (2024).
Thabrew, H. et al. Comparison of YouthCHAT, an electronic composite psychosocial screener, with a clinician interview assessment for young people: randomized controlled trial. J. Med. Internet Res. 21, e13911 (2019).
Chong, M. K. et al. Personalized and collaborative use of digital measurement-based care tools enhances engagement among young adults: a mixed-methods study. BMC Health Serv. Res. 25, 752 (2025).
Forbes, M. K. et al. Elemental psychopathology: distilling constituent symptoms and patterns of repetition in the diagnostic criteria of the DSM-5. Psychol. Med. 54, 886–894 (2024).
Spiller, T. R. et al. Unveiling the structure in mental disorder presentations. JAMA Psychiatry 81, 1101–1107 (2024).
Newson, J. J., Pastukh, V. & Thiagarajan, T. C. Poor separation of clinical symptom profiles by DSM-5 disorder criteria. Front. Psychiatry 12, 775762 (2021).
Martin-Key, N. A. et al. The current state and validity of digital assessment tools for psychiatry: systematic review. JMIR Ment. Health 9, e32824 (2022).
Gibbons, R. D. & deGruy, F. V. Without wasting a word: extreme improvements in efficiency and accuracy using computerized adaptive testing for mental health disorders (CAT-MH). Curr. Psychiatry Rep. 21, 67 (2019).
Giordano, A. et al. Applying multidimensional computerized adaptive testing to the MSQOL-54: a simulation study. Health Qual. Life Outcomes 21, 61 (2023).
Gibbons, R. D. et al. Computerized adaptive tests for rapid and accurate assessment of psychopathology dimensions in youth. J. Am. Acad. Child Adolesc. Psychiatry 59, 1264–1273 (2020).
Bass, M., Morris, S. & Neapolitan, R. Utilizing multidimensional computer adaptive testing to mitigate burden with patient reported outcomes. AMIA Annu Symp Proc. 5, 320–328 (2015).
Sunderland, M., Batterham, P., Carragher, N., Calear, A. & Slade, T. Developing and validating a computerized adaptive test to measure broad and specific factors of internalizing in a community sample. Assessment 26, 1030–1045 (2019).
Loe, B. S., Stillwell, D. & Gibbons, C. Computerized adaptive testing provides reliable and efficient depression measurement using the CES-D scale. J. Med. Internet Res. 19, e302 (2017).
Sunderland, M. et al. Comparing scores from full length, short form, and adaptive tests of the social interaction anxiety and social phobia scales. Assessment 27, 518–532 (2020).
Kessler, R. C. et al. Short screening scales to monitor population prevalences and trends in non-specific psychological distress. Psychol. Med. 32, 959–976 (2002).
Ising, H. K. et al. The validity of the 16-item version of the Prodromal Questionnaire (PQ-16) to screen for ultra high risk of developing psychosis in the general help-seeking population. Schizophr. Bull. 38, 1288–1296 (2012).
Altman, E. G., Hedeker, D., Peterson, J. L. & Davis, J. M. The Altman Self-Rating Mania Scale. Biol. Psychiatry 42, 948–955 (1997).
Norman, S. B., Cissell, S. H., Means-Christensen, A. J. & Stein, M. B. Development and validation of an Overall Anxiety Severity and Impairment Scale (OASIS). Depress. Anxiety 23, 245–249 (2006).
van Spijker, B. A. et al. The suicidal ideation attributes scale (SIDAS): community-based validation study of a new scale for the measurement of suicidal ideation. Suicide Life Threat Behav. 44, 408–419 (2014).
Bush, K., Kivlahan, D. R., McDonell, M. B., Fihn, S. D. & Bradley, K. A. The AUDIT alcohol consumption questions (AUDIT-C): an effective brief screening test for problem drinking. Ambulatory Care Quality Improvement Project (ACQUIP). Alcohol Use Disorders Identification Test. Arch. Intern Med. 158, 1789–1795 (1998).
Mundt, J. C., Marks, I. M., Shear, M. K. & Greist, J. H. The Work and Social Adjustment Scale: a simple measure of impairment in functioning. Br. J. Psychiatry 180, 461–464 (2002).
Zein, R. A. & Akhtar, H. Getting started with the graded response model: an introduction and tutorial in R. Int. J. Psychol. 60, e13265 (2025).
Ganga, R. N., Santa, K., Ali, M. & Smith, G. The impact of a digital referral platform to improve access to child and adolescent mental health services: a prospective observational study with real-world data. Int. J. Environ. Res. Public Health 21, 1318 (2024).
Sin, J. et al. Digital interventions for screening and treating common mental disorders or symptoms of common mental illness in adults: systematic review and meta-analysis. J. Med. Internet Res. 22, e20581 (2020).
Skinner, A., Occhipinti, J. A., Prodan, A., Song, Y. J. C. & Hickie, I. B. Bi-stability and critical transitions in mental health care systems: a model-based analysis. Int. J. Ment. Health Syst. 17, 5 (2023).
Kjell, O. N. E., Kjell, K. & Schwartz, H. A. Beyond rating scales: with targeted evaluation, large language models are poised for psychological assessment. Psychiatry Res. 333, 115667 (2024).
Ghaemi, S. N. & Pope, H. G. Jr. Lack of insight in psychotic and affective disorders: a review of empirical studies. Harv. Rev. Psychiatry 2, 22–33 (1994).
Graham, A. K. et al. Validation of the computerized adaptive test for mental health in primary care. Ann. Fam. Med. 17, 23–30 (2019).
Bassi, E. M. et al. Perceptions of mental health providers of the barriers and facilitators of using and engaging youth in digital mental-health-enabled measurement based care. Digit. Health 10, 20552076241253093 (2024).
Li J., Gibbons R., Ročková V. Deep computerized adaptive testing. Preprint at arXiv. https://doi.org/10.48550/arXiv.2502.19275 (2025).
Colledani, D. & Barbaranelli, C. Anselmi P. Fast, smart, and adaptive: using machine learning to optimize mental health assessment and monitor change over time. Sci. Rep. 15, 6492 (2025).
Bischl, B. et al. Hyperparameter optimization: foundations, algorithms, best practices, and open challenges. WIREs Data Min. Knowl. Discov. 13, e1484 (2023).
Iorfino, F. et al. A digital platform designed for youth mental health services to deliver personalized and measurement-based care. Front. Psychiatry 10, 595 (2019).
Cappelleri, J. C., Jason Lundy, J. & Hays, R. D. Overview of classical test theory and item response theory for the quantitative assessment of items in developing patient-reported outcomes measures. Clin. Ther. 36, 648–662 (2014).
Reise, S. P. & Waller, N. G. Item response theory and clinical measurement. Annu. Rev. Clin. Psychol. 5, 27–48 (2009).
Cai, L., Choi, K., Hansen, M. & Harrell, L. Item Response Theory. Annu. Rev. Stat. Appl. 3, 297–321 (2016).
Mulder, J. & van der Linden, W. J. Multidimensional adaptive testing with optimal design criteria for item selection. Psychometrika 74, 273–296 (2009).
Christensen, K. B., Makransky, G. & Horton, M. Critical values for Yen’s Q(3): identification of local dependence in the Rasch model using residual correlations. Appl. Psychol. Meas. 41, 178–194 (2017).
Haley, S. M., Coster, W. J., Andres, P. L., Kosinski, M. & Ni, P. Score comparability of short forms and computerized adaptive testing: simulation study with the activity measure for post-acute care. Arch. Phys. Med. Rehabil. 85, 661–666 (2004).
Koo, T. K. & Li, M. Y. A guideline of selecting and reporting intraclass correlation coefficients for reliability research. J. Chiropr. Med. 15, 155–163 (2016).
Chalmers, R. P. mirt: a multidimensional item response theory package for the R environment. J. Stat. Softw. 48, 1–29 (2012).
Chalmers, R. P. Generating adaptive and non-adaptive test interfaces for multidimensional item response theory applications. J. Stat. Softw. 71, 1–38 (2016).
Acknowledgements
WC extends his gratitude to Samuel McLeod for guidance related to cross-validation. The funding sources of this study had no role in the design, data collection, data analysis, and reporting of this study. This work was supported by the Medical Research Future Fund Applied Artificial Intelligence in Health Care grant [MRFAI000097]. W.C. was supported by the Australian Government Research Training Program (RTP) Scholarship. J.J.C. was supported by an NHMRC Emerging Leadership Fellowship (2008197). I.B.H. was supported by an NHMRC Research Fellowship (511921). HML is supported by the Bill & Patricia Richie Foundation and a philanthropic funding donor affected by mental health who wishes to remain anonymous. F.I. was supported by an NHMRC EL1 Investigator Grant (GNT2018157).
Author information
Authors and Affiliations
Contributions
W.C. and F.I. conceptualized the project. W.C. conceptualized and conducted the analysis. M.V. reviewed the analysis and provided ongoing statistical guidance. All authors interpreted the results and provided insights. W.C. wrote the initial draft of the manuscript. WC created the figures, tables, and supplementary material. All authors (W.C., I.B.H., M.V., H.M.L., L.J.B., J.J.C., E.M.S., F.I.) provided useful and significant changes to the first draft of the manuscript.
Corresponding author
Ethics declarations
Competing interests
I.B.H. is a Professor of Psychiatry and the Co-Director of Health and Policy, Brain and Mind Centre, University of Sydney. He has led major public health and health service development in Australia, particularly focusing on early intervention for young people with depression, suicidal thoughts and behaviors and complex mood disorders. He is active in the development through codesign, implementation and continuous evaluation of new health information and personal monitoring technologies to drive highly-personalized and measurement-based care. He holds a 3.2% equity share in Innowell Pty Ltd., which is focused on the digital transformation of mental health services. E.M.S. is a Principal Research Fellow at the Brain and Mind Centre, University of Sydney, a Consultant Psychiatrist and Adjunct Clinical Professor at the School of Medicine, University of Notre Dame. She previously served as the Discipline Leader for Adult Mental Health at Notre Dame until January 2025. In addition, she is a member of Medibank’s Medical and Mental Health Reference Groups. A/Prof Scott has also delivered educational seminars on the clinical management of depressive disorders, receiving honoraria from pharmaceutical companies including Servier, Janssen, and Eli Lilly. Moreover, she has contributed to a national advisory board for Pfizer’s antidepressant Pristiq and served as the National Coordinator for an antidepressant trial sponsored by Servier. All other authors declare no competing interests.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Capon, W., Hickie, I.B., Varidel, M. et al. Validating an adaptive digital assessment of youth mental health needs: a cross-sectional study. npj Digit. Med. 9, 173 (2026). https://doi.org/10.1038/s41746-026-02374-2
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s41746-026-02374-2




