Introduction

Schizophrenia is a severe mental disorder characterized by disturbances across multiple domains, such as thinking, perception, self-experience, cognition, volition, affect, and behavior, and is frequently associated with significant social and occupational impairments [1, 2]. Globally, schizophrenia affects approximately 23 million people (around 1 in 345), with higher age-specific prevalence among adults (around 1 in 233) [3]. The illness follows a chronic, relapsing course with substantial functional impairment and premature mortality. Meta-analytic estimates indicate 13–15 years of potential life lost, with the pooled expected age at death being approximately 60 years in men and 68 years in women [4]. Relapse remains common even with treatment; in prospective first-episode cohorts, the five-year cumulative relapse rate reaches approximately 82% [5]. The early phases of the illness carry elevated risks of self-harm and suicide, whereby the lifetime suicide mortality [6, 7] and suicidal ideation rates are 5 and 35%, respectively [8]. These patterns require care that extends beyond acute symptom control.

Psychiatric (psychosocial) rehabilitation, defined by the World Health Organization as a process facilitating opportunities for individuals with mental disorders to achieve optimal independent functioning by strengthening personal competencies and addressing environmental barriers, has emerged as essential for improving functioning and quality of life [9, 10]. Reflecting the needs of patients with mental health disorders, international frameworks emphasize comprehensive, integrated, community-based mental health and social care with longitudinal, measurement-based assessment and adaptation [11,12,13,14]. Best-practice guidelines commonly organize these principles into three core components: (i) comprehensive assessment and individualized care planning, which should be delivered in recovery-oriented, community-based services to support autonomy and participation [12, 13], with ongoing measurement-based monitoring to inform treatment adjustments [14, 15]; (ii) medication management and adherence support, including consideration of long-acting injectable antipsychotics when appropriate [15]; and (iii) evidence-based psychosocial interventions, such as cognitive remediation [16], psychoeducation (e.g., family-based models) [17], and social skills training [18]. In routine clinical workflows, these principles are operationalized through regular follow-up visits [13], family psychoeducation [17], social skills training [18], medication management and adherence support, and measurement-based relapse-prevention planning [15]. Randomized evidence from low-resource settings shows that community-based rehabilitation improves schizophrenia outcomes [19].

Despite the standardization of psychiatric rehabilitation, its implementation remains uneven worldwide [20]. Substantial challenges include resource constraints such as low mental health budgets, workforce shortages, hospital-centric spending [21], and medication-related adverse effects that undermine treatment adherence and tolerability [22, 23]. Moreover, deterioration detection and timely interventions are hampered by the absence of routine, measurement-based assessments [15] and relapse-prevention planning (e.g., early warning sign monitoring) [24]. Coverage data clearly show such implementation gaps: only approximately 29% of people with psychosis receive specialist mental health care globally [3], and approximately one-third of adults with serious mental illnesses in the United States of America have received no mental health treatment in the past year [25]. Such deficiencies in accessibility, quality, and continuity of care, alongside the aforementioned high relapse burden, motivate the search for scalable, remotely deliverable complements to routine rehabilitation [20].

The rapid maturation of digital mental health technologies has opened new pathways for addressing these gaps [26]. Mobile health applications [27], telepsychiatry [28], therapist-guided Internet-delivered cognitive behavioral therapy [29], videoconference-delivered cognitive behavioral therapy [30], automated virtual reality-delivered psychological therapy (some marketed as DTx) [31], and wearable-enabled monitoring [32] now offer practical complements to community care. These technological applications support real-time self-monitoring [27], scalable skills-based interventions [29, 30], remote access [28], and continuous physiological/behavioral monitoring [32]. They capture data through active patient inputs (e.g., ePRO/EMA on smartphones) [33] and passive sensing within a digital phenotyping framework (device logs and onboard sensors) [34].

Artificial intelligence (AI), referring to probabilistic computational methods that learn from data to support prediction/decision-making under uncertainty, has been increasingly applied to analyze these datasets [35]. Related AI approaches encompass several major paradigms, such as supervised learning for outcome prediction from labeled data, unsupervised learning for structure discovery in data without labels, reinforcement learning for sequential decision-making from interactions, and self-supervised learning that derives supervisory signals directly from raw data [35, 36]. Model families range from interpretable approaches (e.g., logistic regression and decision trees) to deep neural networks, with the latter encompassing convolutional architectures for images, recurrent and transformer architectures for sequential data, and graph neural networks for relational structures [35, 37]. Large language models (LLMs; i.e., transformer-based foundation models pretrained on massive text corpora) exhibit strong capabilities in language understanding, generation, and emerging reasoning, enabling applications to process clinical narratives and patient–provider communication [38]. Advanced training strategies for LLMs include multimodal learning to integrate heterogeneous sources, transfer learning to adapt models across domains, and federated learning to enable collaborative training while preserving data locality and privacy [39, 40]. Rigorous LLM deployment requires attention to robustness under distribution shift and principled uncertainty quantification, predictive performance, and governance that advances transparency, fairness, privacy, and security [41, 42].

Notwithstanding the rehabilitation-oriented capabilities of AI and digital therapeutics and the increasing related research, the literature on AI in schizophrenia remains preponderantly concentrated on pathophysiology and diagnosis. For example, supervised models trained on routine electronic health records (EHRs) forecast diagnostic progression for schizophrenia or bipolar disorder [43]. A recurrent neural-network model trained on multi-system EHR data identified individuals at risk of first-episode psychosis up to 12 months before the index event [44]. In neuroimaging, large multisite analyses show that machine learning pipelines can extract reproducible image-derived markers [45]; deep learning graph-neural networks that fuse structural and functional MRI (fMRI) further automate feature discovery, achieving 83% cross-validated accuracy while highlighting circuit-level biomarkers [46]; hypothesis-driven fMRI biomarkers also quantify disease-relevant physiology, such as a cross-validated striatal-dysfunction index that could discriminate schizophrenia from controls and show its relation to antipsychotic response [47]. Multimodal fusion with genomic/transcriptomic data both improves discrimination and helps localize disease-relevant circuits [48], while imaging–transcriptomic maps link MRI phenotypes and fMRI signal amplitude to the cortical expression of interneuron markers [49] and to spatial patterns of schizophrenia risk-gene expression [50].

Meanwhile, rehabilitation-targeted AI applications (e.g., focusing on functional assessment, longitudinal symptom/risk monitoring, medication management, psychosocial skills training, and community reintegration) have received comparatively less attention than diagnostic/prognostic AI applications and remain under-synthesized [51]. A comprehensive synthesis of the AI-based rehabilitation field is particularly critical because psychiatric rehabilitation poses challenges beyond algorithmic performance, requiring context-aware deployment, integration with care pathways, and attention to implementation barriers [52]. Patients commonly raise concerns about data privacy [53] and the possibility that intensive passive monitoring could exacerbate anxiety or paranoia [54]. Clinicians likewise warn that overly intrusive sensing can strain the therapeutic alliance and that recommendations must be sensitive to the clinical context to be actionable [55]. These concerns intersect with technical demands for explainable systems [56] and for high-quality, reliable data, especially in consideration of issues such as label scarcity in psychiatry [57], the limited ecological validity of many functional outcomes [58], device and platform heterogeneity in smartphone/wearable data collection [59], and performance degradation from distribution shifts [35].

For emphasis, we would like to remind the reader that some reviews synthesize data on AI applications, including schizophrenia-focused scoping reviews. Therefore, the research gap we highlight here is that the overt concentration of past reviews on diagnosis and acute phase management [60,61,62] has left rehabilitation processes underexplored. This is compounded by broader, cross-diagnostic overviews of digital/AI approaches rarely providing analyses aligned with schizophrenia rehabilitation targets [63, 64], particularly negative symptoms [65] and community/social participation [66], that require tailored intervention strategies. This systematic review aimed to address these research gaps by examining AI application in schizophrenia rehabilitation management. We analyzed the technical and practical applications of AI models across core rehabilitation domains, including symptom monitoring, medication management, risk management, functional training, and psychosocial support.

Methods

This was a systematic scoping review. We chose this design, which enables research result synthesis, evaluation of AI implementation in schizophrenia rehabilitation, and identification of key values and challenges, owing to the significant heterogeneity in objectives, technology, and evaluation metrics across the studies included in the review. This study was reported following the PRISMA-SCR guidelines [67].

Search strategy

Search sources

We conducted two database searches (Round 1, January 15–31, 2025; Round 2, October 1–15, 2025, following reviewer feedback) across four databases: PubMed (clinical and rehabilitation literature), Web of Science, IEEE Xplore, and the ACM Digital Library (AI-focused computing and engineering venues).

Eligible records spanned January 1, 2012, through October 31, 2025. The 2012 start date reflects the emergence of modern deep learning (e.g., AlexNet) [68], the subsequent acceleration of AI’s development toward natural language processing and computer vision [69], and the sparsity of AI-related mental health literature before this period [64, 70]. We conducted backward and forward citation chasing to improve completeness.

Search terms

We developed search terms under the guidance of two mental health rehabilitation experts, covering target population, AI technologies, and rehabilitation contexts: (“artificial intelligence” OR “AI” OR “machine learning” OR “deep learning” OR “neural networks” OR “natural language processing” OR “computer vision” OR “computational intelligence” OR “data mining” OR “predictive modeling” OR “reinforcement learning”) AND (“schizophrenia” OR “schizophrenic” OR “schizoaffective disorder” OR “psychosis” OR “psychotic disorder” OR “severe mental illness”) AND (“rehabilitation” OR “recovery” OR “management” OR “care” OR “medication adherence” OR “drug compliance” OR “pharmacological management” OR “medication tracking” OR “medication optimization” OR “relapse prevention” OR “risk assessment” OR “risk prediction” OR “violence prediction” OR “crisis management” OR “cognitive training” OR “social skills training” OR “life skills development” OR “functional recovery” OR “skill-building interventions” OR “symptom tracking” OR “symptom monitoring” OR “behavioral monitoring” OR “therapeutic intervention” OR “emotional support” OR “therapy engagement” OR “psychological well-being”).

Study eligibility criteria

Operational boundary of “Rehabilitation” for schizophrenia

In schizophrenia, psychiatric rehabilitation denotes a recovery-oriented, person-centered, longitudinal framework that enables the development of the skills and securing of the environmental supports required to live, learn, and work in the community with the least professional assistance [71]. Currently, this framework integrates evidence-based pharmacological and psychosocial care (e.g., structured symptom monitoring, medication management, proactive risk management, skills-based functional training, and psychosocial support) to maintain stability, prevent relapse, and promote community participation [14]. By contrast, diagnosis is a categorical, operational process that establishes case identification using syndromic criteria and duration thresholds, designed primarily for reliability and clinical utility rather than prescribing specific treatment pathways [72]. Field studies of the ICD-11 diagnostic guidelines similarly emphasize the clinical utility of diagnosis for communication and decision-making rather than uniform intervention protocols [73]. Accordingly, this review focuses on rehabilitation and management approaches that prioritize community functioning and quality-of-life outcomes and go beyond symptom remission.

Rehabilitation application domains

To systematically categorize AI applications within this schizophrenia rehabilitation framework, we operationalized seven core domains reflecting contemporary rehabilitation clinical practice [14]. Each included study was mapped to one or more domains shown in the following list.

  1. a.

    Symptom monitoring: continuous, structured assessment of positive, negative, affective, and related functional symptoms in real-world settings via clinician ratings, patient-reported outcomes, ecological momentary assessment [74], and passive sensing [75]. This structured assessment aims to detect fluctuations and early warning signs of relapse [76] and guide timely interventions.

  2. b.

    Medication management: a systematic, long-term process aimed at optimizing antipsychotic therapy [14], preventing relapse, and minimizing harm, including drug selection/titration, adherence assessment and support, adverse effect monitoring/management [77], long-acting injectable scheduling [78], and shared decision-making.

  3. c.

    Risk management: ongoing assessment, treatment management formulation, and collaborative management targeting high-impact adverse outcomes (suicide/self-harm [79], violence/victimization [80], and relapse/crisis/hospitalization), integrating early warning monitoring [81], safety planning, and stepped, cross-setting responses.

  4. d.

    Functional training: training-based, skills-focused interventions that build the enduring capacities needed for community functioning (e.g., neurocognition, social cognition, activities of daily living, instrumental activities of daily living, and vocational skills) through repeated practice and coached learning (e.g., cognitive remediation [82], social-cognition training [83], and individual placement and supported employment [84]).

  5. e.

    Psychosocial support: structured educational, therapeutic, and social network interventions that enhance coping, family and peer involvement, service engagement, and community integration (e.g., family psychoeducation [85], cognitive behavioral therapy for psychosis [86], and peer support [87]).

  6. f.

    Physical health/lifestyle management (pre-specified, zero-hit in this review): structured, multicomponent interventions addressing cardiometabolic risks to help close the mortality gap [88] and improve functioning and quality of life, including those combining physical activity and diet/weight management [89], smoking cessation [90], and routine metabolic screening [88].

  7. g.

    Service organization/care coordination (pre-specified, zero-hit in this review): team- and pathway-level models that orchestrate medication, psychosocial intervention, and vocational/educational support to deliver integrated, continuous rehabilitation in routine services. Examples include coordinated specialty care for first-episode psychosis [91], assertive community treatment [92], and intensive/structured case management [93].

None of the included studies mapped to domains f or g. Therefore, although these domains were retained for completeness, they were omitted from our domain-level quantitative synthesis.

Inclusion criteria

To ensure relevance to rehabilitation and methodological rigor, studies were included if they met all the criteria below.

  1. a.

    Population: adults or adolescents with clinician-confirmed schizophrenia-spectrum disorders (DSM-5/DSM-5-TR or ICD-10/ICD-11). Studies could include broader serious mental illness diagnoses (e.g., schizoaffective disorder or bipolar disorder with psychotic features), provided that schizophrenia-spectrum disorders constituted a primary analytic group or clearly defined subgroup. Studies conducted in hospitals were eligible only if the AI function targeted post-discharge management or community reintegration outcomes.

  2. b.

    Intervention/AI function: an AI system (as per Organisation for Economic Co-operation and Development/International Organization for Standardization definitions) that infers from inputs to produce predictions/recommendations/decisions/content in service of a rehabilitation task in any of the seven core domains (see Section 2.2.2); eligible paradigms included supervised/unsupervised/semi-supervised learning, deep learning/foundation models/LLMs, reinforcement learning, probabilistic models, and knowledge-based/expert systems [6,7,8, 59].

  3. c.

    Outcomes: rehabilitation-relevant endpoints (e.g., relapse/hospitalization, treatment adherence, functioning/participation, and social/role outcomes) or model performance explicitly tied to a rehabilitation management task (e.g., treatment adherence prediction that triggers case management).

  4. d.

    Designs: randomized controlled trials/quasi-experimental, prospective/retrospective observational, and model development/validation studies. Qualitative or mixed-methods implementation studies were eligible when AI functionality operated within a rehabilitation workflow; diagnostics-only designs were not eligible.

  5. e.

    Setting: community, home-based, supported accommodation, inpatient-to-community transition, or inpatient and digital health settings aligned with sustained rehabilitation care (e.g., inpatient data used to support post-discharge management or longitudinal relapse prevention).

Exclusion criteria

To focus specifically on rehabilitation, we excluded studies that met any of the following criteria:

  1. a.

    focused on diagnostics (e.g., screening, case finding, and differential diagnosis) or cross-sectional case–control classifiers (e.g., schizophrenia vs. healthy controls) without linkage to rehabilitation;

  2. b.

    addressed pathophysiology/biomarkers (e.g., discovery neuroimaging) or theoretical simulations without rehabilitative implications;

  3. c.

    evaluated acute-phase treatment only (e.g., pharmacologic or symptom-focused psychotherapy) without functional/community outcomes or explicit rehabilitation goals;

  4. d.

    were limited to custodial/forensic settings with no stated pathway to community living;

  5. e.

    relied exclusively on modalities infeasible for continuous community monitoring or at-home/routine deployment (e.g., fMRI-only protocols and lab-grade electroencephalogram-only); and

  6. f.

    were editorials, reviews, proposals, posters, conference abstracts, non-original research, or non-English.

Operationalization for cross-sectional and classification studies

Given the prevalence of cross-sectional case–control designs (e.g., schizophrenia vs. healthy controls) in the AI literature, we established explicit operationalization criteria to assess whether such studies qualify as rehabilitation-oriented. This helped us distinguish diagnostic research from rehabilitation-applicable studies by addressing the inherent ambiguity of binary classification paradigms. All baseline eligibility requirements below had to be met.

  1. a.

    Confirmed diagnosis: used real-world data from individuals with clinician-confirmed schizophrenia spectrum disorders (per ICD/DSM or equivalent diagnostic criteria), excluding samples based solely on self-reported diagnoses or clinical high-risk populations.

  2. b.

    Community applicability: data collection methods were feasible for sustained use in community, home, or outpatient settings (e.g., smartphone sensors, wearables, speech/text, and EHR data), and thus did not rely exclusively on research-grade neuroimaging (e.g., fMRI) or laboratory-only modalities (e.g., research-grade electroencephalogram) without a plausible pathway to routine deployment.

  3. c.

    Beyond pure diagnostics: explicitly discussed or proposed rehabilitation management applications beyond solely reporting classification accuracy for “schizophrenia vs. healthy controls” discrimination.

At least one of the following rehabilitation-orientation signals needed to be present:

  1. a.

    Rehabilitation-anchored constructs: the model or features were explicitly linked to rehabilitation-relevant dimensions, enabling translation to management priorities. Examples include symptom scales (Brief Negative Symptom Scale/Clinical Assessment Interview for Negative Symptoms), social cognition measures, sleep/circadian patterns, functional/participation assessments (Personal and Social Performance/UCSD Performance-based Skills Assessment/Quality of Life Scale/WHO Disability Assessment Schedule), medication adherence or side effects, and/or safety/risk indicators.

  2. b.

    Change sensitivity or re-test evidence: presented evidence (even if preliminary) of response to intervention, pharmacological challenges, or repeated measurement, indicating potential utility for longitudinal monitoring or treatment-response tracking.

  3. c.

    Actionability and interpretability: the features or outputs had interpretable clinical meaning and could plausibly inform rehabilitation care actions (e.g., “elevated negative symptom indices → prompt follow-up, social work engagement, or behavioral activation”), even if decision thresholds were not yet quantified.

If at least one of the following was found in the re-review, the study was excluded:

  1. a.

    Diagnostics-only orientation: focused exclusively on diagnostic discrimination without establishing any rehabilitation-related linkage or management application.

  2. b.

    Insufficient real-world utility: external validity or applicability was prohibitively low (e.g., excessive false-positive rates and clearly non-deployable workflows), precluding feasible use in rehabilitation management.

  3. c.

    Non-compliant population or modality: primarily enrolled unconfirmed/self-disclosed cases or clinical high-risk-only samples or relied on data collection methods lacking community-setting feasibility.

Study selection

Zotero automatically filtered and removed duplicates from search results. Two independent reviewers (first and second authors) conducted title/abstract screening, followed by a full-text review of potentially eligible records. Disagreements were resolved through discussion, and unresolved cases were adjudicated by a third expert. Following the initial screening phases, all preliminarily eligible studies underwent a secondary operationalization review to ensure consistent application of the rehabilitation-oriented inclusion criteria, with cross-sectional or case–control designs subjected to stricter criteria (see Section 2.2.5). This secondary review was conducted in November 2025 in response to reviewer feedback, emphasizing clearer rehabilitation boundaries. Interrater agreement for study selection was substantial (Cohen’s κ = 0.78 for title/abstract screening; κ = 0.82 for full-text review; κ = 0.70 for the operationalization review; Fig. 1).

Fig. 1: PRISMA 2020 flow diagram for study selection.
Fig. 1: PRISMA 2020 flow diagram for study selection.
Full size image

Two database search rounds (Round 1: January 2012–January 2025; Round 2: January–October 2025) yielded 627 unique records after de‑duplication. At title/abstract screening, 520 records were excluded because they were non‑original or non‑empirical publications, out‑of‑scope in terms of population or AI use, or did not address rehabilitation‑oriented management. The remaining 107 reports underwent full‑text eligibility assessment and a secondary operationalization review focusing on rehabilitation‑oriented criteria (Methods 2.2.5), leading to the exclusion of 24 reports and a final cohort of 83 studies.

Data extraction

Two reviewers (first and second authors) independently extracted data using a standardized Microsoft Excel template. A pilot extraction of 20 articles refined the procedure and resolved discrepancies. The extracted information comprised the following: (1) bibliographic details (first author, year, country/region, and World Bank income level); (2) population and study design (target condition and phase, center structure [single-center, multicenter, or nationwide/healthcare system], setting, sample size and composition, and observation window or follow-up); (3) task specification (concise task phrase and task family of classification/regression/sequence/time-to-event) and rehabilitation domains using the task–domains framework (domain labels were drawn from the seven rehabilitation domains in Section 2.2.2); (4) technology paradigm (feature engineering-driven supervised learning; sequence and event-time modeling; representation learning and multimodal deep learning; prescriptive policy learning), recording the model used for primary inferences when multiple were compared; (5) data sources (modalities and whether passively or actively collected) and engagement pattern (passive sensing, nudge, conversational, or none); (6) outcome definition (proxy vs. clinical/functional endpoints) and time horizon (a concrete duration or an explicit window such as same-visit, short, mid, mid-to-long, and long term); (7) performance and outcomes captured in a task-aware manner, including classification metrics (area under the receiver operating characteristic curve [AUC], accuracy, sensitivity/specificity, and, where reported, precision/recall), regression metrics (mean absolute error and root mean squared error), time-to-event metrics (concordance indices, also known as C-index, or time-dependent AUC), early warning metrics (e.g., sensitivity and specificity at pre-specified prediction horizons), and task-appropriate metrics for prescriptive/just-in-time adaptive interventions, reinforcement-learning systems, or LLM-guided interventions, which were summarized narratively owing to heterogeneous definitions; (8) validation, interpretability, and implementation signals, including validation level (cross-validation, hold-out, and external), calibration and/or uncertainty reporting (yes/no), interpretability class (feature-level, local-explanation, rule-based, or none), closed-loop action (yes/no) with an action-delivery label distinguishing recognition-only systems from those that directly triggered patient- or clinician-facing support or training, safety guardrails for deployment or LLM/reinforcement-learning use (yes/no), and supplementary quality indicators where available (e.g., randomized controlled evaluations, clinician benchmarking, patient user testing, or fairness and algorithmic-bias assessments); (9) sufficient data pre-processing and feature engineering summary for reproducibility and interpretation (e.g., aggregation windows, selection procedures such as mRMR or embedded regularization, top-k important features, human-readable rules, or learned policy tables); (10) for cross-sectional or baseline proof-of-concept studies, an explicit justification of rehabilitation relevance aligned with the operationalization criteria (see Section 2.2.5).

For each study and task family, when multiple models, thresholds, time points, or subscales were reported, we abstracted all available performance metrics but designated a single prespecified “primary” estimate for cross-study descriptive summaries, prioritizing held-out or external test performance on the primary endpoint. Metrics were summarized in a task-aware fashion (i.e., classification, regression, sequence/time-to-event, early warning, and prescriptive tasks were not pooled across task families), and medians and interquartile ranges were computed for homogeneous metric families (e.g., AUC, accuracy, sensitivity/specificity, mean absolute error, root mean squared error, and R²). Metrics expressed on different scales (e.g., percentage root mean squared error on bounded ecological momentary assessment scales) were reported narratively, but were not included in pooled medians; for early warning models, sensitivity/specificity summaries were restricted to studies that reported both.

To ensure comparability, we grouped methods into four technology paradigms. First, feature engineering-driven supervised learning (typically static classification/regression), such as handcrafted or statistical features with logistic regression, support vector machines, random forests, and tree-based models. Second, sequence and event-time modeling, that is, models that make explicit use of temporal order or survival time such as hidden Markov models, recurrent neural networks, temporal convolutional networks, time-series transformers (also known as TS-Transformers), and Cox proportional hazards models, random survival forests, or deep survival models. Third, representation learning and multimodal deep learning, including self-supervised/contrastive pretraining and multimodal fusion across speech/text/sensing/electronic medical records. Fourth, prescriptive policy learning, ranging from prediction to action and including contextual bandits, reinforcement learning, and dynamic treatment regimes with offline counterfactual evaluation (e.g., inverse propensity scoring, doubly robust estimation, or fitted Q-evaluation). All extracted data were systematically organized according to the AI model type and rehabilitation domain (Table 1). Discrepancies were resolved through discussion and unresolved cases were adjudicated by a third expert.

Table 1 AI for Schizophrenia Rehabilitation (2012–2025): Tasks, Data, Models, Performance, and Implementation Readiness.

Results

Study selection

The systematic search and selection process is illustrated in Fig. 1. Two search rounds yielded a combined total of 627 records after deduplication (Round 1, 561 records from January 2012 to January 2025; Round 2, 66 records from January to October 2025). In Round 1, following title/abstract screening and full-text review of 561 records, 89 studies met the eligibility criteria. The 472 excluded records were ineligible for the following reasons: (i) diagnostics-only focus without rehabilitation linkage (e.g., case–control classifiers for schizophrenia vs. healthy controls discrimination; 198 studies, 41.9%); (ii) acute-phase treatment without functional/community outcomes (96 studies, 20.3%); (iii) pathophysiology/biomarker discovery without rehabilitation management implications (74 studies, 15.7%); (iv) non-original research (i.e., reviews, editorials, protocols, and conference abstracts; 58 studies, 12.3%); (v) exclusive reliance on non-deployable modalities (e.g., fMRI-only or research-grade electroencephalogram data; 31 studies, 6.6%); and (vi) custodial/forensic settings without community pathways (15 studies, 3.2%).

In Round 2, 66 additional records were identified. Following the same screening process as in Round 1, 18 studies met the eligibility criteria. The 48 excluded records showed a distribution similar to that in Round 1, as shown herein: diagnostics-only focus (19 studies, 39.6%), acute-phase treatment only (10 studies, 20.8%), pathophysiology/biomarkers (8 studies, 16.7%), non-original research (6 studies, 12.5%), non-deployable modalities (3 studies, 6.3%), and custodial/forensic settings (2 studies, 4.2%).

All 107 studies (89 from Round 1 and 18 from Round 2) underwent an additional operationalization review. Among these, 42 studies employed cross-sectional or case–control designs requiring stricter operationalization criteria, and 24 were excluded because they (i) lacked confirmed clinical diagnoses or community-deployable methods (9 studies, 37.5%) or (ii) demonstrated no rehabilitation-orientation signals despite initial inclusion (15 studies, 62.5%). This secondary review, conducted in November 2025, resulted in a final cohort of 83 studies spanning January 2012 to October 2025.

Study characteristics

The final cohort comprised 83 studies published between 2012 and October 2025 (Fig. 2). Publication trends demonstrate a marked acceleration in recent years, as early studies from 2012 to 2019 represented only 9.6% (8/83) of the corpus, whereas 2020 to 2023 accounted for 49.4% (41/83), and 2024 to October 2025 contributed 41.0% (34/83).

Fig. 2: Publication trends by year.
Fig. 2: Publication trends by year.
Full size image

Annual counts of included studies (N = 83) from 2012 to October 2025. Bars show annual counts with numeric labels; the superimposed line depicts the temporal trend. Data for 2025 include publications through October 31, 2025.

Studies originated predominantly from high-income countries (Fig. 3). The United States of America contributed the largest share (31/83 studies, 37.3%), followed by China (including Taiwan and Hong Kong; 10/83, 12.0%), and the United Kingdom (5/83, 6.0%). Italy also accounted for 5/83 studies (6.0%). Additional contributions were from South Korea (4/83, 4.8%), France (4/83, 4.8%), Canada (3/83, 3.6%), Spain (3/83, 3.6%), the Netherlands (3/83, 3.6%), Germany (2/83, 2.4%), Poland (2/83, 2.4%), Greece (2/83, 2.4%), and Singapore (2/83, 2.4%). Single-country contributions were observed for Japan, Turkey, India, and Denmark (one study each), and three multicountry or regional trials were coded as International or European consortia.

Fig. 3: Global distribution.
Fig. 3: Global distribution.
Full size image

Bubble map of included studies by primary country (N = 83). Circle size and the numeric label denote the number of studies. The China group includes mainland China (n = 4), Taiwan (n = 5), and Hong Kong SAR (n = 1). Multinational or regional consortia (International/Europe, n = 3) are not assigned to a single country on the map.

Most studies were conducted in community or outpatient settings (57 studies, 68.7%), with smaller proportions in inpatient settings (10 studies, 12.0%), mixed settings leveraging nationwide claims or health system data (4 studies, 4.8%), or settings not clearly reported (12 studies, 14.5%). Approximately half of the studies employed multicenter designs (42 studies, 50.6%), with single-center studies accounting for 47.0% (39 studies) and nationwide or health-system analyses for 2.4% (2 studies).

Population and sample size

All 83 studies provided sample size information (range, 5–87 182 participants; median, 160 participants). Regarding sample size distribution, 4.8% (4 studies) enrolled fewer than 20 participants, 33.7% (28 studies) enrolled 20–100 participants, 38.6% (32 studies) enrolled 101–500 participants, 7.2% (6 studies) enrolled 501–1000 participants, and 15.7% (13 studies) enrolled more than 1 000 participants (Fig. 4). Studies with smaller sample sizes ( < 100 participants) accounted for 38.6% (32 studies) of the corpus, whereas those with 100 or more participants represented 61.4% (51 studies).

Fig. 4: Sample size distribution.
Fig. 4: Sample size distribution.
Full size image

Sample sizes ranged from 5 to 87 182 (median 160). Overall, 32/83 (38.6%) studies enrolled <100 participants and 51/83 (61.4%) enrolled ≥100, with darker shades indicating larger sample-size categories.

Most studies focused on patients with schizophrenia in a clinically stable or chronic phase (45 studies, 54.2%), followed by mixed populations or other diagnostic categories (13 studies, 15.7%), acute inpatients or hospitalized patients (10 studies, 12.0%), patients with first-episode or early psychosis (9 studies, 10.8%), recently discharged patients (5 studies, 6.0%), and treatment-resistant schizophrenia (1 study, 1.2%). Twenty studies (24.1%) included healthy or matched control groups, whereas the remaining 63 (75.9%) exclusively examined patient populations. Several studies incorporated individuals with schizoaffective disorder, bipolar disorder with psychotic features, or broader serious mental illness categories alongside patients with schizophrenia (Fig. 5).

Fig. 5: Clinical population categories.
Fig. 5: Clinical population categories.
Full size image

Categories (mutually exclusive) are: stable/chronic schizophrenia (45, 54.2%), mixed or other diagnoses (13, 15.7%), acute inpatients/hospitalized (10, 12.0%), first‑episode/early psychosis (9, 10.8%), recently discharged (5, 6.0%), and treatment‑resistant schizophrenia (1, 1.2%). Abbreviation: TRS, treatment‑resistant schizophrenia.

Among the 83 studies, 46 (55.4%) were longitudinal studies with specified follow-up or repeated monitoring, while 37 (44.6%) employed cross-sectional or single time-point assessments. For longitudinal studies that explicitly reported follow-up durations (42 studies), follow-up lengths varied considerably as follows: 1 study (2.4%) had a follow-up period of less than 1 week, 4 (9.5%) ranged from 1 week to 1 month, 7 (16.7%) ranged from 1 to 3 months, 7 (16.7%) ranged from 3 to 6 months, 16 (38.1%) ranged from 6 to 12 months, and 7 (16.7%) extended beyond one year (up to 12–17 years).

Data sources and user engagement patterns

The 83 included studies used diverse data-collection methodologies. Active data collection was predominant (56.6%; 47 studies), acquiring data through structured clinical interviews, standardized symptom scales, cognitive task assessments, or patient self-reports. Passive collection comprised 38.6% (32 studies), leveraging sensor-based devices, EHR systems, or social media platforms to capture patient behavioral data. Moreover, 4.8% (4 studies) employed combined approaches integrating both active and passive collection methods to achieve data complementarity.

Regarding user engagement, most studies adopted no-engagement designs (68.7%; 57 studies), wherein data were collected without providing real-time feedback or interventions to patients. Passive sensing constituted 21.7% (18 studies), continuously monitoring patients (e.g., physiological indicators, activity patterns, and behavioral characteristics) via smartphones/wearables. Conversational engagement (e.g., natural language-processing-driven virtual assistants or therapeutic dialogue systems) and nudge-based engagement (e.g., medication reminders or symptom self-assessment prompts through mobile applications) each accounted for 4.8% (4 studies).

Regarding data modality, speech and text data were the most prevalent (22.9%; 19 studies) and included clinical interview transcripts, voice recordings, and natural language-processing techniques. EHRs served as data sources in 14 studies (16.9%), encompassing structured diagnostic codes, prescription information, and unstructured clinical narrative notes. Smartphone-based multimodal sensing ranked next, with 12 studies (14.5%) capturing patients’ mobility trajectories, social interactions, sleep patterns, and screen use behaviors. Wearable device data were relatively scarce, adopted by four studies (4.8%), including wrist-worn accelerometers, smartwatches, or heart-rate monitoring devices.

Outcome measures and temporal horizons

The included studies demonstrated marked heterogeneity in outcome selection and temporal horizons. Proxy endpoints predominated across the literature (e.g., diagnostic classification accuracy, symptom scale scores, medication adherence rates, treatment-response indicators, and social functioning assessments), appearing in 67 studies (80.7%), whereas clinical endpoints (e.g., relapse events, hospital readmissions, symptomatic remission, functional remission, and long-term mortality) were evaluated in 21 studies (25.3%). More specifically, 62 studies (74.7%) exclusively employed proxy endpoints, 16 studies (19.3%) focused solely on clinical endpoints, and 5 studies (6.0%) incorporated both types.

Regarding temporal horizons, concurrent models utilizing data from a single assessment time point represented the most common approach, accounting for 34 studies (41.0%). Short-term investigations of up to three months were employed in 24 studies (28.9%), typically targeting symptom fluctuations, early relapse detection, or medication adherence monitoring. Medium-term investigations spanning 3–12 months represented 16 studies (19.3%), focusing on treatment-response trajectories, functional outcomes, and sustained adherence patterns. Long-term investigations extending beyond 12 months comprised 9 studies (10.8%), addressing outcomes such as multi-year relapse risk, treatment-resistance development, mortality prediction, and chronic disease incidence.

Application domains and task landscape

Across the five rehabilitation management domains that appeared in the included studies, symptom monitoring emerged as the predominant application area (Fig. 6), encompassing 48 studies (57.8%). Symptom monitoring tasks clustered into seven distinct task categories, as follows: diagnostic classification (9 studies) leveraged speech, language, or multimodal features to distinguish patients with schizophrenia from healthy controls [94,95,96,97,98,99,100,101,102]; symptom scale prediction (14 studies) employed machine learning to estimate Positive and Negative Syndrome Scale, Brief Psychiatric Rating Scale, or ecological momentary assessment scores [102,103,104,105,106,107,108,109,110,111,112,113,114,115]; negative symptom quantification (4 studies) automated the assessment of blunted affect, alogia, anhedonia, avolition, and asociality using wearable sensors or speech analysis [116,117,118,119]; cognitive function evaluation (5 studies) detected formal thought disorder or predicted memory performance [120,121,122,123,124]; social functioning assessment (4 studies) utilized smartphone GPS, passive sensing, or facial affect recognition to estimate social isolation, loneliness, and interpersonal competence [125,126,127,128]; quality of life prediction (2 studies) estimated subjective well-being or functional outcomes [129, 130]; clinical phenotyping (7 studies) was used to delineate prognostic subgroups or disease stages, including subtype classification [99, 124, 131,132,133,134,135]. Task categories are not mutually exclusive, and thus the counts may sum to more than the number of studies per domain.

Fig. 6: Domains and task categories of AI applications in schizophrenia rehabilitation.
Fig. 6: Domains and task categories of AI applications in schizophrenia rehabilitation.
Full size image

(a) Distribution of the 83 included studies across rehabilitation management domains: symptom monitoring (48 studies), medication management (19), risk management (16), functional training (1), and psychosocial support (3). (b) Symptom‑monitoring task categories among the 48 studies in this domain: diagnostic classification (9 studies), symptom scale prediction (14), negative symptom quantification (4), cognitive function evaluation (5), social functioning assessment (4), quality‑of‑life prediction (2), and clinical phenotyping (7). (c) Medication‑management task categories among 19 studies: adherence monitoring and prediction (7 studies), treatment response and resistance stratification (8), dosage optimization and toxicity prediction (2), pharmacovigilance for non‑psychiatric adverse events (2), and individualized drug selection (1). (d) Risk‑management task categories among 16 studies: relapse prediction (9 studies), hospitalization risk assessment (3), violence‑related classification (3), comorbidity risk prediction (1), and mortality prediction (1). Bars represent the number of studies per domain or task category; domains and task categories are not mutually exclusive, and individual studies can contribute to more than one category.

Medication management constituted the second-largest domain with 19 studies (22.9%), accounting for five core tasks, as shown herein: adherence monitoring and prediction (7 studies) used smartphone-based visual verification, pharmacokinetic modeling, or claims data to forecast treatment continuation [136,137,138,139,140,141,142]; treatment response and resistance stratification (8 studies) predicted symptomatic remission, treatment-resistant schizophrenia status, or clozapine responsiveness [143,144,145,146,147,148,149,150]; dosage optimization and toxicity prediction (2 studies) recommended therapeutic dose ranges or forecasted adverse metabolic effects [151, 152]; pharmacovigilance for non-psychiatric adverse events (2 studies), which included monitoring prolactin elevation and medication-sequence–linked hospitalization risks [152, 153]; individualized drug selection (1 study) generated personalized treatment rules based on baseline characteristics [154].

Risk management applications appeared in 16 studies (19.3%), comprising the five task categories exposed in this list: relapse prediction (9 studies) developed early warning systems for psychotic exacerbation with prediction windows ranging from one week to two years using digital phenotyping, Internet search behavior, or smartphone passive sensing [155,156,157,158,159,160,161,162,163]; hospitalization risk assessment (3 studies) forecasted readmissions or prolonged inpatient stays [156, 164, 165]; violence-related classification (3 studies) covered aggression-risk prediction or victimization event detection [166,167,168]; comorbidity risk prediction (1 study) estimated type 2 diabetes onset [169]; and mortality prediction (1 study) modeled all-cause death using EHR data [170].

For functional training, only one study (1.2%) identified response trajectories to social cognition training and predicted individualized treatment benefits [171]. Psychosocial support interventions comprised three studies (3.6%): one analyzed therapeutic dialogue patterns in virtual-reality avatar therapy [172], one predicted optimal referral pathways to cognitive behavioral therapy or vocational training [173], and one provided policy recommendation prototypes using offline reinforcement learning [174].

Technological approaches and model architectures

The included studies employed four primary technological paradigms: feature engineering-driven supervised learning (53 studies, 63.9%), representation learning-driven modeling (20 studies, 24.1%), sequence and event-time modeling (7 studies, 8.4%), and prescriptive policy learning (3 studies, 3.6%).

For feature engineering-driven supervised learning, random forest was the most frequently adopted algorithm (24 studies), often used for intrinsic feature-importance profiling and ensemble-based generalization [95, 98,99,100,101, 103, 117, 123, 126, 127, 129, 130, 140, 143, 146, 149, 150, 159, 161, 164, 166, 168, 171, 173]. Gradient boosting variants (e.g., XGBoost and gradient boosting machines; 18 studies) were commonly applied to structured tabular data and high-dimensional feature spaces [103, 106, 108, 117, 122, 123, 126, 130, 139, 140, 143, 145, 149, 150, 152, 168, 169, 173]. Support vector machines (18 studies) were usually applied in small-sample settings and frequently used for speech-acoustic classification, but also appeared in higher-dimensional risk-prediction pipelines [94, 96, 104, 105, 109, 110, 112, 123, 127, 134, 136, 146, 147, 159, 163, 164, 166, 168]. Logistic regression (15 studies) was often used for clinical nomogram construction or as a baseline comparator [96, 123, 130, 139,140,141, 146, 148,149,150, 162, 166, 168, 169, 171]. Regularization techniques, such as least absolute shrinkage and selection operator/elastic net (12 studies), were implemented to select high-dimensional predictors and mitigate overfitting [109, 112, 116, 117, 128, 141,142,143, 150, 166, 168, 169].

Among representation learning methods, transformer architectures (4 studies; e.g., BERT/BioBERT and Whisper) were used to process clinical narrative text, therapy-dialogue content, and automatic speech recognition outputs [120, 165, 167, 175]. Convolutional neural networks (4 studies) were applied to model visual inputs for medication adherence verification, painting-based symptom assessment, and accelerometry-based human-activity recognition [102, 107, 118, 137]. Recurrent architectures (3 studies; e.g., long short-term memory/gated recurrent unit/vanilla recurrent neural networks) were used to capture temporal dependencies in smartphone sensor streams, ecological momentary assessment trajectories, and multimodal relapse predictions [107, 113, 157]. Autoencoder frameworks (3 studies) were used for unsupervised anomaly detection in relapse early warning systems and for dimensionality reduction in mortality risk modeling [155, 157, 170]. Two studies reported LLM-augmented pipelines for zero-shot symptom severity scoring or feature extraction from unstructured EHRs [119, 165].

In sequence and event-time modeling, hidden Markov models (1 study) were used to identify latent symptom state transitions from ecological momentary assessment sequences [132]. Cox proportional-hazards regression and random survival forests (1 study) were applied to model time-to-relapse following medication discontinuation [158]. AutoRegressive Integrated Moving Average (ARIMA) models and Gaussian-process anomaly detection (1 study) were implemented to model irregular temporal patterns in relapse prediction systems [156]. Trajectory clustering with fuzzy methods (1 study) was used to stratify first-episode psychosis patients into prognostic phenotypes [133]. Recurrent networks with long short-term memory or gated recurrent unit cells (3 studies) were used to forecast multi-day mental state fluctuations from digital phenotyping data [114, 115, 153].

For prescriptive policy learning, one study applied targeted minimum loss-based individualized treatment rules to recommend optimal antipsychotic selection using baseline clinical features [154]. Two studies deployed offline reinforcement learning (i.e., batch-constrained Q-learning and deep deterministic policy gradient algorithms) for psychotherapy strategy recommendations and simulated inner speech training policies in cognitive remediation contexts [124, 174].

Model performance and predictive efficacy

Model performance metrics varied substantially across task categories. To avoid inappropriate cross-domain comparisons, metrics are reported separately for classification, regression, event-time, and early warning task applications. For classification tasks, 38 studies reported AUC metrics [94,95,96,97,98,99,100,101, 104, 107, 112,113,114, 117, 120, 123, 127, 130, 139,140,141, 143,144,145,146,147,148,149, 153, 155, 159, 164,165,166, 168,169,170, 173], the median of which was 0.79 (interquartile range [IQR]: 0.71–0.86) with a range of 0.59–1.00. The median accuracy was 79.0% (IQR: 66.2–86.9%), ranging from 31.4–99.0%. Four symptom monitoring studies achieved AUC ≥ 0.90, including schizophrenia vs. healthy control discrimination (AUC = 0.99) [94], negative symptom severity classification (AUC = 1.00) [104], diagnostic classification using symptom subtyping (AUC = 0.92) [99], and schizophrenia classification using temporal features (AUC = 0.95) [101]. These models typically drew on feature engineering from speech acoustics or multimodal behavioral markers; in risk-management applications, deep neural architectures with self-attention also achieved AUC ≈ 0.90 (e.g., long-stay hospitalization prediction [165]).

Regarding regression tasks, studies predicted continuous clinical scale scores, symptom trajectories, or functional outcomes using diverse error metrics. Among studies reporting mean absolute error [103, 106, 108, 126, 152, 157], the median was 2.17 (range, 0.05–7.79) when considering different measurement scales, including Brief Psychiatric Rating Scale subscales, social functioning dimensions, and prolactin concentrations. Across five studies that reported absolute root mean squared error values for clinical scales [102, 110, 129, 151, 152], the root mean squared error exhibited a median of 13.30 (range, 0.06–85.23) when considering quality of life indices, Positive and Negative Syndrome Scale total scores, and pharmacokinetic predictions; an additional study reported a relative root mean squared error of 12% on 0–3 ecological momentary assessment symptom scales [109]. The median R² was 0.63 (range, 0.14–0.92) [107, 122, 128, 151], reaching 0.92 in clozapine pharmacokinetic dose concentration modeling [151] and 0.74 in symptom severity prediction from multimodal wearable data streams [107]. Pearson correlation coefficients for symptom scale predictions were generally moderate to high, often in the range of approximately 0.4–0.9 [95, 96, 103, 105, 106, 121, 122, 126, 176], with some Positive and Negative Syndrome Scale reconstruction models achieving very high correlations (up to r ≈ 0.99) [111]. These metrics span heterogeneous scales, and counts reflect studies that reported each metric, and thus should be interpreted with caution.

For event-time modeling, two studies reported a C-index ranging from 0.71–0.78, covering post-discontinuation relapse [158] and all-cause mortality prediction [170]. In both studies, event-time models outperformed baseline-only comparators; for instance, in Brandt et al. [158], the C-index improved from 0.60 for baseline-only covariates to 0.70–0.71 for regularized Cox and random survival forest models.

Among early warning systems, six studies implemented relapse early warning models (mostly evaluated offline/retrospectively) with prediction horizons ranging from 1 week to 30 days [155, 157, 159, 161,162,163]. The median sensitivity was 31.5% (range, 0.6–66.2%), and the median specificity was 88.0% (range, 71.0–99.7%). One system achieved 66.2% recall at 6.3% precision using balanced random forests on smartphone-sensor clusters [161]. Another attained 99.7% specificity with 0.6% sensitivity via one-class support vector machines [162]. Anomaly-rate increases of approximately 108% [157] and 112% (×2.12) [162] were observed in pre-relapse windows. It was common for studies to have three- to four-week windows (overall range, 1–30 days).

Validation rigor, interpretability, and implementation readiness

Regarding validation protocols, most studies relied on cross-validation (e.g., k-fold, leave-one-subject-out, and Monte Carlo), and a subset used hold-out splits. Four studies reported external or cross-dataset evaluations, including independent cohort or cross-trial datasets and leave-one-site-out or temporal holdout designs [112, 114, 131, 150]. One study achieved 68.0% balanced accuracy on external-validation data spanning three independent trials [150]. For calibration and uncertainty quantification, five studies reported some form of probability calibration or predictive uncertainty handling using Monte Carlo dropout [113], fuzzy-logic confidence stratification over uncertainty-aware decisions [114], and Brier scores and/or calibration plots, sometimes combined with bootstrap internal validation [140, 141, 144]. Most other studies provided no such information.

Regarding interpretability mechanisms, these were relatively common, with those most frequently reported being feature-level approaches (e.g., random forest importance, Shapley additive explanations, permutation importance, and least absolute shrinkage and selection operator coefficients). Local or case-level explanation methods (5 studies) provided instance-specific rationales using Shapley additive explanations, counterfactuals, policy-trajectory visualizations, or LLM-generated justifications [119, 145, 153, 170, 174]. Rule-based interpretability (3 studies) employed decision trees or fuzzy-logic rule sets [97, 114, 160]. A subset of studies (14/83, 16.9%), often applying deep or computer-vision models, reported no explicit interpretability mechanisms [113, 115, 118, 120, 125, 138, 139, 151, 162, 164, 167, 169, 175, 176].

For closed-loop implementation, three studies documented implementation wherein AI predictions triggered direct clinical actions as follows: weekly symptom forecasts that automatically triggered clinical outreach [106]; a randomized controlled trial where AI-based adherence verification with real-time alerts improved adherence rates (94.7 vs. 64.4%; p < 0.001) and symptom outcomes [138]; and a partial closed-loop with computer-vision-flagged medication behaviors prompting counselor-mediated interventions [137]. Most studies operated in recognition-only mode, generating predictions without automated action pathways.

Referring to safety guardrails and quality signals, none of the studies employing reinforcement learning or LLMs reported safety constraints [119, 174]. Supplementary quality signals appeared in one randomized controlled trial [138], one clinician benchmark (n = 24 raters) [113], one user-testing study (n = 7) [163], and one algorithmic-bias probe across demographic subgroups [165].

Discussion

This systematic scoping review adopted a rehabilitation- rather than diagnosis-centered approach, focusing on the actionable value chain (i.e., from monitoring to decision support, intervention, follow-up, and audit) of AI in community and long-term schizophrenia rehabilitation management settings. This value chain framework reflects established measurement-based care principles [14, 15] and implementation science models for digital mental health [177, 178], wherein continuous monitoring informs clinical decisions, triggers timely interventions, enables systematic follow-up, and supports quality auditing cycles. Notably, the publication volume in this area has increased steeply in recent years, underscoring both the timeliness of this evidence base and the immaturity of its implementation layer. We also explicitly delineated the boundary between “rehabilitation” and “pure monitoring/prediction” in our methods, such that the only studies included were those in which AI functions demonstrated a clear pathway to rehabilitation goals (e.g., functional improvement, relapse prevention, medication management, or social participation).

Based on the 83 included studies published between 2012 and October 2025, it appears that AI literature for schizophrenia rehabilitation management is undergoing accelerated development, as more than 90% of the studies were published from 2020 onwards, with immature implementation. Most studies engaged in symptom monitoring (57.8%), medication management (22.9%), and risk management (19.3%), while there was a notable scarcity of studies focused on functional training and psychosocial support (i.e., the areas most proximal to rehabilitation outcomes; 1.2 and 3.6%, respectively). The evidence structure likewise skewed toward “identification and characterization,” as surrogate endpoints dominated (67/83, 80.7%), external validation was rare (4/83, 4.8%), calibration and uncertainty reporting were insufficient (5/83 studies, 6.0%), and closed-loop implementation was uncommon (3/83, 3.6%). For methods, active data collection predominated, yet 68.7% of systems adopted a “no-engagement” design without real-time feedback/intervention. Conversational and nudge-based systems together accounted for <10% of the corpus, and speech/text, EHR, and smartphone sensing were the dominant data modalities, with wearable-only systems remaining uncommon. This indicates that most systems remain only able to discriminate, still requiring a critical transition toward executable, auditable, and sustainable schizophrenia rehabilitation closed loops.

These application gaps reflect the bottleneck effect of the rehabilitation value chain. Functional training and psychosocial support studies require long-term, repeated, and contextualized measurement of behavioral change with actionable labels [84, 179], as studies on these domains rely on high-quality process data and granular task decomposition. Given such implementation complexities, both research categories being markedly underrepresented in the current ecosystem may be unsurprising. In the mental health literature, cross-diagnostic digital interventions and just-in-time adaptive interventions provide methodological inspiration for “moving from identification to action” [180,181,182]. Based on our findings, we suggest that translating the current evidence into stable benefits within schizophrenia contexts will require reconstructing the data and intervention units around rehabilitation goals. This will help ensure that the algorithmic outputs correspond one-to-one with executable action scripts [183, 184].

At the “meta-analytic” performance level, without conflating tasks, classification tasks yielded an overall median AUC of 0.79 and accuracy of 79%, with a minority of symptom monitoring studies (i.e., predominantly relying on acoustic voice features, multimodal behavioral markers, or self-attention architectures) achieving AUC ≥ 0.90. Relapse prediction models exhibited the typical profile of low sensitivity–high specificity (median sensitivity 31.5%; specificity 88%), suggesting that they are better suited as upstream triage signals rather than standalone decision gates. Two studies showed approximately doubled anomaly rates within the prediction windows [157, 162], although overall capture rates remained limited. For schizophrenia rehabilitation clinical practice, the significance of performance metrics hinges on whether they can deliver quantifiable data to promote early engagement, reduce relapse, and enhance participation [185]. Therefore, subsequent research should link surrogate endpoints with clinical endpoints (e.g., relapse, rehospitalization, functioning, and quality of life) and employ decision curve analysis to bind prediction thresholds to specific actions and resource allocation [186, 187]. These research efforts may help translate model optimization into real-world outcome improvements.

Regarding methodological maturity, most studies employed cross-validation or internal holdout, whereas few studies provided external/cross-dataset validation, reported on calibration and uncertainty, were user studies, and involved closed-loop implementation. For risk communication and action thresholds to be reached, discrimination is merely the starting point, as it is calibration that determines communication credibility and uncertainty presentation that pinpoints when to trigger human review [188, 189]. Importantly, distributional drift and subgroup disparities may rapidly erode effectiveness across disease stages and service contexts [190, 191]. Therefore, external validation, calibration curves/Brier scores, confidence intervals, and subgroup robustness should be routinely reported in future studies. At the design level, systems should embed “abstain/requires review” mechanisms and online drift monitoring strategies [191,192,193], as this would enable an automatic downgrade to a human–AI collaboration mode when uncertainty escalates.

Furthermore, a pronounced mismatch exists between interpretability and safety requirements. Feature engineering-driven models generally provide global or local explanations, whereas deep learning, LLM, and reinforcement learning applications largely lack transparency and instance-level explanations. Nevertheless, only a few studies using such transparency-lacking methods offered counterfactual or Shapley additive explanation-based case-level evidence. In rehabilitation settings, the accountability requirements for specific actions demand an auditable chain of “prediction–explanation–action” [56, 194, 195], and the model must be able to explain why follow-up was triggered at a given moment, which factors drove medication adjustments, and how thresholds self-adapted for the same patient across different stages. Particularly in reinforcement learning and LLM applications, safety constraints and alignment mechanisms remain unestablished [196,197,198], with governance lagging behind algorithmic complexity.

Clinical integration and reimbursement/literacy constitute the true thresholds for AI’s scaled deployment in schizophrenia rehabilitation management. However, only three studies achieved closed loops whereby predictions directly triggered actions, while most systems remained in identification mode. In community contexts, it is essential to clarify “who sees what signal when, follows which script to take what action, and who is responsible for tracking and auditing” [177, 178], the absence of corresponding reimbursement mechanisms and workload accounting can render proactive outreach unsustainable [199, 200], and patient and team digital literacy directly impact adherence and interpretation quality [201]. These implementation-layer complexities—role ambiguity, reimbursement gaps, and literacy barriers—reveal that algorithmic performance metrics (e.g., AUC and accuracy) measure what a system can achieve under controlled conditions but remain silent on whether it will be adopted, integrated, and sustained in routine care workflows. Previous studies have predominantly evaluated AI effectiveness through technical benchmarks, leaving questions of reach, feasibility, and service-level impact largely unaddressed. Therefore, instead of continuously hinging on technical metrics for assessing AI deployment, we recommend employing implementation science frameworks such as the RE-AIM to assess reach, adoption, and maintenance, and conducting “AI-in-the-loop” pragmatic trials to evaluate service key performance indicators (e.g., follow-up completion rates, relapse intervals, and functional improvement) [202,203,204].

Equity and generalizability issues are also concerning. The evidence base is concentrated in high-income countries, with minimal representation from low- to moderate-income countries (only one study from India), and only one included study explicitly probed algorithmic bias across demographic subgroups [165]. This entails not only out-of-domain mismatches in device/data ecosystems and care models but also potential cultural biases in goals and measurements. For instance, functional recovery is operationalized differently across cultures, such as independent employment and living in Western cultures versus family role restoration and caregiver burden reduction in Eastern cultures, whereas mainstream functional metrics exhibit limited sensitivity to the latter [205, 206]. Medication management models are likewise highly context-dependent, as divergences in drug availability, follow-up frequency, and metabolic monitoring resources directly impact the validity of adherence prediction and risk assessment [24]. Therefore, local recalibration, preregistered subgroup reporting, and quantification of performance degradation in cross-domain deployment should become standard components of transfer protocols (e.g., referencing TRIPOD + AI and PROBAST + AI guidelines on external validation and reporting standards) [207, 208] and should be combined with evidence of effectiveness erosion from distributional drift and model under-specification [193].

Regarding ethics and governance, passive sensing and high-frequency monitoring may exacerbate feelings of constant surveillance and paranoid content [48, 209]. Risk stratification outputs, if not contextualized through communication, can readily produce labeling effects and therapeutic pessimism [210, 211]. Involuntary treatment and forensic contexts further require explicit delineation of algorithmic signal boundaries and procedural safeguards [212, 213]. Current research predominantly remains at the minimal threshold of obtaining informed consent, whereas we recommend operationalizing governance requirements into four actionable standards: dynamic consent with minimum necessary data collection, purpose limitation with withdrawal/portability rights, subgroup fairness reporting with bias monitoring, and intervention safety switches in closed-loop scenarios. In high-autonomy systems such as reinforcement learning/LLMs, we suggest the integration of red-teaming, adversarial examples, and privilege escalation interception across the training-to-deployment pipeline, with human–AI decision logs recorded for post-hoc auditing, also referencing the previously cited LLM clinical evaluation/mitigation recommendations [197].

Regarding actionable recommendations for clinical practice and development, clinically, algorithmic outputs should be embedded into a “measurement–feedback–intensification” closed loop (measurement-based care). There should be preset thresholds and action scripts (e.g., “alert → phone follow-up within 48 h → escalate to in-person visit or medication adjustment if necessary”) [214], human review triggers for scenarios of elevated uncertainty or complex comorbidities, and thresholds and scripts dynamically calibrated through case audits and outcome feedback, forming a “learning rehabilitation system” [215]. Pertaining to development and operation, external validation and calibration, uncertainty quantification with abstention, cross-domain transfer with recalibration toolkits, and edge/low-bandwidth with energy constraints should be designated as minimum viable configurations [215,216,217]. Service key performance indicators should serve as primary evaluation dimensions, ensuring that technology aligns with the rehabilitation goals of “fewer relapses, better engagement, improved quality of life,” and real-world service efficiency and workload accounting should become regular evaluation metrics [218, 219].

This scoping review had several limitations. First, the search and inclusion scope (the databases of choice and English-language literature, respectively) may have resulted in omissions, with the possibilities of disciplinary intersections and non-standard terminology expanding search blind spots. Second, the included studies exhibited substantial heterogeneity in methodology, data sources, participant populations, and outcome specifications, precluding direct comparisons and quantitative synthesis. Third, the existing evidence predominantly features surrogate endpoints and short-to-medium-term follow-ups, with nearly 40% of studies enrolling fewer than 100 participants and only seven studies reporting follow-ups beyond one year, accompanied by limited external validation, calibration/uncertainty reporting, and real-world implementation documentation, all of which affect inferential strength and generalizability. Fourth, research on geography and device/platform ecosystems were concentrated, leaving cross-context transferability and local adaptability yet to be validated. Fifth, our operationalized criteria for determining “readiness for application,” while enhancing relevance for rehabilitation, may have introduced selection bias.

Overall, AI has demonstrated feasibility across several key components of schizophrenia rehabilitation management, although current evidence is insufficient to support conclusions regarding unified effect sizes. The primary contribution of this review lies in providing an application landscape and evaluative criteria centered on rehabilitation goals, distinguishing technologies with mere identification capabilities from tools that can be integrated into service pathways. Future research should adopt patient-centered outcomes and service performance as primary endpoints; conduct prospective, multi-center, and cross-context validation and recalibration; standardize the reporting of calibration, confidence intervals, and subgroup performance; and advance executable and auditable clinical integration within interoperability and governance frameworks. Only through rigorous translation from signal generation to service-level execution can AI substantively reduce relapse risk, enhance engagement, and improve quality of life in schizophrenia contexts.