Abstract
Oropharyngeal dysphagia affects over half of neurological and oncological populations, yet rehabilitation is constrained by a global therapist shortage that human–AI collaboration has not demonstrably addressed. Here we report a systematic review of 31 studies (1012 participants; PROSPERO: CRD420251115997) evaluating AI-augmented swallowing rehabilitation in adults with oropharyngeal dysphagia, or in healthy volunteers testing systems designed for clinical application. We synthesised findings by aetiology and collaboration mode, assessing risk of bias and certainty of evidence (Grading of Recommendations, Assessment, Development and Evaluation, GRADE). AI-augmented interventions produce short-term gains in functional oral intake and physiological measures (GRADE moderate/low certainty), but these effects attenuate within weeks of cessation, and adherence declines sharply once clinician supervision is withdrawn. NASSS framework analysis reveals a central paradox: the adopter domain—digital literacy, cognitive impairment, interface usability—is the dominant implementation barrier (61.3% rated high), meaning the populations with the greatest need face the steepest barriers to adoption. AI algorithm performance is rated at very low certainty, with validation largely confined to healthy volunteers. These findings support advancement to pragmatic trials for supervised post-stroke rehabilitation but underscore that evidence for other aetiologies, unsupervised settings, and sustained outcomes remains insufficient.
Similar content being viewed by others
Introduction
Safe swallowing requires sub-second coordination of over thirty craniocervical muscle pairs1,2—yet its rehabilitation remains critically under-resourced worldwide. Oropharyngeal dysphagia affects up to 30% of community-dwelling older adults and exceeds 50% in most neurological and oncological populations studied, including stroke, Parkinson’s disease, age-related frailty and head and neck cancer (HNC)3,4,5,6,7,8. Regardless of aetiology, dysphagia independently raises the risk of aspiration pneumonia, malnutrition, and death—post-stroke dysphagia alone confers a more than fourfold increase in pneumonia risk9,10, and aspiration pneumonia remains the leading cause of death in advanced Parkinson’s disease11,12. Because swallowing dysfunction frequently persists or progresses beyond the acute phase13,14, rehabilitation needs are chronic and escalating—accounting for an estimated US$4.3–7.1 billion in excess dysphagia-related inpatient costs per year in the United States15,16, a burden set to intensify as the global population aged 60 and older approaches 2.1 billion by 205017. Intensive swallowing rehabilitation promotes neuroplastic recovery and functional improvement18,19, but delivering it at adequate intensity hinges on a specialist workforce whose numbers fall far short of global need—and the deficit is widening.
In the United States, speech-language pathologist employment is projected to grow by 15% between 2024 and 2034—well above the occupational average—yet still fall short of demand20. In low- and middle-income countries the deficit is orders of magnitude larger: the WHO estimates fewer than ten skilled rehabilitation practitioners per million population, and only 17% of low-income countries have even one speech–language therapist per million21,22. The consequence is that a large proportion of patients worldwide cannot access swallowing rehabilitation at sufficient intensity. Technology may help close this gap—but only if it augments, rather than supplants, the clinical expertise on which safe practice depends.
Human–AI collaboration in rehabilitation embodies this principle, positioning AI not as an autonomous decision-maker but as computational support—real-time physiological monitoring, pattern recognition across multiple signal streams, and individualised dosage adaptation—while clinicians retain contextual judgement, oversight of aspiration risk, and the therapeutic relationship23,24,25. We define a human–AI collaborative rehabilitation system as one integrating: at least one AI-enhanced component, such as adaptive parameter adjustment, multi-parameter pattern recognition, personalised protocol generation, or algorithm-driven real-time feedback and risk alerting; and at least one form of human clinical involvement, such as treatment plan formulation or approval, intervention supervision, parameter adjustment, or exception management. This definition spans a spectrum of collaboration intensity, from continuous clinician oversight with AI-augmented assessment to semi-autonomous AI operation under periodic clinical review. Swallowing lends itself to such collaboration: it generates multimodal physiological signals—electromyographic activity, lingual pressure, laryngeal excursion, deglutition acoustics26,27,28—that encode the rapid biomechanical sequences governing airway protection and bolus transit. These sequences unfold on millisecond timescales, beyond the reach of unaided clinical observation2 but amenable to computational analysis. Preliminary validation indicates that systems built on these signals achieve acceptable detection accuracy in specific populations and improve training precision in controlled settings29,30,31.
Whether these early capabilities translate into real-world clinical benefit remains unclear. Systematic reviews show that specific swallowing interventions yield favourable group-level effects on impairment4,32,33,34, yet a Cochrane review—assessing functional endpoints—found no demonstrable reduction in mortality or long-term disability, with substantial inter-individual response heterogeneity that conventional clinical variables do not adequately explain35. This divergence reflects a persistent challenge in dysphagia research: physiological surrogates and functional outcomes do not reliably co-vary, so that gains in parameters such as muscle activation amplitude or lingual pressure do not consistently predict improvement in swallowing safety or oral intake. For AI-augmented systems, which must select optimisation targets from accessible physiological signals, this dissociation carries direct design consequences. Marked variation in baseline physiology across aetiologies compounds this problem: algorithms validated in post-stroke cohorts—from which most current evidence derives—cannot be assumed to transfer to neurodegenerative or oncological populations, which remain substantially under-represented.
Demonstrating effectiveness is only the first translational hurdle. Implementation science shows that most AI systems performing well under controlled conditions fail to achieve routine adoption, for reasons rooted in the fit between technology design, user capacity, and care environment2,36. In swallowing rehabilitation, this challenge arises under conditions that distinguish it from other digitally augmented therapies. The defining impairments of the target population—neurological, cognitive, sensory—are precisely those that compromise technology interaction37, creating a fundamental tension between who most needs the system and who can most readily operate it. The stakes compound this tension: inadequate system performance during swallowing training can precipitate aspiration—a potentially fatal event—imposing fault-tolerance requirements that exceed those of most rehabilitation scenarios. The populations with the greatest need for these technologies—those in the most resource-constrained settings—face the steepest structural barriers to accessing them. Without deliberate countermeasures, these technologies risk entrenching the very inequities they were designed to alleviate38.
These interdependent challenges remain unaddressed within a unified analytical framework. Previous systematic reviews examined individual technology types in isolation—surface electromyography (sEMG) biofeedback, gamified interfaces, mobile health platforms—yielding technology-specific conclusions that cannot account for why interventions efficacious in controlled trials have consistently failed to enter routine practice. This limitation is not merely additive; it is structural. Effectiveness and implementation complexity are coupled: features that enhance efficacy under supervision—high-fidelity physiological sensing, real-time adaptive algorithms—simultaneously raise barriers to deployment through calibration burden, digital-literacy requirements, and hardware costs. Conversely, design choices that simplify adoption—smartphone-only delivery, reduced clinician involvement—risk attenuating efficacy. Disentangling these trade-offs demands a framework that evaluates both dimensions jointly. The Non-adoption, Abandonment, Scale-up, Spread, and Sustainability (NASSS) framework39, which maps implementation complexity across seven interacting domains from condition characteristics to long-term sustainability (Fig. 1), offers a mature lens for this analysis40; what has been absent is its integration with effectiveness synthesis in a single review.
Barriers are mapped onto seven NASSS domains: D1 (Condition), D2 (Technology), D3 (Value Proposition), D4a (Adopters: Patients), D4b (Adopters: Therapists), and D5–7 (Organisation, Wider System, and Embedding). Surrounding panels detail domain-specific barriers synthesised from the included studies. Directional annotations between domains indicate cross-domain cascading interactions. Solid arrows denote primary influence pathways; dashed curved arrows denote cross-domain interactions; the dashed boundary at the bottom indicates additional complexity amplification in low- and middle-income country contexts.
Here we report a systematic review with two converging objectives: to synthesise effectiveness evidence for human–AI collaborative swallowing rehabilitation across aetiologies, technology types, and collaboration modes; and to apply the NASSS framework to identify the systemic barriers separating controlled efficacy from routine adoption. We developed a taxonomy of human–AI collaboration modes to characterise how clinical and computational tasks are allocated across interventions, and examined whether mode of collaboration is associated with differences in outcomes and implementation complexity. In doing so, we sought to advance the field beyond the question of whether these technologies work, towards an account of how, for whom, and under what conditions they can be embedded in the rehabilitation of one of the most physiologically demanding sensorimotor functions in human medicine.
Results
Study selection
The systematic search across PubMed/MEDLINE, Embase, the Cochrane Library, and Web of Science yielded 5236 records. Following removal of 3914 duplicates, 1322 unique records underwent title and abstract screening. Of these, 1112 were excluded as clearly not meeting the inclusion criteria, leaving 210 records for full-text assessment. Supplementary search strategies—including reference list searching, forward citation tracking, trial registry searching, and expert contact—identified 28 potentially relevant records, of which 12 proceeded to full-text assessment after screening. In total, 222 full texts were assessed (210 from database searching and 12 from supplementary searches), of which 191 were excluded for the following reasons: non-relevant interventional studies (n = 66), diagnostic AI systems without therapeutic integration (n = 33), conference abstracts with incomplete data (n = 30), non-relevant populations (n = 15), non-collaborative study designs lacking human–AI interaction (n = 42), and non-relevant outcomes (n = 5). Ultimately, 31 studies met all inclusion criteria and were included in the review. The Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) flow diagram illustrates the complete study selection process (Fig. 2).
PRISMA flow diagram of the search and study selection strategy.
Study characteristics
The included studies were published between 2016 and 2026: five (16.1%) during 2016–2019 and 26 (83.9%) during 2020–2026. Studies originated from diverse geographical regions, with 15 from the Asia-Pacific (48.4%; mainland China, n = 6; South Korea, n = 4; Taiwan, n = 3; Hong Kong and Japan, n = 1 each), six from North America (all from the United States), seven from Europe (Italy, n = 3; Turkey, n = 2; the United Kingdom and the Netherlands, n = 1 each), and three (9.7%) from multinational collaborations. Descriptively, facial recognition and gamified systems were more prevalent among Asia-Pacific studies, whereas sEMG biofeedback predominated among North American and European studies.
The principal study designs comprised randomised controlled trials (Randomised Controlled Trial, RCTs; n = 13, 41.9%; including nine full-scale RCTs, two pilot RCTs, and two feasibility RCTs), non-randomised interventional studies (n = 8, 25.8%; including six quasi-experimental designs, one case-control study, and one longitudinal study), technology development and validation studies (n = 8, 25.8%), and qualitative studies (n = 2, 6.5%). Sample sizes ranged from 2 to 113 participants (median = 23.0; interquartile range [IQR]: 12.25–58.75). Study settings spanned the full care continuum: acute hospitals and inpatient rehabilitation facilities (n = 15, 48.4%), research laboratories (n = 6, 19.4%), home-based or home-extension settings (n = 4, 12.9%), community centres (n = 4, 12.9%), and hybrid hospital-to-home transitional models (n = 2, 6.5%). Detailed study characteristics, intervention technologies, and outcomes are summarised in Table 1.
Twenty-seven studies (87.1%) recruited independent samples. Four studies had potential sample overlap: two by Zhang et al. validated the same facial recognition system across different settings41,42, and two by Kim H et al. employed the same application with identical sample sizes (n = 11), raising the possibility of shared participants43,44. Findings were analysed at the study level rather than by pooling participant data.
Population characteristics
Among the 26 studies reporting clinical populations, a total of 1,012 participants were enroled. The distribution by aetiology was as follows: post-stroke dysphagia, 13 studies (41.9%); Parkinson’s disease, five (16.1%); age-related swallowing decline, four (12.9%); HNC, one; intensive care unit (ICU)-acquired dysphagia, one; neurogenic dysphagia, one; unclassified dysphagia, one; and healthy volunteers, five (16.1%). Participant age ranged from young adults (17.5–35 years) in validation studies to older populations (mean age 65–73 years) in clinical intervention trials. Sex distribution varied by aetiology and clinical setting, with the highest proportion of male participants observed in stroke studies (up to 84.2%) and the lowest in studies of community-dwelling older adults (9.1%).
Technology systems and intervention characteristics
The 31 included studies employed five principal categories of AI-enhanced technologies. sEMG biofeedback systems (n = 11, 35.5%)29,45,46,47,48,49,50,51,52,53,54 captured muscular activation patterns during swallowing via submental electrode placement; AI algorithms provided threshold-based detection, real-time signal processing, and visual feedback, technical complexity ranged from simple threshold-triggered devices to multi-sensor flexible electrode arrays. AI-driven computer vision and facial recognition technologies (n = 5, 16.1%)41,42,55,56,57 employed contactless, real-time tracking of facial or lingual movements; principal approaches included MediaPipe 468-point three-dimensional facial landmark detection, ensemble of regression trees (ERT) 68-point detection, and Google Teachable Machine-based image classification. Mobile health (mHealth) application platforms (n = 10, 32.3%)43,44,56,57,58,59,60,61,62,63 delivered interventions via smartphone or tablet applications, with functionality encompassing exercise video guidance, real-time feedback, progress tracking, and automated reminders. Machine learning and deep learning algorithms (n = 7, 22.6%)30,41,42,55,56,57,64 were deployed across a broad range of applications, from electroencephalography (EEG) signal classification for brain–computer interfaces to convolutional neural network (CNN)-based wearable swallowing detection. Three studies applied pre-trained or custom algorithms within clinical interventions41,42,57; the remaining four were confined to algorithm validation or acceptance testing in non-clinical or single-session settings. Other technologies included an ontology-based clinical decision support system (n = 1)65, an infra-red lingual motion tracking system (n = 1)66, and a lingual pressure biofeedback device (n = 1)67. Eleven studies (35.5%)31,41,42,47,49,50,52,54,55,56,68 incorporated gamification elements—visual rewards and real-time feedback—into rehabilitation training across multiple sensing modalities.
These technologies employed diverse feedback modalities. Visual biofeedback was adopted in 30 studies (96.8%), delivered through real-time waveform displays, animated game interfaces, progress dashboards, and LED colour-coded status indicators. Auditory feedback supplemented visual modalities in six studies (19.4%). One study (3.2%) implemented haptic feedback. Multimodal feedback (≥2 modalities) was employed in seven studies (22.6%), predominantly in visual–auditory combinations.
Intervention dose parameters differed considerably between clinical settings. In acute rehabilitation, sessions typically lasted 45–60 min, were delivered 5–7 sessions per week (high-intensity), and comprised 10–20 total sessions. In community and home-based settings, sessions lasted 15–30 min, were delivered 2–3 sessions per week (moderate frequency), and comprised 50–120 total sessions. Seven sEMG-based studies employed individualised progressive threshold adjustment, with initial thresholds set at 50–70% of baseline peak amplitude and increasing to 80–150%46,49,50,51,53,54,67.
Human–AI collaboration modes
Using the human–AI collaboration coding system (see ‘Methods’), the 31 studies were classified into five collaboration modes (Table 2). The distribution was as follows: Mode A, AI-augmented direct supervision (n = 11, 35.5%); Mode B, supervised autonomous practice (n = 5, 16.1%); Mode C, hybrid periodic consultation (n = 7, 22.6%); Mode D, AI-driven clinical decision validation (n = 3, 9.7%); Mode E, autonomous AI operation (n = 5, 16.1%).
Among human roles, H1 (Initial setup; n = 31, 100%), H2 (Assessment; n = 30, 96.8%), and H3 (Supervision; n = 29, 93.5%) were assigned in the vast majority of studies, whereas H4 (Treatment adjustment) remained under human control in 17 studies (54.8%). Among AI roles, A3 (Feedback delivery; n = 30, 96.8%) and A1 (Detection; n = 26, 83.9%) were the most prevalent, followed by A4 (Adaptation; n = 12, 38.7%) and A5 (Decision support; n = 3, 9.7%). AI roles were concentrated at the signal detection and feedback layers, whereas treatment adjustment and clinical decision-making remained predominantly under human control.
Effectiveness of human–AI collaborative interventions
Among the 26 studies reporting clinical outcomes, effect directions were predominantly favourable (Fig. 3). However, given the predominance of small samples, short follow-up periods, and feasibility-oriented designs, favourable effect directions should not be equated with robust evidence of clinical effectiveness. The following synthesis distinguishes four outcome tiers that carry different evidentiary weight: (i) clinical swallowing function—direct patient-relevant benefit and the most informative tier; (ii) physiological surrogate measures—which do not invariably translate into functional improvement; (iii) patient-reported and adherence metrics—reflecting acceptability rather than efficacy; and (iv) algorithm performance—reflecting technical capability under controlled validation conditions, typically with healthy volunteers. This hierarchy is maintained throughout the reporting below.
Each bar represents one study, categorised by outcome domain (A: clinical swallowing; B: physiology/strength; C: quality of life [QoL]/patient-reported outcome measures [PROMs]; D: adherence/feasibility) and study design (RCT vs. non-RCT). Bar height encodes sample size (tall: n ≥ 60; medium: 30 ≤ n < 60; short: n < 30), and bar colour encodes the effect direction (green: statistically significant positive effect; yellow: positive trend or mixed results; grey: no difference), and bar pattern fill encodes risk of bias (solid: low risk; hatched: some concerns or moderate risk; open: high risk).
Swallowing function outcomes
Five studies reported oropharyngeal swallowing function as a primary endpoint. Zhang et al. (two RCTs; n = 84 and n = 26; some concerns)41,42, using an AI-augmented video game (AI-VG) system, reported significant improvements over controls in Gugging Swallowing Screen (GUSS) scores (P < 0.05), the larger trial additionally demonstrated significant improvement in Standardised Swallowing Assessment (SSA) scores (P = 0.006). Alyanak et al. (RCT; n = 33; some concerns) reported significantly greater improvement in Dysphagia Outcome and Severity Scale (DOSS) scores (P = 0.004)54. Park et al. (RCT; n = 37; low risk of bias) observed within-group improvements in Videofluoroscopic Dysphagia Scale (VDS) scores in both arms (P < 0.05)68, although the difference between groups was not statistically significant. In a three-arm comparison (conventional therapy plus transcranial direct current stimulation [tDCS] as control, plus sEMG biofeedback, and plus gamified training), Hou et al. (RCT; n = 90; some concerns) reported a stepwise increase in clinical response rates (60, 76.7, and 90%)50.
The Functional Oral Intake Scale (FOIS) was the most widely reported outcome measure (n = 12)31,41,42,45,46,50,51,52,54,58,67,68. Seven studies31,41,42,45,50,58,67—six RCTs (one at low risk of bias, five with some concerns) and one case-control study—demonstrated short-term superiority over controls; the remaining five reported within-group improvements only, which, in the absence of between-group differences, provide limited evidence of intervention-specific effects. Three studies incorporating longer-term follow-up (6 weeks to 3 months) consistently showed that between-group differences were no longer significant (P > 0.05)51,58,67. Li et al. additionally reported a nasogastric tube removal rate of 80%31.
Three studies targeting age-related swallowing decline and Parkinson’s disease assessed oral motor function. Chan et al. (quasi-experimental; n = 70 completers; moderate risk) reported significant objective improvements in bite force (P < 0.001), masticatory efficiency (P = 0.002), and oral diadochokinetic rate (P < 0.05), although the subjective screening tool (Eating Assessment Tool-10 [EAT-10]) did not reach significance57. Jung et al. (RCT; n = 76; some concerns) reported increased salivary flow (+0.71 g min-1; P < 0.05), whilst the subjective oral health assessment (General Oral Health Assessment Index [GOHAI]) was non-significant62. Battel et al. (within-subject feasibility; n = 10; moderate risk) observed significant improvements in the severity and frequency of salivary residue in patients with Parkinson’s disease (P < 0.05)52.
Grading of Recommendations, Assessment, Development and Evaluation (GRADE) certainty: Moderate (downgraded for imprecision due to small sample sizes).
Physiological and biomechanical outcomes
Four studies evaluated lingual muscle activation. Chan et al. reported significant improvements in tongue pressure (P < 0.001) and lingual endurance (P = 0.004)57. Jung et al. demonstrated increased anterior tongue pressure (+4.48 kPa; P < 0.05); posterior tongue pressure and buccal pressure did not reach significance62. Kim et al. (single-arm; n = 8 completers) showed an increase in swallowing tongue pressure from 17.5 to 26.5 kPa (P = 0.046); the effect was not maintained at 12-week follow-up43. Krekeler et al. (RCT; n = 19; some concerns) found no statistically significant difference in maximum isometric pressure, effect sizes were large (anterior tongue pressure: Cohen’s d = 0.95; posterior tongue pressure: d = 0.96)67.
With respect to submental muscle activation, Hou et al. (RCT; n = 90) revealed a progressive increase in sEMG peak amplitude across the control (22 μV), sEMG biofeedback (30 μV), and gamified (44 μV) groups (P < 0.05), with a corresponding reduction in swallowing duration (1.52 s, 1.32 s, and 1.09 s; P < 0.05)50. Kim et al. (device validation; 30 healthy volunteers and 1 patient with Parkinson’s disease) quantified the effects of specific swallowing manoeuvres on submental muscle activity48. Jansen et al. (single-arm feasibility; n = 20 ICU patients) observed sEMG changes that did not reach statistical significance (52 to 57 μV); physiological changes showed no significant correlation with clinical outcomes (FOIS, PAS)49.
Concerning hyoid–laryngeal biomechanics, Li et al. (case–control; n = 20), employing accelerometer-guided game-based biofeedback, showed significantly greater hyoid displacement in the intervention group (intervention group: 11.37 to 14.45 mm, P = 0.002; control group: 12.84 to 13.35 mm, non-significant)31.
Seven studies assessed swallowing safety through aspiration and pharyngeal residue outcomes. Three reported significantly lower aspiration rates in the intervention group compared with the control group (P < 0.05)45,46,51. Alyanak et al. reported within-group improvement in liquid Penetration–Aspiration Scale (PAS) scores (P = 0.026) and superiority over controls for semi-solid consistencies (P = 0.031)54. Park et al. reported within-group PAS improvements in both arms (P < 0.05) without a significant between-group difference68. Battel et al. demonstrated sustained improvement in salivary and solid residue in patients with Parkinson’s disease at three-month follow-up (P < 0.05)52. Krekeler et al. reported a substantial reduction in residue (d = 1.2), PAS improvement remained non-significant (P > 0.05)67.
GRADE certainty: Low (downgraded for risk of bias owing to predominant use of non-randomised designs, inconsistency due to heterogeneous measurement methods, and imprecision due to small sample sizes).
Quality of life and psychological outcomes
Three studies reported significant improvements in swallowing-related quality of life (Swallowing Quality of Life [SWAL-QOL] or Dysphagia Handicap Index [DHI] scores; P < 0.05)41,54,58. Six studies reported favourable trends that fell short of statistical significance (P > 0.05)43,51,52,57,62,67. Starmer et al. found that treatment adherence was associated with better MD Anderson Dysphagia Inventory (MDADI) scores in patients with HNC, although physiology-related quality of life did not differ significantly between groups59.
Psychological outcomes showed favourable trends across several studies43,56,68, including enhanced participant motivation, engagement, and self-efficacy. Su et al. reported that flow state experience scores were highest at moderate game difficulty56. Benfield et al. noted lower mood in the biofeedback group when patients were repeatedly confronted with failure-indicating feedback51.
GRADE certainty: Moderate, downgraded for inconsistency (instrument heterogeneity: SWAL-QOL, DHI, MDADI).
Patient engagement and adherence
Short-term adherence was generally high (72.7–100%, median > 80%)41,43,46,51,52,57; five studies reported 100% adherence31,49,50,56,60, though three of the five49,50,56 were conducted under direct supervision; durations ranged from a single session to approximately eight weeks. Srp et al. reported a decline from 100% to 50% between intensive and maintenance phases60; Chan et al. and Krekeler et al. reported attrition rates of 38% and 41%57,67, respectively, in studies with extended durations.
GRADE certainty: Moderate, downgraded for inconsistency (variable metrics: session completion rates, repetition counts, login frequency).
Safety profile and temporal sustainability
Of the 31 included studies, only 15 (48.4%) provided any form of adverse event reporting; the remaining 16 did not, representing a substantial gap in safety evidence. Among reporting studies, no treatment-related serious adverse events were recorded, and minor issues were limited to mild skin irritation, transient cervical discomfort, xerostomia, and fatigue. However, the absence of serious events should be interpreted in light of incomplete reporting: most studies relied on spontaneous reporting rather than systematic active monitoring, and no study reported near-miss aspiration events or instrumentally confirmed subclinical aspiration. GRADE certainty: Moderate (incomplete reporting).
Beyond safety, the temporal sustainability of treatment effects warrants explicit consideration. Nine studies incorporated post-intervention follow-up. Among these, effect attenuation was a consistent finding: between-group FOIS differences were no longer significant at 6–12 weeks51,58,67; tongue pressure gains at 8 weeks were not maintained at 12 weeks43; and adherence declined substantially during maintenance phases57,60,67.
Exploratory analysis of potential effect modifiers
Stratified by human–AI collaboration mode, Modes A and B (continuous or frequent clinician involvement; n = 16; including 8 RCTs, 50.0%): nine studies reported significant improvements in swallowing function or functional oral intake31,41,42,45,46,50,51,52,68, two reported physiological measure improvements31,50, and adherence rates were consistently high (83.3–100%). Mode C (hybrid periodic consultation; n = 7; including 4 RCTs, 57.1%): six studies reported functional or physiological improvements57,58,59,60,62,67, with more variable adherence rates (37–100%) and effect attenuation at longer-term follow-up in three studies58,60,67. Modes D and E (high AI autonomy; n = 8; including 1 RCT, 12.5%) were predominantly focused on technology validation, with three studies reporting clinical functional outcomes30,56.
Across dysphagia aetiologies, post-stroke dysphagia (n = 13; 10 RCTs): 10 studies reported FOIS improvements31,41,42,45,46,50,51,58,67,68, five reported aspiration-related improvements45,46,51,54,68, and three reported physiological improvements31,50,67. Among Parkinson’s disease studies (n = 5; 0 RCT): three reported improvements in swallowing function or frequency30,52,60 and two were device validation studies48,63. Age-related swallowing decline (n = 4; 1 RCT), three reported tongue pressure improvements43,57,62, although sample sizes were uniformly very small.
Across technology types, sEMG biofeedback systems (n = 11): five studies reported functional improvements45,46,50,52,54 and six reported enhanced physiological activity46,48,49,50,52,54. Facial recognition systems (n = 5), three reported functional improvements41,42,57 and user acceptance was consistently favourable. mHealth application studies (n = 10), five reported short-term adherence rates exceeding 70%43,44,56,57,60, whilst three demonstrated declining adherence during the maintenance phase43,57,60.
With respect to study design, 12 of the 13 RCTs reported significant within-group improvements41,42,45,46,50,51,54,56,58,62,67,68 and eight reported significant between-group differences41,42,45,46,50,54,58,67. The 18 non-randomised, validation, and qualitative studies—rated moderate-to-high risk of bias for interventional and validation designs (n = 16) and low risk for qualitative designs (n = 2)—predominantly reported feasibility, usability, and physiological outcomes but provided limited clinical effectiveness evidence.
AI system performance, usability and clinician experience
Eight studies evaluated AI algorithm performance. The ADAM wearable sensor (validated in 58 healthy adults and 20 patients with Parkinson’s disease during single laboratory sessions) achieved a swallowing detection sensitivity of 95% and specificity of 99% (F1 = 0.89), and correlated strongly with speech-language therapist visual observation (r = 0.92)30. Lee et al. achieved 100% accuracy for liquid swallowing detection47, and Kim et al. (30 healthy adults and 1 patient with Parkinson’s disease) reported signal validation results exceeding 0.9548. Aslan et al. employed ensemble machine learning for EEG-based motor imagery classification, achieving 99.8% accuracy—substantially outperforming single algorithms (k-nearest neighbours: 79.4%; CNN combined with continuous wavelet transform: 83%); this analysis was conducted offline rather than in real time64. The Swallowscope achieved a thickened liquid swallowing detection accuracy of 75.9%61. The tongue–machine interface system demonstrated a click accuracy of 70% and an information transfer rate of 130 bits min−171. The ontology-based clinical decision support system received a clinician approval rate exceeding 66.7%65. GRADE certainty: Very low (risk of bias—predominantly healthy-volunteer validation; inconsistency—performance ranging from 70% to 99.8%; indirectness—controlled laboratory conditions; imprecision—including single-patient pilot testing).
Fourteen studies30,41,42,43,47,48,49,51,55,56,57,60,63,65 analysed system usability, acceptance, and satisfaction. User-reported benefits included objective data supporting clinical diagnosis and decision-making; enhanced engagement and autonomy through gamified human–AI interaction and wearable device feedback; user-friendly interface design; and soft, stretchable wearable materials. Healthcare professionals’ interaction with the systems was notably shaped by perceived system confidence—when professionals disagreed with AI judgements, they tended to selectively override swallowing risk predictions.
The impact of these technologies on clinical workflows was examined in four studies. Positive effects included improved team dynamics, enhanced patient communication, and strengthened objective evidence for clinical decision-making. Identified concerns included incorrect execution of intervention recommendations, over-reliance on AI, and potential interference with clinical judgement. No study examined effects on patient empowerment.
Implementation barriers: NASSS framework analysis
Systematic mapping of implementation barriers to the NASSS framework domains revealed domain-specific complexity patterns (Table 3).
Condition complexity (Domain 1) was rated moderate-to-high in 21 studies (67.7%). Principal challenges included comorbidity interactions; trajectory unpredictability; and severity heterogeneity. ICU-acquired dysphagia exhibited the highest condition complexity.
Technology complexity (Domain 2) showed a three-tier distribution: nine studies (29.0%) documented high, 14 (45.2%) moderate, and eight (25.8%) low technology complexity. Technological barriers encompassed calibration requirements (individualised threshold setting for sEMG systems; personalised difficulty progression in gamified interventions); hardware dependencies (availability of dedicated equipment; sEMG device material flexibility and aesthetics; sensor placement precision; cable fragility in early prototypes); algorithmic limitations (non-interpretability of deep learning systems); and interoperability gaps (absence of electronic health record integration; platform-specific restrictions).
The value proposition (Domain 3) remained largely unquantified. Only two studies reported device costs: the TMIS (<AUD 500) and the flexible sensor patch (approximately US$13.92 plus US$52 unit cost). No study has conducted a formal cost-effectiveness analysis comparing AI-enhanced with conventional rehabilitation.
The adopter system (Domain 4) emerged as the most frequently identified challenge domain (19 studies rated as high complexity; 61.3%). Patient-level barriers included insufficient digital literacy; smartphone ownership and internet access requirements; sensory and motor limitations affecting interface interaction; conflicts between intervention recommendations and personal habits; sustained motivational challenges; and physical constraints. Kim et al. found that educational attainment exceeding 10 years, combined with prior technology experience, predicted successful mHealth adoption44. Healthcare professional-level barriers included training requirements, the time burden of device set-up, workflow integration challenges, and variable technology acceptance.
Organisational, wider system, and embedding factors (Domains 5–7) presented a mixed picture. Organisational barriers were reported as moderate-to-high complexity in 17 studies (54.8%), primarily relating to infrastructure requirements and workflow integration. Wider system barriers were notably under-reported: only two studies (6.5%) addressed regulatory considerations, and none discussed reimbursement policy frameworks. Embedding and sustainability exhibited high complexity in eight studies (25.8%), with concerns including the risk of technological obsolescence; ongoing maintenance requirements; dependence on manufacturer support; and limited evidence for long-term outcome maintenance.
No study achieved consistently low complexity across all NASSS domains.
Risk of bias and quality of evidence
RCTs (n = 13) were assessed using the Cochrane Risk of Bias 2.0 (RoB 2.0); 11 (84.6%) were judged as having ‘some concerns’, with only two receiving an overall rating of ‘low risk’. The most prevalent methodological limitation was the inability to blind participants and therapists to digital intervention allocation. Two studies implemented assessor blinding54,68. Additional concerns included unclear allocation concealment (n = 5, 38.5%), incomplete outcome data due to attrition (n = 3, 23.1%), and potential selective outcome reporting (n = 2, 15.4%).
Non-randomised interventional studies (n = 8) were assessed using the Risk of Bias in Non-randomised Studies – of Interventions (ROBINS-I), with all rated as ‘moderate’ overall risk of bias. Three were judged as having serious confounding bias owing to within-subject or uncontrolled designs that inadequately accounted for temporal confounders such as spontaneous neurological recovery and disease progression, and a further five were rated as having moderate confounding risk. Qualitative studies (n = 2) were assessed using the Critical Appraisal Skills Programme (CASP), both receiving a ‘low risk of bias’ rating. Technology validation studies (n = 8) were assessed using the Quality Assessment of Diagnostic Accuracy Studies 2 (QUADAS-2); five were rated as ‘moderate’ risk and three as ‘high’ risk. The principal sources of bias were patient selection bias—arising from reliance on healthy volunteers, single-patient device validation, or extremely small clinical samples (n ≤ 4)—and concerns regarding applicability to target clinical populations. The above risk-of-bias ratings were integrated into the effectiveness synthesis through study-level bias annotations alongside key findings. Detailed domain-specific ratings and overall risk-of-bias judgements for each study are provided in Supplementary Tables 1–5.
Turning to the certainty of evidence, swallowing function outcomes were rated moderate certainty, downgraded for imprecision owing to small sample sizes. Physiological measures were rated low certainty, downgraded for imprecision on the same grounds. Adverse events were rated moderate certainty, downgraded for incomplete reporting across the majority of included studies. Quality of life outcomes were rated moderate certainty, downgraded for inconsistency arising from instrument heterogeneity. Treatment adherence was rated moderate certainty, downgraded for inconsistency owing to variable operational definitions across studies. System usability was rated low certainty, downgraded for study design limitations and indirectness. AI algorithm performance was rated very low certainty, downgraded across all four GRADE domains (risk of bias, inconsistency, indirectness, and imprecision).
Discussion
The central finding of this review is a discrepancy between what these technologies achieve within individual sessions and what the field most needs them to do: reduce dependence on specialist availability. Current systems appear to enhance session-level precision—quantifying physiological parameters beyond human perceptual thresholds and delivering feedback calibrated to patient performance in real time—but have not demonstrably freed rehabilitation delivery from this constraint. Across the 26 studies reporting clinical outcomes, gains were concentrated under direct or frequent clinician supervision, in institutional settings, and over short follow-up periods; evidence from unsupervised, community-based, or long-term contexts remains sparse. Where supervision was reduced or withdrawn, adherence declined and functional gains eroded. As currently configured, these technologies augment what specialists achieve; they do not yet extend rehabilitation to those who lack access to one.
The distribution of collaboration modes offers a partial explanation for this pattern. Modes requiring continuous or frequent clinician involvement (A–C) accounted for 74.2% of studies (n = 23) and produced a higher proportion of positive clinical outcomes than high-autonomy modes (D and E, n = 8). This observation is hypothesis-generating rather than confirmatory, for reasons that substantially limit causal inference. Modes A and B included a markedly higher proportion of RCTs (50.0% versus 12.5%), enroled larger samples, and predominantly addressed post-stroke dysphagia in institutional settings with established rehabilitation infrastructure—the clinical context most favourable to demonstrating treatment effects. Seven of the eight Mode D and E studies were technology validation or feasibility designs, with small samples, abbreviated follow-up, and protocols not intended to detect clinical effectiveness. The apparent superiority of higher-supervision modes may therefore largely be an artefact of study maturity rather than evidence for a genuine supervision–outcome relationship. Even if the association proves partly genuine, it need not imply that supervision is the active ingredient; an alternative interpretation is that human–AI collaboration operates as a distributed cognitive system23,25 in which value arises not from either component’s solo performance but from their integration—AI capturing signals below human perceptual thresholds, clinicians exercising contextual reasoning beyond current algorithmic capacity24. This hypothesis remains untested and would require designs that hold population, setting, and rigour constant while varying the mode of collaboration.
A related and potentially more consequential challenge concerns the relationship between what these systems measure and what matters clinically: the persistent dissociation between physiological surrogates and functional outcomes. This dissociation is well established in dysphagia research, but it acquires particular significance when adaptive algorithms must select optimisation targets from accessible physiological signals. Jansen et al., for example, reported that sEMG amplitude changes in ICU patients were non-significant and uncorrelated with functional outcomes (FOIS, PAS), even as patients showed clinical improvement. The ICU setting, multi-system confounding, and absence of a control group limit the inferential strength of that particular study, but the pattern it illustrates—signal improvement without corresponding functional change—recurs across the included literature. The underlying measurement problem is substantial: submental sEMG conflates contributions from the suprahyoid muscles, geniohyoid, and anterior belly of the digastric, and patients may achieve functional gains through compensatory strategies recruiting muscle groups outside the sensor’s pick-up volume. Aetiology-specific variability compounds these measurement limitations: motor fluctuations in Parkinson’s disease produce large within-patient variation in sEMG baselines across sessions69, whilst tissue oedema and fibrosis during HNC radiotherapy systematically alter sensor–muscle coupling characteristics70. In the absence of a widely accepted framework for swallowing digital biomarkers71,72, the optimisation target of any adaptive algorithm is itself indeterminate—the system may learn to maximise a signal that does not reliably index the clinical change it is designed to produce. Multi-sensor fusion architectures capable of capturing compensatory mechanisms across muscle groups and tracking multi-dimensional recovery trajectories represent a plausible technical direction73, but none has been validated in clinical dysphagia populations.
The performance of the algorithms themselves warrants separate scrutiny, but must be interpreted with considerable caution: GRADE certainty is very low, downgraded across all four domains; most reported metrics derive from healthy-volunteer validation under controlled conditions without external replication; and no included study directly compared algorithmic approaches within the same clinical population. Against this backdrop, rule-based threshold detection and deep learning approaches each showed different strengths in different contexts. Threshold-based systems offer interpretability, real-time tunability, and the low latency that biofeedback-driven motor learning demands51—but with limited detection accuracy, as the Swallowscope’s template-matching algorithm illustrates (75.9% detection accuracy)61. Deep learning models achieved higher accuracy under controlled validation (sensitivity up to 95% for the ADAM sensor), but at the cost of interpretability and clinical trust—a trade-off reflected in multiple reports of clinicians selectively overriding predictions they could not explain24,74. Together, these findings suggest that the clinically decisive criteria for swallowing AI are not captured by accuracy metrics alone: whether feedback arrives within the motor-learning temporal window, whether decision logic is interpretable to and trusted by clinicians55,65, and whether the system accommodates inter-individual dynamic variability. Conceptually, a layered architecture—edge-level rule engines providing real-time interpretable feedback, cloud-based deep learning for cross-session pattern recognition75—may bridge the gap between these complementary strengths, though clinical evaluation of any such architectures in dysphagia rehabilitation remains at an early stage.
Adopter-related barriers emerged as the domain of highest implementation complexity in the NASSS analysis (61.3% of studies rated high), ahead of technology and organisational factors. This finding is subject to reporting bias—researchers describe barriers they encounter more readily than those they do not assess—and our ratings for domains not explicitly addressed in individual studies rest on inference rather than direct report. The pattern suggests that the principal bottleneck is not technological maturity but technology–user fit76. In the dysphagia population, this fit is uniquely difficult to achieve. Post-stroke visual neglect impairs perception of visual feedback; hemiplegia restricts touchscreen operation; and aphasia—affecting approximately 35.6% of patients with post-stroke dysphagia77—undermines comprehension of instructions and reporting of adverse events. In Parkinson’s disease, ‘on–off’ motor fluctuations produce large within- and between-session variation in the same patient’s ability to interact with the system, demanding interfaces capable of state-aware adaptation. Patients with HNC face the compounding burden of xerostomia, mucositis, and fatigue, which may preclude tolerance of sensor placement on irradiated skin or within the oral cavity. These are not problems that digital literacy training alone can resolve; they require reconceptualisation of interaction modalities at the design level—multimodal feedback channels, adaptive difficulty modulation, and caregiver-mediated transitional modes.
Temporal patterns in studies with extended follow-up reveal a further challenge. Effect attenuation was a consistent finding: between-group advantages in functional oral intake, physiological gains, and adherence rates all diminished or disappeared within 6–12 weeks of intervention cessation. Whether these patterns reflect genuine treatment decay, insufficient cumulative dose, or floor effects from underpowered samples cannot be determined from current evidence. What is clear is that current protocols treat the supervised-to-autonomous transition as a fixed temporal event rather than a response to the patient’s functional readiness. Future protocols should consider dynamic transition mechanisms triggered by predefined functional thresholds (e.g., PAS ≤ 2 and FOIS improvement ≥1 point on consecutive assessments), supplemented by scheduled expert review checkpoints. Testing such mechanisms would require sequential multiple assignment randomised trials comparing threshold-triggered against fixed-interval transitions78.
Beyond effectiveness and adoption, several translational gaps constrain the pathway from trial to practice. Health-economic evidence is virtually absent. Only two of 31 studies reported device costs, and none conducted a formal cost-effectiveness analysis. Given the adherence attenuation documented above, the actual cost per clinically effective session is likely to exceed initial projections substantially60. Comparative economic modelling of delivery scenarios—therapist-led, AI-enhanced intensive, and AI-supported home maintenance—using adherence-adjusted quality-adjusted life years (QALYs) would address a critical evidence gap, but would itself require more robust adherence and outcome data than are currently available.
Even where value can be demonstrated, the regulatory pathway remains undefined. AI-assisted swallowing systems likely qualify as Software as a Medical Device (SaMD), yet current United States Food and Drug Administration (FDA) and international frameworks—designed primarily for diagnostic AI—offer no dedicated classification for therapeutic rehabilitation systems79,80. In the absence of such classification, liability for aspiration events during AI-assisted training remains unresolved among algorithm developers, clinicians, and device manufacturers. These regulatory uncertainties are compounded by algorithmic opacity. When clinicians cannot interpret an AI-generated recommendation, selective override is a rational response—but one that progressively undermines the collaborative premise on which these systems rest81. To improve transparency, we suggest that future studies adopt minimum reporting standards for human–AI task allocation as a rehabilitation-specific supplement to CONSORT-AI and SPIRIT-AI82,83, encompassing AI component functionality, human role responsibilities, collaboration mode, adaptation mechanisms, and exception-handling protocols (see Supplementary Note 1 for a proposed reporting template).
The cascading design decisions identified across the included studies—from sensing architecture through algorithm selection and collaboration mode to interface adaptation—were rarely addressed as an integrated sequence. Drawing on patterns observed across the 31 included studies, we developed a preliminary conceptual framework organising system design decisions into seven sequential layers (Fig. 4). The framework is an analytical scaffold, not a validated protocol; whether this layered structure reflects actual development practice and whether alternative architectures would yield different effectiveness–implementation trade-offs are open empirical questions.
The framework guides system developers through seven sequential decision layers: (1) clinical context definition (aetiology, severity, care setting); (2) sensing system architecture; (3) AI algorithm selection calibrated to task complexity; (4) human–AI collaboration mode design across a five-mode continuum (Modes A–E); (5) multimodal interaction and feedback design; (6) implementation readiness assessment mapped to the NASSS framework; and (7) evaluation framework design. Key decision parameters are listed beneath each layer. A continuous iterative loop enables post-deployment refinement through user feedback collection, performance data analysis, and barrier identification.
A final interpretive challenge concerns the mechanistic basis of the observed effects. Over 60% of included interventions invoked neuroplasticity or motor learning theory as their therapeutic rationale, yet there is no consensus on which mechanism operates in any given clinical context. Different studies attributed the effects of sEMG biofeedback variously to cortical reorganisation, motor skill acquisition, or attenuation of compensatory patterns84,85—explanations that, taken at face value, imply fundamentally different optimal training parameter configurations. No included study attempted to isolate the independent contributions of individual intervention components; most adopted an all-in-one integration strategy that precludes identification of active ingredients. For AI-enhanced rehabilitation, this theoretical ambiguity has direct design consequences: an algorithm that simultaneously embeds a neuroplasticity-driven high-intensity repetition module and a motor-learning-driven progressive feedback fading module may generate contradictory instructions without the capacity to recognise the conflict. Future research should adopt factorial designs to disentangle component effects86, micro-randomised trials to evaluate and optimise adaptive algorithm decision rules87, and hybrid effectiveness–implementation designs to address efficacy and real-world adoption in parallel88.
These findings should be interpreted in light of several methodological limitations, ordered by their estimated impact on validity. First, substantial heterogeneity precluded meta-analysis; narrative synthesis with vote counting, employed here as a preliminary descriptive overview, accords equal weight to all studies regardless of sample size, precision, or rigour, and harvest plots only partially compensate by encoding these attributes alongside effect direction. Second, publication bias likely inflates the evidence base, as null-result studies face greater publication difficulty and commercial developers lack incentive to report unsuccessful systems. Third, the conceptual scope of this review—spanning effectiveness, implementation, and a novel collaboration taxonomy—is broad relative to the empirical evidence available. The Modes A–E framework and task ratio metric were inductively derived during data extraction rather than prespecified, introducing subjectivity despite substantial inter-rater reliability; task ratios should be regarded as approximate ordinal indicators rather than precise measurements. Fourth, whilst risk-of-bias assessments were integrated into the synthesis through study-level annotations and harvest plots, the narrative format constrains the precision with which study quality can be differentially weighted. In particular, the predominance of studies rated at ‘some concerns’ or ‘moderate risk’ means that most findings rest on an evidence base of uncertain internal validity. Fifth, restriction to English-language publications may have excluded relevant studies, particularly from Asia-Pacific regions that contributed nearly half of the included work. Sixth, most included studies were conducted in controlled research environments; generalisability to larger, more diverse populations has not been established. Finally, NASSS complexity ratings for unreported domains were inferred, potentially underestimating true implementation barriers.
This review identifies converging preliminary evidence to suggest that, for post-stroke dysphagia in institutional settings with adequate clinical oversight, human–AI collaborative rehabilitation is ready to advance from proof-of-concept to pragmatic evaluation. The pathway to broader translation is considerably more complex. Evidence is concentrated in a single aetiology and in short-term outcomes; effect attenuation at follow-up is the rule rather than the exception; adopter barriers, not technology limitations, constitute the primary implementation bottleneck; and the physiological signals on which current algorithms are trained may not index the functional recovery that ultimately matters. Advancing the field will require component-isolation designs to identify active ingredients, inclusive interface architectures responsive to the cognitive, sensory, and motor diversity of the target population, health-economic evaluation that accounts for adherence attenuation, and regulatory classification pathways for therapeutic AI. The current evidence is promising for a specific population under specific conditions, but remains preliminary for most aetiologies and settings—an honest reckoning with this distance between evidence and aspiration is the necessary foundation for responsible clinical translation.
Methods
This systematic review was conducted and reported in accordance with the PRISMA 2020 statement. Given the anticipated heterogeneity in intervention designs, study populations, and outcome measures, we concurrently applied the Synthesis Without Meta-analysis (SWiM) guideline to enhance the transparency and rigour of the narrative synthesis. The review protocol was prospectively registered on the PROSPERO database (CRD420251115997).
Two post-registration enhancements were implemented after preliminary data extraction of the first ten studies revealed that no pre-existing framework adequately captured the diversity of human–AI task allocation in swallowing rehabilitation: (a) an operational definition for human–AI collaboration grounded in the augmented intelligence framework, and (b) an inductively derived collaboration coding system (Modes A–E, roles H1–H6 and A1–A6). Both elements were developed inductively during data extraction rather than prespecified in the protocol. The coding system was pilot-tested on a random subset of five studies, refined through structured discussion, and applied to the full dataset only after inter-reviewer consensus was reached. Because these categories were derived from the included studies, they should be evaluated as post hoc analytical constructs rather than independent classificatory instruments. All inclusion criteria and quality assessment procedures remained unchanged.
Eligibility criteria
Inclusion and exclusion criteria were formulated using the Population, Intervention, Comparator, and Outcome (PICO) framework.
For the population, adult patients (≥18 years) with oropharyngeal dysphagia confirmed by instrumental assessment (videofluoroscopic swallowing study [VFSS] or fibreoptic endoscopic evaluation of swallowing [FEES]) or by a validated clinical screening tool. No restrictions were imposed on aetiology (neurogenic, structural, or age-related) or clinical setting (acute hospital, rehabilitation facility, outpatient clinic, community, or home). Studies evaluating system safety or feasibility in healthy volunteers were eligible provided that the technology under investigation was designed for clinical application in dysphagia rehabilitation. Oesophageal dysphagia, paediatric populations, and animal studies were excluded.
For the intervention, as no consensus definition of ‘human–AI collaboration’ exists in the literature, we formulated the following operational definition grounded in the augmented intelligence framework: a human–AI collaborative rehabilitation system is an intervention that integrates AI-enhanced components with human clinical judgement, with the aim of extending—rather than replacing—clinicians’ assessment capabilities, decision-making efficiency, or service coverage through technological means. Included interventions were required to satisfy both of the following criteria:
Criterion A—human clinician involvement (at least one of the following): formulation or approval of the treatment plan; supervision of the intervention process (real-time, periodic, or remote); adjustment of treatment parameters; or management of exceptional circumstances.
Criterion B—AI functional component (at least one of the following): adaptive parameter adjustment informed by patient performance (e.g., dynamic threshold calibration, individualised difficulty progression); pattern recognition or classification derived from multi-parameter combinations; personalised treatment protocol generation or parameter recommendation; or algorithm-based real-time feedback generation or risk alerting.
For borderline cases, explicit adjudication rules were applied. Simple fixed-threshold sEMG biofeedback systems (i.e., triggering feedback when the signal exceeds a preset threshold, with no adaptive adjustment) do not meet the algorithmic complexity required under Criterion B. Where the clinician was continuously involved in threshold setting, monitoring, and dynamic adjustment during training (thereby satisfying Criterion A), the system was included on the grounds that clinician–device synergy met the operational definition of human–AI collaboration. Systems operating entirely autonomously with fixed parameters and without clinician involvement were excluded. For studies satisfying Criterion A but with equivocal Criterion B status (e.g., mobile applications providing only preset video guidance), two reviewers independently assessed whether algorithm-driven individualisation or adaptivity was present; disagreements were resolved by a third reviewer. Pure telemedicine (video consultation without algorithm-driven signal processing or individualised feedback) and pure electrical stimulation (without a biofeedback loop) were excluded.
For comparators, eligible studies could employ conventional swallowing rehabilitation without technological augmentation, alternative digital interventions, waiting-list controls, or usual care. Studies without a control group were eligible for the purposes of descriptive synthesis and implementation barrier analysis.
For outcomes, primary outcomes comprised swallowing safety (penetration, aspiration, pharyngeal residue), swallowing efficiency (oral and pharyngeal transit times), functional oral intake capacity, and nutritional status. Secondary outcomes encompassed physiological intermediate measures, including hyoid displacement, pharyngeal pressure generation, neuromuscular activation patterns, and dose–response data.
For study types, RCTs, quasi-experimental studies, system development studies reporting feasibility or usability data, and qualitative studies of eligible interventions were included. Case reports, conference abstracts without full data, editorials, and purely observational studies were excluded.
Information sources and search strategy
The search strategy was developed collaboratively by two researchers (Y.W.W. and L.S.F.) in consultation with a medical librarian and underwent independent peer review in accordance with the Peer Review of Electronic Search Strategies (PRESS) guideline89. A comprehensive search was conducted across four electronic databases—PubMed/MEDLINE, Embase (via Ovid), the Cochrane Library, and Web of Science—from database inception to 20 December 2025. No language restrictions were applied at the search stage; however, owing to the language capabilities of the review team, only studies published in English were included at the screening stage—this pragmatic constraint and its potential implications are discussed in the Limitations section.
The search strategy combined terms across three conceptual domains (terms within each concept linked by OR; concepts linked by AND). The complete database-specific search strings are provided in Supplementary Table 6; search strings for the remaining databases were adapted in accordance with platform-specific syntax.
Concept 1 — Swallowing disorders
dysphagia, deglutition disorder*, swallowing disorder*, swallowing dysfunction, oropharyngeal dysphagia, deglutition impairment*
Concept 2—Rehabilitation/therapy
rehabilitation, therapy, training, intervention, exercise
Concept 3—Augmented intelligence technologies
artificial intelligence, machine learning, deep learning, neural network*, biofeedback, smart*, intelligent*, digital*, wearable*, game*, gamif*, mobile health, mHealth, eHealth, telerehabilitation, computer-assisted, algorithm*
Grey literature—defined here as research output not published through conventional peer-reviewed commercial channels, including trial registrations, regulatory documents, and unpublished datasets—was sought through the following supplementary strategies: searching of ClinicalTrials.gov and the WHO International Clinical Trials Registry Platform (ICTRP) to identify completed but as yet unpublished trials; hand-searching of reference lists of included studies and relevant systematic reviews; forward citation tracking via Google Scholar; and direct contact with principal research groups in the field (n = 5) to enquire about unpublished or ongoing studies. Conference abstracts identified through database searching were screened but excluded if full data were not available.
Study selection
All retrieved records were imported into Rayyan systematic review management software for deduplication and management. Two reviewers (Y.W.W. and L.S.F.) independently conducted two-stage screening against the predefined eligibility criteria. In Stage 1 (title and abstract screening), all records meeting preliminary criteria or of uncertain eligibility were advanced to full-text review (inter-rater agreement: κ = 0.82). In Stage 2 (full-text review), reasons for exclusion were recorded according to predefined categories (inter-rater agreement: κ = 0.89). Disagreements at both stages were resolved through discussion; where consensus could not be reached, a third reviewer (D.Y.F.) adjudicated.
Data extraction
A standardised data extraction form was developed and pilot-tested on a random sample of five studies; extraction items were refined on the basis of pilot-test results. Complete extraction datasets for all 31 studies are provided in Supplementary Tables 7–10. Two reviewers independently extracted the following data: study characteristics (authors, year of publication, country, study design, clinical setting, sample size, and follow-up duration); population details (dysphagia aetiology and specific diagnosis, severity grading, age, sex distribution, and relevant comorbidities, with particular attention to cognitive impairment and digital literacy-related information); intervention details (AI technology type, hardware components and sensor specifications, biofeedback modality, and treatment parameters including session duration, frequency, total course duration, and intensity progression protocol); outcome data (effect estimates and 95% confidence intervals where reported, and within-group and between-group changes reported as mean ± standard deviation or median and interquartile range); and implementation barrier information (barriers and facilitators mapped to the seven NASSS framework domains; see Analytical Frameworks).
For missing or ambiguous data, corresponding authors were contacted by e-mail, with a maximum of two follow-up attempts at two-week intervals. Data that remained unavailable were annotated as ‘not reported’.
Analytical frameworks
Two analytical frameworks were applied to the included studies. Each is described below, together with its development process, operational rules, and known limitations.
The first framework is a human–AI collaboration coding system developed to characterise how clinical and computational tasks are allocated across interventions. It categorises human roles into six types (H1: Initial setup; H2: Assessment; H3: Supervision; H4: Treatment adjustment; H5: Clinical decision; H6: Validation) and AI roles into six types (A1: Detection; A2: Tracking; A3: Feedback delivery; A4: Adaptation; A5: Decision support; A6: Data management). Full operational definitions and coding examples for each role are documented in Supplementary Note 2.
Each intervention was classified into one of five collaboration modes (Mode A–E) using a sequential decision tree that resolves classification through five hierarchical questions: (1) whether the clinician is continuously present during each session; (2) whether AI provides continuous monitoring during independent practice; (3) the pattern of clinician supervision during independent practice; (4) whether AI is the primary generator of clinical recommendations; and (5) whether AI operates independently after initial setup. The five modes are: Mode A, AI-augmented direct supervision; Mode B, supervised autonomous practice; Mode C, hybrid periodic consultation; Mode D, AI-driven clinical decision validation; Mode E, autonomous AI operation. The complete decision tree with worked classification examples and boundary criteria for adjacent modes (A/B, B/C, C/D) are documented in Supplementary Note 2.
The AI-to-human task ratio was estimated for each intervention by: (a) enumerating all identifiable task components within the intervention protocol; (b) estimating the relative weight of each component on the basis of reported participation frequency and time allocation; (c) attributing each component as ‘human-led’, ‘AI-led’, or ‘shared’; and (d) calculating a weighted task ratio. Two reviewers independently generated estimates, which were then averaged; where estimates diverged by more than 15 percentage points, structured discussion was conducted. Because most original studies did not explicitly report time allocation, task ratios should be regarded as approximate ordinal indicators rather than precise measurements. Study-level classification rationale, task decomposition, independent reviewer estimates, and disagreement resolution records are provided in Supplementary Table 11.
Inter-rater agreement for mode classification was substantial (κ = 0.74; 95% CI: 0.68–0.84). Initial disagreements arose in six studies at adjacent mode boundaries (B/C boundary, n = 3; C/D boundary, n = 2; A/B boundary, n = 1); all were resolved through structured discussion referencing the boundary criteria.
The second framework draws on the NASSS model to assess implementation complexity. Two reviewers independently extracted and mapped reported barriers and facilitators to the seven NASSS framework domains. Complexity ratings (low, moderate, high) for each domain were assigned based on the following criteria: low—no or minimal barriers described; moderate—barriers described with identified solutions or workarounds; high—multiple unresolved barriers reported, or study authors explicitly identified the domain as a major implementation challenge. Where studies did not explicitly address an NASSS domain, the rating was based on barriers inferable from the methods and discussion sections and annotated as ‘inferred’. Inter-rater agreement for domain-level complexity ratings was κ = 0.79 (95% CI: 0.71–0.87).
Quality assessment
Two reviewers independently assessed the methodological quality of each included study using design-specific risk-of-bias tools. RCTs were assessed using the RoB 2.090. Non-randomised interventional studies were assessed using the ROBINS-I91. Technology validation studies were assessed using the QUADAS-292, with signalling questions adapted to reflect AI system-specific methodological considerations. Qualitative studies were assessed using the CASP. Each domain was rated according to the tool-specific categories (e.g., low risk, some concerns, or high risk for RoB 2.0). Disagreements were resolved through discussion; where consensus was not reached, a third reviewer was consulted.
Certainty of evidence was assessed using the GRADE93 approach across the following key outcome domains: swallowing function, physiological measures, quality of life, treatment adherence, system usability, adverse events, and AI algorithm performance. Certainty ratings were classified as high, moderate, low, or very low. The GRADE approach was designed primarily for intervention effectiveness evidence; the certainty of evidence for implementation outcomes (such as NASSS complexity distributions) was not formally rated.
Data synthesis
Substantial heterogeneity in intervention designs, technology systems, population characteristics, and outcome measures precluded statistical pooling of effect sizes. A narrative synthesis was conducted in accordance with the SWiM guideline, structured along three dimensions: (a) primary aetiology; (b) human–AI collaboration mode (Modes A–E); and (c) technology type.
For each grouping and outcome domain, effect direction classification was employed. Effect direction was determined as follows: where statistical test results were reported, P < 0.05 served as the threshold for a positive effect; where only descriptive data were available, an effect exceeding the recognised minimal clinically important difference was classified as positive; studies not reporting sufficient data were annotated as ‘direction uncertain’.
Vote counting—tallying the number of studies reporting positive, null, and negative effects—was used solely as a preliminary descriptive overview of effect directions. Vote counting accords equal weight to all studies regardless of sample size, precision, or risk of bias; it captures only the direction, not the magnitude, of effects; and it cannot distinguish genuine treatment effects from patterns arising from small-study effects or publication bias. Harvest plots were employed to partially compensate for these limitations by encoding sample size, study design, and risk-of-bias classification alongside effect direction.
We further explored potential effect modifiers by comparing the distribution of effect directions across collaboration modes, technology types, population subgroups, and study designs. These exploratory analyses are subject to ecological bias, as subgroups were not randomly formed and differed systematically in population characteristics, intervention type, study design, clinical setting, sample size, and follow-up duration. Apparent differences across subgroups may therefore reflect confounding rather than genuine effect modification. These caveats apply to all subgroup comparisons reported and are not repeated therein. The robustness of synthesis conclusions was assessed in conjunction with GRADE certainty-of-evidence ratings and NASSS complexity distributions.
Data availability
The datasets generated and/or analysed during the current study are available within the article and its supplementary information files, including complete extraction datasets, coding documentation, and classification materials (Supplementary Tables 7–10, Supplementary Table 11, and Supplementary Note 2). No custom computer code was developed for this systematic review; code availability is not applicable.
Code availability
Not applicable. No custom computer code was developed for this systematic review.
References
Hamdy, S. et al. Recovery of swallowing after dysphagic stroke relates to functional reorganization in the intact motor cortex. Gastroenterology 115, 1104–1112 (1998).
Sasegbon, A. & Hamdy, S. The anatomy and physiology of normal and abnormal swallowing in oropharyngeal dysphagia. Neurogastroenterol. Motil. 29, https://doi.org/10.1111/nmo.13100 (2017).
Clavé, P. & Shaker, R. Dysphagia: current reality and scope of the problem. Nat. Rev. Gastroenterol. Hepatol. 12, 259–270 (2015).
Labeit, B. et al. Dysphagia after stroke: research advances in treatment interventions. Lancet Neurol. 23, 418–428 (2024).
Labeit, B. et al. The assessment of dysphagia after stroke: state of the art and future directions. Lancet Neurol. 22, 858–870 (2023).
Doan, T. N. et al. Prevalence and methods for assessment of oropharyngeal dysphagia in older adults: a systematic review and meta-analysis. J. Clin. Med. 11, https://doi.org/10.3390/jcm11092605 (2022).
Ribeiro, M. et al. The prevalence of oropharyngeal dysphagia in adults: a systematic review and meta-analysis. Dysphagia 39, 163–176 (2024).
Yang, W. et al. Review of prophylactic swallowing interventions for head and neck cancer. Int J. Nurs. Stud. 123, 104074 (2021).
Liang, J. et al. Predictors of dysphagia screening and pneumonia among patients with acute ischaemic stroke in China: findings from the Chinese Stroke Center Alliance (CSCA). Stroke Vasc. Neurol. 7, 294–301 (2022).
Banda, K. J. et al. Prevalence of dysphagia and risk of pneumonia and mortality in acute stroke patients: a meta-analysis. BMC Geriatr. 22, 420 (2022).
Chua, W. Y., Wang, J. D. J., Chan, C. K. M., Chan, L. L. & Tan, E. K. Risk of aspiration pneumonia and hospital mortality in Parkinson disease: A systematic review and meta-analysis. Eur. J. Neurol. 31, e16449 (2024).
Won, J. H., Byun, S. J., Oh, B. M., Park, S. J. & Seo, H. G. Risk and mortality of aspiration pneumonia in Parkinson’s disease: a nationwide database study. Sci. Rep. 11, 6597 (2021).
Balcerak, P., Corbiere, S., Zubal, R. & Kägi, G. Post-stroke dysphagia: prognosis and treatment-a systematic review of RCT on interventional treatments for dysphagia following subacute stroke. Front. Neurol. 13, 823189 (2022).
Fiorella, M. L. et al. Dysphagia and dysarthria in neurodegenerative diseases: a multisystem network approach to assessment and management. Audiol. Res. 16, https://doi.org/10.3390/audiolres16010009 (2026).
Attrill, S., White, S., Murray, J., Hammond, S. & Doeltgen, S. Impact of oropharyngeal dysphagia on healthcare cost and length of stay in hospital: a systematic review. BMC Health Serv. Res 18, 594 (2018).
Patel, D. A. et al. Economic and survival burden of dysphagia among inpatients in the United States. Dis. Esophagus 31, 1–7 (2018).
U. N. D. o. & Economic Affairs, S. World Population Ageing 2023 (United Nations, 2024).
Cheng, I., Scarlett, H., Zhang, M. & Hamdy, S. Preconditioning human pharyngeal motor cortex enhances directional metaplasticity induced by repetitive transcranial magnetic stimulation. J. Physiol. 598, 5213–5230 (2020).
Michou, E. et al. Targeting unlesioned pharyngeal motor cortex improves swallowing in healthy individuals and after dysphagic stroke. Gastroenterology 142, 29–38 (2012).
Statistics, U. S. B. o. L. Occupational Outlook Handbook: Speech-Language Pathologists, <https://www.bls.gov/ooh/healthcare/speech-language-pathologists.htm> (2024).
World Health Organization, Rehabilitation 2030: A Call for Action — The Need to Scale Up Rehabilitation, <https://www.who.int/docs/default-source/documents/health-topics/rehabilitation/call-for-action/need-to-scale-up-rehab-july2018.pdf> (2017).
World Health Organization, World Report on Hearing, <https://wfdeaf.org/wp-content/uploads/9789240020481-eng.pdf> (2021).
Topol, E. J. High-performance medicine: the convergence of human and artificial intelligence. Nat. Med. 25, 44–56 (2019).
Rajpurkar, P., Chen, E., Banerjee, O. & Topol, E. J. AI in health and medicine. Nat. Med. 28, 31–38 (2022).
Vaccaro, M., Almaatouq, A. & Malone, T. When combinations of humans and AI are useful: a systematic review and meta-analysis. Nat. Hum. Behav. 8, 2293–2303 (2024).
Shin, B. et al. Automatic Clinical assessment of swallowing behavior and diagnosis of silent aspiration using wireless multimodal wearable electronics. Adv. Sci. 11, e2404211 (2024).
Riebold, B., Seidl, R. O. & Schauer, T. Electromyography- and bioimpedance-based detection of swallow onset for the control of dysphagia treatment. Sensors 24, https://doi.org/10.3390/s24206525 (2024).
Konoike, Y. et al. Parameter analysis using swallowing sounds shows differences in bolus volume, bolus viscosity, sex, and age. Sci. Rep. 15, 30639 (2025).
Shieh, W. Y., Wang, C. M., Ju, Y. Y. & Cheng, H. K. Multi-sensor respiratory-swallow telecare system for safe feeding in different trunk inclinations: system development and clinical application. Sensors 23, https://doi.org/10.3390/s23020642 (2023).
Xu, S. et al. Digital health technology for Parkinson’s disease with comprehensive monitoring and artificial intelligence-enabled haptic biofeedback for bulbar dysfunction. J. Parkinson's Dis. 15, 630–645 (2025).
Li, C. M. et al. Swallowing training combined with game-based biofeedback in poststroke dysphagia. PM R. 8, 773–779 (2016).
Lee, C. L. et al. Efficacy of swallowing rehabilitative therapies for adults with dysphagia: a network meta-analysis of randomized controlled trials. Geroscience 47, 2047–2065 (2025).
Bengisu, S., Demir, N. & Krespi, Y. Effectiveness of Conventional Dysphagia Therapy (CDT), Neuromuscular Electrical Stimulation (NMES), and Transcranial Direct Current Stimulation (tDCS) in acute post-stroke dysphagia: a comparative evaluation. Dysphagia 39, 77–91 (2024).
Alamer, A., Melese, H. & Nigussie, F. Effectiveness of neuromuscular electrical stimulation on post-stroke dysphagia: a systematic review of randomized controlled trials. Clin. Int. Aging 15, 1521–1531 (2020).
Bath, P. M., Lee, H. S. & Everton, L. F. Swallowing therapy for dysphagia in acute and subacute stroke. Cochrane Database Syst. Rev. 10, Cd000323 (2018).
Han, R. et al. Randomised controlled trials evaluating artificial intelligence in clinical practice: a scoping review. Lancet Digit Health 6, e367–e373 (2024).
Jarvis, K., Thetford, C., Turck, E., Ogley, K. & Stockley, R. C. Understanding the barriers and facilitators of Digital Health Technology (DHT) implementation in neurological rehabilitation: an integrative systematic review. Health Serv. Insights 17, 11786329241229917 (2024).
Woolley, K. E. et al. Mapping inequities in digital health technology within the World Health Organization’s European region using PROGRESS PLUS: scoping review. J. Med. Internet Res. 25, e44181 (2023).
Greenhalgh, T. et al. Beyond adoption: a new framework for theorizing and evaluating nonadoption, abandonment, and challenges to the scale-up, spread, and sustainability of health and care technologies. J. Med. Internet Res. 19, e367 (2017).
Shin, H. D. et al. The NASSS (Non-Adoption, Abandonment, Scale-Up, Spread and Sustainability) framework use over time: a scoping review. PLOS Digit. Health 4, e0000418 (2025).
Zhang, B. et al. Effect of artificial intelligence-based video-game system on dysphagia in patients with stroke: a randomized controlled trial. Clin. Nutr. 45, 81–90 (2025).
Zhang, B. et al. Face recognition-driven video game for dysphagia rehabilitation in stroke patients: a pilot randomized controlled trial. Arch. Phys. Med. Rehabil. 106, 342–350 (2025).
Kim, H. et al. Implementation of a home-based mhealth app intervention program with human mediation for swallowing tongue pressure strengthening exercises in older adults: longitudinal observational study. JMIR Mhealth Uhealth 8, e22080 (2020).
Kim, H. et al. User-dependent usability and feasibility of a swallowing training mHealth App for older adults: mixed methods pilot study. JMIR Mhealth Uhealth 8, e19585 (2020).
Wang, L. et al. Respiratory-swallow coordination training using bimodal signal biofeedback for patients with post-stroke dysphagia: a randomized controlled trial. Ann. Med 58, 2607218 (2026).
Nordio, S. et al. Biofeedback as an adjunctive treatment for post-stroke dysphagia: a pilot-randomized controlled trial. Dysphagia 37, 1207–1216 (2022).
Lee, Y. et al. Soft electronics enabled ergonomic human-computer interaction for swallowing training. Sci. Rep. 7, 46697 (2017).
Kim, M. K. et al. Flexible submental sensor patch with remote monitoring controls for management of oropharyngeal swallowing disorders. Sci. Adv. 5, eaay3210 (2019).
Jansen, M. et al. Kangaroo stimulation game in tracheostomized intensive care-related dysphagia: interventional feasibility study. JMIR Serious Games 13, e60685 (2025).
Hou, M. et al. Efficacy of game training combined with surface electromyography biofeedback on post-stroke dysphagia. Geriatr. Nurs. 55, 255–262 (2024).
Benfield, J. K., Hedstrom, A., Everton, L. F., Bath, P. M. & England, T. J. Randomized controlled feasibility trial of swallow strength and skill training with surface electromyographic biofeedback in acute stroke patients with dysphagia. J. Oral. Rehabil. 50, 440–451 (2023).
Battel, I. & Walshe, M. An intensive neurorehabilitation programme with sEMG biofeedback to improve swallowing in idiopathic Parkinson’s disease (IPD): a feasibility study. Int. J. Lang. Commun. Disord. 58, 813–825 (2023).
Bahia, M. M., Carpenter, J., Rogers, K. & Cherney, L. R. Early feasibility and efficacy of a novel skill-based training program for poststroke dysphagia. Arch. Rehabil. Res. Clin. Transl. 7, 100535 (2025).
Alyanak, B., İnanır, M., Sade, S. I. & Kablanoğlu, S. Efficacy of game-based EMG-biofeedback therapy in post-stroke dysphagia: a randomized controlled trial. Dysphagia 40, 1289–1301 (2025).
Zhang, B. et al. Technology acceptance of the video game-based swallowing function training system among healthcare providers and dysphagia patients: a qualitative study. Digit Health 10, 20552076241284830 (2024).
Su, K. C., Wu, K. C., Chou, K. R. & Huang, C. H. Tongue muscle training app for middle-aged and older adults incorporating flow-based gameplay: design and feasibility pilot study. JMIR Serious Games 13, e53045 (2025).
Chan, R. S. M. et al. Human-AI collaboration improves adults’ oral biomechanical functions: a multi-centre, self-controlled clinical trial. J. Dent. 150, 105354 (2024).
Wang, Z., Dai, X. & Wu, C. Effect of an individualized digital coaching program on swallowing function in stroke patients. Acta Neurol. Belg. 123, 963–969 (2023).
Starmer, H. M. et al. Head and neck virtual coach: a randomized control trial of mobile health as an adjunct to swallowing therapy during head and neck radiation. Dysphagia 38, 847–855 (2023).
Srp, M. et al. mHealth-assisted expiratory muscle strength training in Parkinson’s disease patients: A proof-of-concept study. J. Parkinson's Dis. 14, 1623–1630 (2024).
Kuramoto, N., Jayatilake, D., Hidaka, K. & Suzuki, K. Smartphone-based swallowing monitoring and feedback device for mealtime assistance in nursing homes. In Proc. 38th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), 5781–5784, (IEEE, 2016).
Jung, E. S., Choi, Y. Y. & Lee, K. H. Smartphone-based combined oral and whole-body exercise programme aimed at improving oral functions: a randomized clinical trial. Int J. Dent. Hyg. 22, 905–912 (2024).
Kantarcigil, C. et al. Patient perceptions on a wearable sensor technology for swallowing: a qualitative study with patients with Parkinson’s disease. Dysphagia https://doi.org/10.1007/s00455-025-10910-7 (2025).
Aslan, S. G., Yilmaz, B. & IEEE. Proc. 32nd European Signal Processing Conference (EUSIPCO). 1388–1391 (IEEE, 2024).
Spoladore, D. et al. A Knowledge-based Decision Support System for recommending safe recipes to individuals with dysphagia. Comput. Biol. Med. 171, 108193 (2024).
Khan, M. M., Smart, S., Bogaardt, H., Ahmed Zubairi, J. & Yanushkevich, S. Design of a portable device: toward assisting in tongue-strengthening exercises and dysphagia management. IEEE Access 12, 84893–84906 (2024).
Krekeler, B. N. et al. Effects of device-facilitated lingual strengthening therapy on dysphagia related outcomes in patients post-stroke: a randomized controlled trial. Dysphagia 38, 1551–1567 (2023).
Park, J. S., Lee, G. & Jung, Y. J. Effects of game-based chin tuck against resistance exercise vs head-lift exercise in patients with dysphagia after stroke: an assessor-blind, randomized controlled trial. J. Rehabil. Med. 51, 749–754 (2019).
Moreau, C. et al. Overview on wearable sensors for the management of Parkinson’s disease. NPJ Parkinsons Dis. 9, 153 (2023).
Shammas-Toma, M. et al. Wearable technologies in head and neck oncology: scoping review. JMIR Mhealth Uhealth 13, e72372 (2025).
Wong, D. W. et al. Current technological advances in dysphagia screening: systematic scoping review. J. Med. Internet Res. 27, e65551 (2025).
Donohue, C., Mao, S., Sejdić, E. & Coyle, J. L. Tracking hyoid bone displacement during swallowing without videofluoroscopy using machine learning of vibratory signals. Dysphagia 36, 259–269 (2021).
Song, Y., Yun, I., Giovanoli, S., Easthope, C. A. & Chung, Y. Multimodal deep ensemble classification system with wearable vibration sensor for detecting throat-related events. NPJ Digit. Med. 8, 14 (2025).
Kelly, C. J., Karthikesalingam, A., Suleyman, M., Corrado, G. & King, D. Key challenges for delivering clinical impact with artificial intelligence. BMC Med. 17, 195 (2019).
Moor, M. et al. Foundation models for generalist medical artificial intelligence. Nature 616, 259–265 (2023).
Greenhalgh, T. et al. Analysing the role of complexity in explaining the fortunes of technology programmes: empirical application of the NASSS framework. BMC Med. 16, 66 (2018).
Song, W., Wu, M., Wang, H., Pang, R. & Zhu, L. Prevalence, risk factors, and outcomes of dysphagia after stroke: a systematic review and meta-analysis. Front. Neurol. 15, 1403610 (2024).
Lei, H., Nahum-Shani, I., Lynch, K., Oslin, D. & Murphy, S. A. A “SMART” design for building individualized treatment sequences. Annu. Rev. Clin. Psychol. 8, 21–48 (2012).
Warraich, H. J., Tazbaz, T. & Califf, R. M. FDA perspective on the regulation of artificial intelligence in health care and biomedicine. JAMA 333, 241–247 (2025).
Muehlematter, U. J., Bluethgen, C. & Vokinger, K. N. FDA-cleared artificial intelligence and machine learning-based medical devices and their 510(k) predicate networks. Lancet Digit Health 5, e618–e626 (2023).
Ghassemi, M., Oakden-Rayner, L. & Beam, A. L. The false hope of current approaches to explainable artificial intelligence in health care. Lancet Digit Health 3, e745–e750 (2021).
Liu, X., Cruz Rivera, S., Moher, D., Calvert, M. J. & Denniston, A. K. Reporting guidelines for clinical trial reports for interventions involving artificial intelligence: the CONSORT-AI extension. Lancet Digit Health 2, e537–e548 (2020).
Rivera, S. C., Liu, X., Chan, A. W., Denniston, A. K. & Calvert, M. J. Guidelines for clinical trial protocols for interventions involving artificial intelligence: the SPIRIT-AI extension. Bmj 370, m3210 (2020).
Cabib, C. et al. Neurorehabilitation strategies for poststroke oropharyngeal dysphagia: from compensation to the recovery of swallowing function. Ann. N.Y Acad. Sci. 1380, 121–138 (2016).
Anderson, C. et al. The perturbation paradigm modulates error-based learning in a highly automated task: outcomes in swallowing kinematics. J. Appl. Physiol. 119(1985), 334–341 (2015).
Collins, L. M., Dziak, J. J., Kugler, K. C. & Trail, J. B. Factorial experiments: efficient tools for evaluation of intervention components. Am. J. Prev. Med. 47, 498–504 (2014).
Liu, X., Deliu, N. & Chakraborty, B. Microrandomized trials: developing just-in-time adaptive interventions for better public health. Am. J. Public Health 113, 60–69 (2023).
Curran, G. M., Bauer, M., Mittman, B., Pyne, J. M. & Stetler, C. Effectiveness-implementation hybrid designs: combining elements of clinical effectiveness and implementation research to enhance public health impact. Med. Care 50, 217–226 (2012).
McGowan, J. et al. PRESS peer review of electronic search strategies: 2015 guideline statement. J. Clin. Epidemiol. 75, 40–46 (2016).
Sterne, J. A. C. et al. RoB 2: a revised tool for assessing risk of bias in randomised trials. BMJ 366, l4898 (2019).
Sterne, J. A. et al. ROBINS-I: a tool for assessing risk of bias in non-randomised studies of interventions. BMJ 355, i4919 (2016).
Whiting, P. F. et al. QUADAS-2: a revised tool for the quality assessment of diagnostic accuracy studies. Ann. Intern. Med. 155, 529–536 (2011).
Guyatt, G. H. et al. GRADE: an emerging consensus on rating quality of evidence and strength of recommendations. BMJ 336, 924–926 (2008).
Acknowledgements
This work was supported by the National Natural Science Foundation of China (82302965) and the Medical Science and Technology Key Program (LHGJ20230095). The funder played no role in study design, data collection, analysis and interpretation of data, or the writing of this manuscript.
Author information
Authors and Affiliations
Contributions
Y.W.W. and L.S.F. conceived and designed the study. Y.W.W. and L.S.F. performed the literature search, screening, and data extraction. D.Y.F. adjudicated disagreements during study selection and quality assessment. C.M.R. and Y.F.N. contributed to risk of bias assessment and data verification. Z.F. and Z.J. contributed to data synthesis and visualisation. Y.W.W. provided methodological guidance on the NASSS framework analysis. X.X.X. and L.Y.Q. supervised the study and provided critical revisions of the manuscript. Y.W.W. drafted the manuscript. All authors read and approved the final manuscript.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Yang, W., Li, S., Du, Y. et al. Human–AI collaboration for dysphagia rehabilitation from effectiveness to implementation complexity: a systematic review. npj Digit. Med. 9, 404 (2026). https://doi.org/10.1038/s41746-026-02729-9
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s41746-026-02729-9






