Introduction

Outdoor mobility is essential for community participation and quality of life (QOL) in people with stroke (PwS)1. However, many PwS report difficulties walking outdoors2, and navigating uneven surfaces poses particular challenges, increasing the risk of falls and mobility limitations3. Uneven surface presents unpredictable perturbations that place greater demands on the neuromuscular system, making it more challenging to maintain gait stability4,5—the capacity to sustain steady walking despite minor disturbances or errors in control6. As PwS often exhibit reduced adaptability, particularly when exposed to such perturbations7, the assessment of gait stability in uneven-surface walking is of critical clinical relevance. Yet, this aspect remains underexplored.

Recent advances in wearable sensing have allowed for a detailed analysis of trunk acceleration during walking using inertial measurement units (IMUs) attached to the lower back8. Gait stability features can be derived from these signals and categorized into linear and non-linear metrics. Linear analysis commonly uses the root mean square (RMS) of acceleration to quantify the magnitude of variability. PwS exhibited greater trunk RMS values on uneven surfaces than healthy controls, indicating reduced stability. However, the RMS captures only the variability magnitude and not temporal structure. Nonlinear metrics address this issue by characterizing signal dynamics and time-dependent patterns. These include the harmonic ratio (HR) for smoothness, short-term Lyapunov exponent (sLE) for local dynamic stability, recurrence quantification analysis (RQA) for periodicity, and sample entropy (SampEn) for the regularity of the signal. These measures complement linear metrics. In unpredictable environments, such as uneven surfaces, assessing both variability and temporal organization is crucial. The integration of linear and nonlinear metrics offers a more comprehensive understanding of gait stability in PwS. To interpret these indicators effectively, their interrelationships must be clarified using an integrative analysis.

Previous studies that assessed gait stability on uneven surfaces using IMU sensors primarily evaluated each acceleration-derived metric independently4,9,10. However, analyzing multiple features separately can lead to redundancy, increased processing time, and over-representation of correlated metrics, potentially hindering their clinical interpretation. Uneven surfaces expose individuals to unpredictable perturbations, to which PwS respond heterogeneously. These responses are often coupled nonlinearly, which limits the explanatory power of traditional linear statistics. Machine learning (ML) addresses these challenges by extracting complex patterns from high-dimensional data and identifying informative features¹²,¹³. ML can uncover latent relationships missed by conventional methods and has been increasingly used to distinguish individuals with neurological disorders from healthy controls using gait data14,15. Furthermore, combining ML-based automated gait classification with IMU-derived data allows for the rapid and clinically meaningful evaluation of gait abnormalities in individuals with motor impairments16,17. A comparative analysis of multiple ML models has been proposed to identify the most accurate classifiers18,19, with robust results emerging from a consensus across diverse algorithms. Models range from interpretable “glass box” approaches (e.g., logistic regression (LR)) to less transparent “black box” models (e.g., support vector classifier (SVC), random forest (RF)) ²⁰. Additionally, sparse partial least squares discriminant analysis (sPLS-DA), which integrates feature selection, can be useful for improving model interpretability21. Clinical datasets often suffer from limited and imbalanced samples, which can degrade model performance²². To address this, data augmentation techniques such as the Synthetic Minority Over-sampling Technique (SMOTE)²⁵, Generative Adversarial Networks (GAN) ²⁶, and Conditional Tabular GAN (ctGAN)²⁷ have shown promise in improving model accuracy and revealing hidden structures ²³,²⁴. Comparing augmentation methods with ML models helps identify optimal combinations²⁸. Based on these strategies, this study combined data augmentation and ML to identify acceleration-based features that characterize decreased stability in PwS during uneven-surface walking. Standardizing environmental conditions for uneven-surface walking is challenging, making such assessments uncommon in clinical settings. In contrast, gait parameters such as speed, trunk acceleration, muscle activity, and joint angles during even-surface walking are increasingly accessible in practice owing to recent technological advances29. Predicting uneven-surface gait stability using these parameters may help guide rehabilitation and improve outdoor mobility. ML regression models can estimate values that normally require specialized environments³⁰ and can be used to assess disease severity in neurological conditions31. These include linear (e.g., linear regression and support vector regression (SVR)) and nonlinear methods (e.g., RF and XGBoost). Interpretation tools such as partial dependence plots (PDPs) and SHapley Additive explanations (SHAP), including SHAP + PDP, reveal influential predictors and feature interactions32. Applying these techniques to even-surface gait data may help identify key contributors to uneven-surface performance, thereby supporting safer and more adaptive outdoor walking.

The primary aim of this study was to analyze trunk acceleration during uneven-surface walking using multiple machine learning models and evaluate how RMS and nonlinear metrics (HR, RQA, SampEn, and sLE) distinguish PwS from HC. The secondary aim was to predict key acceleration features and uneven-surface gait speed from even-surface gait parameters, including speed, trunk acceleration, electromyography (EMG), and joint angles. By integrating multiple stability metrics derived from wearable sensors with explainable machine learning, this study captures gait instability on uneven surfaces from multiple perspectives and enables prediction based on clinically accessible even-surface walking data, rather than merely describing surface-specific characteristics. This approach may contribute to the development of interpretable digital biomarkers that support the formulation of individualized rehabilitation strategies aimed at improving outdoor mobility in PwS.

Results

Participant characteristics in the HC and PwS groups

The demographic and clinical characteristics of the patients are presented in Table 1. No significant differences were found between the two groups in age, height, weight, or Body Mass Index (BMI) (Welch’s tests, all p > 0.05), or in sex distribution (Fisher’s exact test, p > 0.05). The descriptive statistics for all the gait parameters are shown in Supplementary Table S1.

Table 1 Participant demographic characteristics and comparison of variables between the healthy controls (HC) and people with stroke (PwS) groups.

Step 1. stroke detection

Glass box model and black box model

Feature selection results

Among the acceleration-based features during uneven-surface walking, no pairwise correlations exceeded 0.9 (Supplementary Fig. S1). The Boruta algorithm selected nine features (Supplementary Fig. S2a), whereas the LASSO algorithm selected 11 features (Supplementary Fig. S2b). Eight features—RMS in the anterior–posterior (RMS_AP), mediolateral (RMS_ML), and vertical (RMS_VT) directions; HR in the anterior–posterior direction (HR_AP); SampEn in the anterior–posterior direction (SampEn_AP); %determinism in RQA in the mediolateral direction (RQA_det_ML); and sLE in the anterior–posterior direction (sLE_AP)—were selected by both methods and used as inputs for subsequent supervised ML algorithms.

Supervised ML classification metrics

Classification performance was evaluated across all combinations of data augmentation strategies and ML models based on the mean values from 50 repeated stratified random samplings. Figure 1 shows the average Receiver Operating Characteristic (ROC) curves for the top-performing combinations: RF with GAN (N = 1000), LR with ctGAN (N = 200), and SVC with ctGAN (N = 1000). A full comparison of the classification metrics for all 36 combinations is presented in Supplementary Table S2.

Fig. 1
Fig. 1
Full size image

Average ROC curves for top-performing models (50 repeated stratified samplings). RF, Random forest; LR, Logistic Regression; SVC, Support Vector Classification.

Model interpretation

Model interpretation was performed for the top-performing models. For LR using ctGAN (N = 200), odds ratios were calculated (Table 2). SHAP values were computed for the RF with GAN (N = 1000) and SVC with ctGAN (N = 1000) (Fig. 2). RMS_VT, a linear acceleration metric, showed the highest importance in both the LR and RF models. Among the nonlinear metrics, SampEn_AP was more important than RMS_ML and RMS_AP in LR, whereas HR_AP was particularly influential in the RF model. In the SVC model, HR_AP demonstrated the highest overall importance, exceeding that of all RMS features.

Table 2 Odds ratio in logistic regression (N = 200).
Fig. 2
Fig. 2
Full size image

SHAP Value Plot. The x-axis represents the SHAP value of each feature, indicating its contribution to the predicted class. In this visualization, SHAP values are displayed separately for HC (left) and PwS (right), where positive values indicate movement of the model output toward correctly predicting the displayed class (e.g., positive SHAP values on the HC panel increase the likelihood of being classified as HC). The color scale indicates feature magnitude (red = high, blue = low). For example, a high-value feature (red) located on the positive side of the HC plot indicates high feature values increase classification toward HC, whereas a similar pattern in the PwS plot reflects increased likelihood of PwS classification. (a) RandomForest (GAN(N = 1000)); (b) Support Vector Classifier (ctGAN(N = 1000)).

Including features selection model

The sPLS-DA model was used to classify the PwS and HC groups. Component 1, which included RMS_VT, RMS_AP, HR_AP, and RMS_ML, achieved a mean AUC of 0.953 (SD = 0.002). Component 2, consisting of SampEn_AP, RQA_rec_ML, SampEn_ML, and RQA_det_AP, did not contribute to further improvement (mean AUC = 0.953, SD = 0.006), indicating that Component 1 alone was sufficient to achieve high classification accuracy. The distributions of the samples projected onto components 1 and 2 are shown in Supplementary Fig. S3 Component 1, which explained 27% of the variance, contributed the most to group separation, whereas Component 2 (10%) contributed less because of the greater overlap between groups. Although the 95% confidence ellipses partially overlapped, overall separation was evident. Table 3 lists the features used in the model and their corresponding variable importance in projective (VIP) scores. Among the linear metrics, RMS_VT had the highest score, whereas HR_AP, a nonlinear metric, also showed a high score.

Table 3 Variable importance in projection (VIP) scores of the sPLS-DA model.

Step 2. stroke uneven prediction

Acceleration indices on uneven surface walking to be predicted

Based on the results of the multiple ML classification models in Step 1, three acceleration indices were identified as the key features distinguishing PwS from HC: RMS_VT, SampEn_AP, and HR_AP. In addition to these indices, uneven-surface gait speed was included as a target variable to be predicted using the regression models in Step 2.

Feature selection results

No significant correlations were observed among the baseline data. Furthermore, no combination of gait parameters measured during even-surface walking showed correlation coefficients exceeding 0.9 (Supplementary Fig. S4). Supplementary Fig. S5 illustrates the results of feature selection using Boruta and Lasso for each outcome. Specifically, Boruta and Lasso selected 13 and 15 features, respectively, for the even-surface gait speed (S5Aa, S5Ab); 12 and 16 features for RMS_VT (S5Ba, S5Bb); 5 and 11 features for SampEn_AP (S5Ca, S5Cb); and 10 and 6 features for HR_AP (S5Da, S5Db). The optimal regularization parameters used in Lasso are indicated by the corresponding log(λ) values. Features selected by both methods are summarized in Supplementary Table S3 and were used for subsequent supervised machine learning regression.

Supervised ML regression metrics

The performance of supervised machine learning regression models was evaluated for predicting gait speed, RMS_VT, SampEn_AP, and HR_AP during uneven-surface walking, based on 50 repeated runs with different random seeds. Among the regression models for gait speed, Linear Regression achieved the highest R² value (0.903), while RF was the most accurate among nonlinear models (R² = 0.860) (Fig. 3a). For RMS_VT, RF outperformed all the other models (R² = 0.630), whereas SVR showed the best performance among the linear models (R² = 0.575) (Fig. 3b). In contrast, the predictive performance of SampEn_AP was limited across models, with the Elastic Net (R² = 0.393) and RF (R² = 0.373) showing relatively better performances (Fig. 3c). For HR_AP, the Elastic Net (R² = 0.607) and RF (R² = 0.584) models demonstrated moderate accuracy (Fig. 3d). Notably, RMS_VT was the only outcome in which the nonlinear models consistently outperformed linear models. A complete summary of regression performance metrics is provided in Supplementary Table S4 with corresponding scatter plots shown in Supplementary Fig. S6.

Fig. 3
Fig. 3
Full size image

Scatter plot. Scatter plots of predicted versus actual values for four outcome variables using the best-performing linear and non-linear regression models. Each panel shows the prediction performance for one outcome variable: (a) Gait speed, (b) RMS_VT, (c) SampEn_AP, and (d) HR_AP. RMS, Root mean square; VT, Vertical; Samp_En, Sample entropy; AP, Anterior-posterior; HR, Harmonic ratio.

Model interpretation

We conducted model interpretation using SHAP and PDP for the models that showed the highest performance for each target variable (Supplementary Table S4). In the prediction of uneven surface gait speed and HR_AP, both linear models (Linear Regression and Elastic Net, respectively) and nonlinear models (RF) identified their corresponding even-surface parameters (i.e., even-surface gait speed and HR_AP) as the most influential features through SHAP analysis (Fig. 4). For RMS_VT, in the nonlinear model (RF), which showed superior predictive accuracy, the even-surface gait speed was more important than the even-surface RMS_VT.

Fig. 4
Fig. 4
Full size image

SHAP Value Plot for outcome. These SHAP value plots illustrate the even surface walking parameters that contribute to predicting each outcome on uneven surface walking. (a) Gait_speed; (b) RMS_VT; (c) SampEn_AP; (d) HR_AP. RMS, Root mean square; VT, Vertical; Samp_En, Sample entropy; AP, Anterior-posterior; HR, Harmonic ratio.

In the prediction of SampEn_AP, both the Elastic Net and RF models revealed that even-surface SampEn_AP was a key predictor. Additionally, in the RF model, the ankle dorsiflexion angle at initial contact (Ang_IC_ankle) emerged as one of the top-ranked features.

PDP and SHAP + PDP plots were used to assess the contribution of each even-surface parameter to the prediction of uneven-surface outcomes. In the linear models, all relationships appeared approximately linear, as shown in Supplementary Fig. S7. In contrast, Fig. 5 illustrates the results from the nonlinear (RF) models, highlighting more complex relationships.

Fig. 5
Fig. 5
Full size image

Partial dependence plot analysis and SHAP partial dependence plot analysis for gait parameters on uneven surface. Partial dependence plots (PDPs) and SHAP dependence plots for key even-surface gait features predicting uneven-surface gait outcomes using the Random Forest model. Each row corresponds to a target variable: (a) Gait speed, (b) RMS_VT, (c) SampEn_AP, and (d) HR_AP. For each variable, the left panel shows the PDP, and the right panel shows the SHAP dependence plot with individual data points colored by feature value. Plots are presented for the most influential same-name feature (top row) and an additional relevant predictor (bottom row) identified through SHAP analysis. RMS, Root mean square; VT, Vertical; Samp_En, Sample entropy; AP, Anterior-posterior; HR, Harmonic ratio.

For Gait speed, participants with an even-surface walking speed below 0.8 m/s tended to show greater reductions on uneven surfaces. Similarly, the RMS_VT prediction model indicated a plateau in the RMS_VT values when the even surface walking speed was below 0.8 m/s. For SampEn_AP, both excessively large and small values of Ang_IC_ankle were associated with a higher SampEn_AP, suggesting a U-shaped dependency. As for HR_AP, the uneven-surface HR_AP plateaued when the even-surface HR_AP exceeded approximately 1.5.

Discussion

In this study, we applied a two-step ML approach to analyze trunk acceleration during uneven-surface walking. Step 1 identified key features distinguishing PwS from HC using data augmentation and classification models. Step 2 used regression models to predict these features and the uneven-surface gait speed from the even-surface walking parameters.

Step 1 identified RMS_VT, SampEn_AP, and HR_AP as key features distinguishing PwS from HC. In regression analysis, PwS with even-surface gait speed < 0.8 m/s showed greater speed decline and elevated RMS_VT on uneven surfaces. SampEn_AP was influenced by ankle dorsiflexion at the initial contact, and HR_AP was predicted by its even surface value. These findings highlight that multidimensional, noninvasive sensor data can detect gait instability in PwS and may contribute to the advancement of individualized gait assessment aimed at supporting outdoor mobility.

Among the trunk acceleration metrics for uneven-surface walking, RMS_VT (linear), SampEn_AP, and HR_AP (nonlinear) were particularly effective in distinguishing PwS from HC. RMS_VT was highly important across the models: a significant odds ratio in LR (Table 2), high SHAP values in RF (Fig. 2), and a high VIP score in sPLS-DA (Table 3). Features consistently selected across models are considered robust digital biomarkers33,34,35. This supports prior findings of elevated RMS in PwS during uneven-surface walking⁹ and suggests that increased RMS_VT may reflect greater postural demands or an impaired response to perturbations36. Although acceleration-based measures can be affected by anthropometric factors such as body size and segment mass, no group differences in these characteristics were observed in our dataset, suggesting minimal confounding from morphology. For the nonlinear indices, elevated SampEn_AP contributed to PwS classification in both the LR and SVC models (Fig. 2). Although a higher sample entropy typically reflects adaptive motor control in healthy individuals8, it co-occurs with increased RMS in PwS. This suggests that an elevated SampEn_AP in PwS may indicate instability or irregular trunk acceleration rather than flexible control during uneven-surface walking.

A lower HR_AP was a key predictor of PwS in both the SHAP analysis of the SVC model and the VIP scores from the sPLS-DA. HR reflects waveform periodicity, and previous studies have reported that HR decreases on uneven surfaces, even in healthy individuals4. In PwS, HR is typically lower than that in HC37, consistent with our findings. Notably, HR_AP was more important than RMS in the SVC model, suggesting that disturbances in the harmonic organization or periodic structure of the trunk acceleration waveform may capture instability more sensitively than changes in waveform magnitude alone.

Gait speed during uneven-surface walking was highly predictable from even-surface parameters, with strong performance in both linear (LR, R² = 0.903) and nonlinear (RF, R² = 0.860) models (Fig. 3). SHAP analysis identified even-surface gait speed as the most influential predictor (Fig. 4), and PDP plots showed that PwS walking below 0.8 m/s exhibited greater speed reductions on uneven surfaces (Fig. 5). For RMS_VT, moderate prediction accuracy was achieved with SVR (R² = 0.575) and RF (R² = 0.630), with RF performing marginally better. SHAP analysis indicated that even-surface gait speed had a greater influence than RMS_VT itself (Fig. 4), and PDP plots showed consistently high RMS_VT in PwS walking below 0.8 m/s (Fig. 5).

These findings suggest that PwS with an even-surface gait speed below 0.8 m/s are likely to show both reduced speed and increased trunk acceleration variability (high RMS_VT) on uneven surfaces, indicating difficulty in adaptation. This aligns with prior reports of further speed reduction outdoors in slower PwS³⁸. The 0.8 m/s threshold is widely used to classify gait capacity39,40, and our results support its validity for predicting gait speed and stability under uneven conditions. Importantly, this threshold has long served as a functional benchmark for community ambulation41, and our findings quantitatively reinforce its relevance in uneven-surface contexts. IMU-based ML analysis quantitatively reinforced its clinical relevance as a mobility indicator.

The predictive accuracy of SampEn_AP was limited in both linear (Elastic Net, R² = 0.393) and nonlinear (RF, R² = 0.373) models, warranting cautious interpretation. SHAP analysis identified Ang_IC_ankle during even-surface walking as a relatively important predictor. The SHAP + PDP plots showed a U-shaped relationship, with SampEn_AP increasing when dorsiflexion was either too low or too high, indicating nonlinearity (Fig. 5). Increased dorsiflexion generally supports gait stability42; however, in PwS, reduced tibialis anterior (TA) activity can make ankle dorsiflexion difficult43. Thus, decreased dorsiflexion at the IC may lead to decreased stability on uneven surfaces. However, in cases where ankle dorsiflexion is markedly large despite reduced TA activity, it may reflect compensatory overexertion and a lack of motor control flexibility44. Paradoxically, these conditions may result in decreased stability on uneven surfaces. Given the modest accuracy of the SampEn_AP model, this interpretation should be viewed cautiously. This limited performance may be partly explained by characteristics of entropy measures—SampEn is inherently sensitive to non-stationarity and noise in gait signals. In addition, the analysis intentionally used ten gait cycles, reflecting both safety and fatigue constraints during uneven-surface walking in PwS and our aim to assess whether stability-related biomarkers can be extracted from short, clinically realistic data segments. Prior entropy-based gait studies have shown that approximately ten strides can yield stable entropy estimates45, and that entropy values become largely independent of signal length once the time series exceeds roughly 750 data points46—well below the ~ 1,180 samples contained in each segment of our study (Supplementary Table S6). Nonetheless, shorter recordings may increase estimation variability, and future work should verify SampEn-based stability using longer continuous trials.

For HR_AP during uneven-surface walking, both the linear (SVR, R² = 0.607) and nonlinear (RF, R² = 0.547) models showed moderate predictive accuracy (Fig. 3). The SHAP analysis identified even-surface HR_AP as a key predictor (Fig. 4). In the RF model, SHAP and PDP plots showed that HR_AP on uneven surfaces increased proportionally when even-surface values were low (especially < 1.5), but plateaued above ~ 1.5 (Fig. 5), suggesting a limited additional benefit. Thus, improving HR_AP during even-surface walking may enhance gait regularity on uneven surfaces, particularly in individuals with low baseline values. As HR_AP reflects trunk coordination47, interventions targeting trunk function may improve stability in complex environments. The identified threshold (~ 1.5) aligns with previous reports showing mean HR_AP values of ~ 2.0 in healthy individuals and < 1.5 in PwS37, suggesting a physiologically meaningful cutoff value. Therefore, HR_AP may serve as a practical sensor-based indicator for rehabilitation and outdoor stability monitoring.

This study is the first to use ML with SHAP and PDP analyses to clarify the differences in gait stability indicators between HC and PwS during uneven-surface walking and their associations with even-surface gait parameters. By employing multiple models, including glass boxes, black boxes, and feature-selective approaches, we enhanced the robustness of our findings. Integrating both linear and nonlinear indicators provided a more nuanced understanding of clinically described “reduced stability.” Linking even-surface metrics to instability on uneven surfaces supports the translation of laboratory-based gait data into real-world contexts and provides individualized rehabilitation strategies for outdoor mobility in PwS.

This single-center study included a modest number of HC (n = 39), whereas a priori power analysis and previous IMU-based work suggested that approximately 100 participants per group would be desirable for robust inference. Although class imbalance was addressed using SMOTE, GAN, and ctGAN, these augmentation techniques were applied solely to stabilise model training and do not substitute for real participant data or increase statistical power. Therefore, the limited HC sample may constrain generalisability, and external validation using larger real-world datasets is warranted. Nonetheless, this study offers proof-of-concept for real-world digital biomarkers and supports future multicenter studies. Furthermore, feature selection (Boruta and LASSO) was conducted prior to train–test splitting, which may introduce a risk of information leakage because the selected feature set was informed by the full dataset, including samples later used for testing. Although this workflow is commonly seen in applied clinical machine learning studies14,18, the more rigorous strategy would involve performing feature selection within the training folds and applying the selected features to the held-out test data. Future studies should incorporate fold-wise feature selection to further minimise bias and enhance methodological robustness. Lastly, the predictive accuracy of SampEn_AP was limited, which may reflect several factors. Entropy-based metrics are inherently sensitive to noise and non-stationarity in gait signals, and SampEn was calculated from relatively short recordings (ten gait cycles), conditions that may increase variability. In addition, SampEn was derived only from the paretic side, whereas post-stroke gait asymmetry suggests that bilateral information may be necessary to fully capture stability characteristics. Accordingly, SampEn-based interpretations should be viewed cautiously, and future studies should verify entropy-derived stability using longer recordings and bilateral representations of gait.

This study used ML to analyze acceleration-based gait stability indices during uneven-surface walking, identifying key features that distinguish PwS from HC. The classification models achieved an accuracy of over 95%, with RMS_VT, SampEn_AP, and HR_AP as the key discriminators. In regression, PwS with even-surface gait speed < 0.8 m/s showed slower speeds and higher RMS_VT on uneven surfaces, indicating poor adaptability. SampEn_AP was influenced by ankle dorsiflexion, and HR_AP was influenced by its even-surface value, both of which showed nonlinear patterns. These findings highlight the utility of ML-based acceleration analysis for assessing gait stability and adaptation in PwS. Future studies should validate these results in larger cohorts and inform targeted rehabilitation strategies.

Methods

Participants

A cross-sectional study was conducted at the authors’ institution involving 71 PwS (63.8 ± 7.4 years; stroke onset: median 62.0 days, Interquartile Range 52.0) and 39 age-matched community-dwelling HC (65.6 ± 7.4 years). The exclusion criteria were as follows: (1) inability to walk independently, even with a single cane; (2) presence of bilateral brain lesions; (3) Mini-Mental State Examination (MMSE) score below 24; (4) history of orthopedic disorders; and (5) cerebellar lesions. Written informed consent was obtained from all participants prior to enrollment. All procedures were approved by the ethics committee of the authors’ institution and were conducted in accordance with the Declaration of Helsinki.

Experimental setup and procedures

The participants walked three round trips on a 10-meter even-surface walkway and a 10-meter uneven-surface walkway, each with 2-meter buffer zones. A physical therapist supervised all the walking tasks to ensure safety. Standardized footwear was provided by (W503, MARIANNU Co. Ltd., Japan). The use of canes was permitted if needed; however, lower limb orthoses were not allowed.

During walking, trunk acceleration was recorded using a triaxial wireless accelerometer placed at the third lumbar vertebra (L3) to evaluate stability (see in Supplementary Fig. S8). In addition, sagittal plane videos and surface EMG signals were collected simultaneously. EMG data were obtained from the paretic side in PwS and the right side in HC, targeting five lower limb muscles: the TA, soleus (SOL), rectus femoris, biceps femoris (BF), and gluteus medius (GM)48. Additional technical details regarding the surface design, sensor settings, and data acquisition are provided in Supplementary Table S5.

Clinical evaluation

The severity of lower limb motor impairment was assessed using the Fugl-Meyer Assessment (FMA)49. Balance ability was evaluated using the Berg Balance Scale (BBS)50. Before the task, the participants rated their confidence in walking on an uneven surface using a Likert scale (0 = no confidence, 10 = complete confidence) based on the modified Gait Efficacy Scale51. Isometric knee extensor strength on the non-paretic side was measured using a handheld dynamometer and normalized to the body weight52.

Gait cycle detection

Gait speed was calculated from the time required to traverse a 10-meter walkway, based on synchronized video recordings53. The first and last three gait cycles were excluded to eliminate the acceleration and deceleration effects. Initial contact and toe-off were identified using the anteroposterior axis of a shank-mounted accelerometer54 and verified using a video. Ten gait cycles were extracted for analysis.

Gait stability analysis

All stability metrics—both linear and nonlinear—were computed using the gait cycles remaining after exclusion of the first and last three strides to avoid transient acceleration and deceleration effects. To evaluate gait stability on even and uneven surfaces, we analyzed trunk acceleration in the anterior-posterior (AP), mediolateral (ML), and vertical (VT) directions using MATLAB R2021b (MathWorks Inc., Natick, MA, USA). Linear stability was quantified using the RMS values normalized by squared gait speed55. Nonlinear stability metrics include the HR28, SampEn45,56, RQA56,57,58, and sLE59,60,61,62, capturing smoothness, irregularity, periodicity, and local dynamic stability, respectively. The detailed computation procedures are provided in Supplementary Table S6.

Biomechanical parameters analysis

Joint angles (hip, knee, and ankle) were calculated using OpenPose (v1.7.0), a markerless motion capture system based on video recordings63, validated against optical motion capture64. Signals were low-pass filtered using a zero-lag fourth-order Butterworth filter (6 Hz)65 and time-normalized to 100 points per gait cycle. Peak angles were extracted from the paretic limb in the PwS group and the right limb of the HC group. The EMG signals were bandpass-filtered (20–500 Hz), mean-centered, rectified, and low-pass-filtered at 10 Hz66. Each signal was normalized to the individual’s maximum amplitude and averaged over 10 strides from three trials67. Co-contraction indices (CIs) were computed from the overlap between TA and SOL (shank) and rectus femoris–BF (thigh) activity66, averaged separately for the stance and swing phases across the conditions. Preprocessing was performed according to the SENIAM guidelines.

Step 1. Stroke detection

This step aimed to identify trunk acceleration features during uneven surface walking that differentiated PwS from HC based on 19 acceleration-based stability indicators.

Glass box and black box models

Data normalization

Preprocessing is essential in ML, particularly for imbalanced datasets68. Outliers were detected using the Interquartile Range (IQR) and replaced with median values owing to IQR’s robustness of the IQR to extreme and non-normal data. Power transformation-normalized skewed features and class distributions were visualized to address the imbalance. After cleaning and transformation, all features were standardized using z-score normalization to account for algorithms sensitive to feature magnitude69.

Feature selection

Feature selection was performed in two steps. First, variables with pairwise correlations ≥ 0.9 were reduced to minimize multicollinearity70. Subsequently, both the Boruta algorithm71 and LASSO72 were applied to the data. The features selected by both methods were deemed robust and retained as key discriminative variables73.

Machine learning algorithms

The dataset with the selected features was split into training (80%) and testing (20%) sets using stratified-random sampling. To address class imbalance and enhance model stability, three data augmentation methods were applied to the training set: SMOTE, GAN, and ctGAN. To avoid data leakage, the augmented instances were not shared across the cross-validation folds. To investigate the differences in trunk acceleration features during uneven-surface walking between PwS and HC, an appropriate sample size was estimated. Prior studies have reported effect sizes (Cohen’s d = 0.65–0.75) for RMS metrics during uneven walking9, and assuming a moderate effect size of 0.5, a power analysis (α = 0.05, power = 0.8) indicated that 64 participants in each group would be required. Furthermore, previous IMU-based research28 suggested that approximately 100 participants per group would be desirable for more stable classifier training. In the present study, the final real-world sample comprised 71 PwS and 39 HC. To stabilise model training under this imbalance, data augmentation was used to increase the minority class up to 100 samples and, additionally, to generate an extended set of 1000 synthetic instances for robustness analyses. These augmented samples were used solely for model training and do not increase statistical power or substitute for real participant data.

SMOTE generates synthetic samples by interpolating existing instances of the minority class and is widely used to improve the classification performance of imbalanced datasets74. GAN and ctGAN are deep learning models that generate synthetic samples via adversarial training between the generator and discriminator75. GANs have been shown to improve deep learning performance, particularly in clinical applications with limited data76. ctGAN extends this by enabling conditional generation based on class labels and is optimized for tabular data77, showing improved predictive performance78. The architecture of the model is shown in Supplementary Table S7.

We developed six ML models: two glass-box models (LR and decision tree [DT]) and four black-box models (SVC, XGBoost, RF, and k-nearest neighbors [KNN]). These models were selected to capture both linear and nonlinear decision boundaries, while balancing interpretability (glass-box models) with predictive performance (black-box models). The hyperparameters were optimized using a 5-fold cross-validated grid search (Supplementary Table S8). The models were evaluated on the test set using the ROC AUC, sensitivity, specificity, F1 score, and Brier score. Taken together, this framework enabled systematic exploration of multiple augmentation–model pairings, which we considered essential for identifying robust classifier configurations rather than assuming that any single augmentation strategy or algorithm would perform best.

To obtain robust and unbiased performance estimates while minimizing overfitting, data augmentation and ML model training were repeated 50 times with different random seeds (random state = 42), and the average performance metrics were computed79. Based on these results, the top-performing models from both the glass and black boxes were selected for interpretation.

Model interpretation was conducted using odds ratios for LR and SHAP values80 for the black box models. SHAP quantifies each feature’s contribution to the model output by averaging its marginal effect across all feature combinations, thereby providing both importance and directional insight.

Including feature selection model

We also used sPLS-DA21, a sparse version of partial least squares that integrates variable selection and classification in a single step. Unlike the two-step feature selection approach, sPLS-DA simultaneously performs both tasks. As with the other models, the data were z-score-normalized beforehand. Model accuracy was evaluated using the ROC AUC, and interpretation was based on the VIP score81.

Step 2. stroke uneven prediction

Demographics, physical function, and even-surface gait parameters, including speed, trunk acceleration, paretic limb EMG, and joint kinematics, were used as input features to develop ML regression models for predicting key stability indicators and gait speed during uneven-surface walking, as identified in step 1.

Data normalization

Even surface parameters for PwS were preprocessed as in Step 1: outliers (IQR-based) were replaced with medians, and then power transformation and z-score normalization were applied.

Feature selection

Feature selection was performed in two steps. First, variables with pairwise correlations ≥ 0.9 were reduced to minimize multicollinearity70. Subsequently, both the Boruta algorithm71 and LASSO72 were applied to the data. The features selected by both methods were deemed robust and retained as key discriminative variables82.

Machine learning algorithm

After feature selection, the dataset was divided into training (80%) and testing (20%) sets using stratified random sampling. Six ML models were developed: three linear (Linear Regression, SVR, and Elastic Net) and three nonlinear (RF, XGBoost, and KNN) models. The hyperparameters were optimized using a grid search with 5-fold cross-validation (Supplementary Table S9). The model performance on the test set was evaluated using R², root mean squared error (RMSE), mean absolute error (MAE), and mean squared error (MSE). To ensure robustness, reduce reporting bias, and prevent overfitting, model training was repeated 50 times with different random seeds (random state = 42), and the average metrics were reported79. Based on the overall performance, the best model for each group (linear and nonlinear) was selected. SHAP, PDP, and SHAP + PDP plots were used to interpret the individual feature contributions and their effects on the predictions. A visual summary of the entire two-step workflow is provided in Supplementary Fig. S9 to enhance clarity and overall understanding of the analytic pipeline.

Statistical analysis

Age, height, weight, and BMI were compared between PwS and HC groups using Welch’s t-tests, and sex distribution was examined using Fisher’s exact test. Statistical analyses were performed using R software (version 4.1.2), with a significance level set at p < 0.05. All ML–related analyses were conducted using Python version 3.12.7 (with the seaborn, sklearn, imblearn, TensorFlow, SDV, CTGAN, Optuna, and SHAP packages) and R version 4.3.3 (with the mixOmics, Boruta, and glmnet packages)21.