Introduction

Gait recognition, as a non-invasive and efficient biometric technology, has shown its unique value in various applications, particularly in the realm of clinical healthcare. It offers a fresh perspective for the early diagnosis and monitoring of movement disorders, including Parkinson’s disease (PD)1. PD, a prevalent chronic neurological condition, impacts patients’ motor functions, with the progression of the disease being intricately linked to their gait patterns. The primary clinical manifestations of the disease include: resting tremors, bradykinesia and reduced movement, increased muscle tone, and impaired posture stability2. Traditional diagnostic approaches predominantly depend on manual assessments, which are not only time-consuming and labor-intensive for both doctors and patients but also susceptible to subjective biases. Consequently, the development of an automated recognition method for diagnosing and monitoring PD patients is of paramount importance3.

In recent years, the rapid advancement of deep learning has opened up new horizons for automated gait recognition. Convolutional neural networks (CNNs)4, a cornerstone of deep learning, have demonstrated remarkable performance in image recognition due to their robust automatic feature extraction capabilities. In addition, in the specific domain of gait recognition, numerous innovative deep learning frameworks have been proposed, including GaitSet5, GaitPart6, and GaitBase7. However, traditional gait recognition tasks differ from those for Parkinson’s patients. Historically, gait recognition was used primarily for pedestrian identification8, while Parkinson’s patients represent a distinct category, signifying a shift from fine-grained to coarse-grained tasks. The gait characteristics of Parkinson’s patients diverge markedly from those of the general population. Capturing these subtle differences and accurately classifying them presents a significant research challenge. Existing gait recognition frameworks, tailored to specific datasets and often used for experimental research8, struggle to capture the nuances of abnormal gait features in patients. They also fail to distinguish these features accurately from those of non-patients. Furthermore, existing research on computer vision-based Parkinson’s gait recognition faces several challenges, such as the deficiency in extracting abnormal gait features from patients9, difficulties in enhancing recognition accuracy2, and issues with data diversity and imbalance9. The exploration of intelligent auxiliary discrimination methods could significantly aid physicians in remote diagnosis.

To address the limitations of existing gait recognition architectures in feature capture, sampling flexibility, and generalization ability for Parkinson’s disease (PD) gait analysis, we propose an enhanced hybrid architecture based on GaitBase, integrating four specialized expert subnetworks and adaptive feature fusion mechanisms. The new architecture overcomes the fixed-kernel sampling constraints of traditional models by introducing Linear Deformable Convolution (LDConv) in the Local Motion Expert, enabling dynamic adjustment of convolution kernel shapes to capture irregular gait patterns. Additionally, the multi-scale attention mechanism (EMA module) in the Global-Info Expert enhances cross-dimensional feature interaction, while the Contour Expert employs Sobel convolution and Transposed Convolution to extract high-resolution edge features for gait dynamics. The Spatio-Temporal Expert integrates LSTM with adaptive temporal pooling to model long-term gait temporal dependencies, forming a comprehensive framework for PD gait recognition.

Through comparative studies and ablation experiments on the CASIA-B and OU-MVLP dataset, we validate the effectiveness of the improved model and demonstrate its superior performance in gait feature extraction and recognition tasks. To further assess the model’s practical value in Parkinson’s disease recognition, we collaborated with the Third People’s Hospital of Fujian Province to collect gait video data from 44 Parkinson’s disease patients and 42 healthy controls. We augmented the sample size of each video frame using data enhancement techniques and extracted gait contours using foreground-background separation algorithms10 for training the enhanced model. A series of experimental results confirm that the model performs exceptionally well in Parkinson’s gait recognition tasks, achieving high accuracy and robustness. Therefore, our contributions in this paper are as follows:

1.We propose GRE-MFF-GaitBase (Gait Recognition Expert with Multi-Focus Fusion-enhanced GaitBase), a hybrid architecture integrating four specialized expert subnetworks—Global-Info Expert, Spatio-Temporal Expert, Local Motion Expert, and Contour Expert—into the backbone GaitBase. This design enhances feature extraction and gait recognition accuracy by enabling dynamic adaptation to irregular gait patterns.

2.We conducted extensive experiments on the CASIA-B and OU-MVLP datasets to demonstrate the superiority of GRE-MFF-GaitBase compared to the state-of-the-art methods.

3.We investigated the transfer learning capabilities of the improved model in recognizing gait patterns of Parkinson’s disease (PD) patients and achieve effective recognition on our customized PD gait dataset, which holds promise for providing a new method for the auxiliary detection of Parkinson’s disease.

The structure of this paper is organized as follows: Chapter 2 provides an overview of the related research in the fields of gait recognition and Parkinson’s disease diagnosis. Chapter 3 introduces the relevant datasets and delves into the dataset processing techniques. Chapter 4 presents the enhancements made to the hybrid architecture. Chapter 5 showcases the experimental outcomes and their analysis. Chapter 6 discusses the limitations of our work. Finally, we summarize the research findings and offer perspectives for future investigative endeavors in Chapter 7.

Related works

The early research methods for gait recognition in Parkinson’s disease primarily centered on quantitative data analysis. A prevalent approach involved collecting gait kinematics data from patients using various wearable sensor devices and conducting quantitative analysis with the aid of data analysis tools. For instance, Camps et al.11 utilized motion sensors to gather motion data, which was then classified using deep learning techniques. Chen et al.12 employed lower limb pressure sensors to capture pressure changes during a patient’s walk, enabling the detection of abnormal gaits. Additionally, Mazilu et al.13 employed supervised machine learning to detect freezing of gait (FOG) in Parkinson’s patients by analyzing acceleration data from different body parts.In the realm of vision-based gait feature acquisition, researchers rely on cameras to capture motion videos and extract gait features or quantitative parameters. For example, Microsoft Kinect14 developed a somatosensory system with infrared projectors, color cameras, and infrared depth cameras as the main components for collecting motion signals, effectively identifying 25 human joint points. This method offers significant advantages and excellent real-time performance, accurately capturing subjects’ gait trajectories in complex environments. However, the equipment is precise and costly, demanding high standards for instruments, equipment, location, and operators. Jung et al.15 conducted gait analysis based on 2D video, successfully obtaining gait parameters of Parkinson’s patients through video tracking. Yet, this method yields relatively coarse gait information and lacks the precision of sensor-based measurements.Furthermore, Wang Shentao et al.9 proposed an LSTM-based feature learning method for detecting gait freezing in Parkinson’s disease, a task characterized by time series data.

As deep learning continues to make strides in the field of computer vision, numerous researchers have ventured into exploring human recognition from a non-contact machine vision perspective, with the goal of applying these techniques to real-world scenarios. Gait recognition technology has garnered significant interest among researchers due to its notable advantages: it can learn the shape features of a target based on appearance, achieve non-contact detection of the human body, and maintain functionality under low-resolution conditions, thus offering higher accuracy and convenience. Traditional gait recognition encompasses three key technologies8: gait segmentation, feature extraction, and gait comparison. Given its sensitivity to appearance changes, current research primarily focuses on gait feature extraction and gait time series modeling. For instance, the GaitSet5 deep learning framework introduced by Chao H et al.5 innovatively treats gait sequences as sets and compresses frame-level spatial feature sequences using the maximum function, which is both simple and effective. The GaitPart6 architecture proposed by F et al.6 employs a frame-level part feature extractor (FPFE) and a micro-motion capture module (MCM) to model temporal dependencies and better extract gait data characteristics. Additionally, the CSTL16 architecture put forward by Huang et al.16 concentrates on the temporal characteristics across three scales and obtains motion representation based on temporal context information. GaitGraph17 utilizes human skeleton posture as a representation of gait, arguing that this approach can more clearly extract gait features than traditional contour images. This architecture enhances gait recognition performance by directly estimating robust skeleton poses from RGB images and leveraging the powerful spatio-temporal modeling capabilities of Graph Convolutional Networks (GCNs). The GaitBase architecture proposed by Fan et al.7 is renowned for its simplicity and superior performance. This architecture uses a network akin to ResNet as the backbone and takes gait contours as input, transforming gait features into a 3D feature map for more effective extraction. Castro et al.18 proposed AttenGait, a novel gait recognition model that uses a trainable attention mechanism and supports multiple rich modalities to enhance gait feature extraction.Furthermore, Ye et al.19 introduced BigGait, a gait recognition framework based on large vision models (LVMs). It employs a Gait Representation Extractor (GRE) module to transform general LVM features into effective gait representations and uses a multi-branch structure to extract features, significantly outperforming existing methods.

Gait dataset of patients with Parkinson’s disease

Fig. 1
figure 1

Dataset pre-processing flow.

The construction process of the Parkinson’s gait dataset for this paper is depicted in Fig. 1. The process is detailed in three main steps: data acquisition, pre-processing, and creation of patient gait masks.Footnote 1Footnote 2 In partnership with the Third People’s Hospital of Fujian Province, we utilized the video capture facilities of the gait laboratory to systematically collect gait video data from 44 patients with Parkinson’s disease and 42 healthy controls over a period of approximately one year. Each gait video sample is approximately 100 seconds in duration.

Our data processing begins with filtering the dataset frame by frame to eliminate any corrupted data. Following this, to address the limited number of filtered samples, we apply data augmentation techniques, including horizontal flipping, grayscale conversion, and noise addition, to expand the dataset. The results of the augmented images are presented in Fig. 2.

Fig. 2
figure 2

Example of image enhancement sample.

In this paper, we refer to the public standardized gait dataset CASIA-B for data processing20. This dataset is a classic open-source gait dataset proposed in literature20. It comprises data from 124 individuals under three different walking conditions: normal walking (nm), backpacking walking (BG), and walking in a coat (CL)20. The dataset is collected from multiple viewpoints and ultimately results in a standard gait mask dataset, as illustrated in Fig. 3. After examining the original gait video data, we focused solely on the normal walking (nm) type and restricted the viewing angle to \(90^{\circ }\). The \(90^{\circ }\) viewpoint is commonly used in clinical settings for gait analysis as it provides a frontal view that captures essential gait characteristics, such as stride length, step width, and posture, which are critical for Parkinson’s disease assessment. This perspective is particularly valuable for identifying gait abnormalities associated with Parkinson’s disease. And this approach aligns with the structure of the CASIA-B dataset, which categorizes data based on walking status and view angle, providing a comprehensive resource for gait recognition research.

Fig. 3
figure 3

CASIA-B Dataset: Upper (nm), Middle (bg), Lower (cl).

In the phase of gait contour extraction, we employ the conventional method of foreground and background separation. The essence of this approach is to isolate moving foreground objects while either retaining or discarding the static background. Within the OpenCV21 tool library, a range of foreground-background separation techniques are available, including MOG222, KNN23, and others. For this study, we utilize the MOG222 algorithm to distinguish Parkinson’s patients from the background. Following separation, we further process the extracted samples through operations such as erosion, dilation, and binarization to construct the data samples as depicted in Fig. 4. The corresponding sample labels are presented in Table 1.

Fig. 4
figure 4

Gait data set after preprocessing (part).

Table 1 Partial sample labels.

Methodology

The fundamental principle of our model is to refine and augment the feature extraction capabilities of the widely used GaitBase7 framework’s backbone. We have developed a comprehensive hybrid architecture encompassing multiple pivotal components to achieve this. Initially, the model undergoes an enhancement process on the input gait data, followed by an initial feature transformation utilizing Linear Deformable Convolution (LDConv). Subsequently, a critical routing layer (Router) strategically distributes the processed feature maps to four parallel, specialized focus fusion expert subnetworks, each dedicated to global information, spatiotemporal dynamics, local motion patterns, and contour features respectively. Upon completion of feature extraction, the model consolidates these features through a horizontally segmented fully connected layer (Separate FC) and employs a batch normalization bottleneck (BNNeck) to stabilize the training process. The overarching design of this architecture is geared towards achieving a profound and practical representation of gait data through multi-perspective, multi-level expert collaboration.

The overall architecture of GRE-MFF-GaitBase (Gait Recognition Expert with Multi-Focus Fusion-enhanced GaitBase) is shown in Fig. 5.

Fig. 5
figure 5

Model Architecture: DA is for Data Augmentation, SeparateFC is for Horizontal Partitioning, LDConv is for Linear Deformable Convolution, and BNNeck represents batch normalization neck.

Gaitbase framework

The GaitBase7 Framework, proposed by Fan et al.(2023), is a baseline model for gait recognition. The model primarily consists of four main components: Data Augmentation module (DA), ResNet-Like Backbone, Temporal Pooling (TP), and BNNeck. The GaitBase utilizes a ResNet-like network as its backbone, which is mainly composed of two-dimensional convolutional modules and four residual blocks. The model processes each input gait silhouette frame through this backbone and the Temporal Pooling (TP) module to output a 3D feature map with dimensions of height, width, and channels7, representing a set-level understanding of the input gait sequence. This 3D feature map is then horizontally divided into several parts, each of which is pooled into a feature vector and further mapped to a metric space using a separate fully connected layer. Finally, BNNeck7 is employed to adjust the feature space, with a combination of triplet loss and cross-entropy loss serving as the loss functions for the entire model training process. The formulas are as follows:

$$\begin{aligned} Loss_{triplet}= & \sum _{i=1}^NMax(0,dist(\alpha _i,p_i)-dist(\alpha _i,n_i)+margin) \end{aligned}$$
(1)
$$\begin{aligned} Loss_{cross-entropy}= & \sum _{i=1}^My_ilog(\hat{y}_i) \end{aligned}$$
(2)
$$\begin{aligned} Loss_{all}= & Loss_{triplet}+Loss_{cross-entropy} \end{aligned}$$
(3)

\(Loss_{triplet}\) is the triple loss function, \(\alpha _{i}\) is the eigenvector of the anchor sample, \(p_{i}\) is the eigenvector of the positive sample, \(n_{i}\) is the eigenvector of the negative sample, dist (x, y) represents the distance between two samples, and the Euclidean distance is usually used to calculate the distance between the eigenvectors of two samples. \(Loss_{cross-entropy}\) is the cross entropy loss function, \(y_{i}\) is the unique heat coding of the real label, and \(\hat{y}_i\) is the probability distribution predicted by the model. The loss function of gaitbase is represented by the combination of \(Loss_{triplet}\mathrm {~and~}Loss_{cross-entropy}.\) .

Gait recognition expert with multi-focus fusion

We design the internal structure of the four expert subnetworks ,including Global-Info Expert, Spatio-Temporal Expert, Local Motion Expert, and Contour Expert, separately. Each expert network needs to receive the feature maps handed down from the Router and focus on extracting the information that interests them. The details of the Gait Recognition Expert with Multi-Focus Fusion (GRE-MFF) structure are shown in Fig. 6.

Fig. 6
figure 6

The structure of gait recognition expert with multi-focus fusion (GRE-MFF).

Global-info expert

The fundamental duty of a global information expert is to precisely extract pivotal information from input data that is stable and globally representative, while concurrently minimizing the distraction caused by local details or noise. To fulfill this objective, we have implemented the EMA (Efficient Multi-Scale Attention) module24, which reconstructs and aggregates information from specific channel dimensions into the batch dimension. This allows for multi-scale feature learning across spatial dimensions, fundamentally enhancing the capability to process features of varying spatial resolutions. This methodology fosters a more comprehensive and three-dimensional understanding of images, which is a critical source of global information.

Moreover, the EMA module facilitates the integration of outputs from diverse processing branches through cross-dimensional feature interaction, thereby capturing detailed pixel-level spatial relationships. This cross-scale and cross-channel interaction reinforces the coherence and global consistency of feature representation, culminating in a more unified and robust interpretation of image content. The detailed internal structure of EMA is depicted in Fig. 7.

Fig. 7
figure 7

Efficient multi-scale attention (EMA) model architecture: C is the number of channels, H and W are the width and height, and G is the number of groups.

As shown in Figure 7, the EMA module first partitions input features into G feature groups (\(X_1\) to \(X_G\)) along the channel dimension, ensuring each group retains (C/G, H, W) dimensions; in the Parallel Subnetwork, two parallel branches then process these groups: one applying adaptive average pooling (X-Avg Pool) across spatial dimensions and aggregating results via \(1 \times 1\) convolution, and the other using \(3 \times 3\) convolution to preserve fine-grained details, with both branches outputting sigmoid-weighted maps that reweight channel-wise features to enhance discriminative representations. Subsequently, the Cross-Spatial Learning Module introduces dual cross-scale interactions: one branch normalizes group features (Groups Norm), aggregates them via average pooling, generates spatial attention through sigmoid activation, while the other directly computes spatial attention from downsampled features, and these attention maps are matrix-multiplied to model long range spatial dependencies and capture both local and global spatial correlations.

Spatio-temporal expert

Fig. 8
figure 8

The structure of spatio-temporal expert.

The core component of gait recognition involves precisely capturing and identifying the evolving pattern of an individual’s body posture over time during walking. This process integrates spatial information, including three-dimensional coordinates of joints and geometric relationships between body parts, as well as temporal information, such as the inherent rhythm of the gait cycle and the sequential dynamics of postures. The Spatio-Temporal Expert module is designed to process and extract these complex features embedded in the spatial and temporal dimensions, shown in Fig. 8.

In the initial phase of feature processing, input data is routed through the Temporal Pooling (TP) module25 for basic temporal feature compression. Considering the inherent variability in gait data sequence lengths, the TP module is optimized to address this challenge. For sequences with a constant length, it performs straightforward temporal pooling for compression; for variable-length sequences, it employs a more sophisticated approach: calculating the starting positions of each subsequence for effective segmentation, performing pooling on each segment, and finally concatenating the results. This adaptable mechanism enables the TP module to handle variable-length input data, greatly enhancing the model’s generalization capabilities. Pooling operations also serve as practical tools for feature extraction and dimensionality reduction, condensing key temporal features and filtering out redundancy. Subsequently, Long Short-Term Memory (LSTM)9 receive the feature sequences from the TP module. Since gait patterns often involve complex changes over multiple time steps, LSTM’s strength in capturing long-term temporal dependencies makes it ideal for learning and representing the evolving gait dynamics.

The TP module and LSTM collaboration allow the Spatio-Temporal Expert module to comprehensively and deeply understand and utilize spatial-temporal gait data. This not only boosts the model’s robustness in handling different gait sequence lengths but also improves its ability to capture long-term temporal dependencies, significantly enhancing the overall robustness and adaptability of the model.

Local motion expert

The core design objective of the Local Motion Expert is to extract and encode the motion information in local regions of image sequences, aiming to accurately capture the subtle motion differences and inherent rhythms presented by specific body parts (e.g. legs and arms) in gait recognition tasks.

To achieve this, we select the LDConv module25, which possesses deformable sampling capabilities. This decision is made to overcome the limitations of traditional convolutional kernels, which have fixed sizes and shapes, making them less effective in adapting to local deformations caused by non-rigid objects or non-uniform motion. The internal structure of LDConv, as shown in Fig. 9, innovates by incorporating an offset learning mechanism on top of traditional convolution operations. This mechanism initially extracts the core features of the input feature map through a basic convolutional layer. Subsequently, a crucial offset prediction branch is introduced, which shares the input with the basic convolutional layer or is directly connected to its output. The primary role of this branch is to predict the offset for each original sampling point. Specifically, if the size of the basic convolutional kernel is \((k * k)\), a pair of offsets \((o_i, o_j)\) must be predicted for each original sampling point (i, j). Therefore, the number of output channels for this branch is usually set to \(2 * k * k\), corresponding to the need to predict one offset in the x and y directions for each original sampling point. This branch is dedicated to learning and capturing the deformation patterns in local regions of the input feature map. Since the computed sampling positions may be non-integer coordinates, bilinear interpolation is used to accurately obtain the pixel values at these positions from the input feature map. Finally, after normalization and the application of an activation function, the output feature map of the module is obtained.

By learning and dynamically adjusting the offsets, LDConv enables the convolutional kernel’s sampling area to adapt according to local motion information. This characteristic allows the receptive field to better fit the actual shape and boundaries of local moving objects, rather than being confined to a preset fixed square area. Therefore, the feature maps extracted by this expert module retain traditional spatial features and implicitly integrate key information of local motion, ultimately generating motion context-aware feature representations.

Fig. 9
figure 9

The structure of LDConv, N is the convolution kernel size. The module dynamically adjusts the sampling grid of the convolution kernel by learning offsets.

Contour expert

In current mainstream gait recognition tasks, the contour variation of human gaits serves as a key feature for distinguishing different gait types. Accurately capturing the edge information of human joint movements in gait sequences and preserving subtle contour changes in walking dynamics remain significant challenges. To address this, we designed a Contour Feature Expert for efficient extraction of gait contour information. Composed of Sobel convolution and Transposed Convolution (deconvolution) modules, the main architecture is shown in Fig. 10. This module compensates for the insufficiency of traditional convolutions in extracting detailed edge information from gait sequences, providing high-resolution contour semantic information for subsequent spatio-temporal feature integration.

Fig. 10
figure 10

The structure of contour expert.

The Contour Expert employs a two-stage architecture. First, the Sobel convolution extracts initial edge features, utilizing two \(3 \times 3\) kernels for horizontal (Sobel-x) and vertical (Sobel-y) gradients. The Sobel-x kernel is defined as:

$$\begin{aligned} \begin{bmatrix} -1 & 0 & 1 \\ -2 & 0 & 2 \\ -1 & 0 & 1 \end{bmatrix} \end{aligned}$$
(4)

and the Sobel-y kernel as:

$$\begin{aligned} \begin{bmatrix} -1 & -2 & -1 \\ 0 & 0 & 0 \\ 1 & 2 & 1 \end{bmatrix} \end{aligned}$$
(5)

The gradient magnitude is computed by \(G = \sqrt{G_x^2 + G_y^2}\), generating a low-resolution edge feature map. Subsequently, the Transposed Convolution stage restores edge features to the original size via learnable upsampling. The output size is calculated by:

$$\begin{aligned} \text {Output size} = (W - 1) \times S + K - 2P \end{aligned}$$
(6)

where \(W\) denotes the input size, \(S\) the stride, \(K\) the kernel size (typically \(3 \times 3\)), and \(P\) the padding. This operation flexibly controls the upsampling ratio using stride and padding, avoiding feature blurring from fixed interpolation.Both stages employ differentiable operations, supporting end-to-end training to ensure consistency between contour feature extraction and subsequent gait recognition tasks.

The module enables dynamic contour capture via Sobel gradients to pinpoint instantaneous human contour changes, providing geometric feature foundations for gait pattern discrimination. Transposed Convolution enhances details by upsampling edge features to preserve fine-grained dynamics like joint angles and foot trajectory, proving effective for contour-altering scenarios such as carrying objects. Its cross-scale adaptability handles gait sequences of varying resolutions by restoring abstract edges to high-resolution contours, complementing Spatio-Temporal Experts to boost complex gait representation.

Experimental results

Dataset

CASIA-B is a classic large-scale multiview gait recognition dataset proposed by the Chinese Academy of Sciences20. It contains gait walking data of 124 different individuals captured from 11 different viewpoints under three walking conditions (NM, CL, BG). Each individual has 110 sequences, comprising approximately 9,000 gait silhouette images. In the experiments, the first 74 individuals were used as a training set and the remaining 50 individuals as a test set.

The OU-MVLP5 dataset is one of the large-scale datasets for gait recognition. It was released by Osaka University in 2018. It contains gait data of a total of 10,307 subjects, including 5,114 males and 5,193 females, with ages ranging from 2 to 87 years old. These data were collected indoors using 7 network cameras at 14 different angles. Each subject provides 28 sequences. The dataset is similar to CASIA-B, but the walking type is single and the recognition task is relatively simple.

Table 2 The Basic Characteristics of the Participants in the Self-constructed Dataset.

In addition, the self-built dataset of gait data from Parkinson’s patients and healthy controls includes gait walking data of 44 Parkinson’s patients and 42 healthy individuals captured at a \(90^{\circ }\) viewpoint. The basic characteristics of the participants in the self-built dataset are shown in the Table 2. Following the normal walking (nm) condition in the CASIA-B dataset, we set up four NM sequences for each individual, resulting in approximately 2,000 gait silhouette images. In the experiments, we followed the traditional 8:2 training-to-testing set ratio in machine learning, randomly selecting 69 individuals for the training set and 17 for the test set.

Comparative experiment

In this paper, we utilize the CASIA-B dataset and OU-MVLP5 dataset to conduct comparative experiments on our proposed hybrid architecture model, thereby validating its superiority. We used the grid search method to conduct a comprehensive search of the training parameters and finally determined the best parameter combination in CASIA-B dataset, detailed in Table 3.

Table 3 Hyperparameter Settings for the CASIA-B Dataset.

Initially, we train the CASIA-B dataset for 10,000 epochs using the original GaitBase model as a starting point. Meanwhile, we also conduct training on the OU-MVLP dataset for 80,000 epochs. Then, we select several outstanding baseline models, including GaitSet5, GaitPart6, and GaitGraph17, and conduct comparative experiments under the same environmental conditions as those specified in this paper. GaitSet5 is a flexible and fast cross-view gait recognition network that innovatively treats gait sequences as sets and uses the max function to compress frame-level spatial feature sequences. GaitPart6 explores the local details of input silhouettes in detail and models temporal dependencies through a micro-motion capture module. GaitGraph17 captures the structured information of gaits through graph neural networks, better handling complex relationships in gait data.

Table 4 Comparison results from different viewpoints under three walking conditions.
Table 5 Different viewpoints comparison results in OU-MVLP.

The experimental data shown in Tables 4 and 5 demonstrate that our new architecture has achieved significant performance improvements compared with several current mainstream baseline models. Notably, on the CL-type dataset with the highest recognition difficulty in CASIA-B, our model outperforms existing models in recognition accuracy across multiple dimensions, approaching the performance of the state-of-the-art model AttenGait. Specifically, at the \(36^{\circ }\) acute view, the accuracy improves by 3.80% compared with GaitGL and by 1.50% compared with AttenGait; at the \(90^{\circ }\) view, the accuracy improves by 6.10% compared with GaitGL and by 2.20% compared with AttenGait; at the \(144^{\circ }\) obtuse view, the accuracy shows a 5.30% improvement over GaitGL. Additionally, the model’s performance on the large-scale OU-MVLP dataset also outperforms existing models. For instance, at the \(30^{\circ }\) acute view, the accuracy is enhanced by 2.70% relative to GaitGL. These findings robustly validate the superior performance and robustness of our model across different viewing angles.

Beyond these numerical improvements, it is crucial to consider the clinical significance of these results. The accuracy improvement on the \(36^{\circ }\) acute angle data, for instance, translates to a higher probability of correctly identifying early signs of gait abnormalities associated with Parkinson’s disease in scenarios where patients are viewed from a relatively common angle in clinical settings. This can potentially lead to earlier intervention and better patient outcomes. Similarly, the improvements at \(144^{\circ }\) and \(180^{\circ }\), though seemingly modest, are significant in contexts where patients may present at unconventional angles due to various factors, ensuring the robustness of the diagnostic aid across diverse real-world situations.

Furthermore, while the performance does exhibit some variance across different viewing angles, this is to be expected given the inherent challenges of gait analysis from non-ideal perspectives. This is due to the fact that different viewpoints may capture different gait features. For example, the side view may show leg motion more clearly, while the front or back view may focus more on the movement of the upper body half. Illumination conditions and the degree of occlusion of body parts vary in different viewpoints, which may also affect the accuracy of feature extraction. The fact that our model consistently outperforms other baselines across most of the angles, demonstrating its stability and reliability as a Parkinson gait assisting assessment tool.

Ablation experiment

Furthermore,in this study, we conduct ablation experiments on the proposed hybrid architecture model to substantiate the superiority of the improved model. We first train the CASIA-B dataset for 10,000 epochs based on the original GaitBase model. Subsequently, we use the enhanced model to ablate each module sequentially, so as to analyze the degree of optimization of the original model by the four designed expert modules. To ensure the reliability and fairness of the experiments, in the subsequent training process, we maintain the same training parameters as those of the original model, which are consistent with the parameters used in the comparative experiments. The ablation experiment results are presented in Table 6, where all the metrics are selected from the optimal experimental outcomes under the aforementioned parameter configurations.

Table 6 Ablation results.

The ablation experiments demonstrate that the improved model, by integrating four expert subnetworks, achieves significant performance enhancements on the CASIA-B dataset, with recognition accuracies for normal walking (NM), backpack walking (BG), and carrying load (CL) increasing by 5.9%, 6.4%, and 11.4% respectively compared to the original GaitBase. Among the ablated modules, the Global-Info Expert causes the most pronounced performance drop, reducing NM accuracy from 98.7% to 93.6% and CL accuracy from 89.2% to 82.5%. This module employs the EMA (Efficient Multi-Scale Attention) mechanism to reconstruct and aggregate global features across spatial dimensions, enabling cross-dimensional feature interaction to capture pixel-level spatial dependencies and ensure global representation consistency, which is critical for guiding feature matching in complex scenarios like CL. Ablating the Spatio-Temporal Expert decreases BG accuracy to 93.4% due to its role in modeling gait dynamics via adaptive temporal pooling (TP) and LSTM: TP handles variable-length sequences through segment-based pooling, while LSTM captures long-term temporal dependencies essential for recognizing gait pattern shifts induced by backpacks. The Local Motion Expert, utilizing LDConv for dynamic kernel adaptation, improves local feature extraction (e.g., joint micro-movements in scissor gaits), and its removal reduces CL accuracy to 84.9%. The Contour Expert, via Sobel-Transposed Convolution, preserves high-resolution edge features, and its ablation leads to an 85.1% CL accuracy due to lost structural detail.

In the four-expert subnetwork architecture, the complementary interaction between LDConv (Local Motion Expert) and EMA (Global-Info Expert) forms a pivotal synergistic mechanism. The Local Motion Expert employs LDConv to dynamically adjust convolutional kernels, enabling adaptive capture of irregular gait features like scissor steps or spastic patterns. However, such adaptive extraction may introduce noise in complex scenarios, which is mitigated by the Global-Info Expert’s EMA module. EMA employs exponential weighting to smooth multi-scale features from LDConv, stabilizing representations and enhancing sensitivity to long-term gait dynamics. Conversely, EMA’s cross-dimensional feature fusion provides a robust global context for LDConv, allowing it to focus on discriminative local patterns without short-term fluctuation interference. This synergy is orchestrated by the Router module, which distributes features to parallel experts and integrates their outputs. This mechanism validates that the integration of LDConv’s adaptive local extraction and EMA’s global feature smoothing creates a mutually reinforcing loop, enhancing both gait recognition flexibility and stability within the four-expert framework.

Transfer learning for Parkinson’s patients

In this study, leveraging our proposed hybrid model architecture along with the custom-built gait dataset comprising both Parkinson’s disease patients and healthy individuals, we conduct research on the transfer learning task for distinguishing gait recognition between these two groups. Given the challenges associated with acquiring data from Parkinson’s disease patients, our dataset contains a limited amount of individual data. Despite employing data augmentation techniques to expand the image samples, the number of unique individuals remains the same, with 86 samples available.

In light of the potential impact on model generalization attributable to the small sample size and fixed acquisition perspective inherent in our self-constructed dataset, we judiciously incorporated data augmentation, dropout, weight decay (L2 regularization), and early stopping mechanisms into our training regimen. Furthermore, we optimized the utilization of transfer learning methodologies to fully harness the pre-embedded knowledge from extensive pre-trained samples (CASIA-B), thereby augmenting the model’s generalization efficacy.

During the training phase, we adhere to the conventional 8:2 ratio for splitting the dataset into training and testing sets, utilizing 69 random samples for training and 17 random samples for testing. Consequently, we make minor downward adjustments to the parameters and training epochs. We conduct 5,000 rounds of training on the training set using our proposed new architecture model. The performance fluctuations and changes in the loss function based on our proposed model on the training set are depicted in Figs. 11 and 12.

Fig. 11
figure 11

Line chart of accuracy change (light colors indicate actual changes, and dark colors show smoothed curves).

Fig. 12
figure 12

Line chart of loss function change (Light colors indicate actual changes, and dark colors show smoothed curves).

The model achieves an accuracy of 84.32% on the training set, with a loss function value of 1.227. To validate the trained model, we employ the segmented test set and introduce a binary confusion matrix, as detailed in Table 7. Within the model’s reasoning process, a secondary discriminator is integrated to categorize the classified identity tags into two groups: Patients with Parkinson’s disease and non-patients, as illustrated in the confusion matrix. Additionally, we calculate key performance metrics of the model, including accuracy, recall, F1 score, PR curve and ROC curve of the model for every 1000 rounds of training. The test outcomes are presented in Figs. 13 and 14.

Table 7 Confusion matrix.
Fig. 13
figure 13

Variation diagram of accuracy, recall and F1 Score On The Test Set.

Fig. 14
figure 14

Variation diagram of PR and ROC curves on the test set.

Based on the calculation results from the test set, the model demonstrates remarkable performance metrics. During training, the model achieves its highest accuracy of 0.96 at 5000 iterations. Across iterations 1000 to 5000, the model maintains an average accuracy of 0.712 with a standard deviation of 0.205, indicating performance stability. Additionally, the model attains a maximum recall of 0.90 and a highest F1 score of 0.95, highlighting its effectiveness. The evolution of the confusion matrix, which reflects these outcomes, is depicted in Fig. 15.

Fig. 15
figure 15

Variation diagram of confusion matrix (from left to right).

Furthermore, to demonstrate the superiority of our proposed model, we conduct comparative experiments on our self-built Parkinson’s gait dataset using existing deep learning frameworks. In addition to selecting the advanced gait recognition frameworks GaitSet5 and GaitPart6, we also incorporate the traditional C3D26 and R3D27 network models for comparison. C3D effectively captures spatiotemporal information in video frames through 3D convolutional kernels, performing exceptionally well in video classification tasks. R3D is an improvement upon ResNet, significantly enhancing the model’s performance in video classification tasks by introducing residual connections into 3D convolutions. To ensure the fairness of the experiment, we use the same training and test sets, train the models under identical epoch and batch settings, and evaluate them using the same test set. The results of the comparative tests are presented in Table 8.

Table 8 Parkinson’s disease / non-disease identification task comparative experimental results.

The experimental results indicate that our new hybrid architecture outperforms several existing deep learning architectures in the task of Parkinson’s gait recognition, thereby validating the effectiveness of our model.

Interpretability of Parkinsonian Gait

After review by professional doctors, it was found that most of the patient image samples in the data set we constructed could be classified as gait contour images related to scissor gait, spastic gait and foot drop gait, as shown in Fig. 16 below. And after our preliminary observation and statistical analysis of the Parkinson’s patient gait dataset, the distribution of the main gait types is as shown in Table 9.

Table 9 The distribution of the main gait types.

The scissors gait, called diplegic medically, refers to the walking pattern in which the patient’s legs cross inward during walking, resembling the opening and closing movement of scissors. This gait characteristic is rather common among Parkinson’s patients, mainly caused by muscle stiffness and the decline of motor control ability. The spastic gait, called hemiplegic medically, refers to the walking pattern in which the patient’s leg muscles suddenly contract during walking, leading to an incoherent gait with pauses or dragging phenomena. This gait characteristic is also rather common among Parkinson’s patients, mainly resulting from muscle spasms and the deterioration of motor coordination ability. The foot drop gait, called neuropathic medically, refers to the walking pattern in which the patient’s toes do not lift adequately during walking, causing the sole of the foot to drag on the ground or kick the ground. This gait characteristic is also rather common among Parkinson’s patients, mainly due to the weakness of the foot muscles or the disorder of nerve control.

We used the same experimental configuration and comparison model as in Subsection 4.4 to ensure the fairness of the comparison. The specific experimental results are shown in the Table 10.

Fig. 16
figure 16

Three common abnormal gait profiles summarized from the self-built dataset.

Table 10 The classification results of the three types of abnormal gait.

Limitations

The self-built Parkinson’s disease (PD) dataset used in this study has several limitations. Firstly, the dataset is relatively small in size, comprising gait video data from only 44 PD patients and 42 healthy controls. This limited sample size may restrict the generalizability of our findings to a broader population. Secondly, the dataset exhibits restricted variability, as it was collected under controlled clinical settings with a fixed \(90^{\circ }\) viewpoint. Real-world scenarios often involve more complex and dynamic environments, such as varying lighting conditions, different walking surfaces, and multiple viewpoints. Additionally, due to the inherent homogeneity of the small-scale dataset, there is a risk of overfitting when applying other high-performing large-scale models. In future work, we will be committed to collecting data under more diverse and complex scenario conditions to enhance the robustness and applicability of the model in real-world applications.

This study solely relies on silhouette-based gait recognition, which has inherent limitations. While silhouettes provide a useful representation of gait patterns, they exclude other key clinical indicators relevant to Parkinson’s disease assessment, such as freezing episodes and tremors. These indicators typically require data from multiple sources, like wearable sensors and physiological measurements. Omitting such data may limit our model’s ability to comprehensively capture Parkinson’s symptoms. Moreover, silhouette extraction is sensitive to background variations and occlusions, which can affect the accuracy of gait feature extraction. Future research will explore multimodal data analysis to complement silhouette-based methods and provide a more comprehensive assessment of Parkinson’s disease.

Conclusion

In this work, we propose the Gait Recognition Expert with Multi-Focus Fusion-enhanced GaitBase, a hybrid architecture integrating four specialized expert subnetworks—Global-Info Expert, Spatio-Temporal Expert, Local Motion Expert, and Contour Expert—into the backbone GaitBase, which enhances feature extraction and gait recognition accuracy by enabling dynamic adaptation to irregular gait patterns. This advancement on data processing flow significantly enhances the model’s accuracy in gait recognition tasks. We conducted extensive comparative and ablation experiments on the renowned CASIA-B and OU-MVLP datasets, demonstrating the model’s robustness and superiority from various perspectives. Furthermore, through transfer learning tasks on our self-built gait contour dataset of Parkinson’s patients and non-patients, we achieved effective identification of Parkinson’s patients, further validating the model’s effectiveness.

In the forthcoming research endeavors, we intend to compile a comprehensive dataset of gait data encompassing various acquisition angles (e.g., lateral, frontal, and multi-camera synchronous captures) and diverse environmental conditions (such as indoor and outdoor settings, varying ground materials, and different lighting scenarios), which initiative is designed to augment the model’s practical utility, thereby enhancing its capability to navigate the complexities inherent in real-world applications effectively. Additionally, we will investigate the model’s effectiveness in clinical trials and assess its application potential in actual medical settings. This study not only offers a novel technical approach for the automatic diagnosis of Parkinson’s disease but also paves the way for the application of gait recognition technology in other domains.