Introduction

Pedestrian detection, tracking, and recognition are central issues in computer vision. While early studies primarily addressed single-camera scenarios, the focus has shifted toward multi-camera systems, where person re-identification (re-ID)1 plays a critical role in associating pedestrian trajectories across non-overlapping views. Compared with single-camera settings, multi-camera deployments present severe variations in lighting, viewpoint, and background, as well as blind zones between non-overlapping fields of view, making consistent trajectory association and identity matching challenging to achieve.

Person re-ID methodologies can be broadly classified into image-based and video-based approaches2. Image-based re-ID involves matching a single query frame against a set of gallery images, frequently employing supervised learning techniques to project features into a shared embedding space3,4. Although effective for static appearances, this approach is limited in its ability to leverage temporal cues and is less robust in the presence of occlusion or motion blur. Conversely, video-based re-ID encodes both spatial appearance and temporal dynamics from sequences of frames5, thereby facilitating more comprehensive identity representations.

A prevalent strategy in video-based re-ID involves the aggregation of spatial features extracted by Convolutional Neural Networks (CNNs) over time through average or max pooling6, though this method can be susceptible to noisy frames as noted in5. To enhance temporal modeling, recurrent neural networks (RNNs) have been integrated with CNN backbones, as demonstrated by the Siamese Recurrent Convolutional Network (SRC) developed by McLaughlin et al.5, which captures sequential dependencies between frames. However, standard RNNs are prone to vanishing gradient issues in long sequences. Gated architectures such as Long Short-Term Memory (LSTM)7 and Gated Recurrent Unit (GRU)8 mitigate this problem by incorporating memory gates, with LSTM offering greater capacity for large datasets and GRU providing faster convergence with fewer parameters, rendering GRU particularly effective in small-scale or real-time applications9. These recurrent designs have become foundational for video-based re-ID, forming the basis for our proposed compact multi-level Siamese CNN–GRU framework.

The Siamese network architecture10, which consists of two weight-sharing subnetworks, has been extensively utilized in person re-ID for similarity learning, particularly when training data is limited3,11,12,13. By projecting paired inputs into a shared embedding space and optimizing a verification loss, Siamese models effectively enforce small intra-class distances and large inter-class margins. This pairwise learning paradigm mitigates overfitting and enhances parameter efficiency, which is beneficial when per-identity samples are scarce. However, traditional Siamese frameworks often depend solely on high-level semantic features from deep CNN layers, neglecting low-level spatial cues, such as textures from hats, shoes, or carried objects, that remain consistent across camera views and are especially useful in challenging scenarios14.

To address these limitations, the proposed Multi-level Similarity Perception Siamese Recurrent Convolutional Network (MSP-SRC) incorporates a compact CNN backbone for spatial feature extraction alongside a GRU module for temporal modeling. As depicted in Fig. 1, an auxiliary pooling branch from the early convolutional layer (Pool1) captures low-level similarities, while GRU outputs are aggregated via temporal mean pooling to form high-level sequence embeddings that are less biased toward terminal frames. Both branches are trained concurrently using a combined identification–verification objective15, originally proposed in DeepID2 for face recognition, enabling the network to learn discriminative, noise-resilient representations for sequence-level matching. This multi-level design preserves fine-grained spatial details while simultaneously modeling long-range temporal dependencies, offering a balanced and robust solution for video-based re-ID.

Fig. 1
Fig. 1
Full size image

Architecture of the proposed MSP-SRC framework. (a) Training phase: The Siamese network processes an input pair \(\:({s}_{i},{s}_{j})\) of dimension \(\:T\times\:H\times\:W\times\:5\) (RGB + Optical Flow). It comprises a Shared CNN Backbone for spatial feature extraction and a GRU Temporal Module for sequence modeling. Low-level cues are captured via spatial pooling, while high-level temporal features are aggregated via Temporal Mean Pooling. (b) Testing phase: Probe and Gallery sequences are encoded into fixed-dimensional embeddings \(\:\left({\mathbb{R}}^{e}\right)\) for similarity matching using Euclidean Distance.

Recent transformer-based16,17,18,19,20,21,22 and attention-intensive architectures have achieved state-of-the-art performance in video-based person re-ID by modeling long-range temporal dependencies and refining part-level feature alignment. While effective in controlled benchmarks, these models typically incur high computational costs, large memory footprints, and longer inference latencies, which may limit their applicability in scenarios with constrained resources or limited training data. In contrast, the MSP-SRC framework emphasizes a balanced trade-off between recognition accuracy and architectural compactness, avoiding heavy attention modules while retaining both low-level spatial cues and high-level temporal contexts. Although on-device optimization is beyond the scope of this study, potential deployment in resource-constrained environments is discussed in the conclusion as future work.

The remainder of this paper is organized as follows: Sect. 2 reviews recent developments in video-based person re-ID, with an emphasis on Siamese architectures and multi-level similarity modeling. Section 3 describes the proposed MSP-SRC framework. Section 4 reports the experimental results and provides in-depth discussions of the findings. Section 5 concludes the paper and outlines directions for future research.

Related works

Person re-ID approaches are generally divided into image-based and video-based methods. Image-based methods operate on single frames and learn discriminative spatial features but cannot exploit temporal cues, making them vulnerable to occlusion, pose changes, and resolution loss. Video-based methods process sequences of consecutive frames to capture temporal cues, motion patterns, and cross-view consistency, yielding richer identity representations. However, they also face challenges such as noise, frame redundancy, and sequence misalignment, necessitating models that can jointly extract spatial and temporal features with strong discriminative power. The proposed MSP-SRC is designed for video-based re-ID, integrating low-level spatial cues with high-level temporal dynamics to improve recognition under challenging conditions.

Image-based person Re-ID

Building upon the foundational work of Gheissari et al.23, image-based re-ID involves matching a single image against a gallery under a closed-world assumption, utilizing feature extraction and metric learning to model pedestrian appearance. Traditional descriptors, such as color histograms and texture features, were frequently applied to segmented body parts (head, torso, legs)24,25, with similarity computed via learned metrics including KISSME26, LMNN27, ITML28, and LDML29. Methods like LOMO-XQDA30 enhanced efficiency through cross-view dimensionality reduction, while classifier-based approaches employed SVM or AdaBoost for discriminative matching31,32,33. Despite their contributions, handcrafted pipelines exhibit limited generalizability to large-scale, unconstrained environments.

The advent of deep learning marked a shift towards CNN-based frameworks such as DeepReID4 and PersonNet34, which learn hierarchical representations directly from data. Siamese CNNs3 improved discrimination with pairwise losses but underutilized label information, leading to the development of classification-based designs with domain-guided dropout35 and hybrid variants incorporating temporal or local feature matching modules11,12,36. Transformer-based architectures, including TransReID16, AAformer17, NFormer18, and PSTR19, further enhanced robustness through global context modeling and part-level alignment, although their computational demands pose challenges for scalability in resource-constrained scenarios.

Video-based person Re-ID

In contrast to image-based re-ID, which depends on single-frame appearance features, video-based methodologies leverage temporal cues across frame sequences to enhance robustness against occlusion, motion blur, and viewpoint variation. Initial studies developed sequence-level appearance models by aggregating handcrafted descriptors such as color histograms, covariance descriptors, or local keypoints across frames37,38,39,40. Techniques including geometric transformations40, Conditional Random Fields (CRFs)41, and part-based spatiotemporal modeling42 were introduced to ensure temporal consistency. Motion-based descriptors, such as HOG3D43 and Gait Energy Image44, were utilized to capture dynamic patterns, with gait periodicity employed for temporal alignment45,46,47.

With the rise of deep learning, CNN-based video re-ID pipelines have emerged. Baseline approaches aggregate CNN-extracted frame features into a sequence descriptor via average or max pooling6, which is computationally straightforward but susceptible to noisy or low-quality frames, as noted in5. Recurrent designs integrate CNNs with temporal learners to capture ordering information, as exemplified by the Siamese Recurrent Convolutional Network (SRC)5, Convolutional LSTM networks48, and GRU-based architectures13. Additional enhancements include Fisher Vector encoding48, temporal pyramid pooling49, and recurrent feature aggregation50.

Recent advancements extend beyond simple recurrence. Graph-based methods model frame-to-frame or part-to-part relations for spatiotemporal consistency, such as skeleton-based dynamic hypergraph networks51 and multi-granularity graph pooling52. Attention and transformer architectures capture long-range dependencies and part-level correspondences, as demonstrated in DenseIL20, MSTAT21, CAViT22, and enhanced video transformers53. These models improve temporal reasoning and fine-grained alignment but often incur high computational costs.

Further research has addressed domain shifts and modality gaps. Single-task joint learning54 and multi-domain learning55 aim to enhance cross-view generalization. Visible–infrared video re-ID with spatiotemporal and modality alignment56 bolsters robustness in low-light and cross-modal settings. Large-scale benchmarks such as MARS6 and MEVID57 reflect the trend toward realistic, noisy data, while self-supervised representation learning58 enhances generalization under limited labels.

Within this context, the proposed MSP-SRC follows a compact Siamese CNN–RNN design philosophy, integrating a lightweight CNN backbone with GRU-based temporal modeling. Its core novelty lies in multi-level similarity aggregation, which jointly leverages low-level spatial cues and high-level temporal embeddings for identity discrimination. This design mitigates sensitivity to redundant/noisy frames and partial misalignment across tracklets while retaining a small model footprint suitable for deployment on standard video benchmarks.

Temporal aggregation beyond video-based person Re-ID

Recent progress in broader video understanding has proposed temporal aggregation mechanisms that, although developed for different objectives than identity discrimination, may offer complementary insights for future video person re-ID research. For example, referring atomic video action recognition introduces temporally grounded modeling of a specified target and emphasizes semantic-aware temporal reasoning under multi-person interference59,60. Diffusion-based referring human action segmentation further highlights holistic–partial temporal modeling to cope with long-range ambiguity and complex interactions in crowded scenes61. In addition, few-shot adaptation for activity recognition across diverse domains stresses temporal representation robustness under distribution shifts and limited supervision62. While these studies do not directly address cross-camera identity matching, their temporal aggregation principles (e.g., multi-trajectory sequence modeling, holistic–partial cue integration, and robustness-oriented temporal learning) may be explored as future extensions for person re-ID. Importantly, the present work targets a different design point by proposing a compact CNN–GRU framework with multi-level similarity aggregation, explicitly prioritizing low parameter count and low FLOPs for resource-constrained deployment.

Siamese-based person Re-ID

Siamese networks, consisting of two subnetworks with shared weights, have emerged as a fundamental paradigm in person re-ID for learning discriminative embeddings with limited per-identity samples3,10. While initial image-based variants exhibited strong generalization capabilities in low-data scenarios11,12,35, their principles have been extended to video-based re-ID to incorporate temporal modeling. As elaborated in Sect. 2.2, recurrent Siamese designs—such as the SRC5, and architectures augmented with LSTM or GRU13,48—continue to serve as competitive sequence-matching baselines. However, these designs often emphasize high-level temporal embeddings while underutilizing complementary spatial details, which can reduce robustness in challenging conditions.

A primary limitation of traditional Siamese frameworks is their dependence on high-level semantic features extracted from deeper network layers, which often results in the neglect of fine-grained spatial cues, such as accessory textures or localized patterns, that remain consistent across camera views and facilitate recognition under occlusion or viewpoint changes. To address this issue, multi-level similarity perception approaches have been explored, introducing auxiliary branches that pool features from earlier convolutional layers alongside deeper temporal representations. Related strategies have been applied in video-based re-ID; for example, MG‑RAFA63 uses attention-guided aggregation of multi-granularity spatiotemporal features, and semantic–time fusion frameworks integrate multi-stage features with inter-frame attention to reduce redundancy. These designs inform the dual-branch scheme of the MSP-SRC: the low-level branch preserves detailed spatial cues, whereas the high-level branch captures temporal dependencies. Both are trained jointly under combined identification–verification objectives15, enhancing the discriminability and robustness to noisy frames.

The proposed MSP-SRC adheres to this paradigm, integrating a Pool1-based low-level similarity branch with a GRU-based high-level branch. Mean temporal pooling mitigates end-of-sequence bias, and joint loss optimization preserves complementary spatial and temporal cues. This dual-branch recurrent Siamese design addresses the typical weakness of discarding low-level details while maintaining computational efficiency for standard video-based re-ID benchmarks. Compared with existing CNN–RNN and CNN–GRU-based video re-ID frameworks that typically form a single high-level sequence embedding supervised by either classification or verification losses, MSP-SRC therefore differs in three key aspects: it explicitly preserves early-layer spatial features through a low-level similarity branch, employs a combined identification–verification objective inspired by DeepID215, and adopts a deliberately compact CNN–GRU configuration tailored to medium-scale video benchmarks and resource-constrained deployment.

Proposed MSP-SRC methodology

Building upon the design motivations in Sect. 2.3, the proposed MSP-SRC adopts a dual-branch Siamese framework for joint spatial–temporal modeling in video-based person re-ID. The following subsections present the architectural overview, input representation, spatial and temporal modeling components, multi-level similarity perception module, and optimization strategy.

Overall framework and input representation

The proposed MSP-SRC is a video-based person re-ID framework that jointly preserves low-level spatial cues and high-level temporal dynamics. Building on the SRC5, it employs a deeper CNN backbone to enhance spatial feature representation and integrates an auxiliary pooling branch to capture multi-level similarities. Temporal modeling is handled by a GRU, which balances efficiency with the capacity to model long-range dependencies.

The framework processes paired pedestrian sequences, each represented by five channels: three for RGB color and two for dense optical flow. The RGB channels, extracted from detected pedestrian bounding boxes, encode static appearance attributes such as clothing color, texture, and body shape. The optical flow channels store horizontal and vertical motion components derived from pixel displacement between consecutive frames, capturing short-term dynamics such as gait rhythm and movement direction. By jointly modeling appearance and motion cues, the framework encodes rich spatiotemporal features that enhance robustness to occlusion, pose variation, and illumination changes.

Within each Siamese branch, the CNN backbone extracts frame-level feature maps, with early-layer outputs pooled to form low-level sequence features that retain fine-grained visual details. Final-layer features are processed by the GRU to propagate temporal dependencies, and the hidden states are aggregated via temporal mean pooling to produce high-level sequence representations. The resulting multi-level embeddings from the two branches are concatenated and projected into a shared feature space for similarity computation. The model is trained with a combined identification–verification loss to enforce small intra-class distances and large inter-class margins. The overall architecture is illustrated in Fig. 1.

Spatial feature extraction via compact CNN backbone

Following the design in5, the CNN backbone of MSP-SRC, illustrated in Fig. 2, consists of three convolutional layers that progressively extract spatial features from pedestrian images. The detailed layer configuration is summarized in Table 1. The convolutional mapping from input image \(\:x\) to output feature vector \(\:f\) is expressed as \(\:f=C\left(x\right)\), where \(\:C\left(\bullet\:\right)\) denotes the convolutional transformation.

Since MSP-SRC is designed for video-based person re-ID, the input consists of sequential pedestrian images. A sequence \(\:s\) comprising T consecutive bounding box frames is represented as \(\:s=\{{s}^{\left(1\right)},{s}^{\left(2\right)},\dots\:{,s}^{\left(T\right)}\}\). Each frame \(\:{s}^{\left(t\right)}\) denotes the pedestrian image at time step t. After passing through the CNN backbone, each frame is encoded into a feature vector \(\:{f}^{\left(t\right)}=C\left({s}^{\left(t\right)}\right).\) Because all frames are processed with shared CNN parameters, the network ensures consistent feature extraction across the entire sequence. The resulting feature vectors \(\:{\left\{{f}^{\left(t\right)}\right\}}_{t=1}^{T}\) are projected into a lower-dimensional space before being forwarded to the recurrent module for temporal modeling. To reduce overfitting, dropout regularization is applied during this process64.

Fig. 2
Fig. 2
Full size image

Detailed CNN Backbone Architecture (Spatial Module). The network consists of three convolutional blocks with specific kernel and filter configurations. Input: A single frame of dimension \(\:128\times\:64\times\:5\). Layers: The network progressively extracts features using 16, 32, and 32 filters (\(\:5\times\:5\)), with spatial dimensions reduced via max-pooling (\(\:64\times\:32\to\:32\times\:16\to\:16\times\:8\)). Output: The final feature map is flattened into a 4,096-dimensional vector (\(\:{f}_{t}\)), which serves as the input to the temporal module.

Table 1 Detailed CNN architecture of MSP-SRC.

As detailed in Table 1, the CNN backbone employs three convolutional layers with progressively increasing receptive fields. The first layer captures fine-grained local patterns, such as textures and edges; the second layer aggregates mid-level semantics, such as body parts and poses; and the third layer encodes abstract identity-related features. This shallow yet structured design strikes a balance between efficiency and discriminative power; while deeper backbones could yield stronger abstractions, they would also incur higher computational costs. In video-based re-ID scenarios characterized by a limited number of training identities, a compact backbone serves as an effective regularization strategy, matching model capacity to the available supervision and mitigating the overfitting often observed with over-parameterized deep networks. This choice is also consistent with prior video-based person re-ID architectures on medium-scale benchmarks, which typically employ three- or four-layer CNN backbones trained from scratch. The chosen three-layer configuration therefore ensures that MSP-SRC remains a compact CNN backbone with only three convolutional layers and limited channel width. Crucially, this shallow architecture preserves fine-grained low-level spatial cues—such as clothing textures and accessories—that are explicitly leveraged by the multi-level similarity perception module (Sect. 3.4) to enhance discrimination, whereas such details might be attenuated in deeper semantic abstractions.

Temporal modeling via GRU-based recurrent module

RNNs are particularly adept at modeling sequential data due to their feedback connections, which enable the retention of information across time steps—an ability that conventional CNNs, limited by fixed input-output mappings, do not possess. At each time step, the RNN updates its hidden state by integrating the current input with the accumulated temporal context. During training, the recurrent structure is unfolded into a feedforward network, allowing for gradient propagation through time via backpropagation through time65. As depicted in Fig. 3, the recurrent module in MSP-SRC is instantiated as a GRU-based temporal memory unit, which facilitates information flow across pedestrian sequences. For notational clarity, we first recall the basic ungated RNN formulation implemented as a SimpleRNN cell, and then discuss its gated extensions, including LSTM and GRU.

Fig. 3
Fig. 3
Full size image

Temporal Modeling via GRU Unfolding. The diagram illustrates the recurrent processing of the sequence. Input Features (\(\:{f}_{t}\)) extracted from the CNN backbone are fed into GRU Cells (\(\:{h}_{t}\)) at each time step \(\:(t-1,t,\dots\:T)\). Unlike standard RNNs that only utilize the final state, the sequence of hidden states is forwarded to Temporal Mean Pooling to generate a robust sequence-level representation.

Let \(\:{f}^{\left(t\right)}\) denote the CNN-extracted feature vector from the input frame \(\:{s}^{\left(t\right)}\). The recurrent update is defined as:

$$\:{o}_{high}^{\left(t\right)}={W}_{i}{f}^{\left(t\right)}+{W}_{s}{r}^{(t-1)}$$
(1)
$$\:{r}^{\left(t\right)}=Tanh\left({o}_{high}^{\left(t\right)}\right)$$
(2)

where:

\(\:{o}_{high}^{\left(t\right)}\in\:{\mathbb{R}}^{e\times\:1}\) is the intermediate sequence-level feature,

\(\:{f}^{\left(t\right)}\in\:{\mathbb{R}}^{N\times\:1}\) is the CNN feature at time t,

\(\:{r}^{\left(t\right)}\in\:{\mathbb{R}}^{e\times\:1}\) is the hidden state at time t,

\(\:{r}^{\left(t-1\right)}\in\:{\mathbb{R}}^{e\times\:1}\) is the hidden state from the previous step,

\(\:{W}_{i}\in\:{\mathbb{R}}^{e\times\:N}\) projects the CNN feature into a lower-dimensional embedding,

\(\:{W}_{s}\in\:{\mathbb{R}}^{e\times\:e}\) transforms the recurrent state, and the initial hidden state \(\:{r}^{\left(0\right)}\) is initialized to zero.

Because \(\:{W}_{i}\) is rectangular (e < N), it effectively reduces dimensionality, ensuring computational efficiency. This recurrent update process is depicted in Fig. 4, which outlines the architecture of the employed recurrent unit.

Given an input feature sequence \(\:{\left\{{x}_{t}\right\}}_{t=1}^{T}\), a SimpleRNN cell updates its hidden state \(\:{h}_{t}\) by combining the current input with the previous hidden state:

$$\:{h}_{t}=\varphi\:({W}_{x}{x}_{t}+{W}_{h}{h}_{t-1}+{b}_{h})$$
(3)

,

where \(\:{W}_{x}\) and \(\:{W}_{h}\) denote the input-to-hidden and hidden-to-hidden weight matrices, \(\:{b}_{h}\) is a bias term, and \(\:\varphi\:(\bullet\:)\) is a non-linear activation function. The unfolded computation graph over time is illustrated schematically in Fig. 4. This SimpleRNN formulation provides a minimal recurrent baseline and serves both to introduce the notation for temporal updates and to act as a baseline temporal model in the ablation study (Sect. 4.3).

Fig. 4
Fig. 4
Full size image

Schematic Diagram of the SimpleRNN cell. The basic ungated RNN formulation shown here is used to introduce the notation for recurrent state updates over time. In this work, SimpleRNN additionally serves as a baseline temporal model in the ablation study (Sect. 4.3), while the final MSP-SRC framework adopts a GRU unit as the recurrent module (Sect. 3.3.2).

LSTM

RNNs are constrained by the vanishing gradient problem when extended over lengthy sequences, which impairs their capacity to capture long-range dependencies. To mitigate this issue, LSTM networks incorporate a memory cell together with three gating mechanisms, specifically the forget gate, the input gate, and the output gate. These gates selectively retain historical information, integrate new inputs, and generate outputs, thereby enabling LSTMs to model long-term temporal dependencies more effectively than traditional RNNs. Despite their efficacy, LSTMs are characterized by a substantial number of parameters and relatively high computational demands.

GRU

To improve efficiency, GRUs simplify the LSTM architecture by merging the input and forget gates into an update gate and combining the cell state with the hidden state. This design reduces parameters while retaining the ability to capture temporal dependencies. The GRU is defined as:

$$\:{z}^{t}=sigmoid\left({W}_{z}\left[{h}_{t-1},{f}_{t}\right]\right)$$
(4)
$$\:{r}^{t}=sigmoid\left({W}_{r}\left[{h}_{t-1},{f}_{t}\right]\right)$$
(5)
$$ h^{{\sim t}} = \tanh \left( {W_{h} \left[ {r^{t} \times h_{{t - 1}} ,f_{t} } \right]} \right) $$
(6)
$$ h^{t} = \left( {1 - z^{t} } \right) \times h_{{t - 1}} + z^{t} \times h^{{\sim t}}$$
(7)

where \(\:{z}^{t}\) and \(\:{r}^{t}\) denote the update and reset gates, \( h^{{\sim t}} \) is the candidate hidden state, and \(\:{h}^{t}\) is the final hidden state at time. Compared with LSTM, GRU achieves similar or better accuracy with fewer parameters and faster convergence, making it a favorable choice for video-based person re-ID tasks. Accordingly, MSP-SRC instantiates the temporal module as a GRU-based recurrent unit in the final architecture, whereas SimpleRNN and LSTM are only used as baseline variants in the ablation study (Sect. 4.3) to assess the impact of recurrent capacity on sequence modeling. In addition, the temporal module is restricted to a single GRU layer with temporal mean pooling, which shortens the backpropagation path and, together with the gating mechanism, helps to alleviate vanishing gradients and temporal drift for the sequence lengths considered in this work.

Multi-level similarity perception module

As identified in previous research14, Siamese-based person re-ID methodologies are typically categorized into two paradigms: (1) the direct computation of similarities on feature maps through global or local matching, and (2) the learning of compact feature embeddings optimized using pairwise or triplet loss functions. Although both paradigms demonstrate competitive performance, they frequently overlook low-level spatial cues that remain consistent across different views. To address this shortcoming, the MSP-SRC introduces a multi-level similarity perception mechanism. Specifically, an auxiliary branch extracts low-level features from the initial layers of the CNN, while high-level temporal features are aggregated from the recurrent module across all time steps, yielding complementary representations that jointly enhance similarity estimation.

Low-level similarity via feature pooling

Low-level features, derived from the earlier convolutional layers, are dense and informative in capturing fine-grained spatial details. Pooling operations applied to these features effectively capture regional patterns that complement the abstract representations obtained in deeper layers. Qualitative visualizations in Fig. 5 further illustrate the complementary roles of low-level and high-level features. Conv1 responses (low-level) strongly activate on fine-grained details such as the overall silhouette, clothing textures, and accessories (e.g., backpacks, shoes), capturing local appearance patterns that are highly informative for distinguishing visually similar pedestrians. Conv2 responses (mid-level) emphasize larger body regions and part configurations, while Conv3 responses (high-level) focus on more abstract semantic body regions and suppress some high-frequency texture details. This progression indicates that high-level features alone may abstract away critical local textures necessary for identity discrimination. Consequently, the proposed feature pooling branch operating on early-layer activations is essential to recover and preserve these discriminative low-level cues, ensuring robust matching even when high-level semantic patterns are similar across different identities.

Fig. 5
Fig. 5
Full size image

Visualization of feature activation maps across different CNN layers. The first column shows the input image. Subsequent columns display activation maps from Conv1 (low-level), Conv2 (mid-level), and Conv3 (high-level). Conv1 retains fine-grained identity cues such as clothing textures, contours of carried accessories (e.g., bags), and local color patterns, whereas Conv3 extracts more abstract semantic body regions. This qualitative progression from low-level to high-level responses illustrates the complementary roles of different layers and supports the need for the proposed multi-level aggregation, which preserves discriminative low-level details that might otherwise be attenuated in deeper abstractions.

Formally, for a sequence \(\:\text{s}=\left\{{s}^{\left(1\right)},{s}^{\left(2\right)},\dots\:,{s}^{\left(T\right)}\right\},\) each frame \(\:{s}^{\left(t\right)}\:\) is encoded by the CNN into an embedding vector \(\:{o}_{low}^{\left(t\right)}=l\left({s}^{\left(t\right)}\right)\), where \(\:l\left(\bullet\:\right)\:\)denotes the CNN and projection operation. The resulting set of vectors {\(\:{o}_{low}^{\left(1\right)},{o}_{low}^{\left(2\right)},\dots\:{o}_{low}^{\left(T\right)}\}\) is aggregated into a single low-level representation \(\:{v}_{{\text{s}}_{low}}\) using:

  1. A.

    Mean pooling:

    $$\:{v}_{{\text{s}}_{low}}=\frac{1}{T}\sum\:_{t=1}^{T}{o}_{low}^{\left(t\right)}$$
    (8)
  2. B.

    Max pooling:

    $$\:{v}_{{\text{s}}_{low}}^{i}=\text{max}\left({o}_{low}^{\left(1\right),i},{o}_{low}^{\left(2\right),i},\dots\:,{o}_{low}^{\left(T\right),i}\right),\:\:\:i\in\:\left\{\text{1,2},\dots\:,e\right\}$$
    (9)

The consolidated low-level representation is denoted a \(\:L\left(\text{s}\right)={v}_{{\text{s}}_{low}}\). Figure 6 illustrates the pooling module. This feature pooling operation converts sequences of arbitrary length into fixed-size embeddings, facilitating direct similarity comparison between sequences without requiring explicit frame alignment.

Fig. 6
Fig. 6
Full size image

Feature pooling branch.

High-level similarity via Temporal pooling

Although RNNs capture sequential dependencies, they exhibit two main limitations for re-ID: (1) an overemphasis on terminal frames due to recursive updates, and (2) difficulty in modeling temporal information at multiple scales. To address these issues, the MSP-SRC incorporates a temporal pooling layer following the RNN. Instead of relying solely on the final hidden state, this layer aggregates outputs across all time steps, facilitating a balanced representation of temporal dependencies across the entire sequence. The mathematical formulation of the pooling strategies is presented below.

Given the set of recurrent outputs \(\:\left\{{o}_{high}^{\left(1\right)},{o}_{high}^{\left(2\right)},\dots\:{o}_{high}^{\left(T\right)}\right\}\), the aggregated high-level representation \(\:{v}_{{\text{s}}_{high}}\) is computed as:

  1. A.

    Mean pooling:

    $$\:{v}_{{\text{s}}_{high}}=\frac{1}{T}\sum\:_{t=1}^{T}{o}_{high}^{\left(t\right)}$$
    (10)
  2. B.

    Max pooling:

    $$\:{v}_{{\text{s}}_{high}}^{i}=\text{max}\left({o}_{high}^{\left(1\right),i},{o}_{high}^{\left(2\right),i},\dots\:,{o}_{high}^{\left(T\right),i}\right)$$
    (11)

The final high-level representation is denoted as \(\:R\left(\text{s}\right)={v}_{{\text{s}}_{high}}.\) This temporal pooling operation converts sequences of arbitrary length into fixed-size embeddings, facilitating direct similarity comparison between sequences without requiring explicit frame alignment.

Joint identification–verification optimization strategy

Siamese training with contrastive loss

The MSP-SRC framework adopts a Siamese architecture composed of two weight-sharing subnetworks. Each training sample consists of a pair of pedestrian sequences \(\:\left({\text{s}}_{i},{\text{s}}_{j}\right)\), which are independently processed to produce sequence-level embeddings in a shared feature space. This design enables direct similarity computation via Euclidean distance.

To enforce discriminability, a contrastive loss minimizes intra-class distances while maintaining a margin between inter-class embeddings. Specifically, given the extracted low-level and high-level embeddings:

$$\:{v}_{i\_low}=L\left({\text{s}}_{i}\right),\:\:{v}_{i\_high}=R\left({\text{s}}_{i}\right)$$
(12)
$$\:{v}_{j\_low}=L\left({\text{s}}_{j}\right),\:\:{v}_{j\_high}=R\left({\text{s}}_{j}\right)$$
(13)

the intra-class loss for positive pairs \(\:(i=j)\) is

$$\:{E}^{intra}\left({v}_{i},{v}_{j}\right)=\frac{1}{2}{{v}_{i}-{v}_{j}}^{2}$$
(14)

while the inter-class loss for negative pairs \(\:(i\ne\:j)\) is

$$\:{E}^{inter}\left({v}_{i},{v}_{j}\right)=\frac{1}{2}{\left[\text{max}\left(m-{v}_{i}-{v}_{j},0\right)\right]}^{2}$$
(15)

The overall Siamese loss is therefore:

$$\:E\left({v}_{i},{v}_{j}\right)=\left\{\begin{array}{c}\frac{1}{2}{{v}_{i}-{v}_{j}}^{2}\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:,\:i=j\\\:\frac{1}{2}{\left[\text{max}\left(m-{v}_{i}-{v}_{j},0\right)\right]}^{2}\:\:,\:i\ne\:j\end{array}\right.$$
(16)

As defined in (13), the high-level sequence embedding \(\:R\left(s\right)\) is computed by applying temporal mean pooling to the sequence of GRU outputs \(\:{\left\{{h}_{t}\right\}}_{t=1}^{T}\) over all time steps. Unlike conventional RNN-based video re-ID models that derive the sequence representation solely from the final hidden state, this design prevents the representation from drifting toward the last few frames and encourages \(\:R\left(s\right)\) to summarize a global temporal consensus rather than a tail-dominated state. At inference, similarity between unseen pedestrian sequences is directly measured by Euclidean distance; smaller distances indicate higher identity likelihood.

Joint identification and verification loss

To jointly optimize classification and verification, MSP-SRC employs a combined loss inspired by DeepID215. This objective integrates a verification term based on contrastive distance with an identification term based on cross-entropy classification.

For identification, the recurrent feature vector \(\:v=R\left(s\right)\) is fed into a SoftMax classifier:

$$\:I\left(v\right)=P\left(q=c|v\right)=\frac{\text{exp}\left({W}_{c}v\right)}{{\sum\:}_{K}\text{exp}\left({W}_{K}v\right)}$$
(17)

where \(\:K\) is the number of identity classes, \(\:q\) is the ground-truth label, and \(\:W\) is the SoftMax weight matrix with rows \(\:{W}_{c}\) and \(\:{W}_{k}\) representing the parameters for class \(\:c\) and each class \(\:K\).

The overall loss for a pair \(\:\left({\text{s}}_{1},{\text{s}}_{2}\right)\) is formulated as:

$$\:Q\left({\text{s}}_{1},{\text{s}}_{2}\right)=E\left(L\left({\text{s}}_{1}\right),L\left({\text{s}}_{2}\right)\right)+E\left(R\left({\text{s}}_{1}\right),R\left({\text{s}}_{2}\right)\right)+I\left(R\left({\text{s}}_{1}\right)\right)+I\left(R\left({\text{s}}_{2}\right)\right)$$
(18)

Here, \(\:E\left(cdot\:,\bullet\:\right)\) denotes the contrastive loss as previously defined, and \(\:I\left(\cdot\:\right)\) represents the classification loss. Equal weighting is assigned to the identification and verification components, balancing classification accuracy with embedding separability. This follows the DeepID2 training strategy15, avoiding the need to tune an additional loss-weight hyperparameter on the relatively small PRID-2011 and iLIDS-VID training sets. Importantly, the two loss terms are combined with a fixed 1:1 coefficient (i.e., no λ is introduced or tuned in this study), which improves reproducibility and avoids additional degrees of freedom under the adopted evaluation protocol.

In this joint objective, the multi-level similarity perception module combines spatial and temporal cues through the complementary roles of \(\:L\left(s\right)\) and \(\:R\left(s\right)\). As defined in (12) and (13), \(\:L\left(s\right)\) denotes the low-level sequence embedding obtained by pooling early convolutional feature maps, thereby emphasizing fine-grained spatial appearance patterns such as textures, color configurations, and accessories. In contrast, \(\:R\left(s\right)\) represents the high-level sequence embedding produced by the GRU followed by temporal mean pooling, which captures semantic body structure and aggregated temporal dynamics. In (18), the verification terms \(\:E\left(L\left({\text{s}}_{1}\right),L\left({\text{s}}_{2}\right)\right)\) and \(\:E\left(R\left({\text{s}}_{1}\right),R\left({\text{s}}_{2}\right)\right)\) encourage both the low-level spatial branch and the high-level temporal branch to learn discriminative similarity structures for positive and negative pairs, while the identification terms \(\:I\left(R\left({\text{s}}_{1}\right)\right)\) and \(\:I\left(R\left({\text{s}}_{2}\right)\right)\) regularize the recurrent branch toward class-separable representations. By optimizing these components jointly, the framework aligns complementary low-level and high-level embeddings within a single multi-level similarity space, rather than relying on a single-branch representation.

Results and discussion

This section presents a comprehensive experimental evaluation of the proposed MSP-SRC framework with four objectives: (1) validating its effectiveness across multiple publicly available video-based person re-ID benchmarks, (2) comparing its performance with representative state-of-the-art methods, (3) analyzing the contribution of individual components through ablation and module-wise studies, and (4) assessing computational complexity for a fair efficiency comparison. The evaluation follows standard practices in the person re-ID community to ensure reproducibility and consistency with prior work.

The remainder of this section is organized as follows: Sect. 4.1 introduces the datasets, and Sect. 4.2 outlines the evaluation protocols. Section 4.3 and 4.4 investigate the impact of input modalities, pooling strategies, and architectural configurations. Section 4.5 examines the discriminative capacity of low- and high-level features, while Sect. 4.6 explores the effect of varying sequence lengths. Section 4.7 presents ablation experiments quantifying the importance of each core module, and Sect. 4.8 compares MSP-SRC with recent state-of-the-art approaches, with emphasis on Siamese-based architectures and compact CNN–RNN designs. In addition to accuracy-oriented comparisons, a quantitative computational complexity analysis is presented in the subsection 4.8.1.

Video-based datasets selection

This study evaluates the proposed MSP-SRC framework on two publicly accessible video-based person re-ID benchmarks, PRID-201138 and iLIDS-VID66.These datasets offer a practical balance between sequence diversity and annotation reliability and are widely adopted in prior Siamese RNN-based research, providing manually annotated, relatively clean tracklets that allow the architectural contributions of the proposed CNN–RNN backbone and Multi-Level Similarity Perception (MSP) module to be isolated and analyzed under controlled conditions.

In contrast, the large-scale MARS benchmark67 is constructed from automatically generated tracklets produced by a detector–tracker pipeline, which introduces substantial label noise, ID switches, misalignments, and trajectory fragmentation.

While MARS is highly valuable for end-to-end detection–tracking–re-ID pipelines, such noise can obscure the specific effects of feature modeling within a Siamese framework trained in a pairwise manner. For this reason, the present study focuses on PRID-2011 and iLIDS-VID as canonical video-based benchmarks and leaves the extension of MSP-SRC to MARS and cross-domain evaluation protocols to future work.

PRID-2011 consists of pedestrian sequences captured by two static, non-overlapping cameras. Of the 245 identities appearing in both views, 200 are typically selected and evenly divided into training and testing sets. Sequence lengths range from 5 to 675 frames, with an average of 100.

The iLIDS-VID dataset comprises 300 identities, each represented by sequences from two disjoint camera views. Sequence lengths vary from 23 to 192 frames (average 73), with additional challenges such as occlusion, background clutter, and viewpoint variation.

Evaluation metrics

This study evaluates all methods using the Cumulative Match Characteristic (CMC) curve, with a particular focus on Rank-1 accuracy, which reflects the percentage of query sequences where the correct identity is retrieved at the top rank. During evaluation, each sequence is encoded into a feature vector, and pairwise Euclidean distances are computed to form a distance matrix. The ranked retrieval results are then used to construct the CMC curve. Following the standard protocol for PRID-2011 and iLIDS-VID38,66, the identities in each dataset are randomly partitioned into training and testing sets with a 50%/50% split, and this procedure is repeated for 10 independent trials. The CMC statistics (Rank-1, Rank-5, Rank-10, Rank-20) reported in Sect. 4.8 correspond to the mean accuracy over these 10 runs. A more detailed analysis of variance across runs and formal significance tests is left for future work. Since the adopted datasets typically provide only one ground-truth match per query, mean Average Precision (mAP) is not reported in this study.

Comparison of RNN variants and depths

This subsection compares different RNN variants and network depths to assess their impact on sequence modeling performance. Previous research9 indicates that LSTM and GRU generally outperform SimpleRNN in modeling sequential dependencies. To substantiate this claim, this study assessed SimpleRNN, LSTM, and GRU with depths ranging from one to four layers, employing 16-frame sequences for consistency.

As illustrated in Tables 2, 3 and 4, deeper neural networks exhibited a tendency toward overfitting. SimpleRNN demonstrated modest but stable improvements up to three layers, but its performance declined when extended to four layers. In contrast, both LSTM and GRU experienced reductions in accuracy with increased depth, suggesting limited benefit from additional layers on relatively small datasets. To supplement the tabulated data, Fig. 7 illustrates the impact of increasing recurrent depth on Rank-1 accuracy for the three variants. The trends confirm that SimpleRNN achieves marginal gains up to three layers but deteriorates beyond this point. Both LSTM and GRU show a decline with increased depth, with LSTM degrading more rapidly due to its higher parameterization. GRU consistently outperformed the other variants at shallow depths, highlighting its advantageous balance between model complexity and effective sequence modeling.

Table 2 Comparison of SimpleRNN depths.
Table 3 Comparison of LSTM depths.
Table 4 Comparison of GRU depths.
Table 5 Comparison of the best-performing RNN variants.

To ensure a fair comparison, the optimal configuration of each model was selected and summarized in Table 5. The results indicate that GRU achieved the highest Rank-1 accuracy across both PRID-2011 and iLIDS-VID datasets, while LSTM consistently underperformed. The observed performance differences can be attributed to model complexity. LSTM, with its extensive parameterization, requires larger training data and thus failed to generalize effectively given the relatively small datasets and short sequences used in this study. Conversely, GRU’s simplified gating mechanism reduces the parameter count and facilitates more efficient training, rendering it more adept at capturing the short-term dynamics, such as pose shifts, viewpoint changes, and illumination variations, that predominate in these benchmarks. Consequently, the results underscore that GRU offers the most favorable balance between representational capacity and efficiency for video-based person re-identification. Based on these findings, GRU is adopted as the recurrent module in subsequent experiments.

From a computational viewpoint, increasing recurrent depth also increases the number of trainable parameters and the cost of each forward and backward pass, so selecting a single-layer GRU as the temporal module offers the most favourable trade-off between recognition accuracy and recurrent complexity on PRID-2011 and iLIDS-VID.

Fig. 7
Fig. 7
Full size image

Impact of recurrent depth on person re-ID accuracy. Rank-1 accuracy across 1–4 layers for SimpleRNN, LSTM, and GRU on (a) PRID-2011 and (b) iLIDS-VID. The visualization illustrates that GRU achieves the best accuracy at shallow depths, while LSTM suffers from pronounced degradation as layers increase.

Impact of multi-level feature aggregation

Recognition models have traditionally focused on high-level abstractions derived from deeper network layers. Nonetheless, earlier convolutional layers capture low-level cues, such as textures associated with hats, backpacks, or other local attributes, that provide stable and discriminative identity information. Integrating these fine-grained spatial details with high-level temporal representations produces more robust embeddings for person re-ID. Furthermore, introducing a pooling branch for low-level features offers auxiliary supervision, which facilitates gradient flow and accelerates convergence. These behaviors are consistent with the activation patterns observed in Fig. 5, where shallow-layer feature maps encode richer texture and accessory information, whereas deeper-layer maps concentrate on semantic body regions. The ablation results in Table 6 show that using only low-level or only high-level features leads to degraded Rank-1 performance, whereas combining both branches with the identification objective yields the best accuracy, confirming that low-level and high-level features provide complementary evidence for identity discrimination.

As demonstrated in Table 6, the combination of identity classification and multi-level similarity consistently achieves the highest accuracy on the iLIDS-VID and PRID-2011 benchmarks. In contrast, restricting the model to high-level features alone results in reduced performance, indicating that recurrent modules by themselves are insufficient to learn highly discriminative representations. These findings consistently underscore the effectiveness of multi-level feature aggregation in enhancing both model stability and recognition accuracy.

Quantitatively, the model shows high sensitivity to the feature combination strategy. This ablation can also be viewed as a zero-weight empirical check: removing a similarity branch is equivalent to assigning its contribution weight to zero while keeping the remaining training protocol unchanged. As shown in Table 6, assigning zero weight to either the low-level or the high-level similarity branch (i.e., using “ident + low” or “ident + high” only) reduces Rank-1 accuracy from 57% to 49% and 51% on iLIDS-VID, and from 76% to 54% and 68% on PRID-2011. Removing both similarity branches (“ident alone”) leads to a significant degradation, with Rank-1 dropping to 31% and 55% on iLIDS-VID and PRID-2011, respectively. These ablations confirm that concatenating identification, low-level, and high-level features—rather than relying on single-branch representations—is crucial for maximizing discrimination.

Table 6 CMC Rank-1 accuracy for different feature combination strategies.

To further assess the robustness of the proposed MSP-SRC, this study conducted 10 independent trials with random training/testing splits. The standard deviations for Rank-1 accuracy were 4.18% for PRID-2011 and 2.49% for iLIDS-VID. These variance measures indicate that the framework yields consistent performance across repeated runs, supporting its stability and reproducibility.

Pooling strategies for sequence embedding

Pooling strategies are essential for the consolidation of frame-level representations into concise sequence embeddings. As elaborated in Sect. 4.4, pooling branches were utilized to harness both low-level and high-level information, circumventing the recurrent module. To further investigate temporal aggregation, we assessed alternative pooling strategies applied to recurrent outputs, thereby ensuring that all time steps contribute to the final sequence representation. This approach addresses a common limitation of recurrent models, which often disproportionately emphasize terminal frames, even when informative cues are present earlier in the sequence.

Table 7 illustrates that mean pooling consistently outperforms max pooling across both low-level and high-level feature branches on the PRID-2011 and iLIDS-VID datasets. The advantage is particularly notable in the high-level branch, where stable averaging across frames mitigates the adverse effects of noise and occlusion. These findings validate mean pooling as a more reliable strategy for generating discriminative sequence embeddings, ultimately enhancing recognition accuracy.

Table 7 Comparison of pooling strategies for low- and high-level feature aggregation.

Effect of sequence length on re-ID accuracy

Building upon the prior analysis of feature aggregation and pooling strategies, this subsection examines the impact of sequence length on re-ID performance. As articulated in Sect. 4.3, recurrent networks necessitate adequate temporal context to effectively capture discriminative sequence features. This implies that augmenting the number of frames per sequence during inference could potentially enhance re-ID accuracy.

To investigate this effect, experiments were conducted using the iLIDS-VID dataset, wherein the sequence lengths of both query and gallery sets were varied. The lengths were adjusted to 1, 2, 4, 8, 16, and up to 128 frames. For sequences shorter than the target length, all available frames were utilized.

The results, as depicted in Table 8, reveal a distinct positive correlation between sequence length and recognition accuracy. Extending either the query or gallery length enhances Rank-1 accuracy, with the most significant improvements observed when both are increased. Notably, fixing the query length to a single frame while extending the gallery from 1 to 128 frames results in a 13% improvement in Rank-1 accuracy. Conversely, fixing the gallery length to a single frame while varying the query length yields only a 4% gain. These findings suggest that gallery sequence length exerts a more substantial influence on performance than query length.

To provide a clearer understanding of the results in Table 8; Fig. 8 illustrates the effect of varying sequence lengths for both the query and the gallery. As shown in Fig. 8a, maintaining a constant query sequence length while increasing the gallery length leads to consistent improvements in Rank-1 accuracy, confirming that longer gallery sequences capture more comprehensive temporal cues. Conversely, Fig. 8b shows that extending the query sequence length while keeping the gallery length constant yields relatively smaller gains, indicating that gallery diversity has a greater influence on recognition performance. Together, these visualizations highlight the asymmetric contribution of sequence length between queries and galleries, underscoring the importance of sufficient temporal information in the gallery for robust person re-identification.

Table 8 Rank-1 accuracy for different combinations of query and gallery sequence lengths (iLIDS-VID).

However, longer sequences increase the computational cost approximately linearly, since more frames must be processed by the convolutional backbone and propagated through the GRU. As a result, when deploying MSP-SRC on latency- or memory-constrained platforms, a moderate sequence length is preferred to balance the accuracy gains from longer temporal windows against the overhead of processing very long sequences.

This observation highlights the significance of temporal context and provides an empirical basis for the ablation studies in Sect. 4.7, where the contributions of core modules are systematically evaluated.

Fig. 8
Fig. 8
Full size image

Effect of sequence length on person re-ID accuracy. (a) Rank-1 accuracy for fixed query lengths with varying gallery lengths. (b) Rank-1 accuracy for fixed gallery lengths with varying query lengths. The visualizations confirm that extending gallery sequences yields greater performance improvements than extending query sequences, underscoring the dominant role of gallery temporal diversity in robust recognition.

Ablation study

Experiments were conducted on the PRID-2011 and iLIDS-VID datasets. As summarized in Table 9, four design aspects were systematically evaluated: the recurrent unit, multi-level feature aggregation, temporal pooling strategy, and sequence length during inference.

First, the GRU unit was replaced with SimpleRNN and LSTM under identical training conditions. The Rank-1 accuracy decreased from 76% to 55% on PRID-2011 and from 57% to 31% on iLIDS-VID when employing SimpleRNN, and further declined to 52% and 27%, respectively, with LSTM. These findings suggest that GRU offers a more advantageous balance between model capacity and convergence, particularly in data-constrained environments.

Second, removing the low-level feature branch while retaining only the high-level pathway reduced accuracy by 10% on PRID-2011 and by 3% on iLIDS-VID. This demonstrates that early-layer cues, such as textures and accessories, complement higher-level abstractions and are essential for robust identity discrimination.

Table 9 Ablation study of MSP-SRC on PRID-2011 and iLIDS-VID datasets.

Third, substituting mean pooling with max pooling consistently led to degraded performance. For example, Rank-1 accuracy decreased from 76% to 66% on PRID-2011 when max pooling was applied to high-level features, confirming the superior robustness of mean pooling against temporal noise and outlier frames.

Finally, reducing the gallery sequence to a single frame resulted in marked performance degradation, dropping to 63% on PRID-2011 and 46% on iLIDS-VID, compared to the 128-frame baseline. This highlights the necessity of temporal diversity for capturing comprehensive appearance patterns.

In summary, the ablation experiments confirm that GRU, multi-level aggregation, mean pooling, and longer sequence inputs each contribute significantly to the discriminative capacity of MSP-SRC. Their integration collectively enhances both model stability and recognition accuracy.

Comparison with state-of-the-art Siamese and conventional methods

To further substantiate the effectiveness of the proposed MSP-SRC framework, its performance was evaluated against both traditional video-based person re-ID methods and contemporary Siamese-based models. The assessments were conducted using the PRID-2011 and iLIDS-VID datasets, with CMC Rank-1 accuracy serving as the primary metric. Several classical methods, such as SRC5, LSTM48, LSTM+KISSME48, GRU13, RFA50, AMOC68, CNN + Euc6, and CNN+XQDA6, have shown competitive results on video-based person re-ID benchmarks. However, these methods often rely on deep or multi-stage architectures, including multi-branch attention modules (e.g., RFA and AMOC) or complex recurrent units (e.g., LSTM), which significantly increase model complexity and computational demands. For example, LSTM introduces multiple gating mechanisms, nearly doubling the parameter count compared to GRU, while AMOC utilizes cascaded CNNs with optical flow estimation, resulting in higher memory usage and inference latency. Despite these efforts, their recognition accuracy remains inferior to that of MSP-SRC, as depicted in Table 10. For transparency, it is clarified that the baseline numbers in Table 10 are quoted from the corresponding original papers under their reported settings, whereas MSP-SRC is evaluated under the standard protocol adopted throughout this work; therefore, results should be interpreted together with potential differences in evaluation protocols and input settings.

The MSP-SRC framework was also compared with several recent Siamese-based methods. Wang69 employs two identical CNN branches followed by LSTM modules. Zhang70 achieves slightly higher accuracy by explicitly modeling intra-video differences with a dual-path structure, although this design introduces significant computational overhead. Wu71 incorporates an attention mechanism that emphasizes salient regions and frames, yielding improved performance on iLIDS-VID under cluttered conditions but at the cost of additional complexity and latency. Li72 utilizes deep LSTM layers within a Siamese framework, which performs effectively on larger datasets but requires substantial training resources and memory.

As summarized in Table 10 and visualized in Fig. 9, MSP-SRC consistently outperforms conventional and recurrent baselines on both PRID-2011 and iLIDS-VID. Figure 9 plots the average CMC curves over 10 independent trials, comparing MSP-SRC against ten representative methods, including classical recurrent models (LSTM+KISSME48, GRU13, RFA50, conventional baselines (CNN+XQDA6, AMOC68, and recent Siamese variants (Wang69, Wu71, Zhang70, Li72. The curves reveal that MSP-SRC maintains a distinct performance advantage over classical approaches across the entire retrieval list (Rank-1 to Rank-20). Specifically, on the PRID-2011 dataset, MSP-SRC achieves competitive performance comparable to recent Siamese methods70,72, while significantly outperforming baselines such as AMOC68 and CNN+XQDA6. On the challenging iLIDS-VID dataset, although MSP-SRC’s Rank-1 accuracy is slightly lower than some complex attention-based Siamese models70,71, this marginal difference must be viewed in light of the computational efficiency analyzed in Sect. 4.8.1. While methods such as Zhang70 and Li72 typically rely on deep backbones or heavy attention mechanisms, MSP-SRC is explicitly designed as a lightweight framework with orders of magnitude fewer parameters. Consequently, MSP-SRC not only consistently surpasses standard recurrent (e.g., GRU13, RFA50 and multi-stream baselines (e.g., AMOC68 but also establishes a highly favorable accuracy–efficiency trade-off suitable for resource-constrained deployment.

Table 10 Comparison with traditional and state-of-the-art Siamese-based methods.
Fig. 9
Fig. 9
Full size image

Average CMC curves over 10 independent trials on (a) PRID-2011 and (b) iLIDS-VID datasets. The curves compare MSP-SRC (red solid line) with a broad set of state-of-the-art methods, including Li72, Zhang70, Wu71, Wang69, and conventional baselines such as AMOC68, CNN+XQDA6, and RFA50. MSP-SRC demonstrates robust performance across all ranks, significantly outperforming classical baselines and achieving competitive accuracy against recent Siamese approaches.

Computational complexity analysis

To substantiate the efficiency claims of MSP-SRC, model complexity is analyzed in terms of both parameter counts and floating-point operations (FLOPs). Conventional image-based person re-ID pipelines typically adopt deep CNN backbones such as ResNet-50, which contains approximately 25.6 million parameters73. Recent transformer-based approaches, including TransReID16, commonly employ ViT-Base backbones with substantially larger parameter budgets (on the order of 108 parameters)74, leading to increased memory footprint and computational demand. Moreover, video-based baselines may further increase complexity through multi-stream designs that jointly process appearance and motion cues (e.g., optical flow) using cascaded CNN encoders68.

In contrast, MSP-SRC is deliberately constructed with a compact three-layer CNN backbone (Table 1) with 5 × 5 kernels and channel widths of 5→16→32→32, followed by a single GRU layer and lightweight multi-level pooling branches in Table 11, the overall trainable parameter budget remains around the 106 scale, which is at least one order of magnitude smaller than ResNet-50-based systems and one to two orders of magnitude smaller than ViT-Base backbones used in transformer-based re-ID frameworks. This reduction in model size, while maintaining competitive Rank-1 accuracy on PRID-2011 and iLIDS-VID, positions MSP-SRC at a distinctly different point on the accuracy–efficiency trade-off curve compared with recent transformer- and attention-driven person re-ID models.

Beyond parameter counts, FLOPs are reported as a hardware-agnostic proxy for computational cost. Following a common convention, one multiply–accumulate (MAC) is counted as 2 FLOPs (one multiplication + one addition). For convolution layers, bias additions are explicitly included as one add per output activation. Specifically, for a 2D convolution with output size \(\:{H}_{out}\times\:{W}_{out}\), channels \(\:{C}_{in}\to\:{C}_{out}\), and kernel size \(\:K\times\:K\), FLOPs are computed as:

$$\:{FLOPs}_{conv}=2\times\:{H}_{out}\times\:{W}_{out}\times\:{C}_{out}\times\:\left({C}_{in}\times\:{K}^{2}\right)+{H}_{out}\times\:{W}_{out}\times\:{C}_{out}.$$
(19)

For the GRU, the dominant projection cost is approximated as \(\:{FLOPs}_{GRU}\approx\:6\times\:(N\times\:e+{e}^{2})\), where N is the input feature dimension and e is the hidden-state dimension. Under the operating input resolution of 128 × 64 × 5 (3 RGB + 2 optical-flow channels), MSP-SRC requires approximately 0.115 GFLOPs per frame per Siamese branch, derived via a layer-by-layer theoretical calculation. Table 11 further reports representative FLOPs of standard backbones (e.g., ResNet-50 and ViT-Base) using literature-reported values under the widely adopted single-crop 224 × 224 setting, which should be interpreted together with the corresponding resolution notes.

Table 11 Comparison of model complexity in terms of parameter counts and flops. Baseline values are from the original papers16,73,75. MSP-SRC parameters follow tables 1, and MSP-SRC flops are computed theoretically at 128 × 64 × 5. TE-TransReID flops are estimated from its MobileNetV2 and truncated ViT-Small (4*) branches (see table notes).

From a deployment perspective, a parameter budget on the order of 1–2 million implies a weight storage footprint of only a few megabytes under 32-bit floating-point representation, and substantially lower under quantization. In addition, MSP-SRC relies on a shallow CNN backbone (maximum width 32) and a single GRU layer, yielding modest activation memory with complexity linear in the sequence length, in contrast to the quadratic token-interaction cost of transformer self-attention. These characteristics collectively support the feasibility of deploying MSP-SRC in resource-constrained environments, while a detailed device-specific latency benchmark is left for future work.

Beyond the above quantitative and complexity analyses, the following subsection provides a qualitative diagnostic discussion and a failure-mode taxonomy to further clarify the limitations and boundary conditions of the proposed framework.

Qualitative discussion and failure-case analysis

Qualitative inspection is informative for understanding the boundary conditions of video-based person re-ID systems, particularly under challenging surveillance factors. In addition to the quantitative metrics reported above, this subsection summarizes typical failure modes observed on the PRID-2011 and iLIDS-VID benchmarks and provides a diagnostic analysis to clarify the limitations and boundary conditions of MSP-SRC and representative baselines.

  1. (1)

    Heavy/long-term occlusion and fragmented tracklets:

    Severe occlusion (e.g., by other pedestrians or scene structures) can lead to fragmented tracklets where discriminative cues are intermittently lost. In such cases, temporal aggregation (e.g., GRU-based integration) must rely on a limited subset of informative frames. If the sequence is dominated by non-informative or partially occluded observations, the final embedding may lose the fine-grained anchors required for identity discrimination.

  2. (2)

    Viewpoint changes and pose-induced appearance deformation:

    Large viewpoint shifts significantly alter the visibility of body parts and clothing textures. Although the proposed multi-level similarity aggregation exploits complementary cues across feature hierarchies to mitigate local mismatches, extreme transitions (e.g., side view to back view) may still cause inconsistent correspondences, especially when the tracklet provides only a limited stable appearance profile.

  3. (3)

    Near-duplicate clothing and identity ambiguity:

    When different identities share highly similar clothing colors and textures (high inter-identity / inter-class similarity), appearance-based discrimination becomes intrinsically ambiguous. This issue is amplified in crowded scenes where the learned embeddings may exhibit reduced inter-class margins for a subset of identities, yielding unstable retrieval ranks even with temporal cues.

  4. (4)

    Illumination shifts and background dominance:

    Abrupt illumination changes between non-overlapping camera views can distort color statistics and reduce the relative contribution of person-centric cues. This may lead to embeddings that partially encode background context, resulting in spurious similarities between tracklets recorded under similar lighting or camera conditions.

  5. (5)

    Degraded motion cues:

    While the inclusion of optical-flow channels can enhance temporal modeling, motion cues are sensitive to image quality. Blur, compression artifacts, and low frame rates may degrade optical-flow reliability. In such scenarios, the motion channels may introduce inconsistent or distractive signals during temporal integration, weakening the overall spatiotemporal representation.

Overall, these failure modes highlight the inherent difficulty of video-based person re-ID in unconstrained environments. This diagnostic analysis delineates the practical boundaries of the proposed framework and motivates future research toward more robust modality fusion and occlusion-aware temporal modeling.

Feature-space visualization (e.g., t-SNE/UMAP) can offer additional qualitative insights into embedding clustering and overlap. In this revision, such visualizations are not included because per-tracklet 128-D sequence embeddings were not preserved in a visualization-ready format in the archived runs, and re-generating them requires rebuilding the original inference environment. As part of future work, embedding-space visualization will be conducted for MSP-SRC and representative baselines under the same test protocol to further analyze clustering and separability.

Conclusions

Reflecting on the results and discussions presented in Sect. 4, this study demonstrates that the MSP-SRC framework effectively combines multi-level feature aggregation with GRU-based temporal modeling to achieve robust video-based person re-ID. Evaluations on the PRID-2011 and iLIDS-VID datasets consistently indicate improvements over conventional and Siamese-based baselines, underscoring the contributions of low-level spatial cues, recurrent modeling, and mean temporal pooling. Beyond performance gains, the findings emphasize the importance of compact architectures that balance accuracy with efficiency, particularly in resource-constrained environments.

Looking forward, several key directions emerge to further advance this framework. First, future research will extend MSP-SRC to large-scale benchmarks such as MARS and cross-domain protocols to assess scalability and generalization beyond the datasets considered in this study. Second, extending the multi-level similarity perception mechanism to cross-modal scenarios, including RGB–infrared video, gait-based cues, and text–video descriptions, represents a promising avenue for leveraging complementary modalities under a unified framework. Third, adapting MSP-SRC to unsupervised and weakly supervised regimes via pseudo-labeling or self-supervised contrastive objectives would facilitate deployment in large camera networks where exhaustive annotation is impractical. Fourth, cross-backbone generalizability will be investigated by replacing the compact CNN encoder with lightweight transformer backbones (and CLIP-pretrained ViT variants where feasible), while retaining the proposed multi-level similarity aggregation and GRU-based temporal integration, to quantify the accuracy–efficiency trade-off under the same evaluation protocol. Finally, regarding real-world deployment, future work will focus on specific optimizations such as pruning and quantization, alongside systematic benchmarking of latency, throughput, and energy efficiency on embedded edge platforms. In parallel, hybrid architectures that integrate lightweight temporal self-attention into the compact CNN–GRU backbone will be investigated to enhance long-range temporal modeling while preserving the favorable accuracy–efficiency trade-off established in this work. Additionally, future work will explore whether advanced temporal aggregation paradigms from broader video understanding (e.g., multi-trajectory sequence modeling and holistic–partial temporal cue integration) can be adapted to video person re-ID while preserving the lightweight, low-parameter and low-FLOPs design philosophy of MSP-SRC.