A CNN-RNN Siamese framework with multi-level aggregation for video-based person re-identification

Wang, Yuan-Kai; Pan, Tung-Ming; Sun, Chung-Pin

doi:10.1038/s41598-026-39277-x

Download PDF

Article
Open access
Published: 11 February 2026

A CNN-RNN Siamese framework with multi-level aggregation for video-based person re-identification

Yuan-Kai Wang^1,3^na1,
Tung-Ming Pan^2,3^na1 &
Chung-Pin Sun¹^na1

Scientific Reports volume 16, Article number: 8224 (2026) Cite this article

934 Accesses
Metrics details

Subjects

Abstract

Person re-identification (re-ID) in video sequences is a central task in surveillance and computer vision, yet it continues to present substantial challenges due to occlusion, viewpoint variation, and noisy frames. This study proposes a compact deep learning framework that integrates convolutional features, recurrent temporal modeling, and multi-level similarity aggregation to effectively capture both fine-grained spatial cues and long-range temporal patterns. The framework is deliberately designed as a compact CNN–GRU architecture, thereby avoiding the depth and computational demands of transformer-based backbones while preserving robust recognition capabilities. Experimental evaluations reveal clear advantages over conventional and Siamese-based approaches, confirming the complementary nature of spatial and temporal features and the effectiveness of efficient pooling strategies. These findings indicate that accurate and resource-efficient person re-ID can be achieved through compact architectures, offering practical potential for implementation in real-world, resource-constrained environments.

Domain-generalized person re-identification via refined neuron dropout and reciprocal-expansion re-ranking

Article Open access 22 January 2026

ReMamba: a hybrid CNN-Mamba aggregation network for visible-infrared person re-identification

Article Open access 26 November 2024

Semi-supervised generative and discriminative adversarial learning for motor imagery-based brain–computer interface

Article Open access 17 March 2022

Introduction

Pedestrian detection, tracking, and recognition are central issues in computer vision. While early studies primarily addressed single-camera scenarios, the focus has shifted toward multi-camera systems, where person re-identification (re-ID)¹ plays a critical role in associating pedestrian trajectories across non-overlapping views. Compared with single-camera settings, multi-camera deployments present severe variations in lighting, viewpoint, and background, as well as blind zones between non-overlapping fields of view, making consistent trajectory association and identity matching challenging to achieve.

Person re-ID methodologies can be broadly classified into image-based and video-based approaches². Image-based re-ID involves matching a single query frame against a set of gallery images, frequently employing supervised learning techniques to project features into a shared embedding space^3,4. Although effective for static appearances, this approach is limited in its ability to leverage temporal cues and is less robust in the presence of occlusion or motion blur. Conversely, video-based re-ID encodes both spatial appearance and temporal dynamics from sequences of frames⁵, thereby facilitating more comprehensive identity representations.

A prevalent strategy in video-based re-ID involves the aggregation of spatial features extracted by Convolutional Neural Networks (CNNs) over time through average or max pooling⁶, though this method can be susceptible to noisy frames as noted in⁵. To enhance temporal modeling, recurrent neural networks (RNNs) have been integrated with CNN backbones, as demonstrated by the Siamese Recurrent Convolutional Network (SRC) developed by McLaughlin et al.⁵, which captures sequential dependencies between frames. However, standard RNNs are prone to vanishing gradient issues in long sequences. Gated architectures such as Long Short-Term Memory (LSTM)⁷ and Gated Recurrent Unit (GRU)⁸ mitigate this problem by incorporating memory gates, with LSTM offering greater capacity for large datasets and GRU providing faster convergence with fewer parameters, rendering GRU particularly effective in small-scale or real-time applications⁹. These recurrent designs have become foundational for video-based re-ID, forming the basis for our proposed compact multi-level Siamese CNN–GRU framework.

The Siamese network architecture¹⁰, which consists of two weight-sharing subnetworks, has been extensively utilized in person re-ID for similarity learning, particularly when training data is limited^3,11,12,13. By projecting paired inputs into a shared embedding space and optimizing a verification loss, Siamese models effectively enforce small intra-class distances and large inter-class margins. This pairwise learning paradigm mitigates overfitting and enhances parameter efficiency, which is beneficial when per-identity samples are scarce. However, traditional Siamese frameworks often depend solely on high-level semantic features from deep CNN layers, neglecting low-level spatial cues, such as textures from hats, shoes, or carried objects, that remain consistent across camera views and are especially useful in challenging scenarios¹⁴.

To address these limitations, the proposed Multi-level Similarity Perception Siamese Recurrent Convolutional Network (MSP-SRC) incorporates a compact CNN backbone for spatial feature extraction alongside a GRU module for temporal modeling. As depicted in Fig. 1, an auxiliary pooling branch from the early convolutional layer (Pool1) captures low-level similarities, while GRU outputs are aggregated via temporal mean pooling to form high-level sequence embeddings that are less biased toward terminal frames. Both branches are trained concurrently using a combined identification–verification objective¹⁵, originally proposed in DeepID2 for face recognition, enabling the network to learn discriminative, noise-resilient representations for sequence-level matching. This multi-level design preserves fine-grained spatial details while simultaneously modeling long-range temporal dependencies, offering a balanced and robust solution for video-based re-ID.

Recent transformer-based^{16,17,18,19,20,21,22} and attention-intensive architectures have achieved state-of-the-art performance in video-based person re-ID by modeling long-range temporal dependencies and refining part-level feature alignment. While effective in controlled benchmarks, these models typically incur high computational costs, large memory footprints, and longer inference latencies, which may limit their applicability in scenarios with constrained resources or limited training data. In contrast, the MSP-SRC framework emphasizes a balanced trade-off between recognition accuracy and architectural compactness, avoiding heavy attention modules while retaining both low-level spatial cues and high-level temporal contexts. Although on-device optimization is beyond the scope of this study, potential deployment in resource-constrained environments is discussed in the conclusion as future work.

The remainder of this paper is organized as follows: Sect. 2 reviews recent developments in video-based person re-ID, with an emphasis on Siamese architectures and multi-level similarity modeling. Section 3 describes the proposed MSP-SRC framework. Section 4 reports the experimental results and provides in-depth discussions of the findings. Section 5 concludes the paper and outlines directions for future research.

Related works

Person re-ID approaches are generally divided into image-based and video-based methods. Image-based methods operate on single frames and learn discriminative spatial features but cannot exploit temporal cues, making them vulnerable to occlusion, pose changes, and resolution loss. Video-based methods process sequences of consecutive frames to capture temporal cues, motion patterns, and cross-view consistency, yielding richer identity representations. However, they also face challenges such as noise, frame redundancy, and sequence misalignment, necessitating models that can jointly extract spatial and temporal features with strong discriminative power. The proposed MSP-SRC is designed for video-based re-ID, integrating low-level spatial cues with high-level temporal dynamics to improve recognition under challenging conditions.

Image-based person Re-ID

Building upon the foundational work of Gheissari et al.²³, image-based re-ID involves matching a single image against a gallery under a closed-world assumption, utilizing feature extraction and metric learning to model pedestrian appearance. Traditional descriptors, such as color histograms and texture features, were frequently applied to segmented body parts (head, torso, legs)^24,25, with similarity computed via learned metrics including KISSME²⁶, LMNN²⁷, ITML²⁸, and LDML²⁹. Methods like LOMO-XQDA³⁰ enhanced efficiency through cross-view dimensionality reduction, while classifier-based approaches employed SVM or AdaBoost for discriminative matching^31,32,33. Despite their contributions, handcrafted pipelines exhibit limited generalizability to large-scale, unconstrained environments.

The advent of deep learning marked a shift towards CNN-based frameworks such as DeepReID⁴ and PersonNet³⁴, which learn hierarchical representations directly from data. Siamese CNNs³ improved discrimination with pairwise losses but underutilized label information, leading to the development of classification-based designs with domain-guided dropout³⁵ and hybrid variants incorporating temporal or local feature matching modules^11,12,36. Transformer-based architectures, including TransReID¹⁶, AAformer¹⁷, NFormer¹⁸, and PSTR¹⁹, further enhanced robustness through global context modeling and part-level alignment, although their computational demands pose challenges for scalability in resource-constrained scenarios.

Video-based person Re-ID

In contrast to image-based re-ID, which depends on single-frame appearance features, video-based methodologies leverage temporal cues across frame sequences to enhance robustness against occlusion, motion blur, and viewpoint variation. Initial studies developed sequence-level appearance models by aggregating handcrafted descriptors such as color histograms, covariance descriptors, or local keypoints across frames^37,38,39,40. Techniques including geometric transformations⁴⁰, Conditional Random Fields (CRFs)⁴¹, and part-based spatiotemporal modeling⁴² were introduced to ensure temporal consistency. Motion-based descriptors, such as HOG3D⁴³ and Gait Energy Image⁴⁴, were utilized to capture dynamic patterns, with gait periodicity employed for temporal alignment^45,46,47.

With the rise of deep learning, CNN-based video re-ID pipelines have emerged. Baseline approaches aggregate CNN-extracted frame features into a sequence descriptor via average or max pooling⁶, which is computationally straightforward but susceptible to noisy or low-quality frames, as noted in⁵. Recurrent designs integrate CNNs with temporal learners to capture ordering information, as exemplified by the Siamese Recurrent Convolutional Network (SRC)⁵, Convolutional LSTM networks⁴⁸, and GRU-based architectures¹³. Additional enhancements include Fisher Vector encoding⁴⁸, temporal pyramid pooling⁴⁹, and recurrent feature aggregation⁵⁰.

Recent advancements extend beyond simple recurrence. Graph-based methods model frame-to-frame or part-to-part relations for spatiotemporal consistency, such as skeleton-based dynamic hypergraph networks⁵¹ and multi-granularity graph pooling⁵². Attention and transformer architectures capture long-range dependencies and part-level correspondences, as demonstrated in DenseIL²⁰, MSTAT²¹, CAViT²², and enhanced video transformers⁵³. These models improve temporal reasoning and fine-grained alignment but often incur high computational costs.

Further research has addressed domain shifts and modality gaps. Single-task joint learning⁵⁴ and multi-domain learning⁵⁵ aim to enhance cross-view generalization. Visible–infrared video re-ID with spatiotemporal and modality alignment⁵⁶ bolsters robustness in low-light and cross-modal settings. Large-scale benchmarks such as MARS⁶ and MEVID⁵⁷ reflect the trend toward realistic, noisy data, while self-supervised representation learning⁵⁸ enhances generalization under limited labels.

Within this context, the proposed MSP-SRC follows a compact Siamese CNN–RNN design philosophy, integrating a lightweight CNN backbone with GRU-based temporal modeling. Its core novelty lies in multi-level similarity aggregation, which jointly leverages low-level spatial cues and high-level temporal embeddings for identity discrimination. This design mitigates sensitivity to redundant/noisy frames and partial misalignment across tracklets while retaining a small model footprint suitable for deployment on standard video benchmarks.

Temporal aggregation beyond video-based person Re-ID

Recent progress in broader video understanding has proposed temporal aggregation mechanisms that, although developed for different objectives than identity discrimination, may offer complementary insights for future video person re-ID research. For example, referring atomic video action recognition introduces temporally grounded modeling of a specified target and emphasizes semantic-aware temporal reasoning under multi-person interference^59,60. Diffusion-based referring human action segmentation further highlights holistic–partial temporal modeling to cope with long-range ambiguity and complex interactions in crowded scenes⁶¹. In addition, few-shot adaptation for activity recognition across diverse domains stresses temporal representation robustness under distribution shifts and limited supervision⁶². While these studies do not directly address cross-camera identity matching, their temporal aggregation principles (e.g., multi-trajectory sequence modeling, holistic–partial cue integration, and robustness-oriented temporal learning) may be explored as future extensions for person re-ID. Importantly, the present work targets a different design point by proposing a compact CNN–GRU framework with multi-level similarity aggregation, explicitly prioritizing low parameter count and low FLOPs for resource-constrained deployment.

Siamese-based person Re-ID

Siamese networks, consisting of two subnetworks with shared weights, have emerged as a fundamental paradigm in person re-ID for learning discriminative embeddings with limited per-identity samples^3,10. While initial image-based variants exhibited strong generalization capabilities in low-data scenarios^11,12,35, their principles have been extended to video-based re-ID to incorporate temporal modeling. As elaborated in Sect. 2.2, recurrent Siamese designs—such as the SRC⁵, and architectures augmented with LSTM or GRU^13,48—continue to serve as competitive sequence-matching baselines. However, these designs often emphasize high-level temporal embeddings while underutilizing complementary spatial details, which can reduce robustness in challenging conditions.

A primary limitation of traditional Siamese frameworks is their dependence on high-level semantic features extracted from deeper network layers, which often results in the neglect of fine-grained spatial cues, such as accessory textures or localized patterns, that remain consistent across camera views and facilitate recognition under occlusion or viewpoint changes. To address this issue, multi-level similarity perception approaches have been explored, introducing auxiliary branches that pool features from earlier convolutional layers alongside deeper temporal representations. Related strategies have been applied in video-based re-ID; for example, MG‑RAFA⁶³ uses attention-guided aggregation of multi-granularity spatiotemporal features, and semantic–time fusion frameworks integrate multi-stage features with inter-frame attention to reduce redundancy. These designs inform the dual-branch scheme of the MSP-SRC: the low-level branch preserves detailed spatial cues, whereas the high-level branch captures temporal dependencies. Both are trained jointly under combined identification–verification objectives¹⁵, enhancing the discriminability and robustness to noisy frames.

The proposed MSP-SRC adheres to this paradigm, integrating a Pool1-based low-level similarity branch with a GRU-based high-level branch. Mean temporal pooling mitigates end-of-sequence bias, and joint loss optimization preserves complementary spatial and temporal cues. This dual-branch recurrent Siamese design addresses the typical weakness of discarding low-level details while maintaining computational efficiency for standard video-based re-ID benchmarks. Compared with existing CNN–RNN and CNN–GRU-based video re-ID frameworks that typically form a single high-level sequence embedding supervised by either classification or verification losses, MSP-SRC therefore differs in three key aspects: it explicitly preserves early-layer spatial features through a low-level similarity branch, employs a combined identification–verification objective inspired by DeepID2¹⁵, and adopts a deliberately compact CNN–GRU configuration tailored to medium-scale video benchmarks and resource-constrained deployment.

Proposed MSP-SRC methodology

Building upon the design motivations in Sect. 2.3, the proposed MSP-SRC adopts a dual-branch Siamese framework for joint spatial–temporal modeling in video-based person re-ID. The following subsections present the architectural overview, input representation, spatial and temporal modeling components, multi-level similarity perception module, and optimization strategy.

Overall framework and input representation

The proposed MSP-SRC is a video-based person re-ID framework that jointly preserves low-level spatial cues and high-level temporal dynamics. Building on the SRC⁵, it employs a deeper CNN backbone to enhance spatial feature representation and integrates an auxiliary pooling branch to capture multi-level similarities. Temporal modeling is handled by a GRU, which balances efficiency with the capacity to model long-range dependencies.

The framework processes paired pedestrian sequences, each represented by five channels: three for RGB color and two for dense optical flow. The RGB channels, extracted from detected pedestrian bounding boxes, encode static appearance attributes such as clothing color, texture, and body shape. The optical flow channels store horizontal and vertical motion components derived from pixel displacement between consecutive frames, capturing short-term dynamics such as gait rhythm and movement direction. By jointly modeling appearance and motion cues, the framework encodes rich spatiotemporal features that enhance robustness to occlusion, pose variation, and illumination changes.

Within each Siamese branch, the CNN backbone extracts frame-level feature maps, with early-layer outputs pooled to form low-level sequence features that retain fine-grained visual details. Final-layer features are processed by the GRU to propagate temporal dependencies, and the hidden states are aggregated via temporal mean pooling to produce high-level sequence representations. The resulting multi-level embeddings from the two branches are concatenated and projected into a shared feature space for similarity computation. The model is trained with a combined identification–verification loss to enforce small intra-class distances and large inter-class margins. The overall architecture is illustrated in Fig. 1.

Spatial feature extraction via compact CNN backbone

Following the design in⁵, the CNN backbone of MSP-SRC, illustrated in Fig. 2, consists of three convolutional layers that progressively extract spatial features from pedestrian images. The detailed layer configuration is summarized in Table 1. The convolutional mapping from input image $\:x$ to output feature vector $\:f$ is expressed as $\:f=C\left(x\right)$, where $\:C\left(\bullet\:\right)$ denotes the convolutional transformation.

Since MSP-SRC is designed for video-based person re-ID, the input consists of sequential pedestrian images. A sequence $\:s$ comprising T consecutive bounding box frames is represented as $\:s=\{{s}^{\left(1\right)},{s}^{\left(2\right)},\dots\:{,s}^{\left(T\right)}\}$. Each frame $\:{s}^{\left(t\right)}$ denotes the pedestrian image at time step t. After passing through the CNN backbone, each frame is encoded into a feature vector $\:{f}^{\left(t\right)}=C\left({s}^{\left(t\right)}\right).$ Because all frames are processed with shared CNN parameters, the network ensures consistent feature extraction across the entire sequence. The resulting feature vectors $\:{\left\{{f}^{\left(t\right)}\right\}}_{t=1}^{T}$ are projected into a lower-dimensional space before being forwarded to the recurrent module for temporal modeling. To reduce overfitting, dropout regularization is applied during this process⁶⁴.

Table 1 Detailed CNN architecture of MSP-SRC.

Full size table

As detailed in Table 1, the CNN backbone employs three convolutional layers with progressively increasing receptive fields. The first layer captures fine-grained local patterns, such as textures and edges; the second layer aggregates mid-level semantics, such as body parts and poses; and the third layer encodes abstract identity-related features. This shallow yet structured design strikes a balance between efficiency and discriminative power; while deeper backbones could yield stronger abstractions, they would also incur higher computational costs. In video-based re-ID scenarios characterized by a limited number of training identities, a compact backbone serves as an effective regularization strategy, matching model capacity to the available supervision and mitigating the overfitting often observed with over-parameterized deep networks. This choice is also consistent with prior video-based person re-ID architectures on medium-scale benchmarks, which typically employ three- or four-layer CNN backbones trained from scratch. The chosen three-layer configuration therefore ensures that MSP-SRC remains a compact CNN backbone with only three convolutional layers and limited channel width. Crucially, this shallow architecture preserves fine-grained low-level spatial cues—such as clothing textures and accessories—that are explicitly leveraged by the multi-level similarity perception module (Sect. 3.4) to enhance discrimination, whereas such details might be attenuated in deeper semantic abstractions.

Temporal modeling via GRU-based recurrent module

RNNs are particularly adept at modeling sequential data due to their feedback connections, which enable the retention of information across time steps—an ability that conventional CNNs, limited by fixed input-output mappings, do not possess. At each time step, the RNN updates its hidden state by integrating the current input with the accumulated temporal context. During training, the recurrent structure is unfolded into a feedforward network, allowing for gradient propagation through time via backpropagation through time⁶⁵. As depicted in Fig. 3, the recurrent module in MSP-SRC is instantiated as a GRU-based temporal memory unit, which facilitates information flow across pedestrian sequences. For notational clarity, we first recall the basic ungated RNN formulation implemented as a SimpleRNN cell, and then discuss its gated extensions, including LSTM and GRU.

Let $\:{f}^{\left(t\right)}$ denote the CNN-extracted feature vector from the input frame $\:{s}^{\left(t\right)}$. The recurrent update is defined as:

$$\:{o}_{high}^{\left(t\right)}={W}_{i}{f}^{\left(t\right)}+{W}_{s}{r}^{(t-1)}$$

(1)

$$\:{r}^{\left(t\right)}=Tanh\left({o}_{high}^{\left(t\right)}\right)$$

(2)

where:

$\:{o}_{high}^{\left(t\right)}\in\:{\mathbb{R}}^{e\times\:1}$ is the intermediate sequence-level feature,

$\:{f}^{\left(t\right)}\in\:{\mathbb{R}}^{N\times\:1}$ is the CNN feature at time t,

$\:{r}^{\left(t\right)}\in\:{\mathbb{R}}^{e\times\:1}$ is the hidden state at time t,

$\:{r}^{\left(t-1\right)}\in\:{\mathbb{R}}^{e\times\:1}$ is the hidden state from the previous step,

$\:{W}_{i}\in\:{\mathbb{R}}^{e\times\:N}$ projects the CNN feature into a lower-dimensional embedding,

$\:{W}_{s}\in\:{\mathbb{R}}^{e\times\:e}$ transforms the recurrent state, and the initial hidden state $\:{r}^{\left(0\right)}$ is initialized to zero.

Because $\:{W}_{i}$ is rectangular (e < N), it effectively reduces dimensionality, ensuring computational efficiency. This recurrent update process is depicted in Fig. 4, which outlines the architecture of the employed recurrent unit.

Given an input feature sequence $\:{\left\{{x}_{t}\right\}}_{t=1}^{T}$, a SimpleRNN cell updates its hidden state $\:{h}_{t}$ by combining the current input with the previous hidden state:

$$\:{h}_{t}=\varphi\:({W}_{x}{x}_{t}+{W}_{h}{h}_{t-1}+{b}_{h})$$

(3)

,

where $\:{W}_{x}$ and $\:{W}_{h}$ denote the input-to-hidden and hidden-to-hidden weight matrices, $\:{b}_{h}$ is a bias term, and $\:\varphi\:(\bullet\:)$ is a non-linear activation function. The unfolded computation graph over time is illustrated schematically in Fig. 4. This SimpleRNN formulation provides a minimal recurrent baseline and serves both to introduce the notation for temporal updates and to act as a baseline temporal model in the ablation study (Sect. 4.3).

LSTM

RNNs are constrained by the vanishing gradient problem when extended over lengthy sequences, which impairs their capacity to capture long-range dependencies. To mitigate this issue, LSTM networks incorporate a memory cell together with three gating mechanisms, specifically the forget gate, the input gate, and the output gate. These gates selectively retain historical information, integrate new inputs, and generate outputs, thereby enabling LSTMs to model long-term temporal dependencies more effectively than traditional RNNs. Despite their efficacy, LSTMs are characterized by a substantial number of parameters and relatively high computational demands.

GRU

To improve efficiency, GRUs simplify the LSTM architecture by merging the input and forget gates into an update gate and combining the cell state with the hidden state. This design reduces parameters while retaining the ability to capture temporal dependencies. The GRU is defined as:

$$\:{z}^{t}=sigmoid\left({W}_{z}\left[{h}_{t-1},{f}_{t}\right]\right)$$

(4)

$$\:{r}^{t}=sigmoid\left({W}_{r}\left[{h}_{t-1},{f}_{t}\right]\right)$$

(5)

$$ h^{{\sim t}} = \tanh \left( {W_{h} \left[ {r^{t} \times h_{{t - 1}} ,f_{t} } \right]} \right) $$

(6)

$$ h^{t} = \left( {1 - z^{t} } \right) \times h_{{t - 1}} + z^{t} \times h^{{\sim t}}$$

(7)

where $\:{z}^{t}$ and $\:{r}^{t}$ denote the update and reset gates, $ h^{{\sim t}} $ is the candidate hidden state, and $\:{h}^{t}$ is the final hidden state at time. Compared with LSTM, GRU achieves similar or better accuracy with fewer parameters and faster convergence, making it a favorable choice for video-based person re-ID tasks. Accordingly, MSP-SRC instantiates the temporal module as a GRU-based recurrent unit in the final architecture, whereas SimpleRNN and LSTM are only used as baseline variants in the ablation study (Sect. 4.3) to assess the impact of recurrent capacity on sequence modeling. In addition, the temporal module is restricted to a single GRU layer with temporal mean pooling, which shortens the backpropagation path and, together with the gating mechanism, helps to alleviate vanishing gradients and temporal drift for the sequence lengths considered in this work.

Multi-level similarity perception module

As identified in previous research¹⁴, Siamese-based person re-ID methodologies are typically categorized into two paradigms: (1) the direct computation of similarities on feature maps through global or local matching, and (2) the learning of compact feature embeddings optimized using pairwise or triplet loss functions. Although both paradigms demonstrate competitive performance, they frequently overlook low-level spatial cues that remain consistent across different views. To address this shortcoming, the MSP-SRC introduces a multi-level similarity perception mechanism. Specifically, an auxiliary branch extracts low-level features from the initial layers of the CNN, while high-level temporal features are aggregated from the recurrent module across all time steps, yielding complementary representations that jointly enhance similarity estimation.

Low-level similarity via feature pooling

Low-level features, derived from the earlier convolutional layers, are dense and informative in capturing fine-grained spatial details. Pooling operations applied to these features effectively capture regional patterns that complement the abstract representations obtained in deeper layers. Qualitative visualizations in Fig. 5 further illustrate the complementary roles of low-level and high-level features. Conv1 responses (low-level) strongly activate on fine-grained details such as the overall silhouette, clothing textures, and accessories (e.g., backpacks, shoes), capturing local appearance patterns that are highly informative for distinguishing visually similar pedestrians. Conv2 responses (mid-level) emphasize larger body regions and part configurations, while Conv3 responses (high-level) focus on more abstract semantic body regions and suppress some high-frequency texture details. This progression indicates that high-level features alone may abstract away critical local textures necessary for identity discrimination. Consequently, the proposed feature pooling branch operating on early-layer activations is essential to recover and preserve these discriminative low-level cues, ensuring robust matching even when high-level semantic patterns are similar across different identities.

Formally, for a sequence $\:\text{s}=\left\{{s}^{\left(1\right)},{s}^{\left(2\right)},\dots\:,{s}^{\left(T\right)}\right\},$ each frame $\:{s}^{\left(t\right)}\:$ is encoded by the CNN into an embedding vector $\:{o}_{low}^{\left(t\right)}=l\left({s}^{\left(t\right)}\right)$, where $\:l\left(\bullet\:\right)\:$denotes the CNN and projection operation. The resulting set of vectors {$\:{o}_{low}^{\left(1\right)},{o}_{low}^{\left(2\right)},\dots\:{o}_{low}^{\left(T\right)}\}$ is aggregated into a single low-level representation $\:{v}_{{\text{s}}_{low}}$ using:

A.
Mean pooling:

$$\:{v}_{{\text{s}}_{low}}=\frac{1}{T}\sum\:_{t=1}^{T}{o}_{low}^{\left(t\right)}$$
(8)
B.
Max pooling:

$$\:{v}_{{\text{s}}_{low}}^{i}=\text{max}\left({o}_{low}^{\left(1\right),i},{o}_{low}^{\left(2\right),i},\dots\:,{o}_{low}^{\left(T\right),i}\right),\:\:\:i\in\:\left\{\text{1,2},\dots\:,e\right\}$$
(9)

The consolidated low-level representation is denoted a $\:L\left(\text{s}\right)={v}_{{\text{s}}_{low}}$. Figure 6 illustrates the pooling module. This feature pooling operation converts sequences of arbitrary length into fixed-size embeddings, facilitating direct similarity comparison between sequences without requiring explicit frame alignment.

High-level similarity via Temporal pooling

Although RNNs capture sequential dependencies, they exhibit two main limitations for re-ID: (1) an overemphasis on terminal frames due to recursive updates, and (2) difficulty in modeling temporal information at multiple scales. To address these issues, the MSP-SRC incorporates a temporal pooling layer following the RNN. Instead of relying solely on the final hidden state, this layer aggregates outputs across all time steps, facilitating a balanced representation of temporal dependencies across the entire sequence. The mathematical formulation of the pooling strategies is presented below.

Given the set of recurrent outputs $\:\left\{{o}_{high}^{\left(1\right)},{o}_{high}^{\left(2\right)},\dots\:{o}_{high}^{\left(T\right)}\right\}$, the aggregated high-level representation $\:{v}_{{\text{s}}_{high}}$ is computed as:

A.
Mean pooling:

$$\:{v}_{{\text{s}}_{high}}=\frac{1}{T}\sum\:_{t=1}^{T}{o}_{high}^{\left(t\right)}$$
(10)
B.
Max pooling:

$$\:{v}_{{\text{s}}_{high}}^{i}=\text{max}\left({o}_{high}^{\left(1\right),i},{o}_{high}^{\left(2\right),i},\dots\:,{o}_{high}^{\left(T\right),i}\right)$$
(11)

The final high-level representation is denoted as $\:R\left(\text{s}\right)={v}_{{\text{s}}_{high}}.$ This temporal pooling operation converts sequences of arbitrary length into fixed-size embeddings, facilitating direct similarity comparison between sequences without requiring explicit frame alignment.

Joint identification–verification optimization strategy

Siamese training with contrastive loss

The MSP-SRC framework adopts a Siamese architecture composed of two weight-sharing subnetworks. Each training sample consists of a pair of pedestrian sequences $\:\left({\text{s}}_{i},{\text{s}}_{j}\right)$, which are independently processed to produce sequence-level embeddings in a shared feature space. This design enables direct similarity computation via Euclidean distance.

To enforce discriminability, a contrastive loss minimizes intra-class distances while maintaining a margin between inter-class embeddings. Specifically, given the extracted low-level and high-level embeddings:

$$\:{v}_{i\_low}=L\left({\text{s}}_{i}\right),\:\:{v}_{i\_high}=R\left({\text{s}}_{i}\right)$$

(12)

$$\:{v}_{j\_low}=L\left({\text{s}}_{j}\right),\:\:{v}_{j\_high}=R\left({\text{s}}_{j}\right)$$

(13)

the intra-class loss for positive pairs $\:(i=j)$ is

$$\:{E}^{intra}\left({v}_{i},{v}_{j}\right)=\frac{1}{2}{‖{v}_{i}-{v}_{j}‖}^{2}$$

(14)

while the inter-class loss for negative pairs $\:(i\ne\:j)$ is

$$\:{E}^{inter}\left({v}_{i},{v}_{j}\right)=\frac{1}{2}{\left[\text{max}\left(m-‖{v}_{i}-{v}_{j}‖,0\right)\right]}^{2}$$

(15)

The overall Siamese loss is therefore:

$$\:E\left({v}_{i},{v}_{j}\right)=\left\{\begin{array}{c}\frac{1}{2}{‖{v}_{i}-{v}_{j}‖}^{2}\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:,\:i=j\\\:\frac{1}{2}{\left[\text{max}\left(m-‖{v}_{i}-{v}_{j}‖,0\right)\right]}^{2}\:\:,\:i\ne\:j\end{array}\right.$$

(16)

As defined in (13), the high-level sequence embedding $\:R\left(s\right)$ is computed by applying temporal mean pooling to the sequence of GRU outputs $\:{\left\{{h}_{t}\right\}}_{t=1}^{T}$ over all time steps. Unlike conventional RNN-based video re-ID models that derive the sequence representation solely from the final hidden state, this design prevents the representation from drifting toward the last few frames and encourages $\:R\left(s\right)$ to summarize a global temporal consensus rather than a tail-dominated state. At inference, similarity between unseen pedestrian sequences is directly measured by Euclidean distance; smaller distances indicate higher identity likelihood.

Joint identification and verification loss

To jointly optimize classification and verification, MSP-SRC employs a combined loss inspired by DeepID2¹⁵. This objective integrates a verification term based on contrastive distance with an identification term based on cross-entropy classification.

For identification, the recurrent feature vector $\:v=R\left(s\right)$ is fed into a SoftMax classifier:

$$\:I\left(v\right)=P\left(q=c|v\right)=\frac{\text{exp}\left({W}_{c}v\right)}{{\sum\:}_{K}\text{exp}\left({W}_{K}v\right)}$$

(17)

where $\:K$ is the number of identity classes, $\:q$ is the ground-truth label, and $\:W$ is the SoftMax weight matrix with rows $\:{W}_{c}$ and $\:{W}_{k}$ representing the parameters for class $\:c$ and each class $\:K$.

The overall loss for a pair $\:\left({\text{s}}_{1},{\text{s}}_{2}\right)$ is formulated as:

$$\:Q\left({\text{s}}_{1},{\text{s}}_{2}\right)=E\left(L\left({\text{s}}_{1}\right),L\left({\text{s}}_{2}\right)\right)+E\left(R\left({\text{s}}_{1}\right),R\left({\text{s}}_{2}\right)\right)+I\left(R\left({\text{s}}_{1}\right)\right)+I\left(R\left({\text{s}}_{2}\right)\right)$$

(18)

Here, $\:E\left(cdot\:,\bullet\:\right)$ denotes the contrastive loss as previously defined, and $\:I\left(\cdot\:\right)$ represents the classification loss. Equal weighting is assigned to the identification and verification components, balancing classification accuracy with embedding separability. This follows the DeepID2 training strategy¹⁵, avoiding the need to tune an additional loss-weight hyperparameter on the relatively small PRID-2011 and iLIDS-VID training sets. Importantly, the two loss terms are combined with a fixed 1:1 coefficient (i.e., no λ is introduced or tuned in this study), which improves reproducibility and avoids additional degrees of freedom under the adopted evaluation protocol.

In this joint objective, the multi-level similarity perception module combines spatial and temporal cues through the complementary roles of $\:L\left(s\right)$ and $\:R\left(s\right)$. As defined in (12) and (13), $\:L\left(s\right)$ denotes the low-level sequence embedding obtained by pooling early convolutional feature maps, thereby emphasizing fine-grained spatial appearance patterns such as textures, color configurations, and accessories. In contrast, $\:R\left(s\right)$ represents the high-level sequence embedding produced by the GRU followed by temporal mean pooling, which captures semantic body structure and aggregated temporal dynamics. In (18), the verification terms $\:E\left(L\left({\text{s}}_{1}\right),L\left({\text{s}}_{2}\right)\right)$ and $\:E\left(R\left({\text{s}}_{1}\right),R\left({\text{s}}_{2}\right)\right)$ encourage both the low-level spatial branch and the high-level temporal branch to learn discriminative similarity structures for positive and negative pairs, while the identification terms $\:I\left(R\left({\text{s}}_{1}\right)\right)$ and $\:I\left(R\left({\text{s}}_{2}\right)\right)$ regularize the recurrent branch toward class-separable representations. By optimizing these components jointly, the framework aligns complementary low-level and high-level embeddings within a single multi-level similarity space, rather than relying on a single-branch representation.

Results and discussion

This section presents a comprehensive experimental evaluation of the proposed MSP-SRC framework with four objectives: (1) validating its effectiveness across multiple publicly available video-based person re-ID benchmarks, (2) comparing its performance with representative state-of-the-art methods, (3) analyzing the contribution of individual components through ablation and module-wise studies, and (4) assessing computational complexity for a fair efficiency comparison. The evaluation follows standard practices in the person re-ID community to ensure reproducibility and consistency with prior work.

The remainder of this section is organized as follows: Sect. 4.1 introduces the datasets, and Sect. 4.2 outlines the evaluation protocols. Section 4.3 and 4.4 investigate the impact of input modalities, pooling strategies, and architectural configurations. Section 4.5 examines the discriminative capacity of low- and high-level features, while Sect. 4.6 explores the effect of varying sequence lengths. Section 4.7 presents ablation experiments quantifying the importance of each core module, and Sect. 4.8 compares MSP-SRC with recent state-of-the-art approaches, with emphasis on Siamese-based architectures and compact CNN–RNN designs. In addition to accuracy-oriented comparisons, a quantitative computational complexity analysis is presented in the subsection 4.8.1.

Video-based datasets selection

This study evaluates the proposed MSP-SRC framework on two publicly accessible video-based person re-ID benchmarks, PRID-2011³⁸ and iLIDS-VID⁶⁶.These datasets offer a practical balance between sequence diversity and annotation reliability and are widely adopted in prior Siamese RNN-based research, providing manually annotated, relatively clean tracklets that allow the architectural contributions of the proposed CNN–RNN backbone and Multi-Level Similarity Perception (MSP) module to be isolated and analyzed under controlled conditions.

In contrast, the large-scale MARS benchmark⁶⁷ is constructed from automatically generated tracklets produced by a detector–tracker pipeline, which introduces substantial label noise, ID switches, misalignments, and trajectory fragmentation.

While MARS is highly valuable for end-to-end detection–tracking–re-ID pipelines, such noise can obscure the specific effects of feature modeling within a Siamese framework trained in a pairwise manner. For this reason, the present study focuses on PRID-2011 and iLIDS-VID as canonical video-based benchmarks and leaves the extension of MSP-SRC to MARS and cross-domain evaluation protocols to future work.

PRID-2011 consists of pedestrian sequences captured by two static, non-overlapping cameras. Of the 245 identities appearing in both views, 200 are typically selected and evenly divided into training and testing sets. Sequence lengths range from 5 to 675 frames, with an average of 100.

The iLIDS-VID dataset comprises 300 identities, each represented by sequences from two disjoint camera views. Sequence lengths vary from 23 to 192 frames (average 73), with additional challenges such as occlusion, background clutter, and viewpoint variation.

Evaluation metrics

This study evaluates all methods using the Cumulative Match Characteristic (CMC) curve, with a particular focus on Rank-1 accuracy, which reflects the percentage of query sequences where the correct identity is retrieved at the top rank. During evaluation, each sequence is encoded into a feature vector, and pairwise Euclidean distances are computed to form a distance matrix. The ranked retrieval results are then used to construct the CMC curve. Following the standard protocol for PRID-2011 and iLIDS-VID^38,66, the identities in each dataset are randomly partitioned into training and testing sets with a 50%/50% split, and this procedure is repeated for 10 independent trials. The CMC statistics (Rank-1, Rank-5, Rank-10, Rank-20) reported in Sect. 4.8 correspond to the mean accuracy over these 10 runs. A more detailed analysis of variance across runs and formal significance tests is left for future work. Since the adopted datasets typically provide only one ground-truth match per query, mean Average Precision (mAP) is not reported in this study.

Comparison of RNN variants and depths

This subsection compares different RNN variants and network depths to assess their impact on sequence modeling performance. Previous research⁹ indicates that LSTM and GRU generally outperform SimpleRNN in modeling sequential dependencies. To substantiate this claim, this study assessed SimpleRNN, LSTM, and GRU with depths ranging from one to four layers, employing 16-frame sequences for consistency.

As illustrated in Tables 2, 3 and 4, deeper neural networks exhibited a tendency toward overfitting. SimpleRNN demonstrated modest but stable improvements up to three layers, but its performance declined when extended to four layers. In contrast, both LSTM and GRU experienced reductions in accuracy with increased depth, suggesting limited benefit from additional layers on relatively small datasets. To supplement the tabulated data, Fig. 7 illustrates the impact of increasing recurrent depth on Rank-1 accuracy for the three variants. The trends confirm that SimpleRNN achieves marginal gains up to three layers but deteriorates beyond this point. Both LSTM and GRU show a decline with increased depth, with LSTM degrading more rapidly due to its higher parameterization. GRU consistently outperformed the other variants at shallow depths, highlighting its advantageous balance between model complexity and effective sequence modeling.

Table 2 Comparison of SimpleRNN depths.

Full size table

Table 3 Comparison of LSTM depths.

Full size table

Table 4 Comparison of GRU depths.

Full size table

Table 5 Comparison of the best-performing RNN variants.

Full size table

To ensure a fair comparison, the optimal configuration of each model was selected and summarized in Table 5. The results indicate that GRU achieved the highest Rank-1 accuracy across both PRID-2011 and iLIDS-VID datasets, while LSTM consistently underperformed. The observed performance differences can be attributed to model complexity. LSTM, with its extensive parameterization, requires larger training data and thus failed to generalize effectively given the relatively small datasets and short sequences used in this study. Conversely, GRU’s simplified gating mechanism reduces the parameter count and facilitates more efficient training, rendering it more adept at capturing the short-term dynamics, such as pose shifts, viewpoint changes, and illumination variations, that predominate in these benchmarks. Consequently, the results underscore that GRU offers the most favorable balance between representational capacity and efficiency for video-based person re-identification. Based on these findings, GRU is adopted as the recurrent module in subsequent experiments.

From a computational viewpoint, increasing recurrent depth also increases the number of trainable parameters and the cost of each forward and backward pass, so selecting a single-layer GRU as the temporal module offers the most favourable trade-off between recognition accuracy and recurrent complexity on PRID-2011 and iLIDS-VID.

Impact of multi-level feature aggregation

Recognition models have traditionally focused on high-level abstractions derived from deeper network layers. Nonetheless, earlier convolutional layers capture low-level cues, such as textures associated with hats, backpacks, or other local attributes, that provide stable and discriminative identity information. Integrating these fine-grained spatial details with high-level temporal representations produces more robust embeddings for person re-ID. Furthermore, introducing a pooling branch for low-level features offers auxiliary supervision, which facilitates gradient flow and accelerates convergence. These behaviors are consistent with the activation patterns observed in Fig. 5, where shallow-layer feature maps encode richer texture and accessory information, whereas deeper-layer maps concentrate on semantic body regions. The ablation results in Table 6 show that using only low-level or only high-level features leads to degraded Rank-1 performance, whereas combining both branches with the identification objective yields the best accuracy, confirming that low-level and high-level features provide complementary evidence for identity discrimination.

As demonstrated in Table 6, the combination of identity classification and multi-level similarity consistently achieves the highest accuracy on the iLIDS-VID and PRID-2011 benchmarks. In contrast, restricting the model to high-level features alone results in reduced performance, indicating that recurrent modules by themselves are insufficient to learn highly discriminative representations. These findings consistently underscore the effectiveness of multi-level feature aggregation in enhancing both model stability and recognition accuracy.

Quantitatively, the model shows high sensitivity to the feature combination strategy. This ablation can also be viewed as a zero-weight empirical check: removing a similarity branch is equivalent to assigning its contribution weight to zero while keeping the remaining training protocol unchanged. As shown in Table 6, assigning zero weight to either the low-level or the high-level similarity branch (i.e., using “ident + low” or “ident + high” only) reduces Rank-1 accuracy from 57% to 49% and 51% on iLIDS-VID, and from 76% to 54% and 68% on PRID-2011. Removing both similarity branches (“ident alone”) leads to a significant degradation, with Rank-1 dropping to 31% and 55% on iLIDS-VID and PRID-2011, respectively. These ablations confirm that concatenating identification, low-level, and high-level features—rather than relying on single-branch representations—is crucial for maximizing discrimination.

Table 6 CMC Rank-1 accuracy for different feature combination strategies.

Full size table

To further assess the robustness of the proposed MSP-SRC, this study conducted 10 independent trials with random training/testing splits. The standard deviations for Rank-1 accuracy were 4.18% for PRID-2011 and 2.49% for iLIDS-VID. These variance measures indicate that the framework yields consistent performance across repeated runs, supporting its stability and reproducibility.

Pooling strategies for sequence embedding

Pooling strategies are essential for the consolidation of frame-level representations into concise sequence embeddings. As elaborated in Sect. 4.4, pooling branches were utilized to harness both low-level and high-level information, circumventing the recurrent module. To further investigate temporal aggregation, we assessed alternative pooling strategies applied to recurrent outputs, thereby ensuring that all time steps contribute to the final sequence representation. This approach addresses a common limitation of recurrent models, which often disproportionately emphasize terminal frames, even when informative cues are present earlier in the sequence.

Table 7 illustrates that mean pooling consistently outperforms max pooling across both low-level and high-level feature branches on the PRID-2011 and iLIDS-VID datasets. The advantage is particularly notable in the high-level branch, where stable averaging across frames mitigates the adverse effects of noise and occlusion. These findings validate mean pooling as a more reliable strategy for generating discriminative sequence embeddings, ultimately enhancing recognition accuracy.

Table 7 Comparison of pooling strategies for low- and high-level feature aggregation.

Full size table

Effect of sequence length on re-ID accuracy

Building upon the prior analysis of feature aggregation and pooling strategies, this subsection examines the impact of sequence length on re-ID performance. As articulated in Sect. 4.3, recurrent networks necessitate adequate temporal context to effectively capture discriminative sequence features. This implies that augmenting the number of frames per sequence during inference could potentially enhance re-ID accuracy.

To investigate this effect, experiments were conducted using the iLIDS-VID dataset, wherein the sequence lengths of both query and gallery sets were varied. The lengths were adjusted to 1, 2, 4, 8, 16, and up to 128 frames. For sequences shorter than the target length, all available frames were utilized.

The results, as depicted in Table 8, reveal a distinct positive correlation between sequence length and recognition accuracy. Extending either the query or gallery length enhances Rank-1 accuracy, with the most significant improvements observed when both are increased. Notably, fixing the query length to a single frame while extending the gallery from 1 to 128 frames results in a 13% improvement in Rank-1 accuracy. Conversely, fixing the gallery length to a single frame while varying the query length yields only a 4% gain. These findings suggest that gallery sequence length exerts a more substantial influence on performance than query length.

To provide a clearer understanding of the results in Table 8; Fig. 8 illustrates the effect of varying sequence lengths for both the query and the gallery. As shown in Fig. 8a, maintaining a constant query sequence length while increasing the gallery length leads to consistent improvements in Rank-1 accuracy, confirming that longer gallery sequences capture more comprehensive temporal cues. Conversely, Fig. 8b shows that extending the query sequence length while keeping the gallery length constant yields relatively smaller gains, indicating that gallery diversity has a greater influence on recognition performance. Together, these visualizations highlight the asymmetric contribution of sequence length between queries and galleries, underscoring the importance of sufficient temporal information in the gallery for robust person re-identification.

Table 8 Rank-1 accuracy for different combinations of query and gallery sequence lengths (iLIDS-VID).

Full size table

However, longer sequences increase the computational cost approximately linearly, since more frames must be processed by the convolutional backbone and propagated through the GRU. As a result, when deploying MSP-SRC on latency- or memory-constrained platforms, a moderate sequence length is preferred to balance the accuracy gains from longer temporal windows against the overhead of processing very long sequences.

This observation highlights the significance of temporal context and provides an empirical basis for the ablation studies in Sect. 4.7, where the contributions of core modules are systematically evaluated.

Ablation study

Experiments were conducted on the PRID-2011 and iLIDS-VID datasets. As summarized in Table 9, four design aspects were systematically evaluated: the recurrent unit, multi-level feature aggregation, temporal pooling strategy, and sequence length during inference.

First, the GRU unit was replaced with SimpleRNN and LSTM under identical training conditions. The Rank-1 accuracy decreased from 76% to 55% on PRID-2011 and from 57% to 31% on iLIDS-VID when employing SimpleRNN, and further declined to 52% and 27%, respectively, with LSTM. These findings suggest that GRU offers a more advantageous balance between model capacity and convergence, particularly in data-constrained environments.

Second, removing the low-level feature branch while retaining only the high-level pathway reduced accuracy by 10% on PRID-2011 and by 3% on iLIDS-VID. This demonstrates that early-layer cues, such as textures and accessories, complement higher-level abstractions and are essential for robust identity discrimination.

Table 9 Ablation study of MSP-SRC on PRID-2011 and iLIDS-VID datasets.

Full size table

Third, substituting mean pooling with max pooling consistently led to degraded performance. For example, Rank-1 accuracy decreased from 76% to 66% on PRID-2011 when max pooling was applied to high-level features, confirming the superior robustness of mean pooling against temporal noise and outlier frames.

Finally, reducing the gallery sequence to a single frame resulted in marked performance degradation, dropping to 63% on PRID-2011 and 46% on iLIDS-VID, compared to the 128-frame baseline. This highlights the necessity of temporal diversity for capturing comprehensive appearance patterns.

In summary, the ablation experiments confirm that GRU, multi-level aggregation, mean pooling, and longer sequence inputs each contribute significantly to the discriminative capacity of MSP-SRC. Their integration collectively enhances both model stability and recognition accuracy.

Comparison with state-of-the-art Siamese and conventional methods

To further substantiate the effectiveness of the proposed MSP-SRC framework, its performance was evaluated against both traditional video-based person re-ID methods and contemporary Siamese-based models. The assessments were conducted using the PRID-2011 and iLIDS-VID datasets, with CMC Rank-1 accuracy serving as the primary metric. Several classical methods, such as SRC⁵, LSTM⁴⁸, LSTM+KISSME⁴⁸, GRU¹³, RFA⁵⁰, AMOC⁶⁸, CNN + Euc⁶, and CNN+XQDA⁶, have shown competitive results on video-based person re-ID benchmarks. However, these methods often rely on deep or multi-stage architectures, including multi-branch attention modules (e.g., RFA and AMOC) or complex recurrent units (e.g., LSTM), which significantly increase model complexity and computational demands. For example, LSTM introduces multiple gating mechanisms, nearly doubling the parameter count compared to GRU, while AMOC utilizes cascaded CNNs with optical flow estimation, resulting in higher memory usage and inference latency. Despite these efforts, their recognition accuracy remains inferior to that of MSP-SRC, as depicted in Table 10. For transparency, it is clarified that the baseline numbers in Table 10 are quoted from the corresponding original papers under their reported settings, whereas MSP-SRC is evaluated under the standard protocol adopted throughout this work; therefore, results should be interpreted together with potential differences in evaluation protocols and input settings.

The MSP-SRC framework was also compared with several recent Siamese-based methods. Wang⁶⁹ employs two identical CNN branches followed by LSTM modules. Zhang⁷⁰ achieves slightly higher accuracy by explicitly modeling intra-video differences with a dual-path structure, although this design introduces significant computational overhead. Wu⁷¹ incorporates an attention mechanism that emphasizes salient regions and frames, yielding improved performance on iLIDS-VID under cluttered conditions but at the cost of additional complexity and latency. Li⁷² utilizes deep LSTM layers within a Siamese framework, which performs effectively on larger datasets but requires substantial training resources and memory.

As summarized in Table 10 and visualized in Fig. 9, MSP-SRC consistently outperforms conventional and recurrent baselines on both PRID-2011 and iLIDS-VID. Figure 9 plots the average CMC curves over 10 independent trials, comparing MSP-SRC against ten representative methods, including classical recurrent models (LSTM+KISSME⁴⁸, GRU¹³, RFA⁵⁰, conventional baselines (CNN+XQDA⁶, AMOC⁶⁸, and recent Siamese variants (Wang⁶⁹, Wu⁷¹, Zhang⁷⁰, Li⁷². The curves reveal that MSP-SRC maintains a distinct performance advantage over classical approaches across the entire retrieval list (Rank-1 to Rank-20). Specifically, on the PRID-2011 dataset, MSP-SRC achieves competitive performance comparable to recent Siamese methods^70,72, while significantly outperforming baselines such as AMOC⁶⁸ and CNN+XQDA⁶. On the challenging iLIDS-VID dataset, although MSP-SRC’s Rank-1 accuracy is slightly lower than some complex attention-based Siamese models^70,71, this marginal difference must be viewed in light of the computational efficiency analyzed in Sect. 4.8.1. While methods such as Zhang⁷⁰ and Li⁷² typically rely on deep backbones or heavy attention mechanisms, MSP-SRC is explicitly designed as a lightweight framework with orders of magnitude fewer parameters. Consequently, MSP-SRC not only consistently surpasses standard recurrent (e.g., GRU¹³, RFA⁵⁰ and multi-stream baselines (e.g., AMOC⁶⁸ but also establishes a highly favorable accuracy–efficiency trade-off suitable for resource-constrained deployment.

Table 10 Comparison with traditional and state-of-the-art Siamese-based methods.

Full size table

Computational complexity analysis

To substantiate the efficiency claims of MSP-SRC, model complexity is analyzed in terms of both parameter counts and floating-point operations (FLOPs). Conventional image-based person re-ID pipelines typically adopt deep CNN backbones such as ResNet-50, which contains approximately 25.6 million parameters⁷³. Recent transformer-based approaches, including TransReID¹⁶, commonly employ ViT-Base backbones with substantially larger parameter budgets (on the order of 10⁸ parameters)⁷⁴, leading to increased memory footprint and computational demand. Moreover, video-based baselines may further increase complexity through multi-stream designs that jointly process appearance and motion cues (e.g., optical flow) using cascaded CNN encoders⁶⁸.

In contrast, MSP-SRC is deliberately constructed with a compact three-layer CNN backbone (Table 1) with 5 × 5 kernels and channel widths of 5→16→32→32, followed by a single GRU layer and lightweight multi-level pooling branches in Table 11, the overall trainable parameter budget remains around the 10⁶ scale, which is at least one order of magnitude smaller than ResNet-50-based systems and one to two orders of magnitude smaller than ViT-Base backbones used in transformer-based re-ID frameworks. This reduction in model size, while maintaining competitive Rank-1 accuracy on PRID-2011 and iLIDS-VID, positions MSP-SRC at a distinctly different point on the accuracy–efficiency trade-off curve compared with recent transformer- and attention-driven person re-ID models.

Beyond parameter counts, FLOPs are reported as a hardware-agnostic proxy for computational cost. Following a common convention, one multiply–accumulate (MAC) is counted as 2 FLOPs (one multiplication + one addition). For convolution layers, bias additions are explicitly included as one add per output activation. Specifically, for a 2D convolution with output size $\:{H}_{out}\times\:{W}_{out}$, channels $\:{C}_{in}\to\:{C}_{out}$, and kernel size $\:K\times\:K$, FLOPs are computed as:

$$\:{FLOPs}_{conv}=2\times\:{H}_{out}\times\:{W}_{out}\times\:{C}_{out}\times\:\left({C}_{in}\times\:{K}^{2}\right)+{H}_{out}\times\:{W}_{out}\times\:{C}_{out}.$$

(19)

For the GRU, the dominant projection cost is approximated as $\:{FLOPs}_{GRU}\approx\:6\times\:(N\times\:e+{e}^{2})$, where N is the input feature dimension and e is the hidden-state dimension. Under the operating input resolution of 128 × 64 × 5 (3 RGB + 2 optical-flow channels), MSP-SRC requires approximately 0.115 GFLOPs per frame per Siamese branch, derived via a layer-by-layer theoretical calculation. Table 11 further reports representative FLOPs of standard backbones (e.g., ResNet-50 and ViT-Base) using literature-reported values under the widely adopted single-crop 224 × 224 setting, which should be interpreted together with the corresponding resolution notes.

Table 11 Comparison of model complexity in terms of parameter counts and flops. Baseline values are from the original papers^16,73,75. MSP-SRC parameters follow tables 1, and MSP-SRC flops are computed theoretically at 128 × 64 × 5. TE-TransReID flops are estimated from its MobileNetV2 and truncated ViT-Small (4*) branches (see table notes).

Full size table

From a deployment perspective, a parameter budget on the order of 1–2 million implies a weight storage footprint of only a few megabytes under 32-bit floating-point representation, and substantially lower under quantization. In addition, MSP-SRC relies on a shallow CNN backbone (maximum width 32) and a single GRU layer, yielding modest activation memory with complexity linear in the sequence length, in contrast to the quadratic token-interaction cost of transformer self-attention. These characteristics collectively support the feasibility of deploying MSP-SRC in resource-constrained environments, while a detailed device-specific latency benchmark is left for future work.

Beyond the above quantitative and complexity analyses, the following subsection provides a qualitative diagnostic discussion and a failure-mode taxonomy to further clarify the limitations and boundary conditions of the proposed framework.

Qualitative discussion and failure-case analysis

Qualitative inspection is informative for understanding the boundary conditions of video-based person re-ID systems, particularly under challenging surveillance factors. In addition to the quantitative metrics reported above, this subsection summarizes typical failure modes observed on the PRID-2011 and iLIDS-VID benchmarks and provides a diagnostic analysis to clarify the limitations and boundary conditions of MSP-SRC and representative baselines.

(1)
Heavy/long-term occlusion and fragmented tracklets:

Severe occlusion (e.g., by other pedestrians or scene structures) can lead to fragmented tracklets where discriminative cues are intermittently lost. In such cases, temporal aggregation (e.g., GRU-based integration) must rely on a limited subset of informative frames. If the sequence is dominated by non-informative or partially occluded observations, the final embedding may lose the fine-grained anchors required for identity discrimination.
(2)
Viewpoint changes and pose-induced appearance deformation:

Large viewpoint shifts significantly alter the visibility of body parts and clothing textures. Although the proposed multi-level similarity aggregation exploits complementary cues across feature hierarchies to mitigate local mismatches, extreme transitions (e.g., side view to back view) may still cause inconsistent correspondences, especially when the tracklet provides only a limited stable appearance profile.
(3)
Near-duplicate clothing and identity ambiguity:

When different identities share highly similar clothing colors and textures (high inter-identity / inter-class similarity), appearance-based discrimination becomes intrinsically ambiguous. This issue is amplified in crowded scenes where the learned embeddings may exhibit reduced inter-class margins for a subset of identities, yielding unstable retrieval ranks even with temporal cues.
(4)
Illumination shifts and background dominance:

Abrupt illumination changes between non-overlapping camera views can distort color statistics and reduce the relative contribution of person-centric cues. This may lead to embeddings that partially encode background context, resulting in spurious similarities between tracklets recorded under similar lighting or camera conditions.
(5)
Degraded motion cues:

While the inclusion of optical-flow channels can enhance temporal modeling, motion cues are sensitive to image quality. Blur, compression artifacts, and low frame rates may degrade optical-flow reliability. In such scenarios, the motion channels may introduce inconsistent or distractive signals during temporal integration, weakening the overall spatiotemporal representation.

Overall, these failure modes highlight the inherent difficulty of video-based person re-ID in unconstrained environments. This diagnostic analysis delineates the practical boundaries of the proposed framework and motivates future research toward more robust modality fusion and occlusion-aware temporal modeling.

Feature-space visualization (e.g., t-SNE/UMAP) can offer additional qualitative insights into embedding clustering and overlap. In this revision, such visualizations are not included because per-tracklet 128-D sequence embeddings were not preserved in a visualization-ready format in the archived runs, and re-generating them requires rebuilding the original inference environment. As part of future work, embedding-space visualization will be conducted for MSP-SRC and representative baselines under the same test protocol to further analyze clustering and separability.

Conclusions

Reflecting on the results and discussions presented in Sect. 4, this study demonstrates that the MSP-SRC framework effectively combines multi-level feature aggregation with GRU-based temporal modeling to achieve robust video-based person re-ID. Evaluations on the PRID-2011 and iLIDS-VID datasets consistently indicate improvements over conventional and Siamese-based baselines, underscoring the contributions of low-level spatial cues, recurrent modeling, and mean temporal pooling. Beyond performance gains, the findings emphasize the importance of compact architectures that balance accuracy with efficiency, particularly in resource-constrained environments.

Looking forward, several key directions emerge to further advance this framework. First, future research will extend MSP-SRC to large-scale benchmarks such as MARS and cross-domain protocols to assess scalability and generalization beyond the datasets considered in this study. Second, extending the multi-level similarity perception mechanism to cross-modal scenarios, including RGB–infrared video, gait-based cues, and text–video descriptions, represents a promising avenue for leveraging complementary modalities under a unified framework. Third, adapting MSP-SRC to unsupervised and weakly supervised regimes via pseudo-labeling or self-supervised contrastive objectives would facilitate deployment in large camera networks where exhaustive annotation is impractical. Fourth, cross-backbone generalizability will be investigated by replacing the compact CNN encoder with lightweight transformer backbones (and CLIP-pretrained ViT variants where feasible), while retaining the proposed multi-level similarity aggregation and GRU-based temporal integration, to quantify the accuracy–efficiency trade-off under the same evaluation protocol. Finally, regarding real-world deployment, future work will focus on specific optimizations such as pruning and quantization, alongside systematic benchmarking of latency, throughput, and energy efficiency on embedded edge platforms. In parallel, hybrid architectures that integrate lightweight temporal self-attention into the compact CNN–GRU backbone will be investigated to enhance long-range temporal modeling while preserving the favorable accuracy–efficiency trade-off established in this work. Additionally, future work will explore whether advanced temporal aggregation paradigms from broader video understanding (e.g., multi-trajectory sequence modeling and holistic–partial temporal cue integration) can be adapted to video person re-ID while preserving the lightweight, low-parameter and low-FLOPs design philosophy of MSP-SRC.

Data availability

No new data were generated. All datasets used in this study are publicly available from their original sources and were used in accordance with their licenses. The datasets can be accessed at the following locations:- PRID 2011: [https://www.tugraz.at/institute/icg/research/team-bischof/learning-recognition-surveillance/downloads/prid11](https:/www.tugraz.at/institute/icg/research/team-bischof/learning-recognition-surveillance/downloads/prid11)- iLIDS-VID: [https://xiatian-zhu.github.io/downloads\_qmul\_iLIDS-VID\_ReID\_dataset.html](https:/xiatian-zhu.github.io/downloads_qmul_iLIDS-VID_ReID_dataset.html).

References

Ye, M. et al. Deep learning for person re-identification: A survey and outlook. IEEE Trans. Pattern Anal. Mach. Intell. 44, 2872–2893. https://doi.org/10.1109/tpami.2021.3054775 (2021).
Article ADS Google Scholar
Ming, Z. et al. Deep learning-based person re-identification methods: A survey and outlook of recent works. Image Vis. Comput. 119, 104394. https://doi.org/10.1016/j.imavis.2022.104394 (2022).
Article Google Scholar
Yi, D., Lei, Z., Liao, S. & Li, S. Z. Deep metric learning for person re-identification. In: 22nd international conference on pattern recognition. (IEEE, 2014). https://doi.org/10.1109/ICPR.2014.16
Li, W., Zhao, R., Xiao, T. & Wang, X. Deepreid: deep filter pairing neural network for person re-identification. Proc. IEEE Conf. Comput. Vis. Pattern Recognit. https://doi.org/10.1109/cvpr.2014.27 (2014).
Article Google Scholar
McLaughlin, N., Rincon, J. M. D. & Miller, P. Recurrent convolutional network for video-based person re-identification. In: Proceedings of the IEEE conference on computer vision and pattern recognition. (2016). https://doi.org/10.1109/cvpr.2016.148
Zheng et al. Mars: A video benchmark for large-scale person re-identification. In: European conference on computer vision. (Springer International Publishing, 2016). https://doi.org/10.1007/978-3-319-46466-4_52
Hochreiter Sepp, and Jürgen Schmidhuber. Long short-term memory. Neural computation 9 8 1735–1780. (1997). https://doi.org/10.1162/neco.1997.9.8.1735
Chung, J., Gulcehre, C., Cho, K. & Bengio, Y. Gated feedback recurrent neural networks. In: International conference on machine learning. PMLR, (2015).
Perumal, T., Mustapha, N., Mohamed, R. & Shiri, F. M. A comprehensive overview and comparative analysis on deep learning models. J. Artif. Intell. 6 (1), 301–360. https://doi.org/10.32604/jai.2024.054314 (2024).
Article Google Scholar
Li, Y., Chen, C. L. P. & Zhang, T. A survey on Siamese network: Methodologies, applications, and opportunities. IEEE Trans. Artif. Intell. 3, 994–1014. https://doi.org/10.1109/tai.2022.3207112 (2022).
Article Google Scholar
Varior, R. R., Shuai, B., Lu, J., Xu, D. & Wang, G. A siamese long short-term memory architecture for human re-identification. In: European conference on computer vision. (Springer International Publishing, 2016). https://doi.org/10.1007/978-3-319-46478-7_9
Varior, R., Rama, M., Haloi & Wang, G. Gated siamese convolutional neural network architecture for human re-identification. In: European conference on computer vision. (Springer International Publishing, 2016). https://doi.org/10.1007/978-3-319-46484-8_48
Wu, L., Shen, C. & Anton van den Hengel. and. Deep recurrent convolutional networks for video-based person re-identification: An end-to-end approach. arXiv preprint arXiv:1606.01609 (2016). https://doi.org/10.48550/arXiv.1606.01609
Shen, C. et al. Deep siamese network with multi-level similarity perception for person re-identification. In: Proceedings of the 25th ACM international conference on Multimedia. (2017). https://doi.org/10.1145/3123266.3123452
Sun, Y., Chen, Y., Wang, X. & Tang, X. Deep learning face representation by joint identification-verification. Adv. Neural Inform. Process. Syst. 27 (2014).
He, S. et al. Transreid: Transformer-based object re-identification. In: Proceedings of the IEEE/CVF international conference on computer vision. (2021). https://doi.org/10.1109/iccv48922.2021.01474
Zhu, K. et al. Aaformer: Auto-aligned transformer for person re-identification. IEEE Trans. Neural Networks Learn. Syst. https://doi.org/10.1109/tnnls.2023.3301856 (2023).
Article Google Scholar
Wang, H., Shen, J., Liu, Y., Gao, Y. & Gavves, E. Nformer: robust person re-identification with neighbor transformer. In: Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. https://doi.org/10.1109/cvpr52688.2022.00715 (2022).
Article Google Scholar
Cao, J. et al. Pstr: End-to-end one-step person search with transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. (2022). https://doi.org/10.1109/cvpr52688.2022.00924
He, T. et al. Dense interaction learning for video-based person re-identification. In: Proceedings of the IEEE/CVF international conference on computer vision. (2021). https://doi.org/10.1109/iccv48922.2021.00152
Tang, Z., Zhang, R., Peng, Z., Chen, J. & Lin, L. Multi-stage spatio-temporal aggregation transformer for video person re-identification. IEEE Trans. Multimedia. 25, 7917–7929. https://doi.org/10.1109/tmm.2022.3231103 (2022).
Article Google Scholar
Wu, J. et al. Cavit: Contextual alignment vision transformer for video object re-identification. In: European Conference on Computer Vision. (2022). https://doi.org/10.1007/978-3-031-19781-9_32
Gheissari, N. & Sebastian, T. B. and Richard Hartley. Person reidentification using spatiotemporal appearance. In: IEEE computer society conference on computer vision and pattern recognition (CVPR’06). (IEEE, 2006). https://doi.org/10.1109/cvpr.2006.223
Farenzena, M. et al. Person re-identification by symmetry-driven accumulation of local features. In: 2010 IEEE computer society conference on computer vision and pattern recognition. IEEE, (2010).
Layne, R., Hospedales, T. M., Gong, S. & Mary, Q. Person re-identification by attributes. Bmvc. 2. 3. (2012). https://doi.org/10.5244/c.26.24
Koestinger, M. et al. Large scale metric learning from equivalence constraints. In: IEEE conference on computer vision and pattern recognition. (IEEE, 2012). https://doi.org/10.1109/cvpr.2012.6247939
Weinberger, K. Q., Blitzer, J. & Saul, L. Distance metric learning for large margin nearest neighbor classification. Adv. Neural. Inf. Process. Syst. 18 https://doi.org/10.5555/1577069.1577078 (2005). https://dl.acm.org/doi/
Davis, J. V., Kulis, B., Jain, P., Sra, S. & Dhillon, I. S. Information-theoretic metric learning. In: Proc. 24th Int. Conf. Mach. Learn. https://doi.org/10.1145/1273496.1273523 (2007).
Article Google Scholar
Guillaumin, M., Verbeek, J. & Schmid, C. Is that you? Metric learning approaches for face identification. In: IEEE 12th international conference on computer vision. (IEEE, 2009). https://doi.org/10.1109/ICCV.2009.5459197
Liao, S., Hu, Y., Zhu, X. & Li, S. Z. Person re-identification by local maximal occurrence representation and metric learning. Proc. IEEE Conf. Comput. Vis. Pattern Recognit. https://doi.org/10.1109/cvpr.2015.7298832 (2015).
Article Google Scholar
Prosser, B. J., Zheng, W. S., Gong, S., Xiang, T. & Mary, Q. Person re-identification by support vector ranking. Bmvc. 2. 5. (2010). https://doi.org/10.5244/c.24.21
Liu, X., Wang, H., Wu, Y., Yang, J. & Yang, M. H. An ensemble color model for human re-identification. In: IEEE Winter Conference on Applications of Computer Vision. (IEEE, 2015). https://doi.org/10.1109/WACV.2015.120
Gray, D. & Tao, H. Viewpoint invariant pedestrian recognition with an ensemble of localized features. In: European conference on computer vision. (Springer Berlin Heidelberg, 2008). https://doi.org/10.1007/978-3-540-88682-2_21
Wu, L., Shen, C. & Anton van den Hengel. and. Personnet: Person re-identification with deep convolutional neural networks. Preprint at https://arXiv.org/abs/1601.07255 (2016). https://doi.org/10.48550/arXiv.1601.07255
Xiao, T., Li, H., Ouyang, W. & Wang, X. Learning deep feature representations with domain guided dropout for person re-identification. In: Proceedings of the IEEE conference on computer vision and pattern recognition. (2016). https://doi.org/10.1109/cvpr.2016.140
Russel, N. S., Arivazhagan, S., Amrith, S. G. & Adarsh, S. Person Re-Identification by Siamese network. Inteligencia Artif. 26 (71), 25–33. https://doi.org/10.4114/intartif.vol26iss71pp25-33 (2023).
Article Google Scholar
Bazzani, L. et al. Multiple-shot person re-identification by hpe signature. In: 20th international conference on pattern recognition. (IEEE, 2010). https://doi.org/10.1109/icpr.2010.349
Hirzer, M., Beleznai, C., Roth, P. M. & Bischof, H. Person re-identification by descriptive and discriminative classification. In: Scandinavian conference on Image analysis. (Springer Berlin Heidelberg, 2011). https://doi.org/10.1007/978-3-642-21227-7_9
Hamdoun, O., Moutarde, F., Stanciulescu, B. & Steux, B. Person re-identification in multi-camera system by signature based on interest point descriptors collected on short video sequences. In: 2008 Second ACM/IEEE International Conference on Distributed Smart Cameras. (IEEE, 2008). https://doi.org/10.1109/ICDSC.2008.4635689
Truong Cong, D. N., Achard, C., Khoudour, L. & Douadi, L. Video sequences association for people re-identification across multiple non-overlapping cameras. In: International Conference on Image Analysis and Processing. (Springer Berlin Heidelberg, 2009). https://doi.org/10.1007/978-3-642-04146-4_21
Karaman, S. & Bagdanov, A. D. Identity inference: generalizing person re-identification scenarios. In: European Conference on Computer Vision. (Springer Berlin Heidelberg, 2012). https://doi.org/10.1007/978-3-642-33863-2_44
Bedagkar-Gala, Apurva, Shishir, K. & Shah Part-based spatio-temporal model for multi-person re-identification. Pattern Recognit. Lett. 33, 1908–1915. https://doi.org/10.1016/j.patrec.2011.09.005 (2012).
Article ADS Google Scholar
Klaser, A., Marszałek, M. & Schmid, C. A spatio-temporal descriptor based on 3d-gradients. In: BMVC 2008-19th British machine vision conference. (British Machine Vision Association, 2008).
Han, J. & Bhanu, B. Individual recognition using gait energy image. IEEE Trans. Pattern Anal. Mach. Intell. 28 (2), 316–322. https://doi.org/10.1109/tpami.2006.38 (2005).
Article ADS Google Scholar
Wang, T., Gong, S., Zhu, X. & Wang, S. Person re-identification by video ranking. In: European conference on computer vision. (Springer International Publishing, 2014). https://doi.org/10.1007/978-3-319-10593-2_45
Gao, C. et al. Temporally aligned pooling representation for video-based person re-identification. In: IEEE international conference on image processing (ICIP). (IEEE, 2016). https://doi.org/10.1109/ICIP.2016.7533168
Liu, Z., Chen, J. & Wang, Y. A fast adaptive spatio-temporal 3d feature for video-based person re-identification. In: IEEE international conference on image processing (ICIP). (IEEE, 2016). https://doi.org/10.1109/icip.2016.7533170
Wu, L., Shen, C. & Anton van den Hengel. and. Convolutional LSTM networks for video-based person re-identification. Preprint at https://arXiv.org/abs/1606.01609 1.11 (2016).
Wang, P., Cao, Y., Shen, C., Liu, L. & Shen, H. T. Temporal pyramid pooling-based convolutional neural network for action recognition. IEEE Trans. Circuits Syst. Video Technol. 27, 2613–2622. https://doi.org/10.1109/tcsvt.2016.2576761 (2016).
Article Google Scholar
Yan, Y. et al. Person re-identification via recurrent feature aggregation. European conference on computer vision. (Springer International Publishing, 2016). https://doi.org/10.1007/978-3-319-46466-4_42
Lu, J. et al. Exploring high-order spatio–temporal correlations from skeleton for person re-identification. IEEE Trans. Image Process. 32, 949–963. https://doi.org/10.1109/tip.2023.3236144 (2023).
Article ADS PubMed Google Scholar
Pan, H., Chen, Y. & He, Z. Multi-granularity graph pooling for video-based person re-identification. Neural Netw. 160, 22–33. https://doi.org/10.1016/j.neunet.2022.12.015 (2023).
Article PubMed Google Scholar
Alsehaim, A. & Toby, P. Breckon. VID-Trans-ReID: enhanced video transformers for Person re-identification. BMVC. (2022).
Wang, Y. K., Pan, T. M. & Chi-En, H. Single-task joint learning model for an online multi-object tracking framework. Appl. Sci. 14, 10540. https://doi.org/10.3390/app142210540 (2024).
Article CAS Google Scholar
Wang, Y. K., Guo, J. & Tung-Ming, P. Multidomain joint learning of pedestrian detection for application to quadrotors. Drones 6 12 430. (2022). https://doi.org/10.3390/drones6120430
Zhou, C. et al. Video-based visible-infrared person re-identification via style disturbance defense and dual interaction. In: Proceedings of the 31st ACM International Conference on Multimedia. (2023).
Davila, D. et al. Mevid: Multi-view extended videos with identities for video person re-identification. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. (2023). https://doi.org/10.1109/wacv56688.2023.00168
Dou, Z., Wang, Z., Li, Y. & Wang, S. Identity-seeking self-supervised representation learning for generalizable person re-identification. In: Proceedings of the IEEE/CVF international conference on computer vision. (2023).
Peng, K. et al. Referring atomic video action recognition. In: European Conference on Computer Vision. Cham: Springer Nature Switzerland, (2024).
Peng, K. et al. RefAtomNet++: Advancing referring atomic video action recognition using semantic retrieval based multi-trajectory Mamba. Preprint at https://arXiv.org/abs/2510.16444, (2025).
Peng, K. et al. HopaDIFF: Holistic-Partial Aware Fourier Conditioned Diffusion for Referring Human Action Segmentation in Multi-Person Scenarios ( NeurIPS, 2025).
Peng, K. et al. Exploring few-shot adaptation for activity recognition on diverse domains. Preprint at https://arXiv.org/abs/2305.08420, (2023).
Zhang, Z., Lan, C., Zeng, W. & Chen, Z. Multi-granularity reference-aided attentive feature aggregation for video-based person re-identification. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. (2020). https://doi.org/10.1109/cvpr42600.2020.01042
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I. & Salakhutdinov, R. Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learning Res. 15.1 1929–1958. https://dl.acm.org/doi/abs/ (2014). https://doi.org/10.5555/2627435.2670313
Mozer, M. C. A focused backpropagation algorithm for Temporal pattern recognition. Backpropagation Psychol. Press., 137–169. (2013).
Shaogang, Z. W. S. G. & Tao, X. Associating Groups of People, In: Proc. British Machine Vision Conference (2009). https://doi.org/10.5244/c.23.23
Zheng, L. et al. Mars: A video benchmark for large-scale person re-identification. In: European conference on computer vision. Cham: Springer International Publishing, (2016). https://doi.org/10.1007/978-3-319-46466-4_52
Liu, H. et al. Video-based person re-identification with accumulative motion context. IEEE Trans. Circuits Syst. Video Technol. 28, 2788–2802. https://doi.org/10.1109/tcsvt.2017.2715499 (2017).
Article Google Scholar
Wang, Z., Cai, L., Duan, N. & Fan, L. Improved Siamese network for video-based person re-identification. In: International Conference on Networking and Network Applications (NaNA). (IEEE, 2019). https://doi.org/10.1109/NaNA.2019.00051
Zhang, W. et al. Learning intra-video difference for person re-identification. IEEE Trans. Circuits Syst. Video Technol. 29, 3028–3036. https://doi.org/10.1109/tcsvt.2018.2872957 (2018).
Article Google Scholar
Wu, L., Wang, Y., Gao, J. & Li, X. Where-and-when to look: deep Siamese attention networks for video-based person re-identification. IEEE Trans. Multimedia. 21, 1412–1424. https://doi.org/10.1109/tmm.2018.2877886 (2018).
Article Google Scholar
Li, D., Meng, R. & Liu, Y. Person ee-identification algorithm based on Siamese LSTM. In: 3rd International Conference on Natural Language Processing (ICNLP). (IEEE, 2021). https://doi.org/10.1109/ICNLP52887.2021.00037
He, K. et al. Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 770–778, (2016).
Dosovitskiy, A. An image is worth 16x16 words: Transformers for image recognition at scale. Preprint at https://arXiv.org/abs/2010.11929, (2020).
Zhang, X. et al. TE-TransReID: Towards efficient person re-Identification via local feature embedding and lightweight transformer. Sensors, 25 17, 5461, (2025).

Download references

Acknowledgements

This study was supported by the Department of Electrical Engineering, the Graduate Institute of Applied Science and Engineering, and the Holistic Education Center of Fu Jen Catholic University.

Funding

This research was funded by the National Science and Technology Council, Grant numbers NSTC114-2221-E-030-004.

Author information

These authors contributed equally: Yuan-Kai Wang, Tung-Ming Pan and Chung-Pin Sun.

Authors and Affiliations

Department of Electrical Engineering, Fu Jen Catholic University, New Taipei City, 242, Taiwan
Yuan-Kai Wang & Chung-Pin Sun
Holistic Education Center, Fu Jen Catholic University, New Taipei City, 242, Taiwan
Tung-Ming Pan
Graduate Institute of Applied Science and Engineering, Fu Jen Catholic University, New Taipei City, 242, Taiwan
Yuan-Kai Wang & Tung-Ming Pan

Authors

Yuan-Kai Wang
View author publications
Search author on:PubMed Google Scholar
Tung-Ming Pan
View author publications
Search author on:PubMed Google Scholar
Chung-Pin Sun
View author publications
Search author on:PubMed Google Scholar

Contributions

Yuan-Kai Wang conceived the study, designed the methodology, and supervised the research. Tung-Ming Pan analyzed the results, supervised the research, and wrote and revised the article. Chung-Pin Sun implemented the model, performed the experiments, and wrote the article. All authors contributed to manuscript preparation and approved the final version.

Corresponding author

Correspondence to Tung-Ming Pan.

Ethics declarations

Competing interests

The authors declare no competing interests.

Ethical approval

This study did not involve recruitment of human participants, nor any new data acquisition by the authors. All experiments were conducted exclusively on publicly released person re-identification datasets (e.g., PRID-2011 and iLIDS-VID) obtained from the corresponding dataset providers, and used in compliance with their terms of access and use. Therefore, informed consent to participate was not required for this study, as no human subject research activities (e.g., collection or analysis of an individual’s physiological/psychological/genetic/medical information) were performed by the authors. The study uses only data that were publicly released by the original dataset providers for research purposes. The authors did not collect, generate, or publish any new identifiable images beyond what is available in the original dataset releases; accordingly, no additional consent for publication was required for this manuscript.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Wang, YK., Pan, TM. & Sun, CP. A CNN-RNN Siamese framework with multi-level aggregation for video-based person re-identification. Sci Rep 16, 8224 (2026). https://doi.org/10.1038/s41598-026-39277-x

Download citation

Received: 26 August 2025
Accepted: 04 February 2026
Published: 11 February 2026
Version of record: 05 March 2026
DOI: https://doi.org/10.1038/s41598-026-39277-x

Subjects

Abstract

Similar content being viewed by others

Domain-generalized person re-identification via refined neuron dropout and reciprocal-expansion re-ranking

ReMamba: a hybrid CNN-Mamba aggregation network for visible-infrared person re-identification

Semi-supervised generative and discriminative adversarial learning for motor imagery-based brain–computer interface

Introduction

Related works

Image-based person Re-ID

Video-based person Re-ID

Temporal aggregation beyond video-based person Re-ID

Siamese-based person Re-ID

Proposed MSP-SRC methodology

Overall framework and input representation

Spatial feature extraction via compact CNN backbone

Temporal modeling via GRU-based recurrent module

LSTM

GRU

Multi-level similarity perception module

Low-level similarity via feature pooling

High-level similarity via Temporal pooling

Joint identification–verification optimization strategy

Siamese training with contrastive loss

Joint identification and verification loss

Results and discussion

Video-based datasets selection

Evaluation metrics

Comparison of RNN variants and depths

Impact of multi-level feature aggregation

Pooling strategies for sequence embedding

Effect of sequence length on re-ID accuracy

Ablation study

Comparison with state-of-the-art Siamese and conventional methods

Computational complexity analysis

Qualitative discussion and failure-case analysis

Conclusions

Data availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Ethical approval

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Quick links