Abstract
Visible-infrared person re-identification (VI-ReID) plays a crucial role in cross-modal surveillance by matching individuals between visible and infrared imagery. Although progress has been made, existing methods struggle with several limitations: (1) modality heterogeneity hinders robust cross-spectral feature alignment; (2) attention mechanisms often lack adaptability to dynamic cross-modal contexts; (3) loss functions are combined heuristically without optimal weighting. To overcome these limitations, we introduce DASF-AKSA, a novel framework featuring Dynamic Adaptive Synergistic Fusion and Adaptive Kernel Selection Attention. The DASF module enables content-aware cross-modal fusion through learnable channel switching, while AKSA employs parallel 1D convolutions to capture multi-scale channel contexts with minimal computational overhead. We further design a Quadruple Balance-Optimized Loss framework with weights derived from gradient characteristic analysis, systematically balancing identification, center triplet, supervised contrastive, and margin-based MMD losses for stable multi-objective optimization. Experiments on SYSU-MM01, RegDB, and LLCM datasets show that our method consistently achieves better performance. On RegDB, in particular, DASF-AKSA network reaches 96.20% Rank-1 accuracy and 92.12% mAP, outperforming most existing approaches. Comprehensive validations confirm each component’s contribution through ablation studies, gradient visualization, Grad-CAM heatmaps under low-light conditions, and cross-architecture generalization tests.
Introduction
Person re-identification (ReID) tackles the problem of matching pedestrian images across non-overlapping camera views, forming a core technology for modern intelligent surveillance systems1,2. While single-modality ReID has achieved strong results in well-lit environments, 24/7 operational demands introduce a more difficult setting: aligning visible (RGB) images captured in daytime with infrared (IR) images acquired at night. This practical need has positioned Visible-Infrared Person Re-Identification (VI-ReID) as a critical but particularly challenging research area3,4.
Visible-Infrared Person Re-Identification (VI-ReID) emerged as a distinct research area around 2017, largely driven by the introduction of benchmark datasets like SYSU-MM013 and RegDB5. Initial approaches3,6 often relied on simple adaptations of single-modality networks, such as zero-padding infrared images or using basic channel concatenation for cross-modal matching. These methods, however, yielded limited performance (\(\sim\)40% Rank-1 on SYSU-MM013). This outcome underscored the central challenge in VI-ReID: bridging the significant modality gap between visible and infrared images. This gap stems from their different imaging mechanisms—visible images record reflected light, whereas infrared images capture emitted thermal radiation—resulting in distinctly different visual appearances for the same person.
Between 2018 and 2020, dual-stream architectures emerged as a dominant strategy in VI-ReID to tackle the modality gap7,8,9. These models process visible and infrared inputs through separate subnetworks before mapping both into a shared embedding space. Methods like AlignGAN9 used adversarial learning for pixel-level alignment, while AGW4 integrated attention mechanisms to reweight informative features. As a result, performance rose substantially (\(\sim\)70% Rank-1 on RegDB4), by explicitly learning modality-invariant representations through contrastive or metric learning. However, these approaches shared a common limitation: in emphasizing modality-invariant representations, they often excessively suppressed modality-specific characteristics. This led to the loss of discriminative identity cues unique to each modality—such as distinct thermal signatures in infrared images—weakening the overall feature distinctiveness.
Between 2020 and 2022, the field shifted focus toward modality-specific information, employing disentanglement strategies to explicitly separate and utilize both shared and unique features across modalities10,11,12,13,14,15,16. Representative methods such as Hi-CMD10 with hierarchical cross-modality disentanglement and cm-SSFT11 with shared-specific feature transfer achieved competitive performance (\(\sim\)60% Rank-1 on the more challenging SYSU-MM0117). Despite these improvements, a critical limitation persisted: these methods relied on static fusion rules—fixed concatenation or element-wise addition—which lacked the adaptability needed to handle diverse pedestrian appearances, varying illumination, and camera viewpoint changes18,19.
To overcome the rigidity of static fusion rules, recent VI-ReID research has explored dynamic adaptive mechanisms that modulate feature integration according to input content. For instance, MUN19 introduces an auxiliary modality to dynamically unify cross-modal features, while IDKL18 purifies modality-specific representations to extract implicit discriminative knowledge. Subsequent efforts have also integrated large-scale foundation models20,21,22, utilizing visual-language pre-training via prompt learning and textual alignment to achieve strong performance. Despite these advances, fundamental architectural limitations remain inadequately addressed. Specifically, current methods still exhibit: (1) limited fusion granularity, lacking fine-grained channel-wise adaptive switching; (2) a single-scale processing tendency, ignoring complementary cues across different resolutions; and (3) rigid receptive fields in attention modules, restricting dynamic adaptation to cross-modal contexts.
To address these persistent challenges, we introduce a framework centered on three key innovations. First, our Dynamic Adaptive Synergistic Fusion (DASF) module enables fine-grained channel switching through learnable per-layer thresholds (\(\tau _l\)), allowing input-aware cross-modal interaction. Unlike prior dynamic methods18,19 that operate primarily at the feature-map level, DASF integrates channel-level switching with spatial attention to achieve more adaptive feature fusion.
Our second contribution, the Adaptive Kernel Selection Attention (AKSA), tackles the single-scale limitation of existing methods by adaptively blending outputs from parallel 1D convolutional pathways with kernels \(k \in \{3,5,7\}\). This design enables dynamic, input-dependent adjustment of receptive fields along the channel dimension, effectively capturing multi-scale contextual patterns critical for robust feature representation.
The third cornerstone of our framework is a Quadruple Balance-Optimized Loss (QBOL), a multi-objective function grounded in a principled weighting scheme. The weights—\(\alpha =0.5\) for identification, \(\beta =0.3\) for triplet metric learning, \(\gamma =0.05\) for supervised contrastive learning, and \(\delta =0.15\) for modality alignment—were determined through rigorous gradient analysis and sensitivity experiments (see Tables 7 and 8). This balanced configuration promotes stable convergence and ensures that the objectives of identification, metric learning, and modality alignment optimize in a complementary fashion, thereby resolving the persistent issue of ad-hoc loss function combination in VI-ReID.
Our initial experiments revealed a critical insight: architectural improvements alone (DASF+AKSA) achieved only modest gains across all datasets (Table 5). This indicated that while our modules provided more powerful feature representations, the optimization process itself remained a bottleneck due to the conventional practice of combining loss functions through heuristic weighting. The integration of QBOL proved transformative—the complete model demonstrates substantial improvements across all three benchmarks, which confirms that our architectural innovations and optimization strategy work synergistically to unlock the full potential of the proposed framework.
Evaluations across three standard benchmarks—SYSU-MM01, RegDB, and LLCM—attest to the effectiveness of our method, which establishes new state-of-the-art performance. On RegDB, the model achieves 96.20% Rank-1 accuracy and 92.12% mAP; under the challenging all-search mode of SYSU-MM01, it reaches 68.72% Rank-1 and 64.52% mAP; and on the recently introduced low-light LLCM dataset, it attains 58.40% Rank-1 and 62.29% mAP. Ablation studies confirm the contribution of each component, while integration tests demonstrate that our DASF module consistently enhances existing architectures, highlighting its strong generalizability. As illustrated in Fig. 6, the model reliably retrieves correct matches across the visible-infrared modality gap, even under challenging conditions.
The main contributions of this work are summarized as follows:
-
(1)
We propose the Dynamic Adaptive Synergistic Fusion (DASF) module that enables content-aware cross-modal interaction through dual-stage channel switching and spatial attention fusion, effectively addressing the limitation of fixed fusion strategies.
-
(2)
We develop the Adaptive Kernel Selection Attention (AKSA) mechanism that dynamically weights and combines features from parallel 1D convolutional pathways, enhancing representational power by capturing both fine-grained and global channel contexts, improving the model’s adaptability to diverse cross-modal scenarios.
-
(3)
We design the Quadruple Balance-Optimized Loss (QBOL) framework with empirically validated weight allocation, ensuring balanced learning across identification, metric learning, and modality alignment objectives.
-
(4)
We achieve new competitive performance on multiple VI-ReID benchmarks, and provide comprehensive analyses and generalizations to verify the effectiveness and versatility of our approach.
The remainder of this paper is organized as follows: next section reviews related works on visible-infrared person re-identification (VI-ReID), attention mechanisms, multi-scale learning, and loss function design. Methodology section details the proposed methodology. Experiments section presents experimental results and analysis. Finally, our last section concludes the paper and discusses future research directions.
Related works
Visible-infrared person re-identification
Since Wu et al.3 introduced the SYSU-MM01 dataset and created the first benchmark for visible-infrared person re-identification (VI-ReID), this field has gained significant research attention. The fundamental difficulty arises from the inherent heterogeneity between imaging modalities: visible (RGB) images capture reflected light, whereas infrared (IR) sensors detect emitted thermal radiation. This difference produces a substantial modality gap that complicates direct feature matching4,10.
Initial approaches often relied on domain adaptation to learn shared representations. For example, Ye et al.7 designed a dual-path network that processes RGB and infrared inputs separately and integrates features at multiple levels, highlighting the value of preserving modality-specific properties while learning common features. Wang et al.9 proposed an AlignGAN that employs adversarial training to learn modality-invariant representations, effectively narrowing the distribution gap between RGB and infrared modalities. These studies established a foundational framework using separate feature extractors per modality while bringing their embedding spaces closer.
Recent advances in VI-ReID focus on refined cross-modal alignment strategies. Lu et al.11 introduced a cross-modality shared-specific feature transfer method (cm-SSFT) that preserves modality-specific characteristics while enhancing shared representation learning. Zhang et al.23 designed a Relation-Aware Global Attention (RGA) module to explicitly model part-wise relationships and alleviate spatial misalignment. The release of larger datasets such as RegDB5 and LLCM24 has accelerated progress, with current state-of-the-art methods exceeding 95% Rank-1 accuracy on RegDB and 65% on the more challenging SYSU-MM01. Emerging trends include hierarchical cross-modal decoupling10, dual-level discrepancy reduction25, modality restitution and compensation architectures26, and video-based frameworks that incorporate temporal cues27.
Attention mechanisms for cross-modal learning
Attention mechanisms have significantly advanced cross-modal feature learning in VI-ReID through adaptive feature selection. Early spatial attention approaches focused on identifying distinctive body regions while suppressing background interference. For instance, You et al.28 introduced a cross-modal attention network that employs semantic graph embedding to emphasize identity-related regions, improving matching by highlighting semantically important areas. Building on this, Park et al.29 proposed a relational module that models pairwise relationships between cross-modal local features, enabling more detailed correspondence learning.
Channel attention has proven equally valuable in cross-modal scenarios. Both Wang et al.25 and Li et al.30 introduced dual-attention modules integrating spatial and channel mechanisms to improve feature discrimination. Li et al.’s approach, tailored specifically for VI-ReID, uses channel attention to recalibrate feature channels, amplifying informative ones for better cross-modal matching. Further enhancing such systems, methods like asymmetric co-teaching31 and heterogeneous center loss32 have been incorporated as complementary strategies. Despite these advances, a reliance on static attention limits their ability to fully model dynamic cross-modal interactions.
More recently, cross-modal attention mechanisms have evolved to better model dependencies between modalities. For instance, Chen et al.33 employ self- and cross-attention to capture long-range context, while Guo et al.34 dynamically weight pixel pairs across modalities to compensate for modality gaps. Despite this progress, a reliance on fixed operations often limits their adaptability. Newer designs like position-aware cross-modal attention35 are now providing more flexible pathways.
Dynamic fusion and modality fairness
Dynamic fusion has gained traction in VI-ReID for its input-adaptive integration of cross-modal features, moving beyond uniform fusion rules. Ye et al.17 proposed a Dynamic Dual-Attentive Aggregation Graph (DDAG), which employs graph-based aggregation and global statistics to dynamically weight feature maps. In contrast, Channel Augmented Joint learning (CAJ)36 enriches representations via channel-level augmentation but relies on predefined, non-learnable fusion strategies. Yu et al.19 introduced a Modal Unifying Network (MUN) that harmonizes cross-modal representations by generating auxiliary modalities, though it operates mainly at the modality level and lacks finer, channel-wise adaptation. While these methods all emphasize adaptive fusion, they differ markedly in granularity—from feature-map and channel to modality level—and in learning mechanisms, which include graph aggregation, fixed augmentation, and auxiliary modality generation.
To ensure modality fairness in cross-modal fusion—where neither visible nor infrared imaging receives undue preference, several strategies have emerged. Wu et al.37 introduced a modality-balanced triplet loss in MPANet to explicitly balance discriminative power across modalities. Zhang et al.38 proposed the Feature-level Modality Compensation Network (FMCNet), compensating for modality-specific characteristics directly within feature representations. More recently, De et al.39 developed an Efficient Bilateral Cross-Modality Cluster Matching (EBCMCM) method, aligning cluster structures across modalities to promote equitable and accurate matching.
Multi-scale feature learning and hierarchical supervision
Multi-scale feature learning is considered essential for robust VI-ReID, as different network depths capture varying levels of semantic information. Conventional methods often use only the final convolutional layer, risking the loss of fine-grained details available in earlier layers4. To address this, hierarchical supervision strategies that incorporate intermediate features have gained popularity.
Hao et al.40 pioneered multi-scale feature fusion in VI-ReID, demonstrating the value of integrating features from different network depths to capture both low-level textures and high-level semantics. This direction was further advanced by Feng et al.8, who introduced a modality-specific framework that extracts and combines multi-scale features across layers to improve cross-modal matching.
Hierarchical knowledge distillation provides a framework for multi-scale learning in VI-ReID. To enhance cross-modal representations, Wei et al.41 proposed a syncretic modality collaborative learning (SMCL) model, which constructs a synthetic modality to guide modality-invariant feature learning. This idea is extended in mutual distillation methods: Fu et al.42 introduced Mutual Distillation Learning For Person Re-identification (MDPR) to enable mutual guidance between hard local and soft multi-granular branches, while Gao et al.43 along similar lines developed a deep mutual distillation (DMD) for unsupervised ReID using iteratively refined pseudo-labels. Together, these works demonstrate the role of hierarchical supervision in advancing multi-scale feature learning within distillation-based VI-ReID.
Recent work by Zhou et al.44 introduces a full-scale feature learning framework that leverages multi-scale attention to enrich representations. Still, prevailing multi-scale fusion strategies remain largely static and fail to adapt to distinct modality characteristics. Emerging methods, including patch mixing cross-modal learning45 and high-order structure learning46, now offer more refined fusion mechanisms.
Dynamic convolution represents a significant advance beyond static multi-scale architectures, enabling input-adaptive feature extraction through mechanisms such as parameter generation in Dynamic Convolution47 and dynamic routing in ODConv48. These methods have shown compelling results on general vision tasks and are increasingly applied to single-modality person re-identification49, where adaptive kernels help maintain discriminative power across varying viewpoints and lighting conditions. However, their practical adoption is often hindered by substantial computational overhead—typically increasing FLOPs by 2–3\(\times\)—and considerable training complexity. These limitations become especially critical in visible-infrared person re-identification, where modality heterogeneity, smaller dataset sizes, and real-world deployment requirements impose strict efficiency constraints. Consequently, while dynamic convolution maximizes adaptability through intensive parameter generation, the VI-ReID domain calls for a different design philosophy—one that strategically balances representational flexibility with computational practicality and training stability.
Loss function design for VI-ReID
Loss function design is crucial for learning discriminative cross-modal representations in VI-ReID. While cross-entropy-based identity loss forms the foundation in most approaches, it alone cannot overcome the significant modality gap that hinders cross-modal embedding learning50.
Triplet loss and ranking loss are commonly adopted to strengthen metric learning. Zhao et al.51 introduced a hard triplet loss for cross-modal scenarios that selects the most challenging positive and negative samples within a batch to form more effective training pairs. Liu et al.52 proposed a center triplet loss that uses class centers as anchors, reducing outlier effects and improving stability. Zhu et al.53 developed a hetero-center triplet loss that maintains separate modality-specific centers while promoting cross-modal alignment, directly addressing modality gaps. When multiple loss functions are combined, proper balancing becomes critical: recent work on end-to-end person search54 has demonstrated that gradient magnitudes can differ by over 6\(\times\) across different objectives, leading to suboptimal training dynamics if losses are naively weighted.
Contrastive learning proves effective in VI-ReID by learning from positive and negative pairs. Sun et al.55 incorporated textual cues to locate homogeneous features, proposing a Text-Image Alignment (TIA) module and a Local-Global Image Match (LGIM) module, reinforced by a Changeable Cross-modality Alignment Loss (CCAL). Qian et al.56 employed supervised contrastive learning for cross-modal re-identification, surpassing conventional triplet loss. However, most contrastive methods use uniform loss weights across layers, which may cause optimization conflicts. Emerging approaches like dense contrastive learning57 provide more refined similarity metrics.
To narrow the modality gap, domain alignment losses have been developed. Dai et al.6 employed adversarial learning to produce modality-invariant features through min-max optimization, while Zhang et al.58 designed a dual alignment framework using mutual information maximization and pseudo-label alignment between original and style-transferred images to reduce camera and domain shifts. However, these methods typically combine losses without a principled weighting strategy, often resulting in suboptimal training dynamics59. Recent solutions include consistency-driven feature regularization59 and multi-scale semantic correlation mining60. Parallel developments in multi-view learning61,62 further demonstrate the effectiveness of tensor-based frameworks that dynamically balance multi-source inputs according to their quality and informativeness, reinforcing the value of principled loss weighting for multi-objective optimization.
Methodology
Overall framework
Problem formulation
The goal of visible-infrared person re-identification (VI-ReID) is to match pedestrian images across visible and infrared modalities. Specifically, consider a set of visible images \({\mathscr {V}} = \{v_i\}_{i=1}^{N_v}\) and infrared images \({\mathscr {I}} = \{r_j\}_{j=1}^{N_r}\), each associated with identity labels \({\mathscr {Y}}^v = \{y_i^v\}_{i=1}^{N_v}\) and \({\mathscr {Y}}^r = \{y_j^r\}_{j=1}^{N_r}\). The objective is to learn a mapping function \(f: {\mathscr {X}} \rightarrow {\mathscr {F}}\) that embeds images from both modalities into a common feature space \({\mathscr {F}}\), where \({\mathscr {X}} = {\mathscr {V}} \cup {\mathscr {I}}\)3,4. The resulting feature representations should satisfy the following cross-modality matching criterion:
where \(d(\cdot , \cdot )\) represents a distance measure in the feature space (e.g., Euclidean or cosine distance); \(v_l\) denotes a visible image with identity \(y_l^v\); \(r_j\) and \(r_k\) are infrared images labeled \(y_j^r\) and \(y_k^r\), respectively. The inequality requires that for a given visible image \(v_l\), its feature lies closer to the feature of any infrared image \(r_j\) of the same identity than to that of any infrared image \(r_k\) of a different identity.
Overview of network architecture
Our DASF-AKSA framework (Fig. 1) employs a dual-stream architecture with modality-specific and cross-modal pathways. Two ResNet-50 backbones independently extract visible features \(F_v^{(l)}\) and infrared features \(F_r^{(l)}\) at four stages (\(l \in \{1,2,3,4\}\)). These modality-specific branches serve as the primary feature extractors and remain active during both training and inference.
Training Phase. Given paired RGB-IR images, four DASF modules (inserted at each stage) generate fused features \(F_{\text {fused}}^{(l)}\) via: (1) learnable channel switching with per-stage thresholds \(\tau _l\); (2) AKSA-based multi-scale attention; (3) spatial attention. An FPN unifies multi-scale fused features into pyramid levels P0–P4. The network is optimized via QBOL that supervises modality-specific features, while DASF-generated fused features are computed during forward propagation, they serve as auxiliary gradient pathways during backpropagation—guiding both branches toward cross-modal invariance through gradient-based knowledge transfer without requiring explicit supervision.
Inference Phase. Only modality-specific branches are activated. For a query image \(I_q\) (e.g., infrared), features are extracted as \(f_q = \text {Backbone}_{\text {IR}}(I_q)\) without invoking DASF. Cross-modal matching between \(f_q\) and gallery features \(f_g\) is performed via cosine similarity in the learned common embedding space. This design addresses a key challenge: leveraging paired cross-modal data during training while maintaining single-modality input compatibility at test time.
The Quadruple Balance-Optimized Loss (QBOL) balances identification (\(L_{id}\)), center triplet (\(L_{tri}\)), supervised contrastive (\(L_{sup}\)), and margin-based MMD (\(L_{mmd}\)) losses with carefully calibrated weights \(\alpha , \beta , \gamma , \delta\) (where \(\alpha + \beta + \gamma + \delta = 1\)).
Network Architecture Overview. During training, the framework processes paired RGB-IR images through two modality-specific ResNet-50 backbones that extract stage-specific features \(F_v^{(l)}\) and \(F_r^{(l)}\) (\(l \in \{1,2,3,4\}\)). DASF modules (containing AKSA mechanisms) generate fused features at each stage, unified by an FPN. The network is optimized via QBOL (\(\alpha L_{id} + \beta L_{tri} + \gamma L_{sup} + \delta L_{mmd}\)) that supervises modality-specific features, while DASF-generated fused features serve as auxiliary gradient pathways during backpropagation. At inference, only modality-specific branches are activated—DASF and FPN are bypassed for single-modality feature extraction.
Dynamic adaptive synergistic fusion (DASF) module
Architecture overview
The DASF module serves as a training-time auxiliary component that facilitates cross-modal knowledge transfer by generating fused features from paired RGB-IR inputs. While inactive during inference, DASF provides gradient-based regularization during training—its fused features are computed in the forward pass and participate in the computational graph, enabling gradient flow during backpropagation that guides modality-specific branches toward cross-modal invariance. As depicted in Fig. 2, DASF operates in two stages: (1) channel switching with AKSA-based adaptive kernel selection, enabling dynamic cross-modal feature exchange; (2) spatial attention fusion, emphasizing discriminative regions. At layer l, DASF receives \(F_v^{(l)}, F_r^{(l)} \in {\mathbb {R}}^{B \times C \times H \times W}\) from modality-specific branches and outputs fused features \(F_{\text {fused}}^{(l)} \in {\mathbb {R}}^{B \times C \times H \times W}\).
DASF Module Architecture. During training, DASF receives RGB and IR features from modality-specific branches and generates fused features via two stages: (1) Channel Switching with AKSA adaptively exchanges cross-modal information at the channel level, producing modality-switched features \({\bar{F}}_v\) and \({\bar{F}}_r\); (2) Spatial Attention fuses the switched features through channel-wise average/max pooling and element-wise multiplication, producing the final fused output \(F_{\text {fused}}\). This fused representation provides gradient-based regularization, guiding modality-specific branches toward cross-modal invariance through backpropagation. At inference, DASF is inactive.
Channel switching with adaptive Kernel selection
The channel switching mechanism represents a substantial improvement over conventional fusion approaches by incorporating our novel Adaptive Kernel Selection Attention (AKSA) module. As illustrated in Fig. 3, this component processes input features from both modalities through three parallel pathways that collectively determine optimal fusion strategies. The detailed structure of the channel and position enhancement branches is shown in Fig. 4.
AKSA overview. The input feature maps are reshaped into a sequence and processed in parallel by a Channel Enhancement Branch and a Position Enhancement Branch. Their outputs are concatenated and passed through a lightweight gate network that adaptively fuses the two streams into a single feature. A subsequent global aggregation produces the weight tensor used by the following channel switching.
Detailed structure of the Channel Enhancement Branch and the Position Enhancement Branch in the AKSA module. The Channel Enhancement Branch implements a Multi-scale Channel Attention (MCA) pathway that aggregates global channel context and applies multi-scale filters to reweight informative channels. The Position Enhancement Branch implements a Positional Attention (PA) pathway that produces a position-aware map to emphasize discriminative regions. The two branch outputs are then concatenated and fused by a lightweight gating unit to form the final enhanced feature.
As shown in Fig. 3, the initial step converts the input feature maps into a sequential representation. Specifically, given the input features from the visible stream \(F_v \in {\mathbb {R}}^{B \times C \times H \times W}\) and those from the infrared stream \(F_r \in {\mathbb {R}}^{B \times C \times H \times W}\), each is transformed into a sequence representation via a reshape operation:
where \(S = H \times W\) represents the total number of spatial positions, \(B\) is the batch size, \(C\) denotes the number of channels, while \(H\) and \(W\) are the height and width of the feature map. The variable \(X\) refers to a generic feature map from either modality, and \({\tilde{X}}\) denotes the reshaped feature with spatial dimensions flattened into a sequence. Reshaping the input in this way allows efficient handling of spatial information as sequential data, which supports the subsequent use of attention mechanisms.
The adaptive kernel selection module is a key component of the channel enhancement branch and is responsible for generating weights through a refined process. Its detailed structure is shown in Fig. 4. Input sequences first undergo adaptive average pooling to produce a global context vector. This vector is then processed by two successive fully connected layers, with a ReLU activation inserted between them. The first layer reduces channel dimensionality from \(C\) to \(C/r\), where \(r\) is a reduction ratio (e.g., 16), forming a bottleneck, while the second layer projects features into a 3-dimensional space, corresponding to the three kernel choices. A softmax activation is applied to produce normalized weights representing the importance of each kernel:
where \(w_3\), \(w_5\), and \(w_7\) denote the weights assigned to kernel sizes 3, 5, and 7, respectively. These values, normalized via softmax, reflect the optimal blend of receptive fields for the given input.
Fig. 4 also shows the parallel processing pathway of the multi-scale channel attention (MCA) module within the channel enhancement branch, utilizing convolutional branches. Firstly, global average pooling (GAP) computes channel-wise statistics:
where \(G\) contains spatially averaged features. After transposing \(G\), three separate 1D convolutions are applied:
where \(G^T\) is the transposed form of \(G\) to match the convolution dimensions, and \(f_3\), \(f_5\), \(f_7\) are the resulting features for each kernel. The channel attention map is formed by fusing these multi-scale outputs using the adaptive weights:
where \(\sigma\) is the sigmoid function, \(\odot\) denotes element-wise multiplication, and \(A_c\) highlights salient channels.
As detailed in Fig. 4, the Positional Attention (PA) module within the position enhancement branch operates in parallel to the Multi-scale Channel Attention (MCA) module within the channel enhancement branch to compute a spatial/sequence attention map \(P\) that highlights discriminative regions. In our implementation (see Fig. 4), PA consists of two lightweight 1D convolutions with a ReLU in between:
where the first convolution reduces input \({\tilde{X}}\)’s channel dimension from \(C\) to \(C/g\) ( \(g\) is a grouping factor for efficiency), followed by ReLU non-linearity; the second convolution produces a single-channel output, which is normalized via sigmoid activation \(\sigma\) to generate the final spatial attention weights. \(B\) is the batch size and \(S = H \times W\) is the total spatial positions. The output \(P\) assigns each spatial location a value in \([0, 1]\), with higher values indicating more important regions for re-identification.
Our AKSA model uses a dual-branch strategy to refine channel and spatial information separately (as shown in Fig. 3). The channel enhancement branch modulates the input sequence via element-wise multiplication with the channel attention map:
which yields a channel-enhanced feature map \(X_c\) that amplifies semantically meaningful channels. The position enhancement branch applies spatial recalibration:
which produces a position-enhanced feature map \(X_p\) that emphasizes discriminative spatial regions. These two representations complement each other: \(X_c\) enhances channel-wise contrast, while \(X_p\) preserves spatial structure.
The two feature maps are concatenated and fed into a gating network that learns a dynamic fusion ratio:
where \(\alpha\) represents the gating coefficients that balance the two enhanced representations. The final output of AKSA module is a weighted combination:
AKSA achieves adaptivity by learning input-dependent weights that combine outputs from fixed multi-scale kernels, differing from dynamic convolution47 where kernel parameters are generated per instance. This design offers greater training stability on VI-ReID’s limited cross-modal data and maintains computational efficiency suitable for real-time surveillance.
A channel switching mechanism then uses learned thresholding to select features (see Fig 2). Global aggregation is performed by averaging over spatial dimensions:
Channel-wise masks are generated by comparing aggregated channel statistics with learned per-layer thresholds \(\tau _l\). Let \(W_v, W_r \in {\mathbb {R}}^{B \times C \times 1}\) be the channel statistics from visible and infrared streams (e.g., mean over spatial dimensions). These statistics are first normalized per-stage: \({\hat{W}}_v = (W_v - \mu _l)/\sigma _l\), \({\hat{W}}_r = (W_r - \mu _l)/\sigma _l\), where \((\mu _l, \sigma _l)\) are running statistics for stage \(l\). Binary masks are then computed as:
where \({\textbf{1}}[\cdot ]\) is an indicator function returning 1 if the condition holds, 0 otherwise. During training, differentiable soft masks are used instead (see next section). The masks determine cross-modal feature switching, producing fused feature maps \({\bar{F}}_v\) and \({\bar{F}}_r\).
Learnable channel-switch threshold \(\tau _l\).
The channel switching mechanism relies on a learned threshold \(\tau _l\) to generate binary masks for cross-modal feature selection. Let \(W\!\in \!{\mathbb {R}}^{B\times C\times 1}\) be the aggregated per-channel statistics at a fusion stage (e.g., mean over spatial positions). To make the binary masks differentiable during training, we compute a normalized statistic \({\hat{W}}\) and a soft mask S:
where \((\mu _l,\sigma _l)\) are per-stage running statistics computed using LayerNorm-style normalization to stabilize threshold learning across different feature scales, \(\tau _l\) is a learnable scalar shared across channels of stage l, and T is a temperature parameter that controls the softness of the thresholding operation.
The feature switching operation exchanges channels across modalities based on the computed masks. The switching formula is identical for both training and inference, but uses different mask types:
where \(\alpha _v, \alpha _r\) denote the mask values: soft continuous masks \(S_v, S_r \in [0,1]\) during training, or hard binary masks \(M_v, M_r \in \{0,1\}\) at inference. When \(\alpha _v = 1\), the visible stream fully switches to infrared features \(F_r\); when \(\alpha _v = 0\), it preserves its original features \(F_v\). During training, we use soft masks for differentiability:
where \(\sigma (\cdot )\) is the sigmoid function and T is a temperature parameter. At inference, hard binary masks \(M_v = {\textbf{1}}[{\hat{W}}_v < \tau _l]\) and \(M_r = {\textbf{1}}[{\hat{W}}_r < \tau _l]\) replace soft masks for computational efficiency.
The threshold \(\tau _l\) is initialized to 0 (in the normalized space where \({\hat{W}}\) has zero mean) and learned end-to-end during training. We anneal the temperature T linearly from 2.0 to 0.5 over the course of training to progressively sharpen the soft masks toward binary decisions, thereby reducing the train-inference mismatch.
For stability, we apply several regularization techniques: (1) weight decay (\(5\times 10^{-4}\)) on \(\tau _l\) to prevent excessive growth; (2) global gradient clipping with norm 5.0 to stabilize training dynamics; (3) a short warm-up period (10 epochs) to allow the model to adapt to the switching mechanism gradually; and (4) a light sparsity prior \({\mathscr {L}}_{\text {sparsity}}=|\text {mean}(S)-\rho |\) with \(\rho =0.5\) and coefficient 0.01, added to the total loss to prevent degenerate solutions where all channels are switched on or off.
We empirically compared three different granularities for \(\tau\):
Global \(\tau\): A single threshold value shared across all fusion stages and channels, computed as \(\tau = \text {mean}({\hat{W}})\) using the normalized channel statistics. This approach is simple but suffers from scale mismatch across different feature resolutions, leading to suboptimal performance in early stages.
Per-channel \(\tau _c\): Individual thresholds for each channel within each fusion stage, offering maximum flexibility. However, this introduces 8192 additional parameters in total (4 stages \(\times\) 2048 channels each), leading to severe overfitting and training instability in small-batch scenarios.
Per-layer \(\tau _l\): Our chosen approach uses a single learnable threshold per fusion stage, shared across all channels within that stage. This design balances adaptability with stability, enabling stage-specific switching behavior while preserving parameter efficiency. As shown in Table 6, this design achieves the best performance across all three datasets.
The stability advantages of per-layer \(\tau _l\) stem from several factors: (1) parameter efficiency (only 4 learnable scalars) reduces overfitting risk; (2) stage-specific adaptation allows the model to learn different switching behaviors for different feature resolutions; (3) gradient flow remains stable across training epochs.
The learned thresholds reveal a clear hierarchical progression across network depths, aligning with the intrinsic hierarchy of cross-modal features. Within the normalized space where \({\hat{W}}\) centers near zero, these layer-specific thresholds regulate cross-modal channel switching. This pattern resonates with established cross-modal learning principles4: shallow layers capture modality-specific patterns (e.g., color versus thermal texture), where selective preservation helps retain discriminative local cues; deeper layers encode more modality-invariant semantics (e.g., body structure), making cross-modal fusion—such as combining spatial details from RGB with illumination-robust thermal features—increasingly beneficial. As shown in our ablation studies, such data-driven, layer-wise adaptation is critical to achieving optimal performance.
Spatial attention and feature fusion
As outlined in Fig. 2, after channel switching, the spatial attention module guides the final integration of modality-specific features. First, the switched features from both modalities are concatenated along the channel dimension:
where \([\cdot ; \cdot ]\) represents concatenation across channels.
Design rationale: MCA vs PA.
The dual-branch design targets two orthogonal axes. MCA applies multi-scale channel-wise filtering on global channel statistics to enhance channel selectivity. In contrast, PA operates on the spatial/sequence axis with lightweight convolutions/projections to produce positional weights that emphasize discriminative regions. A gating unit fuses the two enhanced paths. This separation of concerns (channel vs position) makes the branches complementary rather than redundant.
The spatial attention mechanism uses two pooling operations to gather complementary spatial information. Average pooling across channels captures mean activation values:
where \(\text {CAP}(\cdot )\) refers to channel-wise average pooling, computing the mean per spatial position across all channels, and \(\sigma\) is the sigmoid activation. Similarly, max pooling highlights the most prominent features:
where \(\text {CMP}(\cdot )\) denotes channel-wise max pooling, which takes the maximum value per spatial location across channels. Both outputs are passed through a sigmoid to generate attention weights between 0 and 1, aiding gradient stability during training.
The attention weights are applied via element-wise multiplication. The concatenated features are modulated using both weight maps:
where \(\odot\) denotes element-wise multiplication. This two-part attention design helps the network focus on relevant spatial regions and ignore less useful areas, improving the discriminative quality of the features.
The final fusion combines the two modalities by averaging. The attended feature tensor is split back into visible and infrared components: \(F_{\text {attended}}[:,:C,:,:]\) for visible and \(F_{\text {attended}}[:, C:,:,:]\) for infrared. The unified representation is then formed by:
where averaging ensures two modalities contribute equally while keeping the output dimension consistent for later stages.
This fusion method supports effective information sharing: each modality compensates for the other’s limitations while retaining its unique advantages. The spatial attention mechanism makes the fusion process adaptive to input content, narrowing the modality gap through refined feature integration.
Quadruple balance-optimized loss (QBOL) framework
Component loss functions
Our proposed Quadruple Balance-Optimized Loss (QBOL) framework integrates four complementary loss functions with systematically calibrated weights to ensure balanced optimization across different objectives.
Identification Loss \({\mathscr {L}}_{id}\) ensures that the model learns discriminative features for identity recognition. It is formulated as a classification loss over the defined identity classes:
which can be explicitly written using the softmax function as:
where \(N_{id}\) is the number of training samples, c denotes the ground-truth identity of sample i, \({\mathscr {S}}\) represents the softmax function, and \(W^{kl}\) is the weight matrix of the fully connected layer for the (l, k)-th branch, with \(W_c^{kl}\) specifically denoting the weight vector corresponding to identity c.
Center Triplet Loss \({\mathscr {L}}_{tri}\) is employed to enhance feature discriminability by pulling features closer to their class centers while pushing them away from other class centers. It is defined as:
where \(d(\cdot ,\cdot )\) is the Euclidean distance metric, \(f_i\) represents the feature embedding of the i-th sample, \(c_{y_i}\) is the cluster center of the ground-truth class of sample i, \(c_j\) denotes the cluster center of the j-th class, and m is the margin hyperparameter.
Supervised Contrastive Loss \({\mathscr {L}}_{sup}\) is utilized to improve the compactness of intra-class features and the separation of inter-class features in the embedding space. It is formulated as:
where \(P(i) = \{j \mid y_j = y_i, j \ne i\}\) is the set of samples sharing the same identity with sample i (excluding i itself), A(i) includes all samples in the batch except i, \(\cdot\) denotes the dot product, and \(T_{sup}\) is a temperature parameter for supervised contrastive learning.
Margin-Based Maximum Mean Discrepancy Loss \({\mathscr {L}}_{mmd}\) is applied to reduce the distribution discrepancy between features from visible and infrared modalities. It is defined as:
where \(\text {MMD}^2(\cdot ,\cdot )\) computes the squared maximum mean discrepancy with an RBF kernel between visible and infrared feature distributions, \(F_v\) and \(F_r\) represent the feature sets from visible and infrared modalities in the current batch, respectively, and \(\epsilon\) is a margin threshold.
QBOL optimization objective
Our training objective supervises modality-specific features extracted from the dual-stream backbones via the Quadruple Balance-Optimized Loss (QBOL) framework:
where \(\alpha = 0.5\), \(\beta = 0.3\), \(\gamma = 0.05\), \(\delta = 0.15\) (empirically validated via ablation studies). All four loss components operate on the pooled features from modality-specific branches, which are used for inference. A light sparsity regularization \({\mathscr {L}}_{\text {sparsity}}=|\text {mean}(S)-\rho |\) (coefficient 0.01, \(\rho =0.5\)) stabilizes DASF channel switching, yielding the complete training objective: \({\mathscr {L}}_{\text {train}} = {\mathscr {L}}_{\text {QBOL}} + 0.01 \cdot {\mathscr {L}}_{\text {sparsity}}\).
Theoretical basis for weight allocation
The principled weighting scheme represents QBOL’s core contribution, distinct from the heuristic weight combinations commonly employed in VI-ReID4. Such heuristic combinations often result in imbalanced training dynamics, where a single loss can dominate the gradient updates due to intrinsic scale disparities. QBOL mitigates this by allocating weights through systematic analysis of gradient characteristics and training behavior.
Gradient magnitude analysis. Different loss components naturally produce gradients at vastly different scales due to their inherent mathematical properties. The identification loss \({\mathscr {L}}_{id}\) typically exhibits relatively small gradients due to softmax saturation when features are not yet well-separated, while supervised contrastive loss \({\mathscr {L}}_{sup}\) generates substantially larger gradients as it operates on dense pairwise similarities across all batch samples. Without proper weighting, scale disparities cause certain objectives to dominate parameter updates, marginalizing the contribution of others that may be crucial for overall performance. Our weight allocation is designed to balance effective gradients, ensuring that \(\alpha \Vert \nabla {\mathscr {L}}_{id}\Vert\) and \(\gamma \Vert \nabla {\mathscr {L}}_{sup}\Vert\) achieve comparable magnitudes despite their different raw scales.
Training dynamics consideration. The center triplet loss \({\mathscr {L}}_{tri}\) exhibits temporal instability in early epochs due to dynamic center updates—class centers shift rapidly as feature representations evolve, causing high gradient variance. Assigning moderate weight (\(\beta =0.3\)) provides a consistent metric learning signal without introducing excessive oscillation. The margin-based MMD loss \({\mathscr {L}}_{mmd}\) operates on distribution-level alignment, which inherently converges slower than instance-level objectives. The weight \(\delta =0.15\) ensures continuous modality gap reduction throughout training without interfering with identity-specific discrimination learning in later stages.
Complementarity and synergy. The four objectives target orthogonal aspects: \({\mathscr {L}}_{id}\) (classification), \({\mathscr {L}}_{tri}\) (metric structure), \({\mathscr {L}}_{sup}\) (embedding compactness), and \({\mathscr {L}}_{mmd}\) (modality alignment). Our empirically validated weights (\(\alpha + \beta = 0.8\) for identity-related objectives, \(\gamma + \delta = 0.2\) for embedding refinement) reflect the finding that classification and metric learning should dominate early training to establish discriminative representations, while contrastive and alignment losses provide continuous refinement. This structured allocation prevents the common failure mode where uniform weights (\(\alpha =\beta =\gamma =\delta =0.25\)) lead to unstable convergence due to conflicting gradient directions in early epochs (see ablation in Table 7).
In summary, QBOL’s novelty lies not in the individual loss functions, but in the principled weighting scheme derived from gradient characteristics analysis and training dynamics understanding. This ensures stable, balanced optimization that leverages the complementary strengths of each objective—a contribution validated by consistent improvements over heuristic combinations across all three benchmarks (Table 7).
Implementation details
Our model is trained for 120 epochs using SGD optimizer with standard VI-ReID settings (4 identities \(\times\) 4 images \(\times\) 2 modalities per batch). We employ multi-step learning rate decay with linear warm-up, standard data augmentation (flip, crop, erasing), and stability measures including gradient clipping and per-loss monitoring. The AKSA module uses reduction ratio \(r=16\) and group factor \(g=4\), with DASF modules inserted at four ResNet stages. Detailed training hyperparameters, augmentation strategies, and architectural configurations are provided in the Experiments section.
Experiments
Computational complexity and efficiency
To address concerns about the computational overhead introduced by the parallel convolutional pathways in AKSA and the four-stage DASF deployment, we provide a detailed complexity analysis. Table 1 compares our method with representative VI-ReID approaches using input size \(1 \times 3 \times 288 \times 144\) (single forward pass, FLOPs computed as \(2\times\)MACs). All parameter counts exclude task-specific classifier heads for fair comparison.
Efficiency-Accuracy Trade-off Analysis. Table 1 compares the computational complexity and performance of recent VI-ReID methods. Lightweight approaches such as AGW (70.56M params, 10.36G FLOPs) and LbA (70.53M params, 10.35G FLOPs) achieve 70.05% and 74.17% accuracy with minimal computational overhead. In contrast, methods like SSRL (84.33M params, 20.72G FLOPs) and DEEN (74.85M params, 16.72G FLOPs) attain higher accuracy (93.64% and 91.10%) at the cost of substantially increased computation—SSRL requires \(2\times\) the FLOPs of AGW, while DEEN requires 1.6\(\times\).
Our approach achieves 96.20% accuracy with 11.99G FLOPs, requiring 15.7% more computation than AGW while delivering a +26.15% accuracy gain. Compared to SSRL, our method achieves +2.56% higher accuracy while using 42.1% fewer FLOPs (11.99G vs 20.72G), and outperforms DEEN by +5.10% with 28.3% lower computational cost (11.99G vs 16.72G). This efficiency stems from our design: rather than adding computationally expensive feature enhancement modules, DASF and AKSA introduce relatively lightweight adaptive mechanisms that modulate feature representations at multiple scales.
Practical Implications. The parameter count (97.14M) reflects our feature-rich architecture with FPN-based multi-scale fusion, which is essential for capturing discriminative cross-modal representations but primarily affects memory footprint rather than inference speed. For practical deployment, our method offers an attractive balance: compared to DEEN and SSRL which achieve similar accuracy ranges (91-94%) at 16-21G FLOPs, our approach delivers superior performance (96.20%) at significantly lower computational cost (11.99G FLOPs), enabling real-time inference on standard GPU hardware. This positions our framework as a practical solution for accuracy-critical VI-ReID applications where computational efficiency matters.
Experimental settings
Datasets
We evaluate our method on three standard benchmarks for cross-modal person re-identification.
SYSU-MM013 is a large-scale RGB-IR dataset containing 491 identities captured by 6 cameras (4 visible and 2 infrared), with 287,628 visible images and 15,792 infrared images. The training set includes 395 identities, with 22,258 visible and 11,909 infrared images. Evaluation is performed under two protocols: (1) All-Search, which matches infrared queries against all visible gallery images, and (2) Indoor-Search, where both queries and gallery are restricted to indoor scenes. Performance is reported under both single-shot and multi-shot settings.
RegDB5 contains 412 identities, each with 10 visible and 10 thermal images, yielding 4,120 images per modality. We follow the standard random split of 206 identities for training. Results are reported for both visible-to-infrared (V2I) and infrared-to-visible (I2V) retrieval directions.
LLCM24 focuses on low-light re-identification, with 713 identities captured under varied lighting conditions. It consists of 46,767 visible and 21,157 near-infrared (NIR) images. The dataset introduces challenges such as significant illumination changes and cluttered backgrounds, offering a rigorous test for cross-modality matching.
Evaluation metrics
Following standard VI-ReID evaluation protocols3,4, we report Rank-1 accuracy (%) and mean Average Precision (mAP, %) as evaluation metrics. All results represent the best performance from 10 independent runs, with improvements measured against top-performing baseline methods in each category.
Implementation details
Training schedule. We train for 120 epochs using SGD with initial learning rate 0.1, momentum 0.9, and weight decay \(5\times 10^{-4}\). We apply a linear warmup for 10 epochs, then a multi-step decay at epochs 40, 80, and 100, each decaying the learning rate by \(\times 0.1\) (i.e., \(0.1\!\rightarrow \!0.01\!\rightarrow \!0.001\)). All experiments are run with a fixed random seed for reproducibility, with cuDNN benchmark enabled for efficiency.
Batch construction. Each mini-batch contains 4 identities \(\times\) 4 images \(\times\) 2 modalities \(=\) 32 images, following the standard VI-ReID protocol4. Crucially, for each identity, we sample 4 visible and 4 infrared images simultaneously, ensuring paired cross-modal inputs that enable DASF to generate fused features for auxiliary supervision. This pairing is only required during training—at test time, features are extracted independently per modality without pairing constraints.
Data augmentation. We resize all images to \(288\!\times \!144\) and apply the following augmentations: (i) random horizontal flip with \(p=0.5\); (ii) random padding-crop with 10-pixel padding; (iii) random erasing with \(p=0.5\), erasing area ratio in [0.02, 0.4] and aspect ratio in [0.3, 3.3]. We do not apply color jitter to infrared images to avoid distribution distortion. Unless otherwise noted, we do not employ MixUp or CutMix.
Architecture specifics. Within each DASF module, the AKSA mechanism uses reduction ratio \(r=16\) and group factor \(g=4\) for efficient channel and positional attention computation. Four DASF modules (DASF-0 to DASF-3, each containing an AKSA mechanism) are inserted after the four ResNet-50 backbone stages that produce feature maps at resolutions \(72\!\times \!36\), \(36\!\times \!18\), \(18\!\times \!9\), and \(9\!\times \!5\) with 256, 512, 1024, and 2048 channels respectively. The Feature Pyramid Network (FPN) unifies multi-scale features into five pyramid levels (P0–P4) at 256 channels each via \(1\!\times \!1\) lateral projections, top-down upsampling, and element-wise addition, enriching each level with both high-level semantics and fine-grained spatial details.
Stability measures. To ensure stable training, losses are computed in the sequential order: classification \(\rightarrow\) metric learning \(\rightarrow\) modality alignment. We adopt global gradient clipping (norm 5.0) and per-loss monitoring for balanced optimization. For the learnable channel-switch threshold \(\tau _l\), we use soft masks \(S = \sigma \left( \frac{\tau _l - {\hat{W}}}{T}\right)\) during training (with temperature T annealed from 2.0\(\rightarrow\)0.5) and hard binary masks \(M = {\textbf{1}}[{\hat{W}} < \tau _l]\) at inference. Additional regularization techniques include: (1) weight decay (\(5\times 10^{-4}\)) on \(\tau _l\); (2) global gradient clipping (norm 5.0); (3) 10-epoch warm-up; and (4) sparsity prior \({\mathscr {L}}_{\text {sparsity}}\) .
Inference and matching
Single-Modality Feature Extraction. At test time, we extract features using only modality-specific branches, completely bypassing DASF. For a query image \(I_q\) from modality \(m_q\) (e.g., infrared) and gallery image \(I_g\) from modality \(m_g\) (e.g., visible), we independently compute \(f_q = \text {Backbone}_{m_q}(I_q) \in {\mathbb {R}}^d\) and \(f_g = \text {Backbone}_{m_g}(I_g) \in {\mathbb {R}}^d\) where \(d{=}2048\). No cross-modal pairing or fusion is performed during test time.
Normalization and Distance. We apply L2 normalization to project features onto the unit hypersphere:
and compute cross-modal distance via negative cosine similarity:
Despite being extracted from separate branches, \(f_q\) and \(f_g\) share a common embedding space learned via: (i) shared classification heads enforcing semantic alignment; (ii) cross-modal triplet loss pulling same-identity pairs closer; (iii) gradient-based regularization from DASF-generated fused features during training, which implicitly guides both branches toward cross-modal invariance through backpropagation. For retrieval, we rank gallery images by ascending d values. For batched queries \(\{{{\tilde{f}}}_{q_i}\}_{i=1}^{N_q}\) and gallery \(\{\tilde{f}_{g_j}\}_{j=1}^{N_g}\), the distance matrix is computed as:
Guided by this matrix, the retrieval process is conducted consistently across all three datasets (SYSU-MM01, RegDB, and LLCM) and for both visible\(\rightarrow\)infrared and infrared\(\rightarrow\)visible matching directions.
Experimental results and analysis
Performance on SYSU-MM01
As shown in Table 2, our method achieves competitive performance on the challenging SYSU-MM01 dataset. Under the All-Search mode, our model attains 68.72% Rank-1 accuracy and 64.52% mAP, outperforming recent methods including MMM (+2.82% Rank-1, +2.72% mAP) and CIA (+0.53% Rank-1). Notably, our CNN-based approach surpasses transformer-based models CMTR (+3.27% Rank-1) and PMT (+1.19% Rank-1), demonstrating that well-designed convolutional architectures with effective attention mechanisms remain competitive against computationally heavier transformer frameworks.
The improvement is even clearer in the Indoor-Search mode, where we report 74.06% Rank-1 and 78.66% mAP. This corresponds to a Rank-1 increase of +0.78% over CIA (73.28%), +2.40% over PMT (71.66%), and +3.76% over MMM (70.30%). The stable performance across both outdoor and indoor settings underscores the capability of our DASF module in adapting to diverse environmental conditions.
Performance on RegDB
On the RegDB dataset (Table 3), our method achieves strong cross-modality matching performance, reaching 96.20% Rank-1 / 92.12% mAP in visible-to-infrared (V2I) mode. These results exceed the previous best CNN-based method CSVI by +4.79% Rank-1 / +6.98% mAP and the leading transformer-based method MIP by +4.22% Rank-1 / +4.01% mAP. In the infrared-to-visible (I2V) direction, we attain 95.28% Rank-1 / 90.83% mAP, again substantially outperforming CSVI (+5.22% / +6.97%) and MIP (+4.96% / +3.24%).
This strong performance on RegDB stems from several key elements: the DASF module successfully aligns cross-modal features even under significant domain shift through learnable per-layer thresholds \(\tau _l\); the AKSA mechanism integrates both local details and broader context through multi-scale processing with dynamic kernel selection; and the QBOL framework’s principled gradient-balanced optimization helps avoid overfitting to a single modality. The consistent gains across both V2I and I2V tasks further verify the symmetric design of our framework.
Performance on LLCM
On the challenging LLCM dataset featuring low-light conditions, our method achieves 57.46% Rank-1 accuracy and 61.05% mAP in V2I mode, and 49.98% Rank-1 with 57.16% mAP in I2V mode (Table 4). These results represent improvements of +0.96% Rank-1 and +1.25% mAP over CAJ in V2I mode, and +1.18% Rank-1 and +0.56% mAP over CAJ in I2V mode. The relatively larger performance gap in the more challenging I2V direction (+3.58% Rank-1 over AGW) demonstrates our method’s robustness in handling asymmetric cross-modal matching under adverse lighting conditions.
LLCM low-light scenario visualization analysis
To provide intuitive evidence for our model’s robustness under harsh low-light conditions, we conduct comprehensive visualization analysis on the LLCM dataset using Gradient-weighted Class Activation Mapping (Grad-CAM)77. The visualization reveals which regions the model focuses on when extracting discriminative features from low-quality images, and demonstrates how different components contribute to the model’s attention mechanism.
Feature Attention Visualization. Figure 5 presents Grad-CAM visualizations for several randomly selected challenging LLCM samples under varying low-light conditions. Each sample shows the original image alongside its Grad-CAM heatmap overlay, where red/warm regions indicate high activation on discriminative body parts (head, torso, legs) and blue/cool regions represent low activation on background or irrelevant areas.
The visualizations reveal two key insights: (1) Robust region localization—despite severe underexposure and low contrast in some difficult samples, the model consistently focuses on identity-relevant body regions rather than being distracted by lighting artifacts or background clutter; (2) Adaptive discrimination—the model successfully identifies discriminative features even when fine-grained details are obscured (e.g., the first sample in the top row (leftmost pair), is heavily occluded and motion-blurred). This robustness validates the effectiveness of DASF’s channel-level adaptive fusion and AKSA’s multi-scale receptive field selection in harsh low-light scenarios. (3) Background suppression. Across all samples, the model’s activation appears limited on surrounding environmental elements (e.g., lane lines, ground textures, vehicles), with background regions tending to be colored in cool tones. This selective attention suggests that QBOL’s balanced optimization helps mitigate overfitting to spurious background correlations, as the model appears to focus more consistently on discriminative body regions.
Notably, even in extreme low-light conditions challenging for human observers, our model maintains robust matching capability. This resilience originates from our framework’s synergistic design: DASF preserves discriminative modality-specific features through channel switching, while AKSA captures both local textures and global structures via multi-scale attention.
LLCM Feature Attention Visualization via Grad-CAM. Grad-CAM heatmap visualizations on the LLCM dataset demonstrating our model’s attention patterns under low-light conditions. The figure displays some challenging query samples (2 rows \(\times\) 4 columns). For each sample: (left) original RGB image, (right) Grad-CAM heatmap overlay with color-coded attention weights. Red/warm regions indicate high activation on discriminative body parts (head, torso, legs); blue/cool regions represent low activation on background or irrelevant areas.
Overall, our approach demonstrates consistent improvements across all three benchmark datasets, with particularly remarkable gains on RegDB where we achieve approximately +5% Rank-1 and +7% mAP improvements over the listed previous best results. The hybrid attention mechanism in DASF effectively addresses cross-modal alignment challenges while maintaining computational efficiency, making it suitable for practical applications.
Visualization of top retrieved examples
Figure 6 presents retrieval results for eight randomly selected queries from the SYSU-MM01 and RegDB datasets. In some cases, even human observers would find it difficult to verify matches based solely on color and body shape information, particularly in examples like ranks 5 and 6 in the sixth row of SYSU-MM01 all-search results. This demonstrates both the difficulty and practical importance of VI-ReID for nighttime surveillance applications. Our method achieves accurate identification for most query images, successfully retrieving correct matches while filtering out incorrect identities from the gallery. These visualizations provide qualitative evidence for the effectiveness of our proposed network architecture.
Top-10 retrieval examples from multiple datasets and modalities. For each query (left), the ten most similar gallery images are ranked left to right. Green borders indicate correct matches; red borders indicate incorrect matches. These results illustrate the model’s ability to match identities across modalities, even under challenging conditions like low illumination in the LLCM dataset.
Ablation study
Component contribution analysis
Table 5 presents a comprehensive ablation study analyzing the individual contributions of each component in our framework. The baseline configuration, which employs a standard dual-stream ResNet-50 architecture without our proposed modules, achieves 68.65% Rank-1 accuracy and 63.68% mAP on SYSU-MM01, 94.12% Rank-1 and 90.11% mAP on RegDB, and 58.28% Rank-1 with 61.27% mAP on LLCM.
Replacing the standard ECA attention78 with our AKSA module brings consistent improvements across all datasets. On RegDB V2I, AKSA provides +1.61% Rank-1 and +2.17% mAP gains over the ECA variant (95.28%/90.73% vs 93.67%/88.56%), validating the effectiveness of dynamic kernel selection and multi-scale processing. The QBOL framework contributes significantly to performance, particularly on the challenging LLCM dataset where it provides +0.95% Rank-1 and +1.54% mAP improvements over the baseline with loss improvement.
The complete DASF-AKSA-QBOL framework achieves optimal performance through synergistic interactions between components. The improvements are most pronounced on RegDB, where all components work together to address the significant modality gap, resulting in 96.20% Rank-1 accuracy and 92.12% mAP.
Component-wise Ablation Visualization on LLCM. Grad-CAM heatmap comparison across five model configurations for six representative LLCM query images under low-light conditions, where row shows attention maps from a different model. Red/warm colors indicate high activation on discriminative regions; blue/cool colors represent low activation.
Visualization of Component Contributions. To provide intuitive evidence for these quantitative improvements, Fig. 7 presents comparative Grad-CAM visualizations across the five configurations in Table 5 on several randomly selected LLCM low-light samples. The progression reveals distinct contributions of each component through attention pattern analysis:
(1) Baseline exhibits scattered focus on background clutter (e.g., road markings, ground textures), indicating poor discrimination between foreground and background.
(2) Baseline+DASF(ECA) provides marginal improvement, with attention still dispersed across irrelevant regions. While the basic channel attention mechanism in ECA offers some modulation capability, it lacks the adaptivity to dynamically adjust receptive fields for multi-scale feature capture, resulting in persistent background interference and misidentification on some overly prominent redundant features.
(3) Baseline+DASF(AKSA) enhances feature activation levels on body parts through adaptive multi-scale receptive fields. However, background suppression remains inadequate—redundant features in surrounding areas still show strong activations that may cause matching errors. This limitation reveals that architectural innovations alone, without proper optimization constraints, cannot fully eliminate false-positive activations.
(4) Baseline+QBOL similarly achieves high activation on discriminative regions through balanced loss weighting. The carefully calibrated loss combination encourages the model to focus on identity-relevant features while maintaining metric structure. Yet, this configuration fails to fully suppress background interference, as the loss balancing alone cannot address the architectural limitation of fixed receptive fields and static fusion strategies.
(5) Full Model (DASF+AKSA+QBOL) achieves optimal balance by not only focusing sharply on key body parts (head, torso, legs) with stable activation levels, but also effectively suppressing background features. The synergistic effect of combining adaptive fusion (DASF), multi-scale attention (AKSA), and balanced optimization (QBOL) enables the model to prevent misidentification caused by overly prominent redundant features. Specifically, AKSA provides adaptive receptive fields to capture discriminative regions at appropriate scales, while QBOL ensures that these regions receive consistent emphasis throughout training without gradient conflicts. This dual mechanism—architectural adaptivity plus optimization stability—produces the most balanced attention patterns with minimal background interference.
This visualization provides intuitive evidence for the quantitative ablation results in Table 5, demonstrating that architectural innovations (DASF+AKSA) alone are insufficient without principled loss balancing (QBOL). The full model’s ability to maintain stable activation levels on body parts while suppressing background clutter explains the substantial performance gains (+2.12% Rank-1, +1.02% mAP over Baseline+DASF(AKSA)) and confirms that the three components work synergistically rather than additively.
Threshold \(\tau\) Ablation
To thoroughly investigate the channel switching threshold design, we conducted comprehensive ablation studies examining different granularities and adaptivity mechanisms. Table 6 compares four configurations: two fixed (non-learnable) baselines and two adaptive (learnable) approaches with different granularities.
The global \(\tau\) baseline computes a single threshold across all stages and channels using \(\tau = \text {mean}({\hat{W}})\). While simple, this non-normalized approach suffers from scale mismatch across different fusion stages. The uniform per-layer baseline uses a manually chosen fixed threshold (\(\tau _l=0.4\)) for all four stages in the normalized space, which improves upon the global baseline (+0.38% Rank-1) but lacks layer-specific adaptation. The per-channel \(\tau _c\) method assigns individual learnable thresholds to each channel (8192 parameters total), but introduces severe overfitting in small-batch training scenarios.
Our chosen per-layer \(\tau _l\) strategy uses a single learnable threshold per fusion stage (4 parameters total), achieving the best performance (96.20% Rank-1, 92.12% mAP) by balancing adaptability and stability. The learned thresholds from our approach exhibit a clear hierarchical pattern: \(\tau _1=0.287\), \(\tau _2=0.399\), \(\tau _3=0.466\), and \(\tau _4=0.575\), showing progressively higher values at deeper stages. In the normalized space where \({\hat{W}}\) centers around 0, these higher thresholds at deeper layers (e.g., \(\tau _4=0.575\) vs. \(\tau _1=0.287\)) increase the proportion of channels satisfying the switching condition \({\hat{W}} < \tau _l\), thereby promoting more aggressive cross-modal integration where modality-invariant semantic information is most beneficial.
Fixed vs. Adaptive Thresholds. Fixed uniform thresholds (\(\tau _l=0.4\) across all layers, 95.23% Rank-1) underperform our adaptive approach by −0.97% Rank-1 and −1.67% mAP, confirming that layer-specific learning is essential. The learned hierarchy (\(\tau _1=0.287 \rightarrow \tau _4=0.575\)) implements a depth-aware strategy—conservative shallow fusion transitioning to aggressive deep exchange—that fixed thresholds cannot replicate, aligning with CNN feature hierarchies for optimal cross-modal fusion.
QBOL weight analysis
The QBOL weight allocation (\(\alpha =0.5\) for ID loss, \(\beta +\gamma +\delta =0.5\) for auxiliary losses) balances classification dominance with metric learning refinement. This 50-50 split preserves the ID loss’s core supervisory role4 while preventing metric objectives from overwhelming discrimination7,36. The remaining weight is distributed among center triplet (\(\beta\)), supervised contrastive (\(\gamma\)), and margin-based MMD (\(\delta\)) losses, ensuring \(\beta + \gamma + \delta = 0.5\) to maintain balanced optimization.
We performed a sensitivity analysis over the QBOL auxiliary loss terms under this fixed budget, testing multiple weight combinations as presented in Table 7.
As shown in Table 7, the uniform weighting baseline (\(\beta =\gamma =\delta \approx 0.167\), given constraint \(\alpha =0.5\)) achieves only 67.23% Rank-1 and 62.78% mAP, thereby confirming that naive equal allocation among auxiliary losses leads to suboptimal performance. The optimal QBOL configuration \((\beta =0.3, \gamma =0.05, \delta =0.15)\) achieves the best performance with 68.72% Rank-1 and 64.52% mAP, yielding improvements of +1.49% Rank-1 and +1.74% mAP over uniform weighting. This setup prioritizes metric learning via center triplet loss, while using margin-based MMD loss to moderately regularize modality alignment. The supervised contrastive loss is assigned a smaller weight in the QBOL framework to minimize interference during optimization. Performance drops occur with other weight combinations, especially when the margin-based MMD weight is too low (\(\delta =0.1\)), underscoring the role of explicit modality alignment in QBOL.
To further justify the QBOL framework’s \(\alpha = 0.5\) setting, we adjusted the ID loss weight while keeping the ratios between the auxiliary losses unchanged (\(\beta :\gamma :\delta = 6:1:3\)). This ensured that the contribution balance among auxiliary objectives within QBOL remained consistent, allowing us to cleanly isolate the impact of ID loss weighting.
The QBOL weight ablation results, summarized in Table 8, confirm the effectiveness of our selected weights.
Experimental results demonstrate that the QBOL configuration with \(\alpha = 0.5\) delivers the best performance, while setting the weight higher (0.6) or lower (0.4) leads to a decline in both Rank-1 accuracy and mAP. This outcome matches the expected trade-off: the identification loss must be prominent enough to ensure discriminative power, yet leave adequate room for auxiliary losses to refine the feature space without undermining core recognition.
From a gradient perspective within the QBOL framework, the ID loss offers strong per-sample signals that enforce immediate discrimination. In contrast, metric losses such as center triplet and supervised contrastive losses contribute batch-level gradients that enhance the overall structure of the feature space. Meanwhile, the margin-based MMD loss acts at the distribution level, aligning modalities without distorting instance-level distinctions.
Tests show that lowering the ID weight in QBOL to 0.4 reduces discrimination capability, especially for challenging samples requiring well-defined decision boundaries. On the other hand, increasing it to 0.6 skews the emphasis toward classification, weakening the robustness and generalizability of the learned features.
Our QBOL framework demonstrates a key insight: balanced loss weighting is essential for VI-ReID’s multi-task nature. The empirically validated 50%-50% split between identification loss (\(\alpha\)) and auxiliary objectives (\(\beta +\gamma +\delta\)) provides a robust configuration that transfers effectively across different architectures and datasets.
Gradient dynamics validation. To provide empirical evidence for the gradient-based rationale outlined in the Methodology section, we measured gradient norms of each loss component during training on SYSU-MM01 (All-Search mode, full model configuration) and visualized the results in Fig. 8. Specifically, we computed the L2 norm of gradients with respect to the final embedding layer parameters at the end of each training epoch, averaged over all batches within that epoch. Our measurements during early training (epochs 1–20) reveal stark scale differences: the identification loss \({\mathscr {L}}_{id}\) exhibits relatively small gradients with mean \(\Vert \nabla {\mathscr {L}}_{id}\Vert \approx 0.32\), attributable to softmax saturation when features are not yet well-separated. In contrast, supervised contrastive loss \({\mathscr {L}}_{sup}\) generates substantially larger gradients with mean \(\Vert \nabla {\mathscr {L}}_{sup}\Vert \approx 2.18\), as it operates on dense pairwise similarities across all batch samples—resulting in an approximately 6.8\(\times\) scale disparity (i.e., \(2.18/0.32 \approx 6.8\)). Panel (a) of Fig. 8 confirms this inherent imbalance: without proper weighting, \({\mathscr {L}}_{sup}\) would dominate parameter updates, marginalizing the crucial role of \({\mathscr {L}}_{id}\) in classification performance.
Our QBOL weighting scheme effectively balances gradient contributions across objectives. As shown in Panel (b), the weighted gradients converge to a narrow range (0.05–0.15), with \(\alpha \Vert \nabla {\mathscr {L}}_{id}\Vert \approx 0.5 \times 0.28 = 0.14\) and \(\gamma \Vert \nabla {\mathscr {L}}_{sup}\Vert \approx 0.05 \times 1.59 = 0.08\), ensuring proportional influence from each loss. Panel (c) highlights the early-stage instability of \({\mathscr {L}}_{tri}\), where gradient variance reaches std \(\approx 0.77\) in epochs 1–20 due to dynamic center updates—justifying the use of \(\beta =0.3\) to maintain stable metric learning. Meanwhile, \({\mathscr {L}}_{mmd}\) stabilizes after epoch 40, with its gradient norm settling around \(\approx 0.54\), supporting the choice of \(\delta =0.15\) to sustain modality alignment without overwhelming discrimination. The transition from unbalanced raw gradients to balanced effective gradients, visualized in Panel (d), underscores the necessity of our weighting strategy. These empirical results validate our theoretical analysis and explain why the configuration \((\alpha =0.5, \beta =0.3, \gamma =0.05, \delta =0.15)\) performs best in Tables 7 and 8.
QBOL Loss Gradient Dynamics: Empirical Validation of Weight Allocation. Gradient characteristics during training on SYSU-MM01 (epochs 1–120). (a) Raw gradient magnitudes of four loss components over training epochs. (b) Weighted effective gradients after applying QBOL weights (\(\alpha =0.5, \beta =0.3, \gamma =0.05, \delta =0.15\)). (c) Gradient variance analysis for \({\mathscr {L}}_{tri}\) with \(\pm 1\) std envelope. (d) Bar chart comparing raw vs. weighted gradient magnitudes in early epochs.
AKSA Kernel size selection
Design Rationale. Our selection of kernel sizes {3,5,7} for the 1D convolutions aims to capture multi-scale contextual patterns across channels, which correspond to different semantic levels in person Re-ID. Discriminative features often arise from channel interactions at varying scales: localized dependencies (small kernels), intermediate correlations (medium kernels), and broader channel contexts (large kernels). Drawing on multi-scale learning principles79,80, we employ consecutive odd-sized kernels to maintain symmetric padding—preserving sequence length—while enabling incremental growth in receptive field coverage. This configuration also ensures computational efficiency; expanding beyond kernel size 7 yields diminishing returns, with a size-9 kernel increasing FLOPs by roughly 29% relative to size 7.
Ablation results in Table 9 (RegDB V2I) confirm the rationale behind our kernel selection. While two-kernel setups attain 95.54–95.87% Rank-1—with {3,5} offering the best trade-off between fine-grained and mid-range channel modeling—the three-kernel set {3,5,7} reaches 96.20% Rank-1 and 92.12% mAP, a +0.33% Rank-1 gain. This improvement underscores the benefit of adding larger kernels to capture broader channel contexts. Further expanding to {3,5,7,9}, however, reduces performance to 96.04% Rank-1 and 91.89% mAP, suggesting that overly large kernels introduce redundancy and cause over-smoothing. Thus, {3,5,7} delivers the optimal multi-scale channel representation without sacrificing efficiency or accuracy.
Table 10 shows that a reduction ratio of 16 yields the optimal balance between channel compression and representational capacity, while a group number of 4 ensures adequate parallelism and avoids excessive feature fragmentation.
Generalization verification
To demonstrate the generalizability of our DASF module, we integrated it into several existing architectures. As shown in Table 11, adding DASF to AGW improves performance by +2.10% Rank-1 and +1.80% mAP on SYSU-MM01 All-Search mode, while integration with DDAG brings even more substantial gains of +3.17% Rank-1 and +2.36% mAP. Similarly, Table 12 shows that on RegDB, DASF integration improves AGW by +2.13% Rank-1 and +2.14% mAP in V2I mode, and enhances DDAG by +3.20% Rank-1 and +2.66% mAP.
Beyond CNN-based architectures, our module also benefits transformer-based methods. When integrated with CMTR, DASF provides improvements of +0.65% Rank-1 and +0.59% mAP on RegDB, while integration with SPOT yields +0.87% Rank-1 and +0.83% mAP gains on SYSU-MM01, confirming its compatibility with diverse architectural paradigms. These consistent improvements across different backbones and methodologies confirm that our proposed attention mechanisms capture fundamental aspects of cross-modal representation learning that are not adequately addressed by existing approaches.
The consistent performance gains across different architectures and datasets demonstrate that our DASF module provides a general-purpose solution for cross-modal feature fusion that can enhance existing methods without requiring extensive architectural modifications.
Conclusion
This paper addresses persistent challenges in visible-infrared person re-identification (VI-ReID) through a holistic framework that systematically overcomes key limitations in existing methods. We first introduce the Dynamic Adaptive Synergistic Fusion (DASF) module to resolve rigid fusion strategies, enabling content-aware cross-modal interaction via dual-stage channel switching and spatial attention fusion. This dynamic modulation of feature flow substantially enhances cross-modal alignment. Building on this adaptive foundation, we develop the Adaptive Kernel Selection Attention (AKSA) mechanism to overcome static attention designs. AKSA utilizes parallel 1D convolutional pathways with fixed kernel sizes and dynamic output weighting, effectively capturing multi-scale contextual information along the channel dimension while combining both fine-grained and global channel contexts. Initial experiments revealed that these architectural improvements alone proved insufficient for optimal performance due to suboptimal loss balancing. This critical observation motivated our Quadruple Balance-Optimized Loss (QBOL) framework, which implements a principled weighting scheme to systematically integrate identification loss, center triplet loss, supervised contrastive loss, and margin-based MMD loss. The QBOL framework ensures stable convergence and harmonious cooperation among competing learning objectives, thereby fully realizing the potential of our architectural innovations. Extensive evaluations on SYSU-MM01, RegDB, and LLCM datasets demonstrate competitive performance, with particularly strong results on the RegDB benchmark. Ablation studies confirm each component’s contribution and their synergistic effects, while generalization tests show our DASF module consistently enhances various existing architectures.
Future directions. Building upon our competitive results in image-based VI-ReID, future work will focus on three key expansions: (1) extending DASF to video sequences by integrating 3D convolutions or Transformer-based temporal attention, adapting channel switching to generate spatio-temporal masks and employing learnable weighted pooling to prioritize informative frames while managing computational load via efficient attention approximations and frame sampling; (2) generalizing the QBOL framework to multi-task settings such as end-to-end person search (detection + re-identification) by dynamically adjusting loss weights through gradient statistics monitoring and scaling to n components via gradient-based meta-learning or Bayesian optimization; (3) pursuing real-time deployment on embedded GPUs (e.g., NVIDIA Jetson) targeting >30 FPS through model compression techniques including knowledge distillation, pruning of AKSA parallel pathways, INT8 quantization, and optimization with TensorRT and ONNX export, all while maintaining competitive accuracy.
Data Availability
The SYSU-MM013, RegDB5, and LLCM24 datasets used in this paper are publicly available. They can be accessed at https://github.com/wuancong/SYSU-MM01, http://dm.dongguk.edu/link.html, and https://github.com/ZYK100/LLCM respectively.
References
Zheng, L. et al. Person re-identification in the wild. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 3346–3355. https://doi.org/10.1109/CVPR.2017.357 (2017).
Ahmed, E., Jones, M. & Marks, T. K. An improved deep learning architecture for person re-identification. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 3908–3916. https://doi.org/10.1109/CVPR.2015.7299016 (2015).
Wu, A., Zheng, W.-S., Yu, H.-X., Gong, S. & Lai, J. Rgb-infrared cross-modality person re-identification. In 2017 IEEE International Conference on Computer Vision (ICCV), 5390–5399. https://doi.org/10.1109/ICCV.2017.575 (2017).
Ye, M. et al. Deep learning for person re-identification: A survey and outlook. IEEE Trans. Pattern Anal. Mach. Intell. 44, 2872–2893. https://doi.org/10.1109/TPAMI.2021.3054775 (2022).
Nguyen, D. T., Hong, H. G., Kim, K. W. & Park, K. R. Person recognition system based on a combination of body images from visible light and thermal cameras. Sensors https://doi.org/10.3390/s17030605 (2017).
Dai, P., Ji, R., Wang, H., Wu, Q. & Huang, Y. Cross-modality person re-identification with generative adversarial training. In Proceedings of the 27th International Joint Conference on Artificial Intelligence, IJCAI’18, 677–683 (AAAI Press, 2018).
Ye, M., Lan, X., Li, J. & Yuen, P. C. Hierarchical discriminative learning for visible thermal person re-identification. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence and Thirtieth Innovative Applications of Artificial Intelligence Conference and Eighth AAAI Symposium on Educational Advances in Artificial Intelligence, AAAI’18/IAAI’18/EAAI’18 (AAAI Press, 2018).
Feng, Z., Lai, J. & Xie, X. Learning modality-specific representations for visible-infrared person re-identification. IEEE Trans. Image Process. 29, 579–590. https://doi.org/10.1109/TIP.2019.2928126 (2020).
Wang, G. et al. Rgb-infrared cross-modality person re-identification via joint pixel and feature alignment. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV), 3622–3631. https://doi.org/10.1109/ICCV.2019.00372 (2019).
Choi, S., Lee, S., Kim, Y., Kim, T. & Kim, C. Hi-cmd: Hierarchical cross-modality disentanglement for visible-infrared person re-identification. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 10254–10263. https://doi.org/10.1109/CVPR42600.2020.01027 (2020).
Lu, Y. et al. Cross-modality person re-identification with shared-specific feature transfer. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 13376–13386. https://doi.org/10.1109/CVPR42600.2020.01339 (2020).
Pu, N., Chen, W., Liu, Y., Bakker, E. M. & Lew, M. S. Dual gaussian-based variational subspace disentanglement for visible-infrared person re-identification. In Proceedings of the 28th ACM International Conference on Multimedia, MM ’20, 2149–2158 (Association for Computing Machinery, New York, NY, USA, 2020). https://doi.org/10.1145/3394171.3413673.
Fu, C. et al. Cm-nas: Cross-modality neural architecture search for visible-infrared person re-identification. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV), 11803–11812. https://doi.org/10.1109/ICCV48922.2021.01161 (2021).
Zhao, Z., Liu, B., Chu, Q., Lu, Y. & Yu, N. Joint color-irrelevant consistency learning and identity-aware modality adaptation for visible-infrared cross modality person re-identification, Vol. 35, 3520–3528. https://doi.org/10.1609/aaai.v35i4.16466 (2021).
Hu, W., Liu, B., Zeng, H., Hou, Y. & Hu, H. Adversarial decoupling and modality-invariant representation learning for visible-infrared person re-identification. IEEE Trans. Circuits Syst. Video Technol. 32, 5095–5109. https://doi.org/10.1109/TCSVT.2022.3147813 (2022).
Jiang, K. et al. Cross-modality transformer for visible-infrared person re-identification. In Computer Vision—ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XIV, 480–496 (Springer-Verlag, Berlin, Heidelberg, 2022). https://doi.org/10.1007/978-3-031-19781-9_28.
Ye, M., Shen, J., Crandall, D. J., Shao, L. & Luo, J. Dynamic dual-attentive aggregation learning for visible-infrared person re-identification. In Computer Vision—ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XVII, 229–247 (Springer-Verlag, Berlin, Heidelberg, 2020). https://doi.org/10.1007/978-3-030-58520-4_14.
Ren, K. & Zhang, L. Implicit discriminative knowledge learning for visible-infrared person re-identification. In 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 393–402. https://doi.org/10.1109/CVPR52733.2024.00045 (2024).
Yu, H., Cheng, X., Peng, W., Liu, W. & Zhao, G. Modality unifying network for visible-infrared person re-identification. In 2023 IEEE/CVF International Conference on Computer Vision (ICCV), 11151–11161 (2023).
Chen, H., Jiao, B., Wang, W. & Wang, P. Dynamic textual prompt for rehearsal-free lifelong person re-identification. arXiv:2411.06023 (2024).
Du, Y., Lei, C., Zhao, Z., Dong, Y. & Su, F. Video-based visible-infrared person re-identification with auxiliary samples. IEEE Trans. Inf. Forensics Secur. 19, 1313–1325. https://doi.org/10.1109/TIFS.2023.3337972 (2024).
Jiang, Y. et al. L2rw+: A comprehensive benchmark towards privacy-preserved visible-infrared person re-identification. arXiv:2503.12232 (2025).
Zhang, Z., Lan, C., Zeng, W., Jin, X. & Chen, Z. Relation-aware global attention for person re-identification. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 3183–3192. https://doi.org/10.1109/CVPR42600.2020.00325 (2020).
Zhang, Y. & Wang, H. Diverse embedding expansion network and low-light cross-modality benchmark for visible-infrared person re-identification. In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2153–2162. https://doi.org/10.1109/CVPR52729.2023.00214 (2023).
Wang, Z., Wang, Z., Zheng, Y., Chuang, Y.-Y. & Satoh, S. Learning to reduce dual-level discrepancy for infrared-visible person re-identification. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 618–626. https://doi.org/10.1109/CVPR.2019.00071 (2019).
Zhang, Y., Yan, Y., Li, J. & Wang, H. Mrcn: a novel modality restitution and compensation network for visible-infrared person re-identification. In Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence and Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence and Thirteenth Symposium on Educational Advances in Artificial Intelligence, AAAI’23/IAAI’23/EAAI’23 (AAAI Press, 2023). https://doi.org/10.1609/aaai.v37i3.25459.
Lin, X. et al. Learning modal-invariant and temporal-memory for video-based visible-infrared person re-identification. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 20941–20950. https://doi.org/10.1109/CVPR52688.2022.02030 (2022).
You, R. et al. Cross-modality attention with semantic graph embedding for multi-label classification. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34, 12709–12716. https://doi.org/10.1609/aaai.v34i07.6964 (2020).
Park, H., Lee, S., Lee, J. & Ham, B. Learning by aligning: Visible-infrared person re-identification using cross-modal correspondences. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV), 12026–12035. https://doi.org/10.1109/ICCV48922.2021.01183 (2021).
Li, Y. & Chen, Y. Infrared-visible cross-modal person re-identification via dual-attention collaborative learning. Signal Process. Image Commun. 109, 116868. https://doi.org/10.1016/j.image.2022.116868 (2022).
Yang, F. et al. Asymmetric co-teaching for unsupervised cross-domain person re-identification. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34, 12597–12604. https://doi.org/10.1609/aaai.v34i07.6950 (2020).
Han, C., Pan, P., Zheng, A. & Tang, J. Cross-modality person re-identification based on heterogeneous center loss and non-local features. Entropy https://doi.org/10.3390/e23070919 (2021).
Chen, C. et al. Structure-aware positional transformer for visible-infrared person re-identification. IEEE Trans. Image Process. 31, 2352–2364. https://doi.org/10.1109/TIP.2022.3141868 (2022).
Guo, Y. et al. Visible-infrared person re-identification with region-based augmentation and cross modality attention. Sci. Rep. 15, 18225. https://doi.org/10.1038/s41598-025-01979-z (2025).
Mishra, R. K., Mondal, A. & Mathew, J. Nystromformer based cross-modality transformer for visible-infrared person re-identification. Sci. Rep. 15, 16224. https://doi.org/10.1038/s41598-025-01226-5 (2025).
Ye, M., Ruan, W., Du, B. & Shou, M. Z. Channel augmented joint learning for visible-infrared recognition. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV), 13547–13556. https://doi.org/10.1109/ICCV48922.2021.01331 (2021).
Wu, Q. et al. Discover cross-modality nuances for visible-infrared person re-identification. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 4328–4337. https://doi.org/10.1109/CVPR46437.2021.00431 (2021).
Zhang, Q., Lai, C., Liu, J., Huang, N. & Han, J. Fmcnet: Feature-level modality compensation for visible-infrared person re-identification. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 7339–7348. https://doi.org/10.1109/CVPR52688.2022.00720 (2022).
Cheng, D. et al. Efficient bilateral cross-modality cluster matching for unsupervised visible-infrared person reid. In Proceedings of the 31st ACM International Conference on Multimedia, MM ’23, 1325–1333 (Association for Computing Machinery, New York, NY, USA, 2023). https://doi.org/10.1145/3581783.3612073.
Hao, Y., Wang, N., Li, J. & Gao, X. Hsme: hypersphere manifold embedding for visible thermal person re-identification. In Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence and Thirty-First Innovative Applications of Artificial Intelligence Conference and Ninth AAAI Symposium on Educational Advances in Artificial Intelligence, AAAI’19/IAAI’19/EAAI’19 (AAAI Press, 2019). https://doi.org/10.1609/aaai.v33i01.33018385.
Wei, Z., Yang, X., Wang, N. & Gao, X. Syncretic modality collaborative learning for visible infrared person re-identification. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV), 225–234. https://doi.org/10.1109/ICCV48922.2021.00029 (2021).
Fu, H., Cui, K., Wang, C., Qi, M. & Ma, H. Mutual distillation learning for person re-identification. IEEE Trans. Multimed. 26, 8981–8995. https://doi.org/10.1109/TMM.2024.3384677 (2024).
Gao, X., Chen, Z., Wei, J., Wang, R. & Zhao, Z. Deep mutual distillation for unsupervised domain adaptation person re-identification. IEEE Trans. Multimed. 27, 1059–1071. https://doi.org/10.1109/TMM.2024.3459637 (2025).
Zhou, K., Yang, Y., Cavallaro, A. & Xiang, T. Learning generalisable omni-scale representations for person re-identification. IEEE Trans. Pattern Anal. Mach. Intell. 44, 5056–5069. https://doi.org/10.1109/TPAMI.2021.3069237 (2022).
Qian, Z., Lin, Y. & Du, B. Visible-infrared person re-identification via patch-mixed cross-modality learning. Pattern Recognit. 157, 110873. https://doi.org/10.1016/j.patcog.2024.110873 (2025).
Qiu, L. et al. High-order structure based middle-feature learning for visible-infrared person re-identification. In Proceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence and Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence and Fourteenth Symposium on Educational Advances in Artificial Intelligence, AAAI’24/IAAI’24/EAAI’24 (AAAI Press, 2024). https://doi.org/10.1609/aaai.v38i5.28259.
Chen, Y. et al. Dynamic convolution: Attention over convolution kernels. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 11027–11036. https://doi.org/10.1109/CVPR42600.2020.01104 (2020).
Li, C., Zhou, A. & Yao, A. Omni-dimensional dynamic convolution. In The Tenth International Conference on Learning Representations (OpenReview.net, 2022).
Akbaba, E. E., Gurkan, F. & Gunsel, B. Boosting person re-identification feature extraction via dynamic convolution. Pattern Anal. Appl. 27, 80. https://doi.org/10.1007/s10044-024-01294-9 (2024).
Lin, R., Wang, R., Zhang, W., Wu, A. & Bi, Y. Joint modal alignment and feature enhancement for visible-infrared person re-identification. Sensors https://doi.org/10.3390/s23114988 (2023).
Zhao, C. et al. Deep fusion feature representation learning with hard mining center-triplet loss for person re-identification. IEEE Trans. Multimed. 22, 3180–3195. https://doi.org/10.1109/TMM.2020.2972125 (2020).
Liu, H., Tan, X. & Zhou, X. Parameter sharing exploration and hetero-center triplet loss for visible-thermal person re-identification. IEEE Trans. Multimed. 23, 4414–4425. https://doi.org/10.1109/TMM.2020.3042080 (2021).
Zhu, Y. et al. Hetero-center loss for cross-modality person re-identification. Neurocomputing 386, 97–109. https://doi.org/10.1016/j.neucom.2019.12.100 (2020).
Cai, B., Wang, H., Yao, M. & Fu, X. Focus more on what? Guiding multi-task training for end-to-end person search. IEEE Trans. Circuits Syst. Video Technol. 35, 7266–7278. https://doi.org/10.1109/TCSVT.2025.3540089 (2025).
Sun, R., Huang, G., Wang, X., Du, Y. & Zhang, X. Text-augmented multi-modality contrastive learning for unsupervised visible-infrared person re-identification. Image Vis. Comput. 152, 105310. https://doi.org/10.1016/j.imavis.2024.105310 (2024).
Qian, Y. & Tang, S.-K. Multi-scale contrastive learning with hierarchical knowledge synergy for visible-infrared person re-identification. Sensors https://doi.org/10.3390/s25010192 (2025).
Sun, H. et al. Not all pixels are matched: Dense contrastive learning for cross-modality person re-identification. In Proceedings of the 30th ACM International Conference on Multimedia, MM ’22, 5333–5341. https://doi.org/10.1145/3503161.3547970 (Association for Computing Machinery, New York, NY, USA, 2022).
Zhang, C. et al. Improving domain-adaptive person re-identification by dual-alignment learning with camera-aware image generation. IEEE Trans. Circuits Syst. Video Technol. 31, 4334–4346. https://doi.org/10.1109/TCSVT.2020.3047095 (2021).
Chen, X., Yan, Y., Xue, J.-H., Wang, N. & Wang, H. Consistency-driven feature scoring and regularization network for visible-infrared person re-identification. Pattern Recognit. 159, 111131. https://doi.org/10.1016/j.patcog.2024.111131 (2025).
Hua, X. et al. Mscmnet: Multi-scale semantic correlation mining for visible-infrared person re-identification. Pattern Recognit. 159, 111090. https://doi.org/10.1016/j.patcog.2024.111090 (2025).
Wang, H. et al. Tensor completion framework by graph refinement for incomplete multi-view clustering. IEEE Trans. Multimed. https://doi.org/10.1109/TMM.2025.3613125 (2025).
Yao, M., Wang, H., Chen, Y. & Fu, X. Between/within view information completing for tensorial incomplete multi-view clustering. IEEE Trans. Multimed. 27, 1538–1550. https://doi.org/10.1109/TMM.2024.3521771 (2025).
Zheng, A. et al. Visible-infrared person re-identification via specific and shared representations learning. Vis. Intell. 1, 29. https://doi.org/10.1007/s44267-023-00032-9 (2023).
Chen, Y., Wan, L., Li, Z., Jing, Q. & Sun, Z. Neural feature search for RGB-infrared person re-identification. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 587–597. https://doi.org/10.1109/CVPR46437.2021.00065 (2021).
Yang, Y., Hu, W. & Hu, H. Progressive cross-modal association learning for unsupervised visible-infrared person re-identification. IEEE Trans. Inf. Forensics Secur. 20, 1290–1304. https://doi.org/10.1109/TIFS.2025.3527356 (2025).
Ye, M., Wu, Z. & Du, B. Dual-level matching with outlier filtering for unsupervised visible-infrared person re-identification. IEEE Trans. Pattern Anal. Mach. Intell. 47, 3815–3829. https://doi.org/10.1109/TPAMI.2025.3541053 (2025).
Shi, J. et al. Multi-memory matching for unsupervised visible-infrared person re-identification. In Leonardis, A. et al. (eds.) Computer Vision—ECCV 2024 456–474 (Springer Nature Switzerland, Cham, 2025).
Gong, J., Zhao, S. & Lam, K.-M. Interaction and alignment for visible-infrared person re-identification. In 2022 26th International Conference on Pattern Recognition (ICPR), 2253–2259. https://doi.org/10.1109/ICPR56361.2022.9956505 (2022).
Wang, G.-A. et al. Cross-modality paired-images generation for RGB-infrared person re-identification. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34, 12144–12151. https://doi.org/10.1609/aaai.v34i07.6894 (2020).
Liu, J., Wang, J., Huang, N., Zhang, Q. & Han, J. Revisiting modality-specific feature compensation for visible-infrared person re-identification. IEEE Trans. Circuits Syst. Video Technol. 32, 7226–7240. https://doi.org/10.1109/TCSVT.2022.3168999 (2022).
Liang, T., Jin, Y., Liu, W. & Li, Y. Cross-modality transformer with modality mining for visible-infrared person re-identification. IEEE Trans. Multimed. 25, 8432–8444. https://doi.org/10.1109/TMM.2023.3237155 (2023).
Lu, H., Zou, X. & Zhang, P. Learning progressive modality-shared transformers for effective visible-infrared person re-identification. In Proceedings of the AAAI Conference on Artificial Intelligence Vol. 37, 1835–1843. https://doi.org/10.1609/aaai.v37i2.25273 (2023).
Huang, N., Xing, B., Zhang, Q., Han, J. & Huang, J. Co-segmentation assisted cross-modality person re-identification. Inf. Fusion 104, 102194. https://doi.org/10.1016/j.inffus.2023.102194 (2024).
Qian, Y. & Tang, S.-K. Pose attention-guided paired-images generation for visible-infrared person re-identification. IEEE Signal Process. Lett. 31, 346–350. https://doi.org/10.1109/LSP.2024.3354190 (2024).
Wu, R. et al. Enhancing visible-infrared person re-identification with modality- and instance-aware adaptation learning. IEEE Trans. Circuits Syst. Video Technol. 35, 8086–8103. https://doi.org/10.1109/TCSVT.2025.3560118 (2025).
Zhang, Y., Kong, L., Li, H. & Wen, J. Weakly Supervised Visible-Infrared Person Re-Identification via Heterogeneous Expert Collaborative Consistency Learning. arXiv:2507.12942 (2025).
Selvaraju, R. R. et al. Grad-cam: Visual explanations from deep networks via gradient-based localization. In 2017 IEEE International Conference on Computer Vision (ICCV), 618–626. https://doi.org/10.1109/ICCV.2017.74 (2017).
Wang, Q. et al. Eca-net: Efficient channel attention for deep convolutional neural networks. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 11531–11539. https://doi.org/10.1109/CVPR42600.2020.01155 (2020).
Szegedy, C. et al. Going deeper with convolutions. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 1–9. https://doi.org/10.1109/CVPR.2015.7298594 (2015).
He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 770–778. https://doi.org/10.1109/CVPR.2016.90 (2016).
Funding
This work was supported in part by the National Natural Science Foundation of China under Grant 62562058 and 62441213, and the College Students’ Innovation and Entrepreneurship Training Program of China under Grant 202410755062.
Author information
Authors and Affiliations
Contributions
Shuli Cheng and Linjie Sha conceived the experiments, Linjie Sha conducted the experiments, and Shuxian Liu analysed the results. All authors reviewed the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Sha, L., Cheng, S. & Liu, S. Dynamic adaptive synergistic attention network for visible-infrared person re-identification. Sci Rep 15, 44436 (2025). https://doi.org/10.1038/s41598-025-28094-3
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s41598-025-28094-3







