High order Interaction and Wavelet Convolution Network for visible infrared person reidentification

Ma, Li; Kong, Rui; Dai, XinGuan

doi:10.1038/s41598-025-14978-x

Download PDF

Article
Open access
Published: 21 August 2025

High order Interaction and Wavelet Convolution Network for visible infrared person reidentification

Li Ma¹,
Rui Kong¹ &
XinGuan Dai¹

Scientific Reports volume 15, Article number: 30718 (2025) Cite this article

988 Accesses
Metrics details

Subjects

Abstract

Visible-infrared person re-identification (VI-ReID) remains a challenging task due to significant cross-modal discrepancies and poor image quality. While existing methods predominantly employ deep and complex neural networks to extract shared cross-modal features, these approaches inevitably discard critical primitive features during high-level feature abstraction. To address this limitation, we propose the High-order Interaction and Wavelet Convolution Network (HIW-Net) that systematically integrates primitive features at multiple feature interaction stages, thereby compensating for information loss in High-order representations. Furthermore, our framework uses wavelet convolution to mine more diverse features and solve the problem of insufficient feature extraction. We create the RegDB_shape datasets with the help of the Segment Anything Model(SAM) tool to supplement the training set. Extensive experiments on the SYSU-MM01 and RegDB datasets show the superiority of the proposed HIW-Net over several other state-of-the-art methods, proves the effectiveness of this method.

ReMamba: a hybrid CNN-Mamba aggregation network for visible-infrared person re-identification

Article Open access 26 November 2024

A hybrid object detection approach for visually impaired persons using pigeon-inspired optimization and deep learning models

Article Open access 20 March 2025

Visible-infrared person re-identification with region-based augmentation and cross modality attention

Article Open access 25 May 2025

Introduction

Person Re-identification (Re-ID) is a computer vision task aimed at identifying the same individual by analyzing and comparing images captured at different times, from varying perspectives, and across different cameras. This technology is widely used in security monitoring, intelligent transportation, and smart retail^1-3. Currently, methods such as QAConv-GS, AKA, SPT, FastReID, and CLIP-ReID have demonstrated strong performance in person re-identification within the visible light spectrum^4-8. However, in practical scenarios, challenges such as poor lighting conditions, complex environments, weather variations, and day-night transitions hinder effective person re-identification based solely on visible light images. Infrared cameras, thermal images, and other devices can ignore these effects and image the human body. Therefore, many scholars are actively exploring cross-modal person re-identification between visible light and infrared modalities, proposing methods such as YYDS + CMKR, DEN, NFS, and FMCNet^9-12, which aim to develop invariant feature representations across modalities and camera perspectives.

The difference between visible light and infrared images represents a significant challenge in this domain. Visible light images primarily rely on visible light reflected by objects, while infrared images depend on thermal radiation emitted by objects, resulting in substantial differences between the two modalities in color, texture, and contrast^13-15. As shown in Fig. 1, while visible and infrared images contain identity features of the same person, the identity features of the same person in the two modalities are different, but also have same parts. (represented by the white triangle region in Fig. 1, referred to as shared features in this paper). The main challenge is to learn these features effectively. We divide existing methods into two categories, nongenerative-based and generative-based.

Generative methods such as the GECNet network, GECNet uses grayscale transformation to make visible and infrared images similar¹⁶. Some methods employ GANs for modal transformation, converting images from one modality into another to minimize modal differences^17-19. In this field, VIS-IR image pairs are often essential for guiding modality generation, but some mainstream datasets, such as the SYSU-MM01 dataset, do not fulfill this requirement. The absence of VIS-IR image pairs can introduce noise into the generated images. Additionally, generative methods have notable limitations in practical applications.

Non-generative approaches embed features from different modalities into a unified feature space^11,20,21. However, this process may discard certain features from visible and infrared modalities, including critical details essential for person re-identification. Most existing methods focus on reducing modal differences, such as bidirectional modality information interaction network using of Dynamic Aggregation (DA) module⁵⁴ and modality shared-specific features cooperative separation network⁴², which are represented in Fig. 1 as increasing overlapping white triangular areas. Although these approaches achieve promising results, they ignore feature loss at deeper network stages and fail to fully utilize shared features.

The HIW network proposed in this paper not only increases shared features but also focuses on their effective utilization. It consists of a third-order primitive feature interaction module (TPFI) and a diversified feature mining module based on wavelet convolution (wtDFM). Firstly, to reduce the differences between different modalities, the TPFI module is used to interact the features of different modalities through channel and spatial dimensions. At the same time, to reduce the potential loss of features during the network extraction process, the primitive features, low order features, and High-order features are aggregated separately in the channel and spatial dimensions. Low order and original features are used to compensate for High-order features, ensuring sufficient extraction of shared features, and reducing the redundant parameter amount of aggregation in the entire stage. Then, the wtDFE module applies wavelet convolution branches with different receptive fields to the features after TPFI. This enhances the utilization of shared features and explores diversified features, including shape, color, texture, and more abstract attributes(as shown in Fig. 2).

As shown in Fig. 2, to achieve visible-infrared person re-identification and address the issues of insufficient feature extraction and underutilization of features in existing methods, the proposed HIW network introduces two novel modules: TPFI and wtDFM. In addition, inspired by the SGIEL²², where shape, as an invariant shared feature across modalities, is the key information to achieve person re-recognition. SGIEL uses orthogonal projections to eliminate shape information in one projection, which is used to force the network to learn features other than shape. In this paper, the opposite idea is taken, where only shape information is retained and used to force the network to enhance the learning of shape features. In the first stage, the network inputs images that contain all the information, and in the second stage the network inputs images that contain only shape information, by weighing the shape loss from the second stage with the loss from the first stage, and fed back into the training process, the network is trained to better utilize shape features.

In addition, the low quality of the RegDB datasets hinders the effective creation of shape diagrams using existing human body analysis networks^23,25. To address this, this paper uses SAM to annotate the datasets and construct the RegDB_shape datasets²⁶. As shown in Fig. 3, the newly created datasets addresses issues of incomplete person profiles and offers valuable support for research in person re-identification.

The main contributions are as follows:

1)
This paper proposes a novel third-order primitive feature interaction module (TPFI) that minimizes inter-modal differences through channel and spatial dimension interactions while mitigating shared feature loss through third-order feature aggregation.
2)
The Wavelet convolution was introduced to achieve diversified feature mining. The proposed wtDFM module utilizes different wavelet convolution branches to fully exploit shared features.
3)
The HIW network integrates the strengths of TPFI and wtDFM to enhance cross-modal shared feature representation and maximize the utilization of shared features. In addition, shape loss and modality loss were weighed to optimize the total loss and enhance the utilization of shape features.
4)
This paper creates RegDB_shape datasets to advance research in the field of person re-identification.
5)
Extensive experiments demonstrate that the HIW network outperforms other networks on the mainstream SYSU-MM01 and RegDB datasets.

Related work

Current visible-infrared person re-identification (VI-ReID) methods can be categorized into generative and non-generative approaches⁵⁵. Generative models primarily employ generative adversarial networks (GAN) or encoder-decoder modules to facilitate mutual conversion between the two modalities or to create intermediate ones. Sometimes, they also require the fusion of visible and infrared images, such as the SIHP network³⁰, which uses corresponding visible and infrared images to explore complex interactions in image fusion. However, generative methods need VIS-IR image pairs, which have certain limitations in practical applications. Therefore, this paper mainly focuses on non-generative methods.

This paper focuses on enhancing the baseline AGW method for cross-modal person re-identification. The AGW method incorporates a Non-local Attention module into ResNet50, primarily utilizing a $1 \times 1$ convolution kernel. This module enables the model to capture long-range dependencies in images, compute the weighted sum of all positional features, and enhance the feature representation capability. Generalized-mean Pooling (GeMPooling), a learnable pooling layer, is used to provide a continuous transition between maximum pooling and average pooling, GeMPooling adjusts its pooling behavior via a learnable hyperparameter ${p_k}$, it behaves similarly to maximum pooling as ${p_k}$ approaches infinity and is equivalent to average pooling when ${p_k}=1$. Specifically, GeMPooling calculates the generalized average of each feature channel, adjusting ${p_k}$ enhances the contrast of the pooled feature map, emphasizing more significant image features. AGW puts Weighted Regression Triplet loss, an improvement over traditional triplet loss, this loss function inherits the advantages of relative distance optimization between positive and negative pairs while avoiding the introduction of additional margin parameters, it optimizes triplet positive set P and negative set N in batches using a weighted regularization approach, thereby improving the model’s discrimination ability[28]. But, AGW also has many drawbacks, it merely concatenates visible and infrared features, resulting in large modal differences. There is also no diversity mining of the features, which are directly input into the triplet loss function, leading to underutilization of the features.

To obtain more effective features, this paper aggregates features learned at different stages, this strategy validated by the ANN network proposed by Zhu Z²⁹. The ANN network puts Asymmetric Fusion Non-local Block (APNB) module, which integrates features across different levels while accounting for long-range dependencies. This approach significantly enhances performance, demonstrating the effectiveness of feature aggregation at various levels in tasks such as semantic segmentation and classification. However, full-stage feature aggregation may lead to feature redundancy and high computational costs.

To reduce modality differences, the features extracted from visible light and infrared images can be interacted with in channel and spatial dimensions, such as the SCSN network uses an Residual Dual Attention Module (RDAM) module composed of Channel-wise Attention Module (CAM) and Residual Spatial Attention Module (RSAM). In addition, The SCSN network mines diverse salient features through the use of cascading Salient Feature Extraction (SFE) units. SFE suppresses salient features learned in the previous cascading stage, adaptively extracts additional potential salient features, and integrates these features into the final representation through cascading. However, a person’s unique features are finite, and excessive suppression may lead the network to learn non-robust features, resulting in feature vector redundancy and a dilution of salient features²⁸.

After extracting sufficient features, it is also necessary to make full use of these features and strengthen the diversity mining of features. The SGIEL forcibly removes body shape-related features, compelling the VI-ReID model to extract additional features shared across modalities for recognition. By orthogonal decomposition of the feature space, shape-related features are captured in a subspace, while shape-erased features are mapped to an orthogonal complement space, enhancing feature diversity²². However, this method relies heavily on prior knowledge of body shape, making it susceptible to inaccuracies or incompleteness in body shape information, which can adversely impact feature extraction.

The DEEN Network³² uses different expansion rates to simulate different receptive fields and extract multi-scale features, but this may result in tiny details being overlooked. Dilated convolution can introduce more gaps, leading to information loss and ambiguity. Although dilated convolution expands receptive fields, it significantly increases computational cost. As the expansion rate increases, the interval between elements in the convolution kernel becomes larger, which means that each element can cover a larger input area, which increases the computational complexity and thus affects the utilization of GPU.

Inspired by the above methods, between the full-stage aggregation of the ANN method and the second-order aggregation of the SCSN method, this paper proposes the third-order primitive feature interaction, which takes ResNet50 as the backbone, take the features of the image after a simple convolution as the primitive features. From the third stage onwards, these primitive features are inputted into channel-wise and spatial feature interactions. This method, termed third-order primitive feature interaction (TPFI), compensates for potential losses of shared features during deep network extraction. At the cost of increasing a small amount of computational cost, it significantly reduces the loss of shared features in the deep network extraction process.

To enhance the diversity of shared features, this paper uses Wavelet Transform (WT) to create a wavelet convolution module, it achieves a large receptive field without excessive parameterization³³. Wavelet convolution can better capture the low-frequency information in the image, thus enhancing the response to shape, which is well combined with the idea of using shape loss and modality loss to get the total loss to enhance the network’s extraction of shape features, and further enhance the network’s response to shape features.

Method

Model architecture

Figure 4 outlines the network framework for High-order Interaction and Wavelet Convolution Network for visible-infrared person re-identification (HIW-Net). It uses Resnet50 as the backbone and consists of two passes.

In the first pass, visible and infrared images are input. The preprocessing stage of Resnet50 extracts primitive features. Then, for each layerX, it aggregates the input (low-level features), output (high-level features), and primitive features. In the first stage, as the layer’s input is the primitive feature, it uses basic feature interaction (FI). For other stages, it uses third-order primitive feature interaction (TPFI), which involves interaction in both channel and spatial dimensions. For the RegDB datasets, only the first two stages of Resnet50 are used, and for the SYSU datasets, the first three stages are used. Then, The wavelet convolution module splits the aggregated features into diverse frequency bands. It then processes each band with convolutions, using branches with different wavelet convolution kernel sizes to capture more varied feature information. Finally, the model loss, which is the sum of ${L_{ce}}$, ${L_{tri}}$, ${L_{ort}}$ and ${L_{cpm}}$, is calculated using these diverse features.

The second pass takes the shape images of visible and infrared images (containing only human shape features) as input. The calculation method is the same as the first pass, yielding the shape loss.

The total loss is the weighted sum of model loss and shape loss.

Third-order primitive feature interaction

Figure 4 illustrates the structure of the third-order primitive feature interaction module. In this paper, features preceding the layer are called low-order feature ${f_l}$, features following the layer are called High-order feature ${f_h}$, and features resulting from the initial image convolution, batch normalization(BN), ReLU, and max-pooling operations are defined as primitive feature

Channel interaction: Channel interaction between the primitive feature ${f_p}$ and the higher-order feature ${f_h}$, use three convolution $\varphi _{g}^{p}$, $\varphi _{t}^{p}$, $\varphi _{v}^{p}$ to preprocess the primitive and higher-order features, generating three compact outputs $\varphi _{g}^{p}\left( {{f_p}} \right)$, $\varphi _{t}^{p}\left( {{f_h}} \right)$, $\varphi _{v}^{p}\left( {{f_p}} \right)$with the same feature dimensions as the primitive feature, the feature map is flattened in the last dimension, and then use matrix multiplication and softmax to calculate the channel similarity of ${f_p}$ and ${f_h}$, then obtain the channel similarity matrix.

$$M_{C}^{p}({C_p} \times {C_p})=soft\hbox{max} (\varphi _{g}^{p}({f_p}) \times \varphi _{t}^{p}({f_h})),$$

(1)

Then, the preprocessed High-order feature $\varphi _{v}^{p}\left( {{f_p}} \right)$and channel similarity matrix $M_{C}^{p}$ are multiplied to achieve feature interaction across different stages. Then use a $1 \times 1$ convolution $\varphi _{w}^{{ph}}$ to restore the interaction features to same shape as the higher-order feature ${f_h}$:

$$f_{p}^{h}=\varphi _{w}^{{ph}}(M_{C}^{p} \times \varphi _{v}^{p}({f_p})),$$

(2)

Use the same method to perform channel interaction on low-order features ${f_l}$ and High-order features ${f_h}$ to obtain:

$$f_{l}^{h}=\varphi _{w}^{{lh}}(M_{C}^{l} \times \varphi _{v}^{l}({f_l})),$$

(3)

Finally, we utilize matrix addition to perform feature aggregation along the channel dimension in the third-order primitive feature interaction module:

$$f_h^C = {f_h} + f_l^h + f_p^h$$

(4)

Spatial interaction: Then, we performed the spatial interaction of High-order features of channel aggregation have been completed $f_{h}^{C}$, low-order features ${f_l}$ and primitive features ${f_p}$,this operation is similar to channel interaction, the difference is that in the feature pre-processing stage, file the channel dimension with the features of the spatial dimension, so that the spatial size of the pre-processed low-order features ${f_l}$ and primitive features ${f_p}$ is consistent with the High-order. Finally, we obtain the output of the third-order primitive feature interaction.

$${f_{TPFI}}=f_{h}^{C}+\psi _{w}^{{lh}}(M_{S}^{l} \times \psi _{p}^{l}({f_l}))+\psi _{w}^{{ph}}(M_{S}^{p} \times \psi _{v}^{p}({f_p})),$$

(5)

where, $\psi$denotes $1 \times 1$ convolution in spatial interaction, and ${M_S}$ represents the spatial similarity matrix.

Diverse features mining for wavelet Convolution

Parameter efficiency and Multi-Frequency emphasis: For l-level decomposition and a ${\text{k}} \times {\text{k}}$ kernel, wtConv’s[34] parameters scale as $\left( {l \cdot 4 \cdot c \cdot {k^2}} \right)$, whereas its effective receptive field(ERF) grows as $\left( {{2^l} \cdot k} \right)$. For example, a 3-level wtConv with $3 \times 3$ kernels achieves a $24 \times 24$ ERF using only 117 parameters(108 parameters plus bias terms) per channel, compared to 576 parameters for a $24 \times 24$standard convolution.

By focusing convolutions on low-frequency subbands, wtConv enhances shape bias and robustness to high-frequency noise. This aligns with the observation that low frequencies encode structural information, while high frequencies correspond to textures.

Wavelet transform integration with CNNs: The wtConv serves as a drop-in replacement for depth-wise convolutions in existing architectures. Its implementation requires no architectural modifications, ensuring compatibility with standard training pipelines and downstream tasks.

This paper uses $3 \times 3$ and $5 \times 5$ wtConv branches to extract features. The network concatenates features from different branches to obtain the final feature:

$${f^*} = {\theta _{1 \times 1}}({F_{\operatorname{Re} LU}}(\phi _{wtConv}^3({f_{TPFI}}) + \phi _{wtConv}^5({f_{TPFI}}) + {f_{TPFI}}))$$

(6)

where, ${\theta _{1 \times 1}}$represents a convolution of kernel size 1, changing dimension as same as ${f_{TPFI}}$.

Loss functions

This section describes several loss functions for network training. It introduces shape images, performs two similar computations, and calculates the total loss via a simple weighted sum.

Identity classification loss ${L_{id}}$ uses the cross - entropy loss function⁵¹. This helps the model learn to distinguish pedestrians’ identity features and increases inter - class differences. It classifies the output feature ${f^*}$ to mitigate the negative impact of modality differences, ensuring the network correctly identifies different identities.

$${L_{ce}} = - \frac{1}{N}\sum\nolimits_{i = 1}^N {\sum\nolimits_{c = 1}^C {{y_{i,c}}\log ({p_{i,c}})} }$$

(7)

where, N represents the number of samples, C the number of classes, ${y_{i,c}}$ is the one - hot encoded labels, and ${p_{i,c}}$ is the predicted probabilities.

Triplet loss To learn a cross - modal shared metric space, this paper adopts triplet loss⁴⁹. It narrows anchor - positive pairs (same - identity cross - modal samples) and widens negative pairs (different - identity samples), enhancing the feature space’s discriminative power.

$${L_{tri}} = \frac{1}{N}\sum\nolimits_{i = 1}^N {\max (d({a_i},{p_i}) - d({a_i},{n_i}) + \alpha ,0)}$$

(8)

where, $d\left( \cdot \right)$ denotes the Euclidean distance metric, ${a}$ is the margin hyperparameter, ${a_i}$ represents the anchor sample, ${p_i}$ the positive sample, and ${n_i}$ the negative sample.

Orthogonality loss In order to ensure the diversity of features, this paper uses orthogonal loss⁵⁰ to constrain the orthogonality between feature vectors of different branches and reduce feature redundancy. Orthogonal loss is represented by:

$${L_{ort}} = \sum\limits_{m = 1}^{i - 1} {\sum\limits_{n = m + 1}^i {(f_ + ^{mT}f_ + ^n)} }$$

(9)

where, m and n are the m-th and n-th features generated from ${f_{TPFI}}$, respectively.

Center-Guided Pair Mining Loss. The center-guided pair mining loss, proposed by Zhang Y et al.³², is designed for generating diversified features for multi-branch network structures. It narrows intra-class distances across modalities, reducing VIS-IR modality gaps, while widening distances between generated and original features to mine diverse cross-modal clues. Crucially, it ensures inter-class distances exceed intra-class ones.

$$L({f_v},{f_n},f_{{v+}}^{i})={[D(f_{n}^{j},f_{{v+}}^{{i,j}}) - D(f_{v}^{j},f_{{v+}}^{{i,j}}) - D(f_{v}^{j},f_{v}^{k})]_+},$$

(10)

where, ${f_v}$ and ${f_n}$ are the VIS and IR features from TPFI block, $f_{{v+}}^{i}$ is the features generated from the i-th branch of the ${f_v}$. $D\left( { \cdot , \cdot } \right)$ is the Euclidean distance between two features. j, k are different identities in a mini-batch, ${\left[ x \right]_+}=max\left( {x,0} \right)$.

In Eq. (10), $f_{n}^{j}$ represents the infrared feature of identity j, $f_{v}^{j}$ represents the visible feature of identity j, and $f_{{v+}}^{{i,j}}$ represents the feature of generated by $f_{v}^{j}$ in the i branch. The first term reduces the distance between newly - generated visible features and original infrared ones, thus decreasing the modality difference. The second ${L_{ort}}=\sum\limits_{{m=1}}^{{i - 1}} {\sum\limits_{{n=m+1}}^{i} {(f_{+}^{{mT}}f_{+}^{n})} } ,$term increases the distance between newly - generated visible features and original visible ones, prompting the network to learn diverse features. The third term ensures the intra - class distance is smaller than the inter - class distance.

Similarly, for the feature generated by ${f_n}$ in branch i, the loss function that needs to be satisfied is:

$$L({f_v},{f_n},f_{{n+}}^{i})={[D(f_{v}^{j},f_{{n+}}^{{i,j}}) - D(f_{n}^{j},f_{{n+}}^{{i,j}}) - D(f_{n}^{j},f_{n}^{k})]_+},$$

(11)

Therefore, the final CPM loss can be expressed as:

$${L_{cpm}}=L({f_v},{f_n},f_{{v+}}^{i})+L({f_v},{f_n},f_{{n+}}^{i})$$

(12)

Total Loss. This paper employs four loss functions. The network undergoes two computational passes. In the first pass, visible and infrared images are input, and the network computes the sum of the four loss functions to obtain the modality loss.

$${L_{ml}}=L_{{_{{id}}}}^{m}+L_{{_{{tri}}}}^{m}+L_{{_{{ort}}}}^{m}+L_{{_{{cpm}}}}^{m},$$

(13)

where, $L_{*}^{m}$ represents the loss calculated using modal images.

In the second pass, the network takes shape images, derived from visible and infrared images, as input. It calculates the sum of the four loss functions to obtain the shape loss.

$${L_{sl}}=L_{{_{{id}}}}^{s}+L_{{_{{tri}}}}^{s}+L_{{_{{ort}}}}^{s}+L_{{_{{cpm}}}}^{s},$$

(14)

where, $L_{*}^{S}$ represents the loss calculated using shape images.

Finally, the network is trained by minimizing the sum of modal and shape losses.

$${L_{total}}=\lambda {L_{ml}}+(1 - \lambda ){L_{sl}}$$

(15)

where, $\lambda$ denotes the weight that adjusts the loss ratio to control the proportion of shape enhanced features in the network. Experimental exploration shows the network performs best when $\lambda$ = 0.9.

Experiments

This paper conducted experiments on two public datasets, SYSU-MM01 and RegDB, to evaluate the effectiveness of the proposed method and compare it with recent approaches, demonstrating its superiority. Additionally, ablation experiments were conducted to assess the contributions of TPFI, wtDFM, and ShapeLoss.

Dataset and evaluation protocol

The RegDB dataset contains a total of 412 people, each person has 10 visible light images and corresponding 10 thermal images, these images exhibit variations in body posture, capture distance, and lighting conditions, among the 412 people, there are 254 females and 158 males. In addition, 156 people were photographed from the front and the remaining 256 people were photographed from other angles. The images in this dataset are small and have poor clarity. The RGB and thermal images for each identity are in a one-to-one correspondence³⁴. The SYSU-MM01 dataset contains images of 491 people captured by 4 RGB cameras and 2 infrared cameras, totaling 30,071 RGB images and 15,792 infrared images. For testing, it supports two evaluation settings: all-search mode and indoor-search mode. The query set contains 3,803 images captured from IR cameras 3 and 6 in both settings, while the gallery set in all-search mode includes all visible images from the four RGB cameras. In indoor-search mode, the gallery set includes images only from the two indoor RGB cameras.

Shape dataset. The dataset is converted into shape maps. The SCHP network is used to create shape maps for the SYSU-MM01 dataset²³. However, due to the low resolution and poor image quality of the RegDB dataset, existing human body analysis networks produce unsatisfactory results^24,25. This paper manually annotates the RegDB dataset to create the first high-quality RegDB shape dataset.

To evaluate the performance of the network, this paper uses two evaluation metrics: Rank and Mean Average Precision (mAP). The Rank indicator, particularly Rank-1, is a key metric for evaluating the performance of person re-identification algorithms. It indicates the proportion of correctly identified samples ranked first by the algorithm. A higher Rank-1 value signifies better performance in person re-identification tasks³⁵. The mAP is another widely used evaluation metric. It measures the model’s average performance across all categories and is calculated by averaging the Average Precision(AP) of each query. mAP accounts for both precision and recall during the query process, providing a more comprehensive assessment of person re-identification performance³⁶.

Implementation details

This paper uses a single RTX 4090 GPU for experiments. After preprocessing the input image using horizontal flipping, random cropping, and random erasing techniques, the preprocessed image is fed into the network, which uses ResNet50 as its backbone. For the RegDB datasets, the fourth stage of ResNet50 is removed, the TPFI module proposed in this paper is added to the first and second stages, and the wtDFM module is added to the second stage. For the SYSU-MM01 datasets, the TPFI module is added after each of the first three stages, and the wtDFM module is added after the third stage. In the first 10 rounds of warming up the model, the learning rate increases from 0.01 to 0.1, then remains unchanged at 0.1 from 10 to 20 rounds. It then decreases to 0.01 from 20 to 80 rounds and to 0.001 from 80 to 120 rounds. Beyond 120 rounds, the learning rate is further reduced to 0.0001, for a total of 150 training rounds. The network uses the ImageNet pre-trained weight file.

Comparison with state-of-the-art methods

This paper compared the proposed method with existing state-of-the-art VI-ReID methods, including FMCNet¹², SGIEL²², DEEN³², PMT³⁷, AGMNet³⁸, LCNL³⁹, MCJA⁴⁰, DCPLNet⁴¹, MPMN⁴², PMCM⁴³, CSDN⁴⁴, MIP⁴⁵, AGCC⁴⁶, RCC⁴⁷, DNS⁴⁸, MSCMNet⁵², LAReViT⁵³. Extensive experiments demonstrate that the HIW-Net proposed in this paper achieves superior or comparable performance on both the SYSU-MM01 and RegDB datasets.

RegDB. As shown in Table 1, for the VIS-to-IR mode of RegDB, the HIW-Net achieves a Rank-1 accuracy of 94.88%, representing the best performance, while the mAP metric demonstrates comparable performance. For the IR to VIS mode, a Rank-1 accuracy of 94.32% is achieved, representing the best performance, while mAP achieves the second-best result at 87.18%.

SYSU-MM01. As shown in Table 2, in the all-search mode of SYSU-MM01, the HIW-Net achieves the best performance in Rank-1, Rank-10, and Rank-20, achieving 85.05%, 97.67%, and 99.67%, respectively, which are 8.23%, 0.07%, and 0.36% higher than the second-best results. For the indoor-search mode, achieves the best performance in Rank-1, Rank-10, and mAP, achieving 87.50%, 99.32% and 87.75%, surpassing the second-best result by 3.29%, 0.32% and 0.92%. In other metrics, it demonstrated performance comparable to the best results.

The triplet loss employed by HIW-Net emphasizes the distance relationships between difficult sample pairs, aiming to place the ranking of target people images at front and enhance the Rank index. While the network effectively captures the shape and other key features of target people, it does not sufficiently distinguish between non-target people. This lack of fine-grained discrimination causes non-target people to appear relatively higher in the overall ranking, thereby affecting the mAP.

Table 1 Comparison of the proposed method with state-of-the-art approaches on the RegDB dataset. Bold values indicate the best performance, while underlined values represent the second-best results.

Full size table

Table 2 Comparison of the proposed method with state-of-the-art approaches on the SYSU-MM01 dataset. Bold values indicate the best performance, while underlined values represent the second-best results.

Full size table

Ablation studies and analyses

In this section, we conduct ablation studies and analyses to evaluate the effectiveness of each component in this paper proposed High-order Interaction and Wavelet Convolution Network (HIW). All experiments were conducted on the RegDB dataset under the same baseline in the IR-to-VIS mode. The wavelet convolution kernel size and shape loss weight are consistent. Results are shown in Table 3.

Adding each module alone increases the rank-1 metric by over 2.6% and the mAP metric by over 18%, proving each module’s effectiveness. When the wtDFM module and Shape dataset are added together, the rank-1 metric increases by 5.43%, and the mAP metric by 20.36%, this shows the complementarity of the wtDFM module and Shape dataset, both enhancing focus on shape features. Using all three modules at once boosts the rank-1 metric by 6.85% and the mAP metric by 23.34%.

Table 3 The impact of each component on HIW-Net. All experiments are conducted in the IR to VIS mode, with wavelet Convolution kernel sizes of $3 \times 3{\text{~}}$and $5 \times 5$, and shape loss and modality loss coefficients set to 0.1 and 0.9, respectively.

Full size table

The effectiveness of third-order primitive feature interaction(TPFI): As shown in Table 3. Adding the TPFI module to the baseline improves the rank-1 by 3.33% and the mAP by 20.38%. Visualizations in Fig. 5 show that features from general second-order interaction modules (Fig. 5.(b)) lose some chest, abdomen, and foot features compared to the primitive features (Fig. 5.(a)). However, Fig. 5.c, which uses the TPFI module to incorporate primitive features into deep network inputs, reduces such feature loss.(Fig .6)

The effectiveness of diverse features mining for wavelet convolution(wtDFM)As shown in Table 3, adding the wtDFM module to the baseline boosts the rank-1 by 2.68% and the mAP by 18.16%. Visualizations in Fig. 7.(b), which uses the wtDFM module, capture more features like the chest, abdomen, and legs, and focus more on the human silhouette than Fig. 7.(a), which doesn’t use the module.

To determine the optimal wavelet convolution kernel size for the wtDFM module, this section experiments with kernel sizes of $3 \times 3$ and $5 \times 5$, $3 \times 3$ and $7 \times 7$ and $5 \times 5$, as shown in Table 4. When the wavelet convolution kernel size is $3 \times 3$ and $5 \times 5$, the network achieves the best performance.

Table 4 The impact of wavelet Convolution kernel size on HIW-Net. All experiments set shape loss to 0.1 and modality loss to 0.9.

Full size table

The effectiveness of shape dateset: As shown in Table 3, training on the baseline with the addition of the shape dataset increases the rank-1 accuracy by 2.60% and the mAP value by 19.19%. The weight coefficients of the model loss and shape loss are adjusted to explore the optimal combination, As shown in Table 5, when ${\alpha _s}=0.1$ and$~{\alpha _m}=0.9$, the network achieve the best performance. Figure 8 visualizes feature graphs of ${\alpha _s}=0.1$, ${\alpha _s}=0.2$, and ${\alpha _s}=0.3$, showing that as the weight coefficient of shape loss increases, the network places greater focus on shape features.

Table 5 The impact of weight coefficients ${\alpha _{s~}}$and $~{\alpha _m}$ on HIW-Net. All experiments used wavelet Convolution kernel size of $3 \times 3$ and $5 \times 5$.

Full size table

Visualization Analyses:

To investigate the effectiveness of HIW-Net, we visualize the inter class and intra class distances on the SYSU-MM01 dataset, as shown in Fig. 9. This indicates that HIW-Net achieves a larger gap between intra-class and inter-class distances, where ${d_1}<{d_2}$.Thus, HIW-Net can effectively reduce the modality discrepancy between the VIS and the IR images. As shown in Fig. 10, t-SNE visualization of the identity features learned by the model reveals that the baseline model’s projections for the same identity are scattered and hard to distinguish. In contrast, HIW-Net, leveraging TPFI and wtDFM, extracts more comprehensive and diverse features. This enables it to effectively distinguish and aggregate people features.

To further show the effectiveness of HIW-Net, we also show some rank-10 retrieval results of HIW-Net on SYSU-MM01 dataset in Fig. 11, the red ones mean the incorrect matches. The results show that HIW-NET can achieve better person re-identification performance compared to the baseline.

Conclusion

This paper addresses visible-infrared person re-identification by studying how to reduce feature loss during network feature extraction and more effectively mine features while reducing modality differences. The proposed HIW-Net, which consists of third-order primitive feature interaction (TPFI) and wavelet convolution diversity feature expansion (wtDFM), addresses these challenges. In addition, a shape loss weighting strategy is introduced to enhance the network’s attention to shape features. This paper also creates the RegDB_Shape dataset, manually annotating the low-quality RegDB dataset to generate shape maps of person. Extensive experiments on the SYSU and RegDB datasets demonstrate that the proposed HIW-Net outperforms existing methods.

Data availability

The SYSU-MM01 dataset and RegDB dataset used in the paper are publicly available and can be accessed through https://github.com/wuancong/SYSU-MM01 and https://github.com/WSR-001/HIW.The RegDB_Shape dataset is created based on RegDB and can be accessed through https://github.com/WSR-001/HIW.

References

Huang, N. et al. Deep learning for visible-infrared cross-modality person re-identification: A comprehensive review[J]. Inform. Fusion. 91, 396–411 (2023).
Article Google Scholar
Ning, E. et al. Occluded person re-identification with deep learning: a survey and perspectives[J]. Expert Syst. Appl. 239, 122419 (2024).
Article Google Scholar
Ming, Z. et al. Deep learning-based person re-identification methods: A survey and outlook of recent works[J]. Image Vis. Comput. 119, 104394 (2022).
Article Google Scholar
Liao, S. & Shao, L. Graph sampling based deep metric learning for generalizable person re-identification[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. : 7359–7368. (2022).
Pu, N. et al. Lifelong person re-identification via adaptive knowledge accumulation[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. : 7901–7910. (2021).
Tan, L. et al. Occluded Person Re-identification via Saliency-Guided Patch Transfer[C]//Proceedings of the AAAI Conference on Artificial Intelligence. 38(5): 5070–5078. (2024).
He, L. et al. Fastreid: A pytorch toolbox for general instance re-identification[C]//Proceedings of the 31st ACM International Conference on Multimedia. : 9664–9667. (2023).
Li, S., Sun, L. & Li, Q. CLIP-ReID: exploiting vision-language model for image re-identification without concrete text labels[C]//Proceedings of the AAAI Conference on Artificial Intelligence. 37(1): 1405–1413. (2023).
Du, Y., Zhao, Z. & Su, F. Y. Y. D. S. Visible-Infrared Person Re-Identification with Coarse Descriptions[J]. arXiv preprint arXiv:2403.04183, (2024).
Kim, S., Gwon, S. & Seo, K. Enhancing diverse intra-identity representation for visible-infrared person re-identification[C]//Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. : 2513–2522. (2024).
Chen, Y. et al. Neural feature search for rgb-infrared person re-identification[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. : 587–597. (2021).
Zhang, Q. et al. Fmcnet: Feature-level modality compensation for visible-infrared person re-identification[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. : 7349–7358. (2022).
Wang, J. et al. Visible–Infrared person Re-Identification via global feature constraints led by local features[J]. Electronics 11 (17), 2645 (2022).
Article Google Scholar
Hou, F. et al. Review on infrared imaging technology[J]. Sustainability 14 (18), 11161 (2022).
Article Google Scholar
Bustos, N. et al. A systematic literature review on object detection using near infrared and thermal images[J]. Neurocomputing, : 126804. (2023).
Zhong, X. et al. Grayscale enhancement colorization network for Visible-infrared person Re-identification[J].IEEE transactions on circuits and systems for video technology, 2021, PP(99):1–1 .https://doi.org/10.1109/TCSVT.2021.3072171
Yu, H. et al. Modality unifying network for visible-infrared person re-identification[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. : 11185–11195. (2023).
Lin, Y. & Wang, B. Cross-modality person re-identification via modality-synergy alignment learning[J]. Mach. Vis. Appl. 35 (6), 130 (2024).
Article Google Scholar
Zheng, A. et al. Visible-infrared person re-identification via specific and shared representations learning[J]. Visual Intell. 1 (1), 29 (2023).
Article Google Scholar
Fu, C. et al. CM-NAS: Cross-modality neural architecture search for visible-infrared person re-identification[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. : 11823–11832. (2021).
Liu, H., Tan, X. & Zhou, X. Parameter sharing exploration and hetero-center triplet loss for visible-thermal person re-identification[J]. IEEE Trans. Multimedia. 23, 4414–4425 (2020).
Article Google Scholar
Feng, J., Wu, A. & Zheng, W. S. Shape-erased feature learning for visible-infrared person re-identification[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. : 22752–22761. (2023).
Li, P. et al. Self-correction for human parsing[J]. IEEE Trans. Pattern Anal. Mach. Intell. 44 (6), 3260–3271 (2020).
Article Google Scholar
Yang, L. et al. Parsing r-cnn for instance-level human analysis[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. : 364–373. (2019).
Chu, J. et al. Single-stage multi-human parsing via point sets and center-based offsets[C]//Proceedings of the 31st ACM International Conference on Multimedia. : 1863–1873. (2023).
Kirillov, A. et al. Segment anything[C]//Proceedings of the IEEE/CVF international conference on computer vision. : 4015–4026. (2023).
Ye, M. et al. Deep learning for person re-identification: A survey and outlook[J]. IEEE Trans. Pattern Anal. Mach. Intell. 44 (6), 2872–2893 (2021).
Article Google Scholar
Chen, X. et al. Salience-guided cascaded suppression network for person re-identification[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. : 3300–3310. (2020).
Zhu, Z. et al. Asymmetric non-local neural networks for semantic segmentation[C]//Proceedings of the IEEE/CVF international conference on computer vision. : 593–602. (2019).
Zheng, N. et al. Probing synergistic High-order interaction in infrared and visible image fusion[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. : 26384–26395. (2024).
Han, Y. et al. A new image fusion performance metric based on visual information fidelity[J]. Inform. Fusion. 14 (2), 127–135 (2013).
Article Google Scholar
Zhang, Y. & Wang, H. Diverse embedding expansion network and low-light cross-modality benchmark for visible-infrared person re-identification[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. : 2153–2162. (2023).
Finder, S. E. et al. Wavelet convolutions for large receptive fields[C]//European Conference on Computer Vision. Cham: Springer Nature Switzerland, : 363–380. (2024).
Nguyen, D. T. et al. Person recognition system based on a combination of body images from visible light and thermal cameras[J]. Sensors 17 (3), 605 (2017).
Article ADS PubMed PubMed Central Google Scholar
Prosser, B. J. et al. Person re-identification by support vector ranking[C]//Bmvc. 2(5): 6. (2010).
Zheng, L. et al. Scalable person re-identification: A benchmark[C]//Proceedings of the IEEE international conference on computer vision. : 1116–1124. (2015).
Lu, H., Zou, X. & Zhang, P. Learning progressive modality-shared transformers for effective visible-infrared person re-identification[C]//Proceedings of the AAAI conference on artificial intelligence. 37(2): 1835–1843. (2023).
Liu, H., Xia, D. & Jiang, W. Towards homogeneous modality learning and multi-granularity information exploration for visible-infrared person re-identification[J]. IEEE J. Selec. Topics Signal Process. 17 (3), 545–559 (2023).
Article ADS Google Scholar
Yang, M., Huang, Z. & Peng, X. Robust object re-identification with coupled noisy labels[J]. Int. J. Comput. Vision. 132 (7), 2511–2529 (2024).
Article Google Scholar
Liang, T. et al. Bridging the Gap: multi-level cross-modality Joint Alignment for visible-infrared Person re-identification[J] (IEEE Transactions on Circuits and Systems for Video Technology, 2024).
Chan, S. et al. Diverse-feature Collaborative Progressive Learning for visible-infrared Person re-identification[J] (IEEE Transactions on Industrial Informatics, 2024).
Yang, X. et al. Cooperative separation of modality shared-specific features for visible-infrared person re-identification[J]. IEEE Trans. Multimedia, (2024).
Qian, Z., Lin, Y. & Du, B. Visible–infrared person re-identification via patch-mixed cross-modality learning[J]. Pattern Recogn. 157, 110873 (2025).
Article Google Scholar
Yu, X. et al. Clip-driven Semantic Discovery Network for visible-infrared Person re-identification[J] (IEEE Transactions on Multimedia, 2025).
Wu, R. et al. Enhancing Visible-Infrared Person Re-identification with Modality-and Instance-aware Visual Prompt Learning[C]//Proceedings of the 2024 International Conference on Multimedia Retrieval. : 579–588. (2024).
Yu, H. et al. Discovering attention-guided cross-modality correlation for visible–infrared person re-identification[J]. Pattern Recogn. 155, 110643 (2024).
Article Google Scholar
Lin, Z. & Wang, B. Visible-Infrared Person Re-Indentification via Feature Fusion and Deep Mutual Learning[C]//The 16th Asian Conference on Machine Learning (Conference Track). (2024).
Jiang, Y. et al. Domain shifting: A generalized solution for heterogeneous cross-modality person re-identification[C]//European Conference on Computer Vision. Cham: Springer Nature Switzerland, : 289–306. (2024).
Schroff, F., Kalenichenko, D., Philbin, J. & Facenet A unified embedding for face recognition and clustering[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. : 815–823. (2015).
Ranasinghe, K. et al. Orthogonal projection loss[C]//Proceedings of the IEEE/CVF international conference on computer vision. : 12333–12343. (2021).
Mao, A., Mohri, M. & Zhong, Y. Cross-entropy loss functions: Theoretical analysis and applications[C]//International conference on Machine learning. PMLR, : 23803–23828. (2023).
Hua, X. et al. MSCMNet: Multi-scale semantic correlation mining for Visible-Infrared person Re-Identification[J]. Pattern Recogn. 159, 111090 (2025).
Article Google Scholar
Hua, X. et al. Local-Aware residual attention vision transformer for Visible-Infrared person Re-Identification[J]. ACM Trans. Multimedia Comput. Commun. Appl. 21 (5), 1–24 (2025).
Article Google Scholar
Yang, X. et al. Bidirectional modality information interaction for Visible–Infrared person Re-identification[J]. Pattern Recogn. 161, 111301 (2025).
Article Google Scholar
Zheng, H. et al. Visible-infrared person re-identification: A comprehensive survey and a new setting[J]. Electronics 11 (3), 454 (2022).
Article Google Scholar

Download references

Author information

Authors and Affiliations

College of Communication and Information Engineering, Xi’an University of Science and Technology, Xi’an, 710054, China
Li Ma, Rui Kong & XinGuan Dai

Authors

Li Ma
View author publications
Search author on:PubMed Google Scholar
Rui Kong
View author publications
Search author on:PubMed Google Scholar
XinGuan Dai
View author publications
Search author on:PubMed Google Scholar

Contributions

All authors contributed equally to this work. Li Ma: Funding acquisition, Gave technical support and conceptual advice, Review. Rui Kong: wrote code, ran the model, and analysed output data, administered the experiment and wrote the manuscript. XinGuan Dai: Gave technical support and conceptual advice, Review. All authors discussed the results and implications and commented on the manuscript at all stages.

Corresponding author

Correspondence to Rui Kong.

Ethics declarations

Competing interests

The authors declare no competing interests.

.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Ma, L., Kong, R. & Dai, X. High order Interaction and Wavelet Convolution Network for visible infrared person reidentification. Sci Rep 15, 30718 (2025). https://doi.org/10.1038/s41598-025-14978-x

Download citation

Received: 23 May 2025
Accepted: 05 August 2025
Published: 21 August 2025
DOI: https://doi.org/10.1038/s41598-025-14978-x