Introduction

Feature matching has been an important research direction in computer vision, especially feature matching between multimodal images1,2,3. Multimodal images include optical images, near-infrared(NIR) images, depth images, Synthetic Aperture Radar (SAR) images, short-wave infrared(SWIR) images, and so on. Many image processing tasks in computer vision need to be realized on the basis of feature matching, such as: image fusion4,5,6, image matching7,8, image retrieval9, image classification10,11 and image embedding12. Due to the complementary information contained in images of different modalities, it is necessary to find an efficient and robust feature matching method for multimodal images. Further, the existing mainstream feature matching methods can be mainly categorized into two types: detector-based methods and detector-free methods.

Detector-based local feature matching methods mainly rely on the use of convolutional neural networks to learn detectors or extract local descriptors. Depending on the means of generating descriptors, detector-based local feature matching methods can be generally categorized into three main groups: handcrafted feature descriptor methods, area-based descriptor generation methods, and learning-based feature descriptor methods. Handcrafted feature descriptor methods13,14,15 mainly rely on human efforts to design and generate effective local feature descriptors by utilizing relevant techniques in the field of computer vision, and furthermore, establish correct correspondences based on spatial geometry knowledge. The area-based method16,17,18 of generating descriptors achieves image feature matching by measuring the similarity between pixels after domain transformation, which makes it possible to achieve good results when the image details are less. However, the performance of the region-based descriptor generation method is poor when there is a high intensity of transformations between image pairs. Learning-based feature descriptor generation methods19,20,21 learn feature information9 through neural networks to generate good feature descriptors. Detector-based methods are generally two-stage approaches, i.e., a keypoint detector is required to detect the keypoints first, and then the feature descriptors are extracted based on the locations of the keypoints. Because of the dependence on the keypoint detector and the correct detection of keypoints directly affects the subsequent matching results, traditional detector-based methods tend to have large errors.

The Detector-free local feature matching method tends to have better matching results than the detector-based local feature matching method, which is a one-stage method unlike the detector-based two-stage method, i.e., it does not need to rely on a keypoint detector but uses pixel-level dense matching to select keypoints. The dense matching approach reduces the matching errors caused by keypoint detection. Transformer was initially proposed in the field of natural language processing (NLP)22. Due to the excellent global modeling capability of transformer, more and more researchers have also incorporated transformer into detector-free local feature matching methods. Numerous studies232425 have shown the importance of implementing intensive global attention interactions in feature matching. However, this interaction does not distinguish between critical and non-critical regions. Undoubtedly, attention information interaction in non-critical regions will introduce noise to affect the subsequent matching effect. In contrast, grasping the key local features can make the feature matching twice as effective with half the effort. At the same time, due to the information variance between the pairs of images to be matched, it is undoubtedly a waste of computational resources to carry out information interaction on some unimportant features. Therefore, for feature matching, if the global information interaction can be focused on ctrtical features while having local information interaction, it will greatly improve the matching efficiency and accuracy.

Fig. 1
figure 1

Comparison of the feature matching results of FmCFA and other methods. In each image, the SAR image is displayed on the left, and the optical image is shown on the right.

To address the aforementioned challenges, we propose a novel multimodal image feature matching method, FmCFA. FmCFA integrates both local and global attention interactions effectively. We introduce a new attention mechanism, CF-Attention, which focuses attention interactions on critical features, thereby reducing the computational load of global attention while enhancing matching efficiency and accuracy. Additionally, we propose the CFa-Block, which combines window attention, CF-Attention, and cross-attention. The CFa-Block facilitates coarse-grained information interaction between features. As shown in Fig. 1, extensive experiments demonstrate that FmCFA outperforms existing methods in terms of matching accuracy on multimodal image datasets.

Related work

Due to the complementary information between multimodal images, designing an excellent feature matching method for multimodal images in becomes especially important for many computer vision studies or applications. And the feature matching technique is the most crucial part of it. Recent studies can categorize feature matching methods into two types: detector-based methods and detector-free methods.

Detector-based local feature matching methods

Detector-based methods are the classical methods for feature matching, which can be roughly classified into three categories according to the different ways of extracting feature descriptors: handcrafted feature descriptor methods, area-based descriptor generation methods and learning-based feature descriptor methods.

  1. 1.

    Handcrafted feature descriptor methods: As early as 2004, a classical manual descriptor called Scale-invariant feature transform13 was proposed and used for feature matching. Fu et al.14 in 2019, generated a feature descriptor by combining texture and structural information corresponding to normalized feature vectors. In 2021, Cheng et al.15 proposed a binary-based feature descriptor which is used to retrieve missing features when performing image feature matching.

  2. 2.

    Area-based descriptor generation methods: A feature descriptor with dense local self-similarity and defining similarity metrics based on this descriptor was proposed by Ye et al16. In 2020, Xiong et al.17 generated a feature descriptor based on ranked local self-similarity, which contains local features with high image discriminability. In 2021, a novel local feature descriptor based on discrete cosine transform was proposed by Gao et al.18. It can efficiently and compactly maintain the local information and thus achieved good results.

  3. 3.

    Learning-based feature descriptor methods: In 2020, the classical SuperGlue19 was proposed. SuperGlue estimates the allocation by predicting the cost of the optimal transportation problem via graph neural networks and performs contextual aggregation based on attention. Ma et al20, in 2021, processed multispectral images by generative adversarial network with regularization condition to solve the problem of features between multispectral images. In 2022, GLMNet21 solves the graph matching problem by graph learning network while proposing a new constrained regularization loss on top of Laplace sharpened graph convolution.

Detector-free local feature matching methods

Detector-free methods are one-stage feature matching methods using dense matching implemented pixel by pixel. In 2020, a fully microscopic dense matching in GOCor26 was proposed by Truong et al. as an alternative to the feature correlation layer, which efficiently learns the spatial matching prior. In 2021, Prune Truong et al.27 computed the dense flow field between two images, and then showed the accuracy and reliability of the predictions through a pixel-by-pixel confidence map. Further, they parameterized the prediction distribution into a constrained mixture model to more accurately model flow predictions and outliers. The DualRC-Net28 proposed by Li et al. introduces a dual-resolution correspondence network and obtains pixel-level correspondences by a coarse-to-fine method. Based on this, DualRC-Net improves the accuracy and precision of matching. Sun et al.23 , in 2021, proposed a local feature matching method (LoFTR), which is based on the Transformer. LoFTR first establishes pixel-level dense matching on the basis of a coarse feature map and obtains the coarse matching results. Further, the coarse matching result is adjusted on the fine feature map to obtain the final matching result.

Vision transformer

The rise of the Transformer model began in the field of Natural Language Processing (NLP) and, due to its exceptional ability to capture global information, has led to the development of numerous transformer-based approaches in various areas of computer vision, including but not limited to semantic segmentation29,30, object detection31,32, and image classification33,34. Similarly, many researchers have applied Transformers to the field of feature matching. For example, LoFTR23 utilizes self-attention and cross-attention mechanisms to perform coarse-grained local transformer interactions on intra-image and inter-image pairs. FeMIP24 leverages the Vision Transformer for multimodal image feature matching and has achieved impressive results across multiple modality datasets. While these methods employ Transformers to address image feature matching, they mainly focus on extracting attention from global features. In contrast, our proposed FmCFA method distinguishes key regions in multimodal images and applies Transformers to capture critical attention on these important features. This approach not only reduces computational cost but also enhances both matching efficiency and accuracy.

Fig. 2
figure 2

The overview of FmCFA. \(x_{q}\) denotes the query image and \(x_{r}\) denotes the reference image. During feature extraction, feature maps of 1/8 and 1/2 sizes are generated. The 1/8 size features are used for coarse matching, while the 1/2 size features are used for fine matching. In the coarse matching phase, the Critical Features Interaction module is applied to enhance attention on coarse-level features, which are then further refined by matching them with the 1/2 size features for fine matching. Finally, the resulting matched features are used to produce the final feature matching.

Methods

A general overview of our proposed FmCFA is shown in Fig. 2. FmCFA is a one-stage detector-free feature matching method, where the input is a pair of images denoted as \(x_q\) and \(x_r\). \(x_q\) denotes the query image and \(x_r\) denotes the reference image. They represent a pair of images of different modals. Firstly, the feature extractor (“Feature extraction”) is used to extract initial features from images \(x_q\) and \(x_r\), resulting in a coarse-grained feature map at 1/8 of their original size and a fine-grained feature map at 1/2. The initial 1/8 coarse-grained feature map is then passed into our proposed critcal feature interaction module (“Critical feature attention block”). This module consists of several critical feature attention blocks stacked together (“Critical feature attention”). After that, in order to achieve critical region attention interaction while effectively reducing the computation of global attention, the critical feature attention block is designed as a feature fusion augmentation block consisting of window attention, CF-Attention (“Coarse level matching module”) and cross-attention together . The initial 1/8 coarse-grained feature map undergoes the critical feature interaction module and the attention-enhanced coarse-level feature is obtained. Further, we implement dense matching on the basis of the coarse-grained feature map to obtain coarse matching results. Finally, on the basis of the fine-grained feature map, pixel-level refinement regression is performed in regions where the probability of coarse matching is greater than a given threshold.

Preliminary

Attention

Attention mechanism excels in capturing long-distance dependencies. Suppose an input X, three transformation matrices query (Q), key (K) matrix and value (V) matrix are generated by linear transformations of X. The matrix is generated by a linear transformation of X. The matrix is a matrix with a key (K) and a value (V). The computation and output of attention can be realized based on the three matrices:

$$\begin{aligned} Q&= XW_Q \nonumber \\ K&= XW_K \nonumber \\ V&= XW_V \end{aligned}$$
(1)
$$\begin{aligned} Attention(Q,K,V)&=Softmax(\frac{(QK^T)}{\sqrt{d}})V \end{aligned}$$
(2)

where \(W_Q\), \(W_K\) and \(W_V\) are learned weight matrices used to transform the input sequence X into a matrix of queries, keys and values. d is the dimensionality of the queries and keys.

Linear attention

Although the attention mechanism has excellent global modeling capabilities, it introduces a large computational complexity. The linear attention mechanism was proposed by Linear Transformer35, which achieves the same excellent results while having a lower computational complexity. The following is the formulation of the linear attention:

$$\begin{aligned} \text{ LinearAttention } (Q, K, V)=\frac{\phi (Q)\left( \phi (K)^{T} V\right) }{\phi (Q) \sum _{j=1}^{N} \phi \left( k_{j}\right) } \end{aligned}$$
(3)

where \(k_{j}\) is the jth keywork vector in K.

$$\begin{aligned} \phi (x) = elu(x) + 1 \end{aligned}$$
(4)

where \(elu(\cdot )\) denotes the exponential linear unit activation function.

Window attention

Window attention allows the model to focus on more localized information when processing inputs(x). The model considers only the interrelationships between the positions in the input sequence and the current position within the window size, and does not compute global self-attention. Localized windows segment the image uniformly in a non-overlapping manner, and self-attention operations are performed within each window:

$$\begin{aligned} x_{w}= \text{ WinPartition } (x), x_{w} \in R^{(B \times nw ) \times w \times w \times c} \end{aligned}$$
(5)

where B denotes batch size, H and W denote the height and width of input x, C denotes the number of channels and w denotes the window size. We define \(nw=\frac{H \times W}{w \times w}\). The function win_Partition performs a non-overlapping partition of the input feature map. The size of the partition window is w\(\times\)w, then the shape of the output feature map is \((B \times nw ) \times w \times w \times c\).

$$\begin{aligned} x_{w}^{\prime }= \text{ WinMerge } \left( \operatorname {Attention}\left( x_{w}\right) \right) , x_{w}^{\prime } \in R^{B \times H \times W \times C} \end{aligned}$$
(6)

where the function win_merge merges the set of windows whose feature maps are of shape \((B \times nw ) \times w \times w \times c\) into a \(B \times H \times W \times C\) feature map.

The above window attention process can be summarized as:

$$\begin{aligned} x_{w}^{\prime }= \text{ WindowAttention } (x)= \text{ WinMerge } ( \text{ Attention } ( \text{ WinPartition } (x))) \end{aligned}$$
(7)

Feature extraction

In the initial feature extraction phase, we adopt a UNet-like36 network structure that utilizes the middle layer of the ResNet37, to go about extracting features. Before inputting images to the network, the image pairs are resized to 320\(\times\)320. The input is represented as \(x_q\) and \(x_r\). 1/8 size feature maps denoted as \(x_q^c\) and \(x_r^c\) are extracted by the intermediate layer of the ResNet37. The fine-grained level feature maps, denoted as \(x_q^f\) and \(x_r^f\) are continually derived from the coarse-grained ones by a series of convolutional, up-sampling and fusion operations.

Fig. 3
figure 3

Critical areas (orange) and noncritical areas (green).

Critical feature attention block

Due to the information variability of multimodal images, it is inevitable that information is interacted between useless features during global transformer interaction. The information interaction between useless features will waste computational resources and have a negative impact on the subsequent matching process. As shown in Fig. 3, the orange area represents critical features and the green area represents non-critical features. It is obvious that the features under the orange region are rich in information. Attention interaction on such features will bring more benefits to the keypoint finding and matching process. On the other hand, the features within the green region do not possess much critical information. This useless information will undoubtedly hinder the keypoint extraction and matching process. Therefore, it is undeniable that focusing the attention interaction on critical fatures as much as possible will improve the accuracy and efficiency of matching under multimodal images.

Fig. 4
figure 4

The overview of critical feature attention block (CFa-Block). In the Critical Feature Attention block, the 1/8 size feature map first extracts window attention. Then, Critical Features Residual (CFR) is applied to obtain Critical Feature attention. Following this, cross-attention is computed between the images from the two modalities to generate a coarse match.

Besides, although some methods can perform local attention interactions at the coarse level, these interactions are completed on the entire coarse feature map, which lacks understanding of information within a certain region. Images are usually composed of various local features, such as edges, textures, corners, etc. These local features are crucial for object recognition and matching. Further, an object usually consists of multiple local parts. The relative positions and relationships between these local parts are important for matching. The local information interaction within the region can be very helpful for the model to understand the relationship between different parts to achieve better matching results.

Based on the above viewpoint, we propose a new block of attention interactions called CFa-Block. With the expectation that the model can focus global information interaction on critical fatures while having the ability to interact with local information. Further, we apply it within the coarse level feature transformer module.

The inputs to the critical feature interaction module are the coarse-grained feature maps (\(x_q^c \in R^{B \times H \times W \times C}\) and \(x_r^c \in R^{B \times H \times W \times C}\)). Assume that the size of the division window is w and the number of critical fatures selected inside each window is k. \(x_q^c\) and \(x_r^c\) are first partitioned into features in non-overlapping window format and self-attention interactions are performed within each window:

$$\begin{aligned} \begin{aligned} x_q^{cw} = WindowAttention(x_q^c) \\ x_r^{cw} = WindowAttention(x_r^c) \end{aligned} \end{aligned}$$
(8)

where \(x_q^{cw} \in R^{B \times H \times W \times C}\) and \(x_r^{cw} \in R^{B \times H \times W \times C}\). nw denotes the number of windows.

Fig. 5
figure 5

The process of CF-attention.

The outputs \(x_q^{cw}\) and \(x_r^{cw}\) after window attention are passed through CF-Attention. CF-Attention, critical features attention mechanism, focuses the attention interactions on the critical fatures we have selected to avoid the introduction of noise from non-important features. We will describe CF-Attention in detail in “Critical feature attention”.

$$\begin{aligned} \begin{aligned} x_q^{cf} = CF\text{- }Attention(x_q^{cw}) \\ x_r^{cf} = CF\text{- }Attention(x_r^{cw}) \end{aligned} \end{aligned}$$
(9)

where \(x_q^{cf} \in R^{B \times ( nw \times k) \times C}\) and \(x_r^{cf} \in R^{B \times ( nw \times k) \times C}\).

Finally, \(x_q^{cf}\) and \(x_r^{cf}\), after format conversion, perform information interaction between image pairs based on cross-attention. For cross-information interactions between image pairs, Q is generated from one image, while K and V are obtained from another. Cross-attention interactions are also realised by linear attention:

$$\begin{aligned} \begin{aligned} x_q^e = LinearAttention(Q_q,K_r,V_r) \\ x_r^e = LinearAttention(Q_r,K_q,V_q) \end{aligned} \end{aligned}$$
(10)

where \(Q_q\), \(K_q\) and \(V_q\) are obtained from \(x_q^{cf}\) by linear projection. \(Q_r\), \(K_r\) and \(V_r\) are obtained from \(x_r^{cf}\) by linear projection. This process of cross-attention interaction can be summarised as follows:

$$\begin{aligned} x_q^e,x_r^e = CrossAttention(x_q^{cf},x_r^{cf}) \end{aligned}$$
(11)

Here, we do not implement cross-attention on critical fatures but keep the global cross-information interaction for the whole coarse feature map. Because of the large difference in information between different modal image pairs, the combination of critical fatures disrupts the feature information and location information to some extent. In this case, implementing cross-attention only on important features leads to information disorder. This is detrimental to the cross-information interaction between different modal images and the subsequent matching process.

All of the above processes can be summarised as follows:

$$\begin{aligned} \begin{aligned} x_q^e, x_r^e= CrossAttention(CF\text{- }Attnetion(WindowAttention(x_q^c,x_r^c))) \end{aligned} \end{aligned}$$
(12)

We use CFa-Block to represent all the above operations:

$$\begin{aligned} \begin{aligned} x_q^e, x_r^e= CFa\text{- }Block(x_q^c,x_r^c) \end{aligned} \end{aligned}$$
(13)

\(x_q^e\) and \(x_r^e\) represent the final output. The overview flow of CFa-Block is shown in Figure 4. Our coarse level feature transformer module consists of iterating the CFa-Block N times.

Critical feature attention

Our proposed CF-Attention aims to enable the model to focus attention on critical fatures thus improving the matching accuracy. The overall flow of CF-Attention is shown in Fig. 5. CF-Attention is described in detail in this section.

Critical features selection (CFS)

The process of selecting critical features in CF-Attention is demonstrated as shown in Fig. 6. The input to CF-Attention is the output of the window attention, denoted as \(x_q^{cw} \in R^{(B \times nw) \times w \times w \times C}\) and \(x_r^{cw} \in R^{(B \times nw) \times w \times w \times C}\) . Firstly, the average pooling operation is performed for each window to obtain the representative features inside each window:

$$\begin{aligned} \begin{aligned} x_{q1} = AvgPool(x_q^{cw}) \\ x_{r1} = AvgPool(x_r^{cw}) \end{aligned} \end{aligned}$$
(14)

where \(x_{q1} \in R^{(B \times nw) \times 1 \times C}\) and \(x_{r1} \in R^{(B \times nw) \times 1 \times C}\). \(x_{q1}\) and \(x_{r1}\) denote the representative features of each window for \(x_q^{cw}\) and \(x_{r1}\), respectively.

Fig. 6
figure 6

The descriptive diagram of the process of selecting critical features.

Next, all features within each window are computed for similarity with the represented feature of that window, and then the similarity scores are normalised to obtain the corresponding similarity matrix for each window:

$$\begin{aligned} \begin{aligned} S_q = Norm(CosSim(x_{q1},x_q^{cw})) \\ S_r = Norm(CosSim(x_{r1},x_r^{cw})) \end{aligned} \end{aligned}$$
(15)

Based on the similarity scores, we select the k index locations with the highest scores. Based on the index position, k features are selected as critical fatures inside each window.

$$\begin{aligned} \begin{aligned} k_q,index_q = TopK(Sort(S_q)) \\ k_r,index_r = TopK(Sort(S_r)) \end{aligned} \end{aligned}$$
(16)

Note that the shapes of \(k_q\) and \(k_r\) are \(R^{(B \times nw) \times k \times C}\). The above process is summarised as:

$$\begin{aligned} \begin{aligned} k_q = CFS(x_q^{cw}) \\ k_r = CFS(x_r^{cw}) \end{aligned} \end{aligned}$$
(17)

Cretical features interaction (CFI)

The critical fatures inside each window(\(k_q\) and \(k_r\)) are combined to form all the critical fatures of each image. The shapes of \(k_q\) and \(k_r\) are transformed into \(R^{B \times (nw \times k) \times C}\). All of these critical fatures achieve attention interaction between important regions through linear attention:

$$\begin{aligned} \begin{aligned} k_{q2} = CFI(k_q) = LinearAttention(Q_{k_q},K_{k_q},V_{k_q}) \\ k_{r2} = CFI(k_r) = LinearAttention(Q_{k_r},K_{k_r},V_{k_r}) \end{aligned} \end{aligned}$$
(18)

\(Q_{k_q}\), \(K_{k_q}\), and \(V_{k_q}\) are obtained from \(k_q\) by linear projection. Relatively, \(Q_{k_r}\), \(K_{k_r}\), and \(V_{k_r}\) are generated according to \(k_r\). \(k_{q2} \in R^{B \times ( nw \times k) \times C}\) and \(k_{r2} \in R^{B \times ( nw \times k) \times C}\) are the outputs of CF-Attention.

Cretical features residual (CFR)

The outputs \(k_{q2}\) and \(k_{r2}\) after CFI are residualised with the inputs(\(x_q^{cw}\) and \(x_q^{cf}\)). \(k_{q2}\) and \(k_{r2}\) are done fusion with the input features based on the index position. This process is denoted as CFR:

$$\begin{aligned} \begin{aligned} x_q^{c2} = CFR(x_q^{cf},x_q^{cw},index_q) \\ x_r^{c2} = CFR(x_r^{cf},x_r^{cw},index_r) \end{aligned} \end{aligned}$$
(19)

where the shapes of \(x_q^{cf}\) and \(x_r^{cf}\) have been transformed into \(R^{(B \times nw) \times k \times C}\).

Finally, all of the above processes can be summarised as:

$$\begin{aligned} \begin{aligned} x_q^{c2} = CF\text{- }Attention(x_q^{cw}) = CFR(CFI(CFS(x_q^{cw}))) \\ x_r^{c2} = CF\text{- }Attention(x_r^{cw}) = CFR(CFI(CFS(x_r^{cw}))) \end{aligned} \end{aligned}$$
(20)

Coarse level matching module

We use the same policy gradient-based reward and punishment strategy as FeMIP24 for training supervision of the coarse matching module. A series of correspondences are established based on the augmented features \(x_q^e\) and \(x_r^e\) after attention interaction. A bidirectional softmax operation is then used to obtain the probability of nearest neighbour matching in both directions, which can be defined as:

$$\begin{aligned} P(i \leftrightarrow j) = softmax(S(i, \cdot ))j \cdot softmax(\cdot , j)i \end{aligned}$$
(21)

where S is a confidence matrix based on the correspondence and P represents the matching probability.

Further, if the sample is a positive sample in the ground truth matrix and the match probability P is greater than a given threshold u, we consider the match to be correct and give it a positive reward \(\alpha\). In contrast, if the sample is a negative sample and the match probability P is less than u, we do not give it a reward. In all other cases we consider the match to be incorrect and give a negative reward \(\beta\). The value of \(\beta\) is set according to the number of training epochs(n) in order to train the network smoothly:

$$\begin{aligned} \beta = {\left\{ \begin{array}{ll} 0, & \text {if } \text {epoch}< n \\ -0.01 \cdot (\text {epoch} - n), & \text {if } n \le \text {epoch} < n + 25 \\ -0.25, & \text {if } \text {epoch} \ge n + 25 \end{array}\right. } \end{aligned}$$
(22)

The policy gradient formula is as follows:

$$\begin{aligned} \nabla _{\theta }E_{M_{q \leftrightarrow r}}R_{(M_{q \leftrightarrow r})}&= E_{x_q^e,x_r^e} \sum _{i,j}[P(i \leftrightarrow j | x_q^e,x_r^e) \cdot r(i \leftrightarrow j) \cdot \nabla _{\theta \phi _{ij}}] \end{aligned}$$
(23)
$$\begin{aligned} \phi _{ij}&= log P(i \leftrightarrow j | x_q^e,x_r^e) + log P(x_q^e,i | x_q^c,\theta _{x}) + logP(x_r^e,i | x_r^c,\theta _{x}) \end{aligned}$$
(24)

where \(M_{q \leftrightarrow r}\) represents the correspondence and \(P(i \leftrightarrow j | x_q^e,x_r^e)\) is the distribution of matches between \(x_q^e\) and \(x_r^e\). \(\nabla _{\theta \phi _{ij}}\) is the logarithmic gradient of the action sequences. And \(\phi _{ij}\) represents the three motion sequences: \(x_q^e\) from \(x_q\), \(x_r^e\) from \(x_r\), and establishing a match between \(x_q^e\) and \(x_r^e\).

Finally, the loss function for coarse level matching can be expressed as:

$$\begin{aligned} L_1 = \nabla _{\theta }E_{M_{q \leftrightarrow r}}R_{(M_{q \leftrightarrow r})} , p(i \leftrightarrow j) > t \end{aligned}$$
(25)

where t is the given confidence threshold.

Fine level regression re-fine

The results of coarse matching are based on 1/8 coarse feature maps, which results in them not being accurate. Therefore, we use regression to regress the feature maps with high accuracy so that our method can perform accurate pixel-level matching. Similarly to FeMIP24, we employ the same implementation approach.

Specifically, when dealing with each pair of coarse-level matching results, we map their coordinates onto the fine-level feature map \({x_q^f, x_r^f}\) and extract two sets of local feature windows. Then, we extract two collections of local feature windows for each pair of coarse-level matching results. These windows are then utilized to blend the coarse-level and fine-level feature maps, resulting in the creation of fused feature maps denoted as: \(x_q^{fc},x_r^{fc}\)

Subsequently, the fused feature maps \(x_q^{fc},x_r^{fc}\) are passed through an interaction module, which comprises self-attention and cross-attention mechanisms. This interaction module refines the feature maps and produces the refined feature maps \(x_{qA}^{fc},x_{rA}^{fc}\).

Ultimately, the refined fine-level matching results are obtained through regression using a fully connected layer. We determine the outcome of fine regression, denoted as \((\nabla x, \nabla y)\), by calculating the difference between the predicted coordinates \((i_p, j_p)\) and the actual coordinates \((i_a, j_a)\).

The loss of the fine regression re-fine module can be defined as:

$$\begin{aligned} L_2 = \frac{1}{m} \sum _{i=1}^{m} {(i_p + \nabla x, j_p + \nabla y) - (i_a, j_a)}^2 \end{aligned}$$
(26)

where m is the number of feature points, \(\nabla x\) and \(\nabla y\) are the difference between the horizontal and vertical coordinate positions obtained from the fine regression, respectively.

Our total loss is defined as:

$$\begin{aligned} L_{total} = \delta _1 L_1 + \delta _2 L_2 \end{aligned}$$
(27)

where \(\delta _1\) and \(\delta _2\) denote the weights of the two losses L1 and L2, respectively.

Experiment

Implementation details

We use AdamW with CosineAnnealingLR to train the model. The learning rate is \(1 \times 10^{-3}\) and the batch size is 4. The value of t is taken to be 0.2. \(\delta _1\) and \(\delta _2\) are both set to 1. The training is performed on an Nvidia RTX 3090 GPU.

Dataset

We selected five multimodal datasets to validate our model. These multimodal image types include RGB, depth, optical, near-infrared, synthetic aperture radar (SAR), and short-wave infrared (SWIR) images.

  1. (1)

    SEN12MS: The dataset SEN12MS38 provides multiple modalities of an image. Therefore, we perform validation between many different modalities on this dataset.The SEN12MS dataset has multimodal images including RGB, SAR, NIR, and SWIR images. We selected 16000 pairs of images for training and 600 pairs of images for testing.

  2. (2)

    Optial-SAR: The optical-sar dataset39 is an important benchmark dataset for multimodal image feature matching. This dataset includes a variety of scenes: plains, harbors, airports, cities, islands, and rivers. The image size of this dataset is 512\(\times\)512. where we select 16,000 pairs of images for training and 500 pairs of images for testing.

  3. (3)

    WHU-OPT-SAR: The WHU-OPT-SAR dataset40 contains the following scene categories: farmland, deep forest, road, village, water and city. It was collected in Hubei Province, China. Since the size of the original images of the dataset is too large, we crop it to generate 4400 pairs of images. Among them, we randomly selected 3800 pairs of images for training, 300 pairs of images for validation, and the remaining 300 pairs for testing.

  4. (4)

    RGB-NIR scene: The RGB-NIR Scene41 contains 477 image pairs consisting of RGB and NIR images. It contains 9 scenes: countryside, street, field, forest, mountain, water, city, indoor and old building. Among them, we choose 400 pairs of images from this dataset for training and 48 pairs of images for testing.

  5. (5)

    NYU-Depth V2: The NYU-Depth V2 dataset42 is captured from video sequences of indoor scenes. 1449 pairs of RGB and depth images are densely labeled and pairs of them. We select 1049 pairs of images for training and 400 pairs for testing.

Baseline and metrics

To demonstrate the effectiveness of our proposed method in feature matching on multimodal graph datasets, we compared it with several other excellent methods: FeMIP24, LoFTR23, LNIFT43, HardNet44, TFeat45, MatchosNet46, and MatchNet47. We train and test different methods in the same environment, training set, and testing set, and evaluate them using the same evaluation indicators. The evaluation indicators are as follows:

Homography estimation

Homography estimation is computed using OpenCV and further RANSAC is used as robust estimation. In this, the correct identifier is calculated based on the angular error between the predicted result and the groundtruth. We compute the average reprojection error for the four corners of the image and report the regions under the cumulative curve(AUC)48 where the angular error reaches different thresholds (@3 px, @5 px and @10 px).

Mean matching accuracy (MMA)

MMA is a method that assesses the quality of feature matching between pairs of images. It does this by only considering feature matches that are mutual nearest neighbors. In MMA evaluation, the x-axis represents the matching threshold in pixels, while the y-axis shows the average accuracy of matching for image pairs. A correct match is one where the reprojected error of the homography estimate is below the specified matching threshold. MMA calculates the average percentage of correct matches for various pixel error thresholds and provides an overall score based on the average performance across all image pairs.

Table 1 Results of homography estimation on the SEN12MS dataset.
Table 2 Homography estimation on multimodal datasets.

Homography estimation experiments

In the experiment of homography estimation, we conducted the AUC of angular errors at 3, 5, and 10 pixel thresholds.

Based on the SEN12MS dataset, we select four modal image data pairs (NIR-RGB,NIR-SWIR,SAR-NIR,SAR-SWIR) for the homography estimation experiments. As shown in Table 1, the experimental results of FmCFA in all four modalities achieve optimal results.

In addition, we conducted homography estimation tests on the Optical-SAR, RGB-NIR, WHU-OPT-SAR, and NYU-Depth V2 datasets. As shown in Table 2, under these datasets, FmCFA has significant improvements compared to other methods at 3, 5, and 10 pixel thresholds.

Mean matching accuracy experiment

Based on the SEN12MS dataset, we select four modal image data pairs (NIR-RGB,NIR-SWIR,SAR-NIR,SAR-SWIR) to perform the mean matching accuracy experiments. The experimental results are shown in Fig. 7. The horizontal coordinate (x-axis) of the image is the pixel threshold and the vertical coordinate (y-axis) is the ratio of correct matching. According to the figure, it can be seen that FmCFA, FeMIP and LoFTR are the most competitive methods. In most cases, FmCFA achieved better results.

Next, we conducted experiments on the average matching accuracy under several other modal datasets. These datasets include RGN-NIR Scene, NYU-Depth V2, Optical-SAR, and WHU-OPT-SAR. The results of the experiments are shown in Fig. 8. FmCFA, FeMIP, and LoFTR are the most competitive methods. FmCFA has a better matching accuracy in most cases at both low and high thresholds.

However, as shown in Figs. 7 and 8, FmCFA still exhibits some limitations when handling feature matching between near-infrared (NIR) images and optical RGB images. Specifically, when the pixel threshold is low, FmCFA does not achieve optimal results in NIR-RGB image feature matching. This is because the texture of infrared images is less pronounced, making it challenging for FmCFA to extract sufficient key features. In the future, we will focus on improving the extraction of key features from infrared images and aim to further enhance the performance of FmCFA.

Fig. 7
figure 7

Experimental results on mean matching accuracy based on the SEN12MS dataset.

Fig. 8
figure 8

Mean Matching Accuracy Experiments on RGN-NIR Scene (a), NYU-Depth V2(b), Optical-SAR (c), and WHU-OPT-SAR (d) Datasets.

Multimodal image matching visualization

We compare it with an excellent method, FeMIP24. We demonstrated matching performance in four modal image pairs (NIR-RGB, SAR-SWIR, SAR-NIR, and NIR-WIR) of SEN12MS, as shown in Fig. 9. Compared with FeMIP, FmCFA has more and more accurate matching connections, proving its better matching ability.

Fig. 9
figure 9

Visualize feature matching under the SEN12MS dataset. (a) NIR-RGB, (b) NIR-SWIR, (c) SAR-NIR, (d) SAR-SWIR.

Fig. 10
figure 10

Visualize feature matching under different datasets. (a) Optical-SAR, (b) WHU-OPT-SAR, (c) NYU-Depth V2, (d) RGB-NIR Scene.

In addition, we also demonstrated visual matching performance under four other types of image data. These four types of data are: NYU Depth V2, Optical SAR, RGB NIR Scene, and WHU OPT-SAR. As shown in Fig. 10, FmCFA also has better matching performance.

Table 3 Ablation experiments of FmCFA.

Ablation experiment

We make several ablation comparisons in the way we select critical fatures. The selected benchmarks include: feature vector modulus, feature vector variance, average pooling and maximum pooling. As shown in Table 3, all of these approaches achieved good results, but average pooling is the best. We perform a homography estimation experiment on the Optical-SAR dataset to make a comparison.

Conclusion

This parper proposes a new feature matching method called FmCFA for multimodal image datasets. We propose a new attention interaction mechanism (CF-Attention) that concentrates interactions in critical areas and apply it to the CFa-Block. We have conducted a large number of experiments on various modal datasets. In these experiments, FmCFA achieved excellent matching results. This further proves the effectiveness and robustness of FmCFA.