Introduction

With the rapid expansion of the internet, the volume of multimodal data has grown substantially in recent years. This surge has led to the development of various applications, including intelligent search engines and multimedia data management systems, designed to process and analyze vast amounts of multimodal information1,2. Among these advancements, cross-modal retrieval enabling searches across different data types such as images, text, videos, and audio—has emerged as a key area of research and innovation 3,4,5,6. Traditional content-based image retrieval (CBIR) methods establish semantic similarity connections but are restricted to single-modality scenarios. In contrast, cross-modal retrieval involves retrieving semantically related items in one modality (e.g., text) using a query from a different modality (e.g., image). This paper focuses on multi-labeled cross-modal retrieval, which plays a crucial role in various applications such as multimedia retrieval and e-commerce 7,8,9.

Bai et al. 10, used graph neural networks for cross-modal retrieval by directly optimizing the binary codes to minimize the quantization loss and retrieval error. Cross-modal retrieval, the task of aligning and retrieving semantically similar data from different modalities, such as images and text, has garnered increasing attention in the era of multimodal data. Cross-modal attention and generating a shared space mechanism are the common solutions for cross-modal retrieval 11,12,13. These methods transform the low dimensional feature vectors in to high dimensional feature vectors, so that semantically related data points have similar feature code bits. Majority of the research work is going on reducing the sematic gap between diverse modalities with distinct data distributions 14,15. Many cross-modal attention mechanisms have been proposed earlier. Lin et al. 16, proposed a probability-based semantics-preserving hashing (SePH), which generated one single unique hash code, considering the sematic consistency between different modality views. In Ref.17, authors used label consistent matrix factorization hashing (LCMFH). In Ref.18 Modal-Adversarial Hybrid Transfer Network (MHTN) was introduced, and in Ref.19 Semantic correlation maximization was employed for cross-modal retrieval. All these methods extract the features independent of training process, hence these methods may not have the satisfactory performance in many practical applications. In recent times, deep convolution neural networks (DCNNs) are used to extract fine features from data, which significantly improved the retrieval performance capability of the various frameworks 20,21,22,23,24. While these advancements have predominantly focused on single-modality retrieval, modern applications demand systems capable of bridging the semantic gap between different modalities, such as text and images, for effective cross-modal retrieval. In order to address this issue, researchers integrate the text and visual features into a shared embedding space, laid the foundation for cross-modal retrieval systems. Guo et al. 25, proposed Prompts-in-The-Loop (PiTL), a weakly supervised method to pre-train VL-models for cross-modal retrieval tasks. Chen et al. 26, UNITER, a Universal Image-Text Representation, learned through large-scale pre-training over four different image-text datasets for efficient cross-model retrieval. In Ref.27, the researchers proposed a method (BeamCLIP) that can effectively transfer the representations of a large pre-trained multimodal model (CLIP-ViT) into a small target model. Previous cross-modal retrieval methods faced limitations in feature extraction and alignment. CNN-based models lacked global context, while LSTM and Word2Vec struggled with contextual understanding. Traditional fusion techniques, like concatenation and MLPs, failed to model inter-modal relationships effectively. These approaches lacked attention mechanisms, leading to poor retrieval accuracy. While earlier approaches like the Graph Attention Network (GAT) 28 have been employed for feature fusion in cross-modal retrieval, they primarily operate on static and homogeneous graphs. GAT utilizes self-attention to assign weights to neighboring nodes, enabling localized feature aggregation. Jin et al. 29 proposes an end-to-end Graph Attention Network Hashing (GAT-H) framework for cross-modal retrieval. It integrates modality-specific encoders and a graph attention mechanism to learn compact hash codes across image and text modalities. However, such models fall short in capturing evolving cross-modal relationships.

To address the above mentioned problems, inspired by vision language model (VLM)30, we propose a novel approach using GCNN in the attention mechanism for the features extracted using ViT 31 and BERT32 for vision and language respectively to enhance the cross-modal feature alignment. A shared space mechanism is designed with these fine-grained features, to address the lack of fine-grained semantic alignment between local image regions and text tokens. The shared-space model called dynamic Cross-Modal Feature Graph (CMFG), which models mutual contextual relationships by constructing a heterogeneous graph where nodes represent image and text features. Unlike static approaches, our modal dynamically updates cross-modal relationships using K-nearest neighbor (KNN) based neighborhood selection across two modalities during training. This allows the graph structure to adapt based on growing feature representations, enabling context-aware alignment and reducing the semantic gap. This methodology may overcome the limits of conventional fusion methods by allowing the network to reason about the semantic relationships between visual and textual elements explicitly.

The model specific feature extractors used are:

  • ViT: Vision Transformer used for image feature extraction. These transformers are succeeded in capturing the effective feature from images through self-attention mechanism.

  • BERT: Bidirectional Encoder Representations from Transformers is used for text feature extraction. It provides the rich representations of semantic information from textual descriptions.

Joining these feature extraction mechanisms, we purpose a shared embedding space where features from these two different modalities features are mapped into an amalgamated representation.

  • GCNNs with attention mechanisms model the relationships between visual and textual features, emphasizing semantically relevant regions and words.

  • The attention mechanism prioritizes the most important cross-modal interactions, allowing the model to focus on meaningful correspondences while ignoring irrelevant noise.

  • This shared embedding space ensures that the model can effectively bridge the semantic gap between text and images, making it possible to retrieve semantically aligned images based on textual queries with high precision.

Our proposed framework offers the following contributions:

  • Using ViT and BERT for robust and scalable feature extraction, to ensure high-quality visual and textual embeddings.

  • A new mechanism called dynamic Cross-Modal Feature Graph (Dynamic CMFG) is introduced, in which the graph structure is adaptively built during training using K-nearest neighbor (KNN) search over the shared latent feature space. This enables the model to dynamically capture mutual contextual relationships across modalities rather than relying on fixed or manually defined relations.

  • The dynamic CMFG combines attention-based edge weighting that enhances the semantic alignment between image and text features, to allow the context-aware dissemination across two modalities.

In the subsequent sections, related works with detailed literature of ViT and BERT in "Related works", a detailed overview of the proposed framework in "Framework overview" including its architecture, training process, and evaluation methodology. Additionally, comparison of the performance of proposed method against state-of-the-art approaches to demonstrate its effectiveness in cross-modal retrieval tasks in “Experiments” and conclusion of the manuscript in "Conclusion".

Related works

Vision transformer (ViT)

The foremost building blocks of transformer architectures, and more specific recent architecture is ViT introduced by Dosovitskiy et al. 33. Unlike CNN, which depends on convolutional operations to extract local features, ViT operates on sequences of image patches, treating an image as a series of non-overlapping patches and processing them as tokens. At first ViT, divides each image into M-fixed non-overlapping patches, flattens each patch, and project into higher dimensional space using a learnable linear projection. The process involved in ViT is explained clearly with the following equations:

Patch embedding

The given image \(X\epsilon {\mathbb{R}}^{WXHXC}\) (H-height, W-width and C-no.of channels)

No. of patches \(N=\frac{WXH}{{P}^{2}}\) (P-patch size)

\({x}_{i}={\mathbb{R}}^{{P}^{2}.C}\) (\({x}_{i}-\) Vector of Patches)

Project the patches into D-dimensional embedding space using a learnable matrix \(E\epsilon {\mathbb{R}}^{{(P}^{2}.C).D}\)

\({z}_{i}=E.{x}_{i}\) (\({z}_{i}\epsilon {\mathbb{R}}^{D}\) embedding of ith patch vector)

Adding positional embeddings

Positional embeddings are used to maintain the positional relationship among the patches. These embeddings restore the spatial context by encoding the location of each patch within the image.

$${Z}_{E}=\left[{z}_{0},{z}_{1},\dots ..,{z}_{m}\right]+P$$

where, \({Z}_{E}={\mathbb{R}}^{NXD}\) is the input to the transformer.

Multi-headed self attention mechanism(MSA)

This is a core part of the transformer architecture. It enables the model to focus on the most relevant parts of input sequence when making predictions. In ViT, MSA applies a self-attention mechanism to capture the relationships among different regions of an image. For a given image patches \(X\epsilon {\mathbb{R}}^{NXD}\) where, N is the number of patches, D is the embedding dimension, the self-attention mechanism works as follows:

Each token embedding is linearly projected into 3different vectors as Query(Q), Key(K), and Value(V).

$$Q=X{W}_{Q} ,K=X{W}_{K}, V=X{W}_{V}$$

where, \({W}_{Q},{W}_{K},{W}_{V}\epsilon {\mathbb{R}}^{DX{d}_{k}}\) are learnable weight matrices, \({d}_{k}\) is the dimension of query/key space.

Attention scores

It is a value between all pairs of tokens are computed as the dot product of the query and key vectors, scaled by \(\sqrt{{d}_{k}}\), to prevent from large values which can lead to gradient uncertainty.

$$A=softmax\left(\frac{Q{K}^{T}}{\sqrt{{d}_{k}}}\right),$$

\(A\in {\mathbb{R}}^{NXN}\) is the attention matrix, representing the attention weights between each pair of tokens.

The attention weights are used to compute a weighted sum of value vector, as

$$Self-attention output=AV$$
$$Multi-headed output=Contact({head}_{1},{head}_{2, \dots \dots , }{head}_{h}){W}_{o}$$

where,

  • \({head}_{i}=Self-Attention({Q}_{i},{K}_{i},{V}_{i})\).

  • \({W}_{o}\in {\mathbb{R}}^{(h.{d}_{k})XD}\) is a learnable weight matrix.

BERT

It is a transformer-based model architecture for NLP tasks by extracting bidirectional context 34. For each text input T, the BERT tokenizer converts the text into tokens and applies WordPiece tokenization to further break down into sub-words. Each sub-word is then mapped to its corresponding ID in BERT’s vocabulary to produce raw word features. These tokens are represented as embeddings that combine token embeddings \(\left({E}_{Token}\right)\), segment embeddings \(\left({E}_{Segment}\right)\), and positional embeddings \(\left({E}_{Position}\right)\).

$${E}_{input}={E}_{Token}+{E}_{Segment}+{E}_{Position}$$

The embedded input is processed through multi transformer encoder layers, here each layer uses multi-head self-attention to compute relationships between tokens. The attention scores are calculated as:

$$Attention\_Score(Q,K,V)=softmax\left(\frac{Q{K}^{T}}{\sqrt{{d}_{k}}}\right)V$$

The final output from BERT provides contextualized representations for each token, with the [CLS] token representing the overall input sequence:

$${H}_{[CLS]}=BERT({E}_{input})$$

These representations are highly effective for tasks like text feature extraction in cross-modal retrieval, where textual embeddings from BERT are aligned with visual embeddings from models like ViT.

Graph analysis

GCNNs are effective to represent the data through graph, it has a wide range of applications 35,36,37. Xi et al. 38, worked on semi supervised model for hyperspectral image classification using cross-scale graph prototyping network. The authors designed a new self-branch attention mechanism to put more focus on critical features produced by multiple branches. Research on GCNNs has fascinated by many researchers due to their ability to effectively model non-Euclidean data structures, such as social networks, molecular graphs, and multimodal relationships, by capturing intricate dependencies and relational patterns among nodes through graph-based learning. Many recent works like, a low back-alignment spatial GCN for image classification 39, spatial–temporal GCN for emotion detection and classification 40, and quantum-based subgraph CNN to catch the global topological structure and local connectivity structure within a graph 41. Saho et al. 44 used canonical correlation analysis (CCA) to design the latent space. CCA pursues linear projections of two modalities, such that the correlation between their projected components is maximized. It learns transformation matrices for each modality to map their features into a common latent space where semantically aligned pairs are close in terms of correlation. But it is limited by its linearity, hence it is unable to capture complex non-linear relation between image and text features. In 45, the authors used VSE++ which tries improve the earlier visual-semantic embedding models by incorporating hard negative mining during training. It uses a triplet ranking loss that enforces matched image-text pairs to be closer in embedding space than mismatched ones. Hard negatives those that are closest to the query but incorrect are prioritized to improve discriminative power. Although VSE++ improves retrieval via hard negative mining, it relies on static embedding spaces and lacks the capacity to model deeper cross-modal contextual interactions dynamically.

Recent advancements have leveraged GCNNs for cross-modal retrieval by modeling interactions between modalities in graph structures. In such frameworks, nodes represent features extracted from images and text, while edges capture semantic relationships. GCNNs propagate information across these nodes, enabling better alignment and representation in a shared embedding space. For example, approaches integrating GCNNs with attention mechanisms improve the ability to capture fine-grained interdependencies, which is critical for aligning textual and visual modalities.

The ability of GCNNs to capture structured relationships between features makes them an ideal choice for tasks requiring fine-grained alignment between text and visual data. By integrating GCNNs into the attention mechanism, our framework leverages their strengths to improve semantic alignment and retrieval performance in cross-modal tasks.

Framework overview

The shared embedded space plays a vital role in cross-model image retrieval by supporting an integrated representation of features from different modalities. In this space, both image and text features are mapped into a common vector space where their semantic relationships can be directly compared. This allows an image to be retrieved using a textual query, image query or both, providing whole interaction through modalities. The overall framework is shown in Fig. 1.

Fig. 1
figure 1

Overview pf proposed CMFG. ViT for Image features and BERT for Text features and proposed shared space embedding with attention mechanism for image and text features for cross-model retrieval.

To achieve this, we introduce a new approach for attention mechanism called cross-model feature graph (CMFG) as shown in Fig. 2 in two ways, one is in static CMFG (SCMFG) and dynamic CMFG (DCMFG). The implementation of CMFGs as follows:

Fig. 2
figure 2

A sample graph constructed with only ten attributes of ViT and BERT features.

Static CMFG

Generally, a graph is represented as a set of nodes and edges, \(G=\{N,E\}\). The edge set E is defined as \(E=\{({n}_{i},{n}_{j})|{n}_{i},{n}_{j}\in N\}\), connected a pair of nodes in the node set N. An adjacency matrix \(A\in \{{\text{0,1}\}}^{NXN}\) can be used to define the connections in the graph, where \(A\left({n}_{i},{n}_{j}\right)=1\) indicates the presence of an edge between the nodes \({n}_{i},{n}_{j}.\)

CMFG: Let \(G=\{I,T,E\}\) is a two-sided graph, where I and T are image and text features as nodes and E is the set of edges that represent relationships between nodes of I and T. The node \(I=\{{i}_{1},{i}_{2},.\dots .{i}_{{K}_{I}}\}\) be the set of \({K}_{I}\) image feature vectors extracted using a Vision Transformer (ViT), where each \({i}_{j}\in {\mathbb{R}}^{d}\) and node \(T=\{{t}_{1},{t}_{2}\dots .{t}_{{K}_{T}}\}\) be the set of \({n}_{T}\) text feature vectors extracted using BERT, where each \({t}_{j}{\in {\mathbb{R}}}^{d}\).

Edge weights E represent the relationship between nodes, \(E=\{{e}_{it}|i,t\in G, i\ne t\}\) where \(i and t\) are the two different nodes, and an edge \({e}_{it}\) is computed using an attention mechanism. The attention score is given by

$${e}_{it}=softmax\left(\frac{{Q}_{i}.{K}_{t}^{T}}{\sqrt{{d}_{k}}}\right)$$

where, \({Q}_{i}={W}_{Q}{h}_{i}\) is the query vector for node i, \({K}_{t}={W}_{K}{h}_{t}\) is the key vector for node t, \({W}_{Q},{W}_{K} \in {\mathbb{R}}^{dX{d}_{k}}\) are learnable weight matrices, \({h}_{i} and {h}_{t}\) are the feature vectors for nodes i and t, respectively, and \({d}_{k}\) is the dimension of the query and key vectors.

After constructing the CMFG, as shown in Fig. 2, the features of each node are updated by aggregating information from its neighbors. Using Graph Convolutional Networks (GCNNs), the feature update for node i is:

$${h}_{i}^{\left(k+1\right)}=\rho \left(\sum_{t\in \mathcal{M}(i)}{e}_{it} {W}^{k}{h}_{t}^{k}\right)$$

Here, \(\mathcal{M}(i)\) is set of neighbors of node i, \({W}^{k}\) is the learnable weight matrix at layer k, and \(\rho\) is a non-linear activation function and it is ReLU here.

The final stage of representing the image and text features or nodes after K GCNN layers in a common shared embedding space is designed as,

$${H}_{I}^{K}=\left\{{h}_{i}^{K}\right| {i}_{i}\in I\}, { H}_{T}^{K}=\{{h}_{t}^{K}| {t}_{t}\in T \}$$

The CMFG approach dynamically propagate information between nodes through weighted edges. By treating similarity scores as edge weights, the CMFG captures nuanced interactions between image and text features, enabling the model to align them effectively in the shared embedding space. The CMFG’s ability to model the cross-modal relationships in a structured and interpretable way establishes it as a powerful tool for efficient and accurate cross-modal retrieval.

Dynamic CMFG

From the above set of equations under the CMFG able to establish relationships between text and image features through a predefined adjacency structure. However, this kind of fixed structures fail to capture evolving semantic relationships during the training process. To overcome this limitation, we introduce a Dynamic Cross-Modal Feature Graph (CMFG), which updates node connections and edge weights by dynamically refining neighboring relations. The dynamic CMFG refers the ability of the graph structure to adaptively capture and update relationships between image and text features during the learning process. The difference between the proposed dynamic CMFG and static models is illustrated in Fig. 3. It depicts six feature nodes from image and text modalities, highlighting their alignment through attention weights in Fig. 3a, and showcasing dynamic connectivity based on KNN in Fig. 3b,c.

Fig. 3
figure 3

Comparison between standard Graph Attention Network (GAT) and the proposed Cross-Modal Feature Graph (CMFG). (a) In GAT, all nodes are of the same modality (image) with fixed local edges. (b,c) In CMFG, image and text nodes are dynamically connected based on semantic similarity using KNN and attention mechanism from iteration to iteration.

In this framework, each node in image modality \(i\) and text node \(t\) does not rely solely on fixed edges but instead its connectivity based on the neighboring context. The neighboring context is defined using KNN (K-Nearest Neighbor) to determine their top \(b\) closest neighbors with in the same modality:

$$N\left(I,i,b\right)=\left\{w|w\in I, and w is one of the top-b nearest neighbors of i\right\}$$
$$N\left(T,t,b\right)=\left\{w|w\in T, and w is one of the top-b nearest neighbors of t\right\}$$

For the databases used, the top b = 15 nearest neighbors were found to be optimal, with the value selected based on empirical observations and the dimensionality of the feature vectors. Using these two equations, the edges will define dynamically to ensure the semantically related nodes with in the same modality influence each other.

Given an initial adjacency matrix \({G}_{s}\), the dynamic CMFG defines the updated connectivity as follows:

$${G}_{s}\left(i,t\right)=\left\{\begin{array}{c}1, if \exists w\epsilon N\left(I,i,b\right) such that {G}_{s}\left(w,t\right)=1\\ 0, Otherwise\end{array}\right.$$

where \(N(I,i,b)\) is a neighbor selection function that identifies the top b closest nodes in the same modality for a given node \(i\). Similarly, the text-to-image connectivity is defined as

$${G}_{s}\left(t,i\right)=\left\{\begin{array}{c}1, if \exists w\epsilon N\left(T,t,b\right) such that {G}_{s}\left(i,w\right)=1\\ 0, Otherwise\end{array}\right.$$

This confirms that a node from one modality is connected to a node from the other modality only when they share mutual contextual relationships through their nearest neighbors within the same modality. Once the mutual relations are identified, we further refined the graph connections using an attention mechanism, which assigns different refined scores to edges.

For each edge weight between an image node i, and a text node t is dynamically adjusted as follows:

$${e}_{i,t}^{(t+1)}=\alpha {e}_{i,t}^{(t)}+\left(1-\alpha \right).softmax\left(\frac{{Q}_{i}{K}_{t}^{T}}{\sqrt{{d}_{k}}}\right)$$

Here, \({Q}_{i}={W}_{q}{h}_{i}\)(Query) and \({K}_{t}={W}_{k}{h}_{t}\)(key) are feature transformations of the image and text nodes. The most essential connections receive higher attention scores according to the softmax mechanism and \({d}_{k}\) is a scaling factor that stabilizes the learning process. The term \(\alpha\) is a momentum term controlling how much past information is retained (similar to Exponential Moving Average).

With this step it ensures that, if two nodes are connected based on mutual contextual relations (Neighboring relations among nodes), the model highlights stronger connections through learning to identify which relations are meaningful.

At last, the dynamically weighted edges influence the message-passing process in the graph, where node embeddings are updated based on their most relevant neighbors. The updated node representation aggregates information from its connected nodes, weighted by the learned attention scores. This approach allows the CMFG to continuously adapt, ensuring that retrieval performance improves as the graph refines its structure throughout training. By integrating mutual contextual relations for dynamic edge formation and attention-based weighting for refined importance scoring, the proposed model effectively captures cross-modal relationships, making the retrieval process more robust and accurate.

The final embedding update is performed using the dynamically adjusted edge weights:

$${h}_{i}^{(t+1)}=\rho \left(\sum_{t\epsilon N(i)}{e}_{it}^{(t+1)}{W}^{(t)}{h}_{t}^{(t)}\right)$$

This equation ensures the following issues:

  • Nodes aggregate information from their most relevant neighbors.

  • Edge weights dynamically adjust based on learned importance scores.

  • The graph structure continuously evolves, ensuring more accurate cross-modal retrieval.

Experiments

In this section, the experimental setup, datasets description, evaluation metrics, and implementation details are discussed. Complete results and analyses of the experiments are also explained.

Datasets

Two popular cross-model (image-text) databases, NUS-WIDE 42, MIRFLICKR-25K 43 are often used in this proposed method experiment. NUS-WIDE contains 269,647 images and text description and MIRFlickr-25K contains 25,015 samples of image–text pair. These datasets are widely used to test the cross-modal systems that come with preprocessed image and text data. The images are already appropriately sized, and the associated text tags are cleaned and filtered.

NUS-WIDE: This dataset is designed for cross-modal retrieval and multi-label classification and the data is large-scale web images. The images were taken from Flickr and each image accompanied by textual tags with 81 semantic categories. The images in the database has multi-label annotation for each image, where each image can belong to multiple concepts such as “beach”, “sunset”, “sports” or “animal”. The dataset is widely used for image-to-text and text-to-image retrieval, making it valuable for deep learning, multimodal learning, and semantic understanding. Due to its real-world noisy text annotations, preprocessing is often required for optimal performance.

MIRFLICKR-25K: This dataset is good for evaluating cross-modal retrieval models, offering a real-world and diverse collection of images and text tags. It serves as a strong benchmark for testing multi-modal deep learning architectures in image-text retrieval tasks. It consists of 25,000 images collected from Flickr, each accompanied by metadata, including textual tags and annotations. The dataset contains real-world images sourced from Flickr, covering various categories like landscapes, objects, and people and each image is labeled with user-generated tags from Flickr. These tags describe image content and serve as textual representations for cross-modal learning. Some images have additional manually curated annotations for better semantic understanding. The dataset includes 15,000 training images and 10,000 test images.

The evaluation process is conducted using Google Colab Pro, which provides access to high-performance GPUs. The experiments are run on a Colab Pro instance equipped with an NVIDIA Tesla T4 or A100 GPU, Ubuntu-based environment, Intel Xeon processor, high-speed cloud storage, and 100 GB of RAM.

Performance analysis

A comparative analysis of numerous breakthrough methods was conducted to validate the dominance of Attention guided multimodal graph network (PM) over existing methods. These approaches include CCA 44, VSE++45, PiTL 25, UNITER 26, ALBEF 46 GAT-H29, and SimCLR 27. Out of these, CCA used handcrafted features, VSE++ and PiTL used CNN and RNN models for cross model features extraction, UNITER is a transformer-based model, ALBEF is a fusion based model, and SimCLR is a self-supervised method for cross-model retrieval. Retrieval performance metrics, Cross-model alignment metrics and Graph-based attention performance are 3 fundamental retrieval protocols used to evaluate the performance of the PM. For an image and text query and the respective top 5 retrieved results for static and dynamic CMFG are shown in Fig. 4a,b respectively.

Fig. 4
figure 4

(a) Retrieved results for a text query and retrieved images from MIRFLICKR-25K for static and dynamic CMFG. (b) Retrieved results for an image query and retrieved text sequences from MIRFLICKR-25K for static and dynamic CMFG.

Retrieval performance metrics

The mean average precision (mAP) is a typical performance indicator, which measures retrieval accuracy across queries. For example, for a query set Q, the mAP can be calculated as follows:

$$mAP\left(Q\right)= \frac{1}{N}\sum_{j=1}^{N}P(j)$$

where N is the number of samples in the query set Q and P(j) is the average precision, which defined for a query is as follows:

$$P\left(q\right)=\frac{1}{|{N}_{Pos}|}\sum_{i=1}^{M}{P}_{q}(i){\alpha }_{q}(i)$$

\({P}_{q}\left(k\right)\) measures the precision of the top-k samples. Here, \({\alpha }_{q}(i)\)=1 represents the true neighbor of \(q(i)\), and \({\alpha }_{q}\left(i\right)=0\) represents it is not. The retrieval set consists of K instances. The value of |NPos| indicates the number of true neighbors in the retrieval set.

The experimental results of 2 retrieval tasks, Image \(\to\) Text and Text \(\to\) Image includes retrieving text from images and images from text has been presented in Table 1 for two benchmark datasets. The top-5 and top-10 retrieval results for both image and text queries from the NUS-WIDE and MIRFLICKR-25K datasets are shown in Figs. 5 and 6. The corresponding precision and recall curves for R@5 and R@10 are also clearly illustrated in these figures.

Table 1 Comparative results for cross-model retrieval on NUS-WIDE, MIRFLICKR-25K over proposed method for top retrieval R@5 & R@10.
Fig. 5
figure 5

P–R graph for 2 datasets for image-to-text retrieval and text-to-image retrieval for the top 5 outcomes.

Fig. 6
figure 6

P–R graph for 2 datasets for image-to-text retrieval and text-to-image retrieval for the top 10 outcomes.

Cumulative matching characteristic (CMC) curve

The CMC curve measures the probability that the correct match appears within the top-K retrieved results. It is commonly used in cross-modal retrieval (e.g., image-to-text) and re-identification tasks as shown in Fig. 7.

Fig. 7
figure 7

CMC curve for top-10 ranks for image-to-text retrieval outcomes.

For a given q and a set of retrieved outcomes R, the CMC value at rank k is defined as:

$$CMC\left(k\right)=\frac{1}{N}\sum_{i=1}^{N}1({rank}_{i}\le k)$$

where, N is the total number of queries, \(1({rank}_{i}\le k)\) is an indicator function that returns 1 if the correct match is found within the top-k retrieved results, otherwise returns 0.

The future work may extend the proposed model by integrating advanced graph-based methods such as graph clustering 47 or fuzzy-entropy-based community detection 48, which may offer improved neighbor selection and better capture global graph structures compared to traditional KNN approaches.

Conclusion

In this work, a dynamic graph model DCMFG, an adaptive model for cross-modal image-text retrieval is introduced. This model ensures the dynamic edge weight updates in the graph structure based on mutual contextual relations for intra-modal and cross-modal dependencies. The proposed framework leverages ViT and BERT for image and text features and GCNN for structure feature extending. Additionally, an attention-based weighting mechanism refines feature interactions, improving retrieval accuracy. To enhance the robustness of the dynamic CMFG, we integrate neighborhood-based adaptive edge construction, where KNN is used to identify mutual contextual relationships between image and text features. The graph structure gets updated during training, and allowing the model to capture evolving feature dependencies. This leads to a more discriminative and semantically improved shared embedding space, ensuring effective retrieval performance. Experimental results on benchmark datasets showed a significant performance over state-of-the-art methods in terms of precision and recall. In end, the proposed Dynamic CMFG establishes a novel hypothesis for graph-based cross-modal retrieval, offering a powerful and interpretable solution that bridges the semantic gap between vision and language through structured, adaptive learning.