Introduction

Visual understanding, which is analogous to human visual perception, is a major research topic in computer vision research. An essential aspect of this understanding involves resolving the intricate relationships between visual information and textual associations. Image captioning is a fundamental task in visual understanding, where human-comprehensible textual information is derived from analyzing visual data. Current image captioning research primarily focuses on attention mechanisms, commonly employing region proposal networks to obtain region features for attention modeling. In addition, the field of visual-linguistic multi-modality has exhibited significant advancements in image captioning, with notable contributions such as Contrastive Language-Image Pre-Training (CLIP)1. CLIP is designed to learn contrastive loss from extensive datasets, understanding visual features and textual descriptions, thereby establishing meaningful and correlated visual-text representations.

Dense captioning, a subcategory of image captioning, predicts diverse captions from a given input image, instead of being limited to specific caption outcomes2,3. By selecting various objects in the image, dense captioning can capture aspects of multiple situations. However, even though there are many dense captions available, the difficulty lies in choosing truly meaningful captions. In this paper, we introduce a novel approach called RefCap (Image Captioning with Referent Objects Attributes) for generating meaningful captions that align with user preferences in referring expression image captioning. This approach enhances visual understanding by incorporating object relation descriptors. In the proposed RefCap model, after a region proposal network detects various objects, our approach selectively focuses on the related objects based on user input prompts. This enables RefCap to predict specific caption outcomes that correspond to the objects and their referring expressions provided by the user, thus providing a more targeted caption generation instead of a range of possible outcomes.

Visual grounding (VG), also referred to as referring expression comprehension, is an intriguing research area in the field of computer vision4. VG methods aim to identify objects in an image that correspond to a user’s query. For example, if the user inputs a query such as “the man on the right,” the visual grounding module identifies and highlights the specified referent, and then the captioning module generates the corresponding expression. In our RefCap model, we utilize the localized objects obtained from the VG task’s results and establish their relationships.

Figure 1
figure 1

Comparison between RefCap and other approaches: (a) image captioning, (b) dense captioning, and (c) RefCap. The given image is the sampled from the COCO2014 train dataset. The \(\#\) means the image index of the dataset.

Figure  2 illustrates the pipeline of our RefCap model, which combines four main computer vision algorithms: object detection, visual grounding, scene graph generation, and image caption generation. A description of each pipeline network’s role is provided in the “Proposed Method” section. Compared to other tasks of visual understanding, image captioning, and dense captioning, our RefCap model has the following differences, as shown in Figure 1. The image captioning task derives the most descriptive single textual information about a given image. Compared to image captioning, the dense captioning task can express more diverse information in the image. However our RefCap model, the user first selects the object of interest in a given image and then derives a corresponding representation of that object. The RefCap model has the following steps:

  1. 1.

    The word-level encoder encodes the user prompt, which is then passed to the Transformer along with the output of the object detection task.

  2. 2.

    The VG task provides localization information about the target query, which is then used to construct object-level relations with a given object.

  3. 3.

    Finally, we perform image captioning (IC) using the constructed relations to derive textual information about the object relationships.

Compared to other visual understanding tasks, such as image captioning and dense captioning, our RefCap model has the following key differences:

  • Image captioning generates a single textual description of a given image.

  • Dense captioning generates multiple textual descriptions of different objects in a given image.

  • RefCap generates a textual description of a user-specified object in a given image.

Related work

In the past few years, the field of image captioning and visual-linguistic multi-modality research has exhibited significant advancement. In this section, we briefly review related literature for image captioning, visual-linguistic model, and referring expression with discussion about the limitations of existing approaches.

Image captioning

In the early stages of image captioning, approaches primarily relied on pre-defined caption templates or matching image regions to textual descriptions. Hodosh et al. introduced methods such as template matching and retrieval-based image captioning5. Gan et al. used the object detection results as input to the LSTM to generate the caption6. However, these methods faced limitations in terms of flexibility and the ability to generate diverse, contextually relevant captions.

With the advent of deep learning, the focus shifted towards using neural networks for image captioning. A representative work is the Show and Tell model by Vinyals et al. , which introduced an encoder-decoder framework using a convolutional neural network (CNN) as an image encoder and a recurrent neural network (RNN) as a caption generator7. This approach significantly improved the quality of generated captions by learning rich visual representations and capturing temporal dependencies in language.

Recently, researchers have also studied temporal information-based caption generation approaches. For example, Tu et al. investigated how to derive a representative caption that expresses a video’s temporal causal relationship and how to find changes in multiple images8,9.

Visual-linguistic models

To enhance the visual-linguistic understanding, recent research efforts have explored incorporating attention mechanisms into image captioning models. Xu et al. proposed the attention mechanism, allowing the model to selectively focus on different image regions while generating captions10. Similarly, Lu et al. 11 introduced the adaptive attention model, which dynamically adjusted the attention weights based on the input image and generated caption. These attention-based models achieved better alignment between the generated words and visual content, leading to improved caption quality11,12.

Some studies integrate the extensive prior knowledge of visual-linguistic models, such as ClipCap13 and SMALLCAP14, which combine CLIP and GPT-2 with vast amounts of pre-trained models to present lightweight-training models with only minimal fine-tuning for image captioning.

Referring expression generation

In the field of visual understanding research, the generation of referring expressions has gained increasing attention. In this context, Kazemzadeh et al. presented the RefCOCO dataset, which contains referring expressions for objects in images15. Mao et al. proposed a language-guided attention model to generate referring expressions, which makes the model attend to relevant image regions described in the expression16. More recently, referring expressions image captioning has become a crucial research topic in the visual-linguistic multi-modality research field. Hu et al. proposed an end-to-end referring expression image captioning model that incorporated explicit supervision of referred objects17. In their model, user-specified object keywords are used as a prefix to generate specific captions focused on the target object to improve alignment between the referred object and the generated captions.

Visual scene graph generation

The goal of the visual scene graph generation (SGG) task is to detect the relationships between objects. The representative dataset is Visual Genome which is presented by Ranjay et al. 18. The Visual Genome dataset provides localized bounding boxes for objects, along with their attributes such as color, size, location, and more. Furthermore, objects are cross-connected through the relationships. Understanding contextual relationships between objects in the image and its corresponding language is useful for downstream visual-linguistic understanding tasks. In the SGG task, the encoder-decoder method is generally utilized. Xiao et al. generated the graph using both spatial and semantic attention modules and its fusion19. Cong et al. proposed a one-stage approach that can predict the relations directly using a triplet decoder, and they also provide abundant demonstrations20.

There is also a study that has attempted to image captioning through the SGG task. Yang et al. 21 presented a method for image captioning through the SGG task using a homogeneous network to convert a scene graph into a caption21. They used a scene graph to represent objects in the image and generated a caption via an encoder-decoder structure.

Proposed method

In this section, we present an image captioning model called RefCap using referent object relationships. As shown in Figure 2, RefCap requires user prompts to initiate the captioning process. Subsequently, it employs Visual Selection (VS) and Image Captioning (IC) tasks to derive a textual description of the selected object and its corresponding referent. To gain a better understanding of RefCap’s functionality, we provide a detailed explanation in the following subsections.

Figure 2
figure 2

The RefCap model generates a caption for the object specified by the user prefix in the same image as shown in Figure  1.

Visual grounding

For the visual grounding task, both visual and linguistic features are used to compute embedding vectors as input. Given an image as input, the visual branch is composed of a stack of 6-encoder layers. Each encoder layer of the visual branch includes a multi-head self-attention layer and feed-forward network (FFN). Positional encoding is then added at each encoder layer. Meanwhile, the linguistic branch utilizes a 12-encoder layer with a pre-trained BERT model. A [CLS] token is appended at the beginning, and a [SEP] token is appended at the end of each token. Subsequently, each token is used as input for the linguistic transformer. To merge these visual-linguistic tokens, a linear projection is applied to them with the same dimension, and a learnable regression token [REG] is added for bounding box prediction. The visual-linguistic fusion tokens are then fed into a visual-linguistic transformer with six encoder layers. To configure a loss function, we sum the differences between predicted and ground-truth boxes across all stages to calculate the total loss.

$$\begin{aligned} L = \alpha _1L_1 + \alpha _gL_{giou}, \end{aligned}$$
(1)

where \(\alpha _1\) and \(\alpha _g\) are hyper-parameters. \(L_1\) is the \(\ell _1\) loss and \(L_{giou}\) is the generalized intersection over union (GIoU) loss proposed by Rezatofighi et al. 22. As a result, the bounding box corresponding to the user prefix is predicted.

Referent objects selection

Following the Visual Grounding (VG) task, our model establishes relationships between referent objects. To construct these relationships, we require additional object locations. By utilizing a CNN-based object detection algorithm, we can easily obtain localized objects. The image features are then linearized to form the input vector, and a subset of these features is fed into the transformer encoder layer.

For every possible relationship between the objects, we compute the probability of their relationships. Considering that there are \(m(m-1)\) potential relationships among m objects, we generate \(m(m-1)\) pairwise features. Thus, the relationships between the i-th subject and the j-th object can be denoted as \(R_{i\rightarrow {j}}\in R\), where \(i \ne j\). These relationships form a directed graph represented by a collection of subject-predicate-object triplets.

The triplet prediction can be expressed as:

$$\begin{aligned} \hat{R} = <\hat{y}_{\text {sub}}, \hat{c}_{\text {pred}}, \hat{y}_{\text {obj}}>, \end{aligned}$$
(2)

where each subject and object contain class and bounding box labels denoted as \(\hat{y} = <\hat{c}, \hat{b}>\). Using the ground truth triplet \(R = <y_{\text {sub}}, c_{\text {pred}}, y_{\text {obj}}>\), the triplet cost is computed using the cost function \(c_m\) introduced in RelTR20.

$$\begin{aligned} c_{tri}=c_m(\hat{y}_{sub},y_{sub})+c_m(\hat{c}_{pred},c_{pred})+c_m(\hat{y}_{obj},y_{obj}). \end{aligned}$$
(3)

Given the triplet cost \(c_{tri}\), we can get the triple loss \(L_{tri}\) as:

$$\begin{aligned} L_{tri}=L_{sub} + L_{obj} + L_{pred}, \end{aligned}$$
(4)

where L is the cross-entropy loss between the predicted class and the ground truth class.

As following steps from RelTR, we can get pruned relationships such as <Obj-relation-BG> and <Obj-no relation-Obj>. These dense object relationships obtain abundant descriptions between subject-object relations. However, unnecessary relations still follow. Therefore we need to prune some edges for remaining meaningful relations. For remaining referent objects relationships, we apply some criteria:

  1. 1.

    The initial object \(y_{init}\) from the VG task is root node.

  2. 2.

    The duplicated relationship must not be included.

  3. 3.

    The subject which is predicted from subject \(y_{init}\) or \(y_{obj}\) as object previous step, can have another relationship.

  4. 4.

    The object which is predicted from object \(y_{init}\) or \(y_{sub}\) as subject previous step, can have another relationship.

  5. 5.

    Both subject and object can each have multiple relationships, unless the above paragraph is contradicted.

Finally, we perform image captioning (IC) using the constructed relations to derive textual information about the object relationships.

Image captioning

The final step of our model is to generate the target caption result by incorporating the structure derived from the selected referent objects. Before entering the caption step, we concatenate the output of SGG and VG tasks. However predicted triplet embeddings \(\hat{R}=<\hat{y}_{sub},\hat{c}_{pred},\hat{y}_{obj}>\) and visual features contain different information and lengths. Therefore, we need to unify both triplet embeddings and visual features as the same dimension features. These concatenated features enter the encoder of the transformer network23. The Encoder layer is composed of a stack of 6-encoder layers, and each encoder layer includes a multi-head self-attention layer and FFN similar to our VG task. As following the transformer23, the attention can be calculated by:

$${\text{Attention}}(\textbf{Q},\textbf{K},\textbf{V})={\text {softmax}} \left( \frac{\textbf{Q}\textbf{K}^{\textrm{T}}}{\sqrt{d_K}}\right) \textbf{V}, $$
(5)

where \(\textbf{Q},\textbf{K},\textbf{V}\) is query, key, value of attention module, \(d_K\) is dimension of model. And the multi-headed attention is also calculated by:

$$\begin{aligned} \text {Multihead}(\textbf{Q},\textbf{K},\textbf{V})=\text {Concat}(h_1,h_2,...,h_n)W^\text {o}, \end{aligned}$$
(6)

where \(h_i=\text {Attention}(\textbf{Q}W^q_i,\textbf{K}W^k_i,\textbf{V}W^v_i)\) and the projections W is parameter matrices of each head. The encoded features are passed to the linguistic decoder for captioning. The linguistic decoder is also composed of a stack of 6-decoder layers. By decoding the features, we finally obtain a caption that corresponds to the user input object.

The objective function for image captioning consists of two terms: cross-entropy loss and Self-Critical Sequence Training (SCST) loss24. The cross-entropy loss is defined as follows:

$$\begin{aligned} L_{CE}=-{\sum ^T_i \log (p_\theta (t^{*}_t|y^{*}_1,...,y^{*}_{t-1}))}, \end{aligned}$$
(7)

where \(y^*_t\) is the ground-truth target for all sequences, and \(\theta \) is the model parameters. The SCST loss minimizes the negative expectation of the CIDEr score, which is a metric for evaluating the quality of image captions:

$$\begin{aligned} L_{SCST}=-E_{y_{1:T}\sim {}p_\theta }[r(y_{1:T})], \end{aligned}$$
(8)

where r denotes the CIDEr fraction function. The gradient of the SCST loss can be approximated as:

$$\begin{aligned} \nabla _\theta L_{SCST}\approx -(r(y_{1:T})-r(\hat{y}_{1:T}))\nabla _\theta \log p_\theta (y_{1:T}), \end{aligned}$$
(9)

where \(r(y_{1:T})\) is the CIDEr score of the sampled caption and \(r(\hat{y}_{1:T})\) is the CIDEr score of the greedy decoding of the model.

Experimental results

Implementation details

In our experimental results, we present a comprehensive evaluation of each module, incorporating both quantitative and qualitative assessments. Our RefCap model consists of four main modules, namely:

  1. (i)

    Object detection: This module enables the system to search for relevant visual content based on user queries.

  2. (ii)

    Visual grounding: The visual grounding module aims to establish a connection between textual queries and specific objects or regions in the visual content.

  3. (iii)

    Scene graph generation: This module generates a structured representation of the relationships between objects in the scene, capturing their interactions and contextual information.

  4. (iv)

    Image captioning: The image captioning module generates descriptive captions that accurately convey the content and context of the visual input.

Each of these modules plays a crucial role in our RefCap model, and we provide detailed descriptions and evaluations for each module in the following sections.

Object detection

For object detection, we utilize a Faster R-CNN model25 pretrained on the ImageNet dataset, with ResNet-10126 as the backbone architecture. This model is then fine-tuned on the Visual Genome dataset to perform visual grounding and scene graph generation tasks.

To reduce the dimensionality of the object features, which initially yield a 2048-dimensional feature vector, we apply dimensionality reduction to obtain a dimension of \(d_K=512\). This reduction is followed by a ReLU activation and a dropout layer to enhance the model’s performance. During training, we employ the SGD optimizer with an initial learning rate of \(1\times 10^{-2}\).

Table 1 Comparison of RefCap to state-of-the-art methods on the ReferItGame, RefCOCO, and RefCOCO+ datasets.

Visual grounding

To evaluate the Visual Grounding task, we conduct experiments on two datasets: ReferItGame15 and RefCOCO31. These datasets consist of images that contain objects referred to in the referring expressions. Each object may have one or multiple referring expressions associated with it.

We split each dataset into three subsets: a 70% training set, a 20% test set, and a 10% validation set. The input image size is standardized to \(640 \times 640\), and the maximum expression length is set to 10 tokens, including the [CLS] and [SEP] tokens. The shorter maximum expression length is chosen because the inference process only requires the keywords related to the target object.

These evaluation setups allow us to assess the performance of the Visual Grounding task on different datasets and validate the effectiveness of our approach.

Scene graph generation

The Visual Genome dataset is used for evaluating Scene Graph Generation. The Visual Genome dataset consists of 108K images with 34K object categories, 68K attribute categories, and 42K relationship categories. We select the most frequent 150 object categories and 50 relationship categories. The attribute categories are omitted by merging with relationship categories. Thus, each image has object and relationship (with attributes) categories in the scene graph. During inference, the criteria are applied aforementioned in the referent object selection section for pruning the unrelated relationships.

Image captioning

For the image captioning task, the commonly used COCO Entities dataset32 was used. The dataset contains diverse caption annotations with an abundant combination of objects and their bounding boxes. Thus employing these datasets by RefCap which builds a sub-graph makes sense.

Quantitative evaluation

We first evaluate the visual grounding task of RefCap on the ReferItGame15, RefCOCO31, and RefCOCO+31 datasets, comparing it to other state-of-the-art methods including Maximum Mutual Information (MMI)16, Variational Context (VC)27, Modular Attention Network (MAttNet)28, Single-Stage Grounding (SSG)29, and Real-time Cross-modality Correlation Filtering (RCCF)30. Note that the accuracy of the RefCOCO and RefCOCO+ datasets is based on TestA only. Table 1 shows that our RefCap model is competitive with other state-of-the-art methods.

Table 2 Evaluation with four metric.

We also evaluate our visual scene graph generation of RefCap with grounded objects on the Visual Genome dataset. Our desired output of the scene graph is the sub-graph of the entire graph. This means the result doesn’t need to include the entire relationship. Thus we aim that how the sub-graph represents well about target object and the output includes ground truth. We adopt the four metrics (PredCls, PhrCls, SGGen, SGGen+) for evaluating our scene graph generation which is presented by Yang et al. 33 with our insight. We modified the ground truth data to reference the target object. Each image contains multiple target objects and its referent relationship. Thus we just compare the relation to the target object with modified ground truth. The performance of generating sub-graph by RefCap is shown in Table 2.

We finally evaluate our image captioning task. We employ conventional metrics (BLEU34, METEOR35, ROUGE36, and CIDEr37) to measure the quality of the predicted captions on the COCO Entities dataset. Table 3 shows the results of evaluating the predicted caption on COCO Entities.

Table 3 Evaluation of RefCap on the image captioning task on the COCO Entities dataset.

Qualitative evaluation

Figure 3. shows the examples of our entire RefCap model. The few keywords are typed as input by the user, RefCap detects the corresponding object, builds a relationship with related objects, and draws its caption result. Unlike traditional caption methods, RefCap shows the caption results for the user’s desired target. Our RefCap can provide caption results, not only in images with a single object but also in images with multiple objects.

Figure 3
figure 3

Some examples of RefCap model. The given images are selected from the COCO2014 test dataset. The \(\#\) means the image index of the dataset. Incorrect caption results are highlighted in red.

Ablation study

In this section, we analyze the impact of each hyperparameter on the model for each module. First, we explored the effect of the prefix length on the visual grounding (VG) task. As summarized in Fig. 4, the performance tends to improve as the prefix length increases. However, continually increasing the prefix length slows down processing due to the increased parameters of the model. Therefore, RefCap uses a prefix length of 15, which balances performance and processing time.

Figure 4
figure 4

Effect of prefix length on the image captioning performance of RefCap.

In a separate experiment, we examined how the scene graph generator (SGG) affects the performance of caption generation. Table 4 shows the accuracy of different combinations of subject, predicate, and object. The results indicate that using all three components achieved the best outcome. As you can see, the combination of object and predicate is superior to subject and object alone, because it is difficult to represent the target’s properties with class alone.

Table 4 Evaluation of scene graph generation on the Visual Genome dataset.

Research plan

In this paper, we introduce a novel image captioning model, RefCap, which leverages referent object attributes to generate more specific and tailored captions. However, our model has several limitations. First, it consists of a combination of several pipeline networks, which makes it complex and sensitive to the performance of each individual network. To address this, we plan to develop an end-to-end model in our future work. We look forward to sharing our progress in future publications.

Discussion and conclusion

The main idea of the paper is to predict a meaningful caption from a selected user prefix. By exploring object relationships for image captioning, our method can more accurately and concretely predict the caption results. As a result, the user of our method can get the more satisfying result that corresponds to his prefix. We also demonstrated quantitative evaluation and qualitative evaluation. As a quantitative evaluation, we experiment with various datasets for each module. Both quantitative and qualitative evaluations yielded gratifying results. Moreover, our RefCap can provide multiple caption results from a single image based on user input. We hope the utilization of this convergence of the object detection and image captioning tasks, would provide insight into the future of computer vision and multimodality research.