Learning optimal image representations through noise injection for fine-grained search

Kumar, Vidit; Tripathi, Vikas; Pant, Bhaskar; Diwakar, Manoj; Singh, Prabhishek; Bijalwan, Anchit

doi:10.1038/s41598-025-97528-9

Download PDF

Article
Open access
Published: 03 May 2025

Learning optimal image representations through noise injection for fine-grained search

Vidit Kumar¹,
Vikas Tripathi¹,
Bhaskar Pant¹,
Manoj Diwakar¹,
Prabhishek Singh² &
…
Anchit Bijalwan³

Scientific Reports volume 15, Article number: 15560 (2025) Cite this article

1235 Accesses
Metrics details

Subjects

Abstract

In recent years, fine-grained image search has been an area of interest within the computer vision community. Many current works follow deep feature learning paradigms, which generally exploit the pre-trained convolutional layer’s activations as representations and learn a low-dimensional embedding. This embedding is usually learned by defining loss functions based on local structure like triplet loss. However, triplet loss requires an expensive sampling strategy. In addition, softmax-based loss (when the problem is treated as a classification task) performs faster than triplet loss but suffers from early saturation. To this end, a novel approach is proposed to enhance fine-grained representation learning by incorporating noise injection in both input and features. At the input, input image is made noised and the goal is set to reduce the distance between the L2 normalized features of input image and its noisy version in the embedding space, relative to other instances. Concurrently, noise injection in the features acts as regularization, facilitating the acquisition of generalized features and mitigating model overfitting. The proposed approach is tested on three public datasets: Oxford flower-17, Cub-200-2011 and Cars-196, and achieves better retrieval results than other existing methods. In addition, we also tested our approach in the Zero-Shot setting and got favorable results compared to the prior methods on Cars-196 and Cub-200-2011.

High-resolution single-photon imaging with physics-informed deep learning

Article Open access 22 September 2023

Learning high-level visual representations from a child’s perspective without strong inductive biases

Article 07 March 2024

An efficient lightweight network for image denoising using progressive residual and convolutional attention feature fusion

Article Open access 25 April 2024

Introduction

Image retrieval has been studied for decades, yielded significant results, and is still a challenging topic. A challenge is to obtaining visually related images to the query sample by analyzing its visual characteristics either by low-level semantics (like shape, texture, color) or by higher semantics (like bag of visual words, neural codes)¹. Prior (Content based image retrieval) CBIR’s methods work well for databases of large inter-class variance as compared to databases of less inter-class variance (see Fig. 1). However, real-life scenarios require fine-grained search, that is, to locate images that correspond to the exact query’s sub-category. For instance, when a user queries an image (say bike or flower image), the user needs to access/retrieve images in the same fine-level category as a query (i.e., images correspond to the same model of bike or same flower species)². In such a setting, retrieval becomes a complex and challenging task because it is arduous to distinguish between various models of cars or bikes, or various species of flowers, or different breeds of dogs. The reason for this is that they share visual appearances at the global level, which can only be distinguished by focusing on the critical parts of the object, such as the bird’s feature texture, the dog’s body color, and the shape of the bike’s headlight, etc. Therefore, the major challenge of this problem is to produce strong representations that can capture these subtle details and reduce differences between nearly identical categories. Fine-grained search can be used for various purposes, including but not limited to surveillance, evaluation of climate change, intelligent retail, monitoring of biodiversity and ecosystems, intelligent transportation, etc.

Learning effective descriptors plays an important role in the fine-grained image retrieval (FGIR) domain. When good features are exploited, a retrieval algorithm allows similar images to be placed in beginning of a ranked list and dissimilar ones at the end. Since², FGIR has drawn a growing research focus in computer vision society. Despite recent progress, FGIR is still an open problem for commercial and cataloging applications. With the recent developments in deep learning^3,4,5, the deep learning methods built upon (convolutional neural network) CNN features have become the mainstream of fine-grained search. However, these features are learned from the coarse domain; direct exploitation is not feasible since they cannot capture the fine details of the object. Instead, low dimensional features are learned on top of CNN features using the so-called deep metric learning (DML) approach, which aims to learn the low dimensional metric space (or embedding space) of embeddings where similar things are close and dissimilar are distant. Lots of work has been done in this area using contrastive loss⁶, triplet loss^7,8, and quadruplet loss^9,10. Most of them follow triplet loss. However, triplet loss is based on mining strategies^7,8,11,12,13 to make it fast convergence, which requires extra computations. On the other hand, softmax is generally faster to converge compared to triplet loss but suffers in early saturation, which converges to some worse local minima. Furthermore, learning embeddings from larger networks poses overfitting to small datasets. In this paper, we tend to overcome these issues by proposing a noise-invariant feature learning approach. In this approach, the model is trained using auxiliary induced noise injected at two positions: at input layer and final layer of the deep network. By introducing noise at the input layer, the model learns noise-invariant features by maximizing the similarity between an image instance and its corresponding noisy version. Meanwhile, the noise added at the final layer, in conjunction with the softmax cross-entropy loss function, serves as a form of regularization by generating augmented features within the embedding space. In the former case, we employ a contrastive learning approach, where positives are formed by injecting noise into images, while other samples serve as negatives. In the latter case, the induced noise prevents softmax from suffering early saturation and allows for the continued propagation of gradients computed on noise-augmented features, thereby helping to reduce overfitting on small datasets.

The following are our key contributions:

1)
We propose a Noise-invariant feature embedding learning method by optimizing it using softmax. This minimizes the costly sampling process in training DML, which is the main limitation of triplet loss. This also alleviates the problem of early saturation of softmax-based learning.
2)
This is done by adding noise into both the input layer and the last layer of the deep network during the training process. The primary objective, grounded in contrastive learning, aims to maximize the similarity between an image instance and its corresponding noisy version. The secondary objective, relying on softmax cross-entropy, addresses augmented features generated within the embedding space, serving as a form of regularization.
3)
Analysis on three fine-grained datasets illustrates that our approach achieves better results than state-of-the-art.

The rest of the paper is structured as follows: existing related works are explored in Section “Related Work”. The proposed approach is detailed in Section “Methodology”. Section “Experiments” discusses the experimental settings and analyzes the outcome results. Section “Conclusion” concludes the paper.

Related Work

Following the success of CNN³, deep learning techniques also led to research in image retrieval¹. For instance, Babenko et al.¹⁶ employed a pretrained CNN, fine-tuned it on the target images, and used its responses for image representation and retrieval. In¹⁷, a feature aggregation method was presented that exploits sum pooling on deep features to generate compact descriptors. Further, Mohedano et al.¹⁸ exploit bag-of-Word model with CNN features, whereas in¹⁹, CNN features with VLAD are exploited for image search. Reference²⁰ employed sum pooling in their aggregated method over weighted convolutional features across channels and spatial locations. In addition, Yang et al.²¹ presented an image retrieval technique based on Cross Batch Reference based feature learning strategy. Tolias et al.²² presented an approach that generates compact features by encoding multiple locations with convolutional layer’s activations. Shakarami et al.²³ present a fusion-based descriptor for image retrieval, which includes LBP, HOG, and CNN features. Although these methods work well for coarse levels, fine-grained localization is required as an initial step for fine-grained images. Using the deep learning paradigm some efforts have also been made for fine-grained image tasks. For instance, reference²⁴ utilized convolutional kernels for both object’s parts selection and representation. Watkins et al.²⁵ suggested a two-stage learning scheme (localization learning followed by classification using detected location) for fine-grain classification by exploring resnet architectures. Zhou et al.²⁶ explore label hierarchy using rich relationships through bipartite-graph with VGG-net⁴ for fine-grained classification. In²⁷, authors deployed pre-trained VGG-16⁴ for object localization and selected its deep descriptors by removing noise or background. Zheng et al.²⁸ suggested the centralized ranking loss and trained the CNN with weakly supervised object localization. Then they employed a CNN response map with the contours to precisely extract the features. Kumar et al.²⁹ explored ResNet18⁵ for the FGIR task, where they fine-tuned it on the target dataset and used its activations for retrieval. Yingying et al.³⁰ proposed relation based convolutional descriptor that encodes local subtle features for FGIR. Further, some efforts are made in the direction of learning embedding. For instance,⁶ used the pair-wise loss and⁷ used the triplet loss for learning image embedding with CNN as a backbone. Subsequently, Song et al.³¹ exploited every pair in the minibatch to obtain hard negatives. Sohn et al.³² extends the triplet loss^7,8 into N-pairs loss, which uses softmax cross-entropy loss on pair-wise similarity values within the batch. Song et al.³³ presented the clustering loss for embedding learning by considering the embedding space’s global structure. Huang et al.¹⁰ exploited quadruplet and mines hard examples in end-to-end network with PDDM block for similarity evaluation. Zheng et al.³⁴ proposed softmax Loss for FGIR with normalize-scale layer. The Ranked List loss³⁵ accounts for both positive and negative data within a batch, aiming to clearly differentiate between the positive and negative sets. Reinforcement learning based sampling was proposed in³⁶. Koth et al.¹³ also explored policy-adapted sampling via reinforcement learning for triplet losses. Further, Zheng et al.³⁷ explore hard negative mining via generative approach. Duan et al.³⁸ proposed multilevel similarity based metric loss which explore global, local and channel level similarity. Sanakoyeu et al.³⁹ explored divide and conquer approach in which they iteratively divide the embedding to learn different features.

However, most of these methods rely on sampling strategies that make model training more computationally expensive. In contrast to the above analysis, we implemented a simple strategy for learning fine-grained features via a noise-assisted learning approach which strengthens the feature representation potential of the base network without requiring any sampling strategies.

Methodology

The outline of proposed method is depicted in Fig. 2. First a minibatch of images is randomly sampled and noised. Then pairs of noisy images and natural images are fed to Siamese network and a minibatch of natural images is fed to two standalone networks. The Siamese network is responsible for making features noise-invariant, while other two networks are responsible for learning class discriminating features. All networks are jointly trained with common goal of feature representation learning for fine grained image retrieval.

Consider the training images $\left\{ {x_{1} ,x_{2} , \ldots ,x_{m} } \right\} \in X$ with associated labels $y_{i} \in Y$ in the minibatch. Let $f_{p}$ and $f_{n}$ be the L₂ normalized feature embedding of positive instance $x_{p}$ and negative instance $x_{n}$ to instance $x_{i}$ such that $y_{i} = y_{p} ;y_{i} \ne y_{n}$. These positives and negatives are selected from the minibatch during training. Assume $\left( { \cdot , \odot , \cdot } \right)$ the cosine similarity function with $\odot$ as dot product. To enforce the compactness among same class instances and separateness among different class instances in the embedded space, the class discrimination loss inspired by⁴⁰ could be given as:

$${\text{L}}_{{\text{D}}} = { }\mathop \sum \limits_{{\text{i}}} \frac{1}{{\left| {{\text{P}}\left( {\text{i}} \right)} \right|}}\mathop \sum \limits_{{{\text{p}} \in {\text{P}}\left( {\text{i}} \right)}} {\text{log}}\left( {1 + \mathop \sum \limits_{{{\text{n}} \in {\text{N}}\left( {\text{i}} \right)}} {\text{exp}}\left( {{\text{f}}_{{\text{i}}} \odot {\text{f}}_{{\text{n}}} /\tau - {\text{f}}_{{\text{i}}} \odot {\text{f}}_{{\text{p}}} /\tau } \right)} \right)$$

(1)

where, P(i) is set of positive indices to ith instance and N(i) is set of negative indices to i^th instance.

Noise-invariant feature learning for FGIR

To improve the feature representation capability for a network, the noise can helps the deep CNN to learn better representations for fine-grained images. The noisy labels used in prior publications^41,42 for feature learning need a large dataset with noisy labels network’s training. Instead of using noisy labels, the network is optimized by injecting noise at the input layer and CNN’s higher layer. Specifically, for each training iteration, a noise is sampled from zero mean Gaussian distribution, which is injected to the input images as well as in the activations of last layer (output of average pooling layer in our case) of the deep CNN (refer Fig. 2).

Let $\xi_{i}^{I} \in N\left( {0,\delta_{i}^{2} } \right)$ be the noise sampled from the zero mean Gaussian distribution. The noise is injected to each sample selected for minibatch as $\tilde{x}_{i} = x_{i} + \xi_{i}^{I}$. Let ${\text{f}}_{\text{i}}$ and ${\widetilde{\text{f}}}_{\text{i}}$ be the L₂ normalized feature embedding of $x_{i}$ and $\tilde{x}_{i}$. For all instances $x_{i} \in X$, the objective is to maximize $\left( {{\text{f}}_{{\text{i}}} \odot {\tilde{\text{f}}}_{{\text{i}}} } \right)$.

Given a Siamese network, we compute the probability of noisy sample $\tilde{x}_{i}$ being classified as ith image as:

$${\text{P}}\left( {{{i|\tilde{\text{x}}}}_{i} } \right) = \frac{{{\text{exp}}\left( {{\text{f}}_{{\text{i}}} \odot {\tilde{\text{f}}}_{{\text{i}}} /\tau } \right)}}{{{\text{exp}}\left( {{\text{f}}_{{\text{i}}} \odot {\tilde{\text{f}}}_{{\text{i}}} /\tau } \right) + \sum\nolimits_{{{\text{j}} = 1:{\text{m}},\;{\text{j}} \ne {\text{i}}}} {{\text{exp}}} \left( {{\text{f}}_{{\text{j}}} \odot {\tilde{\text{f}}}_{{\text{i}}} /\tau } \right){ }}}$$

(2)

The loss⁴⁰ associated with (2) is given as:

$${\text{L}}_{{\text{N}}} = { } - \mathop \sum \limits_{{\text{i}}} \log \frac{{{\text{exp}}\left( {{\text{f}}_{{\text{i}}} \odot {\tilde{\text{f}}}_{{\text{i}}} /\tau } \right)}}{{{\text{exp}}\left( {{\text{f}}_{{\text{i}}} \odot {\tilde{\text{f}}}_{{\text{i}}} /\tau } \right) + \sum\nolimits_{{{\text{j}} = 1:{\text{m}},\;{\text{j}} \ne {\text{i}}}} {{\text{exp}}\left( {{\text{f}}_{{\text{j}}} \odot {\tilde{\text{f}}}_{{\text{i}}} /\tau } \right)} { }}}$$

(3)

The Siamese network in this approach excels at learning embeddings for fine-grained representation by comparing and distinguishing pairs of inputs. In our approach, it is utilized to create a meaningful embedding space that brings similar images closer together. Here, the loss L_N will take care for compacting the distance between $\left( {f_{i} ,\tilde{f}_{i} } \right)$ pairs which means making features noise invariant. It also minimizes $\text{exp}\left({\text{f}}_{\text{j}}\odot {\widetilde{\text{f}}}_{\text{i}}\right)$ for all other instances, making separateness among other instances relative to its clean instance.

We also adopt multi-classification task to further optimize the network, however softmax suffers early saturation due to overfitting to smaller datasets. To overcome this, we inject the gaussian noise to the output of final layer of network (avg. pool in our case), so that each time loss will penalize the noisy feature for predicting low score.

Let $\xi_{i}^{F} \in N\left( {0,\delta_{i}^{2} } \right)$ be a noise, $Z_{i}$ represents the deep CNN’s last layer normalized⁴³ activations for input image i, the noisy response can be deduced as $\tilde{Z}_{i} = Z_{i} + \xi_{i}^{F}$. Now, with K-way softmax through fully connected layer $FZ = w_{z} \tilde{Z}_{i} + b_{z}$, the probability distribution of a model parameterized by $\phi$ over m classes is given as:

$$P\left( {y_{i} \left| i \right.,\phi } \right) = \frac{{\exp (FZ_{i} )}}{{\sum\nolimits_{K} {\exp (FZ_{j} )} }}$$

(4)

With the goal to maximize this probability (4), the loss is to minimize is:

$$L_{S} = - \frac{1}{\left| m \right|}\sum\limits_{n = 1}^{\left| m \right|} {\log P} \left( {y_{i} \left| {i_{n} ,\phi } \right.} \right)$$

(5)

The total loss is given as:

$$L = L_{D} + \lambda_{1} L_{N} + \lambda_{2} L_{S}$$

(6)

Minimizing L means minimizing all three losses L_D, L_N and L_S. first Eq. (1) can be reformulated as

$${\text{L}}_{{\text{D}}} = { } - \mathop \sum \limits_{{\text{i}}} \frac{1}{{\left| {{\text{P}}\left( {\text{i}} \right)} \right|}}\mathop \sum \limits_{{{\text{p}} \in {\text{P}}\left( {\text{i}} \right)}} \log \frac{{{\text{exp}}\left( {{\text{f}}_{{\text{i}}} \odot {\text{f}}_{{\text{p}}} /\tau } \right)}}{{{\text{exp}}\left( {{\text{f}}_{{\text{i}}} \odot {\text{f}}_{{\text{p}}} /\tau } \right) + { }\mathop \sum \nolimits_{{{\text{n}} \in {\text{N}}\left( {\text{i}} \right)}} {\text{exp}}\left( {{\text{f}}_{{\text{i}}} \odot {\text{f}}_{{\text{n}}} /\tau } \right)}}$$

(7)

Now, examining L, minimizing Eq. (7) necessitates maximizing $\text{exp}\left({\text{f}}_{\text{i}}\odot {\text{f}}_{\text{p}}/\tau \right)$ and minimizing $\text{exp}\left({\text{f}}_{\text{i}}\odot {\text{f}}_{\text{n}}/\tau \right)$. Given that features are L2 normalized, maximizing $\text{exp}\left({\text{f}}_{\text{i}}\odot {\text{f}}_{\text{p}}/\tau \right)$ involves maximizing the cosine similarity between ${\text{f}}_{\text{i}}$ and ${\text{f}}_{\text{p}}$, forcibly aligning the features of the original sample and its positive counterpart. Similarly, minimizing $\text{exp}\left({\text{f}}_{\text{i}}\odot {\text{f}}_{\text{n}}/\tau \right)$ involves decreasing the cosine similarity between ${\text{f}}_{\text{i}}$ and ${\text{f}}_{\text{n}}$, forcibly separating the features of the original sample from its negative counterparts. This results in compactness of similar samples and separateness of dissimilar samples in embedding space. Now looking into L_N, minimizing it necessitates maximizing $\text{exp}\left({\text{f}}_{\text{i}}\odot {\widetilde{\text{f}}}_{\text{i}}/\tau \right)$ and minimizing $\text{exp}\left({\text{f}}_{\text{j}}\odot {\widetilde{\text{f}}}_{\text{i}}/\tau \right)$. Maximizing $\text{exp}\left({\text{f}}_{\text{i}}\odot {\widetilde{\text{f}}}_{\text{i}}/\tau \right)$ compels forcibly aligning the features of the original sample ${\text{f}}_{\text{i}}$ and its noisy counterpart ${\widetilde{\text{f}}}_{\text{i}}$. The outcome is a noise-invariant feature embedding. Similarly, minimizing $\text{exp}\left({\text{f}}_{\text{j}}\odot {\widetilde{\text{f}}}_{\text{i}}/\tau \right)$ forcibly separating ${\widetilde{\text{f}}}_{\text{i}}$ from the features of other instances ${\text{f}}_{\text{j}}$. This further ensures separateness of dissimilar samples in embedding space. Last minimizing L_S will further enhance the noise invariant property and class separability.

Overall steps of our approach is summarized in Algorithm 1.

Training details

We used resnet18 (R18)⁵ as a backbone. To make a good start, we initialize the R18’s parameters with weights trained on imagenet⁴⁸. The dense layers’ weights are initialized as in⁵. The size of embedding is set to 256 and adam with weight decay of 10e-4 is used for network training. The learning rate and mini-batch’s size is set to 10e-4 and 64 respectively. We first sample 8 class randomly and then sample 8 instances per class. For each sample, noisy sample is created for siamese network. We exploit the data augmentation operations as follows: after randomly sampling a mini-batch of training images, first it is resized with its shorter side to 256 by preserving the aspect ratio, which maintains the original shape of the object. Then it is crop with size 224 × 224 from random location within the image. Next, it is rotated with degree within the range of (-15, 15) (followed by a center crop to maintain same spatial size). At last, with a 0.5 probability, color augmentation takes place followed by horizontal flipping with 0.5 probability. For color augmentation, we employ the proposed method of⁴⁴ that generates realistic like synthetic images. Using⁴⁴, we randomly select one image out of 10 generated images for each image of the minibatch. For L_S (Eq. 5), we utilize label smoothing for the target probabilities within the cross-entropy to better tackle overfitting. This entails setting the probability of the correct class to 1 – φ with φ = 0.1, while assigning φ/(cl-1) as the probability for all other classes. Also L2 normalization is done to sampled noise before adding to feature. For inference, we first rescaled the image to shorter side with 224 and samples 3 network input’s sized crops (a center crop and a crop from each of the two shorter sides) from the image before feeding to the network. All crops’ feature vectors are then averaged to produce the feature representation of image. For matching we employ cosine similarity using L2 normalized features of gallery set to query.

Experiments

This section first discuss the dataset setting and evaluation measures. Then report the FGIR results and analyze the effect of noise-injection in retrieval performance. Finally, we also test our approach in context with Zero-shot learning.

Datasets and evaluation setting

The experiments are conducted on two datasets, the Oxford Flowers-17¹⁴ and the Cars-196¹⁵. Oxford Flowers-17 consists of 17 fine-grained categories with 1360 flower images. Cars-196 consists of 196 fine-grained classes of cars models with 16,185 images. Since Oxford Flowers-17 is a small dataset that contains 80 images per category, we conduct the experiment on randomly selected five splits of the dataset, and each split consist of three sets: training, gallery and query as depicted in Table 1. As a result, there are 680, 425 and 255 images for training, gallery and query sets, respectively. In the case of Cars-196 dataset, we conduct the experiment on the standard training testing split i.e. 8,144/8,041 images for training/testing. Note, the retrieval process is performed in the testing set by treating all images as queries, and the retrieved images are then evaluated by excluding the query image. MATLAB and NVIDIA Tesla K40c GPU are used to perform the experiments. To assess retrieval performance, we use Mean Average Precision (mAP) as described in²⁷.

Table 1 Five splits setting of Oxford Flowers-17.

Full size table

Results and analysis

Results on Oxford Flowers-17 under FGIR setting

In this comparative analysis of proposed method with state-of-arts is done and results (mAPs) are reported in Table 2 for. It can be seen that handcrafted features perform poorly with mAPs of 0.101 (LPB⁵⁹) and 0.112 (HOG⁵⁸), as they are unable to distinguish subtle differences in fine-grained images because these methods are not designed by keeping subtle details into consideration. However, Deep CNN descriptors shows great improvement over handcrafted ones. For instance, pre-trained ResNet18 descriptors shows 0.513 mAP, which is around + 0.4 (mAP) improvement over handcrafted features. Further, with fine tuning on target dataset, performance is further enhanced with mAPs of 0.877 (Yang et al.²¹) and 0.928 (Kumar et al.²⁹). With 0.946 mAP, the suggested approach is able to achieve better results than others, which confirm the importance of noise insertion while training the network on small datasets. Further, mAP@K is also depicted in Fig. 3, where we can see that our method gradually improves over fine-tuned R18²⁹ with the increase of K.

Table 2 Comparisons of mAPs on Oxford Flowers-17 under FGIR.

Full size table

Moreover, Tables 3 and 4 depicts the categorical wise performance of Flowers-17 with comparative analysis with state-of-arts. From the results, we can observe the methods of^45,46 and⁴⁷ performs much better compared to HOG and LBP, and further²⁹ able to improves over these methods in 13 classes. Our method is able to outperform²⁹ in thirteen classes.

Table 3 Comparison of mAPs of categories 1–9 on Oxford Flowers-17 under FGIR.

Full size table

Table 4 Comparison of mAPs of categories 10–17 on Oxford Flowers-17 under FGIR.

Full size table

Results on Cars-196 under FGIR setting

Further, we compare our method with the SOTA on cars-196, which is reported in Tables 5 and 6 respectively. On comparing with baselines in Table 5, our method is able to achieve 80.2% mAP which is 3.7% higher than 76.5% of Kumar et al.²⁹ and far ahead of LBP and HOG. That mainly owes to the effectively learning of image representation through intensive augmentation in the form of noise. Along with LBP (0.007 mAP) and HOG (0.010 mAP), pretrained ResNet18’s responses performs poorly with mAP of 0.041. This implies that for a larger number of fine-grained classes (compared to classes of flowers-17), the pretrained ResNet18 is unable to distinguish them. The reason is that through imagenet dataset⁴⁸ it is learned to focus on the global relationships of the object rather than object’s subtle description. Furthermore, in the context of top-1 and top-5 mAP, we can see in Table 6 that our method consistently outperforms the SPOC¹⁷, CroW²⁰, RMAC²², Wei et al.²⁷ and Kumar et al.²⁹ with an 86.14% top1 mAP and 81.62% top5 mAP.

Table 5 Comparison of mAPs on Cars-196 under FGIR.

Full size table

Table 6 Performance (mAP) Comparison on Cars-196 under FGIR.

Full size table

Ablation study

Effect of noise induced on retrieval performance

We conduct experiments on cars-196 to assess the impact of injected noise on retrieval performance. The findings, presented in the form of mAP at Top-k, are shown in Table 7, where our proposed work is performs well compared to other settings, e.g., 86.14% (with all loss) vs. 84.98 $({\text{with}}\;L_{N} \;{\text{and}}\;L_{D} )$ vs. 84.12% $({\text{with}}\;L_{N} \;{\text{and}}\;L_{S} )$ and 82.61% $({\text{with}}\;L_{D} )$ for Top-1 mAP. This also indicates inclusion of noises at both end benefits to learning generalizable features. Figure 4 further visualize the performance under different settings.

Table 7 Top k mAP when different settings on Cars-196 under FGIR.

Full size table

Fine-grained recognition

In this ablation study, we analyze the effect of our approach on recognition accuracy. For this we use the cars-196 dataset and the standard protocol for training and testing. We set the minibatch size to 64, learning rate to 0.0001 and data augmentation setting as discussed in Section “Training details”. The results in the term of recognition accuracy are reported in Table 8, where we can see the boost in accuracy with our approach.

Table 8 Recognition performance (accuracy) analysis on Cars-196.

Full size table

Zero shot learning

Next, we test the generalization of our method in the context of zero-shot setting, namely to test whether the proposed method helps to find discriminative features even for the unseen images. In this regard, following the settings in³⁴, we conduct the experiment on the Cars-196 and Cub-200-2011⁴⁹ datasets, where the first half classes are employed to train the network and the remaining half classes for testing purpose. We conduct the zero shot learning experiments using pytorch with max 40 epochs. We implement our method on both base networks: resnet18 (R18) and resnet50 (R50). First, we analyze the effectiveness of the proposed method on Cub-200-2011 and Cars-196 using experimentation setting (R18, embedding size = 512, learning rate = 0.002, gamma = 0.1 for every 15 epochs, batch_size = 240 with 12 samples per class) and the results are reported in Table 9, where we can see that by including $L_{N}$ and $L_{S}$ the retrieval performance tends to increase, which confirms using noise in $L_{N}$ can help to incorporate intra-class variance and noise in $L_{S}$ serves as a form of regularization.

Table 9 Analysis of proposed method on Cub-200-2011 and Cars-196 using R18.

Full size table

Further, we analyze the effect of embedding size on retrieval performance (recall@k) which is depicted in Fig. 5, and effect of noise in L_S on Cub-200-2011 with our approach is shown in Fig. 6. In Figs. 7 and 8, we additionally depict the retrieval results for a randomly picked query from each dataset.

In Table 10, we can also see that our method is able to achieve better results compare to baseline methods such as EPSHN⁵⁰ and NormSoftmax⁵¹ (where, EPSHN⁵⁰ is based on contrastive learning approach and NormSoftmax⁵¹ is based on classification approach). For Resnet50 and Resnet101, we set the batch size to 144 and 24 samples per class. As per Table 10, our method consistently achieves better results for Cars-196 and Cub-200-2011 datasets in terms of recall@k than SOTA. However, few methods performs better than proposed method, which can be seen our method’s limitation in context of Cub-200-2011 dataset due to small dataset. For SOP³¹ dataset our model consistently achieves better results compare to others in Table 11. We can also see that compared to the baseline methods^50,51, the proposed method is able to improve its performance for all three datasets. This study confirms that our approach is able to generalize over unseen classes. We also show, with resnet101 model the proposed method is able to improve even more.

Table 10 Performance (Recall@k) Comparison under Zero-shot setting.

Full size table

Table 11 Performance (Recall@k) Comparison under Zero-shot setting for SOP³¹ dataset.

Full size table

Conclusion

In this paper, a noise-assisted feature learning approach for FGIR is proposed which alleviates the expensive sampling process in triplet learning, and early saturation problem in softmax based learning. The deep CNN is jointly trained with multi loss objective dealing with class discriminative learning as well as noise invariant learning. Oxford flower 17 and cars-196 datasets are consider to validate our approach, where it achieves significant gains over existing schemes. Under the zero-shot setting, we achieved competitive results on cars-196, Cub-200-2011 and SOP datasets. The proposed approach exhibits great potential and can be explored in various industrial applications such as clothing retrieval, face retrieval, biomedical image retrieval, landmark retrieval, etc. The main limitation of this task may be the training time compared to normal CNN training which needs to be explore in larger networks. A second limitation might be that the loss of the proposed method primarily emphasizes a global perspective. This could be addressed by incorporating local attention mechanisms to capture subtle features more effectively. In subsequent work, we plan to leverage various deep variations of CNN and vision transformers to expand our approach to larger datasets. The applicability of these techniques can be evaluated in the medical field, utilizing both supervised and unsupervised learning techniques for potential advancements.

Data availability

All images used in Figures 1, 7, and 8 are sourced from publicly available datasets intended for research purposes. Therefore, permission for their use is not required. The data that support the findings of this study and publicly available datasets are available at https://www.robots.ox.ac.uk/~vgg/data/flowers/17/index.html; https://www.vision.caltech.edu/datasets/cub_200_2011/; https://www.kaggle.com/datasets/jessicali9530/stanford-cars-dataset?datasetId=30084&sortBy=dateCreated&select=cars_test.

References

Zhou, W., Li, H. & Tian, Q. Recent advance in content-based image retrieval: A literature survey (2017).
Xie, L., Wang, J., Zhang, B. & Tian, Q. Fine-grained image search. IEEE Trans. Multimed. 17, 636–647 (2015).
Article CAS Google Scholar
Krizhevsky, A., Sutskever, I. & Hinton, G. E. ImageNet classification with deep convolutional neural networks. In Adv. Neural Inf. Process. Syst. 1097–1105 (2012).
Simonyan, K. & Zisserman, A. Very deep convolutional networks for large-scale image recognition. In 3rd Int. Conf. Learn. Represent. ICLR 2015 - Conf. Track Proc. (2015).
He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. 770–778 (2016).
Bell, S. & Bala, K. Learning visual similarity for product design with convolutional neural networks. In ACM Trans. Graph. (2015).
Wang, J., Song, Y., Leung, T., Rosenberg, C., Wang, J., Philbin, J., Chen, B. & Wu, Y. Learning fine-grained image similarity with deep ranking. In Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. 1386–1393 (2014).
Schroff, F., Kalenichenko, D. & Philbin, J. FaceNet: A unified embedding for face recognition and clustering. In Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. 815–823 (2015).
Chen, W., Chen, X., Zhang, J. & Huang, K. Beyond triplet loss: A deep quadruplet network for person re-identification. In Proc. - 30th IEEE Conf. Comput. Vis. Pattern Recognition 403–412 (CVPR, 2017).
Huang, C., Loy, C. C. & Tang, X. Local similarity-aware deep feature embedding. In Adv. Neural Inf. Process. Syst. 1270–1278 (2016).
Manmatha, R., Wu, C. Y., Smola, A. J. & Krahenbuhl, P. Sampling matters in deep embedding learning. In Proc. IEEE Int. Conf. Comput. Vis. 2840–2848 (2017).
Ge, W., Huang, W., Dong, D. & Scott, M. R. Deep metric learning with hierarchical triplet loss. In Proceedings of the European Conference on Computer Vision (ECCV) 269–285 (2018).
Roth, K., Milbich, T., Ommer, B. Pads: Policy-adapted sampling for visual similarity learning. In Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. 6567–6576 (2020).
Nilsback, M. E. & Zisserman, A. A visual vocabulary for flower classification. In Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. 1447–1454 (2006).
Krause, J., Stark, M., Deng, J., Fei-Fei, L. 3D object representations for fine-grained categorization. In Proc. IEEE Int. Conf. Comput. Vis. 554–561 (2013).
Babenko, A., Slesarev, A., Chigorin, A. & Lempitsky, V. Neural codes for image retrieval. In European Conference on Computer Vision 584–599 (2014).
Yandex, A. B. & Lempitsky, V. Aggregating local deep features for image retrieval. In Proc. IEEE Int. Conf. Comput. Vis. 1269–1277 (2015).
Mohedano, E., Mcguinness, K., O’Connor, N. E., Salvador, A., Marqués, F. & Giró-I-nieto, X. Bags of local convolutional features for scalable instance search. In ICMR 2016 - Proc. 2016 ACM Int. Conf. Multimed. Retr. 327–331 (2016).
Ng, J. Y. H., Yang, F. & Davis, L. S. Exploiting local features from deep networks for image retrieval. In IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. Work. 53–61 (2015).
Kalantidis, Y., Mellina, C. & Osindero, S. Cross-dimensional weighting for aggregated deep convolutional features. In European Conference on Computer Vision 685–701 (2016).
Yang, H. F., Lin, K. & Chen, C. S. Cross-batch reference learning for deep classification and retrieval. In MM 2016 - Proc. 2016 ACM Multimed. Conf. 1237–1246 (2016).
Tolias, G., Sicre, R. & Jégou, H. Particular object retrieval with integral max-pooling of CNN activations. In 4th Int. Conf. Learn. Represent. ICLR 2016 - Conf. Track Proc. (2016).
Shakarami, A. & Tarrah, H. An efficient image descriptor for image classification and CBIR. Optik 214, 164833 (2020).
Article PubMed PubMed Central CAS Google Scholar
Zhang, X., Xiong, H., Zhou, W., Lin, W. & Tian, Q. Picking deep filter responses for fine-grained image recognition. In Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. 1134–1142 (2016).
Watkins, R., Pears, N. & Manandhar, S. Vehicle classification using ResNets, localisation and spatially-weighted pooling (2018).
Zhou, F. & Lin, Y. Fine-grained image classification by exploring bipartite-graph labels. In Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. 1124–1133 (2016).
Wei, X. S., Luo, J. H., Wu, J. & Zhou, Z. H. Selective convolutional descriptor aggregation for fine-grained image retrieval. IEEE Trans. Image Process. 26, 2868–2881 (2017).
Article ADS MathSciNet PubMed MATH Google Scholar
Zheng, X., Ji, R., Sun, X., Wu, Y., Huang, F. & Yang, Y. Centralized ranking loss with weakly supervised localization for fine-grained object retrieval. In IJCAI Int. Jt. Conf. Artif. Intell. 1226–1233 (2018).
Kumar, V., Tripathi, V. & Pant, B. Content based fine-grained image retrieval using convolutional neural network. In 2020 7th Int. Conf. Signal Process. Integr. Networks 1120–1125 (SPIN, 2020).
Zhu, Y., Cao, G., Yang, Z. & Xiufan, Lu. Learning relation-based features for fine-grained image retrieval. Pattern Recogn. 140, 109543 (2023).
Article Google Scholar
Song, H. O., Xiang, Y., Jegelka, S. & Savarese, S. Deep metric learning via lifted structured feature embedding. In Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. 4004–4012 (2016).
Sohn, K. Improved deep metric learning with multi-class N-pair loss objective. In Adv. Neural Inf. Process. Syst. 1857–1865 (2016).
Song, H. O., Jegelka, S., Rathod, V. & Murphy, K. Deep metric learning via facility location. In Proc. - 30th IEEE Conf. Comput. Vis. Pattern Recognition 2206–2214 (CVPR, 2017).
Zheng, X., Ji, R., Sun, X., Zhang, B., Wu, Y. & Huang, F. Towards optimal fine grained retrieval via decorrelated centralized loss with normalize-scale layer. In Proc. AAAI Conf. Artif. Intell. Vol. 33, 9291–9298 (2019).
Wang, X., Hua, Y., Kodirov, E., Hu, G., Garnier, R. & Robertson, N. M. Ranked list loss for deep metric learning. In Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. 5207–5216 (2019).
Duan, Y., Chen, L., Lu, J. & Zhou, J. Deep embedding learning with discriminative sampling policy. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit. 4964–4973 (2019).
Zheng, W., Lu, J. & Zhou, J. Hardness-aware deep metric learning. IEEE Trans. Pattern Anal. Mach. Intell. 43, 3214–3228. https://doi.org/10.1109/TPAMI.2020.2980231 (2021).
Article PubMed Google Scholar
Duan, C. et al. Multilevel similarity-aware deep metric learning for fine-grained image retrieval. IEEE Trans. Industr. Inf. 19(8), 9173–9182. https://doi.org/10.1109/TII.2022.3227721 (2023).
Article Google Scholar
Sanakoyeu, A., Ma, P., Tschernezki, V. & Ommer, B. Improving deep metric learning by divide and conquer. IEEE Trans. Pattern Anal. Mach. Intell. 44(11), 8306–8320 (2022).
PubMed Google Scholar
Chen, T., Kornblith, S., Norouzi, M. & Hinton, G. A simple framework for contrastive learning of visual representations. In International Conference on Machine Learning 1597–1607 (PMLR, 2020).
Rodner, E., Simon, M., Fisher, R. B. & Denzler, J. Fine-grained recognition in the noisy wild: Sensitivity analysis of convolutional neural networks approaches. In: Br. Mach. Vis. Conf. 2016 60.1–60.13 (BMVC, 2016).
Krause, J., Sapp, B., Howard, A., Zhou, H., Toshev, A., Duerig, T., Philbin, J. & Fei-Fei, L. The unreasonable effectiveness of noisy data for fine-grained recognition. In Lect. Notes Comput. Sci. (Including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics) 301–320 (2016).
Ba, J. L., Kiros, J. R. & Hinton, G. E. Layer normalization. arXiv preprint arXiv:1607.06450 (2016).
Afifi, M. & Brown, M. What else can fool deep learning? Addressing color constancy errors on deep neural network performance. In Proc. IEEE Int. Conf. Comput. Vis. 243–252 (2019).
Yang, J., Yu, K., Gong, Y. & Huang, T. Linear spatial pyramid matching using sparse coding for image classification. In Proc. IEEE Int. Conf. Comput. Vis. 1794–1801 (2009).
Gao, S., Tsang, I. W. H. & Ma, Y. Learning category-specific dictionary and shared dictionary for fine-grained image categorization. IEEE Trans. Image Process. 23, 623–634 (2014).
Article ADS MathSciNet PubMed MATH Google Scholar
Ahmed, K. T., Ummesafi, S. & Iqbal, A. Content based image retrieval using image features information fusion. Inf. Fusion 51, 76–99 (2019).
Article Google Scholar
Russakovsky, O. et al. ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. 115, 211–252 (2015).
Article MathSciNet Google Scholar
Wah, C., Branson, S., Welinder, P., Perona, P. & Belongie, S. The caltech-ucsd birds-200-2011 dataset (2011).
Xuan, H., Stylianou, A. & Pless, R. Improved embeddings with easy positive triplet mining. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Snowmass, CO, USA 2474–2482 (2020).
Zhai, A. & Wu, H.-Y. Classification is a strong baseline for deep metric learning. arXiv [cs.CV]. http://arxiv.org/abs/1811.1264 (2018).
Zhao, J.-M. & Lian, Q.-S. Multi-centers SoftMax reciprocal average precision loss for deep metric learning. Neural Comput. Appl. 35(16), 11989–11999 (2023).
Article Google Scholar
Yan, J., Luo, L., Deng, C. & Huang, H. Adaptive hierarchical similarity metric learning with noisy labels. IEEE Trans. Image Process. 32, 1245–1256 (2023).
Article ADS PubMed Google Scholar
Yang, B., Sun, H., Li, F. W., Chen, Z., Cai, J. &Song, C. HSE: Hybrid species embedding for deep metric learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision 11047–11057 (2023).
Chan, P. P., Li, S., Deng, J. & Yeung, D. S. Multi-proxy based deep metric learning. Inf. Sci. 643, 119120 (2023).
Article Google Scholar
Jiang, X., Yao, Y., Dai, X., Shen, F., Nie, L. & Shen, H. T. Anti-collapse loss for deep metric learning. IEEE Trans. Multimed. (2024).
Yang, L., Wang, P. & Zhang, Y.. Stop-gradient softmax loss for deep metric learning. In Proceedings of the AAAI Conference on Artificial Intelligence Vol. 37, 3164–3172 (2023).
Dalal, N. & Triggs, B. Histograms of oriented gradients for human detection. In Proc. - 2005 IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognition 886–893 (CVPR, 2005).
Ojala, T., Pietikäinen, M. & Mäenpää, T. Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. IEEE Trans. Pattern Anal. Mach. Intell. 24, 971–987 (2002).
Article MATH Google Scholar

Download references

Author information

Authors and Affiliations

Department of CSE, Graphic Era Deemed to be University, Dehradun, India
Vidit Kumar, Vikas Tripathi, Bhaskar Pant & Manoj Diwakar
School of Computer Science Engineering and Technology, Bennett University, Greater Noida, India
Prabhishek Singh
Faculty of Electrical and Computer Engineering, Arba Minch University, Arba Minch, Ethiopia
Anchit Bijalwan

Authors

Vidit Kumar
View author publications
Search author on:PubMed Google Scholar
Vikas Tripathi
View author publications
Search author on:PubMed Google Scholar
Bhaskar Pant
View author publications
Search author on:PubMed Google Scholar
Manoj Diwakar
View author publications
Search author on:PubMed Google Scholar
Prabhishek Singh
View author publications
Search author on:PubMed Google Scholar
Anchit Bijalwan
View author publications
Search author on:PubMed Google Scholar

Contributions

V.K.: write original draft; V.T. and B.P.: Supervision; P.S. and M.D.: writing, review and editing; A.B.: validation and analysis.

Corresponding author

Correspondence to Anchit Bijalwan.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Kumar, V., Tripathi, V., Pant, B. et al. Learning optimal image representations through noise injection for fine-grained search. Sci Rep 15, 15560 (2025). https://doi.org/10.1038/s41598-025-97528-9

Download citation

Received: 10 July 2024
Accepted: 04 April 2025
Published: 03 May 2025
DOI: https://doi.org/10.1038/s41598-025-97528-9