Abstract
The ability to continuously perceive new concepts with extremely limited samples is innate in human beings. Few-shot incremental learning impersonates this ability by constructing an intelligent learning mechanism, and its intention is to identify novel categories from a given few instances gradually. The crucial means of few-shot incremental learning is to yield a model with the generalization ability to highlight the dominant object in the image and make the appearance of the object smoother by leveraging prior knowledge. In light of this, in this paper, we propose a Spatially Aware Global and Local Perspectives (SGLP) approach to tackle the few-shot incremental learning problem. To enhance semantic representations of features, we build the relationship information of the spatial feature in the global scope and encourage the model to pay attention to the dominant region in features. Furthermore, we assume that the current and surrounding information of the image have a similar appearance and design a smooth operation of the spatial feature by adopting the simple Gaussian kernel in a local scope. Extensive experiments on benchmarks demonstrate the superiority and effectiveness of the proposed approach.
Similar content being viewed by others
Introduction
Deep convolutional networks (DNNs) have gained tremendous improvement in various computer vision tasks such as image classification1, semantic segmentation2 and object detection3 over the past years. The success of DNNs in visual understanding hinges on the crucial assumption that each class of images possesses a large quantity of labeled training data. In contrast, human beings have the ability to gradually perceive new concepts by exploiting a few labeled instants extremely. To impersonate the ability, few-shot increment learning has recently aroused the attention and interest of researchers, which aims to continuously recognize novel categories when confronted with scarce instances. Therefore, few-shot increment learning emerges as a crucial pursuit, mirroring the remarkable human capacity to acquire new concepts gradually with minimal labeled instances.
Current methodologies for tackling the few-shot learning challenge primarily center around two fundamental aspects: meta-learning4,5,6,7,8,9,10,11,12 and generalization learning13,14,15,16,17,18,19,20,21. Meta-learning is a machine learning technique that aims to train algorithms to learn how to learn efficiently. This process typically involves two distinct phases: meta-training and meta-testing. In the meta-training phase, the model is immersed in a diverse array of tasks, assimilating a broad problem-solving framework that can be readily applied to uncharted challenges. In the subsequent meta-testing phase, the model encounters a novel task, harnessing the acquired wisdom from meta-training to swiftly acclimate to the newfound task and effectively surmount it. Nonetheless, within the realm of few-shot learning tasks, the essence of meta-learning training resides in the optimization of models through the random selection of sample data from the base class dataset (i.e., source domain), as shown in Fig. 1a. This training approach, constrained by the limited number of classes, can consequently impose restrictions on the model’s performance. In order to alleviate the constraints on model performance, few-shot approaches13,14,15,16,17,18,19,20,21 based on the generalization learning strategy have become the prevailing trend. The purpose of these approaches is to train a strong baseline model by leveraging full-class categories of base classes. This strategy facilitates better generalization to novel categories, as illustrated in Fig. 1b.This stimulates our interest in exploring whether generalization can effectively promote few-shot incremental learning, recognizing new concepts while reducing the forgetting of base class knowledge.
Illustration of different strategies in the few-shot learning task. (a) Meta-task sampler: Sampling from the base class to create the support and query sets is used to construct meta-learning tasks; (b) Base-task sampler: Sampling from the base class to create the base sample set is used to construct full-class supervised learning; (c) Base-task sampler with spatial feature enhancer: Sampling from the base class to create the base sample set is used to construct full-class supervised learning with spatial feature enhancer.
To inherit the advantages of few-shot learning in generalization performance, few-shot incremental learning approaches22,23,24,25,26,27 draw inspiration from the characteristics of generalization learning to generate the powerful base-class model for better adaptation to continuously added categories. Based on the full-class supervision strategy, in this paper, we propose Spatially Aware Global and Local Perspectives (SGLP) to the few-shot incremental learning task. Different from previous few-shot incremental learning approaches22,24,25,26, their core idea is to construct incremental correlations to ensure the continuous learning ability of the model. The proposed SGLP approach focuses on improving the feature presentation by incorporating spatial feature enhancer (refer to Fig. 1c) to powerfully enhance the generalization ability of the model and assist the few-shot incremental learning. To be specific, the spatial feature enhancer conducts a relationship construction in the global scope and a smooth operation in the local scope to improve the feature representation. The global perspective utilizes a attention mechanism to capture long-range dependencies and relationships within features, as opposed to focusing solely on local information. Its primary purpose is to help models understand global patterns and connections in feature maps. Under a local perspective, the proximity of neighboring pixels around the feature is considered as a prerequisite. This perspective employs a simple Gaussian kernel to further smooth the proximity relationships of the central pixel, thereby enhancing the influence of the effective region. Comprehensive experimental analyses in various benchmark datasets illustrate the superior and effective performance of our proposed approach, and ablation studies also clarify the utility of different ingredients.
In summary, the main contributions in this study are listed as follows:
-
We present a framework that leverages spatial-aware global and local perspectives to address the few-shot incremental learning challenge.
-
We introduce a feature semantic enhancement strategy by constructing long-range relationships in a global scope.
-
We introduce a feature smoothing operation by establishing proximity relationships in a local scope.
-
Comprehensive experiments in multiple benchmarks (i.e., MiniImageNet4, CIFAR-FS8 and CUB-200–201128), consisting of experimental comparisons, ablations, and visualizations, verify the effectiveness of our approach and difference ingredients.
Related work
In this section, we present the related works including few-shot learning, few-shot incremental learning, and attention mechanism, which are most relevant to our work.
Few-shot learning
Few-shot learning aims to train the model on base classes and enable it to recognize novel classes with scarce instances. Existing approaches to address few-shot learning mainly are divided into two families: meta-learning-based approaches4,5,6,7,8,9,10,11,12 and generalization learning-based approaches13,14,15,16,17,18,19,20,21,29,30,31. The fundamental concept of meta-learning-based approaches is to create support and query sets from base classes using the meta-task sampler and then facilitate relationship learning between them.
Pioneering the field, Finn et al.5 proposed MAML and introduced a groundbreaking concept in few-shot learning, focusing on versatile initialization for rapid adaptation to new tasks with limited data. Building upon this, Sung et al.7 furthered the field by explicitly modeling local relationships between objects enhancing discriminative abilities. Zintgraf et al.11presented a fast context adaptation via meta-learning introducing a technology for swift context adaptation. In Ref12., Chen et al.meta-learning seamlessly integrating self-supervised learning. This core technology refines representations and boosts adaptability to new classes. This innovation complements self-supervised approaches, collectively enabling models to quickly adjust to new tasks and demonstrating a holistic progression in the meta-learning landscape. Recently, generalization learning-based approaches have shown promising performance through full-class supervision, marking a trend in the research of few-shot learning. For example, Hou et al.14 introduced an attention mechanism to enhance the embedding model and generated clear feature maps, facilitating the generalization of the model. This benefits from a combination of a dense global classification loss and a few-shot loss. Further, Tian et al.15 achieved remarkable performance by employing a straightforward baseline with a linear classifier, showcasing its effectiveness in few-shot learning. To investigate the role of data augmentation, Rizve17leveraged both invariant and equivariant to boost the performance of models with large margins. Moreover, Ref18,19,20,21. also exhibit promising outcomes through various full-class supervision strategies. Based on the above analysis, we employ the full-class supervision strategy to address the few-shot incremental learning task.
Few-shot incremental learning
The goal of few-shot incremental learning emulates the cognitive proficiency observed in humans for swift adaptation, empowering machine learning models to adeptly adjust to novel tasks or categories within the constraints of limited samples. Tao et al.32 first introduced the concept of few-shot class-incremental learning by incorporating a neural gas for preserving topology in the embedding space. Build upon the pioneering work32adapt existing continuous incremental learning methods to address the challenges. Additionally, Refs33,34,35,36. leverage word vectors to mitigate the inherent difficulties stemming from data scarcity in the few-shot incremental learning task. Another group of prevalent approaches22,23,24,25,38,39 concentrates on meta-training with base class data, simulating test scenarios by sampling fake incremental episodes. These methods typically achieve significant results, making them effective for many real-world scenarios and performing their applicability to arbitrary pre-trained models. Furthermore, a majority of these approaches24,25,39 draw inspiration from the concept of full-class supervision and freeze the parameters of the meta-trained model to explicitly preserve base knowledge, enhancing the model’s adaptability to novel concepts. Based on the above manner, our approach preserves the foundational state of the model, enabling effective adaptation to novel knowledge by continually evolving classifiers.
Attention mechanism
In the realm of deep neural networks, the attention mechanism has exerted a profound influence on diverse computer vision endeavors, including but not limited to image categorization, object localization, and semantic segmentation40,41,42,43,44,45,46,47. In Ref42., Hu et al. introduced the conventional Squeeze-and-Excitation channel attention mechanism, providing seamless integration capabilities into established model frameworks. Additionally, Woo et al.43 proposed an innovative Convolutional Block Attention Module by incorporating maximal pooling to compress the spatial dimension of input features, thereby extracting diverse representations through the utilization of average pooling. Huang et al.45 extended43and introduced an innovative criss-cross attention module designed to capture contextual information from entire-image dependencies in a more efficient and effective manner. In Ref44., the non-local neural networks are introduced by a paradigm shift in neural architectures, leveraging non-local operations for efficient capture of long-range dependencies. This approach enhances information integration across spatial and temporal scales, demonstrating effectiveness in various computer vision tasks.
Attention mechanisms have also been extended beyond static images to address spatiotemporal modeling in video understanding and dynamic perception. Zhou et al.48 introduced a Regional Attention module within a 3D network for RGB-D gesture recognition, effectively integrating spatial and temporal cues to enhance robustness and cross-modal alignment. Tan et al.49 proposed the Temporal Attention Unit (TAU), a lightweight module that models frame-wise temporal dependencies with low computational cost, showing strong performance in spatiotemporal prediction tasks. Further advances include Yuan et al.50, who developed UNIST, a prompt-driven attention model for urban spatiotemporal prediction, and Cai et al.51, who introduced GraphTAN for capturing temporal dynamics in graph-structured data. Xiang et al.52 applied a synchronization-based spatiotemporal attention mechanism to EEG-based seizure prediction, highlighting the versatility of attention across domains. These advances in attention mechanisms provide valuable foundations that inspire and inform the design of our proposed approach.
Methodology
In this section, we start by presenting the problem formulation of few-shot incremental learning and further introduce the proposed approach in detail.
Problem formulation
Consider an incremental learning paradigm conducted over a sequence of sessions \(S = \{s_{1}, s_{2},..., s_{n}\}\), where each session \(s_{i}\) represents a distinct temporal stage. The model is exposed to a set of base classes \(\mathcal {C}_{base}\) and incrementally encounters new classes denoted by \(\mathcal {C}^{i}_{inc}\) in each session \(s_{i}\). The base classes are shared across sessions, ensuring continuity in the learning process. Let \(D^{i}_{base}\) represent the dataset comprising instances from the base classes available in session \(s_{i}\), and \(D^{i}_{inc}\) denote the dataset for the few-shot incremental classes introduced in the same session. The term “few-shot” underscores the limited availability of samples for each incremental class.
The model parameters at the end of session \(s_{i}\) are denoted \(\theta ^{i}\), and the knowledge retained from the base classes up to this point is represented by \(\mathcal {K}_{base}^{i}\). Similarly, \(\mathcal {K}_{inc}^{i}\) captures the model’s knowledge about the few-shot incremental classes. The primary objective of few-shot Incremental Learning is to develop a model capable of continual adaptation and knowledge retention. Specifically, the model must demonstrate effective learning from both the shared base classes and the newly introduced few-shot incremental classes. The performance of the model is assessed based on its ability to accurately classify instances from both base and incremental classes, thus reflecting its adaptability to evolving knowledge scenarios. In essence, Few-shot Incremental Learning addresses the nuanced challenge of incrementally introducing and learning from new classes, emphasizing the scarcity of available samples for the latter, all within the framework of an incremental learning setting across multiple sessions.
Data initialization
Few-shot incremental learning aims to perform the ability that requires the trained model to learn new knowledge while reducing knowledge forgetting. Data initialization is an effective manner to enrich the diversity of data, prevent overfitting of the model, and enhance feature representation through sampling from the base-task sampler and utilizing data augmentation, as shown in Fig. 2.
The Framework of the proposed Spatial-aware Global and Local Perspectives approach. Data initialization is utilized to obtain the base sample set by base-task sampler and generate the diverse images by data augmentation as the input of the model. To enhance semantic representations of features, the spatial feature enhancer is constructed by the relationship information of the spatial feature in the global and local scopes, which encourages the model to pay attention to the dominant region in features. Based on the learned model above, the incremental learning module aims to incrementally update the parameters of the classifier to effectively recognize new category data.
To be specific, given a base class \(\mathcal {C}_{base}\), a base-task sampler is utilized to yield the base sample set \(\{(\textbf{X}_{i}, \textbf{y}_{i})\}_{i=1}^{N}\) (i.e., the mini-batch set), where N denotes the number of the set \(\{\textbf{X}_{i}\}_{i=1}^{N}\) and \(\textbf{y}_{i}\) is the corresponding true label of \(\textbf{X}_{i}\). The base sample set contains all the different categories in the base class according to the manner of random sampling from the base class. In contrast to meta-learning training53,54, which is limited to the representation of sample categories, class-sampling learning can train models using all categories of the base class, thereby enhancing the feature representation capability. With the base sample set \(\{\textbf{X}_{i}\}_{i=1}^{N}\), a data augmentation operation is then employed to enrich the representation of the input and avoid overfitting of the model to the data as follows:
In Eq. (1), the abbreviation Aug stands for data augmentation, and \(\textbf{X}_{i}^{aug}\) represses the augmented data. Data augmentation, which includes operations such as rotation, cropping, flipping, and brightness adjustment, is used to improve the performance of the model. Therefore, by data initialization, the model can better learn the underlying distribution of the data and generalize it to new data.
Spatial feature enhancer
Spatial-aware global perspective
After yielding the \(\textbf{X}_{i}^{aug}\) by data initialization in Section 3.2, the general strategy is to utilize the standard feature extractor \(\mathcal {F(*)}\) to generate the representation33,34,35. Although the strategy plays a significant role in obtaining high-level information, it fails to learn relationships from a global perspective in terms of spatial location, which weakens the presentation of effective semantic information. Based on this, we present a spatial feature enhancer to strengthen the representation of the extracted feature, as shown in Fig. 2. The spatial feature enhancer designs a feature enhancement process from global to local perception. In the spatial-aware global perspective, it can learn long-range dependencies between features in an image. The long-range dependencies work by comparing each spatial feature in an image to all other spatial features in the image, and then weighting the contribution of each feature to the denoised output based on their similarity.
Specifically, the extracted feature \(\textbf{X}_{i}^{ext}\) can be encoded by the standard feature extractor, i.e., \(\textbf{X}_{i}^{ext} = \mathcal {F}(\textbf{X}_{i}^{aug})\). To improve the feature representation, the spatial-aware global perspective utilizes each pixel feature to build the relationship in a global scope, as shown in Fig. 3. Given the pixel feature \(\{\textbf{x}_{i,j}\}_{j=1}^{M}\) of \(\textbf{X}_{i}^{ext}\) (M denotes the number of all pixel features in \(\textbf{X}_{i}^{ext}\)), the enhanced feature with the spatial-aware global perspective is repressed as follows:
In Eq. (2), \(\mathcal {I}\), \(\mathcal {J}\), and \(\mathcal {K}\) indicate three linear embedding functions with different learned parameters. C and T denote the normalization factor and the transpose operation, respectively. Build upon the enhanced feature \(\widetilde{\textbf{x}}_{i,j}\), the merged feature can be obtained by
where \(W_{\hat{\textbf{x}}}\) represents the matrix and is the learned parameter by the training of the model. By concatenating all pixel features \(\hat{\textbf{x}}_{i,j}\), the representation based on the spatial-aware global perspective can be formed as \(\hat{\textbf{X}}_{i}^{global}= \{\hat{\textbf{x}}_{i,j} \}_{j=1}^{M}\). The generated feature \(\hat{\textbf{X}}_{i}^{global}\) has a richer semantic representation, reflecting the main characteristics of the image.
The spatial-aware global perspective enhances the feature representation of \(\textbf{X}_{i}^{ext}\) and riches the semantic information by weighting pixel features in a global scope. “Conv” is the \(1\times 1\) convolution network. \(\oplus\) and \(\otimes\) denote the multiplication and addition operations, respectively.
Spatial-aware local perspective
Beyond the global spatial awareness above, we further enhance features from a local spatial perspective, believing in the assumption that adjacent pixels have similar semantic information. The similarity pixels in the adjacent relation should be close and perform a smooth distribution. However, previous approaches24,25,39 fail to effectively consider the relationship between pixel features. This can easily lead to several noises or non-smooth values55. To mitigate the adverse effects of noise on feature representation \(\textbf{X}_{i}^{global}\), we introduce a simple operation and leverage the Gaussian kernel \(\mathcal {K}\) to smooth the spatial feature with a local scope, as shown in Fig. 2. To be specific, the smooth feature can be represented as:
where \(\mathcal {G}(\cdot )\) denotes the Gaussian filtering operator, and it aims to conduct the convolution of Gaussian kernel and feature \(\textbf{X}_{i}^{global}\). The Gaussian kernel \(\mathcal {K}\) consists of nine fixed values with \(3\times 3\times 1\) size and a standard deviation, and its expression is as follows:
By utilizing the Gaussian filtering operator to \(\hat{\textbf{X}}_{i}^{global}\) in Eq. (4), the yielded representation \(\hat{\textbf{X}}_{i}^{local}\) performs smoother. Although Gaussian filtering can smooth representation, it can still preserve the main features, maintaining the contours and edges of the image. Furthermore, global average pooling (GAP) is leveraged to generate the vectored representation of \(\hat{\textbf{X}}_{i}^{local}\), which can be formulated by:
To determine the classification of the feature \(\hat{\textbf{x}}_{i}^{GAP}\), a classifier following a Softmax function is adopted to obtain the probability \(\hat{\textbf{p}}_{i}\) of \(\hat{\textbf{x}}_{i}^{GAP}\), which is represented as:
where \(\mathcal {C}\) is a classifier with a fully connected layer that has learned parameters. \(\texttt {Softmax}\) denotes the normalized exponential function and attains the probability of the image \(\textbf{X}\). To train the learned parameter of the model, a cross-entropy loss function \(\mathcal {L}_{ce}\) is leveraged to achieve the constraint between the true label \(\textbf{y}_{i}\) and the predicted probability \(\hat{\textbf{p}}_{i}\) in a supervision manner as follows:
By employing the spatial-aware global and local perspectives above, the model can reduce the impact of noise and effectively enhance the representation ability of features.
Incremental learning module
After the parameters of the feature extractor \(\mathcal {F}\) and classifier \(\mathcal {C}\) are trained, the next task is to conduct a few-shot incremental learning. In the incremental learning stage, to alleviate the computation burden, the parameters of the feature extractor \(\mathcal {F}(\cdot )\) and the spatial feature enhancer are fixed. Recently, with the popularity of self-attention structures, various architectures have been proposed in succession such as multi-head attention56, transformer57, and graph attention network58. Harnessing the self-attention principle, a progressive update strategy, akin to that in22, dynamically learns classifiers within each session \(s_{i}\), enabling continuous recognition of novel concepts. Let \(\mathcal {W}_{i}\in \mathbb {R}^{N_{i} \times C}\) denote the parameters of the classifier in session \(\textbf{s}_{i}\), where \(N_{i}\) represents the number of classes in session \(\textbf{s}_{i}\) and C is the dimension of the feature \(\hat{\textbf{x}}_{i}^{GAP}\). Note that \(\textbf{s}_{0}\) denotes the training stage within the base class, where the classifier utilizes parameters represented by \(\mathcal {W}_{0}\). \(\mathcal {W}_{0}\) possesses the learned weight vectors of all base classes, which can be represented as:
where \({w}_{i}^{0}\) is the corresponding weight vector of category i. With the increase of session \(s_{i}\) (i.e., incremental learning), classifier weight parameters are constantly increasing and updating. Therefore, the previous sessions can be described as:
where \(N_{i}\) denotes the number of classifier weights for the i-th session. According to the idea of graph attention network59, each classifier weight \({w}_{N_{i}}^{i}\) in \(\mathcal {W}_{i}\) can be seen as a node in the graph structure. By building the self-attention relationship in \(\mathcal {W}_{i}\), a correlation coefficient \(c_{i,j}\) can be calculated by:
where \(\texttt {proj}_{1}(\cdot )\) and \(\texttt {proj}_{2}(\cdot )\) indicate the different projection functions that conduct the linear transformation of \({w}^{j}\) and \({w}^{k}\). \(\langle \cdot , \cdot \rangle\) denotes the inner product symbol between two nodes. By normalizing the correlation coefficient, the coefficient factor can be expressed as:
In Eq. (12), \(|\mathcal {W}|\) is the current number of all nodes in \(\mathcal {W}_{i}\). With the coefficient factor, the weight parameter \(w^{i'}\) in each node can be updated by:
where \(\textbf{M}\) indicates a weight matrix that consists of the set of the linear transformation to \(w^{l}\). Through self-attention operations within the graph structure, the classifiers \(W^{i}\) are updated, resulting in the following representation:
Through the incremental learning module above, the adaptation module dynamically updates the classifier models learned in the current session and previously learned classifier models, and then combines them to make predictions for all classes.
Experiments
This section introduces the experimental setting in detail and then demonstrates results with the comparison experiment and various ablations.
Experimental settings
Datasets
We evaluate our proposed approach on three standard few-shot incremental learning benchmarks, composing of MiniImageNet4, CIFAR-FS8 and CUB-200–201128.
MiniImageNet is the subset of ImageNet60and first introduced by Ref61. for conducting the few-shot learning task. It consists of 60, 000 images with 100 categories, where each category has 600 images. Following the previous works22,32, we utilize 60 and 40 categories for base-class training and incremental learning, respectively. The 40 newly introduced categories are evenly distributed across 8 sessions, with 5 classes allocated to each session. Within these incremental sessions, each class is represented by 5 training images. All images are resized to the \(84\times 84\) resolution as the input of the model.
CIFAR-FS constitutes a subset of the CIFAR-100 dataset8, encompassing 100 categories in total, and comprising a total of 60,000 images with \(32\times 32\) resolution. In this dataset, there are 500 training images and 100 testing images allocated per class. Adhering to the data partitioning defined in reference22,32, we designate 60 classes as base-classes and 40 classes as novel classes. To further organize the 40 novel classes, they are stratified into 8 separate incremental sessions, with each session configured as a 5-way 5-shot classification task.
CUB-200–2011 is a fine-grained dataset encompassing bird species, consisting of 11,788 images distributed among 200 subcategories. In accordance with the data partitions stipulated in prior works22,32, the dataset is bifurcated into two distinct categories: 100 base classes and 100 novel classes, among the total 200 classes. The latter subset of 100 novel classes is further stratified into 10 discrete sessions, with each session structured as a 10-way 5-shot task. The image dimensions across the dataset are standardized to 224×224 pixels. As illustrated in Fig. 4, we also provide a limited selection of sample images from these datasets for reference.
Implementation details
To ensure a fair comparison, we follow the experimental settings of previous work22,26,34,62. Specifically, we use ResNet-20 as the backbone for experiments on CIFAR-100, and ResNet-18 for experiments on miniImageNet and CUB-200. For all benchmark datasets presented above, we leverage the classification accuracy to quantitatively evaluate the performance of our proposed approach. During the training phase, all data from the source and target domains is involved in the training process and employed to optimize model parameters. The optimal model is determined by the best classification performance on the validation set and used to evaluate the classification accuracy of the unlabeled target domain. To avoid the overfitting of the model to the data, all input images are augmented by RandomHorizontalFlip and Normalization operations to generate the diversity representation63,64,65,66,67,68,69, as shown in Fig. 5. Here, the evaluation mechanism of the classification accuracy can be formulated as follows:
where \(\textbf{Y}_{u}\) and \(\widehat{\textbf{Y}}_{u}\) denote the predicted and ground-truth labels of the unlabeled image \(\textbf{X}_{u}\) in the target domain \(\mathcal {T}_{u}\), respectively, and \(|\cdot |\) indicates the number of elements. By applying the evaluation mechanism of Eq. (15), we obtain the final result of the generalization performance of the model.
Comparison with the previous models
To evaluate the effectiveness of our proposed approach, we conduct a comprehensive comparison with existing few-shot incremental learning approaches across three benchmark datasets (MiniImageNet, CIFAR-FS, and CUB-200–2011), detailed in Tables 1, 2, and 3. The results in Tables 1 and 2 (representing the coarse-grained dataset) consistently demonstrate the superior performance of our approach in the 5-way 5-shot setting across all sessions. This trend, evident in our quantitative analysis, substantiates the excellence of our proposed approach. Furthermore, when examining the results on the fine-grained dataset in Table 3, our approach consistently delivers commendable performance across various components in all sessions. This robust performance on both coarse-grained and fine-grained datasets showcases the versatility and efficacy of our proposed approach. In summary, our comparative analysis across multiple datasets affirms the robustness and superiority of the proposed approach. It excels in scenarios involving coarse-grained datasets and maintains competitive performance on fine-grained datasets. These findings underscore the potential of the spatial-aware global and local perspectives for advancing few-shot incremental learning methodologies.
Ablation analyses
In this section, comprehensive ablation analyses are conducted to verify the effectiveness of the different modules. We first assert the influence of the data augmentation operation on the model. The spatial-aware global and local perspectives are then verified for their utility, respectively. Finally, the visualization is displayed to explain the efficiency of the proposed approach.
Influence of data augmentation
Data augmentation is able to avoid the overfitting of the dataset to the model and increase the diversity of the training instances, which aims to improve the generalization performance of the model. As mentioned in Section 4.3, the RandomHorizontalFlip and Normalization operations are utilized to augment the input image. The several examples with augmentation are shown in Fig. 5. From Fig. 5, we can observe that the augmented version visual effects exhibit more prominent local characteristics. To quantitatively verify the influence of the augmentation operation, Fig. 6 gives the performance report with and without data augmentation on MiniImageNet datasets. The experimental result in Fig. 6 shows that the approach with data augmentation achieves a significant boost in various dataset benchmarks. This phenomenon can explain that conducting simple transformations on the input data, can enrich the data, reduce overfitting of the model, and thus effectively enhance the generalization of the model. In addition, the convergence and the accuracy of the model with increasing iteration times in the training stage have also been illustrated in Fig. 7.
The training stage yields convergence of loss function and accuracy of the model in the base class. (a) As the number of epochs increases, the convergence performance of the model on the training set. (b) As the number of epochs increases, the convergence performance of the model on the test set. (c) As the number of epochs increases, the accuracy performance of the model on the test set. (d) As the number of epochs increases, the accuracy performance of the model on the test set.
Influence of spatial-aware global perspective
As described in Section 3.3, the spatial-aware global perspective (SAGP) builds the long-range dependencies between spatial features. The mechanism of long-range dependencies involves evaluating every spatial feature against all other spatial features. Subsequently, the contribution of each feature to the denoised output is weighted according to their similarity. To assert the influence of the spatial-aware global perspective on the model, we conduct the ablation analysis in Table 4. Table 4 adopts the MiniImageNet dataset with a 5-way 5-shot setting as the evaluation benchmark, and the result from Table 4 demonstrates that the SAGP achieves a significant improvement in all sessions. This suggests that establishing long-range dependencies in features is advantageous for enhancing feature quality.
Influence of spatial-aware local perspective
In addition to the global perspective within the spatial scope, the impact of the spatial-aware local perspective (SALP) is also demonstrated in Table 4. As discussed in Section 3.3, SALP assumes that adjacent pixels convey similar semantic information and applies the simple Gaussian filtering operator \(\mathcal {G(\cdot )}\) to smooth local features. The results in Table 4 reveal a significant improvement in performance by SALP compared to the baseline model on the benchmark dataset. The consistent findings across all sessions underscore the efficacy of the SALP module. Moreover, the last row of Table 4 indicates that the collaborative action of the SAGP and SALP modules achieves superior performance compared to each separate module in the evaluation benchmark. This observation suggests that the proposed approach effectively enhances feature representation and improves the model’s generalization performance toward incremental novel classes.
Influence of different shots
To explore the effect of the number of support examples in few-shot incremental learning, we evaluate the model under 1-shot and 5-shot settings on the CIFAR-FS dataset, as shown in Fig. 8. The results demonstrate that the 5-shot setting consistently yields better performance than the 1-shot setting across all incremental steps. Increasing the number of support examples significantly enhances classification accuracy, suggesting that richer class representations help mitigate the challenges of incremental learning. Nevertheless, the performance gap between the two settings indicates that the problem of catastrophic forgetting remains substantial in the few-shot regime, even with additional support samples.
Visualization
To qualitatively explain the effectiveness of the model, the visualization of feature maps is demonstrated in Fig. 9. As shown in Fig. 9, the top, middle and bottom rows are the original images, the augmented version and the feature map, respectively. From Fig. 9, it can be observed that despite the appearance distortion caused by data augmentation operations, the model can effectively focus on the main regions of the image. Even in cases where the image scene is complex and contains multiple objects, the model can still accurately locate the target region based on the task. The visualization results in Fig. 9, where the model accurately focuses on feature maps of images, serve as a validation and interpretation of the effectiveness in the proposed approach.
Conclusion and future work
In this paper, we present a straightforward yet effective approach, termed Spatial-aware Global and Local Perspectives (SGLP), to tackle the few-shot incremental learning problem. The spatial-aware global perspective establishes relationships among spatial features globally, promoting the model to emphasize the dominant representation. Meanwhile, the spatial-aware local perspective operates under the assumption that current and surrounding image information share similar appearances. A Gaussian filtering operation within a local scope is employed to refine the spatial feature. Comprehensive experiments across various benchmarks demonstrate that the proposed approach yields competitive results, elucidating the efficacy of its distinct components.
For future work, we attempt an interesting direction that involves exploring few-shot incremental learning through the integration of multiple modalities. Investigating how incorporating information from diverse sources, such as images and text, can enhance the adaptability and robustness of few-shot learning models is a key aspect. Exploring dedicated architectures and fusion methods tailored for multi-modal scenarios is also a valuable avenue for future research, aiming to improve generalization in situations with limited labeled data.
Data availability
The datasets analysed during the current study are available at: https://github.com/icoz69/CEC-CVPR2021?tab=readme-ov-file.
References
Li, Y., Wu, C.-Y., Fan, H., Mangalam, K., Xiong, B., Malik, J. & Feichtenhofer, C. Mvitv2: Improved multiscale vision transformers for classification and detection. In: Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition pp. 4804–4814, (2022).
Wang, W., Sun, G. & Van Gool, L. Looking beyond single images for weakly supervised semantic segmentation learning. IEEE Trans. Pattern Anal. Mach. Intell. (2022).
Du, X., Wang, X., Gozum, G. & Li, Y. Unknown-aware object detection: Learning what you don’t know from videos in the wild. In: Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition pp. 13678–13688 (2022).
Ravi, S. & Larochelle, H. Optimization as a model for few-shot learning (2016).
Finn, C., Abbeel, P. & Levine, S. Model-agnostic meta-learning for fast adaptation of deep networks. arXiv preprintarXiv:1703.03400 (2017).
Ren, M., Triantafillou, E., Ravi, S., Snell, Swersky, J. K., Tenenbaum, J. B., Larochelle, H. & Zemel, R. S. Meta-learning for semi-supervised few-shot classification. Int. Conf. Learn. Rep. (2018).
Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P. H. & Hospedales, T. M. Learning to compare: Relation network for few-shot learning. In: Proc. IEEE Conference on Computer Vision and Pattern Recognition pp. 1199–1208 (2018).
Bertinetto, L., Henriques, J. F., Torr, P., Vedaldi, A. Meta-learning with differentiable closed-form solvers. In: International Conference on Learning Representations (2018).
Hao, F., He, F., Cheng, J., Wang, L., Cao, J. & Tao, D. Collect and select: Semantic alignment metric learning for few-shot learning. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV) pp. 8460–8469 (2019).
Wu, Z., Li, Y., Guo, L. & Jia, K. Parn: Position-aware relation networks for few-shot learning. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV) pp. 6659–6667 (2019).
Zintgraf, L., Shiarli, K., Kurin, V., Hofmann, K. & Whiteson, S. Fast context adaptation via meta-learning. In International Conference on Machine Learning (ed. Zintgraf, L.) 7693–7702 (PMLR, 2019).
Chen, D. et al. Self-supervised learning for few-shot image classification. In ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (ed. Chen, D.) 1745–1749 (IEEE, 2021).
Wu, H., Zheng, Z., Wang, H., Wang, W. & Yang, Z. Few-Shot Incremental Learning with Context-Aware Spatial Enhancement for Image Recognition. IEEE Access (2025).
Hou, R., Chang, H., Ma, B., Shan, S. & Chen, X. Cross attention network for few-shot classification. Adv. Neural Inf. Process. Syst. 4003–4014 (2019).
Tian, Y., Wang, Y., Krishnan, D., Tenenbaum, J. B. & Isola, P. Rethinking few-shot image classification: a good embedding is all you need? In Computer Vision-ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XIV 16 (ed. Tian, Y.) 266–282 (Springer, 2020).
Zhang, C., Cai, Y., Lin, G. & Shen, C. Deepemd: Few-shot image classification with differentiable earth mover’s distance and structured classifiers. In: Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition pp. 12203–12213 (2020).
Rizve, M. N., Khan, S., Khan, F. S. & Shah, M. Exploring complementary strengths of invariant and equivariant representations for few-shot learning. In: Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition pp. 10836–10846 (2021).
Wu, H. et al. CLCFE: complementary loss coupling for feature-enhanced few-shot fine-grained visual recognition. Appl. Intell. 55, 742 (2025).
Zheng, Z. et al. SGE: Semantic-guided Generalization Enhancement for few-shot learning. Knowl.-Based Syst. 323, 113761 (2025).
Zheng, Z., Feng, X., Yu, H. & Gao, M. Cooperative density-aware representation learning for few-shot visual recognition. Neurocomputing 471, 208–218 (2022).
Zheng, Z. et al. Iccl: Independent and correlative correspondence learning for few-shot image classification. Knowl.-Based Syst. 266, 110412 (2023).
Zhang, C., Song, N., Lin, G., Zheng, Y., Pan, P. & Xu, Y. Few-shot incremental learning with continually evolved classifiers. In: Proc. IEEE/CVF conference on computer vision and pattern recognition pp. 12455–12464 (2021).
Shi, G., Chen, J., Zhang, W., Zhan, L.-M. & Wu, X.-M. Overcoming catastrophic forgetting in incremental few-shot learning by finding flat minima. Adv. Neural. Inf. Process. Syst. 34, 6747–6761 (2021).
Zhou, D.-W., Wang, F.-Y., Ye, H.-J., Ma, L., Pu, S. & Zhan, D.-C. Forward compatible few-shot class-incremental learning. In: Proc. IEEE/CVF conference on computer vision and pattern recognition pp. 9046–9056 (2022).
Zhou, D.-W., Ye, H.-J., Ma, L., Xie, D., Pu, S. & Zhan, D.-C. Few-shot class-incremental learning by sampling multi-phase tasks. IEEE Trans. Pattern Anal. Mach. Intell. (2022).
Xu, X. et al. Multi-feature space similarity supplement for few-shot class incremental learning. Knowl.-Based Syst. 265, 110394 (2023).
Ji, Z., Hou, Z., Liu, X., Pang, Y. & Li, X. Memorizing complementation network for few-shot class-incremental learning. IEEE Trans. Image Process. 32, 937–948 (2023).
Wah, C., Branson, S., Welinder, P., Perona, P. & Belongie, S. The caltech-ucsd birds-200-2011 dataset (2011).
Wu, H., Zhao, Y. & Li, J. Selective, structural, subtle: Trilinear spatial-awareness for few-shot fine-grained visual recognition. In 2021 IEEE International Conference on Multimedia and Expo (ICME) (ed. Wu, H.) 1–6 (IEEE, 2021).
Zheng, Z., Feng, X., Yu, H., Li, X. & Gao, M. Unsupervised few-shot image classification via one-vs-all contrastive learning. Appl. Intell. 53(7), 7833–7847 (2023).
Wu, H., Zhao, Y. & Li, J. Invariant and consistent: Unsupervised representation learning for few-shot visual recognition. Neurocomputing 520, 1–14 (2023).
Tao, X., Hong, X., Chang, X., Dong, S., Wei, X. & Gong Y. Few-shot class-incremental learning. In: Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition pp. 12183–12192 (2020).
Chen, K. & Lee, C.-G. Incremental few-shot learning via vector quantization in deep embedded space. In: Int. Conf. Learn. Rep. (2020).
Dong, S. et al. Few-shot class-incremental learning via relation knowledge distillation. Proc. AAAI Conf. Artif. Intell. 35, 1255–1263 (2021).
Mazumder, P., Singh, P. & Rai, P. Few-shot lifelong learning. Proc. AAAI Conf. Artif. Intell. 35, 2337–2345 (2021).
Cheraghian, A., Rahman, S., Fang, P., Roy, S. K., Petersson, L. & Harandi, M. Semantic-aware knowledge distillation for few-shot class-incremental learning. In: Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition pp. 2534–2543 (2021).
Cheraghian, A., Rahman, S., Ramasinghe, S., Fang, P., Simon, C., Petersson, L. & Harandi, M. Synthesized feature based few-shot class-incremental learning on a mixture of subspaces. In: Proc. IEEE/CVF international conference on computer vision pp. 8661–8670 (2021).
Chi, Z., Gu, L., Liu, H., Wang, Y., Yu, Y. & Tang, J. Metafscil: A meta-learning approach for few-shot class incremental learning. In: Proc. IEEE/CVF conference on computer vision and pattern recognition pp. 14166–14175 (2022).
Zhu, K., Cao, Y., Zhai, W., Cheng, J. & Zha, Z.-J. Self-promoted prototype refinement for few-shot class-incremental learning. In: Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition pp. 6801–6810 (2021).
Wu, H., Fu, K., Zhao, Y., Song, H. & Li, J. Joint self-supervised and reference-guided learning for depth inpainting. Comput. Vis. Media 8(4), 597–612 (2022).
Niu, S.-Z., Wu, H., Yu, Z.-F., Zheng, Z.-J. & Yu, G.-H. Total generalized variation minimization based on projection data for low? Dose CT reconstruction. Nan Fang yi ke da xue xue bao = J. South. Med. Univ. 37(12), 1585–1591 (2017).
Hu, J., Shen, L. & Sun, G. Squeeze-and-excitation networks. In: Proc. IEEE conference on computer vision and pattern recognition pp. 7132–7141 (2018).
Woo, S., Park, J., Lee, J.-Y. & Kweon, I. S. Cbam: Convolutional block attention module. In: Proc. European conference on computer vision (ECCV) pp. 3–19 (2018).
Wang, X., Girshick, R., Gupta, A. & He, K. Non-local neural networks. In: Proc. IEEE conference on computer vision and pattern recognition pp. 7794–7803 (2018).
Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y. & Liu, W. Ccnet: Criss-cross attention for semantic segmentation. In: Proc. IEEE/CVF international conference on computer vision pp. 603–612 (2019).
Shen, L., Tao, H., Ni, Y., Wang, Y. & Stojanovic, V. Improved yolov3 model with feature map cropping for multi-scale road object detection. Meas. Sci. Technol. 34(4), 045406 (2023).
Wang, Y. et al. Arrhythmia classification algorithm based on multi-head self-attention mechanism. Biomed. Signal Process. Control 79, 104206 (2023).
Zhou, B., Li, Y. & Wan, J. Regional attention with architecture-rebuilt 3d network for rgb-d gesture recognition. Proc. AAAI Conf. Artif. Intell. 35, 3563–3571 (2021).
Tan, C., Gao, Z., Wu, L., Xu, Y., Xia, J., Li, S. & Li, S. Z. Temporal attention unit: Towards efficient spatiotemporal predictive learning. In: Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition pp. 18770–18782 (2023).
Yuan, Y., Ding, J., Feng, J., Jin, D. & Li, Y. Unist: A prompt-empowered universal model for urban spatio-temporal prediction. In: Proc. 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining pp. 4095–4106 (2024).
Cai, Y.-J., Cai, H.-C., Zhang, C.-Y., Chen, C. P. & Tang, Q.-X. Graphtan: Temporal attention network for learning graph-level embedding. IEEE Trans. Comput. Soc. Syst. (2025).
Xiang, J. et al. Synchronization-based graph spatio-temporal attention network for seizure prediction. Sci. Rep. 15(1), 4080 (2025).
Snell, J., Swersky, K. & Zemel, R. Prototypical networks for few-shot learning. Adv. Neural Inf. Process. Syst. pp. 4077–4087 (2017).
Li, W., Wang, L., Xu, J., Huo, J., Gao, Y. & Luo, J. Revisiting local descriptor based image-to-class measure for few-shot learning. In: Proc. IEEE Conference on Computer Vision and Pattern Recognition pp. 7260–7268 (2019).
Huang, S., Yang, W., Wang, L., Zhou, L. & Yang, M. Few-shot unsupervised domain adaptation with image-to-class sparse similarity encoding. In: Proc. 29th ACM International Conference on Multimedia pp. 677–685 (2021).
Li, J., Wang, X., Tu, Z. & Lyu, M. R. On the diversity of multi-head attention. Neurocomputing 454, 14–24 (2021).
Han, K. et al. Transformer in transformer. Adv. Neural. Inf. Process. Syst. 34, 15908–15919 (2021).
He, L., Bai, L., Yang, X., Du, H. & Liang, J. High-order graph attention network. Inf. Sci. 630, 222–234 (2023).
Velickovic, P. et al. Graph attention networks. Stat 1050(20), 10–48550 (2017).
Russakovsky, O. et al. Imagenet large scale visual recognition challenge. Int. J. Comput. Vision 115(3), 211–252 (2015).
Cai, Q., Pan, Y., Yao, T., Yan, C. & Mei, T. Memory matching networks for one-shot image recognition. In: Proc. IEEE conference on computer vision and pattern recognition pp. 4080–4088 (2018).
Liu, H. et al. Few-shot class-incremental learning via entropy-regularized data-free replay. In European Conference on Computer Vision (eds Liu, H. et al.) 146–162 (Springer, 2022).
Zheng, Z., Feng, X., Yu, H., Li, X. & Gao, M. Bdla: Bi-directional local alignment for few-shot learning. Appl. Intell. 53(1), 769–785 (2023).
Bao, Y. et al. E2cl: An efficient and effective classification learning for pneumonia detection in chest x-rays. In 2024 5th International Conference on Computers and Artificial Intelligence Technology (CAIT) (eds Bao, Y. et al.) 35–40 (IEEE, 2024).
Zheng, Z. et al. Cross-domain few-shot chest x-ray recognition. In 2024 5th International Conference on Computers and Artificial Intelligence Technology (CAIT) (eds Zheng, Z. et al.) 224–229 (IEEE, 2024).
Wu, H. et al. Dara: Distribution-aware representation alignment for semi-supervised domain adaptation in image classification. J. Supercomput. 81(2), 1–37 (2025).
Wu, H. et al. Vlce: Unified vision-language collaborative enhancement for facial expression recognition. In 2024 5th International Conference on Computers and Artificial Intelligence Technology (CAIT) (eds Wu, H. et al.) 94–99 (IEEE, 2024).
Zhang, C., Hu, C., Xie, J., Wu, H. & Zhang, J. Wcal: Weighted and center-aware adaptation learning for partial domain adaptation. Eng. Appl. Artif. Intell. 130, 107740 (2024).
Zheng, Z. et al. MERGE: multimodal-enhanced representation and guided ensemble for pneumonia recognition in chest X-ray images. J. Supercomput. 81, 1–25 (2025).
Rebuffi, S.-A., Kolesnikov, A., Sperl, G. & Lampert, C. H. icarl: Incremental classifier and representation learning. In: Proc. IEEE conference on Computer Vision and Pattern Recognition pp. 2001–2010 (2017).
Zou, Y., Zhang, S., Li, Y. & Li, R. Margin-based few-shot class-incremental learning with class-level overfitting mitigation. Adv. Neural. Inf. Process. Syst. 35, 27267–27279 (2022).
Li, Y., Zhu, H., Ma, J., Xiang, C. & Vadakkepat, P. Incremental few-shot learning via implanting and consolidating. Neurocomputing 559, 126800 (2023).
Castro, F. M., Marín-Jiménez, M. J., Guil, N., Schmid, C. & Alahari, K. End-to-end incremental learning. In: Proc. European conference on computer vision (ECCV) pp. 233–248 (2018).
Zhao, H., Fu, Y., Kang, M., Tian, Q., Wu, F. & Li, X. Mgsvf: Multi-grained slow vs. fast framework for few-shot class-incremental learning. IEEE Trans. Pattern Anal. Mach. Intell. (2021).
Acknowledgements
This work was supported in part by National Natural Science Foundation of China (Grant No.12071104 and Grant No. 62261002), Natural Science Foundation of Zhejiang Province (Grant No. LD19A010002 and Grant No. LY21F010001), Jiangxi Double Thousand Plan (Grant No. jxsq2019201061), Science and Technology Program of Jiangxi Province (Grant No. 20192BCB23019 and Grant No. 20202BBE53024), Fundamental Research Funds for the Provincial Universities of Zhejiang (Grant No, 230056), and Zhejiang Provincial Natural Science Foundation of China (Grant No. LQN25F030002).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Wu, H., Zheng, Z., Lv, L. et al. A spatially aware global and local perspective approach for few-shot incremental learning. Sci Rep 15, 21903 (2025). https://doi.org/10.1038/s41598-025-08323-5
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s41598-025-08323-5
Keywords
This article is cited by
-
Meta-learning for few-shot open task recognition
Scientific Reports (2026)











