A spatially aware global and local perspective approach for few-shot incremental learning

Wu, Heng; Zheng, Zijun; Lv, Laishui; Xu, Yifeng; Bardou, Dalal; Niu, Shanzhou; Yu, Gaohang; Wang, Yinyin

doi:10.1038/s41598-025-08323-5

Download PDF

Article
Open access
Published: 01 July 2025

A spatially aware global and local perspective approach for few-shot incremental learning

Heng Wu^1,2,3,
Zijun Zheng^4,5,6,
Laishui Lv⁷,
Yifeng Xu¹,
Dalal Bardou⁸,
Shanzhou Niu⁹,
Gaohang Yu¹⁰ &
…
Yinyin Wang^11,12

Scientific Reports volume 15, Article number: 21903 (2025) Cite this article

1447 Accesses
3 Citations
Metrics details

Subjects

Abstract

The ability to continuously perceive new concepts with extremely limited samples is innate in human beings. Few-shot incremental learning impersonates this ability by constructing an intelligent learning mechanism, and its intention is to identify novel categories from a given few instances gradually. The crucial means of few-shot incremental learning is to yield a model with the generalization ability to highlight the dominant object in the image and make the appearance of the object smoother by leveraging prior knowledge. In light of this, in this paper, we propose a Spatially Aware Global and Local Perspectives (SGLP) approach to tackle the few-shot incremental learning problem. To enhance semantic representations of features, we build the relationship information of the spatial feature in the global scope and encourage the model to pay attention to the dominant region in features. Furthermore, we assume that the current and surrounding information of the image have a similar appearance and design a smooth operation of the spatial feature by adopting the simple Gaussian kernel in a local scope. Extensive experiments on benchmarks demonstrate the superiority and effectiveness of the proposed approach.

Scene categorization by Hessian-regularized active perceptual feature selection

Article Open access 04 January 2025

Spatial coding for action across spatial scales

Article 12 December 2022

A generalizable and accessible approach to machine learning with global satellite imagery

Article Open access 20 July 2021

Introduction

Deep convolutional networks (DNNs) have gained tremendous improvement in various computer vision tasks such as image classification¹, semantic segmentation² and object detection³ over the past years. The success of DNNs in visual understanding hinges on the crucial assumption that each class of images possesses a large quantity of labeled training data. In contrast, human beings have the ability to gradually perceive new concepts by exploiting a few labeled instants extremely. To impersonate the ability, few-shot increment learning has recently aroused the attention and interest of researchers, which aims to continuously recognize novel categories when confronted with scarce instances. Therefore, few-shot increment learning emerges as a crucial pursuit, mirroring the remarkable human capacity to acquire new concepts gradually with minimal labeled instances.

Current methodologies for tackling the few-shot learning challenge primarily center around two fundamental aspects: meta-learning^{4,5,6,7,8,9,10,11,12} and generalization learning^{13,14,15,16,17,18,19,20,21}. Meta-learning is a machine learning technique that aims to train algorithms to learn how to learn efficiently. This process typically involves two distinct phases: meta-training and meta-testing. In the meta-training phase, the model is immersed in a diverse array of tasks, assimilating a broad problem-solving framework that can be readily applied to uncharted challenges. In the subsequent meta-testing phase, the model encounters a novel task, harnessing the acquired wisdom from meta-training to swiftly acclimate to the newfound task and effectively surmount it. Nonetheless, within the realm of few-shot learning tasks, the essence of meta-learning training resides in the optimization of models through the random selection of sample data from the base class dataset (i.e., source domain), as shown in Fig. 1a. This training approach, constrained by the limited number of classes, can consequently impose restrictions on the model’s performance. In order to alleviate the constraints on model performance, few-shot approaches^{13,14,15,16,17,18,19,20,21} based on the generalization learning strategy have become the prevailing trend. The purpose of these approaches is to train a strong baseline model by leveraging full-class categories of base classes. This strategy facilitates better generalization to novel categories, as illustrated in Fig. 1b.This stimulates our interest in exploring whether generalization can effectively promote few-shot incremental learning, recognizing new concepts while reducing the forgetting of base class knowledge.

To inherit the advantages of few-shot learning in generalization performance, few-shot incremental learning approaches^{22,23,24,25,26,27} draw inspiration from the characteristics of generalization learning to generate the powerful base-class model for better adaptation to continuously added categories. Based on the full-class supervision strategy, in this paper, we propose Spatially Aware Global and Local Perspectives (SGLP) to the few-shot incremental learning task. Different from previous few-shot incremental learning approaches^22,24,25,26, their core idea is to construct incremental correlations to ensure the continuous learning ability of the model. The proposed SGLP approach focuses on improving the feature presentation by incorporating spatial feature enhancer (refer to Fig. 1c) to powerfully enhance the generalization ability of the model and assist the few-shot incremental learning. To be specific, the spatial feature enhancer conducts a relationship construction in the global scope and a smooth operation in the local scope to improve the feature representation. The global perspective utilizes a attention mechanism to capture long-range dependencies and relationships within features, as opposed to focusing solely on local information. Its primary purpose is to help models understand global patterns and connections in feature maps. Under a local perspective, the proximity of neighboring pixels around the feature is considered as a prerequisite. This perspective employs a simple Gaussian kernel to further smooth the proximity relationships of the central pixel, thereby enhancing the influence of the effective region. Comprehensive experimental analyses in various benchmark datasets illustrate the superior and effective performance of our proposed approach, and ablation studies also clarify the utility of different ingredients.

In summary, the main contributions in this study are listed as follows:

We present a framework that leverages spatial-aware global and local perspectives to address the few-shot incremental learning challenge.
We introduce a feature semantic enhancement strategy by constructing long-range relationships in a global scope.
We introduce a feature smoothing operation by establishing proximity relationships in a local scope.
Comprehensive experiments in multiple benchmarks (i.e., MiniImageNet⁴, CIFAR-FS⁸ and CUB-200–2011²⁸), consisting of experimental comparisons, ablations, and visualizations, verify the effectiveness of our approach and difference ingredients.

Related work

In this section, we present the related works including few-shot learning, few-shot incremental learning, and attention mechanism, which are most relevant to our work.

Few-shot learning

Few-shot learning aims to train the model on base classes and enable it to recognize novel classes with scarce instances. Existing approaches to address few-shot learning mainly are divided into two families: meta-learning-based approaches^{4,5,6,7,8,9,10,11,12} and generalization learning-based approaches^{13,14,15,16,17,18,19,20,21,29,30,31}. The fundamental concept of meta-learning-based approaches is to create support and query sets from base classes using the meta-task sampler and then facilitate relationship learning between them.

Pioneering the field, Finn et al.⁵ proposed MAML and introduced a groundbreaking concept in few-shot learning, focusing on versatile initialization for rapid adaptation to new tasks with limited data. Building upon this, Sung et al.⁷ furthered the field by explicitly modeling local relationships between objects enhancing discriminative abilities. Zintgraf et al.¹¹presented a fast context adaptation via meta-learning introducing a technology for swift context adaptation. In Ref¹²., Chen et al.meta-learning seamlessly integrating self-supervised learning. This core technology refines representations and boosts adaptability to new classes. This innovation complements self-supervised approaches, collectively enabling models to quickly adjust to new tasks and demonstrating a holistic progression in the meta-learning landscape. Recently, generalization learning-based approaches have shown promising performance through full-class supervision, marking a trend in the research of few-shot learning. For example, Hou et al.¹⁴ introduced an attention mechanism to enhance the embedding model and generated clear feature maps, facilitating the generalization of the model. This benefits from a combination of a dense global classification loss and a few-shot loss. Further, Tian et al.¹⁵ achieved remarkable performance by employing a straightforward baseline with a linear classifier, showcasing its effectiveness in few-shot learning. To investigate the role of data augmentation, Rizve¹⁷leveraged both invariant and equivariant to boost the performance of models with large margins. Moreover, Ref^18,19,20,21. also exhibit promising outcomes through various full-class supervision strategies. Based on the above analysis, we employ the full-class supervision strategy to address the few-shot incremental learning task.

Few-shot incremental learning

The goal of few-shot incremental learning emulates the cognitive proficiency observed in humans for swift adaptation, empowering machine learning models to adeptly adjust to novel tasks or categories within the constraints of limited samples. Tao et al.³² first introduced the concept of few-shot class-incremental learning by incorporating a neural gas for preserving topology in the embedding space. Build upon the pioneering work³²adapt existing continuous incremental learning methods to address the challenges. Additionally, Refs^33,34,35,36. leverage word vectors to mitigate the inherent difficulties stemming from data scarcity in the few-shot incremental learning task. Another group of prevalent approaches^{22,23,24,25,38,39} concentrates on meta-training with base class data, simulating test scenarios by sampling fake incremental episodes. These methods typically achieve significant results, making them effective for many real-world scenarios and performing their applicability to arbitrary pre-trained models. Furthermore, a majority of these approaches^24,25,39 draw inspiration from the concept of full-class supervision and freeze the parameters of the meta-trained model to explicitly preserve base knowledge, enhancing the model’s adaptability to novel concepts. Based on the above manner, our approach preserves the foundational state of the model, enabling effective adaptation to novel knowledge by continually evolving classifiers.

Attention mechanism

In the realm of deep neural networks, the attention mechanism has exerted a profound influence on diverse computer vision endeavors, including but not limited to image categorization, object localization, and semantic segmentation^{40,41,42,43,44,45,46,47}. In Ref⁴²., Hu et al. introduced the conventional Squeeze-and-Excitation channel attention mechanism, providing seamless integration capabilities into established model frameworks. Additionally, Woo et al.⁴³ proposed an innovative Convolutional Block Attention Module by incorporating maximal pooling to compress the spatial dimension of input features, thereby extracting diverse representations through the utilization of average pooling. Huang et al.⁴⁵ extended⁴³and introduced an innovative criss-cross attention module designed to capture contextual information from entire-image dependencies in a more efficient and effective manner. In Ref⁴⁴., the non-local neural networks are introduced by a paradigm shift in neural architectures, leveraging non-local operations for efficient capture of long-range dependencies. This approach enhances information integration across spatial and temporal scales, demonstrating effectiveness in various computer vision tasks.

Attention mechanisms have also been extended beyond static images to address spatiotemporal modeling in video understanding and dynamic perception. Zhou et al.⁴⁸ introduced a Regional Attention module within a 3D network for RGB-D gesture recognition, effectively integrating spatial and temporal cues to enhance robustness and cross-modal alignment. Tan et al.⁴⁹ proposed the Temporal Attention Unit (TAU), a lightweight module that models frame-wise temporal dependencies with low computational cost, showing strong performance in spatiotemporal prediction tasks. Further advances include Yuan et al.⁵⁰, who developed UNIST, a prompt-driven attention model for urban spatiotemporal prediction, and Cai et al.⁵¹, who introduced GraphTAN for capturing temporal dynamics in graph-structured data. Xiang et al.⁵² applied a synchronization-based spatiotemporal attention mechanism to EEG-based seizure prediction, highlighting the versatility of attention across domains. These advances in attention mechanisms provide valuable foundations that inspire and inform the design of our proposed approach.

Methodology

In this section, we start by presenting the problem formulation of few-shot incremental learning and further introduce the proposed approach in detail.

Problem formulation

Consider an incremental learning paradigm conducted over a sequence of sessions $S = \{s_{1}, s_{2},..., s_{n}\}$, where each session $s_{i}$ represents a distinct temporal stage. The model is exposed to a set of base classes $\mathcal {C}_{base}$ and incrementally encounters new classes denoted by $\mathcal {C}^{i}_{inc}$ in each session $s_{i}$. The base classes are shared across sessions, ensuring continuity in the learning process. Let $D^{i}_{base}$ represent the dataset comprising instances from the base classes available in session $s_{i}$, and $D^{i}_{inc}$ denote the dataset for the few-shot incremental classes introduced in the same session. The term “few-shot” underscores the limited availability of samples for each incremental class.

The model parameters at the end of session $s_{i}$ are denoted $\theta ^{i}$, and the knowledge retained from the base classes up to this point is represented by $\mathcal {K}_{base}^{i}$. Similarly, $\mathcal {K}_{inc}^{i}$ captures the model’s knowledge about the few-shot incremental classes. The primary objective of few-shot Incremental Learning is to develop a model capable of continual adaptation and knowledge retention. Specifically, the model must demonstrate effective learning from both the shared base classes and the newly introduced few-shot incremental classes. The performance of the model is assessed based on its ability to accurately classify instances from both base and incremental classes, thus reflecting its adaptability to evolving knowledge scenarios. In essence, Few-shot Incremental Learning addresses the nuanced challenge of incrementally introducing and learning from new classes, emphasizing the scarcity of available samples for the latter, all within the framework of an incremental learning setting across multiple sessions.

Data initialization

Few-shot incremental learning aims to perform the ability that requires the trained model to learn new knowledge while reducing knowledge forgetting. Data initialization is an effective manner to enrich the diversity of data, prevent overfitting of the model, and enhance feature representation through sampling from the base-task sampler and utilizing data augmentation, as shown in Fig. 2.

To be specific, given a base class $\mathcal {C}_{base}$, a base-task sampler is utilized to yield the base sample set $\{(\textbf{X}_{i}, \textbf{y}_{i})\}_{i=1}^{N}$ (i.e., the mini-batch set), where N denotes the number of the set $\{\textbf{X}_{i}\}_{i=1}^{N}$ and $\textbf{y}_{i}$ is the corresponding true label of $\textbf{X}_{i}$. The base sample set contains all the different categories in the base class according to the manner of random sampling from the base class. In contrast to meta-learning training^53,54, which is limited to the representation of sample categories, class-sampling learning can train models using all categories of the base class, thereby enhancing the feature representation capability. With the base sample set $\{\textbf{X}_{i}\}_{i=1}^{N}$, a data augmentation operation is then employed to enrich the representation of the input and avoid overfitting of the model to the data as follows:

$$\begin{aligned} \textbf{X}_{i}^{aug} = \texttt {Aug}(\textbf{X}_{i}). \end{aligned}$$

(1)

In Eq. (1), the abbreviation Aug stands for data augmentation, and $\textbf{X}_{i}^{aug}$ represses the augmented data. Data augmentation, which includes operations such as rotation, cropping, flipping, and brightness adjustment, is used to improve the performance of the model. Therefore, by data initialization, the model can better learn the underlying distribution of the data and generalize it to new data.

Spatial feature enhancer

Spatial-aware global perspective

After yielding the $\textbf{X}_{i}^{aug}$ by data initialization in Section 3.2, the general strategy is to utilize the standard feature extractor $\mathcal {F(*)}$ to generate the representation^33,34,35. Although the strategy plays a significant role in obtaining high-level information, it fails to learn relationships from a global perspective in terms of spatial location, which weakens the presentation of effective semantic information. Based on this, we present a spatial feature enhancer to strengthen the representation of the extracted feature, as shown in Fig. 2. The spatial feature enhancer designs a feature enhancement process from global to local perception. In the spatial-aware global perspective, it can learn long-range dependencies between features in an image. The long-range dependencies work by comparing each spatial feature in an image to all other spatial features in the image, and then weighting the contribution of each feature to the denoised output based on their similarity.

Specifically, the extracted feature $\textbf{X}_{i}^{ext}$ can be encoded by the standard feature extractor, i.e., $\textbf{X}_{i}^{ext} = \mathcal {F}(\textbf{X}_{i}^{aug})$. To improve the feature representation, the spatial-aware global perspective utilizes each pixel feature to build the relationship in a global scope, as shown in Fig. 3. Given the pixel feature $\{\textbf{x}_{i,j}\}_{j=1}^{M}$ of $\textbf{X}_{i}^{ext}$ (M denotes the number of all pixel features in $\textbf{X}_{i}^{ext}$), the enhanced feature with the spatial-aware global perspective is repressed as follows:

$$\begin{aligned} \widetilde{\textbf{x}}_{i,j} = \frac{1}{C}\sum _{m=1}^{M}(\mathcal {I}(x_{i,j})^{T}, \mathcal {J}(x_{i,m}))\cdot \mathcal {K}(x_{i,m})^{T}. \end{aligned}$$

(2)

In Eq. (2), $\mathcal {I}$, $\mathcal {J}$, and $\mathcal {K}$ indicate three linear embedding functions with different learned parameters. C and T denote the normalization factor and the transpose operation, respectively. Build upon the enhanced feature $\widetilde{\textbf{x}}_{i,j}$, the merged feature can be obtained by

$$\begin{aligned} \hat{\textbf{x}}_{i,j} = W_{\hat{\textbf{x}}}\cdot \widetilde{x}_{i,j}+ x_{i,j}, \end{aligned}$$

(3)

where $W_{\hat{\textbf{x}}}$ represents the matrix and is the learned parameter by the training of the model. By concatenating all pixel features $\hat{\textbf{x}}_{i,j}$, the representation based on the spatial-aware global perspective can be formed as $\hat{\textbf{X}}_{i}^{global}= \{\hat{\textbf{x}}_{i,j} \}_{j=1}^{M}$. The generated feature $\hat{\textbf{X}}_{i}^{global}$ has a richer semantic representation, reflecting the main characteristics of the image.

Spatial-aware local perspective

Beyond the global spatial awareness above, we further enhance features from a local spatial perspective, believing in the assumption that adjacent pixels have similar semantic information. The similarity pixels in the adjacent relation should be close and perform a smooth distribution. However, previous approaches^24,25,39 fail to effectively consider the relationship between pixel features. This can easily lead to several noises or non-smooth values⁵⁵. To mitigate the adverse effects of noise on feature representation $\textbf{X}_{i}^{global}$, we introduce a simple operation and leverage the Gaussian kernel $\mathcal {K}$ to smooth the spatial feature with a local scope, as shown in Fig. 2. To be specific, the smooth feature can be represented as:

$$\begin{aligned} \hat{\textbf{X}}_{i}^{local} = \mathcal {G}(\mathcal {K}, \hat{\textbf{X}}_{i}^{global}), \end{aligned}$$

(4)

where $\mathcal {G}(\cdot )$ denotes the Gaussian filtering operator, and it aims to conduct the convolution of Gaussian kernel and feature $\textbf{X}_{i}^{global}$. The Gaussian kernel $\mathcal {K}$ consists of nine fixed values with $3\times 3\times 1$ size and a standard deviation, and its expression is as follows:

$$\begin{aligned} \mathcal {K} = \begin{pmatrix} 1/16 & 1/8 & 1/16 \\ 1/8 & 1/4 & 1/8 \\ 1/16 & 1/8 & 1/16 \end{pmatrix}. \end{aligned}$$

(5)

By utilizing the Gaussian filtering operator to $\hat{\textbf{X}}_{i}^{global}$ in Eq. (4), the yielded representation $\hat{\textbf{X}}_{i}^{local}$ performs smoother. Although Gaussian filtering can smooth representation, it can still preserve the main features, maintaining the contours and edges of the image. Furthermore, global average pooling (GAP) is leveraged to generate the vectored representation of $\hat{\textbf{X}}_{i}^{local}$, which can be formulated by:

$$\begin{aligned} \hat{\textbf{x}}_{i}^{GAP} =\texttt {GAP}(\hat{\textbf{X}}_{i}^{local}). \end{aligned}$$

(6)

To determine the classification of the feature $\hat{\textbf{x}}_{i}^{GAP}$, a classifier following a Softmax function is adopted to obtain the probability $\hat{\textbf{p}}_{i}$ of $\hat{\textbf{x}}_{i}^{GAP}$, which is represented as:

$$\begin{aligned} \hat{\textbf{p}}_{i} =\texttt {Softmax}(\mathcal {C}(\hat{\textbf{x}}_{i}^{GAP})), \end{aligned}$$

(7)

where $\mathcal {C}$ is a classifier with a fully connected layer that has learned parameters. $\texttt {Softmax}$ denotes the normalized exponential function and attains the probability of the image $\textbf{X}$. To train the learned parameter of the model, a cross-entropy loss function $\mathcal {L}_{ce}$ is leveraged to achieve the constraint between the true label $\textbf{y}_{i}$ and the predicted probability $\hat{\textbf{p}}_{i}$ in a supervision manner as follows:

$$\begin{aligned} \mathcal {L}_{ce} = -\sum _{i=1}^{N}\textbf{y}_{i}\log ( \hat{\textbf{p}}_{i}). \end{aligned}$$

(8)

By employing the spatial-aware global and local perspectives above, the model can reduce the impact of noise and effectively enhance the representation ability of features.

Incremental learning module

After the parameters of the feature extractor $\mathcal {F}$ and classifier $\mathcal {C}$ are trained, the next task is to conduct a few-shot incremental learning. In the incremental learning stage, to alleviate the computation burden, the parameters of the feature extractor $\mathcal {F}(\cdot )$ and the spatial feature enhancer are fixed. Recently, with the popularity of self-attention structures, various architectures have been proposed in succession such as multi-head attention⁵⁶, transformer⁵⁷, and graph attention network⁵⁸. Harnessing the self-attention principle, a progressive update strategy, akin to that in²², dynamically learns classifiers within each session $s_{i}$, enabling continuous recognition of novel concepts. Let $\mathcal {W}_{i}\in \mathbb {R}^{N_{i} \times C}$ denote the parameters of the classifier in session $\textbf{s}_{i}$, where $N_{i}$ represents the number of classes in session $\textbf{s}_{i}$ and C is the dimension of the feature $\hat{\textbf{x}}_{i}^{GAP}$. Note that $\textbf{s}_{0}$ denotes the training stage within the base class, where the classifier utilizes parameters represented by $\mathcal {W}_{0}$. $\mathcal {W}_{0}$ possesses the learned weight vectors of all base classes, which can be represented as:

$$\begin{aligned} \mathcal {W}_{0} = \{{w}_{0}^{0}, {w}_{1}^{0}, \ldots , {w}_{N_{0}}^{0}\}, \end{aligned}$$

(9)

where ${w}_{i}^{0}$ is the corresponding weight vector of category i. With the increase of session $s_{i}$ (i.e., incremental learning), classifier weight parameters are constantly increasing and updating. Therefore, the previous sessions can be described as:

$$\begin{aligned} \{\mathcal {W}_{i}\}_{i=0}^{I} = \{{w}_{0}^{0}, {w}_{1}^{0}, \ldots , {w}_{N_{0}}^{0},\ldots , {w}_{0}^{i}, {w}_{1}^{i}, \ldots , {w}_{N_{i}}^{i}\}, \end{aligned}$$

(10)

where $N_{i}$ denotes the number of classifier weights for the i-th session. According to the idea of graph attention network⁵⁹, each classifier weight ${w}_{N_{i}}^{i}$ in $\mathcal {W}_{i}$ can be seen as a node in the graph structure. By building the self-attention relationship in $\mathcal {W}_{i}$, a correlation coefficient $c_{i,j}$ can be calculated by:

$$\begin{aligned} c_{j,k} = \langle \texttt {proj}_{1}({w}^{j}), \texttt {proj}_{2}({w}^{k}) \rangle , \end{aligned}$$

(11)

where $\texttt {proj}_{1}(\cdot )$ and $\texttt {proj}_{2}(\cdot )$ indicate the different projection functions that conduct the linear transformation of ${w}^{j}$ and ${w}^{k}$. $\langle \cdot , \cdot \rangle$ denotes the inner product symbol between two nodes. By normalizing the correlation coefficient, the coefficient factor can be expressed as:

$$\begin{aligned} \alpha _{j,k} = \texttt {Softmax}(c_{i,j}) = \frac{\exp {c_{i,j}}}{\sum _{l=1}^{|\mathcal {W}|}\exp {c_{i,l}}}. \end{aligned}$$

(12)

In Eq. (12), $|\mathcal {W}|$ is the current number of all nodes in $\mathcal {W}_{i}$. With the coefficient factor, the weight parameter $w^{i'}$ in each node can be updated by:

$$\begin{aligned} w^{i'} = w^{i} +\sum _{l=1}^{|\mathcal {W}|}\alpha _{i,l} \textbf{M}w^{l}, \end{aligned}$$

(13)

where $\textbf{M}$ indicates a weight matrix that consists of the set of the linear transformation to $w^{l}$. Through self-attention operations within the graph structure, the classifiers $W^{i}$ are updated, resulting in the following representation:

$$\begin{aligned} \{\mathcal {W}_{i}^{'}\}_{i=0}^{I} = \{{w}_{0}^{0'}, {w}_{1}^{0'}, \ldots , {w}_{N_{0}}^{0'},\ldots , {w}_{0}^{i'}, {w}_{1}^{i'}, \ldots , {w}_{N_{i}}^{i'}\}. \end{aligned}$$

(14)

Through the incremental learning module above, the adaptation module dynamically updates the classifier models learned in the current session and previously learned classifier models, and then combines them to make predictions for all classes.

Experiments

This section introduces the experimental setting in detail and then demonstrates results with the comparison experiment and various ablations.

Experimental settings

Datasets

We evaluate our proposed approach on three standard few-shot incremental learning benchmarks, composing of MiniImageNet⁴, CIFAR-FS⁸ and CUB-200–2011²⁸.

MiniImageNet is the subset of ImageNet⁶⁰and first introduced by Ref⁶¹. for conducting the few-shot learning task. It consists of 60, 000 images with 100 categories, where each category has 600 images. Following the previous works^22,32, we utilize 60 and 40 categories for base-class training and incremental learning, respectively. The 40 newly introduced categories are evenly distributed across 8 sessions, with 5 classes allocated to each session. Within these incremental sessions, each class is represented by 5 training images. All images are resized to the $84\times 84$ resolution as the input of the model.

CIFAR-FS constitutes a subset of the CIFAR-100 dataset⁸, encompassing 100 categories in total, and comprising a total of 60,000 images with $32\times 32$ resolution. In this dataset, there are 500 training images and 100 testing images allocated per class. Adhering to the data partitioning defined in reference^22,32, we designate 60 classes as base-classes and 40 classes as novel classes. To further organize the 40 novel classes, they are stratified into 8 separate incremental sessions, with each session configured as a 5-way 5-shot classification task.

CUB-200–2011 is a fine-grained dataset encompassing bird species, consisting of 11,788 images distributed among 200 subcategories. In accordance with the data partitions stipulated in prior works^22,32, the dataset is bifurcated into two distinct categories: 100 base classes and 100 novel classes, among the total 200 classes. The latter subset of 100 novel classes is further stratified into 10 discrete sessions, with each session structured as a 10-way 5-shot task. The image dimensions across the dataset are standardized to 224×224 pixels. As illustrated in Fig. 4, we also provide a limited selection of sample images from these datasets for reference.

Implementation details

To ensure a fair comparison, we follow the experimental settings of previous work^22,26,34,62. Specifically, we use ResNet-20 as the backbone for experiments on CIFAR-100, and ResNet-18 for experiments on miniImageNet and CUB-200. For all benchmark datasets presented above, we leverage the classification accuracy to quantitatively evaluate the performance of our proposed approach. During the training phase, all data from the source and target domains is involved in the training process and employed to optimize model parameters. The optimal model is determined by the best classification performance on the validation set and used to evaluate the classification accuracy of the unlabeled target domain. To avoid the overfitting of the model to the data, all input images are augmented by RandomHorizontalFlip and Normalization operations to generate the diversity representation^{63,64,65,66,67,68,69}, as shown in Fig. 5. Here, the evaluation mechanism of the classification accuracy can be formulated as follows:

$$\begin{aligned} \begin{aligned} \textbf{Acc} = \frac{|\textbf{X}_{u}:\textbf{X}_{u}\in \mathcal {T}_{u}\wedge \widehat{\textbf{Y}}_{u}=\textbf{Y}_{u}|}{|\textbf{X}_{u}:\textbf{X}_{u}\in \mathcal {T}_{u}|}, \end{aligned} \end{aligned}$$

(15)

where $\textbf{Y}_{u}$ and $\widehat{\textbf{Y}}_{u}$ denote the predicted and ground-truth labels of the unlabeled image $\textbf{X}_{u}$ in the target domain $\mathcal {T}_{u}$, respectively, and $|\cdot |$ indicates the number of elements. By applying the evaluation mechanism of Eq. (15), we obtain the final result of the generalization performance of the model.

Comparison with the previous models

To evaluate the effectiveness of our proposed approach, we conduct a comprehensive comparison with existing few-shot incremental learning approaches across three benchmark datasets (MiniImageNet, CIFAR-FS, and CUB-200–2011), detailed in Tables 1, 2, and 3. The results in Tables 1 and 2 (representing the coarse-grained dataset) consistently demonstrate the superior performance of our approach in the 5-way 5-shot setting across all sessions. This trend, evident in our quantitative analysis, substantiates the excellence of our proposed approach. Furthermore, when examining the results on the fine-grained dataset in Table 3, our approach consistently delivers commendable performance across various components in all sessions. This robust performance on both coarse-grained and fine-grained datasets showcases the versatility and efficacy of our proposed approach. In summary, our comparative analysis across multiple datasets affirms the robustness and superiority of the proposed approach. It excels in scenarios involving coarse-grained datasets and maintains competitive performance on fine-grained datasets. These findings underscore the potential of the spatial-aware global and local perspectives for advancing few-shot incremental learning methodologies.

Table 1 Comparison with previous approaches using 5-way 5-shot setting on the MiniImageNet dataset. PD$\downarrow$ denotes the accuracy drop from session 0 to session 8.

Full size table

Table 2 Comparison with previous approaches using 5-way 5-shot setting on the CIFAR-FS dataset. PD$\downarrow$ denotes the accuracy drop from session 0 to session 8.

Full size table

Table 3 Comparison with previous approaches using 5-way 5-shot setting on the CUB dataset. PD$\downarrow$ denotes the accuracy drop from session 0 to session 10.

Full size table

Table 4 Ablation analysis using 5-way 5-shot setting on the MiniImageNet dataset. “BL", “SAGP"and “SALP"indicate the baseline model, the spatial-aware global perspective, and the spatial-aware local perspective, respectively. PD$\downarrow$ denotes the accuracy drop from session 0 to session 8.

Full size table

Ablation analyses

In this section, comprehensive ablation analyses are conducted to verify the effectiveness of the different modules. We first assert the influence of the data augmentation operation on the model. The spatial-aware global and local perspectives are then verified for their utility, respectively. Finally, the visualization is displayed to explain the efficiency of the proposed approach.

Influence of data augmentation

Data augmentation is able to avoid the overfitting of the dataset to the model and increase the diversity of the training instances, which aims to improve the generalization performance of the model. As mentioned in Section 4.3, the RandomHorizontalFlip and Normalization operations are utilized to augment the input image. The several examples with augmentation are shown in Fig. 5. From Fig. 5, we can observe that the augmented version visual effects exhibit more prominent local characteristics. To quantitatively verify the influence of the augmentation operation, Fig. 6 gives the performance report with and without data augmentation on MiniImageNet datasets. The experimental result in Fig. 6 shows that the approach with data augmentation achieves a significant boost in various dataset benchmarks. This phenomenon can explain that conducting simple transformations on the input data, can enrich the data, reduce overfitting of the model, and thus effectively enhance the generalization of the model. In addition, the convergence and the accuracy of the model with increasing iteration times in the training stage have also been illustrated in Fig. 7.

Influence of spatial-aware global perspective

As described in Section 3.3, the spatial-aware global perspective (SAGP) builds the long-range dependencies between spatial features. The mechanism of long-range dependencies involves evaluating every spatial feature against all other spatial features. Subsequently, the contribution of each feature to the denoised output is weighted according to their similarity. To assert the influence of the spatial-aware global perspective on the model, we conduct the ablation analysis in Table 4. Table 4 adopts the MiniImageNet dataset with a 5-way 5-shot setting as the evaluation benchmark, and the result from Table 4 demonstrates that the SAGP achieves a significant improvement in all sessions. This suggests that establishing long-range dependencies in features is advantageous for enhancing feature quality.

Influence of spatial-aware local perspective

In addition to the global perspective within the spatial scope, the impact of the spatial-aware local perspective (SALP) is also demonstrated in Table 4. As discussed in Section 3.3, SALP assumes that adjacent pixels convey similar semantic information and applies the simple Gaussian filtering operator $\mathcal {G(\cdot )}$ to smooth local features. The results in Table 4 reveal a significant improvement in performance by SALP compared to the baseline model on the benchmark dataset. The consistent findings across all sessions underscore the efficacy of the SALP module. Moreover, the last row of Table 4 indicates that the collaborative action of the SAGP and SALP modules achieves superior performance compared to each separate module in the evaluation benchmark. This observation suggests that the proposed approach effectively enhances feature representation and improves the model’s generalization performance toward incremental novel classes.

Influence of different shots

To explore the effect of the number of support examples in few-shot incremental learning, we evaluate the model under 1-shot and 5-shot settings on the CIFAR-FS dataset, as shown in Fig. 8. The results demonstrate that the 5-shot setting consistently yields better performance than the 1-shot setting across all incremental steps. Increasing the number of support examples significantly enhances classification accuracy, suggesting that richer class representations help mitigate the challenges of incremental learning. Nevertheless, the performance gap between the two settings indicates that the problem of catastrophic forgetting remains substantial in the few-shot regime, even with additional support samples.

Visualization

To qualitatively explain the effectiveness of the model, the visualization of feature maps is demonstrated in Fig. 9. As shown in Fig. 9, the top, middle and bottom rows are the original images, the augmented version and the feature map, respectively. From Fig. 9, it can be observed that despite the appearance distortion caused by data augmentation operations, the model can effectively focus on the main regions of the image. Even in cases where the image scene is complex and contains multiple objects, the model can still accurately locate the target region based on the task. The visualization results in Fig. 9, where the model accurately focuses on feature maps of images, serve as a validation and interpretation of the effectiveness in the proposed approach.

Conclusion and future work

In this paper, we present a straightforward yet effective approach, termed Spatial-aware Global and Local Perspectives (SGLP), to tackle the few-shot incremental learning problem. The spatial-aware global perspective establishes relationships among spatial features globally, promoting the model to emphasize the dominant representation. Meanwhile, the spatial-aware local perspective operates under the assumption that current and surrounding image information share similar appearances. A Gaussian filtering operation within a local scope is employed to refine the spatial feature. Comprehensive experiments across various benchmarks demonstrate that the proposed approach yields competitive results, elucidating the efficacy of its distinct components.

For future work, we attempt an interesting direction that involves exploring few-shot incremental learning through the integration of multiple modalities. Investigating how incorporating information from diverse sources, such as images and text, can enhance the adaptability and robustness of few-shot learning models is a key aspect. Exploring dedicated architectures and fusion methods tailored for multi-modal scenarios is also a valuable avenue for future research, aiming to improve generalization in situations with limited labeled data.

Data availability

The datasets analysed during the current study are available at: https://github.com/icoz69/CEC-CVPR2021?tab=readme-ov-file.

References

Li, Y., Wu, C.-Y., Fan, H., Mangalam, K., Xiong, B., Malik, J. & Feichtenhofer, C. Mvitv2: Improved multiscale vision transformers for classification and detection. In: Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition pp. 4804–4814, (2022).
Wang, W., Sun, G. & Van Gool, L. Looking beyond single images for weakly supervised semantic segmentation learning. IEEE Trans. Pattern Anal. Mach. Intell. (2022).
Du, X., Wang, X., Gozum, G. & Li, Y. Unknown-aware object detection: Learning what you don’t know from videos in the wild. In: Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition pp. 13678–13688 (2022).
Ravi, S. & Larochelle, H. Optimization as a model for few-shot learning (2016).
Finn, C., Abbeel, P. & Levine, S. Model-agnostic meta-learning for fast adaptation of deep networks. arXiv preprintarXiv:1703.03400 (2017).
Ren, M., Triantafillou, E., Ravi, S., Snell, Swersky, J. K., Tenenbaum, J. B., Larochelle, H. & Zemel, R. S. Meta-learning for semi-supervised few-shot classification. Int. Conf. Learn. Rep. (2018).
Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P. H. & Hospedales, T. M. Learning to compare: Relation network for few-shot learning. In: Proc. IEEE Conference on Computer Vision and Pattern Recognition pp. 1199–1208 (2018).
Bertinetto, L., Henriques, J. F., Torr, P., Vedaldi, A. Meta-learning with differentiable closed-form solvers. In: International Conference on Learning Representations (2018).
Hao, F., He, F., Cheng, J., Wang, L., Cao, J. & Tao, D. Collect and select: Semantic alignment metric learning for few-shot learning. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV) pp. 8460–8469 (2019).
Wu, Z., Li, Y., Guo, L. & Jia, K. Parn: Position-aware relation networks for few-shot learning. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV) pp. 6659–6667 (2019).
Zintgraf, L., Shiarli, K., Kurin, V., Hofmann, K. & Whiteson, S. Fast context adaptation via meta-learning. In International Conference on Machine Learning (ed. Zintgraf, L.) 7693–7702 (PMLR, 2019).
Google Scholar
Chen, D. et al. Self-supervised learning for few-shot image classification. In ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (ed. Chen, D.) 1745–1749 (IEEE, 2021).
Chapter Google Scholar
Wu, H., Zheng, Z., Wang, H., Wang, W. & Yang, Z. Few-Shot Incremental Learning with Context-Aware Spatial Enhancement for Image Recognition. IEEE Access (2025).
Hou, R., Chang, H., Ma, B., Shan, S. & Chen, X. Cross attention network for few-shot classification. Adv. Neural Inf. Process. Syst. 4003–4014 (2019).
Tian, Y., Wang, Y., Krishnan, D., Tenenbaum, J. B. & Isola, P. Rethinking few-shot image classification: a good embedding is all you need? In Computer Vision-ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XIV 16 (ed. Tian, Y.) 266–282 (Springer, 2020).
Chapter Google Scholar
Zhang, C., Cai, Y., Lin, G. & Shen, C. Deepemd: Few-shot image classification with differentiable earth mover’s distance and structured classifiers. In: Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition pp. 12203–12213 (2020).
Rizve, M. N., Khan, S., Khan, F. S. & Shah, M. Exploring complementary strengths of invariant and equivariant representations for few-shot learning. In: Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition pp. 10836–10846 (2021).
Wu, H. et al. CLCFE: complementary loss coupling for feature-enhanced few-shot fine-grained visual recognition. Appl. Intell. 55, 742 (2025).
Zheng, Z. et al. SGE: Semantic-guided Generalization Enhancement for few-shot learning. Knowl.-Based Syst. 323, 113761 (2025).
Google Scholar
Zheng, Z., Feng, X., Yu, H. & Gao, M. Cooperative density-aware representation learning for few-shot visual recognition. Neurocomputing 471, 208–218 (2022).
Article Google Scholar
Zheng, Z. et al. Iccl: Independent and correlative correspondence learning for few-shot image classification. Knowl.-Based Syst. 266, 110412 (2023).
Article Google Scholar
Zhang, C., Song, N., Lin, G., Zheng, Y., Pan, P. & Xu, Y. Few-shot incremental learning with continually evolved classifiers. In: Proc. IEEE/CVF conference on computer vision and pattern recognition pp. 12455–12464 (2021).
Shi, G., Chen, J., Zhang, W., Zhan, L.-M. & Wu, X.-M. Overcoming catastrophic forgetting in incremental few-shot learning by finding flat minima. Adv. Neural. Inf. Process. Syst. 34, 6747–6761 (2021).
Google Scholar
Zhou, D.-W., Wang, F.-Y., Ye, H.-J., Ma, L., Pu, S. & Zhan, D.-C. Forward compatible few-shot class-incremental learning. In: Proc. IEEE/CVF conference on computer vision and pattern recognition pp. 9046–9056 (2022).
Zhou, D.-W., Ye, H.-J., Ma, L., Xie, D., Pu, S. & Zhan, D.-C. Few-shot class-incremental learning by sampling multi-phase tasks. IEEE Trans. Pattern Anal. Mach. Intell. (2022).
Xu, X. et al. Multi-feature space similarity supplement for few-shot class incremental learning. Knowl.-Based Syst. 265, 110394 (2023).
Ji, Z., Hou, Z., Liu, X., Pang, Y. & Li, X. Memorizing complementation network for few-shot class-incremental learning. IEEE Trans. Image Process. 32, 937–948 (2023).
Article PubMed ADS Google Scholar
Wah, C., Branson, S., Welinder, P., Perona, P. & Belongie, S. The caltech-ucsd birds-200-2011 dataset (2011).
Wu, H., Zhao, Y. & Li, J. Selective, structural, subtle: Trilinear spatial-awareness for few-shot fine-grained visual recognition. In 2021 IEEE International Conference on Multimedia and Expo (ICME) (ed. Wu, H.) 1–6 (IEEE, 2021).
Google Scholar
Zheng, Z., Feng, X., Yu, H., Li, X. & Gao, M. Unsupervised few-shot image classification via one-vs-all contrastive learning. Appl. Intell. 53(7), 7833–7847 (2023).
Article Google Scholar
Wu, H., Zhao, Y. & Li, J. Invariant and consistent: Unsupervised representation learning for few-shot visual recognition. Neurocomputing 520, 1–14 (2023).
Article Google Scholar
Tao, X., Hong, X., Chang, X., Dong, S., Wei, X. & Gong Y. Few-shot class-incremental learning. In: Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition pp. 12183–12192 (2020).
Chen, K. & Lee, C.-G. Incremental few-shot learning via vector quantization in deep embedded space. In: Int. Conf. Learn. Rep. (2020).
Dong, S. et al. Few-shot class-incremental learning via relation knowledge distillation. Proc. AAAI Conf. Artif. Intell. 35, 1255–1263 (2021).
Google Scholar
Mazumder, P., Singh, P. & Rai, P. Few-shot lifelong learning. Proc. AAAI Conf. Artif. Intell. 35, 2337–2345 (2021).
Google Scholar
Cheraghian, A., Rahman, S., Fang, P., Roy, S. K., Petersson, L. & Harandi, M. Semantic-aware knowledge distillation for few-shot class-incremental learning. In: Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition pp. 2534–2543 (2021).
Cheraghian, A., Rahman, S., Ramasinghe, S., Fang, P., Simon, C., Petersson, L. & Harandi, M. Synthesized feature based few-shot class-incremental learning on a mixture of subspaces. In: Proc. IEEE/CVF international conference on computer vision pp. 8661–8670 (2021).
Chi, Z., Gu, L., Liu, H., Wang, Y., Yu, Y. & Tang, J. Metafscil: A meta-learning approach for few-shot class incremental learning. In: Proc. IEEE/CVF conference on computer vision and pattern recognition pp. 14166–14175 (2022).
Zhu, K., Cao, Y., Zhai, W., Cheng, J. & Zha, Z.-J. Self-promoted prototype refinement for few-shot class-incremental learning. In: Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition pp. 6801–6810 (2021).
Wu, H., Fu, K., Zhao, Y., Song, H. & Li, J. Joint self-supervised and reference-guided learning for depth inpainting. Comput. Vis. Media 8(4), 597–612 (2022).
Article Google Scholar
Niu, S.-Z., Wu, H., Yu, Z.-F., Zheng, Z.-J. & Yu, G.-H. Total generalized variation minimization based on projection data for low? Dose CT reconstruction. Nan Fang yi ke da xue xue bao = J. South. Med. Univ. 37(12), 1585–1591 (2017).
Google Scholar
Hu, J., Shen, L. & Sun, G. Squeeze-and-excitation networks. In: Proc. IEEE conference on computer vision and pattern recognition pp. 7132–7141 (2018).
Woo, S., Park, J., Lee, J.-Y. & Kweon, I. S. Cbam: Convolutional block attention module. In: Proc. European conference on computer vision (ECCV) pp. 3–19 (2018).
Wang, X., Girshick, R., Gupta, A. & He, K. Non-local neural networks. In: Proc. IEEE conference on computer vision and pattern recognition pp. 7794–7803 (2018).
Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y. & Liu, W. Ccnet: Criss-cross attention for semantic segmentation. In: Proc. IEEE/CVF international conference on computer vision pp. 603–612 (2019).
Shen, L., Tao, H., Ni, Y., Wang, Y. & Stojanovic, V. Improved yolov3 model with feature map cropping for multi-scale road object detection. Meas. Sci. Technol. 34(4), 045406 (2023).
Article CAS ADS Google Scholar
Wang, Y. et al. Arrhythmia classification algorithm based on multi-head self-attention mechanism. Biomed. Signal Process. Control 79, 104206 (2023).
Article Google Scholar
Zhou, B., Li, Y. & Wan, J. Regional attention with architecture-rebuilt 3d network for rgb-d gesture recognition. Proc. AAAI Conf. Artif. Intell. 35, 3563–3571 (2021).
Google Scholar
Tan, C., Gao, Z., Wu, L., Xu, Y., Xia, J., Li, S. & Li, S. Z. Temporal attention unit: Towards efficient spatiotemporal predictive learning. In: Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition pp. 18770–18782 (2023).
Yuan, Y., Ding, J., Feng, J., Jin, D. & Li, Y. Unist: A prompt-empowered universal model for urban spatio-temporal prediction. In: Proc. 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining pp. 4095–4106 (2024).
Cai, Y.-J., Cai, H.-C., Zhang, C.-Y., Chen, C. P. & Tang, Q.-X. Graphtan: Temporal attention network for learning graph-level embedding. IEEE Trans. Comput. Soc. Syst. (2025).
Xiang, J. et al. Synchronization-based graph spatio-temporal attention network for seizure prediction. Sci. Rep. 15(1), 4080 (2025).
Article CAS PubMed PubMed Central Google Scholar
Snell, J., Swersky, K. & Zemel, R. Prototypical networks for few-shot learning. Adv. Neural Inf. Process. Syst. pp. 4077–4087 (2017).
Li, W., Wang, L., Xu, J., Huo, J., Gao, Y. & Luo, J. Revisiting local descriptor based image-to-class measure for few-shot learning. In: Proc. IEEE Conference on Computer Vision and Pattern Recognition pp. 7260–7268 (2019).
Huang, S., Yang, W., Wang, L., Zhou, L. & Yang, M. Few-shot unsupervised domain adaptation with image-to-class sparse similarity encoding. In: Proc. 29th ACM International Conference on Multimedia pp. 677–685 (2021).
Li, J., Wang, X., Tu, Z. & Lyu, M. R. On the diversity of multi-head attention. Neurocomputing 454, 14–24 (2021).
Article Google Scholar
Han, K. et al. Transformer in transformer. Adv. Neural. Inf. Process. Syst. 34, 15908–15919 (2021).
Google Scholar
He, L., Bai, L., Yang, X., Du, H. & Liang, J. High-order graph attention network. Inf. Sci. 630, 222–234 (2023).
Article Google Scholar
Velickovic, P. et al. Graph attention networks. Stat 1050(20), 10–48550 (2017).
Google Scholar
Russakovsky, O. et al. Imagenet large scale visual recognition challenge. Int. J. Comput. Vision 115(3), 211–252 (2015).
Article MathSciNet Google Scholar
Cai, Q., Pan, Y., Yao, T., Yan, C. & Mei, T. Memory matching networks for one-shot image recognition. In: Proc. IEEE conference on computer vision and pattern recognition pp. 4080–4088 (2018).
Liu, H. et al. Few-shot class-incremental learning via entropy-regularized data-free replay. In European Conference on Computer Vision (eds Liu, H. et al.) 146–162 (Springer, 2022).
Google Scholar
Zheng, Z., Feng, X., Yu, H., Li, X. & Gao, M. Bdla: Bi-directional local alignment for few-shot learning. Appl. Intell. 53(1), 769–785 (2023).
Article Google Scholar
Bao, Y. et al. E2cl: An efficient and effective classification learning for pneumonia detection in chest x-rays. In 2024 5th International Conference on Computers and Artificial Intelligence Technology (CAIT) (eds Bao, Y. et al.) 35–40 (IEEE, 2024).
Chapter Google Scholar
Zheng, Z. et al. Cross-domain few-shot chest x-ray recognition. In 2024 5th International Conference on Computers and Artificial Intelligence Technology (CAIT) (eds Zheng, Z. et al.) 224–229 (IEEE, 2024).
Chapter Google Scholar
Wu, H. et al. Dara: Distribution-aware representation alignment for semi-supervised domain adaptation in image classification. J. Supercomput. 81(2), 1–37 (2025).
Article Google Scholar
Wu, H. et al. Vlce: Unified vision-language collaborative enhancement for facial expression recognition. In 2024 5th International Conference on Computers and Artificial Intelligence Technology (CAIT) (eds Wu, H. et al.) 94–99 (IEEE, 2024).
Chapter Google Scholar
Zhang, C., Hu, C., Xie, J., Wu, H. & Zhang, J. Wcal: Weighted and center-aware adaptation learning for partial domain adaptation. Eng. Appl. Artif. Intell. 130, 107740 (2024).
Article Google Scholar
Zheng, Z. et al. MERGE: multimodal-enhanced representation and guided ensemble for pneumonia recognition in chest X-ray images. J. Supercomput. 81, 1–25 (2025).
Article Google Scholar
Rebuffi, S.-A., Kolesnikov, A., Sperl, G. & Lampert, C. H. icarl: Incremental classifier and representation learning. In: Proc. IEEE conference on Computer Vision and Pattern Recognition pp. 2001–2010 (2017).
Zou, Y., Zhang, S., Li, Y. & Li, R. Margin-based few-shot class-incremental learning with class-level overfitting mitigation. Adv. Neural. Inf. Process. Syst. 35, 27267–27279 (2022).
Google Scholar
Li, Y., Zhu, H., Ma, J., Xiang, C. & Vadakkepat, P. Incremental few-shot learning via implanting and consolidating. Neurocomputing 559, 126800 (2023).
Article Google Scholar
Castro, F. M., Marín-Jiménez, M. J., Guil, N., Schmid, C. & Alahari, K. End-to-end incremental learning. In: Proc. European conference on computer vision (ECCV) pp. 233–248 (2018).
Zhao, H., Fu, Y., Kang, M., Tian, Q., Wu, F. & Li, X. Mgsvf: Multi-grained slow vs. fast framework for few-shot class-incremental learning. IEEE Trans. Pattern Anal. Mach. Intell. (2021).

Download references

Acknowledgements

This work was supported in part by National Natural Science Foundation of China (Grant No.12071104 and Grant No. 62261002), Natural Science Foundation of Zhejiang Province (Grant No. LD19A010002 and Grant No. LY21F010001), Jiangxi Double Thousand Plan (Grant No. jxsq2019201061), Science and Technology Program of Jiangxi Province (Grant No. 20192BCB23019 and Grant No. 20202BBE53024), Fundamental Research Funds for the Provincial Universities of Zhejiang (Grant No, 230056), and Zhejiang Provincial Natural Science Foundation of China (Grant No. LQN25F030002).

Author information

Authors and Affiliations

College of Information Engineering, Hangzhou Vocational & Technical College, Hangzhou, 310018, China
Heng Wu & Yifeng Xu
Institute of Network Technology, Yantai, 264005, China
Heng Wu
College of Education, Ludong University, Yantai, 264025, China
Heng Wu
College of Sciences, China Jiliang University, Hangzhou, 310018, China
Zijun Zheng
Guangzhou Huayi Electronic Technology Co., Ltd, Guangzhou, 511400, China
Zijun Zheng
School of Micro-Electronics, South China University of Technology, Guangzhou, 511442, China
Zijun Zheng
College of Information Engineering, China Jiliang University, Hangzhou, 310018, China
Laishui Lv
LMIA Laboratory, Department of Computer Science, Abbes Laghrour University of Khenchela, Khenchela, 40004, Algeria
Dalal Bardou
School of Mathematics and Computer Science, Gannan Normal University, Ganzhou, 341000, China
Shanzhou Niu
School of Science, Zhejiang University of Science and Technology, Hangzhou, 310023, China
Gaohang Yu
Yancheng No.1 People’s Hospital Affiliated Hospital of Medical School, Nanjing University, Yancheng, 224008, China
Yinyin Wang
The First People’s Hospital of Yancheng, Yancheng, 224000, China
Yinyin Wang

Authors

Heng Wu
View author publications
Search author on:PubMed Google Scholar
Zijun Zheng
View author publications
Search author on:PubMed Google Scholar
Laishui Lv
View author publications
Search author on:PubMed Google Scholar
Yifeng Xu
View author publications
Search author on:PubMed Google Scholar
Dalal Bardou
View author publications
Search author on:PubMed Google Scholar
Shanzhou Niu
View author publications
Search author on:PubMed Google Scholar
Gaohang Yu
View author publications
Search author on:PubMed Google Scholar
Yinyin Wang
View author publications
Search author on:PubMed Google Scholar

Corresponding author

Correspondence to Yinyin Wang.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Wu, H., Zheng, Z., Lv, L. et al. A spatially aware global and local perspective approach for few-shot incremental learning. Sci Rep 15, 21903 (2025). https://doi.org/10.1038/s41598-025-08323-5

Download citation

Received: 24 May 2024
Accepted: 20 June 2025
Published: 01 July 2025
Version of record: 01 July 2025
DOI: https://doi.org/10.1038/s41598-025-08323-5

Keywords

This article is cited by

Meta-learning for few-shot open task recognition
- Xiaoming Han
- Dianxi Shi
- Shaowu Yang
Scientific Reports (2026)

Subjects

Abstract

Similar content being viewed by others

Scene categorization by Hessian-regularized active perceptual feature selection

Spatial coding for action across spatial scales

A generalizable and accessible approach to machine learning with global satellite imagery

Introduction

Related work

Few-shot learning

Few-shot incremental learning

Attention mechanism

Methodology

Problem formulation

Data initialization

Spatial feature enhancer

Spatial-aware global perspective

Spatial-aware local perspective

Incremental learning module

Experiments

Experimental settings

Datasets

Implementation details

Comparison with the previous models

Ablation analyses

Influence of data augmentation

Influence of spatial-aware global perspective

Influence of spatial-aware local perspective

Influence of different shots

Visualization

Conclusion and future work

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

This article is cited by

Meta-learning for few-shot open task recognition

Search

Quick links