Abstract
Zero-shot learning methods are used to recognize objects of unseen categories. By transferring knowledge from the seen classes to describe the unseen classes, deep learning models can recognize unseen categories. However, relying solely on a small labeled seen dataset and the limited semantic relationships will lead to a significant domain shift, hindering the classification performance. To tackle this problem, we propose a transductive zero-shot learning method, based on Knowledge Graph and Graph Convolutional Network. We firstly learn a knowledge graph, where each node represents a category encoded by its semantic embedding. With a shallow graph convolutional network having a small number of layers, we learn the classifier for each category, supervised by the visual classifiers of the seen categories. During testing, a clustering strategy, the Double Filter Module with Hungarian algorithm, is applied to the unseen samples, and then, the learned classifiers are used to predict their categories. Pseudo annotations are given to the samples that are more accurately classified. In the transductive setting, the unseen categories with higher classification accuracy are assigned pseudo annotations, and can be associated with the seen categories to progressively update model parameters. We validate the proposed model on three data sets, and our model outperforms other state-of-the-art methods, achieving 47.36% accuracy on AWA2, 30.69% on ImageNet50, and 18.87 on ImageNet100, with a 4–10% improvement over existing methods.
Similar content being viewed by others
Introduction
The past few decades have witnessed the rapid development of machine learning and deep neural networks, which have achieved great performance in natural language processing and computer vision tasks, such as object recognition1, detection2,3, and segmentation4,5. Such success undoubtedly comes from the continuous improvement of processing algorithms and computing performance, but also from the large amount of well-annotated training data sets. The typical manual annotation work is tedious and time-consuming. With the ever-growing quantity of accessible multimedia data and an ever-expanding number of object categories, annotation of all objects is unrealistic. In some applications, unexpected object classes may occur. For example, on a construction site where robots are deployed, the robots need to recognize different objects, which may be useful objects or obstacles, and may have been seen or unseen previously. Therefore, to be able to recognize an object of an unseen category, traditional supervised learning methods require adding a large number of samples of this unseen category and retraining the entire model again. To solve this problem, a new object recognition approach, dedicated to unseen categories, is required, which can be supported by the semantic relationships existing between seen and unseen categories. This task is called Zero-Shot Learning (ZSL)6,7,8, which only requires annotated images of seen categories in the source domain, and can recognize images of unseen categories in the target domain.
The mainstream technique to recognize unfamiliar or unseen categories is to transfer knowledge from seen categories to the unseen categories. Seen and unseen categories usually share a common high-dimensional vector space, i.e., the semantic space. The most commonly used semantic space is based on semantic attributes, in which each category can be represented by an attribute vector. The seen and unseen categories are also related in the visual feature space. Traditional zero-shot learning methods associate the semantic space and the visual space by learning a compatible projection function to project from one space to the other, or two compatible projection functions to project both spaces into a common latent embedding space. However, these methods hardly retain the structural information during the semantic embedding process.
Comparison between inductive and transductive settings applied to knowledge graph-based zero-shot learning. In the inductive setting, only the seen categories are used during training. In the transductive setting, not only the seen categories are used, but also some of the unseen categories can be employed during training.
Another paradigm of zero-shot learning is to exploit explicit knowledge graphs. In this paradigm, the relationships between objects are provided by the knowledge graph, where each node represents a category, and the nodes are connected directly to their descendants/ancestors. Using this information, the classifiers for all categories, including unseen categories, can be learned. As a typical example of the knowledge graph-based deep neural network, deep graph convolutional networks (GCN)9 are trained to form a classifier for each category by regressing the weight vectors of the visual classifiers, that are learned from the seen categories using a deep convolutional neural network (CNN). Although GCNs can rely on the idea of message transferring to pass knowledge from a node to its neighbors, they follow an inductive way of learning, which only employs annotated data from the seen categories to train the models as shown in Fig. 1, often leading to the projection domain shift problem10. In Fig. 1, the green and red nodes in the graph mean seen categories and unseen categories respectively. The light stars represent the samples is available during training, while the dark stars represent there are no available samples.
An effective approach to dealing with the domain shift problem is transductive zero-shot learning10, which employs unlabeled data belonging to unseen categories in the training phase. Transductive zero-shot learning assumes that the visual features of all unseen images and the semantic information, as shown in Fig. 1, in the form of attributes of their categories, are provided in advance. In this way, zero-shot learning can fully leverage the existing information on these unseen categories.
Therefore, in this paper, we propose a transductive strategy based on knowledge graph and graph convolutional networks (TGCNZ) to deal with the zero-shot learning tasks. Firstly, we build a knowledge graph, in which each node corresponds to an object category and is represented by its semantic embedding, and the edge between two nodes represents their relationships. Then, we train a one-layer GCN to transfer knowledge between different categories, and output classifiers of both the seen and unseen classes. We also propose a double filtering module based on the Hungarian algorithm, which optimizes clustering and classification simultaneously, to sort out the samples of unseen categories and assign them with pseudo labels. Following the transductive setting, these samples can be used to train the visual classifiers of the unseen categories. After a small number of iterations, the performance of the model can be effectively improved. To validate the effectiveness of the proposed method, we conducted experiments on three data sets, including a traditional zero-shot learning data set and two simplified subsets of ImageNet. Compared with other knowledge graph-based methods, the proposed method demonstrates superior performance on these data sets.
In summary, the contributions of this paper are as follows. (1) We propose a transductive strategy, based on knowledge graph and graph convolutional networks, to deal with the zero-shot learning tasks. Unlike previous works, which heavily relied on the structure of a knowledge graph, our proposed method can effectively alleviate the domain shift problem. (2) A double filtering module, based on the Hungarian algorithm, is proposed to sort out the unseen samples, which are given a pseudo label. (3) The proposed method outperforms the state-of-the-art knowledge graph-based zero-shot.
The rest of this paper is organized as follows: Sect. “Introduction” reviews related works on zero shot learning, knowledge graph, GCN and transductive zero shot learning. Section “Related work” details our methodology. Section “The proposed approach” presents experiments on three representative data sets and comparisons to state-of-the-art methods. Section “Conclusion” concludes the paper.
Related work
In this section, we first briefly discuss zero-shot learning methods, which can be roughly divided into two categories. Then, we introduce the utilization of knowledge graphs and graph convolutional networks in zero-shot learning. Finally, the transductive zero-shot learning methods are discussed.
Zero shot learning
Zero-shot learning11,12 aims to transfer the knowledge obtained from the seen categories to the unseen categories, so that new instances can be recognized, where the test unseen categories are disjointed from the training categories13,14. It only requires annotated images of the seen classes in the source domain, and the semantic correlation with the unseen classes in the target domain. The target and source domains often share a common semantic space, where the relationship between seen classes and unseen classes is defined. Most of the popular zero-shot learning methods can be divided into two main categories.
The first category is based on semantic embeddings to represent each class with learnt vector representations. The semantic embedding can then be projected to the visual feature space15. Frome et al.16 proposed a framework, named DeViSE, to train the projection function, from image to word embedding, using a convolutional neural network with a transformation layer. Instead of obtaining the word embedding directly through convolutional neural networks, Norouzi et al.17 proposed another method, named ConSE, to embed an image by combining a convolutional neural network employed for image classification with a word embedding model. Kodirov et al.18 proposed a different method, which learns the projection function based on the encoder-decoder paradigm. Different from DeViSE and ConSE, which are based on regression with deep neural networks, they applied linear and projection mapping in both the encoder and decoder.
There are also some other zero-shot learning methods19,20. Li et al.21 proposed a novel convolutional variational autoencoder-generative adversarial network (CVAE-GAN), which was designed with a discriminator, and a generator consisting of two encoders and one decoder. A novel transformer-based method22 was used for noise suppression during pseudo label assignment.
Knowledge graph and graph convolutional networks
The second category of zero-shot learning methods is based on distilling knowledge from a knowledge graph, developed for object recognition23. Several methods for object recognition, based on knowledge graphs, have been proposed24,25,26. Graph Convolutional Network27 was first designed in28 to perform semi-supervised object classification. Salakhutdinov et al.29 proposed a method, which shares the category representation between the different object classifiers using WordNet30. These representations share the statistical strength of outputs from related objects for those objects with a few training examples. The constructed knowledge graph can be used to model the relationships between different categories. Wang et al.31 proposed an approach to distill information of categories via both semantic embeddings and knowledge graphs. They exploited the word embedding of each seen category and their relationship with each other to learn a visual classifier for the unseen categories without any training examples. In their work, a deeper Graph Convolutional Network, with six layers, is trained to form classifiers for all categories by regressing real-valued weight vectors. However, the recent work in32 has indicated that GCNs have the Laplacian smoothing problem, which can ease the classification for shallow GCNs, but, as the network depth increases, the hidden representations may be over-smoothened. This will fuse the features together, and finally degrade the classification performance. In this regard, Michael et al.33 proposed a dense connectivity scheme with a hierarchical structure, in which each node can only influence its immediate ancestors and descendants. They also designed a weighting scheme that shares distance-based weights across layers to increase the flexibility of the model. Inspired by this design, we employ one-layer GCNs, rather than deeper GCNs in our proposed framework.
Transductive zero shot learning
Transductive learning strategy was proposed to deal with the domain shift problem in zero-shot learning. Previous work in various areas, such as object segmentation34,35, few-shot learning36, and hyper spectral image classification37, took advantage of transductive learning to transfer knowledge from the test data in the training phase38. Transductive learning assumes that the semantic information of unseen categories and all unseen images are available in advance. Different methods, such as domain adaptation39 and label propagation40, were further proposed to make better use of the additional information from unseen categories. Recently, Zhang et al.41 indicated that the visual features of images from unseen categories can be divided into different clusters, even though their labels are unknown. Wan et al.42 proposed three different types of visual structure constraints for learning the projection function in transductive zero-shot learning. In the work of Guo et al.43, in addition to learning the projection between semantic embeddings and the visual features, samples in the training data having the closest attribute space to those of the unseen categories are selected, and these unseen categories are then assigned pseudo labels to train the corresponding category classifiers. In this way, the projection function can be further optimized. Rohrbach et al.44 established a K-nearest neighbors graph for test data in the semantic embedding space. The label transferring algorithm was applied to the graph to predict labels. Zhang et al.44 modeled the test data using a Gaussian distribution model, and iteratively classified the test data of the same cluster into the same unknown class. Yu et al.45 sorted out unseen data based on reliableness, assigned them pseudo labels similar to43, and input gradually reliable samples during training to iteratively refine the model. Moreover, a fast training strategy, which represents each seen category with their visual patterns, was proposed. Different from the previous approaches, we adopt the unseen classifiers learned by Graph Convolutional Networks to filter the unseen samples, which are assigned pseudo unseen labels in the transductive setting.
The proposed approach
In this paper, we propose an efficient method for zero-shot learning, based on distilling information of categories from semantic embeddings and knowledge graph representations, enhanced with a transductive strategy based on a Double Filter Module, to tackle the domain shift problem. In the following, we will first present the Graph Convolutional Networks used to train both the classifiers for seen and unseen categories. Then, we will introduce the Double Filtering Module and the transductive zero-shot learning strategy (TGCNZ) proposed for zero-shot learning.
Graph convolutional networks for zero-shot learning
We first formalize the problem of zero-shot learning and introduce how Graph Convolutional Networks can be used for this task. In a zero-shot learning setting, denote \(\textbf{C}\) as the total number of classes, including \(\textbf{C}_{tr}\) training classes and \(\textbf{C}_{te}\) test classes, respectively. Although the training classes and the test classes are disjoint, i.e., \(\textbf{C}_{tr}{\cap \textbf{C}}_{te}=\emptyset\), they are correlated in a common semantic space, where an \(\textbf{S}\)-dimensional semantic representation vector is given for each class. A set of training data from the seen categories, \(\textbf{D}_{tr}=\{(X_i^s,\ c_i^s),\ i=1,...,N_s\}\) is also provided, where \(X_i^s\) denotes the i-th training image and \(c_i^s\in \textbf{C}_{tr}\) denotes the corresponding class label of this image. The goal of zero-shot learning is to predict the label \(c_i^u\in \textbf{C}_{te}\) of the unseen data in the test set \(\textbf{D}_{te}=\{X_i^u,\ i=1,\ldots ,N_u\}\).
The Graph Convolutional Network can be represented by a function \(F(\cdot )\), whose input and output are the word embedding of an entity X and the SoftMax classification results of the entity, denoted as F(X), respectively. The output of the i-th entity is a C-dimensional SoftMax probability vector, denoted as \(F_i(X)\). The weights of \(F(\cdot )\) are trained by backpropagation with the SoftMax loss. In a GCN, the convolutional operations rely on the adjacency graph, which defines the connections between neighboring nodes. The convolutional operations of each layer is represented as follows:
where \(\hat{\textbf{A}}\) is an \(n\times n\) normalized matrix of the binary adjacency graph \(\textbf{A}\) of the layer, and n is the number of nodes in the graph. \(\textbf{X}^\prime\) is the input feature matrix from the previous layer, with dimensions \(n\times k\). \(\textbf{W}\) is the \(k\times c\) weight matrix of the layer, where c denotes the number of output channels. The convolutional layers are followed by a non-linear activation function. The propagation rule for the \((l+1)\)-th layer \(\textbf{H}^{(l+1)}\) is defined as:
where \(\hat{\textbf{D}}\) is the diagonal degree matrix of \(\hat{\textbf{A}}\), with entries
\(\sigma (\cdot )\) denotes the ReLU activation function, introducing non-linearity to the model.
In the last convolutional layers, the number of output channels c equals to the number of label classes C.
The structure of the proposed method. The green dotted box is the Graph Convolutional Networks used for zero-shot learning. Images from the seen classes are input to a pre-trained CNN to train visual classifiers for the seen categories. The green and red cylinders in the knowledge graph represent the seen and unseen categories, respectively. The grey dotted box is the Double Filtering Module, which includes a classification branch, a clustering branch, and a label projection unit. The yellow line denotes the transductive procedure, where the unseen samples with pseudo labels can be used in the training process.
In our zero-shot learning setting, the input of our framework, as shown in Fig. 2, is the set of both seen classes and unseen classes, and the corresponding semantic-embedding vectors, \(\textbf{X}=\left\{ x_i\right\} _{i=1}^n\). The purpose of our model is to predict the visual classifiers for all the seen and unseen categories, \(\textbf{W}=\left\{ w_i\right\} _{i=1}^n\). The GCN is trained as a logistic regression model based on the deep features extracted from a fixed pre-trained Convolutional Neural Network. The output of the classifier \(w_i\) of the category i forms a visual feature vector, which has the same dimension, D. In our setting, the samples \(\textbf{D}_{tr}\) from the seen categories are used to estimate the weight vectors of their corresponding categories. For the unseen categories, we estimate their corresponding weight vectors using their semantic embedding vectors as input.
The knowledge graph has an explicit relationship among all the categories for learning visual classifiers of novel classes. Each node in the knowledge graph represents a semantic category. Let n be the number of nodes in the graph, including \(N_s\) nodes belonging to the seen classes \(\textbf{C}_{tr}\), and \(N_u\) nodes from the unseen classes \(\textbf{C}_{te}\). If any two nodes have a semantic relationship, they are linked to each other. The relationship graph of all the nodes is represented by an \(n\times n\) binary adjacency matrix, A.
As shown in the green dotted box in Fig. 2, images from the seen categories are input to the Convolutional Neural Networks to train their corresponding classifiers. The weights in the last convolutional layer of these seen categories, shown as the green cylinders, are interpreted as the supervised information for the Graph Convolutional Network. Our task is to predict the weights for all the unseen classes, shown as the red cylinders. Therefore, we take the knowledge graph with the word embedding vector of each node as the input of the Graph Convolutional Network to predict the classifiers, i.e, the weights of the last fully connected layer, for all of the seen and unseen categories. The weights of the learned classifiers \(\textbf{W}_{i...N_s}\) of the seen categories from the Convolutional Neural Networks are set as the ground truth for the GCN. The corresponding weights of the predicted classifiers from the GCN are denoted as \(\hat{\textbf{W}}_{i...N_u}\). The mean squared error (MSE) metric is used as the loss function to train the model.
In fact, although the joint use of graph convolutional networks and the knowledge graph is considered an advanced method to tackle the domain shift problem in zero-shot learning, the amount of unseen, non-annotated samples remains large. Transductive learning exploits these unlabelled images from the unseen categories to improve the zero-shot learning performance. Therefore, in this paper, we propose a novel method, which combines the graph convolutional networks and a knowledge graph with a transductive strategy.
Transductive zero-shot learning strategy
In the transductive setting, unlabelled unseen samples are available during the training process. Although these samples have neither labels nor corresponding semantic information, they contain rich visual features, which can be extracted by the pre-trained CNNs. Based on the feature discriminativity capacity of CNNs, these visual features can be easily divided into different clusters. Therefore, most previous transductive learning-based methods focused on the alignment of these clusters with the unseen categories. In our method, the transductive process becomes simpler, because we can obtain the trained classifiers for both the seen and unseen categories. These classifiers can filter the samples from the unseen categories and assign them corresponding pseudo labels. Those samples with pseudo labels can then be used to train the classifiers for the related unseen categories using backpropagation. To achieve this, we propose a Double Filtering Module.
Double filtering module
As shown in the grey dotted box in Fig. 2, the Double Filtering Module consists of a classification filter, a clustering filter, and a label projection unit, based on the Hungarian algorithm. We cluster the unseen samples in the test data set into \(N_u\) clusters. In the meantime, we use the classifiers of unseen categories to classify these unseen samples and obtain the SoftMax probability results.
We use a pre-trained CNN to extract the features \(f(x_i^{u})\) of the images from the unseen categories, \(x_i^{u},\ i\in \left\{ 1,\ ...,n_{u}\right\}\). Then, based on the classifiers, we can predict the class label of each image. We take the result corresponding to the highest predicted probability as its pseudo label. Although, during testing, we cannot calculate the accuracy of these classification results, we can still obtain the class labels with higher probability values \(P_{x_i^{u}}\), which can be used as one of the most important indices for the Double Filtering Module.
Another main branch in the Double Filtering Module performs clustering. The visual features of unseen images can be separated into different clusters. Clustering is an unsupervised machine learning algorithm that has found practical applications and challenges in complex computer vision tasks. Recently, Van Gansbeke et al.46 proposed SCAN (Semantic Clustering by Adopting Nearest neighbors), a clustering algorithm that extracts semantically meaningful nearest neighbors of images using a self-supervised method, and uses these nearest neighbors, jointly with the images, to train the clustering process. In our framework, we adopt the SCAN method to separate the unseen samples into \(N_{u}\) clusters, and each cluster can be denoted as \(G_i\).
The ideal clustering case is that the overwhelming majority of the images in a cluster belong to the same category. However, restricted by the structure of both the classifiers and clustering algorithms, experiments have shown that each cluster cannot be simply and directly assigned to a certain category. For example, as shown in Fig. 3, the unseen samples, represented as circles, are divided into five clusters, \(G_1,G_2,G_3,G_4\), and \(G_5\), which contain different numbers of samples. Each circle represents a sample from an unseen class, where its color represents the pseudo category it belongs to, and its size represents the SoftMax probability value. Obviously, it is not easy to accurately label each cluster. In this regard, we propose a label projection unit based on the Hungarian Algorithm47.
The Hungarian algorithm48 is a combined optimization algorithm that solves the assignment problem in polynomial time. The goal is to find an optimal assignment matrix \(\textbf{P} \in \{0,1\}^{N \times M}\), where N is the number of unseen-class samples, M represents the number of unseen categories and \(P_{i j}=1\) indicates that sample\(x_i^{u}\) is assigned to category \(c_j\). The assignment minimizes the total matching cost while satisfying constraints:
Based on the maximum matching mechanism of the Hungarian algorithm, we project the classification results \(C_i^{pseudo}\) of the unseen samples and the clusters \(G_i\) obtained from clustering to a common space, and then perform the Hungarian matching process and sample filtering in this space. As shown in Step 2 in Fig. 3, the proportion of the most numerous categories \(C_{G_i}\)in each cluster \(G_i\) is denoted as \(P_{G_i}\). In Step 3, we obtain the projection, as follows:
Then, in Step 4, we firstly filter the clusters with the Top-k \(P_{G_i}\). Secondly, a threshold \(\lambda\) is set to filter the samples in the cluster \(G_i\), based on their probability value. As shown in Fig. 3, \(G_1\) and \(G_3\) are selected, while the samples, with a low classification probability or not belonging to their assigned categories, are removed. Finally, we obtain the pseudo labels of the images belonging to the unseen categories, \((x_i^{pseudo},C_i^{pseudo})\). The algorithm of the Double Filtering Module is shown in Algorithm 1.
The filtered images \((x_i^{pseudo},C_i^{pseudo})\) are used in the training process to train the visual classifiers of the unseen categories \(C_i^{pseudo}\). The additional information provided by these filtered categories can further alleviate the domain shift problem. As shown in Fig. 2, \(\textbf{W}_u\), the weights of the pseudo ground-truth classifier of the filtered unseen categories, can be trained through back propagation.
Experiments
In this section, we conduct experiments to evaluate the performance of the TGCNZ model, and compare it with the state-of-the-art knowledge graph-based zero-shot leanring methods on three representative data sets. In our experiments, we utilize the same constructed knowledge on the three data sets to make a fair comparison.
Data sets
AWA2: The Animals with Attributes 2 (AWA2) data set49 was established on the original AWA data set50, and is a traditional zero-shot learning data set. It consists of 37,322 images from 50 animal categories, which provide 85-attribute features. In our experiments, 40 classes are set as seen classes for training, and 10 classes are set as unseen classes for testing.
ImageNet50 and ImageNet100: In previous works, the large data set ImageNet51 is a commonly used data set for zero-shot learning, which denoted as “2-hops”, “3-hops”, and “All”16. However, in the transductive setting, the model will be retrained after obtaining the images with pseudo labels from the unseen categories. This means that training on a larger data set will need more training time. Therefore, in our experiments, we consider the smaller subsets, ImageNet50 and ImageNet100, which have 50 and 100 classes, respectively, as proposed in46. Table 1 shows the overview of the three data sets in our experiments.
Setup
We utilize the ResNet-5052 convolutional model, without the pre-training step, to train the classifiers for the seen categories. The number of weights in the last layer of ResNet-50 is 2049 (2048 weights and 1 bias). To obtain the semantic embeddings of each category as the input of the Graph Convolutional Networks, we utilize the GloVe text mode53, trained on the Wikipedia data set. For the knowledge graph, we utilize the sub-graph of WordNet30, which has about 30K nodes. A larger knowledge graph means that more semantic information can be shared between different categories and more accurate results can be achieved, but more computing resources are needed. However, the purpose of this paper is not to compete with other methods to obtain higher recognition results by using a larger knowledge graph. Instead, our aim is to combine transductive learning with knowledge graph-based zero-shot learning, so that we can make full use of the unlabelled images and achieve promising performance with a knowledge graph of a certain size. Therefore, for each data set, we extract the smallest connected sub-graph from WordNet, in order to make sure that each compared method captures the same prior knowledge from the knowledge graph. We first use a deeper Graph Convolutional Network, which has the number of output channels as follows: 2048 \(\rightarrow\) 2048 \(\rightarrow\) 1024 \(\rightarrow\) 1024 \(\rightarrow\) 512. Then, we compare the effectiveness of these different layers in our experiments. We utilize the Adam optimizer54 to train the GCN for 3000 epochs in each experiment, with learning rate 0.001, and weight decay 0.0005. The proposed TGCNZ model is implemented in PyTorch55 on a GTX 2080Ti GPU.
In our TGCNZ method, we set two iterations, whose models are denoted as TGCNZ_1 and TGCNZ_2, respectively. At every iteration, in the Double Filtering Module, the clustering operation is performed only once, and the classifying operation is performed twice using the updated classifiers, which means that we also use the Hungarian algorithm twice to obtain updated images with pseudo labels.
Main results
The quantitative results of different methods on the three data sets are shown in Table 2. Compared to previous knowledge graph-based methods, such as, GCNZ31, SGCN33, and DGP33, our proposed method outperforms them by a large margin, achieving, for instance, almost 20% relative improvement for Top-5 accuracy on the AWA2 data set. We observe that even using one iteration, TGCNZ_1 outperforms other methods on AWA2 and ImageNet100. We report harmonic mean (H) in Tables 2 and 4 to evaluate seen-unseen trade-offs. Our method achieves H = 41.83% on ImageNet50 and H = 30.74% on ImageNet100, outperforming all baselines. Although we get H = 50.75% on AWA2, lower than DGP with H = 52.9%, but when we change the clustering algorithm from SCAN to GAT-Cluster, we get H = 55.38% on AWA2, better than the other baselines. We also illustrate the bar chart to compare the top-5 accuracy as shown if Fig. 5, which enhance readability and support key claims.
The qualitative results of TGCNZ are shown in Fig. 4. The images are from the unseen classes. We compare the results of the proposed method with those of three other methods, as well as the ResNet pre-trained with the seen classes from the training set. Obviously, ResNet is not able to recognize the images from unseen classes. As shown in Fig. 4, the other compared methods cannot recognize the images unicycle and totem pole, but TGCNZ can classify these images in the top-5 results.
Model analysis
Analysis of word embeddings. To measure the sensitivity of the knowledge graph-based methods to word embeddings, we investigate different word embeddings including GloVe, FastText56, and word2vec. Word vectors of different dimensions extracted by GolVe, GloVe300, GloVe200, GloVe100, and GloVe50, are compared separately. FastText 300-dimensional word vectors are trained with the wikipedia data set and common crawl. The word2vec is trained with GoogleNews. All the results are shown in Table 3. We can see that, on the whole, GloVe300 performs better than the other word embeddings. Therefore, we utilize GloVe300 in our experiments.
Analysis of clustering methods. In the Double Filtering Module, we cluster the unseen samples in the test set into \(N_u\) clusters. In this experiment, we compared the performance of the module with the use of three different clustering methods, including K-means, GAT-cluster57, and SCAN, on the three data sets. As shown in Table 4, we can see that the performance of SCAN far exceeds the other two methods, except the results of TCGNZ_2 on AWA2 (Table 5). Furthermore, we compare these clustering methods combined with our proposed method. The results are shown in Table 6, where SCAN performs better on ImageNet50 and ImageNet100. The results indicate that the performance is affected by both the classification ability and clustering ability.
Analysis of filter thresholds. The threshold k is used in the Double Filtering Module. We analyze the performance of the module with different values of k on different data sets. As shown in Table 7, the recognition results on AWA2 decreases as the value of k increases. In contrast, the result of ImageNet50 becomes better, when the value of k increases. The same results were obtained on ImageNet100, but the effect of k on the performance is very slight.
Analysis of iterations. In our proposed method, more iterations mean that more unseen samples with pseudo labels will be screened out, which is more conducive to improving the recognition accuracy of unseen categories. As shown in Table 8, when the number of iterations is set to three, the accuracy increases obviously. Considering the computational resources and time consumption, we choose two iterations in our experiments.
Analysis of number of layers. We perform experiments on three data sets to verify that the shallow GCNs, i.e., with a small number of layers, perform better for zero-shot learning. The results are shown in Table 5. All the hidden layers have a dimensionality of 2048 with 0.5 dropout, as in33. The results show that the one-layer GCN performs the best.
Conclusion
In this paper, we propose a transductive zero-shot learning method based on Knowledge Graph and Graph Convolutional Network (GCN) to tackle the domain shift problem, which hinders the recognition performance. We firstly learn a knowledge graph, where each node represents a category encoded by its semantic embedding. After a one-layer graph convolution, we learn the classifier of each seen and unseen category, supervised by the visual classifiers of the seen categories. During testing, a clustering strategy, based on our proposed Double Filtering Module with the Hungarian algorithm, is applied to the unseen samples, and then predicts their categories using the learned classifiers. Pseudo annotations are given to those unseen samples with more accurate classification results. In the transductive setting, the unseen categories with higher classification accuracy are assigned pseudo annotations, and can be associated with the seen categories to progressively update model parameters. We validate the proposed model on three data sets, and the proposed method achieves competitive performance compared with other knowledge graph-based methods. In our future works, we will use visual-semantic correlations to better capture evolving category dependencies in real-world scenarios, since our knowledge graph relies on static semantic relationships. And we will extend the framework to handle cross-modal shifts by integrating adversarial domain adaptation to enhance robustness in heterogeneous environments.
Data availability
All data generated or analysed during this study are included in this published article and can be downloaded publicly from the link https://www.image-net.org/.
References
Yang, Y., Sun, X., Dong, J., Lam, K.-M. & Zhu, X. X. Attention-convnet network for ocean-front prediction via remote sensing SST images. IEEE Trans. Geosci. Remote Sens. https://doi.org/10.1109/TGRS.2024.3496660 (2024).
Li, Q. et al. Developing a microscopic image dataset in support of intelligent phytoplankton detection using deep learning. ICES J. Mar. Sci. 77, 1427–1439 (2020).
Zhao, H., Sun, X., Dong, J., Chen, C. & Dong, Z. Highlight every step: Knowledge distillation via collaborative teaching. IEEE Trans. Cybern. 52, 2070–2081. https://doi.org/10.1109/TCYB.2020.3007506 (2022).
Lv, Q., Sun, X., Chen, C., Dong, J. & Zhou, H. Parallel complement network for real-time semantic segmentation of road scenes. IEEE Trans. Intell. Transp. Syst. (2021).
Sun, X. et al. Gaussian dynamic convolution for efficient single-image segmentation. IEEE Trans. Circuits Syst. Video Technol. 32, 2937–2948. https://doi.org/10.1109/TCSVT.2021.3096814 (2022).
Miao, Z. et al. Multi-modal language models in bioacoustics with zero-shot transfer: A case study. Sci. Rep. 15, 7242. https://doi.org/10.1038/s41598-025-89153-3 (2025).
Larochelle, H., Erhan, D. & Bengio, Y. Zero-data learning of new tasks. In AAAI 1, 3 (2008).
Ren, J., Zhao, Y., Zhang, W. & Sun, C. Zero-shot incremental learning using spatial-frequency feature representations. Sci. Rep. 15, 10932. https://doi.org/10.1038/s41598-024-83649-0 (2025).
Simonyan, K. & Zisserman, A. Very deep convolutional networks for large-scale image recognition. Comput. Sci. (2014).
Fu, Y., Hospedales, T. M., Xiang, T. & Gong, S. Transductive multi-view zero-shot learning. IEEE Trans. Pattern Anal. Mach. Intell. 37, 2332–2345 (2015).
Yu, Y., Ji, Z., Guo, J. & Zhang, Z. Zero-shot learning via latent space encoding. IEEE Trans. Cybern. 49, 3755–3766 (2018).
Ji, Z., Yu, X., Yu, Y., Pang, Y. & Zhang, Z. Semantic-guided class-imbalance learning model for zero-shot image classification. IEEE Trans. Cybern. (2021).
Wang, C., Yu, Z., Long, Z., Zhao, H. & Wang, Z. A few-shot diabetes foot ulcer image classification method based on deep ResNet and transfer learning. Sci. Rep. 14, 29877. https://doi.org/10.1038/s41598-024-80691-w (2024).
Lampert, C. H., Nickisch, H. & Harmeling, S. Attribute-based classification for zero-shot visual object categorization. IEEE Trans. Pattern Anal. Mach. Intell. 36, 453–465 (2014).
Huang, C., Loy, C. C. & Tang, X. Local similarity-aware deep feature embedding. In Proceedings of the 30th International Conference on Neural Information Processing Systems, 1270–1278 (2016).
Frome, A. et al. Devise: A deep visual-semantic embedding model. Adv. Neural Inf. Process. Syst. 26, 2121–2129 (2013).
Norouzi, M. et al. Zero-shot learning by convex combination of semantic embeddings. In 2nd International Conference on Learning Representations, ICLR 2014 (2014).
Kodirov, E., Xiang, T. & Gong, S. Semantic autoencoder for zero-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 3174–3183 (2017).
Li, C. et al. Small data challenges for intelligent prognostics and health management: A review. Artif. Intell. Rev. 57, 1–52 (2024).
Zhang, X. et al. A personalized federated meta-learning method for intelligent and privacy-preserving fault diagnosis. Adv. Eng. Inf. 62, 102781 (2024).
Li, C. et al. A zero-shot fault detection method for UAV sensors based on a novel CVAE-GAN model. IEEE Sens. J. 24, 16 (2024).
Wang, H. et al. A novel transformer-based few-shot learning method for intelligent fault diagnosis with noisy labels under varying working conditions. Reliab. Eng. Syst. Saf. 251, 110400 (2025).
Leksut, J. T., Zhao, J. & Itti, L. Learning visual variation for object recognition. Image Vis. Comput. 98, 103912 (2020).
Fergus, R., Bernal, H., Weiss, Y. & Torralba, A. Semantic label sharing for learning with many categories. In European Conference on Computer Vision, 762–775 (Springer, 2010).
Marino, K., Salakhutdinov, R. & Gupta, A. The more you know: Using knowledge graphs for image classification. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 20–28 (2017).
Sun, X. et al. Network embedding via deep prediction model. IEEE Trans. Big Data 9, 455–470. https://doi.org/10.1109/TBDATA.2022.3194643 (2023).
Chen, Z., Xu, J., Peng, T. & Yang, C. Graph convolutional network-based method for fault diagnosis using a hybrid of measurement and prior knowledge. IEEE Trans. Cybern. (2021).
Kipf, T. N. & Welling, M. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016).
Salakhutdinov, R. & Hinton, G. E. Semantic hashing. Int. J. Approx. Reason. 50, 969–978. https://doi.org/10.1016/j.ijar.2008.11.006 (2009).
Miller, G. A. Wordnet: A lexical database for English. Commun. ACM 38, 39–41 (1995).
Wang, X., Ye, Y. & Gupta, A. Zero-shot recognition via semantic embeddings and knowledge graphs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 6857–6866 (2018).
Li, Q., Han, Z. & Wu, X.-M. Deeper insights into graph convolutional networks for semi-supervised learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32 (2018).
Kampffmeyer, M. et al. Rethinking knowledge graph propagation for zero-shot learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 11487–11496 (2019).
Zhang, Y., Wu, Z., Peng, H. & Lin, S. A transductive approach for video object segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 6949–6958 (2020).
Chen, C., Sun, X., Hua, Y., Dong, J. & Xv, H. Learning deep relations to promote saliency detection. Proc. AAAI Conf. Artif. Intell. 34, 10510–10517 (2020).
Qiao, L. et al. Transductive episodic-wise adaptive metric for few-shot learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 3603–3612 (2019).
Hong, D. et al. Graph convolutional networks for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. (2020).
Yu, Y. et al. Transductive zero-shot learning with a self-training dictionary approach. IEEE Trans. Cybern. 48, 2908–2919 (2018).
Kodirov, E., Xiang, T., Fu, Z. & Gong, S. Unsupervised domain adaptation for zero-shot learning. In Proceedings of the IEEE International Conference on Computer Vision, 2452–2460 (2015).
Xiaojin, Z. & Zoubin, G. Learning from labeled and unlabeled data with label propagation. Tech. Rep., Technical Report CMU-CALD-02–107, Carnegie Mellon University (2002).
Zhang, Z. & Saligrama, V. Zero-shot recognition via structured prediction. In European Conference on Computer Vision, 533–548 (Springer, 2016).
Wan, Z. et al. Transductive zero-shot learning with visual structure constraint. arXiv preprint arXiv:1901.01570 (2019).
Guo, Y., Ding, G., Han, J. & Gao, Y. Zero-shot recognition via direct classifier learning with transferred samples and pseudo labels. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 31 (2017).
Rohrbach, M., Ebert, S. & Schiele, B. Transfer learning in a transductive setting. Adv. Neural Inf. Process. Syst. 26, 46–54 (2013).
Yu, Y., Ji, Z., Guo, J. & Pang, Y. Transductive zero-shot learning with adaptive structural embedding. IEEE Trans. Neural Netw. Learn. Syst. 29, 4116–4127 (2017).
Van Gansbeke, W., Vandenhende, S., Georgoulis, S., Proesmans, M. & Van Gool, L. Scan: Learning to classify images without labels. In European Conference on Computer Vision, 268–285 (Springer, 2020).
Mills-Tettey, G. A., Stentz, A. & Dias, M. B. The dynamic hungarian algorithm for the assignment problem with changing costs. Robotics Institute, Pittsburgh, PA, Tech. Rep. CMU-RI-TR-07-27 (2007).
Bruff, D. The assignment problem and the Hungarian method. Notes for Math. 20, 5 (2005).
Xian, Y., Lampert, C. H., Schiele, B. & Akata, Z. Zero-shot learning’a comprehensive evaluation of the good, the bad and the ugly. IEEE Trans. Pattern Anal. Mach. Intell. 41, 2251–2265. https://doi.org/10.1109/TPAMI.2018.2857768 (2019).
Lampert, C. H., Nickisch, H. & Harmeling, S. Learning to detect unseen object classes by between-class attribute transfer. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, 951–958, 10.1109/CVPR.2009.5206594 (2009).
Deng, J. et al. Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, 248–255 (IEEE, 2009).
He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 770–778 (2016).
Pennington, J., Socher, R. & Manning, C. D. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 1532–1543 (2014).
Kingma, D. P. & Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
Paszke, A. et al. Automatic Differentiation in Pytorch (Tech, Rep, 2017).
Joulin, A. et al. Fasttext. zip: Compressing text classification models. arXiv preprint arXiv:1612.03651 (2016).
Niu, C., Zhang, J., Wang, G. & Liang, J. Gatcluster: Self-supervised Gaussian-attention network for image clustering. In European Conference on Computer Vision, 735–751 (Springer, 2020).
Acknowledgements
This work was supported by a research grant from the Science and Technology Development Fund, Macao SAR No.0006/2024/RIA1, and the National Natural Science Foundation of China (U1706218, 61971388).
Author information
Authors and Affiliations
Contributions
Xin Sun, Qiong Li and Junyu Dong wrote the main manuscript text. Xin Sun and Qiong Li conducted the experiments. Xin Sun designed the research. All authors contributed to the article and approved the submitted version.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Li, Q., Sun, X. & Dong, J. Transductive zero-shot learning via knowledge graph and graph convolutional networks. Sci Rep 15, 28708 (2025). https://doi.org/10.1038/s41598-025-13612-0
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s41598-025-13612-0
This article is cited by
-
Enhancing zero-shot scene recognition through semantic autoencoders and visual relation transfer
Scientific Reports (2025)








