Introduction

Electrocardiogram (ECG) is a widely employed tool for monitoring cardiovascular diseases in clinical settings. However, the interpretation of ECGs can be subjective and may vary among different observers. Consequently, the integration of computer-assisted diagnostic support can significantly enhance the accuracy and consistency of diagnoses, thereby improving the overall quality of the healthcare system, particularly when combined with the expertise of clinical professionals. Consequently, research on electrocardiogram detection is of paramount importance for the advancement of intelligent medical treatment and has attracted growing attention in the field of machine learning1,2,3,4,5,6.

ECG records cardiac electrical activity through electrodes on the body surface. Each wave’s characteristics provide critical clinical information. Standard ECG beats include the P wave, QRS complex, and T wave, which have temporal and spatial dependencies. Owing to the rapid advancement in computing resources and the availability of extensive experimental datasets, a variety of deep learning models, such as Convolutional Neural Networks (CNNs) and Long Short-Term Memory networks (LSTMs), have been successfully utilized for ECG detection and recognition. However, they struggle to effectively capture complex neighborhood information, which imposes limitations on their ability to parse the structure of the signal. Graph learning, leveraging its capacity to represent complex physical processes and irregular relationships, can enhance the accuracy of ECG signal detection and classification by treating segmentation as nodes.

In recent years, there has been a notable emergence of graph-based methods7,8,9,10 within the field of ECG signal detection. These approaches have systematically validated the dependencies among signals by modeling variables as nodes in a graph structure and employing graph edges to characterize interactions between variables. This has significantly propelled the development of graph-based ECG anomaly detection technology. Such methods are fundamentally rooted in the framework of graph neural networks (GNNs)11, which have demonstrated remarkable achievements in graph representation learning. However, experimental observations highlight significant limitations in the current graph construction strategies. For example, in our preliminary studies on the AF dataset12, it was observed that KNN-based adjacency matrices (a) fail to effectively capture long-range semantic correlations, such as misclassifying arrhythmia patterns with similar local features but distinct global morphologies, and (b) demonstrate performance degradation under noisy conditions due to erroneous connections between dissimilar nodes. These issues arise because KNN-based methods:

  • Reliance on local proximity leads to the neglect of global label semantics, such as conflating classes that share overlapping local features yet possess distinct clinical meanings.

  • Generate density-biased edges, which may result in redundant connections within high-density regions and insufficient coverage in sparsely represented areas.

  • Lack robustness to noise, as the similarity based on Euclidean distance is highly susceptible to disruption by artifacts.

The quality of an initial graph structure is significantly influenced by the connections among its nearest neighboring nodes. However, performing nearest neighbor search on large-scale datasets incurs prohibitively high computational complexity. To address this challenge, hash technology has been recognized as an efficient solution13,14,15. It employs compact binary codes to represent high-dimensional data and enables low-complexity nearest neighbor search processes via Hamming distance calculations16,17. Furthermore, the binary encoding strategy in hashing inherently possesses the capability to suppress feature noise18. Drawing inspiration from this, we leverage hashing for semantic reconstruction and introduce a novel graph structure learning framework. This framework defines graph nodes based on semantic hash similarity, rather than exclusively relying on nearest neighbor relationships. Employing semantic hash similarity for constructing graph structures provides several advantages: (1) Compared to the nearest neighbor approach, semantic hashing is more effective in capturing global semantic similarity between nodes, rather than focusing solely on local structural relationships. (2) Hash codes assess similarity through efficient bit operations, such as calculating the Hamming distance, which offers superior computational efficiency compared to traditional methods like Euclidean distance or cosine similarity. (3) Semantic hashing demonstrates robustness against noise and redundant information by effectively filtering out insignificant details when mapping data into a low-dimensional space via a hash function. The primary contributions can be succinctly summarized as follows:

  • We propose a concise semantic hash graph learning framework, which initially acquires the hash representation of signals and employs Hamming similarity to alleviate the rigid similarity in the label space, and simultaneously addresses the issue of incomplete or redundant adjacency in similar graphs.

  • We propose a hash learning optimization algorithm in richness of supervised learned hash representations. Additionally, it generates hash representations for unseen signals using simple hash functions, constructs a global hash dictionary by combining trained and unseen signal hashes, and establishes a global graph structure based on Hamming similarity.

  • By evaluating the generated graph structures using a classic fast GCN, the results demonstrate that the obtained graphs outperform current popular baseline methods, particularly in scenarios involving large intervals.

Related work

Graph-based representation and recognition of ECG signals

Owing to its powerful capability to represent complex data structures, the graph-based approach (GA) has gained significant popularity in the realm of ECG representation and detection. By modeling the interdependencies among signals, this method can uncover the underlying relationship patterns that exist between different signals. This attribute proves to be highly valuable for anomaly detection.

For example, Arora et al.7 proposed a multimodal graph neural network (GNNs) to capture the intra and inter-ECG dependencies. Yang et al.8 transformed the ECG from the time domain to the graph domain through Graph Fourier Transform (GFT); then they converted the GFT results of ECG into graph bispectra to extract graph integral bispectra for recognition of arrhythmia. Chen et al.9 proposed local subgraph learning and signal-enhanced graph learning, which are respectively applied to the learning of electrocardiogram representation and signal-enhanced graph learning. For the multiperceptive region, Chen et al.10 proposed a spatio-temporal graph convolutional contraction network for arrhythmia recognition. Zhang et al.19 introduced a novel spatio-temporal residual GCN for the diagnosis of coronary heart disease by segmenting the ECG signals into small, single-channel blocks and converting these blocks into graph nodes. Iqbal et al.20 integrated a double-layer graph convolutional layer with a spatial attention mechanism capture the local and global dependencies, thereby achieving more effective classification of arrhythmias.

The aforementioned methods have improved the detection efficiency of ECG through the use of graph-based approaches. However, the predefined graphs utilized in these methods focus solely on the similarity between neighboring nodes, thereby neglecting global semantic correlations. Additionally, the construction of such graphs involves quadratic complexity, rendering them inefficient for large-scale datasets.

Graph structure learning

The quality of a graph-based model is directly influenced by the quality of the graph. Rapidly constructing a high-quality graph is crucial for advancing graph-based learning. Graph structure learning (GSL)21,22,23 can significantly enhance the quality of graphs by reconstructing or refining the original graph topology, which has recently garnered significant research attention24,25,26.

The two most popular methods for constructing graphs are as follows: k Nearest Neighbors (kNN graphs)27 and \(\epsilon\) proximity thresholding (\(\epsilon\)-graphs)28. A kNN graph is one that connects nodes through K-nearest neighbors, and the later establishes the link by measuring that the similarity of two nodes is less than the preset threshold \(\epsilon\). Nevertheless, these two approaches not only entail significant computational complexity but also fail to address the challenge of sparse data effectively. Furthermore, it is possible for similar nodes to possess distinct labels, indicating that the initial graph constructed based on neighboring nodes may also incorporate a significant amount of noise29. Consequently, scholars have proposed numerous graph regularization methodologies to address the aforementioned challenges. They includes sparse constraint30, smooth constraint31, low rank constraint32,33, matrix factorization34,35 and so on. However, the constraints employed in these methods are imposed solely on the learner, while the graph structure is treated as a predefined condition. Fundamentally, these approaches still align with the categories of KNN graph or \(\epsilon\)-graphs.

Unlike models that solely focus on local neighborhoods, the Transformer architecture36 facilitates message propagation between all nodes, thereby uncovering novel relationships. Kreuzer et al.37 propose a learnable position encoding mechanism that employs an additional Transformer encoder to transform the Laplace spectrum of the graph. Dwivedi et al.38 proposed the decoupling of structural embeddedness from positional embeddedness, allowing both to be updated in tandem with other parameters. These methods, which leverage kernel functions for node representation or encode both local and global characteristics of the graph structure through additional diffusion kernel, do enhance the structural optimization of the graph. However, they all rely on deep learning techniques and entail considerable training time and computational expenses.

Graph learning with hash-based approaches

Owing to its low retrieval time complexity (\(\mathscr {O}(1)\)) and minimal storage requirements, hash coding has garnered significant attention from researchers in recent years39,40. There exist several methodologies that integrate hash-based techniques with graph learning approaches. For instance, Lu et al.39 introduced the Asymmetric Transfer Hashing (ATH) framework to address the cross-domain heterogeneous search challenge. They optimized domain distribution and feature discrepancies by employing a dual-domain hash function and an adaptive bipartite graph. Wang et al. employ a graph embedding learning approach to generate hash codes suitable for model replication detection through the design of a fusion quantization and triplet loss hash network. Huang et al.41 employ a graph embedding learning approach to generate hash codes suitable for model replication detection through the design of a fusion quantization and triplet loss hash network. Hamming Spatial Graph Convolutional Networks (HS-GCN)42 model incorporated the concept of Hamming similarity into its framework, thereby enhancing the representation of both users and projects. Wang et al.43 employed anchor graphs to effectively capture the intrinsic neighborhood structure and subsequently developed a graph-based hashing model.

In contrast to existing methods that primarily focus on learning hash codes utilizing graph structures, this paper proposes a novel approach where hash coding is employed to infer graph structures. This method not only reduces the computational complexity associated with the initial adjacency matrix but also mitigates the impact of graph noise, thereby providing a semantically enriched initial graph for graph-based deep learning models.

Materials and methods

The focus of this section lies on the proposed architecture for similarity reconstruction based on semantic hash. Once the global graph is generated, it is forwarded to fast GCN for efficient ECG detection. The entire process is depicted in Fig. 1. The section provides a comprehensive exposition of each component in the proposed approach, encompassing its formulation and solution process. Additionally, we elucidate the algorithm’s complexity and convergence properties.

Fig. 1
Fig. 1The alternative text for this image may have been generated using AI.
Full size image

Basic framework proposed ECG recognition based on semantic hash similarity graph learning. The entire framework comprises three modules: the hash encoding module, the graph reconstruction module, and the fast GCN module. The hash encoding module is primarily responsible for performing hash encoding on ECG signals to construct a global hash dictionary. The graph reconstruction module involves weighted reconstruction of the semantic similarity graph by integrating the dynamic similarity graph with the hash similarity graph. Lastly, the reconstructed semantic similarity graph is input into the FastCNG network to facilitate ECG retrieval.

Preliminary and notations

Throughout the paper, vectors (matrices) are denoted by lowercase (uppercase) bold letters, e.g., \({\textbf {b}}\) and \({\textbf {X}}\). \(\Vert {\textbf {X}}\Vert _F=\sqrt{\sum _{i=1}^p\sum _{j=1}^{q}x_{ij}^2}\) is the Frobenius norm of a matrix \({\textbf {X}}=(x_{ij})_{p\times q}\) and \(\Vert {\textbf {b}}\Vert _2=\sqrt{\sum _{i=1}^{p}b_i^2}\) is used to denote the \(\ell _2\)-norm of a vector \({\textbf {b}}\in \mathscr {R}^p\). Notation \(\mathscr {S}_n^p\) denotes an orthogonal feasible region and \(tr({\textbf {X}})\) denotes the trace of matrix \({\textbf {X}}\). We use \({\textbf {B}}=[{\textbf {b}}_1,\ldots ,{\textbf {b}}_n]^T\in \{-1,1\}^{n\times k}\) to denote the hash dictionary. The ECG training dataset is denoted as \(\mathscr {D}=\{{\textbf {x}}_i,y_i\}_{i=1}^n\) with n samples, where \({\textbf {x}}_i\in \mathscr {R}^d\) and \(y_i\in \{1,2,\ldots , c\}\subset \mathscr {N}\) are the ith sample and its corresponding label, respectively.

Let \({\textbf {G}}=\{{\textbf {V}},{\textbf {E}},{\textbf {X}}\}\) denote a graph, where \({\textbf {V}}=\{v_1,\ldots , v_n\}\) represents a sequence of nodes with \(|{\textbf {V}}|=n\) and \({\textbf {E}}\) represents the set of edges connecting them. Edges depict the relationships between nodes and can also be represented as an adjacency matrix \({\textbf {A}}=[a_{ij}]\in \mathscr {R}^{n\times n}\), with \(a_{ij}\) denoting the relationship between nodes \(v_i\) and \(v_j\).

Additionally, the optimization process employs the following notations: the i-th column of matrix \({\textbf {X}}\in \mathscr {R}^{p\times q}\) is denoted by \({\textbf {x}}_i\). \({\textbf {X}}_{\overline{i}}=[{\textbf {x}}_1,\cdots ,{\textbf {x}}_{i-1},{\textbf {x}}_{i+1},\ldots ,{\textbf {x}}_q]\in \mathscr {R}^{p\times {(q-1)}}\) means that the i-th column of \({\textbf {X}}\) is removed. \({\textbf {X}}_{i,v}=[{\textbf {x}}_1,\cdots ,{\textbf {x}}_{i-1},{\textbf {v}},{\textbf {x}}_{i+1},\ldots ,{\textbf {x}}_q]\in \mathscr {R}^{p\times q}\) denotes \({\textbf {X}}\) with its i-th column replaced by a given vector \({\textbf {v}}\). And \(\otimes\) denotes the element-wise product. Finally, \(\mathscr {B}({\textbf {c}},r)=\{{\textbf {x}}\in \mathscr {R}^p|\Vert {\textbf {x}}-{\textbf {c}}\Vert _2\le r\}\) is a p-dimensional hypersphere, where \({\textbf {c}}\in \mathscr {R}^p\) is the center and r is the radius.

Semantic hash similarity graph learning

Problem formulation

The binary-coded hash representation of data ensures consistent similarity between data and semantic affinity. Building upon this foundation, a novel framework has recently emerged in the field of hash learning, aiming to bridge the spatial gap between semantic space and Hamming space by distilling discrete hash codes from semantic affinity. The next step will involve the creation of a hash dictionary.

The approach specifically utilizes one-hot tag vectors for constructing semantic affinity and subsequently employs discrete symmetric matrix factorization to generate discriminant compact hash codes. Given any two label vectors \({\textbf {y}}_i,{\textbf {y}}_j \in \{0,1\}^{c}\), their similarity can be represented by \(S_{ij}=\langle {\textbf {y}}_i,{\textbf {y}}_j\rangle\) or \(S_{ij}=\frac{\langle {\textbf {y}}_i,{\textbf {y}}_j\rangle }{\Vert {\textbf {y}}_i\Vert \Vert {\textbf {y}}_j\Vert }\) (multi-label). So the affinity matrix can be expressed as \({\textbf {S}}={\textbf {Y}}^T{\textbf {Y}}\in \mathscr {R}^{n\times n}\). Symmetric decomposition model is to make the binary hash row vector \({\textbf {b}}_i \in \{-1,1\}^k\) of code length k satisfy \({\textbf {b}}_i{\textbf {b}}_j^T\approx kS_{ij}\) as far as possible, and its matrix form can be expressed as \({\textbf {B}} {\textbf {B}}^ T\approx k{\textbf {S}}\). A universal objective function can be unified as follows:

$$\begin{aligned} \min \limits _{{\textbf {B}}\in \{-1,1\}^{n\times k}}\mathscr {F}=\Vert k{\textbf {S}}-{\textbf {BB}}^ T\Vert _F^2, s.t.\; {\textbf {B}}^T{\textbf {B}}=n{\textbf {I}}_k. \end{aligned}$$
(1)

Equation 1 represents the classical objective function for learning hash codes based on semantics44,45,46,47. However, due to the impact of the binary discrete constraint inherent in hashing, optimizing this function has consistently been a challenging problem in the field. Currently, there are primarily two strategies for optimizing Eq. 1: introducing intermediate variables to preserve the discreteness of hash coding44,45, or employing a semi-relaxation strategy for optimization46,47.

Unlike the aforementioned methods, we propose an equivalent objective function to overcome the optimization difficulty of Eq. 1. By re-evaluating Eq. 1, we find that its aim is \(k{\textbf {S}}\approx {\textbf {BB}}^T\), i.e., \(k{\textbf {Y}}^T{\textbf {Y}}\approx {\textbf {BB}}^T\). The accomplishment of \(\sqrt{k}{{\textbf {Y}}}^T\approx {\textbf {B}}\) can serve as a viable alternative to the aforementioned objective. However, there exists a dimensional discrepancy between the label space and the Hamming space, which renders the achievement of this objective infeasible. To do this, we introduce an orthogonal rotation factor \({\textbf {R}}\in \mathscr {R}^{c\times k }\) to effectively mitigate the dimensional difference, i.e., \(\sqrt{k}{{\textbf {Y}}}^T\approx {\textbf {B}} {\textbf {R}}^T\). As long as \({\textbf {R}}^T {\textbf {R}}\equiv {\textbf {I}}_k\), we have \(({\textbf {B}} {\textbf {R}}^T)({\textbf {B}} {\textbf {R}}^T)^T={\textbf {B}} {\textbf {R}}^T{\textbf {R}} {\textbf {B}}^T={\textbf {B}} {\textbf {B}}^T=\sqrt{k}{{\textbf {Y}}}^T(\sqrt{k}{{\textbf {Y}}}^T)^T=k{\textbf {Y}}^T{\textbf {Y}}\). It is worth noting that the condition for the aforementioned equivalence is the constant orthogonality of the rotation factor. In other words, the subsequent optimization process must ensure that the rotation factor consistently resides on the Stiefel manifold. To sum up, Eq. 1 can be equivalently expressed as:

$$\begin{aligned} \min \limits _{{\textbf {B}}\in \{-1,1\}^{n\times k},{\textbf {R}}\in \mathscr {R}^{c\times k }}\mathscr {F}=\Vert \widetilde{{\textbf {Y}}}^T-{\textbf {BR}}^T\Vert _F^2, s.t.\; {\textbf {B}}^T{\textbf {B}}=n{\textbf {I}}_k, {\textbf {R}}^T {\textbf {R}}\equiv {\textbf {I}}_k. \end{aligned}$$
(2)

where \(\widetilde{{\textbf {Y}}}=\sqrt{k} {\textbf {Y}}\). The subsequent optimization then shifted its focus to \({\textbf {R}}\). In order to effectively approach Eq. 1, it is imperative to ensure that in the subsequent iterations, \({\textbf {R}}\) progresses orthogonally step by step, thereby being constrained within the orthogonal feasible region.

Optimization

Due to the indefinite magnitudes of c and k, the strategy based on Stiefel manifold is not applicable to Eq. 2. Therefore, we adopt a two-step optimization approach. Firstly, we use the proximal linearized approximation of the augmented Lagrangian algorithm48,49 to minimize the function value. Then, we utilize column-wise block coordinate descent method to handle orthogonal constraints. Eq. 2 w.r.t. \({\textbf {R}}\) can be formulated as:

$$\begin{aligned} \min \limits _{{\textbf {R}}\in \mathscr {R}^{c\times k }}\mathscr {F}({\textbf {R}})=tr({\textbf {RB}}^T{\textbf {BR}}^T)-2tr(\widetilde{{\textbf {Y}}} {\textbf {BR}}^T) s.t.\; {\textbf {R}}^T {\textbf {R}}\equiv {\textbf {I}}_k. \end{aligned}$$
(3)

The Lagrangian function of Eq. 3 can be expressed as follows, where \(\Lambda\) represents the Lagrangian multiplier matrix for the orthogonal constraint:

$$\begin{aligned} \mathscr {L}=\mathscr {F}({\textbf {R}})-\frac{1}{2}tr(\Lambda ({\textbf {R}}^T {\textbf {R}}-{\textbf {I}}_k)). \end{aligned}$$
(4)

Then the gradient of Eq. 4 for \({\textbf {R}}\) is \(\nabla _{{\textbf {R}}}\mathscr {L}=\nabla \mathscr {F}({\textbf {R}})-{\textbf {R}}\Lambda\). Let \(\nabla _{{\textbf {R}}}\mathscr {L}=0\), combined with the first-order optimization conditions ( Substationarity: \(({\textbf {I}}_c-{\textbf {R}} {\textbf {R}}^T)\nabla \mathscr {F}({\textbf {R}})=0\), Symmetry: \({\textbf {R}}^T\nabla \mathscr {F}({\textbf {R}})={\textbf {R}}\nabla \mathscr {F}({\textbf {R}})^T\), Feasibility: \({\textbf {R}}^T{\textbf {R}}={\textbf {I}}_k\).), we have \(\Lambda ={\textbf {R}}^T \nabla \mathscr {F}({\textbf {R}})=\nabla \mathscr {F}({\textbf {R}})^T{\textbf {R}}\), \(\nabla \mathscr {F}({\textbf {R}})={\textbf {R}}\Lambda\) and \(\nabla \mathscr {F}({\textbf {R}})-{\textbf {R}}\nabla \mathscr {F}({\textbf {R}})^T{\textbf {R}}=0\).

In order to uphold the orthogonality constraint, we incorporate an additional penalty term for orthogonality. Thus, Eq. 4 be reformulated as follows:

$$\begin{aligned} \mathscr {L}_{\beta }=\mathscr {F}({\textbf {R}})-\frac{1}{2}tr(\Lambda ({\textbf {R}}^T {\textbf {R}}-{\textbf {I}}_k))+\frac{\beta }{4}\Vert {\textbf {R}}^T {\textbf {R}} -{\textbf {I}}_k\Vert _F^2. \end{aligned}$$
(5)

where \(\beta\) is the penalty regulation coefficient. Then \(\nabla _{{\textbf {R}}}\mathscr {L}\) can be rewritten as:

$$\begin{aligned} \nabla _{{\textbf {R}}}\mathscr {L}_{\beta }({\textbf {R}},\Lambda )=\nabla \mathscr {F}({\textbf {R}})-{\textbf {R}}\Lambda -\beta {\textbf {R}}({\textbf {I}}_k-{\textbf {R}}^T {\textbf {R}}). \end{aligned}$$
(6)

For the current iteration \({\textbf {R}}^{(v)}\), the proximal linearized augmented Lagrangian function is defined as:

$$\begin{aligned} \begin{aligned}&\left\{ \begin{array}{lr} \Lambda ^{(v)}=\Psi (\nabla \mathscr {F}({\textbf {R}}^{(v)})^T{\textbf {R}}^{(v)}),& \\ \widetilde{\mathscr {L}}_{\beta }=tr(({\textbf {G}}^T({\textbf {R}}-{\textbf {R}}^{(v)}))+\frac{\eta }{2}\Vert {\textbf {R}}-{\textbf {R}}^{(v)}\Vert _F^2,& \\ {\textbf {R}}^{(v+1)}={\arg \min }_{{\textbf {R}}}\widetilde{\mathscr {L}}_{\beta }({\textbf {R}}).& \end{array} \right. \end{aligned} \end{aligned}$$
(7)

where \({\textbf {G}}=\nabla _{{\textbf {R}}}\mathscr {L}_{\beta }({\textbf {R}}^{(v)},\Lambda ^{(v)})\), \(\frac{1}{\eta }\) is the step size of gradient, \(\Psi ({\textbf {A}})=\frac{{\textbf {U}}+{\textbf {U}}^T}{2}\) is used to ensure the symmetry in the first-order optimization conditions.

For redundant spherical constraints in the orthogonal feasible region, we add an additional term for \(\nabla \mathscr {F}({\textbf {R}})\), i.e., \(\nabla \mathscr {F}({\textbf {R}})={\textbf {R}}\Lambda +{\textbf {R}} {\textbf {D}}\), where \({\textbf {D}}\) is a diagonal matrix determined by the Lagrangian multiplier. Therefore, the iterative formula of \(\Lambda ^{(v)}\) in Eq. 7 can be expressed as:

$$\begin{aligned} \Lambda ^{(v)}=\Psi (\nabla \mathscr {F}({\textbf {R}}^{(v)})^T{\textbf {R}}^{(v)})+\Xi ^{(v)}. \end{aligned}$$
(8)

where \(\Xi =Diag({\textbf {R}}^T\nabla _{{\textbf {R}}}\mathscr {L}_{\beta }({\textbf {R}},\Psi (\nabla \mathscr {F}({\textbf {R}})^T{\textbf {R}})))\).

The updating rule of variable \({\textbf {R}}\) provided by Eq. 7 does not guarantee that each iteration optimization is within the orthogonal feasible region. In order to address this issue, we adopt column-wise block coordinate descent method to achieve the second objective optimization of Eq. 7. Concretely, we fix the \(k-1\) columns of \({\textbf {R}}\) and use the i-th column as a variable. The sub-target of Eq. 7 can be expressed as:

$$\begin{aligned} \min \limits _{{\textbf {r}}\in \mathscr {R}^c}\widetilde{\mathscr {L}}^i_{\beta ,{\textbf {R}}}({\textbf {r}}), s.t.\; \Vert {\textbf {r}}\Vert _2=1, {\textbf {R}}_{\overline{i}}^T {\textbf {r}}=0. \end{aligned}$$
(9)

where \(\widetilde{\mathscr {L}}^i_{\beta ,{\textbf {R}}}({\textbf {r}})=\widetilde{\mathscr {L}}_\beta ({\textbf {R}}_{i,r})\). The constraint \({\textbf {R}}_{\overline{i}}^T {\textbf {r}}={\textbf {0}}\) indicates the solution \({\textbf {r}}^*\in Null({\textbf {R}}_{\overline{i}})=\{{\textbf {r}}\in \mathscr {R}^c|{\textbf {R}}_{\overline{i}}^T {\textbf {r}}={\textbf {0}}\}\), which is equivalent to \({\textbf {r}}=({\textbf {I}}_c-{\textbf {R}}_{\overline{i}} {\textbf {R}}_{\overline{i}}^T){\textbf {r}}\).

Therefore, Eq. 5 can be simplified as:

$$\begin{aligned} \min \limits _{{\textbf {r}}\in \mathscr {R}^c}f_i({\textbf {r}})=\widetilde{\mathscr {L}}^i_{\beta ,{\textbf {R}}}(({\textbf {I}}_c-{\textbf {R}}_{\overline{i}} {\textbf {R}}_{\overline{i}}^T){\textbf {r}}), s.t.\; \Vert {\textbf {r}}\Vert _2=1, {\textbf {r}}\in Null({\textbf {R}}_{\overline{i}}). \end{aligned}$$
(10)

For the i-th column \({\textbf {r}}_i\), the gradient of Eq. 10 can be expressed as:

$$\begin{aligned} \nabla f_i({\textbf {r}}_i)=({\textbf {I}}_c-{\textbf {R}}_{\overline{i}} {\textbf {R}}_{\overline{i}}^T)\nabla \widetilde{\mathscr {L}}^i_{\beta ,{\textbf {R}}}(({\textbf {I}}_c-{\textbf {R}}_{\overline{i}} {\textbf {R}}_{\overline{i}}^T){\textbf {r}}_i) =({\textbf {I}}_c-{\textbf {R}}_{\overline{i}} {\textbf {R}}_{\overline{i}}^T)\nabla f_i({\textbf {r}}_i). \end{aligned}$$
(11)

From Eq. 11, it is not difficult to find: \(\nabla f_i({\textbf {r}}_i)\in Null({\textbf {R}}_{\overline{i}})\). Therefore, both \({\textbf {r}}_i\) and \(\nabla f_i({\textbf {r}}_i)\) all lie in \(Null({\textbf {R}}_{\overline{i}})\). And we know that the null space of \({\textbf {R}}_{\overline{i}}\) is the orthocomplement of the row space of \({\textbf {R}}_{\overline{i}}\), i.e., \((Row({\textbf {R}}_{\overline{i}}))^{\bot }=Null({\textbf {R}}_{\overline{i}})\), so any point in the span of \({\textbf {r}}_i\) and \(\nabla f_i({\textbf {r}}_i)\), i.e., \(Span\{{\textbf {r}}_i,\nabla f_i({\textbf {r}}_i)\}\), satisfies the orthogonal constraint. Let \(\widetilde{{\textbf {r}}_i}={\textbf {r}}_i -\iota \nabla f_i({\textbf {r}}_i)\in Span\{{\textbf {r}}_i,\nabla f_i({\textbf {r}}_i)\}\), and then we can get the orthogonal feasible solution:

$$\begin{aligned} {\textbf {r}}_i^{(v+1)}=\frac{{\textbf {r}}_i^v -\iota \nabla f_i({\textbf {r}}_i^v)}{\Vert {\textbf {r}}_i^v -\iota \nabla f_i({\textbf {r}}_i^v)\Vert _2}. \end{aligned}$$
(12)

where \(\iota =\frac{1}{\eta }\) is the step size.

The optimization of the hash dictionary can be resolved immediately once the rotation factor \({\textbf {R}}\) is optimized. To reiterate Eq. 3, minimizing \(\mathscr {F}({\textbf {B}})\) is equivalent to maximizing:

$$\begin{aligned} \max \mathscr {F}({\textbf {B}})=tr(\widetilde{{\textbf {Y}}} {\textbf {BR}}^T)=tr({\textbf {B}}^T\widetilde{{\textbf {Y}}}^T{\textbf {R}}). \end{aligned}$$
(13)

By utilizing the property of inequality, the concise and closed form of the hash dictionary can be derived:

$$\begin{aligned} {\textbf {B}}=sgn(\widetilde{{\textbf {Y}}}^T{\textbf {R}}). \end{aligned}$$
(14)

Unseen signal hash representation

The establishment of communication between the primitive space and Hamming space necessitates the utilization of an explicit hash function, which serves as a practical tool for generating hash codes for previously unseen signal data. For simplicity, the most straightforward linear hash function is used to establish links.

We minimize the following square loss:

$$\begin{aligned} \mathscr {H}({\textbf {P}})=\Vert {\textbf {B}}^T-{\textbf {P}} {\textbf {X}}\Vert _F^2+\lambda \Vert {\textbf {P}}\Vert _F^2. \end{aligned}$$
(15)

where \(\lambda\) is a regularization parameter, which is used to prevent over learning. To minimize Eq. 15, we can get \({\textbf {P}}={\textbf {B}}^T{\textbf {X}}^T({\textbf {X}} {\textbf {X}}^T+\lambda {\textbf {I}}_d)^{-1}\). Confronted with an unobserved signal dataset \({\textbf {X}}_{te}\in \mathscr {R}^{d\times m}\), we can efficiently produce its hash representation \({\textbf {B}}_{te}^T=sgn({\textbf {P}} {\textbf {X}}_{te})\) using this linear hashing function.

The integration of the training hash dictionary and the unseen hash representation enables the acquisition of the global hash representation \({\textbf {B}}_{g}^T=[{\textbf {B}}^T, {\textbf {B}}_{te}^T]\). The global graph similarity can be computed by evaluating the Hamming similarity of each hash code. The more intuitive procedure is summarized in Algorithm 1.

Algorithm 1
Algorithm 1The alternative text for this image may have been generated using AI.
Full size image

Semantic Hash Similarity Graph

Dynamic similarity graph

The aforementioned semantic hash similarity graph can be derived expeditiously from the label space. However, it fails to consider the structural interconnections among the original signals. Hence, it becomes imperative to amalgamate the structural similarity in order to construct the ultimate analogous structure. The previous approaches for constructing similar matrices typically rely on inner product or Euclidean distance, which are not suitable for paired time series due to the potential cascade effect between the series leading to temporal lag. The Dynamic Time Warping distance (DTW)50 is a robust measure of signal similarity that accounts for temporal variations. Therefore, we utilize the DTW distance to construct a structure analogous to the original signal.

Formally, given any two signal sequences \({\textbf {x}}_{i}, {\textbf {x}}_{j}\), their similarity \({\textbf {A}}_D^{ij}\) is computed utilizing the unbounded distance of DTW.

$$\begin{aligned} {\textbf {A}}_D^{ij}=exp\left( -\frac{DTW({\textbf {x}}_{i}, {\textbf {x}}_{j},\omega )}{\epsilon }\right) . \end{aligned}$$
(16)

where \(\epsilon\) is a hyper-parameter, \(\omega\) is local constraint window size, and \({\textbf {A}}_D\) is referred to as dynamic similarity graph. The ultimate global graph structure is formed by the weighted summation of the semantic similarity graph and the dynamic structural similarity graph, i.e., \({\textbf {A}}=\kappa {\textbf {A}}_D+(1-\kappa ){\textbf {A}}_H\). The above hyper-parameters are set to \(\epsilon =0.5\), \(\omega =3\) and \(\kappa =0.3\) in order to minimize parameter interdependence and reduce algorithm debugging time.

Convergence analysis and complexity discussion

Next we discuss the convergence of Algorithms 1. The iteration of two variables is involved in Algorithm 1, where the iteration of the hash dictionary does not play a crucial role in minimizing the objective. This is because learning the hash dictionary essentially involves selecting the best vertex from a k-dimensional hypercube vertex. Therefore, we can only focus on discussing the iterative convergence of the rotation factor \({\textbf {R}}\). The experimental section provides empirical validation while the appendix presents theoretical analysis.

The complexity analysis of Algorithm 1 is outlined below. Initializing \({\textbf {R}}\) and \({\textbf {B}}\) takes \(\mathscr {O}(nk+ck)\). Each iteration to update \({\textbf {R}}\) and \({\textbf {B}}\) requires calculation costs of \(\mathscr {O}(ck+4ck^2)\) and \(\mathscr {O}(cnk)\), respectively. Calculating \({\textbf {P}}\) needs \(\mathscr {O}(nd^2+d^3+knd+kd^2)\). The generation of \({\textbf {B}}_{te}\) requires \(\mathscr {O}(kmd)\). The computation cost for the ultimate generation of the global semantic graph structure \({\textbf {A}}\) amounts to \(\mathscr {O}((m+n)^2)\).

Experiments and discussions

To validate the effectiveness of the constructed hash similarity graph, a deep graph model framework is required. In this study, we employ the fast CNG model proposed by Yang et al.51. The FastGCN converts the set of graph vertices into independent and identically distributed (i.i.d.) samples drawn from a probability distribution, thereby facilitating the uniform estimation of the loss gradient for parameter updates. Following this, the full GCN architecture is utilized to compute embeddings for the newly introduced vertices.

Experiment setting

  1. (1)

    Datasets:

    • \(\bullet\) MIT-BIH arrhythmia database52: The dataset was continuously recorded for a duration of 30 minutes, with a sampling rate of 360 samples per second. Each cycle within the dataset was annotated with a reference value based on the R-peak, serving as an analytical ground truth. For this study, a subset of the dataset consisting of 10,000 ECG fragments from II leads was selected and categorized into four different classes: normal heartbeat (N), left bundle branch block heartbeat (L), right bundle branch block heartbeat (R), and ventricular premature beat (V). A training set comprising 8,000 randomly chosen samples was created, including 7,000 training instances and 1,000 validation instances. The remaining samples were allocated to the test set.

    • \(\bullet\) The AF ECG dataset12: It was obtained from the 2017 PhysioNet/CinC Challenge and comprised of 8528 ECG segments ranging in duration from 30 to 60 seconds. This dataset was categorized into four groups: normal sinus rhythm (N), atrial fibrillation (A), other cardiac rhythm (O), and noise segment (\(\sim\)). For training purposes, we randomly selected 7000 sample signals while the remaining were used as a test set. Additionally, we extracted 500 sample signals from the training set for validation.

  2. (2)

    Compared methods: The detection performance of the constructed semantic hash similarity graph in the FastGCN network is verified by conducting experimental comparisons with several commonly used intelligent algorithms. These algorithms include machine learning methods such as long short-term memory based on CNN (CNN-LSTM)53, bidirectional LSTM (BiLSTM)54, one-dimensional convolutional neural networks (1DCNN)3, attentive recurrent neural networks (ARNN)4, GCN55, CNN-BiLSTM-ATT2, ST-ReGE19 and MPR-STSGCN10.

    Fig. 2
    Fig. 2The alternative text for this image may have been generated using AI.
    Full size image

    Comparison results of all methods for each class of ACC on the dataset MIT.

    Fig. 3
    Fig. 3The alternative text for this image may have been generated using AI.
    Full size image

    Comparison results of all methods for each class of ACC on the dataset AF.

    Fig. 4
    Fig. 4The alternative text for this image may have been generated using AI.
    Full size image

    Convergence of algorithm 1 on two datasets.

    Fig. 5
    Fig. 5The alternative text for this image may have been generated using AI.
    Full size image

    Comparison results between \(\text {FastGCN}_{H}\) and \(\text {FastGCN}_{D}\) on two datasets.

    Fig. 6
    Fig. 6The alternative text for this image may have been generated using AI.
    Full size image

    The visualization comparison of semantic similarity graphs and label similarity graphs on MIT.

  3. (3)

    Metrics: The metrics employed in our study include accuracy for each category, average accuracy (Acc), Precision, Recall (sensitivity), and Macro-F1 score based on the AAMI guidelines. The following metrics are presented:

    $$\begin{aligned} \left\{ \begin{array}{lr} Acc=\frac{TP+TN}{TP+FP+FN+TN}, \\ Precision=\frac{TP}{TP+FP}, \\ Recall=\frac{TP}{TP+FN}, \\ F_{1,ma}=\frac{2Precision*Recall}{Precision+Recall}=\frac{2TP}{2TP+FP+FN}. \end{array} \right. \end{aligned}$$
    (17)

    where the abbreviations TP, FP, TN, and FN respectively stand for true positive, false positive, true negative, and false negative.

  4. (4)

    Experimental protocol: For Algorithm 1, it involves three parameters \(\beta\), \(\iota\) and \(\lambda\), which are set to \(\min \left( \max (10^{-2}, 10^{-3}\Vert {\textbf {B}}^T{\textbf {B}}\Vert _F^2),10^5\right)\), \(1e-4\) and \(1e-1\) respectively. The hash code length k is set to 32. The number of epochs for FastGCN is set to 30, the number of hidden units is set to 256, the ADAM learning rate is set to \(10^{-2}\), the batch size is set to 200, and the resample size is set to 400. Since FastGCN has only two layers, we kernelize the samples in advance to improve its nonlinearity. In addition, to facilitate the reproducibility of the experiment in FastGCN, we provide two datasets along with their corresponding semantic hash graphs (https://pan.baidu.com/s/19-APd8zAYim2hBr6YvCutg?pwd=7neo). All experiments are carefully performed on a workstation with Intel(R) Core(TM) i7-9700 CPU@3.00GHz, NVIDIA GeForce RTX 3070, 64 GB RAM.

Analysis of experimental results

  1. (1)

    MIT-BIH: The Acc, Precision, Recall and F1 score results of all baselines on MIT-BIH are listed in Table 1. The comparison of Acc results for all baselines in each category is shown in Fig. 2. The feedback from the results indicates that the category features in this dataset are notably discernible, and all methods exhibit relatively high detection performance. The superiority of our method is evident in terms of Acc, precision, and F1 score compared to other methods. The value of recall is slightly weaker compared to ARNN. All the benchmarks are based on deep learning methodologies. The results indicate that all approaches have achieved satisfactory outcomes, suggesting that the deep learning architecture exhibits superior performance in extracting critical signal features. However, it should be noted that deep networks require longer training time. The ARNN model, in addition to achieving the second highest performance among all methods, suggests that employing an attention mechanism in time-series networks is more appropriate for analyzing ECG signals. The FastGCN structure, despite its mere two layers, outperforms other baselines in terms of recognition results. This indirectly suggests that the semantic hash similarity graph we constructed is abundant in semantic information, which greatly facilitates the efficient detection of the FastGCN model.

    Table 1 Comparison results of various methods on two data sets. bold text indicates the best and italic text indicates the second best result, respectively.
Fig. 7
Fig. 7The alternative text for this image may have been generated using AI.
Full size image

The visualization comparison of semantic similarity graphs and label similarity graphs on AF.

Fig. 8
Fig. 8The alternative text for this image may have been generated using AI.
Full size image

Sensitivity of parameter \(\kappa\) on two datasets.

Fig. 9
Fig. 9The alternative text for this image may have been generated using AI.
Full size image

Sensitivity of \(\lambda\) and k on two datasets.

Fig. 10
Fig. 10The alternative text for this image may have been generated using AI.
Full size image

The comparison of the noise resistance capabilities of SHSG and KNN on two datasets.

  1. (2)

    AF dataset: The baseline comparison results have been summarized in Table 1, while Fig. 3 displays the accuracy (ACC) for each type of detection. The results presented in Table 1 demonstrate that our method outperforms all baselines in terms of ACC, recall, and F1 scores over a wide range of intervals. This also demonstrates the effectiveness of the proposed semantic hash graph learning framework in mitigating the inherent rigidity within the label space, as well as addressing issues pertaining to incomplete or redundant adjacency within the similarity graph. However, it should be noted that the precision results are comparatively weaker. The absence of an attention module in FastGCN model may contribute to this phenomenon, potentially indicating a class imbalance.

The results of each baseline method in each type of detection vary significantly due to the imbalance in the number of different samples in this dataset, as illustrated by Fig. 2. Among these variations, the recognition rate for normal rhythm (N) and atrial fibrillation (A) is the highest, while the recognition rate for the other two types is unsatisfactory. Our method performs better in the first and fourth categories of the recognition results, while CNN-BiLSTM-ATT significantly leads in the second category of the detection results; however, its performance in the fourth category of the detection results is not satisfactory. The results from the graph show that our method, as well as ST-ReGE and MPR-STSGCN, have relatively high recognition accuracy in each category, indicating that these three methods are not sensitive to class imbalance, as the dataset is highly imbalanced in terms of categories. The potential reason is that the parameters of the attention module are updated solely through loss backpropagation from labels and predicted values, without any additional supervision information being introduced. Consequently, its supervision is constrained, rendering it susceptible to label overfitting.

Convergency analysis and training time

The convergence of algorithm 1 on two datasets is investigated, and the results are depicted in Fig. 4. A remarkably rapid convergence rate (less than 5 iterations) is observed, primarily attributed to the extraction of hash books from tag space rather than semantic associations. Despite the increased complexity of orthogonal rotation factor optimization, its volume remains minimal, resulting in a highly expedited optimization speed. The convergence diagram on MIT reveals that, in addition, the objective function does not exhibit a genuine decrease in value. The primary reason is that updating the hash code is crucial for selecting the optimal vertex on the K-dimensional hypercube network, and it does not positively impact the minimization of the objective function value; in fact, it may even have a detrimental effect. Furthermore, rotation factor \({\textbf {R}}\) has a significantly smaller volume compared to \({\textbf {B}}\), resulting in its contribution to the target value being much less significant than that of \({\textbf {B}}\). Therefore, we only need to ensure that the algorithm can converge normally. In the appendix, we give a strict proof that updating the rotation factor \({\textbf {R}}\) can guarantee the reduction of the target value.

The training time and retrieval time at all baselines are also examined in our investigation. To be fair, the epochs number for each network is set to 30. The training time and retrieval time for all baselines are listed in Table 2, revealing that our method outperforms other deep learning approaches. It is noteworthy that the construction of a global similarity graph by combining semantic hash similarity only requires a few seconds, which falls within an acceptable timeframe.

Table 2 Comparison of training time and retrieval time on two data sets. bold text indicates the best and italic text indicates the second best result, respectively.

Ablation analysis

The contribution of the semantic hash similarity graph to the experimental performance is further analyzed through ablation experiments conducted on two datasets, providing deeper insights into the module. The abbreviated symbols \(\text {FastGCN}_{H}\) and \(\text {FastGCN}_{D}\) are utilized for the purpose of expressing the combined semantic hash graph and its excised version more conveniently. The \(\text {FastGCN}_{D}\) version is obtained when \(\kappa =1\) in our method.

The experimental comparison results of the two methods on the two datasets are presented in Fig. 5. The findings clearly demonstrate that the incorporation of semantic hash graph significantly enhances the performance of FastGCN, thereby confirming the presence of abundant semantic information in the learned graph for effective class separation.

To make it more intuitive, we selected the first 100 signals from each of the two datasets and compared their semantic similarity graphs and label similarity graphs, as shown in Figs. 6 and 7. The extracted semantic similarity graphs fit the label similarities very well.

Parameter sensitivity

The above experiments are conducted after preconfiguring all parameters. Subsequently, we examine the impact of parameter variations on the outcomes using two original datasets. The focus of our research lies not in extensive exploration of network parameters, but rather in examining the impact of these parameters within the generated semantic hash graph model on the outcomes.

The weight parameter \(\kappa\) between semantic similarity graph and dynamic structural similarity graph exerts the most significant impact on the outcome. The impact of varying this parameter on the outcome was investigated in our experiment, and the results are illustrated in Fig. 8. The results depicted in the figure indicate that it is more suitable to choose the interval [0.1, 0.5] for parameter \(\kappa\) on data set MIT-BIH , while selecting the interval [0.1, 0.6] on data set AF would be more appropriate.

For penalty parameter \(\beta\), we have set a general range of adjustments, which we will not investigate further here. Based on the experimental findings, it was conclusively determined that the variation of \(\iota\) has no impact on the experimental results. Instead, the results exhibited remarkable stability throughout the experiment. However, considering the convergence of the algorithm, it is better to choose a smaller step size as far as possible, such as \(\iota \le 1e-3\). The aforementioned two parameters solely impact the algorithm’s convergence speed, without significantly affecting the outcome.

The regularization parameter \(\lambda\) is solely utilized to prevent the linear projection function from overfitting. All experiments in this paper are based on \(\lambda =1e-1\), but from the sensitivity experiment of \(\lambda\) in Fig. 9, it can be seen that taking \(\lambda =2e-1\) is the best choice on the two datasets.

The length of the hash code plays a crucial role in capturing signal semantics. A longer hash code length typically encompasses richer semantic information. As illustrated in Fig. 9, increasing the hash code length k enhances the semantic similarity graph, which in turn significantly boosts the retrieval performance of the FastGCN network. However, longer hash codes come at the cost of increased computational complexity. Despite Fig. 9 indicating that \(k=128\) yields the optimal results, this study opts for \(k=32\) to strike a balance between retrieval effectiveness and computational efficiency.

Noise sensitivity

Furthermore, to verify the noise resistance of our graph construction method, we introduce Gaussian white noise with a signal-to-noise ratio (SNR) of 10 into each of the two datasets. Subsequently, we conduct a comparative analysis of the original graphs constructed using the KNN method and our proposed method within the FastGCN framework. Additionally, experimental comparisons are performed on FastGCN for the two graph structures built based on Gaussian white noise. The experimental results are shown in Fig. 10. Our results dropped by 3% under noise, while KNN-based results decreased by 11% on MIT-BIH. And Our results experienced a 3-percentage-point drop under noisy conditions, whereas the KNN-based results exhibited a 12-percentage-point decrease on AF. The results demonstrate that our graph learning method possesses the capability of noise mitigation. This is primarily attributed to the fact that hash coding can effectively diminish the impact of noise.

Conclusion

Aiming at the dual dilemma that the learning of existing graph structures over-relies on local topological relations while ignoring global semantic similarity, and that it is difficult to effectively capture node association features in sparse data scenarios, This paper proposes a Semantic Hashing Similarity Graph (SHSG) learning framework based on supervisory mechanism. The framework systematically models the internal feature association and cross-sample semantic similarity of ECG signals through multi-level semantic fusion strategy, and provides a strong discriminant initial graph structure for depth map neural networks. SHSG first constructs a compact hash representation of supervised signals based on semantic consistency constraints of tag supervision. Secondly, a lightweight linear hash function is designed to generate the generalization representation of the unobserved signal. Then, the hash space embedding of the training set and test set samples is integrated to construct a globally traceable hash dictionary. Finally, based on Hamming similarity, the graph topology is generated. In order to verify the effectiveness of the proposed method, experiments are carried out on the double-layer FastGCN architecture. The experimental results verify the dual advantages of the proposed method in terms of feature characterization ability and computational efficiency.

Next, we will focus on the construction of unsupervised semantic similarity graphs and combine them with more advanced graph neural network (GNN) architectures, such as graph attention networks and graph isomorphism networks, to further enhance feature extraction and learning capabilities.