Explainable attention-based deep learning for classification and interpretation of heart murmurs using phonocardiograms

Althaph, Bollapalli; Challa, Nagendra Panini

doi:10.1038/s41598-025-21971-x

Download PDF

Article
Open access
Published: 30 October 2025

Explainable attention-based deep learning for classification and interpretation of heart murmurs using phonocardiograms

Bollapalli Althaph¹ &
Nagendra Panini Challa¹

Scientific Reports volume 15, Article number: 37991 (2025) Cite this article

2538 Accesses
1 Altmetric
Metrics details

Subjects

Abstract

Cardiovascular diseases (CVDs) remain a leading global health challenge, necessitating diagnostic solutions that combine high accuracy with clinical interpretability and reproducibility. Traditional auscultation methods rely extensively on clinician expertise, resulting in variability and potential diagnostic delays, especially for subtle murmurs indicative of existing cardiac abnormalities. While automated methods have improved diagnostic accuracy, they frequently lack reproducibility and transparency, contributing to clinical mistrust. To address these challenges, we propose an Explainable Attention-Based Deep Learning framework specifically designed for classification and interpretation of heart murmurs using phonocardiogram (PCG) signals. Our approach employs a Transformer architecture tailored for robust time frequency feature extraction—such as spectrograms and Mel Frequency Cepstral Coefficients (MFCCs) applied to PCGs. Visual explanations generated through Gradient weighted Class Activation Mapping (Grad-CAM) explicitly highlight critical systolic and diastolic murmur segments driving the model’s diagnostic predictions. We rigorously validated our framework across multiple datasets, including the HeartWave dataset (over 1,300 recordings), and further corroborated our results using CirCor DigiScope, PhysioNet, and Shenzhen datasets. Our revised validation strategy, adopting robust A-Test methods, demonstrated enhanced reliability with an accuracy of 96.7%, macro-F1 score of 95.5%, and an AUC above 0.97. Compared to ten baseline models—including CNN-RNN hybrids, ResNet variants, and Time Growing Neural Networks (TGNNs) our framework showed a 3–5% improvement in accuracy and a 2–4% increase in macro F1 score, particularly excelling in identifying rare conditions such as valvular defects and congenital anomalies. Ablation studies underscored the crucial role of attention mechanisms for both accuracy enhancement and interpretability, showing strong alignment between model-generated explanations and expert annotations. Future work will further explore model scalability, robustness in diverse clinical environments, and integration with multimodal data, including electrocardiograms, aiming for comprehensive and clinically trusted diagnostic support.

Introduction

Context and motivation

Cardiovascular diseases (CVDs) remain one of the leading global health challenges, contributing to millions of deaths annually and placing immense pressure on healthcare infrastructures. Early detection and timely intervention are critical, especially given the progressive nature of cardiac conditions. Traditionally, clinicians utilize auscultation—listening to heart sounds via a stethoscope—to identify abnormalities such as murmurs, extra heart sounds, and irregular rhythms¹. While auscultation is deeply embedded in clinical practice, its accuracy relies significantly on clinician expertise and environmental conditions, making subtle murmurs challenging to detect consistently, particularly by less experienced clinicians or in noisy environments.

Recent advancements in machine learning (ML) and artificial intelligence (AI) present significant opportunities for automating aspects of the diagnostic process, enhancing reproducibility and consistency. Among diagnostic modalities in cardiology, the phonocardiogram (PCG) has emerged as an essential tool for AI-based analysis. PCGs offer rich temporal and spectral information regarding heart sounds, critical for identifying structural or functional heart anomalies. Initial computational approaches involving manual feature extraction combined with traditional classifiers like Support Vector Machines (SVMs) or Hidden Markov Models (HMMs) showed potential but required considerable domain expertise and struggled with scalability across diverse datasets².

In recent years, deep learning methods, including Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs), have significantly advanced the accuracy of PCG-based classification, successfully detecting conditions such as mitral valve prolapse and hypertrophic cardiomyopathy. Despite these advancements, many AI-based methods remain opaque or “black-box” in nature, limiting clinical acceptance due to a lack of transparency and explainability in decision-making processes³. Clinicians require clearly interpretable models to justify their diagnostic decisions confidently, especially when identifying heart murmurs indicative of existing cardiac conditions.

Furthermore, for practical clinical deployment, AI models must exhibit robust performance across varied recording environments, including differences in microphone quality, sampling rates, and ambient noise levels encountered in clinical settings such as emergency rooms or outpatient clinics. The critical intersection of interpretability, reproducibility, and robust performance across diverse conditions remains a vital focus, highlighting the need for frameworks that clearly identify heart murmurs through transparent AI-driven analysis⁴.

Problem statement

The integration of AI in cardiac auscultation has demonstrated improved diagnostic accuracy; however, clinical mistrust remains significant due primarily to poor reproducibility of results across diverse clinical environments. Variability stemming from differences in recording equipment, ambient noise, patient demographics, and inconsistent model performance across distinct datasets severely limits clinical adoption and utility. Consequently, addressing reproducibility through robust, transparent, and consistently reliable AI frameworks is essential for meaningful integration into routine clinical practice.

Significance

The development of an interpretable, high-performance AI framework for heart sound diagnosis has significant implications for various stakeholders in the healthcare ecosystem:

Clinicians and healthcare providers: Transparent AI models enable rapid verification of results, helping clinicians to correlate the highlighted segments with established auscultation knowledge or patient histories. This accelerates diagnostic workflows, particularly in high-demand clinical environments.
Patients: Timely and accurate detection of cardiac conditions enhances the management of chronic heart diseases, reduces hospital readmissions, and significantly improves patients’ quality of life. Clear explanations of diagnoses also foster patient engagement and adherence to treatment plans.
Healthcare systems and policymakers: Interpretable AI solutions could relieve pressure on specialized cardiology services, reduce associated diagnostic costs, and establish uniform standards of care across healthcare settings.

In summary, interpretability is a crucial factor in transforming AI models from experimental tools into reliable clinical decision support systems (CDSS), bridging the gap between research and clinical practice⁵.

Objectives and contributions

This research introduces an attention-based Transformer architecture designed specifically for multi-class PCG classification, with the primary goal of enhancing both classification accuracy and model interpretability. By focusing on the why and how behind the model’s decisions, we aim to increase clinician acceptance and improve patient safety.

To develop a robust diagnostic pipeline capable of processing time-frequency representations of PCG signals (e.g., spectrograms or Mel-frequency cepstral coefficients) and providing confidence scores for conditions such as normal heart sounds, valvular stenosis, and regurgitation. To integrate Grad-CAM (Gradient-weighted Class Activation Mapping) for post-hoc explainability, enabling clinicians to visualize which audio frames or frequency bands most influence the classification outcome.

1.
Adapting the transformer for PCGs: We redesign the Transformer architecture’s positional encodings, attention modules, and feedforward sublayers to better handle the unique temporal and spectral characteristics of heart sounds, improving the sensitivity to murmurs and other pathological cues.
2.
Incorporating explainability mechanisms: Beyond self-attention, we integrate Grad-CAM to provide localized visual evidence of anomalies, making the decision-making process more transparent and traceable.
3.
Extensive validation and benchmarking: We evaluate the model on several datasets, including the HeartWave dataset, supplemented by other publicly available sources like PhysioNet. Comparisons with common baselines, such as CNN-RNN hybrids, highlight improvements in both performance and interpretability.

Ultimately, this work aims to present a model that clinicians can understand, trust, and use confidently in real-world settings. By combining advanced deep learning techniques with explainable AI mechanisms, we offer a solid foundation for the next generation of transparent AI solutions in cardiovascular diagnostics.

Related work

This section reviews the evolution of heart sound analysis, from traditional signal-processing methods to contemporary deep learning frameworks⁶, with a focus on the challenges and opportunities related to interpretability in clinical contexts. The discussion is divided into four parts: traditional approaches, modern deep learning solutions, explainable AI (XAI) in healthcare⁷, and current gaps and future opportunities for research.

Traditional approaches to heart sound analysis

Earlier efforts in automated phonocardiogram (PCG) analysis heavily relied on hand-engineered features to represent the nuanced time-frequency structure of heart sounds. Researchers often employed wavelet transforms to isolate transient or non-stationary components indicative of cardiac anomalies such as murmurs, ejection clicks, or extra systolic sounds. The wavelet transform’s capacity for multi-resolution analysis proved effective in pinpointing short-duration events in a longer recording. For instance, abrupt energy spikes in specific frequency sub-bands could signal systolic murmurs characteristic of conditions like aortic stenosis or mitral regurgitation⁸.

Another widely adopted technique involved Mel-Frequency Cepstral Coefficients (MFCCs), adapted from speech processing⁹. By mapping frequencies to a perceptual (Mel) scale, MFCCs captured essential spectral features aligned with human auditory sensitivities. Although initially designed to recognize phonemes in spoken language, MFCCs provided an efficient representation of heart sound signals, distinguishing normal S1,S2 segments from paths containing unusual acoustic signatures.

Despite the relative success of these approaches, they depended greatly on the skill and assumptions of researchers. Feature selection, thresholding, and pre-segmentation heuristics required detailed knowledge of cardiac auscultation principles and thorough experimentation across multiple datasets. Moreover, noise sensitivity was a recurring obstacle: real-world PCG recordings often include stethoscope friction, patient motion, or ambient clinical sounds, necessitating elaborate filtering pipelines. Thus, transferring a carefully tuned algorithm from one environment (e.g., a quiet lab) to another (e.g., a busy hospital ward) frequently led to performance drops¹⁰.

Beyond the noise issue, scalability also emerged as a limitation. Many foundational studies were tested on relatively small or proprietary datasets, sometimes featuring a few hundred recordings. Such constraints hindered cross-validation, overshadowed the influence of hyperparameter choices, and complicated comparisons across different research groups. Consequently, while traditional methods clarified the potential of machine driven heart sound diagnosis, they also underscored the need for more data driven, noise-robust, and scalable strategies paving the way for the next generation of techniques centered around deep learning¹¹.

Deep learning in heart sound diagnosis

With the advent of increased computational power and growing availability of larger PCG repositories, deep learning approaches began to displace traditional feature-engineering pipelines. Convolutional Neural Networks (CNNs)¹² were among the first to gain traction, leveraging architectures that treat spectrograms or scalograms of heart sounds similarly to images. In this paradigm, trainable convolution kernels automatically learn low-level patterns (e.g., short bursts corresponding to S1 or S2) and higher-level, class-discriminative features (e.g., murmurs or abnormal frequency bands). This end-to-end structure dramatically reduced the reliance on hand-tuned wavelets or MFCC parameters, showcasing better adaptability to new data sources and varied noise levels.

Subsequently, Recurrent Neural Networks (RNNs)¹³ were explored to address the sequential nature of heart sounds. Long Short-Term Memory (LSTM)¹⁴ and Gated Recurrent Unit (GRU)¹⁵ architectures proved particularly effective in capturing longer-term dependencies, such as the interval between S1 and S2 or the transition from systole to diastole. By combining CNN-based local feature extraction with RNN-based temporal modeling, hybrid CNN-RNN frameworks emerged, providing a structured way to account for both spectral details and evolving heartbeat patterns. Many studies reported significant gains in classification accuracy, especially for tasks involving multi-class discrimination of various valvular diseases or rare congenital anomalies.

More recently, Transformers¹⁶ adapted from natural language processing have begun to make inroads into heart sound analysis. Their multi-head self-attention mechanism allows the model to “attend” to different segments of the PCG signal in parallel, potentially identifying murmurs or extra heart sounds across multiple beats¹⁷. Notably, Partovi et al.¹⁸ conducted a comprehensive survey of deep learning models for heart sound analysis and benchmarked numerous attention-based, convolutional, and recurrent approaches. Their study emphasized that while CNNs and autoencoders often achieved high performance in specific case studies, the generalizability and reproducibility of results varied widely due to dataset inconsistencies and evaluation mismatches. The authors recommended standardized datasets and interpretability-aware design principles for future model development. This work offers critical context for the integration of attention mechanisms and highlights the necessity of robust validation for clinical deployment.

Nevertheless, deep learning approaches collectively represent a major leap forward in heart sound classification¹⁹. They reduce the overhead of manual feature crafting, provide improved performance under diverse acoustic conditions, and pave the way for holistic models that incorporate additional data streams such as patient demographics or concurrent ECG signals. Yet, as these architectures become more complex, the interpretability of their outputs—a vital concern in clinical environments—has emerged as a key research priority²⁰.

Connections between heart murmurs and PCG

Heart murmurs represent audible vibrations generated by turbulent blood flow, and their characteristics provide crucial diagnostic information. These murmurs manifest in PCG signals as distinct time-frequency patterns associated with specific cardiac phases. For example, systolic murmurs such as those seen in aortic stenosis or mitral regurgitation appear between the first (S1) and second (S2) heart sounds and exhibit high-frequency components. In contrast, diastolic murmurs, like those associated with aortic regurgitation or mitral stenosis, follow S2 and typically display lower-frequency energy over longer durations.

Partovi et al.¹⁸ offer a detailed analysis of murmur types and their acoustic signatures within PCG recordings. They highlight that systolic murmurs are often sharper and shorter, whereas diastolic murmurs tend to be more prolonged and subtle. The review emphasizes that accurate murmur classification requires robust segmentation of heart cycles and careful preservation of signal fidelity during preprocessing. Furthermore, congenital anomalies such as ventricular septal defects (VSD) or patent ductus arteriosus (PDA) produce continuous murmurs that span both systolic and diastolic phases, distinguishable in PCG by sustained high-amplitude regions. This mapping between clinical murmurs and their PCG features forms a foundation for developing interpretable deep learning systems.

Explainable AI in healthcare

Alongside growing accuracy, the need for trustworthy and clinically interpretable predictions has gained momentum in recent years. Healthcare practitioners often request evidence or rationales for an algorithm’s conclusion, particularly if a diagnosis could lead to invasive procedures or significant treatment changes²¹. Explainable AI (XAI)²² seeks to address this demand, introducing techniques that help demystify the decision-making process of complex neural networks:

LIME (Local Interpretable Model-Agnostic Explanations): Generates a simplified surrogate model around a specific instance to approximate the influence of individual input features. For heart sound data, LIME has been adapted to highlight important frames or spectral bins contributing to a predicted label.
SHAP (SHapley Additive exPlanations): Attributes each feature’s contribution based on cooperative game theory, offering consistent and theoretically grounded explanations. When applied to PCG signals, SHAP can quantify how certain frequency components or time segments shift a prediction toward a pathological class.
Grad-CAM (Gradient-Weighted Class Activation Mapping): Creates heatmaps overlayed on time-frequency representations, indicating which regions the network identifies as key for classification. This technique, widely used in image analysis, has been extended to one-dimensional or spectrogram-based heart sound inputs, helping clinicians see whether the model’s focus aligns with suspected murmurs.

In practice, however, these methods can require domain-specific adaptation²³. A typical spectrogram overlay might not inherently convey whether a murmur is diastolic or systolic. Clinicians might prefer an explanation that marks an abnormal S2 split or a midsystolic click, clarifying how the model’s attention correlates with known pathophysiological events. Thus, the granularity and clinical relevance of XAI outputs remain pivotal, demanding further research on how to refine these tools to fit cardiologists’ existing mental models of heart sound interpretation²⁴.

Time growing neural networks (TGNNs) in heart sound analysis

Time Growing Neural Networks (TGNNs) have been widely employed over the past decade for cardiovascular disease classification tasks²⁵. TGNNs are designed to model temporal growth patterns in sequential data, effectively capturing evolving features across time segments such as systolic and diastolic phases in heart sound signals. By dynamically expanding their architecture, TGNNs can adapt to variable-length cardiac cycles and focus separately on physiologically relevant intervals.

TGNNs have demonstrated strong performance in discriminating various cardiac abnormalities, leveraging their ability to explicitly model temporal growth and changes within the cardiac cycle. However, these models often rely on sequential processing and may lack the ability to globally attend to all time points simultaneously, potentially limiting their capacity to capture long-range dependencies and complex interactions between systolic and diastolic events.

Moreover, TGNNs typically provide limited interpretability, as their dynamic structure and evolving weights are harder to visualize and correlate directly with clinical features compared to attention mechanisms. This presents challenges in clinical adoption, where explainability is crucial.

Our work builds on these insights by employing a Transformer-based attention mechanism that enables flexible, parallel modeling of the entire cardiac cycle, capturing both local murmur-level features and global cycle-level context with integrated explainability.

Research gaps and challenges

Despite the strides achieved in classification accuracy, multiple research gaps persist. First, multi-class classification—covering a wide spectrum of valvular disorders, arrhythmic events, and congenital anomalies continues to pose challenges. Many published studies reduce tasks to a binary problem (normal vs. abnormal) or limit themselves to a handful of prevalent conditions. Expanding the range of conditions tackled by deep networks can improve their utility in general clinical practice, but it requires more comprehensive datasets and rigorous generalization strategies.

Second, efforts to tailor XAI methods specifically for heart sound data remain in their infancy. While saliency maps or feature attributions provide a starting point, bridging the gap between these outputs and clinically interpretable markers—such as the shape of a murmur or the ratio of S1 to S2 intervals still lacks systematic solutions. Aligning explanations with medical knowledge, for instance by matching Grad-CAM hotspots to annotated systolic phases, may substantially boost confidence among cardiologists.

Third, robustness in the face of noise and demographic heterogeneity warrants deeper exploration. Real-world recordings vary in patient age, body habitus, and comorbidities, introducing patterns that may not appear in controlled datasets. Systems that can adapt, or at least detect potential mismatches, could avert misclassifications and guide clinicians toward secondary confirmatory tests.

Given these open questions, this work proposes an attention-centric framework designed to accommodate multi-class PCG classification while delivering interpretable insights. By leveraging the global reach of Transformer-like architectures and refining XAI outputs for domain specificity, we aim to address the dual imperatives of accuracy and clinical trustworthiness ultimately bridging current gaps and laying a foundation for more robust, transparent cardiovascular diagnostics.

Dataset description

This section provides an overview of the heart sound datasets employed for cardiovascular disease (CVD) analysis, focusing on the coverage of normal and abnormal conditions, annotation detail, and demographic diversity. Each dataset highlights different clinical or technical scenarios, offering a rich testing ground for machine learning models aimed at automated auscultation. Two tables summarize the characteristics and notable limitations of these repositories, reflecting variations in sampling rates, device types, and labeling precision.

HeartWave dataset

The HeartWave dataset consists of 1,353 heart sound recordings, each belonging to one of nine clinical classes²⁶. These classes capture both normal heart sounds (S1, S2) and prevalent pathologies such as aortic stenosis, mitral regurgitation, and pulmonary stenosis. Table 7 outlines key aspects, including the number of recordings, demographic spread, average recording duration, and annotation quality.

Table 1 HeartWave dataset summary.

Full size table

Table 1 describes HeartWave’s distinguishing features include detailed cardiologist annotations that precisely label systole, diastole, and any additional heart sound events (e.g., S3, S4). Where murmurs are identified, severity is graded from 1 to 6, aligning with standard clinical practice correlating murmur loudness to possible lesion significance. In addition, the dataset captures auscultation locations (e.g., mitral, aortic, pulmonary), enabling targeted analysis of region-specific pathologies. Taken together, these elements make HeartWave an excellent basis for investigating multi-class classification and advanced explainability strategies.

Heart sound repositories

Beyond HeartWave, several open-access databases offer valuable breadth and variety for evaluating algorithmic generalization. various Open access Heart Sound Datasets are listed in Table 2 compares these repositories, highlighting their key attributes and any known limitations. Some feature pediatriccohorts, while others emphasize specific valvular diseases or artifact-heavy recordings. Such differences allow researchers to probe model robustness and cross-demographic adaptability.

Table 2 Comparative overview of open-access heart sound datasets.

Full size table

CirCor DigiScope²⁷ stands out for its extensive pediatric collection, offering timing and pitch annotations that can expose age-related diagnostic patterns. PhysioNet/CinC 2016²⁸is widely used for normal-versus-abnormal classification, yet exhibits higher noise levels and a narrower disease spectrum. Pascal Datasets²⁹ A and B capture data with elevated sampling frequencies (44 kHz), encouraging exploration of high-resolution spectral details, although their relatively small sample counts constrain broad pathological analysis. Meanwhile, the GitHub open-access repository³⁰ collects data from varied sources, mostly focusing on valve-specific abnormalities such as aortic stenosis or mitral valve prolapse, at an 8 kHz sampling rate. Finally, the Heart Sounds Shenzhen dataset³¹focuses on mild, moderate, or severe categorization of valvular heart disease, but lacks attention to congenital or rarer anomalies. Taken together, these datasets reveal a broad continuum of clinical contexts, noise profiles, and labeling protocols:

Multi-class complexity: HeartWave and GitHub highlight multi-class and valvular-specific distinctions, while PhysioNet focuses on a simpler normal/abnormal paradigm.
Demographic variation: CirCor emphasizes pediatric recordings, contrasting with the adult-centric orientation in HeartWave, Shenzhen, and most Pascal data.
Noise vs. Fidelity: PhysioNet includes notable ambient interference, whereas Pascal can achieve higher fidelity but with fewer recordings. HeartWave lies in a middle ground, balancing clinical realism and moderate noise levels.

These differences underscore the necessity of evaluating classification algorithms under multiple acoustic conditions and disease distributions. The synergy of HeartWave’s rich annotation and broad pathology coverage with other specialized repositories helps ensure that models developed are not narrowly tuned to a single patient population or recording device. By merging these complementary sources, the study fosters a more nuanced understanding of how machine learning systems respond to diverse clinical settings, ultimately reinforcing the pursuit of accurate and interpretable CVD detection across varied patient profiles³².

Heart abnormalities and representative PCG patterns

To aid interpretability and dataset transparency, we summarize below the key heart abnormalities included across the datasets used in this study. Each abnormality is associated with its typical murmur type and corresponding PCG signature.

Table 3 Heart abnormalities used in this study and their PCG characteristics.

Full size table

The Fig. 1 presents time-domain PCG signal excerpts from multiple datasets, annotated to indicate murmur regions and S1/S2 markers. This visual comparison complements the table and facilitates a clearer understanding of the diagnostic landscape our model must navigate.

Proposed framework

This section details our end-to-end pipeline for automated heart sound analysis using the proposed Explainable HeartSound Transformer (EHST). The framework is designed to process diverse PCG (phonocardiogram) data from various sources (e.g., HeartWave, CirCor DigiScope, PhysioNet/CinC 2016, Pascal A/B, GitHub Open Access, and Heart Sounds Shenzhen) while ensuring consistent data handling, robust model training, and transparent interpretability.

EHST comprises five main components: data input and segmentation, data preprocessing and feature extraction, Transformer-based encoding, classification, and explainability. Figure 2 presents a conceptual overview of the pipeline. As shown in Table 3 different heart abnormalities can be identified by their unique PCG characteristics.

The overall architecture of EHST is illustrated in Fig. 2.

Data preprocessing

The process begins with the acquisition of raw PCG signals, which are then segmented into individual heartbeats by detecting characteristic peaks (S1 and S2). Next, the segmented signals are preprocessed by applying noise removal, normalization, and transformation into time-frequency representations such as spectrograms or Mel-Frequency Cepstral Coefficients (MFCCs). Data augmentation is applied to account for inter-device and inter-patient variability. The preprocessed segments are then passed to the Transformer-based encoder, which utilizes multi-head self-attention to learn rich temporal and spectral representations. These representations are pooled and fed into fully connected layers for final classification. Finally, explainability modules (e.g., Grad-CAM and attention visualization) are integrated to provide clinical insights into the model’s decisions. Robust preprocessing is critical to convert raw PCG signals into model-ready features while preserving clinically relevant information. Our pipeline explicitly performs **segmentation before time-frequency (TF) transformation** to align the data with physiological cardiac cycles, which enhances clinical interpretability.

Segmentation and peak detection

Raw PCG signals $x_{raw}(t)$ are segmented into individual heartbeats by detecting characteristic peaks corresponding to the first (S1) and second (S2) heart sounds. For datasets with pre-annotated peaks (e.g., HeartWave, CirCor DigiScope), these annotations are directly utilized. For datasets lacking full annotations (e.g., PhysioNet, Pascal B), we employ an automated peak detection method based on the wavelet transform:

$$\begin{aligned} \mathcal {P} = \left\{ p \mid \text {corr}\left( \Psi (x_{raw}), \Psi _{\text {template}}\right) > \delta \right\} , \end{aligned}$$

(1)

where $\Psi (\cdot )$ denotes the wavelet transform of the signal, $\Psi _{\text {template}}$ is a canonical wavelet template, and $\delta$ is an empirically chosen threshold (typically between 0.6 and 0.8). The detected peaks $\mathcal {P}$ define the boundaries for segmenting complete heartbeats.

Window length adaptation

Because heartbeats vary in length—especially due to diastolic duration changes across populations—each segmented heartbeat is adaptively adjusted in window length to normalize its duration. For instance, pediatric datasets (e.g., CirCor DigiScope) have shorter windows reflecting faster heart rates, while adult datasets (e.g., HeartWave) allow longer windows up to 1 second per cycle. This adaptive windowing reduces variability that could otherwise degrade TF representation quality.

Time-frequency transformation

Each normalized heartbeat segment is transformed into a time-frequency representation. For typical sampling frequencies (e.g., 2–4 kHz in HeartWave and Shenzhen), we compute the Short-Time Fourier Transform (STFT) using a Hamming window of length 20 ms with 50% overlap:

$$\begin{aligned} X(m, \omega ) = \sum _{n=-\infty }^\infty x(n) \, w(n - m) \, e^{-j \omega n}, \end{aligned}$$

(2)

where w(n) is the Hamming window function, m denotes the time shift, and $\omega$ is the frequency variable. For higher sampling rates (e.g., Pascal datasets), we either downsample or adjust the window parameters to preserve frequency resolution within the 20–800 Hz band of clinical relevance. Alternatively, Mel-Frequency Cepstral Coefficients (MFCCs) are computed for moderate sampling frequencies to capture spectral features effectively.

Normalization and data augmentation

Following TF transformation, features are normalized using z-score normalization:

$$\begin{aligned} \textbf{F}_{\text {norm}} = \frac{\textbf{F} - \mu }{\sigma }, \end{aligned}$$

(3)

where $\textbf{F}$ denotes the extracted feature matrix, and $\mu$, $\sigma$ are the mean and standard deviation computed over the training data. To improve model robustness, data augmentation methods—such as noise injection, time-stretching, and random cropping—are applied to simulate inter-device and inter-patient variability.

Preprocessing order

Segmenting before applying TF transformations ensures that feature extraction aligns with physiologically meaningful units (complete cardiac cycles). While diastolic variability causes variable segment lengths, adaptive windowing mitigates potential disruptions in TF resolution. Careful tuning of TF parameters (window length, overlap) further preserves temporal and spectral fidelity. This approach enhances the clinical interpretability of the features and supports reliable classification.

The entire preprocessing procedure is summarized in Algorithm 1.

Transformer-based feature extractor

EHST leverages a multi-head Transformer encoder to learn rich, contextual representations from the preprocessed PCG data. Given a heartbeat segment represented as a sequence of T frames, the Transformer encoder applies self-attention to capture both local (murmur-level) and global (cycle-level) dependencies.

Multi-head self-attention and positional encoding

Each frame is first projected into an embedding space:

$$\begin{aligned} \textbf{e}_t = \textbf{F}_t \textbf{W}_e + \textbf{b}_e + \text {PE}(t), \end{aligned}$$

(4)

where $\text {PE}(t)$ denotes the positional encoding. We adopt a sinusoidal positional encoding:

$$\begin{aligned} \begin{aligned} \textbf{PE}(t,2i)&= \sin \Bigl (\frac{t}{10000^{2i/d_{model}}}\Bigr ),\\ \textbf{PE}(t,2i+1)&= \cos \Bigl (\frac{t}{10000^{2i/d_{model}}}\Bigr ). \end{aligned} \end{aligned}$$

(5)

The multi-head self-attention mechanism then computes attention scores between all pairs of frames, which are aggregated to form a refined representation:

$$\begin{aligned} \alpha _{t,\tau }^{(h)} = \frac{\exp \Bigl (\frac{(\textbf{Q}_t^{(h)})^\top \textbf{K}_\tau ^{(h)}}{\sqrt{d_k}}\Bigr )}{\sum _{\tau '=1}^{T}\exp \Bigl (\frac{(\textbf{Q}_t^{(h)})^\top \textbf{K}_{\tau '}^{(h)}}{\sqrt{d_k}}\Bigr )}, \end{aligned}$$

(6)

where $\textbf{Q}^{(h)}$, $\textbf{K}^{(h)}$, and $\textbf{V}^{(h)}$ are the query, key, and value matrices for head h, respectively.

Optional cross-attention for clinical data

For datasets that provide additional clinical variables (e.g., demographics), cross-attention is incorporated. Here, the PCG embeddings act as queries while clinical vectors serve as keys and values. This helps contextualize the learned features with patient-specific information.

The Transformer encoding procedure is summarized in Algorithm 2.

Classification module

The output of the Transformer encoder is a sequence of hidden states, which is then aggregated into a single vector for each heartbeat. We adopt average pooling:

$$\begin{aligned} \textbf{h}_{\text {pool}} = \frac{1}{T}\sum _{t=1}^{T}\textbf{H}_t^{(L)}. \end{aligned}$$

(7)

The pooled vector is fed into fully connected layers with ReLU activations, followed by a softmax layer to generate class probability distributions:

$$\begin{aligned} \textbf{z}= & \textbf{W}_{out}\,\text {ReLU}(\textbf{h}_{\text {pool}}\textbf{W}_1 + \textbf{b}_1) + \textbf{b}_{out}, \end{aligned}$$

(8)

$$\begin{aligned} \hat{p}_c= & \frac{\exp (z_c)}{\sum _{i=1}^{C}\exp (z_i)}, \quad c=1,\dots ,C. \end{aligned}$$

(9)

Algorithm 3 summarizes the classification process.

Explainability module

A key strength of EHST is its transparent decision-making, achieved through integrated explainability modules.

Grad-CAM for spectrogram visualization

When processing spectrogram inputs, we apply Grad-CAM to generate heatmaps that highlight critical frequency-time regions contributing to a class prediction. Let $\textbf{A} \in \mathbb {R}^{F \times T \times K}$ denote a higher-level activation map. The importance of channel k for class c is computed as:

$$\begin{aligned} \alpha _k^{(c)} = \frac{1}{Z} \sum _{f,t} \frac{\partial \hat{p}_c}{\partial A_k(f,t)}, \end{aligned}$$

(10)

and the final Grad-CAM heatmap is given by:

$$\begin{aligned} H_c(f,t) = \text {ReLU}\Bigl (\sum _{k} \alpha _k^{(c)} \, A_k(f,t)\Bigr ). \end{aligned}$$

(11)

This visualization enables clinicians to verify that EHST focuses on relevant segments, such as systolic or diastolic murmur intervals.

Attention weight visualization

For the self-attention layers, we average attention scores across all heads:

$$\begin{aligned} \beta _{t,\tau } = \frac{1}{N_{\text {heads}}}\sum _{h=1}^{N_{\text {heads}}}\alpha _{t,\tau }^{(h)}. \end{aligned}$$

(12)

Plotting $\beta _{t,\tau }$ as a matrix or as row-sum plots reveals the temporal regions the model deems most critical, offering another layer of interpretability.

Algorithm 4 outlines the process of generating explainable outputs using both Grad-CAM and attention visualization.

By integrating these modules, EHST not only achieves high diagnostic performance but also provides transparent, clinically interpretable insights that can support and augment traditional cardiac auscultation.

Performance metrics

This section details the evaluation framework for our Transformer-based heart sound classification system. We describe the loss functions employed to address class imbalance, present the hyperparameter configurations for our model and ten baseline approaches, and outline a comprehensive set of metrics used to assess both classification performance and interpretability.

Loss functions

Addressing class imbalance

Heart sound datasets are often imbalanced, with normal recordings typically outnumbering pathological cases. To prevent the model from favoring majority classes, we adopt loss functions that assign higher penalties to misclassifications of minority classes. This is essential to improve the model’s sensitivity to less frequent but clinically significant conditions.

Weighted cross-entropy

For a classification task with $C$ classes, let $\textbf{y} \in \{0,1\}^C$ denote the one-hot encoded label and $\hat{\textbf{p}} \in \mathbb {R}^C$ the predicted probability distribution. The weighted cross-entropy loss is defined as:

$$\begin{aligned} \mathcal {L}_{\text {WCE}} = - \sum _{c=1}^{C} w_c \, y_c \, \log (\hat{p}_c), \end{aligned}$$

(13)

where $w_c$ is inversely proportional to the frequency of class $c$, thus penalizing errors on underrepresented classes more heavily. This loss function ensures that the model’s performance does not suffer when dealing with classes that are inherently less frequent in the dataset.

Focal loss

In scenarios of severe imbalance, focal loss further reduces the relative loss for well-classified examples and emphasizes harder samples. It is defined as:

$$\begin{aligned} \mathcal {L}_{\text {focal}} = - \sum _{c=1}^{C} \alpha _c \, y_c \, (1 - \hat{p}_c)^{\gamma } \, \log (\hat{p}_c), \end{aligned}$$

(14)

where $\gamma > 0$ is the focusing parameter and $\alpha _c$ are class-specific weights. In our experiments, weighted cross-entropy is applied for moderate imbalance, whereas focal loss is used when minority classes are particularly sparse.

Hyperparameter tuning

Transformer-based model configuration

Table 4 summarizes the default hyperparameters for our Transformer-based classifier, determined through grid searches and pilot studies across multiple heart sound datasets.

Table 4 Default hyperparameters for the proposed transformer-based classifier.

Full size table

Baseline model hyperparameters

Table 5 outlines typical hyperparameter settings for ten common baseline models spanning CNN, RNN, and hybrid architectures. These serve as benchmarks for comparing our proposed approach.

Table 5 Hyperparameter summaries for ten common baseline models.

Full size table

Regularization is achieved through dropout (0.1–0.3) and weight decay ($10^{-5}$–$10^{-4}$). Hyperparameter optimization is performed via manual/random search followed by systematic methods such as Bayesian optimization, with early stopping based on validation F1-score or loss.

Evaluation metrics

In this section, we describe the comprehensive set of evaluation metrics employed to assess both the classification performance and model interpretability. These metrics allow us to evaluate the model on multiple fronts, ensuring robustness, reliability, and clinical relevance.

Classification metrics

Accuracy:

$$\begin{aligned} \text {Accuracy} = \frac{\text {Total Correct Predictions}}{\text {Total Samples}}. \end{aligned}$$

While accuracy is a commonly used metric, it can be misleading in imbalanced datasets where a model can achieve high accuracy by predominantly predicting the majority class. Therefore, accuracy must be considered alongside other metrics, especially in scenarios with rare conditions.

Precision, recall, and F1-score:

For each class $c$, with $TP_c$, $FP_c$, and $FN_c$ denoting true positives, false positives, and false negatives respectively:

$$\begin{aligned} \text {Precision}_c = \frac{TP_c}{TP_c + FP_c}, \quad \text {Recall}_c = \frac{TP_c}{TP_c + FN_c}, \end{aligned}$$

where precision evaluates the proportion of correct positive predictions, and recall measures the ability to correctly identify positive instances. The F1-score combines these two metrics as the harmonic mean:

$$\begin{aligned} \text {F1}_c = 2 \times \frac{\text {Precision}_c \times \text {Recall}_c}{\text {Precision}_c + \text {Recall}_c}. \end{aligned}$$

Macro-F1 computes the F1-score independently for each class and then takes the average, while weighted-F1 adjusts for class imbalance by considering the support (the number of true instances for each class).

Area under the ROC curve (AUC):

The AUC is computed for each class using a one-vs.-rest approach. An AUC value above 0.9 generally indicates robust discriminative performance. This metric is particularly useful for understanding model behavior in differentiating between normal and abnormal heart sounds, even in the presence of noisy data or limited samples.

Confusion matrix:

The confusion matrix provides detailed insights into the class-wise performance of the model, showing true positives, false positives, false negatives, and true negatives. It is a useful tool for identifying potential areas for improvement, such as misclassifications between similar classes (e.g., between normal and mild cases).

Matthews correlation coefficient (MCC):

MCC is another critical metric used in binary and multi-class classification tasks to measure the quality of binary (or multi-class) classifications. It provides a balanced evaluation even for imbalanced datasets:

$$\begin{aligned} \text {MCC} = \frac{TP \times TN - FP \times FN}{\sqrt{(TP + FP)(TP + FN)(TN + FP)(TN + FN)}} \end{aligned}$$

An MCC score close to +1 indicates excellent performance, 0 indicates no better than random predictions, and -1 indicates total disagreement between prediction and ground truth.

Explainability metrics

Explainability is essential for building trust in AI models in clinical settings. We evaluate the transparency of our model using several metrics, including Grad-CAM visualizations and SHAP (SHapley Additive exPlanations).

Overlap with expert annotations:

For datasets with annotated murmur segments, we measure the degree of alignment between the model’s Grad-CAM or attention heatmaps and the expert annotations. We define the overlap ratio as follows:

$$\begin{aligned} \text {OverlapRatio} = \frac{\sum _{(f,t)\,\in \,\text {annotation}} H_c(f,t)}{\sum _{(f,t)} H_c(f,t)}, \end{aligned}$$

where $H_c(f,t)$ denotes the importance map for class $c$, and the sum is taken over all the frames $f$ and time steps $t$ associated with the annotation.

Intersection-over-Union (IoU):

IoU measures the spatial agreement between the model’s highlighted areas and expert annotations. By thresholding $H_c(f,t)$ to form a binary mask, we compute:

$$\begin{aligned} \text {IoU} = \frac{\text {Mask} \cap \text {Annotated Region}}{\text {Mask} \cup \text {Annotated Region}}, \end{aligned}$$

This metric quantifies how well the model identifies the correct regions in the time-frequency space corresponding to pathological heart sounds, such as systolic or diastolic murmurs.

SHAP analysis:

In addition to Grad-CAM, we use SHAP to provide further insight into how different features contribute to model decisions. SHAP values decompose the prediction for a particular instance into contributions from each feature, quantifying their individual impact. This allows clinicians to understand which time-frequency components or other features (e.g., patient demographic data) are most influential in driving predictions.

Training workflow

Our training workflow is designed to optimize both predictive performance and interpretability. It includes the following steps:

1.
Model initialization: Configure the Transformer architecture, including the depth, number of attention heads, and feed-forward dimensions. The loss function is selected based on the dataset characteristics (e.g., weighted cross-entropy for imbalanced datasets or focal loss for harder-to-detect classes).
2.
Hyperparameter setup: Initialize key hyperparameters, including the learning rate (e.g., $1 \times 10^{-3}$), batch size (8–32), dropout rate, and set up a dynamic scheduler like ReduceLROnPlateau for adaptive learning rates.
3.
Iterative Training:
- Forward pass: Compute predictions for each batch and evaluate the loss function based on the true labels.
- Backward pass: Update model weights using optimization techniques such as Adam or AdamW.
- Learning Rate Adjustment: Adjust the learning rate when validation metrics plateau, ensuring that overfitting is avoided and convergence is achieved.
4.
Validation and early stopping: Monitor validation metrics (F1-score, accuracy, AUC) on a held-out set, and apply early stopping if validation metrics show no improvement after a predefined number of epochs.
5.
Final testing and explainability analysis: After training, evaluate the model using classification metrics (accuracy, F1, AUC) and generate Grad-CAM or attention maps. For datasets with murmur annotations, compute overlap and IoU metrics.
6.
Benchmarking: Compare the performance and interpretability of our model against baseline models using consistent evaluation criteria, as shown in Table 5.

This training and evaluation protocol ensures that our system not only delivers high-accuracy predictions but also provides transparent and clinically interpretable insights for heart sound classification.

Experimental setup

This section details the experimental design used to evaluate our Transformer-based heart sound classification framework. We describe the datasets, baseline models, cross-validation protocols, and testing scenarios, all tailored to assess the model’s performance under real-world conditions.

Datasets employed

Our experiments encompass seven repositories: six publicly available/open-access datasets and our in-house HeartWave dataset. These datasets exhibit diverse patient demographics, sampling frequencies, recording devices, and pathology definitions. Tables 6 and 7 summarize the key characteristics.

Open-access datasets

Table 6 provides an overview of the open-access datasets used:

CirCor DigiScope: Contains 5,282 recordings with detailed murmur annotations for pediatric heart sounds; however, adult data are limited, posing a domain-shift challenge.
PhysioNet/CinC 2016: Comprises 2,575 normal and 655 abnormal recordings at 2 kHz, characterized by significant noise and limited representation of rare CVDs.
Pascal Datasets A and B: Feature high-frequency recordings (44.1 kHz and 44 kHz, respectively) suitable for artifact detection, albeit with small sample sizes and restricted pathology diversity.
GitHub Open Access: Focuses on four valvular conditions (AS, MS, MVP, MR) with 1,000 recordings, but lacks demographic variety.
Heart Sounds Shenzhen (HSS): Contains 845 recordings classified as normal, mild, or severe, providing a progression-based perspective on valvular disease.

Table 6 Summary of datasets for heart sound analysis.

Full size table

HeartWave dataset

In addition to open-access repositories, we use the HeartWave dataset, which consists of 1,353 high-quality heart sound recordings sampled at 2–4 kHz. The dataset includes nine classes covering both normal and pathological conditions, with expert annotations for S1, S2, murmurs, and extra heart sounds. This dataset is pivotal for evaluating classification accuracy and ensuring that model predictions align with clinically relevant features.

Table 7 HeartWave dataset summary.

Full size table

Unified preprocessing and comparison

To ensure comparability across datasets with varying sampling rates and acquisition devices, all data are processed through a unified preprocessing pipeline. First, datasets recorded at higher frequencies (e.g., the Pascal dataset) are downsampled or resampled to a common range of 2–4 kHz. Next, the recordings are segmented into individual heartbeats using robust peak detection methods, such as amplitude thresholding or wavelet-based techniques, which leverage available annotations for accurate segmentation. Each extracted heartbeat is then transformed into a time-frequency representation—using either Short-Time Fourier Transform (STFT) or Mel-Frequency Cepstral Coefficients (MFCC)—to capture both spectral and temporal features essential for subsequent analysis. To further enhance signal quality, a band-pass filter (20–800 Hz) combined with wavelet denoising is applied, reducing ambient noise and other artifacts. Finally, data augmentation techniques, including time-stretching, noise injection, and random cropping, are employed to improve the generalizability of the model. The preprocessed data from this pipeline are then used as inputs to the Transformer-based model, ensuring that performance comparisons across different datasets are both fair and consistent.

Baseline models

To benchmark our proposed Transformer-based model, we compare it against ten established baseline models that employ various architectures for feature extraction and classification:

CNN-Baseline: Uses stacked convolutional layers followed by a fully connected classifier to capture local features³³.
RNN-Baseline: Employs recurrent architectures (LSTM or GRU) to model the sequential dependencies in PCG signals.
CNN-RNN Hybrid: Combines convolutional layers for local feature extraction with RNNs for temporal modeling.
CRNN-Attention: A hybrid model incorporating an attention mechanism to focus on diagnostically relevant segments.
TCN Model: Utilizes Temporal Convolutional Networks with dilated convolutions to capture multi-scale temporal dependencies.
RNN-Transformer: Merges RNNs with a Transformer block to leverage both sequential and attention-based modeling.
Multi-Task CNN: Performs joint classification and segmentation using a shared CNN architecture.
DenseNet-Style CNN: Features densely connected convolutional blocks for enhanced feature reuse.
Transformer-Lite: A simplified Transformer model with 1–2 self-attention layers, optimized for resource-constrained environments.
Wavelet-CNN: Integrates wavelet transforms with CNNs to robustly extract features from low-SNR signals.

All baseline models are trained using standard hyperparameters (learning rates of $10^{-4}$ to $10^{-3}$, batch sizes of 8–32, and moderate dropout). For imbalanced datasets, weighted cross-entropy or focal loss is employed to mitigate class imbalance.

Cross-validation

Robust performance estimates are obtained by employing cross-validation (CV) with stratified sampling to preserve the class distributions within each dataset³⁴. For larger datasets such as HeartWave (1,353 samples) and Shenzhen (845 samples), a 5-fold CV strategy is adopted, where one fold is held out for testing and the remaining folds are used for training. In contrast, for smaller datasets such as Pascal A/B, a 10-fold CV approach is used to maximize the training data available in each fold and to secure multiple test splits, thus ensuring more reliable performance evaluation. The model performance is assessed using mean accuracy, F1-scores, AUC, and confusion matrices, which collectively provide a comprehensive and robust set of metrics for comparing model performance across different datasets.

Results and analysis

This section provides a comprehensive evaluation of the proposed Explainable HeartSound Transformer (EHST) for multi-class heart sound diagnosis. We assess its performance across multiple datasets and compare it with several state-of-the-art baseline models. Our analysis includes overall classification performance, class-wise breakdowns, rare-class detection, ablation studies, confusion matrix analysis, and interpretability via explainability metrics such as Grad-CAM and SHAP. Additionally, further analyses address segmentation methods, window length sensitivity, data augmentation strategies, demographic-based performance, and computational efficiency.

Overall multi-dataset performance

We evaluated EHST on six datasets: HeartWave, CirCor DigiScope, PhysioNet/CinC, Pascal (A+B), GitHub (Valvular), and Shenzhen (HSS). Performance metrics including accuracy, precision, recall, macro-F1 score, Matthews Correlation Coefficient (MCC), and Area Under the ROC Curve (AUC) were computed and compared with ten baseline models. Table 8 summarizes the results across all six datasets.

Table 8 Overall performance across six datasets (EHST vs. Baselines).

Full size table

The overall results in Table 8 indicate that EHST consistently achieves higher performance than the baseline models. For example, on the HeartWave dataset, EHST attains an accuracy of 96.7% and a macro-F1 score of 95.5%, compared to 94.8% and 93.1% respectively for the best baseline. Similarly, in CirCor DigiScope and PhysioNet CinC, EHST outperforms the baselines by 3–5% in accuracy and 2–4% in F1 score. Additionally, the AUC values remain above 0.90 across all datasets, which confirms the model’s robust ability to distinguish between normal and abnormal heart sounds even in noisy and imbalanced settings. The consistently high Matthews Correlation Coefficient (MCC), typically around 0.91–0.94 (not shown in this table), further corroborates EHST’s strong performance in dealing with imbalanced data.

Figure 3 presents a grouped bar chart comparing the accuracy of EHST with the Mean Baselines and Best Baseline across the six datasets. It is evident from the chart that EHST achieves higher accuracy values on each dataset. For instance, on the HeartWave dataset, EHST reaches an accuracy of 96.7% compared to 94.8% for the best baseline. This visual comparison highlights EHST’s significant improvement in overall classification accuracy, reinforcing its suitability for clinical applications where high accuracy is crucial.

The line plot in Fig. 4 illustrates the F1 scores for EHST, Mean Baselines, and Best Baseline methods across six datasets. The plot clearly shows that EHST consistently achieves higher F1 scores on all datasets compared to the baseline methods. The trend line for EHST lies above those for both the mean and best baselines, indicating a superior balance between precision and recall. This consistent performance across datasets underscores EHST’s capability to reliably distinguish between classes even in challenging, imbalanced settings.

The box plot in Fig. 5 depicts the distribution of AUC values for EHST, Mean Baselines, and Best Baseline methods over several cross-validation folds. The vertical extent of each box represents the interquartile range (IQR) of AUC scores, with the median indicated by a horizontal line. EHST exhibits a higher median AUC with a narrower IQR compared to the baseline methods, which implies not only high discriminative power but also low variability across folds. Consistently, AUC values remain above 0.90 across all datasets, underscoring EHST’s robust ability to differentiate between normal and abnormal heart sounds, even in the presence of noise and imbalanced data.

Overall, EHST demonstrates improvements of 3–5% in accuracy and 2-4% in macro-F1 score over baseline methods, underscoring its strong discriminative power and clinical applicability.

Class-wise breakdown on HeartWave

Table 9 presents the detailed class-wise performance (precision, recall, and F1 score) for each of the nine heart sound classes in the HeartWave dataset, evaluated using 5-fold cross-validation.

Table 9 Class-specific performance on HeartWave (9 Classes, 5-Fold CV).

Full size table

The class-wise analysis shows that EHST achieves high F1 scores for normal heart sounds (97.9%) and maintains F1 scores above 90% for minority classes such as congenital anomalies and miscellaneous rare conditions. This indicates effective handling of class imbalance through the weighted loss function and attention mechanisms, ensuring that both common and rare pathologies are accurately detected.

Rare-class detection across multiple datasets

Rare-class detection performance was measured using micro-F1 scores for underrepresented classes across selected datasets. Table 10 presents these results, demonstrating that EHST outperforms the best baseline by 2–3% in detecting rare conditions.

Table 10 Rare-class performance across selected datasets (Micro-F1 for underrepresented Classes).

Full size table

Figure 6 shows a dumbbell plot that compares the rare-class F1 scores for EHST and the best baseline across four datasets: Pascal (A), Pascal (B), GitHub, and Shenzhen (HSS). In this plot, each dataset is represented by two markers—one for EHST (displayed in tomato) and one for the best baseline (displayed in steelblue). A vertical line connects the markers for each dataset, clearly illustrating the performance gap between EHST and the best baseline. The plot clearly demonstrates that EHST consistently outperforms the best baseline across all four datasets, with improvements ranging from approximately 2% to 3% in F1 score. For example, on the Pascal (A) dataset, EHST achieves an F1 score of 86.4% compared to 82.9% for the best baseline, as shown by the gap between the markers. Similar trends are observed for the Pascal (B), GitHub, and Shenzhen datasets, highlighting EHST’s robust performance in detecting underrepresented classes. The relatively narrow gap between the upper and lower bounds (represented by the connecting lines) also suggests that the model’s performance is consistent across cross-validation folds.

This analysis confirms that EHST consistently outperforms baselines in rare-class detection, which is critical for clinical applications where underrepresented pathologies must be reliably identified.

Ablation studies and confusion matrix

Ablation studies were performed on the HeartWave dataset (using 5-fold CV) to quantify the contributions of self-attention and Grad-CAM. Table 11 shows that the removal of self-attention results in a significant performance drop, while Grad-CAM removal mainly impacts interpretability.

Table 11 Ablation: impact of removing self-attention and Grad-CAM (HeartWave, 5-Fold CV).

Full size table

Removing the self-attention module leads to a 2.7% drop in F1 score, highlighting its importance for identifying critical features such as murmurs. While removing Grad-CAM has minimal impact on classification performance, it diminishes the model’s interpretability—an essential factor in clinical applications (Fig. 7).

The chart shows that the removal of the self-attention mechanism results in a significant drop in both Accuracy and F1 Score, with values falling from 96.7% to 94.2% for Accuracy and from 95.5% to 92.3% for F1 Score. The removal of the Grad-CAM module, however, has a negligible effect on these quantitative metrics, as indicated by the nearly identical values when only Grad-CAM is removed. When both components are removed, the performance further declines, underscoring the importance of the self-attention mechanism for robust classification. Although the AUC remains relatively stable (with EHST at 97% and dropping to 93% when both modules are removed), its conversion to percentage scale confirms that all configurations achieve AUC values above 90%, indicating strong discriminative ability overall.

The confusion matrix for the Shenzhen dataset is presented in Table 12. This matrix indicates that misclassifications predominantly occur between the Mild and Severe classes, which may be attributed to overlapping symptom characteristics.

Table 12 Confusion matrix for Shenzhen dataset (EHST, 3 classes).

Full size table

The strong diagonal of the confusion matrix indicates that EHST is highly accurate for Normal and Severe cases. However, some misclassifications occur between Mild and Severe, suggesting that further refinement of training data or feature extraction methods could help improve discrimination between these classes.

Explainability metrics and SHAP analysis

To evaluate the interpretability of EHST, we use both Grad-CAM and SHAP analyses. Table 13 shows the overlap ratio and Intersection-over-Union (IoU) for systolic and diastolic murmur annotations on the HeartWave dataset. Additionally, Table 14 lists the top five features ranked by mean SHAP value.

Table 13 SHAP-based explainability metrics on HeartWave murmur annotations.

Full size table

Table 14 Top 5 features ranked by mean SHAP value for EHST on HeartWave dataset.

Full size table

In addition to these tables, we employ a beeswarm plot to visualize the SHAP values for each feature across the entire HeartWave dataset. Figure 8 displays the beeswarm plot, which illustrates the distribution of SHAP values per feature. Each point represents a single instance; the position along the horizontal axis indicates the impact on the model output, while the color reflects the feature’s value. This visualization helps in understanding not only which features are most influential but also whether higher or lower feature values push the model output in a particular direction.

The explainability metrics show that EHST improves overlap and IoU by 5–8% compared to baseline models, indicating superior alignment between the model’s attention maps and expert annotations. The beeswarm plot further corroborates these findings by highlighting that features such as the murmur frequency band, S1 amplitude, and S2 duration have the greatest impact on the predictions. These insights provide a clear, interpretable understanding of the model’s decision-making process, ensuring that clinicians can trust the automated diagnoses. Overall, the combination of quantitative metrics and visualizations like the beeswarm plot confirms the clinical relevance and transparency of EHST.

Statistical validation with A-Test

To rigorously evaluate the statistical significance and robustness of EHST’s performance compared to baseline models, we employed the non-parametric A-Test³⁵. The A-Test quantifies the probability that a randomly selected observation from one distribution (EHST results) will be greater than a randomly selected observation from another distribution (baseline results):

$$\begin{aligned} A_{12} = P(X_1 > X_2) + 0.5 \cdot P(X_1 = X_2), \end{aligned}$$

(15)

where $X_1$ and $X_2$ are the distributions of accuracy, macro F1-score, or AUC obtained from cross-validation folds. An $A_{12}$ value of 0.5 indicates no difference, whereas values approaching 0 or 1 imply a strong effect size in favor of one method. Conventionally, thresholds of $A > 0.71$ or $A < 0.29$ are considered large effects, $0.64 \le A \le 0.71$ or $0.29 \le A \le 0.36$ medium, and $0.56 \le A \le 0.64$ or $0.36 \le A \le 0.44$ small³⁶.

Table 15 presents the A-Test results comparing EHST with the best-performing baseline models across all datasets. The consistently low values (close to 0.1–0.2) indicate a large effect size in favor of EHST, confirming the robustness and statistical significance of our improvements.

Table 15 A-Test scores comparing EHST with best baseline models across datasets. Values close to 0 or 1 imply large effect sizes.

Full size table

The results confirm that EHST significantly outperforms the baseline models across all datasets and evaluation metrics, further validating the robustness of the proposed framework.

Segmentation method analysis

To evaluate the impact of the segmentation approach on performance, we compared manual annotations with automated peak detection on the HeartWave dataset. As shown in Table 16, the accuracy and macro F1 scores obtained using automated peak detection are nearly equivalent to those derived from manual annotation. This close performance indicates that the automated segmentation pipeline is robust and reliable, reducing the need for labor-intensive manual labeling while still maintaining high-quality input for the EHST model.

Table 16 Performance comparison for segmentation methods on HeartWave.

Full size table

These results confirm that the automated method effectively captures the essential heart sound segments with minimal loss in performance, making it a viable option for scaling up the data preparation process.

Window length sensitivity

We examined the sensitivity of EHST to different window lengths used during segmentation on the HeartWave dataset. As summarized in Table 17, the model’s performance varies slightly with different window lengths. A window length of 1.0 s produced the highest accuracy (96.7%) and macro F1 (95.5%), suggesting that this duration provides an optimal balance between capturing sufficient temporal dynamics and minimizing noise.

Table 17 Performance variation with different window lengths on HeartWave.

Full size table

The marginal differences observed imply that while shorter windows might not capture the full extent of a heartbeat cycle, longer windows could introduce additional noise. Therefore, a 1.0 s window appears to be the optimal setting for the EHST model.

Data augmentation strategy

To assess the impact of data augmentation, we evaluated EHST on the PhysioNet/CinC dataset using various augmentation strategies. Table 18 compares the performance without any augmentation, with individual strategies (noise injection, time-stretching, random cropping), and with a combination of all three techniques. The combined strategy results in the highest accuracy (90.3%) and macro F1 (88.9%), demonstrating that the integration of multiple augmentation methods effectively enhances model robustness by simulating realistic variability and reducing overfitting.

Table 18 Effect of data augmentation strategies on PhysioNet/CinC performance.

Full size table

This analysis confirms that using a combination of augmentation strategies best simulates the diverse acoustic conditions encountered in clinical settings, thereby improving the model’s generalizability.

While the combined augmentation strategy significantly improves validation metrics by simulating diverse acoustic conditions, it may also introduce distributional shifts that impact performance on real-world, unseen data. This potential domain shift is a known challenge in PCG and other biomedical signal processing tasks.

To minimize adverse effects, our augmentation parameters were carefully selected to reflect realistic variations encountered in clinical practice. Furthermore, we validated our model on multiple independent datasets with varied noise profiles and demographic distributions to assess robustness beyond the training set.

Nonetheless, there remains a trade-off between increasing training data diversity and preserving fidelity to clinical conditions. Future work will explore adaptive and domain-adversarial augmentation techniques aimed at enhancing model generalizability across heterogeneous clinical environments.

Demographic-based performance

We further analyzed EHST’s performance on the HeartWave dataset across different demographic groups. Table 19 presents the accuracy and macro F1 scores for pediatric (Age < 18) and adult (Age $\ge$ 18) subgroups. The results show consistent performance between the two groups, with adults achieving slightly higher scores. This consistency indicates that EHST generalizes well across diverse patient populations, which is critical for clinical deployment in varied settings.

Table 19 Performance by demographic subgroups on HeartWave.

Full size table

These findings demonstrate that the model’s performance remains robust regardless of age, suggesting its effectiveness in both pediatric and adult clinical environments.

Overall, these detailed analyses across segmentation methods, window length sensitivity, data augmentation strategies, demographic-based performance, and computational efficiency provide a comprehensive picture of EHST’s performance and robustness in diverse clinical scenarios. The results demonstrate that EHST not only excels in classification performance but also generalizes well across different data conditions and remains computationally efficient for practical use.

Discussion

Across six datasets, EHST achieved higher accuracy, macro-F1 score, and AUC values than the baseline models, as shown in Table 8. In particular, on the HeartWave dataset, EHST attained higher class-specific performance, detecting both common and rare pathologies and showing improved sensitivity for underrepresented classes, as reflected in elevated micro-F1 scores. While these results indicate notable performance gains, they are specific to the datasets and experimental conditions evaluated.

Ablation studies suggest that the self-attention mechanism and integrated explainability modules contribute meaningfully to classification accuracy and model transparency. The combination of Grad-CAM, attention visualization, and SHAP yielded clinically interpretable outputs, aligning model attention with known pathophysiological features in the evaluated datasets.

In terms of computational efficiency, EHST exhibited training and inference times comparable to strong baseline architectures, including CNN-RNN hybrids and Transformer variants. Optimized model depth and parameter sharing in the multi-head attention layers contributed to a balance between complexity and scalability. For example, training times on the HeartWave dataset were similar to those of the best-performing baseline CNN-RNN model, and inference latency was within real-time constraints under the tested conditions.

The model’s performance remained stable across variations in segmentation method, window length, data augmentation strategy, and demographic subgroup, suggesting potential for broader applicability. However, real-world deployment may involve additional variability in recording devices, patient populations, and environmental noise, which were not exhaustively represented in the current datasets.

Overall, within the scope of the datasets and metrics considered, EHST demonstrated consistent performance advantages over the ten baseline models evaluated, while providing interpretable outputs without marked loss in efficiency. These findings support EHST’s potential for use in scalable, explainable heart sound analysis, pending further validation in prospective and more heterogeneous clinical settings.

Comparison with time growing neural networks (TGNNs)

Time Growing Neural Networks (TGNNs) offer a unique approach to modeling cardiac cycles by incrementally adapting their architecture to capture systolic and diastolic variations²⁵. While TGNNs excel at representing temporal growth patterns, they process sequences sequentially, which can limit the ability to capture global dependencies and complex interactions spanning multiple heartbeats.

In contrast, our proposed EHST utilizes a multi-head self-attention mechanism that simultaneously attends to all time points within a heartbeat segment. This parallel attention allows the model to flexibly learn relationships across both systolic and diastolic phases without explicit segmentation or architectural growth, leading to richer temporal and spectral representations.

Additionally, EHST integrates explainability tools such as Grad-CAM and attention visualization, providing clinicians with interpretable heatmaps linked to physiologically meaningful heart sound components. This level of transparency is often lacking in TGNN architectures, making our method more suitable for clinical applications where interpretability is critical.

Empirically, EHST demonstrates consistent improvements in classification accuracy and robustness over baseline models, including those based on TGNNs, across multiple datasets, further validating the advantages of attention-based modeling for heart sound diagnosis.

Limitations

Despite the promising performance of EHST, several limitations remain. The model shows difficulty in accurately distinguishing between mild and severe cases, likely due to overlapping acoustic features and the limited number of borderline examples in the training data. While EHST performed well under controlled conditions, its robustness to real-world noise, device variability, and differences in recording environments has yet to be fully established. These factors, along with potential domain shifts when applied to new patient populations, may affect performance in practical deployments. The datasets used in this study do not comprehensively represent all demographic groups, age ranges, or rare cardiac pathologies, underscoring the need for greater diversity in training data to improve generalizability.

Extensive prospective clinical trials and longitudinal studies are required to validate EHST’s effectiveness, reliability, and interpretability in routine clinical workflows. Additionally, integration with multimodal data sources, such as ECG and echocardiography, could further enhance diagnostic coverage. Future work will also focus on refining the model’s interpretability mechanisms and applying advanced data augmentation and domain adaptation strategies to improve noise resilience and adaptability across heterogeneous clinical settings. Ultimately, optimizing EHST for robust, transparent, and scalable deployment remains a key objective before real-world adoption.

Conclusion

In this study, we proposed a Transformer-based framework for multi-class heart sound diagnosis that combines attention mechanisms with Grad-CAM visualizations to balance predictive performance and interpretability. Evaluated across six public datasets and the in-house HeartWave dataset, the model achieved higher accuracy, macro-F1 score, and AUC than baseline models under the given experimental conditions, with gains observed for both common and underrepresented classes. Explainability tools, including Grad-CAM, attention weight visualization, and SHAP analysis, produced outputs consistent with expert-annotated features, supporting potential clinical relevance. The framework also showed stable performance across segmentation methods, window lengths, and demographic subgroups. However, these findings are limited to retrospective datasets, and real-world variability in devices, patient populations, and noise conditions was not fully represented. Prospective clinical validation, broader demographic coverage, and integration with multimodal data such as ECG remain important next steps to confirm generalizability and ensure reliable deployment in diverse clinical environments.

Data availability

The datasets used and/or analyzed during the current study available from the corresponding author on reasonable request.

References

Sadr, H., Salari, A., Ashoobi, M. T. & Nazari, M. Cardiovascular disease diagnosis: a holistic approach using the integration of machine learning and deep learning models. Eur. J. Med. Res. 29(1), 455 (2024).
Article PubMed PubMed Central Google Scholar
Alkayyali, Z., Idris, S. A. B. & Abu-Naser, S. S. A systematic literature review of deep and machine learning algorithms in cardiovascular diseases diagnosis. J. Theor. Appl. Inf. Technol. 101(4), 1353–1365 (2023).
Google Scholar
Wong, K. K., Fortino, G. & Abbott, D. Deep learning-based cardiovascular image diagnosis: a promising challenge. Futur. Gener. Comput. Syst. 110, 802–811 (2020).
Article Google Scholar
Ainiwaer, A. et al. Deep learning of heart-sound signals for efficient prediction of obstructive coronary artery disease. Heliyon 10, 1 (2024).
Article Google Scholar
Ren, Z. et al. A comprehensive survey on heart sound analysis in the deep learning era. IEEE Comput. Intell. Mag. 19(3), 42–57 (2024).
Article Google Scholar
Ali, S. N., Shuvo, S. B., Al-Manzo, M. I. S., Hasan, A. & Hasan, T. An end-to-end deep learning framework for real-time denoising of heart sounds for cardiac disease detection in unseen noise. IEEE Access 11, 87887–87901 (2023).
Article Google Scholar
Guleria, P., Naga Srinivasu, P., Ahmed, S., Almusallam, N. & Alarfaj, F. K. Xai framework for cardiovascular disease prediction using classification techniques. Electronics 11(24), 4086 (2022).
Article Google Scholar
Abdel-Motaleb, I., & Akula, R. Artificial intelligence algorithm for heart disease diagnosis using phonocardiogram signals. In 2012 IEEE International Conference on Electro/Information Technology 1–6 (IEEE, 2012).
Zhou, X., Guo, X., Zheng, Y. & Zhao, Y. Detection of coronary heart disease based on mfcc characteristics of heart sound. Appl. Acoust. 212, 109583 (2023).
Article Google Scholar
Tripathi, P. M., Kumar, A., Komaragiri, R. & Kumar, M. A review on computational methods for denoising and detecting ecg signals to detect cardiovascular diseases. Arch. Comput. Methods Eng. 29(3), 1875–1914 (2022).
Article Google Scholar
Azam, F. B., Ansari, M. I., Nuhash, S. I. S. K., McLane, I. & Hasan, T. Cardiac anomaly detection considering an additive noise and convolutional distortion model of heart sound recordings. Artif. Intell. Med. 133, 102417 (2022).
Article PubMed Google Scholar
Rustam, F. et al. Incorporating cnn features for optimizing performance of ensemble classifier for cardiovascular disease prediction. Diagnostics 12(6), 1474 (2022).
Article PubMed PubMed Central Google Scholar
Kamble, D. D., Kale, P. H., Nitture, S. P., Waghmare, K. V., & Aher, C. N. Heart disease detection through deep learning model rnn. In Smart Intelligent Computing and Applications, Volume 2: Proceedings of Fifth International Conference on Smart Computing and Informatics (SCI 2021) 469–480 (Springer, 2022).
Rath, A., Mishra, D., & Panda, G. Lstm-based cardiovascular disease detection using ecg signal. In Cognitive Informatics and Soft Computing: Proceeding of CISC 2020 133–142 (Springer, 2021).
Mol, J. et al. A novel deep learning-based 1d-cnn-optimized gru approach for heart disease prediction. Automatika 66(1), 79–90 (2025).
Article Google Scholar
Abbas, Q., Hussain, A. & Baig, A. R. Automatic detection and classification of cardiovascular disorders using phonocardiogram and convolutional vision transformers. Diagnostics 12(12), 3109 (2022).
Article PubMed PubMed Central Google Scholar
Zhao, W., Ma, H., Jin, N., Zheng, Y. & Guo, X. Detection of coronary heart disease based on heart sound and hybrid vision transformer. Appl. Acoust. 230, 110420 (2025).
Article Google Scholar
Partovi, E., Babic, A. & Gharehbaghi, A. A review on deep learning methods for heart sound signal analysis. Front. Artif. Intell. 7, 562. https://doi.org/10.3389/frai.2024.1434022 (2024).
Article Google Scholar
Liu, Z. et al. Heart sound classification based on bispectrum features and vision transformer mode. Alex. Eng. J. 85, 49–59 (2023).
Article Google Scholar
Bozkurt, B., Germanakis, I. & Stylianou, Y. A study of time-frequency features for cnn-based automatic heart sound classification for pathology detection. Comput. Biol. Med. 100, 132–143 (2018).
Article PubMed Google Scholar
Oliveira, B. et al. Explainable multimodal deep learning for heart sounds and electrocardiogram classification. In 2024 46th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC) 1–4 (IEEE, 2024).
Majhi, B. & Kashyap, A. Explainable ai-driven machine learning for heart disease detection using ecg signal. Appl. Soft Comput. 167, 112225 (2024).
Article Google Scholar
Carter, T. S., Yang, G. H., Loke, G. & Yan, W. Deciphering simultaneous heart conditions with spectrogram and explainable-ai approach. Biomed. Signal Process. Control 85, 104990 (2023).
Article Google Scholar
Divakar, C. et al. Explainable ai for cnn-lstm network in pcg-based valvular heart disease diagnosis. In 2024 14th International Conference on Cloud Computing, Data Science & Engineering (Confluence) 92–97 (IEEE, 2024).
Gharehbaghi, A., Lindén, M. & Babic, A. An artificial intelligent-based model for detecting systolic pathological patterns of phonocardiogram based on time-growing neural network. Appl. Soft Comput. 83, 105615. https://doi.org/10.1016/j.asoc.2019.105615 (2019).
Article Google Scholar
Alrabie, S. & Barnawi, A. Heartwave: a multiclass dataset of heart sounds for cardiovascular diseases detection. IEEE Access (2023).
Oliveira, J. et al. The circor digiscope dataset: from murmur detection to murmur classification. IEEE J. Biomed. Health Inform. 26(6), 2524–2535 (2021).
Article Google Scholar
Zhu, B. et al. Review of phonocardiogram signal analysis: insights from the physionet/cinc challenge 2016 database. Electronics 13(16), 3222 (2024).
Article Google Scholar
Boulares, M., Alafif, T. & Barnawi, A. Transfer learning benchmark for cardiovascular disease recognition. IEEE Access 8, 109475–109491 (2020).
Article Google Scholar
Yaseen Son, G.-Y. & Kwon, S. Classification of heart sound signal using multiple features. Appl. Sci. 8(12), 2344 (2018).
Article Google Scholar
Dong, F. et al. Machine listening for heart status monitoring: introducing and benchmarking hss–the heart sounds shenzhen corpus. IEEE J. Biomed. Health Inform. 24(7), 2082–2092 (2019).
Article Google Scholar
Randhawa, S. K. & Singh, M. Classification of heart sound signals using multi-modal features. Procedia Comput. Sci. 58, 165–171 (2015).
Article Google Scholar
Li, F., Tang, H., Shang, S., Mathiak, K. & Cong, F. Classification of heart sounds using convolutional neural network. Appl. Sci. 10(11), 3956 (2020).
Article CAS Google Scholar
Oh, S. L. et al. Classification of heart sound signals using a novel deep wavenet model. Comput. Methods Programs Biomed. 196, 105604 (2020).
Article PubMed Google Scholar
Gharehbaghi, A. Deep Learning in Time Series Analysis (CRC Press, 2023).
Book Google Scholar
Arcuri, A., & Briand, L. A practical guide for using statistical tests to assess randomized algorithms in software engineering. In Proceedings of the 33rd International Conference on Software Engineering. ICSE ’11, Association for Computing Machinery 1–10 (2011). https://doi.org/10.1145/1985793.1985795.

Download references

Author information

Authors and Affiliations

School of Computer Science and Engineering (SCOPE), VIT-AP University, Amaravati, Vijayawada, 522237, Andhra Pradesh, India
Bollapalli Althaph & Nagendra Panini Challa

Authors

Bollapalli Althaph
View author publications
Search author on:PubMed Google Scholar
Nagendra Panini Challa
View author publications
Search author on:PubMed Google Scholar

Contributions

1) Bollapalli Althaph contributed to the conceptualization, methodology, and writing of the original draft. 2) Nagendra Panini Challa provided methodology, writing of the original draft and supervision.

Corresponding author

Correspondence to Nagendra Panini Challa.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Althaph, B., Challa, N.P. Explainable attention-based deep learning for classification and interpretation of heart murmurs using phonocardiograms. Sci Rep 15, 37991 (2025). https://doi.org/10.1038/s41598-025-21971-x

Download citation

Received: 21 March 2025
Accepted: 24 September 2025
Published: 30 October 2025
Version of record: 30 October 2025
DOI: https://doi.org/10.1038/s41598-025-21971-x

Subjects

Abstract

Introduction

Context and motivation

Problem statement

Significance

Objectives and contributions

Related work

Traditional approaches to heart sound analysis

Deep learning in heart sound diagnosis

Connections between heart murmurs and PCG

Explainable AI in healthcare

Time growing neural networks (TGNNs) in heart sound analysis

Research gaps and challenges

Dataset description

HeartWave dataset

Heart sound repositories

Heart abnormalities and representative PCG patterns

Proposed framework

Data preprocessing

Segmentation and peak detection

Window length adaptation

Time-frequency transformation

Normalization and data augmentation

Preprocessing order

Transformer-based feature extractor

Multi-head self-attention and positional encoding

Optional cross-attention for clinical data

Classification module

Explainability module

Grad-CAM for spectrogram visualization

Attention weight visualization

Performance metrics

Loss functions

Addressing class imbalance

Weighted cross-entropy

Focal loss

Hyperparameter tuning

Transformer-based model configuration

Baseline model hyperparameters

Evaluation metrics

Classification metrics

Explainability metrics

Training workflow

Experimental setup

Datasets employed

Open-access datasets

HeartWave dataset

Unified preprocessing and comparison

Baseline models

Cross-validation

Results and analysis

Overall multi-dataset performance

Class-wise breakdown on HeartWave

Rare-class detection across multiple datasets

Ablation studies and confusion matrix

Explainability metrics and SHAP analysis

Statistical validation with A-Test

Segmentation method analysis

Window length sensitivity

Data augmentation strategy

Demographic-based performance

Discussion

Comparison with time growing neural networks (TGNNs)

Limitations

Conclusion

Data availability

References

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article