Introduction

Context and motivation

Cardiovascular diseases (CVDs) remain one of the leading global health challenges, contributing to millions of deaths annually and placing immense pressure on healthcare infrastructures. Early detection and timely intervention are critical, especially given the progressive nature of cardiac conditions. Traditionally, clinicians utilize auscultation—listening to heart sounds via a stethoscope—to identify abnormalities such as murmurs, extra heart sounds, and irregular rhythms1. While auscultation is deeply embedded in clinical practice, its accuracy relies significantly on clinician expertise and environmental conditions, making subtle murmurs challenging to detect consistently, particularly by less experienced clinicians or in noisy environments.

Recent advancements in machine learning (ML) and artificial intelligence (AI) present significant opportunities for automating aspects of the diagnostic process, enhancing reproducibility and consistency. Among diagnostic modalities in cardiology, the phonocardiogram (PCG) has emerged as an essential tool for AI-based analysis. PCGs offer rich temporal and spectral information regarding heart sounds, critical for identifying structural or functional heart anomalies. Initial computational approaches involving manual feature extraction combined with traditional classifiers like Support Vector Machines (SVMs) or Hidden Markov Models (HMMs) showed potential but required considerable domain expertise and struggled with scalability across diverse datasets2.

In recent years, deep learning methods, including Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs), have significantly advanced the accuracy of PCG-based classification, successfully detecting conditions such as mitral valve prolapse and hypertrophic cardiomyopathy. Despite these advancements, many AI-based methods remain opaque or “black-box” in nature, limiting clinical acceptance due to a lack of transparency and explainability in decision-making processes3. Clinicians require clearly interpretable models to justify their diagnostic decisions confidently, especially when identifying heart murmurs indicative of existing cardiac conditions.

Furthermore, for practical clinical deployment, AI models must exhibit robust performance across varied recording environments, including differences in microphone quality, sampling rates, and ambient noise levels encountered in clinical settings such as emergency rooms or outpatient clinics. The critical intersection of interpretability, reproducibility, and robust performance across diverse conditions remains a vital focus, highlighting the need for frameworks that clearly identify heart murmurs through transparent AI-driven analysis4.

Problem statement

The integration of AI in cardiac auscultation has demonstrated improved diagnostic accuracy; however, clinical mistrust remains significant due primarily to poor reproducibility of results across diverse clinical environments. Variability stemming from differences in recording equipment, ambient noise, patient demographics, and inconsistent model performance across distinct datasets severely limits clinical adoption and utility. Consequently, addressing reproducibility through robust, transparent, and consistently reliable AI frameworks is essential for meaningful integration into routine clinical practice.

Significance

The development of an interpretable, high-performance AI framework for heart sound diagnosis has significant implications for various stakeholders in the healthcare ecosystem:

  • Clinicians and healthcare providers: Transparent AI models enable rapid verification of results, helping clinicians to correlate the highlighted segments with established auscultation knowledge or patient histories. This accelerates diagnostic workflows, particularly in high-demand clinical environments.

  • Patients: Timely and accurate detection of cardiac conditions enhances the management of chronic heart diseases, reduces hospital readmissions, and significantly improves patients’ quality of life. Clear explanations of diagnoses also foster patient engagement and adherence to treatment plans.

  • Healthcare systems and policymakers: Interpretable AI solutions could relieve pressure on specialized cardiology services, reduce associated diagnostic costs, and establish uniform standards of care across healthcare settings.

In summary, interpretability is a crucial factor in transforming AI models from experimental tools into reliable clinical decision support systems (CDSS), bridging the gap between research and clinical practice5.

Objectives and contributions

This research introduces an attention-based Transformer architecture designed specifically for multi-class PCG classification, with the primary goal of enhancing both classification accuracy and model interpretability. By focusing on the why and how behind the model’s decisions, we aim to increase clinician acceptance and improve patient safety.

To develop a robust diagnostic pipeline capable of processing time-frequency representations of PCG signals (e.g., spectrograms or Mel-frequency cepstral coefficients) and providing confidence scores for conditions such as normal heart sounds, valvular stenosis, and regurgitation. To integrate Grad-CAM (Gradient-weighted Class Activation Mapping) for post-hoc explainability, enabling clinicians to visualize which audio frames or frequency bands most influence the classification outcome.

  1. 1.

    Adapting the transformer for PCGs: We redesign the Transformer architecture’s positional encodings, attention modules, and feedforward sublayers to better handle the unique temporal and spectral characteristics of heart sounds, improving the sensitivity to murmurs and other pathological cues.

  2. 2.

    Incorporating explainability mechanisms: Beyond self-attention, we integrate Grad-CAM to provide localized visual evidence of anomalies, making the decision-making process more transparent and traceable.

  3. 3.

    Extensive validation and benchmarking: We evaluate the model on several datasets, including the HeartWave dataset, supplemented by other publicly available sources like PhysioNet. Comparisons with common baselines, such as CNN-RNN hybrids, highlight improvements in both performance and interpretability.

Ultimately, this work aims to present a model that clinicians can understand, trust, and use confidently in real-world settings. By combining advanced deep learning techniques with explainable AI mechanisms, we offer a solid foundation for the next generation of transparent AI solutions in cardiovascular diagnostics.

Related work

This section reviews the evolution of heart sound analysis, from traditional signal-processing methods to contemporary deep learning frameworks6, with a focus on the challenges and opportunities related to interpretability in clinical contexts. The discussion is divided into four parts: traditional approaches, modern deep learning solutions, explainable AI (XAI) in healthcare7, and current gaps and future opportunities for research.

Traditional approaches to heart sound analysis

Earlier efforts in automated phonocardiogram (PCG) analysis heavily relied on hand-engineered features to represent the nuanced time-frequency structure of heart sounds. Researchers often employed wavelet transforms to isolate transient or non-stationary components indicative of cardiac anomalies such as murmurs, ejection clicks, or extra systolic sounds. The wavelet transform’s capacity for multi-resolution analysis proved effective in pinpointing short-duration events in a longer recording. For instance, abrupt energy spikes in specific frequency sub-bands could signal systolic murmurs characteristic of conditions like aortic stenosis or mitral regurgitation8.

Another widely adopted technique involved Mel-Frequency Cepstral Coefficients (MFCCs), adapted from speech processing9. By mapping frequencies to a perceptual (Mel) scale, MFCCs captured essential spectral features aligned with human auditory sensitivities. Although initially designed to recognize phonemes in spoken language, MFCCs provided an efficient representation of heart sound signals, distinguishing normal S1,S2 segments from paths containing unusual acoustic signatures.

Despite the relative success of these approaches, they depended greatly on the skill and assumptions of researchers. Feature selection, thresholding, and pre-segmentation heuristics required detailed knowledge of cardiac auscultation principles and thorough experimentation across multiple datasets. Moreover, noise sensitivity was a recurring obstacle: real-world PCG recordings often include stethoscope friction, patient motion, or ambient clinical sounds, necessitating elaborate filtering pipelines. Thus, transferring a carefully tuned algorithm from one environment (e.g., a quiet lab) to another (e.g., a busy hospital ward) frequently led to performance drops10.

Beyond the noise issue, scalability also emerged as a limitation. Many foundational studies were tested on relatively small or proprietary datasets, sometimes featuring a few hundred recordings. Such constraints hindered cross-validation, overshadowed the influence of hyperparameter choices, and complicated comparisons across different research groups. Consequently, while traditional methods clarified the potential of machine driven heart sound diagnosis, they also underscored the need for more data driven, noise-robust, and scalable strategies paving the way for the next generation of techniques centered around deep learning11.

Deep learning in heart sound diagnosis

With the advent of increased computational power and growing availability of larger PCG repositories, deep learning approaches began to displace traditional feature-engineering pipelines. Convolutional Neural Networks (CNNs)12 were among the first to gain traction, leveraging architectures that treat spectrograms or scalograms of heart sounds similarly to images. In this paradigm, trainable convolution kernels automatically learn low-level patterns (e.g., short bursts corresponding to S1 or S2) and higher-level, class-discriminative features (e.g., murmurs or abnormal frequency bands). This end-to-end structure dramatically reduced the reliance on hand-tuned wavelets or MFCC parameters, showcasing better adaptability to new data sources and varied noise levels.

Subsequently, Recurrent Neural Networks (RNNs)13 were explored to address the sequential nature of heart sounds. Long Short-Term Memory (LSTM)14 and Gated Recurrent Unit (GRU)15 architectures proved particularly effective in capturing longer-term dependencies, such as the interval between S1 and S2 or the transition from systole to diastole. By combining CNN-based local feature extraction with RNN-based temporal modeling, hybrid CNN-RNN frameworks emerged, providing a structured way to account for both spectral details and evolving heartbeat patterns. Many studies reported significant gains in classification accuracy, especially for tasks involving multi-class discrimination of various valvular diseases or rare congenital anomalies.

More recently, Transformers16 adapted from natural language processing have begun to make inroads into heart sound analysis. Their multi-head self-attention mechanism allows the model to “attend” to different segments of the PCG signal in parallel, potentially identifying murmurs or extra heart sounds across multiple beats17. Notably, Partovi et al.18 conducted a comprehensive survey of deep learning models for heart sound analysis and benchmarked numerous attention-based, convolutional, and recurrent approaches. Their study emphasized that while CNNs and autoencoders often achieved high performance in specific case studies, the generalizability and reproducibility of results varied widely due to dataset inconsistencies and evaluation mismatches. The authors recommended standardized datasets and interpretability-aware design principles for future model development. This work offers critical context for the integration of attention mechanisms and highlights the necessity of robust validation for clinical deployment.

Nevertheless, deep learning approaches collectively represent a major leap forward in heart sound classification19. They reduce the overhead of manual feature crafting, provide improved performance under diverse acoustic conditions, and pave the way for holistic models that incorporate additional data streams such as patient demographics or concurrent ECG signals. Yet, as these architectures become more complex, the interpretability of their outputs—a vital concern in clinical environments—has emerged as a key research priority20.

Connections between heart murmurs and PCG

Heart murmurs represent audible vibrations generated by turbulent blood flow, and their characteristics provide crucial diagnostic information. These murmurs manifest in PCG signals as distinct time-frequency patterns associated with specific cardiac phases. For example, systolic murmurs such as those seen in aortic stenosis or mitral regurgitation appear between the first (S1) and second (S2) heart sounds and exhibit high-frequency components. In contrast, diastolic murmurs, like those associated with aortic regurgitation or mitral stenosis, follow S2 and typically display lower-frequency energy over longer durations.

Partovi et al.18 offer a detailed analysis of murmur types and their acoustic signatures within PCG recordings. They highlight that systolic murmurs are often sharper and shorter, whereas diastolic murmurs tend to be more prolonged and subtle. The review emphasizes that accurate murmur classification requires robust segmentation of heart cycles and careful preservation of signal fidelity during preprocessing. Furthermore, congenital anomalies such as ventricular septal defects (VSD) or patent ductus arteriosus (PDA) produce continuous murmurs that span both systolic and diastolic phases, distinguishable in PCG by sustained high-amplitude regions. This mapping between clinical murmurs and their PCG features forms a foundation for developing interpretable deep learning systems.

Explainable AI in healthcare

Alongside growing accuracy, the need for trustworthy and clinically interpretable predictions has gained momentum in recent years. Healthcare practitioners often request evidence or rationales for an algorithm’s conclusion, particularly if a diagnosis could lead to invasive procedures or significant treatment changes21. Explainable AI (XAI)22 seeks to address this demand, introducing techniques that help demystify the decision-making process of complex neural networks:

  • LIME (Local Interpretable Model-Agnostic Explanations): Generates a simplified surrogate model around a specific instance to approximate the influence of individual input features. For heart sound data, LIME has been adapted to highlight important frames or spectral bins contributing to a predicted label.

  • SHAP (SHapley Additive exPlanations): Attributes each feature’s contribution based on cooperative game theory, offering consistent and theoretically grounded explanations. When applied to PCG signals, SHAP can quantify how certain frequency components or time segments shift a prediction toward a pathological class.

  • Grad-CAM (Gradient-Weighted Class Activation Mapping): Creates heatmaps overlayed on time-frequency representations, indicating which regions the network identifies as key for classification. This technique, widely used in image analysis, has been extended to one-dimensional or spectrogram-based heart sound inputs, helping clinicians see whether the model’s focus aligns with suspected murmurs.

In practice, however, these methods can require domain-specific adaptation23. A typical spectrogram overlay might not inherently convey whether a murmur is diastolic or systolic. Clinicians might prefer an explanation that marks an abnormal S2 split or a midsystolic click, clarifying how the model’s attention correlates with known pathophysiological events. Thus, the granularity and clinical relevance of XAI outputs remain pivotal, demanding further research on how to refine these tools to fit cardiologists’ existing mental models of heart sound interpretation24.

Time growing neural networks (TGNNs) in heart sound analysis

Time Growing Neural Networks (TGNNs) have been widely employed over the past decade for cardiovascular disease classification tasks25. TGNNs are designed to model temporal growth patterns in sequential data, effectively capturing evolving features across time segments such as systolic and diastolic phases in heart sound signals. By dynamically expanding their architecture, TGNNs can adapt to variable-length cardiac cycles and focus separately on physiologically relevant intervals.

TGNNs have demonstrated strong performance in discriminating various cardiac abnormalities, leveraging their ability to explicitly model temporal growth and changes within the cardiac cycle. However, these models often rely on sequential processing and may lack the ability to globally attend to all time points simultaneously, potentially limiting their capacity to capture long-range dependencies and complex interactions between systolic and diastolic events.

Moreover, TGNNs typically provide limited interpretability, as their dynamic structure and evolving weights are harder to visualize and correlate directly with clinical features compared to attention mechanisms. This presents challenges in clinical adoption, where explainability is crucial.

Our work builds on these insights by employing a Transformer-based attention mechanism that enables flexible, parallel modeling of the entire cardiac cycle, capturing both local murmur-level features and global cycle-level context with integrated explainability.

Research gaps and challenges

Despite the strides achieved in classification accuracy, multiple research gaps persist. First, multi-class classification—covering a wide spectrum of valvular disorders, arrhythmic events, and congenital anomalies continues to pose challenges. Many published studies reduce tasks to a binary problem (normal vs. abnormal) or limit themselves to a handful of prevalent conditions. Expanding the range of conditions tackled by deep networks can improve their utility in general clinical practice, but it requires more comprehensive datasets and rigorous generalization strategies.

Second, efforts to tailor XAI methods specifically for heart sound data remain in their infancy. While saliency maps or feature attributions provide a starting point, bridging the gap between these outputs and clinically interpretable markers—such as the shape of a murmur or the ratio of S1 to S2 intervals still lacks systematic solutions. Aligning explanations with medical knowledge, for instance by matching Grad-CAM hotspots to annotated systolic phases, may substantially boost confidence among cardiologists.

Third, robustness in the face of noise and demographic heterogeneity warrants deeper exploration. Real-world recordings vary in patient age, body habitus, and comorbidities, introducing patterns that may not appear in controlled datasets. Systems that can adapt, or at least detect potential mismatches, could avert misclassifications and guide clinicians toward secondary confirmatory tests.

Given these open questions, this work proposes an attention-centric framework designed to accommodate multi-class PCG classification while delivering interpretable insights. By leveraging the global reach of Transformer-like architectures and refining XAI outputs for domain specificity, we aim to address the dual imperatives of accuracy and clinical trustworthiness ultimately bridging current gaps and laying a foundation for more robust, transparent cardiovascular diagnostics.

Dataset description

This section provides an overview of the heart sound datasets employed for cardiovascular disease (CVD) analysis, focusing on the coverage of normal and abnormal conditions, annotation detail, and demographic diversity. Each dataset highlights different clinical or technical scenarios, offering a rich testing ground for machine learning models aimed at automated auscultation. Two tables summarize the characteristics and notable limitations of these repositories, reflecting variations in sampling rates, device types, and labeling precision.

HeartWave dataset

The HeartWave dataset consists of 1,353 heart sound recordings, each belonging to one of nine clinical classes26. These classes capture both normal heart sounds (S1, S2) and prevalent pathologies such as aortic stenosis, mitral regurgitation, and pulmonary stenosis. Table 7 outlines key aspects, including the number of recordings, demographic spread, average recording duration, and annotation quality.

Table 1 HeartWave dataset summary.

Table 1 describes HeartWave’s distinguishing features include detailed cardiologist annotations that precisely label systole, diastole, and any additional heart sound events (e.g., S3, S4). Where murmurs are identified, severity is graded from 1 to 6, aligning with standard clinical practice correlating murmur loudness to possible lesion significance. In addition, the dataset captures auscultation locations (e.g., mitral, aortic, pulmonary), enabling targeted analysis of region-specific pathologies. Taken together, these elements make HeartWave an excellent basis for investigating multi-class classification and advanced explainability strategies.

Heart sound repositories

Beyond HeartWave, several open-access databases offer valuable breadth and variety for evaluating algorithmic generalization. various Open access Heart Sound Datasets are listed in Table 2 compares these repositories, highlighting their key attributes and any known limitations. Some feature pediatriccohorts, while others emphasize specific valvular diseases or artifact-heavy recordings. Such differences allow researchers to probe model robustness and cross-demographic adaptability.

Table 2 Comparative overview of open-access heart sound datasets.

CirCor DigiScope27 stands out for its extensive pediatric collection, offering timing and pitch annotations that can expose age-related diagnostic patterns. PhysioNet/CinC 201628is widely used for normal-versus-abnormal classification, yet exhibits higher noise levels and a narrower disease spectrum. Pascal Datasets29 A and B capture data with elevated sampling frequencies (44 kHz), encouraging exploration of high-resolution spectral details, although their relatively small sample counts constrain broad pathological analysis. Meanwhile, the GitHub open-access repository30 collects data from varied sources, mostly focusing on valve-specific abnormalities such as aortic stenosis or mitral valve prolapse, at an 8 kHz sampling rate. Finally, the Heart Sounds Shenzhen dataset31focuses on mild, moderate, or severe categorization of valvular heart disease, but lacks attention to congenital or rarer anomalies. Taken together, these datasets reveal a broad continuum of clinical contexts, noise profiles, and labeling protocols:

  • Multi-class complexity: HeartWave and GitHub highlight multi-class and valvular-specific distinctions, while PhysioNet focuses on a simpler normal/abnormal paradigm.

  • Demographic variation: CirCor emphasizes pediatric recordings, contrasting with the adult-centric orientation in HeartWave, Shenzhen, and most Pascal data.

  • Noise vs. Fidelity: PhysioNet includes notable ambient interference, whereas Pascal can achieve higher fidelity but with fewer recordings. HeartWave lies in a middle ground, balancing clinical realism and moderate noise levels.

These differences underscore the necessity of evaluating classification algorithms under multiple acoustic conditions and disease distributions. The synergy of HeartWave’s rich annotation and broad pathology coverage with other specialized repositories helps ensure that models developed are not narrowly tuned to a single patient population or recording device. By merging these complementary sources, the study fosters a more nuanced understanding of how machine learning systems respond to diverse clinical settings, ultimately reinforcing the pursuit of accurate and interpretable CVD detection across varied patient profiles32.

Heart abnormalities and representative PCG patterns

To aid interpretability and dataset transparency, we summarize below the key heart abnormalities included across the datasets used in this study. Each abnormality is associated with its typical murmur type and corresponding PCG signature.

Table 3 Heart abnormalities used in this study and their PCG characteristics.
Fig. 1
figure 1

Representative PCG waveforms for different cardiac conditions. Each subplot illustrates the characteristic time-domain morphology associated with a specific abnormality.

The Fig. 1 presents time-domain PCG signal excerpts from multiple datasets, annotated to indicate murmur regions and S1/S2 markers. This visual comparison complements the table and facilitates a clearer understanding of the diagnostic landscape our model must navigate.

Proposed framework

This section details our end-to-end pipeline for automated heart sound analysis using the proposed Explainable HeartSound Transformer (EHST). The framework is designed to process diverse PCG (phonocardiogram) data from various sources (e.g., HeartWave, CirCor DigiScope, PhysioNet/CinC 2016, Pascal A/B, GitHub Open Access, and Heart Sounds Shenzhen) while ensuring consistent data handling, robust model training, and transparent interpretability.

EHST comprises five main components: data input and segmentation, data preprocessing and feature extraction, Transformer-based encoding, classification, and explainability. Figure 2 presents a conceptual overview of the pipeline. As shown in Table 3 different heart abnormalities can be identified by their unique PCG characteristics.

The overall architecture of EHST is illustrated in Fig. 2.

Fig. 2
figure 2

Architecture of proposed approach.

Data preprocessing

The process begins with the acquisition of raw PCG signals, which are then segmented into individual heartbeats by detecting characteristic peaks (S1 and S2). Next, the segmented signals are preprocessed by applying noise removal, normalization, and transformation into time-frequency representations such as spectrograms or Mel-Frequency Cepstral Coefficients (MFCCs). Data augmentation is applied to account for inter-device and inter-patient variability. The preprocessed segments are then passed to the Transformer-based encoder, which utilizes multi-head self-attention to learn rich temporal and spectral representations. These representations are pooled and fed into fully connected layers for final classification. Finally, explainability modules (e.g., Grad-CAM and attention visualization) are integrated to provide clinical insights into the model’s decisions. Robust preprocessing is critical to convert raw PCG signals into model-ready features while preserving clinically relevant information. Our pipeline explicitly performs **segmentation before time-frequency (TF) transformation** to align the data with physiological cardiac cycles, which enhances clinical interpretability.

Segmentation and peak detection

Raw PCG signals \(x_{raw}(t)\) are segmented into individual heartbeats by detecting characteristic peaks corresponding to the first (S1) and second (S2) heart sounds. For datasets with pre-annotated peaks (e.g., HeartWave, CirCor DigiScope), these annotations are directly utilized. For datasets lacking full annotations (e.g., PhysioNet, Pascal B), we employ an automated peak detection method based on the wavelet transform:

$$\begin{aligned} \mathcal {P} = \left\{ p \mid \text {corr}\left( \Psi (x_{raw}), \Psi _{\text {template}}\right) > \delta \right\} , \end{aligned}$$
(1)

where \(\Psi (\cdot )\) denotes the wavelet transform of the signal, \(\Psi _{\text {template}}\) is a canonical wavelet template, and \(\delta\) is an empirically chosen threshold (typically between 0.6 and 0.8). The detected peaks \(\mathcal {P}\) define the boundaries for segmenting complete heartbeats.

Window length adaptation

Because heartbeats vary in length—especially due to diastolic duration changes across populations—each segmented heartbeat is adaptively adjusted in window length to normalize its duration. For instance, pediatric datasets (e.g., CirCor DigiScope) have shorter windows reflecting faster heart rates, while adult datasets (e.g., HeartWave) allow longer windows up to 1 second per cycle. This adaptive windowing reduces variability that could otherwise degrade TF representation quality.

Time-frequency transformation

Each normalized heartbeat segment is transformed into a time-frequency representation. For typical sampling frequencies (e.g., 2–4 kHz in HeartWave and Shenzhen), we compute the Short-Time Fourier Transform (STFT) using a Hamming window of length 20 ms with 50% overlap:

$$\begin{aligned} X(m, \omega ) = \sum _{n=-\infty }^\infty x(n) \, w(n - m) \, e^{-j \omega n}, \end{aligned}$$
(2)

where w(n) is the Hamming window function, m denotes the time shift, and \(\omega\) is the frequency variable. For higher sampling rates (e.g., Pascal datasets), we either downsample or adjust the window parameters to preserve frequency resolution within the 20–800 Hz band of clinical relevance. Alternatively, Mel-Frequency Cepstral Coefficients (MFCCs) are computed for moderate sampling frequencies to capture spectral features effectively.

Normalization and data augmentation

Following TF transformation, features are normalized using z-score normalization:

$$\begin{aligned} \textbf{F}_{\text {norm}} = \frac{\textbf{F} - \mu }{\sigma }, \end{aligned}$$
(3)

where \(\textbf{F}\) denotes the extracted feature matrix, and \(\mu\), \(\sigma\) are the mean and standard deviation computed over the training data. To improve model robustness, data augmentation methods—such as noise injection, time-stretching, and random cropping—are applied to simulate inter-device and inter-patient variability.

Preprocessing order

Segmenting before applying TF transformations ensures that feature extraction aligns with physiologically meaningful units (complete cardiac cycles). While diastolic variability causes variable segment lengths, adaptive windowing mitigates potential disruptions in TF resolution. Careful tuning of TF parameters (window length, overlap) further preserves temporal and spectral fidelity. This approach enhances the clinical interpretability of the features and supports reliable classification.

The entire preprocessing procedure is summarized in Algorithm 1.

Algorithm 1
figure a

Preprocessing Algorithm for Multiple PCG Datasets

Transformer-based feature extractor

EHST leverages a multi-head Transformer encoder to learn rich, contextual representations from the preprocessed PCG data. Given a heartbeat segment represented as a sequence of T frames, the Transformer encoder applies self-attention to capture both local (murmur-level) and global (cycle-level) dependencies.

Multi-head self-attention and positional encoding

Each frame is first projected into an embedding space:

$$\begin{aligned} \textbf{e}_t = \textbf{F}_t \textbf{W}_e + \textbf{b}_e + \text {PE}(t), \end{aligned}$$
(4)

where \(\text {PE}(t)\) denotes the positional encoding. We adopt a sinusoidal positional encoding:

$$\begin{aligned} \begin{aligned} \textbf{PE}(t,2i)&= \sin \Bigl (\frac{t}{10000^{2i/d_{model}}}\Bigr ),\\ \textbf{PE}(t,2i+1)&= \cos \Bigl (\frac{t}{10000^{2i/d_{model}}}\Bigr ). \end{aligned} \end{aligned}$$
(5)

The multi-head self-attention mechanism then computes attention scores between all pairs of frames, which are aggregated to form a refined representation:

$$\begin{aligned} \alpha _{t,\tau }^{(h)} = \frac{\exp \Bigl (\frac{(\textbf{Q}_t^{(h)})^\top \textbf{K}_\tau ^{(h)}}{\sqrt{d_k}}\Bigr )}{\sum _{\tau '=1}^{T}\exp \Bigl (\frac{(\textbf{Q}_t^{(h)})^\top \textbf{K}_{\tau '}^{(h)}}{\sqrt{d_k}}\Bigr )}, \end{aligned}$$
(6)

where \(\textbf{Q}^{(h)}\), \(\textbf{K}^{(h)}\), and \(\textbf{V}^{(h)}\) are the query, key, and value matrices for head h, respectively.

Optional cross-attention for clinical data

For datasets that provide additional clinical variables (e.g., demographics), cross-attention is incorporated. Here, the PCG embeddings act as queries while clinical vectors serve as keys and values. This helps contextualize the learned features with patient-specific information.

The Transformer encoding procedure is summarized in Algorithm 2.

Algorithm 2
figure b

Transformer Encoder for PCG Frames

Classification module

The output of the Transformer encoder is a sequence of hidden states, which is then aggregated into a single vector for each heartbeat. We adopt average pooling:

$$\begin{aligned} \textbf{h}_{\text {pool}} = \frac{1}{T}\sum _{t=1}^{T}\textbf{H}_t^{(L)}. \end{aligned}$$
(7)

The pooled vector is fed into fully connected layers with ReLU activations, followed by a softmax layer to generate class probability distributions:

$$\begin{aligned} \textbf{z}= & \textbf{W}_{out}\,\text {ReLU}(\textbf{h}_{\text {pool}}\textbf{W}_1 + \textbf{b}_1) + \textbf{b}_{out}, \end{aligned}$$
(8)
$$\begin{aligned} \hat{p}_c= & \frac{\exp (z_c)}{\sum _{i=1}^{C}\exp (z_i)}, \quad c=1,\dots ,C. \end{aligned}$$
(9)

Algorithm 3 summarizes the classification process.

Algorithm 3
figure c

Classification Algorithm

Explainability module

A key strength of EHST is its transparent decision-making, achieved through integrated explainability modules.

Grad-CAM for spectrogram visualization

When processing spectrogram inputs, we apply Grad-CAM to generate heatmaps that highlight critical frequency-time regions contributing to a class prediction. Let \(\textbf{A} \in \mathbb {R}^{F \times T \times K}\) denote a higher-level activation map. The importance of channel k for class c is computed as:

$$\begin{aligned} \alpha _k^{(c)} = \frac{1}{Z} \sum _{f,t} \frac{\partial \hat{p}_c}{\partial A_k(f,t)}, \end{aligned}$$
(10)

and the final Grad-CAM heatmap is given by:

$$\begin{aligned} H_c(f,t) = \text {ReLU}\Bigl (\sum _{k} \alpha _k^{(c)} \, A_k(f,t)\Bigr ). \end{aligned}$$
(11)

This visualization enables clinicians to verify that EHST focuses on relevant segments, such as systolic or diastolic murmur intervals.

Attention weight visualization

For the self-attention layers, we average attention scores across all heads:

$$\begin{aligned} \beta _{t,\tau } = \frac{1}{N_{\text {heads}}}\sum _{h=1}^{N_{\text {heads}}}\alpha _{t,\tau }^{(h)}. \end{aligned}$$
(12)

Plotting \(\beta _{t,\tau }\) as a matrix or as row-sum plots reveals the temporal regions the model deems most critical, offering another layer of interpretability.

Algorithm 4 outlines the process of generating explainable outputs using both Grad-CAM and attention visualization.

Algorithm 4
figure d

Explainable AI Generation

By integrating these modules, EHST not only achieves high diagnostic performance but also provides transparent, clinically interpretable insights that can support and augment traditional cardiac auscultation.

Performance metrics

This section details the evaluation framework for our Transformer-based heart sound classification system. We describe the loss functions employed to address class imbalance, present the hyperparameter configurations for our model and ten baseline approaches, and outline a comprehensive set of metrics used to assess both classification performance and interpretability.

Loss functions

Addressing class imbalance

Heart sound datasets are often imbalanced, with normal recordings typically outnumbering pathological cases. To prevent the model from favoring majority classes, we adopt loss functions that assign higher penalties to misclassifications of minority classes. This is essential to improve the model’s sensitivity to less frequent but clinically significant conditions.

Weighted cross-entropy

For a classification task with \(C\) classes, let \(\textbf{y} \in \{0,1\}^C\) denote the one-hot encoded label and \(\hat{\textbf{p}} \in \mathbb {R}^C\) the predicted probability distribution. The weighted cross-entropy loss is defined as:

$$\begin{aligned} \mathcal {L}_{\text {WCE}} = - \sum _{c=1}^{C} w_c \, y_c \, \log (\hat{p}_c), \end{aligned}$$
(13)

where \(w_c\) is inversely proportional to the frequency of class \(c\), thus penalizing errors on underrepresented classes more heavily. This loss function ensures that the model’s performance does not suffer when dealing with classes that are inherently less frequent in the dataset.

Focal loss

In scenarios of severe imbalance, focal loss further reduces the relative loss for well-classified examples and emphasizes harder samples. It is defined as:

$$\begin{aligned} \mathcal {L}_{\text {focal}} = - \sum _{c=1}^{C} \alpha _c \, y_c \, (1 - \hat{p}_c)^{\gamma } \, \log (\hat{p}_c), \end{aligned}$$
(14)

where \(\gamma > 0\) is the focusing parameter and \(\alpha _c\) are class-specific weights. In our experiments, weighted cross-entropy is applied for moderate imbalance, whereas focal loss is used when minority classes are particularly sparse.

Hyperparameter tuning

Transformer-based model configuration

Table 4 summarizes the default hyperparameters for our Transformer-based classifier, determined through grid searches and pilot studies across multiple heart sound datasets.

Table 4 Default hyperparameters for the proposed transformer-based classifier.

Baseline model hyperparameters

Table 5 outlines typical hyperparameter settings for ten common baseline models spanning CNN, RNN, and hybrid architectures. These serve as benchmarks for comparing our proposed approach.

Table 5 Hyperparameter summaries for ten common baseline models.

Regularization is achieved through dropout (0.1–0.3) and weight decay (\(10^{-5}\)\(10^{-4}\)). Hyperparameter optimization is performed via manual/random search followed by systematic methods such as Bayesian optimization, with early stopping based on validation F1-score or loss.

Evaluation metrics

In this section, we describe the comprehensive set of evaluation metrics employed to assess both the classification performance and model interpretability. These metrics allow us to evaluate the model on multiple fronts, ensuring robustness, reliability, and clinical relevance.

Classification metrics

Accuracy:

$$\begin{aligned} \text {Accuracy} = \frac{\text {Total Correct Predictions}}{\text {Total Samples}}. \end{aligned}$$

While accuracy is a commonly used metric, it can be misleading in imbalanced datasets where a model can achieve high accuracy by predominantly predicting the majority class. Therefore, accuracy must be considered alongside other metrics, especially in scenarios with rare conditions.

Precision, recall, and F1-score:

For each class \(c\), with \(TP_c\), \(FP_c\), and \(FN_c\) denoting true positives, false positives, and false negatives respectively:

$$\begin{aligned} \text {Precision}_c = \frac{TP_c}{TP_c + FP_c}, \quad \text {Recall}_c = \frac{TP_c}{TP_c + FN_c}, \end{aligned}$$

where precision evaluates the proportion of correct positive predictions, and recall measures the ability to correctly identify positive instances. The F1-score combines these two metrics as the harmonic mean:

$$\begin{aligned} \text {F1}_c = 2 \times \frac{\text {Precision}_c \times \text {Recall}_c}{\text {Precision}_c + \text {Recall}_c}. \end{aligned}$$

Macro-F1 computes the F1-score independently for each class and then takes the average, while weighted-F1 adjusts for class imbalance by considering the support (the number of true instances for each class).

Area under the ROC curve (AUC):

The AUC is computed for each class using a one-vs.-rest approach. An AUC value above 0.9 generally indicates robust discriminative performance. This metric is particularly useful for understanding model behavior in differentiating between normal and abnormal heart sounds, even in the presence of noisy data or limited samples.

Confusion matrix:

The confusion matrix provides detailed insights into the class-wise performance of the model, showing true positives, false positives, false negatives, and true negatives. It is a useful tool for identifying potential areas for improvement, such as misclassifications between similar classes (e.g., between normal and mild cases).

Matthews correlation coefficient (MCC):

MCC is another critical metric used in binary and multi-class classification tasks to measure the quality of binary (or multi-class) classifications. It provides a balanced evaluation even for imbalanced datasets:

$$\begin{aligned} \text {MCC} = \frac{TP \times TN - FP \times FN}{\sqrt{(TP + FP)(TP + FN)(TN + FP)(TN + FN)}} \end{aligned}$$

An MCC score close to +1 indicates excellent performance, 0 indicates no better than random predictions, and -1 indicates total disagreement between prediction and ground truth.

Explainability metrics

Explainability is essential for building trust in AI models in clinical settings. We evaluate the transparency of our model using several metrics, including Grad-CAM visualizations and SHAP (SHapley Additive exPlanations).

Overlap with expert annotations:

For datasets with annotated murmur segments, we measure the degree of alignment between the model’s Grad-CAM or attention heatmaps and the expert annotations. We define the overlap ratio as follows:

$$\begin{aligned} \text {OverlapRatio} = \frac{\sum _{(f,t)\,\in \,\text {annotation}} H_c(f,t)}{\sum _{(f,t)} H_c(f,t)}, \end{aligned}$$

where \(H_c(f,t)\) denotes the importance map for class \(c\), and the sum is taken over all the frames \(f\) and time steps \(t\) associated with the annotation.

Intersection-over-Union (IoU):

IoU measures the spatial agreement between the model’s highlighted areas and expert annotations. By thresholding \(H_c(f,t)\) to form a binary mask, we compute:

$$\begin{aligned} \text {IoU} = \frac{\text {Mask} \cap \text {Annotated Region}}{\text {Mask} \cup \text {Annotated Region}}, \end{aligned}$$

This metric quantifies how well the model identifies the correct regions in the time-frequency space corresponding to pathological heart sounds, such as systolic or diastolic murmurs.

SHAP analysis:

In addition to Grad-CAM, we use SHAP to provide further insight into how different features contribute to model decisions. SHAP values decompose the prediction for a particular instance into contributions from each feature, quantifying their individual impact. This allows clinicians to understand which time-frequency components or other features (e.g., patient demographic data) are most influential in driving predictions.

Training workflow

Our training workflow is designed to optimize both predictive performance and interpretability. It includes the following steps:

  1. 1.

    Model initialization: Configure the Transformer architecture, including the depth, number of attention heads, and feed-forward dimensions. The loss function is selected based on the dataset characteristics (e.g., weighted cross-entropy for imbalanced datasets or focal loss for harder-to-detect classes).

  2. 2.

    Hyperparameter setup: Initialize key hyperparameters, including the learning rate (e.g., \(1 \times 10^{-3}\)), batch size (8–32), dropout rate, and set up a dynamic scheduler like ReduceLROnPlateau for adaptive learning rates.

  3. 3.

    Iterative Training:

    • Forward pass: Compute predictions for each batch and evaluate the loss function based on the true labels.

    • Backward pass: Update model weights using optimization techniques such as Adam or AdamW.

    • Learning Rate Adjustment: Adjust the learning rate when validation metrics plateau, ensuring that overfitting is avoided and convergence is achieved.

  4. 4.

    Validation and early stopping: Monitor validation metrics (F1-score, accuracy, AUC) on a held-out set, and apply early stopping if validation metrics show no improvement after a predefined number of epochs.

  5. 5.

    Final testing and explainability analysis: After training, evaluate the model using classification metrics (accuracy, F1, AUC) and generate Grad-CAM or attention maps. For datasets with murmur annotations, compute overlap and IoU metrics.

  6. 6.

    Benchmarking: Compare the performance and interpretability of our model against baseline models using consistent evaluation criteria, as shown in Table 5.

This training and evaluation protocol ensures that our system not only delivers high-accuracy predictions but also provides transparent and clinically interpretable insights for heart sound classification.

Experimental setup

This section details the experimental design used to evaluate our Transformer-based heart sound classification framework. We describe the datasets, baseline models, cross-validation protocols, and testing scenarios, all tailored to assess the model’s performance under real-world conditions.

Datasets employed

Our experiments encompass seven repositories: six publicly available/open-access datasets and our in-house HeartWave dataset. These datasets exhibit diverse patient demographics, sampling frequencies, recording devices, and pathology definitions. Tables 6 and 7 summarize the key characteristics.

Open-access datasets

Table 6 provides an overview of the open-access datasets used:

  • CirCor DigiScope: Contains 5,282 recordings with detailed murmur annotations for pediatric heart sounds; however, adult data are limited, posing a domain-shift challenge.

  • PhysioNet/CinC 2016: Comprises 2,575 normal and 655 abnormal recordings at 2 kHz, characterized by significant noise and limited representation of rare CVDs.

  • Pascal Datasets A and B: Feature high-frequency recordings (44.1 kHz and 44 kHz, respectively) suitable for artifact detection, albeit with small sample sizes and restricted pathology diversity.

  • GitHub Open Access: Focuses on four valvular conditions (AS, MS, MVP, MR) with 1,000 recordings, but lacks demographic variety.

  • Heart Sounds Shenzhen (HSS): Contains 845 recordings classified as normal, mild, or severe, providing a progression-based perspective on valvular disease.

Table 6 Summary of datasets for heart sound analysis.

HeartWave dataset

In addition to open-access repositories, we use the HeartWave dataset, which consists of 1,353 high-quality heart sound recordings sampled at 2–4 kHz. The dataset includes nine classes covering both normal and pathological conditions, with expert annotations for S1, S2, murmurs, and extra heart sounds. This dataset is pivotal for evaluating classification accuracy and ensuring that model predictions align with clinically relevant features.

Table 7 HeartWave dataset summary.

Unified preprocessing and comparison

To ensure comparability across datasets with varying sampling rates and acquisition devices, all data are processed through a unified preprocessing pipeline. First, datasets recorded at higher frequencies (e.g., the Pascal dataset) are downsampled or resampled to a common range of 2–4 kHz. Next, the recordings are segmented into individual heartbeats using robust peak detection methods, such as amplitude thresholding or wavelet-based techniques, which leverage available annotations for accurate segmentation. Each extracted heartbeat is then transformed into a time-frequency representation—using either Short-Time Fourier Transform (STFT) or Mel-Frequency Cepstral Coefficients (MFCC)—to capture both spectral and temporal features essential for subsequent analysis. To further enhance signal quality, a band-pass filter (20–800 Hz) combined with wavelet denoising is applied, reducing ambient noise and other artifacts. Finally, data augmentation techniques, including time-stretching, noise injection, and random cropping, are employed to improve the generalizability of the model. The preprocessed data from this pipeline are then used as inputs to the Transformer-based model, ensuring that performance comparisons across different datasets are both fair and consistent.

Baseline models

To benchmark our proposed Transformer-based model, we compare it against ten established baseline models that employ various architectures for feature extraction and classification:

  • CNN-Baseline: Uses stacked convolutional layers followed by a fully connected classifier to capture local features33.

  • RNN-Baseline: Employs recurrent architectures (LSTM or GRU) to model the sequential dependencies in PCG signals.

  • CNN-RNN Hybrid: Combines convolutional layers for local feature extraction with RNNs for temporal modeling.

  • CRNN-Attention: A hybrid model incorporating an attention mechanism to focus on diagnostically relevant segments.

  • TCN Model: Utilizes Temporal Convolutional Networks with dilated convolutions to capture multi-scale temporal dependencies.

  • RNN-Transformer: Merges RNNs with a Transformer block to leverage both sequential and attention-based modeling.

  • Multi-Task CNN: Performs joint classification and segmentation using a shared CNN architecture.

  • DenseNet-Style CNN: Features densely connected convolutional blocks for enhanced feature reuse.

  • Transformer-Lite: A simplified Transformer model with 1–2 self-attention layers, optimized for resource-constrained environments.

  • Wavelet-CNN: Integrates wavelet transforms with CNNs to robustly extract features from low-SNR signals.

All baseline models are trained using standard hyperparameters (learning rates of \(10^{-4}\) to \(10^{-3}\), batch sizes of 8–32, and moderate dropout). For imbalanced datasets, weighted cross-entropy or focal loss is employed to mitigate class imbalance.

Cross-validation

Robust performance estimates are obtained by employing cross-validation (CV) with stratified sampling to preserve the class distributions within each dataset34. For larger datasets such as HeartWave (1,353 samples) and Shenzhen (845 samples), a 5-fold CV strategy is adopted, where one fold is held out for testing and the remaining folds are used for training. In contrast, for smaller datasets such as Pascal A/B, a 10-fold CV approach is used to maximize the training data available in each fold and to secure multiple test splits, thus ensuring more reliable performance evaluation. The model performance is assessed using mean accuracy, F1-scores, AUC, and confusion matrices, which collectively provide a comprehensive and robust set of metrics for comparing model performance across different datasets.

Results and analysis

This section provides a comprehensive evaluation of the proposed Explainable HeartSound Transformer (EHST) for multi-class heart sound diagnosis. We assess its performance across multiple datasets and compare it with several state-of-the-art baseline models. Our analysis includes overall classification performance, class-wise breakdowns, rare-class detection, ablation studies, confusion matrix analysis, and interpretability via explainability metrics such as Grad-CAM and SHAP. Additionally, further analyses address segmentation methods, window length sensitivity, data augmentation strategies, demographic-based performance, and computational efficiency.

Overall multi-dataset performance

We evaluated EHST on six datasets: HeartWave, CirCor DigiScope, PhysioNet/CinC, Pascal (A+B), GitHub (Valvular), and Shenzhen (HSS). Performance metrics including accuracy, precision, recall, macro-F1 score, Matthews Correlation Coefficient (MCC), and Area Under the ROC Curve (AUC) were computed and compared with ten baseline models. Table 8 summarizes the results across all six datasets.

Table 8 Overall performance across six datasets (EHST vs. Baselines).

The overall results in Table 8 indicate that EHST consistently achieves higher performance than the baseline models. For example, on the HeartWave dataset, EHST attains an accuracy of 96.7% and a macro-F1 score of 95.5%, compared to 94.8% and 93.1% respectively for the best baseline. Similarly, in CirCor DigiScope and PhysioNet CinC, EHST outperforms the baselines by 3–5% in accuracy and 2–4% in F1 score. Additionally, the AUC values remain above 0.90 across all datasets, which confirms the model’s robust ability to distinguish between normal and abnormal heart sounds even in noisy and imbalanced settings. The consistently high Matthews Correlation Coefficient (MCC), typically around 0.91–0.94 (not shown in this table), further corroborates EHST’s strong performance in dealing with imbalanced data.

Fig. 3
figure 3

Grouped bar chart of accuracy across datasets.

Figure 3 presents a grouped bar chart comparing the accuracy of EHST with the Mean Baselines and Best Baseline across the six datasets. It is evident from the chart that EHST achieves higher accuracy values on each dataset. For instance, on the HeartWave dataset, EHST reaches an accuracy of 96.7% compared to 94.8% for the best baseline. This visual comparison highlights EHST’s significant improvement in overall classification accuracy, reinforcing its suitability for clinical applications where high accuracy is crucial.

Fig. 4
figure 4

Line plot of F1 score across datasets.

The line plot in Fig. 4 illustrates the F1 scores for EHST, Mean Baselines, and Best Baseline methods across six datasets. The plot clearly shows that EHST consistently achieves higher F1 scores on all datasets compared to the baseline methods. The trend line for EHST lies above those for both the mean and best baselines, indicating a superior balance between precision and recall. This consistent performance across datasets underscores EHST’s capability to reliably distinguish between classes even in challenging, imbalanced settings.

Fig. 5
figure 5

Box plot of AUC distribution across methods.

The box plot in Fig. 5 depicts the distribution of AUC values for EHST, Mean Baselines, and Best Baseline methods over several cross-validation folds. The vertical extent of each box represents the interquartile range (IQR) of AUC scores, with the median indicated by a horizontal line. EHST exhibits a higher median AUC with a narrower IQR compared to the baseline methods, which implies not only high discriminative power but also low variability across folds. Consistently, AUC values remain above 0.90 across all datasets, underscoring EHST’s robust ability to differentiate between normal and abnormal heart sounds, even in the presence of noise and imbalanced data.

Overall, EHST demonstrates improvements of 3–5% in accuracy and 2-4% in macro-F1 score over baseline methods, underscoring its strong discriminative power and clinical applicability.

Class-wise breakdown on HeartWave

Table 9 presents the detailed class-wise performance (precision, recall, and F1 score) for each of the nine heart sound classes in the HeartWave dataset, evaluated using 5-fold cross-validation.

Table 9 Class-specific performance on HeartWave (9 Classes, 5-Fold CV).

The class-wise analysis shows that EHST achieves high F1 scores for normal heart sounds (97.9%) and maintains F1 scores above 90% for minority classes such as congenital anomalies and miscellaneous rare conditions. This indicates effective handling of class imbalance through the weighted loss function and attention mechanisms, ensuring that both common and rare pathologies are accurately detected.

Rare-class detection across multiple datasets

Rare-class detection performance was measured using micro-F1 scores for underrepresented classes across selected datasets. Table 10 presents these results, demonstrating that EHST outperforms the best baseline by 2–3% in detecting rare conditions.

Table 10 Rare-class performance across selected datasets (Micro-F1 for underrepresented Classes).

Figure 6 shows a dumbbell plot that compares the rare-class F1 scores for EHST and the best baseline across four datasets: Pascal (A), Pascal (B), GitHub, and Shenzhen (HSS). In this plot, each dataset is represented by two markers—one for EHST (displayed in tomato) and one for the best baseline (displayed in steelblue). A vertical line connects the markers for each dataset, clearly illustrating the performance gap between EHST and the best baseline. The plot clearly demonstrates that EHST consistently outperforms the best baseline across all four datasets, with improvements ranging from approximately 2% to 3% in F1 score. For example, on the Pascal (A) dataset, EHST achieves an F1 score of 86.4% compared to 82.9% for the best baseline, as shown by the gap between the markers. Similar trends are observed for the Pascal (B), GitHub, and Shenzhen datasets, highlighting EHST’s robust performance in detecting underrepresented classes. The relatively narrow gap between the upper and lower bounds (represented by the connecting lines) also suggests that the model’s performance is consistent across cross-validation folds.

Fig. 6
figure 6

Rare-class F1 Score comparison for EHST vs. best baseline.

This analysis confirms that EHST consistently outperforms baselines in rare-class detection, which is critical for clinical applications where underrepresented pathologies must be reliably identified.

Ablation studies and confusion matrix

Ablation studies were performed on the HeartWave dataset (using 5-fold CV) to quantify the contributions of self-attention and Grad-CAM. Table 11 shows that the removal of self-attention results in a significant performance drop, while Grad-CAM removal mainly impacts interpretability.

Table 11 Ablation: impact of removing self-attention and Grad-CAM (HeartWave, 5-Fold CV).

Removing the self-attention module leads to a 2.7% drop in F1 score, highlighting its importance for identifying critical features such as murmurs. While removing Grad-CAM has minimal impact on classification performance, it diminishes the model’s interpretability—an essential factor in clinical applications (Fig. 7).

Fig. 7
figure 7

Grouped Bar Chart of Ablation Study Metrics for EHST. The chart uses lightcoral for Accuracy, lightseagreen for F1 Score, and lightsteelblue for AUC.

The chart shows that the removal of the self-attention mechanism results in a significant drop in both Accuracy and F1 Score, with values falling from 96.7% to 94.2% for Accuracy and from 95.5% to 92.3% for F1 Score. The removal of the Grad-CAM module, however, has a negligible effect on these quantitative metrics, as indicated by the nearly identical values when only Grad-CAM is removed. When both components are removed, the performance further declines, underscoring the importance of the self-attention mechanism for robust classification. Although the AUC remains relatively stable (with EHST at 97% and dropping to 93% when both modules are removed), its conversion to percentage scale confirms that all configurations achieve AUC values above 90%, indicating strong discriminative ability overall.

The confusion matrix for the Shenzhen dataset is presented in Table 12. This matrix indicates that misclassifications predominantly occur between the Mild and Severe classes, which may be attributed to overlapping symptom characteristics.

Table 12 Confusion matrix for Shenzhen dataset (EHST, 3 classes).

The strong diagonal of the confusion matrix indicates that EHST is highly accurate for Normal and Severe cases. However, some misclassifications occur between Mild and Severe, suggesting that further refinement of training data or feature extraction methods could help improve discrimination between these classes.

Explainability metrics and SHAP analysis

To evaluate the interpretability of EHST, we use both Grad-CAM and SHAP analyses. Table 13 shows the overlap ratio and Intersection-over-Union (IoU) for systolic and diastolic murmur annotations on the HeartWave dataset. Additionally, Table 14 lists the top five features ranked by mean SHAP value.

Table 13 SHAP-based explainability metrics on HeartWave murmur annotations.
Table 14 Top 5 features ranked by mean SHAP value for EHST on HeartWave dataset.

In addition to these tables, we employ a beeswarm plot to visualize the SHAP values for each feature across the entire HeartWave dataset. Figure 8 displays the beeswarm plot, which illustrates the distribution of SHAP values per feature. Each point represents a single instance; the position along the horizontal axis indicates the impact on the model output, while the color reflects the feature’s value. This visualization helps in understanding not only which features are most influential but also whether higher or lower feature values push the model output in a particular direction.

Fig. 8
figure 8

Beeswarm plot of SHAP values for EHST on HeartWave dataset.

The explainability metrics show that EHST improves overlap and IoU by 5–8% compared to baseline models, indicating superior alignment between the model’s attention maps and expert annotations. The beeswarm plot further corroborates these findings by highlighting that features such as the murmur frequency band, S1 amplitude, and S2 duration have the greatest impact on the predictions. These insights provide a clear, interpretable understanding of the model’s decision-making process, ensuring that clinicians can trust the automated diagnoses. Overall, the combination of quantitative metrics and visualizations like the beeswarm plot confirms the clinical relevance and transparency of EHST.

Statistical validation with A-Test

To rigorously evaluate the statistical significance and robustness of EHST’s performance compared to baseline models, we employed the non-parametric A-Test35. The A-Test quantifies the probability that a randomly selected observation from one distribution (EHST results) will be greater than a randomly selected observation from another distribution (baseline results):

$$\begin{aligned} A_{12} = P(X_1 > X_2) + 0.5 \cdot P(X_1 = X_2), \end{aligned}$$
(15)

where \(X_1\) and \(X_2\) are the distributions of accuracy, macro F1-score, or AUC obtained from cross-validation folds. An \(A_{12}\) value of 0.5 indicates no difference, whereas values approaching 0 or 1 imply a strong effect size in favor of one method. Conventionally, thresholds of \(A > 0.71\) or \(A < 0.29\) are considered large effects, \(0.64 \le A \le 0.71\) or \(0.29 \le A \le 0.36\) medium, and \(0.56 \le A \le 0.64\) or \(0.36 \le A \le 0.44\) small36.

Table 15 presents the A-Test results comparing EHST with the best-performing baseline models across all datasets. The consistently low values (close to 0.1–0.2) indicate a large effect size in favor of EHST, confirming the robustness and statistical significance of our improvements.

Table 15 A-Test scores comparing EHST with best baseline models across datasets. Values close to 0 or 1 imply large effect sizes.

The results confirm that EHST significantly outperforms the baseline models across all datasets and evaluation metrics, further validating the robustness of the proposed framework.

Segmentation method analysis

To evaluate the impact of the segmentation approach on performance, we compared manual annotations with automated peak detection on the HeartWave dataset. As shown in Table 16, the accuracy and macro F1 scores obtained using automated peak detection are nearly equivalent to those derived from manual annotation. This close performance indicates that the automated segmentation pipeline is robust and reliable, reducing the need for labor-intensive manual labeling while still maintaining high-quality input for the EHST model.

Table 16 Performance comparison for segmentation methods on HeartWave.

These results confirm that the automated method effectively captures the essential heart sound segments with minimal loss in performance, making it a viable option for scaling up the data preparation process.

Window length sensitivity

We examined the sensitivity of EHST to different window lengths used during segmentation on the HeartWave dataset. As summarized in Table 17, the model’s performance varies slightly with different window lengths. A window length of 1.0 s produced the highest accuracy (96.7%) and macro F1 (95.5%), suggesting that this duration provides an optimal balance between capturing sufficient temporal dynamics and minimizing noise.

Table 17 Performance variation with different window lengths on HeartWave.

The marginal differences observed imply that while shorter windows might not capture the full extent of a heartbeat cycle, longer windows could introduce additional noise. Therefore, a 1.0 s window appears to be the optimal setting for the EHST model.

Data augmentation strategy

To assess the impact of data augmentation, we evaluated EHST on the PhysioNet/CinC dataset using various augmentation strategies. Table 18 compares the performance without any augmentation, with individual strategies (noise injection, time-stretching, random cropping), and with a combination of all three techniques. The combined strategy results in the highest accuracy (90.3%) and macro F1 (88.9%), demonstrating that the integration of multiple augmentation methods effectively enhances model robustness by simulating realistic variability and reducing overfitting.

Table 18 Effect of data augmentation strategies on PhysioNet/CinC performance.

This analysis confirms that using a combination of augmentation strategies best simulates the diverse acoustic conditions encountered in clinical settings, thereby improving the model’s generalizability.

While the combined augmentation strategy significantly improves validation metrics by simulating diverse acoustic conditions, it may also introduce distributional shifts that impact performance on real-world, unseen data. This potential domain shift is a known challenge in PCG and other biomedical signal processing tasks.

To minimize adverse effects, our augmentation parameters were carefully selected to reflect realistic variations encountered in clinical practice. Furthermore, we validated our model on multiple independent datasets with varied noise profiles and demographic distributions to assess robustness beyond the training set.

Nonetheless, there remains a trade-off between increasing training data diversity and preserving fidelity to clinical conditions. Future work will explore adaptive and domain-adversarial augmentation techniques aimed at enhancing model generalizability across heterogeneous clinical environments.

Demographic-based performance

We further analyzed EHST’s performance on the HeartWave dataset across different demographic groups. Table 19 presents the accuracy and macro F1 scores for pediatric (Age < 18) and adult (Age \(\ge\) 18) subgroups. The results show consistent performance between the two groups, with adults achieving slightly higher scores. This consistency indicates that EHST generalizes well across diverse patient populations, which is critical for clinical deployment in varied settings.

Table 19 Performance by demographic subgroups on HeartWave.

These findings demonstrate that the model’s performance remains robust regardless of age, suggesting its effectiveness in both pediatric and adult clinical environments.

Overall, these detailed analyses across segmentation methods, window length sensitivity, data augmentation strategies, demographic-based performance, and computational efficiency provide a comprehensive picture of EHST’s performance and robustness in diverse clinical scenarios. The results demonstrate that EHST not only excels in classification performance but also generalizes well across different data conditions and remains computationally efficient for practical use.

Discussion

Across six datasets, EHST achieved higher accuracy, macro-F1 score, and AUC values than the baseline models, as shown in Table 8. In particular, on the HeartWave dataset, EHST attained higher class-specific performance, detecting both common and rare pathologies and showing improved sensitivity for underrepresented classes, as reflected in elevated micro-F1 scores. While these results indicate notable performance gains, they are specific to the datasets and experimental conditions evaluated.

Ablation studies suggest that the self-attention mechanism and integrated explainability modules contribute meaningfully to classification accuracy and model transparency. The combination of Grad-CAM, attention visualization, and SHAP yielded clinically interpretable outputs, aligning model attention with known pathophysiological features in the evaluated datasets.

In terms of computational efficiency, EHST exhibited training and inference times comparable to strong baseline architectures, including CNN-RNN hybrids and Transformer variants. Optimized model depth and parameter sharing in the multi-head attention layers contributed to a balance between complexity and scalability. For example, training times on the HeartWave dataset were similar to those of the best-performing baseline CNN-RNN model, and inference latency was within real-time constraints under the tested conditions.

The model’s performance remained stable across variations in segmentation method, window length, data augmentation strategy, and demographic subgroup, suggesting potential for broader applicability. However, real-world deployment may involve additional variability in recording devices, patient populations, and environmental noise, which were not exhaustively represented in the current datasets.

Overall, within the scope of the datasets and metrics considered, EHST demonstrated consistent performance advantages over the ten baseline models evaluated, while providing interpretable outputs without marked loss in efficiency. These findings support EHST’s potential for use in scalable, explainable heart sound analysis, pending further validation in prospective and more heterogeneous clinical settings.

Comparison with time growing neural networks (TGNNs)

Time Growing Neural Networks (TGNNs) offer a unique approach to modeling cardiac cycles by incrementally adapting their architecture to capture systolic and diastolic variations25. While TGNNs excel at representing temporal growth patterns, they process sequences sequentially, which can limit the ability to capture global dependencies and complex interactions spanning multiple heartbeats.

In contrast, our proposed EHST utilizes a multi-head self-attention mechanism that simultaneously attends to all time points within a heartbeat segment. This parallel attention allows the model to flexibly learn relationships across both systolic and diastolic phases without explicit segmentation or architectural growth, leading to richer temporal and spectral representations.

Additionally, EHST integrates explainability tools such as Grad-CAM and attention visualization, providing clinicians with interpretable heatmaps linked to physiologically meaningful heart sound components. This level of transparency is often lacking in TGNN architectures, making our method more suitable for clinical applications where interpretability is critical.

Empirically, EHST demonstrates consistent improvements in classification accuracy and robustness over baseline models, including those based on TGNNs, across multiple datasets, further validating the advantages of attention-based modeling for heart sound diagnosis.

Limitations

Despite the promising performance of EHST, several limitations remain. The model shows difficulty in accurately distinguishing between mild and severe cases, likely due to overlapping acoustic features and the limited number of borderline examples in the training data. While EHST performed well under controlled conditions, its robustness to real-world noise, device variability, and differences in recording environments has yet to be fully established. These factors, along with potential domain shifts when applied to new patient populations, may affect performance in practical deployments. The datasets used in this study do not comprehensively represent all demographic groups, age ranges, or rare cardiac pathologies, underscoring the need for greater diversity in training data to improve generalizability.

Extensive prospective clinical trials and longitudinal studies are required to validate EHST’s effectiveness, reliability, and interpretability in routine clinical workflows. Additionally, integration with multimodal data sources, such as ECG and echocardiography, could further enhance diagnostic coverage. Future work will also focus on refining the model’s interpretability mechanisms and applying advanced data augmentation and domain adaptation strategies to improve noise resilience and adaptability across heterogeneous clinical settings. Ultimately, optimizing EHST for robust, transparent, and scalable deployment remains a key objective before real-world adoption.

Conclusion

In this study, we proposed a Transformer-based framework for multi-class heart sound diagnosis that combines attention mechanisms with Grad-CAM visualizations to balance predictive performance and interpretability. Evaluated across six public datasets and the in-house HeartWave dataset, the model achieved higher accuracy, macro-F1 score, and AUC than baseline models under the given experimental conditions, with gains observed for both common and underrepresented classes. Explainability tools, including Grad-CAM, attention weight visualization, and SHAP analysis, produced outputs consistent with expert-annotated features, supporting potential clinical relevance. The framework also showed stable performance across segmentation methods, window lengths, and demographic subgroups. However, these findings are limited to retrospective datasets, and real-world variability in devices, patient populations, and noise conditions was not fully represented. Prospective clinical validation, broader demographic coverage, and integration with multimodal data such as ECG remain important next steps to confirm generalizability and ensure reliable deployment in diverse clinical environments.