Introduction

Knee osteoarthritis (OA) is a worldwide public health concern. A 2021 Global Burden of Disease (GBD) study estimated that in 2020, 595 million individuals—7.6% of the global population—suffered from OA and predicted an additional 75% increase in knee-OA cases by 20501. Direct medical expenses related to OA currently consume as much as 1% to 2.5% of gross national product in the United States, United Kingdom, Canada, and Australia2. Excessive body weight is one of the most crucial contributing factors to this issue3. The GBD study links approximately 20% of all new OA cases to high body-mass index (BMI)1. Additionally, a study involving more than 15, 000 adults found that every 1kg m−2 rise in BMI increases the likelihood of knee-OA development or radiographic advancement by 5%, while weight loss results in equivalent reductions in risk4. Beyond BMI-related factors, specific sporting activities and competitive athletic participation have also been shown to influence OA risk. Recent systematic reviews indicate that certain sports, particularly at elite competitive levels, significantly elevate the risk of knee OA later in life5,6,7. Hence, detection and monitoring of OA with accessible and reliable diagnostic tools, especially within high-risk populations such as athletes and individuals with increased BMI, is crucial for sports medicine and public health.

Currently, clinical diagnosis of OA relies on compatible symptoms and/or radiographic changes. Symptoms, such as knee pain, are non-specific for knee OA, and traditional radiography, i.e., x-rays, is insensitive for the early disease process8,9. MRI is a more sensitive method for the detection of changes associated with knee OA, but the well-recognized lack of specificity of these findings for clinical symptoms or incident or progressive knee OA draws their clinical relevance into question10. Therefore, the current diagnostic landscape for knee OA is caught between two unsatisfying extremes: techniques that are low-risk and inexpensive (e.g., physical examination, x-rays) but insufficiently sensitive or specific, and MRI, which is highly sensitive but not specific, as well as costly and time-consuming. As a result, there is still no readily available, low-cost, low-risk method that can detect knee OA early and reliably. An inexpensive, non-invasive tool that can be integrated into a routine clinical visit would therefore fill the critical gap and could markedly improve early detection and subsequent management of OA.

Knee acoustic emissions (KAEs), the sounds that are generated by articular movement, have attracted substantial interest as a non-invasive biomarker for the detection and tracking of OA progression. Clinically, the sounds produced from the knees of patients with OA differ from those of healthy patients. Multiple studies have demonstrated that the mechanical properties of joints, which are changed by the degeneration of cartilage in OA, are associated with changes in the acoustic signatures generated during joint motion11,12. These findings highlight the clinical potential for KAEs to act as an objective, repeatable, real-time assessment of joint health without invasive procedures or operator-intensive imaging tools.

Current research on KAEs for the diagnosis of OA relies heavily on common statistical models and naive machine learning approaches. These models are easy to interpret - using hand-crafted features such as the number of KAEs and their amplitude - but are easily confused by artifacts and confounding variables and may overlook subtle patterns that are indicative of early stages of OA13. In contrast, deep learning methods, which are capable of learning complex representations directly from raw acoustic data, offer the potential to significantly enhance diagnostic accuracy and sensitivity. However, data sparsity makes the application of deep learning in KAEs challenging. Data are often collected from singular recording sites one knee at a time, with each new study using dissimilar hardware and collection frameworks. As a result, datasets are often insufficient for training state-of-the-art deep learning algorithms, with existing deep learning models in KAE research typically trained and tested using small datasets with hindered generalizability and clinical reliability. To address the shortcomings of both approaches, it is imperative to not only develop algorithms that address data sparsity but also retain their interpretability by incorporating explainable artificial intelligence (XAI) approaches such that predictions are made on clinically relevant acoustic features, and not on irrelevant parts of the recordings. Therefore, establishing robust deep learning models that are interpretable and trained stably remains an important problem toward achieving reliable OA detection and monitoring. Figure 1 contrasts the conventional OA diagnostic workflow (left panels)—where clinicians combine symptoms, physical examination, and imaging to localize joint pathology—with our proposed KAE-based deep learning pipeline (right panels), in which knee sounds are analyzed by a neural network and XAI methods highlight the time-frequency regions that drive the model’s decision, analogous to how image-based XAI would highlight the pixels corresponding to, for example, a dog in a photograph.

Fig. 1: Conceptual overview illustrating the contrast between traditional medical diagnosis and machine learning-based interpretation of knee acoustic emissions.
figure 1

The left panels represent the conventional clinical workflow, where diagnostic reasoning is based on symptoms, imaging, and anatomical cues. The right panels show the proposed AI-based approach, where knee sounds are analyzed by a machine learning model, and the decision process is made transparent via explainable AI techniques. The illustrative analogy (e.g., highlighting a dog in an image) emphasizes how the model focuses on acoustically meaningful regions.

To date, KAE studies have not systematically explored transfer learning to alleviate data sparsity, nor have they used modern XAI techniques to verify that model decisions are based on physiologically meaningful acoustic patterns rather than noise or confounders such as BMI. Motivated by these gaps, this work addresses the following questions: (i) can deep transfer learning applied to time-frequency representations of KAEs improve OA-versus-healthy classification robustness and accuracy under limited data? (ii) can XAI provide clinically meaningful insight into which time-frequency regions drive the model’s predictions? and (iii) how does the model perform in a clinically realistic cohort that includes high-BMI but clinically healthy participants, and to what extent are its predictions biased by BMI? Therefore, the main contributions of this work are as follows. Therefore, the main contributions of this work are as follows.

  • We use a KAE dataset, which is the first to include a clinically meaningful number of healthy yet high-BMI participants. This composition allows us to understand whether the KAEs from high-BMI participants bias our decisions on OA detection.

  • We introduce an end-to-end OA detection pipeline that transforms raw KAEs into time-frequency spectrograms and utilizes a deep neural network for OA classification from KAEs. To our knowledge, this is the first approach to use deep neural networks for this problem. We also demonstrate that our approach outperforms conventional machine learning approaches on the same dataset.

  • To enhance model robustness and stability, we use transfer learning, a method commonly employed to improve performance, robustness, and training stability. Specifically, we adapt certain parts of a model previously trained for another task to our current problem. Additionally, we thoroughly analyze how using different portions of the pretrained model affects performance and stability.

  • We utilize XAI to generate saliency maps both in time and frequency domains to demonstrate that the network concentrates on acoustically and physiologically plausible regions rather than on recording artifacts. This enhances the interpretability and clinical credibility of the model’s decisions.

Altogether, our proposed approach marks a significant advancement by being the first to integrate deep learning with explainable artificial intelligence for knee osteoarthritis detection from acoustic emissions. Furthermore, by evaluating our method on a clinically representative population that closely mirrors real-world scenarios, our study ensures improved robustness, interpretability, and clinical relevance, significantly advancing applicability in practical healthcare settings.

Methods

Related work

Previous approaches to OA classification using KAE data predominantly relied on hand-crafted feature extraction combined with conventional machine learning algorithms. Researchers have explored statistical descriptors such as entropy, signal energy, zero-crossing rate, and spectral centroids, among other time and frequency domain features, to characterize the differences in acoustic emissions between healthy and osteoarthritic joints. These features are then typically input to classifiers, including support vector machines, random forests, Gaussian mixture models, k-nearest neighbors, or shallow multi-layer perceptrons to discriminate between disease states14,15,16,17,18,19,20. While these methods offer some interpretability and are well-suited for small and controlled datasets, their reliance on manual feature engineering often results in sub-optimal and dataset-specific representations that limit generalizability and robustness across heterogeneous populations and recording devices. Moreover, these hand-crafted features may fail to capture subtle, complex patterns in the KAEs that are critical for accurate OA detection, underscoring the need for more data-driven and automated feature-learning approaches.

In recent years, deep neural networks, have emerged as transformative tools in biomedical research due to their ability to automatically learn hierarchical and task-relevant representations from raw and minimally processed data. Unlike traditional methods that depend heavily on expert-driven feature engineering, deep learning models, specifically residual networks (ResNets)21, are capable of uncovering intricate structures and latent dynamics inherent in complex biomedical signals, including imaging, electrophysiological, and audio data22,23.

Originally developed to address image classification challenges, ResNets have gained prominence in biomedical research due to their superior capacity to effectively train very deep models and mitigate the vanishing gradient problem encountered in convolutional models. Unlike conventional convolutional neural networks (CNNs), which directly learn hierarchical representations via convolutional operations, ResNets introduce skip connections enabling the network to capture residual mappings of the input data. Formally, convolutional operations within these blocks can be described mathematically as,

$$\begin{array}{l}{{\bf{S}}}^{c}(i,j)=({\bf{I}}* {{\bf{K}}}^{c})(i,j)+{{\bf{B}}}^{c}(i,j)\\ \,\,\,\,\,\,\,\,\,\,\,=\mathop{\sum }\limits_{m=0}^{M-1}\mathop{\sum }\limits_{n=0}^{N-1}{\bf{I}}(i+m,j+n)\,{{\bf{K}}}^{c}(m,n)\\ \,\,\,\,\,\,\,\,\,\,\,+{{\bf{B}}}^{c}(i,j).\end{array}$$
(1)

where I is the input matrix, c C indexes the different kernels the model learns simultaneously. Kc and Bc(i, j) are the convolutional kernel of size M × N and the bias term, related to the kernel c respectively. Sc(i, j) is the scalar output at spatial position (i, j) associated to the kernel c.

In each layer of a ResNet, instead of only learning a direct input-output mapping, the network explicitly learns a residual mapping expressed mathematically as,

$${{\bf{S}}}_{\ell }={f}_{\ell }({{\bf{I}}}_{\ell })+{{\bf{I}}}_{\ell }$$
(2)

where I is the input to the block , and f(I) represents the nonlinear transformation learned by a sequence of convolutional, batch normalization, and activation layers associated with the block . This architecture allows deeper networks to be effectively trained while solving the vanishing gradients problem inherent in CNNs.

ResNet variants have been widely adopted in biomedical applications, achieving strong results not only in image classification tasks but also in the analysis of biomedical signals24. In the domain of sound signal classification, a common approach involves converting the raw audio or biosignal data into time-frequency representations such as spectrograms or Mel-frequency cepstral coefficients, which can then be fed into pre-trained or custom ResNet architectures for feature extraction and classification25,26. This technique leverages the convolutional structure of ResNet to extract spatial and temporal patterns from the spectrogram, facilitating robust recognition of complex acoustic signatures found in biological data. The proven effectiveness of ResNets in classifying heart, and lung sounds highlights their versatility and strong potential for advancing automated OA diagnosis from KAEs.

Despite the remarkable success of deep convolutional neural networks in various biomedical domains, their performance often hinges on the availability of large, diverse labeled datasets—a condition rarely met in specialized biomedical applications such as KAE analysis. Insufficient data can result in overfitting and suboptimal generalization, limiting the practical utility of these models in clinical settings24,27. In other domains, data-efficient deep learning has been explored through active learning, and synthetic data generation to alleviate annotation and data scarcity constraints28,29. To address similar challenges in our setting, transfer learning has emerged as a powerful strategy, wherein knowledge learned from large-scale datasets (often from related domains, such as general audio or image datasets) is leveraged to improve performance on target tasks with limited data30. By fine-tuning pre-trained networks on smaller, domain-specific datasets, transfer learning not only accelerates convergence and enhances stability but also improves feature robustness and reduces the risk of overfitting, even in data-scarce scenarios24,31. This approach has been widely adopted across the literature, enabling more accurate and reproducible results in diverse biomedical tasks ranging from medical image analysis to biosignal classification.

Another critical aspect of advancing biomedical research is to build confidence among clinicians and the broader medical community that deep neural networks are not merely extracting noise, but are genuinely learning and utilizing meaningful semantic patterns from data. This need for trust has driven the demand for greater transparency and interpretability in predictive models. In response, XAI methods (e.g. GradCAM32, GradCAM++33) have emerged as essential tools for providing insights into the decision-making processes of complex neural networks, thereby enabling researchers and clinicians to validate model reliability, mitigate potential biases, and foster greater trust in automated systems34,35. Techniques such as saliency maps, layer-wise relevance propagation, and gradient-based attribution have been widely adopted to identify the most informative regions of input data that contribute to a specific prediction, whether in medical imaging, genomics, or physiological signal analysis36,37. The adoption of XAI in the biomedical community is accelerating, as these methods not only facilitate regulatory compliance and clinical acceptance but also aid in scientific discovery by uncovering novel biomarkers and patterns within complex datasets38. As a result, explainable AI is becoming a cornerstone for ensuring transparency and interpretability in state-of-the-art biomedical machine learning pipelines.

Among the growing suite of explainable AI methods, Full-Gradient (FullGrad) stands out for its comprehensive approach to attribution, as it combines both the neuron gradients with respect to the input and the bias contributions throughout the entire network39. Unlike conventional gradient-based techniques that may overlook important contextual information present in deep residual architectures, FullGrad captures relevant patterns from all layers, thereby producing more robust, interpretable, and fine-grained attribution maps. Formally, for a network defined as,

$${\mathcal{F}}(x)={f}_{L}\circ {f}_{L-1}\circ \cdots \circ {f}_{1}(x),$$
(3)

where f, = 1, …, L, denote the individual layers. The FullGrad saliency map (i.e. \({\mathcal{G}}({\bf{I}})\)) for this network can be formulated as,

$${\mathcal{G}}({\bf{I}})=\left|{\bf{I}}\odot \frac{\partial {\mathcal{F}}({\bf{I}})}{\partial {\bf{I}}}\right|+\mathop{\sum }\limits_{\ell =1}^{L}\mathop{\sum }\limits_{c=1}^{{C}_{\ell }}\psi \left(\left|{{\bf{B}}}_{\ell }^{(c)}\odot \frac{\partial {\mathcal{F}}({\bf{I}})}{\partial {{\bf{B}}}_{\ell }^{(c)}}\right|\right)$$
(4)

where I denotes the input spectrogram to the network, and \({\mathcal{F}}({\bf{I}})\) represents the scalar network output. The operator indicates element-wise multiplication, while \(\frac{\partial {\mathcal{F}}({\bf{I}})}{\partial {\bf{I}}}\) is the gradient of the output with respect to the input. \({{\bf{B}}}_{\ell }^{(c)}\) refers to the bias term for kernel c in layer (i.e. Eq (1)), with = 1, …, L, and c 1, …, C. \(\frac{\partial {\mathcal{F}}({\bf{I}})}{\partial {{\bf{B}}}_{\ell }^{(c)}}\) is the gradient of the output with respect to this bias term. The function ψ( ) denotes an up-sampling operator that projects the bias-level attributions to the spatial resolution of the input, and takes the element-wise absolute value to generate a non-negative saliency map. The first and the second terms in Eq (4) account for direct and indirect contributions within deep neural networks, respectively. Thus, FullGrad provides more reliable visualizations of the spectrotemporal regions in input signals that drive model predictions. Recent studies indicate that these characteristics align particularly well with ResNet architectures due to FullGrad’s reduced sensitivity to the specificity of individual input features and its comprehensive capability to identify all relevant attributes involved in the decision-making process40. As a result, explanations derived from FullGrad exhibit robustness against minor perturbations, including subtle adversarial noise, while effectively preserving structural features that strongly influence the model’s output. This enhanced interpretability makes FullGrad particularly valuable for interpreting complex deep models like ResNets, an essential requirement in application domains that demand trustworthy explanations and actionable insights, such as biomedical signal analysis.

Given the established correlation between KAEs and osteoarthritis, as well as the limitations of traditional hand-crafted feature-based approaches, there is a compelling need for more advanced, data-driven frameworks that can robustly generalize across clinical and real-world settings. Deep transfer learning with architectures such as ResNet addresses the data scarcity issue and enables powerful hierarchical representations of complex acoustic signals like KAEs. However, to ensure clinical adoption and trustworthiness, integrating explainable AI methods—specifically, FullGrad—provides the crucial transparency needed for model predictions by revealing the acoustically relevant regions influencing diagnostic outcomes. In contrast to prior data-efficient deep learning approaches that focus on video surveillance or image-based disease detection and do not target KAE-based OA diagnosis or detailed spectrotemporal explanations28,29, our work adapts transfer learning-based ResNet models directly to KAE spectrograms and couples them with FullGrad attribution. By uniting these recent advances in deep learning and model interpretability, the application of transfer learning-based ResNet models coupled with FullGrad attribution we present a deep learning framework for more accurate, interpretable, and scalable OA classification from KAE data.

Data collection

This study used a wearable acoustic sensing device design previously developed by Teague et al.41. The system consists of a portable, battery-powered, low-cost unit containing a custom-designed printed circuit board (PCB), a microcontroller, and four piezoelectric contact microphones (BU-23173-000, Knowles, IL, USA). The microphones are chosen because of their validated robustness under clinical noise42, high bandwidth, and scalable size. Microphones were placed medially and laterally to the superior and inferior edges of the patella, following prior literature. Positioning was performed via clinicians to ensure consistent anatomical placement across participants. The contact microphone configuration and data acquisition procedure were identical to studies such as Nichols et al.20 and Richardson et al.43, ensuring the validity and reliability with previous works.

Each subject was asked to perform a set of scripted maneuvers, including flexion-extension (FE), sit-to-stand (STS), and a short walking trial. These trials were designed following the general protocol described by Nichols et al.20, and performed at a standardized pace of approximately 0.25 Hz using a visual timer. All of the recordings were collected in clinical examination rooms or lab environments, which are not isolated from noise, thus capturing real-world conditions that enhance the practical relevance and usability of our approach in typical clinical settings.

Although we collected all three types of maneuvers, we only included the unloaded FE trials in our analysis. This choice was made to minimize the influence of external loading and soft tissue-related acoustic variability, which are more pronounced in STS and walking tasks. By focusing on unloaded FE, we ensured more consistent knee-generated acoustic signatures under controlled conditions. During each session, data from all four microphones were collected simultaneously.

The dataset includes 86 knees from 52 participants (Table 1). Clinical experts labeled 49 knees as healthy and 37 as having osteoarthritis (OA). Knees labeled as OA were classified based on a Kellgren-Lawrence grade of 2 or higher.

Table 1 Cohort-level demographics

An important point to highlight is that a considerable portion of the dataset includes knees from individuals classified as obese (BMI > 30), totaling 31 participants. This makes our work one of the few studies that focus on high-BMI populations in the context of KAEs, alongside studies like44,45. Most of the previous works either did not work with or discarded the data from obese participants46. The reason for this exclusion is due to the way body mass affects knee loading and alters how sound propagates through soft tissue, which is very hard to interpret using traditional methods. These effects often make analysis harder and can reduce signal clarity, especially in KAEs. These effects can change both the propagation and consistency of acoustic signals, making high-BMI cases more difficult to analyze. Nevalainen et al.19 observed that BMI significantly affected both the predictive performance of AE-based models and the robustness of signal acquisition in obese individuals.

Including these high-BMI subjects, spread across the three cohorts, is crucial because obesity is strongly linked to OA progression and is common in clinical settings. We therefore view this multi-cohort composition as a key strength of our study, enabling us to evaluate whether the proposed methods remain effective in a more challenging and realistic scenario.

The inclusion of these high-BMI subjects is crucial as obesity extremely prevalent and is strongly linked to OA incidence and progression. We believe that including this multi-cohort composition strengthens the validity of our approach and forms a key part of our contribution, as it helps test whether the proposed methods still hold in a more challenging and realistic scenario.

Signal preprocessing

KAE signals are collected with a sampling frequency of 46.875 kHz. A band-pass filter between 250 Hz and 4.5 kHz is applied to remove low-frequency movement artifacts and focus on the frequency ranges where the knee acoustic signatures are present.

After filtering, we removed the initial inactive segments to avoid including non-informative signal regions caused by sensor settling or brace adjustment. The onset of voluntary movement was manually annotated using the goniometer signal, based on visual inspection of the knee angle trajectory. Specifically, the start of the first clear FE cycle was labeled by identifying the earliest point at which consistent angular displacement began.

This labeling was performed by an expert observer using synchronized plots of the goniometer and acoustic signals. The approach ensured that only segments corresponding to deliberate movement were retained, improving the temporal alignment between knee motion and associated acoustic emissions.

Following the removal of inactive segments, the signal was converted into a time-frequency representation using Short Time Fourier Transforms (STFT), allowing the classifier to extract temporally localized spectral features relevant to knee acoustic emissions.

We have used four different window and hop-length configurations in our experiments, denoted as N128-H16, N128-H32, N64-H16, and N32-H16. Here, the notation “N#-H#" indicates the STFT parameters, with “N" referring to the window length and “H" representing the hop size, both expressed in samples. Given our sampling frequency of 16 kHz, these configurations correspond to window lengths of 8 ms (128 samples), 4 ms (64 samples), and 2 ms (32 samples), and to hop increments of 2 ms (32 samples) and 1 ms (16 samples). By analyzing spectrograms generated using these varied temporal and spectral resolutions, we aim to investigate how different time-frequency granularities affect the effectiveness of extracting distinctive acoustic-emission patterns from knee sounds relevant for OA assessment.

These choices, in general, align with the nature of the knee sounds themselves, which do not exhibit precise tonal frequencies but instead appear as broadband or band-limited bursts. By favoring high temporal resolution, our approach is more consistent with both the phenomenology of knee acoustic emissions and the analysis strategies adopted in prior OA studies.

After computing the STFT via;

$${\bf{X}}(m,{\omega }_{k})=\mathop{\sum }\limits_{n=-\infty }^{\infty }x[n]\cdot w[n-m]\cdot {e}^{-j{\omega }_{k}n},$$
(5)

where x[n] is the KAE signal, w[n] is a window function (Hanning window in our case), m denotes the center time index of the window, and ωk is the k-th frequency bin, we retained only the real-valued part (i.e. \(\Re \left\{{\bf{X}}(m,{\omega }_{k})\right\}\)) of the resulting spectrograms to reduce the complexity while preserving temporal-spectral structure. To compress the dynamic range of the signal and reduce the influence of high-magnitude outliers, the spectrograms that will be provided to the deep learning models were log-scaled as,

$${{\bf{X}}}_{\log }(m,k)=\log \left(1+| \Re \left\{{\bf{X}}(m,{\omega }_{k})\right\}| \right).$$
(6)

To emphasize periods of knee acoustic emission and suppress regions dominated by noise or silence, we introduced an envelope-based weighting scheme. Specifically, for each KAE recording, we first computed the absolute value (rectified) signal and identified its local amplitude peaks. An envelope function was then derived by performing cubic spline interpolation between these peaks, resulting in a smoothly varying envelope signal xe[n]. Subsequently, this continuous envelope was segmented into overlapping frames consistent with the STFT spectrogram framing parameters and averaged within each frame, yielding a frame-based envelope vector e[m] as:

$${x}_{e}^{{\prime} }[m]=\frac{1}{N}\mathop{\sum }\limits_{n=mH}^{mH+N-1}{x}_{e}[n],$$
(7)

where N and H represent the frame (window) and hop lengths (in samples), respectively.

Next, the frame-based envelope values were normalized to zero mean and unit standard deviation and rectified at zero to ensure non-negativity:

$${e}_{+}[m]=\max \left(0,\,\frac{{x}_{e}^{{\prime} }[m]-{\mu }_{e}}{{\sigma }_{e}}\right),$$
(8)

with μe and σe denoting the mean and standard deviation of \({x}_{e}^{{\prime} }[m]\), respectively.

Finally, to enhance temporal regions associated with meaningful knee acoustic events, we weighted each time frame of the log-spectrogram \({{\bf{X}}}_{\log }(m,k)\) by the corresponding normalized envelope frame value e+[m], yielding the final weighted spectrogram representation \({\widetilde{{\bf{X}}}}_{\log }(m,k)\),

$${\widetilde{{\bf{X}}}}_{\log }(m,k)={{\bf{X}}}_{\log }(m,k)\cdot {e}_{+}[m],\,\forall k.$$
(9)

This envelope weighting step emphasizes spectro-temporal regions that correspond to genuine acoustic emission events. The resulting weighted spectrograms, \({\widetilde{{\bf{X}}}}_{\log }(m,k)\), were subsequently provided as input representations to the deep neural network.

Model architecture, training, and interpretability with explainable AI

Given the demonstrated effectiveness of CNNs across a wide range of domains—including, but not limited to, biomedical applications—we have adopted a CNN-based architecture as the baseline model in our study. As our main model for classification among those CNNs, we used the ResNet-1821. It stands on a good point between depth and efficiency, which makes it practical for most of the cases specifically for biomedical or audio classification applications47,48,49.

The original ResNet-18 model expects a 3-channel input. We adapted the model for our single-channel spectrograms by decreasing the number of input channels to 1. Additionally, the original ResNet-18 ends with a fully connected layer designed for 1000-class classification implemented as a 1000-dimensional softmax output. Since our task is binary classification (OA vs. Healthy), we replaced this layer with a new classification head consisting of a global average pooling layer followed by a single fully connected neuron with sigmoid activation. This configuration enables the network to aggregate spatially distributed feature responses into a scalar probability, making it suitable for binary decision-making.

Since the number of samples in our dataset is limited, we employed a transfer learning strategy by reusing the ResNet-18 model whose weights are pre-trained on the ImageNet dataset50,21. Rather than updating all of the parameters, we gradually unfreeze and train those parts of the model for our problem. This approach helps us reduce the variance by decreasing the number of trainable parameters, while preserving the general and low-level feature extractions from a large-scale image data from ImageNet. Given our dataset size, this approach gives us a sweet spot between the model capability and generalization. Figure 2 illustrates the overview of our methodology. Table 2 details the proportion of trainable parameters for different numbers of unfrozen layers, while Table 3 specifies which layers are incrementally unfrozen during training.

Fig. 2: Overview of the proposed classification and explainability pipeline for knee acoustic emission (KAE) analysis.
figure 2

The sound signal is first band-pass filtered and transformed into a spectrogram using STFT. This spectrogram is then fed into a partially fine-tuned ResNet-18, where only the final residual blocks are updated. During inference, gradient-based attribution (FullGrad) is applied to produce frequency-time saliency maps that highlight the regions contributing most to the model’s decision, enhancing explainability.

Table 2 Trainable and non-trainable parameters and percentage of trainable parameters for different # of unfrozen layers
Table 3 ResNet-18 — When does a layer become trainable?

The proposed deep transfer learning models were trained using the binary cross-entropy loss and optimized with the Adam optimization algorithm. We explored several hyperparameter configurations to identify the optimal training setup. Initial learning rates for training ranged from 1 × 10−5 to 1 × 10−3. A learning rate scheduler ("ReduceLRonPlateau") was applied to further ensure effective convergence; specifically, the learning rate was reduced by a factor of 0.5 whenever the validation F1-score failed to improve over a period of 20 consecutive epochs. The experiments had varying epochs settings, ranging from 150 to 600 epochs, determined by the initial learning rate and number of trainable network layers. Due to class imbalance in the dataset (37 OA versus 49 healthy knees), class weighting was integrated directly into the binary cross-entropy loss function to ensure unbiased model decision-making.

For robust and reliable performance evaluation, experiments were repeated across multiple random seeds for both data shuffling (seeds 0, 5, 10, 16) and model initialization (seeds 100, 200). For each combination of spectrogram and model configuration, all training runs were conducted across the full cross-product of these seed values.

To ensure robust and unbiased assessment of model performance, we used a subject-wise hold-out validation strategy. Each individual may have multiple visits and multiple recordings for both knees; therefore, all data from a given subject were assigned exclusively to either the training, validation, or test set to prevent data leakage. The dataset was partitioned into 60% training, 20% validation, and 20% test subsets.

For benchmarking, we compared the proposed deep transfer learning framework against two widely-used traditional machine learning models: Random Forest and a simple z-score based classifier, which predicts sample class membership by comparing the z-scored mean and standard deviation of each class in the training set. This comparative analysis enables a direct evaluation of the performance improvements offered by our approach in relation to established baseline methods.

As discussed in Related Work Section, the FullGrad algorithm has demonstrated superior performance and reliability compared to other state-of-the-art XAI approaches. Therefore, we adopted FullGrad to interpret our deep transfer learning model and to provide clear insights into the decision-making process. Specifically, we applied FullGrad to generate saliency maps highlighting the most influential time-frequency regions of the knee acoustic emissions that guided the model’s osteoarthritis classification decisions. These visual explanations enabled us to verify that the model’s predictions were based solely on physiologically meaningful features rather than artifacts or background noise, thus enhancing the trustworthiness and transparency of the proposed deep learning framework in clinical and practical applications.

Results

Classification performance

The classification performance of the proposed ResNet-based model with transfer learning (ResNet-TL) was benchmarked against three alternative approaches: a ResNet18 model trained from scratch, the Z-Score classifier, and a Random Forest classifier. Figure 3 summarizes the distribution of the obtained test accuracies for these models across multiple independent runs, involving different random data partitions and model initializations. Employing transfer learning clearly improved the classification accuracy and robustness: the best-performing ResNet-TL configuration (spectrogram parameters: FFT size = 128, hop length = 16; learning rate = 10−5; weight decay = 5 × 10−6; five unfrozen convolutional layer blocks) achieved a mean accuracy of 88.9%(σ = 3.9%), surpassing the corresponding best-performing ResNet18 baseline trained from scratch (87.5%, σ = 2.4%), as well as the Z-Score (83.7%, σ = 2.2%) and Random Forest (77.6%, σ = 6.4%) benchmark classifiers. The Random Forest classifier exhibited the lowest accuracy and greatest variance, highlighting its sensitivity to different experimental and data conditions.

Fig. 3: Comparison of the test accuracy distributions obtained with the four evaluated classifiers.
figure 3

The mean (μ) and standard deviation (σ) of the accuracies are reported above every box. ResNet with transfer learning (ResNet-TL) delivers the best performance (μ = 88.9%), followed by a ResNet trained from scratch (μ = 87.5%), the Z-Score baseline (μ = 83.7%), and the Random-Forest baseline (μ = 77.6%), which also shows the largest variance.

The detailed accuracy means and standard deviations across all explored hyperparameter configurations, including learning rates, weight decays, spectrogram parameters, and the number of unfrozen convolutional layer blocks, are provided in Table 4. Additionally, the effect of envelope weighting is analyzed in Table 5.

Table 4 Accuracy means and standard deviations grouped by Learning Rate and Weight Decay, spectrogram parameters, and number of unfrozen layer blocks (indexed from 0 to 6)
Table 5 Accuracy means and standard deviations without applying envelope weighting, grouped by learning rate, weight decay, spectrogram parameters, and the number of unfrozen layer blocks (indexed from 0 to 6)

Due to our adopted experimental design, where subject-wise splits are defined through multiple random data seeds and each subject has variable numbers of visits and recordings (left and right knees), slight differences in the number of samples in each test set were unavoidable. To illustrate per-class classification capability clearly, we present one representative result from a high-performing model configuration in Table 6. This particular evaluation involved 19 test samples (11 healthy and 8 osteoarthritic knees), and the model successfully classified all healthy knees correctly (accuracy 100%) while correctly classifying 87.5% of osteoarthritic knees. Overall, for this specific evaluation, the classification accuracy was 90%, indicating strong discriminative performance on both classes and highlighting the clinical potential of the proposed ResNet-TL framework.

Table 6 Per-class classification performance

To further evaluate the robustness of the proposed method to body habitus, we stratified the test knees into two subgroups based on participant BMI: low BMI (BMI ≤30) and high BMI (BMI > 30). Table 7 reports the mean classification accuracies (and standard deviations) for all evaluated models within these two subgroups. The ResNet-TL model maintained high performance in both groups, with only a modest decrease in accuracy for high-BMI subjects (91.7% ± 5.5 vs. 85.8% ± 2.9). The ResNet18 model trained from scratch achieved the same accuracy as ResNet-TL in the low-BMI group (91.7% ± 5.5), but its performance dropped more markedly and showed higher variance in the high-BMI subgroup (80.8% ± 8.3). In contrast, the Z-Score baseline showed a substantial degradation in accuracy in the high-BMI subgroup (96.2% ± 3.8 vs. 67.8% ± 6.2), while the Random Forest classifier remained consistently lower-performing in both groups. These results suggest that the proposed ResNet-TL framework is comparatively robust to the confounding effects of higher BMI, which is particularly relevant for real-world clinical populations.

Table 7 Model accuracy (%) for low vs high BMI subjects

FullGrad results

To interpret the frequency regions guiding the decisions of our proposed ResNet-based transfer learning model, we analyzed the model’s attention distributions generated by the FullGrad method. Figure 4 illustrates the averaged frequency activation profiles for healthy and osteoarthritic (OA) knee recordings across multiple model seeds. These quantitative FullGrad results were obtained using spectrogram configurations trained with the same optimal learning rate (10−5) and weight decay (5 × 10−6) as identified previously in Classification Performance Subsection, covering all related spectrogram parameterizations. Across all tested spectrogram configurations, we observed a consistent pattern indicating clear separation between the Healthy and OA classes, with osteoarthritic knees consistently showing stronger mean activation values compared with healthy knees. The activation disparity was particularly evident at lower frequencies (up to approximately 2000 Hz), providing evidence that the low-frequency acoustic features predominantly drive the model’s classification decisions. Additionally, the shaded areas, indicating standard deviations, remained relatively small, demonstrating robust consistency in the activation patterns across multiple experimental runs.

Fig. 4: Aggregated frequency activation profiles for Healthy and OA classes obtained using the FullGrad on the ResNet18 model.
figure 4

The solid lines represent the mean activation values across multiple runs, while the shaded areas indicate one standard deviation around the means. Notably, the difference in activation between Healthy and OA classes is more pronounced at lower frequencies, indicating stronger class-specific distinctions in this range.

Qualitative examples of the FullGrad visualization approach are shown in Fig. 5. These qualitative heat maps illustrate the specific frequency-time regions of the spectrograms that contributed most significantly to the neural network predictions.

Fig. 5: Qualitative visualization of the FullGrad CAM heatmaps overlaid on original spectrograms for samples from three different collections.
figure 5

The left column shows the original spectrograms, the middle column displays the corresponding CAM heatmaps indicating regions of high model activation, and the right column presents the combined overlay of the heatmaps on the original spectrograms. The CAM amplitude colorbar on the right quantifies the intensity of activations, with brighter areas representing stronger model focus. This visualization highlights the frequency-time regions that the ResNet18 model utilizes for classification across diverse sample collections.

Discussion

The primary quantitative findings from our study demonstrate that the proposed ResNet model, leveraging transfer learning from ImageNet (ResNet-TL), achieved approximately 89% mean classification accuracy with consistently low variance (4%). This approach clearly outperformed traditional feature-based machine learning baselines and showed robust generalization across 2 random seeds, 4 spectrogram configurations, and 4 subject splits. To our knowledge, this is the first deep-learning-based method to attain high and reliable performance on a challenging multi-cohort KAE dataset, particularly one with a considerable number of obese individuals. By combining deep transfer learning, optimized spectrogram parameters, a novel signal-envelope weighting scheme, and modern explainable AI techniques, this work addresses both data sparsity and interpretability in KAE-based OA classification. These results represent a significant advance in the context of KAE-based OA diagnostics, highlighting deep learning’s potential for capturing subtle acoustic patterns linked specifically to OA pathology, even when there are acoustic signals associated with higher BMI.

Our detailed performance analysis for different hyperparameter combinations is provided in Table 4. These results highlight the influence of the four spectrogram configurations—N128-H16, N128-H32, N32-H16, and N64-H16—and how their behaviors differed with changing learning rates. With a relatively large learning rate of 10−3, all four settings showed best performance if only a small part of the model was trained, ranging from 0.82 to 0.83. However, as more layers became trainable, the N128-H32 configuration showed a sharp decrease in performance, possibly due to its coarser temporal resolution, suggesting more susceptibility to overfitting when trained aggressively. At the intermediate learning rate of 10−4, performance improved significantly for all four configurations; notably, N64-H16 achieved the highest accuracy (0.88) with 2 layers unfrozen, while N128-H32 and N32-H16 both obtained their optimal results (0.85) in the fully trainable setting. N128-H16’s performance peaked at 0.86 when 1 layer was unfrozen, though it was lower (0.79) when all layers were fine-tuned. Finally, at the smallest learning rate of 10−5, the N128-H16 configuration achieved the overall highest mean accuracy (0.89) when there are five unfrozen residual blocks, closely followed by N128-H32 (0.87), N32-H16 (0.86), and N64-H16 (0.85). These results collectively indicate that spectrogram parameter selection interacts significantly with both learning rate and the number of unfrozen layers. Therefore, carefully tuning the resolution together with learning rate and unfrozen layers is crucial for achieving optimal and robust diagnostic accuracies.

Apart from selecting the right spectrogram configuration, number of unfrozen layers, and learning rate, our envelope weighting approach significantly improved classification accuracy by selectively emphasizing time frames corresponding to genuine acoustic emission events. The ablation results in Table 5 demonstrate that without envelope weighting, accuracies decreased considerably—often by approximately 10 − 20%—across all spectrogram settings and hyperparameter combinations. By directly enhancing acoustic patterns related to joint movement, our envelope weighting helped the deep learning model effectively distinguish OA-related segments from irrelevant or noisy regions, offering a clear advantage compared to traditional handcrafted-feature methods.

Interpretability analysis using FullGrad saliency maps provided important evidence that our deep-learning models consistently rely on physiologically meaningful acoustic regions. As shown in Fig. 4, FullGrad activations consistently focused on low-frequency regions (below approximately 2 kHz), which encompass most of the signal’s energy as shown in Fig. 6 and have previously been proposed as clinically relevant for detecting cartilage-related changes51. Although activation values decreased for higher-frequency bands in all spectrogram configurations, the coarser frequency resolutions associated with smaller FFT window lengths (e.g., N32, N64) led to less pronounced activation drops. Additionally, low standard deviation around activations indicates stability and consistency in the model attributions. In this first study, we therefore concentrated on frequency-domain aggregation of FullGrad, which can be computed consistently for all subjects independent of the availability of synchronized goniometer data. Such interpretability not only provides confidence that the network’s decisions are rooted in biologically meaningful cartilage acoustic emissions rather than sensor artifacts or noise but also highlights the potential of frequency-domain explanations for clinical translation. In this sense, the explicit validation of the model’s decision process via explainable AI is a key strength of the present work.

Fig. 6: Power spectral density (PSD) comparison between healthy and osteoarthritis (OA) knee joint sounds.
figure 6

Each subject’s PSD was normalized to their maximum amplitude (0 dB) and averaged within groups. Solid lines represent group means and shaded regions indicate ± 1 standard deviation.

Our multi-cohort study included a substantial proportion of participants with high BMI (31 out of 52 subjects), allowing us to test whether acoustic methods can reliably detect knee osteoarthritis even when body weight affects sound transmission. Consistent with this goal, the BMI-stratified analysis in Table 7 showed that the ResNet-TL model preserved high accuracy in the high-BMI subgroup, with only a modest reduction compared to the low-BMI subgroup and a smaller performance drop and variance than the other approaches. This aspect is particularly important since heavier patients are frequently encountered in clinical practice yet often excluded from knee acoustics research. Taken together with the use of a clinically realistic cohort and systematic comparison against conventional machine learning baselines, these findings demonstrate that end-to-end deep learning with KAE can perform well under realistic clinical conditions rather than only in highly curated experimental settings.

The limitations of our study include a modest number of participants and testing only one type of knee movement (unloaded flexion-extension). Although we mitigated sampling variability and overfitting by using multiple subject-wise data splits, random seeds, and hyperparameter configurations, these strategies cannot fully substitute for a larger and more diverse sample, and no formal a priori power analysis was performed given the exploratory nature of this work and the lack of established effect-size estimates for KAE-based deep learning. Other maneuvers such as sit-to-stand and level walking were available only for a subset of cohorts and often contained problematic or noisy recordings; including them would have substantially reduced the amount of usable data and compromised comparability across subjects. We therefore restricted our analysis to unloaded flexion-extension as the most consistently recorded and standardized task in the current multi-cohort dataset. Additionally, while our frequency analysis with FullGrad clearly showed which frequency patterns were important to the models, our time-based analysis was less conclusive. This is because only 22 out of 52 participants had synchronized goniometer recordings, so performing a fully cycle-aligned saliency analysis would have required discarding more than half of the dataset; for the same reason, we did not use the goniometer signal as an explicit input or supervision signal. Furthermore, due to the nature of deep neural networks, they tend to focus on the most distinctive parts of their inputs. In our case, the models do not always focus on all of the FL cycles, but rather concentrate on a subset of cycles or short time windows that are particularly indicative. As a result, aligning saliency maps to flexion phases and averaging them across subjects can yield patchy or seemingly inconsistent temporal patterns, even when the underlying cues are physiologically meaningful. Although our qualitative saliency maps (Fig. 5) visually indicated correspondence with regions containing acoustic emissions, we could not quantitatively measure the similarity between saliency maps and their corresponding spectrograms or specific movement phases. Finally, data collection was conducted at a single center using a specific hardware setup, which may limit generalizability across different devices and recording environments.

Taken together, our findings point toward a practical pathway for developing KAE analytics that can be used as part of clinical care. A compact version of the proposed model could be integrated into a point-of-care device for use in clinical environments or even potentially at home. Point-of-care KAE assessment could facilitate earlier diagnosis while avoiding radiation exposure and more costly diagnostic technologies, and could potentially inform risk of progression or guide clinical care in the future. The proposed computational pipeline (band-pass filtering, STFT generation, and a single forward pass through a ResNet-18 network) is compatible with real-time or near-real-time execution on modest hardware and could be implemented on a low-cost wearable system using contact microphones and a microcontroller. Because the method is non-invasive and does not involve ionizing radiation, it could be used repeatedly and would complement, rather than replace, standard imaging modalities such as radiographs or MRI. To reach these goals, several research steps remain. First, larger longitudinal datasets that capture a wider range of daily activities (e.g., sit-to-stand, level walking) are needed to test temporal stability and to extend our analysis beyond unloaded flexion-extension. Second, multi-center data collection across different devices and environments will be essential to establish robustness and generalizability. Moreover, explainability methods must evolve toward fine-grained temporal attribution, perhaps by collecting synchronized goniometer data for all participants, synchronizing saliency analysis with goniometer data, and training on individual flexion-extension cycles. In addition, future work should systematically compare the proposed envelope weighting scheme with alternative preprocessing strategies, such as wavelet-based denoising or other time-frequency enhancement methods, to quantify their impact on both classification performance and interpretability while ensuring that subtle OA-related acoustic events are preserved. Ultimately, integrating KAE-based deep learning into multimodal clinical decision-support systems, alongside imaging and clinical data, could enable earlier and more accessible OA assessment and support at-home monitoring of knee health.