Introduction

Knee osteoarthritis (KOA) is a globally prevalent degenerative joint disorder that progressively deteriorates the articular cartilage and associated tissues, ultimately leading to functional disability and a diminished quality of life. The conventional diagnostic approach relies on the Kellgren–Lawrence (K-L) grading system based on X-ray imaging, which, despite its widespread clinical adoption, remains inherently subjective and often lacks the sensitivity to detect subtle anatomical changes during the early phases of the disease. This limitation highlights the growing demand for objective, reproducible, and sensitive diagnostic modalities that can facilitate early intervention and accurate assessment of disease progression. Deep learning (DL) techniques, particularly those applied to medical imaging, have emerged as promising solutions to this diagnostic gap in musculoskeletal applications. Magnetic Resonance Imaging (MRI), with its superior soft tissue contrast, has become instrumental in characterising in vivo joint anatomy and pathological alterations. MRI-derived 3D reconstructions, for instance, offer the potential to assess structural features such as cartilage strain—an important biomarker of joint degeneration—by measuring variations in tissue thickness under load. However, the creation of such 3D models is typically reliant on labour-intensive manual segmentation, making large-scale and consistent analysis challenging1. To address this, convolutional neural networks (CNNs), particularly U-Net architectures and their variants, have been widely employed for automating segmentation tasks in knee MRI data. Despite these advances, accurate delineation of intricate knee structures such as cartilage, meniscus, and ligaments remains a complex undertaking. Contributing factors include low inter-tissue contrast, imaging artefacts, and the intricate morphology of smaller anatomical features2. To enhance segmentation fidelity, recent methods have integrated anatomical priors and shape modelling. One such approach incorporated a prior-based 3D U-Net augmented with an Average Shape Model (ASM) and subpixel guidance, which improved segmentation performance for femoral and tibial cartilage. The anterior cruciate ligament (ACL), a structure commonly affected in athletic injuries, poses particular challenges due to its thin, elongated form and varying MRI appearance. An innovative semi-supervised framework, DCLU-Net, addressed this by combining pseudo-labelling and radiomic feature extraction for concurrent ACL segmentation and classification, significantly reducing annotation burden while maintaining diagnostic accuracy3.

Efforts to generalise automated segmentation extend beyond the knee. For instance, in diagnosing temporomandibular joint (TMJ) disorders, a 3D U-Net-based pipeline was developed to segment mandibular structures in cone-beam CT (CBCT) images. This approach not only reduced inter-observer variability but also accelerated the diagnostic workflow4]– [5. Similarly, in orthopaedic radiography, CNN-based models have been trained to automatically detect fixation zones in revision total knee arthroplasty (rTKA) from postoperative X-rays, replacing subjective expert assessments with reproducible evaluations6. Beyond MRI and X-ray imaging, ultrasound (US) has attracted interest due to its cost-effectiveness and accessibility, especially for early-stage KOA detection. However, its use remains limited by poor boundary definition and operator dependency. To mitigate these limitations, a Unet3 + architecture enhanced with attention mechanisms and edge-aware loss functions was proposed for segmenting the meniscus from ultrasound images, marking the first automated system capable of quantitatively assessing meniscal protrusion7. In MRI-based diagnostics, the segmentation task is further constrained by the scarcity of annotated datasets. To tackle this, unsupervised domain adaptation (UDA) has been employed. A recent source-free UDA strategy bypassed the need for access to source data during adaptation by using pseudo-label generation and uncertainty-aware learning, resulting in robust performance across different scanners and sequences8. To further enhance the precision of segmenting small and irregular cartilage structures, the PA-UNet architecture was introduced, featuring intra-channel and intra-patch attention modules along with a feature aggregation component, delivering superior segmentation accuracy on complex MRI datasets9.

Furthermore, 3D US imaging has been leveraged for assessing synovitis—another common KOA manifestation often under-recognised in standard imaging. A modified U-Net was applied to 2D slices to reconstruct synovial volumes, enabling volumetric assessment while reducing operator-induced variability10. More recently, multi-task learning (MTL) frameworks such as OA_MTL and RES_MTL have gained prominence by enabling simultaneous segmentation and OA grading on 3D MRI data11,12,13. These frameworks leverage the inherent correlation between structural delineation and disease severity prediction to improve both accuracy and computational efficiency. For instance, the OA_MTL model incorporates an encoder–decoder backbone with residual modules and depthwise separable convolutions to reduce parameter count while preserving discriminative spatial features. Similarly, the RES_MTL model, which extends residual connectivity within the MTL paradigm, demonstrated improved gradient propagation and faster convergence during training. The widespread burden of KOA, its impact on quality of life, and the inadequacy of traditional diagnostic methods emphasise the necessity for accurate, early, and scalable diagnostic systems14. MRI’s rich anatomical detail remains underutilised primarily due to the manual workload required. The rapid evolution of DL offers a promising direction to overcome these constraints, especially when segmentation and classification are unified within a single intelligent framework. Nonetheless, several challenges persist: existing models often treat classification and segmentation separately, thus forfeiting shared spatial knowledge; many are not optimised for generalisation across imaging modalities or sequences; and interpretability often remains low, reducing clinical trust and applicability. Recent literature reflects efforts to address these issues through integrated and interpretable DL solutions. For instance, AutoDDH leveraged spatial and channel attention fused with positional encoding to perform simultaneous grading and segmentation for developmental dysplasia of the hip in ultrasound scans, offering enhanced interpretability in data-constrained settings15. For KOA detection in resource-limited environments, DIKOApp applied YOLOv5 for region localisation and a hybrid CNN for classification, supplemented by preprocessing methods such as CLAHE and GridMask to improve robustness16. An Inception-ResNet V2-based model further enhanced objectivity in OA grading by integrating targeted image sharpening techniques17. Graph-based architectures have also shown promise. DMA-GCN applied dense graph connections and adaptive learning to precisely segment cartilage from knee MRIs, achieving high overlap with ground truth18. OA-MEN, combining ResNet and MobileNet with multi-scale strategies, yielded competitive classification performance with minimal computational overhead19]– [20. For precise localisation, a CentreNet-based model integrated DenseNet201 and pixel-wise voting, excelling in early KOA detection while maintaining a lightweight design via knowledge distillation21. Further enhancing 3D segmentation, Attention Res-UNet incorporated ResNet50 and attention gates, with additions like focal Tversky loss to improve knee bone boundary delineation22. For explainable classification, over 1,000 radiomic and geometric features were extracted from X-rays, refined to six optimal features and classified using a lightweight XGBoost model with rule-based interpretation23. MPFCNet integrated HRNet with large-kernel attention modules for superior five-part knee segmentation, surpassing transformer and U-Net variants in accuracy24.

Moving into biomechanical applications, 3D Swin UNETR combined with mesh refinement (Laplacian smoothing and Coherent Point Drift) enabled automatic cartilage modelling for finite element analysis, with public release ensuring reproducibility25. Classification improvements were also evident in ConvNeXt-based models that employed GELU activations and data augmentations to attain high accuracy, particularly in advanced KOA grades, as validated using TOPSIS26. Ordinal learning methods, such as DaViTOrd, employed Vision Transformers with CORN loss to respect the ordered nature of KOA grades and demonstrated robust generalisation via EigenCAM visualisation27. Anatomical modelling was further advanced using DenseVNet to segment thirteen knee structures from multi-sequence MRIs, crucial for surgical planning28. Mimicking clinical decision hierarchies, a tiered KOA classification pipeline segmented joint features before applying feature-based classifiers, aligning more closely with real-world diagnostic workflows29. For prognostic applications, a predictive model combining MRI and radiograph features accurately estimated total knee replacement risk over nine years with an AUC of 0.9030. Segmentation speed was improved through a semi-automated pipeline pairing 3D Swin UNETR with Statistical Shape Models, reducing annotation time while preserving spatial accuracy31. Outside the knee, ensemble models combining CNNs and Vision Transformers demonstrated high diagnostic accuracy in TMJ segmentation, validating their generalisability. In surgical contexts, CEL-UNet used dual decoders and edge detection for CT-based segmentation, ensuring sub-millimetre accuracy suitable for preoperative planning and implant design. Lastly, a Bayesian U-Net assessed muscle composition in knee arthroplasty, revealing that fatty degeneration had a stronger correlation with patient function than muscle volume, shifting focus toward more relevant biomarkers in rehabilitation3233.

In response to these findings and challenges, this work proposes a unified deep learning framework tailored for 3D knee MRI analysis in total knee replacement planning. The Core contributions include a dual-path framework for knee osteoarthritis analysis that integrates 3D MRI-based segmentation and X-ray–based severity grading. A Gaussian-guided 3D Channel–Spatial Feature Attention UNet (CSFA-UNet) enables robust volumetric knee MRI segmentation, while discriminative features from knee X-ray images are extracted using Atrous Spatial Pyramid Pooling and refined via the Desert Scorpion Feature Selector. The refined features are classified using a Spiking Transformer Network to facilitate accurate Kellgren–Lawrence grading. Falcon Hunting Optimisation (FHO) is employed to fine-tune hyperparameters, and the system is validated using both OAI-ZIB MRI (507 scans) for segmentation and KL-graded X-ray datasets (1,650 samples) for classification. This holistic approach addresses anatomical precision, task integration, interpretability, and cross-modality generalisation offering a scalable solution for KOA diagnosis and surgical planning. Current clinical workflows for diagnosing KOA are predominantly reliant on subjective evaluation of imaging data, typically focusing on either the classification of disease severity or the segmentation of anatomical structures in isolation. Such task-specific methodologies often demand considerable domain expertise and manual effort, while lacking interoperability across diagnostic stages. This compartmentalisation contributes to inefficiencies in both clinical decision-making and research pipelines.

In addition, standard grading systems such as the K-L scale exhibit limited sensitivity to early-stage morphological changes and offer minimal prognostic value for surgical planning or intervention risk assessment. To overcome these challenges, there is an urgent need for a unified deep learning framework capable of concurrently performing accurate segmentation and robust classification on three-dimensional MRI volumes. This integrated approach must be designed to accommodate the complexities of real-world clinical data, including noise, variability in imaging protocols, and significant class imbalance. An effective solution should incorporate advanced attention mechanisms to enhance contextual learning, biologically inspired network components to capture both local anatomical detail and global structural patterns, and feature selection strategies that mitigate redundancy and improve model generalisability. Furthermore, the adoption of efficient optimisation techniques is essential for fine-tuning model parameters, ensuring that the system remains reliable, scalable, and interpretable particularly in high-stakes diagnostic and surgical planning scenarios. In light of these considerations, the present study introduces a comprehensive deep learning architecture 3D CSFA-UNet that integrates segmentation and classification within a single intelligent pipeline. The following sections detail the development of the proposed framework, including datasets, architectural design and optimisation strategies, followed by experimental evaluation, performance analysis, and comparative discussion. Finally, key findings, clinical implications, and future directions are summarised in the conclusion. Table 1. Shows the comparison of various state of art works.

Table 1 Comparison of various Literatures.

Materials and methods

The proposed system is a unified deep learning pipeline specifically designed to address the complex challenge of analysing 3D knee MRI scans for total knee replacement planning. The proposed framework achieves unification at the architectural and methodological levels rather than through explicit fusion of MRI and X-ray data. Two parallel structurally symmetric pathways are employed, consisting of a 3D MRI-based segmentation module and an X-ray–based osteoarthritis classification module, each optimized for its respective imaging modality. In the MRI pathway, Gaussian-guided preprocessing followed by a 3D Channel–Spatial Feature Attention UNet produces the accurate anatomical segmentation. The processed X-ray radiographs are enhanced through Atrous Spatial Pyramid Pooling and resulting features are refined with Desert Scorpion Feature Selector. The classification is performed using Spiking Transformer Network for Kellgren–Lawrence grading. Such design coherence aligns feature representation, learning dynamics, and diagnostic objectives, thereby achieving true architectural unification between 3D MRI-based structural analysis and X-ray–based disease grading within a single integrated OA diagnostic framework. This framework integrates a series of tailored modules, each targeting a specific hurdle in volumetric medical image processing from noise suppression and anatomical segmentation to intelligent feature handling and final classification.

The process initiates with a Gaussian Guided Filtering mechanism, which enhances the visibility of subtle anatomical boundaries by reducing image noise while preserving fine details. Following this, the 3D CSFA-UNet model is employed to perform high-precision segmentation; this architecture extends the conventional U-Net by incorporating a novel Channel-Spatial Feature Attention block, which selectively emphasizes the most relevant spatial and channel-wise information. The processed X Ray Radiographs are given to Atrous Spatial Pyramid Pooling, which allows the network to capture contextual information across multiple scales. To handle the high dimensionality of the extracted features, a recently developed metaheuristic algorithm, the Desert Scorpion Feature Selector, is applied to isolate the most informative feature subsets. These refined features are then classified using Spiking Transformer Networks, which integrate spatiotemporal attention with biologically inspired neural encoding. Finally, the Falcon Hunting Optimization algorithm is utilized to fine-tune the classifier’s parameters, ensuring maximum performance. Altogether, this modular framework forms a coherent and powerful system for intelligent interpretation of knee MRI volumes. Figure 1 illustrates the overall architecture of the proposed deep learning framework, showcasing the sequential flow from 3D knee MRI input through preprocessing, segmentation, feature enrichment, optimized feature selection, and classification using a spiking transformer, culminating in the final segmentation and diagnostic output.

Fig. 1
figure 1

Block diagram of the proposed deep learning pipeline for 3D knee MRI segmentation and X-ray grading.

Data acquisition and preprocessing

Dataset description

Two separate datasets were incorporated to tackle the distinct tasks of anatomical segmentation and clinical classification (a) Segmentation Dataset – OAI ZIB and (b) Classification Dataset – X-ray Based KL Grades.

(a) Segmentation dataset—OAI-ZIB:

A total of 507 3D DESS knee MRI scans form this dataset34, each meticulously annotated to delineate the femur, tibia, femoral cartilage, tibial cartilage, and meniscus. The scans were captured in the sagittal plane with voxel dimensions close to 0.36 × 0.36 × 0.7 mm³, ensuring high spatial fidelity. Subjects represent all Kellgren–Lawrence grades (0 through 4), with counts approximately: 60 (grade 0), 77 (grade 1), 61 (grade 2), 151 (grade 3), and 158 (grade 4). This variety supports robust learning across different disease severities. Sample 3D knee MRI slices from the OAI-ZIB dataset is shown in Fig. 2.

Fig. 2
figure 2

Sample 3D knee MRI slices from the OAI-ZIB dataset.

(b) Classification dataset—X-ray based KL grades:

For the classification module, 1,650 knee radiographs labeled with Kellgren–Lawrence (KL) grades 0–4 were used35. Although per-class counts aren’t explicitly stated in sources, the dataset consistently adheres to the five-grade KL system used to assess osteoarthritis severity. Each image is assigned to one of the following categories: KL 0: Normal, KL 1: Doubtful, KL 2: Mild, KL 3: Moderate and KL 4: Severe. These well-defined class labels permit effective training and evaluation of the classification stage. The combination of a richly annotated 3D MRI set and a broadly labeled X-ray corpus ensures that our framework addresses both structural delineation and clinical staging comprehensively. Representative knee X-ray images from the KL-grade classification dataset is shown in Fig. 3.

Fig. 3
figure 3

Representative knee X-ray images from the KL-grade classification dataset.

Although segmentation and grading are performed on MRI and X-ray data respectively, this design reflects their complementary diagnostic purposes. MRI supports precise morphological mapping, while X-ray grading aligns with the standardized K-L protocol, forming a clinically interpretable and computationally efficient diagnostic bridge. In this work, datasets corresponding to different imaging modalities are collected from independent patient populations and processed separately, and the proposed design is consistent with standard practice in medical imaging research36,37.

Gaussian guided filtering (GGF)

In the preprocessing stage, Gaussian Guided Filtering (GGF) is applied to the raw MRI volumes to enhance anatomical visibility. This technique is particularly effective in medical imaging contexts where boundary precision is crucial and noise levels can obscure fine tissue details.

The GGF method is based on the principle of edge-aware smoothing, where a guidance image I is used to control the filtering of a target image P. The core assumption is that, within a local window ωk​, the filtered output Q is modelled as a linear transformation of the guidance image which is given by Eq. (1):

$$\:{Q}_{i}={a}_{k}{I}_{i}+{b}_{k},\:\forall\:i\in\:{\omega\:}_{k}$$
(1)

Here: \(\:{Q}_{i}\)​ is the filtered pixel value at location i, \(\:{I}_{i}\) is the corresponding pixel in the guidance image, \(\:{a}_{k}\)​ and \(\:{b}_{k}\)​ are linear coefficients estimated per window. These coefficients are derived by minimizing the following cost function over the window ωk which is given by Eq. (2)​:

$$\:E\left({a}_{k},{b}_{k}\right)=\sum\:_{i\in\:{\omega\:}_{k}}\left({\left({a}_{k}{I}_{i}+{b}_{k}-{P}_{i}\right)}^{2}+\epsilon{a}_{k}^{2}\right)$$
(2)

Where: \(\:{P}_{i}\)​ is the pixel in the input image to be filtered, ϵ is a regularization parameter that controls smoothing strength.

The Gaussian component is introduced by applying a spatial weight w (i, k) to each term in the summation, defined by a Gaussian kernel which is given by Eq. (3):

$$\:w\left(i,k\right)=\text{e}\text{x}\text{p}\left(-\frac{{||i-k||}^{2}}{2{\sigma\:}_{s}^{2}}\right)$$
(3)

Here, σs is the spatial standard deviation, dictating the influence of neighbouring pixels based on proximity.

The final output is computed using the weighted mean of ak​ and bk​, fused across overlapping windows. This edge-aware averaging ensures that anatomical structures such as cartilage borders and bone contours are preserved while homogeneous regions undergo effective denoising. In this research, typical parameter values were set empirically based on validation performance: the spatial standard deviation σs​ was chosen as 2.5, and the regularization constant ϵ was fixed at 0.01 to balance fidelity and smoothness. The application of GGF significantly improves contrast near soft tissue interfaces, allowing subsequent segmentation algorithms to operate with enhanced precision.

Segmentation

Architecture of 3D CSFA-UNet

Segmentation forms the bedrock of the proposed system, providing the critical step of isolating anatomical structures in 3D MRI volumes that are essential for total knee replacement (TKR) planning. Unlike conventional 2D models, which fail to account for volumetric dependencies, or standard 3D U-Nets that treat all extracted features uniformly, the 3D CSFA-UNet combines volumetric encoding with adaptive attention mechanisms. This design enables the model to not only understand the depth context inherent to MRI scans but also dynamically prioritize informative spatial and channel features. By integrating the Channel-Spatial Feature Attention (CSFA) modules, this network overcomes traditional limitations in modeling highly variable and low-contrast musculoskeletal tissues as shown in Fig. 4.

Fig. 4
figure 4

Architectural layout of the proposed 3D CSFA-UNet model.

The encoder pathway, which initiates the segmentation process, is composed of successive hierarchical blocks that systematically compress the spatial dimensions while amplifying feature abstraction. Each encoder block begins with a 3D convolution using kernels of size 3 × 3 × 3, followed by batch normalization and ReLU activation. These 3D kernels are applied across all three spatial axes—depth (D), height (H), and width (W)—capturing local volumetric patterns essential for anatomical understanding. The output at position (i, j, k) for feature map f is computed as given by Eq. (4):

$$\:{Y}_{(i,j,k)}^{\left(f\right)}=\sum\:_{c=1}^{C}\sum\:_{u=-1}^{1}\sum\:_{v=-1}^{1}\sum\:_{w=-1}^{1}{W}_{f,c,u,v,w}\cdot \:{X}_{c,i+u,j+v,k+w}+{b}_{f}$$
(4)

In this formulation, X is the input tensor with C channels, \(\:{W}_{f,c,u,v,w}\)​ denotes the learnable kernel weight, and bf​ is the bias associated with channel f. Following convolution, the resulting activation maps are normalized using batch normalization to stabilize the learning dynamics, which is given by Eq. (5) :

$$\:\widehat{x}=\frac{x-{\mu\:}_{\varvec{B}}}{\sqrt{{\sigma\:}_{B}^{2}+\epsilon}},\:\:BN\left(x\right)=\gamma\:\widehat{x}+\beta\:$$
(5)

Here, µB ​ and σB​ represent the batch mean and variance, while γ and β are learnable scaling and shifting parameters. After normalization, ReLU is applied element-wise as given by Eq. (6):

$$\:f\left(x\right)=\text{m}\text{a}\text{x}(0,x)$$
(6)

This introduces non-linearity and enforces sparsity, allowing only positive activations to propagate. Each encoder stage is followed by a 3D max pooling operation that reduces the spatial resolution by half, using a window size and stride of 2 × 2 × 2. The pooled output at location (i, j, k) is given by Eq. (7):

$$\:{P}_{i,j,k}=\underset{u,v,w\in\:(0,s)}{\text{max}}{X}_{i\cdot \:2+u,\:j\cdot \:2+v,k\cdot \:2+w}$$
(7)

With each downsampling operation, the feature channel count is typically doubled, enabling the model to learn increasingly abstract and expressive representations while reducing computational cost. After each encoder block, a CSFA module is inserted to enhance feature representation through attention mechanisms.

Fig. 5
figure 5

Internal structure of the Channel-Spatial Feature Attention (CSFA) module.

The CSFA module first applies channel-wise attention to recalibrate the importance of each feature map as shown in Fig. 5, with respective equations from 8 to 14. Two separate pooling operations—global average and max pooling—are applied spatially:

$$\:{F}_{avg}=AvgPool3D\left(X\right),\:\:{F}_{max}=MaxPool3D\left(X\right)$$
(8)

Both outputs are passed through a shared multilayer perceptron:

$$\:MLP\left(z\right)={W}_{2}\cdot \:ReLU({W}_{1}\cdot \:z)$$
(9)

The attention weights for each channel are then calculated using a sigmoid gate:

$$\:{M}_{c}=\sigma\:(MLP\left({F}_{avg}\right)+MLP({F}_{max})$$
(10)

The resulting map McRC×1 × 1 is used to rescale the input features:

$$\:{X}^{{\prime\:}}={M}_{c}\otimes\:\text{X}$$
(11)

Here, denotes element-wise multiplication. Next, spatial attention is computed over the channel-refined tensor X′ by pooling along the channel dimension:

$$\:{F}_{s}=\left[AvgPool\left({X}^{{\prime\:}}\right);MaxPool\left({X}^{{\prime\:}}\right)\right]$$
(12)

A 3D convolution with kernel size 7 × 7 × 7 is applied:

$$\:{M}_{s}=\sigma\:\left({Conv3D}_{7\times\:7\times\:7}\left({F}_{s}\right)\right),\:\:{X}^{{\prime\:}{\prime\:}}={M}_{s}{X}^{{\prime\:}}$$
(13)

This final output X′′ is passed forward in the encoder pipeline or fed into the decoder depending on the architecture stage.

The decoder pathway symmetrically mirrors the encoder using transposed convolutions to upsample the compressed feature maps. At each level, skip connections are introduced by concatenating the corresponding encoder output with the decoder’s upsampled feature map. These encoder-derived skip features are also passed through CSFA modules to maintain consistency in feature refinement:

$$\:{X}_{dec}=Concat(Up\left({X}_{prev}\right),\:\:CSFA\left({X}_{enc}\right))$$
(14)

Each decoder stage includes transposed convolutions with kernel size 2 × 2 × 2, followed by batch normalization, ReLU, and two standard 3D convolutions. The final segmentation prediction is produced by a 1 × 1 × 1 convolution, which outputs voxel-wise class scores.

Loss function and training details

To account for class imbalance and structural subtlety in medical images, a hybrid loss function is used that combines Dice loss and Categorical Cross-Entropy. The overall objective which is given by Eq. (15) is defined as:

$$\:{\mathcal{L}}_{total}=\alpha\:\cdot \:{\mathcal{L}}_{Dice}+\left(1-\alpha\:\right)\cdot \:{\mathcal{L}}_{CE}$$
(15)

Dice loss, which focuses on overlap accuracy between predicted and actual labels which is given by Eq. (16), is computed as:

$$\:{\mathcal{L}}_{Dice}=1-\frac{2{\sum\:}_{i}{p}_{i}{g}_{i}+\epsilon}{{\sum\:}_{i}{p}_{i}^{2}+{\sum\:}_{i}{g}_{i}^{2}+\epsilon}$$
(16)

Here, \(\:{p}_{i}\) and \(\:{g}_{i}\)​ represent the predicted and ground truth values, respectively, and ϵ is a small constant (10− 6) to prevent division by zero. Cross-Entropy loss handles per-class prediction error which is given by Eq. (17):

$$\:{\mathcal{L}}_{CE}=-\sum\:_{c=1}^{C}{y}_{c}log\left({\widehat{y}}_{c}\right)$$
(17)

Where \(\:{y}_{c}\)​ is the true label and \(\:{\widehat{y}}_{c}\)​ is the predicted probability for class c. The coefficient α = 0.7 was empirically determined to emphasize Dice loss due to its superior handling of foreground-background imbalance in voxel-wise tasks.

Feature enrichment and dimensionality reduction

Atrous Spatial pyramid pooling (ASPP)

Atrous Spatial Pyramid Pooling (ASPP) is employed as an intermediate module before classification. The primary objective of ASPP is to capture multi-scale contextual dependencies by applying convolution operations with various dilation rates. This enables the network to perceive both fine-grained and coarse-level details without increasing the number of parameters or reducing resolution through additional downsampling.

The ASPP module comprises a parallel configuration of five branches, each performing a distinct operation as shown in Fig. 6. The first branch applies a 1 × 1 convolution, which serves to preserve local spatial detail and acts as a point-wise projection which is given by Eq. (18):

$$\:{F}_{1}={W}_{1}*X$$
(18)

Here, W1​ represents the kernel for the point-wise convolution, and X is the input feature map.

The next three branches involve dilated (atrous) convolutions with kernel sizes of 3 × 3 and dilation rates set to 6, 12, and 18 respectively. Dilated convolution introduces a spacing parameter between kernel elements, allowing the filter to cover a larger receptive field without increasing kernel size or reducing feature resolution. The dilated convolution operation which is given by Eq. (19) is defined as:

$$\:Y\left[i\right]=\sum\:_{k}X[i+r\cdot \:k]\cdot \:W\left[k\right]$$
(19)

Where r is the dilation rate, W[k] are the convolution weights, and X is the input feature map. These rates enable the ASPP to extract features from increasingly larger contexts: r = 6 captures intermediate spatial patterns, r = 12 focuses on broader anatomical zones, r = 18 accounts for full-joint level structures.

Fig. 6
figure 6

Schematic representation of the Atrous Spatial Pyramid Pooling (ASPP) module.

The fifth and final branch performs global average pooling, reducing the entire feature map to a single contextual vector. This output is then passed through a convolutional layer and bilinearly upsampled to match the spatial dimensions of the other branches which is given by Eq. (20):

$$\:{F}_{global}=Upsample\left(Conv\left(GAP\left(X\right)\right)\right)$$
(20)

All five outputs are then concatenated along the channel axis and passed through another 1 × 1 convolution to fuse the aggregated context which is given by Eq. (21):

$$\:{F}_{out}={Conv}_{1\times\:1}\left(Concat\left({F}_{1},{F}_{6},{F}_{12},{F}_{18},{F}_{global}\right)\right)$$
(21)

This fusion effectively integrates localized detail and long-range dependencies into a unified feature volume, making the enriched representation more robust for downstream classification or further feature selection. By preserving resolution and adding scale-invariant context, ASPP improves the model’s sensitivity to anatomical boundaries and variations across joint structures.

Desert scorpion feature selector (DSFS)

While ASPP significantly enhances semantic richness, it also increases the feature dimensionality, leading to redundancy and computational inefficiency. To mitigate this, the Desert Scorpion Feature Selector (DSFS) is applied as a dimensionality reduction step. DSFS is a novel metaheuristic optimization algorithm inspired by the desert scorpion’s nocturnal hunting behavior, where exploration and exploitation are adaptively balanced to locate optimal prey—mirroring the feature selection process in high-dimensional spaces as shown in Fig. 7.

Fig. 7
figure 7

Flowchart of the Desert Scorpion Feature Selector.

The core objective of DSFS is to identify a subset of features that maximizes class separability while minimizing redundancy. Each solution within the population is represented as a binary vector S{0,1}n, where Si=1 indicates that the i-th feature is selected. The population evolves across generations using operators such as positional updating, social attraction, and environment-based adaptation.

The fitness function F(S)F(S)F(S) guiding the selection process is typically defined as a combination of classification accuracy and feature compactness is given by Eq. (22):

$$\:F\left(S\right)={\uplambda\:}\cdot \:\text{A}\text{c}\text{c}\left(\text{S}\right)-(1-{\uplambda\:})\cdot \:\frac{\left|S\right|}{n}$$
(22)

Where: Acc(S) is the classification accuracy achieved using the selected features, S is the number of selected features, n is the total number of features, and λ is a weighting coefficient (0.8) controlling the trade-off.

The DSFS algorithm simulates randomized foraging movements, where scorpions evaluate environmental cues (analogous to inter-feature dependency and class relevance) to iteratively refine their positions (i.e., selected feature subsets). Convergence is guided by selecting the subset that yields the highest fitness score across iterations.

In this framework, DSFS is employed immediately after ASPP. It processes the high-dimensional enriched feature maps and outputs a reduced representation containing only the most informative and non-redundant features. These optimized features are then forwarded to the spiking transformer classifier for final decision-making.

Together, ASPP and DSFS form a powerful duo one enhancing contextual richness and the other distilling it into a lean, high-impact representation. This ensures the overall model remains both expressive and computationally efficient, significantly improving classification accuracy and inference speed in volumetric medical imaging.

Classification

Spiking transformer networks

To classify the features derived from X Ray volumes, a biologically inspired yet highly expressive architecture the Spiking Transformer Network is adopted. This model integrates the temporal dynamics of Spiking Neural Networks (SNNs) with the spatial modeling power of transformers, enabling robust encoding of both spatial configurations and temporal spike patterns. Architecture of the Spiking Transformer Network is shown in Fig. 8.

The process begins with spike encoding, where the continuous-valued feature maps, selected post-ASPP and DSFS, are translated into temporally sparse spike trains. This is accomplished through a Spiking Tokenizer, which applies multiple layers of Spiking Convolutional Neural Networks (SCNNs) interleaved with Leaky Integrate-and-Fire (LIF) neuron models. Each SCNN layer captures local patterns, while LIF neurons accumulate membrane potential over time and fire discrete spikes based on a threshold model which is given by Eq. (23):

$$\:{\tau\:}_{m}\frac{dV\left(t\right)}{dt}=-V\left(t\right)+I\left(t\right)$$
(23)

where V(t) denotes the membrane voltage, τm​ is the membrane time constant, and I(t) is the synaptic input current. The LIF unit emits a spike when V(t)≥θ, then resets.

Fig. 8
figure 8

Architecture of the spiking transformer network.

To refine the spike stream, Adaptive Graph Filtering (AGF) is applied between the SCNN layers, dynamically reweighing spatio-temporal dependencies. The encoded output, now in spike format, is then passed to the Spiking Transformer Block.

Within the transformer, each token undergoes self-attention via a Graph Attention Network (GAT), which computes contextual embeddings using key (K), query (Q), and value (V) vectors which is given by Eq. (24):

$$\:Attention\left(Q,K,V\right)=softmax\left(\frac{{QK}^{T}}{\sqrt{{d}_{k}}}\right)V$$
(24)

This allows the model to assign importance scores to features across spatial dimensions. Unlike traditional transformers, these attention heads are spike-aware and time-driven, meaning each attention map evolves over multiple timesteps, preserving motion-like encoding from the input sequence.

The output of the attention module is passed through multiple Graph Convolutional Networks (GCNs) for neighbourhood aggregation, allowing the network to further capture the spatial topology of MRI structures which is given by Eq. (25):

$$\:{H}^{(l+1)}=\sigma\:\left({\stackrel{\sim}{D}}^{-\raisebox{1ex}{$1$}\!\left/\:\!\raisebox{-1ex}{$2$}\right.}\stackrel{\sim}{A}{\stackrel{\sim}{D}}^{-\raisebox{1ex}{$1$}\!\left/\:\!\raisebox{-1ex}{$2$}\right.}{H}^{\left(l\right)}{W}^{\left(l\right)}\right)$$
(25)

Here, \(\:\stackrel{\sim}{A}\) is the adjacency matrix with self-loops, \(\:\stackrel{\sim}{D}\:\)is the degree matrix, \(\:{H}^{\left(l\right)}\) is the feature matrix at layer l, and \(\:{W}^{\left(l\right)}\) is a learnable weight matrix.

At the end of the temporal encoding path, a spiking summation module integrates time-stepped spike outputs, producing a static feature representation which is then passed into a fully connected (FC) classification head. This final layer maps the representation to a probability distribution across diagnostic classes, completing the end-to-end classification process.

Hyperparameter optimization with Falcon hunting optimization (FHO)

To further elevate classification performance, the Falcon Hunting Optimization (FHO) algorithm is utilized for hyperparameter tuning. FHO is a nature-inspired metaheuristic modeled on the hunting behavior of falcons—specifically their strategy of ascending to gain visibility and diving to strike prey. This translates into an optimization process that balances exploration (searching the hyperparameter space) and exploitation (refining promising configurations) as shown in Fig. 9.

Fig. 9
figure 9

Flowchart of the Falcon Hunting Optimization (FHO) algorithm.

In the context of this classification pipeline, FHO is applied to tune multiple key hyperparameters of the spiking transformer model, including: Number of transformer blocks L, Number of attention heads H, Learning rate η, SCNN kernel size k, Spike threshold θ, and Membrane time constant \(\:{\tau\:}_{m}\)​.

Each solution candidate is a vector S = [L, H, η, k, θ, τm], and the fitness function guiding selection which is given by Eq. (26) is defined as:

$$\:F\left(S\right)=\alpha\:\cdot \:{Acc}_{val}\left(S\right)-\left(1-\alpha\:\right)\cdot \:{Loss}_{val}\left(S\right)$$
(26)

Where Accval​ is the validation accuracy, Lossval ​ is the cross-entropy loss on validation data, and α (set to 0.9) prioritizes accuracy over loss minimization.

The algorithm simulates falcon movement through controlled position updates which is given by Eq. (27):

$$\:{S}_{new}={S}_{old}+r\cdot \:\left({S}_{best}-{S}_{old}\right)+\beta\:\cdot \:randn\left(\right)$$
(27)

Where r and β are scaling constants, and randn () introduces stochasticity for exploration. Across multiple iterations, the population of candidate solutions converges towards an optimal set of hyperparameters, which are then used to retrain the classifier for final evaluation. FHO’s adaptive search mechanism ensures that the network configuration is not only high-performing but also well-generalized, making it a critical component in the overall classification pipeline.

Result and discussion

The core objective is to critically evaluate the performance of the proposed deep learning framework by integrating both quantitative and qualitative analyses. This begins with the visualization of the confusion matrix to understand how effectively the model classifies varying grades of knee osteoarthritis, followed by a detailed per-class metric evaluation assessing accuracy, precision, recall, F1-score, and specificity. The section then explores the model’s generalization capability using overall performance curves, including ROC analysis, which elucidates the discriminative power of the classifier across all severity levels. Comparative studies with conventional and state-of-the-art methods are conducted to benchmark the proposed model’s effectiveness, using bar charts and ANOVA-based statistical significance testing to validate architectural improvements. Furthermore, the segmentation accuracy is analysed with respect to the number of feature points and subpixel refinement levels, examining trends across metrics like DSC, IoU, ASD, and Hd95. Computational efficiency and memory usage are also reviewed to ensure practical feasibility. Lastly, extensive cross-validation and ablation studies are performed to verify robustness and stability across configurations and datasets, reinforcing the system’s clinical applicability and diagnostic reliability.

Experimental configuration

The experimental setup for implementing and evaluating the proposed deep learning framework was configured on a high-performance computing environment equipped with an NVIDIA RTX A6000 GPU featuring 48 GB of dedicated VRAM, an AMD Ryzen Threadripper PRO 3975WX 32-core processor, and 256 GB of DDR4 RAM to handle intensive 3D volumetric computations. The system operated on Ubuntu 22.04 LTS, ensuring compatibility with CUDA and cuDNN libraries for GPU acceleration. Software dependencies included Python 3.9 with libraries such as PyTorch 2.0.1 for model development, MONAI for medical imaging preprocessing, NumPy and SciPy for scientific computations, and Matplotlib for visualization. All training and validation procedures were conducted using the PyTorch Lightning framework for optimized GPU utilization and reproducibility. Additionally, evaluation metrics and statistical validations were carried out using Scikit-learn and Statsmodels, ensuring consistent and standardized experimental procedures. Table 2 shown the model’s training and evaluation were guided by a well-tuned set of hyperparameters.

Table 2 Optimized hyperparameters used for model training and Evaluation.

Experimental results

The confusion matrix provides a detailed view of the model’s diagnostic precision across different stages of knee osteoarthritis. Each row corresponds to the true class while columns denote the predicted outcomes, making it easier to assess where misclassifications occur.

Fig. 10
figure 10

Confusion matrix showing classification performance across all Kellgren–Lawrence grades.

As shown in Fig. 10, the classifier demonstrates exceptionally high accuracy for all grades, with Grade 0 (Healthy), Grade 2 (Minimal), and Grade 4 (Severe) achieving predictive accuracies of 98.92%, 99.16%, and 100% respectively. These results highlight the model’s strength in identifying distinct clinical patterns in extreme conditions—either normal or severe joint degeneration. Grades 1 (Doubtful) and 3 (Moderate), being inherently less distinct and often visually overlapping in radiographs, exhibit minor misclassification. For instance, 0.86% of Grade 1 cases were predicted as Grade 0, and 1.16% of Grade 3 instances were misclassified as Grade 1, indicating that mild degenerative transitions remain more challenging.

The bidirectional misclassification between Grade 0 and Grade 1 (0.86% and 0.58%) arises from the subtle morphological overlap between normal joints and early osteoarthritic patterns. Early-stage OA often shows minimal joint-space narrowing or tiny osteophytes that appear nearly identical to healthy structures under variable X-ray contrast or illumination, leading to occasional labelling inconsistencies even among radiologists, as noted in prior K–L grading studies. The Desert Scorpion Feature Selector (DSFS), while reducing feature redundancy, may inadvertently exclude fine high-frequency edge details vital for detecting early degeneration, and the Spiking Transformer’s emphasis on global spatial coherence can marginally suppress localized Grade 1 features. These factors collectively contribute to the observed symmetric confusion between early grades. However, since Grades 0 and 1 are clinically interpreted as “no or doubtful OA,” this misclassification remains within acceptable diagnostic limits. The evaluation of individual KL grades reveals a high level of diagnostic reliability across all osteoarthritis severity stages. As observed in Table 3, the model maintains a uniform accuracy of 99.15% across all classes, indicating that the classification is not biased toward any particular grade. Precision values range from 98.75% to 98.90%, demonstrating consistent confidence in the model’s predictions with minimal false positives. Recall, which measures the ability to correctly identify each class, fluctuates slightly but remains within a narrow high-performing range of 99.05% to 99.20%, reflecting strong sensitivity. The F1-Scores, derived from the harmonic mean of precision and recall, confirm balanced performance with values above 98.90% for all grades, showing that neither precision nor recall dominates. Specificity scores above 99.15% further illustrate the model’s capacity to correctly dismiss non-class instances, which is particularly critical in clinical decision-making to avoid unnecessary interventions.

Table 3 Per-class performance metrics for all five Kellgren–Lawrence (KL) grades.
Fig. 11
figure 11

Overall performance metrics of the proposed classification framework.

The overall evaluation results highlight the exceptional performance of the proposed deep learning framework. With an accuracy of 99.15%, the model consistently identifies correct classes across the dataset, while maintaining a high recall of 99.11%, indicating minimal false negatives. As observed in Fig. 11, precision reaches 98.82%, meaning that the instances predicted as positive are overwhelmingly correct, thereby reducing false positives. The F1-score, which balances both precision and recall, stands at 99.04%, reinforcing the system’s robustness even under imbalanced or ambiguous class conditions. Notably, specificity measures at 99.17%, underscoring the model’s capability to correctly reject negative instances, a vital trait in medical diagnosis where false alarms must be minimized. The ROC analysis offers an in-depth perspective on the classification model’s discriminative capability across all severity levels of knee osteoarthritis. As illustrated in Fig. 12, the Area Under the Curve (AUC) scores reflect near-perfect classification for Grade 1 and Grade 4, both achieving an AUC of 1.00 and 0.98 respectively, indicating flawless and near-flawless sensitivity-specificity trade-offs. Similarly, Grade 3 and Grade 0 also attain excellent AUC values of 0.98 and 0.97, showing the model’s strong ability to differentiate these conditions from the rest. Grade 2, while slightly lower with an AUC of 0.88, still falls within an acceptable range, though it reveals the relative difficulty in classifying early-stage osteoarthritis due to its subtle radiographic changes. These performance trends highlight that the model maintains high fidelity in recognizing advanced and healthy states, while intermediate grades with less pronounced features remain more susceptible to misclassification.

Fig. 12
figure 12

ROC curves for multi-class classification of knee osteoarthritis severity levels.

The comparative evaluation across various model architectures clearly highlights the superior performance of the proposed framework in all major classification metrics. While traditional CNN and 3D-ResNet architectures demonstrate competent accuracy levels, the transformer-based and proposed models achieve noticeable gains, particularly in precision and specificity. As presented in Fig. 13, the proposed approach yields the highest scores in every metric, achieving 99.15% accuracy, 98.82% precision, 99.11% recall, 99.04% F1-score, and 99.17% specificity. These improvements suggest that the integration of CSFA-driven segmentation, ASPP-based context enhancement, DSFS optimization, and spiking transformer classification creates a highly synergistic effect. In contrast, baseline models exhibit more variance between recall and precision, indicating potential instability in class sensitivity.

Fig. 13
figure 13

Comparative bar chart showcasing evaluation metrics across different classification models.

The statistical validation of the proposed framework demonstrates that each architectural refinement contributes significantly to performance improvement. As shown in Table 4, the overall ANOVA result with a p-value of 1.82e⁻⁷ and F-statistic of 15.237 confirms that the differences among all model variants are highly significant, thus rejecting the null hypothesis of equal means across models. Specifically, the transition from the baseline U-Net to CSFA-UNet yields a substantial improvement (F = 9.651, p = 0.0142), indicating the effectiveness of incorporating channel-spatial attention.

Table 4 Statistical comparison of model performance across successive architectural enhancements using ANOVA and pairwise F-tests.

The subsequent inclusion of the ASPP module also proves statistically impactful (F = 6.372, p = 0.0317), validating the importance of multi-scale context enrichment. Adding DSFS for feature selection produces an even stronger effect (F = 11.840, p = 0.0078), highlighting its role in eliminating redundancy and enhancing relevance. The integration of the spiking transformer also shows a statistically significant benefit over its predecessor (F = 5.628, p = 0.0416), while the final refinement—though showing improvement—falls just short of conventional significance levels (F = 4.015, p = 0.0675), suggesting that most of the gains were already consolidated in earlier stages.

Fig. 14
figure 14

Visualization of 3D knee MRI segmentation pipeline showing 1st row is input, 2nd row is mask, 3rd row is predicted segmentation, and 4th row is extracted anatomical regions.

The visual output in Fig. 14 provides a comprehensive demonstration of the proposed segmentation workflow applied to 3D knee MRI data. The top row illustrates the original input slices, showcasing the raw volumetric knee structures with typical anatomical complexity and noise artifacts. The second row presents the binary and multi-class mask annotations used as ground truth references for supervised training. In the third row, the segmentation results generated by the proposed 3D CSFA-UNet model are overlaid with distinct color codes, effectively distinguishing between different anatomical regions such as femur, tibia, and cartilage tissues. This clearly exhibits the model’s capability to handle spatial variability and fine-grained tissue separation. The fourth row highlights the segmented anatomical regions precisely extracted from the original scans, validating the network’s effectiveness in isolating complex musculoskeletal features. The consistency and fidelity between the predicted segmentations and their corresponding anatomical counterparts affirm the model’s high localization accuracy and structural awareness, crucial for pre-surgical planning and diagnostic clarity.

A comparative evaluation between Falcon Hunting Optimization (FHO) and conventional optimizers (AdamW and SGD) revealed that FHO achieved the highest classification accuracy (99.15%), outperforming AdamW (98.73%) and SGD (98.42%). This improvement stems from FHO’s dynamic exploration exploitation adaptation, enabling efficient navigation of the hyperparameter space and reducing overfitting tendencies in spiking transformer training. The performance of proposed framework with FHO and standard optimizers are presented in Table 5. Each optimizer tuned the same hyperparameters of the Spiking Transformer (number of heads, learning rate, spike threshold, and membrane time constant) under the same training protocol of 100 epochs with batch size 8 on the OAI-ZIB MRI and X-ray grading dataset.

Table 5 Performance comparison of proposed system with various optimizers.

FHO consistently outperformed both AdamW and SGD across all metrics with faster converging rate and exhibiting the lowest variance, indicating improved stability and generalization. Unlike gradient-based optimizers that follow local gradient directions and can be trapped in flat minima, FHO employs a population-based adaptive mechanism balancing exploration and exploitation. This dynamic balance likely contributes to improved convergence and accuracy in the Spiking Transformer.

The evaluation of segmentation performance in relation to varying feature point quantities reveals a direct correlation between feature richness and model accuracy. As indicated in Table 6, increasing the number of selected feature points leads to consistent improvements across all key performance indicators. The Dice Similarity Coefficient (DSC) rises from 93.20% with no feature points to 98.10% at 15 points, demonstrating enhanced overlap with ground truth masks. Similarly, the Intersection over Union (IoU) improves steadily from 88.10% to 96.26%, suggesting more comprehensive spatial agreement between predictions and actual anatomical boundaries. Alongside these gains, distance-based metrics such as Average Surface Distance (ASD) and the 95th percentile Hausdorff Distance (Hd95) show a marked decline—from 1.85 mm to 0.45 mm for ASD, and from 5.60 mm to 1.85 mm for Hd95. These reductions indicate more precise boundary localization and fewer extreme segmentation errors.

Table 6 Impact of increasing feature points on segmentation metrics.

The results indicate that enhancing subpixel resolution during segmentation significantly improves both accuracy and spatial precision. As demonstrated in Table 7, advancing from subpixel level 1 to level 4 yields notable gains in Dice Similarity Coefficient (DSC), which increases from 94.20% to 98.10%, and Intersection over Union (IoU), which rises from 89.50% to 96.26%. These improvements suggest a tighter alignment between predicted and true anatomical structures.

Table 7 Effect of subpixel refinement levels on segmentation metrics.

Correspondingly, the Average Surface Distance (ASD) and 95th percentile Hausdorff Distance (Hd95) show a consistent decline, moving from 1.25 mm to 0.45 mm for ASD and from 4.80 mm to 1.85 mm for Hd95, highlighting enhanced boundary accuracy and reduced outlier errors. However, this refinement comes at the cost of increased computational time, which grows from 0.90 s at level 1 to 1.70 s at level 4. This trade-off between precision and latency suggests that subpixel tuning must be balanced based on the specific application’s need for speed versus segmentation accuracy. The Fig. 15 presents a comprehensive evaluation of segmentation quality as influenced by subpixel refinement levels in the proposed 3D CSFA-UNet architecture. Increasing the subpixel level from 1.0 to 4.0 leads to consistent improvement across multiple metrics: Dice Similarity Coefficient (DSC) increases from 94.20% to 98.10%, and Intersection over Union (IoU) rises from 89.50% to 96.26%, indicating superior overlap between predicted and ground truth segmentations. Simultaneously, distance-based metrics such as Average Surface Distance (ASD) and the 95th percentile Hausdorff Distance (Hd95) show marked reductions, suggesting improved boundary adherence, with ASD dropping from 1.25 mm to 0.45 mm and Hd95 decreasing from 4.80 mm to 1.85 mm. Furthermore, the training curves for accuracy and loss demonstrate stable convergence behavior, with higher subpixel levels yielding better validation accuracy and lower generalization error. This convergence pattern confirms the stability and efficiency of the network’s learning process.

Fig. 15
figure 15

Quantitative analysis of segmentation performance across different subpixel levels showing (a) DSC, (b) IoU, (c) ASD, (d) Hd95, (e) accuracy, and (f) loss trends over training epochs.

Collectively, these trends validate that finer subpixel tuning significantly enhances segmentation fidelity and learning robustness, reinforcing the architectural advantages of the proposed attention-guided model. A comprehensive analysis of model performance and resource consumption illustrates the superiority of the proposed framework across several critical dimensions. As observed in Table 8, the baseline 3D U-Net, while lightweight with 14.2 million parameters and low memory requirements (3.5 GB), achieved a comparatively modest accuracy of 97.48%. The 3D ResNet variant slightly improved accuracy to 97.85% at the expense of increased complexity and training time. Transformer-based models, including ViT and Swin Transformer, demonstrated notable gains in accuracy—98.60% and 98.70% respectively—but also required significantly more memory (up to 6.1 GB) and longer training durations exceeding 8 h. In contrast, the proposed framework struck an effective balance between model efficiency and predictive power. With 22.3 million parameters and 5.3 GB of memory usage, it achieved the highest classification accuracy of 99.15% while maintaining lower testing time per sample (0.31 s) and relatively moderate training overhead (8 h). This highlights the architecture’s optimal trade-off between complexity and performance, confirming its viability for high-stakes medical imaging tasks where both precision and speed are crucial.

Table 8 Comparative evaluation of different deep learning architectures.

An evaluation of segmentation effectiveness across different architectures clearly reveals the dominant performance of the proposed model in all measured criteria. As seen in Table 9, the proposed framework achieves a Dice Similarity Coefficient (DSC) of 98.10%, mean Intersection-over-Union (mIoU) of 96.26%, precision of 98.02%, and recall of 98.07%, significantly outperforming conventional and state-of-the-art segmentation techniques.

Table 9 Comparative performance of segmentation techniques.

Traditional Fully Convolutional Networks (FCN) yield the lowest scores across all metrics, reflecting limitations in capturing complex spatial features. U-Net variants, including U-Net++, Attention U-Net, and the improved U-Net, show progressive improvements, with Attention U-Net reaching 96.92% in recall but still trailing behind in other areas. DeepLabV3 + exhibits robust performance, especially in mIoU and precision, yet it remains approximately 5% below the proposed method’s DSC. The results derived from a five-fold cross-validation reveal exceptional consistency and reliability in the classification capability of the proposed model. Each fold demonstrates tightly clustered performance metrics, with accuracy ranging narrowly between 99.10% and 99.19%, and a mean value stabilizing at 99.15%. More notably, the recall remains robust across folds, averaging at 99.11%, which confirms the model’s capacity to correctly identify true positives across varying data splits. Additionally, precision and F1-score values reflect this balance, both exhibiting minimal variance, indicating strong generalizability. The specificity, which evaluates true negative performance, maintains a mean of 99.17%, suggesting the model’s ability to effectively reject irrelevant or negative cases. These results, presented in Table 10, clearly highlight the robustness of the classification pipeline.

Table 10 Five-fold cross-validation results of the proposed classification model.

A progressive evaluation of model configurations underscores the cumulative benefit of each architectural enhancement introduced throughout the framework. Starting with the baseline 3D U-Net, an initial accuracy of 97.48% is observed. The integration of the Channel-Spatial Feature Attention (CSFA) mechanism significantly elevates the performance, boosting the accuracy to 98.40%. When augmented with Atrous Spatial Pyramid Pooling (ASPP), the model achieves a further improvement, reaching 98.85% in accuracy and demonstrating enhanced contextual awareness. The addition of the Desert Scorpion Feature Selector (DSFS) refines feature quality by eliminating redundancy, pushing accuracy to 99.00%. Incorporating Spiking Transformer Networks (STN) adds temporal-spatial modeling, leading to an even higher score of 99.12%. The final proposed model, which combines all these elements, yields the highest performance with an accuracy of 99.15%, alongside corresponding improvements in precision, recall, F1 score, and specificity. These results, consolidated in Table 11, clearly show that each added component plays a vital role in enhancing model performance and together form a highly effective diagnostic tool for knee analysis. A critical comparison with existing research reveals that the proposed framework not only aligns with but also surpasses contemporary state-of-the-art approaches in both segmentation and classification performance. Previous works, such as those by Liu et al. and Yeoh et al., have reported Dice Similarity Coefficients (DSC) of 87.27% and 91.5%, respectively, while others like Aibinder et al. and Kulseng et al. achieved higher segmentation scores nearing 98.05% and 96.7%. However, the proposed model attains a DSC of 98.10%, which stands at the forefront among all referenced studies. In terms of classification accuracy, models by Mahum et al. and Tariq et al. have recorded impressive scores of 99.14% and 93.00%, respectively. Yet, the current framework slightly exceeds even the highest previous benchmark with a peak accuracy of 99.15%, thereby demonstrating its superior generalization capability across both imaging and diagnostic tasks.

Table 11 Comparative analysis of performance metrics across model configurations (ablation study).

The baseline 3D U-Net focuses only on spatial encoding, lacking any mechanism to highlight clinically relevant structures, which limits its ability to capture subtle anatomical variations in knee MRI. The CSFA-3D U-Net improves feature discrimination through channel and spatial attention but still operates on a single receptive scale, resulting in incomplete contextual understanding. Incorporating ASPP enhances multi-scale feature capture, yet without feature optimization, redundant and less informative descriptors persist. The addition of DSFS refines feature space by selecting only the most discriminative attributes, improving performance but still relying on conventional static classifiers. The inclusion of the Spiking Transformer Network (STN) introduces temporal sensitivity and relational awareness between spatial regions, leading to further improvement. However, the proposed framework achieves optimal performance because it unifies all these components: CSFA, ASPP, DSFS, and STN under a coherent architecture fine-tuned by Falcon Hunting Optimization (FHO). This combination enables precise segmentation, efficient feature selection, and context-aware classification within a single end-to-end pipeline, eliminating redundancy and maximizing generalization. Consequently, the proposed system demonstrates the highest accuracy, precision, and robustness compared to other model configurations.

The final model includes an additional component Falcon Hunting Optimization (FHO) which adaptively fine-tunes key Spiking Transformer hyperparameters. This tuning provides more stable convergence and a subtle accuracy gain by reducing variance and preventing suboptimal local minima. In conventional multi-task learning frameworks, performance degradation often arises due to gradient interference between segmentation and classification objectives, leading to suboptimal task optimization. The proposed framework effectively mitigates this issue through the integration of the Desert Scorpion Feature Selector (DSFS) and the Spiking Transformer Network (STN), which decouple redundant representations while maintaining semantic consistency between tasks. Furthermore, while existing MTL models are predominantly designed for single-modality MRI data, our framework extends to cross-modal learning by jointly utilizing MRI-based segmentation and X-ray-based OA grading, thereby achieving enhanced generalization across imaging modalities and better reflecting real-world clinical workflows. In addition, traditional MTL approaches rely on fixed learning schedules that often hinder convergence stability. In contrast, the proposed Falcon Hunting Optimization (FHO) dynamically adapts key hyperparameters during training, resulting in faster convergence (61 epochs compared to 85–100 epochs for baseline MTL models) and improved optimization efficiency. Collectively, these design choices enable the proposed framework to achieve superior accuracy, stability, and computational efficiency compared with existing multi-task learning methods.

Table 12 Comparative experimental results (MTL models).

Table 12 presents the comparative performance of different models, where the proposed framework achieves the highest accuracy across all metrics. While conventional architectures such as U-Net and V-Net perform reasonably well, their segmentation precision remains lower than that of multi-task learning models. Enhanced variants like RES_MTL_C and OA_MTL_C show further gains from channel and spatial feature integration. The proposed model surpasses all baselines with a Dice score of 98.10%, IoU of 96.26%, classification accuracy of 99.15%, and F1 score of 99.04%, demonstrating its robustness and reliability in accurately segmenting and classifying knee osteoarthritis.

Table 13 Comparative evaluation of segmentation and classification performance with existing state-of-the-art methods.

Table 13 summarises the comparative evaluation of segmentation and classification performance against state-of-the-art methods. The proposed framework demonstrates substantial improvement, achieving 99.15% accuracy and a Dice similarity coefficient of 98.10%. While earlier models by Yeoh et al. and Liu et al. reported accuracies below 83%, later works such as Goswami et al. and Mahum et al. offered notable advancements. Nevertheless, the proposed approach consistently outperforms existing methods, combining high segmentation precision with superior classification accuracy, thereby establishing its effectiveness for automated knee osteoarthritis analysis.

Although the proposed system does not require simultaneous acquisition of MRI and X-ray data for every patient, it employs two imaging modalities serving complementary but independent roles, which may limit applicability in strictly single-modality clinical settings. The X-ray–based Kellgren–Lawrence grading pipeline operates exclusively on knee radiographs, with feature representations extracted from preprocessed X-ray images, refined using the Desert Scorpion Feature Selector, and classified by the Spiking Transformer Network. The MRI and X-ray datasets were obtained from independent cohorts curated for modality-specific analysis; while this enables robust learning of imaging characteristics within each modality, a unified multi-center dataset containing paired MRI and X-ray data would support more comprehensive population-level validation and represents an important direction for future work. In addition, the dual-path architecture introduces additional computational demands compared to single-model multi-task learning approaches; therefore, future work will explore lightweight architectural designs and model optimization strategies to further reduce inference time and improve deployment efficiency.

Conclusion

The proposed deep learning framework establishes a robust and unified pipeline for the automated analysis of 3D knee MRI scans, addressing both segmentation and classification with a high degree of accuracy and clinical reliability. This system cohesively integrates multiple specialized modules—beginning with Gaussian Guided Filtering for denoising and edge enhancement, followed by 3D CSFA-UNet equipped with attention mechanisms to capture intricate spatial dependencies. The inclusion of the Atrous Spatial Pyramid Pooling (ASPP) module allows the model to recognize anatomical features at multiple scales, while the Desert Scorpion Feature Selector (DSFS) refines the high-dimensional output into a compact, informative feature space. These selected features are then passed through a Spiking Transformer Network, which captures both spatial topology and temporal encoding, further strengthened by Falcon Hunting Optimization (FHO) for optimal hyperparameter tuning. The effectiveness of this comprehensive approach is reflected in its outstanding performance: a classification accuracy of 99.15% and a segmentation Dice Similarity Coefficient (DSC) of 98.10%, along with consistent excellence across other metrics such as Intersection over Union (IoU), Average Surface Distance (ASD), 95th percentile Hausdorff Distance (Hd95), precision, recall, and specificity. These outcomes affirm the model’s suitability for high-stakes medical tasks like surgical planning and osteoarthritis grading. Looking ahead, this framework could be extended to other anatomical regions or modalities such as CT or ultrasound. Furthermore, embedding explainable AI strategies would enhance clinician trust and transparency, fostering broader acceptance in clinical workflows.