Introduction

Knee osteoarthritis (OA) is the most common degenerative joint disease which leads to significant disability in the elderly, with a global prevalence of around 23% in individuals aged over 40 years1. Total knee arthroplasty (TKA) is recommended for advanced knee OA based on its capacity to reduce pain, improve functional status and quality of life2. Despite extensive development and optimization of TKA techniques, it is estimated that 15–20% patients remained dissatisfied after the surgery3. Identifying patients who are at risk of dissatisfaction will support better clinical decision making prior to surgery. By proposing alternative management strategies, reduction of outpatient attendances and minimizing unwarranted surgical intervention for this group of patients, we can provide more cost-effective and patient-centered care for patients with knee OA.

The outcome of TKA is commonly assessed objectively with patient-reported outcome measures (PROMs) to measure baseline function preoperatively and patient improvement postoperatively. Several PROMs have been developed and validated worldwide, including Knee Society knee and function scores (KSS)4, Short Form-36 Health Survey (SF-36)5 and Oxford Knee Score (OKS)6. These PROMs have corresponding minimal clinically important difference (MCID) thresholds that can be used to determine if a patient achieves perceivable benefits or remains dissatisfied postoperatively.

Machine learning has been applied to predict patient dissatisfaction after TKA. Several studies have used clinical data to predict dissatisfaction by applying tree-based models, achieving an area under the receiver operating characteristic curve (AUC) ranging from 0.60 to 0.957. However, these studies consisted of small sample size, heterogeneous MCID cutoffs and follow-up periods, and used singular functional or mental questionnaires. Mostly notably, no studies have included anterior-posterior (AP) knee radiographs which are used by surgeons to determine disease severity and patient counseling prior to surgery.

The aim of this study was to develop and compare the performance of ML models using image only (AP knee radiographs), clinical data only, and multimodal data to predict postoperative dissatisfaction, defined as not meeting PROM MCID thresholds after TKA at 6-month and 2-year follow-ups, respectively. Specifically, the input included one AP knee radiograph and clinical data per knee collected at preoperative assessment visit of the corresponding patient, and the output was the probability that the patient would not achieve MCID, i.e., dissatisfied with the surgical outcome, measured by various PROMs after TKA at follow-up visits.

Results

We included a total of 7224 knee radiographs from 5720 subjects with corresponding clinical data available at baseline and follow-up visits for model development (see Supplementary Figure 1). Details of baseline characteristics are shown in Table 1. The baseline mean age was 67.0 (standard deviation, SD = 7.6) years, 21% (N = 1513) of them were males. The majority of our patients were Chinese (87%, N = 6288), followed by Malay (6.9%, N = 500), Indian (5.1%, N = 370) and other races (0.9%, N = 66). Seventy-nine percent of them were overweight to severely obese, the prevalent comorbidities including hypertension (62%, N = 4501), hyperlipidaemia (49%, N = 3515) and diabetes (18%, N = 1332). The participants had a mean follow-up of 1.9 years (SD = 0.4).

Table 1 Baseline characteristics of the patients in training and testing dataset

In the internal testing set, the image-only model achieved fair AUCs for the prediction at 2-year follow-up: not achieving MCID in KSS 0.577 [95% confidence interval (95%CI) 0.540–0.612], SF-PCS 0.542 (0.510–0.574), SF-MCS 0.486 (0.455–0.519), and OKS 0.617 (0.557–0.677). Similar model performance was observed in the prediction of 6-month outcomes (see Supplementary Table 1, Supplementary Figure 2).

The clinical-data only model outperformed image-only model for the prediction at 2-year follow-up: not achieving MCID in KSS 0.888 (0.866–0.909), SF-PCS 0.836 (0.812-0.860), not achieving SF-MCS MCID 0.833 (0.812–0.854), and OKS 0.806 (0.753–0.859) (DeLong Test, all p < 0.001); as well as at 6-month follow-up: KSS 0.895 (0.873–0.913), SF-PCS 0.835 (0.810–0.859), SF-MCS 0.830 (0.807–0.852), OKS 0.809 (0.763–0.853) (DeLong Test, all p < 0.001) (Fig. 1, Table 2).

Fig. 1: Receiver operating characteristic curves for the internal testing set on predicting patient dissatisfaction at 2-year follow-up.
figure 1

ROC curves of image-data only model, clinical-data only model and multimodal model for predicting each PROM MCID: a Not achieve KSS MCID. b Not achieve SF-PCS MCID. c Not achieve SF-MCS MCID. d Not achieve OKS MCID. KSS Knee Society knee and function scores, MCS mental component score, OKS Oxford Knee Score, PCS physical component score, SF Short Form-36 Health Survey.

Table 2 Model performance for the internal testing set on predicting patient dissatisfaction at 2-year follow-up

When concatenating image features with clinical data in the multimodal model, its performance was equivalent but did not reach statistically significance as compared to clinical-data only model at 2-year follow-up: KSS 0.891 (0.870–0.911), SF-PCS 0.832 (0.808–0.857), SF-MCS 0.835 (0.811–0.856), and OKS 0.816 (0.768–0.863) (DeLong test, all p\(\,>\,\)0.05) (Fig. 1, Table 2). Similar results were observed in the prediction of 6-month outcomes: KSS 0.898 (0.876–0.916), SF-PCS 0.832 (0.808–0.856), SF-MCS 0.835 (0.812–0.857), and OKS 0.802 (0.753–0.848). (see Supplementary Table 1, Supplementary Figure 2) (DeLong Test, all p\(\,>\,\)0.05).

To interpret the results from image-only model, we selected random cases who underwent bilateral TKA from the testing set (see Supplementary Figure 3). Although models incorporating images can identify pathological changes such as narrowed joint space and osteophytes, it was unable to use these features (presence/severity) to predict patient dissatisfaction. For example, a subject demonstrating similar severity of radiographic features in both knees was satisfied with only one side, while the contralateral side, which presented higher preoperative PROMs, resulted in dissatisfaction after TKA.

Since clinical-data only model achieved equivalent performance with multimodal model, we evaluated the feature importance in the clinical-data only model to determine the most importance contributors. Of the 60 input features, the preoperative score emerged as the most important input variable for predicting postoperative dissatisfaction. The top three features for each PROM are as follows: KSS: preoperative KSS score, preoperative SF-MCS score and height; SF-PCS: preoperative SF-PCS score, age, preoperative SF-35 Q2; SF-MCS: preoperative SF-MCS score, preoperative SF-36 Q6 and preoperative SF-PCS score; OKS: preoperative OKS score; preoperative terminal knee extension and age. The direction of association between each input variables and outcomes at 2-year follow-up is shown in Fig. 2, with similar relationships observed at 6-month follow-up (see Supplementary Figure 4).

Fig. 2: Feature importance plots on predicting patient dissatisfaction at 2-year follow-up.
figure 2

SHAP explanation of clinical-data only model for predicting each PROM MCID: a Not achieve KSS MCID. b Not achieve SF-PCS MCID. c Not achieve SF-MCS MCID. d Not achieve OKS MCID. KSS Knee Society knee and function scores, MCID minimal clinically important difference, MCS mental component score, OKS Oxford Knee Score, PCS physical component score, PREOP preoperative, rome terminal flexion, roms terminal extension.

Discussion

To our knowledge, we presented the first and largest interpretable machine learning study comparing three machine learning models that utilized AP knee radiographs only, clinical data only, and multimodal data to predict postoperative dissatisfaction using validated PROMS. Our machine learning models were built, validated, and tested in 5720 subjects at 6-month and 2-year follow-ups, respectively. Among the three models, both clinical-data only and multimodal model obtained excellent performance that outperformed the image-only model, measured by AUC and its 95% CI, indicating that incorporation of AP knee radiographs commonly used by surgeons to monitor disease progression and preoperative TKA candidacy counseling could only add limited values on predicting postoperative dissatisfaction. We also identified that preoperative function scores are the most important clinical features that correlate with postoperative dissatisfaction.

A recent review has summarized the existing evidence and developments of machine learning models in the prediction of postoperative satisfaction7, showing that weakness of previous studies included inconsistent performance (AUC 0.60–0.95), limited sample size (around 400–6400), various clinical outcomes, used only clinical data as the input data, and heterogonous follow-up period. For instance, one cross-sectional NHS-based study involving 34,100 observations reported J-statistic (sensitivity + specificity −1) of 0.31 for predicting postoperative OKS score and 0.57 for quality of life8, the other study following 6480 patients who received TKA at 2-year visit demonstrated an AUC of 0.6–0.899. Our findings have addressed the issues of inconsistent model performance, limited sample size, and heterogeneous follow-up periods, offering valuable and interpretable evidence by incorporating AP knee radiograph for identifying patients who are unlikely to benefit from TKA, therefore provide precise management in the clinical settings.

Over 700000 TKA are performed every year in the United States10, with around 15–20% of patients would be dissatisfied postoperatively for unknown reasons. The recommendation for TKA relies fundamentally on medical history of chronic mechanical knee pain which limits function, physical examination suggesting internal derangement of the knee and severe pathological changes on preoperative knee radiographs. Radiographs remain the most widely used tool to assess the severity of structural damage such as joint narrowing space and osteophytes. However, the relationship between radiographic findings with PROMs are inconsistent in previous studies. One study identified that increased Kellgren–Lawrence (KL) score was associated with worsen general health, physical function and pain11, while other two studies found weak or no relationship between KL grading with PROMs12,13, indicating the severity of radiographic features cannot fully reflect patient’s functional, mental or quality of life status. Our study, from a machine learning standpoint, shows that although the model can identify pathological patterns such as narrowed joint space in the AP knee images, it inadequately associates with the clinical outcomes as measured by PROMs after TKA.

Although standard knee radiographs include AP view, lateral view, and the Merchant view, most of the orthopedic machine learning studies to date mainly used the AP view for joint segmentation, early detection and severity classification, with only two studies exploring the feasibility of using the lateral view for early detection of OA knee14,15. Emerging evidence has shown the patellofemoral joint space narrowing, assessed by a lateral or Merchant view, is associated with poorer PROMs16. In concordance with previous findings, our study suggests that although the machine learning models were capable to learn certain pathological patterns from the AP knee radiograph, the image alone only offered limited information on predicting patient dissatisfaction after TKA, highlighting that surgeons should not solely use AP knee radiographs when indicating patients for TKA. One of the possible explanations is that the two-dimensional bony knee structure in the AP view, without information on the actual degree of cartilage loss, is incapable of reflecting preoperative function17 that is highly associated with postoperative dissatisfaction. Future studies incorporating the lateral or the Merchant view are warranted to improve the accuracy of model prediction.

The postoperative dissatisfaction is a multifactorial outcome that correlates with patient expectation, medical history, female sex, ethnicity, and functional status18. Recent studies have raised concerns that patients’ mental health before surgery may be negatively associated with the postoperative dissatisfaction. In a longitudinal prospective study following over 1200 patients who received TKA over a mean of 5 years, patients with a higher preoperative mental health score presented better physical functions scores at pre- and post-operative visits19. In another pilot randomized controlled trial, preoperative psychological intervention improved patient satisfaction at 2-year follow-up after TKA among patients with depression20. Our results showed that preoperative mental health status was ranked as a top 4 feature in predicting dissatisfaction measured by KSS, SF-PCS, and SF-MCS, indicating psychological distress may play an important role on patient dissatisfaction. Assessment of both functional and mental health status coupled with timely intervention may be beneficial in the clinical management of OA knee.

Several clinical rating systems have been widely used to describe the outcome of TKA, among which KSS is the most utilized scoring system21. The KSS objectively assesses functional status, whereas SF-36 measures quality of life, and OKS evaluates patient’s opinion on their outcome that is more correlated with pain but less with the functional status22. In addition, the concepts of measuring satisfaction, expectation fulfillment, patient acceptable symptoms state (PASS) and MCID are not equivalent. Unfortunately, none of these PROMs can comprehensively address the changes in pain, physical, function, and mental status after TKA. Future studies should be directed toward designing a comprehensive scoring system in order to understand patient-oriented outcomes.

Our study has identified top performing features, namely preoperative PROM scores, range of motion, and age from validated PROMs that can predict patient dissatisfaction. These clinical parameters could form the basis of machine learning models used in either primary care or specialist care to select or indicate patients for TKA. These novel models, instead of a single-view AP knee radiograph can be used to predict patient dissatisfaction and counsel patients on the probability of a favorable postoperative outcome. Through careful patient selection, this will lead to improved overall outcomes for TKA patients and enhance cost-effectiveness of the procedure.

Our study has some limitations. We performed internal validation within one institution, further external validation with multi-ethnicity cohorts is desirable. Second, our study may have selection bias since we excluded patients with surgical history. Third, our model is a classification one that cannot estimate the exact postoperative PROM scores. Fourth, the models did not take into account postoperative radiographic assessments where poorly positioned components may affect outcomes.

In summary, we developed and evaluated three machine learning models utilizing image-data only, clinical-data only, and multimodal data to predict patient dissatisfaction at 6-month and 2-year follow-ups. This machine learning study is the first to utilize and compare whether incorporating AP knee radiograph can predict patient dissatisfaction after TKA. We showed that both clinical-data only and multimodal model presented excellent performance on predicting postoperative dissatisfaction. Our model could serve as a valuable evaluation tool to differentiate patients with knee OA who will not benefit from TKA in order to achieve the precision medicine in the clinical settings. Future work to incorporate multi-view knee radiographs to enhance the model performance is warranted.

Methods

Study design and participants

In this longitudinal, retrospective, single-center study, we trained, validated, and tested machine learning models using AP knee radiographs only, clinical data only, and multimodal data for predicting postoperative dissatisfaction from a retrospective joint replacement registry in Singapore General Hospital, Singapore. Patients who underwent either TKA or minimally-invasive TKA (MIS-TKA) for diagnoses of knee OA from 2006 to 2016 were included. Exclusion criteria were: (1) prior surgery of the knee; (2) underwent non-TKA procedure; (3) incomplete clinical data.

We collected anonymized preoperative clinical assessments, PROMs, and weight-bearing AP knee radiographs at baseline, with PROMs repeated at 6 months and 2 years postoperatively. The ethics approval was obtained from the SingHealth centralized institutional review board (Reference number: CIRB 2019/2878), informed consent was waived by the ethics committee since all the data were extracted anonymously from routine clinical practice. We followed the TRIPOD guideline to report our study.

Data preparation

For clinical data, all continuous variables were normalized, and categorical variables were one-hot encoded.

For each AP knee radiograph in digital imaging and communications in medicine (DICOM) format, we extracted a region of interest of the knee joint using BoneFinder@ tool (http://bone-finder.com/)23. Subsequently, we used contrast-limited adaptive histogram equalization24 to enhance the image contrast. We horizontally flipped all right knee radiographs to have a similar view with the left knee radiographs and rescaled all the images to 224 × 224 pixels.

We performed the dataset splitting to ensure no samples overlapping at the patient level in training and testing sets (80:20). We used the balanced sampling to handle class imbalance between dissatisfied versus satisfied patients.

Outcomes

The main goal of this study was to develop and compare the performance of machine learning models using image data only, clinical data only, and multimodal data to predict postoperative dissatisfaction. Several PROMs were recorded by the physiotherapists at preoperatively, at 6 months and at 2 years postoperatively: KSS4 [ranging from 0 (extreme knee problem) to 100 (no knee problem)], SF-365, and OKS6 [ranging from 0 (most severe symptoms) to 48 (least severe symptoms)]. The SF-36 score was aggregated into a physical component summary (PCS) [ranging from 0 (worst health status) to 100 (best health status)] and a mental component summary (MCS) [ranging from 0 (worst health status) to 100 (best health status)].

We calculated the difference of each PROM between follow-up visits and preoperative assessment, and dichotomized patients into two groups: dissatisfied (did not achieve MCID) versus satisfied (achieved MCID) according to each PROM MCID. We used published MCID values as the cutoff: KSS (34.5 points)25, SF-36 (PCS: 10 points and MCS: 10 points)26, and OKS (5 points)27. The primary outcomes were patient dissatisfaction quantified by each MCID at 2-year follow-up, and the secondary outcomes were patient dissatisfaction at 6-month follow-up.

We used the area under the received operating characteristic curve (AUC) as the main model-discriminative performance metric. We also report other clinically relevant metrics: F1 score, precision, and recall.

Model development and test

We developed three machine learning models to predict patient dissatisfaction after TKA: (a) a model that used knee radiograph only (Fig. 3a), (b) a model that utilized clinical data only (Fig. 3b), (c) a model that integrated both knee radiograph and clinical data for its predictions (Fig. 3c).

Fig. 3: Overview of the proposed machine learning models.
figure 3

a The image-only model feeds with AP knee radiographs. b The clinical-data only model feeds with clinical data. c The multimodal model incorporating the information from both AP knee radiographs and clinical data. CNN convolutional neural network, MLP multilayer perceptron, XGB Extreme Gradient Boosting.

For the image-only model, we used a convolution neural network (CNN) to extract features for classification. The ConvNeXt-Tiny28 (the best model among ResNet101, ResNeXt, ConvNeXt, and Vision Transformer based on the training set) pretrained with the ImageNet dataset (http://www.image-net.org/) was used as the CNN backbone.

We trained clinical-data only model by Extreme Gradient Boosting (XGB) algorithm29 [the best model among XGB, Random Forest, Support Vector Machine, decision tree and multilayer perceptron (MLP) based on the training set], which comprised demographics, medical history, preoperative assessments and preoperative PROM scores (see Supplementary Table 2). We applied 5-fold internal cross validation (4 folds for training and 1 fold for validation, in turn) in the training set for hyperparameter tuning. Once the optimal hyperparameter values were selected, the actual XGB model was trained on the whole training set. The hyperparameters were optimized for the number of estimators, maximum depth, learning rate, and subsample strategies (see Supplementary Table 3). We selected the best hyperparameter and epoch achieved the highest average performance of AUC on the validation fold for final training, then evaluated on the testing set.

For the multimodal model, after the same CNN architecture was trained to extract image features, we concatenated the image features with clinical features and used them to train a XGB classifier via a joint training approach. Specifically, the CNN was jointly trained with MLP-extracted clinical features to extract complementary image features for further prediction. We used Adam optimizer to update each CNN architecture. In image-only and multimodal models, we used the same 5-fold cross validation as previously described within the training set for hyperparameter tuning in CNN (learning rate, weight decay, batch size, and best epoch) and XGB (same hyperparameters searched in clinical-data only model), respectively. The model was trained for 10 epochs. We used the cross-entropy loss function, learning rate of 5\(\times\)10−5, batch size of 32, weight decay of 0, and a balanced class sample for training.

Our models were implemented using PyTorch and xgboost. We conducted all the training on a machine equipped with a NVDIA A100 SXM4 Tensor Core graphics processing unit and 80 GB available video random-access memory.

Model interpretation

We used SHAP30 to identify the key clinical features and gradient-weighted class activation mapping (Grad-CAM)31 to visualize where the model is focusing when making predictions on patient dissatisfaction.

Statistical analysis

To estimate the confidence interval (CI) for each performance metric, we used n-out-of-n bootstrap with replacement, repeating it for 1000 times to calculate the 95% CI. We estimated 95% CI using the 2.5 and 97.5 percentiles of the empirical distribution of the corresponding metric. We compared the model performance in three models: (1) image-only, (2) clinical-data only, (3) multimodal using the DeLong’s method as well as the overlap of 95% CI of mean AUC scores in each model. P values\(\,<\,\)0.05 were considered as statistically significance.

Data analysis and modeling were performed using Python 3.9.17 (Python Software Foundation, Wilmington, DE, USA), the Anaconda Distribution (Anaconda, Inc., Austin, TX, USA), and R software 4.2.2 (R Foundation for Statistical Computing, Vienna, Austria, 2022).

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.