Introduction

The advent of deep learning (DL)-based image segmentation has allowed for the fully automated, accurate, and rapid analysis of MSK structures from medical images1,2,3,4,5,6,7,8,9,10. These models assist in extracting the structure’s shape and estimating diagnostic image biomarkers, such as volume and muscle density, for assessing muscle atrophy and fatty degeneration11,12. These biomarkers can be used for the diagnosis of muscle pathologies, such as muscular dystrophy13, cachexia14, and sarcopenia15. The models were tested on the rotator cuff8, chest3, hip and thigh7, and abdominal muscles1,2,4,9,10,16,17. However, multiple issues exist in those studies that limit the reliable application in large-scale databases.

  • Several studies addressed the segmentation of the muscles in only a single or a few two-dimensional (2D) CT slices1,18,19, which do not reflect the three-dimensional (3D) properties of the muscles and depend on the subjective selection of the slices20.

  • The 3D muscle segmentation has also been attempted8,10,21; however, only a few muscles were addressed. A recent study22 addressed the 3D segmentation of 27 hip and thigh muscles in CT images; however, the model was tested on a small database consisting of 12 cases, and the average accuracy was lower than that reported in a previous study targeting similar muscles7.

  • Current muscle segmentation approaches assess the model’s accuracy only in cross-sectional area or volume estimation. However, muscle density, which can be quantified based on the mean Hounsfield units (HU) in the CT image23, has shown higher correlations with muscle strength and functions11,24. This necessitates the accuracy assessment of muscle density estimation, as well.

  • Even though some studies attempted the analysis of large-scale databases21,25, no rigid criteria were applied for segmentation failure detection. In other words, it is not clear how to determine whether the automatic predictions can be safely adopted, possibly corrected with moderate efforts, or better excluded for a reliable downstream analysis.

Our group has developed a segmentation tool, i.e., Bayesian UNet with Monte-Carlo dropout sampling (MCDS), that outputs the model’s uncertainty, a.k.a predictive uncertainty in addition to the target segmentations using multiple predictions from random dropouts of the model weights (i.e. dropout samples) at the inference time7. The model was validated on a database of 20 cases of hip osteoarthritis (hip OA) patients. It has shown high accuracy in segmenting 19 hip and thigh muscles as well as the possibility of predicting the segmentation accuracy in unannotated CT images based on the predictive uncertainty. In the future, we want to leverage this tool to segment large-scale databases of CT images collected from many health centers26, and analyze the impact of the demographic and disease factors in the Japanese population. These databases include large variations from the training data, such as manufacturer/scanner, imaging conditions, and disease variations, which may lead to segmentation failure due to the domain shift problem27,28,29,30. Furthermore, even though the number of dropout samples was found to affect the accuracy of MCDS-based approaches31, the impact of this parameter on the MSK segmentation was not investigated. Fitzpatrick et al. reported the automated volumetric and demographic analysis of the iliopsoas muscle segmented from magnetic resonance (MR) images of 5,000 subjects25. Their database was collected from the UK Biobank database32, which, in contrast to ours, has a unified imaging scanner and protocol that mitigates the domain shift problem. The predictive uncertainty was also addressed in previous studies to predict the segmentation accuracy in unannotated images7,15,33,34,35; however, the analysis was limited to small databases, and no quantitative criteria were applied for the detection of the segmentation failures for the down-stream analysis.

In this study, we report the preparations conducted to employ the model for muscle segmentation in the large-scale database. In particular, a larger fully annotated database consisting of 50 cases of hip OA patients acquired by two CT scanners has been prepared. In addition, the model’s capacity has been increased to account for the enlarged training database. The major contributions of this work are as follows:

  • Investigating the segmentation accuracy and volume/intensity prediction in 22 MSK structures from four databases of CT images acquired from multiple manufacturers/scanners with various disease conditions, and patient positioning, i.e., standing and supine positions.

  • Assessing the accuracy of the predictive uncertainty as a predictor of the segmentation accuracy under various imaging conditions and disease variations and suggesting quantitative criteria for detecting segmentation failures.

  • Showcasing the capability of the predictive uncertainty and suggested criteria in detecting segmentation failures at a large database of > 2,500 volumetric CT images of hip OA patients, and investigating the impact of the disease status on the predictive uncertainty.

Materials and methods

CT images and annotations

In this study, databases of CT images from multiple manufacturers/scanners, disease status, and patient positioning were used. Table 1 summarizes the characteristics of the databases (DBs) used in this study. DB#1 included pre-operational images from 50 unilateral hip OA patients (mean age: 61.4 ± 13.0 yrs, min: 30 yrs, max: 86 yrs; 44 females, 6 males) acquired by two scanners from different generations by the same manufacturer (HiSpeed ”old” (N=20) and Optima CT660 ”new” (N=30), GE Healthcare, Milwaukee, WI). The images were resampled so the slice interval became 1.0 mm throughout the entire volume. The disease severity was assessed using Crowe36 and Kellgren and Lawrence (KL)37 grading, in which higher grades indicate higher disease severity. The affected sides were those with KL,Crowe>1. This database was used for the internal validation of the model accuracy, for investigating the impact of the training and inference settings, and for disease status.

Table 1 CT image characteristics.

The three databases DB#2-4 were for subjects without hip OA from institutions different from that of DB#1, and were used for external validation of the model accuracy. DB#2 was collected from a public database38, including 18 cases (age anonymized; 13 females, 5 males) with soft tissue sarcoma acquired by a scanner from the same manufacturer as DB#1 but a different model (Discovery ST, GE Healthcare, Milwaukee, WI). DB#3 included images for 10 subjects (mean age: 50.1 ±7.6 yrs, min: 41 yrs, max: 64 yrs; 10 males) who were scanned for the diagnosis of colorectal cancer using a scanner from a different manufacturer (Supria, Hitachi Medical, Tokyo, Japan). DB#4 included images of 20 healthy volunteers (mean age: 65.1 ± 6.3 yrs, min: 55 yrs, max: 76). The images were acquired for the volunteers in the supine and standing positions. The supine images were acquired with a 320-row detector CT scanner (Aquilion ONE, Canon Medical Systems Corporation, Otawara, Japan), while the standing images were acquired with an upright 320-row detector CT (prototype TSX-401R, Canon Medical Systems Corporation, Otawara, Japan)39.

DB#5 included pre-operational images for uni/bilateral hip OA patients collected from the same institution as DB#1. This database was used to showcase the usability of the predictive uncertainty in failure detection and the impact of disease status in a large-scale setting. The original database contained 9,260 CT images, acquired with hip-to-knee and whole lower limb imaging protocols for total hip arthroplasty (THA) surgery. Digitally reconstructed radiographs (DRRs) were constructed from each CT volume. The DRR was used to visually confirm the presence of metal implants from hip-to-knee, and truncate the original volume below the knee level if the image covered the whole limb. The process ended up with 2,579 CT volumes (mean age: 61.8 ± 15.2 yrs, min: 13 yrs, max: 98 yrs; 2,062 females, 497 males). The affected and unaffected sides in each CT image were assigned based on an automatic grading model40. Particularly, a cubic region of interest (ROI) centered at the hip center on each side was extracted using a CNN-based landmark detection tool. A DRR was constructed for each side, and was input to a 7-class disease severity classification model with a vision transformer (ViT) architecture41. A pre-trained model with a classification accuracy of .962 was used to predict the disease severity, and the sides with KL,Crowe>1 were considered affected, and unaffected otherwise.

Figure 1 shows an example of the target structures including 19 muscles and three bones, whereas Table 2 lists the structures names, abbreviations used in text, and the visualization colors. A collaborative group consisting of a health science researcher with a medical physics background, computer science researchers, and orthopedic surgeons specializing in musculoskeletal imaging created and validated the ground-truth (GT) labels of all the target structures in DB#1 and the GMED muscle in DBs#2-4. The annotations of the 50 cases in DB#1 passed through multiple annotation and validation cycles. The annotations were first created using a pre-trained model7, and the automated segmentations were corrected using 3D Slicer42.

Fig. 1
figure 1

Segmentation labels of the bones and muscles

Table 2 Target structures.

Overall scheme

Figure 2 shows the overall scheme of validating the segmentation model for the automated assessment of bones and muscles in CT images. The CT image was input to the model, where each axial slice was processed to segment the bones and muscles. Each bone and muscle were extracted from the concatenated volume of all slices for qualitative, i.e., muscle density visualization, and quantitative, i.e., volume and mean HU assessments. Besides the bone/muscle labels, the structure-wise predictive uncertainty was computed based on MCDS.7,31 The databases and proposed scheme were used to tackle the following research questions:

  • DB#1: What is the model’s performance (i.e., segmentation and muscle/bone assessment accuracy) under variations the training settings (i.e., number of UNet encoder/decoder layers and number of training cases), and inference settings (number of dropout samples and estimation method of predictive uncertainty)?

  • DB#2–4: How would the model performance change if applied to external databases of CT images from multiple manufacturers/scanners, with various disease conditions and patient positioning (standing and supine positions)?

  • DB#5: Can the predictive uncertainty be used for segmentation failure detection in large-scale unannotated databases?

  • DB#1,5: What is the impact of the disease status on the model performance and predictive uncertainty?

Fig. 2
figure 2

Overall scheme for validation of musculoskeletal segmentation model for automated assessment of bones and muscles in CT images with uncertainty estimation

Figure 3 summarizes the research questions in this study with the corresponding databases.

Fig. 3
figure 3

Summary of the research questions tackled in the study with the corresponding databases and methodologies used in the experiments. Sect: section numbers in the paper, ROI: region-of-interest, Hip OA: hip osteoarthritis, GT: ground-truth (annotation), DB: database, N: number of cases

Image segmentation

In this study, a cascaded 2D Bayesian U-Net model, which outputs the predicted structure labels with pixel-wise predictive uncertainty maps, was used7. The baseline model architecture consisted of an encoder and decoder composed of multiple down/upsampling layers (hereinafter called layers for simplicity). Each encoder layer consisted of two basic blocks each consisting of padding-convolution-activation blocks, followed by a dropout block. The decoder layer consisted of an upconvolution block, whose output was concatenated with the corresponding encoder layer’s output, followed by a basic block. The model’s output feature map was input to a softmax layer to obtain the voxel-wise class probabilities. Two modifications were made to the baseline architecture: 1) increasing the depth of the model (i.e., using six encoder/five decoder layers instead of five encoder/four decoder layers), and 2) adding a batch normalization layer43 to the basic convolutional blocks, which stabilizes the training of large neural network models and improves the overall performance33. Similar to7, at the inference time, the mean and the variance of 10 MC dropout samples were used to obtain the output label and voxel-wise uncertainty map, respectively. The structure-wise predictive uncertainty was computed as the average of the voxel-wise uncertainty map within the segmented label.

To investigate the impact of the modified model and larger annotated database, including 50 cases, the performance was compared with the baseline model consisting of five encoder/four decoder layers and validated on 20 or 50 cases. For simplicity, the aforementioned models were termed (5layers,20), (5layers,50), and (6layers,50). The parameters were 10 M and 44 M for the 5layers and 6layers models, respectively.

Muscle/bone assessment

The labels predicted by the segmentation model were used to assess each structure’s volume and muscle density. The volume was computed as a multiplication of the number of voxels by the size of each voxel in centimeter cubes (cc) normalized by the subject’s height. The muscle density was computed as the mean of the intensity or CT values in HU within the segmented label. Structures on the right and left sides of the patient’s body were assessed separately based on a postprocessing using connected component analysis (CCA). An additional watershed algorithm and CCA were used to separate the right and left sides when connected (e.g., the connection of right and left hemi-pelvises at the pubic symphysis).

A transfer function was used to comprehensively convert the HU values into scalar muscle density to visualize lean muscle and intramuscular fat. HU values less than -30 HU were considered fat, values within the range [30,30] were considered muscle/fat composite, and values larger than 30 HU were considered lean muscle23. Color and opacity transfer functions were used to visualize the transformed image (see Fig. 2, right).

Evaluation metrics

The segmentation accuracy, and accuracy in estimating the structures volume and density, i.e., mean HU, were evaluated. The segmentation accuracy was evaluated using the Dice coefficient (DC) and average symmetric surface distance (ASD). DC assesses the overlap between the GT and predicted labels. ASD assesses the surface distance, i.e., surface error, to assess the presence of small yet distant false positive structures. The predicted volume and mean HU accuracy were evaluated using the absolute difference between the quantities measured at the GT and predicted labels. The volume error (average volume error [AVE]) was computed as a percentage relative to the GT volume. The intensity error (average intensity error [AIE]) was reported as the average of absolute differences between the mean HUs of the GT and predicted labels.

The accuracy of the predictive uncertainty for detecting inaccurate (correctable with moderate human effort) or failed (correctable with notable efforts) segmentations was investigated. For each structure, a threshold based on the standard deviation of DC was determined to consider the segmentation inaccurate or failed. To make the threshold setting more statistically robust against outliers, the median absolute deviation (MAD) was used. Particularly, a threshold of MedianDC+1.4826k*MADDC was used, where k was set as -2 or -3 for inaccurate or failed segmentations, respectively. The area under the receiver operating characteristic (AUROC) curves of the predictive uncertainty based on the DC threshold was used to assess the detection accuracy. The AUROCs of the predictive uncertainty from both 5layers,20 and 6layers,50 were computed. Linear regression lines were computed between DC (dependent) and the predictive uncertainty (independent) for each structure and the averages of all structures combined.

Statistical analysis

The concordance correlation coefficient (CCC)44 was used to assess the agreement between the GT and predicted volume and mean HU. The Pearson correlation coefficient (ρ) assessed the linear relationship between the predictive uncertainty and DC. To investigate the statistical significance of the differences between paired measurements, the Shapiro test was first used to assess the normality of the different distributions. Student’s t-test was used when normality was found. Otherwise, the Wilcoxon signed-rank test was used. A probability of p = 0.05 was considered significant in all tests. Bonferroni correction was used when multiple comparisons between the models or databases were made.

Implementation details

The proposed approach was developed and validated in Python and Keras45,46. The segmentation models (5layers,20), (5layers,50), and (6layers,50) were trained and validated on DB#1 based on 5-fold cross-validation. For a matched comparison among the models, the remaining 30 out of the 50 cases used in training the (5layers,20) model were used in the inference phase. Models with 5 and 6 layers were retrained on all the images in DB#1 and were used to predict the labels in DBs#2-5. The quantitative validation of DBs#2-4 was limited to the GMED muscle, whereas the average of the predictive uncertainty in all structures was used in DB#5. The predictive uncertainty thresholds used for failure detection in DB#5 were derived based on the linear regression lines computed in DB#1.

The segmentation model training and inference were performed on a Linux-based cluster of servers with graphical processing units (GPUs; Nvidia Corporation, Santa Clara, CA, USA). Similar to the previous study7, the (5layers,20) model was trained for 150k iterations with a batch size of 3, whereas the models 5layers,50 and 6layers,50 were trained for 200k iterations due to the increased training data and model capacity. The inference time per volume (approximately 500 CT slices) was approximately 3 minutes.

Results

Segmentation accuracy and predictive uncertainty

The improved model 6layers,50 has shown overall improvements with respect to all evaluation metrics. Figure 4 shows the segmentation accuracy, predictive uncertainty, and volume/mean HU accuracy of the three models. Each point represents the average metric value of all structures in a single subject. The accuracy of the 6layers,50 model was significantly higher than that of the 5layers,20 in terms of all metrics. The average DC of the 6layers,50 model was .945±.015, with an average increase of 1.2% at all structures compared with 5layers,20 (p<.017). An average improvement of approximately 0.4 mm was observed in ASD (p<.017). The improvement by 6layers,50 model was statistically significant in most MSK structures, as shown in Supplementary Figs. A.1 (DC), A.2 (ASD).

Fig. 4
figure 4

Distributions of the segmentation accuracy (a), predictive uncertainty (b), and volume/mean HU accuracy (c) of the bones and muscles (averaged on all structures) by each model applied to DB#1 (N = 50). Horizontal lines in the boxes represent the medians, while blue boxes represent the means. Detailed values are depicted in Supplementary Figs. A.1-A.5. DC: Dice coefficient, ASD: average symmetric surface distance, AVE: average volume error, AIE: average intensity error, n.s.: not significant, *: p < 0.017, Student’s t-test or Wilcoxon signed rank sum test with Bonferroni correction

The box plots in Fig. 4(b) show that the uncertainty proportionally decreased with the improved segmentation accuracy in Fig. 4(a) regarding the number of training cases and model depth. Scatter plots of DC versus the predictive uncertainty for each model are depicted in Supplementary Fig. A.6(a). Linear relationships with strong correlations were obtained between the segmentation accuracy and the predictive uncertainty by all the models. A strong correlation of ρ=-.79 was obtained in the 6layers,50 model. This emphasizes the usability of the predictive uncertainty as a predictor of the segmentation accuracy, which supports the findings by the previous studies7,33.

Figure 5 shows the relationship between the average DC and average predictive uncertainty of all structures in each patient in DB#1 and the corresponding ROC curve for failure detection. The predictive uncertainty of both models (5layers,20 and 6layers,50) yielded high AUROCs (≥.95) in detecting inaccurate and failed segmentations. Table 3 shows the AUROCs of each structure. The median AUROCs of all structures by 6layers,50 for detecting inaccurate and failed segmentations were .979 and .959, respectively. The OE and OI muscles had the lowest accuracy. Supplementary Fig. A.9 shows the detailed results of each structure. Supplementary Figure A.11 shows scatter plots of the predictive uncertainty by the two models in DB#5. Based on the thresholds computed in DB#1 (Fig. 5), three representative cases were visualized. The improved segmentation by the 6layers,50 model can be observed in the three cases and the scatter plot. The representative case of the failed segmentation exhibits unusual positioning of the hip, possibly due to the patient’s discomfort as a result of the disease.

Fig. 5
figure 5

Receiver operating characteristic (ROC) curves of the inaccurate and failed segmentation detection in DB#1 (N = 50) using the predictive uncertainty. Thresholds were determined based on the median absolute deviations (σ) of the DC

Table 3 Accuracy (area under receiver operator curve (AUROC)) of the segmentation failure detection based on the predictive uncertainty and segmentation accuracy (Dice coefficient; DC) of all structures DB#1 (N = 50). σ indicates the threshold computed based on the median absolute deviation of DC and used for the detection of inaccurate (-2σ) and failed (-3σ) segmentations. ”All structures average” indicates using the average of predictive uncertainties of all structures for failure detection. The structures’ abbreviations are listed in Table 2.

Relationship between segmentation accuracy/predictive uncertainty and disease stage

Figure 6 shows the distributions of the evaluation metrics and predictive uncertainty at the internal DB#1 (a) and predictive uncertainty in the large-scale database DB#5 (b) in terms of the hip OA disease status (unaffected vs. affected) in each body side. The model 6layers,50 showed statistically significant improvement (p<.01) in all the structure groups in DB#1. The proportional relationship between the accuracy and predictive uncertainty can also be observed in all structure groups, where smaller uncertainty was accompanied by increasing accuracy. The unaffected sides significantly showed higher accuracy in the bones than the affected ones in DB#1. All groups had a similar tendency in DB#5, where the affected sides had higher predictive uncertainty. In addition to the sensitivity to the variations in the positioning, as shown in Fig. A.11, this shows a possible impact of the disease status on the performance of the segmentation model in large-scale databases.

Fig. 6
figure 6

Distributions of the accuracy evaluation metrics and predictive uncertainty of the three MSK structure groups, i.e., thigh (left) and hip (middle) muscles and bones (right), in terms of the disease status of body sides in hip OA patients in internal validation DB#1 (a) and large-scale predictive uncertainty analysis in DB#5) (b). N: number of cases. n.s.: not significant, *: p < 0.004. (Based on Shapiro’s normality test, the hypothesis test was performed using either the Wilcoxon signed-rank test or the Student’s t-test. Bonferroni correction was used for the multiple comparisons.)

Validation on a multi-manufacturer/scanner database

Figure 7 shows the evaluation metrics and predictive uncertainty of GMED muscle segmented at the databases DB#1-4 from multiple manufacturers/scanners and disease variations (see Table. 1). Representative cases (5th (blue filled triangle) and 95th (red filled upside down triangle) quantiles of the predictive uncertainty visualized in Supplementary Figs.A.7 and A.8) are depicted. Statistically significant improvements in the DC, ASD, and AIE were observed in the four databases using the model 6layers,50. The predictive uncertainty was obviously related to the accuracy metrics, where low uncertainty cases mostly had high accuracy metric values and vice versa.

Fig. 7
figure 7

Comparison between segmentation model accuracy (a, c) and predictive uncertainty (b) of the GMED muscle in the multi-manufacturer/scanner databases DB#1(N = 50), DB2(N = 18), DB#3(N = 10) and DB#4(N = 20). DC: Dice coefficient, ASD: Average symmetric surface distance, AVE: Average volume error, AIE: Average intensity error, su: supine, st: standing, n.s.: not significant, *: p < 0.01. (Based on Shapiro’s normality test, the hypothesis tests were performed using either the Wilcoxon signed-rank test or the Student’s t-test with Bonferroni correction). The triangles indicate the cases corresponding to the 5th (blue filled triangle) and 95th (red filled upside down triangle) quantiles of the predictive uncertainty visualized in A.7 and A.8.

Table 4 summarizes the means and SDs of the evaluation metrics in the four databases. Using the 6layers,50 model, the segmentation accuracy at a predictive uncertainty of 5 × 10−4 was > 0.90 (DC) and approximately less than 2.00 mm (ASD) in all the databases. Notably, a sub-HU accuracy was obtained in predicting the mean HU in the four databases. Overall improvements were observed in all the muscles by the 6layers,50 model (see Supplementary Fig. A.7). At the GMED muscle, the 5layers,20 model failed to capture the boundaries with the GMAX muscle, which reduced its segmentation accuracy, whereas the errors were less in the 6layers,50 model results. The PSOAS muscle’s lower part was undersegmented in multiple instances by the 5layers,20 model. Noteworthy, only a slight degradation in accuracy was observed at standing compared with the supine positioning in DB#4.

Table 4 Comparison between evaluation metrics of the GMED muscle segmentation in four databases. DC: Dice coefficient, ASD: average symmetric surface distance, AIE: average intensity error, AVE: average volume error, p: p-value of the difference between 5 and 6layers models (Student’s t-test if normal distribution, Wilcoxon signed rank test otherwise, with Bonferroni correction), n.s.: not significant.

Table 5 summarizes the predictive uncertainty and its correlations with the segmentation accuracy (DC) of the GMED muscle in the four databases. With the four databases combined, both models yielded strong correlations, where the average PCCs for the 5layers,20 and 6layers,50 models were -.85 and -.60, respectively. Table 6 shows the AUROCs of failure detection at the four databases. The median AUROCs by the 6layers,50 were .963 and .995 for inaccurate (-2σ) and failed (-3σ) segmentation detection, respectively.

Table 5 Predictive uncertainty (mean ± standard deviation ”std”) and correlation (Pearson correlation coefficient, ρ) with Dice coefficient of the GMED muscle in four databases.
Table 6 Accuracy (area under receiver operator curve (AUROC)) of the segmentation failure detection based on the predictive uncertainty and segmentation accuracy (Dice coefficient; DC) of the GMED muscle in the four databases. σ indicates the threshold computed based on the median absolute deviation of DC and used for the detection of inaccurate (-2σ) and failed (-3σ) segmentations.

Muscle/bone assessment

Table 7 compares the volume and mean HU prediction applied to the GT and auto segmentations in DB#1 obtained from the 6layers,50 model. The measurements of the unaffected and affected hip OA sides were reported separately. Most structures exhibited substantial agreement between the GT and auto measurements on both sides (ρ.95). The PI muscle showed weak agreement in the volume and HU measurements, whereas the TFL muscle showed weak agreement only in mean HU. In both the volume and mean HU predictions, the unaffected side has shown a slightly larger MAE than the affected side. MAE of the predicted volumes at the bones and muscles for the affected and unaffected sides was 1.77±1.06 cc/m2 and 1.89±1.17 cc/m2, respectively. The MAE of the mean HU for the affected and unaffected sides was 1.46±0.95 HU and 1.38±0.89 HU, respectively. Notably, a sub-HU MAE was obtained at the GMAX, GMED, AD, BF, RF, SM, VLVI, VM muscles on the affected and unaffected sides.

Table 7 Comparison between affected and unaffected sides of the muscles and bones in DB#1 in terms of normalized volume and mean HU using ground truth (GT) and auto (Auto) segmented labels.

Figure 8 shows representative case (median DC) segmentations with muscle histograms and 3D volume rendering of muscle density of the GMAX and GMED muscles. High reproducibility of the GT-based histograms and muscle density visualizations could be observed. In particular, the auto segmentations could comprehensively reproduce lean muscle (red) and fat (yellow) portions.

Fig. 8
figure 8

Ground-truth (GT) and predicted (Auto) segmentations of the unaffected (Un.) and affected (Aff.) sides of a representative hip OA case (median DC in Fig. 4) with diagnostic biomarkers, histograms, and muscle density visualizations of the gluteus maximus (GMAX) and gluteus medius (GMED) muscles.

Impact of the training and inference settings

Number of training images

Table 8 shows the impact of the number of training images on the segmentation accuracy (DC, ASD) of the 6layers model applied to DB#1. The highest accuracy was obtained when 40 cases were used. However, no statistically significant differences were observed in comparison to 30 cases. Compared with cases fewer than 30 cases, statistically significant differences were observed. In addition, strong correlations between DC and the predictive uncertainty were obtained in all numbers of training cases (see Supplementary Fig.A.6,b). This emphasizes the generalizability of the predictive uncertainty as a predictor of the segmentation accuracy regardless of the number of training images.

Table 8 Impact of the number of training cases on the segmentation accuracy of the 6-layers model. DC: Dice coefficient, ASD: Average symmetric surface distance, n.s.: not significant, *: p < 0.05, **: p < 0.01.

Number of dropout samples.

Table 9 shows the impact of the number of training samples on the segmentation accuracy by the 6layers model applied to DB#1. No improvement was observed by increasing the samples to larger than 10 samples. This indicates that 10 samples are sufficient to obtain a stable performance by the model.

Table 9 Impact of the number of dropout samples on the segmentation accuracy.

Estimation method of the predictive uncertainty.

Table 10 shows the correlation between the uncertainty, estimated using entropy and MCDS-based variance, and the segmentation accuracy (DC). Both uncertainties were estimated using the 6layers model applied to DB#1. The entropy was computed from the probability maps of a single sample (deterministic setting) following the previous studies15,33 whereas the MCDS was computed from the variance of 10 dropout samples. The entropy produces stronger correlations with the segmentation accuracy (mean: -.71 vs. -.69); however, the usage of a single sample led to a decreased segmentation accuracy (DC: .941±.014 vs. .947±.013; as shown in Table 9).

Table 10 Comparison between predictive uncertainty and segmentation accuracy (Dice coefficient) using Monte Carlo Dropout Sampling (MCDS, 10 dropout samples) and entropy. ρ: Pearson correlation coefficient.

Discussion

This study validated a DL model for the segmentation of MSK structures with uncertainty estimation in clinical CT images. The novelty of this work is that it showed the usability of the predictive uncertainty for predicting the MSK segmentation accuracy and detecting segmentation failures in databases of CT images from multi-manufacturers/scanners and with disease and positioning variations, such as supine and standing, and with different scales, including a large-database with 2,579 CTs. This showed the possibility of using the predictive uncertainty as a tool for detecting the failed segmentation in unannotated CT images. The study also exhibited the potential of the 6layers,50 model in producing accurate segmentations for assessing the muscle/bone volume and mean intensity, with DC > 0.90 in almost all the muscles and > 0.95 in the bones (see Supplementary Fig. A.1). The validation on the external databases has shown high generalizability of the model’s performance, where a DC > 0.95 and an AIE < 1 HU were obtained in evaluating the GMED segmentations, and the predictive uncertainty could detect the cases with segmentation failures.

Systematic improvements were observed using the 6layers,50 model at all the structure groups, regardless of the disease status. However, the PI muscle showed the smallest DC of 0.845 ± 0.091, with the largest ASD, AIE, and AVE (see Supplementary Figs. A.1, A.2, A.4, A.5). The degraded accuracy could be interpreted by the location of this muscle among various bony, abdominal, and vascular structures, making it challenging for automated segmentation. 3D segmentation models47 might improve the segmentation accuracy of this muscle as they better involve the volumetric relationships with the surrounding structures.

The predictive uncertainty was investigated in several studies to predict the segmentation accuracy in medical images.7,15,33,34,35. Nowak et al. investigated the predictive uncertainty (entropy) in segmenting skeletal muscles in lumbar-level CT slices from dual centers with CT scanners from multiple manufacturers. Their study showed the applicability of the predictive uncertainty on the data from both centers; however, it was only applied to 2D CT slices, with the muscles combined into a single label. Mehtrash et al. investigated the predictive uncertainty (entropy) based on ensemble models. The method was validated on multiple structures at MRIs, and strong correlations between the predictive uncertainty and segmentation accuracy were reported. However, both studies did not address the segmentation of individual muscles or bones and did not investigate the impact of practically important factors, such as disease condition or numbers of training data on the segmentation accuracy15,33. In our experiments, we attempted to use the entropy of single samples and observed slightly improved correlations with the segmentation accuracy. However, the segmentation accuracy has decreased. Indeed, larger numbers of 10 samples seem to improve the overall accuracy (See Table 9). Compared with the ensemble approach33, the MCDS approach showed a good balance between the segmentation accuracy, computation time, and accuracy of the predictive uncertainty.

Compared with the baseline model7, this study showed a potential improvement when increasing the depth of the segmentation model and the number of training data. Increasing the training data to larger than 20 cases improved overall, as shown in Fig. 4. Other studies have also investigated the segmentation of thigh muscles from CT images48,49,50. However, the number of cases was smaller, making the comparison invalid. Recently, Kim et al. attempted a 3D UNETR47 for the segmentation of the full thigh muscles22. The model was trained on a larger dataset (60 cases) and tested on 12 cases; however, the dataset included only patients with hip fractures, and it showed lower accuracy (DC=0.84; ASSD=1.419±0.91 mm). These comparisons collectively emphasize the higher accuracy of the improved model and the uniqueness of our fully annotated database (DB#1) and validation of external databases (DB#2-4) regarding the number of cases and the diversity of disease, patient positioning, and imaging conditions.

The assessment of the volume and intensity of the muscles and bones are among the ultimate goals of automated MSK image segmentation. In particular, the mean HU measured at abdominal muscles has shown a higher potential to predict age-related adverse outcomes compared with the muscle area9. To our knowledge, this is the first study to investigate the accuracy of these measurements in automatically segmented hip and thigh MSK structures in CT images. High accuracy of the volumes and mean HU of most muscles and bones in HOA patients was obtained. Furthermore, the validation experiment on the four databases showed the robustness of the improved model in the segmentation of the GMED muscle with respect to the multi-manufacturer/scanners and disease variations. These findings indicate the potential usability of the segmentation model for hip-to-knee MSK assessments in clinical routines. The rapid inference time (3 min) of the entire CT volume adds to the model’s practicality for adoption in surgical planning or musculoskeletal simulation platforms. Furthermore, the muscle-wise density visualization depicted in Fig. 7 would help in the rapid and comprehensive assessment of muscle quality under several conditions, such as hip OA, cancer, sarcopenia, and obesity23.

On the other hand, MSK segmentation approaches in magnetic resonance images (MRIs) are attracting attention due to patient safety and high soft tissue contrast51,52, and the possibility of quantifying the muscle/fat composition using special sequences, such as Dixon53. However, MRIs usually require a long scanning time, represent various characteristics based on the acquisition sequence, and cover limited fields of view (FOVs). This necessitates integrating multiple acquisitions and registration processing to assess the whole knee-to-hip52, which could be limited to a few research-purposed databases25.

This study has the following limitations. The 2D segmentation model, even though it has a rapid inference time, does not capture the 3D information of neighboring structures, which affects the segmentation of small structures, such as the PI muscle. State-of-the-art 3D models, such as nnUNet54 or swin UNETR55, may potentially improve the segmentation of the small muscles; however, those models are known for their higher computational cost and longer inference time than their 2D counterparts56. This study did not investigate those models because it aimed to explore the potential improvements in the baseline 2D model and its predictive uncertainty, leveraging its fast inference time for large-scale analysis. Nevertheless, future studies should investigate the potential of 3D models to improve the segmentation accuracy of small structures. Furthermore, the small hip muscles (OI and OE muscles) showed low AUROC in failure detection based on the predictive uncertainty. The usage of the auto segmentations of those muscles requires attention in unannotated databases.

The failure detection approach and improved model (6layers,50) create a basis for several future directions in our research. The model’s potential in analyzing the disease progression of individual bones and muscles in large-scale databases of unannotated CTs will be investigated. Cases with segmentation failures could be detected based on the predictive uncertainty and excluded or refined by human annotators for downstream MSK analyses. Furthermore, the extension of the segmentation model to predict the MSK structures in other regions, such as the abdomen and back muscles, is currently under development. Furthermore, a few muscles in the hip, such as the quadratus femoris and Gemelli muscles, were not addressed, besides combining several muscles, such as the adductors, into a single label due to the challenging boundary definition. These structures will be addressed in our future work by involving higher-resolution images, such as from photon-counting CTs. This study focused on hip OA as a target MSK pathology. As the model has shown possible dependency on the disease severity in the large-scale analysis, we plan to further investigate its performance on other muscle pathologies, such as muscular dystrophy13, cachexia14, and sarcopenia15.

Conclusions

This study validated a DL model for MSK segmentation with uncertainty estimation in clinical CT images. The improved model (6layers,50) allowed for the automated, rapid, and accurate assessment of the volume and density of the hip and thigh bones and muscles from clinical CT images. The study has shown an impact of the disease severity on the model’s performance, and the usability of the predictive uncertainty as a tool for predicting the segmentation accuracy and failure detection in individual MSK structures at unannotated CT image databases. The high segmentation and muscle volume/density estimation accuracy, along with the high accuracy in failure detection, exhibited the model’s reliability for the analysis of individual MSK structures in large-scale CT databases.