Introduction

Sarcopenia—the age-associated decline in skeletal muscle quantity and function—impairs balance and gait in older adults and is strongly linked to falls and hip fractures, with downstream risks of institutionalization and mortality. Consensus statements from European and Asian working groups provide operational criteria, yet comparative evaluations show that definitions differ in predicting falls1,2,3. Sarcopenia frequently coexists with osteoporosis (osteosarcopenia) and is enriched among older adults with a history of falling4,5. With ageing populations, standardized and scalable measures of muscle status are needed to support epidemiologic phenotyping and cross-cohort harmonization.

Multiple expert groups have proposed diagnostic frameworks, most prominently the European Working Group on Sarcopenia in Older People (EWGSOP2) and the Asian Working Group for Sarcopenia (AWGS)1,2. Despite broad agreement that muscle strength is central and that muscle mass (typically by DEXA or bioimpedance) provides complementary information, differences remain in operational cut-offs, measurement modalities, and staging schemes3. In practice, this heterogeneity impairs comparability across studies and limits the transportability of algorithms, underlining the need for mass metrics that are both physiologically meaningful and technically reproducible at scale.

The global leadership initiative in sarcopenia (GLIS) recently advanced the field by articulating a conceptual definition that disentangles components (muscle mass, muscle strength, and muscle-specific strength) from downstream outcomes such as physical performance6. In the literature exploring GLIS-inspired operationalizations, muscle-specific strength has often been computed as the ratio of strength to mass, with the latter approximated by abdominal CT at the third lumbar vertebra (L3)7,8,9. This “CT-SMI” approach—using a single axial slice to estimate cross-sectional muscle area—has gained traction because it correlates with whole-body lean mass and is opportunistically available in many clinical settings10.

However, reliance on a single L3 slice has inherent limitations. First, it samples trunk musculature rather than the muscle groups that drive locomotion; selective atrophy of the thighs—highly relevant to gait and balance—may therefore be under-represented11,12. Second, a one-slice measurement is susceptible to variability from slice positioning and body habitus, and to bias in the presence of regional disproportionality9. Third, cross-sectional area alone cannot capture the three-dimensional distribution of muscle bulk across the length of the limb8. These constraints motivate volumetric strategies that better reflect true muscle burden in the lower extremities.

Manual delineation of individual muscles across many slices is prohibitively time-consuming. However, recent advances in medical image analysis, enable accurate, automated, and scalable segmentation of musculoskeletal structures. Deep-learning pipelines can recover consistent three-dimensional masks from neck-to-knee acquisitions and yield volumetric readouts within minutes, supporting population-scale quantification with quality control procedures to ensure reliability13,14. These automated approaches are particularly valuable for studying muscle changes associated with mobility and aging.

Mobility limitations, falls, and the clinical construct of locomotive syndrome are largely mediated by lower-limb dysfunction5. To understand these conditions better, prior epidemiologic and biomechanical studies have linked measures of thigh muscle size and architecture with walking speed, balance impairment, and hip-fracture risk15,16. In this context, the thigh muscle region merits greater emphasis, and lower-limb volumetry represents a biologically relevant mass metric to be evaluated alongside established approaches.

To address this need for accurate thigh muscle quantification, our group has previously developed fully automated segmentation models for 27 pelvic–thigh musculoskeletal structures on Whole Thigh CT (from L3 to knee joint), achieving a mean Dice similarity coefficient of 0.89 at the individual-muscle level across all 27 labels17. Building on this per-muscle performance, we now adapt these methods to Dixon MRI. With cross-modality pipelines now mature, this approach enables consistent thigh volumetry at scale.

To validate and apply this approach in a large-scale epidemiological context, we leverage the UK Biobank (UKB) dataset. The UK Biobank is a prospective cohort of ~500,000 adults with extensive baseline phenotyping and linkage to health records18. For traits relevant to muscle and function, UKB includes dynamometer-based grip strength, anthropometry (height, weight, waist/hip), bioimpedance analysis (whole-body composition) at baseline, blood and urine biomarkers. Its imaging study includes standardized neck-to-knee Dixon MRI in a large imaging subset (tens of thousands to date) and whole-body DEXA in another subset19,20.

Using these rich multimodal data, we aim to develop a functionally oriented muscle-mass biomarker—muscle-specific muscle volume (MSMV)—derived from neck-to-knee Dixon MRI via automated segmentation of 27 anatomically defined thigh muscles, addressing aspects not captured by conventional muscle mass or muscle-specific strength indices.

Results

The study population was drawn from the UK Biobank, a large-scale prospective cohort. As shown in the Table 1, we analyzed cross-sectional data from 37,004 participants selected from an initial pool of 64,524 individuals based on age ( > 50 years) and the availability of complete datasets for grip strength, Dixon MRI, and DEXA scans.

Table 1 Demographic information of inclusion criteria in UK Biobank

The cohort comprised 17,920 men and 19,084 women, with a mean age of 64.5 ± 7.9 years. As shown in Table 1, significant sex differences were observed in physical and muscular characteristics. Men were, on average, taller (176.0 vs. 162.7 cm) and heavier (83.7 vs. 69.0 kg), and exhibited greater maximum grip strength (39.3 vs. 24.1 kg) and a higher ALMI (8.18 vs. 6.53 kg/m²) compared to women. Based on the EWGSOP2 criteria, 1.13% of the total sample had confirmed sarcopenia, with a slightly higher prevalence in men (1.20%) than in women (1.06%).

Figures 1, 2 present the performance of our automated MRI muscle segmentation model using UNETR architecture, which achieved an overall mean DSC of 0.896 across 27 distinct pelvic–thigh musculoskeletal structures classes. Large muscle groups demonstrated the highest segmentation accuracy: gluteus maximus (Dice=0.956), multifidus (0.937), abdominal oblique (0.936), and vastus lateralis (0.935). The quadriceps complex (vastus lateralis, medialis, and intermedius) maintained DSC above 0.899, while the hamstring group (biceps femoris, semimembranosus, and semitendinosus) achieved scores exceeding 0.910. Smaller, deeper muscles showed lower performance, with obturator externus representing the most challenging structure (Dice=0.797). The mean IoU across all classes was 0.814, with precision and recall metrics both averaging above 0.90.

Fig. 1: Representative examples of automated pelvic–thigh musculoskeletal structures segmentation on Dixon MRI.
Fig. 1: Representative examples of automated pelvic–thigh musculoskeletal structures segmentation on Dixon MRI.The alternative text for this image may have been generated using AI.
Full size image

Three-dimensional visualization of automated muscle segmentation results from the UNETR model. The model achieved a mean DSC of 0.896 across all muscle classes, with particularly high accuracy for larger muscle groups (Dice >0.93).

Fig. 2: Segmentation performance metrics across individual thigh muscles.
Fig. 2: Segmentation performance metrics across individual thigh muscles.The alternative text for this image may have been generated using AI.
Full size image

Quantitative evaluation of the UNETR segmentation model performance for each of the 27 pelvic–thigh musculoskeletal structures. a Per-class Dice similarity coefficient (DSC) values; the red dashed line indicates the mean DSC (0.896). b Per-class Intersection over Union (IoU) values for the same structures; the red dashed line indicates the mean IoU (0.814). c Precision–recall scatter plot by class, with point color denoting the corresponding DSC, illustrating the trade-off between over- and under-segmentation; larger superficial muscles cluster at higher precision/recall, whereas smaller deep muscles show comparatively lower values. d Heatmap summarizing per-class DSC, IoU, precision, and recall, providing an integrated view of model performance across muscle and bony classes (including femur and iliac).

Figure 3 validates our automated MRI muscle segmentation model against DEXA gold standard measurements in 37,004 participants. Figure 3a (TMV vs DEXA Legs Lean Mass) demonstrates exceptional concordance TMV and DEXA legs lean mass (r = 0.936, p < 0.001), while Fig. 3b (TVI vs ALMI) shows robust correlation between height-adjusted TVI and ALMI (r = 0.885, p < 0.001).

Fig. 3: Validation of MRI-derived muscle measurements against DEXA gold standard.
Fig. 3: Validation of MRI-derived muscle measurements against DEXA gold standard.The alternative text for this image may have been generated using AI.
Full size image

Concordance analysis between automated MRI segmentation and DEXA measurements in 37,004 participants. a Scatter plot showing correlation between thigh muscle volume (TMV) from MRI and DEXA legs lean mass (lean soft tissue of the entire lower extremities) (r = 0.936, p < 0.001). b Correlation between height-adjusted thigh volume index (TVI) and DEXA appendicular lean mass index (ALMI) (r = 0.885, p < 0.001).

Figure 4 presents age- and sex-stratified distributions of muscle mass metrics. Figure 4a (DEXA Appendicular Lean Mass by Age and Sex) demonstrates ALMI declining from median values of 8.5 kg/m² (men, 40-49 y) to 7.5 kg/m² (men, 80 + ) and 6.8 kg/m² (women, 40-49 y) to 6.4 kg/m² (women, 80 + ), representing declines of 0.25 and 0.13 kg/m²/decade, respectively. The proportion below sarcopenia cutoffs (7.0 kg/m² men, 5.5 kg/m² women) increased from <5% to 30% in men and 10% to 35% in women across the age span, with a critical transition at age 70 where >20% fell below thresholds.

Fig. 4: Age- and sex-stratified distributions of muscle mass metrics.
Fig. 4: Age- and sex-stratified distributions of muscle mass metrics.The alternative text for this image may have been generated using AI.
Full size image

Box plots showing muscle mass distributions across age groups for men (blue) and women (red). a DEXA-derived appendicular lean mass index (ALMI) with EWGSOP2 sarcopenia cutoff thresholds indicated by dashed lines (7.0 kg/m² for men, 5.5 kg/m² for women). b Absolute thigh muscle volume (TMV) from MRI segmentation. c Height-adjusted thigh volume index (TVI). Box plots display median (center line), interquartile range (IQR; box limits showing 25th-75th percentiles), whiskers extending to 1.5×IQR, and outliers as individual points. Numbers below each box indicate sample size. All metrics show progressive decline with aging, with increasing variance in older age groups.

Figure 4b (TMV by Age and Sex) reveals more pronounced absolute TMV losses, with median values decreasing from 12,158 cm³ to 9,550 cm³ in men (21% decline) and 8,358 cm³ to 6,894 cm³ in women (19% decline). Men maintained 40–50% higher absolute TMV throughout aging, though this gap narrowed due to greater absolute losses. Figure 4c shows that height-adjusted TVI partially normalized sex differences, with men maintaining only 15–20% higher values (0.38–0.34 cm³/m² vs 0.30–0.26 cm³/m²). Both sexes demonstrated similar relative TVI declines of ~13% over 40 years, suggesting height normalization accounts for much of the apparent sex difference in muscle loss.

The complementary nature of MRI-derived thigh metrics (Fig. 4b, c) to traditional DEXA measurements (Fig. 4a) provides enhanced granularity for sarcopenia assessment. Increasing variance with age, particularly evident in 70+ groups across all panels, suggests heterogeneous aging trajectories that may distinguish successful from pathological muscle aging, with implications for targeted intervention strategies.

Figure 5 and Table 2 presents the associations between muscle balance patterns and clinical outcomes through tertile analysis of posterior-anterior (P/A) ratio and left-right (L/R) asymmetry in 24,670 participants. Figure 5a (P/A Ratio) demonstrates that participants in the highest P/A ratio tertile ( > 0.48) compared to the lowest tertile ( < 0.41) had significantly increased odds of adverse outcomes. The highest tertile showed increased odds of weak grip strength (OR 1.60, 95% CI 1.45–1.77, p < 0.001; 1,083 vs 698 events), sarcopenia (OR 1.42, 95% CI 1.13–1.78, p = 0.003; 179 vs 127 events), all-cause mortality (OR 1.49, 95% CI 1.23–1.81, p < 0.001; 257 vs 174 deaths), and falls in the last year (OR 1.12, 95% CI 1.05–1.20, p = 0.001; 2,373 vs 2,158 events). Low muscle mass defined by DEXA ALMI showed no association with P/A ratio (OR 1.02, 95% CI 0.93–1.11, p = 0.68; 1,096 vs 1,076 events).

Fig. 5: Association between muscle distribution patterns and clinical outcomes.
Fig. 5: Association between muscle distribution patterns and clinical outcomes.The alternative text for this image may have been generated using AI.
Full size image

Forest plots showing odds ratios (OR) with 95% confidence intervals for adverse outcomes comparing high versus low tertiles of muscle balance metrics. a Posterior/anterior (P/A) muscle ratio associations, where high P/A ratio indicates relatively preserved posterior muscles but diminished anterior muscles. b Left/right (L/R) asymmetry associations based on DEXA measurements. ORs are adjusted for age, sex, BMI, and physical activity level. Numbers indicate events in high/low tertile groups. The vertical dashed line represents OR = 1.0 (no association). P/A ratio showed significant associations with all outcomes except low muscle mass, while L/R asymmetry showed no significant associations.

Table 2 Association Between Muscle Balance Patterns and Clinical Outcomes in UK Biobank

Figure 5b (L/R Asymmetry) reveals that L/R asymmetry measured by DEXA showed no significant associations with any clinical outcome. OR for high versus low asymmetry tertiles were 0.95 (95% CI 0.87–1.04) for low muscle mass (1,057 vs 1,106 events), 0.93 (95% CI 0.85–1.03) for weak grip strength (838 vs 893 events), 0.86 (95% CI 0.68–1.08) for sarcopenia (133 vs 155 events), 0.91 (95% CI 0.75–1.11) for mortality (196 vs 215 deaths), and 1.02 (95% CI 0.95–1.09) for falls (2,240 vs 2,207 events). The median L/R asymmetry was 3.1% (IQR 1.4–4.9%), with 20% of participants exceeding 5% asymmetry. Adjustment for age, sex, BMI, and physical activity level did not alter these associations. The differential associations between P/A ratio and clinical outcomes, contrasted with the null findings for L/R asymmetry, indicate that sagittal plane muscle balance represents a distinct risk factor for functional decline independent of coronal plane symmetry.

In sex-stratified models, the associations were directionally similar in men (adjusted OR 0.76, 95% CI 0.66–0.89) and women (0.84, 0.73–0.96), with no significant sex interaction (p = 0.339), suggesting consistent effects across sexes.

Discussion

This study presents a comprehensive automated segmentation framework for quantifying TMV from Dixon MRI in 37,004 UK Biobank participants, achieving exceptional performance (mean DSC = 0.896) across 27 distinct muscle groups. Our findings demonstrate strong concordance with DEXA-derived measurements while revealing novel insights into muscle composition patterns that predict functional decline and mortality. These results address critical limitations in current sarcopenia assessment methods and provide a scalable solution for population-level muscle phenotyping.

To contextualize these results against current clinical entry-point measures, DEXA-derived appendicular lean mass remains the most practical entry-point measure for sarcopenia: scanners are inexpensive, whole-body scans take only a few minutes, radiation dose is low, and guideline-based cut-offs are well established. However, as a two-dimensional projection, DEXA cannot separate anterior and posterior thigh compartments, is insensitive to sagittal-plane imbalances in muscle distribution, and mixes muscle with adjacent soft tissues along the X-ray path. In contrast, the neck-to-knee Dixon MRI protocol used here acquires co-registered water and fat volumes over the entire thigh in approximately 6–7 min without ionizing radiation, and our automated three-dimensional segmentation converts these into anatomically resolved metrics (TMV, TVI, and P/A ratio) that, in this cohort, captured compartment-specific risk for weakness, sarcopenia, mortality, and falls beyond that provided by ALMI alone.

Beyond DEXA, it is also important to consider widely used opportunistic imaging surrogates. The prevailing CT-based skeletal muscle index (SMI) measured at the L3 vertebral level has gained widespread adoption due to its opportunistic availability and correlation with whole-body muscle mass8,9. However, our findings highlight fundamental limitations of this single-slice approach that may compromise its utility for sarcopenia assessment and mobility prediction. First, the L3-SMI samples trunk musculature rather than the locomotor muscles that directly mediate gait, balance, and fall risk. As Goodpaster et al.21 demonstrated, preferential atrophy of thigh muscles precedes functional decline by several years, yet this critical regional loss remains invisible to L3-based measurements. Our data showing stronger associations between thigh muscle patterns and functional outcomes support this locomotor-centric view of sarcopenia pathophysiology.

Second, the reliance on cross-sectional area from a single axial slice introduces substantial measurement variability. Slice positioning errors of even 1-2 cm can alter SMI values by 5–10% due to the tapering anatomy of paraspinal muscles22. Our volumetric approach, integrating information across the entire thigh length, eliminates this positional dependency while capturing the full three-dimensional muscle distribution. This methodological advance is particularly relevant given the heterogeneous patterns of muscle loss we observed, where some individuals showed preferential proximal versus distal atrophy patterns that would be missed by any single-slice approach.

Third, and perhaps most critically, the L3-SMI approach conflates anatomically and functionally distinct muscle groups into a single metric. Our compartment-specific analysis reveals that posterior-to-anterior muscle ratio independently predicts adverse outcomes. Specifically, individuals in the highest P/A ratio tertile (indicating relatively preserved posterior muscles but diminished anterior muscles) demonstrated a 60% increased odds of weak grip strength (OR 1.60, p < 0.001), 42% increased odds of sarcopenia (OR 1.42, p < 0.001), 49% increased odds of mortality (OR 1.49, p < 0.001), and 12% increased odds of falls (OR 1.12, p = 0.001) compared to those in the lowest tertile. This means that for every 100 individuals with balanced muscle distribution who develop weakness, 160 individuals with high P/A ratios will develop weakness—a clinically meaningful difference.

Paradoxically, this increased risk occurs despite slight association with total muscle mass, suggesting that muscle distribution may be more important than absolute quantity (OR 1.02, p = 0.68). This finding aligns with biomechanical principles, as the anterior compartment muscles (quadriceps) are crucial for knee stability, stair climbing, and rising from chairs—activities directly tested in grip strength and functional assessments. While posterior muscles (hamstrings and gluteals) contribute to forward propulsion, their relative preservation in the setting of quadriceps atrophy may indicate a maladaptive compensation pattern that ultimately compromises overall function16. The inability of L3-SMI to capture these functionally relevant distribution patterns may explain its modest predictive value for mobility outcomes in recent validation studies.

Against this backdrop, the clinical implications of distribution- and composition-aware phenotyping become clearer. The differential associations between muscle distribution patterns and clinical outcomes underscore the importance of moving beyond simple quantity metrics toward composition-aware assessments. Our finding that high P/A ratios predict adverse outcomes independent of total muscle mass has immediate implications for sarcopenia phenotyping and intervention targeting. As Kirk et al.4 emphasized in their comprehensive review of osteosarcopenia, the syndrome encompasses not merely muscle loss but dysregulated muscle quality and distribution. Our data provide quantitative support for this conceptual framework by demonstrating that individuals with preserved total muscle mass but altered distribution patterns exhibit functional impairments comparable to those with overt sarcopenia.

This composition-centric view aligns with findings from Linge et al., who analyzed 40,178 UK Biobank participants using the AMRA platform and demonstrated that adverse muscle composition—defined by the combination of low fat-tissue free muscle volume and high muscle fat infiltration—was a strong independent predictor of all-cause mortality, with hazard ratios exceeding 2.0 even after adjustment for grip strength and BMI. While their approach focused on intramuscular fat as a marker of metabolic dysfunction, our P/A ratio captures a complementary dimension: the selective vulnerability of anterior compartment muscles that directly mediate locomotor function23. Together, these findings suggest that sarcopenia phenotyping should incorporate both metabolic (fat infiltration) and biomechanical (compartment distribution) dimensions of muscle quality.

In line with this biomechanical dimension, the stronger predictive value of sagittal plane imbalance (P/A ratio) compared to coronal plane asymmetry (L/R ratio) offers mechanistic insights into fall pathophysiology. Falls in older adults predominantly occur in the sagittal plane during activities like sit-to-stand transitions, stair navigation, and recovery from forward perturbations5. The posterior muscle groups, particularly the gluteus maximus and hamstrings, generate the hip extension torque necessary for these activities. Our observation that individuals in the highest P/A ratio tertile have 49% higher mortality risk suggests this pattern may reflect broader neuromuscular dysfunction beyond simple disuse atrophy.

Importantly, the preservation of predictive associations after adjustment for physical activity levels indicates that muscle distribution patterns capture intrinsic biological aging processes not fully modifiable by exercise24,25. This interpretation is consistent with histological and imaging studies showing selective age-related atrophy of type II fibers and heterogeneous, muscle-specific trajectories of muscle loss across the lower limb. Our findings add an imaging-based, whole-limb perspective to this literature, while deliberately avoiding specific inferences about fiber-type composition of individual muscles from Dixon MRI alone.

Building on these clinical and mechanistic considerations, our volumetric approach enables several advances in sarcopenia phenotyping that address limitations identified in recent consensus statements1,6. First, the strong correlation between MRI-derived TMV and ALMI (r = 0.885) validates our method against the current gold standard while providing substantially greater anatomical detail. The ability to parse 27 individual muscles allows investigation of differential atrophy patterns that may define sarcopenia subtypes with distinct etiologies and therapeutic responses.

The validity of our MRI-based sarcopenia associations is further supported by comparison with large-scale BIA studies. Jauffret et al. 26 examined 387,025 UK Biobank participants using bioimpedance-derived skeletal muscle index and reported that both pre-sarcopenic and sarcopenic participants had significantly elevated fracture risk (adjusted HR 1.20–1.30) independent of heel ultrasound parameters. Our sarcopenia-related odds ratios (OR 1.42 for sarcopenia, OR 1.12 for falls) show comparable magnitude, validating cross-modality consistency despite fundamentally different measurement approaches. However, while BIA-based studies establish epidemiological associations, they cannot explain why sarcopenia increases fall risk. Our compartment-specific analysis addresses this mechanistic gap: selective quadriceps atrophy—reflected in elevated P/A ratios—directly compromises the knee extension torque required for sit-to-stand transitions and perturbation recovery, providing an anatomical basis for targeted rehabilitation interventions.

Second, the height-normalized TVI shows remarkably consistent age-related decline rates between sexes ( ~ 13% over 40 years), contrasting with the apparent sex disparity in absolute muscle loss. This finding suggests that much of the reported sex difference in sarcopenia prevalence may reflect anthropometric scaling rather than differential biological aging.

Third, the Dixon fat–water separation that underlies our segmentation also yields quantitative fat-fraction maps, which could address a critical gap in current sarcopenia definitions focused solely on lean mass11,12. While we focused on lean muscle volume and distribution in this initial validation, the same acquisition is well suited for future derivation of intramuscular fat and other fat-fraction–based indices. Previous studies have reported that individuals with high intramuscular fat despite preserved muscle volume exhibit functional impairments comparable to those with low muscle mass, supporting the inclusion of muscle quality metrics in next-generation sarcopenia criteria. These phenotyping advances also align with emerging conceptual frameworks. The GLIS framework emphasizes muscle-specific strength—the ratio of strength to muscle mass—as a key pathophysiological metric6. Our findings reveal important considerations for operationalizing this concept. Using grip strength as the numerator and thigh muscle volume as the denominator may seem anatomically mismatched, yet we observed strong associations between this ratio and functional outcomes. This apparent paradox likely reflects grip strength’s role as a biomarker of global neuromuscular function rather than isolated forearm capacity.

More fundamentally, our discovery that muscle distribution predicts outcomes independent of total mass challenges the assumption that muscle-specific strength can be reduced to a simple ratio. Consider two individuals with identical grip strength and total thigh volume but different P/A ratios: our data suggest the individual with higher P/A ratio (relatively less posterior muscle) will have worse functional outcomes despite equivalent muscle-specific strength by conventional calculation. This finding indicates that the denominator of muscle-specific strength equations must incorporate distribution information to achieve optimal predictive validity.

The compartment-specific approach also enables anatomically matched strength-mass ratios. Future studies could pair our thigh muscle volumes with lower extremity strength measures (knee extension/flexion, hip abduction) to derive true regional muscle-specific strength metrics. Such measurements would better reflect the mechanical coupling between muscle tissue and force generation, potentially improving sensitivity for detecting pre-clinical sarcopenia.

From an implementation and deployment standpoint, automated segmentation using transformer-based architectures (UNETR) represents a methodological advance with immediate translational potential. The mean DSC of 0.896 across 27 muscles exceeds the 0.85 threshold considered clinically acceptable for treatment planning in radiation oncology, suggesting sufficient accuracy for phenotyping applications14. The superior performance on larger muscles (DSC > 0.93 for quadriceps, gluteals) that contribute most to total volume ensures robust total muscle quantification even if smaller muscle boundaries are imperfectly delineated.

The computational efficiency of our pipeline—processing a complete thigh volume in under 2 min—enables population-scale deployment. Applied to the full UK Biobank imaging cohort, this approach could generate muscle phenotypes for >100,000 individuals, creating unprecedented opportunities for genetic and epidemiological discovery. The method’s reliance on Dixon MRI, now standard in population imaging protocols, ensures broad applicability without specialized sequences.

Despite these strengths, several limitations merit consideration. First, our cross-sectional design precludes causal inference regarding muscle patterns and outcomes. Longitudinal analysis of repeat imaging visits will establish whether distribution changes precede or follow functional decline. Second, while we validated against DEXA-derived lean mass, comparison with muscle biopsy findings would strengthen claims about muscle quality assessment. Third, our UK Biobank cohort, while large, may not represent sarcopenia patterns in non-European populations or clinical samples with advanced frailty. Fourth, our segmentation pipeline is restricted to the pelvic–thigh region and does not include lower-leg musculature, despite the recognized relevance of calf muscle mass in ambulation and its incorporation into EWGSOP2 and AWGS2019 criteria. This limitation reflects the UK Biobank neck-to-knee Dixon MRI protocol, which lacks continuous knee-to-ankle coverage. Future extensions of our framework to whole-body or dedicated lower-leg Dixon acquisitions would enable quantification of gastrocnemius and soleus volumes and allow direct comparison with calf-based indices recommended by international consensus groups.

Future work should explore several promising directions. Integration of fat fraction data could yield composite metrics incorporating both quantity and quality dimensions. Machine learning approaches might identify optimal combinations of muscle volumes that maximize outcome prediction. Genome-wide association studies (GWAS) of muscle-specific phenotypes could reveal susceptibility loci that remain undetected when whole-body lean mass is used as the primary trait. Most importantly, intervention studies should test whether exercise programs targeting posterior chain muscles can normalize P/A ratios and reduce fall risk.

Overall, this study establishes automated MRI-based thigh muscle segmentation as a powerful tool for sarcopenia research and clinical assessment. By moving beyond single-slice, single-number metrics to comprehensive volumetric phenotyping, we reveal that muscle distribution patterns predict functional decline and mortality independent of total muscle mass. These findings challenge current sarcopenia definitions focused solely on quantity and support evolution toward composition-aware criteria. As population imaging cohorts expand globally, the methods presented here offer a scalable pathway to precision medicine in sarcopenia, enabling risk stratification and treatment selection based on individual muscle phenotypes rather than population averages.

Methods

Study design and participants

This cross-sectional analysis utilized baseline imaging data from the UK Biobank, a population-based prospective cohort study. UK Biobank recruited 502,492 adults aged 40–69 years from 22 assessment centers across the United Kingdom between 2006 and 2010. Participants underwent comprehensive baseline assessments including sociodemographic questionnaires, physical measurements, and biological sampling.

From 2014 onwards, a subset of participants was invited for multimodal imaging based on geographic proximity to imaging centers and willingness to travel. By 2023, approximately 85,000 participants had completed at least one imaging visit. Our analysis focused on participants who underwent neck-to-knee Dixon MRI as part of the standardized imaging protocol. Inclusion criteria were: (1) age ≥50 years at imaging to ensure adequate representation across the aging spectrum, (2) completed neck-to-knee Dixon MRI, (3) DEXA scan performed within 2 years of MRI, and (4) grip strength measurement available from baseline or imaging visit. We excluded participants with incomplete imaging coverage, motion artifacts preventing accurate segmentation, or missing key covariates.

From 64,524 participants with Dixon MRI, we restricted the sample to those meeting the above criteria and with complete DEXA, grip strength, and demographic data, excluding 27,521 individuals in total. The final analytical cohort comprised 37,004 participants (17,920 men, 19,084 women).

All measurements were obtained during single imaging visits except DEXA scans, which were performed separately but typically within 6 months of MRI. This study was conducted using data from the UK Biobank resource under Application ID 622629. UK Biobank has ethical approval as a Research Tissue Bank from the North West Multi-centre Research Ethics Committee (REC reference: 21/NW/0157; IRAS project ID: 299116), which permits the use of stored data and samples for health-related research in the public interest. All UK Biobank participants provided written informed consent at recruitment for the use of their data and linkage to health records in approved research studies. Under this framework, the present analysis of de-identified UK Biobank data did not require additional project-specific ethics approval or consent.

Sarcopenia classification

Sarcopenia was defined according to EWGSOP2 consensus as low grip strength ( < 27 kg for men, <16 kg for women) and low appendicular lean mass index (ALMI < 7.0 kg/m2 for men, <5.5 kg/m2 for women)1. Grip strength was measured twice per hand with a calibrated Jamar dynamometer, and the highest value was retained. Appendicular lean mass (ALM, kg) was obtained directly from whole-body dual-energy X-ray absorptiometry (DEXA) scans performed in the UK Biobank imaging centers on GE-Lunar iDXA systems (enCORE software v17). The lean soft-tissue masses of both arms and both legs (fields 23263–23266) were summed to yield ALM. In this DEXA implementation, “legs lean mass” denotes the total lean soft tissue of the entire lower extremities (thigh, lower leg, and foot), not isolated thigh muscle volume. Quality assurance included daily calibration with a stepped phantom and quarterly cross-calibration across scanners. Standing height (m), recorded via UK Biobank field 50, was used to calculate ALMI. Participants who met neither EWGSOP2 criterion were classified as non-sarcopenic; those with low strength but normal ALMI were designated “probable sarcopenia” and included only in sensitivity analyses.

Whole Thigh MRI acquisition

Dixon MRI exploits the chemical-shift difference ( ~ 3.5ppm at 1.5 T) between fat and water protons to generate co-registered water, fat and proton-density-fat-fraction (PDFF) volumes. We used the UK Biobank Dixon protocol on 1.5 T Siemens MAGNETOM Aera scanners at four imaging centers. The sequence was a three-dimensional spoiled gradient-echo acquisition with a repetition time of 6.53 ms and dual echo times of 2.39 ms (in-phase) and 4.77 ms (opposed-phase). A 10° flip angle and a receiver bandwidth of 960 Hz pixel⁻¹ were employed15,27,28.

Images were reconstructed on a 224 × 224 matrix, giving a native in-plane resolution of 2.0 mm across a 448 × 448 mm field of view; 320 contiguous 3.0 mm slices were collected, spanning from the head to below the knees. Parallel imaging with GRAPPA (acceleration factor 2) kept the total acquisition time to approximately 6.5 min per participant.

Each of six consecutive table positions (chest, abdomen, pelvis, hip, mid-thigh, and thigh-to-knee), the scanner produced four spatially aligned magnitude images—opposed-phase, in-phase, water-only, and fat-only—resulting in 24 compressed DICOM series per subject.

For the analysis of pelvic–thigh musculoskeletal structures estimation model, a continuous three-dimensional (3D) volumes was computationally generated by fusing four separate, overlapping axial Dixon MRI stations from the UK Biobank dataset. These stations collectively spanned the region from the pelvic to the knee, encompassing the pelvic, hip, mid-thigh, and thigh-to-knee areas. A custom Python-based algorithm systematically reconstructed and merged these stations to produce a single, cohesive image for each participant.

The process began by reconstructing each of the four stations into a distinct 3D volume. Within each station, the individual 2D DICOM slices were sorted anatomically in the inferior-to-superior direction based on their z-axis coordinate, as specified in the “ImagePositionPatient” metadata tag. These sorted slices were then stacked to form a 3D numpy array. Voxel spacing was defined using the “PixelSpacing” tag for the in-plane (X, Y) dimensions, while the through-plane (Z) spacing was robustly calculated as the median distance between consecutive slices to ensure accuracy. Following reconstruction, the 3D stations were sorted in superior-to-inferior anatomical order. A consistency check was performed to resolve any spatially duplicated stations, retaining only the most recently acquired data based on the “SeriesTime” tag.

The core fusion process involved creating a single high-resolution global grid that encompassed the entire spatial extent of all four validated stations. Each station was then precisely resampled onto this common grid using trilinear interpolation.

The value P at a target point with normalized coordinates \(({x}_{d},\,{y}_{d},\,{z}_{d})\) within a voxel is estimated using trilinear interpolation, defined as the weighted average of the values \({C}_{{ijk}}\) at the eight surrounding corner points \((i,{j},{k})\) of the voxel, where \(i,{j},{k}\in \{\mathrm{0,1}\}\).

The formula is:

$$P\left({x}_{d},\,{y}_{d},{z}_{d}\right)={\sum }_{i=0}^{1}{\sum }_{j=0}^{1}{\sum }_{k=0}^{1}{C}_{{ijk}}{w}_{i}\left({x}_{d}\right){w}_{j}\left({y}_{d}\right){w}_{k}({z}_{d})$$
(1)

Where the weights for each axis are calculated by linear interpolation:

$${w}_{i}\left(d\right)={d}^{i}{\left(1-d\right)}^{1-i}$$
(2)

This can be expanded into the full summation form

$$\begin{array}{l}\left({x}_{d},\,{y}_{d},{z}_{d}\right)={\sum }_{i=0}^{1}{\sum }_{j=0}^{1}{\sum }_{k=0}^{1}{C}_{ijk}{x}_{d}^{i}\\ {\left(1-{x}_{d}\right)}^{1-i}{y}_{d}^{j}{\left(1-{y}_{d}\right)}^{1-j}{z}_{d}^{k}{\left(1-{z}_{d}\right)}^{1-k}\end{array}\,\,\,$$
(3)
  • \(P\left({x}_{d},\,{y}_{d},{z}_{d}\right)\) is the interpolated value at the target point.

  • \({C}_{{ijk}}\) is the known value at the corner \((i,{j},{k})\) of the voxel.

  • \({x}_{d},\,{y}_{d},\,{z}_{d}\) are the normalized distances (ranging from 0 to 1) of the target point from the corner \((0,\,0,\,0)\) along each respective axis.

In the overlapping regions between adjacent stations, voxel intensities were blended using a simple average to guarantee a seamless transition. This entire pipeline was implemented in Python, leveraging libraries such as pydicom and numpy. To significantly reduce processing time, the computationally demanding interpolation step was accelerated using NVIDIA CUDA via the numba library. The final continuous 3D volume was exported in the nearly Raw Raster Data (NRRD) format, which preserves the image data along with its complete spatial information (origin, spacing and orientation) and metadata in a single file.

Ground Truth Annotation for MRI Muscle Segmentation

Ground truth annotations for the pelvic–thigh musculoskeletal structures segmentation model were meticulously created from the fused Dixon MRI volumes. Expert anatomists manually delineated 27 distinct pelvic–thigh musculoskeletal structures segments using 3D Slicer software (version 5.2.2), a widely adopted platform for medical image annotation29. The annotation protocol followed standardized anatomical guidelines to ensure consistency across raters.

Anterior compartment muscles (n = 5): sartorius, rectus femoris, vastus lateralis, vastus medialis, and vastus intermedius. Gluteal region muscles (n = 8): gluteus maximus, gluteus medius, gluteus minimus, piriformis, obturator internus, obturator externus, pectineus, and tensor fasciae latae. Medial compartment muscles (n = 5): adductor magnus, adductor longus, adductor brevis, gracilis, and quadratus femoris. Posterior compartment muscles (n = 3): semimembranosus, semitendinosus, and biceps femoris. Core muscles (n = 4): multifidus, iliopsoas, abdominal oblique, and rectus abdominis. Bone structures (n = 2): iliac bone and femur. Additionally, subcutaneous and intermuscular adipose tissue can be annotated within the same framework to enable future analyses of fat-related muscle quality, although these measures were not analyzed in the present work. The annotation masks were saved in NRRD format, preserving spatial information including voxel spacing, orientation matrices, and origin coordinates essential for accurate volumetric quantification.

Image Pre-processing

The preprocessing pipeline was implemented following standardized medical imaging protocols to ensure robust and reproducible segmentation results. Several critical transformations were applied systematically to both training and validation datasets.

Intensity Normalization: Intensity windowing specific to soft tissue visualization was applied, mapping voxel intensities from the range [80, 450] to normalized values [0.0, 1.0]. This windowing, optimized for muscle tissue contrast enhancement. This approach aligns with established practices in medical image preprocessing14,30.

Spatial Resampling: All volumes underwent resampling to achieve consistent voxel spacing of (1.5, 1.5, 2.0). Trilinear interpolation was employed for image volumes to preserve intensity continuity, while nearest-neighbor interpolation was utilized for segmentation masks to maintain label integrity. This standardization protocol ensures uniform spatial resolution across heterogeneous MRI acquisitions, as recommended by recent medical imaging benchmarks31.

Patch-based Sampling Strategy: The training pipeline utilized the RandCropByPosNegLabel transformation from the MONAI framework to extract three-dimensional patches of dimensions (96, 96, 96) voxels. Four patches were sampled per volume during each training iteration, maintaining a 1:1 ratio between positive samples (containing target muscle tissue) and negative samples (background regions). This balanced sampling strategy addresses the class imbalance inherent in medical image segmentation tasks32.

UNETR Architecture for 3D Muscle Segmentation

We employed the UNETR (U-Net Transformers) architecture for 3D muscle segmentation, leveraging its hybrid design that combines Vision Transformers for global context modeling with convolutional decoders for precise spatial localization33. The transformer encoder processes 3D patches of size (96, 96, 96) voxels through 12 self-attention layers with hidden dimension 768, capturing long-range anatomical dependencies crucial for distinguishing morphologically similar muscle groups. The CNN decoder follows a U-Net-like architecture with skip connections from transformer layers at multiple resolutions (1/2, 1/4, 1/8, and 1/16), employing instance normalization and residual blocks for stable training. The model outputs 29 channels corresponding to 27 distinct pelvic–thigh musculoskeletal structures segments plus background, enabling comprehensive multi-class segmentation.

Training configuration

Model training was conducted on NVIDIA RTX 4090 and A6000 GPUs using PyTorch 2.0 and MONAI 1.2.0 frameworks. We employed a hybrid loss function combining Dice and Cross-Entropy losses (L_total = L_Dice + L_CE) to address class imbalance in multi-muscle segmentation. The AdamW optimizer was configured with learning rate 2 × 10-5, weight decay 1 × 10-5, and batch size 2, with hyperparameters determined through systematic grid search. Training proceeded for 12,000 epochs with validation every 2 epochs and early stopping based on validation performance. Mixed precision training with automatic mixed precision (AMP) was utilized to accelerate computation and enable gradient accumulation, effectively increasing the batch size despite memory constraints imposed by 3D volumetric data.

Performance evaluation metrics of AI segmentation model

To comprehensively assess segmentation performance, we employed multiple complementary metrics that capture different aspects of segmentation quality, following established evaluation protocols in medical image segmentation34,35. Each metric was computed on a per-class basis and aggregated across all 27 muscle segments.

The dice similarity coefficient (DSC) and Intersection over Union (IoU) were calculated to quantify volumetric overlap between predicted and ground truth segmentations. For each muscle class c:

$${DS}{C}_{c}=\frac{\left|{P}_{c}\cap {G}_{c}\right|}{\left|{P}_{c}+{G}_{c}\right|}\,\times 2$$
(4)
$${Io}{U}_{c}=\frac{\left|{P}_{c}\cap {G}_{c}\right|}{\left|{P}_{c}\cup {G}_{c}\right|}$$
(5)

Here Pc and Gc represent the sets of voxels predicted and labeled as class c, respectively. These metrics are related by: DSC = 2×IoU / (1 + IoU). The mean values across all muscle classes were computed as:

$${Mean\; DSC}=\left(\frac{1}{n}\right)\times \sum ({Dic}{e}_{c}){and\; Mean\; IoU}=\left(\frac{1}{n}\right)\times \sum ({Io}{U}_{c})$$
(6)

where n = 27 muscle classes. Classes absent from the ground truth (tp = 0, fn = 0) were excluded from averaging to prevent artificial inflation of performance metrics, as recommended by recent segmentation challenges36.

The Precision and Recall were computed to evaluate the model’s discriminative capability:

$${Precisio}{n}_{c}=\frac{T{P}_{c}}{(T{P}_{c}+F{P}_{c})}$$
(7)
$${Recal}{l}_{c}=\frac{T{P}_{c}}{(T{P}_{c}+F{N}_{c})}$$
(8)

Where TPc, FPc, and FNc represent true positives, false positives, and false negatives for class c, respectively. These metrics provide complementary insights into the model’s tendency toward over- or under-segmentation of specific muscle groups37.

Statistical analysis

The concordance between MRI-derived muscle measurements and DEXA gold standard assessments was evaluated using multiple statistical approaches to ensure robust validation of our automated segmentation method. Pearson correlation coefficients were calculated to assess the linear relationships between paired measurements, specifically examining thigh muscle volume (TMV) versus DEXA legs lean mass (n = 37,004) and thigh volume index (TVI) versus ALMI per height squared (n = 37,004).

The correlation coefficient was computed as \(r=\varSigma [({x}_{i}-\bar{x})({y}_{i}-\bar{y})]/\surd \,[\varSigma \left({{x}_{i}-\,\bar{x}}^{2}\right)\times \varSigma \left({{y}_{i}-\bar{y}}^{2}\right)]\), where xi and yi represent paired MRI and DEXA measurements for participant i, and x̄ and ȳ denote the respective sample means. To quantify the uncertainty in our correlation estimates, 95% confidence intervals were constructed using Fisher’s z-transformation: \({\rm{z}}=0.5\times \mathrm{ln}[(1+{\rm{r}})/(1-{\rm{r}})]\), with standard error \({\rm{SE}}({\rm{z}})=1/\surd ({\rm{n}}-3)\).

The confidence intervals were then back-transformed to the correlation scale. Correlations exceeding 0.8 were interpreted as strong, indicating excellent concordance between measurement modalities.

To characterize age-related patterns of muscle loss, participants were stratified into decade-based age groups (50–59, 60–69, 70–79, and ≥80 years); an additional 40–49 year stratum was included only for descriptive visualization in Fig. 4. This stratification enabled examination of both linear trends and potential non-linear patterns in muscle changes across the aging spectrum. Within each age stratum, descriptive statistics including median, interquartile range, and distribution parameters were calculated separately for men and women to account for known sex differences in muscle mass and aging trajectories. The rate of muscle decline was quantified using linear regression models fitted separately for each sex: Yij = β0 + β1 × Ageij + εij, where Yij represents the muscle metric (ALMI, TMV, or TVI) for participant i of sex j, β0 represents the intercept, β1 represents the annual rate of change, and εij represents the random error term assumed to follow a normal distribution with mean zero and constant variance.

The decline rate per decade was calculated as 10 × β1, providing clinically interpretable estimates of muscle loss over ten-year periods. The 95% confidence intervals for these estimates were derived from the standard error of the regression coefficient: CI = 10 × β1 ± 1.96 × 10 × SE(β1). Additionally, to examine the proportion of participants meeting sarcopenia criteria across age groups, we calculated the percentage falling below established EWGSOP2 thresholds (ALMI <7.0 kg/m² for men, <5.5 kg/m² for women) within each age stratum. The Cochran-Armitage test for trend was applied to assess whether the proportion with sarcopenia increased linearly with age category, with the test statistic calculated as Z = Σwi(pi-p̄)/√[p̄(1-p̄)Σwi²/ni], where wi represents the weight for age group i, pi represents the proportion with sarcopenia in group i, and p̄ represents the overall proportion.

The associations between muscle balance patterns and adverse clinical outcomes were investigated using multivariable logistic regression models, with participants stratified into tertiles based on the distribution of posterior/anterior (P/A) ratio and left/right (L/R) asymmetry. Tertile cutpoints were determined using the 33rd and 67th percentiles of each distribution, with P/A ratio tertiles defined as low ( < 0.41), middle (0.41–0.48), and high ( > 0.48). For the primary analysis, we compared the highest versus lowest tertile, excluding the middle tertile to maximize contrast between groups and increase statistical power to detect associations. The logistic regression model for each clinical outcome was specified as:

$${logit}\left({p}_{i}\right)={\mathrm{ln}}\left[\frac{{p}_{{i}}}{1-{p}_{{i}}}\right]={{\rm{\beta }}}_{0}+\,{{\rm{\beta }}}_{1}\times {{\rm{I}}}_{\mathrm{high},{\rm{i}}}+{{\rm{\beta }}}_{2}\times {\mathrm{Age}}_{{\rm{i}}}+{{\rm{\beta }}}_{3}\times {\mathrm{Sex}}_{{\rm{i}}}$$
(9)

where \({p}_{i}\) represents the probability of the outcome for participant i, \({{\rm{I}}}_{{\rm{high}},{\rm{i}}}\) is an indicator variable coded as 1 for high tertile membership and 0 for low tertile (reference category), \({{\rm{Age}}}_{{\rm{i}}}\) represents age in years (continuous), \({{\rm{Sex}}}_{{\rm{i}}}\) is coded as 1 for male and 0 for female, BMIi represents body mass index in kg/m² (continuous), and PAi represents physical activity level in MET-hours/week (categorical).

Five clinical outcomes were examined: (1) low muscle mass defined by ALMI below EWGSOP2 sex-specific cutoffs, (2) weak grip strength using established thresholds ( < 27 kg for men, <16 kg for women), (3) confirmed sarcopenia according to EWGSOP2 criteria requiring both low muscle mass and weak grip strength, (4) all-cause mortality ascertained through linkage with national death registries, and (5) history of falls in the past year derived from UK Biobank Field 2296 where values ≥ 2 indicate fall occurrence. Odds ratios (OR) were calculated as OR = exp(β1), representing the multiplicative increase in odds of the outcome for high versus low tertile membership. The 95% confidence intervals were obtained using the profile likelihood method, which provides more accurate coverage than Wald-based intervals, particularly for smaller sample sizes or when OR deviate substantially from 1.0. Statistical significance was assessed using Wald χ² tests with the test statistic calculated as (β1/SE(β1))², where SE represents the standard error. P-values less than 0.05 were considered statistically significant. Given the exploratory nature of this analysis and the biological plausibility of the examined associations, no correction for multiple comparisons was applied, though all p-values are reported to allow readers to apply their preferred adjustment method if desired.