Clinical validation of a deep learning tool for characterizing spinopelvic mobility in total hip arthroplasty

Smith, Gerard H.; Gupta, Sanjeev; Kavolus, Joseph; Pierrepont, Jim; Plaskos, Christopher; Wakelin, Edgar A.

doi:10.1038/s41598-025-16824-6

Download PDF

Article
Open access
Published: 02 October 2025

Clinical validation of a deep learning tool for characterizing spinopelvic mobility in total hip arthroplasty

Scientific Reports volume 15, Article number: 34343 (2025) Cite this article

683 Accesses
Metrics details

Subjects

Abstract

Radiographic measurement of patient-specific spinopelvic mobility is standard-of-care in total hip arthroplasty (THA). Current manual methods are time-consuming, subjective, and error-prone. This study utilized deep learning (DL) to classify functional positions and measure pelvic tilt (PT), sacral slope (SS) and lumbar lordotic angle (LLA). A DL pipeline integrating image classification, vertebra detection, and landmark detection was developed using data from an international joint registry, comprising 52,772 images for classification, 9,875 for object detection, and 25,249 for landmark detection. Performance was evaluated using area under the curve (AUC), F1 score, and mean absolute error (MAE). Accuracy was compared to annotations by three expert engineers and validated by two senior engineers and a surgeon. Radiographs were processed in 1.96 ± 0.04 s, achieving precision, recall, Receiver-operator-characteristic-AUC, and Precision-Recall-AUC metrics above 0.994. Anatomical landmark predictions resulted in errors of: PT: 1.6°±2.1°, SS: 3.3°±2.6°, LLA: 4.2°±3.2°. There was no significant difference in PT and LLA between expert engineers and the DL pipeline, and 0.5° difference in SS (p = 0.043). Clinical validation showed no difference in landmark rejection rates (p > 0.05). We developed and clinically validated a DL pipeline that accurately measures patient-specific spinopelvic mobility from lateral functional radiographs, providing a scalable method for routine characterization in THA patients.

Association of age and spinopelvic function in patients receiving a total hip arthroplasty

Article Open access 14 February 2023

Spinopelvic parameters in the lateral decubitus are different from standing and sitting positions

Article Open access 26 April 2025

Changes of the coronal lumbar-pelvic-femoral alignment after conversion total hip arthroplasty in patients with unilateral ankylosed hip

Article Open access 04 April 2023

Introduction

Patients with unfavorable spinopelvic mobility or significant sagittal spinal deformities in total hip arthroplasty (THA) present an elevated risk of instability^1,2,3dislocation^1,2,3,4and revision surgery^1,2. The use of radiographic assessments is routine care for identifying at-risk patients early in the treatment process⁵. Sagittal radiographs in standing, seated and contra-lateral step-up functional positions allow analysis of spinopelvic sagittal mobility. Typical analysis involves landmarking key anatomic features of the lumbar spine and pelvis to calculate the sacral slope (SS), pelvic tilt (PT), and lumbar lordotic angle (LLA)^4,6,7,8,9,10. These parameters allow evaluation of dislocation risk as well as impingement and prosthetic joint mechanics^1,11,12,13. Utilizing these measurements surgeons can optimize prosthetic component orientation and selection, enhancing joint functionality and reducing the likelihood of postoperative complications^1,12,13.

Measurement accuracy, however, is impacted by observer expertise, imaging quality and patient-specific anatomical factors^14,15,16,17. Furthermore, the lead time required to process radiographs by vendors using expert engineers poses a risk to routine analysis. The development of automated systems using deep learning (DL) offers a promising solution to enhance both accuracy and efficiency. DL algorithms have demonstrated remarkable success in image recognition tasks using medical imaging^18,19,20 and when trained on sufficiently large datasets report a high degree of precision^{21,22,23,24,25,26,27} potentially surpassing the accuracy of manual measurements²⁸. Previous works have used convolutional neural networks (CNNs) to auto-landmark spine and pelvis radiographs^22,23,24,29with some models applying successive networks to refine performance^26,27. However, existing models often lack a dataset that originates from diverse imaging centers and machines, and spans various geographical locations, limiting their applicability^24,25,26. Moreover, these models frequently do not analyze multiple functional positions and lack rigorous clinical validation^22,23,24,26. A DL pipeline that classifies patient functional position and measures spinopelvic mobility in lateral X-rays, using a large-scale multicenter international medical imaging database, has not been previously developed.

The purpose of this study was to investigate a method to rapidly generate spinopelvic measurements in hip arthroplasty. Specifically, we aim to: (1) Train a DL model that will classify input sagittal functional pelvic radiographs based on functional position; (2) Train a DL model to identify lumbar and pelvic landmarks allowing calculation of SS, PT and LLA; (3) Develop a DL pipeline that provides end-to-end spinopelvic mobility measurements indistinguishable from expert engineers, with clinical validation by an orthopaedic surgeon and two senior engineers.

Materials and methods

Image dataset and landmarks

Preoperative and postoperative lateral functional imaging was retrospectively extracted from an international joint replacement registry (CorinRegistry, Corin, UK). Ethics approval was obtained prior to the start of this study (AU ethics: Bellberry: 2020-08-764-A-2, USA IRB: WCGIRB: 120190312). All study methods were carried out in accordance with relevant guidelines and regulations, and a waiver of informed consent was approved by the ethics committee for this research. The registry is a cloud-based big-data ecosystem that passively de-identifies imaging and other data from preoperative planning and postoperative analysis processes for research purposes. It encompasses data from over 38,000 total hip arthroplasties performed between January 2017 and September 2023, collected from 391 imaging centers across 11 countries (Australia, United States, United Kingdom, France, Hungary, New Zealand, Austria, Japan, Italy, Germany and Belgium, Table S1). The lateral functional imaging consisted of standing, flexed seated and contra-lateral step-up radiographs captured either for preoperative THA planning³⁰ or post-operative implant placement analysis. All three lateral functional X-rays were taken with the referred side positioned closest to the imaging detector and furthest from the X-ray source (Figure S1). In the flexed seated X-ray, the femur was positioned to be horizontal and parallel to the floor, as was the contralateral femur in the step-up X-ray. During the step-up, the contralateral leg was raised while the patient bore weight on the affected leg. For the seated position, the patient flexed forward as much as possible. Inclusion criteria were: undergoing or having undergone THA with lateral radiographs previously landmarked for spinopelvic analysis, all landmarks visible on image. All imaging in the registry has previously been evaluated by expert engineers and quality checked by senior engineers to ensure suitable image quality, and the correct functional positions. The maximum available imaging set per patient includes standing, flexed-seated, and contralateral step-up lateral x-rays. Cases with one or more missing images were included in this analysis if the images were acceptable for preoperative planning or postoperative analysis. There are no exclusion criteria for this analysis.

The number of images used to train, validate and test each model, the proportion of preoperative versus postoperative imaging, and the number of imaging centers contributing data is shown in Table 1. The differences in total dataset sizes used to train each model are due to labelled data availability constraints. Given the variability in the execution of imaging protocols across 391 centers worldwide, we first trained a deep learning classification model to categorize lateral functional radiographs into one of three functional positions. While imaging centers are provided with standardized protocols and instructed to include the appropriate series description in the DICOM metadata (0008,103e), adherence to these naming conventions is inconsistent. As a result, relying solely on DICOM tags is not feasible, necessitating an automated classification approach. The DICOM images were preprocessed by converting to JPG format and rescaling pixel intensities between 0 and 255. This conversion facilitates standardized input handling in the model training, validation, and testing stages. We developed an X-ray processing pipeline consisting of several DL models to identify PT, SS, and LLA from lateral functional imaging, Fig. 1. PT is defined as the angle formed by a line drawn from the midpoint of the two anterior superior iliac spine (ASIS) landmarks to the center of the pubic symphysis, and a vertical reference line. SS is defined as the angle between the superior surface of the S1 sacral endplate and a horizontal line. The LLA is determined by the angle between the superior S1 and L1 vertebra endplate, which quantifies the degree of lordosis of the lower spine³¹.

Table 1 Overview of image processing pipeline, model specifications, training details, and demographic analysis. Chi-squared tests were used to assess differences in sex distribution across datasets, and one-way ANOVA was used to evaluate age differences.

Full size table

Landmarks defined by expert engineers through a manual x-ray annotation process across the three functional positions (standing, contralateral step-up, and flexed seated) were used as ground truth for training the landmark detection model. These ground truth landmarks were identified using RadiAnt DICOM Viewer software (Medixant, Poznan, Poland) and independently verified by senior engineers accredited for quality control with a minimum of two years experience in annotation and having undergone a practical assessment to demonstrate competency. To ensure accuracy during 2D landmarking, corresponding 3D landmarks captured on patient computed tomography (CT) scans were utilized. Both expert and senior engineers visually inspected the annotated landmarks, comparing their placement against 3D-rendered CT scans with corresponding 3D landmark points, which were obtained using Simpleware ScanIP (Synposys, Sunnyvale, CA). To assist with identifying ground truth landmarks, expert engineers manually manipulated the 3D volume rendering to align with the X-ray view, ensuring accurate spatial correspondence. Additionally, they applied adjustable filters to enhance visualization, allowing for the creation of a realistic X-ray replica that closely matched the radiographic appearance. The landmark prediction algorithm outputs confidence maps, with pixel coordinates of the highest confidence values selected as the predicted landmarks. These predicted landmarks were then used to calculate spinopelvic mobility parameters, with angular values compared against the ground truth annotations to assess accuracy.

Model pipeline

An image processing pipeline was developed utilizing deep learning techniques to streamline the classification of functional radiographs (Vision Transformer (ViT)³², detect vertebrae (Object detection (YOLOv8)³³ and automate anatomical landmark detection (convolutional neural network (CNN)³⁴ and derivation of spinopelvic measurements, Fig. 2.

The development of these models was structured with a division of the imaging into training, validation, and testing groups, following a 70:20:10 split. Data augmentation was applied online during training using PyTorch and the Albumentations library to enhance model generalization. A combination of intensity, geometric, and noise-based transformations was probabilistically applied. Brightness and contrast adjustments were applied randomly, with brightness limited to ± 40% (30% probability) and contrast limited to ± 70% (40% probability). Geometric transformations included independent random shifting (range: ±10%, probability: 10%), scaling (range: ±20%, probability: 10%), and rotation (range: ±45°, probability: 80%). Horizontal flipping was applied with a 50% probability. Noise augmentation included Gaussian noise (variance range: 50–500, probability: 85%) and blurring with a kernel size of 14–20 pixels (15% probability). Each augmentation was applied stochastically to simulate real-world imaging variations and improve model robustness.

Models were trained using Python version 3.10.10. The vision transformer was developed using Microsoft Azure’s AutoML platform. For object detection the YOLOv8 model used the Ultralytics 8.0.227 package^33,35 and for landmark detection, we employed the PyTorch package version 2.0.0³⁶; Data augmentations were performed using the Albumentations package version 1.3.1 (https://github.com/albumentations-team/albumentations)³⁷. Model parameters were initialised with randomization Table 1.

Model testing and statistical analysis

Receiver Operator Characteristic Area Under the Curve (ROC-AUC) analysis was performed on the image classification algorithm. The micro and macro averaged AUC, sensitivity, specificity, and F1 scores (harmonic mean of precision and recall) were calculated. Precision-Recall AUC (PR-AUC) analysis was performed on the vertebra detection algorithm and the AUC calculated.

The mean absolute error (MAE) between the predicted spinopelvic measurements and the ground truth measurements (PT, SS and LLA) captured by the expert engineers during preoperative planning, was calculated. MAE values for spinopelvic measurements were grouped by functional position in the x-ray and compared using Wilcox rank-sum tests. Robustness of the landmarking models were assessed by comparing the PT, SS and LLA MAE based on number of images obtained from each imaging center. Imaging centers were subdivided into 4 groups (1–10, 11–50, 51–100 and > 100 images received). To compare differences between imaging center volume for each functional position a Kruskal-Wallis test was performed and post-hoc Dunn’s tests if the Kruskal-Wallis test returned a significant difference.

To prospectively assess the accuracy of the DL pipeline, first a power analysis was performed: using a 2 sample 2-sided means power analysis we calculated the number of cases needed with equal groups, alpha = 0.05, power = 0.8, a standard deviation of 2.5° and a mean difference of 2°. Groups of n = 26 is required. To account for any case failures, we extended this n value to 30. A prospective set of 30 cases was chosen from a pool of images not used during training, validation, or testing. These 30 cases were randomly selected non-consecutive routine preoperative THA cases. The cases were selected to represent a wide range of image quality and difficulty in landmark identification. Each case contained three functional positions (stand, step-up, flex seated) that were annotated by three expert engineers twice, in two rounds spaced two weeks apart, similar to Wakelin et al.³⁸ and Raynauld et al.³⁹. To provide an unbiased comparison between the DL algorithm and the expert engineers, the engineers did not have access to the 3D CT model to assist in landmarking the radiographs for these 30 cases. The interobserver agreement for the expert engineers was evaluated using the interclass correlation coefficient (ICC). Wilcox rank-sum tests compared expert engineers between each other and the MAE between the pooled expert engineers and DL pipeline results.

To clinically validate the DL pipeline, two senior engineers and a fellowship trained hip arthroplasty surgeon with over 10 years clinical experience conducted a blinded quality control (QC) check on the prospective set of 90 images (30 cases) landmarked by three expert engineers and the DL pipeline (270 images total), Fig. 3. A 2 sample, proportion power analysis was performed with α = 0.05, 1 - β = 0.8, a sampling ratio of 1:1 and an estimated rejection rate of 5% in the expert engineer landmarked cohort compared to 15% in the comparative cohort: A minimum of 140 images analysed per group are required. The two senior engineers performed a QC check on (1) all DL pipeline landmarked images (90 each = 180 total), and (2) all manually landmarked images by the three expert engineers (90 × 3 = 270, 135 each). The surgeon QC checked all DL pipeline landmarked images (90 total), paired with one expert engineer landmarked image selected randomly from the 3 expert engineers (90 total). The senior engineers and surgeon classified images as correctly or incorrectly landmarked, and if incorrectly landmarked, indicated which landmarks required changing. The pass-fail rates for images annotated by expert engineers versus those by the DL pipeline were compared using Chi-square tests. Fisher’s exact test was used to compare differences in landmark rejection between senior engineers and surgeons. A critical p value of 0.05 was used in all cases. All statistical testing was performed in R4.1.2 (R Project, Vienna, Austria).

Results

Model training and convergence

All models successfully completed training and were able to process all test set images. Total processing time for sequential classification, object detection, and landmark detection was 1.96 ± 0.04 s per image using an Azure virtual machine (Standard_NC4as_T4_v3) with an Nvidia Tesla T4 GPU.

The convergence of estimated landmarks from the sequential 6-stage convolutional model from epoch 0–50 is shown as a heat map in Fig. 4. At epoch 0 the heatmap is broad and undifferentiated. As the epochs advance, the heatmap focus narrows with an increase in intensity over the true landmark regions. By epoch 50, the model has converged.

Image classification and object detection

High ROC-AUC and PR-AUC values were achieved for both the image classification and vertebrae detection models. The classification model demonstrated excellent performance (Figure S2), achieving micro and macro-average AUCs, F1 scores, sensitivity, and specificity all ≥ 0.998 (Table 2). For vertebrae detection, the Precision-Recall (PR) curve (Figure S2) highlighted a similarly strong performance, with an AUC-PR of 0.998, precision of 0.994, and recall of 0.994 (Table 2). Misclassifications between standing and step-up positions were rare (0.23%) and typically occurred when the vertical femur in step-up images was obscured by a limited field of view, creating the appearance of a standing posture (Figure S3). Similarly, misclassifications between flexed seated and step-up positions (0.06%) arose due to overlapping femurs in seated poses and the absence of forward flexion, which closely resembled a stepping action (Figure S3).

Table 2 Performance metrics for image classification and vertebrae detection, including AUC, F1 scores, sensitivity, specificity, and precision-recall values achieved by the deep learning pipeline for spinopelvic mobility analysis.

Full size table

Clinical measurements: DL pipeline versus ground truth

The overall error and error broken down by functional position for the predicted PT, SS and LLA are shown in Fig. 5 (see also Figure S4 for boxplots with outliers and Table S2 for summary statistic values). PT is most accurately predicted with the lowest MAE (MAE ± SD: 1.6°±2.1°), however PT error is also dependent on functional position in which Stand reports the lowest error (1.2°±1.3°) and Seated the highest (2.3°±3.2°), p < 0.0001. SS and LLA report greater mean errors of 3.3°±2.6 and 4.2°±3.2° respectively, in which the functional position does not impact model accuracy.

MAE across PT, SS and LLA is not impacted by the origin of imaging (Fig. 6, see also Figure S5 for plot with outliers). Imaging centers which contributed low volumes of imaging (≤ 10 images) showed no increase in error compared to those which contributed high volumes of imaging (> 100 images).

DL pipeline versus expert engineers

Comparing MAE among the three expert engineers landmarking the spinopelvic measurements twice across the same 90 radiographs, MAE was the lowest for PT (1.0°±1.1° to 1.8°±1.9°) followed by SS (2.3°±1.9° to 2.5°±2.0°) and LLA (2.6°±2.1° to 3.1°±2.4°), Table 3. High ICC values (≥ 0.99) were reported across all spinopelvic angles indicating excellent reliability, however small differences were found between expert engineers for PT and LLA of up to 0.8°.

Table 3 Comparison of mean absolute error (MAE ± standard deviation (SD), [Interquartile Range (IQR)], (max)) of engineer landmarked images (n = 180 X-ray images from 30 subjects) to each other and comparison of MAE of DL pipeline landmark errors (n = 90 X-ray images from 30 subjects) with pooled engineer landmarked images. Wilcox tests used to determine significance. Significance indicated by p < 0.05, differences shown by: A = 1 vs. 2, B = 2 vs. 3, C = 3 vs. 1, N.S. = no significant differences.

Full size table

The DL pipeline did not report a difference in error (p > 0.05) compared to the expert engineers in landmarking PT and LLA, Table 3. However, for SS, the DL MAE was 0.5° higher at 2.9°±2.1°, compared to the expert engineers’ average of 2.4°±2.0° (p = 0.043).

Clinical validation

Neither senior engineers nor experienced surgeon rejected the DL landmarks at a higher rate than expert engineer generated landmarks (p > 0.05), indicating the DL pipeline performed equivalent to an expert engineer, Table 4. The senior engineers rejected landmarks between 4.4 and 8.9% of the time, while the surgeon rejected landmarks 7.8% of the time, with no significant differences in landmark rejection frequency between the surgeons and senior engineers (Table 5). Examples of surgeon-rejected X-rays with landmarks placed by both expert engineers and the DL pipeline are provided in Figure S6.

Table 4 Comparison of rate of rejection of DL or engineer landmarked lateral functional x-rays by senior engineers and surgeon. Chi-square tests used to determine significance.

Full size table

Table 5 Comparison of rate of rejected landmarks by senior engineers and surgeon. N total for QC (Senior) engineer = 225, N total for QC surgeon = 90, from 30 subjects. Cases can have more than 1 landmark adjusted during QC, or no landmark change after initial rejection. Fisher’s exact test was used to compare differences in landmark rejection between senior engineers and surgeons.

Full size table

Discussion

The primary outcomes of our study are: (1) A classification model was developed which accurately classified lateral radiographs based on functional position; (2) A vertebra object detection model and key point detection model were developed to accurately and rapidly identify anatomic landmarks for calculating SS, PT and LLA; (3) Clinical validation showed the DL pipeline performed spinopelvic mobility measurements as well as expert engineers. This is the first study to our knowledge to develop and clinically validate an end-to-end DL pipeline for spinopelvic mobility characterization, applying DL models to accurately classify and landmark lateral x-ray imaging across three functional positions using a large international medical imaging registry.

Our DL pipeline performance for PT (1.6°±2.1°) is similar to previous studies using deep learning algorithms^24,25,26 who reported average PT errors of 2.1 ± 6.3°, 2.7 ± 2.5°, and 1.1 ± 1.7°, respectively. For SS, our model demonstrated an error of 3.3°±2.6°, comparable to the errors of 4.8 ± 6.9°, 5.0 ± 3.4°, and 3.5 ± 4.8° reported in the same studies. Although our model exhibited a slightly higher LLA error of 4.2° ± 3.2°, possibly due to error propagation in the sacral slope and L1 endplate landmarks, it remains comparable to previously reported LLA errors of 4.3 ± 4.1°, 2.9 ± 2.3°, and 5.1 ± 6.3°^24,26,40.

Among the evaluated measurements (PT, SS, LLA), we consider PT the most critical as it directly impacts cup position planning in our preoperative planning workflow, with each degree of PT change resulting in about a 0.7° change in functional cup anteversion in standing⁴¹. We consider an error of 2.5-3° in standing PT to be clinically significant as this would result in approximately 2° of error in functional cup position planning. A low planning error threshold is desirable to minimize compounding of errors when considering the precision of intraoperative delivery⁴². This threshold also agrees with Imai et al. who consider a 2.5° measurement error for standing PT to be clinically relevant⁴³. The accuracy of PT measurements was dependent on functional position in which the error was lowest in the standing position (1.2°±1.3°), increased in the step-up position (1.5°±1.7°), and was the highest in the flexed seated position (2.3°±3.2°) and significantly different to standing PT (p < 0.001) (Fig. 5B). The increasing error in PT measurements from stand to step-up and seated positions may be attributed to occlusion of the pubic symphysis landmark by anatomy and hardware, particularly when the femurs are horizontal. Although the seated position reports statistically greater error of 1.1° compared to standing, this difference is unlikely to be clinically significant.

The prospective landmarking by expert engineers showed small differences between expert engineers and high interclass correlations (≥ 0.99) indicating high reliability of the manual measurement process. When comparing these errors to the DL pipeline, no difference was observed in PT or LLA error indicating a comparable accuracy between the DL model and gold standard in a prospective series. However, a significant difference in SS error was observed in which the DL pipeline reported 0.5° greater error than the expert engineers. This difference is unlikely to be clinically relevant.

Clinical validation of landmark accuracy by both senior engineers and experienced orthopedic surgeon showed no difference in rejection rate between expert engineer and DL generated landmarks. This validation indicates that our DL pipeline performs as well as expert engineers. Senior engineers rejected the DL annotated landmarks in 4.4–8.9% of cases whereas the surgeon rejected at a rate of 7.8%. Both senior engineers and surgeon were instructed to review the landmarks for accuracy, however, the engineers had prior training, experience and understanding of the subsequent use of the landmarks to calculate spinopelvic parameters, exposure to 3D renders from CT imaging and the clinical relevance of small modifications to landmark placement. Despite the surgeon performance being comparable with the senior engineers in our study, surgeons in general may not be consistently reliable assessors of radiographic landmarks. Kleeman-Forsthuber et al.¹⁴ highlighted significant discrepancies in the accuracy of PT and SS measurements when comparing those made by surgeons to those captured by expert engineers. The study revealed a poor correlation, with ICCs ranging from − 0.197 to -0.004 for PT and − 0.188 to -0.022 for SS, between inexperienced surgeon measurements and validated landmarks obtained through commercial software used by expert engineers. Evaluating spinal deformity is not a standard procedure for most arthroplasty surgeons, who generally have limited expertise in analyzing sagittal spinopelvic radiographs^14,44. This notable disparity not only highlights the potential limitations in relying exclusively on surgeon evaluations for determining landmark accuracy in radiographic analyses but also underscores the importance of incorporating assessments by expert engineers or employing DL-generated landmarks in clinical settings using models trained on ground truth data captured by experts.

One major advantage of DL algorithms are the improved processing efficiency compared to manual methods, achieving this analysis within 1.96 s per image. In comparison, Korez et al.²⁵employed deep learning for sagittal x-ray analysis, reported a processing time of 1 s per image. Schwartz et al.²⁴utilized a segmentation approach, achieving a processing time of 0.82 s. Jang et al.²²achieved a processing time of 2.2 s, generating six measurements in anteroposterior (AP) x-rays. These studies, however, did not include image classification in processing calculations. In contrast, manual measurements using SurgiMap Spine software are approximately 75 s per image²⁵and estimates for manual landmarking using the protocol defined in this study are approximately 120 s per image. We would therefore expect several minutes in time-savings in image processing per case after implementing our DL pipeline, even with the maintenance of a visual QC check which we believe is critical to identify any outliers.

The DL pipeline presented in this study has several potential impacts to routine clinical assessment: (1) reduced variability and subjectivity; (2) maintaining objectivity in evaluating spinopelvic parameters with complex anatomy; (3) improved scalability and efficiency. Access to these measurements is required for tailoring surgical approaches to individual patient anatomy, optimizing implant placement, and achieving balanced spinal-pelvic alignment. Novikov et al.⁴⁵ found over 50% of early THA revisions were avoidable in which 48% of these were due to poor acetabular component positioning. Routine and accessible spinopelvic measurements for planning THA may reduce these avoidable revisions by reducing poor implant placement and implant selection.

The size of our dataset is a key strength. In addition to the number of images, this data is sourced from a high number of imaging centers such that the models are exposed to a wide range of focal lengths, x-ray voltages, current, field of view, demographics, image quality and surgical indications resulting in a comprehensive training dataset. In contrast to our study, previous models^26,27,46 have employed smaller datasets, ranging from 2,090 to 10,923 images. The limited size of these datasets restricts the scope of analysis and generalizability of results. Our dataset is predominantly made up of preoperative images. While this could be viewed as a limitation due to underrepresentation of post-operative scenarios, it is also a mitigated by the inclusion of preoperative cases with contralateral implants. Furthermore, the consistency of errors across imaging centers indicates a highly robust model which can render accurate predictions on a diverse imaging dataset.

Our study has limitations. The images used are not calibrated, as such only angular measurements could be produced, which prevents us from calculating landmark placement error in millimetres. Although some post-operative imaging was included in the dataset, we did not explicitly test the model on post-operative imaging. Characterization of our model pipeline for post-operative analysis was outside the scope of this study. Although our DL pipeline showed similar performance to expert engineers, some landmarks were still rejected by the senior engineers and surgeon. The pipeline therefore requires further refinement before unsupervised clinical use. The ground truth values used to train the models are generated from expert engineer landmarks QC’d by senior engineers, although carefully reviewed, these landmarks cannot be guaranteed to be error free due to the subjective nature of anatomic landmarking. We used two senior engineers and an orthopaedic surgeon to validate our algorithm, and while surgeons may have expertise in this these measurements, a fellowship-trained musculoskeletal-specialized radiologist may have provided a more accurate assessment. The DL pipeline image quality limit has not been tested in this study, it is unknown how accuracy degrades as image quality or patient functional position exceeds the training dataset. Our dataset included a comprehensive cohort of approximately 14,141 THA procedures, encompassing all available imaging without exclusion based on image quality or patient characteristics. Therefore, this cohort is likely to include patients with spinal disorders as well as hip disease. We did not distinguish between patients with and without spinal disorders and therefore cannot assess whether this would impact the accuracy of the model.

To conclude, using a series of deep learning models, we developed and clinically validated an image processing pipeline that can rapidly and accurately estimate patient specific spinopelvic mobility from lateral functional radiographs. This pipeline is a scalable method to automatically characterize patient-specific mobility in all THA patients.

Data availability

The data that support the findings of this study are proprietary to Corin and were used under internal licensing agreements for the current study. Due to privacy and confidentiality restrictions, these data are not publicly available. However, certain aspects of the data or derived data may be available from the corresponding author on reasonable request with ethical approval and permission from Corin.

References

Sharma, A. K. & Vigdorchik, J. M. The hip-Spine relationship in total hip arthroplasty: How to execute the plan. J. Arthroplasty 36, S111–S120. https://doi.org/10.1016/j.arth.2021.01.008 (2021).
Article PubMed Google Scholar
DelSole, E. M., Vigdorchik, J. M., Schwarzkopf, R., Errico, T. J. & Buckland, A. J. Total hip arthroplasty in the spinal deformity population: Does degree of sagittal deformity affect rates of safe zone placement, instability, or revision? J. Arthroplasty 32, 1910–1917. https://doi.org/10.1016/j.arth.2016.12.039 (2017).
Article PubMed Google Scholar
Heckmann, N. et al. Late dislocation following total hip arthroplasty: Spinopelvic imbalance as a causative factor. J. Bone Joint Surg. Am. 100, 1845–1853. https://doi.org/10.2106/JBJS.18.00078 (2018).
Article PubMed Google Scholar
Vigdorchik, J. M. et al. Otto Aufranc Award: A simple hip-spine classification for total hip arthroplasty: Validation and a large multicentre series. Bone Joint J 103-B, 17–24 (2021). https://doi.org/10.1302/0301-620X.103B7.BJJ-2020-2448.R2 (2021).
Article PubMed Google Scholar
Behery, O. A., Poultsides, L. & Vigdorchik, J. M. Modern imaging in planning a personalized hip replacement and evaluating the spino-pelvic relationship in prosthetic instability. In Personalized Hip Knee Joint Replace. 143–156. https://doi.org/10.1007/978-3-030-24243-5_13 (2020).
Le Huec, J. C., Aunoble, S., Philippe, L. & Nicolas, P. Pelvic parameters: Origin and significance. Eur. Spine J. 20(Suppl 5), 564–571. https://doi.org/10.1007/s00586-011-1940-1 (2011).
Article PubMed PubMed Central Google Scholar
Vrtovec, T. et al. A review of methods for evaluating the quantitative parameters of sagittal pelvic alignment. Spine J. 12, 433–446. https://doi.org/10.1016/j.spinee.2012.02.013 (2012).
Article PubMed Google Scholar
Noshchenko, A., Hoffecker, L., Cain, C. M. J., Patel, V. V. & Burger, E. L. Spinopelvic parameters in asymptomatic subjects without spine disease and deformity: A systematic review with meta-analysis. Clin. Spine Surg. 30, 392–403. https://doi.org/10.1097/BSD.0000000000000533 (2017).
Article PubMed Google Scholar
Le Huec, J. C., Thompson, W., Mohsinaly, Y., Barrey, C. & Faundez, A. Sagittal balance of the spine. Eur. Spine J. 28, 1889–1905. https://doi.org/10.1007/s00586-019-06083-1 (2019).
Article PubMed Google Scholar
Vigdorchik, J. M. et al. Prevalence of risk factors for adverse spinopelvic mobility among patients undergoing total hip arthroplasty. J. Arthroplasty 36, 2371–2378. https://doi.org/10.1016/j.arth.2020.12.029 (2021).
Article PubMed Google Scholar
Sultan, A. A. et al. The impact of spino-pelvic alignment on total hip arthroplasty outcomes: A critical analysis of current evidence. J. Arthroplasty 33, 1606–1616. https://doi.org/10.1016/j.arth.2017.11.021 (2018).
Article PubMed Google Scholar
Heckmann, N. D. & Lieberman, J. R. Spinopelvic biomechanics and total hip arthroplasty: A primer for clinical practice. J. Am. Acad. Orthop. Surg. 29, e888–e903. https://doi.org/10.5435/JAAOS-D-20-00953 (2021).
Article PubMed Google Scholar
Vigdorchik, J. M. et al. High prevalence of spinopelvic risk factors in patients with post-operative hip dislocations. J. Arthroplasty 38, 706–712. https://doi.org/10.1016/j.arth.2022.05.016 (2023).
Article PubMed Google Scholar
Kleeman-Forsthuber, L. T. et al. Reliability of spinopelvic measurements that May influence the cup position in total hip arthroplasty. J. Arthroplasty 35, 3758–3764. https://doi.org/10.1016/j.arth.2020.06.056 (2020).
Article PubMed Google Scholar
Krupinski, E. A. Current perspectives in medical image perception. Atten. Percept. Psychophys. 72, 1205–1217. https://doi.org/10.3758/APP.72.5.1205 (2010).
Article PubMed Google Scholar
Hicks, G. E., George, S. Z., Nevitt, M. A., Cauley, J. A. & Vogt, M. T. Measurement of lumbar lordosis: Inter-rater reliability, minimum detectable change and longitudinal variation. J. Spinal Disord. Tech. 19, 501–506. https://doi.org/10.1097/01.bsd.0000210116.94273.ad (2006).
Article PubMed Google Scholar
Hong, J. Y. et al. Reliability analysis for radiographic measures of lumbar lordosis in adult scoliosis: A case-control study comparing 6 methods. Eur. Spine J. 19, 1551–1557. https://doi.org/10.1007/s00586-010-1422-x (2010).
Article PubMed PubMed Central Google Scholar
Litjens, G. et al. A survey on deep learning in medical image analysis. Med. Image Anal. 42, 60–88. https://doi.org/10.1016/j.media.2017.07.005 (2017).
Article PubMed Google Scholar
Shen, D., Wu, G. & Suk, H. I. Deep learning in medical image analysis. Annu. Rev. Biomed. Eng. 19, 221–248. https://doi.org/10.1146/annurev-bioeng-071516-044442 (2017).
Article CAS PubMed PubMed Central Google Scholar
Steele, J. R. et al. Deep learning phenotype automation and cohort analyses of 1,946 knees using the coronal plane alignment of the knee classification. J. Arthroplasty 38, e211. https://doi.org/10.1016/j.arth.2023.02.055 (2023).
Article Google Scholar
Jang, S. J. et al. Deep-learning automation of preoperative radiographic parameters associated with early periprosthetic femur fracture after total hip arthroplasty. J. Arthroplasty 39, 1191–1198 e1192 (2024). https://doi.org/10.1016/j.arth.2023.11.021
Article PubMed Google Scholar
Jang, S. J. et al. Leg-length discrepancy variability on standard anteroposterior pelvis radiographs: An analysis using deep learning measurements. J. Arthroplasty 38, 2017–2023 e (2013). https://doi.org/10.1016/j.arth.2023.03.006 (2023).
Article Google Scholar
Jang, S. J. et al. John Charnley Award: Deep Learning Prediction of Hip Joint Center on Standard Pelvis Radiographs. J Arthroplasty 37, S400-S407 e401, (2022). https://doi.org/10.1016/j.arth.2022.03.033
Schwartz, J. et al. (ed, T.) Deep learning automates measurement of spinopelvic parameters on lateral lumbar radiographs. Spine (Phila PA. 1976) 46 E671–E678 https://doi.org/10.1097/BRS.0000000000003830 (2021).
Article PubMed Google Scholar
Korez, R., Putzier, M. & Vrtovec, T. A deep learning tool for fully automated measurements of sagittal spinopelvic balance from X-ray images: Performance evaluation. Eur. Spine J. 29, 2295–2305. https://doi.org/10.1007/s00586-020-06406-7 (2020).
Article PubMed Google Scholar
Yeh, Y. C. et al. Deep learning approach for automatic landmark detection and alignment analysis in whole-spine lateral radiographs. Sci. Rep. 11, 7618. https://doi.org/10.1038/s41598-021-87141-x (2021).
Article ADS CAS PubMed PubMed Central Google Scholar
Cina, A. et al. 2-step deep learning model for landmarks localization in spine radiographs. Sci. Rep. 11, 9482. https://doi.org/10.1038/s41598-021-89102-w (2021).
Article ADS CAS PubMed PubMed Central Google Scholar
Nagpal, K. et al. Development and validation of a deep learning algorithm for improving Gleason scoring of prostate cancer. NPJ. Digit. Med. 2 https://doi.org/10.1038/s41746-019-0112-2 (2019).
Lochel, J. et al. Deep learning algorithm for fully automated measurement of sagittal balance in adult spinal deformity. Eur. Spine J. 33, 4119–4124. https://doi.org/10.1007/s00586-023-08109-1 (2024).
Article PubMed Google Scholar
Sharma, A. K. et al. Low dislocation rates with the use of patient specific safe zones in total hip arthroplasty. J. Orthop. 27, 41–48. https://doi.org/10.1016/j.jor.2021.08.009 (2021).
Article PubMed PubMed Central Google Scholar
Behery, O. A., Vasquez-Montes, D., Cizmic, Z., Vigdorchik, J. M. & Buckland, A. J. Can flexed-seated and single-leg standing radiographs be useful in preoperative evaluation of lumbar mobility in total hip arthroplasty? J. Arthroplasty 35, 2124–2130. https://doi.org/10.1016/j.arth.2020.03.035 (2020).
Article PubMed Google Scholar
Dosovitskiy, A. et al. An image is worth 16x16 words: Transformers for image recognition at scale. ArXiv Preprint arXiv:2010.11929. https://doi.org/10.48550/arXiv.2010.11929 (2020).
Jocher, G., Chaurasia, A. & Qiu, J. YOLO by ultralytics v. 8.0.227 https://github.com/ultralytics/ultralytics (2023). https://doi.org/10.5281/zenodo.3908559
Wei, S. E., Ramakrishna, V., Kanade, T. & Sheikh, Y. C. Pose machines. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 4724–4732 (2016). https://doi.org/10.1109/CVPR.2016.511
Sohan, M., Sai Ram, T. & Rami Reddy, C. V. A review on YOLOv8 and its advancements. In Data Intelligence and Cognitive Informatics. (eds I. Jeena Jacob, Selwyn Piramuthu, & Przemyslaw Falkowski-Gilski) 529–545 (Springer, Singapore). https://doi.org/10.1007/978-981-99-7962-2_39
Paszke, A. et al. Pytorch: An imperative style, high-performance deep learning library. Adv. Neural. Inf. Process. Syst. 32 https://doi.org/10.48550/arXiv.1912.01703 (2019).
Buslaev, A. et al. Albumentations: Fast and flexible image augmentations. Information 11 https://doi.org/10.3390/info11020125 (2020).
Wakelin, E. A. et al. Accurate determination of post-operative 3D component positioning in total knee arthroplasty: The AURORA protocol. J. Orthop. Surg, Res. 13, 1–10. https://doi.org/10.1186/s13018-018-0957-0 (2018).
Article Google Scholar
Raynauld, J. P. et al. Reliability of a quantification imaging system using magnetic resonance images to measure cartilage thickness and volume in human normal and osteoarthritic knees. Osteoarthr. Cartil. 11, 351–360. https://doi.org/10.1016/S1063-4584(03)00029-3 (2003).
Article Google Scholar
Aubert, B., Vazquez, C., Cresson, T., Parent, S. & de Guise, J. A. Toward automated 3D spine reconstruction from biplanar radiographs using CNN for statistical spine model fitting. IEEE Trans. Med. Imaging 38, 2796–2806. https://doi.org/10.1109/TMI.2019.2914400 (2019).
Article ADS CAS PubMed Google Scholar
Lembeck, B., Mueller, O., Reize, P. & Wuelker, N. Pelvic tilt makes acetabular cup navigation inaccurate. Acta Orthop. 76, 517–523. https://doi.org/10.1080/17453670510041501 (2005).
Article PubMed Google Scholar
Okazaki, T. et al. Accuracy of robotic arm-assisted versus computed tomography-based navigation in total hip arthroplasty using the direct anterior approach: A retrospective study. BMC Musculoskelet. Disord 25, 787. https://doi.org/10.1186/s12891-024-07891-3 (2024).
Article PubMed PubMed Central Google Scholar
Imai, N., Ito, T., Suda, K., Miyasaka, D. & Endo, N. Pelvic flexion measurement from lateral projection radiographs is clinically reliable. Clin. Orthop. Relat. Res. 471, 1271–1276. https://doi.org/10.1007/s11999-012-2700-1 (2013).
Article PubMed PubMed Central Google Scholar
Iwasa, M. et al. Errors in the radiographic measurement of pelvic incidence. J. Orthop. Res. 41, 1266–1272. https://doi.org/10.1002/jor.25477 (2023).
Article PubMed Google Scholar
Novikov, D. et al. Can some early revision total hip arthroplasties be avoided? Bone Joint J. 101-B, 97–103. https://doi.org/10.1302/0301-620X.101B6.BJJ-2018-1448.R1 (2019).
Article CAS PubMed Google Scholar
Zhang, K., Xu, N., Guo, C. & Wu, J. MPF-net: An effective framework for automated cobb angle estimation. Med. Image Anal. 75, 102277. https://doi.org/10.1016/j.media.2021.102277 (2022).
Article PubMed Google Scholar

Download references

Acknowledgements

The authors would like to sincerely thank Tom Van der Veen for his invaluable contributions to the early development of our deep learning model. We also acknowledge the dedicated efforts of Linden Bromwich, Thomas Ban, Nirosha Pillalamarri, and Matthew Todd in supporting data acquisition and preparation, which were critical to this study’s success. Their contributions have significantly enriched this work.

Author information

Authors and Affiliations

Corin Ltd, 17 Bridge Street, Pymble, Sydney, NSW, 2073, Australia
Gerard H. Smith, Jim Pierrepont, Christopher Plaskos & Edgar A. Wakelin
Royal Prince Alfred Hospital, 50 Missenden Rd, Camperdown, Sydney, NSW, 2050, Australia
Sanjeev Gupta
Newton Wellesley Orthopedic Associates, 2000 Washington St, Newton, MA, 02462, USA
Joseph Kavolus

Authors

Gerard H. Smith
View author publications
Search author on:PubMed Google Scholar
Sanjeev Gupta
View author publications
Search author on:PubMed Google Scholar
Joseph Kavolus
View author publications
Search author on:PubMed Google Scholar
Jim Pierrepont
View author publications
Search author on:PubMed Google Scholar
Christopher Plaskos
View author publications
Search author on:PubMed Google Scholar
Edgar A. Wakelin
View author publications
Search author on:PubMed Google Scholar

Contributions

GHS, EAW, and CP conceptualized and developed the methodology for the study. GHS wrote the software code and developed the DL models, engineered/processed/analyzed the data and wrote the original draft of the manuscript. JK performed the QC activities. EAW, CP, and JP edited the manuscript. All authors (GHS, EAW, CP, JP, JK, SG) interpreted the data and discussed the results and reviewed and approved the original and revised manuscripts.

Corresponding author

Correspondence to Gerard H. Smith.

Ethics declarations

Competing interests

GHS and EAW were employees of Corin when this work was completed. JP and CP are current employees of Corin and hold stock options. SG declares the following: Corin U.S.A. (Paid consultant; Paid presenter or speaker; Research support); DePuy, a Johnson & Johnson Company (Paid consultant; Paid presenter or speaker); Journal of Bone and Joint Surgery – British (Editorial or governing board); Journal of the American Academy of Orthopaedic Surgeons (Editorial or governing board); Stryker (Paid consultant; Research support). JK declares the following: Conformis (Stock or stock options); Histogenics (Stock or stock options); Neogenomics (Stock or stock options); Nuvasive (Stock or stock options); Vericel (Stock or stock options).

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary Material 1

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Smith, G.H., Gupta, S., Kavolus, J. et al. Clinical validation of a deep learning tool for characterizing spinopelvic mobility in total hip arthroplasty. Sci Rep 15, 34343 (2025). https://doi.org/10.1038/s41598-025-16824-6

Download citation

Received: 10 July 2024
Accepted: 19 August 2025
Published: 02 October 2025
DOI: https://doi.org/10.1038/s41598-025-16824-6