Abstract
We intended to compare the doctors with a convolutional neural network (CNN) that we had trained using our own unique method for the Lateral Pillar Classification (LPC) of Legg–Calve–Perthes Disease (LCPD). Thousands of training data sets are frequently required for artificial intelligence (AI) applications in medicine. Since we did not have enough real patient radiographs to train a CNN, we devised a novel method to obtain them. We trained the CNN model with the data we created by modifying the normal hip radiographs. No real patient radiographs were ever used during the training phase. We tested the CNN model on 81 hips with LCPD. Firstly, we detected the interobserver reliability of the whole system and then the reliability of CNN alone. Second, the consensus list was used to compare the results of 11 doctors and the CNN model. Percentage agreement and interobserver analysis revealed that CNN had good reliability (ICC = 0.868). CNN has achieved a 76.54% classification performance and outperformed 9 out of 11 doctors. The CNN, which we trained with the aforementioned method, can now provide better results than doctors. In the future, as training data evolves and improves, we anticipate that AI will perform significantly better than physicians.
Similar content being viewed by others
Introduction
LCPD is an idiopathic avascular necrosis of the femoral head in the pediatric population. Treatment choice mainly depends on the prediction of the prognosis of femoral head sphericity and the congruency of the hip joint. The indication of containment treatment is determined by prognostic prediction. Several prognostic classification systems have been proposed for this purpose. Herring et al.1 described LPC in 1992 to predict the prognostic severity of disease at the fragmentation phase. According to LPC, the height of the lateral pillar is normal in group A. At least 50% of lateral pillar height was preserved in group B, and less than 50% of lateral pillar height was maintained in group C.
There are many studies in the literature about the interobserver reliability and prognostic value of LPC. Many studies reported moderate to high interobserver reliability and high prognostic value2,3,4,5,6,7,8,9,10,11, but as an opposing view, Agus et al.12 reported superior reliability for the Salter-Harris and Catterall classifications over the LPC. They found poor reliability in both intra- and inter-observer agreement in their study. Liggieri et al.13 stated that the degree of interobserver concordance was far from ideal for the three classification systems, including LPC, and advised complementary systems for staging. Some authors criticized this system by stating that there was a change between the first and last classifications10,11,14. Even according to Price15, LPC may only be predictive in retrospect after the opportunity for containment has passed, and there is still a need for the prospective indicators to guide treatment choice. Given the criticisms in the literature, as LPC may not be sufficient for staging in every patient, the prognosis may not be as expected. We anticipate that CNN will be a sharper prospective indicator in terms of classification and prognosis.
There are studies in the literature in which AI-based systems are used to solve the problems of proximal femur bone shape deformation in LCPD patients. The proximal femur bone shape deformity rates were determined using the patient's healthy femoral head in the study conducted by Memis et al. in 2021 using an image registration approach. Experimental studies using Magnetic Resonance Imaging (MRI) data from 13 patients suffering from LCPD have shown that the algorithm yields very promising results16. However, this technique is not useful for patients suffering from LCPD on both sides. In a subsequent study published in 2021 by the same research group, a method for rapidly and accurately extracting bilateral proximal femurs from bilateral hip joint MRI images utilizing random subsampling points was provided17. In the study conducted by Bugeja et al., a new automated three-dimensional (3D) pipeline for segmentation and measurement of cam volume, surface area, and height using MRI data from 97 patients suffering from femoroacetabular impingement (FAI) was provided. The proposed method in this study consists of two main parts: 3D U-net with focused shape modeling (FSM) for segmenting the proximal femur and identifying the cam bone mass and obtaining the anatomical information of the patient18. In another study using U-Net architecture, Memis et al. performed multiform proximal femur and femoral head bone segmentation from low-quality MRI sections with U-Net architecture19. These studies do not include any classification methods; they simply provide segmentation. In 2021, Memiş et al. used Support Vector Machine (SVM), one of the most widely used Machine Learning (ML) algorithms in the literature, to classify the Waldenstrom Stages of LCPD in MRI images and reported a 62% success rate20. These studies were all carried out using MRI scans. There is no research on LPC that uses AI with X-ray images. Also, in an other study on LCPD, the objective was not to stage the disease but rather to identify whether it was present or not, and it was claimed that the study was a great success21.
When the studies in the literature are examined, it is seen that the studies generally focus on the proximal femur bone shape segmentation. The reason for the small number of classification studies is the low number of patients suffering from LCPD and the failure to get the data required for training DL networks by researchers. Within the scope of this study, data that could represent LCPD patients were derived from modified data of healthy individuals, allowing researchers to train a DL network that could assist orthopedists with LPC. We aimed to show how CNN will give a result in terms of both reliability and percentage agreement in this regard. The flowchart, which provides information about the training and testing processes, is given in (Fig. 1).
A flowchart of training and test progress (a) All normal and LCPD hips collected (b) Obtaining normal pediatric hip radiographs (328 images) that are representative of the real patient group in terms of age and gender distribution (c) Randomly selected 130 images from 328 normal images to create group A. (d, e) 114 images for group B and 84 images for group C modified by computer engineers under the guidance of six orthopedists (f) Augmenting the modified images to 1312 for training (520 in Class A, 456 in Class B, and 336 in Class C). (g) Designing CNN’s architecture and saving the CNN model for the test process (h) Obtaining real patients’ (LCPD) hip X-ray images (81 images). (i) Labelling LCPD patients’ hip radiographs with a unanimous decision by 6 senior orthopedists (Class A:29, Class B:31, Class C:21). (j) Applying the trained CNN model to test images and transferring the obtained classification results to the evaluation stage. (k) Testing the reliability of the proposed method and decision of the 6 doctors.
Then, three more orthopaedists and two radiologists were added to the study, and the results of a total of nine orthopaedists and two radiologists were compared to those of CNN in terms of their classification accuracy.
The following sections provide more information about the proposed methods and the study's results.
Materials and methods
LCPD dataset
The study was approved by the local ethics committee of the Istanbul University Faculty of Medicine (No. 15/01/2020-003) and conducted in compliance with the 1964 Helsinki Declaration. The Ethics Committee of the Istanbul University Faculty of Medicine waived the requirement for informed consent because all data were anonymized and de-identified. Clinical records of LCPD patients on follow-up at two institutions between 2008 and 2016 were reviewed. One hundred eighteen patient hip radiographs were retrieved. Radiographs were excluded if they were not in the fragmentation phase. Hip radiographs taken in the neutral position from a distance of 120 cm in the full anteroposterior plane were used. Eighty-one radiographs were available for this study (72 males and 9 females, ages 4–12 years; mean age was 9.46; 45 right hips and 36 left hips). Only four males have bilateral LCPD. The time after onset was 4–6 weeks. At first, six orthopaedists with more than five years of pediatric orthopaedic experience made the classification individually. Then, a definitive list (a consensus list) was made by the unanimous decision of these six orthopaedists. We used the classic LPC and not the modified one because we did not have enough group b/c patient images to test. We had only four real patient images, which were classified as group b/c by doctors. This amount of data is not sufficient for researchers to obtain a reliable result. So, we decided to use classic LPC. Reliability tests were conducted using these six orthopaedists’ individual results and CNN’s results. After reliability tests, we added five more independent doctors, including two radiologists, for a percentage agreement test. They labeled 81 images individually as A, B, or C class. Finally, we compared the classification success rates of all of them using the percentage agreement test. Since all of the doctors were already familiar with LPC, none of them were given a tutorial at the start of this study.
The data set, described in Table 1, contains few samples due to the low prevalence of LCPD in the general population. A large number of images from various classes must be employed in order to train CNN architectures. The quantity of radiographs was insufficient for training the CNN architecture, so the authors devised a method for creating a training dataset. In this method, 328 radiographic images of healthy hips were obtained from patients between the ages of 4 and 12. 130 randomly chosen radiographs from the 328 images are labeled “group A”. Two computer engineers applied Adobe Photoshop® software to modify the femoral heads of the remaining 198 images under the guidance of six orthopedists. They produce 114 group B and 84 group C labeled images according to LPC (Fig. 2). Since modified images were created for training, all of the real LCPD radiographs were reserved to test the generated CNN model.
An illustration of how to produce a class B femoral head (a) Normal hip radiograph with a blue line encircling the femoral head (b) The height of the lateral column of the normal femoral head was reduced up to 50%. The head has been converted to group B. (c) The newly created head is highlighted in red. (d) The highlighted head is segmented from the image and assigned as the final training data.
Data Augmentation
In our study, there were 328 images of the femoral head for training, but the number of training samples was not enough to obtain acceptable results. According to Wong et al.22, augmentation has a significant impact on improving performance and reducing overfitting issues. Training data was augmented by horizontal flipping, zooming, and shearing of every image. At the end of the process, the number of training images increased to 1312 (520 for group A, 456 for group B, and 336 for group C).
Preprocessing
The femoral heads that will be used in testing and training processes were segmented from real patient and modified hip radiographs in this section. During this process, the determination of femoral head boundaries and segmentation of these parts were performed under the supervision of six orthopedists (Fig. 3). After segmentation, other preprocessing approaches such as histogram equalization, sharpening, or morphological operations were applied to enhance the visual quality of the images. However, it was revealed that these preprocessing techniques had no significant impact on CNN's performance. So, it was decided not to use these preprocessing steps after segmentation because the CNN can work well enough without them.
Deep learning (DL) and proposed CNN architecture
DL is a ML method that has recently attracted the attention of the medical sciences as well as many other fields. Unlike traditional ML methods, the DL algorithm learns features directly from the data via its multilayered structures23. To efficiently extract input data features, DL computes their internal parameters in the forward pass, then iteratively adjusts them during backpropagation. DL's forward and backpropagation procedures improve DL's learning capacity and provide a means to learn complex patterns in computer vision and other domains24. DL can be investigated through four main network structures. These are Deep Reinforcement Learning, Deep Boltzmann Machines, Recurrent Neural Networks (RNN), and CNNs. Systems that employ deep reinforcement learning interact with their surroundings and pick up knowledge from them25. Deep Boltzmann Machines are used for processing data by learning a probability density function over the inputs26. RNN are special in that they can operate on a series of vectors over time. This indicates that they can relate earlier knowledge to the current progress. With its several advantages, such as being more similar to the human visual processing system, CNN is employed in this study. It is widely used for the classification of images and is also proficient in learning and extracting features from 2D data27.
CNNs are a common type of deep learning neural network that is used for image classification applications. The network is composed of layers, each of which has a specific function. The fundamental layers of a CNN consist of the following:
Convolutional layer Convolution operations are carried out by this layer on the input image to aid in the feature extraction process. Each of the filters used by the convolutional layer on top of the image is designed to detect a certain feature in the image.
Activation layer This layer applies an activation function to the output of the convolutional layer, which is often the rectified linear unit (ReLU) function in image classification. The ReLU function enhances the network's capacity to learn complicated features by introducing non-linearity into the system.
Pooling layer Pooling is used to minimize the spatial dimensions of the feature maps, which reduces the computational cost of the network and increases the robustness of the features to slight changes in an object's position within an image.
Fully connected layer All the neurons in one layer are linked to the neurons in the next layer by this layer. Its purpose is to learn non-linear combinations of features from the previous layer.
Output layer This layer generates the network's final output, which could be a class label or a collection of class probabilities for an image classification task.
Training process
Figure 4 demonstrates the proposed neural network design, which consists of six convolutional layers, each followed by an activation and pooling layer. After the activation layers have normalized the output of the convolutional layers, the pooling layers use a maximum pooling operation with a kernel size of 2 × 2 to minimize the dimension of the data. Employing 3 × 3 filters, the first three convolutional layers generate 128 feature maps, while the following three convolutional layers generate 256 feature maps. The architecture also has a flattened layer and two fully connected layers. The activation function for all activation layers and the first fully connected layer is the ReLU function, which also serves as a solution to the vanishing gradient problem28. Only the last fully connected layer uses the Softmax activation function because the class of input is determined in this layer. Also, between the first and second fully connected layers, 0.2 ratio of randomly chosen neuron connections are dropped out to avoid overfitting29. While creating the CNN model, optimizers are implemented to minimize the error between the predicted output and the true output. Due to its capacity to manage the complexity of deep learning and its ability to adaptively adjust the learning rate for each parameter, the proposed network implements the Adaptive Moment Estimation (Adam) Optimizer with a learning rate of 0.00130. The selection of hyper-parameters such as number of convolution layers, activation functions, optimizer, dropout rate, learning rate and batch size used in this study were determined by grid search algorithm and the hyper-parameters that gave the best results were used in the study. Determining hyper parameters with the grid search algorithm is a widely used method in deep learning studies. Determining the hyper-parameters with an algorithm among various values enables the deep neural network to give results with higher performance in different problems and contributes to the creation of the model suitable for the problem. (Fig. 4).
Once the architecture of the network was established, images with dimensions of 224 × 224 × 3 were given to the first convolutional layer with a 32-batch size, and the training process was started with 1312 images labeled by six doctors. In the study, the epoch number was chosen as 1000, which is a large value to avoid the underfitting problem, and the early stopping criterion was applied to stop training to avoid overfitting when the validation value declined. The training was terminated automatically between 416 and 482 epochs, each epoch time was around 30 s, and the weights with the highest validation success were recorded. The early stopping criterion is optimized by using different patience values and patience value was chosen as 50 to prevent overfitting. Radiographic images of real patients, referred to as “test images”, are never used in the training process. They're only used throughout the testing process.
Evaluation of metrics
After the training process of the model was completed, the network parameters obtained were recorded, and the network was made ready for experimental tests. Various evaluation metrics, such as accuracy, precision, specificity, recall (sensitivity), and F1 score, are used to evaluate the performance of the trained CNN during the tests. Accuracy is the most common evaluation metric for CNN classification systems because it is easy to understand and interpret. However, when dealing with unbalanced data, the F1 score is an effective option for evaluating the classification model. We also demonstrate the other evaluation metrics to support our classification model results. The accuracy rate of the system was 83.33% on validation and 76.54% on test images; the F1 score of the system on test images is 77.13% according to the consensus list.
At this stage, to evaluate the performance of CNN against doctors by percentage agreement, three more orthopaedists and two radiologists who were completely unaware of this study participated in this study and independently classified the test group. A success ranking was determined after the results were obtained.
For statistical analysis, fleiss kappa, weighted kappa, and the intraclass correlation coefficient using the SPSS software pack (version 21, IBM, New York, USA) were used. P value < 0.05 was considered for statistical significance31,32,33,34.
Hardware and software used in the experimental studies
Experimental studies carried out within the scope of this study were executed using an AMD Athlon X940 Black Edition processor, 16 GB of DDR3 RAM, and MSI GTX 1070 graphics card. In experimental studies, Python 3.6.5 was used as a programming language. In studies where Keras libraries are used to create DL architectures, Keras and Numpy libraries are used for image preprocessing. Evaluation metric values are calculated by authors using accuracy, precision, specificity, sensitivity, and F1 score equations.
Ethics and consent to participate
The study protocol was approved by the institutional review board of the local ethics committee(No 15-01-2020/003). No patient consent was required because of the retrospective nature of the study. Patient data were anonymized in the paper.
Results
According to the classification result, the confusion matrix shown in Table 2 is created. All accuracy, precision, sensitivity, and specificity results for every class and the whole system were calculated with the information given in the confusion matrix. In the confusion matrix in Table 2, the columns represent the classification results performed by the model, while the rows represent the ground truths of the data. Accuracy, Precision, Specificity, Sensitivity, F1 Score evaluation metrics represent the values obtained for each class, respectively, and the values in the last column represent the total result. The reason for using evaluation metrics other than Accuracy is to evaluate the performance of the model in each class separately and to prove that the model is not inclined to recognize a class.
The most successful model accuracy rate is 83.33% on validation, 76.54% on test images, according to the consensus list, and approximately 99.5% on training. Other precision, specificity, sensitivity, and F1 score results are shown in Table 2.
The results of six orthopaedists and CNN were analyzed using the Fleiss Kappa coefficient for the total and weighted Kappa between individual observers. Interobserver reliability was determined.
The ICC values of each orthopaedist were above 0.8, indicating good or excellent reliability. The ICC of CNN was 0.868, which is one of the highest in the literature. The kappa value for interobserver reliability between CNN and orthopaedists was 0.433–0.607 (Table 3).
We found that the overall Fleiss Kappa was 0.532, which indicates moderate agreement among all the reviewers, including CNN. Group A-B-C kappa scores separately were 0.677–0.354–0.565, respectively. The highest scores for the interobserver agreement were between DR1 and DR2, and between DR1 and consensus. CNN had the third highest agreement with the consensus among six orthopaedists (Table 4).
To compare CNN's success to that of doctors, the rate at which images were correctly labeled by doctors based on the consensus list was compared to CNN's success rate. CNN's classification success rate was greater than 9 out of 11 doctors. CNN determined 62 of 81 (76.5%) images to be true. If we take them separately, CNN was achieved to determine 27/29 (93.1%) in group A, 18/31 (58.1%) in group B, and 17/21 (80.9%) in group C (Table 5).
The distribution of LPC obtained by each doctor and CNN network is given in detail in Fig. 5.
Discussion
In this study, LPC was performed using DL algorithms. Controversial results regarding the interobserver reliability of the LPC have been reported in the literature. Therefore, a DL algorithm was developed that can help orthopedists first. Then, reliability and performance tests were carried out. In this context, hip radiographs representing various stages of LCPD were created using the aforementioned method, and the CNN model was trained using these radiographs. The model, whose training was completed, was tested using 81 hip radiographs belonging to real patients. Eleven doctors' test results were compared with the CNN model's results.
Several clinical studies of CNNs have been proposed in the medical sciences since 2012. The initial AI applications in medicine have usually focused on image recognition tasks, such as diabetic retinopathy, mammographic lesions, and skin cancer recognition. With varying degrees of success, CNN was used in orthopedics for recognition and segmentation tasks such as fractures of various bones, meniscal tears, knee cartilage lesions, anterior cruciate ligament tears, vertebral body localization and segmentation, bone age assessment, and degenerative spine interpretation35. While various DL models demonstrate expert-level performance even now, they also have the potential to exceed expert physicians in the future. Olczak et al.36 compared five openly available DL networks to identify fracture, laterality, body part, and exam view and published at least 90% accuracy for laterality, body part, and exam view and 83% accuracy for fracture identification. Chung et al.37 trained a deep CNN to detect and classify the proximal humerus fractures using 1891 images. They concluded that CNN showed superior performance over general physicians and orthopedists and similar performance to orthopedists specializing in the shoulder. In complex 3- and 4-part fractures, CNN showed superior performance. Helm et al.38 reviewed the literature and reported that AI helps orthopedic surgeons better serve their patients. Urakawa et al.39 reported in their study that AI was better than five orthopedists for detecting hip fractures. In their systematic review, Langerhuizen et al.40 reported that AI reflected near-perfect prediction for fracture detection in five studies. AI outperformed human examiners for the detection and classification of hip and proximal femur fractures. One study showed equivalent performance for wrist, hand, and ankle fractures.
The accuracy and precision of our CNN model were 76.54% and 76.91%, respectively, compared to the 62% accuracy and 52% precision result in the support vector machine classifier, according to the Waldenstrom classification of LCPD20.
LPC has some advantages, including reproducibility, ease of application (requires only an AP radiograph), and high correlation with the outcome. According to many authors, LPC has the highest interclass reliability among the three classification systems6,7,8,9,10,11.
Recent studies demonstrated highly variable reliability; the ICC for LPC was reported as 0.388–0.59612, and as 0.70–0.8541, so the ICC of CNN in this study was one of the highest previously reported. CNN's weighted kappa was compatible with previously reported values of 0.7227, 0.71–0.7941, 0.65–0.7042, 0.56–0.7043, and 0.5109.
Many authors studied the prognostic value of this classification, and a high association was found with the final outcome2,3,4,5. Patients with Group A and Group B below eight years of age are treated symptomatically; patients with Group C and the Group B/C border above eight years of age need surgical containment. However, there is a debate at this point. Some authors thought that there were changes in the groups between the first and last classifications. Lappin et al.11 stated that 75% of group A and 30% of group B cases required upgrading within seven months of initial symptoms. Meurer et al.14 reported the need for upgrading in 30% of cases. Park et al.10 stated that 45% of cases needed upgrades between their initial and final grades. Thus, the initial and final grades of the hip can be different, and this difference may postpone the containment treatment to preserve the shape of the femoral head, especially in patients under eight years of age. Even Akgun et al.40 suggested the use of measurements instead of estimates, especially in borderline cases when LPC was to be used as the prognostic indicator. We predict that CNN will be a more accurate predictive indicator in terms of classification and predicting prognosis, especially in the future, as underlined by Price15.
We have some limitations in our study. First, test and training data should never be mixed to accurately assess the CNN's performance. When similar studies in the literature were examined, it was seen that many real patient images were used in the studies. Chung et al.37 used 515 normal and 1376 fractured proximal humerus images for this purpose. Urakawa et al.39 used a total of 3346 hip images to train CNN in their study. Even considering that the incidence of LCPD ranges from 0.4–29/100,000 children < 15 years of age44, the probability of finding data close to the numbers in the mentioned studies is almost negligible in a two-center study. Thus, we did not have enough labeled radiographs to use them separately to execute both the test and training. We designate such a method to overcome this problem. The images, which were obtained by modifying normal hip radiographs, were used only to train CNN, while real patients' images were used only to test. Secondly, by our method, we only changed the height of the femoral head, especially in the lateral pillar. Femoral head fragmentation seen in the natural course of LCPD was not present in our images obtained from normal hip radiographs. Although we believe that this disadvantage may have negatively affected our results, our CNN still showed better performance than 9 out of 11 doctors, which is a very promising result. We believe if we had the same number of real patient radiographs as we produced to train, the performance of CNN could surpass us all. In addition, our study demonstrates that it is possible to train CNN and obtain good results using the data produced by the aforementioned method, particularly in studies where patient-specific data is insufficient.
Conclusion
It is widely known that large quantities of data are used during the training of CNN to ensure successful outcomes. This research presents a novel method for training deep neural networks in the presence of limited data, such as that associated with rare medical conditions like LCPD, by employing modified data during training. AI systems can be employed for many rare diseases with this technique, in which only healthy data is utilized by modifying it. Moreover, this is the first study to perform LPC on x-ray images of LCPD patients. The results indicate that the classification accuracy of the developed model is superior to that of many of the doctors who participated in the study. Since the modified data is constructed with a predetermined pattern structure, we predict that the real pattern structure will yield superior results, as it will reflect not just the changes in head height, but also other structural alterations that occur in the femoral head during the disease.
Data availability
The data used during the current study are available from the corresponding author on reasonable request.
Abbreviations
- LCPD:
-
Legg–Calve–Perthes disease
- LPC:
-
Lateral pillar classification
- AI:
-
Artificial İntelligence
- CNN:
-
Convolutional neural network
- DL:
-
Deep learning
- ML:
-
Machine learning
- ANN:
-
Artificial neural network
- MRI:
-
Magnetic Resonance Imaging
- ReLU:
-
Rectified linear unit
- TP:
-
True positive
- TN:
-
True negative
- FP:
-
False positive
- FN:
-
False negative
- ICC:
-
Intraclass correlation coefficient
References
Herring, J. A., Neustadt, J. B., Williams, J. J., Early, J. S. & Browne, R. H. The lateral pillar classification of Legg-Calvé-Perthes disease. J. Pediatric Orthop. 12, 143–150 (1992).
Wiig, O., Terjesen, T. & Svenningsen, S. Prognostic factors and outcome of treatment in Perthes’ disease: A prospective study of 368 patients with five-year follow-up. J. Bone Joint Surg. British 90, 1364–1371 (2008).
Herring, J. A., Kim, H. T. & Browne, R. Legg-Calvé-Perthes disease: Part II: Prospective multicenter study of the effect of treatment on outcome. JBJS 86, 2121–2134 (2004).
Ritterbusch, J. F., Shantharam, S. S. & Gelinas, C. Comparison of lateral pillar classification and Catterall classification of Legg-Calvé-Perthes’ disease. J. Pediatr. Orthop. 13, 200–202 (1993).
Huhnstock, S. et al. Radiographic classifications in Perthes disease: Interobserver agreement and association with femoral head sphericity at 5-year follow-up. Acta Orthop. 88, 522–529 (2017).
Mahadeva, D., Chong, M., Langton, D. J. & Turner, A. M. Reliability and reproducibility of classification systems for Legg-Calvé-Perthes disease: A systematic review of the literature. Acta Orthopædica Belgica 76, 48 (2010).
Sambandam, S. N., Gul, A., Shankar, R. & Goni, V. Reliability of radiological classifications used in Legg–Calve–Perthes disease. J. Pediatric Orthop. B 15, 267–270 (2006).
Kollitz, K. M. & Gee, A. O. Classifications in brief: The Herring lateral pillar classification for Legg-Calvé-Perthes disease. Clin. Orthop. Relat. Res. 471, 2068–2072 (2013).
Podeszwa, D. A., Stanitski, C. L., Stanitski, D. F., Woo, R. & Mendelow, M. J. The effect of pediatric orthopaedic experience on interobserver and intraobserver reliability of the Herring lateral pillar classification of Perthes disease. J. Pediatric Orthop. 20, 562–565 (2000).
Park, M. S., Chung, C. Y., Lee, K. M., Kim, T. W., & Sung, K. H. Reliability and stability of three common classifications for Legg-Calvé-Perthes disease. Clinical Orthopaedics and Related Research®, 470, 2376–2382 (2012).
Lappin, K., Kealey, D. & Cosgrove, A. Herring classification: How useful is the initial radiograph?. J. Pediatric Orthop. 22, 479–482 (2002).
Agus, H., Kalenderer, O., Eryanlmaz, G. & Ozcalabi, I. T. Intraobserver and interobserver reliability of Catterall, Herring, Salter-Thompson and Stulberg classification systems in Perthes disease. J. Pediatric Orthop. B 13, 166–169 (2004).
Liggieri, A. C., Tamanaha, M. J., Abechain, J. J., Ikeda, T. M. & Dobashi, E. T. Intra and interobserver concordance between the different classifications used in Legg-Calvé-Perthes disease. Revista Brasileira de Ortopedia 50, 680–685 (2015).
Meurer, A., Schwitalle, M., Humke, T., Rosendahl, T. & Heine, J. Vergleich der prognostischen Aussagefähigkeit der Catterall-und Herring-Klassifikation bei Patienten mit Morbus Perthes. Z. Orthop. Ihre Grenzgeb. 137, 168–172 (1999).
Price, C. T. The lateral pillar classification for Legg-Calve-Perthes disease. J. Pediatric Orthop. 27, 592–593 (2007).
Memiş, A., Varlı, S. & Bilgili, F. A novel approach for computerized quantitative image analysis of proximal femur bone shape deformities based on the hip joint symmetry. Artif. Intell. Med. 115, 102057 (2021).
Memiş, A., Varlı, S. & Bilgili, F. Fast and accurate registration of the proximal femurs in bilateral hip joint images by using the random sub-sample points. IRBM 43, 130–141 (2022).
Bugeja, J. M. et al. Automated volumetric and statistical shape assessment of cam-type morphology of the femoral head-neck region from 3D magnetic resonance images. arXiv preprint arXiv:2112.02723 (2021).
Memiş, A., Varlı, S. & Bilgili, F. Semantic segmentation of the multiform proximal femur and femoral head bones with the deep convolutional neural networks in low quality MRI sections acquired in different MRI protocols. Comput. Med. Imaging Graph. 81, 101715 (2021).
Memiş, A., Varlı, S., & Bilgili, F. Automatic classification of the waldenstrom stages of Legg-Calve-Perthes disease from the 2D total proximal femur shape deformity. In 2021 International Conference on INnovations in Intelligent SysTems and Applications (INISTA) (pp. 1–6). IEEE (2021).
Davison, A. K., Cootes, T. F., Perry, D. C., Luo, W., Medical Student Annotation Collaborative, & Lindner, C. Perthes disease classification using shape and appearance modelling. In Computational Methods and Clinical Applications in Musculoskeletal Imaging: 6th International Workshop, MSKI 2018, Held in Conjunction with MICCAI 2018, Granada, Spain, September 16, 2018, Revised Selected Papers 6 (pp. 86–98). (2018).
Wong, S. C., Gatt, A., Stamatescu, V., & McDonnell, M. D. Understanding data augmentation for classification: when to warp?. In 2016 International Conference on Digital Image Computing: Techniques and Applications (DICTA) (pp. 1–6). IEEE https://doi.org/10.1109/DICTA.2016.7797091. (2016)
Chartrand, G. et al. Deep learning: A primer for radiologists. Radiographics 37, 2113–2131 (2017).
Liu, X. Y., Fang, Y., Yang, L., Li, Z., & Walid, A. High-performance tensor decompositions for compressing and accelerating deep neural networks. In Tensors for Data Processing (pp. 293–340). Academic Press (2022). https://doi.org/10.1016/B978-0-12-824447-0.00018-2
Arulkumaran, K., Deisenroth, M. P., Brundage, M. & Bharath, A. A. Deep reinforcement learning: A brief survey. IEEE Signal Process. Mag. 34, 26–38 (2017).
Srivastava, N., & Salakhutdinov, R. R. Multimodal learning with deep boltzmann machines. Adv. Neural Inf. Process. Syst. 25. (2012)
Alom, M. Z. et al. A state-of-the-art survey on deep learning theory and architectures. Electronics 8, 292 (2019).
Ide, H., & Kurita, T. Improvement of learning for CNN with ReLU activation by sparse regularization. In 2017 International Joint Conference on Neural Networks (IJCNN) (pp. 2684–2691). IEEE (2017) https://doi.org/10.1109/IJCNN.2017.7966185
Baldi, P. & Sadowski, P. The dropout learning algorithm. Artif. Intell. 210, 78–122 (2014).
Şen, S. Y., & Özkurt, N. Convolutional neural network hyperparameter tuning with adam optimizer for ECG classification. In 2020 Innovations in Intelligent Systems and Applications Conference (ASYU) (pp. 1–6). IEEE (2020). https://doi.org/10.1109/ASYU50717.2020.9259896
Landis, J. R. & Koch, G. G. The measurement of observer agreement for categorical data. Biometrics 33, 159–174 (1977).
Cohen, J. Weighted kappa: nominal scale agreement provision for scaled disagreement or partial credit. Psychol. Bull. 70, 213 (1968).
Koo, T. K. & Li, M. Y. A guideline of selecting and reporting intraclass correlation coefficients for reliability research. J. Chiropr. Med. 15, 155–163 (2016).
Fleiss, J. L. Measuring nominal scale agreement among many raters. Psychol. Bull. 76, 378 (1971).
Chea, P. & Mandell, J. C. Current applications and future directions of deep learning in musculoskeletal radiology. Skeletal Radiol. 49, 183–197 (2020).
Olczak, J. et al. Artificial intelligence for analyzing orthopedic trauma radiographs. Acta Orthop. 88, 581–586 (2017).
Chung, S. W. et al. Automated detection and classification of the proximal humerus fracture by using deep learning algorithm. Acta Orthop. 89, 468–473. https://doi.org/10.1080/17453674.2018.1453714 (2018).
Helm, J. M. et al. Machine learning and artificial intelligence: Definitions, applications, and future directions. Curr. Rev. Musculoskelet. Med. 13, 69–76 (2020).
Urakawa, T. et al. Detecting intertrochanteric hip fractures with orthopedist-level accuracy using a deep convolutional neural network. Skeletal Radiol. 48, 239–244 (2019).
Langerhuizen, D. W. G. et al. What are the applications and limitations of artificial intelligence for fracture detection and classification in orthopaedic trauma imaging? A systematic review. Clin. Orthop. Relat. Res. 477, 2482–2491. https://doi.org/10.1097/CORR.0000000000000848 (2019).
Herring, J. A., Kim, H. T. & Browne, R. Legg-Calvé-Perthes disease: Part I: Classification of radiographs with use of the modified lateral pillar and Stulberg classifications. JBJS 86, 2103–2120 (2004).
Akgun, R. et al. The accuracy and reliability of estimation of lateral pillar height in determining the herring grade in Legg-Calve-Perthes disease. J. Pediatr. Orthopaed. 24, 651–653 (2004).
Wiig, O., Terjesen, T. & Svenningsen, S. Inter-observer reliability of radiographic classifications and measurements in the assessment of Perthes’ disease. Acta Orthop. Scand. 73, 523–530 (2002).
Loder, R. T. & Skopelja, E. N. The epidemiology and demographics of legg-calvé-perthes’ disease. Int. Sch. Res. Not. https://doi.org/10.5402/2011/504393 (2011).
Funding
No financial support was received for this study.
Author information
Authors and Affiliations
Contributions
Concept – Z.S., Y.S., S.K., Y.A.K., M.T., T.S., A.S.A, F.B., C.S. .; Design – Z.S.,T.K., T.I.; Supervision – Z.S., M.T., S.K., T.S., F.B., C.S.,; Materials – Z.S., Y.S., S.K., Y.A.K., A.S.A., F.B., C.S., .; Data Collection and/or Processing – Z.S., Y.S., M.T., S.K., T.S., F.B., C.S..; Analysis and/or Interpretation—Z.S., Y.S., M.T., S.K., T.S., F.B., C.S; Literature Review – Z.S., Y.S., M.T., S.K., T.S., F.B., C.S.; Writing – Z.S., Y.S., S.K., Y.A.K., M.T., T.S., A.S.A, F.B., C.S.; Critical Review—Z.S., Y.S., S.K., Y.A.K., M.T., T.S., A.S.A, F.B., C.S.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher's note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Soydan, Z., Saglam, Y., Key, S. et al. An AI based classifier model for lateral pillar classification of Legg–Calve–Perthes. Sci Rep 13, 6870 (2023). https://doi.org/10.1038/s41598-023-34176-x
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s41598-023-34176-x
This article is cited by
-
Femoral head vascular status in early-stage Legg–Calvé–Perthes disease assessed by contrast-enhanced magnetic resonance imaging: comparison with the contralateral side
Orphanet Journal of Rare Diseases (2025)
-
Deep learning-assisted screening and diagnosis of scoliosis: segmentation of bare-back images via an attention-enhanced convolutional neural network
Journal of Orthopaedic Surgery and Research (2025)
-
Artificial intelligence and pediatric imaging data: ethical strategies for learning and collaboration
Pediatric Radiology (2025)
-
Ethical guidance for reporting and evaluating claims of AI outperforming human doctors
npj Digital Medicine (2024)







