Introduction

The human spine is a complex structure that plays a vital role in the movement, protection, and support of the body. It is made up of a total of 33 individual vertebrae, including 7 cervical vertebrae in the neck, 12 thoracic vertebrae in the upper back, 5 lumbar vertebrae in the lower back, 5 sacral vertebrae fused together to form the sacrum, and 4 coccygeal vertebrae fused together to form the coccyx1,2. The vertebrae have a complex structure and function, making it very important to follow proper spine bio-mechanics to avoid any unwanted effects on the body3,4. Figure 1 illustrates the entire spinal column, including its five regions.

Fig. 1
figure 1

Human spine anatomy C1–C7 cervical vertebrae T1–T12 thoracic vertebrae L1–L5 lumbar vertebrae4.

As reported by the American journal of public health in 20165, spinal cord injuries are the leading cause of paralysis after stroke. According to statistics, lower back pain affects 80% of the population4 and making it the primary reason Americans visit doctors3, amounting over $50 billion per year in medical care. In 2013, World Health Organization (WHO) reported that more than 400,000 people experience spinal injuries and deformities every year4. These statistics highlight the significant impact of spinal health has on individuals and society as a whole.

According to the literature, spine spondylosis was commonly found among the elderly and middle-aged populations. However, due to the popularity of electronic devices and prolonged sitting for education and work, its prevalence is increasing equally among younger population as well. Statistically spinal degenerative diseases are also found in various populations and can cause compression or pulling of nerve roots, leading to radicular symptoms such as back pain or leg pain. Additionally, these conditions may cause deformities, most commonly at the C4-C7 and L4-S1 levels6. The status of spinal deformities can be diagnosed and correlated with pain symptoms by conducting imaging tests that determine the condition of the vertebral column. However, this process can be difficult, and sometimes clinicians need to use manual methods or computer-aided diagnostic tools to make a decision.

Localization of the vertebrae that make up the spine is key in diagnosing spinal diseases, such as calculating Cobb angles, detecting vertebral fractures, and identifying other abnormalities7. The process of vertebra localization from X-ray images is tedious, as it involves manually determining the corner points. The irregular shape and structure of the vertebrae, which vary among different people, make it difficult to specify the exact location of the vertebrae from X-ray images.

Automated methods are very helpful for large-scale screening because they are faster and more accurate than manually identifying each vertebra. These methods use a set of 68 points or landmarks for vertebrae identification. Automatic vertebral landmark localization has been studied for many years, but it is still challenging due to the significant ambiguity and variability in X-ray images8. By determining vertebral pathologies, it is possible to stop the progression of spine-related illnesses at the initial stages of treatment, while also providing doctors with vital information for creating a treatment plan.

Artificial Intelligence (AI) is a computer technology that provides doctors and clinicians with a powerful tool for quicker and more accurate disease diagnosis9. Among the various AI techniques, Convolutional Neural Networks (CNNs) are particularly effective for image segmentation. In this regard, U-Net and fully convolutional networks are examples of CNN models that have shown remarkable success in image segmentation10,11. The architecture of CNNs and deep learning is based on an encoder-decoder structure. This structure helps the models learn how to encode and decode images more accurately. These deep learning models have been utilized in medical image analysis and classification12,13. However, CNNs do not create long-range dependencies and global contexts in images, which can impact the accuracy of image segmentation. This limitation can be overcome by using transformers to capture global dependencies14.

In developing countries such as Pakistan, there is a severe shortage of doctors relative to the patient population, making it critical to promote research in this area. The purpose of this study is to develop an application that medical professionals may use to identify and improve the accuracy of spine curvature diagnosis. However, due to limited resources in the country, only a few researchers are currently working on biomedical applications, resulting in limited accessibility to information for many people.

We have organized our paper as follows: In Section 2, we provide a comprehensive review of the literature on vertebral identification, localization, and classification. Section 3 covers the models and techniques that can be used to accurately locate vertebrae, along with their strengths and weaknesses. In Section 4, we present our findings and provide a detailed comparison of the different methods we tested. Finally, in Section 5, we summarize the conclusions of our study and provide suggestions for future research.

Literature review

As the spine is a crucial and sensitive structure in the human body, there are chances of various musculoskeletal disorders that can affect the spinal cord. In clinical practice, physical examinations are used to diagnose spine disorders and abnormalities in patients. Several aspects of the physical examination include observation, palpation, and functional movement tests. Additionally, neurological evaluations are conducted, which involve assessing symptoms, the severity of pain, muscle strength and weakness, changes in bladder patterns, sensation, and motor function. The use of radiography images is one of the most popular and reliable means of diagnosing spinal problems and assessing the severity of spinal organ conditions. Imaging tests such as x-rays, CT scans, and MRI can provide more accurate diagnoses.

To date, numerous researchers have developed automated systems to diagnose various diseases using artificial intelligence (AI), particularly in the L5 to S1 or L4 to L5 areas of the spine. This area is often subjected to heavy mechanical stress, which can lead to slips. While many approaches have been used in the past for vertebra localization, segmentation, and classification, there still exists a gap in the literature regarding the latest diagnostic methods. Scientists continue to explore new ways to diagnose spine problems with excellent accuracy.

Literature suggested that automated spine image analysis can be achieved with deep learning technology to easily find the parts of a spine that are injured or sick. Recently, in the field of spinal images researchers have used landmark identification techniques15,16,17,18,19,20,21,22,23,24,25. In Multi-View Convolution Layers (MVC-Net)25, AP (anterior and posterior) and LAT (lateral) X-rays multiview features were combined to create multi-view convolution layers, which were then used to extract global spinal information. Based on this idea, MVE-Net (multi-view extrapolation)16 proposed a function to speed up the convergence process and get more accurate results. A landmark detector used the border features of spines defined by AEC-Net (Adaptive Error Correction Net)17 using kernel to predict landmarks. The residual corrector component was developed by Bayat et al.26 for landmark identifications.

A. Safari et al. established a semi-manual method for estimating a cobb angle in27. The ROI is extracted via contract stretching in X-ray. With the use of manual landmarking to establish the curvature of the spine, a 5-th order polynomial curve is applied. The last step is to calculate the morphologic curve and then approximate the angle using the equation. The exact formula is determined at the intersection sites. According to the paper, obtained result between the angles is 0.8. In paper28, the estimation of angle using x-ray images of the spine, a new regression method is presented. The suggested way consists of two parts. The earliest one assists in calculating cobb angles. The second one which employs curve properties in a direct approach to cobb calculation. The next step is the error correction network, which essentially extrapolates the output of both modules to determine the variation in cob angles between the two networks. 581 spinal anteriorposterior x-ray images were used to measure the results, and the error was 4.90 in the cobb angle. In29, a method to detect scoliosis using X-ray images has been presented by Kang Cheol Kim et al. The author also discussed the limitations of labor intensive methods, which are timeconsuming and tedious. The technique is divided into three main sections. For localization map is used in the first section. The slope of each vertebra is estimated in the second section using the vertebral-tilt field, and the Cobb angle is determined in the third section using the vertebral centroids. It achieves a Cobb angle CMAE of 3:51 degrees and a SMAPE of 7:84% when the performance is tested. The regression-based and heatmap-based approaches are the two main families of methods available for vertebral landmark localization. An effective method of localizing vertebral landmarks is the heatmap approach. The heatmap is typically generated with the posterior-anterior X-ray images using a convolutional neural network (CNN). In the heatmap a local maximum response is used to determine vertebral landmarks. The four corner points (top left, top right, bottom left, and bottom right) are indicated in four different heatmaps that Zhang et al predict with the help of four different fully connected layers30. By analyzing the predicted heatmap, landmark coordinates can be obtained indirectly. According to Yi et al31, firstly author had mapped 17 center points of the spine on heat maps and after that corner points were regressed using shape regression based on their numerical coordinates. Although these heatmap based methods are successful, they have a common flaw in their results, includes some vertebra like structures are dilated and missing landmarks due to dilution and false positives values. The improper clinical parameters are derived as a result of these inaccurate predictions, which cause misdiagnosis.

The transformer technique, which was initially put forth in the field of natural language processing and has lately flourished in a variety of computer vision fields, including landmark localization, has been used in various researches for landmark localization. Heatmaps are generated to get landmark coordinates in the majority, if not all, transformer-based pproaches, which generally follow the same pipeline. By including a transformer into a CNN, Yang et al.32 were able to effectively implicitly capture the long-range spatial correlations between human body components. These associations were subsequently decoded into heatmaps for human pose estimation.

The strengths of the graph convolution network and transformer were combined by Zhao et al.33 to characterise a set of previously localised 2D landmarks and map them into their equivalent 3D equivalents. Their research was on 3D landmark localization. These transformer based approaches need a lot of data for training, but they are also prohibitively expensive to implement because they must compute similarity scores for each pair of landmarks. The Tao and Zheng34 team developed a transformer that can be used to detect vertebrae in 3D CT images. However, it is harder to localize vertebrae in 2D X-ray images than it is in 3D CT images because of the tissues that are superimposed on the bones. The study of how well the transformer helps the researcher to localize vertebral landmarks is still in progress, and has not yet been demonstrated.

For landmark localization another approach is regression-based, which utilizes an end-to-end CNN model to directly regress the coordinates. A structured support vector regression method was put out by Sun et al.35 which enables numerous outputs to simultaneously share comparable sparsity patterns. Transformers are often considered a more flexible alternative to CNNs for processing various visual tasks even though they were originally designed for sequence-to-sequence modeling in natural language processing models (NLP). There are several visual benchmarks that show that deep-natural networks with transformers perform better than those without transformers. In medical-image segmentation, transformers are used because of their powerful presentation capabilities. In medical image segmentation, Chen J et al. and Wang Wet al. used transformers to extract global contexts36,37,38,39,40,41.

It is evident from the literature that is explained above, vertebrae localization and segmentation is one of the popular important areas to extract ROI for deformity analysis using neural networks. One of the major gap in the treatment of scoliosis is the method of classifying the shape based difference in curvature, to assess the normal, C and S shapes of curvature. Recent developments in neural networks allow object detection, with a clearly defined bounding box that indicates the region of interest, in contrast to traditional neural networks that perform semantic segmentation. A modified form of image classification is combined with a concept of localization, predicting both the object’s location and its class. The main difference is that it sees the entire image at once and generates region proposals based on information. By collecting contextual information, it prevents false positives in results and is faster. The fundamental goal of these studies is to assist doctors in managing the work of manual labelling. A number of image processing and machine learning techniques were used by the researchers to identify different lumber deformities using the analysis of spine images. Recently, deep learning methods have been used to improve the accuracy of medical diagnoses. However, these methods have difficulty determining the severity of diseases when scans contain low-resolution images or when bones are not visible. The analysis of lumbar deformities uses accurate information about the positions of vertebrae to determine the deformity severity. This can be difficult if there is even a small variation in the positions of the vertebrae.

In this paper, we are proposing a technique to identify the location of vertebra’s in the spine and segmentation using a network of transformers. Transformer network uses information from the encoding network (which encodes the high-level features of the bones), the decoding network (which decode the fine-grained features of the bones), and the global dependencies network (which takes into account how bones are connected).This study introduces a new method for calculating spinal curvature that performs as well as experts. A structured comparison of various existing techniques is presented in the table, highlighting key differences. The tabular format enables a succinct and logical presentation of the comparative analysis, making it easier to comprehend the gaps between different approaches. Table 1 illustrates the summary of the literature.

Table 1 A summarized overview of existing techniques related to spine analysis.

Bearing in mind all of these shortcomings and difficulties, the following are the main contributions of our methodology to this research work.

  1. 1.

    This study proposes a unified transformer based framework to extract vertebrae and identify spine deformities.

  2. 2.

    A detailed statistical analysis has been conducted to identify the best set of features to identify the type of scoliosis.

  3. 3.

    The proposed network combines a self-supervision learning, self-attention mechanism and feature fusion to provide reliable information for diagnosis and accurate analysis of the spine.

Materials

This section details the materials, for the study, we have used the Accurate Automated Spinal Curvature Estimation MICCAI 2019 (AASCE MICCAI 2019) challenge dataset. There are total of 609 anterior-posterior X-ray images having a thoracic-lumber region of the spine. The dataset is openly accessible and is divided into 60-20-20% split. It contains 365 training, 122 validation, and 122 test images. Each X-ray image has 68 points that correspond to 17 vertebrae. Every vertebra has four points which corresponds to four corner of the vertebra (top left-right and bottom left-right). The input size of the image is \(2048 \times 1024\). To train the deep learning models, one of the preliminary steps is to perform data annotation and labelling. Data labelling is performed under the supervision of a professional radiologist. Figure 2 shows the dataset image. Figure 2(A) shows the original image and Fig. 2(B) shows the bounding box of seventeen vertebrae and Fig. 2(C) shows the polygons and Fig. 2(D) shows the Mask.

Fig. 2
figure 2

Sample images from dataset42. (A) Shows the original image (B) Bounding Box of seventeen vertebrae (C) Polygons of seventeen vertebrae (D) Mask of thoracic-lumber region of the spine.

Methodology

This section details the proposed framework, which consists of four key stages: vertebrae localization, spine segmentation, center profile generation, and shape analysis as discussed below. The whole methodology for this study is shown in Fig. 3.

  • Vertebrae Localization: We leverage a Detection Transformer (DETR) to predict bounding boxes for each vertebra within an image. This approach effectively locates individual vertebrae.

  • Spine Segmentation: To accurately isolate the spinal region, we employ a SegFormer Transformer for image segmentation. This step refines the region of interest for further analysis.

  • Center Profile Generation: Based on the predicted bounding boxes and the segmented mask, the centroids of each vertebra are calculated. This information is then used to generate the central profile of the spine, representing its curvature.

  • Shape Analysis: Features are extracted from the generated spinal profile. These features are then employed to categorize the spine into three classes: normal, single-bend (C-shaped), and double-bend (S-shaped).

Fig. 3
figure 3

Proposed methodology. The subimages of spines are taken from dataset42.

Vertebrae localization

The first phase of our proposed methodolgy consists of accurate vertebrae localization which is crucial for spinal analysis in clinical settings. However, automatic methods face challenges due to variations in image quality, vertebrae size, and spinal curvatures. Traditional object detection methods often rely on region proposal networks (RPNs) that struggle with these variations. We make use of a recent deep learning approach called DETR (Detection Transformer) for vertebrae localization. DETR is a transformer-based architecture that directly predicts bounding boxes for objects in an image, bypassing the need for RPNs. This makes DETR well-suited for vertebrae localization due to its strengths for Long-Range Dependencies (LRD) and set-based prediction. Figure 4 shows the architecture of DETR. It can be seen that DETR consists of three main components:

Fig. 4
figure 4

Architecture of detection transformer. The spine image shown is from dataset42.

  • CNN Backbone: A pre-trained convolutional neural network (CNN) is used as the backbone for feature extraction. The CNN extracts hierarchical features from the input image, capturing local details and spatial information. In our case we have made use of ResNet50 to extract and generate the feature map from the input images.

  • Transformer Encoder-Decoder: The extracted features are fed into a transformer encoder-decoder architecture.

    • Encoder: The encoder stack utilizes self-attention layers to analyze the relationships between features from different image regions. This allows the model to learn contextual information about the entire image.

    • Decoder: The decoder uses a set of learnable object queries and attention mechanisms to predict bounding boxes and class labels. The decoder attends to relevant features from the encoder output based on the object queries, resulting in accurate bounding box predictions for each vertebra.

Unlike grid-based object detectors like YOLO, which predict bounding boxes and class probabilities independently for each grid cell, DETR leverages a transformer-based architecture for simultaneous class and bounding box prediction. This is achieved through a set-based prediction method.

DETR makes use of a set-based prediction approach. It uses a predetermined number of object queries and associates them with bounding boxes through a decoder network. Each bounding box’s coordinates (x, y, width, and height) are predicted based on a reference set of positional encoding, which provide location information within the image. The key elements for this are:

  • pred_boxes: Represent the predicted bounding box coordinates (x, y, width, height) for each object query.

  • pos_embeddings: Provide positional information for each object query relative to the image.

The predicted bounding boxes are computed using the DETR decoder function given in Eq. (1).

$$\begin{aligned} pre\_boxes = DETR\_Decoder(features, pos\_embeddings) \end{aligned}$$
(1)

Where features represent the output features from the transformer encoder, capturing contextual information from the input image while pos_em-beddings provide positional information for each object query.

  • Prediction feed-forward networks(FFNs): The final prediction in DETR is computed by a 3-layer perceptron, incorporating a Rectified Linear Unit (ReLU) activation function and a hidden dimension of d. This perceptron is followed by a linear projection layer. The feedforward network (FFN) is responsible for predicting the normalized center coordinates, height, and width of the bounding box relative to the input image. Furthermore, the linear layer within the FFN utilizes a softmax function to predict the class label, indicating the object category to which the detected object belongs.

Center profile generation

This section details the center profile generation process, vertebra localization was performed using transformer. Confidence score was set to 0.9 to get accurate predicted boxes. From vertebra localization we have predicted bounding box for each vertebra. We calculate the center point of each vertebra using Eq. (2).

$$\begin{aligned} Center(x,y) = \frac{x_{min} +x_{max}}{2} , \frac{y_{min} +y_{max}}{2} \end{aligned}$$
(2)

Where \(x_{min} , x_{max}\) are the top left corner and \(y_{min}, y_{max}\) are the bottom right corner of the vertebra. The predicted and ground-truth centroids are shown in Fig. 5 where red and blue color shows centroids calculated from ground-truth and predicted vertebrae respectively.

Fig. 5
figure 5

Vertebrae centroids on dataset images42. (A) Red dots show the centroids calculated from ground truth bounding boxes. (B) Blue dots show the centroids calculated from predicted bounding boxes.

Vertebra segmentation was performed using SegFormer transformer. The center profile of spine for segmented images was generated using morphological thinning operation as shown in Fig. 6.

Fig. 6
figure 6

Spine curvature created through morphological thinning on predicted segmentation mask.

Spine segmentation

This section details the spine segmentation process, a crucial step for computer-aided diagnosis (CAD) systems in identifying spinal conditions like lordosis, kyphosis, scoliosis, and degenerative disc disease. Accurate segmentation also provides essential input for subsequent shape analysis and classification modules. While medical images offer high contrast for bone structures, precisely identifying the intricate patterns and boundaries of vertebrae remains a challenge. To address this, we make use of the recent SegFormer Transformer architecture.

SegFormer

SegFormer’s ability to combine the strengths of Transformers and CNNs makes it well-suited for semantic segmentation. This fusion empowers the model to capture both local details (through CNNs) and global context (through Transformers), leading to superior segmentation accuracy. A key innovation lies in its use of self-attention mechanisms within Transformers,enabling the capture of complex relationships between image regions43. This translates to high segmentation accuracy while maintaining efficiency.

SegFormer’s architecture offers several advantages for spine segmentation. The key innovation lies in its hierarchical Transformer encoder, which extracts features at multiple scales. This allows the model to capture both fine-grained details (local information) and broader contextual information (global information) from the input image, leading to more accurate segmentation. Additionally, SegFormer eliminates the need for positional encoding, a technique commonly used in Transformers that can suffer from performance drops when the testing resolution differs from the training resolution. This elimination improves efficiency and robustness. Finally, SegFormer employs a lightweight Multi-Layer Perceptrons (MLPs) decoder that effectively combines information from various encoder layers. This simple design promotes efficient segmentation using Transformers. Figure 7 shows the detailed architecture of SegFormer Transformer.

Fig. 7
figure 7

Architecture of SegFormer transformer.

Working of SegFormer

SegFormer architecture starts by taking an image of size \(H \times W \times 3\), the initial step involves partitioning it into patches of dimensions \(4 \times 4\). Subsequently, these patches serve as input to the hierarchical Transformer encoder, which facilitates the extraction of multi-level features at resolutions of 1/4, 1/8, 1/16, and 1/32 of the original image resolution. The encoder of SegFormer produces feature maps at four distinct scales. Each scale’s feature map size corresponds to the input image’s dimensions reduced by factors of 4, 8, 16, and 32. Consequently, the output consists of four feature maps with dimensions (H / k, W / k, C), where C represents the embedded dimension. This arrangement capitalizes on the self-attention mechanism inherent in Transformers, which tends to capture local features initially and global features later in the network. Subsequently, SegFormer leverages these outputs to integrate both local and global information for segmentation tasks. Within SegFormer blocks, a sequence of smaller blocks is present, which includes:

  1. 1.

    An efficient self-attention layer

  2. 2.

    A mix-feed forward network (Mix-FFN)

  3. 3.

    An overlapped path merging layer

The resulting multi-level features are then forwarded to the All-MLP decoder for the prediction of the segmentation mask. The All-MLP decoder proposed in SegFormer operates in four key steps.

  1. 1.

    It takes multi-level features from the SegFormer encoder and consolidates them into a unified layer, adjusting their dimensions using MLP.

  2. 2.

    Then, it upscales and combines these features, preserving their dimensionality.

  3. 3.

    Following that, it integrates the features through another MLP layer, ensuring dimensional coherence.

  4. 4.

    Ultimately, it employs a final MLP layer to produce the ultimate segmentation mask, delineating the spatial distribution of various objects within the image, with dimensions reflecting the original image size scaled down by a factor of 4 \(\times\) \(\times\).

This segmentation mask is generated at a resolution of H /4 \(\times\) W /4 \(\times\) \(N_{cls}\) , where \(N_{cls}\) denotes the number of categories within the dataset. Table 3 shows Parameters for training of SegFormer Architecture.

Shape analysis

The analysis of spinal column deformities in the context of spine curvature assessment is crucial. Traditional research on the Scoliosis disease has identified two significant types of sideways curvatures known as shape S and shape C. As the disease progresses, the severity of these curvatures increases, often requiring surgical intervention. Timely detection of abnormal curvatures is vital as it enables prompt treatment and aids in preventing further deterioration.

For the shape Analysis, we have used the center profile of spine. Five features were calculated for the center profile generated through vertebrae localization results and Segmentation. These features are; Feature1 (Absolute difference between extreme center points (ADBECP)), Feature2 (Segment wise mean of extreme points (SWMEP)), Feature3 (Segment wise standard deviation of extreme points (SWSDEP), Feature4 (Mean absolute gradient magnitude (MAGM)), Feature5 (Mean absolute gradient phase (MAGP)). All these features are motivated from clinical findings. ADBECP reflects the global curvature of spine as in scoliosis, the spine deviates laterally. SWMEP provides a localized curvature profile for spine segments and helps in detection of a section from whole spine that is deviating from normal one. SWSDEP measures variability of extreme points and it suggests irregular curvature within a segment. MAGM calculates the mean of absolute gradient magnitude that reflects rate of change in curvature along the spine. Finally MAGP measures the average angular direction of curvature to provide insight into the directional flow of spine’s curvature and it helps in differentiating single curve and double curve scoliosis. The calculations related to these features are as following

Absolute difference between extreme center points (ADBECP)

Extreme points are the maximum and minimum points of center profile column along x-axis. The absolute difference of these extreme point is calculated using Eq. (3),

$$\begin{aligned} \text{ Absolute } \text{ difference } = |\max { (center-profile_x)} - \min {(center-profile_x)}| \end{aligned}$$
(3)

Where center-profile is the center points of spine along x-axis. Figure 8 shows the deviation of spine from the mean positions. Normal spine is not deviated to any side (Fig. 8-A) while C-shaped spine is deviated to one side either left or right (Fig. 8-B) and S-shaped spine is deviated to both side left and right (Fig. 8-C).

Fig. 8
figure 8

Deviation of spine from mean position shown on dataset images42 (A) Normal-Spine: Minimal deviation from the mean position. (B) C-shaped: one side deviation, either left or right. (C) S-shaped two side deviation, both left and right.

Segment-wise mean of extreme points (SWMEP)

Images of AASCE MICCAI 2019 challenge dataset has 17 vertebrae. We divide the predicted vertebrae image into 4 segments which are vertebrae 1 - 4, 5 - 8, 9 - 12 and 13 - 17 as shown in Fig. 9. We divide the image into segments to get the local-level variation of spine.

For each segment, absolute difference of extreme point (max and min) was calculated as the extreme points for each segment is different. We will get four values for difference of extreme point against segmented image. Mean of these values were calculated using Eq. (4),

$$\begin{aligned} \text{ Mean } = \frac{\sum _{seg =1}^{4} seg(max-min) }{\text{ Total } \text{ segments }} \end{aligned}$$
(4)

Where seg1 is Vertebra 1-4, seg2 is Vertebra 5-8, seg3 is Vertebra 9-12 and seg4 is Vertebra 13- 17. The deviation of extreme point from mean position against each segment is shown in Fig. 9-C

Fig. 9
figure 9

Segment wise extreme points shown on dataset images42 (A) Original-Image with predicted center-profile showing deviation of extreme point from mean positon (B) Division of original image into four segment (C) Deviation of each segment extreme points from mean position.

Segment wise standard deviation of extreme points (SWSDEP)

Absolute difference of extreme points (max and min) was calculated for all four segments. From these value standard deviation was calculated to get the local level variation of spine.

Mean absolute gradient magnitude (MAGM)

The mean absolute gradient magnitude represents the rate of change, which can vary between positive and negative values. In cases of normal curvature, deviations are expected to be minimal, indicating little to no change along the x-axis. Conversely, for C-shaped curves, the magnitude tends to be slightly higher, while for S-shaped curves, it is even more pronounced. Eq. (5) provides the formula for calculating this value.

$$\begin{aligned} |G| = \sqrt{G_x^2 + G_y^2} = \sqrt{\left( \frac{\partial f}{\partial x}\right) ^2 + \left( \frac{\partial f}{\partial y}\right) ^2} \end{aligned}$$
(5)

Mean absolute gradient phase (MAGP)

In addition to indicating the magnitude of change, it also reveals the direction of change in the image intensities, known as phase. Similar to the first feature, minimal changes in the phase angle are expected for normal images. For C-shaped curves, a smaller value will be identified, while the most significant change in phase will be observed for S-shaped curves. The gradient direction is determined using Eq. (6).

$$\begin{aligned} \text {Phase} = \tan ^{-1}\left( \frac{G_y}{G_x}\right) \end{aligned}$$
(6)

Curvature classification

This section details the curvature classification process, we extracted relevant features from the spine images. These features were then used to classify various spine conditions using different machine learning algorithms: Random Forest (RF), Support Vector Machine (SVM), and K-Nearest Neighbors (KNN), Artificial Neural Networks (ANNs). Random Forest (RF): This ensemble method proved most effective in our experiments, achieving the highest classification accuracy. RF addresses overfitting risks and handles missing data well by using multiple decision trees.

Support Vector Machine (SVM): SVM excels at finding optimal hyperplanes to separate data points belonging to different classes. It can be further enhanced with kernel functions to manage non-linear data distributions, particularly beneficial for scoliosis classification with potentially complex patterns.

K-Nearest Neighbors (KNN): This simple yet powerful algorithm classifies data points based on their similarity to labeled neighbors. KNN offers a reliable approach for scoliosis classification when appropriate parameters are chosen.

While Artificial Neural Networks (ANNs) are powerful classifiers, their data requirements can be limiting for scoliosis datasets due to potential size restrictions and data collection challenges.

Experimentation and results

In this section we discuss the experimental setup, evaluation parameters and the results of proposed framework. This section include the findings of the study and provide a comprehensive analysis of the experiments conducted.

Experimental setup

The model was trained on a PyTorch framework using a system equipped with 128 GB RAM and a combination of two Nvidia GPUs: RTX 3090 Ti and RTX 2070. The system also utilized a 16-core, 32-thread CPU with a base clock speed of 4.5 GHz.

For Vertebrae localization using transformer, the AASCE MICCAI 2019 challenge dataset was divided into 60-20-20 % split (60% for training, 20% for validation and 20% for testing). We trained the transformer model at the learning rate of 1e-4 for 170 epochs unless the validation loss stop decreasing. This trained model was then evaluated on the testing dataset to predict seventeen vertebrae in an image. Summary of the parameters for training is given in Table 2. The size of the input image was fixed to \(2048 \times 1024\).

Table 2 Parameters for training detection transformer.

For the SegFormer model, pretrained weights originally trained on the SegFormer-b0-finetuned-ade-512-512, were used as an initial starting point for training on the dataset. The training parameters for the SegFormer Architecture are summarized in Table 3. For the curvature classification we divide the dataset into 60-40 split (60% for training and 40% for testing).

Table 3 Parameters for training of SegFormer architecture.

Evaluation parameters

Different results were evaluated using the following parameters.

Mean Average Precision (mAP)

It can be determine by finding the AP of every class and then dividing it by number of classes. mAP is calculated using Eq. (7)

$$\begin{aligned} \text{ mAP } = \frac{1}{N} \sum _{i =1}^{N} AP_i \end{aligned}$$
(7)

Where N is the total number of classes and \(AP_i\) is the average precision of \(i_{th}\)class.

Intersection over union (IOU)

It determines the difference between the ground-truth and predicted labels. Localization model predict the bounding box, the acceptance and rejection of that bounding box depends upon the IOU and the bounding box confidence score. The predicted boxes were rejected if it was under the threshold value. IOU is calculated using Eq. (8)

$$\begin{aligned} IOU = \frac{ \text{ Area } \text{ of } \text{ Overlap }}{ \text{ Area } \text{ of } \text{ Union }} \end{aligned}$$
(8)

accuracy

It is a ratio between the accurate prediction to the sum of all predictions. It can be calculated using Eq. (9)

$$\begin{aligned} Accuracy = \frac{ {True_{positive} + True_{negative}}}{ True_{positive} + False_{positive} + True_{negative} + False_{negative}} \times 100 \end{aligned}$$
(9)

Where, \(True_{positive}\) indicates true positive prediction, \(True_{negative}\) indicates true negative prediction, \(False_{positive}\) indicates false positive prediction and \(False_{negative}\) indicate false negative prediction.

Dice score

The Dice score, often referred to as the Dice coefficient or Dice similarity coefficient, serves as a metric for assessing the similarity between predicted and ground truth sample. It is calculated using Eq. (10).

$$\begin{aligned} \text {DiceScore} = \frac{2 \times T_{\text {positive}}}{2 \times T_{\text {positive}} + F_{\text {positive}} + F_{\text {negative}}} \end{aligned}$$
(10)

Where, \(T_{\text {positive}}\) indicates true positive prediction, \(T_{\text {negative}}\) indicates true negative prediction and \(F_{\text {negative}}\) indicate false negative prediction.

Precision

Precision evaluates the proportion of accurately predicted positive instances within the total positive predictions made by a model. It can be calculated using Eq. (11).

$$\begin{aligned} \text {Precision} = \frac{T_{\text {positive}}}{T_{\text {positive}} + F_{\text {positive}}} \end{aligned}$$
(11)

Where, \(T_{\text {positive}}\) indicates true positive prediction and \(F_{\text {negative}}\) indicate false negative prediction.

Mean Absolute Error (MAE)

MAE computes the average absolute difference between the predicted and actual values across all data points. we first calculated the Euclidean distance between predicted and ground truth point. Then take the absolute of all the value and average all the points. It is calculated using Eq. (12).

$$\begin{aligned} \text {MAE} = \frac{1}{N} \sum _{n=1}^{N} |x_n - y_n| \end{aligned}$$
(12)

where \(n\) is the total number of samples, \(x_n\) is ground truth value, and \(y_n\) is the predicted value.

Symmetric Mean Absolute Percentage Error (SMAPE)

We assess the accuracy of the vertebrae localization using the SMAPE for the AASCE Challenge. The accurate mapping of vertebrae localization serves as the established measurement criterion for evaluating shape curvature. SMAPE metric determine the accuracy of the landmarks by contrasting the identified landmark positions with the actual ground-truth landmarks. All vertebraes are computed using SMAPE, as indicated by Eq. (13).

$$\begin{aligned} \text {SMAPE} = \frac{1}{n} \sum _{i=1}^{n} \frac{|y_i - \hat{y}_i|}{\left( \frac{|y_i| + |\hat{y}_i|}{2}\right) } \times 100\% \end{aligned}$$
(13)

In the Eq. (13), \({y}_i\), denotes the actual value, \({n}\) denotes the total number of vertebrae predictions, while \(\hat{y}_i\) denotes the predicted value.

Shapley Additive Explanations (SHAP)

The fundamental principle of SHAP is to allocate a contribution value to every feature associated with a predicted result. It is possible for individual features to have varying effects on a projected result, either positively or negatively44. In addition, SHAP considers every possible combination of features when determining how each feature contributes to the projection. The SHAP value for feature \(j\) in the context of spine X-ray analysis can be described by the Eq. (14).

$$\begin{aligned} \text {SHAP}_j = \phi _0 + \sum _{k=1}^{p} \phi _k \cdot x_k \end{aligned}$$
(14)

where \(\phi _0\) represents the baseline value, \(\phi _k\) are the SHAP coefficients for each feature \(k\), and \(x_k\) are the feature values.

Ablation studies

We have executed comprehensive experiments to validate our proposed framework. Our localization model achieved a mAP of 0.96 at the IOU threshold of 0.5 for the AASCE MICCAI 2019 dataset which means that the DETR model can predict all the 17 vertebrae in an image with the confidence score of 0.9. DETR results was examined using different threshold values. Table 4 shows the mAP value with different IOU values.

Segmentation was performed on SegFormer transformer. We have achieved the mean accuracy of 0.97, mean IOU of 0.94, the dice-coefficient score is 0.93 and the mask precision is 0.91, which mean we get the good segmented mask again each spine image. The high dice score indicates that the ground truth and predicted vertebrae have very less difference. High precision indicates true positive prediction against all positive prediction. Table 5 shows the SegFormer model result on test dataset.

Table 4 Variation in mAP value with different IOU thresholds using DETR trained model.
Table 5 SegFormer model result on test dataset.

The primary challenge for determining the center point base curvature is the accurate vertebrae detection in localization and spine mask creation in segmentation. The center profile of the spine is the key issue in feature extraction as feature was calculated from center profile of spine. To fully evaluate the efficiency of our proposed model, several ablation studies were conducted. We have performed ablation studies with respect to localization, segmentation, features and classifiers. For the localization we have used the DETR encoder decoder structure which performed very well. To compare the results of transformer model with deep learning model, we have performed the vertebrae localization on YOLOv5, YOLOv3 and Generalized Hough Transform model. The comparison of these model was shown in Table 6. The transformer-based DETR model outperforms YOLOv3, YOLOv5, and Generalized Hough Transform (GHT) in vertebrae localization due to its ability to model global contextual relationships across the entire image, which is particularly advantageous in capturing the spatial alignment of vertebrae along the spine. Additionally, unlike CNN-based detectors that rely on predefined anchor boxes, DETR formulates detection as a set prediction problem, allowing it to more accurately localize vertebrae without bias toward fixed-scale object assumptions.

Table 6 Comparison of different object detection models result for vertebrae localization.

For the segmentation we have used the SegFormer transformer which is well known for its architecture and computational efficiency. To compare the results of transformer model with deep learning model, we have gone through the literature and find the models that is well known for segmentation and compare the result with our approach AASCE dataset. The comparison of these model was shown in Table 7. we can clearly say that our approach works better as compare to other models with higher dice coefficient score and mask precision value of 0.93 and 0.91 respectively.

Table 7 Comparison of different segmentation models result for spine segmentation.

In the segmentation ablation study (Table 7), the transformer-based model demonstrated superior performance over U-Net, Mask R-CNN, and FCN-8 due to its self-attention mechanism, which effectively captures long-range dependencies and preserves spatial consistency along the curved spinal structures. Unlike conventional encoder-decoder architectures that may struggle with complex anatomical variations, the transformer excels at modeling global contextual information, enabling more precise boundary delineation and robust segmentation across diverse spinal profiles. Further more, when assessing the effectiveness of a machine learning model, it is crucial to determine the significance of individual features and their combined impact on the model’s accuracy. This enables us to identify the most valuable features for making precise predictions and eliminate or modify features that do not contribute significantly to the model’s performance. To assess the performance of features, accuracy was measured for each feature individually as well as for different combinations of features using random forest classifier as shown in the Table 8.

It is clear that Feature 1-5 play vital role in detection of normal, C-shaped and S-shaped spine. Overall result of five features are 98.9 as shown in Table 8. Second highest features are SWMEP, SWSDEP, MAGP with accuracy 98.3. For further analysis we use second highest feature. While all five features showed robust outcomes, the determination was reached to concentrate on the second-most effective set of features. Our goal is to optimize the model for situations when a somewhat smaller feature set is advantageous, like real-time applications or situations with constrained computational resources, by concentrating on the second-best set.

Table 8 Result of shape analysis for individual and combination of features using Random Forest classifier.

Visualization and interpretation

This section details the Visualization and Interpretation process, Fig. 10 shows the vertebrae localization results using DETR. Figure 10 (A) shows the original Input image, 10 (B) shows the Bounding boxes of the ground truth and 10 (C) shows the predicted bounding boxes from DETR model.

Fig. 10
figure 10

Visual Results of Vertebrae localization using Detection Transformer on dataset images42. (A) Original-Image (B) Ground-truth Bounding Boxes (C) Predicted Bounding Boxes.

Figure 11 shows the spine segmentation results using SegFormer transformer. Figure 11 (A) shows the original Input image, Fig. 11 (B) shows the ground truth mask and Fig. 11 (C) shows the predicted mask from segmentation model.

Fig. 11
figure 11

Visual Results of Spine Segmentation using SegFormer on dataset images42. (a) Original-Image (b) Original Mask (c) Predicted Mask.

The performance of different classifiers varies when it comes to curvature classification, the Random Forest classifier yielding the best results. The result for the Random Forest classifier is depicted in Table 10.Three features Segment-wise mean of extreme points (SWMEP), Segment wise standard deviation of extreme points (SWSDEP), Mean absolute gradient phase (MAGP) and three classes (Class 0 (Normal), Class 1 (C-Shaped), Class 2 (S-Shaped)) were included in the SHAP analysis for the proposed model as indicated by Fig. 12. SHAP values provide an explanation of each feature that impacts the model’s output across the various classes. The length of the bar indicates the extent of each feature’s contribution to the predictions associated with that class.

Fig. 12
figure 12

Features based SHAP analysis.

  1. 1.

    SWMEP displays the highest contribution to Class 2 while also significantly contributing to Class 1 but its influence on Class 0 is comparatively minimal.

  2. 2.

    SWSDEP exhibits a balanced effect across all two classes (class2 and class1), with a strong impact on Class 0.

  3. 3.

    MAGP has a strong impact on Class 0 but its impact on the other classes is relatively less.

A single observation (instance) is represented by each dot. The horizontal line along x-axis represents the SHAP value and the purpose of color is to represent the feature value. Colour changes from blue to pink, which denotes a low feature value to a high feature value.

  1. 1.

    SWSDEP exhibits a wide range of SHAP values, indicating its dual role in the model’s predictions, both positively and negatively, depending on the instance. The color variation implies that both low and high values of SWSDEP can influence the model’s output strongly.

  2. 2.

    MAGP illustrates that high values (pink) tend to increase the SHAP value predictions in one direction, while low values (blue) move the predictions in the opposite effect.

  3. 3.

    SWMEP manifests a more centralized distribution of SHAP values, indicating that its effect on the model output remains relatively stable across instances.

To assess the efficacy of a classification model in predicting multiple classes. The ROC curve for Class N is positioned very close to the top-left corner, signifying exceptional performance in differentiating Class N from the others as indicated by Fig. 13. With an AUC of 0.99, this score is remarkably high, indicating that the model possesses an impressive capability to accurately identify instances of Class N while effectively minimizing false positives. In comparison to Class N, the ROC curve for Class S is less steep and further from the top-left corner. The model appears to be less successful in differentiating Class S as indicated by the AUC of 0.94. In contrast to Class N, there may be fewer true positives or more false positives for class S. Class C exhibits strong performance, as evidenced by its ROC curve, which is similarly situated near the top-left corner as Class N. The AUC of 0.98 reflects a high-performance level, closely aligning with that of Class N. The model demonstrates exceptional efficacy for Classes N and C, as evidenced by their elevated AUC values (0.99 and 0.98, respectively). This indicates that the model can accurately differentiate these classes from others. The performance of the model is marginally lower for Class S, with an AUC of 0.94. The ROC curves and their respective AUC values suggest that the classification model exhibits commendable performance across all classes, particularly for Classes N and C. Nonetheless, there is a slight decline in performance for Class S.

Fig. 13
figure 13

ROC for Multi-Class.

Comparison with state-of-the-art(SOTA) methods

To validate our study, we performed a comparison with state-of-the-art (SOTA) techniques. To assess the effectiveness of our proposed framework with existing SOTA methods, we conducted a comprehensive quantitative analysis on the AASCE2019 dataset. For the AASCE2019 dataset, the methods for comparison include: LaNet45, VlteNet46, TsNet47, MmaNet48, Seg4reg+22 and B-Spline49. We adhered to the SMAPE and MAE metrics to ensure a fair comparison. As illustrated in Table 9, our proposed framework yields promising results in the prediction of Spine Curvature. The last row of Table 9 contains a summary of the experimental results. Table 9 also lists the MAE value 2.7 for center point localization using our proposed approach. The results suggest a notable improvement in the ability of our technique in accurately tracing the center points for the whole dataset. The Mean Absolute Error (MAE) demonstrates a decrease for the whole spine when compared with different methodologies. The enhanced efficacy in the localization of central points, particularly throughout the complete spinal area, results in a smaller Symmetric Mean Absolute Percentage Error (SMAPE) value 4.37, as depicted in Table 9. It has been noted that the inclusion of the Feedforward Neural Network (FFN) within the proposed transformer architecture contributes to a reduction in the SMAPE metric. The SMAPE value for the Seg4Reg+ is 7.32, while LaNet, VlteNet, TsNet, MmaNet, and B-Spline, which attain respective scores of 4.51, 5.44, 6.87, 7.28, and 8.28. The MAE of our proposed approach is slightly good than that of other methodologies. We argue that the superior performance of our model as evidenced by the SMAPE metric as SMAPE represents the only metric in the AASCE challenge.

Table 9 Comparison with state-of-the-art methods on the public AASCE dataset.

For the classification of the data, we have used different classifiers (SVM, RF, KNN and ANN). Table 10 shows the result with respect to different classifiers. Random forest performs best for AASCE MICCAI 2019 dataset with an accuracy of 98.3% having maximum depth of 16 and number of estimator is 250.

Table 10 Results of shape analysis using different classifiers.

Literature shows that, some researcher works on both hand crafted and deep learning features for the classification of spine into normal, C-shaped and S-shaped. Table 11 shows the comparison of these features. From Table 11 we can say that the hand crafted feature performs better than the deep learning features and hybrid features.

Table 11 Comparison of shape analysis result with state-of-the-art models.

Discussion

Overall this study was addressed in four parts (vertebrae localization, segmentation, center profile generation and shape analysis).Considering that the proposed approach involves four stages, experiments were evaluated separately on AASCE MICCAI 2019 dataset. Firstly, for the vertebrae localization we used detection transformer to localize 68 corner points. The first step of our model is to focus on every single vertebra, protecting the localization process from being disrupted by adjacent vertebrae. Secondly, we have used a SegFormer to do the segmentation of the spine. Thirdly, the center points of 17 vertebrae were localized and a center profile of the spine was generated using center point technique for localization and morphological thinning for segmentation. By using our model, we are able to easily address all steps with decreasing difficulty. In the final step, for shape analysis process, we have selected the profile of spine to calculates the features and classify the data into normal, Single-bend (C-shaped) and Doublebend (S-shaped) spine. In addition, to assess the performance and generalizability further, we conducted experiments on three features in terms of evaluation on location errors of the (predicted vertebrae) centroids, detection rates of vertebrae localization. We can say that the three features 1, 2 and 3 plays vital role in detection of normal, C-shaped and S-shaped spine where Feature 1 is the Segment-wise mean of extreme points (SWMEP), Feature 2 is Segment wise standard deviation of extreme points (SWSDEP) and Feature 3 is Mean absolute gradient phase (MAGP). To identify the ROI for vertebrae shape analysis, the centroids of the predicted vertebra was used for vertebrae localization. If ROI may contain partial vertebra that’s mean centroid location of vertebra was predicted wrongly and result in information being lost. The evaluation indicators were the location errors and detection rates to evaluate the whole vertebra ROI. If the location errors (LE) is small and detection rates (DR) is 100% that means the whole vertebra is contained in the ROI and by contrast if LE is too large (DR is 95%) that’s mean some valid information lost and the ROI only contains partial vertebra. Furthermore, in each step, we design a multi step network to identify the object correctly and also to decrease the difficulty level, which was inspired by the DETR technique. The DETR is a neural network architecture used for object detection tasks. It solves a difficult problem using self-attention layers, it passes through these layers and get the final weighted sum which predict the bounding box of object in the image. It also identifies the class of the object and its confidence score. The confidence score represents how sure the actual model is that an object is contained within the bounding box. For our model we have choose the confidence score of 0.9 so that we get all that bounding box which are closet to actual objects, therefore, a promising performance was achieved by our method by using only a few parameters. For vertebral localization we compared our method with the existing state-of-the-art methods, our method was performed well in term of performance after extensive and comprehensive experiments. Furthermore, we demonstrate the effectiveness of our proposed methodology for vertebral localization to update the bounding box instead of moving boxes independently and uncontrollably. To prove effectiveness of the proposed method we have performed the vertebrae localization on YOLOv5 architecture which give us a mAP value of 0.94 at IOU threshold of 0.5 while DETR gives mAP value of 0.96 at 0.5 IOU threshold. Table 6 shows the comparison of DETR, YOLOv5,YOLOv3 and generalized hough transform. For the classification of the data, we have used different classifier (SVM, RF, KNN & ANN). Random forest performs best for AASCE MICCAI 2019 dataset with an accuracy of 98.3%. It has already been reported in the literature that researchers have concentrated their research on spinal abnormalities and the methods used to diagnose them using segmentation and regression. There are different ways available in literature to segment vertebrae in deep learning. However, the clinical factors examined in the literature and the conversations we had with subject matter experts enlightened us to shift the focus and concentrate on spine morphology. The entire process in the study was based on localization, localization of vertebrae have been be used to define the vertebrae position and centre points and segmentation of vertebrae. According to the findings, the transformer technique has successfully located vertebrae with mAP up to 96, making an impressive contribution. The centre points are subsequently calculated using the localization and segmentation findings. Mean values for all the points are used to examine the centre point results. Here, it’s important to note that we choose a transformer DETR technique for vertebrae localization, due to the different nature of the procedure, a comparison with the literature is not appropriate in this case. Our method is as accurate as more advanced methods, and it also has high reliability. This means that it is likely to be accurate when used in experiments. We found that our method is highly reliable based on the results of our experiments. It is our intention to classify the severity of disease based on the Cobb method result in the future. We acknowledge that our study did not utilize a diverse dataset, which represents a limitation of the research. The primary reason for testing our algorithms solely on a one dataset is the unavailability of publicly accessible, diverse datasets. To address this constraint, we plan to evaluate our algorithm using a locally curated dataset in future research. As with all machine learning models, the DETR model may exhibit biases that can affect prediction accuracyparticularly in tasks such as vertebrae localization. These biases may arise if the training dataset lacks sufficient anatomical variation, diverse imaging conditions, or representation across different patient demographics. As a result, the model may struggle to generalize to unseen data. Furthermore, the evaluation of model performance is influenced by the choice of assessment metrics (e.g., the IoU threshold for bounding box overlap). If the metrics disproportionately penalize or favor certain prediction types, this can skew accuracy scores and misrepresent actual model performance. We also aim to enhance the model so that it can support the diagnosis of various spinal disorders. Additionally, to improve the reliability of our experimental results, factors such as illumination conditions should be carefully considered and discussed during result interpretation.

Conclusion

To summarize all the work, a lot of effort has been put into creating an automated system that can analyze spinal images from the perspective of neurosurgeon. In addition, the article also included an analysis of S-shape and C-shape separations of spinal disorders in an automated manner. In this paper, a four stage method was created for classification and vertebrae localization, segmentation, center profile generation. First, for the vertebrae localization we proposed a detection transformer to localize 68 corner points. Secondly, we have used a SegFormer to do the segmentation of the spine. Thirdly, the center points of 17 vertebrae were localized and a center profile of the spine was generated. Finally, for shape analysis process, we take the profile of spine to calculates the features and classify the data into normal, Single-bend (C-shaped) and Double-bend (S-shaped) spine. Based on the AASCE MICCAI 2019 dataset, the mAP value was 0.96 at 0.5 IOU threshold for predicted vertebra centroid and the detection rates for all testing cases were good which demonstrated that all ROIs could be used to classify all the data. For shape analysis, we take the profile of spine to calculates the features and classify the data on the identified ROIs. Random forest (RF) classification method was used to address the three class problem. We divide our dataset into 60-40 split, 60 for the training and 40 for the testing with the input as ROIs. The findings of the experiment on AASCE MICCAI 2019 dataset in terms of mAP 0.96 at 0.5 IOU threshold showed that we identified the vertebrae localization efficiently and successfully. To prove effectiveness of the proposed method we have performed the vertebrae localization on YOLOv5 architecture which give us a mAP value of 0.94 at IOU threshold of 0.5 while DETR gives mAP value of 0.96 at 0.5 IOU threshold. For the classification of the data, we have used different classifier (SVM, RF, KNN & ANN). Random forest performs best for AASCE MICCAI 2019 dataset with an accuracy of 98.3%. Furthermore, the results of our comparison with some state-of-the-art methods indicate that we have obtained good results of mAP as compared to YOLOv5 architecture. Our work does, however, have significant limitations. First, we didn’t apply our method to the complicated circumstances of clinical practice, instead we simply used frontal images from a public database. Moreover, sagittal alignment is becoming an increasingly significant factor in clinical outcomes, and researchers need to focus more on it. Despite the fact that our study was restricted to frontal spine images, our technique can still identify the public dataset’s fluctuations and significant ambiguity. The findings of severity of scoliosis have not yet been made available in our research. To make a diagnosis accurate, a doctor needs to use the latest medical technology and expert knowledge in the field, which is also a shortcoming in our method. The effectiveness of our model suggests that it will perform well when tested against high quality internal (local) images in future study. Based on proposed framework, the automated extraction of spinal center profiles and associated features could assist radiologists and orthopedic specialists by providing objective and quantifiable indicators of spinal curvature, potentially improving the early detection and classification of scoliosis. This could support treatment planning decisions such as monitoring curve progression, determining the need for bracing, or preparing for surgical intervention. Moreover, integrating this tool into Picture Archiving and Communication Systems or Electronic Medical Records could streamline its accessibility during routine diagnostic processes.