Comparative analysis of vision transformers and convolutional neural networks in osteoporosis detection from X-ray images

Sarmadi, Ali; Razavi, Zahra Sadat; van Wijnen, Andre J.; Soltani, Madjid

doi:10.1038/s41598-024-69119-7

Download PDF

Article
Open access
Published: 03 August 2024

Comparative analysis of vision transformers and convolutional neural networks in osteoporosis detection from X-ray images

Ali Sarmadi¹^na1,
Zahra Sadat Razavi^1,2,3^na1,
Andre J. van Wijnen^4,5 &
…
Madjid Soltani^1,6,7,8,9

Scientific Reports volume 14, Article number: 18007 (2024) Cite this article

6791 Accesses
53 Citations
1 Altmetric
Metrics details

Subjects

Abstract

Within the scope of this investigation, we carried out experiments to investigate the potential of the Vision Transformer (ViT) in the field of medical image analysis. The diagnosis of osteoporosis through inspection of X-ray radio-images is a substantial classification problem that we were able to address with the assistance of Vision Transformer models. In order to provide a basis for comparison, we conducted a parallel analysis in which we sought to solve the same problem by employing traditional convolutional neural networks (CNNs), which are well-known and commonly used techniques for the solution of image categorization issues. The findings of our research led us to conclude that ViT is capable of achieving superior outcomes compared to CNN. Furthermore, provided that methods have access to a sufficient quantity of training data, the probability increases that both methods arrive at more appropriate solutions to critical issues.

Deep‑learning based osteoporosis classification in knee X‑rays using transfer‑learning approach

Article Open access 03 November 2025

Identification of osteoporosis using ensemble deep learning model with panoramic radiographs and clinical covariates

Article Open access 12 April 2022

Prediction of osteoporosis from simple hip radiography using deep learning algorithm

Article Open access 07 October 2021

Introduction

Deep learning is a specific branch of machine learning that use artificial neural networks (ANNs) to acquire knowledge from extensive datasets and execute intricate operations¹. Artificial neural networks are computational models that draw inspiration from the structure and functionality of real neurons. They are composed of interconnected nodes organized into layers, which collectively interpret information that is transmitted to subsequent levels. The nodes inside each layer possess varying characteristics, such as similarities, architectures, operations, and parameters, which are contingent upon the specific job and dataset at hand. The parameters are typically initialized in a random manner and subsequently changed via a process known as learning or training.

While the concept of deep learning dates back to the 1980s, it has only been in recent years that it has seen widespread adoption and success thanks to advances in large data, processing power, and algorithm development. Computer vision, natural language processing (NLP), speech recognition, recommender systems, autonomous vehicles, and many more challenges have all benefited from the use of deep learning. Image production, machine translation, game playing, and natural language understanding are other areas where deep learning has demonstrated meaningful outcomes.

Deep learning is not only transforming the fields of science and technology, but also the fields of art, culture, and society. Deep learning is creating new forms of art, including style transfer², neural style³, and deep dream⁴, that all blend different styles and genres. Deep learning is creating new forms of culture, such as neural rap⁵, neural karaoke⁶, neural doodle⁷, that together combine different languages and expressions. Deep learning is creating new forms of society, such as social bots⁸, chatbots⁹, virtual assistants¹⁰, that interact with humans and other machines. Therefore, with deep learning advancements, a prospective revolution can be expected in the future of many scientific fields and medical sciences are no exception to that.

The field of medicine stands out as one where deep learning could have a significant and positive impact. Clinical and biomedical disciplines are concerned with all aspects of health care, including the identification, evaluation, and management of illness. Data of several kinds are used in the medical sciences, including medical pictures, electronic health records (EHRs), genome sequences, and pharmacological compounds. Detection, diagnosis, prognosis, prediction, classification, segmentation, generation, synthesis, are just a sampling of the many tasks that would benefit from objectivity and automation in the medical sciences.

Deep learning can provide excellent answers for numerous issues and opportunities in medical sciences. Using the strength of artificial neural networks (ANNs), deep learning can learn from raw or unstructured data with little to no feature engineering or preprocessing. Even without substantial dimensionality reduction or regularization, deep learning can handle large-scale or high-dimensional data. Deep learning can also learn new representations or functions that are highly abstract or sophisticated without requiring a large amount of training data or explicit assumptions¹¹.

In medicine, deep learning has several potential uses and applications such as:

Diagnostic imaging: Medical imaging diagnostics can be performed using deep learning, which can interpret multiple biomedical image types such as X-rays, MRI scans, and CT scans. The algorithms are able to recognize any potential risk and highlight any abnormalities in the images. In the process of cancer detection, deep learning is used extensively. Machine learning and deep learning were the driving forces behind the recent breakthroughs in computer vision^12,13.
Electronic health records: Deep learning allows for the analysis of electronic health records (EHRs) that include both structured and unstructured data, such as clinical notes, laboratory test results, diagnoses, and prescriptions, at remarkable speeds while maintaining the highest possible level of accuracy. Additionally, mobile devices such as smartphones and wearable technology offer valuable information regarding patient behaviors and mobility. These devices have the capability of transforming data through the use of mobile applications to monitor medical risk factors for deep learning models¹⁴.
Genomics: The analysis of genomic sequences, which yields information about genotype/phenotype correlations and evolution of organisms, can be performed by deep learning. The identification of genes that are associated with diseases or traits can be assisted by deep learning. Recent studies show that deep learning can also aid in the design of novel genes or the editing of existing genes for therapeutic purposes.
Drug development: Regarding the pharmaceutical industry, deep learning can aid in both the development of new pharmacotherapies and the improvement of existing ones. Drug molecules with the appropriate properties or functionalities can be designed with the use of deep learning. Predicting drug interactions and adverse effects is another area where deep learning has proven useful^15,16.

The aforementioned uses represent only a minor fraction (the proverbial ‘tip of the iceberg’) of what deep learning algorithms can accomplish and potential contribute to the broader field of medical image analysis. Deep learning has been remarkably successful in medical image analysis, and this achievement has coincided with a period of drastically expanded utilization of electronic medical records and diagnostic imaging. With deep learning, significant hierarchical linkages within the data can be discovered algorithmically, rather than through the time-consuming manual construction of features, and hierarchical linkage algorithms are very helpful in the era of biomedical big data.

Examples of important work in the field of deep learning-based medical image analysis include:

Image classification: Classifying medical images into appropriate groups
Localization: Pinpointing an object in an image
Detection: Identifying instances of predefined semantic objects in digital media.
Segmentation: Partitioning digital images into numerous segments so that they can be simplified and/or their representation can be altered to make them more meaningful and easier to evaluate
Registration: Aligning or generating correspondences between data. Mapping one image to another is one such example^17,18.

Medical image analysis using a Deep Learning Approach (DLA) is a rapidly expanding area of study. DLA has seen extensive usage in medical imaging for illness diagnosis. This method, which relies on a convolution process, has effectively been applied in the field of medical visual pattern identification. Insightful studies of DLA and progress in artificial neural network development have yielded intriguing applications in medical imaging. One type of deep learning model referred to as Convolutional Neural Networks (CNNs) has been successfully used in conjunction with backpropagation to carry out automatic recognition tasks¹⁹. In addition, the utilization of deep learning architectures has been observed in the context of data augmentation, specifically in the domain of medical picture augmentation. This practice involves the application of three distinct types of deep generative models, namely Variational Autoencoders (VAEs), Generative Adversarial Networks (GANs), and Deep Models (DMs)²⁰.

Beyond the significant achievements and continued promise, progress in medical image processing using deep learning faces ongoing challenges, such as:

Data availability: The paucity of well-annotated large-scale datasets is a major obstacle to this line of study²¹. This is a typical problem in the field of medical imaging, where strict privacy and confidentiality regulations restrict access to patient information.
Adversarial attacks: Noisy medical imaging data presents a challenge for deep neural networks and designs.
Trustability: Users and patients may be hesitant to utilize deep learning technologies in healthcare due to concerns about validity and security.
Ethical and privacy issues: Ethical and privacy issues relating to medical data are also key challenges that need to be addressed²².
Data quality and variations: Variations in the quality of the data as well as the input present considerable issues. These differences can be the result of using different imaging methods, adjusting the scanner parameters, or moving the patient.
Overfitting: Deep learning models often have a propensity to overfit, which is especially problematic when there is a limited amount of training data available²³.

Deep learning's transformative impact extends into the field of medical sciences, including the detection and diagnosis of osteoporosis, a condition characterized by weakened bones and an increased risk of fractures. Osteoporosis is a prevalent and debilitating condition, especially among the elderly, leading to significant morbidity, mortality, and healthcare costs. Early diagnosis is crucial because it enables timely intervention, which can prevent fractures and their associated complications. Traditional methods of diagnosing osteoporosis, such as dual-energy X-ray absorptiometry (DXA), are effective but can be limited by accessibility and cost. X-ray imaging is more widely available and less expensive, making it a valuable tool for osteoporosis screening if enhanced by advanced diagnostic techniques.

Recent advancements have seen the integration of deep learning techniques, such as Vision Transformers (ViTs) and Convolutional Neural Networks (CNNs), to enhance the accuracy and efficiency of osteoporosis detection from X-ray images. These advanced models can automatically analyze X-ray images, identifying subtle patterns and variations indicative of osteoporosis with higher precision than traditional methods. By comparing ViTs and CNNs, researchers aim to determine the most effective approach for early and accurate osteoporosis diagnosis, potentially improving patient outcomes through timely and targeted treatment strategies. The early detection of osteoporosis using deep learning can facilitate preventive measures, such as lifestyle modifications, pharmacotherapy, and monitoring, thereby reducing the risk of fractures and improving the quality of life for patients. The application of deep learning in this area exemplifies its broader potential to revolutionize medical diagnostics and enhance the quality of healthcare, emphasizing the importance of innovative technologies in managing and curing osteoporosis.

CNN and ViT

In the realm of deep learning algorithms, convolutional neural networks are the most widely used tools for processing images. CNNs are artificial neural networks that are primarily used to solve difficult image-driven pattern recognition tasks. These networks represent very impressive forms of ANN architecture and a remarkable example of ANN design. CNNs consist of one or more convolutional layers, typically accompanied by pooling layers, fully connected layers, and normalization layers. A version of multilayer perceptrons is employed, which is specifically designed to require only limited preprocessing.

The architectural design of Convolutional Neural Networks (CNNs) is specifically tailored to exploit the inherent 2D structure present in input images or other 2D inputs, such as speech signals. Translation invariant characteristics are obtained through the utilization of local connections and tied weights, followed by a pooling process. Convolutional neural networks (CNNs) employ a comparatively minimal amount of pre-processing in comparison to alternative image categorization algorithms. This functionality implies that the network acquires the ability to learn and acquire filters that were previously required manual design in traditional techniques. The significant advantage of this functionality is the ability to achieve independence from pre-existing knowledge and human labor in the process of feature design. The nomenclature "convolutional neural network" reflects network utilization of distinct forms of linear operations that permit mathematical convolution. Convolutional neural networks employ convolution operations instead of conventional matrix multiplication in at least one of its layers²⁴. Figure 1 is an illustration of CNN’s function.

CNNs have long been regarded as the predominant and uncontested algorithmic approach for image processing, until more recently. However, Vision Transformers (ViTs) are a recent advancement in computer vision that have been gaining prominence due to their potential to outperform CNNs on a number of different vision tasks. The self-attention mechanism is the essential idea that underpins the tremendous success of Vision Transformers. In contrast to CNNs, which focus largely on the nearby aspects of an image, ViTs are able to zero in on the long-distance connections that exist within the image. In comparison to CNNs, ViTs are endowed with greater versatility and capability of feature detection.

ViTs may be broken down into many transformer blocks, and within each of these blocks is both a self-attention mechanism and a feed-forward neural network. This makes up the ViT's architecture. A sequence of image patches serves as the input to a ViT, and these transformer blocks do their processing in tandem with one another^26,27.

CNNs, on the other hand, have been the standard model for visual data for a considerable amount of time. They are made up of convolutional layers, which are then followed by pooling layers, and they are exceptionally good at finding local characteristics in images²⁸. ViTs, on the other hand, excel in their ability to zero in on long-range associations within the image, but CNNs are incapable of doing so²⁷.

ViTs and CNNs have both demonstrated outstanding outcomes in a variety of vision applications, despite the fact that they operate in quite different ways. For example, one study compared vision transformers and convolutional neural networks (CNNs) in the field of digital pathology. The results showed that ViTs performed marginally better than CNNs for tumor detection in three out of four different types of tissue. ViTs, on the other hand, require a greater investment of time and effort to train than CNNs do, despite the fact that they have demonstrated results that are encouraging. In addition, in order for ViTs to outperform CNNs, the tasks that they are asked to complete may need to be made more difficult so that they may make the most of their limited inductive bias²⁷. In Fig. 2 the function of ViT can be seen.

In this research, we used many ViT variants alongside several well-known CNN variants to solve a classification problem and then compared their results.

Osteoporosis and X-ray image classification

Resorption and deposition of new bone mineral balance adult bone mineral loss to maintain strength. Resorption lowers bone mineral density, making bones brittle and fracture-prone. Osteoporosis involves bone remodeling. Understanding bone remodeling regulation helps prevent and treat osteoporosis^30,31. Architecture and geometry make bones light and strong. Long tubular bones have a strong cortical layer and softer trabecular bone. These strong, light, and flexible bones can handle high-impact exercises without breaking. Trabecular bone has a dense cortical layer like vertebrae. When loaded briefly, vertebrae compress and expand. Skeletons must grow, heal, and adapt. Bone remodeling matters. Daily remodeling leads the cortical layer to resorb minerals and the bone cavity to lose trabecular bone and grow as we age³². Some compensation comes from mineral layer growth outside the cortical layer. Continuous remodeling and bone microarchitecture strongly impact osteoporosis development (Fig. 3).

Young adults with wider femurs may fracture their hips later in life due to thinner cortical layers. Later resorption is more frequent in thinner layers. Monoclasts and mesenchymal osteoblasts from different cell lines affect bone resorption and deposition. Osteoclast cell membranes' very active ion channels release protons into the extracellular space to lower pH and break down bone³³. This environment produces osteoclast-derived proteolytic enzymes such cathepsin K that remove bone matrix and induce mineral loss. Osteoblasts create amounts of collagen type I and negatively charged non-collagenous proteins to make hydroxyapatite, the major new bone material. Cell/cell communication between these two cell types controls bone synthesis, maintenance, and loss. Osteoclast activation begins bone resorption during remodeling. After a brief “reversal” period in the resorption “pit,” progressive waves of osteoblasts build new bone matrix. Since bone production takes longer than resorption, remodeling may cause bone loss³⁴. To communicate, precursors, osteoclasts, and osteoblasts generate “signaling” molecules. Signaling chemicals, hormones, food, and exercise are being studied to change bone physiology cells. Multiple hormonal and endocrine factors tightly control bone growth and homeostasis. Bone growth and maintenance require estrogen, parathyroid hormone, and testosterone-converted estrogen. Bone cells are directly affected by estrogen via osteoblast and osteoclast receptors. This interaction starts a complex cell cycle that increases osteoblast activity and disrupts osteoblast-osteoclast communication. Bone resorption and formation are coordinated by osteoblasts activating osteoclasts³⁵. Estrogen is delivered into the nucleus by the ERα cell surface receptor, activating certain genes. ERRα and ERα receptors in osteoblasts may govern bone cells. Recent evidence suggests SHBG, which lets estrogen enter cells, may also contribute. Natural estrogen reaches the bloodstream far from bone, affecting the uterus and breast. Other locally produced signaling molecules greatly affect bone physiology. In particular, PGE2 stimulates bone growth and resorption. Bone cells make PGE2 from arachidonic acid. In mechanically stressed animals, reducing prostaglandin production by COX2 enzymes slows bone development³⁶. Exercise-induced bone growth may require PGE2. Nonsteroidal anti-inflammatory COX-2 drugs may increase fracture risk. Leukotrienes regulate bone remodeling. As osteoclasts and osteoblasts rebuild mouse bone, arachidonic acid-derived compounds diminish bone density³⁷.

Outside bone cells use cell surface receptors to notify the cell nucleus to activate or deactivate activity-regulating genes. These contain BMP receptors, which boost bone growth. Osteoblast precursors have BMP receptors. As mice lacking LRP5 develop severe osteoporosis, the Wnt receptor low-density lipoprotein (LDL)-related protein 5 receptor (LRP5) may also be critical for bone formation. BMP receptors and LRP5 may stimulate osteoblasts, but how is unknown. Sclerostin binding to osteoblasts' LRP5/6 receptor suppresses Wnt signaling, reducing bone formation. PTH and mechanical stress lower osteoblast sclerostin. Sclerostin antibodies expand bone and strengthen it. The cell surface receptor RANK and its partner RANKL trigger precursor cells to become osteoclasts^38,39,40. Bone-rebuilding osteoblasts and osteoclasts create various signaling molecules, including RANKL. An equilibrium between RANKL and osteoprotegerin causes osteoporosis. In animal studies, osteoprotegerin increases bone mass and loss causes osteoporosis and fractures. Human osteoporosis can be treated with RANKL inhibitors Table 1. The cell surface receptors DAP12 and FcRγ work together to assist in osteoclast development and activation. Mice lacking DAP12 and FcRγ exhibit severe osteoporosis, marked by increased bone density. Two cell surface receptors recruit ITAM adaptor proteins to increase intracellular calcium. Studies demonstrated RANK/RANKL and ITAM pathways activated osteoclasts. Both routes activate NFAT c1. To become active osteoclasts, precursor cells need NFATc1, which controls bone resorption. Environmental and genetic factors affect osteoblast and osteoclast gene expression and osteoporosis^41,42,43.

Table 1 Some common cell signaling pathway in osteoporosis.

Full size table

Diagnosis

Seniors have increased osteoporosis risk. However, osteopenia or osteoporosis can begin earlier. Since osteoporosis has no symptoms, a complete medical history, including recent fractures, and risk factors (see risk factors and individuals at high fracture risk) should determine if a BMD test is warranted. Many healthcare systems recommend a fracture risk assessment before a BMD test. This is the most common BMD test method, but other diagnostic equipment gives more. The pathophysiological definition of osteoporosis includes diminished bone mass and other bone fragility characteristics as key risk factors for fractures. Bone mass alone predicts fracture risk in clinical practice. So, it can diagnose and track patients. Bone density/volume or area is BMD⁵³.

Normal X-rays can only show spine fractures; hence BMD examinations require specific equipment. The most prominent BMD test is DXA (dual-energy X-ray absorptiometry), a densitometric technique that can be measured in vivo and validated by many fracture risks studies. DXA uses hydroxyapatite (bone mineral) and soft tissue as reference materials to assess low-radiation X-ray beam attenuation through the body with two photon energies to detect tiny bone loss. Software detects bone edges. For osteoporosis risk, DXA is most commonly used to evaluate bone density in the proximal femur (including femoral neck) and lumbar spine (L1–L4). DXA is used to diagnose vertebral fractures using vertebral fracture assessment (VFA) and the trabecular bone score (TBS) using lateral spine images^54,55.

WHO osteoporosis DXA scan thresholds. Z- or T-scores are SD-based. The proposed reference range uses NHANES III reference database femoral neck measures in Caucasian women aged 20–29 to calculate T-score. Osteoporosis is diagnosed when a person's femoral neck T-score for BMD is 2.5 SDs below the reference value, obtained from bone density measurements in young healthy females (see Table 2 and Fig. 4)^56,57. Osteoporosis is usually diagnosed by lumbar spine T-score. Bone density data adjusted for age and sex yields the Z-score reference value. It displays how many SDs an individual's BMD must depart from the anticipated mean for his/her age and sex group. It's usually used to assess kids and teens^58,59.

Table 2 Advantages and disadvantages of osteoporosis tools.

Full size table

Comparing DXA results from the same patient scanned by different brands is risky since some manufacturers modify key parameters. These characteristics include voltages, filtering, edge detection, and soft tissue thickness. Consider scanning patients with the same equipment to compare results. Osteoporosis DXA bone density testing is quantitative, non-invasive, inexpensive, and convenient. However, osteoporosis fractures are clinically significant. BMD is a diagnostic and risk factor evaluation technique that helps doctors stratify fracture risk. DXA technology has had unexpected consequences (confusing patients and some healthcare professionals), thus BMD use for diagnosis versus risk assessment must be differentiated. Bone mineral density measurements of osteopenia or osteoporosis may not always predict fracture. Fragility fracture patients rarely have BMD T-scores below − 2.5. Most fracture patients have osteopenia, not BMD-defined osteoporosis^60,61. Many non-skeletal factors affect fracture risk (see risk factors). DXA only measures BMD, therefore pathologic osteoporosis variables are excluded. Despite this limitation, DXA, vertebral or hip BMD T-scores can detect fragility fracture risk. Preventing bone fractures and their socioeconomic impact requires early osteoporosis diagnosis. Lifestyle adjustments and therapy may help osteopenic and osteoporotic patients prevent fractures. A position paper by leading practitioners advised secondary preventative assessment for all older fractures, including lifestyle, non-pharmacological, and pharmaceutical interventions to lower fracture risk. Due of DXA's limitations in assessing fragility fracture risk, the FRAX^® calculator combines BMD with other, partly BMD-independent risk factors. Density testing assesses bone quantity, not quality. So, we need a low-cost tool to diagnose this condition early^62,63. (Table 2).

Data

During the course of our studies, we noted a significant dearth of data that is openly available to the public and pertains to X-ray imaging of osteoporosis. The absence of sufficient data is a significant obstacle to the advancement of deep learning algorithms in addressing medical challenges, as previously noted. Upon conducting an extensive search, we have successfully identified a dataset comprising osteoporosis knee X-ray images. This dataset has been previously utilized in a published study⁶⁴.

As it is clear from Table 3, the public dataset we accessed is not balanced. The class "osteopenic subjects" accounts for the vast majority of the data, whereas the other two classes barely make up a minute portion of the total. Therefore, it is to be expected that any classifier that is created around these data will contain some degree of bias. Figure 4 displays three examples of the dataset images.

Table 3 Properties of the dataset⁶⁴.

Full size table

Methods and experiments

In total, four tests were conducted utilizing four iterations of the Vision Transformer (ViT) algorithms. All of these models have been employed for the task of "feature extraction," wherein a pre-trained version of the method is utilized to extract a vector comprising certain picture features. This vector is subsequently employed for the purpose of classification. The problem that we tried to solve was a three-class classification problem. In Table 4, The Vision Transformer algorithms used in this study, and their features are shown. From this point forward, the results of our experiments will be presented. It is important to note that, in all instances, the parameters used to initialize functions and classes are the default settings as specified in the relevant software documentation, unless otherwise explicitly stated. The batch size used for experiments is 32. Each experiment has been conducted using epochs number of 10.

Table 4 ViT instances that were used for experiment. “Feature Vector Size” column indicates the size of the feature vector that the algorithm outputs. A larger feature vector is anticipated to provide a more comprehensive representation of the image. Number of parameters is also manifested in the table. The number of parameters of a deep learning algorithm generally represents the capability of that algorithm for learning the patterns residing in the image. Of course, a large number of parameters does not guarantee an acceptable performance.

Full size table

We have adopted the Adam optimizer implemented in Tensorflow with its default learning rate value 0.001 for all trainings. Also, categorical crossentropy loss function in Tensorflow have been used for the training phase of all experiments and its default parameters’ values are used. No data augmentation technique was applied on the data.

For each model the size of the feature vector it outputs and its number of parameters is mentioned in the Table 2. When analyzing the model designated as "vit_s16_fe," several noteworthy aspects may be inferred from the name that provide insights into the model. The term "s16" denotes the variable representing the "quantity of patches." To facilitate image analysis, every Vision Transformer (ViT) model partitions the images into distinct regions known as "patches". The numerical value "16" under the designation "s16" signifies that this particular model employs a division of its input images into 16 distinct patches. The model designated as "vit_b32_fe," which corresponds to model number 3, performs the same task as previously described, but with the inclusion of 32 patches. The letter "b" represents the term "base architecture". The abbreviation "fe" is used to refer to the term "feature extractor". The letter "r" in this context is the abbreviation for the term "residual," indicating the utilization of residual designs inside it. So, the models having “r” in their names, exploit residual blocks in their architectures which are a type of deep learning building blocks used usually in conjunction with CNN architectures^65,66.

It is anticipated that a larger feature vector will provide a more comprehensive representation of the data instance, specifically in the context of images. Moreover, it is expected that a model with a greater number of parameters will possess an increased capacity to discern and comprehend the underlying patterns within the given data, provided that an ample amount of data is available for training the algorithm⁶⁷.

We have used a pre-trained version of each of these models for feature extracting. This technique is called Transfer Learning (TL). TL is one of the most essential technologies in the era of artificial intelligence and deep learning. It aims to make the most of previously acquired expertise by applying it to other fields. Transfer learning, pre-training and fine-tuning, domain adaptation, domain generalization, and meta-learning are just a few of the subjects that have piqued the interest of the academic and application community throughout the years⁶⁸. TL was created in response to the increased availability of ample labeled data. This approach leverages current data to address challenges associated with the collection of fresh annotated data. The objective of transfer learning is to enhance the performance of a target learner through the utilization of additional related source data⁶⁹. In the case of this study, we the ViT models, have been trained on an image dataset called ImageNet.

ImageNet is a comprehensive dataset consisting of more than one million photos, categorized into 1000 distinct classes (https://www.image-net.org/download.php). It is commonly utilized as a standard for evaluating the efficacy of image classifying algorithms and also for pretraining and transfer the learned patterns to other classification problems⁷⁰. One can read a full explanation of ImageNet in⁷¹.

Each one the models listed in Table 4 has been trained on a smaller version of the original ImageNet with the size of 1000. The models have been obtained from Tensorflow hub (https://tfhub.dev). TensorFlow Hub serves as a comprehensive collection of pre-trained machine learning models that are readily available for utilization and deployment. The experiments have been implemented using Python programming language version 3.10.12. In addition, the Tensorflow framework, specifically version 2.14.0, has been employed. Tensorflow, developed by Google, is a prevalent programming tool utilized for the implementation of deep learning algorithms. The dataset was partitioned into two distinct sets: a training set, 80 percent of the data, exclusively utilized during the training phase, and a held-out test set, which the algorithm was not exposed to during training and is employed for evaluating the dataset's performance. Additionally, it is worth noting that the optimization method Adam was utilized for the purpose of optimization, employing its default parameter settings.

The findings presented in Table 5 indicate that an increase in the number of parameters is associated with a decline in performance. This observation suggests that the model may be overfitting due to a limited amount of available data¹. Figure 5 is the visualized version of the Table 5.

Table 5 The result of the experiments.

Full size table

To make sure that the obtained results are reliable, we repeated the experiments, this time using a standard cross validation with the number of folds 5. The results of the aforementioned experiment are presented in the Table 6. The mean accuracy improves by 0.2 points as the models get bigger, starting from the smallest ViT. Then, the performance drops significantly, considering the ViT_b32_fe. And the last one has performed better that the others, which indicates that there is a significant need for having a larger dataset of these types of images.

Table 6 Cross validation experiments with ViTs.

Full size table

In Fig. 6, the Box and Whisker plots of the cross validation experiments on ViTs is manifested.

In Fig. 7, the confusion matrix for each model is shown. All models have made optimal efforts in accurately categorizing the class labeled as 1, which pertains to the category including "osteopenic subjects". The class with the highest number of examples in our dataset is observed to have a tendency for models to classify all instances as belonging to that particular category, as previously noted. The presence of bias in the models can be attributed to the uneven nature of the dataset. The performance of all models in classifying "osteoporotic subjects," which constitutes the smallest subset of cases in the dataset, has been the least successful.

In order to facilitate comparative analysis, we have also performed identical experiments with four commonly employed CNNs. In Table 7, the mentioned CNN algorithms along with their features are shown.

Table 7 CNN instances that were used for running experiment to be compared to ViT.

Full size table

The VGG16 model is a convolutional neural network architecture that was introduced by K. Simonyan and A. Zisserman in their research article titled "Very Deep Convolutional Networks for Large-Scale Image Recognition" at the University of Oxford. The VGG19 model is a modified version of the VGG model, characterized by its 19 levels. These layers consist of 16 convolutional layers, 3 fully connected layers, 5 max pooling layers, and 1 softmax layer. The model is comprised of a total of 143,667,240 parameters, often known as weights. This concept serves as the foundation for numerous scholarly investigations within this particular discipline⁷². ResNet50 is a modified version of the ResNet model, characterized by the inclusion of 48 Convolution layers. The ResNet model is extensively employed and serves as a fundamental framework for numerous scholarly investigations in this domain⁷³. ResNetRS101 can be considered as a modified version of the ResNet architecture.

The experiment was conducted using pre-trained instances of each model from Tensorflow. Every model in the study has undergone pre-training using the ImageNet dataset. Furthermore, the Adam optimizer has been utilized for the purpose of optimization.

The outcome of the experiment conducted by CNN is presented in Table 8. An identical pattern is evident in the subsequent experiment as well: The efficacy degrades as the number of parameters increases due to the overfitting phenomenon.

Table 8 The CNN experiment results.

Full size table

The results in Table 8 are visualized in Fig. 8.

Just like the case with ViTs, we have conducted a separate experiment for CNNs, using cross validation technique with the number of folds 5. The result of that experiment is presented in Table 9. Cross validation results show that the accuracy improves with the increase in size of the models (with the exception of VGG19) (Fig. 9).

Table 9 Cross validation experiments with CNNs.

Full size table

In Fig. 10, the Box and Whisker plots of the cross validation experiments of CNNs, is manifested.

The confusion matrices of the Convolutional Neural Network (CNN) models are depicted in Fig. 9. Similar to the ViT experiment, the models exhibited optimal performance in the class characterized by the largest number of cases in the dataset, namely "osteopenic subjects," while demonstrating inferior performance in the class with the smallest number of instances, namely "osteoporotic subjects." Once again, the issue of imbalanced dataset has resulted in overfitting.

When the outcomes of two tests are compared, it is possible to draw the conclusion that ViT models have performed more successfully than CNNs, despite the fact that they suffer from drawbacks like as an insufficient amount of data and imbalanced data. The top performing CNN model, VGG16, is now being outperformed by the best performing ViT model, vit_s16. When there is access to a large quantity of data, it is natural to be able to do further and more comprehensive comparisons. In spite of this, it is clear from the parameters of this study that ViTs have a greater potential for diagnosis when compared to the traditional. Visual comparisons of the models belonging to the two families (ViTs and CNNs) are presented in Fig. 11. It is immediately apparent that the ViTs come out on top of the competition. In Fig. 11, ViT models and CNN models are compared in terms of their accuracies.

For the sake of ensuring the robustness of our comparison, we did a t-test between the results of models.

Table 10 clearly shows that, with the exception of row 3, the ViT consistently outperformed its CNN counterpart in all other comparisons.

Table 10 Results of t-test conducted between pairs of models (alpha = 0.05). The test was conducted over the accuracy metric.

Full size table

We also went through the examination of the power of these two types of models, via another method. We utilised heatmap visualisations to evaluate their effectiveness in detecting illness locations in medical pictures. The main goal was to assess the precision and comprehensibility with which each model identifies problematic regions. The heatmaps obtained showed that the ViT s16 model outperformed the VGG16 model in reliably identifying the affected areas. The heatmaps generated by the ViT s16 model offer a higher level of accuracy and clarity in identifying specific regions of pathology. This highlights its potential as a more efficient tool for detecting diseases in medical imaging applications. The results indicate that the sophisticated structure of the ViT s16, which utilises self-attention processes, provides notable benefits in medical picture analysis compared to the conventional convolutional method of VGG16. Figures 12 and 13 manifest the result of the comparison for aforementioned two models.

Conclusion and discussion

In this study, we examined the ability of Vision Transformer models, a new technique in the field of deep-learning-based image analysis. We used these models to solve a difficult essential image classification problem, which is the diagnosis of osteoporotic individuals using X-ray pictures. Specifically, we addressed the considerable challenge to determine whether or not the subjects had osteoporosis, which is critical for proper clinical diagnosis and treatment.

Osteoporosis poses a substantial health concern due to its asymptomatic nature, frequently eluding detection until the manifestation of a fracture. This phenomenon exerts a significant impact on a vast number of individuals across the globe, with a notable predilection towards the female demographic aged 50 years and above. The pathological condition induces a state of diminished structural integrity and fragility within the skeletal system, thereby augmenting the susceptibility to fractures, primarily in the regions encompassing the hip, spine, and wrist. The occurrence of these fractures has the potential to result in the manifestation of persistent pain, long-term impairment, diminished self-sufficiency, and in extreme cases, mortality.

The substantial nature of the economic burden associated with osteoporosis is noteworthy. The estimated annual expenditure for the treatment of fractures associated with osteoporosis amounts to billions of dollars. Moreover, the affliction has the potential to induce a reduction in labor participation, a decline in overall efficiency, and a surge in medical expenditures. The implementation of preventive measures and the timely identification of osteoporosis are of utmost importance in effectively addressing this health concern. Modifications in one's lifestyle, encompassing the incorporation of consistent physical activity, adherence to a nutritious dietary regimen, and the abstention from both smoking and excessive alcohol intake, have been observed to exhibit potential in mitigating the onset of the aforementioned ailment. The utilization of bone density testing as a means of screening for osteoporosis can additionally facilitate the early identification of the ailment and enable timely intervention to mitigate the occurrence of fractures.

Vision Transformers may be the next important player in the field of medical image analysis using deep learning. The findings of this study demonstrated that they have triumphed over the traditional convolutional neural networks in the problem that we were attempting to tackle with this investigation. However, there are still a number of fundamental challenges that prevent the deep learning algorithm from being improved in this particular topic and, more generally speaking, in the field of medicine. The paucity of data that has been thoroughly cleaned and tagged appears to be the primary obstacle, which is a very significant issue that future studies will need to resolve. Our study revealed that there is a significant deficiency in the amount of publicly available X-ray data for osteoporosis patients. Expansion of our training set will support the development of more robust deep learning algorithms for X-rays in elderly patients.

Data availability

The dataset analysed during the current study are available in the Mendeley Data Knee Osteporosis repository, https://data.mendeley.com/datasets/fxjm8fb6mw/2.

References

LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. Nature 521(7553), 436–444 (2015).
Article ADS CAS PubMed Google Scholar
Gatys, L. A., Ecker, A. S. & Bethge, M. (eds) Image style transfer using convolutional neural networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016).
Johnson, J., Alahi, A. & Fei-Fei, L. (eds) Perceptual losses for real-time style transfer and super-resolution. Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part II 14 (Springer, 2016).
Mordvintsev, A., Olah, C. & Tyka, M. Inceptionism: Going deeper into neural networks (2015).
Potash, P., Romanov, A. & Rumshisky, A. (eds) Ghostwriter: Using an LSTM for automatic rap lyric generation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (2015).
Angelini, O., Moinet, A., Yanagisawa, K. & Drugman, T. Singing synthesis: With a little help from my attention. arXiv preprint arXiv:1912.05881 (2019).
Champandard, A. J. Semantic style transfer and turning two-bit doodles into fine artworks. arXiv preprint arXiv:1603.01768 (2016).
Vinyals, O. & Le, Q. A neural conversational model. arXiv preprint arXiv:1506.05869 (2015).
Sutskever, I., Vinyals, O. & Le, Q. V. Sequence to sequence learning with neural networks. Adv. Neural Inf. Process. Syst. 27 (2014).
Li, J., Monroe, W., Shi, T., Jean, S., Ritter, A. & Jurafsky, D. Adversarial learning for neural dialogue generation. arXiv preprint arXiv:1701.06547 (2017).
Hu, C. et al. Trustworthy multi-phase liver tumor segmentation via evidence-based uncertainty. Eng. Appl. Artif. Intell. 133, 108289 (2024).
Article Google Scholar
Wang, J., Zhu, H., Wang, S.-H. & Zhang, Y.-D. A review of deep learning on medical image analysis. Mob. Netw. Appl. 26, 351–380 (2021).
Article Google Scholar
Zhou, L., Sun, X., Zhang, C., Cao, L. & Li, Y. LiDAR-based 3D glass detection and reconstruction in indoor environment. IEEE Trans. Instrum. Meas. 73, 3375965 (2024).
Article ADS Google Scholar
Miotto, R., Wang, F., Wang, S., Jiang, X. & Dudley, J. T. Deep learning for healthcare: Review, opportunities and challenges. Brief. Bioinform. 19(6), 1236–1246 (2017).
Article PubMed Central Google Scholar
Chen, H., Engkvist, O., Wang, Y., Olivecrona, M. & Blaschke, T. The rise of deep learning in drug discovery. Drug Discov. Today 23(6), 1241–1250 (2018).
Article PubMed Google Scholar
Gu, Y., Hu, Z., Zhao, Y., Liao, J. & Zhang, W. MFGTN: A multi-modal fast gated transformer for identifying single trawl marine fishing vessel. Ocean Eng. 303, 117711 (2024).
Article Google Scholar
Ker, J., Rao, J. & Lim, T. Deep learning applications in medical image analysis. IEEE Access 6, 9375–9389. https://doi.org/10.1109/ACCESS.2017.2788044 (2018).
Article Google Scholar
Qi, F. et al. Glass makes blurs: Learning the visual blurriness for glass surface detection. IEEE Trans. Ind. Inform. 20(4), 6631–6641 (2024).
Article Google Scholar
Puttagunta, M. & Ravi, S. Medical image analysis based on deep learning approach. Multimed. Tools Appl. 80, 24365–24398. https://doi.org/10.1007/s11042-021-10707-4 (2021).
Article PubMed PubMed Central Google Scholar
Kebaili, A., Lapuyade-Lahorgue, J. & Ruan, S. Deep learning approaches for data augmentation in medical imaging: A review. J. Imaging. 9(4), 81 (2023).
Article PubMed PubMed Central Google Scholar
Altaf, F., Islam, S. M., Akhtar, N. & Janjua, N. K. Going deep in medical image analysis: Concepts, methods, challenges, and future directions. IEEE Access 7, 99540–99572 (2019).
Article Google Scholar
Dhar, T., Dey, N., Borra, S. & Sherratt, R. S. Challenges of deep learning in medical image analysis—Improving explainability and trust. IEEE Trans. Technol. Soc. 4(1), 68–75 (2023).
Article Google Scholar
Saraf, V. et al. (eds) Deep Learning Challenges in Medical Imaging (Springer, 2020).
Google Scholar
O'Shea, K. & Nash, R. An introduction to convolutional neural networks. arXiv preprint arXiv:1511.08458 (2015).
Wang, Z. J. et al. CNN explainer: Learning convolutional neural networks with interactive visualization. IEEE Trans. Vis. Comput. Graph. 27(2), 1396–1406 (2020).
Article Google Scholar
Islam, K. Recent advances in vision transformer: A survey and outlook of recent work. arXiv preprint arXiv:2203.01536 (2022).
Deininger, L., Stimpel, B., Yuce, A., Abbasi-Sureshjani, S., Schönenberger, S., Ocampo, P. et al. A comparative study between vision transformers and CNNs in digital pathology. arXiv preprint arXiv:2206.00389 (2022).
Raghu, M., Unterthiner, T., Kornblith, S., Zhang, C. & Dosovitskiy, A. Do vision transformers see like convolutional neural networks?. Adv. Neural Inf. Process. Syst. 34, 12116–12128 (2021).
Google Scholar
Khan, A., Rauf, Z., Sohail, A., Rehman, A., Asif, H., Asif, A. et al. A survey of the vision transformers and its CNN-transformer based variants. arXiv preprint arXiv:2305.09880 (2023).
La Manna, F. et al. Molecular profiling of osteoprogenitor cells reveals FOS as a master regulator of bone non-union. Gene 874, 147481 (2023).
Article PubMed Google Scholar
Chen, S. et al. lncRNA Xist regulates osteoblast differentiation by sponging miR-19a-3p in aging-induced osteoporosis. Aging Dis. 11(5), 1058–1068 (2020).
Article PubMed PubMed Central Google Scholar
Johnston, C. B. & Dagar, M. Osteoporosis in older adults. Med. Clin. North Am. 104(5), 873–884 (2020).
Article PubMed Google Scholar
Šromová, V., Sobola, D. & Kaspar, P. A brief review of bone cell function and importance. Cells 12(21), 2576 (2023).
Article PubMed PubMed Central Google Scholar
Gao, Y., Patil, S. & Jia, J. The development of molecular biology of osteoporosis. Int. J. Mol. Sci. 22(15), 8182 (2021).
Article CAS PubMed PubMed Central Google Scholar
Gerosa, L. & Lombardi, G. Bone-to-brain: A round trip in the adaptation to mechanical stimuli. Front. Physiol. 12, 623893 (2021).
Article PubMed PubMed Central Google Scholar
Guajardo-Correa, E. et al. Estrogen signaling as a bridge between the nucleus and mitochondria in cardiovascular diseases. Front. Cell Dev. Biol. 10, 968373 (2022).
Article PubMed PubMed Central Google Scholar
Sauerschnig, M. et al. Effect of COX-2 inhibition on tendon-to-bone healing and PGE2 concentration after anterior cruciate ligament reconstruction. Eur. J. Med. Res. 23(1), 1 (2018).
Article PubMed PubMed Central Google Scholar
Halloran, D., Durbano, H. W. & Nohe, A. Bone morphogenetic protein-2 in development and bone homeostasis. J. Dev. Biol. 8(3), 19 (2020).
Article CAS PubMed PubMed Central Google Scholar
Razavi, Z.-S. et al. Advancements in tissue engineering for cardiovascular health: A biomedical engineering perspective. Front. Bioeng. Biotechnol. 12, 1385124 (2024).
Article PubMed PubMed Central Google Scholar
Hatami, S., Tahmasebi Ghorabi, S., Mansouri, K., Razavi, Z. & Karimi, R. A. The role of human platelet-rich plasma in burn injury patients: A single center study. Canon J. Med. 4(2), 41–45 (2023).
Google Scholar
Xu, J., Yu, L., Liu, F., Wan, L. & Deng, Z. The effect of cytokines on osteoblasts and osteoclasts in bone remodeling in osteoporosis: A review. Front. Immunol. 14, 1222129 (2023).
Article CAS PubMed PubMed Central Google Scholar
Jiang, Z., Han, X., Zhao, C., Wang, S. & Tang, X. Recent advance in biological responsive nanomaterials for biosensing and molecular imaging application. Int. J. Mol. Sci. 23(3), 1923 (2022).
Article CAS PubMed PubMed Central Google Scholar
Razavi, Z., Soltani, M., Pazoki-Toroudi, H. & Chen, P. CRISPR-microfluidics nexus: Advancing biomedical applications for understanding and detection. Sens. Actuators A Phys. 376, 115625 (2024).
Article CAS Google Scholar
Diegel, C. R. et al. Inhibiting WNT secretion reduces high bone mass caused by Sost loss-of-function or gain-of-function mutations in Lrp5. Bone Res. 11(1), 47 (2023).
Article CAS PubMed PubMed Central Google Scholar
Molaei, A. et al. Pharmacological and medical effect of modified skin grafting method in patients with chronic and severe neck burns. J. Med. Chem. Sci. 5, 369–375 (2022).
Google Scholar
Marcadet, L., Bouredji, Z., Argaw, A. & Frenette, J. The roles of RANK/RANKL/OPG in cardiac, skeletal, and smooth muscles in health and disease. Front. Cell Dev. Biol. 10, 903657 (2022).
Article PubMed PubMed Central Google Scholar
Mahmoudvand, G., Karimi Rouzbahani, A., Razavi, Z. S., Mahjoor, M. & Afkhami, H. Mesenchymal stem cell therapy for non-healing diabetic foot ulcer infection: New insight. Front. Bioeng. Biotechnol. 11, 1158484 (2023).
Article PubMed PubMed Central Google Scholar
Wang, R. N. et al. Bone Morphogenetic Protein (BMP) signaling in development and human diseases. Genes Dis. 1(1), 87–105 (2014).
Article PubMed PubMed Central Google Scholar
Otaghvar, H. et al. A brief report on the effect of COVID 19 pandemic on patients undergoing skin graft surgery in a burns hospital from March 2019 to March 2020. J. Case Rep. Med. Hist. https://doi.org/10.54289/JCRMH2200138 (2022).
Article Google Scholar
Wein, M. N. & Kronenberg, H. M. Regulation of bone remodeling by parathyroid hormone. Cold Spring Harb. Perspect. Med. 8(8), a031237 (2018).
Article PubMed PubMed Central Google Scholar
Taheripak, G. et al. SIRT1 activation attenuates palmitate induced apoptosis in C2C12 muscle cells. Mol. Biol. Rep. 51(1), 354 (2024).
Article CAS PubMed Google Scholar
Hatami, S., Tahmasebi Ghorabi, S., Ahmadi, P., Razavi, Z. & Karimi, R. A. Evaluation the effect of Lipofilling in Burn Scar: A cross-sectional study. Canon J. Med. 4(3), 78–82 (2023).
Google Scholar
LeBoff, M. S. et al. The clinician’s guide to prevention and treatment of osteoporosis. Osteoporos. Int. 33(10), 2049–2102 (2022).
Article CAS PubMed PubMed Central Google Scholar
Høiberg, M. P., Rubin, K. H., Hermann, A. P., Brixen, K. & Abrahamsen, B. Diagnostic devices for osteoporosis in the general population: A systematic review. Bone 92, 58–69 (2016).
Article PubMed Google Scholar
Yu, Y. et al. Targeting loop3 of sclerostin preserves its cardiovascular protective action and promotes bone formation. Nat. Commun. 13(1), 4241 (2022).
Article ADS CAS PubMed PubMed Central Google Scholar
Zhang, Y. et al. Association between serum soluble α-klotho and bone mineral density (BMD) in middle-aged and older adults in the United States: A population-based cross-sectional study. Aging Clin. Exp. Res. 35(10), 2039–2049 (2023).
Article PubMed Google Scholar
Han, X. et al. Multifunctional TiO2/C nanosheets derived from 3D metal–organic frameworks for mild-temperature-photothermal-sonodynamic-chemodynamic therapy under photoacoustic image guidance. J. Colloid Interface Sci. 621, 360–373 (2022).
Article ADS CAS PubMed Google Scholar
Song, Z. H. et al. Effects of PEMFs on Osx, Ocn, TRAP, and CTSK gene expression in postmenopausal osteoporosis model mice. Int. J. Clin. Exp. Pathol. 11(3), 1784–1790 (2018).
PubMed PubMed Central Google Scholar
Yeap, S. S. et al. Different reference ranges affect the prevalence of osteoporosis and osteopenia in an urban adult Malaysian population. Osteoporos. Sarcopenia 6(4), 168–172 (2020).
Article PubMed PubMed Central Google Scholar
Yu, W. et al. Comparison of differences in bone mineral density measurement with 3 hologic dual-energy x-ray absorptiometry scan modes. J. Clin. Densitom. 24(4), 645–650 (2021).
Article PubMed PubMed Central Google Scholar
Gai, Y. et al. Rational design of bioactive materials for bone hemostasis and defect repair. Cyborg Bionic Syst. (Washington, DC). 4, 0058 (2023).
Article CAS Google Scholar
Al-Hashimi, L., Klotsche, J., Ohrndorf, S., Gaber, T. & Hoff, P. Trabecular bone score significantly influences treatment decisions in secondary osteoporosis. J. Clin. Med. 12(12), 4147 (2023).
Article CAS PubMed PubMed Central Google Scholar
Bahadori, S., Williams, J. M., Collard, S. & Swain, I. Can a purposeful walk intervention with a distance goal using an activity monitor improve individuals’ daily activity and function post total hip replacement surgery. A randomized pilot trial. Cyborg Bionic Syst. 4, 0069 (2023).
Article PubMed PubMed Central Google Scholar
Wani, I. M. & Arora, S. Osteoporosis diagnosis in knee X-rays by transfer learning based on convolution neural network. Multimed. Tools Appl. 82(9), 14193–14217 (2023).
Article PubMed Google Scholar
He, K., Zhang, X., Ren, S. & Sun, J. (eds) Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016).
Veit, A., Wilber, M. J. & Belongie, S. Residual networks behave like ensembles of relatively shallow networks. Adv. Neural Inf. Process. Syst. 29 (2016).
Sarker, I. H. Deep learning: A comprehensive overview on techniques, taxonomy, applications and research directions. SN Comput. Sci. 2(6), 420 (2021).
Article PubMed PubMed Central Google Scholar
Wang, J. & Chen, Y. Introduction to Transfer Learning: Algorithms and Practice (Springer Nature, 2023).
Book Google Scholar
Farahani, A., Pourshojae, B., Rasheed, K. & Arabnia, H. R. (eds) A concise review of transfer learning. 2020 International Conference on Computational Science and Computational Intelligence (CSCI) (IEEE, 2020).
Recht, B. et al. (eds) Do Imagenet Classifiers Generalize to Imagenet. International Conference on Machine Learning (PMLR, 2019).
Google Scholar
Deng, J. et al. (eds) Imagenet: A Large-Scale Hierarchical Image Database. 2009 IEEE Conference on Computer Vision and Pattern Recognition (IEEE, 2009).
Google Scholar
Simonyan, K. & Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).
Koonce, B. & Koonce, B. ResNet 50. Convolutional Neural Networks with Swift for Tensorflow: Image Recognition and Dataset Categorization 63–72 (2021).

Download references

Author information

These authors contributed equally: Ali Sarmadi and Zahra Sadat Razavi.

Authors and Affiliations

Department of Mechanical Engineering, K. N. Toosi University of Technology, Tehran, Iran
Ali Sarmadi, Zahra Sadat Razavi & Madjid Soltani
Physiology Research Center, Iran University Medical Sciences, Tehran, Iran
Zahra Sadat Razavi
Biochemistry Research Center, Iran University Medical Sciences, Tehran, Iran
Zahra Sadat Razavi
Department of Biochemistry, University of Vermont, Burlington, VT, USA
Andre J. van Wijnen
Department of Internal Medicine, Erasmus University Medical Center, Rotterdam, The Netherlands
Andre J. van Wijnen
Department of Electrical and Computer Engineering, University of Waterloo, Waterloo, Canada
Madjid Soltani
Centre for Biotechnology and Bioengineering (CBB), University of Waterloo, Waterloo, Canada
Madjid Soltani
Department of Integrative Oncology, BC Cancer Research Institute, Vancouver, Canada
Madjid Soltani
Centre for Sustainable Business, International Business University, Toronto, Canada
Madjid Soltani

Authors

Ali Sarmadi
View author publications
Search author on:PubMed Google Scholar
Zahra Sadat Razavi
View author publications
Search author on:PubMed Google Scholar
Andre J. van Wijnen
View author publications
Search author on:PubMed Google Scholar
Madjid Soltani
View author publications
Search author on:PubMed Google Scholar

Contributions

A.S (Ali Sarmadi): Conceptualization, Investigation, Methodology, Formal analysis, Visualization, Validation, Writing-original draft; Z.S.R (Zahra Sadat Razavi): Conceptualization, Investigation, Methodology, Formal analysis, Visualization, Validation, Writing-original draft; A.V.W (Andre J. van Wijnen): Formal analysis, Writing—review and editing; M.S (Madjid Soltani): Conceptualization, Supervision, Writing—review and editing.

Corresponding author

Correspondence to Madjid Soltani.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Sarmadi, A., Razavi, Z.S., van Wijnen, A.J. et al. Comparative analysis of vision transformers and convolutional neural networks in osteoporosis detection from X-ray images. Sci Rep 14, 18007 (2024). https://doi.org/10.1038/s41598-024-69119-7

Download citation

Received: 07 February 2024
Accepted: 31 July 2024
Published: 03 August 2024
Version of record: 03 August 2024
DOI: https://doi.org/10.1038/s41598-024-69119-7

This article is cited by

Prediction of lymph node metastasis and recurrence risk in early-stage oral tongue squamous cell carcinoma with fully automated MRI deep learning
- Jiliang Ren
- Weiding Zhou
- Ying Yuan
Cancer Imaging (2026)
Advancing Congenital Heart Defect Treatments: Synergistic Approaches with Stem Cells and Functional Scaffolds
- Zahra Sadat Razavi
- Hamed Afkhami
Stem Cell Reviews and Reports (2026)
Bone Osteoporosis Fractures Detection with Deep Learning: An X-ray Image Analysis Approach
- Md. Habibullah Belali
- Ferdus Rhaman Khan
- Khandaker Mohammad Mohi Uddin
Biomedical Materials & Devices (2026)
Precision Nanotechnology: Revolutionizing Therapeutic Strategies Against Drug-Resistant Breast Cancer
- ZahraSadat Razavi
- Arefeh Mottaghi
- Nahid Ahmadi
Annals of Biomedical Engineering (2026)
Enhanced diagnosis of osteoporosis using vision transformer with lumbar MRI
- Wenbin Wang
- Dongming Li
- Jian Zhong
BMC Medical Imaging (2025)