Abstract
Pediatric wrist fractures are common skeletal injuries in clinical practice; however, due to the ongoing development of children’s bones, fracture characteristics are complex and often prone to misdiagnosis or missed diagnosis. Moreover, traditional diagnostic methods rely heavily on the physician’s experience, which may compromise efficiency and accuracy, especially in environments with limited medical resources. To address this issue, this study proposes an improved deep learning detection method based on YOLO11s, named Kid-YOLO, for the automatic detection of pediatric wrist fractures in X-ray images. By introducing the C3k2-WTConv module and Focaler-MPDIoU loss function, the model was improved in terms of multi-scale feature extraction, target box localization accuracy optimization, and addressing the class imbalance problem. The C3k2-WTConv module, which combines wavelet transform and convolution operations, effectively enhances the model’s ability to detect subtle fractures and complex patterns. The Focaler-MPDIoU loss function improves performance in detecting rare targets by dynamically adjusting sample weight distribution and optimizing prediction box positioning. Experiments were conducted on the publicly available GRAZPEDWRI-DX dataset after data cleaning, The results show that, compared with the YOLO11 model, the improved model achieves a 3.2% increase in precision, a 1.6% increase in recall, a 1.8% improvement in mAP@50, and a 3.2% improvement in mAP@50–95. Furthermore, this study developed an AI-assisted diagnostic system with an integrated graphical user interface, capable of efficiently performing image loading, fracture detection, and result visualization, thereby providing physicians with a reliable diagnostic tool. In the future, this method is expected to be applied to a broader range of medical imaging analysis tasks, offering new technical support for precision medicine.
Introduction
Child wrist fractures are one of the most common types of fractures in clinical practice, especially in sports and accidents1. Given the limited physical coordination and high activity levels in children, the incidence of wrist fractures is notably elevated. X-rays, being the most widely used diagnostic tool, are crucial for diagnosing pediatric fractures2. However, in clinical practice, early diagnosis of pediatric wrist fractures remains challenging. Children’s bones are still developing, and fracture presentations are complex. The fracture line may be subtle or even partially obscured, leading to misinterpretations as growth plates or cartilage. Interpreting X-rays for pediatric fractures demands highly specialized skills and expertise3. Moreover, due to the scarcity of radiology resources in primary care settings and emergency departments, misdiagnoses and missed diagnoses are frequent. Failure to accurately diagnose or appropriately manage fractures can lead to long-term detrimental effects on bone growth and functional recovery in children.
In recent years, artificial intelligence (AI) technology has seen widespread application in medical image analysis4, particularly deep learning-based object detection models5,6, which demonstrate significant potential. By training deep learning models with large datasets of labeled images, automated diagnostic systems can rapidly identify fracture locations and types in X-rays, alleviating the diagnostic workload of clinicians and enhancing diagnostic efficiency. Notably, the YOLO series of models7,8,9,10,11,12,13,14,15,16, renowned for their exceptional detection speed and high accuracy, have formed the basis for numerous studies in medical image detection17,18,19,20. Tristan Till et al. employed the YOLOv7 model to detect hand fracture radiographs and optimized performance by adjusting training image sizes and model configuration files; however, no further structural improvements to the model were introduced21. Chun-Tse Chien et al. trained a YOLOv9 model on the GRAZPEDWRI-DX dataset and applied data augmentation techniques to expand the training set, thereby enhancing model performance22. Although their approach achieved superior results compared with other models, no modifications were made to the network architecture. In contrast, Rui-Yang Ju et al. integrated the ResCBAM module into YOLOv8, which substantially improved detection accuracy and robustness on the GRAZPEDWRI-DX pediatric wrist fracture dataset while maintaining real-time performance20. Notably, the latest YOLO11 model demonstrates further performance gains over YOLOv8. These studies highlight the continuous evolution of YOLO-based models for pediatric fracture detection, while also underscoring the need for further architectural optimization to fully exploit their potential. As the latest iteration of the YOLO models, YOLO11 further optimizes both detection speed and accuracy, making it especially well-suited for the automatic detection and diagnosis of complex medical images. However, research on the application of YOLO11 in the automatic detection of pediatric wrist fractures remains limited, and its effectiveness in extracting subtle fracture features and detecting multi-scale targets warrants further investigation.
In addition to the algorithm itself, data plays a critical role in influencing fracture detection outcomes23. Existing data on pediatric wrist fractures is not only limited but also plagued by issues such as class imbalance and redundant annotations. Such data characteristics can result in overfitting to the majority class, while the model’s performance in detecting fractures in the minority class remains inadequate, thereby elevating the risks of missed or incorrect diagnoses. Addressing these challenges requires optimizing data processing and enhancing detection algorithms.
To address the aforementioned challenges, this study proposes a deep learning-based AI assisted diagnostic approach for pediatric wrist fractures and develops a comprehensive automatic detection system to aid doctors in making faster and more accurate diagnoses. The main contributions of this study are as follows:
-
1.
This study performs comprehensive data preprocessing on the publicly available pediatric wrist fracture X-ray dataset GRAZPEDWRI-DX, eliminating irrelevant sample categories, mitigating overfitting risks, and enhancing the model’s ability to extract fracture-specific features.
-
2.
Building on the latest YOLO11s model, this study significantly enhances detection efficiency and accuracy for pediatric wrist fracture detection through structural optimization and refinement strategies.
-
3.
A novel C3k2-WTConv module is introduced to optimize multi-scale feature extraction in pediatric wrist fracture images. This module integrates wavelet transform and convolution operations, effectively improving the ability to capture complex patterns while avoiding the issue of excessive parameters.
-
4.
To tackle the class imbalance problem in the dataset, this study employs the Focaler-MPDIoU loss function, which combines the advantages of Focaler-IoU24 and MPDIoU25, optimizing both classification and localization performance in object detection tasks. The incorporation of MPDIoU introduces a direct constraint on the vertex distances of bounding boxes, enabling the model to adjust the predicted box positions and sizes more accurately during optimization. Focaler-IoU dynamically adjusts the model’s focus on difficult-to-classify samples, mitigating overfitting to simpler samples, and enhancing the model’s detection capability for rare classes, thereby significantly reducing both missed detection and false positive rates.
-
5.
This study develops a comprehensive AI-assisted diagnostic system, integrating the optimized YOLO11 model, capable of rapidly performing automatic detection of pediatric wrist fractures and providing clinicians with a reliable auxiliary diagnostic tool.
The remainder of this paper is organized as follows: section “Materials and methods” describes the materials and methods, section “Experiment and results” presents the experimental results, section “Discussion” discusses the findings, and section “Conclusions” concludes with limitations and future directions.
Materials and methods
This section begins by presenting the dataset and the corresponding preprocessing procedures applied to ensure data quality and consistency. Subsequently, the architecture of the YOLO11 network is introduced, with a detailed description of the structure and functionality of its individual modules. Finally, the section outlines the improved network architecture, highlighting the modifications implemented to enhance detection performance.
Dataset cleaning
This study employs the publicly available GRAZPEDWRI-DX dataset, specifically curated for pediatric wrist fracture detection. The dataset comprises 20,327 high-quality X-ray images, each meticulously annotated by professional radiologists. The annotations specify the target categories and their precise bounding box locations within the images. The dataset encompasses common wrist pathology categories along with other related categories, classified into nine categories in total.
In the original dataset, there are certain categories unrelated to fracture detection, namely “text” and “foreignbody.” The inclusion of these categories could interfere with model training by causing the model to extract irrelevant features, thereby negatively impacting the accuracy of fracture detection. To mitigate this issue, the dataset underwent rigorous cleaning and preprocessing. The annotations of the “text” and “foreignbody” categories, along with their respective bounding boxes, were removed from the annotation files, while the complete annotation files were preserved. Furthermore, the information of other categories was rearranged to form the final dataset employed in this study. Although this study removed the “text” and “foreignbody” categories to reduce noise, these categories may still provide auxiliary information in clinical traumatology. Following the data cleaning process, the dataset was partitioned into training, validation, and test sets with a 6:2:2 ratio. This study adopted strict patient-level partitioning, ensuring that images from the same patient were not included in both training and test sets to prevent data leakage. The distribution of the data is depicted in Fig. 1.
YOLO11 algorithm
The YOLO11 algorithm is the latest iteration in the YOLO series, building upon the high efficiency and accuracy of previous versions while further enhancing feature extraction, detection accuracy, and computational efficiency. The architecture of YOLO11 primarily comprises three components: the Backbone, Neck, and Head, as illustrated in Fig. 2. The backbone network, through its hierarchical architecture, progressively constructs multi-scale feature representations from raw pixels, preserving fine-grained details while abstracting higher-level semantic information. The neck module fuses and enhances the features extracted by the backbone, optimizing detection performance for both small and large objects through multi-scale feature integration. Finally, the detection head transforms these multi-scale feature representations into the final predictions via coordinate regression and classification.
The subsequent sections will offer a detailed overview of the key technical elements of the YOLO11 algorithm.
Data augmentation
Data augmentation plays a pivotal role in the training process of YOLO11. Specifically, during the data preprocessing stage, YOLO11 employs the Mosaic augmentation technique, which enhances the training dataset by concatenating four images into one large composite image. The procedure involves the following steps: First, a random center point is selected from the four images, and then the images are positioned in the four quadrants of the composite image. For each image, the model extracts its target labels (including categories and bounding boxes) and adjusts the label coordinates based on its position in the composite image. Finally, the adjusted labels and images are integrated to generate a Mosaic image. This technique enhances the diversity of training samples and significantly boosts the robustness and generalization capability of the object detection model, particularly when dealing with objects of varying sizes and complex backgrounds. The key advantage of the Mosaic augmentation technique lies in its ability to significantly improve the model’s adaptability to diverse scenes and objects. In the context of pediatric wrist fracture X-ray images, this technique enhances the model’s sensitivity to complex fracture shapes and diverse fracture locations. Figure 3 illustrates the results of the Mosaic. To ensure stable convergence and maintain optimal performance, mosaic augmentation was applied only during the early stages of training and deliberately disabled in the final 20 epochs. This strategy allows the model to benefit from data diversity introduced by augmentation while stabilizing learning in the later phase to better fit the underlying data distribution.
Backbone network
The backbone network of YOLO11 incorporates the newly designed C3k2 module, aimed at enriching the model’s gradient flow. The C3k2 module is essentially an enhancement of the C2F module in YOLOv4, boosting the network’s expressive capacity by introducing a more efficient Bottleneck structure. In YOLO11, when the c3k parameter is set to FALSE, the C3k2 module reverts to the traditional C2F module; when set to TRUE, the Bottleneck structure is replaced by the C3K module. This optimization significantly enhances the efficiency and accuracy of feature extraction. The C3k2 module boosts the model’s feature extraction capability by integrating multi-level features, particularly striking a balance between low-level features and high-level semantic information. This design enhances the model’s ability to extract fine-grained features from complex images, especially when detecting small fracture regions. The optimization of the C3k2 module significantly strengthens the model’s diagnostic capability, especially when handling subtle fractures.
Neck network
In the neck network of YOLO11, the model integrates the C3k2 module with the C2PSA (Channel and Spatial Attention) module26. The C2PSA module introduces a spatial attention mechanism27 that facilitates selective focus on multi-scale image features. By adaptively adjusting both the channel and spatial dimensions of the feature maps, C2PSA improves the model’s ability to process multi-scale information.
The spatial attention mechanism enables YOLO11 to focus on the most critical areas of the image, minimizing background interference, particularly when processing images with complex structures or background noise, such as X-ray images of children’s wrist fractures. The C2PSA module dynamically adjusts attention weights for features at various scales, optimizing the flow of feature information and thereby enhancing the accuracy and robustness of object detection.
Head network
The head network of YOLO11 features a decoupled detection head, which independently processes regression and classification tasks, further enhancing detection performance.
To reduce computational complexity, YOLO11 incorporates Depthwise Separable Convolutions (DWConv) in the classification branch, replacing traditional convolution operations28. The implementation of Depthwise Separable Convolutions effectively reduces computational load and the number of parameters while preserving efficiency without compromising accuracy. This design enables YOLO11 to sustain low computational cost while ensuring detection accuracy.
Loss function
YOLO11 employs a composite loss function that incorporates both classification and regression losses.
For classification tasks, YOLO11 uses BCELoss (Binary Cross Entropy Loss) to compute the difference between predicted class probabilities and the ground truth labels.
The BCELoss function is defined in Eq. (1):
where N represents the number of samples, \({y_i}\) and \(\widehat {{{y_i}}}\) denote the true label and predicted probability for the i-th sample, respectively.
For regression tasks, YOLO11 utilizes DFL Loss and CIoU29 loss. Distribution Focal Loss (DFL) aims to reduce the impact of simple background regions on model training, thus focusing on hard-to-detect targets and improving model performance in complex scenarios. The DFL is shown in Eq. (2):
where \({S_i}\) and \({S_{i+1}}\) represent the probabilities of the i-th and \(\left( {i+1} \right)\)-th values in the discretized probability distribution predicted by the model, and y denotes the true continuous target value. \({y_i}\) and \({y_{i+1}}\) represent the left and right integer points of y mapped onto the discretized distribution, which correspond to the lower and upper bounds of y.
The CIoU loss is used to optimize the regression accuracy of the bounding box, further improving the accuracy of the detection box, as shown in Eq. (3):
Here, \({\rho ^2}\left( {b,{b^{gt}}} \right)\) represents the squared Euclidean distance between the centers of the predicted and ground truth boxes, and \({c^2}\) represents the squared length of the diagonal of the smallest enclosing box that contains both the predicted and ground truth boxes. \(\mathcal{L}\) denotes the loss function. IoU represents the Intersection over Union, defined as the ratio of the intersection area between the predicted box and the ground-truth box to their union area. The v term represents the aspect ratio difference, as shown in Eq. (4):
where w and h are the width and height of the predicted box, and \({w^{gt}}\) and \({h^{gt}}\) are the width and height of the ground truth box.
The dynamic balancing factor \(\alpha\) is shown in Eq. (5):
WTConv
WTConv30 combines wavelet transform (WT) and convolution operations to extract multi-scale features from the image, while avoiding the issue of excessive parameters. The wavelet transform decomposes the frequency information of the image, while convolution layers process the low-frequency and high-frequency components, thereby enhancing the network’s ability to capture complex patterns. This module not only processes multi-scale information and multi-level features but also applies scale adjustment, further optimizing feature representation. The flowchart of WTConv is shown in Fig. 4.
The Haar wavelet transform is employed as the core transformation method for WTConv due to its high computational efficiency and its ability to effectively decompose an image into both low-frequency and high-frequency components. The Haar wavelet transform utilizes four filters, as depicted in Formula (6).
\({f_{LL}}\) is a low-pass filter, while the remaining filters are high-pass filters. \({f_{LL}}\) is employed to extract the image’s low-frequency components, while \({f_{LH}}\), \({f_{HL}}\), and \({f_{HH}}\) extract high-frequency details in the horizontal, vertical, and diagonal directions, respectively.
The WT operation is illustrated in Formula (7):
X denotes the input image, while \({X_{LL}}\) represents the low-frequency component, capturing the global features and smooth regions of the image. \({X_{LH}}\) represents the horizontal high-frequency component, highlighting the details along the horizontal axis. \({X_{HL}}\) represents the vertical high-frequency component, emphasizing the details along the vertical axis. \({X_{HH}}\) represents the diagonal high-frequency component, capturing the details along the diagonal axis.
The inverse wavelet transform (IWT) operation formula is given in Formula (8):
IWT utilizes a transposed convolution operation to reassemble the low-frequency and high-frequency components into comprehensive image features. The output Y encompasses both global information and detailed local features.
Through the aforementioned process, the WTConv module effectively integrates multi-scale feature extraction with efficient parameter control, thereby enhancing the feature learning capacity for object detection tasks, particularly in medical image fracture detection. It not only improves the network’s ability to perceive multi-level information but also guarantees the model’s lightweight design through optimized computational efficiency, making it well-suited for detection tasks that demand both real-time performance and high precision.
C3k2-WTConv
The C3k2 module is a convolutional residual module employed in the backbone network of the YOLO11 algorithm, primarily consisting of stacked Bottleneck submodules. The C3k2 module enhances feature extraction efficiency by reducing the number of parameters, while leveraging residual connections to alleviate the gradient vanishing problem in deep networks. This study introduces an enhanced C3k2-WTConv module, which substitutes the convolution operation in the Bottleneck submodule of the C3k2 module with WTConv, integrating the multi-scale feature extraction benefits of wavelet transform with the robust feature learning capabilities of convolution, thus further improving the module’s feature extraction performance. The structure diagram of the C3k2-WTConv module is illustrated in Fig. 5.
C3k2-WTConv structure diagram. The baseline C3k2 module, composed of stacked Bottleneck residual submodules, improves feature extraction efficiency while reducing parameters. In the enhanced version, the convolution in the Bottleneck is replaced with WTConv, which integrates wavelet transform for multi-scale feature decomposition with convolution for robust feature learning.
MPDIoU
In object detection tasks, the degree of match between the predicted bounding box and the ground truth box plays a crucial role in the accuracy of object localization. Traditional Intersection over Union (IoU)25 loss functions primarily measure the match between the predicted and ground truth boxes based on their overlap ratio. As an improved variant, the CIoU loss introduces constraints on the center point distance and aspect ratio, further optimizing the position and shape of the bounding box. However, CIoU still lacks sensitivity to the overall placement of detection boxes within the image, particularly when dealing with multi-scale objects, and may not provide sufficient optimization.
This study employs the MPDIoU loss function. MPDIoU is an enhanced version of the IoU loss, which improves the localization accuracy of the bounding box by considering both the overlap and distance information between the predicted and ground truth boxes. This method not only accounts for the overlap between the predicted and ground truth boxes but also incorporates the average predicted distance relative to the image, aiming to optimize the position and size of the box, thereby improving detection accuracy. The core idea of the MPDIoU loss function is to add a vertex distance constraint, related to the overall scale of the image, on top of IoU. It calculates the Euclidean distance between the two key vertices of the predicted and ground truth boxes and normalizes this distance to the height and width of the image. This design is particularly suited for complex detection scenarios in medical imaging, where there are large variations in object sizes and numerous targets in the image’s peripheral regions. The MPDIoU loss function is shown in Eq. (9):
the MPDIoU calculation is given by Eq. (10):
h and w represent the height and width of the input image, while \(d_{1}^{2}\) and \(d_{2}^{2}\) are defined in Eqs. (11) and (12):
Here, \(\left( {x_{1}^{{prd}},y_{1}^{{prd}},x_{2}^{{prd}},y_{2}^{{prd}}} \right)\) represent the coordinates of the four vertices of the predicted bounding box, while \(\left( {x_{1}^{{gt}},y_{1}^{{gt}},x_{2}^{{gt}},y_{2}^{{gt}}} \right)\) represent the coordinates of the four vertices of the ground truth anchor box.
Focaler-MPDIoU
In object detection tasks, the imbalanced distribution of data classes and the characteristics of rare samples present significant challenges to model training. This imbalance manifests as a significantly larger number of majority class samples compared to minority class samples, causing the model to focus on optimizing majority class samples while neglecting the feature learning of minority class samples. This phenomenon is particularly pronounced in wrist fracture detection in this study, as samples from certain fracture categories are extremely scarce, yet their clinical diagnostic value is critically important.
To address this issue, this study proposes an improved loss function, Focaler-MPDIoU, which combines the advantages of Focaler-IoU and MPDIoU to optimize both object classification accuracy and bounding box localization accuracy simultaneously. Specifically, Focaler-IoU is an IoU-based weighted loss function that applies weights to the IoU values of different samples, enhancing the model’s focus on hard-to-regress samples. MPDIoU, on the other hand, optimizes the location and size of the predicted bounding boxes by combining IoU and the average predicted distance of the bounding box, allowing the model to localize objects more accurately. By combining these two methods, Focaler-MPDIoU can improve both object classification and localization accuracy, even in the presence of class imbalance and rare samples.
The mathematical formula for Focaler-IoU (\(Io{U^{focaler}})\) is shown in Eq. (13):
where [d, u] ∈ [0, 1]. The values of d and u can be adjusted to focus on different regression samples. Its loss function is given by Eq. (14):
By incorporating the MPDIoU loss into the above equation, the loss function for Focaler-MPDIoU is defined as in Eq. (15):
Methods proposed in this study
In this study, we propose an improved detection model named Kid-YOLO, which is built upon the YOLO11 architecture with two main enhancements. First, in the backbone network, the standard C3k2 module is replaced by the C3k2-WTConv module. By integrating wavelet transform into the Bottleneck structure, this module captures both low- and high-frequency information, thereby improving multi-scale feature representation and enhancing the extraction of subtle wrist fracture features. Second, in the bounding box regression branch, the original CIoU loss is replaced with Focaler-MPDIoU, which significantly improves detection precision and localization accuracy, especially for small or rare fracture cases. Together, these improvements strengthen the overall detection capability of the proposed Kid-YOLO model. The complete architecture of Kid-YOLO is illustrated in Fig. 6.
Experiment and results
Experimental environment
This study uses a 16-core Intel(R) Xeon(R) Platinum 8352 V CPU @ 2.10 GHz, with an RTX 4090 GPU, CUDA 12.4 as the acceleration environment, and Pytorch as the deep learning framework. The YOLO11s model is selected as the baseline model for this study, with hyperparameters for the YOLO algorithm set as follows: an image size of 640 × 640, 300 epochs, 8 workers, a learning rate of 0.01, and the optimizer being SGD.
Evaluation metric
To comprehensively assess the model’s performance in pediatric wrist fracture detection, this study employs the following standard evaluation metrics: precision, recall, and mean average precision (mAP), to quantify the model’s classification and localization capabilities.
Precision is given by Eq. (16):
where TP is the true positive, referring to the regions correctly identified as fractures; FP is the false positive, referring to regions incorrectly predicted as fractures. In the context of fracture detection, high precision implies fewer false positives, which minimizes the risk of misdiagnosis and unnecessary treatments, thus ensuring the model’s reliability in clinical settings.
Recall is given by Eq. (17):
where FN refers to false negatives, i.e., the fracture regions that were not detected.
Mean Average Precision (mAP) is one of the most widely used evaluation metrics in object detection tasks, particularly for multi-class detection. It measures the model’s detection accuracy for different categories and calculates the overall evaluation by averaging the precision for all categories, as shown in Eq. (18):
where \({\text{K}}\) is the total number of categories, and \({\text{AP}}\) is the Average Precision for the k-th category.
Comparative experiment
In this study, to demonstrate the enhanced performance of the modified YOLO11s model in fracture detection, comparisons were conducted with various object detection models, and the results are presented in Table 1.
The experimental results indicate that, when compared to other leading object detection algorithms, YOLO11s exhibited notable performance advantages in the task of pediatric wrist fracture detection. YOLO11s outperforms widely-used algorithms, including the YOLO series, Faster R-CNN, and SSD, across key metrics such as Precision, Recall, and mean Average Precision (mAP@50 and mAP@50–95). This demonstrates its superior performance in detecting pediatric wrist fractures. Furthermore, the Kid-YOLO model proposed in this study further optimized performance compared to the YOLO11s model, achieving improvements across all metrics. This indicates that Kid-YOLO performs better in handling the complex features of pediatric wrist fracture X-ray images.
In addition, we compared the detection performance of each category between the YOLO11s model and the Kid-YOLO model to further evaluate their respective strengths and limitations across different target classes. As shown in Table 2, the proposed Kid-YOLO model achieves higher mAP across almost all categories compared with YOLO11s. These results support the claim that Kid-YOLO enhances minority class detection, which is of particular clinical relevance.
Ablation experiment
To assess the impact of each improved module on the performance of the YOLO11 algorithm, this study conducted a series of ablation experiments. By systematically removing or adding key modules, the contribution of each module to the model’s performance was evaluated. The results of the ablation experiments are shown in Table 3.
The experimental results demonstrate that incorporating the C3k2-WTConv and Focaler-MPDIoU modules into the YOLO11s model enhances all four evaluation metrics. The model achieves optimal performance when both modules are integrated simultaneously. To assess the effectiveness of the MPDIoU loss function in improving target box localization accuracy and enhancing the detection of rare samples, this study further designed ablation experiments utilizing various IoU loss functions. Experiments were conducted on the Kid-YOLO model using IoU, CIoU29, DIoU29, SIoU31, and GIoU29 loss functions, with the results presented in Table 4.
The experimental results demonstrate that in Kid-YOLO, the four detection metrics using Focaler-MPDIoU outperform those using other types of IoU. This further validates the effectiveness of Focaler-MPDIoU in wrist fracture detection.
Visual results
To further assess the practical performance of Kid-YOLO in pediatric wrist fracture detection, this study compares the model’s detection results with the fracture regions manually annotated by professional doctors. This comparison validates the model’s detection capability, with the visualization results presented in Fig. 7.
Automatic fracture detection system
After completing model training and evaluation, we developed an automatic detection system for pediatric wrist fractures with a graphical user interface (GUI) based on the Kid-YOLO model to enhance clinical usability. The interface, implemented in PyQt6, is designed with five main functions: (1) Open – browse and load an X-ray image, which is displayed on the left panel; (2) Detection – perform automatic fracture detection, with the results shown on the right panel; (3) Save – export the detection results and save them to a specified folder; (4) Clear – remove the currently displayed images from the interface; and (5) Close – exit the program. These functions provide clinicians with an intuitive workflow for loading, analyzing, and documenting X-ray examinations. As illustrated in Fig. 8, the system demonstrates accurate detection in most cases, supporting doctors with an efficient and reliable tool for auxiliary diagnosis.
Interface of the pediatric wrist fracture automatic detection system. The interface includes five main functions: Open loads an X-ray image (displayed on the left panel); Detection runs automatic fracture detection and displays results on the right panel; Save exports annotated images to a specified folder; Clear removes all images from the interface; and Close terminates the program. This design enables doctors to easily visualize and manage fracture detection results for clinical use.
Discussion
This study addresses the clinical challenges in pediatric wrist fracture detection by proposing a Kid-YOLO model, which is an improvement based on YOLO11s, and developing a practical AI-assisted diagnostic system. Using the publicly available GRAZPEDWRI-DX dataset, this research enhances the dataset’s relevance through data cleaning, thereby laying a solid foundation for improving model performance. By introducing the C3k2-WTConv module and the Focaler-MPDIoU loss function, the model is optimized in terms of feature extraction, object box localization, and the issue of class imbalance, resulting in significant improvements in detection performance.
To highlight the contributions of the proposed Kid-YOLO model, we conducted a comparative evaluation with existing detection algorithms, including Faster R-CNN, SSD, YOLOv3-tiny, YOLOv5s, YOLOv6s, YOLOv7-tiny, YOLOv8s, YOLOv10s, and YOLO11s. Kid-YOLO achieves the best overall performance across all metrics, with a precision of 76.2%, recall of 61.5%, mAP@50 of 59.4%, and mAP@50–95 of 39.5%. Compared with the strong baseline YOLO11s, Kid-YOLO improves precision by 4.2%, recall by 1.6%, mAP@50 by 1.8%, and mAP@50–95 by 3.2%. The per-class analysis further highlights the advantages of the proposed Kid-YOLO model. While improvements are observed across all categories, the most significant gains are found in rare fracture types, which are traditionally difficult to detect due to their low representation in training datasets.
The C3k2-WTConv module significantly improves the model’s ability to perform multi-scale learning for complex fracture features. The Focaler-MPDIoU loss function effectively optimizes both object box localization and the detection of rare categories. The fusion of these two components results in the best performance. Additionally, the study tests the Kid-YOLO model with different IoUs, further validating the practicality of MPDIoU in pediatric wrist fracture detection.
This study employs visual analysis to validate the performance of the Kid-YOLO model in pediatric wrist fracture detection. The model is capable of accurately detecting various types of pediatric wrist fractures, demonstrating its substantial potential for practical clinical application as a highly efficient and reliable diagnostic tool for physicians. Furthermore, the graphical user interface-based diagnostic system developed in this study is intuitive and streamlined, enabling physicians to efficiently complete wrist fracture diagnosis tasks, thus providing a feasible solution for clinical use.
Despite the substantial progress achieved in pediatric wrist fracture detection, several limitations and avenues for improvement remain. This study relied exclusively on the GRAZPEDWRI-DX dataset, which contains limited samples in certain fracture categories, potentially constraining model accuracy. As the model was trained and validated on a single institutional dataset, there is an inherent risk of overfitting, which may restrict its generalizability to broader clinical populations. Furthermore, variability in imaging quality and acquisition devices—including differences in resolution, exposure parameters, and hardware characteristics—can introduce biases and compromise robustness. To ensure reliable clinical deployment, rigorous cross-device and cross-protocol evaluations are needed. Beyond technical issues, the integration of AI into pediatric imaging also raises important ethical and medico-legal concerns, such as algorithmic transparency, potential bias against underrepresented subgroups, and the delineation of responsibility in clinical decision-making. The question of medico-legal liability in AI-assisted pediatric diagnosis remains a pressing challenge that requires multidisciplinary engagement among clinicians, ethicists, and policymakers.
Looking ahead, future research should focus on several key directions to enhance both scientific rigor and clinical applicability. Multicenter validation involving diverse patient populations and imaging environments is crucial to confirm robustness and reduce overfitting risks. Seamless integration with Picture Archiving and Communication Systems (PACS) will be pursued to facilitate adoption in routine clinical workflows, requiring careful attention to interoperability standards, data security, and clinician-friendly interface design. In addition, semi-supervised and self-supervised learning strategies provide a promising pathway to exploit large-scale unlabeled pediatric imaging data, thereby reducing annotation costs, enriching feature representation, and potentially improving diagnostic accuracy in real-world settings. Expanding data collection to include a broader range of high-quality pediatric fracture images—particularly rare fracture types—will further enhance model generalizability. Finally, large-scale prospective clinical trials are essential to evaluate system stability, reliability, and clinical utility, thereby bridging the gap between laboratory performance and real-world implementation.
Conclusions
Overall, this study provides an innovative solution for the automatic detection of pediatric wrist fractures through an improved deep learning model and diagnostic support system. It not only significantly enhances detection accuracy but also offers a new direction for the application of AI technologies in medical image analysis. In the future, with the expansion of data scale, continuous optimization of algorithms, and further clinical validation, AI-assisted diagnostic technology is expected to become a crucial tool for fracture diagnosis, further advancing the development of precision medicine and aiding in the healthy growth of children.
Data availability
The data that support the findings of this study can be requested from the corresponding author.
References
Little, J. T., Klionsky, N. B., Chaturvedi, A., Soral, A. & Chaturvedi, A. Pediatric distal forearm and wrist injury: an imaging review. Radiographics 34, 472–490 (2014).
Reavey, P. L. & Hammert, W. C. Examination of the wrist. Plast. Reconstr. Surg. 147, 284e–294e (2021).
Bruno, F. et al. The acutely injured wrist. Radiol. Clin. N AM. 57, 943–955 (2019).
Zhou, L. et al. Artificial intelligence in medical imaging of the liver. World J. Gastroenterol. 25, 672 (2019).
Liu, W. et al. SSD: single shot multibox detector. In Computer Vision – ECCV 2016 (eds. Leibe, B. et al.) 21–37 (Springer International Publishing, 2016).
Girshick, R. & Fast R-CNN (2015).
Redmon, J. You Only Look Once: Unified, Real-time Object Detection (Springer, 2016).
Redmon, J. & Farhadi, A. YOLO9000: Better, Faster, Stronger 7263–7271 (2017).
Li, C. et al. YOLOv6: A single-stage object detection framework for industrial applications. arXiv preprint arXiv:2209.02976 (2022).
Wang, C., Bochkovskiy, A. & Liao, H. M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition 7464–7475 (Springer, 2023).
Wang, C., Yeh, I. & Mark Liao, H. Yolov9: learning what you want to learn using programmable gradient information. In European Conference on Computer Vision 1–21 (Springer, 2025).
Khanam, R. & Hussain, M. YOLOv11: an overview of the key architectural enhancements. arXiv preprint arXiv:2410.17725 (2024).
Yang, Z., Shen, Y. & Shen, Y. Football referee gesture recognition algorithm based on YOLOv8S. Front. Comput. Neurosci. 18, 1341234 (2024).
Zhao, L. & Li, S. Object detection algorithm based on improved YOLOv3. Electronics 9, 537 (2020).
Wu, D., Lv, S., Jiang, M. & Song, H. Using channel pruning-based YOLO V4 deep learning algorithm for the real-time and accurate detection of Apple flowers in natural environments. Comput. Electron. Agric. 178, 105742 (2020).
Wang, A. et al. YOLOv10: Real-Time End-to-End Object Detection 107984–108011 (Springer, 2024).
Ahmed, A., Imran, A. S., Manaf, A., Kastrati, Z. & Daudpota, S. M. Enhancing wrist abnormality detection with yolo: analysis of state-of-the-art single-stage detection models. Biomed. Signal. Process. 93, 106144 (2024).
Ahmed, A. & Manaf, A. Pediatric wrist fracture detection in X-Rays Via YOLOv10 algorithm and dual label assignment system. arXiv preprint arXiv:2407.15689 (2024).
Ju, R. & Cai, W. Fracture detection in pediatric wrist trauma x-ray images using YOLOv8 algorithm. Sci. Rep.-UK 13, 20077 (2023).
Ju, R., Chien, C. & Chiang, J. In YOLOv8-ResCBAM: YOLOv8 Based on an Effective Attention Module for Pediatric Wrist Fracture Detection 403–416 (eds. Mahmud, M. et al.) (Springer Nature, 2025).
Till, T., Tschauner, S., Singer, G., Lichtenegger, K. & Till, H. Development and optimization of Ai algorithms for wrist fracture detection in children using a freely available dataset. Front. Pediatr. 2023, 11 (2023).
Chien, C. T., Ju, R. Y., Chou, K. Y. & Chiang, J. S. YOLOv9 for fracture detection in pediatric wrist trauma x-ray images. Electron. lett. 60, e13248 (2024).
Nagy, E., Janisch, M., Hržić, F., Sorantin, E. & Tschauner, S. A pediatric wrist trauma x-ray dataset (GRAZPEDWRI-DX) for machine learning. Sci. Data. 9, 222 (2022).
Zhang, H. & Zhang, S. Focaler-IoU: more focused intersection over union loss. arXiv preprint arXiv:2401.10525. (2024).
Zhou, D. et al. Iou Loss for 2D/3D object detection. In international Conference on 3D Vision (3DV) 85–94 (IEEE, 2019).
Mou, L. et al. CS-Net: channel and spatial attention network for curvilinear structure segmentation. In Medical Image Computing and Computer Assisted Intervention–MICCAI 2019: 22nd International Conference, Shenzhen, China, October 13–17, Proceedings, Part I 22 721–730 (Springer, 2019).
Tootell, R. B. et al. The retinotopy of visual Spatial attention. Neuron 21, 1409–1422 (1998).
Chollet, F. Xception: deep learning with depthwise separable convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 1251–1258 (2017).
Zheng, Z. et al. Distance-IoU Loss: Faster and better learning for bounding box regression. In Proceedings of the AAAI Conference on Artificial Intelligence 12993–13000 (2020).
Finder, S. E., Amoyal, R., Treister, E. & Freifeld, O. Wavelet convolutions for large receptive fields. In European Conference on Computer Vision 363–380 (Springer, 2025).
Wang, X. & Song, J. ICIoU: Improved loss based on complete intersection over union for bounding box regression. IEEE ACCESS. 9, 105686–105695 (2021).
Funding
This research was funded by the National Natural Science Foundation of China, grant numbers 11372223, 11102135; Tianjin Natural Science Foundation, grant numbers 17JCZDJC36000, 18JCZDJC35900.
Author information
Authors and Affiliations
Contributions
D.L did the experimental part and the simulation, Z.Y did some basic research and adjustment for the material and methods part and did the supervision for D.L, Z.Y, Q.M and C.B did the supervision for D.L and corrections of the paper. All authors reviewed the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Liu, D., Yang, Z., Bao, C. et al. Artificial intelligence-based method for detecting wrist fractures in children. Sci Rep 15, 38555 (2025). https://doi.org/10.1038/s41598-025-22419-y
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s41598-025-22419-y







