Introduction

In recent years, the demand for precise and automated analysis of medical images has grown significantly, particularly in fields like forensic medicine and dentistry, where tasks such as age estimation and gender classification play crucial roles1,2,3. Cone Beam Computed Tomography (CBCT), which provides detailed three-dimensional imaging of craniofacial structures, has become a valuable tool in these fields due to its ability to capture volumetric data that traditional 2D methods cannot4,5,6. However, the high dimensionality and complexity of CBCT data introduce considerable computational challenges, and designing models that are both efficient and accurate, especially for tasks requiring nuanced interpretation of structural features like age and gender, remains a key obstacle7,8.

Convolutional Neural Networks (CNNs) have transformed medical image analysis, particularly with 2D images, due to their capacity to learn relevant features automatically and outperform traditional manual methods in speed and accuracy9,10.

However, applying CNNs to 3D CBCT images poses additional challenges, including significantly higher computational demands and increased risk of overfitting due to the greater number of parameters and limited size of annotated 3D datasets in medical imaging11,12. These factors can reduce model generalizability, especially when training data lack diversity or are imbalanced. Although several studies have explored 3D CNNs and transformer-based architectures for volumetric image analysis13,14, such models often require extensive computational resources and regularization techniques to avoid overfitting, which may not be practical in forensic or clinical workflows.

To address these challenges, multi-task learning (MTL) frameworks have emerged as a promising solution15. MTL allows a single model to handle multiple related tasks, such as age estimation and gender classification, by sharing feature extraction layers across tasks, reducing computational overhead and enhancing model generalization16. CBCT scans capture overlapping anatomical features—such as mandibular development and sinus morphology—relevant to both age and gender, making multi-task learning (MTL) particularly effective. In this study, age estimation is modeled as a regression task and gender classification as a binary task. MTL allows shared feature representations, improving performance and generalization while reducing overfitting, which is especially beneficial in data-limited forensic CBCT contexts17.

While most prior studies rely on 2D images lacking anatomical depth18,19, our approach uses panoramic slices reconstructed from CBCT scans to balance spatial detail and computational efficiency4,12. This method preserves key 3D features in a 2D format compatible with CNNs, enabling accurate age and gender estimation with reduced resource demands. CBCT offers valuable forensic cues—such as tooth eruption, pulp size, and mandibular morphology—supporting both tasks more effectively than traditional radiographs.

In forensic applications, accurate age estimation is critical for identifying individuals and determining the legal status of minors versus adults, while gender classification is essential for both medical diagnostics and forensic investigations2,9,20. In CBCT imaging, several dental and craniofacial features provide biologically grounded cues for age and gender estimation. These include tooth development stages, pulp chamber volume, eruption patterns, and mandibular morphology. Such features exhibit clear age-related trends and sexual dimorphism, making them ideal for forensic analysis. Our model effectively captures these patterns through attention-guided learning and visual explanations21. Traditional methods are often time-consuming, subjective, and rely heavily on manual expertise22. By incorporating CNNs and multi-task learning, our framework offers a more efficient, objective, and scalable solution. Processing panoramic slices from CBCT scans also enables a streamlined workflow that retains the interpretability and accuracy of 3D data while reducing computational costs23.

A major challenge in adopting machine learning models for medical and forensic applications is the need for interpretability, as clinicians and forensic experts must trust the model’s decisions24. Explainable AI (XAI) techniques, such as Grad-CAM (Gradient-weighted Class Activation Mapping), generate visual explanations by highlighting regions in an image that contribute most to the model’s predictions25. In contexts like forensic medicine, this transparency is essential to align the model’s predictions with expert knowledge, ensuring that decisions are trustworthy and verifiable26.

To further improve interpretability, we integrate attention mechanisms within our architecture. Attention enables the model to focus on the most relevant areas of the CBCT image, such as the mandible and maxilla, which are crucial for both age estimation and gender classification27. This prioritization of key regions aligns the model’s decision-making process with anatomical importance, making its predictions more reliable. Combining Attention with Grad-CAM yields precise visual explanations, enhancing performance and providing deeper insights into the model’s reasoning for forensic experts and clinicians28.

In this study, we propose a novel multi-task learning framework for age estimation and gender classification using CBCT images. By integrating attention mechanisms and Attention + Grad-CAM, our approach enhances both accuracy and transparency. Using panoramic slices reconstructed from CBCT scans, we retain the spatial richness of 3D data while reducing computational costs. Our results demonstrate that this method provides a robust solution for forensic applications, balancing high accuracy with the transparency required for clinical and forensic use.

Related work

Deep learning has driven significant progress in medical imaging, particularly in forensic applications like age estimation and gender classification29,30. However, despite advancements, key research gaps remain, especially in the integration of 3D CBCT images and attention mechanisms for improving interpretability. This section critically reviews recent studies, highlighting both their contributions and limitations, while identifying areas for future exploration.

Age and gender estimation in forensic applications using deep learning

Estimating age and gender accurately is crucial in forensic medicine, where deep learning has the potential to improve both efficiency and precision. Studies on 2D imaging, such as the work by Vila-Blanco et al.19, have shown that deep learning models can achieve a Mean Absolute Error (MAE) of 0.97 years for age estimation and 91.82% accuracy in gender classification using panoramic radiographs. Similarly, Büyükçakır et al.18 leveraged EfficientNet-B4 on a dataset of 3,896 orthopantomograms (OPGs), achieving an impressive MAE of 0.562 years. These studies reveal the strength of deep learning on 2D images, particularly with large datasets, but raise questions about the generalizability of these models to smaller, imbalanced, or more diverse datasets often seen in forensic applications. Similarly, Park et al.30 introduced a multi-task model based on EfficientNet-B3 and CBAM, achieving promising results (MAE = 2.93 years; accuracy = 99.2%), though their framework was limited to 2D images without CBCT integration. Similarly, Ozlu Ucan et al.31 developed a hybrid pipeline combining 2D and 1D CNNs with a Modified Genetic–Random Forest algorithm, reporting near-perfect performance (R2 = 0.999), but focused exclusively on age estimation. Kokomoto et al.32 proposed a two-stage deep learning approach (MAE = 0.261 years), yet their model also operated solely on 2D radiographs and did not address gender classification. In contrast, our study leverages CBCT-derived panoramic slices, enabling joint age and gender estimation through a multi-task learning framework. Additionally, we incorporate interpretability tools—Attention and Grad-CAM—to enhance model transparency, which is particularly critical in forensic applications.

Age estimation using 3D CBCT remains challenging due to volumetric complexity and computational demands. Pham et al.33 reported a high MAE of 5.15 years, while Hou et al.34 reduced this to 1.64 using NAS—but at a high computational cost. Gender classification has shown strong results with 2D CNNs, as demonstrated by Atas35 and Rajee and Mythili16, though Bu et al.12 noted performance drops in younger subjects. Venema et al.9 showed CNN adaptability using humerus images. Still, adapting 2D models to CBCT requires handling data imbalance and complexity. Talib et al.36 further highlighted the importance of dataset integrity using transfer learning to filter artificial radiographs.

In summary, while deep learning models have shown strong performance on 2D data, adapting them to effectively utilize 3D CBCT and address challenges like generalizability and data imbalance remains underexplored. Future research should prioritize scalable CBCT-based solutions for real-world forensic applications. A comparative overview of existing methods and our proposed approach is provided in Table 1, highlighting their respective strengths, limitations, and contributions.

Integration of CBCT images in multi-task learning

While CBCT images provide detailed anatomical information, their use in multi-task learning (MTL) frameworks—where both age estimation and gender classification can be performed simultaneously—remains limited. Fujimoto et al.7 demonstrated the potential of using 3D CBCT images for age estimation by focusing on alveolar bone features, achieving promising results in clinical settings. However, this study did not explore the application of multi-task learning, where shared features could be used to improve both age and gender predictions.

Vila-Blanco et al.19 addressed some of the challenges in processing CBCT images by developing a multi-task deep learning framework for automatic tooth and root canal segmentation. Their model, combining DentalNet and PulpNet, efficiently segmented complex anatomical structures, significantly reducing manual segmentation time and achieving superior performance on clinical datasets. This study demonstrates the efficacy of shared feature learning in dental imaging tasks but does not extend its findings to multi-task models that could handle both age estimation and gender classification.

Table 1 Comparative analysis of related work and proposed method.

da Silva et al.29 introduced SDetNet, a CBCT-based model with anatomy-guided attention for sex classification via frontal sinus analysis. While effective, it did not include age estimation or multi-task learning. In contrast, our model addresses both tasks simultaneously. Although multi-task learning with 3D CBCT data holds promise, it requires balancing accuracy with computational efficiency.

Attention mechanisms and explainability

Attention mechanisms have become increasingly important in improving the interpretability of deep learning models, particularly in medical imaging. Joo et al.37 developed a VAE-based model for age estimation, incorporating attention maps that highlighted the most relevant regions of teeth for age prediction. This approach improved the model’s explainability, which is crucial in forensic applications where the decision-making process must be transparent.

In a different approach, Hou et al.38 introduced a Multi-scale Aggregation Attention Block (MAB) within their Teeth U-Net model, which was designed to enhance teeth boundary detection in panoramic X-ray images. The model achieved a segmentation accuracy of 98.53%, illustrating the power of attention mechanisms in boosting model performance and improving interpretability.

Despite these advancements, attention mechanisms in 3D CBCT data remain underexplored. Applying attention layers to 3D data could not only improve model transparency but also enhance performance in tasks that require detailed anatomical analysis, such as age estimation and gender classification. Future work should focus on developing attention mechanisms tailored to 3D data, addressing the unique challenges posed by the spatial complexity of CBCT images.

Materials and methods

This retrospective study was approved by the institutional ethics committee (IR.SUMS.REC.1402.215). All CBCT scans were originally acquired for diagnostic purposes unrelated to this research. Informed consent was obtained from all patients, including permission for the anonymized use of imaging data in future research. No additional imaging was performed specifically for this study, and all methods complied with institutional and international guidelines.

This study employs an advanced deep learning framework for age estimation and gender classification using CBCT imaging, leveraging multi-task learning and attention mechanisms. The methodology follows a structured workflow (as depicted in Fig. 1) that includes data preprocessing, model development, optimization, and evaluation. A complete breakdown of the methodology is provided in Appendix A1.

Fig. 1
figure 1

Overview of the proposed method for age estimation and gender classification from CBCT images.

Data acquisition and preprocessing

CBCT scans were retrospectively collected and preprocessed to ensure standardization and enhance interpretability. The preprocessing pipeline involved panoramic slice reconstruction, region-of-interest (ROI) extraction, and contrast enhancement to highlight critical anatomical features. Advanced data augmentation strategies, including geometric transformations (rotation, scaling), noise injection, and histogram equalization, were applied to improve model generalizability.

Model architecture and training

A multi-task deep learning framework was implemented, integrating state-of-the-art architectures (ResNet50, DenseNet121, InceptionV3) optimized for forensic radiology. Feature extraction was enhanced using Convolutional Block Attention Module (CBAM) to prioritize clinically relevant structures, while Grad-CAM visualization ensured model interpretability. The training process was optimized using Adam optimizer, learning rate decay, early stopping, and cross-validation across multiple CBCT datasets. The loss functions included Mean Squared Error (MSE) for age estimation and Binary Cross-Entropy (BCE) for gender classification.

Evaluation and explainability

Model performance was rigorously evaluated using Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), accuracy, precision, recall, and AUC. Advanced explainability techniques, including Grad-CAM, and attention heatmaps, were used to validate the model’s decision-making process by highlighting key craniofacial features. Additionally, clinical validation was conducted by forensic experts to assess real-world applicability and reliability.

For a comprehensive breakdown of data preprocessing techniques, hyperparameter tuning, model architecture variations, and evaluation metrics, refer to Appendix A1, where detailed methodological explanations are provided.

Results

The proposed multi-task deep learning model demonstrated high accuracy and robustness in age estimation and gender classification tasks using CBCT images. The model’s performance was evaluated across various architectures, data augmentation strategies, and attention mechanisms, highlighting the impact of interpretability techniques on forensic radiology.

Key findings

  • Age estimation performance The model achieved strong predictive accuracy, as shown in Table 2, with significant improvements when incorporating attention mechanisms. A detailed comparison of actual and predicted age values across the dataset is illustrated in Fig. 2.

Table 2 Performance of the model for age Estimation across architectures.
Fig. 2
figure 2

Comparison of actual and predicted age across the entire dataset.

  • Performance Across Age Subgroups and Gender: The model’s accuracy varied based on age groups and gender, with performance trends depicted in Figs. 3 and 4.

Fig. 3
figure 3

Model performance analysis across different age subgroups.

Fig. 4
figure 4

Performance across gender.

  • Impact of attention mechanisms Tables 3 and 4 present a comparative analysis of InceptionV3 with and without attention mechanisms, demonstrating the enhanced feature extraction capabilities of the model when attention mechanisms were included.

  • Model Interpretability The integration of Grad-CAM and CBAM attention mechanisms significantly enhanced visual interpretability, as shown in Figs. 5 and 6, highlighting key craniofacial and dental regions.

Table 3 Results of InceptionV3 without attention mechanism.
Table 4 Results of InceptionV3 with attention mechanism.

For a detailed breakdown of experimental results, statistical comparisons, and additional performance metrics, please refer to Appendix A2, where comprehensive evaluations of different architectures, training configurations, and hyperparameter optimizations are provided.

Discussion

This study advances previous research by implementing a multi-task learning framework on CBCT-derived panoramic reconstructions instead of conventional 2D radiographs. It introduces a hybrid loss-balancing strategy tailored for imbalanced forensic datasets and integrates CBAM with Grad-CAM to enhance model interpretability. These innovations contribute both practical value and methodological novelty to deep learning in forensic radiology. The proposed model demonstrated robust performance, achieving a low MAE of 1.08 years and an R² of 0.93 in age estimation, particularly excelling in younger age groups. In gender classification, it achieved an accuracy of 95.3% and an AUC of 0.97. These results indicate that the model effectively leverages CBCT data, especially with the inclusion of multi-scale CNN features and attention mechanisms that focus on clinically relevant regions, enhancing interpretability and generalization.

Fig. 5
figure 5

Attention + Grad-CAM Maps highlighting key dental regions for age and gender estimation.

Fig. 6
figure 6

Comparison of Grad-CAM and Attention + Grad-CAM heatmaps for dental X-ray interpretation.

The attention mechanism, combined with Grad-CAM, significantly enhances the interpretability of the model by providing visual insights into the model’s focus areas, such as the mandible and maxillary regions. The model’s capacity to produce interpretable heatmaps allows practitioners to see which anatomical regions are prioritized, making the model more reliable in clinical and forensic settings.

Our model demonstrates strong performance when compared to recent studies in age and gender prediction. Vila-Blanco et al.19 and Büyükçakır et al.18 reported slightly lower MAEs using 2D panoramic radiographs; however, their approaches lack the spatial depth and contextual detail provided by CBCT imaging. Our method leverages CBCT-derived panoramic slices, which preserve key anatomical features while reducing computational complexity. In contrast to Pham et al.33, who used full 3D CBCT volumes and observed a high error rate (MAE = 5.15), our slice-based strategy enables more accurate predictions (MAE = 1.08). Furthermore, while Joo et al.37 and Park et al.29 incorporated attention mechanisms, they lacked external interpretability techniques like Grad-CAM, which our model integrates to enhance clinical transparency. Finally, unlike Ozlu Ucan et al.30, whose hybrid method was limited to age estimation on 2D data, our model supports both age and gender classification in a unified, interpretable multi-task framework.

While 3D CBCT scans provide rich anatomical detail, processing full volumes is computationally demanding and prone to overfitting with limited data. By using reconstructed 2D panoramic slices, our method retains essential spatial features while enabling efficient, interpretable modeling. Compared to 3D approaches like Pham et al.33, which showed high error rates, our 2D strategy offers a practical balance between accuracy and feasibility for forensic use.

The integration of attention mechanisms, particularly Attention + Grad-CAM, has proven effective in enhancing both performance and interpretability. By guiding the model to focus on specific craniofacial structures, attention mechanisms contribute to more accurate age and gender predictions. This focus is particularly beneficial for complex anatomical variations in older age groups, where the model’s ability to prioritize key features like the mandible and molars becomes essential.

As presented in Tables 3 and 4 and visualized in Figs. 3 and 4, the model exhibited highest accuracy for age estimation in the 7–10 and 15–18 age groups, with slightly higher errors in older individuals (19–23), likely due to greater anatomical variability. Gender classification was consistently accurate across sexes, though male subjects showed marginally higher precision and recall. These findings highlight the model’s robustness and generalizability across age and gender subgroups, reinforcing its applicability in diverse forensic contexts.Attention mechanisms also provide visual explanations of the model’s focus areas, allowing practitioners to better understand its decision-making process. This transparency is essential in clinical and forensic applications, where interpretability builds trust and facilitates model validation. By visually highlighting relevant features, the model aids practitioners in understanding the basis for predictions, making it a valuable tool for medical and legal assessments.

The findings have practical implications for both clinical and forensic applications:

InceptionV3 performs better due to its multi-scale architecture, which captures both fine and broad features simultaneously. This is especially useful in CBCT images where anatomical cues vary in size and location. Its parallel convolutional paths give it an advantage over linear models like ResNet50 and VGG16 in handling such structural complexity.Forensic Applications: The model’s accuracy in age estimation and gender classification could aid forensic experts in identification processes, expediting age and gender determination in legal cases. The integration of attention mechanisms further enhances model trustworthiness, allowing experts to visually validate the model’s focus areas and ensuring alignment with established anatomical markers.

Clinical applications: In clinical settings, the model can help automate diagnostic workflows, especially in scenarios where dental and craniofacial analysis is required for treatment planning or patient evaluation. This is particularly beneficial in pediatric cases for growth assessment and orthodontic planning, as well as in reconstructive surgery. By providing accurate, interpretable outputs, the model can reduce diagnostic errors, speed up analysis, and ultimately contribute to improved patient outcomes.

Although the dataset used in this study is relatively large compared to previous CBCT-based research, it remains modest in size when considered against typical deep learning requirements.It lacks sufficient representation of older age groups beyond 23 years. This limited age range, may restrict the model’s ability to generalize across broader forensic populations with greater anatomical variability. Furthermore, all CBCT scans were obtained from a single center, using a specific scanner and imaging protocol. While this standardization minimizes internal variability, it introduces potential institutional bias, meaning that the model’s performance may decline when applied to images from different centers, scanners, ethnic groups, or clinical contexts. Finally, the lack of external validation on independent datasets remains a major limitation; despite robust internal cross-validation, independent testing is essential to truly confirm the model’s stability, generalizability, and suitability for real-world forensic or clinical deployment. Future research should focus on expanding the dataset to include a broader range of age groups and diverse ethnic backgrounds, improving the model’s generalizability. Future work should involve collaboration with forensic institutions to collect case-specific CBCT data and evaluate the model’s robustness across broader, more varied populations. Additional improvements in attention mechanisms and interpretability techniques, such as layer-wise relevance propagation (LRP) or integrated gradients, could enhance the model’s focus on clinically significant areas.

Moreover, adapting the model for additional clinical and forensic applications, including the detection of dental anomalies, disease classification, or facial reconstruction, could expand its utility. Future models could also consider incorporating non-binary or diverse gender categories, making the system more inclusive and suitable for a wider range of clinical scenarios.

Although Grad-CAM visualizations were primarily performed on the InceptionV3 architecture due to its superior overall performance, future work could explore comparative interpretability across multiple architectures to better understand how different models attend to forensic-relevant anatomical regions.

Conclusion

This study introduced an explainable multi-task deep learning model for simultaneous age estimation and gender classification using CBCT-derived panoramic slices. By leveraging the InceptionV3 architecture with integrated attention mechanisms and Grad-CAM visualizations, the model achieved strong predictive performance (MAE = 1.08 years for age; 95.3% accuracy for gender) along with high interpretability. Compared to traditional 2D radiograph-based models, our approach benefits from CBCT’s anatomical richness while maintaining computational efficiency through 2D reconstruction. These findings highlight the model’s potential in forensic and clinical contexts where both accuracy and transparency are essential. Future work should focus on expanding the dataset, validating across external populations, and exploring transformer-based architectures for enhanced feature learning.