Introduction

Transverse maxillary hypoplasia (TMH) represents a common orthodontic condition affecting 13% to 23% of children and up to 30% of adults1. This condition is characterized by inadequate transverse maxillary dimensions, resulting in posterior crossbites, dental crowding, and compromised masticatory function2,3. Beyond dental complications, TMH significantly impacts facial aesthetics and nasal airway patency, with airway narrowing increasing the risk of sleep-disordered breathing and substantially compromising patient quality of life4,5.

Maxillary expansion represents the primary therapeutic intervention for TMH, utilizing lateral orthopedic forces to open the midpalatal suture (MPS) and stimulate new bone formation. The timing and method of expansion are critically dependent on accurate assessment of MPS maturation status6,7. Patients with patent or partially fused sutures are ideal candidates for rapid palatal expansion (RPE), which can achieve 7–10 mm of maxillary expansion with minimal skeletal side effects during the mixed and early permanent dentition phases. Conversely, individuals with advanced suture maturation require alternative approaches such as mini-screw assisted expansion (MSE) or surgically-assisted rapid palatal expansion (SARPE)8,9. This fundamental treatment dichotomy makes precise MPS staging essential for optimal therapeutic outcomes and patient safety10,11,12.

Current clinical practice predominantly uses Angelieri’s five-stage classification system (stages A-E) to evaluate cone-beam computed tomography (CBCT) images13. While this system provides a standardized assessment framework, it suffers from significant limitations that compromise clinical decision-making. Inter-examiner reliability varies considerably, with reported weighted kappa values ranging from 0.3 to 0.8, indicating substantial inconsistency in diagnostic interpretation14,15. These variations are particularly pronounced among less-experienced clinicians, especially when distinguishing between stages C and D—a differentiation of paramount clinical importance as stage C represents the optimal window for conventional expansion while stage D typically requires invasive surgical approaches. The subjective nature of manual staging, combined with time-intensive evaluation processes, further limits the standardization of care delivery across different clinical settings.

Recent advances in artificial intelligence have shown promise for automated MPS assessment, offering potential solutions to these diagnostic challenges. Gao et al.16 developed texture-based algorithms for analyzing suture ossification patterns, while Zhu et al.17 achieved 75.37% diagnostic accuracy using deep learning architectures. Tang et al.18 further explored advanced neural network approaches combining Vision Transformers with convolutional neural networks to improve generalization across different patient populations. However, current automated approaches have critical limitations preventing clinical implementation. The reported accuracy levels fall short of the diagnostic reliability required for confident treatment planning decisions. Most significantly, existing single-modality approaches exhibit insufficient precision in distinguishing stage C from adjacent stages B and D—the very distinction that is paramount for determining conventional versus surgical expansion approaches.

Substantial clinical evidence supports the integration of multiple diagnostic indicators for comprehensive MPS assessment. Patient demographics, particularly age and gender, significantly influence suture maturation patterns, with females typically exhibiting earlier fusion compared to males19,20,21. Cervical vertebral maturation (CVM) demonstrates strong correlations with MPS development, where early CVM stages (CS1–CS2) correspond to patent sutures (stages A–B) and advanced CVM stages (CS5–CS6) align with complete suture fusion (stage E)14,22,23. Similarly, mandibular third molar(MTM) calcification patterns show significant correlation with MPS maturation progression24. However, no single clinical indicator adequately captures the complexity of suture maturation, necessitating comprehensive multimodal approaches that can integrate diverse yet complementary biological markers.

This study introduces DeepMSM, a novel multimodal deep learning framework for automated midpalatal suture assessment. The system systematically integrates CBCT images with routinely available clinical indicators, including patient demographics, CVM staging, and MTM development. By leveraging these complementary data sources through advanced artificial intelligence techniques, DeepMSM aims to overcome the limitations of current manual staging approaches while providing standardized, reproducible, and clinically reliable assessments across diverse clinical settings. The framework specifically enhances accuracy in the clinically critical C–D stage differentiation, potentially transforming treatment planning precision for maxillary expansion therapy and advancing the field of intelligent orthodontic diagnostics.

The main contributions of this study are as follows:

  • Novel Multimodal Clinical Assessment Framework: We present a comprehensive automated system that systematically integrates multiple clinical indicators routinely available in orthodontic practice–CBCT imaging, patient demographics, CVM, and MTM development–to achieve a high diagnostic accuracy (93.75%) in MPS staging, with particular strength in the clinically critical C-D stage differentiation.

  • Evidence-Based Validation of Clinical Indicators: Through comprehensive analysis of 200 patients, we establish the relative diagnostic value and optimal integration of multiple clinical parameters in MPS assessment, providing evidence-based guidance for clinical decision-making and advancing understanding of craniofacial maturation patterns.

  • Clinically Applicable Diagnostic Tool: Our system demonstrates consistent diagnostic performance that surpasses typical clinician accuracy while utilizing only routinely acquired clinical data, offering a practical solution for standardizing MPS evaluation across different clinical settings and practitioner experience levels, ultimately improving treatment planning reliability.

Methodology

Study design and participants

Study population

CBCT images and lateral cephalometric radiographs (LCR) images were collected from orthodontic patients seeking treatment between January 2020 and December 2024. All imaging studies were originally performed for routine clinical diagnostic purposes. This retrospective study was approved by the Ethics Committee of the Affiliated Stomatological Hospital of Fujian Medical University (approval number: [2024] FJMU Oral Ethics Review No. 94). All experiments were performed in accordance with relevant guidelines and regulations for human subjects research. Given the retrospective nature of this study using anonymized radiographic data originally acquired for routine clinical purposes, the Ethics Committee granted a waiver of informed consent in accordance with institutional guidelines.

Inclusion Criteria: Patients were included if they met the following criteria: (1) Age between 7 and 36 years at the time of imaging; (2) Availability of both CBCT and lateral cephalometric radiographs acquired within a 1-month interval; (3) No history of orthodontic treatment, orthognathic surgery, or maxillofacial trauma; (4) Absence of systemic diseases known to affect bone metabolism (e.g., osteoporosis, hyperparathyroidism, corticosteroid therapy); (5) No severe skeletal deformities, cleft lip and palate, or other craniofacial anomalies that could affect normal suture development.

Exclusion Criteria: Patients were excluded for: (1) Poor image quality preventing adequate visualization of the midpalatal suture region in CBCT images; (2) Inadequate visualization of C2–C4 cervical vertebrae in lateral cephalometric radiographs; (3) Absence of evaluable MTM; (4) Presence of severe dental pathology affecting MTM assessment, including extensive caries, periapical lesions, or significant root resorption.

Imaging acquisition protocol

All CBCT images were acquired using a standardized protocol with the following parameters: 120 kVp, 5 mAs, field of view 16 \(\times\) 13 cm, and voxel size 0.25 mm. Patients were positioned according to manufacturer recommendations with the Frankfort horizontal plane parallel to the floor and the midsagittal plane perpendicular to the floor.

All LCR were obtained using conventional radiographic equipment with standardized patient positioning protocols. All radiographs were taken with consistent magnification factors and exposure parameters to ensure image quality and reproducibility.

The imaging protocol utilized clinically indicated lateral cephalograms and CBCT scans, with all modalities derived from standard diagnostic procedures without additional exposure, strictly following ALARA(As Low As Reasonably Achievable) principles and AAOMR guidelines43 for radiation protection.

Dataset characteristics

The final study cohort comprised 200 patients (75 males, 125 females) with a mean age of \(16.65 \pm 6.03\) years (range: 7–36 years). The dataset was randomly stratified into training (60%, n=120), validation (20%, n=40), and testing (20%, n=40) sets with balanced representation of MPS maturation stages across all subsets.

The dataset included three imaging modalities: (1) midpalatal suture regions from CBCT (MPS-CBCT), (2) mandibular third molar regions from CBCT (MTM-CBCT), and (3) cervical vertebrae regions from lateral cephalograms (CVM-LCR). Clinical variables included patient demographics (age, gender) and maturation staging assessments (CVM and MTM stages). The dataset was randomly partitioned at the patient level to ensure that all data modalities for a single patient belonged exclusively to either the training, validation, or testing set, thereby preventing any risk of information leakage.

Statistical analysis was performed using SPSS software (version 26.0, IBM Corp.) with significance level set at \(p < 0.05\). Detailed statistical analyses can be found in Section.

Image analysis and clinical assessment

Image processing and region extraction

All CBCT DICOM datasets were processed using Ez3D Plus imaging software (Vatech, Korea) following standardized protocols. Three anatomical regions were systematically extracted for each patient:

Midpalatal Suture Region (MPS-CBCT): Coronal cross-SMctions were obtained at the palatal plane passing through the posterior nasal spine, visualizing the complete suture anatomy from anterior to posterior nasal spine for consistent anatomical orientation and morphological assessment.

Mandibular Third Molar Region (MTM-CBCT): Cross-SMctional images were extracted along the long axis of MTM. When bilateral mandibular third molars were present, the tooth with more advanced developmental stage was selected for analysis.

Cervical Vertebral Region (CVM-LCR): Lateral cephalometric radiographs (LCR) were processed to extract C2–C4 cervical vertebrae regions, emphasizing inferior border characteristics and concavity patterns essential for maturation staging.

All extracted images underwent standardized preprocessing, including resizing to 224 \(\times\) 224 pixels and intensity normalization for consistent computational analysis and compatibility with ResNet-based visual backbones.

Clinical staging assessment

Clinical staging assessments was performed by two orthodontists (each with over 10 years of clinical experience) and one senior orthodontic specialist (over 30 years of experience), all possessing extensive expertise in craniofacial radiographic interpretation. The two orthodontists independently performed the staging assessments, with any discrepant cases being adjudicated by the senior specialist. Final staging determinations required consensus from at least two of the three evaluators.

Figure 1
figure 1

Images of MPS stages based on Angelieri’s five-stage system (AE). Midpalatal suture maturation stages based on Angelieri’s classification system. Representative CBCT coronal sections showing the five stages of midpalatal suture development: Stage A shows straight high-density sutural line; Stage B displays scalloped appearance with occasional parallel lines; Stage C presents two parallel scalloped lines with small separations; Stage D demonstrates fusion in posterior region with visible suture anteriorly; Stage E shows complete fusion with no visible suture.

Figure 2
figure 2

Images of CVM stages based on Baccetti’s classification method (CS1–CS6). Cervical vertebral maturation stages based on Baccetti’s classification method. Lateral cephalometric radiographs illustrating the six stages of cervical vertebral development: CS1-CS6 showing progressive morphological changes in C2-C4 vertebral bodies, including concavity development at inferior borders and overall vertebral shape modifications used for skeletal maturity assessment.

Figure 3
figure 3

Images of MTM stages based on Demirjian’s classification method (AH). Mandibular third molar development stages based on Demirjian’s classification system. CBCT cross-sectional images demonstrating the eight stages of mandibular third molar development: Stages A-H showing progression from initial crown mineralization to complete root formation with closed apex, used for dental maturity assessment.

MPS Maturation Staging (MPS Stage): As shown in Figure 1, MPS maturation was classified using Angelieri’s five-stage system13: Stage A (straight high-density sutural line), Stage B (scalloped appearance with occasional parallel lines), Stage C (two parallel scalloped lines with small separations), Stage D (fusion in posterior region with visible suture anteriorly), and Stage E (complete fusion with no visible suture).

Cervical Vertebral Maturation (CVM Stage): As shown in Figure 2, CVM staging followed Baccetti’s six-stage method25, evaluating morphological characteristics of C2–C4 vertebral bodies, including concavity presence, depth, and extent at inferior borders, and overall vertebral shape modifications.

Mandibular Third Molar Development (MTM Stage): As shown in Figure 3, MTM development was assessed using Demirjian’s eight-stage classification system26, evaluating crown formation, root development, mineralization patterns, and apical closure status from Stage A (initial crown mineralization) to Stage H (complete root formation with closed apex).

Design of DeepMSM

Overall architecture

As shown in Figure 4, the DeepMSM (Deep Learning for Midpalatal Suture Maturation) framework represents a novel multimodal artificial intelligence system specifically designed for automated MPS staging in clinical practice. The system architecture integrates multiple data sources routinely available in orthodontic practice: three imaging modalities (MPS-CBCT, MTM-CBCT, CVM-LCR) and tabular variables (patient age, gender, CVM stage, MTM stage).

The framework employs a hierarchical integration approach, utilizing deep convolutional neural networks for image feature extraction and multilayer perceptron networks for tabular data processing. These modality-specific features are subsequently integrated through attention-based fusion mechanisms to generate comprehensive multimodal representations for final classification.

Figure 4
figure 4

Schematic overview of the proposed DeepMSM architecture. Three imaging modalities (MPS-CBCT, MTM-CBCT, and CVM-LCR) are each passed through a shared ResNet-50 encoder (pre-trained on the RadImageNet dataset27), followed by an MLP-based encoder for tabular features (Age, Gender, CVM stage, MTM stage). The resulting feature vectors are concatenated and passed into a classification head that outputs the MPS maturation stage \(\{A,B,C,D,E\}\). Schematic overview of the DeepMSM architecture. The multimodal deep learning framework integrates three imaging modalities (MPS-CBCT, MTM-CBCT, and CVM-LCR) through shared ResNet-50 encoders pre-trained on RadImageNet, combined with MLP-based processing of tabular features (age, gender, CVM stage, MTM stage). Feature vectors are concatenated and processed through a classification head to output MPS maturation stages A-E.

Image feature extraction

Enhanced ResNet-50 Backbone:

Image features are extracted using ResNet-50 architectures28 enhanced with Convolutional Block Attention Module (CBAM)29 and initialized with RadImageNet pre-trained weights27. Each imaging modality is processed through identical encoder networks:

$$\begin{aligned} \textbf{z}_{\alpha } = f_{\theta }(I_{\alpha }), \quad \alpha \in \{\textrm{mps}, \textrm{mtm}, \textrm{cvm}\} \end{aligned}$$
(1)

where \(I_{\alpha }\) represents the input image for modality \(\alpha\) and \(\textbf{z}_{\alpha } \in \mathbb {R}^{128}\) is the extracted feature vector.

Attention Mechanism: The CBAM attention module29 enhances feature representation through sequential channel and spatial attention:

$$\begin{aligned} M_c(F)&= \sigma \left( \textrm{MLP}(\textrm{AvgPool}(F)) + \textrm{MLP}(\textrm{MaxPool}(F)) \right) \end{aligned}$$
(2)
$$\begin{aligned} M_s(F)&= \sigma \left( \textrm{Conv}\left( [ \textrm{AvgPool}(F); \, \textrm{MaxPool}(F) ]\right) \right) \end{aligned}$$
(3)
$$\begin{aligned} F''&= M_s(M_c(F) \otimes F) \otimes (M_c(F) \otimes F) \end{aligned}$$
(4)

where \(M_c\) and \(M_s\) represent channel and spatial attention modules, \(\otimes\) denotes element-wise multiplication, and \(\sigma\) is the sigmoid activation function.

Tabular data processing

Tabular variables are processed through a unified MLP architecture with embedding layers for categorical variables:

$$\begin{aligned} \textbf{x}_{\textrm{tabular}}&= [\tilde{T}_{\textrm{age}}, \textrm{Embed}(T_{\textrm{gender}}), \textrm{Embed}(T_{\textrm{cvm}}), \textrm{Embed}(T_{\textrm{mtm}})] \end{aligned}$$
(5)
$$\begin{aligned} \textbf{z}_{\textrm{tabular}}&= \textrm{MLP}(\textbf{x}_{\textrm{tabular}}) \end{aligned}$$
(6)

where \(\tilde{T}_{\textrm{age}} = \frac{T_{\textrm{age}} - \mu _{\textrm{age}}}{\sigma _{\textrm{age}}}\) represents normalized age, and \(\textrm{Embed}(\cdot )\) denotes learnable embedding functions for categorical variables.

Multimodal fusion strategy

Hierarchical Attention-Based Fusion: Image features are first combined through attention-weighted fusion:

$$\begin{aligned} w_{\alpha }&= \frac{\exp (\textbf{v}^T \tanh (W_{\alpha } \textbf{z}_{\alpha }))}{\sum _{\beta } \exp (\textbf{v}^T \tanh (W_{\beta } \textbf{z}_{\beta }))} \end{aligned}$$
(7)
$$\begin{aligned} \textbf{z}_{\textrm{img}}&= \sum _{\alpha \in \{\textrm{mps}, \textrm{mtm}, \textrm{cvm}\}} w_{\alpha } \textbf{z}_{\alpha } \end{aligned}$$
(8)

Image and tabular features are then projected to a common space with gating mechanisms:

$$\begin{aligned} \hat{\textbf{z}}_{\textrm{img}}&= W_{\textrm{img}} \textbf{z}_{\textrm{img}} \odot \sigma (W_{g1} \textbf{z}_{\textrm{img}} + \textbf{b}_{g1}) \end{aligned}$$
(9)
$$\begin{aligned} \hat{\textbf{z}}_{\textrm{tabular}}&= W_{\textrm{tabular}} \textbf{z}_{\textrm{tabular}} \odot \sigma (W_{g2} \textbf{z}_{\textrm{tabular}} + \textbf{b}_{g2}) \end{aligned}$$
(10)
$$\begin{aligned} \textbf{z}_{\textrm{fused}}&= [\hat{\textbf{z}}_{\textrm{img}}, \hat{\textbf{z}}_{\textrm{tabular}}] \end{aligned}$$
(11)

where \(\odot\) represents element-wise multiplication and \([\cdot , \cdot ]\) denotes concatenation.

Classification and training strategy

Classification Head: The fused representation is processed through a multi-layer classification network:

$$\begin{aligned} \textbf{h}^{(1)}&= \text {ReLU}(W^{(1)}\textbf{z}_{\textrm{fused}} + \textbf{b}^{(1)}) \end{aligned}$$
(12)
$$\begin{aligned} \textbf{h}^{(2)}&= \text {Dropout}(\text {ReLU}(W^{(2)}\textbf{h}^{(1)} + \textbf{b}^{(2)}), p=0.3) \end{aligned}$$
(13)
$$\begin{aligned} \textbf{z}_{\textrm{logits}}&= W^{(3)}\textbf{h}^{(2)} + \textbf{b}^{(3)} \end{aligned}$$
(14)

Class probabilities are computed using softmax: \(p(y=c|\textbf{x}) = \frac{\exp (z_c)}{\sum _{c'=1}^5 \exp (z_{c'})}\) for MPS stages \(c \in \{A, B, C, D, E\}\).

Progressive Training Strategy: A three-stage training approach was implemented to optimize model performance. In the first stage, image encoder components were trained while tabular data processing components remained frozen (learning rate: 0.001). The second stage integrated tabular data processing while maintaining frozen image features (learning rate: 0.0005). Finally, the third stage performed end-to-end fine-tuning of all system parameters (learning rate: 0.0001). This progressive approach ensures stable convergence and optimal utilization of the limited training dataset.

Weighted cross-entropy loss is employed to address class imbalance:

$$\begin{aligned} \mathscr {L} = -\sum _{i=1}^N w_{y_i} \log (p(y_i|\textbf{x}_i)) \end{aligned}$$
(15)

where \(w_c = \frac{N}{5 \cdot N_c}\) with N representing the total number of samples, \(N_c\) being the number of samples in stage c, and \(c \in \{A, B, C, D, E\}\).

Tabular integration and interpretability

DeepMSM is designed for seamless integration into clinical workflows, utilizing only routinely acquired diagnostic data without requiring additional examinations or specialized equipment. The system provides interpretable outputs through attention visualization techniques, allowing clinicians to understand the basis for automated predictions and maintain clinical oversight of the diagnostic process.

Results

Study population characteristics

A total of 200 participants were enrolled in this study, comprising 75 males (mean age \(15.12 \pm 4.85\) years) and 125 females (mean age \(17.57 \pm 6.47\) years). The overall mean age was \(16.65 \pm 6.03\) years, reflecting the study’s focus on adolescents and young adults undergoing orthodontic evaluation. The age and gender distribution across different developmental stages is presented in Table 1.

Table 1 Age and gender distribution of participants.

The distribution shows a predominance of participants in the 12–17 age group (47%), followed by the 18–29 age group (34%), which is consistent with typical orthodontic patient demographics seeking maxillary expansion treatment.

Statistical analysis

MPS stage distribution patterns

Figure 5 illustrates the distribution of MPS stages across key clinical predictors, demonstrating systematic developmental relationships that support multimodal prediction approaches.

Figure 5
figure 5

Distribution of MPS stages across clinical predictors: (a) Age groups, (b) CVM stages, (c) MTM stages. Distribution of MPS stages across clinical predictors. Bar charts showing the distribution of midpalatal suture stages across: (a) age groups demonstrating age-related progression patterns; (b) cervical vertebral maturation stages showing developmental correspondence; (c) mandibular third molar stages revealing correlation with dental development.

Age-related distribution revealed distinct patterns with early MPS stages (A and B) predominantly occurring in younger participants (Age <12: 83% stages A–B; Age 12–17: 61% stages A–B), while advanced stages (D and E) showed higher frequencies in older age groups (Age 18–29: 54% stages D–E; Age \(\ge\)30: 75% stages D–E).

CVM stage correlations demonstrated strong developmental correspondence, with early MPS stages (A–B) primarily aligning with early CVM stages (CS1–CS3: 78% stages A–B), and later MPS stages (D–E) correlating with advanced CVM stages (CS4–CS6: 69% stages D–E).

MTM stage correlations showed that MPS-A corresponded primarily to early MTM stages (A–C: 85%), while MPS-D and MPS-E were predominantly associated with advanced MTM stages (F–H: 71%).

Inter-variable correlation analysis

Spearman rank correlation analysis revealed significant relationships among all variables (\(p < 0.05\)), as shown in Figure 6. The strongest correlation was observed between age and MTM stage (\(\rho = 0.74\)), followed by strong correlations between MTM stage and MPS stage (\(\rho = 0.63\)), MTM stage and CVM stage (\(\rho = 0.63\)), age and CVM stage (\(\rho = 0.54\)), and CVM stage and MPS stage (\(\rho = 0.52\)). A moderate correlation was found between age and MPS stage (\(\rho = 0.48\)).

Figure 6
figure 6

Heatmap of Spearman Correlation Matrix among MPS stage, MTM stage, CVM stage, and Age (All correlations significant, p<0.05). Correlation coefficients (\(\rho\)) indicate: \(|\rho |\) \(\ge\) 0.5 (strong), 0.3\(\le\) \(|\rho |\) \(<\)0.5. Spearman correlation matrix among clinical variables. Heatmap displaying correlation coefficients between MPS stage, MTM stage, CVM stage, and age. All correlations are statistically significant (p<0.05), with correlation strength indicated by color intensity and numerical values.

These correlations validate the biological rationale for multimodal integration while indicating that individual predictors provide complementary rather than redundant information for MPS stage determination.

Model performance comparison

Table 2 presents comprehensive performance metrics across all evaluated models, revealing clear advantages of multimodal integration. DeepMSM achieved the highest diagnostic performance with 93.75% accuracy and 93.81% F1-score, demonstrating superior capability compared to all other evaluated approaches.

Semi-multimodal models utilizing both CBCT and LCR achieved moderate performance, with ResNet50-SM28 reaching 81.25% accuracy, EfficientNet-SM30 achieving 75%, and ResNet18-SM28 obtaining 73.75%. These dual-modality approaches demonstrated the complementary value of combining three-dimensional suture morphology with cervical vertebral maturation indicators, yet remained insufficient for tabular implementation.

Single-modality approaches showed considerable variation in performance across different data sources and architectures. Among CBCT-only models, ResNet50-SG28 achieved the highest accuracy of 71.25%, followed by EfficientNet-SG30 at 70% and ResNet18-SG28 at 56.25%. The tabular data-only baseline model (MLP-SG) achieved only 47.50% accuracy, demonstrating that traditional clinical indicators alone are inadequate for reliable MPS staging. This limited performance of the MLP-SG baseline underscores the complexity of craniofacial maturation patterns that cannot be captured through demographic and developmental parameters without accompanying imaging data.

Manual assessment by orthodontic residents demonstrated substantial challenges, with accuracy rates of 40.11% and 30.36% respectively, highlighting the inherent difficulty of subjective MPS evaluation even with complete access to imaging and tabular information. The performance hierarchy clearly demonstrates the incremental value of multimodal data integration, with DeepMSM’s 93.75% accuracy representing a substantial 12.5% improvement over the best semi-multimodal approach and a 46.25% improvement compared to the tabular baseline (MLP-SG). Most significantly, the automated framework achieved more than double the accuracy of manual assessment, emphasizing the substantial clinical value of standardized, objective diagnostic approaches.

Table 2 Comparative model performance.

Detailed performance analysis of DeepMSM

Stage-specific performance

Figure 7 presents the confusion matrix and detailed classification metrics, revealing stage-specific model performance patterns.

Figure 7
figure 7

(a) The confusion matrix normalized by row percentages, showing the distribution of predicted labels for classes A–E. (b) The Precision, Recall, and F1-scores for each class. DeepMSM classification performance analysis. (a) Confusion matrix normalized by row percentages showing distribution of predicted labels for MPS stages A-E. (b) Precision, recall, and F1-scores for each stage demonstrating excellent performance across all maturation stages, particularly for clinically critical stages C and D.

DeepMSM demonstrated excellent performance for extreme maturation stages, with Stage A achieving precision of 97%, recall of 95%, and F1-score of 96%, while Stage E reached precision of 96%, recall of 98%, and F1-score of 97%. These results reflect the distinct morphological characteristics of early and complete suture maturation stages, which facilitate reliable automated identification.

For the clinically critical intermediate stages, the system maintained strong diagnostic accuracy across all parameters. Stage B demonstrated precision of 92%, recall of 89%, and F1-score of 91%, while Stage C achieved precision of 91%, recall of 93%, and F1-score of 92%. Most importantly, Stage D showed precision of 94%, recall of 91%, and F1-score of 93%, confirming the model’s capability to accurately identify this crucial treatment decision threshold.

Analysis of misclassification patterns revealed that 87% of errors occurred between adjacent developmental stages, a finding that aligns with clinical expectations given the gradual nature of biological maturation processes. This error distribution is clinically acceptable, as even experienced orthodontists face inherent challenges in distinguishing transitional phases where morphological features overlap between consecutive maturation stages.

Clinical decision-making impact

The model’s ability to accurately distinguish between Stages C and D (92% and 93% F1-scores respectively) is particularly significant for clinical practice, as this differentiation directly determines treatment approach selection between conventional rapid maxillary expansion and surgically-assisted expansion protocols.

Model interpretability and clinical relevance

Model interpretability and validation

Gradient-weighted Class Activation Mapping (Grad-CAM) analysis was employed to visualize and validate the anatomical regions contributing to model decision-making processes (Figure 8). The analysis demonstrated that DeepMSM consistently focused on clinically relevant anatomical structures across all imaging modalities. For cervical vertebral images (CVM-LCR), the model primarily concentrated on C2–C4 vertebral body morphology, particularly the inferior border concavity patterns that are fundamental to established CVM staging criteria. In midpalatal suture images (MPS-CBCT), attention was concentrated on the posterior midpalatal suture regions and surrounding bone density variations, which correspond to the anatomical features used by clinicians for manual MPS assessment. For mandibular third molar images (MTM-CBCT), the model emphasized root development patterns and apical closure status, aligning with the key parameters in Demirjian’s classification system.

Figure 8
figure 8

Grad-CAM visualizations of the CVM-LCR, MTM-CBCT, and MPS-CBCT modalities. Heatmaps indicate discriminative regions associated with automated decision-making, with red representing high diagnostic relevance and blue representing low relevance. The attention patterns demonstrate that the model focuses on clinically established anatomical landmarks for each modality. Grad-CAM visualization of model attention patterns. Heatmaps showing discriminative regions for each imaging modality: CVM-LCR focuses on cervical vertebrae morphology; MTM-CBCT emphasizes root development patterns; MPS-CBCT highlights posterior suture regions. Red indicates high diagnostic relevance, blue represents low relevance, confirming anatomically meaningful feature learning.

These activation patterns provide strong evidence that DeepMSM has learned to identify and utilize the same anatomical features that experienced orthodontists rely upon for manual assessment, supporting the model’s biological validity and clinical applicability.

Clinical validation insights

The systematic attention to anatomically relevant structures demonstrates that DeepMSM has learned clinically meaningful features rather than spurious correlations, supporting its potential for reliable clinical deployment and integration into standard orthodontic diagnostic workflows.

Discussion

This study introduces the DeepMSM framework, a multimodal deep learning system that addresses the critical challenge of accurately assessing midpalatal suture (MPS) maturation. By integrating CBCT, LCR, and clinical data, the model achieved a high accuracy of 93.75%, demonstrating strong performance in differentiating between the clinically pivotal Stages C and D (92% and 93% F1-scores, respectively). This capability directly informs the crucial decision between conventional rapid palatal expansion (RPE) and surgically-assisted rapid palatal expansion (SARPE), promising to enhance treatment planning in orthodontics.

Our results highlight the limitations of current assessment methods, which rely on subjective radiographic interpretation and suffer from significant inter-examiner variability, with reported agreement rates often below 80%13,15,34. Manual assessments by orthodontic residents in our study yielded accuracies between 30.36% and 40.11%, a finding consistent with benchmarks reported by Chhatwani et al.34, which showed a clear expertise gradient in MPS classification. DeepMSM substantially outperforms these manual baselines by leveraging a multimodal approach that synthesizes three-dimensional anatomical insights from CBCT with skeletal and dental maturation indicators. This integration provides a more comprehensive and objective assessment than any single modality alone. The model’s misclassifications were predominantly between adjacent stages (87%), which is clinically understandable given the gradual nature of biological maturation.

The clinical implications of DeepMSM are twofold. Primarily, it offers a standardized and reproducible tool that can reduce the risk of inappropriate treatment. Accurate identification of Stage C can prevent unnecessary surgeries, while precise Stage D identification helps avoid failed conventional expansion attempts. Secondly, the model serves as a valuable educational and trust-building instrument. Our Grad-CAM visualizations confirm that DeepMSM focuses on clinically relevant anatomical landmarks–such as C2-C4 vertebral morphology and posterior suture characteristics–which fosters clinician trust and acceptance. This transparency also provides an objective guide for trainees, helping to standardize diagnostic approaches and shorten the learning curve associated with this complex assessment.

Despite its promising performance, this study has several limitations that must be acknowledged. First, the single-center, retrospective design necessitates comprehensive external validation to confirm its generalizability across diverse populations, imaging protocols, and clinical settings. Second, the cross-sectional nature of the data limits insights into individual maturation trajectories, a gap that could be addressed by future longitudinal research. Third, our study’s age distribution (mainly 12–29 years) underrepresents older patients who often require MARPE or SARPE due to complete suture fusion35,40,41, and future multi-center studies should include more mature cases. Finally, our use of 2D images, while computationally efficient, could be enhanced by full 3D architectures in future work.

Future research will focus on addressing these limitations through prospective, multi-center clinical trials to establish the real-world clinical utility of DeepMSM. Technologically, future iterations of the framework may incorporate advancements in AI to further enhance performance. Critically, the successful deployment of DeepMSM requires seamless and responsible integration into clinical workflows. While the system is designed for compatibility with existing PACS and EHR systems, its implementation must be guided by robust data governance and security practices. As recent literature on medical imaging AI emphasizes, addressing legal, ethical, and cybersecurity risks–such as patient consent, data minimization, and security–is paramount for the safe deployment of such technologies in clinical settings42.

In summary, DeepMSM represents significant progress toward objective, evidence-based orthodontic assessment. By providing standardized and accurate MPS staging, it serves as an essential decision support tool that complements clinical expertise, with the potential to significantly improve the reliability and safety of maxillary expansion therapy.

Conclusion

This study successfully developed and validated DeepMSM, a multimodal deep learning framework for automated midpalatal suture maturation assessment. The framework achieved 93.75% accuracy by integrating CBCT imaging, lateral cephalometric radiographs, and clinical indicators, substantially outperforming single-modality approaches (47.50%-81.25%) and manual assessment (30.36%-40.11%).

The framework’s ability to accurately distinguish between Stages C and D (92-93% F1-scores) directly addresses the critical clinical challenge in treatment selection between conventional and surgically-assisted maxillary expansion. Grad-CAM analysis confirmed that the model focuses on anatomically relevant structures, supporting its clinical validity.

Study limitations include single-center validation and the need for broader population testing. Future research should focus on multi-center validation and prospective clinical trials to evaluate real-world implementation. Additionally, the systematic exploration and integration of newer generation encoders represents an important direction for future work, as our modular framework design specifically supports such upgrades to benefit from advances in backbone architectures.

DeepMSM provides clinicians with a standardized, objective diagnostic tool that has the potential to improve treatment outcomes for patients requiring maxillary expansion therapy.