Deep learning-based detection of murine congenital heart defects from µCT scans

Nguyen, Hoa; Desgrange, Audrey; Ochandorena-Saa, Amaia; Benhamo, Vanessa; Bernheim, Ségolène; Houyel, Lucile; Meilhac, Sigolène M.; Zimmer, Christophe

doi:10.1038/s42003-025-09023-6

Download PDF

Article
Open access
Published: 23 December 2025

Deep learning-based detection of murine congenital heart defects from µCT scans

Communications Biology volume 8, Article number: 1809 (2025) Cite this article

1470 Accesses
1 Altmetric
Metrics details

Subjects

Abstract

Micro-computed tomography (μCT) provides 3D images of congenital heart defects (CHD) in mice. However, diagnosing CHD from μCT scans is time-consuming and requires clinical expertise. Here, we present a deep learning approach to automatically segment and screen normal from malformed hearts. On a cohort of 139 μCT scans of control and mutant mice, our diagnosis model achieves an area-under-the-curve (AUC) of 97%. For further validation, we acquired two additional cohorts after model training. Performance on a similar ‘prospective’ cohort is excellent (AUC: 100%). Performance on a ‘divergent’ cohort containing novel genotypes is moderate (AUC: 81%), but improves markedly after model finetuning (AUC: 91%), showcasing robustness and adaptability to technical and biological differences in the data. A user-friendly Napari plugin allows researchers without coding expertise to utilize and retrain the model. Our pipeline will accelerate diagnosis of heart anomalies in mice and facilitate mechanistic studies of CHD.

Deep learning-based segmentation of the thorax in mouse micro-CT scans

Article Open access 02 February 2022

Development and performance evaluation of fully automated deep learning-based models for myocardial segmentation on T1 mapping MRI data

Article Open access 14 August 2024

A lightweight hybrid deep learning system for cardiac valvular disease classification

Article Open access 22 August 2022

Introduction

Congenital malformations are a leading cause of pregnancy arrest¹ and perinatal lethality in developed countries², and the most frequent malformation is that of the heart³. While genetics is thought to play a major role in cardiac developmental defects, the specific genetic causes are known only for a fifth of the patients⁴, and consist mainly of monogenic mutations associated with syndromes or cardiomyopathies. Polygenic inheritance or gene-environment interactions are believed to underlie a large part of the other cases, including structural heart defects. However, owing to the rarity of any given combination of mutations in human pathology, dissecting the specific contribution of the hundreds of genes involved in heart development is a daunting task.

The mouse is a valuable model to study heart morphogenesis owing to the relative proximity of heart structures and developmental patterns between humans and mice⁵. Genetically-modified mouse strains contribute to exploring genotype-phenotype relationships underlying congenital defects. Data collected from these strains have been used to build up extensive databases such as the Mouse Genome Informatics (MGI) database⁶ or the International Mouse Phenotyping Consortium (IMPC) database⁷.

Several imaging techniques enable the precise capture and identification of 3D cardiovascular structures. These include micro-computed tomography (µCT), magnetic resonance imaging (MRI), and optical coherence tomography. Thanks to technical improvements, mouse embryogenesis is now characterized at earlier stages and higher throughput⁸. However, extracting meaningful insights from the vast amounts of data generated by these techniques has become a new challenge for embryologists, as the examination of individual 3D images is a complex and time-consuming process. For example, manual segmentation of organs or tumors by clinicians requires on the order of ~10 min per scan^9,10,11; detailed segmentation of a mouse heart takes about 30 min, and phenotyping takes between 20 min and 1 h per heart and requires extensive anatomical training and knowledge of the medical nomenclature of heart defects.

Therefore, automating the analysis of 3D cardiac imaging data holds the potential to increase the throughput of anatomic characterizations. A reliable automated segmentation of the heart volume would allow the extraction of detailed quantitative measurements of the heart structure, and an automated screen of normal versus malformed hearts could speed up and potentially improve the identification of organ anomalies, while adequate visualization tools could highlight variations in morphological features characteristic of diseased samples.

Although methods for automated segmentation and/or classification of congenital heart defects (CHD) in mice have been reported before^{12,13,14,15,16,17,18,19}, they have several limitations. First, segmentation and analysis algorithms, when publicly available, typically require coding skills to be used, and hence call for more user-friendly alternatives^14,15. Secondly, previous work on analysing morphological phenotypes of developing mice in µCT data has leveraged non-linear registration of CT-scans towards a population-averaged image, and/or separate analyses of voxel intensities, tensor-based morphometry, and comparison of whole-organ volumes with wild-type mice^12,14,15. The use of non-linear registration makes morphological analysis potentially susceptible to errors caused by anatomical structures unrelated to the heart. The reliance on handcrafted diagnostic features (i.e., features defined by a human expert) leaves open the possibility that more pertinent, less intuitive features can be extracted from the data using machine learning techniques, potentially enabling better diagnosis. Deep learning methods, including convolutional neural networks (CNNs)^20,21, iteratively refine and optimize features for a given task. Therefore, image features learned by CNNs can emerge as new phenotypic descriptors that have not been explicitly defined. Additionally, by jointly analysing multiple features, neural networks may enable the diagnosis of subtle phenotypes involving several features whose departure from normality may be insufficient for reliable diagnosis when considered separately.

In this study, we describe MouseCHD, an approach based on 3D CNNs that can automatically segment heart volumes from a µCT-scan of the thorax and abdomen of mouse fetuses and can classify cardiac anatomy as normal or abnormal. We validate the segmentation and diagnosis tools quantitatively on perinatal mice, at E(embryonic day) 18.5 and at birth (P0). Moreover, we provide these methods as a plugin for the widely popular Napari platform²², allowing users to apply our heart segmentation and classification tools without programming experience. Importantly, this plugin can also be used to fine-tune the classification model on the user’s own data. The proposed tools, along with our new dataset, should be instrumental for fast and objective screening of mouse samples for the study of heart morphogenesis and congenital cardiac defects.

Results

Overview

As a brief overview, we implemented, trained, and tested a computational pipeline (MouseCHD) to analyse μCT scans of mouse fetuses, which consists of two main modules: (i) a module for heart segmentation and (ii) a diagnosis module for identification of malformed hearts. The segmentation module extracts heart regions from thoracic and abdominal μCT scans, while the diagnosis (classification) module predicts the presence or absence of CHD based on segmented hearts. We first assess the segmentation and then the diagnostic performance of MouseCHD on an initial cohort of μCT scans of control and mutant mice. Next, we evaluate the model’s robustness on a ‘prospective’ cohort obtained later and a ‘divergent’ cohort that contains significant differences in phenotypes. We then assess the effect of fine-tuning the model on the divergent data. Finally, we present a user-friendly plugin that enables easy application of our segmentation and diagnosis modules, as well as their retraining on other datasets.

Initial cohort of control and mutant mice

We first assembled a cohort (hereafter called ‘initial cohort’) containing 139 μCT scans of 139 individual mice (Fig. 1, Methods). This cohort comprised 56 mice at E18.5 and 83 mice at birth (P0). Out of the 139 mice, 38 mice (27%) (Fig. 1b) suffered from CHD. These CHDs belonged to 5 major non-exclusive categories: apex malposition (dextrocardia, mesocardia), atrial situs defects (right isomerism, left isomerism or situs inversus of atrial appendages), septal defects (complete atrioventricular septal defects), ventricle malposition (L-LOOP), and malposition of the great arteries (double outlet right ventricle, transposition of the great arteries), as shown in the UpSet plot of Fig. 1b (one scan with CHD was excluded from the UpSet plot only because the type of anomaly was difficult to interpret). Among the 37 remaining hearts, 36 hearts displayed at least two malformations. Septal defects were present in all 38 abnormal hearts. See Supplementary Fig. 1 for examples of annotated phenotypes and Supplementary Data 1 for a comprehensive list of samples and their annotations.

**Fig. 1: Micro-computed tomography (µCT) images of mouse fetuses and phenotyping of congenital heart defects.**

Hearts at different developmental stages naturally differ in size and morphology²³ (Fig. 1a), thus developmental stage is a potential confounder for the classification of heart anomalies. To alleviate this risk, we analysed scans with equal proportions of CHD and normal hearts at both stages (CHD to normal ratio of 15/41 = 0.37 for E18.5 and 23/60 = 0.38 for P0) (Fig. 1c).

Heart segmentation module based on nnU-Net

We implemented a module to isolate the heart from the thoracic and abdominal scans after image preprocessing (Fig. 2a; Methods). Our 3D segmentation module is based on the nnU-Net, a self-configurating neural network framework built on the U-Net²⁴ architecture, which has demonstrated high performance and robustness for a large number of data types²⁵. The nnU-Net automatically sets up policies for data preprocessing, network architectures, hyperparameters for training and postprocessing, based on data fingerprints. The framework automatically splits training scans into five folds, applies a cross-validation strategy to train five distinct models, and the final segmentations are obtained from the ensemble averages of the probabilities predicted by these five models. Since the heart is a contiguous entity, we further improved the segmentation by keeping only the largest connected component and hence removing fragmented areas.

**Fig. 2: Deep learning-based framework to identify abnormal heart morphology.**

For training and evaluation, the nnU-Net requires a segmentation ground-truth, which was defined manually by an embryologist on 40 μCT scans by outlining 6261 slices in total; these 40 scans contained samples from both E18.5 and P0 stages, and contained both normal and CHD hearts (Methods). Table 1 details the categories and developmental stages for training and testing data in the initial cohort. This data split guaranteed that all stages and phenotypes were represented in both training and testing data.

Table 1 Data split for training and quantitatively testing automated heart segmentation (initial cohort)

Full size table

For a training dataset of n = 22 scans, training took about 24 h on a GPU with 40 GB RAM (NVIDIA A100-SXM4-40GB) or 46 GB RAM (NVIDIA A40 46GB) and 2 CPUs with 500 GB RAM each. Once trained, the model performed segmentation within 2–4 min per sample on a 16 GB RAM GPU.

Robust heart segmentation from mouse thoracic and abdominal μCT scans

The segmentation of the heart serves two purposes: (i) it enables a quantitative analysis of heart volume and shape, and (ii) it facilitates the subsequent identification of malformed hearts by restricting the analysed images to regions containing the heart.

We evaluated the performance of segmentation in three ways (Fig. 3). First, we quantitatively compared automated heart segmentations to manual heart segmentations on 12 μCT scans (Methods) (Fig. 3a). The nnU-Net segmentation achieved a Dice coefficient of 96.5%, a recall of 96.1%, and a precision of 95.9%, indicating highly accurate segmentations (Fig. 3b), and comparing favorably to alternative methods^17,18 (Supplementary Note 1 and Supplementary Figs. 2, 3).

**Fig. 3: Evaluation of deep learning-based heart segmentation from μCT scans.**

Second, an embryologist qualitatively assessed the segmentations for the complete dataset of 139 μCT scans and labeled them as either perfectly segmented, over-segmented, or under-segmented (Methods). Ignoring the 28 scans used for training, and thus considering only the remaining 111 scans, we found that 86% of hearts (n = 96) were perfectly segmented, while 14% (n = 15) hearts were under- or over-segmented (Fig. 3c). For these imperfect segmentations, the discrepancies between automated and manual segmentations remained minor (see Fig. 3a and Supplementary Fig. 4 for examples), and were mainly restricted to the great vessels connecting the heart. The percentage of perfectly segmented scans, did not depend on the developmental stage, since it was the same (86%) for both E18.5 and P0 (Fisher exact test p = 1.00) (Fig. 3d). Percentages of perfectly segmented scans for normal and CHD hearts were 90% and 74%, respectively, a difference of marginal significance (Fisher test p = 0.048) (Fig. 3e).

Third, we used our automated segmentation to calculate heart volumes in the entire dataset. Since the heart grows in size during gestation^26,27, accurate segmentation should yield an increase in volume between the E18.5 stage and birth (P0). This was indeed the case, since segmented heart volumes at birth (16.41 ± 3.29 mm³) were significantly larger than in the E18.5 stage (12.11 ± 3.22 mm³) (Mann-Whitney U test: n1 = 56, n2 = 83, p = ${1.3110}^{-10}$) (Fig. 3f), a volume increase of 35.5% (95% confidence interval: 26%–44.7%).

Together, these three distinct validations demonstrate high-quality segmentation of mouse hearts from μCT scans regardless of developmental stages and cardiac anomalies.

A 3D CNN for CHD diagnosis

To classify hearts into normal or abnormal (CHD), we developed a dedicated 3D CNN model that uses the nnU-net based segmentation described above to crop out the heart from the fetus body and mask out non-heart pixels (Fig. 2a). Due to differences in heart volumes, the 3D images had different sizes (193 ± 25, 154 ± 21, 165 ± 20 voxels; averages ± standard deviations). We resized all images using linear interpolation to a common size of 64 × 64 x 32 voxels. To help compensate for the low number of training images, we took one out of every five slices along the z-axis to form a 3D image from each μCT scan, then repeated this process by shifting the slices by one slice four times, thus artificially increasing the number of training images by five-fold (at the expense of a five-fold reduction of axial resolution) (Fig. 2b). For training, these images were re-interpolated linearly along the z-axis to obtain volumes with isotropic voxels. For testing, however, the trained model was applied to the entire set of slices contained in the cropped heart region, without subsampling.

Our 3D CNN consisted of four 3D convolutional blocks, one fully connected layer, and an output layer with two neurons (Fig. 2c). The first three 3D convolutional blocks were each composed of a 3D convolutional layer followed by a 3D batch normalization layer. The last convolutional block additionally included a max pooling layer (pool size = 2) between the 3D convolutional and 3D batch normalization layers. The output of the last convolutional layer underwent a global average pooling, followed by a fully connected layer with 512 neurons, and finally an output layer with two neurons and a softmax function to calculate the probabilities of the heart being normal or having CHD (Fig. 2c).

For a training dataset of n = 111 scans, training took about 2 h on a GPU NVIDIA Geforce RTX 4080 (with 16 GB RAM) or a GPU NVIDIA A40 (with 46 GB RAM). Once trained, the model performed a diagnosis in a few seconds per sample.

Accurate diagnosis of malformed hearts

We next assessed the performance of our 3D-CNN for detecting cardiac malformations (Fig. 4). Because of the low number of available μCT scans, especially for hearts with CHD (n = 38 out of a total of n = 139) we evaluated the model using five-fold cross-validation, whereby the dataset was split into five stratified folds, allowing to train five models on four folds each, and testing each model on the remaining fold (Table 2). We stratified the data such as to ensure that the ratios of hearts with or without CHD and from the E18.5 or P0 developmental stages were similar between all folds. Visual comparisons of the pixel intensity histograms within vs. across folds, or in CHD vs normal hearts, did not reveal any obvious difference between these groups, and comparisons of the images using dimensionality reduction (t-SNE) did not show any correlation that might bias the evaluation (Supplementary Fig. 5a–c).

**Fig. 4: Evaluation of deep learning-based CHD diagnosis for mouse fetal hearts.**

Table 2 Data split for training and testing the heart anomaly diagnosis module

Full size table

Figure 4a shows the sensitivity versus specificity tradeoff (ROC curve) for each of the five models/folds (colored curves) as well as the average ROC curve (black) (Methods). The AUC ranged from 0.89 to 1.00 for the five models, with an average of 0.96. At the chosen threshold (p = 0.5), the average sensitivity was 93%, the average specificity was 96%, and the class-balanced accuracy was 94% (Fig. 4b), indicating good classification performance.

For comparison, we also trained and tested two other diagnosis models on the entire μCT scans without using the heart segmentation (Supplementary Fig. 6; Methods). The first model was trained using the same 3D-CNN architecture, but using the entire scans as input rather than only the segmented heart regions. This model only achieved a sensitivity of 63%, a specificity of 90%, a balanced accuracy of 77%, and an AUC of 0.77 (Supplementary Fig. 6a). The second model was trained with attention-based deep multiple instance learning (AttMIL)²⁸ on heart slices (Supplementary Fig. 6c). This AttMIL model achieved a sensitivity of 78%, a specificity of 95%, a balanced accuracy of 86.5%, and an AUC of 0.92 (Supplementary Fig. 6a). Thus in both cases, classification performance was lower than with segmentation, likely because the model learned confounding factors located outside the heart rather than features intrinsic to the hearts, which make up only ~10% in volume of the whole μCT scans (Supplementary Fig. 6b, d). This shows the importance of locating the heart and the benefit of performing heart segmentation for the diagnosis pipeline. Our model (MouseCHD) achieved similar classification performance as DenseNet121, a widely used medical classification architecture, when retrained on our own dataset²⁹ (Supplementary Note 2 and Supplementary Fig. 7).

Among potential confounders of disease diagnosis are heart volumes and developmental stage. However, heart volumes did not differ significantly between scans with and without CHD (14.60 ± 3.95 mm³ vs. 14.70 ± 3.86 mm³, respectively; Mann-Whitney U-test p = 0.9, Fig. 4c) or between correctly and incorrectly classified scans (14.72 ± 3.85 mm³ vs. 13.85 ± 4.35 mm³, respectively; Mann-Whitney U-test p = 0.56; Fig. 4d). Likewise, the proportion of correct classifications did not significantly differ between E18.5 (52/56, 92.86%) and P0 (80/83, 96.39%) hearts (Fisher test p = 0.44) (Fig. 4e). Thus, our model appears to have learned to identify heart malformations irrespective of heart size or developmental stage.

The cross-validation analysis above demonstrated the robustness of our preprocessing and training strategy by evaluating five separate models trained on different subsets of the data. Each of these models achieved strong and consistent performance metrics. To create the final model, we used the same preprocessing and training pipeline employed during cross-validation, but instead of splitting the data into folds, we used the entire dataset of n = 139 scans for training. This ensures that the model benefits from the full diversity of the dataset, potentially further enhancing its generalization capability.

Robust diagnosis on prospective and divergent cohorts

To assess our model’s performance in real world settings, we next tested it on two cohorts obtained exclusively after model training: a ‘prospective’ cohort and a ‘divergent’ cohort (Fig. 5). Supplementary Table 1 presents a comprehensive overview of the diverse mouse lines, genotypes, stages and cardioplegia treatments employed in the initial, prospective, and divergent cohorts and Table 3 provides statistics on the two latter cohorts.

**Fig. 5: Evaluation of CHD diagnosis on prospective and divergent cohorts.**

Table 3 Prospective and divergent cohorts

Full size table

The prospective cohort consisted of 18 samples (12 CHD and 6 normal) acquired at the E18.5 stage. It comprised a subset of the mouse lines and genotypes of the initial cohort, used the same cardioplegia treatment, and introduced no novel conditions, hence allowing to evaluate the model on new samples without retraining (Methods, Supplementary Table 1). The distribution of cardiac phenotypes in this dataset did not significantly differ from the initial cohort used for model training (Supplementary Fig. 8). Evaluation of our model on this dataset again showed excellent performance, with a balanced accuracy of 96%, a sensitivity of 92%, and perfect specificity and AUC of 100% (Fig. 5c, d). These findings validate our model’s adaptability to novel data in the absence of changes in mouse lines, developmental stages, genotypes, or cardioplegia treatment.

For a more stringent test, we next turned to the divergent cohort, consisting of 80 samples (29 CHD and 51 normal), which, unlike the prospective cohort, contained many differences relative to the training data. These included the presence of different mouse lines, and application of a very different cardioplegia treatment (Methods; Supplementary Table 1). Importantly, the distribution of phenotypes in this dataset was significantly different from the initial cohort (Fisher test: p = 0.01), with the inclusion of novel phenotypes -in particular situs inversus totalis- that were absent from the initial cohort and hence unseen by the model (Supplementary Fig. 8). Despite all these differences, when tested on the n = 80 scans of the divergent cohort, our model achieved fair performance, much better than random classification (balanced accuracy: 75%, sensitivity: 69%, specificity: 80%, AUC: 81%), (Fig. 5c, e).

Nevertheless, the significant drop in performance compared to the initial and prospective cohorts (balanced accuracy of 75% vs. 94–96%) prompted us to look for patterns associated with incorrect classification. To this aim, we used gradient-weighted class activation mapping (Grad-CAM)³⁰, a technique that highlights the image regions that most strongly contribute to the classification output. We found that in a subset of scans, these activation maps highlighted hyperdense regions partially or fully filling the heart chambers. These structures likely correspond to sedimented hemoglobin resembling the “hematocrit effect” found in human CT-scans³¹ and therefore constitute artefacts potentially causing misclassifications (Supplementary Fig. 9a–c). Indeed, the μCT scans where Grad-CAM highlighted hyperdense regions (which we manually labeled as ‘artifacts’ in a manner blinded to the predicted and ground truth diagnoses) were more prone to be misclassified (11 misclassified out of 52 (21.15%) vs. 16 out of 185 (8.65%); Fisher test p = 0.023), and were found more frequently in the divergent cohort (39%) than in the initial and prospective cohorts (14% and 11%, respectively) (Supplementary Fig. 9d). Removing scans labeled as ‘artifacts’ from the test data increased the balanced accuracy to 79% (Supplementary Fig. 9d), suggesting that artifacts induced by hyperdense structures explain some but not all of the lower performance on the divergent cohort.

Finetuning improves performance on the divergent cohort

The lower performance on the divergent cohort compared to the initial and the prospective cohorts, even when discarding obvious artifacts, calls for an improved model. Because of the many differences between the divergent and the initial cohort, one can expect improved performance on the divergent cohort after finetuning the model to account for these differences. In order to avoid any bias in the testing data, we divided the divergent cohort evenly into two halves of n = 40 scans each, ensuring similar distributions of genotypes, mouse lines, cardioplegia treatment, phenotypes, and predictions from the pretrained model (see Supplementary Table 2 for details). We then finetuned our model (pretrained on the initial cohort) on one half of the divergent cohort, combined with all n = 18 scans of the prospective cohort (totaling n = 58 scans), and tested it on the other half (n = 40 scans) of the divergent cohort. For comparison, we also evaluated the model pretrained on the initial cohort on the same n = 40 test scans of the divergent cohort.

Finetuning led to improved performance on all metrics: balanced accuracy increased from 71% to 78%, sensitivity increased from 71% to 79%, specificity from 81% to 88%, and the AUC jumped from 81% to 91% (Fig. 5f, g). This performance was also an improvement compared to a model trained from scratch on the same set of n = 58 scans, which only achieved sensitivity, specificity and AUC values of 78%, 69% and 78%, respectively, thus illustrating the benefit of pretraining (compare D2 to D1 in Supplementary Fig. 10). We acknowledge that these performance metrics still did not reach those reported for the initial cohort, when the model was trained on n = 111 scans of the same dataset (average sensitivity, specificity and AUC of 93%, 96% and 96%, respectively; Fig. 4b). However, these metrics were similar to or better than those obtained on the initial cohort when trained on only n = 58 scans, since in this case sensitivity, specificity and AUC equaled 72%, 79% and 77%, respectively (compare D1 to I2 in Supplementary Fig. 10). Thus, when finetuned on the divergent cohort, our model is at least as good at diagnosing CHD in the divergent cohort as a model trained from scratch on the same number of scans in the initial cohort at diagnosing CHD in the initial cohort.

These results demonstrate the versatility of our tool, as it can be easily retrained to bolster diagnostic performance for diverse datasets.

Easy-to-use and retrainable Napari plugin to visualize and analyse heart defects

In order to empower embryologists and others to reutilize and improve our pipeline without requiring coding skills, we built it into a user-friendly plugin for Napari, a widely used and open source visualization and image analysis platform²² (Fig. 6). Our MouseCHD plugin is available on both the Napari plugin repository and GitHub, along with sample data (https://github.com/hnguyentt/mousechd-napari). After importing μCT scans with a NifTI³², DICOM³³, or Nrrd³⁴ format, users can choose between three different tasks: (i) heart segmentation, (ii) CHD diagnosis, and (iii) retraining of the diagnosis model (Fig. 6a, b). The output of the segmentation module is a 3D mask of the heart (Fig. 6c), while the output of the diagnosis module is a predicted probability for the heart to be normal or to have CHD. Thanks to Napari’s built-in features, users can easily visualize images in 2D or 3D and interact with them, e.g., by rotating or adjusting contrast and transparency (Fig. 6d). This facilitates a thorough inspection of heart morphology and segmentation results. Note that the heart segmentation module requires at least one graphics processing unit (GPU). For users who do not possess local GPUs, we provide instructions on how to run the plugin using remote high-performance computing clusters or cloud services. Detailed guidelines are available at: https://github.com/hnguyentt/mousechd-napari/blob/master/docs/server_setup.md.

**Fig. 6: Napari plugin for heart segmentation and diagnosis of heart malformation.**

In addition, we implemented the Grad-CAM visualization technique, which highlights the image regions used by the model for diagnosis (Fig. 6a, d). Preliminary inspection by an embryologist (A.D.) indicates that for true positive CHD detections, Grad-CAM highlights septal regions between ventricles and atria in 30 out of 35 true positive cases (86%) (Supplementary Fig. 11). These regions align closely with the phenotyping annotations for septal defects, as illustrated in Supplementary Fig. 1b. Since atrioventricular septal defects are a common feature of all CHD cases in our initial dataset (Fig. 1b), the model appears to correctly identify these defects for CHD prediction. In addition, Grad-CAM visualizations can also guide further improvements of the classification model itself. For example, when we applied the AttMIL model to the heart slices, Grad-CAM tended to highlight regions inside the mouse lungs, instead of the heart (Supplementary Fig. 6d). This is relevant to our dataset of samples, because the heterotaxy syndrome has associated lung defects (isomerism) with full penetrance in Nodal mutants^5,35. Yet, lung defects cannot be taken as predictive of cardiac defects. Such observations led us to introduce the heart segmentation module, which proved essential to achieving satisfactory classification performance based on cardiac features only.

Most importantly, while MouseCHD provides direct access to our best-performing diagnosis model, it also enables users to retrain the model on their own data, potentially using categories of CHD not present in our dataset (Fig. 6e). This retraining capability comes with tools to monitor training progress and will make it easy for embryologists to generate tailored diagnosis models that are adapted to -and likely outperform the model provided by us on- their own specific data.

Thus, MouseCHD offers researchers and clinicians in cardiology several ready-to-use analysis and visualization tools for studying heart morphology and anomalies in mouse fetuses.

Discussion

We presented MouseCHD, a fully automated pipeline to diagnose heart malformations in mouse fetuses using 3D µCT-scans. Our pipeline includes deep learning-based segmentation of the heart and a 3D-CNN model to detect heart structural anomalies. We provide the pipeline as a user-friendly tool within Napari that will enable researchers to quickly and easily analyse 3D images and identify cardiac defects.

The automated heart segmentation achieved a Dice score of 96% when compared with manual segmentations by a human expert. While implemented to facilitate subsequent diagnosis, this segmentation tool is time-saving and already enables visual insights into heart morphology and quantitative measurements such as the total heart volume.

Our diagnostic module detected CHD with a sensitivity, specificity, and AUC of 92%, 96%, and 97% respectively, in the initial cohort, despite having been trained on a fairly limited amount of µCT scans ( ~ 80 normal and ~30 CHD cases). The model performed even better (sensitivity: 92%, specificity: 100%, AUC: 100%) on a small prospective cohort acquired later and containing samples corresponding to a subset of the conditions in the initial cohort. The balanced accuracy (95% and 96% for the initial and prospective cohorts) was on par with that reported in related studies on human imaging data of the heart, despite the much larger size of human organs and the use of very different imaging techniques and ground truth data^36,37,38,39. When tested on a divergent cohort containing differences in genotypes, phenotypes, and cardioplegia treatment, our model still performed much better than chance (sensitivity: 69%, specificity: 80%, AUC: 81%), indicating a reasonable degree of robustness of the learned features to these differences. Importantly, finetuning the model on a subset of the divergent cohort led to markedly improved performance (sensitivity: 79%, specificity: 88%, AUC: 91%), underscoring our framework’s adaptability to new data distributions and its utility for practical applications, especially in institutions that lack the resources to train models on large datasets from scratch. Finally, our method includes an implementation of Grad-CAM, a tool to visually highlight the image regions that underlie the diagnosis of CHD. Although more work is needed to validate this tool’s ability to directly highlight cardiac malformations, it may serve as a basis for visual explanations of CHD diagnosis by deep learning in the future.

In line with the growing importance of deep learning in medical image analysis and diagnosis of human cardiac conditions⁴⁰, including CHD⁴¹, our approach uses CNNs that automatically learn complex and highly predictive features. This approach stands in contrast with previous studies that use handcrafted features^12,14,15 for automated phenotyping of mouse fetuses from µCT data. The striking performance of deep learning-based 2D image classification relies strongly on the availability of large annotated image databases, such as ImageNet⁴², which includes more than 14 million 2D images scraped from the Internet. By contrast, databases of 3D medical images, such as The Cancer Imaging Archive (TCIA⁴³), currently only contain thousands of images. As a consequence, there is still a dearth of easily accessible pretrained 3D classification models for medical imaging data in general and for cardiology in particular.

To address this gap, we developed a comprehensive open-access pipeline tailored specifically to CHD detection in mice. While this pipeline builds on established methods, including nnU-Net for segmentation²⁵, 3D-CNNs for classification, and Grad-CAM³⁰ activation maps, it integrates these components into a dedicated tool for CHD detection, along with 3D visualization and an interface that makes it easy to retrain the entire pipeline on the user’s own datasets. This is particularly important, since our study demonstrated that existing segmentation models struggled when used from scratch but performed well after retraining. Furthermore, while existing frameworks like MONAI provide powerful tools for medical imaging tasks, their implementation often requires coding expertise. By contrast, our user-friendly Napari plugin does not require any programming skills, thus making it accessible to a broad range of researchers and practitioners and enabling biologists to automate a personalised classification of their own cohorts. This could stimulate the growth of annotated 3D-cardiac image databases and thereby allow to train models with further improved diagnostic accuracy.

A limitation of our study is its reliance on a single mouse disease (heterotaxy), developmental window, and imaging technique. While our analysis of a divergent cohort indicates transferability of the learned features, it also showed reduced performance in the absence of retraining. Therefore, an obvious follow-up is to increase the diversity of the training data, for example, using external cohorts such as IMPC⁴⁴. A model pretrained on a larger, diverse dataset like IMPC can then still benefit from fine-tuning on local data (such as the µCT-scans used in this study) to further improve diagnostic accuracy. Our Napari plugin will facilitate such retraining. Beyond the supervised learning approaches with 3D-CNNs implemented here, follow-up work may also benefit from more recent methodological advances, such as vision transformers or self-supervised methods that use pre-text tasks to extract relevant features from similar unlabelled data, e.g. publicly available 3D µCT-scan images from unrelated cohorts⁴⁵. Additionally, more elaborate image pre-processing may improve the performance and robustness of the model, as suggested by previous studies that emphasized the importance of spatial registration¹² and image normalization to address varying voxel intensities across different µCT imaging settings¹⁴. Finally, it will be important to extend this study towards distinguishing between different types of cardiac defects or to identify the association of cardiac defects with overall fetal growth and sex. The methods and open source software tools described here^46,47, along with the new dataset that we make available⁴⁸, are expected to stimulate such developments. Even without awaiting such improvements, we believe that our method’s ability to automate the analysis of cardiac morphology in mice will accelerate the identification of mechanisms of congenital heart defects.

Methods

Animal model

The different mouse lines used in this study are listed in Supplementary Table 1. E0.5 was defined as noon on the day of vaginal plug detection. Both male and female fetuses were collected and used randomly for experiments. Animals were housed in individually ventilated cages containing tube shelters and nesting material, maintained at 21 °C and 50% humidity, under a 12 h light/dark cycle, with food and water ad libitum, in the Laboratory of Animal Experimentation and Transgenesis of the Structure Fédérative de Recherches (SFR) Necker, Imagine Campus, Paris. We have complied with all relevant ethical regulations for animal use. Animal procedures were approved by the ethical committees of Institut Pasteur, Université Paris Cité, and the French Ministry of Research and conform to the guidelines from Directive 2010/63/EU of the European Parliament on the protection of animals used for scientific purposes.

Fetus and neonate collection

Pregnant females were euthanized by cervical dislocation. Fetuses or neonates were recovered at 17.5 or 18.5 days of development (E17.5, E18.5) or at birth (P0), as listed in Supplementary Data 1, and euthanized by decapitation. Blood filling was reduced by 5 min incubation in warm Hanks’ Balanced Salt Solution (HBSS). We used two alternative cardioplegia solutions to arrest and relax the heart. We used the classical cardioplegia solution of cold 250 mM KCl for the initial and prospective cohorts. However, when this solution was applied whole mount (decapitated body), we noticed that the cardiac muscle was occasionally not fully relaxed. This does not impact the diagnosis of the cardiac structure or the heart's outer volume, but leads to more variable ventricular wall thickness in controls. In the divergent cohort acquired later, we therefore used a more efficient cardioplegia solution of 110 mM NaCl, 16 mM KCl, 16 mM MgCl₂, 1.5 mM CaCl₂, and 10 mM NaHCO₃⁴⁹ (see also Supplementary Table 1). The body was fixed in 4% paraformaldehyde for 24 h at 4 °C.

Micro-Computed Tomography

The thoracic skin and the left arm were removed to improve penetration of the contrast agent and as a landmark of the fetus left side, respectively. Samples were stained in 100% Lugol over 72 h. Micro-computed tomography (μCT) images of the thorax and abdomen were acquired on a Micro-Computed Tomography Quantum FX (Perkin Elmer), within a field of exposure of 10 mm diameter. The dataset comprised 512 images with an isotropic resolution of 20 μm x 20 μm x 20 μm for each sample. Figure panels displaying μCT data in 2D show sections through the imaged volume along the most appropriate plane, and thus do not necessarily correspond to optical slices.

Initial cohort and diagnosis

In our initial cohort, we used mouse fetal samples with heterotaxy, a disease that includes defects in all cardiac segments. The cohort combines Hoxb1^Cre/+; Nodal^flox/Nul conditional mutants, Rpgrip1l^-/- mutants, which have a fully penetrant phenotype (implying that 100% of mutant hearts are abnormal)^5,35,50, as well as littermate wild-type or heterozygous controls, and Ift20^Nul/+; Nodal^Nul/+ double heterozygotes with no cardiac phenotype. Samples were generated, genotyped, and imaged as described in refs. ^5,35. Heart defects were diagnosed independently by two experts (an embryologist – A.D. or A.O.S., and a clinician – L. H. or S.B.) on μCT and high-resolution episcopic microscopy (HREM) images⁵, using the segmental approach⁵¹ and the International Paediatric and Congenital Cardiac Code (IPCCC) and International Classification of Diseases (ICD-11) Code. Phenotypic annotations were controlled and supervised by S.M.

Heart segmentation with nnU-Net and manual segmentation

In order to avoid technical confounders, we standardized raw μCT scans to the same axial view (Left – Posterior – Superior: LPS) and to isotropic voxels of size 20 μm x 20 μm x 20 μm. This preprocessing took about 25 min per scan using two CPUs with 200 GB RAM each. After this preprocessing, we trained a 3D nnU-Net²⁸ segmentation model (3d_fullres) using manually segmented scans as ground truth data. To define this ground truth, the heart was manually outlined using Imaris (Bitplane) by an embryologist (V.B.) in each 2D slice containing the heart. This was done in a subset of μCT scans consisting of 23 scans with normal hearts (12 mice at E18.5 and 11 at P0) and 17 scans with CHD (10 mice at E18.5 and 7 at P0), totaling 40 scans and 6261 slices. From these 40 scans, 28 scans served for training the nnU-Net, while 12 scans were kept for testing, as detailed in Table 1.

Qualitative assessment of the segmentations was performed by an embryologist (A.D.) using Fiji, comparing the outlines of the segmentation with the µCT-scan. Imperfect heart segmentation corresponds either to over-segmentation (i.e., inclusion of non-cardiac regions), or under-segmentation (i.e., exclusion of heart regions).

Diagnosis model and training procedure

Our 3D CNN for CHD diagnosis uses 3D images of size 64 × 64 x 32 voxels as input, obtained after interpolation as described in the Results section ‘A 3D CNN for CHD diagnosis’. To diversify the training data, we applied Gaussian noise and random crops to these standardized 3D images during training on the fly.

We used the stochastic gradient descent (SGD) optimizer with a learning rate of 0.01 to optimize the model over a maximum of 100 epochs. To address class imbalance, we used a class-weighted cross-entropy loss function

$L=-\left[{w}_{1}{{y\rm{log}}}(p)+{w}_{0}\left(1-y\right)\log \left(1-p\right)\right]$ with weights ${w}_{0}=\!\left({n}_{0}+{n}_{1}\right)/2{n}_{0}$, ${w}_{1}=\left({n}_{0}+{n}_{1}\right)/2{n}_{1}$, where ${n}_{0}$ and ${n}_{1}$ are the numbers of normal and CHD samples, respectively, $y$ is the ground truth label (0 or 1), and $p$ the predicted probability. To avoid overfitting, we stopped training when the validation loss stopped improving for 8 consecutive epochs.

To further evaluate the role of segmentation in our pipeline, we also conducted two classification experiments without the segmentation step as indicated in the Results section ‘Accurate diagnosis of malformed hearts’ (Supplementary Fig. 6a). In the first experiment, we used the same model architecture as described in this section, but fed the entire scan as input instead of the segmented heart. To ensure equivalent receptive fields and aspect ratios, the input size of the whole-scan model was adjusted from 512 × 512 x 512 to 190 × 190 x 95, a downscaling factor of ~2.7 (for the x and y dimensions) that approximately matches the downscaling factor of 2.5−3.5 used for the segmented-heart model. In the second experiment, we employed an attention-based deep multiple instance learning (AttMIL) framework²⁸, which processes individual heart slices (axial view) of size 224 × 224 as input. We adapted AttMIL by utilizing ResNet-50 as the feature extractor, followed by an attention-pooling module that quantifies the relative importance of each heart slice. The detailed architecture of AttMIL is shown in Supplementary Fig. 6c.

Performance metrics for segmentation

To quantitatively evaluate heart segmentation accuracy, we used 12 μCT scans with manual segmentations as mentioned in the Methods section ‘Heart segmentation with nnU-Net and manual segmentation’ above and considered them as ground truth (Fig. 3a). We quantified the agreement between the automated and manual segmentation using the Dice coefficient, recall and precision, defined below, each of which would reach 100% for perfect segmentations:

$${\rm{Dice}}=\frac{2* {\it{overlap}}}{{\it{prediction}}+{\it{ground truth}}}$$

$${\rm{recall}}=\frac{{\it{overlap}}}{{\it{groundtruth}}}$$

$${\rm{precision}}=\frac{{\it{overlap}}}{{\it{prediction}}}$$

where ${\it{prediction}}$ is the number of voxels in the automatically segmented (predicted) 3D masks, ${\it{groundtruth}}$ is the number of voxels in the manually defined 3D segmentation masks (ground truth), and ${\it{overlap}}$ is the number of pixels in the intersections of automated and manual segmentations.

Performance metrics for diagnosis

To evaluate the performance of the diagnosis models, we used four metrics: sensitivity, specificity, area under the curve (AUC), and balanced accuracy. Each metric provides complementary information about the model’s ability to distinguish between CHD and normal scans. Sensitivity (true positive rate) measures the model’s ability to correctly identify CHD cases and is defined as ${\rm{sensitivity}}=\frac{{\rm{TP}}}{{\rm{TP}}+{\rm{FN}}}$, where ${\rm{TP}}$ are true positives (CHD scans classified as CHD) and ${\rm{FN}}$ are false negatives (CHD scans classified as normal). Specificity (true negative rate) quantifies the model’s ability to correctly identify normal cases, and is defined as ${\rm{specificity}}=\frac{{\rm{TN}}}{{\rm{TN}}+{\rm{FP}}}$, where ${\rm{TN}}$ are true negatives (normal scans classified as normal) and ${\rm{FP}}$ are false positives (normal scans classified as CHD). AUC quantifies the model’s overall ability to differentiate between CHD and normal cases. It is derived from the receiver operating characteristic (ROC) curve, which plots sensitivity against 1 – specificity at various thresholds (applied to the predicted probability of the heart to have CHD). An AUC of 1 represents perfect classification, while 0.5 indicates random classification. Note that, except for computing the ROC curve and the AUC, we used a fixed threshold of p = 0.5, equivalent to predicting the class with the highest probability. The balanced accuracy is the average of the sensitivity and the specificity, and is particularly useful for imbalanced datasets. We also report confusion matrices, which count the number of samples in each ground truth class and each predicted class (for perfect classification, non-diagonal elements are zero).

Divergent cohort

The divergent cohort introduced entirely novel combinations of mouse lines, genotypes, cardioplegia treatments, and developmental stages relative to the initial cohort used to train the initial model. Specifically, we used mouse fetal samples from a new line, Ccdc40^lnks/lnks mutants⁵², as well as littermate wild-type controls. Ccdc40^lnks/lnks mutants develop heterotaxy with partial penetrance, as well as situs inversus totalis, while 70% of Ccdc40^lnks/lnks mutants have no detectable heart phenotype. Thus, the divergent cohort contains a new phenotype of situs inversus totalis absent from the initial and prospective cohorts. The divergent cohort also contained Hoxb1^Cre/+; Nodal^flox/Nul conditional mutants as in the initial cohort but these samples (like all others in the divergent cohort) were treated with a very different cardioplegia method, which is more efficient to relax the cardiac muscle within the endogenous thoracic cavity and thus provides more standardized ventricular wall thickness in controls (see Methods section ‘Fetus and neonate collection’ above and Supplementary Table 1). Note that the divergent cohort included samples at the E17.5 stage (Fig. 5a), which was not represented in the initial and prospective cohorts. However, E17.5 for the Ccdc40 mouse line is equivalent to E18.5 for the other lines, because this stage also corresponds to one day before birth. Ccdc40^lnks/lnks mutants develop faster, compared to Nodal mutants, and are born one day earlier. Indeed, we did not detect any significant difference in heart volumes between these two groups (11.49 ± 1.95 mm³, for E17.5 Ccdc40^lnks/lnks vs. 12.35 ± 3.42 mm³ for E18.5 for the remaining mouse lines; Mann-Whitney U test: n1 = 40, n2 = 107, p = 0.293; Fig. 5b).

Activation map of classification model

We used Gradient-weighted Class Activation Map (Grad-CAM) to highlight features in the μCT images that underlie classification as normal or CHD³⁰. In the resulting activation maps, hot colors represent regions that are more important for predicting a specific class (CHD or normal) than regions with cold colors. In our implementation of Grad-CAM, we first computed feature maps of the last convolutional layer (A), then calculated the gradients of the predicted class with respect to these feature maps. Next, these gradients were global-average-pooled to determine the importance weights (w). Finally, the Grad-CAM was computed from the linear combination of feature maps (A) and weights (w) and application of the ReLu function (i.e., setting all negative values to zero).

Statistics and reproducibility

All sample numbers (n) indicated in the text refer to biological replicates, i.e., different embryos. We considered three cohorts, labelled ‘initial’ (n = 139 mice), prospective (n = 18), and divergent (n = 80). Each μCT scan corresponds to a distinct mouse embryo. Group allocation was based on PCR genotyping. Investigators were blinded during imaging and phenotype analysis. One sample was excluded from the UpSet plot only (Fig. 1b) because the tissue was too degraded to interpret the type of heart anomaly. Descriptive statistics of quantitative variables are given as mean ± standard deviation (SD) unless otherwise stated. We used Fisher's exact test to test for dependence between labels (CHD or normal) and stages, and to test for the dependence of segmentation and classification results with labels or stages. The differences of quantitative variables between the two groups were tested by Mann-Whitney U tests, without accounting for litters as a potential source of variability. Differences were considered statistically significant for p-values below 0.05 in two-sided hypothesis tests. Effect sizes were described by the ratio of means between groups and a bootstrapped 95% confidence interval of this mean difference computed using the arch package.

Statistical analyses were implemented with the scipy package (version 1.9.1) within Python 3.9.13.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability

The raw data (µCT scans) described in this study, along with annotations, are available on the Institut Pasteur Owey database at https://doi.org/10.48802/owey.KvzLJikT.

Code availability

The MouseCHD Napari plugin has been deposited on the Napari Hub at https://www.napari-hub.org/plugins/mousechd-napari and is also available on Zenodo at https://doi.org/10.5281/zenodo.17192088. Analysis scripts are available on GitHub at https://github.com/hnguyentt/MouseCHD and on Zenodo at https://doi.org/10.5281/zenodo.17192074.

References

Hoyert, D. & Gregory C. W. E. Cause-of-Death Data From the Fetal Death File, 2018–2020. https://stacks.cdc.gov/view/cdc/120533 (2022).
Ely, D. & Driscoll, A. Infant mortality in the United States, 2020: Data from the period linked Birth/infant death file. Natl Vital-. Stat. Rep. 71, 18 (2022).
Google Scholar
Houyel, L. & Meilhac, S. M. Heart Development and Congenital Structural Heart Defects. Annu. Rev. Genomics Hum. Genet. 22, 257–284 (2021).
Article PubMed Google Scholar
Diab, N. S. et al. Molecular Genetics and Complex Inheritance of Congenital Heart Disease. Genes 12, 1020 (2021).
Article PubMed PubMed Central CAS Google Scholar
Desgrange, A., Lokmer, J., Marchiol, C., Houyel, L. & Meilhac, S. M. Standardised imaging pipeline for phenotyping mouse laterality defects and associated heart malformations, at multiple scales and multiple stages. Dis. Model. Mech. 12, dmm038356 (2019).
Article PubMed PubMed Central Google Scholar
MGI-Mouse Phenotypes, Alleles & Disease Models. http://www.informatics.jax.org/phenotypes.shtml.
Home. IMPC | International Mouse Phenotyping Consortium https://www.mousephenotype.org/.
Hsu, C.-W. et al. Three-dimensional microCT imaging of mouse development from early post-implantation to early postnatal stages. Dev. Biol. 419, 229–236 (2016).
Article PubMed PubMed Central CAS Google Scholar
Wasserthal, J. et al. TotalSegmentator: Robust Segmentation of 104 Anatomic Structures in CT Images. Radiology: Artificial Intelligence 5 e230024 (2023).
McGrath, H. et al. Manual segmentation versus semi-automated segmentation for quantifying vestibular schwannoma volume on MRI. Int. J. Comput. Assist. Radiol. Surg. 15, 1445–1455 (2020).
Article PubMed PubMed Central Google Scholar
Men, K. et al. Fully automatic and robust segmentation of the clinical target volume for radiotherapy of breast cancer using big data and deep learning. Phys. Med. 50, 13–19 (2018).
Article PubMed Google Scholar
Wong, M. D., Maezawa, Y., Lerch, J. P. & Henkelman, R. M. Automated pipeline for anatomical phenotyping of mouse embryos using micro-CT. Development 141, 2533–2541 (2014).
Article PubMed PubMed Central CAS Google Scholar
Chu, Q. et al. CACCT: An Automated Tool of Detecting Complicated Cardiac Malformations in Mouse Models. Adv. Sci. 7, 1903592 (2020).
Article Google Scholar
Horner, N. R. et al. LAMA: automated image analysis for the developmental phenotyping of mouse embryos. Development 148, dev192955 (2021).
Article PubMed PubMed Central CAS Google Scholar
Dickinson, M. E. et al. High-throughput discovery of novel developmental phenotypes. Nature 537, 508–514 (2016).
Article PubMed PubMed Central CAS Google Scholar
Zouagui, T., Chereul, E., Janier, M. & Odet, C. 3D MRI heart segmentation of mouse embryos. Comput. Biol. Med. 40, 64–74 (2010).
Article PubMed CAS Google Scholar
Malimban, J. et al. Deep learning-based segmentation of the thorax in mouse micro-CT scans. Sci. Rep. 12, 1822 (2022).
Article PubMed PubMed Central CAS Google Scholar
Schoppe, O. et al. Deep learning-enabled multi-organ segmentation in whole-body mouse scans. Nat. Commun. 11, 5626 (2020).
Article PubMed PubMed Central CAS Google Scholar
Lappas, G. et al. Automatic contouring of normal tissues with deep learning for preclinical radiation studies. Phys. Med. Biol. 67, 044001 (2022).
Article CAS Google Scholar
LeCun Y, Bengio Y. Convolutional networks for images, speech, and time series. The handbook of brain theory and neural networks. (1998).
Li, Z., Liu, F., Yang, W., Peng, S. & Zhou, J. A Survey of Convolutional Neural Networks: Analysis, Applications, and Prospects. IEEE Trans. Neural Netw. Learn. Syst. 33, 6999–7019 (2022).
Article PubMed Google Scholar
Sofroniew, N. et al. Napari: a multi-dimensional image viewer for Python. Zenodo https://doi.org/10.5281/ZENODO.3555620 (2022).
Article Google Scholar
Yu, Q., Leatherbury, L., Tian, X. & Lo, C. W. Cardiovascular Assessment of Fetal Mice by In Utero Echocardiography. Ultrasound Med. Biol. 34, 741–752 (2008).
Article PubMed PubMed Central Google Scholar
Ronneberger, O., Fischer, P. and Brox, T. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention pp. 234–241 (Cham: Springer international publishing, 2015).
Isensee, F., Jaeger, P. F., Kohl, S. A. A., Petersen, J. & Maier-Hein, K. H. nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation. Nat. Methods 18, 203–211 (2021).
Article PubMed CAS Google Scholar
de Boer, B. A., van den Berg, G., de Boer, P. A. J., Moorman, A. F. M. & Ruijter, J. M. Growth of the developing mouse heart: An interactive qualitative and quantitative 3D atlas. Dev. Biol. 368, 203–213 (2012).
Article PubMed Google Scholar
Leu, M., Ehler, E. & Perriard, J.-C. Characterisation of postnatal growth of the murine heart. Anat. Embryol. (Berl.) 204, 217–224 (2001).
Article PubMed CAS Google Scholar
Ilse, M., Tomczak, J. M. & Welling, M. Attention-based Deep Multiple Instance Learning. ArXiv180204712 Cs Stat (2018).
Cardoso, M. J. et al. MONAI: An open-source framework for deep learning in healthcare. Preprint at https://doi.org/10.48550/arXiv.2211.02701 (2022).
Selvaraju, R. R. et al. Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization. Int. J. Comput. Vis. 128, 336–359 (2020).
Article Google Scholar
Day, C. M. & Sodickson, A. The Art and Science of Straight Lines in Radiology. Am. J. Roentgenol. 196, W166–W173 (2011).
Article Google Scholar
NIfTI-1 Data Format — Neuroimaging Informatics Technology Initiative. https://nifti.nimh.nih.gov/nifti-1/.
Pianykh OS. Digital imaging and communications in medicine (DICOM) a practical introduction and survival guide. Berlin, Heidelberg: Springer Berlin Heidelberg; 2012. https://doi.org/10.1007/978-3-540-74571-6.
Teem: nrrd: Definition of NRRD File Format. https://teem.sourceforge.net/nrrd/format.html.
Desgrange, A., Le Garrec, J.-F., Bernheim, S., Bønnelykke, T. H. & Meilhac, S. M. Transient Nodal Signaling in Left Precursors Coordinates Opposed Asymmetries Shaping the Heart Loop. Dev. Cell 55, 413–431.e6 (2020).
Article PubMed CAS Google Scholar
Lo, J., Lim, A., Wagner, M. W., Ertl-Wagner, B. & Sussman, D. Fetal Organ Anomaly Classification Network for Identifying Organ Anomalies in Fetal MRI. Front. Artif. Intell. 5, 832485 (2022).
Article PubMed PubMed Central Google Scholar
Qiao, S. et al. RLDS: An explainable residual learning diagnosis system for fetal congenital heart disease. Future Gener. Comput. Syst. 128, 205–218 (2022).
Article Google Scholar
Athalye, C. et al. Deep-learning model for prenatal congenital heart disease screening generalizes to community setting and outperforms clinical detection. Ultrasound Obstet. Gynecol. J. Int. Soc. Ultrasound Obstet. Gynecol. 63, 44–52 (2024).
Article CAS Google Scholar
Arnaout, R. et al. An ensemble of neural networks provides expert-level prenatal detection of complex congenital heart disease. Nat. Med. 27, 882–891 (2021).
Article PubMed PubMed Central CAS Google Scholar
Ahsan, M. M. & Siddique, Z. Machine learning-based heart disease diagnosis: A systematic literature review. Artif. Intell. Med. 128, 102289 (2022).
Article PubMed Google Scholar
Hoodbhoy, Z. et al. Diagnostic Accuracy of Machine Learning Models to Identify Congenital Heart Disease: A Meta-Analysis. Front. Artif. Intell. 4, p.708365 (2021).
Deng, J. et al. ImageNet: A large-scale hierarchical image database. in 2009 IEEE Conference on Computer Vision and Pattern Recognition 248–255 https://doi.org/10.1109/CVPR.2009.5206848 (2009).
Clark, K. et al. The Cancer Imaging Archive (TCIA): Maintaining and Operating a Public Information Repository. J. Digit. Imaging 26, 1045–1057 (2013).
Article PubMed PubMed Central Google Scholar
Muñoz-Fuentes, V. et al. The International Mouse Phenotyping Consortium (IMPC): a functional catalogue of the mammalian genome that informs conservation. Conserv. Genet. 19, 995–1005 (2018).
Article PubMed PubMed Central Google Scholar
Tang, Y. et al. Self-supervised pre-training of swin transformers for 3d medical image analysis. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 20730–20740 (2022).
Nguyen, H. MouseCHD, a Napari plugin for segmentation of hearts in micro-CT scans and diagnosis of congenital heart disease in mouse embryos. https://doi.org/10.5281/zenodo.17192088 (2025).
Nguyen, H. MouseCHD analysis scripts. https://doi.org/10.5281/zenodo.17192074 (2025).
Desgrange, A. & Meilhac, S. Mouse micro-computed tomography scans for phenotyping congenital heart defects. Owey https://doi.org/10.48802/owey.KvzLJikT (2025).
Cordeiro, B. & Clements, R. Murine Isolated Heart Model of Myocardial Stunning Associated with Cardioplegic Arrest. J. Vis. Exp. JoVE e52433 https://doi.org/10.3791/52433 (2015).
Vierkotten, J., Dildrop, R., Peters, T., Wang, B. & Rüther, U. Ftm is a novel basal body protein of cilia involved in Shh signalling. Development 134, 2569–2577 (2007).
Article PubMed CAS Google Scholar
R, V. P. The segmental approach to diagnosis in congenital heart disease. Birth Defects. Orig. Art Ser 8, 4–23 (1972).
Becker-Heck, A. et al. The coiled-coil domain containing protein CCDC40 is essential for motile cilia function and left-right axis formation. Nat. Genet. 43, 79–84 (2011).
Article PubMed CAS Google Scholar
Conway, J. R., Lex, A. & Gehlenborg, N. UpSetR: an R package for the visualization of intersecting sets and their properties. Bioinformatics 33, 2938–2940 (2017).
Article PubMed PubMed Central CAS Google Scholar

Download references

Acknowledgements

We thank Arthur Herbout for his work in a related effort preceding this study, Cédric Thépenier for many insightful comments on the manuscript, and Geoffrey Detrez for helping set up the data repository. This work was funded by the Inception project through Investissement d’Avenir grant ANR-16-CONV-0005 and in part by Institut National du Cancer (Programme de recherche translationnelle en cancérologie 2020/ PRT-K 2020) and by a CIFRE PhD fellowship from Sanofi to H.N. The Meilhac laboratory is supported by core funding from the Institut Pasteur and state funding from the Agence Nationale de la Recherche under the “Investissements d’avenir” program (ANR-10-IAHU-01). A.O.S. is supported by the Pasteur-Paris University (PPU) International Ph.D. Program. S.M.M. and A.D. are INSERM research scientists. The Zimmer teams are supported by core funding from Institut Pasteur, from the Rudolf Virchow Center, University of Würzburg, and the Bavarian State Ministry for Science and Arts through the Distinguished Professorship Program. Funding agencies had no participation in the design, analysis, and reporting of the study.

Author information

Authors and Affiliations

Institut Pasteur, Université Paris Cité, Imaging and Modeling Unit, Paris, France
Hoa Nguyen & Christophe Zimmer
Integrated Drug Discovery, Sanofi R&D, Paris, France
Hoa Nguyen
Université Paris Cité, Imagine - Institut Pasteur Unit of Heart Morphogenesis, INSERM UMR1163, Paris, France
Audrey Desgrange, Amaia Ochandorena-Saa, Vanessa Benhamo, Ségolène Bernheim, Lucile Houyel & Sigolène M. Meilhac
Unité Médico-Chirurgicale de Cardiologie Congénitale et Pédiatrique, M3C-Necker, Hôpital Universitaire Necker-Enfant-Malades, APHP, Paris, France
Lucile Houyel
University of Würzburg, Rudolf Virchow Center for Integrative and Translational Bioimaging, Würzburg, Germany
Christophe Zimmer
University of Würzburg, Center for Artificial Intelligence and Data Science, Würzburg, Germany
Christophe Zimmer

Authors

Hoa Nguyen
View author publications
Search author on:PubMed Google Scholar
Audrey Desgrange
View author publications
Search author on:PubMed Google Scholar
Amaia Ochandorena-Saa
View author publications
Search author on:PubMed Google Scholar
Vanessa Benhamo
View author publications
Search author on:PubMed Google Scholar
Ségolène Bernheim
View author publications
Search author on:PubMed Google Scholar
Lucile Houyel
View author publications
Search author on:PubMed Google Scholar
Sigolène M. Meilhac
View author publications
Search author on:PubMed Google Scholar
Christophe Zimmer
View author publications
Search author on:PubMed Google Scholar

Contributions

Conceptualization: H.N., A.D., S.M.M., C.Z.; Data Curation: S.B., L.H.; Formal Analysis: H.N.; Investigation: A.D., A.O.S., V.B.; Methodology and software: H.N.; Supervision: S.M.M., C.Z.; Visualization: H.N., A.D.; Writing - Original Draft: H.N.; Writing - Review & Editing: All authors. Funding Acquisition: S.M.M., C.Z.

Corresponding authors

Correspondence to Sigolène M. Meilhac or Christophe Zimmer.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Communications Biology thanks Estibaliz Gómez-de-Mariscal and the other anonymous reviewer(s) for their contribution to the peer review of this work. Primary Handling Editor: Aylin Bircan. A peer review file is available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Transparent Peer Review file

Supplementary Information

Description of Additional Supplementary Files

Supplementary Data 1

Reporting Summary

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Nguyen, H., Desgrange, A., Ochandorena-Saa, A. et al. Deep learning-based detection of murine congenital heart defects from µCT scans. Commun Biol 8, 1809 (2025). https://doi.org/10.1038/s42003-025-09023-6

Download citation

Received: 10 May 2024
Accepted: 09 October 2025
Published: 23 December 2025
Version of record: 23 December 2025
DOI: https://doi.org/10.1038/s42003-025-09023-6

Subjects

Abstract

Similar content being viewed by others

Introduction

Results

Overview

Initial cohort of control and mutant mice

Heart segmentation module based on nnU-Net

Robust heart segmentation from mouse thoracic and abdominal μCT scans

A 3D CNN for CHD diagnosis

Accurate diagnosis of malformed hearts

Robust diagnosis on prospective and divergent cohorts

Finetuning improves performance on the divergent cohort

Easy-to-use and retrainable Napari plugin to visualize and analyse heart defects

Discussion

Methods

Animal model

Fetus and neonate collection

Micro-Computed Tomography

Initial cohort and diagnosis

Heart segmentation with nnU-Net and manual segmentation

Diagnosis model and training procedure

Performance metrics for segmentation

Performance metrics for diagnosis

Divergent cohort

Activation map of classification model

Statistics and reproducibility

Reporting summary

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Peer review

Peer review information

Additional information

Supplementary information

Rights and permissions

About this article

Cite this article

Share this article

Search

Quick links