Introduction

Congenital malformations are a leading cause of pregnancy arrest1 and perinatal lethality in developed countries2, and the most frequent malformation is that of the heart3. While genetics is thought to play a major role in cardiac developmental defects, the specific genetic causes are known only for a fifth of the patients4, and consist mainly of monogenic mutations associated with syndromes or cardiomyopathies. Polygenic inheritance or gene-environment interactions are believed to underlie a large part of the other cases, including structural heart defects. However, owing to the rarity of any given combination of mutations in human pathology, dissecting the specific contribution of the hundreds of genes involved in heart development is a daunting task.

The mouse is a valuable model to study heart morphogenesis owing to the relative proximity of heart structures and developmental patterns between humans and mice5. Genetically-modified mouse strains contribute to exploring genotype-phenotype relationships underlying congenital defects. Data collected from these strains have been used to build up extensive databases such as the Mouse Genome Informatics (MGI) database6 or the International Mouse Phenotyping Consortium (IMPC) database7.

Several imaging techniques enable the precise capture and identification of 3D cardiovascular structures. These include micro-computed tomography (µCT), magnetic resonance imaging (MRI), and optical coherence tomography. Thanks to technical improvements, mouse embryogenesis is now characterized at earlier stages and higher throughput8. However, extracting meaningful insights from the vast amounts of data generated by these techniques has become a new challenge for embryologists, as the examination of individual 3D images is a complex and time-consuming process. For example, manual segmentation of organs or tumors by clinicians requires on the order of ~10 min per scan9,10,11; detailed segmentation of a mouse heart takes about 30 min, and phenotyping takes between 20 min and 1 h per heart and requires extensive anatomical training and knowledge of the medical nomenclature of heart defects.

Therefore, automating the analysis of 3D cardiac imaging data holds the potential to increase the throughput of anatomic characterizations. A reliable automated segmentation of the heart volume would allow the extraction of detailed quantitative measurements of the heart structure, and an automated screen of normal versus malformed hearts could speed up and potentially improve the identification of organ anomalies, while adequate visualization tools could highlight variations in morphological features characteristic of diseased samples.

Although methods for automated segmentation and/or classification of congenital heart defects (CHD) in mice have been reported before12,13,14,15,16,17,18,19, they have several limitations. First, segmentation and analysis algorithms, when publicly available, typically require coding skills to be used, and hence call for more user-friendly alternatives14,15. Secondly, previous work on analysing morphological phenotypes of developing mice in µCT data has leveraged non-linear registration of CT-scans towards a population-averaged image, and/or separate analyses of voxel intensities, tensor-based morphometry, and comparison of whole-organ volumes with wild-type mice12,14,15. The use of non-linear registration makes morphological analysis potentially susceptible to errors caused by anatomical structures unrelated to the heart. The reliance on handcrafted diagnostic features (i.e., features defined by a human expert) leaves open the possibility that more pertinent, less intuitive features can be extracted from the data using machine learning techniques, potentially enabling better diagnosis. Deep learning methods, including convolutional neural networks (CNNs)20,21, iteratively refine and optimize features for a given task. Therefore, image features learned by CNNs can emerge as new phenotypic descriptors that have not been explicitly defined. Additionally, by jointly analysing multiple features, neural networks may enable the diagnosis of subtle phenotypes involving several features whose departure from normality may be insufficient for reliable diagnosis when considered separately.

In this study, we describe MouseCHD, an approach based on 3D CNNs that can automatically segment heart volumes from a µCT-scan of the thorax and abdomen of mouse fetuses and can classify cardiac anatomy as normal or abnormal. We validate the segmentation and diagnosis tools quantitatively on perinatal mice, at E(embryonic day) 18.5 and at birth (P0). Moreover, we provide these methods as a plugin for the widely popular Napari platform22, allowing users to apply our heart segmentation and classification tools without programming experience. Importantly, this plugin can also be used to fine-tune the classification model on the user’s own data. The proposed tools, along with our new dataset, should be instrumental for fast and objective screening of mouse samples for the study of heart morphogenesis and congenital cardiac defects.

Results

Overview

As a brief overview, we implemented, trained, and tested a computational pipeline (MouseCHD) to analyse μCT scans of mouse fetuses, which consists of two main modules: (i) a module for heart segmentation and (ii) a diagnosis module for identification of malformed hearts. The segmentation module extracts heart regions from thoracic and abdominal μCT scans, while the diagnosis (classification) module predicts the presence or absence of CHD based on segmented hearts. We first assess the segmentation and then the diagnostic performance of MouseCHD on an initial cohort of μCT scans of control and mutant mice. Next, we evaluate the model’s robustness on a ‘prospective’ cohort obtained later and a ‘divergent’ cohort that contains significant differences in phenotypes. We then assess the effect of fine-tuning the model on the divergent data. Finally, we present a user-friendly plugin that enables easy application of our segmentation and diagnosis modules, as well as their retraining on other datasets.

Initial cohort of control and mutant mice

We first assembled a cohort (hereafter called ‘initial cohort’) containing 139 μCT scans of 139 individual mice (Fig. 1, Methods). This cohort comprised 56 mice at E18.5 and 83 mice at birth (P0). Out of the 139 mice, 38 mice (27%) (Fig. 1b) suffered from CHD. These CHDs belonged to 5 major non-exclusive categories: apex malposition (dextrocardia, mesocardia), atrial situs defects (right isomerism, left isomerism or situs inversus of atrial appendages), septal defects (complete atrioventricular septal defects), ventricle malposition (L-LOOP), and malposition of the great arteries (double outlet right ventricle, transposition of the great arteries), as shown in the UpSet plot of Fig. 1b (one scan with CHD was excluded from the UpSet plot only because the type of anomaly was difficult to interpret). Among the 37 remaining hearts, 36 hearts displayed at least two malformations. Septal defects were present in all 38 abnormal hearts. See Supplementary Fig. 1 for examples of annotated phenotypes and Supplementary Data 1 for a comprehensive list of samples and their annotations.

Fig. 1: Micro-computed tomography (µCT) images of mouse fetuses and phenotyping of congenital heart defects.
figure 1

a μCT scans of the thorax and abdomen of mouse fetuses at 18.5 days of development (E18.5, left) and at birth (P0, right) displayed as 3D renderings (top) or stacks of 2D slices (bottom). Blue bounding boxes show the set of 2D thoracic slices containing the heart. Red bounding boxes are framed around the heart. The corresponding 2D image stacks are shown with matching color frames below. Yellow contours on each slice outline the automatically segmented heart. b UpSet plot53 showing the distribution of five types of heart malformations for n = 37 mice with heterotaxy. SD septal defects, MGA malposition of the great arteries, ASD atrial situs defects, AM apex malposition, VM ventricle malposition. Horizontal bars show the total number of cases for each type of malformation. Because many mice have multiple malformations, the total (n = 129) exceeds the number of samples (n = 37). Vertical bars show the number of cases that exhibit the combination of malformations shown in the corresponding matrix column below. For example, 5 cases have both SD and MGA but none of the other malformations. Out of the 37 cases, 36 cases have at least two anomalies, and 8 cases exhibit all 5 types of anomalies simultaneously. c Contingency table for diagnostic label (CHD or normal) and developmental stage (E18.5 or P0) for n = 139 fetuses. Scale bars in (a): 0.5 mm for 3D renderings (top), 4 mm (left) or 1.5 mm (right) for 2D slices (bottom). Source data is available in Supplementary Data 1.

Hearts at different developmental stages naturally differ in size and morphology23 (Fig. 1a), thus developmental stage is a potential confounder for the classification of heart anomalies. To alleviate this risk, we analysed scans with equal proportions of CHD and normal hearts at both stages (CHD to normal ratio of 15/41 = 0.37 for E18.5 and 23/60 = 0.38 for P0) (Fig. 1c).

Heart segmentation module based on nnU-Net

We implemented a module to isolate the heart from the thoracic and abdominal scans after image preprocessing (Fig. 2a; Methods). Our 3D segmentation module is based on the nnU-Net, a self-configurating neural network framework built on the U-Net24 architecture, which has demonstrated high performance and robustness for a large number of data types25. The nnU-Net automatically sets up policies for data preprocessing, network architectures, hyperparameters for training and postprocessing, based on data fingerprints. The framework automatically splits training scans into five folds, applies a cross-validation strategy to train five distinct models, and the final segmentations are obtained from the ensemble averages of the probabilities predicted by these five models. Since the heart is a contiguous entity, we further improved the segmentation by keeping only the largest connected component and hence removing fragmented areas.

Fig. 2: Deep learning-based framework to identify abnormal heart morphology.
figure 2

a A nnU-Net is used to segment hearts from μCT scans. The 3D renderings show the entire μCT scan on the left, and a bounding box around the segmented heart on the right (with 5 voxel padding along the three axes). This bounding box is used to define inputs for the classification module in (c). b From the stack of slices in this bounding box, we extract five distinct 3D images by sampling every fifth slice, as illustrated (each color corresponds to a distinct 3D image), leading to five times more 3D training images. c The diagnosis module is a 3D CNN that takes 3D images resized to 64 × 64 x 32 voxels as input and comprises four convolutional blocks, a 3D max-pooling layer, a global average pooling layer, a dense (fully connected, FC) layer with 512 neurons and an output layer with two neurons (encoding the predicted probabilities of hearts having CHD, p(CHD) or being normal, p(normal)). Each convolutional block comprises one 3D convolutional layer and a 3D batch normalization layer. Scale bars: 0.5 mm in (a), 1.5 mm in (b).

For training and evaluation, the nnU-Net requires a segmentation ground-truth, which was defined manually by an embryologist on 40 μCT scans by outlining 6261 slices in total; these 40 scans contained samples from both E18.5 and P0 stages, and contained both normal and CHD hearts (Methods). Table 1 details the categories and developmental stages for training and testing data in the initial cohort. This data split guaranteed that all stages and phenotypes were represented in both training and testing data.

Table 1 Data split for training and quantitatively testing automated heart segmentation (initial cohort)

For a training dataset of n = 22 scans, training took about 24 h on a GPU with 40 GB RAM (NVIDIA A100-SXM4-40GB) or 46 GB RAM (NVIDIA A40 46GB) and 2 CPUs with 500 GB RAM each. Once trained, the model performed segmentation within 2–4 min per sample on a 16 GB RAM GPU.

Robust heart segmentation from mouse thoracic and abdominal μCT scans

The segmentation of the heart serves two purposes: (i) it enables a quantitative analysis of heart volume and shape, and (ii) it facilitates the subsequent identification of malformed hearts by restricting the analysed images to regions containing the heart.

We evaluated the performance of segmentation in three ways (Fig. 3). First, we quantitatively compared automated heart segmentations to manual heart segmentations on 12 μCT scans (Methods) (Fig. 3a). The nnU-Net segmentation achieved a Dice coefficient of 96.5%, a recall of 96.1%, and a precision of 95.9%, indicating highly accurate segmentations (Fig. 3b), and comparing favorably to alternative methods17,18 (Supplementary Note 1 and Supplementary Figs. 2, 3).

Fig. 3: Evaluation of deep learning-based heart segmentation from μCT scans.
figure 3

a Heart segmentation results (evaluated by an expert as perfect), shown on a single slice of a μCT scan: original image (no overlay), manual segmentation (blue overlay), automated segmentation by the nnU-Net (yellow overlay), and overlap between automated and manual segmentations (red overlay). b Segmentation performance metrics for the five models trained by nnU-Net on five data folds. Dice coefficient, recall, and precision are shown (see Methods). Ensemble corresponds to the pixel-wise average over the five folds. c Horizontal bars show the nnU-net segmentation performance evaluated by an expert: an embryologist visually assessed heart segmentations from n = 111 mice (28 hearts used for training were excluded) and labelled 96 hearts (86%) as perfectly segmented. d, e Vertical bars show the number of perfectly (pale green) or imperfectly (dark grey) segmented hearts according to stage (d) or label (CHD or normal) (e) in the same data set of n = 111 mice. The proportion of perfectly segmented hearts was not significantly associated with developmental stage (Fisher test p = 1), and was marginally significantly associated with the presence of CHD (Fisher test p = 0.048). f Boxplots compare the heart volumes as measured from automated segmentations in n = 139 mice for the two developmental stages. Green dots correspond to normal hearts, red dots to hearts with CHD. Scale bar in (a): 4 mm. Source data is available in Supplementary Data 1.

Second, an embryologist qualitatively assessed the segmentations for the complete dataset of 139 μCT scans and labeled them as either perfectly segmented, over-segmented, or under-segmented (Methods). Ignoring the 28 scans used for training, and thus considering only the remaining 111 scans, we found that 86% of hearts (n = 96) were perfectly segmented, while 14% (n = 15) hearts were under- or over-segmented (Fig. 3c). For these imperfect segmentations, the discrepancies between automated and manual segmentations remained minor (see Fig. 3a and Supplementary Fig. 4 for examples), and were mainly restricted to the great vessels connecting the heart. The percentage of perfectly segmented scans, did not depend on the developmental stage, since it was the same (86%) for both E18.5 and P0 (Fisher exact test p = 1.00) (Fig. 3d). Percentages of perfectly segmented scans for normal and CHD hearts were 90% and 74%, respectively, a difference of marginal significance (Fisher test p = 0.048) (Fig. 3e).

Third, we used our automated segmentation to calculate heart volumes in the entire dataset. Since the heart grows in size during gestation26,27, accurate segmentation should yield an increase in volume between the E18.5 stage and birth (P0). This was indeed the case, since segmented heart volumes at birth (16.41 ± 3.29 mm3) were significantly larger than in the E18.5 stage (12.11 ± 3.22 mm3) (Mann-Whitney U test: n1 = 56, n2 = 83, p = \({1.3110}^{-10}\)) (Fig. 3f), a volume increase of 35.5% (95% confidence interval: 26%–44.7%).

Together, these three distinct validations demonstrate high-quality segmentation of mouse hearts from μCT scans regardless of developmental stages and cardiac anomalies.

A 3D CNN for CHD diagnosis

To classify hearts into normal or abnormal (CHD), we developed a dedicated 3D CNN model that uses the nnU-net based segmentation described above to crop out the heart from the fetus body and mask out non-heart pixels (Fig. 2a). Due to differences in heart volumes, the 3D images had different sizes (193 ± 25, 154 ± 21, 165 ± 20 voxels; averages ± standard deviations). We resized all images using linear interpolation to a common size of 64 × 64 x 32 voxels. To help compensate for the low number of training images, we took one out of every five slices along the z-axis to form a 3D image from each μCT scan, then repeated this process by shifting the slices by one slice four times, thus artificially increasing the number of training images by five-fold (at the expense of a five-fold reduction of axial resolution) (Fig. 2b). For training, these images were re-interpolated linearly along the z-axis to obtain volumes with isotropic voxels. For testing, however, the trained model was applied to the entire set of slices contained in the cropped heart region, without subsampling.

Our 3D CNN consisted of four 3D convolutional blocks, one fully connected layer, and an output layer with two neurons (Fig. 2c). The first three 3D convolutional blocks were each composed of a 3D convolutional layer followed by a 3D batch normalization layer. The last convolutional block additionally included a max pooling layer (pool size = 2) between the 3D convolutional and 3D batch normalization layers. The output of the last convolutional layer underwent a global average pooling, followed by a fully connected layer with 512 neurons, and finally an output layer with two neurons and a softmax function to calculate the probabilities of the heart being normal or having CHD (Fig. 2c).

For a training dataset of n = 111 scans, training took about 2 h on a GPU NVIDIA Geforce RTX 4080 (with 16 GB RAM) or a GPU NVIDIA A40 (with 46 GB RAM). Once trained, the model performed a diagnosis in a few seconds per sample.

Accurate diagnosis of malformed hearts

We next assessed the performance of our 3D-CNN for detecting cardiac malformations (Fig. 4). Because of the low number of available μCT scans, especially for hearts with CHD (n = 38 out of a total of n = 139) we evaluated the model using five-fold cross-validation, whereby the dataset was split into five stratified folds, allowing to train five models on four folds each, and testing each model on the remaining fold (Table 2). We stratified the data such as to ensure that the ratios of hearts with or without CHD and from the E18.5 or P0 developmental stages were similar between all folds. Visual comparisons of the pixel intensity histograms within vs. across folds, or in CHD vs normal hearts, did not reveal any obvious difference between these groups, and comparisons of the images using dimensionality reduction (t-SNE) did not show any correlation that might bias the evaluation (Supplementary Fig. 5a–c).

Fig. 4: Evaluation of deep learning-based CHD diagnosis for mouse fetal hearts.
figure 4

Diagnostic performance on the initial cohort of n = 139 mice. a Diagnostic performance for the detection of malformed (CHD) hearts as measured by the receiver operating characteristic (ROC) curve and the area under the curve (AUC) of five models/folds (each shown in a different color). The AUC ranges from 0.89 to 1.00, and the overall AUC (black line) is 0.97 (see Methods). The confusion matrix corresponds to the ensemble of five folds with AUC = 0.97. b Performance metrics for diagnostic performance in five-fold cross-validation. The CHD detection model achieved an average AUC, sensitivity, specificity, and balanced accuracy of 96 ± 4%, 93 ± 7%, 96 ± 4%, and 94 ± 5%, respectively, as assessed by five-fold cross-validation. c, d Boxplots compare the heart volumes computed from the automated segmentations for CHD vs normal cases (c) and for correct vs. incorrect classifications (d). Green dots correspond to normal hearts, red dots to hearts with CHD. e Contingency table counting the number of samples with correct or incorrect predictions according to the developmental stage. Source data is available in Supplementary Data 1.

Table 2 Data split for training and testing the heart anomaly diagnosis module

Figure 4a shows the sensitivity versus specificity tradeoff (ROC curve) for each of the five models/folds (colored curves) as well as the average ROC curve (black) (Methods). The AUC ranged from 0.89 to 1.00 for the five models, with an average of 0.96. At the chosen threshold (p = 0.5), the average sensitivity was 93%, the average specificity was 96%, and the class-balanced accuracy was 94% (Fig. 4b), indicating good classification performance.

For comparison, we also trained and tested two other diagnosis models on the entire μCT scans without using the heart segmentation (Supplementary Fig. 6; Methods). The first model was trained using the same 3D-CNN architecture, but using the entire scans as input rather than only the segmented heart regions. This model only achieved a sensitivity of 63%, a specificity of 90%, a balanced accuracy of 77%, and an AUC of 0.77 (Supplementary Fig. 6a). The second model was trained with attention-based deep multiple instance learning (AttMIL)28 on heart slices (Supplementary Fig. 6c). This AttMIL model achieved a sensitivity of 78%, a specificity of 95%, a balanced accuracy of 86.5%, and an AUC of 0.92 (Supplementary Fig. 6a). Thus in both cases, classification performance was lower than with segmentation, likely because the model learned confounding factors located outside the heart rather than features intrinsic to the hearts, which make up only ~10% in volume of the whole μCT scans (Supplementary Fig. 6b, d). This shows the importance of locating the heart and the benefit of performing heart segmentation for the diagnosis pipeline. Our model (MouseCHD) achieved similar classification performance as DenseNet121, a widely used medical classification architecture, when retrained on our own dataset29 (Supplementary Note 2 and Supplementary Fig. 7).

Among potential confounders of disease diagnosis are heart volumes and developmental stage. However, heart volumes did not differ significantly between scans with and without CHD (14.60 ± 3.95 mm3 vs. 14.70 ± 3.86 mm3, respectively; Mann-Whitney U-test p = 0.9, Fig. 4c) or between correctly and incorrectly classified scans (14.72 ± 3.85 mm3 vs. 13.85 ± 4.35 mm3, respectively; Mann-Whitney U-test p = 0.56; Fig. 4d). Likewise, the proportion of correct classifications did not significantly differ between E18.5 (52/56, 92.86%) and P0 (80/83, 96.39%) hearts (Fisher test p = 0.44) (Fig. 4e). Thus, our model appears to have learned to identify heart malformations irrespective of heart size or developmental stage.

The cross-validation analysis above demonstrated the robustness of our preprocessing and training strategy by evaluating five separate models trained on different subsets of the data. Each of these models achieved strong and consistent performance metrics. To create the final model, we used the same preprocessing and training pipeline employed during cross-validation, but instead of splitting the data into folds, we used the entire dataset of n = 139 scans for training. This ensures that the model benefits from the full diversity of the dataset, potentially further enhancing its generalization capability.

Robust diagnosis on prospective and divergent cohorts

To assess our model’s performance in real world settings, we next tested it on two cohorts obtained exclusively after model training: a ‘prospective’ cohort and a ‘divergent’ cohort (Fig. 5). Supplementary Table 1 presents a comprehensive overview of the diverse mouse lines, genotypes, stages and cardioplegia treatments employed in the initial, prospective, and divergent cohorts and Table 3 provides statistics on the two latter cohorts.

Fig. 5: Evaluation of CHD diagnosis on prospective and divergent cohorts.
figure 5

Diagnostic performance on the prospective cohort (n = 18 mice) and the divergent cohort (n = 80 mice). a Thoracic and abdominal μCT scan of a mouse fetus at 17.5 days of development (E17.5) from the Ccdc40 line, shown as 3D rendering (left) and stacks of 2D slices (right). Blue bounding box shows the set of 2D thoracic slices containing the heart. The red bounding box is framed around the heart. The corresponding 2D image stacks are shown with matching color frames. b Boxplots compare the heart volumes at different stages (E17.5: n = 40; E18.5: n = 107, and P0: n = 90). Heart volumes differ significantly between stages P0 and E18.5 (Mann-Whitney U-test p = \(2.1\times10^{-13}\)), but not between E17.5 (Ccdc40 line) and E18.5 (all other lines) (Mann-Whitney U-test p = 0.293), because they both correspond to one day before birth. c Diagnostic performance of the model trained on the initial cohort for the detection of malformed hearts (CHD), assessed by the receiver operating characteristic (ROC) curve. The area under the curve (AUC) for prospective cohort data (green) and divergent cohort data (blue) is 1.0 and 0.81, respectively, and the overall AUC (prospective and divergent cohort data, black line) is 0.87. The confusion matrix shows the counts of samples for each predicted and ground truth class, for the combined prospective and divergent cohort data. d, e Confusion matrices of the model trained on the initial cohort using test scans from the prospective cohort (d), and the divergent cohort (e). f Performance of the model on n = 40 divergent cohort scans before (left) and after (right) finetuning on n = 40 distinct scans from the divergent cohort and the n = 18 scans from the prospective cohort. Finetuning increased sensitivity from 71.4% to 78.6%, specificity from 80.8% to 88.5%, balanced accuracy from 76.1% to 83.5%, and AUC from 0.813 to 0.915. g Confusion matrix of the finetuned model, as tested on the remaining half of the divergent cohort (n = 40). Scale bars in (a), from left to right: 0.5 mm, 4 mm, 1.5 mm. Source data for (b–e) is available in Supplementary Data 1.

Table 3 Prospective and divergent cohorts

The prospective cohort consisted of 18 samples (12 CHD and 6 normal) acquired at the E18.5 stage. It comprised a subset of the mouse lines and genotypes of the initial cohort, used the same cardioplegia treatment, and introduced no novel conditions, hence allowing to evaluate the model on new samples without retraining (Methods, Supplementary Table 1). The distribution of cardiac phenotypes in this dataset did not significantly differ from the initial cohort used for model training (Supplementary Fig. 8). Evaluation of our model on this dataset again showed excellent performance, with a balanced accuracy of 96%, a sensitivity of 92%, and perfect specificity and AUC of 100% (Fig. 5c, d). These findings validate our model’s adaptability to novel data in the absence of changes in mouse lines, developmental stages, genotypes, or cardioplegia treatment.

For a more stringent test, we next turned to the divergent cohort, consisting of 80 samples (29 CHD and 51 normal), which, unlike the prospective cohort, contained many differences relative to the training data. These included the presence of different mouse lines, and application of a very different cardioplegia treatment (Methods; Supplementary Table 1). Importantly, the distribution of phenotypes in this dataset was significantly different from the initial cohort (Fisher test: p = 0.01), with the inclusion of novel phenotypes -in particular situs inversus totalis- that were absent from the initial cohort and hence unseen by the model (Supplementary Fig. 8). Despite all these differences, when tested on the n = 80 scans of the divergent cohort, our model achieved fair performance, much better than random classification (balanced accuracy: 75%, sensitivity: 69%, specificity: 80%, AUC: 81%), (Fig. 5c, e).

Nevertheless, the significant drop in performance compared to the initial and prospective cohorts (balanced accuracy of 75% vs. 94–96%) prompted us to look for patterns associated with incorrect classification. To this aim, we used gradient-weighted class activation mapping (Grad-CAM)30, a technique that highlights the image regions that most strongly contribute to the classification output. We found that in a subset of scans, these activation maps highlighted hyperdense regions partially or fully filling the heart chambers. These structures likely correspond to sedimented hemoglobin resembling the “hematocrit effect” found in human CT-scans31 and therefore constitute artefacts potentially causing misclassifications (Supplementary Fig. 9a–c). Indeed, the μCT scans where Grad-CAM highlighted hyperdense regions (which we manually labeled as ‘artifacts’ in a manner blinded to the predicted and ground truth diagnoses) were more prone to be misclassified (11 misclassified out of 52 (21.15%) vs. 16 out of 185 (8.65%); Fisher test p = 0.023), and were found more frequently in the divergent cohort (39%) than in the initial and prospective cohorts (14% and 11%, respectively) (Supplementary Fig. 9d). Removing scans labeled as ‘artifacts’ from the test data increased the balanced accuracy to 79% (Supplementary Fig. 9d), suggesting that artifacts induced by hyperdense structures explain some but not all of the lower performance on the divergent cohort.

Finetuning improves performance on the divergent cohort

The lower performance on the divergent cohort compared to the initial and the prospective cohorts, even when discarding obvious artifacts, calls for an improved model. Because of the many differences between the divergent and the initial cohort, one can expect improved performance on the divergent cohort after finetuning the model to account for these differences. In order to avoid any bias in the testing data, we divided the divergent cohort evenly into two halves of n = 40 scans each, ensuring similar distributions of genotypes, mouse lines, cardioplegia treatment, phenotypes, and predictions from the pretrained model (see Supplementary Table 2 for details). We then finetuned our model (pretrained on the initial cohort) on one half of the divergent cohort, combined with all n = 18 scans of the prospective cohort (totaling n = 58 scans), and tested it on the other half (n = 40 scans) of the divergent cohort. For comparison, we also evaluated the model pretrained on the initial cohort on the same n = 40 test scans of the divergent cohort.

Finetuning led to improved performance on all metrics: balanced accuracy increased from 71% to 78%, sensitivity increased from 71% to 79%, specificity from 81% to 88%, and the AUC jumped from 81% to 91% (Fig. 5f, g). This performance was also an improvement compared to a model trained from scratch on the same set of n = 58 scans, which only achieved sensitivity, specificity and AUC values of 78%, 69% and 78%, respectively, thus illustrating the benefit of pretraining (compare D2 to D1 in Supplementary Fig. 10). We acknowledge that these performance metrics still did not reach those reported for the initial cohort, when the model was trained on n = 111 scans of the same dataset (average sensitivity, specificity and AUC of 93%, 96% and 96%, respectively; Fig. 4b). However, these metrics were similar to or better than those obtained on the initial cohort when trained on only n = 58 scans, since in this case sensitivity, specificity and AUC equaled 72%, 79% and 77%, respectively (compare D1 to I2 in Supplementary Fig. 10). Thus, when finetuned on the divergent cohort, our model is at least as good at diagnosing CHD in the divergent cohort as a model trained from scratch on the same number of scans in the initial cohort at diagnosing CHD in the initial cohort.

These results demonstrate the versatility of our tool, as it can be easily retrained to bolster diagnostic performance for diverse datasets.

Easy-to-use and retrainable Napari plugin to visualize and analyse heart defects

In order to empower embryologists and others to reutilize and improve our pipeline without requiring coding skills, we built it into a user-friendly plugin for Napari, a widely used and open source visualization and image analysis platform22 (Fig. 6). Our MouseCHD plugin is available on both the Napari plugin repository and GitHub, along with sample data (https://github.com/hnguyentt/mousechd-napari). After importing μCT scans with a NifTI32, DICOM33, or Nrrd34 format, users can choose between three different tasks: (i) heart segmentation, (ii) CHD diagnosis, and (iii) retraining of the diagnosis model (Fig. 6a, b). The output of the segmentation module is a 3D mask of the heart (Fig. 6c), while the output of the diagnosis module is a predicted probability for the heart to be normal or to have CHD. Thanks to Napari’s built-in features, users can easily visualize images in 2D or 3D and interact with them, e.g., by rotating or adjusting contrast and transparency (Fig. 6d). This facilitates a thorough inspection of heart morphology and segmentation results. Note that the heart segmentation module requires at least one graphics processing unit (GPU). For users who do not possess local GPUs, we provide instructions on how to run the plugin using remote high-performance computing clusters or cloud services. Detailed guidelines are available at: https://github.com/hnguyentt/mousechd-napari/blob/master/docs/server_setup.md.

Fig. 6: Napari plugin for heart segmentation and diagnosis of heart malformation.
figure 6

a Plugin screenshot. The Napari graphical user interface is divided in four panels: (1) layer control to adjust visualization parameters such as opacity, contrast, and color maps, for each layer; (2) list of all visualization layers, which can be individually displayed or hidden using the eye icon; (3) canvas for image visualization, (4) the “MouseCHD” plugin panel lets users choose between three tasks: heart segmentation, CHD diagnosis, or re-training of the diagnosis model. The diagnosis result is shown on the bottom right as two probabilities for CHD (red) and normal (green). The canvas (3) displays the input image, the heart segmentation, and the Grad-CAM activation map, as three distinct layers in 3D. Users can rotate the image in 3D. b Sample image, which can be loaded by choosing “File” → “Open Sample” → “mouse3dscan-napari: micro-CT scan (3D grayscale)”. c Heart segmentation result displayed in 3D in the canvas (3) when the user chooses the segmentation task. d 2D views: users can click on the viewer button (cube icon in the magenta box) and use the slider to select 2D image slices. e Re-training: the user provides the data folder containing CHD and normal images organized into two sub-folders: “CHD” and “Normal”. The plugin then automatically preprocesses the data, segments hearts, and re-trains the CHD diagnosis model. The plugin allows users to monitor the training process through dynamic visualization of key performance metrics – such as accuracy, recall and precision – plotted against training epochs using TensorBoard. Users can launch TensorBoard by clicking the “Run TensorBoard” button and following the provided link. Scale bars in (ad): 1 mm.

In addition, we implemented the Grad-CAM visualization technique, which highlights the image regions used by the model for diagnosis (Fig. 6a, d). Preliminary inspection by an embryologist (A.D.) indicates that for true positive CHD detections, Grad-CAM highlights septal regions between ventricles and atria in 30 out of 35 true positive cases (86%) (Supplementary Fig. 11). These regions align closely with the phenotyping annotations for septal defects, as illustrated in Supplementary Fig. 1b. Since atrioventricular septal defects are a common feature of all CHD cases in our initial dataset (Fig. 1b), the model appears to correctly identify these defects for CHD prediction. In addition, Grad-CAM visualizations can also guide further improvements of the classification model itself. For example, when we applied the AttMIL model to the heart slices, Grad-CAM tended to highlight regions inside the mouse lungs, instead of the heart (Supplementary Fig. 6d). This is relevant to our dataset of samples, because the heterotaxy syndrome has associated lung defects (isomerism) with full penetrance in Nodal mutants5,35. Yet, lung defects cannot be taken as predictive of cardiac defects. Such observations led us to introduce the heart segmentation module, which proved essential to achieving satisfactory classification performance based on cardiac features only.

Most importantly, while MouseCHD provides direct access to our best-performing diagnosis model, it also enables users to retrain the model on their own data, potentially using categories of CHD not present in our dataset (Fig. 6e). This retraining capability comes with tools to monitor training progress and will make it easy for embryologists to generate tailored diagnosis models that are adapted to -and likely outperform the model provided by us on- their own specific data.

Thus, MouseCHD offers researchers and clinicians in cardiology several ready-to-use analysis and visualization tools for studying heart morphology and anomalies in mouse fetuses.

Discussion

We presented MouseCHD, a fully automated pipeline to diagnose heart malformations in mouse fetuses using 3D µCT-scans. Our pipeline includes deep learning-based segmentation of the heart and a 3D-CNN model to detect heart structural anomalies. We provide the pipeline as a user-friendly tool within Napari that will enable researchers to quickly and easily analyse 3D images and identify cardiac defects.

The automated heart segmentation achieved a Dice score of 96% when compared with manual segmentations by a human expert. While implemented to facilitate subsequent diagnosis, this segmentation tool is time-saving and already enables visual insights into heart morphology and quantitative measurements such as the total heart volume.

Our diagnostic module detected CHD with a sensitivity, specificity, and AUC of 92%, 96%, and 97% respectively, in the initial cohort, despite having been trained on a fairly limited amount of µCT scans ( ~ 80 normal and ~30 CHD cases). The model performed even better (sensitivity: 92%, specificity: 100%, AUC: 100%) on a small prospective cohort acquired later and containing samples corresponding to a subset of the conditions in the initial cohort. The balanced accuracy (95% and 96% for the initial and prospective cohorts) was on par with that reported in related studies on human imaging data of the heart, despite the much larger size of human organs and the use of very different imaging techniques and ground truth data36,37,38,39. When tested on a divergent cohort containing differences in genotypes, phenotypes, and cardioplegia treatment, our model still performed much better than chance (sensitivity: 69%, specificity: 80%, AUC: 81%), indicating a reasonable degree of robustness of the learned features to these differences. Importantly, finetuning the model on a subset of the divergent cohort led to markedly improved performance (sensitivity: 79%, specificity: 88%, AUC: 91%), underscoring our framework’s adaptability to new data distributions and its utility for practical applications, especially in institutions that lack the resources to train models on large datasets from scratch. Finally, our method includes an implementation of Grad-CAM, a tool to visually highlight the image regions that underlie the diagnosis of CHD. Although more work is needed to validate this tool’s ability to directly highlight cardiac malformations, it may serve as a basis for visual explanations of CHD diagnosis by deep learning in the future.

In line with the growing importance of deep learning in medical image analysis and diagnosis of human cardiac conditions40, including CHD41, our approach uses CNNs that automatically learn complex and highly predictive features. This approach stands in contrast with previous studies that use handcrafted features12,14,15 for automated phenotyping of mouse fetuses from µCT data. The striking performance of deep learning-based 2D image classification relies strongly on the availability of large annotated image databases, such as ImageNet42, which includes more than 14 million 2D images scraped from the Internet. By contrast, databases of 3D medical images, such as The Cancer Imaging Archive (TCIA43), currently only contain thousands of images. As a consequence, there is still a dearth of easily accessible pretrained 3D classification models for medical imaging data in general and for cardiology in particular.

To address this gap, we developed a comprehensive open-access pipeline tailored specifically to CHD detection in mice. While this pipeline builds on established methods, including nnU-Net for segmentation25, 3D-CNNs for classification, and Grad-CAM30 activation maps, it integrates these components into a dedicated tool for CHD detection, along with 3D visualization and an interface that makes it easy to retrain the entire pipeline on the user’s own datasets. This is particularly important, since our study demonstrated that existing segmentation models struggled when used from scratch but performed well after retraining. Furthermore, while existing frameworks like MONAI provide powerful tools for medical imaging tasks, their implementation often requires coding expertise. By contrast, our user-friendly Napari plugin does not require any programming skills, thus making it accessible to a broad range of researchers and practitioners and enabling biologists to automate a personalised classification of their own cohorts. This could stimulate the growth of annotated 3D-cardiac image databases and thereby allow to train models with further improved diagnostic accuracy.

A limitation of our study is its reliance on a single mouse disease (heterotaxy), developmental window, and imaging technique. While our analysis of a divergent cohort indicates transferability of the learned features, it also showed reduced performance in the absence of retraining. Therefore, an obvious follow-up is to increase the diversity of the training data, for example, using external cohorts such as IMPC44. A model pretrained on a larger, diverse dataset like IMPC can then still benefit from fine-tuning on local data (such as the µCT-scans used in this study) to further improve diagnostic accuracy. Our Napari plugin will facilitate such retraining. Beyond the supervised learning approaches with 3D-CNNs implemented here, follow-up work may also benefit from more recent methodological advances, such as vision transformers or self-supervised methods that use pre-text tasks to extract relevant features from similar unlabelled data, e.g. publicly available 3D µCT-scan images from unrelated cohorts45. Additionally, more elaborate image pre-processing may improve the performance and robustness of the model, as suggested by previous studies that emphasized the importance of spatial registration12 and image normalization to address varying voxel intensities across different µCT imaging settings14. Finally, it will be important to extend this study towards distinguishing between different types of cardiac defects or to identify the association of cardiac defects with overall fetal growth and sex. The methods and open source software tools described here46,47, along with the new dataset that we make available48, are expected to stimulate such developments. Even without awaiting such improvements, we believe that our method’s ability to automate the analysis of cardiac morphology in mice will accelerate the identification of mechanisms of congenital heart defects.

Methods

Animal model

The different mouse lines used in this study are listed in Supplementary Table 1. E0.5 was defined as noon on the day of vaginal plug detection. Both male and female fetuses were collected and used randomly for experiments. Animals were housed in individually ventilated cages containing tube shelters and nesting material, maintained at 21 °C and 50% humidity, under a 12 h light/dark cycle, with food and water ad libitum, in the Laboratory of Animal Experimentation and Transgenesis of the Structure Fédérative de Recherches (SFR) Necker, Imagine Campus, Paris. We have complied with all relevant ethical regulations for animal use. Animal procedures were approved by the ethical committees of Institut Pasteur, Université Paris Cité, and the French Ministry of Research and conform to the guidelines from Directive 2010/63/EU of the European Parliament on the protection of animals used for scientific purposes.

Fetus and neonate collection

Pregnant females were euthanized by cervical dislocation. Fetuses or neonates were recovered at 17.5 or 18.5 days of development (E17.5, E18.5) or at birth (P0), as listed in Supplementary Data 1, and euthanized by decapitation. Blood filling was reduced by 5 min incubation in warm Hanks’ Balanced Salt Solution (HBSS). We used two alternative cardioplegia solutions to arrest and relax the heart. We used the classical cardioplegia solution of cold 250 mM KCl for the initial and prospective cohorts. However, when this solution was applied whole mount (decapitated body), we noticed that the cardiac muscle was occasionally not fully relaxed. This does not impact the diagnosis of the cardiac structure or the heart's outer volume, but leads to more variable ventricular wall thickness in controls. In the divergent cohort acquired later, we therefore used a more efficient cardioplegia solution of 110 mM NaCl, 16 mM KCl, 16 mM MgCl2, 1.5 mM CaCl2, and 10 mM NaHCO349 (see also Supplementary Table 1). The body was fixed in 4% paraformaldehyde for 24 h at 4 °C.

Micro-Computed Tomography

The thoracic skin and the left arm were removed to improve penetration of the contrast agent and as a landmark of the fetus left side, respectively. Samples were stained in 100% Lugol over 72 h. Micro-computed tomography (μCT) images of the thorax and abdomen were acquired on a Micro-Computed Tomography Quantum FX (Perkin Elmer), within a field of exposure of 10 mm diameter. The dataset comprised 512 images with an isotropic resolution of 20 μm x 20 μm x 20 μm for each sample. Figure panels displaying μCT data in 2D show sections through the imaged volume along the most appropriate plane, and thus do not necessarily correspond to optical slices.

Initial cohort and diagnosis

In our initial cohort, we used mouse fetal samples with heterotaxy, a disease that includes defects in all cardiac segments. The cohort combines Hoxb1Cre/+; Nodalflox/Nul conditional mutants, Rpgrip1l-/- mutants, which have a fully penetrant phenotype (implying that 100% of mutant hearts are abnormal)5,35,50, as well as littermate wild-type or heterozygous controls, and Ift20Nul/+; NodalNul/+ double heterozygotes with no cardiac phenotype. Samples were generated, genotyped, and imaged as described in refs. 5,35. Heart defects were diagnosed independently by two experts (an embryologist – A.D. or A.O.S., and a clinician – L. H. or S.B.) on μCT and high-resolution episcopic microscopy (HREM) images5, using the segmental approach51 and the International Paediatric and Congenital Cardiac Code (IPCCC) and International Classification of Diseases (ICD-11) Code. Phenotypic annotations were controlled and supervised by S.M.

Heart segmentation with nnU-Net and manual segmentation

In order to avoid technical confounders, we standardized raw μCT scans to the same axial view (Left – Posterior – Superior: LPS) and to isotropic voxels of size 20 μm x 20 μm x 20 μm. This preprocessing took about 25 min per scan using two CPUs with 200 GB RAM each. After this preprocessing, we trained a 3D nnU-Net28 segmentation model (3d_fullres) using manually segmented scans as ground truth data. To define this ground truth, the heart was manually outlined using Imaris (Bitplane) by an embryologist (V.B.) in each 2D slice containing the heart. This was done in a subset of μCT scans consisting of 23 scans with normal hearts (12 mice at E18.5 and 11 at P0) and 17 scans with CHD (10 mice at E18.5 and 7 at P0), totaling 40 scans and 6261 slices. From these 40 scans, 28 scans served for training the nnU-Net, while 12 scans were kept for testing, as detailed in Table 1.

Qualitative assessment of the segmentations was performed by an embryologist (A.D.) using Fiji, comparing the outlines of the segmentation with the µCT-scan. Imperfect heart segmentation corresponds either to over-segmentation (i.e., inclusion of non-cardiac regions), or under-segmentation (i.e., exclusion of heart regions).

Diagnosis model and training procedure

Our 3D CNN for CHD diagnosis uses 3D images of size 64 × 64 x 32 voxels as input, obtained after interpolation as described in the Results section ‘A 3D CNN for CHD diagnosis’. To diversify the training data, we applied Gaussian noise and random crops to these standardized 3D images during training on the fly.

We used the stochastic gradient descent (SGD) optimizer with a learning rate of 0.01 to optimize the model over a maximum of 100 epochs. To address class imbalance, we used a class-weighted cross-entropy loss function

\(L=-\left[{w}_{1}{{y\rm{log}}}(p)+{w}_{0}\left(1-y\right)\log \left(1-p\right)\right]\) with weights \({w}_{0}=\!\left({n}_{0}+{n}_{1}\right)/2{n}_{0}\), \({w}_{1}=\left({n}_{0}+{n}_{1}\right)/2{n}_{1}\), where \({n}_{0}\) and \({n}_{1}\) are the numbers of normal and CHD samples, respectively, \(y\) is the ground truth label (0 or 1), and \(p\) the predicted probability. To avoid overfitting, we stopped training when the validation loss stopped improving for 8 consecutive epochs.

To further evaluate the role of segmentation in our pipeline, we also conducted two classification experiments without the segmentation step as indicated in the Results section ‘Accurate diagnosis of malformed hearts’ (Supplementary Fig. 6a). In the first experiment, we used the same model architecture as described in this section, but fed the entire scan as input instead of the segmented heart. To ensure equivalent receptive fields and aspect ratios, the input size of the whole-scan model was adjusted from 512 × 512 x 512 to 190 × 190 x 95, a downscaling factor of ~2.7 (for the x and y dimensions) that approximately matches the downscaling factor of 2.5−3.5 used for the segmented-heart model. In the second experiment, we employed an attention-based deep multiple instance learning (AttMIL) framework28, which processes individual heart slices (axial view) of size 224 × 224 as input. We adapted AttMIL by utilizing  ResNet-50 as the feature extractor, followed by an attention-pooling module that quantifies the relative importance of each heart slice. The detailed architecture of AttMIL is shown in Supplementary Fig. 6c.

Performance metrics for segmentation

To quantitatively evaluate heart segmentation accuracy, we used 12 μCT scans with manual segmentations as mentioned in the Methods section ‘Heart segmentation with nnU-Net and manual segmentation’ above and considered them as ground truth (Fig. 3a). We quantified the agreement between the automated and manual segmentation using the Dice coefficient, recall and precision, defined below, each of which would reach 100% for perfect segmentations:

$${\rm{Dice}}=\frac{2* {\it{overlap}}}{{\it{prediction}}+{\it{ground truth}}}$$
$${\rm{recall}}=\frac{{\it{overlap}}}{{\it{groundtruth}}}$$
$${\rm{precision}}=\frac{{\it{overlap}}}{{\it{prediction}}}$$

where \({\it{prediction}}\) is the number of voxels in the automatically segmented (predicted) 3D masks, \({\it{groundtruth}}\) is the number of voxels in the manually defined 3D segmentation masks (ground truth), and \({\it{overlap}}\) is the number of pixels in the intersections of automated and manual segmentations.

Performance metrics for diagnosis

To evaluate the performance of the diagnosis models, we used four metrics: sensitivity, specificity, area under the curve (AUC), and balanced accuracy. Each metric provides complementary information about the model’s ability to distinguish between CHD and normal scans. Sensitivity (true positive rate) measures the model’s ability to correctly identify CHD cases and is defined as \({\rm{sensitivity}}=\frac{{\rm{TP}}}{{\rm{TP}}+{\rm{FN}}}\), where \({\rm{TP}}\) are true positives (CHD scans classified as CHD) and \({\rm{FN}}\) are false negatives (CHD scans classified as normal). Specificity (true negative rate) quantifies the model’s ability to correctly identify normal cases, and is defined as \({\rm{specificity}}=\frac{{\rm{TN}}}{{\rm{TN}}+{\rm{FP}}}\), where \({\rm{TN}}\) are true negatives (normal scans classified as normal) and \({\rm{FP}}\) are false positives (normal scans classified as CHD). AUC quantifies the model’s overall ability to differentiate between CHD and normal cases. It is derived from the receiver operating characteristic (ROC) curve, which plots sensitivity against 1 – specificity at various thresholds (applied to the predicted probability of the heart to have CHD). An AUC of 1 represents perfect classification, while 0.5 indicates random classification. Note that, except for computing the ROC curve and the AUC, we used a fixed threshold of p = 0.5, equivalent to predicting the class with the highest probability. The balanced accuracy is the average of the sensitivity and the specificity, and is particularly useful for imbalanced datasets. We also report confusion matrices, which count the number of samples in each ground truth class and each predicted class (for perfect classification, non-diagonal elements are zero).

Divergent cohort

The divergent cohort introduced entirely novel combinations of mouse lines, genotypes, cardioplegia treatments, and developmental stages relative to the initial cohort used to train the initial model. Specifically, we used mouse fetal samples from a new line, Ccdc40lnks/lnks mutants52, as well as littermate wild-type controls. Ccdc40lnks/lnks mutants develop heterotaxy with partial penetrance, as well as situs inversus totalis, while 70% of Ccdc40lnks/lnks mutants have no detectable heart phenotype. Thus, the divergent cohort contains a new phenotype of situs inversus totalis absent from the initial and prospective cohorts. The divergent cohort also contained Hoxb1Cre/+; Nodalflox/Nul conditional mutants as in the initial cohort but these samples (like all others in the divergent cohort) were treated with a very different cardioplegia method, which is more efficient to relax the cardiac muscle within the endogenous thoracic cavity and thus provides more standardized ventricular wall thickness in controls (see Methods section ‘Fetus and neonate collection’ above and Supplementary Table 1). Note that the divergent cohort included samples at the E17.5 stage (Fig. 5a), which was not represented in the initial and prospective cohorts. However, E17.5 for the Ccdc40 mouse line is equivalent to E18.5 for the other lines, because this stage also corresponds to one day before birth. Ccdc40lnks/lnks mutants develop faster, compared to Nodal mutants, and are born one day earlier. Indeed, we did not detect any significant difference in heart volumes between these two groups (11.49 ± 1.95 mm3, for E17.5 Ccdc40lnks/lnks vs. 12.35 ± 3.42 mm3 for E18.5 for the remaining mouse lines; Mann-Whitney U test: n1 = 40, n2 = 107, p = 0.293; Fig. 5b).

Activation map of classification model

We used Gradient-weighted Class Activation Map (Grad-CAM) to highlight features in the μCT images that underlie classification as normal or CHD30. In the resulting activation maps, hot colors represent regions that are more important for predicting a specific class (CHD or normal) than regions with cold colors. In our implementation of Grad-CAM, we first computed feature maps of the last convolutional layer (A), then calculated the gradients of the predicted class with respect to these feature maps. Next, these gradients were global-average-pooled to determine the importance weights (w). Finally, the Grad-CAM was computed from the linear combination of feature maps (A) and weights (w) and application of the ReLu function (i.e., setting all negative values to zero).

Statistics and reproducibility

All sample numbers (n) indicated in the text refer to biological replicates, i.e., different embryos. We considered three cohorts, labelled ‘initial’ (n = 139 mice), prospective (n = 18), and divergent (n = 80). Each μCT scan corresponds to a distinct mouse embryo. Group allocation was based on PCR genotyping. Investigators were blinded during imaging and phenotype analysis. One sample was excluded from the UpSet plot only (Fig. 1b) because the tissue was too degraded to interpret the type of heart anomaly. Descriptive statistics of quantitative variables are given as mean ± standard deviation (SD) unless otherwise stated. We used Fisher's exact test to test for dependence between labels (CHD or normal) and stages, and to test for the dependence of segmentation and classification results with labels or stages. The differences of quantitative variables between the two groups were tested by Mann-Whitney U tests, without accounting for litters as a potential source of variability. Differences were considered statistically significant for p-values below 0.05 in two-sided hypothesis tests. Effect sizes were described by the ratio of means between groups and a bootstrapped 95% confidence interval of this mean difference computed using the arch package.

Statistical analyses were implemented with the scipy package (version 1.9.1) within Python 3.9.13.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.