Introduction

Physics-based simulations of cardiovascular interventions such as endovascular stent expansion or heart valve implantation can help optimize device design and deployment, especially in challenging anatomies1. These “virtual interventions” can be modelled on a patient-specific digital twin, which is a computational replication of a real anatomy derived from medical imaging2,3,4. Virtual interventions have been shown to model the mechanical and hemodynamic consequences of implanting heart valves5,6, atrial appendage occluders7, and coronary stents8,9, as well as the electrophysiological consequences of cardiac ablation10. Applied to a cohort of digital twins, virtual interventions enable in silico trials of medical devices11, in which their safety and efficacy can be assessed within a digital environment. Such trials can act as digital evidence for regulatory agencies, reducing the exorbitant cost and failure rates involved with bringing a device to market12,13. Virtual interventions also enable the simulation of hypothetical scenarios, such as implanting alternative devices or modeling different physiological conditions within the same patient1. This experimental framework provides mechanistic insight regarding what factors concerning device design and physiology critically influence deployment. Such insights can influence both regulatory and development processes, enhancing future designs and guiding recruitment for clinical trials1,14.

In contrast, our ability to extract mechanistic insight involving alternative anatomic variants is highly limited. Specifically, we delineate three phenomena critical to device development and regulatory evaluation that digital twin frameworks are unable to properly address. First, the uniqueness of each digital twin complicates the assessment of uncertainty in device performance attributable to scale-specific anatomic variation. Small scale anatomic features can be highly influential on both hemodynamics and biomechanics. Examples include coronary plaque rupture being influenced by thin fibrous caps15, ventricular trabeculae influencing cardiac hemodynamics16, and coronary branches affecting blood-flow through the aortic root17. Second, due to the complex correlations between local anatomic features within digital twin cohorts, it remains difficult to disentangle the causal relationships and interaction effects exerted by localized anatomic regions on device failure. Localized anatomic features have been widely known to interact in influencing cardiovascular physics, examples include the interactions between lipid and calcium in determining plaque rupture risk2,4, mitral valve pathology on aortic valve replacements18, and aortic valve replacements on coronary flow19. Lastly, the reliance on digital twin cohorts for in silico trials can compromise device evaluation on less common or pathological anatomic shapes11,14. Accordingly, current digital twin paradigms are unable to fully or precisely explore anatomic space, limiting the broader applicability of virtual interventions for device development and regulatory review.

To address such issues, generative models of virtual anatomies have been proposed but typically struggle to balance between producing outputs that are both realistic and controllable. The gold standard method is principal component analysis (PCA), which has traditionally been used to generate virtual cohorts for biomechanical and hemodynamic simulations20. Despite its utility, PCA is unable to accurately model the highly nonlinear anatomic variation inherent to human anatomy. As such, there has been a rising interest in deep learning approaches for producing virtual anatomies. State-of-the-art deep learning architectures for this purpose have been variational autoencoders (VAE) and generative adversarial networks (GANs), which exhibit improved performance compared to PCA21,22,23. While such architectures have demonstrated the ability to produce variations of anatomy by exploring their latent space21, as of yet current approaches are limited in their ability to precisely edit patient-specific models. This is because such methods represent anatomic shape in terms of global shape vectors, which are not expressive enough to control anatomic variation at different spatial scales or within localized regions while keeping others constant. To overcome this limitation, a previous study by Kong et al. found that representing anatomy in terms of a higher-dimensional and spatially extended latent grid enabled higher expressiveness but decreased generation quality under an auto-decoder paradigm24.

In contrast, diffusion models are a novel class of generative models that can synthesize 2D and 3D medical images with high quality and diversity25,26,27. However, their use in generating virtual anatomy in the form of anatomic label maps is still in its infancy. Preliminary studies used unconditional diffusion models to produce multi-label 2D segmentations of the brain and retinal fundus vasculature respectively in order to train downstream computer vision algorithms28,29. The ability of diffusion models to flexibly edit natural images is also well-characterized. For example, diffusion models can create variations of natural images through a perturb-denoise process, partially corrupting a seed image and restoring it through iterative denoising30. The level of added noise can control whether the model synthesizes global or local features31. Furthermore, diffusion models can be used to locally in-paint regions within an image by specifying a spatially extended mask32,33,34,35,36, either by directly replacing the masked portion during each denoising step or using the gradient of a masked similarity loss. While these technique has been used in the context of medical images for anomaly detection37,38 and data augmentation39, their use in modifying virtual anatomy has not been studied.

Lastly, research into generative models for virtual anatomy is hampered by the lack of appropriate evaluation frameworks to assess the quality of synthetic cohorts for in silico trials. For example, the Fréchet inception distance (FID)40 is difficult to use for evaluating generative models of virtual anatomies, as no standard pre-trained network for 3D anatomic segmentations is available. Moreover, point cloud-based metrics delineate 3D shape quality and diversity but do not measure the interpretable morphological metrics necessary to understand device performance, nor do they measure topological correctness, a critical factor to ensure compatibility with numerical simulation. Recent studies attempt to address this by visualizing the 1D distributions of clinically relevant morphological variables such as tissue volumes23,41, but fail to study the multi-dimensional relationship between morphological metrics, nor do they investigate morphological bias due to imbalanced data distributions.

In this study, we develop an experimental framework to study how latent diffusion models (LDMs) can act as a controllable source of anatomic variants for in silico trials to fulfill two main functionalities. The first functionality centers on the controlled synthesis of informative anatomies through editing digital twins, which we term “digital siblings”. As opposed to a digital twin, which is a computational replication of a patient-specific anatomy, a digital sibling would resemble the corresponding twin, but exhibit subtle differences in anatomic form. Comparative simulation studies using twins and their siblings would yield insight regarding how scale-specific and region-specific anatomic variation can influence simulated deployment. The second functionality revolves around virtual cohort augmentation by creating digital siblings from a curated subpopulation of digital twins. This would enrich virtual cohorts with specified anatomic attributes, addressing issues related to cohort imbalance and diversity. We accordingly develop a latent diffusion model to generate 3D cardiac label maps and introduce a novel experimental framework to study the synthesis of anatomic variation (Fig. 1). We first characterize the baseline performance of the model through generating de-novo cardiac label maps (Fig. 2). We then investigate two methods to generate digital siblings with diffusion models: (1) perturbational editing of cardiac digital twins to enable scale-specific variation; and (2) localized editing of cardiac digital twins to enable region-specific variation. In our experimental framework, we select various digital twins to act as “seed” volumes and produce several digital siblings through editing. We then apply this procedure over different hyperparameters and seed characteristics to study how generative editing can alter the morphological and topological attributes of digital twins. Lastly, we study how such editing methods can be used to augment virtual cohorts with less common anatomic features. Our main contributions and insights are as follows:

  1. 1.

    We develop and train a latent diffusion model to generate 3D cardiac label maps and introduce a novel experimental framework to study how generative editing techniques can produce scale-and-region specific variants of digital twins.

  2. 2.

    We demonstrate that latent diffusion models can introduce topological violations during generation and editing, where the number of violations is influenced by editing methodology and seed characteristics.

  3. 3.

    We find that dataset imbalance induces a bias within the generation process towards common anatomic features. This anatomic bias extends to scale-and-region specific editing. The degree and spatial distribution of this bias is influenced by editing hyperparameters and seed characteristics.

  4. 4.

    We demonstrate that this anatomic bias can be leveraged to enhance virtual cohort diversity in two manners. Virtual cohort augmentation with scale-specific variation can help explore less populated spaces within the anatomic distribution bounded by the training set. Similarly, augmentation with region-specific variation can augment the cohort with anatomic forms outside the anatomic distribution.

Fig. 1: We study the ability of diffusion models to generate digital siblings for virtual interventions and augment in silico trials.
figure 1

Top row: we unconditionally generate latent codes \((\bar{{\bf{z}}})\) which are decoded (D) into cardiac label maps (\(\bar{{\bf{x}}}\)). Middle row: We encode (E) patient-specific digital twins (x) into a latent space (z) and apply a partial perturb-denoise process to achieve scale-specific variations (\({\bar{{\bf{x}}}}_{\psi }\)). Bottom row: We locally edit pre-specified tissues to achieve region-specific variations (\({\bar{{\bf{x}}}}_{{\bf{m}}}\)).

Fig. 2: Schematic for the forward and reverse diffusion process.
figure 2

The decoded cardiac label maps for several intermediately noised latent representations zσ. During training, a neural denoiser learns to approximate the incremental reverse process at each noise level σ. During sampling, the network is recursively applied to produce de-novo cardiac label maps.

Results

Unconditional sampling of virtual anatomies

We conducted a sensitivity analysis of cohort quality with respect to the sampling steps and cohort size and and use a minimum of 50 sampling steps and a cohort size of 120 (Supplementary Figs. 7 and 8). We then sample 360 label maps with a diffusion model for 50 steps for analysis and visualization. We also train a baseline generative VAE that samples a global latent vectors which can be decoded into cardiac label maps. Example label maps can be seen in Fig. 3. We further visualize the reconstructed and synthetic cardiac label maps in 3D within Supplementary Figs. 5 and 6. The scatterplot (Fig. 4) and the difference heatmap (Fig. 5) show the morphological distribution of the synthetic anatomies generated by the diffusion model on a global and local scale respectively. Both figures demonstrate that unconditional sampling tends to generate mean-sized cardiac label maps, but fails to sample rarer anatomic configurations on the periphery of the distribution. This is especially the case for the baseline VAE, which learns a much more constrained distribution concentrated around the anatomic mean. This bias also exists on a local level as seen in the difference heatmap Pdiff in Fig. 5. The heatmaps for each individual chamber are shown in Supplementary Fig. 4.

Fig. 3: Example 2D slices from 3D cardiac label maps.
figure 3

Top left: digital twin label maps from the training set. Top right: unconditionally generated label maps generated by the diffusion model. Bottom left: perturbational edits of a single cardiac digital twin over various sampling ratios. Bottom right: localized edits of cardiac digital twins over various tissue masks. Bottom row has a white outline of the edited twin for perturbational edits (left) and an outline of the edited tissue region for localized edits (right).

Fig. 4: Unconditional generation captures common anatomic variations but fail to capture outliers.
figure 4

Diagonal plot shows the 2D morphological distribution (shown as a scatterplot) exhibited by real cohorts and synthetic cohorts generated by unconditional sampling from a generative VAE and diffusion model. Marginal plots show the equivalent 1D morphological distributions for the virtual anatomy cohorts, visualized as a kernel density estimate plot.

Fig. 5: The distribution of label maps synthesized by the diffusion model exhibits spatially varying differences against that of real label maps.
figure 5

Spatial occupancy heatmaps show the distribution of real (Preal) and synthetic (Psynthetic) label maps, as well as the difference in occupation (Pdiff). Heatmaps are masked out where Preal or Psynthetic are zero. Real or synthetic bias correspond to increased relative occupancy by real or synthetic anatomies respectively.

Table 1 shows the morphological and 3D shape based metrics for the generative VAE and diffusion model. Our diffusion model is able to sample from a wider distribution of anatomy due to it’s expressive latent grid, with higher recall and coverage values. However, the generative VAE exhibits higher morphological precision, MMD, and 1-NNA values, likely due to sampling common anatomies near the center of the distribution. Table 2 indicates the primary source of topological violation stems from the initial segmentations used to train the generative models. Violations in the real dataset stem from the segmentation network used to create the original dataset, in which small clusters of misclassified tissues contribute to the amount of topological violations (Fig. 3 and Supplementary Fig. 2).

Table 1 Morphological and shape based metrics comparing a baseline generative VAE and diffusion model for unconditional cardiac generation
Table 2 Topological violations exhibited by real, reconstructed real, and synthetic cohorts respectively

Scale specific variation through perturbational editing

We select four seed label maps that represent different types of cardiac anatomy: a seed with a large LV and RV (LR), a seed with a small LV and RV (LR), a seed with a large LV but mean sized RV (LR), and a seed with a mean sized LV and RV (LR). For each seed, we generate synthetic anatomies with varying sampling ratios, corresponding to ψ = [0.35, 0.50, 0.65, 0.8, 1], leading to a total of 20 virtual cohorts of 120 anatomies each. Example label maps can be seen in Fig. 3.

Figure 6 shows that the cohorts generated by perturbational editing are increasingly biased towards the most common anatomies with increasing noise. Figure 7 further shows that the amount of injected noise corresponds to spatial scale, as the bias exhibited by the spatial heatmap Pdiff expands with increasing noise. Table 3 demonstrates that the topological quality of the sampled label map can degrade when editing outlier twins, as can be seen when perturbationally editing seed LR with a sampling ratio ψ of 0.35. This is because the seed occupies a sparsely populated region of the anatomic distribution. A visualization of the topological violations exhibited after perturbational editing can be found in Supplementary Fig. 3

Fig. 6: Perturbationally editing seed cardiac label maps (star marker) with increasing levels of injected noise ψ produces cohorts that are biased towards the most common anatomies (blue contour).
figure 6

Each scatterplot corresponds to a different seed label map, showing multiple cohorts synthesized by editing the same seed with different sampling ratios (ψ). For improved visual clarity, scatterplots are supplemented with kernel density estimate plots, and the number of data points displayed per cohort is reduced by half.

Fig. 7: Perturbationally editing seed cardiac label maps (columns) with increasing levels of injected noise ψ (rows) enables scale-specific variation.
figure 7

Difference heatmaps Pdiff show spatially varying discrepancies between the seed and synthetic cohorts generated by perturbationally editing various seed label maps.

Table 3 Topological violations exhibited by each cohort produced by perturbationally editing various seed label maps for different sampling ratios ψ

Region specific variation through localized editing

For each of the previously mentioned seeds, we specify two masks designed to edit the RV and LV respectively. The myocardium was not included for each tissue mask, allowing it to vary with each ventricular chamber. This process resulted in eight synthetic cohorts of 120 anatomies each. Example label maps can be seen in Fig. 3.

Figure 8 shows that the 1D distributions of edited ventricular volumes are biased towards most common values of the real cohort. This can be seen most prominently with seed LR where the edited LVs have a substantially lower volume as compared to the seed label map. From the spatial difference heatmaps Pdiff visualized in Fig. 9, we further observe that localized editing can change individual chambers while maintaining others as constant, where the edited chambers are biased towards a mean anatomic shape. With the exception of editing the RV of seed LR, locally editing the seed label maps increased the percentage of topological violations as compared to the seeds, as can be seen in Table 4. A visualization of the topological violations exhibited after localized editing can be found in Supplementary Fig. 3, and a comparison of our replacement-based inpainting method to guidance-based inpainting can be found in Supplementary Fig. 1.

Fig. 8: Localized editing of seed cardiac label maps (star marker) produces cohorts with region-specific variation that is biased towards those of the most common anatomies.
figure 8

Each scatterplot corresponds to a different seed, showing multiple cohorts synthesized by locally editing the same seed label map with different tissue masks m.

Fig. 9: Locally editing seed cardiac label maps (columns) with different tissue masks m (rows) enables region-specific variation.
figure 9

Difference heatmaps Pdiff show spatially varying discrepancies between the seed and synthetic cohorts generated by locally editing 4 seed label maps.

Table 4 Topological violations exhibited by each cohort produced by localized editing of various seed label maps with different tissue masks

Virtual cohort augmentation through selective editing

In this experiment we contrast and compare three strategies that can augment virtual cohorts with rare anatomies to improve dataset imbalance and diversity. In this case, we aim to enrich a target cohort with rare patient-specific cardiac label maps distinguished by an RV volume larger than a threshold value of 115 ml. Our first strategy is to unconditionally sample 7200 label maps and filter all outputs with RV volumes less the threshold. In our second strategy, we utilize the bias inherent to perturbational editing and modify digital twins from the target cohort to create digital sibling cohorts. Half of the digital twins received a large perturbation (ψ = 0.5) and the other half received a small perturbation (ψ = 0.35). Following the editing process, digital siblings with an RV volume below the threshold were excluded. Our third strategy leverages the bias inherent to localized editing, in which half of the target cohort was locally edited to have different LV shapes, while the other half were edited to have different RV shapes. Similarly, outputs that do not meet the RV volume threshold were excluded. All three strategies resulted in filtered cohorts of size 140 each. The evaluation metrics, namely Frechet morphological distance, morphological precision, and morphological recall, were computed against the target cohort consisting of 50 cardiac label maps from the train set as the reference standard.

Figure 10 demonstrates that unconditional generation does not fully explore the peripheries of the target cohort distribution, where it can be seen that the largest RV volumes within the target cohort (black stars) are not represented. In contrast, perturbational editing excels in filling sparsely populated peripheries of the distribution (producing anatomies with both ventricles enlarged). Table 5 reinforces these insights, demonstrating that augmentation through perturbational editing enhances diversity through exhibiting higher recall and COV values as compared to unconditional generation. Augmenting cohorts with localized editing yields cardiac label maps with morphological features that conform to the distribution of individual morphological metrics but deviate from the multidimensional distribution, producing anatomies with only a single large ventricle. Table 5 shows that localized editing results in increased diversity metrics. Furthermore, both editing-based strategies yield similar or better fidelity metrics such as precision and MMD when compared to unconditional sampling. Table 5 also demonstrates that virtual cohorts produced by the all augmentation strategies exhibit similar topological quality.

Fig. 10: Scatterplots demonstrating three augmentation strategies for a target cohort of real cardiac label maps distinguished by right ventricle volumes larger than a minimum threshold (dashed lines).
figure 10

The first strategy uses unconditional generation while second and third strategies utilized generative editing applied to a cohort of seed label maps. All generated cohorts underwent filtering to ensure a minimum right ventricular volume.

Table 5 Comparison of various metrics across different virtual cohort augmentation strategies

Discussion

In this study we developed an experimental framework to investigate how generative diffusion models of human anatomy can be integrated into virtual intervention workflows through the precision editing of digital twins. This novel paradigm is designed to facilitate the generation of mechanistic insights for device development as well as digital evidence for regulatory assessment. Specifically, we trained a diffusion model on a dataset of 3D cardiac label maps and leveraged the model to edit digital twins under various hyperparameters. By examining the 3D shape, morphological attributes and topological quality of the label maps post-editing, we find that diffusion model-based editing techniques can generate insightful morphological variants of digital twins for virtual interventions. Perturbational editing can produce scale-specific variations of digital twins, which can isolate the sensitivity of device deployment to both small and large-scale variations. In contrast, localized editing can produce region-specific variations of digital twins, which can elucidate the localized effect of anatomic features on device deployment. Such insights can streamline the development of novel medical devices and provide a more comprehensive assessment of device performance for regulatory agencies.

While the integration of generative editing with virtual interventions has the potential to produce mechanistic insight and augment in silico trials, they should be employed with caution. For example, we find that generative editing can produce anatomies with topologically incorrect features, such as connected atria or several left ventricle components, which induce non-physiological phenomena within numerical simulations of cardiovascular physics. Moreover, we demonstrate that diffusion models exhibit a bias towards generating the more common anatomic features within the dataset, a bias that extends to diffusion model-based editing techniques. Anatomic variants with low morphological plausibility can induce inaccuracies in the regulatory assessment of device safety and fail to capture possible failure modes. As such, methods that evaluate and control anatomic bias will be critical to the integration of generative artificial intelligence within workflows regarding device development and regulatory review. We nevertheless demonstrate that such anatomic bias can be leveraged to enhance the digital evidence produced by in silico trials. This is achieved by augmenting cohorts with digital siblings, thereby improving factors critical to regulatory approval, such as cohort balance and diversity. Specifically, we found that perturbational editing can fill the sparsely populated regions within the anatomic distribution, potentially improving device assessment for realistic anatomies. Similarly, localized editing can expand the space of plausible anatomies that can be probed with virtual interventions, enabling the assessment of possible failure modes, at the expense of decreased anatomic realism.

However, while our experimental framework can derive novel insights regarding the morphological and topological behaviour of generative editing for virtual interventions, it exhibits a number of limitations. First, it does not quantitatively analyze morphology on multiple scales, instead measuring global level metrics such as 3D shape, volumes and axis lengths. Second, the influence of the diffusion model architecture or sampling methodology on generative editing was not explored. Lastly, the validity of visualizing spatial heatmaps depends on spatial correspondence between anatomic features, and would not apply to anatomies that have a variable topology such as organs with multi-component inclusions. All of these limitations present exciting directions for future work on evaluation metrics and experimental frameworks regarding the generative editing of digital twins for device development and regulation.

Methods

Dataset

We used the TotalSegmentator dataset42, consisting of 1204 Computed Tomography (CT) images, each segmented into 104 bodily tissues. We filtered out all patient label maps that do not have complete and adequate-quality segmentations for all four cardiac chambers. This resulted in a dataset of 512 3D cardiac label maps, where each label map consisted of 6 tissues: aorta (Ao), myocardium (Myo), right ventricle (RV), left ventricle (LV), right atrium (RA), and left atrium (LA). All cardiac label maps were cropped and resampled to a size of 7 × 128 × 128 × 128, with an isotropic voxel size of 1.4 mm3. We then reoriented each cardiac segmentation so that the axis between the LV and LA centroids is aligned with the positive z-axis. Lastly, we rigidly registered all segmentations to a reference label map using the methodology described by Avants et al.43.

Latent diffusion model training

We employed a latent diffusion model (LDM), consisting of a variational autoencoder (VAE) and a denoising diffusion model. The VAE encodes cardiac label maps x into latent representations z, which can be decoded into label maps \(\bar{{\bf{x}}}\). The training process for our diffusion model is done in the latent space of the trained autoencoder, we represent the probability distribution of cardiac anatomy by pdata(z) and consider the joint distribution p(zσ; σ) obtained through a forward diffusion process, in which i.i.d Gaussian noise of standard deviation σ is added to the data, where at σ = σmax the data is indistinguishable from Gaussian noise. The driving principle of diffusion models is to sample pure Gaussian noise and approximate the reverse diffusion process through using a neural network to sequentially denoise the latent representations zσ with noise levels σ0 = σmax > σ1 > > σN = σmin such that the final denoised latents correspond to the clean data distribution. Following Karras et al.44, we represent the reverse diffusion process as the solution to the following stochastic differential equation

$$d{{\bf{z}}}_{\sigma }=-2\sigma {\nabla }_{{\bf{z}}}\log p({{\bf{z}}}_{\sigma };\sigma )\,dt+\sqrt{2\sigma }d{\bf{w}}$$
(1)

Where the score function \({\nabla }_{{{\bf{z}}}_{\sigma }}\log p({\bf{z;\; \sigma }})\) denotes the direction in which the rate of change for the log probability density function is greatest and dw is the standard Wiener process. Since the data distribution is not analytically tractable we train a neural network to approximate the score function. We start with clean latent representations z and model a forward diffusion process that produces intermediately noised latents zσ = z + n where \({\bf{n}} \sim {\mathcal{N}}({\bf{0}},{\sigma }^{2}{\bf{I}})\), parameterized by a noise level σ. The diffusion model is parameterized as a function Fθ, encapsulated within a denoiser Dθ, that takes as input an intermediately noised output zσ and a noise level σ to predict the clean data z.

$${D}_{\theta }({{\bf{z}}}_{{\mathbf{\sigma }}};\sigma )={c}_{{\rm{skip}}}(\sigma )\,{{\bf{z}}}_{{\mathbf{\sigma }}}+{c}_{{\rm{out}}}(\sigma )\,{F}_{\theta }({c}_{{\rm{in}}}(\sigma )\,{{\bf{z}}}_{{\mathbf{\sigma }}};\,{c}_{{\rm{noise}}}(\sigma )),$$
(2)

where cskip controls the skip connections that allow the Fθ to predict the noise n at low σ and the training data z at high σ. This parametrization has been shown to improve convergence speed and performance44. The variables cout and cin scale the input and output magnitudes to be within unit variance, and the constant cnoise maps the noise level σ to a conditioning input to the network44. The denoiser output is related to the score function through the relation \({\nabla }_{{{\bf{z}}}_{\sigma }}\log p({{\bf{z}}}_{\sigma };\sigma )=\left({D}_{\theta }({{\bf{z}}}_{{\mathbf{\sigma }}};\sigma )-{\bf{z}}\right)/{\sigma }^{2}\) and Fθ is chosen to be a 3D U-net with both convolutional and self-attention layers, similar to previous approaches28,30,44,45. The loss L is then specified based on the agreement between the denoiser output and the original training data:

$$L={{\mathbb{E}}}_{\sigma ,{\bf{z}},{\bf{n}}}[\lambda (\sigma )| | {D}_{\theta }({{\bf{z}}}_{\sigma };\sigma )-{\bf{z}}| {| }_{2}^{2}],$$
(3)

such that the loss weighting λ(σ) = 1/cout(σ)2 ensures an effective loss weight that is uniform across all noise levels, and σ is sampled from a log-normal distribution with a mean of 1 and standard deviation of 1.2.

Once the denoiser has been sufficiently trained, we define a specific noise level schedule governing the reverse process, in which the initial noise level, σ, starts at \({\sigma }_{\max }\) and decreases to \({\sigma }_{\min }\):

$${\sigma }_{i}={\left({\sigma }_{\max }^{\frac{1}{\rho }}+\frac{i}{N-1}\left({\sigma }_{\min }^{\frac{1}{\rho }}-{\sigma }_{\max }^{\frac{1}{\rho }}\right)\right)}^{\rho }$$
(4)

where ρ, σmin and σmax are hyperparameters that were set to 3, 2e−3, and 80 respectively. We specifically leverage a stochastic variant of the solver detailed in Karras et al.44 to sequentially denoise the latent representations zσ and solve the reverse diffusion process detailed in Eq. (1) (Fig. 1).

Latent diffusion model implementation

We trained the variational autoencoder with an MSE reconstruction loss and a KL divergence loss with a relative weight of 1e−6. We modified the architecture from Rombach et. al.45 to ensure compatibility with 3D voxel grids and adjusted the number of channels to [64,128,192]. We augmented our data with random scaling (0.5–1.5), rotations (0–180°), and translations (0–20 voxels) in each direction. For the denoising diffusion model, we modified the original architecture of the specified by Rombach et al.45 to ensure compatibility with 3D voxel grids and adjusted the model channels to [64,128,192]. We used the Adam optimizer46 for the VAE and diffusion model, using learning rates of 1e−4 and 2.5e−5 respectively.

Perturbational editing

To create digital siblings by perturbational editing, we first encode a seed cardiac label map xseed into the latent representation zseed. Instead of sampling from pure Gaussian noise, we recursively apply the denoiser using the intermediately noised latent zσ as the starting point (Fig. 1) to produce \({\bar{{\bf{z}}}}_{\psi }\). The latent \({\bar{{\bf{z}}}}_{\psi }\) is then decoded into the cardiac label map \({\bar{{\bf{x}}}}_{\psi }\) using the autoencoder. The intermediate step i < N is a hyperparameter that determines how much of the sampling process is recomputed. We express this hyperparameter in terms of the the sampling ratio ψ = (N − i)/N in our experiments, such that ψ = 0 is equivalent to the reconstruction of the original label map, and ψ = 1 is equivalent to the unconditional generation of cardiac label maps.

Localized editing

To create digital siblings by localized editing, we first encode a seed cardiac label map xseed into the latent representation zseed. To better preserve the masked region, we set zseed as the mean prediction of the encoder without sampling from the Gaussian prior. A tissue-based mask, m, denoting which cardiac tissues are to be preserved, was created and downsampled to the same size as the latent representation. The mask was then dilated twice to ensure that tissue interfaces remain stable during editing. The sampling process is similar to that of unconditional sampling, with the addition of an update step that replaces the unmasked portion of the intermediately denoised image with an equivalently corrupted latent representation belonging to the seed label map:

$${{\bf{z}}}_{\sigma }={\bf{m}}\odot ({{\bf{z}}}^{{\rm{seed}}}+{\bf{n}}(\sigma ))+(1-{\bf{m}})\odot {{\bf{z}}}_{\sigma }$$
(5)

At the end of sampling, the denoised latent \({\bar{{\bf{z}}}}_{{\bf{m}}}\) is then decoded into the cardiac label map \({\bar{{\bf{x}}}}_{{\bf{m}}}\) through the decoder (Fig. 1).

Shape evaluation

To evaluate virtual cohorts in terms of 3D shape, we use point cloud-based metrics as proposed by Yang et al.47. These metrics include (1) minimum matching distance (MMD), which measures shape fidelity, (2) coverage (COV), which measures shape diversity, and (3) 1-nearest-neighbor accuracy (1-NNA), which measures distributional similarity. To convert label maps into point clouds, we group the main cardiac chambers and myocardium into a single shape and use marching cubes48 to obtain a 3D surface mesh from which we randomly sample a point cloud of 1024 points. To calculate the shape metrics, We first define the similarity between point clouds in terms of Chamfer distance (CD) and earth mover’s distance (EMD) as follows:

$$\,\text{CD(X, Y)}\,=\sum _{x\in X}\mathop{\min }\limits_{y\in Y}\parallel x-y{\parallel }_{2}^{2}+\sum _{y\in Y}\mathop{\min }\limits_{x\in X}\parallel x-y{\parallel }_{2}^{2},$$
(6)
$$\,\text{EMD(X, Y)}\,=\mathop{\min }\limits_{\psi :X\to Y}\sum _{x\in X}\parallel x-\psi (x){\parallel }_{2},$$
(7)

where X and Y are two point clouds with the same number of points and ψ is a bijection between them. Given a set of generated (Sg) and real (Sr) point clouds, we measure shape fidelity through MMD as follows:

$${\rm{MMD}}({S}_{g},{S}_{r})=\frac{1}{| {S}_{r}| }\sum _{Y\in {S}_{r}}\mathop{\min }\limits_{X\in {S}_{g}}D(X,Y),$$
(8)

where lower values indicate the generated shapes are of higher fidelity. To measure shape diversity, we compute COV as the fraction of point clouds in the real set that are matched to at least one point cloud in the generated set. Mathematically we compute COV as follows:

$${\rm{COV}}({S}_{g},{S}_{r})=\frac{\left| \left\{\arg {\min }_{Y\in {S}_{r}}D(X,Y)| X\in {S}_{g}\right\}\right| }{| {S}_{r}| },$$
(9)

where higher values indicate better diversity or coverage in terms of 3D shape. Finally, we use 1-NNA to compare the distribution of real and generated shapes, to do this we let SX = Sr Sg − {X} and NX be the nearest neighbor of X in SX. 1-NNA is the leave-one-out accuracy of the 1-NN classifier:

$$\begin{array}{l}\,\text{1-NNA}\,({S}_{g},{S}_{r})=\\ \frac{{\sum }_{X\in {S}_{g}}{\mathbb{I}}[{N}_{X}\in {S}_{g}]+{\sum }_{Y\in {S}_{r}}{\mathbb{I}}[{N}_{Y}\in {S}_{r}]}{| {S}_{g}| +| {S}_{r}| },\end{array}$$
(10)

where \({\mathbb{I}}\) is the indicator function. A value close to 50% implies that Sg and Sr are sampled from the same distribution.

Lastly, to analyze anatomic bias on a local scale, a voxel-wise mean was computed over all virtual anatomies within a cohort. This results in a spatial heat map P of size 7 × 128 × 128 × 128 for the real and synthetic cohorts. The inverse of the background channel was chosen for further visualization.

Morphological evaluation

To assess the morphological quality of a virtual cohort, we represent each virtual anatomy in terms of a 12-dimensional morphological feature vector. For each cardiac label map, we calculate the volume, major axis length, and minor axis length for the LV, RV, LA, and RA. Two of these metrics (LV and RV volumes) were further chosen to plot the global morphological distribution of each cohort. To calculate measures of morphological fidelity and diversity, we adapt the improved precision and recall metrics defined for generative image models49. Our key idea is to form explicit non-parametric representations of the manifolds of real and generated data within morphological space, rather than the feature space of a neural classifier.

Following Kynkaanniemi et al.49, we embed our real and synthetic anatomy (in the form of multi-label segmentations) into morphological feature space. We denote the morphological feature vectors of the real and generated anatomies by φr and φg, respectively, and the corresponding sets of morphological feature vectors by Φr and Φg.

For each set of feature vectors Φ  {Φr, Φg}, we estimate the corresponding manifold in the feature space. We obtain the estimate by forming a hypersphere around each feature vector with a radius equal to the distance to its kth nearest neighbor. Together, these hyperspheres define a volume in morphological feature space that serves as an estimate of the true manifold. To determine whether a given sample φ is located within this volume, we define a binary function

$$f(\varphi ,\Phi )=\left\{\begin{array}{ll}1,\quad &{\rm{if}}\,\parallel \varphi -{\varphi }^{{\prime} }{\parallel }_{2}\le \parallel {\varphi }^{{\prime} }-{{\rm{NN}}}_{k}({\varphi }^{{\prime} },\Phi ){\parallel }_{2}\\ \quad &\,{\rm{for}}\, {\rm{at}}\,{\rm{least}}\,{\rm{one}}\,\,{\varphi }^{{\prime} }\in \Phi \\ 0,\quad &{\rm{otherwise}}\end{array}\right.$$
(11)

where \({{\rm{NN}}}_{k}({\varphi }^{{\prime} },\Phi )\) returns the kth nearest feature vector of \({\varphi }^{{\prime} }\) from set Φ. As such, f(φ, Φr) provides information on whether an individual anatomy is morphologically realistic, whereas f(φ, Φg) determines if an anatomy could be reproduced by the diffusion model. We can now define our metrics as

$${\rm{precision}}({\Phi }_{r},{\Phi }_{g})=\frac{1}{| {\Phi }_{g}| }\sum _{{\varphi }_{g}\in {\Phi }_{g}}f({\varphi }_{g},{\Phi }_{r})$$
(12)
$${\rm{recall}}({\Phi }_{r},{\Phi }_{g})=\frac{1}{| {\Phi }_{r}| }\sum _{{\varphi }_{r}\in {\Phi }_{r}}f({\varphi }_{r},{\Phi }_{g}),$$
(13)

where precision denotes the fraction of morphologically ‘realistic’ anatomies in the generated dataset, while recall denotes the fraction of real anatomies that could have been generated by the diffusion model.

Lastly, we implement a variant of the Frechet Inception Distance50 which we call “Frechet Morphological Distance” (FMD). The key difference is that instead of using the features of a pretrained Inception v3 model, we utilize morphological features. Given the set of real and generated morphological feature vectors Φg and Φr, we calculate the means (μg, μr) and standard deviations (Σg, Σr) and compute FMD as follows:

$$\begin{array}{l}{\rm{FMD}}(\mu ,{\mu }^{{\prime} },\Sigma ,{\Sigma }^{{\prime} })\,=\,\parallel \mu -{\mu }^{{\prime} }{\parallel }_{2}^{2}\\\qquad\qquad\qquad\qquad\quad +{\rm{tr}}(\Sigma +{\Sigma }^{{\prime} })\\\qquad\qquad\qquad\qquad\quad-2{\rm{tr}}\left({\left(\Sigma {\Sigma }^{{\prime} }\right)}^{\frac{1}{2}}\right),\end{array}$$
(14)

where a lower FMD indicates the morphological distribution of real and generated anatomies are similar.

Topological evaluation

In order to study how well anatomic constraints and compatibility with numerical simulation are respected, we assess the topological quality of each label map. Clinically, topological defects such as a septal defect between the right and left hearts can have a significant effect on electrophysiology51 and hemodynamics52. Specifically, for each generated anatomy we evaluate 12 different topological violations and calculate the percentage of topological violations exhibited by the cohort. We assess three types of topological violations. The first five metrics checks for the correct number of connected components for the Myo, LV, RV, LA, and RA channels. The next five metrics assesses the required adjacency relations between the following tissues: LV & Ao, LV & Myo, LV & LA, RV & Myo, RV & RA. The final two metrics examine the absence of adjacency relations between the LV & RV as well as the LA & RA. Multi-component topological violations were found by determining the presence of critical voxels as described in Gupta et al.53, while the number of connected components was assessed by the method described by Silversmith et al.54. For computational efficiency, the label maps were subsampled by a factor of two before calculating all topological violations.

Generative autoencoder baseline

To establish a baseline for comparison with our diffusion model approach, we trained a generative VAE that encodes the cardiac label map into a global latent vector and decodes it back into voxel space. We generate new label maps with the generative VAE by sampling the global latent from a gaussian distribution and using it as input to the decoder. The generative VAE architecture is similar to our reconstructive VAE, except we use 6 downsampling blocks to reduce the 128 × 128 × 128 voxel resolution to 4 × 4 × 4 before flattening the latent grid and feeding it as input to a fully connnected layer to produce a global latent vector of size 128. We also change the number of channels in the encoder and decoder to [64,128,196,256,256,512]. We trained the VAE with the same reconstruction and KL divergence loss, but increased the KL loss term to 1e−3 to more strongly enforce a Gaussian distribution on the global latent vectors for improved sampling.