Deep learning for vessel segmentation and flow analysis to identify clusters associated with adverse outcomes in a fontan patient registry

Yao, Tina; Clair, Nicole St.; Gong, Madeline; Miller, Gabriel F.; Quail, Michael; Moledina, Shahin; Dorfman, Adam L.; Fogel, Mark A.; Krishnamurthy, Rajesh; Lam, Christopher Z.; Robinson, Joshua D.; Slesnick, Timothy C.; Weigand, Justin; Steeden, Jennifer A.; Rathod, Rahul H.; Muthurangu, Vivek

doi:10.1038/s41598-026-40738-6

Download PDF

Article
Open access
Published: 04 March 2026

Deep learning for vessel segmentation and flow analysis to identify clusters associated with adverse outcomes in a fontan patient registry

Tina Yao¹^na1,
Nicole St. Clair²^na1,
Madeline Gong²,
Gabriel F. Miller²,
Michael Quail¹,
Shahin Moledina¹,
Adam L. Dorfman³,
Mark A. Fogel⁴,
Rajesh Krishnamurthy⁵,
Christopher Z. Lam⁶,
Joshua D. Robinson⁷,
Timothy C. Slesnick⁸,
Justin Weigand⁹,
Jennifer A. Steeden¹,
Rahul H. Rathod²,
Vivek Muthurangu¹ &
FORCE Investigators

Scientific Reports volume 16, Article number: 11956 (2026) Cite this article

1934 Accesses
Metrics details

Subjects

Abstract

We introduce a deep learning framework comprising two models for automated segmentation (DCS) and large-scale deep temporal clustering (DTC) within a registry of single ventricle patients. The DCS model performs simultaneous classification and segmentation of velocity-encoded phase-contrast magnetic resonance (PCMR) data for five individual blood vessels, the left and right pulmonary arteries, aorta, superior vena cava, and inferior vena cava. Trained, validated and tested on 260 cardiac MRI exams (each containing 5 PCMR scans), it demonstrated a median Dice score of 0.91 on 50 unseen test exams. Integrated into a fully automated pipeline, the DCS model processed over 4500 registry exams without manual intervention, reaching 98% classification accuracy and 90% segmentation accuracy in cases with all five vessels present. Flow curves obtained from successful segmentations were used to train the DTC model, which performs deep temporal clustering to uncover unique flow patterns. Survival analysis showed that these groups were statistically correlated to increased risk of mortality or transplantation and to liver disease, highlighting the clinical relevance of the proposed framework.

Geometric-topological deep transfer learning for precise vessel segmentation in 3D medical volumes

Article Open access 15 January 2026

Pretrained subtraction and segmentation model for coronary angiograms

Article Open access 27 August 2024

Automated segmentation of multiparametric magnetic resonance images for cerebral AVM radiosurgery planning: a deep learning approach

Article Open access 17 January 2022

Introduction

Time-varying signals can provide important insights into cardiac pathophysiology, capturing dynamic processes that static measurements don’t fully characterize. For instance, it is well-recognized that certain blood flow patterns are associated with specific disease processes (e.g. abnormal early/atrial filling ratio in diastolic dysfunction)¹. One of the most accurate methods of measuring time-varying blood flow is velocity-encoded phase-contrast magnetic resonance imaging (PCMR), a technique that is heavily used in the evaluation of congenital heart disease (CHD)^{2,3,4,5,6,7,8,9}. However, usually only time-averaged PCMR metrics (e.g. net forward volume) are reported, neglecting much of the information encoded into the time-varying signal. More sophisticated methods of leveraging the time-varying nature of PCMR data have been developed, but access to large amounts of data is required to validate their clinical relevance.

A potential source of large amounts of PCMR data is the Fontan Outcomes Registry using CMR Examinations (FORCE, http://www.forceregistry.org)¹⁰, which contains > 4500 cardiovascular MR exams performed in patients with functionally single ventricles and the Fontan circulation. These patients are of particular interest in our study because they are characterized by highly abnormal flow patterns^11,12,13,14. The registry does contain PCMR images, but they are unprocessed and only time-averaged metrics like net forward volume are stored. Thus, investigation of time-varying flow requires the PCMR images to be reprocessed (segmented). However, manual segmentation would be too time-consuming, as well as being prone to inter- and intra-observer variability¹⁵.

Deep learning (DL) offers the possibility of automatically processing PCMR images, but there are two significant challenges specific to our application: (i) Images in the FORCE registry are heterogeneous due to differing CMR protocols/hardware at the contributing sites (common to all registries) and highly variable and complex anatomy (specific to Fontan patients), and (ii) There are multiple distinct vessels that must be first identified and then segmented if processing is to be performed without human input. Our solution is a joint Deep Classification + Segmentation model (DCS) that uses a modified UNet3 + architecture (that we have shown works with other images in the FORCE registry¹⁶) to simultaneously classify and segment multiple vessels.

Once large amounts of PCMR images are processed, novel methods for analyzing time-varying flow data (e.g. temporal clustering) can be developed and validated. Clustering attempts to group patients based on similar characteristics in some N-dimensional space and has been used to discover new phenotypes in several cardiovascular diseases^17,18,19. Although conventional temporal clustering (e.g. dynamic time warping) has been used to analyze time-varying signals^20,21,22,23, Deep Temporal Clustering (DTC)²⁴ has been shown to have superior performance.

The overall aim of our study is to combine DL-based classification/segmentation with DL-based temporal clustering to investigate the time-varying blood flow in the Fontan circulation. The specific aims were: (i) Develop and validate a DCS model for classification and segmentation of five different phase-contrast flow planes: aorta (Ao), superior vena cava (SVC), inferior vena cava (IVC), left pulmonary artery (LPA) and right pulmonary artery (RPA), (ii) Integrate the DCS model into an automated pipeline to process the whole of the FORCE registry data (> 4500 datasets) and evaluate segmentation quality, (iii) Perform Deep Temporal Clustering (DTC) on extracted time-varying flow curves from the FORCE registry to identify novel flow-based phenotypes, and iv) Perform time-to-event analysis to assess the association of these phenotypes with key clinical outcomes, including death/transplantation and liver disease.

Results

Study overview

Figure 1 shows the pipeline consisting of two DL models (full details in methods section), one for deep classification/segmentation (DCS) and one for deep temporal clustering (DTC). The DCS model was a multi-class UNet3 + architecture with full-scale skip connections, deep supervision, and a novel tunable input based on the DICOM series description. The model was trained, validated and tested on 260 manually segmented exams from the FORCE registry (training/validation/testing split into 185/25/50 exams, each exam contains five 2D + time PCMR series, data demographics are in Supplementary Table 1). The DCS model was then integrated into a fully automated processing pipeline, with segmentation results rated by an expert. These generated flow curves were then inputted into the DTC model. The DTC model was based on temporal autoencoder backbone that generated a latent space representation that was further optimized for clustering by minimizing the Kullback–Leibler (KL) divergence. Finally, flow-curve based clusters were associated with clinical outcomes (death/transplantation and liver disease).

Deep classification + segmentation model

Classification performance

The DCS model achieved an overall classification accuracy of 97% across all vessels in the 50 unseen test exams (250 PCMR series). Classification accuracy for the Ao, SVC and IVC was 100%, and for LPA and RPA was 94% (Supplementary Fig. 1A).

The added value of the tunable layer that leverages series description information (entered by the technologist at the time of scanning) was tested by training a vanilla model without the tunable layer. This model achieved 94% accuracy across all vessels (compared to 97% with the tunable layer). The robustness of the tunable layer was evaluated by modifying the tunable layer input at inference in 2 ways: (i) removing series descriptions, which resulted in 94% accuracy; and (ii) assigning incorrect series descriptions in all cases, resulting in 90% accuracy.

Full confusion matrix results are shown in Supplementary Fig. 1, suggesting that the tunable layer increases classification accuracy without making the DCS model overly sensitive to erroneous or missing data.

Segmentation performance

Median Dice score across all test vessels (N = 250, Supplementary Fig. 2) was 0.91 (IQR: 0.86–0.93). Vessel-specific Dice scores were: Ao - 0.93 (IQR: 0.91–0.95), IVC - 0.93 (IQR: 0.89–0.95), SVC - 0.89 (IQR: 0.85–0.91), LPA - 0.90 (IQR: 0.87–0.93), and RPA - 0.88 (IQR: 0.84–0.92). The Ao and IVC Dice scores were statistically higher than other vessels (p < 0.007).

Time-varying flow curves calculated from manual and DL segmentation are shown in Fig. 2 demonstrating good agreement over time for all vessels (further segmentation examples for each vessel in Supplementary Figs. 3–7). Comparison of net forward volumes calculated from these curves is shown in Fig. 3 with clinically acceptable limits of agreement and strong intraclass correlations (ICC > 0.95). However, DL segmentation did produce slightly higher net forward volumes than manual segmentations (2.2–6.7%), which were significant for the RPA, LPA, IVC, and SVC and trended for the Ao (p = 0.06).

The correlation between Dice accuracy and percentage accuracy of flow compared to the ground truth is shown in Supplementary Fig. 8. It shows that the better the segmentation, the closer the derived flow is to the ground truth flow.

Pipeline performance

At the time of the study, the FORCE registry contained 4881 CMR exams (3369 patients) with at least one PCMR series. Registry demographics are reported in Supplementary Table 2, with a median age at time of scanning being 15.8 (IQR: 11.3–21.9) years, similar proportions of patients with extracardiac and lateral tunnels, and the most common underlying diagnosis being hypoplastic left heart syndrome (HLHS).

The DCS model was integrated into a cloud-based automated pipeline (8 vCPUs, 120 GB RAM) that processed the whole registry (excluding the 260 exams used for model development). The processing time was 127 ± 80 s per exam, which included: (i) Sorting through all series in an exam (~ 31 series per exam), (ii) Extracting the PCMR series (~ 11 series per exam, noting that FORCE registry exams can include additional flow planes beyond the five the model was trained on, such as pulmonary veins or repeat scans), (iii) Running the joint model on all PCMR series (2.9s per series on the cloud instance, compared 218ms when run locally on a NVIDIA RTX A6000 GPU), (iv) Storing flow curves and segmentation masks for the LPA, RPA, Ao, SVC and IVC, and (v) Creating a GIF showing segmentation results for quality assurance. The total time to process the whole registry was ~ 170 h, with no human interaction required during this time.

Out of the 4881 exams, 2902 exams had PCMR data for all five vessels. In these exams, acceptable classification/segmentation (as assessed by a human expert, see methods) was achieved in 90% of all vessels. Individual rates were: LPA – 91%, RPA – 82%, Ao – 83%, SVC – 96%, IVC – 97% (Supplementary Fig. 9). Segmentation success was significantly higher in the SVC and IVC than in the LPA, RPA, and Ao (p < 0.001), while the LPA showed better segmentation success than the RPA and Ao (p < 0.001).

Failures were mainly due to inaccurate segmentation, with vessel misclassification only occurring in the LPA (~ 50% of failures) and RPA (~ 21% of failures). Failures were associated with poor image quality (p < 0.03 for the RPA and SVC) and certain anatomical/morphological features (Table 1). These included: (i) Similarly sized neo- and native aortas (bilateral aortas), which resulted in 51% successful aortic segmentations compared to 87% in the rest of the population (Supplementary Fig. 10 for example), (ii) Bilateral SVCs with only 90% successful SVC segmentations compared to 97% in single SVCs (Supplementary Fig. 10 shows a bilateral SVC failure), and also significant effects on segmentation success for the LPA (87% vs 92%) and RPA (75% vs 83%), and (iii) Heterotaxy, which affected segmentation success across all vessels except the Ao, particularly in the LPA (71% vs 94% in non-heterotaxy) and RPA (59% vs 85% in non-heterotaxy).

Table 1 Percentage of successful vessel segmentations for each vessel evaluated in the FORCE registry pipeline (N = 2902). Results are stratified by clinical and imaging variables. Values in bold denote statistical significance.

Full size table

Deep temporal clustering model

Separate deep temporal clustering models were trained for the combined LPA/RPA flow curves (DTC_PA), and the combined SVC/IVC flow curves (DTC_VC). For both models, exams were only included if vessel segmentations were rated acceptable. Furthermore, aortic segmentations had to be rated as acceptable as aortic flow was used to identify and remove exams that were pulse-gated. This resulted in 1943 LPA/RPA flow curves and 2286 SVC/IVC flow curves being included in the clustering models.

Clustering

The optimal number of clusters, k, was determined using a sensitivity analysis with values from 3–8 (Supplementary Table 3). The DTC_PA and DTC_VC models did not converge with more than six and four clusters respectively. For values of k that did produce stable clusters, the optimal k was based on maximizing the temporal silhouette score and demonstrating significant differences for death/transplantation and liver outcomes where priority was given to death/transplantation. Based on these criteria, the optimum number of clusters was k = 5 for DTC_PA (silhouette score = 0.89, significant association with death/transplantation) and k = 4 for DTC_VC (silhouette score = 0.90, significant association with death/transplantation and liver disease).

Supplementary Fig. 11 presents t-distributed stochastic neighbor embedding (t-SNE) plots of the clusters before and after joint optimization with the clustering layer that enforces more confident predictions, demonstrating that the DTC method generates more distinct clusters compared to simple k-means on the latent space, which is similar to Principal Component Analysis (PCA)-based methods²⁵.

Pulmonary artery (PA) clusters

Fig. 4A shows the mean flow curves for patients in each of the five PA clusters and Table 2 shows the key differences in patient demographics between the clusters (Full demographic information in Supplementary Table 4 and 5). The PA cluster characteristics are summarized as follows:

1.
Normal Distribution, High Flow (PA_Norm-High): This group had normally distributed branch PA flow, with higher flow towards the right pulmonary artery (42% LPA / 58% RPA) and high total pulmonary blood flow (2.91 L/min/m²). This was the youngest group (median age 13.7 years, p < 0.001 vs. all except PA_RPA-Norm) and was 65% male. This group had EDV_i of 101.5 mL/m², ESV_i of 47.5 mL/m² and EF of 53%. The patients in this cluster also had the highest aortic flow (3.5 L/min/m², p < 0.001) and lowest aorto-pulmonary collateral flow (18%, p < 0.04) compared to the other clusters. Fontan types were evenly split between lateral tunnel (46%) and extracardiac conduit (46%) and there was 6% heterotaxy.
2.
Normal Distribution, Low Flow (PA_Norm-Low): This group had normal distribution of branch PA flow (45% LPA / 55% RPA), but low overall total pulmonary blood flow (1.42 L/min/m²). This was the oldest group (median age 17.0 years, p < 0.002 vs. PA_Norm-High and PA_RPA-Norm) and was 53% male. This group also had the highest ventricular volumes (EDV_i 105.6mL/m², p < 0.001 vs. PA_Bal-Norm; ESV_i 54.0 mL/m², p < 0.002 vs. PA_Norm-High and PA_Bal-Norm) and the lowest EF (49%, p < 0.03). In addition, they had the lowest aortic flow (2.5 L/min/m², p < 0.001) and highest aorto-pulmonary collateral flow (25%, p < 0.001 vs. PA_Norm-High and PA_Bal-Norm). Furthermore, this group had greatest proportion of extracardiac conduits (48%) as well as the highest prevalence of heterotaxy (12%, p < 0.008 vs. PA_Norm-High and PA_RPA-Norm).
3.
Diastolic-Dominant, Normal Flow (PA_Dia-Norm): In this group, a greater proportion of branch PA flow occurred in during diastole, even though the total amount was normal (1.96 L/min/m²). This group was generally older (median age 16.6 years, p < 0.02 vs. PA_Norm-High and PA_RPA-Norm) and 60% male. This group had EDV_i of 102.8mL/m² and ESV_i of 51.3mL/m², and EF of 51%. This group also had an aortic flow rate of 2.8L/min/m² and a collateral flow of 23%. This group had 51% lateral tunnel, 43% extracardiac conduits and 7% heterotaxy.
4.
RPA-Dominant, Normal Flow (PA_RPA-Norm): In this group, flow was predominantly to the RPA (70%), with normal total flow (2.34 L/min/m²). This group was younger than average (median age 14.5 years, p < 0.002 vs. All expect PA_Norm-High) and 65% were male. This group has EDV_i of 102.5mL/m², ESV_i of 49.8mL/m² and EF of 52%. They also had an aortic flow rate of 3.2 L/min/m² and collateral flow of 23%. This group had 53% lateral tunnel and 43% extracardiac conduits and the least heterotaxy (3%, p < 0.001 vs. PA_Norm-Low and PA_Bal-Norm).
5.
Balanced Distribution, Normal Flow (PA_Bal-Norm): This group had a balanced branch PA flow distribution where flow toward both pulmonary arteries were near-equal (52% LPA / 48% RPA) with normal total PA flow (2.18 L/min/m²). This group was older than the average (mean age 16.1) and 57% male. The ventricular volumes in this group were the lowest (EDV_i 94.5 mL/m², p < 0.002 and ESV_i 44.4 mL/m², p < 0.005), with an EF of 53%. This group had an aortic flow rate of 2.9 L/min/m² and a collateral flow of 19%. This group also had the lowest proportion of lateral tunnels (38%) with a low proportion of extracardiac conduits (44%) and higher proportion of other Fontan types (18%). This group also had 11% heterotaxy.

Table 2 Demographics for the patients in the 5 identified clusters identified by DTC_PA. Values in bold denote statistical significance.

Full size table

Vena caval (VC) clusters

Fig. 4B shows the mean flow curves for patients in each of the four VC clusters and Table 3 shows key demographics differences between the clusters (Full demographic information in Supplementary Table 6 and 7). The characteristics of the VC clusters are summarized as follows:

1.
IVC-Dominant, Normal Flow (VC_IVC-Norm): This group had a higher proportion of flow from the IVC (72%) with an average amount of total vena caval flow (2.32 L/min/m²). This was the oldest group (median age 20.6 years, 66% adult, p < 0.001), and 59% were male. This group had EDV_i of 98.8mL/m², ESV_i of 47.5mL/m², and EF of 51%. This group had an aortic flow of 2.8L/min/m² and collateral flow of 20%. This group also had the highest prevalence of lateral tunnel procedures (56%), lowest extracardiac conduits (32%) and 10% heterotaxy.
2.
SVC-Dominant, High Flow (VC_SVC-High): Greater proportion of flow from the SVC (56%) with high total vena caval flow of 2.83 L/min/m². This was the youngest group (median age 8.2 years, 95% pediatric, p < 0.001), and 63% were male. This group had an EDV_i of 98.6mL/m² and ESV_i of 46.7mL/m², as well as the highest EF (53%, p = 0.001 vs. VC_Norm-Norm). This group also had the highest aortic flow rate (3.7L/min/m², p < 0.001) and one of the highest collateral flows (24%, p < 0.001 vs. VC_IVC-Norm and VC_Norm-High). This group also had the highest prevalence of extracardiac conduit procedures (65%), lowest lateral tunnel (27%) and generally higher prevalence of heterotaxy compared to the other clusters (12%, p = 0.02 vs. VC_Norm-High).
3.
Normal Distribution, High Flow (VC_Norm-High): Normal flow distribution (37% SVC / 63% IVC) with high total flow (2.86 L/min/m²). This group was predominantly younger (median age 13.5, p < 0.001) and 64% male. This group has EDV_i of 101.4mL/m², ESV_i of 47.3mL/m² and EF of 52%. This group had an average aortic flow rate of 3.3L/min/m² and the lowest collateral flow (18%, p < 0.008). This group had 49% extracardiac conduits and lateral tunnel procedures of 43%, as well as the lowest prevalence of heterotaxy (7%, p < 0.02 vs. VC_SVC-High and VC_Norm-Norm).
4.
Normal Distribution, Normal Flow (VC_Norm-Norm): Normal flow distribution (37% SVC / 63% IVC) with normal amount of flow (2.15 L/min/m²). This group was predominantly older (median age 17.2) and 57% male. This group had EDV_i of 99.7mL/m² and ESV_i of 50.2mL/m² and the lowest EF (51%, p < 0.02 vs. VC_SVC-High and VC_Norm-High). They also had the lowest aortic flow rate (2.7L/min/m² vs. all except VC_IVC-Norm) and one of the highest collateral flows (24%, p < 0.002 vs. VC_IVC-Norm and VC_Norm-High). This group had a similar amount of lateral tunnel and extracardiac conduits (44% and 43%) as well as the highest prevalence of heterotaxy (14%, p = 0.001 vs. VC_Norm-High).

Table 3 Demographics for the patients in the 4 identified clusters identified by DTC_VC. Values in bold denote statistical significance.

Full size table

Cluster transition analysis

Supplementary Fig. 12 illustrates cluster transitions among patients with multiple scans, showing changes from the first to the last scan, as well as transitions between consecutive scans. In the PA model, 323 patients had multiple scans with 424 individual cluster transitions. Of these, 48% of patients remained in the same cluster between their first and last scan, while 52% remained in the same cluster between consecutive scans.

In the VC model, 390 patients had multiple scans, corresponding to 524 separate transitions. While 40% of patients remained in the same VC cluster from their first to last scan, 48% remained in the same cluster between consecutive scans.

These findings indicate that patients can develop distinct flow profiles over time. However, as patients with multiple scans represent a relatively small subset of the overall cohort, further analysis is required to draw conclusions regarding their outcome trajectories.

Cluster outcome analysis

Given that patients have multiple scans and their cluster membership can change over time, we performed time-varying time-to-event analysis (time-varying Cox regression²⁶ adjusted for age, sex, indexed aortic flow rate, and ejection fraction EF, and Simon-Makuch plots, an extension of the Kaplan–Meier method for time-varying groups^27,28). We investigated two outcomes (i) Death or transplantation as a composite outcome (death/transplantation), and (ii) The first diagnosis of Fontan-associated liver disease.

The Simon-Makuch plot for death/transplantation in the PA clusters is shown in Fig. 4A, with hazard ratios for pairwise cluster comparisons presented in Supplementary Fig. 13. The PA_Bal-Norm group had a significantly lower risk of death/transplantation compared to the PA_Norm-Low group (HR = 0.33, p = 0.007). There was no significant difference in liver disease observed among these clusters.

Figure 4B shows Simon-Makuch plots for death/transplantation and liver disease for the vena caval clusters, with pairwise hazard ratios in Supplementary Fig. 14. The patients in the VC_SVC-High cluster had a significantly lower risk of death/transplantation compared to the patients in the VC_IVC-Norm (HR = 0.34, p = 0.023) and VC_Norm-Norm clusters (HR = 0.31, p = 0.010). The VC_SVC-High cluster also had a significantly lower risk of liver disease compared to the VC_IVC-Norm (HR = 0.58, p = 0.027) and the VC_Norm-Norm clusters (HR = 0.56, p = 0.018).

Comparison to PCA-based clustering

We used PCA-based k-means clustering to generate clusters (Supplementary Fig. 15) with the same number of clusters as the DTC method (Fig. 4) for direct comparison. The figures show that centroid flow patterns from both methods were similar, suggesting that the DTC method for generating a latent space representation is robust and the flow patterns observed are replicable. Furthermore, the DTC method generates more distinct clusters, as shown by higher silhouette scores (PA: 0.89, VC: 0.90) compared to the PCA method (PA: 0.12, VC: 0.20).

More importantly, the two methods differ in their ability to create clusters with prognostic association. The DTC method finds clusters that have significant differences in death/transplantation for both PA and VC vessels (p = 0.007, p < 0.023, respectively). Conversely, the PCA-based clusters only just reach significance for PAs (p = 0.049) and are non-significant for VCs. Nevertheless, for liver disease, both the DTC and PCA methods find similar significant differences between VC clusters.

Discussion

This is the first study, to our knowledge, to develop a deep learning model for the simultaneous classification and segmentation of multiple vessels (imaged using PCMR) and flow-based clustering of Fontan patients. The key findings were: (i) The joint deep classification and segmentation model (DCS) demonstrated high accuracy in classifying and segmenting major blood vessels, (ii) Incorporating the DCS model into an automated pipeline allowed rapid processing of the complete FORCE registry (> 4500 exams), providing robust flow measurement for about 90% of vessels, (iii) The deep temporal clustering (DTC) approach identified distinct flow dynamics between patient clusters, and (iv) These clusters showed significant associations with major clinical outcomes, including death or transplantation and liver disease. Our unified approach helps unlock the full potential of phase-contrast MR (PCMR) images stored in the FORCE registry by identifying novel physiologically distinct groups with prognostic significance. Importantly, our approach could be easily adapted to other CMR registries that contain PCMR images (e.g. The Indicator Cohort or PVDOMICS^29,30).

Deep learning PCMR segmentation

Manual core-lab segmentation of the PCMR images in the FORCE registry (> 4500 exams) would take ~ 2000 person-hours (~ 50 working weeks of non-stop segmentation). This is not tractable, which is why we developed a DL-based processing pipeline that processed the entire registry without human interaction in ~ 170 h (even when run on a cloud CPU). Importantly, we demonstrated high Dice scores compared to manual segmentation, as well as strong agreement between flow curves derived from manual and DL segmentations. This is despite the challenging and complex anatomy of Fontan patients³¹. We believe this is primarily due to the underlying UNet3 + architecture³², which allows better multi-scale feature fusion and provides a richer understanding of both fine details and global context. Although we have previously used a UNet3 + to successfully segment short-axis images in the FORCE registry¹⁶, extensive modifications were made to optimize the architecture for flow segmentation. One of the most important modifications was the use of a single network to perform both classification and segmentation. Previous studies that have aimed to segment multiple vessels used separate classifiers and vessel-specific networks, which adds complexity to training and inference, while limiting information sharing between tasks³³. In contrast, our simpler joint approach leverages the UNet3 + architecture and also introduces a novel tunable input based on series descriptions. This input provides contextual information added by technologists during CMR scans without hard-coding, thereby improving classification accuracy while remaining robust to missing (~ 7% of all exams) or incorrect descriptions. The network was also modified to accept 2D + time data by using 3D convolutions. This enabled feature learning between consecutive frames and enforced temporal consistency. Finally, the imaginary component of the complex signal was also included as an additional input to our model, which leveraged velocity information to improve accuracy without having to manage the high levels of noise present in phase images³³.

Our models demonstrated 90% success across the FORCE registry, which includes data from various scanners, clinical sites and complex anatomies, providing very large amounts of data for flow-based clustering. Nevertheless, there were certain situations in which the joint classification + segmentation model performed less well. This included specific anatomies; for instance, heterotaxy seemed to confuse the model because the LPA can look like the normal RPA plane and vice versa.

Although the aorta had the highest Dice score in the test set (0.93), it had one of the lowest segmentation success rates in the full pipeline (83%). This discrepancy cannot only be explained by differences in the amount of bilateral aorta cases, as they were similar in the pipeline cohort (12%) compared to the test set (10%). This suggests that the pipeline cohort likely included greater variability or combinations of variability that are difficult to quantify or were not explicitly analyzed. For example, another main difference is that the test set was drawn from hospitals used for model training, whereas the pipeline involved inference on data from multiple previously unseen centers with potentially different underlying imaging characteristics.

If these cases were underrepresented in the training data, future performance could improve by including more examples of uncommon anatomies or unusual imaging protocols.

Temporal clustering analysis

Deriving clinically useful biomarkers from time-varying data requires temporal dimensionality reduction. This is often achieved using simple methods (such as reporting peak or mean values), and for PCMR data, usually only net forward volumes are reported^34,35. However, these simplistic approaches do not extract all the potentially useful physiological information from flow data. Principal component analysis offers a more sophisticated approach by decomposing temporal flow curves into weighted principal components (PCs) that account for the majority of variance in the populations. This method has already been applied to flow in the Fontan circulation, where poorer outcomes were associated with certain flow patterns (e.g. diastolic dominant PA flow)^36,37,38. Furthermore, it is possible to combine PCA with k-means clustering to perform temporal clustering^39,40,41,42. In our study, we demonstrated that although PCA-based clustering produced similar clusters to DTC, the DTC method produced more distinct clusters (as shown by higher silhouette scores) with more significant associations with outcomes. The poorer performance of PCA-based clustering is potentially related to the PCA being constrained by the linear summation of orthogonal PCs, which may limit the expressivity of the method. Therefore, we opted to use DTC, a method that uses a temporal autoencoder to generate a latent space that can capture more complex temporal patterns compared to using PC decomposition. This method also employs a joint optimization step to maximize the certainty of cluster assignment. Importantly, as our DSC pipeline model provided >20× more data than previous studies on blood flow in Fontan patients^36,37,38, providing a dataset large enough for deep learning to reliably learn complex, high-dimensional temporal patterns without overfitting.

Deep temporal clustering analysis of the FORCE dataset revealed distinct flow patterns that in some cases were associated with clinically relevant outcomes. Patients in the PA_Norm-Low group (normal distribution, low flow) had an increased risk of death. This might be unsurprising, but our analysis was corrected for indexed aortic flow rate and EF, suggesting that this was not simply the result of poor perfusion and cardiac function. Interestingly, this group did have the highest systemic to pulmonary collateral flow, and one possibility is that they have higher pulmonary vascular resistance, which would explain their higher mortality. It is also possible that low pulsatility and diastolic-dominance in these patients reflect an additional adverse physiology that contributes to higher mortality. A more surprising finding is that the PA_Bal-Norm (balanced distribution, normal flow) group had the best outcome. This group had slightly greater flow to the left lung, which could not be fully explained by heterotaxy (although heterotaxy was slightly overrepresented). An intriguing explanation for these findings is that Fontan anatomy with equal or slightly greater left lung flow is associated with better hemodynamics and thus, improved outcome. This could be investigated by leveraging DL-based anatomical segmentation and flow field estimation⁴³ to investigate power loss and TCPC resistance⁴⁴.

We also found that patients in the VC_Norm-Norm group (normal distribution and normal flow) have an increased risk of death/transplantation. Conversely, patients with dominant SVC flow were associated with a lower risk of death/transplantation and liver disease. These patients were younger, which is expected as SVC/IVC flow ratio reduces with age. Nevertheless, the association with outcome remained even after correction for age, suggesting that other mechanisms resulted in better prognosis in these patients. One possibility is that increased splanchnic circulation is associated with worse mortality. This idea is supported by the fact that the cluster with dominant IVC flow also had higher mortality, and a potential mechanism is increased liver disease in both these groups. Another possible explanation is that higher SVC flow leads to less flow collision between the vena caval flow streams, resulting in great hemodynamic efficiency. Further investigation of flow distribution is required to better understand the reason for these associations, which is achievable using the comprehensive data available in the FORCE registry.

We have demonstrated the potential of deep temporal clustering for the analysis of time-varying signals at scale. This method could easily be applied to multiple time-varying signals in the cardiovascular imaging space, including flow in other pathologies, myocardial strain, and ventricular/atrial volumetric curves. However, use in large registries does require a joint segmentation-clustering approach.

Limitations

The main limitation of our study was that segmentation was not successful in all cases, meaning that human review for all outputs of the DCS model was necessary. However, this can be done in a matter of seconds per exam, which is feasible even when reviewing thousands of exams.

Another issue is that our tunable input relies on a curated data dictionary to extract tokens from series descriptions. This dictionary was developed using descriptions from the current FORCE registry, so it may not perform well with new datasets that may have different naming conventions. This could be remedied by using a large language model (LLM) to represent the series descriptions as an embedding that can be inputted into the MLP of the tunable layer in place of the current one-hot encoding, with end-to-end training of both the UNet3 + and LLM. However, we demonstrated that our model does have high classification accuracy, and an LLM may be an unnecessary complication.

Another limitation is that we cannot assess the bias of segmentation failures on the clustering results, as only successful segmentations were used during training and inference of the clustering model. Furthermore, although we measured the number of transitions that a patient has between different scans according to their cluster membership. These findings indicate that patients can develop distinct flow profiles over time. However, as patients with multiple scans represent a relatively small subset of the overall cohort, further analysis is required to draw conclusions regarding their outcome trajectories.

One known issue with PCMR is the background phase, particularly with older exams. We didn’t perform background phase correction because software methods that aim to fit parabolic planes to the image are highly sensitive to the amount/segmentation of static tissue. In clinical use, these methods are checked and discarded if they are clearly incorrect. However, our pipeline is designed not to require significant human input, and background phase correction would be difficult under these conditions. While the lack of background phase correction can result in the flow offsets, our DTC method has been trained to cluster based on the shape of the flow curve as well as absolute flow values. Thus, we believe that clustering is partially robust to phase offsets. Furthermore, it is recognized that background phase errors are more prevalent with older scanners. If clustering was heavily influenced by background phase, one might expect difference in scan date between clusters. However, Supplementary Tables 4–7 show there were no significant differences found in scan date for any PA clusters. It should be noted that VC_SVC-High cluster scans were acquired significantly earlier than the other groups. However, patients in this cluster were also significantly younger, which may explain why the scans were older rather than due to bias from background phase effects.

Finally, we were unable to assess the effect of different types of acquisition (e.g. free breathing, breath-hold, and real-time) and known artefacts (e.g. due to stents) on both segmentation accuracy and cluster assignment. This is because this data wasn’t available in the FORCE registry, but the high success rate in segmentation does suggest good overall generalizability. Furthermore, although the models were validated on heterogeneous data from multiple scanners, protocols, and sites, they have not been tested on patients beyond the single-ventricle population and may perform less reliably on more typical cardiac anatomy. Nevertheless, the methods developed in this study could be applied to other datasets.

Conclusion

In this study, we introduce a unified framework for flow segmentation and clustering. Our deep classification segmentation (DCS) model processes multiple blood vessels using PCMR images from different sites, scanners, and types of single-ventricle physiology. We validated the DCS model on a large dataset (2902 exams, 14,510 phase-contrast series), achieving an average success rate of 90% across all five vessels. By leveraging the flow curve data, we used DTC to identify clusters and analyze curve characteristics, linking them to clinical outcomes. We believe our method can provide new insights into Fontan physiology and potentially provide new methods of treatment. In addition, our methodology could easily be modified for other large cardiovascular datasets that contain time-varying signals (e.g. PCMR, strain data, or even ECG data). In a clinical environment, we envisage an automated system, applied directly after scanning, which segments all five vessels, to provide a full flow profile, and classifies the patients into higher or lower risk, without human input.

Methods

This was a multicenter study approved by the Institutional Review Boards or research ethics committees at each participating institution or via a reliance Institutional Review Board agreement with Boston Children’s Hospital. The study proposal and this manuscript were approved by the FORCE Data Governance and Publications Committee. The study involved no direct participation of the patients. The complete list of FORCE Investigator co-authors and affiliations is enumerated in the Authors’ Contribution section.