Introduction

Echocardiography provides a comprehensive anatomy and physiology assessment of the heart and is one of the most widely available imaging modalities in the field of Cardiology given its non-radiative, safe, and low-cost nature1,2,3. Cardiac chamber quantification, especially the left ventricle (LV), is one of the fundamental tasks of echocardiography studies in the current practice4, and the results can have direct effects on clinical decisions such as the management of heart failure, valvular heart diseases, and chemotherapy-induced cardiomyopathy5,6,7,8. Although LV chamber quantification tasks are performed by trained sonographers or physicians, it is known to be subject to intra- and inter-observer variance, which can be up to 7%-13% across studies9,10,11,12. Many artificial intelligence (AI) applications have been applied to address this essential task and to minimize variations3,13,14,15,16,17,18.

While established AI systems are potential solutions, the training of segmentation AI models requires large amounts of training data and their corresponding expert-defined annotations, making them challenging and costly to implement19. In recent years, transformers, a type of neural network architecture with the self-attention mechanism that enables the model to efficiently capture complex dependencies and relationships, have revolutionized the field from natural language processing to computer vision20,21,22. Vision transformers (ViT)20 are a type of transformer specifically designed for images, which have shown impressive performance with simple image patches and have become a popular choice for the building of foundation models that can be fine-tuned for various applications23. Building on this success, Meta AI introduced the “Segment Anything Model” (SAM), a foundation large vision model that was trained on diverse datasets and that can adapt to specific tasks. This model achieves “zero-shot” segmentation, i.e., segments user-specified objects at different data resources without needing any training data24.

However, while the zero-shot performance of SAM on natural image datasets has been promising24, its performance on complex image datasets, such as medical images, has not been fully investigated. While not specifically including echocardiography images, one study tested SAM on different medical image datasets, including ultrasound, and the zero-shot performance was not optimal25. Recently, MedSAM was introduced as a universal tool for medical image segmentation. However, ultrasound or echocardiography was less represented in the training set despite >1 million medical images being used26. This prompted a core research question: How can we effectively develop a segmentation model for a specific medical image segmentation task? Is it necessary to build a universal medical image segmentation model, or can a generic foundation model be fine-tuned with a relatively small, task-specific dataset to achieve comparable performance? In this context, we aim to study a novel strategy of using representative echocardiography examples for the fine-tuning of a foundation segmentation model (SAM) and compare its effectiveness with MedSAM and a state-of-the-art segmentation model trained with a domain-specific dataset (EchoNet)3.

We hypothesized that fine-tuning the foundation segmentation model with the above strategy can achieve similar or superior performance on LV segmentation compared to MedSAM or EchoNet. We further designed a human annotation study on a subset of images to investigate whether the model can enhance the efficiency of LV segmentation. Within the echocardiography domain, images obtained from different institutions and modalities (with different image qualities) were used to evaluate the generalization capability of SAM performance on external evaluation sets.

Results

SAM’s zero-shot performance on echocardiography and POCUS

We first tested the zero-shot performance of SAM. Overall, the zero-shot performance on the EchoNet-dynamic test set had a mean DSC of 0.863 ± 0.053. In terms of individual cardiac phase performance, end-diastolic frames were better than end-systolic frames (mean DSC 0.878 ± 0.040 vs. 0.849 ± 0.060) (Table 1). Supplementary Table 1 summarizes the zero-shot performance on the EchoNet-dynamic training and validation sets. On the Mayo Clinic dataset, the mean DSC was 0.882 ± 0.036 and 0.861 ± 0.043 for TTE and POCUS, respectively. On the CAMUS dataset, we observed a mean DSC of 0.866 ± 0.039 and 0.852 ± 0.048 on A4C and A2C views, respectively (Table 1). When compared to the ground truth LVEF, the calculated LVEF had an MAE of 11.67%, 6.28%, and 6.38%, on EchoNet, Mayo-TTE, and Mayo-POCUS data, respectively (Supplementary Table 2). When evaluated by Hausdorff Distance (HD) and Average Surface Distance (ASD), MedSAM was superior across most of the datasets (Table 2).

Table 1 Comparison of Model Performance Based on Dice Similarity Coefficient (DSC) and Intersection over Union (IoU)
Table 2 Comparison of Model Performance Based on Hausdorff Distance (HD) and Average Surface Distance (ASD)

SAM’s fine-tuned performance on echocardiography and POCUS

Compared to zero-shot, fine-tuning generally improved the performance of SAM, with a mean DSC of 0.911 ± 0.045 (SAM112) on the EchoNet-dynamic test set (Table 1 and Supplementary Table 1). Similar improvement was also observed in Mayo TTE data, with an overall mean DSC of 0.902 ± 0.032. In contrast, no significant improvement was observed on the POCUS data (DSC 0.857 ± 0.047), while the performance was numerically improved when compared with the ground truth of the second observer (DSC 0.876 ± 0.038), as summarized in Table 3. The EchoNet model had a significant performance drop, especially on the Mayo-POCUS and CAMUS A2C datasets (Table 1). When compared to the ground truth LVEF, the calculated LVEF had an MAE of 7.52%, 5.47%, and 6.70%, on the EchoNet test, Mayo-TTE, and Mayo-POCUS data, respectively (Supplementary Table 2).

Table 3 Zero-shot vs. Fine-tuned SAM112 performance on TTE and POCUS (against the second observer)

Qualitatively, poor-performing cases often featured suboptimal images (such as weak LV endocardial borders, off-axis views; Fig. 1, Panels c, f), highlighting the importance of image quality (Fig. 1). On a representative POCUS case, fine-tuned SAM112 predicted masks that were more consistent with LV geometry (Fig. 2).

Fig. 1: Qualitative performance of fine-tuned SAM on representative cases against ground truth on the EchoNet-dynamic test dataset.
figure 1

From top to bottom: 97.5th to 2.5th percentile of DSC. Panels a, b, and c are end-diastolic frames, and Panels d, e, and f are end-systolic frames. We observed that many of the poor-performance cases had suboptimal image qualities, such as weak LV endocardial borders or off-axis views (Panels c and f, suggesting the importance of good input image quality on model performance. Additionally, end-diastolic frames usually have a better delineation of borders than end-systolic frames, which is consistent with the model performance (end-diastolic slightly better than end-systolic).

Fig. 2: Zero-shot and fine-tuned SAM performance on a representative POCUS case.
figure 2

Panel a. end-diastolic frame, Panel b. end-systolic frame. From left to right are the ground truth, zero-shot, and fine-tuned mask, with an overlay of bounding boxes (green-colored) and mask (blue-colored), on the original POCUS image. Fine-tuned masks were more consistent with the anticipated left ventricular geometry on visualization. Note that POCUS images generally had worse quality compared to transthoracic echocardiography images.

Comparison of SAM, MedSAM, EchoNet, and U-Net models

Comparisons were made between fine-tuned SAM112 (fine-tuned using 112× 112 images), MedSAM, EchoNet, and U-Net. Under the same setting, the inference time for SAM was about 7.3 images/sec, MedSAM was 5.0 images/sec, EchoNet was 153.3 images/sec, and UNet was 26 images/sec. Note that EchoNet takes video input, so the inference time was averaged by the number of frames in each video.

When evaluated by DSC and IoU, EchoNet demonstrated a similar level of performance compared to fine-tuned SAM112 on the EchoNet test set (SAM112 vs. EchoNet: DSC 0.911 ± 0.045 vs. 0.915 ± 0.047, p < 0.0001) and the Mayo TTE dataset (DSC 0.902 ± 0.032 vs. 0.893 ± 0.090, p < 0.0001). However, EchoNet significantly underperformed on the Mayo POCUS and CAMUS datasets, with a performance drop ranging from 5–25% (Table 1). MedSAM also showed about 2–5% worse performance than fine-tuned SAM112across all datasets (all p < 0.0001), except for the CAMUS dataset, as it was part of MedSAM’s training data. U-Net had worse performance than fine-tuned SAM112 across all datasets (all p < 0.0001) (Table 1). When evaluated by HD and ASD, MedSAM had the best performance across most of the datasets, while fine-tuned SAM112 had the best performance on the EchoNet test set. This was followed by SAM (zero-shot), EchoNet, and then U-Net (Supplementary Table 3).

Further fine-tuned SAM1024 (fine-tuned using 1024 × 1024 images as MedSAM) essentially achieved the same performance as fine-tuned MedSAM, with a DSC of 0.935 and 0.936, respectively. Both of the fine-tuned models also demonstrated similar performance across different datasets, except for the CAMUS dataset, which is part of MedSAM’s training set (Table 4). We also observed that after the finetuning, both SAM1024 and MedSAM had dropped performance slightly on the Mayo and POCUS datasets, compared to fine-tuned SAM112 and base MedSAM.

Table 4 Comparison of fine-tuned SAM and MedSAM

Human Annotation Study

To test SAM’s potential in enhancing clinical workflow, we designed a human annotation study using SAM112. In the study, we observed that AI assistance significantly improved efficiency by decreasing 50% annotation time (p < 0.0001; Fig. 3a), which remains true for both expert-level and medical student-level annotators (Fig. 3b, c). The AI-assisted workflow maintained the quality of left ventricle segmentation, as measured by DSC (Fig. 3d–f). Qualitative assessments indicated an improvement in performance for inexperienced annotators, which may not be completely captured by the DSC scores (Supplementary Fig. 1).

Fig. 3: SAM-assisted workflow decreased the time for annotation while maintaining the quality of echocardiography image segmentation.
figure 3

a With SAM’s assistance, the average annotation time significantly decreased by 50%. b Among experienced annotators, the workflow decreased annotation time by 39.2%; (c) It also reduced 66.0% annotation time for an inexperienced annotator. df No significant difference was observed in segmentation quality when measured by the Dice Similarity Score. ***p < 0.0001.

Discussion

The major contributions of this work include: (1) presenting a data-efficient and cost-effective strategy for training a LV segmentation model for echocardiography images based on SAM and leveraging its generalization capability for POCUS images, and (2) showing the potential of optimizing clinical workflows through a human-in-the-loop approach, where SAM significantly improved efficiency while maintaining annotation quality.

Echocardiography, like other ultrasound modalities, is generally considered an imaging modality with more challenges due to its operator dependency and low signal-to-noise ratio3,27,28,29. Additionally, objects could often have weak border linings or be obstructed by artifacts on ultrasound/echocardiography images, which posed specific challenges for echocardiography segmentation tasks3,25. While the image/frame-based approach of SAM does not consider consecutive inter-frame changes and spatio-temporal features of the echocardiogram, fine-tuned SAM (both SAM112 and SAM1024) achieved superior frame-level segmentation performance on all datasets compared to the video-based EchoNet model3. Importantly, there were cases with suboptimal image quality and imperfect human labels (Fig. 1) in the EchoNet-dynamic data set3,30, which can limit the model performance.

In terms of generalization capability, the fine-tuned SAM112 model demonstrated robust performance on unseen TTE (including the A2C view in CAMUS) and POCUS data, with about a 1% and 5% drop in performance, respectively. In contrast, there was a significant drop in EchoNet and UNet’s performance on the POCUS dataset (Table 1). Similarly, when evaluated by border-sensitive metrics such as HD and ASD, foundation models (SAM and MedSAM) were able to generate more consistent borders compared to EchoNet and U-Net (Supplementary Table 3). This again demonstrated the advantage of leveraging the generalization capabilities of foundation models23 in building AI solutions for the rapidly growing use of POCUS in cardiac imaging31. While evaluation on a larger POCUS dataset is required to better evaluate inter-observer variations, we demonstrated a strategy to fine-tune foundation models using readily available and relatively high-quality TTE results, knowing that POCUS images usually come with larger variations in operator skill levels, image quality, and scanning modalities31,32,33,34.

On a head-to-head comparison, fine-tuned SAM112 outperformed base MedSAM across almost all datasets when evaluated by DSC and IoU (Table 1). This is likely due to the fact that cardiac ultrasound images were less represented in MedSAM’s training set of over 1 million images. When fine-tuned under the same setting, SAM1024 and fine-tuned MedSAM can achieve the same level of performance and generalization capability on a specialized LV segmentation task (Table 4). Our study supports a data-efficient foundation model training strategy that uses representative examples for specific tasks in medical subspecialties, providing a generalizable solution that overcomes the challenge of large-scale, high-quality dataset collection in this field19,35. Importantly, starting with SAM also saved computational resources for training on 20 A100 (80 G) GPUs in MedSAM’s training process26.

SAM’s interactive capability allows the potential integration of SAM into research or clinical workflows in echocardiography labs to create high-quality segmentation masks at scales23,25. We demonstrated the potential of a SAM-based, human-in-loop workflow for LV segmentation, significantly reducing the annotation time and maintaining the segmentation quality across users with different experience levels (Fig. 3; Supplementary Fig. 1). Compared to fully automatic models3, an interactive approach ensures that segmentations can be directly prompted or modified to the level of interpreters’ satisfaction24,25. This process also has the potential to gain the trust of human users to facilitate integration36. While future studies are needed to assess its real impact on clinical practice, the interactive functionality of SAM could be especially helpful in challenging cases or when fully automatic models fail to predict segmentations accurately25.

This study is limited by its retrospective nature and could be subject to selection bias. Since SAM is an image-based model, its performance was evaluated on pre-selected end-diastolic and end-systolic frames rather than the beat-to-beat assessment of the video as proposed by the EchoNet-Dynamic model. However, we still demonstrated superior frame-level performance with fine-tuned SAM. While adapters are another potential direction to use SAM for echocardiography segmentation37, we have not specifically explored this approach in the current paper. Since this manuscript focuses on exploring effective strategies for adapting a generic foundation model like SAM, we did not modify the architecture, and as such, the study does not include substantial technical innovations. The balance of performance and generalization capability was not comprehensively explored. Additionally, how SAM can be integrated into the current echocardiography lab workflow and its real-world effects will need to be validated in a prospective setting, which is beyond the scope of the study.

In conclusion, fine-tuning SAM on a relatively small yet representative echocardiography dataset resulted in superior performance compared to the state-of-the-art echocardiography segmentation model, EchoNet, as well as base MedSAM, and achieved the same performance as fine-tuned MedSAM. We demonstrated a data-efficient and cost-effective approach for handling specialized complex echocardiography images, and SAM’s potential to enhance current echocardiography workflows by improving the efficiency with a human-in-the-loop design.

Methods

Population and Data Curation

EchoNet-Dynamic

The EchoNet-Dynamic dataset is publicly available (https://echonet.github.io/dynamic/); details of the dataset have been described previously3. In brief, the dataset contains 10,030 apical-4-chamber (A4C) TTE videos at Stanford Health Care in the period of 2016–2018. The raw videos were preprocessed to remove patient identifiers and downsampled by cubic interpolation into standardized 112 × 112-pixel videos. Videos were randomly split into 7465, 1277, and 1288 patients, respectively, for the training, validation, and test sets3. The mean left ventricular ejection fraction of the EchoNet-dynamic dataset was 55.8 ± 12.4%, 55.8 ± 12.3%, and 55.5 ± 12.2%, for the training, validation, and test set, respectively3. Patient characteristics were not available for this public dataset. In this study, the cases without ground truth labels were excluded from this analysis (5 from the train set, 1 from the test set) (Supplementary Table 3).

Mayo Clinic

A dataset from the Mayo Clinic (Rochester, MN) that includes 99 randomly selected patients (58 TTE in 2017–2018 and 41 point-of-care ultrasound studies (POCUS) in 2022) was used as an external validation dataset. The A4C videos were reviewed by a clinical sonographer (RW) and a cardiologist (JMF). Fifty-two (33 TTE and 19 POCUS) out of the total 99 cases were traced by both of the annotators to select and segment the end-diastolic and end-systolic frames. The tracings were done manually on commercially available software (MD.ai, Inc., NY). In the Mayo Clinic data set (n = 99), the mean age was 47.5 ± 17.8 years, 58 (58.6%) were male, and coronary artery disease, hypertension, and diabetes were found in 20 (20.2%), 42 (42.4%), and 15 (15.2%) of patients, respectively. The dataset contains an A4C view of 58 TTE cases and 41 POCUS cases, the LVEF was 61.7 ± 7.5% for TTE and 63.2 ± 11.9% for POCUS cases.

CAMUS

The Cardiac Acquisitions for Multi-structure Ultrasound Segmentation (CAMUS) dataset contains 500 cases from the University Hospital of St Etienne (France) with detailed tracings on both A4C and A2C views38. The CAMUS cohort (n = 500) had a mean age of 65.1 ± 14.4 years, 66% male, and a mean LVEF of 44.4 ± 11.9%.

Segment Anything Model (SAM), MedSAM, and U-Net

The Segment Anything Model (SAM) is an image segmentation foundation model trained on a dataset of 11 million images and 1.1 billion masks24. It can generate object masks from input prompts like points or boxes. SAM’s promptable design enables zero-shot transfer to new image distributions and tasks, achieving competitive or superior performance compared to fully supervised methods24. In brief, the model comprises a VisionEncoder, PromptEncoder, MaskDecoder, and Neck module, which collectively process image embeddings, point embeddings, and contextualized masks to predict accurate segmentation masks24. MedSAM is a fine-tuned version of the base SAM, trained on over 1.5 million medical image-mask pairs across 10 modalities and 30 cancer types, with trained weights available26. U-Net architecture was selected as the representative of commonly used segmentation deep learning models. In our study, the U-Net architecture39 leveraged a ResNet-18 backbone40.

Data Preprocessing

Each EchoNet-Dynamic video (112 × 112 pixels, in avi format) was exported into individual frames without further resizing. End-diastolic and end-systolic frames of each case were extracted, which corresponded to the human expert-traced, frame-level ground truth segmentation coordinates in the dataset. Ground truth segmentation masks were generated according to the coordinates and saved in the same size (112 × 112 pixels). The labeled frames of Mayo TTE and POCUS images were exported from the MD.ai platform and followed a similar preprocessing method to remove patient identifiers, then horizontally flipped and resized to 112 × 112 pixels3. The CAMUS images were rotated by 270 degrees and resized to be consistent with the EchoNet format. When importing to the SAM model, all the raw images were resized with the built-in function “ResizeLongestSide”24.

Zero-shot Evaluation

The original SAM ViT-base model (model type “vit_b”, checkpoint “sam_vit_b_01ec64.pth ”) without modification or fine-tuning was used to segment the dataset, referred to as zero-shot learning in machine learning literature24,26. The larger versions of SAM (ViT Large and ViT Huge) were not used as they did not offer significant performance improvement despite higher computational demands24. Bounding box coordinates of each left ventricle segmentation tracing were generated from the ground truth segmentations and used as the prompt for SAM24. Of note, while SAM supports both the bounding box and point prompts, our study focused exclusively on the bounding box prompt due to the incomplete and relatively weak borders in echocardiography images. Additionally, since the left atrium and LV are connected structures, point prompts may result in masks that incorrectly link the two chambers.

Model Fine-tuning

The SAM ViT-base model was used for fine-tuning with a procedure described by Ma et al.41. We used the training set cases (n = 7460) of the EchoNet-Dynamic as our customized dataset without data augmentation. The same bounding box was used as the prompt, as described above. We used a customized loss function, which is the unweighted sum of Dice loss and cross-entropy loss41,42. Adam optimizer43 was used (weight decay = 0), with an initial learning rate of 2e-5 (decreased to 3e-6 over 27 epochs). The batch size was 8. The model was fine-tuned on a node on the Stanford AI Lab cluster with a 24 GB NVIDIA RTX TiTAN GPU. The fine-tuning procedure of U-Net followed the same procedure except for using bounding boxes, and the training epoch was 23 for the best-performing model. No data augmentation was performed.

To ensure fair comparison between SAM and MedSAM, we used MedSAM’s fine-tuning setup with 1024 × 1024 images, and both models were trained for 70 epochs. The two versions of fine-tuned SAM models were referred to as SAM112 and SAM1024, respectively. The training time for SAM112 and SAM1024 was about 25 h and 40 h, respectively. MedSAM required a training time of about 50 h. UNet required 16 min.

Validation and Generalization

The fine-tuned SAM was tested on the test set of the EchoNet-Dynamic dataset. To test the generalization capacity of SAM, we used external validation samples from the CAMUS dataset (both A2C and A4C)38 and a Mayo Clinic dataset including the A4C view of cases of TTE and POCUS devices.

Human Annotation Study

We conducted a human annotation study to evaluate the enhancement of segmentation efficiency and quality when utilizing a fine-tuned SAM. Three annotators (two experts and one medical student) independently annotated 100 echocardiography images, randomly selected from the EchoNet-Dynamic test set. Each one annotated 100 images with and without SAM assistance. In the SAM-assisted workflow, annotators were instructed to draw minimal bounding boxes to prompt the model. In the regular workflow, annotators manually trace the endocardium contour, as per standard clinical practice. Annotators were allowed to refine their annotations until they were satisfactory. We recorded and compared the time required to complete annotations and the quality of annotations (DSC against ground truth) using both workflows.

Statistical Model Performance Evaluation

The model segmentation performance was directly evaluated by Intersection over Union (IoU), Dice similarity coefficient (DSC), Hausdorff Distance (HD), and Average Surface Distance (ASD) against human ground truth labels44. IoU and DSC assess the overlap of segmented areas, while HD and ASD evaluate the alignment of segmentation borders. The formulas are provided in Supplementary Note 1. Depending on the normality of distribution, two-tailed, paired t-tests or Wilcoxon tests were conducted to assess the statistical significance of the differences in accuracy between models (zero-shot vs. fine-tuned SAM, EchoNet vs. fine-tuned SAM, and MedSAM vs. fine-tuned SAM), as well as the annotation time and quality metrics in the human annotation study (AI-assisted vs. non-AI-assisted), with p < 0.05 as significant.

Ethical review and approval

All the studies have been performed in accordance with the Declaration of Helsinki.EchoNet-Dynamic dataset contains a publicly available, de-identified dataset that was approved by Stanford University Institutional Review Board and data privacy review through a standardized workflow by the Center for Artificial Intelligence in Medicine and Imaging (AIMI) and the University Privacy Office. The CAMUS dataset is also publicly available, under the approval of the University Hospital of St Etienne (France) ethical committee after full anonymization. The use of the Mayo Clinic dataset was approved by the institutional review board (protocol#22-010944); only patients providing informed consent for minimal-risk retrospective studies were included, thus the requirement for additional informed consent was waived.