Introduction

Cardiovascular diseases represent the most significant global health challenge, as they affect millions of people worldwide, claim the lives of millions annually, and place a heavy strain on both healthcare systems and families. In this critical context, the accurate and timely assessment of cardiac function is not merely a diagnostic step but the very cornerstone of modern cardiology. Central to this assessment is the Left Ventricle (LV), the heart’s primary pumping chamber responsible for circulating blood throughout the body. Two fundamental metrics provide insight into its health. The first one is the Ejection Fraction (EF), which quantifies its pumping efficiency, and the second one is LV wall thickness, an indicator of structural integrity. Both of these vital parameters are routinely evaluated using echocardiography, a widely accessible imaging modality due to its non-invasive nature, cost-effectiveness, and real-time capabilities1,2.

However, despite its utility, the manual interpretation of echocardiograms presents several significant challenges for clinicians. The process is inherently laborious, requiring meticulous tracing of ventricular borders, which is both time-consuming and subject to the individual physician’s judgment. This subjectivity can lead to inter-observer variability, where different experts may arrive at different measurements for the same patient. Furthermore, the dynamic nature of the heart in real-time imaging sequences makes the task susceptible to human error, a problem that is often compounded by image artifacts such as acoustic noise, which can obscure anatomical boundaries and complicate accurate assessment3,4.

In this regard, automated Artificial Intelligence (AI) based systems can offer a powerful solution to overcome these obstacles5,6,7,8,9. By automatically performing left ventricle segmentation and identifying key anatomical points, these tools can provide measurements that are not only rapid but also objective and reproducible1,10,11. This automation has the potential to streamline clinical workflows, reduce the burden on echocardiologists, and provide consistent, reliable data that is crucial for both routine patient care and large-scale clinical research. Despite these advances, current AI approaches often provide a fragmented analysis of cardiac health12,13,14,15. Most models are trained for isolated tasks, such as segmentation for functional assessment or keypoint detection for structural measurement. This “siloed” approach fails to leverage the intrinsic physiological link between cardiac function and structure, overlooking the potential for a model to learn more robust, synergistic features.

To address these gaps, we introduce a novel mixed-type Multi-Task Learning (MTL) framework. We specifically distinguish our approach from simpler MTL paradigms (e.g., combining multiple classification tasks) by designing a model that co-learns two fundamentally different types of tasks concurrently: (1) a dense, pixel-level segmentation task and (2) a sparse, heatmap-based keypoint localization task. This approach encourages the model to learn a shared, synergistic representation of cardiac anatomy, as the knowledge gained from segmentation can inform keypoint localization and vice versa. Our model architecture employs a powerful EfficientNet backbone to act as a shared feature encoder, which then feeds into two specialized, parallel heads: (1) a U-Net-style decoder that reconstructs a precise segmentation mask for EF calculation, and (2) a convolutional head that predicts spatial heatmaps for localizing the anatomical keypoints needed for wall thickness measurement. By integrating these tasks, our framework provides a holistic and efficient solution for a comprehensive LV assessment from a single analysis. We validate this framework on three large-scale public datasets: EchoNet-Dynamic, CAMUS, and EchoNet-LVH. Our results demonstrate that the model achieves state-of-the-art performance for both segmentation and keypoint localization, confirming the effectiveness of the unified MTL strategy.

While state-of-the-art frameworks have demonstrated the ability to assess both EF and wall thickness16,17,18, they often rely on a black-box approach or complex multi-layer segmentation. This study proposes and validates a novel, lightweight MTL framework that specifically unifies LV segmentation with heatmap-based anatomical keypoint detection (for wall thickness) in a single, interpretable model. The main contribution of this paper can be summarized as follows:

  • A novel MTL framework is proposed that simultaneously performs LV segmentation (for EF) and keypoint detection (for wall thickness) within a single, unified model, which provides a more holistic and clinically relevant evaluation than prior MTL models that focus on tasks like segmentation plus EF regression or multi-boundary segmentation.

  • An effective and reproducible preprocessing pipeline is presented for extracting analysis-ready keyframes (end-systole and end-diastole) from clinical echocardiography videos, simplifying the data preparation workflow.

  • The proposed model is validated on three large-scale public datasets, demonstrating state-of-the-art performance and confirming the viability of the MTL approach for clinical use.

The remainder of this paper is organized as follows. Section 2 reviews the related work in automated echocardiography analysis, covering both segmentation and keypoint detection techniques. Section 3 details our proposed Multi-Task Learning framework, including the model architecture, the datasets used, and the implementation specifics. In Sect. 4, we present the experimental results, providing a quantitative and qualitative discussion of the model’s performance and comparing it against established benchmarks. Finally, Sect. 5 concludes the paper with a summary of our findings and a discussion of future research directions.

Related work

The application of deep learning to echocardiography analysis has seen significant progress, with researchers primarily focusing on two key tasks as LV segmentation and anatomical keypoint detection. Accurate segmentation of the LV cavity is fundamental for quantitative analysis of cardiac function, particularly for the calculation of Left Ventricular Ejection Fraction (LVEF)1. Recent years have witnessed remarkable advancements in automated echocardiographic analysis for detecting cardiac abnormalities using deep learning techniques7,19,20,21,22. Cardiac abnormalities encompass a wide spectrum of conditions, including regional wall motion abnormalities, congenital heart defects, and valvular disorders. Accurate detection and quantification of these abnormalities are essential for proper cardiac function assessment, risk stratification, and clinical decision-making. For instance, regional wall motion abnormalities often indicate ischemic heart disease or myocardial infarction, while congenital defects such as septal anomalies can significantly impact hemodynamics. Valvular disorders, including stenosis or regurgitation, require precise evaluation to guide surgical or interventional treatment planning.

Several studies have demonstrated the effectiveness of deep learning approaches in identifying these conditions. For example, Sanjeevi et al.23. employed 3D convolutional networks on the HMC-QU dataset to detect wall motion abnormalities, achieving an F1-score of 0.77. This study highlighted the importance of capturing spatial-temporal dynamics across multiple echocardiographic frames to accurately assess ventricular motion. Nova et al.24. leveraged a U-Net architecture for segmenting cardiac chambers, specifically targeting the identification of septal defects. Their model reported a Dice score of 97%, illustrating the capacity of convolutional encoder-decoder networks to delineate fine anatomical structures in ultrasound images. Meanwhile, Vafaeezadeh25 et al. applied Inception-ResNet-v2 for mitral valve motion classification, achieving an accuracy of 69%, which emphasizes the potential and challenges of using deep convolutional features for dynamic valve assessment. Collectively, these works demonstrate that tailored network architectures, whether 2D, 3D, or hybrid, can capture specific structural or functional aspects of the heart.

A substantial body of research has also concentrated on assessing global cardiac function, primarily through segmentation of the left ventricle (LV) and estimation of ejection fraction (EF), one of the most critical clinical parameters for evaluating cardiac performance. Belfilali et al.26. proposed a transfer learning approach using VGG19 with pre-trained ImageNet weights, achieving a Dice similarity of 93% and a Hausdorff distance of 4 mm on the CAMUS dataset. This approach illustrates the value of leveraging pre-trained networks to overcome the limited availability of annotated echocardiographic data. Similarly, Sfakianakis et al.27. utilized a U-Net-based model to achieve Dice scores of 0.96 and 0.955 for end-diastolic and end-systolic frames, respectively, while obtaining Pearson correlation coefficients exceeding 0.97 for EF estimation. Their study emphasizes the importance of precise temporal frame selection and accurate endocardial boundary delineation for functional assessment.

Beyond conventional CNN-based approaches, Li et al.28. introduced EchoEFNet, a multi-task architecture combining a ResNet50 encoder with atrous spatial pyramid pooling to simultaneously perform segmentation and EF estimation. The model achieved Dice scores of 0.965 on CMUEcho and 0.934 on CAMUS, with Jaccard indices above 0.87, highlighting the advantage of multi-scale context aggregation in capturing both local and global anatomical features. Other advanced architectures, including DPSN (Li et al.29.), ResDUnet (Amer et al.30.), and PLANet (Liu et al.21.), integrated multi-scale feature pyramids, residual connections, and attention mechanisms, demonstrating Dice scores consistently above 0.94 and strong EF correlations up to 0.88. These studies collectively show a shift toward networks that exploit hierarchical feature representations and emphasize clinically relevant regions through attention mechanisms.

In addition, transformer-based models such as IFT-Net (Zhao et al.31.) and hybrid spatio-temporal frameworks like MCLAS have further advanced the state-of-the-art. These models leverage self-attention mechanisms to capture long-range spatial dependencies and temporal correlations across cardiac cycles, resulting in Dice scores exceeding 0.95 and robust EF estimation, particularly on heterogeneous datasets with varying image quality and acquisition protocols. The incorporation of temporal dynamics and anatomical priors into these models addresses limitations of purely spatial CNNs, providing more consistent and physiologically plausible predictions.

In particular, Wei et al.32. proposed a semi-supervised multi-task framework called MCLAS, which introduces a collaborative learning mechanism to enhance feature extraction. The model utilizes a 3D encoder–decoder architecture to capture spatio-temporal semantic features. Evaluation on the CAMUS dataset demonstrated superior performance, with Dice similarity scores of 0.931 for the systolic phase and 0.950 for the diastolic phase, highlighting the effectiveness of the proposed approach.

Collectively, these studies highlight a clear evolution in automated echocardiographic analysis—from traditional CNN-based segmentation toward advanced multi-task, attention-driven, and spatio-temporal models that integrate both structural and functional information. Despite these notable advancements, several challenges remain. Poor image quality due to patient movement, acoustic shadowing, or limited probe positioning continues to hinder reliable feature extraction. Inter-patient anatomical variability, including differences in heart size, shape, and pathology presentation, complicates model generalization. Moreover, the scarcity of large, high-quality annotated datasets limits the potential for fully supervised learning, often necessitating the use of transfer learning, data augmentation, or semi-supervised approaches. Addressing these challenges will be critical for translating research advances into clinical practice, enabling automated echocardiography systems to provide reliable, real-time support for cardiologists worldwide.

The other critical task for LV assessment is the detection of specific anatomical keypoints, which is necessary for measuring wall thickness and diagnosing conditions like LV hypertrophy14,26. Left ventricular hypertrophy represents a structural alteration of the myocardium often associated with hypertension, hypertrophic cardiomyopathy, and cardiac amyloidosis, and if untreated, can progress to severe cardiac dysfunction. Accurate assessment of LV wall thickness is essential for diagnosing LVH, and recent studies have increasingly adopted deep learning for automated key-point detection and measurement tasks and introduced a deep learning framework combining a modified DeepLabv3 architecture for key-point detection with a 3D ResNet for video-based classification on EchoNet-LVH and SHC datasets. This approach automatically localized critical anatomical landmarks to measure septal and posterior wall thickness, achieving mean absolute errors of 1.2 mm (interventricular septum) and 1.4 mm (posterior wall), while predicting underlying etiologies such as amyloidosis (AUC 0.83), hypertrophic cardiomyopathy (AUC 0.98), and aortic stenosis (AUC 0.89). Similarly, Li et al.12. proposed an automated system for LV thickness measurement across six echocardiographic views, leveraging ResNeXt-101 for view classification and parallel models for thickness estimation. Their multi-view late-fusion strategy improved performance, achieving accuracies of 0.93 for hypertrophic cardiomyopathy, 0.90 for amyloidosis, and 0.92 for hypertension-related LVH. Yu et al.33. utilized a ResNet-based classifier combined with Unet + + for segmentation-driven measurements on a cohort of 724 patients, reporting an overall LVH detection accuracy of 92.4%, with cause-specific accuracies exceeding 88%. Moreover, Duan et al.13. developed MENN, a CNN-based approach for key-point detection and LV segmentation in both B-mode and motion-mode echocardiography on murine datasets, achieving a Dice score of 0.956 across four cardiac regions. Other frameworks, such as Chang et al.’s34 CNN-LSTM pipeline and Madani et al.’s35 VGG16 and GAN-based semi-supervised models further demonstrate the trend toward hybrid architectures that integrate spatial and temporal cues for robust key-point localization and wall thickness prediction. Collectively, these works highlight the evolution from manual measurements toward fully automated LVH assessment using landmark detection and segmentation-assisted pipelines, although challenges such as view dependency, anatomical variability, and limited labeled data remain.

The move toward multi-task learning has been explored by several other groups, though with different task combinations. For example, both EchoEFNet28 and EFNet36 proposed frameworks that pair LV segmentation with the direct regression of the EF value. Other approaches have used MTL as a regularizer, such as Monkam et al.37, who simultaneously segmented the endocardium and epicardium, or MUF-Net38, which paired segmentation with an auxiliary boundary-detection task. These methods validate the power of MTL. However, they do not address the unification of segmentation with the specific anatomical keypoint detection required for structural wall thickness assessment.

Recently, highly comprehensive, state-of-the-art frameworks have been developed that also provide combined LV assessment. PanEcho18, for instance, is a powerful multi-task, multi-view model that uses direct regression to predict a wide array of parameters, including LVEF and wall thickness. Other systems, such as EchoNet-Measurements17 and the commercial Us2.ai16, are primarily segmentation-based, typically deriving wall thickness by segmenting both the endocardium and epicardium.

Our work differs from the existing approaches by proposing a lightweight, interpretable model that specifically unifies U-Net based segmentation (for function) with heatmap-based keypoint detection (for structure). This offers a more direct and computationally efficient alternative to full epicardial segmentation for wall thickness assessment. We hypothesize that by training a single model to perform both segmentation and keypoint detection simultaneously, the model can learn a more robust and holistic representation of cardiac anatomy, leading to a more accurate and efficient clinical assessment.

Materials and methods

Dataset description

Our proposed model is trained and validated using three large, publicly available echocardiography datasets. Each dataset provides unique views and annotations, allowing for a comprehensive evaluation of our multi-task framework.

EchoNet-LVH

The EchoNet-LVH dataset is an open-source collection designed for the automated measurement of left ventricular wall thickness to assess for hypertrophy11. It contains 12,000 echocardiography videos in the parasternal long-axis (PLAX) view, acquired from patients as part of their routine clinical care at the Stanford cardiology clinic between 2008 and 2020. For each video, expert cardiologists have provided annotations for the end-systole and end-diastole frames, including measurements for the Interventricular Septum (IVS), Left Ventricular Internal Dimension (LVID), and Left Ventricular Posterior Wall (LVPW). The data includes the specific frame numbers for each measurement and the corresponding keypoint coordinates (X1, Y1, X2, Y2) needed for our keypoint detection task. The details of the EchoNet-LVH dataset are summarized in Table 1, and Fig. 1 shows a sample image of this dataset.

Fig. 1
figure 1

Sample image from the EchoNet-LVH dataset (PLAX view). This image is shown at its original cropped resolution prior to model resizing.

Table 1 Summary of the EchoNet-LVH dataset.

EchoNet-Dynamic

The EchoNet-Dynamic dataset consists of 10,030 echocardiography videos in the apical-4-chamber (A4C) view10. These videos were acquired at Stanford University Hospital between 2016 and 2018 using a variety of ultrasound machines. The dataset is labeled by human experts with values for Ejection Fraction (EF), End-Systolic Volume (ESV), and End-Diastolic Volume (EDV). A separate annotation file specifies the exact frame numbers corresponding to end-systole and end-diastole for each video. Crucially for our segmentation task, this file also provides the ground truth coordinates (X1, Y1, X2, Y2) that trace the border of the left ventricle in these keyframes. All videos have been cropped to a 112 × 112 pixel resolution to remove information outside the imaging sector. The details of the EchoNet-Dynamic dataset are summarized in Table 2, and Fig. 2 shows a sample image of this dataset.

Fig. 2
figure 2

Sample images from the EchoNet-Dynamic dataset (A4C view). These images are shown at the 112 × 112 pixel resolution used for model input.

Table 2 Summary of the EchoNet-Dynamic dataset.

CAMUS

The CAMUS dataset contains echocardiography examinations from 500 patients, acquired at the University Hospital of St Etienne, France39. It is intentionally heterogeneous, including images of varying quality and multiple pathological cases. The data is split into a training set of 450 patients and a testing set of 50 patients. For each patient, data from both end-systole and end-diastole phases are provided with ground truth segmentation masks for the LV cavity. While the dataset includes multiple views, we utilized only the apical-4-chamber (A4C) images in this study to maintain consistency with EchoNet-Dynamic. The original ground truth masks were multi-class and were converted to binary masks during our preprocessing stage. The details of EchoNet-Dynamic are summarized in Table 3, and Fig. 3 shows a sample image of this dataset.

Fig. 3
figure 3

Sample image of the CAMUS dataset.

Table 3 Summary of the CAMUS dataset.

Preprocessing details

A multi-stage preprocessing pipeline was implemented to convert the raw source data into a clean, analysis-ready format suitable for training our deep learning models. These steps were crucial for standardizing the inputs and handling inconsistencies across the different datasets.

For the video-based datasets, EchoNet-Dynamic and EchoNet-LVH, a key preprocessing step was to extract only the two most clinically relevant static frames: end-systole (ES) and end-diastole (ED). Processing the entire video for each case would be computationally expensive and inefficient, as the ground truth annotations only apply to these specific moments in the cardiac cycle. This extraction was performed using the explicit frame numbers provided in the annotation files accompanying each dataset. This approach provided two significant benefits. Firstly, it simplified the problem from a complex video analysis task to a more manageable image-based one, reducing model complexity and training time. Secondly, it effectively doubled the number of training samples available from each video, which helps improve model training and generalization.

For the segmentation task, it was essential to have consistent, binary ground truth masks. In this regard, for the EchoNet-Dynamic dataset, the ground truth was provided as a series of coordinates tracing the LV border. We used these points to generate a binary segmentation mask for each keyframe. This process involved creating a polygon from the coordinates, using cubic interpolation to ensure a smooth and anatomically plausible contour. For the CAMUS dataset, the dataset already provided segmentation masks; however, they were not in a binary format. To prevent model errors and ensure compatibility, all masks from this dataset were converted to a standard binary representation (pixel values of 0 or 255).

A final curation step was performed to handle missing or incomplete data. For the EchoNet-Dynamic dataset, 6 videos had insufficient measurement data and were therefore excluded from the study, resulting in a final set of 2,048 curated images and masks. The EchoNet-LVH dataset (originally 12,000 videos) required more extensive cleaning. Our process involved extracting the two relevant keyframes (ED and ES) from each video, which would ideally yield \(\:\approx\:\text{24,000}\:\)images. However, there were numerous instances of video files missing corresponding measurement data in the annotations file, and vice versa. After excluding all such incomplete samples, a final dataset of 17,435 curated keyframes was prepared for the keypoint detection task. The summary of the preprocessing steps is provided in Table 4.

Table 4 Summary of preprocessing steps.

Proposed multi-task learning framework

To address the limitations of previous approaches that analyze cardiac function and structure in isolation, we propose a MTL40 framework. The core principle of this framework is to train a single, unified deep learning model to simultaneously perform two related tasks from a single echocardiogram: (1) Left Ventricle Segmentation: To delineate the LV cavity, enabling the calculation of volume and EF. (2) Anatomical Keypoint Detection: To localize the specific points required for measuring the interventricular septum (IVS) and posterior wall (LVPW) thickness. As depicted in Fig. 4, our proposed architecture is based on a shared encoder-decoder design, which consists of a shared encoder as a backbone that bifurcates into two separate segmentation and keypoint detection heads that are described in the following.

Fig. 4
figure 4

Schematic structure of the proposed MTL framework.

Shared encoder backbone

The backbone of our proposed model is a pre-trained EfficientNet-B041, which serves as a powerful and efficient shared encoder. We selected the B0 variant as it provides an optimal balance between computational efficiency (low parameter count and FLOPs) and high feature extraction performance, making it ideal for our unified MTL framework. EfficientNet’s key innovation lies in its compound scaling method. Unlike other approaches that arbitrarily scale network dimensions such as depth (number of layers), width (number of channels), or image resolution, EfficientNet scales all three dimensions uniformly using a single, fixed compound coefficient. This balanced scaling ensures that as the network gets larger, its accuracy and efficiency grow predictably and optimally. The architecture itself is built upon Mobile Inverted Bottleneck Blocks (MBConv), which are computationally efficient building blocks enhanced with integrated Squeeze-and-Excitation (SE) blocks. The SE block performs channel-wise feature recalibration, allowing the network to learn the relative importance of different feature channels and selectively emphasize the most informative ones. The structure of EfficientNet is illustrated in Fig. 5.

This backbone takes a 112 × 112 pixel echocardiogram image as input and processes it through these convolutional blocks to generate a rich, hierarchical set of feature maps. In the early layers, the model captures low-level features such as edges and textures. As the data progresses through the deeper layers of the network, these are combined to form more complex representations corresponding to recognizable anatomical shapes and structures of the left ventricle.

By leveraging a backbone pre-trained on the large-scale ImageNet dataset, the model begins with a robust understanding of general visual patterns, which significantly accelerates the training process and improves its ability to learn the specific, fine-grained features relevant to echocardiographic images. These extracted features, capturing both low-level and high-level anatomical context, provide a shared foundation for both the segmentation and keypoint detection heads, enabling the synergistic learning central to our MTL approach.

Fig. 5
figure 5

Structure of efficientnet block.

Segmentation head

The first task-specific branch of our MTL framework is the segmentation head, which is responsible for delineating the LV endocardial border to produce a precise segmentation mask. This head is designed as a U-Net-style decoder42, taking the high-level feature maps from the shared EfficientNet encoder and progressively upsampling them to reconstruct a full-resolution prediction.

Each stage of the decoder consists of an Upsampling or Transpose Convolution layer, which doubles the spatial dimensions of the feature map, followed by a series of Conv2D layers with Batch Normalization and ReLU activation functions to refine the features.

The key to this architecture’s high performance in segmentation tasks is its use of skip connections. These connections feed the feature maps from the corresponding stages of the encoder and concatenate them with the upsampled feature maps in the decoder. This fusion of deep, semantic information (the “what”) with shallow, high-resolution spatial details (the “where”) is crucial for recovering the precise, sharp boundaries of the LV cavity, which can be lost during the downsampling process.

The final layer of this head is a 1 × 1 convolutional layer with a sigmoid activation function. This produces a single-channel, grayscale image with the same dimensions as the input (112 × 112), where each pixel value ranges from 0 to 1, representing the probability of that pixel belonging to the LV cavity. The structure of the segmentation head is shown in Fig. 6.

Fig. 6
figure 6

Structure of segmentation head.

Keypoint detection head

The second task-specific branch of our MTL framework is the keypoint detection head, which is responsible for localizing the specific anatomical landmarks required for wall thickness measurements. This head is designed as a lightweight convolutional network43 that transforms the shared features from the encoder into spatial probability maps.

Instead of directly regressing the (x, y) coordinates for each point, which can be unstable and less robust to slight anatomical variations, this head is trained to predict a stack of 2D heatmaps. Each heatmap in the stack corresponds to a single anatomical keypoint. It functions as a probability distribution over the image, where the intensity of each pixel represents the likelihood of the keypoint being present at that location.

The architecture consists of a few Conv2D blocks, each with Batch Normalization and ReLU activation, which refine the features for the localization task. The final layer is a 1 × 1 convolutional layer with N_k filters (where N_k is the number of keypoints to be detected) and a sigmoid activation function. This produces a stack of N_k heatmaps, each with the same dimensions as the input image. At inference time (test time), no thresholding is applied. The final coordinate for each keypoint is determined by applying a standard \(\:argmax\) operation to find the \(\:(x,\:y)\) pixel location of the maximum intensity value within its corresponding heatmap. This heatmap-based approach provides a richer supervisory signal during training and is generally more robust than direct coordinate regression. The structure of the keypoint detection head is depicted in Fig. 7.

Fig. 7
figure 7

Structure of keypoint detection head.

Training methodology

By training the model on both segmentation and keypoint detection concurrently, it is encouraged to learn a shared, synergistic representation of cardiac anatomy. We hypothesize that the knowledge of the overall LV shape (from segmentation) can regularize and improve the accuracy of keypoint localization, and vice versa. To achieve this, the model was “co-trained” on the union of all three datasets (CAMUS, EchoNet-Dynamic, and EchoNet-LVH) simultaneously. The model is trained end-to-end by optimizing a combined loss function, \(\:{L}_{total}\text{}\:\)​, which is a weighted sum of the individual losses from each head (1). During training, batches contained a mix of samples from all tasks. When a segmentation sample was processed, loss was calculated only for the segmentation head (with the keypoint loss set to zero), and vice versa. This ensures the shared encoder is updated by both tasks concurrently.

$$\:{L}_{total}\text{}\:=\alpha\:\cdot\:{L}_{segmentation}\text{}+(1-\alpha\:)\cdot\:{L}_{keypoint}\text{}$$
(1)

In this equation, \(\:{L}_{segmentation}\) refers to the loss for the segmentation task, which is a combination of Dice Loss and Binary Cross-Entropy (BCE). The Dice Loss, which was found to be highly effective in the initial segmentation experiments, is well-suited to handle the class imbalance between the LV cavity and the background, while BCE ensures fine-grained, pixel-level accuracy. \(\:{L}_{keypoint}\) also refers to the loss for the keypoint detection task, which is the Mean Squared Error (MSE), calculated between the model’s predicted heatmaps and the ground-truth heatmaps. The ground-truth heatmaps are generated by placing a 2D Gaussian kernel at the target coordinate for each keypoint.\(\:\:\alpha\:\) is a trainable hyperparameter between 0 and 1 that balances the contribution of each task to the total loss during training, allowing the model to dynamically prioritize the more challenging task.

Implementation and results

Evaluation metrics

To quantitatively assess the performance of our proposed model, we used a set of standard and well-established metrics, evaluating each task head separately. The accuracy of the LV segmentation mask was evaluated using three different metrics, including Dice Similarity Coefficient (DSC), Jaccard Index (IoU), and Hausdorff Distance (HD), while Mean Absolute Error (MAE) was used to evaluate the accuracy of the anatomical keypoint localization. The details of these metrics are as follows.

  • Dice Similarity Coefficient (DSC): This is a spatial overlap index that measures the similarity between the predicted mask \(\:\left(A\right)\) and the ground truth mask \(\:\left(B\right)\). It is calculated as \(\:DSC=\frac{2\cdot\:\mid\:\text{A}\cap\:\text{B}\mid\:\text{}}{\mid\:\text{A}\mid\:+\mid\:\text{B}\mid\:}\), with values ranging from 0 (no overlap) to 1 (perfect overlap).

  • Jaccard Index (IoU): Also known as Intersection over Union, this metric also quantifies the overlap between the predicted and ground truth masks. It is defined as \(\:\text{I}\text{o}\text{U}=\frac{\mid\:\text{A}\cap\:\text{B}\mid\:\text{}}{\mid\:\text{A}\cup\:\text{B}\mid\:}\).

  • Hausdorff Distance (HD): Unlike overlap-based metrics, the Hausdorff Distance measures the distance between the boundaries of the predicted and ground truth masks. It quantifies the maximum distance from a point on one contour to the closest point on the other, providing a measure of boundary localization error. Lower values indicate better performance.

  • Mean Absolute Error (MAE): This metric calculates the average Euclidean distance between the predicted keypoint coordinates and the ground truth coordinates. It provides a direct measure of the localization error in pixels, with lower values indicating higher accuracy.

Implementation details and hyperparameters

All experiments were conducted on a high-performance computing system designed to handle the computational demands of our deep learning framework. The system was equipped with an Intel® Core™ i9 CPU, 64 GB of RAM, and an NVIDIA GeForce RTX 3080 Ti GPU to accelerate model training. The experiments were run on a Linux-based x64 operating system, which provided a stable environment for the Python-based deep learning libraries used.

To identify the optimal hyperparameters for the proposed model, we employed Optuna, an advanced optimization framework. This approach automates the tuning process by systematically exploring the hyperparameter space using techniques like Bayesian optimization. After conducting 20 trials, the optimal configuration was determined. The search space for this optimization included the optimizer (Adam, RMSprop), learning rate (log-uniform distribution from 10− 5 to 10− 2), and the loss weighting factor (uniform distribution from 0.1 to 0.9). The final set of hyperparameters used for training our model is detailed in Table 5. To assess practical applicability, we calculated the model’s efficiency. The final MTL model (using the EfficientNet-B0 backbone) has approximately 1.7 million parameters. Total training for 20 epochs took approximately 6.5 h on the specified NVIDIA GeForce RTX 3080 Ti GPU. It is worth noting that we strictly adhered to the official, pre-defined patient-level splits (Train/Validation/Test) provided by the creators of each dataset (CAMUS, EchoNet-Dynamic, EchoNet-LVH). This ensures our results are reproducible and directly comparable to other benchmark studies.

Table 5 Summary of implementation details and hyperparameters.

Performance analysis

This section presents a comprehensive evaluation of our proposed MTL framework. We assessed the model’s performance on its two core tasks, including LV segmentation and anatomical keypoint detection, using the three public datasets described previously. The quantitative results are supplemented with a qualitative analysis and an ablation study to demonstrate the specific benefits of our MTL approach.

Segmentation task performance

The performance of the segmentation head was evaluated on the CAMUS and EchoNet-Dynamic datasets using DSC, IoU, and HD metrics. The results, detailed in Table 6, are presented separately for the end-diastolic (ED) and end-systolic (ES) frames.

On the CAMUS dataset, which is known for its heterogeneity and inclusion of challenging clinical cases, our model showed exceptional robustness. It achieved a DSC of 0.951 and an IoU of 0.912 for the end-diastolic frames. This high degree of overlap is clinically significant, as the accurate delineation of the LV cavity at end-diastole is a critical prerequisite for calculating end-diastolic volume (EDV) and, consequently, ejection fraction. The model’s performance remained strong for the more challenging end-systolic frames, which have a smaller cavity size, achieving a DSC of 0.938. Furthermore, the low Hausdorff Distance (3.24 mm for ED) indicates that the predicted contour closely aligns with the ground-truth boundary, minimizing the risk of measurement errors.

Similarly, the model’s performance was strong on the large-scale EchoNet-Dynamic dataset, where it obtained a DSC of 0.931 for ED frames and 0.909 for ES frames. A key finding on this dataset is the exceptionally low HD, with values of 1.02 mm for ED and 1.07 mm for ES. This indicates a highly precise alignment between the predicted and ground-truth boundaries, which is crucial for accurate volume calculations. This consistency in producing reliable segmentations across two different datasets, with varying acquisition protocols and patient populations, underscores the generalization capability of our framework and confirms that the segmentation head can reliably serve as the foundation for a fully automated assessment of systolic function.

Table 6 Segmentation performance on CAMUS and EchoNet-Dynamic datasets.

Keypoint detection task performance

In addition to segmentation, our framework was evaluated on its ability to perform precise anatomical keypoint detection, a task essential for measuring LV wall thickness and assessing structural abnormalities like hypertrophy. This evaluation was conducted on the EchoNet-LVH dataset, which provides expert annotations for the specific landmarks required to measure the Interventricular Septum (IVS), the Left Ventricular Posterior Wall (LVPW), and the Left Ventricular Internal Dimension (LVID). The accuracy of the keypoint detection head was quantified using the MAE metrics. The results, detailed in Table 7, are presented separately for the end-diastolic (ED) and end-systolic (ES) frames.

Specifically, the model achieved an overall average MAE of approximately 1.13 pixels across all structural measurements (pooling both end-diastolic and end-systolic phases). Notably, it demonstrated exceptionally high precision in measuring LVID, achieving an MAE of just 0.3446 pixels in the ES phase. This sub-pixel accuracy is critical, as even minor errors in landmark placement can lead to significant inaccuracies in wall thickness and dimensional measurements. This high level of performance confirms the model’s capability to reliably identify the specific anatomical points needed for a quantitative assessment of LV structure, providing a robust tool for the automated detection of conditions such as left ventricular hypertrophy.

Table 7 Keypoint detection performance on the EchoNet-LVH dataset.

Qualitative and visual results

To complement the quantitative metrics, a qualitative analysis was performed to assess the performance of our MTL framework visually. Figure 8 presents a visual comparison between the model’s predictions and the ground-truth annotations for representative samples from the EchoNet-dynamic and CAMUS datasets, encompassing segmentation tasks.

Fig. 8
figure 8

Visual comparison of the model’s predictions against ground-truth annotations for the segmentation task. For each row, the panels represent: (A) Input Image, (B) Ground-Truth Mask, and (C) Predicted Mask.

The figure is organized to display the original input echocardiogram, the expert-provided ground truth, and the corresponding segmentation mask generated by our model. The segmentation results demonstrate that the model can generate precise and anatomically plausible contours for the LV cavity, closely matching the ground truth in both the CAMUS and EchoNet-Dynamic datasets. Notably, the model maintains high performance even in cases with imaging artifacts and low contrast, which are common challenges in clinical practice. However, a closer inspection also reveals potential failure modes. For instance, in the CAMUS sample shown in Fig. 8, the model’s predicted apex is slightly blunted or flattened compared to the ground-truth mask, a common difficulty in regions of apical signal dropout. Similarly, while generally accurate, keypoint localization can be challenged by significant acoustic shadowing that obscures anatomical boundaries.

Figure 9 presents a visual comparison between the model’s predictions and the ground-truth annotations for representative samples from the EchoNet-LVH dataset, encompassing keypoint detection. As can be seen, the predicted landmarks on samples from the EchoNet-LVH dataset show a strong alignment with the ground-truth coordinates. The visual evidence confirms the low MAE reported in the quantitative analysis, illustrating the model’s accuracy in localizing the specific points required for measuring wall thickness and internal dimensions.

Fig. 9
figure 9

Visual comparison of the model’s predictions against ground-truth annotations for the keypoint detection task on the EchoNet-LVH dataset. (A) Original input image. (B) Ground-truth keypoints (visualized as lines for IVS, LVID, LVPW). (C) Predicted keypoints generated by our MTL model.

Overall, these visual results substantiate the quantitative findings, confirming that our unified framework can robustly and accurately perform both segmentation and keypoint detection, making it a reliable tool for automated echocardiographic analysis.

Ablation study

To definitively validate our central hypothesis that a unified MTL framework provides a synergistic benefit over isolated, single-task approaches, we conducted a comprehensive ablation study. The primary objective of this study was to isolate and quantify the contribution of the MTL strategy itself by comparing our full model against two baseline models, each trained to perform only one of the constituent tasks. For this purpose, we configured and trained two specialized baseline models:

  1. 1.

    Segmentation-Only Model: This model utilized the same architecture as our proposed framework but was trained exclusively on the segmentation datasets (CAMUS and EchoNet-Dynamic) using only the segmentation loss function.

  2. 2.

    Keypoint-Only Model: Similarly, this model shared the same architecture but was trained solely on the keypoint detection dataset (EchoNet-LVH) using only the keypoint loss function.

The performance of these single-task models was then compared against our complete MTL framework. The results, summarized in Table 8, unequivocally demonstrate the superiority of the multi-task approach. Our full MTL model not only outperformed the “Segmentation-Only” baseline in segmentation accuracy (achieving a higher average DSC score) but also surpassed the “Keypoint-Only” model in localization precision (achieving a lower average MAE).

This improvement across both tasks confirms that the model benefits from learning a shared, synergistic representation of cardiac anatomy. The contextual information gained from understanding the overall shape of the LV cavity (from the segmentation task) acts as a powerful regularizer, improving the accuracy of localizing specific anatomical landmarks. Conversely, the precise feature localization learned from the keypoint task helps refine the segmentation boundaries. This study provides strong evidence that our unified MTL framework is more effective and robust than addressing these clinically linked problems in isolation. To formally validate this, we performed a Wilcoxon signed-rank test comparing the paired results on the test set. The improvements from the MTL framework were found to be statistically significant for both segmentation (average DSC increase, p = 0.006) and keypoint detection (average MAE reduction, p = 0.002).

Table 8 Ablation study on the effectiveness of the MTL framework.

Discussion

In this study, we introduced an MTL framework for the automated assessment of left ventricular structure and function from echocardiographic images. In contrast to previous approaches that treated LV segmentation (for functional assessment) and keypoint detection (for structural assessment) as separate problems, our proposed model addresses both tasks concurrently within a single, unified network. The quantitative and qualitative results demonstrate the efficacy and robustness of this approach.

Our segmentation results, particularly the achievement of a DSC score of 0.951 on the challenging CAMUS dataset and a very low HD of 1.02 mm on the EchoNet-Dynamic dataset, indicate the model’s high precision in delineating the LV endocardial borders. This level of accuracy, which is competitive with the best-performing single-task models, is essential for the reliable calculation of clinical parameters such as EF. Similarly, in the keypoint detection task, our model achieved a consistently low localization error on the standardized 112 × 112 input images, notably achieving an MAE of 0.3446 pixels for measuring the LVID. This high degree of sub-pixel precision on the processed images is a strong indicator of the model’s ability to learn the precise anatomical landmarks.

To contextualize our results within the existing literature, Table 9 compares the performance of our framework against several other state-of-the-art models. As can be seen, this table is organized by task, presenting benchmarks for both LV segmentation on the CAMUS and EchoNet-Dynamic datasets, and keypoint detection on the EchoNet-LVH dataset. This allows for a direct comparison of our model’s performance on each individual task against specialized, single-task architectures. Clearly, our MTL approach achieves competitive results. To properly contextualize our contribution, it is important to compare it with other state-of-the-art systems that also employ mixed-type MTL. Comprehensive frameworks like PanEcho18 achieve this combined assessment using direct regression, while EchoNet-Measurements17 and Us2.ai16 rely on full segmentation (both endo- and epicardium). Furthermore, other MTL models like EchoEFNet28 and EFNet36 have paired segmentation with EF regression.

Our framework’s novelty lies in its specific, interpretable methodology. To our knowledge, ours is the first lightweight MTL framework to validate unifying two distinct geometric task types: U-Net segmentation (a dense prediction task) with heatmap-based keypoint detection (a sparse localization task). This specific task-pairing is novel and offers a computationally efficient alternative to the aforementioned regression-based or full-segmentation approaches. Therefore, a direct comparison of MTL performance with models like EchoEFNet (which solve a different task-pairing) is not straightforward.

Table 9 Performance comparison of the proposed framework with other studies.

The most significant finding of this research, confirmed by our ablation study, is the clear superiority of the MTL framework over single-task models. Our results showed that training on both tasks simultaneously improved the performance of each. This phenomenon, known as synergistic learning, occurs because the model develops a more comprehensive and robust understanding of cardiac anatomy by concurrently learning the overall shape of the ventricle (via segmentation) and the precise location of anatomical landmarks (via keypoint detection). This shared knowledge acts as a form of regularization, helping the model to extract more relevant and powerful features.

Despite the promising results, our study has certain limitations that must be addressed before clinical translation. First, the model was evaluated on public datasets; for ultimate validation, it needs to be assessed on clinical data from various medical centers with different imaging protocols to ensure generalizability and avoid any potential bias from the training sets.

Second, the current framework is focused on 2D static keyframes (ED and ES), neglecting the rich temporal information within the full cardiac cycle. This approach was necessary due to the nature of annotations in the public datasets, but it limits the model’s ability to assess dynamic features like regional wall motion. As noted, future work will focus on incorporating temporal data.

Third, while the datasets used are noted for their heterogeneity and inclusion of pathological cases, we did not perform an explicit analysis of class balance (e.g., normal vs. hypertrophic). The model’s strong performance on these challenging datasets suggests robustness to this inherent imbalance, but future validation should explicitly test performance across different pathological severities.

Fourth, our evaluation focused on geometric proxy metrics (DSC, HD, and pixel-based MAE) rather than the final downstream clinical metrics (e.g., Ejection Fraction correlation, or wall thickness error in mm). Our reported sub-pixel MAE (average ~ 1.13 pixels) demonstrates high precision on the standardized 112 × 112 image, but a direct conversion to millimeters (mm) is not straightforward, as it requires patient-specific pixel/mm scaling factors not consistently available in the datasets. Therefore, a direct comparison with studies reporting error in mm, such as Duffy et al.11, is not appropriate. A future study must validate the direct correlation and agreement between our model’s automated outputs and expert-derived clinical parameters using original-resolution images.

Finally, our study did not include a cross-dataset evaluation to assess generalization. A full cross-dataset validation of the entire MTL model is inherently challenging, as the public datasets for segmentation (CAMUS, EchoNet-Dynamic) and keypoint detection (EchoNet-LVH) are acquired from different echocardiographic views (A4C vs. PLAX, respectively). Future work should focus on validating the framework on multi-view, multi-task datasets or performing domain adaptation to bridge this gap.

Conclusion

The accurate and reproducible assessment of left ventricular function and structure is critical for cardiovascular diagnosis, yet the manual analysis of echocardiograms remains a significant clinical challenge due to its time-consuming nature and inter-observer variability. In this research, we addressed this challenge by introducing and validating a novel MTL framework for the comprehensive and automated assessment of the left ventricle in echocardiographic images. This framework successfully integrates two critical clinical tasks, left ventricle segmentation for systolic function assessment and keypoint detection for wall thickness measurement, into a single, unified model.

Our experimental results on three large public datasets demonstrated that the proposed approach achieves state-of-the-art performance in both tasks. More importantly, our ablation study proved that the concurrent learning of these two tasks, due to synergistic learning, leads to improved performance in both domains. This study presents a novel, integrated solution that unifies the geometric assessment of LV function (via segmentation) and structure (via keypoint detection). While other comprehensive systems exist that perform this combined assessment using regression or full epicardial segmentation, our lightweight MTL framework is the first to validate this specific and interpretable task-pairing (segmentation + keypoint heatmaps). This approach has the potential to optimize clinical workflows, reduce the workload of cardiologists, and, by providing fast, accurate, and reproducible measurements, significantly contribute to the early diagnosis and better management of cardiovascular diseases.

Future work will focus on extending this framework to incorporate temporal information from entire video sequences, enabling the analysis of dynamic features such as regional wall motion abnormalities. Furthermore, we plan to adapt the model for 3D echocardiography to provide a more complete geometric and volumetric assessment. Validating this extended framework on large-scale, multi-center clinical datasets will be a crucial next step to ensure its robustness and generalizability before clinical adoption.