Abstract
Current AI approaches for cardiac diagnosis require condition-specific supervised learning with extensive labeled datasets, leading to fundamental scalability barriers. We developed an ECG-CLIP model, applying contrastive multimodal learning to enable zero-shot cardiac diagnosis from 12-lead ECGs using natural language supervision. Trained on 800,034 ECG-text pairs from MIMIC-IV-ECG, ECG-CLIP evaluated 18 cardiac conditions without condition-specific training. The model achieved superior performance for rhythm abnormalities (AUROC > 0.90) compared to morphological conditions. External validation demonstrated robust AUROC rank consistency (\({\rm{\rho }}\) = 0.934), including remarkable zero-shot performance for pediatric patients despite no pediatric training cases. Direct comparison showed ECG-CLIP approached supervised models while providing broader diagnostic coverage. Demographic analysis revealed U-shaped age-dependent performance and condition-specific sex-age patterns. By eliminating dependence on labeled data, ECG-CLIP enables diagnosis of various cardiac conditions via text-based queries. This paradigm shift from rigid task-specific models to flexible unified systems addresses critical deployment barriers, potentially expanding global access to expert-level ECG interpretation.
Similar content being viewed by others
Introduction
The 12-lead electrocardiogram (ECG) remains a cornerstone of cardiovascular diagnosis, providing critical insights into cardiac rhythms, conduction abnormalities, and structural pathologies1,2,3. Despite its widespread clinical application, accurate ECG interpretation demands substantial expertise and exhibits concerning inter-observer variability, with documented error rates ranging from 25% to 31%, even among experienced clinicians4,5. This challenge is increasingly exacerbated by the global shortage of cardiac specialists, creating a pressing need for reliable automated interpretation systems that could expand access to expert-level diagnosis6.
Current artificial intelligence (AI) approaches for ECG analysis rely predominantly on supervised learning paradigms that require extensive condition-specific labeled datasets7,8,9,10. These systems operate as rigid, task-specific classifiers requiring individual development, validation, and maintenance for each cardiac condition. The resulting ecosystem presents substantial clinical deployment limitations: each new diagnostic capability demands separate dataset curation, model development, and regulatory approval11,12, creating significant barriers to comprehensive cardiac care automation13,14. This approach becomes particularly problematic for rare cardiac conditions where sufficient labeled data is challenging to obtain15.
The recent success of the CLIP (Contrastive Language-Image Pre-training) framework has demonstrated the transformative potential of contrastive multimodal learning for developing flexible, zero-shot capabilities16,17,18. CLIP’s approach of learning joint representations between paired modalities through contrastive learning enables a single model to perform classification tasks for various concepts describable in natural language, eliminating the need for task-specific labeled datasets. However, the application of contrastive multimodal learning to physiological signals and clinical text remains underexplored in medical diagnosis. ECGs as physiological time-series present unique challenges for multimodal learning: complex temporal dynamics encoded in precise timing relationships and waveform morphologies19, specialized clinical terminology requiring domain-specific understanding20, and life-critical accuracy demands that are essential for medical applications21,22.
Despite these challenges, we hypothesized that contrastive multimodal learning principles could be applied to address the unique characteristics of physiological signals. To test this hypothesis, we developed an ECG-CLIP model, applying established contrastive multimodal learning principles to enable zero-shot cardiac diagnosis from ECG signals. Our approach learns joint representations between ECG-text pairs of 12-lead ECG signals and summary reports, allowing a single model to diagnose a diverse range of cardiac conditions describable in natural language without requiring condition-specific training data. By enabling this paradigm shift from multiple narrow, condition-specific models to a single flexible system, our work establishes a foundation for scalable medical AI systems that could transform cardiovascular care delivery.
Results
Dataset characteristics and demographics
We developed and evaluated ECG-CLIP, a contrastive learning model that aligns ECG signals with their summary reports, using the MIMIC-IV-ECG dataset23,24 for development and a publicly available large-scale 12-lead ECG database for arrhythmia study (“LSAS-ECG dataset” for short call)25,26 for external validation (Table 1). The MIMIC-IV-ECG dataset comprised 800,034 ECG-text pairs from 161,352 unique patients, distributed across training (638,376 pairs from 129,081 patients), validation (80,974 pairs from 16,135 patients), and internal testing (80,684 pairs from 16,136 patients) sets. The external validation LSAS-ECG dataset contained 45,151 ECG recordings from unique patients. In all subsequent statistical analyses, data were analyzed at the level of individual ECG recordings.
Demographics showed notable differences between datasets (Table 1). MIMIC-IV-ECG had a mean age of 64.3 ± 17.1 years (range 12–101 years), with 55.0–56.7% of recordings from patients aged 45-74 years and 29.7–30.7% from those 75+ years. LSAS-ECG was younger (mean 58.2 ± 19.7 years, range 0–89 years) with 58.4% aged 45–74 years, 21.1% aged 75+ years, and notably included pediatric cases (0–11 years). Sex distribution was relatively balanced (males: 50.7–51.6% MIMIC-IV-ECG, 56.4% LSAS-ECG). Missing demographic data was ≤0.6%.
Care setting analysis of the MIMIC-IV-ECG dataset revealed 408,760 (51.1%) ECG recordings from inpatient admissions (inpatient visits) and 391,274 (48.9%) from outside inpatient encounters. Among inpatients, 149,301 (36.5%) had ICU stays while 259,388 (63.5%) were admitted without ICU care; 71 recordings (<0.1%) had missing ICU status records. Clinical characterization of inpatient recordings demonstrated significantly higher comorbidity burden among patients with ICU stays compared to those without ICU care (Supplementary Table 1), including higher prevalence of cardiovascular conditions, renal dysfunction, and critical metabolic conditions.
We extracted 18 cardiac conditions from MIMIC-IV-ECG and 14 from LSAS-ECG (Supplementary Tables 2 and 3). Prevalence varied substantially, from <1% for rare conditions like supraventricular tachycardia (SVT) to >10% for common arrhythmias like atrial fibrillation (AF) (Table 1). Notable differences included higher AF (21.8% vs 11.7–12.4%) and SB (36.7% vs 12.3–12.5%) in LSAS-ECG, but more ST-T changes (STTC) in MIMIC-IV-ECG (18.2–18.9% vs 2.6%). Condition prevalence stratified by demographics showed age-related trends and balanced sex distribution (Supplementary Fig. 1).
We evaluated two initialization strategies for model training: (1) partially utilizing weights from CLIP, and (2) full random initialization following standard best practices. The first CLIP-initialized approach demonstrated modestly superior overall performance than random initialization (Supplementary Fig. 2). Therefore, subsequent results were based on the CLIP-initialized model.
Retrieval and semantic clustering evaluation
To assess the quality of learned ECG-text embeddings, we evaluated bidirectional retrieval performance and semantic clustering capabilities. Analysis stratified by summary report template frequency revealed distinct performance: Rare templates (frequency = 1) achieved superior retrieval performance with Recall@1 of 0.267 ± 0.007, Recall@5 of 0.513 ± 0.008, Recall@10 of 0.693 ± 0.008, and MRR of 0.394 ± 0.007. Common templates (frequency = 2–10) demonstrated intermediate performance (Recall@1: 0.197 ± 0.007, Recall@5: 0.513 ± 0.008, Recall@10: 0.665 ± 0.008, MRR: 0.331 ± 0.006). Frequent templates (frequency > 10) showed reduced performance (Recall@1: 0.057 ± 0.003, Recall@5: 0.188 ± 0.006, Recall@10: 0.278 ± 0.006, MRR: 0.115 ± 0.004).
To evaluate whether the model learned clinically meaningful semantic relationships, we analyzed embedding clustering by comparing cosine similarities within ECG groups sharing identical clinical reports against random ECG pairs. ECGs sharing identical clinical reports exhibited significantly higher embedding similarities compared to random ECG pairs (0.847 ± 0.001 vs. 0.581 ± 0.001, p < 0.001), representing a 45.8% improvement over random. This finding was statistically significant across all independent runs (100/100 runs, p < 0.05) with a very large effect size (Cohen’s d = 2.58 ± 0.02).
Cardiac condition detection performance
Internal testing performance across 18 cardiac conditions revealed substantial metric-dependent variations. Discrimination metrics showed area under the receiver operating characteristic curve (AUROC) ranging from 0.474–0.985 (mean 0.860 ± 0.145) and area under the precision-recall curve (AUPRC) from 0.116–0.832 (0.500 ± 0.255). Classification metrics demonstrated wider variability: accuracy 0.272–0.992 (0.783 ± 0.218), sensitivity 0.667–0.960 (0.820 ± 0.067), specificity 0.121–0.992 (0.762 ± 0.258), precision 0.089–0.789 (0.400 ± 0.246), and F1-score 0.160–0.864 (0.499 ± 0.233) (Table 2, Supplementary Figs. 3–6).
Sensitivity remained consistently high (>0.75 for 17/18 conditions), while precision varied dramatically (0.089–0.789). Morphological conditions achieved particularly low precision: left ventricular hypertrophy (LVH, 0.148), STTC, 0.194, prolonged QT interval (PQTI, 0.089), and low QRS voltages (LowQRS, 0.114). Rhythm abnormalities maintained high specificity ( > 0.94 in 6/7 conditions), while morphological conditions showed reduced specificity, particularly STTC (0.121). Category-wise F1 scores ranked: general ECG classification (0.626 ± 0.142), rhythm abnormalities (0.580 ± 0.247), conduction disorders (0.526 ± 0.199), and morphological conditions (0.233 ± 0.070) (Supplementary Table 4).
External validation across 14 conditions maintained similar metric patterns (Table 2, Supplementary Figs. 3-5), with AUROC (0.558–0.985, 0.856 ± 0.133), AUPRC (0.024–0.956, 0.312 ± 0.313), accuracy (0.334–0.982, 0.793 ± 0.195), sensitivity (0.353–0.973, 0.727 ± 0.193), specificity (0.320–0.990, 0.794 ± 0.199), precision (0.021–0.934, 0.245 ± 0.300), and F1-score (0.041–0.953, 0.298 ± 0.301) (Supplementary Table 5). The comprehensive radar plot analysis revealed distinct performance clustering by cardiac condition categories (Fig. 1). Rhythm abnormalities (AF, ST, SB) exhibited consistently superior performance with AUROC values > 0.94 and robust generalization between internal and external validation sets. Conduction disorders (AVB, 1dAVb, RBBB, LBBB) demonstrated intermediate performance with moderate external validation consistency. In contrast, morphological conditions (LVH, STTC, PQTI, LowQRS) showed consistently poor performance across all metrics, with substantial performance gaps in AUPRC (Fig. 1b) and F1-score metrics (Fig. 1f). Strong rank consistency was observed for discrimination and specificity metrics (Spearman correlation coefficient, AUROC: p = 0.934, accuracy: p = 0.934, specificity: p = 0.932, all p < 0.001), indicating that conditions maintaining high diagnostic performance in the internal dataset generally sustained superior relative performance in the external validation dataset. AUPRC (p = 0.662, p = 0.010) and F1-score (p = 0.741, p = 0.002) showed moderate rank consistency, while sensitivity demonstrated weaker preservation of relative diagnostic difficulty ordering (p = 0.189, p = 0.517) between internal and external datasets (Supplementary Fig. 7).
a area under receiver operating characteristic curve (AUROC), b area under precision-recall curve (AUPRC), c accuracy, d sensitivity, e specificity, and f F1 score. In all panels, blue lines represent internal testing performance on the MIMIC-IV-ECG dataset, while red lines represent external validation performance on the LSAS-ECG dataset. Each radar plot displays 14 cardiac conditions arranged circumferentially: AF atrial fibrillation, SVT supraventricular tachycardia, ST sinus tachycardia, SB sinus bradycardia, PVC premature ventricular complexes, PAC premature atrial complexes, AVB atrioventricular block, 1dAVb first-degree atrioventricular block, RBBB right bundle branch block, LBBB left bundle branch block, LVH left ventricular hypertrophy, STTC ST-T changes, PQTI prolonged QT interval, and LowQRS low QRS voltages. Performance values range from 0 to 1.0 on the radial axis, with higher values indicating better performance.
Comparison with supervised models
We compared ECG-CLIP against two established supervised models6,10 on six cardiac conditions where direct comparison was feasible, as these models cannot evaluate conditions beyond their original scope. ECG-CLIP demonstrated competitive zero-shot performance compared to these supervised models (Fig. 2, Supplementary Tables 6, 7). On internal testing, supervised models showed modest but consistent AUROC improvements averaging +0.030 (supervised model 1) and +0.028 (supervised model 2), with condition-specific differences ranging from +0.009 (RBBB) to +0.093 (1dAVb). More substantial differences emerged in AUPRC metrics, where supervised models achieved average improvements of +0.199 and +0.183, respectively, with the largest disparities observed for 1dAVb (+0.468 and +0.442).
a area under the receiver operating characteristic curve (AUROC) performance comparison on internal testing across six cardiac conditions. b AUROC performance comparison on external validation. c area under the precision-recall curve (AUPRC) performance comparison on internal testing. d AUPRC performance comparison on external validation. Light blue bars represent ECG-CLIP zero-shot performance, coral bars represent supervised model 1 (Ribeiro et al. (Nature Communications, 2020)), and light green bars represent supervised model 2 (Lu et al. (European Heart Journal Digital Health, 2024)). Error bars represent 95% confidence intervals calculated by 1000 bootstraps. Cardiac conditions: AF atrial fibrillation, ST sinus tachycardia, SB sinus bradycardia, RBBB right bundle branch block, LBBB left bundle branch block, 1dAVb first-degree atrioventricular block.
External validation revealed condition-specific advantages. While supervised models maintained superiority for common arrhythmias (ST, SB, RBBB, ranging from +0.009 to +0.032), ECG-CLIP demonstrated markedly superior generalization for complex conditions: LBBB (+0.101 and +0.073 AUROC) and 1dAVb (+0.046 and +0.089 AUROC). These improvements enabled ECG-CLIP to achieve better overall external AUROC performance (+0.020 and +0.012). For AUPRC on external validation, supervised models retained advantages across most conditions (+0.031 to +0.221), except for 1dAVb where ECG-CLIP substantially outperformed both models (+0.213 and +0.227).
Demographic stratification analysis
Age-stratified analysis revealed condition-dependent variations across four groups: 0–11 (external only), 12–44, 45–74, and ≥75 years (Fig. 3a and c). Most conditions (13/18) maintained stable discrimination (AUROC difference <0.1), though atrioventricular block (AVB), PQTI, and borderline ECG showed substantial age-related disparities. PQTI demonstrated a pronounced age-related pattern, with superior performance in older patients (≥75 years: AUROC 0.890, 95% CI 0.883–0.897) compared to younger patients (12-44 years: AUROC 0.670, 95% CI 0.655–0.685). Superior AUPRC performance in middle-aged and older patients (≥45 years) was observed for AF, premature ventricular complex (PVC), and right bundle branch block (RBBB), while younger patients (12–44 years) demonstrated better performance for sinus tachycardia (ST), sinus bradycardia (SB), and left bundle branch block (LBBB). Consistently suboptimal performance was observed for morphological conditions (LVH, STTC, PQTI, LowQRS), which demonstrated AUPRC < 0.4 across all age groups, with particularly poor performance in younger patients. Sex-stratified analysis demonstrated minimal disparities, with all 18 conditions showing AUROC differences <0.1 between males and females (Fig. 3b and d). Only LBBB and LVH showed slightly superior female performance on AUPRC metrics.
a area under the receiver operating characteristic curve (AUROC) performance across 18 cardiac conditions stratified by age groups (12–44, 45–74, and ≥75 years). b AUROC performance stratified by sex (male vs. female). c area under the precision-recall curve (AUPRC) performance across cardiac conditions stratified by age groups. d AUPRC performance stratified by sex. Stars indicate the demographic group with the highest performance when the difference between groups is >0.1; triangles indicate the group with the lowest performance when the difference is >0.1. Cardiac conditions are abbreviated as: AF atrial fibrillation, SVT supraventricular tachycardia, ST sinus tachycardia, SB sinus bradycardia, PVC premature ventricular complexes, PAC premature atrial complexes, PR pacemaker rhythm, AVB atrioventricular block, 1dAVb first-degree atrioventricular block, RBBB right bundle branch block, LBBB left bundle branch block, LVH left ventricular hypertrophy, STTC ST-T changes, PQTI prolonged QT interval, LowQRS low QRS voltages.
External validation of demographic patterns (Supplementary Fig. 8) confirmed these findings across 14 conditions with additional insights for pediatric patients. Robust cross-age performance was maintained for rhythm abnormalities, with AF, SVT, ST, and SB achieving AUROC > 0.85 across all age groups, including pediatric patients (0–11 years). Notable age-related disparities (>0.1 AUROC difference) were observed for SB, with reduced pediatric performance, and premature atrial complex (PAC), with diminished performance in older patients (≥75 years). Most of the conduction disorders did not show age difference on AUROC, while morphological conditions showed universal age-related disparities (>0.1 AUROC difference), with complex age-dependent patterns: pediatric patients demonstrated optimal performance for LVH and LowQRS detection, while showing reduced capability for STTC and PQTI. AUPRC external validation revealed that pediatric patients exhibited inferior performance for AF, SB, PVC, and most conduction disorders, while patients ≥45 years achieved superior AF performance (AUPRC > 0.85). Sex-stratified external validation maintained balanced performance between male and female patients, with minimal sex-related disparities ( < 0.1 AUROC difference) across all 14 conditions. Age- and sex-stratified performances for all conditions were detailed in Supplementary Tables 8–23.
Care intensity stratification analysis
Model performance demonstrated remarkable consistency across care settings with slight AUROC variations across the 18 conditions: outside inpatients vs. inpatients without ICU (0.008 ± 0.031), outside inpatients vs. inpatients with ICU (0.026 ± 0.049), and inpatients without ICU vs. inpatients with ICU (0.018 ± 0.024). Similar consistency was observed for AUPRCs. Performance patterns showed a slight gradient across care intensity levels, with outside inpatients recordings achieving marginally superior performance compared to inpatients without ICU stay, which in turn performed slightly better than inpatients with ICU care settings (Fig. 4, Supplementary Table 24, Supplementary Figs. 9-13).
a area under the receiver operating characteristic curve (AUROC) performance across 18 cardiac conditions stratified by care intensity groups (inpatient with ICU stay, inpatient without ICU stay, and outside inpatient). b Area under the precision-recall curve (AUPRC) performance stratified by care intensity. Cardiac conditions are abbreviated as: AF atrial fibrillation, SVT supraventricular tachycardia, ST sinus tachycardia, SB sinus bradycardia, PVC premature ventricular complexes, PAC premature atrial complexes, PR pacemaker rhythm, AVB atrioventricular block, 1dAVb first-degree atrioventricular block, RBBB right bundle branch block, LBBB left bundle branch block, LVH left ventricular hypertrophy, STTC ST-T changes, PQTI prolonged QT interval, LowQRS low QRS voltages.
Model confusion patterns
Systematic false positive analysis revealed distinct condition-specific confusion patterns (Fig. 5a). AF emerged as the predominant false positive across multiple rhythm abnormalities—SVT, ST, SB, PVC, and PR all exhibited AF as their most frequent false positive, consistently accompanied by STTC co-occurrence. Conduction abnormalities showed AF (with STTC) as the primary confusion for AVB, while RBBB and LBBB demonstrated STTC as their main false positive. Among morphological conditions, both LVH and STTC showed SB as the most frequent false positive with AVB co-occurrence, while PQTI and LowQRS demonstrated STTC as the primary confusion source with frequent AF co-occurrence. Three bidirectional confusion pairs dominated: AF-STTC (AF + STTC ↔ STTC + AF), ST-PAC (ST + PAC ↔ PAC + ST), and ST-STTC (ST + STTC ↔ STTC + ST), generating false positives across heterogeneous condition categories spanning rhythm abnormalities, conduction disorders, and morphological variants. External validation confirmed the generalizability of AF over-detection patterns, with SB and ST emerging as supplementary major contributors to the confusion matrix (Supplementary Fig. 14a).
a False positive confusion matrix showing the most frequent misclassifications for each cardiac condition. Rows represent false positive conditions and columns represent true positive conditions. Circles indicate confusion frequency ranking (orange = most frequent, purple = second most frequent, blue = third most frequent); triangles represent co-occurring false positive conditions. b Model confusion risk ranking across sex-age demographic groups for each target condition, showing the top 10 highest-risk subgroups. Numbers represent risk rank (1 = highest risk, 10 = lowest risk among top 10); colors indicate risk level (dark red = ranks 1–3, orange = ranks 4–7, yellow = ranks 8–10). Cardiac condition abbreviations as in Fig. 3.
Given that AF emerged as the predominant false positive for SB detection and the critical clinical implications of this confusion, where AF typically requires anticoagulation therapy while SB generally does not, we conducted a detailed post-hoc analysis to understand the underlying model confusion mechanisms. We analyzed ECG summary parameter measurements available in the MIMIC-IV-ECG database and clinical characteristics across three groups on the internal testing: AF cases misclassified as SB (n = 838), correctly classified SB cases (n = 9875), and correctly identified AF cases (n = 9146). The misclassified AF cases demonstrating heart rates (53.4 ± 13.2 bpm) similar to correctly classified SB cases (53.7 ± 4.7 bpm) but significantly different from typical AF cases (94.7 ± 24.9 bpm, all p < 0.001) (Supplementary Table 25). Examination of the original ECG reports revealed that 662 out of 838 cases (79%) in the misclassified AF group were specifically documented as “atrial fibrillation with slow ventricular response” or “atrial flutter with slow ventricular response”. Clinical characteristics analysis demonstrated significant differences between misclassified AF and correctly classified SB groups, including older age (76.2 ± 13.3 vs 65.2 ± 16.2 years, p < 0.001), significantly higher prevalence of comorbidities, and increased digoxin usage before ECG recordings (8.0% vs 1.3%, p < 0.001) (Supplementary Table 26).
Demographic confusion risk assessment of internal testing revealed striking U-shaped patterns across age-sex subgroups, with the highest diagnostic errors in elderly (≥70 years) and young (<30 years) populations (Fig. 5b). The male 90+ subgroup exhibited maximum risk for ST, SB, and PQTI detection, while the female 90+ subgroup showed elevated risk for SVT, PQTI, and LowQRS. Younger cohorts demonstrated comparable vulnerability: the male 20-30 subgroup showed the highest risk for PVC, AF, and AVB, while the female 20–30 subgroup exhibited maximum risk for AF and PR. Condition-specific patterns revealed that rhythm abnormalities (AF, SVT, SB, ST) concentrated risk in both elderly (>70 years) and young populations (16-30 years), while ectopic beats (PVC, PAC) predominantly affected younger patients (<50 years).
Among sex-age subgroups aged 70+ years, confusion risk was distributed similarly between males and females for AF, SVT, ST, SB, PR, STTC, and PQTI. Notable sex differences emerged in younger populations (16–50 years), where females showed a higher risk for AF and SVT with exclusive vulnerability for LowQRS, while males demonstrated exclusive risk for SB and PAC. The 50–70 age range emerged as a relative “safe zone,” demonstrating consistently lower confusion risk across most conditions for both sexes.
The external validation dataset supplemented pediatric sex-age interaction subgroups (0–12 years), revealing widespread model confusion risk across multiple conditions. The female 0–1 subgroup demonstrated moderate-to-high vulnerabilities (ranks 3–5 for AF, SVT, ST, and PQTI). Notable high-risk patterns emerged in early childhood: the male 3–5 subgroup showed rank 1 risk for both SB and AVB, while the male 5–8 subgroup exhibited rank 1 risk for PAC and LBBB (rank 2 for SB). The female 5–8 subgroup also demonstrated rank 1 risk for PQTI. Across all pediatric conditions evaluated, confusion risk was distributed relatively equally between male (13 conditions) and female (15 conditions) subgroups (Supplementary Fig. 14b).
Pediatric age stratification analysis
Recognizing the substantial physiological differences across pediatric developmental stages, we implemented finer age stratification evaluation within patients <12 years. The prevalence of conditions across age groups showed high sparsity, excepting ST, which showed prevalence ranged from 0.68% in infants (<1 year) to 73.9% in preschoolers (3–4 years old), then decreased to 35.6% in older children (8–11 years old) (Supplementary Table 27). No patients were available in the 1–2 year age group for analysis.
Based on the confusion matrix analysis across pediatric age groups (Supplementary Figs. 15–28), most conditions exhibited high specificity (>0.90) across all age groups. A notable exception was morphological conditions, which showed relatively higher prevalence compared to other conditions: exceeding 30% in preschoolers (73.9%), school-age children (5–7 years old, 45.4%), and older children (35.6%). These conditions exhibited moderate age-related performance variation, with sensitivity ranging from 0.548 in preschoolers to 0.900 in older children, and specificity varying from 0.731 in preschoolers to 0.984 in older children. Specifically, for ST detection, the model performed well on the older children group with 0.900 sensitivity, 0.984 specificity, and 0.968 precision. AF maintained consistently high specificity across all age groups (0.992–1.0), and 0.895 sensitivity in infants, which was the only group showing AF condition. SB demonstrated stable performance with specificity >0.988 in all groups. Several conditions, including PVC and RBBB, and LBBB, showed zero positive conditions across most pediatric age groups, limiting sensitivity assessment.
Prompt optimization
Based on the above performance evaluation, we identified that morphological abnormalities, particularly STTC and LowQRS, demonstrated markedly suboptimal detection performance compared to rhythm abnormalities across both internal and external validation datasets. These challenging results prompted us to explore whether more anatomically specific prompt descriptions could improve diagnostic accuracy for these morphological conditions.
For STTC diagnosis, optimized anatomically specific prompts showed substantial improvements during internal testing (Fig. 6a, Supplementary Table 28). The optimized prompts achieved notable gains across multiple metrics, with particularly strong improvements in AUROC (+0.368), AUPRC (+0.367), accuracy (+0.453), specificity (+0.585), and F1 score (+0.195). The mean improvement across all metrics was 0.287 ± 0.219. However, sensitivity showed a decline (–0.146), indicating a trade-off between specificity and sensitivity. External validation results for STTC were more modest but generally positive (Fig. 6b). Optimized prompts maintained improvements in AUROC (+0.218), AUPRC (+0.038), accuracy (+0.231), specificity (+0.2390), and F1 score (+0.060), with a mean improvement of 0.115 ± 0.102. Similar to internal testing, sensitivity decreased (–0.020), suggesting consistent behavior across validation sets.
a, b ST-T changes (STTC) detection performance during internal testing and external validation across seven evaluation metrics. c, d low QRS voltages (LowQRS) detection performance during internal testing and external validation across seven evaluation metrics. Pink bars represent standard prompt performance, and brown bars represent optimized prompt performance. Error bars indicate 95% confidence intervals calculated by 1000 bootstraps. Performance values range from 0 to 1.0 for all metrics including area under the receiver operating characteristic curve (AUROC), area under the precision-recall curve (AUPRC), accuracy, sensitivity, specificity, precision, and F1 score.
Optimized prompts for LowQRS detection demonstrated consistent but smaller improvements in internal testing (Fig. 6c, Supplementary Table 28). The lead-specific optimized prompts achieved gains in AUROC (+0.073), AUPRC (+0.036), accuracy (+0.161), specificity (+0.178), and F1 score (+0.057), with a mean improvement of 0.076 ± 0.065. Sensitivity again showed a small decline (–0.015). External validation for LowQRS showed mixed results (Fig. 6d). While accuracy (+0.295) and specificity (+0.306) improved substantially, AUPRC decreased slightly (–0.001) and sensitivity declined more notably (–0.180). The mean improvement was 0.068 ± 0.1612, indicating greater variability in performance gains.
Discussion
This study demonstrates the potential for leveraging natural language supervision to enable zero-shot cardiac diagnosis from 12-lead ECG signals, representing a paradigm shift from condition-specific labeled datasets to flexible diagnostic capabilities. Traditional supervised approaches require extensive manual labeling for each cardiac condition, creating scalability bottlenecks that limit AI deployment to well-represented conditions with sufficient training examples. Our ECG-CLIP framework addresses these constraints by learning diagnostic capabilities from summary reports that naturally accompany ECG recordings, enabling diagnosis of cardiac conditions describable in natural language. Motivated by the success of CLIP’s contrastive multimodal learning framework16,17,18, we investigated whether contrastive learning between physiological signals and text could be applied to diagnosis of cardiac conditions, with results suggesting that contrastive multimodal learning principles could be effectively applied to ECG-based diagnosis.
The frequency-dependent retrieval performance reveals a fundamental trade-off in our approach. Superior performance on rare templates indicates strong capability for identifying unique clinical patterns, precisely where retrieval systems provide maximum clinical value. However, reduced performance on frequent templates presents a significant limitation, as these represent the majority of clinical encounters. This performance gap likely reflects two factors: rare templates contain distinctive lexical patterns that facilitate discrimination, while frequent templates often differ only by minor variations within standardized language. Our sentence-level analysis supports this interpretation, revealing extreme standardization (0.1% diversity ratio) compared to moderate report-level diversity (22.2%), indicating that reports vary primarily through recombination of highly standardized phrases rather than novel linguistic expressions.
Our results reveal distinct performance patterns across cardiac abnormality categories that provide important insights into the capabilities and limitations of language-supervised approaches. The superior performance for rhythm abnormalities (AF, ST, SB, PVC) compared to morphological conditions (LVH, STTC, PQTI, LowQRS) suggests fundamental differences in how these pathophysiological features are represented in both text and ECG signals. The strong performance for rhythm abnormalities likely stems from the model’s ability to capture temporal patterns through alignment between signal dynamics and textual descriptions. This aligns with recent work showing that temporal relationships in physiological signals can be effectively encoded through language supervision27,28. Rhythm disturbances manifest as distinctive temporal sequences that can be effectively described linguistically irregular intervals for atrial fibrillation, slow rates for sinus bradycardia, or premature complexes for ectopic beats. These patterns translate well into natural language descriptions that capture the essential diagnostic features, enabling robust contrastive learning between signal and text. The impressive performance for rare conditions like SVT demonstrates particular clinical utility, as obtaining large labeled datasets for uncommon arrhythmias remains challenging in traditional supervised approaches29.
Conversely, morphological conditions depend on subtle amplitude and waveform variations that may be inadequately captured in routine clinical descriptions. Ventricular hypertrophy, STTC, and QRS abnormalities require precise quantitative measurements and contextual interpretation30. While the CLIP architecture effectively aligns global patterns with descriptive text, it may be insufficient for capturing the fine-grained features necessary for detecting these subtle morphological variations.
External validation results demonstrate robust generalization across different patient populations and institutional settings, with strong correlations for discrimination metrics (AUROC rank consistency \({\rm{\rho }}\) = 0.934) between internal and external datasets. This generalization capability suggests that contrastive learning helps models develop robust feature representations that transfer effectively across diverse clinical environments, which is a critical requirement for real-world deployment31. The zero-shot generalization to pediatric patients revealed both promising adaptability and considerable age-dependent performance variations despite the absence of pediatric cases in training data. The variability observed in ST detection exemplifies the physiological complexity across pediatric developmental stages, with sensitivity ranging from 0.548 in preschoolers to 0.900 in older children, and specificity varying from 0.731 in preschoolers to 0.984 in older children. Notably, the model achieved the best performance in older children (sensitivity: 0.900, specificity: 0.984, precision: 0.969), suggesting better alignment with more stable cardiac physiology in this age group. Beyond ST, most other cardiac conditions showed sparse occurrence across pediatric age groups, with many conditions demonstrating zero true positives in specific developmental stages, limiting comprehensive performance assessment. These findings demonstrate that while zero-shot pediatric generalization is achievable, the substantial physiological variability across developmental stages and the sparse nature of pediatric cardiac conditions necessitate more refined, age-specific analytical approaches and specialized validation protocols before clinical deployment in pediatric populations32,33,34,35.
Competitive performance compared to supervised models2,10 demonstrates the potential of zero-shot capabilities to match traditional approaches while offering greater flexibility. Direct comparison was conducted on six cardiac conditions (AF, ST, SB, RBBB, LBBB, and 1dAVb) using established supervised models retrained on our dataset. ECG-CLIP achieved superior overall AUROC performance on external validation (+0.020 and +0.012 versus supervised models) while simultaneously providing diagnostic capabilities across 18 conditions with a single model. For the remaining 12 conditions evaluated in our study, performance comparisons with the published supervised approaches were not appropriate, as these models were not designed to handle conditions beyond their original scope, precisely illustrating the limitation that our approach addresses.
Demographic stratification revealed complex age-dependent performance patterns approximating a U-shaped distribution for several conditions. Elderly patients (70+ years) exhibited the highest concentration of diagnostic errors, with the male 90+ subgroup showing maximum risk for ST, SB, and PQTI detection. This pattern likely reflects the increased complexity of ECG interpretation in elderly populations, where multiple comorbidities, medications, and age-related cardiac changes can obscure diagnostic features36. Younger patients (<50 years) also exhibited elevated confusion risk levels, with the male 20-30 subgroup showing the top 3 risk ranks for AF, SVT, PVC, AVB, and RBBB, and the female 20–30 subgroup demonstrating rank 1 risk for both AF and PR. This U-shaped age-performance relationship likely reflects the physiological complexity at both extremes—age-related cardiovascular changes and polypharmacy effects in elderly patients37, and developmental cardiac variations in younger patients38.
Additionally, the analysis demonstrates overall fairness with minimal sex-related disparities (AUROC differences <0.1 for most conditions across age groups). However, condition-specific sex differences emerged that warrant clinical attention. Notable sex-specific vulnerabilities included females showing higher confusion risk for AF and SVT in younger age groups (16–50 years) and exclusive risk patterns for LowQRS detection, while males demonstrated exclusive risk for SB and PAC in the same age range. Among elderly patients (70+ years), confusion risk is generally balanced between sexes for most conditions. These condition-specific sex differences suggest that while the model maintains overall sex equity, certain cardiac conditions may benefit from sex-stratified interpretation39,40, particularly in younger patient populations41.
The consistent performance across different care intensity levels provides important insights into the robustness and generalizability of our approach. The modest overall performance variations between outside inpatient, inpatient without ICU, and inpatient with ICU settings suggest that the contrastive learning framework develops stable feature representations that are robust to varying clinical acuity levels and patient complexity. This finding is particularly important given that ICU patients typically present with multiple comorbidities, complex medication regimens, and altered physiological states that can substantially affect ECG patterns42,43.
Systematic confusion pattern analysis reveals specific misclassification tendencies with important clinical implications. AF emerged as the most common false positive across multiple rhythm abnormalities, reflecting shared electrophysiological features—irregular R-R intervals that appear in various rhythm disturbances. The AF to SB confusion pattern provides valuable insights into our model’s learned representations and diagnostic limitations. The predominant misclassification of slow ventricular response AF as SB suggests that the model has effectively learned fundamental rhythm characteristics, particularly R wave patterns and RR interval features, enabling it to distinguish slow-rate from fast-rate conditions. This heart rate-centric approach mirrors early-stage clinical reasoning, where practitioners often rely on readily apparent temporal patterns. However, our findings reveal that the model may underemphasize subtle morphological features such as P wave patterns, limiting its ability to distinguish low-rate AF from SB. This limitation reflects a broader challenge in automated ECG interpretation: while deep learning models excel at capturing obvious temporal and amplitude patterns, they may struggle with the nuanced morphological features that expert cardiologists use to differentiate complex arrhythmias. The frequent bidirectional confusion between AF and STTC further indicates clinicians should exercise caution when these conditions appear together, as AF can cause secondary repolarization abnormalities44. For morphological conditions, both LVH and STTC detection frequently resulted in SB false positives, suggesting certain ECG features remain challenging to distinguish reliably due to overlapping voltage and repolarization patterns45,46.
The improved performance achieved through prompt optimization for STTC and LowQRS reveals a unique advantage of natural language supervision. Unlike supervised models, where performance improvements require new labeled data and complete retraining, our approach enables optimization through refined clinical descriptions. This flexibility allows the model to adapt to evolving clinical terminology and diagnostic criteria without architectural changes or additional data collection. Our results suggest that prompt effectiveness depends on linguistic diversity within the training corpus. The model successfully responds to alternative formulations when these patterns are adequately represented during training. For instance, we observed performance improvements using more specific electrophysiological descriptions compared to standard diagnostic terms. This finding opens avenues for systematic prompt optimization strategies, such as developing multiple inference variants that capture different clinical perspectives: standardized nomenclature (“atrial fibrillation”), mechanistic descriptions (“irregular rhythm without P waves”), or severity gradations (“paroxysmal” versus “persistent atrial fibrillation”). While these initial results are promising, comprehensive validation is needed to establish optimal prompt engineering strategies across different clinical contexts. The ability to enhance diagnostic performance through language refinement represents a fundamental departure from traditional AI systems that require retraining for updates47,48 and could enable continuous improvement aligned with evolving clinical practice through prompt adaptation rather than model redeployment49,50.
The practical advantages for clinical deployment directly address the scalability limitations of traditional supervised learning. While supervised approaches can handle multi-class classification in a single model, they require labeled training data for each condition and cannot diagnose conditions beyond their predefined classes51. Adding new conditions necessitates collecting labeled datasets, retraining the entire model, and revalidation, a process that becomes prohibitive for rare conditions or emerging diagnostic criteria52. In contrast, our single model provides diagnostic capabilities for a diversity range of conditions describable in text without requiring condition-specific training data. This zero-shot capability enables immediate adaptation to new diagnostic criteria or rare conditions without model retraining or data collection. For instance, if new cardiac biomarkers or diagnostic categories emerge, ECG-CLIP can theoretically evaluate them immediately through appropriate text prompts, while supervised models would require months or years of data collection and retraining. This flexibility is particularly valuable in addressing the long tail of rare cardiac conditions, where assembling sufficient labeled training data remains infeasible53.
Several important limitations must be acknowledged. Data and reference standard limitations include machine-generated ECG interpretation labels rather than systematic expert physician consensus, meaning our performance metrics represent comparisons against existing automated standards rather than definitive clinical truth. The high linguistic standardization (22.2% report-level diversity) constrains model exposure to varied clinical expression patterns, potentially limiting generalizability to institutions with diverse reporting practices.
Model performance limitations reveal fundamental challenges in detecting morphological abnormalities compared to rhythm abnormalities, suggesting current text supervision may be insufficient for comprehensive cardiac diagnosis. Systematic confusion patterns, particularly AF over-detection across multiple rhythm categories including SB, indicate limitations in distinguishing subtle rhythm characteristics. The framework lacks interpretability mechanisms essential for clinical trust and safe deployment in high-stakes medical environments.
Validation limitations include potential biases from specific institutional data sources that may not represent diverse populations or healthcare systems with different ECG acquisition protocols. While external validation demonstrated promising pediatric generalization despite adult-only training, substantial age-dependent performance variations highlight the complexity of applying adult-trained models directly to pediatric populations. Our validation approach was retrospective and conducted by the same investigative team using datasets from two hospital systems, which may not fully capture the complexities encountered in diverse clinical environments.
Despite these limitations, several promising enhancement opportunities have been identified that could address these challenges and advance the clinical applicability of contrastive learning approaches in cardiac diagnosis. Technical enhancements could potentially benefit from extended temporal resolution, such as 30-second recordings, which might improve SB detection, as standard 10-second captures may inadequately represent certain rhythm abnormalities. Enhanced semantic supervision through condition-specific morphological descriptors could potentially guide models toward clinically relevant waveform characteristics that current standardized reports might not adequately capture. Multi-shot learning and advanced prompting strategies may offer potential for improving morphological condition detection, building on our preliminary prompt engineering results.
Clinical integration improvements include incorporating comprehensive clinical context from electronic health records to provide more patient-specific information for model learning and increase report specificity beyond standardized templates, as our analysis demonstrated superior performance for rare versus standardized report templates. Systematic expert validation frameworks could assess the clinical relevance of diagnostic errors and guide evidence-based optimization strategies.
Validation priorities would benefit from prospective, independent, multi-center trials across diverse healthcare institutions to evaluate real-time clinical performance and help establish more comprehensive evidence for safe deployment. Future studies might include a systematic expert review of model outputs and broader assessment across more diverse patient populations and clinical settings.
In conclusion, this study demonstrates the potential of contrastive multimodal learning principles to enable diagnostic capabilities for a diverse range of cardiac conditions documented in clinical practice. ECG-CLIP represents a potential paradigm shift from multiple condition-specific models to a unified system adaptable through textual prompts, with the theoretical capability to interpret various cardiac conditions documented in corresponding reports. Our findings reveal important performance patterns: superior capability for rhythm abnormalities and conduction disorders compared to morphological conditions, U-shaped age-related performance with reduced accuracy in young and elderly patients, and overall sex equality when age is not considered, and consistent overall performance across varying care intensity levels, though condition-specific sex-age subgroup biases emerged that warrant careful consideration during clinical implementation. External validation across different populations demonstrated robust generalization, while zero-shot performance in pediatric patients, though promising, highlights the need for age-specific validation and clinical safeguards before pediatric deployment. By potentially reducing dependency on condition-specific labeled datasets, this approach may help address fundamental scalability challenges in cardiovascular AI deployment, offering a potential pathway toward more accessible and adaptable cardiac diagnostic systems.
Methods
Study design
This was a contrastive multimodal learning-based diagnostic study for zero-shot cardiac condition detection from 12-lead ECG signals. The study design, implementation, and reporting followed the Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis + Artificial Intelligence (TRIPOD + AI) statement54 to ensure transparency and reproducibility of our AI model development and evaluation process.
Datasets
The ECG-CLIP model was trained on the MIMIC-IV-ECG dataset, a publicly available dataset of 12-lead ECGs with corresponding summary reports23,24. The ECGs were collected from Beth Israel Deaconess Medical Center (BIDMC) emergency department, hospital (including the ICU), and outpatient care centers. The dataset contained 800,035 12-lead ECGs, each sampled at 500 Hz for 10 seconds and identified by a unique study_id. The dataset provided summary reports for these ECGs generated by ECG machines from various FDA-cleared manufacturers. One ECG lacked a corresponding report and was excluded from analysis, resulting in 800,034 ECG-text pairs of 12-lead ECG signals and summary reports (unique study_id) from 161,352 unique patients (unique subject_id). The presence of multiple studies per patient indicates longitudinal data collection for some individuals. These machine-generated reports exhibit high linguistic standardization at multiple levels: Report-level analysis identified 177,730 unique report templates among 800,034 total reports (22.2% diversity ratio), while sentence-level analysis revealed 2,859 unique sentences/phrases among 2,831,408 total sentences (0.1% diversity ratio).
To assess the quality of machine-generated ECG interpretations, we conducted a systematic validation analysis comparing ECG report-derived labels with clinical discharge diagnoses from the MIMIC-IV-ECG-Ext-ICD database55,56 for inpatient cases. This analysis revealed substantial agreement between ECG report-driven labels and physician-adjudicated discharge diagnoses, with match scores ≥0.84 for all 12 ICD-10 defined cardiac conditions, and 9 out of 12 achieving match scores ≥0.90. Detailed methodology and results of this quality assessment were provided in Supplementary (Dataset quality assessment section).
This dataset was then divided into training, validation, and internal testing sets in an 8:1:1 ratio based on unique patients to prevent data leakage. The training set contained 638,376 ECG-text pairs from 129,081 patients for training the model parameters. The validation set contained 80,974 ECG-text pairs from 16,135 patients for optimal checkpoint and decision threshold selection. The internal testing set included 80,684 ECG-text pairs from 16,136 patients for performance evaluation on data from the same distribution as the training set.
Demographic information (age and sex) for each ECG recording was extracted from MIMIC-IV-ECG-Ext-ICD with a signed data use agreement and careful matching of study_id between MIMIC-IV-ECG and MIMIC-IV-ECG-Ext-ICD databases. Based on individual ECG signal (study_id), patient ages at the time of each ECG ranged from 12 to 101 years (mean ± standard deviation (SD): 64.3 ± 17.1 years), with 56.0% of them from patients aged 45–74 years. ECGs from male subjects comprised 51.1% of the dataset. Detailed age and sex distributions for the training, validation, and internal testing sets were presented in Table 1. In addition, for inpatient ECG recordings, we extracted clinical characteristics including comorbidities, medications, and laboratory values from the MIMIC-IV-3.1 clinical database57. Detailed extraction methodologies were provided in the Supplementary Clinical Characteristics Extraction section.
For external validation, we utilized a publicly available large-scale 12-lead ECG database for arrhythmia study (“LSAS-ECG dataset” for short call)25,26 to assess model generalizability across different institutions and populations, which was constructed by Ningbo First Hospital of Zhejiang University, China. This dataset contained 45,152 unique 12-lead ECGs with 10-second duration sampled at 500 Hz, with each ECG from a different unique patient. Cardiac conditions were labeled by licensed physicians using a systematic approach: initial labeling by one physician, secondary validation by another, and final adjudication by a senior physician in cases of disagreement. One patient lacking diagnostic label information was excluded from the analysis. Demographic data were extracted from ECG header files. Based on individual ECGs, patient ages ranged from 0 to 89 years (mean ± SD: 58.3 ± 19.7 years), with 58.4% of them from patients aged 45–74 years, slightly younger than the MIMIC-IV-ECG cohort. ECGs from male patients represented 56.4% of this external validation dataset (Table 1).
ECG signal preprocessing
Signal preprocessing began with noise reduction using a fourth-order Butterworth bandpass filter (0.5–50 Hz). The filtered signals were downsampled from 500 Hz to 100 Hz, yielding a 12 × 1000-dimensional array for each ECG recording. Lead-wise min-max normalization was applied to standardize the signal amplitudes. To ensure model compatibility, lead loss was processed through an imputation strategy, which was any missing leads were imputed with the mean of available leads rather than zeros to preserve the temporal continuity of signals and prevent training instabilities. In cases where all leads were lost, the signals were imputed with small-magnitude Gaussian noise (\(\mu =0,\,\sigma ={10}^{-6}\)) to maintain numerical stability during model inference. Preprocessed signals were stored in HDF5 format for efficient data handling during training.
Text preprocessing
Text preprocessing included two sequential stages: cleaning and tokenization. The cleaning stage removed any special characters, dashes, parentheses, HTML tags, and anomalous text patterns, if they existed. All text was converted to lowercase, leading/trailing whitespace was removed, and multiple spaces were consolidated into single spaces. Tokenization employed CLIP’s byte pair encoding with a vocabulary size of 49,408. For text sequences exceeding the model’s context length, we truncated them to (context_length - 1) tokens, reserving space for special tokens. Each sequence begins with the <|startoftext | > token and ends with the <|endoftext | > token. If truncation is required, the last token is replaced with the <|endoftext | > token to ensure proper sequence termination.
Linguistic standardization analysis
We assessed standardization at both report and sentence levels. Report-level diversity was calculated as the ratio of unique complete report templates to total reports. Sentence-level analysis parsed individual report segments from structured machine output (report_0 through report_17 columns), treating each non-empty segment as a distinct sentence/phrase unit, then calculated the ratio of unique normalized sentences to total sentences.
Model architecture and training
We implemented the CLIP contrastive learning framework, modifying the encoder architecture to process physiological signals: a modified Vision Transformer for ECG signal encoding and a Transformer for text encoding. The text encoder consists of 12 layers with a width of 512 and 8 attention heads, processing text through byte-pair encoding with a maximum context length of 77 tokens, consistent with the original CLIP configuration.
Architecture modifications were implemented to accommodate 12-lead ECG temporal signals. A fundamental modification involved replacing the original 2D convolutional operations designed for image processing with 1D convolutional operations optimized for temporal signal processing. The original patch embedding layer of the Vision Transformer was replaced with a custom front-end consisting of three 1D convolutional layers. The first convolutional layer processed input ECG signals (12 × 1000 dimensions) using a patch size of 32 (kernel size = 32, stride = 32), followed by two additional convolutional layers (kernel size = 3, stride = 2) equipped with batch normalization and ReLU activation. The subsequent transformer architecture (attention blocks, layer normalization, and projection layers) remained identical to the original CLIP ViT-B/32 structure.
The training framework employed contrastive learning to align ECG signals with corresponding summary reports in a shared embedding space. For each batch of N ECG-text pairs, the model computed cosine similarity scores across all N × N possible combinations of ECG and text embeddings after L2 normalization. The training objective utilized symmetric cross-entropy loss applied bidirectionally, from ECG-to-text and text-to-ECG directions, with the final loss calculated as the average of both directional losses.
A two-phase training strategy was implemented to optimize model performance while preserving pre-trained knowledge. Phase 1 employed selective parameter training where all pre-trained ViT-B/32 weights were frozen, with only the newly added convolutional layers and their associated batch normalization parameters being optimized. The three new convolutional layers were initialized using Kaiming normal initialization with ‘fan_out’ mode to account for ReLU nonlinearity, while all other pre-trained weights were preserved from the original CLIP model. This phase utilized an SGD optimizer with a learning rate of 0.001, a momentum of 0.9, and a batch size of 32, training for 30 epochs. Model checkpoints were saved every 1000 training steps and evaluated on the validation set to monitor convergence.
Phase 2 initiated comprehensive fine-tuning using the final checkpoint from the 30-epoch Phase 1 training as initialization, where all model parameters were made trainable. This phase employed a more sophisticated training configuration with AdamW optimizer (learning rate 0.0001, weight decay 0.01) and reduced batch size of 16 for enhanced convergence stability. Advanced learning rate scheduling was implemented, beginning with a linear warmup phase over the first 500 training steps (start factor 0.1), followed by cosine annealing for the remaining epochs. Additional training stabilization techniques included gradient norm clipping (max norm 1.0) and validation monitoring with loss values. Phase 2 training proceeded for 10 epochs, with checkpoints saved every 1000 training steps. Post-training model selection was performed by evaluating all saved checkpoints on the validation set, with the checkpoint achieving the minimum validation loss selected as the final optimal model.
In addition, we trained a model with an identical architecture but using random initialization with standard best practices (Xavier uniform for linear layers, Kaiming normal for convolutional layers, normal distribution for embedding layers). Considering the model was trained from scratch, we set more epochs (30) while keeping all other configurations identical.
Experiments were conducted using PyTorch 2.3.1 with CUDA 11.8 acceleration on an NVIDIA GeForce RTX 4070Ti SUPER GPU. The training time of the model with all parameters trainable was approximately 2.5 hours per epoch.
Applying ECG-CLIP to cardiac condition detection
Although ECG-CLIP was trained to encode ECG signals and summary reports as semantically meaningful vector embeddings, the model was subsequently adapted for cardiac condition detection tasks to evaluate its zero-shot performance capabilities. The trained ECG-CLIP model enables flexible detection through contrastive similarity computation between ECG signal embeddings and textual label embeddings within the shared latent space.
The detection process leverages the dual-encoder architecture, where ECG signals are processed through the modified 1D convolutional front-end and transformer layers to generate signal embeddings, while textual labels are encoded through the transformer-based text encoder to produce semantic embeddings. The model computes cosine similarity matrices between ECG and text embeddings, with learned temperature scaling automatically applied through the trained logit_scale parameter to generate calibrated similarity scores. Final detection decisions are derived through softmax normalization of the similarity logits, yielding probability distributions across candidate labels.
We employed textual prompts formatted as positive condition descriptors (e.g., “atrial fibrillation or atrial flutter”) paired with the negative prompt “other.” We evaluated 18 cardiac conditions on the internal testing set (MIMIC-IV-ECG) and externally validated 14 conditions using the LSAS-ECG dataset. For MIMIC-IV-ECG, condition labels were extracted from summary reports and binarized (presence=1, absence=0) to establish reference points for performance evaluation. The mapping between extracted report text and cardiac conditions, and their corresponding positive prompts for inference, is detailed in Supplementary Table 1. For LSAS-ECG, ground truth labels were extracted from ECG header files recorded using SNOMED-CT diagnostic codes, with condition-to-code mappings provided in Supplementary Table 3.
Performance evaluation
Retrieval metrics
We evaluated the quality of learned ECG-text embeddings through bidirectional retrieval tasks comprising signal-to-text and text-to-signal retrieval. To ensure statistical robustness and eliminate sampling bias, we conducted a multi-sampling evaluation framework with 100 independent runs, each utilizing different random seeds. Given the standardized nature of clinical documentation identified in our corpus analysis, we stratified the retrieval analysis by template frequency patterns. We categorized unique complete clinical reports (templates) into three groups based on their occurrence frequency in the internal testing set: rare templates (frequency = 1, n = 22,722 ECGs), common templates (frequency = 2–10, n = 15,851 ECGs), and frequent templates (frequency > 10, n = 42,111 ECGs). For each frequency category and run, we randomly sampled 3,000 ECGs to ensure adequate representation while maintaining computational efficiency. We assessed performance using standard information retrieval metrics: Recall@k (k ∈ {1, 5, 10}), measuring the proportion of queries where the correct match appears in the top-k retrievals, and MRR, quantifying the average inverse rank of the first correct match58,59.
Semantic clustering evaluation
To assess whether our model learned clinically meaningful semantic relationships, we quantified embedding clustering quality by analyzing cosine similarities within groups of ECGs sharing identical clinical reports. We identified these groups through exact string matching of summary reports in the internal testing set. For each run, we randomly sampled 100 identical-report groups, limiting each group to a maximum of 50 ECGs to ensure computational feasibility. We computed pairwise cosine similarities among all group members and compared these intra-group similarities against a baseline of similarities computed between 8,000 randomly paired ECGs per run. This approach aims to test whether ECGs with identical clinical interpretations cluster more tightly in embedding space than random pairs.
Overall metrics
We employed a comprehensive metrics suite that captures multiple dimensions of model performance on both internal testing and external validation sets. Discrimination metrics included AUROC for overall discrimination capability across all decision thresholds and AUPRC for performance assessment under class imbalance conditions, particularly critical for rare cardiac conditions. Classification performance was evaluated through accuracy, sensitivity, specificity, precision, and F1 score.
Comparative analysis
To assess performance relative to supervised methods, we implemented two established ECG interpretation models representing the evolution of deep learning approaches in this field: Ribeiro et al.2 using a residual network architecture for unified 12-lead ECG processing, and Lu et al.10 employing an isolation-integration strategy with class weighting for imbalanced datasets. These models were selected as methodologically complementary supervised approaches with demonstrated cardiologist-level performance that could evaluate overlapping cardiac conditions with our method. Comparison was limited to six cardiac conditions (AF, ST, SB, RBBB, LBBB, and 1dAVb) representing major categories of cardiac abnormalities (rhythm disorders and conduction blocks), as these supervised models cannot diagnose conditions beyond their original scope. This limitation illustrates the fundamental scalability constraint that our zero-shot approach addresses: to our best knowledge, no existing supervised model has been able to simultaneously diagnose the full range of 18 cardiac conditions evaluated by ECG-CLIP. Both baseline models were trained on our MIMIC-IV-ECG training set using identical data preprocessing, train/validation/test splits, and evaluation metrics. Training procedures followed the original publication specifications with hyperparameter optimization on our validation set.
Demographic stratification
To evaluate algorithmic fairness of the ECG-CLIP and identify potential performance disparities, we conducted a comprehensive analysis stratified by age and sex subgroups for each condition detection task. Age stratification included 12–44, 45–74, and 75+ years subgroups for internal testing, with an additional 0–11 years pediatric subgroup for external validation. This broad age stratified groups align with established clinical cardiovascular risk frameworks10 and provide sufficient statistical power for performance metric calculations. Sex stratification comprised male and female subgroups.
Care intensity stratification
To assess model generalizability across different clinical care settings, we stratified the internal testing set into three care intensity groups based on clinical context availability and ICU status: (1) inpatient with ICU stay, (2) inpatient without ICU stay, and (3) outside inpatient encounters.
Confusion pattern analysis
We conducted systematic error analysis to identify potential clinical risks and model limitations. False positive analysis involved identifying non-target conditions most frequently misclassified as the target condition and analyzing co-occurrence patterns among confused conditions. To quantify demographic-specific confusion patterns, we developed a confusion risk assessment framework using physiologically-informed fine age bins: 0–1, 1–3, 3–5, 5–8, 8–12, 12–16, 16–20, 20–30, 30–40, 40–50, 50–60, 60–70, 70–80, 80–90, and 90+ years. This granular stratification, based on established pediatric cardiology developmental stages and known clinical complexity patterns across the lifespan, was designed to identify high-risk demographic subgroups requiring heightened clinical vigilance during AI deployment33. Each age bin was further stratified by sex, creating up to 30 distinct demographic subgroups for analysis.
Pediatric age stratification analysis
Recognizing the substantial physiological differences across pediatric developmental stages, we implemented finer age stratification within patients <12 years using physiologically-informed age bins: 0–1, 1–3, 3–5, 5–8, and 8–12 years, based on established pediatric cardiology developmental stages33. Age data in the LSAS-ECG dataset were recorded as integer years without monthly precision, limiting finer subdivision within critical developmental periods. Given the relatively sparse occurrence of cardiac conditions within these fine age subdivisions (Supplementary Table 25), we employed a confusion matrix to provide a transparent visualization of model behavior. This approach enabled direct examination of FP and FN patterns for each condition across pediatric age groups despite limited sample sizes in certain age-condition combinations.
Prompt optimization
Based on initial performance evaluation, we observed that morphological abnormalities, particularly STTC and LowQRS, demonstrated suboptimal detection performance compared to rhythm abnormalities. To explore whether more anatomically specific descriptions could improve diagnostic accuracy for these challenging conditions, we conducted an exploratory proof-of-concept prompt optimization analysis. This investigation was motivated by the presence of detailed anatomical specifications in the training data reports for these specific conditions. Optimized prompts were developed: for STTC, we expanded “st-t changes” to “st-t changes such as inferior st-t changes, lateral st-t changes, extensive st-t changes, septal st-t changes”; for LowQRS, we expanded “low qrs voltages” to “low qrs voltages such as low qrs voltages in precordial leads, low qrs voltages in limb leads, generalized low qrs voltages”. Performance comparison between standard and optimized prompts was conducted using identical evaluation procedures to isolate the impact of prompt specificity.
Statistical analysis
All statistical analysis was performed using Python 3.9 with NumPy 1.26.3, SciPy 1.13.1, and scikit-learn 1.6.1. Continuous demographics and clinical characteristics were presented as mean ± standard deviation, and categorical variables were presented as frequency (percentage). All statistical analyses were performed using appropriate parametric or non-parametric tests based on data distribution characteristics. Prior to comparative analysis, normality of continuous variables was assessed using the Shapiro-Wilk test, with subsequent t-test or Mann-Whitney U tests for pairwise comparisons, as appropriate. Comparisons of categorical variables employed either Pearson’s chi-square test or Fisher’s exact test, depending on expected cell frequencies in the respective 2 × 2 contingency tables. Fisher’s exact test was specifically chosen when any expected frequency was <5 to maintain statistical validity. Statistical significance was defined as a p-value < 0.05.
Performance metrics were calculated according to standard definitions: accuracy (proportion of correct predictions), sensitivity (true positive rate), specificity (true negative rate), precision (positive predictive value), and F1-score (harmonic mean of precision and sensitivity). AUROC and AUPRC were computed using trapezoidal integration.
The optimal decision threshold for each cardiac condition was determined through a sensitivity-constrained approach on the validation set. This methodology prioritized the detection of positive cases, reflecting the critical importance of comprehensive cardiac screening in clinical practice. Two primary considerations informed this approach: First, the clinical implications of undetected cardiac conditions warrant particular attention, as delayed diagnosis may lead to the progression of diseases and compromise patient outcomes60. Second, the substantial class imbalance inherent in our dataset necessitated careful threshold calibration. The threshold selection procedure employed a two-stage optimization strategy. Initially, all candidate thresholds achieving a minimum sensitivity of 0.8 were identified. Subsequently, the threshold yielding the maximum F1-score within this constrained set was selected, thereby optimizing overall classification performance while maintaining robust detection capabilities. In instances where no threshold satisfied the sensitivity constraint, the algorithm defaulted to selecting the threshold that maximized the F1-score without constraints.
Confidence intervals (95% CI) were calculated using bootstrap resampling with 1000 iterations, sampling with replacement while maintaining original class proportions. Random seeds were fixed to ensure reproducibility. To evaluate rank consistency between internal and external validation metrics, Spearman rank correlation coefficients (ρ) were computed with two-tailed p-values, considering p < 0.05 as statistically significant. This approach assesses whether conditions that are challenging to diagnose in one clinical setting remain relatively challenging in another, focusing on the preservation of relative diagnostic difficulty ordering across datasets.
The confusion risk score for each demographic (sex-age) subgroup was calculated as a weighted combination of false positive and false negative rates using the formula: confusion risk score = (false positive rate × 0.4) + (false negative rate × 0.6), where rates were computed within each specific demographic stratum. Subgroups were ranked by confusion risk score, with rank 1 indicating the highest risk.
Data availability
The data utilized in this study are publicly available from PhysioNet: https://physionet.org/content/mimic-iv-ecg/1.0/ (MIMIC-IV-ECG); https://physionet.org/content/mimic-iv-ecg-ext-icd-labels/1.0.0/ (MIMIC-IV-ECG-Ext-ICD); https://physionet.org/content/mimiciv/3.1/ (MIMIC-IV v3.1); https://physionet.org/content/ecg-arrhythmia/1.0.0/ (A large-scale 12-lead electrocardiogram database for arrhythmia study).
Code availability
The codes generated and used during the current study are available from the corresponding authors on reasonable request.
References
Maron, B. J. et al. Assessment of the 12-lead ECG as a screening test for detection of cardiovascular disease in healthy general populations of young people (12–25 Years of Age): a scientific statement from the American Heart Association and the American College of Cardiology. Circulation 130, 1303–1334 (2014).
Ribeiro, A. H. et al. Automatic diagnosis of the 12-lead ECG using a deep neural network. Nat. Commun. 11, 1760 (2020).
Yagi, R., Mori, Y., Goto, S., Iwami, T. & Inoue, K. Routine electrocardiogram screening and cardiovascular disease events in adults. JAMA Intern. Med. 184, 1035–1044 (2024).
Brosnan, M. et al. Modest agreement in ECG interpretation limits the application of ECG screening in young athletes. Heart Rhythm 12, 130–136 (2015).
Cook, D. A., Oh, S. & Pusic, M. V. Accuracy of physicians’ electrocardiogram interpretations: a systematic review and meta-analysis. JAMA Intern. Med. 180, 1461–1471 (2020).
Fye, W. B. Introduction: the origins and implications of a growing shortage of cardiologists. J. Am. Coll. Cardiol. 44, 221–232 (2004).
Attia, Z. I. et al. An artificial intelligence-enabled ECG algorithm for the identification of patients with atrial fibrillation during sinus rhythm: a retrospective analysis of outcome prediction. Lancet 394, 861–867 (2019).
Lee, M. S. et al. Artificial intelligence applied to electrocardiogram to rule out acute myocardial infarction: the ROMIAE multicentre study. Eur. Heart J. ehaf004, 1917–1929 (2025).
Lin, C. et al. Artificial intelligence–powered rapid identification of ST-elevation myocardial infarction via electrocardiogram (ARISE) — a pragmatic randomized controlled trial. NEJM AI. 1 (2024).
Lu, L. et al. Decoding 2.3 million ECGs: interpretable deep learning for advancing cardiovascular diagnosis and mortality risk stratification. Eur. Heart J. Digit. Health 5, 247–259 (2024).
Benjamens, S., Dhunnoo, P. & Meskó, B. The state of artificial intelligence-based FDA-approved medical devices and algorithms: an online database. NPJ Digit. Med. 3, 118 (2020).
Wilkinson, M. D. et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci. Data 3, 160018 (2016).
Attia, Z. I., Harmon, D. M., Behr, E. R. & Friedman, P. A. Application of artificial intelligence to the electrocardiogram. Eur. Heart J. 42, 4717–4730 (2021).
He, J. et al. The practical implementation of artificial intelligence technologies in medicine. Nat. Med. 25, 30–36 (2019).
Vrudhula, A. et al. Impact of case and control selection on training artificial intelligence screening of cardiac amyloidosis. JACC Adv. 3, 100998 (2024).
Radford, A. et al. Learning transferable visual models from natural language supervision. Proceedings of the 38th International Conference on Machine Learning, PMLR 139, 8748–8763 (2021).
Tiu, E. et al. Expert-level detection of pathologies from unannotated chest X-ray images via self-supervised learning. Nat. Biomed. Eng. 6, 1399–1406 (2022).
Christensen, M., Vukadinovic, M., Yuan, N. & Ouyang, D. Vision–language foundation model for echocardiogram interpretation. Nat. Med. 30, 1481–1488 (2024).
Kligfield, P. Recommendations for the standardization and interpretation of the electrocardiogram: part I. Circulation 115, 1306–1324 (2007).
Balady, G. J. et al. COCATS 4 Task Force 3: training in electrocardiography, ambulatory electrocardiography, and exercise testing. J. Am. Coll. Cardiol. 65, 1763–1777 (2015).
de Hond, A. A. H. et al. Guidelines and quality criteria for artificial intelligence-based prediction models in healthcare: a scoping review. NPJ Digit. Med. 5, 2 (2022).
Sokol, K., Fackler, J. & Vogt, J. E. Artificial intelligence should genuinely support clinical reasoning and decision making to bridge the translational gap. NPJ Digit. Med. 8, 345 (2025).
Gow, B. et al. MIMIC-IV-ECG: diagnostic electrocardiogram matched subset (version 1.0). PhysioNet https://doi.org/10.13026/4nqg-sb35 (2023).
Goldberger, A. et al. PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation 101, e215–e220 (2000).
Zheng, J., Guo, H. & Chu, H. A large scale 12-lead electrocardiogram database for arrhythmia study (version 1.0.0). PhysioNet. https://doi.org/10.13026/wgex-er52 (2022).
Zheng, J. et al. Optimal multi-stage arrhythmia classification approach. Sci. Rep. 10, 2898 (2020).
Zhou, J., Duan, Y., Chang, Y., Wang, Y. & Lin, C. BELT: bootstrapped EEG-to-language training by natural language supervision. IEEE Trans. Neural Syst. Rehabil. Eng. 32, 3278–3288 (2024).
Ito, A., Dohi, K. & Kawaguchi, Y. CLaSP: learning concepts for time-series signals from natural language supervision. Preprint at https://arxiv.org/abs/2411.08397 (2024).
Bleijendaal, H. et al. Clinical applicability of artificial intelligence for patients with an inherited heart disease: a scoping review. Trends Cardiovasc. Med. 33, 274–282 (2023).
Katoh, T. et al. Expert consensus document on automated diagnosis of the electrocardiogram: The task force on automated diagnosis of the electrocardiogram in Japan part 3: Inappropriate internal processing in the automatic diagnosis of electrocardiograms and its problems. J. Arrhythmia 40, 17 (2023).
Chen, T., Kornblith, S., Norouzi, M. & Hinton, G. A simple framework for contrastive learning of visual representations. ICML 149, 1597–1607 (2020).
Dickinson, D. F. The normal ECG in childhood and adolescence. Heart 91, 1626–1630 (2005).
Rijnbeek, P. R., Witsenburg, M., Schrama, E., Hess, J. & Kors, J. A. New normal limits for the paediatric electrocardiogram. Eur. Heart J. 22, 702–711 (2001).
Schwartz, P. J. et al. Guidelines for the interpretation of the neonatal electrocardiogram. A task force of the European Society of Cardiology. Eur. Heart J. 23, 1329–1344 (2002).
Somani, S. et al. Deep learning and the electrocardiogram: review of the current state-of-the-art. Europace 23, 1179–1191 (2021).
Mossie, A. et al. Prevalence and associated factors of preoperative abnormal electrocardiography among older surgical patients in southern Ethiopia: Multicenter cross-sectional study. BMC Geriatr. 24, 905 (2024).
Rautaharju, P. M. et al. AHA/ACCF/HRS recommendations for the standardization and interpretation of the electrocardiogram: part IV. J. Am. Coll. Cardiol. 53, 982–991 (2009).
Sharma, S. et al. International recommendations for electrocardiographic interpretation in athletes. Eur. Heart J. 39, 1466–1480 (2018).
Bai, W. A population-based phenome-wide association study of cardiac and aortic structure and function. Nat. Med. 26, 1654–1662 (2020).
Surawicz, B. & Parikh, S. R. Prevalence of male and female patterns of early ventricular repolarization in the normal ECG of males and females from childhood to old age. J. Am. Coll. Cardiol. 40, 1870–1876 (2002).
Papadakis, M. et al. The prevalence, distribution, and clinical outcomes of electrocardiographic repolarization patterns in male athletes of African/Afro-Caribbean origin. Eur. Heart J. 32, 2304–2313 (2011).
Sandau, K. E. et al. Update to practice standards for electrocardiographic monitoring in hospital settings: a scientific statement from the American Heart Association. Circulation 136, e273–e344 (2017).
Fordyce, C. B. et al. Prevention of complications in the cardiac intensive care unit: a scientific statement from the American Heart Association. Circulation 142, 22 e379–e406 (2020).
Magnani, J. W. et al. P wave duration and risk of longitudinal atrial fibrillation in persons ≥60 years old (from the Framingham Heart Study). Am. J. Cardiol. 107, 917–921 (2011).
Hancock, E. W. et al. AHA/ACCF/HRS recommendations for the standardization and interpretation of the electrocardiogram: Part V. J. Am. Coll. Cardiol. 53, 992–1002 (2009).
Pewsner, D. et al. Accuracy of electrocardiography in diagnosis of left ventricular hypertrophy in arterial hypertension: systematic review. BMJ 335, 711 (2007).
Kelly, C. J., Karthikesalingam, A., Suleyman, M., Corrado, G. & King, D. Key challenges for delivering clinical impact with artificial intelligence. BMC Med. 17, 195 (2019).
Topol, E. J. High-performance medicine: the convergence of human and artificial intelligence. Nat. Med. 25, 44–56 (2019).
Wei, J. et al. Finetuned language models are zero-shot learners. Preprint at https://doi.org/10.48550/arXiv.2109.01652 (2022).
Brown, T. B. et al. Language models are few-shot learners. NeurIPS 33, 1877–1901 (2020).
Esteva, A. et al. A guide to deep learning in healthcare. Nat. Med. 25, 24–29 (2019).
Shah, N. H., Milstein, A. & Bagley, S. C. Making machine learning models clinically useful. JAMA 322, 1351–1352 (2019).
Lee, J. et al. Deep learning for rare disease: a scoping review. J. Biomed. Inform. 135, 104227 (2022).
Collins, G. S. et al. TRIPOD+ AI statement: updated guidance for reporting clinical prediction models that use regression or machine learning methods. BMJ 385, e078378 (2024).
Strodthoff, N., Lopez Alcaraz, J. M., & Haverkamp, W. MIMIC-IV-ECG-Ext-ICD: Diagnostic labels for MIMIC-IV-ECG (version 1.0.0). PhysioNet. RRID:SCR_007345https://doi.org/10.13026/hdyc-1h77 (2024).
Strodthoff, N., Alcaraz, J. M. L. & Haverkamp, W. Prospects for artificial intelligence-enhanced electrocardiogram as a unified screening tool for cardiac and non-cardiac conditions: an explorative study in emergency care. Eur. Heart J. Digit. Health 5, 454–460 (2024).
Johnson, A. E. W. et al. MIMIC-IV, a freely accessible electronic health record dataset. Sci. Data 10, 1. https://doi.org/10.1038/s41597-022-01899-x (2023).
Chen, Z. et al. Multi-modal Masked Autoencoders for Medical Vision-and-Language Pre-training. MICCAI, 679-689 (2022).
Vendrow, E. et al. INQUIRE: A natural world text-to-image retrieval benchmark. Adv. Neural Inf. Process. Syst. 37, 126500–126514 (2024).
Cummins, R. O. & Hazinski, M. F. Guidelines based on fear of type II (false-negative) errors: why we dropped the pulse check for lay rescuers. Circulation 102, I377–I379 (2000).
Acknowledgements
This study was supported by Japan Science and Technology Agency (JST) Japan-Taiwan Collaborative Research Program, Grant number JPMJKB231, JSPS KAKENHI Grant number JP21K10287 and JP25K1597, and The University of Aizu, Grant number L-502.
Author information
Authors and Affiliations
Contributions
X.Zhou, W.C., and X.Zhu contributed to the overall study design and manuscript preparation. T.L., H.H., K.N., and S.L. contributed to technical and material support. X.Zhou. and T.L. contributed to data analysis. X.Zhou, W.C., and X.Zhu contributed to manuscript writing, preparation, and submission. X.Zhou., T.L., H.H., K.N., S.L., W. C., and X.Zhu contributed to manuscript review and revision and approved the final manuscript.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Zhou, X., Li, T., Hayama, H. et al. Diagnosis of cardiac conditions from 12-lead electrocardiogram through natural language supervision. npj Digit. Med. 8, 697 (2025). https://doi.org/10.1038/s41746-025-02074-3
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s41746-025-02074-3








