Artificial intelligence-enabled analysis of handheld single-lead electrocardiograms to predict incident atrial fibrillation: an analysis of the VITAL-AF randomized trial

Khurshid, Shaan; Friedman, Sam F.; Al-Alusi, Mostafa A.; Kany, Shinwan; Sommers, Thomas; Anderson, Christopher D.; Ho, Jennifer E.; McManus, David D.; Borowsky, Leila H.; Ashburner, Jeffrey M.; Lubitz, Steven A.; Atlas, Steven J.; Maddah, Mahnaz; Singer, Daniel E.; Ellinor, Patrick T.

doi:10.1038/s41746-025-02164-2

Download PDF

Article
Open access
Published: 26 November 2025

Artificial intelligence-enabled analysis of handheld single-lead electrocardiograms to predict incident atrial fibrillation: an analysis of the VITAL-AF randomized trial

Shaan Khurshid^1,2,3^na1,
Sam F. Friedman⁴^na1,
Mostafa A. Al-Alusi^1,2,5,
Shinwan Kany^2,6,7,
Thomas Sommers⁸,
Christopher D. Anderson^9,10,11,
Jennifer E. Ho^1,12,
David D. McManus¹³,
Leila H. Borowsky¹⁴,
Jeffrey M. Ashburner¹⁴,
Steven A. Lubitz^1,2,3,
Steven J. Atlas¹⁴,
Mahnaz Maddah⁴,
Daniel E. Singer¹⁴ &
…
Patrick T. Ellinor^1,2,3

npj Digital Medicine volume 8, Article number: 776 (2025) Cite this article

4339 Accesses
2 Citations
Metrics details

Subjects

Abstract

Whether artificial intelligence (AI) analysis of single-lead ECG (1 L ECG) can predict incident AF is unknown. In the VITAL-AF trial (ClinicalTrials.gov NCT03515057, registered 2/24/2021) of primary care patients aged ≥65 years undergoing handheld 1 L ECG screening, we tested three AI approaches to incident AF prediction, and compared the best model to the CHARGE-AF risk score. In a test set of 4,221 individuals, a published AI model trained using single standard ECG leads (“1 L ECG-AI”) provided similar 2-year AF discrimination to models trained with VITAL-AF data. In the full VITAL-AF sample of 15,694 individuals without prevalent AF (2-year incident AF 3.1%), 1 L ECG-AI with age/sex (1 L ECG-AI AS) had comparable discrimination (area under the receiver operating characteristic curve [AUROC] 0.695[0.637–0.742]; average precision [AP] 0.060[0.050–0.078]) to CHARGE-AF (AUROC 0.679[0.623-0.730]; AP 0.062[0.052–0.080], AUROC p = 0.46, AP p = 0.92). Net reclassification improvement was favorable versus age ≥65 years (0.27[0.22–0.32]). 1 L ECG-AI may increase efficiency and reach of AF screening.

Artificial intelligence estimated electrocardiographic age as a recurrence predictor after atrial fibrillation catheter ablation

Article Open access 05 September 2024

Wearable device derived electrocardiographic age and its association with atrial fibrillation

Article Open access 17 January 2026

Analyzing artificial intelligence systems for the prediction of atrial fibrillation from sinus-rhythm ECGs including demographics and feature visualization

Article Open access 23 November 2021

Introduction

Atrial fibrillation (AF) is a common arrhythmia associated with substantial preventable morbidity^1,2,3,4,5,6. Since AF may be asymptomatic and unrecognized, there is great interest in mass screening to detect undiagnosed AF and enable timely preventive interventions^7,8,9. However, contemporary assessments of AF screening have generally relied on the simple age threshold of ≥65 years endorsed by guidelines from the European Society of Cardiology and Cardiac Society of Australia and New Zealand^10,11, leading to screening of many individuals at low AF risk and failure to demonstrate meaningful improvements in key outcomes (e.g., stroke, mortality)^12,13,14. Therefore, there is a critical unmet need for a more targeted approach to AF screening focused on individuals at higher risk of undiagnosed AF^15,16.

To this end, it is well-recognized that future AF risk can be estimated with reasonable accuracy. Composite risk factor scores, such as the well-validated Cohorts for Heart and Aging Research in Genomic Epidemiology AF (CHARGE-AF) score¹⁷, consistently demonstrate moderate predictive value for incident AF^18,19. However, uptake of clinical AF scores is limited by cumbersome calculation and potential misclassification of score components²⁰. More recently, studies have reported the ability to estimate AF risk using deep learning applied to a single 12-lead electrocardiogram (ECG)^21,22,23, offering a novel mechanism for accurate and efficient AF risk estimation. Although 12-lead ECGs can be obtained within minutes in most clinic settings, acquisition typically requires trained staff and specialized equipment. In contrast, single-lead ECG tracings (1 L ECG) are increasingly available from consumer wearable and handheld devices, and therefore possess the potential to substantially extend the reach of AF risk estimation. Although limited surrogate evaluations (e.g., restricting to one lead of a 12-lead ECG) have suggested that 1 L ECG may retain AF risk information²¹, whether real-world 1 L ECG tracings can predict AF remains unknown.

Here, we leverage a unique resource of over 16,000 individuals with >35,000 handheld 1 L ECG tracings taken prospectively as part of the VITAL-AF randomized clinical trial of AF screening embedded within eight primary care practices within the Massachusetts General Hospital (MGH) network. We compare several approaches to AI-enabled AF risk estimation: a published convolutional neural network (CNN) trained using single standard ECG leads (1 L ECG-AI)²¹, 1 L ECG-AI fine-tuned in VITAL-AF (1 L ECG-AI Fine-Tuned), and a CNN trained exclusively within VITAL-AF (1 L VITAL). We then assess the performance of 1 L ECG-AI in the full VITAL-AF sample, quantify relations between ECG-AI-based and clinical risk signals, and establish the potential for 1 L ECG-based AF risk estimation to extend the reach of risk-informed AF screening well beyond the traditional clinic setting.

Results

Identifying an optimal method for AF risk estimation using 1 L ECG

The current analysis comprised ~16,000 participants of the VITAL-AF trial, a randomized trial of AF screening in primary care patients aged 65 years and older, who were subsequently split into a development set and test set (Table 1 and Fig. 1). Here, we compared three deep learning-based approaches to AF risk estimation: 1 L ECG-AI, 1 L ECG-AI Fine-Tuned, and 1 L VITAL. 1 L ECG-AI was developed fully outside VITAL-AF, whereas 1 L ECG-AI Fine-Tuned and 1 L VITAL utilized the development set of the VITAL-AF sample for fine-tuning and training, respectively. Each model performed inference using the 10-s segment of the 30-s 1 L ECG tracing predicted to have the least tracing noise (see Methods and Supplementary Figs. 1, 2). Patients with prevalent AF were included in model training but not evaluation (Fig. 1). ECG-AI model architectures are shown in Supplementary Fig. 3.

Table 1 Analysis of sample characteristics

Full size table

Models were assessed in the VITAL-AF Test Set comprising 4221 randomly selected individuals from VITAL-AF not used in any aspect of model training or fine-tuning and without prevalent AF (age 74 ± 7 years, 61% women, Fig. 1). Detailed baseline characteristics are shown in Table 1. At 2 years, there were 119 incident AF events (cumulative incidence 3.2% [95% CI 2.6–3.9]). Discrimination of 2-year AF was comparable using 1 L ECG-AI (area under receiver operating characteristic curve [AUROC] 0.672 [95%CI 0.576–0.753]) and 1 L VITAL (0.667 [0.554–0.757]), which were numerically favorable to 1 L ECG-AI Fine-Tuned (0.654 [0.546–0.757]) (Fig. 2). Trends were similar according to 2-year average precision (1 L ECG-AI 0.061 [0.048–0.082], 1 L ECG-AI Fine-Tuned 0.058 [0.046–0.078], 1 L VITAL 0.081 [0.056–0.12]) (Supplementary Fig. 4).

**Fig. 2: Model comparison in VITAL-AF Test Set.**

Performance of 1 L ECG-AI in the VITAL-AF Full Inference Set

Given that 1 L ECG-AI provided comparable or better performance than the other models despite being trained completely outside VITAL-AF, we further evaluated this model in the VITAL-AF Full Inference Set: the complete VITAL-AF cohort members without prevalent AF who underwent ≥1 screening 1 L ECG (n = 15,694, age 74 ± 7 years, 58% women, Fig. 1 and Table 1).

At 2 years, there were 411 incident AF events (cumulative incidence 3.1% [95% CI 2.7–3.4]). 1 L ECG-AI had consistent discrimination of 2-year incident AF risk (AUROC 0.666 [0.603–0.721]; AP 0.053 [0.045–0.068]). With age and sex added to 1 L ECG-AI (“1 L ECG-AI AS”), discrimination improved (0.695 [0.637–0.742]; AP 0.060 [0.050–0.077]) and was comparable to the validated 11-component CHARGE-AF clinical risk score (0.679 [0.625–0.729]; 0.062 [0.052–0.080], p = 0.46 for AUROC, p = 0.92 for AP) (Table 2 and Fig. 3).

**Fig. 3: Model performance summary in VITAL-AF full inference set.**

Table 2 Model metrics for 2-year incident AF in VITAL-AF full inference set

Full size table

Since the AI models were trained on individuals with prevalent AF but evaluated only on individuals without prevalent AF, recalibration to the baseline hazard of the VITAL-AF full inference set was performed prior to assessing model calibration. Although calibration was reasonable for each model, absolute risk estimates were particularly accurate for 1 L ECG-AI AS (integrated calibration index [ICI] 0.0048 [0.0016–0.080], where lower values indicate lower average error), whereas both CHARGE-AF (ICI 0.015 [0.012–0.018]) and 1 L ECG-AI AS (ICI 0.022 [0.019–0.023]) showed a tendency to overestimate observed AF incidence at the high end of the predicted risk distribution (Table 2 and Fig. 3).

Combining clinical risk factors with AI signals

Combining CHARGE-AF and 1 L ECG-AI resulted in modest numerical improvement in discrimination (AUROC 0.703 [0.650–0.751]; AP 0.063 [0.054–0.078]). (Table 2 and Fig. 3). Absolute risk estimates were well-calibrated (ICI 0.0058 [0.0022–0.0094]). In a Cox proportional hazards model including terms for both 1 L ECG-AI and CHARGE-AF, both terms were independent predictors of incident AF (Fig. 3). Consistent with a complementary relation, the cumulative risk of AF was highest among individuals classified as high risk (i.e., 2-year AF risk ≥3%) according to both ECG-AI and CHARGE-AF, followed by high risk according to one model only, followed by individuals not at high risk according to either model (Fig. 4). Individuals at high risk for AF using 1 L ECG-AI only were generally younger and with lower comorbidity burden compared to individuals at high risk using CHARGE-AF (Supplementary Table 1). The correlation between 1 L ECG-AI and CHARGE-AF was moderate (r = 0.44 [95% CI 0.43–0.45]).

**Fig. 4: Cumulative risk of AF stratified by 1 L ECG-AI AS.**

1 L ECG-AI AS to stratify risk of longitudinal AF

Given the favorable performance of 1 L ECG-AI AS with the requirement for only one 1 L ECG, age, and sex, this model was selected for further evaluation as a potential tool to stratify risk of AF. Categories of 1 L ECG-AI AS predicted risk (i.e., <1%, 1–2%, and ≥3% to approximate tertiles) effectively separated longitudinal AF incidence (Fig. 4). Two-year AF incidence was markedly higher with 1 L ECG-AI AS in the top 5% (8.0% [5.7–10.2]) versus bottom 5% (0.89% [0.23–1.55]) (Supplementary Fig. 5).

Use of 1 L ECG-AI AS 2-year predicted risk ≥3% rather than the age threshold of ≥65 years endorsed in certain guidelines^10,11 and used to select the VITAL-AF sample¹⁰ would result in favorable net reclassification (NRI 0.27 [95% CI 0.22–0.32]), driven by an appreciable degree of appropriate non-case reclassification (i.e., deferring screening of 9,941 non-cases, NRI- 64.2% [63.3–65.0%]), but at the cost of some unfavorable case reclassification (i.e., failing to screen 149 cases, NRI + −37.1% [−42.1% to −32.2%]) (Table 3 and Supplementary Table 2). Reclassification was also favorable using 1 L ECG-AI compared to additional guideline-based criteria for screening¹⁰ including age ≥65 years with elevated stroke risk^24,25 (NRI 0.18 [0.12−0.23]), and age ≥75 years (0.078 [0.024–0.13]) (Table 3 and Supplementary Table 2).

Table 3 Reclassification using 1 L ECG-AI AS versus age- and stroke risk-based criteria for AF screening

Full size table

Decision curve analysis demonstrates that use of 1 L ECG-AI AS rather than screening all individuals ≥65 years would result in net benefit across a wide range of thresholds used to select screening candidates (Supplementary Fig. 6), and lead to substantial reductions in individuals screened (Supplementary Fig. 7) while maintaining constant net benefit.

Secondary and subgroup analyses

Saliency maps demonstrated that the 1 L ECG P wave and surrounding regions had the greatest effect on 1 L ECG-AI AF risk estimates (Fig. 5). Among 81 incident AF cases with Holter, patch, or event monitoring available within 6 months of incident AF diagnosis, 66 (81.5%) had evidence of paroxysmal AF while 15 (18.5%) had findings consistent with persistent AF. Although estimates had limited precision, AF discrimination using 1 L ECG-AI AS was higher for persistent AF (AUROC 0.717 [0.573–0.840]) versus paroxysmal AF (0.601 [0.511–0.698]). Among individuals with incident AF and available measurements on or after AF diagnosis, there was a weak positive correlation between 1 L ECG-AI AS and NTproBNP (r = 0.14, 95% CI 0.01–0.27) and between ECG-AI AS and left atrial diameter (r = 0.17, 95% CI 0.05–0.28). At the ≥3% AF risk threshold, there was no difference in the sensitivity of 1 L ECG-AI AS among individuals with NTproBNP that was elevated (70.3%, 95% CI 61.1–78.2) versus non-elevated (66.4%, 95% CI 56.5–75.0), or left atrial diameter that was enlarged (66.7%, 95% CI 57.9–74.5) or non-enlarged (61.7, 95% CI 53.1–69.6). Model discrimination was generally consistent across shorter time windows (e.g., 1 L ECG-AI AS 6-month AUROC 0.685 [95% CI 0.641–0.730]) (Supplementary Table 3). All models had lower discrimination within strata of age, but 1 L ECG-AI AS had the best relative preservation in performance (i.e., age 65–69: 1 L ECG-AI AS 2-year AUROC 0.627 [0.505–0.730] vs. CHARGE-AF 0.577 [0.473–0.686]) (Supplementary Table 4). All models performed similarly among men versus women (Supplementary Table 5), and among individuals with a class I indication for oral anticoagulation based on stroke risk using the CHA₂DS₂VASc score (Supplementary Table 6). Among 3,479 individuals with a 12-lead ECG performed within 3 years of the baseline visit, discrimination was moderately higher using a previously validated 12-lead ECG model (AUROC 0.812 [0.736–0.876]) versus 1 L ECG-AI ([0.772 [0.687–0.848]). In a linear mixed model assessing intra- versus inter-individual variability, we found that 1 L ECG-AI had an intraclass correlation coefficient of 0.65, and within-tracing correlation was high (r = 0.87 for the least noise window versus last 10-s window). Among the 411 incident AF cases, time to AF diagnosis was similar among individuals with 1 L ECG-AI AS 2-year predicted risk ≥3% (“true positives”, n = 262) (median 272 days [quartile-1: 143, quartile-3: 414]) versus individuals with 1 L ECG-AI AS 2-year predicted risk <2% (“false negatives”, n = 149) (270 days [157, 405]). Individuals with 1 L ECG-AI AS 2-year predicted risk ≥3% had a higher rate of ECG (107.6 per 100 person-years [95% CI 105.5–109.7]) and Holter, patch, or event monitor (7.68 [7.12–8.24]) utilization prior to an AF diagnosis compared to individuals with 1 L ECG-AI AS 2-year predicted risk <3% (ECG: 52.6 [51.5–53.7]; monitor: 4.71 [4.37–5.05]).

Discussion

Here, we developed and tested three distinct approaches to deep learning-based estimation of future AF risk using a unique resource of over 35,000 real-world handheld 1 L ECG tracings collected prospectively in over 16,000 primary care patients enrolled in a large AF screening trial. Utilizing a test set of more than 4200 individuals, we found that 1 L ECG-AI, a model previously trained using over 450,000 single-lead tracings from standard 12-lead ECGs and applied to 1 L ECGs in a transfer learning approach, provided comparable performance to models trained or fine-tuned using VITAL-AF data. In the full VITAL-AF sample, a combination of 1 L ECG-AI with age and sex (1 L ECG-AI AS) discriminated AF risk favorably relative to the validated 11-component CHARGE-AF clinical risk factor score, had better calibration to observed AF risk, and led to favorable reclassification of risk versus the age threshold of ≥65 years currently endorsed in AF screening guidelines¹⁰.

Our findings support and extend prior analyses suggesting the potential for deep learning to predict future AF using 1 L ECG tracings. Prior work has focused on surrogates for true 1 L ECG (i.e., one lead of a standard 12-lead ECG)²¹ or classification of concurrent AF rather than true AF prediction^26,27. Using the AliveCor Kardia device (the same device used in the current study), Raghunath et al. developed a model capable of distinguishing concurrent paroxysmal AF on the basis of 1 L ECG tracings showing sinus rhythm²⁷. More recently, Gadaleta et al. developed a model capable of discriminating very short-term future AF (i.e., 14 days) using clinical-grade 1 L ECG patch monitors obtained for a variety of clinical indications and including individuals whose AF was known²⁶. Our work substantially extends prior analyses by demonstrating the ability of AI-enabled analysis to discriminate AF up to 2 years in the future using real-world handheld 1 L ECGs obtained prospectively within the specific population in which AF risk estimation is most meaningful (i.e., older individuals with no known AF diagnosis despite receiving regular primary care).

Our findings establish the feasibility of incident AF prediction using handheld 1 L ECG, and provide specific support for the role of transfer learning in the adaptation of ECG-based AI models to the mobile 1 L ECG modality. A variety of prior studies have suggested retained predictive performance of standard ECG-based models when applied to a single lead of the 12-lead ECG as a surrogate for a true 1 L ECG. In real-world use, however, 1 L ECG tracings offer unique characteristics (e.g., more noise, differing sampling frequency, variable duration)^21,28. Here, we observed substantially improved model performance when utilizing a strategy of selectively applying inference to the 10-second segment of the 30-s tracings classified as having the least tracing noise. Moreover, we found that 1 L ECG-AI, a pre-trained AF risk estimation model developed using single leads of a standard 12-lead ECG (originally trained using over 450,000 tracings) provided comparable performance to a model trained de novo using true 1 L ECGs from the smaller VITAL-AF Development Set, even though 1 L ECG-AI had no prior exposure to handheld 1 L ECG tracings. Furthermore, no meaningful improvement was observed from fine-tuning 1 L ECG-AI in the VITAL-AF sample. Our findings are consistent with a number of recent studies showcasing the value of transfer learning in domains as diverse as network biology²⁹ and natural language³⁰, in which models trained using comparably large sample sizes in external datasets to perform related tasks offer favorable performance to models trained de novo in regimes where sample sizes are more limited (e.g., handheld 1 L ECG)^31,32. Importantly, saliency maps demonstrated that, consistent with clinical expectations and similar to the 12-lead ECG-AI model on which it is based²¹, 1 L ECG-AI AF risk estimates were heavily influenced by the 1 L ECG P wave and surrounding regions, demonstrating the ability of 1 L ECG-AI to recognize key features of the 1 L ECG waveform.

Our observations provide evidence that a robust appraisal of the expected performance of deep learning models can only be obtained by evaluating models in the specific clinical context of their intended use. The overall discrimination of 1 L ECG-AI (AUROC ~0.69 with age and sex) is lower than the discrimination reported for the 12-lead ECG-based detection of concurrent AF (AUROC ~0.8–0.9)^22,33 and estimation of future AF (AUROC ~0.75–0.85)^21,23. We suspect there are two primary factors accounting for lower performance. First, there is clearly information loss when moving from 12-lead ECG to a single-lead ECG of the same format (e.g., AUROC ~0.75–0.85 for 12-lead models versus ~0.72 for 1 L ECG-AI prior to transfer to VITAL-AF), followed by further loss when applied to real-world 1 L ECG tracings (e.g., AUROC ~ 0.66 for 1 L ECG-AI after transfer to VITAL-AF). Second, when compared to prior retrospective assessments of ECGs, which are subject to indication bias on account of clinical acquisition, our analysis represents handheld 1 L ECGs obtained prospectively and unselectively. Indeed, in a secondary analysis focused on individuals with a 12-lead ECG performed for clinical indications within the preceding three years, not only did the 12-lead model achieve higher discrimination (AUROC 0.81) than the 1 L ECG model (AUROC 0.77), but the 1 L ECG model had substantially higher AUROC than it did in the overall sample. The key contribution of population characteristics is also supported by substantially lower performance of the CHARGE-AF clinical risk score (AUROC 0.68) compared to multiple prior retrospective validations (AUROC ~0.7–0.8)^18,19. On balance, our findings establish the feasibility of future AF risk estimation using handheld 1 L ECG, and additionally serve as an important demonstration that expected model performance may differ substantively when models are applied to the specific clinical settings in which implementation is intended.

Our results highlight the potential for 1 L ECG to extend the efficiency and reach of AF screening efforts. Despite providing a path to earlier AF detection and prompt initiation of preventive interventions (e.g., oral anticoagulation, lifestyle modification), recent screening efforts have failed to demonstrate substantive gains in AF diagnosis or improvements in hard outcomes such as stroke or mortality¹⁶. One major limitation has been the use of simple guideline-based age thresholds (e.g., ≥65 years)¹⁰, whose application leads to screening many individuals at relatively low risk of AF. A risk-informed approach may be more efficient^15,34, and with evidence demonstrating limited uptake of AF risk scores such as CHARGE-AF on account of cumbersome calculation and potential misclassification of score inputs, the ability to estimate AF risk with comparable accuracy using only age, sex, and a single handheld 1 L ECG has particular value³⁵. Compared to screening all individuals aged ≥65 years¹⁰, 1 L ECG-AI AS exhibited highly favorable reclassification (0.27), driven by a 64% increase in specificity (i.e., appropriate down-classification of nearly 10,000 individuals who did not develop AF within 2 years). Of course, deferring screening of non-high-risk individuals also leads to a decrease in sensitivity (37.1%), and future work is warranted to reduce false negatives and identify optimal thresholds to balance AF screening yield and efficiency. Importantly, decision curve analyses demonstrated that 1 L ECG-AI AS would lead to substantial reductions in individuals screened while retaining a constant net benefit across a wide range of thresholds. Conversely, methods such as 1 L ECG-AI AS may also facilitate identification of high-risk younger individuals who would otherwise be missed by traditional age criteria, a possibility we could not evaluate in our current analysis, which included only individuals ≥65 years. Similarly, consistent with prior data demonstrating a complementary nature of clinical risk factors and AI signals²¹, 1 L ECG-AI AS demonstrated the greatest AF risk discrimination in individuals with relatively lower comorbidity burden, suggesting a potential role for combined approaches leveraging both clinical and AI criteria to select AF screening candidates. Of note, there was only a weak positive correlation between 1 L ECG-AI AS and either NTproBNP or left atrial size among individuals with incident AF, suggesting that 1 L ECG-AI risk estimates provide non-overlapping value to common AF-related biomarkers, which are also less well-suited for population screening. Although the precision of estimates was limited, we did observe higher performance for incident persistent AF, suggesting that 1 L ECG-AI AS may be particularly useful for discriminating risk of developing persistent or high-burden AF, which may be more clinically actionable. Ultimately, future work is warranted to identify optimal methods for AF risk estimation, which should consider the setting in which risk stratification is intended (e.g., clinic versus home), as well as potential tradeoffs in performance (e.g., 12-lead ECG versus 1 L ECG). Overall, our findings suggest that 1 L ECG may possess utility not only for AF screening, but also for AF risk stratification, wherein individuals at elevated AF risk may be considered for more intensive screening³⁶ (e.g., repeated monitoring with 1 L ECG, application of clinical-grade monitors) and interventions to prevent AF onset altogether (e.g., alcohol cessation⁶, weight loss³⁷). Future trials of AF screening guided by 1 L ECG-based AI are warranted.

Our study should be interpreted in the context of design. First, our 1 L ECG-based model comparison was performed in a holdout test set of VITAL-AF. We note that prospectively acquired 1 L ECG tracings from a trial population are a unique resource, highlighting a key strength of our study while also limiting options for external validation. We also note that the 1 L ECG-AI model ultimately selected for detailed analyses in VITAL-AF Full Inference Set was developed using a completely different modality (single leads of a standard 12-lead ECG) in non-overlapping individuals outside of VITAL-AF. Nevertheless, as 1 L ECG datasets become increasingly available in the future, it will be important to assess model performance in other healthcare systems and using other 1 L ECG devices. Although we suspect that our noise-window method should be useful for analyses using other 1 L ECG devices prone to noise and acquisition artifact (e.g., smartwatch ECG), it will be important to quantify potential impacts of differences in acquisition characteristics including ECG vector, sampling frequency (e.g., 300 Hz using AliveCor versus 512 Hz using Apple Watch), and tracing duration, and assess the degree to which recalibration or fine-tuning may be needed. Second, our estimates of model metrics have limited precision due to limitations in sample size and modest event rates in the smaller VITAL-AF Test Set. Nevertheless, we submit estimates are sufficient to support the performance of 1 L ECG-AI as comparable to or better than the comparison models, justifying its use in subsequent analyses in the larger VITAL-AF Full Inference Set. Third, the current follow-up is limited to two years. Fourth, incident AF was defined using a combination of a validated electronic health record-based algorithm and manual validation. Although misclassification of undiagnosed AF remains possible with our design, we submit that several factors (e.g., enrollment of patients engaged in routine primary care, use of 1 L ECG screening, which has demonstrated reasonable positive predictive value³⁸) serve to limit its extent. Fifth, tracings were obtained by trained medical assistants and likely represent higher-quality tracing acquisition than regular consumer use. Sixth, our models performed inference using 10-s 1 L ECG windows, to mirror the shape of a standard 12-lead ECG and to facilitate inference of our existing 1 L ECG-AI model in a transfer learning context. Future assessment of de novo models trained using longer windows is warranted. Sixth, the absence of uniform protocolized rhythm monitoring introduces potential ascertainment bias, and we did observe that individuals with higher estimated risk according to 1 L ECG-AI AS had higher rates of ECG and Holter, patch, or event monitoring during follow-up. Seventh, the generalizability of our findings may be limited by sample specificity (e.g., predominantly White population from the New England region of the United States).

In summary, we developed 1 L ECG-AI, a model capable of discriminating risk of 2-year incident AF using real-world 1 L ECG tracings obtained prospectively in the context of a large randomized trial of AF screening among primary care patients with no known AF diagnosis. When combined with age and sex, 1 L ECG-AI offered comparable performance to the validated 11-component CHARGE-AF clinical risk score and demonstrated potential to improve AF screening efficiency compared to the simple age threshold of ≥65 years endorsed in current guidelines. Future work is needed to establish the clinical utility of AF screening guided by AI-enabled analysis of mobile 1 L ECG tracings.

Methods

Trial design and analysis sample

The design, conduct, primary outcome results, protocol, and statistical analysis plan of the VITAL-AF trial (ClinicalTrials.gov NCT03515057, registered 2/24/2021; https://clinicaltrials.gov/study/NCT03515057) have been published previously^12,39. Briefly, VITAL-AF recruited patients from 16 primary care practices within the MGH practice-based research network. VITAL-AF was a pragmatic cluster randomized trial, in which practices were randomized in a 1:1 ratio to AF screening versus usual care, and assessed a primary outcome of new AF diagnosis at 1 year. The trial enrolled patients between July 31, 2018 and October 8, 2019. Patients were eligible for inclusion if they were aged ≥65 years and attended an outpatient clinic appointment at a participating primary care practice with a primary care physician, nurse practitioner, or physician’s assistant. Given the pragmatic design of VITAL-AF, no further selection criteria were applied, and in particular, individuals were not excluded based on prior AF status. In the current analysis, we focused specifically on participants who had ≥1 1 L ECG screening performed. Given prior observations demonstrating that inclusion of tracings among patients with prevalent AF (including tracings showing AF) can improve performance for incident AF risk estimation²¹, tracings taken among individuals with prevalent AF were included in model training. However, all model evaluation was performed among individuals without prevalent AF at the time of screening (see below, Fig. 1). Participants provided informed consent to participate, and the research protocol was approved by the Mass General Brigham Institutional Review Board (2017P000562). This study adheres to the STARD⁴⁰, EHRA AI⁴¹, and CONSORT⁴² reporting guidelines (Supplementary Materials).

1 L ECG acquisition

Eligible and consenting individuals visiting intervention practices were offered AF screening with the AliveCor Kardia (AliveCor, US) 1 L ECG at each encounter at the time of routine vital sign assessment (i.e., prior to meeting with the primary care physician, nurse practitioner, or physician’s assistant). The 1 L ECG was administered by medical assistants who received dedicated training in the use of the Kardia device prior to study start, as well as monthly refreshers. Since multiple tracings were obtained for some individuals, model training opportunistically considered all tracings as distinct examples, but for model evaluation, only the earliest 1 L ECG per individual was used (with the earliest tracing employed to maximize available follow-up).

AF model development and evaluation

The study sample was randomly split into a development set (i.e., training and validation, ~80%) and a test set (~20%) (Table 1 and Fig. 1). In the test set, we evaluated three distinct approaches to AF risk estimation using 1 L ECG. First, we transferred a contemporary version of a previously published ECG-based convolutional neural network model to estimate AF risk (ECG-AI, AUROC 0.82 for 5-year incident AF in Massachusetts General Hospital)²¹, which in this case was trained using single leads of a standard 12-lead ECG among individuals outside the VITAL-AF study. While the AliveCor device most closely resembles lead I of the 12-lead ECG, we observed models trained from other leads also had discriminative power when transferred (e.g., a lead II only model achieved a c-index of 0.65 using 1 L ECG, as compared to 0.67 for the lead I only model). To leverage this information, we adopted a novel training strategy which uses all leads by selecting a different one at random in every training batch. Over the course of optimization, this single lead model learns from all 12 leads, and this strategy outperformed lead-specific models when generalizing to AliveCor handheld 1 L ECGs. Model architecture was largely unchanged from the published version, and as before, the model is multi-task, outputting not only the survival probability for incident AF (primary task), but also the auxiliary tasks of age regression, sex classification, survival probability for death, and presence of AF on the ECG²¹. The input of 5000 voltage timepoints is processed through a one-dimensional (1D) convolution in 64 channels, followed by 3 densely connected 1D convolutional blocks. All convolutional kernels have a width of 71, use the Mish activation function and are followed by a max-pooling layer. The convolutional blocks have widths of 64, 48, and 32 channels. The entire architecture consists of 8,653,337 trainable parameters. Optimization was performed using ADAM stochastic gradient descent, with an initial learning rate of 2e-4. The optimizer and backpropagation are implemented by the TensorFlow (v2.13.1) machine learning framework and the Broad Institute Machine Learning for Health (ML4H) (v0.0.13) model factory (https://github.com/broadinstitute/ml4h). Model convergence was determined by early stopping criteria of no improvement in validation loss after 32 epochs, with a learning rate decay of 0.5 for every eight epochs without validation loss improvement. The model was trained for 8 h using an Nvidia V100 (Santa Clara, CA) graphical processing unit. This model was termed “1 L ECG-AI.”

Second, we evaluated a version of 1 L ECG-AI, which was fine-tuned in the VITAL-AF Derivation Set. Here, we used the same 1 L ECG-AI architecture and ADAM optimizer but with a simplified single-task (survival probability for AF) output. To prevent large changes in model weights early in training in the context of fine-tuning, a reduced initial learning rate (2e-5) was employed.

Third, we trained a de novo CNN in the VITAL-AF Derivation Set. This model architecture mirrored 1 L ECG-AI except that the auxiliary tasks included classification of the automated rhythm interpretation from the AliveCor device, manual adjudication of rhythm by the cardiologist overreaders, readability, sex classification, and age regression.

As described in detail previously²¹, all models utilized a loss function incorporating survival time and censoring in order to output an estimated longitudinal incidence of AF. All models utilized learning rate decay. Full model architectures are provided in Supplementary Fig. 3.

Each model took as input a 10-s segment of the full 30-s handheld 1 L ECG tracing as a uniformly-shaped input tensor of dimension (5000 ×1). A 10-s window was chosen to match the shape and sampling rate of a standard 12-lead ECG, and to facilitate inference using 1 L ECG-AI, which was trained using 10-s tracings. Linear interpolation resampled the 300 Hz frequency of the AliveCor to the 500 Hz frequency typical of 12-lead ECGs. In training, models were inputted with random 10-s windows sampled from the full 30-s 1 L ECG tracing. Different random samples were used in each training epoch, thereby exposing the models to the full 1 L ECG tracing.

Given that the 1 L ECG tracings were 30 s in duration and commonly had noise at the beginning of acquisition, we trained a separate convolutional neural network to detect the contiguous 10-s window with the least noise (i.e., the highest readability as determined by human adjudicators, see Supplementary Figs. 1, 2). Using the same convolutional neural network backbone as the 1 L VITAL model, we trained a binary classification model using the readability label. The model had high discriminative capability with an AUROC of 0.934 and a mean precision of 0.997 in a held-out test set of 3008 1 L ECG traces. The predicted minimum noise segment was then used as the input to each AF prediction model to perform inference. This noise minimization approach resulted in substantial improvement in model performance compared to the use of the first or last 10 s of the tracing for inference (Supplementary Fig. 2).

Saliency mapping

To assess the behavior of 1 L ECG-AI, we created saliency maps, which highlight the sections of the 1 L ECG where the smallest changes in input voltage lead to the greatest changes in AF prediction risk. Saliency is defined as the model output gradient with respect to an input 1 L ECG. Efficient computation is possible with the same backpropagation machinery used in model training, except during training, the gradient is of the loss function rather than the model output, and it is taken with respect to the model weights rather than the model input. Both cases rely on the chain rule and the automatic differentiation capabilities of the Python package “Tensorflow”. An exemplar 1 L ECG waveform is overlaid on the ECG saliencies. Saliency was generated using a random sample of 64 tracings from the VITAL-AF sample.

Clinical factors and outcomes

The prediction target for each model was incident AF during the 2-year study period. Incident AF was identified in a staged manner as follows: (1) candidate AF events were identified using the presence of ≥1 International Statistical Classification of Diseases and Related Health Problems, Tenth Revision (ICD-10) code for AF or atrial flutter or 12-lead electrocardiogram with AF or atrial flutter diagnosis, then (2) the medical record was manually adjudicated for the presence of AF (prevalent, incident, or absent) by two research nurses with consensus resolution of discrepant adjudications and cardiologist resolution of unresolved discrepancies. Adjudicators were unaware of the AliveCor 1 L ECG result, including any 1 L ECG-based AF risk estimates. Of 301 records which double-reviewed to assess inter-rater agreement, agreement was high regardless of whether uncertain adjudications (n = 16) were excluded (94.0%), included and counted as agreement (94.4%), or included and counted as disagreement (89.0%). For comparison between AI-based AF risk and clinical risk factors, we calculated the CHARGE-AF score, a validated risk factor-based AF prediction tool, for all individuals^17,19,43. Baseline age, sex, race, height, weight, and blood pressure values were obtained from the electronic health record⁴⁴. Anti-hypertensive use was determined using medication lists⁴³. Tobacco use was categorized as present or absent. Race was classified as white or non-white, as performed previously using CHARGE-AF^43,45. The presence of heart failure, diabetes, and myocardial infarction were ascertained using previously validated diagnostic and procedural codes^43,46. A small fraction of individuals with missing tobacco use (1%) were considered non-smokers, and mean imputation was applied for trivial missingness in vital signs data (0.1%). Clinical factor definitions are provided in Supplemental Table 7.

Statistical analysis

Model discrimination was compared by calculation of inverse probability of censoring-weighted AUROC⁴⁷. Since AUROC may be insensitive to differences in discrimination among models in the setting of relatively uncommon outcomes such as AF, we additionally calculated time-dependent AP¹⁶. We plotted the corresponding ROC and AP curves. The outcome of incident AF was estimated at 2 years (i.e., the maximum available follow-up of the VITAL-AF trial at the time of this analysis). AUROC and AP values were compared using 500-iteration bootstrapping, which were used to calculate 95% confidence intervals and perform pairwise Z-testing.

Given that 1 L ECG-AI provided comparable or better performance in the Test Set compared to the other models despite being trained completely outside the VITAL-AF sample, we performed a dedicated evaluation of 1 L ECG-AI in the VITAL-AF Full Inference Sample (i.e., all VITAL-AF participants with ≥1 L ECG tracing and no prevalent AF, Fig. 1). Since age and sex are readily available demographic factors, we additionally assessed 1 L ECG-AI with the incorporation of age and sex as additional input variables (“1 L ECG-AI AS”). We then compared 1 L ECG-AI and 1 L ECG-AI AS to the CHARGE-AF clinical risk score, and a combination of 1 L ECG-AI and the CHARGE-AF clinical risk score. Combination models were developed using Cox proportional hazards regression with the covariate weights (i.e., age, sex, and ECG-AI for 1 L ECG-AI AS and ECG-AI and CHARGE-AF for the CHARGE-AF + ECG-AI model) obtained within individuals aged ≥65 years in the original ECG-AI development set after excluding individuals included in VITAL-AF. As performed previously, 1 L ECG-AI probabilities were logit-transformed for inclusion in the Cox models^21,48. Model discrimination was compared as outlined above. The ability to stratify risk of incident AF using 1 L ECG-AI AS and CHARGE-AF was assessed by plotting the Kaplan–Meier cumulative risk of AF across strata of high risk according to each of the two models, with high risk corresponding to ≥3% 2-year predicted AF risk (approximate top tertile). The ability to stratify more extreme AF risk using 1 L ECG-AI AS was assessed similarly, except using strata defined by the bottom 5% of risk, top 5% of risk, and middle 90% of risk.

We assessed calibration using: (1) adaptive hazard regression⁴⁹ curves of predicted versus observed AF risk, and (2) integrated calibration index (ICI), the average prediction error weighted by the empirical risk distribution⁴⁹. Since the AI models were trained on individuals with prevalent AF but evaluated only on individuals without prevalent AF²¹, recalibration to the baseline hazard of the VITAL-AF Full Inference Set was performed prior to assessing model calibration⁵⁰. For this analysis, each model score was converted to a predicted probability of AF using the equation: \(1-{s}_{0}^{\exp \left(\sum \beta X-\sum \beta Y\right)\,\,}\) where \({s}_{0}\) is the average AF-free survival probability at 2 years in VITAL-AF, \(\sum \beta {X}\) is the individual’s score value, and \(\sum \beta {Y}\) is the average score in VITAL-AF. Model weights and parameters are given in Supplementary Table 8.

The potential effect of implementing 1 L ECG-AI AS to select screening candidates, as opposed to the guideline-based age threshold of ≥65 years (i.e., all VITAL-AF participants), was assessed by calculating 2-year time-dependent reclassification indices⁵¹. We also assessed additional guideline-based criteria for selecting screening candidates¹⁰ (i.e., age ≥65 years with ≥1 additional stroke risk factor defined using the CHA₂DS₂-VASc score^24,25, and (b) age ≥75 years). For these analyses, individuals with predicted 1 L ECG-AI AS risk ≥3% were considered high risk. Given that optimal risk thresholds for AF screening remain unclear, we additionally performed decision curve analyses^52,53, in which we compared the expected net benefit of screening using 1 L ECG-AI AS across a range of plausible thresholds used to define elevated AF risk (versus no screening or screening all individuals). We additionally quantified the number of AF screenings which may be avoided while maintaining a constant net benefit.

In secondary analyses, we assessed model discrimination for incident AF at 6 months and 1 year. We assessed model performance across subgroups of age (i.e., age 65–69, 70–79, and ≥80 to approximate tertiles of the age distribution), sex, and the presence of a class I indication for oral anticoagulation based on stroke risk as defined using the CHA₂DS₂-VASc score (i.e., ≥2 for men and ≥3 for women). We compared the time to AF diagnosis among incident AF cases with 1 L ECG-AI AS risk ≥3% (“true positives”) versus 1 L ECG-AI AS risk <3% (“false negatives”). To classify AF type (paroxysmal versus persistent), we inspected reports of the subset of AF cases with Holter, patch, or event monitoring available within 6 months of incident AF diagnosis. We assessed model performance for paroxysmal and persistent AF, respectively, excluding individuals with incident AF of the other type, or with an unclassifiable type (i.e., no monitoring data). To assess relations between 1 L ECG-AI AS performance and common AF-related biomarkers, we assessed the correlation between 1 L ECG-AI AS and (a) NTproBNP and (b) left atrial anteroposterior size on echocardiography, among individuals with an available measurement taken within 7 days before or following an incident AF diagnosis. We additionally assessed the sensitivity of 1 L ECG-AI AS at the ≥3% risk threshold across strata of NTproBNP (i.e., above the age-adjusted reference range) and left atrial diameter (>40 mm). To assess the relative information loss using 1 L ECG versus standard 12-lead ECG, we compared AF discrimination using a contemporary version of a previously validated 12-lead ECG-AI AF risk estimation algorithm²¹ among the subset of individuals not included in the training set of the 12-lead model and with an available 12-lead ECG performed within 3 years of the baseline visit. To assess the behavior of 1 L ECG-AI across tracings, we fix a linear mixed model on intra- versus inter-individual 1 L ECG-AI inferences, and assessed the within-tracing correlation across varying 10-second windows. To quantify whether 1 L ECG-AS risk may associate with subsequent rhythm monitoring, we quantified the person-time rates of (i) 12-lead ECGs, and (ii) Holter, event, or patch monitors performed during the study period and prior to any incident AF diagnosis. We considered two-sided p values <0.05 statistically significant. Analyses were performed using Python v3.8⁵⁴ and R v4.0⁵⁵.

Data availability

VITAL-AF trial data contain protected health information and cannot be shared publicly.

Code availability

The ECG-AI model serving as the foundation for the models evaluated in the current analysis is available at https://github.com/broadinstitute/ml4h/tree/master/model_zoo/ECG2AF. Scripts underlying the statistical analysis are available at https://github.com/shaankhurshid/1l_ecg_ai.git.

References

Wolf, P. A., Abbott, R. D. & Kannel, W. B. Atrial fibrillation: a major contributor to stroke in the elderly. The Framingham study. Arch. Intern. Med. 147, 1561–1564 (1987).
Article PubMed Google Scholar
Corley, S. D. et al. Relationships between sinus rhythm, treatment, and survival in the Atrial Fibrillation Follow-Up Investigation of Rhythm Management (AFFIRM) Study. Circulation 109, 1509–1513 (2004).
Article PubMed Google Scholar
Carlisle, M. A., Fudim, M., DeVore, A. D. & Piccini, J. P. Heart failure and atrial fibrillation, like fire and fury. JACC Heart Fail 7, 447–456 (2019).
Article PubMed Google Scholar
Diener, H.-C., Hart, R. G., Koudstaal, P. J., Lane, D. A. & Lip, G. Y. H. Atrial fibrillation and cognitive function: JACC review topic of the week. J. Am. Coll. Cardiol. 73, 612–619 (2019).
Article PubMed Google Scholar
Middeldorp, M. E. et al. PREVEntion and regReSsive Effect of weight-loss and risk factor modification on atrial fibrillation: the REVERSE-AF study. Europace 20, 1929–1935 (2018).
Article PubMed Google Scholar
Voskoboinik, A. et al. Alcohol abstinence in drinkers with atrial fibrillation. N. Engl. J. Med. 382, 20–28 (2020).
Article PubMed Google Scholar
Lip, G. Y. H., Nieuwlaat, R., Pisters, R., Lane, D. A. & Crijns, H. J. G. M. Refining clinical risk stratification for predicting stroke and thromboembolism in atrial fibrillation using a novel risk factor-based approach: the euro heart survey on atrial fibrillation. Chest 137, 263–272 (2010).
Article PubMed Google Scholar
Ruff, C. T. et al. Comparison of the efficacy and safety of new oral anticoagulants with warfarin in patients with atrial fibrillation: a meta-analysis of randomised trials. Lancet 383, 955–962 (2014).
Article PubMed Google Scholar
Stroke Prevention in Atrial Fibrillation Study. Final results. Circulation 84, 527–539 (1991).
Hindricks, G. et al. 2020 ESC Guidelines for the diagnosis and management of atrial fibrillation developed in collaboration with the European Association of Cardio-Thoracic Surgery (EACTS). Eur. Heart J. https://doi.org/10.1093/eurheartj/ehaa612 (2020).
NHFA CSANZ Atrial Fibrillation Guideline Working Group et al. National heart foundation of Australia and the Cardiac Society of Australia and New Zealand: Australian Clinical Guidelines for the diagnosis and management of atrial fibrillation 2018. Heart Lung Circ. 27, 1209–1266 (2018).
Article Google Scholar
Lubitz, S. A. et al. Screening for atrial fibrillation in older adults at primary care visits: the VITAL-AF randomized controlled trial. Circulation https://doi.org/10.1161/CIRCULATIONAHA.121.057014 (2022).
Uittenbogaart, S. B. et al. Detecting and diagnosing atrial fibrillation (D2AF): study protocol for a cluster randomised controlled trial. Trials 16, 478 (2015).
Article PubMed PubMed Central Google Scholar
Svendsen, J. H. et al. Implantable loop recorder detection of atrial fibrillation to prevent stroke (The LOOP Study): a randomised controlled trial. Lancet 398, 1507–1516 (2021).
Article PubMed Google Scholar
Ashburner, J. M., Khurshid, S., Atlas, S. J., Singer, D. E. & Lubitz, S. A. Point-of-care screening for atrial fibrillation: where are we, and where do we go next?. Cardiovasc. Digit Health J. 2, 294–297 (2021).
Article PubMed PubMed Central Google Scholar
Khurshid, S., Healey, J. S., McIntyre, W. F. & Lubitz, S. A. Population-based screening for atrial fibrillation. Circ. Res. 127, 143–154 (2020).
Article PubMed PubMed Central Google Scholar
Alonso, A. et al. Simple risk model predicts incidence of atrial fibrillation in a racially and geographically diverse population: the CHARGE-AF consortium. J. Am. Heart Assoc. 2, e000102 (2013).
Article PubMed PubMed Central Google Scholar
Khurshid, S. et al. Performance of atrial fibrillation risk prediction models in over 4 million individuals. Circ. Arrhythm. Electrophysiol. 14, e008997 (2021).
Article PubMed Google Scholar
Christophersen, I. E. et al. A comparison of the CHARGE-AF and the CHA2DS2-VASc risk scores for prediction of atrial fibrillation in the Framingham Heart Study. Am. Heart J. 178, 45–54 (2016).
Article PubMed PubMed Central Google Scholar
Khurshid, S. Clinical perspectives on the adoption of the artificial intelligence-enabled electrocardiogram. J. Electrocardiol. 81, 142–145 (2023).
Article PubMed PubMed Central Google Scholar
Khurshid, S. et al. Electrocardiogram-based deep learning and clinical risk factors to predict atrial fibrillation. Circulation https://doi.org/10.1161/CIRCULATIONAHA.121.057480 (2021).
Yuan, N. et al. Deep learning of electrocardiograms in sinus rhythm from US veterans to predict atrial fibrillation. JAMA Cardiol. 8, 1131–1139 (2023).
Article PubMed PubMed Central Google Scholar
Raghunath, S. et al. Deep neural networks can predict new-onset atrial fibrillation from the 12-lead electrocardiogram and help identify those at risk of AF-related stroke. Circulation https://doi.org/10.1161/CIRCULATIONAHA.120.047829 (2021).
Joglar, J. A. et al. 2023 ACC/AHA/ACCP/HRS guideline for the diagnosis and management of atrial fibrillation: a report of the American College of Cardiology/American Heart Association Joint Committee on Clinical Practice Guidelines. Circulation 149, e1–e156 (2024).
Article PubMed Google Scholar
Engdahl, J., Andersson, L., Mirskaya, M. & Rosenqvist, M. Stepwise screening of atrial fibrillation in a 75-year-old population: implications for stroke prevention. Circulation 127, 930–937 (2013).
Article PubMed Google Scholar
Gadaleta, M. et al. Prediction of atrial fibrillation from at-home single-lead ECG signals without arrhythmias. npj Digit. Med. 6, 229 (2023).
Article PubMed PubMed Central Google Scholar
Raghunath, A. et al. Artificial intelligence–enabled mobile electrocardiograms for event prediction in paroxysmal atrial fibrillation. Cardiovasc. Digit. Health J. 4, 21–28 (2023).
Article PubMed PubMed Central Google Scholar
Khunte, A. et al. Detection of left ventricular systolic dysfunction from single-lead electrocardiography adapted for portable and wearable devices. npj Digit. Med. 6, 124 (2023).
Article PubMed PubMed Central Google Scholar
Theodoris, C. V. et al. Transfer learning enables predictions in network biology. Nature 618, 616–624 (2023).
Article PubMed PubMed Central Google Scholar
Li, Y., Wehbe, R. M., Ahmad, F. S., Wang, H. & Luo, Y. A comparative study of pretrained language models for long clinical text. J. Am. Med Inf. Assoc. 30, 340–347 (2023).
Article Google Scholar
Diamant, N. et al. Patient contrastive learning: A performant, expressive, and practical approach to electrocardiogram modeling. PLoS Comput. Biol. 18, e1009862 (2022).
Article PubMed PubMed Central Google Scholar
Khurshid, S. et al. Deep learned representations of the resting 12-lead electrocardiogram to predict at peak exercise. Eur. J. Prev. Cardiol. 31, 252–262 (2024).
Article PubMed Google Scholar
Attia, Z. I. et al. An artificial intelligence-enabled ECG algorithm for the identification of patients with atrial fibrillation during sinus rhythm: a retrospective analysis of outcome prediction. Lancet 394, 861–867 (2019).
Article PubMed Google Scholar
Khurshid, S. & Singh, J. P. Keep your fingers on the PULsE: artificial intelligence to guide atrial fibrillation screening. Eur. Heart J. Digit Health 3, 205–207 (2022).
Article PubMed PubMed Central Google Scholar
Ashburner, J. M. et al. Impact of a clinical atrial fibrillation risk estimation tool on cardiac rhythm monitor utilization following acute ischemic stroke: a prepost clinical trial. Am. Heart J. 284, 57–66 (2025).
Article PubMed Google Scholar
Steinhubl, S. R. et al. Effect of a home-based wearable continuous ECG monitoring patch on detection of undiagnosed atrial fibrillation: the mSToPS randomized clinical trial. JAMA 320, 146–155 (2018).
Article PubMed PubMed Central Google Scholar
Pathak, R. K. et al. Long-term effect of goal-directed weight management in an atrial fibrillation cohort: a long-term follow-up study (LEGACY). J. Am. Coll. Cardiol. 65, 2159–2169 (2015).
Article PubMed Google Scholar
Khurshid, S. et al. Performance of single-lead handheld electrocardiograms for atrial fibrillation screening in primary care. VITAL-AF Trial JACC Adv. 2, 100616 (2023).
Article PubMed Google Scholar
Ashburner, J. M. et al. Design and rationale of a pragmatic trial integrating routine screening for atrial fibrillation at primary care visits: the VITAL-AF trial. Am. Heart J. 215, 147–156 (2019).
Article PubMed PubMed Central Google Scholar
Bossuyt, P. M. et al. STARD 2015: an updated list of essential items for reporting diagnostic accuracy studies. BMJ 351, h5527 (2015).
Article PubMed PubMed Central Google Scholar
Svennberg, E. et al. State of the art of artificial intelligence in clinical electrophysiology in 2025: a scientific statement of the European Heart Rhythm Association (EHRA) of the ESC, the Heart Rhythm Society (HRS), and the ESC Working Group on E-Cardiology. Europace 27, euaf071 (2025).
Article PubMed PubMed Central Google Scholar
Hopewell, S. et al. CONSORT 2025 statement: updated guideline for reporting randomised trials. BMJ 389, e081123 (2025).
Article PubMed PubMed Central Google Scholar
Hulme, O. L. et al. Development and validation of a prediction model for atrial fibrillation using electronic health records. JACC Clin. Electrophysiol. 5, 1331–1341 (2019).
Article PubMed PubMed Central Google Scholar
Khurshid, S. et al. Cohort design and natural language processing to reduce bias in electronic health records research. npj Digit. Med. 5, 47 (2022).
Article PubMed PubMed Central Google Scholar
Alonso, A. et al. Prediction of atrial fibrillation in a racially diverse cohort: the multi-ethnic study of atherosclerosis (MESA). J Am Heart Assoc. 5, e003077 (2016).
Article PubMed PubMed Central Google Scholar
Wang, E. Y. et al. Initial precipitants and recurrence of atrial fibrillation. Circ. Arrhythm. Electrophysiol. 13, e007716 (2020).
Article PubMed PubMed Central Google Scholar
Uno, H., Tian, L., Cai, T., Kohane, I. S. & Wei, L. J. A unified inference procedure for a class of measures to assess improvement in risk prediction systems with survival data. Stat. Med. 32, 2430–2442 (2013).
Article PubMed Google Scholar
Christopoulos, G. et al. Artificial intelligence-electrocardiography to predict incident atrial fibrillation: a population-based study. Circ. Arrhythm. Electrophysiol. 13, e009355 (2020).
Article PubMed PubMed Central Google Scholar
Austin, P. C., Harrell, F. E. & Klaveren, D. Graphical calibration curves and the integrated calibration index (ICI) for survival models. Stat. Med. 39, 2714–2742 (2020).
Article PubMed PubMed Central Google Scholar
Demler, O. V., Paynter, N. P. & Cook, N. R. Tests of calibration and goodness-of-fit in the survival setting. Stat. Med. 34, 1659–1680 (2015).
Article PubMed PubMed Central Google Scholar
Pencina, M. J., D’Agostino, R. B. & Steyerberg, E. W. Extensions of net reclassification improvement calculations to measure usefulness of new biomarkers. Stat. Med. 30, 11–21 (2011).
Article PubMed Google Scholar
Vickers, A. J. & Elkin, E. B. Decision curve analysis: a novel method for evaluating prediction models. Med. Decis. Mak. 26, 565–574 (2006).
Article Google Scholar
Pencina, M. J. et al. Evaluating the added predictive ability of a new marker: from area under the ROC curve to reclassification and beyond. Stat. Med. 27, 157–172 (2008).
Article PubMed Google Scholar
Python Core Team. Python: a dynamic, open source programming language. Python Software Foundation. https://www.python.org/ (2015).
R Core Team. R: a language and environment for statistical computing. R Foundation for Statistical Computing Vienna, Austria. https://www.R-project.org/ (2015).

Download references

Acknowledgements

This investigator-initiated study was funded by the Bristol Myers Squibb–Pfizer Alliance. This work was also supported by NIH grants R01HL092577, R01HL157635 (Ellinor); K23HL169839 (Khurshid); T32HL007208 (Al-Alusi); 3OT2OD035404-01S3, 1R01NS134597, 1UG3HG014379-01 (Maddah); American Heart Association (Dallas, Texas) 18SFRN34110082, 961045 (Ellinor, Maddah); 23CDA1050571 (Khurshid); and from the European Union MAESTRIA 965286 (Ellinor). Dr. Lubitz previously received support from NIH grants R01HL139731 and R01HL157635, and American Heart Association 18SFRN34250007. Dr. Kany received the Walter Benjamin Fellowship from the Deutsche Forschungsgemeinschaft (521832260).

Author information

These authors contributed equally: Shaan Khurshid, Sam F. Friedman.

Authors and Affiliations

Cardiovascular Research Center, Heart and Vascular Institute, Mass General Brigham, Boston, MA, USA
Shaan Khurshid, Mostafa A. Al-Alusi, Jennifer E. Ho, Steven A. Lubitz & Patrick T. Ellinor
Cardiovascular Disease Initiative, Broad Institute of Harvard and the Massachusetts Institute of Technology, Cambridge, MA, USA
Shaan Khurshid, Mostafa A. Al-Alusi, Shinwan Kany, Steven A. Lubitz & Patrick T. Ellinor
Telemachus and Irene Demoulas Family Foundation Center for Cardiac Arrhythmias, Heart and Vascular Institute, Mass General Brigham, Boston, MA, USA
Shaan Khurshid, Steven A. Lubitz & Patrick T. Ellinor
Data Sciences Platform, Broad Institute of Harvard and the Massachusetts Institute of Technology, Cambridge, MA, USA
Sam F. Friedman & Mahnaz Maddah
Division of Cardiology, Heart and Vascular Institute, Mass General Brigham, Boston, MA, USA
Mostafa A. Al-Alusi
Department of Cardiology, University Heart and Vascular Center Hamburg-Eppendorf, Hamburg, Germany
Shinwan Kany
German Center for Cardiovascular Research (DZHK), Partner Site Hamburg/Kiel/Lübeck, Hamburg, Germany
Shinwan Kany
Department of Medicine, Massachusetts General Hospital, Boston, MA, USA
Thomas Sommers
Center for Genomic Medicine, Massachusetts General Hospital, Boston, MA, USA
Christopher D. Anderson
Henry and Allison McCance Center for Brain Health, Massachusetts General Hospital, Boston, MA, USA
Christopher D. Anderson
Department of Neurology, Brigham and Women’s Hospital, Boston, MA, USA
Christopher D. Anderson
Cardiology Division, Beth Israel Deaconess Medical Center, Boston, MA, USA
Jennifer E. Ho
Department of Medicine, University of Massachusetts Medical School, Worcester, MA, USA
David D. McManus
Division of General Internal Medicine, Massachusetts General Hospital, and Harvard Medical School, Boston, MA, USA
Leila H. Borowsky, Jeffrey M. Ashburner, Steven J. Atlas & Daniel E. Singer

Authors

Shaan Khurshid
View author publications
Search author on:PubMed Google Scholar
Sam F. Friedman
View author publications
Search author on:PubMed Google Scholar
Mostafa A. Al-Alusi
View author publications
Search author on:PubMed Google Scholar
Shinwan Kany
View author publications
Search author on:PubMed Google Scholar
Thomas Sommers
View author publications
Search author on:PubMed Google Scholar
Christopher D. Anderson
View author publications
Search author on:PubMed Google Scholar
Jennifer E. Ho
View author publications
Search author on:PubMed Google Scholar
David D. McManus
View author publications
Search author on:PubMed Google Scholar
Leila H. Borowsky
View author publications
Search author on:PubMed Google Scholar
Jeffrey M. Ashburner
View author publications
Search author on:PubMed Google Scholar
Steven A. Lubitz
View author publications
Search author on:PubMed Google Scholar
Steven J. Atlas
View author publications
Search author on:PubMed Google Scholar
Mahnaz Maddah
View author publications
Search author on:PubMed Google Scholar
Daniel E. Singer
View author publications
Search author on:PubMed Google Scholar
Patrick T. Ellinor
View author publications
Search author on:PubMed Google Scholar

Contributions

S. Khurshid and S.F. contributed equally and are co-first authors. S. Khurshid and S.F.F. conceived of the study. S. Khurshid, S.F.F. and T.S. contributed to study design, modeling, and statistical analysis. S. Khurshid and S.F.F. drafted the manuscript. M.A.A.-A., S. Kany, T.S., C.D.A., J.E.H., D.D.M., L.H.B., J.M.A., S.A.L., S.J.A., M.M., D.E.S. and P.T.E. performed critical reviews. All authors discussed the results, contributed to the final work, and have provided final approval of the completed version.

Corresponding author

Correspondence to Patrick T. Ellinor.

Ethics declarations

Competing interests

Dr. Lubitz is employed at Novartis Institutes for Biomedical Research and has received research support from Bristol Myers Squibb/Pfizer, Boehringer Ingelheim, Fitbit, Medtronic, Premier, and IBM, and has consulted for Bristol Myers Squibb/Pfizer, Blackstone Life Sciences, and Invitae. Dr. Ellinor receives sponsored research support from Bayer AG, IBM Research, Bristol Myers Squibb, Pfizer and Novo Nordisk; he has also served on advisory boards or consulted for Bayer AG. Dr. Ho has received sponsored research support from Bayer AG and research supplies from EcoNugenics, Inc. Dr. Singer has received research support from the Eliot B. and Edith C. Shoolman Fund of Massachusetts General Hospital and Bristol Myers Squibb, and has consulted for Bristol Myers Squibb, Fitbit (Google), Medtronic, and Pfizer. Dr. Atlas has received sponsored research support from Bristol Myers Squibb/Pfizer and American Heart Association (18SFRN34250007) and has consulted for Boehringer Ingelheim, Bristol Myers Squibb, Pfizer, Premier and Fitbit (Google). Dr. Khurshid receives sponsored research support from Bayer AG. The remaining authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

1l_ecg_af_supp_092425 (download PDF )

CONSORT_2025 (download PDF )

ehra_checklist (download PDF )

STARD-2015-checklist (download PDF )

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Khurshid, S., Friedman, S.F., Al-Alusi, M.A. et al. Artificial intelligence-enabled analysis of handheld single-lead electrocardiograms to predict incident atrial fibrillation: an analysis of the VITAL-AF randomized trial. npj Digit. Med. 8, 776 (2025). https://doi.org/10.1038/s41746-025-02164-2

Download citation

Received: 18 July 2025
Accepted: 09 November 2025
Published: 26 November 2025
Version of record: 22 December 2025
DOI: https://doi.org/10.1038/s41746-025-02164-2

Subjects

Abstract

Similar content being viewed by others

Artificial intelligence estimated electrocardiographic age as a recurrence predictor after atrial fibrillation catheter ablation

Wearable device derived electrocardiographic age and its association with atrial fibrillation

Analyzing artificial intelligence systems for the prediction of atrial fibrillation from sinus-rhythm ECGs including demographics and feature visualization

Introduction

Results

Identifying an optimal method for AF risk estimation using 1 L ECG

Performance of 1 L ECG-AI in the VITAL-AF Full Inference Set

Combining clinical risk factors with AI signals

1 L ECG-AI AS to stratify risk of longitudinal AF

Secondary and subgroup analyses

Discussion

Methods

Trial design and analysis sample

1 L ECG acquisition

AF model development and evaluation

Saliency mapping

Clinical factors and outcomes

Statistical analysis

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Supplementary information

1l_ecg_af_supp_092425 (download PDF )

CONSORT_2025 (download PDF )

ehra_checklist (download PDF )

STARD-2015-checklist (download PDF )

Rights and permissions

About this article

Cite this article

Share this article

Search

Quick links