Abstract
Whether artificial intelligence (AI) analysis of single-lead ECG (1 L ECG) can predict incident AF is unknown. In the VITAL-AF trial (ClinicalTrials.gov NCT03515057, registered 2/24/2021) of primary care patients aged ≥65 years undergoing handheld 1 L ECG screening, we tested three AI approaches to incident AF prediction, and compared the best model to the CHARGE-AF risk score. In a test set of 4,221 individuals, a published AI model trained using single standard ECG leads (“1 L ECG-AI”) provided similar 2-year AF discrimination to models trained with VITAL-AF data. In the full VITAL-AF sample of 15,694 individuals without prevalent AF (2-year incident AF 3.1%), 1 L ECG-AI with age/sex (1 L ECG-AI AS) had comparable discrimination (area under the receiver operating characteristic curve [AUROC] 0.695[0.637–0.742]; average precision [AP] 0.060[0.050–0.078]) to CHARGE-AF (AUROC 0.679[0.623-0.730]; AP 0.062[0.052–0.080], AUROC p = 0.46, AP p = 0.92). Net reclassification improvement was favorable versus age ≥65 years (0.27[0.22–0.32]). 1 L ECG-AI may increase efficiency and reach of AF screening.
Similar content being viewed by others
Introduction
Atrial fibrillation (AF) is a common arrhythmia associated with substantial preventable morbidity1,2,3,4,5,6. Since AF may be asymptomatic and unrecognized, there is great interest in mass screening to detect undiagnosed AF and enable timely preventive interventions7,8,9. However, contemporary assessments of AF screening have generally relied on the simple age threshold of ≥65 years endorsed by guidelines from the European Society of Cardiology and Cardiac Society of Australia and New Zealand10,11, leading to screening of many individuals at low AF risk and failure to demonstrate meaningful improvements in key outcomes (e.g., stroke, mortality)12,13,14. Therefore, there is a critical unmet need for a more targeted approach to AF screening focused on individuals at higher risk of undiagnosed AF15,16.
To this end, it is well-recognized that future AF risk can be estimated with reasonable accuracy. Composite risk factor scores, such as the well-validated Cohorts for Heart and Aging Research in Genomic Epidemiology AF (CHARGE-AF) score17, consistently demonstrate moderate predictive value for incident AF18,19. However, uptake of clinical AF scores is limited by cumbersome calculation and potential misclassification of score components20. More recently, studies have reported the ability to estimate AF risk using deep learning applied to a single 12-lead electrocardiogram (ECG)21,22,23, offering a novel mechanism for accurate and efficient AF risk estimation. Although 12-lead ECGs can be obtained within minutes in most clinic settings, acquisition typically requires trained staff and specialized equipment. In contrast, single-lead ECG tracings (1 L ECG) are increasingly available from consumer wearable and handheld devices, and therefore possess the potential to substantially extend the reach of AF risk estimation. Although limited surrogate evaluations (e.g., restricting to one lead of a 12-lead ECG) have suggested that 1 L ECG may retain AF risk information21, whether real-world 1 L ECG tracings can predict AF remains unknown.
Here, we leverage a unique resource of over 16,000 individuals with >35,000 handheld 1 L ECG tracings taken prospectively as part of the VITAL-AF randomized clinical trial of AF screening embedded within eight primary care practices within the Massachusetts General Hospital (MGH) network. We compare several approaches to AI-enabled AF risk estimation: a published convolutional neural network (CNN) trained using single standard ECG leads (1 L ECG-AI)21, 1 L ECG-AI fine-tuned in VITAL-AF (1 L ECG-AI Fine-Tuned), and a CNN trained exclusively within VITAL-AF (1 L VITAL). We then assess the performance of 1 L ECG-AI in the full VITAL-AF sample, quantify relations between ECG-AI-based and clinical risk signals, and establish the potential for 1 L ECG-based AF risk estimation to extend the reach of risk-informed AF screening well beyond the traditional clinic setting.
Results
Identifying an optimal method for AF risk estimation using 1 L ECG
The current analysis comprised ~16,000 participants of the VITAL-AF trial, a randomized trial of AF screening in primary care patients aged 65 years and older, who were subsequently split into a development set and test set (Table 1 and Fig. 1). Here, we compared three deep learning-based approaches to AF risk estimation: 1 L ECG-AI, 1 L ECG-AI Fine-Tuned, and 1 L VITAL. 1 L ECG-AI was developed fully outside VITAL-AF, whereas 1 L ECG-AI Fine-Tuned and 1 L VITAL utilized the development set of the VITAL-AF sample for fine-tuning and training, respectively. Each model performed inference using the 10-s segment of the 30-s 1 L ECG tracing predicted to have the least tracing noise (see Methods and Supplementary Figs. 1, 2). Patients with prevalent AF were included in model training but not evaluation (Fig. 1). ECG-AI model architectures are shown in Supplementary Fig. 3.
Depicted is an overview of the current two-part study. In the first part, we compared three approaches to incident AF risk estimation using single-lead ECG (1 L ECG): a A published 12-lead ECG-based convolutional neural network (CNN) trained using only one standard ECG lead (1 L ECG-AI), b 1 L ECG-AI with additional fine-tuning within a subset of the VITAL-AF sample (1 L ECG-AI Fine-Tuned), and c a de novo CNN trained solely within a subset of the VITAL-AF data (1 L VITAL). Each of these models were compared in the VITAL-AF Test Set, which only included individuals without prevalent AF and who were excluded from model training or fine-tuning. In the second part, given the favorable performance of 1 L ECG-AI, which was not trained using any VITAL-AF data, we performed a dedicated evaluation of this model in the full VITAL-AF sample without prevalent AF (VITAL-AF Full Inference Set), where we compared performance to the validated CHARGE-AF clinical risk score17. Original figure created using Adobe Illustrator.
Models were assessed in the VITAL-AF Test Set comprising 4221 randomly selected individuals from VITAL-AF not used in any aspect of model training or fine-tuning and without prevalent AF (age 74 ± 7 years, 61% women, Fig. 1). Detailed baseline characteristics are shown in Table 1. At 2 years, there were 119 incident AF events (cumulative incidence 3.2% [95% CI 2.6–3.9]). Discrimination of 2-year AF was comparable using 1 L ECG-AI (area under receiver operating characteristic curve [AUROC] 0.672 [95%CI 0.576–0.753]) and 1 L VITAL (0.667 [0.554–0.757]), which were numerically favorable to 1 L ECG-AI Fine-Tuned (0.654 [0.546–0.757]) (Fig. 2). Trends were similar according to 2-year average precision (1 L ECG-AI 0.061 [0.048–0.082], 1 L ECG-AI Fine-Tuned 0.058 [0.046–0.078], 1 L VITAL 0.081 [0.056–0.12]) (Supplementary Fig. 4).
Depicted are receiver operating characteristic curves for 2-year incident AF in the VITAL-AF test set according to three alternative approaches to 2-year AF risk estimation using deep learning of the 1 L ECG: a previously published 12-lead ECG-based convolutional neural network (CNN) trained using only single standard ECG leads (1 L ECG-AI, green), 1 L ECG-AI with additional fine-tuning within a subset of the VITAL-AF sample (1 L ECG-AI Fine-Tuned, purple), and a de novo CNN trained solely within a subset of the VITAL-AF data (1 L VITAL, magenta).
Performance of 1 L ECG-AI in the VITAL-AF Full Inference Set
Given that 1 L ECG-AI provided comparable or better performance than the other models despite being trained completely outside VITAL-AF, we further evaluated this model in the VITAL-AF Full Inference Set: the complete VITAL-AF cohort members without prevalent AF who underwent ≥1 screening 1 L ECG (n = 15,694, age 74 ± 7 years, 58% women, Fig. 1 and Table 1).
At 2 years, there were 411 incident AF events (cumulative incidence 3.1% [95% CI 2.7–3.4]). 1 L ECG-AI had consistent discrimination of 2-year incident AF risk (AUROC 0.666 [0.603–0.721]; AP 0.053 [0.045–0.068]). With age and sex added to 1 L ECG-AI (“1 L ECG-AI AS”), discrimination improved (0.695 [0.637–0.742]; AP 0.060 [0.050–0.077]) and was comparable to the validated 11-component CHARGE-AF clinical risk score (0.679 [0.625–0.729]; 0.062 [0.052–0.080], p = 0.46 for AUROC, p = 0.92 for AP) (Table 2 and Fig. 3).
Depicted is a summary of 1 L ECG-AI model performance in the VITAL-AF full inference set. Panels a, b depict discrimination of 2-year incident AF using the receiver operating characteristic curve (panel a) and the precision-recall curve (panel b). The y-axis of panel b is truncated at a precision of 0.4 to facilitate visualization of differences in model performance. Panel c displays univariable (top half) and mutually-adjusted (bottom half) hazard ratios for incident AF in Cox proportional hazards models, including terms for CHARGE-AF and 1 L ECG-AI (logit-transformed, see text). Panel d depicts calibration of each model, where the x-axis plots predicted 2-year AF risk, and the y-axis plots observed 2-year AF incidence, and the hashed diagonal line represents a perfectly calibrated model. Each curve is labeled by the integrated calibration index49, a measure of average model error, where lower values indicate more accurate absolute risk estimates.
Since the AI models were trained on individuals with prevalent AF but evaluated only on individuals without prevalent AF, recalibration to the baseline hazard of the VITAL-AF full inference set was performed prior to assessing model calibration. Although calibration was reasonable for each model, absolute risk estimates were particularly accurate for 1 L ECG-AI AS (integrated calibration index [ICI] 0.0048 [0.0016–0.080], where lower values indicate lower average error), whereas both CHARGE-AF (ICI 0.015 [0.012–0.018]) and 1 L ECG-AI AS (ICI 0.022 [0.019–0.023]) showed a tendency to overestimate observed AF incidence at the high end of the predicted risk distribution (Table 2 and Fig. 3).
Combining clinical risk factors with AI signals
Combining CHARGE-AF and 1 L ECG-AI resulted in modest numerical improvement in discrimination (AUROC 0.703 [0.650–0.751]; AP 0.063 [0.054–0.078]). (Table 2 and Fig. 3). Absolute risk estimates were well-calibrated (ICI 0.0058 [0.0022–0.0094]). In a Cox proportional hazards model including terms for both 1 L ECG-AI and CHARGE-AF, both terms were independent predictors of incident AF (Fig. 3). Consistent with a complementary relation, the cumulative risk of AF was highest among individuals classified as high risk (i.e., 2-year AF risk ≥3%) according to both ECG-AI and CHARGE-AF, followed by high risk according to one model only, followed by individuals not at high risk according to either model (Fig. 4). Individuals at high risk for AF using 1 L ECG-AI only were generally younger and with lower comorbidity burden compared to individuals at high risk using CHARGE-AF (Supplementary Table 1). The correlation between 1 L ECG-AI and CHARGE-AF was moderate (r = 0.44 [95% CI 0.43–0.45]).
Depicted is the cumulative risk of AF across strata of predicted risk using 1 L ECG-AI AS. Panel a plots cumulative risk across categories of 1 L ECG-AI AS risk, and Panel b plots cumulative risk across strata of both ECG-AI AS and CHARGE-AF. In both panels, high AF risk is defined as 2-year AF risk ≥3% (approximating the top tertile of risk). The number at risk across each stratum over time is depicted below each plot.
1 L ECG-AI AS to stratify risk of longitudinal AF
Given the favorable performance of 1 L ECG-AI AS with the requirement for only one 1 L ECG, age, and sex, this model was selected for further evaluation as a potential tool to stratify risk of AF. Categories of 1 L ECG-AI AS predicted risk (i.e., <1%, 1–2%, and ≥3% to approximate tertiles) effectively separated longitudinal AF incidence (Fig. 4). Two-year AF incidence was markedly higher with 1 L ECG-AI AS in the top 5% (8.0% [5.7–10.2]) versus bottom 5% (0.89% [0.23–1.55]) (Supplementary Fig. 5).
Use of 1 L ECG-AI AS 2-year predicted risk ≥3% rather than the age threshold of ≥65 years endorsed in certain guidelines10,11 and used to select the VITAL-AF sample10 would result in favorable net reclassification (NRI 0.27 [95% CI 0.22–0.32]), driven by an appreciable degree of appropriate non-case reclassification (i.e., deferring screening of 9,941 non-cases, NRI- 64.2% [63.3–65.0%]), but at the cost of some unfavorable case reclassification (i.e., failing to screen 149 cases, NRI + −37.1% [−42.1% to −32.2%]) (Table 3 and Supplementary Table 2). Reclassification was also favorable using 1 L ECG-AI compared to additional guideline-based criteria for screening10 including age ≥65 years with elevated stroke risk24,25 (NRI 0.18 [0.12−0.23]), and age ≥75 years (0.078 [0.024–0.13]) (Table 3 and Supplementary Table 2).
Decision curve analysis demonstrates that use of 1 L ECG-AI AS rather than screening all individuals ≥65 years would result in net benefit across a wide range of thresholds used to select screening candidates (Supplementary Fig. 6), and lead to substantial reductions in individuals screened (Supplementary Fig. 7) while maintaining constant net benefit.
Secondary and subgroup analyses
Saliency maps demonstrated that the 1 L ECG P wave and surrounding regions had the greatest effect on 1 L ECG-AI AF risk estimates (Fig. 5). Among 81 incident AF cases with Holter, patch, or event monitoring available within 6 months of incident AF diagnosis, 66 (81.5%) had evidence of paroxysmal AF while 15 (18.5%) had findings consistent with persistent AF. Although estimates had limited precision, AF discrimination using 1 L ECG-AI AS was higher for persistent AF (AUROC 0.717 [0.573–0.840]) versus paroxysmal AF (0.601 [0.511–0.698]). Among individuals with incident AF and available measurements on or after AF diagnosis, there was a weak positive correlation between 1 L ECG-AI AS and NTproBNP (r = 0.14, 95% CI 0.01–0.27) and between ECG-AI AS and left atrial diameter (r = 0.17, 95% CI 0.05–0.28). At the ≥3% AF risk threshold, there was no difference in the sensitivity of 1 L ECG-AI AS among individuals with NTproBNP that was elevated (70.3%, 95% CI 61.1–78.2) versus non-elevated (66.4%, 95% CI 56.5–75.0), or left atrial diameter that was enlarged (66.7%, 95% CI 57.9–74.5) or non-enlarged (61.7, 95% CI 53.1–69.6). Model discrimination was generally consistent across shorter time windows (e.g., 1 L ECG-AI AS 6-month AUROC 0.685 [95% CI 0.641–0.730]) (Supplementary Table 3). All models had lower discrimination within strata of age, but 1 L ECG-AI AS had the best relative preservation in performance (i.e., age 65–69: 1 L ECG-AI AS 2-year AUROC 0.627 [0.505–0.730] vs. CHARGE-AF 0.577 [0.473–0.686]) (Supplementary Table 4). All models performed similarly among men versus women (Supplementary Table 5), and among individuals with a class I indication for oral anticoagulation based on stroke risk using the CHA2DS2VASc score (Supplementary Table 6). Among 3,479 individuals with a 12-lead ECG performed within 3 years of the baseline visit, discrimination was moderately higher using a previously validated 12-lead ECG model (AUROC 0.812 [0.736–0.876]) versus 1 L ECG-AI ([0.772 [0.687–0.848]). In a linear mixed model assessing intra- versus inter-individual variability, we found that 1 L ECG-AI had an intraclass correlation coefficient of 0.65, and within-tracing correlation was high (r = 0.87 for the least noise window versus last 10-s window). Among the 411 incident AF cases, time to AF diagnosis was similar among individuals with 1 L ECG-AI AS 2-year predicted risk ≥3% (“true positives”, n = 262) (median 272 days [quartile-1: 143, quartile-3: 414]) versus individuals with 1 L ECG-AI AS 2-year predicted risk <2% (“false negatives”, n = 149) (270 days [157, 405]). Individuals with 1 L ECG-AI AS 2-year predicted risk ≥3% had a higher rate of ECG (107.6 per 100 person-years [95% CI 105.5–109.7]) and Holter, patch, or event monitor (7.68 [7.12–8.24]) utilization prior to an AF diagnosis compared to individuals with 1 L ECG-AI AS 2-year predicted risk <3% (ECG: 52.6 [51.5–53.7]; monitor: 4.71 [4.37–5.05]).
Depicted is a saliency map of 1 L ECG-AI demarcating regions of the 1 L ECG waveform having the greatest influence on atrial fibrillation (AF) risk predictions. Blue shades depict the magnitude of the gradient of predicted AF risk with respect to the 1 L ECG waveform amplitude, where darker shades illustrate regions of the waveform exerting greater salience, or influence on AF risk predictions. Maps were generated using a random sample of 64 tracings from the VITAL-AF sample. An exemplar 1 L ECG is overlaid on the saliency map.
Discussion
Here, we developed and tested three distinct approaches to deep learning-based estimation of future AF risk using a unique resource of over 35,000 real-world handheld 1 L ECG tracings collected prospectively in over 16,000 primary care patients enrolled in a large AF screening trial. Utilizing a test set of more than 4200 individuals, we found that 1 L ECG-AI, a model previously trained using over 450,000 single-lead tracings from standard 12-lead ECGs and applied to 1 L ECGs in a transfer learning approach, provided comparable performance to models trained or fine-tuned using VITAL-AF data. In the full VITAL-AF sample, a combination of 1 L ECG-AI with age and sex (1 L ECG-AI AS) discriminated AF risk favorably relative to the validated 11-component CHARGE-AF clinical risk factor score, had better calibration to observed AF risk, and led to favorable reclassification of risk versus the age threshold of ≥65 years currently endorsed in AF screening guidelines10.
Our findings support and extend prior analyses suggesting the potential for deep learning to predict future AF using 1 L ECG tracings. Prior work has focused on surrogates for true 1 L ECG (i.e., one lead of a standard 12-lead ECG)21 or classification of concurrent AF rather than true AF prediction26,27. Using the AliveCor Kardia device (the same device used in the current study), Raghunath et al. developed a model capable of distinguishing concurrent paroxysmal AF on the basis of 1 L ECG tracings showing sinus rhythm27. More recently, Gadaleta et al. developed a model capable of discriminating very short-term future AF (i.e., 14 days) using clinical-grade 1 L ECG patch monitors obtained for a variety of clinical indications and including individuals whose AF was known26. Our work substantially extends prior analyses by demonstrating the ability of AI-enabled analysis to discriminate AF up to 2 years in the future using real-world handheld 1 L ECGs obtained prospectively within the specific population in which AF risk estimation is most meaningful (i.e., older individuals with no known AF diagnosis despite receiving regular primary care).
Our findings establish the feasibility of incident AF prediction using handheld 1 L ECG, and provide specific support for the role of transfer learning in the adaptation of ECG-based AI models to the mobile 1 L ECG modality. A variety of prior studies have suggested retained predictive performance of standard ECG-based models when applied to a single lead of the 12-lead ECG as a surrogate for a true 1 L ECG. In real-world use, however, 1 L ECG tracings offer unique characteristics (e.g., more noise, differing sampling frequency, variable duration)21,28. Here, we observed substantially improved model performance when utilizing a strategy of selectively applying inference to the 10-second segment of the 30-s tracings classified as having the least tracing noise. Moreover, we found that 1 L ECG-AI, a pre-trained AF risk estimation model developed using single leads of a standard 12-lead ECG (originally trained using over 450,000 tracings) provided comparable performance to a model trained de novo using true 1 L ECGs from the smaller VITAL-AF Development Set, even though 1 L ECG-AI had no prior exposure to handheld 1 L ECG tracings. Furthermore, no meaningful improvement was observed from fine-tuning 1 L ECG-AI in the VITAL-AF sample. Our findings are consistent with a number of recent studies showcasing the value of transfer learning in domains as diverse as network biology29 and natural language30, in which models trained using comparably large sample sizes in external datasets to perform related tasks offer favorable performance to models trained de novo in regimes where sample sizes are more limited (e.g., handheld 1 L ECG)31,32. Importantly, saliency maps demonstrated that, consistent with clinical expectations and similar to the 12-lead ECG-AI model on which it is based21, 1 L ECG-AI AF risk estimates were heavily influenced by the 1 L ECG P wave and surrounding regions, demonstrating the ability of 1 L ECG-AI to recognize key features of the 1 L ECG waveform.
Our observations provide evidence that a robust appraisal of the expected performance of deep learning models can only be obtained by evaluating models in the specific clinical context of their intended use. The overall discrimination of 1 L ECG-AI (AUROC ~0.69 with age and sex) is lower than the discrimination reported for the 12-lead ECG-based detection of concurrent AF (AUROC ~0.8–0.9)22,33 and estimation of future AF (AUROC ~0.75–0.85)21,23. We suspect there are two primary factors accounting for lower performance. First, there is clearly information loss when moving from 12-lead ECG to a single-lead ECG of the same format (e.g., AUROC ~0.75–0.85 for 12-lead models versus ~0.72 for 1 L ECG-AI prior to transfer to VITAL-AF), followed by further loss when applied to real-world 1 L ECG tracings (e.g., AUROC ~ 0.66 for 1 L ECG-AI after transfer to VITAL-AF). Second, when compared to prior retrospective assessments of ECGs, which are subject to indication bias on account of clinical acquisition, our analysis represents handheld 1 L ECGs obtained prospectively and unselectively. Indeed, in a secondary analysis focused on individuals with a 12-lead ECG performed for clinical indications within the preceding three years, not only did the 12-lead model achieve higher discrimination (AUROC 0.81) than the 1 L ECG model (AUROC 0.77), but the 1 L ECG model had substantially higher AUROC than it did in the overall sample. The key contribution of population characteristics is also supported by substantially lower performance of the CHARGE-AF clinical risk score (AUROC 0.68) compared to multiple prior retrospective validations (AUROC ~0.7–0.8)18,19. On balance, our findings establish the feasibility of future AF risk estimation using handheld 1 L ECG, and additionally serve as an important demonstration that expected model performance may differ substantively when models are applied to the specific clinical settings in which implementation is intended.
Our results highlight the potential for 1 L ECG to extend the efficiency and reach of AF screening efforts. Despite providing a path to earlier AF detection and prompt initiation of preventive interventions (e.g., oral anticoagulation, lifestyle modification), recent screening efforts have failed to demonstrate substantive gains in AF diagnosis or improvements in hard outcomes such as stroke or mortality16. One major limitation has been the use of simple guideline-based age thresholds (e.g., ≥65 years)10, whose application leads to screening many individuals at relatively low risk of AF. A risk-informed approach may be more efficient15,34, and with evidence demonstrating limited uptake of AF risk scores such as CHARGE-AF on account of cumbersome calculation and potential misclassification of score inputs, the ability to estimate AF risk with comparable accuracy using only age, sex, and a single handheld 1 L ECG has particular value35. Compared to screening all individuals aged ≥65 years10, 1 L ECG-AI AS exhibited highly favorable reclassification (0.27), driven by a 64% increase in specificity (i.e., appropriate down-classification of nearly 10,000 individuals who did not develop AF within 2 years). Of course, deferring screening of non-high-risk individuals also leads to a decrease in sensitivity (37.1%), and future work is warranted to reduce false negatives and identify optimal thresholds to balance AF screening yield and efficiency. Importantly, decision curve analyses demonstrated that 1 L ECG-AI AS would lead to substantial reductions in individuals screened while retaining a constant net benefit across a wide range of thresholds. Conversely, methods such as 1 L ECG-AI AS may also facilitate identification of high-risk younger individuals who would otherwise be missed by traditional age criteria, a possibility we could not evaluate in our current analysis, which included only individuals ≥65 years. Similarly, consistent with prior data demonstrating a complementary nature of clinical risk factors and AI signals21, 1 L ECG-AI AS demonstrated the greatest AF risk discrimination in individuals with relatively lower comorbidity burden, suggesting a potential role for combined approaches leveraging both clinical and AI criteria to select AF screening candidates. Of note, there was only a weak positive correlation between 1 L ECG-AI AS and either NTproBNP or left atrial size among individuals with incident AF, suggesting that 1 L ECG-AI risk estimates provide non-overlapping value to common AF-related biomarkers, which are also less well-suited for population screening. Although the precision of estimates was limited, we did observe higher performance for incident persistent AF, suggesting that 1 L ECG-AI AS may be particularly useful for discriminating risk of developing persistent or high-burden AF, which may be more clinically actionable. Ultimately, future work is warranted to identify optimal methods for AF risk estimation, which should consider the setting in which risk stratification is intended (e.g., clinic versus home), as well as potential tradeoffs in performance (e.g., 12-lead ECG versus 1 L ECG). Overall, our findings suggest that 1 L ECG may possess utility not only for AF screening, but also for AF risk stratification, wherein individuals at elevated AF risk may be considered for more intensive screening36 (e.g., repeated monitoring with 1 L ECG, application of clinical-grade monitors) and interventions to prevent AF onset altogether (e.g., alcohol cessation6, weight loss37). Future trials of AF screening guided by 1 L ECG-based AI are warranted.
Our study should be interpreted in the context of design. First, our 1 L ECG-based model comparison was performed in a holdout test set of VITAL-AF. We note that prospectively acquired 1 L ECG tracings from a trial population are a unique resource, highlighting a key strength of our study while also limiting options for external validation. We also note that the 1 L ECG-AI model ultimately selected for detailed analyses in VITAL-AF Full Inference Set was developed using a completely different modality (single leads of a standard 12-lead ECG) in non-overlapping individuals outside of VITAL-AF. Nevertheless, as 1 L ECG datasets become increasingly available in the future, it will be important to assess model performance in other healthcare systems and using other 1 L ECG devices. Although we suspect that our noise-window method should be useful for analyses using other 1 L ECG devices prone to noise and acquisition artifact (e.g., smartwatch ECG), it will be important to quantify potential impacts of differences in acquisition characteristics including ECG vector, sampling frequency (e.g., 300 Hz using AliveCor versus 512 Hz using Apple Watch), and tracing duration, and assess the degree to which recalibration or fine-tuning may be needed. Second, our estimates of model metrics have limited precision due to limitations in sample size and modest event rates in the smaller VITAL-AF Test Set. Nevertheless, we submit estimates are sufficient to support the performance of 1 L ECG-AI as comparable to or better than the comparison models, justifying its use in subsequent analyses in the larger VITAL-AF Full Inference Set. Third, the current follow-up is limited to two years. Fourth, incident AF was defined using a combination of a validated electronic health record-based algorithm and manual validation. Although misclassification of undiagnosed AF remains possible with our design, we submit that several factors (e.g., enrollment of patients engaged in routine primary care, use of 1 L ECG screening, which has demonstrated reasonable positive predictive value38) serve to limit its extent. Fifth, tracings were obtained by trained medical assistants and likely represent higher-quality tracing acquisition than regular consumer use. Sixth, our models performed inference using 10-s 1 L ECG windows, to mirror the shape of a standard 12-lead ECG and to facilitate inference of our existing 1 L ECG-AI model in a transfer learning context. Future assessment of de novo models trained using longer windows is warranted. Sixth, the absence of uniform protocolized rhythm monitoring introduces potential ascertainment bias, and we did observe that individuals with higher estimated risk according to 1 L ECG-AI AS had higher rates of ECG and Holter, patch, or event monitoring during follow-up. Seventh, the generalizability of our findings may be limited by sample specificity (e.g., predominantly White population from the New England region of the United States).
In summary, we developed 1 L ECG-AI, a model capable of discriminating risk of 2-year incident AF using real-world 1 L ECG tracings obtained prospectively in the context of a large randomized trial of AF screening among primary care patients with no known AF diagnosis. When combined with age and sex, 1 L ECG-AI offered comparable performance to the validated 11-component CHARGE-AF clinical risk score and demonstrated potential to improve AF screening efficiency compared to the simple age threshold of ≥65 years endorsed in current guidelines. Future work is needed to establish the clinical utility of AF screening guided by AI-enabled analysis of mobile 1 L ECG tracings.
Methods
Trial design and analysis sample
The design, conduct, primary outcome results, protocol, and statistical analysis plan of the VITAL-AF trial (ClinicalTrials.gov NCT03515057, registered 2/24/2021; https://clinicaltrials.gov/study/NCT03515057) have been published previously12,39. Briefly, VITAL-AF recruited patients from 16 primary care practices within the MGH practice-based research network. VITAL-AF was a pragmatic cluster randomized trial, in which practices were randomized in a 1:1 ratio to AF screening versus usual care, and assessed a primary outcome of new AF diagnosis at 1 year. The trial enrolled patients between July 31, 2018 and October 8, 2019. Patients were eligible for inclusion if they were aged ≥65 years and attended an outpatient clinic appointment at a participating primary care practice with a primary care physician, nurse practitioner, or physician’s assistant. Given the pragmatic design of VITAL-AF, no further selection criteria were applied, and in particular, individuals were not excluded based on prior AF status. In the current analysis, we focused specifically on participants who had ≥1 1 L ECG screening performed. Given prior observations demonstrating that inclusion of tracings among patients with prevalent AF (including tracings showing AF) can improve performance for incident AF risk estimation21, tracings taken among individuals with prevalent AF were included in model training. However, all model evaluation was performed among individuals without prevalent AF at the time of screening (see below, Fig. 1). Participants provided informed consent to participate, and the research protocol was approved by the Mass General Brigham Institutional Review Board (2017P000562). This study adheres to the STARD40, EHRA AI41, and CONSORT42 reporting guidelines (Supplementary Materials).
1 L ECG acquisition
Eligible and consenting individuals visiting intervention practices were offered AF screening with the AliveCor Kardia (AliveCor, US) 1 L ECG at each encounter at the time of routine vital sign assessment (i.e., prior to meeting with the primary care physician, nurse practitioner, or physician’s assistant). The 1 L ECG was administered by medical assistants who received dedicated training in the use of the Kardia device prior to study start, as well as monthly refreshers. Since multiple tracings were obtained for some individuals, model training opportunistically considered all tracings as distinct examples, but for model evaluation, only the earliest 1 L ECG per individual was used (with the earliest tracing employed to maximize available follow-up).
AF model development and evaluation
The study sample was randomly split into a development set (i.e., training and validation, ~80%) and a test set (~20%) (Table 1 and Fig. 1). In the test set, we evaluated three distinct approaches to AF risk estimation using 1 L ECG. First, we transferred a contemporary version of a previously published ECG-based convolutional neural network model to estimate AF risk (ECG-AI, AUROC 0.82 for 5-year incident AF in Massachusetts General Hospital)21, which in this case was trained using single leads of a standard 12-lead ECG among individuals outside the VITAL-AF study. While the AliveCor device most closely resembles lead I of the 12-lead ECG, we observed models trained from other leads also had discriminative power when transferred (e.g., a lead II only model achieved a c-index of 0.65 using 1 L ECG, as compared to 0.67 for the lead I only model). To leverage this information, we adopted a novel training strategy which uses all leads by selecting a different one at random in every training batch. Over the course of optimization, this single lead model learns from all 12 leads, and this strategy outperformed lead-specific models when generalizing to AliveCor handheld 1 L ECGs. Model architecture was largely unchanged from the published version, and as before, the model is multi-task, outputting not only the survival probability for incident AF (primary task), but also the auxiliary tasks of age regression, sex classification, survival probability for death, and presence of AF on the ECG21. The input of 5000 voltage timepoints is processed through a one-dimensional (1D) convolution in 64 channels, followed by 3 densely connected 1D convolutional blocks. All convolutional kernels have a width of 71, use the Mish activation function and are followed by a max-pooling layer. The convolutional blocks have widths of 64, 48, and 32 channels. The entire architecture consists of 8,653,337 trainable parameters. Optimization was performed using ADAM stochastic gradient descent, with an initial learning rate of 2e-4. The optimizer and backpropagation are implemented by the TensorFlow (v2.13.1) machine learning framework and the Broad Institute Machine Learning for Health (ML4H) (v0.0.13) model factory (https://github.com/broadinstitute/ml4h). Model convergence was determined by early stopping criteria of no improvement in validation loss after 32 epochs, with a learning rate decay of 0.5 for every eight epochs without validation loss improvement. The model was trained for 8 h using an Nvidia V100 (Santa Clara, CA) graphical processing unit. This model was termed “1 L ECG-AI.”
Second, we evaluated a version of 1 L ECG-AI, which was fine-tuned in the VITAL-AF Derivation Set. Here, we used the same 1 L ECG-AI architecture and ADAM optimizer but with a simplified single-task (survival probability for AF) output. To prevent large changes in model weights early in training in the context of fine-tuning, a reduced initial learning rate (2e-5) was employed.
Third, we trained a de novo CNN in the VITAL-AF Derivation Set. This model architecture mirrored 1 L ECG-AI except that the auxiliary tasks included classification of the automated rhythm interpretation from the AliveCor device, manual adjudication of rhythm by the cardiologist overreaders, readability, sex classification, and age regression.
As described in detail previously21, all models utilized a loss function incorporating survival time and censoring in order to output an estimated longitudinal incidence of AF. All models utilized learning rate decay. Full model architectures are provided in Supplementary Fig. 3.
Each model took as input a 10-s segment of the full 30-s handheld 1 L ECG tracing as a uniformly-shaped input tensor of dimension (5000 ×1). A 10-s window was chosen to match the shape and sampling rate of a standard 12-lead ECG, and to facilitate inference using 1 L ECG-AI, which was trained using 10-s tracings. Linear interpolation resampled the 300 Hz frequency of the AliveCor to the 500 Hz frequency typical of 12-lead ECGs. In training, models were inputted with random 10-s windows sampled from the full 30-s 1 L ECG tracing. Different random samples were used in each training epoch, thereby exposing the models to the full 1 L ECG tracing.
Given that the 1 L ECG tracings were 30 s in duration and commonly had noise at the beginning of acquisition, we trained a separate convolutional neural network to detect the contiguous 10-s window with the least noise (i.e., the highest readability as determined by human adjudicators, see Supplementary Figs. 1, 2). Using the same convolutional neural network backbone as the 1 L VITAL model, we trained a binary classification model using the readability label. The model had high discriminative capability with an AUROC of 0.934 and a mean precision of 0.997 in a held-out test set of 3008 1 L ECG traces. The predicted minimum noise segment was then used as the input to each AF prediction model to perform inference. This noise minimization approach resulted in substantial improvement in model performance compared to the use of the first or last 10 s of the tracing for inference (Supplementary Fig. 2).
Saliency mapping
To assess the behavior of 1 L ECG-AI, we created saliency maps, which highlight the sections of the 1 L ECG where the smallest changes in input voltage lead to the greatest changes in AF prediction risk. Saliency is defined as the model output gradient with respect to an input 1 L ECG. Efficient computation is possible with the same backpropagation machinery used in model training, except during training, the gradient is of the loss function rather than the model output, and it is taken with respect to the model weights rather than the model input. Both cases rely on the chain rule and the automatic differentiation capabilities of the Python package “Tensorflow”. An exemplar 1 L ECG waveform is overlaid on the ECG saliencies. Saliency was generated using a random sample of 64 tracings from the VITAL-AF sample.
Clinical factors and outcomes
The prediction target for each model was incident AF during the 2-year study period. Incident AF was identified in a staged manner as follows: (1) candidate AF events were identified using the presence of ≥1 International Statistical Classification of Diseases and Related Health Problems, Tenth Revision (ICD-10) code for AF or atrial flutter or 12-lead electrocardiogram with AF or atrial flutter diagnosis, then (2) the medical record was manually adjudicated for the presence of AF (prevalent, incident, or absent) by two research nurses with consensus resolution of discrepant adjudications and cardiologist resolution of unresolved discrepancies. Adjudicators were unaware of the AliveCor 1 L ECG result, including any 1 L ECG-based AF risk estimates. Of 301 records which double-reviewed to assess inter-rater agreement, agreement was high regardless of whether uncertain adjudications (n = 16) were excluded (94.0%), included and counted as agreement (94.4%), or included and counted as disagreement (89.0%). For comparison between AI-based AF risk and clinical risk factors, we calculated the CHARGE-AF score, a validated risk factor-based AF prediction tool, for all individuals17,19,43. Baseline age, sex, race, height, weight, and blood pressure values were obtained from the electronic health record44. Anti-hypertensive use was determined using medication lists43. Tobacco use was categorized as present or absent. Race was classified as white or non-white, as performed previously using CHARGE-AF43,45. The presence of heart failure, diabetes, and myocardial infarction were ascertained using previously validated diagnostic and procedural codes43,46. A small fraction of individuals with missing tobacco use (1%) were considered non-smokers, and mean imputation was applied for trivial missingness in vital signs data (0.1%). Clinical factor definitions are provided in Supplemental Table 7.
Statistical analysis
Model discrimination was compared by calculation of inverse probability of censoring-weighted AUROC47. Since AUROC may be insensitive to differences in discrimination among models in the setting of relatively uncommon outcomes such as AF, we additionally calculated time-dependent AP16. We plotted the corresponding ROC and AP curves. The outcome of incident AF was estimated at 2 years (i.e., the maximum available follow-up of the VITAL-AF trial at the time of this analysis). AUROC and AP values were compared using 500-iteration bootstrapping, which were used to calculate 95% confidence intervals and perform pairwise Z-testing.
Given that 1 L ECG-AI provided comparable or better performance in the Test Set compared to the other models despite being trained completely outside the VITAL-AF sample, we performed a dedicated evaluation of 1 L ECG-AI in the VITAL-AF Full Inference Sample (i.e., all VITAL-AF participants with ≥1 L ECG tracing and no prevalent AF, Fig. 1). Since age and sex are readily available demographic factors, we additionally assessed 1 L ECG-AI with the incorporation of age and sex as additional input variables (“1 L ECG-AI AS”). We then compared 1 L ECG-AI and 1 L ECG-AI AS to the CHARGE-AF clinical risk score, and a combination of 1 L ECG-AI and the CHARGE-AF clinical risk score. Combination models were developed using Cox proportional hazards regression with the covariate weights (i.e., age, sex, and ECG-AI for 1 L ECG-AI AS and ECG-AI and CHARGE-AF for the CHARGE-AF + ECG-AI model) obtained within individuals aged ≥65 years in the original ECG-AI development set after excluding individuals included in VITAL-AF. As performed previously, 1 L ECG-AI probabilities were logit-transformed for inclusion in the Cox models21,48. Model discrimination was compared as outlined above. The ability to stratify risk of incident AF using 1 L ECG-AI AS and CHARGE-AF was assessed by plotting the Kaplan–Meier cumulative risk of AF across strata of high risk according to each of the two models, with high risk corresponding to ≥3% 2-year predicted AF risk (approximate top tertile). The ability to stratify more extreme AF risk using 1 L ECG-AI AS was assessed similarly, except using strata defined by the bottom 5% of risk, top 5% of risk, and middle 90% of risk.
We assessed calibration using: (1) adaptive hazard regression49 curves of predicted versus observed AF risk, and (2) integrated calibration index (ICI), the average prediction error weighted by the empirical risk distribution49. Since the AI models were trained on individuals with prevalent AF but evaluated only on individuals without prevalent AF21, recalibration to the baseline hazard of the VITAL-AF Full Inference Set was performed prior to assessing model calibration50. For this analysis, each model score was converted to a predicted probability of AF using the equation: \(1-{s}_{0}^{\exp \left(\sum \beta X-\sum \beta Y\right)\,\,}\) where \({s}_{0}\) is the average AF-free survival probability at 2 years in VITAL-AF, \(\sum \beta {X}\) is the individual’s score value, and \(\sum \beta {Y}\) is the average score in VITAL-AF. Model weights and parameters are given in Supplementary Table 8.
The potential effect of implementing 1 L ECG-AI AS to select screening candidates, as opposed to the guideline-based age threshold of ≥65 years (i.e., all VITAL-AF participants), was assessed by calculating 2-year time-dependent reclassification indices51. We also assessed additional guideline-based criteria for selecting screening candidates10 (i.e., age ≥65 years with ≥1 additional stroke risk factor defined using the CHA2DS2-VASc score24,25, and (b) age ≥75 years). For these analyses, individuals with predicted 1 L ECG-AI AS risk ≥3% were considered high risk. Given that optimal risk thresholds for AF screening remain unclear, we additionally performed decision curve analyses52,53, in which we compared the expected net benefit of screening using 1 L ECG-AI AS across a range of plausible thresholds used to define elevated AF risk (versus no screening or screening all individuals). We additionally quantified the number of AF screenings which may be avoided while maintaining a constant net benefit.
In secondary analyses, we assessed model discrimination for incident AF at 6 months and 1 year. We assessed model performance across subgroups of age (i.e., age 65–69, 70–79, and ≥80 to approximate tertiles of the age distribution), sex, and the presence of a class I indication for oral anticoagulation based on stroke risk as defined using the CHA2DS2-VASc score (i.e., ≥2 for men and ≥3 for women). We compared the time to AF diagnosis among incident AF cases with 1 L ECG-AI AS risk ≥3% (“true positives”) versus 1 L ECG-AI AS risk <3% (“false negatives”). To classify AF type (paroxysmal versus persistent), we inspected reports of the subset of AF cases with Holter, patch, or event monitoring available within 6 months of incident AF diagnosis. We assessed model performance for paroxysmal and persistent AF, respectively, excluding individuals with incident AF of the other type, or with an unclassifiable type (i.e., no monitoring data). To assess relations between 1 L ECG-AI AS performance and common AF-related biomarkers, we assessed the correlation between 1 L ECG-AI AS and (a) NTproBNP and (b) left atrial anteroposterior size on echocardiography, among individuals with an available measurement taken within 7 days before or following an incident AF diagnosis. We additionally assessed the sensitivity of 1 L ECG-AI AS at the ≥3% risk threshold across strata of NTproBNP (i.e., above the age-adjusted reference range) and left atrial diameter (>40 mm). To assess the relative information loss using 1 L ECG versus standard 12-lead ECG, we compared AF discrimination using a contemporary version of a previously validated 12-lead ECG-AI AF risk estimation algorithm21 among the subset of individuals not included in the training set of the 12-lead model and with an available 12-lead ECG performed within 3 years of the baseline visit. To assess the behavior of 1 L ECG-AI across tracings, we fix a linear mixed model on intra- versus inter-individual 1 L ECG-AI inferences, and assessed the within-tracing correlation across varying 10-second windows. To quantify whether 1 L ECG-AS risk may associate with subsequent rhythm monitoring, we quantified the person-time rates of (i) 12-lead ECGs, and (ii) Holter, event, or patch monitors performed during the study period and prior to any incident AF diagnosis. We considered two-sided p values <0.05 statistically significant. Analyses were performed using Python v3.854 and R v4.055.
Data availability
VITAL-AF trial data contain protected health information and cannot be shared publicly.
Code availability
The ECG-AI model serving as the foundation for the models evaluated in the current analysis is available at https://github.com/broadinstitute/ml4h/tree/master/model_zoo/ECG2AF. Scripts underlying the statistical analysis are available at https://github.com/shaankhurshid/1l_ecg_ai.git.
References
Wolf, P. A., Abbott, R. D. & Kannel, W. B. Atrial fibrillation: a major contributor to stroke in the elderly. The Framingham study. Arch. Intern. Med. 147, 1561–1564 (1987).
Corley, S. D. et al. Relationships between sinus rhythm, treatment, and survival in the Atrial Fibrillation Follow-Up Investigation of Rhythm Management (AFFIRM) Study. Circulation 109, 1509–1513 (2004).
Carlisle, M. A., Fudim, M., DeVore, A. D. & Piccini, J. P. Heart failure and atrial fibrillation, like fire and fury. JACC Heart Fail 7, 447–456 (2019).
Diener, H.-C., Hart, R. G., Koudstaal, P. J., Lane, D. A. & Lip, G. Y. H. Atrial fibrillation and cognitive function: JACC review topic of the week. J. Am. Coll. Cardiol. 73, 612–619 (2019).
Middeldorp, M. E. et al. PREVEntion and regReSsive Effect of weight-loss and risk factor modification on atrial fibrillation: the REVERSE-AF study. Europace 20, 1929–1935 (2018).
Voskoboinik, A. et al. Alcohol abstinence in drinkers with atrial fibrillation. N. Engl. J. Med. 382, 20–28 (2020).
Lip, G. Y. H., Nieuwlaat, R., Pisters, R., Lane, D. A. & Crijns, H. J. G. M. Refining clinical risk stratification for predicting stroke and thromboembolism in atrial fibrillation using a novel risk factor-based approach: the euro heart survey on atrial fibrillation. Chest 137, 263–272 (2010).
Ruff, C. T. et al. Comparison of the efficacy and safety of new oral anticoagulants with warfarin in patients with atrial fibrillation: a meta-analysis of randomised trials. Lancet 383, 955–962 (2014).
Stroke Prevention in Atrial Fibrillation Study. Final results. Circulation 84, 527–539 (1991).
Hindricks, G. et al. 2020 ESC Guidelines for the diagnosis and management of atrial fibrillation developed in collaboration with the European Association of Cardio-Thoracic Surgery (EACTS). Eur. Heart J. https://doi.org/10.1093/eurheartj/ehaa612 (2020).
NHFA CSANZ Atrial Fibrillation Guideline Working Group et al. National heart foundation of Australia and the Cardiac Society of Australia and New Zealand: Australian Clinical Guidelines for the diagnosis and management of atrial fibrillation 2018. Heart Lung Circ. 27, 1209–1266 (2018).
Lubitz, S. A. et al. Screening for atrial fibrillation in older adults at primary care visits: the VITAL-AF randomized controlled trial. Circulation https://doi.org/10.1161/CIRCULATIONAHA.121.057014 (2022).
Uittenbogaart, S. B. et al. Detecting and diagnosing atrial fibrillation (D2AF): study protocol for a cluster randomised controlled trial. Trials 16, 478 (2015).
Svendsen, J. H. et al. Implantable loop recorder detection of atrial fibrillation to prevent stroke (The LOOP Study): a randomised controlled trial. Lancet 398, 1507–1516 (2021).
Ashburner, J. M., Khurshid, S., Atlas, S. J., Singer, D. E. & Lubitz, S. A. Point-of-care screening for atrial fibrillation: where are we, and where do we go next?. Cardiovasc. Digit Health J. 2, 294–297 (2021).
Khurshid, S., Healey, J. S., McIntyre, W. F. & Lubitz, S. A. Population-based screening for atrial fibrillation. Circ. Res. 127, 143–154 (2020).
Alonso, A. et al. Simple risk model predicts incidence of atrial fibrillation in a racially and geographically diverse population: the CHARGE-AF consortium. J. Am. Heart Assoc. 2, e000102 (2013).
Khurshid, S. et al. Performance of atrial fibrillation risk prediction models in over 4 million individuals. Circ. Arrhythm. Electrophysiol. 14, e008997 (2021).
Christophersen, I. E. et al. A comparison of the CHARGE-AF and the CHA2DS2-VASc risk scores for prediction of atrial fibrillation in the Framingham Heart Study. Am. Heart J. 178, 45–54 (2016).
Khurshid, S. Clinical perspectives on the adoption of the artificial intelligence-enabled electrocardiogram. J. Electrocardiol. 81, 142–145 (2023).
Khurshid, S. et al. Electrocardiogram-based deep learning and clinical risk factors to predict atrial fibrillation. Circulation https://doi.org/10.1161/CIRCULATIONAHA.121.057480 (2021).
Yuan, N. et al. Deep learning of electrocardiograms in sinus rhythm from US veterans to predict atrial fibrillation. JAMA Cardiol. 8, 1131–1139 (2023).
Raghunath, S. et al. Deep neural networks can predict new-onset atrial fibrillation from the 12-lead electrocardiogram and help identify those at risk of AF-related stroke. Circulation https://doi.org/10.1161/CIRCULATIONAHA.120.047829 (2021).
Joglar, J. A. et al. 2023 ACC/AHA/ACCP/HRS guideline for the diagnosis and management of atrial fibrillation: a report of the American College of Cardiology/American Heart Association Joint Committee on Clinical Practice Guidelines. Circulation 149, e1–e156 (2024).
Engdahl, J., Andersson, L., Mirskaya, M. & Rosenqvist, M. Stepwise screening of atrial fibrillation in a 75-year-old population: implications for stroke prevention. Circulation 127, 930–937 (2013).
Gadaleta, M. et al. Prediction of atrial fibrillation from at-home single-lead ECG signals without arrhythmias. npj Digit. Med. 6, 229 (2023).
Raghunath, A. et al. Artificial intelligence–enabled mobile electrocardiograms for event prediction in paroxysmal atrial fibrillation. Cardiovasc. Digit. Health J. 4, 21–28 (2023).
Khunte, A. et al. Detection of left ventricular systolic dysfunction from single-lead electrocardiography adapted for portable and wearable devices. npj Digit. Med. 6, 124 (2023).
Theodoris, C. V. et al. Transfer learning enables predictions in network biology. Nature 618, 616–624 (2023).
Li, Y., Wehbe, R. M., Ahmad, F. S., Wang, H. & Luo, Y. A comparative study of pretrained language models for long clinical text. J. Am. Med Inf. Assoc. 30, 340–347 (2023).
Diamant, N. et al. Patient contrastive learning: A performant, expressive, and practical approach to electrocardiogram modeling. PLoS Comput. Biol. 18, e1009862 (2022).
Khurshid, S. et al. Deep learned representations of the resting 12-lead electrocardiogram to predict at peak exercise. Eur. J. Prev. Cardiol. 31, 252–262 (2024).
Attia, Z. I. et al. An artificial intelligence-enabled ECG algorithm for the identification of patients with atrial fibrillation during sinus rhythm: a retrospective analysis of outcome prediction. Lancet 394, 861–867 (2019).
Khurshid, S. & Singh, J. P. Keep your fingers on the PULsE: artificial intelligence to guide atrial fibrillation screening. Eur. Heart J. Digit Health 3, 205–207 (2022).
Ashburner, J. M. et al. Impact of a clinical atrial fibrillation risk estimation tool on cardiac rhythm monitor utilization following acute ischemic stroke: a prepost clinical trial. Am. Heart J. 284, 57–66 (2025).
Steinhubl, S. R. et al. Effect of a home-based wearable continuous ECG monitoring patch on detection of undiagnosed atrial fibrillation: the mSToPS randomized clinical trial. JAMA 320, 146–155 (2018).
Pathak, R. K. et al. Long-term effect of goal-directed weight management in an atrial fibrillation cohort: a long-term follow-up study (LEGACY). J. Am. Coll. Cardiol. 65, 2159–2169 (2015).
Khurshid, S. et al. Performance of single-lead handheld electrocardiograms for atrial fibrillation screening in primary care. VITAL-AF Trial JACC Adv. 2, 100616 (2023).
Ashburner, J. M. et al. Design and rationale of a pragmatic trial integrating routine screening for atrial fibrillation at primary care visits: the VITAL-AF trial. Am. Heart J. 215, 147–156 (2019).
Bossuyt, P. M. et al. STARD 2015: an updated list of essential items for reporting diagnostic accuracy studies. BMJ 351, h5527 (2015).
Svennberg, E. et al. State of the art of artificial intelligence in clinical electrophysiology in 2025: a scientific statement of the European Heart Rhythm Association (EHRA) of the ESC, the Heart Rhythm Society (HRS), and the ESC Working Group on E-Cardiology. Europace 27, euaf071 (2025).
Hopewell, S. et al. CONSORT 2025 statement: updated guideline for reporting randomised trials. BMJ 389, e081123 (2025).
Hulme, O. L. et al. Development and validation of a prediction model for atrial fibrillation using electronic health records. JACC Clin. Electrophysiol. 5, 1331–1341 (2019).
Khurshid, S. et al. Cohort design and natural language processing to reduce bias in electronic health records research. npj Digit. Med. 5, 47 (2022).
Alonso, A. et al. Prediction of atrial fibrillation in a racially diverse cohort: the multi-ethnic study of atherosclerosis (MESA). J Am Heart Assoc. 5, e003077 (2016).
Wang, E. Y. et al. Initial precipitants and recurrence of atrial fibrillation. Circ. Arrhythm. Electrophysiol. 13, e007716 (2020).
Uno, H., Tian, L., Cai, T., Kohane, I. S. & Wei, L. J. A unified inference procedure for a class of measures to assess improvement in risk prediction systems with survival data. Stat. Med. 32, 2430–2442 (2013).
Christopoulos, G. et al. Artificial intelligence-electrocardiography to predict incident atrial fibrillation: a population-based study. Circ. Arrhythm. Electrophysiol. 13, e009355 (2020).
Austin, P. C., Harrell, F. E. & Klaveren, D. Graphical calibration curves and the integrated calibration index (ICI) for survival models. Stat. Med. 39, 2714–2742 (2020).
Demler, O. V., Paynter, N. P. & Cook, N. R. Tests of calibration and goodness-of-fit in the survival setting. Stat. Med. 34, 1659–1680 (2015).
Pencina, M. J., D’Agostino, R. B. & Steyerberg, E. W. Extensions of net reclassification improvement calculations to measure usefulness of new biomarkers. Stat. Med. 30, 11–21 (2011).
Vickers, A. J. & Elkin, E. B. Decision curve analysis: a novel method for evaluating prediction models. Med. Decis. Mak. 26, 565–574 (2006).
Pencina, M. J. et al. Evaluating the added predictive ability of a new marker: from area under the ROC curve to reclassification and beyond. Stat. Med. 27, 157–172 (2008).
Python Core Team. Python: a dynamic, open source programming language. Python Software Foundation. https://www.python.org/ (2015).
R Core Team. R: a language and environment for statistical computing. R Foundation for Statistical Computing Vienna, Austria. https://www.R-project.org/ (2015).
Acknowledgements
This investigator-initiated study was funded by the Bristol Myers Squibb–Pfizer Alliance. This work was also supported by NIH grants R01HL092577, R01HL157635 (Ellinor); K23HL169839 (Khurshid); T32HL007208 (Al-Alusi); 3OT2OD035404-01S3, 1R01NS134597, 1UG3HG014379-01 (Maddah); American Heart Association (Dallas, Texas) 18SFRN34110082, 961045 (Ellinor, Maddah); 23CDA1050571 (Khurshid); and from the European Union MAESTRIA 965286 (Ellinor). Dr. Lubitz previously received support from NIH grants R01HL139731 and R01HL157635, and American Heart Association 18SFRN34250007. Dr. Kany received the Walter Benjamin Fellowship from the Deutsche Forschungsgemeinschaft (521832260).
Author information
Authors and Affiliations
Contributions
S. Khurshid and S.F. contributed equally and are co-first authors. S. Khurshid and S.F.F. conceived of the study. S. Khurshid, S.F.F. and T.S. contributed to study design, modeling, and statistical analysis. S. Khurshid and S.F.F. drafted the manuscript. M.A.A.-A., S. Kany, T.S., C.D.A., J.E.H., D.D.M., L.H.B., J.M.A., S.A.L., S.J.A., M.M., D.E.S. and P.T.E. performed critical reviews. All authors discussed the results, contributed to the final work, and have provided final approval of the completed version.
Corresponding author
Ethics declarations
Competing interests
Dr. Lubitz is employed at Novartis Institutes for Biomedical Research and has received research support from Bristol Myers Squibb/Pfizer, Boehringer Ingelheim, Fitbit, Medtronic, Premier, and IBM, and has consulted for Bristol Myers Squibb/Pfizer, Blackstone Life Sciences, and Invitae. Dr. Ellinor receives sponsored research support from Bayer AG, IBM Research, Bristol Myers Squibb, Pfizer and Novo Nordisk; he has also served on advisory boards or consulted for Bayer AG. Dr. Ho has received sponsored research support from Bayer AG and research supplies from EcoNugenics, Inc. Dr. Singer has received research support from the Eliot B. and Edith C. Shoolman Fund of Massachusetts General Hospital and Bristol Myers Squibb, and has consulted for Bristol Myers Squibb, Fitbit (Google), Medtronic, and Pfizer. Dr. Atlas has received sponsored research support from Bristol Myers Squibb/Pfizer and American Heart Association (18SFRN34250007) and has consulted for Boehringer Ingelheim, Bristol Myers Squibb, Pfizer, Premier and Fitbit (Google). Dr. Khurshid receives sponsored research support from Bayer AG. The remaining authors declare no competing interests.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Khurshid, S., Friedman, S.F., Al-Alusi, M.A. et al. Artificial intelligence-enabled analysis of handheld single-lead electrocardiograms to predict incident atrial fibrillation: an analysis of the VITAL-AF randomized trial. npj Digit. Med. 8, 776 (2025). https://doi.org/10.1038/s41746-025-02164-2
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s41746-025-02164-2







