Main

In recent years, there have been rapid developments in ambulatory electrocardiogram (ECG) technology that enable markedly increased use of ambulatory ECG monitoring. At the same time, the importance of detecting brief, infrequent arrhythmias, particularly atrial fibrillation (AF), has been recognized1,2. Longer ECG recording duration and frequency lead to higher detection rates of arrhythmia3,4,5,6, and extended ECG monitoring is recommended for patients with syncope7,8 and individuals in whom screening for AF to prevent new-onset or recurrent stroke could be beneficial9. The number of patients that may benefit from rhythm monitoring is also growing, particularly with evidence that short-duration subclinical AF10 may benefit from anticoagulation1,11. With the increasing availability of lower-cost devices, longer-term monitoring capabilities and the emergence of direct-to-consumer devices that provide irregular pulse notifications and record single-lead ECG intermittently, there has come a deluge of heart rhythm monitoring data that requires analysis12,13. Given the worldwide shortages of healthcare workers14, this increased workload may overburden human ECG technician resources, possibly reducing the quality of heart rhythm annotations15,16,17,18, leading to misdiagnosis, delayed treatment and adverse patient outcomes.

While it has widely been predicted that artificial intelligence (AI) will replace humans in some areas19, the nearest examples in healthcare are in mammography, where AI can replace a second physician reader for mammograms20,21,22,23, and in pathology, where AI tools improve pathologist accuracy and efficiency24,25. Implementation of an AI model that uses ECGs to alert physicians to high-risk hospitalized patients was recently shown to reduce mortality26, and several machine learning-based models that use ECG data to predict arrhythmia have been developed27,28. AI holds considerable promise for arrhythmia diagnostics as it can rapidly analyze a large amount of data at low cost, provide consistent annotations without risk of mental fatigue and provide results in near real time29. Previous studies indicate that AI algorithms can be trained to detect and accurately classify arrhythmias on resting ECG and ambulatory ECG recordings30,31, but no study has evaluated the role of AI in performing scanning and technical annotation of ambulatory ECG and providing results that can then be forwarded for physicians to review. Because AI-only reporting would mean that large amounts of ECG data would never be seen by a healthcare professional, such an AI model would need to have excellent negative predictive value for critical arrhythmias without generating unacceptable rates of false-positive annotations that would require physician review.

We designed the DeepRhythmAI for autonoMous Analysis of RhyThm INvestigatIon (DRAI MARTINI) study to test the DeepRhythmAI model for direct-to-physician reporting of ambulatory ECG data. The aim was to report on the performance of the DeepRhythmAI compared to technician analysis of ambulatory ECG data, including absolute rates of false-negative and false-positive detection for both the AI model and ECG technicians.

Results

The study population consisted of 14,606 patients (mean age = 65.5 ± 10 years, 42.8% males), who were monitored for a mean of 14 ± 10 days (Extended Data Fig. 1). Monitoring indications were provided through the device for 14,596 patients and are reported in Extended Data Table 1. The most common monitoring indications were palpitations, syncope, dizziness and examination for AF.

Critical arrhythmias

The AI model had superior sensitivity for the primary endpoint of false-negative findings (all instances of the arrhythmia missed for the full recording) of critical arrhythmia (98.6% (95% confidence interval (CI) = 97.7–99.4) versus 80.3% (95% CI = 77.3–83.3%); Table 1). This category includes ≥30 s of AF, ≥30 s of supraventricular tachycardia (SVT), sinus arrest/asystole events lasting ≥3.5 s, third-degree AV block of any duration and ≥10 s of ventricular tachycardia (VT) ≥ 120 beats per minute. The AI model analysis had 3.2 false negatives per 1,000 patients, compared to 44.3 per 1,000 for technicians (Fig. 1), resulting in a relative risk (RR) of a false-negative finding of critical arrhythmias of 14.1 (95% CI = 10.4–19.0) for technician analysis compared to DeepRhythmAI model analysis. Extended Data Table 2 reports these results for individual arrhythmias. The lower false-negative rate with the AI model was observed in both males and females (Extended Data Fig. 2). In a sensitivity analysis where misclassifications between critical arrhythmias were not considered AI or technician false negatives, we saw largely unchanged results—2.3 false-negative findings per 1,000 patients using the AI model and 39.4 per 1,000 patients for technicians (RR = 16.9 (95% CI = 12.0–23.9); Extended Data Fig. 3). This RR for false-negative findings over the full recording increased with increasing monitoring duration (RR = 7.8 (95% CI = 3.1–19.8) for 1–2 days of monitoring, RR = 9.1 (95% CI = 3.9–21.1) for 3–7 days of monitoring and RR = 17.9 (95% CI = 11.9–26.9) for ≥8 days of monitoring). Overall, the negative predictive value for critical arrhythmias was 99.9% (95% CI = 99.9–100%) for the AI model compared to 99.1% (95% CI = 98.9–99.2) for technicians, and the AI model had superior negative predictive values for all individual critical arrhythmia classes (Table 1). The AI model detection rates of true-positive VTs, SVTs, asystoles and third-degree AV blocks were substantially higher than the technicians, and the AI model detected numerically more AF events (Fig. 2). Episode durations for false-negative events are reported in the Extended Data Table 3.

Table 1 Performance of DeepRhythmAI and ECG technicians compared to the consensus panel of cardiologists for critical arrhythmias
Fig. 1: False-negative critical arrhythmias per 1,000 patients by AI and technician analysis.
figure 1

Error bars represent 95% CIs derived using bootstrapping. AVB, AV block.

Fig. 2: True-positive critical arrhythmias per 1,000 patients by AI and technician analysis.
figure 2

Error bars represent 95% CIs derived using bootstrapping.

DeepRhythmAI model analysis resulted in more false-positive findings of asystoles, third-degree AV block and ≥10 s VT (Fig. 3). In sensitivity analyses when misclassifications between critical arrhythmias were not considered false positives, the total false-positive event rate over the full recordings was 6.3% for the AI model and 2.3% for technicians (Extended Data Fig. 4), corresponding to 12 (interquartile range (IQR) = 6–74) false-positive events per 1,000 patient days of recording for AI and 5 (IQR = 2–153) per 1,000 patient days of recording for technicians. Panel classifications of patients for whom strips were extracted are reported in Fig. 4. The duration of false-positive detections by the AI model and technicians is reported in Extended Data Table 3.

Fig. 3: False-positive critical arrhythmias per 1,000 patients by AI and technician analysis.
figure 3

Error bars represent 95% CIs derived using bootstrapping.

Fig. 4: Diagnoses of patients with critical arrhythmias by DeepRhythmAI and ECG technicians.
figure 4

Sankey diagram showing arrhythmic event durations for critical arrhythmias as detected by each of the two methods. Cardiologist panel annotations are used to classify DeepRhythmAI and ECG technician annotations into TP, FP or FN. For FP and FN detections, we also report whether these were annotated by the cardiologist panels as another critical arrhythmia class or as a noncritical arrhythmia/noise or NSR. TP, true positives; FP, false positives; FN, false negatives; NSR, normal sinus rhythm.

Full confusion matrix statistics for individual critical arrhythmias for both the AI model and technicians compared to panel annotations are reported in Table 1. DeepRhythmAI model analysis was superior in terms of sensitivity but had lower specificity for ≥10 s VT, asystole and third-degree AV block. The AI model analysis had similar positive predictive value to technicians for AF and sustained SVTs but lower positive predictive values for sustained VT, third-degree AV block and asystoles. The overall F1 score, which is the harmonized mean of positive predictive value and sensitivity, was similar for the AI model and technicians. However, the F1 scores for AI were superior for sustained SVT, and the F1 score for technicians was better for VT.

Noncritical arrhythmias

Noncritical arrhythmias included premature atrial complexes and premature ventricular complexes, second-degree AV block, pauses of 2.0–3.5 s, VT episodes <10 s, idioventricular/accelerated idioventricular rhythms, SVT episodes ≤30 s and ectopic atrial rhythm. Results for these rhythm classes are reported in Table 2. The AI model had superior sensitivity for all noncritical arrhythmias and a superior F1 score for pauses and idioventricular/accelerated idioventricular rhythms but lower specificity for all noncritical arrhythmias except SVT episodes <30 s and ectopic atrial rhythms.

Table 2 Performance of DeepRhythmAI and ECG technicians compared to the consensus panel of cardiologists for noncritical arrhythmias

Discussion

This large, carefully adjudicated analysis demonstrates that the DeepRhythmAI model could safely replace technician interpretation of ambulatory ECG recordings, with an impressive sensitivity for critical arrhythmias and a modest increase of false-positive detections. The DeepRhythmAI model had a negative predictive value for critical arrhythmias that exceeded 99.9% and, compared to technicians, resulted in 17 times fewer patients with a missed diagnosis of a critical arrhythmia. This was at a cost of 2.4 times more false-positive detections, which for critical arrhythmias occurred once every 6 recordings for AI and once every 14 recordings for technicians. Considering that the DeepRhythmAI model performance exceeds the benchmarks of 99% negative predictive value and 70% positive predictive value that guidelines have recommended for accepting a single high-sensitive troponin to rule out major adverse cardiovascular events32,33,34,35,36, we consider DeepRhythmAI model-only analysis to be safe for the analysis of ambulatory ECG data.

The current study differs fundamentally from previous studies of AI for arrhythmia classification in that we evaluate the use of AI as the only reader for the majority of the health data, with physician confirmation only of AI model-selected episodes. This may be necessary for the management of the rising volume of ECG that will need to be accurately adjudicated without missing critical events. The sample size in terms of annotated strips in this study is 6–16 times larger than previous studies30,31, and the patient population negative predictive value, absolute false-positive and false-negative rates for AI-only analysis have never been reported before. These data are necessary to determine whether an AI can safely be used for direct-to-physician reporting and have not been shown in previous studies evaluating AI for arrhythmia diagnostics. Direct-to-physician reporting of ambulatory ECG results could unburden strained healthcare environments and result in an appropriate expansion of access, which should result in more equitable access to testing and subsequent care. We used a large, unselected clinical patient population to estimate how the use of the DeepRhythmAI model analysis instead of ECG technician analysis would affect the accurate detection and false-positive rates, using the beat-to-beat classification of a large and representative sample of arrhythmic events. Due to our sampling strategy, the measures of sensitivity that we report are not directly comparable to the sensitivity reported in selected rhythm strips in previous studies. We report as false negatives only patients in whom a diagnosis was missed for the full duration of the recording (that is, 14 ± 10 days of monitoring), arguably a more relevant evaluation metric. With this in mind, the AI model we evaluated had better sensitivity for all critical arrhythmias that were evaluated in both this study and a study assessing a deep neural network architecture for rhythm classification of single-lead ECGs31, a study evaluating a convolutional neural network for rhythm classification of 12-lead ECGs30 and a study comparing a deep neural network with physician over-reading of the full ECG to an electrophysiologist review of a traditional Holter system37. While the technician sensitivity in this study is low, this finding is in line with previous studies that show a low average accuracy in ECG interpretation for technicians38.

The large difference in false-negative findings using the DeepRhythmAI model and technician analysis could be dependent on factors related to algorithms and factors related to causes of human error. The higher rate of technician false negatives is likely in part to be due to limitations of features-based algorithms compared to AI models, but because technician work also includes scanning the ECG manually and assessing heart rate trends, there could also be effects of time pressure, information overload15,17 and other factors related to limits in human perception and memory16,18, which do not affect AI models. Thus, with increasing data volume that will require analysis, the AI model increasingly outperforms technician interpretation, giving consistent annotations not subject to fatigue. Rhythm analysis by technicians depends on correctly identifying and retaining in memory a large number of visual features; for example, a single capture beat in a wide complex tachycardia is pathognomonic for VT, but the human working memory has a fixed upper limit, and high information loads, such as in the analysis of ambulatory ECG recordings, can lead to reduced accuracy decision quality15,16.

Some limitations in study design should be considered. First of all, the technicians, but not the physician panels or the AI model, had access to clinical information such as monitoring indication, age and sex, which may have introduced a bias in favor of the technicians. At the same time, while the technicians were performing their analysis during paid clinical work hours, the cardiologist panels were performing their analysis as part of a research protocol, and therefore the panel annotations do not exactly represent a clinical workflow. Panel cardiologists may have been either more or less careful than they would have been with clinical patients, which could have introduced misclassification bias. We have not differentiated between second-degree AV block types 1 and 2, and we do not report subgroups by monitoring indication. Because monitoring indications were entered through the device, the absence of a reported indication should not be interpreted as a lack of that indication. The false-negative events in the study were patients in whom all episodes of arrhythmia were missed for the entire recording duration by one method, but at least one was detected by the other. While we consider this to be a robust method for false-negative estimation, it is possible that there are additional arrhythmic events that were undetected by both the AI model and technicians. If any arrhythmias were missed by both methods, this would imply a lower sensitivity and negative predictive value for both technicians and the AI model but not affect the results showing a superior sensitivity and negative predictive value for the AI model compared to technicians. It is also important to point out that, while the technicians were aided by a Food and Drug Administration-approved algorithm and also performed a manual review and reannotation of the data, their use of a different algorithm may have yielded different results. The underlying ECG data were recorded by a device providing leads II and III. However, the use of devices with nonstandard lead configurations and single-lead recording is becoming more prevalent. The results cannot be generalized to other AI algorithms, and the DeepRhythmAI model may have different performances on other signals, although, in view of the accuracy that the DeepRhythmAI model demonstrated in this study, the model could be tested on other ECG recording signals in the future. Finally, while we used an unselected patient population and extracted a large representative sample of relevant arrhythmic episodes for evaluation, some evaluation metrics that we report, such as the negative predictive value, are dependent on the population prevalence of arrhythmia, which may differ between different populations and may change over time.

Direct-to-physician reporting of leads II and III ambulatory ECG recordings using the DeepRhythmAI model would result in 17 times fewer missed diagnoses of critical arrhythmias than usual care with technician annotation and has a negative predictive value exceeding 99.9%. This would be at a cost of seven extra false-positive findings per 1,000 patient days of recording. AI analysis may substantially reduce labor costs and could potentially report results in near real time.

Methods

Data source

The source population for this study is an unselected patient population of 14,606 individuals, consisting of a random sample of patients who had been monitored in the United States for clinical indications between 2016 and 2019. Recording durations varied from 1 to 31 days. The dataset consisted of 211,010 days of ambulatory monitoring collected in these patients using PocketECG (Medicalgorithmics). PocketECG is a full-disclosure ECG device with limb lead configuration (leads II and III) and a sampling rate of 300 samples per second. The device can record and transmit ECG signals for up to 31 days. The patients were referred by 1,079 different physicians from 166 clinics, and the recordings were analyzed in clinical practice at an independent diagnostic testing facility by one of 167 certified ECG technicians working with a features-based algorithm using adaptive beat morphology template generation and comparison so that each QRS complex in the recording was annotated beat-to-beat by the ECG technician. ECG technician work was extensive and included a review of the whole ECG recording and verification of all events detected by the algorithm, including pauses and asystoles, all bradycardia events, all missed heartbeats or second- and third-degree AV blocks, all ventricular and supraventricular arrhythmias and all episodes detected as AF. In this process, artifacts and electrode dysfunction were re-annotated. The technicians also inspected all regions of the recording marked as having a ‘patient-triggered symptom’ flag and reviewed the recording at the time of the fastest, slowest and average minutely heart rate. They were aided in this process by software that allowed them to manually inspect heart rate trends for irregularities, filter beats by heart rate and group beats into morphologies. At the end of the review, episodes were selected for inclusion in a report to physicians.

Before inclusion in the study, all data were anonymized, and the Ethics Review Board of Sweden has therefore waived the need for approval (decision 2019-03227). As such, the Ethics Review Board did not consider that informed consent was necessary.

DeepRhythmAI

The DeepRhythmAI model (v3.1; Medicalgorithmics) is a proprietary mixed network ensemble for rhythm classification. The network performs QRS and noise detection, beat classification and rhythm identification using several algorithms based on convolutional neural networks and transformer architecture with custom-built components39,40,41,42. The main network components for QRS detection and rhythm classification have been pretrained on 1,716,141 5-min-long ECG strips and fine-tuned on 60,549 ≤30 s ECG strips. These were extracted from 69,706 anonymized clinical long-term recordings. Algorithm internal validation was performed using 15,188 ≤30 s strips from 12,330 additional separate patient recordings. A high-level flowchart of the algorithm is presented in the Extended Data Fig. 5. The preprocessing involves selecting desired ECG channels from input data, scaling the signal amplitude according to the input analog–digital conversion values and resampling to a frequency of 300 Hz. A deep learning model predicts the probability of QRS complex presence and signal readability, extracts signal features and predicts the probability of QRS complex presence and readable signal39. This output, together with the preprocessed signal, is passed to an ensemble combined from models of two structures. The first is intended for the analysis of information from a wide context and has a hybrid architecture of the convolutional neural network and transformer encoder layers40. The second is a pure-transformer implementation based on Vision Transformer41, allowing for a superior interpretation of signal within a relatively narrow window. Additionally, a specialized classifier was developed for the detection of asystole events.

The QRS complex detector uses custom residual modules inspired by MobileNetV2.42 Each module consists of the following three one-dimensional convolutional layers: a pointwise convolution to expand feature dimension; a convolutional layer with a kernel length of 3 and variable dilation rates; a pointwise convolution to reduce feature dimensions to their original size. The dilation rate doubles in each residual module during the first half of the model and then progressively decreases to a rate of 1 at the output layer. A final linear layer converts the output features into probabilities of QRS complex presence and signal readability for each sample. Thresholding and morphological operations are subsequently applied to extract QRS positions and identify nondiagnostic ranges. The wide-context architecture comprises a series of submodules. Initially, features are extracted from heart rate trends, calculated based on QRS detections, using the same architecture as the QRS detector (excluding the final linear layer). Another submodule extracts features for each sample of the preprocessed ECG signal using residual modules from the QRS detector but with a fixed dilation rate progression. The signal is downsampled using strided convolutional layers. Subsequently, windows of downsampled features are extracted, and two-dimensional strided convolutional layers are applied, resulting in features for each beat. The resulting features are processed using transformer encoder layers, augmented by an additional convolutional layer inserted between the linear layers in the fully connected blocks. Finally, the features are converted to logits for each QRS complex class using two pointwise convolutional layers.

The signal-detail architecture is based on transformer encoder layers that process ECG signals split into patches. A linear layer embeds each patch. The transformer layers process the embedded patches, and logits for each QRS complex class are calculated using a linear layer. Only the patches containing QRS complexes are selected for predictions. The asystole filter module shares the same architecture as the wide-context model but is trained with hyperparameters and a dataset tailored to the asystole detection task.

We used the same dataset for training the QRS complex and noise detector and the main components of the heartbeat classification ensemble (three wide-context models and three signal-detail models). Data augmentation techniques tailored to each of these tasks, like noise artifact generation or synthesis of heartbeats with rare features, were used to enhance training dataset diversity and mitigate overfitting. In addition to that, a classifier specializing in the interpretation of asystole events was developed by feeding to a single model with wide-context analysis architecture a carefully selected 11,670 strips with asystole or sinus arrest and 20,292 strips with noise or electrode dysfunction. The training process of this model encompassed methods from supervised and self-supervised learning domains. The ensemble model output is averaged or replaced by the asystole filter model output (for heartbeats with RR interval greater than the sinus arrest threshold of 2 s) to provide the probabilities of QRS complex classes. Finally, the heartbeat types that are the final output of the DeepRhythmAI model are translated to heart rhythm types. Optimization was performed using the AdamW algorithm. Models were internally evaluated by measuring the root mean squared error metric based on sensitivity, precision and F1 score calculated from predictions and ground truth of internal validation/test strips, following the methodology provided by the International Electrotechnical Commission 60601-2-47 standard43.

The ECG recordings used in this study had never been presented to the DeepRhythmAI model or any AI model from which the DeepRhythmAI model was derived, but as part of the study protocol, we analyzed the entire raw ECG signal data from these same recordings using the DeepRhythmAI model to provide detection and beat-to-beat classification of all heartbeats.

Definition of critical and noncritical arrhythmias

Selection of representative arrhythmic episodes

Our strip selection method was designed to not introduce any bias toward using ECG signals with less baseline noise or arrhythmic events presenting with typical ECG diagnoses. We did this by automation; fully random individual recordings were searched by an algorithm for the presence of arrhythmic events of each rhythm class, and 34-s strips containing arrhythmia events according to either the AI model annotations, the ECG technician annotations or both were selected, at a maximum of one per method and arrhythmia class per patient. The automated selection script ran until a total of 500 strips each had been selected for each of the critical arrhythmias and 250 strips each had been selected for the noncritical rhythm classes, or all recordings had been searched and no more arrhythmias were found. The number of individual recordings that had to be searched to yield the strips for each rhythm class was considered the source population size for that class. The strip selection is described in Extended Data Fig. 6. In addition to the critical and noncritical rhythm classes, we included sinus rhythm, sinus bradycardia and unreadable signals due to noise or electrode dysfunction to evaluate the AI model performance for these signals and to ensure that the physician annotators would be provided with a differentiated sample in which they did not know which strips would contain critical arrhythmias. In total, we selected 5,245 strips, of which 2,240 were critical arrhythmias, and after errors in uploading ten of these to the annotation platform, we had 5,235 strips, of which 2,236 were critical arrhythmias.

Consensus panel annotations

All 34-s strips were annotated beat-to-beat by 17 panels consisting of three expert annotators each—≥2 board-certified cardiologists and additionally including board-certified clinical physiologists (n = 2) or final-year cardiology residents. The physicians on the panels performed the annotation independently of AI and technician annotations and were blinded to the strip selection criteria. Strips were randomly distributed among panels and presented in random order and were annotated using a custom-built software platform in which QRS complex tags, without beat type classifications, as detected by the AI model, were present. We used DeepRhythmAI model-detected QRS complexes for strips detected by both the AI model and the technicians to minimize bias; technicians in clinical practice may not have bothered to correct QRS tags for all instances of arrhythmia, and differential methodology for strips could have resulted in unblinding. The QRS tags were highly concordant. For QRS complexes that resulted in technician false negatives, there was a 98% overlap between the AI model and fixed features algorithm QRS positions. Physician annotators were asked to identify the beat type for each QRS complex according to an annotation manual (Supplemental Note), correct any mistaken QRS position placements, add any missed QRS complexes and mark areas that were unreadable due to poor signal or electrode dysfunction. Each physician annotated the entire strip beat by beat, and all discrepancies on the beat level were resolved by panel consensus. The resulting gold-standard annotations were compared to the beat-to-beat annotations of the AI model and technicians according to prespecified acceptance criteria, where we considered arrhythmic events to be concordant with the panel annotation in case of ≥80% overlap in beat type and duration with the panel annotation for all sustained tachyarrhythmias and 90% overlap in duration for asystole events and pauses. For second- or third-degree AV block, we considered the presence of any such event within the strip to be a concordant annotation, and for ECG technicians, we also considered annotation of an unspecified ‘missed beat’ to be a concordant annotation for second-degree AV block. Single ectopic atrial and ventricular beats were considered concordant within ±45 samples (150 ms). Noise annotations were considered concordant if within 80% of the panel annotation as regards duration. Minor discrepancies between the AI/technician annotations and consensus panel annotations, on the beat-to-beat level, were thus allowed, for example, low numbers of supraventricular beats or beats with unknown beat types within AF episodes.

Statistics

The primary analysis compares the frequency of false-negative, true-positive and false-positive critical arrhythmias per 1,000 individual patients over the full duration of the recordings for technicians and the AI model, along with full confusion matrix statistics for the AI model and technician performance compared to panel annotations. As a result of the sampling strategy, false negatives were only reported in patients in whom all instances of an arrhythmia type were missed for the entire duration of the recording. True-positive events were defined as episodes detected by the AI model or technician, with correct annotations according to the independent gold-standard consensus panel annotation. Descriptive statistics are reported as mean ± s.d. CIs were derived using bootstrapping with 1,000 replications. Definitions for the confusion matrix statistics are reported in the Extended Data Table 4. We also performed subanalyses where misclassifications of critical arrhythmias were not considered false-negative or false-positive events because these events would have been reported to physicians. In these analyses also, we did not consider second-degree AV block to be a false-positive finding. For the analyses of total false-positive and false-negative findings of critical arrhythmias, the prevalence of all arrhythmias was weighted to the full population size according to the proportion of the population queried. Nonoverlapping CIs were considered evidence of the superiority of one method over the other. All analyses were performed in Python, except for the calculations of RR, which were done in Stata version 17.0 for Mac, using two-sided Fisher’s exact P values. Analyses were performed by L.S.J. and G.J., with involvement from the steering group, according to prespecified plans. The study steering group (L.S.J., J.S.H., A.P.B. and A.M.) met regularly throughout the conduct of the study without the presence of Medicalgorithmics employees.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.