Introduction

Wearable devices provide personal health monitoring and their clinical role in supporting health-care delivery is growing swiftly1. They have enabled longitudinal assessment of physiology at scale due to their measurement of health metrics such as heart rate, blood oxygen saturation, and cardiorespiratory fitness2,3. This has allowed early detection of respiratory illness, prediction of cardiovascular risk, and population-level assessment of physical activity2,4,5. Given the current emphasis on personalised medicine and digital phenotyping, there is a growing need for accurate consumer devices that enable the remote capture of digital biomarkers and biometrics6.

Compared with traditional methods, wearable devices offer continuous measurement that may facilitate identification of trends in health status and preventative care7,8. Yet, without validation, wearable device measurements may misguide assessment and treatment, potentially resulting in misrepresentations of health or delayed interventions.

Apple Watch (Apple Inc., California) is the most widely owned wearable device worldwide, with over 100 million users6, and measures several health metrics that have been associated with cardiovascular and all-cause mortality when assessed using criterion methods8,9,10,11,12,13,14,15,16,17. However, its measurement accuracy is not well-established. Existing literature indicates that accuracy is dependent on the individual metric, as well as the measurement conditions18. Previously, Apple Watch heart rate measurements have shown strong agreement with criterion measures, but factors such as exercise intensity, movement pattern, and skin contact affect accuracy19,20,21,22. Conversely, energy expenditure estimates have demonstrated low levels of agreement22,23,24, and sensitivity and specificity for atrial fibrillation detection range widely between studies25.

This heterogeneity permeates the current literature. Variation in study protocols and criterion methods render comparative analysis of validation studies challenging. Prior systematic reviews and meta-analyses have included a small number of studies, many of which validated Apple Watch software and hardware that has since been discontinued23,24,26. Over the past five years, there has been a substantial increase in validation research, however, a contemporary literature synthesis including all health metrics has not been conducted. The yearly update cycle of Apple Watch, and swift advances in machine learning algorithms which underpin measurements, accentuate this issue27.

A continuously updated synthesis of Apple Watch metrics is required. To address this, our review was designed as a living study to provide an up-to-date evaluation of the device’s measurement accuracy, in accordance with the analytical validation component of the V3 framework28. We defined health metrics as any health-related physiological, behavioural, or environmental metric measured natively by Apple Watch. Our aim was to better understand the competencies and boundaries of Apple Watch in clinical and personal health contexts. Our objectives in this systematic review and meta-analysis were to: (1) identify all Apple Watch health metrics that have been validated in primary research studies, (2) evaluate the measurement accuracy of each metric, and (3) identify gaps in the current research.

Results

Following the removal of duplicates, 1202 records were identified. After title and abstract screening, 221 full texts were assessed for eligibility (PRISMA flow diagram, Fig. 1). Articles excluded following full-text review are listed in the Supplementary Information (pp. 18–32). Overall, 82 studies (430,052 participants) were included in this systematic review. Additional results, which include synthesis of hypertension notification, heart rate variability, sound exposure, and Six-Minute Walk Test distance estimation, along with funnel plots, are provided in Supplementary Note 1.

Fig. 1
Fig. 1
Full size image

PRISMA flow diagram.

Fourteen health metrics from all Apple Watch models through to Series 9 and Ultra 2 were validated. Fifty-seven percent of all participants were male, and the median sample size was 44. Information on total sample size was available for 81 of all 82 studies, and male–female split was available for 75 studies. Heart rate was the most frequently validated metric (38 studies), whereas only one study assessed hypertension notification, sound exposure, and heart rate variability. Study characteristics, including criterion methods and sample demographic, are listed in Table 1.

Table 1 Characteristics of included studies

Risk of bias

Overall, 13 (14%) studies were classified as ‘low’ risk of bias, 29 (30%) as ‘some concerns’, and 53 (56%) as ‘high’. Domain 1 (participants) and Domain 4 (statistical analysis) were most frequently rated as high risk. Twenty-six (27%) studies did not appropriately select participants to represent the target population (Domain 1), and 20 (21%) used inappropriate statistical analysis. This included complete exclusion of unsuccessful measurements, use of unsuitable statistical measures of agreement (e.g., t-tests), inadequate reporting of missing data, or failure to account for repeated measures. By contrast, Domain 3 (reference standard) was predominantly rated as low risk (85/95 [89%]). Validation protocols, criterion methods, and time intervals between assessments were mostly appropriate. Detailed risk of bias assessment for each metric is provided as a supplementary file, with narrative synthesis in the Supplementary Information (pp. 2–3).

Heart rate

Thirty-eight studies (1855 participants; 66% male) validated heart rate measurements from all Apple Watch models through Series 9 and Ultra 2. Agreement with criterion measures was strongest at rest, whereas it was lower during exercise involving irregular movement patterns and among individuals with arrhythmia29. Mean difference for resting heart rate ranged from -2.47 bpm to 3.61 bpm, and MAPE ranged from 1.69% to 7.2%30,31. During exercise, 10/11 (91%) studies reported MAPE lower than 10%21,30,32,33,34,35,36,37,38,39. MAPE tended to rise as intensity increased, although a decrease was noted in three studies32,34,38.

Meta-analysis of heart rate, with resting and exercise conditions combined, included 22 studies (n = 1247)29,30,31,32,33,34,37,38,40,41,42,43,44,45,46,47,48,49,50. The pooled mean bias (MB) was low, although limits of agreement (LoA) indicated measurement variability (-0.27 bpm [95% CI -0.72–0.17]; LoA -7.19 to 6.64; τ2 0.53; Fig. 2). For resting heart rate, we found that Apple Watch measurements were higher than criterion measures (MB 0.21 bpm [95% CI -0.65–1.07]; LoA -8.14 to 8.56; τ2 0.67; Fig. 3A). During exercise, Apple Watch underestimated heart rate (MB -0.63 bpm [95% CI -1.37–0.12]; LoA -6.86 to 5.60; τ2 0.93; Fig. 3B).

Fig. 2: Forest plot for heart rate under all conditions.
Fig. 2: Forest plot for heart rate under all conditions.
Full size image

The red dashed line represents the pooled mean bias; the blue dashed lines represent the pooled limits of agreement (-7.19 to 6.64).

Fig. 3: Forest plots of heart rate at rest and during exercise.
Fig. 3: Forest plots of heart rate at rest and during exercise.
Full size image

A Forest plot for heart rate at rest. The pooled mean bias (0.21 bpm) and limits of agreement (-8.14 to 8.56) are represented by the dashed red and blue lines, respectively. B Forest plot for heart rate during exercise (mean bias -0.63; limits of agreement -6.86 to 5.60).

Six studies (16%) were rated as ‘low’ risk of bias, 11 (29%) as ‘some concerns’, and 21 (55%) as ‘high’. To examine the robustness of our findings, we conducted sensitivity analysis excluding studies at high risk of bias. The pooled mean bias and limits of agreement were comparable to our primary analysis (MB -0.50 bpm [95% CI -1.47–0.47]; LoA -7.54 to 6.53; 13 studies; Fig. S1).

To compare findings across Apple Watch models, we performed exploratory subgroup analysis according to the generation of optical heart rate sensor: first-generation (Apple Watch models up to Series 3), second-generation (Series 4–5 and all SE models), and third-generation (Series 6 onwards, including Ultra models). Compared to our primary analysis, we found narrower limits of agreement for the third-generation sensor (LoA -3.68 to 2.59; 8 studies; Fig. S2), but wider limits of agreement for the first- and second-generation sensors. Mean bias was comparable across all analyses. Further detail is provided in Supplementary Note 1.

Atrial fibrillation detection

Seventeen studies validated atrial fibrillation detection (n = 422,654; 57% male): two evaluated PPG-based detection from tachograms (Irregular Rhythm Notification)25,50, and the remainder assessed the ECG app. Sensitivity and specificity ranged widely between studies (19%–100% and 66%–100%, respectively). Six of the 15 studies that calculated sensitivity reported values higher than 80%51,52,53,54,55,56, and six fell in the range of 65% to 90%50,57,58,59,60,61. Sensitivity and specificity substantially improved when inconclusive ECG tracings were excluded51,53,56,59,60,62. The rate of inconclusive tracings was between 15 and 25% in several studies52,53,54,55,60,63. Thirteen studies were rated as ‘high’ risk of bias and four as ‘some concerns’.

Eleven studies (n = 3144) were included in meta-analysis of atrial fibrillation detection, all of which validated the ECG app51,52,53,55,56,57,59,60,62,63,64. Pooled sensitivity was 0.79 (95% CI 0.61–0.90), and pooled specificity was 0.91 (95% CI 0.81–0.96). The overall Zhou and Dendukuri I2 indicated moderate heterogeneity (55%). The area under the curve suggested strong discriminative ability (0.93; Fig. 4). Exploratory subgroup analysis examining the influence of hardware and software version is presented in the Supplementary Information (p. 6).

Fig. 4
Fig. 4
Full size image

Summary Receiver Operating Characteristic Curve for atrial fibrillation detection.

ECG waveform morphology

Seven studies (n = 535, 68% male) compared the amplitude and duration of Apple Watch ECG recordings to 12-lead ECG46,49,65,66,67,68,69. QT interval was the most frequently assessed segment (five studies)42,65,67,68,69. Four studies reported that Apple Watch underestimated QT interval duration, although limits of agreement were relatively wide42,65,67,68. Many of these studies evaluated different segments of the ECG waveform, restricting comparison.

Blood oxygen saturation

Blood oxygen saturation (SpO2) measurements were validated in Series 6 through Series 8, and six studies included patient cohorts42,45,70,71,72,73. Seven studies reported overall mean difference <1% SpO2, indicating good measurement accuracy, particularly in normoxic ranges42,70,72,73,74,75,76. However, limits of agreement approximating ±5% SpO2 were reported in multiple studies, indicating variability in measurements45,70,72,74,75,76,77. Measurement error tended to increase as SpO2 decreased. All five studies that assessed SpO2 in both hypoxic and normoxic ranges found stronger agreement with criterion measures in normoxic ranges72,74,75,76,77. Apple’s white paper reported accuracy root mean square (Arms) within the limits (<3.5%) defined by the US Food and Drug Administration (FDA) for medical pulse oximeters across the entire range of 70-100% SpO2. Two additional studies also reported Arms within these limits across the range of 80–100%75,76. Contrastingly, two studies reported wide limits of agreement for hypoxic ranges, reflecting variability in accuracy72,77.

Nine studies (n = 969) were included in meta-analysis of blood oxygen saturation. Pooled mean bias indicated that Apple Watch underestimated SpO2, although limits of agreement demonstrated variability (MB -0.04% [95% CI -0.42–0.35], LoA -4.01 to 3.94; τ2 0.13; Fig. 5). Our exploratory subgroup analysis found overestimation and wider limits of agreement for measurements obtained in hypoxic ranges (MB 0.43% [95% CI -3.85–4.71]; LoA -8.35 to 9.21; Supplementary Information p. 7).

Fig. 5: Forest plot of blood oxygen saturation measurement accuracy.
Fig. 5: Forest plot of blood oxygen saturation measurement accuracy.
Full size image

The pooled mean bias (-0.04% SpO2) and limits of agreement (-4.01 to 3.94) are represented by the dashed red and blue lines, respectively.

Energy expenditure

Margins of error for energy expenditure estimates were often large, both during exercise and at rest (8 studies; n = 270; 63% male). There was considerable variation between and within individual studies. Participants were predominantly young physically active adults, and five of the eight studies assessed Apple Watch Series 2 or older. All six studies that calculated MAPE reported values of 20% or higher in at least one test condition31,32,36,39,78,79. Overall, MAPE ranged from 9.71% (running) to 151.66% (walking). No distinct trend in measurement error by exercise intensity could be observed.

Step count and wheelchair push count

Three studies validated step count from Apple Watch First Generation and Series 1. In the largest study (n = 71), a small underestimation and strong correlation was found, however, moderate correlation and wide limits of agreement were reported in each of the other studies80. There was no distinct trend in accuracy based on walking or running speed80,81. Notably, no study included sedentary periods or seated activities that involved arm movements in their validation).

Fig. 6: Graphical abstract.
Fig. 6: Graphical abstract.
Full size image

Demonstrating included metrics, inclusion requirement for device wear, risk of bias ratings, and meta-analysis results. bpm beats per minute, LoA limits of agreement. Icons adapted from Phosphor Icons, used under the MIT License.

Three studies evaluated wheelchair push count. Apple Watch overestimated overall wheelchair push count in two studies82,83, and underestimated in the other84. However, margins of error varied substantially, even within studies. MAPE ranged from 1% to 21% for Series 183,84, and was 9.2% for Series 482.

VO2 max estimation

One study (n = 30) compared VO2 max estimates to indirect calorimetry and found that Apple Watch underestimated VO2 max, noting a clinically significant mean difference (-6.07 mL/kg/min) and wide limits of agreement85.

Sleep stage classification and sleep apnoea detection

Three studies validated sleep stage classification (n = 221)86,87,88. Overall, they found good differentiation between sleep and wake states, but moderate-to-poor differentiation between physiologically similar sleep stages. Two studies reported sensitivity for binary sleep-wake classification ≥97%, however, they also reported low accuracy for classification of deep sleep, with a tendency to misclassify it as light sleep86,87. Robbins and colleagues (n = 29, Series 8) found that Apple Watch significantly underestimated deep sleep, and overestimated light sleep86. For sleep apnoea detection, Apple’s clinical validation study found higher specificity (98.5% [95% CI 98.0–99.0]) than sensitivity (66.3% [95% CI 62.2–70.3]). Fig. 6 provides a graphical overview of this review's results.

Discussion

This systematic review and meta-analysis evaluated the accuracy of 14 health metrics from Apple Watch to inform its use in personal health monitoring and clinical settings. We found that accuracy varied by metric, measurement conditions, and physiological characteristics, highlighting the need to interpret accuracy in the context of each metric’s intended use.

The pooled mean bias for heart rate was low (-0.27 bpm [95% CI -0.72–0.17]), although limits of agreement were moderately wide (-7.19 to 6.64 bpm). The pooled limits of agreement demonstrated measurement variability of ~±7 bpm and reflected agreement across a broad population by incorporating both within- and between-study variability, as described by Tipton & Shuster. In line with Bland and Altman’s recommendations, the limits of agreement are the key measure for determining whether Apple Watch is a suitable alternative to current measurement methods. We observed sufficient accuracy to quantify exercise intensity among healthy adults, although moderate misestimation may occur in some cases, particularly among individuals with cardiac disease. Our subgroup analyses showed substantially lower variability for measurements obtained with the third-generation optical sensor (LoA -3.68 to 2.59) compared to older generations. This indicated that accuracy was both population- and condition-dependent.

For blood oxygen saturation, we also found low mean bias (-0.04% [95% CI -0.42–0.35]), but the pooled limits of agreement (-4.01 to 3.94) suggested that Apple Watch may, in certain instances, misclassify individuals in hypoxic ranges as being in normoxic ranges. Across individual studies and in our subgroup analysis, we identified greater variability and lower agreement among patients in hypoxaemia. However, two studies found that, in healthy adults, Apple Watch met the standards set by the FDA and International Organization for Standardization (ISO) for medical grade pulse oximetry when hypoxaemia was induced. These findings indicate that Apple Watch may serve as a useful adjunct to traditional pulse oximetry, although its accuracy is limited in hypoxic ranges.

For atrial fibrillation detection, Apple Watch was more specific than sensitive (pooled sensitivity 0.79 [95% CI 0.61–0.90]). The pooled specificity (0.91 [95% CI 0.81–0.96]) indicated that notification of atrial fibrillation likely reflects true presence, suggesting notification warrants further clinical investigation. Both sensitivity and specificity ranged widely between studies, however, and in many, more than a quarter of measurements were inconclusive, representing a notable rate of unsuccessful assessment.

The error of energy expenditure estimates was often large and varied considerably, both within and between studies, and the mean difference for VO2 max (-6.07 mL/kg/min) was clinically significant, as a 3.5 mL/kg/min increase has been associated with a risk ratio of 0.89 for all-cause mortality89. We observed moderate accuracy for sleep overall, with good classification between sleep and wake states — sufficient for personal health monitoring — but differentiation between physiologically similar sleep stages was poor. There was also moderate accuracy for step count, wheelchair push count and hypertension notification, although there were fewer than four studies included for each metric. A number of metrics are yet to be validated, including respiratory rate, wrist temperature and measures of sedentary behaviour.

There are important distinctions between our findings and previous systematic reviews and meta-analyses, although we report similar results for certain metrics22,23,24,90,91,92,93,94,95,96. A prior meta-analysis, which pooled multiple effect estimates from single studies — a method that is not recommended97 — found a similar mean bias but wider limits of agreement for heart rate (-0.12 bpm; LoA −11.06 to 10.81)22. Notably, the authors included several studies that we deemed ineligible for our review, primarily due to the validity of criterion methods and lack of adherence to manufacturer guidelines for device wear. Elsewhere, low and moderate agreement have been identified for energy expenditure and step count, respectively22,23,24. Many of these previous systematic reviews, however, included fewer than five studies and exclusively assessed old Apple Watch software and hardware23,24. Only two prior meta-analyses have evaluated atrial fibrillation detection. The first pooled just three studies using a fixed-effects model, which does not appropriately account for heterogeneity93, while the second meta-analysis pooled results from multiple manufacturers’ devices92.

We found that Apple Watch’s measurement accuracy broadly aligns with that of other wearable devices. Across manufacturers, error margins for energy expenditure estimates are often large98, whereas heart rate measurements typically exhibit stronger agreement with criterion measures26. For heart rate and blood oxygen saturation, Apple Watch showed stronger agreement with criterion measures than Garmin, Fitbit, and Withings devices23,24,71,99. For sleep, however, agreement with polysomnography was lower for Apple Watch than for Whoop, Fitbit, and Garmin88,100.

Three factors particularly impact measurement accuracy. Firstly, the metric’s measurement method. Metrics such as step count, VO2 max, and energy expenditure require inputs from multiple sensors, combined through sensor fusion27. When they are combined, error from individual inputs may compound101,102. In contrast, metrics like heart rate and SpO2 are obtained directly from photoplethysmography (PPG), requiring less derivation. Secondly, factors such as movement, moisture, and skin contact impact motion sensor measurements and the clarity of PPG waveforms27,103,104. This is one source of inaccurate heart rate measurements during high-intensity exercise with irregular movement patterns29. Thirdly, physiological factors, including blood perfusion and individual variation in heart rate response to exercise affect measurements18. Low blood perfusion, due to low body temperature or physiological traits, can lead to inaccuracy, especially given the PPG sensor’s reliance on pulsatile arterial blood, which accounts for a minority of blood in the tissue at the wrist27. Algorithms that are ill-suited to an individual’s physiology may also lead to inaccuracy. Given the sensitivity of PPG waveforms and sensor measurements to these factors, the machine learning algorithms that interpret them are increasingly important, and recent literature has shown improved accuracy due to algorithmic developments alone105.

To determine whether accuracy is adequate, the measurement’s intended use must be considered. For clinical use, thresholds corresponding to clinically important change may guide interpretation. For instance, a 10 bpm increase in resting heart rate has been associated with a 9% increase in all-cause mortality risk16, whereas a 3.5 mL/kg/min increase in VO2 max and a 1000-step increase in daily step count have both been associated with decreased all-cause mortality risk89,106. Accuracy that permits detection of clinically meaningful change — within thresholds identified by large epidemiological studies and meta-analyses, or those stipulated by regulatory bodies such as the FDA, ISO, and European Union107,108,109 — may be deemed adequate. For personal health and fitness monitoring, however, wider margins of error may suffice to provide high-level trends over time in physiological and behavioural health metrics. In population-level research trials, where scale may attenuate individual error, such measurements could provide researchers with insight into associations and risk stratification across groups. The required accuracy, therefore, should be guided by the measurement’s use and by validation among the intended measurement population.

We recognise that our results are contingent on the characteristics of our included studies, particularly given the variability in accuracy across participant cohorts and measurement conditions. A greater proportion of trials involving cardiac populations or exercise involving erratic movement patterns, for instance, may have produced different results. Methodological rigour was also inconsistent: adherence to validation guidelines, such as INTERLIVE’s expert statements, was low110,111,112,113, statistical procedures were sometimes inadequately described, and inconclusive measurements were excluded from certain analyses. In addition, few studies conducted free-living validation, which best reflects typical use, likely due to challenges obtaining criterion measures.

Consequently, our study has several limitations. First, statistical and methodological heterogeneity prevented meta-analysis of energy expenditure, and restricted subgroup analyses. We were unable to conduct subgroup analysis by body mass index or skin tone as it was infrequently reported. Additionally, we could not precisely differentiate between the impact of hardware and software on accuracy due to the proprietary nature of updates to the foreground heart rate and SpO2 algorithms, as well as the limited number of studies evaluating each Apple Watch model. Second, the generalisability of our findings was restricted due to the bias towards physically active individuals and males among participants. The variation in sex balance between metrics, coupled with limited validation among older adults and those with comorbidities, accentuates this restriction. Third, many studies were at high risk of bias. While we conducted sensitivity analyses excluding these studies for heart rate, this was not feasible for blood oxygen saturation and atrial fibrillation; fewer than five studies were rated as ‘low’ or ‘some concerns’ for these metrics, and the marked imbalance between groups would have limited the validity and interpretability of any formal analysis. Fourth, few studies were included for metrics such as step count and sleep. This was due to our stringent approach to criterion method validity and adherence to manufacturer guidelines for device wear. Fifth, many studies assessed Apple Watch models that have since been discontinued. Nevertheless, several studies validated measurements from the most recent optical heart rate sensor and algorithms, as they are not updated with each new Apple Watch model.

The main strength of this study is its breadth and meta-analyses. It is the first to synthesise all health metrics from Apple Watch that have currently been validated, and it provides the most comprehensive meta-analyses to date of heart rate, atrial fibrillation detection, and blood oxygen saturation. We gave ample consideration to the validity of criterion methods and ensured that Apple Watch was validated in the manner it was designed to be worn. We did not consider research-grade wearables as valid criterion methods for step count or energy expenditure due to the conflicting evidence on their validity98,114,115. A rigorous search and screening process was implemented, comprising nine databases and four reviewers, and to reduce publication bias, grey literature was included. This study is designed as a living systematic review and meta-analysis to ensure that the evidence synthesis does not become outdated quickly as Apple Watch evolves. An updated search will be conducted yearly to integrate new studies and new metrics, and data will be published in an open-access format.

The clinical applications of wearable devices are budding. There is growing recognition that wearable devices may improve preventative care and management of chronic disease2,102. Major organisations, including the American Heart Association and the British Heart Foundation, are conducting large research trials to inform the integration of wearable data in cardiovascular care2,116,117,118. Moreover, the development of digital biomarkers, together with emerging metrics such as hypertension notification, aim to translate wearable measurements into clinically actionable data that support disease management and assessment. Clear interpretation of these data may provide agency to patients, allowing them to better manage their condition in partnership with their healthcare professional, ultimately reducing health-care cost and burden102,119,120,121,122.

Future research should examine the longitudinal relationships of Apple Watch metrics with markers of health and disease, as well as validating measurements taken at single time-points. Clearer understanding of measurement precision and reliability will enable more accurate interpretation of trends in health metrics over time. Validation studies that include older adults, patient populations, and metrics related to vital signs — such as respiratory rate and wrist temperature — are needed. As software and hardware advance, and new metrics are developed, continued validation across diverse cohorts and conditions is required to inform the capabilities and limitations of Apple Watch.

This systematic review and meta-analysis demonstrated the variation in measurement accuracy between Apple Watch health metrics, as well as the influence of measurement condition and individual physiology. We identified good agreement for heart rate overall, whereas error for energy expenditure estimates was often inconsistent and large. Wide limits of agreement for SpO2 indicated measurement variability, and we found moderate accuracy for sleep and step count. As a ubiquitous consumer device, Apple Watch provides the general population with assessment of activity, physiology, and cardiovascular function that may otherwise be inaccessible. Despite inaccuracies, the continuous nature of these measurements may offer unique health insights, and further research exploring their use in public health is warranted.

Methods

This systematic review and meta-analysis was conducted and reported as per PRISMA guidelines123. The protocol was prospectively registered in PROSPERO (CRD42023481841; www.crd.york.ac.uk/PROSPERO/view/CRD42023481841).

Search strategy and selection criteria

We searched PubMed, SPORTDiscus, Embase, IEEE Xplore, Web of Science, Scopus, CINAHL and the Cochrane Library from inception to September 24, 2025. Keywords, Medical Subject Headings (MeSH), and synonyms related to Apple Watch and its measurement accuracy were included. To identify additional studies and grey literature, a hand search was undertaken across Google Scholar, the Apple Health website, and the US Food and Drug Administration 510(k) database. The university’s Research Engagement Librarian was involved throughout the development of the search strategy, which was peer-reviewed prior to implementation. Details of the tailored search strategy for each database are reported in Supplementary Note 2.

We included primary research studies which compared any health metric from Apple Watch to a validated criterion measure. Description of valid criterion measures are available in the Supplementary Information (pp. 11-12). Studies investigating metrics not intended to be measured by Apple Watch, or in populations in which they were not intended for use, were excluded; for example, recording ECG with Apple Watch placed at the ankle, or blood oxygen saturation assessment in neonates. Measurements were required to be taken in accordance with manufacturer guidelines. Studies in which multiple devices were worn on one wrist were excluded due to potential measurement interference caused by improper device placement, photoplethysmographic light impedance from adjacent devices, and motion sensor disruption, among other factors. Grey literature, including conference abstracts and unpublished white papers, was also included. There were no restrictions placed on demographic or language.

Three authors (RL, B.O.’G., M.B.) independently screened titles, abstracts, and full texts, with two authors per citation. Disagreements were resolved by consensus. The study selection process was carried out using Covidence (Veritas Health Innovation Ltd). This study was designed as a living systematic review. Searches will be updated every 12 months, or earlier if major Apple Watch hardware or software updates occur. Newly identified studies will be screened and incorporated using the same methodology. Updates will be disseminated via the Open Science Framework (osf.io/v5d3k).

Outcomes

The primary outcome was the agreement between measurements from Apple Watch and the criterion method for each health metric. This included pooled mean bias, Bland-Altman limits of agreement, sensitivity, and specificity for metrics that were meta-analysed. We extracted measures of agreement across all populations and conditions, including varied exercise intensities and clinical cohorts (e.g., cardiovascular disease). Measures of effect included mean difference, sensitivity and specificity, mean average percentage error (MAPE), Bland-Altman limits of agreement, and correlation coefficients.

Data extraction

Two reviewers (RL, M.B.) independently extracted data in duplicate using a pilot-tested extraction form in Microsoft Excel. Extracted data were then compared and merged following consensus. This included data about participant demographic, criterion method, validation protocol, and statistical analysis. In the case of missing or unclear information, authors were contacted via email, and one follow-up was sent to those who did not respond. Where required, we back-calculated statistics necessary for meta-analysis, if sufficient data were available124.

Risk of bias assessment

An adapted version of the COSMIN checklist (COnsensus-based Standards for the selection of health Measurement INstruments) was used to assess risk of bias. COSMIN defines standards for evaluating the methodological quality of studies validating health measurement instruments and is implemented by the expert-led ‘Towards Intelligent Health and Well-Being Network of Physical Activity Assessment’ (INTERLIVE) consortium110,125. The modified tool includes four domains: participants, index measure, reference standard, and statistical analysis. Each domain includes multiple items with three possible answers (‘yes’, ‘unclear’, or ‘no’), and ratings were assigned in accordance with the checklist’s recommendations. Studies with at least one ‘no’, or more than two ‘unclear’ ratings were categorised as ‘high’ risk, while those with one ‘unclear’ item were designated as ‘some concerns’. Studies with ‘yes’ in all domains were classified as ‘low risk’. Where studies validated more than one metric, risk of bias was assessed individually for each. Three authors (R.L., B.O’G., M.B.) independently assessed risk of bias and disagreements were resolved by consensus.

Statistical analysis

Meta-analysis of heart rate and blood oxygen saturation was conducted in accordance with the framework developed by Tipton & Shuster126. A random-effects model with inverse variance weighting was used account for heterogeneity between trials127. Pooled Bland-Altman limits of agreement and mean bias were calculated. Subgroup meta-analyses were conducted for heart rate measured at rest and during exercise. To prevent unit-of-analysis errors, only one estimate per study per condition was included in meta-analyses, in line with the approach described by Borenstein and colleagues97. Where studies reported multiple mean difference values, they were pooled prior to meta-analysis, accounting for variance. If the standard deviation of the differences was not reported, it was back-calculated by rearranging the formula used to compute 95% limits of agreement128. Details of the formulae for back-calculation and the methods for pooling mean differences are provided in the Supplementary Information (p. 15).

Pooled sensitivity and specificity for atrial fibrillation detection was calculated using bivariate meta-analysis with Reitsma (mada package)129. Diagnostic accuracy contingency tables were back-calculated when not reported, in accordance with previously described methods (appendix p. 14)124. We evaluated statistical heterogeneity by estimating the degree of between-study variability using the Tau² statistic130,131. Analyses were conducted in R version 4.5.1 (The R Foundation for Statistical Computing, Vienna) with RStudio (version 2025.09.0 + 387) and in Python 3.13