Introduction

Recent years have seen a proliferation of research proclaiming the utility of quantum machine learning (QML) algorithms for analyzing classical data in many sectors, e.g. finance, cybersecurity, logistics, pharmaceuticals, energy, minerals, and healthcare. With the increasing digitization of health data, the growth of electronic health and medical records1 paves the way for the use of algorithmic techniques - quantum or classical - for analyzing this data. Potential digital health applications could include clinical decision support, clinical predictive health and health monitoring, public health applications and improving health services delivery and data fusion2,3,4,5. The potential for use-case discovery for QML in healthcare6 and biomedical7 applications is found to be compelling in previous systematic reviews. Other broader reviews on quantum computing for health, biology and lifesciences8,9,10,11,12,13,14,15 hypothesize the potential utility of QML algorithms or quantum subroutines in health, but none of these works are rigorous systematic reviews (and thus reproducible). Indeed, across all of these standard and systematic reviews, we find that the strength of the current evidence base even under mildly realistic operating conditions is not examined.

Characterizing the role of QML algorithms applied to real-world classical data is nuanced and a challenging question in applications development but also in fundamental QML theory16,17. Quantum advantage refers to asymptotic reduction in computational resources (or some other metric18) required by quantum algorithms when compared to classical counterparts, i.e. resources are saved as problem size scales to infinity. Empirical quantum advantage19 colloquially refers to finite-sized simulations or experiments using quantum over classical algorithms to perform a task, where one assumes any desired resource savings will scale to larger problems, e.g. in qubit number, high-dimensional or highly structured datasets. However, for classical datasets of arbitrary structure such as those encountered in healthcare settings, there is no known theoretically provable quantum advantage18. Instead, the field relies on mostly empirical analysis of QML performance for a variety of pseudo-real-world data, where performance differentials between quantum and classical methods on these smaller problems constitute evidence for testing empirical quantum advantage. Most computational analysis of scaling behavior assumes ideal operating conditions and it is unknown if QML methods will retain any benefits in realistic operating settings, such as on near-term noisy quantum hardware. In some cases, the role of quantum algorithms for solving inference tasks has been entirely replaced by equivalent classical capability, in a process known as dequantization (e.g.20,21).

In this work, we undertake a systematic literature review of QML applications in digital health between 2015 and 2024. As typical in medical research settings, a systematic literature review is a standard methodological approach for assessing the strength of evidence for proposed interventions in clinical contexts and public health22. Based on existing evidence in literature, we use the SPICE framework23 to ask: In developing digital health technologies, could quantum machine learning algorithms potentially outperform existing classical methods in efficacy or efficiency? A systematic review was conducted in line with the PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses)24 (Supplementary Note 2) detailed in Methods. Our methodology assesses the strength of the evidence and dominant trends associated with using QML algorithms for digital health, including assessing the extent to which performance robustness of proposed QML algorithms has been characterized.

Our current-state analysis reveals that the empirical evidence for QML in digital health cannot conclusively address our research question. We find that numerous studies had to be excluded due to a lack of technical rigor in their analysis of QML algorithms. The majority of eligible studies use only ideal simulations of QML algorithms, thereby excluding the resource overhead incurred for error-mitigated or error-corrected algorithms required for noisy quantum hardware. Of high quality studies, nearly all QML algorithms are found to be linear quantum models, and therefore represent a small subset of general QML. Most use-cases in digital health focussed on providing clinical support, and no studies considered health service delivery or public health applications. Only two synthesized studies used electronic health records for quantum machine learning applications, while the remaining studies repeatedly gravitated towards a handful of open-source health databases. Finally, 13 studies used quantum hardware demonstrations and separated into two classes: either algorithms for a gate-based, universal quantum computer using up to 20 qubits, or quantum annealers using O(100) qubits. Whether potential advantages of QML can be retained in the presence of noise is largely unaddressed in all studies.

We devote the remainder of this Introduction to providing an overview of quantum machine learning for those unfamiliar with this domain. We will also briefly discuss performance metrics, properties of different families of quantum machine learning algorithms, techniques for encoding data into quantum states, and data pre-processing. Quantum algorithms refer to a broad category of algorithms, for which it is desired that quantum computing hardware will be required to perform some of the computations. We distinguish these quantum algorithms from quantum-inspired classical algorithms that use insights from quantum mechanics to perform computations on classical computers. Quantum machine learning algorithms are a subset of quantum algorithms. For the scope of this review, a quantum machine learning algorithm takes as input a classical dataset, and an inference problem is defined on the classical dataset.

Much of the literature we encountered in our review discussed the potential benefits of using quantum machine learning techniques in lieu of classical methods to analyze health data. The terminology used to communicate these benefits is often ill-defined, e.g. quantum ‘speed-up’, ‘utility’ or ‘advantage’ are used interchangeably. In QML, computational ‘advantage’ accrues when a QML algorithm can reduce the number of operations required to solve this inference problem as the size of the problem becomes asymptotically large. Here, the problem size is typically associated with features of the input data e.g. with input data dimension. From a computer science perspective, algorithms can either improve on the number of queries or samples required (sample complexity) or the number of parallelizable quantum operations (time complexity, or runtime). When quantum algorithms enable improvements in complexity, this is sometimes referred to as ‘quantum advantage’, while ‘speed-up’ is often reserved only for reduction in time complexity. An additional metric of memory complexity quantifies the size or type of data structures required to efficiently store and recall intermediary information during computation. While memory complexity is typically not discussed in the literature for quantum algorithms, subroutines such as QRAM may play an analogous role. A comparison of computational costs of selected classical vs. quantum algorithms for ideal mathematical regimes can be found in ref. 11, but these were not encountered for real-world health data in our review.

Quantum algorithms separate into two different categories in this review: gate-based quantum models, or quantum annealing. This categorization can broadly reflect the difference between digital and universal vs. analog and non-universal quantum computing. While we provide a high-level summary of classes of quantum algorithms that were encountered in this review, it cannot be construed as a comprehensive overview of quantum machine learning (see for example, refs. 25,26). Background quantum notation and a fuller discussion is provided in Supplementary Note 1. The majority of studies in the review focussed on quantum algorithms designed for gate-based universal quantum computers. These algorithms include quantum kernel methods (including quantum support vector machines), quantum neural networks, quantum convolutional neural networks, and quantum deep learning. We summarize the quantum computational steps in these protocols by considering how outputs are generated from inputs in Fig. 1 by representing these steps as quantum circuits.

Fig. 1: Common models in quantum machine learning.
figure 1

Quantum circuit depictions of linear vs. non-linear embedding in quantum models27. Horizontal wires represent qubits where input states are shown as ket \(| \cdot \left.\right\rangle\) symbols; temporal order of computations progresses from left to right. Boxed quantum gates (blue, orange) are reversible rotations, or `unitary gates’, of quantum states, i.e. U = U−1. If data encoding (blue) is separable from variational gates (orange), then the quantum model is linear. ‘Circuit size’ refers to the number of qubits, while ‘circuit depth’ represents the number of time-steps required to run the full circuit assuming that quantum operations on disjoint qubits have been parallelized. ac Measurements (msmts.) are pushed to the end; quantum circuit can be summarized by a unitary operation. d Mid-circuit measurement outcomes change quantum operations ‘on the fly’ (pink). Circuits with tunable θ (blue gates) can be broadly referred to as variational (VQC) or parameterized (PQC) quantum circuits.

In the circuit visualization of Fig. 1, inputs to a quantum algorithm are qubit states denoted with ket-notation \(| \cdot \left.\right\rangle\) and boxed operations denote quantum gates. These gates are associated with reversible, logical operations performed on quantum states. The circuit is terminated with measurements of a quantum state which yield probabilistic outcomes, ‘0’ or ‘1’, where probabilities are determined by the quantum circuit. Suppose for some input quantum state, ρ0, the average output of a quantum computation is given by f(x, θ) where (x, θ) define classical inputs to a quantum algorithm. Here, ρ0 represents an input state, such as all qubits in their ground (zero) state; x represents one sample of real data with dimension d, \(x\in {{\mathbb{R}}}^{d}\) for a dataset with N samples, and tunable free parameters, θ, that parameterize the circuit. One encodes data x into quantum states using a parameterized quantum gate, denoted U(x). Meanwhile, free parameters, θ, implement classically optimized or trained quantum gates, V(θ). With these assumptions, the desired output information required from the circuit is typically given with reference to an observable quantity, \(\hat{O}\). This output information is inherently statistical, i.e. one must infer average information about \(\hat{O}\) from a statistical ensemble of ‘0’ or ‘1’ measurements obtained by repeatedly preparing and measuring the same quantum circuit Ns number of times. Therefore to extract information about \(\hat{O}\), we build up an ensemble of quantum measurements by repeatedly running a quantum circuit Ns number of times for a single instance of x, and repeating for different choices of x.

A quantum machine learning algorithm typically consists of input data (x-dependent) and tunable (θ-dependent) quantum operations. Using Supplementary Note 1, we can write the general output of a QML algorithm as,

$$f(x,\theta ):= {\rm{Tr}}\left[U(x,\theta ){\rho }_{0}{U}^{\dagger }(x,\theta )\hat{O}\right]=\langle {\rho }_{x,\theta },\hat{O}\rangle ,$$
(1)

where the data (x-dependent) and tunable (θ-dependent) components of the quantum state ρx,θ cannot be separated. In equation (1), U(x, θ) represents a parameterized quantum gate which depends on data x and tunable parameters θ. The output of a QML algorithm thus computes the overlap between information in the quantum state ρx,θ = U(x, θ)ρ0U(x, θ), and the desired output \(\hat{O}\), using an inner product. In contrast, linear quantum models allow us to separate the x-dependent quantum operations and θ-dependent quantum operations within the inner product27. In these models, we perform data encoding operations followed by tunable gates V(θ). As shown in Fig. 1(a), a linear quantum neural network (QNN) can be expressed by,

$$f(x,\theta ):= {\rm{Tr}}\left[V(\theta )U(x){\rho }_{0}{U}^{\dagger }(x){V}^{\dagger }(\theta )\hat{O}\right]=\langle {\rho }_{x},{\hat{O}}_{\theta }\rangle .$$
(2)

In equation (2), θ can take the form of any other classical parameters that are not x; data encoding is expressed by ρx U(x)ρ0U(x), and the parameterized neural net is expressed as \({\hat{O}}_{\theta }:= {V}^{\dagger }(\theta )\hat{O}V(\theta )\). We note that the embedding U(x) can be nonlinear transformation of the input data, x. However, the terminology ‘linear’ quantum model refers to the linearity of the model with respect to the embedding, i.e. data-dependent and parameterized components of the quantum algorithm can be separated as shown above27.

With this structure, we can additionally describe many other types of quantum machine learning algorithms. For example, we can omit θ entirely, and recover sophisticated algorithms that focus on data encoding procedures. In quantum kernel methods (QKMs), θ is replaced by training data, and the algorithm output f during prediction represents a linear combination of all training samples. Sometimes the action of ρ, U(x) or V(θ) is non-trivially restricted to some subset of quantum states. Quantum convolutional neural networks (QCNNs), quantum generative adversarial networks, quantum causal modeling, quantum transformers, and quantum deep reinforcement learning all have regimes in which they reduce to linear quantum models of the form in Eq. (2) as discussed in Supplementary Note 1.

Meanwhile, quantum annealing algorithms assume a very specific type of quantum computing hardware, namely adiabatic computers, (e.g. D-Wave) to solve specific learning tasks. Adiabatic quantum computers can approximately solve computationally hard (i.e. ‘NP-hard’) problems28 including approximately solving combinatorial optimization problems. The main class of problems encountered in this review relates to quadratic unconstrained binary optimization (QUBO). Examples of QUBO optimization problems include regression, classification, and data compression tasks. Classical, quantum and hybrid annealers can all approximately solve QUBO optimization problems29, or be used to draw samples from particular types of probability distributions (e.g. Boltzmann distributions)30. While more general forms of adiabatic quantum computing than annealing techniques do exist, we did not encounter any within our included literature, and for this reason have not included a discussion of this form of learning.

Quantum algorithms for QUBO formulations have provable advantage over classical counterparts in some regimes. Quantum QUBO algorithms for optimizing support vector machines (SVMs) and balanced k-means clustering have better computational complexities compared to classical counterparts, while quantum algorithms for QUBO formulations of regression have equivalent computational complexity to classical algorithms28. For this limited class of problems, quantum adiabatic computers, such as D-Wave 2X processors, can access ≈ 1000 qubits, which is an order of magnitude larger than O(100) qubit processors for universal non-annealing quantum computers developed by IBM and Google. We also note that it is possible to realize quantum annealing tasks on gate-based quantum computers, e.g. ref. 31, and therefore our classification represents one choice of a non-exclusive method for framing the discussion of QML algorithms.

So far we have introduced quantum machine learning algorithms in generality without reference to the dataset under consideration. However, characteristics of classical data and the representation of this data in a quantum algorithm can affect potential attainability of computational advantage in solving inference tasks32,33. Data encoding describes the process of representing classical data as quantum states, such as the choice of a data encoder, U(x), in Fig. 1. Data encoding is required for both annealing and non-annealing quantum algorithms. Ideally, data encoders must be efficient in computational complexity in both circuit size (number of qubits) and circuit depth (number of parallel operations). There are a number of ways to embed classical data x in quantum states, as summarized in Table 1. For continuous variable inputs, one may use binary representation of data to finite precision τ and encode using discrete methods such as basis encoding, as reported in Table 1. The growth of the number of computations required for encoding is mathematically expressed in \({\mathcal{O}}(g(n))\)-notation to express an upper bound g(n) on the number of operations as the argument n goes to infinity, ignoring constant multiplicative or additive factors. As an example from Table 1, angle-encoding can be prepared in constant depth but scales linearly with number of qubits. The trade-off is switched for amplitude encoding, which in general scales linearly with runtime and logarithmically with qubit number.

Table 1 Comparison of data encoding strategies

Hardware-specific considerations can change implementation details of a quantum algorithm. The decomposition of required quantum operations to the native set of quantum gates available on hardware may change the number of operations, e.g. replacing one 2-qubit gate with a decomposition involving several single and 2-qubit gates. Similarly, hardware implementation of any continuous variable often also incurs finite precision. In most cases, these changes are multiplicative or additive with problem size. These multiplicative or additive changes do not affect the overall asymptotic scaling behavior of the encoder. Some data encoders are not intended as a near-term, implementable strategy. For example, quantum random access memories (QRAM)34 use an additional m = O(dτ) ancillary qubits to randomly access superpositions of basis-encoded states in favorable logarithmic \(O(\log (m))\) time. However, robust QRAMs remain extremely challenging to implement on hardware35. Finally, the parallel unary encoder assumes specific hardware capabilities that affect the complexity of data encoding, and we return to this point in the Discussion.

Distinct from the choice of data encoding strategy, data pre-processing is concerned with using classical techniques to clean up, rescale, compress, or transform data. Since data encoding is expensive in quantum resources, and may impact performance, raw data is pre-processed before encoding data in quantum states. Data pre-processing can have many goals, e.g. to compress raw data, identify key features, or address missing values. For most near-term demonstrations of QML, it is well known that dimensionality reduction of classical datasets is often required to encode data into small or intermediate scale quantum circuits. However, the potential impact of data pre-processing on comparisons of quantum vs. classical algorithm performance is not addressed in literature.

We now turn to presenting our key results and the methodology for our review. The structure of this document is as follows. We begin by conducting meta-analysis and synthesizing the empirical evidence for all eligible studies in Results. We comment on the extent to which this evidence base addresses our research questions, and discuss limitations and future outlook in Discussion. Details of our systematic review methodology, including a study quality assessment framework and comprehensive search and screening criteria, is outlined in Methods.

Results

Results are presented in two stages. Firstly, we depict results of the screening process and the study quality appraisal, which has led to a focus on 16 studies for final synthesis. Secondly, we summarize synthesized evidence and discuss the extent to which our original research question is addressed.

Characterization of synthesized studies

Our systematic review is summarized by a PRISMA diagram in Fig. 2a. Our searches identified 4915 distinct studies. A total of 313 studies passed title and abstract screening and went through full-text screening, of which 169 met eligibility criteria. Inter-rater reliability as measured by Cohen’s kappa was substantial for title and abstract screening (0.72) and moderate for full text screening (0.48). According to the distribution of exclusion reasons shown in Fig. 2b, the most frequent cause for full-text exclusion during screening was distinguishing between genuinely quantum algorithms designed to run on quantum hardware, vs. classical computation invoking ideas, insights or jargon from quantum physics. If the health setting of physics-centric QML studies is not explicit (e.g. for instance36), then these studies will not be returned in search nor pass title and abstract screening for any systematic review.

Fig. 2: PRISMA diagram.
figure 2

PRISMA diagram extracted from Covidence60. a Overview of systematic review from screening to synthesis. Of 4915 studies returned from search in 5 databases, 93.6% of abstracts are deemed ineligible based on digital health setting and quantum intervention criteria. During full text review, 144 studies are excluded, and exclusion reasons are given in (b). Data is extracted (red box) for 169 eligible studies; additional quality assessment and selection criteria is applied to yield a total of 16 studies for final synthesis. b Frequency count of exclusion reasons for full text review.

To address issues of technical rigor, we approach data extraction in two steps: first, the application of the quality assessment criteria, and secondly, narrowing the focus to studies that investigated realistic operating conditions either via noisy simulations or by testing algorithms on real quantum hardware. The distribution of quality scores after consensus is reported in Table 2. Only 6 of 169 studies led to non-trivial and unresolvable differences in scoring criteria between two independent reviewers, indicating 96.4% consensus rate for quality assessment scoring. Borderline studies arise when a quality score for two of the following three concerns remains unresolved by reviewers: insufficient performance analysis of classical pre-processing before data input to quantum algorithm, insufficient performance analysis of scalability using qubit numbers > O(1), and/or insufficient performance analysis of choice of data encoding strategy.

Table 2 Study quality assessment criteria and scoring

The resulting metadata for extracted (synthesized) studies is shown in the top (bottom) row of Fig. 3. Of all eligible studies in Fig. 3 (top row), 138 of 169 (81.7%) use only simulations of quantum machine learning applications for digital health without testing on hardware. Where simulations are the only evidence base in a study, only 7 out of 138 studies use some form of noisy simulations, while the remaining 131 studies use only ideal simulations. When restricting to synthesized studies in Fig. 3 (bottom row), a greater proportion of studies do appear to test quantum algorithms on actual quantum hardware (refer Fig. 3b vs. (f)).

Fig. 3: High-level characteristics of eligible and synthesized studies.
figure 3

Data extraction for eligible studies in Fig. 2, with n = 169, 16 for number of studies extracted (ad) or synthesized (eh), respectively. Synthesized studies exclude eligible studies that do not meet our quality criteria or rely solely on ideal simulations. a, e Histogram of number of studies by publication year; eligible studies show rapid increase in number by year in (a) while synthesized studies appear more uniformly distributed in (e). b, f Percentage of studies that mentioned using quantum hardware using either health or non-health datasets. ‘None’ in (b) indicates usage of both ideal and noisy simulations, while ‘None’ in (f) indicates only noisy simulations of quantum computers. c, g Percentage of studies summarized by digital health setting, broadly characterized as aiding clinical diagnosis (‘Diagnosis’), enabling predictive health (‘Predictive’), and ‘Other’. d, h Number of studies providing open access to software code and input datasets. The absence of code/data availability statements, statements ‘upon reasonable request’, broken links or unusable repositories are all categorized as ‘No’ above.

The growth in the number of eligible articles on quantum algorithms for digital health seems almost exponential in Fig. 3a. These applications are broadly categorized into diagnosis, predictive health, and ‘other’ in Fig. 3c. ‘Diagnosis’ refers to an application that identifies, characterizes or labels current health data with the aim of supporting a clinical diagnosis e.g. classification of medical images or time varying signals. ‘Predictive’ refers to an application that predicts future health information based on current health data, not necessarily to support formulation of a clinical diagnosis e.g. predicting drug efficacy or disease/risk factors. All remaining applications are grouped under ‘other` e.g. generating synthetic ECG signals based on EHR/EMR data. In Fig. 3d, we observe that the majority of studies did not enable code and data accessibility which are typically both required to enable tests of reproducibility. For the final set of synthesized studies corresponding to Fig. 3h, all available datasets and code are summarized in Supplementary Note 3.

Empirical evidence from synthesized studies

Nearly all of our synthesized studies were concerned with a learning task of performing a clinical diagnosis or a clinical prediction based on classical datasets. From a clinical perspective, all studies rationalized quantum algorithm design by citing other empirical literature. Any empirical rationale for the choice of quantum intervention did not necessarily refer back to comparable clinical settings: in most cases, it appeared that the matching between health datasets and quantum interventions was either ad-hoc, or one tried all possible quantum interventions in order to empirically discover the best performing models for a fixed health dataset. No QML applications were focussed on health service delivery, public health, and consumer health monitoring applications. Only one study, Qu (2023)37, focussed on health-data analytics applications, namely, that QGANs may be beneficial in ameliorating issues of model collapse for synthetic data generation in digital health applications, but these are untested at scale in both simulations and quantum hardware. No studies were related to improving efficacy or efficiency of health service delivery, e.g. optimization problems for patient flow, or operational cost-down in hospitals.

We find that quantum kernel methods and quantum annealing techniques dominate our synthesized evidence in Fig. 4 (top). The choice of quantum intervention typically then informs the choice of classical comparator (middle) within each study, and hence the distributions of quantum and classical algorithms are correlated. Finally we note that datasets (bottom) are not particularly clinically diverse and factorize into private and open-source datasets. While EHR and hospital data are often private, the remaining datasets are all open-source. Most empirical evidence does not use electronic health records, but gravitates to a handful of open-source health databases. We thus find that the diversity of applications investigated empirically is limited.

Fig. 4: Overview of quantum interventions, classical comparators and datasets for synthesized evidence.
figure 4

Distribution of quantum interventions, classical comparators and health data setting for all 16 synthesized studies (columns), including studies with borderline or weak consensus (shaded gray columns). For each study, we indicate primary quantum interventions (top), classical comparators (middle), and digital health datasets (bottom). Quantum interventions deployed on non-digital health datasets are excluded. Interventions or comparators (rows) combine similar multiple-choice data extraction entries. QNNs exclude QCNNs and no VQCs / PQCs were found to contain mid-circuit measurements or adaptive gates. All sixteen studies, including any available codebases and datasets, are compiled in Supplementary Note 3.

In Fig. 5, we report performance metrics from synthesized studies comparing quantum interventions with classical machine learning counterparts for different digital health applications. The choice of quantum algorithms again separates into two groups: annealing vs. gate-based techniques. Indeed, quantum annealing studies focus on digital health tasks that can be mapped to a QUBO problem, and are able to scale to problem sizes at least an order of magnitude larger in qubit number than non-annealing quantum hardware. On the other hand, gate-based non-annealing quantum hardware accommodates a broader range of QML algorithms, as shown by the remaining rows in Fig. 5, and a broader range of hardware platforms, such as trapped ions (IonQ) and superconducting qubits (Rigetti, IBM). However, hardware experiments in many instances are almost outdated e.g. IBM quantum processors (56.2% of synthesized studies) are Falcon models or older despite the availability of processors with 100+ qubits since 2022.

Fig. 5: Summary of quantum vs. classical performance metrics for synthesized evidence.
figure 5

Selected test scores (accuracy, AUC and squared error [0, 1]) reported for quantum interventions deployed under realistic operating conditions using health data. Ranges reflected min and max test scores attained over different experimental configurations reported in each study; error bars for each test score are shown if originally reported. Synthesized studies with strong (weak) consensus are shaded in green (gray). Best possible scores for accuracy, fidelity and AUC metrics (unity) differ from squared error metrics (zero). Performance benchmarks that cannot be easily compared or interpreted alongside other metrics, or are not consistently reported, e.g. F1 scores and precision, are excluded for readability. NA Not applicable. NR Not reported. Est. Estimates of numerical values implied from visual/graphical data.

We compare quantum vs. classical machine learning by reporting selected metrics from sixteen synthesized studies in the remaining columns of Fig. 5. Of these, five studies do not provide sufficient information to address our review question. Das (2023)38, Kawaguchi (2023)39 and Qu (2023)37 do not report numerical metrics for quantum experiments, while Yano (2020)40 and Landman (2022)41 report numerical quantum benchmarks but do not report a numerical classical comparator. The remaining 11 studies reported in Fig. 5 reveal major issues in facilitating the comparison between quantum vs. classical interventions. There are three scientifically concerning flaws:

  1. 1.

    No empirical evidence of performance scaling: All quantum computing demonstrations, even in simulations, have not been carried out at scale. Leaving aside the issue of quantum advantage for classical datasets, empirical investigations on universal, gate-based quantum computers have not investigated performance as a function of increasing problem size or qubit number e.g. to O(100) qubits. Even for these small-scale experiments on universal, gate-based platforms, only Krunic (2022)19 plotted trend lines of performance vs. problem size / qubit number to establish empirical scaling behavior. In all other cases, including annealing applications, algorithmic performance scaling was not established in ideal or noisy simulations or prior to running on quantum hardware.

  2. 2.

    Limited reporting of statistical uncertainties: All studies provided limited or no discussion of how statistical fluctuations in test scores should be interpreted. Only Kazdaghli (2024) estimates and reports sample error bars for test score values42, while Krunic (2022) proposed a technique to contextualize fluctuations in performance score to the underlying configuration of experiments using PTRI metrics19 but in lieu of uncertainty analysis. In the absence of error bars, any differences in classical vs. quantum performance appeared to be statistically equivalent fluctuations for a range of configurations.

  3. 3.

    Lack of noise characterization and impact of quantum hardware: Most studies recognize the significantly large deterioration between ideal and actual quantum hardware performance due to the effect of noise. Despite this, studies compared quantum hardware performance mostly only with ideal simulations, rather than using noisy simulations or secondary data to provide insight into algorithm performance on hardware. Only two synthesized studies38,43 used noisy simulations to compare to hardware results in their analysis. When running algorithms on quantum hardware, only two studies38,44 explicitly considered error mitigation. Of these, only one study used an application-agnostic error mitigation technique and distinguished between raw vs. mitigated results to contextualize the impact of noise44. All studies failed to take data to characterize performance of the underlying quantum hardware while running QML experiments. Consequently, these studies offer almost no insight into whether fluctuations in classical vs. quantum QML performance are entirely dominated by drift in performance of underlying quantum hardware.

In summary of evidence presented in all synthesized studies, we deem that the performance differentials between quantum and classical machine learning metrics for digital health reported in Fig. 5 are negligible. Not only is empirical evidence difficult to synthesize and interpret, but the tabulated performance scores show no clear, consistent, statistically significant trend to support any empirical claims of quantum utility in digital health across a range of hardware platforms.

Discussion

We have discussed until this point why a meta-analysis of empirical evidence in synthesized studies is insufficient for claiming empirical quantum utility for quantum machine learning in digital health. This absence of empirical evidence may be understandable in a relatively new field where applications development may temporally lag new insights in quantum machine learning theory and new hardware capabilities. We now consider these observations on research methodology and themes below.

Even in a discipline that must rely on heuristics and empirical investigations, the majority of studies claim empirical quantum advantage but do not take into account realistic operating conditions in their analysis. The absence of noise characterization or noisy simulations to explain deviations of quantum hardware experiments from idealized conditions is particularly surprising. In 14 out of 16 studies, hardware results were compared to ideal simulations without any noise characterization or the use of noise simulations to contextualize results. Of the two studies that used noise simulations, these simulations were limited to simple noise models. For example, in Qu (2023)37, QGANs are used for synthetic heartbeat data generation. Ideal QGANs converged to accuracies ranging from 87.7% – 90.9% for different types of heartbeat data. Standard noise simulations of bit-flip, phase-flip, amplitude damping and depolarizing noise at moderately strong levels reduced the range of accuracies to ≈ 75% – 90%, where each noise model is individually considered. However, in realistic settings, these noise models are inadequate—at the very least, requiring a mixture of different error types. While similar noise simulations were used as evidence to show noise-robustness of QML methods, the limited nature of these noise models would not reflect realistic operating conditions.

Indeed the field appears to lack empirical comparisons of quantum annealing vs. gate-based QML in regimes where quantum annealing is anticipated to have provable quantum advantage, e.g. for specific learning tasks such as binary classification. Four studies formulated learning tasks as QUBO problems and used a quantum annealer. Two of these studies focussed on classification tasks using a support vector machine29,30, which could be easily compared with a universal gate-based computer. The remaining two studies focussed on areas such as linear regression45 and data compression46 for which there does not appear to be provable computational advantage for quantum annealing. Since D-Wave architectures have been available for some time before newer quantum processors, two of these four studies represent our oldest publications dating back to 2018. All annealing studies are also subject to study considerations above, and our review did not find strong evidence of quantum annealers outperforming either newer gate-based, universal quantum computers or classical counterparts.

Only one synthesized study used electronic health records as opposed to generic digital health data. In Krunic (2022)19, electronic health records were used to perform kernel-based prediction of six-month persistence of rheumatoid arthritis patients on biologic therapies. Both quantum and classical kernels were compared for different configurations of number of features and number of samples of training data. The study offers weak evidence of empirical quantum advantage when the configuration space is restricted to small dimensional datasets with a low number of features. Aside from directly using electronic health records, Kazdaghli (2024)42 focussed on using quantum interventions for data imputation in clinical data, applicable to the analysis of electronic health records, but also other types of clinical data, such as those used in clinical trials. Meanwhile, some eligible but not synthesized studies discussed the use of quantum algorithms for securely pooling health data in federated learning applications47.

Some of the quantum algorithms encountered in this review cited significant improvements in data-encoding compared to typical approaches outlined in the Introduction. Efficient image processing tasks are pursued using quantum transformers in Cherrat (2024)48 and Landman (2022)41, while data imputation is pursued in Kazdaghli (2024)42. While these applications in digital health differ, the underlying technologies in Landman (2022)41, Cherrat (2024)48, and Kazdaghli (2024)42 all rely on methods in ref. 49 and appear to inherit favorable resource scaling from assuming specific hardware capabilities that do not exist generally. The underlying data encoders assume hardware can implement entangling gates on overlapping sets of qubits in parallel (as opposed to sequentially). This hardware capability is so-far only shown for small-scale trapped ions50 and it is not expected to scale to systems with large qubit number.

Meanwhile health data consists of both continuous and discrete data and Yano (2020)40 fills an existing gap in literature by looking at encoding of discrete variable data into quantum VQCs using Quantum Random Access Coding (QRAC). The authors argue that \(O({\log }_{2}(d\tau ))/2\) improvement in circuit size complexity can be attained for discrete variable inputs, suggesting a two-fold improvement over amplitude scaling in Table 1. Nevertheless, the critical challenge for amplitude encoding strategies in QML is that linear runtime complexity can prohibit accessing super-polynomial advantage and this barrier is not addressed by the paper.

Nearly all quantum algorithms were linear quantum models. Some theoretical evidence shows that linear quantum models will require exponentially more qubits than non-linear models27, and heuristic evidence shows that certain types of linear quantum models will not be useful for the analysis of classical datasets17. Even broadening to a larger pool of 169 eligible studies, non-linear quantum models were not encountered. Of our synthesized studies, seven studies used linear quantum kernel methods including Moradi (2022, 2023)43,44, Yano (2020)40, Aswiga (2024)51, Krunic (2022)19 and Kawaguchi (2023)39. For non-kernel methods, the underlying technologies for Nirula (2021)52, Qu (2023)37 and Das (2023)38 can be recast in linear form. Finally, the quantum transformers and data encoding strategies that yield favorable scaling properties in Cherrat (2024)48, Landman (2022)41, and Kazdaghli (2024)42 use methods developed in ref. 49. Aside from a variant proposed in Cherrat (2024), the data encoders and neural networks leveraged by these studies all appear to be described by the framework of linear quantum models. Indeed, the observed absence of clear, consistent performance trends in the empirical meta-analysis of the previous section could in part be explained by the underlying linear quantum models used for many of the studies. After publishing our pre-print, we were made aware of ref. 53 as an improvement of quantum methods in Landman 2022 and Cherrat 2024, consisting of a non-linear model in Fig. 1d. While ref. 53 fails our inclusion criteria, even its inclusion would not affect the overall conclusions of our review.

Despite the fact that all quantum models were trained by a supervised learning problem, no study explicitly characterized their optimization landscape. It is well known that optimization of supervised QML algorithms can be plagued by exponentially vanishing gradients (barren plateaus)54, exponential concentration of kernel values55, or exponentially concentrated local minima56. However only two out of sixteen synthesized studies mentioned optimization challenges associated with their proposed methods for supervised quantum machine learning. Here, Cherrat (2024)48 and Landman (2022)41 stated that their proposed QML methods’ structures may avoid barren plateaus. All studies failed to provide a systematic characterization of their empirical optimization landscape, and the resources utilized by their chosen optimization protocol in practice. Meanwhile no substantial improvements are found in reducing shot number requirements for QML applications considered in this review.

Finally, classical data preprocessing tasks are highly discretionary and impact on QML is poorly understood. There are two areas where data preprocessing is frequently used in QML: feature selection for kernel methods, and dimensionality reduction for data encoding. In feature selection, both the number of features43, and statistical significance of features were established using statistical tests44 to aid kernel design. Meanwhile, dimensionality reduction is required to encode data on quantum hardware with limited qubit numbers, e.g. by cropping, PCA or LDA. However, the impact of dimensionality reduction on QML performance is unaddressed. For example, reducing images to 2n length, where the number of qubits n is small, risks creating duplication in training and testing datasets if two different full-sized images become identical after dimensionality reduction. Other preprocessing tasks include re-scaling, using statistical summaries, or transforming data, e.g. using Haralick features57 or Fourier methods, but there has been no characterization of the impact of these methods on investigations of empirical quantum advantage.

Our review highlights that the language of quantum advantage, empirical quantum utility, speed-up, or resource efficiencies are poorly defined and frequently abused notions in literature. QML applications development could benefit from guidelines on what robust quantum vs. classical comparisons look like. Even leaving aside the issue of how to select the best classical comparator, comparing computational cost improvements enabled by quantum algorithms can be a difficult task. As discussed in the Introduction, computational costs are theoretically quantified by sample complexity number of queries or time complexity (number of sequential operations). As examples, for sample complexity, one must ensure that information contained in each query or sample must be comparable across algorithms. Meanwhile for time complexity, operations contain assumptions about hardware capabilities and these assumptions are not always explicitly stated nor consistent. As elucidated by our discussion on quantum transformers in our review, studies assume that groups of quantum operations can be parallelized. This assumption is not hardware-agnostic nor scale-agnostic: certain quantum operations may be parallelized in certain small-scale architectures but not in others. A comprehensive review on the approaches for benchmarking quantum performance is thus of immediate urgency and interest.

In the absence of theoretical assurances on complexity, we have seen in Fig. 5 how empirical studies use performance metrics such as fidelity or accuracy to argue for the ‘utility’ of quantum algorithms in information processing tasks. We have discussed how arguments of empirical utility or advantage must demonstrate both scalability and robustness of performance. Since approximately simulating 100 qubits can be within reach of classical computers, we argue that characterizing properties of QML algorithms as a function of system size is more important than reporting any single figure of merit at some arbitrary choice of system size. Secondly, relying solely on ideal simulations offers no insight into robustness, and one simple test of robustness is to understand and mitigate the impact of noise. Thirdly, we find that the choice of performance metrics in empirical studies is diverse, often ad-hoc, and limits how to perform meta-analysis of evidence in the field.

We summarize these and other considerations to specify minimum requirements for the robust analysis of quantum algorithms on classical datasets. Our proposed qualitative framework is presented in Table 3. This framework is complementary to the quality analysis framework used in this review. In our framework, the minimum requirements outlined in Table 3 (column 3) can be immediately met. However, many of the ideal requirements in Table 3 (column 4) may require new research due to the nascency of the field. Indeed, the field has made some progress towards these ideal requirements: challenges of technical reproducibility in QML are discussed in ref. 16 and quantum algorithm performance has been linked closely with hardware benchmarking in ref. 58. However, research gaps continue to exist, for example, in the lack of a principled approach to link quantum algorithms to structures in classical data, to select appropriate performance metrics, to use noise characterization tools to set performance expectations and benchmark algorithm performance on hardware, and/or to compare quantum methods with a repository of best-in-class classical benchmarks for industry subdomains. Further development and testing of the proposed framework is both urgent and important to urge better empirical evidence in our nascent field. Finally, we note that our review methodology is future-proof: by changing only the search period, we may provide a systematic update on the quality of research evidence for QML in digital health.

Table 3 Proposed framework for QML study design

To conclude, digital health aims to transform access, affordability and quality of healthcare. As classical machine learning methods in health approach commercialization, we find an exponentially growing number of studies advocating the use of QML in health. Our work is the first systematic review that examines the strength of empirical evidence to support these claims using a database of 4915 studies. We find most applications are focussed on clinical decision support and comparatively little attention is given to health service delivery and public health use-cases. Of eligible studies, we appraise study quality yielding 16 robust studies which analyze QML applications in realistic operating environments. Despite this, we find that synthesized empirical evidence does not establish clear trends in performance benefits of QML algorithms over classical methods. Even leaving aside the issue of classical comparators, this synthesized evidence additionally does not establish scalability or robustness of QML performance. To this end, we propose minimal requirements for empirical studies for claiming empirical advantage for QML algorithms. We reiterate that enabling meaningful use-case discovery for QML in digital health requires new research to rationalize the choice of QML structures on classical datasets, define appropriate benchmarks, and establish performance scaling under realistic operating conditions. An update to the search period of our review enables us to systematically track changes in this evidence base in the near future.

Methods

Our systematic review is registered on PROSPERO (ID: CRD42024562024)59. Screening and data extraction were performed in Covidence60. Commonly used nomenclature encountered in this review is summarized in Supplementary Note 4.

Search strategy

Our search strategy is formed by decomposing our research question into elements of the SPICE framework23, as summarized in Table 4. Only articles published after 2015 were included, as the first commercially-available quantum computer was made accessible in 201661 and digitization of health information into electronic records1 is relatively recent. Hence both factors prohibit meaningful applications development prior to this date. Search syntax was refined by trial and error on PubMed (Table 4) in consultation with a health research librarian, and adapted to other databases (Embase, Scopus, arXiv and IEEE, refer Supplementary Note 5). Key articles were identified as litmus tests to sense check database-specific search term strategies. Searches were conducted from 10 May to 10 June 2024.

Table 4 Global search strategy for all databases

Inclusion/exclusion criteria

The eligibility criteria for the screening process is summarized in Table 5. Our study setting prioritized digital health data sources that consist of electronic medical records (EMRs) or electronic health records (EHRs). EMRs represent a real-time patient health record that collects, stores, and displays clinical information as the foundation of a digital hospital as opposed to an EHR which displays summarized patient information to the consumer in the community and across multiple health care providers. The terms “EMR” and “EHR” may be used interchangeably in some countries. Since health data is subject to strict privacy and security legislation, we also consider data that could be reasonably considered to be in an EHR or EMR, thereby permitting the inclusion of open source and published health datasets that are typically used for proof-of-principle results in both classical and quantum ML. While EHR/EMR data typically includes medical imaging, laboratory data, time-varying signals and patient information, we also include genomics data and biomarkers when used in a context where they supplement a patient’s EHR or EMR for diagnosis or predictive health applications. A notable exclusion is textual search or analysis of digital or handwritten clinical notes, as these would imply looking at an entirely different class of algorithms that have little or no overlap with unstructured data analysis of non-textual health datasets listed above.

Table 5 Eligibility criteria for screening

Our criteria also prioritized QML algorithms that were genuinely intended to be run on quantum computing hardware, and at least aspired to demonstrate some kind of advantageous scaling property as the number of qubits is increased. In Table 5, we list the sheer number of algorithms that are classical computations with a nominal usage of the word ‘quantum’. This list was added to throughout screening as new terms were encountered. Many studies technically obfuscated the distinction between QML algorithms and classical computations that use quantum mechanical theory or other insights. For instance, in medical imaging we exclude quantum mechanical corrections to classical algorithms which help to reduce noise in reconstructing images from raw sensor data. Finally, we exclude quantum algorithms unlikely to arise in the context of analyzing classical digital health data, for example: quantum sensing, quantum cryptography, and quantum algorithms for genome pattern matching, genomic sequence alignment, or molecular and chemistry simulations.

Screening

Two independent researchers conducted title and abstract screening of all search results: one reviewer had a health background, while the other had a physics background. Full text review was performed by a total of three reviewers. For consistency, one reviewer participated for all screening stages including both abstract and full text screening. Conflicts were resolved through internal discussion or by involving a third reviewer’s opinion.

Data extraction and study quality appraisal

Study characteristics were extracted for all included studies. Additionally, a study quality appraisal was performed to form consensus-based decisions about including or excluding particular studies based on robustness62. These appraisals are typically implemented during data extraction and prior to narrative synthesis62. Our study quality assessment criteria analyses the rigor with which QML algorithms were investigated16 and we do not include a myriad of other potential benchmarks, e.g. for clinical robustness. At least two reviewers independently scored eligible studies, and the maximum score over both reviewers was selected during consensus formation. Attributes of low vs. high quality studies with were compared with respect to our criteria. Full data extraction template is enclosed in Supplementary Note 6 and extracted data as well as underlying analysis code for data extraction is available online63.