Introduction

With the adoption of the Digital Healthcare Act (DVG)3 in 2019, Germany became the first country in the world to systematically integrate digital health applications (DiGAs) into its statutory healthcare system, aiming to improve patient care and foster innovation. Since 2020, approved DiGAs can be prescribed to patients as a form of treatment, with the full cost reimbursed by their statutory health insurance. Unlike lifestyle apps, approved DiGAs are classified as medical devices, categorized as risk class I or IIa according to the Medical Device Regulation (MDR) or the Medical Device Directive (MDD); the latter was applicable under transitional regulations until the MDR came into full effect on May 26, 2021. With the introduction of the Digital Act (DigiG) in 2023, the scope was expanded to also allow certain medical devices that are categorized as risk class IIb to qualify as DiGA, provided that they are used in combination with remote medical supervision1.

DiGAs are specifically designed to support patient care in recognizing, monitoring, treating, or alleviating diseases, injuries, and disabilities2. The approval process for DiGAs is entirely new. Applications undergo a strict approval process by the Federal Institute for Drugs and Medical Devices (BfArM). After verifying the fulfillment of basic approval requirements, such as safety, functionality, data protection, information security, and interoperability of a DiGA, the BfArM assesses whether the manufacturer provides sufficient evidence for so-called positive healthcare effects of a DiGA and grants either provisional or permanent approval of a DiGA. Provisional approval is granted for DiGA of risk class I or IIa if preliminary evidence for positive healthcare effects is provided. The concept of positive healthcare effects was introduced by the Digital Healthcare Act (DVG)3, which categorized these as either medical benefits or patient-relevant improvements of structures and processes. Medical benefits encompass four categories of effects focused on improvements in a patient’s condition: improvement of health status, improvement of quality of life, reduction of disease duration, and prolongation of survival. Patient-relevant improvements of structures and processes encompass nine categories of effects focused on supporting the health behavior of patients or integrating processes between patients and healthcare providers. These categories are adherence, facilitating access to care, patient safety, health literacy, patient autonomy, coping with illness-related difficulties in everyday life, reduction of therapy-related expenses and burdens for patients and their relatives, alignment of treatment with guidelines and recognized standards, and coordination of treatment procedures2.

During a 12-month period of provisional approval, the manufacturer is obliged to provide further evidence of positive healthcare effects, otherwise, the approval is withdrawn. Under specific circumstances, the 12-month period may be extended by decision of the BfArM. If a manufacturer provides sufficient evidence of positive healthcare effects, permanent approval of the digital health application as DiGA is granted at the discretion of the BfArM. To be deemed sufficient evidence, approval studies must meet specific criteria outlined in the BfArM’s fast-track guidelines2. When doing this research and to date, the version of the fast-track guidelines dated December 28, 2023, is in effect. According to these fast-track guidelines, criteria to be met by approval studies include having at minimum a retrospective comparative study design, having a sufficient sample size (without specification), providing rates and reasons for drop-outs, being conducted in Germany, and having a comparison group oriented with the reality of healthcare2. After being granted permanent approval, manufacturers have 12 months to make their approval studies publicly available.

Thus, the DiGA approval procedure differs significantly from the established HTA process for assessing the benefits and harms of medicinal products in Germany, which is regulated by the German Medicines Market Reorganization Act (AMNOG)4. Under AMNOG, drugs that have demonstrated efficacy and safety, i.e. have a positive risk-benefit ratio, can be prescribed immediately upon market entry, but must subsequently undergo an additional benefit assessment. This assessment compares the new treatment with a standard therapy, a so-called appropriate comparative therapy, which is predefined by the Federal Joint Committee (G-BA), as well as outcomes for which the added benefit needs to be assessed. The process lasts six months and forms the basis for price negotiations, with the final price reflecting the assessed additional benefit. In contrast, the DiGA procedure does not involve predefinition of appropriate comparative therapies or outcomes by authorities. Moreover, the reimbursement price for DiGAs is not linked to the extent of additional benefit demonstrated. What both procedures share is the goal of making innovative and safe health interventions with a positive risk-benefit ratio rapidly accessible to patients.

Since the DiGA approval process is entirely new, there is ongoing discussion about the methodological quality of approval studies and the positive healthcare effects they demonstrate.

These discussions take place against the backdrop of a constantly evolving DiGA landscape, with new DiGA continuously receiving provisional or permanent approval, while others are removed. As of June 16, 2025, the DiGA directory5 lists 44 DiGAs with permanent and 14 DiGAs with provisional approval. Another 11 DiGAs had been delisted: six because no evidence of a positive healthcare effect could be provided, one because the clinical study for approval could not be completed, and four at the request of the manufacturer5. Since their introduction, the number of DiGA prescriptions has steadily increased, with over 1 million prescriptions issued by the end of 20246. Approximately 81% of these were activated by patients, leading to cumulative expenditures of €234 million by the statutory health insurance funds between September 1, 2020, to December 31, 20246. In 2024, the average price per DiGA prescription was €541, with prices per DiGA varying significantly, ranging from €119 to €2077 per use, which is typically limited to 90 days but in some cases extends up to 365 days6.

Despite these developments, knowledge about the overall risk of bias (RoB) of approval studies remains limited, as the BfArM does not provide public information on its evaluation of approval studies. Thus, it is unknown whether, how, and with which tools RoB assessments of approval studies are carried out by the BfArM as part of the DiGA approval process.

A comprehensive overview of the available published evidence for permanently approved DiGAs, therefore, appears both timely and necessary.

So far, two systematic reviews have conducted RoB assessment of a limited number of approval studies: Kolominsky-Rabas et al. assessed the approval studies for six permanently approved DiGAs from the categories “psychology” and “nervous system”7. Applying the revised Cochrane RoB tool, they found an overall high RoB7. Lantzsch et al.8 conducted a review and RoB assessment of eleven approval studies for eight of ten DiGAs that had been permanently approved as of February 2022. Using the same RoB tool as Kolominsky-Rabas et al., they also found an overall high RoB8. However, the RoB assessments for DiGA approval studies that were included in both these systematic reviews differed across several RoB domains.

Except for conducting a RoB assessment, both systematic reviews give no comprehensive overview of the characteristics of and evidence for positive healthcare effects measured in DiGA approval studies. It is, also in international comparison, a standard procedure to systematically review existing intervention studies on digital therapeutics and to compare their effects and their risk of bias using tools such as the RoB 29,10,11,12,13,14.

Against this background, the two-fold objective of this systematic review was, for all DiGA approval studies that were published by March 15, 2024,

  1. (1)

    to provide a comprehensive overview of the characteristics of and evidence for positive healthcare effects,

  2. (2)

    to assess the RoB applying the revised Cochrane RoB tool (RoB 2)15

Results

Results of the search process

Based on information in the DiGA directory for the 33 DiGAs that were permanently approved as of 15 March 2024, we were able to identify 23 approval studies for 21 DiGAs (Fig. 1). Two studies each were identified for the DiGAs deprexis and Kalmeda. For deprexis, the manufacturer listed two approval studies16,17 in the DiGA directory. For Kalmeda, a clinical study report18 was found, published on the manufacturer’s website in 2022, alongside a journal article19 named as an approval study on the manufacturer’s website in March 2024. Both were included in this review; the clinical study report18 had also been assessed in a previous review8.

Fig. 1: PRISMA flow chart.
figure 1

The figure outlines the screening and selection process of DiGA approval studies included in the systematic review, following the PRISMA 2020 guidelines.

Approval studies could not be identified for all 33 DiGAs, as manufacturers have 12 months to publish such studies after permanent approval is granted. As of March 15, 2024, this period had not yet expired for 12 permanently approved DiGAs, and the approval study had not been published yet. Of the identified 23 approval studies for 21 DiGA, nine were retrieved using the information on author, title, and the source provided by the DiGA directory (see Fig. 1). A further nine studies could be retrieved from the manufacturers’ websites, and the remaining five were identified via Google Scholar. Two additional studies were found via manual searches. These were excluded from the review, as it could not be verified that they were official approval studies. The participant numbers, details on outcomes, or study registration numbers of these studies did not match the information in the DiGA directory20,21. One identified study was an unpublished draft for a journal article; it was excluded due to the uncitable format and participant numbers that differed from the DiGA directory. In addition, 11 published study protocols22,23,24,25,26,27,28,29,30,31,32 corresponding to approval studies for 11 DiGAs were found via Google Scholar by applying the study registration numbers from DRKS or ClinicalTrials.gov, which are named in the DiGA directory for all approved DiGAs. The two approval studies included in this review for Kalmeda refer to the same study protocol28.

Description of included studies

Table 1 depicts the characteristics of included approval studies. Data in this table are excerpts from the comprehensive data extraction sheet, available as Supplementary Data 1. Table 1 is available as an editable and filterable Excel file in Supplementary Data 2.

Table 1 Study and population characteristics of included studies (ordered by DiGA category, then alphabetically by DiGA name)

The approval studies for all identified DiGAs were conducted between August 2012 and February 2022. For 21 of the approval studies, the study design was a parallel, two-arm randomized controlled trial. One study was a three-arm randomized controlled trial33, and one was a cluster-randomized controlled trial34. The majority of approval studies—13 of 23—included in this systematic review are from the DiGA directory category “psychology.”16,17,33,35,36,37,38,39,40,41,42,43,44 Other DiGA directory categories for which approval studies were identified are “ears,”18,19 “genitals, kidneys and urinary tract,”45 “hormones and metabolism,”46,47,48 “muscles, bones and joints,”34,49 “nervous system,”50, and “other”51. The typical intervention was self-guided, web-based, and included elements from cognitive behavioral therapy (CBT), with content delivered through interactive exercises and multimedia.

In one study38, the intervention group was composed of sub-groups of participants with three different indications. The intervention was tailored in an indication-specific manner, and outcomes were measured separately. Thus, the approval study was counted as three interventions, and information is listed for each group separately in the tables.

Reminders for maintaining motivation and adherence were applied in 11 interventions, and were optional in three others. The intervention duration ranged from 1.5 to 12 months, with a mean duration of 3.8 months. The control group received care-as-usual in 21 studies, with access provided to the DiGA after a mean waiting period of 5.0 months. In two studies, participants of the control group were never given access to the DiGA34,39.

Sample sizes varied widely, with the number of participants ranging from 56 to 1245, with a mean of N = 321 (intervention and control groups combined). The majority of study participants were female, ranging from 47.2% to 90.7% (mean 68.8%) in the intervention groups (excluding two studies for DiGA that were designed exclusively for women35 or men45, respectively. The mean age of the intervention groups ranged from 29.4 to 57.4 years (mean 42.1 years). In control groups, the share of female participants ranged from 44.1% to 91.8% (mean 67.3%; excluding the two approval studies for DiGA that were designed exclusively for women35 or men45, respectively. The mean age in the control groups ranged from 28.0 to 57.3 years (mean 41.7 years).

Overview of outcomes and effects

Table 2 shows the primary outcomes, measurement time points, and measurement instruments used in the studies, as well as the effects and effect sizes found (available as an editable and filterable Excel file in Supplementary Data 3). Information on secondary outcomes is contained in the data extraction sheet (see Supplementary Data 1).

Table 2 Primary outcomes and effects of interventions (ordered by DiGA category, then alphabetically by DiGA name)

The 23 approval studies included between one and three primary outcomes, totaling 29 across all studies (mean: 1.2; median: 1). In addition, the studies examined between 0 and 13 secondary outcomes (total: 120; mean: 4.8; median: 4) and 0–6 other outcomes (total: 25; mean: 1 median: 0). The categorization of outcomes as “primary”, “secondary” or “other” (including “explorative”/“additional”/“other”) were extracted from the studies themselves. In counting outcomes, the study by Zurowski et al.38 was recorded three times, as the intervention was tailored to three distinct indications of participants in the intervention group, and these three interventions were assessed separately.

The 29 primary outcomes, which are decisive for the permanent approval of a DiGA, were assigned to different outcome domains within either the “medical benefit” or “patient-relevant improvements of structures and processes” categories. As some of the sub-categories are new in terms of proof of benefit and are not comprehensively defined by the BfArM fast-track guidelines, the assignment was performed according to the best knowledge of the authors of this systematic review. Of the 29 primary outcomes, 28 were assigned to the “medical benefit” outcome domain, specifically within the subcategories of “improvement of health” (25) or “improvement of quality of life” (3). One primary outcome was assigned to the “patient-relevant improvements of structures and processes” outcome domain, specifically within the subcategory “improvement of patient autonomy.”45

The choice of the primary outcomes was not explicitly justified in the approval studies. Thus, the implicit rationales were identified for all primary outcomes as part of the review. For 12 primary outcomes, only one rationale was identified; for 14 primary outcomes, two rationales were identified; and for three primary outcomes, three rationales were assigned. Thus, a total of 49 implicit rationales were assigned to the 29 primary outcomes. Implicit rationales identified were classified according to the categories in the methods section: high prevalence of condition (n = 19), high burden or impact on quality of life (n = 16), clinical relevance/core symptom (n = 8), improvement of care (n = 3), and supported by meta-analysis or research evidence (n = 3).

For the assessment of the 29 primary outcomes, 34 outcome measurement instruments were applied. It is striking that, except for one, all outcomes were measured by patient-reported outcome measures (PROMs). The use of these instruments was justified in the studies by their validity, reliability, sensitivity, or their common use.

However, even when approval studies embraced a similar primary outcome, they did not necessarily measure it using the same instruments. For example, quality of life was assessed in three approval studies41,44,45 using three different outcome measurement instruments: the Quality of Life Questionnaire for Patients on Long-Term Medication (QoLMed)45, the WHO Well-Being Index-5 (WHO 5)41, and the Short Form Health Survey- 12 item version (SF-12)44. While the QoLMed and SF-12 measure broader aspects of quality of life, the WHO-5 specifically focuses on psychological well-being, a key and specific component of quality of life. Primary outcomes were commonly measured directly at the end of the intervention. Drop-outs before the time of post-intervention measurement varied considerably across approval studies, with drop-out rates ranging from 4% to 53.3% (mean 21.7%; median 22%) in intervention groups and from 0% to 47% (mean 11.8%; median 8.6%) in control groups. In most studies, the drop-out rate before the post-intervention measurement was higher in the intervention than in the control group. It should also be noted that eight studies did not conduct a follow-up measurement36,39,40,41,42,45,48,49.

All approval studies found a statistically significant between-group effect in terms of primary outcome measurement. Approval studies that reported effect sizes according to Cohen’s d disclosed predominantly significant medium or large between-group effects for primary outcomes. The largest effect size for a primary outcome in terms of Cohen’s d was found for the improvement of insomnia severity (d = 1.79; p < 0.001)43, while the smallest was reported for the reduction of alcohol consumption, as measured by the Quantity/Frequency Index (QFI) (d = 0.34; p < 0.001)36.

Several other studies reported effects as relative or absolute changes, such as weight loss in kg or changes in instrument scores. In these cases, we could not assess effect sizes if study authors did not classify them. For example, without knowledge of the literature on obesity, it is not possible to tell whether a weight loss of 2.9 kg over the intervention period would constitute a small, medium, or large effect.

In addition to primary outcomes, the 23 included studies also examined a total of 120 secondary outcomes and 25 other outcomes.

Of the 120 secondary outcomes examined, 95 were assigned to the “medical benefit” outcome domain, specifically to the subcategories “improvement of health” (79), “improvement of quality of life” (15), and “prolongation of survival” (1). Further 12 secondary outcomes were assigned to the “patient-relevant improvements of structures and processes” outcome domain, specifically to the subcategories “adherence” (2), “patient safety” (2), “patient autonomy” (3), “health literacy“ (1), “coping with illness-related difficulties in everyday life” (3) and “reduction of therapy-related expenses and burdens for patients and their relatives” (1). The remaining 13 secondary outcomes were categorized as “other” outcomes, for example, “acceptance of diabetes” or “working ability”. For the measurement of the 120 secondary outcomes, 154 outcome measurement instruments were applied. Some instruments, such as the SF-12, were counted as two measurement instruments in cases where, for example, results for mental and physical health were analysed separately. The 25 other outcomes were measured by 29 outcome measurement instruments; in one additional case, the outcome measurement instrument was not specified. Three of these 25 other outcomes were not measured by questionnaires but by freely developed items and free text39. For 68 secondary outcomes and three other outcomes, a statistically significant between-group effect was shown. Of the 25 other outcomes, four outcomes were only measured for the intervention group, and 17 other outcomes were not included in the main analysis, meaning significant between-group effect at post-assessment could not be assessed.

Risk of bias assessment

The RoB was assessed for all 23 included studies. All studies were evaluated to have an overall high RoB. The RoB results are presented for the 22 randomized parallel-group trials in Fig. 2 and for the cluster-randomized parallel-group trial in Fig. 334. Risks relating to specific domains are presented below. The detailed RoB2 assessment is available in the Supplementary Information.

Fig. 2: Results of the RoB2 assessment for 22 randomized parallel-group trials (ordered by DiGA category, then chronologically).
figure 2

The figure presents the RoB2 judgments for five bias domains and the overall risk of bias for each of the 22 included randomized parallel-group DiGA approval studies. Risk levels are color-coded as follows: green (low risk), yellow (some concerns), red (high risk). Figure 2 shows the studies ordered by DiGA category from the DiGA directory, as in Tables 1 and 2, then according to the chronological appearance of the study.

Fig. 3: Results of the RoB2 assessment for cluster-randomized parallel-group trials.
figure 3

The figure presents the RoB2 judgments for five bias domains and the overall risk of bias for the included cluster-randomized parallel-group DiGA approval study. Risk levels are color-coded as follows: green (low risk), yellow (some concerns), red (high risk).

Regarding bias arising from the randomization process, the RoB was rated as low for 10 studies. However, some concerns were identified in 11 trials, and one study was assessed as having a high RoB40. Specifically, 19 trials used adequate randomization methods to generate the allocation sequence, and 11 studies appropriately concealed the allocation. Three studies did not clearly report randomization19,38,48 and 10 studies did not adequately describe allocation concealment methods. One study failed to adequately conceal allocation40.

With regard to bias due to deviations from the intended interventions, the RoB was rated as low for two studies41,45 and with some concerns for 18 studies. Two studies were assessed as having a high RoB18,47. A high RoB was primarily attributed to participants being aware of the intervention or to inadequate reporting of the likely influence of important, non-protocol interventions and the balance of these factors across study groups.

In terms of bias due to missing outcome data, the RoB was assessed as low for eight studies16,36,39,42,43,45,48,51, with the other 14 studies assessed as having a high RoB. High RoB was typically due to drop-out rates exceeding 5%, combined with a lack of analyses to provide evidence that the results were not biased by missing outcome data.

Regarding the bias in measurement of the outcome, the RoB was assessed as high for 21 studies. This was due to the predominant use of PROMs, which implies that outcomes may have been influenced by awareness of the intervention received. For one study47, the RoB was assessed as low because the primary outcome was measured objectively by study personnel during visits through various clinical parameters such as body weight, height, body composition, and BMI, with only the secondary outcome—quality of life—being patient-reported.

Concerning bias in the selection of reported results, the RoB was assessed as low for eight16,18,35,36,37,41,46,51 out of 22 studies. Low bias stemmed from all results of analyses and outcome measurements being reported, and analyses being conducted according to a pre-specified analysis plan outlined in the study protocol22,23,24,25,27--30. Some concerns persisted for one study39 for which a study protocol was available32, but discrepancies between the protocol and measured outcomes were identified. The RoB was also rated as some concern for eight further studies, for which results of all analyses and outcome measurements were reported, but no study protocol could be identified17,19,38,43,44,45,49,50. Five studies were assessed as having a high RoB due to selective reporting of outcome measurements, i.e., not reporting results for all measured outcomes, and/or analyses of the data33,40,42,47,48.

In addition to the parallel-group trials, the sole cluster-randomized parallel-group trial included in this review, published by Priebe et al.34, was assessed separately and found to have an overall high RoB (Fig. 3). Methodological flaws were identified in the three RoB domains bias due to deviations from the intended interventions, bias due to missing outcome data, and bias in selection of the reported results.

In Fig. 4, the results of the RoB assessment across all RoB domains are visualized for the 23 included studies. As the figure shows, across all studies, the main sources of high RoB are in the domains bias due to missing outcome data and bias in measurement of the outcome. The domain bias due to deviation from intended interventions is a source of some concerns, while the randomization process was appropriately conducted in the majority of studies.

Fig. 4: Overview of RoB across all included studies and domains.
figure 4

The figure displays the percentage distribution of RoB2 ratings (low risk, some concerns, high risk) across the five bias domains and overall for all 23 included studies.

Discussion

The objective of this systematic review of DiGA approval studies was two-fold: first, to give an overview of the study characteristics, measurement instruments, outcomes, and positive healthcare effects relating to these tools; and, second, to assess the RoB of studies submitted by manufacturers to the BfArM for permanent approval for their DiGAs.

In total, 23 approval studies for 21 DiGAs published as of March 15, 2024, were included in this review.

The results revealed substantial differences among the studies. This was first of all evident with regard to intervention design and study characteristics, including sample sizes, drop-out rates, measurement times, and intervention durations. These variations between interventions reflected the flexibility granted to manufacturers in designing interventions within the German DiGA approval process. Another source of variability between interventions was in the choice and measurement of primary, secondary, and additional outcomes. Despite this, an overview of the outcomes revealed that all approval studies reported statistically significant and predominantly medium to large between-group effects for their primary outcomes. In contrast, the results for secondary outcomes were more variable.

The RoB showed that the overall RoB was high for all approval studies, albeit with variation across different domains.

Thus, the main conclusions of this systematic review are that DiGA approval studies are difficult to compare and that evidence provided for the positive healthcare effects of DiGAs should be critically evaluated, as results are prone to bias.

An overview of measured outcomes was not conducted in previous reviews of DiGA approval studies7,8. Thus, no comparison of findings is possible. A discussion of the effects reported by approval studies in comparison to previous research is also impossible; first, due to the wide range of outcomes examined by approval studies, and second, because of the novelty and international singularity of DiGA. The comparability with effects found for other mobile health applications is questionable.

In assessing the RoB, our systematic review added to the literature by evaluating several approval studies that had not been included in former reviews. The overall RoB assessment we present is consistent with two former systematic reviews that examined several DiGA approval studies published before 20227,8. Both reviews also found a high overall RoB across all examined DiGA approval studies, with a varying RoB across different domains. Comparing our findings on underlying methodological issues in the additional studies with the findings of these two previous studies, it can be concluded that the RoB in newer approval studies does not systematically differ from that of earlier studies. Neither the domain-specific RoB assessments nor the overall RoB ratings showed systematic improvement over time. The lack of systematic improvement over time may in part be explained by the fact that the methodological requirements outlined in the DiGA fast-track guidelines have remained largely unchanged since their introduction.

Although not specifically required by DiGA guidelines2, all approval studies were conducted using a (cluster-)randomized controlled trial design, considered the gold standard for clinical evidence.

Across approval studies, there were substantial differences in sample sizes and drop-out rates, with very small and very large examples of each. This variation may affect the interpretability and comparability of the study results. However, as emphasized in the DiGA fast-track guidelines, both sample sizes and drop-out rates should only be evaluated in comparison with similar interventions2. Thus, a small sample size or a high drop-out rate is not necessarily unfavorable. In several approval studies, the observed drop-out rates were higher in the intervention than in the control group. Assessing the causes and implications of this phenomenon is more difficult and does not allow for definitive conclusions.

Possible explanations for drop-outs in the intervention group might be that the app does not meet expectations, for example, with regard to an observed positive healthcare effect, or might not align with participants´ daily routines. However, the literature suggests that different factors are underlying high drop-out rates in digital interventions. A meta-analysis by Torous et al.52 dealing with drop-out rates in clinical trials of smartphone apps for depressive symptoms found no significant differences in drop-out rates between intervention and waitlist control groups. Torous et al. identify in-app mood monitoring and the provision of human feedback as key factors in reducing attrition rates, while features of intervention design, such as including waitlist or placebo controls, clinical vs. non-clinical populations, and therapeutic approaches to depression had no effect on drop-out rates.

Beyond clinical trials, high drop-out rates in the intervention group also indicate a challenge to translate DiGA into real-world healthcare settings, particularly regarding adherence and utilization following prescription. Reasons for low adherence may include insufficient user support, lack of individualization, usability issues, and poor integration into routine care53,54. Ultimately, low utilization may negatively impact the overall effectiveness of DiGAs.

Some interventions had a relatively short duration of three months or less, and in some cases, no follow-up assessment was conducted. This is in line with the DiGA fast-track guidelines from the BfArM; however, such a short intervention duration and a lack of follow-up are questionable study design elements. In practice, DiGAs are often prescribed multiple times or for a longer duration, as they mostly address chronic diseases. Thus, it may be appropriate to include obligatory analysis of long-term positive healthcare effects of DiGAs as part of the approval process.

Overall, the outlined divergence of approval studies is likely the result of the far-from-strict BfArM DiGA fast-track guidelines, which leave manufacturers a large scope for study design, especially when compared to more strictly regulated processes such as the AMNOG procedure. This leads to low comparability both between DiGA approval studies in general and between DiGAs designed for the same indication.

The reporting of outcomes and the outcome measurement instruments applied by approval studies revealed further marked differences. This is not only attributable to the variety of indications addressed, but also to the fact that approval studies employed different outcomes and measurement instruments even for similar indications. Finally, the diversity of outcomes and measurement instruments makes it difficult to compare approval studies and their reported positive healthcare effects. One way to enhance comparability would be to mandate the use of core outcome sets (COS)55 in approval studies. For example, for the indications depression and anxiety, a COS is available56, which comprises four general treatment outcomes: symptom burden, functioning, disease progression, and treatment sustainability, as well as potential side effects of treatments. This COS could have been applied in the approval studies of DiGA for depression and/or anxiety. However, specific COS for digital interventions have yet to be developed and validated.

Registers might also play a valuable role in enabling a comparable, cross-DiGA measurement of predefined outcomes- ideally based on standardized outcome sets. A recent publication by Albrecht et al.57 illustrates this approach by combining overarching and indication-specific outcomes within the DiGAReal registry for rheumatology patients.

As outlined above, the overall RoB for all approval studies was assessed as high, with different underlying causes. One cause was the suboptimal reporting quality of the included studies. Missing information or imprecise wording sometimes means that a RoB cannot be ruled out; for example, if it is not clear whether an allocation sequence in the randomization process was concealed or not, the possibility that it was not must be considered. Related to this, study protocols were often not publicly available, meaning it was not possible to assess whether data analyses were carried out according to a pre-specified analysis plan. A second cause of poor RoB ratings was the use of inadequate statistical models to deal with missing data due to drop-outs, reflected in the high RoB scores for missing outcome data. Another cause of high RoB was the frequent use of PROMs, such as those assessing improvements in quality of life, combined with the absence of blinding of outcome assessors.

While poor RoB ratings due to suboptimal reporting quality, missing study protocols, and inadequate statistical analyses can be easily avoided by better adherence to the CONSORT reporting guidelines—as already suggested in the DiGA fast track guidelines—or through the publication of study protocols, the use of PROMs is a more challenging issue.

The use of PROMs to assess DiGA outcomes is an obvious and necessary choice, as the intended effects of DiGA—such as improved quality of life—are often difficult or impossible to measure objectively. Additionally, DiGA interventions are conducted outside of clinical settings and are predominantly self-guided. Consequently, employing a waitlist study design seems reasonable, as the use of “sham” apps for control groups would be difficult to implement and likely methodologically inadequate7. Thus, overcoming the bias arising from participants’ awareness of their treatment allocation remains challenging. Nevertheless, it may be worth discussing whether the RoB2 tool´s classification of PROM usage—an essential part of patient-centered care- as a potential source of bias is always justified or appropriate.

These findings must be interpreted in light of certain strengths and limitations of the present review.

Our review provides an overview of study characteristics, measurements, outcomes, and the effects reported by DiGA approval studies, as well as an assessment of their RoB. Unlike previous reviews7,8, it also processes the reported effects, representing significant added value.

The insights gained from this review are valuable for both scientific and practical purposes. For the scientific audience, a comprehensive overview of the positive healthcare effects of DiGA and the associated RoB in the supporting evidence fills a knowledge gap. In practice, this information can offer context and guidance for DiGA prescribers. Also, by highlighting methodological gaps and inconsistencies, it establishes a framework to aid manufacturers and stakeholders in improving the design of future approval studies and the DiGA approval process, particularly a refinement of the fast-track guidelines.

This review also has limitations. We only included DiGA approval studies that demonstrated a positive healthcare effect on the basis of which DiGA received permanent approval. Studies that failed to demonstrate such an effect and as a result a DiGA did not receive permanent approval could not be considered, as these studies are not publicly available for further analysis. The BfArM provides only very limited information on the reasons why permanent approval was not granted to provisionally approved and withdrawn DiGA (see introduction). Moreover, it is not known how many applications for direct permanent approval were rejected by the BfArM.

Identifying all relevant approval studies was challenging. Although the search strategy was adjusted to address this issue—by incorporating various search methods, such as the DiGA directory, websites, MEDLINE, and a manual search via Google Scholar—it is possible that not all published approval studies for the 33 DiGAs permanently approved as of March 15, 2024, were identified and included in this review. Retrieval of DiGA approval studies could be facilitated by requiring manufacturers to provide a link to the relevant study in the DiGA directory. Regarding data extraction, it must be noted that not all required information could be extracted from the studies due to the sometimes poor reporting quality. Ambiguities could potentially have been resolved through communication with the study authors; however, this is not in line with the principles of the RoB2.

However, the RoB 2 guidelines recommend contacting study authors to request study protocols that are not publicly accessible58. Yet, in the DiGA approval process, the publication of approval studies and study registration is mandatory. It would therefore be consistent and appropriate if corresponding study protocols were also published as part of this regulatory transparency. Consequently, we refrained from contacting study authors to request unpublished protocols. In cases where no protocol was publicly available, we documented this in the RoB assessment and the RoB rating in the section was “no information”. This applied to 11 included DiGA studies. To enhance the clarity of evidence for the positive healthcare effects of DiGAs, it could be beneficial for the BfArM to mandate adherence to reporting guidelines for future approval studies.

Furthermore, the categorization of primary and secondary outcomes—as either “medical benefit” or “patient-relevant improvement of structures and processes”—had to be conducted to the best of our knowledge, due to the lack of clear definitions in the DiGA fast-track guidelines.

In several studies, the statistical significance or effect sizes were not clearly specified for reported outcomes. Consequently, data extraction required the interpretation of the available information to the best of our knowledge. Effect sizes were reported in Table 2 as indicated by the authors of the approval studies. This is because the studies used a wide range of outcomes and a variety of measurement instruments. Independent interpretation and contextualization of the effect sizes would require detailed knowledge of all these instruments and related literature, which was not feasible within the scope of this review.

The quality assessment was conducted across outcomes, taking into consideration the diversity of outcomes in the studies and the lack of comparability between them. Further analyses, such as meta-analysis, could not be conducted due to the diversity of outcomes and conditions applied in the approval studies. Thus, we were limited to a narrative presentation of the results.

The findings of our systematic review have several implications for practice and future research. The RoB assessment, in particular, clearly suggests that the evidence presented for the positive healthcare effects of DiGAs in approval studies should be considered with caution. However, a review of the DiGA fast-track guidelines reveals that the included studies do indeed fulfill the criteria for permanent approval, the granting of which is a discretionary decision made by the BfArM.

Making the permanent approval of DiGAs a discretionary decision may be considered reasonable at first glance, taking into account the diversity of DiGA interventions. However, given the divergence between scientific standards for high-quality studies and the approval criteria for DiGAs established by BfArM guidelines, a discussion on how to narrow this gap is imperative.

As the BfArM continues to grant permanent approval for DiGAs, improving the methodological quality of approval studies is important. Improving the methodological quality of DiGA studies would help to ensure trust in and the quality of DiGAs, which is crucial given the planned approval of certified DiGAs as medical devices of risk class II.

Aiming to improve the methodological quality of DiGA approval studies, the results of this systematic review imply several possible measures that could be adopted by the BfArM and manufacturers.

First, adaptations to the criteria for permanent approval of DiGA could be key to improving the quality of manufacturers’ studies. As methodological problems of early approval studies do not differ from those of later ones, a connection to the fast-track guidelines appears evident.

Thus, to address poor reporting quality, the prospective publication of study protocols and reporting according to CONSORT guidelines should be mandatory. To improve comparability between approval studies and the effectiveness of DiGAs, the BfArM should demand clear rationales for outcome selection and recommend adherence to COS. Also, a minimum intervention duration and the conduct of a follow-up should be recommended, as DiGAs often address chronic diseases and are thus prescribed multiple times and for more than 90 days at a time.

To address the RoB resulting from the inadequate use of statistical models for dealing with missing data, e.g., last observation carried forward (LOCF), the BfArM could provide details of recommended models in its guidelines. These criteria should be communicated transparently to DiGA manufacturers via the DiGA fast-track guidelines to ensure clarity and consistency in expectations. The consequences of inadequate reporting quality and methodological flaws should also be clearly communicated.

Second, incorporating real-world evidence as a mandatory component of the approval process could help strengthen the evidence base by enabling the monitoring of DiGA performance. The introduction of an application-accompanying performance measurement (in German “anwendungsbegleitende Erfolgsmessung”, AbEM) for DiGA in Germany from 202659 might be a valuable step in this direction.

Third, the BfArM should increase the transparency of the approval process. As permanent approval of a DiGA is a discretionary decision by the BfArM, the criteria used to evaluate the quality of DiGA approval studies must be publicly available. It would also be reasonable for the BfArM to apply standardized instruments, such as the RoB tool, for evaluating approval studies as part of its decision-making process. To further enhance transparency, the BfArM should also publish more detailed information on rejected applications for permanent approval and the rationales behind these decisions. In doing so, the BfArM could enhance trust in its decisions and provide an example of best practice for manufacturers designing future DiGA approval studies. Relatedly, the BfArM should also be transparent about how it handles violations of, or exceptions to, the 12-month publication deadline for approval studies of permanently approved DiGAs. According to our research, as of June 02, 2025, the 12-month deadline had expired for five permanently approved DiGAs. Clear communication on whether and how such delays affect the approval status would strengthen accountability and provide important guidance for manufacturers and international stakeholders alike.

Beyond the recommendations for improving the approval process in Germany, this systematic review offers important insights and valuable lessons for the international context. As Germany has been a pioneer in establishing reimbursable digital health applications, other countries, including France, Belgium, and Austria have already modeled their processes after the German fast-track procedure60. Other countries may follow in the future. Therefore, an important takeaway from this review is the importance of the early implementation of adequate mechanisms in the approval process to systematically identify methodological weaknesses in approval studies for DiGAs. Ensuring transparency around these mechanisms could enhance international comparability and foster the adoption of best practices globally.

Taken together, the findings of this systematic review highlight that DiGA approval studies exhibit potential for methodological improvement and should be closely monitored. In an update of this systematic review, we will review approval studies published after March 15, 2024. While the focus of the present review was on outcomes measured post-intervention, the focus of the next review will be on follow-up measurements and evidence for the long-term positive healthcare effects of DiGAs.

Methods

The systematic review was registered prospectively with PROSPERO at https://www.crd.york.ac.uk/prospero/ (CRD42023460497) and adheres to the PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) guidelines to ensure comprehensive and transparent reporting61. The PRISMA checklist is available in the Supplementary Information.

Study inclusion and exclusion

This systematic review includes all studies that were submitted by DiGA manufacturers to the BfArM to present evidence for positive health care effects of the DiGA and that are (a) published in either German or English and (b) publicly available as a full text as of March 15, 2024, in any format (study reports, journal articles). In addition, we included all publicly available corresponding study protocols.

Search strategy and selection process

According to the DiGA directory (https://diga.bfarm.de), a total of 33 DiGAs were granted permanent approval by the BfArM as of March 15, 2024. Hence, a search for approval studies of the 33 DiGA was conducted. As a first step, the DiGA directory was consulted. As the directory comprises a registry where manufacturers should identify the approval studies relevant to their DiGA, we expected to retrieve all approval studies from this source. However, approval studies were cited for less than half of the permanently approved DiGAs. In most cases, the listing was either left blank or contained references to the manufacturers’ websites. Thus, the search strategy was readjusted by including additional sources. If the approval study was not cited in the DiGA directory, the studies were searched for first on the manufacturers’ websites. If the study could not be identified from the website, searches were conducted in MEDLINE using PubMed and manually in Google Scholar using the study registration number. The search for approval studies was repeated monthly, beginning on 15 September 2023 and concluding on March 15, 2024, to identify newly published approval studies.

To ensure that identified studies were the relevant approval study for a given DiGA, details were checked against available information from the DiGA directory, study registries, and manufacturers’ websites on study characteristics such as sample size, period of intervention, and trial registration numbers. In addition, study protocols corresponding to the approval studies were searched for via the German Clinical Trials Register (DRKS) and clinicaltrials.gov, manufacturers’ websites, and manually on Google Scholar using the study registration number published in the DiGA directory5. As all approval studies were to be included in this review, no title-abstract-screening and no full-text-screening was performed. Hence, a discussion and an agreement between the reviewers on the selection of eligible studies was not conducted.

Data extraction

After the identification of approval studies and corresponding study protocols, data extraction was conducted by one researcher (KS) and verified for accuracy and completeness by a second (MS). Any disagreements between the reviewers were resolved through discussion or with the assistance of a third reviewer (SD).

Data extraction was performed using a data extraction sheet in tabular form, developed by two reviewers (KS, MS) and tested on two randomly selected included studies to ensure that all relevant information of the study was extracted.

The data extraction sheet comprised categories based on the population, intervention, control, outcome, and study design (PICOS) scheme, as well as complementary information deemed relevant. The categories were (1) citation details, (2) study design, (3) recruitment and inclusion/exclusion criteria, (4) population characteristics and sample size, (5) country of intervention, (6) the allocation of the study participants to the intervention and control group(s), (7) intervention or treatment applied to intervention and control group(s), (8) drop-outs, (9) methods of analysis, (10) outcomes, (11) outcome measurement instruments, (12) measurement times, (13) effect of intervention, (14) effect size.

Information on the rationale for outcome selection was also collected from each study, with a focus on whether core outcome sets were applied, as proposed by Williamson et al.55 These rationales were categorized according to a system developed by KS, MS, and SD as part of the review process. Overall, a distinction was made between implicit and explicit rationales for outcome selection.

A rationale was considered implicit if

  1. (1)

    The condition or symptom being measured is very common, making it essential to focus on reducing its impact (high prevalence of condition).

  2. (2)

    The outcome directly reflects the core symptoms or conditions being treated, and reducing these symptoms is a key goal of the intervention (clinical relevance/core symptom).

  3. (3)

    The outcome was chosen because provision of healthcare for treatment of this condition or symptom could be improved, including better access, more efficient treatment, or overcoming barriers in traditional care (improvement of care).

  4. (4)

    The outcome was selected because the condition or symptom being measured has a substantial negative effect on patients’ daily functioning and overall well-being (high burden or impact on quality of life).

  5. (5)

    The outcome was selected based on strong empirical support from previous studies, systematic reviews, or meta-analyses that have demonstrated its relevance in evaluating treatment across multiple research settings (supported by meta-analysis or research evidence).

In contrast, a rationale was considered explicit if either the relevance of the outcome was determined in a structured manner—such as by conducting a Delphi survey or focus group with patients or experts—or the outcome was chosen with reference to a core outcome set.

With regard to treatment effects, this review focused on the reporting of primary outcomes as identified by study authors, measured at post-intervention. This is because the proof of a long-term positive healthcare effect for a primary outcome is not mandatory for the permanent approval of a DiGA by the BfArM. Consequently, follow-up measurements are not obligatory and are likely not included in all studies.

To assess effect sizes, we took classifications from the approval studies when available; otherwise, we assessed effect sizes according to standard statistical literature. If study authors reported relative improvements but did not specify an effect size, no categorization was made, as expert knowledge would be necessary to make such a determination.

Risk of bias assessments

As of March 15, 2024, the studies submitted to the BfArM were either randomized controlled or cluster-randomized controlled trials. Thus, for the RoB assessment, the RoB2 tool for randomized controlled trials15 (August 22, 2019 version) and the adapted RoB2 tool for cluster-randomized controlled trials (March 18, 2020 version) were applied15,58,62.

The Cochrane RoB2 tool is structured into five domains of bias that focus on different aspects of trial design, conduct, and reporting. These five domains are (1) RoB arising from the randomization process, (2) RoB due to deviation from intended intervention, (3) RoB due to missing outcome data, (4) RoB in measurement of the outcome, and (5) RoB in selection of the reported result. For each domain, the RoB2 provides a fixed series of questions to collect information on the possible RoB, and an algorithm with signaling questions to generate the RoB assessment for each domain; an overall RoB is then determined at the study level. A rating out of three possible outcomes is given for each domain and the study overall: “low,” “some concerns,” or “high.”

In the RoB assessment, the effect of assignment to intervention was of particular interest. The RoB2 was used to assess the risk associated with the outcome(s), outcome measure(s), and timepoint(s)58. We anticipated that outcomes in approval studies would be heterogeneous because of variations in scope, target groups, addressed conditions, and overall aims of DiGAs. Thus, for consistency, the RoB assessment was conducted per study instead of per outcome.

Previous RoB assessments have been conducted for some approval studies as part of previously published systematic reviews7,8; however, due to differences in some assessed domains, these studies were included in the assessment here as well.

The RoB in the identified studies was assessed independently by two researchers (KS, SD), using the RoB2 Excel tool provided on the riskofbiasinfo.org website. As the literature suggests that training on the use of the tool is necessary to correctly apply the tool and improve inter-rater agreement63, three studies were initially selected for calibration in order to develop a shared understanding of the RoB2 tool. Results of the assessment were discussed by the two researchers, and consensus was reached. Disagreements between the two researchers were observed in domain D3 (“bias due to missing outcome data”), due to the complexity of the statistical analysis models used, and in domain D5 (“bias in selection of the reported result)”, where the assessment requires a particularly high degree of interpretation regarding the quality and completeness of reporting.