Introduction

Digital health software products (DHSPs) are increasingly used by patients, caregivers, and healthcare professionals in the delivery of care to manage, maintain, or improve health1. DHSPs are software applications built for a general-purpose computing platform, standalone products, extensions to another standalone product, or companions to hardware sensors.

Despite the rapid uptake of DHSPs and the recognition of their potential value in healthcare, many question how the quality of these technologies is assessed and are calling for more evidence-based evaluation frameworks2,3. Borges et al. recently identified lack of infrastructure and technical support, impact on clinician workload, inadequate training, and perception of usefulness as barriers to healthcare provider’s willingness to adopt DHSPs4. Clinicians and patients struggle with identifying the clinical utility of DHSPs5,6. Some clinicians have shared that they are hesitant to recommend DHSPs to patients due to lack of knowledge on identifying which ones are effective or can be trusted7. This disconnect may stem from gaps with clinicians’ limited knowledge of medical product approvals in general; in a recent US survey, only 17% of physicians indicated some level of understanding of the FDA’s device approval process8.

The current process of establishing that DHSPs are built according to best practices is haphazard and fragmented9. Numerous standards, guidances, and audits exist10, covering domains such as evidence (e.g., FDA de novo, 510(k)), privacy and security (e.g., HITRUST, SOC2) and usability (e.g., WCAG, HFE/UE report), but none covers the full spectrum of what users may consider important. Moreover, where regulatory requirements do not exist, industry evaluations, which can be very expensive and time consuming for developers, are left up to the developer to consider. Indeed, many consumer health and wellness DHSPs are not regulated because they do not meet the definition of a medical device, leaving the decision to audit a product up to the developer, and consumers the task to ascertain quality and trust of a product, often without expertise to do so. As a result, many potential adopters of DHSPs—especially large healthcare systems, insurers, and payers—develop their own bespoke evaluation flows, leading to further fragmentation11.

To that end we undertook to develop a single framework to establish that products are built according to best practices and achieve a common baseline of acceptability, to speed the adoption of valuable digital health technologies (DHTs) and build trust among buyers and end-users across the healthcare landscape. To do so, we conducted a 2-phase iterative study.

In phase 1, we interviewed and surveyed subject matter experts (SMEs), developed a needs assessment, and identified the high-level components of an evaluation framework (EF). In phase 2, we cataloged the current state of science and regulatory guidances to create an evidence-based EF that comprises multiple domains of interest, each of which is composed of a set of evaluation criteria and associated benchmarks. The ultimate goal was to develop an EF that DHSP adopters can use to ensure that their products reflect best practices and meet a quality bar that instills trust.

Data and methods

This mixed-methods study collected evidence to develop an EF for a common baseline of acceptability for DHSPs. Evidence was collected in phases that informed each other. All methods were carried out in accordance with relevant guidelines and regulations, and the experimental protocols were approved by the Advarra institutional review board (Advarra Pro00073478). Informed consent was obtained from all the participants, prior to interviews, focus groups, and surveys to use anonymized answers for the purpose of this work.

Needs assessment

The needs assessment elucidated the need for a framework to evaluate the quality of DHSPs (SFig 1). We conducted interviews and focus groups with SMEs representing stakeholders from across the healthcare ecosystem. From August to October 2023, we recruited 164 potential participants from DiMe’s network of digital health experts, including from regulatory agencies, healthcare providers, DHSP developers, patient advocacy, medical societies, life science companies, payers and investor organizations, of which 79 (48%) agreed to participate. We specifically focused on English-speaking individuals in mid- to senior-level positions; 40% were women and the geographical location of the organizations’ headquarters was: 76% US, 7% Europe, 4% elsewhere, and 13% with a global footprint.

All interviews (n = 50) and focus groups (n = 29 participants) were held via Zoom except one focus group which was held in person. Interviews lasted 45 min and included 1 participant, 1 study facilitator, and 1 note-taker. Interviews were semi-structured and conducted in two cohorts. In the first cohort, 35 interviews were conducted to identify the high-level topics (i.e., domains) a comprehensive framework should include. These domains were validated by the second cohort of interviews (n = 15). In addition, both cohorts were asked to discuss key trends in how DHSPs are evaluated across the industry, and to identify participants’ greatest needs and priorities for an evaluation program for their specific stakeholder group.

Focus groups consisted of 8–13 participants and lasted 1 h; 1 group consisted only of DHSP adopters (n = 9, healthcare providers, payers, and other stakeholders that aggregate product information and recommendations), 1 consisted only of DHSP developers (n = 8), and the remaining group, which met in person, consisted of adopters, developers, and industry associations and investors (n = 13). An online workspace12 was used during the focus groups; this allowed participants to share additional comments during the discussion. Participants in all 3 focus groups were asked the following questions: (1) What EFs, certifications, or standards are you aware of or currently using?, and (2) What value would an EF for DHSPs bring to healthcare? In the adopters-only focus group, we also asked how developers could make the process of choosing a DHSP more efficient. In the developers-only group, we asked how adopters and regulators could make their work easier.

Survey

From the information gleaned from the interviews and focus groups, we developed a survey with 76 questions based around the domains identified in the interviews and focus groups (SFig 1). From October to November 2023, we shared the survey broadly to the digital medicine community through DiMe’s partner email list and Slack community.

We evaluated content validity13 with SMEs to ensure that the survey questions represented concepts that are essential for assessing the quality of DHSPs, and we asked experts (n= 15) representing developers, adopters, investors, and regulators to review the survey questions’ validity. Based on recommendations14, a panel of ≥ 5 participants is sufficient to assess the quantified content validity ratio (CVR). The panelists were asked to specify whether a question was “not necessary,” “useful but not necessary,” or “essential”. The content validation index (CVI) was then compared with the Lawshe critical value15; a critical value of 0.99 was needed for validation. A total of 17 questions were removed as a result of this process, leaving 59.

Evaluation framework development

We conducted a literature review, guided by findings from the needs assessment, to benchmark how quality has been assessed for DHSPs (SFig 2). Its aim was to identify quality assessment criteria that would meet the needs of both developers seeking to differentiate their products as trustworthy and adopters identifying which products are worthy of further consideration.

We set the inclusion and exclusion criteria (SFig 3) to identify publications that included recommendations for assessing DHSPs within the parameters set by the needs assessment. We screened 4504 English-language titles and abstracts published between January 2020 to December 2023 in PubMed, Google Scholar, and Publish or Perish16 against the inclusion and exclusion criteria, and extracted data from 1551 publications that met these criteria. Titles and abstracts were independently screened by 5 researchers. A subset of 10% was randomly selected for auditing by 1 researcher who was blind to the initial screening judgment. Cases of disagreement were discussed and resolved by consensus of all 5 researchers. Using findings from the needs assessment, we applied a combination of deductive and inductive thematic analysis to the extracted data17. Five researchers coded data for domains, intended users, and product types; a DHSP discussed in the publication was assigned to one of 3 intended user groups (patients or consumers only, patients and clinician team, or clinical or administrative staff only). We assigned each publication to 1 or more domains and 1 or more product types, based on the information provided about the DHSP. Additionally, the researchers applied inductive analysis to identify descriptive themes around quality assessments within the three domains18.

Secondly, we conducted a landscape analysis, following the WHO guide19, to comprehensively identify frameworks, guidances, and standards that apply to DHSPs. The goal of this analysis was to extract recommendations and best practices that could be used to design a comprehensive EF for DHSPs. The sources were identified through web searches, guided by knowledge from DiMe and regulatory experts on which sources are considered important by the field for evaluating DHSPs. The identified sources were labeled as a certification, framework, guideline, industry standard, regulatory guidance, or tool. Next, the identified sources were assigned to at least one of the domains. During interviews (see below), additional sources were identified iteratively until saturation was reached.

Findings from the literature review and landscape analysis served as the basis for the EF. We developed domain-specific criteria groups, and criteria, and benchmarks that could be used to evaluate the quality of DHSPs. From March to April 2024, we interviewed 49 additional SMEs, including patient representatives, to validate the criteria and benchmarks. Again, we engaged stakeholders from across the healthcare industry. Interviews were held via Zoom. We shared the criteria groups and criteria with participants before interviews; during interviews, we reviewed each benchmark in the context of the domains and criteria, and collected feedback. We asked participants to evaluate the relevance of each criterion and benchmark.

Finally, in May 2024, we conducted user testing interviews with DHSP developers to validate the feasibility of attesting a product against the identified criteria and benchmarks. Saturation was reached after 9 interviews.

Results

Needs assessment

The first cohort of SME interviews identified usability, equity and inclusion, clinical and technical evidence, market and end-user evidence, and privacy and security as the major high-level topics, i.e., domains. Additionally, these interviews identified 4 stakeholder groups that would stand to benefit from a comprehensive EF: DHSP adopters (including healthcare systems, payers, and patients), DHSP developers, regulators, and industry associations.

For each interview in the second cohort, we sorted feedback by domain and employed an inductive approach for thematic analysis. The feedback received quickly reached saturation, as the participants shared similar needs and challenges that could be addressed with an EF. We moved to analysis after 15 interviews.

The domains were validated according to how many responses corresponded to a given domain or the number of comments that included a detail relevant to that domain. For adopters, usability received the most comments, followed by evidence, equity & inclusion, and privacy & security. Respondents did not separate out market and end-user evidence from clinical and technical evidence, but instead spoke of evidence as a single domain. Developer responses ranked the domains in a slightly different order: evidence, usability, privacy & security, and equity & inclusion. This analysis confirmed that the domains identified in the first cohort covered the areas that adopters and developers focus on when vetting or designing DHSP.

We used an inductive approach to review responses within each domain to identify themes and key trends for evaluation of DHSPs, spanning responses from adopters and developers (STable 1). This analysis also identified workflow integration, outcomes, and the overall business model as relevant context for assessing the quality of DHSPs; SMEs preferred that the domains serve as the central organizing unit for the EF.

The top three domains that both adopters and developers wanted to prioritize were evidence, usability, and privacy & security, with equity & inclusion as a 4th theme that applies to all domains. The themes for evidence varied slightly; adopters prioritized evidence vetted by clinicians and supporting workflow integrations, whereas developers prioritized evidence to support clinical claims and ROI. For usability, both stakeholder groups prioritized demonstrating knowledge of user needs for and value of using the DHSP. For privacy & security, adopters prioritized clearly defined measures without specifying a particular method, whereas developers stated that the reference standards of HITRUST20, SOC 2 Type II21, and HIPAA22 should be in place (STable 1).

We developed content for focus group discussions around the finding that three domains are foundational. We began these discussions with prompts to learn about the need for an EF for developers and adopters. We then asked participants to assess each domain in the context areas: outcomes, equity & inclusion, workflow, and business model. We also asked participants to identify any gaps they believe exist in the ability of adopters to efficiently evaluate DHSPs. Participants were aware of or are currently using numerous frameworks, certifications, and standards for the privacy & security domain (such as HITRUST, HIPAA, SOC 2 Type II, SMART on FHIR, HITECH, SaMD, NCQQ, VA/DoD, FedRAMP, ONC, KLAS reports, GDPR)20,21,22,23,24,25,26,27,28,29,30. Fig. 1

Fig. 1
figure 1

Representation of SME answers to the question “In what domain does good exist?” and “In which domains is there room for improvement?” to evaluate DHSPs for quality and trust. Respondents: evidence, n = 36; privacy & security, n = 18, usability, n = 26.

We then presented the adopter and developer focus-group participants with a grid depicting the domains and context areas. When presented with the question, “What value would an EF for DHSPs bring to healthcare?”, participants’ answers were grouped thematically (Fig. 2; STable 4).

Fig. 2
figure 2

Thematically grouped responses to the question “What value would an evaluation framework for DHSPs bring to healthcare?”. (left) Total times a theme was identified (after thematic analysis) in the DHSP adopter focus group; (right) The same data for the DHSP developer focus group. In order of appearance: improved efficiencies, improved data quality and evidence, generate multistakeholder agreements, improved safety, improved transparency leading to improved consumer confidence, improved equity, improved reputation (of product or company).

Across the needs assessment activities, participants were asked for each domain if they believe “good exists and there is very little need for improvement” or “the current state is insufficient and there are many opportunities for improvement” (Fig. 1 and STable 2).

Survey development and deployment

We synthesized findings from the interviews and focus groups to design a survey aimed at quantifying the impact of the gaps identified in the focus groups. Specific questions were developed around the three domains.

The survey was sent to the DiMe community and completed by 93 participants: 45 adopters, 32 developers, and 16 regulators, investors, and industry association representatives. The top roles represented were executive leadership, research, data science, or analytics profiles (STable 5). All 3 groups ranked “clinical outcomes clearly defined” as the top criterion when evaluating a new DHSP (Table 1). All 3 groups also ranked “easier to tell which products are fit for my purpose” as the most valuable aspect of such a framework. The evidence domain was ranked as most important when evaluating DHSPs, and “outcomes” was ranked far above equity & inclusion or workflow integration when asked about context.

Table 1 Survey responses.

Evaluation framework development

Several outcomes from the needs assessment were used for the next research phase. This included condensing to 3 domains—evidence, privacy/security, and usability—for 4 stakeholder groups: adopters, developers, regulators, and industry associations. The theme of equity was ubiquitous and woven throughout the domains. The thematic analysis and survey results provided content for criteria to evaluate each domain. We also organized DHSPs into 3 user types based on the intended user group: patients or consumers only, patients and clinical teams, and clinical teams or administrators only.

For the literature review, we screened the titles and abstracts of 4504 unique publications. We reviewed the full text of 1551 (34%), of which 1053 (68%) provided recommendations or best practices to evaluate DHSP quality (SFig 2). We extracted information from these publications that could inform DHSP quality within the defined domains and identified the intended user group for the subject DHSP.

Through inductive thematic analysis, we identified three recommendations for improving quality for the evidence domain: engage a variety of stakeholders when developing or validating a DHSP, conduct a research study for evidence generation, and build consensus around evidence guidelines (STable 3). Within the domain of privacy & security, there was a call for more comprehensive guidelines specific to DHSPs, and more resources and information on data privacy for end-users. The themes we identified for the usability domain were also focused on providing more information and conducting user testing.

The landscape analysis revealed 160 professional sources for alignment with the three domains, product types, and recommendations or best practices that could inform an EF (STable 6): 47 regulatory guidances, 32 frameworks, 32 guidelines, 34 industry standards, and 15 tools. From these, 92 provided information relevant to the evidence domain, 81 related to privacy and security, and 62 related to usability.

Data from the needs assessment, survey, literature review, and landscape analysis were integrated into an EF organized around each domain. For each domain, we identified and defined criteria groups as the first level of organization (Table 2). We further divided each criteria group into criteria and associated benchmarks.

Table 2 The evaluation framework with domains, criteria (groups) and associated benchmarks.

We interviewed SMEs to validate and refine this initial version of the EF. We asked them to approach their review of the framework through the lens of the stakeholder group most closely tied to their role. After interviewing 49 SMEs (16 adopters, 21 developers, and 12 industry association representatives) and observing saturation with the feedback, we moved to analysis and synthesis to integrate the feedback in the framework.

SMEs from all stakeholder groups expressed that the framework was comprehensive and contained important details. They recommended approaches to attesting to the benchmarks that ranged from federal regulations such as FDA 510(k) clearance to a document summarizing the work conducted or processes in place. They thought that the evidence domain could benefit most from an EF, as very few requirements exist for what good evidence should look like, especially for products that are not subject to FDA oversight. The primary feedback for privacy & security was that many privacy and security regulations exist for DHSPs. For usability, the primary feedback was that the usability criteria are important and should be addressed; however, very few developers give this the amount of attention it needs. For each domain, several criteria groups were collapsed, and criteria and benchmarks were combined to reflect SMEs feedback. We then validated the new criteria and benchmarks against industry standards, regulatory guidances, frameworks, and tools.

Finally, we conducted user testing with developers who were working on commercial products that fit ≥ 1 user group. We collected feedback from 9 developers, across the three domains, before reaching saturation. They indicated that the criteria and benchmarks were informative and applicable to their products, and suggested minor edits for evidence and privacy & security. The primary concern with usability was that the benchmarks were more detailed than those required by industry standards or regulatory bodies.

Discussion

The current disorganized processes for assessing the quality and trustworthiness of DHSPs can delay development and deployment of DHTs, and might not reflect best practices with regard to evidence, privacy and security, usability, and equity and inclusion. Though there are fragmented assessments focused on discrete domains of importance to evaluating DHSPs from the public and private sector, no industry-driven, non-partisan effort that unifies and provides market guidance to DHSPs and their adopters exists. An evidence-based EF would harmonize future efforts. This mixed-methods study created an EF to offer a comprehensive set of benchmarks for ensuring high quality of DHSPs.

Despite the existence of many regulatory guidances, frameworks, and industry standards, the SMEs we interviewed asked for additional specificity as to what “good” would look like for each domain we identified. Findings from the literature review echoed feedback from SMEs that an EF with clearly defined criteria and benchmarks is needed. Currently, the onus is on developers, adopters and end-users to identify relevant evaluation criteria and apply them. This is highly problematic: here are 6000 +31 hospitals in the US and 350,000 +1 DHSPs for them to choose from with no standard approach to evaluation. The size of this systemic challenge is staggering and the implication to equitable care access is deeply concerning. Our framework provides comprehensive and clear parameters of quality and allows developers to attest their products against them so that adopters can more quickly identify and adopt fit-for-purpose DHSPs.

Many SMEs shared that very little consideration goes into evaluating usability for diverse populations and different end-users. This is consistent with research showing large inequities in access to and uptake in DHTs32,33. Many developers stated that they rely on convenience sampling when conducting user testing. Similarly, developers shared that little effort is dedicated to accessibility, and both developers and adopters reported that they would like to see a bigger push to prioritize inclusive design in DHSP development.

The SMEs feedback also indicated a desire for greater transparency throughout the DHSP development and deployment phases. Diverse stakeholders should be included early in the DHSP development process10,34. There are many opportunities for developers to be more transparent, including how evidence is generated to support the DHSP claims, providing terms and conditions for privacy and security that are easy for end-users to understand, and providing details on usability testing, among other aspects.

The EF is based on front-line voices and needs, and its modularity ensures it can be readily updated to stay current with the evolving needs of stakeholders. Its broad scope improves upon other evaluation tools that take a cost reduction35,36,37 or profit driven approach38.

To retain the viability of this EF going forward we intend to maintain a transparent process upon which new benchmarks, standards, and criteria can be considered and incorporated. Our expectation for such evolutions are that they are (a) good for the DHSP adopters and developers, (b) enhance or maintain the relevance of this EF, and (c) are clear and attestable. This will allow for the current EF to consider important evolutions to this quickly changing landscape, including new guidelines developed after this effort was initially launched, such as new technological developments like AI and interoperability.

The study has a few limitations: Having greater sample sizes for interviews, focus groups, and surveys is often preferable but not always achievable. Most of the SMEs interrogated for the needs assessment represented senior executive or leadership roles, which may have skewed the results. We mitigated this by including mid-level participants in the cohort of SMEs that participated in the EF development phase.

Most of the benchmarks and guidelines leveraged to develop this EF are U.S.-focused. Though we included several standards with global or non-U.S. reach (e.g., GDPR), future extensions of the EF may more intentionally incorporate considerations related to other markets, e.g., such as the work happening on the European Digital Health Technology Assessment framework (EDiHTA)39.

Finally, one limitation of inductive thematic analysis is that it relies on the researchers’ subjective assessments. We attempted to mitigate this by discussing the identified themes with DiMe internal experts to reach a consensus.

In conclusion, the proposed evidence-based EF spans multiple domains of trust and value, harmonizes best practices, stands on the shoulders of well-respected and established work, and eases the effective adoption of high-quality, trustworthy DHSPs. It also becomes a bedrock upon which future iterations can be built.