Introduction

Based on current data, 1 in 31 children in the United States are diagnosed with autism by age eight1. While early intervention is associated with the greatest benefits, many children experience multi-year delays to diagnosis. Despite reliable diagnosis being possible by 18 months2, the average age of diagnosis is currently five years3. For girls, delays are even greater, with the average age of diagnosis sitting at 5.6 years3. Over-reliance on a dwindling specialist workforce4,5 has contributed to delayed evaluations, as has routine use of time intensive assessments irrespective of case complexity6,7. A recent survey of autism speciality centers across the U.S.8 found that nearly two-thirds of specialty centers (61%) have wait times longer than 4 months. Of that group, 25% have waitlists of more than half a year. 21% report waitlists of more than a year, or waitlists so full that they can no longer take new referrals. The same survey found that in the majority of centers (83%), evaluations take more than three hours, with evaluations extending up to 8 hours in a quarter of centers.

There is a growing call to expand the pool of clinicians able to conduct evaluations, as well as a recognized need to streamline the evaluation process itself, so that more children can be diagnosed equitably and accurately early on9,10. Multiple randomized controlled trials show that timely access to targeted early interventions leads to significantly greater cognitive, linguistic, and functional gains for children with autism, compared to lack of treatment, delayed treatment, or non-targeted treatment11. Even minor delays to treatment initiation have been shown to negatively impact outcomes, for example, starting therapies at 27 months versus 18 months of age12.

In response to this need for streamlined early diagnosis, Canvas Dx was developed and validated prospectively to empower a broader pool of clinicians to act rapidly upon first developmental concerns13. The first FDA-authorized diagnostic for autism of any kind14, Canvas Dx uses AI-technology based on data from thousands of diverse children at risk for and with developmental delays, including autism. Device inputs were designed to capture behavioral, executive functioning, language and communication features maximally predictive of autism. Consistent with best practice recommendations that evaluation for autism include both caregiver and clinician input, as well as direct observation of the child15, Canvas Dx integrates data from multiple sources (see Fig. 1) in its machine learning algorithm.

Fig. 1
figure 1

After downloading the Canvas Dx App on their smartphone, the child’s caregiver answers a brief question set about the child’s behavior and development (5 min). The caregiver also uploads two brief (1.5–5 min) videos of their child playing via the App. Videos undergo analysis and feature extraction. The child’s clinician answers a set of questions via the Canvas Dx clinician web portal (10 min). All inputs are fed through the machine learning algorithm. An output of positive, negative or indeterminate for autism is returned, along with an auto-generated detailed report mapping challenges to DSM-5 criteria relevant to autism diagnosis. The image depicts actors not real study participants. Image copyright Cognoa Inc.

The device provides a positive or negative autism prediction in the majority of cases, as well as a detailed report for each child that helps identify developmental strengths and challenges, and maps data to DSM-5 autism criteria to better inform next steps. In cases where there is insufficient information to confidently provide a diagnostic prediction or rule out with high accuracy, the device produces an ‘indeterminate’ output. This diagnostic abstention mechanism allows for safer uncertainty management in cases where misclassification risks are the highest16. Explainable AI and the management of uncertainty has become central to AI in healthcare16,17. Arbitrary cut-offs that result in a binary classification are subject to error at the edge cases, particularly in the field of autism where ambiguous presentations or multiple co-occurring conditions increase misclassification risk in binary screeners18,19. Having an indeterminate range or abstention feature may support greater clinician accuracy when evaluating complex autism cases, just as a clinician is able to say “I don’t know” when uncertain, AI-based devices are likely to operate more safely and transparently when they are not forced to produce a binary prediction in all cases20.

Based on clinical trial data13, in a study environment with an underlying autism prevalence of 29%, the device achieved a Positive Predictive Value (PPV) for autism of 80.8% (95% CI, 70.3–88.8) and a Negative Predictive Value (NPV) of 98.3% (95% CI, 90.6–100.0). Given examples of interventions failing to perform with equal accuracy outside of clinical trial settings,21 and an underperformance of AI models in real-world settings in particular22, the purpose of this analysis was to determine how Canvas Dx is performing in real world settings, and to learn more about its impact on age of diagnosis, as well as the characteristics of device prescribers and patient users. Analysis of AI model performance in real-world contexts is a critical step towards ensuring safe and impactful clinical adoption22.

Methods

A de-identified aggregate data analysis of the initial 254 Canvas Dx prescriptions fulfilled in clinical settings post-market authorization was conducted to determine: what proportion of children received a determinate device output (positive or negative for autism); device PPV, NPV, sensitivity, and specificity compared to clinical reference standard; and key prescriber and patient characteristics. Real world performance metrics were then compared to previously published clinical trial device performance.

Sample: All patients who were prescribed Canvas Dx and completed all inputs needed to get a diagnostic result were included in this analysis. All patients were in the intended use population of the device, children 18 to 72 months of age with caregiver or health provider concern for developmental delay.

Ethics: The de-identified real world aggregate data analysis (PR015) was determined exempt by Advarra IRB. The previously published Canvas Dx clinical study protocol referenced in this analysis, and informed consent forms were reviewed and approved by a centralized Institutional Review Board (IntegReview IRB). Protocol Number: Q170886. IntegReview IRB granted approval of the study (protocol version 1.0) on 19 July 2019. IntegReview was subsequently purchased by Advarra IRB. Informed consent was obtained from all caregivers whose children participated in the clinical study. This study was registered on ClinicalTrials.gov (NCT04151290) prior to study initiation. All clinical study methods were carried out in accordance with relevant guidelines and regulations.

Real world data analysis

Clinical reference standard procedure

As part of its obligation to conduct continuous algorithmic performance monitoring, the device manufacturer tracks Canvas Dx performance against a panel of blinded, independent, board-certified child and adolescent psychiatrists, child neurologists, developmental-behavioral pediatricians, or child psychologists with more than 5 years experience in diagnosing autism. Two specialists, blinded to the device results and to the diagnostic call of their peer, evaluate the device inputs and determine if autism and/or other neurodevelopmental conditions are present based on DSM-5 criteria. In cases where the two specialists disagree, a third specialist (also blinded) reviews the data, and the majority decision determines the clinical reference standard diagnosis.

Statistical analysis of device performance

The determinate rate was calculated as the proportion of prescriptions for which the device predicted positive or negative for autism, as opposed to abstaining. Because the device is not a binary classifier, abstention cases were analyzed separately from determinate cases. For determinate cases, PPV, NPV, sensitivity and specificity were calculated with the clinical reference standard consensus diagnosis for each case used as the true label. The corresponding 95% confidence intervals were generated for each metric. Fisher’s Exact Test was used to determine whether there was a statistically significant difference in device performance between biological sex or age range for each of these metrics. As abstention cases represent neither a correct nor incorrect classification, sensitivity and specificity are not reported in the indeterminate sample. Instead, we calculated the percentage of indeterminate cases that received a positive and negative reference standard autism diagnosis, as well as the percent that were indicated as being at risk for other neurodevelopmental conditions. These analyses were conducted on the indeterminate group as a whole, as well as on subsets of the indeterminate group stratified into low, moderate, and high autism risk. These risk groupings were derived by examining the distribution of positive and negative reference standard diagnoses across the range of device scores within the indeterminate zone. Score ranges that resulted in the lowest and highest observed prevalence of autism were assigned to the low- and high-risk groups respectively, and the middle range group was selected to maximize the separation in autism prevalence across the three categories.

Analysis of decision thresholds

To examine the impact of decision thresholds on performance the PPV, NPV, sensitivity, and specificity of the device were calculated for a range of decision thresholds that resulted in determinate rates between 20% and 100%. The range of decision thresholds were selected by adjusting both the positive and negative threshold boundaries from the true Device thresholds to achieve specific determinate rates. The determinate rates at which each performance metric becomes significantly different from the real world Device performance were calculated.

Comparison to clinical trial data

Calculated real world performance metrics were then compared to clinical trial data to ensure that there was no degradation in device performance between clinical and real world settings, using Fisher’s Exact Test. Full details of the methodology used to derive the clinical trial performance metrics are described in previously published work.23.

Results

Real world data analysis

Prescriber characteristics

At the time of data analysis, 100 unique prescribers had Canvas Dx prescriptions fulfilled. Prescribers were located in 20 different states, across 40 practices. The highest number of prescriptions were generated in California (68), Virginia (43), and Florida (42). Breakdown of prescriber qualifications is included in Fig. 2.

Fig. 2
figure 2

Prescriber qualifications.

Patient characteristics

Based on clinical reference standard determination, the underlying autism prevalence in the sample was 54.7% (139/254). Over a quarter of the sample, 29.13% (74/254), were female. The median age of children evaluated with Canvas Dx was 37.2 months (range: 17.1–71.8 months). The median age of children who received a positive output was 33.7 months (range: 17.1–69.7 months).

Table 1 presents the demographic and clinical characteristics of the full study population, the population with a Negative ASD reference standard, and the population with a Positive ASD reference standard. Fisher’s Exact Test was used to assess whether there were statistically significant differences between the positive and negative ASD groups for each characteristic.

Table 1 Patient characteristics stratified by reference standard diagnosis.

Device performance

More than half of users (62.99%) received a determinate result (CI- 57.05% − 68.93%). For determinate cases, compared to the reference standard, Canvas Dx had an NPV of 97.56% (CI- 92.84% − 100.0%) and a PPV of 92.44% (CI- 87.69% − 97.19%). Sensitivity and specificity were ​​99.1% (CI- 97.34% − 100.0%) and 81.63% (CI- 70.79% − 92.47%) respectively. Autism prevalence rates in the indeterminate group are displayed in Table 3. Data regarding the prescribing clinician’s final diagnoses were available for 41.1% of the 95 indeterminate cases. In the majority of these cases (76.9%), the prescribing clinician was in agreement with the reference standard diagnosis (21 positive cases and 9 negative cases). For the 23.1% of cases with disagreement between the prescribing clinician and the reference standard diagnosis, the majority received a clinician positive diagnosis and negative reference standard (6 cases) while the rest received a clinician negative diagnosis and positive reference standard (3 cases).

Table 2 presents a contingency table comparing the reference standard diagnosis (Positive or Negative for ASD) to the device result (Positive, Indeterminate, or Negative). Counts reflect the number of cases falling into each combination of reference standard and device outcome.

Table 2 Contingency table.

Table 3 presents the percentage of individuals within the indeterminate device result group who received an autism diagnosis or had at least one documented risk factor for a neurodevelopmental condition other than autism. The data are stratified by autism risk level assigned within the indeterminate group: low, moderate, and high.

Table 3 Indeterminate autism risk group analysis.

Device performance by biological sex

For determinate cases there were no statistically significant differences in device performance between males and females at the 0.05 p value level. The rate at which the device produced a determinate versus indeterminate result was also statistically insignificant at the 0.05 p value level (see Table 4).

Table 4 Device performance by biological sex.

Device performance by age

There were no statistically significant differences in NPV, sensitivity, specificity or determinate rate between the over 48 months of age and the under 48 months of age groups. The device had a statistically significant difference in PPV performance between age groups, with cases under 48 months of age achieving superior PPV (see Table 5).

Table 5 Device performance by age group.

Impacts of threshold adjustments on device performance

Fig. 3
figure 3

Fig. 3 illustrates the best device performance line that represents the theoretic determinate rate at which all accuracy metrics are maximized.

Figure 3 Impact of adjusting abstention thresholds: this figure demonstrates the change in PPV, NPV, sensitivity, and specificity as the abstention thresholds are adjusted to allow for a range of determinate rates. The Best Device Performance line represents the theoretic determinate rate at which all accuracy metrics are maximized. The Selected Determinate Rate line represents the current real world device performance with the abstention thresholds used in this study. The Significant Determinate Increase line represents the point at which the determinate rate becomes statistically improved over the current real world device determinate rate. All other lines represent the point at which an accuracy metric statistically significantly decreases from real world performance.

Real world device performance comparison to clinical trial results

The demographic composition of our real world and clinical trial samples are included in Table 6.

Table 6 Clinical trial population characteristics vs. real world population characteristics.

Across all cases, PPV improved to a significant degree in real world performance. This improvement was driven by statistically significant improvements to PPV in the female and under 48 months of age demographics. Real world PPV performance for male and over 84 months of age demographics were equivalent to clinical trial performance. Real world NPV performance was equivalent to clinical trial performance across all demographics. The real world determinate rate was significantly improved when compared to the clinical trial determinate rate across all demographics (see Table 7). The sample of real world patients reflects the composition of the clinical trial sample for age and gender, though the real world patient sample had a significantly higher autism prevalence. This increased prevalence may drive some of the significant improvements to PPV, and the decreases in NPV.

Table 7 Clinical trial device performance vs. real world performance.

Discussion

Principal results

In this analysis of real-world Canvas Dx use, the device provided highly accurate positive and negative outputs for autism that aligned with the specialist reference standard in the majority of cases. In a patient population with an autism prevalence of 54.9%, Canvas Dx, had a high NPV (97.56%) and PPV (92.44%), providing a determinate output for 62.75% of children. Children in this analysis were provided a positive output more than 2 years (26.3 months) earlier than the current average age of autism diagnosis in the United States3. This finding highlights the substantial waitlist reductions that could be made by streamlining evaluations and recruiting a broader range of clinicians to participate in the autism evaluation process. Currently, the U.S has only 758 developmental-behavioral pediatricians for 19 million kids with developmental or learning challenges4 and 11 child and adolescent psychiatrists for every 100,000 children5. By empowering more clinicians to participate in autism evaluations, Canvas Dx can help to support definitive early action for a greater subset of children. Earlier answers, in turn, may enable initiation of targeted interventions during the critical early years of high brain neuroplasticity when they have the greatest impact.

While device performance was consistent across biological sex for all metrics and across age groups for most metrics, PPV performance differed between older and younger age groups. Comparison of these real world results to clinical trial results23 suggests that this difference in PPV performance is due to substantially improved device performance in the younger age group, rather than degraded performance in the older age group. While girls only comprised 29.02% of the sample analyzed here, they represented 30.0% of children who received a determinate result, indicating proportional representation in determinate results across sexes. This is a finding of critical importance given the existing inequities in autism diagnoses for girls in the U.S3,24.

Economic and societal impacts

Robust data across numerous published studies support both the short and long term health and economic benefits of diagnosing children with autism earlier, so that treatments can begin in the critical neurodevelopmental window where they have the greatest impact25. A U.S. analysis of the potential medical and residential cost savings that could be realized with earlier initiation of evidence-based therapies for children with autism, projects annual cost savings in excess of $23.8 billion, with savings of ~$8.5 billion and $2.6 billion in Federal Medicaid and State Medicaid spending respectively26. Canadian lifetime cost-effectiveness modeling per person with autism based on eliminating their current 32 month wait time for intensive behavioral intervention (IBI) initiation found substantial government ($53,000 per person) and society ($267,000 per person) savings27.

Cost savings are realized not only in the post-diagnostic period, but also through reduction of unnecessary or untargeted treatments and poorly managed symptoms in the period between first concern and eventual diagnosis. A large US claims analysis28 for ~ 9000 children with autism, for example, found that the mean all-cause medical cost per child was ~ 2x higher for those with longer time from first concern to diagnosis compared with those with a shorter time delay ($5,268 vs. $2,525 per child in the younger age cohort and $5,570 vs. $2,265 per child in the older age cohort). Children who had a longer delay to diagnosis also experienced a greater number of both all-cause and autism-related health care visits compared with children who had a shorter delay. For example, the mean and median number of office or home visits were between 1.5x and 2x greater among children who had a longer time from concern to diagnosis.28.

Limitations

Only data captured as part of routine device use were available for the real world analysis therefore we were unable to comment on subjective patient and provider experiences, satisfaction measures, or longitudinal diagnostic stability. Similarly, information on patient race/ethnicity and socio-economic status are not collected as part of routine clinical device use, therefore we could not conduct covariate analysis on these features. Pivotal trial results, however, did point to equitable device performance across race/ethnicity and socio-economic status23. More information on device performance across these covariates is currently being collected as part of a primary care integration study29.

In 37% of cases, the Device abstained from making an autism prediction or rule out. As Fig. 3 demonstrates, adjusting determinate thresholds impacts both abstention and accuracy. Restricting determinate outputs to the 63.0% of cases with sufficient certainty prevents the degradation of device performance that is seen when adjusting abstention thresholds to allow for larger determinate rates. Increasing the determinate rate to 72.0% results in a statistically significant improvement in determinate rate over current real world performance (Fisher’s Exact Test p value 0.047) without any statistically significant decrease in accuracy metrics. The determinate rate can be further increased to 81.4% while maintaining statistically equivalent accuracy metrics. At this point, PPV drops significantly (Fisher’s Exact Test p value 0.039), and specificity decreases to a clinically significant degree though it maintains statistical equivalence. At this point, the number of indeterminates decreases from 95 to 49 cases, while the number of False Positives increases from 9 to 22 cases and the number of False Negatives increases from 1 to 4 cases. The number of True Positives increase from 110 to 120 cases, and True Negatives increase from 40 to 59 cases. The determinate rate can then be increased up to 94.7% before both PPV and sensitivity drop statistically significantly (Fisher’s Exact Test p value 0.042). While both NPV and specificity remain statistically equivalent to current real world performance, both metrics experience clinically significant decreases. Specificity and NPV performance are statistically maintained up to a 100% determinate rate.The real world device PPV remains statistically superior or equivalent to clinical trial performance at a 100% determinate rate.

All four metrics can achieve 100% performance, but this can only be realized by lowering the determinate rate to 20.9%. Though the number of False Positive and False Negatives decrease to 0, the number of indeterminate cases rises from 95 to 198. True Positives decrease from 110 to 40 cases, and True Negatives decrease from 40 cases to 15. Restricting the determinate rate to cases with an even higher certainty would further improve device performance, but with the trade-off of providing fewer children with a determinate result. With a determinate rate of 52.97%, the number of children provided with a determinate result would be significantly decreased (Fisher’s Exact Test p value 0.047). The selected thresholds for this device represent the theoretic determinate rate at which all accuracy metrics are maximized.

While allowing for a 37% abstention rate is arguably a limitation of the device, it aligns with calls from clinicians and statisticians alike to consider machine learning abstention in complex edge cases16,17. Abstention in such cases may represent a preferred method for addressing high uncertainty because it both minimizes misclassification and highlights challenging cases that may need further investigation18,19,20,30. This is particularly critical for conditions such as autism where consequences of misclassification include a potential failure to receive treatment during the window of peak brain neuroplasticity. As demonstrated in Fig. 3, the selected Canvas Dx abstention thresholds were chosen to preserve device performance while providing determinate results to as many cases as can be classified with high certainty, though device performance would remain clinically useful at much lower abstention rates. For indeterminate cases, clinicians are still given access to the full Canvas Dx detailed report that includes DSM-5 patient specific mapping. In this real world analysis we observed that for the majority of cases where the prescriber rendered a diagnostic call or rule out for indeterminate cases, it aligned with the blinded reference standard call. While this analysis demonstrates high device accuracy in real world settings, and earlier average age of autism diagnosis with related potential cost savings, its full impact will likely not be felt until payors clarify how reimbursement will be achieved through comprehensive medical policy coverage. The AAP leadership’s recent prioritization of advocacy efforts to ensure primary care providers throughout the country can have their autism diagnoses recognised9, suggests a potential acceleration of clinical adoption may occur in the near future.

Conclusions

This analysis of 254 Canvas Dx uses highlighted device accuracy, feasibility and utility across a variety of real-world contexts. Reducing the proportion of children requiring speciality referral and time intensive evaluations are critical steps towards the goal of tackling diagnostic delays and getting children into the right services sooner. Future longitudinal research quantifying the extent of pre and post-diagnostic cost savings associated with early streamlined diagnosis is recommended.