An analysis of the real world performance of an artificial intelligence based autism diagnostic

Salomon, Carmela; Heinz, Kelianne; Aronson-Ramos, Judith; Wall, Dennis P.

doi:10.1038/s41598-025-15575-8

Download PDF

Article
Open access
Published: 12 August 2025

An analysis of the real world performance of an artificial intelligence based autism diagnostic

Carmela Salomon¹,
Kelianne Heinz¹,
Judith Aronson-Ramos² &
…
Dennis P. Wall^1,3

Scientific Reports volume 15, Article number: 29503 (2025) Cite this article

3493 Accesses
2 Altmetric
Metrics details

Subjects

Abstract

Rapidly rising demand for pediatric autism evaluations has outpaced specialist capacity and created a crisis of delayed diagnoses and treatment. Streamlining the diagnostic process could reduce wait times and optimize use of limited specialist resources. Following strong clinical trial results, Canvas Dx, an AI-based diagnostic, was FDA authorized to support accurate diagnosis or rule-out of autism in children 18–72 months with caregiver or healthcare provider concern for developmental delay. To gain insight into real-world device performance, a de-identified aggregate data analysis of the initial 254 Canvas Dx prescriptions fulfilled post-market authorization was conducted to determine: accuracy of autism predictions compared to clinical reference standard diagnosis and prior clinical trial data, key real-world prescriber and patient characteristics, proportion of determinate device outputs (positive or negative for autism) and impact of decision threshold settings on device performance. In this sample of 254 children with a 54.7% autism prevalence rate (29.1% female, average age 39.99 months), Canvas Dx had a NPV of 97.6% (CI- 92.8% -100.0%) and a PPV of 92.4% (CI-87.7%-97.2%). A majority of cases (63.0%) received a determinate result. Sensitivity and specificity of determinate results were 99.1% (CI-97.3%-100.0%) and 81.6% (CI-70.8%-92.5%) respectively. The median age of children who received a positive for autism output was 37.2 months, which is more than 2 years earlier than the current median age of autism diagnosis. No performance differences were noted based on patients’ sex. Compared to clinical trial results, real world performance was equivalent for all key metrics, with the exception of the determinate rate and the PPV which were significantly improved in real world performance. Analysis of real-world Canvas Dx data highlights its feasibility and utility in supporting accurate, equitable and early diagnosis or rule out of autism. With medical coverage and broader clinical adoption, innovative solutions such as Canvas Dx can play an important role in helping to address the growing specialist waitlist crisis, ensuring that more children gain access to targeted therapies during the critical window of neurodevelopment where they have the greatest life-changing impact.

Evaluation of an artificial intelligence-based medical device for diagnosis of autism spectrum disorder

Article Open access 05 May 2022

Early detection of autism using digital behavioral phenotyping

Article Open access 02 October 2023

Classifying autism in a clinical population based on motion synchrony: a proof-of-concept study using real-life diagnostic interviews

Article Open access 07 March 2024

Introduction

Based on current data, 1 in 31 children in the United States are diagnosed with autism by age eight¹. While early intervention is associated with the greatest benefits, many children experience multi-year delays to diagnosis. Despite reliable diagnosis being possible by 18 months², the average age of diagnosis is currently five years³. For girls, delays are even greater, with the average age of diagnosis sitting at 5.6 years³. Over-reliance on a dwindling specialist workforce^4,5 has contributed to delayed evaluations, as has routine use of time intensive assessments irrespective of case complexity^6,7. A recent survey of autism speciality centers across the U.S.⁸ found that nearly two-thirds of specialty centers (61%) have wait times longer than 4 months. Of that group, 25% have waitlists of more than half a year. 21% report waitlists of more than a year, or waitlists so full that they can no longer take new referrals. The same survey found that in the majority of centers (83%), evaluations take more than three hours, with evaluations extending up to 8 hours in a quarter of centers.

There is a growing call to expand the pool of clinicians able to conduct evaluations, as well as a recognized need to streamline the evaluation process itself, so that more children can be diagnosed equitably and accurately early on^9,10. Multiple randomized controlled trials show that timely access to targeted early interventions leads to significantly greater cognitive, linguistic, and functional gains for children with autism, compared to lack of treatment, delayed treatment, or non-targeted treatment¹¹. Even minor delays to treatment initiation have been shown to negatively impact outcomes, for example, starting therapies at 27 months versus 18 months of age¹².

In response to this need for streamlined early diagnosis, Canvas Dx was developed and validated prospectively to empower a broader pool of clinicians to act rapidly upon first developmental concerns¹³. The first FDA-authorized diagnostic for autism of any kind¹⁴, Canvas Dx uses AI-technology based on data from thousands of diverse children at risk for and with developmental delays, including autism. Device inputs were designed to capture behavioral, executive functioning, language and communication features maximally predictive of autism. Consistent with best practice recommendations that evaluation for autism include both caregiver and clinician input, as well as direct observation of the child¹⁵, Canvas Dx integrates data from multiple sources (see Fig. 1) in its machine learning algorithm.

The device provides a positive or negative autism prediction in the majority of cases, as well as a detailed report for each child that helps identify developmental strengths and challenges, and maps data to DSM-5 autism criteria to better inform next steps. In cases where there is insufficient information to confidently provide a diagnostic prediction or rule out with high accuracy, the device produces an ‘indeterminate’ output. This diagnostic abstention mechanism allows for safer uncertainty management in cases where misclassification risks are the highest¹⁶. Explainable AI and the management of uncertainty has become central to AI in healthcare^16,17. Arbitrary cut-offs that result in a binary classification are subject to error at the edge cases, particularly in the field of autism where ambiguous presentations or multiple co-occurring conditions increase misclassification risk in binary screeners^18,19. Having an indeterminate range or abstention feature may support greater clinician accuracy when evaluating complex autism cases, just as a clinician is able to say “I don’t know” when uncertain, AI-based devices are likely to operate more safely and transparently when they are not forced to produce a binary prediction in all cases²⁰.

Based on clinical trial data¹³, in a study environment with an underlying autism prevalence of 29%, the device achieved a Positive Predictive Value (PPV) for autism of 80.8% (95% CI, 70.3–88.8) and a Negative Predictive Value (NPV) of 98.3% (95% CI, 90.6–100.0). Given examples of interventions failing to perform with equal accuracy outside of clinical trial settings,²¹ and an underperformance of AI models in real-world settings in particular²², the purpose of this analysis was to determine how Canvas Dx is performing in real world settings, and to learn more about its impact on age of diagnosis, as well as the characteristics of device prescribers and patient users. Analysis of AI model performance in real-world contexts is a critical step towards ensuring safe and impactful clinical adoption²².

Methods

A de-identified aggregate data analysis of the initial 254 Canvas Dx prescriptions fulfilled in clinical settings post-market authorization was conducted to determine: what proportion of children received a determinate device output (positive or negative for autism); device PPV, NPV, sensitivity, and specificity compared to clinical reference standard; and key prescriber and patient characteristics. Real world performance metrics were then compared to previously published clinical trial device performance.

Sample: All patients who were prescribed Canvas Dx and completed all inputs needed to get a diagnostic result were included in this analysis. All patients were in the intended use population of the device, children 18 to 72 months of age with caregiver or health provider concern for developmental delay.

Ethics: The de-identified real world aggregate data analysis (PR015) was determined exempt by Advarra IRB. The previously published Canvas Dx clinical study protocol referenced in this analysis, and informed consent forms were reviewed and approved by a centralized Institutional Review Board (IntegReview IRB). Protocol Number: Q170886. IntegReview IRB granted approval of the study (protocol version 1.0) on 19 July 2019. IntegReview was subsequently purchased by Advarra IRB. Informed consent was obtained from all caregivers whose children participated in the clinical study. This study was registered on ClinicalTrials.gov (NCT04151290) prior to study initiation. All clinical study methods were carried out in accordance with relevant guidelines and regulations.

Real world data analysis

Clinical reference standard procedure

As part of its obligation to conduct continuous algorithmic performance monitoring, the device manufacturer tracks Canvas Dx performance against a panel of blinded, independent, board-certified child and adolescent psychiatrists, child neurologists, developmental-behavioral pediatricians, or child psychologists with more than 5 years experience in diagnosing autism. Two specialists, blinded to the device results and to the diagnostic call of their peer, evaluate the device inputs and determine if autism and/or other neurodevelopmental conditions are present based on DSM-5 criteria. In cases where the two specialists disagree, a third specialist (also blinded) reviews the data, and the majority decision determines the clinical reference standard diagnosis.

Statistical analysis of device performance

The determinate rate was calculated as the proportion of prescriptions for which the device predicted positive or negative for autism, as opposed to abstaining. Because the device is not a binary classifier, abstention cases were analyzed separately from determinate cases. For determinate cases, PPV, NPV, sensitivity and specificity were calculated with the clinical reference standard consensus diagnosis for each case used as the true label. The corresponding 95% confidence intervals were generated for each metric. Fisher’s Exact Test was used to determine whether there was a statistically significant difference in device performance between biological sex or age range for each of these metrics. As abstention cases represent neither a correct nor incorrect classification, sensitivity and specificity are not reported in the indeterminate sample. Instead, we calculated the percentage of indeterminate cases that received a positive and negative reference standard autism diagnosis, as well as the percent that were indicated as being at risk for other neurodevelopmental conditions. These analyses were conducted on the indeterminate group as a whole, as well as on subsets of the indeterminate group stratified into low, moderate, and high autism risk. These risk groupings were derived by examining the distribution of positive and negative reference standard diagnoses across the range of device scores within the indeterminate zone. Score ranges that resulted in the lowest and highest observed prevalence of autism were assigned to the low- and high-risk groups respectively, and the middle range group was selected to maximize the separation in autism prevalence across the three categories.

Analysis of decision thresholds

To examine the impact of decision thresholds on performance the PPV, NPV, sensitivity, and specificity of the device were calculated for a range of decision thresholds that resulted in determinate rates between 20% and 100%. The range of decision thresholds were selected by adjusting both the positive and negative threshold boundaries from the true Device thresholds to achieve specific determinate rates. The determinate rates at which each performance metric becomes significantly different from the real world Device performance were calculated.

Comparison to clinical trial data

Calculated real world performance metrics were then compared to clinical trial data to ensure that there was no degradation in device performance between clinical and real world settings, using Fisher’s Exact Test. Full details of the methodology used to derive the clinical trial performance metrics are described in previously published work.²³.

Results

Real world data analysis

Prescriber characteristics

At the time of data analysis, 100 unique prescribers had Canvas Dx prescriptions fulfilled. Prescribers were located in 20 different states, across 40 practices. The highest number of prescriptions were generated in California (68), Virginia (43), and Florida (42). Breakdown of prescriber qualifications is included in Fig. 2.

Patient characteristics

Based on clinical reference standard determination, the underlying autism prevalence in the sample was 54.7% (139/254). Over a quarter of the sample, 29.13% (74/254), were female. The median age of children evaluated with Canvas Dx was 37.2 months (range: 17.1–71.8 months). The median age of children who received a positive output was 33.7 months (range: 17.1–69.7 months).

Table 1 presents the demographic and clinical characteristics of the full study population, the population with a Negative ASD reference standard, and the population with a Positive ASD reference standard. Fisher’s Exact Test was used to assess whether there were statistically significant differences between the positive and negative ASD groups for each characteristic.

Table 1 Patient characteristics stratified by reference standard diagnosis.

Full size table

Device performance

More than half of users (62.99%) received a determinate result (CI- 57.05% − 68.93%). For determinate cases, compared to the reference standard, Canvas Dx had an NPV of 97.56% (CI- 92.84% − 100.0%) and a PPV of 92.44% (CI- 87.69% − 97.19%). Sensitivity and specificity were 99.1% (CI- 97.34% − 100.0%) and 81.63% (CI- 70.79% − 92.47%) respectively. Autism prevalence rates in the indeterminate group are displayed in Table 3. Data regarding the prescribing clinician’s final diagnoses were available for 41.1% of the 95 indeterminate cases. In the majority of these cases (76.9%), the prescribing clinician was in agreement with the reference standard diagnosis (21 positive cases and 9 negative cases). For the 23.1% of cases with disagreement between the prescribing clinician and the reference standard diagnosis, the majority received a clinician positive diagnosis and negative reference standard (6 cases) while the rest received a clinician negative diagnosis and positive reference standard (3 cases).

Table 2 presents a contingency table comparing the reference standard diagnosis (Positive or Negative for ASD) to the device result (Positive, Indeterminate, or Negative). Counts reflect the number of cases falling into each combination of reference standard and device outcome.

Table 2 Contingency table.

Full size table

Table 3 presents the percentage of individuals within the indeterminate device result group who received an autism diagnosis or had at least one documented risk factor for a neurodevelopmental condition other than autism. The data are stratified by autism risk level assigned within the indeterminate group: low, moderate, and high.

Table 3 Indeterminate autism risk group analysis.

Full size table

Device performance by biological sex

For determinate cases there were no statistically significant differences in device performance between males and females at the 0.05 p value level. The rate at which the device produced a determinate versus indeterminate result was also statistically insignificant at the 0.05 p value level (see Table 4).

Table 4 Device performance by biological sex.

Full size table

Device performance by age

There were no statistically significant differences in NPV, sensitivity, specificity or determinate rate between the over 48 months of age and the under 48 months of age groups. The device had a statistically significant difference in PPV performance between age groups, with cases under 48 months of age achieving superior PPV (see Table 5).

Table 5 Device performance by age group.

Full size table

Impacts of threshold adjustments on device performance

Figure 3 Impact of adjusting abstention thresholds: this figure demonstrates the change in PPV, NPV, sensitivity, and specificity as the abstention thresholds are adjusted to allow for a range of determinate rates. The Best Device Performance line represents the theoretic determinate rate at which all accuracy metrics are maximized. The Selected Determinate Rate line represents the current real world device performance with the abstention thresholds used in this study. The Significant Determinate Increase line represents the point at which the determinate rate becomes statistically improved over the current real world device determinate rate. All other lines represent the point at which an accuracy metric statistically significantly decreases from real world performance.

Real world device performance comparison to clinical trial results

The demographic composition of our real world and clinical trial samples are included in Table 6.

Table 6 Clinical trial population characteristics vs. real world population characteristics.

Full size table

Across all cases, PPV improved to a significant degree in real world performance. This improvement was driven by statistically significant improvements to PPV in the female and under 48 months of age demographics. Real world PPV performance for male and over 84 months of age demographics were equivalent to clinical trial performance. Real world NPV performance was equivalent to clinical trial performance across all demographics. The real world determinate rate was significantly improved when compared to the clinical trial determinate rate across all demographics (see Table 7). The sample of real world patients reflects the composition of the clinical trial sample for age and gender, though the real world patient sample had a significantly higher autism prevalence. This increased prevalence may drive some of the significant improvements to PPV, and the decreases in NPV.

Table 7 Clinical trial device performance vs. real world performance.

Full size table

Discussion

Principal results

In this analysis of real-world Canvas Dx use, the device provided highly accurate positive and negative outputs for autism that aligned with the specialist reference standard in the majority of cases. In a patient population with an autism prevalence of 54.9%, Canvas Dx, had a high NPV (97.56%) and PPV (92.44%), providing a determinate output for 62.75% of children. Children in this analysis were provided a positive output more than 2 years (26.3 months) earlier than the current average age of autism diagnosis in the United States³. This finding highlights the substantial waitlist reductions that could be made by streamlining evaluations and recruiting a broader range of clinicians to participate in the autism evaluation process. Currently, the U.S has only 758 developmental-behavioral pediatricians for 19 million kids with developmental or learning challenges⁴ and 11 child and adolescent psychiatrists for every 100,000 children⁵. By empowering more clinicians to participate in autism evaluations, Canvas Dx can help to support definitive early action for a greater subset of children. Earlier answers, in turn, may enable initiation of targeted interventions during the critical early years of high brain neuroplasticity when they have the greatest impact.

While device performance was consistent across biological sex for all metrics and across age groups for most metrics, PPV performance differed between older and younger age groups. Comparison of these real world results to clinical trial results²³ suggests that this difference in PPV performance is due to substantially improved device performance in the younger age group, rather than degraded performance in the older age group. While girls only comprised 29.02% of the sample analyzed here, they represented 30.0% of children who received a determinate result, indicating proportional representation in determinate results across sexes. This is a finding of critical importance given the existing inequities in autism diagnoses for girls in the U.S^3,24.

Economic and societal impacts

Robust data across numerous published studies support both the short and long term health and economic benefits of diagnosing children with autism earlier, so that treatments can begin in the critical neurodevelopmental window where they have the greatest impact²⁵. A U.S. analysis of the potential medical and residential cost savings that could be realized with earlier initiation of evidence-based therapies for children with autism, projects annual cost savings in excess of $23.8 billion, with savings of ~$8.5 billion and $2.6 billion in Federal Medicaid and State Medicaid spending respectively²⁶. Canadian lifetime cost-effectiveness modeling per person with autism based on eliminating their current 32 month wait time for intensive behavioral intervention (IBI) initiation found substantial government ($53,000 per person) and society ($267,000 per person) savings²⁷.

Cost savings are realized not only in the post-diagnostic period, but also through reduction of unnecessary or untargeted treatments and poorly managed symptoms in the period between first concern and eventual diagnosis. A large US claims analysis²⁸ for ~ 9000 children with autism, for example, found that the mean all-cause medical cost per child was ~ 2x higher for those with longer time from first concern to diagnosis compared with those with a shorter time delay ($5,268 vs. $2,525 per child in the younger age cohort and $5,570 vs. $2,265 per child in the older age cohort). Children who had a longer delay to diagnosis also experienced a greater number of both all-cause and autism-related health care visits compared with children who had a shorter delay. For example, the mean and median number of office or home visits were between 1.5x and 2x greater among children who had a longer time from concern to diagnosis.²⁸.

Limitations

Only data captured as part of routine device use were available for the real world analysis therefore we were unable to comment on subjective patient and provider experiences, satisfaction measures, or longitudinal diagnostic stability. Similarly, information on patient race/ethnicity and socio-economic status are not collected as part of routine clinical device use, therefore we could not conduct covariate analysis on these features. Pivotal trial results, however, did point to equitable device performance across race/ethnicity and socio-economic status²³. More information on device performance across these covariates is currently being collected as part of a primary care integration study²⁹.

In 37% of cases, the Device abstained from making an autism prediction or rule out. As Fig. 3 demonstrates, adjusting determinate thresholds impacts both abstention and accuracy. Restricting determinate outputs to the 63.0% of cases with sufficient certainty prevents the degradation of device performance that is seen when adjusting abstention thresholds to allow for larger determinate rates. Increasing the determinate rate to 72.0% results in a statistically significant improvement in determinate rate over current real world performance (Fisher’s Exact Test p value 0.047) without any statistically significant decrease in accuracy metrics. The determinate rate can be further increased to 81.4% while maintaining statistically equivalent accuracy metrics. At this point, PPV drops significantly (Fisher’s Exact Test p value 0.039), and specificity decreases to a clinically significant degree though it maintains statistical equivalence. At this point, the number of indeterminates decreases from 95 to 49 cases, while the number of False Positives increases from 9 to 22 cases and the number of False Negatives increases from 1 to 4 cases. The number of True Positives increase from 110 to 120 cases, and True Negatives increase from 40 to 59 cases. The determinate rate can then be increased up to 94.7% before both PPV and sensitivity drop statistically significantly (Fisher’s Exact Test p value 0.042). While both NPV and specificity remain statistically equivalent to current real world performance, both metrics experience clinically significant decreases. Specificity and NPV performance are statistically maintained up to a 100% determinate rate.The real world device PPV remains statistically superior or equivalent to clinical trial performance at a 100% determinate rate.

All four metrics can achieve 100% performance, but this can only be realized by lowering the determinate rate to 20.9%. Though the number of False Positive and False Negatives decrease to 0, the number of indeterminate cases rises from 95 to 198. True Positives decrease from 110 to 40 cases, and True Negatives decrease from 40 cases to 15. Restricting the determinate rate to cases with an even higher certainty would further improve device performance, but with the trade-off of providing fewer children with a determinate result. With a determinate rate of 52.97%, the number of children provided with a determinate result would be significantly decreased (Fisher’s Exact Test p value 0.047). The selected thresholds for this device represent the theoretic determinate rate at which all accuracy metrics are maximized.

While allowing for a 37% abstention rate is arguably a limitation of the device, it aligns with calls from clinicians and statisticians alike to consider machine learning abstention in complex edge cases^16,17. Abstention in such cases may represent a preferred method for addressing high uncertainty because it both minimizes misclassification and highlights challenging cases that may need further investigation^18,19,20,30. This is particularly critical for conditions such as autism where consequences of misclassification include a potential failure to receive treatment during the window of peak brain neuroplasticity. As demonstrated in Fig. 3, the selected Canvas Dx abstention thresholds were chosen to preserve device performance while providing determinate results to as many cases as can be classified with high certainty, though device performance would remain clinically useful at much lower abstention rates. For indeterminate cases, clinicians are still given access to the full Canvas Dx detailed report that includes DSM-5 patient specific mapping. In this real world analysis we observed that for the majority of cases where the prescriber rendered a diagnostic call or rule out for indeterminate cases, it aligned with the blinded reference standard call. While this analysis demonstrates high device accuracy in real world settings, and earlier average age of autism diagnosis with related potential cost savings, its full impact will likely not be felt until payors clarify how reimbursement will be achieved through comprehensive medical policy coverage. The AAP leadership’s recent prioritization of advocacy efforts to ensure primary care providers throughout the country can have their autism diagnoses recognised⁹, suggests a potential acceleration of clinical adoption may occur in the near future.

Conclusions

This analysis of 254 Canvas Dx uses highlighted device accuracy, feasibility and utility across a variety of real-world contexts. Reducing the proportion of children requiring speciality referral and time intensive evaluations are critical steps towards the goal of tackling diagnostic delays and getting children into the right services sooner. Future longitudinal research quantifying the extent of pre and post-diagnostic cost savings associated with early streamlined diagnosis is recommended.

Data availability

Data are not publicly available because they contain sensitive patient information. Individual, de-identified, participant data that underlie the results reported in this article and study protocol may be made available to qualified researchers upon reasonable request. Proposals should be directed to sharief@cognoa.com to gain access. Data requestors will be required to sign a data-sharing agreement prior to access. The full study protocol for the clinical trial is available on ClinicalTrials.gov.

References

Shaw, K. A.. et al. Prevalence and early identification of autism spectrum disorder among children aged 4 and 8 Years — autism and developmental disabilities monitoring network, 16 sites, united states ,2022. MMWR. Surveill. Summar. 2025, 1–22. https://doi.org/10.15585/mmwr.ss7402a1 (2025).
Pierce, K. et al. Evaluation of the diagnostic stability of the early autism spectrum disorder phenotype in the general population starting at 12 months. JAMA Pediatr. 173, 578–587 (2019).
Article PubMed PubMed Central Google Scholar
Autism Speaks. Autism by the Numbers: 2023 Inaugural Annual Report (2023). https://www.autismspeaks.org/sites/default/files/ABN_Annual_Report_2023.pdf.
Leshner, C. US is facing a shortage of developmental specalists (2023). https://www.wcnc.com/article/news/health/shortage-health-developmental-pediatrician-united-states-charlotte/275-6bd709e8-735b-4ac2-9a7e-4790b4b0cf05.
American Academy of Child & Adolescent Psychiatry. AACAP Releases Workforce Maps Illustrating Severe Shortage of Child and Adolescent Psychiatrists (2018).
Kaufman, N. K. Rethinking gold standards and best practices in the assessment of autism. Appl. Neuropsychol. Child. 2020, 1–12. https://doi.org/10.1080/21622965.2020.1809414 (2020).
Gwynette, M. F. et al. Overemphasis of the autism diagnostic observation schedule (ADOS) evaluation subverts a clinician’s ability to provide access to autism services. J. Am. Acad. Child. Adolesc. Psychiatry. 58, 1222–1223 (2019).
Article PubMed Google Scholar
Kraft, C. Wait times and processes for autism diagnostic evaluations: a first report survey of autism centers in the US. (2023).
American Academy of Pediatrics. Top 10 Leadership Resolutions (2023).
Barbaresi, W. et al. Clinician diagnostic certainty and the role of the autism diagnostic observation schedule in autism spectrum disorder diagnosis in young children. JAMA Pediatr. 176, 1233–1241 (2022).
Article PubMed PubMed Central Google Scholar
Gabbay-Dizdar, N. et al. Early diagnosis of autism in the community is associated with marked improvement in social symptoms within 1–2 years. Autism 26, 1353–1363 (2022).
Article PubMed Google Scholar
Guthrie, W. et al. The earlier the better: an RCT of treatment timing effects for toddlers on the autism spectrum. Autism 13623613231159153, 523 (2023).
Wall, D. P., Liu-Mayo, S., Salomon, C., Shannon, J. & Taraman, S. Optimizing a de Novo artificial intelligence-based medical device under a predetermined change control plan: improved ability to detect or rule out pediatric autism. Intell. -Based Med. 8, 100102 (2023).
Google Scholar
U.S & Food & Drug Administration. FDA Authorizes Marketing of Diagnostic Aid for Autism Spectrum Disorder (2021). https://www.fda.gov/news-events/press-announcements/fda-authorizes-marketing-diagnostic-aid-autism-spectrum-disorder.
Brian, J. A., Zwaigenbaum, L. & Ip, A. Standards of diagnostic assessment for autism spectrum disorder. Paediatr. Child. Health. 24, 444–451 (2019).
Article PubMed PubMed Central Google Scholar
Kompa, B., Snoek, J. & Beam, A. L. Second opinion needed: communicating uncertainty in medical machine learning. NPJ Digit. Med. 4, 1–6 (2021).
Article Google Scholar
Cortes, C., DeSalvo, G., Gentile, C., Mohri, M. & Yang, S. Online learning with abstention. Int. Conf. Mach. Learn. 2025, 1059–1067 (2025).
Charman, T. & Gotham, K. Measurement issues: screening and diagnostic instruments for autism spectrum disorders–lessons from research and practise. Child. Adolesc. Ment Health. 18, 52–63 (2013).
Article PubMed Google Scholar
Roberts, M. Y. et al. Beyond pass-fail: examining the potential utility of two thresholds in the autism screening process. Autism Res. 12, 112–122 (2019).
Article ADS PubMed Google Scholar
Gandouz, M., Holzmann, H. & Heider, D. Machine learning with asymmetric abstention for biomedical decision-making. BMC Med. Inf. Decis. Mak. 21, 1–11 (2021).
Google Scholar
Franklin, J. & Schneeweiss, S. When and how can real world data analyses substitute for randomized controlled trials? Clin Pharmacol. Ther 102, 1452 (2017).
Longhurst, C., Singh, K., Chopra, A., Atreja, A. & Brownstein, J. A Call for Artificial Intelligence Implementation Science Centers to Evaluate Clinical Effectiveness. NEJM AI (2024).
Megerian, J. T. et al. Evaluation of an artificial Intelligence-Based medical device for diagnosis of autism spectrum disorder. Nat. Partn. J. - Digit. Med. https://doi.org/10.1038/s41746-022-00598-6 (2022).
Article Google Scholar
Aylward, B. S., Gal-Szabo, D. E., Taraman, S. & Racial Ethnic, and sociodemographic disparities in diagnosis of children with autism spectrum disorder. J. Dev. Behav. Pediatr. JDBP (2021).
Towle, P. O., Patrick, P. A., Ridgard, T., Pham, S. & Marrus, J. Is earlier better? The relationship between age when starting early intervention and outcomes for children with autism spectrum disorder: a selective review. Autism Res. Treat. (2020).
Frazier, T. W. et al. Evidence-based use of scalable biomarkers to increase diagnostic efficiency and decrease the lifetime costs of autism. Autism Res. 14, 1271–1283 (2021).
Article ADS PubMed PubMed Central Google Scholar
Piccininni, C., Bisnaire, L. & Penner, M. Cost-effectiveness of wait time reduction for intensive behavioral intervention services in ontario, Canada. JAMA Pediatr. 171, 23–30 (2017).
Article PubMed Google Scholar
Vu, M. et al. Increased delay from initial concern to diagnosis of autism spectrum disorder and associated health care resource utilization and cost among children aged younger than 6 years in the united States. J. Manag Care Spec. Pharm. 29, 378–390 (2023).
PubMed Google Scholar
Sohl, K. et al. Feasibility and impact of integrating an artificial Intelligence–Based diagnosis aid for autism into the extension for community health outcomes autism primary care model: protocol for a prospective observational study. JMIR Res. Protoc. 11, e37576 (2022).
Article PubMed PubMed Central Google Scholar
Landsheer, J. A. The clinical relevance of methods for handling inconclusive medical test results: quantification of uncertainty in medical decision-making and screening. Diagnostics 8, 32 (2018).
Article PubMed PubMed Central Google Scholar

Download references

Author information

Authors and Affiliations

Cognoa Inc., 2185 Park Blvd, Palo Alto, CA, 94306, USA
Carmela Salomon, Kelianne Heinz & Dennis P. Wall
Society of Developmental and Behavioral Pediatrics, Virginia, USA
Judith Aronson-Ramos
Department of Biomedical Data Science, Department of Pediatrics, Department of Psychiatry and Behavioral Sciences, Stanford University, Stanford, CA, USA
Dennis P. Wall

Authors

Carmela Salomon
View author publications
Search author on:PubMed Google Scholar
Kelianne Heinz
View author publications
Search author on:PubMed Google Scholar
Judith Aronson-Ramos
View author publications
Search author on:PubMed Google Scholar
Dennis P. Wall
View author publications
Search author on:PubMed Google Scholar

Contributions

All authors contributed to the conception, design and drafting of the manuscript. Additionally, DPW contributed to data analysis and revised the manuscript critically for important intellectual content. JR contributed substantially to the acquisition and interpretation of data. CS contributed substantially to writing the first manuscript draft and critically interpreting the data. KH led the data analysis and drafted, reviewed, and revised the statistical methods and results. All authors approved the final manuscript as submitted and agree to be accountable for all aspects of the work.

Corresponding author

Correspondence to Carmela Salomon.

Ethics declarations

Competing interests

KH and CS are employees of Cognoa. CS also holds Cognoa stock options. JR is an employee of Autism Path 2 Care and is affiliated with the Society of Developmental and Behavioral Pediatrics. DPW is the co-founder of Cognoa, is on the board of directors, and holds Cognoa stock.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Salomon, C., Heinz, K., Aronson-Ramos, J. et al. An analysis of the real world performance of an artificial intelligence based autism diagnostic. Sci Rep 15, 29503 (2025). https://doi.org/10.1038/s41598-025-15575-8

Download citation

Received: 04 September 2024
Accepted: 08 August 2025
Published: 12 August 2025
DOI: https://doi.org/10.1038/s41598-025-15575-8

Subjects

Abstract

Similar content being viewed by others

Evaluation of an artificial intelligence-based medical device for diagnosis of autism spectrum disorder

Early detection of autism using digital behavioral phenotyping

Classifying autism in a clinical population based on motion synchrony: a proof-of-concept study using real-life diagnostic interviews

Introduction

Methods

Real world data analysis

Clinical reference standard procedure

Statistical analysis of device performance

Analysis of decision thresholds

Comparison to clinical trial data

Results

Real world data analysis

Prescriber characteristics

Patient characteristics

Device performance

Device performance by biological sex

Device performance by age

Impacts of threshold adjustments on device performance

Real world device performance comparison to clinical trial results

Discussion

Principal results

Economic and societal impacts

Limitations

Conclusions

Data availability

References

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Quick links