Screening autism spectrum disorder in children using machine learning on speech transcripts

Assaf, Rida; Shehabeddine, Zein; Ramesh, Vikram

doi:10.1038/s41598-025-01500-6

Download PDF

Article
Open access
Published: 01 October 2025

Screening autism spectrum disorder in children using machine learning on speech transcripts

Rida Assaf¹,
Zein Shehabeddine¹ &
Vikram Ramesh²

Scientific Reports volume 15, Article number: 34134 (2025) Cite this article

217 Accesses
2 Altmetric
Metrics details

Subjects

Abstract

Early detection of Autism Spectrum Disorder (ASD) in children is crucial for timely interventions that can improve developmental outcomes. Traditional diagnostic methods are often resource-intensive, time-consuming, and may raise ethical concerns regarding privacy, particularly for minors. In this study, we evaluate the feasibility of privacy-preserving machine learning models for ASD detection using children’s speech transcripts. By exclusively leveraging structured text-based inputs, our method inherently avoids the direct use of identifiable biometric data, such as raw audio or video, thus significantly reducing privacy risks. Although we have not implemented explicit cryptographic privacy measures (e.g., differential privacy, encryption), our approach minimizes privacy concerns inherently associated with sensitive biometric data. We conducted experiments on two datasets from the TalkBank repository, focusing on linguistic features such as Mean Length of Utterance (MLU) and Mean Length of Turn Ratio (MLT Ratio). Our results demonstrate strong predictive performance, with models achieving accuracy above 86% across both datasets. Notably, we found that a small, focused subset of features was sufficient to maintain this level of performance, reducing the need for extensive data collection, thereby enhancing privacy. These findings highlight the promise of computational linguistics in advancing non-invasive, ethical approaches to ASD detection, providing a foundation for future applications in clinical and educational contexts.

Crowdsourced privacy-preserved feature tagging of short home videos for machine learning ASD detection

Article Open access 07 April 2021

Leveraging computational linguistics and machine learning for detection of ultra-high risk of mental health disorders in youths

Article Open access 15 July 2025

Enhancing autism spectrum disorder classification in children through the integration of traditional statistics and classical machine learning techniques in EEG analysis

Article Open access 08 December 2023

Introduction

Autism Spectrum Disorder (ASD) is a neurodevelopmental condition characterized by challenges in social interaction, communication, and repetitive behaviors. About 1% of the world population¹ and an estimated 5.4M people in the US have ASD², affecting 1 in 54 children. Roughly 25% of children with ASD go undiagnosed. ASD can be detected by early surveillance and developmental monitoring, but the process is long, elaborate, and tedious³ with multiple screening stages at various ages of the child, with parents completing a series of checklists and questionnaires and children’s development monitored regularly. This is followed by more intensive questioning of parents and screening to conclusively diagnose and develop treatment plans. Early symptoms of ADHD often overlap with those of ASD, which can lead to misdiagnosis and delayed interventions for ASD.

Early diagnosis of ASD is crucial as it allows timely intervention, which can significantly improve developmental outcomes⁴. For milder forms of ASD like Asperger’s syndrome⁵, early diagnosis and treatment can help children develop healthy social and communication skills. However, accurate diagnosis is often delayed because ASD characteristics may not emerge until the disorder is well established.

Several non-invasive, early-detection technologies using artificial intelligence and machine learning⁶ are being attempted to accurately and timely diagnose ASD in children⁷. For example, facial features recorded during social interactions⁸ at the onset of the disease display unique and differential characteristics⁹. Researchers have successfully used computer vision to train machine learning models that can detect ASD using facial images of impacted children with a reasonable degree of accuracy¹⁰. However, one of the major drawbacks of using facial images is the invasion of privacy, especially for minors, and the need to collect their images over prolonged periods of their childhood. Attempts have also been made to diagnose ASD using retinal image¹¹ and brain image¹² analysis as objective screening methods. Recent approaches for ASD detection frequently rely on multimodal deep learning frameworks, notably transformer-based models such as BERT, CNN-based models like Xception, and LSTM-based models such as WS-BiTM and Bat-PSO-LSTM. These methods predominantly employ structural Magnetic Resonance Imaging (sMRI) data, achieving high accuracy^13,14,15,16. However, they present significant drawbacks: they are computationally expensive, require substantial data volumes, and rely on invasive data collection methods like MRI scanning, raising ethical and privacy concerns¹⁷. Other studies employing short-time Fourier transforms have shown that audio-based analysis can detect subtle variations in speech that are challenging to encode in text form. For instance, features such as voice pitch modulation or spectral irregularities could be particularly useful for distinguishing ASD-related speech patterns¹⁸.

Unlike images, audio recordings and MRI scans, speech data can be leveraged to detect language disorders¹⁹ and ASD²⁰ while maintaining privacy. Like fingerprints, every child’s linguistic patterns are unique. Studies have shown that speech patterns in ASD children change abnormally with varying levels of their voice pitch and spectral content. One of the characteristics of Autistic children is that they tend to repeat certain words and phrases over and over. These words and phrases are regular and simple, typical of the vocabulary of children in their age group. These speech abnormalities provide an excellent opportunity to use computational linguistics and machine learning^21,22,23,24 for early detection of ASD.

With privacy and ethics in AI becoming increasingly pressing concerns, developing methods that protect the privacy of minors is of utmost importance. In this study, we evaluate the feasibility of privacy-preserving machine learning models for the early detection of ASD in children, focusing on the predictive power of linguistic features extracted from speech transcripts. It is important to note that in this study, ’privacy-preserving’ specifically refers to the method’s inherent protection of personal identity by avoiding the collection and use of audio or visual biometric data. By analyzing structured, text-based inputs, our work highlights the potential of non-invasive and privacy-respecting approaches for ASD detection. Our findings provide foundational insights into key linguistic features and their role in ASD detection, paving the way for future development of scalable, ethical, and effective diagnostic systems.

Materials and methods

Datasets

Spoken language, particularly among children, represents a critical yet underexplored domain in data collection and analysis compared to the extensive focus on written language and digital images. In recent years, efforts to compile and study natural speech datasets have gained traction, driven by advancements in machine learning and the growing recognition of spoken language’s diagnostic potential. For this study, we utilized the TalkBank system as the primary data source. TalkBank is the world’s largest open-access repository of spoken language data, widely used across disciplines such as education, medicine, linguistics, and psychology²⁵. It provides a vast collection of multimedia language corpora, including data on children and adolescents. We leveraged two subsets from TalkBank: the CHILDES database and the ASDBank English corpus. These datasets were selected due to their relevance for our task of predicting autism spectrum disorder, as they include comparable methods and analyses^26,27.

The dataset from Eigsti et al.²⁸ included children aged 3-6 years, divided into three groups: typically developing (TD) children, children with non-ASD developmental delays, and children with ASD. The dataset comprised 48 participants, each engaged in a 30-minute free-play session. The cohort’s ethnic composition was as follows: 39 White, 3 LatinX, and 6 African-American children. The average participant age was 51 months, with a mean utterance length of 3.08 words and a median of 2.6 words per utterance.

The datasets from Nadig and Bang^29,30 featured 38 children, including both ASD and TD participants, from English-speaking families. Natural language samples were collected over a year during three-time points in parent-child interactions. The children with ASD were aged 36-74 months, while the TD children were younger, aged 12-57 months, reflecting the significant language delays typically observed in children with ASD as compared to their TD counterparts.

These datasets were chosen for their relevance and quality based on several criteria: the inclusion of participants from the appropriate age groups for both ASD and TD cohorts, the satisfactory sample size compared to other available datasets, and the availability of data for both ASD and TD children within the corpus. The datasets utilized in this study (Eigsti and Nadig) are publicly accessible via TalkBank and are linked in the data availability section. TalkBank ensures compliance with ethical standards for data collection and explicitly requires researchers who contribute data to obtain informed consent from participants or their guardians and obtain necessary ethical approvals before data sharing. Data retrieval was facilitated through TalkBank’s APIs, enabling the extraction of participant details, word tokens, utterances, and transcript information. The raw data from both datasets were cleansed and prepared for analysis. Linguistic features, including stem words and parts of speech, were extracted and incorporated as features. The final datasets contained 52 features, encompassing participant demographics and linguistic attributes. An overview of the class distribution and some relevant statistics are presented in Table 1.

Table 1 ASDBank Participant Profiles and Key Linguistic Features. This table summarizes demographic and speech transcript characteristics from the Eigsti and Nadig datasets used for ASD screening. The table includes participant counts and key linguistic measures-Total Words, Mean, and Median utterance lengths. Definitions: TD: Typical Development; DD: Delayed Development; ASD: Autism Spectrum Disorder; Age: age in months; Total Words: total words spoken; Mean: average number of words per utterance; Median: median number of words per utterance.

Full size table

Machine learning models

To perform our experiments, we selected three machine learning models with varying order of complexity and a high degree of explainability.

Data preprocessing included cleansing, handling missing values by imputing means, and encoding categorical features as integers. To address class imbalance, we applied Synthetic Minority Oversampling Technique (SMOTE), which synthesizes new instances of minority classes to balance the dataset. SMOTE was configured to re-sample all classes except the majority class. This resampling led to an improvement in our models’ performance by up to 3%.

To ensure that the synthetic samples did not introduce bias or unrealistic data patterns, we evaluated the distributions of the features before and after SMOTE using the Kolmogorov-Smirnov (KS) test. The KS test is a nonparametric method that compares the cumulative distribution functions of two samples; a high p-value (typically above 0.05) indicates that there is no significant difference between the distributions, while a low KS statistic suggests that the maximum deviation between the distributions is small^31,32. In our study, we obtained KS statistics below 0.15 with corresponding p-values of around 0.99 for all features. These results indicate that the distributions of the original minority class and the SMOTE-generated samples are nearly identical, implying that the oversampling has not introduced any bias or unrealistic data patterns.

Hyperparameter tuning was performed for each model using grid search combined with 5-fold stratified cross-validation. Specifically, for the Logistic Regression model, the number of iterations was found to be optimal at 750 iterations. For the Random Forest classifier, the number of trees and maximum depth were tuned, resulting in the best performance with 200 trees and a maximum depth of 2. For TabNet, hyperparameters including batch size and learning rate were systematically explored, with optimal performance achieved at a batch size of 8. This approach ensured robustness and maximized accuracy within the constraints of the available datasets.

The 3 classifiers used in our analysis are detailed below:

Logistic Regression (LR): a widely used algorithm for binary classification problems. It models a linear relationship between independent variables and a binary dependent variable, offering interpretable results for categorical predictions³³.
Random Forest (RF): an ensemble learning method that combines multiple decision trees to enhance classification accuracy and reduce overfitting. By sub-sampling the dataset during training, RF ensures robustness and reduces variance in predictions³⁴.
Tabular Neural Network (TabNet): TabNet is designed for tabular data and employs sequential attention to prioritize relevant features during learning. This approach improves interpretability and computational efficiency³⁵.

Results

Predictive performance

To evaluate our models, we applied 5-fold stratified cross-validation on two datasets from ASD TalkBank. The Nadig dataset had a total of 38 participants who were primarily children with ASD and TD children. SMOTE was used only on the Nadig dataset due to the imbalance in its classes (ASD vs TD). The Eigsti data is more comprehensive and includes data of children with Delayed Development (DD). This is important to consider, since in many instances, ASD and DD characteristics overlap leading to misdiagnosis. A summary of the number of participants and their characteristics is provided in Table 2.

Table 2 Dataset Composition from TalkBank. This table provides a breakdown of the Eigsti and Nadig datasets, detailing the number of participants in each group (ASD: Autism Spectrum Disorder, TD: Typical Development, and DD: Delayed Development).

Full size table

We used 6 different metrics to evaluate the performance of our models and establish their effectiveness. Precision and Recall are highly relevant in medical evaluation and are consolidated using the F1 score to make performance comparison easier. Together, along with the Receiver Operating Characteristic - Area Under the Curve (ROC-AUC) score, are also highly used to evaluate skewed datasets. Accuracy is a general metric used to evaluate classifier performance. P-values are also used in medical research to determine whether the sample estimate significantly differs from a hypothesized value. Below is a more detailed description of the metrics we used along with the formulas to compute them.

Precision: is the ratio that measures the effectiveness of the model in predicting positive labels out of all positive predictions made. High precision ensures that when the model predicts a child has ASD, it is likely to be correct³⁶.

$$\begin{aligned} Precision = \frac{TP}{TP + FP} \end{aligned}$$

Recall: Measures the proportion of true positives that were identified correctly. Having a high recall for the model is important since we want to ensure that ASD kids are identified accurately, minimizing the risk of undiagnosed cases and enabling timely interventions³⁶.

$$\begin{aligned} Recall = \frac{TP}{TP + FN} \end{aligned}$$

F1 Score: The F1 score balances Precision and Recall to evaluate classification performance, particularly in imbalanced datasets, by accounting for both false positives and false negatives.

$$\begin{aligned} F1\ Score = \frac{2* Precision Score * Recall Score}{Precision Score + Recall Score} \end{aligned}$$

Accuracy: Measures the total number of model outcomes that were predicted correctly³⁶.

$$\begin{aligned} Accuracy = \frac{TP + TN}{ TP + FN + TN + FP} \end{aligned}$$

P-Value: Represents the probability that the observed performance of the model-such as its accuracy or ROC-AUC-would have occurred purely by chance, assuming that there is no true effect (i.e., the model is performing at chance level). A p-value below conventional thresholds (e.g., p < 0.05 or p < 0.01) indicates that the model’s performance is statistically significant, meaning that it is very unlikely that the observed results are due to random variation alone³¹.

Area under the Receiver Operating Characteristic Curve (ROC-AUC): ROC is a probability curve and AUC represents the degree or measure of separability. It indicates how much the model is capable of distinguishing between classes. The higher the AUC, the better the model is at distinguishing between patients with ASD and no ASD^37,38,39.

We used the following definitions to compute the evaluation metrics:

True Positives (TP): Is the number of children with ASD correctly predicted by the model.
False Positives (FP): Is the number of children without ASD incorrectly predicted by the model to have ASD.
True Negatives (TN): Is the number of children without ASD correctly predicted by the model.
False Negatives (FN): Is the number of children with ASD that the model fails to identify.

Results from our experiments for each of the classifiers using the 6 performance metrics are captured in Table 3. Of particular significance is the consistency of results achieved for each of these models across the various performance measures.

Logistic regression achieved strong performance on the smaller binary datasets (Nadig and Eigsti), with ROC-AUC scores of 0.93 and 0.87, respectively, suggesting that the model was effective at distinguishing between ASD and TD cases in these datasets. When the datasets were merged, TabNet outperformed the other models, achieving a ROC-AUC score of approximately 0.96, which aligns with the tendency of deep learning models to benefit from larger datasets. For the multi-class Eigsti dataset, Random Forest attained the highest ROC-AUC score of 0.71, indicating its ability to capture more nuanced relationships between ASD, TD, and DD classes despite the smaller sample size per class. As shown in Table 3, the p-values for all models are less than conventional thresholds (p < 0.05), indicating the reliability and statistical significance of the observed model results.

Given the limited sizes of the Eigsti (48 participants) and Nadig (38 participants) datasets, there is a potential risk of overfitting, which could limit the generalizability of our models. However, when we compared our training accuracy to the validation accuracy we found a difference of no more than 4% indicating that our model isn’t overfitting.

Table 3 Performance Metrics Across ASD Datasets: Summary of Precision, Recall, F1 Score, Accuracy, ROC-AUC, and P-value for Logistic Regression, Random Forest, and TabNet models evaluated using 5-Fold Stratified Cross-Validation on the Nadig, Eigsti, Eigsti Multi, and Merged datasets. Metrics were calculated with standard formulas and validated through cross-validation to ensure reliability. Bold values indicate the highest score for each metric within the corresponding dataset.

Full size table

Feature importance

Given Logistic Regression’s strong performance across both individual and merged datasets, we leveraged its interpretability to analyze the relationship between key features and ASD classification. During the 5-fold cross-validation process on the merged dataset, we recorded the feature coefficients obtained from each fold and stored them in an accumulator. After completing all folds, we computed the average of these coefficients to obtain a final estimate. By examining the coefficients with the largest magnitudes, we identified the most influential features and their potential association with ASD.

We analyzed the top features that were identified by the Logistic Regression model and observed that while certain parts of speech (POS) features were among the highest-ranking, they were excluded to minimize the overall feature set. POS features contribute a large number of variables, and their removal did not lead to a significant drop in performance. To further validate the importance of the remaining top features, we trained a model using only the top four non-POS features and achieved nearly the same accuracy of 86%, reinforcing their relevance in distinguishing ASD cases. These features are:

MLU (Mean Length of Utterance): Measures the average number of morphemes per utterance spoken by the child. A morpheme is the smallest meaningful unit in a language (e.g., “in,” “come,” “-ing,” forming “incoming”)^40,41. A lower MLU is associated with a higher likelihood of ASD, indicating that children with ASD may use shorter and simpler utterances.
MLT Ratio (Mean Length of Turn Ratio): Represents the ratio of the mean length of a child’s turn to that of the mother or investigator. A lower MLT ratio is correlated with ASD, suggesting that children with ASD contribute less in conversational exchanges compared to their counterparts.
Age: A positive coefficient suggests that younger children are more likely to exhibit characteristics associated with ASD.
Sex: A positive coefficient for this feature indicates that male children are more likely to be classified with ASD, consistent with broader epidemiological trends.

As shown in Fig. 1, our analysis of these features suggests that a decrease in the MLU and the MLT_ratio of the child is likely suggestive of ASD. The importance of the MLU and the MLT_ratio aligns with prior research showing that children with ASD tend to have shorter and less reciprocal conversational patterns. On the other hand, older children and male children are also likely to be diagnosed with ASD, consistent with broader epidemiological trends.

Discussion

In this study, we evaluated the effectiveness of privacy-centered machine learning models for ASD detection using children’s speech transcripts. Our approach demonstrated strong predictive performance, with accuracy exceeding 86%, and provided valuable insights into the importance of key linguistic features for distinguishing ASD characteristics. By using structured, text-based inputs, our study highlights the potential of speech transcripts as a non-invasive and privacy-preserving alternative for ASD detection, addressing the ethical challenges associated with sensitive audio or visual data.

A survey of current ASD detection methods highlighted that traditional machine learning models such as Logistic Regression and Random Forests still outperform more complex models, particularly when applied to structured data^17,42. Our study reinforces these findings by reporting comparable performance while explicitly enhancing interpretability through feature importance analysis. A key aspect highlighted in the existing literature is the need for explainability, as models used for clinical decisions must provide insights into their reasoning¹⁷. Interpretability remains a challenge in many current deep learning approaches, including the multimodal transformer and CNN-based models, which operate as “black-box” methods and do not elucidate decision-making processes effectively limiting their clinical adoption^13,14,17.

A central contribution of this study is the identification of a concise set of linguistic features that have the potential to aid in ASD detection while minimizing the need for extensive data collection. By isolating key features, such as Mean Length of Utterance (MLU) and Mean Length of Turn Ratio (MLT Ratio), our findings suggest that a more privacy-conscious approach to ASD screening could be feasible. This approach reduces the amount of sensitive information required, aligning with ethical considerations while maintaining the potential for meaningful insights into linguistic patterns associated with ASD.

Using speech transcripts as input provides a structured, text-based approach that inherently reduces privacy concerns. Transcripts avoid potential exposure of sensitive or identifiable audio data, ensuring a higher level of confidentiality. However, they may lack some critical acoustic details, such as prosodic features, which are often indicative of atypical speech characteristics in individuals with ASD¹⁸. These features include variations in pitch, rhythm, and intonation, which are difficult to capture solely through textual representation.

One of the limitations of our study was the size and diversity of the datasets, particularly the limited age range (3-6 years) of the children included. Expanding the dataset to include a broader age group and more diverse environments and dialects or performing transfer learning from larger speech-based datasets would improve the generalizability and performance of the models¹⁷. Additionally, the lower performance observed for the Eigsti (multi) dataset underscores the need to identify and integrate features that capture nuances specific to developmental delays (DD) to reduce the risk of misdiagnosis.

Future research could explore how the identified linguistic features can be leveraged in practical applications, such as the development of interactive conversational systems for early ASD screening. For instance, chatbots or robotic assistants equipped with conversational AI could enable scalable, non-invasive diagnostic assessments, while simultaneously offering personalized recommendations for follow-up or intervention. In addition, multi-modal approaches combining linguistic features with limited prosodic cues could be explored to try and find a balance between the valuable information prosodic features provide while at the same maintaining the privacy that text-based data provides.

Data availability

The datasets used in this study are available on TalkBank under the following URLs: Nadig dataset: https://asd.talkbank.org/access/English/Nadig.html Eigsti dataset: https://asd.talkbank.org/access/English/Eigsti.html

References

CDC. Data and statistics on autism spectrum disorder,. Center for Disease Control (2020).
Maenner, M. e. a. Prevalence of autism spectrum disorder among children aged 8 years - autism and developmental disabilities monitoring network. CDC (2016).
Dawson, G. & Sapiro, G. Potential for digital behavioral measurement tools to transform the detection and diagnosis of autism spectrum disorder. JAMA Pediatr. 173, 305–306 (2019).
Article PubMed PubMed Central Google Scholar
Elder, J., Kreider, C., Brasher, S. & Ansell, M. Clinical impact of early diagnosis of autism on the prognosis and parent-child relationships. Psychol. Res. Behav. Manag. 10, 283–292. https://doi.org/10.2147/PRBM.S117499 (2017).
Article PubMed PubMed Central Google Scholar
MD, S. B. Asperger syndrome. The Developmental Unit The Genesee Hospital Rochester, New York13 (1996).
Lohar, M. V. & Chorage, S. S. Machine learning-based models for early stage detection of autism spectrum disorders. Int. J. Future Gener. Commun. Netw. 13, 426–438 (2020).
Google Scholar
Akter, T. et al. Machine learning-based models for early stage detection of autism spectrum disorders. IEEE Access https://doi.org/10.1109/ACCESS.2019.2952609 (2019).
Corbett, B., Newsom, C., Key, A., Qualls, L. & Edmiston, E. K. Examining the relationship between face processing and social interaction behavior in children with and without autism spectrum disorder. J. Neurodev. Disord. 6, 35. https://doi.org/10.1186/1866-1955-6-35 (2014).
Article PubMed PubMed Central Google Scholar
Grossard, C. et al. Children with autism spectrum disorder produce more ambiguous and less socially meaningful facial expressions: an experimental study using random forest classifiers. Mol. Autism https://doi.org/10.1186/s13229-020-0312-2 (2020).
Article PubMed PubMed Central Google Scholar
Shaik, J. & Padmanabhan, S. Detecting autism from facial images. International Journal of Advanced Research https://doi.org/10.13140/RG.2.2.35268.35202 (2021).
Laia, M. et al. A machine learning approach for retinal images analysis as an objective screening method for children with autism spectrum disorder. EClinical Medicine, The Lancet28 (2020).
Nogay, H. S. & Adeli, H. Machine learning (ml) for the diagnosis of autism spectrum disorder (asd) using brain imaging. Rev. Neurosci. 31, 825–841. https://doi.org/10.1515/revneuro-2020-0043 (2020).
Article Google Scholar
Khan, K. & Katarya, R. Mcbert: A multi-modal framework for the diagnosis of autism spectrum disorder. Biol. Psychol. 194, 108976. https://doi.org/10.1016/j.biopsycho.2024.108976 (2025).
Article PubMed Google Scholar
Khan, K. & Katarya, R. Ws-bitm: Integrating white shark optimization with bi-lstm for enhanced autism spectrum disorder diagnosis. J. Neurosci. Methods 413, 110319. https://doi.org/10.1016/j.jneumeth.2024.110319 (2025).
Article CAS PubMed Google Scholar
Khan, K. & Katarya, R. Aff-bpl: An adaptive feature fusion technique for the diagnosis of autism spectrum disorder using bat-pso-lstm based framework. J. Comput. Sci. 83, 102447. https://doi.org/10.1016/j.jocs.2024.102447 (2024).
Article Google Scholar
Jha, A., Khan, K. & Katarya, R. Diagnosis support model for autism spectrum disorder using neuroimaging data and xception. In 2023 International Conference on Electrical, Electronics, Communication and Computers (ELEXCOM), 1–6, https://doi.org/10.1109/ELEXCOM58812.2023.10370586 (2023).
Khan, K. & Katarya, R. Machine learning techniques for autism spectrum disorder: current trends and future directions. In 2023 4th International Conference on Innovative Trends in Information Technology (ICITIIT), 1–7, https://doi.org/10.1109/ICITIIT57246.2023.10068658 (2023).
Sai, K. V. K., Krishna, R. T., Radha, K., Rao, D. V. & Muneera, A. Automated asd detection in children from raw speech using customized stft-cnn model. Int. J. Speech Technol. 27, 701–716. https://doi.org/10.1007/s10772-024-10131-7 (2024).
Article Google Scholar
MacWhinney, B. et al. Language banking for language disorders. American Speech-Language-Hearing Association Convention (2014).
Macwhinney, B. Understanding spoken language through talkbank. Behav. Res. Methods https://doi.org/10.3758/s13428-018-1174-9 (2018).
Article Google Scholar
Tanaka, H., Sakti, S., Neubig, G., Toda, T. & Nakamura, S. Linguistic and acoustic features for automatic identification of autism spectrum disorders in children’s narrative. Workshop Comput. Linguistics Clin. Psychol. https://doi.org/10.3115/v1/W14-3211 (2014).
Article Google Scholar
Karlekar, S., Niu, T. & Bansal, M. Detecting linguistic characteristics of alzheimer’s dementia by interpreting neural models. University of North Carolina at Chapel Hill (2018).
Hassanali, K.-n. & Liu, Y. Measuring language development in early childhood education: a case study of grammar checking in child language transcripts. Proceedings of the 6th Workshop on Innovative Use of NLP for Building Educational Applications 87–95 (2011).
He, A. X., Luyster, R., Hong, S. & Arunachalam, S. Personal pronoun usage in maternal input to infants at high vs. low risk for autism spectrum disorder. First Lang. 38, 014272371878263. https://doi.org/10.1177/0142723718782634 (2018).
Article Google Scholar
MacWhinney, B. The childes project: Tools for analyzing talk. Carnegie Mellon University (2021).
Sanchez, A. et al. childes-db: a flexible and reproducible interface to the child language data exchange system. Carnegie Mellon University https://doi.org/10.31234/osf.io/93mwx (2018).
Macwhinney, B., Bird, S., Cieri, C. & Martell, C. Talkbank: Building an open unified multimodal database of communicative interaction. Carnegie Mellon University (2004).
Eigsti, I.-M., Bennetto, L. & Dadlani, M. Beyond pragmatics: Morphosyntactic development in autism. J. Autism Dev. Disord. 37, 1007–23. https://doi.org/10.1007/s10803-006-0239-2 (2007).
Article PubMed Google Scholar
Bang, J. & Nadig, A. Learning language in autism: Maternal linguistic input contributes to later vocabulary. Autism Res. https://doi.org/10.1002/aur.1440 (2015).
Article PubMed Google Scholar
De Palma, P., Garcia-Camargo, L., Kilfoyle, J., Vandam, M. & Stover, J. Speech tested for zipfian fit using rigorous statistical techniques. Proc. Linguistic Soc. Am. 6, 394. https://doi.org/10.3765/plsa.v6i1.4975 (2021).
Article Google Scholar
Andrade, C. The p value and statistical significance: Misunderstandings, explanations, challenges, and alternatives. Indian J. Psychol. Med. 41, 210–215. https://doi.org/10.4103/IJPSYM.IJPSYM_193_19 (2019) (PMID: 31142921).
Article PubMed PubMed Central Google Scholar
Massey, F. J. The kolmogorov-smirnov test for goodness of fit. J. Am. Stat. Assoc. 46, 68–78 (1951).
Article Google Scholar
Sammut, C. & Webb, G. I. Logistic regression. Encyclopedia of Machine Learning and Data Mining 780–781 (2017).
Brieman, L. Random forests. Statistics Department, University of California, Berkeley (2001).
Arik, S. Ö. & Pfister, T. Tabnet: Attentive interpretable tabular learning. CoRR https://doi.org/10.48550/arXiv.1908.07442 (2019). arXiv:1908.07442.
Powers, D. M. W. Evaluation: From precision, recall and f-measure to roc, informedness, markedness and correlation. Int. J. Mach. Learn. Technol. 2, 37–63 (2011).
Google Scholar
Çorbacıoğlu, Şeref Kerem & Aksel, G. Receiver operating characteristic curve analysis in diagnostic accuracy studies: A guide to interpreting the area under the curve value. Turkish J. Emerg. Med. 23, 195–198 (2023).
Article Google Scholar
Vining, D. & Gladish, G. Receiver operating characteristic curves: A basic understanding. Radiographics 12, 1147–54. https://doi.org/10.1148/radiographics.12.6.1439017 (1992).
Article CAS PubMed Google Scholar
Krupinski, E. Receiver operating characteristic (roc) analysis. Frontline Learn. Res. 5, 31–42. https://doi.org/10.14786/flr.v5i2.250 (2017).
Article Google Scholar
Ranti, A., Bahasa, T. & Inggris, K. I. Mean length utterance of children morphological development. The 1st National Conference on English Language Teaching (NACELT): Applied Linguistics, General Linguistics, and Literature (2015).
Oosthuizen, H. & Southwood, F. Methodological issues in the calculation of mean length of utterance. South African Journal of Communication Disorders56 (2009).
Sethi, A., Khan, K., Katarya, R. & Yingthawornsuk, T. Empirical evaluation of machine learning techniques for autism spectrum disorder. In 2024 12th International Electrical Engineering Congress (iEECON), 1–5, https://doi.org/10.1109/iEECON60677.2024.10537970 (2024).

Download references

Acknowledgements

This work was partially supported by the University Research Board (Grant Number: 104518) at the American University of Beirut (AUB).

Author information

Authors and Affiliations

Department of Computer Science, American University of Beirut, Beirut, Lebanon
Rida Assaf & Zein Shehabeddine
Department of Computer Science, University of California, Los Angeles, Los Angeles, USA
Vikram Ramesh

Authors

Rida Assaf
View author publications
Search author on:PubMed Google Scholar
Zein Shehabeddine
View author publications
Search author on:PubMed Google Scholar
Vikram Ramesh
View author publications
Search author on:PubMed Google Scholar

Contributions

R.A. conceptualized the study, supervised the research, contributed to methodological design, and edited the manuscript. Z.S. and V.R. conducted the experiments, performed data analysis, and wrote the original draft. All authors reviewed and approved the final manuscript.

Corresponding author

Correspondence to Rida Assaf.

Ethics declarations

Competing interests

The author(s) declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Assaf, R., Shehabeddine, Z. & Ramesh, V. Screening autism spectrum disorder in children using machine learning on speech transcripts. Sci Rep 15, 34134 (2025). https://doi.org/10.1038/s41598-025-01500-6

Download citation

Received: 31 January 2025
Accepted: 06 May 2025
Published: 01 October 2025
DOI: https://doi.org/10.1038/s41598-025-01500-6