Main

The importance of knowing whether there are clinical phenotypic differences between girls and boys with autism cannot be underestimated. Clarity on this topic can lead to improved early-age detection and diagnostic procedures, reveal new insights into causes and mechanisms, point to more efficacious early-age treatment protocols, aid in differential planning and targets for intervention and more generally provide enhanced sex and gender equity in society1. Examination of early-age sex differences within the context of longitudinal data in particular can generate important insights into sex-specific developmental trajectories, benefiting parents and clinicians in understanding progression and course. Inconsistent and contradictory reports on sex differences in autism are the source of debate inside and outside the scientific and clinical communities. Some studies report symptom differences2,3 and some do not4; some report cognitive differences5 and some do not6; some report female versus male differences in autism spectrum disorder (ASD) that are similar to sex differences in those who are typically developing (TD)7 and some do not4.

For example, females with ASD have been reported to have superior social skills8,9, subtler autistic traits10, fewer restricted and repetitive behaviours (RRB)2,11,12 and superior language abilities compared with males with ASD13. In contrast, other studies have reported that girls with autism have more social difficulties6,14,15 and display more intellectual disability than males with autism16. Yet, still other studies report that females and males with ASD are similar in cognition17,18, attention19 and behavioural traits associated with ASD20,21. Thus, there is currently a pronounced lack of clarity on the question of clinical phenotypic differences between girls and boys with autism, and this has led to widely divergent causal and developmental theories, such as the protective effect theory22,23 and the extreme male brain (EMB) theory24. This has also led to concerns about equity in detecting and diagnosing ASD in girls and providing clinical care for girls with ASD25. This situation may be less due to the biological and/or clinical complexity of sex and autism factors and more due to study design and small sample limitations.

There is now considerable evidence suggesting that ASD begins during prenatal life26,27,28. Within this context, examination of sex differences at the earliest ages possible is essential given the large impact of very early experience on phenotypic expression29. For example, in one study, changes in the quality of the home environment dramatically influenced cognitive profiles in infants originally placed in an orphanage by as much as 15 IQ points30. In another study, the receptive language ability of children with autism was considerably improved if exposure to a particular intervention occurred at or before 18 months of age31. Although studying extremely young individuals does not entirely eliminate the confound of experience, it provides a snapshot of ASD more proximal to the disorder’s onset, which may generate results that are more biologically, and less experientially, driven. In contrast, studies of older children and adults, while important for understanding the condition and complimentary to studies with infants and toddlers, cannot necessarily disentangle the influence of experience. However, more sex differences are reported in studies involving older children3,10,32,33, highlighting the importance of investigating sex differences at early ages.

Overall, limitations in previous studies span a wide range. The great majority of studies have been small samples of n = 28–96 and thus lack statistical power6,7. Very few studies examined the earliest ages possible, when effects are closest to autism prenatal beginnings27,28,34 and before a host of experiential effects could come into play. Ascertainment bias weakens the generalizability of sex difference findings. For instance, ascertainment from clinic-referred samples may be biased towards those who are more cognitively and socially affected, as in a recent study where 98% of toddlers scored in the cognitively impaired range with an early learning composite (ELC) score of <70 (ref. 17). Similarly, the ascertainment of infant siblings of children with autism from within multiplex families may also be unrepresentative of the general ASD population4,5. Assessment limitations provide incomplete or weakly validated clinical phenotypic information about social symptom severity, language ability, cognitive level and social attention behaviour. For instance, some studies used parent report rather than gold-standard expert-based assessment tools35 and some reported cognitive but not ASD symptom test results17 and vice versa3. Few examined longitudinal changes in sex effects36,37. Some studies lack comparison with TD participants and/or participants with developmental delay (DD), thus limiting the interpretation of sex differences and sex-effect specificity in ASD12,18. Lastly, nearly all studies lacked unsupervised patient subtyping, which enables data-driven analyses of phenotypes in girls and boys with ASD across different levels of clinical ability from higher to lower.

We utilized our general San Diego County-wide population-based screening approach, Get SET Early38, to screen, evaluate and identify a large cohort of females and males with early-age ASD. This method generated a large single-site study sample containing n = 2,618 toddlers (1,539 with ASD and 1,079 without ASD), including toddlers with ASD as young as 12 months of age (Table 1). Importantly, the Get SET Early method simultaneously enabled us to recruit, examine and assess in a comparable and unbiased way TD as well as non-ASD developmental delayed (DD) girls and boys for comparison to the ASD toddlers. We further leveraged data integration and machine learning approaches to examine heterogeneity in males and females with ASD, utilized a social attention eye-tracking test and leveraged longitudinal data to examine sex-specific early-age psychometric and social symptom developmental changes associated with ASD.

Results

Primary analysis

Two-way analyses of variance (ANOVAs) across all subscales revealed significant omnibus F-test results for the entire model (Supplementary Table 1) and we examined our hypotheses when appropriate (see Methods for detailed steps and procedures).

Sex differences in ASD

Although the parent-based screening tool used to recruit toddlers to the study (that is, the Communication and Symbolic Behavior Scales (CSBS) Infant-Toddler Checklist39) revealed lower screen scores in boys with ASD than girls with ASD, we found no significant sex differences in 17 of the 18 standardized test scores presented in Fig. 1. Similarly, on the GeoPref eye-tracking test, no sex difference among girls and boys was found, as reported in our earlier paper40 (Fig. 2a). Only for the Vineland Adaptive Behavior Scales (Vineland)41 daily living skills subdomain did girls with ASD score significantly higher than boys with ASD (Extended Data Table 1).

Fig. 1: Comparison of sex differences in ASD and TD groups across various test subscales.
figure 1

The bars represent performance differences between males and females, with longer bars indicating stronger statistical significance (lower P values). Any bar crossing the red dashed line has a P value of <0.05. Asterisks denote significant sex differences. In all cases where a significant difference is marked, girls outperformed boys. For the ADOS, where lower scores indicate less impairment, outperformance means lower scores for girls. For the Vineland scales, where higher scores represent better abilities, outperformance means higher scores for girls. All of the tests used to determine significance were two sided. Either a Kruskal–Wallis chi-squared test or a t-test was used, as appropriate (Extended Data Table 1). Multiple comparisons were corrected for using the false discovery rate. comm, communication; DailyLiving, daily living skills; EarlyGes, early gestures; exprLang, expressive language; FineMotor, fine motor skills; LaterGes, later gestures; MtrSkls, motor skills; recepLang, receptive language; SA, social affect; TotalGes, total gestures; VisRec, visual reception; WProd, words produced; WUnd, words understood.

Source data

Fig. 2: Primary analysis of sex differences in ASD and TD groups.
figure 2

a, Top, images from the GeoPref eye-tracking test—a tool that analyses visual stimulus preferences in toddlers. Bottom, bar graphs showing that there were no statistically significant sex differences in the results of this test in the group with ASD or the TD group (Extended Data Table 1). Each box plot illustrates the data distribution, with the centre line representing the median, the box edges indicating the interquartile range (IQR; 25th–75th percentiles) and the whiskers extending to the smallest and largest values within 1.5× the IQR. b, Test subscales that exhibited statistically significant interactions between sex and diagnostic group, with the TD group showing sex differences (Supplementary Table 1). Mullen subscales show standardized (Std.) age-equivalent scores. The data are presented as mean values ± CIs; see the primary analysis section of Table 1 for sample sizes. Please see the datasets df.match.geopref.csv (a) and df.match.ados2.csv, df.match.mul2.csv, df.match.wg.csv and unq.ws2.csv (b) in GitHub (Primary Cross-Sectional Analysis→Data; https://github.com/ACE-UCSD/Autism-Sex-Differences-Analysis-Pathway). Images in panel a © 2003 Gaiam Americas, Inc., courtesy of Gaiam Americas, Inc. and Fit For Life, LLC.

Source data

Sex differences in TD

Similar to toddlers with ASD, on the parent-based CSBS checklist, TD boys had poorer screen scores than girls on social, symbolic and total score composites. However, in contrast with toddlers with ASD, ten of the 18 standardized test comparisons in Fig. 1 showed significant sex differences, with TD girls performing better than TD boys. On the Autism Diagnostic Observation Schedule (ADOS)42, in the TD group girls had slightly better scores than boys. On the Mullen Scales of Early Learning (MSEL) test, TD girls had higher IQ scores than TD boys for fine motor skills, visual reception and receptive language. Similarly, on the MacArthur-Bates Communicative Development Inventories (CDI)43 words and gestures (WG) analysis, TD girls had higher scores on later gestures and total gestures. On the CDI words and sentences (WS) analysis, TD girls also had higher scores on the words produced subscale. On the GeoPref eye-tracking test, there were no sex difference among TD girls and boys. In addition, on the Vineland parent questionnaire, TD girls scored better than TD boys for daily living skills, as well as the adaptive behaviour composite (ABC) and socialization (Extended Data Table 1).

It is noteworthy that the ANOVA results indicated significant interactions between sex and group for only four subscales: ADOS social affect, MSEL visual reception, MSEL receptive language and CDI–WS words produced. These interaction effects can be described as subtle but significant sex differences in the TD group, with a female advantage over males, but a lack of sex differences in the group with ASD. For the remaining subscales showing statistically significant differences, these differences were driven by the significant main effect of sex. Therefore, sex differences in the ASD and TD groups should be interpreted independently for these other measures (Fig. 2b).

Sex differences in DD

There were almost no sex differences between girls and boys with DD. Girls showed slightly better performance than boys with an ignorable effect size in ADOS RRB and girls scored slightly better on Vineland motor skills, with a small effect size (Extended Data Table 2).

Cluster analysis

After reviewing the results of 26 techniques to determine the number of clusters (Supplementary Table 2), there was a tie between two- and three-cluster solutions. We proceeded with analyses of three clusters. Based on our experience of clustering children with ASD and TD toddlers using similarity network fusion (SNF), the first obvious result of SNF would be a two-cluster solution, as children with ASD and TD toddlers tend to separate easily based on their scores. However, in this study, we opted for a three-cluster solution, which resulted in high, medium and low clusters, two of which contained both individuals with ASD and TD toddlers. This approach allowed us to capture a broader range of heterogeneity within the sample, providing more detailed and nuanced insights into the varying characteristics and different levels of traits. The three clusters resulted in toddlers with high, medium and low performance in the social, language and motor domains. Toddlers with ASD spanned all three cluster performance levels, but TD toddlers were only present in the high and medium performance clusters, as expected.

Validation of clusters

We trained SNF44 on 80% of the data (1,337 participants) and tested it on the remaining 20% (336 participants) (Methods). The three clusters were consistently observed in both the training and test datasets (Fig. 3). To validate separation of the clusters and assess differences between them, we conducted ANOVAs and multiple pairwise comparisons on the training and test clusters. The results showed a significant omnibus ANOVA and significant pairwise comparisons across all three clusters, spanning the variables used to construct the SNF across the three data layers. Additionally, in predicting the test data clusters using the trained SNF model, we observed a similar structure to that of the training data. This included toddlers with ASD in the high-, medium- and low-ability clusters and TD toddlers in the high- and medium-ability clusters. Furthermore, the proportions of the test sample assigned to the clusters were relatively comparable to those of the trained model (Supplementary Tables 36). We computed silhouette scores for both the training and test data clusters to further validate the cluster separation. The results showed overall high scores of 0.46 and 0.40 for the training and test clusters, respectively, indicating distinct and well-separated clusters (Fig. 4 and Supplementary Table 7).

Fig. 3: Cluster analysis.
figure 3

a,b, Graphs for the training set (a) and test set (b) illustrating the separation of high, medium and low clusters across three domains: social, motor and language (Supplementary Tables 3–6). For each domain, violin plots display the density and distribution of scores for the group with ASD and TD toddlers, separated by sex. Individual data points are overlaid on the violin plots to highlight the variability within each cluster. The social domain includes the measures ADOS social affect and Vineland socialization; the motor domain features Vineland motor skills and MSEL fine motor skills; and the language domain comprises Vineland communication, MSEL receptive language and MSEL expressive language. The pie charts show the proportional distribution of ASD and TD groups across clusters and sexes, emphasizing the distinct separation and composition of the clusters. The clustering patterns are consistent between the training and testing sets, validating the robustness of the clustering approach across all domains. Please see the datasets train.labels.asd.td2.csv and test.labels.asd.td.csv in GitHub (Cluster Analysis; https://github.com/ACE-UCSD/Autism-Sex-Differences-Analysis-Pathway).

Source data

Fig. 4: Cluster validation strategies.
figure 4

a, Results of fivefold cross-validation on the training set, repeated ten times, yielding a high average accuracy of 0.91. The embedded box plot highlights key summary statistics: the median (centre line), IQR (box boundaries) and values within 1.5× the IQR (whiskers). Individual data points show accuracy values. The mean accuracy is indicated by the large purple point, with the dashed red line representing the overall mean accuracy across all runs. b, Graph showing high NMI values remaining consistent across increasing percentages of random removals. Each box plot represents the distribution of NMI values from 100 replicates per removal percentage, with the centre line indicating the median, the box boundaries showing the IQR (25th–75th percentile) and the whiskers extending to 1.5× the IQR. c, Confirmation of the quality of clustering in both the training and test sets, as evidenced by silhouette scores of 0.46 and 0.40, respectively. d, Graphs showing the distinct separation of clusters when external variables are applied (Supplementary Tables 8 and 9).

Source data

To examine the SNF clusters in relation to external variables, we compared the clusters for WG words produced, words understood, early gestures, later gestures, total gestures, WS words produced, ADOS RRB and the percentage of social fixation in the eye-tracking GeoPref test. The ANOVA results revealed that at least two of the three cluster comparisons showed significant differences across all external variables. This was consistent with the clear separation observed in the variables used to construct the SNF clusters, for which all three pairwise comparisons were significant (Fig. 4 and Supplementary Tables 8 and 9). Furthermore, performing a fivefold cross-validation on the training set, repeated ten times, resulted in a high average accuracy of 0.91. Finally, robustness analyses demonstrated the high stability of SNF against data perturbations. When subjected to random removal of 5, 10, 20, 30, 40 and 50% of the study participants, SNF exhibited high average normalized mutual information (NMI) values of 90.8, 88.3, 84.5, 81.3, 78.6 and 75.2%, respectively (Fig. 4).

Next, we tested whether sex differences varied across the heterogeneous high, medium and low spectrum of clinical performance seen in the SNF clusters.

Sex differences in those with ASD within and across clusters

There were few differences between girls and boys with ASD within and across clusters. In the training SNF data, girls with ASD in the medium cluster had worse ADOS social affect scores than boys with ASD, yet better socialization scores on the Vineland parent questionnaire. No other significant sex differences were found for the ASD group in the high-, medium- and low-ability clusters in the training datasets. Those few sex differences in the medium-ability cluster were not replicated in the test SNF data (for example, social affect scores were 14.5 and 14.1 in girls and boys, respectively, and not different). Instead, in the test dataset girls with ASD in the medium cluster had better MSEL fine motor scores than boys with ASD in this cluster, and girls with ASD in the high-ability cluster had higher Vineland communication and MSEL expressive language scores than boys with ASD in this cluster. These several better scores in girls with ASD were not seen in the larger-sample training dataset (Extended Data Tables 3 and 4 and Supplementary Tables 10 and 11).

We also investigated sex effects on external variables and found no significant sex differences in the high- and medium-performing clusters for the group with ASD. However, in the low-performing group, males with ASD scored higher for WS words produced. It is noteworthy that in the low-ability cluster for the group with ASD, we did not have enough observations for girls to examine sex differences in several external variables (Extended Data Table 5 and Supplementary Table 12).

Sex differences in TD toddlers within and across clusters

Among the clusters, there were several sex differences for TD toddlers. In the training SNF dataset, among higher-ability toddlers, TD girls achieved better scores than boys on Vineland socialization, Vineland motor skills, MSEL fine motor skills and MSEL receptive language; however, these sex differences did not replicate in the smaller sample of higher-ability TD toddlers in the test dataset. Similarly, TD boys scored higher than TD girls on MSEL expressive language in the medium-ability cluster and these differences were not observed in the smaller testing dataset (Extended Data Tables 3 and 4 and Supplementary Tables 10 and 11).

We also investigated sex effects on external variables within the TD group. Girls in the high-ability cluster had better scores on WG early gestures, total gestures and WS words produced than boys; and girls in the medium-ability cluster had better ADOS RRB scores than boys. These better scores in different domains in TD girls generally align with the results obtained from the trained SNF model (Extended Data Table 5 and Supplementary Table 12).

To examine sex differences across clusters, the results of a chi-squared test revealed no significant difference in the distribution of boys and girls across the three clusters in both those with ASD and TD toddlers in the training data, indicating that sex was not a determining factor in cluster membership (Supplementary Table 13).

Longitudinal analysis

Consistent with the primary analysis results, longitudinal analysis revealed more and stronger sex differences in TD toddlers than in those with ASD. These sex differences in TD toddlers occurred in both initial status (that is, intercept) and growth trajectories (that is, slope) and the estimates in Extended Data Table 6 indicate the difference between boys’ and girls’ scores regarding their intercept and slope.

Longitudinal sex differences in those with ASD

On the ADOS, boys and girls with ASD did not differ in intercepts. However, boys with ASD exhibited age-related ADOS social affect and total score growth trajectories that were increasingly worse and became nearly identical to those of girls by later toddler ages (slopes in Extended Data Table 6), and effect sizes were small (that is, 0.42 and 0.44, respectively). On the MSEL, no significant longitudinal sex differences among toddlers with ASD were found. On the Vineland parent questionnaire, girls with ASD had a slightly but significantly less declining slope than boys with ASD for the motor skills subscale, but again the effect size was small (0.43; Fig. 5).

Fig. 5: Longitudinal analysis.
figure 5

ac, Longitudinal growth trajectories and baseline differences across the ADOS (a), Vineland (b) and MSEL subscales (c), stratified by sex. Only subscales with statistically significant sex differences in intercept and/or slope (P < 0.05) are shown (Extended Data Table 6). For the ADOS, significant differences were observed in the intercept and slope for TD-Social Affect, the slope for ASD-Social Affect, the intercept for TD-RRB, the intercept and slope for TD-Overall Total, and the slope for ASD-Overall Total. For the Vineland, significant differences were found in the intercept for TD-Communication, TD-Daily Living Skills, TD-Socialization and TD-ABC, and in the slope for ASD-Motor Skills and TD-Motor Skills. For the MSEL, significant intercept differences were identified for TD-Visual Reception, TD-Receptive Language and TD-Expressive Language. Smoothed trend lines represent estimated values with 95% CIs (shaded areas) and individual participant trajectories are shown as thin lines to highlight variability. See the datasets ados.long.csv, vine.long.csv and mul.long.csv in the GitHub repository (Longitudinal Analysis; https://github.com/ACE-UCSD/Autism-Sex-Differences-Analysis-Pathway).

Source data

Longitudinal sex differences in the TD group

On the ADOS, TD girls had significantly better initial social affect, RRB and total scores than TD boys. Compared with TD boys, girls in this group displayed better longitudinal improvement in social affect and total scores (effect sizes ranged from small to medium: 0.48–0.79) (Extended Data Table 6 and Fig. 5). On the MSEL, TD girls exhibited significantly better initial scores than boys for the visual reception, receptive language and expressive language subscales by 4.12, 6.82 and 3.73, with effect sizes of 0.43, 0.72 and 0.36, respectively. On the Vineland parent questionnaire, TD girls once again had better intercept scores than boys for communication, daily living skills, socialization and ABC, with differences of 2.76, 3.78, 2.07 and 2.46, respectively. However, the effect sizes were small at 0.35, 0.49, 0.30 and 0.36, respectively. Additionally, on the Vineland motor skills subscale, TD girls had better improvement in longitudinal trajectory than TD boys, with an average slope difference of 0.16 and a medium effect size of 0.55 (Fig. 5).

Interpretation of confidence intervals and precision

The non-significant findings reported in this study are supported by the precision of the estimates, as reflected in the calculated confidence intervals (CIs). In the primary analyses, narrow CIs for the mean or median differences across groups suggest that the observed null effects are unlikely to be due to insufficient power or variability in the data. Instead, these intervals indicate that any potential differences, if present, are probably negligible and not clinically meaningful. Similarly, in the longitudinal analyses, the CIs for both intercepts and slopes in the latent growth models demonstrate high precision in estimating baseline levels and rates of change over time. These precise estimates reinforce the robustness of the null results and the consistency of the findings across multiple time points. The absence of wide CIs across analyses further highlights the stability of the results and provides strong evidence that sex differences in ASD at early ages are minimal or non-existent.

Discussion

Our large-scale study shows that at the early-age clinical beginning of autism, there are virtually no clinically or statistically significant differences between female and male toddlers with ASD across a wide range of standardized and validated tests of symptom severity, social and language ability and behavioural social attention. This study included n = 2,618 contrast male and female toddlers with ASD and DD and TD individuals from the general San Diego population, with the majority uniformly ascertained and recruited using the Get SET Early model38. Toddlers were psychometrically and diagnostically assessed at a single site by licensed psychologists, thus the participants were not a collection from different sites with varying procedures, personnel and populations. Only one of 19 primary study test comparisons between female and male toddlers with ASD was significantly different. This single difference was within the daily living subdomain score on a parent report tool—the Vineland—with girls scoring higher than boys. In contrast, ten of 18 measures were significantly different between typically developing females and males, with female toddlers consistently performing better than male toddlers. Furthermore, the lack of sex differences within the group with ASD was not unique to those with ASD since we also found almost no sex differences in toddlers with DD who did not have ASD.

Our findings—that there are no clinical sex differences in ASD at very early ages—leads to two possible conclusions. The first is that previous studies that reported clinical sex differences in ASD are incorrect, possibly due to small sample sizes, sampling bias, limited study measures or other methodological issues. An alternative conclusion is that sex differences do not exist in ASD at the time of first symptom onset, but emerge slowly at later ages, driven by psychosocial factors or differences in biology between males and females that unfold across development. At the psychosocial level, studies have shown that parents engage in more positive parenting behaviour with their female children relative to males, which can lead to sex differences in language expression45. At the biological level, differences in sex hormone surges can also impact clinical phenotype46. Longitudinal studies that track children for whom ASD was detected early through to school age and beyond and then compare early- versus later-age symptom presentation could help to resolve this question.

Similar to other studies, we found stronger performance for female TD toddlers than male TD toddlers. For example, studies on neurotypical development note that female TD toddlers display better social and linguistic abilities, as well as increased attention to faces47 and greater eye contact48, relative to male toddlers. Moreover, typical girls produce more gestures and more words than boys49,50,51—a phenomenon that has been reported across languages and cultures52. Lange et al.53 showed that typical girls have larger vocabularies, and preschool girls in particular (3–6 years of age) have better grammar, speech comprehension, pronunciation and processing of sentences and nonce words. In the present study, females with ASD did not show such differences compared with males with ASD. Although the study was not designed to test the EMB theory of ASD24, the results—showing a similar female over male advantage in the TD group but no sex difference in the group with ASD for the ADOS social affect scale—align with EMB predictions. The EMB theory posits that in domains with typical sex differences (for example, early social and language abilities), these differences would be attenuated in ASD, with ASD presenting at the extreme end of the spectrum compared with TD males24.

Clinical heterogeneity in ASD is well known, but there are few large-sample studies addressing whether or not there are sex differences in clinical heterogeneity at very early ages. Here we considered this topic using state-of-precision-medicine methods to determine subtype membership for study toddlers, and leveraged comprehensive validation strategies including cluster separation analyses, fivefold cross-validation, hold-out test set analysis, robustness assessment, cluster quality assessment and external validation analyses54. Using SNF—a robust, unsupervised, multimodality integration approach—we identified three reproducible subtypes of toddler with ASD, best described as low, medium and high ability. The profile of the low-ability ASD subtype aligns well with the clinical characteristics of profound autism55, whereas the high-ability ASD subtype overlaps with typical toddlers, and many patients assigned to the high-ability subtype may go on to have optimal later-age outcomes. The heterogeneity and subtypes in the present study represent the wide spectrum known for autism, ranging from severely affected to high functioning. Ninety percent of patients with ASD fell into the low- and medium-ability clusters, whereas 96% of TD toddlers fell into the high-ability cluster. Thus, SNF is an accurate unsupervised multimodality approach for ASD versus TD diagnostic separation and clinical subtyping. Within each of the three subtypes of this ASD spectrum, girls and boys with ASD in our study had remarkably similar symptom severity and social, language, cognitive and social eye-tracking levels of performance. In addition, cluster analyses revealed that the proportions of girls and boys with ASD were the same at low, medium and high levels of ability. Thus, girls with ASD were not disproportionately impaired relative to boys, nor were they disproportionately higher functioning. Although ASD is highly heritable56 and girls are thought to carry higher genetic liability, leading to them potentially being more impacted22, the lack of clinical differences in ASD between girls and boys at the early onset of the disorder does not fit that hypothesis.

In this study, we not only examined sex differences at the very young age of first symptom expression, but also possible changes across time. Our results did not reveal striking differences in trajectories between male and female children with ASD. Given the considerable biological heterogeneity in ASD, however, we hypothesize that subtype-specific hypotheses may be more informative than sex57.

Although the present study had numerous strengths, including a large sample size (that is, 2,618 toddlers) and an extensive test battery with both standardized and eye-tracking measures, there are two potential limitations. First, although data collection at a single site using standardized operating procedures and licensed clinical psychologists probably minimized the noise that is often associated with multi-site studies, it is unclear whether the results would generalize to other geographical regions. Second, it is important to consider the possible influence of our approach to recruiting participants with early-detected ASD (that is, Get SET Early, which relies on a toddler’s failure of the CSBS screening at well-baby check-ups to prompt a referral to our centre) on the study results. Although sex-specific norms are not provided by the test developers, it is possible that this screening method is less sensitive to the detection of autism in younger girls given that male toddlers with ASD had slightly lower (worse) CSBS scores than females with ASD. However, counter to this is the fact that the male:female ratio in the current study was actually more strongly in favour of the detection of ASD in females than the national average (3.6:1.0 in the current study versus 3.8:1.0 in the most recent Centers for Disease Control and Prevention average58), suggesting that the Get SET Early approach detects females with ASD at expected, or better than expected, rates. Other possible issues relating to the use of the Get SET Early model is the fact that it is unlikely that extremely-high-functioning individuals with ASD will be detected using developmental screening tools, such as the CSBS, that rely on the presence of observable symptoms. Indeed, although precise estimates regarding late-identified cases of ASD are not formally established, some studies suggest that as many as 6–25% of ASD individuals do not receive a diagnosis until school age or later59,60. It is thus possible that results from the present study may, or may not, generalize to the subset of individuals with ASD who have mild presentations and do not receive diagnoses until later in life60,61,62. Future studies can examine this possibility by examining sex differences in late-detected groups compared with individuals from more traditionally detected cohorts.

Collectively, despite limitations inherent in past studies of sex differences, such as relatively small sample sizes (for example, n = 28–96), limited clinical data and a lack of longitudinal data, these studies nonetheless found few to no sex differences in important clinical measures in ASD at early ages. The present study overcomes these limitations and also did not find compelling evidence of sex differences in the clinical presentation and progression of ASD during the earliest years of the disorder. ASD is highly heritable56, yet more than 80–90% of patients are classified as idiopathic with no identifiable genetic cause. Thus, whether or not there might be genetic differences between girls and boys with idiopathic ASD is currently unknown, but if there are differences, they do not result in clinical sex differences at early ages. Future studies aimed at identifying whether sex differences emerge at later ages in autism should incorporate comprehensive and reproducible designs, such as those used herein. In conclusion, although later environmental influences could arguably impact later-age symptom presentation, particularly in females32,63,64, the current body of evidence and our present study do not support previous speculations about sex differences in ASD at the time of first diagnosis. Therefore, it is unlikely that girls with ASD differ clinically from boys with ASD across early development.

Methods

Study design

Sex differences were examined using both cross-sectional and longitudinal clinical data collected between 2002 and 2022. Participants in the current study completed comprehensive psychological assessments on social skills, expressive and receptive language, gesture production, motor skills, visual reception and core ASD symptoms (‘Clinical testing’ section). This research met all of the ethical requirements of the Human Research Protection Program under the approval of the University of California, San Diego Office of IRB Administration (project number 202115). Parents gave written informed consent and all testing occurred at the University of California, San Diego Autism Center of Excellence.

Participants, recruitment, clinical testing and diagnostic criteria

Participants

A total of 2,618 toddlers participated in this study, including those with a diagnosis of ASD (n = 1,539; 1,200 male and 339 female; mean age = 28.6 months), those with DD (n = 478; 349 male and 129 female; mean age = 26.0 months) and TD individuals (n = 601; 349 male and 252 female; mean age = 25.7 months). No statistical methods were used to predetermine sample sizes, but our sample sizes are larger than those reported in previous publications4,14,65.

Within this cohort, 44.3% returned for one or more clinical test session before the age of 4 years, resulting in data from a total of 4,440 longitudinal testing sessions. Socioeconomic status (that is, median household income) and racial and ethnic distributions were as expected for the San Diego region (Table 1). Overall, no significant sex differences in socioeconomic status were found; however, when examining children with ASD and TD individuals separately, boys had a higher median household income, whereas no sex differences were observed among children with DD (Supplementary Table 14).

Table 1 Demographics and clinical characteristics summary

Recruitment

The majority of toddlers (~75%) were recruited through a general population-based screening approach using the Get SET Early method (formerly known as the 1-Year Well-Baby Check-Up Approach66. This method results in the detection of ASD in children as young as 12 months. The programme is based on the collaboration of more than 200 local paediatricians who screen for ASD and other delays using the CSBS Infant-Toddler Checklist at all 12-, 18- and 24-month well-baby check-ups and refer toddlers to our centre who fail the CSBS screening and/or are suspected of having ASD. The remainder of the cohort were community referrals who contacted our centre seeking a developmental evaluation. Toddlers referred younger than 30 months were invited for a re-evaluation every 9–12 months until their third birthday, when a final diagnosis was given. Toddlers were stratified into the diagnostic groups described above based on the results from their most recent diagnostic evaluation.

Clinical testing

Toddlers and their parents participated in a series of tests, including the ADOS-2 (ref. 42) or ADOS-G, MSEL67, Vineland-3 (ref. 41), MacArthur-Bates CDI43, CSBS39 and GeoPref eye-tracking test. The study examined three subscales of the ADOS: social affect, RRB and overall total. Additionally, five subscales of the Vineland were assessed: communication, daily living skills, socialization, motor skills and ABC. For the MSEL, the following five subscales were evaluated: visual reception, fine motor skills, expressive language, receptive language and ELC. The CDI–WG included five subscales: words produced, words understood, early gestures, later gestures and total gestures. Words produced from the CDI–WS were also assessed. Furthermore, the CSBS assessment covered four subscales: social composite, speech composite, symbolic composite and total score. Lastly, the percentage of social fixation in the GeoPref test was also examined. Approximately 80% of the GeoPref eye-tracking data came from our earlier eye-tracking study40 and the remaining 20% were collected subsequently.

All assessments were administered by licensed clinical psychologists and eye-tracking technicians blind to the initial CSBS screening scores. To help ensure interclinician reliability, the lead clinical psychologist (C.C.B.; an ADOS-certified independent trainer with over 25 years of experience in toddler testing) was responsible for training the other psychologists to achieve research-reliable levels on the ADOS. ADOS reliability checks were conducted approximately twice per year. The consistency of testing procedures and setting and the use of only licensed clinical psychologists may have served to bolster the validity of the results (see Supplementary Information for more details).

Diagnostic criteria

A toddler was assigned to one of the following diagnostic categories based on the following criteria: ASD (scored within the range of concern on the ADOS-2 and was considered to have ASD based on Diagnostic and Statistical Manual of Mental Disorders (5th edn) criteria and clinical judgement); DD (scored ≤85 on the overall MSEL ELC); or TD individuals (scored within the normal range on all clinical assessments).

Data analyses

Three main analyses were conducted, including a primary cross-sectional analysis, a longitudinal analysis and a cluster analysis. In all subsequent sections, the statistical tests used were two sided.

Primary analysis

The primary analysis included examination of sex differences between the group with ASD and the TD group across all tests and subscales defined in the Methods. Although the primary goal of the study was to examine possible sex differences in children with ASD specifically, examination of possible sex differences within TD toddlers was also included to aid interpretation of the results. Examination of all available data revealed a statistically significant (P < 0.001) 4-month age difference between toddlers with ASD (mean age = 29.4 months) and TD individuals (mean age = 25.2 months), except for the CDI–WS test. To ensure that any reported differences in ASD and TD children were not driven by age effects, primary analyses included data from all available time points, ranging from visit 1 to visit 5, representing the number of times each child visited the clinic. Cardinality matching68 was then conducted to achieve an evenly age-matched sample. Cardinality matching is an alternative to propensity score matching that resolves the covariate overlap problem. It identifies the largest possible matched sample based on the pre-specified ratio of participants with ASD to TD toddlers and balance criteria, such as age69. The MatchIt package70 in R was used to perform cardinality matching. Matching was conducted on different tests separately, as some of the participants took the tests at slightly different ages and there was also small variation in the sample sizes of groups with ASD and TD toddlers across tests (Table 1). The best ratio of participants with ASD to TD toddlers was found to be 2.0 for ADOS and Vineland, 1.5 for MSEL and 1.0 for CDI–WG. No matching was required for CDI–WS since there were no significant age differences between the group with ASD and the TD group.

First, we conducted a two-way ANOVA including sex, group and sex × group interaction across all tests and subscales to explore the relationship between groups and sex on clinical tests. If there was a significant interaction or a significant sex main effect, we proceeded to test our planned contrasts. Since we were interested in sex differences in the groups with ASD and TD individuals, we focused on two out of six multiple comparisons resulting from the ANOVA interaction (that is, ASD (F) versus ASD (M) and TD (F) versus TD (M)) and adjusted the P values for two comparisons using the false discovery rate71. Then, depending on the data normality and homogeneity of variance assumptions, examination of sex differences within the group with ASD and the TD group was conducted using either t-tests or Kruskal–Wallis rank-sum tests as appropriate. Effect sizes for Kruskal–Wallis rank-sum tests are reported as \({\eta }^{2}\) values, where 0.01–0.05 represents a small effect, 0.06–0.13 represents a medium effect and >0.14 represents a large effect based on the recommendation of Lomax and Hahs-Vaughn72. Effect sizes for t-tests are reported as Cohen’s d values, where small, medium and large effects correspond to approximate ranges of 0.2–0.49, 0.5–0.79 and ≥0.8, respectively73. In a separate analysis, toddlers with DD who had a MSEL ELC score of ≤85 were examined for potential sex differences.

For the primary analysis, we examined sex differences in groups with ASD, TD and DD across multiple subscales using the appropriate statistical tests. When the assumptions of normality and homogeneity of variance were satisfied, two-sample t-tests were employed to compare group means. CIs for the mean differences were calculated based on the standard error and the critical values from the t-distribution at a 95% confidence level. In cases where the assumptions of the t-test were violated, non-parametric Kruskal–Wallis tests were used to assess group differences. Here CIs were calculated to reflect the rank-based differences between groups, providing a range of plausible differences between the medians of the groups.

Cluster analysis

To better understand patterns of clinical heterogeneity and whether or not they may be different across males and females with ASD, we conducted a study female and male clinical subtypes in ASD and TD using SNF44. Since the SNF method requires complete data across all layers and cannot accommodate missing information, we included only those participants who were consistently present across the ADOS, Vineland and MSEL subscales after matching. This process resulted in a consolidated age-matched sample of 1,673 participants with ASD or TD toddlers.

We employed Pearson correlation—a filter method in feature selection techniques—to identify features for inclusion in the SNF analysis. We calculated all possible pairwise correlations among the subscales of ADOS, Vineland and MSEL and then removed those subscales that were highly correlated with others. Additionally, we grouped similar subscales that approximately measure the same concept into a separate layer. Therefore, data from three clinical domains, including social, language and motor, were considered as three layers in SNF. Within the social domain, we considered the ADOS social affect as well as the Vineland socialization subscales. For the language domain, we included the MSEL receptive language, MSEL expressive language and Vineland communication subscales. Lastly, the motor domain included Vineland motor skills and MSEL fine motor skills. The data were normalized at each layer, and given that the data are continuous the Euclidean distance was utilized to compute pairwise distances. Upon fusing the similarity graphs through SNF, spectral clustering74 (a community detection algorithm) was employed to identify cluster labels. To determine the optimal number of clusters (that is, three) based on various distance measures and clustering methods, we compared the results of 24 indices using the NbClust75 R package, the Bayesian information criterion based on a Gaussian mixture model in the mclust76 R package and the total within-cluster sums of squares (that is, elbow method) using the cluster77 R package (Supplementary Table 2). In addition to the primary SNF analysis, we conducted a separate SNF analysis incorporating sex as a main feature. This allowed us to further examine the effect of sex on the clustering outcomes.

Training SNF was performed on 80% of the data (n = 1,337) and the remaining 20% (n = 336) were held out to test the clusters obtained. To assess the quality of the clusters during both training and testing stages, we computed silhouette scores78, which ranged from −1 to 1. Higher scores indicate better-defined and more well-separated clusters. The resulting scores of 0.46 for training and 0.4 for testing suggest reasonably well-separated clusters in both stages (Fig. 4c). Furthermore, to investigate the differences between clusters, we performed ANOVA and multiple pairwise comparisons with false discovery rate correction on several variables included in the SNF, as well as other variables outside of the SNF, to externally validate the clusters. To evaluate the robustness of SNF clustering, we randomly removed 5, 10, 20, 30, 40 and 50% of the data, conducted SNF on the remaining data and calculated NMI values. We repeated this process 100 times for each random removal proportion to obtain more stable results. Finally, a five-fold cross-validation was conducted on the training set, repeated ten times, to further validate the clusters.

After validating the obtained clusters using the aforementioned strategies, we took an additional step to investigate the presence of sex differences at the subtype level. This involved examining variables within the training and test data, as well as external variables, using the t-test or the Kruskal–Wallis test as appropriate. Furthermore, we explored the associations between sex and cluster membership in both the group with ASD and the TD group separately, employing the chi-squared test for this purpose.

Longitudinal analyses using a latent growth model with individually varying times of observation

We began by investigating sex differences through an initial cross-sectional analysis. Subsequently, we evaluated potential sex differences within ASD and TD subtypes. Finally, to comprehensively analyse sex differences in toddlers with ASD and TD individuals, we conducted a longitudinal analysis to examine these differences over time. Given the availability of longitudinal scores, ADOS, Vineland and MSEL subscales were used in this analysis. However, due to inconsistencies between test intervals, which could potentially lead to overestimation or underestimation of the targeted parameters79, we utilized latent growth modelling80. This method was chosen for its ability to accommodate individual variations in observation times.

In this multilevel model, repeated measures were nested within participants, and sex was a predictor of the random intercept and random slope at level two. Level one variables were subscale scores from the ADOS, Vineland and MSEL, and Cohen’s d was reported as the effect size73. Maximum likelihood with robust standard errors was the estimation method, but due to a convergence problem in three subscales (that is, Vineland daily living skills, Vineland ABC and MSEL visual reception) in the group with ASD, the MLF estimator was used (that is, a simpler version of maximum likelihood with robust standard errors) for the calculation of standard errors. MLF estimator settings represent “maximum likelihood parameter estimates with standard errors approximated by first-order derivatives and a conventional chi-square test statistic”81. Additionally, for some subscales, such as ADOS RRB and all four subscales of MSEL for the TD group, only up to four measurements were considered due to low sample sizes. Specifically, the fifth measurement had a variance of zero, which made it unusable in the analysis. CIs for the parameter estimates, including intercepts and slopes, were computed in Mplus using the standard errors derived from the maximum likelihood estimation. These CIs represent the range of plausible values for the growth parameters at the specified confidence level (95%). Data preparation was conducted using R and the analyses were performed using Mplus version 8.3 (ref. 81).

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.