Introduction

Autism (AT) and schizophrenia (SZ) have been recognized as independent diagnoses since the 1970s [1]. While AT is primarily characterized by distinct patterns of social communication, and repetitive behaviors, SZ are characterized by positive (e.g., hallucinations, delusions), negative (e.g., social withdrawal), and cognitive symptoms (e.g., executive functions deficits) [2]. They also show different onset trajectories (i.e., AT becomes apparent in early childhood, while SZ onset usually occurs in late adolescence or early adulthood), albeit a progression from AT to SZ cannot be excluded [3].

However, the heterogeneity of both diagnostic categories [4, 5] and their phenotypic overlap [6,7,8] can hinder accurate psychiatric diagnosis. For example, a review [9] reported that catatonic features, commonly associated with psychosis, were prevalent in up to 20% of autistic samples. Social functioning patterns, such as emotion processing, have also been shown to bear similarities in AT and SZ [10]. In addition, AT and SZ co-occur in approximately 4% of cases [11], and AT has been found to be diagnosed in up to 11% of people at clinical high risk for psychosis [12]. Differential diagnosis is additionally hindered by the fact that the common clinical observational and interviews to assess diagnosis-specific symptoms, such as the Autism Diagnostic Observation Schedule (ADOS) and the Positive and Negative Syndrome Scale (PANSS), do not have good specificity [13, 14]. Specifically [13], reported a higher degree of overlap between AT and SZ with respect to negative symptoms, while [14] reported that positive symptoms can better discriminate between the two groups. This symptom overlap between SZ and AT has led us to inquire about the underlying neural mechanisms, and whether these might aid in differential diagnosis [15]. A recent international machine learning competition aimed to classify AT and typically developed (TD) showed that fMRI data can yield a classification accuracy of ~ 80% [16]. Classification accuracies based on structural or functional MRI data can be just as high or higher when distinguishing SZ from TD [17, 18]. However, even though some previous attempts have been made [19], the challenge is greater when attempting to discriminate between heterogeneous nosological categories, such as AT and SZ, that share both genetic variants and neuroimaging patterns [20]. For example [21], computed functional connectivity based on resting state fMRI data from 2980 subjects (1665 TD, 537 SZ, and 778 AT). They revealed an astounding level of heterogeneity, with both extensively overlapping connectivity patterns between AT and SZ, especially in the default Mode Network, as well as opposing patterns, such as increased connectivity between sensorimotor and default mode areas in SZ, but decreased connectivity between these areas in AT.

One potential discriminatory brain-based marker is the excitation/inhibition (E/I) ratio, which has been shown to be different in TD compared to both AT [22] and SZ [23]. The E/I ratio is based on the concerted activity of mostly glutamatergic (i.e., excitatory) and GABAergic (i.e., inhibitory) neurons. The former are the most numerous and project throughout the entire brain, while the latter are fewer and synapse locally (for a comprehensive account of excitatory and inhibitory activity balance in the human brain, see [24]). A way to estimate the E/I ratio in humans based on non-invasive measures, such as resting state fMRI (rsfMRI), is by computing the Hurst (H) exponent from the acquired timeseries; an increased H values indicates a decreased E/I ratio, and vice-versa (for a review see [25]).

In AT [26], first hypothesized that observed sensory processing patterns may result from an increased E/I ratio. Among the evidence they cite is the fact that parietal and cerebellar areas show ~50% less glutamic acid decarboxylase (GAD), the enzyme that synthesizes the inhibitory neurotransmitter γ -aminobutyric acid (GABA) in AT compared to TD [27]. Additionally, cortical mini columns, which are functional units composed of GABAergic and glutamatergic neurons processing thalamic inputs, are smaller and more numerous in AT compared to TD [28]. A more recent summary specifically points towards the impact of reduced inhibition on cortical and hippocampal functioning in AT [29]. Whether this E/I imbalance is mainly due to excessive excitatory activity, or deficient inhibitory activity, is not entirely clear [30, 31], but recent evidence points to the E/I imbalance in AT being caused by concomitant effects [32]. Finally, direct evidence for the contribution of an E/I imbalance in AT comes from a study using bumetanide (i.e., a selective NKCC1 chloride importer antagonist, which decreases depolarizing GABA action, to reduce the E/I ratio) in a large cohort of AT children [33]. These authors reported a decrease in repetitive behaviors following a 91-day bumetanide trial. Another direct link between sensory processing and GABAergic activity in AT has been provided by [34]. These authors used arbaclofen, a GABA type B receptor agonist, to show that auditory repetition suppression was negatively impacted by the drug in TD, but improved in AT.

In SZ, the E/I ratio has also been reported to be imbalanced compared to TD. Post-mortem and genetic evidence [35], and computational modeling revealed that this imbalance causes hyperconnectivity in association brain areas [36]. In addition, research has also shown a link between an E/I imbalance and aberrant internal sensory processing in SZ [37], such as hallucinations [23]. Finally, dopamine appears to be crucial in maintaining the E/I balance by modulating the excitability of glutamatergic and GABAergic neurons, thus contributing crucially to the E/I ratio in SZ [38]. Dopaminergic activity, in concerted action with glutamatergic and GABAergic activity, when disrupted, can directly impact memory function and prefrontal connectivity in SZ (for an extended account please see [39]).

It has been proposed that an E/I imbalance characterizes both AT and SZ [40, 41], and that this relies in turn on shared genotype [42]. However, given the substantial heterogeneity in both AT and SZ [5], it is difficult to ascertain to which extent there is overlap between the brain areas that display this imbalance in these populations.

In recent years, various machine learning approaches have been employed to improve differential diagnosis of mental disorders [43]. Among these, interpretable models, such as Random Forest (RF), have become increasingly popular due to their transparency, as opposed to the traditional “black box” methods, such as support vector machine [44, 45]. A trade-off between high interpretability and high accuracy is usually considered when opting for a particular classification approach, as highly accurate classifiers tend to provide less interpretability (but see also [45], for a different account). Among interpretable approaches, RF stands out as an algorithm that can provide both high accuracy and reasonable interpretability [46]. In addition, it can also be used for feature selection based on feature importance in an out-of-sample classification, as we did in the current project [47].

Considering the current state of knowledge regarding the E/I ratio in different populations, we had three main objectives: (1) to quantify group differences in the E/I ratio between TD, AT, and SZ, (2) to assess whether the E/I ratio could support differential diagnosis, and (3) to verify the replicability of our findings in an independently acquired dataset.

To assess the role of clinical vs. E/I ratio data in classifying AT and SZ, we used two independent datasets and five distinct sets of features (i.e., classification models) comprising either phenotypic assessment only, the E/I ratio (as indexed by the H exponent) of multiple brain areas only, or all these together. To quantify the E/I ratio based on rsfMRI timeseries, we computed the H of 53 predefined functional brain areas, based on the Neuromark templates [48]. The H exponent has been refined as a reliable computational approximation of synaptic E/I based on extensive physiological and in silico studies [22]. For the phenotypic features we focused on core symptoms assessments, namely: ADOS — measuring AT-related social and communication patterns, and PANSS — measuring SZ-related positive (e.g. delusions, hallucinations) and negative (e.g., social withdrawal) symptoms, and general psychopathology (e.g., attention deficits). The total ADOS scores were calculated using the original ADOS-2 algorithm, thus including the following items: A-4, 8, 9 and 10, and B-1, 2, 6, 8, 9, 11 and 12. In addition, we used an IQ estimate, and two social cognitive measures: Empathizing Quotient (EQ) — measuring empathy, and the Bermond–Vorst Alexithymia Questionnaire (BVAQ) — measuring alexithymia, both of which have been shown to be different in AT and SZ compared to TD [49,50,51]. Two rsfMRI datasets were used to test replicability. The first dataset included data from publicly available fMRI repositories of either AT or SZ data, and included a relatively large dataset [21]. The second, smaller, dataset was collected on-site from both AT, SZ, and NT. We believe the use of both datasets holds important advantages. The replication dataset, while consisting of fewer participants, contains both rsfMRI and phenotypic data. In addition, the AT, SZ and TD in this dataset were collected in the same setting, which precludes the risk of site-related confounds. The larger, exploratory dataset was obtained by sourcing datasets from different online repositories. These datasets had been acquired at various sites with different scanning parameters, and phenotypical data was not uniformly available across sites and clinical groups. While this prevented us for doing a full exploration and replication of all the classification models that we were able to test using the smaller dataset, it allowed us to: (1) reduce model complexity and increase model stability [52] in the smaller replication dataset by using only the most important H features (i.e., brain regions) from the larger exploratory dataset, and (2) illustrate the replicability of the results of the H only classification model.

Methods

Participants

Two independent datasets were used in the current project. An exploratory dataset (Exploratory), based on several publicly-available online datasets (described below), and an internally-collected replication dataset (Replication).

For the Exploratory dataset, we analyzed 519 TD (362 males & 157 females; mean age = 28.49 ± 7.68), 200 AT (180 males & 20 females; mean age = 24.74 ± 6.6), and 355 SZ (245 males & 110 females; mean age = 30.91 ± 7.95) from the previously preprocessed and harmonized dataset used in [21]. The participants in [21] had been selected from several data repositories: the AT from the Autism Brain Imaging Data Exchange (ABIDE I and II), and the SZ from the Bipolar-Schizophrenia Network on Intermediate Phenotypes (B-SNIP), the Center for Biomedical Research Excellence (COBRE), the Maryland Psychiatric Research Center (MPRC), and the Function Biomedical Informatics Research Network (FBIRN). From the dataset of [21], we chose a subset of participants that closely matched the age (18–35 y.o.) and intelligence quotient (IQ) of the Replication dataset. Note that an estimated IQ > 75 criterion was chosen because it was the inclusion threshold in the BSNIP dataset, where no IQ values were recorded. Because some of the data from the Replication dataset had been previously submitted to data repositories (e.g., ABIDE II) from which the dataset of [21] had been drawn, we ensured that no participants were included in both the Exploratory and Replication datasets, by excluding repeated participants from the Exploratory dataset.

For the Replication dataset, participants were recruited via the Olin Neuropsychiatry Research Center (ONRC) and the Yale University School of Medicine and underwent rsfMRI scanning for the current study. We discarded participants with head motion > 10 mm, and those with incomplete phenotypic assessment information, resulting in a final dataset consisting of 55 TD (26 males & 29 females; mean age = 23.86 ± 3.65), 30 AT (25 males & 5 females, mean age = 22.33 ± 3.78), and 39 SZ (31 males & 8 females, mean age = 25.66 ± 3.53). The Replication dataset has been previously used by [53, 54] and [19], and the exclusion criteria we used here were the same: intellectual disability (i.e., estimated IQ < 80), neurological disorders (e.g., epilepsy), current drug use as indicated by pre-scanning interview and urine test, incompatibility with MRI safety measures (e.g., metal implants), and a history of psychiatric diagnoses in TD.

Phenotypic assessment in the replication dataset

Below we describe the phenotypic data that were collected for the Replication dataset. These were not consistently available for the Exploratory dataset. In addition, we also recorded the chlorpromazine equivalent for participants from the Replication dataset who took antipsychotic medication (i.e., eight AT, and 37 SZ).

Diagnostic assessment

The severity of psychotic symptoms was assessed using the Positive and Negative Syndrome Scale (PANSS; [55]) in both AT and SZ. The PANSS scores can be interpreted along three subscales: (1) positive symptoms, reflecting the severity of hallucinations and delusions; (2) negative symptoms, reflecting the severity of blunted affect and anhedonia, and (3) a general subscale quantifying other psychopathology such as poor attention and lack of insight.

The ADOS, module 4 [56] was administered to all participants and the total score was used in this study to confirm or rule out an autism diagnosis and quantify autistic social and communication characteristics.

The Structured clinical interview for DSM-IV-TR Axis I disorders (SCID; [57]) was used to confirm/rule out the diagnosis of SZ and the absence of any Axis I diagnoses in TD.

Social cognition and estimated IQ

Estimated IQ was calculated for the entire dataset using the Vocabulary and Block Design subtests of the Wechsler Scale of Adult Intelligence-III (WAIS-III; [58, 59]). Additionally, all participants completed: (1) the Empathizing Quotient (EQ; [60]) which measures general empathy including both the affective and cognitive empathy components; (2) the Bermond–Vorst Alexithymia Questionnaire (BVAQ; [61]), whose sub-scores are computed along five distinct dimensions: “verbalizing”, reflecting one’s propensity to talk about one’s feelings; “identifying”, capturing the extent to which one is able to accurately define one’s emotional states; “analyzing”, quantifying the extent to which one seeks to understand the reason for one’s emotions; “fantasizing”, quantifying one’s tendency to day-dream, and “emotionalizing”, reflecting the extent to which a person is emotionally aroused by emotion-inducing events. Descriptive statistics and group comparisons of phenotypic data are given in Table 1.

Table 1 Means and standard deviations (in parentheses) of demographics, phenotypic and clinical raw scores for all three groups of the replication dataset.

Imaging data acquisition and preprocessing

For the Exploratory dataset, the rsfMRI data was preprocessed using the SPM toolbox, as described extensively in [21]. In short, the first few volumes were discarded, then rigid-body motion correction and slice-timing correction were performed. Finally, the data were normalized, resampled to 3 mm3 isotropic voxels, and smoothed with a 6 mm FWHM Gaussian kernel. Prior to preprocessing, the effects of age, gender, site acquisition, and interactions between age and site, and gender and site were regressed from the gray matter volumes of each voxel, to ensure between-site harmonization.

For the Replication dataset, rsfMRI scans lasted 7.5 min and were collected using a Siemens Skyra 3 T scanner at the ONRC. Participants lay still, with eyes open, while fixating a centrally presented cross. Blood oxygenation level dependent (BOLD) signal was obtained with a T2*-weighted echo planar imaging (EPI) sequence: TR = 475 ms, TE = 30 ms, flip angle = 60 deg, 48 slices, multiband (8), interleaved slice order, 3 mm3 voxels. Neuroimaging data were preprocessed using SPM8 (www.fil.ion.ucl.ac.uk/spm/software/spm8/). Each dataset was realigned to the first T2* image using the INRIAlign toolbox (https://www-sop.inria.fr/epidaure/Collaborations/IRMf/INRIAlign.html), coregistered to their corresponding high signal-to-noise single-band reference image (sbREF; [62]), spatially normalized to the Montreal Neurological Institute (MNI) standard template [63], and spatially smoothed (6 mm3). Finally, framewise displacement (FD) motion parameters were computed according to the FSL library algorithm [64], and the mean FD value for each run was used as a covariate in group analyses.

Data analysis

For both datasets, we ran a fully automated independent component analysis (ICA) on the preprocessed rsfMRI data using the Group ICA for fMRI Toolbox (GIFT v4.0c; https://trendscenter.org/software/gift/; [65]) to define functional brain regions. The 53 replicable independent component (IC) templates from the NeuroMark pipeline with the neuromark_fMRI_1.0 templates [48] were used to estimate participant-specific, spatially-independent components using a spatially-constrained ICA algorithm [66]. A complete list of the NeuroMark IC templates, arranged into seven functional domains, and peak MNI coordinates for each IC template are given in Table 2 and illustrated in Supplementary Fig. 1. After detrending and despiking using 3dDespike [67], we extracted one Hurst exponent (H), an estimate of the E/I ratio, from each of the resulting 53 IC time courses of each participant.

Table 2 Group differences in hurst exponent (H) per component, in the exploratory dataset, computed using ANCOVA with age and sex as covariates.

We estimated H using the nonfractal MATLAB toolbox (https://github.com/wonsang/nonfractal; [68]). Specifically, we used the function bfn_mfin_ml.m with the “filter” argument set to “haar” and the “ub” (upper bound) and “lb” (lower bound) arguments set to [1.5, 10] and [−0.5, 0], respectively, as previously recommended by [22]. This is a wavelet-based maximum likelihood estimation using a discrete wavelet transform using volumes in power of 2 and a filter of type Haar. Similar to [22], for the replication dataset, this resulted in the first 512 volumes being utilized.

All the other statistical analyses were performed with R 4.1.1. These included a one-way analysis of covariance (ANCOVA), Tuckey post-hoc, two-sided two-sample Welch t tests, and a Spearman correlation analysis.

Classification analyses

A crucial aspect of classification algorithms in neuroimaging is sample size. It has been shown that larger sample sizes lead to more accurate estimates when brain-behavior relationships are investigated via traditional statistical approaches [69]. While this seems to be generally true also for machine learning [52], the relationship between sample size and classification accuracy does not appear to be entirely linear [70]. For this reason, in the current project, we established the initial classification accuracy of our brain-based classification model based on the largest of the two datasets.

First, using the Exploratory dataset, we classified the AT and SZ participants using a random forest (RF) algorithm, as implemented in the Interpretable Artificial Intelligence (IAI) toolbox (https://www.interpretable.ai/) and accessed through R 4.2.1 [71]. The RF algorithm was optimized using a grid search with 100 iterations of 100 trees of maximum 10 leaves per tree. The features were normalized before being used by the classifier, and the optimum sample split was determined using the “gini” criterion, which was preferred because it is less computationally expensive and gives comparable results to other criteria (e.g., “entropy”). The feature set consisted of the 53 H exponents of each participant. We ran 100 randomized sample splits and averaged the model performance metrics that we obtained for each of the splits to obtain three final classification performance indices (i.e., area under the curve/AUC, sensitivity and specificity). From each group, with each new split, 50% of the data were randomly allocated to the test, and the rest to the train group. We used the AT sample as reference for calculating sensitivity and specificity. Following the classification, we selected the ten ICs with the highest feature importance of H (Supplementary Fig. 2) as a simplified feature set for use in RF classification of the Replication dataset. Next, using the Replication dataset, we classified the AT and SZ participants using the same algorithm and toolbox, with 100 splits and 50% randomized allocation of data into the train and test groups. Five models were used for the RF classification in this case, containing the following features: (a) E/I model: the H only values of the 10 ICs ranked as most important by the RF classification in the Exploratory dataset; (b) symptoms only model: PANSS 3 factor scores and ADOS total scores; (c) symptoms and cognitive model: PANSS, ADOS, EQ, BVAQ, and IQ scores; (d) E/I and symptoms model: the ten H from model (a) plus the PANSS and ADOS scores, and (e) E/I, symptoms and cognitive model: the ten H from model (a) plus the PANSS, ADOS, EQ, BVAQ, and IQ scores. Like in the previous step, we calculated three classification performance indices (i.e., AUC, sensitivity and specificity), and used the AT sample as reference for sensitivity and specificity. A misclassification index for each participant and each model was calculated as the ratio between the misclassification instances of each participant and the total number of times they were allocated to a test set.

Results

Group differences in demographic and phenotypic assessment

In the Exploratory dataset, there were significant group differences in age (F (2) = 42.45, p < 0.001), and sex (χ2 (2) = 35.156, p < 0.001).

Data for the Replication dataset, including statistical tests, are presented in Table 1. There were significant differences in estimated IQ, age, sex and FD, and therefore these parameters were used as covariates in further group analyses. Regarding symptom assessments, the AT and SZ groups did not significantly differ in their social and communication skills, as indicated by the ADOS scores, but the PANSS scores on all three domains (i.e., positive and negative symptoms, and general psychopathology) were significantly elevated in SZ compared to AT. For social functioning, BVAQ-Fantasizing was significantly decreased in AT compared to SZ, while Empathy was significantly decreased in both AT and SZ compared to TD, and in AT compared to SZ.

Group differences in H in the exploratory dataset

The ANOVA results testing group differences in H values are given in Table 2. While all areas showed significant group differences, the largest effect size (i.e., η2 > = 0.06) was found for: the paracentral lobule (i.e., ICs no. 10 and 13 from the Sensorimotor domain), the calcarine gyrus, middle occipital gyrus, fusiform gyrus, and inferior occipital gyrus (i.e., ICs no. 17, 18, 20, 23 from the Visual network), the insula, and the superior, middle and right inferior frontal gyrus (i.e., ICs no. 27, 30, 31, 35 from the Cognitive control domain), the precuneus and anterior cingulate cortex (i.e., IC no. 43, 44, 47 from the Default Mode network), and one area of the Cerebellar domain (i.e., IC no. 50).

Group differences in H in the replication dataset

The ANCOVA results showing group differences in H values from each component are given in Table 3. The areas most sensitive to overall group differences, after controlling for age, sex, IQ, and FD, were the left and right postcentral gyrus and paracentral lobule (i.e., IC no. 9, 11, and 10, from the Sensorimotor network), and the calcarine gyrus, middle occipital gyrus, middle temporal gyrus, inferior occipital gyrus, and lingual gyrus (i.e., IC no. 17, 18, 19, 23 and 24 from the Visual network). The supplementary motor area (i.e., IC no. 34), associated with the Cognitive Control domain, also yielded significant group differences.

Table 3 Group differences in Hurst exponent (H) per component, in the Replication dataset, computed using ANCOVA with age, sex, IQ, and FD as covariates.

Classification accuracy and feature importance in the exploratory dataset

Using the complete 53 H feature set, the classification performance was: AUC = 84%, Sensitivity = 65%, and Specificity = 83% (Fig. 1A; please note that sensitivity and specificity are calculated in relation to AT).

Fig. 1: Classification performance.
figure 1

A Classification performance for each model and dataset. AUC area under the curve; a Expl: model with all 53 H values in the Exploratory dataset; a Repl: model with the ten H values in the Replication dataset; b: model with ADOS and PANSS; c: model with ADOS, PANSS, IQ, EQ, BVAQ; d: model with the ten H, ADOS and PANSS; e: model with the ten H, ADOS, PANSS, IQ, EQ, BVAQ. B Misclassification of participants. AT autism, SZ schizophrenia.

Next, we used the ten ICs with the highest feature importance from the RF classification of the Exploratory dataset to simplify the feature set for the RF classification of the Replication dataset. These ten ICs were: the precuneus and the anterior cingulate cortex (i.e., ICs 43, 44, 47, and 48, from the Default Mode Network), the superior frontal gyrus (i.e., IC 35 of the Cognitive Control domain), the paracentral lobule and the precentral gyrus (ICs 10 and 14, from the Sensorimotor domain), and the middle occipital gyrus, the inferior occipital gyrus, and the lingual gyrus (i.e., ICs 18, 23, and 24, from the Visual domain) (Supplementary Fig. 2).

Classification results in the replication dataset

In the Replication dataset, we obtained the following classification performance, outlined in Fig. 1A: model (a), using the reduced H feature set (ten ICs): AUC = 72%, Sensitivity = 64%, and Specificity = 67%; model (b), using the PANSS and ADOS as features: AUC = 78%, Sensitivity = 65%, and Specificity = 73%; model (c), using the PANSS, ADOS, EQ, IQ, and BVAQ as features: AUC = 76%, Sensitivity = 62%, and Specificity = 76%; model (d), using the ten H plus PANSS and ADOS: AUC = 81%, Sensitivity = 67%, and Specificity = 76%; model (e), using the ten H plus PANSS, ADOS, EQ, IQ, and BVAQ: AUC = 83%, Sensitivity = 68%, and Specificity = 79%.

In the Replication dataset, for each classification instance, we inspected the scaled contribution of each feature included in each of the five models. In order to account for the different number of features included in each classification model, we scaled the raw feature importance values (Supplementary Fig. 3) by multiplying them by the number of features included in that respective model (Supplementary Fig. 4). As can be seen in Supplementary Fig. 4, the precentral gyrus (IC 14), the inferior occipital gyrus (IC 23), the lingual gyrus (IC 24), and the precuneus (IC 48) are the most important H features overall. The most important symptom scores are the ADOS total, followed by PANSS positive and negative. However, the most important feature overall was the estimated IQ, which prompted our concern that the IQ might bias our classification output and misleadingly appear as being more important than other features in model (e), given that our groups were not matched for IQ (Table 1). We therefore repeated the classification for models (c) and (e) without IQ, and found that performance remained virtually unchanged (i.e., model (c): AUC = 0.75, Sensitivity = 0.6, Specificity = 0. 76, and model (e): AUC = 0.83, Sensitivity = 0.68, Specificity = 0. 81). Feature hierarchy also remained otherwise unchanged following IQ elimination.

Finally, inspection of misclassified participants (Fig. 1B), showed that on average, SZ were increasingly more accurately classified as we moved from models (a) through (e), while AT were most accurately classified when H features were included (i.e., models a, d, and e).

Antipsychotic medication effect

Finally, we verified whether medication (i.e., chlorpromazine equivalent) in the SZ group of the Replication dataset might have influenced the H values that we observed. We ran a Spearman correlation between the chlorpromazine equivalent and each of the 53 H values and found no significant correlations (all p > 0.05).

Discussion

In the current project, we assessed the feasibility of using the E/I ratio, as estimated by the H exponent, to distinguish between AT and SZ. We had two main goals: (1) to compare the classification accuracy when different sets of clinical, social and non-social cognition, and imaging features were combined, and (2) to perform an out-of-sample replication of the classification model using a reduced set of brain-based features (i.e., the H exponent from the ten most important brain components).

We first explored group differences between the three groups, in each independent dataset (Tables 2 and 3). The most consistent findings across both datasets reflected a significantly reduced H (i.e., increased E/I ratio) in SZ compared to TD in most areas of the cerebellar domain, the bilateral postcentral gyrus and paracentral lobule (from the Sensorimotor domain), and all but one area of the Visual domain. Given that levels of glutamate and GABA have been reported to vary inconsistently across SZ, and moreover, to be impacted by both medication and disease progression (see [41] for an in-depth overview), we believe that the replicability of these group differences are even more notable. What is more, this may suggest that despite the aforementioned medication and disease progression related neurotransmitter variations in SZ, a persistently elevated E/I ratio remains, predominantly in visual processing brain areas. In addition, we also checked whether antipsychotic medication in the SZ group from the Replication dataset correlated with H; we found that it did not, which suggests that at least in this sample, medication did not substantially bias our results. Nevertheless, the effects of medication and condition cannot be completely disentangled in the current study.

Group differences were less consistent between TD and AT, and the effects mostly displayed opposite directions in the two independent datasets. Namely, in the Exploratory dataset, significantly larger H (i.e., reduced E/I ratio) were found in AT compared to TD in the precuneus, the cerebellum, the frontal cortex, the left inferior parietal lobule, and some areas of the visual domain (i.e., calcarine, fusiform, inferior occipital, and middle temporal gyrus). In the Replication dataset, the supplementary motor area showed significantly reduced H (i.e., increased E/I ratio) in AT compared to TD. We believe that one potential explanation that could account for this reduced replicability in this case could be the increased heterogeneity of AT which may simply not have been sufficiently captured given the limited size of the Replication dataset. Indeed [41], also suggest that E/I variations appear to be more heterogenous in AT than SZ. Another aspect that could contribute to these inconsistent findings might be due to sex differences, which we were unable to assess in the current project given insufficient female AT participants. Previously [22], reported that the H of the ventromedial prefrontal cortex was significantly elevated in female AT compared to female TD, but significantly decreased in male AT compared to male TD. Thus, a more extensive exploration of sex differences in larger samples of AT could clarify this aspect.

A direct comparison of H in AT v. SZ revealed no significant differences in the Replication dataset but showed a consistent pattern of significantly increased H (i.e., decreased E/I ratio) in AT compared to SZ, in all but six of the 53 brain areas. These findings require further validation given the heterogeneity in data collection between these two groups in the Exploratory dataset.

Classification performance of AT and SZ, using the H values of all 53 brain areas, in the Exploratory dataset, was very good (AUC = 84%), with especially high specificity for SZ (83%). A similar trend was maintained when using a reduced classification model comprising the ten most important H-based features in the Replication dataset (Fig. 1A), though classification performance was overall reduced, as was to be expected in an out-of-sample replication, and given the limited size of the Replication dataset. We further explored how augmenting the feature set in the Replication dataset could improve classification performance. We found that while using PANSS and ADOS alone resulted in increased classification performance compared to using H alone, performance increased further when combining H, PANSS and ADOS, and was the highest when using H, PANSS, ADOS, as well as social and non-social cognitive measures (i.e., IQ, EQ and BVAQ). This increase in performance appeared to be mostly driven by a steady decrease in misclassified SZ (Fig. 1B). Due to some concerns that unmatched IQ between AT and SZ might be biasing these results, we re-ran these classification models without IQ as a feature and obtained similar results. In both the Exploratory and the Replication dataset, sensitivity was < 70% for all models, indicating that our models retained substantial error when identifying true positives. The specificity was however constantly > 70% in both datasets and for all models except for model (a) in the Replication dataset. This indicates that using the reduced set of H features alone is insufficient to correctly identify true negatives.

We may therefore conclude that: (1) classification performance based on E/I, as estimated by H, is substantial and replicable across independent datasets, and (2) classification performance is the highest when H and phenotypic assessment are combined, resulting in notably decreased misclassification instances in both AT and SZ (Fig. 1B). Interestingly, AT were more frequently misclassified when only phenotypic assessment was used, compared to both H only and H combined with phenotypic assessment (Fig. 1B). Given that both AT and SZ in the Replication dataset were acquired using the same protocol and testing environment, we surmise that inherent AT heterogeneity could best explain these different trajectories in classification performance.

Using the H to estimate the E/I ratio based on fMRI data has long been proposed as a valuable biomarker, and this approach has seen a substantial increase in recent years, likely due to the availability of less computationally expensive implementations. In addition, toolboxes equipped with algorithms that are better suited to estimate H based on biological signals, such as the nonfractal toolbox [72], have likely also contributed to the neuroimaging community’s increased attention to the multiple applications of the H exponent. Thus, recent studies have investigated the longitudinal developmental trajectories of AT children [73] using the H and have shown that E/I imbalances can already be identified in childhood. Also using H based on rsfMRI to compute non-fractal connectivity in a sample of adult AT and TD participants [74], obtained a classification performance of > 90%. The H exponent has been further shown to correlate positively with molecular indices of E/I balance, such as parvalbumin mRNA expression in human children and adults, as well as parvalbumin-positive cell counts in mice [75]. Thus, the H appears to be an ever-stronger contender among the biomarkers that have so far been proposed for the study of AT and E/I. These latest findings also add further support to the results of our current study, as they indicate that a pervasive E/I imbalance is characteristic of AT, and that H can be used as a robust estimator.

Limitations

While the Exploratory dataset that we used in the first step of our classification analysis offered a satisfactory amount of data, it does have the limitation that these data were collected using a variety of protocols and acquisition sites. On the other hand, while the Replication dataset was acquired uniformly and offered us the possibility to test different classification models, given that identical clinical assessment protocols were used for both AT and SZ, it contained few participants. Nevertheless, we believe that especially given these limitations, the replicable results using the brain-based data are noteworthy. Another limitation which we were unable to address using the currently available datasets were sex differences. This is especially true for AT, given the historic bias which has resulted in currently available samples being overwhelmingly dominated by male participants. A limitation regarding the replication sample was that interviewers were not blind to participants’ diagnoses, which could have biased the on-site assessments. Finally, we wish to note that in this project, the E/I ratio was estimated indirectly based on rsfMRI BOLD signal through the H exponent. While rsfMRI has the advantage of being noninvasive and only requires a few minutes of acquisition, the shortcoming is that H is a computational proxi of the E/I ratio and therefore it is unclear whether it captures the underlying fluctuations of excitatory and inhibitory neurotransmitters. We therefore caution readers to interpret these results with this limitation in mind.

Conclusion

In conclusion, in the current study, we probed the ability of the E/I ratio, as estimated indirectly by the H exponent, to discriminate between AT and SZ. We showed that this has a good discriminatory power, on its own as well as in combination with additional phenotypic measures. In fact, this neurophysiological measure augmented the discriminative effect of traditional phenotypic measures. Notably, when using brain-based classification features only (i.e., H), we obtained a good classification performance, which speaks for the distinct possibility that different etiologic mechanisms underlie the observed E/I ratio alterations in AT and SZ. Given how E/I ratio alterations are closely linked to impaired sensory processing, these results may offer an avenue for future clinical research to explore potential therapeutic interventions specific for each group. Our results point to incorporating the H as a valuable discriminant criterion in differential diagnosis, with especially encouraging results for SZ, since misclassification instances of SZ were minimized when H was added to all the available phenotypic measures. In addition, for AT, even using H only led to fewer misclassifications compared to when the standard ADOS and PANSS were used, but combining H with phenotypic measures did not lead to significantly better discriminatory effect. We believe that this stresses the potential for the E/I ratio to contribute to differential diagnosis.