Introduction

Major depressive disorder (MDD) dramatically impacts the daily functioning of patients and is currently the leading cause of disability worldwide [1]. Therefore, early diagnosis and optimal allocation of the proper treatment are critical. Unfortunately, the current treatment strategies present a response rate and remission as low as of 36.8% after a first treatment [2,3,4]. Thus, as proposed in the realms of systems medicine, we expect that by identifying brain patterns that classify patients at the individual level, we may open new biomarker-based avenues for the development of more personalized and effective treatments.

Neuroimaging techniques, such as magnetic resonance imaging (MRI), enable a non-invasive macro-scale view of human brain structure at the millimeter level of resolution. Initial neuroimaging studies used univariate approaches to reveal structural brain differences in MDD compared to healthy controls (HC) [5,6,7], identifying reduced hippocampal and frontal lobe volume. However, these studies had limited sample sizes and the more recent large sample studies have reported small effect sizes [8,9,10,11], highlighting the absence of a single neuro-anatomical biomarker associated with MDD. The search for more complex biomarkers, which may include the interaction between different neuro-anatomical features, can be conducted via machine learning (ML) algorithms - especially deep learning (DL) algorithms - applied to the MDD vs HC classification task.

Like univariate approaches, ML and DL studies reported varying classification accuracies from 53–91% [12, 13]. The high variability of classification performances and the lack of consistent biomarkers can partly be explained by the small sample sizes, as it was demonstrated by Flint and colleagues [14]. Supplementing this, a study based on cortical and subcortical morphological features, reported high accuracy of 75% in the small sample, which was not replicated in an independent large UK Biobank dataset, achieving only 54% [15].

Another factor that may inflate classification accuracies are related to study-site effects. The site-effect corresponds to site-specific characteristics other than diagnosis – such as scanner type, acquisition protocol, demographic differences, and inclusion and exclusion criteria – which may bias classification accuracies. A study demonstrated how site effect may contribute to both inflated and deflated classification accuracies [16]. Hence, numerous ways to tackle site-effect and improve model generalizability exist, from linear and non-linear ComBat harmonization tools [17, 18] to embedding site confounders directly to the model [19]. However, to overcome the difficult point of the heterogeneity of MDD and the lack of replicability and generalization of the models, the investigation of very large samples of participants with global representation is fundamental.

Using a large-scale dataset from the ENIGMA-MDD consortium, we compared the classification performance of commonly used ML models to predict diagnosis based on cortical and subcortical parcellations of morphological features (surface areas, thicknesses, volumes) [20]. Overall, results showed a trend that may highlight the contribution of site-effects to classification performance. Specifically, there was a clear difference in classification performance dependent on the cross-validation splitting techniques used in training. Site-splitting generally performed at close to chance level for all classifiers, while mixing sites across splits achieved up to 62% balanced accuracy with an SVM. Of note, data harmonization using ComBat removed the site effect and resulted in a balanced accuracy of 52% with SVM. Based on these findings, we concluded that most commonly used ML classification algorithms could not successfully discriminate MDD from HC individuals based on morphological features organized in pre-defined Desikan-Killiany atlas parcellations. However, it remains unclear whether more fine-grained information of morphometric features, displayed in a vertex-wise organization, could outperform the classification based on parcellation atlas-distributed information.

There are some directions in improving classification based on morphological information. First, previous ML studies considered surface area, thickness, and volume characteristics only, while the information on the cortical shape, such as gyral and sulcal shape patterns, was not integrated into analyses. Cortical gyrification modalities are affected by genetic and non-genetic factors [21, 22], alterations of which were associated with MDD [23, 24]. Multimodal morphological feature analysis, including myelination, gray matter, and curvature, revealed a correlation between cortical differences and MDD-associated genes [25]. Therefore, the addition of shape modalities, such as cortical curvature and sulcal depth, to cortical thickness could enhance the classification performance, as demonstrated for sex and autism classification [26].

Cortical morphological features such as sulcal depth and gyrification, measured via local gyrification index (LGI) or curvature, have been investigated as potential biomarkers for MDD, although the literature remains limited and somewhat inconsistent. Some earlier studies have suggested that sulcal depth may be decreased in individuals with suicidality-associated MDD [25]. Even so, this study included only 39 healthy controls, 40 depressed patients without suicidality (patient controls), and 39 with suicidality (suicidal groups) were analyzed based on SBM to estimate the fractal dimension, gyrification index, sulcal depth, and cortical thickness; the small sample size and range of features assessed make it prone to both type I and type II error, relative to the studies we have performed in thousands of patients. In terms of gyrification, multiple studies have reported both hypo- and hyper-gyrification in various cortical regions, including the frontal, cingulate, insular, parietal, and temporal lobes [24, 27,28,29,30,31]. However, these findings are often region-constrained, based on small sample sizes, and lack consistent replication across cohorts and studies. These constraints highlight the need for coordinated multi-site analyses using harmonized data and advanced morphometric modeling approaches.

Hence, one promising direction is the use of more advanced classification algorithms. DL methods have gained popularity in the neuroimaging field as a promising tool for cortical surface reconstruction [32], image preprocessing [33], and cortical parcellation [34]. Furthermore, DL is widely evaluated as a predictive tool in psychiatry, showing higher or at least the same classification performance compared to linear models [26, 35,36,37,38,39]. The analysis of cortical morphometric features can be conducted via convolutional neural network (CNN) [40], designed to reveal complex patterns in 2D images. In order to apply such 2D CNN in the classification, it requires 3D cortical features to be initially projected into 2D image space. Nevertheless, this step may inevitably create distortion in spatial properties such as shape, area, distance, and direction. Several approaches were implemented before, such as latitude/longitude projection [41] and optimal mass transport (OMT) projection [26, 42], which preserves area. However, the impact of these projection methods on classification performance were never directly compared in the neuroimaging field.

The main goal of this study was to distinguish MDD from HC individuals based on integrated cortical morphological features, including sulcal depth, curvature, and thickness. These features were analyzed via SVM with linear kernel and CNN architecture of pre-trained DenseNet [43], which demonstrated its superiority over simpler models in autism vs HC classification task [26]. SVM was chosen as it is a robust shallow ML model, frequently used in neuroimaging settings [44,45,46]. We investigated classification performance of these two methods to understand the role of complex non-linear patterns in MDD manifestation. We used balanced accuracy, sensitivity, specificity and AUC as the classification performance metrics. Higher classification performance of the DenseNet model presume the presence of spatially complex patterns in brain morphology, which are relevant for classification. Furthermore, we aimed to estimate the relevance of integrating cortical thickness and shape characteristics (sulcal depth, curvature and thickness) into the analysis by training the models with all features combined and by considering them separately. Similar to our previous study [20], different cross-validation (CV) approaches were evaluated: Splitting the data by balancing age and sex distribution across all CV folds (Splitting by Age/Sex), and performing leave-sites-out CV in order to estimate the performance on the unseen during the training sites (Splitting by Site). This approach allowed us to estimate whether the model’s performance is influenced by demographic or site-related factors. The difference between results in both splitting strategies presumes the presence of the site effect we addressed by harmonizing the data in both splitting strategies via ComBat. In summary, we hypothesized that: (1) Integration of cortical thickness and shape characteristics would contribute positively to the classification performance, and (2) DenseNet could differentiate MDD from HC based on the provided features. Additionally, we compared two projection methods, latitude/longitude and OMT projections by performing auxiliary single-site sex classification based on three of the largest cohorts to explore whether classification performance may vary according to 2D projection method. We had no a priori hypothesis for the projection results.

Material and methods

Study participants and study design

We analyzed a large-scale multi-site sample provided by the ENIGMA-MDD working group, comprising 2772 MDD and 4240 HC individuals, from 30 cohorts worldwide. Details on inclusion/exclusion criteria and sample characteristics can be found in Supplementary Table 1. Subjects with missing information on demographic data or any of cortical surface mesh files (l(r).sulc, l(r).curv, l(r).thickness) were excluded from the analysis (476 and 6% excluded).

Image processing and analysis

Each site acquired structural T1-weighted MRI scans of participants and preprocessed them according to ENIGMA Consortium protocol (http://enigma.ini.usc.edu/protocols/imaging-protocols/). This pipeline includes the segmentation of T1-weighted MRI volumes, tessellation, topology correction, and spherical inflation of the white matter surface. Detailed information on the acquisition protocols and scanner model in each cohort can be found in Supplementary Table 2. Cortical meshes were generated during FreeSurfer preprocessing in every site. Cerebral cortex meshes were then extracted from the FreeSurfer unsmoothed fsaverage6 template, effectively removing intracranial volume (ICV) differences (Supplementary Fig. 1) and yielding 37,747 and 37,766 vertices for the left and right hemispheres, respectively. The preprocessing pipeline applied in this study is consistent across all subjects, regardless of age, as the core procedures do not differ fundamentally between adolescents and adults. We analyzed vertex-wise features, such as sulcal depth, curvature, and thickness, both as integrated features and separately (Fig. 1).

Fig. 1: Proposed conceptualization levels and implementation of classification procedure.
figure 1

Left: Higher classification performance in MDD vs HC classification task can be achieved by implementing deep ML models, such as DenseNet, in comparison to a shallow ML model, for example, SVM. Furthermore, the analysis of integrated morphometric features can provide a more detailed description of cortical organization than separated features, leading to better differentiability of MDD from HC. The application of ComBat may improve the generalizability of results as site-related differences are removed. Right: Cortical sulcal depth, curvature, and thickness are first projected into the 2D grid and then transformed into 2D images using OMT projection. We split the data into 10 CV folds according to age and sex (Splitting by Age/Sex) and according to the site belonging (Splitting by Site). After the residualization step, where the age and sex effect are regressed out linearly, we train and test SVM and DenseNet on the diagnosis classification.

Considering the absence of well-established pre-trained on cortical meshes CNN models, we projected 3D cortical surfaces into 2D images and applied DenseNet, which was pre-trained on natural images. There are few studies applying different projection methods such as latitude/longitude project and area-preserving maps [e.g., 26, 41]. Of note, the latitude/longitude method, in which cortical mesh is first re-sampled to the sphere and consequently mapped to the 2D grid, creates strong area distortions in the edges and near the medial wall close to subcortical regions [41]. Both methods may (differentially) influence subsequent classification performances, but to the best of our knowledge, no studies to date have directly compared this in one study using the same samples. Thus, we applied both 2D projection methods to the cortical meshes, resulting in 224 × 224 pixels images for each hemisphere. The images were normalized to present mean of 0 and standard deviation of 1.

Data splitting

To assess potential biases in the model’s decision-making, we performed 10-fold cross-validation (CV) by splitting the data according to (1) demographic covariates, in which age and sex distribution were balanced and subjects from each site are equally distributed across all CV folds (Splitting by Age/Sex), and (2) site affiliation, where each site was contained only in one CV fold (Splitting by Site). In both strategies, 9 CV folds were used for training, while one remaining CV fold was used as a test set. This procedure was repeated iteratively until every CV fold was used as a test set. In the Splitting by Age/Sex strategy, effect of demographic factors on the classification performance is reduced, as the model is trained and tested on the same demographics. Nevertheless, the site-related differences may bias the decision-making of the classification models [20], which is directly addressed in Splitting by Site. This strategy demonstrates how well the model trained on one set of sites can be applied to the data from unseen sites. As the number of sites exceeds the number of folds, we distributed the sites across the folds to balance the number of subjects in every fold as close as possible by iteratively distributing the largest sites across all 10 folds. Smallest folds were added subsequently to further even the number of subjects in every fold. Overall, the difference in the classification results between these two splitting strategies may indicate the existence of the site effect. More detailed description of both splitting strategies can be found elsewhere [20].

MDD vs HC classification

After the data-splitting step, the primary analysis was carried out. Firstly, we residualized all features normatively, removing linear age and sex dependencies. To avoid data leakage, age and sex regressors were estimated on the healthy subjects from the training set (9 CV folds) and then applied to the training and test set (1 CV fold) for patients and HC. Next, the classification algorithms were trained on the training folds, and classification performance was estimated on the test fold. As demonstrated by Dinga and colleagues, accuracy alone should be avoided as it does not account for class frequencies [47]. Thus, the algorithms were evaluated according to categorical measures, including balanced accuracy, sensitivity, specificity, and rank-based measure – AUC, allowing for a broad overview of performance. For model-level assessment [48], we performed the classification using all features combined and then using features separately to assess the final classification performance. We evaluated the classification performance of a robust shallow model - SVM with linear kernel, and DL model - DenseNet pre-trained on natural images from ImageNet dataset [49], which has been shown to be a robust convolutional neural network for image classification in both natural images and neuroimaging contexts [26, 43]. When DenseNet was trained on a single data domain, left and right hemisphere images were propagated through corresponding left and right DenseNets, the fully connected layers of which were concatenated. The resulting feature vectors were then fed to the output layer. For the whole-brain all-features analysis, we combined the features extracted from every feature and hemisphere, concatenate them, and fed them to the output layer. For SVM, all considered images were flattened and then concatenated into a single array. In this study, we intentionally chose not to apply dimensionality reduction techniques (e.g., PCA or feature selection) prior to model training. This decision was driven by the goal of preserving the full anatomical interpretability of vertex-wise cortical features and directly evaluating the classification potential of the complete morphometric representation. To mitigate the risk of overfitting in this high-dimensional setting, we implemented nested 10-fold cross-validation for robust performance estimation and hyperparameter tuning. Specifically, for the SVM, nine values of the regularization parameter (C) were explored, resulting in 90 model evaluations across outer folds. For DenseNet, the grid search spanned 54 unique hyperparameter combinations, yielding 540 model evaluations (hyperparameters in Supplementary Table 3). The concept and implementation of analysis are illustrated in Fig. 1. To mitigate site-related differences, which may potentially bias the classification results, we additionally performed the analysis by harmonizing all features via ComBat. Variance explained by age and sex was preserved during this harmonization step. Next, we residualized features normatively, as described above, and trained/tested the models. Application of ComBat differed for both splitting strategies. In short, ComBat parameters estimated on the training set were applied to the test set directly for the Age/Sex splitting strategy. In splitting by Site, ComBat was applied twice: first, we used ComBat to harmonize the training sites; second, we applied ComBat to adjust the test sites to the harmonized training sites, i.e., using the training sites as the reference batch [50]. A more detailed description of the ComBat application can be found in our previous work [20].

Auxiliary analysis in projection methods

To explore and evaluate the potential impact of 2D projection methods on the classification performance, we compared both methods in their ability to classify healthy males from healthy females in 3 of the largest cohorts separately. The single-site classification was estimated via 10-fold CV on 411, 723, and 397 subjects, respectively. As usual, 9 CV folds were used for training, while one remaining CV fold was used as a test set. This procedure was repeated iteratively until every CV fold is used as a test set. To provide an initial perspective on the potential classification advantages of the pre-trained DenseNet, we presented the balanced accuracies obtained by two classifiers: an SVM with a linear kernel and the DenseNet [43]. Furthermore, using the hyperparameters found via the sex classification task (Supplementary Table 3), we presented the classification performance of both models.

Results

Participants and data splitting

We detected substantial differences in age (78% of pairwise comparisons between cohorts were significant, t-test, p < 0.05) and sex (47%, t-test, p < 0.05) across cohorts. The demographic and clinical profile is presented in Table 1. As expected, Splitting by Age/Sex resulted in more balanced CV folds with respect to number of subjects, age and sex distributions, while folds created by Splitting by Site were more uneven on these characteristics (Table 2).

Table 1 Participating sites.
Table 2 Data splitting strategies.

MDD vs HC classification

First, we compared the performance of SVM and DenseNet for different splitting strategies (Fig. 2). In Splitting by Age/Sex, SVM achieved 0.551 ± 0.021 in balanced accuracy, while DenseNet yielded 0.578 ± 0.022. In Splitting by Site, both SVM and DenseNet models performed worse, yielding 0.528 ± 0.039 and 0.512 ± 0.019, respectively. The minor difference in classification performances for different splitting strategies indicated a potential site effect, which we addressed by applying ComBat. In Splitting by Age/Sex, the balanced accuracy of SVM with ComBat dropped to 0.478 ± 0.019, while the performance of DenseNet did not change and yielded 0.561 ± 0.015. In splitting by Site with ComBat, the performance of both models was similar and close to random chance, balanced accuracy yielded 0.520 ± 0.019 and 0.508 ± 0.020 for SVM and DenseNet respectively. Thus, we did not observe an improvement of models’ performances after data harmonization by ComBat. A full panel of results, including all classification metrics, can be found in Supplementary Table 4.

Fig. 2: MDD vs HC classification performance of SVM and DenseNet applied to vertex-wise cortical features.
figure 2

Balanced accuracy for both classification models when trained on all features integrated with and without ComBat harmonization for both splitting strategies and when trained on single features. Error bars represent standard deviation.

Next, we explored if any of the considered feature modalities yields greater classification performance (Fig. 2). In Splitting by Age/Sex, all data modalities yielded similar range of accuracies: thickness (SVM: 0.549 ± 0.020; DenseNet: 0.576 ± 0.019) compared to sulcal depth (SVM: 0.543 ± 0.022; DenseNet: 0.562 ± 0.019), and curvature (SVM: 0.531 ± 0.015; DenseNet: 0.567 ± 0.019), observed for both classification models. In Splitting by Site, sulcal depth (SVM: 0.523 ± 0.016; DenseNet: 0.515 ± 0.020), curvature (SVM: 0.513 ± 0.033; DenseNet: 0.516 ± 0.025) and thickness (SVM: 0.522 ± 0.038; DenseNet: 0.515 ± 0.022) also exhibited similar range of classification accuracies. Both models performed similarly for all feature types. These results demonstrate that integration of shape modalities with cortical thickness did not benefit the classification models. Results from the exploratory analyses for each hemisphere and for each feature modality per hemisphere showed no improvements in performance of the models (Supplementary Table 5, Supplementary Fig. 3). In addition, we applied the main demographic and clinical stratifications used in the ENIGMA-MDD working group to assess post-hoc whether groups that are more homogeneous would achieve better classification metrics (Supplementary Table 6).

Auxiliary sex prediction task

As an initial step, we also conducted a sex classification to explore which projection method (latitude/longitude, OMT) yields higher classification performance for both SVM and DenseNet (Supplementary Fig. 2). There was no clear difference between projection methods; however, we observed a consistently higher classification performance of DenseNet compared to SVM for all types of features and hemispheres. Considering previous success of OMT projection as a projection method applied on cortical surface and its property to preserve distances between vertices [26], we conducted our main analysis with OMT projection.

Discussion

In this work, we evaluated the diagnostic classification performance of DenseNet and SVM models, trained on cortical maps projected via OMT, including sulcal depth, curvature, and thickness, from a multi-site global dataset. Our analysis included 7012 participants from 31 sites worldwide, allowing for a comprehensive and realistic overview of classification performances. Both models were evaluated in parallel using two different CV splitting strategies. In Splitting by Age/Sex, we obtained CV folds with comparable demographics; thus, the performance of the models should not be affected by these demographic variables. In Splitting by Site, sites were distributed across folds. Therefore, models were trained and tested on different sets of sites. This strategy is closer to application of diagnostic classification models in clinical practice, and allowed for realistic estimation of classification performance on unseen sites. Overall, the classification performances of both models were similar: In Splitting by Age/Sex, DenseNet achieved 58 vs 55% for SVM; in Splitting by Site, the difference was even more negligible, DenseNet achieved 51 vs 52% for SVM. Both models performed better in Splitting by Age/Sex, implying the presence of a confounding site effect, most likely arising from differences in scanner vendors or image acquisition parameters. In this sense, ComBat approximated the classification results of the two splitting strategies, but did not improve the accuracy of the models. Ultimately, the classification performances of both models for all integrated morphometric features, both in Splitting by Age/Sex and in Splitting by Site, revealed similar classification levels of single-features.

Cortical morphological maps as diagnostic biomarkers for MDD

To the best of our knowledge, this is the first study to combine cortical thickness, sulcal depth, and curvature features in order to classify MDD vs HC. Furthermore, previous ML studies with large samples only incorporated low-resolution atlas-based thickness characteristics. In our approach, we analyzed vertex-wise information, providing a richer and more detailed description of brain characteristics than atlas-derived regional measures. Even so, the integration of complementary cortical characteristics did not lead to higher classification performances compared to the accuracies obtained from the single cortical features, regardless of the data splitting strategy and the classification model. In Splitting by Site, no feature yielded an accuracy substantially higher than random chance accuracy, indicating the failure of both models to capture MDD-specific alterations. Furthermore, the analysis of finer-grained cortical maps, even for thickness alone, did not result in higher classification performance, compared to ML performance levels observed in our previous study [20]. Thus, the assumption that higher resolution would lead to greater classification performance did not hold in this study, as all results were close to the chance level, in line with previous attempts in classifying MDD [14, 15, 20]. Furthermore, stratification of the sample according to demographic (sex) and clinical characteristics (age of onset, antidepressant use, and number of depressive episodes) did not yield better differentiability between HC and MDD, in line with our previous study [20]. This new evidence suggests the absence of prominent gray matter alterations that alone may serve as diagnostic tool in patients with MDD. Combining features such as sulcal depth, curvature, and thickness in vertex-wise, unfolded cortical maps, and including them within a deep learning framework, is highly original. It advances prior work [24, 25, 27,28,29, 31] by integrating these complementary morphometric dimensions in a way few studies have, potentially clarifying whether these combined metrics can yield robust, clinically actionable biomarkers.

Although we combined complementary characteristics in the analysis, the interaction between thickness and shape was not addressed here. According to recent evidence, local cortical shape may correlate with thickness [51]. So, combined thickness-shape patterns should be further explored for the classification of MDD. Furthermore, reduced myelination was associated with MDD [52,53,54], which could lead to structural reorganization of cortical features, making it a potential classification aspect to be investigated. In addition, subcortical morphological characteristics may improve the classification by taking into account structural modifications in cortico-subcortical loops associated with MDD [8].

Integration of morphological characteristics with cytoarchitectonic and functional information may allow better contextualization of MDD-related alterations, as demonstrated in transdiagnostic study [55], with the potential to achieve higher classification performance [56, 57]. Brain topology can be described via the connectome - a whole-brain connectivity architecture of the brain. As nodes of brain connectome exhibited elevated susceptibility to brain disorders [58], graph analytical approaches could also lead to stronger differentiability between MDD and HC. Moreover, subject-specific parcellation schemes could be applied to compute structural and functional connectomes [59], and further analyzed by suitable sophisticated classification models taking into account the neural architecture e.g., with graph neural network [60].

Data splitting and site effect

Several multi-site psychiatric neuroimaging studies directly demonstrated how different splitting strategies might introduce unwanted biases in inflated classification performances [20, 36, 61]. In Splitting by Age/Sex, trained models are unbiased regarding demographic factors; while in Splitting by Site the site affiliation is controlled, therefore we addressed the generalizability of the models applied to unseen sites. Similar to the results from our previous study [20], the classification performance of both SVM and DenseNet was higher in Splitting by Age/Sex, up to 58%, compared to Splitting by Site, close to random chance. This discrepancy indicates the existence of hidden site-related biases influencing classification performance. As this nuisance-based phenomenon appears in multi-site mega-analyses [36, 62] for its better comprehension, we strongly encourage the application of different splitting strategies in future multi-site ML studies.

The low accuracy of both models in Splitting by Site strategy is either due to the presence of a strong site-effect, hindering the ability of the models to capture diagnosis-related differences, or due to the general inability of both models to find meaningful alterations associated with MDD. Therefore, we addressed site-effect via ComBat. Thus, the possibility remains that subject-level prediction based on cortical features is unfeasible. As Combat has never been applied to vertex-wise cortical projections, we visually inspected its effect on a single pixel for every feature type (Supplementary Fig. 4). The application of ComBat resulted in more homogenous value distribution across cohorts, in line with previous studies analyzing the effects on atlas-based features [17, 20]. Nevertheless, this harmonization step did not lead to improvement in accuracies. While demographic covariates were preserved, ComBat may over-correct the data [63], causing a part of MDD-related associations to be removed along with the site-effect. Against this, more careful consideration of the site-effect is required in the future studies.

In Splitting by Age/Sex, the balanced accuracy of both models dropped (SVM: 55–48%; DenseNet: 58–56%) when ComBat was applied. The decrease of model’s performances near the levels in Splitting by Site indicates that initial higher classifications are most likely driven by site-related biases. To further validate this assumption, we performed the classification with balanced ratio between HC and MDD in every site in Splitting by Age/Sex, which resulted in close to random chance accuracies in DenseNet and SVM. Noticeably, DenseNet was less affected by the application of ComBat in the original analysis, reflecting potential non-linear site-related differences that remained in the dataset after harmonization, which is in line with previous findings [64]. Therefore, we recommend ComBat only be applied when combining more linear models, such as SVM, while more sophisticated models alone should directly incorporate site information as an additional input.

SVM vs DenseNet

Previous ML mega-analyses based on structural MDD vs HC classifications considered only shallow linear and non-linear ML models, such as SVM, penalized logistic regression and decision tree [14, 15, 20]. In this study, we extended the diagnostic classification approach by comparing the performance of shallow linear model - SVM with a linear kernel to a highly non-linear deep DenseNet classifier applied to vertex-wise cortical information. The explorative results of sex classification applied to HC revealed higher classification performance of the DenseNet compared to the SVM (Supplementary Fig. 2) for all data modalities. The higher accuracy suggests that DenseNet was able to capture non-linear sex dependencies that were present in the cortical maps. The superiority of DenseNet over SVM in the sex classification task was in line with previous study conducted on the same vertex-wise cortical maps [26]. Conversely, another large sample study revealed no advantage of using any deep architectures over simpler models in predicting demographic factors [37]; therefore, further tests in even bigger samples are required. Nevertheless, in this study both models exhibited a similar range of accuracies, close to random chance, for the main task of MDD versus HC classification. Therefore, the application of DenseNet did not yield the expected improvement for detecting combined (nor separated) structural cortical features that discriminate patients from controls.

Similar performance of the linear SVM and non-linear DenseNet model may be due to the absence of non-linear interactions between different cortical regions, significant for the MDD detection. Furthermore, the analyzed sample is highly heterogeneous in terms of demographic and clinical covariates, potentially interfering with the main task and lowering the classification performance. In this vein, there are several possible directions for improving DenseNet performance. First, the considered model was pre-trained only on natural images from ImageNet. The model could be subsequently pre-trained on cortical projections from an independent large sample using immediate task, for example predicting sex as it was performed in Gao’s study [26]. Furthermore, one could use more than one intermediate task to optimize the weights of the neural network, for example, predicting demographic or clinical covariates. This approach is broadly known as multi-task learning [65], the usefulness of which in the neuroimaging domain was already demonstrated [19, 35].

Secondly, the multi-task approach could be used to “unlearn” undesired biases. In our analysis, site-related differences were removed via ComBat. One could train the network to perform the main task while unlearning the scanner parameters, as was successfully demonstrated by Dinsdale and colleagues [66]. Furthermore, one could replace the residualization step in the same manner by making the network unlearn age and sex dependencies. In line with our previous analysis, we linearly regressed out age and sex dependencies from the cortical features using normative approach [20]. Considering the greater performance of the DenseNet model in predicting sex, we can speculate the presence of non-linear male-female differences in cortical morphology. Thus, unlearning age- and sex-related dependencies could improve classification performance.

Further strengths and limitations

Here we were interested in using a pre-trained deep learning model—specifically, DenseNet—to see if it could effectively classify major depressive disorder (MDD) from healthy brains using finer-grained, unfolded cortical surface maps, and whether such information, when combined, could offer complementary classificatory value compared to previously examined features. This approach extends the methodology of our prior study [20], where we employed more conventional structural MRI-derived features such as cortical thickness, surface area, and subcortical volumes from whole-brain regions-of-interest (ROIs). Our current approach is original and methodologically relevant, particularly in light of increasing interest in surface-based neuroimaging analyses that go far beyond standard ROI measures. And by employing more detailed cortical maps from different sources—such as sulcal depth and curvature—and projecting them in unfolded 2D space, we sought to assess whether such refinements in cortical representation could provide additional or differentially informative patterns for diagnostic classification. While the classification performance did not surpass that of previous studies, this negative finding is itself valuable, helping to delineate the boundaries of what these finer-grained representations currently offer in this domain.

Although we did not apply dimensionality reduction in the present analysis, we acknowledge that this remains a promising avenue for future research. Prior work has employed PCA, spherical harmonic decomposition [67], surface eigenmodes [68], cortical gradients [69], and deep generative models such as variational autoencoders [70] to reduce feature dimensionality while preserving meaningful structure. The potential impact of these dimensionality reduction approaches on classification performance should be explored in dedicated follow-up studies.

A potential limitation of this study is the absence of modeling based on MDD subtypes. While studies have proposed various subtyping schemes to address the clinical and biological heterogeneity of MDD, there is a wide range of subtyping approaches and most were derived from small samples (e.g., [71]), with limited replication or independent validation. For this reason, we intentionally chose not to include a subtyping step. This decision avoids reliance on uncertain stratification and reflects a key strength of the approach: classification performance could have direct clinical applicability, independent of MDD subtype definitions. Nonetheless, we acknowledge that the presence of unmodeled heterogeneity within the MDD group may have contributed to the lack of discriminative performance observed. Another important limitation is the lack of detailed ethnic and genetic information across the full sample. Sociocultural and genetic diversity are known to influence both brain morphology and disease expression, and their absence may affect the generalizability of the findings. These remain open challenges for future research aiming to enhance the specificity and robustness of neuroimaging-based classifiers for MDD. In particular, large-scale studies incorporating diverse populations and robust subtyping frameworks may offer insights with broader international applicability.

Conclusion

In this study, we tested if more advanced classification algorithms applied to high-resolution morphometric shape characteristics can improve MDD vs HC classification. Splitting the data according to demographic variables and according to site allowed a comprehensive analysis of model’s performances and biases. We detected site effects, which we addressed at least partially with the ComBat harmonization tool, but did not increase classification metrics. Both shallow and deep ML models exhibited low, close to chance accuracies. Most importantly, the integration of high-resolution cortical thickness and shape features from vertices did not lead to greater classification performance over previously analyzed atlas-based cortical features. According to our results, it seems unlikely that structural MRI alone will provide diagnostic biomarkers of MDD. Thus, further investigation is required into the classification performance applied to the fusion of other MRI modalities, including fMRI and DWI.