Introduction

Eating disorders (EDs) are serious psychiatric disorders influenced by a complex interplay of biological, psychological, and environmental factors, leading to severe disturbances in eating behaviors and related thoughts and emotions1,2,3,4,5. The physical manifestations of EDs are multiple, leading to three primary diagnostic groups6,7. Anorexia nervosa (AN) is characterized by distorted body image with intense fear of getting fat, weight loss behaviors (including restrictive eating)1,8 leading to substantially low body weight. Bulimia nervosa (BN) is characterized by recurrent episodes of binge-eating (overeating with loss of control) and purging behaviors to control weight1,8, while Binge eating disorder (BED) primarily involves frequent binge-eating episodes but without regular compensatory behaviors1,8. Individuals with disturbed eating patterns who do not fit the criteria for AN, BN, or BED might be diagnosed with other specified or unspecified feeding or EDs according to the fifth edition of DSM (Diagnostic and Statistical Manual of Mental Illnesses)3. Irrespective of the type of diagnosis, EDs have been associated with chronic health issues, co-occurring physical and psychiatric comorbidities, as well as increased suicidality and mortality risks, rendering them an essential field for prevention and intervention4,9,10,11,12,13,14,15,16.

According to Danish nationwide registry-based studies, approximately 50% of all EDs are diagnosed before the age of 1917,18, with peak onset of age at diagnosis between 15 and 22 years, depending on sub-type15,19. A Canadian cohort study investigating trajectories of ED symptoms found the most significant increase between the ages of 12 and 1520. A similar study based on US survey data also observed that the onset age of EDs is in adolescence16. The onset of such disorders in the adolescent years can further negatively impact the social relations of young individuals, as well as their educational success and, consequently, their long-term professional development21,22,23. Nevertheless, the identification of ED cases remains a challenge7,24,25, with severe delays concerning seeking and getting treatment26,27, with individuals taking an average of 3.6 years to recognize they have an ED after symptoms begin and 4.2 years from onset to receiving treatment28. Psychological factors, restricted access to healthcare, the stigma surrounding mental health, dietary behaviors, perceptions of body image, and cultural standards regarding appearance could all potentially obstruct the early detection of EDs and the ability to access treatment28.

Considering the substantial proportion and individual and societal health impacts of EDs, coupled with the obstacles to accessing treatment, it is critically important to develop and investigate new applicable methods and tools for the early detection of individuals at elevated risk for EDs29,30. In the current data density area with high Artificial Intelligence (AI) fascination, prominent tools could be prediction models utilizing classic statistical or machine learning methodologies, which can provide individualized risks of having or developing a specific ED at a particular time point31,32. The deployment of such validated prediction models in a clinical setting could further enhance the quality of selective prevention programs, the latter known to be more efficient than universal ones for reducing eating pathologies33. Nevertheless, previous work on ED prediction tasks consists of a few non-generalizable models trained on small sample sizes, with most of them relying on cross-sectional study designs, thereby missing the participants’ follow-up or prognosis29,34. The main objective of this study was to develop and validate diagnostic and prognostic models predicting eating disorders (AN, BN, BED) in adolescents, utilizing a comprehensive dataset of approximately 100 possible predictors. Using self-reported data from the Danish National Birth Cohort (DNBC)35,36 linked with national registry information, we developed models to identify the presence of EDs by the 11-year follow-up, i.e., DNBC-11 (diagnostic task), and by the 18-year follow-up, i.e., DNBC-18 (prognostic task).

Methods

Source of data

The study is based on 96,822 liveborn children born in Denmark between 1996 and 2003 who participated in the Danish National Birth Cohort (DNBC) and were followed from in utero to childhood and young adulthood. The data was collected across seven waves, with the first two interviews being during pregnancy. The remaining five waves included data on children ages 6 and 18 months, 7, 11, and 18 years. Data from DNBC are self-reported (mother or/and child) and related to the child’s life and early exposures. More details on the cohort are available elsewhere (www.dnbc.dk)35,36.

Further information on the children was obtained through the nationwide registries since every Danish citizen is linked to these databases via their unique person identifier37. We extracted data on the parents’ education status via the Population Education Register, income through the Income Register, urbanicity from the Population Register, and employment status from the Integrated Database for Labor Market Affiliation37,38,39,40. We additionally extracted data related to various childhood adversities covering material deprivation (family poverty, long-term unemployment), loss or threat of loss (death of a parent, death of a sibling, parent somatic illness, sibling somatic illness), and family dynamics (foster care, parental and sibling psychiatric illness, parental alcohol and drug abuse, maternal separation) via a combination of registers with slight adaptations to definitions in the DANLIFE cohort41. Namely, the Danish National Prescription Registry, the Danish National Patient Register, the Danish Psychiatric Central Research Register, and Register of Support for Children and Adolescents were used to extract the adversities above42,43,44. All registered diagnoses are defined based on the 8th and 10th versions of the International Classification of Diseases (ICD-8 and ICD-10) used in Denmark. Both primary and secondary diagnoses are included. Exploiting the self-reported data from the DNBC cohort, we extracted information on the child’s stress, emotional and behavioral difficulties, adopted by the Stress in Children (SiC) self-assessed questionnaire and the Strength and Difficulties Questionnaire (SDQ) (https://www.sdqinfo.org/) 45,46.

Study population

This longitudinal cohort study harvests data from the DNBC. The overall study population for the diagnostic task (predicting the risk of having EDs at DNBC-11) consisted of 44,357 adolescents, predominantly between the ages of 11 and 12 years at questionnaire completion. For the prognostic task (predicting the risk of having EDs at DNBC-18), the corresponding size was further reduced to 26,127 after excluding those who did not respond or complete the DNBC-18 questionnaire.

Predictors

We used a variety of potential ED predictors based on domain knowledge and availability in both the Danish registers and the DNBC self-reported data. Given the complex, multifactorial etiology of EDs without a clearly established set of risk factors, we included variables spanning multiple domains. We implemented variables found in the registers covering information on childhood adversities from date of birth and up to the DNBC-11 questionnaire completion, family characteristics (maternal and paternal age at birth, parity, parental education at DNBC-11, urbanicity), and also various disorders found in the Danish National Patient Register (feeding disorders, psychiatric disorders other than EDs, autoimmune and autoinflammatory conditions). We also implemented self-reported and mother-reported variables regarding the child’s lives, social relations, stress status, behavioral patterns, activity levels, dietary habits, parental characteristics, prenatal exposures, and self-reported height and weight variables, measured at 7- and 11-year follow-up of the DNBC (waves 5–6). We further used six different scales of the behavioral screening SDQ, which holds information on five sub-scales (emotional symptoms, conduct problems, hyperactivity/inattention, peer relationship problems, prosocial behavior) and an overall impact scale, reported in DNBC-11by both the parent and the child, separately. We extracted further information on physical, psychological, and behavioral responses to SiC from the SiC questionnaire and selected items from DAWBA (Development and Well-Being Assessment). We ended up with 97 possible predictors for the diagnostic task and 99 for the prognostic task. The additional predictors in the prognostic task were the presence/absence of an official ED diagnosis between age 6 and the DNBC-11 assessment, as well as the BMI measured at DNBC-11. The latter was excluded from the diagnostic task to prevent circularity since BMI forms part of the self-reported ED outcome definition at DNBC-11, but was included in the prognostic task as it does not interfere with the DNBC-18 outcome assessment. The aforementioned list of predictors allowed us to capture diverse factors that might influence ED risk, from family background and early life experiences to immediate psychological and social determinants. Missing values in categorical variables were allocated as a separate “Unknown” category, whereas missingness in numeric variables was handled with median imputation based on the training set information. A complete list of the predictors used in the prediction models, along with the ICD-8 and ICD-10 codes used for specific diseases, can be found in Supplementary Table 1.

Outcome

We constructed a composite outcome of EDs combining two distinct components: (1) formal diagnoses extracted from Danish health registers using established ICD-10 diagnostic codes (AN: F500; atypical AN: F501; BN: F502; atypical BN: F503; EDNOS: F50, F508, F509), and (2) algorithm-derived classifications based on DNBC questionnaire responses using the DSM-5 diagnostic criteria as standard to define symptoms fulfilling threshold criteria for EDs. For classifications at DNBC-11, which relied on BMI (such as for AN and BN), developmental changes in adolescents were accounted for by using appropriate age- and sex-specific BMI cut-offs from the International Obesity Task Force (IOTF). Standard adult cut-offs were used for the DNBC-18 follow-up. In the diagnostic task, we classified a participant as “ED-Positive” if that person was either officially diagnosed with an ED in the registers between age 6 and their DNBC-11 assessment, or met threshold or sub-threshold ED criteria (AN, BN, BED) based on DNBC-11 questionnaire responses. For the prognostic task, we included all participants irrespective of their ED status at DNBC-11. We classified these participants as “ED-Positive” if they were either officially diagnosed with an ED in the registers between their DNBC-11 and DNBC-18 assessments, or met threshold ED criteria based on DNBC-18 questionnaire responses. We chose this approach because it reflects real-world screening scenarios where the model would be applied to unselected populations that include individuals with varying baseline ED statuses. Thus, our composite outcome is intentionally designed to maximize case identification for a clinical screening tool, rather than estimate a pure epidemiological prevalence. A detailed description of how the algorithm defines threshold and sub-threshold EDs based on questionnaire responses can be found in Supplementary Table 2.

Model development and selection

We split the original dataset into 75% for training and the remaining 25% for testing, using stratified sampling on the outcome variable, i.e., keeping the distribution of the two outcome levels similar across the two splits. We used the data splitting strategy to internally validate the developed models’ conditional predictive performance into a new, unseen set of individuals. For both prediction tasks, we initially trained an extreme gradient-boosting machine learning model (XGBoost) that uses all available variables and their interactions to predict the outcomes of interest. The latter relies on the boosting ensemble technique, a process in which decision trees are added sequentially to the ensemble to minimize the residuals (errors) of the predictions produced by the previously fitted decision trees47.

The XGBoost algorithm contains several hyperparameters, namely the number and maximum depth of trees, the learning rate, the number of randomly selected predictors to perform the splits, the minimum size of nodes, early stopping iterations, as well as different weights assigned to the “ED-Positive” and “ED-Negative” class that accounts for imbalanced data (few “ED-Positive” cases). We selected the optimal model’s hyperparameters using 5-fold cross-validation (CV) on the training data, using the Brier Score as the evaluation metric, the latter being a proper scoring rule measuring the average squared distance of the model’s predictions from the observed outcome, with lower values indicative of a better model48. This optimization process utilized a grid of 30 potential hyperparameter configurations. We employed a space-filling approach based on maximum entropy, distributing the parameter values across the space with minimal overlap or redundancy49. Using these hyperparameters, we finally fitted the XGBoost models in the training set and evaluated their predictive performance in the testing set.

The large number of predictors used in the XGBoost model (“ML-model”) could make its implementation time-consuming and restrictive in a real-world clinical setting. Therefore, we introduced a feature selection step by screening for the top 10 most important predictors of the XGBoost model in the training process using the SHapley Additive exPlanation (SHAP) values, feature attribution method50. Higher absolute SHAP values for a predictor indicate higher contributions of that predictor to the overall risk of having or developing EDs. The highest-ranked predictors were then implemented into a logistic regression model (“Reduced model”) with restricted cubic splines with 5 knots for numeric variables and evaluated using the same testing data as the XGBoost model. The knots were placed at the specific quantiles of each variable’s distribution (5th, 27.5th, 50th, 72.5th, and 95th percentiles), with the algorithm automatically reducing the number of knots when variables had insufficient unique values to support 5 knots51. We further introduced two additional benchmark models (“Simple” and “Single” model”) using only the top 2 and top-1 most important predictors based on the SHAP values for model comparison purposes.

Extended analyses

We performed sensitivity analyses for both the diagnostic and prognostic tasks by expanding the definition of the target outcome. Specifically, for the diagnostic part, we also classified participants with present DEBs by the DNBC-11 based on self-reported data as “ED-Positive.” For analyzing the risk of developing EDs by the DNBC-18, we also considered participants with sub-threshold ED as part of the positive class of the composite outcome. The workflow and the performance metrics used to evaluate the models remained unchanged and similar to the primary analysis. To assess the performance of our clinically applicable model beyond a single train-test partitioning, we estimated the potential clinical benefit of the logistic regression model using 5-fold CV based on all available data for each predictive task.

We further examined the association between the prevalence (for categorical variables) or spread (for continuous ones) of the predictors for both the diagnostic and prognostic set and their respective predictive performance, as measured by the AUC. To accomplish that, we computed the normalized (scale 0–1) informativeness of each predictor. We defined informativeness as the percentage of the minority category for binary variables (categorical variables were binarized first through one-hot encoding) and the coefficient of variation (standard deviation divided by the mean) for the continuous ones. We then computed their univariate AUC against the ED outcome. We subsequently defined high-performing predictors as those in the top 25th percentile of univariate AUC values. As a last step, we used logistic regression to evaluate the association between tertiles of informativeness (the frequency or prevalence of a predictor) and high predictive performance.

Performance metrics

We assessed the predictive performance of the models with respect to their discrimination ability, calibration, and net benefit from the decision curve analysis. The area under the receiver operating characteristic curve (AUC) was the metric reported for discrimination. The AUC ranges from 0 to 1 and is the model’s probability of assigning higher risks to those individuals who are “ED-Positive” than those who are “ED-Negative”. The calibration of the model reflects how accurate the model’s predictions are, i.e., how well the predictions match the data. For example, if a model is well-calibrated, we expect 30% of individuals with a predicted risk of 0.3 to have the outcome. We evaluated the calibration of the models via calibration plots and by computing the Brier Score ranging from 0 to 1, indicative of both calibration and discrimination. The net benefit is a metric stemming from the decision curve analysis, assessing the potential clinical impact of a model52,53,54. It is defined as the proportion of individuals correctly classified as “ED-Positive” (True Positives) minus the proportion of individuals incorrectly classified as “ED-Positive” (False Positives), the latter weighted by a factor related to the relative benefits and harms of each (odds of the risk threshold). The Net Benefit is computed based on the predicted risks from a statistical or machine learning model and compared against default clinical strategies, i.e., intervening on all individuals or not intervening at all. The latter is also calculated for the whole range of possible thresholds to avoid basing our decision on a single threshold (for example, intervening on individuals when their risk is above 0.2). Hence, the Net Benefit is interpreted into units of net True Positives, i.e., how many identified cases (True Positives) will a specific strategy or model lead to without unnecessarily intervening on individuals (False Positives) at a particular threshold. Stated differently, whenever the threshold is, for example, chosen to be 0.1, the clinician is willing to accept up to 9 unnecessary ED interventions (False Positives) per true positive, meaning that the harm of a false positive is 9 times smaller than the benefit of a true positive. Consequently, higher risk thresholds would be considered for interventions that are likely to be harmful or deemed to be intrusive with potential side effects, with lower thresholds being more relevant for less-harmful interventions such as counseling.

All the metrics mentioned above are based on predictions from the testing set to avoid reporting an over-optimistic performance.

Software

We used R statistical software 4.3.2 (https://www.r-project.org/) to preprocess and analyze the current study’s data.

Ethical statement

This study was performed in accordance with the Declaration of Helsinki. Approval of the study was obtained from the Danish Data Protection Agency through the joint notification of The Faculty of Health and Medical Sciences at The University of Copenhagen (SUND-2017-09) and the DNBC Steering Committee (2017-24). Written informed consent was collected from the mothers who enrolled themselves and fetus to the DNBC when pregnant back in 1996–2002. The participants born into DNBC were informed about their participation, what it involved and implications, as well as rights and how to opt out when they came of age, i.e., turned 18.

Results

During the 11-year follow-up of the DNBC cohort (Fig. 1), the sample included 44,357 participants, with 33,267 allocated to the training set and 11,090 to the testing set. A total of 475 out of 33,267 (1.4%) and 157 out of 11,090 (1.4%) individuals in the training and testing sets, respectively, were classified as having diagnosed EDs or having symptoms in accordance with threshold or sub-threshold ED (composite diagnostic outcome). Regarding age distribution, 81.0% of participants were 11 years old at questionnaire completion, 14.7% were 12 years old, and 2.8% were of other ages. The sample was evenly distributed by sex (51.1% female, 48.9% male), with 60.2% of the adolescents experiencing parental separations, 25.6% experiencing poverty periods, and 18.2% experiencing parental unemployment up to the 11-year follow-up. Parental psychiatric disorders were present in 13.1% of participants (one parent affected). For the prognostic task, the sample was reduced to 26,127 participants who completed both 11-year and 18-year follow-ups, with 19,595 in the training set and 6532 in the testing set. There were 698 out of 19,595 (3.6%) in the training and 216 out of 6532 (3.3%) participants in the testing set with either a diagnosed or a threshold ED (composite prognostic outcome). The distribution of the predictors, the primary outcome, and their missingness across the data are reported in Supplementary Tables 3 and 4 for the diagnostic and prognostic tasks.

Fig. 1: Participant flowchart.
Fig. 1: Participant flowchart.
Full size image

Flowchart of the DNBC (Danish National Birth Cohort) study sample data used to develop the predictive models for eating disorders (1996–2003, follow-up of 18 years).

A more detailed overview of the various threshold EDs (AN, BN, BED) and other disordered eating behaviors (DEBs), Purging, and Sub-threshold EDs across the two datasets (diagnostic and prognostic) is illustrated in Fig. 2. The most frequent type observed was DEBs, with their corresponding absolute number being the largest for both follow-up periods (2612 individuals out of 44,357 in the 11-year and 2309 individuals out of 26,127 in the 18-year follow-up) compared with the rest. We also observed a high prevalence of BED and its sub-threshold category, both in the early and late adolescent years. There were few participants with an official diagnosis of ED, particularly at 11-year follow-up (24 out of 44,357), although displaying an increase through the years, as shown in the 18-year follow-up. The component behaviors within the DEB category showed distinct patterns across developmental stages (Supplementary Table 5). At DNBC-11, binge eating was the predominant behavior, occurring in 82.8% of individuals with DEBs, while purging behaviors were relatively rare (6.1%). Fasting behaviors were present in 17.2% of cases, and multiple concurrent behaviors were observed in 5.9% of individuals. By DNBC-18, the behavioral profile of those with DEBs had shifted: binge eating frequency decreased to 42.7% while fasting behaviors more than doubled to 35.9%. Excessive exercise, as measured only in DNBC-18, accounted for 25.3%, with the co-occurrence of multiple behaviors simultaneously increasing to 12.3%.

Fig. 2: Barplots of the distribution of eating disorders by the two distinct follow-up periods of the analysis.
Fig. 2: Barplots of the distribution of eating disorders by the two distinct follow-up periods of the analysis.
Full size image

The upper plots display the absolute number of cases and relative proportion for a specific eating disorder or disordered eating pattern by the 11-year follow-up of the Danish National Birth Cohort (DNBC) for the 44,357 analyzed individuals (Diagnostic Set). The lower plots display the corresponding numbers observed by the DNBC-18 for the 26,127 analyzed individuals (Prognostic Set). The bold labels on the y-axis display the categories included in the composite outcome of the prediction tasks.

Figure 3 displays the transitions in ED statuses between the two follow-up times. From those participants who completed the questionnaires for both follow-ups (N = 26,127), 7% had DEBs or EDs by DNBC-11. The majority of these exhibited DEBs (79.47%), with smaller proportions classified as having subthreshold (11.42%) or diagnosed/threshold EDs (9.11%). By the DNBC-18, there was a transition across these categories. A sizeable proportion of individuals with DEBs by DNBC-11 transitioned to no ED (71.46%), suggesting a degree of remission over time. However, a persistent subset progressed to subthreshold (7.59%) or diagnosed EDs (6.71%), highlighting the potential for escalation in severity. Similarly, individuals classified with subthreshold EDs by DNBC-11 exhibited diverse trajectories; while some transitioned to no disordered eating patterns (73.11%), others progressed to diagnosed/threshold disorders (7.55%). Finally, the majority of those with diagnosed/threshold disorder by DNBC-11 displayed no ED by DNBC-18 (65.68%), with 13.61% maintaining the diagnosis.

Fig. 3: Alluvial plot of the transitions between eating disorders from the DNBC-11 to DNBC-18 follow-up.
Fig. 3: Alluvial plot of the transitions between eating disorders from the DNBC-11 to DNBC-18 follow-up.
Full size image

The x-axis displays the two distinct follow-up times (11 years and 18 years). The purple bars represent the proportions of individuals in each category by the DNBC-11 and DNBC-18 follow-up. The width of the flows (orange for disordered eating behaviors, red for subthreshold eating disorder, and blue for diagnosed or threshold eating disorder) connecting the bars illustrates the proportion of individuals transitioning between categories over time, highlighting patterns of persistence and change. Percentages on the y-axis indicate the distribution of individuals across the categories.

Model description and predictive performance

We developed an XGBoost machine learning model (“ML model”) using all available predictors, tuning the hyperparameters through 5-fold CV (Supplementary Table 6). We extracted the top 10 most important predictors of the ML model for each prediction task based on their average absolute SHAP values in the training set (Supplementary Fig. 1). We then developed three logistic regression models for the target composite outcomes using the top 10 (“Reduced model”), top 2 (“Simple model”), and top 1 (“Single model”) highest ranked predictors.

The discriminative ability and overall performance metrics for both prediction tasks are reported in Table 1. For the diagnostic task (predicting ED presence by the 11-year follow-up), the ML model achieved an AUC [95% CI] of 81.3 [78.0, 84.6]. The reduced model using the top-10 predictors showed similar performance with an AUC of 81.1 [77.9, 84.3], with a non-significant difference of −0.2 [−2.1, 1.8]. The Simple model yielded a significantly lower AUC of 77.9 [74.1, 81.6] (∆AUC = −3.4 [−6.1, −0.7]), while the Single model showed the largest performance decrease with an AUC of 65.6 [61.2, 69.9] (∆AUC = −15.7 [−19.8, −11.6]). The Brier scores were comparable between the ML, Reduced, and Simple models, with only the Single model showing a small but significant increase in Brier score (∆Brier = 0.03 [0.01, 0.05]). For the prognostic task (predicting ED development by 18-year follow-up), the ML model achieved an AUC of 76.9 [74.3, 79.5]. The Reduced model showed a non-significant decrease in performance (∆AUC = −1.5 [−3.4, 0.4]), while both the Simple and Single models demonstrated significantly lower discriminative ability (∆AUC = −3.2 [−5.7, −0.7] and −8.1 [−10.3, −6.0], respectively). The Brier scores for the prognostic models showed minimal differences, with only the Single model displaying a small but significant increase compared to the ML model (0.03 [0.01, 0.06]). We report the differences (ΔAUC and ΔBrier) in the performance metrics between models and across prediction tasks in Supplementary Table 7. The ML model provided risks up to 0.25 and was generally well-calibrated, behaving similarly across both prediction tasks and displaying signs of slight overestimation and broader uncertainty with increasing risks. The calibration plots for the diagnostic and prognostic tasks are found in the Supplementary Material (Supplementary Figs. 2 and 3). We further observed positive associations between predictor informativeness tertiles and high predictive performance as measured by the AUC. For the diagnostic task, compared to the lowest tertile, middle and highest informativeness tertiles showed odds ratios of 2.73 (95% CI: 1.22–6.49) and 3.09 (95% CI: 1.39–7.31), respectively. For the prognostic task, the associations were similar: middle tertile OR 2.87 (95% CI: 1.21–7.39) and highest tertile OR 5.90 (95% CI: 2.60–14.8). We also found that self-reported variables demonstrated significantly higher distributional variability (informativeness) compared to register-based variables in both the diagnostic task (0.23 vs 0.16, difference = 0.07, 95% CI: 0.002–0.14) and prognostic task (0.23 vs 0.15, difference = 0.07, 95% CI: 0.004–0.14). Extending the outcome for the diagnostic and prognostic task to include DEBs and sub-threshold EDs, respectively, led to similar AUC patterns, with the predictive performance of the Reduced model being comparable to the ML one, with the Simple and Single models reaching smaller values (Supplementary Table 8).

Table. 1 Prediction performance of the models across prediction tasks in the testing set

SHAP-values and variable importance

We employed SHAP-values (SHapley Additive exPlanations) to identify the most influential predictors for both diagnostic and prognostic tasks, with higher absolute values indicating greater contribution to ED risk prediction. For the prognostic model (predicting EDs by DNBC-18), being female emerged as the strongest predictor (Fig. 4). The remaining top predictors included: higher emotional symptoms (measuring feelings of worry, unhappiness, and nervousness), lower body satisfaction scores (indicating greater dissatisfaction with physical appearance), higher peer relationship problems (reflecting difficulties in social relationships with other children), lower hyperactivity/inattention scores (measuring restlessness, distractibility, and impulsivity as reported by parents), higher childhood BMI at the 7-year follow-up, lower maternal and paternal BMI values at DNBC-11, higher stress levels from the SiC questionnaire, and higher conduct problems (indicating antisocial behaviors). All of these factors, when present in the specified directions, contributed to increased ED risk by DNBC-18. We provide the partial dependence plots illustrating how the prognostic risk changes across the different predictors in Supplementary Fig. 4.

Fig. 4: SHAP-values beeswarm plot of the top 10 most influential predictors of ML-model for eating disorders by DNBC-18.
Fig. 4: SHAP-values beeswarm plot of the top 10 most influential predictors of ML-model for eating disorders by DNBC-18.
Full size image

The x-axis of the plot displays the SHAP-value in units of log odds, while the y-axis shows the predictors sorted from most important to least important based on their mean absolute SHAP-value. The SHAP-value is represented as a dot point for each individual and reflects the deviation of each individual from the average baseline risk extracted from the training set based on the value of each specific predictor. Higher feature values are colored yellow, whereas lower values are purple. SHAP-values clustered on the right side of the gray vertical line (higher than zero) indicate higher predicted risks (positive deviations from the average risk). In contrast, values on the left (lower than zero) reflect lower predicted risks (negative deviations from the average risk). For illustration, higher values of the child’s BMI at the 7-year follow-up push the predictions of the model further away from the baseline risk and towards the ED-positive class. All predictors are measured at time zero (DNBC-11) unless specified.

For the diagnostic model (identifying current EDs by DNBC-11), lower body satisfaction was the most influential predictor (Fig. 5). Other key predictors included higher emotional symptoms, higher stress levels, higher hyperactivity/inattention scores (as reported by children themselves), frequent obsessive-compulsive disorder symptoms (assessing repetitive behaviors and intrusive thoughts), higher conduct problems and peer relationship problems (both child-reported), having lost contact with a best friend, presence of depressive feelings, and lack of sleep. Sex did not appear among the top 10 predictors for current ED identification, whereas it was the most influential predictor for the presence of ED in late adolescence. The diagnostic model uniquely incorporated acute psychological symptoms such as obsessive-compulsive behaviors, depressive feelings, and sleep disturbances, while the prognostic model incorporated early physical indicators, including childhood BMI and parental BMI values. Hyperactivity/inattention showed opposing patterns between models: lower parent-reported scores increased risk in the prognostic model, while higher child-reported scores increased risk in the diagnostic model.

Fig. 5: SHAP-values beeswarm plot of the top 10 most influential predictors of ML-model for eating disorders by DNBC-11.
Fig. 5: SHAP-values beeswarm plot of the top 10 most influential predictors of ML-model for eating disorders by DNBC-11.
Full size image

The x-axis of the plot displays the SHAP-value in units of log odds, while the y-axis shows the predictors sorted from most important to least important based on their mean absolute SHAP-value. The SHAP-value is represented as a dot point for each individual and reflects the deviation of each individual from the average baseline risk extracted from the training set based on the value of each specific predictor. Higher feature values are colored yellow, whereas lower values are deep blue. SHAP-values clustered on the right side of the gray vertical line (higher than zero) indicate higher predicted risks (positive deviations from the average risk). In contrast, values on the left (lower than zero) reflect lower predicted risks (negative deviations from the average risk). All predictors are measured at time zero (DNBC-11) unless specified.

Potential clinical benefit of models

We assessed the models’ ability to provide additional clinical benefit based on the results of the decision curve analysis. Specifically, we evaluated their net benefit, i.e., how many ED cases each model or strategy can correctly identify (True Positives) on the testing set without unnecessarily intervening on individuals (False Positives) across the continuum of risk thresholds. The analysis results are illustrated in Fig. 6. The use of all four prediction models as a decision-support tool for an intervention exhibited a higher net benefit for thresholds up to 10% for both prediction tasks when compared with a strategy of intervening on every adolescent or no one. The models did not show any added benefit for thresholds above that number.

Fig. 6: Decision curve analysis on the testing data for the risk of eating disorders.
Fig. 6: Decision curve analysis on the testing data for the risk of eating disorders.
Full size image

The upper plot displays the net benefit of models and strategies for diagnosing eating disorders (EDs) in adolescents by DNBC-11 (diagnostic task). The lower plot displays the net benefit for the ED risk of adolescents by the DNBC-18 (prognostic task). The x-axis (“Threshold probability for intervention”) shows the range of thresholds, i.e., the probability that, when exceeded, adolescents are classified as having a high risk of having EDs by DNBC-11 or DNBC-18. The y-axis (“Net benefit”) shows the smoothed proportion, measured in True Positives (TPs), of accurately diagnosing or identifying adolescents at risk of ED presence after subtracting the weighted (by the odds of the threshold) false positives for every threshold probability. A proportion of 0.05 implies five correctly identified ED cases for every 100 individuals in the target population without unnecessarily intervening in them. The” Treat All” line represents the scenario of intervening on every individual in the target population, i.e., every child in early adolescence, without using a specific model. The” Treat None” line depicts the scenario of not intervening or screening the target population. The rest of the lines display the net benefit extracted from the predictions of each model on the testing set. In general, higher decision curves should be preferred over the rest.

The differences in the models’ net benefit for the diagnostic task were minimal for the ML, Reduced, and Simple models, while being higher than the model using only the top-1 predictor based on the SHAP-values. Specifically, at a risk threshold of 2%, the net benefit of the ML, Reduced, Simple, and Single models was 0.0058, 0.0053, 0.0050, and 0.0028, respectively. Therefore, at the given threshold, the first three models could identify approximately 60 to 50 ED cases per 10,000 adolescents without falsely identifying any. For relevance, the prevalence of the studied outcome in our test data corresponded to 140 cases per 10,000 individuals. For the prognostic task, the ML and the reduced models exhibited a higher net benefit for thresholds above 4% and up to 10% when compared against the Simple or Single models. Namely, at a threshold of 5%, the net benefit of the ML and Reduced models was 0.0070 and 0.0065, respectively, with the Simple’s and Single’s being 0.0044 and 0.0021. Hence, the former two models could correctly identify the development of 70 and 65 ED cases in early adulthood without any FPs from a population of 10,000 adolescents, with the prevalence of the outcome in the test data being 330 cases out of 10,000. We also tested the Reduced model’s robustness by examining its net benefit from predictions extracted from a 5-fold cross-validated split. The decision curves displayed a net benefit similar to the single train/test split (Supplementary Fig. 5).

Discussion

The current study presents the first large-scale attempt to develop prediction models for EDs in adolescence using comprehensive self-reported and registry-based data34,55. By combining these two types of data sources, models were developed and internally validated, demonstrating potential predictive capacity while also revealing important insights about the relative value of different types of predictors. The comparative analysis of our predictive models revealed the relationship between model complexity and performance. While our machine learning model achieved strong discrimination for both tasks, the reduced logistic regression model using only the top 10 predictors showed equivalent performance, with non-significant differences in AUC for diagnostic and prognostic tasks, respectively. This finding aligns with previous research in clinical prediction modeling, where simpler models have achieved comparable performance to more complex algorithms. For instance, a previous systematic review of 71 studies showed that algorithms did not generally outperform logistic regression for clinical prediction tasks56. Here, we proposed a carefully constructed, simpler model using key predictors, offering an optimal balance between accuracy and clinical applicability. Our approach leverages the strengths of both methods, i.e., using machine learning for feature selection to inform simpler logistic regression models, providing an effective strategy for predicting outcomes such as EDs.

The variable importance analysis showed that self-reported measures were the primary predictors at both time points, highlighting the value of structured questionnaire data. Multiple subscales from the SDQ consistently ranked among the most influential predictors, in line with previous work57, with emotional symptoms, peer relationship problems, and conduct problems leading the way for diagnostic and prognostic tasks. The SiC questionnaire score was also important across both prediction tasks, emphasizing the already-known significance of stress assessment in evaluating ED risk58. In contrast, despite their comprehensive nature and objective measurement, registry-based variables did not prominently appear among the top predictors. Our informativeness analysis provided insight into these patterns, revealing that self-reported variables demonstrated higher distributional variability compared to register-based variables. The positive dose-response relationship between predictor informativeness tertiles and high predictive performance demonstrated that variables with greater distributional variability were systematically more likely to achieve stronger discrimination, underscoring that predictors must have reasonably high spread or prevalence to contribute to model performance substantially. Additionally, and for our prediction tasks, the limited prominence of register-based variables may reflect that when detailed self-reported data are available, register-based information may not provide additional predictive signal beyond what is already captured by detailed psychological and behavioral measures. These results highlight an important methodological consideration: analyses based solely on effect estimates such as odds ratios cannot appropriately determine whether a variable will function as an effective predictor of EDs, since a rare predictor with a significant and large effect size can potentially influence a subgroup of the target population (here adolescents in Denmark), but have minimal impact on the overall target population if the subgroup’s size is too small59.

Concerning the models’ clinical utility, our decision curve analysis showed that these models can be beneficial for interventions with a low risk of harm. At lower thresholds, where the models indicated greater net benefit than alternative strategies, they can effectively inform preventive measures such as psychoeducational resources, tailored counseling sessions, peer support groups, and family-based interventions for high-risk individuals. We developed a simple logistic regression model that offers a pragmatic approach to clinical implementation, requiring only 10 readily accessible predictors from standard psychological questionnaires and basic health information. The model’s reliance on established measures, including the SDQ subscales, SiC score, and metrics such as BMI and body dissatisfaction, means it could be easily integrated into existing pediatric and adolescent health screenings. This parsimony, combined with performance comparable to more complex algorithms, positions the model as a potentially valuable clinical decision-support tool. However, while our validation demonstrates promising predictive capabilities, this does not automatically translate to improved patient outcomes. Ideally, a stepped wedge cluster randomized trial would need to be conducted to evaluate whether model-guided screening and intervention improves early detection rates, reduces time to treatment initiation, and ultimately improves clinical outcomes compared to standard care60,61,62,63.

Beyond their clinical utility, our findings could offer theoretical insights into the developmental pathways of EDs. The behavioral composition differences between the two distinct follow-ups reveal distinct developmental phases: by the DNBC-11, DEBs were predominantly characterized by binge eating, while by DNBC-18, there was a shift toward more diverse behavioral patterns, including increased fasting and the emergence of excessive exercise. This developmental transition from primarily impulsive eating behaviors to more restrictive and compensatory patterns may explain some of the differences in predictive factors between diagnostic and prognostic models. The contrasting patterns of hyperactivity/inattention across models may reflect this behavioral evolution: lower parent-reported hyperactivity/inattention scores predicting future ED risk could indicate children with greater capacity for self-control and internalization, potentially predisposing them to the restrictive behaviors that become more prominent by late adolescence, as evidenced by longitudinal trajectories showing increasing internalizing problems in restrictive eaters64. Conversely, higher child-reported hyperactivity/inattention symptoms in the diagnostic ED model may capture the acute cognitive effects of active ED symptoms or the restlessness associated with mostly binge-eating presentations at early adolescence, consistent with elevated externalizing problems in emotional/uncontrolled eaters64. The divergent BMI patterns, where lower parental BMI but higher childhood BMI both predicted increased ED risk, suggest complex intergenerational transmission mechanisms operating differently across developmental phases. Lower parental BMI may reflect familial genetic vulnerability to EDs, shared environmental factors that promote weight concern, or intergenerational transmission of weight-related attitudes65. At the same time, higher childhood BMI may create vulnerability through weight stigma, body dissatisfaction, and subsequent dieting attempts that may initially manifest as binge eating but later evolve into restrictive behaviors66,67,68,69.

The differential prominence of sex as a predictor between tasks may reflect developmental changes in ED risk, where sex differences become more pronounced as adolescents progress toward late adolescence and the peak incidence period for many EDs. The emergence of obsessive-compulsive symptoms, depressive feelings, and sleep disturbances as key diagnostic predictors (but not prognostic ones) suggests that these factors may represent more proximal indicators of current ED pathology rather than risk factors for future development. Conversely, the consistent importance of emotional symptoms, conduct problems, and body satisfaction across both tasks highlights these as factors that remain relevant for both identifying current cases and predicting future risk, supporting their potential utility as targets for both screening and prevention efforts. However, it is important to note that our model used to extract such patterns is designed for screening and clinical decision-support rather than etiological inference. While these patterns are scientifically interesting and merit further investigation, definitive interpretation of their causal or theoretical significance would require dedicated analyses incorporating causal inference methods, which fall outside our current study’s scope.

This study has several key strengths. First, it represents the largest sample size to date used for developing ED prediction models, with 44,357 individuals for the diagnostic task and 26,127 for the prognostic task. Additionally, we evaluated an extensive set of approximately 100 potential predictors spanning demographic, social, behavioral, and clinical domains, substantially more comprehensive than other studies in the field29,55. We consider the longitudinal design with up to 18 years of follow-up another strength, enabling us to develop diagnostic and prognostic models and characterize transitions between ED states over time. Importantly, we conducted thorough internal validation of the models, examining discrimination, calibration, and potential clinical utility. Finally, we successfully developed a simplified logistic regression model using only 10 predictors that maintained performance comparable to the more complex machine learning model, enhancing potential clinical implementation. However, some limitations warrant discussion. The reliance on self-reported data for many predictors and outcomes may introduce reporting bias, given the sensitive nature of variables related to mental health. Nevertheless, our composite outcome, combining diagnosed cases with self-reported threshold and subthreshold ones according to the DSM-5 criteria, provides a more complete picture of ED risk, particularly valuable given that many individuals may delay or avoid seeking formal diagnosis due to stigma or access barriers26,27. However, this composite approach still has a limitation: it cannot capture those adolescents who may have experienced an undiagnosed ED that had already remitted before their respective follow-up questionnaire (either DNBC-11 or DNBC-18) was administered. Such cases would be missed by both the register and the self-reported DNBC information, a limitation inherent to the available data. While we couldn’t predict specific ED subtypes separately due to sample size constraints, this limitation may be less critical for early prevention efforts that target shared risk factors such as those identified by our models. Also, DNBC participants tend to be of higher socioeconomic status than the general Danish population, potentially limiting generalizability and highlighting the need for external validation of the models35,70,71,72. Furthermore, while SHAP-values identified influential predictors, they depict statistical patterns within our model’s predictions and cannot determine whether these factors cause EDs. Any causal claims would require different analytical approaches. Finally, although our models showed promising performance, they were developed and validated using Danish data, and their applicability to other populations might require further investigation.

In summary, this study demonstrated the feasibility of developing predictive models for EDs using routinely collected data from questionnaires and health registries. Our models, integrating emotional symptoms, peer relationship difficulties, stress levels, conduct problems, body satisfaction levels, and BMI trajectories, demonstrated promising accuracy in identifying adolescents who may develop EDs, offering clinicians a potential screening tool for use during routine pediatric visits. While these models show promise for identifying young people at risk, particularly for low-risk interventions, their real-world impact on patient outcomes needs to be rigorously evaluated through clinical trials. The predominance of psychological and behavioral variables in our prediction models suggests potential value in exploring transdiagnostic approaches to mental health screening in adolescence, which could offer a more comprehensive framework for early risk assessment73,74.