Introduction

Atopic Dermatitis (AD) is a common chronic, recurrent skin disease characterized by dry skin, localized red scaly patches, intense itching, and skin pain1,2. AD affects patients’ quality of sleep, school and work, and even future career plans3,4, leading to reduced quality of life and increased healthcare expenditures5. For example, individuals with adult eczema who require out-of-pocket payments of 371–489 dollars per person year6 and those with hand eczema frequently face significant challenges while trying to unlock their fingerprints7. According to the most recent Global Burden of Disease (GBD) 2022 report, there are an estimated 223 million AD cases globally, with nearly 20% of cases occurring in children between the ages of 1–4. This represents a significant healthcare burden on society. Additionally, AD ranks first among skin diseases and 15th among nonfatal diseases in the Global Burden of Disease based on Disability-Adjusted Life Years (DALYs)1,2,8. AD has a complex pathogenesis involving gene-environment interactions, skin barrier disruption, microbial homeostasis disruptions, and immunoregulation imbalances1,9,10.

A significant amount of recent research has focused on identifying the genetic and environmental factors contributing to AD, such as parental asthma or AD11, FLG gene mutations12, the degree of hygiene13, air pollution14, ultraviolet light exposure15, diversity of intestinal flora16, immunization17, green cover, tobacco exposure18, and environmental hormonal persistent organic pollutants19. These recent findings have significantly enhanced our understanding of AD and contributed to the development of preventive strategies. However, considering the increased prevalence and incidence of AD over the past few decades2,19, it is important to assess if any crucial issues have been overlooked. Epidemiologists have discovered since the turn of the century that having many siblings20, attending daycare centers21,22, residing on farms with animals23, and having furry pets24 all offer varied degrees of protection against the onset of atopic illnesses. Possible explanations include early childhood close contact with other children or living in a rural area, which can increase exposure to a variety of microorganisms and pathogens, thereby partially triggering the natural maturation process of the immune system. On the other hand, AD may arise from an imbalance between type 1 helper cells (Th1) and type 2 helper cells (Th2)25. High hygienic cleanliness scores were associated with an increased risk of reporting AD at 30–42 months of age (OR = 1.04, 1.01–1.07), particularly for AD with painful, exudative fluid, as indicated by studies on the relationship between hygienic cleanliness in infancy and AD13. Therefore, while improved cleanliness has notably reduced the incidence and mortality of infectious diseases, it may also contribute to the ongoing rise in the prevalence of AD, a concept known as the “hygiene hypothesis” proposed by Strachan26.

Machine learning, a branch of artificial intelligence, relies on advanced statistical algorithms to identify inherent patterns in extensive datasets27. This capability allows researchers to uncover novel insights, including the importance of specific variables and the interrelationships among various risk factors—insights often beyond the reach of traditional statistical methods such as logistic regression28. Machine learning models employing various algorithms explore the effect of the variable of interest on the outcome from multiple perspectives, providing a more comprehensive understanding of the issue. Moreover, the robust predictive capacity of machine learning aids in identifying high-risk populations and delivering personalized preventive and treatment protocols, which are essential for the early prevention and diagnosis of atopic dermatitis. This research utilized a case–control study design to investigate the overall impact of various potential variables on early childhood AD and their relationship to the hygiene hypothesis. Additionally, a machine learning model was developed to evaluate the relative significance of these variables across different algorithmic frameworks.

Materials and methods

Study population

In August 2019, six administrative districts—Xinshi, Shayibake, Tianshan, Shuimogou, Toutunhe, and Midong—were selected from 40 districts in Urumqi using stratified random sampling. Subsequently, eight to twelve kindergartens, totaling sixty, were randomly chosen from each district to study children diagnosed with atopic dermatitis by a physician. Concurrently, four children without a diagnosis of AD were selected as controls and matched by gender, age, and ethnicity. Each child in the control group had an equal chance of participating in this investigation. The study was approved by the hospital’s ethical committee, and parents provided signed informed consent. A standardized questionnaire was developed and administered to all children in both the case and control groups. This questionnaire was adapted from the one used in the China Children, Homes, Health (CCHH) study, with minor modifications for Urumqi. The questionnaire using in this study comprised four sections: general demographic information, child feeding status, AD prevalence in children and family members, and living environment. Prior to the survey, the research team communicated with the Urumqi City Department of Education and kindergarten teachers, who then received standardized professional training. Preschool teachers agreed to distribute the questionnaire to the parents of the children, who were instructed to complete it at home. Parents were given one week to complete the questionnaire and submit it to the designated kindergarten teacher, who subsequently returned it to the Urumqi Education Bureau. All questionnaires were reviewed by at least two trained survey team members, and those deemed unqualified were excluded. The inclusion–exclusion process is illustrated in Fig. 1.

Fig. 1
Fig. 1
Full size image

The flowchart of this study.

Key variables in this study were defined as follows: atopic dermatitis was determined by the question, “Has the child ever had a physician-diagnosed case of AD or eczema?”; antibiotic use was defined based on whether the child had received antibiotic treatment and the frequency of use during the first year. Breastfeeding status was assessed through questions regarding whether the child was exclusively breastfed by the mother and the duration of breastfeeding. Sibling status was evaluated by determining if the child was an only child; if not, the number of older and younger siblings was recorded. Additionally, we recorded each child’s birth weight, categorizing weights below 2.5 kg as low birth weight and those above 4.0 kg as fetal macrosomia.

Statistical analysis

Epi Data 3.1 was used to establish the database, and R 4.3.0 software was used to analyze and process the data; the data were screened firstly by the inclusion and exclusion criteria, and non-random missing data were filled in according to the cause, random missing data were filled in by Multiple Imputation by Chained Equations (MICE)29. Univariate analyses were performed using the χ2 test. Factors associated with AD occurrence of preschool children were analyzed using multivariate logistic regression, with odds ratios (ORs) and 95% confidence intervals (CIs) reported. All statistical analyses were performed using the two-sided test criterion of α = 0.05, and P < 0.05 was considered statistically significant.

Stratification analysis

To further explore whether the effects of variables of interest on outcomes in multifactorial analyses were influenced by intrinsic attributes, we conducted independent stratified analyses of intrinsic attribute variables such as parental history of allergy-related disorders including atopic dermatitis, allergic rhinitis, and asthma, parental history of allergic rhinitis or asthma, parental AD status, and child’s mode of birth. Considering the 1:4 matching of the data in this study, we used a unique strategy that distinguishes us from traditional stratified analyses, where we introduced the concept of Basic Data Unit (BDU), which is defined as a sample of 1 case and its 4 samples of healthy controls matched according to gender, age, and ethnicity. Our stratification strategy was to retain the BDUs that could be matched to the stratification variables and discard the BDUs that could not be matched, resulting in 1:1 case–control data based on the stratification variables, which included gender, age, ethnicity, and the stratification variables. Figure S1 shows an example of a Parent with atopic disease history, which can help to understand the stratified analysis strategy of this study.

Interpretable machine learning models

Traditional statistical methods are limited when analyzing large, multidimensional datasets. Logistic regression is a variation of generalized linear regression that utilizes the sigmoid function for dichotomous classification, mapping input values to the [0,1] interval. While Logistic regression is straightforward, historically established, and effective for analyzing simple datasets, it often fails to fully capture underlying patterns in large datasets through coefficients. In contrast, Random Forest and eXtreme Gradient Boosting (XGBoost) are advanced ensemble models based on decision trees, differing primarily in their integration strategies for the base learners. Random Forest combines classification and regression trees (CART) with a parallel bagging strategy that enhances learner diversity through self-sampling and feature subset perturbation, culminating in a final decision based on voting. Its parallel architecture enables the use of the Gini index—a metric for evaluating node partitioning in CART—evolving into Mean Decrease Gini (MDG) in Random Forest. A higher MDG indicates greater importance of the feature for model performance. Conversely, XGBoost is a sequentially integrated model that adjusts data distribution based on preceding training results before training each base learner. It uses the outputs of weighted learners for final decision-making. Its sequential training process complements the SHAP interpretability framework. The SHAP value, grounded in game theory30, was initially applied to economic modeling before being successfully adopted in machine learning. In machine learning, SHAP assesses the significance of individual features by comparing model performance with and without them. This technique evaluates not only a feature’s relative importance but also its directional impact. The MDG from random forests can also assess feature importance; however, this metric primarily reveals quantitative significance.

The operational process encompasses various challenges, including feature engineering, handling class imbalance, optimal parameter tuning, and model iteration. Among these, class imbalance is a significant issue that impacts model performance. With a 1:4 case–control ratio in this study, the target variable accounts for only 20%. Complex machine learning algorithms typically require this ratio to be around 1:2. Given the unique advantages of BDUs in this study, we opted for down-sampling instead of up-sampling to address class imbalance. We implemented a computational strategy for sample spacing to retain control samples that differ significantly from case samples in each BDU. This approach maximizes the separation between case and control samples in feature space while maintaining the original design pattern of 1:1 matching in case–control studies. Figure S2 illustrates the data down-sampling strategy, aiding in the comprehension of the study’s machine learning component. Ultimately, we employed the area under the receiver operating characteristic curve (AUROC) as the primary metric for model performance evaluation, presented as mean ± standard deviation. All analyses were conducted using R (Version 4.3.0), with the following packages: mice (Version 3.16.0), caret (Version 6.0–94), randomForest (Version 4.7–1.1), xgboost (Version 1.7.6.1), and ggplot2 (Version 3.4.4). The Python Streamlit module (Version 1.32.0) was used for model deployment.

Result

Characteristics of the study population

The case group consisted of 771 children: 400 boys (51.9%), 371 girls (48.1%), 717 Han Chinese (93.0%), and 54 individuals from ethnic minorities (7.0%). The mean age of participants was 5.40 ± 1.06 years, with an average age of first atopic dermatitis episode at 2.16 ± 1.32 years. The control group comprised 3084 children matched in a 1:4 ratio based on similar gender, ethnicity, and age distribution to the case group. Significant differences emerged between the control and AD groups regarding mode of birth, full-term status, sibling presence, birth weight, paternal asthma, paternal allergic rhinitis (AR), paternal AD, maternal asthma, maternal AR, and maternal AD (P < 0.05). Detailed results are presented in Table 1.

Table 1 Demographic characteristics of ad and control groups.

Univariate analysis of indoor environmental factors, breastfeeding, and antibiotic exposure on AD

Table 2 displays the population distribution, χ2 values, and associated p-values for the AD and control groups. It was observed that multiple factors were significantly different between the two groups, including newly purchased furniture, renovations, mold presence, dampness in the parents’ residence prior to or during maternal pregnancy, and smoking by the father and maternal grandfather during the mother’s pregnancy. Additionally, factors such as newly purchased furniture and renovations during the child’s first year, mold presence, dampness, father and maternal grandfather smoking during this period, pet ownership, exposure to fish or reptiles, the duration of exclusive breastfeeding and antibiotics exposure were significantly different between the AD and control groups (P < 0.05). No significant differences were observed between the two groups regarding paternal grandfather smoking during maternal pregnancy, paternal grandfather smoking, ownership of dogs, or cultivation of flowering plants during the child’s first year (P > 0.05). The Cochran-Armitage trend test indicated a significant trend regarding the duration of exclusive breastfeeding and frequency of antibiotic treatment during the child’s first year across subgroups (P for trend < 0.01).

Table 2 One-way analysis of breastfeeding and antibiotic exposure in early life indoor environmental factors.

Multivariate analysis of indoor environmental factors, breastfeeding, and antibiotic exposure on AD

Variables demonstrating statistical significance (P < 0.05) in the univariate analysis were included in the logistic regression analysis. Table S1 lists the statistically significant variables included in the multifactorial analysis, along with their odds ratios (ORs) and 95% confidence intervals. The analysis revealed that paternal asthma (OR = 2.07, 95% CI 1.19–3.6), paternal AR (OR = 1.42, 95% CI 1.22–1.64), paternal AD (OR = 1.79, 95% CI 1.41–2.27), maternal AR (OR = 1.52, 95% CI 1.31–1.76), and maternal AD (OR = 2.10, 95% CI 1.72–2.55) are genetically associated risk factors for childhood AD. Figure 2, derived from Table S1, illustrates the results of the multivariate analysis concerning indoor environmental factors, antibiotic use, exclusive breastfeeding duration, sibling presence, and birth weight in relation to AD. Notably, renovation of the dwelling during maternal pregnancy was identified as an indoor environmental risk factor for AD (OR = 1.50, 95% CI 1.15–1.96). Our findings indicated that children receiving three or more antibiotic treatments between 0–1 year old had a significantly increased risk of AD compared to those who did not (OR = 1.92, 95% CI 1.29–2.85). Additionally, children exclusively breastfed for four months or longer also faced an elevated risk compared to those not exclusively breastfed (OR = 1.59, 95% CI 1.17–2.17). Furthermore, children with older siblings demonstrated a reduced risk of AD compared to only children (OR = 0.76, 95% CI 0.63–0.92). Moreover, children with low birth weight exhibited a lower risk of AD compared to those of normal weight (OR = 0.62, 95% CI 0.47–0.81).

Fig. 2
Fig. 2
Full size image

Multifactorial logistic analysis of indoor environmental factors, antibiotic use during the first year, breastfeeding, and sibling effects on preschoolers with AD; see Table S1 for complete results and covariates. mp maternal pregnancy, cfy child first year.

Stratified analyses by parental history of atopic diseases and child’s mode of birth

For variables of interest during the multifactorial analysis, including exclusive breastfeeding duration, antibiotic using at age 0–1 years, sibling status, and birth weight, we conducted stratified analyses to analyze in depth whether the health effects of these variables on children’s development of AD were moderated by intrinsic attributes. Figure 3 shows the change in the effect of the variable of interest on the outcome after stratifying alone for parental history of atopic disease (Fig. 3A, Table S2), parental allergic rhinitis or asthma (Fig. 3B, Table S3), parental AD, and child’s mode of birth (Fig. 3C, Table S4), with covariates including all variables except stratification and the interest variable in Table S1. Overall, exclusive breastfeeding at 4 months and older and antibiotic use at 3 and older early in life showed a predominantly hazardous effect on AD, whereas having an older sibling and low birth weight showed a predominantly protective effect on AD. Specifically, we observed that the risk effect of exclusive breastfeeding for 4 months or more on AD was significantly lower when the mother had an allergy-related disease compared with when the mother did not have an allergy-related disease (yes: 1.73, 95% CI 1.11–2.69 vs no: 1.34, 95% CI 0.81–2.21), and that this phenomenon was particularly pronounced in the group of AD parents (yes: 1.67, 95% CI 0.81–2.21). (yes: 1.67, 95% CI 1.16–2.4 vs no: 0.74, 95% CI 0.25–2.18); in contrast, father’s allergy-related disease significantly increased the risk of breastfeeding for AD (yes: 2.0, 95% CI 1.08–3.67 vs no: 1.26, 95% CI 0.85–1.87). The risk effect of 3 or more antibiotic administrations on AD was significantly lower when the father had an allergy-related disease compared to when the father did not have an allergy-related disease (yes: 1.66, 95% CI 0.77–3.58 vs no: 2.3, 95% CI 1.42–3.73). This phenomenon was particularly significant in the group of parents with AD (yes: 0.91, 95% CI 0.11–7.35 vs no: 1.8, 95% CI 1.16–2.8). Consistent results regarding the protective effect of older siblings and low birth weight on childhood AD were demonstrated in the stratification of parental history of allergy-related disease, that is, a parental history of allergy-related disease would significantly reduce this protective effect compared to no history of related disease. Interestingly, the protective effect of low birth weight on AD was significantly reduced in the subgroup with no parental history of AD (OR = 1.13, 95% CI 0.48–2.65) compared with a parental history of AD (OR = 0.62, 95% CI 0.45–0.86), even with the direction of the β coefficient reversed.

Fig. 3
Fig. 3
Full size image

Stratified analysis of atopic disease history (A), rhinitis or asthma history (B), parental AD, and mode of birth (C) to explore factors influencing childhood AD, including exclusive breastfeeding duration, antibiotic use during the first year, sibling status, and birth weight.

Machine learning model building and feature evaluation

We obtain the optimal parameters of the models through a grid search strategy, and details about the hyperparameters of the relevant models and their explanations can be found in Table S5. Figure 4A shows the average performance of the three models over 100 iterations on both the training set and the testing set. It can be seen that the Random Forest model performs optimally and significantly better than the Logistics regression model in both the training set (AUROC: 0.80 ± 0.006) and the testing set (AUROC: 0.741 ± 0.016); the specific performances of the three models and their hypothesis tests are shown in Table S6. Figure 4B,C show that under the optimal parameters, the global importance assessment of XGBoost and Random Forest models on downsampled data. Both SHAP and MDG values suggest that breastfeeding duration, older siblings, and low birth weight occupy considerable importance. Also, Fig. 4B suggests that exclusive breastfeeding for 4 months and above (Code: 3) increases AD risk; whereas having older siblings (Code: 1) and low birth weight (Code: 1) reduces children’s AD risk. Finally, we also built an online AD prediction tool (https://admodel-ghrcirrmt5ik6wz5shqydj.streamlit.app/) relying on the RF model of Fig. 4C to identify early the risk of AD in children aged 2–8 years using early life variables.

Fig. 4
Fig. 4
Full size image

Interpretable machine learning applied to downsampling 1:1 matched case–control data, utilizing logistic regression (LR), random forest (RF), and extreme gradient boosting (XGBoost) models. The five key variables indicated in Fig. 2 include new renovations during maternal pregnancy (mp), frequency of antibiotic therapy during the first year (cfy), duration of exclusive breastfeeding, presence of older siblings, and low birth weight were highlight in the plot. (A) Area under the receiver operating characteristic (AUROC) curves for the training and testing sets of the three models across 100 iterations. (B) SHAP value evaluation based on the testing dataset from the XGBoost model. Each point represents a sample, with point color indicating the value of the corresponding feature. The dichotomous variable (e.g., older siblings) is represented by distinct colors, while the multicategorical ordinal variable (e.g., months of exclusive breastfeeding) reflects gradient color based on category count; affiliated feature SHAP values are indicated by green bars. SHAP values greater than zero indicate a positive contribution to the outcome, reflecting a hazard effect. For example, children with older siblings (coded as 1) are less likely to develop AD than those without (coded as 0). (C) Mean Decrease Gini (MDG) size assessment from the full random forest (RF) model; larger MDG values indicate greater feature importance. mp maternal pregnancy, cfy child’s first year, AR allergic rhinitis, AD atopic dermatitis.

Discussion

About 20% of children and 10% of adults worldwide suffer from AD, a multifactorial inflammatory skin condition that lowers the quality of life and has a non-negligible economic impact2,8. AD is prevalent in children and often precedes a cascade of allergic conditions, including food allergies, allergic rhinitis, and asthma31. A recent study reported a 12.0% prevalence of AD in preschool-aged children in Urumqi32. Apart from causing discomfort and itching, AD can significantly impact children’s self-esteem and future social engagement3,4. Therefore, it is crucial to investigate early infancy risk factors for AD to facilitate timely prevention and treatment. This case–control study revealed that exclusive breastfeeding for 4 months or above, antibiotics using during child’s first year for 3 times or above and renovation of the dwelling during mother’s pregnancy have dangerous impact on the development of AD in children aged 2–8 years. Conversely, having older siblings and low birth weight (< 2.5 kg) were found to be protective factors for AD.

One significant risk factor for atopic disease in children is the history of atopic disease in either or both parents. For example, prospective studies have revealed that a high FLG mutation risk score (OR = 1.8; 95% CI 1.1–2.9), parental asthma (OR = 3.7; 95% CI 1.2–11.5), and parental AD (OR = 6.2; 95% CI 1.2–23.2) are substantial genetic risks for persistent AD in children11. A significant prospective cohort study conducted at the beginning of the century discovered that children whose parents had atopic disease had an increased risk of AD by the time the children were 4 years old; the risk of AD in children with a parental atopic history was nearly twice as high as that in children without a parental atopic history33; this effect was even more pronounced when parental AD was taken into account. The present research discovered that atopic disease of the parents, such as atopic dermatitis, asthma, and allergic rhinitis, all exacerbated the risk of AD in preschoolers, particularly in mothers. This is also evidenced by the assessment of the factors’ importance by predictive models.

Multivariate regression analysis identified that renovation of the dwelling during mother’s pregnancy increase the AD risk in children, compared to the period before the mother’s pregnancy and the child’s first year. The release of chemicals such as formaldehyde, organic volatiles, surfactants, and environmental endocrine disruptors (EDCs) during the renovation process has been linked to adverse health effects, particularly in young children and infants34. Additionally, immunological research suggests a higher prevalence of the type 2 helper cell phenotype in AD patients, characterized by elevated serum IgE and interleukin (IL)-4 levels. Furthermore, decorative and furniture materials containing volatile organic compounds (VOCs) can impact the fetal immune system and compromise the skin barrier, intensifying the sensitization process to indoor dust mites and molds35,36,37. This is particularly significant during the seventh to seventeenth month of gestation, a crucial period for fetal epidermal differentiation and susceptibility to exogenous hazardous substances such as PM2.537. Overall, the negative impacts of new renovations during the mother’s pregnancy are significantly associated with the subsequent development of AD in children, compared to both new renovations before and after pregnancy.

The association between antibiotic use and childhood AD has been extensively studied, yielding varying conclusions due to differences in study design, periods of interest, and antibiotic types and doses, which have impaired the credibility of meta-analyses38,39. A multicenter cross-sectional study revealed a correlation (OR = 1.20, 95% CI 1.11–1.30) between childhood AD and antibiotic use in children aged 0–1 years40. Similarly, a retrospective cohort study found that prenatal antibiotic exposure increased the incidence of AD in 11-year-old children (aHR = 1.19, 95% CI 1.09–1.31)41. However, most studies have not shown an adverse impact or a statistically significant link between antibiotic usage in neonates and childhood AD22,42. For example, a retrospective study based on a prospective cohort found decreased odds ratios for developing AD in children using antibiotics during the ages of 0–1 year and 1–4 years, with OR of 0.61 and 0.11, respectively; as well as a lower risk of atopic sensitization, with OR of 0.38 and 0.15, respectively22. Moreover, the quantity and frequency of antibiotic use have not been thoroughly examined in many studies positively associated with AD43. This study found a significant association between antibiotic use in the first year of life and AD in preschool-aged children, demonstrating a frequency-enhanced effect (P for trend < 0.001). Furthermore, multivariable analysis indicated that administering three or more antibiotic doses between 0 and 1 year is associated with a 92% increased risk of AD, regardless of the child’s mode of birth or parental history of atopic disorders. However, stratified analyses revealed that the risk of AD associated with the use of three or more antibiotics was significantly reduced in children of parents with a history of AD, even becoming non-significant compared to those whose parents did not have AD. While the exact mechanism remains unknown, this phenomenon suggests complex interactions between antibiotic use and parental AD that may influence the development of AD in children. These findings provide important implications for further research on antibiotic use.

Preliminary research indicates a possible preventive effect of breastfeeding on childhood atopic dermatitis. A meta-analysis of prospective studies published prior to 2000 found that exclusive breastfeeding for at least three months reduced the risk of AD, especially when accounting for parental atopic disorders (OR = 0.58; 95% CI 0.41–0.92)44. However, recent studies have not provided adequate evidence to support the protective role of exclusive breastfeeding against AD45. Conversely, a 2014 Japanese birth cohort study reported an increased risk of AD associated with exclusive breastfeeding compared to formula feeding (OR = 1.26, 95% CI 1.12–1.41), showing a dose–response relationship (P for trend < 0.001)46. Several factors, such as recall bias, study design, and varying interpretations of exclusive breastfeeding, may account for this discrepancy. Additionally, exclusive breastfeeding may not supply adequate vitamin D, potentially leading to deficiency and an increased risk of AD; however, vitamin D supplementation can improve clinical symptoms of AD47,48. Moreover, recent advancements in infant formula may have diminished some benefits of breastfeeding. The hygiene hypothesis posits that prolonged exclusive breastfeeding decreases exposure to pathogenic stimuli, favoring type 2 helper cells over type 1 cells and increasing the risk of allergy development13. Furthermore, it proposes that breast milk contains antimicrobial and anti-inflammatory bioactive compounds that enhance infant resistance to infections49. Introducing a diverse array of complementary foods between 6 and 12 months helps establish intestinal flora homeostasis, thereby reducing AD incidence from 1 to 2 years of age50. Our study identified a 1.59-fold increased risk of AD in children exclusively breastfed for four months or longer compared to their non-breastfed counterparts. Further stratified analyses indicated that this effect was most pronounced in children without a parental history of atopic disease, particularly when the mother lacked such a history. This finding suggests that mothers with a history of allergy-related diseases may possess specific antibodies or molecules that confer some protection against AD, allowing the child to acquire a degree of resistance through breastfeeding. This hypothesis warrants further exploration in relevant basic research. A recent matched case–control study indicated a significantly lower risk of AD in children under two years old when weaned or introduced to a diverse solid complementary diet as early as 4 months, with ORs of 0.41 (95% CI 0.20–0.87) and 0.30 (95% CI 0.11–0.81), respectively51. However, this effect diminished after stratification by the child’s mode of birth in our stratification analysis, implying that the mode of birth may be a confounding factor influencing the impact of extended exclusive breastfeeding on AD. This necessitates further rigorous studies for confirmation.

The hygiene hypothesis, proposed in 1989, has undergone significant revisions and modifications. Strachan initially suggested that the exchange of early childhood infections between siblings could protect against immune-related diseases26. Building on this concept, Rook introduced the “Old-Friends-Hypothesis,” which emphasizes the coexistence of infectious diseases and human evolution over time, suggesting that appropriate early-life exposure to microbial communities can help prevent immune-related diseases and allergic conditions52. Subsequent Alpine farm studies provided strong evidence for this hypothesis, broadening our understanding of the relationship between health and early-life microbial exposure53. In the field of immunology, the Microbiota Hypothesis, developed by Noverr and Hufnagle, has been refined through the study of microbial communities and their interactions with host mucosal surfaces, highlighting their metabolic and immunological effects54. Simultaneously, phylogenetic evidence indicates a lower variety and richness of microbial communities in invertebrates compared to vertebrates55. Combining concepts from biological evolution, immunology, and microbiology, the hygiene hypothesis is considered a historically relevant model explaining how modern lifestyles impact human health. It emphasizes the long-evolved balance between pathogen stimuli and immune responses from a human-nature perspective, suggesting a potential link to the rising prevalence of allergy-related diseases in industrialized nations56,57. Notably, this study found that children with older siblings had a 24% lower risk of AD, independent of parental atopic disease. However, no correlation was found between having a dog at home and AD in children aged 0–1 year. According to the hygiene hypothesis, close contact with older siblings, whether in caregiving person, may increase a child’s exposure to pathogenic stimuli after birth, potentially lowering the risk of AD by promoting normal immune system maturation. A recent study on early-life illnesses and the development of AD in children suggests that older siblings may act as “microbe contact carriers” when interacting with the child58. By contrast, children under one year old are less likely to often interact with a dog, and the influence of this relationship is also not significant.

This study suggests that children born with low birth weight have a significantly lower risk of developing AD later in life. This unexpected finding aligns with the notion that babies with low birth weights require more personalized attention. Notably, low birth weight children are less likely than those with normal birth weight to experience exclusive breastfeeding (Table S7), indirectly supporting this observation. Further logistic analyses indicated that low birth weight children were 1.46 times more likely than those with normal birth weight to experience discontinuous exclusive breastfeeding (Table S8). According to the hygiene hypothesis, it is presumed that low birth weight babies may have a reduced risk of AD due to less exclusive breastfeeding and increased nursing care. However, it is evident that multiple factors, like socioeconomic factors and the quality of postnatal nursing care, contribute to this effect, necessitating careful consideration of the role and significance of the hygiene hypothesis in this context.

Artificial intelligence (AI) technologies leveraging machine learning are set to revolutionize atopic dermatitis management by enabling data-driven, personalized treatment. Beyond clinical diagnosis and prognosis, machine learning has increasingly been applied to population studies. For instance, a recent large-scale case–control study utilized machine learning models to extract new insights regarding breast cancer risk factors59. In this study, paternal asthma emerged as the weakest predictor of AD among all assessed variables, as indicated by both SHAP values and MeanDecreaseGini scores. However, logistic regression revealed that paternal asthma ranked second to maternal AD in terms of odds ratio (OR) for childhood AD (Table S1). This discrepancy arises because traditional OR measure the ratio of exposure odds in cases versus controls, neglecting the exposure’s actual contribution to the outcome. In contrast, machine learning algorithms focus on the degree of each exposure’s contribution relative to all factors. Both machine learning approaches identified the same key variables: months of exclusive breastfeeding, having older siblings, low birth weight, maternal AD or AR, and paternal AR. Overall, machine learning feature evaluation suggests three key considerations: firstly, the effects of older siblings and hygiene hypotheses on atopic dermatitis warrant adequate attention; secondly, children of parents with AD or AR are a key population of concern for future AD preventive health care, and parents with atopic disease should raise health awareness to prevent the occurrence of AD in their children; lastly, high-quality epidemiological and mechanistic studies are essential to elucidate the impacts of exclusive breastfeeding duration and low birth weight on AD, providing a scientific foundation for maternal and child health practices.

Using a large case–control study, we analyzed the effects of early-life indoor environmental factors, frequency of antibiotic use in infants aged 0–1 years, duration of exclusive breastfeeding, and sibling status on AD in preschool children. During data transformation for stratified analysis and machine learning, we utilized a 1:4 matched dataset, converting it to a 1:1 matched dataset to enhance comparability and strengthen the conclusions. Although our strengths lie in comprehensive analysis and machine learning model development, the study has several limitations. Notably, this is a single-center study conducted in Urumqi and may not fully represent national demographics. As a questionnaire-based retrospective study, potential recall and reporting biases may undermine the scientific rigor of the findings. Furthermore, the stratified analysis and machine learning transformation processes, while ensuring internal data consistency, reduced the representativeness of the control group, widened the gap from real-world data, and challenged the model’s generalizability.

Conclusion

Our case–control study found that a history of parental atopic disease, indoor renovations during pregnancy, exclusive breastfeeding for four months or longer, and the use of antibiotics three or more times in a child’s first year significantly increased the risk of AD in preschoolers. Conversely, having older siblings and low birth weight were negatively associated with the risk of AD, supporting the hygiene hypothesis. Additionally, our machine learning analysis identified children of parents affected by AD or allergic rhinitis as a priority population for future preventive healthcare. Parents with AD or AR must enhance health awareness to mitigate AD risk in their children. Moreover, reducing antibiotic overuse may be crucial for managing childhood AD. High-quality epidemiological and mechanistic studies are essential to further elucidate the impact of exclusive breastfeeding duration on AD and its underlying mechanisms, offering a scientific basis for maternal and child healthcare practices. Importantly, our findings do not diminish the significance of breastfeeding for healthy child development; rather, they suggest that the role of alternative factors, including supplemental formula, warrants further investigation as society evolves and becomes more affluent. Additionally, the duration of exclusive breastfeeding may need to be redefined.