Introduction

Allergic conditions arise when the immune system inappropriately responds to a normally harmless substances or allergens1. This response activates the immunoglobulin E (IgE)-mediated pathways and the release of inflammatory mediators such as histamine. This hypersensitivity reaction can lead to a range of clinical manifestations, including atopic dermatitis, asthma, and allergic rhinitis, or collectively known as the atopic march. Allergies have multi-factorial etiology, including immune dysfunction, sex hormones, genetic predisposition and environmental factors2. As documented by previous cohort studies, atopic march typically begins during childhood3,4,5,6. During this period boys were observed to have higher prevalence of allergic diseases than girls until puberty when the trend switches7,8. The increase of incidence and severity of allergic diseases among women has been attributed to female sex hormones which can amplify immune response contributing to airway hyperresponsiveness, inflammation, and increased mucus production9,10,11. There is an evidence gap specifically addressing allergies in elderly women which further highlights the need for more research, a population that remains underrepresented despite its relevance for understanding late-life allergy patterns.

Moreover, adult-onset atopic diseases can also be driven by environmental exposures. Several epidemiological studies have shown that environmental factors (e.g. air pollution, meteorological factors, green space, etc.) are also strongly associated with existing allergies as well as the new development of allergies12,13,14. Specifically, air pollution and climatic factors are linked to allergies in children14. The strongest associations were found for the air pollutants nitrogen dioxide (NO2), particulate matter of an aerodynamic diameter of 2.5 μm or less (PM2.5) and ultrafine particles as well as temperature15. Changes in temperature has been associated with longer pollen seasons and skin barrier dysfunction16,17. In Germany, where allergic rhinitis is the most common allergic disease, climate change is seen to indirectly influence allergy incidence18. Additionally, factors such as active smoking or exposure to secondhand smoking have also been associated with higher risk of atopic dermatitis and allergic rhinitis19,20.

The environmental exposures are often highly correlated, which causes collinearity of the exposure variables and makes the cumulative assessment of the effects of multiple exposures on health outcomes difficult. One challenge in environmental epidemiology is estimating the independent effects of many correlated exposures. General approaches include assessing each exposure in separate models, adjusting for other exposures domains, or assessing all exposures simultaneously in a single model such as semi-Bayes modeling. However, often the optimal strategy remains uncertain and the combined effect of exposures is not studied.

In addition, the genetic make-up is involved in the pathogenesis of allergies21,22. Genetic risk scores (GRS) were developed for determining the cumulative genetic effect on a trait or disease23. Methodology initially developed for genome-wide analyses can also be useful for analyses of environmental exposures, since environmental predictors are also highly correlated similar to genetic factors. In a previous study24, the GRS methodology was adopted for assessing the extent to which various domains of exposomic factors contribute to health outcomes, building risk scores (RS) from several correlated exposure variables to assess the cumulative effect of one domain. Here, we extend this approach and explore its utility on binary health outcomes. Moreover, instead of select relevant SNPs based on previous genome-wide association studies, we use a new screening approach, called cross leverage scores (CLS), incorporating all available SNPs simultaneously.

Joint modeling genetic predisposition and environmental exposures is necessary in the context of allergic diseases development. While allergies are triggered by environmental exposures, genes have the imprint on immune recognition, barrier function, and inflammatory responses that determine how the body respond to these triggers. As a complement to genome, the exposome captures the totality of exposures in providing a more comprehensive understanding of disease etiology and its multifactorial nature25,26. In fact, recent study comprising 14 European cohorts demonstrated that an ERS integrating multiple external exposome domains was associated with higher incidence of asthma27. As such, we aim to use an exposomic approach considering different domains of exposures, along with genetic risk, in identifying their relative contribution to allergies. To the best of our knowledge, this is the first study to look at both genetic and multiple exposomic factors on allergy. Ranking the different contributions of exposures may aid health risk assessment of allergies and aid policy-making by identifying which domains of exposures should be prioritized in setting limit values.

Materials and methods

Cohort data

In this article, we used the second follow-up examination of the SALIA cohort study (Study on the influence of air pollution on lung function, inflammation and aging) from 2007 to 2010 comprising of 450 German elderly women from the Ruhr area and Southern Münsterland with available genetic data. The elderly women were first recruited when they were aged 54–55 in 1985. Men were not recruited to avoid bias due to occupational exposure since the setting was a German region with mining and steel industry at the time. More details on the SALIA cohort can be found in previous studies28. The cohort study was carried out in accordance with the Declaration of Helsinki and was approved by the Ethics Committee of Ruhr University, Bochum (protocol code 2732 and date of approval: 4 April 2006) and the Heinrich Heine University, Düsseldorf (protocol code 3507 and date of approval: 24 November 2010) for studies involving humans. All women gave their written informed consent before the investigation. The primary outcome of this study is an existing or previous physician’s diagnosis of an allergy, which includes bronchial asthma, atopic dermatitis/eczema, and allergic rhinitis/hay fever. Participant’s information on age, body mass index (BMI), and socioeconomic status (SES) were derived from the cohort interview questionnaire as single predictors or fixed confounding variables in our models. For the current analysis, a participant was classified with high SES if she or her partner’s number of school years is 10 or more years, and low SES otherwise. In addition, responses on smoking behavior questions such as currently smoking, number of pack-years, formerly smoking, and exposure to secondhand smoking, were used to develop a Smoking Risk Score.

Genetic assessment

Genome-wide genotyping was performed in 752 from different biological samples (venous blood, saliva from buccal swab or Oragene collection kit OG-500) at different time points using the Axiom Precision Medicine Research Array (Affymetrix, Santa Clara, CA, USA) (GRCh37/hg19) resulting in 871,262 variants. After quality controls29 and genotype imputation against the Haplotype Reference Consortium using the Michigan Imputation Server30, 586 individuals and 7,643,653 SNPs remained. The full set of almost 8 million SNPs were screened using Cross Leverage Scores (CLS). This procedure detects the most influential SNPs on a health outcome through a computationally efficient sketching approach based on QR-decomposition31. Specifically, the CLS of the ith SNP is determined by the dot product of rows Qi. and Qp*., where matrix Q is the orthonormal basis for the column space of the binded matrix [X y]T. The advantage of using CLS over performing multiple testing of p-values from simple (generalized) linear models is that it uses all of the hundred-thousands of SNPs and, thereby, inherently considering the interaction effects, in the computation of each SNP importance. In this study, we selected the 300 SNPs with the highest absolute CLS of the allergy-associated SNPs. In addition, we also took into account the first 10 principal components of the genotype matrix to control for population structure during CLS computation. The top 300 SNPs are then filtered after pruning with linkage disequilibrium threshold of 0.20 using the SNPRelate32 R package. SNPs included in the GRS are annotated with their corresponding or nearest gene using Ensembl33 of the biomaRt Bioconductor package implemented in R.

Exposure assessment

To build the air pollution risk score, air pollutants (e.g., PM2.5, PM10, and NO₂) were estimated using the optimal interpolation method at a spatial resolution of 2 × 2 km2. These estimates were provided by the German Environment Agency (Umweltbundesamt) and rescaled to a spatial resolution of 1 × 1 km234. In terms of meteorological risk score, the ambient temperature and humidity data of 6 × 6 km2 resolution were extracted from Consortium for Small-scale Modeling – Regional Reanalysis 6 (COSMO-REA6)35. Specifically, for both air pollution and meteorological variables, estimated levels were assigned to participants’ geocoded home address to minimize spatial exposure misclassification and we use the average of 1 year prior and the average of 1 month prior to the examination date to reduce temporal misclassification. For the greenness risk score, we included the normalized difference of vegetation indices (NDVI) of varying buffers: 300 m, 500 m, 1000 m around participants’ residence to capture local variation in greenspace exposure.

Bootstrapping and splitting into training-testing samples

Following a stratified bootstrap method, all analysis is repeated B times. For this analysis we repeated B = 200 times for computational efficiency since in each bootstrap we do 10 replications to reduce variation due to folds as described below. This number of iterations was also used before24 and seemed to yield stable results. Stratification means that the bootstrapped datasets are generated by randomly selecting participants from the original sample with replacement while maintaining the baseline proportion of cases and controls. The observations in each of these bootstrap replicates are then randomly split into training (60%) and test (40%) sets, in accordance with the recommendation from Dudbridge23. The training samples are used to learn the risk score weights while the relative contributions of the risk scores are assessed in the testing sample as further discussed in the following subsection.

Training for risk score weights

To determine the weights in their respective risk scores, a cross-validated logistic ridge regression is utilized using the glmnet36 R package. The model solves the optimization problem with an L2 penalty term and addresses potential multicollinearity among the variables. The parameter λ determining the strength of the penalty of ridge regression is tuned through 10-fold cross-validation. Within the folds, the ratio of cases and controls is still maintained. Following the principle employed by Wigmann24, the estimation of coefficients or weights is repeated ten times and average to reduce the randomness due to the folds. Then for every risk score, the respective subsets of weights are normalized so that the resulting weights sum to one.

Relative contributions of risk scores in the test sample

Based on the logistic linear model of the risk scores (RS) explaining the binary allergic outcome y, i.e., log((π(y))/(1-π(y))) = β0 + β1 Age + β2 BMI + β3 SES + β4 Smoking RS + β5 Air Pollution RS + β6 Meteorological RS + β7 Greenness RS + β8 GRS + ϵ, dominance analysis is performed using the R package dominanceanalysis37. Dominance analysis compares the R-squared of all possible subset models of the full model. In the process, it decomposes the overall measure into relative contributions of each predictor.

In the logistic regression context, we use McFadden’s Pseudo R-squared, also called as deviance R2, among other pseudo R-squared statistics since it reflects the “variance-accounted” for the logistic regression model. Mathematically, McFadden’s is given by R2 = 1-L1 /L0 where L1 is the full-model log-likelihood and L0 is the intercept-only log-likelihood. In addition, it was previously shown that the measure is relatively independent of the base rate of the binary outcome variable compared to other Pseudo R-squared indices38. This may be crucial when the analysis is applied to cohorts with unbalanced outcome data such as allergies.

From 200 bootstrap replications, median relative contributions of the predictors and their corresponding 95% confidence intervals (C.I.) are estimated. Similarly, the median regression coefficients and confidence intervals are estimated. See Supplementary Fig. S1 for the analytic workflow of the study.

Sensitivity analyses

Based on previous studies, which showed that the relative contribution of GRS tends to be much higher than that of the remaining risk scores, a sensitivity analysis was conducted without the GRS. In addition, we performed a sensitivity analysis with broader definition of having allergies where participants who had total IgE level > 100IU/L, but not necessarily a diagnosed allergy are also considered cases.

Results

Descriptive analysis

Among the 450 participants with available genetic data, 94 (20.89%) have been diagnosed with an allergic disease. Among those with an allergic disease, the most common allergy was allergic asthma (48%), followed by allergic rhinitis/hay fever (45%) and atopic dermatitis/eczema (22%). A total of 13 participants have more than one manifestation: one woman with eczema, rhinitis, and asthma; seven with eczema and rhinitis; three with both eczema and asthma; two with both asthma and rhinitis.

Table 1 shows the descriptive summary of the participants, as well as cross-tabulation by their allergic status. The women were at a mean age of 73.9 (SD = 2.8) years during the second follow-up examination with mean BMI of 27 kg/m2 or mildly overweight across both case and control groups. Those without allergies (83%) had a higher percentage of high socioeconomic status (SES) than those with allergies (47%). Women with allergies had a higher number of cigarette pack-years than women without allergies.

Table 1 Crosstabulation of predictor variables among participants with and without Allergies.

Air pollution variables were generally higher in the residential areas of those with allergies. For meteorological variables, there is a contrasting trend. For SNPs included, it is expected to have a pattern in their mean number of variants since the relevant SNPs were already selected using cross leverage scores (CLS).

Genetic risk score (GRS)

After pruning the 300 SNPs with the highest CLS, only 50 SNPs remain to be included in the constructed GRS. Supplementary Table S1 shows that rs10759210 has the highest absolute CLS.

To investigate the role of the SNPs in the constructed GRS, the median normalized weights of these SNPs after being trained in logistic ridge regression across 200 bootstraps are shown. See Supplementary Fig. S2 and Table S1. The magnitude of the weights of the SNPs ranged from − 0.023 to 0.055 and multiple SNPs are used to explain or predict the allergic disease of the patients. rs2780980 achieved the highest positive normalized weight while rs6065705 had the most negative normalized weight.

Among the genes that were annotated to the SNPs in the GRS, the RAR-related orphan receptor A (RORA) gene had the most extensive literature on being associated to allergies in humans and mouse models. The expressed gene restrains allergic skin inflammation and influence immunologic features of asthma39,40.

Exposomal risk scores (ERSs)

Based on the normalized weights of the pollutants in the Air Pollution RS, these were mostly positively weighted with the monthly mean exposure to NO2 having a median weight of 0.29. See Fig. 1. On the other hand, factors in the Smoking RS did not show clear directions except for pack-years and active smoking which are negatively weighted on average. For the Meteorological RS, there is a consistent pattern as in the descriptive analysis where monthly mean relative humidity and annual temperature are weighted positively while monthly mean temperature and annual relative humidity are weighted inversely. Meanwhile, the weights of the variables in the Greenness RS such as NDVI at 300m2 and at 500m2 lean more positively.

Determining the relative contribution in the testing sample

Building several risk scores effectively reduced the correlation between predictors as they are contained within the risk scores. See Supplementary Fig. S3. Only the Air Pollution RS has a low positive correlation with Meteorological RS.

Based on the fitted model in the test sample, all the risk scores and the single predictors “explain” 11.13% (3.05, 21.42) of the variance (See Fig. 2). In the technical sense of using McFadden’s Pseudo R-squared, the fitted model indicates this percentage improvement in log-likelihood relative to an intercept-only model. The GRS has the highest contribution with median of 3.80% (0.12, 11.40). This indicates adding the genetic risk score provided the largest gain in the model fit, suggesting it may be the most useful single marker for improving risk stratification. At second, Meteorological RS has a relative contribution of 1.13% (0.05, 6.32). This was followed by Air Pollution RS with relative contribution of 0.73% (0.03, 5.07). The constructed Greenness RS contributes to allergy relatively at 0.58% (0.02, 4.58) while Smoking RS has a median relative contribution of 0.39% (0.01, 3.12). The single predictors, namely BMI, Age and SES have the least relative contributions with medians of 0.35% (0.01, 3.78), 0.32% (0.03, 2.73), and 0.03% (0.01, 2.50), respectively. Across the 200 bootstrap replicates, all models with risk scores in the test split converged.

The GRS is significantly associated with the diagnosis of allergies [OR: 2.06 (1.15, 4.21)] in the fitted model. See Fig. 3. On the other hand, the ERSs and single predictors are not significantly associated based on the bootstrap confidence intervals. In terms of median estimate, we note that the Meteorological RS [OR = 1.50 (0.71, 4.43)], Air Pollution RS [OR = 1.19 (0.51, 2.99)], and Greenness RS [OR = 1.26 (0.58, 2.44)] have odds ratios greater than 1 which implies that on most models across the 200 bootstraps they are contributing risk to having allergies. In contrast, BMI [OR = 0.88 (0.53, 1.33)] has odds ratio below 1 and is inversely related to diagnosis of allergies. The remaining risk scores have odds ratio very close to 1 i.e., Smoking RS = 1.00 (0.57, 1.72), SES = 0.98 (0.69, 1.60), and Age = 0.97 (0.62, 1.59).

Sensitivity analysis

After removing the GRS from the analysis, the relative contributions of ERSs remained small as shown in Supplementary Fig. S4. The model containing only the ERSs and individual factors as single predictors explains 6.50% (1.19, 14.08) of the variance. There were little changes in the median relative contribution estimates of the risk scores. In terms of ranking, only the low-contributing single predictors Age and SES have switched. This sensitivity analysis shows that, still, none of the ERSs and predictors were significantly associated with allergies. See Supplementary Fig. S5.

Using a broader definition of allergic cases that included participants with high IgE levels, we found that the proportion of cases increased from 20.89% to 27.11%. Considering this outcome, the model showed a slight decrease in the overall percentage explained by the fitted model to 10.97% (3.60, 22.20). The relative contribution of GRS did not change much with a median of 3.84% (0.28, 11.81). Similar to the main model, only the GRS has a significant association to allergies. Supplementary Table S3 shows the relative contributions and odds ratio of the risk scores and predictors in this sensitivity analysis.

Discussion

Overall, the proposed approach was able to demonstrate how to quantify the contributions of genetic factors and several domains of exposures to allergy in terms of percentage of explained variance. In this SALIA cohort, the model containing GRS and ERSs showed modest improvement of model fit on explaining allergic diagnosis in elderly German women.

GRS, through CLS-screened set of SNPs, attained the highest relative contribution among other risk scores. Several genome-wide association studies already support the genetic influence in allergies as SNPs from different loci relevant in the epithelial barrier, innate-adaptive immunity, IL-1 family signaling, regulatory T cells, and vitamin D pathway have been identified41,42.

Even after excluding the GRS in a sensitivity analysis, the relative contributions of ERSs are still low. Since the outcome used in the cohort is defined as prevalence (or being ever diagnosed of allergy), the impact of meteorological risk score and air pollution risk score is likely to have been understated. The measurements of the included variables were derived only on the most recent year and month before the data was collected, which may not capture the biologically relevant exposure windows. Several literature support that atopic march often develops in early childhood which indicates that the exposure window for allergies are more susceptible at this time3,43,44. However, recent studies show that it could develop at any age which means that for elderly it might take a more long-term or longer lagged exposure to air pollution and climatic factors45,46. Notably, other confounders such as diet, medication use, and exposure to house dust mite allergens were not considered in the model which may underestimate the overall variance in explaining allergies.

Gene-environment studies on allergies in adults are scarcer compared to children and adolescents. In both populations, we were not able to find any study on allergies that compared the relative contribution of genotype data and environmental exposures. Existing studies usually examine genetic risk through a single variant or a GRS along with an environmental variable. For example, variant in PID1 gene have been associated to asthma related to exposure to irritants47. Likewise, SNPs in IL1RN gene have been linked to asthma risk in settings with tobacco smoke exposure during childhood48. Among infants, it was found that prenatal exposure of mothers to indoor PM10 and a GRS for asthma reduces lung function49. For childhood asthma incidence, the main effect of GRS for asthma was significant but not the main effects of air pollutants in one study50. In another study, the main effects of traffic-related NO2 and PM2.5 were significant to asthma among children and were linked to higher genetic susceptibility in the GSTP1 gene51. Meanwhile, exposure to environmental tobacco smoke in early life was associated to higher risk of early-onset asthma considering variants in 17q21 locus52. Consistent with these studies, our results showed the significant association of constructing a GRS for allergies. In contrast, our built risk scores for air pollutants and smoking factors did not have significant effect on allergies. Clearly, there is still evidence gap for exposomic research on modeling allergic conditions.

While our approach integrated well the use of CLS in screening of SNPs and the utility of dominance analysis after risk score construction, it is proposed with some limitations. First, it relies on the calculation of relative contributions via McFadden’s Pseudo R-squared. While this was chosen as an analog of the R-squared in the linear regression context, some may argue of other alternative metrics in application to binary outcomes such as Tjur’s R-squared and Nagelkerke’s R-squared.

Second, splitting the available data set into training and test samples reduce statistical power in this analysis and it requires repeated execution of the fitting process to reduce the randomness. In addition, when working with a binary outcome, this may lead to some draws with highly unbalanced data which may hinder the convergence of the model at worst. Although there is no convergence errors in this analysis, the unbalanced data may explain why very wide confidence intervals were obtained across the bootstrap replicates.

Third, the estimated regression coefficients can be highly inflated with risk scores combining very discriminating variables multiplied further by their weights as in the case for GRS in our application. This can be further aggravated by the unbalanced case-control data. Future directions to mitigate this include possible subgrouping of SNPs in terms of their functional annotations to further decompose how the GRS massively explains allergic outcomes while maintaining interpretability.

Lastly, combining several covariates in weighted RS leads to some loss of information. It is possible and common among environmental variables to have non-linear relationship with the outcome and reducing this into sum-product through the risk score may have diluted their actual relative importance. Hence, constructing an environmental risk score containing non-linear functions of exposures such as splines is left to be explored.

Nonetheless, the dominance analysis approach can easily be adapted to other health outcomes. The proposed approach is flexible and can use continuous measures, such as spirometric outcomes, or other binary outcomes, such as a diabetes diagnosis, among others, even though we used the binary “allergy present/not allergy present” classification in this example.

Ranking the contributions of ERSs along with GRS may aid health risk assessment and aid policy-making by identifying which domains of exposures should be prioritized in setting limit values. For example, in our study, knowing GRS is the most influential on allergies among elderly women can enable targeted prevention and monitoring strategies. Depending on the exposomes with high contributions, public health policies may be developed and specific sources of exposure can be regulated to a particular disease. Aside from smoking regulations, ambient air quality standards can be strengthened and enforced by reducing industrial and combustion near vulnerable communities. Similarly, urban planning and creating more green spaces can be prioritized.

Conclusion

In summary, we observed that genetics had the highest influence to allergic diagnosis while the exposomal factors had lower relative contribution to allergies in this particular cohort data. Methodologically, we successfully applied the concept of dominance analysis based on McFadden’s Pseudo R-squared in the context of exposomal and genetic risk scores. Moreover, it allows us to rank the various domains of cumulative risk factors. In addition, we utilized CLS as a screening strategy for SNPs. Overall, we were able to integrate the application of CLS and dominance analysis with our results serving as their proof of concept. Potentially, this is useful for health risk assessment regulating specific sources of exposure, and priority setting of policy-making in public health. Future directions include applications to longitudinal studies and other methodological improvements to capture possible nonlinear relationships with exposomes.

Fig. 1
figure 1

Normalized weights of variables in ERSs. The violin plot shows the distribution of normalized weights of variables in the ERSs obtained through logistic ridge regression of the training split across the 200 bootstrap replications. 2ndsmoke: Secondhand smoking, exsmoker: Formerly smoking, packyr: cigarette pack-years, smoker: Currently smoking. mth-variable: mean of variable a month prior outcome evaluation, yr-variablename: mean of variable a year prior outcome evaluation.

Fig. 2
figure 2

Relative Contribution of GRS, ERSs, and single predictors to allergic diagnosis. The bar plot shows the median contribution of each variable in the increase of McFadden’s R2 on explaining allergic diagnosis. The error bars indicate the 95% bootstrapped confidence interval based on 200 replications in the test split of the analysis.

Fig. 3
figure 3

Estimated odds ratios (OR) of GRS, ERSs, and single predictors on allergic diagnosis in the Test split across 200 Bootstraps. The forest plot shows the median OR of each variable on explaining allergic diagnosis in the sensitivity analysis. The error bars indicate the 95% bootstrapped confidence interval based on 200 replications in the test split of the analysis. The dotted line represents the null value of OR = 1 (i.e., no association).