Fig. 3: Estimating population-genetics parameters for hundreds of diseases and thousands of disease pairs. | Nature Communications

Fig. 3: Estimating population-genetics parameters for hundreds of diseases and thousands of disease pairs.

From: Estimating heritability and genetic correlations from large health datasets in the absence of genetic data

Fig. 3

Here, h2 denotes heritability, and corr is a correlation between a disease pair which can be genetic, environmental, or phenotypic. a A workflow explains the key steps of our model development. We used three national-scale health registries, representing the United States, Denmark, and Sweden, which comprised 3.8 billion, 154 million, and 95 million disease diagnoses, respectively. We computed curves reflecting disease prevalence by age and sex (disease prevalence curves) and derived a metric mapping (disease embedding in metric space) for the whole disease spectrum. We used these two complementary representations to estimate hundreds of thousands of disease-specific parameters. We then validated the accuracy of our model’s predictions by benchmarking them against previously-published (“actual”) estimates that were not used in model training. Plates b and c show kernel density estimation plots we computed from 1000 random 4:1 splits of data (4/5 for training and 1/5 for testing). We used these plots to visualize the joint distribution of the actual data for testing and model-predicted values. The linear fit slopes between the actual and predicted values are 0.996 for h2 and 0.993 for corr, indicating nearly perfectly unbiased estimations. d The distributions of Pearson’s correlations between the actual and predicted values have mean values of 0.870 for h2 and 0.874 for corr. e A distribution of the mean age of disease-specific diagnosis bearers. The median of the mean ages over all diseases is around 42 years, and specifically, the mean ages of autism, bipolar disorder, and schizophrenia that appeared in the US data are 9, 40, and 41, respectively. f There is a significant positive correlation between disease onset age and diagnosis count in the US data, suggesting there are less-than-expected, rare, late-onset diseases. g The relationship also holds for each of the five disease clusters. For individual clusters (c1–c5), we show the best linear approximation, regression coefficients (p values were computed using Student’s t test), and Spearman’s correlation ρ (p values were computed using algorithm AS 89), color-coded by the shape cluster. Superscript asterisks indicate significance level of the estimates being different from 0.

Back to article page