Fig. 2: Cryptic phenotype inference in the UCSF and UKBB datasets.

a Distribution of HHT cryptic phenotype severity among the subjects in the UCSF testing dataset, stratified by their HHT diagnostic status (green: controls; purple: HHT cases). a (inset) Precision-recall curve for the prediction of HHT diagnoses using the cryptic phenotype. The approximate performance of a random classifier is shown in red. Panel b displays the same information for the UKBB dataset, which was generated using an independently inferred phenotype model. Panel c displays the same information as (a), except that the UKBB phenotype model is used to generate the cryptic phenotypes in the UCSF dataset. d, e The increase in cryptic phenotype severity among diagnosed cases is displayed jointly for the UCSF and UKBB models/datasets (N = 13 diseases, see main text and Supplementary Data 4). Panel d compares the results of the UCSF model (applied to the UCSF dataset; x-axis) with those generated by the UKBB model (applied to the UKBB dataset; y-axis). Panel e instead compares the results of the UKBB model after applying it to both the UCSF (x-axis) and UKBB (y-axis) datasets. Error bars in panels d and e represent 95% confidence intervals for the severity statistics (estimated using bootstrapped re-sampling, N = 105). Panel f Coefficients of determination (r2) among the cryptic phenotypes inferred by the UCSF and UKBB models were estimated using the UCSF dataset. The resulting distribution over this statistic is displayed for the 38 diseases where model fitting was successful in both datasets.