Extended Data Fig. 2: An overview of UK Biobank dataset used in this study.

Our initial dataset consists of all European-ancestry in UK Biobank (n = 435,766). We considered all individuals with valid spirograms as modeling dataset (n = 325,027) and individuals with invalid spirograms are used as PRS holdout set. The PRS holdout set is from the European individuals who are not used in the ML modeling and in the GWASs (n = 110,739). We split the modeling datasets to train and validation set with 80% and 20% of samples, respectively. The modeling dataset was used to select model architectures, tune hyperparameters, and evaluate ML model performance across tasks while a two-fold cross-fold dataset was used during the final model application process to generate phenotypes. It is worth mentioning that the combination of train fold 1 and train fold 2 sample size is not equal to the size of the whole training dataset due to the fact we removed genetically close samples that fall cross two different folds. As folds were constructed to keep genetically related individuals together, preventing the same individual or a close relative from being used for both training and prediction.