Extended Data Fig. 3: Performance of pgBoost trained on eQTL vs. CRISPR data.

Average enrichment across recall values of links predicted by pgBoost trained on fine-mapped eSNP-eGene pairs vs. CRISPR-validated links for (a) 4,434 fine-mapped eSNP-eGene pairs attaining maximum PIP > 0.5 across GTEx tissues, (b) 53,701 SNP-gene pairs attaining maximum ABC score > 0.2 across 344 biosamples, (c) 892 links validated by CRISPR, and (d) 155 non-coding SNP-gene pairs derived from fine-mapped GWAS variants with a unique fine-mapped coding variant within a 2 Mb window, at various distance thresholds. While other regulatory linking methods have used CRISPR links for model training, pgBoost may perform better when trained on eQTL due to the limited amount of CRISPR data, pervasive aneuploidy and structural variants in the K562 cell line, and/or better representation of multiple cell types in GTEx eQTL data. The number of positive evaluation links at each distance threshold is specified in parentheses. Confidence intervals denote standard errors. Stars denote 2-sided bootstrap p-values for difference (*: p < 0.05, **: p < 0.01, ***: p < 0.001) of top method vs. each other method (Methods). For pgBoostCRISPR, we defined a training set comprised of 892 SNP-gene links validated by CRISPR as positives and 10,345 links tested but not validated by CRISPR as negatives (Supplementary Table 7).