Abstract
Methods that analyze single-cell paired RNA sequencing (RNA-seq) and assay for transposase-accessible chromatin using sequencing (ATAC-seq) multiome data have shown promise in linking regulatory elements to genes. However, existing methods exhibit low concordance and do not capture the effects of genomic distance. We propose pgBoost, an integrative modeling framework that trains a non-linear combination of existing linking strategies (including genomic distance) on expression quantitative trait locus (eQTL) data to assign a probabilistic score to each candidate single-nucleotide polymorphism–gene link. pgBoost attained higher enrichment than existing methods for evaluation sets derived from eQTL, activity-by-contact, CRISPR and genome-wide association study (GWAS) data. We further determined that restricting pgBoost to features from a focal cell type improved power to identify links relevant to that cell type. We highlight several examples in which pgBoost linked fine-mapped GWAS variants to experimentally validated or biologically plausible target genes that were not implicated by other methods. In conclusion, a non-linear combination of linking strategies improves power to identify target genes underlying GWAS associations.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$32.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 print issues and online access
$259.00 per year
only $21.58 per issue
Buy this article
- Purchase on SpringerLink
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout






Similar content being viewed by others
Data availability
All single-cell multiome datasets analyzed are publicly available. The 10x PBMC36 dataset is available at https://www.10xgenomics.com/datasets/pbmc-from-a-healthy-donor-granulocytes-removed-through-cell-sorting-10-k-1-standard-2-0-0. The Luecken BMMC37, SHARE-seq LCL12 and Meijer brain42 datasets are available at Gene Expression Omnibus (accession codes GSE194122, GSE140203 and GSE193240, respectively). The Xu K562 (ref. 38) dataset is available at https://www.ebi.ac.uk/biostudies/arrayexpress/studies/E-MTAB-11264. Linking scores and percentiles for pgBoost and existing methods have been made publicly available at https://doi.org/10.5281/zenodo.11211925. Fine-mapped GTEx eQTL data24 are available at https://www.finucanelab.org/data. GTEx eQTL summary statistics are available at https://www.gtexportal.org/home/downloads/adult-gtex/qtl. ABC scores9 are available on the ENCODE portal (https://www.encodeproject.org/). Biosample IDs and file accessions are listed in Supplementary Table 5. The CRISPR dataset25,26,27,28,29,30,31,32,33 is available at https://github.com/EngreitzLab/CRISPR_comparison/blob/main/resources/crispr_data/EPCrisprBenchmark_ensemble_data_GRCh38.tsv.gz. GWAS-derived SNP–gene links34 are available at https://doi.org/10.5281/zenodo.11211925. PCHi-C enhancer–promoter links are available as a supplemental data archive of a previous publication66. eQTLGen summary statistics67 are available at https://eqtlgen.org/cis-eqtls.html. ENCODE-rE2G predictions9 are available on the ENCODE portal (https://www.encodeproject.org). Biosample IDs and file accessions are listed in Supplementary Table 16. GWAS fine-mapping results35 for 94 UK Biobank41 traits are available at https://www.finucanelab.org/data. Coordinates for the set of genes analyzed9 are available at https://github.com/EngreitzLab/CRISPR_comparison/blob/main/resources/genome_annotations/CollapsedGeneBounds.hg38.bed.
Code availability
The code to implement pgBoost68 has been made publicly available at https://github.com/elizabethdorans/pgBoost. Tutorials for existing methods69 (SCENT, Signac, ArchR and Cicero) have been made publicly available at https://github.com/elizabethdorans/E2G_Method_Tutorials. Code implementing SCENT15 is publicly available at https://github.com/immunogenomics/SCENT. Code implementing Signac12,16 is publicly available at https://github.com/stuart-lab/signac. Code implementing ArchR17 is publicly available at https://github.com/GreenleafLab/ArchR. Code implementing Cicero18 is publicly available at https://github.com/cole-trapnell-lab/cicero-release. We used Seurat (v.4.3.0.1), SCENT (v.0.1.0), Signac (v.1.10.0), ArchR (v.1.0.2), Cicero (v.1.3.9) and XGBoost (v.1.7.5.1).
References
Hindorff, L. A. et al. Potential etiologic and functional implications of genome-wide association loci for human diseases and traits. Proc. Natl Acad. Sci. USA 106, 9362–9367 (2009).
Gusev, A. et al. Integrative approaches for large-scale transcriptome-wide association studies. Nat. Genet. 48, 245–252 (2016).
Gamazon, E. R. et al. Using an atlas of gene regulation across 44 human tissues to inform complex disease- and trait-associated variation. Nat. Genet. 50, 956–967 (2018).
Gazal, S. et al. Combining SNP-to-gene linking strategies to identify disease genes and assess disease omnigenicity. Nat. Genet. 54, 827–836 (2022).
Claussnitzer, M. et al. A brief history of human disease genetics. Nature 577, 179–189 (2020).
Abdellaoui, A., Yengo, L., Verweij, K. J. H. & Visscher, P. M. 15 years of GWAS discovery: realizing the promise. Am. J. Hum. Genet. 110, 179–194 (2023).
Boix, C. A., James, B. T., Park, Y. P., Meuleman, W. & Kellis, M. Regulatory genomic circuitry of human disease loci by integrative epigenomics. Nature 590, 300–307 (2021).
Nasser, J. et al. Genome-wide enhancer maps link risk variants to disease genes. Nature 593, 238–243 (2021).
Gschwind, A. R. et al. An encyclopedia of enhancer-gene regulatory interactions in the human genome. Preprint at bioRxiv https://doi.org/10.1101/2023.11.09.563812 (2023).
Heinz, S., Romanoski, C. E., Benner, C. & Glass, C. K. The selection and function of cell type-specific enhancers. Nat. Rev. Mol. Cell Biol. 16, 144–154 (2015).
Chen, S., Lake, B. B. & Zhang, K. High-throughput sequencing of the transcriptome and chromatin accessibility in the same cell. Nat. Biotechnol. 37, 1452–1457 (2019).
Ma, S. et al. Chromatin potential identified by shared single-cell profiling of RNA and chromatin. Cell 183, 1103–1116 (2020).
Vandereyken, K., Sifrim, A., Thienpont, B. & Voet, T. Methods and applications for single-cell and spatial multi-omics. Nat. Rev. Genet. 24, 494–515 (2023).
Cuomo, A. S. E., Nathan, A., Raychaudhuri, S., MacArthur, D. G. & Powell, J. E. Single-cell genomics meets human genetics. Nat. Rev. Genet. 24, 535–549 (2023).
Sakaue, S. et al. Tissue-specific enhancer–gene maps from multimodal single-cell data identify causal disease alleles. Nat. Genet. 56, 615–626 (2024).
Stuart, T., Srivastava, A., Madad, S., Lareau, C. A. & Satija, R. Single-cell chromatin state analysis with Signac. Nat. Methods 18, 1333–1341 (2021).
Granja, J. M. et al. ArchR is a scalable software package for integrative single-cell chromatin accessibility analysis. Nat. Genet. 53, 403–411 (2021).
Pliner, H. A. et al. Cicero predicts cis-regulatory DNA interactions from single-cell chromatin accessibility data. Mol. Cell 71, 858–871 (2018).
Kartha, V. K. et al. Functional inference of gene regulation using single-cell multi-omics. Cell Genomics 2, 100166 (2022).
Bravo González-Blas, C. et al. SCENIC+: single-cell multiomic inference of enhancers and gene regulatory networks. Nat. Methods 20, 1355–1367 (2023).
Mitra, S. et al. Single-cell multi-ome regression models identify functional and disease-associated enhancers and enable chromatin potential analysis. Nat. Genet. 56, 627–636 (2024).
Chen, T. & Guestrin, C. XGBoost: a scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 785–794 (eds Krishnapuram, B. et al.) (Association for Computing Machinery, 2016).
Aguet, F. et al. The GTEx Consortium atlas of genetic regulatory effects across human tissues. Science 369, 1318–1330 (2020).
Wang, Q. S. et al. Leveraging supervised learning for functionally informed fine-mapping of cis-eQTLs identifies an additional 20,913 putative causal eQTLs. Nat. Commun. 12, 3394 (2021).
Thakore, P. I. et al. Highly specific epigenome editing by CRISPR–Cas9 repressors for silencing of distal regulatory elements. Nat. Methods 12, 1143–1149 (2015).
Fulco, C. P. et al. Systematic mapping of functional enhancer–promoter connections with CRISPR interference. Science 354, 769–773 (2016).
Ulirsch, J. C. et al. Systematic functional dissection of common genetic variation affecting red blood cell traits. Cell 165, 1530–1545 (2016).
Wakabayashi, A. et al. Insight into GATA1 transcriptional activity through interrogation of cis elements disrupted in human erythroid disorders. Proc. Natl Acad. Sci. USA 113, 4434–4439 (2016).
Huang, J. et al. Dissecting super-enhancer hierarchy based on chromatin interactions. Nat. Commun. 9, 943 (2018).
Fulco, C. P. et al. Activity-by-contact model of enhancer–promoter regulation from thousands of CRISPR perturbations. Nat. Genet. 51, 1664–1669 (2019).
Gasperini, M. et al. A genome-wide framework for mapping gene regulation via cellular genetic screens. Cell 176, 377–390 (2019).
Schraivogel, D. et al. Targeted Perturb-seq enables genome-scale genetic screens in single cells. Nat. Methods 17, 629–635 (2020).
Klann, T. S. et al. CRISPR–Cas9 epigenome editing enables high-throughput screening for functional regulatory elements in the human genome. Nat. Biotechnol. 35, 561–568 (2017).
Weeks, E. M. et al. Leveraging polygenic enrichments of gene features to predict genes underlying complex traits and diseases. Nat. Genet. 55, 1267–1276 (2023).
Kanai, M. et al. Insights from complex trait fine-mapping across diverse populations. Preprint at medRxiv https://doi.org/10.1101/2021.09.03.21262975 (2021).
PBMC from a Healthy Donor (10k), Single Cell Multiome ATAC + Gene Expression Dataset by Cell Ranger 2.0.0 (10x Genomics, 2021); https://www.10xgenomics.com/datasets/pbmc-from-a-healthy-donor-granulocytes-removed-through-cell-sorting-10-k-1-standard-2-0-0
Luecken, M. D. et al. A sandbox for prediction and integration of DNA, RNA, and protein data in single cells. In 35th Conference on Neural Information Processing Systems Track on Datasets and Benchmarks (eds Vanschoren, J. & Yeung, S.) (Curran Associates, 2020).
Xu, W. et al. ISSAAC-seq enables sensitive and flexible multimodal profiling of chromatin accessibility and gene expression in single cells. Nat. Methods 19, 1243–1249 (2022).
Yardımcı, G. G. et al. Measuring the reproducibility and quality of Hi-C data. Genome Biol. 20, 57 (2019).
Zhou, B. et al. Comprehensive, integrated, and phased whole-genome analysis of the primary ENCODE cell line K562. Genome Res. 29, 472–484 (2019).
Bycroft, C. et al. The UK Biobank resource with deep phenotyping and genomic data. Nature 562, 203–209 (2018).
Meijer, M. et al. Epigenomic priming of immune genes implicates oligodendroglia in multiple sclerosis susceptibility. Neuron 110, 1193–1210.e13 (2022).
Wang, G., Sarkar, A., Carbonetto, P. & Stephens, M. A simple new approach to variable selection in regression, with application to genetic fine mapping. J. R. Stat. Soc. Series B Stat. Methodol. 82, 1273–1300 (2020).
Benner, C. et al. FINEMAP: efficient variable selection using summary data from genome-wide association studies. Bioinformatics 32, 1493–1501 (2016).
Dutka, M., Zimmer, K., Ćwiertnia, M., Ilczak, T. & Bobiński, R. The role of PCSK9 in heart failure and other cardiovascular diseases—mechanisms of action beyond its effect on LDL cholesterol. Heart Fail. Rev. 29, 917–937 (2024).
Bao, X. et al. Targeting proprotein convertase subtilisin/kexin type 9 (PCSK9): from bench to bedside. Signal Transduct. Target. Ther. 9, 13 (2024).
Simeonov, D. R. et al. Discovery of stimulation-responsive immune enhancers with CRISPR activation. Nature 549, 111–115 (2017).
Burchill, M. A., Yang, J., Vang, K. B. & Farrar, M. A. Interleukin-2 receptor signaling in regulatory T cell development and homeostasis. Immunol. Lett. 114, 1–8 (2007).
Stahl, E. A. et al. Genome-wide association study meta-analysis identifies seven new rheumatoid arthritis risk loci. Nat. Genet. 42, 508–514 (2010).
Martin, J. E. et al. The autoimmune disease-associated IL2RA locus is involved in the clinical manifestations of systemic sclerosis. Genes Immun. 13, 191–196 (2012).
Lowe, C. E. et al. Large-scale genetic fine mapping and genotype–phenotype associations implicate polymorphism in the IL2RA region in type 1 diabetes. Nat. Genet. 39, 1074–1082 (2007).
Avellino, R. et al. An autonomous CEBPA enhancer specific for myeloid-lineage priming and neutrophilic differentiation. Blood 127, 2991–3003 (2016).
Leroy, H. et al. CEBPA point mutations in hematological malignancies. Leukemia 19, 329–334 (2005).
Pabst, T. & Mueller, B. U. Complexity of CEBPA dysregulation in human acute myeloid leukemia. Clin. Cancer Res. 15, 5303–5307 (2009).
Du, Z., Luo, G., Wang, K., Bing, Z. & Pan, S. Identification of a novel heterozygous SPTB mutation by whole genome sequencing in a Chinese patient with hereditary spherocytosis and atrial septal defect: a case report. BMC Pediatr. 21, 291 (2021).
Li, S. et al. A novel SPTB mutation causes hereditary spherocytosis via loss-of-function of β-spectrin. Ann. Hematol. 101, 731–738 (2022).
Wang, Y. et al. A novel variant in the SPTB gene underlying hereditary spherocytosis and a literature review of previous variants. BMC Med. Genomics 17, 206 (2024).
Rees, H. A. & Liu, D. R. Base editing: precision chemistry on the genome and transcriptome of living cells. Nat. Rev. Genet. 19, 770–788 (2018).
Vernimmen, D. & Bickmore, W. A. The hierarchy of transcriptional activation: from enhancer to promoter. Trends Genet. 31, 696–708 (2015).
Auton, A. et al. A global reference for human genetic variation. Nature 526, 68–74 (2015).
Finucane, H. K. et al. Partitioning heritability by functional annotation using genome-wide association summary statistics. Nat. Genet. 47, 1228–1235 (2015).
Friedman, J., Hastie, T. & Tibshirani, R. Sparse inverse covariance estimation with the graphical lasso. Biostatistics 9, 432–441 (2008).
Hao, Y. et al. Integrated analysis of multimodal single-cell data. Cell 184, 3573–3587.e29 (2021).
Feng, J., Liu, T., Qin, B., Zhang, Y. & Liu, X. S. Identifying ChIP–seq enrichment using MACS. Nat. Protoc. 7, 1728–1740 (2012).
Hafemeister, C. & Satija, R. Normalization and variance stabilization of single-cell RNA-seq data using regularized negative binomial regression. Genome Biol. 20, 296 (2019).
Javierre, B. M. et al. Lineage-specific genome architecture links enhancers and non-coding disease variants to target gene promoters. Cell 167, 1369–1384.e19 (2016).
Võsa, U. et al. Large-scale cis- and trans-eQTL analyses identify thousands of genetic loci and polygenic scores that regulate blood gene expression. Nat. Genet. 53, 1300–1310 (2021).
Dorans, E. elizabethdorans/pgBoost: v1.0.0. Zenodo https://doi.org/10.5281/zenodo.15276307
Dorans, E. elizabethdorans/E2G_Method_Tutorials: v1.0.0. Zenodo https://doi.org/10.5281/zenodo.15284078
Acknowledgements
We thank J. Engreitz, R. Andersson, A. Gschwind and W. Qiu for valuable feedback on the manuscript and S. Raychaudhuri, S. Sunyaev, V. Sankaran, S. Sakaue, C. Boix, B. Strober, J. Rossen and M. Zhang for helpful discussions. This research was funded by NIH grants U01 HG012009, R56 HG013083, R01 MH101244, R37 MH107649, R01 HG006399 and R01 MH115676 (A.L.P.). K.D. is funded by NIH/NHGRI under grant R00HG012203, NIH/NCI Cancer Center Support Grant P30CA008748 and the Josie Robertson Investigator Program.
Author information
Authors and Affiliations
Contributions
K.D. and K.J. conceived this study. E.D. developed the pgBoost model and conducted all analyses. A.L.P., K.D. and K.J. provided critical scientific guidance. E.D. and A.L.P. wrote the initial draft of the manuscript. All authors contributed to the final manuscript.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Genetics thanks Tim Stuart and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data
Extended Data Fig. 1 Performance of pgBoost vs. a logistic regression model.
Average enrichment across recall values of links predicted by pgBoost vs. a logistic regression model with identical features for (a) 4,434 fine-mapped eSNP-eGene pairs attaining maximum PIP > 0.5 across GTEx tissues, (b) 53,701 SNP-gene pairs attaining maximum ABC score > 0.2 across 344 biosamples, (c) 892 links validated by CRISPR, and (d) 155 non-coding SNP-gene pairs derived from fine-mapped GWAS variants with a unique fine-mapped coding variant within a 2 Mb window, at various distance thresholds. The number of positive evaluation links at each distance threshold is specified in parentheses. Confidence intervals denote standard errors. Stars denote 2-sided bootstrap p-values for difference (*: p < 0.05, **: p < 0.01, ***: p < 0.001) of top method vs. each other method (Methods). Gradient boosting significantly outperformed logistic regression on eQTL evaluation data and CRISPR evaluation data >50 kb. Gradient boosting did not significantly outperform logistic regression on GWAS evaluation data. While gradient boosting significantly outperformed logistic regression on all ABC evaluation data (>1 kb and >5 kb), gradient boosting significantly underperformed logistic regression on longer-range ABC evaluation data (>50 kb and >100 kb); this could be due to the gradient boosting model assigning higher weight to distance-based features than the logistic regression model (however, we do not observe this behavior in other evaluation sets).
Extended Data Fig. 2 Performance of pgBoost vs. fine-mapped eQTL.
Average enrichment across recall values of links predicted by pgBoost vs. fine-mapped eQTL (ranking candidate SNP-gene links by maximum PIP across GTEx tissues) for (a) 53,701 SNP-gene pairs attaining maximum ABC score > 0.2 across 344 biosamples, (b) 892 links validated by CRISPR, and (c) 155 non-coding SNP-gene pairs derived from fine-mapped GWAS variants with a unique fine-mapped coding variant within a 2 Mb window, at various distance thresholds. The number of positive evaluation links at each distance threshold is specified in parentheses. Confidence intervals denote standard errors. Stars denote 2-sided bootstrap p-values for difference (*: p < 0.05, **: p < 0.01, ***: p < 0.001) of top method vs. each other method (Methods).
Extended Data Fig. 3 Performance of pgBoost trained on eQTL vs. CRISPR data.
Average enrichment across recall values of links predicted by pgBoost trained on fine-mapped eSNP-eGene pairs vs. CRISPR-validated links for (a) 4,434 fine-mapped eSNP-eGene pairs attaining maximum PIP > 0.5 across GTEx tissues, (b) 53,701 SNP-gene pairs attaining maximum ABC score > 0.2 across 344 biosamples, (c) 892 links validated by CRISPR, and (d) 155 non-coding SNP-gene pairs derived from fine-mapped GWAS variants with a unique fine-mapped coding variant within a 2 Mb window, at various distance thresholds. While other regulatory linking methods have used CRISPR links for model training, pgBoost may perform better when trained on eQTL due to the limited amount of CRISPR data, pervasive aneuploidy and structural variants in the K562 cell line, and/or better representation of multiple cell types in GTEx eQTL data. The number of positive evaluation links at each distance threshold is specified in parentheses. Confidence intervals denote standard errors. Stars denote 2-sided bootstrap p-values for difference (*: p < 0.05, **: p < 0.01, ***: p < 0.001) of top method vs. each other method (Methods). For pgBoostCRISPR, we defined a training set comprised of 892 SNP-gene links validated by CRISPR as positives and 10,345 links tested but not validated by CRISPR as negatives (Supplementary Table 7).
Extended Data Fig. 4 Performance of pgBoost trained on eQTL data from all tissues vs. whole blood.
Average enrichment across recall values of links predicted by pgBoost trained on fine-mapped eSNP-eGene pairs from all GTEx tissues vs. fine-mapped eSNP-eGene pairs from whole blood for (a) 4,434 fine-mapped eSNP-eGene pairs attaining maximum PIP > 0.5 across GTEx tissues, (b) 53,701 SNP-gene pairs attaining maximum ABC score > 0.2 across 344 biosamples, (c) 892 links validated by CRISPR, (d) 155 non-coding SNP-gene pairs derived from fine-mapped GWAS variants with a unique fine-mapped coding variant within a 2 Mb window, and (e) 712 fine-mapped eSNP-eGene pairs attaining PIP > 0.5 in GTEx whole blood, at various distance thresholds. The number of positive evaluation links at each distance threshold is specified in parentheses. Confidence intervals denote standard errors. Stars denote 2-sided bootstrap p-values for difference (*: p < 0.05, **: p < 0.01, ***: p < 0.001) of top method vs. each other method (Methods). For pgBoostBlood-eQTL, we defined a training set of 1,518 eSNP-eGene pairs attaining PIP > 0.2 in whole blood as positives and 6,997 eSNP-eGene pairs including a gene in the positive set and attaining PIP < 0.01 in whole blood as negatives.
Extended Data Fig. 5 Performance of pgBoost models restricted to a single existing method or distance.
Average enrichment across recall values of links predicted by pgBoost restricted to linking scores from a single existing method or genomic distance for (a) 4,434 fine-mapped eSNP-eGene pairs attaining maximum PIP > 0.5 across GTEx tissues, (b) 53,701 SNP-gene pairs attaining maximum ABC score > 0.2 across 344 biosamples, (c) 892 links validated by CRISPR, and (d) 155 non-coding SNP-gene pairs derived from fine-mapped GWAS variants with a unique fine-mapped coding variant within a 2 Mb window. Because enrichment differences are highly similar across distance thresholds, results at >10 kb are shown as barplots to aid in visual comparison. Confidence intervals denote standard errors. Stars denote 2-sided bootstrap p-values for the difference of the main pgBoost model (denoted in black outline) vs. each other method (*: p < 0.05, **: p < 0.01, ***: p < 0.001) (Methods). Red stars indicate that the focal method significantly underperforms pgBoost.
Extended Data Fig. 6 Performance of pgBoost models adding or ablating one existing method.
Average enrichment across recall values of links predicted by pgBoost (denoted in black outline) vs. pgBoost models ablating features from one existing peak-gene linking method (SCENT, Signac, Cicero), ablating distance-based features, or adding ArchR features for (a) 4,434 fine-mapped eSNP-eGene pairs attaining maximum PIP > 0.5 across GTEx tissues, (b) 53,701 SNP-gene pairs attaining maximum ABC score > 0.2 across 344 biosamples, (c) 892 links validated by CRISPR, and (d) 155 non-coding SNP-gene pairs derived from fine-mapped GWAS variants with a unique fine-mapped coding variant within a 2 Mb window. e) Table summarizing the significance of differences in average enrichments between pgBoost and each other model on the 4 evaluation sets. Because enrichment differences are highly similar across distance thresholds in a–d, results at >10 kb are shown as barplots to aid in visual comparison. Confidence intervals denote standard errors. Stars denote 2-sided bootstrap p-values for difference (*: p < 0.05, **: p < 0.01, ***: p < 0.001) of focal method vs. main pgBoost model (Methods). Red stars (in a–d) or red shading (in e) indicates that the focal method significantly underperformed pgBoost. Black stars (in a–d) or blue shading (in e) indicates that the focal method significantly outperformed pgBoost. While the main pgBoost model significantly underperformed the model ablating Signac on ABC evaluation data, we prioritized results on eQTL and CRISPR evaluations (see Supplementary Note). The main pgBoost model significantly outperformed each ablated model on at least 1 evaluation data set and never significantly underperformed an expanded model, supporting our choice to include SCENT, Signac, Cicero, and distance as features (and exclude ArchR).
Extended Data Fig. 7 Performance of pgBoost vs. ENCODE-rE2G across biosamples.
Average enrichment across recall values of links predicted by pgBoost vs. ENCODE-rE2G (maximum score across biosamples; see Supplementary Table 16) for (a) 4,434 fine-mapped eSNP-eGene pairs attaining maximum PIP > 0.5 across GTEx tissues, (b) 892 links validated by CRISPR, and (c) 155 non-coding SNP-gene pairs derived from fine-mapped GWAS variants with a unique fine-mapped coding variant within a 2 Mb window, at various distance thresholds (we exclude ABC evaluation data from these comparisons, because the ABC evaluation data were used as features to generate ENCODE-rE2G predictions). The number of positive evaluation links at each distance threshold is specified in parentheses. Confidence intervals denote standard errors. Stars denote 2-sided bootstrap p-values for difference (*: p < 0.05, **: p < 0.01, ***: p < 0.001) of top method vs. each other method (Methods).
Extended Data Fig. 8 Performance of pgBoost restricted to features from a focal cell type on fine-mapped GWAS variants.
Average enrichment across recall values of variants linked to genes by pgBoost restricted to features from a focal cell type (Union denotes all features) for (a) n = 488 fine-mapped variants (PIP > 0.2) for 7 autoimmune diseases or granulocyte-related blood cell traits, (b) n = 455 fine-mapped variants (PIP > 0.2) for 7 red blood cell or platelet-related blood cell traits, (c) fine-mapped variants for individual traits from (a), (d) fine-mapped variants for individual traits from (b). Confidence intervals denote standard errors, and stars denote 2-sided bootstrap p-values for difference (*: p < 0.05, **: p < 0.01, ***: p < 0.001) of focal method vs. top method (denoted in black outline) (Methods). Plt: platelet count, HbA1c: hemoglobin A1c, Hb: hemoglobin, RBC: red blood cell count, MCV: mean corpuscular volume, MCH: mean corpuscular hemoglobin, MCHC: mean corpuscular hemoglobin concentration, WBC: white blood cell count, Mono: monocyte count, Eosino: eosinophil count, AID: autoimmune disease, Lym: lymphocyte count, Neutro: neutrophil count, Baso: basophil count.
Extended Data Fig. 9 Performance of pgBoost restricted to features from a focal cell type on eQTL data from LCL.
Average enrichment across recall values of links predicted by pgBoost restricted to features from a focal cell type (columns; Union denotes all features) for fine-mapped eSNP-eGene pairs attaining maximum PIP > 0.5 in GTEx EBV-transformed lymphocytes or LCL (number of evaluation links: n = 79). We expanded the set of pgBoost models to include a model restricted to LCL features from the SHARE-seq LCL data set (Table 2). The pgBoost model restricted to LCL significantly outperformed models restricted to B cell, myeloid, or erythroid features and the model spanning the union of all 5 cell types, and performed similarly to the model restricted to T cell features. (We did not analyze cell-type-specific GTEx evaluation data for T cells, B cells, myeloid cells, or erythroid cells, because these cell types are not represented in GTEx). Confidence intervals denote standard errors. Stars denote 2-sided bootstrap p-values for difference (*: p < 0.05, **: p < 0.01, ***: p < 0.001) of top method (denoted in black outline) vs. each other method (Methods). Red stars denote significant underperformance of the focal method vs. the top method.
Extended Data Fig. 10 Performance of pgBoost restricted to features from a focal cell type in cell-type-level and tissue-level ABC evaluation data sets.
Average enrichment across recall values of links predicted by pgBoost restricted to features from a focal cell type (columns; Union denotes all features) for cell-type-level (left) or tissue-level (right) ABC evaluation links (rows) defined by SNP-gene links attaining ABC score > 0.2 in biosample(s) related to the focal cell type or tissue (irrespective of ABC scores attained in other biosamples). Number of evaluation links in each ABC evaluation set: nT = 19,301, nB = 17,638, nmyeloid = 10,632, nK562 = 3,793, nblood = 33,867, nbrain = 19,623. Number of biosamples in each ABC evaluation set: Nblood = 74, Nbrain = 26, NT = 53, NB = 14, Nmyeloid = 5, NK562 = 1. Stars denote 2-sided bootstrap p-values for difference (*: p < 0.05, **: p < 0.01, ***: p < 0.001) of top method (denoted in black outline; defined separately for each row) vs. each other method (Methods). For cell-type-level evaluation data sets (left), pgBoost models restricted to the focal cell type significantly outperformed each other model, in general (except for the T cell-level ABC evaluation data set, for which the model restricted to T cell features did not significantly outperform the model restricted to B cell features, but significantly outperformed each other model). On brain-level evaluation data (right), the pgBoost model restricted to brain oligodendroglia features significantly outperformed the model restricted to K562 features and the model spanning both cell types. On blood-level evaluation data (right), the pgBoost model spanning the union of both cell types significantly outperformed both other models.
Supplementary information
Supplementary Information
Supplementary Figs. 1–45 and Note.
Supplementary Tables
Supplementary Tables 1–16.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Dorans, E., Jagadeesh, K., Dey, K. et al. Linking regulatory variants to target genes by integrating single-cell multiome methods and genomic distance. Nat Genet 57, 1649–1658 (2025). https://doi.org/10.1038/s41588-025-02220-3
Received:
Accepted:
Published:
Issue date:
DOI: https://doi.org/10.1038/s41588-025-02220-3