Linking regulatory variants to target genes by integrating single-cell multiome methods and genomic distance

Dorans, Elizabeth; Jagadeesh, Karthik; Dey, Kushal; Price, Alkes L.

doi:10.1038/s41588-025-02220-3

Article
Published: 12 June 2025

Linking regulatory variants to target genes by integrating single-cell multiome methods and genomic distance

Nature Genetics volume 57, pages 1649–1658 (2025)Cite this article

8811 Accesses
4 Citations
14 Altmetric
Metrics details

Subjects

Abstract

Methods that analyze single-cell paired RNA sequencing (RNA-seq) and assay for transposase-accessible chromatin using sequencing (ATAC-seq) multiome data have shown promise in linking regulatory elements to genes. However, existing methods exhibit low concordance and do not capture the effects of genomic distance. We propose pgBoost, an integrative modeling framework that trains a non-linear combination of existing linking strategies (including genomic distance) on expression quantitative trait locus (eQTL) data to assign a probabilistic score to each candidate single-nucleotide polymorphism–gene link. pgBoost attained higher enrichment than existing methods for evaluation sets derived from eQTL, activity-by-contact, CRISPR and genome-wide association study (GWAS) data. We further determined that restricting pgBoost to features from a focal cell type improved power to identify links relevant to that cell type. We highlight several examples in which pgBoost linked fine-mapped GWAS variants to experimentally validated or biologically plausible target genes that were not implicated by other methods. In conclusion, a non-linear combination of linking strategies improves power to identify target genes underlying GWAS associations.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on SpringerLink
Instant access to the full article PDF.

USD 39.95

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: Existing single-cell peak–gene linking methods exhibit low concordance and underperform a distance-based linking score.**

**Fig. 2: Overview of the pgBoost method.**

**Fig. 3: Performance of pgBoost and other methods on eQTL and ABC evaluation datasets.**

**Fig. 4: Performance of pgBoost and other methods on gold-standard evaluation datasets.**

**Fig. 5: Performance of pgBoost restricted to features from a focal or non-focal cell type in evaluation datasets relevant to the focal cell type.**

**Fig. 6: Examples of SNP–gene links identified by pgBoost.**

A polygenic score method boosted by non-additive models

Article Open access 29 May 2024

GWAS Explorer: an open-source tool to explore, visualize, and access GWAS summary statistics in the PLCO Atlas

Article Open access 12 January 2023

Gene selection by incorporating genetic networks into case-control association studies

Article 19 December 2022

Data availability

All single-cell multiome datasets analyzed are publicly available. The 10x PBMC³⁶ dataset is available at https://www.10xgenomics.com/datasets/pbmc-from-a-healthy-donor-granulocytes-removed-through-cell-sorting-10-k-1-standard-2-0-0. The Luecken BMMC³⁷, SHARE-seq LCL¹² and Meijer brain⁴² datasets are available at Gene Expression Omnibus (accession codes GSE194122, GSE140203 and GSE193240, respectively). The Xu K562 (ref. ³⁸) dataset is available at https://www.ebi.ac.uk/biostudies/arrayexpress/studies/E-MTAB-11264. Linking scores and percentiles for pgBoost and existing methods have been made publicly available at https://doi.org/10.5281/zenodo.11211925. Fine-mapped GTEx eQTL data²⁴ are available at https://www.finucanelab.org/data. GTEx eQTL summary statistics are available at https://www.gtexportal.org/home/downloads/adult-gtex/qtl. ABC scores⁹ are available on the ENCODE portal (https://www.encodeproject.org/). Biosample IDs and file accessions are listed in Supplementary Table 5. The CRISPR dataset^{25,26,27,28,29,30,31,32,33} is available at https://github.com/EngreitzLab/CRISPR_comparison/blob/main/resources/crispr_data/EPCrisprBenchmark_ensemble_data_GRCh38.tsv.gz. GWAS-derived SNP–gene links³⁴ are available at https://doi.org/10.5281/zenodo.11211925. PCHi-C enhancer–promoter links are available as a supplemental data archive of a previous publication⁶⁶. eQTLGen summary statistics⁶⁷ are available at https://eqtlgen.org/cis-eqtls.html. ENCODE-rE2G predictions⁹ are available on the ENCODE portal (https://www.encodeproject.org). Biosample IDs and file accessions are listed in Supplementary Table 16. GWAS fine-mapping results³⁵ for 94 UK Biobank⁴¹ traits are available at https://www.finucanelab.org/data. Coordinates for the set of genes analyzed⁹ are available at https://github.com/EngreitzLab/CRISPR_comparison/blob/main/resources/genome_annotations/CollapsedGeneBounds.hg38.bed.

Code availability

The code to implement pgBoost⁶⁸ has been made publicly available at https://github.com/elizabethdorans/pgBoost. Tutorials for existing methods⁶⁹ (SCENT, Signac, ArchR and Cicero) have been made publicly available at https://github.com/elizabethdorans/E2G_Method_Tutorials. Code implementing SCENT¹⁵ is publicly available at https://github.com/immunogenomics/SCENT. Code implementing Signac^12,16 is publicly available at https://github.com/stuart-lab/signac. Code implementing ArchR¹⁷ is publicly available at https://github.com/GreenleafLab/ArchR. Code implementing Cicero¹⁸ is publicly available at https://github.com/cole-trapnell-lab/cicero-release. We used Seurat (v.4.3.0.1), SCENT (v.0.1.0), Signac (v.1.10.0), ArchR (v.1.0.2), Cicero (v.1.3.9) and XGBoost (v.1.7.5.1).

References

Hindorff, L. A. et al. Potential etiologic and functional implications of genome-wide association loci for human diseases and traits. Proc. Natl Acad. Sci. USA 106, 9362–9367 (2009).
Article CAS PubMed PubMed Central Google Scholar
Gusev, A. et al. Integrative approaches for large-scale transcriptome-wide association studies. Nat. Genet. 48, 245–252 (2016).
Article CAS PubMed PubMed Central Google Scholar
Gamazon, E. R. et al. Using an atlas of gene regulation across 44 human tissues to inform complex disease- and trait-associated variation. Nat. Genet. 50, 956–967 (2018).
Article CAS PubMed PubMed Central Google Scholar
Gazal, S. et al. Combining SNP-to-gene linking strategies to identify disease genes and assess disease omnigenicity. Nat. Genet. 54, 827–836 (2022).
Article CAS PubMed PubMed Central Google Scholar
Claussnitzer, M. et al. A brief history of human disease genetics. Nature 577, 179–189 (2020).
Article CAS PubMed PubMed Central Google Scholar
Abdellaoui, A., Yengo, L., Verweij, K. J. H. & Visscher, P. M. 15 years of GWAS discovery: realizing the promise. Am. J. Hum. Genet. 110, 179–194 (2023).
Article CAS PubMed PubMed Central Google Scholar
Boix, C. A., James, B. T., Park, Y. P., Meuleman, W. & Kellis, M. Regulatory genomic circuitry of human disease loci by integrative epigenomics. Nature 590, 300–307 (2021).
Article CAS PubMed PubMed Central Google Scholar
Nasser, J. et al. Genome-wide enhancer maps link risk variants to disease genes. Nature 593, 238–243 (2021).
Article CAS PubMed PubMed Central Google Scholar
Gschwind, A. R. et al. An encyclopedia of enhancer-gene regulatory interactions in the human genome. Preprint at bioRxiv https://doi.org/10.1101/2023.11.09.563812 (2023).
Heinz, S., Romanoski, C. E., Benner, C. & Glass, C. K. The selection and function of cell type-specific enhancers. Nat. Rev. Mol. Cell Biol. 16, 144–154 (2015).
Article CAS PubMed PubMed Central Google Scholar
Chen, S., Lake, B. B. & Zhang, K. High-throughput sequencing of the transcriptome and chromatin accessibility in the same cell. Nat. Biotechnol. 37, 1452–1457 (2019).
Article CAS PubMed PubMed Central Google Scholar
Ma, S. et al. Chromatin potential identified by shared single-cell profiling of RNA and chromatin. Cell 183, 1103–1116 (2020).
Article CAS PubMed PubMed Central Google Scholar
Vandereyken, K., Sifrim, A., Thienpont, B. & Voet, T. Methods and applications for single-cell and spatial multi-omics. Nat. Rev. Genet. 24, 494–515 (2023).
Article CAS PubMed Google Scholar
Cuomo, A. S. E., Nathan, A., Raychaudhuri, S., MacArthur, D. G. & Powell, J. E. Single-cell genomics meets human genetics. Nat. Rev. Genet. 24, 535–549 (2023).
Article CAS PubMed PubMed Central Google Scholar
Sakaue, S. et al. Tissue-specific enhancer–gene maps from multimodal single-cell data identify causal disease alleles. Nat. Genet. 56, 615–626 (2024).
Article CAS PubMed PubMed Central Google Scholar
Stuart, T., Srivastava, A., Madad, S., Lareau, C. A. & Satija, R. Single-cell chromatin state analysis with Signac. Nat. Methods 18, 1333–1341 (2021).
Article CAS PubMed PubMed Central Google Scholar
Granja, J. M. et al. ArchR is a scalable software package for integrative single-cell chromatin accessibility analysis. Nat. Genet. 53, 403–411 (2021).
Article CAS PubMed PubMed Central Google Scholar
Pliner, H. A. et al. Cicero predicts cis-regulatory DNA interactions from single-cell chromatin accessibility data. Mol. Cell 71, 858–871 (2018).
Article CAS PubMed PubMed Central Google Scholar
Kartha, V. K. et al. Functional inference of gene regulation using single-cell multi-omics. Cell Genomics 2, 100166 (2022).
Article CAS PubMed PubMed Central Google Scholar
Bravo González-Blas, C. et al. SCENIC+: single-cell multiomic inference of enhancers and gene regulatory networks. Nat. Methods 20, 1355–1367 (2023).
Article PubMed PubMed Central Google Scholar
Mitra, S. et al. Single-cell multi-ome regression models identify functional and disease-associated enhancers and enable chromatin potential analysis. Nat. Genet. 56, 627–636 (2024).
Article CAS PubMed PubMed Central Google Scholar
Chen, T. & Guestrin, C. XGBoost: a scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 785–794 (eds Krishnapuram, B. et al.) (Association for Computing Machinery, 2016).
Aguet, F. et al. The GTEx Consortium atlas of genetic regulatory effects across human tissues. Science 369, 1318–1330 (2020).
Article CAS Google Scholar
Wang, Q. S. et al. Leveraging supervised learning for functionally informed fine-mapping of cis-eQTLs identifies an additional 20,913 putative causal eQTLs. Nat. Commun. 12, 3394 (2021).
Article CAS PubMed PubMed Central Google Scholar
Thakore, P. I. et al. Highly specific epigenome editing by CRISPR–Cas9 repressors for silencing of distal regulatory elements. Nat. Methods 12, 1143–1149 (2015).
Article CAS PubMed PubMed Central Google Scholar
Fulco, C. P. et al. Systematic mapping of functional enhancer–promoter connections with CRISPR interference. Science 354, 769–773 (2016).
Article CAS PubMed PubMed Central Google Scholar
Ulirsch, J. C. et al. Systematic functional dissection of common genetic variation affecting red blood cell traits. Cell 165, 1530–1545 (2016).
Article CAS PubMed PubMed Central Google Scholar
Wakabayashi, A. et al. Insight into GATA1 transcriptional activity through interrogation of cis elements disrupted in human erythroid disorders. Proc. Natl Acad. Sci. USA 113, 4434–4439 (2016).
Article CAS PubMed PubMed Central Google Scholar
Huang, J. et al. Dissecting super-enhancer hierarchy based on chromatin interactions. Nat. Commun. 9, 943 (2018).
Article PubMed PubMed Central Google Scholar
Fulco, C. P. et al. Activity-by-contact model of enhancer–promoter regulation from thousands of CRISPR perturbations. Nat. Genet. 51, 1664–1669 (2019).
Article CAS PubMed PubMed Central Google Scholar
Gasperini, M. et al. A genome-wide framework for mapping gene regulation via cellular genetic screens. Cell 176, 377–390 (2019).
Article CAS PubMed PubMed Central Google Scholar
Schraivogel, D. et al. Targeted Perturb-seq enables genome-scale genetic screens in single cells. Nat. Methods 17, 629–635 (2020).
Article CAS PubMed PubMed Central Google Scholar
Klann, T. S. et al. CRISPR–Cas9 epigenome editing enables high-throughput screening for functional regulatory elements in the human genome. Nat. Biotechnol. 35, 561–568 (2017).
Article CAS PubMed PubMed Central Google Scholar
Weeks, E. M. et al. Leveraging polygenic enrichments of gene features to predict genes underlying complex traits and diseases. Nat. Genet. 55, 1267–1276 (2023).
Article CAS PubMed PubMed Central Google Scholar
Kanai, M. et al. Insights from complex trait fine-mapping across diverse populations. Preprint at medRxiv https://doi.org/10.1101/2021.09.03.21262975 (2021).
PBMC from a Healthy Donor (10k), Single Cell Multiome ATAC + Gene Expression Dataset by Cell Ranger 2.0.0 (10x Genomics, 2021); https://www.10xgenomics.com/datasets/pbmc-from-a-healthy-donor-granulocytes-removed-through-cell-sorting-10-k-1-standard-2-0-0
Luecken, M. D. et al. A sandbox for prediction and integration of DNA, RNA, and protein data in single cells. In 35th Conference on Neural Information Processing Systems Track on Datasets and Benchmarks (eds Vanschoren, J. & Yeung, S.) (Curran Associates, 2020).
Xu, W. et al. ISSAAC-seq enables sensitive and flexible multimodal profiling of chromatin accessibility and gene expression in single cells. Nat. Methods 19, 1243–1249 (2022).
Article CAS PubMed Google Scholar
Yardımcı, G. G. et al. Measuring the reproducibility and quality of Hi-C data. Genome Biol. 20, 57 (2019).
Article PubMed PubMed Central Google Scholar
Zhou, B. et al. Comprehensive, integrated, and phased whole-genome analysis of the primary ENCODE cell line K562. Genome Res. 29, 472–484 (2019).
Article CAS PubMed PubMed Central Google Scholar
Bycroft, C. et al. The UK Biobank resource with deep phenotyping and genomic data. Nature 562, 203–209 (2018).
Article CAS PubMed PubMed Central Google Scholar
Meijer, M. et al. Epigenomic priming of immune genes implicates oligodendroglia in multiple sclerosis susceptibility. Neuron 110, 1193–1210.e13 (2022).
Article CAS PubMed PubMed Central Google Scholar
Wang, G., Sarkar, A., Carbonetto, P. & Stephens, M. A simple new approach to variable selection in regression, with application to genetic fine mapping. J. R. Stat. Soc. Series B Stat. Methodol. 82, 1273–1300 (2020).
Article PubMed PubMed Central Google Scholar
Benner, C. et al. FINEMAP: efficient variable selection using summary data from genome-wide association studies. Bioinformatics 32, 1493–1501 (2016).
Article CAS PubMed PubMed Central Google Scholar
Dutka, M., Zimmer, K., Ćwiertnia, M., Ilczak, T. & Bobiński, R. The role of PCSK9 in heart failure and other cardiovascular diseases—mechanisms of action beyond its effect on LDL cholesterol. Heart Fail. Rev. 29, 917–937 (2024).
Article CAS PubMed PubMed Central Google Scholar
Bao, X. et al. Targeting proprotein convertase subtilisin/kexin type 9 (PCSK9): from bench to bedside. Signal Transduct. Target. Ther. 9, 13 (2024).
Article PubMed PubMed Central Google Scholar
Simeonov, D. R. et al. Discovery of stimulation-responsive immune enhancers with CRISPR activation. Nature 549, 111–115 (2017).
Article CAS PubMed PubMed Central Google Scholar
Burchill, M. A., Yang, J., Vang, K. B. & Farrar, M. A. Interleukin-2 receptor signaling in regulatory T cell development and homeostasis. Immunol. Lett. 114, 1–8 (2007).
Article CAS PubMed PubMed Central Google Scholar
Stahl, E. A. et al. Genome-wide association study meta-analysis identifies seven new rheumatoid arthritis risk loci. Nat. Genet. 42, 508–514 (2010).
Article CAS PubMed PubMed Central Google Scholar
Martin, J. E. et al. The autoimmune disease-associated IL2RA locus is involved in the clinical manifestations of systemic sclerosis. Genes Immun. 13, 191–196 (2012).
Article CAS PubMed Google Scholar
Lowe, C. E. et al. Large-scale genetic fine mapping and genotype–phenotype associations implicate polymorphism in the IL2RA region in type 1 diabetes. Nat. Genet. 39, 1074–1082 (2007).
Article CAS PubMed Google Scholar
Avellino, R. et al. An autonomous CEBPA enhancer specific for myeloid-lineage priming and neutrophilic differentiation. Blood 127, 2991–3003 (2016).
Article CAS PubMed PubMed Central Google Scholar
Leroy, H. et al. CEBPA point mutations in hematological malignancies. Leukemia 19, 329–334 (2005).
Article CAS PubMed Google Scholar
Pabst, T. & Mueller, B. U. Complexity of CEBPA dysregulation in human acute myeloid leukemia. Clin. Cancer Res. 15, 5303–5307 (2009).
Article CAS PubMed Google Scholar
Du, Z., Luo, G., Wang, K., Bing, Z. & Pan, S. Identification of a novel heterozygous SPTB mutation by whole genome sequencing in a Chinese patient with hereditary spherocytosis and atrial septal defect: a case report. BMC Pediatr. 21, 291 (2021).
Article CAS PubMed PubMed Central Google Scholar
Li, S. et al. A novel SPTB mutation causes hereditary spherocytosis via loss-of-function of β-spectrin. Ann. Hematol. 101, 731–738 (2022).
Article CAS PubMed Google Scholar
Wang, Y. et al. A novel variant in the SPTB gene underlying hereditary spherocytosis and a literature review of previous variants. BMC Med. Genomics 17, 206 (2024).
Article PubMed PubMed Central Google Scholar
Rees, H. A. & Liu, D. R. Base editing: precision chemistry on the genome and transcriptome of living cells. Nat. Rev. Genet. 19, 770–788 (2018).
Article CAS PubMed PubMed Central Google Scholar
Vernimmen, D. & Bickmore, W. A. The hierarchy of transcriptional activation: from enhancer to promoter. Trends Genet. 31, 696–708 (2015).
Article CAS PubMed Google Scholar
Auton, A. et al. A global reference for human genetic variation. Nature 526, 68–74 (2015).
Article PubMed Google Scholar
Finucane, H. K. et al. Partitioning heritability by functional annotation using genome-wide association summary statistics. Nat. Genet. 47, 1228–1235 (2015).
Article CAS PubMed PubMed Central Google Scholar
Friedman, J., Hastie, T. & Tibshirani, R. Sparse inverse covariance estimation with the graphical lasso. Biostatistics 9, 432–441 (2008).
Article PubMed Google Scholar
Hao, Y. et al. Integrated analysis of multimodal single-cell data. Cell 184, 3573–3587.e29 (2021).
Article CAS PubMed PubMed Central Google Scholar
Feng, J., Liu, T., Qin, B., Zhang, Y. & Liu, X. S. Identifying ChIP–seq enrichment using MACS. Nat. Protoc. 7, 1728–1740 (2012).
Article CAS PubMed Google Scholar
Hafemeister, C. & Satija, R. Normalization and variance stabilization of single-cell RNA-seq data using regularized negative binomial regression. Genome Biol. 20, 296 (2019).
Article CAS PubMed PubMed Central Google Scholar
Javierre, B. M. et al. Lineage-specific genome architecture links enhancers and non-coding disease variants to target gene promoters. Cell 167, 1369–1384.e19 (2016).
Article CAS PubMed PubMed Central Google Scholar
Võsa, U. et al. Large-scale cis- and trans-eQTL analyses identify thousands of genetic loci and polygenic scores that regulate blood gene expression. Nat. Genet. 53, 1300–1310 (2021).
Article PubMed PubMed Central Google Scholar
Dorans, E. elizabethdorans/pgBoost: v1.0.0. Zenodo https://doi.org/10.5281/zenodo.15276307
Dorans, E. elizabethdorans/E2G_Method_Tutorials: v1.0.0. Zenodo https://doi.org/10.5281/zenodo.15284078

Download references

Acknowledgements

We thank J. Engreitz, R. Andersson, A. Gschwind and W. Qiu for valuable feedback on the manuscript and S. Raychaudhuri, S. Sunyaev, V. Sankaran, S. Sakaue, C. Boix, B. Strober, J. Rossen and M. Zhang for helpful discussions. This research was funded by NIH grants U01 HG012009, R56 HG013083, R01 MH101244, R37 MH107649, R01 HG006399 and R01 MH115676 (A.L.P.). K.D. is funded by NIH/NHGRI under grant R00HG012203, NIH/NCI Cancer Center Support Grant P30CA008748 and the Josie Robertson Investigator Program.

Author information

These authors contributed equally: Kushal Dey, Alkes L. Price.

Authors and Affiliations

Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA, USA
Elizabeth Dorans, Karthik Jagadeesh & Alkes L. Price
PhD Program in Biological and Biomedical Sciences, Harvard Medical School, Boston, MA, USA
Elizabeth Dorans
Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, USA
Elizabeth Dorans, Karthik Jagadeesh & Alkes L. Price
Computational and Systems Biology Program, Memorial Sloan Kettering Cancer Center, New York, NY, USA
Kushal Dey
Gerstner Sloan Kettering Graduate School of Biomedical Sciences, Memorial Sloan Kettering Cancer Center, New York, NY, USA
Kushal Dey
Physiology, Biophysics and Systems Biology, Weill Cornell Medicine, New York, NY, USA
Kushal Dey
Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, USA
Alkes L. Price

Authors

Elizabeth Dorans
View author publications
Search author on:PubMed Google Scholar
Karthik Jagadeesh
View author publications
Search author on:PubMed Google Scholar
Kushal Dey
View author publications
Search author on:PubMed Google Scholar
Alkes L. Price
View author publications
Search author on:PubMed Google Scholar

Contributions

K.D. and K.J. conceived this study. E.D. developed the pgBoost model and conducted all analyses. A.L.P., K.D. and K.J. provided critical scientific guidance. E.D. and A.L.P. wrote the initial draft of the manuscript. All authors contributed to the final manuscript.

Corresponding authors

Correspondence to Elizabeth Dorans, Kushal Dey or Alkes L. Price.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Genetics thanks Tim Stuart and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Performance of pgBoost vs. a logistic regression model.

Average enrichment across recall values of links predicted by pgBoost vs. a logistic regression model with identical features for (a) 4,434 fine-mapped eSNP-eGene pairs attaining maximum PIP > 0.5 across GTEx tissues, (b) 53,701 SNP-gene pairs attaining maximum ABC score > 0.2 across 344 biosamples, (c) 892 links validated by CRISPR, and (d) 155 non-coding SNP-gene pairs derived from fine-mapped GWAS variants with a unique fine-mapped coding variant within a 2 Mb window, at various distance thresholds. The number of positive evaluation links at each distance threshold is specified in parentheses. Confidence intervals denote standard errors. Stars denote 2-sided bootstrap p-values for difference (*: p < 0.05, **: p < 0.01, ***: p < 0.001) of top method vs. each other method (Methods). Gradient boosting significantly outperformed logistic regression on eQTL evaluation data and CRISPR evaluation data >50 kb. Gradient boosting did not significantly outperform logistic regression on GWAS evaluation data. While gradient boosting significantly outperformed logistic regression on all ABC evaluation data (>1 kb and >5 kb), gradient boosting significantly underperformed logistic regression on longer-range ABC evaluation data (>50 kb and >100 kb); this could be due to the gradient boosting model assigning higher weight to distance-based features than the logistic regression model (however, we do not observe this behavior in other evaluation sets).

Extended Data Fig. 2 Performance of pgBoost vs. fine-mapped eQTL.

Average enrichment across recall values of links predicted by pgBoost vs. fine-mapped eQTL (ranking candidate SNP-gene links by maximum PIP across GTEx tissues) for (a) 53,701 SNP-gene pairs attaining maximum ABC score > 0.2 across 344 biosamples, (b) 892 links validated by CRISPR, and (c) 155 non-coding SNP-gene pairs derived from fine-mapped GWAS variants with a unique fine-mapped coding variant within a 2 Mb window, at various distance thresholds. The number of positive evaluation links at each distance threshold is specified in parentheses. Confidence intervals denote standard errors. Stars denote 2-sided bootstrap p-values for difference (*: p < 0.05, **: p < 0.01, ***: p < 0.001) of top method vs. each other method (Methods).

Extended Data Fig. 3 Performance of pgBoost trained on eQTL vs. CRISPR data.

Average enrichment across recall values of links predicted by pgBoost trained on fine-mapped eSNP-eGene pairs vs. CRISPR-validated links for (a) 4,434 fine-mapped eSNP-eGene pairs attaining maximum PIP > 0.5 across GTEx tissues, (b) 53,701 SNP-gene pairs attaining maximum ABC score > 0.2 across 344 biosamples, (c) 892 links validated by CRISPR, and (d) 155 non-coding SNP-gene pairs derived from fine-mapped GWAS variants with a unique fine-mapped coding variant within a 2 Mb window, at various distance thresholds. While other regulatory linking methods have used CRISPR links for model training, pgBoost may perform better when trained on eQTL due to the limited amount of CRISPR data, pervasive aneuploidy and structural variants in the K562 cell line, and/or better representation of multiple cell types in GTEx eQTL data. The number of positive evaluation links at each distance threshold is specified in parentheses. Confidence intervals denote standard errors. Stars denote 2-sided bootstrap p-values for difference (*: p < 0.05, **: p < 0.01, ***: p < 0.001) of top method vs. each other method (Methods). For pgBoost_CRISPR, we defined a training set comprised of 892 SNP-gene links validated by CRISPR as positives and 10,345 links tested but not validated by CRISPR as negatives (Supplementary Table 7).

Extended Data Fig. 4 Performance of pgBoost trained on eQTL data from all tissues vs. whole blood.

Average enrichment across recall values of links predicted by pgBoost trained on fine-mapped eSNP-eGene pairs from all GTEx tissues vs. fine-mapped eSNP-eGene pairs from whole blood for (a) 4,434 fine-mapped eSNP-eGene pairs attaining maximum PIP > 0.5 across GTEx tissues, (b) 53,701 SNP-gene pairs attaining maximum ABC score > 0.2 across 344 biosamples, (c) 892 links validated by CRISPR, (d) 155 non-coding SNP-gene pairs derived from fine-mapped GWAS variants with a unique fine-mapped coding variant within a 2 Mb window, and (e) 712 fine-mapped eSNP-eGene pairs attaining PIP > 0.5 in GTEx whole blood, at various distance thresholds. The number of positive evaluation links at each distance threshold is specified in parentheses. Confidence intervals denote standard errors. Stars denote 2-sided bootstrap p-values for difference (*: p < 0.05, **: p < 0.01, ***: p < 0.001) of top method vs. each other method (Methods). For pgBoost_Blood-eQTL, we defined a training set of 1,518 eSNP-eGene pairs attaining PIP > 0.2 in whole blood as positives and 6,997 eSNP-eGene pairs including a gene in the positive set and attaining PIP < 0.01 in whole blood as negatives.

Extended Data Fig. 5 Performance of pgBoost models restricted to a single existing method or distance.

Average enrichment across recall values of links predicted by pgBoost restricted to linking scores from a single existing method or genomic distance for (a) 4,434 fine-mapped eSNP-eGene pairs attaining maximum PIP > 0.5 across GTEx tissues, (b) 53,701 SNP-gene pairs attaining maximum ABC score > 0.2 across 344 biosamples, (c) 892 links validated by CRISPR, and (d) 155 non-coding SNP-gene pairs derived from fine-mapped GWAS variants with a unique fine-mapped coding variant within a 2 Mb window. Because enrichment differences are highly similar across distance thresholds, results at >10 kb are shown as barplots to aid in visual comparison. Confidence intervals denote standard errors. Stars denote 2-sided bootstrap p-values for the difference of the main pgBoost model (denoted in black outline) vs. each other method (*: p < 0.05, **: p < 0.01, ***: p < 0.001) (Methods). Red stars indicate that the focal method significantly underperforms pgBoost.

Extended Data Fig. 6 Performance of pgBoost models adding or ablating one existing method.

Average enrichment across recall values of links predicted by pgBoost (denoted in black outline) vs. pgBoost models ablating features from one existing peak-gene linking method (SCENT, Signac, Cicero), ablating distance-based features, or adding ArchR features for (a) 4,434 fine-mapped eSNP-eGene pairs attaining maximum PIP > 0.5 across GTEx tissues, (b) 53,701 SNP-gene pairs attaining maximum ABC score > 0.2 across 344 biosamples, (c) 892 links validated by CRISPR, and (d) 155 non-coding SNP-gene pairs derived from fine-mapped GWAS variants with a unique fine-mapped coding variant within a 2 Mb window. e) Table summarizing the significance of differences in average enrichments between pgBoost and each other model on the 4 evaluation sets. Because enrichment differences are highly similar across distance thresholds in a–d, results at >10 kb are shown as barplots to aid in visual comparison. Confidence intervals denote standard errors. Stars denote 2-sided bootstrap p-values for difference (*: p < 0.05, **: p < 0.01, ***: p < 0.001) of focal method vs. main pgBoost model (Methods). Red stars (in a–d) or red shading (in e) indicates that the focal method significantly underperformed pgBoost. Black stars (in a–d) or blue shading (in e) indicates that the focal method significantly outperformed pgBoost. While the main pgBoost model significantly underperformed the model ablating Signac on ABC evaluation data, we prioritized results on eQTL and CRISPR evaluations (see Supplementary Note). The main pgBoost model significantly outperformed each ablated model on at least 1 evaluation data set and never significantly underperformed an expanded model, supporting our choice to include SCENT, Signac, Cicero, and distance as features (and exclude ArchR).

Extended Data Fig. 7 Performance of pgBoost vs. ENCODE-rE2G across biosamples.

Average enrichment across recall values of links predicted by pgBoost vs. ENCODE-rE2G (maximum score across biosamples; see Supplementary Table 16) for (a) 4,434 fine-mapped eSNP-eGene pairs attaining maximum PIP > 0.5 across GTEx tissues, (b) 892 links validated by CRISPR, and (c) 155 non-coding SNP-gene pairs derived from fine-mapped GWAS variants with a unique fine-mapped coding variant within a 2 Mb window, at various distance thresholds (we exclude ABC evaluation data from these comparisons, because the ABC evaluation data were used as features to generate ENCODE-rE2G predictions). The number of positive evaluation links at each distance threshold is specified in parentheses. Confidence intervals denote standard errors. Stars denote 2-sided bootstrap p-values for difference (*: p < 0.05, **: p < 0.01, ***: p < 0.001) of top method vs. each other method (Methods).

Extended Data Fig. 8 Performance of pgBoost restricted to features from a focal cell type on fine-mapped GWAS variants.

Average enrichment across recall values of variants linked to genes by pgBoost restricted to features from a focal cell type (Union denotes all features) for (a) n = 488 fine-mapped variants (PIP > 0.2) for 7 autoimmune diseases or granulocyte-related blood cell traits, (b) n = 455 fine-mapped variants (PIP > 0.2) for 7 red blood cell or platelet-related blood cell traits, (c) fine-mapped variants for individual traits from (a), (d) fine-mapped variants for individual traits from (b). Confidence intervals denote standard errors, and stars denote 2-sided bootstrap p-values for difference (*: p < 0.05, **: p < 0.01, ***: p < 0.001) of focal method vs. top method (denoted in black outline) (Methods). Plt: platelet count, HbA1c: hemoglobin A1c, Hb: hemoglobin, RBC: red blood cell count, MCV: mean corpuscular volume, MCH: mean corpuscular hemoglobin, MCHC: mean corpuscular hemoglobin concentration, WBC: white blood cell count, Mono: monocyte count, Eosino: eosinophil count, AID: autoimmune disease, Lym: lymphocyte count, Neutro: neutrophil count, Baso: basophil count.

Extended Data Fig. 9 Performance of pgBoost restricted to features from a focal cell type on eQTL data from LCL.

Average enrichment across recall values of links predicted by pgBoost restricted to features from a focal cell type (columns; Union denotes all features) for fine-mapped eSNP-eGene pairs attaining maximum PIP > 0.5 in GTEx EBV-transformed lymphocytes or LCL (number of evaluation links: n = 79). We expanded the set of pgBoost models to include a model restricted to LCL features from the SHARE-seq LCL data set (Table 2). The pgBoost model restricted to LCL significantly outperformed models restricted to B cell, myeloid, or erythroid features and the model spanning the union of all 5 cell types, and performed similarly to the model restricted to T cell features. (We did not analyze cell-type-specific GTEx evaluation data for T cells, B cells, myeloid cells, or erythroid cells, because these cell types are not represented in GTEx). Confidence intervals denote standard errors. Stars denote 2-sided bootstrap p-values for difference (*: p < 0.05, **: p < 0.01, ***: p < 0.001) of top method (denoted in black outline) vs. each other method (Methods). Red stars denote significant underperformance of the focal method vs. the top method.

Extended Data Fig. 10 Performance of pgBoost restricted to features from a focal cell type in cell-type-level and tissue-level ABC evaluation data sets.

Average enrichment across recall values of links predicted by pgBoost restricted to features from a focal cell type (columns; Union denotes all features) for cell-type-level (left) or tissue-level (right) ABC evaluation links (rows) defined by SNP-gene links attaining ABC score > 0.2 in biosample(s) related to the focal cell type or tissue (irrespective of ABC scores attained in other biosamples). Number of evaluation links in each ABC evaluation set: n_T = 19,301, n_B = 17,638, n_myeloid = 10,632, n_K562 = 3,793, n_blood = 33,867, n_brain = 19,623. Number of biosamples in each ABC evaluation set: N_blood = 74, N_brain = 26, N_T = 53, N_B = 14, N_myeloid = 5, N_K562 = 1. Stars denote 2-sided bootstrap p-values for difference (*: p < 0.05, **: p < 0.01, ***: p < 0.001) of top method (denoted in black outline; defined separately for each row) vs. each other method (Methods). For cell-type-level evaluation data sets (left), pgBoost models restricted to the focal cell type significantly outperformed each other model, in general (except for the T cell-level ABC evaluation data set, for which the model restricted to T cell features did not significantly outperform the model restricted to B cell features, but significantly outperformed each other model). On brain-level evaluation data (right), the pgBoost model restricted to brain oligodendroglia features significantly outperformed the model restricted to K562 features and the model spanning both cell types. On blood-level evaluation data (right), the pgBoost model spanning the union of both cell types significantly outperformed both other models.

Supplementary information

Supplementary Information

Supplementary Figs. 1–45 and Note.

Reporting Summary

Peer Review File

Supplementary Tables

Supplementary Tables 1–16.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Dorans, E., Jagadeesh, K., Dey, K. et al. Linking regulatory variants to target genes by integrating single-cell multiome methods and genomic distance. Nat Genet 57, 1649–1658 (2025). https://doi.org/10.1038/s41588-025-02220-3

Download citation

Received: 23 May 2024
Accepted: 07 May 2025
Published: 12 June 2025
Version of record: 12 June 2025
Issue date: July 2025
DOI: https://doi.org/10.1038/s41588-025-02220-3