Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

Linking regulatory variants to target genes by integrating single-cell multiome methods and genomic distance

Abstract

Methods that analyze single-cell paired RNA sequencing (RNA-seq) and assay for transposase-accessible chromatin using sequencing (ATAC-seq) multiome data have shown promise in linking regulatory elements to genes. However, existing methods exhibit low concordance and do not capture the effects of genomic distance. We propose pgBoost, an integrative modeling framework that trains a non-linear combination of existing linking strategies (including genomic distance) on expression quantitative trait locus (eQTL) data to assign a probabilistic score to each candidate single-nucleotide polymorphism–gene link. pgBoost attained higher enrichment than existing methods for evaluation sets derived from eQTL, activity-by-contact, CRISPR and genome-wide association study (GWAS) data. We further determined that restricting pgBoost to features from a focal cell type improved power to identify links relevant to that cell type. We highlight several examples in which pgBoost linked fine-mapped GWAS variants to experimentally validated or biologically plausible target genes that were not implicated by other methods. In conclusion, a non-linear combination of linking strategies improves power to identify target genes underlying GWAS associations.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Existing single-cell peak–gene linking methods exhibit low concordance and underperform a distance-based linking score.
Fig. 2: Overview of the pgBoost method.
Fig. 3: Performance of pgBoost and other methods on eQTL and ABC evaluation datasets.
Fig. 4: Performance of pgBoost and other methods on gold-standard evaluation datasets.
Fig. 5: Performance of pgBoost restricted to features from a focal or non-focal cell type in evaluation datasets relevant to the focal cell type.
Fig. 6: Examples of SNP–gene links identified by pgBoost.

Similar content being viewed by others

Data availability

All single-cell multiome datasets analyzed are publicly available. The 10x PBMC36 dataset is available at https://www.10xgenomics.com/datasets/pbmc-from-a-healthy-donor-granulocytes-removed-through-cell-sorting-10-k-1-standard-2-0-0. The Luecken BMMC37, SHARE-seq LCL12 and Meijer brain42 datasets are available at Gene Expression Omnibus (accession codes GSE194122, GSE140203 and GSE193240, respectively). The Xu K562 (ref. 38) dataset is available at https://www.ebi.ac.uk/biostudies/arrayexpress/studies/E-MTAB-11264. Linking scores and percentiles for pgBoost and existing methods have been made publicly available at https://doi.org/10.5281/zenodo.11211925. Fine-mapped GTEx eQTL data24 are available at https://www.finucanelab.org/data. GTEx eQTL summary statistics are available at https://www.gtexportal.org/home/downloads/adult-gtex/qtl. ABC scores9 are available on the ENCODE portal (https://www.encodeproject.org/). Biosample IDs and file accessions are listed in Supplementary Table 5. The CRISPR dataset25,26,27,28,29,30,31,32,33 is available at https://github.com/EngreitzLab/CRISPR_comparison/blob/main/resources/crispr_data/EPCrisprBenchmark_ensemble_data_GRCh38.tsv.gz. GWAS-derived SNP–gene links34 are available at https://doi.org/10.5281/zenodo.11211925. PCHi-C enhancer–promoter links are available as a supplemental data archive of a previous publication66. eQTLGen summary statistics67 are available at https://eqtlgen.org/cis-eqtls.html. ENCODE-rE2G predictions9 are available on the ENCODE portal (https://www.encodeproject.org). Biosample IDs and file accessions are listed in Supplementary Table 16. GWAS fine-mapping results35 for 94 UK Biobank41 traits are available at https://www.finucanelab.org/data. Coordinates for the set of genes analyzed9 are available at https://github.com/EngreitzLab/CRISPR_comparison/blob/main/resources/genome_annotations/CollapsedGeneBounds.hg38.bed.

Code availability

The code to implement pgBoost68 has been made publicly available at https://github.com/elizabethdorans/pgBoost. Tutorials for existing methods69 (SCENT, Signac, ArchR and Cicero) have been made publicly available at https://github.com/elizabethdorans/E2G_Method_Tutorials. Code implementing SCENT15 is publicly available at https://github.com/immunogenomics/SCENT. Code implementing Signac12,16 is publicly available at https://github.com/stuart-lab/signac. Code implementing ArchR17 is publicly available at https://github.com/GreenleafLab/ArchR. Code implementing Cicero18 is publicly available at https://github.com/cole-trapnell-lab/cicero-release. We used Seurat (v.4.3.0.1), SCENT (v.0.1.0), Signac (v.1.10.0), ArchR (v.1.0.2), Cicero (v.1.3.9) and XGBoost (v.1.7.5.1).

References

  1. Hindorff, L. A. et al. Potential etiologic and functional implications of genome-wide association loci for human diseases and traits. Proc. Natl Acad. Sci. USA 106, 9362–9367 (2009).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  2. Gusev, A. et al. Integrative approaches for large-scale transcriptome-wide association studies. Nat. Genet. 48, 245–252 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  3. Gamazon, E. R. et al. Using an atlas of gene regulation across 44 human tissues to inform complex disease- and trait-associated variation. Nat. Genet. 50, 956–967 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  4. Gazal, S. et al. Combining SNP-to-gene linking strategies to identify disease genes and assess disease omnigenicity. Nat. Genet. 54, 827–836 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  5. Claussnitzer, M. et al. A brief history of human disease genetics. Nature 577, 179–189 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  6. Abdellaoui, A., Yengo, L., Verweij, K. J. H. & Visscher, P. M. 15 years of GWAS discovery: realizing the promise. Am. J. Hum. Genet. 110, 179–194 (2023).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  7. Boix, C. A., James, B. T., Park, Y. P., Meuleman, W. & Kellis, M. Regulatory genomic circuitry of human disease loci by integrative epigenomics. Nature 590, 300–307 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  8. Nasser, J. et al. Genome-wide enhancer maps link risk variants to disease genes. Nature 593, 238–243 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  9. Gschwind, A. R. et al. An encyclopedia of enhancer-gene regulatory interactions in the human genome. Preprint at bioRxiv https://doi.org/10.1101/2023.11.09.563812 (2023).

  10. Heinz, S., Romanoski, C. E., Benner, C. & Glass, C. K. The selection and function of cell type-specific enhancers. Nat. Rev. Mol. Cell Biol. 16, 144–154 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  11. Chen, S., Lake, B. B. & Zhang, K. High-throughput sequencing of the transcriptome and chromatin accessibility in the same cell. Nat. Biotechnol. 37, 1452–1457 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  12. Ma, S. et al. Chromatin potential identified by shared single-cell profiling of RNA and chromatin. Cell 183, 1103–1116 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  13. Vandereyken, K., Sifrim, A., Thienpont, B. & Voet, T. Methods and applications for single-cell and spatial multi-omics. Nat. Rev. Genet. 24, 494–515 (2023).

    Article  CAS  PubMed  Google Scholar 

  14. Cuomo, A. S. E., Nathan, A., Raychaudhuri, S., MacArthur, D. G. & Powell, J. E. Single-cell genomics meets human genetics. Nat. Rev. Genet. 24, 535–549 (2023).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  15. Sakaue, S. et al. Tissue-specific enhancer–gene maps from multimodal single-cell data identify causal disease alleles. Nat. Genet. 56, 615–626 (2024).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  16. Stuart, T., Srivastava, A., Madad, S., Lareau, C. A. & Satija, R. Single-cell chromatin state analysis with Signac. Nat. Methods 18, 1333–1341 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  17. Granja, J. M. et al. ArchR is a scalable software package for integrative single-cell chromatin accessibility analysis. Nat. Genet. 53, 403–411 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  18. Pliner, H. A. et al. Cicero predicts cis-regulatory DNA interactions from single-cell chromatin accessibility data. Mol. Cell 71, 858–871 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  19. Kartha, V. K. et al. Functional inference of gene regulation using single-cell multi-omics. Cell Genomics 2, 100166 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  20. Bravo González-Blas, C. et al. SCENIC+: single-cell multiomic inference of enhancers and gene regulatory networks. Nat. Methods 20, 1355–1367 (2023).

    Article  PubMed  PubMed Central  Google Scholar 

  21. Mitra, S. et al. Single-cell multi-ome regression models identify functional and disease-associated enhancers and enable chromatin potential analysis. Nat. Genet. 56, 627–636 (2024).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  22. Chen, T. & Guestrin, C. XGBoost: a scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 785–794 (eds Krishnapuram, B. et al.) (Association for Computing Machinery, 2016).

  23. Aguet, F. et al. The GTEx Consortium atlas of genetic regulatory effects across human tissues. Science 369, 1318–1330 (2020).

    Article  CAS  Google Scholar 

  24. Wang, Q. S. et al. Leveraging supervised learning for functionally informed fine-mapping of cis-eQTLs identifies an additional 20,913 putative causal eQTLs. Nat. Commun. 12, 3394 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  25. Thakore, P. I. et al. Highly specific epigenome editing by CRISPR–Cas9 repressors for silencing of distal regulatory elements. Nat. Methods 12, 1143–1149 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  26. Fulco, C. P. et al. Systematic mapping of functional enhancer–promoter connections with CRISPR interference. Science 354, 769–773 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  27. Ulirsch, J. C. et al. Systematic functional dissection of common genetic variation affecting red blood cell traits. Cell 165, 1530–1545 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  28. Wakabayashi, A. et al. Insight into GATA1 transcriptional activity through interrogation of cis elements disrupted in human erythroid disorders. Proc. Natl Acad. Sci. USA 113, 4434–4439 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  29. Huang, J. et al. Dissecting super-enhancer hierarchy based on chromatin interactions. Nat. Commun. 9, 943 (2018).

    Article  PubMed  PubMed Central  Google Scholar 

  30. Fulco, C. P. et al. Activity-by-contact model of enhancer–promoter regulation from thousands of CRISPR perturbations. Nat. Genet. 51, 1664–1669 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  31. Gasperini, M. et al. A genome-wide framework for mapping gene regulation via cellular genetic screens. Cell 176, 377–390 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  32. Schraivogel, D. et al. Targeted Perturb-seq enables genome-scale genetic screens in single cells. Nat. Methods 17, 629–635 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  33. Klann, T. S. et al. CRISPR–Cas9 epigenome editing enables high-throughput screening for functional regulatory elements in the human genome. Nat. Biotechnol. 35, 561–568 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  34. Weeks, E. M. et al. Leveraging polygenic enrichments of gene features to predict genes underlying complex traits and diseases. Nat. Genet. 55, 1267–1276 (2023).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  35. Kanai, M. et al. Insights from complex trait fine-mapping across diverse populations. Preprint at medRxiv https://doi.org/10.1101/2021.09.03.21262975 (2021).

  36. PBMC from a Healthy Donor (10k), Single Cell Multiome ATAC + Gene Expression Dataset by Cell Ranger 2.0.0 (10x Genomics, 2021); https://www.10xgenomics.com/datasets/pbmc-from-a-healthy-donor-granulocytes-removed-through-cell-sorting-10-k-1-standard-2-0-0

  37. Luecken, M. D. et al. A sandbox for prediction and integration of DNA, RNA, and protein data in single cells. In 35th Conference on Neural Information Processing Systems Track on Datasets and Benchmarks (eds Vanschoren, J. & Yeung, S.) (Curran Associates, 2020).

  38. Xu, W. et al. ISSAAC-seq enables sensitive and flexible multimodal profiling of chromatin accessibility and gene expression in single cells. Nat. Methods 19, 1243–1249 (2022).

    Article  CAS  PubMed  Google Scholar 

  39. Yardımcı, G. G. et al. Measuring the reproducibility and quality of Hi-C data. Genome Biol. 20, 57 (2019).

    Article  PubMed  PubMed Central  Google Scholar 

  40. Zhou, B. et al. Comprehensive, integrated, and phased whole-genome analysis of the primary ENCODE cell line K562. Genome Res. 29, 472–484 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  41. Bycroft, C. et al. The UK Biobank resource with deep phenotyping and genomic data. Nature 562, 203–209 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  42. Meijer, M. et al. Epigenomic priming of immune genes implicates oligodendroglia in multiple sclerosis susceptibility. Neuron 110, 1193–1210.e13 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  43. Wang, G., Sarkar, A., Carbonetto, P. & Stephens, M. A simple new approach to variable selection in regression, with application to genetic fine mapping. J. R. Stat. Soc. Series B Stat. Methodol. 82, 1273–1300 (2020).

    Article  PubMed  PubMed Central  Google Scholar 

  44. Benner, C. et al. FINEMAP: efficient variable selection using summary data from genome-wide association studies. Bioinformatics 32, 1493–1501 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  45. Dutka, M., Zimmer, K., Ćwiertnia, M., Ilczak, T. & Bobiński, R. The role of PCSK9 in heart failure and other cardiovascular diseases—mechanisms of action beyond its effect on LDL cholesterol. Heart Fail. Rev. 29, 917–937 (2024).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  46. Bao, X. et al. Targeting proprotein convertase subtilisin/kexin type 9 (PCSK9): from bench to bedside. Signal Transduct. Target. Ther. 9, 13 (2024).

    Article  PubMed  PubMed Central  Google Scholar 

  47. Simeonov, D. R. et al. Discovery of stimulation-responsive immune enhancers with CRISPR activation. Nature 549, 111–115 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  48. Burchill, M. A., Yang, J., Vang, K. B. & Farrar, M. A. Interleukin-2 receptor signaling in regulatory T cell development and homeostasis. Immunol. Lett. 114, 1–8 (2007).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  49. Stahl, E. A. et al. Genome-wide association study meta-analysis identifies seven new rheumatoid arthritis risk loci. Nat. Genet. 42, 508–514 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  50. Martin, J. E. et al. The autoimmune disease-associated IL2RA locus is involved in the clinical manifestations of systemic sclerosis. Genes Immun. 13, 191–196 (2012).

    Article  CAS  PubMed  Google Scholar 

  51. Lowe, C. E. et al. Large-scale genetic fine mapping and genotype–phenotype associations implicate polymorphism in the IL2RA region in type 1 diabetes. Nat. Genet. 39, 1074–1082 (2007).

    Article  CAS  PubMed  Google Scholar 

  52. Avellino, R. et al. An autonomous CEBPA enhancer specific for myeloid-lineage priming and neutrophilic differentiation. Blood 127, 2991–3003 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  53. Leroy, H. et al. CEBPA point mutations in hematological malignancies. Leukemia 19, 329–334 (2005).

    Article  CAS  PubMed  Google Scholar 

  54. Pabst, T. & Mueller, B. U. Complexity of CEBPA dysregulation in human acute myeloid leukemia. Clin. Cancer Res. 15, 5303–5307 (2009).

    Article  CAS  PubMed  Google Scholar 

  55. Du, Z., Luo, G., Wang, K., Bing, Z. & Pan, S. Identification of a novel heterozygous SPTB mutation by whole genome sequencing in a Chinese patient with hereditary spherocytosis and atrial septal defect: a case report. BMC Pediatr. 21, 291 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  56. Li, S. et al. A novel SPTB mutation causes hereditary spherocytosis via loss-of-function of β-spectrin. Ann. Hematol. 101, 731–738 (2022).

    Article  CAS  PubMed  Google Scholar 

  57. Wang, Y. et al. A novel variant in the SPTB gene underlying hereditary spherocytosis and a literature review of previous variants. BMC Med. Genomics 17, 206 (2024).

    Article  PubMed  PubMed Central  Google Scholar 

  58. Rees, H. A. & Liu, D. R. Base editing: precision chemistry on the genome and transcriptome of living cells. Nat. Rev. Genet. 19, 770–788 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  59. Vernimmen, D. & Bickmore, W. A. The hierarchy of transcriptional activation: from enhancer to promoter. Trends Genet. 31, 696–708 (2015).

    Article  CAS  PubMed  Google Scholar 

  60. Auton, A. et al. A global reference for human genetic variation. Nature 526, 68–74 (2015).

    Article  PubMed  Google Scholar 

  61. Finucane, H. K. et al. Partitioning heritability by functional annotation using genome-wide association summary statistics. Nat. Genet. 47, 1228–1235 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  62. Friedman, J., Hastie, T. & Tibshirani, R. Sparse inverse covariance estimation with the graphical lasso. Biostatistics 9, 432–441 (2008).

    Article  PubMed  Google Scholar 

  63. Hao, Y. et al. Integrated analysis of multimodal single-cell data. Cell 184, 3573–3587.e29 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  64. Feng, J., Liu, T., Qin, B., Zhang, Y. & Liu, X. S. Identifying ChIP–seq enrichment using MACS. Nat. Protoc. 7, 1728–1740 (2012).

    Article  CAS  PubMed  Google Scholar 

  65. Hafemeister, C. & Satija, R. Normalization and variance stabilization of single-cell RNA-seq data using regularized negative binomial regression. Genome Biol. 20, 296 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  66. Javierre, B. M. et al. Lineage-specific genome architecture links enhancers and non-coding disease variants to target gene promoters. Cell 167, 1369–1384.e19 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  67. Võsa, U. et al. Large-scale cis- and trans-eQTL analyses identify thousands of genetic loci and polygenic scores that regulate blood gene expression. Nat. Genet. 53, 1300–1310 (2021).

    Article  PubMed  PubMed Central  Google Scholar 

  68. Dorans, E. elizabethdorans/pgBoost: v1.0.0. Zenodo https://doi.org/10.5281/zenodo.15276307

  69. Dorans, E. elizabethdorans/E2G_Method_Tutorials: v1.0.0. Zenodo https://doi.org/10.5281/zenodo.15284078

Download references

Acknowledgements

We thank J. Engreitz, R. Andersson, A. Gschwind and W. Qiu for valuable feedback on the manuscript and S. Raychaudhuri, S. Sunyaev, V. Sankaran, S. Sakaue, C. Boix, B. Strober, J. Rossen and M. Zhang for helpful discussions. This research was funded by NIH grants U01 HG012009, R56 HG013083, R01 MH101244, R37 MH107649, R01 HG006399 and R01 MH115676 (A.L.P.). K.D. is funded by NIH/NHGRI under grant R00HG012203, NIH/NCI Cancer Center Support Grant P30CA008748 and the Josie Robertson Investigator Program.

Author information

Authors and Affiliations

Authors

Contributions

K.D. and K.J. conceived this study. E.D. developed the pgBoost model and conducted all analyses. A.L.P., K.D. and K.J. provided critical scientific guidance. E.D. and A.L.P. wrote the initial draft of the manuscript. All authors contributed to the final manuscript.

Corresponding authors

Correspondence to Elizabeth Dorans, Kushal Dey or Alkes L. Price.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Genetics thanks Tim Stuart and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Performance of pgBoost vs. a logistic regression model.

Average enrichment across recall values of links predicted by pgBoost vs. a logistic regression model with identical features for (a) 4,434 fine-mapped eSNP-eGene pairs attaining maximum PIP > 0.5 across GTEx tissues, (b) 53,701 SNP-gene pairs attaining maximum ABC score > 0.2 across 344 biosamples, (c) 892 links validated by CRISPR, and (d) 155 non-coding SNP-gene pairs derived from fine-mapped GWAS variants with a unique fine-mapped coding variant within a 2 Mb window, at various distance thresholds. The number of positive evaluation links at each distance threshold is specified in parentheses. Confidence intervals denote standard errors. Stars denote 2-sided bootstrap p-values for difference (*: p < 0.05, **: p < 0.01, ***: p < 0.001) of top method vs. each other method (Methods). Gradient boosting significantly outperformed logistic regression on eQTL evaluation data and CRISPR evaluation data >50 kb. Gradient boosting did not significantly outperform logistic regression on GWAS evaluation data. While gradient boosting significantly outperformed logistic regression on all ABC evaluation data (>1 kb and >5 kb), gradient boosting significantly underperformed logistic regression on longer-range ABC evaluation data (>50 kb and >100 kb); this could be due to the gradient boosting model assigning higher weight to distance-based features than the logistic regression model (however, we do not observe this behavior in other evaluation sets).

Extended Data Fig. 2 Performance of pgBoost vs. fine-mapped eQTL.

Average enrichment across recall values of links predicted by pgBoost vs. fine-mapped eQTL (ranking candidate SNP-gene links by maximum PIP across GTEx tissues) for (a) 53,701 SNP-gene pairs attaining maximum ABC score > 0.2 across 344 biosamples, (b) 892 links validated by CRISPR, and (c) 155 non-coding SNP-gene pairs derived from fine-mapped GWAS variants with a unique fine-mapped coding variant within a 2 Mb window, at various distance thresholds. The number of positive evaluation links at each distance threshold is specified in parentheses. Confidence intervals denote standard errors. Stars denote 2-sided bootstrap p-values for difference (*: p < 0.05, **: p < 0.01, ***: p < 0.001) of top method vs. each other method (Methods).

Extended Data Fig. 3 Performance of pgBoost trained on eQTL vs. CRISPR data.

Average enrichment across recall values of links predicted by pgBoost trained on fine-mapped eSNP-eGene pairs vs. CRISPR-validated links for (a) 4,434 fine-mapped eSNP-eGene pairs attaining maximum PIP > 0.5 across GTEx tissues, (b) 53,701 SNP-gene pairs attaining maximum ABC score > 0.2 across 344 biosamples, (c) 892 links validated by CRISPR, and (d) 155 non-coding SNP-gene pairs derived from fine-mapped GWAS variants with a unique fine-mapped coding variant within a 2 Mb window, at various distance thresholds. While other regulatory linking methods have used CRISPR links for model training, pgBoost may perform better when trained on eQTL due to the limited amount of CRISPR data, pervasive aneuploidy and structural variants in the K562 cell line, and/or better representation of multiple cell types in GTEx eQTL data. The number of positive evaluation links at each distance threshold is specified in parentheses. Confidence intervals denote standard errors. Stars denote 2-sided bootstrap p-values for difference (*: p < 0.05, **: p < 0.01, ***: p < 0.001) of top method vs. each other method (Methods). For pgBoostCRISPR, we defined a training set comprised of 892 SNP-gene links validated by CRISPR as positives and 10,345 links tested but not validated by CRISPR as negatives (Supplementary Table 7).

Extended Data Fig. 4 Performance of pgBoost trained on eQTL data from all tissues vs. whole blood.

Average enrichment across recall values of links predicted by pgBoost trained on fine-mapped eSNP-eGene pairs from all GTEx tissues vs. fine-mapped eSNP-eGene pairs from whole blood for (a) 4,434 fine-mapped eSNP-eGene pairs attaining maximum PIP > 0.5 across GTEx tissues, (b) 53,701 SNP-gene pairs attaining maximum ABC score > 0.2 across 344 biosamples, (c) 892 links validated by CRISPR, (d) 155 non-coding SNP-gene pairs derived from fine-mapped GWAS variants with a unique fine-mapped coding variant within a 2 Mb window, and (e) 712 fine-mapped eSNP-eGene pairs attaining PIP > 0.5 in GTEx whole blood, at various distance thresholds. The number of positive evaluation links at each distance threshold is specified in parentheses. Confidence intervals denote standard errors. Stars denote 2-sided bootstrap p-values for difference (*: p < 0.05, **: p < 0.01, ***: p < 0.001) of top method vs. each other method (Methods). For pgBoostBlood-eQTL, we defined a training set of 1,518 eSNP-eGene pairs attaining PIP > 0.2 in whole blood as positives and 6,997 eSNP-eGene pairs including a gene in the positive set and attaining PIP < 0.01 in whole blood as negatives.

Extended Data Fig. 5 Performance of pgBoost models restricted to a single existing method or distance.

Average enrichment across recall values of links predicted by pgBoost restricted to linking scores from a single existing method or genomic distance for (a) 4,434 fine-mapped eSNP-eGene pairs attaining maximum PIP > 0.5 across GTEx tissues, (b) 53,701 SNP-gene pairs attaining maximum ABC score > 0.2 across 344 biosamples, (c) 892 links validated by CRISPR, and (d) 155 non-coding SNP-gene pairs derived from fine-mapped GWAS variants with a unique fine-mapped coding variant within a 2 Mb window. Because enrichment differences are highly similar across distance thresholds, results at >10 kb are shown as barplots to aid in visual comparison. Confidence intervals denote standard errors. Stars denote 2-sided bootstrap p-values for the difference of the main pgBoost model (denoted in black outline) vs. each other method (*: p < 0.05, **: p < 0.01, ***: p < 0.001) (Methods). Red stars indicate that the focal method significantly underperforms pgBoost.

Extended Data Fig. 6 Performance of pgBoost models adding or ablating one existing method.

Average enrichment across recall values of links predicted by pgBoost (denoted in black outline) vs. pgBoost models ablating features from one existing peak-gene linking method (SCENT, Signac, Cicero), ablating distance-based features, or adding ArchR features for (a) 4,434 fine-mapped eSNP-eGene pairs attaining maximum PIP > 0.5 across GTEx tissues, (b) 53,701 SNP-gene pairs attaining maximum ABC score > 0.2 across 344 biosamples, (c) 892 links validated by CRISPR, and (d) 155 non-coding SNP-gene pairs derived from fine-mapped GWAS variants with a unique fine-mapped coding variant within a 2 Mb window. e) Table summarizing the significance of differences in average enrichments between pgBoost and each other model on the 4 evaluation sets. Because enrichment differences are highly similar across distance thresholds in ad, results at >10 kb are shown as barplots to aid in visual comparison. Confidence intervals denote standard errors. Stars denote 2-sided bootstrap p-values for difference (*: p < 0.05, **: p < 0.01, ***: p < 0.001) of focal method vs. main pgBoost model (Methods). Red stars (in ad) or red shading (in e) indicates that the focal method significantly underperformed pgBoost. Black stars (in ad) or blue shading (in e) indicates that the focal method significantly outperformed pgBoost. While the main pgBoost model significantly underperformed the model ablating Signac on ABC evaluation data, we prioritized results on eQTL and CRISPR evaluations (see Supplementary Note). The main pgBoost model significantly outperformed each ablated model on at least 1 evaluation data set and never significantly underperformed an expanded model, supporting our choice to include SCENT, Signac, Cicero, and distance as features (and exclude ArchR).

Extended Data Fig. 7 Performance of pgBoost vs. ENCODE-rE2G across biosamples.

Average enrichment across recall values of links predicted by pgBoost vs. ENCODE-rE2G (maximum score across biosamples; see Supplementary Table 16) for (a) 4,434 fine-mapped eSNP-eGene pairs attaining maximum PIP > 0.5 across GTEx tissues, (b) 892 links validated by CRISPR, and (c) 155 non-coding SNP-gene pairs derived from fine-mapped GWAS variants with a unique fine-mapped coding variant within a 2 Mb window, at various distance thresholds (we exclude ABC evaluation data from these comparisons, because the ABC evaluation data were used as features to generate ENCODE-rE2G predictions). The number of positive evaluation links at each distance threshold is specified in parentheses. Confidence intervals denote standard errors. Stars denote 2-sided bootstrap p-values for difference (*: p < 0.05, **: p < 0.01, ***: p < 0.001) of top method vs. each other method (Methods).

Extended Data Fig. 8 Performance of pgBoost restricted to features from a focal cell type on fine-mapped GWAS variants.

Average enrichment across recall values of variants linked to genes by pgBoost restricted to features from a focal cell type (Union denotes all features) for (a) n = 488 fine-mapped variants (PIP > 0.2) for 7 autoimmune diseases or granulocyte-related blood cell traits, (b) n = 455 fine-mapped variants (PIP > 0.2) for 7 red blood cell or platelet-related blood cell traits, (c) fine-mapped variants for individual traits from (a), (d) fine-mapped variants for individual traits from (b). Confidence intervals denote standard errors, and stars denote 2-sided bootstrap p-values for difference (*: p < 0.05, **: p < 0.01, ***: p < 0.001) of focal method vs. top method (denoted in black outline) (Methods). Plt: platelet count, HbA1c: hemoglobin A1c, Hb: hemoglobin, RBC: red blood cell count, MCV: mean corpuscular volume, MCH: mean corpuscular hemoglobin, MCHC: mean corpuscular hemoglobin concentration, WBC: white blood cell count, Mono: monocyte count, Eosino: eosinophil count, AID: autoimmune disease, Lym: lymphocyte count, Neutro: neutrophil count, Baso: basophil count.

Extended Data Fig. 9 Performance of pgBoost restricted to features from a focal cell type on eQTL data from LCL.

Average enrichment across recall values of links predicted by pgBoost restricted to features from a focal cell type (columns; Union denotes all features) for fine-mapped eSNP-eGene pairs attaining maximum PIP > 0.5 in GTEx EBV-transformed lymphocytes or LCL (number of evaluation links: n = 79). We expanded the set of pgBoost models to include a model restricted to LCL features from the SHARE-seq LCL data set (Table 2). The pgBoost model restricted to LCL significantly outperformed models restricted to B cell, myeloid, or erythroid features and the model spanning the union of all 5 cell types, and performed similarly to the model restricted to T cell features. (We did not analyze cell-type-specific GTEx evaluation data for T cells, B cells, myeloid cells, or erythroid cells, because these cell types are not represented in GTEx). Confidence intervals denote standard errors. Stars denote 2-sided bootstrap p-values for difference (*: p < 0.05, **: p < 0.01, ***: p < 0.001) of top method (denoted in black outline) vs. each other method (Methods). Red stars denote significant underperformance of the focal method vs. the top method.

Extended Data Fig. 10 Performance of pgBoost restricted to features from a focal cell type in cell-type-level and tissue-level ABC evaluation data sets.

Average enrichment across recall values of links predicted by pgBoost restricted to features from a focal cell type (columns; Union denotes all features) for cell-type-level (left) or tissue-level (right) ABC evaluation links (rows) defined by SNP-gene links attaining ABC score > 0.2 in biosample(s) related to the focal cell type or tissue (irrespective of ABC scores attained in other biosamples). Number of evaluation links in each ABC evaluation set: nT = 19,301, nB = 17,638, nmyeloid = 10,632, nK562 = 3,793, nblood = 33,867, nbrain = 19,623. Number of biosamples in each ABC evaluation set: Nblood = 74, Nbrain = 26, NT = 53, NB = 14, Nmyeloid = 5, NK562 = 1. Stars denote 2-sided bootstrap p-values for difference (*: p < 0.05, **: p < 0.01, ***: p < 0.001) of top method (denoted in black outline; defined separately for each row) vs. each other method (Methods). For cell-type-level evaluation data sets (left), pgBoost models restricted to the focal cell type significantly outperformed each other model, in general (except for the T cell-level ABC evaluation data set, for which the model restricted to T cell features did not significantly outperform the model restricted to B cell features, but significantly outperformed each other model). On brain-level evaluation data (right), the pgBoost model restricted to brain oligodendroglia features significantly outperformed the model restricted to K562 features and the model spanning both cell types. On blood-level evaluation data (right), the pgBoost model spanning the union of both cell types significantly outperformed both other models.

Supplementary information

Supplementary Information

Supplementary Figs. 1–45 and Note.

Reporting Summary

Peer Review File

Supplementary Tables

Supplementary Tables 1–16.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Dorans, E., Jagadeesh, K., Dey, K. et al. Linking regulatory variants to target genes by integrating single-cell multiome methods and genomic distance. Nat Genet 57, 1649–1658 (2025). https://doi.org/10.1038/s41588-025-02220-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue date:

  • DOI: https://doi.org/10.1038/s41588-025-02220-3

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing