Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

Decoding sequence determinants of gene expression in diverse cellular and disease states

Abstract

Sequence-to-function models that predict gene expression from genomic DNA sequence have proven valuable for many biological tasks, including understanding cis-regulatory syntax and interpreting noncoding genetic variation. However, current state-of-the-art models are trained largely on bulk expression profiles from healthy tissues or cell lines and have not learned the properties of precise cell types and states that are captured in large-scale single-cell transcriptomic datasets. Thus, they cannot perform these tasks at the resolution of specific cell types or states. Here we present Decima, a model that predicts the cell type- and condition-specific expression of a gene from its surrounding DNA sequence. Decima is trained on single-cell or single-nucleus RNA sequencing data from over 22 million cells and successfully predicts the cell-type-specific expression of unseen genes. We demonstrate Decima’s ability to reveal cis-regulatory mechanisms driving cell-type-specific gene expression and its changes in disease, predict noncoding-variant effects at cell type resolution and design context-specific regulatory DNA elements.

This is a preview of subscription content, access via your institution

Access options

Buy this article

USD 39.95

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: The Decima model and its evaluation.
The alternative text for this image may have been generated using AI.
Fig. 2: Decima’s attributions highlight regulatory elements.
The alternative text for this image may have been generated using AI.
Fig. 3: Decima’s attributions reveal drivers of cell-type identity.
The alternative text for this image may have been generated using AI.
Fig. 4: Decima predicts variant effects at cell type resolution.
The alternative text for this image may have been generated using AI.
Fig. 5: Decima reveals the potential effects of disease-associated variants.
The alternative text for this image may have been generated using AI.
Fig. 6: Decima identifies TFs underlying disease-associated expression changes.
The alternative text for this image may have been generated using AI.

Similar content being viewed by others

Data availability

Publicly available sc/snRNA-seq count matrices were downloaded from the following sources. SCimilarity (sc/snRNA-seq): Individual datasets were downloaded and prepared as described in ref. 15. Brain Cell Atlas (snRNA-seq): https://www.braincellatlas.org/dataSet. Human skin atlas: https://singlecell.broadinstitute.org/single_cell/study/SCP2738. Human retina atlas: https://cellxgene.cziscience.com/collections/4c6eaf5c-6d57-4c76-b1e9-60df8c655f1e. Genome annotation files containing gene and exon coordinates were obtained via CellRanger at https://www.10xgenomics.com/support/software/cell-ranger/latest and via the National Center for Biotechnology Information (NCBI) gene database at https://www.ncbi.nlm.nih.gov/gene. sc-eQTL variants were obtained from the EBI eQTL catalog (accession no. QTS000038). Sources for all GWAS variant datasets are presented in Supplementary Table 5. A reference set of fine-mapped eQTLs was obtained from the September 2022 release of Open Targets at https://ftp.ebi.ac.uk/pub/databases/opentargets/genetics/22.09/. Model weights and predictions made by Decima for all genes and variants are available via Zenodo at https://doi.org/10.5281/zenodo.18142522 (ref. 65).

Code availability

Decima is available via GitHub at https://github.com/Genentech/decima. Data were processed using Scanpy v1.10.2, AnnData v0.10.8 and pandas v2.1.4. The models used in this paper were trained and applied using Decima version 0.1, PyTorch v2.2.2, PyTorch Lightning v2.4.0, wandb v0.17 and Python v3.11.9. Models were trained on a single NVIDIA A100 GPU with CUDA 12.1. Analyses of the model’s predictions used modisco-lite v2.2.1, MEME suite v0.5.5 and gimmemotifs v0.18.0. Fine-mapping was performed using DENTIST v0.9.2.1, PolyFun v1.0.0 and susieR v0.11. Code used to process data, train models and perform all analyses in this paper is available via GitHub at https://github.com/Genentech/decima-applications and via Zenodo at https://doi.org/10.5281/zenodo.18142522 (ref. 65). A Snakemake-based pipeline for GWAS fine-mapping is available upon request.

References

  1. Zhou, J. et al. Deep learning sequence-based ab initio prediction of variant effects on expression and disease risk. Nat. Genet. 50, 1171–1179 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  2. Trevino, A. E. et al. Chromatin and gene-regulatory dynamics of the developing human cerebral cortex at single-cell resolution. Cell 184, 5053–5069 (2021).

    Article  CAS  PubMed  Google Scholar 

  3. Chen, K. M., Wong, A. K., Troyanskaya, O. G. & Zhou, J. A sequence-based global map of regulatory activity for deciphering human genetics. Nat. Genet. 54, 940–949 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  4. Eraslan, G., Avsec, Ž, Gagneur, J. & Theis, F. J. Deep learning: new computational modelling techniques for genomics. Nat. Rev. Genet. 20, 389–403 (2019).

    Article  CAS  PubMed  Google Scholar 

  5. Sasse, A., Chikina, M. & Mostafavi, S. Unlocking gene regulation with sequence-to-function models. Nat. Methods 21, 1374–1377 (2024).

    Article  CAS  PubMed  Google Scholar 

  6. Agarwal, V. & Shendure, J. Predicting mRNA abundance directly from genomic sequence using deep convolutional neural networks. Cell Rep. 31, 107663 (2020).

    Article  CAS  PubMed  Google Scholar 

  7. Avsec, Ž et al. Base-resolution models of transcription-factor binding reveal soft motif syntax. Nat. Genet. 53, 354–366 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  8. Avsec, Ž et al. Effective gene expression prediction from sequence by integrating long-range interactions. Nat. Methods 18, 1196–1203 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  9. Linder, J., Srivastava, D., Yuan, H., Agarwal, V. & Kelley, D. R. Predicting RNA-seq coverage from DNA sequence as a unifying model of gene regulation. Nat. Genet. 57, 949–961 (2025).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  10. Holland, C. H. et al. Robustness and applicability of transcription factor and pathway analysis tools on single-cell RNA-seq data. Genome Biol. 21, 1–19 (2020).

    Article  Google Scholar 

  11. Badia-i-Mompel, P. et al. Gene regulatory network inference in the era of single-cell multi-omics. Nat. Rev. Genet. 24, 739–754 (2023).

    Article  CAS  PubMed  Google Scholar 

  12. Schwessinger, R., Deasy, J., Woodruff, R. T., Young, S. & Branson, K. M. Single-cell gene expression prediction from DNA sequence at large contexts. Preprint at bioRxiv https://doi.org/10.1101/2023.07.26.550634 (2023).

  13. Li, J. et al. Deep learning of cross-species single-cell landscapes identifies conserved regulatory programs underlying cell types. Nat. Genet. 54, 1711–1720 (2022).

    Article  CAS  PubMed  Google Scholar 

  14. Hingerl, J. C. et al. scooby: modeling multi-modal genomic profiles from DNA sequence at single-cell resolution. Nat. Methods 22, 2275–2285 (2025).

  15. Heimberg, G. et al. A cell atlas foundation model for scalable search of similar human cells. Nature 638, 1085–1094 (2025).

    Article  CAS  PubMed  Google Scholar 

  16. Chen, X. et al. A brain cell atlas integrating single-cell transcriptomes across human brain regions. Nat. Med. 30, 2679–2691 (2024).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  17. Fiskin, E. et al. Multi-modal skin atlas identifies a multicellular immune-stromal community associated with altered cornification and specific T cell expansion in atopic dermatitis. Nat. Commun. 17, 3194 (2026).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  18. Li, J. et al. Integrated multi-omics single cell atlas of the human retina. Preprint at bioRxiv https://doi.org/10.1101/2023.11.07.566105 (2023).

  19. Eraslan, G. et al. Single-nucleus cross-tissue molecular reference maps toward understanding disease gene function. Science 376, eabl4290 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  20. Shrikumar, A., Greenside, P. & Kundaje, A. Learning important features through propagating activation differences. In International Conference on Machine Learning 3145–3153 (PMLR, 2017).

  21. ENCODE Project Consortium, et al. Expanded encyclopaedias of DNA elements in the human and mouse genomes. Nature 583, 699–710 (2020).

    Article  Google Scholar 

  22. Gasperini, M. et al. A genome-wide framework for mapping gene regulation via cellular genetic screens. Cell 176, 377–390(2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  23. Karollus, A., Mauermeier, T. & Gagneur, J. Current sequence-based models capture gene expression determinants in promoters but mostly ignore distal enhancers. Genome Biol. 24, 56 (2023).

    Article  PubMed  PubMed Central  Google Scholar 

  24. Zhang, K. et al. A single-cell atlas of chromatin accessibility in the human genome. Cell 184, 5985–6001 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  25. Grainger, S., Hryniuk, A. & Lohnes, D. Cdx1 and Cdx2 exhibit transcriptional specificity in the intestine. PLoS ONE 8, e54757 (2013).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  26. Shrikumar, A. et al. Technical note on transcription factor motif discovery from importance scores (TF-MoDISco) version 0.5.6.5. Preprint at https://arxiv.org/abs/1811.00416 (2020).

  27. Piasecki, B. P., Burghoorn, J. & Swoboda, P. Regulatory Factor X (RFX)-mediated transcriptional rewiring of ciliary genes in animals. Proc. Natl Acad. Sci. USA 107, 12969–12974 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  28. Little, D. R. et al. Differential chromatin binding of the lung lineage transcription factor NKX2-1 resolves opposing murine alveolar cell fates in vivo. Nat. Commun. 12, 1–18 (2021).

    Article  Google Scholar 

  29. Daniely, Y. et al. Critical role of p63 in the development of a normal esophageal and tracheobronchial epithelium. Am. J. Physiol. Cell Physiol. 287, C171–C181 (2004).

    Article  CAS  PubMed  Google Scholar 

  30. Yang, H., Lu, M. M., Zhang, L., Whitsett, J. A. & Morrisey, E. E. GATA6 regulates differentiation of distal lung epithelium. Development129, 2233–2246 (2002).

    Article  CAS  PubMed  Google Scholar 

  31. Shiraishi, K. et al. Airway epithelial cell identity and plasticity are constrained by Sox2 during lung homeostasis, tissue regeneration, and in human disease. npj Regen. Med. 9, 1–14 (2024).

    Article  Google Scholar 

  32. Kersbergen, A. et al. Lung morphogenesis is orchestrated through Grainyhead-like 2 (Grhl2) transcriptional programs. Dev. Biol. 443, 1–9 (2018).

    Article  CAS  PubMed  Google Scholar 

  33. Mall, M. et al. Myt1l safeguards neuronal identity by actively repressing many non-neuronal fates. Nature 544, 245–249 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  34. Qureshi, I. A., Gokhan, S. & Mehler, M. F. REST and CoREST are transcriptional and epigenetic regulators of seminal neural fate decisions. Cell Cycle 9, 4477 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  35. Masuda, S., Matsuura, K. & Shimizu, T. GATA6 regulates anti-angiogenic properties in human cardiac fibroblasts via modulating LYPD1 expression. Regen. Ther. 23, 8–16 (2023).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  36. Song, S. et al. TEA domain transcription factor 1 (TEAD1) induces cardiac fibroblasts cells remodeling through BRD4/Wnt4 pathway. Signal Transduct. Targeted Ther. 9, 1–12 (2024).

    Article  Google Scholar 

  37. Burgos Villar, K. N., Liu, X. & Small, E. M. Transcriptional regulation of cardiac fibroblast phenotypic plasticity. Curr. Opin. Physiol. 28, 100556 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  38. Yazar, S. et al. Single-cell eQTL mapping identifies cell type-specific genetic control of autoimmune disease. Science 376, eabf3041 (2022).

    Article  CAS  PubMed  Google Scholar 

  39. Kerimov, N. et al. A compendium of uniformly processed human gene expression and splicing quantitative trait loci. Nat. Genet. 53, 1290–1299 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  40. Rosenbauer, F. & Tenen, D. G. Transcription factors in myeloid development: balancing differentiation with transformation. Nat. Rev. Immunol. 7, 105–117 (2007).

    Article  CAS  PubMed  Google Scholar 

  41. Mostafavi, H., Spence, J. P., Naqvi, S. & Pritchard, J. K. Systematic differences in discovery of genetic effects on gene expression and complex traits. Nat. Genet. 55, 1866–1875 (2023).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  42. Zhang, M. J. et al. Polygenic enrichment distinguishes disease associations of individual cells in single-cell RNA-seq data. Nat. Genet. 54, 1572–1580 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  43. Alves-Bezerra, M. & Cohen, D. E. Triglyceride metabolism in the liver. Compr. Physiol. 8, 1 (2017).

    Article  PubMed  PubMed Central  Google Scholar 

  44. Wenzel, P. Monocytes as immune targets in arterial hypertension. Br. J. Pharmacol. 176, 1966 (2018).

    Article  PubMed  PubMed Central  Google Scholar 

  45. Hung, H. L. et al. Stimulation of NF-E2 DNA binding by CREB-binding protein (CBP)-mediated acetylation. J. Biol. Chem. 276, 10715–10721 (2001).

    Article  CAS  PubMed  Google Scholar 

  46. Kim, S. et al. DNA-guided transcription factor cooperativity shapes face and limb mesenchyme. Cell 187, 692–711 (2024).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  47. Yeo, S.-Y. et al. A positive feedback loop bi-stably activates fibroblasts. Nat. Commun. 9, 3016 (2018).

    Article  PubMed  PubMed Central  Google Scholar 

  48. Schreiber, S., Nikolaus, S. & Hampe, J. Activation of nuclear factor κB in inflammatory bowel disease. Gut 42, 477 (1998).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  49. Han, Y. M. et al. NF-kappa B activation correlates with disease phenotype in Crohn’s disease. PLoS ONE 12, e0182071 (2017).

    Article  PubMed  PubMed Central  Google Scholar 

  50. Taskiran, I. I. et al. Cell-type-directed design of synthetic enhancers. Nature 626, 212–220 (2024).

    Article  CAS  PubMed  Google Scholar 

  51. de Almeida, B. P. et al. Targeted design of synthetic enhancers for selected tissues in the Drosophila embryo. Nature 626, 207–211 (2024).

    Article  PubMed  Google Scholar 

  52. Gosai, S. J. et al. Machine-guided design of cell-type-targeting cis-regulatory elements. Nature 634, 1211–1220 (2024).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  53. Lal, A., Gunsalus, L., Nair, S., Biancalani, T. & Eraslan, G. gReLU: a comprehensive framework for DNA sequence modeling and design. Nat. Methods 22, 2253–2257 (2025).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  54. Wang, D., Tai, P. W. L. & Gao, G. Adeno-associated virus vector as a platform for gene therapy delivery. Nat. Rev. Drug Discov. 18, 358–378 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  55. Ponjavic, J. et al. Transcriptional and structural impact of TATA-initiation site spacing in mammalian core promoters. Genome Biol. 7, R78 (2006).

    Article  PubMed  PubMed Central  Google Scholar 

  56. Kulakovskiy, I. V. et al. HOCOMOCO: towards a complete collection of transcription factor binding models for human and mouse via large-scale ChIP-seq analysis. Nucleic Acids Res. 46, D252–D259 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  57. Gupta, S., Stamatoyannopoulos, J. A., Bailey, T. L. & Noble, W. S. Quantifying similarity between motifs. Genome Biol. 8, R24 (2007).

    Article  PubMed  PubMed Central  Google Scholar 

  58. Bruse, N. & van Heeringen, S. J. GimmeMotifs: an analysis framework for transcription factor motif analysis. Preprint at bioRxiv https://doi.org/10.1101/474403 (2018).

  59. Majdandzic, A., Rajesh, C. & Koo, P. K. Correcting gradient-based interpretations of deep neural networks for genomics. Genome Biol. 24, 1–13 (2023).

    Article  Google Scholar 

  60. Wang, G., Sarkar, A., Carbonetto, P. & Stephens, M. A simple new approach to variable selection in regression, with application to genetic fine mapping. J. R. Stat. Soc. Series B 82, 1273–1300 (2020).

    Article  Google Scholar 

  61. Chen, W. et al. Improved analyses of GWAS summary statistics by reducing data heterogeneity and errors. Nat. Commun. 12, 7117 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  62. Weissbrod, O. et al. Functionally informed fine-mapping and polygenic localization of complex trait heritability. Nat. Genet. 52, 1355–1363 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  63. Gazal, S. et al. Functional architecture of low-frequency variants highlights strength of negative selection across coding and non-coding annotations. Nat. Genet. 50, 1600–1607 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  64. Karczewski, K. J. et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature 581, 434–443 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  65. Lal, A. Decoding sequence determinants of gene expression in diverse cellular and disease states. Zenodo https://doi.org/10.5281/zenodo.18142522 (2024).

Download references

Acknowledgements

We thank the following for their helpful comments: A. Regev, J. Rock, O. Ursu, H. Jasper, T. Sterne-Weiler, X. Yao, S. Mostafavi, C. Cox, M. H. Celik, K. Fletez-Brant, D. Chang, N. Jorstad, J. Song and J. Hingerl.

Author information

Authors and Affiliations

Authors

Contributions

G.E., J.L.C. and A.L. processed single-cell data. A.L. trained the model. A.L., A.K., D.G., L.G., G.E. and A.M.T. performed analyses with assistance from S.N., M.G.G. and N.D. J.B., B.V.D.G. and T. Bhangale processed GWAS data. G.E., T. Biancalani, H.C.B. and G.S. supervised the work. G.E., A.L., A.K., D.G., L.G. and H.C.B. wrote the paper. All authors read and approved the paper.

Corresponding authors

Correspondence to Avantika Lal or Gokcen Eraslan.

Ethics declarations

Competing interests

All authors were employed by Genentech, Inc. while contributing to this study.

Peer review

Peer review information

Nature Methods thanks the anonymous reviewer(s) for their contribution to the peer review of this work. Primary Handling Editor: Lin Tang, in collaboration with the Nature Methods team. Peer reviewer reports are available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Decima predicts the expression patterns of cell type-specific genes.

a) Schematic illustrating the approach used to evaluate Decima’s performance on identification of cell type-specific genes. b) Histogram showing the Area Under the Receiver Operator Characteristic (AUROC) for classification of specific vs. nonspecific genes in each cell type based on Decima’s predictions in the same cell type.

Extended Data Fig. 2 Design of a fibroblast-specific regulatory element in the context of Crohn’s disease using directed evolution.

a) A schematic showing promoter design through directed evolution with Decima b) Predicted expression of the cargo gene across healthy and diseased fibroblast and non-fibroblast cells over 100 rounds of directed evolution. c) Predicted specificity of cargo gene expression in fibroblasts and disease fibroblasts, which were optimized in design in rounds 0–50 for cell-type specificity and 50–100 for disease-state specificity. d) In silico mutagenesis (ISM) of the synthetic regulatory element reveals key sequence motifs whose perturbation is predicted to uniquely affect expression of the cargo gene in fibroblasts. e, f) ISM with respect to disease fibroblast expression identifies key motifs generated in the design process for the fibroblast (top) and disease fibroblast (bottom) evolved sequences, including a TWIST1, C/EBP, and IRF motif, which are implicated in fibroblast-specific and immune-specific regulation. g) These motifs match HOCOMOCO v12 motifs.

Supplementary information

Supplementary Information (download PDF )

Supplementary Figs. 1–20, Tables 2–8, Methods and references.

Reporting Summary (download PDF )

Peer Review File (download PDF )

Supplementary Table 1 (download XLSX )

List of single-cell RNA-seq or single-nucleus RNA-seq datasets included in Decima’s training data.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Lal, A., Karollus, A., Gunsalus, L. et al. Decoding sequence determinants of gene expression in diverse cellular and disease states. Nat Methods (2026). https://doi.org/10.1038/s41592-026-03102-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • Version of record:

  • DOI: https://doi.org/10.1038/s41592-026-03102-0

Search

Quick links

Nature Briefing AI and Robotics

Sign up for the Nature Briefing: AI and Robotics newsletter — what matters in AI and robotics research, free to your inbox weekly.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing: AI and Robotics