Main

Spatio-molecular organization in the brain reflects cellular composition and circuit connectivity patterns that underlie fundamental structure–function relationships. Mapping patterns of gene expression onto brain topography using spatially-resolved transcriptomics (SRT) has facilitated new biological insights into the molecular mechanisms that link structure to function, and enabled better understanding of these relationships over development and disease progression1,2,3. In the human brain, these techniques have enabled molecular delineation of cortical laminae beyond classic histological definitions4,5, identified new cell types and their topographical organization in the noradrenergic locus coeruleus6 and described differentiation trajectories at multiple gestational stages of fetal brain7. These technologies have been deployed to better understand disease states in the brain, including mapping the local microenvironment of multiple sclerosis lesions8,9 and infiltration patterns of malignant glioblastoma10.

Comprehensive spatio-molecular mapping of the human hippocampus (HPC) is critical to understand how its unique organizational structure supports many fundamental biological processes11,12. The HPC includes the dentate gyrus (DG) and the cornu ammonis (CA) regions, subdivided into CA1–CA4, each of which contains specialized cell types and distinct laminar organization. The organization of these specialized neuronal cell types into neuropil-enriched layers, including the DG molecular layer (ML), stratum lucidum (SL) and stratum radiatum (SR), supports well-defined functions within the canonical HPC circuitry. The trisynaptic loop, which supports various features of learning, memory and the stress response, initiates with inputs from the entorhinal cortex (ENT), which traverse from DG to CA3 to CA1, culminating with a relay to the subiculum (SUB), the major output nucleus of the HPC13,14. Output circuits from the SUB to numerous cortical and subcortical regions control important cognitive and motivated behaviors11,15 and are implicated in neuropsychiatric and neurodevelopmental disorders16,17.

Defining the molecular composition of cell types that have specialized roles in HPC circuit function is a prerequisite to targeting their function for therapeutic interventions. However, available transcriptomic profiles generated using single-nucleus RNA-sequencing (snRNA-seq) from postmortem human HPC tissue18,19,20,21 lack important spatial information and do not retain cytosolic or synaptic transcripts22. Additionally, many existing transcriptomic datasets have focused specifically on the DG given its importance in development and aging23,24,25,26,27,28, or have inconsistently sampled across HPC subregions, resulting in cellular composition differences between donors. To investigate gene expression at cellular resolution across the human HPC, we curated postmortem human tissue specimens with well-defined HPC neuroanatomy that systematically encompassed all subfields and sampled across the structure’s diverse longitudinal axis. We deployed a discovery-based experimental design using a well-validated platform to measure gene expression transcriptome-wide in a spatial context and generated paired snRNA-seq data from adjacent tissue sections to investigate gene expression at cellular resolution. To maximize the utility and value of this data resource for the community, we sourced HPC tissue from the same adult, neurotypical brain donors for which we recently provided comprehensive, paired SRT and snRNA-seq data in the dorsolateral prefrontal cortex (dlPFC)5.

We used spot-level deconvolution and non-negative matrix factorization (NMF) to integrate the SRT and snRNA-seq datasets, providing biological insights about the molecular organization of HPC cell types, cell states and spatial domains in the human brain. We also deployed computational strategies to overcome the inherent limitations of postmortem human tissue by incorporating functional molecular data from model organisms. Specifically, we used the human gene expression data to identify latent factors, then incorporated existing rodent datasets that feature information on circuit connectivity and neural activity induction to make predictions about axonal projection targets and likelihood of ensemble recruitment in spatially-defined cellular populations of the human HPC. The ability to infer functional roles for human cell types within the context of intact circuitry has profound potential for understanding how function in the human brain is disrupted in disease.

Results

SRT in the human HPC

We performed Visium Spatial Gene Expression (Visium-H&E, 10x Genomics) and 3′ Single Cell Gene Expression (10x Genomics) experiments to generate paired SRT maps and snRNA-seq data of the postmortem human anterior HPC (Fig. 1a–c and Supplementary Table 1). Using the same ten neurotypical adult brain donors included in our previous dlPFC study5, we generated Visium-H&E data from two to five capture areas per donor to encompass all major HPC subfields (DG, CA1–CA4 and SUB; Supplementary Fig. 1 and for abbreviations see Supplementary Table 2). Following quality control (QC; Supplementary Figs. 2 and 3), we retained 150,917 spots from n = 36 capture areas, hereafter referred to as the SRT dataset.

Fig. 1: Experimental design to generate paired snRNA-seq and SRT data in the humanHPC.
figure 1

a, Postmortem human tissue blocks containing the anterior HPC were dissected from ten adult neurotypical brain donors. b, Tissue blocks were cryosectioned for snRNA-seq assays (gold), and placed on Visium slides (Visium-H&E, blue). Tissue sections (n = 2–4, 100 μm cryosections per donor) from all ten donors were collected from the same tissue blocks for measurement with the 10x Genomics Chromium 3′ gene expression platform. For each donor, two samples were generated, one sorted based on PI+ (purple) and the second sorted based on PI+ and NeuN+ (green). Replicate samples were collected from three donors for a total of n = 26 total snRNA-seq libraries. Panels a and b are created with BioRender.com. c, Tissue blocks were scored with a razor blade to demarcate regions of interest and 10 μm tissue sections were collected for measurement with the 10x Genomics Visium-H&E platform. For all ten donors, two to five tissue sections were collected to include the extent of the HPC (n = 36 total capture areas). Dashed outlines indicate the approximate location of score marks, for example, donor, with color indicating capture areas obtained from consecutive sections. Orientation was verified based on the expression of known marker genes, such as SNAP25 (dashed outline color corresponding to the capture area location on the tissue section). d, Canonical marker genes were identified as SVGs using nnSVG29. Spots are colored by log2-normalized counts. e, SRT data were clustered using PRECAST30 with k = 18, and clusters were annotated (columns) based on expression of known marker genes (rows). Cluster groupings indicated at the top of the heatmap define which clusters contributed to the broad domains of neuron, neuropil, WM and vasc/CSF. Hippocampal region abbreviations are also presented in Supplementary Table 2.

To identify HPC spatial domains in our SRT dataset, we leveraged spatially aware feature selection (nnSVG29 and clustering (PRECAST30) methods. nnSVG and PRECAST were chosen for their ability to improve on nonspatially aware feature selection and clustering methods and their computational efficiency31,32. We evaluated a range of clustering resolutions (k) and used Akaike Information Criterion (AIC), marker gene expression, and comparison with histological annotations to inform our selection of k = 18 PRECAST clusters (Fig. 1d,e, Supplementary Fig. 4 and Supplementary Fig. 5). Due to similar gene expression, multiple clusters corresponding to CA2–CA4 and to CA1 were collapsed into their respective domains (Supplementary Fig. 6a,b). The final spatial domains (Fig. 2a and Extended Data Fig. 1) identified the DG granule cell layer (GCL), CA2–CA4, CA1, SUB, cortical neurons of other retrohippocampal (RHP) regions, and a domain that appeared to be transitional between the SUB and RHP (SUB.RHP). To finalize spatial domains for neuropil clusters, we compared gene expression and anatomical localization to the histologically annotated SL, SR, DG ML, stratum lacunosum-moleculare (SLM) and subgranular zone of the DG (SGZ; Supplementary Fig. 6c,d), resulting in final neuropil spatial domains including ML, SL.SR, SR.SLM and SLM.SGZ. Although the SLM and SGZ are comprised of distinct cell types and exhibit distinct functions, the combinatorial nature of the SLM.SGZ spatial domain was consistent with other neuropil regions SL.SR and SR.SLM, where the k = 18 PRECAST clusters were most accurately annotated to a combination of adjacent regions. We also annotated for clusters of cell types that are not traditionally associated with fixed HPC anatomical regions—GABA, vasculature (Vasc) and CP (Extended Data Fig. 1). We compared these results with other clustering algorithms and found high concordance.

Fig. 2: Spatial domain annotation and DE in the human HPC using SRT data.
figure 2

a, Integrated and merged spot plot of four Visium capture areas from the same donor (Br3942) with spots colored by the 16 spatial domains annotated from k = 18 PRECAST clusters. CA1.1/CA1.2 were collapsed to CA1 and CA2-4.1/CA2-4.2 were collapsed to CA2–CA4. See Supplementary Table 2 for abbreviations. b, Schematic illustrating pseudobulking approach, which collapses spot-level data to the spatial domain level within each capture area by summing the total UMIs for each group. c, PCA of the pseudobulked samples captures variation across broad spatial domains. Neuron cell body-enriched (greens and light blue (GABA)), neuropil-enriched (grays), WM-enriched (purples) and Vasc/CSF (dark blue). d, Heatmap showing DEG expression (rows) across the spatial domains (columns). Grouping across the top shows broad domain annotations. Spot plots are filled by log2-normalized counts. Spot borders are colored by a broad domain. eh, Spot plots demonstrate spatial expression of DEGs—PPFIA2, a known marker for the GCL (e), PRKCG, which is known to be enriched in CA1–CA4 domains (f), APOC1, a known astrocyte cell marker (g) and SFRP2, which was specifically increased in the SLM-subgranular zone (SGZ) domain (h). i, Volcano plots illustrate results from DE analysis for each broad-level domain, with log2(FC) on the x axis and FDR-adjusted, −log10-transformed P values on the y axis. Genes colored red pass both FDR and log2(FC) thresholds (FDR-adjusted P < 0.01 and log2(FC) > 1). Top DEGs for each broad domain grouping are labeled. j, Violin plots of pseudobulk expression of DEGs for neuron-enriched regions (CLSTN3), neuropil-enriched regions (SLC1A3), white matter regions (SHTN1) and vasc/CSF regions (TPM2). Spatial domains are on the x axis and normalized gene expression in log2(CPM) is on the y axis. Boxes are colored by broad cluster grouping.

To identify differentially expressed genes (DEGs) across the spatial domains, we employed a ‘layer-enriched’ linear mixed-effects modeling strategy performed on pseudobulked SRT data, as previously described4,5 (Fig. 2b–d, Extended Data Fig. 2, Supplementary Fig. 7 and Supplementary Table 4). We confirmed canonical marker genes for spatial domains, including PPFIA2 for GCL (Fig. 2e), PRKCG for pyramidal neuron layers CA1–CA4 (Fig. 2f) and MOBP for WM. We also identified several new marker genes, particularly with respect to the SUB and HPC neuropil-enriched domains (Fig. 2g,h (APOC1 and SFRP2, respectively)). Using the same modeling strategy, we identified genes specific to broad domains (neuron, neuropil, WM and Vasc/CSF; Fig. 2i,j and Supplementary Table 5). Altogether, these findings reveal widespread differences in gene expression across the HPC corresponding with both discrete spatial subregions and broad spatial domains.

Paired snRNA-seq in the human HPC

While SRT captures spatial organization that unlocks circuit-level information based on extensive prior histological and anatomical research, snRNA-seq provides improved cellular resolution molecular profiling. Using cryosections adjacent to those collected for the SRT dataset, we collected two populations of nuclei for snRNA-seq—a NeuN+ population to enrich for neurons and a PI+ population (Fig. 1b). After QC and preprocessing (Supplementary Figs. 8 and 9), we retained 75,411 high-quality nuclei across the ten donors. We implemented graph-based clustering on batch-corrected reduced dimension representations of gene expression patterns and annotated the resulting 60 high-resolution clusters based on known marker gene expression (Supplementary Figs. 10–12).

We identified DEGs across snRNA-seq clusters using the pseudobulk enrichment modeling approach above and used these results to collapse our clusters into 24 cell types (Fig. 3a,b and Supplementary Table 6). Through investigation of robust DEGs, we were further able to resolve minute differences between cell classes (Fig. 3c,d and Extended Data Fig. 3). Intriguingly, we observed subsets of GCs that express distinct forms of activin receptors (ACVR1, ACVR2A and ACVR1C), suggesting a stable heterogeneity within the GCL during adulthood (Extended Data Fig. 3g).

Fig. 3: Cell type identification and DE in the human HPC using snRNA-seq.
figure 3

a, UMAP representation of snRNA-seq data. Individual nuclei are represented as points, colored and labeled according to their cell type. Cell type abbreviations are listed in Supplementary Table 2. b, Left, stacked bar plots of cell types indicated in a showing proportions of nuclei for each donor (columns). Right, stacked bar plots of cell types indicated in a, with columns indicating nuclei grouped by sort strategy (PI+ or PI+NeuN+), and across the overall dataset (all nuclei). c, Violin plots showing log2-normalized expression (y axis) of select DEGs identified with both spatial domain and n = 60 snRNA-seq cluster DE analyses. Nuclei are grouped based on cell types (x axis) for improved visibility, and the fill color also corresponds to cell type as indicated in a. d, Heatmap showing a selection of DEGs (rows) from snRNA-seq DE analysis across all n = 60 clusters (columns). Heatmap is colored by mean log2-normalized counts. Cell cluster abbreviations are also presented in Supplementary Table 2. UMAP, uniform manifold approximation and projection; GC, DG granule cell; CA1/ProS; CA1 and prosubiculum; Sub, subiculum; HATA, HPC–amygdala transition area; Amy, amygdala; Thal, thalamus; Cajal, Cajal–Retzius cells; GABA.PENK, PENK positive GABAergic neurons; GABA.MGE, medial ganglionic eminence-derived GABAergic neurons; GABA.LAMP5, LAMP5-positive GABAergic neurons; GABA.CGE, central ganglionic eminence-derived GABAergic neurons; Micro/Macro/T, microglia and macrophages and T cells; Astro, astrocytes; Oligo, oligodendrocytes; OPC, oligodendrocyte progenitor cells; Ependy, ependymal cells; AHi, amygdala-hippocampal region; CXCL14, CXCL14 positive GABAergic neurons; HTR3A, HTR3A positive GABAergic neurons; VIP, VIP positive GABAergic neurons; CRABP1, CRABP1 positive GABAergic neurons; C1QL1, C1QL1 positive GABAergic neurons; PV.FS, PVALB positive fast-spiking GABAergic neurons; SST, SST positive GABAergic neurons; CORT, CORT positive GABAergic neurons; COP, committed oligodendrocyte precursor; CP, choroid plexus tissue; Endo, endothelial cells; PC/SMC, pericytes/smooth muscle cells; VLMC, vascular leptomeningeal cells.

Correlating the gene-level enrichment model t statistics of snRNA-seq and SRT revealed strong agreement between the two profiling approaches (Extended Data Fig. 3h). We compared specific DEGs that were significant in pyramidal neuron groups of both data modalities (Figs. 2d and 3c). KIT, a CA3 marker in SRT33, is expressed in pyramidal cell types from CA3 and CA1, as well as in a specific GABAergic population. We observed that POU3F1 is expressed at similar levels in CA1 and Sub.1 clusters in snRNA-seq data, but is restricted to the CA1 in SRT. We identified new SUB markers COL24A1 and PART1 in both modalities, while KCNH5 and TESPA1 were specific to SUB in SRT but were expressed more broadly in snRNA-seq clusters (Figs. 2d and 3c).

The HPC is implicated in many pathological conditions, but the specific cell types and spatial domains driving these associations remain unclear16,34,35,36. To identify cell types and spatial domains associated with genetic risk for diseases and disorders, we used stratified linkage disequilibrium score regression (S-LDSC)37,38,39. S-LDSC regression coefficients represent the contribution of a given annotation (for example, spatial domain or cell type) to the heritability of a given trait, including neuropsychiatric conditions40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55. We observed many expected associations, including enrichment of Alzheimer’s disease risk in microglia41 (Supplementary Fig. 13a). Spatial domain analyses revealed enrichment of genetic risk for schizophrenia (SCZ) in the CA1 domain, and for multiple disorders, including SCZ, in the RHP (Supplementary Fig. 13b). We thus sought additional integration strategies for our snRNA-seq and SRT data to enable better understanding of the diversity of HPC cellular populations and their spatial organization.

Integrating snRNA-seq and SRT data

One approach to integrating SRT and snRNA-seq data involves estimating the cell type composition within individual Visium spots via spot-level deconvolution algorithms56. To provide a robust spot deconvolution reference dataset for postmortem human anterior HPC, we generated a Visium Spatial Proteogenomics (Visium-SPG) dataset, which replaces H&E histology with immunofluorescence (IF) staining to label proteins of interest. Visium-SPG was conducted with tissue from two donors (one male/one female) with clear anatomical orientation (Supplementary Fig. 14). By labeling cell-type-specific proteins (NEUN for neurons, OLIG2 for oligodendrocytes, GFAP for astrocytes and TMEM119 for microglia), the fluorescence intensity in each IF channel provides an orthogonal validation for predictions of cell type composition (Extended Data Fig. 4a,b). We benchmarked the ability of three spot-level deconvolution algorithms57,58,59 to leverage cell type identity, learned from marker genes in our snRNA-seq data and validated by IF staining, to predict the cell type composition of individual Visium-SPG spots (Extended Data Fig. 4c,d). We found the most consistent performance with RCTD57 and applied this method to the Visium-H&E SRT dataset (Extended Data Fig. 4d,e and Supplementary Note 1).

While spot-level deconvolution methods are valuable, most rely on a priori classification of reference data into discrete clusters and do not allow for the integrated representation of transcriptional variation beyond cell type. Another integration approach involves the consolidation of reference snRNA-seq gene expression information into reduced dimensions with NMF, and the subsequent transfer of these NMF patterns onto a query dataset (for example, SRT)60,61 (Extended Data Fig. 5). Unlike spot-level deconvolution, NMF patterns represent any source of gene expression variation and may be specific to cell types, cell states, other biological processes, or technical artifacts62,63. Additionally, NMF provides interpretable gene-level weights for each pattern, theoretically allowing for the direct biological interpretation of NMF patterns. Here we used NMF to integrate snRNA-seq and SRT datasets by first defining gene expression patterns within the snRNA-seq data and then mathematically projecting the patterns to transfer the weights onto SRT data to probe the spatial organization of these patterns (Extended Data Fig. 5).

We used NMF to distill normalized snRNA-seq counts into 100 NMF patterns. Unsurprisingly, many of these patterns were specific to HPC cell types (Extended Data Fig. 6a). Following pattern transfer to the SRT dataset and removal of poorly represented or donor-specific patterns, we observed 47 patterns that corresponded strongly to specific spatial domains (Extended Data Fig. 6c,d). We compared the ability of RCTD spot-level deconvolution and NMF to classify Visium-H&E spots, and found strong correlations with spot-level predictions for cell type proportion via RCTD (nmf77-Oligo ρ = 0.94, nmf81-Astro ρ = 0.86; Fig. 4a,b,d,e and Supplementary Fig. 15c–f). We performed gene set enrichment analysis (GSEA) using the ordinal rank of gene-level NMF weights and found that genes contributing to the transcription patterns captured by oligodendrocyte- and WM-dominant pattern nmf77 (ANK3 (ref. 64), RHOA65 and DPYSL2 (ref. 66)) and astrocyte-dominant pattern nmf81 (NRXN1 (ref. 67), NCAN68 and PLEC69) were consistent with the known biological role of these cell types (Fig. 4c,f). These findings highlight the agreement of the two integration approaches and the benefit of gene-level weights provided with NMF.

Fig. 4: NMF reveals cell type heterogeneity and biologically relevant pathways in hippocampal subfields compared to RCTD deconvolution results.
figure 4

For all spot plots of example capture areas from donor Br3942, spot borders are colored by spatial domain (see Supplementary Table 2 for abbreviations). a, RCTD prediction of proportion of oligodendrocytes (Oligo, fill color) per Visium spot. b, Spot plots displaying spot-level weights (fill color) for NMF pattern nmf77. c, GSEA results showed that genes with higher nmf77 weights contributed to the significant enrichment of biological pathways associated with oligodendrocytes. For GSEA results, the x axis shows the ranked gene-level NMF weight. Each vertical black bar indicates the rank of genes for the indicated Reactome gene set. GSEA statistics are presented to the right of the plot. Top, NES; middle, numerator indicates the number of genes present in the experimental gene set, denominator indicates the total number of genes in the Reactome gene set; bottom, FDR-adjusted one-tailed P value (Padj). d, RCTD prediction of proportion of astrocytes (Astro, fill color) per spot. e, Spot plots displaying spot-level weights (fill color) for NMF pattern nmf81 (specifically elevated in astrocyte snRNA-seq clusters). f, Genes with higher nmf81 weights contributed to the significant enrichment of biological pathways associated with astrocytes. g, Spot plots displaying spot-level weights (fill color) for NMF pattern nmf13 (elevated in neuronal snRNA-seq clusters). h, Spot plots displaying spot-level weights (fill color) for NMF pattern nmf7 (elevated in neuronal snRNA-seq clusters). i, Genes with higher nmf13 weights contributed to the significant enrichment of biological pathways related to neuronal signaling, with increased representation of transcriptional variation highly relevant to excitatory postsynaptic response. j, Genes with higher nmf7 weights contributed to the significant enrichment of biological pathways related to neuronal signaling, with increased representation of transcriptional variation highly relevant to the structure and function of inhibitory synaptic connections. NES, normalized enrichment score.

Another advantage of NMF is the capacity to capture transcriptional variation beyond cell type definition. We observed 19 patterns that were present across all cell types and spatial domains (Extended Data Fig. 6a,d). We examined two of these general patterns (nmf13 and nmf7), both of which exhibited increased spot-level weights in multiple neuronal cell types and HPC spatial domains but were anti-correlated with one another (Fig. 4g,h and Supplementary Fig. 16). Examination of GSEA results suggested that nmf13 and nmf7 encapsulate distinct synaptic transcriptional profiles. Top nmf13-weighted genes include CAMKK1 (ref. 70), LRFN2 (ref. 71) and DLGAP3 (ref.72), and the enrichment of genes related to the activation of NMDA receptors and postsynaptic events (Padj. = 0.01) suggested that nmf13 represents gene expression highly relevant to excitatory postsynaptic response (Fig. 4i). In contrast, top nmf7-weighted genes, like GABRA1, KIF5A73,74,75 and DYNLL2 (ref.76), and the enrichment of genes related to GABA synthesis, release, reuptake and degradation (Padj. = 1.2 × 10−4) indicate that nmf7 more strongly represents the structure and function of inhibitory postsynaptic specializations (Fig. 4j). These data demonstrate that in practice, NMF identifies subfield-specific differences in transcription related to neuronal structure and function.

NMF captures activity-dependent transcription programs

We identified two NMF patterns (nmf91 and nmf20) that captured stimulus-dependent transcriptional programs, exemplified by the expression genes with known roles in modulating synaptic response to activity77,78,79,80 like FOS, JUN and NR4A1 (ref. 81) in nmf91 and SORCS3 (ref. 82), PDE10A83 and HOMER1 (ref. 84) in nmf20 (Extended Data Fig. 7). To further investigate the ability of NMF patterns to capture neuronal activity-related gene expression, we transferred the NMF patterns from our human snRNA-seq dataset onto a mouse snRNA-seq dataset containing HPC neurons activated by electroconvulsive seizures (ECS) or HPC neurons under control conditions (Sham)85. As we observed when transferring human snRNA-seq nuclei-level weights into human SRT data, NMF patterns mathematically projected onto mouse snRNA-seq nuclei were both general and cell-type specific (Extended Data Fig. 8).

We found that nmf91 and nmf20 were more highly weighted onto nuclei from GCs in the ECS condition than the sham-activated group, supporting our hypothesis that these patterns capture activity-dependent transcriptional programs (Fig. 5a). We further found that many of the top-weighted genes to nmf91 and nmf20 were significantly increased in ECS GCs compared to sham. For nmf91, these genes included many canonical activity-dependent transcripts (Fig. 5b). Although highly-weighted genes in nmf20 are less clear, the high weight and increased expression of BDNF86 and intracellular BDNF signaling mediators SORCS3 and SORCS1 (refs. 82,87) suggests that this pattern represents genes involved in secondary activity response (Fig. 5c). We found 98.0% of spots with non-zero nmf20 weights were located in neuron (91.0%) or neuropil (7.0%) broad domains, while 68.9% of spots with non-zero nmf91 weights occurred in these regions (42.8% in neuron and 26.1% in neuropil), suggesting enrichment of these patterns in synapse-rich domains.

Fig. 5: NMF captures transcriptional programs relevant to neuronal activity.
figure 5

a, Select NMF patterns projected onto a mouse snRNA-seq dataset of hippocampal neurons activated by ECS or under control conditions (Sham; y axis, by cell type). The dot color indicates scaled average nuclei NMF weights. Dot size indicates the proportion of y axis group with non-zero pattern weight for the given x axis value. DE analysis was performed on mouse GC nuclei, testing for differences in the expression by activity condition. For all volcano plots (b,c,f,g), the y axis presents the −log10 FDR-adjusted P value and the x axis presents log2(FC), where negative values indicate greater expression in sham-activated GCs and positive values indicate greater expression in ECS GCs. b,c,f,g, Points are colored for the gene-level NMF weight for nmf91 (b), nmf20 (c), nmf10 (f) and nmf14 (g). d,e, Spot plots isolating the DG GCL spatial domain (green outlined spots) demonstrate the differing spatial organization of nmf10 (d) and nmf14 (e) weights in an example capture area from donor Br3942. Spot fill indicates spot-level NMF pattern weight. h, Left, UMAP plot of all nuclei present in our human snRNA-seq, highlighting the GC clusters. Right, zoomed UMAP plot of only GC nuclei from our human snRNA-seq dataset with color indicating cluster identity. i, UMAP plot of human GC nuclei showing (left) nmf10 nuclei-level weights and (right) log2-normalized counts of highly-weighted nmf10 gene CHST9. j, UMAP plot of human GC nuclei showing (left) nmf14 nuclei-level weights and (right) log2-normalized counts of highly-weighted nmf14 gene SORCS3. PS/Sub, prosubiculum and subiculum neurons; L5/Po, layer 5 and polymorphic layer.

Compared to other HPC cell types, mouse GC nuclei exhibited specific increases in nmf14 and nmf10 weights, which were also associated with GC clusters and the GCL in our human datasets (Fig. 5a and Extended Data Fig. 6a,d). We observed that the transcriptional patterns encapsulated by nmf10 and nmf14 represented distinct responses to neuronal activity in the mouse (Fig. 5a) and unique spatial organization in the human (Fig. 5d,e). nmf10 mapped principally to unstimulated GC nuclei, and many top nmf10-weighted genes were significantly increased in baseline conditions (Fig. 5f). Notably, these genes (CNTN1, ADAM22, FAT4, CDH10, ST6GALNAC5 and CHST9) encode for critical synaptic adhesion proteins and proteins that stabilize established synapses.

In contrast, nmf14 was elevated in stimulated GCs. Key genes with higher nmf14 weights that were significantly increased in ECS GCs included GC.4 cluster markers BDNF and ACVR1C (Extended Data Fig. 3g), as well as SORCS3 and SGK1 (Fig. 5g). Given the functional importance of BDNF88, ACVR1C89, SORCS3 (ref. 77) and SGK1 (refs. 90,91) expression in activity-dependent synaptic plasticity, we hypothesize that nmf14 represents gene expression patterns that promote synaptic scaling. The elevated weight of nmf14 in the superficial human GCL in unstimulated conditions (Fig. 5e) suggests that this subset of GCs may be uniquely poised to promote activity-dependent synaptic scaling. Examining the nuclei-level weights of nmf10 and nmf14 and the expression of top-weighted nmf10 and nmf14 genes indicated that the GC.4 cluster fits the criteria for poised GCs (Fig. 5h–j).

These data indicate that NMF can identify activity-regulated gene expression in the context of cell-type-specific recruitment. Activity-regulated transcription is intimately related to physiological function, recruitment of cellular ensembles and synaptic connectivity, all of which fundamentally contribute to cellular behavior. However, these properties are not often considered when annotating cell types based on transcriptomics. NMF enables the transfer of information from animal models to human brain datasets, allowing predictions about the functional properties of human cells across species. This approach can be leveraged to understand cell-type function in the context of HPC circuit connectivity.

Spatial mapping elucidates pyramidal neuron diversity

Although the HPC and RHP contain multiple classes of pyramidal neurons with distinct molecular profiles18,19,92, physiological properties93 and axonal projection targets94, it has historically been difficult to distinguish subfield-specific RHP populations (for example, SUB versus ENT) with snRNA-seq alone. To address this gap, we integrated SRT and snRNA-seq using NMF to better define these cellular populations.

Mapping Sub.1-specific nmf40 and Sub.2-specific nmf54 to SRT revealed distinct laminar organization in situ (CA1-proximal versus CA1-middle, respectively; Fig. 6a). To probe if these differences corresponded to distinct efferent target regions13,14, we used data from a recent study coupling retrograde viral circuit tracing with single-cell methylation sequencing (snmC-seq) in mouse95. Our human snRNA-seq derived NMF patterns were transferred onto the mouse snmC-seq data from excitatory HPC and RHP neurons (n = 2,004 nuclei) and after removing sparsely mapping patterns (<45 nuclei) unfortunately nmf54/Sub.2 was excluded (Extended Data Fig. 9a). However, patterns specific to CA1, SUB and RHP cell types mapped to mouse HPC neurons with distinct efferent targets (Fig. 6b). These results recapitulated known projections from the SUB to the thalamus96 and hypothalamus97, and from the ENT to the HPC and prefrontal cortex98.

Fig. 6: NMF reveals a continuum of pyramidal cell types across the RHP.
figure 6

All spot plots are filled by the spot-level weights for the indicated NMF pattern, with scales corresponding to the maximum spot-level weight of any spot in the SRT dataset. Spot border color indicates the spatial domain. See Supplementary Table 2 for abbreviations. a, Spot plots of example areas from donor Br3942 highlight the laminar organization of nmf40 (snRNA-seq cluster Sub.1) and nmf54 (snRNA-seq cluster Sub.2), as well as the clear distinction from the CA1 (nmf15). b, Dot plot of NMF weights (dot color) after transfer to mouse single-cell methylation sequencing (snmC-seq) with retroviral tracing (n = 2004 HPC and RHP nuclei)95. Rows indicate nuclei axonal projection target region. Dot size indicates the number of nuclei with non-zero pattern weights. c, Spot plots of example capture area from donor Br2743 show ENT-specific NMF patterns. d, TSNE plots of pyramidal nuclei from the snRNA-seq dataset colored by the NMF patterns in c. e, Spot plots of example capture area from donor Br2743 show RHP NMF patterns that exhibit spot-level weights distributed across the ENT and subicular complex. f, TSNE plots of pyramidal nuclei from the snRNA-seq dataset colored by the NMF patterns in e. g, Spot plots (for example, donor Br3942) exemplify the anatomical location of nmf17 to the PreS, indicated by the asterisk. h, Violin plots show snRNA-seq log2-normalized counts (y axis) across HPC and RHP clusters (x axis, reflecting spatially-informed annotations; Supplementary Table 2) for traditional cortical layer markers (SATB2, TLE4 and CUX2), canonical subiculum marker FN1, and new subiculum markers (COL24A1, TOX). i, new DEGs distinguish the superficial, middle and deep subiculum (Sub.1, Sub.2, Sub.3, respectively), and the PreS. Dot size indicates the proportion of nuclei in each cluster (column) with non-zero expression for each gene (row). The dot color indicates average log2-normalized gene counts. TSNE, t-distributed stochastic neighbor embedding; HPF, hippocampal field; MOB, main olfactory bulb; RSP, retrosplenial cortex; PTLp, posterior parietal cortex; ACA, anterior cingulate cortex; PIR, piriform cortex; STR, striatum; TH, thalamus; AMY, amygdala; HY, hypothalamus; MOp, primary motor cortex; PFC, prefrontal cortex.

NMF patterns enriched in distinct clusters of non-CA pyramidal nuclei corresponded to nonoverlapping SRT spots with stereotypical laminar organization throughout the RHP (Fig. 6c–f). NMF patterns nmf84, nmf45, nmf27 and nmf51 were spatially restricted to the distal and superficial RHP, suggesting that these patterns and associated snRNA-seq clusters are ENT-specific (Fig. 6c,d). NMF patterns corresponding to deeper layer neurons (nmf68, nmf22 and nmf53) were not restricted along the CA1 transverse axis (distal to proximal) but were instead present throughout the SUB-ENT transition (Fig. 6e). One pattern (nmf65), enriched in deep pyramidal clusters (L6.1 and half of L6b), ran along the border of WM and SUB, adjacent to the middle layer of SUB spots highlighted by nmf54 (Fig. 6e,f). Based on the expression of SUB marker genes and top NMF pattern genes, including TOX, TSHZ2 and ZNF385D, we believe that the L6.1 cluster and nmf65 capture the transcriptional profile of the deep SUB (Extended Data Fig. 9e,f). Altogether, spatial registration of pyramidal NMF patterns agreed with broader classification of the snRNA-seq clusters as superficial (L2/3), L5, and deep (L6/6b) and facilitated improved annotation of snRNA-seq clusters.

One NMF pattern (nmf17) enriched in another superficial snRNA-seq cluster (L2/3.1) exhibited a dramatically different spatial location than the other ENT-specific L2/3 patterns (Fig. 6g and Supplementary Fig. 17). The distribution of nmf17 follows previous anatomical descriptions of the presubiculum (PreS)98,99; in more anterior samples (donors Br6423 and Br2743) nmf17 is more adjacent to the SUB. In more posterior samples (donors Br3942 and Br8325), it is present at the curve of the SUB. In some donors, nmf17 is present in an island-like pattern, again consistent with the description of the PreS anatomical studies. Unfortunately, to our knowledge, no molecular markers have been identified for the PreS to provide crucial supporting evidence that cluster L2/3.1 and nmf17 label this specific brain region.

Recognizing that the expression of common cortical layer markers may not be sufficient to identify these spatially-annotated populations in future experiments (Fig. 6h), we revisited our DEG results to identify high-confidence gene markers specifically associated with subpopulations of the SUB (Fig. 6h). Although the canonical SUB marker gene FN1 is helpful to isolate superficial SUB neurons (Sub.1), our data indicated that COL24A1—identified as a SUB DEG in both SRT and snRNA-seq datasets (Fig. 3c)—defines both superficial and middle SUB neurons. TOX, another DEG for the SUB in both SRT and snRNA-seq data, is a more general subicular complex indicator, labeling the PreS as well (Fig. 6h).

We performed additional differential expression (DE) analysis to distinguish our spatially-annotated SUB clusters from transcriptionally similar local neuron populations. We set a robust threshold and identify multiple genes that, in combination with traditional regional and laminar markers, will be useful for future research either in annotating these populations or in biological discovery (Fig. 6i and Extended Data Fig. 10). For example, markers like MDFIC (PreS) and NDST4 (superficial SUB) could elucidate region-specific circuit functions as the products of these genes contribute to glucocorticoid response100 and synaptic specificity101, respectively. Other markers like SCN7A (deep SUB)102 and TRPC3 (middle SUB)103 may correspond to region-specific neuronal firing properties.

Discussion

By integrating SRT and snRNA-seq data, we generated a comprehensive transcriptomic atlas of the adult human HPC. We characterized the molecular organization of the human HPC with spatial and cellular resolution, identifying discrete spatial domains and a rich repertoire of HPC cell types. Spot-level deconvolution and NMF revealed gene expression patterns that represent cell-type-specific molecular signatures. NMF enabled the capture of gene expression profiles shared across biological processes, such as synaptic signaling (Fig. 4). Integration of these patterns with cross-species functional genomics data (Fig. 5) provides spatial context to behaviorally-relevant cell types, cell states and molecular pathways. To molecularly define the organization of the subicular complex, we leverage cross-species neuronal circuit data and anatomical insights from our SRT dataset (Fig. 6).

Measuring transcriptomic changes in response to experimental manipulations is not possible in human postmortem brain tissue, while functional studies in rodents that can test causality of cellular and molecular associations may lack direct relevance to human brain function and behavior. Therefore, integrating human transcriptomic data with functional genomics data from animal models has the potential to facilitate the interpretation of cellular responses to various stimuli. To illustrate the biological insights that can be gleaned using this approach, we mapped NMF patterns from the human HPC onto snRNA-seq data from a mouse model of induced ECS85 to identify gene expression patterns that are putatively associated with neural activity in the human brain (Fig. 5). Identifying and localizing expression patterns associated with activity-regulated genes in the human HPC is important because expression of these genes is critical for recruitment of HPC neuronal ensembles controlling learning, memory and other cognitive processes104. NMF patterns that reflected processes known to be regulated by activity, including immediate-early gene expression and synaptic signaling, were enriched in activated mouse neurons (Fig. 5). In the rodent, the majority of GCs are relatively silent under both baseline conditions105 and during exploration of new environments106,107. However, a small fraction of GCs rapidly increases their firing activity during behavior, supporting a sparse coding scheme that facilitates precision in the detection of newness and the animal’s location during learning108. NMF patterns that differentially map to control versus activated neurons may represent the population of GCs that are more quiescent versus those that are either actively firing or intrinsically primed for activity85. Indeed, active and quiescent mouse GCs have substantially different transcriptomic profiles85. Differences in GC activity levels may reflect differences in their ability to be recruited into neuronal ensembles in response to behavioral experiences109,110,111,112,113,114. These analyses provide insight into the putative spatial and molecular organization of cellular activity states in the human DG, which is critical to better understand circuit function in the HPC. Circuit activity in the HPC flows from the DG, extending through the pyramidal, subicular and RHP regions, which have not been well-characterized in the human at the molecular scale. Since annotations for the RHP transition region rely on a priori knowledge of spatial organization and no molecular profiles are available, we were unable to manually annotate subdivisions in this region. While analysis of the snRNA-seq data enabled identification of many individual cell types along this transition region, unbiased clustering strategies were also unable to fully identify and differentiate across subdivisions. However, application of NMF afforded a more biologically meaningful interpretation of these cell types by mapping their spatial organization in the SRT dataset. With this approach we were able to identify gene signatures attributable to specific subdivisions of the ENT and subicular complex (Fig. 6). These findings highlight the future potential of these approaches for incorporating human transcriptomic data with rapidly emerging viral circuit labeling tools, which have recently enabled the profiling of neuron populations in rodent models based on their innervation patterns115,116.

We demonstrate the effectiveness of NMF for integrating transcriptomics data with other datasets, particularly across species and data modalities. This approach has some limitations. Namely, NMF can be sensitive to initialization values, patterns corresponding with noise may be incorrectly attributed to distinct biological processes, and NMF weights can be scaled differently across different factorizations. Despite these limitations, the illustrated strategy can be iterated upon to map existing and forthcoming datasets to our atlas. NMF is able to transfer continuous patterns of expression, such as cell types (Figs. 4 and 6) and transcriptional activity (Fig. 5), while other tools, including principal component analysis (PCA) and spot-level deconvolution, are not able to find these gradients of expression. NMF-based integration of our snRNA-seq and SRT data leveraged the spatial information provided by the SRT dataset and the cellular resolution of snRNA-seq to enable the discovery of molecular signatures in human tissue. Ultimately, these signatures can be extended with targeted gene panels at single-cell resolution for validation, or reverse translated to test causality in rodent or cellular models. Notably, snRNA-seq and SRT data from dlPFC are available from the same donors as those used in this study5, and emerging computational methods may enable modeling molecular connectivity patterns across this clinically-relevant circuit in the future. This is an important endeavor with substantial potential to improve clinical outcomes given strong scientific rationale and precedence for normalizing circuit-level dysfunction to improve symptoms of neuropsychiatric disorders117,118,119. Understanding functional dynamics in HPC circuits is important because their dysregulation is implicated in various neuropsychiatric and neurodevelopmental disorders120,121. Defining the human HPC at cellular resolution with spatial fidelity is necessary for designing molecular approaches that can facilitate more precise circuit manipulation. Ultimately, identifying unique molecular identities for spatially organized HPC cell types based on their innervation targets in the human brain is necessary for the development of circuit-specific therapeutics.

This highly integrated, well-annotated single-cell and spatial transcriptomics dataset of the human HPC provides unique biological insight into hippocampal molecular neuroanatomy, including activity-dependent transcriptomic profiles in the GCL and molecular definition of the RHP transition zone between the ENT and SUB. Our innovative approach can be effectively used to map external data to this and other transcriptomic atlases, as illustrated. To facilitate wide access to our data, we provided the human HPC snRNA-seq and SRT atlas as a resource to the scientific community through multiple avenues, including interactive web applications for visualization and exploration.

Methods

Postmortem human tissue samples, processing and QC

Postmortem human brain tissues from ten neurotypical adult donors of European ancestry were obtained after autopsy following informed consent from legal next-of-kin, through the Maryland Department of Health IRB protocol 12–24, and from the Western Michigan University Homer Stryker MD School of Medicine, Department of Pathology, the Department of Pathology, University of North Dakota School of Medicine and Health Sciences and the County of Santa Clara Medical Examiner-Coroner Office in San Jose, CA, all under the WCG protocol 20111080, as previously described5,122 (Supplementary Table 1). No statistical methods were used to predetermine sample sizes, but our donor cohort includes the same donors described in a previous publication5.

Anterior hippocampal dissections were performed by the same neuroanatomist (T.M.H.) using anatomical landmarks to include the anterior portion of the HPC proper plus the subicular complex36. A total of 10 μm cryosections were mounted on standard microscopy slides for staining with hematoxylin and eosin (H&E) for orientation and QC. Following QC, 2–4 100 μm cryosections, totaling approximately 50 mg of tissue, were transferred into a low-adhesion tube (Eppendorf) kept on dry ice and reserved from the anterior portion of the HPC for snRNA-seq (Fig. 1a). Then, two 10 μm sections were mounted on standard microscopy slides and reserved for H&E staining and RNAscope to guide reassembly of Visium capture areas and histological annotation of canonical subfields, using previously defined markers of the DG (PROX1), CA3 (NECAB1), CA1 (MPPED1) and SUB (SLC17A6).

The tissue block was then scored using a razor blade, so that tissue sections approximately the size of a Visium capture area could be positioned and placed onto chilled Visium Spatial Gene Expression slides (2000233, 10x Genomics). In some cases, multiple Visium experiments were performed to ensure the inclusion of relevant subfields. In these cases, the distance along the anterior-posterior axis between experiments did not exceed 500 μm. Following the successful completion of Visium experiments, an additional 100 μm cryosections were collected for snRNAseq.

SRT data generation and analysis

Visium Spatial Gene Expression slides were processed as previously described5,28. Experimenters were blind to the age and sex of donors. For each Visium slide, H&E staining was performed (protocol CG000160, revision B, 10x Genomics) and imaged on a Leica CS2 slide scanner equipped with a ×20/0.75NA objective and a ×2 doubler. Following tissue permeabilization (18 min, empirically determined), on-slide cDNA synthesis was performed, and sequencing libraries were generated for all Visium samples following the manufacturer’s protocol (CG000239, revision C, 10x Genomics). Libraries were sequenced on a NovaSeq 6000 System (Illumina) at the Johns Hopkins Single Cell Transcriptomics Core according to the manufacturer’s instructions at a minimum depth of 60,000 reads per spot. Images were processed using VistoSeg for preprocessing and nuclei segmentation123. FASTQ and image data were integrated using the 10x SpaceRanger pipeline124 and reads were aligned to the reference genome GRCh38 2020-A. Spatial coordinates were rotated to reflect the orientation of the intact tissue section (Supplementary Fig. 1). For all SRT analysis, we used R (version 4.3.2, unless otherwise noted) and Bioconductor (version 3.17) for analysis of genomics data125.

Visium QC and count normalization

After removing undetected genes and spots with zero counts, spot-level QC was performed with scuttle to calculate mitochondrial expression rate, library size, and the number of detected genes126,127. We applied a 3× median absolute deviation (MAD) threshold to discard spots with low library sizes and/or low numbers of detected genes (Supplementary Fig. 2). Mitochondrial expression rate was not used to define low-quality spots as this metric appeared to correlate with true biological variation attributable to hippocampal neuropil-enriched layers (Supplementary Fig. 3). Neuropil is characterized by large numbers of synapses and a paucity of neuronal cell bodies, and mitochondria are known to be enriched at both presynaptic and postsynaptic sites128,129. After applying 3MAD thresholds, the SRT dataset consisted of 31,483 genes and 150,917 spots. Counts were normalized by library size and log2-transformed using computeSumFactors() from scran126 and logNormCounts() from scuttle127.

Feature selection: spatially variable genes (SVGs)

Features were selected as input for unsupervised clustering by selecting the 2,000 SVGs identified with nnSVG29. Traditional methods of selecting highly variable genes do not necessarily capture spatial expression patterns. The nnSVG package uses a nearest-neighbor Gaussian process model and measures up against other SVG detection methods in improving spatial domain detection in RNA-seq data, compared to nonspatial feature selection approaches31. nnSVG aims to identify genes with varying expression patterns across different regions within the tissue by estimating gene-specific spatial ranges and retaining detailed spatial correlation information.

Before feature selection, we filtered out genes with fewer than 100 counts across all spots across all capture areas, according to nnSVG documentation. For each capture area, we also filtered out genes with less than three counts in at least 0.5% of spots in that capture area. We did not exclude the mitochondrial genome. nnSVG was run separately on each capture area using the default parameters, generating one result per capture area. To consolidate SVGs across capture areas, we ranked all genes within each capture area by spatial variance, then averaged gene ranks across all capture areas. We retained the top 2,000 genes (based on mean rank) for clustering analysis, further constraining to genes ranked in the top 1,000 genes in at least two different capture areas (Supplementary Table 3 and Fig. 1d).

Clustering and spatial domain annotation

For spatially-aware unsupervised clustering of SRT data, we used PRECAST v1.5 (ref. 30), with the 2,000 SVGs as input features (Supplementary Fig. 4a). PRECAST was chosen based on its ability to leverage spatial information when performing clustering, its use of joint embeddings that allow for integration across samples and batches, and its relatively fast runtime which allows for optimization of the number of desired clusters32. We ran PRECAST using k = 5 through k = 20 and focused k = 15 through k = 20 as these had the lowest AIC values (Supplementary Fig. 4b). Visual inspection indicated similar clusters for k = 15 through k = 20. Focusing on k = 16, k = 17 and k = 18, we examined marker gene expression and the cohesion with histologically defined annotations performed by experienced neurobiologists (Supplementary Fig. 5). We selected k = 18 clustering results as the basis for our HPC spatial domains due to minor differences including a cluster that appeared to map to SLM and SGZ, which results from k = 16 and k = 17 lacked (Supplementary Fig. 4c–e).

Histological annotations

Before unsupervised clustering, experienced neurobiologists (E.D.N. and S.C.P.) generated histologically defined annotations of tissue regions present in the SRT capture areas to assist in later annotation of unsupervised clusters. To facilitate histological annotations, we built a temporary shiny application using the spatialLIBD package4,130. The following anatomical regions were defined (Supplementary Fig. 5)—GCL, subgranular zone (SGZ), ML, CA4 pyramidal cell layer (PCL-CA4), PCL-CA3, PCL-CA1, SUB, cortex, stratum oriens (SO), SR, SL, SLM and white matter (WM). Expert neuroanatomists also recognized choroid plexus (CP) and thalamus tissue in a small number of capture areas.

To define these regions, we visualized log2 counts per million (CPM) expression of marker genes from snRNA-seq studies in both human HPC18,19 and mouse HPC13,92 and used H&E images for histological reference. To identify GCL, we used PROX1 and SEMA5A and were also guided by high levels of hematoxylin staining due to densely packed granule cell bodies. To identify SGZ, we used SEMA5A, but not PROX1, expression. We also used GAD1/GAD2 due to the presence of GABAergic neurons in SGZ131. To identify ML, we noted low unique molecular identifiers (UMIs) and detected genes in areas between GCL and SLM. To identify PCL-CA4, we used NECTIN3, AMPH, SEMA5A, SLC17A7, and observed the presence of higher UMIs and more detected genes compared to adjacent SGZ. For PCL-CA3, we used NECTIN3, NECAB1, AMPH and MPPED1. For PCL-CA1, we used FIBCD1, MPPED1, CLMP and SLC17A7. For SUB, we used FN1, NTS and SLC17A6. For SO, we used PVALB, SST, GAD1/GAD2. For SR, we searched for regions between SLM and PCL-CA1/PCL-CA3 that exhibited high levels of GFAP, but lower levels of MBP/MOBP compared to SLM and WM. For SL, we identified regions between SR and PCL-CA3 with low UMI counts and a high number of detected genes, as well as higher levels of GFAP and MBP/MOBP compared to adjacent PCL-CA3. For SLM, we observed a strip of darker staining compared to adjacent ML and SR, along with the expression of MBP/MOBP and GFAP. To identify WM regions, we looked for very high levels of oligodendrocyte markers such as MBP, MOBP and PLP1. To identify CP, we used TTR, PRLR and MSX1. CP tissue also had a distinctive sponge-like appearance, making visual identification easy. For the thalamus, we used TCF7L2. Finally, for cortex, we used CUX2, RORB and BCL11B.

Annotation of SRT clusters

The expression of well-established marker genes for HPC regions and cell types (Supplementary Fig. 5c) was used to annotate PRECAST k = 18 clusters to the DG GCL, the CA pyramidal cell layers (CA2–CA4 and CA1), the SUB, the RHP, and as GABAergic neuron-rich spots (GABA). Multiple clusters that were annotated to the same HPC domain were collapsed together, differing only in the number of nuclei per spot (Supplementary Fig. 6a,b). One cluster was labeled as SUB.RHP based on the expression of both SUB and RHP genes and its positioning between the two regions. Clusters with distinct non-neuronal gene expression were annotated as WM, CP and Vasc tissue.

To assist in annotation of neuropil-rich domains, we referenced the manual annotations (Supplementary Fig. 6c). We also examined the expression of genes enriched in the CA1 (MPPED1, FIBCD1), CA3 (TSPAN18, NECTIN3, AMPH), DG (PROX1, SEMA5A) and astrocytic genes since we had a cluster with SGZ-proximal spatial organization (Supplementary Fig. 6d). With these aids we annotated domains mapping to DG ML, SL/SR, SR/SLM and SLM/SGZ. Domains mapping to WM (WM.1, WM.2 and WM.3) were grouped as WM domains at the broad domain level. Domains mapping to vascular and CP tissue were grouped as vascular/cerebrospinal fluid (vasc/CSF) domains at the broad domain level. The final HPC domain annotations used throughout the manuscript are presented in Extended Data Fig. 1.

Identification of the thalamus and amygdala spots

During manual annotation, we identified the presence of a small amount of thalamus tissue in one capture area using thalamus-specific marker TCF7L2 (refs. 132,133; Supplementary Fig. 18a–c). In this capture area, the thalamus tissue was incorrectly annotated as SUB and SUB.RHP PRECAST domains. The thalamic SUB/SUB.RHP spots were the only spots from SUB or SUB.RHP spatial domains in this capture area. These spots were therefore excluded from DE analyses. In another donor, we identified a large, homogeneous expanse of tissue that comprised the entire capture area and was annotated as RHP. Based on the anatomy of the region, we believed this region was in fact amygdalar tissue based on the expression of several genes known to be enriched in the amygdala—SLC17A6, CDH22 (ref. 21), OPRM1 (ref. 134) and CACNG4 (ref. 135; Supplementary Fig. 18e–h). We identified three total capture areas from two donors containing tissue from the adjacent amygdala, based on the expression of these genes, all of which were annotated to the RHP spatial domain. These spots were not included in the DE analyses.

Comparison with other unsupervised spatial clustering algorithms

To ensure robustness of spatial domain predictions across computational algorithms, we compared PRECAST results with those generated from two other leading clustering methods specifically designed for spatial transcriptomics. We examined a domain detection algorithm using graph-based autoencoders, specifically GraphST136. Following the GraphST tutorial, features were selected by the highly variable gene method, and GraphST was performed for each individual slide. Samples were integrated with Harmony137, and we generated k = 16 clusters (Supplementary Fig. 19). We also compared BayesSpace138, which is similar to PRECAST in that it implements a hidden Markov random field but does so within a Bayesian model framework. For BayesSpace implementation, we used the same set of SVGs input to PRECAST to generate PCs that were mutual nearest neighbor (MNN) corrected for donor identity and used k = 18 clusters (Supplementary Fig. 20).

We compared the manual annotations with the results from the unsupervised clustering results by examination of the spot plots (Supplementary Figs. 5, 19 and 20), visualization of clustering results with alluvial plots (Supplementary Fig. 21a–c), and by quantitative comparison of the number of spots assigned to each cluster using the Adjusted Rand Index (ARI). PRECAST k = 18 clusters produced with highest agreement with manual annotations, followed closely by BayesSpace clusters (ARI with manual annotation—PRECAST = 0.192, GraphST = 0.112, BayesSpace = 0.173). When we collapsed clusters into the domains for PRECAST and similarly collapsed CA2.4 for BayesSpace, PRECAST still exhibited higher ARI (ARI with manual annotation—PRECAST domains = 0.231, BayesSpace domain = 0.179). We observed strong agreement between PRECAST and BayesSpace results (Extended Data Fig. 1 and Supplementary Figs. 20 and 21d), and the ARI for these two clustering methods was 0.398, which increased to 0.427 when the domains for both methods were used. We further examined whether GraphST or BayesSpace produced clusters that corresponded to the amygdala and thalamus spots identified manually, and found no improvement on the performance of PRECAST (Supplementary Fig. 21e). Altogether, we did not find evidence to suggest that selecting one of these alternative methods to derive our spatial domain annotations would improve the quality of our data and we continued to use PRECAST-derived spatial domains throughout the remainder of the study.

Pseudobulked SRT DE

Following unsupervised clustering, pseudobulk samples were generated by summing the raw gene expression counts across all Visium spots within a given capture area and spatial domain. We discarded pseudobulk samples that were composed of <50 spots and samples of spots identified as the thalamus or amygdala. We removed lowly expressed genes using the filterByExpr() function from the edgeR package139,140. We followed edgeR recommendations, and counts were normalized by the product of the library size and size factor, then converted to CPM and log2-transformed. After normalization, PCA and examination of the percent of gene-level variance confirmed that pseudobulk samples primarily captured biological variation (Supplementary Figs. 22 and 23). DE analysis was implemented using limma and accounted for block-specific (that is, capture area) variation141. Our DE approach followed a one-vs-rest enrichment model at either the domain level or broad domain level, in each case adjusting for donor age, donor sex and Visium slide as covariates. DE models computed fold change (FC), Student’s t-test statistics and two-tailed P values for each gene within each domain/broad domain were corrected for the false discovery rate (FDR)142 (Supplementary Tables 4 and 5). Genes were considered statistically significant if log2(FC) > 1 or log2(FC) < −1 and FDR < 0.01 (Fig. 2i,j, Extended Data Fig. 2 and Supplementary Fig. 7).

snRNA-seq data generation and analysis

Using 100 μm cryosections collected from each donor, we conducted snRNA-seq using 10x Genomics Chromium Single Cell Gene Expression v3.1 technology. Following standard procedures for nuclei isolation4, nuclei from each donor were split into two equal samples for fluorescence-activated nuclear sorting (FANS) based on propidium iodide (PI+) and Alexa Fluor 488-conjugated anti-NeuN (MAB377X, MilliporeSigma) fluorescence. One sample per donor was sorted based on PI+ fluorescence (neuronal and non-neuronal nuclei), and the second sample was sorted based on both PI+ and NeuN+ fluorescence to facilitate enrichment of neurons. This approach initially gave n = 20 for snRNA-seq (1 PI+ and 1 PI+NeuN+ sample for all ten donors) with a total of 18,000 sorted nuclei per donor (9,000 per Chromium sample). Samples were collected over multiple rounds, each containing one to three donors for two to six samples per round. We obtained poor yields of nuclei from one sequencing round (round 3), which included both PI+ and PI+NeuN+ samples from three donors. We collected additional PI+ and NeuN+ samples from these three donors and performed the nuclei sorting steps again, this time achieving better final yields of nuclei. Therefore, we had a final n = 26 for snRNA-seq (1 P1+ and 1 PI+NeuN+ sample for seven donors, 2 PI+ and 2 PI+NeuN+ samples for three donors).

All samples were sorted into reverse transcription reagents from the 10x Genomics Single Cell 3′ Reagents kit (without enzyme). Enzyme and water were added to bring the reaction to full volume. cDNA synthesis and subsequent sequencing library generation were performed according to the manufacturer’s instructions for the Chromium Next GEM Single Cell 3′ v3.1 (dual-index) kit (CG000315, revision E, 10x Genomics). Samples were sequenced on a NovaSeq (Illumina) at the Johns Hopkins University Single Cell and Transcriptomics Sequencing Core at a minimum read depth of 50,000 reads per nucleus.

snRNA-seq QC

Following sequencing, reads were mapped to Genome Reference Consortium Human Build 38 (GRCh38 2020-A) using cellranger (version 7.0.0). A Bioconductor pipeline was used for downstream analysis125. Empty droplets were detected and removed with dropletUtils143.

Because neurons have larger library sizes and a greater number of expressed genes compared to non-neuronal cells in human brain tissue144, and because our PI+NeuN+ samples are enriched for neurons, initial QC thresholds (3MAD) were computed on a per-sample basis (Supplementary Fig. 8a–c). The data-driven mitochondrial fraction thresholds from two sequencing rounds (round 1 and round 3—previously recognized to contain lower nuclei yields) were dramatically higher than the remaining samples (Supplementary Fig. 8a), resulting in exclusion of fewer low-quality nuclei (Supplementary Fig. 8d). We calculated new per-sample 3MAD thresholds for mitochondrial fraction by restricting threshold calculation to the higher quality samples with original thresholds of <5% expression from the mitochondrial genome (Supplementary Fig. 8g). This increased the number of nuclei excluded based on mitochondrial fraction, particularly in samples from sequencing rounds 1 and 3 (Supplementary Fig. 8j). The initial 3MAD thresholds for the other two QC metrics resulted in no nuclei being excluded from almost all PI+ sorted samples and from many PI+NeuN+ sorted samples (Supplementary Fig. 8e,f). For all samples with no nuclei excluded from the initial 3MAD threshold, we manually set the library size and number of detected genes threshold to 1,000 counts and 1,000 detected genes (Supplementary Fig. 8h,i), thereby increasing the number of excluded nuclei (Supplementary Fig. 8k,l). Following nuclei QC, the dimensions of our dataset were 36,601 genes × 86,905 nuclei, with the kept nuclei exhibiting consistent QC metrics across samples.

We observed that one sample of PI+NeuN+ nuclei remaining after QC filters had noticeably lower library sizes and fewer detected genes than all other PI+NeuN+ samples (Supplementary Fig. 9b,c). To see if this discrepancy was likely the result of a neuronal population unique to that sample (17c-scp), we performed rudimentary clustering with the quickCluster() function from scran and estimated the number of neurons present in each cluster based on raw expression of SYT1 > 1 count (Supplementary Fig. 9d). We found that across neuronal clusters, sample 17c-scp displayed an increased number of nuclei with low library size and few detected genes, but that there remained 17c-scp nuclei that matched the distribution of these QC metrics present in the other PI+NeuN+ samples (Supplementary Fig. 9e). We reasoned that many of the nuclei in sample 17c-scp were low-quality neurons. To prevent a detrimental impact on downstream analyses, we took a conservative approach and removed all nuclei with fewer than 5,000 detected genes in this sample only. Following this removal step, the dimensions of our dataset were 36,601 genes × 80,594 nuclei (Supplementary Fig. 9f).

snRNA-seq feature selection, dimensionality reduction and clustering

Using raw counts, scry identified features based on deviance (rather than variance) and generated Pearson residuals that were used as input for dimension reduction and clustering, avoiding biases introduced by log-normalization of UMI count data145. Subsequent PCA and MNN correction with batchelor146 produced low-dimensional embeddings that were not influenced by donor and sequencing round (Supplementary Fig. 10). Gene expression count data were then normalized and log2 transformed with scran126, using cluster-specific size factors generated from rough cluster assignments.

We implemented graph-based clustering to identify cell types present in the snRNA-seq dataset. A nearest neighbor graph was generated with scran126 based on the 50 MNN-corrected PCs using k = 5 nearest neighbors and Jaccard weights, followed by igraph147 implementation of Louvain clustering148. This resulted in 59 clusters. We annotated these clusters as neuronal or non-neuronal based on expression of neuron-specific genes, such as SYT1 (Supplementary Fig. 11a). We identified three low-quality neuron clusters based on fewer detected genes (Supplementary Fig. 11b). We also identified one cluster of likely doublets, which were marked by coexpression of oligodendrocyte, astrocyte, OPC and microglia markers in the same nuclei (Supplementary Fig. 11c). These four clusters were removed, together accounting for 5% of all nuclei.

After removing low-quality neurons and likely doublets, we repeated feature selection, dimensionality reduction, batch correction and graph-based clustering using the same parameters as before. This resulted in 62 clusters, which we classified as neuronal and non-neuronal based on SYT1 expression (Supplementary Fig. 11d). Using the number of detected genes and coexpression of distinct glial markers, we identified one low-quality neuron cluster (Supplementary Fig. 11e) and one low-quality non-neuronal cluster, respectively (Supplementary Fig. 11f). These two clusters contained 1.6% of all nuclei, and so we did not reprocess the dataset after removal. Our final dataset contained 75,411 nuclei classified into 60 clusters.

Preliminary snRNA-seq cluster annotation

Before DE analysis, we used markers from published snRNA-seq data in humans18 to label broad cell types including glia cell types (Supplementary Fig. 11f), inhibitory/GABAergic neurons and excitatory/glutamatergic neurons (Supplementary Fig. 12a). Within excitatory neurons, we were able to identify distinct clusters of CA pyramidal cells, DG granule cells and SUB pyramidal cells based on marker genes like PROX1, CALB1, FNDC1, TSPAN18, CARTPT and FN1. We knew that we should have a small thalamus population and identified this population by looking for a cluster with elevated TCF7L2 expression based on our SRT data (Supplementary Figs. 18a–c and 12b). Most of the remaining excitatory cell clusters likely comprise the RHP regions which exhibit laminar organization akin to cortical layers and are often annotated based on the expression of CUX2 (superficial marker), SATB2 (deep callosal projection marker), TLE4 (deep subcortical projection marker; Supplementary Fig. 12a). We also knew that there should be some amygdalar neurons present, but were unable to identify these clusters based on the expression of genes used to classify amygdalar domains in the SRT dataset (SLC17A6, CDH22, OPRM1 and CACNG4; Supplementary Fig. 18d–h). To identify amygdala neuron clusters in the snRNA-seq data, we isolated the RHP spots from SRT data and used scran::findMarkers() to implement a binomial test for gene markers that distinguished the spots annotated to the amygdala (n = 4,519 spots) from the true RHP domain (n = 3,940 spots). Evaluating markers increased in either population at an FDR < 0.05 and log2(FC) > 0.3, we identified 73 genes that were increased in amygdala spots and 86 genes that were increased in RHP spots. We examined the expression of these genes in all excitatory cells annotated to RHP layers and identified five clusters that we relabeled as amygdalar neurons (n = 5,524 nuclei; Supplementary Fig. 12c). Notably, these nuclei overwhelmingly came from the donors with amygdala domains identified in SRT (Br6432 n = 3,380 nuclei, Br6423 n = 1,338 nuclei), although we did annotate 721 nuclei from Br8667 to these amygdala clusters.

Pseudobulked snRNA-seq DE

Following unsupervised clustering, nuclei were pseudobulked by cluster (n = 60) and sample. We dropped all genes with zero counts across all nuclei and summed the raw gene expression counts for across all nuclei in a given sample in a given cluster using the aggregateAcrossCells() function from scran126. We discarded pseudobulk samples that were composed of less than ten nuclei (864 remaining pseudobulk samples). This resulted in the complete removal of one cluster (cluster 58), which had already been identified as Cajal–Retzius cells by the expression of RELN. As with SRT pseudobulk samples, snRNA-seq pseudobulk samples were processed with edgeR to remove lowly expressed genes and normalize gene expression. We performed DE analysis on the n = 60 clusters using pseudobulk log2-transformed CPM values. As with SRT DE analysis, we used a limma framework to conduct enrichment DE models, accounting for block-specific (that is, FANS sample) variation and adjusting for age, sex and sequencing round covariates. For each gene and cluster, this model computed FC, Student’s t-test statistics and two-tailed P values. P values were corrected for the FDR142 (Supplementary Table 6). We correlated the gene-level enrichment model t statistics for each of the 60 snRNA-seq clusters with the t statistics from the enrichment model of the SRT domains and found concordance where appropriate (Extended Data Fig. 3h).

Detailed snRNA-seq cluster annotation

We used the statistically significant (log2(FC) > 2 and FDR < 0.0001) markers from pseudobulk DE analysis to provide greater detail to the cluster labels and identified DEGs that distinguished nearly all 60 clusters as unique in some way (Extended Data Fig. 3 and Fig. 3d). With this approach we were able to annotate small populations of distinct non-neuronal cell types through the cluster-specific expression of LAMA2 (fibroblast population of vascular leptomeningeal cells (VLMCs))149, DLC1 (pericytes/smooth muscle cells (PC/SMC))150, MECOM (endothelial cells)151 and GRP17 (committed oligodendrocyte precursors (COPs))152 (Extended Data Fig. 3a). DEGs across the six superficial RHP pyramidal neuron clusters indicated the grouping of two clusters (L2/3.1 and L2/3.5) based on the expression of TESPA1 and TSHZ2 (Extended Data Fig. 3b and Fig. 3d). Even across the amygdala neurons we identified genes distinguishing the clusters from one another, although the number of donors contributing to this population was small as the amygdala was not systematically targeted for dissection (Extended Data Fig. 3c). We were able to annotate two SUB clusters based on CNTN6 expression13 and all HPC pyramidal neurons exhibited a gradient of NUP93 expression that may serve to drive HPC identity153 (Extended Data Fig. 3f).

GABAergic neurons aggregated into cell types based on the expression of cell type-specific gene expression (Extended Data Fig. 3d) and gene expression that indicated broader GABAergic families (Extended Data Fig. 3e). Using these DEGs, we annotated GABAergic families corresponding to central ganglionic eminence (CGE) origin (consisting of VIP and HTR3A cell types) based on ADARB2 expression144 and lack of NXPH1, a LAMP5+ family of ADARB2 expressing interneurons154 (consisting of CXCL14, LAMP5.CGE and LAMP5.MGE cell types), a group of GABAergic neurons from the medial ganglionic eminence (consisting of CRABP1, C1QL1, PV.FS, CORT and SST cell types) based on LHX6 expression144,154, and a population of PENK+ interneurons that originates before the major GABAergic/glutamatergic split154. We noted that the CXCL14 cluster exhibited high RELN expression, but confirmed that these were not mislabeled Cajal–Retzius neurons based on the lack of TP73 expression (Extended Data Fig. 3e) and likely represent a population of LAMP5+ interneurons that continue to express RELN into adulthood155.

Even the smallest clusters (like GC.5, n = 96 nuclei; thalamus, n = 35 nuclei; COP, n = 57 nuclei and AHi.4, n = 73 nuclei) exhibited clear expression of marker genes unique from similar clusters. Some clusters were difficult to distinguish even with DEG expression (for example, Micro.1 versus Micro.2, CP.1 versus CP.2 versus CP.3; Extended Data Fig. 3a). However, our ability to discern gene markers unique to each of the 60 clusters suggested that the high resolution with which we determined our cell clusters did not result in over-clustering.

Validation of snRNA-seq clustering approach with orthogonal computation methods

Given the importance of cluster-derived cell type identity for downstream analysis (especially NMF-based integration with SRT data), we performed a re-analysis of the snRNA-seq data from the initial QC through to clustering. Our rationale with this re-analysis is that, if we identify similar cell clusters using an orthogonal framework with different computational tools, then the conclusions generated from our n = 60 clusters are generalizable. Re-analysis was performed by a co-author (J.R.T.) who was not involved in the original snRNA-seq processing and was blinded to whether individual nuclei were included or excluded therein.

In our re-analysis, instead of identifying empty droplets using Bioconductor tools, we began with the filtered feature matrix from which empty droplets had already been removed during Cell Ranger processing. We removed the samples from sequencing round 3 that had increased debris during sorting and exhibited high mitochondrial fraction (starting nuclei = 118,415; Supplementary Fig. 24a). Because the samples from sequencing round 3 had been re-collected, there were still n = 10 donors represented. After evaluating a maximum initial mitochondrial fraction of 20%, we further filtered out all nuclei with >10% reads from the mitochondrial genome (remaining nuclei = 116,494; Supplementary Fig. 24b,c). Minimum thresholds for library size and the number of detected genes were evaluated separately for PI+ and PI+NeuN+ samples due to enrichment of neurons in the PI+/NeuN+ population, the increased number of reads in the larger neuronal nuclei, and the distinct distribution of these metrics across the two sorting methods (Supplementary Fig. 24d,e). For the PI+ samples, we selected a minimum threshold of 750 detected genes and 2,000 reads (Supplementary Fig. 24f,g). For the PI+NeuN+ data, one sample in sequencing round 2 (sample 17c-scp) exhibited a substantially shifted distribution (Supplementary Fig. 24h,i) and was therefore excluded before determining QC thresholds of a minimum of 1,000 detected genes and 3,500 reads (Supplementary Fig. 24j,k). The total number of nuclei after QC filters was 81,838, with 22.5% of the 36,577 discarded nuclei originating from sample 17c-scp with reduced quality round 2 (all remaining samples constituting 2.85–6.80% of excluded nuclei, consistent with Supplementary Fig. 9f).

Feature selection for dimensionality reduction in re-analysis was conducted using the same binomial deviance approach as before, given the ability of this method to handle raw count data and account for batch effects. For dimensionality reduction, we performed scVI156 analysis in Python to generate latent representations that replaced PCA, followed by MNN batch correction. Our scVI model included the covariates of sequencing round, donor ID, mitochondrial fraction, age and postmortem interval. Scanpy157 was then used to construct a nearest neighbors map of k = 15 neighbors based on the scVI latent representations, and subsequently to generate cell clusters using the Leiden algorithm at a resolution of 0.5 (Supplementary Fig. 25a). scVI latent representations were also used to calculate UMAP embeddings for visualization. Clusters were then refined based on the expression of MALAT1, which has been shown to indicate the quality of nuclei in snRNA-seq data, and based on mitochondrial fraction158 (Supplementary Fig. 25b–f). After filtering out low-quality clusters, new neighbors (k = 10) and clusters were found, and new UMAP embeddings computed (Supplementary Fig. 25g,h). Two small clusters (<100 nuclei) were removed due to size, and one additional small cluster was removed because it comprised nuclei from mostly one sample (110 of 138 nuclei from one sample).

Despite the very different approach to QC, re-analysis within a python framework produced a highly similar set of nuclei—64,251 were nuclei present in both the new set (74,216 total) and our original set (75,411 total; Supplementary Fig. 25i). To examine the similarity of the cell clusters generated, we performed the pseudobulk enrichment model on the re-analysis clusters. Specifically, we found that the clusters identified with scVI re-analysis correlated strongly with our original n = 60 clusters when comparing the gene expression patterns captured by the nuclei groupings (Supplementary Fig. 25j). This shows that the same cell types were found in the re-analysis and that the nuclei that were present in only one of the two datasets were spread across all clusters, suggesting we were not missing any cell types in either analysis.

S-LDSC

We performed S-LDSC to evaluate the enrichment of heritability of brain-related traits for gene sets defined for different cell populations. Gene sets were generated by selecting the top 10% of genes by t statistic for each population (unique SRT domain and snRNA-seq cell type). To construct a genome annotation, we added a 100 kb window upstream and downstream of the transcribed region of each gene in the corresponding gene set. GWAS summary statistics were run for each trait40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55. Following recommendations from the LDSC resource website (https://alkesgroup.broadinstitute.org/LDSCORE), S-LDSC was run for each gene set using the baseline LD model v2.2, which included 97 annotations to control for the LD between variants and other functional annotations in the genome. We used HapMap Project Phase 3 SNPs as regression SNPs, and 1,000 Genomes SNPs of European ancestry samples as reference SNPs, obtained from the LDSC resource website. The z score of per-SNP heritability was used to evaluate the unique contribution of gene sets to trait heritability while accounting for contributions from other functional annotations in the baseline model. The P values are derived from the z score, assuming a normal distribution and FDR was computed from the P values based on the Benjamini–Hochberg procedure.

Spot-level deconvolution algorithm benchmarking

A step-by-step protocol of the spot-level deconvolution benchmarking in the human HPC is presented in Supplementary Note 1. To estimate cellular composition across spatial domains (Extended Data Fig. 4a), we generated a spot deconvolution reference dataset in human anterior HPC using Visium-SPG to provide orthogonal validation of cellular identity. Two high-quality samples (one male/one female) were selected for IF staining of NEUN, OLIG2, GFAP and TMEM119 (Supplementary Fig. 14a). After QC and normalization, 37,845 spots across eight capture areas were retained (Supplementary Fig. 14b,c). As these samples were not included in the original PRECAST clustering, spatial domains were projected into Visium-SPG using RcppML60 (Supplementary Fig. 14d). To establish ground truth cell annotations based on IF staining, nuclei were segmented from DAPI using Cellpose 2.0 (ref. 159), and mean intensities across all IF channels were extracted (Supplementary Fig. 26a). A subset of ~1,000 annotated nuclei trained a CART decision tree model160, for automated classification into four broad cell types (Extended Data Fig. 4b and Supplementary Fig. 26b).

To benchmark the performance of three deconvolution tools (RCTD57, cell2location58 and Tangram59), we generated several different classification levels based on the n = 60 superfine cluster annotations (Supplementary Fig. 27a). We verified that across these resolutions we did not see donor-specific enrichment that was unexpected (for example, amygdala enrichment expected; Supplementary Fig. 27b–d). Marker genes selected via DeconvoBuddies161 were used for cell2location and Tangram (Supplementary Figs. 28 and 29 and Extended Data Fig. 4c); however, RCTD performance required the use of the RCTD default marker set (Supplementary Fig. 30). RCTD showed the highest concordance with IF-based counts (Supplementary Figs. 31–33 and Extended Data Fig. 4d). RCTD was thus applied to the full Visium-H&E dataset (Extended Data Fig. 4e), with results available in Samui162 (‘Integrating snRNA-seq and SRT data’).

Integration of snRNA-seq with NMF and label transfer

We used NMF to identify continuous patterns of gene expression within our snRNA-seq data (Extended Data Fig. 5). NMF is a dimensionality reduction technique for pattern recognition163 that is used across various disciplines, including transcriptomics61,63. NMF decomposes a matrix A into two lower-rank matrices W and H corresponding to gene-level and nucleus-level weight matrices, respectively.

$${A}_{(i,\,j)}\approx {W}_{(i,k)}{H}_{(k,\,j)}$$

where k is the rank of the factorization and both w and h are constrained to have only non-zero elements. The decomposition is performed iteratively, considering the mean squared error between A and the product of w and h. Iterations continue until the change in mean squared error between successive iterations falls below a specified tolerance of fit, which we set to 10−6 (recommended threshold for high-quality factorization results ≥ 10−5).

NMF analysis was implemented using RcppML, which is significantly faster and more memory efficient than other NMF methods60. Using RcppML to implement NMF, we diagonalize the columns in W and the rows in H to ensure they sum to 1, allowing for L1/L2 regularization and reproducible NMF loadings. In our case, A was the normalized snRNA-seq counts matrix with dimensions 36,601 genes × 75,411 nuclei. We selected k = 100 NMF patterns after using the cross_validate_nmf() function from singlet to perform cross-validation of different NMF ranks164 (Supplementary Fig. 34). The matrix of gene-level NMF weights for k = 100 patterns is presented in Supplementary Table 9, and a simplified version containing the top ten genes weighted to each NMF pattern is presented in Supplementary Table 10 and Supplementary Fig. 35.

Dataset integration with NMF

To integrate gene expression patterns identified with NMF in the source dataset (snRNA-seq) with other target datasets (for example, SRT), we performed transfer learning with the project() function from RcppML, which uses a fast implementation of non-negative least squares165. We mathematically projected the source W matrix with k = 100 NMF patterns into a second, target dataset based on the target matrix (A′) with dimensions I’ genes and j′ observations to generate the target coefficient matrix of H′.

$${{W}^{{{T}}}}_{(k,i^{\prime})}{A^{\prime}}_{(i^{\prime},\,j^{\prime})}\approx{H^{\prime}}_{(k,\,j^{\prime})}$$

Notably, i′ is the number of genes present in both source snRNA-seq (A) and target (A′) gene expression matrices. To assess the validity of NMF pattern transfer, we examined the frequency of non-zero weighted observations for each NMF pattern in the target coefficient matrix (H′) and used the empirical cumulative distribution function to determine a minimum number of observations for consideration of NMF patterns.

To explore different questions in this manuscript, the source snRNA-seq NMF patterns were transferred to three different target datasets—our SRT dataset (Extended Data Fig. 6d), a mouse electroconvulsive stimulation (ECS) snRNA-seq data85 (Extended Data Fig. 8b,c) and a mouse single-nucleus methylation sequencing (snmC-seq) dataset where retroviral tracing before collection meant that sequenced neurons were identified by both the brain region from which they were collected and the brain region to which their axon projected95 (Extended Data Fig. 9b).

NMF transfer to mouse ECS snRNA-seq dataset

For pattern transfer to the mouse ECS snRNA-seq data85, we downloaded the processed data object from the GitHub repository for the original mouse study and used the full dataset (n = 15,990 nuclei; https://github.com/Erik-D-Nelson/ARG_HPC_snRNAseq/blob/main/processed_data/sce_subset.rda.xz). Information on how these data were processed before download can be found in the same GitHub repository (https://github.com/Erik-D-Nelson/ARG_HPC_snRNAseq/tree/main/code). To enhance interpretability of our human datasets, we identified human orthologs for mouse genes in the ECS dataset and removed all genes without identified orthologs. We matched these orthologs with genes in the W matrix and discarded any genes not included in both datasets. Following the removal of genes without orthologs and genes not included in both datasets, we retained 17,557 genes. We used normalized, log2-transformed counts from the mouse ECS snRNA-seq dataset as the A′ matrix. Following label transfer, patterns were normalized to sum to one in each target dataset. We removed patterns with non-zero weights in <1,000 nuclei in the mouse ECS snRNA-seq dataset (Extended Data Fig. 8a,b). To facilitate interpretation within the context of our human findings, we further subset the data to only NMF patterns that also mapped to >1,050 SRT spots (Extended Data Fig. 8c). We found that the appropriate patterns corresponded to the relevant cell types (Extended Data Fig. 8b,c).

NMF transfer to mouse snmC-seq dataset

We leveraged a mouse model linking snmC-seq in neurons with their axon projection targets through retroviral tracing95 to see if any NMF patterns generated from our human snRNA-seq data corresponded to HPC neurons with specific axonal projection targets. We downloaded the data from GEO (accession code GSE230782). As only the raw data was available, we reproduced the original authors’ pipeline by adapting code found in the GitHub repository for the snmC-seq study (https://github.com/zhoujt1994/EpiRetroSeq2023/blob/main/02.integration_mc_rna/03.process_RS2.ipynb). This required extracting CH gene body methylation counts to approximate gene expression, based on the premise that non-CpG cytosine methylation exhibits a strong inverse relationship with gene expression in neurons166. Following the original authors’ specifications, data were QC processed and cell type clusters were generated by integrating with mouse single-cell RNA-seq data. We then subset the snmC-seq dataset to excitatory HPC and RHP neurons only (n = 2,004 nuclei). Then, following the initial study design, CH gene body methylation counts were extracted, log-scaled and negated. This matrix was used for the A′ matrix for pattern transfer. We identified human orthologs for mouse genes in this dataset and removed all genes without orthologs. We matched these orthologs with genes in the W matrix and discarded any genes not included in both datasets. Following the removal of genes without orthologs and genes not included in both datasets, we retained 14,452 genes. Following label transfer, patterns were normalized to sum to one in each target dataset. We isolated the NMF patterns that we previously identified (Extended Data Fig. 6a) as corresponding to the tissue source of the snmC-seq dataset (CA patterns—nmf11, nmf63, nmf61 and nmf15; prosubiculum/subiculum patterns—nmf32, nmf40 and nmf54; RHP patterns—nmf65, nmf22, nmf53, nmf68, nmf51, nmf45, nmf84, nmf27 and nmf17). We then removed those patterns with non-zero weights in <45 snmC-seq nuclei (Extended Data Fig. 9a). We found that the remaining patterns correctly corresponded to nuclei collected from relevant brain regions (Extended Data Fig. 9b).

Postintegration analyses of human HPC datasets

After NMF transfer to SRT, we excluded patterns with non-zero weights in <1,050 spots (Extended Data Fig. 6c). We identified two sex-based patterns based on the elevated weights of XIST (nmf37) and chrY genes (nmf27; Supplementary Fig. 36). These patterns were also excluded from downstream analysis since sex as a biological variable was not a focus in this study.

We observed that that after transfer to SRT cell type-specific patterns corresponded with the appropriate spatial domains (Extended Data Fig. 6d). Spot-level weights of the general NMF patterns remained diffusely distributed across spatial domains, with the exception of nmf94 that was highly weighted in spots annotated to the CP (Supplementary Fig. 37a). The CP marker gene TTR was the highest weighted gene to nmf94 and was weighted 8.79 times higher than the next highest weighted gene (Supplementary Table 10 and Supplementary Fig. 35). We determined that nmf94 was distinct from the TTR-dominant CP-specific transcriptional program captured by nmf48 (Supplementary Fig. 37b–d) and that observation-level weights reflected the donor bias of CP nuclei without being actually elevated in CP nuclei (Supplementary Fig. 37e,f). It appears that when nmf94 was transferred to the SRT dataset, the nuanced donor-specific effect was not conserved due to the strong importance of TTR expression in determining observation-level weights, and nmf94-weighted spots were nearly identical to nmf48-weighted spots (Supplementary Fig. 37g–j). This highlights the importance of a close examination of NMF patterns to determine the biological or technical significance of the gene expression patterns they represent and whether they can be interpreted in the same way in a dataset to which they are projected.

To ascertain if additional NMF patterns captured donor-specific effects that could confound biological interpretation in downstream analysis, we examined the donor-level representation in non-zero weighted observations across all general and specific NMF patterns (Supplementary Fig. 38). We identified two patterns (nmf91 and nmf20) that were equally represented across snRNA-seq samples but exhibited notable enrichment for two donors when projected to the SRT dataset (Extended Data Fig. 7a,c and Supplementary Fig. 38a,b, asterisk). These patterns were characterized by stimulus- and activity-dependent genes (Extended Data Fig. 7b,d). The increased abundance of these genes and the increased weight of nmf91 and nmf20 in nuclei across donors is likely due to the increased sensitivity of snRNA-seq to detect these sparsely expressed transcripts (Extended Data Fig. 7e,f). Thus, the apparent donor bias seemingly introduced when nmf91 and nmf20 are projected onto the SRT dataset is likely biologically meaningful and highlights specific donors in whom stimulus-dependent transcripts were enriched. This is explored further in Fig. 5.

To validate the biological relevance of genes highly weighted to individual NMF patterns, we performed GSEA on patterns explored in Fig. 4. GSEA was implemented with fgsea167 using reactome.db pathways168, limited to pathways annotated to Homo sapiens. Non-zero gene-level weights were used for the gene score, resulting in 4,661 genes for oligodendrocyte-specific nmf44, 9,030 genes for astrocyte-specific nmf81, 7,616 genes for transneuronal nmf13 and 7,675 genes for transneuronal nmf7. Results were evaluated as significant at a FDR-adjusted P value < 0.05. The FDR-adjusted P value and normalized enrichment score for these tests are reported in Fig. 4c,f,i,j.

Postintegration analyses of the mouse ECS dataset

To investigate whether the genes contributing to the differential weights of select NMF patterns between ECS GCs and sham GCs were associated with neuronal activation85, we performed DE analysis using scran to identify markers using a binomial test. The ECS snRNA-seq data was subset to GCs and limited to the genes with non-zero weights for the NMF pattern being tested. Results were evaluated at a strict threshold of FDR-adjusted two-tailed P value < 1 × 10−30 and an absolute value of the log2(FC) > 1. Genes labeled with text in Fig. 5 volcano plots were further characterized by high NMF pattern weights—nmf91 weight > 0.0015, nmf20 weight > 0.00065, nmf10 weight > 0.00065 and nmf14 weight > 0.0005.

We observed that nmf55 nuclei weights were higher in ECS GCs compared to sham GCs (Extended Data Fig. 8c). However, eight of the top ten genes weighted to nmf55 encode ribosomal subunits (RPS24, RPL26, RPS27A, RPL32, RPS12, RPLP1, RPL34 and RPS8), and the remaining two genes are strongly associated with translation (TPT1 and EEF1A1; Supplementary Table 10). We performed DE analysis on non-zero weighted nmf55 genes and found that although these genes were significantly increased in ECS GCs, the magnitude of log2(FC) and adjusted P values were much attenuated compared to nmf91 and nmf20 which captured activity-dependent transcriptional programs (Extended Data Fig. 8d and Fig. 5). De novo protein production in response to neuronal activity is critical for subsequent changes in synaptic and circuit connectivity169 and production of translational machinery like ribosomes likely facilitates such changes170. Thus, although ribosome transcript-dominant nmf55 is associated with activated mouse GCs, it is unlikely that nuclei or spots highly weighted for nmf55 in our human snRNA-seq or SRT datasets were recently activated. This is supported by the ubiquitous weighting of nmf55 across SRT spatial domains (Extended Data Fig. 8e). This example highlights the need for biological context in interpreting NMF patterns.

Spatial mapping of RHP laminar organization and identification of layer-specific SUB marker genes

Spatial mapping of SUB cluster-specific NMF patterns with SRT revealed laminar distribution in the SUB (Fig. 6a). To capitalize on the enhanced detail present in the integrated NMF representations, we generated binarized thresholds for individual NMF patterns by taking one fifth of the 95% max spot-level weight (Supplementary Fig. 39a). This revised spot-level annotation strategy exhibited strong concordance with the SUB spatial domain identified via unsupervised clustering (Supplementary Fig. 39b) and was used in combination with broad domain labels in Fig. 6c,e to provide visual reference for SUB spots when examining the laminar organization of other RHP-enriched NMF patterns. This binarization strategy was extended to the RHP-enriched NMF patterns to permit the discrete labeling of superficial and L5 ENT spots and was performed for the CA1 and CA3 as a measure of validation (Supplementary Fig. 39d,e). The following NMF patterns were used to generate the binarized domain labels—CA1 (nmf15), CA3 (union of nmf11 and nmf63), ENT_sup (union of nmf84, nmf45, nmf27) and ENT_L5 (nmf51). Only spots that passed a single one of the thresholds were kept for the combined annotation (Supplementary Fig. 39d). The increased detail of these NMF-driven thresholded annotations provided histological context that assisted in the classification of the spots (Supplementary Fig. 17) and nuclei (Supplementary Fig. 39e) labeled by nmf17 as the PreS.

We examined SUB and deep RHP snRNA-seq clusters (Extended Data Fig. 9c) for evidence that nmf65 indicated nuclei corresponding to the deep SUB as suggested by nmf65 laminar organization (Fig. 6e). While the entirety of the L6.1 cluster was highly weighted to nmf65, only a portion of the L6b cluster was highly weighted to this pattern (Fig. 6f). To ascertain if these deep pyramidal nuclei corresponded to the separate populations of RHP and SUB-adjacent spots labeled by nmf65, we split L6b into two portions by thresholding nmf65 weights (L6b_nmf65 with nmf65 weights > 0.00035; Extended Data Fig. 9d). This grouping was used to visualize gene expression of top DEGs that were also highly weighted to the relevant NMF patterns and indicated that L6.1 is more accurately described as deep SUB (Extended Data Fig. 9e,f).

Following NMF-driven assessment of the spatial localization of the clusters corresponding to SUB and RHP pyramidal neurons, the following snRNA-seq clusters were relabeled—L6.1 to Sub.3 (Extended Data Fig. 9e,f); L2/3.1 to PreS (Fig. 6g and Supplementary Figs. 17 and 39e); L2/3.4 to ENT.sup1, L2/3.6 to ENT.sup2a, L2/3.3 to ENT.sup2b, L2/3.2 to ENT.sup3, L5.1 to ENT.L5 (Fig. 6c,d); L5.2 to RHP.CBLN2+ (based on Extended Data Fig. 3b) and L6.2 to RHP.L6 and L6b to RHP.L6b (Fig. 6e,f).

Characterization of SUB layers

Integration of our snRNA-seq and SRT datasets with NMF revealed that four snRNA-seq clusters corresponded to subicular gene expression patterns with laminar organization. Following the identification of the superficial SUB (Sub.1), middle SUB (Sub.2), deep SUB (Sub.3) and presubiculum (PreS), we performed DE analysis (scran-implemented binomial test) in our snRNA-seq dataset focused on identifying genes that could specifically identify these regions (Fig. 6i and Extended Data Fig. 10).

To test for genes that distinguished the Sub.1 and Sub.2 clusters, we subset to regions with similar gene expression and that are spatially adjacent (CA1, ProS, Sub.3, PreS), testing for enrichment across six groups in total and only for genes with counts in >100 nuclei (22,572 genes × 8,783 nuclei). Sub.1 markers were considered statistically significant at −log10(FDR) > 30 and with log2(FC) > 1 versus each of the CA1, ProS, Sub.2, Sub.3 and PreS (146 genes). After filtering to genes with an average log2 CPM expression of > 1 in the Sub.1 nuclei, we identified ten superficial SUB marker genes—AC007368.1, AL138694.1, ATP6V1C2, COL21A1, EBF4, FN1, NDST4, PARD3B, PRKCH, RAPGEF3 (Extended Data Fig. 10a). Sub.2 markers were considered statistically significant at −log10(FDR) > 30 and with log2(FC) > 1 versus each of the CA1, ProS, Sub.1, Sub.3 and PreS (184 genes). After filtering to genes with an average log2 CPM expression of > 1 in the Sub.2 nuclei, we identified seven middle SUB marker genes—GDNF-AS1, LHFPL3, MAMDC2, PCED1B, RORB, SULF1 and TRPC3 (Extended Data Fig. 10b). ProS markers were considered statistically significant at −log10(FDR) > 30 and with log2(FC) > 1 versus each of the CA1, Sub.1, Sub.2, Sub.3 and PreS (17 genes). After filtering for genes with an average log2 CPM expression of >1 in the ProS nuclei, we identified no unique prosubiculum marker genes.

To test for genes that distinguished the Sub.3 cluster from other SUB layers and other deep RHP clusters, we subset to regions with similar gene expression and that are spatially adjacent (Sub.1, Sub.2, RHP.L6b, RHP.L6), testing for enrichment across five groups in total and only for genes with counts in >100 nuclei (21,037 genes × 5,647 nuclei). Sub.3 markers were considered statistically significant at −log10(FDR) > 30 and with log2(FC) > 1 versus each of the Sub.1, Sub.2, RHP.L6b and RHP.L6 (68 genes). After filtering to genes with an average log2 CPM expression of >1 in the Sub.3 nuclei, we identified 21 deep SUB marker genes—AC007100.1, AC010967.1, AC023503.1, AC046195.2, AL356295.1, CD36, COL4A1, COL4A2, DISC1, FSTL5, GUCA1C, LINC01194, LINC01239, LINC01821, NR2F2-AS1, PLEKHG1, RASGEF1B, SCN7A, SCUBE1, SNCAIP and VEGFC (Extended Data Fig. 10c).

To test for genes that distinguished the PreS cluster from other SUB layers and other superficial RHP clusters, we subset to regions with similar gene expression and that are spatially adjacent (Sub.1, Sub.2, ProS, ENT.sup1, ENT.sup2a, ENT.sup2b and ENT.sup3), testing for enrichment across eight groups in total and only for genes with counts in >100 nuclei (22,708 genes × 9,835 nuclei). PreS markers were considered statistically significant at −log10(FDR) > 30 and with log2(FC) > 1 compared to each of the Sub.1, Sub.2, ProS, ENT.sup1, ENT.sup2a, ENT.sup2b and ENT.sup3 (38 genes). After filtering for genes with an average log2 CPM expression of >1 in the PreS nuclei, we identified the following five PreS marker genes: AC008662.1, AL161629.1, FSTL5, MDFIC and WSCD1 (Extended Data Fig. 10d).

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.