Main

Cell-type annotation is a core step in single-cell RNA sequencing (RNA-seq) pipelines. The quality of annotations directly impacts downstream analyses, including mapping cellular diversity across tissues and deciphering cell-type-specific regulatory mechanisms. Manual annotation remains time-consuming and dependent on domain-specific expertise, but the rapid adoption of single-cell RNA-seq as a standard laboratory technique has created an urgent need for automated, scalable solutions1. With repositories such as the Human Cell Atlas2 and CELLxGENE3 now containing over 100 million cells, accurate and robust annotation methods are a critical first step in translating these large-scale datasets into actionable biological insights4,5.

Automated atlas-level cell-type annotation can be framed as a supervised classification problem, where models assign labels to individual cells on the basis of gene expression profiles, using reference annotations provided by original studies6,7,8. A defining feature of this task is that cell types are organized within a hierarchical ontology9,10, forming a multilevel taxonomy. For example, ‘leukocytes’ represent a broad category that contains ‘lymphocytes’, which in turn includes more specific subtypes such as ‘B cells’. However, annotation practices vary substantially between studies—some assign broad categories, while others distinguish fine-grained subtypes. This inconsistency in label granularity introduces ambiguity into the training signal, as models must infer the appropriate level of resolution without explicit guidance. More formally, the annotation task can be viewed as learning a function f: XY, where X is the space of gene expression profiles and Y is a structured label space defined by a directed acyclic graph (DAG). In this graph, each node corresponds to a cell type and directed edges represent subtype relationships—for example, ‘B cell’ and ‘T cell’ are children of ‘lymphocyte’. This structure captures relationships across varying levels of annotation granularity11,12,13.

Many methods have been developed to perform automated cell-type annotation, ranging from logistic regression to deep learning architectures14,15,16,17. Recent benchmarking studies have shown that deep learning models outperform simpler methods as the number of cells in a dataset increases11. Importantly, these evaluations were conducted using donor-partitioned training and test splits, a design we refer to as the in-distribution (ID) setting (Fig. 1a). While useful for controlled comparisons, such splits do not reflect how cell atlases evolve in practice, where new studies are continually added and must be annotated upon release.

Fig. 1: Evaluating model generalization in continuously updated single-cell atlases reveals sharp OOD performance drops for the annotation task.
Fig. 1: Evaluating model generalization in continuously updated single-cell atlases reveals sharp OOD performance drops for the annotation task.
Full size image

a, A curated subset of the CELLxGENE census (May 2023 release) consisting of 22.2 million human cells annotated with 164 curated cell types, spanning 5,052 donors, 56 tissues and 249 studies. All cells were profiled using 10x Genomics platforms. b, OOD test set comprised of 2.6 million newly added cells from 21 studies in the December 2023 release. These cells span 470 donors and 16 tissues, and are annotated with 80 of the 164 original training cell types. All cells are also profiled using 10x Genomics platforms. c, We train three models (linear classifier, MLP and TabNet) on a donor-partitioned training set comprised of 15.2 million cells from the May 2023 CELLxGENE census. d, ID test set comprised of 3.4 million cells from the May 2023 release of the CELLxGENE census, held out by donor. The linear model, MLP and TabNet achieve 80%, 82% and 84% macro F1 scores, respectively (mean taken over four independent runs). e, All models exhibit substantial OOD performance drops: macro F1 scores decrease to 55%, 57% and 52% for the linear model, MLP and TabNet, respectively. The dashed red bars indicate the ID performances for comparison.

Source data

Results

To better evaluate generalization to newly released studies, we consider an out-of-distribution (OOD) set-up in which models are tested on datasets not seen during training (Fig. 1b). We trained three methods with increasingly complex architectures (a linear classifier, a multilayer perceptron (MLP) and TabNet18) on an atlas of 15.2 million human cells annotated with 164 unique cell types, curated in the scTab study11 from the May 2023 release of the CELLxGENE census (Fig. 1c). We then evaluate each method on 2.6 million human cells from 21 studies newly added during the 2023-12-15 release, spanning 470 donors, 16 tissues and 80 of the original 164 cell types represented in the training set. Despite being evaluated on the same cell types profiled with the same assays, macro-averaged F1 (macro F1) scores dropped by 24−32% for the linear classifier, MLP and TabNet when moving from the ID case (Fig. 1d) to the OOD setting (Fig. 1e), underscoring the limitations of current modeling strategies in generalizing across studies.

To address these shortcomings, we introduce a hierarchical cross-entropy (HCE) loss that explicitly incorporates the structural relationships between cell types (Methods). Unlike standard cross-entropy (CE), which treats all classes as flat and independent, HCE enforces a consistency constraint: the probability assigned to a general cell type (for example, T cell) must be at least as high as the sum of more granular subtypes (for example, ‘α–β T cell’ and ‘γ–δ T cell’). This prevents the model from needing to choose between broad and granular labels, since predicting a child inherently implies selecting the parent in the hierarchy (Fig. 2a).

Fig. 2: HCE loss improves perfomances across architectures.
Fig. 2: HCE loss improves perfomances across architectures.
Full size image

a, The standard CE loss defines a probability distribution over a flat label set, treating each cell type independently and requiring that probabilities sum to unity across the ontology. The HCE loss modifies these predictions by propagating probability mass up the DAG of the cell ontology: parent nodes such as T cell accumulate mass from their more specific descendants, such as α–β T cell and γ–δ T cell, encouraging biologically coherent predictions. b, The HCE loss improves macro F1 scores by 12−15% on OOD evaluations across the linear classifier, MLP and TabNet. All performance metrics reported reflect the mean taken over four independent training and evaluation runs per model, with results from each run shown as individual dots (color coding remains the same as in the legend). c, Per-cell-type performance changes induced by the HCE loss strategy for the MLP model, shown relative to standard CE. All performance metrics reported reflect the mean taken over four independent training and evaluation runs per model, with results from each run shown as individual dots (color coding remains the same as in the legend). For each cell type, a paired t-test was performed and P values were adjusted using the Holm–Bonferroni method to correct for multiple hypothesis testing. d, Improvements from HCE loss for the MLP model visualized directly on the cell ontology DAG consisting of all 164 cell types seen in the training set. Node size reflects the number of cells of that type seen in training, while color indicates the change in F1 score, shown as a gradient from green (improvement) to white (neutral/no change) to red (decline). Gray nodes correspond to cell types not observed in the OOD test set.

Source data

While there are methods that leverage ontological information for cell-type annotation, they do not enforce hierarchical consistency as an integral part of their predictive framework. For example, OnClass maps both transcriptomic profiles and cell ontology structure into a joint embedding space, enabling both the annotation of unseen cell types and the identification of marker genes12. However, it operates primarily as a nearest-neighbor or embedding search algorithm and does not couple hierarchical relationships to the learned probabilities for each cell. As a result, sibling classes or intermediate states can still be misassigned if their embeddings overlap in feature space. As another example, popV aggregates predictions from multiple classifiers using ontology-based voting, producing robust consensus labels and uncertainty estimates for ambiguous or outlier populations7. However, the ontology is used only as a scaffold for post hoc reconciliation and not as a guide for model optimization. This means that hierarchical constraints are not encoded in training and possible conflicts or inconsistencies in the ensemble are resolved heuristically. In contrast, SCimilarity focuses on metric learning for scalable, cross-study retrieval of transcriptionally similar cells, using the ontology at training time to exclude ambiguous annotation pairs when sampling triplets for a contrastive loss function19. The learned representation supports high-quality search and transfer tasks but is not directly optimized for hierarchical or taxonomic consistency when determining class probabilities. In summary, unlike these approaches, our HCE loss explicitly encodes hierarchical dependences and relationships into the model’s objective, ensuring that all predictions respect the structure of the cell ontology.

Applying the HCE loss improved OOD macro F1 scores by 12−15% for the linear classifier, MLP and TabNet, without modifying their architecture or tuning any hyperparameters (Fig. 2b). These consistent gains demonstrate the widespread benefits of hierarchy-aware training for cell-type annotation tasks. The HCE loss function enables the recovery of roughly half of the performance drop observed when models are applied to new studies, underscoring the practical value of aligning training objectives with ontology structure. To further assess the consistency of this effect, we evaluated performance across each of the 21 held-out studies individually (Supplementary Fig. 1). Outside of just one study while using the linear model, all HCE-trained models showed statistically significant improvements, highlighting the robustness of this approach across diverse experimental settings.

To better understand where the improvements from hierarchical training arise, we identified cell types that exhibited statistically significant changes in performance between models trained with and without the HCE loss. These differences were determined using a paired t-test across training runs, with P values corrected using the Holm–Bonferroni method (Methods). In the MLP model, for example, HCE led to improvements of up to 0.9 in F1 score for cell types such as ‘glutamatergic neuron’ and ‘CD14+ monocyte’ (Fig. 2c). Examining these effects in relation to the cell ontology, we found that the largest gains occurred for internal nodes, particularly those embedded in densely connected regions of the DAG where related types were annotated in the training data (Fig. 2d and Supplementary Figs. 24). In contrast, leaf nodes—especially structurally isolated ones—showed more modest gains (Supplementary Figs. 5 and 6). This aligns with the intuition that the hierarchical loss is most effective when it can propagate signal across nearby cell types. Similar trends were observed for the linear and transformer-based models, highlighting the architecture-agnostic nature of the effect (Supplementary Fig. 7). Importantly, gains were largely unaffected by a cell type’s rarity, the number of contributing studies or tissues and the diversity of sequencing technologies used—further underscoring the robustness of the approach (Supplementary Fig. 8). Finally, these gains extend to cells observed in new contexts, including across diseases and tissues not seen in the training set, where we also observe consistent improvements (Supplementary Figs. 9 and 10).

Discussion

Our results challenge the view that increasing model complexity is the primary route to improved cell-type annotation at atlas scale. Instead, we demonstrate that aligning the training objective with biological structure—through an HCE loss—consistently improves generalization across model classes, from linear classifiers to transformers. Critically, our findings suggest a strategy for building more effective training sets: rather than simply adding data, efforts should prioritize studies that increase connectivity among annotated cell types, especially in sparsely represented regions, thereby amplifying the generalization capabilities of learning architectures. While the hierarchical objective recovers roughly half of the OOD performance drop, some decrease in accuracy is expected due to imperfect annotation agreement across studies. Understanding the downstream impacts of these inconsistencies and correcting them will be essential as atlas-level resources become more standardized and begin to power the next generation of AI-driven discoveries in biology. It is important to note that HCE relies on a predefined, labeled DAG, and while the cell ontology serves as a valuable reference it is continuously evolving, with ongoing updates to cell-type definitions and their hierarchical relationships. Furthermore, while this study centers on cell-type classification, the hierarchical loss generalizes to any setting with structured label spaces, offering a simple drop-in replacement for standard CE that brings domain knowledge into model training. This points to a broader opportunity to incorporate biological priors into learning objectives, an increasingly important consideration as models are trained on ever-growing single-cell atlases19,20,21.

Methods

Training and evaluation datasets

The dataset used in this study originates from the same filtered subset of the CELLxGENE census (v.2023-05-15)3 that was curated for the scTab study11. This subset was constructed by applying strict inclusion criteria to the full census: only primary human cells profiled with 10x Genomics technologies were retained and the feature space was limited to 19,331 human protein-coding genes. Cell types were required to appear in at least 5,000 cells drawn from a minimum of 30 donors. All gene expression profiles were size-factor normalized to 10,000 counts per cell and log-transformed with a pseudocount of 1 (that is, f(x) = log(x + 1)). The resulting dataset included 22,189,056 cells annotated with 164 distinct cell types, spanning 5,052 donors and 56 tissues. For the ID task, we adopted the same donor-partitioned data split as used by Fischer et al.11—that is, 15,240,192 cells for training, 3,500,032 for validation and 3,448,832 for testing.

The OOD test dataset consisted of all newly added human cells in a subsequent release of the CELLxGENE census (v.2023-12-15). These cells were also profiled using 10x Genomics platforms and annotated with one of the 164 labels observed during training. This resulted in approximately 2.6 million cells drawn from 21 studies, covering 80 of the 164 training cell types.

Cell ontology

We used the cell ontology obtained from the Ontology Lookup Service at EMBL-EBI as the hierarchical scaffold for all analyses10. The ontology was represented as a DAG, where nodes correspond to cell types and directed edges correspond to is_a subtype relationships. We restricted the ontology to the 164 distinct cell types observed in the training set (‘Training and evaluation datasets’). In CELLxGENE, which is the atlas used in our study, cell types are annotated by the original data contributors and then harmonized by mapping each label to the closest cell ontology term as specified by the portal’s data schema. While the cell ontology offers a valuable scaffold for representing hierarchical relationships among cell types, it is important to note that its structure is continuously being revised where certain definitions and mappings between cell types remain under active refinement.

Because each cell type corresponds to a node in the DAG, we can further classify them on the basis of the type of node they represent. A node was defined as a leaf if it had no children in the pruned ontology and as an internal node if it had at least one child. We also distinguished between connected nodes, which had at least one parent or child present in the curated training set, and isolated nodes, which had none of their ancestors or descendants represented in the training data. These definitions were used to assess how the hierarchical loss propagates information across the ontology (Supplementary Fig. 6).

Evaluation protocol

Classification performance was evaluated using the macro F1 score, which computes the unweighted average of the F1 scores across all cell types. This metric ensures that each cell type contributes equally to the overall score, regardless of class imbalance or prevalence in the dataset. For C cell types, the macro F1 score is computed as

$${\rm{m}}{\rm{a}}{\rm{c}}{\rm{r}}{\rm{o}}\,{F}_{1}\,{\rm{s}}{\rm{c}}{\rm{o}}{\rm{r}}{\rm{e}}=\frac{1}{C}\displaystyle \mathop{\sum }\limits_{i=1}^{C}\frac{2\times {\rm{p}}{\rm{r}}{\rm{e}}{\rm{c}}{\rm{i}}{\rm{s}}{\rm{i}}{\rm{o}}{{\rm{n}}}_{i}\times {\rm{r}}{\rm{e}}{\rm{c}}{\rm{a}}{\rm{l}}{{\rm{l}}}_{i}}{{\rm{p}}{\rm{r}}{\rm{e}}{\rm{c}}{\rm{i}}{\rm{s}}{\rm{i}}{\rm{o}}{{\rm{n}}}_{i}+{\rm{r}}{\rm{e}}{\rm{c}}{\rm{a}}{\rm{l}}{{\rm{l}}}_{i}}$$
(1)

where precisioni and recalli are defined for the ith class as

$${{\text{precision}}}_{i}=\frac{{{\text{TP}}}_{i}}{{{\text{TP}}}_{i}+{{\text{FP}}}_{i}},\,\,\,\,\,\,\,\,{\rm{r}}{\rm{e}}{\rm{c}}{\rm{a}}{\rm{l}}{{\rm{l}}}_{i}=\frac{{{\text{TP}}}_{i}}{{{\text{TP}}}_{i}+{{\text{FN}}}_{i}}.$$
(2)

Here, the terms TPi, FPi and FNi denote the number of true positives, false positives and false negatives for the ith cell type, respectively. We followed the evaluation framework introduced by Fischer et al.11 in the scTab study, particularly because of the way those authors handled differences in the granularity of annotations that can occur across different studies: namely, a predicted label is considered correct if it exactly matches the ground-truth label or if it corresponds to a descendant of the ground-truth label in the cell ontology (that is, the prediction is a more specific subtype). This accounts for the fact that some datasets provide coarse-grained annotations (for example, T cell) while others include more detailed subtypes (for example, ‘CD4-positive, α–β T cell’). In such cases, predicting a valid subtype is treated as correct, as it remains consistent with the original label. Any other prediction, including a coarser label (such as a parent node) or an unrelated class, is considered incorrect.

Model details

We evaluated three model architectures of increasing complexity: a linear classifier, an MLP and the TabNet transformer model. Each model takes as input the full set of 19,331 human protein-coding genes. To ensure a fair comparison across models and with previous work, we adopted the architecture configurations and hyperparameters used in the scTab benchmarking study from Fischer et al.11 (Supplementary Tables 13). The models using CE versus HCE share identical architecture and hyperparameter settings; the loss term is the only difference between them. Specifically, for the models with CE, we used the best hyperparameters available according to the original scTab study. For the models using the HCE loss, we did not perform additional hyperparameter tuning and instead kept the (possibly suboptimal) hyperparameters used for the models with CE. Note that, while recent efforts have explored large-scale foundation models to learn transferable embeddings for single-cell data, such approaches have not yet demonstrated clear advantages over simpler, task-specific approaches for cell-type annotation11,22. We therefore focused on methods where we could easily isolate and study the direct effects of implementing the HCE strategy.

HCE loss function

The HCE loss function extends the standard CE loss by explicitly encoding the structural relationships across the cell ontology. With the standard CE, the loss is computed directly from raw model predictions, treating all cell types as independent classes. Let p = (p1, …, pC) denote the raw predicted probabilities for C different cell types. The standard CE loss is given by

$${{\mathcal{L}}}_{{\rm{C}}{\rm{E}}}=-\mathop{\sum }\limits_{i=1}^{C}{\mathbb{1}}\{{\rm{l}}{\rm{a}}{\rm{b}}{\rm{e}}{\rm{l}}=i\}\,\log \,{p}_{i}$$
(3)

where \({\mathbb{1}}\{label=i\}\) is an indicator function that is equal to 1 if the true class label is the ith cell type and 0 otherwise. The HCE adjusts these predictions to reflect hierarchical dependences encoded in the ontology’s DAG. The adjusted score si for the ith cell type is computed as the sum of the predicted probability for its label and the predicted probabilities of all its descendant subtypes

$${s}_{i}={p}_{i}+\mathop{\sum }\limits_{j\in {\mathcal{D}}(i)}{p}_{\!j}$$
(4)

where \({\mathcal{D}}(i)\) denotes the set of all descendants of cell type i in the DAG. This adjustment ensures that the probability of a parent node reflects its entire subgraph. The hierarchical loss is then

$${{\mathcal{L}}}_{{\rm{H}}{\rm{C}}{\rm{E}}}=-\mathop{\sum }\limits_{i=1}^{C}{\mathbb{1}}\{{\rm{l}}{\rm{a}}{\rm{b}}{\rm{e}}{\rm{l}}=i\}\,\log \,{s}_{i}\,.$$
(5)

This formulation directly parallels the evaluation framework, where predictions are considered correct if they match the ground-truth label or any of its descendants. By aligning the training objective with the assessment criterion, HCE encourages cell-type classification models to distribute probability mass in a way that respects biological hierarchy and annotation granularity.

Consider an ontology subgraph that is rooted at the node T cell, which includes subtype labels such as CD4+ T cell, CD8+ T cell and γ–δ T cell. The HCE enables classifications models to predict fine-grained subtypes when available, while also deferring to parent categories when annotations are coarse or ambiguous. For example, if some studies annotate cells as T cell while others use more specific labels such as CD4+ T cell or CD8+ T cell, the adjusted score is computed as

$${s}_{{\rm{T}}{\rm{c}}{\rm{e}}{\rm{l}}{\rm{l}}}={p}_{{\rm{T}}{\rm{c}}{\rm{e}}{\rm{l}}{\rm{l}}}+{p}_{{\rm{C}}{\rm{D}}{4}^{+}}+{p}_{{\rm{C}}{\rm{D}}{8}^{+}}+{p}_{{\rm{\gamma }}{-}\delta }+\ldots .$$
(6)

This hierarchical set-up allows the model to aggregate subtype information upward, improving consistency across annotations with varying granularity.

Implementation details for the HCE loss

We implemented the HCE loss using a reachability matrix R {0, 1}C×C, where element Rij = 1 if the jth class is reachable from the ith class (meaning j is either i itself or j is a descendant of i in the hierarchy) and Rij = 0 otherwise. The reachability relation encoded in this matrix is a partial order and has the following mathematical properties:

  • Reflexive—every class is reachable from itself (diagonal elements are 1).

  • Antisymmetric—if class i can reach j and j can reach i, then i = j.

  • Transitive—if class i can reach j and j can reach k, then i can reach k.

Indeed, the reachability matrix represents the transitive closure of the inverted adjacency matrix of the hierarchical DAG structure. Since the original DAG encodes is_a relationships from child to parent, we invert the edge directions to enable parent-to-descendant reachability, ensuring reflexivity by setting the diagonal to 1. Each trained model outputs a raw probability distribution p = (p1, …, pC) over the class labels. The adjusted scores are computed via matrix–vector multiplication: s = Rp, which efficiently aggregates descendant probabilities for each class. We then apply a log transformation with numerical stability log(s + ϵ), where ϵ = 10−6. The final loss uses a weighted negative log-likelihood as implemented in PyTorch, with class weights computed following scikit-learn’s compute_class_weight approach: wi = N/(Cni), where N is the total number of samples, C is the number of classes and ni is the count of samples for the class i. The complete loss for a single training sample x with true label t is

$${{\mathcal{L}}}_{{\rm{H}}{\rm{C}}{\rm{E}}}(x)=-{w}_{t}\,\log ({s}_{t}+\epsilon ).$$
(7)

This formulation maintains consistency with the models trained with the weighted CE, while incorporating hierarchical structure through efficient matrix operations.

Statistical evaluation of performance differences across loss functions

To assess changes in predictive performance induced by the ontology-aware training strategy, we computed per-cell-type differences in macro F1 score between models trained with standard CE and HCE across four independent training runs. For each cell type, a paired t-test was performed and P values were adjusted using the Holm–Bonferroni method to correct for multiple hypothesis testing. Statistically significant differences indicate cell types for which ontology-aware training produces consistent changes beyond random variability.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.