Main

T cells play critical roles in cancer, infection and autoimmune disease, driving widespread interest in characterizing their states using single-cell RNA sequencing (scRNA-seq)1,2,3. Clustering, the predominant analysis approach, has key limitations for interpreting T cell profiles. Transcriptomes reflect expression of multiple GEPs—co-regulated gene modules reflecting distinct biologic functions such as defining cell type, activation states, life cycle processes or external stimuli responses4. T cell GEPs vary continuously5, combine additively within individual cells6 and exhibit stimulus-dependent plasticity7. However, clustering discretizes cells, obscuring the coexpressed GEPs. For instance, proliferating T cells from multiple subsets may cluster together, masking their subsets. This may explain why clustering typically fails to delineate many canonical T cell subsets1,8, even with integration of surface protein markers via CITE-seq9,10.

Component-based models like nonnegative matrix factorization (NMF), hierarchical Poisson factorization and SPECTRA overcome some limitations of clustering8,11,12,13,14. These methods model GEPs as gene expression vectors and transcriptomes as weighted mixtures of GEPs. Unlike principal component analysis (PCA), NMF components correspond to biologically interpretable GEPs reflecting cell types and functional states that additively contribute to a transcriptome12. Component-based approaches yield GEP vectors that serve as a fixed coordinate system for comparing GEP activities across datasets. This is similar to scoring gene-set activities15 but with variable gene weights and simultaneous modeling of multiple GEPs. This prevents confounding of related signals and enables comparison of relative GEP activities. Previous analyses of T cells using component-based models have already recognized GEPs associated with T cell activation8 and exhaustion13 but were limited in dataset size and only addressed a small number of biological contexts. Furthermore, it is not well established how well such GEPs generalize across datasets.

Here, we present star-CellAnnoTator (starCAT), a framework to score cells based on a fixed, multi-dataset catalog of GEPs. ‘star’ is a wildcard placeholder based on the asterisk (*) used in programming, indicating applicability across tissues and cell types. Our specific instantiation for T cells is thus written T-CellAnnoTator (TCAT). We derive a comprehensive T cell GEP catalog by applying consensus nonnegative matrix factorization (cNMF)12 to seven scRNA-seq datasets comprising 1.7 million T cells from 38 human tissues1,2,10,16,17,18,19. Combining GEPs across datasets yields 46 consensus gene expression programs (cGEPs) capturing T cell subsets, activation states and functions (Fig. 1a). We demonstrate TCAT’s utility for inferring subset and antigen-specific activation (ASA) states and identifying cGEPs predictive of immunotherapy response across multiple tumor types.

Fig. 1: Overview of starCAT.
figure 1

a, starCAT first identifies GEPs in multiple datasets and aggregates them into cGEPs. It then uses the cGEPs to annotate new query datasets and compute additional scores and classifiers. b, Pairwise correlations of GEPs discovered across reference datasets with insets for cGEPs derived from all seven references. Inset row and column orders are the same for all cGEPs. c, Heat map of cGEPs (rows) and which datasets the comprising GEPs were found in (columns). Green boxes indicate a GEP was found in a dataset. Colored bar indicates the cGEP’s assigned class. cGEPs corresponding to non-T cell lineages are excluded. d, Marker genes for selected example cGEPs in z-score units with the minimum value fixed at 0. The AB_ prefix indicates a surface protein.

Results

Annotating cells with predefined GEPs

We first augmented the published cNMF algorithm to enhance GEP discovery (Fig. 1a). cNMF mitigates randomness in NMF by repeating NMF and combining outputs into robust estimates, generating GEP spectra (gene weights) and per-cell activities (‘usages’) reflecting the relative contributions of GEPs to each cell. To improve cross-dataset GEP reproducibility, we corrected batch effects which can cause cNMF to learn redundant dataset-specific GEPs. Standard batch-correction methods are incompatible with cNMF as they introduce negative values or modify low-dimensional embeddings rather than gene-level data. Therefore, we adapted Harmony20 to provide batch-corrected nonnegative gene-level data. Additionally, we modified cNMF to incorporate surface protein measurements in GEP spectra for CITE-seq datasets to enhance GEP interpretability (Methods).

Next, we developed starCAT to infer the usages of GEPs learned in a reference dataset in new ‘query’ datasets. Unlike cNMF, which learns GEP spectra and usages simultaneously, starCAT quantifies the activity of predefined GEPs within each cell, using nonnegative least squares, similarly to NMFproject11. starCAT then leverages the GEP usages to predict additional cell features, including lineage, T cell antigen receptor (TCR) activation and cell cycle phase (Fig. 1a). This can provide several advantages over running cNMF or similar approaches de novo: it ensures a consistent cell state representation for comparison across datasets, can quantify rarely used GEPs that may be hard to identify de novo in small query datasets and markedly reduces run time.

We benchmarked starCAT’s performance through simulations where the reference and query datasets contained only partially overlapping GEPs (Methods). Simulations included two 100,000-cell references and a 20,000-cell query, where each cell expressed one subset-defining and one or more non-subset GEPs. Cells in the reference datasets included additional GEPs or lacked certain GEPs relative to the query datasets and only shared 90% of genes in common (Extended Data Fig. 1a). We then learned GEPs from each reference with cNMF and predicted their usage in the query using starCAT.

starCAT accurately inferred the usage of GEPs overlapping between the reference and query (Pearson R > 0.7) and predicted low usage of extra GEPs in the reference that were not in the query (Extended Data Fig. 1b–d). We observed similar prediction accuracies when predicting a simulated query dataset with half or fewer overlapping GEPs between the reference and query (Supplementary Fig. 1). starCAT outperformed direct application of cNMF to the query for overlapping GEPs, despite having extra or missing GEPs in the references. We suspected this was due to the larger size of the references and confirmed that starCAT maintained its performance across smaller query datasets while cNMF’s performance declined (Extended Data Fig. 1e).

cGEPs for T cell annotation

We next developed a catalog of T cell GEPs to use for TCAT. We analyzed T cells from seven datasets spanning blood and tissues from healthy individuals and those with coronavirus disease 2019 (COVID-19), cancer, rheumatoid arthritis or osteoarthritis (Supplementary Table 1 and Extended Data Fig. 1f). We chose datasets to reflect phenotypic breadth, large sample sizes (>70,000 T cells) and, where possible, inclusion of CITE-seq data to aid GEP interpretation. We included two COVID-19 peripheral blood mononuclear cell (PBMC) datasets, two healthy PBMC datasets and two tissue datasets to assess cross-dataset GEP reproducibility. After quality control, 1.7 million cells remained from 905 samples from 695 individuals. We applied cNMF to each batch-corrected dataset independently (Supplementary Fig. 2 and Methods).

GEPs were reproducible across datasets. To quantify this, we clustered GEPs found in different datasets and defined a cGEP as the average of each cluster (Methods). Nine cGEPs were supported by all seven datasets (mean Pearson R = 0.81, P < 1 × 10−50 all pairs) and 49 by two or more datasets (mean R = 0.74, P < 1 × 10−50 all pairs; Fig. 1b and Extended Data Fig. 1g). Across datasets, 68.4–96.8% of GEPs clustered with at least one GEP from another dataset, indicating high reproducibility. Gene expression principal components showed substantially less concordance across datasets (Extended Data Fig. 1h).

We curated a catalog of 46 T cell cGEPs—27–36 more than prior analyses11,13,14—including 11 discovered only in blood, 7 only in tissue, and 28 in both (Supplementary Table 1 and Fig. 1c). We excluded 49 of 52 singleton GEPs as likely dataset-specific artifacts but retained three reflecting biologically justified signals. Specifically, the rheumatoid arthritis dataset contributed a T peripheral helper (TPH) GEP, previously identified in inflamed synovium21 (markers included PD-1 protein, LAG3 and CXCL13 RNA; Supplementary Table 2), while the pan-cancer dataset contributed an exhaustion GEP (HAVCR2, ENTPD1, LAG3) and a T follicular helper (TFH) GEP (PD-1 protein, CXCR5 and CXCL13 RNA) distinct from a second TFH GEP discovered in the non-cancer tissue datasets. We also identified six cGEPs corresponding to non-T cell populations, including erythrocytes (HBA2, HBA1) and plasmablasts (JCHAIN, IGKC), likely reflecting residual doublets. We retained these in the catalog to help flag doublets.

We annotated cGEPs by examining their top weighted genes (Fig. 1d, Extended Data Fig. 2a and Supplementary Fig. 3 and Supplementary Table 2). For example, FOXP3 and GATA3 marked the regulatory T (Treg) and type 2 helper T (TH2)-resting cGEPs. GATA3, IL4, IL5 and IL17A, RORC and IL26 marked the TH2-activated and interleukin-17-producing helper T (TH17)-activated cGEPs, respectively. Some cGEPs were also annotated based on gene-set enrichment analysis (Supplementary Note and Supplementary Table 3).

We further labeled cGEPs through association with surface marker-based gating of canonical T cell subsets in a COVID-19 PBMC CITE-seq reference (COMBAT)17 (Extended Data Fig. 2b,c and Methods). Multivariate logistic regression revealed strong associations between specific cGEPs and the Treg, γδT, mucosal-associated invariant T (MAIT) cell, CD4/CD8 naive, CD8 effector memory (CD8 EM), CD4 central memory (CD4 CM) and terminally differentiated effector memory (TEMRA) subsets (P value < 1 × 10−200, coefficient > 0.35). The CD4 EM subset was associated with TH17-resting and TH1-like cGEPs as expected (P < 4.1 × 10−189, coefficient > 0.22). In total, we identified 17 subset-associated cGEPs.

We also identified likely technical artifact cGEPs (Supplementary Table 4). A mitochondria cGEP marked by mitochondrially transcribed genes correlated with per-cell mitochondrial transcript fraction (average R = 0.81 across datasets), a common quality-control metric in scRNA-seq22,23. Another cGEP, labeled ‘poor-quality’, was marked by MALAT1, a long noncoding RNA linked to poor cell viability24. Its usage correlated with mitochondrial transcript fraction (mean R = 0.25 across datasets), inversely with the fraction of protein-coding transcripts per cell (mean R = −0.50) and positively with the percentage of intergenic reads per cell (mean R = 0.74; Extended Data Fig. 2d–f). Thus, it may be driven by contaminating DNA or nascent RNA. We also flagged immediate early gene cGEPs as potentially technical in nature (Supplementary Note and Supplementary Fig. 4).

Benchmarking TCAT on an independent query dataset

Next, we benchmarked TCAT for predicting discrete T cell subsets in a query CITE-seq dataset (labeled ‘flu-vaccine’), containing 336,739 T cells from PBMCs of 24 COVID-19-recovered and 17 healthy individuals following influenza vaccine25. We defined ten conventional T cell subsets via manual surface protein gating to serve as prediction targets (Extended Data Fig. 3a). While subsets largely separated on a gene expression uniform manifold approximation and projection (UMAP), memory populations overlapped substantially, possibly due to shared functional GEPs (Fig. 2a). We hypothesized that predictions based on TCAT, which disentangles subset and functional cGEPs, would outperform methods unable to distinguish these signals.

Fig. 2: Benchmarking TCAT on a query dataset.
figure 2

a, UMAP of the flu-vaccine dataset colored by manual gating (Extended Data Fig. 3a) and TCAT multinomial label prediction. b, Cross-method comparison of balanced accuracy for manually gated subset prediction. c, Same UMAP as a but demonstrating prediction of manually gated Treg and CD8 EM populations with the most associated individual cGEP (usage > 0.025), the multi-label classifier based on multiple cGEPs, Leiden clustering with a resolution of 1.0 and three reference mapping algorithms. d, Heat map of pseudobulk expression in cGEP-high (usage > 0.1) and cGEP-low (usage < 0.1) cells, per sample. Pseudobulk profiles were normalized by library size and rows were z-scored.

Indeed, TCAT enabled more accurate subset prediction than RNA-based clustering or the discrete reference mapping tools Azimuth10, Symphony26 and ProjecTILs3 (Methods). Subset assignment by thresholding the most associated cGEP performed comparably to reference mapping and clustering across nine resolutions for predicting one lineage at a time (Supplementary Fig. 5a). For simultaneous multi-label prediction, we trained a multinomial logistic classifier on the COMBAT reference and measured performance with balanced accuracy (which weights classes of different sizes equally) in the flu-vaccine query (Methods and Fig. 2a). This greatly outperformed all tested reference mapping methods and clustering (balanced accuracy—TCAT, 0.72; Clustering, 0.61; Symphony, 0.58; Azimuth, 0.52; ProjecTILs, 0.13; Fig. 2b,c and Extended Data Fig. 3b).

We compared the performance of this multi-label classifier when trained using TCAT’s cGEP catalog versus previously published GEP catalogs from NMF analyses of T cells in autoimmune diseases11 and tumors14 (Methods). TCAT’s catalog yielded better prediction accuracy for all lineages (Extended Data Fig. 3c and Supplementary Fig. 5b). These analyses show that TCAT can predict peripheral T cell subsets without manual annotation and with accuracy surpassing clustering and leading reference mappers.

We also found that usage of the CellCycle-S, CellCycle-G2M and mitochondrial cGEPs correlated well with common, gene-set-based estimates of these programs, including published proliferation gene sets27 (R > 0.75; Extended Data Fig. 3d and Methods).

Next, we validated prediction of T cell polarization against canonical marker expression. We discretized cells based on the TH1-like, TH2-resting and TH17-resting cGEPs and computed per-sample pseudobulk profiles of high (usage > 0.1) and low (usage < 0.1) cells. TH2-resting-high cells had significantly higher expression of TH2 markers (GATA3, CCR4, PTGDR2) and analogously for TH17-resting-high cells and TH17 markers (CCR6, RORC, AQP3; P < 1 × 10−35 all, paired t-test; Fig. 2d). TH1-like-high cells had increased expression of TH1 markers (CXCR3, IFNG-AS1, CD195 protein; P < 1 × 10−35 all), although IFNG and TBX21 were also expressed in TH1-like-low cells (Extended Data Fig. 3e), potentially due to their expression in cytotoxic T cells28,29. Excluding cytotoxic-high cells illustrated significantly higher IFNG and TBX21 in TH1-like-high cells (P = 8.2 × 10−13, P = 9.6 × 10−47).

cGEPs capture multi-GEP T cell identities

Next, we illustrate how TCAT reveals cellular heterogeneity obscured by clustering in the COMBAT COVID-19 dataset. First, we examined cell cycle effects since they often mask subsets30. Regressing out cell cycle programs31 does not always work well because it may remove correlated signals like activation.

While the published analysis of CD4 memory cells identified multiple proliferating subclusters, these did not correspond directly to subsets, except for one—CD4.TEFF.prolif.MKI67lo—that was enriched for the myeloid doublet cGEP (Fig. 3a,b) and reflects a likely myeloid doublet population (Supplementary Fig. 6a). By contrast, TCAT readily identified distinct proliferating subsets based on coexpression of cell cycle and subset cGEPs (Fig. 3c,d).

Fig. 3: Comparing TCAT to clustering in the COMBAT dataset.
figure 3

a, UMAP of T cells showing published CD4 memory subclusters with other clusters shown in gray. ‘CD4’ is omitted from cluster names for space. b, Average usage of selected cGEPs in CD4 memory subclusters. Subcluster labels for a align with the corresponding row labels. The three cell cycle cGEPs are summed and labeled CellCycle. A red arrow is shown next to the myeloid doublet subcluster and a red line is shown next to subclusters annotated as proliferating. c, Same UMAP as in a colored by usage of selected subset, functional and artifact cGEPs. Intensities were averaged over 20 nearest neighbors to reduce overplotting. d, Usage of selected cGEPs in cells with high or low cell cycle cGEP usage. Cells were grouped by their most highly used subset cGEPs. e, Percentage of cells within each manual gate assigned to each polarization based on usage > 0.1. Points show the per-sample proportions (n = 137 samples). Bar represents the average across samples.

This enabled us to quantify subset proliferation rates. Most subsets had increased cell cycle usage in COVID-19 compared to healthy cells (Extended Data Fig. 4a). Proliferation rates were correlated between the COVID-19 datasets (R = 0.80, P = 0.00021 in COVID-19, R = 0.56, P = 0.025 in healthy). The most proliferative population expressed the TPH cGEP and likely corresponds to TPH cells recently identified in COVID-19 (ref. 32).

Analogous to the cell cycle, we found that poor-quality, cytotoxic and interferon-stimulated gene (ISG) cGEPs could also dominate clusters, obscuring subsets (Fig. 3b–d and Supplementary Fig. 6b). For example, ISGs drove the CD4.TEM.IFN.resp and CD4.Th.IFN.resp clusters, which contained cells using multiple subset cGEPs (Extended Data Fig. 4b). For example, CD4.Th.IFN.resp contained many cells that expressed the CD4-naive cGEP and expected naive subset markers, suggesting they were misclustered with memory cells due to the ISG signal (Supplementary Note and Supplementary Fig. 6c–e).

Clusters with high cytotoxic cGEP usage contained cells with high usage of many subset cGEPs including CD8 EM, TEMRA and γδT (Extended Data Fig. 4c). Cells coexpressing cytotoxic and subset cGEPs coexpressed the expected cytotoxicity and subset marker genes (Extended Data Fig. 4d). This illustrates how TCAT can reveal cytotoxic T cell heterogeneity.

TCAT revealed polarization via the TH1-like, TH2-resting and TH17-resting cGEPs (Fig. 3c) while the published clustering lacked a TH2 cluster, and only annotated TH1/TH17 subsets with a high resolution yielding 243 total subclusters. We observed the expected enrichment between TH1 and TH17 annotated subclusters and cells expressing the TH1-like and TH17-resting cGEPs, respectively (Fisher’s exact test P < 1 × 10−100).

However, TCAT also identified polarization outside the canonical CD4 memory subsets (Fig. 3e). Across manually gated populations (Extended Data Fig. 2b), TH2-resting was most enriched in CD8 CM (15.7%) and CD4 CM (12.8%) subsets, while TH1-like was enriched in CD8 CM (15.7%), CD4 EM (14.7%), CD8 EM (14.4%) and MAIT populations (12.3%). By contrast, the Treg cGEP was most enriched in the expected Treg subset (88.1%) and TH17-resting in the expected CD4 EM (22.1%) and CD4 CM (10.7%) populations. Subset polarization proportions across subsets correlated strongly between the COMBAT and flu-vaccine datasets (R > 0.9, P < 5.5 × 10−5 all; Extended Data Fig. 4e). Furthermore, cells expressed the expected surface markers for their polarization, irrespective of CD4/CD8 lineage (Extended Data Fig. 4f), illustrating how TCAT can reveal polarized CD8+ T cell populations33.

cGEPs associated with TCR-dependent activation

Next, we identified cGEPs induced by antigen-specific TCR activation using an activation-induced marker (AIM) assay followed by scRNA-seq (AIM-seq; Fig. 4a–d). We stimulated PBMCs from five healthy donors for 24 h using either a pool of 176 peptide antigens from common pathogens (CEFX, JPT)34 plus anti-CD28/CD49d co-stimulation, or co-stimulation only (mock). Then, we sorted activated and non-activated T cells from the peptide stimulation using activation markers (OX40 and PD-L1 for the CD4 population35 and CD137 for the CD8 population36). We labeled the resulting populations with hashtag antibodies and pooled them for CITE-seq and TCR-seq, totaling 42,370 cells (12,743 AIM-positive, 15,369 AIM-negative, 14,258 mock; Methods and Supplementary Fig. 7a). As expected, peptide stimulation substantially increased the percentage of AIM-positive cells (Fig. 4b and Extended Data Fig. 5a,b).

Fig. 4: Identifying cGEPs associated with TCR-dependent activation.
figure 4

a, Schematic of AIM-seq and UMAP of resulting data colored by donor. b, Flow cytometry data from an AIM-seq run showing surface activation markers in CD3+CD4+ and CD3+CD4 gated populations with the gates used for AIM-positive (+), AIM-negative (−) and mock (M) populations. c,d, UMAP of AIM-seq dataset colored by sorting condition (c) or manually gated subset based on CITE-seq (d). e, cGEP association with AIM positivity. x axis shows the regression coefficient. y axis shows the −log10 false discovery rate (FDR)-corrected two-tailed P value (that is, Q value). cGEPs are labeled by assigned category. f, Average usage of selected AIM-associated cGEPs in +, − and M cells from different gated subsets, per sample. Lines show the median. P values are from a per-sample two-tailed rank-sum test comparing ‘+’ with ‘−’ and ‘M’ samples, for each lineage. *P < 0.05 and average usage in ‘+’ cells > 0.01.

Source data

The data confirmed expected features of AIM-positive cells. First, they had increased expression of additional activation markers (CD54, CD25, CD71, CD69, P < 1 × 10−200; Extended Data Fig. 5c–e). They were also depleted of naive T cells (CD4: P = 0.027, CD8: P = 8.6 × 10−4) and enriched for Treg cells and CD4 CMs and EMs (P = 0.00064, 0.0044 and 0.054; Extended Data Fig. 5f), consistent with expected preexisting memory to pathogens in the pool. However, we could still detect some naive T cell activation (11.8% and 1.4% of AIM-positive cells were CD4 and CD8 naive, respectively). Clonal expansion (defined as 2+ cells sharing a CDR3 beta sequence) was significantly more common in AIM-positive than AIM-negative CD4 memory cells (P = 2.1 × 10−7 CD4, P = 0.14 CD8; Supplementary Fig. 7b) and clones were significantly more likely to share AIM status than expected by chance (77% agreement versus 51% expected; binomial P < 1 × 10−200).

Next, we identified cGEPs that increased following ASA using sample-level pseudobulk association tests (Fig. 4e, Extended Data Fig. 5g and Methods). Twenty-four cGEPs increased in AIM-positive relative to AIM-negative cells (FDR-adjusted Q < 0.05). Of these, we labeled the ISG and metallothionein cGEPs as milieu regulated as they increased in both AIM-negative and AIM-positive cells relative to mock (P < 1.5 × 10−3 all). We suspect these cGEPs are upregulated by interferon and extracellular ion concentration sensing37, independent of TCR engagement.

Five subset cGEPs were higher in AIM-positive cells (TH17-resting, Treg, TPH, TH22, TFH-2) while three were higher in AIM-negative cells (CD8-naive, CD4-naive and TH1-like), likely reflecting baseline differences in the number of peptide-reactive cells in these populations rather than cGEP upregulation (Supplementary Table 5 and Extended Data Fig. 5f).

The 17 remaining AIM-associated cGEPs included several with clear links to TCR stimulation including cell cycle38, actin cytoskeleton39, heatshock40,41 and major histocompatibility complex class II42. Additionally, 11 functional AIM-associated cGEPs may be specific to T cell activation, including CTLA4/CD38, ICOS/CD38, NME1/FABP5, OX40/EBI3, multi-cytokine, exhaustion, TIMD4/TIM3, TH2-activated, TH17-activated and BCL2/FAM13A (see Supplementary Note and Supplementary Fig. 7c for more details). Several were most upregulated in specific subsets, such as multi-cytokine in CD8 memory; TIMD4/TIM3 in CD8 CM and γδT; CTLA4/CD38 in Treg, CD4 memory and CD8 CM (Fig. 4f); and OX40/EBI3 in tumor-infiltrating T cells (Supplementary Fig. 7d).

Because proliferation is a core response to TCR activation, we tested if AIM-associated cGEPs were enriched in proliferating cells in vivo. Results were concordant across datasets, with 15 cGEPs increased in cell cycle-high cells (aggregate usage > 0.1) in at least four of six datasets (Extended Data Fig. 5h, Supplementary Table 6 and Supplementary Fig. 7e). Of these, 14 were AIM-associated (Fisher exact test P = 2.1 × 10−5), further supporting a role for these cGEPs in TCR activation in vivo.

Annotating antigen-dependent activation in disease

Next, we developed a per-cell antigen-specific activation (ASA) score to identify TCR-activated T cells in disease. Using forward stepwise selection, we identified four AIM-associated cGEPs (TIMD4/TIM3, ICOS/CD38, CTLA4/CD38 and OX40/EBI3) that together predict CD71/CD95 coexpression in the COMBAT and flu-vaccine datasets (Methods, Extended Data Fig. 6a,b and Supplementary Note). We selected CD71 and CD95 as activation markers because they are known to be upregulated within 24 h of TCR activation43,44,45,46, were upregulated in the AIM assay (Extended Data Fig. 5c–e) and had high quality across subsets in both datasets.

ASA effectively predicted CD71/CD95 coexpression in vivo (COMBAT: area under the curve (AUC) = 0.920, flu-vaccine: AUC = 0.818) and AIM positivity in the AIM-seq data (AUC = 0.828; Extended Data Fig. 6c–e). It also correlated with expression of other surface activation markers (CD69: R = 0.43, CD25: R = 0.52, P < 1 × 10−100; Supplementary Fig. 8a). We chose a discrete ASA threshold of 0.0625 by balancing sensitivity and specificity (Extended Data Fig. 6c–e). This threshold resulted in a positive call for 76.7% of CD71+CD95+ and 5.2% of non-CD71+CD95+ T cells in the COMBAT dataset, and 60.6%, 7.0% and 3.2% of AIM-positive, AIM-negative and mock-stimulated cells in AIM-seq (Fig. 5a,b).

Fig. 5: Annotating ASA in vivo.
figure 5

a, Box plot of ASA score for cells stratified as activated (that is, CD71+CD95+, N = 24,341 cells) or not activated (N = 375,258 cells). Boxes represent the interquartile range and whiskers represent 1.5 times the interquartile range. The box center line indicates the median. b, Same as a but for AIM-seq data with cells stratified by sort condition (+: N = 13,235 cells; −: N = 15,528 cells; M: N = 14,459 cells). c,d, UMAP of the COMBAT dataset colored by published clustering (c) and ASA score (d). e, Percentage of activated (ASA > 0.065) or proliferating (sum of cell cycle cGEPs > 0.1) cells per sample across datasets. Boxes represent the interquartile range and whiskers represent the 95% quantile range. (AMP-RA, N = 162 samples; COMBAT, N = 244; HIV-vaccine, 16; pan-cancer, 272; pan-tissue, 24; Sparks, 82; AIM-seq, 10; TBRU, 518; UK-COVID, 242). f, Clonality in manually gated conventional CD4+ and CD8+ T cells annotated as activated (ASA > 0.065) or not activated (ASA < 0.065). Clonality is defined as the number of cells in the same sample with an identical alpha and beta CDR3 amino acid sequence. g, Percentage of activated (ASA > 0.065) CD4+ and CD8+ conventional T cells in COVID-19 and healthy control samples, by cohort. Boxes represent the interquartile range and whiskers represent 1.5 times the interquartile range. The box center line indicates the median. (COMBAT, N = 77 COVID-19, 10 healthy; UK-COVID, N = 80 COVID-19, 21 healthy). h, log2 OR for 2 × 2 association of ASA positivity and manual gating subset assignment. An asterisk indicates Bonferroni-adjusted two-tailed Fisher’s exact test P value < 0.05. i, Percentage of activated (ASA > 0.065), exhausted (exhaustion cGEP usage > 0.065) or bystander (ASA + exhaustion usage < 0.065) T cells in CD4+ and CD8+ conventional T cells, per sample stratified by tumor type and corresponding healthy tissues. Boxes represent the interquartile range and whiskers represent 1.5 times the interquartile range. The box center line indicates the median. (BC: n = 2 tumor, n = 2 normal; ESCA: n = 7 tumor, n = 7 normal; HCC: n = 5 tumor, n = 5 normal; PACA: n = 26 tumor, n = 1 normal; RC: n = 10 tumor, n = 11 normal; THCA: n = 10 tumor, n = 8 normal; UCEC: n = 9 tumor, n = 8 normal). BC, breast cancer; ESCA, esophageal cancer; HCC, hepatocellular carcinoma; PACA, pancreatic cancer; RC, renal cell carcinoma; THCA, thyroid carcinoma; UCEC, endometrial cancer. j, Per-individual comparison of percentage of exhausted CD8+ T cells and average number of mutations per mutational burden (MB). k, log2 OR for enrichment of bystander T cells by subset cGEP assignment. Bar value reflects the estimated OR, while error bars represent the analytical 95% confidence intervals around the estimate.

Source data

We benchmarked ASA against literature-derived T cell activation gene sets for predicting surface activation profiles. ASA outperformed 9/9 and 7/9 tested gene sets in the flu-vaccine and AIM-seq datasets, respectively, demonstrating its utility relative to a widely used approach (Extended Data Fig. 6f,g and Supplementary Fig. 8b,c).

ASA-high cells had several expected features of antigen activated cells in vivo. They were enriched in proliferating clusters (Fisher’s exact odds ratio (OR) 2.8–58.8 across datasets, P < 1 × 10−100 all) and ASA correlated with cell cycle usage (mean R 0.15; Fig. 5c,d and Extended Data Fig. 6h). However, ASA identified significantly more activated cells than the cell cycle alone, indicating greater sensitivity for classifying activation (P = 8.8 × 10−189, two-tailed paired t-test; Fig. 5e).

ASA-high cells were more likely to be clonally expanded in both COVID-19 datasets (COMBAT OR = 2.50, UK-COVID OR = 2.28, P < 1 × 10−100 for both). Furthermore, ASA and cell cycle were independently associated with clonal expansion in a multivariate logistic regression (beta values: ASA, 0.45 and 0.50; cell cycle, 0.66 and 0.52, in COMBAT and UK-COVID respectively; P < 1 × 10−22; Methods). The TCR clone size distribution was also shifted upward in ASA-high relative to ASA-low cells (P < 1 × 10−100, both datasets; Fig. 5f and Extended Data Fig. 6i).

There were significantly more ASA-high cells in COVID-19 than healthy samples, consistent with viral activation (COMBAT: P = 1.9 × 10−7, UK-COVID: P = 1.5 × 10−6; Fig. 5g). ASA rates were comparable in CD4 and CD8 conventional subsets but higher in Treg cells for both healthy and COVID-19 samples (Fig. 5h and Supplementary Fig. 8d–f). In COVID-19, ASA-high cells were enriched in CD8 CM, CD8 EM and DN subsets (OR = 4.8, 2.8 and 3.1 respectively, all P < 1 × 10−10), although not in healthy samples, likely reflecting the antiviral response. ASA rates were higher in UK-COVID than COMBAT. This correlated with differences in sample quality reflected in poor-quality cGEP usage and library size and may reflect nonspecific activation related to sample processing (Supplementary Fig. 9a).

To further illustrate analyses enabled by TCAT, we characterized variation in T cell exhaustion and activation in breast cancer (BC), esophageal cancer (ESCA), hepatocellular carcinoma (HCC), pancreatic cancer (PACA), renal cell carcinoma (RC), thyroid carcinoma (THCA) and endometrial cancer (UCEC) (Fig. 5i). ASA positivity in TCAT-annotated CD4 cells ranged from 5.4% (breast) to 48.0% (esophageal). This correlated with analogous rates in CD8 cells for ASA (R = 0.70, P = 2.6 × 10−9) and exhaustion (R = 0.38, P = 4.0 × 10−3; Extended Data Fig. 6j) across tumor types. Treg cells had significantly higher ASA positivity in thyroid (P = 3.0 × 10−6) and esophageal (P = 0.0045) cancer relative to matched normal tissues (Extended Data Fig. 6k). As expected, tumor mutation burden was significantly correlated with the percentage of exhausted CD8+ T cells per tumor (Spearman ρ = 0.59, P = 6.9 × 10−8; Fig. 5j and Supplementary Fig. 9b).

Many tumors included T cells with low ASA and exhaustion usage (‘bystanders’). CD8 bystanders varied widely from 35.5% (endometrial) to 90.1% (breast) of total CD8 cells. Bystanders were enriched within populations marked by usage of the CD4-naive (OR = 15.9), TH2-resting (OR = 10.6), TH1-like (OR = 7.3), MAIT (OR = 4.42) and CD8-naive (OR = 4.03) cGEPs and were most depleted from TPH (OR = 0.19), Treg (OR = 0.23) and CD8 TRM (OR = 0.61) populations (P < 1 × 10−21 all; Fig. 5k). These analyses illustrate how TCAT and ASA scoring can enable disease exploration.

Identifying disease-associated cGEPs

Next, we associated cGEPs with cancer, COVID-19 and rheumatoid arthritis phenotypes (Supplementary Table 7, Extended Data Fig. 7a–f and Supplementary Note). Using pseudobulk sample-level regression in the pan-cancer dataset (89 tumor, 47 matched normal samples, 13 cancer types), we identified Treg47, exhaustion48 and ISG49 as strongly tumor-associated, consistent with their known role in cancer (FDR-corrected Q = 7.4 × 10−12, 8.5 × 10−6 and 9.3 × 10−6, respectively). Of 21 tumor-enriched cGEPs, 17 were AIM-associated (Fisher exact test P = 7.4 × 10−6).

We separately analyzed individual tumor types with ≥2 tumor and normal samples each (Methods). Results were highly concordant across cancers (sign test P < 0.05 for 14/15 tumor-type pairs; Extended Data Fig. 7b). Treg, exhaustion and CTLA4/CD38 cGEPs were upregulated in all six tumor types tested (P < 0.05). However, some signals were more specific including TH17-activated (thyroid: P = 5.3 × 10−6, hepatocellular: P = 0.013) and TH2-activated cGEPs (esophageal, uterine, thyroid and hepatocellular: P < 0.05 all).

Of note, the TFH-2 and TPH cGEPs were both upregulated in cancer (Q = 3.6 × 10−4, Q = 3.3 × 10−10). TFH and TPH cells recruit B cells via CXCL13 aiding in antibody production. TFH cells are found primarily in lymphoid organs and TPH cells are predominantly in inflamed tissues50, including likely within tumors51. TPH cell cGEP usage was associated with CXCL13 expression and plasma cells abundance across tumors, indicating a role in tumor-associated lymphoid aggregates (Supplementary Fig. 9c–e and Supplementary Note).

cGEP associations with COVID-19 status revealed consistent associations between the two reference datasets (R = 0.64, P = 2.8 × 10−7) highlighting TPH cells and ASA cGEP involvement (Q < 0.05; Extended Data Fig. 7c–e and Supplementary Note). Rheumatoid arthritis similarly showed increased usage of metallothionein, HLA, ICOS/CD38, TPH cells and other activation-associated cGEPs (Q < 0.05; Extended Data Fig. 7f and Supplementary Note).

Characterizing ICI response

We next demonstrate TCAT’s utility by identifying cGEPs that predict tumor response to immune checkpoint inhibitors (ICIs). ICIs are state-of-the-art therapies for treating many types of cancer, yet 5-year survival remains poor for over half of treated patients52. To investigate T cell states associated with ICI response, we applied TCAT to melanoma53, non-melanoma skin cancer (NMSC)54 and colorectal cancer (CRC)55 datasets containing responder and nonresponder tumors before and after treatment.

We first examined melanoma as the largest dataset containing 19 pretreatment and 48 total samples. TCAT revealed populations expressing ASA, exhaustion, cell cycle and CD4-naive signatures (Fig. 6a). We also noted a prominent subset of cells expressing TCF7, which was previously associated with ICI response in this dataset53.

Fig. 6: cGEPs associated with ICI response.
figure 6

a, UMAPs of the melanoma dataset showing TCAT predicted lineage; the CD4-naive and exhaustion cGEPs; the ASA and cell cycle scores; TCF7 expression; and treatment status and response. b, Associations of cGEP usage with ICI response in CD4+ T cells of pretreatment melanoma. Dots are colored by cGEP type. x axis shows the average difference in usage between nonresponders and responders. y axis shows the −log10 two-tailed P value. ce, Average ASA score, cell cycle score and CD4-naive cGEP usage in CD4+ T cells from pretreatment melanomas and NMSC tumors and combined pretreatment and post-treatment CRC. The average scores are mean and variance normalized. P values are one-tailed t-tests for melanoma and NMSC and mixed linear regression P values for CRC. P value in the title is a meta-analysis of the P values across the three cancer types. *P < 0.05. Boxes represent the interquartile range and whiskers represent 1.5 times interquartile range. The box center line indicates the median. Sample sizes shown are n = 9, n = 6 and n = 15 for responders in melanoma, NMSC and CRC, respectively, and n = 10, n = 7 and n = 4 in nonresponders in melanoma, NMSC and CRC. f, Same as b in pretreatment NMSC. g, Same as b in combined pretreatment and post-treatment CRC, but showing coefficients and P values from mixed linear regression analysis, controlling for treatment time point and donor of origin.

In melanoma pretreatment tumors, CD4+ T cells from nonresponders had significantly higher activation (for example, TIMD4/TIM3, OX40/EBI3, HLA) and cell cycle cGEP usage (P < 0.05 two-tailed t-test; Fig. 6b and Supplementary Table 8). Associations were concordant in pretreatment NMSC samples (sign test P = 0.0016; Extended Data Fig. 8a), including significant associations for CellCycle-Late-S and TIMD4/TIM3 (one-tailed P = 0.037 and 0.046, respectively; Fig. 6c–f). Meta-analysis of associations between pretreatment melanoma and NMSC tumors was significant for the combined ASA and cell cycle scores (P = 0.0072 and P = 0.0036, respectively; Fig. 6c–e) and for many cGEPs (for example, CellCycle-Late-S, exhaustion, ICOS/CD38; P < 0.05 all). Thus, elevated pretreatment TCR activation and proliferation signatures predict a worse response to ICIs in these tumor types.

Responders also exhibited higher usage of the CD4-naive cGEP in pretreatment melanoma and NSMC tumors (meta-analysis P = 0.0063; Fig. 6e). This aligns with prior evidence linking TCF7 to improved ICI outcome, as TCF7 is a top marker of the CD4-naive cGEP (Supplementary Table 2). The naive T cells markers TCF7, CCR7 and SELL all had higher expression in pretreatment responders than nonresponders (meta-analysis P = 0.024, 0.0016 and 0.00047, respectively; Extended Data Fig. 8b–d). Furthermore, the proportion of TCAT-classified naive CD4+ T cells was similarly predictive of response (meta-analysis P = 0.016; Extended Data Fig. 8e), suggesting infiltrating naive CD4 cells may predict positive ICI response.

The CRC dataset only had one pretreatment nonresponder sample, which precluded association testing in pretreatment tumors. However, we observed concordant associations between pretreatment and post-treatment samples across the three tumor types (Extended Data Fig. 8f–h and Supplementary Fig. 10a,b). For example, response-associated cGEPs (P < 0.05) were significantly concordant between pretreatment and post-treatment melanoma samples (Fisher’s exact test P = 0.0039).

Assuming pretreatment and post-treatment samples share many immune states, we repeated associations using combined pretreatment and post-treatment samples, modeling treatment status with a fixed effect and patient of origin with random effects (Methods). This yielded consistent results with our prior findings, including increased activation and cell cycle cGEPs in CRC nonresponders (for example, P < 1 × 10−6 for OX40/EBI3, CellCycle-Late-S and ASA; Fig. 6g and Supplementary Fig. 10c,d). These distinct analyses support that our findings are robust and reproducible in three cancer types.

We obtained similar results in CD8+ T cells (Extended Data Fig. 8i–k). Activation and cell cycle cGEPs were elevated in pretreatment nonresponders of melanoma and NMSC (meta-analysis P < 0.05 for CellCycle-G2M, CellCycle-Late-S, TIMD4/TIM3, exhaustion, ICOS/CD38, OX40/EBI3 and HLA) and in the combined CRC cohort (P < 0.05 for all). The CD8-naive cGEP was associated with positive ICI response in pretreatment melanoma and NMSC samples (meta-analysis P = 0.032). These analyses highlight TCAT’s capacity to reveal clinically meaningful immune patterns across multiple datasets.

Discussion

Here, we introduced starCAT, a tool that leverages the reproducibility of functionally informative cGEPs across datasets to annotate new scRNA-seq data. We illustrated starCAT with the most comprehensive T cell GEP catalog to date, including 16 subset-associated and 25 functional cGEPs derived from reference datasets spanning many tissues and diseases. Combining this catalog with starCAT yields TCAT.

TCAT offers key advantages over standard approaches. It simultaneously annotates functional and subset GEPs, disentangling conflated signals and revealing unexpected populations like abundant TH2 polarized CD8+ T cells33. It outperformed RNA-based clustering and reference label transfer methods for subset annotation and enabled facile disease activity comparisons. TCAT is also much faster than de novo matrix factorization, avoids the need for GEP annotation and improves GEP inference accuracy for smaller datasets.

We developed the AIM-seq assay to identify cGEPs induced following TCR-dependent activation. This identified 24 TCR activation-associated cGEPs including context-specific responses like multi-cytokine in CD8 memory and CTLA4/CD38 predominantly in CD4 memory cells. We aggregated several of these cGEPs into an ASA score to identify activation in scRNA-seq data. This revealed numerous ‘bystanders’ within tumors that lacked either activation or exhaustion signatures and were enriched for naive and unconventional T cell subsets.

TCAT uncovered features of tumor-infiltrating T cells that predict ICI response. Surprisingly, nonresponsive tumors were enriched for activation, cell cycle and exhaustion cGEPs. By contrast ICI responsive samples were enriched for the CD4-naive cGEP and were predicted to contain more naive CD4+ T cells. These findings suggest that TCAT can help predict therapy response and characterize clinically relevant cell states.

Limitations of this work include that the catalog may be missing GEPs that are disease or tissue-context specific and were filtered due to non-reproducibility across datasets. Our AIM-seq experiment used only a single co-stimulation signal and set of microbial peptides and thus may not have identified all possible activation-associated GEPs. Future work incorporating additional datasets and stimulation conditions can address both limitations.

While demonstrated in T cells, starCAT is applicable to any cell type or tissue. We provide open-source software and a growing repository of GEPs, including human glioma myeloid56 and bone marrow hematopoiesis references57. Similar to the molecular signatures database (MSigDB)58,59 but for scRNA-seq annotation, the platform supports community-contributed GEP catalogs. We hope starCAT will enable comprehensive identification of GEPs across tissues and diseases.

Methods

Materials and reagents

Reagent or resource

Source

Identifier

XVIVO15 culture media

Lonza

Catalog no. 02-060Q

RPMI 1640 medium

Thermo Fisher

Catalog no. 11875093

Benzonase nuclease

Sigma-Aldrich

CAS no. 9025-65-4

Trial grade CEFX peptide pool

JPT

PM-CEFX-2

Anti-CD28 antibody

BioLegend

Catalog no. 302933

RRID: AB_11150591

Anti-CD49d antibody

BioLegend

Catalog no. 304339

RRID: AB_2810443

Human TruStain FcX (Fc receptor blocking solution)

BioLegend

Catalog no. 422302

RRID: AB_2818986

Zombie yellow fixable viability kit

BioLegend

Catalog no. 423104

TotalSeq-C human universal cocktail, V1.0

BioLegend

Catalog no. 399905

Human TOTAL-SeqC repertoire (5′) hashing antibodies

BioLegend

Catalog nos. 394661, 394663 and 394665

Anti-CD3-BV421 (SK7)

BioLegend

Catalog no. 344833

RRID: AB_2565674

Anti-CD134-PE (Ber-ACT35)

BioLegend

Catalog no. 350003

RRID: AB_10641708

Anti-CD274-BV785 (29E.2A3)

BioLegend

Catalog no. 329735

RRID: AB_2629581

Anti-CD137-APC (4-B4-1)

BioLegend

Catalog no. 309809

RRID: AB_830671

Anti-CD4-FITC (RPA-T4)

BioLegend

Catalog no. 300505

RRID: AB_314073

Chromium next GEM single-cell 5′ kit v2, 16 reactions

10x

Catalog no. 1000263

Dual index kit TN set A, 96 reactions

10x

Catalog no. 1000250

Chromium next GEM chip K single-cell kit, 48 reactions

10x

Catalog no. 1000286

Chromium single-cell human TCR amplification kit, 16 reactions

10x

Catalog no. 1000252

Library construction kit, 16 reactions

10x

Catalog no. 1000190

5′ feature barcode kit, 16 reactions

10x

Catalog no. 1000256

starCAT algorithm

Whereas cNMF learns both GEPs and their usage in cells, starCAT has the simpler problem of fitting the usage for a fixed set of GEPs. Specifically, cNMF runs NMF multiple times, each time solving the following optimization as given by equation (1):

$${\rm{ArgMin}}_{G,U}{|X}-{{UG}}{|}_{F}$$
(1)

where

$$U\ge 0,G\ge 0$$

where X is a NxH matrix of N cells by the top H overdispersed genes, U is a learned NxK matrix of the usages of K GEPs in each cell, and G is a learned KxH matrix where each row encodes the relative contribution of each highly variable gene in a GEP. H is usually a parameter set to ~2,000 overdispersed genes. | |F denotes the Frobenius norm. X includes variance-normalized overdispersed genes to ensure biologically informative genes are included and contribute similar amounts of information even when they may be expressed on different scales. For cNMF, the optimization is solved multiple times and the resulting G matrices are concatenated, filtered and clustered to determine a final average estimate of G. Ultimately cNMF refits the GEP spectra into two separate representations, one reflecting the average expression of the GEP in units of transcripts per million \({G}^{{tpm}}\) and one in z-scored units used to define marker genes \({G}^{\rm{scores}}\) (see ref. 12 for details).

By contrast, starCAT takes a fixed catalog of GEPs as input, denoted as \({G}^{* }\), and a new query dataset \({X}^{\rm{query}}\) and solves the optimization as given by equation (2):

$${\rm{ArgMin}}_{U}|{X}^{\rm{query}}-{{U{G}}}^{* }{|}_{F}$$
(2)

where

$$U\ge 0$$

The columns of \({X}^{\rm{query}}\) and \({G}^{* }\) correspond to a prespecified set of overdispersed genes. Analogous to cNMF, we use gene-wise standard-deviation-normalized counts for \({X}^{{query}}\). See below for how \({G}^{* }\) is calculated for TCAT. We solve for U with nonnegative least squares using the NMF package in scikit-learn (version 1.1.3)60 with \({G}^{* }\) fixed. We use the Frobenius error, the multiplicative update (‘mu’) solver, a tolerance of 1 × 10−4 and maximum iterations of 1,000. We then perform row normalization of the U matrix so that each cell’s aggregate usage across all K GEPs sums to 1.

Dataset preprocessing and batch-effect correction

To generate the input matrix for cNMF for each dataset, we first filtered genes detected in fewer than ten cells and cells with fewer than 500 unique molecular identifiers. We also excluded antibody-derived tags (ADTs) and genes containing a period in their gene name. We subsequently subsetted the data to the top 2,000 most overdispersed genes, identified by the ‘seurat_v3’ algorithm as implemented in Scanpy61. Next, we scaled each gene to unit variance. To avoid outliers with excessively high values, we calculated the 99.99th percentile value across all cells and genes and set this as a ceiling. We denote this matrix as \({X}^{\rm{raw}}\).

We used an adapted version of harmonypy to correct batch effects and other technical variables from \({X}^{\rm{raw}}\) before cNMF20. For this, we computed Harmony’s maximum diversity clustering matrix from principal components calculated from a normalized version of X, which we label \({X}^{{norm}}\). Specifically, to compute \({X}^{{norm}}\), we started from the same initial gene list described above but first normalized the rows of the matrix so that each cell’s counts sum to 10,000 (TP10K normalization). We then subsetted to the top 2,000 overdispersed genes, and scaled each column (gene) to unit variance, resulting in \({X}^{\rm{norm}}\). We then performed PCA on \({X}^{\rm{norm}}\) and supplied those principal components to the run_harmony function of harmonypy. We then used the mixture of experts model correction, implemented in harmonypy with the computed maximum diversity clustering matrix, but instead of correcting the principal components using this model, as standard Harmony does, we corrected \({X}^{\rm{raw}}\). This creates a small amount of variability around 0 for the smallest values in \({X}^{\rm{raw}}\). We therefore set a floor of 0, resulting in the corrected matrix \({X}^{c}\) used as the count matrix for cNMF.

cNMF

We ran cNMF on the batch-corrected \({X}^{c}\) matrix, which only includes the top 2,000 overdispersed RNA genes. Spectra for the resulting GEPs were then refit by cNMF including all genes that passed the initial set of filters, including ADTs. Specifically, RNA counts were normalized to sum to 10,000, and ADT counts were separately normalized to sum to 10,000 and the combined matrix was passed as the --tpm argument for cNMF. Thus, the GEP spectra output by cNMF incorporates ADTs and genes not included in the 2,000 overdispersed genes.

cNMF was run for each dataset with the number of components (K) varying between 15 and 55 and with 20 iterations. The final number of NMF components used for each dataset, K*, was chosen by visualizing the trade-off between reconstruction error and stability for these runs (Supplementary Fig. 2). Once K* was selected, we ran cNMF a final time with only this value for K and with 200 iterations to generate the final GEP spectra estimates.

Constructing a catalog of cGEPs

Next, we identified consensus GEP spectra—that is, the average of correlated GEP spectra identified by cNMF in different datasets. Normalized input GEP vectors, denoted as gi, were computed by starting from the spectra_tpm output from cNMF, renormalizing each vector to sum to 106, and then dividing each gene by its standard deviation in TP10K units. Then, we created an undirected graph where the 267 GEPs identified across all reference datasets were represented as nodes g1g267. We drew edges, denoted as Ei,j connecting a pair of GEPs gi and gj if the following criteria were met:

  1. 1.

    gi and gj were from different datasets

  2. 2.

    Rij > 0.5 where Rij denotes the Pearson correlation between gi and gj. For computing Rij, gi and gj were subset to the union of the overdispersed genes for each dataset.

  3. 3.

    gi was among the top seven most correlated GEPs with gj, and gj was among the top seven most correlated GEPs with gi with correlation defined as in 2.

Next, we initialized a set for each GEP: x1 = {g1} … x267 = {g267}. We then iterated through all edges Ei,j in the graph in order of decreasing Rij and merged the sets xi and xj into a new set xi,j = {gi, gj}. If either gi or gj were already members of a merged set from previous merges, we merged their containing sets only if at least two-thirds of the GEP pairs in the resulting consensus set were connected by edges. For example, if there is an edge E4,9 and g4 is already merged into a set {g1, g2, g4}, then we only merged {g1, g2, g4} and {g9} if there were also edges E1,9 and E2,9. This resulted in 49 merged sets and 52 unmerged ‘singleton’ sets. We filtered 49 of the 52 singletons and retained 3 that had a biological explanation for being identified in only one dataset. This resulted in a final catalog of 52 cGEPs, including doublet programs.

Lastly, we subset each GEP to the union of overdispersed genes across all seven reference datasets that were present in all datasets and obtained the final consensus GEPs by taking the element-wise average GEPs in each merged set. This matrix was used as the reference for TCAT. For marker gene analyses (for example, Fig. 2b,d and Extended Data Fig. 2), we then averaged, element-wise, the z-score representation of the GEP output by cNMF for GEPs in a consensus set.

Simulation analysis

We adapted the scsim simulation framework described previously12 based on Splatter62 into a new version, scsim2. Like with scsim, we distinguished between subset GEPs, which are mutually exclusive, and non-subset or ‘activity’ GEPs, which are not. For the original scsim framework, cells used one of multiple subset GEPs and potentially used a single-activity GEP. We adapted scsim to allow cells to use anywhere from none to all of the activity GEPs in addition to their single subset GEP. We kept the Splatter parameters used in the original publication to describe the distribution of gene expression data: mean_rate = 7.68, mean_shape = 0.34, libloc = 7.64, libscale = 0.78, expoutprob = 0.00286, expoutloc = 6.15, expoutscale = 0.49, diffexpprob = 0.025, diffexpdownprob = 0.025, diffexploc = 1.0, diffexpscale = 1.0, bcv_dispersion = 0.448 and bcv_dof = 22.087.

We simulated 10 subset GEPs and 10 activity GEPs based on 10,000 total genes. The extra-GEP reference included all 20, the missing-GEP reference included 6 of the subset GEPs and 6 of the non-subset GEPs, and the query dataset included 8 subset GEPs and 8 non-subset GEPs. Each dataset consisted of 9,000 genes, randomly sampled from the total of 10,000. Each cell was randomly assigned a subset GEP with uniform probability, and each cell randomly selected whether it expressed each activity GEP with a probability of 0.3. The degree of usage of each activity GEP was sampled uniformly between 0.1 and 0.7. If the sum of the activity GEPs exceeded 0.8 for a cell, they were renormalized to sum to 0.8. Thus, each cell’s usage of its subset GEP always exceeded 0.2. We simulated 100,000 cells each for the extra-GEP and missing GEP references. We simulated multiple query datasets containing 100, 500, 1,000, 5,000, 10,000, 20,000, 50,000 or 100,000 cells. The same parameters were used for Supplementary Fig. 1 but with different numbers of GEPs in the references and query.

We subsequently ran cNMF using 1,000 overdispersed genes, 20 iterations, local_neighborhood_size = 0.3 and density_threshold = 0.15. We used K = 20, K = 12 and K = 16 for the extra-GEP reference, missing-GEP reference and query datasets, respectively. We then used starCAT to fit the usage of the reference GEPs on the query dataset. To evaluate the performance of starCAT and cNMF, we calculated the Pearson correlation of the inferred GEP usage with the simulated ground-truth usage.

Gene-set enrichment analysis

We used Fisher’s exact test in Python’s Scipy library to associate cGEPs with gene sets (Supplementary Note). For the T cell polarization dataset63, we defined polarization gene sets as genes that had an FDR-corrected P value < 0.05 and fold change > 2 with the stimulation condition. We excluded genes with FDR-corrected P value between 0.05 and 0.2 and fold change > 1, as many of these are upregulated by the stimulation but just did not reach FDR significance. We also obtained literature gene sets corresponding to immediate early genes64 and gene ontologies65,66. We tested these for enrichments with each cGEP thresholded with a z score > 0.015, which corresponded to the 99th percentile across all genes and cGEPs, using Fisher’s exact test as implemented in scipy.stats in Python.

Manual subset gating analysis

We library size normalized ADT protein measurements to sum to 104 (TP10K) and applied the centered log-ratio transformation. We then scaled each protein to unit variance, and set a ceiling of 15 to remove excessively high outliers. Next, we performed PCA and ran batch correction using harmonypy with the same batch features as for cNMF. We then computed the K-nearest neighbor graph with K = 5 neighbors, using the Harmony-corrected principal components. We then smoothed the normalized protein estimates using MAGIC67 using the K-nearest neighbor graph and the diffusion operator powered to t = 3.

We gated canonical T cell subsets using the smoothed normalized ADTs. First, we gated γδ T cells using expression of Vδ2 TCR. Then, we separated MAIT cells using expression of CD161 and TCR Vα 7.2. We then used CD4 and CD8 to separate CD4 (CD4+CD8), CD8 (CD4CD8+), double-positive (CD4+CD8) and double-negative (CD4CD8) T cells. We then subset to CD4+ T cells and gated Treg cells using expression of CD25 and CD39. Of the remaining CD4+ T cells, we used CD62L and CD45RA to define CD4-naive (CD62L+CD45RA+), CD4 central memory (CD62L+CD45RA), CD4 effector memory (CD62LCD45RA) and CD4 TEMRA (CD62LCD45RA+) populations. For the CD8+ T cells, we similarly used CD62L and CD45RA to define CD8 naive (CD62L+CD45RA+), CD8 central memory (CD62L+CD45RA), CD8 effector memory (CD62LCD45RA) and CD8 TEMRA (CD62LCD45RA+) populations.

T cell subset classification benchmarking analyses

We used T cell subsets defined by manual gating of ADTs in the flu-vaccine dataset as ground truth for prediction. For single cGEP prediction, we ran TCAT to predict cGEP usage, and identified the cGEP that best predicted the lineage based on AUC.

We also used all the cGEPs to perform simultaneous multi-label prediction. We scaled the normalized usages for all cGEPs to zero mean and unit variance. Using COMBAT as a training dataset, we trained a multinomial logistic regression using scikit-learn60 version 1.0.2 with lbfgs solver to predict gated subset from usages. Model weights were adjusted by the inverse of subset size using class_weight = ‘balanced’, allowing subsets with different cell counts to contribute to the model equally. We excluded CD4 TEMRA, double-negative and double-positive subsets from this analysis due to low cell counts in both the training and testing datasets. We evaluated this model in the independent flu-vaccine query dataset.

Analogous comparisons were made using GEPs from Yasumizu et al. fit to the data using the NMFproject software11. We also obtained gene sets derived from NMF analyses of T cells in a pan-cancer dataset14. To assess the ability of these gene sets to predict gated subsets, we used the score_genes function in Scanpy61 on data normalized following the standard pipeline (library size normalizing to TP10K, log transformation, scaling each gene to unit variance). We then assigned each subset to the gene set that yielded the maximal AUC.

To evaluate clustering, we first normalized the data as above, and data were subset to highly variable genes using the highly_variable_genes function in Scanpy with default parameters. We then ran PCA and Harmony batch correction of the principal components20. We then computed the K-nearest neighbor graph using 31 harmony-corrected principal components and 30 nearest neighbors. We then performed Leiden clustering68 with resolution parameters ranging from 0.25 to 2.25 increasing by 0.25. For each clustering resolution, we performed a greedy search to assign clusters to manually gated subsets based on maximization of the balanced accuracy (that is, the average recall across all subsets). In each iteration, we considered all unassigned clusters and possible gated subset assignments and selected the cluster and assignment that most increased the overall balanced accuracy. When no remaining cluster assignments would increase the balanced accuracy, we assigned the cluster to a subset that least decreased the balanced accuracy. We continued this process until each cluster was assigned to a subset.

To evaluate other reference mapping methods, we followed the normalization methods directed by each method. We supplied the flu-vaccine raw counts matrix to Azimuth and utilized its human PBMC reference (the HIV-vaccine dataset). For reference mapping with Symphony, we built a reference for the HIV-vaccine dataset, performing library size normalization (TP10K normalization) followed by log transformation. We selected the top 2,000 variable genes per donor using VST selection, centered and scaled the normalized counts, and performed PCA (irlba package) and batch correction (on lane and donor) using Harmony. We built a Symphony reference using the batch-corrected PCs. We then performed TP10K library size normalization and log transformation on the flu-vaccine query and annotated cells using Symphony’s reference mapping and annotation transfer algorithms. For reference mapping with ProjecTILs, we supplied the flu-vaccine raw count matrix and utilized the default comprehensive T cell reference provided by ProjecTILs (ref_TILAtlas_mouse_v1). Internally, ProjecTILs maps human queries to its mouse reference using gene orthologs.

For all accuracy calculations, we utilized sklearn’s balanced_accuracy_score, an approach appropriate for cases of class imbalance. In the binary case, balanced accuracy refers to the mean sensitivity and specificity of a prediction. In the multi-class case, balanced accuracy refers to the mean sensitivity across classes.

AIM-seq

Patients were recruited for this study through the Partners Biobank69. Informed consent was obtained from all participants. We have complied with all ethical regulations, and the study protocol was approved by the Mass General Brigham Institutional Review Board. PBMCs were collected from five genotyped participants with no autoimmune diseases or use of immunomodulatory medications. PBMCs were quickly thawed and placed in prewarmed xVIVO15 cell culture medium (Lonza) supplemented with 5% heat-inactivated FBS. To reduce cell clumping, PBMCs were incubated in xVIVO15 containing 50 U ml−1 of benzonase nuclease (Sigma-Aldrich) for 15 min at 37 °C and filtered using a 70-µm cell strainer. Washed and nuclease-treated cells were seeded in a 96-well cell culture plate at a concentration of 2.5 × 106 per ml. Peptide stimulations were performed using the CEFX Ultra SuperStim Pool (JPT Peptide Technologies, PM-CEFX-1) at a final concentration of 1.25 µg ml−1 per peptide for 22 h at 37 °C and 5% CO2. Recombinant anti-CD28 and anti-CD49d antibodies (BioLegend) were added at a final concentration of 5 µg ml−1 and 0.625 µg ml−1, respectively, to provide co-stimulation for peptide-reactive T cells. Separately mock-stimulated cells were treated with anti-CD28 and anti-CD49d antibodies at the same concentration.

Peptide responsive T cells were detected by the expression of the surface activation markers PD-L1, OX40 and CD137 via flow cytometry. Following the stimulation, peptide-treated and mock-stimulated cells were washed in cell staining buffer (PBS + 2 mM EDTA + 2% FBS) to end the stimulation. Fc receptor blocking was performed using a 1:50 dilution of Human TruStain FcX (BioLegend) in cell staining buffer for 10 min at 4 °C. Cell viability staining was performed using a 1:500 dilution of Zombie Yellow Fixable Viability Dye (BioLegend) prepared in PBS for 30 min at 4 °C. Surface staining was performed using 1:100 dilutions of BV421-conjugated anti-CD3, FITC-conjugated anti-CD4, BV786-conjugated anti-PD-L1, PE-conjugated anti-OX40 and APC-conjugated anti-CD137 (BioLegend) for 25 min at 4 °C in cell staining buffer. Following cell staining, antigen reactive and non-reactive T cells were identified using a BD FACSAria II cell sorter and collected in cRPMI medium (100 U ml−1 penicillin–streptomycin + 2 mM l-glutamine + 10 mM HEPES + 0.1 mM non-essential amino acids + 1 mM sodium pyruvate + 0.05 mM 2-mercaptoethanol) supplemented with 20% FBS. Sorted T cell populations were then labeled with 75 μl of TotalSeq oligonucleotide-conjugated hashing antibody mix, incubated for 30 min at 4 °C with gentle mixing after 15 min, and pooled in equal quantities. Staining with the TotalSeq-C Human Universal Cocktail (BioLegend) was then performed according to the manufacturer’s instructions. The cells were then resuspended in PBS supplemented with 0.04% FBS at a final concentration of 500 cells per µl and submitted for single-cell profiling on the Chromium Next GEM instrument. Library preparation was completed for the hashtag oligonucleotides, scRNA-seq, CITE-seq and TCR-repertoire sequencing following the manufacturer’s instructions.

We collected AIM-seq data from two separate 10x runs. In the first experiment, PBMCs from three donors were processed independently as described above and were pooled after fluorescence-activated cell sorting. In the second run, PBMCs from four donors, two of which overlapped with the first run, were stimulated separately and pooled before fluorescence-activated cell sorting.

Preprocessing of AIM-seq data

The AIM-seq dataset was processed using Cell Ranger version 6.1.1 with default parameters and alignment to hg38 reference genome. The donor of origin for each cell was determined using Demuxlet version 1.0 with a doublet-prior of 0.1 (ref. 70). Cells with a null or ambiguous Demuxlet result, fewer than 10 counts of the hashtag oligonucleotides, or fewer than 50 total RNA counts were filtered. To account for staining differences between the hashtag oligonucleotides and different sequencing depths of the two 10x runs, the counts for each hashtag oligonucleotide in each 10x run were scaled to have the same median value. Next we added a pseudocount to the hashtag oligonucleotide counts and log10 transformed these data. Then we ran Gaussian mixture models separately for each hashtag oligonucleotide with K = 2 clusters. Each cell was assigned to a single condition if it was in the high cluster for one oligonucleotide and the low clusters for all others, a doublet if it was in the high cluster for more than one oligonucleotide, or an empty droplet if it was in the low cluster for all oligonucleotides. Empty droplets or doublets based on the hashtag oligonucleotide clustering were filtered, as were doublets based on Demuxlet. Genes detected in fewer than ten cells were filtered before running TCAT.

Statistics and reproducibility

We did not perform a statistical analysis for choosing sample size. We chose to replicate the study across five samples for reproducibility, as we sequenced thousands of cells per donor and were well powered to find significant associations between stimulation conditions. We excluded cells based on unique molecular identifier counts and ability to be demultiplexed, as above. No blinding was performed.

cGEP associations with AIM positivity, proliferation and disease

To associate cGEPs with the AIM-seq stimulus, we first ran TCAT to fit the usages of the cGEPs in the AIM-seq dataset. We then computed the average usage of each cGEP in cells from each sort condition in each donor. We created two dummy variables, the first indicating whether a sample was treated with CEFX or mock, and the second indicating whether a sample was both CEFX-treated and AIM-positive or not. We used ordinary least-squares regression to estimate the effects of these two variables and an intercept. cGEPs associated with the CEFX or mock dummy variable were labeled ‘milieu-associated’, while cGEPs positively associated with the AIM-positive dummy were labeled ‘AIM-associated’.

To associate cGEPs with proliferation, we defined cells as proliferating or non-proliferating in each dataset by setting a threshold of 0.1 on the sum of the three cell cycle cGEPs—CellCycle-S-phase, CellCycle-Late-S and CellCycle-G2M. We then computed the mean usage of each cGEP per sample separately in high cell-cycle (sum usage > 0.1) and low cell-cycle (sum usage < 0.1) cells. We filtered samples that did not have at least 10 high cell-cycle cells and 100 low cell-cycle cells. Then, for each cGEP, we performed a two-tailed paired t-test (ttest_rel in Scipy, default parameters) between average cGEP usage for high and low cell-cycle cells. We meta-analyzed P values across datasets using Fisher’s Method (combine_pvalues in Scipy).

To associate cGEPs with sample-level disease phenotypes, we calculated the average usage of each cGEP in each sample for a given dataset. We then used ordinary least-squares regression to find cGEPs with higher average usage in disease samples than controls, controlling for sample-level batch variables as covariates. For all datasets, disease status was modeled as a binary dummy variable, and an intercept was included. For UK-COVID, the processing site was included as a dummy variable covariate. For COMBAT, sequencing pool and processing institute were included as dummy variable covariates. For the pan-cancer dataset, all cancer types were initially included in the analysis and dummy variable covariates were included for tissue of origin. In addition, sequencing technology was included as a dummy variable. When there were multiple tumor samples or matched normal samples from the same donor, we excluded the duplicates with fewer total cells. For all association tests, we performed FDR correction of the P values using the Benjamini–Hochberg method (fdrcorrection in Statsmodels with method = ‘indep’).

For the ICI response analyses, we used one-tailed or two-tailed Welch’s t-tests as indicated in the main text for analyses of isolated pretreatment or post-treatment samples. For analyses of combined pretreatment and post-treatment samples, we used mixed linear models with an intercept, treatment status and clinical response as fixed effects, patient of origin as random intercepts, and average cGEP usage as the response variable. We used the MixedLM package in statsmodels with REML = False (that is, using maximum likelihood estimation) and computed P values for the response fixed effect using the likelihood-ratio test. For meta-analyses, we used Fisher’s method to combine one-tailed P values.

Defining the ASA score

We used CD71+CD95+ surface protein coexpression in the COMBAT and flu-vaccine datasets as an in vivo correlate of TCR activation to help prioritize AIM-associated cGEPs for predicting TCR-activated cells. First, we preprocessed the ADT surface proteins in these datasets as described in the manual subset gating section. We then subsetted cells by their manual gating-defined broad cell types (conventional CD4, CD4+ Treg, conventional CD8, other) and gated CD71+CD95+ cells separately for each cell type as the response feature to be predicted by AIM-associated cGEPs.

We then performed forward stepwise selection, evaluating how well the summation of usages of different combinations of AIM-associated cGEPs would predict CD71+CD95+ gating. At each stage, the per-cell ASA score was computed as the sum of normalized usages of cGEPs in the predictive set. At each forward step, we determined which cGEP should be added to the predictive set based on which would most improve the average AUC across the flu-vaccine and COMBAT datasets. We used a reduction in AUC in both datasets as the stopping criterion for adding cGEPs. We considered all AIM-associated cGEPs identified as candidates for this but excluded those known to have a broader function outside T cell activation (for example, cytoskeleton, metallothionein and cell cycle) and those reflecting activation-associated T cell subsets (TPH and TH17-activated). We also excluded exhaustion from the ASA score as it reflects a distinct inhibitory response to antigen stimulation that users may wish to annotate separately.

Benchmarking the ASA score

We benchmarked ASA’s prediction of activation with published T cell activation gene sets, where ground-truth activation is defined by AIM positivity in the AIM-seq dataset and CD71+CD95+ coexpression in the flu-vaccine dataset. We utilized all T cell activation gene sets present in the immunologic signature (C7) collection on MSigDB. We then scored each cell’s usage of each of the gene sets using scanpy’s score_genes function following data preprocessing (TP10K normalization, log transformation, mean and variance scaling).

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.