Table 1 Summary of selected scFMs.
From: Single-cell foundation models: bringing artificial intelligence into cell biology
Model name | Journal | Date | Omics | Dataset size | Pretraining dataset | Architecture | Input modification | Pretraining tasks | Downstream tasks* |
|---|---|---|---|---|---|---|---|---|---|
scBERT | Nature Machine Intelligence | September 2022 | scRNA-seq | 1.1 million | PanglaoDB. 209 human datasets and 74 tissues | Encoder-only (performer) | Value binning | Gene masking | Cell type prediction and novel cell type detection |
Geneformer | Nature | May 2023 | scRNA-seq | 30 million | Human Cell Atlas, EMBL-EBI Single Cell Expression Atlas, PanglaoDB, Tumor-Immune Single-Cell Hub, GEO and SRA. Human only, no malignant cells or immortalized cell lines | Encoder-only | Ranked by expression and normalize gene expression by median value from pretraining dataset | Gene masking | Cell type prediction, in silico perturbation, gene classification and cell/gene embedding |
tGPT | iScience | May 2023 | scRNA-seq | 22.3 million | Human Cell Atlas, Single Cell Expression Atlas, COVID-19 Atlas, Tabula Muris and Mouse Cell Atlas | Decoder-only | Ranked by expression | Autoregressive gene prediction | Cell embedding |
CellLM | arXiv | June 2023 | scRNA-seq | 2 million | PanglaoDB and CancerSCEM | Encoder-only (performer) | Value binning and PPI embedding | Gene masking and contrastive learning (cell) | Cell type prediction |
CellPLM | bioRxiv | October 2023 | scRNA-seq and spatial | 11 million | Human Tumor Cell Atlas, Human Cell Atlas, GEO and CosMx dataset | Encoder | Value projection | Gene masking | Cell type prediction, spatial gene imputation and cell embedding |
scGPT | Nature Methods | February 2024 | scRNA-seq, scATAC-seq and CITE-seq | 33 million | CELLxGENE, Human Cell Atlas and PanglaoDB | Decoder-inspired with masked generative pretraining | Value binning | Attention masking | Cell type prediction, in silico perturbation, batch integration, multiomics integration, gene regulatory network generation and reference mapping |
scFoundation | Nature Methods | June 2024 | scRNA-seq | 50 million | GEO, Human Cell Atlas, EMBL-EBI, hECA and DISCO | Asymmetric encoder–decoder | Value projection | Gene masking | Cell type prediction, perturbation prediction (GEARS), cell/gene embedding, drug response prediction (DeepCDR), read depth enhancement, gene module and network inference |
Geneformer2 | bioRxiv | August 2024 | scRNA-seq | 103 million | Geneformer1 + Broad Institute Single Cell Portal, CELLxGENE and Brotman Baty Institute-Allen Single Cell Atlases | Encoder-only | Ranked by expression and normalize gene expression by median value from pretraining dataset | Gene masking | Cell type prediction, in silico perturbation, gene classification, cell/gene embedding and multitask fine-tuning |
UCE | bioRxiv | October 2024 | scRNA-seq | 36 million | CELLxGENE. Eight species | Encoder | ESM2-based embedding and ordered by genomic location | Gene masking and binary classification of expression | Cell embedding |
GeneCompass | Cell Research | October 2024 | scRNA-seq | 126 million | GEO, ArrayExpress, China National Center for Bioinformation and CellxGENE. Human and mouse | Encoder-only | Ranked by expression and embeddings from prior knowledge | Gene masking for gene ID and expression prediction | Cell type prediction, perturbation prediction (GEARS), GRN inference, drug response prediction and gene embedding |
Nicheformer | bioRxiv | October 2024 | scRNA-seq and spatial | 110 million | GEO, CosMx, Xenium and MERFISH. Human and mouse | Encoder-only | Ranked by expression | Gene masking for gene rank prediction | Cell embedding and niche prediction |
SCimilarity | Nature | November 2024 | scRNA-seq | 23.4 million | GEO, CELLxGENE and manual curation | Non-transformer encoder–decoder model | Value projection | Cell similarity and expression reconstruction | Cell type prediction and cell search |
scPRINT | Nature Communications | April 2025 | scRNA-seq and scATAC-seq | 54 million | CELLxGENE. Human and mouse | Encoder–decoder | Normalized expression, ESM2-based embedding and gene location as position | Denoising, label prediction and expression reconstruction | Cell type prediction, GRN inference and cell embedding |