Table 1 Summary of selected scFMs.

From: Single-cell foundation models: bringing artificial intelligence into cell biology

Model name

Journal

Date

Omics

Dataset size

Pretraining dataset

Architecture

Input modification

Pretraining tasks

Downstream tasks*

scBERT

Nature Machine Intelligence

September 2022

scRNA-seq

1.1 million

PanglaoDB. 209 human datasets and 74 tissues

Encoder-only (performer)

Value binning

Gene masking

Cell type prediction and novel cell type detection

Geneformer

Nature

May 2023

scRNA-seq

30 million

Human Cell Atlas, EMBL-EBI Single Cell Expression Atlas, PanglaoDB, Tumor-Immune Single-Cell Hub, GEO and SRA. Human only, no malignant cells or immortalized cell lines

Encoder-only

Ranked by expression and normalize gene expression by median value from pretraining dataset

Gene masking

Cell type prediction, in silico perturbation, gene classification and cell/gene embedding

tGPT

iScience

May 2023

scRNA-seq

22.3 million

Human Cell Atlas, Single Cell Expression Atlas, COVID-19 Atlas, Tabula Muris and Mouse Cell Atlas

Decoder-only

Ranked by expression

Autoregressive gene prediction

Cell embedding

CellLM

arXiv

June 2023

scRNA-seq

2 million

PanglaoDB and CancerSCEM

Encoder-only (performer)

Value binning and PPI embedding

Gene masking and contrastive learning (cell)

Cell type prediction

CellPLM

bioRxiv

October 2023

scRNA-seq and spatial

11 million

Human Tumor Cell Atlas, Human Cell Atlas, GEO and CosMx dataset

Encoder

Value projection

Gene masking

Cell type prediction, spatial gene imputation and cell embedding

scGPT

Nature Methods

February 2024

scRNA-seq, scATAC-seq and CITE-seq

33 million

CELLxGENE, Human Cell Atlas and PanglaoDB

Decoder-inspired with masked generative pretraining

Value binning

Attention masking

Cell type prediction, in silico perturbation, batch integration, multiomics integration, gene regulatory network generation and reference mapping

scFoundation

Nature Methods

June 2024

scRNA-seq

50 million

GEO, Human Cell Atlas, EMBL-EBI, hECA and DISCO

Asymmetric encoder–decoder

Value projection

Gene masking

Cell type prediction, perturbation prediction (GEARS), cell/gene embedding, drug response prediction (DeepCDR), read depth enhancement, gene module and network inference

Geneformer2

bioRxiv

August 2024

scRNA-seq

103 million

Geneformer1 + Broad Institute Single Cell Portal, CELLxGENE and Brotman Baty Institute-Allen Single Cell Atlases

Encoder-only

Ranked by expression and normalize gene expression by median value from pretraining dataset

Gene masking

Cell type prediction, in silico perturbation, gene classification, cell/gene embedding and multitask fine-tuning

UCE

bioRxiv

October 2024

scRNA-seq

36 million

CELLxGENE. Eight species

Encoder

ESM2-based embedding and ordered by genomic location

Gene masking and binary classification of expression

Cell embedding

GeneCompass

Cell Research

October 2024

scRNA-seq

126 million

GEO, ArrayExpress, China National Center for Bioinformation and CellxGENE. Human and mouse

Encoder-only

Ranked by expression and embeddings from prior knowledge

Gene masking for gene ID and expression prediction

Cell type prediction, perturbation prediction (GEARS), GRN inference, drug response prediction and gene embedding

Nicheformer

bioRxiv

October 2024

scRNA-seq and spatial

110 million

GEO, CosMx, Xenium and MERFISH. Human and mouse

Encoder-only

Ranked by expression

Gene masking for gene rank prediction

Cell embedding and niche prediction

SCimilarity

Nature

November 2024

scRNA-seq

23.4 million

GEO, CELLxGENE and manual curation

Non-transformer encoder–decoder model

Value projection

Cell similarity and expression reconstruction

Cell type prediction and cell search

scPRINT

Nature Communications

April 2025

scRNA-seq and scATAC-seq

54 million

CELLxGENE. Human and mouse

Encoder–decoder

Normalized expression, ESM2-based embedding and gene location as position

Denoising, label prediction and expression reconstruction

Cell type prediction, GRN inference and cell embedding

  1. *Tasks available as built-in functions or tutorials. CITE-seq Cellular Indexing of Transcriptomes and Epitopes by Sequencing), GRN gene regulatory network, PPI protein-protein interaction.