Fig. 1: Overview of scFMs.
From: Single-cell foundation models: bringing artificial intelligence into cell biology

a Single-cell datasets are aggregated from public databases, repositories and manual searches. These datasets are processed to serve as pretraining data for scFMs, with a subset or new dataset selected to update the models to specific contexts or conditions. The example uniform manifold approximation and projection (UMAP) plots are from The Allen Brain Cell Atlas. b, Each cell in the pretraining datasets is converted into a model input by tokenizing its gene expressions or other measured profiles, with special tokens optionally added to denote particular cell characteristics. c The model architecture is built using multiple deep neural network layers, typically arranged in the transformer architecture. d The pretraining is carried out using various pretraining tasks, such as masking strategies and contrastive learning, among others. e After the initial pretraining, the model can be fine-tuned and further adapted via continual pretraining or transfer learning to meet specialized purposes or contexts. f The user inputs are fed into the scFMs to extract cell embeddings, gene embeddings and generated feature profiles, which can be utilized for a variety of downstream tasks.