Extended Data Fig. 2: Downstream performance across different tissues of Nicheformer models trained on different subsets of the data splitting by modality. | Nature Methods

Extended Data Fig. 2: Downstream performance across different tissues of Nicheformer models trained on different subsets of the data splitting by modality.

From: Nicheformer: a foundation model for single-cell and spatial omics

Extended Data Fig. 2

A) Shown are the F1 scores for niche classification in the CosMx human liver (top left) and lung (top right) datasets, cell type classification in MERFISH mouse brain (bottom right) and the MSE for niche regression in MERFISH mouse brain (bottom left) obtained by different models trained on different data subsets. The results demonstrate a clear advantage of training on spatial data compared to dissociated data. A model trained on just 1% of spatial data significantly outperforms models trained on the same or even three times the amount of dissociated data, reinforcing the fundamental difference between these modalities. This suggests that no amount of dissociated data can fully compensate for the spatial context when evaluated on spatial tasks. Additionally, computational efficiency plays a crucial role: the model trained on a smaller dissociated subset (1%) performs better than one trained on a larger subset (3%) because both were trained for the same duration, leading to more updates per sample in the smaller dataset. Furthermore, stratified training offers advantages only in specific cases, such as the liver, which can be explained by the distribution of tissue types in the random subset - since they are overly present in SpatialCorpus-110M. For example, brain cells are more abundant in the random subset than in the stratified one, potentially influencing performance. The results are found statistically significant even after adjusting for FDR. B) Shown are the F1 score curves of two different models trained on different modalities: spatial and dissociated respectively. Both models have the same number of parameters and have been training for the same amount of time. The task is performed by linear probing. The model trained on MERFISH data notably outperforms the model trained on RNA-seq, highlighting a significant distribution shift between technologies. C) Shown are the F1 scores for niche classification in the CosMx human liver (top left) and lung (top right) datasets, cell type classification in MERFISH mouse brain (bottom right) and the MSE for niche regression in MERFISH mouse brain (bottom right) obtained by different models trained on different data subsets. As in the previous data split test, a broad coverage train distribution is necessary to achieve good performance across a variety of scenarios. In this case, models trained uniquely in mouse data underperform in downstream tasks based on human data (top row); and models trained on only human data underperform in downstream tasks based on mouse data (bottom row). A model trained on a combination of mouse and human data performs on pair in both cases. Results were found statistically significant even after FDR correction.

Source data

Back to article page