Fig. 3: Data augmentation for scRNA-seq cell type classification improves model generalizability. | Nature Communications

Fig. 3: Data augmentation for scRNA-seq cell type classification improves model generalizability.

From: scTab: Scaling cross-tissue single-cell annotation models

Fig. 3

a Illustration of the data augmentation procedure. The difference vector in raw gene space between the same cell type observed across two donors can be used to simulate how the gene expression of a cell type might look for a different donor and, thus, artificially increase the training data size. b For each input vector to the neural network, an augmentation vector is randomly sampled and added to the original input vector. The augmented vector is then fed into the neural network (due to simplicity the batch dimension is omitted in the sketch). c tSNE visualization of original and augmented data. One can see that the augmentation blurs out the boundaries of the cell types but that the main source of variation (cell type) is still preserved. d Effect of augmentation on training and validation loss and macro F1-score (training data was subset to 4.3 million cells (Methods)). One can observe the desired effect of data augmentation, an increase in training loss (regularizing effect), and a decrease in validation loss. The dashed vertical lines indicate how long the models with and without data augmentation are fitted on average (early stopping is done based on the macro F1-score), respectively. Data are presented as mean values ± 95% CI.

Back to article page