Extended Data Fig. 3: Multi-hop integration of PBMCs using StabMap.
From: Stabilized mosaic single-cell data integration using unshared features

a. Number of cells present in CYTOF, ECCITE-Seq and Multiome PBMC datasets. b. UpSet plot of features shared among datasets, for example 7 proteins are measured in the CYTOF and ECCITESeq datasets, gene expression is measured for 154 genes in the Multiome and ECCITESeq datasets, while all other protein, RNA and chromatin accessibility features are distinct. c. Mosaic data topology of these datasets. Features are shared among the ECCITESeq and CYTOF and Multiome datasets respectively, but there are no shared features between the CYTOF and Multiome datasets. d. Joint UMAP embeddings of multi-hop Stabmap performed with reference dataset Multiome (left column) and both CYTOF and Multiome (right column), coloured by the data modality (top row) and broad cell type (bottom row). e. Violin plots of LISI values among CYTOF and Multiome cells for the three embeddings as in panel d. LISI values are calculated with reference to broad cell type (left), where low values are more desirable, and with reference to modality (right), where high values are considered more desirable. Overall we observe more desirable mixing of cells when using the CYTOF dataset as the reference for this scenario. f. Line plots indicating the preservation of biological signal across several steps of multi-hop mosaic data integration. Cells were randomly selected from the Mouse Gastrulation Dataset, and split into 8 distinct datasets (x-axis) with varying numbers of total cells per dataset n = 500, 1,000, 2,000 (panels). Then, varying numbers of features n = 100, 200, 500, 1,000 (lines in each plot) were retained from among the HVGs such that there was approximately 50% overlap of features between datasets 1 and 2, 2 and 3, and so on. As a result, any one dataset only shares features with its neighbouring dataset, representing an extreme task for multi-hop mosaic data integration. To assess quality, cell type accuracy was calculated with dataset 1 as the reference (y-axis), and we observe some decrease in mapping quality as the number of intermediate datasets increased, especially as fewer features were used. Ribbons represent 95% confidence intervals on generalised additive model smoothed curve.