Extended Data Fig. 1: Schematic representation of our approach to project a neighborhood-based phenotype into an independent dataset for testing of association replication.
From: Identifying genetic variants that influence the abundance of cell states in single-cell data

We use a published reference mapping algorithm, Symphony, to project each cell from the replication dataset (blue labels) into the embedding used for construction of the nearest neighbor graph from the discovery dataset (orange labels). For each replication dataset cell, we store its distance to the 15 nearest discovery dataset cells; these represent the seed weights of this replication dataset cell in the discovery dataset neighborhoods, of which there is one per discovery dataset cell. We use diffusion in the nearest neighbor graph to obtain from these seed weights the fractional membership of each replication dataset cell within all discovery dataset neighborhoods. For each replication dataset sample, the combination of neighborhood memberships across all cells in the sample yields the fractional abundance of that sample across discovery dataset neighborhoods. Row-wise stacking these per-sample vectors into a matrix produces an estimated Neighborhood Abundance Matrix (NAM) containing the distribution of each replication dataset sample across discovery dataset neighborhoods. We can then use the stored products of the discovery dataset NAM SVD to obtain loadings for each replication dataset sample on the discovery dataset NAM-PCs, as shown. Combining the replication dataset sample loadings on the discovery dataset NAM-PCs with the fitted coefficients that define the phenotype in the discovery dataset produces an estimated phenotype value per replication dataset sample, which can be used to test for association to the allele of interest (or case-control status), controlling for relevant covariates.