Fig. 1: The framework of FastQDesign.
From: A realistic FastQ-based framework FastQDesign for ScRNA-seq study design issues

Step 1, we prepare both the FastQ reference dataset and the pseudo-design dataset downsampled from the FastQ reference dataset. The reference dataset can be from a publicly available resource, such as GEO. After the cellranger’s alignment process, the barcode and the alignment detail (BAM file) are transferred to our proposed algorithm fastF to generate the pseudo-design dataset. It first processes the cell barcode one at a time, selecting it only if the current random number n is less than the given subsampling cell number ratio N until finished the entire barcodes list; then processes the BAM file one read at a time, and checks: i) if the current random number r is less than the given subsampling read depth ratio R, ii) if the current read is confidently mapped to the transcriptome (i.e., whether it is noise), iii) if the read belongs to the selected cell barcodes. Note that both n, r ∈ Uniform(0, 1) are simulated from a random number generator (RNG). The filtered reads are then encoded to the SQLite database to generate the UMI matrix from the pseudo-design dataset. Step 2, we compare the stability of the pseudo-design sample from three aspects, cell clustering, marker genes, and pseudotime, by the adjusted rand index(ARI), Jaccard index, and Kendall’s τ index (details are in Methods). We define the similarity as the average of these three indices. We obtain the grid of similarity by varying cell number and read depth, where each dot is the average of 10 repeated measurements. A shape-constrained additive model(SCAM) is fitted to smooth the surface. Step 3, cost-benefit analysis for optimal designs. The colored-coded curves stand for different flow cell capacities. In particular, the purple curve is the budget function, any designs under it are feasible(black), otherwise, it is not attainable(grey). The design with a diamond shape surpasses the similarity threshold(red straight line) and has a minimal cost, which is optimal cost design. The design with a star shape under the budget(blue vertical line), achieves the optimal similarity. The designs are one-to-one correspond in both scatter plots.