Figure 1

A new computational pipeline for genome-wide identification of pSNMs without matched control tissue samples. (A) Overall framework of the pipeline including read pre-processing, genotyping and filtering. The processes of mosaic identification and filtering were implemented in our scripts. (B) The Bayesian-based genotyper demonstrated as a probabilistic graphical model. Four genotypes were defined: ref-hom for “homozygous for the reference allele”, het for “heterozygous”, alt-hom for “homozygous for the alternative allele”, and mosaic for “mosaic”. The posterior probabilities were inferred from prior and conditional probabilities that were calculated or simulated from known population genetics data and next-generation sequencing data (see Materials and Methods). (C) Simulated behavior of the Bayesian genotyper when the sequencing depth and base quality varied. The X axis denotes the alternative allele fraction. The Y axis denotes the posterior probability of the four genotypes. Columns 1 to 4 represent sequencing depths of 20, 40, 80, and 160, respectively, and rows 1 to 3 represent base qualities of 10, 20, and 30, respectively. It showed that increasing sequencing depth could improve the power to distinguish between mosaic and heterozygous sites, whereas increasing base quality could be helpful in distinguishing between mosaic and homozygous sites. (D) The power to distinguish mosaic sites from the simulated ∼20 000 homozygous and ∼20 000 heterozygous sites by sequentially applying the Bayesian genotyper and each of the ten error filters. This result demonstrates the high specificity of our pipeline in excluding germline sites and the relative contribution of the genotyper and filters.