Table 4 Outcome from the hard filters utilized in the QC pipeline, at the variant, genotype, and sample levels, for genome-wide biallelic and triallelic sites.

From: Empirical design of a variant quality control pipeline for whole genome sequencing data using replicate discordance

Variant Level

Site Removal Criterion

Biallelic, Sequential Filtering

Triallelic, Sequential Filtering

# Pass (% Pass), Variants

Monomorphic

17,585,919 (100)

1,536,657 (100)

1

Missingness ≥ 5%

17,584,990 (99.99)

1,536,085 (99.96)

2

Blacklisted region or LCR

17,584,990 (100)

1,536,085 (100)

3

DP < 25,000

17,346,931 (98.65)

1,345,292 (87.58)

4

MQ < 58.75 or MQ > 61.25

15,971,098 (92.17)

968,987 (72.03)

5

InbreedingCoeff < –0.8

15,661,311 (98.06)

949,810 (98.02)

6

VQSLOD < 7.81

14,760,982 (94.25)

888,194 (93.51)

Genotype Level

Genotype Removal Criterion

# Pass (% Pass), Genotypes

7

DP < 10

3,819,276,086 (99.96)

202,424,447 (98.89)

8

GQ < 20

3,800,347,137 (99.50)

187,956,031 (92.85)

Sample Level

Sample Removal Criterion

# Pass (% Pass), Samples

9

Missingness ≥ 10%

259 (100)

193 (74.52)

  1. These values were calculated following removal of non-‘PASS’ sites according to GATK HaplotypeCaller. The third and fourth columns include results when only variants passing the preceding filter move on to the subsequent filter. If only SNV-SNV triallelic sites are considered for the triallelic pipeline, zero samples are removed in the triallelic pipeline (the missingness for all samples remained below 8.5%).