Extended Data Fig. 10: Comparison of nanopore amplicon (this study) vs. whole genome Illumina sequence data (from the MalariaGEN Pf7 data resource) for describing csp diversity.

Comparison of amplicon ONT (this study) vs. whole genome Illumina sequence data (from the MalariaGEN Pf7 data resource) for describing csp diversity. a. Allele frequency estimates in Ghana for mutations in the C-terminal region (CTR) of the csp gene. Frequency estimates for the ONT data generated in this study (ONT, n = 196) are very close to the estimates produced by the Ghanian samples of the MalariaGEN Pf7 dataset (WGS, n = 1746). This analysis uses all C-terminal mutations observed in both datasets (selecting only samples from Ghana in Pf7) within clonal haplotypes (that is, heterozygous mutation calls in Pf7 samples were discarded for frequency estimation). We also discarded Pf7 samples with missing data for the C-terminal haplotype. For several SNPs, the ONT samples produced a higher non-reference allele frequency (NRAF) estimate than in Pf7. However, we confirmed with Fisher’s exact tests (2-sided) that the frequency differences could be explained by the variance introduced by the smaller ONT sample size. All 17 SNPs with NRAF > 5% in the ONT data were below the p-value threshold, set using Bonferroni correction for multiple comparisons (0.05/17 = 0.0029). The lowest p-value was for K317E (p = 0.0075), and in this context the allele frequency change (0.76 in Pf7 to 0.84 in the ONT data) is unlikely to be meaningful. All other SNPs had p-values > 0.01. b. Non-reference haplotype frequency distributions for the csp CTR in samples from Ghana. We compare the ONT samples from this study (inset; ONT, n = 178) with Ghanian samples in the Pf7 dataset (WGS, n = 1604), after removing missing, heterozygous and reference haplotypes (that is, haplotypes without any allele difference from the reference). Both distributions have a very similar shape, with a small set of high-frequency haplotypes that quickly decay into a long tail of minor ones. In addition, the first and third top-ranking haplotypes in both datasets are identical. This figure indicates that not only that CTR mutations have very similar frequencies in both datasets, but that haplotype distribution and composition are also alike.