Fig. 1: Microhaplotype discovery pipeline.

Panel (a) provides an overview of the marker selection process. Criteria for selecting samples, variants, and windows for potential microhaplotype candidate windows result in a total of 5460 windows (200 bp). The MalariaGEN Pv4 dataset was filtered to use only high-quality monoclonal samples (Fws ≥ 0.95) that had at least 50% of the core genome positions callable. SNP variants from this sample subset were then identified as biallelic, have low genotype missingness (<0.1), had high global minor allele frequencies (MAF ≥ 0.1), and FILTER = PASS in the MalariaGEN dataset, resulting in 13,084 total SNPs. The core genome was then scanned in coding regions for all 200 bp windows in which > 3 of the identified variants were found and filtered for high diversity (global heterozygosity ≥0.5). Panel (b) provides a schematic representation of microhaplotypes with 3 SNPs. Microhaplotypes leverage SNP information content in small-windowed regions of the genome to provide a high-resolution reconstruction of the parasite genome. Three high-diversity SNPs in a single microhaplotype can have as many as 8 distinct combinations of alleles which, when combined with 100 or more microhaplotypes across the genome, results in high discriminatory power to characterise relatedness. c Manhattan plot of the P. vivax windows identified across the genome, showing the chromosomal distribution of all windows identified in the global set of high-quality, independent monoclonal infections with at least 1 SNP. Each point is an identified window, with the size increasing as the number of SNPs within the window increases. Potential microhaplotype regions are well distributed across the 14 chromosomes. The microhaplotypes with the highest SNP densities tend to be located at the ends of the chromosomes. Note, microhaplotypes were selected only from the core regions of the genome i.e., excluding highly diverse telomeric, sub-telomeric and centromere regions where sequence reads could not be mapped accurately (see Pv4 data resource for further details on core regions).