Lineage-informative microhaplotypes for recurrence classification and spatio-temporal surveillance of Plasmodium vivax malaria parasites

Siegel, Sasha V.; Trimarsanto, Hidayat; Amato, Roberto; Murie, Kathryn; Taylor, Aimee R.; Sutanto, Edwin; Kleinecke, Mariana; Whitton, Georgia; Watson, James A.; Imwong, Mallika; Assefa, Ashenafi; Rahim, Awab Ghulam; Nguyen, Hoang Chau; Tran, Tinh Hien; Green, Justin A.; Koh, Gavin C. K. W.; White, Nicholas J.; Day, Nicholas; Kwiatkowski, Dominic P.; Rayner, Julian C.; Price, Ric N.; Auburn, Sarah

doi:10.1038/s41467-024-51015-3

Download PDF

Article
Open access
Published: 08 August 2024

Lineage-informative microhaplotypes for recurrence classification and spatio-temporal surveillance of Plasmodium vivax malaria parasites

Nature Communications volume 15, Article number: 6757 (2024) Cite this article

5368 Accesses
22 Citations
18 Altmetric
Metrics details

Subjects

Abstract

Challenges in classifying recurrent Plasmodium vivax infections constrain surveillance of antimalarial efficacy and transmission. Recurrent infections may arise from activation of dormant liver stages (relapse), blood-stage treatment failure (recrudescence) or reinfection. Molecular inference of familial relatedness (identity-by-descent or IBD) can help resolve the probable origin of recurrences. As whole genome sequencing of P. vivax remains challenging, targeted genotyping methods are needed for scalability. We describe a P. vivax marker discovery framework to identify and select panels of microhaplotypes (multi-allelic markers within small, amplifiable segments of the genome) that can accurately capture IBD. We evaluate panels of 50–250 microhaplotypes discovered in a global set of 615 P. vivax genomes. A candidate global 100-microhaplotype panel exhibits high marker diversity in the Asia-Pacific, Latin America and horn of Africa (median H_E = 0.70–0.81) and identifies 89% of the polyclonal infections detected with genome-wide datasets. Data simulations reveal lower error in estimating pairwise IBD using microhaplotypes relative to traditional biallelic SNP barcodes. The candidate global panel also exhibits high accuracy in predicting geographic origin and captures local infection outbreak and bottlenecking events. Our framework is open-source enabling customised microhaplotype discovery and selection, with potential for porting to other species or data resources.

Microhaplotype deep sequencing assays to capture Plasmodium vivax infection lineages

Article Open access 05 August 2025

High-throughput genotyping of Plasmodium vivax in the Peruvian Amazon via molecular inversion probes

Article Open access 25 November 2024

Strong positive selection biases identity-by-descent-based inferences of recent demography and population structure in Plasmodium falciparum

Article Open access 20 March 2024

Introduction

The malaria parasite Plasmodium vivax is a major public health threat affecting the poorest and most vulnerable populations in more than 49 endemic countries¹. Over the past decade, enhanced malaria control efforts in areas outside of sub-Saharan Africa have achieved a marked decline in P. falciparum infections, but a relative rise in the proportion of P. vivax cases¹. Several biological attributes of P. vivax render this species more resilient to interventions than P. falciparum². P. vivax forms dormant liver stage parasites (hypnozoites) that can reactivate weeks to months after initial inoculation causing recurrent episodes of malaria (relapses). A single mosquito inoculation can cause multiple relapses, sustaining transmission over extended periods. Relapses are thought to cause over 60% of clinical cases³. Knowledge of the biology and epidemiology of P. vivax relapse, including the underlying reactivation mechanism(s) and host and parasite determinants, are vital for informing public health strategies to combat this parasite, but our understanding of these processes is limited⁴. A major obstacle to our increased understanding has been the difficulty of classifying recurrent P. vivax infections accurately. These recurrences can arise from reinfection (new mosquito inoculations), recrudescence (blood-stage treatment failures), or relapse (reactivation of dormant liver stages). Discriminating between these causes is challenging¹.

Accurate methods to disentangle relapse from reinfection and recrudescence events are critical to improving our understanding of the therapeutic efficacy of current treatment regimens for P. vivax. Accumulating reports suggest that chloroquine, the most widely used drug for treating the P. vivax blood stages, is failing in several endemic regions, but recrudescence (drug failure) can be confused with relapse, confounding efficacy studies⁵. Accurate diagnosis of the cause of recurrent infection is also essential to clinical studies of hypnozoiticidal (anti-relapse) regimens (primaquine and tafenoquine) as this relies on distinguishing relapse from reinfection^6,7. Reinfection dilutes observed differences in recurrence between hypnozoiticidal interventions and biases treatment effect estimates towards interventions which provide longer post-treatment prophylaxis⁸.

The ability to discriminate between recurrent P. falciparum infections deriving from reinfections (which are likely genetically heterologous to the initial infection given sufficient population diversity) and recrudescences (which contain homologous parasites) using genotyping at a handful of polymorphic markers, represented a crucial advance for P. falciparum clinical research⁹. It allowed therapeutic efficacy studies to be conducted in endemic areas without the need for concomitant detailed epidemiological assessment. However, the situation is more complex for P. vivax malaria since relapsing parasites can be genetically homologous or heterologous to the initial infection^10,11. However, recent genomic studies have revealed that a proportion of paired P. vivax isolates (from acute and recurrent infections) that would be classified as heterologous using traditional genotyping approaches share homology in large segments of the genome, inferring familial relatedness^12,13,14. In mosquito hosts capturing blood meals with mixtures of parasite genomes, the obligate meiotic stage will generate meiotic recombinants, producing sporozoites that share parents. Every natural infection deriving from more than one sporozoite may comprise mixtures of meiotic recombinants. Pairs of infections with evidence of recent identity by descent (IBD) consistent with close relationships such as siblings or half-siblings (as much as ≥50% and ≥25% IBD respectively), are more likely to have derived from the same mosquito inoculation than from different inoculations and are, therefore, more likely to reflect relapse than reinfection events. IBD therefore has significant potential to enable more accurate classification of recurrent P. vivax infections. IBD-based measures would also allow finer resolution of the spatial connectivity between P. vivax populations than is possible using classical methods such as the fixation index or phylogenetic approaches¹⁵. With appropriate genetic data, IBD can be used to characterise major transmission networks (foci) for targeted intervention, or the risks of infection spread between communities and across borders.

While genomic data provides the greatest information to estimate IBD, P. vivax patient isolates often have low parasite densities which currently precludes whole genome sequencing at large scale, even when using selective whole genome amplification methods¹⁶. Restricting analyses to infections with high parasite densities is not ideal as these are atypical of P. vivax infection and hence may not be representative of the true diversity in patient infections. High-throughput genotyping offers a more operationally feasible approach that can be applied to low-volume samples such as dried blood spots and would be more readily implemented in surveillance frameworks in malaria-endemic countries^17,18,19,20. However, selecting parsimonious marker sets that can capture genome wide IBD effectively is challenging²¹. To date, genotyping methods for P. vivax have relied on either capillary sequencing of microsatellite markers or next generation sequencing (NGS) of Single Nucleotide Polymorphisms (SNPs)^2,22. In a recent P. falciparum study, targeted NGS of short regions comprising multiple highly variable SNPs (microhaplotypes) provided a simpler, cheaper, and higher-throughput approach than microsatellite typing, with substantially higher resolution of individual clones than with single SNPs²³.

In this work, we establish a framework for selecting and evaluating the effectiveness of universal P. vivax IBD barcodes that can be used to improve the interpretation of therapeutic evaluations of hypnozoiticidal and schizonticidal antimalarial drugs, elucidate the biology and epidemiology of P. vivax relapses, and provide NMCPs and other agencies with actionable information on parasite transmission within and across national borders. Using 615 publicly available P. vivax genomes, we identify panels of microhaplotype markers that can estimate IBD across pairs of infections in diverse parasite populations. In silico analyses confirm that P. vivax microhaplotypes can provide much higher resolution to estimate relatedness consistent with close relationships compared to biallelic SNP barcodes that are currently used. We demonstrate that carefully selected panels for IBD characterisation can not only resolve P. vivax relatedness relationships within patients, but also have broad utility for spatiotemporal use cases.

Results

Microhaplotype marker discovery framework and selection

We created a microhaplotype marker discovery framework to identify and select panels of microhaplotypes with high accuracy in capturing IBD and potential to be incorporated into amplicon-based sequencing assays. The primary marker selection criteria were: i) high diversity and preferably multiallelic; and ii) even spacing throughout the genome. Variable panel sizes can be selected but a minimum of 100 microhaplotypes is recommended. The framework has been built as a series of Jupyter notebooks in Python and was implemented on the MalariaGEN Pv4 dataset. The notebooks are directly connected to the Pv4 data resource on Google Colab enabling users to discover and analyse their own marker panels without downloading data or files. The discovery framework walks users through the entire microhaplotype selection process and is customisable to different sample sets, marker properties and panel sizes with embedded exploratory benchmarking analysis for marker informativeness https://svsiegel.github.io/vivax-mhaps/, https://doi.org/10.5281/zenodo.12622789.

The first step of the discovery framework (notebook 1: svsiegel/vivax-mhaps) entails sample quality filtering; an initial set of 1,816 isolates in the MalariaGEN Pv4 dataset were filtered to 615 high quality, likely monoclonal samples (Fig. 1a, Supplementary Data 1)²⁴. In addition to country classifications, the MalariaGEN Pv4 samples were assigned regional classifications based on genetic clustering patterns²⁴. Although most parasites originated from the Asia-Pacific region, the dataset exhibited global geographic representation, spanning 17 countries, with all regional populations represented by at least 30 isolates (range n = 31–151) (Table 1, Supplementary Fig. 1). The filtered sample subset was used to select Single Nucleotide Polymorphisms (SNPs) and identify candidate microhaplotypes within 200 bp regions of the P. vivax genome (Fig. 1b) with high diversity in all geographic regions selected.

**Fig. 1: Microhaplotype discovery pipeline.**

Table 1 Geographic distribution of the sample set

Full size table

A total of 13,084 candidate high-quality biallelic SNPs with global high minor allele frequency (MAF ≥ 0.1) and low genotype failure rate (missingness <0.1) were identified in core, coding regions of the genome to minimise potential primer design challenges (Fig. 1a). A windowed scan was conducted in partially overlapping windows across all identified variants of interest (13,498 windows found). As illustrated in Fig. 1c, windows were relatively uniformly distributed across the genome. We then filtered further by heterozygosity to be above the theoretical maximum for a single biallelic SNP of 0.5 and having 3 or more SNPs within the window. Together, this framework identified a total of 3,830 microhaplotype candidate heterozygome windows within the core genome.

The first step in panel selection (notebook 2: svsiegel/vivax-mhaps) divides the total length of the 14 chromosomes by the number of desired markers (default 100) to assign the proportional number of markers needed per chromosome. The framework then provides user-defined options to generate panels prioritised for either uniform spacing (see notebook 2 evenly spaced algorithm), or for heterozygosity (see notebook 2 greedy algorithm) (Fig. 2a). Candidate panels can also be manually refined to balance both heterozygosity and marker spacing (Fig. 2b).

**Fig. 2: Marker property outputs within the discovery pipeline.**

For evaluation of random versus high-diversity optimisation, we created two comparative 100-microhaplotype panels, both with 3–10 SNPs, global heterozygosity >0.6, and final marker selection curated manually. The exemplar panel contained well-spaced, high heterozygosity markers (high-diversity panel), while the comparator panel had well-spaced markers without heterozygosity optimisation (random panel) (Fig. 2c). The rationale for selecting 3–10 SNPs was based on diminishing returns observed above 10 SNPs (see exploratory analyses - notebook 2) and retaining a computationally manageable number of allele combinations (2¹⁰ allele combinations possible = 1024).

The high-diversity and random 100-microhaplotype panels were also benchmarked against a previously developed 42-SNP biallelic panel (Broad barcode) that has been widely used by the vivax community²⁵. Four SNPs within the Broad barcode had to be excluded from analyses because of high genotype failure rates or multi-allelic status in the Pv4 dataset, leaving a 38-SNP Broad barcode for evaluation (Fig. 2c).

For evaluation of panel size, we also created five panels with 50, 100, 150, 200 and 250 microhaplotypes using the greedy algorithm with roughly even spacing. Panels were chosen to contain 3–10 candidate SNPs and were optimised for highest heterozygosity. Details on the coordinates and SNPs covered by all microhaplotype panels are provided in Supplementary Data 2. A full description of the optimal selection criteria identified and demonstrated in this manuscript can be found in notebooks 1 and 2—svsiegel/vivax-mhaps and directly executable on Github pages https://svsiegel.github.io/vivax-mhaps/.

Evaluation of randomly selected versus high-diversity microhaplotype panels on IBD estimation using simulated data

In each geographic region, the 95% confidence intervals (CIs) around the estimate \(\hat{r}\) were calculated as a measure of the informativeness of a given marker panel (Fig. 3). In all geographic regions and for all marker panels, the CI intervals were shortest when r = 1 (identical infections), closely followed by r = 0 (strangers), and highest when r = 0.5 (siblings) indicating lowest accuracy in predicting sibling relationships. In all geographic regions, the CI intervals were smallest around the estimate \(\hat{r}\) for the high-diversity microhaplotype panel followed by the random microhaplotype panel, and the 38-SNP Broad barcode (Fig. 3). At the high-diversity panel, all (100%) of the estimates of r = 0.5 had 95% CIs between 0.1 and 0.8. Whilst the Random-SNP microhaplotype panel followed similar trends to the high-diversity panel in each geographic region, the BR38 barcode displayed particularly large CI intervals relative to the microhaplotype panels in Maritime Southeast Asia and Oceania. The same trends in marker panel performance were observed using the root mean squared error (RMSE) of the estimates of \(\hat{r}\) compared to the data-generating r (Supplementary Fig. 2). In all regions, the high-diversity microhaplotype panel exhibited consistently lower RMSE values than the random microhaplotype and the 38-SNP Broad barcode. Importantly, the high-diversity panel was the only panel able to consistently keep RMSE values below 0.1 in all subpopulations (even when r = 0.5), which shows that selection with highest heterozygosity microhaplotypes outperformed the random microhaplotype panel in relatedness estimates. These results indicate that panels of ~100 uniformly spaced, high heterozygosity microhaplotypes provide more accurate estimation of IBD than the current single SNP panel used for P. vivax infections; this advantage may be more pronounced in certain geographic regions, possibly reflecting the geographic representation informing SNP selection. All downstream analyses focused on the 100 marker panels for evaluation of capturing IBD, including the high-diversity and random microhaplotype panels for comparison, with the high-diversity panel serving as a hypothetical exemplar panel for IBD use cases.

**Fig. 3: Confidence intervals around infection relatedness estimates using 100 microhaplotype-based simulations.**

Evaluation of an exemplar high-diversity 100 microhaplotype panel on IBD estimation using real data

In addition to the simulated data, the efficacy of the high-diversity 100 microhaplotype panel in estimating IBD between infections was evaluated using real data from the independent, high-quality monoclonal infections in Pv4. Pairwise measures of IBD were determined for the genome-wide SNPs and for the microhaplotype SNPs using hmmIBD. In all geographic regions, the microhaplotype-based estimates of pairwise IBD demonstrated a significant positive correlation with the genome-wide estimates (all P < 0.05, Spearman’s rho statistic using a paired test, Supplementary Fig. 3).

Further evaluation using real data was undertaken on sequenced pairs of primary and recurrent P. vivax infections in Pv4. These samples came from a range of clinical trials, as well as from returning travellers^26,27,28,29. A total of 14 infection pairs satisfied the criteria for selection which required monoclonal infections with high-quality genomic data at both time points. The infection pairs originated from sites in the Asia-Pacific and exhibited a range of durations between the primary and recurrent episode (Table 2). Genome-wide IBD measures using hmmIBD revealed that 11 infection pairs were clones (IBD ≥ 0.95) reflecting potential recrudescence or relapse events, whilst 1 pair was distant relatives (0.05 ≤ IBD < 0.25) and 2 pairs were strangers (IBD < 0.05). The distant relatives and strangers reflect potential reinfection or relapse events. When the data were restricted to the 100-microhaplotype markers, the resulting hmmIBD-based estimates correlated highly with the genome-wide measures of IBD (rho=0.790, Spearman’s rank correlation). One pair of infections (PJ0167-C and PJ0166-C) that were classified as strangers with genomic data (IBD = 0.048) were classified as distant relatives (IBD = 0.156) with the microhaplotypes, but the other 13 pairs had concordant recurrence classifications.

Table 2 Genome-wide versus microhaplotype-based identity by descent (IBD) estimates in P. vivax infection pairs

Full size table

Microhaplotype panels can effectively capture diversity and differentiation

Further evaluation of the high-diversity 100 microhaplotype panel was conducted to assess the utility of such panels in capturing key population genetic features. The genetic diversity of the panel was assessed for isolates from each of the 7 geographic regions using measures of heterozygosity and effective cardinality. Heterozygosities across the microhaplotypes varied, but overall diversity was high in all geographic regions. The lowest diversity was observed in the horn of Africa (median heterozygosity = 0.70) and the highest in Oceania (median heterozygosity = 0.81) (Table 3, Fig. 4a). Similar trends occurred in effective cardinality (roughly equivalent to the number of alleles, with adjustment for minor allele frequency), with median values ranging from 3.88 in Africa to 5.18 in Oceania (Table 3, Fig. 4b).

Table 3 Regional patterns of population diversity using the high-diversity microhaplotype panel

Full size table

**Fig. 4: Comparative diversity between the High-diversity microhaplotype panel and the 38-SNP Broad Barcode.**

Within-host diversity estimates using the high-diversity 100 microhaplotype panel were generated for each isolate using Complexity of Infection (COI) measures and correlated with the genome-wide estimates derived from the F_WS score. Using thresholds of COI = 1 and F_WS ≥ 0.95 to define a likely monoclonal infection, there were no significant differences in the proportion of polyclonal infections defined by the microhaplotype and genome-wide data in any geographic region (Supplementary Table 1). Furthermore, as illustrated in Fig. 5, the F_WS scores of the infections with COI = 1 was significantly larger than those with COI > 1 in all regions, demonstrating a strong alignment between the microhaplotype and genomic data (Supplementary Table 1).

**Fig. 5: Genome-wide F_WS distribution by microhaplotype-based complexity of infection (COI).**

The capacity of the high-diversity microhaplotype panel to capture spatial transmission patterns including differentiation of geographically distinct populations was also assessed, using Principal Coordinate Analysis (PCoA) and IBD-based networks. The microhaplotype-based PCoA trends were consistent with spatial trends observed with genome-wide datasets²⁴. Analysis of all the available P. vivax data demonstrated three major clusters with PC1 and PC2 representing South America, Africa and West Asia (group 1), East and West Southeast Asia (group 2), and Oceania and Maritime Southeast Asia (group 3) (Fig. 6a, b). Clear differentiation of the regional groups was observed within each of the major clusters. To retain accuracy in MAF estimates, IBD-based analyses were restricted to within each of the 7 regional groupings. In Maritime Southeast Asia, a major clonal outbreak in Malaysia (defined K2) that we have described previously with genomic data³⁰, was also distinguished clearly with the microhaplotypes (Fig. 6c, d). Furthermore, the resolution of previously defined K3 and K4 Malaysian sub-populations was also captured with the microhaplotypes. In other geographic regions, patterns of infection relatedness were largely consistent with genomic patterns described in previous studies (Supplementary Fig. 4)²⁴.

**Fig. 6: Spatial trends in *P. vivax* connectivity using microhaplotypes versus genomic data.**

Further spatial evaluations of the high-diversity 100 microhaplotype panel were undertaken to assess the accuracy of such panels in detecting imported P. vivax cases by determining an infection’s country of origin. Using the data from 21 countries with a minimum sample size of 4 from the available P. vivax global dataset, we determined the accuracy of the 494 SNPs within the microhaplotype panel in predicting country of origin using a recently developed Bi-Allele Likelihood (BALK) classifier³¹. The predictive accuracy of the 494 microhaplotype SNPs was compared to the 38-SNP Broad barcode and three recently identified P. vivax geographic barcodes selected specifically for determining country of origin (GEO33, GEO50 and GEO55 comprising 33, 50, and 55 SNPs respectively)³¹. The median Matthew’s correlation coefficient (MCC, ranging from −1 when prediction is 0% correct to 1 when prediction is 100% correct) of the 494 microhaplotype SNPs were greater than or equal to the 38-SNP Broad barcode, GEO33 and GEO50 in 21 (100%) countries (Supplementary Table 2, Fig. 7). The GEO55 panel exhibited higher MCCs than the microhaplotype panel in 3/21 (14%) countries (Cambodia, Indonesia and Vietnam) but lower values in 4/21 (19%) countries (Afghanistan, India, Myanmar and Thailand). The median MCCs of the microhaplotype panel exceeded 0.9 in all countries except Cambodia and Vietnam (19/21, 90%); the median MCCs in these countries were also <0.9 at the 38-SNP Broad barcode and the three GEO panels.

**Fig. 7: Comparison of country prediction performance between SNP panels.**

External validation of IBD accuracy using an exemplar high-diversity 100 microhaplotype panel with an independent dataset

Given that the microhaplotype panels were selected using the MalariaGEN Pv4 dataset, we considered the potential of bias in using Pv4 data for the evaluation. An independent P. vivax genomic dataset was therefore employed for additional external evaluation of IBD estimation using the high-diversity 100 microhaplotype panel. From an initial set of 836 non-Pv4 isolates, we identified 324 high quality, monoclonal samples^32,33,34,35. The 324 samples derived from 19 endemic countries and clustered with infections from represented or geographically proximal countries in the Pv4 dataset (Supplementary Fig. 5, Supplementary Data 1^32,33,34,35). Although all 7 geographic regions were represented amongst the 324 non-Pv4 samples, IBD analyses were restricted to 289 samples from 4 regional groups comprising ≥30 samples: AF (n = 44), ESEA (n = 62), SAM (n = 95) and WAS (n = 87). Pairwise measures of IBD were determined for the genome-wide SNPs and for the microhaplotype SNPs using hmmIBD. The 324 independent samples exhibited an overall higher proportion of missing calls at the 494 SNPs comprising the microhaplotype loci (median 17% SNPs, range 0–50% SNPs) than in the 615 Pv4 samples (median 0%, mean 2.5%, range 0–34%) reflecting the use of Pv4 for the original marker selection. Nonetheless, the microhaplotype-based estimates of pairwise IBD in the independent dataset demonstrated a significant positive correlation with the genome-wide estimates in all 4 geographic regions (all P < 0.05, Spearman’s rho statistic using a paired test, Supplementary Fig. 6).

Evaluation of microhaplotype number on IBD estimation

A formal evaluation of the impact of panel size on the accuracy of estimating IBD was undertaken using a set of 50, 100, 150, 200, and 250 microhaplotype panels generated using the greedy algorithm (see Notebook 2). Analysis was undertaken on monoclonal P. vivax infections using simulated data generated on pairs of infections across a range of relatedness (r) values using paneljudge³⁶. Trends in marker panel performance were observed using the root mean squared error (RMSE) of the estimates of r(\(\hat{r}\)) compared to the data-generating r (Supplementary Fig. 7), which shows that increasing panel size improves the accuracy of IBD estimation, with a large gain in panel informativeness between 50 and 100 microhaplotypes but diminishing returns above 100 microhaplotypes.

Discussion

Our study provides the first description in silico of P. vivax microhaplotype panels which can be used to estimate identity by descent relatedness between pairs of acute and recurrent infection isolates, and thus help to discriminate between different causes of vivax malaria recurrence. A systematic genome-wide selection process was used to identify panels of 50–250 globally diverse microhaplotypes using an expansive P. vivax genome dataset. The utility of these panels was assessed using both simulated and ‘real’ genomic data, including an independent validation dataset. Panels of 100 or more microhaplotypes demonstrate significant potential to improve the interpretation of clinical trials and surveillance data. Further details on the integration of microhaplotype data in the proposed use cases, and areas requiring further research and development are described herein.

A key requirement of the P. vivax microhaplotype panels was the ability to estimate IBD accurately between paired isolates and thereby help to classify the likely origin of vivax malaria recurrence (i.e., relapse, recrudescence, or reinfection) in clinical trials. Our simulations showed that both high-diversity and random microhaplotype panels yielded consistently higher accuracy in IBD estimation (across different IBD levels and in different populations) than the 42-SNP Broad barcode that is currently the most widely used SNP barcode for P. vivax. This is not surprising as the microhaplotypes have substantially greater genetic information content compared with the 38 evaluable BR38SNP panel. We also demonstrated greater efficacy of IBD estimation when microhaplotypes were selected to meet specified diversity criteria (high-diversity SNP panel) relative to microhaplotypes with random SNP selections (any-SNP panel). In addition, we were able to confirm that panel size increases informativeness greatly between 50 and 100 microhaplotypes per panel, with diminishing returns beyond 100. The marginal gains to informativeness at higher panel sizes (150, 200, 250) are consistent with modelling studies by Taylor and colleagues and come at a trade-off in cost and laboratory practicalities. Panels of 100 microhaplotypes provide a reasonable compromise.

In accordance with previous predictions²¹, our simulation-based results demonstrated error rates (RMSE) below 0.1 in the estimation of pairwise IBD in all populations tested using the high-diversity microhaplotype panel. The error rates were highest for the prediction of infections with 50% IBD, which is consistent with previous findings. Further work is needed to understand how error rates might impact clinical predictions of individual treatment response or population-level drug efficacy both in the context of descriptive data analyses and mathematical models³⁷. In descriptive analyses, data summaries (e.g., allelic homology observations or relatedness estimates) and a set of rules (e.g., relatedness values greater than 25% are suggestive of relapse) are used to classify recurrences categorically. Rules-based classification is not the same as estimating the probability of relapse given the data under a statistical model (as in Taylor et al.²¹, which also accounts for the added complexity associated with multiclonal infections). Nevertheless, improved data informativeness for IBD estimation likely translates into improved model-based probabilistic classification performance, at least in the case of the model of Taylor, because that model machinery includes an intermediary that evaluates the probability of the data given IBD partitions compatible with networks of sibling, clonal and stranger parasites²¹.

It should be acknowledged that panel evaluation based on simulated data was done in a highly idealised manner that captures panel performance in its most favourable light. Data were simulated under the same model used to estimate relatedness, such that the model was perfectly specified in relation to the simulated data. Real data are generated by an ancestral process that the model does not capture, i.e., it is mis-specified in relation to real data. The impact of mis-specified allele frequencies, mis-classified multiclonal infections, etc. was not evaluated. In addition, it should be acknowledged that genotyping errors and failures (missing data) were not included in our simulations as this would require reprogramming of the paneljudge package, which is beyond the remit of the current study. However, we anticipate that genotyping failures should be infrequent with NGS-based amplicon sequencing methods, where each locus is typically covered by hundreds of reads²³.

Data from pairs of P. vivax isolates, collected from recurrent infections in clinical trials, confirmed the ability of the high-diversity microhaplotype panel to estimate pairwise IBD. High correlations were observed between microhaplotype and genome wide IBD estimates in a set of 14 paired P. vivax isolates from the same patient before and after treatment. Using our assigned IBD thresholds, only one of 14 pairs of infections had mismatching classifications of stranger versus distant relative, which are both likely to reflect either reinfections or relapses. However, the available genomic data did not comprise any infection pairs with IBD estimates around 50% (i.e., siblings), which our simulations predicted to be the most difficult relationships to determine accurately. The number of genomic pairs was limited owing to the difficulty in obtaining enough DNA and thus high-quality sequence data from recurrent infections, which typically exhibit low parasite densities. This is a major incentive for using targeted genotyping approaches. Nonetheless, we demonstrated high correlations between the microhaplotype and genome-wide IBD estimates in the assessments of day 0 samples, where a range of IBD relationships were observed.

In some scenarios, genetic data alone will not be informative of the likely origin of a recurrence. A single mosquito inoculation may carry clonal parasites or related parasites generated through the recombination of heterologous clones. If a human host carries hypnozoites from a single inoculation, the relapsing parasites will therefore either be homologous or related (meiotic siblings) to the incident infection. In situations where the relapses are homologous, they will be genetically indistinguishable from recrudescence. If a host carries hypnozoites from one or more previous mosquito inoculations, relapsing parasites are liable to be unrelated/heterologous to the incident infection, and thus genetically indistinguishable from reinfections. In our study, a high proportion of infection pairs (79%) had homologous genomes, although the majority (86%) of pairs derived from low seasonal transmission settings in Malaysian Borneo, southern Vietnam, Cambodia and western Thailand. A therapeutic efficacy study undertaken in Peru also identified a moderately high proportion of genetically homologous recurrences (52%, 12/23)¹². The proportion of relapses amongst the recurrences in our and the Peruvian study is uncertain. In a therapeutic efficacy study undertaken in Cambodia, 20 patients were relocated to a malaria-free area excluding the possibility of reinfection; the authors were able to confidently define five recurrences as relapses and demonstrated that 4 (80%) of these were related to the initial infection³⁸. As more comprehensive data become available, a clearer picture of the epidemiology of recurrent P. vivax infections will emerge. Even in areas where the prevalence of related recurrences is low, mathematical modelling approaches that combine genetic data with time-to-event data will help to resolve the probable cause of recurrence³⁷.

A recent study highlights the human spleen as an important reservoir of P. vivax parasites³⁹. However, there is currently no evidence of altered metabolism of endosplenic stages in human P. vivax infections, in marked contrast to the hypnozoite reservoir. We therefore anticipate that the endosplenic stages are treatable by blood-stage antimalarials such as chloroquine, although this needs to be confirmed. In this context, information on IBD between paired clinical isolates would not distinguish between endosplenic and circulating infections but should still help to distinguish reactivated liver stages and reinfections.

Another key requirement of the microhaplotype panel was the ability to capture spatial P. vivax transmission dynamics. Strategies to contain P. vivax effectively will be assisted by a more comprehensive understanding of the major routes of infection spread within and across borders. The SNP-based data from our microhaplotype panel displayed clear geographic trends and high accuracy in predicting the country of origin, suggesting utility in detecting and mapping imported P. vivax cases. Although the high-diversity microhaplotype panel was not intentionally selected for country prediction, the rich genetic data enabled high-performance country prediction compared to recently described geographic marker panels (GEO33, GEO50, and GEO55)³¹. The equivalently high prediction accuracy of the microhaplotypes relative to the GEO panels means that users who want to capture information on both IBD and on country of origin can derive all information with the microhaplotypes alone without impeding on accuracy. The accuracy in pairwise IBD estimation using the microhaplotype data also demonstrates a unique potential for tracking infection spread at micro-epidemiological spatial resolution, to inform the dispersal of infections within and between communities. For example, the microhaplotype data from Malaysia effectively captured a previously described clonal expansion, as well as more subtle population structure reflecting different foci of infection. The spatial analyses conducted here used biallelic SNP data at the microhaplotypes; whilst the high density of SNPs (n = 494) provided rich genetic information, even greater information content can be achieved using multiallelic microhaplotypes once new software to deal with these complex datasets becomes available. It is possible that some microhaplotypes are subject to site-specific selective pressure, which could affect their spatio-temporal surveillance utility; this impact could be mitigated by implementing additional microhaplotypes.

Some global regions are not well represented in the Pv4 genomic dataset that were used for marker selection, such as Africa, the Indian subcontinent (West Asia), Central and South America (SAM). It is therefore unclear how well the microhaplotypes described here will capture IBD in these regions. Some insights have been derived from our evaluation of an independent (non-Pv4) P. vivax genomic dataset. Amongst the four regional groups evaluated in the independent dataset moderate differences were observed in country representation. For example, the Pv4 Africa dataset only comprised Ethiopian isolates, whilst the independent dataset comprised six African countries. There were also differences between the two datasets in four countries in South America, three in East Southeast Asia and two in West Asia. Despite the differences in country representation, the independent data regional groups exhibited significant correlations between genome-wide and microhaplotype-based estimations of IBD. This suggests that the panel discovery framework captured globally representative microhaplotypes from the Pv4 dataset. Nonetheless, the informatics framework that was established for the microhaplotype selection can be applied readily to update panels as needed once additional whole genome data become available from new geographical regions. The framework can also be used to select country- or region-specific panels where needed.

Information on within-host infection complexity is important to capture epidemiologically relevant transmission dynamics. We observed high concordance in the proportion of polyclonal infections captured by microhaplotype-based COI and genome-wide F_WS measures when thresholds of COI = 1 and F_WS ≥ 0.95 were applied. Only a few infections displayed differences in the classification of polyclonality between the microhaplotype and genomic datasets. In interpreting these differences, it should be acknowledged that the 0.95 F_WS threshold is only a guideline, and that population distributions of within-host diversity generally reflect a continuum, not discrete clusters of monoclonal and polyclonal infections.

Within-host diversity at SNP barcodes also follows a continuum, but this is less marked owing to the lower genetic resolution. When monoclonal thresholds were only applied to the COI data, the F_WS demonstrated significantly larger values in the COI = 1 (monoclonal) vs COI > 1 (polyclonal) infection group in all geographic regions investigated, highlighting consistency between the two data sets. Further work is needed to determine how to phase microhaplotype profiles in highly complex infections where any number of clones may be present in varying proportions, but tools such as the Dcifer software provide an important step forward⁴⁰. Indeed, phasing of polyclonal malaria infections is not a challenge unique to microhaplotypes.

The P. vivax genome has an abundance of globally diverse microhaplotype regions that can effectively capture information on infection lineages and spatial connectivity, overcoming the previous requirement to generate genomic data in samples that are often notoriously difficult to sequence. With targeted, deep sequencing platforms, these markers have great potential to inform on the complex diversity within individual infections and associated insights on transmission. The establishment of targeted microhaplotype genotyping tools for P. vivax will transform the assessment of clinical surveys in this species, enhance knowledge of relapse biology, and greatly improve surveillance of infections.

Methods

MalariaGEN Pv4 Data preparation

The initial dataset comprised 1,816 samples from 17 countries derived from the P. vivax community study (Pv4) in the vivax and Malaria Genomic Epidemiology Networks (vivaxGEN and MalariaGEN)²⁴, as well as previously published external studies^41,42. Of the 1,895 samples described in Pv4 data resource, the 1,816 samples reflect data for which we had access prior to the open release. All genomic datasets were generated using Illumina short-read sequencing platforms. Sequence alignment, SNP discovery and variant calling, population assignments, and Fws (within-sample F statistic) calculations for within-host allele infection complexity were undertaken using previously described methods within the MalariaGEN framework, producing a dataset referred to as P. vivax Genome Variation Project release 4.0 (Pv4). As described in the data resource, the Pv4 data set comprises ~4.5 million variants of which 911,901 are high-quality biallelic SNPs suitable for population genetic analyses²⁴. The F_WS estimates the fixation of alleles within each infection relative to the diversity observed in the total population on a scale from 0 to 1 and were provided as part of the Pv4 dataset^24,43. Using a threshold of F_WS ≥ 0.95 as a proxy to identify a monoclonal infection, all polyclonal infections were excluded from subsequent analysis. High-quality samples were selected using a threshold of ≥50% core genome, notated by the “analysis-set” flag in the MalariaGEN Pv4 dataset. The patient metadata provided by the contributing VivaxGEN partner studies was used to identify independent isolates for marker selection, and recurrent isolates for downstream evaluation of selected marker sites. The MalariaGEN curated metadata was used to define 7 regional-level geographic assignments with the following categories: Africa (AF), South and Central America (SAM), West Southeast Asia (WSEA), East Southeast Asia (ESEA), Maritime Southeast Asia (MSEA), Oceania (OCE) and West Asia (WAS). Rather than using country-based classifications to define infection groups, we employed the regional classifications defined in the MalariaGEN Pv4 data resource, which are based on both geographic location and genomic clustering patterns. This approach is more accurate than purely country-based classifications since malaria transmission networks do not always conform to national spatial boundaries. For example, there is a malaria-free ‘corridor’ that runs through the middle of Thailand, leading to distinct separation of eastern and western infections⁴⁴. In contrast, some border regions, such as parts of Vietnam and Cambodia, are very ‘porous’ with high levels of homology seen between infections across national borders.

External Validation data preparation

A genomic dataset with non-overlapping samples with the MalariaGEN Pv4 dataset was established for independent validation of IBD estimation accuracy in microhaplotypes selected from the Pv4 dataset. Details on the Methods can be found in Supplementary Note 1. In brief, the independent dataset was sourced from all published, open-access, Illumina paired-end genome-wide P. vivax data that was not represented in Pv4. A total of 836 independent samples were sourced from 8 studies^{32,33,34,35,38,45,46,47}. Read alignment, variant calling and genotype calling were undertaken in close alignment with the MalariaGEN Pv4 pipeline with the aim of deriving comparable genotype calls at the 911,901 high-quality biallelic SNPs detected in Pv4²⁴. Samples were filtered to remove those with >50% missing variants, and variants were filtered to remove those with >25% missing samples. The resulting variant calling format (vcf) file comprised 329,371 variants in 324 samples. The R-based moimix package was used to calculate the F_WS and a threshold of ≥0.95 was used to assign monoclonal infections for downstream clustering and IBD analyses⁴⁸. The high-quality monoclonal Pv4 dataset was subset to the 329,371 variants. A genetic distance matrix was constructed on the pooled Pv4 and non-Pv4 genome-wide dataset using the neighbour-joining method implemented with the R-based ape package⁴⁹. The neighbour-joining tree was visually inspected to define regional classifications for the independent dataset samples.

Marker selection

We tailored our marker selection (Fig. 1) to the Illumina amplicon-based sequencing workflow as this methodology is widely used in the malaria community and has demonstrated utility and feasibility for malaria molecular surveillance in low- and middle-income countries¹⁷. The maximum amplicon size for Illumina amplicon-based sequencing is 200 bp, which set the criteria for the maximum length of each microhaplotype. The minimum target of 100 microhaplotypes was based on a mathematical modelling study, which demonstrated that under idealised settings (i.e., estimating relatedness from data simulated under the model used for estimation), IBD estimates with low root mean squared error (RMSE < 0.1) between monoclonal malaria parasites with relatedness equal to 0.5 (relatedness with the highest RMSE) are obtainable using ~100 polyallelic markers²¹. Diminishing returns in RMSE reduction were observed above 100 polyallelic markers, highlighting this target number as a pragmatic balance of IBD accuracy against the economic cost of primers and assays²¹. Panels of microhaplotypes (schematic Fig. 1b) were subsequently selected to optimise marker selection with the previously identified criteria using a three-phase approach as described in Fig. 1a with respect to sample, variant and window selection. In the first phase, samples were subset from the MalariaGEN Pv4 data resource that consisted of high-quality samples as identified in Pv4 as being independent, 50% of the genome callable, and likely monoclonal with Fws ≥ 0.95. In the second phase, candidate SNPs were identified by filtering the high-quality biallelic SNPs (filter pass), with minor allele frequency (MAF) 0.1 and genotype failure rates <0.1. Finally, candidate microhaplotype windows were identified by scanning the core genome (excluding hypervariable regions) in coding regions with 50 bp overlapping, 200 bp windows to identify amplicon-sized segments containing one or more candidate SNPs. We then defined the Pv heterozygome as the collection of all windows that additionally have sufficiently high diversity (windows with a minimum heterozygosity of 0.5 and at least 3 SNPs (3,830 total windows). A Manhattan plot of all windows distributed across the core, coding regions of the P. vivax genome is shown in Fig. 1c. Microhaplotype panels were then selected from the P. vivax heterozygome by further selecting windows that had between 3–10 SNPs, and approximately evenly distanced spacing across 14 chromosomes and a minimum heterozygosity of 0.6 (1,110 windows total). The Random mhaps panel had 100 markers chosen irrespectively of diversity and the high-diversity 100 SNP panel had each marker chosen with the highest possible heterozygosity in a given genomic region (Fig. 2c). Five additional panels were generated to separately evaluate the impact of marker number (50, 100, 150, 200 and 250 marker panels) which were selected with the same criteria as the high-diversity panel using the greedy algorithm described in notebook 2: https://github.com/svsiegel/vivax-mhaps. Measures of MAF and heterozygosity were calculated using the scikit-allel package (microhaplotype discovery marker framework, notebook 1: https://github.com/svsiegel/vivax-mhaps).

Population genetic evaluation of a 100-microhaplotype panel

A panel of 100 high-heterozygosity microhaplotypes (high-diversity panel) was selected as an exemplar microhaplotype panel for comparative evaluation against the Broad P. vivax barcode, as well as a panel that had similar characteristics without being specifically optimised for highest possible heterozygosity (random panel). These two panels were discovered from a broader set of microhaplotypes output from the greedy algorithm and a final selection was curated manually. (Supplementary Data 2). Population diversity was assessed at each panel (microhaplotype and Broad) for each of the 7 geographic regions using measures of effective cardinality and heterozygosity computed with the R-based paneljudge package (release not published) available at https://github.com/aimeertaylor/paneljudge. Analyses were conducted on the high-quality monoclonal samples using one of the two alleles selected at random to reconstruct microhaplotypes at heterozygote genotype positions using the haploidify function in Scikit-allel version 1.2.0 (https://github.com/cggh/scikit-allel). Measures of within-host diversity were also generated using THE REAL McCOIL package version 2.0 (https://github.com/EPPIcenter/THEREALMcCOIL) on the biallelic SNPs in the High-diversity microhaplotype panel and correlated against the genome-wide F_WS estimates of infection complexity^43,50.

The potential of the biallelic SNPs within the microhaplotype panel in capturing spatial patterns of transmission was assessed using Principal Coordinate Analysis (PCoA) to illustrate the regional-level geographic clustering patterns. PCoA analysis was performed on the same high-quality monoclonal samples (n = 615) by conducting PCA measures using sklearn.decomposition PCA method of scikit-learn package version 1.2.0 on a microhaplotype-based pairwise distance matrix calculated from the proportion of non-identical microhaplotype alleles across all positions. Spatial clustering patterns were also assessed using IBD-based measures of differentiation implemented with the hmmIBD package as described below and plotted using the R-based igraph package version 1.6 (https://igraph.org).

A Bi-allele Likelihood (BALK) classifier was used to predict the country of origin of P. vivax isolates using SNP data (not multi-allelic microhaplotypes) at biallelic SNPs in the High-diversity SNP 100 microhaplotype panel, Broad P. vivax barcode, and three recently identified P. vivax geographic barcodes (GEO33, GEO50 and GEO55)^25,31. Details of the BALK classifier can be found in the original paper³¹. Country prediction performance was assessed using data from 799 of the 922 high-quality samples in the available Pv4 genomic dataset. The 799 sample-set corresponds with the sample set used in the original paper describing the BALK classifier³¹. Briefly, the sample set was derived by filtering samples to include a single representative of samples with near-identical genomes (defined as pairs with genetic distance <0.001), subjecting to iterative data quality filtering to obtain the best representative number of samples and informative SNPs without any missing genotype by removing samples with higher missingness iteratively, and then removing samples that appeared to be imported based on genome-wide data clustering patterns. A total of 21 countries had ≥4 samples. The comparative predictive performance of the SNP panels was evaluated using a stratified 10-fold cross-validation with 500 repeats, reporting the Matthews Correlation Coefficient (MCC).

Evaluation of IBD estimation with using simulated data

The 2 selected 100 microhaplotype panels (random microhaplotype panel and high-diversity microhaplotype panel), as well as the 42-SNP Broad barcode panel, were evaluated with the paneljudge package for their ability to estimate relatedness (IBD) using a range of simulations to compute the uncertainty (error rate) in estimations of relatedness (r) ranging from 0 (strangers) to 1 (identical) under varying allele frequency distributions in different geographic regions. Details of the equations, assumptions, and parameter options in the paneljudge package can be found on GitHub (https://github.com/aimeertaylor/paneljudge). Briefly, for each of a range of data-generating r values (r = 0, 0.25, 0.5, 0.75 and 1.0), data were simulated on a pair of haploid genotypes, generating 100 haploid genotype pairs. Estimates of r(\(\hat{r}\)), switch-rate parameter k(\(\hat{k}\)), the 95% confidence intervals (CIs) around the estimates \(\hat{r}\) and \(\hat{k}\), and root mean square errors (RMSE) were then computed. The switch-rate parameter modifies the rate of switching between latent IBD and non-IBD states in the hidden Markov model by multiplying the probability of a crossover per base pair (default value 7.4e-7 as per⁵¹) and the inter-marker distance in base pairs. The accuracy of a given marker panel in estimating r was measured by the CI width and RMSE, ranging from ~0 (near perfect informativeness) to 1 (uninformative). The CI width and RMSE of \(\hat{r}\) compared to the data-generating rnever reach 0 as a genome of finite length may have many realised relatednesses compatible with a given probability of IBD. This means that there will always be some variance in realised relatedness around r. For comparative value, paneljudge simulations were run on two panels of 100 microhaplotype markers, one optimised for heterozygosity (high-diversity panel), one not optimised for heterozygosity (random panel), and for a set of 38 assayable biallelic SNPs from the 42-SNP Broad barcode (BR38) that has been widely used by the P. vivax surveillance community^25,52,53. The five additional “greedy” algorithm panels consisting of 50, 100, 150, 200, and 250 markers were also analysed for RMSE comparisons against panel size using the same method (Supplementary Fig. 3).

Evaluation of IBD estimation using real data

In addition to simulation-based data, the efficacy of the high-diversity 100 microhaplotype panel in estimating IBD was assessed relative to gold-standard genome-wide data using high-quality genomic data (≥50% genome covered by at least 5 reads) from monoclonal infections (F_WS ≥ 0.95). Evaluations were undertaken in the MalariaGEN Pv4 data resource using both baseline (day 0 infections) data, and pairs of day 0 and recurrent infections, and in the independent validation data set using baseline data. IBD estimates between infection pairs were generated using the hmmIBD version 3.0 software⁵⁴. As the version 3 software does not enable adjustments in the genotyping error rate for polyallelic markers (which may reduce artificially the IBD estimates between infections) the microhaplotype data were run as biallelic SNP markers rather than polyallelic markers. Sample pairs were run in geographically defined batches using allele frequency estimates based on the regional estimates from the baseline samples (high-quality, monoclonal, independent infection) using default parameters. Genome-wide and microhaplotype-based recurrence classifications were assigned using the following infection pair IBD thresholds; clone (pairwise IBD ≥ 0.95), close relative (0.25 ≤ IBD < 0.95), distant relative (0.05 ≤ IBD < 0.25), stranger (IBD < 0.05). The concordance between the microhaplotype and genome-wide IBD estimates was evaluated by the correlation coefficient between the datasets, and the proportion of recurrence classification mismatches.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability

Sequencing data from the MalariaGEN Pv4 samples has been made publicly available in the European Nucleotide Archive (ENA), with details provided in a data resource describing the MalariaGEN Pv4 data²⁴. Accession codes are also available in Supplementary Data 1 and the VCF can be downloaded here: https://www.malariagen.net/resource/30. The sequencing data from the independent validation samples has also been made publicly available in the ENA or National Institutes of Health Sequence Repository Archive in the respective contributing studies^32,33,34,35. Accession codes are also available in Supplementary Data 1.

Code availability

The full microhaplotype marker discovery framework, which includes the development of an easy-to-use code and additional exploratory and optimisation analyses that are fully customisable can be found here: https://github.com/svsiegel/vivax-mhaps and the Github pages link https://svsiegel.github.io/vivax-mhaps/ with citable code for vivax-mhaps v1.0.0 using DOI:10.5281/zenodo.12622789. This repository can be easily accessed and run without needing to download the repository files, as it is connected directly to the Pv4 data package in a cloud-based instance using Google Colab, including all of the needed supplementary analysis files here https://svsiegel.github.io/vivax-mhaps/. This framework could additionally be applied to other data resources and other malaria species with minimal changes to the underlying codebase.

References

World Health Organization. World Malaria Report 2022 (World Health Organization, 2022).
Auburn, S., Cheng, Q., Marfurt, J. & Price, R. N. The changing epidemiology of Plasmodium vivax: Insights from conventional and novel surveillance tools. PLoS Med. 18, e1003560 (2021).
Article PubMed PubMed Central Google Scholar
Commons, R. J., Simpson, J. A., Watson, J., White, N. J. & Price, R. N. Estimating the proportion of plasmodium vivax recurrences caused by relapse: a systematic review and meta-analysis. Am. J. Trop. Med. Hyg. 103, 1094–1099 (2020).
Article CAS PubMed PubMed Central Google Scholar
White, N. J. Determinants of relapse periodicity in Plasmodium vivax malaria. Malar. J. 10, 297 (2011).
Article PubMed PubMed Central Google Scholar
Price, R. N. et al. Global extent of chloroquine-resistant Plasmodium vivax: a systematic review and meta-analysis. Lancet Infect. Dis. 14, 982–991 (2014).
Article PubMed PubMed Central Google Scholar
von Seidlein, L. et al. Review of key knowledge gaps in glucose-6-phosphate dehydrogenase deficiency detection with regard to the safe clinical deployment of 8-aminoquinoline treatment regimens: a workshop report. Malar. J. 12, 112 (2013).
Article Google Scholar
Thriemer, K. et al. Challenges for achieving safe and effective radical cure of Plasmodium vivax: a round table discussion of the APMEN Vivax Working Group. Malar. J. 16, 141 (2017).
Article PubMed PubMed Central Google Scholar
Cheah, P. Y., Steinkamp, N., von Seidlein, L. & Price, R. N. The ethics of using placebo in randomised controlled trials: a case study of a Plasmodium vivax antirelapse trial. BMC Med. Ethics 19, 19 (2018).
Article PubMed PubMed Central Google Scholar
Snounou, G. & Beck, H. P. The use of PCR genotyping in the assessment of recrudescence or reinfection after antimalarial drug treatment. Parasitol. Today 14, 462–467 (1998).
Article CAS PubMed Google Scholar
Chen, N., Auliff, A., Rieckmann, K., Gatton, M. & Cheng, Q. Relapses of Plasmodium vivax infection result from clonal hypnozoites activated at predetermined intervals. J. Infect. Dis. 195, 934–941 (2007).
Article CAS PubMed Google Scholar
Imwong, M. et al. Relapses of Plasmodium vivax infection usually result from activation of heterologous hypnozoites. J. Infect. Dis. 195, 927–933 (2007).
Article CAS PubMed Google Scholar
Cowell, A. N., Valdivia, H. O., Bishop, D. K. & Winzeler, E. A. Exploration of Plasmodium vivax transmission dynamics and recurrent infections in the Peruvian Amazon using whole genome sequencing. Genome Med. 10, 52 (2018).
Article PubMed PubMed Central Google Scholar
Bright, A. T. et al. A high resolution case study of a patient with recurrent Plasmodium vivax infections shows that relapses were caused by meiotic siblings. PLoS Negl. Trop. Dis. 8, e2882 (2014).
Article PubMed PubMed Central Google Scholar
Popovici, J. et al. Recrudescence, reinfection, or relapse? a more rigorous framework to assess chloroquine efficacy for Plasmodium vivax Malaria. J. Infect. Dis. 219, 315–322 (2019).
Article CAS PubMed Google Scholar
Taylor, A. R. et al. Quantifying connectivity between local Plasmodium falciparum malaria parasite populations using identity by descent. PLoS Genet. 13, e1007065 (2017).
Article PubMed PubMed Central Google Scholar
Noviyanti, R. et al. Implementing parasite genotyping into national surveillance frameworks: feedback from control programmes and researchers in the Asia-Pacific region. Malar. J. 19, 271 (2020).
Article PubMed PubMed Central Google Scholar
Jacob, C. G. et al. Genetic surveillance in the Greater Mekong subregion and South Asia to support malaria control and elimination. Elife 10, e62997 (2021).
Article CAS PubMed PubMed Central Google Scholar
Lerch, A. et al. Longitudinal tracking and quantification of individual Plasmodium falciparum clones in complex infections. Sci. Rep. 9, 3333 (2019).
Article ADS PubMed PubMed Central Google Scholar
Wamae, K. et al. Amplicon sequencing as a potential surveillance tool for complexity of infection and drug resistance markers in plasmodium falciparum asymptomatic infections. J. Infect. Dis. 226, 920–927 (2022).
Article CAS PubMed PubMed Central Google Scholar
Rao, P. N. et al. A method for amplicon deep sequencing of drug resistance genes in plasmodium falciparum clinical isolates from India. J. Clin. Microbiol. 54, 1500–1511 (2016).
Article CAS PubMed PubMed Central Google Scholar
Taylor, A. R., Jacob, P. E., Neafsey, D. E. & Buckee, C. O. Estimating relatedness between malaria parasites. Genetics 212, 1337–1351 (2019).
Article PubMed PubMed Central Google Scholar
Ruybal-Pesántez, S. et al. Molecular markers for malaria genetic epidemiology: progress and pitfalls. Trends Parasitol. 40, 147–163 (2024).
Article PubMed Google Scholar
Tessema, S. K. et al. Sensitive, highly multiplexed sequencing of microhaplotypes from the plasmodium falciparum heterozygome. J. Infect. Dis. 225, 1227–1237 (2022).
Article CAS PubMed Google Scholar
MalariaGEN et al. An open dataset of Plasmodium vivax genome variation in 1,895 worldwide samples. Wellcome Open Res. 7, 136 (2022).
Article PubMed Central Google Scholar
Baniecki, M. L. et al. Development of a single nucleotide polymorphism barcode to genotype Plasmodium vivax infections. PLoS Negl. Trop. Dis. 9, e0003539 (2015).
Article PubMed PubMed Central Google Scholar
Lacerda, M. V. G. et al. Single-dose tafenoquine to prevent relapse of Plasmodium vivax malaria. N. Engl. J. Med. 380, 215–228 (2019).
Article CAS PubMed PubMed Central Google Scholar
Taylor, W. R. J. et al. Short-course primaquine for the radical cure of Plasmodium vivax malaria: a multicentre, randomised, placebo-controlled non-inferiority trial. Lancet 394, 929–938 (2019).
Article CAS PubMed PubMed Central Google Scholar
Chu, C. S. et al. Comparison of the cumulative efficacy and safety of chloroquine, artesunate, and chloroquine-primaquine in plasmodium vivax malaria. Clin. Infect. Dis. 67, 1543–1549 (2018).
Article CAS PubMed PubMed Central Google Scholar
Grigg, M. J. et al. Efficacy of artesunate-mefloquine for chloroquine-resistant plasmodium vivax malaria in Malaysia: an open-label, randomized, controlled trial. Clin. Infect. Dis. 62, 1403–1411 (2016).
Article CAS PubMed PubMed Central Google Scholar
Auburn, S. et al. Genomic analysis of a pre-elimination Malaysian Plasmodium vivax population reveals selective pressures and changing transmission dynamics. Nat. Commun. 9, 2585 (2018).
Article ADS PubMed PubMed Central Google Scholar
Trimarsanto, H. et al. A molecular barcode and web-based data analysis tool to identify imported Plasmodium vivax malaria. Commun. Biol. 5, 1411 (2022).
Article CAS PubMed PubMed Central Google Scholar
Benavente, E. D. et al. Distinctive genetic structure and selection patterns in Plasmodium vivax from South Asia and East Africa. Nat. Commun. 12, 3160 (2021).
Article ADS CAS PubMed PubMed Central Google Scholar
Buyon, L. E. et al. Population genomics of Plasmodium vivax in Panama to assess the risk of case importation on malaria elimination. PLoS Negl. Trop. Dis. 14, e0008962 (2020).
Article CAS PubMed PubMed Central Google Scholar
Daron, J. et al. Population genomic evidence of Plasmodium vivax Southeast Asian origin. Sci. Adv. 7, eabc3713 (2021).
Article ADS CAS PubMed PubMed Central Google Scholar
Ibrahim, A. et al. Population-based genomic study of Plasmodium vivax malaria in seven Brazilian states and across South America. Lancet Reg. Health Am. 18, 100420 (2023).
PubMed PubMed Central Google Scholar
LaVerriere, E. et al. Design and implementation of multiplexed amplicon sequencing panels to serve genomic epidemiology of infectious disease: a malaria case study. Mol. Ecol. Resour. 22, 2285–2303 (2022).
Article CAS PubMed PubMed Central Google Scholar
Taylor, A. R. et al. Resolving the cause of recurrent Plasmodium vivax malaria probabilistically. Nat. Commun. 10, 5595 (2019).
Article ADS CAS PubMed PubMed Central Google Scholar
Popovici, J. et al. Genomic analyses reveal the common occurrence and complexity of Plasmodium vivax relapses in Cambodia. MBio 9, e01888–17 (2018).
Article CAS PubMed PubMed Central Google Scholar
Kho, S. et al. Evaluation of splenic accumulation and colocalization of immature reticulocytes and Plasmodium vivax in asymptomatic malaria: a prospective human splenectomy study. PLoS Med. 18, e1003632 (2021).
Article CAS PubMed PubMed Central Google Scholar
Gerlovina, I., Gerlovin, B., Rodríguez-Barraquer, I. & Greenhouse, B. Dcifer: an IBD-based method to calculate genetic distance between polyclonal infections. Genetics 222, iyac126 (2022).
Article PubMed PubMed Central Google Scholar
Parobek, C. M. et al. Selective sweep suggests transcriptional regulation may underlie Plasmodium vivax resilience to malaria control measures in Cambodia. Proc. Natl Acad. Sci. USA 113, E8096–E8105 (2016).
Article CAS PubMed PubMed Central Google Scholar
Lo, E. et al. Frequent expansion of Plasmodium vivax Duffy Binding Protein in Ethiopia and its epidemiological significance. PLoS Negl. Trop. Dis. 13, e0007222 (2019).
Article CAS PubMed PubMed Central Google Scholar
Auburn, S. et al. Characterization of within-host Plasmodium falciparum diversity using next-generation sequence data. PLoS ONE 7, e32891 (2012).
Article ADS CAS PubMed PubMed Central Google Scholar
Parker, D. M., Carrara, V. I., Pukrittayakamee, S., McGready, R. & Nosten, F. H. Malaria ecology along the Thailand-Myanmar border. Malar. J. 14, 388 (2015).
Article PubMed PubMed Central Google Scholar
Chan, E. R. et al. Whole genome sequencing of field isolates provides robust characterization of genetic diversity in Plasmodium vivax. PLoS Negl. Trop. Dis. 6, e1811 (2012).
Article CAS PubMed PubMed Central Google Scholar
Chen, S.-B. et al. Whole-genome sequencing of a Plasmodium vivax clinical isolate exhibits geographical characteristics and high genetic variation in China-Myanmar border area. BMC Genomics 18, 131 (2017).
Article PubMed PubMed Central Google Scholar
Delgado-Ratto, C. et al. Population genetics of Plasmodium vivax in the Peruvian Amazon. PLoS Negl. Trop. Dis. 10, e0004376 (2016).
Article PubMed PubMed Central Google Scholar
Creators Lee, Stuart1 Bahlo, Melanie1 Show affiliations 1. Walter and Eliza Hall Institute. Moimix: An R Package for Assessing Clonality in High-Througput Sequencing Data. https://doi.org/10.5281/zenodo.58257.
Paradis, E. Analysis of Phylogenetics and Evolution with R (Springer Science & Business Media, 2011).
Chang, H.-H. et al. THE REAL McCOIL: a method for the concurrent estimation of the complexity of infection and SNP allele frequency for malaria parasites. PLoS Comput. Biol. 13, e1005348 (2017).
Article PubMed PubMed Central Google Scholar
Miles, A. et al. Indels, structural variation, and recombination drive genomic diversity in Plasmodium falciparum. Genome Res. 26, 1288–1299 (2016).
Article CAS PubMed PubMed Central Google Scholar
Dewasurendra, R. L. et al. Use of a Plasmodium vivax genetic barcode for genomic surveillance and parasite tracking in Sri Lanka. Malar. J. 19, 342 (2020).
Article CAS PubMed PubMed Central Google Scholar
Ba, H. et al. Multi-locus genotyping reveals established endemicity of a geographically distinct Plasmodium vivax population in Mauritania, West Africa. PLoS Negl. Trop. Dis. 14, e0008945 (2020).
Article CAS PubMed PubMed Central Google Scholar
Schaffner, S. F., Taylor, A. R., Wong, W., Wirth, D. F. & Neafsey, D. E. hmmIBD: software to infer pairwise identity by descent between haploid genotypes. Malar. J. 17, 196 (2018).
Article PubMed PubMed Central Google Scholar
Miotto, O. et al. Genetic architecture of artemisinin-resistant Plasmodium falciparum. Nat. Genet. 47, 226–234 (2015).
Article CAS PubMed PubMed Central Google Scholar

Download references

Acknowledgements

The study was supported by the National Health and Medical Research Council of Australia (APP2001083 supporting S.A. and S.V.S.), the Wellcome Trust (200909 and ICRG GR071614MA Senior Fellowships in Clinical Science to R.N.P., 206194/Z17/Z supporting J.C.R. and S.V.S.) the National Institutes of Health (R01AI137154 to J.C.R.) and the Bill & Melinda Gates Foundation (INV-043618 supporting S.A. and R.N.P.).The whole genome sequencing component of the study was supported by the Medical Research Council and UK Department for International Development (award number M006212 to DK) and the Wellcome Trust (award numbers 206194 and 204911 to D.K.). The IMPROV clinical trial was supported by the Bill & Melinda Gates Foundation (OPP1054404 awarded to R.N.P.). We thank the patients who contributed their samples to the study, and the health workers and field teams who assisted with the sample collections. Genome sequencing was undertaken by the Wellcome Sanger Institute, and we thank the staff of the Wellcome Sanger Institute Sample Logistics, Sequencing, and Informatics facilities for their contribution.

Author information

Deceased: Dominic P. Kwiatkowski.

Authors and Affiliations

Wellcome Sanger Institute, Hinxton, Cambridge, CB10 1SA, UK
Sasha V. Siegel, Roberto Amato, Kathryn Murie, Georgia Whitton & Dominic P. Kwiatkowski
Menzies School of Health Research and Charles Darwin University, Darwin, Northern Territory, 0811, Australia
Sasha V. Siegel, Hidayat Trimarsanto, Mariana Kleinecke, Ric N. Price & Sarah Auburn
Eijkman Research Center for Molecular Biology, National Research and Innovation Agency, Jakarta, 10430, Indonesia
Hidayat Trimarsanto
Institut Pasteur, University de Paris, Infectious Disease Epidemiology and Analytics Unit, Paris, France
Aimee R. Taylor
Exeins Health Initiative, Jakarta Selatan, 12870, Indonesia
Edwin Sutanto
Centre for Tropical Medicine and Global Health, Nuffield Department of Medicine, University of Oxford, Oxford, OX3 7LJ, UK
James A. Watson, Nicholas J. White, Nicholas Day, Ric N. Price & Sarah Auburn
Oxford University Clinical Research Unit, Hospital for Tropical Diseases, 764 Vo Van Kiet, W.1, Dist.5, Ho Chi Minh City, Vietnam
James A. Watson, Hoang Chau Nguyen & Tinh Hien Tran
Department of Molecular Tropical Medicine and Genetics, Faculty of Tropical Medicine, Mahidol University, Bangkok, Thailand
Mallika Imwong
Ethiopian Public Health Institute, Addis Ababa, Ethiopia
Ashenafi Assefa
Institute for Global Health and Infectious Diseases, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina, USA
Ashenafi Assefa
Mahidol‐Oxford Tropical Medicine Research Unit, Faculty of Tropical Medicine, Mahidol University, Bangkok, 10400, Thailand
Awab Ghulam Rahim, Nicholas J. White, Nicholas Day & Ric N. Price
Afghan International Islamic University, Kabul, Afghanistan
Awab Ghulam Rahim
Formerly GlaxoSmithKline, Brentford, UK
Justin A. Green
Department of Infectious Diseases, Northwick Park Hospital, Harrow, UK
Gavin C. K. W. Koh
Cambridge Institute for Medical Research, University of Cambridge, Hills Road, Cambridge, CB2 0XY, UK
Julian C. Rayner

Authors

Sasha V. Siegel
View author publications
Search author on:PubMed Google Scholar
Hidayat Trimarsanto
View author publications
Search author on:PubMed Google Scholar
Roberto Amato
View author publications
Search author on:PubMed Google Scholar
Kathryn Murie
View author publications
Search author on:PubMed Google Scholar
Aimee R. Taylor
View author publications
Search author on:PubMed Google Scholar
Edwin Sutanto
View author publications
Search author on:PubMed Google Scholar
Mariana Kleinecke
View author publications
Search author on:PubMed Google Scholar
Georgia Whitton
View author publications
Search author on:PubMed Google Scholar
James A. Watson
View author publications
Search author on:PubMed Google Scholar
Mallika Imwong
View author publications
Search author on:PubMed Google Scholar
Ashenafi Assefa
View author publications
Search author on:PubMed Google Scholar
Awab Ghulam Rahim
View author publications
Search author on:PubMed Google Scholar
Hoang Chau Nguyen
View author publications
Search author on:PubMed Google Scholar
Tinh Hien Tran
View author publications
Search author on:PubMed Google Scholar
Justin A. Green
View author publications
Search author on:PubMed Google Scholar
Gavin C. K. W. Koh
View author publications
Search author on:PubMed Google Scholar
Nicholas J. White
View author publications
Search author on:PubMed Google Scholar
Nicholas Day
View author publications
Search author on:PubMed Google Scholar
Dominic P. Kwiatkowski
View author publications
Search author on:PubMed Google Scholar
Julian C. Rayner
View author publications
Search author on:PubMed Google Scholar
Ric N. Price
View author publications
Search author on:PubMed Google Scholar
Sarah Auburn
View author publications
Search author on:PubMed Google Scholar

Contributions

S.A., S.V.S., R.N.P., and J.C.R. conceived the study. S.V.S, H.T., R.A., and S.A. designed major components of the study. S.A., S.V.S, H.T., and R.A. wrote the original drafts for major sections of the manuscript. S.V.S, H.T., R.A., K.M., E.S., M.K, G.W., and S.A. conducted data analysis. A.R.T., J.A.W., D.P.K., J.C.R., and R.N.P. contributed critical guidance and tools for the analytical methods and data interpretation. D.P.K. and J.C.R. contributed sequencing, data production and informatic support. M.I., A.A., A.G.R, N.H.C., T.T.H., J.A.G., G.C.K.W.K, N.J.W., N.D., D.P.K., J.C.R., and R.N.P. contributed essential field-based malaria collections and metadata.

Corresponding author

Correspondence to Sarah Auburn.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Communications thanks Prashant Mallick, Shazia Ruybal-Pesántez and the other anonymous reviewer(s) for their contribution to the peer review of this work. A peer review file is available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Peer Review File

Description of Additional Supplementary Information

Supplementary Dataset 1

Supplementary Dataset 2

Reporting Summary

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Siegel, S.V., Trimarsanto, H., Amato, R. et al. Lineage-informative microhaplotypes for recurrence classification and spatio-temporal surveillance of Plasmodium vivax malaria parasites. Nat Commun 15, 6757 (2024). https://doi.org/10.1038/s41467-024-51015-3

Download citation

Received: 17 April 2023
Accepted: 25 July 2024
Published: 08 August 2024
Version of record: 08 August 2024
DOI: https://doi.org/10.1038/s41467-024-51015-3

This article is cited by

Applying novel Plasmodium Vivax serological exposure markers to quantify residual malaria transmission in the Philippines through repeated health facility surveys: the SMaRT study protocol
- Jhobert Bernal
- Maria Lourdes Macalinao
- Rhea J. Longley
BMC Infectious Diseases (2025)
Genomic analysis of Plasmodium vivax field isolates circulating in sub-Saharan Africa
- Isabelle Bouyssou
- Lemu Golassa
- Didier Ménard
Communications Biology (2025)
Microhaplotype deep sequencing assays to capture Plasmodium vivax infection lineages
- Mariana Kleinecke
- Edwin Sutanto
- Sarah Auburn
Nature Communications (2025)
Automated reporting of primaquine dose efficacy, tolerability and safety for Plasmodium vivax malaria using a systematic review and individual patient data meta-analysis
- Peta Edler
- Megha Rajasekhar
- Adugna Woyessa
Malaria Journal (2025)
Population genomics of Plasmodium ovale species in sub-Saharan Africa
- Kelly Carey-Ewend
- Zachary R. Popkin-Hall
- Jessica T. Lin
Nature Communications (2024)

Subjects

Abstract

Similar content being viewed by others

Introduction

Results

Microhaplotype marker discovery framework and selection

Evaluation of randomly selected versus high-diversity microhaplotype panels on IBD estimation using simulated data

Evaluation of an exemplar high-diversity 100 microhaplotype panel on IBD estimation using real data

Microhaplotype panels can effectively capture diversity and differentiation

External validation of IBD accuracy using an exemplar high-diversity 100 microhaplotype panel with an independent dataset

Evaluation of microhaplotype number on IBD estimation

Discussion

Methods

MalariaGEN Pv4 Data preparation

External Validation data preparation

Marker selection

Population genetic evaluation of a 100-microhaplotype panel

Evaluation of IBD estimation with using simulated data

Evaluation of IBD estimation using real data

Reporting summary

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Peer review

Peer review information

Additional information

Supplementary information

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Search

Quick links