Abstract
During infection with Plasmodium falciparum in pregnancy, parasites express a unique virulence factor, VAR2CSA, that mediates binding of infected red blood cells to the placenta. A major goal in designing vaccines to protect pregnant women from malaria is to elicit antibodies to VAR2CSA. The challenge is that VAR2CSA is highly polymorphic and identifying conserved epitopes is essential to elicit strain-transcending immunity. Unexpectedly, a mouse monoclonal antibody, 3D10, raised against region II of the unrelated Duffy binding protein from P. vivax (DBPII) cross-reacts with diverse alleles of VAR2CSA in vitro, suggesting that epitopes may be shared across this family of ‘Duffy binding-like’ (DBL) proteins. Peptide arrays spanning four DBL proteins from two Plasmodium spp, including two alleles of VAR2CSA, DBPII, and PvEBP2 (as a negative control), were screened with 3D10 but the data were too complex to manually identify common epitope sequences. As such, we designed a machine learning framework to analyse the array data. We applied decision trees to extract features correlated to 3D10 binding and evaluated the model on an independent dataset for a rodent Plasmodium DBL protein (PcDBP). Next, we analysed patterns of the features predicted by the model to be strongly associated with 3D10 binding and designed mutant peptides to test complex sequence motifs. Features associated with 3D10 reactivity were mapped onto predicted 3D structures of Plasmodium proteins and validated based on 3D10 reactivity to the recombinant antigens. While the array data identified certain linear epitopes, the framework predicted other epitopes to be conformational. This was demonstrated with PcDBP; as predicted by the model, no linear peptides reacted strongly with 3D10, yet the folded protein was recognized by the antibody in a conformation-dependent manner. With this approach, peptide array data can be mined to extract physicochemical properties of epitopes recognized by cross-reactive antibodies.
Similar content being viewed by others
Introduction
Malaria caused by Plasmodium falciparum is among the deadliest infectious diseases and results in more than half a million deaths per year1. Pregnant women compose a large portion of the at-risk population (>36 million pregnancies) with over 12 million infected in sub-Saharan Africa annually2. Vaccines to protect pregnant women from malaria aim primarily to elicit antibodies to the P. falciparum virulence factor VAR2CSA that is uniquely expressed during infection in pregnancy3. VAR2CSA is expressed on the surface of infected erythrocytes and binds to chondroitin sulfate A (CSA)-specific proteoglycans on the placenta syncytiotrophoblast4, disrupting the exchange of oxygen and nutrients from the maternal blood to the fetus5. Placental malaria is associated with poor maternal and birth outcomes, including severe maternal anaemia, low birthweight infants, pre-term birth and fetal growth restriction6, underscoring the importance of developing vaccines specific to this population.
Vaccines based on recombinant fragments of VAR2CSA showed promise in pre-clinical studies but failed to elicit broadly inhibitory antibodies in Phase I trials7,8. A key bottleneck for these vaccines is the extensive genetic diversity of the VAR2CSA alleles; thus, efforts are underway to identify conserved epitopes for next-generation vaccines7,8. One approach, common with genetically diverse pathogens like influenza and HIV, is to study the epitopes recognized by antibodies acquired naturally during infection9. Typically, VAR2CSA antibodies are acquired in a parity-dependent manner and are associated with protection from placental malaria10. However, antibodies to VAR2CSA were also reported in certain non-pregnant populations11,12,13. In Colombia and Brazil, men and children had antibodies to VAR2CSA at similar levels to pregnant women of all gravidities11,14. Based on the local epidemiology of malaria, we proposed that exposure to the Plasmodium vivax species, and specifically to the P. vivax Duffy binding protein (region 2; DBPII), can elicit antibodies that cross-react with VAR2CSA14. In support of this hypothesis, a mouse monoclonal antibody raised against DBPII, called 3D10, cross-reacts with VAR2CSA and moderately inhibits the adhesion to CSA of infected erythrocytes expressing different alleles of VAR2CSA14,15. These findings suggest a conserved epitope may be shared between these evolutionarily distinct proteins, which could be exploited to develop a vaccine against VAR2CSA14. A unifying trait between DBPII and VAR2CSA is the presence of Duffy binding like (DBL) domains, which are common to several malaria virulence factors used for invasion or sequestration.
Various studies describe the epitope of 3D10 in the DBPII region of PvDBP based on mutational analysis, phage display libraries, mimotopes, peptide arrays, and alanine scanning16,17,18. A discrete binding region was identified within subdomain 1 (SD1) that includes the amino acids NxxRKR and/or YK(R/Y/E). Yet a search for a similar motif in VAR2CSA failed to identify a homologous epitope, highlighting a common shortfall in mapping discrete epitopes recognized by cross-reactive antibodies19. Our screening of VAR2CSA by peptide array with 3D10 provided no clarity with regards to a conserved epitope between VAR2CSA and DBPII20. The array data revealed one major binding site along with several other clusters of highly reactive peptides within the VAR2CSA DBL domains but none of the peptides contained the consensus 3D10 epitope from DBPII16,17. Since manual analysis of our dataset was insufficient to identify physicochemical characteristics predictive of 3D10 binding, in this study we employed machine learning to extract features of 3D10 epitopes19,21,22. We developed a regression machine learning framework to analyse 430 features from a dataset of over 4300 peptides in 4 arrays derived from DBPII, two alleles of VAR2CSA, and PvEBP2 (3D10 non-reactive). We held out a fifth dataset that tested 3D10 against peptides derived from PcDBP, a protein orthologous to PvDBP but derived from Plasmodium chabaudi. This held out dataset was used to assess performance of our model as it is similar to DBPII but does not react to 3D10. We used several methods to select hyperparameters and fit our model to our 4 arrays prior to feature selection. Using feature selection to extract physicochemical characteristics that describe 3D10 binding, we applied the selected features to generate a panel of mutated peptides to test more complex patterns of amino acids. Our analyses revealed a broader set of criteria for 3D10 epitopes in VAR2CSA that are also present, as linear and conformational epitopes, in other DBL proteins from Plasmodium. We propose our framework as a method to generate a physicochemically-based binding profile of cross-reactive antibody binding19.
Results
Machine learning and feature selection
The dataset used in fitting the machine learning algorithm prior to feature selection consists of the reactivity values (expressed as arbitrary units) of 3D10 to each peptide in the arrays derived from DBPII, PvEBP2, and 2 alleles of VAR2CSA. We then mapped the peptide array data to 3D structures of DBPII, PvEBP2, and VAR2CSA (FCR3) to visualize 3D10 reactivity to these different proteins (Fig. 1A–D). As expected, 3D10 strongly recognized an epitope in SD1 of DBPII (Fig. 1A). There were no epitopes in PvEBP2 (Fig. 1B) consistent with previous findings that 3D10 does not recognize this recombinant protein by ELISA23. In VAR2CSA, 3D10 recognized several epitopes within DBL3 and 4 (Fig. 1C) and DBL5 (Fig. 1D). Based on the intensity of 3D10 binding, we classified peptides as 3D10 reactive and 3D10 non-reactive and identified the five most reactive peptides (Supplementary Fig S1; Fig. 1E). Many of the strongest binding peptides contain a significant enrichment of lysine residues and tyrosines adjacent to lysines. However, these traits alone are insufficient to define reactive vs non-reactive peptides given that non-reactive peptides were also rich in lysines, and tyrosines next to lysines was commonly observed in the dataset (Supplementary Fig S2).
Structures generated by AlphaFold 3 (A, B) are accessible in a GitHub repository described in our data availability section. VAR2CSA (FCR3) structures were resolved by cryo-EM and are accessible at PDB ID 7JGE (C) and 7JGF (D)61. Raw median values are represented according to the colour scale and mapped to DBPII (A), PvEBP2 (B), VAR2CSA (FCR3) core region (DBL 1–4) (C), and VAR2CSA (FCR3) flexible arm region (DBL 5 and 6) (D). E One peptide within SD1 and four peptides recognized most strongly by 3D10 in VAR2CSA were selected for further analysis.
Next, we generated a panel of features (see Methods) (Supplementary Table S1) that we hypothesized were associated with 3D10 binding. These features focused on specific amino acids, plus properties such as charge, hydropathy and polarity. Structural features were included with the caveat that synthetic peptides can adopt heterogeneous structures in vitro that fail to replicate their structures within the native or recombinant protein24,25. The label (predicted variable) was the raw median 3D10 binding value of two replicates; this was selected to minimize the noise from non-reactive peptides. For feature selection, we applied the methods of feature elimination, feature utilization, and variance feature selection to our panel of 430 features, to select a small subset of features that correlated to 3D10 reactivity (see Methods) (Supplementary Table S2).
To assess the performance of a model with our hyperparameters, we tested our model on our holdout dataset which was generated under the same experimental conditions as our training sets. This dataset measured the reactivity of 3D10 screened against an overlapping library of peptides spanning PcDBP, an orthologous protein from Plasmodium chabaudi that is closely related to PvDBP and PvEBP2. The machine learning model correctly predicted that no peptides from the array would strongly react to 3D10 (Supplementary Fig. S3).
Feature validation
Based on the features selected by the model, we analysed each feature statistically to identify those that were positively associated with 3D10 binding in our dataset. We evaluated our selected features in our datasets using three methods: heatmaps of features, amino acid enrichment calculation, and analysis of amino acid distances within each peptide (see Methods).
The first set of analyses focused on positive charge, hydropathy, and sidechain energy (Fig. 2). In 3D10 reactive peptides, we observed that lysine and arginine residues were significantly enriched (Fig. 2A). There was also a tendency for positive residues to cluster at the C-terminus and for non-polar residues to cluster at the N-terminus of 3D10 reactive peptides (Fig. 2B)26,27. We hypothesized that N-terminal modifications of VAR2CSA peptides that preserved the C-terminal positive clusters observed in P2–P5 may increase reactivity as they will be more like P1, the minimum binding peptide from DBPII. To test this hypothesis, we designed subsets of mutated peptides for four different tests (Fig. 2C). For each subset, we applied different approaches to adjust the relative hydropathy and sidechain energy of P2–P5 to be closer to P1. In subset A, we tested if hybrid peptides that combined the N-terminal residues of P1 with the C-terminal positive clusters of P2–P5 were sufficient or enhanced binding to 3D10; this would indicate N-terminal sidechain energy and hydropathy contribute to the differences in reactivity between peptides. In tests B, C, and D, we mutated residues in peptides P2–P5 to set the overall peptide sidechain energy and/or hydropathy to be similar to P1. We used either chemically-related amino acids, or a minimal number of amino acids while avoiding (where possible) changes to positively charged residues (Fig. 2C). We tested mutated peptides by indirect ELISA and binding was compared to the non-mutated ‘wildtype’ peptides (Fig. 2D). We observed that the N-terminal modifications in P2–P5 decreased binding in all subsets except when modifications decreased the total number of negative residues. Additionally, the clusters of positive residues at the affixed C-terminus, despite being strongly associated with 3D10 reactive peptides, were not sufficient to explain 3D10 reactivity to peptides P2–P5. Moreover, sidechain energy and hydropathy patterns at the C-terminus that correlate to 3D10 reactive peptides involve a more complex signature than what is observed in our heatmaps (Fig. 2B)26,27.
A Frequency of charged residues in 3D10 reactive and non-reactive peptides was compared. Significance was calculated by single factor ANOVA with a classic Bonferroni correction (α = 0.05). B Heatmap of positive charge, sidechain energy, and hydropathy in 3D10 reactive peptides. Each cell indicates a residue position from the N-terminus at the top to the C-terminus at the bottom. The colour scale for each table ranges from the maximum to minimum for each feature as indicated above each column. C Table of peptide sequences with mutated residues coloured red. Peptides are grouped by different tests. D Mutated peptide reactivity relative to wildtype. Bars represent mean values plus standard deviation. Bars are coloured relative to their test group.
Our analysis suggested that clusters of positive charge at the C-terminus were insufficient for 3D10 reactivity, and reactivity was affected by N-terminal mutations modifying net-charge (Fig. 2C, D). To expand on these observations, we increased the depth of positive charge analysis and characterized sequence patterns relative to enriched positive residues (Fig. 3A). Consistent with the heatmaps of 3D10 reactive peptides, there was an increased likelihood of lysine residues positioned adjacent to arginine residues (Figs. 2B and 3A). Tyrosine residues were significantly more likely to occur adjacent to lysine residues and significantly less likely to occur distal from lysine residues; this corresponds with the selection of the ‘KY count’ feature (Fig. 3B). Further, lysine residues had a higher propensity to occur one amino acid away from other lysine residues (KxK) in 3D10 reactive peptides (Fig. 3D). ‘YK’ was not selected as a feature despite being present in P1. To determine if the absence of ‘YK’ in selected features was a function of the fitting datasets, we swapped the position of lysine and tyrosine residues to create the opposite feature (Fig. 3D). We also decreased the ‘KY’ count by increasing the distance between lysine and arginine (test B). Again, we focused on C-terminal ‘K’ and ‘Y’ residues as our top-binding peptides consistently contained C-terminal ‘KY’ motifs. To test N-terminal positive residues, we focused on P3 as it contained positive residues with consistent spacing allowing us to explore the interchangeability of lysine and arginine residues and the requirement for specific lysine residues (tests C, D, and E). Finally, we tested whether tyrosine could be interchanged with other aromatic residues (test F) (Fig. 3C, E). Consistently, we observed that despite the chemical similarities of arginine and lysine or phenylalanine and tyrosine, it was essential that lysine was adjacent to tyrosine for high 3D10 reactivity. This highlights that 3D10 binding requires the specific amino acid motif: ‘KY’/ ‘YK’.
A–C Log odds ratio of distances between arginine to lysine, tyrosine to lysine, or lysine to lysine residues comparing observed proportion to expected value in 3D10 reactive and non-reactive peptides. Missing data points indicate a distance value occurred zero times and is significantly less common than expected (see Methods). D Table of peptide sequences with mutated residues coloured red. Peptides are grouped by different tests. E Mutated peptide reactivity relative to wildtype. Bars represent mean values plus standard deviation. Bars are coloured relative to their test group.
For several of our selected features, the statistical analyses did not uncover significant differences between 3D10 reactive and non-reactive peptides (Supplementary Table S2). However, their selection by the decision trees suggested they could contribute to 3D10 reactivity. These include polar and non-polar residues, specifically VM, HL, HA, HM, HF, MT, HV, and asparagine counts. Also, there was a tendency for some polar and non-polar residues to occur less frequently in 3D10 reactive peptides but were not statistically significant except for tyrosine count in 3D10 reactive peptides (Fig. 4A). It is unclear if reduced frequency in some residues is due to the proportional increase in positive residues like lysine and arginine (Fig. 2A). Additionally, we were unable to determine a distinct pattern relative to hydropathy by broadly adjusting the hydropathy of 3D10 reactive peptides (Fig. 2B, D). This may highlight a more complex relationship that varies from peptide to peptide. To address this, we mutated minimal polar and non-polar residues into various positions to determine their importance. For tests A, B, and C, we used serine, methionine, and alanine: serine was present in all top-binding peptides, methionine was present in the frequently observed ‘VM’ feature, and alanine appeared in the ‘HA’ feature and is a small amino acid (Fig. 4B). The results of this testing failed to establish a specific pattern of polar residues relative to other aspects of the 3D10 epitope (Fig. 4C). These results suggest that polarity is not a sufficient criterion to distinguish 3D10 reactive peptides from 3D10 non-reactive peptides.
A The frequency of residues in 3D10 reactive and non-reactive peptides was compared. Significance was analysed by single factor ANOVA with a classic Bonferroni correction (α = 0.05). B Table of peptide sequences with mutated residues coloured red. Peptides are grouped by different tests. C Mutated peptide reactivity relative to wildtype. Bars represent mean values plus standard deviation. Bars are coloured relative to their test group.
By integrating the results from all mutant peptide sets, we propose the following six criteria for highest 3D10 reactivity: 1) a lysine adjacent to a tyrosine; 2) a lysine one residue away from another lysine; 3) clusters of lysines and arginines; 4) a polar residue proximal to the criteria 1/2/3; 5) minimal negative residues within or between the above criteria; and 6) criteria 1–4 are proximal to one another. The first criterion, a tyrosine adjacent to lysine, is present in 40 of 43 3D10 reactive peptides and only one amino acid away in the remaining peptides. We found that this criterion is not restricted to the C-terminus (Fig. 3E). We also observed that arginine cannot replace lysine, nor can other aromatics replace tyrosine (Fig. 3E). We observed a significantly higher rate of lysine positioned one amino acid away from another lysine (Fig. 3D). Particularly, we observed that binding was diminished or enhanced with the deletion or addition of the ‘KxK’ motif, respectively (Figs. 2C, D and 3D, E). We observed that arginine and lysine residues were positioned proximal to one another, positive charge was strongly correlated to binding, and these residues clustered at the C-terminus of reactive peptides (Supplementary Table S2; Figs. 2A, B and 3A–C). Our data indicated that this clustering is not sufficient for 3D10 reactivity but modifications in this region resulted in a major loss of 3D10 reactivity (Figs. 2D and 3E). We observed that the addition or removal of certain polar residues had significant effects on reactivity, but we were unable to discern a specific positionality relative to other criteria necessary for reactivity (Figs. 2D and 4C). The addition of negative residues consistently decreased binding while the mutation of negative residues increased binding. Finally, because 3D10 strongly reacts to linear peptides, like P1, we believe that ideally, these criteria must be proximal to one another; in P1, all criteria are satisfied in the minimum possible 6 amino acids28.
3D10 binding criteria application and testing
To validate these experimentally-derived criteria within the context of a protein with three-dimensional structure, we constrained the distance between criteria to a field of ~1000 Ų and restricted our analysis to epitopes on the protein surface29,30,31. Moreover, we defined strict standards to translate criteria for 3D structure mapping. We defined clusters of arginines and lysines as 3 of either residue present within a 6 amino acid span. Criterion 2 was defined as ‘KY’ or ‘YK’ occurrences in protein sequence, and criterion 3 was defined as any occurrence of ‘KxK’ excluding ‘KDK’ or ‘KEK’. ‘Proximal’ was defined as continuous on the surface structure without negative residues between criteria 1, 2, or 3. This standard eliminated ‘KEK’ and ‘KDK’ from consideration. Criterion 4 was interpreted to be the occurrence of a polar residue continuous with any of criteria 1, 2, or 3.
Using our experimentally derived criteria, we classified protein segments that satisfied all six criteria as ‘signature regions’. In our analysis of VAR2CSA (FCR3 allele), we observed 8 signature regions across the protein (Fig. 5A). We identified 7 signatures in the VAR2CSA core region DBL 1–4. One signature includes the sequences from P2 and P3. P4 was not associated with any of the signatures because it was not surface accessible and does not satisfy criterion 2. One signature was identified in DBL5 within the VAR2CSA arm region and mapped to P5. Surprisingly, several of the signature regions are discontinuous and predicted to be conformational epitopes (Fig. 5A). To further validate both the specificity of this approach and the breadth of cross-reactivity of 3D10, we mapped the 3D10 binding criteria onto other DBL proteins from Plasmodium (Fig. 5B–E)32. We used the DBPII region of PvDBP as a reference, and as expected, the criteria mapped to the known epitope in SD1 (Fig. 5B)17. Consistent with our array data and experimental data, PvEBP2 was not predicted to have any signature regions despite containing a DBL domain (Figs. 1B and 5C). Surprisingly, when we applied the criteria to the model of PcDBP, we identified a single conformational signature corresponding to the same region in DBPII (Fig. 5D). EBA-175 from Plasmodium falciparum also shares significant genetic similarity to the DBL domain of DBPII and contains 2 predicted signature regions: one predominantly linear and one conformational (Fig. 5E).
Structures generated by AlphaFold 3 (A–D) are accessible in a GitHub repository described in our data availability section. EBA-175 structure was resolved by crystallography and is accessible at PDB ID 1ZRL (E)63. A–E Surface-exposed regions containing a polar residue (magenta), a ‘KxK’ motif (yellow), a ‘KY’/‘YK’ motif (cyan), and a positive cluster region (green) that are uninterrupted by other residues or negative residues (grey) within a 28-Angstrom diameter was defined as satisfying the criteria. Criteria mapped onto the structure of VAR2CSA (FCR3) (A), DBPII (B), PvEBP2 (C), PcDBP (D), EBA-175 (E). Regions that satisfy the criteria are highlighted and numbered relative to their order in the protein sequence and assigned a specific colour. Boxes around specific amino acids indicate homologous regions between DBPII (B), PvEBP2 (C), and PcDBP (D). F Reactivity of 3D10 to DBL proteins and effect of DTT on protein recognition.
To validate the predicted binding signatures and test the specificity of these criteria, we tested 3D10 against a panel of recombinant proteins by ELISA (Fig. 5F). As expected, VAR2CSA and DBPII, but not PvEBP2, were recognized by 3D1014,23,33. Both PcDBP and EBA-175 were strongly recognized by 3D10, while two other Plasmodium, PfMSP1 and PvMSP1-19, were not. An exciting outcome of the algorithm is that the predicted epitope in PcDBP and one of two predicted epitopes in EBA-175 may be conformational. We tested whether disruption of the disulfide bonds in these DBL domains would reduce binding of 3D10 (Fig. 5F). Reactivity to the full-length VAR2CSA was unaffected by DTT, while there was a partial reduction in reactivity to DBPII. Reactivity to EBA-175 increased, suggesting DTT treatment may have revealed linear epitopes that are not surface-exposed. Of specific interest, DTT reduced binding of 3D10 to PcDBP, consistent with this epitope being conformational.
Discussion
We propose a framework for integrating machine learning to enhance epitope mapping with peptide arrays (Fig. 6). In this pipeline, peptide array data were combined with feature selection to extract features associated with antibody binding (positively or negatively); statistical analysis and empirical testing down-selected specific binding criteria, which were then mapped onto 3D protein structures to predict linear and conformational epitopes. We identified six criteria for 3D10 binding which mapped to predicted epitopes only in those proteins that were recognized by 3D10 experimentally. These criteria are consistent with the 3D10 epitope in DBPII requiring the amino acids ‘NxxRKR’ and/or ‘YK(R/Y/E)’ determined by mutational analysis, arrays, and mimotopes16,17. Importantly, our criteria identified additional physicochemical features that broaden the epitope beyond this simple motif, which in turn, may explain the cross-reactivity of 3D10 with VAR2CSA.
When we mapped the criteria to the VAR2CSA structure, we identified 8 predicted epitopes. Two of these correspond to a linear sequence in DBL3X that matches two peptides (P2 and P3) recognized most strongly by 3D10 in the array. Surprisingly, the other 6 epitopes are predicted to be conformational. These findings suggest that the framework may reveal conformational epitopes based on linear array data. This was evident with the related DBL protein, PcDBP, which was recognized by 3D10 by ELISA and western blot (data not shown) but failed to bind to any linear peptides in an array. Since DBL domains are structurally stabilized by disulphide bonds34,35, we used DTT to disrupt the sole conformational epitope and indeed, recognition by 3D10 was reduced. While DTT reduced the binding of 3D10 to PcDBP, we noted that for EBA-175, binding increased. We propose that DTT treatment can expose other linear epitopes that are not surface-exposed.
One concern is that 3D10 is a polyreactive antibody that binds promiscuously to all DBL proteins. Analysis of DBL domains from Plasmodium proteins suggests a high likelihood for specific signatures such as ‘KxK’ and positive clusters despite significant sequence differences and genetic diversity36,37. However, our data with PvEBP2 (which shares 36% sequence identity and 52% sequence similarity with DBPII (Supplementary Fig S4)) support the specificity and accuracy of our 3D10 binding criteria. 3D10 does not bind to the PvEBP2 recombinant protein nor to any of its peptides within our array. When we applied the six criteria for binding to the 3D structure of PvEBP2, no epitopes were predicted. The implication is that the criteria generated by our framework and the requirement for those criteria to be satisfied within a defined physical space on the surface of the protein were specific enough to differentiate reactive and non-reactive recombinant DBL domains.
If criteria are not sufficiently restrictive, the framework will be subject to a high rate of false positive signature prediction. For example, our criterion 4 concerning polar residues did not eliminate any signatures for our proteins, and this criterion’s low specificity was not meaningful for predicting 3D10 binding. If we established a definite pattern for the position of polar residues for 3D10 reactivity, we might eliminate one or several potential signatures in VAR2CSA. The challenge with generating additional criteria is the inherent overlap between some features; mutating a charged residue will inevitably change the net polarity and structural dynamics of the peptide.
The generation of restrictive criteria depends on the design and size of the peptide arrays and feature space used to train machine learning models. Although we used arrays specific for target antigens, non-natural randomized peptide arrays could be screened with an antibody of interest to extract features associated with binding or performed in addition to target antigens. The ideal panel would have high variance with respect to physicochemical features selected by machine learning and several peptides with varying levels of reactivity. A peptide array without sufficient variance in physicochemical features will be more difficult to generalize and cause overfitting. The variance in peptides relative to their physicochemical features should ideally be present in both reactive and non-reactive peptides. The quality of features will also affect the outcomes of machine learning; features attempting to predict the structure of peptides will likely be less correlative to antibody reactivity due to peptide structures being dynamic in solution. We included structural features while training our model as they describe the physicochemical properties and types of residues in a peptide and may describe structural tendencies as opposed to a definite structure. Additionally, conformationally-constrained peptide arrays could be incorporated into testing and feature space to enable the consideration of certain structural elements during criteria development38,39,40.
Based on our findings, we propose that 3D10 is an example of a cross-reactive monoclonal antibody, a type of antibody specific for a few antigens related via their DBL domains. Recent work has labelled similar antibodies as “super antibodies” or “promiscuous” and note that these antibodies form a population distinct from typical highly specific antibodies9,19,41,42. This type of antibody is associated with antigenically variable pathogens; much like DBL represents an antigenically variable domain present in many malaria proteins important for pathogenesis35. Despite consistent structural elements (such as interdomain and intradomain disulphide bonds), DBL domains of pathogenic Plasmodium proteins are genetically diverse36. A review by Walker et al. 2018 highlighted the association between antigenic variability and the generation of broadly neutralizing antibody responses9. For example, HIV and influenza often elicit antibodies with cross-reactivity to related antigens, supporting development of ‘universal’ vaccines for these immune-evasive pathogens43,44,45,46,47. These antibodies make up a fraction of the total pool of antibodies but have become a focus in the development of vaccines against immune evading pathogens like HIV48. Efforts are tempered by concerns that these antibodies may be difficult to elicit consistently, and not all are broadly neutralizing49. Our framework and approach may serve as an effective platform for the comparison of these antibodies to one another and to identify common elements between them. As more broadly neutralizing antibodies are discovered against HIV, our framework can be used to determine physicochemical patterns that are associated with high breadth and potency of neutralization. The conserved structural and physicochemical properties of multiple binding regions for cross-reactive monoclonal antibodies can be exploited to effectively target antigenically variable pathogens with vaccines.
If applied to polyclonal sera, this framework can also be used to extract features associated with immunodominance. For specific pathogens such as influenza virus, the criteria could define specific immunodominant features within the HA head region (for example) that elicit only strain-specific antibodies, resulting in low vaccine efficacy50. Targeted epitope masking strategies could be applied to reduce the immunogenicity of these sites and stimulate responses to subdominant, protective epitopes.
The key utility in this framework is a new interpretation of peptide array data. Specifically, it can analyse peptide array data as a physicochemical characterization of the interface between an antibody and thousands of different antigens rather than a tool to map discrete epitopes by residue. Machine learning enables the simultaneous analysis of these antibody-antigen interfaces to reveal complex features of epitopes that extend beyond identifying discrete amino acid motifs. Validation using HDX, cryo-EM, and site-directed mutagenesis, along with testing novel antibodies and datasets, is needed to independently verify the epitopes predicted by this framework. This approach to cross-reactive antibody characterization presents a marked shift from existing methods to better understand their target epitopes and can be applied to both linear and conformational epitopes.
Methods
Peptide array and data processing
Synthetic peptide arrays were synthesized by PepperPrint™ (Germany) using the PEPperCHIP protocol to develop overlapping arrays: VAR2CSA (FCR3) (GenBank: AAQ73926.1, accession: AY372123.1, residues 1-2659), VAR2CSA (NF54) (GenBank: EWC87419.1, accession: KE123842.1, residues 1-2652), DBPII, PcDBP, and PvEBP2. Peptides were 20 amino acids in length and there was a 19 amino acid overlap between peptides in all arrays except VAR2CSA (NF54) for which there was an 18 amino acid overlap.
For each data point, we used the average of the median foreground intensities for each spot (arbitrary units). For each peptide, binding was measured in duplicate (V1 and V2), and the average raw median intensity was used except for when the difference between the lesser of the two values was less than or equal to 40% the larger of the two values.
Additionally, some peptides occurred in both VAR2CSA (FCR3) and VAR2CSA (NF54); in this case, the highest of the two averaged values were taken. This resulted in a total of 4378 data points being left; 215 duplicate data points were identified and the lower raw median intensity replicant was removed. We employed a training-validation split of 1/3 to 2/3 to randomly distribute the data. That is, 2/3 of the datapoints were randomly selected to be involved in training and validation phases of algorithm development and 1/3 of datapoints were selected to test the algorithm.
Model design and selection
The high similarity between peptides with a 19-amino acid overlap from related proteins and the low proportion of 3D10-reactive peptides present important considerations for feature selection model design. Specific training-validation splits can artificially decrease or increase error if reactive peptide examples are distributed asymmetrically. However, to extract as much data as possible on the 3D10 reactivity in our dataset, there is a need to maximize training data. We employed several strategies to minimize the impact of overfitting during machine learning model design and analysis.
We chose mean squared error (MSE) as the loss function for training the algorithm to predict reactive peptides51. In model selection, we preferred the algorithm that had the lowest MSE averaged over multiple random training-validation split instances to minimize bias introduced by a favourable data distribution. We avoided including more complex features that could identify specific peptide(s) and cause overfitting. For example, some dipeptide features are too rare in these proteins to be used to validate their impact on 3D10 reactivity in our dataset. Removal of these noisy features from the feature pool will vary between datasets and antibodies. To support the generalizability of the framework, we relied on additional testing to filter non-predictive features and synthesize more complex motifs/criteria correlated to reactivity. Specifically, feature selection, statistical analysis, and experimental validation were used to generate our final criteria. We used decision trees in our framework because they had the lowest average MSE in our comparisons, were used previously in machine learning applied to peptide array data, and are easily interpretable (Supplementary Fig. S5)52,53. We used GridSearchCV to optimize hyperparameters defined in the DecisionTreeRegressor function in Scikit-learn54. Five-fold cross-validation generated a decision tree with a maximum depth of four, two samples minimum per leaf, five samples minimum to split a node, and had a MSE of 5.9E + 5 and an R2 of 0.82. A randomly generated model from our dataset using these hyperparameters accurately predicted that none of the peptides in PcDBP reached high 3D10 reactivity (Supplementary Fig. S3).
Features
Features employed in machine learning can be placed into six categories: single amino acid counts, dipeptide counts, secondary structure, physicochemical values, charge, and disulphide bond potential (Supplementary Table S1). These values are either whole number, continuous, or binary values. Single amino acid counts are the total counts of each of the 20 different amino acids occurring in protein sequences. The secondary structure employs S4PRED to predict the secondary structure of the recombinant protein sequences, including the elongating “GSGSGSG” N-terminal and C-terminal linkers, and values were taken as the average of residue values for respective peptide sequence segments55. There were three values for secondary structure representing the odds of forming either an alpha-helix, beta-sheet, and a flexible coil. Physicochemical values are based on sums of values for amino acids from 3 different physicochemical scales: sidechain energy, hydropathy, and polarity26,27,56. Charge features were sums of positively charged residues (lysine, arginine, and histidine), negatively charged residues (glutamate and aspartate), and net charge. Disulphide bond potential is a binary feature which is set to ‘1’ if there are two cysteine residues in a peptide with two or more amino acids in between. Dipeptide counts, like single amino acid counts, is a count of total occurrences of each combination of 2 consecutive amino acids; 400 for each combination of 20 amino acids. This feature is adjusted by the hyperparameter ‘peptide trim.’ ‘Peptide trim’ for dipeptide counts defines the number of amino acids to skip before counting dipeptide features. This is employed because of the concern that the C-terminal of the peptides may be less accessible to 3D10 as this is the end attached to the glass slide. We chose not to employ this hyperparameter for other features as dipeptide count is intended to identify patterns of specific amino acids, whereas single amino acid counts, charge, etc., inform the physical properties of the whole peptide, which may affect its flexibility or be used in conjunction with other features. A ‘peptide trim’ of 4 minimized average error.
Feature selection
We employed three methods for feature selection: variance-based feature selection, feature utilization, and recursive feature elimination. Using an ensemble of features accounts for the potential of each method to be biased for missing predictive features and selecting nonpredictive features. Both variance-based feature selection and recursive feature elimination are provided by scikit-learn54. We created the feature selection method, feature utilization, for our project.
Variance-based feature selection is an unsupervised method, meaning it is agnostic to 3D10 reactivity. The goal of this method is to establish which features have high variance and, therefore, are less likely to be selected due to overfitting. Because it is not dependent on 3D10 reactivity, features selected by this method are not sufficient to qualify them for analysis. We used a threshold (\(p\)) of 1.
Feature utilization is not available in scikit-learn. Because of the random nature of our training-validation splitting, this method involves quantifying feature usage frequency from the lowest MSE decision trees of 100 tournaments of 100 randomly generated trees with new training-validation splits57. To minimize the impact of favourable training-validation splits caused by similarities between peptides in the whole dataset, we used several small tournaments. If any feature is present in a tournament-winning tree, it is defined as being ‘selected’; we did not validate features based on their frequency of utilization in ‘tournament-winning’ trees. At a tree depth of four, each tree contributes 15 features; hyperparameter optimization may produce a different tree depth for different antibodies which may affect output. This method is not highly restrictive and could include features that enable overfitting. This emphasizes the importance of using multiple methods and qualifying selected features using statistical and experimental analysis.
Recursive feature elimination recursively eliminates features to a given value until a cohort of predictive features is left. Starting with the whole feature set (n features) will eliminate a poorly correlative feature to 3D10 reactivity and yield ‘n-1’ features. It repeats this until it reaches our defined values of 5, 10, or 15 features. Fifteen features represent the maximum number of features that can be applied by a tree with depth 4.
Statistical methods and feature analysis
We used three methods to analyse features in addition to their selection: heatmapping, enrichment, and sequence patterns. For enrichment analysis, we compared 3D10 reactive peptides to 3D10 non-reactive peptides (Supplementary Fig. S1). Using these three methods, we analysed selected features or groups of overlapping features that describe a majority of 3D10 reactive peptides or for which there is a significant difference between 3D10 reactive and 3D10 non-reactive peptides. Overfitting results in some selected features corresponding to very few reactive peptides and not correlating to 3D10 reactivity. Similarly, feature overlap may have resulted in features like ‘YK’ not being selected despite being equivalent to ‘KY’ in our testing, and strongly correlating to 3D10 reactivity, emphasizing the value in grouping and further testing. Data for feature values from this dataset with respective hyperparameters are accessible in a GitHub repository described in our data availability section. For heat mapping, we determined average feature values for each of the 20 positions amino acid positions that span the length of the peptides. As the peptide array contained significantly overlapping peptides, the average value for each position in the non-reactive population was almost identical to that of the average for the whole sequence. This meant that the comparison between non-reactive peptides and reactive peptides was not meaningful and may diminish trends observed in 3D10 reactive peptides if they were reflected in the whole sequence. For example, 3D10 reactive peptides were more likely to contain positively charged residues than non-reactive peptides; the difference in their average positive charge for each residue position does not improve the characterization of positionality of positive residues. The data provided by the difference between 3D10 reactive and non-reactive peptides would be better characterized by enrichment analysis.
Amino acid enrichment compared frequencies of amino acid occurrence in 3D10 reactive peptides and 3D10 non-reactive peptides. Statistical significance was determined by 2-way ANOVA selecting for relevant comparisons and correcting for number of comparisons (only comparing values for each amino acid between reactive and non-reactive peptide groups). For example, to compare three different amino acid occurrences in either group, we corrected for three comparisons.
The method for determining amino acid patterns is similar to the method for sequence analysis known as TMSTAT58. Like TMSTAT, we do not have to assume that the distribution of amino acids across sequences is not anti-segregated or co-segregated relative to the features we are seeking to analyse. Assuming non-anti-segregated and non-co-segregated distribution assumes that amino acid distributions are homogeneous across all populations. We do not need to perform this assumption because our statistical comparisons are between observed odds and the expected odds. The expected odds, in this case, would be the likelihood of a homogeneously distributed population. For two types of amino acids, ‘A’ and ‘B’, we calculated the expected probability of a distance between an ‘A’ amino acid to the closest ‘B’ amino acid in 3D10 reactive or 3D10 non-reactive peptides, \({P}_{{Exp}}\). This value depended on the average number of ‘B’ amino acids in the respective population, \(\bar{B}\). Expected odds for any distance was calculated by bootstrapping. We simulated all possible distributions of 1–5 ‘B’ amino acids across the peptide. We determined the shortest distance between an ‘A’ amino acid to the closest ‘B’ amino acid to develop a reference table (Supplementary Table S3). For each possible distance, \({D}^{X}\), we obtained a probability as a function of \(\bar{B}\) where \(X\) denotes the discrete minimum distance between ‘A’ and the closest ‘B’ amino acid; \(X\in \left\{1,\,2,\,3,\,\ldots ,\left.19\right\}\right.\). This probability function was calculated by performing second-order polynomial regression on the bootstrapped odds.
This resulted in 19 equations for each possible value of \(X\) and enabled the prediction of expected minimum distances between ‘A’ and ‘B’ type amino acids for cases where \({B}\) was not a whole number. The observed probabilities were a proportional count for the occurrence for each distance to occur in the population \({P}_{{Obs}}\). Log odds ratio was obtained by a ratio of observed distance proportions and expected distance probabilities.
We used 95% confidence intervals as a significance cut-off and calculated using previously published methods that assume a normal distribution59. With this cut-off, we could establish if the increased likelihood for a particular distance to occur between amino acids of type ‘A’ and ‘B’ was significantly more than expected. For situations where ‘A’ and ‘B’ were the same (like lysine-to-lysine distances), we employed a correction of \(\overline{{B}_{Cor}}=\overline{B}-1\) to account for the fact that the reference lysine serving as ‘A’ will not be contributing to \(\bar{B}\) relative to the expected probability of distances. We ignored cases in which only ‘A’ or ‘B’ were present as distances between them would be indeterminant. For a specific \({D}^{X}\), we excluded plotting data points for which the \({P}_{{Obs}}\) was 0 as this meant that the observed probability was indeterminately less common than \({P}_{{Exp}}\) as the log odds ratio is equal to \(-\infty\). For graphing, we plotted \({D}^{[\mathrm{0,5}]}\) as the 95% confidence interval grows as \({P}_{{Exp}}\to 0\).
Indirect ELISA
We measured the reactivity of 3D10, a mouse IgG mAb, to various proteins by an enzyme-linked immunosorbent assay (ELISA). 96-well Maxisorb plates (catalogue no. 439454; Thermo Fisher Scientific) were coated with synthetic peptides at 10 µg/mL and recombinant proteins at either 1 µg/mL (VAR2CSA) or 0.5 µg/mL (all others) at 50 µL diluted in 1× PBS and incubated overnight at 4 °C. We added 4% BSA for 1 h at 37 °C. We washed wells 4 times with 1X PBST (0.1% Tween 20). Plates were incubated with 50 µL of 3D10 or mouse IgG1 isotype control (catalogue no. MA5-14453; Invitrogen, Waltham, MA) for 1 h at room temperature (RT) and washed four times with 1× PBST. We then incubated samples at RT for 1 h with 50 µL of HRP-conjugated goat anti-mouse HRP (catalogue no. 170-6516; Bio-Rad, Mississauga, Canada). After repeated washes, 50 µL of 3,3′,5,5′-Tetramethylbenzidine (catalogue no. T0440; Sigma-Aldrich) was added and incubated for 30 min or when reactions approached saturation. This consideration was relevant for cases in which the optimized antibody concentration for a wildtype peptide (P1–P5) resulted in saturated optical density (OD) values for mutated peptides that significantly increased reactivity. We stopped reactions with 50 µL H2SO4 (0.5 N) and measured the OD of each well at 450 nm. All samples were tested in duplicate on the same plate and repeated in at least two independent experiments. For ELISAs with proteins treated with DTT, samples were incubated with DTT (10 mM) at 56 °C for 10 min and then added to the plate. ELISAs with DTT treatments were performed in volumes of 100 µL as opposed to 50 µL.
For the ELISAs with peptides, the OD values were corrected by subtracting the average of blank wells from the raw measurements and evaluating if IgG isotype control and secondary antibody control had no reactivity (OD < 0.25). To determine relative binding, OD values for mutated peptides were measured as a ratio to the wildtype peptide from which they are derived:
For ELISAs with recombinant proteins, the OD for the wells with the isotype control was subtracted from the corresponding value with 3D10.
Peptides were synthesized by PEPMIC (China) at 70% purity. The 3D10 antibody was generously provided by Dr. John Adams. EBA-175 was obtained through BEI Resources, NIAID, NIH: Plasmodium falciparum Erythrocyte Binding Antigen-175 RII-Non-Glycosylated Protein, Recombinant from Pichia pastoris, MRA-1162, contributed by Annie Mo. PfMSP1 was purchased from CTK Biotech, San Diego, California (Catalogue no-A3003). PvMSP1-19 was obtained through BEI Resources, NIAID, NIH: Plasmodium vivax yP30P2-Pv200 MSP1-19 Protein, Recombinant from Saccharomyces cerevisiae, Strain 2905/6, MRA-60, contributed by David C. Kaslow. Full-length VAR2CSA (FV2; FCR3 allele) was generously provided by Dr. Ali Salanti. PcDBP was expressed and purified from E. coli. PvDBP was expressed in E. coli and purified by Mohammad Rafiul Hoque.
Criteria mapping and structure prediction
To map criteria onto 3D structures, we first generated structures of relevant protein segments in AlphaFold 332. For proteins that had high-resolution structures available, we used the experimentally determined structures. Only the EBA-175 structure was sourced from previously resolved structures. We used AlphaFold 3 generated structures because of the unavailability of target protein structures, or, in the case of VAR2CSA, available protein structures were insufficient. More specifically, previously resolved structures for VAR2CSA (FCR3) and (NF54) had several surface-exposed segments excluded60,61. EBA-175 had near-atomic accuracy resolution (<2.5 Å) because the structure was determined by high-accuracy methods: crystallography and solution NMR62,63.
The AlphaFold 3 generated VAR2CSA (FCR3) core region was evaluated against cryo-EM resolved structures for the same area (Supplementary Fig S5). Relative to the pruned RMSD score, these structures were sufficiently similar to one another (RMSD < 3 Å). Low resolution and a high proportion of unstructured regions may have resulted in a higher whole RMSD score (5.894 Å)64,65. Relative to our protein segment of greater than 1400 amino acids, the whole RMSD score suggests high accuracy66.
Data availability
The Github contains access to the code and files associated with the production of all figures and research data:https://github.com/SimranjitGrewal/REMMI-Resolution-of-Epitopes-by-Microarray-using-Machine-learning-IntegrationPrimary data sets are available on figshare: https://figshare.com/s/52f4dd3daa32b0b46b78.
References
Venkatesan, P. The 2023 WHO World malaria report. Lancet Microbe 5, e214 (2024).
World Health Organization. World malaria report 2024: addressing inequity in the global malaria response. World Health Organization (2024).
Lee, W. C., Russell, B. & Renia, L. Sticking for a cause: the Falciparum Malaria parasites cytoadherence paradigm. Front. Immunol. 10, 1444 (2019).
Fried, M. & Duffy, P. E. Adherence of Plasmodium falciparum to chondroitin sulfate A in the human placenta. Science 272, 1502–1504 (1996).
Ngai, M. et al. Malaria in pregnancy and adverse birth outcomes: new mechanisms and therapeutic opportunities. Trends Parasitol. 36, 127–137 (2020).
Desai, M. et al. Epidemiology and burden of malaria in pregnancy. Lancet Infect. Dis. 7, 93–104 (2007).
Mordmuller, B. et al. First-in-human, randomized, double-blind clinical trial of differentially adjuvanted PAMVAC, a vaccine candidate to prevent pregnancy-associated malaria. Clin. Infect. Dis. 69, 1509–1516 (2019).
Sirima, S. B. et al. PRIMVAC vaccine adjuvanted with Alhydrogel or GLA-SE to prevent placental malaria: a first-in-human, randomised, double-blind, placebo-controlled study. Lancet Infect. Dis. 20, 585–597 (2020).
Walker, L. M. & Burton, D. R. Passive immunotherapy of viral infections: ‘super-antibodies’ enter the fray. Nat. Rev. Immunol. 18, 297–308 (2018).
Doritchamou, J. Y. A. et al. A single full-length VAR2CSA ectodomain variant purifies broadly neutralizing antibodies against placental malaria isolates. Elife 11, https://doi.org/10.7554/eLife.76264 (2022).
Gnidehou, S. et al. Functional antibodies against VAR2CSA in nonpregnant populations from colombia exposed to Plasmodium falciparum and Plasmodium vivax. Infect. Immun. 82, 2565–2573 (2014).
Beeson, J. G. et al. Antibodies among men and children to placental-binding Plasmodium falciparum-infected erythrocytes that express var2csa. Am. J. Trop. Med. Hyg. 77, 22–28 (2007).
Oleinikov, A. V. et al. A plasma survey using 38 PfEMP1 domains reveals frequent recognition of the Plasmodium falciparum antigen VAR2CSA among young Tanzanian children. PLoS One 7, e31011 (2012).
Gnidehou, S. et al. Cross-species immune recognition between Plasmodium vivax Duffy binding protein antibodies and the Plasmodium falciparum surface antigen VAR2CSA. J. Infect. Dis. 219, 110–120 (2019).
Mitran, C. J. et al. Antibodies to cryptic epitopes in distant homologues underpin a mechanism of heterologous immunity between Plasmodium vivax PvDBP and Plasmodium falciparum VAR2CSA. mBio. 10, https://doi.org/10.1128/mBio.02343-19 (2019).
George, M. T. et al. Identification of an immunogenic broadly inhibitory surface epitope of the Plasmodium vivax Duffy binding protein ligand domain. mSphere 4, https://doi.org/10.1128/mSphere.00194-19 (2019).
Chen, E. et al. Broadly neutralizing epitopes in the Plasmodium vivax vaccine candidate Duffy Binding Protein. Proc. Natl Acad. Sci. USA113, 6277–6282 (2016).
Mitran, C. J., Higa, L. M., Good, M. F. & Yanow, S. K. Generation of a Peptide Vaccine Candidate against Falciparum Placental Malaria Based on a Discontinuous Epitope. Vaccines 8, https://doi.org/10.3390/vaccines8030392 (2020).
Laffy, J. M. J. et al. Promiscuous antibodies characterised by their physico-chemical properties: from sequence to structure and back. Prog. Biophys. Mol. Biol. 128, 47–56 (2017).
Iyamu, U. et al. A conserved epitope in VAR2CSA is targeted by a cross-reactive antibody originating from Plasmodium vivax Duffy binding protein. Front. Cell Infect. Microbiol. 13, 1202276 (2023).
Saha, S. & Raghava, G. P. BcePred: Prediction of Continuous B-Cell Epitopes in Antigenic Sequences Using Physico-chemical Properties. International Conference on Artificial Immune Systems 197–204 (2004).
Teimouri, H., Medvedeva, A. & Kolomeisky, A. B. Bacteria-specific feature selection for enhanced antimicrobial peptide activity predictions using machine-learning methods. J. Chem. Inf. Model. 63, 1723–1733 (2023).
Ntumngia, F. B. et al. A novel erythrocyte binding protein of plasmodium vivax suggests an alternate invasion pathway into duffy-positive reticulocytes. mBio 7, https://doi.org/10.1128/mBio.01261-16 (2016).
McDonald, E. F., Jones, T., Plate, L., Meiler, J. & Gulsevin, A. Benchmarking AlphaFold2 on peptide structure prediction. Structure 31, 111–119 e112 (2023).
Weng, G. et al. Comprehensive evaluation of fourteen docking programs on protein-peptide complexes. J. Chem. Theory Comput. 16, 3959–3969 (2020).
Di Rienzo, L. et al. Characterizing hydropathy of amino acid side chain in a protein environment by investigating the structural changes of water molecules network. Front. Mol. Biosci. 8, 626837 (2021).
Liang, S. & Grishin, N. V. Effective scoring function for protein sequence design. Proteins 54, 271–281 (2004).
Regenmortel, M. H. V. What is a B-cell epitope? Methods in Mol. Biol. 524, 3–20 (2009).
Ainavarapu, S. R. et al. Contour length and refolding rate of a small protein controlled by engineered disulfide bonds. Biophys. J. 92, 225–233 (2007).
Ramaraj, T., Angel, T., Dratz, E. A., Jesaitis, A. J. & Mumey, B. Antigen-antibody interface properties: composition, residue interactions, and features of 53 non-redundant structures. Biochim. Biophys. Acta 1824, 520–532 (2012).
Rubinstein, N. D. et al. Computational characterization of B-cell epitopes. Mol. Immunol. 45, 3477–3489 (2008).
Abramson, J. et al. Accurate structure prediction of biomolecular interactions with AlphaFold 3. Nature 630, 493–500 (2024).
Ntumngia, F. B. et al. Conserved and variant epitopes of Plasmodium vivax Duffy binding protein as targets of inhibitory monoclonal antibodies. Infect. Immun. 80, 1203–1208 (2012).
Singh, S. K., Hora, R., Belrhali, H., Chitnis, C. E. & Sharma, A. Structural basis for Duffy recognition by the malaria parasite Duffy-binding-like domain. Nature 439, 741–744 (2006).
Howell, D. P., Samudrala, R. & Smith, J. D. Disguising itself-insights into Plasmodium falciparum binding and immune evasion from the DBL crystal structure. Mol. Biochem. Parasitol. 148, 1–9 (2006).
Benavente, E. D. et al. Global genetic diversity of var2csa in Plasmodium falciparum with implications for malaria in pregnancy and vaccine development. Sci. Rep. 8, 15429 (2018).
Smith, J. D., Subramanian, G., Gamain, B., Baruch, D. I. & Miller, L. H. Classification of adhesive domains in the Plasmodium falciparum erythrocyte membrane protein 1 family. Mol. Biochem. Parasitol. 110, 293–310 (2000).
Wei, S. et al. Binding epitope for recognition of human TRPM4 channel by monoclonal antibody M4M. Sci. Rep. 12, 19562 (2022).
Ashworth, J. et al. Peptide microarray IgM and IgG screening of pre-SARS-CoV-2 human serum samples from Zimbabwe for reactivity with peptides from all seven human coronaviruses: a cross-sectional study. Lancet Microbe 4, e215–e227 (2023).
Vashisht, K. et al. Cyclic constrained immunoreactive peptides from crucial P. falciparum proteins: potential implications in malaria diagnostics. Transl. Res. 249, 28–36 (2022).
Kaur, H. & Salunke, D. M. Antibody promiscuity: understanding the paradigm shift in antigen recognition. IUBMB Life. 67, 498–505 (2015).
Jain, D. & Salunke, D. M. Antibody specificity and promiscuity. Biochem J. 476, 433–447 (2019).
Sather, D. N. et al. Factors associated with the development of cross-reactive neutralizing antibodies during human immunodeficiency virus type 1 infection. J. Virol. 83, 757–769 (2009).
Andrews, S. F. et al. An influenza H1 hemagglutinin stem-only immunogen elicits a broadly cross-reactive B cell response in humans. Sci. Transl. Med. 15, eade4976 (2023).
Centers for Disease, C. & Prevention. Serum cross-reactive antibody response to a novel influenza A (H1N1) virus after vaccination with seasonal influenza vaccine. MMWR Morb. Mortal. Wkly Rep. 58, 521–524 (2009).
Mikell, I. et al. Characteristics of the earliest cross-neutralizing antibody response to HIV-1. PLoS Pathog. 7, e1001251 (2011).
Li, T. et al. Identification of a cross-neutralizing antibody that targets the receptor binding site of H1N1 and H5N1 influenza viruses. Nat. Commun. 13, 5182 (2022).
Burton, D. R. & Hangartner, L. Broadly neutralizing antibodies to HIV and their role in vaccine design. Annu. Rev. Immunol. 34, 635–659 (2016).
Sok, D. & Burton, D. R. Recent progress in broadly neutralizing antibodies to HIV. Nat. Immunol. 19, 1179–1188 (2018).
Sanchez-de Prada, L. et al. Immunodominance hierarchy after seasonal influenza vaccination. Emerg. Microbes Infect. 11, 2670–2679 (2022).
Reich, N. G. et al. Case study in evaluating time series prediction models using the relative mean absolute error. Am. Stat. 70, 285–292 (2016).
Li, Y. et al. Establishment of a decision tree model for diagnosis of early rheumatoid arthritis by proteomic fingerprinting. Int J. Rheum. Dis. 18, 835–841 (2015).
Takeshi, M., Go, Y., Xiaoming, H. & Yusuke, I. Signal-to-noise ratio improvement of peptide microarrays by using hyperbranched-polymer materials. J. Appl. Phys. 105 (2009).
Pedregosa, F. et al. Scikit-learn: machine learning in Python. J. Mach. Learn. Res 12, 2825–2830 (2011).
Moffat, L. & Jones, D. T. Increasing the accuracy of single sequence prediction methods using a deep semi-supervised learning framework. Bioinformatics 37, 3744–3751 (2021).
Grantham, R. Amino acid difference formula to help explain protein evolution. Science 185, 862–864 (1974).
An, C. et al. Radiomics machine learning study with a small sample size: Single random training-test set split may lead to unreliable results. PLoS One 16, e0256152 (2021).
Senes, A., Gerstein, M. & Engelman, D. M. Statistical analysis of amino acid patterns in transmembrane helices: the GxxxG motif occurs frequently and in association with beta-branched residues at neighboring positions. J. Mol. Biol. 296, 921–936 (2000).
Bland, J. M. & Altman, D. G. Statistics notes. The odds ratio. BMJ 320, 1468 (2000).
Wang, K. et al. Cryo-EM reveals the architecture of placental malaria VAR2CSA and provides molecular insight into chondroitin sulfate binding. Nat. Commun. 12, 2956 (2021).
Ma, R. et al. Structural basis for placental malaria mediated by Plasmodium falciparum VAR2CSA. Nat. Microbiol. 6, 380–391 (2021).
Barth, P., Schonbrun, J. & Baker, D. Toward high-resolution prediction and design of transmembrane helical protein structures. Proc. Natl Acad. Sci. USA. 104, 15682–15687 (2007).
Tolia, N. H., Enemark, E. J., Sim, B. K. & Joshua-Tor, L. Structural basis for the EBA-175 erythrocyte invasion pathway of the malaria parasite Plasmodium falciparum. Cell 122, 183–193 (2005).
Carugo, O. How root-mean-square distance (r.m.s.d.) values depend on the resolution of protein structures that are compared. J. Appl. Crystallogr. 36, 125–128 (2003).
Chothia, C. & Lesk, A. M. The relation between the divergence of sequence and structure in proteins. EMBO J. 5, 823–826 (1986).
Reva, B. A., Finkelstein, A. V. & Skolnick, J. What is the probability of a chance prediction of a protein structure with an rmsd of 6 A?. Fold. Des. 3, 141–147 (1998).
Acknowledgements
We thank Dr. Michael Good for providing valuable feedback on our manuscript.This research was funded by the National Institute of Allergy and Infectious Diseases of the National Institutes of Health (R01AI150944). The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. We acknowledge funding from a project grant from the Canadian Institutes of Health Research (CIHR-IRSC Funding Reference Number 168944). We acknowledge the support of the Natural Sciences and Engineering Research Council of Canada (NSERC), [RGPIN-2017-04176].
Author information
Authors and Affiliations
Contributions
S.G. and S.K.Y. wrote the manuscript. S.G., U.I., D.V. and C.M. performed experiments. N.H. helped design the machine learning algorithm. All authors reviewed the manuscript prior to submission.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Grewal, S., Iyamu, U., Vinals, D.F. et al. Machine learning framework to extract physicochemical features of B-cell epitopes recognized by a cross-reactive antibody. npj Syst Biol Appl 11, 109 (2025). https://doi.org/10.1038/s41540-025-00583-1
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41540-025-00583-1