Machine learning framework to extract physicochemical features of B-cell epitopes recognized by a cross-reactive antibody

Grewal, Simranjit; Iyamu, Uwa; Vinals, Daniel Ferrer; Mitran, Catherine J.; Hegde, Nidhi; Yanow, Stephanie K.

doi:10.1038/s41540-025-00583-1

Download PDF

Article
Open access
Published: 02 October 2025

Machine learning framework to extract physicochemical features of B-cell epitopes recognized by a cross-reactive antibody

Simranjit Grewal¹,
Uwa Iyamu³,
Daniel Ferrer Vinals³,
Catherine J. Mitran³,
Nidhi Hegde² &
…
Stephanie K. Yanow¹^nAff3

npj Systems Biology and Applications volume 11, Article number: 109 (2025) Cite this article

1032 Accesses
1 Altmetric
Metrics details

Subjects

Abstract

During infection with Plasmodium falciparum in pregnancy, parasites express a unique virulence factor, VAR2CSA, that mediates binding of infected red blood cells to the placenta. A major goal in designing vaccines to protect pregnant women from malaria is to elicit antibodies to VAR2CSA. The challenge is that VAR2CSA is highly polymorphic and identifying conserved epitopes is essential to elicit strain-transcending immunity. Unexpectedly, a mouse monoclonal antibody, 3D10, raised against region II of the unrelated Duffy binding protein from P. vivax (DBPII) cross-reacts with diverse alleles of VAR2CSA in vitro, suggesting that epitopes may be shared across this family of ‘Duffy binding-like’ (DBL) proteins. Peptide arrays spanning four DBL proteins from two Plasmodium spp, including two alleles of VAR2CSA, DBPII, and PvEBP2 (as a negative control), were screened with 3D10 but the data were too complex to manually identify common epitope sequences. As such, we designed a machine learning framework to analyse the array data. We applied decision trees to extract features correlated to 3D10 binding and evaluated the model on an independent dataset for a rodent Plasmodium DBL protein (PcDBP). Next, we analysed patterns of the features predicted by the model to be strongly associated with 3D10 binding and designed mutant peptides to test complex sequence motifs. Features associated with 3D10 reactivity were mapped onto predicted 3D structures of Plasmodium proteins and validated based on 3D10 reactivity to the recombinant antigens. While the array data identified certain linear epitopes, the framework predicted other epitopes to be conformational. This was demonstrated with PcDBP; as predicted by the model, no linear peptides reacted strongly with 3D10, yet the folded protein was recognized by the antibody in a conformation-dependent manner. With this approach, peptide array data can be mined to extract physicochemical properties of epitopes recognized by cross-reactive antibodies.

Immunogenicity of a Plasmodium vivax vaccine based on the duffy binding protein formulated using adjuvants compatible for use in humans

Article Open access 25 August 2023

Development of a bispecific immune engager using a recombinant malaria protein

Article Open access 06 April 2021

Allelic variants of full-length VAR2CSA, the placental malaria vaccine candidate, differ in antigenicity and receptor binding affinity

Article Open access 19 November 2021

Introduction

Malaria caused by Plasmodium falciparum is among the deadliest infectious diseases and results in more than half a million deaths per year¹. Pregnant women compose a large portion of the at-risk population (>36 million pregnancies) with over 12 million infected in sub-Saharan Africa annually². Vaccines to protect pregnant women from malaria aim primarily to elicit antibodies to the P. falciparum virulence factor VAR2CSA that is uniquely expressed during infection in pregnancy³. VAR2CSA is expressed on the surface of infected erythrocytes and binds to chondroitin sulfate A (CSA)-specific proteoglycans on the placenta syncytiotrophoblast⁴, disrupting the exchange of oxygen and nutrients from the maternal blood to the fetus⁵. Placental malaria is associated with poor maternal and birth outcomes, including severe maternal anaemia, low birthweight infants, pre-term birth and fetal growth restriction⁶, underscoring the importance of developing vaccines specific to this population.

Vaccines based on recombinant fragments of VAR2CSA showed promise in pre-clinical studies but failed to elicit broadly inhibitory antibodies in Phase I trials^7,8. A key bottleneck for these vaccines is the extensive genetic diversity of the VAR2CSA alleles; thus, efforts are underway to identify conserved epitopes for next-generation vaccines^7,8. One approach, common with genetically diverse pathogens like influenza and HIV, is to study the epitopes recognized by antibodies acquired naturally during infection⁹. Typically, VAR2CSA antibodies are acquired in a parity-dependent manner and are associated with protection from placental malaria¹⁰. However, antibodies to VAR2CSA were also reported in certain non-pregnant populations^11,12,13. In Colombia and Brazil, men and children had antibodies to VAR2CSA at similar levels to pregnant women of all gravidities^11,14. Based on the local epidemiology of malaria, we proposed that exposure to the Plasmodium vivax species, and specifically to the P. vivax Duffy binding protein (region 2; DBPII), can elicit antibodies that cross-react with VAR2CSA¹⁴. In support of this hypothesis, a mouse monoclonal antibody raised against DBPII, called 3D10, cross-reacts with VAR2CSA and moderately inhibits the adhesion to CSA of infected erythrocytes expressing different alleles of VAR2CSA^14,15. These findings suggest a conserved epitope may be shared between these evolutionarily distinct proteins, which could be exploited to develop a vaccine against VAR2CSA¹⁴. A unifying trait between DBPII and VAR2CSA is the presence of Duffy binding like (DBL) domains, which are common to several malaria virulence factors used for invasion or sequestration.

Various studies describe the epitope of 3D10 in the DBPII region of PvDBP based on mutational analysis, phage display libraries, mimotopes, peptide arrays, and alanine scanning^16,17,18. A discrete binding region was identified within subdomain 1 (SD1) that includes the amino acids NxxRKR and/or YK(R/Y/E). Yet a search for a similar motif in VAR2CSA failed to identify a homologous epitope, highlighting a common shortfall in mapping discrete epitopes recognized by cross-reactive antibodies¹⁹. Our screening of VAR2CSA by peptide array with 3D10 provided no clarity with regards to a conserved epitope between VAR2CSA and DBPII²⁰. The array data revealed one major binding site along with several other clusters of highly reactive peptides within the VAR2CSA DBL domains but none of the peptides contained the consensus 3D10 epitope from DBPII^16,17. Since manual analysis of our dataset was insufficient to identify physicochemical characteristics predictive of 3D10 binding, in this study we employed machine learning to extract features of 3D10 epitopes^19,21,22. We developed a regression machine learning framework to analyse 430 features from a dataset of over 4300 peptides in 4 arrays derived from DBPII, two alleles of VAR2CSA, and PvEBP2 (3D10 non-reactive). We held out a fifth dataset that tested 3D10 against peptides derived from PcDBP, a protein orthologous to PvDBP but derived from Plasmodium chabaudi. This held out dataset was used to assess performance of our model as it is similar to DBPII but does not react to 3D10. We used several methods to select hyperparameters and fit our model to our 4 arrays prior to feature selection. Using feature selection to extract physicochemical characteristics that describe 3D10 binding, we applied the selected features to generate a panel of mutated peptides to test more complex patterns of amino acids. Our analyses revealed a broader set of criteria for 3D10 epitopes in VAR2CSA that are also present, as linear and conformational epitopes, in other DBL proteins from Plasmodium. We propose our framework as a method to generate a physicochemically-based binding profile of cross-reactive antibody binding¹⁹.

Results

Machine learning and feature selection

The dataset used in fitting the machine learning algorithm prior to feature selection consists of the reactivity values (expressed as arbitrary units) of 3D10 to each peptide in the arrays derived from DBPII, PvEBP2, and 2 alleles of VAR2CSA. We then mapped the peptide array data to 3D structures of DBPII, PvEBP2, and VAR2CSA (FCR3) to visualize 3D10 reactivity to these different proteins (Fig. 1A–D). As expected, 3D10 strongly recognized an epitope in SD1 of DBPII (Fig. 1A). There were no epitopes in PvEBP2 (Fig. 1B) consistent with previous findings that 3D10 does not recognize this recombinant protein by ELISA²³. In VAR2CSA, 3D10 recognized several epitopes within DBL3 and 4 (Fig. 1C) and DBL5 (Fig. 1D). Based on the intensity of 3D10 binding, we classified peptides as 3D10 reactive and 3D10 non-reactive and identified the five most reactive peptides (Supplementary Fig S1; Fig. 1E). Many of the strongest binding peptides contain a significant enrichment of lysine residues and tyrosines adjacent to lysines. However, these traits alone are insufficient to define reactive vs non-reactive peptides given that non-reactive peptides were also rich in lysines, and tyrosines next to lysines was commonly observed in the dataset (Supplementary Fig S2).

**Fig. 1: 3D10 peptide array data mapped on their respective 3D protein structures.**

Next, we generated a panel of features (see Methods) (Supplementary Table S1) that we hypothesized were associated with 3D10 binding. These features focused on specific amino acids, plus properties such as charge, hydropathy and polarity. Structural features were included with the caveat that synthetic peptides can adopt heterogeneous structures in vitro that fail to replicate their structures within the native or recombinant protein^24,25. The label (predicted variable) was the raw median 3D10 binding value of two replicates; this was selected to minimize the noise from non-reactive peptides. For feature selection, we applied the methods of feature elimination, feature utilization, and variance feature selection to our panel of 430 features, to select a small subset of features that correlated to 3D10 reactivity (see Methods) (Supplementary Table S2).

To assess the performance of a model with our hyperparameters, we tested our model on our holdout dataset which was generated under the same experimental conditions as our training sets. This dataset measured the reactivity of 3D10 screened against an overlapping library of peptides spanning PcDBP, an orthologous protein from Plasmodium chabaudi that is closely related to PvDBP and PvEBP2. The machine learning model correctly predicted that no peptides from the array would strongly react to 3D10 (Supplementary Fig. S3).

Feature validation

Based on the features selected by the model, we analysed each feature statistically to identify those that were positively associated with 3D10 binding in our dataset. We evaluated our selected features in our datasets using three methods: heatmaps of features, amino acid enrichment calculation, and analysis of amino acid distances within each peptide (see Methods).

The first set of analyses focused on positive charge, hydropathy, and sidechain energy (Fig. 2). In 3D10 reactive peptides, we observed that lysine and arginine residues were significantly enriched (Fig. 2A). There was also a tendency for positive residues to cluster at the C-terminus and for non-polar residues to cluster at the N-terminus of 3D10 reactive peptides (Fig. 2B)^26,27. We hypothesized that N-terminal modifications of VAR2CSA peptides that preserved the C-terminal positive clusters observed in P2–P5 may increase reactivity as they will be more like P1, the minimum binding peptide from DBPII. To test this hypothesis, we designed subsets of mutated peptides for four different tests (Fig. 2C). For each subset, we applied different approaches to adjust the relative hydropathy and sidechain energy of P2–P5 to be closer to P1. In subset A, we tested if hybrid peptides that combined the N-terminal residues of P1 with the C-terminal positive clusters of P2–P5 were sufficient or enhanced binding to 3D10; this would indicate N-terminal sidechain energy and hydropathy contribute to the differences in reactivity between peptides. In tests B, C, and D, we mutated residues in peptides P2–P5 to set the overall peptide sidechain energy and/or hydropathy to be similar to P1. We used either chemically-related amino acids, or a minimal number of amino acids while avoiding (where possible) changes to positively charged residues (Fig. 2C). We tested mutated peptides by indirect ELISA and binding was compared to the non-mutated ‘wildtype’ peptides (Fig. 2D). We observed that the N-terminal modifications in P2–P5 decreased binding in all subsets except when modifications decreased the total number of negative residues. Additionally, the clusters of positive residues at the affixed C-terminus, despite being strongly associated with 3D10 reactive peptides, were not sufficient to explain 3D10 reactivity to peptides P2–P5. Moreover, sidechain energy and hydropathy patterns at the C-terminus that correlate to 3D10 reactive peptides involve a more complex signature than what is observed in our heatmaps (Fig. 2B)^26,27.

**Fig. 2: Binding-correlated clusters of C-terminal lysines and arginines are insufficient for 3D10 reactivity.**

Our analysis suggested that clusters of positive charge at the C-terminus were insufficient for 3D10 reactivity, and reactivity was affected by N-terminal mutations modifying net-charge (Fig. 2C, D). To expand on these observations, we increased the depth of positive charge analysis and characterized sequence patterns relative to enriched positive residues (Fig. 3A). Consistent with the heatmaps of 3D10 reactive peptides, there was an increased likelihood of lysine residues positioned adjacent to arginine residues (Figs. 2B and 3A). Tyrosine residues were significantly more likely to occur adjacent to lysine residues and significantly less likely to occur distal from lysine residues; this corresponds with the selection of the ‘KY count’ feature (Fig. 3B). Further, lysine residues had a higher propensity to occur one amino acid away from other lysine residues (KxK) in 3D10 reactive peptides (Fig. 3D). ‘YK’ was not selected as a feature despite being present in P1. To determine if the absence of ‘YK’ in selected features was a function of the fitting datasets, we swapped the position of lysine and tyrosine residues to create the opposite feature (Fig. 3D). We also decreased the ‘KY’ count by increasing the distance between lysine and arginine (test B). Again, we focused on C-terminal ‘K’ and ‘Y’ residues as our top-binding peptides consistently contained C-terminal ‘KY’ motifs. To test N-terminal positive residues, we focused on P3 as it contained positive residues with consistent spacing allowing us to explore the interchangeability of lysine and arginine residues and the requirement for specific lysine residues (tests C, D, and E). Finally, we tested whether tyrosine could be interchanged with other aromatic residues (test F) (Fig. 3C, E). Consistently, we observed that despite the chemical similarities of arginine and lysine or phenylalanine and tyrosine, it was essential that lysine was adjacent to tyrosine for high 3D10 reactivity. This highlights that 3D10 binding requires the specific amino acid motif: ‘KY’/ ‘YK’.

**Fig. 3: Lysine adjacent to tyrosine and specific positive residue patterns result in strong 3D10 reactivity.**

For several of our selected features, the statistical analyses did not uncover significant differences between 3D10 reactive and non-reactive peptides (Supplementary Table S2). However, their selection by the decision trees suggested they could contribute to 3D10 reactivity. These include polar and non-polar residues, specifically VM, HL, HA, HM, HF, MT, HV, and asparagine counts. Also, there was a tendency for some polar and non-polar residues to occur less frequently in 3D10 reactive peptides but were not statistically significant except for tyrosine count in 3D10 reactive peptides (Fig. 4A). It is unclear if reduced frequency in some residues is due to the proportional increase in positive residues like lysine and arginine (Fig. 2A). Additionally, we were unable to determine a distinct pattern relative to hydropathy by broadly adjusting the hydropathy of 3D10 reactive peptides (Fig. 2B, D). This may highlight a more complex relationship that varies from peptide to peptide. To address this, we mutated minimal polar and non-polar residues into various positions to determine their importance. For tests A, B, and C, we used serine, methionine, and alanine: serine was present in all top-binding peptides, methionine was present in the frequently observed ‘VM’ feature, and alanine appeared in the ‘HA’ feature and is a small amino acid (Fig. 4B). The results of this testing failed to establish a specific pattern of polar residues relative to other aspects of the 3D10 epitope (Fig. 4C). These results suggest that polarity is not a sufficient criterion to distinguish 3D10 reactive peptides from 3D10 non-reactive peptides.

**Fig. 4: Polar residues do not consistently predict binding of 3D10.**

By integrating the results from all mutant peptide sets, we propose the following six criteria for highest 3D10 reactivity: 1) a lysine adjacent to a tyrosine; 2) a lysine one residue away from another lysine; 3) clusters of lysines and arginines; 4) a polar residue proximal to the criteria 1/2/3; 5) minimal negative residues within or between the above criteria; and 6) criteria 1–4 are proximal to one another. The first criterion, a tyrosine adjacent to lysine, is present in 40 of 43 3D10 reactive peptides and only one amino acid away in the remaining peptides. We found that this criterion is not restricted to the C-terminus (Fig. 3E). We also observed that arginine cannot replace lysine, nor can other aromatics replace tyrosine (Fig. 3E). We observed a significantly higher rate of lysine positioned one amino acid away from another lysine (Fig. 3D). Particularly, we observed that binding was diminished or enhanced with the deletion or addition of the ‘KxK’ motif, respectively (Figs. 2C, D and 3D, E). We observed that arginine and lysine residues were positioned proximal to one another, positive charge was strongly correlated to binding, and these residues clustered at the C-terminus of reactive peptides (Supplementary Table S2; Figs. 2A, B and 3A–C). Our data indicated that this clustering is not sufficient for 3D10 reactivity but modifications in this region resulted in a major loss of 3D10 reactivity (Figs. 2D and 3E). We observed that the addition or removal of certain polar residues had significant effects on reactivity, but we were unable to discern a specific positionality relative to other criteria necessary for reactivity (Figs. 2D and 4C). The addition of negative residues consistently decreased binding while the mutation of negative residues increased binding. Finally, because 3D10 strongly reacts to linear peptides, like P1, we believe that ideally, these criteria must be proximal to one another; in P1, all criteria are satisfied in the minimum possible 6 amino acids²⁸.

3D10 binding criteria application and testing

To validate these experimentally-derived criteria within the context of a protein with three-dimensional structure, we constrained the distance between criteria to a field of ~1000 Å² and restricted our analysis to epitopes on the protein surface^29,30,31. Moreover, we defined strict standards to translate criteria for 3D structure mapping. We defined clusters of arginines and lysines as 3 of either residue present within a 6 amino acid span. Criterion 2 was defined as ‘KY’ or ‘YK’ occurrences in protein sequence, and criterion 3 was defined as any occurrence of ‘KxK’ excluding ‘KDK’ or ‘KEK’. ‘Proximal’ was defined as continuous on the surface structure without negative residues between criteria 1, 2, or 3. This standard eliminated ‘KEK’ and ‘KDK’ from consideration. Criterion 4 was interpreted to be the occurrence of a polar residue continuous with any of criteria 1, 2, or 3.

Using our experimentally derived criteria, we classified protein segments that satisfied all six criteria as ‘signature regions’. In our analysis of VAR2CSA (FCR3 allele), we observed 8 signature regions across the protein (Fig. 5A). We identified 7 signatures in the VAR2CSA core region DBL 1–4. One signature includes the sequences from P2 and P3. P4 was not associated with any of the signatures because it was not surface accessible and does not satisfy criterion 2. One signature was identified in DBL5 within the VAR2CSA arm region and mapped to P5. Surprisingly, several of the signature regions are discontinuous and predicted to be conformational epitopes (Fig. 5A). To further validate both the specificity of this approach and the breadth of cross-reactivity of 3D10, we mapped the 3D10 binding criteria onto other DBL proteins from Plasmodium (Fig. 5B–E)³². We used the DBPII region of PvDBP as a reference, and as expected, the criteria mapped to the known epitope in SD1 (Fig. 5B)¹⁷. Consistent with our array data and experimental data, PvEBP2 was not predicted to have any signature regions despite containing a DBL domain (Figs. 1B and 5C). Surprisingly, when we applied the criteria to the model of PcDBP, we identified a single conformational signature corresponding to the same region in DBPII (Fig. 5D). EBA-175 from Plasmodium falciparum also shares significant genetic similarity to the DBL domain of DBPII and contains 2 predicted signature regions: one predominantly linear and one conformational (Fig. 5E).

**Fig. 5: Mapping of proposed 3D10 binding criteria onto 3D structures of *Plasmodium* proteins.**

To validate the predicted binding signatures and test the specificity of these criteria, we tested 3D10 against a panel of recombinant proteins by ELISA (Fig. 5F). As expected, VAR2CSA and DBPII, but not PvEBP2, were recognized by 3D10^14,²³^,³³. Both PcDBP and EBA-175 were strongly recognized by 3D10, while two other Plasmodium, PfMSP1 and PvMSP1-19, were not. An exciting outcome of the algorithm is that the predicted epitope in PcDBP and one of two predicted epitopes in EBA-175 may be conformational. We tested whether disruption of the disulfide bonds in these DBL domains would reduce binding of 3D10 (Fig. 5F). Reactivity to the full-length VAR2CSA was unaffected by DTT, while there was a partial reduction in reactivity to DBPII. Reactivity to EBA-175 increased, suggesting DTT treatment may have revealed linear epitopes that are not surface-exposed. Of specific interest, DTT reduced binding of 3D10 to PcDBP, consistent with this epitope being conformational.

Discussion

We propose a framework for integrating machine learning to enhance epitope mapping with peptide arrays (Fig. 6). In this pipeline, peptide array data were combined with feature selection to extract features associated with antibody binding (positively or negatively); statistical analysis and empirical testing down-selected specific binding criteria, which were then mapped onto 3D protein structures to predict linear and conformational epitopes. We identified six criteria for 3D10 binding which mapped to predicted epitopes only in those proteins that were recognized by 3D10 experimentally. These criteria are consistent with the 3D10 epitope in DBPII requiring the amino acids ‘NxxRKR’ and/or ‘YK(R/Y/E)’ determined by mutational analysis, arrays, and mimotopes^16,17. Importantly, our criteria identified additional physicochemical features that broaden the epitope beyond this simple motif, which in turn, may explain the cross-reactivity of 3D10 with VAR2CSA.

**Fig. 6: Machine learning peptide array analysis framework.**

When we mapped the criteria to the VAR2CSA structure, we identified 8 predicted epitopes. Two of these correspond to a linear sequence in DBL3X that matches two peptides (P2 and P3) recognized most strongly by 3D10 in the array. Surprisingly, the other 6 epitopes are predicted to be conformational. These findings suggest that the framework may reveal conformational epitopes based on linear array data. This was evident with the related DBL protein, PcDBP, which was recognized by 3D10 by ELISA and western blot (data not shown) but failed to bind to any linear peptides in an array. Since DBL domains are structurally stabilized by disulphide bonds^34,35, we used DTT to disrupt the sole conformational epitope and indeed, recognition by 3D10 was reduced. While DTT reduced the binding of 3D10 to PcDBP, we noted that for EBA-175, binding increased. We propose that DTT treatment can expose other linear epitopes that are not surface-exposed.

One concern is that 3D10 is a polyreactive antibody that binds promiscuously to all DBL proteins. Analysis of DBL domains from Plasmodium proteins suggests a high likelihood for specific signatures such as ‘KxK’ and positive clusters despite significant sequence differences and genetic diversity^36,37. However, our data with PvEBP2 (which shares 36% sequence identity and 52% sequence similarity with DBPII (Supplementary Fig S4)) support the specificity and accuracy of our 3D10 binding criteria. 3D10 does not bind to the PvEBP2 recombinant protein nor to any of its peptides within our array. When we applied the six criteria for binding to the 3D structure of PvEBP2, no epitopes were predicted. The implication is that the criteria generated by our framework and the requirement for those criteria to be satisfied within a defined physical space on the surface of the protein were specific enough to differentiate reactive and non-reactive recombinant DBL domains.

If criteria are not sufficiently restrictive, the framework will be subject to a high rate of false positive signature prediction. For example, our criterion 4 concerning polar residues did not eliminate any signatures for our proteins, and this criterion’s low specificity was not meaningful for predicting 3D10 binding. If we established a definite pattern for the position of polar residues for 3D10 reactivity, we might eliminate one or several potential signatures in VAR2CSA. The challenge with generating additional criteria is the inherent overlap between some features; mutating a charged residue will inevitably change the net polarity and structural dynamics of the peptide.

The generation of restrictive criteria depends on the design and size of the peptide arrays and feature space used to train machine learning models. Although we used arrays specific for target antigens, non-natural randomized peptide arrays could be screened with an antibody of interest to extract features associated with binding or performed in addition to target antigens. The ideal panel would have high variance with respect to physicochemical features selected by machine learning and several peptides with varying levels of reactivity. A peptide array without sufficient variance in physicochemical features will be more difficult to generalize and cause overfitting. The variance in peptides relative to their physicochemical features should ideally be present in both reactive and non-reactive peptides. The quality of features will also affect the outcomes of machine learning; features attempting to predict the structure of peptides will likely be less correlative to antibody reactivity due to peptide structures being dynamic in solution. We included structural features while training our model as they describe the physicochemical properties and types of residues in a peptide and may describe structural tendencies as opposed to a definite structure. Additionally, conformationally-constrained peptide arrays could be incorporated into testing and feature space to enable the consideration of certain structural elements during criteria development^38,39,40.

Based on our findings, we propose that 3D10 is an example of a cross-reactive monoclonal antibody, a type of antibody specific for a few antigens related via their DBL domains. Recent work has labelled similar antibodies as “super antibodies” or “promiscuous” and note that these antibodies form a population distinct from typical highly specific antibodies^9,19,41,42. This type of antibody is associated with antigenically variable pathogens; much like DBL represents an antigenically variable domain present in many malaria proteins important for pathogenesis³⁵. Despite consistent structural elements (such as interdomain and intradomain disulphide bonds), DBL domains of pathogenic Plasmodium proteins are genetically diverse³⁶. A review by Walker et al. 2018 highlighted the association between antigenic variability and the generation of broadly neutralizing antibody responses⁹. For example, HIV and influenza often elicit antibodies with cross-reactivity to related antigens, supporting development of ‘universal’ vaccines for these immune-evasive pathogens^{43,44,45,46,47}. These antibodies make up a fraction of the total pool of antibodies but have become a focus in the development of vaccines against immune evading pathogens like HIV⁴⁸. Efforts are tempered by concerns that these antibodies may be difficult to elicit consistently, and not all are broadly neutralizing⁴⁹. Our framework and approach may serve as an effective platform for the comparison of these antibodies to one another and to identify common elements between them. As more broadly neutralizing antibodies are discovered against HIV, our framework can be used to determine physicochemical patterns that are associated with high breadth and potency of neutralization. The conserved structural and physicochemical properties of multiple binding regions for cross-reactive monoclonal antibodies can be exploited to effectively target antigenically variable pathogens with vaccines.

If applied to polyclonal sera, this framework can also be used to extract features associated with immunodominance. For specific pathogens such as influenza virus, the criteria could define specific immunodominant features within the HA head region (for example) that elicit only strain-specific antibodies, resulting in low vaccine efficacy⁵⁰. Targeted epitope masking strategies could be applied to reduce the immunogenicity of these sites and stimulate responses to subdominant, protective epitopes.

The key utility in this framework is a new interpretation of peptide array data. Specifically, it can analyse peptide array data as a physicochemical characterization of the interface between an antibody and thousands of different antigens rather than a tool to map discrete epitopes by residue. Machine learning enables the simultaneous analysis of these antibody-antigen interfaces to reveal complex features of epitopes that extend beyond identifying discrete amino acid motifs. Validation using HDX, cryo-EM, and site-directed mutagenesis, along with testing novel antibodies and datasets, is needed to independently verify the epitopes predicted by this framework. This approach to cross-reactive antibody characterization presents a marked shift from existing methods to better understand their target epitopes and can be applied to both linear and conformational epitopes.

Methods

Peptide array and data processing

Synthetic peptide arrays were synthesized by PepperPrint™ (Germany) using the PEPperCHIP protocol to develop overlapping arrays: VAR2CSA (FCR3) (GenBank: AAQ73926.1, accession: AY372123.1, residues 1-2659), VAR2CSA (NF54) (GenBank: EWC87419.1, accession: KE123842.1, residues 1-2652), DBPII, PcDBP, and PvEBP2. Peptides were 20 amino acids in length and there was a 19 amino acid overlap between peptides in all arrays except VAR2CSA (NF54) for which there was an 18 amino acid overlap.

For each data point, we used the average of the median foreground intensities for each spot (arbitrary units). For each peptide, binding was measured in duplicate (V1 and V2), and the average raw median intensity was used except for when the difference between the lesser of the two values was less than or equal to 40% the larger of the two values.

$${MIN}\left(V1,\,V2\right)\le {MAX}\left(V1,\,V2\right)* 0.4$$

(1)

Additionally, some peptides occurred in both VAR2CSA (FCR3) and VAR2CSA (NF54); in this case, the highest of the two averaged values were taken. This resulted in a total of 4378 data points being left; 215 duplicate data points were identified and the lower raw median intensity replicant was removed. We employed a training-validation split of 1/3 to 2/3 to randomly distribute the data. That is, 2/3 of the datapoints were randomly selected to be involved in training and validation phases of algorithm development and 1/3 of datapoints were selected to test the algorithm.

Model design and selection

The high similarity between peptides with a 19-amino acid overlap from related proteins and the low proportion of 3D10-reactive peptides present important considerations for feature selection model design. Specific training-validation splits can artificially decrease or increase error if reactive peptide examples are distributed asymmetrically. However, to extract as much data as possible on the 3D10 reactivity in our dataset, there is a need to maximize training data. We employed several strategies to minimize the impact of overfitting during machine learning model design and analysis.

We chose mean squared error (MSE) as the loss function for training the algorithm to predict reactive peptides⁵¹. In model selection, we preferred the algorithm that had the lowest MSE averaged over multiple random training-validation split instances to minimize bias introduced by a favourable data distribution. We avoided including more complex features that could identify specific peptide(s) and cause overfitting. For example, some dipeptide features are too rare in these proteins to be used to validate their impact on 3D10 reactivity in our dataset. Removal of these noisy features from the feature pool will vary between datasets and antibodies. To support the generalizability of the framework, we relied on additional testing to filter non-predictive features and synthesize more complex motifs/criteria correlated to reactivity. Specifically, feature selection, statistical analysis, and experimental validation were used to generate our final criteria. We used decision trees in our framework because they had the lowest average MSE in our comparisons, were used previously in machine learning applied to peptide array data, and are easily interpretable (Supplementary Fig. S5)^52,53. We used GridSearchCV to optimize hyperparameters defined in the DecisionTreeRegressor function in Scikit-learn⁵⁴. Five-fold cross-validation generated a decision tree with a maximum depth of four, two samples minimum per leaf, five samples minimum to split a node, and had a MSE of 5.9E + 5 and an R² of 0.82. A randomly generated model from our dataset using these hyperparameters accurately predicted that none of the peptides in PcDBP reached high 3D10 reactivity (Supplementary Fig. S3).

Features

Features employed in machine learning can be placed into six categories: single amino acid counts, dipeptide counts, secondary structure, physicochemical values, charge, and disulphide bond potential (Supplementary Table S1). These values are either whole number, continuous, or binary values. Single amino acid counts are the total counts of each of the 20 different amino acids occurring in protein sequences. The secondary structure employs S4PRED to predict the secondary structure of the recombinant protein sequences, including the elongating “GSGSGSG” N-terminal and C-terminal linkers, and values were taken as the average of residue values for respective peptide sequence segments⁵⁵. There were three values for secondary structure representing the odds of forming either an alpha-helix, beta-sheet, and a flexible coil. Physicochemical values are based on sums of values for amino acids from 3 different physicochemical scales: sidechain energy, hydropathy, and polarity^26,27,56. Charge features were sums of positively charged residues (lysine, arginine, and histidine), negatively charged residues (glutamate and aspartate), and net charge. Disulphide bond potential is a binary feature which is set to ‘1’ if there are two cysteine residues in a peptide with two or more amino acids in between. Dipeptide counts, like single amino acid counts, is a count of total occurrences of each combination of 2 consecutive amino acids; 400 for each combination of 20 amino acids. This feature is adjusted by the hyperparameter ‘peptide trim.’ ‘Peptide trim’ for dipeptide counts defines the number of amino acids to skip before counting dipeptide features. This is employed because of the concern that the C-terminal of the peptides may be less accessible to 3D10 as this is the end attached to the glass slide. We chose not to employ this hyperparameter for other features as dipeptide count is intended to identify patterns of specific amino acids, whereas single amino acid counts, charge, etc., inform the physical properties of the whole peptide, which may affect its flexibility or be used in conjunction with other features. A ‘peptide trim’ of 4 minimized average error.

Feature selection

We employed three methods for feature selection: variance-based feature selection, feature utilization, and recursive feature elimination. Using an ensemble of features accounts for the potential of each method to be biased for missing predictive features and selecting nonpredictive features. Both variance-based feature selection and recursive feature elimination are provided by scikit-learn⁵⁴. We created the feature selection method, feature utilization, for our project.

Variance-based feature selection is an unsupervised method, meaning it is agnostic to 3D10 reactivity. The goal of this method is to establish which features have high variance and, therefore, are less likely to be selected due to overfitting. Because it is not dependent on 3D10 reactivity, features selected by this method are not sufficient to qualify them for analysis. We used a threshold ($p$) of 1.

$${Var}\left|X\right|\ge p$$

(2)

Feature utilization is not available in scikit-learn. Because of the random nature of our training-validation splitting, this method involves quantifying feature usage frequency from the lowest MSE decision trees of 100 tournaments of 100 randomly generated trees with new training-validation splits⁵⁷. To minimize the impact of favourable training-validation splits caused by similarities between peptides in the whole dataset, we used several small tournaments. If any feature is present in a tournament-winning tree, it is defined as being ‘selected’; we did not validate features based on their frequency of utilization in ‘tournament-winning’ trees. At a tree depth of four, each tree contributes 15 features; hyperparameter optimization may produce a different tree depth for different antibodies which may affect output. This method is not highly restrictive and could include features that enable overfitting. This emphasizes the importance of using multiple methods and qualifying selected features using statistical and experimental analysis.

Recursive feature elimination recursively eliminates features to a given value until a cohort of predictive features is left. Starting with the whole feature set (n features) will eliminate a poorly correlative feature to 3D10 reactivity and yield ‘n-1’ features. It repeats this until it reaches our defined values of 5, 10, or 15 features. Fifteen features represent the maximum number of features that can be applied by a tree with depth 4.

Statistical methods and feature analysis

We used three methods to analyse features in addition to their selection: heatmapping, enrichment, and sequence patterns. For enrichment analysis, we compared 3D10 reactive peptides to 3D10 non-reactive peptides (Supplementary Fig. S1). Using these three methods, we analysed selected features or groups of overlapping features that describe a majority of 3D10 reactive peptides or for which there is a significant difference between 3D10 reactive and 3D10 non-reactive peptides. Overfitting results in some selected features corresponding to very few reactive peptides and not correlating to 3D10 reactivity. Similarly, feature overlap may have resulted in features like ‘YK’ not being selected despite being equivalent to ‘KY’ in our testing, and strongly correlating to 3D10 reactivity, emphasizing the value in grouping and further testing. Data for feature values from this dataset with respective hyperparameters are accessible in a GitHub repository described in our data availability section. For heat mapping, we determined average feature values for each of the 20 positions amino acid positions that span the length of the peptides. As the peptide array contained significantly overlapping peptides, the average value for each position in the non-reactive population was almost identical to that of the average for the whole sequence. This meant that the comparison between non-reactive peptides and reactive peptides was not meaningful and may diminish trends observed in 3D10 reactive peptides if they were reflected in the whole sequence. For example, 3D10 reactive peptides were more likely to contain positively charged residues than non-reactive peptides; the difference in their average positive charge for each residue position does not improve the characterization of positionality of positive residues. The data provided by the difference between 3D10 reactive and non-reactive peptides would be better characterized by enrichment analysis.

Amino acid enrichment compared frequencies of amino acid occurrence in 3D10 reactive peptides and 3D10 non-reactive peptides. Statistical significance was determined by 2-way ANOVA selecting for relevant comparisons and correcting for number of comparisons (only comparing values for each amino acid between reactive and non-reactive peptide groups). For example, to compare three different amino acid occurrences in either group, we corrected for three comparisons.

The method for determining amino acid patterns is similar to the method for sequence analysis known as TMSTAT⁵⁸. Like TMSTAT, we do not have to assume that the distribution of amino acids across sequences is not anti-segregated or co-segregated relative to the features we are seeking to analyse. Assuming non-anti-segregated and non-co-segregated distribution assumes that amino acid distributions are homogeneous across all populations. We do not need to perform this assumption because our statistical comparisons are between observed odds and the expected odds. The expected odds, in this case, would be the likelihood of a homogeneously distributed population. For two types of amino acids, ‘A’ and ‘B’, we calculated the expected probability of a distance between an ‘A’ amino acid to the closest ‘B’ amino acid in 3D10 reactive or 3D10 non-reactive peptides, ${P}_{{Exp}}$. This value depended on the average number of ‘B’ amino acids in the respective population, $\bar{B}$. Expected odds for any distance was calculated by bootstrapping. We simulated all possible distributions of 1–5 ‘B’ amino acids across the peptide. We determined the shortest distance between an ‘A’ amino acid to the closest ‘B’ amino acid to develop a reference table (Supplementary Table S3). For each possible distance, ${D}^{X}$, we obtained a probability as a function of $\bar{B}$ where $X$ denotes the discrete minimum distance between ‘A’ and the closest ‘B’ amino acid; $X\in \left\{1,\,2,\,3,\,\ldots ,\left.19\right\}\right.$. This probability function was calculated by performing second-order polynomial regression on the bootstrapped odds.

$${P}_{{Exp}}=f\left({D}^{x}\right|\bar{B})$$

(3)

This resulted in 19 equations for each possible value of $X$ and enabled the prediction of expected minimum distances between ‘A’ and ‘B’ type amino acids for cases where ${B}$ was not a whole number. The observed probabilities were a proportional count for the occurrence for each distance to occur in the population ${P}_{{Obs}}$. Log odds ratio was obtained by a ratio of observed distance proportions and expected distance probabilities.

$$LN\left(\frac{{P}_{obs}}{{P}_{EXP}}\right)$$

(4)

We used 95% confidence intervals as a significance cut-off and calculated using previously published methods that assume a normal distribution⁵⁹. With this cut-off, we could establish if the increased likelihood for a particular distance to occur between amino acids of type ‘A’ and ‘B’ was significantly more than expected. For situations where ‘A’ and ‘B’ were the same (like lysine-to-lysine distances), we employed a correction of $\overline{{B}_{Cor}}=\overline{B}-1$ to account for the fact that the reference lysine serving as ‘A’ will not be contributing to $\bar{B}$ relative to the expected probability of distances. We ignored cases in which only ‘A’ or ‘B’ were present as distances between them would be indeterminant. For a specific ${D}^{X}$, we excluded plotting data points for which the ${P}_{{Obs}}$ was 0 as this meant that the observed probability was indeterminately less common than ${P}_{{Exp}}$ as the log odds ratio is equal to $-\infty$. For graphing, we plotted ${D}^{[\mathrm{0,5}]}$ as the 95% confidence interval grows as ${P}_{{Exp}}\to 0$.

Indirect ELISA

We measured the reactivity of 3D10, a mouse IgG mAb, to various proteins by an enzyme-linked immunosorbent assay (ELISA). 96-well Maxisorb plates (catalogue no. 439454; Thermo Fisher Scientific) were coated with synthetic peptides at 10 µg/mL and recombinant proteins at either 1 µg/mL (VAR2CSA) or 0.5 µg/mL (all others) at 50 µL diluted in 1× PBS and incubated overnight at 4 °C. We added 4% BSA for 1 h at 37 °C. We washed wells 4 times with 1X PBST (0.1% Tween 20). Plates were incubated with 50 µL of 3D10 or mouse IgG1 isotype control (catalogue no. MA5-14453; Invitrogen, Waltham, MA) for 1 h at room temperature (RT) and washed four times with 1× PBST. We then incubated samples at RT for 1 h with 50 µL of HRP-conjugated goat anti-mouse HRP (catalogue no. 170-6516; Bio-Rad, Mississauga, Canada). After repeated washes, 50 µL of 3,3′,5,5′-Tetramethylbenzidine (catalogue no. T0440; Sigma-Aldrich) was added and incubated for 30 min or when reactions approached saturation. This consideration was relevant for cases in which the optimized antibody concentration for a wildtype peptide (P1–P5) resulted in saturated optical density (OD) values for mutated peptides that significantly increased reactivity. We stopped reactions with 50 µL H₂SO₄ (0.5 N) and measured the OD of each well at 450 nm. All samples were tested in duplicate on the same plate and repeated in at least two independent experiments. For ELISAs with proteins treated with DTT, samples were incubated with DTT (10 mM) at 56 °C for 10 min and then added to the plate. ELISAs with DTT treatments were performed in volumes of 100 µL as opposed to 50 µL.

For the ELISAs with peptides, the OD values were corrected by subtracting the average of blank wells from the raw measurements and evaluating if IgG isotype control and secondary antibody control had no reactivity (OD < 0.25). To determine relative binding, OD values for mutated peptides were measured as a ratio to the wildtype peptide from which they are derived:

$${\rm{Relative\; binding}}=\left({{OD}}_{{mutated}}/{{OD}}_{{wildtype}}\right)* 100$$

(5)

For ELISAs with recombinant proteins, the OD for the wells with the isotype control was subtracted from the corresponding value with 3D10.

Peptides were synthesized by PEPMIC (China) at 70% purity. The 3D10 antibody was generously provided by Dr. John Adams. EBA-175 was obtained through BEI Resources, NIAID, NIH: Plasmodium falciparum Erythrocyte Binding Antigen-175 RII-Non-Glycosylated Protein, Recombinant from Pichia pastoris, MRA-1162, contributed by Annie Mo. PfMSP1 was purchased from CTK Biotech, San Diego, California (Catalogue no-A3003). PvMSP1-19 was obtained through BEI Resources, NIAID, NIH: Plasmodium vivax yP30P2-Pv200 MSP1-19 Protein, Recombinant from Saccharomyces cerevisiae, Strain 2905/6, MRA-60, contributed by David C. Kaslow. Full-length VAR2CSA (FV2; FCR3 allele) was generously provided by Dr. Ali Salanti. PcDBP was expressed and purified from E. coli. PvDBP was expressed in E. coli and purified by Mohammad Rafiul Hoque.

Criteria mapping and structure prediction

To map criteria onto 3D structures, we first generated structures of relevant protein segments in AlphaFold 3³². For proteins that had high-resolution structures available, we used the experimentally determined structures. Only the EBA-175 structure was sourced from previously resolved structures. We used AlphaFold 3 generated structures because of the unavailability of target protein structures, or, in the case of VAR2CSA, available protein structures were insufficient. More specifically, previously resolved structures for VAR2CSA (FCR3) and (NF54) had several surface-exposed segments excluded^60,61. EBA-175 had near-atomic accuracy resolution (<2.5 Å) because the structure was determined by high-accuracy methods: crystallography and solution NMR^62,63.

The AlphaFold 3 generated VAR2CSA (FCR3) core region was evaluated against cryo-EM resolved structures for the same area (Supplementary Fig S5). Relative to the pruned RMSD score, these structures were sufficiently similar to one another (RMSD < 3 Å). Low resolution and a high proportion of unstructured regions may have resulted in a higher whole RMSD score (5.894 Å)^64,65. Relative to our protein segment of greater than 1400 amino acids, the whole RMSD score suggests high accuracy⁶⁶.

Data availability

The Github contains access to the code and files associated with the production of all figures and research data:https://github.com/SimranjitGrewal/REMMI-Resolution-of-Epitopes-by-Microarray-using-Machine-learning-IntegrationPrimary data sets are available on figshare: https://figshare.com/s/52f4dd3daa32b0b46b78.

References

Venkatesan, P. The 2023 WHO World malaria report. Lancet Microbe 5, e214 (2024).
Article PubMed Google Scholar
World Health Organization. World malaria report 2024: addressing inequity in the global malaria response. World Health Organization (2024).
Lee, W. C., Russell, B. & Renia, L. Sticking for a cause: the Falciparum Malaria parasites cytoadherence paradigm. Front. Immunol. 10, 1444 (2019).
Article CAS PubMed PubMed Central Google Scholar
Fried, M. & Duffy, P. E. Adherence of Plasmodium falciparum to chondroitin sulfate A in the human placenta. Science 272, 1502–1504 (1996).
Article CAS PubMed Google Scholar
Ngai, M. et al. Malaria in pregnancy and adverse birth outcomes: new mechanisms and therapeutic opportunities. Trends Parasitol. 36, 127–137 (2020).
Article CAS PubMed Google Scholar
Desai, M. et al. Epidemiology and burden of malaria in pregnancy. Lancet Infect. Dis. 7, 93–104 (2007).
Article PubMed Google Scholar
Mordmuller, B. et al. First-in-human, randomized, double-blind clinical trial of differentially adjuvanted PAMVAC, a vaccine candidate to prevent pregnancy-associated malaria. Clin. Infect. Dis. 69, 1509–1516 (2019).
Article PubMed PubMed Central Google Scholar
Sirima, S. B. et al. PRIMVAC vaccine adjuvanted with Alhydrogel or GLA-SE to prevent placental malaria: a first-in-human, randomised, double-blind, placebo-controlled study. Lancet Infect. Dis. 20, 585–597 (2020).
Article CAS PubMed Google Scholar
Walker, L. M. & Burton, D. R. Passive immunotherapy of viral infections: ‘super-antibodies’ enter the fray. Nat. Rev. Immunol. 18, 297–308 (2018).
Article CAS PubMed PubMed Central Google Scholar
Doritchamou, J. Y. A. et al. A single full-length VAR2CSA ectodomain variant purifies broadly neutralizing antibodies against placental malaria isolates. Elife 11, https://doi.org/10.7554/eLife.76264 (2022).
Gnidehou, S. et al. Functional antibodies against VAR2CSA in nonpregnant populations from colombia exposed to Plasmodium falciparum and Plasmodium vivax. Infect. Immun. 82, 2565–2573 (2014).
Article PubMed PubMed Central Google Scholar
Beeson, J. G. et al. Antibodies among men and children to placental-binding Plasmodium falciparum-infected erythrocytes that express var2csa. Am. J. Trop. Med. Hyg. 77, 22–28 (2007).
Article PubMed Google Scholar
Oleinikov, A. V. et al. A plasma survey using 38 PfEMP1 domains reveals frequent recognition of the Plasmodium falciparum antigen VAR2CSA among young Tanzanian children. PLoS One 7, e31011 (2012).
Article CAS PubMed PubMed Central Google Scholar
Gnidehou, S. et al. Cross-species immune recognition between Plasmodium vivax Duffy binding protein antibodies and the Plasmodium falciparum surface antigen VAR2CSA. J. Infect. Dis. 219, 110–120 (2019).
CAS PubMed Google Scholar
Mitran, C. J. et al. Antibodies to cryptic epitopes in distant homologues underpin a mechanism of heterologous immunity between Plasmodium vivax PvDBP and Plasmodium falciparum VAR2CSA. mBio. 10, https://doi.org/10.1128/mBio.02343-19 (2019).
George, M. T. et al. Identification of an immunogenic broadly inhibitory surface epitope of the Plasmodium vivax Duffy binding protein ligand domain. mSphere 4, https://doi.org/10.1128/mSphere.00194-19 (2019).
Chen, E. et al. Broadly neutralizing epitopes in the Plasmodium vivax vaccine candidate Duffy Binding Protein. Proc. Natl Acad. Sci. USA113, 6277–6282 (2016).
Article CAS PubMed PubMed Central Google Scholar
Mitran, C. J., Higa, L. M., Good, M. F. & Yanow, S. K. Generation of a Peptide Vaccine Candidate against Falciparum Placental Malaria Based on a Discontinuous Epitope. Vaccines 8, https://doi.org/10.3390/vaccines8030392 (2020).
Laffy, J. M. J. et al. Promiscuous antibodies characterised by their physico-chemical properties: from sequence to structure and back. Prog. Biophys. Mol. Biol. 128, 47–56 (2017).
Article CAS PubMed PubMed Central Google Scholar
Iyamu, U. et al. A conserved epitope in VAR2CSA is targeted by a cross-reactive antibody originating from Plasmodium vivax Duffy binding protein. Front. Cell Infect. Microbiol. 13, 1202276 (2023).
Article CAS PubMed PubMed Central Google Scholar
Saha, S. & Raghava, G. P. BcePred: Prediction of Continuous B-Cell Epitopes in Antigenic Sequences Using Physico-chemical Properties. International Conference on Artificial Immune Systems 197–204 (2004).
Teimouri, H., Medvedeva, A. & Kolomeisky, A. B. Bacteria-specific feature selection for enhanced antimicrobial peptide activity predictions using machine-learning methods. J. Chem. Inf. Model. 63, 1723–1733 (2023).
Article CAS PubMed Google Scholar
Ntumngia, F. B. et al. A novel erythrocyte binding protein of plasmodium vivax suggests an alternate invasion pathway into duffy-positive reticulocytes. mBio 7, https://doi.org/10.1128/mBio.01261-16 (2016).
McDonald, E. F., Jones, T., Plate, L., Meiler, J. & Gulsevin, A. Benchmarking AlphaFold2 on peptide structure prediction. Structure 31, 111–119 e112 (2023).
Article CAS PubMed Google Scholar
Weng, G. et al. Comprehensive evaluation of fourteen docking programs on protein-peptide complexes. J. Chem. Theory Comput. 16, 3959–3969 (2020).
Article CAS PubMed Google Scholar
Di Rienzo, L. et al. Characterizing hydropathy of amino acid side chain in a protein environment by investigating the structural changes of water molecules network. Front. Mol. Biosci. 8, 626837 (2021).
Article PubMed PubMed Central Google Scholar
Liang, S. & Grishin, N. V. Effective scoring function for protein sequence design. Proteins 54, 271–281 (2004).
Article CAS PubMed Google Scholar
Regenmortel, M. H. V. What is a B-cell epitope? Methods in Mol. Biol. 524, 3–20 (2009).
Ainavarapu, S. R. et al. Contour length and refolding rate of a small protein controlled by engineered disulfide bonds. Biophys. J. 92, 225–233 (2007).
Article CAS PubMed Google Scholar
Ramaraj, T., Angel, T., Dratz, E. A., Jesaitis, A. J. & Mumey, B. Antigen-antibody interface properties: composition, residue interactions, and features of 53 non-redundant structures. Biochim. Biophys. Acta 1824, 520–532 (2012).
Article CAS PubMed PubMed Central Google Scholar
Rubinstein, N. D. et al. Computational characterization of B-cell epitopes. Mol. Immunol. 45, 3477–3489 (2008).
Article CAS PubMed Google Scholar
Abramson, J. et al. Accurate structure prediction of biomolecular interactions with AlphaFold 3. Nature 630, 493–500 (2024).
Article CAS PubMed PubMed Central Google Scholar
Ntumngia, F. B. et al. Conserved and variant epitopes of Plasmodium vivax Duffy binding protein as targets of inhibitory monoclonal antibodies. Infect. Immun. 80, 1203–1208 (2012).
Article CAS PubMed PubMed Central Google Scholar
Singh, S. K., Hora, R., Belrhali, H., Chitnis, C. E. & Sharma, A. Structural basis for Duffy recognition by the malaria parasite Duffy-binding-like domain. Nature 439, 741–744 (2006).
Article CAS PubMed Google Scholar
Howell, D. P., Samudrala, R. & Smith, J. D. Disguising itself-insights into Plasmodium falciparum binding and immune evasion from the DBL crystal structure. Mol. Biochem. Parasitol. 148, 1–9 (2006).
Article CAS PubMed Google Scholar
Benavente, E. D. et al. Global genetic diversity of var2csa in Plasmodium falciparum with implications for malaria in pregnancy and vaccine development. Sci. Rep. 8, 15429 (2018).
Article PubMed PubMed Central Google Scholar
Smith, J. D., Subramanian, G., Gamain, B., Baruch, D. I. & Miller, L. H. Classification of adhesive domains in the Plasmodium falciparum erythrocyte membrane protein 1 family. Mol. Biochem. Parasitol. 110, 293–310 (2000).
Article CAS PubMed Google Scholar
Wei, S. et al. Binding epitope for recognition of human TRPM4 channel by monoclonal antibody M4M. Sci. Rep. 12, 19562 (2022).
Article CAS PubMed PubMed Central Google Scholar
Ashworth, J. et al. Peptide microarray IgM and IgG screening of pre-SARS-CoV-2 human serum samples from Zimbabwe for reactivity with peptides from all seven human coronaviruses: a cross-sectional study. Lancet Microbe 4, e215–e227 (2023).
Article CAS PubMed Central Google Scholar
Vashisht, K. et al. Cyclic constrained immunoreactive peptides from crucial P. falciparum proteins: potential implications in malaria diagnostics. Transl. Res. 249, 28–36 (2022).
Article CAS PubMed Google Scholar
Kaur, H. & Salunke, D. M. Antibody promiscuity: understanding the paradigm shift in antigen recognition. IUBMB Life. 67, 498–505 (2015).
Article CAS PubMed Google Scholar
Jain, D. & Salunke, D. M. Antibody specificity and promiscuity. Biochem J. 476, 433–447 (2019).
Article CAS PubMed Google Scholar
Sather, D. N. et al. Factors associated with the development of cross-reactive neutralizing antibodies during human immunodeficiency virus type 1 infection. J. Virol. 83, 757–769 (2009).
Article CAS PubMed Google Scholar
Andrews, S. F. et al. An influenza H1 hemagglutinin stem-only immunogen elicits a broadly cross-reactive B cell response in humans. Sci. Transl. Med. 15, eade4976 (2023).
Article CAS PubMed Google Scholar
Centers for Disease, C. & Prevention. Serum cross-reactive antibody response to a novel influenza A (H1N1) virus after vaccination with seasonal influenza vaccine. MMWR Morb. Mortal. Wkly Rep. 58, 521–524 (2009).
Google Scholar
Mikell, I. et al. Characteristics of the earliest cross-neutralizing antibody response to HIV-1. PLoS Pathog. 7, e1001251 (2011).
Article CAS PubMed PubMed Central Google Scholar
Li, T. et al. Identification of a cross-neutralizing antibody that targets the receptor binding site of H1N1 and H5N1 influenza viruses. Nat. Commun. 13, 5182 (2022).
Article CAS PubMed PubMed Central Google Scholar
Burton, D. R. & Hangartner, L. Broadly neutralizing antibodies to HIV and their role in vaccine design. Annu. Rev. Immunol. 34, 635–659 (2016).
Article CAS PubMed PubMed Central Google Scholar
Sok, D. & Burton, D. R. Recent progress in broadly neutralizing antibodies to HIV. Nat. Immunol. 19, 1179–1188 (2018).
Article CAS PubMed PubMed Central Google Scholar
Sanchez-de Prada, L. et al. Immunodominance hierarchy after seasonal influenza vaccination. Emerg. Microbes Infect. 11, 2670–2679 (2022).
Article CAS PubMed PubMed Central Google Scholar
Reich, N. G. et al. Case study in evaluating time series prediction models using the relative mean absolute error. Am. Stat. 70, 285–292 (2016).
Article PubMed PubMed Central Google Scholar
Li, Y. et al. Establishment of a decision tree model for diagnosis of early rheumatoid arthritis by proteomic fingerprinting. Int J. Rheum. Dis. 18, 835–841 (2015).
Article CAS PubMed Google Scholar
Takeshi, M., Go, Y., Xiaoming, H. & Yusuke, I. Signal-to-noise ratio improvement of peptide microarrays by using hyperbranched-polymer materials. J. Appl. Phys. 105 (2009).
Pedregosa, F. et al. Scikit-learn: machine learning in Python. J. Mach. Learn. Res 12, 2825–2830 (2011).
Google Scholar
Moffat, L. & Jones, D. T. Increasing the accuracy of single sequence prediction methods using a deep semi-supervised learning framework. Bioinformatics 37, 3744–3751 (2021).
Article CAS PubMed PubMed Central Google Scholar
Grantham, R. Amino acid difference formula to help explain protein evolution. Science 185, 862–864 (1974).
Article CAS PubMed Google Scholar
An, C. et al. Radiomics machine learning study with a small sample size: Single random training-test set split may lead to unreliable results. PLoS One 16, e0256152 (2021).
Article CAS PubMed PubMed Central Google Scholar
Senes, A., Gerstein, M. & Engelman, D. M. Statistical analysis of amino acid patterns in transmembrane helices: the GxxxG motif occurs frequently and in association with beta-branched residues at neighboring positions. J. Mol. Biol. 296, 921–936 (2000).
Article CAS PubMed Google Scholar
Bland, J. M. & Altman, D. G. Statistics notes. The odds ratio. BMJ 320, 1468 (2000).
Article CAS PubMed PubMed Central Google Scholar
Wang, K. et al. Cryo-EM reveals the architecture of placental malaria VAR2CSA and provides molecular insight into chondroitin sulfate binding. Nat. Commun. 12, 2956 (2021).
Article CAS PubMed PubMed Central Google Scholar
Ma, R. et al. Structural basis for placental malaria mediated by Plasmodium falciparum VAR2CSA. Nat. Microbiol. 6, 380–391 (2021).
Article CAS PubMed PubMed Central Google Scholar
Barth, P., Schonbrun, J. & Baker, D. Toward high-resolution prediction and design of transmembrane helical protein structures. Proc. Natl Acad. Sci. USA. 104, 15682–15687 (2007).
Article CAS PubMed PubMed Central Google Scholar
Tolia, N. H., Enemark, E. J., Sim, B. K. & Joshua-Tor, L. Structural basis for the EBA-175 erythrocyte invasion pathway of the malaria parasite Plasmodium falciparum. Cell 122, 183–193 (2005).
Article CAS PubMed Google Scholar
Carugo, O. How root-mean-square distance (r.m.s.d.) values depend on the resolution of protein structures that are compared. J. Appl. Crystallogr. 36, 125–128 (2003).
Article CAS Google Scholar
Chothia, C. & Lesk, A. M. The relation between the divergence of sequence and structure in proteins. EMBO J. 5, 823–826 (1986).
Article CAS PubMed PubMed Central Google Scholar
Reva, B. A., Finkelstein, A. V. & Skolnick, J. What is the probability of a chance prediction of a protein structure with an rmsd of 6 A?. Fold. Des. 3, 141–147 (1998).
Article CAS PubMed Google Scholar

Download references

Acknowledgements

We thank Dr. Michael Good for providing valuable feedback on our manuscript.This research was funded by the National Institute of Allergy and Infectious Diseases of the National Institutes of Health (R01AI150944). The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. We acknowledge funding from a project grant from the Canadian Institutes of Health Research (CIHR-IRSC Funding Reference Number 168944). We acknowledge the support of the Natural Sciences and Engineering Research Council of Canada (NSERC), [RGPIN-2017-04176].

Author information

Stephanie K. Yanow
Present address: School of Public Health, University of Alberta, Edmonton, AB, Canada

Authors and Affiliations

Department of Medical Microbiology and Immunology, University of Alberta, Edmonton, AB, Canada
Simranjit Grewal & Stephanie K. Yanow
Department of Computing Science, University of Alberta, Edmonton, AB, Canada
Nidhi Hegde
School of Public Health, University of Alberta, Edmonton, AB, Canada
Uwa Iyamu, Daniel Ferrer Vinals & Catherine J. Mitran

Authors

Simranjit Grewal
View author publications
Search author on:PubMed Google Scholar
Uwa Iyamu
View author publications
Search author on:PubMed Google Scholar
Daniel Ferrer Vinals
View author publications
Search author on:PubMed Google Scholar
Catherine J. Mitran
View author publications
Search author on:PubMed Google Scholar
Nidhi Hegde
View author publications
Search author on:PubMed Google Scholar
Stephanie K. Yanow
View author publications
Search author on:PubMed Google Scholar

Contributions

S.G. and S.K.Y. wrote the manuscript. S.G., U.I., D.V. and C.M. performed experiments. N.H. helped design the machine learning algorithm. All authors reviewed the manuscript prior to submission.

Corresponding author

Correspondence to Stephanie K. Yanow.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary data

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Grewal, S., Iyamu, U., Vinals, D.F. et al. Machine learning framework to extract physicochemical features of B-cell epitopes recognized by a cross-reactive antibody. npj Syst Biol Appl 11, 109 (2025). https://doi.org/10.1038/s41540-025-00583-1

Download citation

Received: 18 March 2025
Accepted: 24 August 2025
Published: 02 October 2025
DOI: https://doi.org/10.1038/s41540-025-00583-1