Abstract
Understanding and predicting viral host range is a fundamental challenge in virology, with direct implications for emerging pathogen surveillance and pandemic preparedness. Traditional molecular descriptors such as PAAC and DPC capture only local physicochemical properties, limiting their ability to generalize across diverse viral taxa. In this work, we developed VirHostPRED, a novel computational framework based on protein language models (PLMs) that leverages embeddings derived from ESM-2 to predict the likelihood of human infectivity from individual viral protein sequences. Among nine machine learning algorithms evaluated, SVM-RBF achieved the best performance, reaching 0.852 accuracy and 0.914 AUC on the hold-out test set using ESM2-t48-15B embeddings. The progressive scaling of ESM-2 from 8 million to 15 billion parameters resulted in consistent gains in discriminative capability, while t-SNE projections revealed enhanced class separability with larger models, confirming that ESM-2 embeddings encode biologically meaningful structure. Comparative benchmarks with ensemble and linear classifiers further demonstrated that nonlinear models effectively capture the high-dimensional relationships within PLM representations. Our web server, VirHostPRED, enables rapid in silico prediction of human infectivity risk from viral protein sequences without requiring extensive experimental characterization, providing an efficient computational triage system to support early warning, prioritization, and resource allocation in viral surveillance pipelines. The VirHostPRED server is freely available at https://www.biochemintelli.com/virhostpred/.
Introduction
Emerging infectious diseases remain one of the greatest global health threats1,2, as exemplified by the COVID-19 pandemic3,4,5 and recurrent zoonotic spillovers6,7. Anticipating which newly detected viruses may pose risks to humans is a pressing challenge for surveillance systems and pandemic preparedness8,9. Traditional wet-lab host range assays, while essential, are time-consuming, labor-intensive, and unevenly applied across the vast virosphere10,11. This gap has driven the development of computational approaches that aim to triage novel sequences rapidly, providing early signals to guide experimental follow-up and risk assessment11,12.
Machine-learning models relying solely on sequence information have demonstrated that genomic composition can immediately prioritize viruses with zoonotic potential, substantially narrowing the candidate space for investigation11,13. Broader ecological analyses further suggest that apparent clustering of zoonotic risk often reflects sampling biases rather than intrinsic host-order susceptibilities, underscoring the need for mechanistic, sequence-driven approaches at the pathogen level11,14,15. In parallel, risk maps of emerging zoonoses remain constrained by heterogeneous surveillance data, highlighting the value of portable computational frameworks grounded in molecular sequences14,16.
A protein-centric perspective is biologically well motivated, as viral host range is frequently determined by proteins that mediate receptor binding, entry, and tissue tropism17,18,19. The coronavirus spike protein illustrates how a single viral protein can encode determinants of cross-species infectivity. Such proteins carry evolutionary constraints and functional signatures reflecting compatibility with host receptors and immune evasion pressures, making them ideal substrates for predictive modeling20,21,22,23,24. Data-driven approaches have become essential for analyzing high-dimensional biological data. Recently, the application of Large Language Models (LLMs) has expanded beyond text to encode complex biological entities. These models have proven effective in diverse domains, ranging from extracting semantic features from biomedical knowledge graphs for drug interaction discovery (e.g., LLM-DDI25 to deciphering the evolutionary ‘language’ of protein sequences.
Protein language models (PLMs), trained on millions of natural sequences, learn representations where structural and functional properties emerge from sequence alone26,27. Scaling PLMs yields richer embeddings that capture long-range dependencies and structural regularities, enabling downstream transfer to diverse biological prediction tasks26,28. Recent work shows that PLM embeddings can encode sequence–structure–function relationships and improve annotation, retrieval, and predictive modeling29,30. Multi-aspect frameworks further suggest that PLM representations unify sequence, structure, and function signals into general-purpose feature spaces27.
For viruses and hosts, sequence-only models have already demonstrated practical utility: genome-composition models highlight human-infecting viruses, while deep learning at the protein level predicts virus–host protein interactions, showing that residue-level features encode host-related biology11,31,32,33. Approaches such as DeepViral integrate protein sequences with disease phenotypes to infer virus–human protein–protein interactions34, and more recent advances, like EvoMIL, directly leverage PLMs to predict host species from viral proteins35. These studies indicate that embeddings contain stronger predictive signals than handcrafted features and can also provide interpretability by highlighting proteins most responsible for host assignments35,36.
Traditional alignment-based tools such as BLAST remain foundational for homology search but falter when novel viruses lack close relatives precisely the scenario where early-risk triage is most needed37,38,39,40. By contrast, PLM embeddings offer alignment-free, fixed-length representations that capture higher-order regularities and can be integrated with lightweight classifiers, enabling robust performance under limited labels and distributional shifts27,40. Empirical studies in related protein tasks confirm that PLM features can rival or surpass structure-dependent pipelines in low-data settings, supporting their adoption for viral host-range inference35,41,42.
Recent advances in language models for viral genomics have followed several distinct paradigms. Generative approaches, such as SARITA43, SpikeGPT-244, and RITA45, leverage autoregressive or masked language modeling to generate novel viral sequences or predict fitness landscapes. These models typically require substantial computational resources for training and fine-tuning. Anomaly-detection frameworks, such as DeepAutoCoV46, combine PLM-derived latent representations with outlier detection algorithms to flag potentially dangerous viral variants in real-time surveillance contexts.
Guided by this evidence, the present study introduces VirHostPRED, an ESM2-embedding approach to determine whether a virus infects humans from a single viral protein sequence. Unlike generative PLM approaches (e.g., SARITA, SpikeGPT-2) that model sequence distributions or anomaly-detection frameworks (e.g., DeepAutoCoV) designed for variant surveillance, VirHostPRED focuses on binary host-range classification using frozen PLM embeddings combined with classical machine learning classifiers. This design prioritizes simplicity, interpretability, and low computational requirements over end-to-end fine-tuning. The method leverages evolutionary-scale representations learned by protein language models to capture functional constraints relevant to host specificity, while operating at the single-protein level rather than requiring complete viral genomes. To enhance accessibility and translational impact, we further implemented a user-friendly web server that enables researchers and surveillance teams to submit viral protein sequences and obtain rapid predictions of human infectivity risk. This resource provides a practical, lightweight tool for early-stage assessment of emerging viruses, complementing experimental efforts and supporting pandemic preparedness in resource-limited settings.
Materials and methods
Datasets
The viral protein amino acid sequences used in this study were obtained from the NCBI Virus database (https://www.ncbi.nlm.nih.gov/labs/virus/). Sequences were selected using the following filtering criteria. For viral proteins from human-infecting viruses (positive class): GenBank/RefSeq: RefSeq; Nucleotide Completeness: complete; Assembly Completeness: complete; Host: Homo sapiens (human), taxid: 9606. For viral proteins from non-human viruses (negative class): GenBank/RefSeq: RefSeq; Nucleotide Completeness: complete; Assembly Completeness: complete; Host: NOT Homo sapiens (human), taxid: NOT 9606 (Table S1).
The retrieved sequences were then processed to ensure quality and completeness. To reduce sequence redundancy and prevent overrepresentation of closely related viral strains, we applied the CD-HIT algorithm47 for protein sequence clustering. CD-HIT employs a greedy incremental clustering approach that groups sequences by pairwise identity: the first sequence becomes a cluster representative, and subsequent sequences are assigned to existing clusters if they share identity above a specified threshold with any representative, or become new representatives otherwise. We selected a 70% identity threshold based on several considerations: (1) this threshold is commonly used in protein homology studies to define remote homologs while excluding highly similar sequences that could inflate performance estimates48; (2) at 70% identity, sequences typically share structural and functional similarity but represent distinct evolutionary lineages; (3) this threshold balances between removing redundant sequences (which would cause data leakage between train/test sets) and retaining sufficient biological diversity for robust model training. More stringent thresholds (e.g., 40–50%) would substantially reduce dataset size, while less stringent thresholds (e.g., 90%) would retain potentially redundant sequences. The specific CD-HIT parameters used were: -c 0.7 -n 5 -d 0 -M 16,000 -T 8, where -n 5 specifies the word length appropriate for identities in the 0.7–0.88 range.
After clustering, the human-infecting class contained 2127 sequences, while the non-human class contained 10,635 sequences. To ensure balanced training and prevent classifier bias toward the majority class, we randomly sampled 2127 sequences from the non-human class, matching the size of the limiting class. This sampling was performed with a fixed random seed (42) for reproducibility, resulting in a final balanced dataset of 4,254 sequences (Table S2). All viral sequences used in this study are available at: VirHostPRED: VirHostPRED Datasets.
Molecular descriptor and embedding computation
From all sequences, we computed two classical molecular descriptors: (1) pseudo-amino acid composition (PAAC; lambda = 5, weight = 0.05)49, which captures both amino acid composition and sequence-order effects, producing 30 features per protein; and (2) dipeptide composition (DPC)50, defined as the frequency of all possible dipeptides (400 combinations) normalized by sequence length. ESM-2 protein embeddings were extracted using the official esm Python library (https://github.com/facebookresearch/esm)28 Prior to processing, non-standard amino acids were replaced with ‘X’ to ensure compatibility with the model’s vocabulary. For each viral protein sequence, we computed a fixed-dimensional vector representation by mean-pooling the residue-level hidden states from the final transformer layer of each model variant (layer 6 for esm2_t6_8M, layer 33 for esm2_t33_650M, and layer 48 for esm2_t48_15B). The mean-pooling operation was computed over all amino acid positions, explicitly excluding the special beginning-of-sequence ([CLS]) and end-of-sequence ([EOS]) tokens. For proteins exceeding the model’s maximum context length of 1022 residues, we employed a sliding window strategy with 900-residue windows and 150-residue overlap. Individual window embeddings were extracted independently and subsequently averaged to produce the final sequence-level representation. This approach ensures that long viral proteins (e.g., large polyproteins) are fully represented without truncation. The resulting embeddings have dimensionalities of 320 (esm2_t6_8M), 1,280 (esm2_t33_650M), and 5,120 (esm2_t48_15B), which were used directly as input features for downstream machine learning classifiers. models in three configurations: ESM2-t6-8 M (8 million parameters, 320 dimensions), ESM2-t33-650 M (650 million parameters, 1280 dimensions), and ESM2-t48-15B (15 billion parameters, 5,120 dimensions). For long sequences (> 1022 residues), we applied a sliding-window strategy with 900–amino acid segments and 150-residue overlap, then performed mean pooling over the resulting embeddings. Non-standard amino acids were replaced with “X” during preprocessing to ensure compatibility with ESM-2.
PAAC and DPC descriptors were computed with the Python package propy3 (https://pypi.org/project/propy3/). ESM-2 embeddings were generated using PyTorch 2.x (https://pytorch.org/) and fair-esm (https://github.com/facebookresearch/esm)28, running on the Apple MPS accelerator (Metal Performance Shaders) on a MacBook Pro M3 Max (16-core CPU, 40-core GPU) with 128GB of RAM, using float32 precision to ensure numerical stability. Non-standard amino acids were replaced with ‘X’ prior to processing.
Normalization and preprocessing
To improve computational efficiency and reduce noise, constant features (variance = 0) were removed from all datasets. No missing values were present in the ESM-2 embeddings or PAAC/DPC descriptors, so no imputation was required. The balanced dataset was split into training (80%, n = 3,03) and test (20%, n = 851) sets using a stratified partition to preserve class proportions (random_state = 42). Following the split, all numeric features were standardized using Z-score scaling (mean = 0, standard deviation = 1) with scikit-learn’s StandardScaler (https://scikit-learn.org/). The scaler was fit only on the training set and applied to both the training and hold-out test sets to ensure rigorous evaluation and prevent data leakage.
Training, cross-validation, and evaluation
We evaluated nine machine learning classification algorithms: Random Forest (RF), Multilayer Perceptron (MLP), eXtreme Gradient Boosting (XGB), Light Gradient Boosting Machine (LGBM), Logistic Regression with L1 and L2 regularization (LR-L1, LR-L2), Support Vector Machine with an RBF kernel (SVM-RBF), Support Vector Machine with a linear kernel (SVM-Linear), and Gradient Boosting Classifier (GBC). Training for all classifiers was performed on 80% of the full dataset, using stratified 10-fold cross-validation. The remaining 20% of the data (named hold-out test sets or test in this study) was used to evaluate the performance of the trained models. All analyses were carried out using scikit-learn (https://scikit-learn.org/), XGBoost (https://xgboost.readthedocs.io/), and Microsoft LightGBM (https://lightgbm.readthedocs.io/). The average metrics used to evaluate the models in this binary classification problem, both during the training stage with cross-validation and during the test stage, were as follows:
In this study, we also assessed the effectiveness of the predictive models using the area under the receiver operating characteristic (ROC) curve (AUC). ROC curves were generated for both the cross-validation phase and the hold-out test sets, enabling evaluation of the models discriminative ability. Cross-validation AUC is reported as the mean AUC across the 10 folds. ROC curves show the performance of all evaluated classifiers for each representation on the hold-out test set. This study followed a reproducible pipeline that begins with dataset construction and proceeds through feature or embedding computation, model training with stratified 10 fold cross validation, independent testing, and web deployment. Figure 1 summarizes the architecture of the workflow, including the two dataset branches positive human viruses and negative non human viruses, the computation of molecular descriptors DPC and PAAC and pretrained ESM2 embeddings, the evaluation of nine classifier variants, and the final web application that serves probabilistic predictions from FASTA inputs (Fig. 1).
Architecture of the workflow. Viral protein amino acid sequences were collected from NCBI Virus and split into positive (human viruses) and negative (non human viruses) datasets. Molecular descriptors (DPC, PAAC) and pretrained ESM2 embeddings (ESM2-t6-8 M, ESM2-t33-650 M, ESM2-t48-15B) were computed. Models were trained with stratified 10 fold cross validation on 80% of the data and evaluated on the remaining 20%. Nine classifier variants were assessed: LR L1, LR L2, RF, SVM-L, SVM-RBF, XGB, LGBM, GB, and MLP. The selected model was deployed as a web application that accepts FASTA inputs and returns probabilistic predictions.
Unsupervised 2D projections using t-SNE (left panel within each subpanel) and PCA (right), colored by class. (A) Dipeptide composition (DPC). (B) Pseudo-amino acid composition (PAAC). (C) ESM2-t6-8 M embeddings. (D) ESM2-t33-650 M embeddings. (E) ESM2-t48-15B embeddings. PLM-based projections show progressively greater class separation than classical descriptors, with ESM2-t48-15B yielding the most compact clouds and clearest separation.
ROC curves comparing 10-fold cross-validation (left in each subpanel) versus the hold-out test set (right) for each representation space. (A) Dipeptide composition (DPC). (B) Pseudo-amino acid composition (PAAC). (C) ESM2-t6-8 M. (D) ESM2-t33-650 M. (E) ESM2-t48-15B. In all cases, non-linear classifiers, such as SVM-RBF, LGBM, and XGB, outperform linear ones. The highest overall performance is observed with ESM2-t48-15B, where SVM-RBF attains the top AUC on the hold-out test set, consistent with the improved separability seen in the projections.
Visualization and separability analysis
Class separability was evaluated using dimensionality reduction techniques. Principal Component Analysis (PCA) with 2 components was applied to obtain a linear projection of the data, and t-Distributed Stochastic Neighbor Embedding (t-SNE) was used to capture non-linear relationships. These analyses were also conducted using scikit-learn (https://scikit-learn.org/).
Web application
The VirHostPRED web server utilizes a microservices architecture to ensure scalability. The core application logic and user interface are implemented using Django (https://www.djangoproject.com/) and Alpine.js (https://alpinejs.dev/), backed by a PostgreSQL (https://www.postgresql.org/) database. Protein sequence inference is offloaded to a dedicated API service hosted on Google Cloud (https://cloud.google.com/), built with Flask (https://flask.palletsprojects.com/) and PyTorch (https://pytorch.org/), which executes the ESM-2 embedding generation and SVM classification on GPU-accelerated computing nodes. This decoupled design allows the web interface to remain lightweight while the cloud backend scales to handle computationally intensive prediction tasks. The platform allows users to upload protein sequences in FASTA format and obtain classification predictions with probabilistic scores in real time. The tool is available freely at https://www.biochemintelli.com/virhostpred/. This implementation democratizes access to machine learning technologies for viral protein classification and contributes to advances in computational virology research.
Results
Unsupervised projections reveal a clear increase in class separability when moving from composition-based descriptors to protein language model embeddings. In the t-SNE and PCA plots for dipeptide composition (DPC) and pseudo-amino acid composition (PAAC), the two classes largely overlap, with diffuse aggregation and no evident linear boundaries. In contrast, ESM2 embeddings show progressively stronger structure: with ESM2-t6-8 M the projections already display distinct regions and density gradients; ESM2-t33-650 M further consolidates these clusters; and ESM2-t48-15B produces the most compact and clearly separated clouds, especially in t-SNE, indicating that larger models capture finer discriminative signals than classical descriptors (Fig. 2).
Predictive performance mirrors these trends. With PAAC, the best results were obtained with SVM-RBF (accuracy CV/test = 0.783/0.778; AUC CV/test = 0.861/0.844), followed by LGBM (0.776/0.760; AUC 0.850/0.848). The remaining algorithms showed more modest performance, and CV-to-test gaps were small, which indicates stable generalization despite the information ceiling of PAAC (Table 1; Fig. 3). With DPC, SVM-RBF again led (0.776/0.769; AUC 0.850/0.852), with LGBM and XGB close behind. The slight AUC increase on the hold-out set suggests that DPC provides a signal complementary to PAAC, although still limited compared with PLM embeddings (Table 2; Fig. 3).
Using ESM2 embeddings increased performance consistently with model scale. With ESM2-t6-8 M, several classifiers exceeded 0.80 accuracy and approximately 0.88–0.89 AUC. SVM-RBF stood out (0.827/0.813; AUC 0.895/0.885), and LGBM or XGB were close by, with approximately 0.821/0.805 accuracy and approximately 0.899/0.889 AUC (Table 3; Fig. 3). ESM2-t33-650 M pushed metrics further: SVM-RBF reached 0.836/0.833 accuracy with AUC 0.912/0.905, while XGB, GB, and LGBM achieved similar AUC values, approximately 0.916 in cross-validation and approximately 0.898–0.900 on the hold-out test set (Table 4; Fig. 3).
Finally, ESM2-t48-15B delivered the best balance between cross-validation and test performance. SVM-RBF achieved the highest independent test accuracy, 0.852 with AUC 0.914. LGBM and Gradient Boosting remained highly competitive, with accuracies around 0.836 and AUC around 0.909–0.913. The Random Forest ensemble also showed solid generalization, 0.826 accuracy with AUC 0.898. Linear models were strong but lagged behind non-linear ones. These results are summarized in Table 5, and the corresponding ROC curves are shown in Fig. 3. Overall, ESM2 embeddings, particularly those from the ESM2-t48-15B model, provide the most informative signal for this task, with small CV-test gaps and ROC curves consistently above those obtained with composition-based methods, supporting their use as the representation of choice for deployment.
To ensure broad accessibility of our predictive framework, we developed VirHostPRED, an interactive web application built using the Django 4.2 framework. The platform provides an intuitive interface where users can upload viral protein sequences in FASTA format and receive real-time probabilistic predictions of human infectivity generated by the SVM-RBF model trained on ESM-2 15B embeddings. The backend integrates the trained classifier with an optimized inference pipeline, enabling efficient batch processing and dynamic visualization of prediction outputs. The web server runs on a scalable cloud deployment, ensuring high availability and reproducibility. VirHostPRED is freely accessible at https://www.biochemintelli.com/virhostpred/ and is intended as a community resource to democratize access to machine learning-based viral host range prediction and to support genomic surveillance and computational virology research.
Discussion
Human viral infections continue to emerge and re-emerge from animal reservoirs, driven by ecological changes and human mobility. This scenario demands rapid and scalable tools capable of prioritizing risk signals as soon as new viral sequences appear in metagenomic and surveillance pipelines well before viral cultures or phenotypic characterization become available7,51. Although high-throughput sequencing has greatly expanded viral discovery, several bottlenecks persist: sampling biases, incomplete host annotations, and operational limitations of metatranscriptomics all of which delay the interpretation of which viral lineages represent a genuine threat to humans52.
NCBI Virus host annotations derive from literature curation and may contain noise. Potential sources of label uncertainty include: (1) incorrect or incomplete host annotations in original submissions; (2) multi-host (zoonotic) viruses that may appear in both classes depending on annotation; (3) laboratory-adapted strains with artificial host annotations. Our use of RefSeq sequences (which undergo additional curation) partially mitigates these issues. The human infectivity label in VirHostPRED should be interpreted as documented human infection in the literature rather than a definitive biological capability.
The results demonstrate that embeddings derived from ESM-2 consistently outperform traditional molecular descriptors in predicting viral host range. While PAAC and DPC achieved maximum AUC values of 0.861 and 0.852, respectively, ESM-2 embeddings exceeded 0.90, with the 15-billion-parameter model reaching 0.914 on the hold-out test set (Tables 1, 2 and 5, and Fig. 3). Unlike handcrafted descriptors that capture only local physicochemical properties, ESM-2 embeddings learn contextual sequence representations without requiring alignment, enabling the extraction of structural, functional, and evolutionary signals directly from primary sequences28..
These findings align with a growing body of evidence showing that learned protein representations contain substantially richer discriminative information than composition-based features. Liu et al. demonstrated that ESM-1b embeddings combined with multiple-instance learning outperform traditional approaches in viral host prediction, achieving AUC values above 0.95 in bacteriophages and close to 0.9 in eukaryotic viruses35.
Similarly, Villegas-Morcillo et al.. reported superior performance of unsupervised protein embeddings over handcrafted features in molecular function prediction53, while Thomas et al.. showed that PLM-based embeddings improve protein–protein interaction interface prediction by capturing a broad range of biophysical properties54. Collectively, these studies support the notion that transformer-based PLMs, trained on large-scale protein sequence corpora, encode higher-order evolutionary and functional constraints that cannot be recovered from local compositional statistics alone55,56, providing a more biologically meaningful representation space for predictive modeling in computational virology.
To systematically evaluate the impact of feature representation on predictive performance, we compared classifiers trained on baseline physicochemical descriptors (PAAC, DPC) against those trained on ESM-2 embeddings of increasing capacity. As detailed in Table S3, there is a clear progression in performance: models based on simple composition metrics achieved test AUCs of approximately 0.84–0.85, whereas protein language model embeddings yielded substantial improvements, with the ESM-2 15B model reaching a test AUC of 0.9136. This trend highlights the superior ability of evolutionary-scale representations to capture complex functional signals relevant to host specificity compared to traditional handcrafted features.
The progressive scaling of ESM-2 from 8 million to 15 billion parameters resulted in consistent improvements in discriminative capability, with accuracy increasing from 0.813 to 0.852 on the test set (Tables 3, 4 and 5, and Fig. 3). Unsupervised projections using t-SNE and PCA revealed that ESM-2 embeddings produce progressively clearer class separation as model size increases, consistent with the aforementioned scaling trend and in agreement with previous literature. While PAAC and DPC exhibited extensive overlap between human and non-human viruses in both projection spaces, ESM2-t48-15B generated compact and distinct clusters particularly in t-SNE indicating that larger models capture finer discriminative signals. t-SNE, by preserving local nonlinear relationships, demonstrated superior ability to visualize clustering structure compared to PCA, which, as a linear method, failed to reveal class boundaries present in the high-dimensional space. This enhanced separability in two-dimensional projections directly correlates with the predictive performance of supervised classifiers, confirming that PLM embeddings encode biologically relevant structure that facilitates host discrimination. This pattern is consistent with previous reports showing that larger protein language models learn richer representations that capture long-range dependencies and higher-order structural regularities28,57,58.
Among the nine algorithms evaluated, SVM-RBF exhibited the best overall performance with ESM-2 embeddings, achieving the highest test accuracy (0.852) using esm2_t48_15B. Tree-based ensemble methods, including LGBM, XGB, and GB, maintained competitive performance with AUC values consistently above 0.90, while linear methods (LR-L1, LR-L2, SVM-L) showed robust yet slightly lower predictive capacity. This suggests that nonlinear relationships in the embedding space contribute significantly to host discrimination. The small gaps between cross-validation and hold-out test set (typically < 2% in accuracy) indicate stable generalization without substantial overfitting an advantageous property for viral surveillance applications where emerging viruses may differ from the training data. Comparative studies in related tasks confirm that nonlinear classifiers such as SVMs and ensemble models effectively capture the complexity of high-dimensional feature spaces derived from protein embeddings. For example, in a study on virulence factor prediction using ESM embeddings, six machine learning algorithms were compared with ESM-1 and ESM-2 representations, and it was reported that among them, SVM consistently achieved the best performance for both ESM PLMs, obtaining strong hold-out test set metrics for Gram-positive and Gram-negative bacterial datasets59.
In another study on membrane protein type prediction, high-dimensional meta-features derived from PSSMs (closely resembling embedding-based feature spaces) were evaluated, and algorithms such as SVM, RF, XGB, LGBM, and MLP were compared. In that study, SVM achieved the best predictive performance across most datasets60. Finally, in a study predicting protein thermophilicity from embeddings derived from protein language models (ProtT5), the SVM-RBF algorithm demonstrated outstanding performance, establishing itself as one of the most effective classifiers for capturing nonlinear relationships between sequence representations and the thermodynamic properties of proteins61. All of these precedents corroborate the excellent predictive performance metrics achieved with the three ESM-2 models evaluated in this study.
While protein language models have revolutionized sequence representation, the extent to which model capacity drives predictive performance for viral phenotypes remained to be quantified. To address this, we conducted a rigorous comparative analysis using both cross-validation and an independent held-out test set, implementing a Grid Search strategy (GridSearchCV) to identify optimal hyperparameters (C and γ) for three ESM-2 variants (8 M, 650 M, and 15B parameters).
As detailed in Table S4, our analysis confirmed a distinct scaling law: we observed a monotonic improvement in sensitivity and specificity as parameters increased. The 15B parameter model optimized with this protocol achieved the highest performance on the independent test set (Accuracy: 0.848, ROC-AUC: 0.922), significantly outperforming both the 8 M (Accuracy: 0.801, ROC-AUC: 0.875) and 650 M (Accuracy: 0.814, ROC-AUC: 0.900) variants. This finding suggests that the subtle evolutionary signatures distinguishing human-infecting viruses from environmental viromes are deeply embedded in the protein space and require high-capacity models for accurate resolution. Thus, the computational cost of the 15B model is justified by its superior ability to disentangle these complex biological signals.
To visually corroborate these quantitative findings, we further analyzed the structure of the embedding space using dimensionality reduction. As illustrated in Fig. S2, a comparative 3D PCA reveals that while embeddings from smaller models (ESM2-8 M) show considerable overlap between human and non-human viral proteins, the 15B parameter model produces a markedly clearer separation between the classes. This progressive disentanglement in the latent space aligns with the observed scaling law, offering a geometric explanation for the superior accuracy of the larger model: by learning richer and more distinct representations of viral host-tropism features, high-capacity models simplify the downstream classification task, effectively separating biological signals that appear entangled in lower-dimensional representations.
A potential limitation of our study concerns the composition of the negative training set, where bacteriophages constitute 73.2% of the non-human viral sequences. Since distinguishing bacteriophages from human-infecting viruses is comparatively easier due to fundamental differences between prokaryotic and eukaryotic viral biology, this composition could inflate reported performance metrics. To address this concern, we conducted additional evaluations using a filtered dataset that completely excludes bacteriophages, retaining only eukaryotic viruses (plant viruses, invertebrate viruses, vertebrate viruses, giant viruses, and archaeal viruses) as negative examples (n = 570). On this more challenging evaluation scenario, VirHostPRED achieved a CV ROC-AUC of 0.828 ± 0.027 and test ROC-AUC of 0.850, compared to 0.922 ± 0.006 and 0.914, respectively, on the full dataset (Table S5; Fig. S3). While this represents a decrease of approximately 10% in discriminative performance, the model maintains strong predictive ability according to standard interpretation guidelines (ROC-AUC 0.80–0.90 indicates “good” discrimination)62. Importantly, VirHostPRED significantly outperformed baseline methods on the eukaryotic-only dataset, achieving 74.7% improvement over random classification and outperforming both Random Forest (ROC-AUC: 0.794) and Logistic Regression (ROC-AUC: 0.776) (Table S5). These results demonstrate that VirHostPRED captures biologically meaningful molecular signatures associated with human infectivity that generalize beyond the prokaryote-eukaryote distinction.
An additional limitation is the absence of external validation on independently collected sequences. The random train-test split, while stratified, does not fully assess generalization to novel virus families or geographically distinct isolates. Future work should evaluate VirHostPRED on emerging viruses (e.g., novel coronaviruses, influenza reassortants) to quantify robustness under distribution shift. Users should interpret predictions for highly divergent sequences with appropriate caution.
The ability to predict human infectivity from individual viral protein sequences has direct implications for surveillance systems and pandemic preparedness. Our web server, VirHostPRED, enables researchers and surveillance teams to submit viral protein sequences and obtain rapid predictions of human infectivity risk without requiring extensive experimental characterization, thereby providing early signals to guide experimental follow-up and risk assessment. Consequently, we believe that the presented approach complements experimental efforts by providing a rapid computational triage that can inform resource allocation and prioritization of emerging pathogens for detailed characterization.
Conclusions
In this study, we demonstrated that protein language model embeddings, particularly those derived from ESM-2 15B, substantially enhance the accuracy of viral host range prediction compared to traditional composition-based descriptors. By combining these embeddings with a robust SVM-RBF classifier, we achieved consistent performance across validation and hold-out test set, underscoring the generalizability of the learned representations. The resulting VirHostPRED web application translates this predictive framework into an accessible, real-time tool for the research community, enabling early computational triage of newly sequenced viral proteins. Beyond its immediate applicability in genomic surveillance, VirHostPRED highlights the growing potential of large-scale pretrained models to extract biologically meaningful patterns directly from amino acid sequences. Future developments will focus on integrating structural and ecological features, expanding the range of host taxa, and continuously updating the model as new viral data become available, further strengthening its utility for pandemic preparedness and viral discovery.
Data availability
All viral protein sequences used in this study were retrieved from the NCBI Virus database (https://www.ncbi.nlm.nih.gov/labs/virus/). The curated datasets, including the non redundant sequences and the balanced training and hold-out test set, are publicly available at VirHostPRED Datasets (data). Additional metadata and scripts used for dataset construction are available from the corresponding author upon reasonable request.
References
Morens, D. M., Folkers, G. K. & Fauci, A. S. The challenge of emerging and re-emerging infectious diseases. Nature 430, 242–249 (2004).
Jones, K. E. et al. Global trends in emerging infectious diseases. Nature 451, 990–993 (2008).
Haider, N. et al. COVID-19—Zoonosis or emerging infectious disease? Front. Public. Health. 8, (2020).
Zhou, P. et al. A pneumonia outbreak associated with a new coronavirus of probable Bat origin. Nature 579, 270–273 (2020).
Morens, D. M. et al. The origin of COVID-19 and why it matters. Am. J. Trop. Med. Hyg. 103, 955–959 (2020).
Allen, T. et al. Global hotspots and correlates of emerging zoonotic diseases. Nat. Commun. 8, 1124 (2017).
Plowright, R. K. et al. Pathways to zoonotic spillover. Nat. Rev. Microbiol. 15, 502–510 (2017).
Olival, K. J. et al. Host and viral traits predict zoonotic spillover from mammals. Nature 546, 646–650 (2017).
Grange, Z. L. et al. Ranking the risk of animal-to-human spillover for newly discovered viruses. Proc. Natl. Acad. Sci. 118, (2021).
Call, L., Nayfach, S. & Kyrpides, N. C. Illuminating the virosphere through global metagenomics. Annu. Rev. Biomed. Data Sci. 4, 369–391 (2021).
Mollentze, N., Babayan, S. A. & Streicker, D. G. Identifying and prioritizing potential human-infecting viruses from their genome sequences. PLoS Biol. 19, e3001390 (2021).
Wardeh, M., Blagrove, M. S. C., Sharkey, K. J. & Baylis, M. Divide-and-conquer: machine-learning integrates mammalian and viral traits with network features to predict virus-mammal associations. Nat. Commun. 12, 3954 (2021).
Ladner, J. T. Genomic signatures for predicting the zoonotic potential of novel viruses. PLoS Biol. 19, e3001403 (2021).
Wille, M., Geoghegan, J. L. & Holmes, E. C. How accurately can we assess zoonotic risk? PLoS Biol. 19, e3001135 (2021).
Mollentze, N. & Streicker, D. G. Viral zoonotic risk is homogenous among taxonomic orders of mammalian and avian reservoir hosts. Proc. Natl Acad. Sci. 117, 9423–9430 (2020).
da Silva, A. F. et al. ViralFlow v1.0—a computational workflow for streamlining viral genomic surveillance. NAR Genom. Bioinform. 6, (2024).
Morizono, K. & Chen, I. S. Receptors and tropisms of envelope viruses. Curr. Opin. Virol. 1, 13–18 (2011).
Maginnis, M. S. Virus–receptor interactions: the key to cellular invasion. J. Mol. Biol. 430, 2590–2611 (2018).
Valero-Rello, A., Baeza-Delgado, C., Andreu-Moreno, I. & Sanjuán, R. Cellular receptors for mammalian viruses. PLoS Pathog. 20, e1012021 (2024).
Dadonaite, B. et al. Spike deep mutational scanning helps predict success of SARS-CoV-2 clades. Nature 631, 617–626 (2024).
Thadani, N. N. et al. Learning from prepandemic data to forecast viral escape. Nature 622, 818–825 (2023).
Starr, T. N. et al. Deep mutational scanning of SARS-CoV-2 receptor binding domain reveals constraints on folding and ACE2 binding. Cell 182, 1295–1310e20 (2020).
Liu, K. et al. Cross-species recognition of SARS-CoV-2 to bat ACE2. Proc. Natl. Acad. Sci. 118, (2021).
Li, F. Receptor recognition and cross-species infections of SARS coronavirus. Antiviral Res. 100, 246–254 (2013).
Li, D. et al. LLM-DDI: leveraging large Language models for Drug-Drug interaction prediction on biomedical knowledge graph. IEEE J. Biomed. Health Inf. 1–9. https://doi.org/10.1109/JBHI.2025.3585290 (2025).
Rives, A. et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proceedings of the National Academy of Sciences 118, (2021).
Bepler, T. & Berger, B. Learning the protein language: Evolution, structure, and function. Cell. Syst. 12, 654–669e3 (2021).
Lin, Z. et al. Evolutionary-scale prediction of atomic-level protein structure with a Language model. Sci. (1979). 379, 1123–1130 (2023).
Alley, E. C., Khimulya, G., Biswas, S., AlQuraishi, M. & Church, G. M. Unified rational protein engineering with sequence-based deep representation learning. Nat. Methods. 16, 1315–1322 (2019).
Wu, F., Wu, L., Radev, D., Xu, J. & Li, S. Z. Integration of pre-trained protein Language models into geometric deep learning networks. Commun. Biol. 6, 876 (2023).
Tsukiyama, S., Kurata, H. & Cross-attention, P. H. V. Prediction of human and virus protein–protein interactions using cross-attention-based neural networks. Comput. Struct. Biotechnol. J. 20, 5564–5573 (2022).
Lanchantin, J., Weingarten, T., Sekhon, A., Miller, C. & Qi, Y. Transfer learning for predicting virus-host protein interactions for novel virus sequences. In Proceedings of the 12th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics, 1–10. https://doi.org/10.1145/3459930.3469527 (ACM, 2021).
Iuchi, H. et al. Bioinformatics approaches for unveiling virus-host interactions. Comput. Struct. Biotechnol. J. 21, 1774–1784 (2023).
Liu-Wei, W. et al. DeepViral: prediction of novel virus–host interactions from protein sequences and infectious disease phenotypes. Bioinformatics 37, 2722–2729 (2021).
Liu, D., Young, F., Lamb, K. D., Robertson, D. L. & Yuan, K. Prediction of virus-host associations using protein Language models and multiple instance learning. PLoS Comput. Biol. 20, e1012597 (2024).
Madan, S., Demina, V., Stapf, M., Ernst, O. & Fröhlich, H. Accurate prediction of virus-host protein-protein interactions via a Siamese neural network using deep protein sequence embeddings. Patterns 3, 100551 (2022).
Zielezinski, A., Barylski, J. & Karlowski, W. M. Taxonomy-aware, sequence similarity ranking reliably predicts phage–host relationships. BMC Biol. 19, 223 (2021).
Pertsemlidis, A. & Fondon, J. W. Having a BLAST with bioinformatics (and avoiding BLASTphemy). Genome Biol. 2, reviews2002.1 (2002).
Watkins, S. C. & Putonti, C. The use of informativity in the development of robust viromics-based examinations. PeerJ 5, e3281 (2017).
Trifonov, V. & Rabadan, R. Frequency analysis techniques for identification of viral genetic data. mBio. 1, (2010).
Gonzales, M. E. M., Ureta, J. C. & Shrestha, A. M. S. Protein embeddings improve phage-host interaction prediction. PLoS One. 18, e0289030 (2023).
Gonzales, M. E. M., Ureta, J. C. & Shrestha, A. M. S. PHIStruct: improving phage–host interaction prediction at low sequence similarity settings using structure-aware protein embeddings. Bioinformatics. 41, (2024).
Rancati, S. et al. SARITA: a large Language model for generating the S1 subunit of the SARS-CoV-2 Spike protein. Brief Bioinform. 26, (2025).
Dhodapkar, R. M. A deep generative model of the SARS-CoV-2 Spike protein predicts future variants. Preprint at. https://doi.org/10.1101/2023.01.17.524472 (2023).
Hesslow, D., Zanichelli, N., Notin, P., Poli, I. & Marks, D. RITA: a study on scaling up generative protein sequence models. (2022). https://doi.org/10.48550/arXiv.2205.05789.
Rancati, S. et al. Forecasting dominance of SARS-CoV-2 lineages by anomaly detection using deep autoencoders. Brief Bioinform 25, (2024).
Li, W. & Godzik, A. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22, 1658–1659 (2006).
Fu, L., Niu, B., Zhu, Z., Wu, S. & Li, W. CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics 28, 3150–3152 (2012).
Chou, K. C. Prediction of protein cellular attributes using pseudo-amino acid composition. Proteins Struct. Funct. Genet. 43, 246–255 (2001).
Petrilli, P. Classification of protein sequences by their dipeptide composition. Bioinformatics 9, 205–209 (1993).
Marie, V. & Gordon, M. L. The (Re-)Emergence and spread of viral zoonotic disease: A perfect storm of human ingenuity and stupidity. Viruses 15, 1638 (2023).
Charre, C. et al. Evaluation of NGS-based approaches for SARS-CoV-2 whole genome characterisation. Virus Evol. 6, (2020).
Villegas-Morcillo, A. et al. Unsupervised protein embeddings outperform hand-crafted sequence and structure features at predicting molecular function. Bioinformatics. 37, 162–170 (2021).
Thomas, D. P. G., Garcia Fernandez, C. M., Haydarlou, R. & Feenstra, K. A. PIPENN-EMB ensemble net and protein embeddings generalise protein interface prediction beyond homology. Sci. Rep. 15, 4391 (2025).
Marquet, C. et al. Embeddings from protein Language models predict conservation and variant effects. Hum. Genet. 141, 1629–1647 (2022).
Chen, J. Y. et al. Evaluating the advancements in protein Language models for encoding strategies in protein function prediction: a comprehensive review. Front. Bioeng. Biotechnol. 13, (2025).
Vieira, L. C., Handojo, M. L. & Wilke, C. O. Medium-sized protein Language models perform well at transfer learning on realistic datasets. Sci. Rep. 15, 21400 (2025).
Hou, C., Liu, D., Zafar, A. & Shen, Y. Understanding language model scaling on protein fitness prediction. Preprint at (2025). https://doi.org/10.1101/2025.04.25.650688
Liu, Y. et al. Advancing virulence factor prediction using protein Language models. BMC Biol. 23, 307 (2025).
Ruan, X., Xia, S., Li, S., Su, Z. & Yang, J. Hybrid framework for membrane protein type prediction based on the PSSM. Sci. Rep. 14, 17156 (2024).
Haselbeck, F. et al. Superior protein thermophilicity prediction with protein Language model embeddings. NAR Genom. Bioinform. 5, (2023).
Mandrekar, J. N. Receiver operating characteristic curve in diagnostic test assessment. J. Thorac. Oncol. 5, 1315–1316 (2010).
Funding
This research is not funded by any funding agency.
Author information
Authors and Affiliations
Contributions
J.F.B. conceived the study, designed the methodology, and developed the VirHostPRED software. He performed the data collection, preprocessing, and model training and evaluation. L.H.B. contributed to dataset curation, literature review, and biological interpretation of the results. A.J.Y. participated in the manuscript writing, and critical revision of the final version. Writing, review, and editing of the manuscript were carried out by J.F.B. with input from all authors. Supervision and final approval were provided by J.F.B. as the corresponding author.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Below is the link to the electronic supplementary material.



Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Beltrán, J.F., Belén, L.H., Parraguez-Contreras, F. et al. Protein language models enable accurate viral host range prediction. Sci Rep 16, 7606 (2026). https://doi.org/10.1038/s41598-026-37765-8
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s41598-026-37765-8


