Hidden challenges in evaluating spillover risk of zoonotic viruses using machine learning models

Kawasaki, Junna; Suzuki, Tadaki; Hamada, Michiaki

doi:10.1038/s43856-025-00903-w

Download PDF

Article
Open access
Published: 20 May 2025

Hidden challenges in evaluating spillover risk of zoonotic viruses using machine learning models

Communications Medicine volume 5, Article number: 187 (2025) Cite this article

3127 Accesses
2 Citations
1 Altmetric
Metrics details

Subjects

Abstract

Background

Machine learning models have been deployed to assess the zoonotic spillover risk of viruses by identifying their potential for human infectivity. However, the lack of comprehensive datasets for viral infectivity poses a major challenge, limiting the predictable range of viruses.

Methods

In this study, we address this limitation through two key strategies: constructing expansive datasets across 26 viral families and developing the BERT-infect model, which leverages large language models pre-trained on extensive nucleotide sequences.

Results

Here we show that our approach substantially boosts model performance. This enhancement is particularly notable in segmented RNA viruses, which are involved with severe zoonoses but have been overlooked due to limited data availability. Our model also exhibits high predictive performance even with partial viral sequences, such as high-throughput sequencing reads or contig sequences from de novo sequence assemblies, indicating the model’s applicability for mining zoonotic viruses from virus metagenomic data. Furthermore, models trained on data up to 2018 demonstrate robust predictive capability for most viruses identified post-2018. Nonetheless, high-resolution evaluation based on phylogenetic analysis reveals general limitations in current machine learning models: the difficulty in alerting the human infectious risk in specific zoonotic viral lineages, including SARS-CoV-2.

Conclusions

Our study provides a comprehensive benchmark for viral infectivity prediction models and highlights unresolved issues in fully exploiting machine learning to prepare for future zoonotic threats.

Plain language summary

To prepare for future pandemics caused by animal-derived viruses, there is a growing need for computational models that can predict whether a virus might infect humans. We constructed extensive datasets covering information about different viruses, including key human pathogens. We developed computational models using these datasets, which outperformed existing approaches across many virus types. However, we also revealed that current models share the same unresolved challenges when assessing whether specific viruses will infect humans, including SARS-CoV-2. These findings suggest that current models may fail to identify animal viruses that can infect humans, which underscores the urgent need for improved predictive models to strengthen pandemic preparedness.

Machine learning approaches for influenza A virus risk assessment identifies predictive correlates using ferret model in vivo data

Article Open access 01 August 2024

Early prognosis of respiratory virus shedding in humans

Article Open access 25 August 2021

Decoding the RNA viromes in shrew lungs along the eastern coast of China

Article Open access 08 August 2024

Introduction

Because zoonotic viruses pose a significant threat to human health, monitoring animal viruses with the potential for human infection is crucial^1,2,3. Despite advances in metagenomic and metatranscriptomic research revealing vast viral genetic diversity in animals^4,5, the evaluation of viral phenotypes, such as infectivity, transmissibility, pathogenesis, and virulence, requires substantial human efforts due to the lack of high-throughput methods. Human infectivity, an important phenotypic characteristic relevant to zoonotic viral spillover and subsequent emerging diseases, remains largely unvalidated for most viruses. To address these challenges, machine learning models for predicting human infectivity using viral genetic features as inputs have been developed^{2,3,6,7,8,9,10,11}. These models may help determine priority viruses for further virological characterization.

Unsupervised feature extraction from viral genetic sequences and their interpretation are helpful in understanding of the mechanisms driving viral infectivity as well as improving model performance. Large language models (LLMs), pre-trained on extensive genetic data, have achieved state-of-the-art performances in various genotype-to-phenotype tasks¹². Pre-trained LLMs are expected to capture context-like rules in nucleotide sequences, allowing the construction of high-performance models even with limited labeled data. Furthermore, leveraging LLMs for viral infectivity prediction tasks can potentially uncover previously indiscernible patterns crucial for predicting viral infectivity. Thus, such unsupervised feature extraction could improve model performance and deepen our understanding of molecular mechanisms underlying viral infectivity.

While existing models have demonstrated high performance, several gaps remain in the model evaluation scheme^2,13,14. First, the absence of standardized datasets limits the uniform comparison of model performances. Second, previous evaluations may have overestimated predictive performance because the dataset was occupied by viruses that did not match the purpose of predicting viral infectivity in humans, such as bacteriophages^7,8,11. Third, given the successive emergence of novel zoonotic viruses, models should preferably be predictive for the infectivity of novel viruses (i.e., identified after the model construction). Addressing these challenges in model development is essential for their real-world application in public health management.

Here, we aim to improve viral infectivity prediction by (i) curating datasets covering various viral families and (ii) developing models by leveraging LLMs. Our models outperform existing models for most viral families. Nonetheless, we also identify the general limitations in current machine learning models: the difficulty in alerting the human infectious risk in specific zoonotic viral lineages. This study provides a comprehensive benchmark for viral infectivity prediction models and highlights the remaining challenges of fully leveraging machine learning in preparedness for upcoming zoonotic threats.

Methods

Preparing viral sequences and infectivity datasets

Viral sequences and metadata for the 26 viral families were collected from the NCBI Virus Database¹⁵. Viral infectivity was labeled according to the host information, from which organism the viral sequences were isolated, and the human infectivity was not experimentally verified for all viruses. In addition, viral sequences collected from environmental samples were excluded from our dataset because the host organisms were ambiguous. For segmented RNA viruses, which include multiple sequences, we grouped sequences into viral isolates based on the combination of 22 entries in the metadata (Supplementary Data 1). If a single viral isolate contained more sequences than the specified number of viral segments (i.e., Orthomyxoviridae: 8, Rotaviridae: 12, and other segmented RNA viruses: 3), redundancy was eliminated by randomly sampling a sequence for each segment. We also checked the Segment column in metadata to ensure that multiple segment sequences were not assigned to a single virus, which resulted in the removal of 76 virus isolates and 1270 sequences. For non-segmented viruses, representative sequences were selected from identical viral sequences with the same infectivity label. Through these dataset curations, we selected 140,638 sequences from the 1,336,901 sequences downloaded from the NCBI Virus Database. Eventually, our curation strategy generated ~29 times more viral data available than that of the Virus-Host Database used predominantly in previous studies (Supplementary Data 2).

We divided the collected data into a past virus dataset for model training and a future virus dataset for evaluating model performance for novel viruses identified after the model construction. Data with a sequence collection date before December 31, 2017, were classified into the past virus datasets, while the subsequent data were classified into the future virus datasets. To validate the applicability of viruses associated with repeated zoonoses, we collected additional future virus dataset for Orthomyoviridae and Coronaviridae families from several different resources: (i) Influenza A viruses classified into four H subtypes (i.e., H1, H3, H5, and H7) from the NCBI Influenza virus database¹⁶, (ii) Influenza A virus classified into H5 subtype from the GISAID database (https://gisaid.org/), and (iii) SARS-CoV2-related viruses (i.e., sarbecoviruses identified after 2018) with animal infectivity from previous study¹⁷ and those with human infectivity from the Nextstrain database¹⁸ (Supplementary Data 3-4). For the GISAID dataset prediction, we annotated the position of open reading frame (ORF) by EMBOSS getorf (version 6.6.0.0)¹⁹ and used them as inputs of the zoonotic rank model.

Constructing predictive model for viral infectivity based on LLMs

We constructed predictive models for viral infectivity, namely BERT-infect, based on the LLMs. We used the (i) DNABERT model pre-trained on human whole genome²⁰ and (ii) ViBE model pre-trained on the viral genome sequences in the NCBI RefSeq database²¹. Since pre-trained models with 4-mer tokenization were available in both the DNABERT and ViBE models, we fine-tuned these BERT models using the past virus datasets to construct an infectivity prediction model for each viral family. Input data were prepared by splitting viral genomes into 250 bp fragments with a 125 bp window size and 4-mer tokenization. The hyperparameters used to fine-tune two BERT models are listed in Supplementary Data 5.

Re-training comparative models

Supplementary Data 6 lists the candidate models that could serve as comparisons in this study. Among these, the HostNet⁸ and VirusBERTHP¹¹ models were not publicly available when we initiated this study, and retraining on our dataset was not possible. Furthermore, VIDHOP¹⁸ was excluded from the subsequent analysis for the following reasons: (i) the original study used >100,000 sequences per viral family as training dataset (i.e., approximately ten-fold of the amount of our dataset), and (ii) the original VIDHOP model was only built for three viral families, making it difficult to compare model performance across a diverse range of viral families.

Eventually, three existing models (humanVirusFinder¹⁹, DeePac_vir²⁰, and zoonotic_rank²¹) were re-trained with our newly constructed datasets. The hyperparameters for the DeePac_vir model re-training are listed in Supplementary Data 7. The humanVirusFinder and zoonotic_rank models were re-trained with the same parameters as in the previous studies. For humanVirusFinder model, we used the 4-mer frequency as inputs, which exhibited best performance in previous study. The zoonotic rank model was re-trained using all genome composition features (i.e., viral genomic features, similarity to interferon-stimulated genes, similarity to housekeeping genes, and similarity to remaining genes) over 1000 iterations, and the top 10% of iterations were used to construct the bagged model.

We used BLASTn²² to evaluate potential dataset predictability based on the simple hypothesis that a virus showing sequence similarity to human infectious viruses is predicted to be human infectious. BLASTn prediction was conducted using the training dataset as a database and test dataset as a query. The following cases were judged as unpredictable: (i) there was no hit sequence with an E-value of <1e-4, or (ii) the aligned length of the top hit sequence did not cover >50% of the query sequence.

Evaluation of viral infectivity prediction for the past viral genome

In model training and validating using the past virus datasets, stratified five-fold cross-validation was performed to adjust for the class imbalance of infectivity and virus genus classifications. The training, evaluation, and test datasets proportions were set to 60%, 20%, and 20%, respectively. The prediction probabilities were calculated in differently for each type of model. The humanVirusFinder and zoonotic_rank models use the viral genome sequence as input and directly output the predicted results for each sequence. In contrast, the BERT-infect, DeePac_vir, and VIDHOP models use 250 bp subsequences as inputs, and the prediction results for genomic sequences were calculated by averaging the predicted scores for subsequences. The predictive performances were compared using two metrics: (i) the area under the receiver operating characteristic curve (AUROC) and (ii) the area under the precision-recall curve (PR-AUC). Down-sampling was conducted for virus families with PR-AUC < 0.75 in all models when training with the original past virus dataset (i.e., Rhabdoviridae, Circoviridae, Poxviridae, and Hantaviridae).

Evaluation of model detection ability for human-infecting virus using high-throughput sequencing data

We evaluated model predictive performance when using high-throughput sequencing data based on different inputs: (i) 250 bp single-ended high-throughput sequence reads and (ii) variable-length contig sequences (i.e., 500 bp, 1000 bp, 3000 bp, and 5000 bp). For the first input, we only compared BERT-infect_DNABERT, BERT-infect_ViBE, and DeePac_vir because the humanVirusFinder and zoonotic_rank models have a recommended input of >500 bp sequences and viral genomic sequences, respectively. For the second input, we compared BERT-infect_DNABERT, BERT-infect_ViBE, humanVirusFinder, and DeePac_vir. The humanVirusFinder model directly outputted the predicted results for each contig, whereas the prediction results of the other three models were calculated by averaging the predicted scores for subsequences. We evaluated model performance based on the AUROC and PR-AUC.

Evaluation of viral infectivity prediction for the future viral genome

Model predictive performance for the future viral dataset was evaluated under the two scenarios. First, the infectivity of novel viruses was determined based on the threshold representing the highest F1 score in the past viral datasets, and F1, recall and precision metrics were calculated. Second, we investigated the enrichment of human-infecting viruses within the top 20% of predicted probabilities.

Phylogenetic analyses

Multiple sequence alignments for each viral family were constructed by adding sequences to the reference alignment provided by the International Committee on Taxonomy of Viruses resources²³. First, we eliminated sequence redundancy in our datasets using CD-HIT (version: 4.8.1)^24,25 when the number of sequences per label was >200. For nucleotide reference sequences, our dataset sequences were added into the reference alignment using mafft (version: 7.508)²⁶ with the options “--add” and “--keeplength”. For amino acid reference sequences, the protein sequences were extracted from viral genomic sequences using tBLASTn (version: 2.15.0)²² and were added into the reference alignment using mafft with the options “--add” and “--keeplength”. Phylogenetic trees were constructed by the maximum likelihood method using IQ-TREE (version 2.1.4 beta)²⁷. Substitution models were selected based on the Bayesian information criterion provided by ModelFinder²⁸. The branch supportive values were measured using ultrafast bootstrap in UFBoot2²⁹ with 1000 replicates. Visualization of the phylogenetic tree, viral characteristics, and their prediction results were performed using ggtree package (version 3.6.0)³⁰. The parameters used to construct the phylogenetic analyses are listed in Supplementary Data 8.

Ethical approval

This study did not require approval from an institutional review board (IRB) because it was based exclusively on publicly available data and did not involve any identifiable human subjects or personal information.

Results

Constructing comprehensive large-scale datasets for 26 viral families

We developed models for each viral family by training with pairs of viral sequences and their host information (Fig. 1A). While the Virus-Host Database³¹, a curated source for virus-host relationships, has been mainly used for training dataset in previous studies²⁹, its dataset composition does not perfectly align with the purpose of the models: predicting the zoonotic potential of viruses. For example, this database (version 2019/01/31, used in ref. ³⁰) included 13,396 viral sequences, but only 77.6% were associated with eukaryotic hosts (Supplementary Data 2), which may lead to overestimating model performance owing to the presence of easy-to-predict viruses, such as bacteriophages.

**Fig. 1: Training and evaluation of predictive models for viral infectivity.**

To construct models more suitable for human-infecting viruses, we collected data from the NCBI Virus database³¹ for 26 viral families that include key human-infecting pathogens (details in “Methods”, Supplementary Data 1 and 3–4). Because the human infectivity of many viruses has not been directly validated, the infectivity label (i.e., human-infecting or animal-infecting) for each viral strain was defined based on the host information from which the viral sequences were collected (see “Discussion”). Our datasets included viruses associated with 1476 vertebrate species and 535 arthropod species.

Eventually, our curated datasets offered a substantial increase in available data, ~29 times more than that of the Virus-Host Database³¹ (Fig. 1C). Notably, the Virus-Host Database contained <20 human-infecting viral strains for 15 of the 26 viral families, potentially overlooking important pathogens during model training and evaluation. By contrast, our comprehensive datasets encompassed at least 50 human-infecting viral strains for each viral family in the past virus datasets, establishing a valuable resource for developing predictive models for viral infectivity.

Evaluation strategy and comparative models

To assess the predictive capability against new viral sequences identified after the model construction, we divided the viral data into two datasets: (i) a past virus dataset comprising sequences identified up to the end of 2017 for model training and (ii) a future virus dataset for evaluating the predictive capability toward viruses discovered post-2018 (Fig. 1B).

To address the potential shortage of labeled data for viral infectivity, we utilized LLMs pre-trained on extensive genetic sequences²⁶. We fine-tuned two pre-trained Bidirectional Encoder Representations from Transformers (BERT) models using our datasets: (i) DNABERT pre-trained on the human whole genome²⁰ and (ii) ViBE pre-trained on the viral genomes registered in the NCBI RefSeq Viral Database²¹ (see “Methods”). Molecular mimicry of host organisms is known to be a factor that define host range by contributing to efficient replication and immune evasion of viruses^3,32,33,34, and therefore, we hypothesized that the DNABERT model, pre-trained with the human genome, could extract representative features associated with human infectivity.

We evaluated model performance through benchmarking with existing models²⁹. Candidate models were selected based on (i) their use of viral genome sequences as inputs and (ii) their applicability to several viral families (Supplementary Data 6).

Performance comparison among models trained with the past virus datasets

To evaluate our model performance in predicting viral infectivity, we trained our and existing models using the past virus datasets with five-fold stratified cross-validation (Fig. 2A, B). BLASTn²² was used to assess the potential predictability of our datasets based on the simple hypothesis: if a human-infecting virus was included among the hit sequences, it was predicted to be human-infecting (see “Methods”).

**Fig. 2: Predictive performance when using the past viral genomes as inputs.**

Our benchmark revealed that BERT-infect_DNABERT and BERT-infect_ViBE (fine-tuned with the past virus datasets) outperformed existing models across most viral families (Fig. 2C). Conversely, BERT-infect_{DNABERT_scratch} and BERT-infect_{ViBE_scratch} models (fine-tuned BERT models without pre-trained weights) failed to predict viral infectivity even with fine-tuning on the same datasets. The difference in performance between BERT-infect and previous models was especially pronounced when training with the relatively small datasets from the Virus-Host Database (Supplementary Fig. 1). These results underscore the critical role of LLM pre-training for enhancing model performance, especially when labeled data is limited.

Remarkably, BERT-infect models demonstrated superior PR-AUC scores across 18 viral families, with particularly pronounced performance in the segmented RNA viruses. Despite comprising key pathogens associated with severe diseases, such as hemorrhagic fever³⁵, these viral families have been neglected in previous model evaluations owing to data limitations (Fig. 2C). Thus, our comprehensive datasets represent a valuable resource for developing predictive models for viral infectivity across a range of viral families.

Model applicability for zoonotic virus detection from high-throughput sequencing data

We evaluated model performance in detecting human-infecting viruses from high-throughput sequencing data when inputting (i) 250 bp single-ended high-throughput sequencing reads and (ii) viral contigs with various lengths, which reflect the real-world scenario where short viral contigs are often obtained using sequence assembly³⁶. The BERT-infect and DeePac_vir models maintained consistent performance for most viral families, regardless of input length (Fig. 2D). We attributed the decrease in the predictive performance of the humanVirusFinder model to the k-mer compositions in short input sequences, which may not fully represent viral genome complexity³¹. These results indicated that only certain models are suitable for mining human-infecting viruses when inputting partial viral sequences (i.e., high-throughput sequencing reads or assembled contigs).

We considered the computational resource required for model training and prediction (Supplementary Data 9). Deep learning-based models, including BERT-infect, can effectively process shorter input sequences, but they require considerable computational power and time. For example, fine-tuning the BERT-infect_DNABERT model with the Coronaviridae dataset takes 12 hours with four NVIDIA Tesla V100. In contrast, the humanVirusFinder and zoonotic rank models offer greater resource efficiency but are challenging to apply to high-throughput sequencing data. These observations underline the trade-offs between computational efficiency and model applicability to various types of inputs.

Model performance in predicting infectivity of newly identified viruses

To provide early warnings on future pandemics, models should be able to predict the infectivity of newly discovered viruses. We evaluated the predictive capability of models trained on the past virus datasets when applied to the future virus datasets (Fig. 1B). For the family Coronaviridae, severe acute respiratory syndrome coronavirus 2 (SARS-CoV2)-related viruses (i.e., sarbecoviruses identified post-2018) were distinguished from other coronaviruses in the evaluation of model performance (see “Methods”).

Our approach considered two distinct thresholds (Fig. 3A). The first threshold was set based on the highest F1 scores from the past virus dataset analysis and served as a definitive criterion for distinguishing human-infecting viruses. We observed comparable median F1 scores in four models (Fig. 3B). In contrast, the DeePac_vir model displayed lower F1 scores, primarily due to decreased precision scores. However, it should be noted that the human infectivity of viruses in our dataset has not been experimentally validated, suggesting that some viruses may be mislabeled. Therefore, it is difficult to determine whether the low precision score is due to the high number of false positives or the detection of viruses unproven to be infectious to humans (see “Discussion”, Supplementary Figs. 2–4). For the second threshold, we explored the enrichment of human-infecting viruses among those predicted with the higher probabilities, which attempted to reflect a practical scenario where we need to prioritize high-risk viruses based on their predicted scores. We observed no notable differences across models; for instance, targeting viruses with the top 20% of predicted probabilities led to a median detection rate of ~40% for human-infecting viruses (Fig. 3C). These results demonstrate a high level of model performance toward novel viruses identified after the model training.

**Fig. 3: Model capabilities for predicting infectivity of future viruses.**

Systematic identification of difficult-to-predict viral lineages

To identify virus lineages for which human-infection risk is difficult to predict, we conducted a high-resolution model comparison at the viral family and genus levels (Supplementary Figs. 5–8). While the F1 scores were >0.75 for most virus families, poor predictive abilities were observed in almost all models for some viral families: Circoviridae, Coronaviridae, Flaviviridae, Hepeviridae, Rhabdoviridae, and Hantaviridae. Furthermore, even when F1 scores were above 0.75 for other viral genera, there were predictive gaps within particular viral genera: SARS-CoV2-related viruses, Flavivirus, and Phlebovirus, associated with severe infectious diseases and suspected zoonotic origins. Such difficulties were also observed in the original model, although not noted in previous studies³². Notably, half of these difficult-to-predict viral lineages were listed as high-risk viruses by WHO pathogens prioritization (Supplementary Fig. 9). Although we checked for problems in the training dataset for these difficult-to-predict virus lineages, neither the amount of data were extremely small, nor the proportion of viruses that infect humans was particularly low (Supplementary Fig. 10), which means that the reason behind the difficulty of prediction could not be clarified (see “Discussion”). Our findings highlight critical issues in viral infectivity prediction models: despite achieving substantial predictive performance overall, the models consistently failed to predict the infectivity of specific zoonotic viral lineages.

Difficult-to-predict viral lineage frequently changed infectivity during evolution

To further investigate the challenges in predicting human infectivity of zoonotic viruses, we mapped the phylogenetic relationships of viruses alongside their prediction results (Fig. 4A). Here, we focused on Flaviviridae, which includes viruses that cause severe infectious diseases in humans, such as hepatitis C virus (genus Hepacivirus) and Zika virus (genus Flavivirus). In the genus Hepacivirus, a phylogenetic distinction existed between human- and animal-infecting viruses, and F1 scores for all models were above 0.8 for both the past and future virus datasets (Fig. 4B). In contrast, model performance dropped for the genus Flavivirus, which was characterized by frequent shifts in infectivity during its evolution: the F1 scores were ~0.6 for the future virus dataset. These results suggest the insufficient predictive capability of current models for viral lineages in which the zoonotic-infection potential has been acquired repeatedly.

**Fig. 4: Predictive viral infectivity for Flaviviridae.**

Challenges in detecting human infectious risk of emerging zoonotic viral lineages

Our study revealed severe limitations in the ability of models to identify the risk of human infectivity posed by SARS-CoV2-related viruses (Supplementary Fig. 5). All models trained on the past virus dataset failed to recognize SARS-CoV2-related viruses as a potential threat despite achieving high predictive performance for most coronaviruses (Fig. 5A, B). Notably, although this dataset included various human-infecting SARS-CoV-2 variants from the original outbreak and the currently prevalent Omicron variants, the human-infecting potential of SARS-CoV2-related viruses could not be determined based on the best F1 score for the past virus datasets (Fig. 5C). These findings highlight a crucial gap in our preparedness for zoonotic pandemics: even if SARS-CoV2-related viruses had been detected before the COVID-19 outbreak through genomic surveillance, the current models would not have flagged them as high-risk (see “Discussion”).

**Fig. 5: Predictive viral infectivity for Coronaviridae.**

Furthermore, we also revealed that current models did not predict the human infectivity of most H5 influenza A viruses, which have reported >900 human infections so far, with an increase in spillover cases into dairy cattle, cats, and humans since March 2024^37,38,39 (Supplementary Fig. 11). Despite the high performance of several models, including the BERT-infect models, on past and future viral datasets, they failed to detect the human infection risk of H5 influenza A viruses. These findings emphasized an essential area for model improvement to accurately assess the risk posed by emerging zoonotic viruses.

Discussion

The rapid elucidation of viral genetic diversity has increased the demand for high-throughput approaches for assessing the potential human infectivity of viruses, such as machine learning models using viral genetic information as inputs^3,34,35. In this study, to overcome the limitations of insufficient labeled data in constructing viral infectivity prediction models, we constructed new datasets across 26 viral families and developed innovative models utilizing pre-trained LLMs for context-like rules within genetic sequences (Fig. 1). Our models exhibited high predictive performance for the past and future virus datasets across most viral families (Figs. 2, 3). Particularly noteworthy was the improved performance of our models on segmented RNA viruses, which have been neglected in viral infectivity prediction owing to limited data. These results represent a major advance in our preparedness to combat the future threats of zoonotic viruses.

While models trained on the past virus datasets demonstrated high predictive capabilities for most viruses that identified after model construction, our high-resolution evaluation based on phylogenetic analysis revealed significant limitations in current models for assessing the risk associated with specific zoonotic viral lineages (Figs. 4, 5, Supplementary Figs. 5–8). Such limitations were observed even in models previously reported to have high predictive performance, indicating an essential area for future development that has not been addressed before. Crucially, no model has accurately identified the human infection risk posed by SARS-CoV2-related viruses, and previous studies have also only weakly warned of the risks associated with these viruses. Therefore, even with enhanced viral surveillance in animal populations, current models may fail to detect emerging high-risk zoonotic viruses. Our insights emphasize the need for advancements in predictive models to prepare against future zoonotic viral diseases.

A potential limitation in developing models to predict viral infectivity is the gap between the predictive purpose and the training datasets, where the labels may not accurately reflect the actual viral infectivity. Typically, most models are trained using infectivity labels defined based on host information from which organism the viral sequences were identified (Fig. 1A). This is an alternative approach due to the limited availability of comprehensive datasets on viral infectivity in humans³⁵. Thus, it should be noted that a certain number of human-infecting viruses may be labeled as animal ones. Given such limitations, this study focused on model sensitivity: detecting human infection risk, and animal-derived viruses predicted to be human-infectious by multiple models should be prioritized for risk assessment in future research (Supplementary Fig. 4). Furthermore, the multi-labeling method, which explicitly indicates viruses that infect both animals and humans, may improve predictive performance for zoonotic viruses. On the other hand, an important hyperparameter for the multi-labeling method has not been investigated: the taxonomic hierarchy to assign labels. Future work will need to include preliminary investigations into the degree of sequence variation that leads to infectivity changes to identify the appropriate taxonomic hierarchy for multi-labeling.

In this study, we constructed new datasets across 26 viral families, comprising ~29-times more data available than previously used³⁶ (Fig. 1C); however, further dataset curation is necessary to enhance the model performance of zoonotic viral infectivity prediction. The exclusive inclusion of easy-to-predict viruses could hinder learning for zoonotic viruses derived from rare spillover events³⁶. Our evaluation showed high predictive capabilities for most viral families in the future virus datasets, likely owing to the presence of viruses having high sequence similarity and the same infectivity as the past viruses (Supplementary Fig. S2). Thus, further dataset curation focusing on viral lineages associated with zoonoses could enhance model performance. However, it should also be noted that removing redundancy involves a trade-off with the scarcity of data, because a limited number of human infectious viruses have been experimentally validated.

Furthermore, another limitation in our dataset curation is that the data labeling method cannot distinguish between infectivity (i.e., capability for animal-human infection) and transmissibility (i.e., ability to expand human-human infections). In this study, we also conducted large-scale predictions on the infectivity of zoonotic H5 influenza A viruses, showing that current models did not predict almost all their infectivity (Supplementary Fig. 11). On the other hand, it is currently difficult to determine whether these results reflect that recent spillover events by the H5 influenza A viruses have not led to human-to-human transmission or due to the inadequate predictive performance of the current models. Further model developments to hierarchically predict infectivity and transmissibility would be necessary for accurately assessing the risk of zoonotic spillover and subsequent pandemic potential.

We identified difficult-to-predict viral lineages, but we found no problems with these viral genera in the training datasets; the reason behind this difficulty of prediction thus remains unclear (Supplementary Fig. 10). It may be that frequent changes in virus infectivity during Flavivirus evolution might have hindered model training (Fig. 4); however, the association between prediction difficulty and changes in infectivity should be further assessed through virus-host evolutionary analysis. Another possibility is that capturing infectivity changes due to a small number of genomic variations might be challenging for models using nucleotide sequences as the input. Indeed, our evaluation revealed that current models fail to recognize the human infectivity threat posed by SARS-CoV2-related viruses (Fig. 5), whose infectivity has been reported to be changed by limited mutations, mainly in the spike protein^40,41. Utilizing protein language models, known for extracting feature vectors that reflect the structural and functional properties of proteins⁴⁰, could enhance model performance by detecting key changes at the amino acid level relevant to viral infectivity⁴².

Knowledge-based models integrating multiple features related to molecular mechanisms underlying viral infectivity represent another promising approach⁴³. Recent studies highlighted the importance of various types of virus-host interactions, beyond the entry pathway, in determining viral infectivity^3,32,44. Interestingly, our BERT-infect models displayed comparable predictive performance, although the DNABERT and ViBE models were pre-trained on the human and viral genomes, respectively (Figs. 2, 3, Supplementary Figs. 5–8). These results suggest that two BERT-infect models may capture different contextual features involved in viral infectivity and yield new insights into viral infection mechanisms. Thus, a dual strategy of (i) model refinement for unsupervised feature extraction and (ii) deepened understanding of infection mechanisms through model interpretation would contribute to developing knowledge-based models for accurately assessing the human infection risk of viruses.

In conclusion, this study presents the BERT-infect model leveraging LLMs, along with a framework for evaluating model performance in predicting viral infectivity. Although our models demonstrated high predictive performance for various viral families, we found unresolved challenges in accurately predicting certain zoonotic viral lineages across the current models. These findings emphasized an essential area for model improvement to accurately assess the risk posed by emerging zoonotic viruses. On the other hand, needless to say, relying on a machine learning model is not realistic to prepare for future zoonotic viral pandemics, and we believe that enhanced virus surveillance in animals and collaboration with virological experiments are still extremely important. The zoonotic risk surveillance should be enhanced through such a research ecosystem.

Data availability

The datasets for 26 viral families and BERT-infect models are available in the Zenodo Repository^45,46,47,48. Metadata for viral sequences (e.g., taxonomic classification and host label) is also provided in Supplementary Data 1 and 3-4. The source data for Figs. 1–5 is in Supplementary Data 10.

Code availability

The relevant codes are available at https://github.com/Junna-Kawasaki/BERT-infect_2024.

References

Carroll, D. et al. The Global Virome Project. Science 359, 872–874 (2018).
Article CAS PubMed Google Scholar
Carlson, C. J. et al. The future of zoonotic risk prediction. Philos. Trans. R. Soc. Lond. B Biol. Sci. 376, 20200358 (2021).
Article CAS PubMed PubMed Central Google Scholar
Mollentze, N., Babayan, S. A. & Streicker, D. G. Identifying and prioritizing potential human-infecting viruses from their genome sequences. PLoS Biol. 19, e3001390 (2021).
Article CAS PubMed PubMed Central Google Scholar
Greninger, A. L. A decade of RNA virus metagenomics is (not) enough. Virus Res. 244, 218–229 (2018).
Article CAS PubMed Google Scholar
Edgar, R. C. et al. Petabase-scale sequence alignment catalyses viral discovery. Nature 602, 142–147 (2022).
Article CAS PubMed Google Scholar
Keusch, G. T. et al. Pandemic origins and a One Health approach to preparedness and prevention: Solutions based on SARS-CoV-2 and other RNA viruses. Proc. Natl Acad. Sci. USA 119, e2202871119 (2022).
Article CAS PubMed PubMed Central Google Scholar
Zhang, Z. et al. Rapid identification of human-infecting viruses. Transbound. Emerg. Dis. 66, 2517–2522 (2019).
Article CAS PubMed Google Scholar
Bartoszewicz, J. M., Seidel, A. & Renard, B. Y. Interpretable detection of novel human viruses from genome sequencing data. NAR Genom. Bioinform. 3, lqab004 (2021).
Article PubMed PubMed Central Google Scholar
Mock, F., Viehweger, A., Barth, E. & Marz, M. VIDHOP, viral host prediction with deep learning. Bioinformatics 37, 318–325 (2020).
Article Google Scholar
Ming, Z. et al. HostNet: improved sequence representation in deep neural networks for virus-host prediction. BMC Bioinforma. 24, 455 (2023).
Article Google Scholar
Wang, Y., Yang, J. & Cai, Y. VirusBERTHP: Improved Virus Host Prediction Via Attention-based Pre-trained Model Using Viral Genomic Sequences. in 2023 IEEE International Conference on Bioinformatics and Biomedicine (BIBM) 678–683 (IEEE, 2023).
Liu, J. et al. Large language models in bioinformatics: applications and perspectives. ArXiv (2024).
Mollentze, N. & Streicker, D. G. Predicting zoonotic potential of viruses: where are we? Curr. Opin. Virol. 61, 101346 (2023).
Article PubMed Google Scholar
Wille, M., Geoghegan, J. L. & Holmes, E. C. How accurately can we assess zoonotic risk? PLoS Biol. 19, e3001135 (2021).
Article CAS PubMed PubMed Central Google Scholar
Brister, J. R., Ako-Adjei, D., Bao, Y. & Blinkova, O. NCBI viral genomes resource. Nucleic Acids Res. 43, D571–D577 (2015).
Article CAS PubMed Google Scholar
Bao, Y. et al. The influenza virus resource at the National Center for Biotechnology Information. J. Virol. 82, 596–601 (2008).
Article CAS PubMed Google Scholar
Lytras, S. et al. Exploring the natural origins of SARS-CoV-2 in the light of recombination. Genome Biol. Evol. 14, evac018 (2022).
Hadfield, J. et al. Nextstrain: real-time tracking of pathogen evolution. Bioinformatics 34, 4121–4123 (2018).
Article CAS PubMed PubMed Central Google Scholar
Rice, P., Longden, I. & Bleasby, A. EMBOSS: the European Molecular Biology Open Software Suite. Trends Genet. 16, 276–277 (2000).
Article CAS PubMed Google Scholar
Ji, Y., Zhou, Z., Liu, H. & Davuluri, R. V. DNABERT: pre-trained bidirectional encoder representations from transformers model for DNA-language in genome. Bioinformatics 37, 2112–2120 (2021).
Article CAS PubMed PubMed Central Google Scholar
Gwak, H.-J. & Rho, M. ViBE: a hierarchical BERT model to identify eukaryotic viruses using metagenome sequencing data. Brief. Bioinform. 23, bbac204 (2022).
Camacho, C. et al. BLAST+: architecture and applications. BMC Bioinforma. 10, 421 (2009).
Article Google Scholar
Lefkowitz, E. J. et al. Virus taxonomy: the database of the International Committee on Taxonomy of Viruses (ICTV). Nucleic Acids Res. 46, D708–D717 (2018).
Article CAS PubMed Google Scholar
Fu, L., Niu, B., Zhu, Z., Wu, S. & Li, W. CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics 28, 3150–3152 (2012).
Article CAS PubMed PubMed Central Google Scholar
Li, W. & Godzik, A. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22, 1658–1659 (2006).
Article CAS PubMed Google Scholar
Katoh, K. & Standley, D. M. MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Mol. Biol. Evol. 30, 772–780 (2013).
Article CAS PubMed PubMed Central Google Scholar
Minh, B. Q. et al. IQ-TREE 2: New models and efficient methods for phylogenetic inference in the genomic era. Mol. Biol. Evol. 37, 1530–1534 (2020).
Article CAS PubMed PubMed Central Google Scholar
Kalyaanamoorthy, S., Minh, B. Q., Wong, T. K. F., von Haeseler, A. & Jermiin, L. S. ModelFinder: fast model selection for accurate phylogenetic estimates. Nat. Methods 14, 587–589 (2017).
Article CAS PubMed PubMed Central Google Scholar
Hoang, D. T., Chernomor, O., von Haeseler, A., Minh, B. Q. & Vinh, L. S. UFBoot2: Improving the ultrafast bootstrap approximation. Mol. Biol. Evol. 35, 518–522 (2018).
Article CAS PubMed Google Scholar
Yu, G. Using ggtree to visualize data on tree-like structures. Curr. Protoc. Bioinforma. 69, e96 (2020).
Article Google Scholar
Mihara, T. et al. Linking virus genomes with host taxonomy. Viruses 8, 66 (2016).
Article PubMed PubMed Central Google Scholar
Babayan, S. A., Orton, R. J. & Streicker, D. G. Predicting reservoir hosts and arthropod vectors from evolutionary signatures in RNA virus genomes. Science 362, 577–580 (2018).
Article CAS PubMed PubMed Central Google Scholar
Takata, M. A. et al. CG dinucleotide suppression enables antiviral defence targeting non-self RNA. Nature 550, 124–127 (2017).
Article CAS PubMed PubMed Central Google Scholar
Martínez, M. A., Jordan-Paiz, A., Franco, S. & Nevot, M. Synonymous virus genome recoding as a tool to impact viral fitness. Trends Microbiol 24, 134–147 (2016).
Article PubMed Google Scholar
Kuhn, J. H. et al. Annual (2023) taxonomic update of RNA-directed RNA polymerase-encoding negative-sense RNA viruses (realm Riboviria: kingdom Orthornavirae: phylum Negarnaviricota). J. Gen. Virol. 104, 001864 (2023).
Kawasaki, J., Kojima, S., Tomonaga, K. & Horie, M. Hidden viral sequences in public sequencing data and warning for future emerging diseases. MBio 12, e0163821 (2021).
Article PubMed Google Scholar
Garg, S. Outbreak of highly pathogenic avian influenza A(H5N1) viruses in U.S. dairy cattle and detection of two human cases—United States, 2024. MMWR Morb. Mortal. Wkly. Rep. 73, 501–505 (2024).
Burrough, E. R. et al. Highly pathogenic avian influenza A(H5N1) Clade 2.3.4.4b virus infection in domestic dairy cattle and cats, United States, 2024. Emerg. Infect. Dis. 30, 1335-1343 (2024).
Nguyen, T.-Q. et al. Emergence and interstate spread of highly pathogenic avian influenza A(H5N1) in dairy cattle. bioRxiv (2024) https://doi.org/10.1101/2024.05.01.591751.
Li, P. et al. Effect of polymorphism in Rhinolophus affinis ACE2 on entry of SARS-CoV-2 related bat coronaviruses. PLoS Pathog. 19, e1011116 (2023).
Article CAS PubMed PubMed Central Google Scholar
Temmam, S. et al. Bat coronaviruses related to SARS-CoV-2 and infectious for human cells. Nature 604, 330–336 (2022).
Article CAS PubMed Google Scholar
Liu D. et al. Prediction of virus-host associations using protein language models and multiple instance learning. PLoS Comput. Biol. 20, e1012597 (2024).
Thadani, N. N. et al. Learning from prepandemic data to forecast viral escape. Nature (2023) https://doi.org/10.1038/s41586-023-06617-0 (2023).
Dufloo, J. et al. Receptor-binding proteins from animal viruses are broadly compatible with human cell entry factors. Nat. Microbiol. 10, 405–419 (2025).
Kawasaki, J. Zenodo for “Comprehensive datasets for 25 viral families”. zenodo https://doi.org/10.5281/zenodo.11102793 (2024).
Kawasaki, J. Zenodo for “BERT-infect models for non-segmented DNA viruses”. Zenodo https://doi.org/10.5281/zenodo.11103056 (2024).
Kawasaki, J. Zenodo for “BERT-infect models for non-segmented RNA viruses”. https://doi.org/10.5281/zenodo.11103079 (2024).
Kawasaki, J. Zenodo for “BERT-infect models for segmented RNA viruses” https://doi.org/10.5281/zenodo.11103091 (2024).

Download references

Acknowledgements

We acknowledge with gratitude the Authors, the Originating laboratories responsible for obtaining specimens, and the Submitting laboratories that generated and shared the genetic sequence and metadata via the GISAID Initiative. We are grateful to Dr. Jumpei Ito (The University of Tokyo, Japan), Dr. Sho Miyamoto (National Institute of Infectious Diseases, Japan), and Dr. Masayuki Horie (Osaka Metropolitan University, Japan) for their helpful discussions. This study was supported by grant from JST PRESTO (JPMJPR23R4 to J.K.); Japan Society for the Promotion of Science (JSPS) KAKENHI (JP22KJ2901 and JP25H01310 to J.K.); by the Waseda Research Institute for Science and Engineering, Grant-in-Aid for Young Scientists (Early Bird) (J. K.). The super-computing resource was provided by the ROIS National Institute of Genetics, the Human Genome Center at The University of Tokyo, and the Information Technology Center at The University of Tokyo. We thank Editage (https://www.editage.com/) for editing and reviewing this manuscript for English language. Some illustrations were retrieved from BioRender.com.

Author information

Authors and Affiliations

Faculty of Science and Engineering, Waseda University, Tokyo, Japan
Junna Kawasaki & Michiaki Hamada
Department of Infectious Disease Pathobiology, Graduate School of Medicine, Chiba University, Chiba, Japan
Junna Kawasaki & Tadaki Suzuki
Department of Infectious Disease Pathology, National Institute of Infectious Diseases, Japan Institute for Health Security, Tokyo, Japan
Tadaki Suzuki
Cellular and Molecular Biotechnology Research Institute (CMB), National Institute of Advanced Industrial Science and Technology (AIST), Tokyo, Japan
Michiaki Hamada
Graduate School of Medicine, Nippon Medical School, Tokyo, Japan
Michiaki Hamada

Authors

Junna Kawasaki
View author publications
Search author on:PubMed Google Scholar
Tadaki Suzuki
View author publications
Search author on:PubMed Google Scholar
Michiaki Hamada
View author publications
Search author on:PubMed Google Scholar

Contributions

J.K. and M.H. conceived the study; J.K. mainly performed the bioinformatics analyses; J.K. prepared the figures and wrote the initial draft of the manuscript. All authors contributed to study design, data interpretation, and paper revision, and have approved the final manuscript for publication.

Corresponding authors

Correspondence to Junna Kawasaki or Michiaki Hamada.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Communications Medicine thanks the anonymous reviewers for their contribution to the peer review of this work. [Peer review reports are available].

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Transparent Peer Review file

Supplemental_Material

description of additional supplementary file

supplementary_dataset_1

supplementary_dataset_2

supplementary_dataset_3

supplementary_dataset_4

supplementary_dataset_5

supplementary_dataset_6

supplementary_dataset_7

supplementary_dataset_8

supplementary_dataset_9

supplementary_dataset_10

reporting-summary

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Kawasaki, J., Suzuki, T. & Hamada, M. Hidden challenges in evaluating spillover risk of zoonotic viruses using machine learning models. Commun Med 5, 187 (2025). https://doi.org/10.1038/s43856-025-00903-w

Download citation

Received: 13 June 2024
Accepted: 09 May 2025
Published: 20 May 2025
DOI: https://doi.org/10.1038/s43856-025-00903-w

This article is cited by

Clinical metagenomics for diagnosis and surveillance of viral pathogens
- Oscar Enrique Torres Montaguth
- Sarah Buddle
- Judith Breuer
Nature Reviews Microbiology (2025)

Subjects

Abstract

Background

Methods

Results

Conclusions

Plain language summary

Similar content being viewed by others

Introduction

Methods

Preparing viral sequences and infectivity datasets

Constructing predictive model for viral infectivity based on LLMs

Re-training comparative models

Evaluation of viral infectivity prediction for the past viral genome

Evaluation of model detection ability for human-infecting virus using high-throughput sequencing data

Evaluation of viral infectivity prediction for the future viral genome

Phylogenetic analyses

Ethical approval

Results

Constructing comprehensive large-scale datasets for 26 viral families

Evaluation strategy and comparative models

Performance comparison among models trained with the past virus datasets

Model applicability for zoonotic virus detection from high-throughput sequencing data

Model performance in predicting infectivity of newly identified viruses

Systematic identification of difficult-to-predict viral lineages

Difficult-to-predict viral lineage frequently changed infectivity during evolution

Challenges in detecting human infectious risk of emerging zoonotic viral lineages

Discussion

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Peer review

Peer review information

Additional information

Supplementary information

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Search

Quick links