Fig. 1
From: Protein language models enable accurate viral host range prediction

Architecture of the workflow. Viral protein amino acid sequences were collected from NCBI Virus and split into positive (human viruses) and negative (non human viruses) datasets. Molecular descriptors (DPC, PAAC) and pretrained ESM2 embeddings (ESM2-t6-8 M, ESM2-t33-650 M, ESM2-t48-15B) were computed. Models were trained with stratified 10 fold cross validation on 80% of the data and evaluated on the remaining 20%. Nine classifier variants were assessed: LR L1, LR L2, RF, SVM-L, SVM-RBF, XGB, LGBM, GB, and MLP. The selected model was deployed as a web application that accepts FASTA inputs and returns probabilistic predictions.