Figure 1
From: Using machine learning to detect coronaviruses potentially infectious to humans

Methodological workflow of the human Binding Potential (h-BiP) score. Left: preprocessing sequences from alpha and beta coronaviruses. Top: whether the S protein was available from annotation or by extraction from whole-genome, the dataset consists of 2534 unique S protein sequences. Each protein sequence is transformed into a trimer (3 amino acid) representation by sliding a window one amino acid at a time. Bottom: we curated the host field and annotated the sequences according to their binding status to human receptors. Regardless of the host, a virus is considered positive for binding if there is experimental evidence of binding to a human receptor. Right: a skip-gram model uses a neural network to generate trimer embeddings of a fixed dimension (d = 100). These trimer embeddings are numerical vectors that encode information from all neighboring trimers within a context window in the protein sequence. Next, we compute the final sequence embedding (d = 100) by adding up all of its trimer embeddings. The scatterplot shows a visualization for the embeddings from all viruses after using t-distributed stochastic neighbor embedding (tsne) to reduce dimensionality. Finally, all sequence embeddings feed a classifier (logistic regression) to produce the h-BiP score that learns from the binding information of alpha and beta coronaviruses. An h-BiP score greater than or equal to 0.5 flags the virus as likely for human binding.