Fig. 2: Cross-dataset evaluation on the GUIDE-seq dataset. | Communications Biology

Fig. 2: Cross-dataset evaluation on the GUIDE-seq dataset.

From: A versatile CRISPR/Cas9 system off-target prediction tool using language model

Fig. 2

The models were trained on the CIRCLE-seq dataset and externally validated on GUIDE-seq, which uses a different experimental platform. Left: ROC curves showing the AUROC performance of all models. Right: Precision–recall curves showing AUPRC performance, which is particularly important in imbalanced datasets. The RNA-pretrained model CCLMoff achieved the best performance (AUROC = 0.996, AUPRC = 0.520), significantly outperforming baseline models including LSTM, CRISPR-Net, and AttenToCrispr. In addition, we evaluated three recent language model-based approaches: CCLMoff-Hyena (DNA-pretrained), CRISPR-BERT (task-specific pretraining), and CRISPR-DNT (Transformer-based). Both CRISPR-BERT and CCLMoff-Hyena demonstrated lower AUPRC than CCLMoff, highlighting the advantage of RNA-specific foundation model pretraining in capturing sgRNA–DNA interactions.

Back to article page