Fig. 3: Sequence space distribution and similarity analysis of generated proteins. | Nature Communications

Fig. 3: Sequence space distribution and similarity analysis of generated proteins.

From: Ab-initio amino acid sequence design from protein text description with ProtDAT

Fig. 3

a The point cloud distribution of protein vectors visualizes three groups: ‘ProtDAT(PM1)’ and ‘ProtDAT(PM2)’ (sequences generated using two distinct prompt methods), and ‘Test’ (the reference protein sequences from the ProtDAT-Dataset test set). The structures of the reference and generated sequences are depicted separately in yellow and blue, with UniProt IDs and results of global sequence identity and TM-score. b A case of ProtDAT design process with PM1, comprises four components: prompt input, protein sequence generation, protein sequence alignment, and protein evaluation. c The similarities between the protein sequences generated by ProtDAT under different generation parameters and natural protein sequences, calculated by KL divergence. d, e The amino acid residue distribution of protein sequences generated (‘Gen Seqs’) separately in PM1 and PM2 under the conditions of Top-p = 0.85 and T (temperature coefficient) =1.0, compared to the corresponding natural sequences (‘Test Seqs’) from ProtDAT-Dataset test set.

Back to article page