Extended Data Fig. 6: Relationships between quality of wild-type protein structure and PST performance.

a–b, This is a supplement for Fig. 3c. Boxplots show distribution of a percentages of coils for protein structure over 34 datasets and b third quartile (Q3) of B factors at alpha carbons over 26 X-ray datasets. Datasets were classified into two classes depending on whether PST embedding is the best embedding. Scatter plots show same data with boxplots but for individual datasets. One-sided Mann–Whitney U-test examines the statistical significance that two classes have different values. Boxplots display five-number summary where center line shows median, upper and lower limits of the box show upper and lower quartiles, and upper and lower whiskers show the maximum and the minimum by excluding “outliers” outside the interquartile range. In a, sample sizes for PST ranked as the best model are n = 21, n = 15, n = 18 and n = 19 for training data size 24, 96, 168 and 240, respectively. Sample sizes for PST not ranked as the best model are n = 13, n = 19, n = 16 and n = 15 for training data size 24, 96, 168 and 240, respectively. The p-values are 0.01, 3 × 10−5, 1 × 10−3 and 1 × 10−3 for training data size 24, 96, 168 and 240, respectively. In b, sample sizes for PST ranked as the best model are n = 8, n = 12, n = 14 and n = 15 for training data size 24, 96, 168 and 240, respectively. Sample sizes for PST not ranked as the best model are n = 18, n = 14, n = 12 and n = 11 for training data size 24, 96, 168 and 240, respectively. P values are 0.03, 0.02, 0.07 and 0.02 for training data size 24, 96, 168 and 240, respectively.