Table 1 Overall performance of NuFold on the test set using different models

From: NuFold: end-to-end approach for RNA tertiary structure prediction with flexible nucleobase center representation

NuFold Variations

Ave. RMSD (Ă…)

Ave. GDT-TS

# of Correct targets

VS Baseline (win/tie / lose)

RMSD-centric (Baseline)

6.98

0.443

25 / 36

-

GDT-TS-centric

7.06

0.441

25 / 36

2 / 30 / 4

(Baseline + )

    

Baseline + Recycles

6.87

0.444

25 / 36

3 / 31 / 2

Baseline + Metagenome

6.68

0.453

25 / 36

5 / 29 / 2

Baseline + Recycles + Metagenome

6.67

0.456

25 / 36

6 / 28 / 2

(Small Models)

    

24 Evoformer Blocks + 50% self-distillation

7.28

0.454

24 / 36

9 / 19 / 8

24 Evoformer Blocks + 75% self-distillation

7.98

0.445

24 / 36

7 / 20 / 9

Population-based:

    

Best in the population

5.62

0.490

27 / 36

16 / 20 / 0

Largest cluster (centroid)

7.77

0.440

25 / 36

2 / 26 / 8

Largest cluster (pLDDT)

7.80

0.439

25 / 36

2 / 26 / 8

Highest pLDDT

6.87

0.452

25 / 36

7 / 24 / 5

  1. # of correct targets: The count of targets for which the model achieved a Root Mean Square Deviation (RMSD) of less than 6 Å out of the 36 test target RNAs. VS Baseline: the comparison of RMSD results with those of a baseline, distinguishing between cases where the model’s RMSD is better, equal to, or worse than the baseline. A target is considered tied when its RMSD is less than 0.5 Å compared to the baseline structure. The “RMSD-centric” model, selected at the 146,287th training step, exhibited the smallest average RMSD on the validation dataset. Similarly, the “GDT-TS-centric” model, chosen at the 145,263rd training step, demonstrated the highest average GDT-TS on the validation dataset. The second block of Baseline+ shows results of the baseline model with an increased MSA from metagenome database search and with an increased number of recycles to 30 from 3. In the + Recycle models, a structure with the highest pLDDT was selected from those generated from 8 to 14 recycle iterations (this method and its motivation is discussed later in detail). In the +Metagenome models, a structure with the highest pLDDT was selected from those generated from 3 metagenome MSAs and the original MSA. The number of recycles was set to 3. The last block with four rows presents results of the population-based methods. The Best in the population row shows the best (lowest RMSD, highest GDT-TS) from all the 385 structure models. In the “largest cluster (centroid)” approach, structure models were clustered based on structural similarity using LB3Dclust43, and the structure closest to the averaged structure of the cluster was selected from the largest cluster. In the “largest cluster (pLDDT)” approach, the structure with the highest pLDDT within the largest cluster was chosen. The “Best pLDDT” indicates the structure with the highest pLDDT among the 385 generated structures, without applying clustering. The best result in each metric is shown in bold (the best in the population values were excluded from the comparison).