Fig. 4: Training of a high-accuracy XNA basecalling model.

A Line plot depicting base-level accuracy per round of training for non-canonical bases, canonical bases within 5 bp of a non-canonical base and more distant canonical bases. The baseline round of training depicts results using the Bonito Super Accuracy DNA model. B Line plots depicting performance on the proof-of-concept (POC) library using basecaller models trained with different sets of real and artificial reads, with increasing proportion of NCBs per read. The following figures are based on models trained with spliced data and 9% NCB proportion. C Line plots depicting the performance of models trained jointly for both NCBs or separately, with an increasing number of layers frozen. Subsequent analyses are based on models with 6 frozen layers. D Boxplots depicting performance of the final model, based on the complex and proof-of-concept libraries for testing with diverse templates and held-out reads. Boxplots show first and third quartiles (box), median (line), and ± 1.5× the interquartile range (whiskers). E Basecalling confusion matrix for NCBs and canonical bases. The last column represents deletion errors. F Barplots depicting performance analysis for reads with multiple NCBs. Data are presented as mean values ± standard deviation. NCB proximity, number and order did not seem to have a strong impact on NCB accuracy, indicating that the final trained basecaller is robust in handling reads with multiple NCBs. Source data are provided as a Source Data file.