Fig. 3: Impact of pre-training on MethylBERT performance.

A UMAP plot of 3-mer token embeddings before and after pre-training. The clusters were made via k-means clustering. B Confusion matrix of read classification results by the model pre-trained with human genome hg19 (top left) and with mouse genome mm10 (bottom left). Distribution of P(cell type = Tumour|read) in both cell types calculated by the two pre-trained models (right). P-values in the violin plot were calculated using two-sided paired t-test statistics. The inner boxplots represent the median, and the first and third quartiles, whereas the whiskers show the rest of the distribution. C Training (solid line) and validation (dotted line) curves of MethylBERT with and without pre-training (green and yellow). Both graphs are plotted every 10 steps. D Confusion matrix of read classification results by the MethylBERT model with and without pre-training calculated at the step when each model achieved the best validation performance. E Histogram of P(cell Type=Tumour|read) in tumour (T) and normal (N) reads (orange and blue each) calculated by MethylBERT with and without pre-training (top and bottom).