Fig. 9: Confidence Score Illustration and Out-of-Training Detection.
From: Enhancing nucleotide sequence representations in genomic analysis with contrastive optimization

a Confidence Score Illustration: Circular points represent the training set, with different colors corresponding to different classes, while star points represent query points. The outlines around the stars depict decision boundaries. In this example, the distance of query point a < b ≪ c, illustrates how the confidence score could vary despite the proximity of the nearest neighbors. The confidence score calculation integrates both the distance to the closest training point and neighborhood class probabilities. The blue star, despite being equidistant from its nearest neighbor as others, receives a lower confidence score due to its outlier status and the influence of decision boundaries. b KDE plot illustrating the distribution of confidence scores for genes of E. coli Genes are categorized based on their availability in the training set, with “In Training Set (462 genes)" indicating genes present in the training data and “Out of Training Set (3853 genes)" indicating genes absent from the training data. This shows the power of the confidence score as a quality estimator of predictions for users to ensure the results.