Fig. 2: LookingGlass identifies homologous sequence pairs at the phylum level.

a Distribution of embedding similarities for homologous (blue) and nonhomologous (red) sequence pairs are significantly different (unpaired two-sided t-test P < 10−16, n = 163,184 sequence pairs). Box shows median and interquartile range, whiskers extend to minima and maxima of range, and diamonds indicate outliers defined as 1.5x the interquartile range. b Accuracy, precision, recall, and F1 metrics (Eqs. (1)–(4)) for homologous/nonhomologous predictions across embedding similarity thresholds. Default threshold of maximum accuracy (0.62) shown in vertical dashed line. c Distribution of embedding and sequencing similarities for homologous (blue) and nonhomologous (red) sequence pairs. In total, 44% of homologous sequence pairs have sequence similarity alignment scores below the threshold of 50 (horizontal line). Embedding similarity threshold (0.62, vertical line) separates homologous and nonhomologous sequence pairs with maximum accuracy. Bold black box in the lower right indicates homologous sequences correctly identified by LookingGlass that are missed using alignments. Source data are provided as a Source Data file.