Fig. 1: Seeding scheme of LexicMap for reference database. | Nature Biotechnology

Fig. 1: Seeding scheme of LexicMap for reference database.

From: Efficient sequence alignment against millions of prokaryotic genomes with LexicMap

Fig. 1

a, A fixed set of 20,000 31-mers (called probes) are generated, ensuring that their prefixes include every possible 7-mer. Seeds, each prefix matching one of these, will be found distributed across all database genomes and chosen in such a way as to have a window guarantee. b, LexicHash creates one hash function per probe and, when applied to a genome, it finds the k-mer with the longest prefix match, which is then stored as a seed. c, Each genome is scanned to find seed deserts (regions longer than 100 bp with no seed); every k-mer within this region has a 7-mer prefix match with at least one probe (because the probes cover all possible 7-mers); hence, seeds can be chosen with spacing of x bp (50 by default). d, Seeds are stored in a hierarchical index. In fact, although not shown here for simplicity, the number of seeds is doubled to support both prefix and suffix matching (details in Methods).

Back to article page