Extended Data Fig. 2: Annotation statistics on pre-training data of LucaOne.
From: Generalized biological foundation model with unified nucleic acid and protein language

a. The proportion of genome region types and order-level taxonomy in nucleic acid. Most sequences have both types of annotation information. b. The proportion of the count of sequences with each of the selected six annotations, including order-level taxonomy, keyword, site, domain, homology, and tertiary structure, of which the proportion of sequence count with tertiary structure is tiny. c. and d. The proportion of sequence counts in the top 10 phylum-level taxonomy of nucleic acids and proteins, respectively. e. The distribution of eight selected genome region types in nucleic acids, of which the CDS region is the most. f. and g. The proportion of sequence counts in the top 10 order-level taxonomy (total 2,196 categories) of nucleic acids and proteins, respectively. h–k. The proportion of protein sequence counts in the top 10 keywords (total 1,179 categories), the top 10 site types (total 946 categories), the top 10 domain types (total 13,717 categories), and the top 10 homology types (total 3,442 categories), respectively. l. The coord-(x, y, z) distribution of Cα-atom position (local normalization within a protein chain). It is very similar to the normal distribution. The distribution has a long tail in c–f. The distribution is ladder decreasing in g–k.