Extended Data Fig. 1: Overall statistics on pre-training data of LucaOne.
From: Generalized biological foundation model with unified nucleic acid and protein language

a. Sequences (DNA, RNA, and proteins) were derived from RefSeq, UniProt, ColabFoldDB, and UniRef50. b. The data (nucleic acids and proteins) involved four superkingdom types: Viruses, Archaea, Eukarya, and Bacteria, of which Bacteria accounted for the most. c. The sequence length distribution of nucleic acids, with the most being more than 1,000. d. The sequence length distribution of proteins, with the maximum length ratio between 100 and 1,000. e. The proportion of five nucleotides (’A’, ’T’, ’C’, ’G’, and ’Unknown’) in nucleic acid sequences (’U’ compiled with ’T’ in RNA) and the four identified nucleotides were close in proportion. f. The proportion of the 20 standard amino acid letters and five other letters (including four non-standard amino acids and ’X’ for unknown amino acid) in the protein sequence, and Leucine has the highest proportion.