Fig. 1: CodonTransformer multispecies model with combined organism-amino acid-codon embedding.
From: CodonTransformer: a multispecies codon optimizer using context-aware neural networks

a An encoder-only BigBird Transformer model trained by combined amino acid-codon tokens along with organism encoding for host-specific codon usage representation. b Schematic representation of the organism encoding strategy used in CodonTransformer using token_type_id, similar to contextualized vectors in natural language processing (NLP). c CodonTransformer was trained with ~1 million genes from 164 organisms across all domains of life and fine-tuned with highly expressed genes (top 10% codon usage index, CSI) of 13 organisms and two chloroplast genomes. CLS the start of sequence token, UNK general unknown token, SEP the end of sequence token, PAD padding token.