Fig. 1: Construction of a pre-trained protein LLM based on the entire UniProtKB/Swiss-Prot database.

a, Schematic of ProteoGPT. ProteoGPT comprises multiple Transformer blocks, stacked sequentially and categorized into three main modules: the input module, processing module and output module. The input module consists of embedding layers designed to map encoded sequence data into a continuous vector space, thereby facilitating the model’s comprehension. Positional encoding is also applied to introduce positional information. The processing module incorporates multihead self-attention layers and feed forward neural network layers. Its primary role involves the nonlinear transformation of vectors from the input module. This enables the model to focus on various segments of the sequences and establish contextual understanding. Each processing module includes residual connection and layer normalization to ensure the stable propagation of gradients and model training. The output module primarily comprises linear layers responsible for generating the probability distribution for the next token. b, Length distribution for protein sequences used in a. c, Sources distribution for protein sequences used in a. d, Training process of ProteoGPT. The blue solid line represents the raw loss values, while the black solid line depicts the smoothed loss values. The schematic in a was created with BioRender.com.