Introduction

Protein structure at atomic resolution is usually determined by X-ray crystallography1 or nuclear magnetic resonance (NMR)2. However, this is an expensive process where \(> 70\%\) of the total cost is spent on attempts that do not produce crystals of diffraction quality3. Crystallization of proteins is a prerequisite for structural determination. Yet, it has been a daunting challenge, with an overall success rate ranging between 2 and 10\(\%\)4. The determination of important biological features that help increase the propensity for protein crystallization remains a great challenge. Several machine learning methods and statistical techniques have been developed to predict sequence-based protein crystallization5,6,7,8,9,10,11. These approaches utilize feature-based protein representations including physicochemical and k-mer frequency features from amino acid sequences and corresponding structures. Most of these techniques undergo a feature selection procedure(s), followed by traditional machine learning techniques such as support vector machines12,13, random forests14 and gradient-boosting machines15.

The availability of large-scale protein datasets through public databases such as PepcDB16, enables the use of deep learning techniques for the problem of protein crystallization prediction. DeepCrystal, a deep neural network (DNN) based model was proposed by Elbasir et al.17 to predict protein crystallization propensity using only the protein AA sequence as input without the need to extract additional physio-chemical and k-mer features by implementing convolutional neural networks (CNNs)18 as backbone. DeepCrystal captures frequently occurring amino acid (AA) k-mers of different lengths driving the crystallization prediction and outperforms state-of-the-art (sota) feature-based methods. Furthermore, techniques such as ATTCry19 design a CNN framework based on multi-scale and multi-head self-attention for crystallization prediction. CLPred20 uses a bidirectional recurrent neural network with long- and short-term memory (BLSTM) to capture long-range interaction patterns between the k-mers of AA sequence to predict protein crystallizability using the AA protein sequence as input.

DCFCrystal21 was a multistage crystallization predictor that could estimate the success propensities of three individual steps in the protein crystallization process by utilizing a deep-cascade forest model with multiple types of sequence-based features. The effectiveness of DCFCrystal was driven by the pseudo-predicted hybrid solvent accessibility (PsePHSA) feature. However, the DCFCrystal method is only available as a web-server and can predict at most 100 protein sequences in one job request, thereby, making it infeasible for high-throughput screening. Similarly, SADeepcry22 was another multistage crystallization predictor like DCFCrystal which uses optimized self-attention and auto-encoder modules to extract sequence, structure and physico-chemical features from the proteins to predict the success rate of final protein crystallization. However, we observed that from their source code (https://github.com/zhc940702/SADeepcry) there is no code available to test on new proteins or generate features required to test the model on unseen test proteins. This renders the method infeasible for benchmarking and a fair comparison with other state-of-the-art crystallization predictors. Finally, GCmapCrys23, was proposed for the prediction of multistage crystallization propensity by integrating graph attention networks with the predicted protein contact map. Moreover, it uses BLAST24 to generate a position-specific scoring matrix, SCRATCH-1D to use predicted solvent accessibility and secondary structure, and HHblits25 for multiple sequence alignment (MSA). A similar technique, namely BCrystal26, utilizes homology, secondary structure, solvent accessibility, torsion angle features in combination with an XGBoost model. However, these techniques, especially those that use MSA, are extremely slow (\(\approx\) 30 minutes for one protein sequence) and cannot be used for high-throughput protein screening. Since, the goal of our work was to compare the crystallization propensity of a protein using just their AA sequence and the ability of the model to perform high-throughput screening, hence we focus on methods such as DeepCrystal, ATTCrys and CLPred during our experimental comparisons.

In recent years, application of natural language processing (NLP) methods to protein sequences has led to remarkable breakthroughs for sota protein structure and property prediction. The driving force for these breakthroughs is the transformer, a deep learning architecture27, which uses the concept of self-attention to efficiently capture long-range dependencies and intricate patterns in protein sequences that were previously difficult to discern using traditional deep learning methods27.

Analogous to using words and sentences to train typical large language models (LLMs), transformer-based models such as ESM2 use individual AAs, peptides, and protein sequences28 to learn the “language” of life. These protein language models (PLMs) follow a self-supervised learning framework, where the model attempts to predict the identity of randomly masked AAs (usually \(15\%\) of the AAs per protein sequence) using the unmasked portions of the protein sequence. For example, ESM2 was pre-trained on the masked language training task with \(\approx\)65 million unique protein sequences from UniRef28. After this extensive training, scientists are able to use these pre-trained models to extract high-dimensional representations for their proteins of interest. These vectors can be used for downstream tasks such as protein property prediction, protein clustering, and functional comparisons26,29,30,31,32,33,34.

In the present work, we perform efficacy assessments of several open source PLMs for the task of predicting protein crystallization using the TRILL platform35. TRILL is a comprehensive resource designed to democratize access to sota open PLMs, eliminating the requirement for advanced computational skills. Using robust deep learning frameworks such as Pytorch Lightning36 and HuggingFace Accelerate37, TRILL provides access to several PLMs such as ESM228, Ankh38 and ProstT539, specifically for tasks such as protein design and property analysis. Moreover, TRILL facilitates the usage of these PLMs with different model configurations and parameter space. These PLMs in TRILL are complemented by a suite of utilities that enhance user experience and functionality.

For protein sequence classification, the platform provides functionalities to embed protein sequences into vector representations per protein or per residue-basis, visualize the embedded protein sequence representation, train custom classifiers, and predict class labels for unseen protein sequences. These diverse tools and functionalities are encapsulated within a command-line interface, organized through ten commands as detailed in the original TRILL paper35. In the present work, we utilize the TRILL platform to determine the vector representation of proteins for each PLM using just the AA sequence as input. We used two additional PLMs including the xTrimoPGLM40 and SaProt41 to determine protein vector representations using only the AA sequence as input and to have a comprehensive comparison.

These vector representations are then passed as training data to classifiers which are optimized through hyper-parameter tuning. This results in optimal crystallization propensity predictor for individual PLM. We then performed a comprehensive comparison of these PLM-based predictors on several independent test sets. Finally, we generate 3000 proteins through a fine-tuned ProtGPT2 model (on the crystallizable class) and through a series of computational filtration steps identify a reduced set of 5 novel proteins as potentially crystallizable.

Figure 1
figure 1

Flowchart of the proposed PLM benchmarking framework for protein crystallization propensity prediction.

The key contributions of the manuscript are:

  • Benchmarking different ESM2 models for the task of protein crystallization prediction using raw protein sequences on external balanced, SwissProt and TrEMBL test sets;

  • Benchmark PLMs such as Ankh, Ankh-Large, ProstT5 and ProtT5-XL for the task of protein crystallization prediction on external balanced, SwissProt and TrEMBL test sets;

  • Benchmark PLMs such as xTrimoPGLM and SaProt models for the task of protein crystallization prediction on external balanced, SwissProt and TrEMBL test sets;

  • Benchmark per-residue feature representation of three top-performing PLMs as input to CNN and LSTM models for the task of protein crystallization prediction on external balanced, SwissProt and TrEMBL test sets;

  • Comprehensive comparison of open-source PLMs to predict diffraction-quality crystals with superior performance on aforementioned test sets;

  • Provide all the code used for benchmarking open-PLMs for crystallization prediction task via github (https://github.com/raghvendra5688/crystallization_benchmark) for reproducibility and enabling community to utilize TRILL for their protein property prediction task.

  • Fine-tune a protein generator namely ProtGPT242 to generate de novo protein sequences from the crystallizable class;

  • Evaluate, screen and validate the generated proteins to identify a unique set of stable and well-folded proteins.

Figure 1 provides a flow diagram of the proposed framework for predicting protein crystallization propensity.

Materials and methods

Overview

The problem of predicting the crystallization propensity of a protein is a binary classification task. A protein sequence is given by a sequence of AAs \(x = (x_{1},x_{2},\dots ,x_{L})\), where \(x_{i}\), is the \(i^{th}\) amino acid in the sequence and is part of a vocabulary comprising 20 amino acids, while L is the length of the protein sequence. A given PLM uses its encoder referred as “tokenizer” (\(t(\cdot )\)) that encodes the AA sequence x to an encoded representation (\(t(x) \in \mathbb {R}^L\)) that is then ingestible for deep learning technique. This is a widely used encoding scheme in natural language processing (NLP) to have a vector representation of words in a sentence43,44.

The encoded representation t(x) is then given as input to the PLM and the final transformer layer of the PLM generates an embedding representation of the protein, preserving meaningful inter-residue relationships and contextual information within the original protein sequence. In mathematical terms e(t(x)) is the embedding of the protein x, with \(e: \mathbb {R}^{L} \rightarrow \mathbb {R}^{d}\), where d represents the embedding dimension of the transformer layer of the PLM (note: for comparison reasons, we use different PLMs, thus d changes). Our aim is to learn a function \(c(\cdot )\) that takes as input the embedded protein sequence e(t(x)) and outputs a probability, i.e., \(c: \mathbb {R}^L \rightarrow [0,1]\), where \(c(\cdot )\) is the function computed by the nonlinear classifier. In this work, \(c(\cdot )\) is an XGBoost45 or a LightGBM46 or a multilayer pereceptron (MLP)47 classifier.

While fine-tuning individual PLM (either all layers or few layers) with a classification head is an option, some of the PLMs tested in this work are extremely large i.e. ESM2 with 36 transformer layers and \(\approx\) 3 billion parameters and xTrimoPGLM-10B with \(\approx\) 10 billion parameters. Thus, it is impossible to fine-tune such a PLM even with a batch size of 2, given the configuration of the available GPU - NVIDIA RTX A6000 with 48 Gb RAM. Hence, to have a fair evaluation given our GPU capacity, and to understand the learning representation capacity of these PLMs, we considered all these PLMs in a zero-shot learning framework to generate the embedded vector representations for proteins using their AA sequence.

Data partitioning

We perform our experiment on the processed PepcDB dataset (http://pepcdb.rcsb.org) following the protocols set by Wang et al.11. The data set comprises proteins which have been classified into five groups, namely (i) diffraction-quality crystals, (ii) protein cloning failure, (iii) protein material production failure, (iv) purification failure, and (v) crystallization failure. We consider the proteins labeled as diffraction-quality crystals to be the crystallizable class, while other proteins are assigned to the non-crystallizable class. The final dataset comprises 28,731 sequences of which 5383 proteins belong to the crystallizable class, and the remaining 23,348 are non-crystallizable. As in11,17, all sequences in each class are passed through a sequence identity filter \(> 25\%\) with other proteins in that class to remove redundant and similar protein sequences within each class.

To divide our dataset into training and test sets, we follow a simple protocol. The maximum length of a protein sequence considered for our model is \(L_{\text {max}} = 800\). This is done to be compliant with methods like DeepCrystal17 and CLPred20, which use the same L as the maximum length of the protein sequence. Proteins with \(L < L_{\text {max}}\) are padded with the symbolic representation of gaps. By performing this protein filtering step, the total number of proteins in the dataset is reduced to 25,120.

We follow the procedure used in DeepCrystal17, ATTCrys19 and CLPred20 to divide this dataset into two parts: \(\mathbb {D}_{1}\) and \(\mathbb {D}_{2}\) such that \(\mathbb {D}_{2}\) consists of \(\mathbb {D}_{2}^{1}=891\) crystallizable and \(\mathbb {D}_{2}^{0}=896\) non-crystallizable proteins. Here 1 corresponds to crystallizable and 0 corresponds to non-crystallizable class. Thus, \(\mathbb {D}_{2}\) represents the fairly balanced test set for performance evaluation as used in DeepCrystal, ATTCrys and CLPred methods. \(\mathbb {D}_{1}\) has a total of 23,333 protein sequences, where \(\mathbb {D}_{1}^{1}=3,846\) proteins belong to crystallizable class while remaining \(\mathbb {D}_{1}^{0}=19,487\) proteins fall are non-crystallizable.

We also use two independent test sets generated in1 as external validation sets. The two external datasets, referred as SP_final and TR_final were obtained from SwissProt and TrEMBL databases respectively, following the protocol detailed in Elbasir et al.17. In the SP_final dataset, we have 148 proteins belonging to the positive class while remaining 89 sequences are non-crystallizable, whereas in the TR_final dataset there are 374 crystallizable proteins and 638 proteins belonging to the negative class. We compare our methods with sota web-servers such as fDETECT8, DeepCrystal17, ATTCrys19 and CLPred20 on these datasets. For all performance comparisons, we provide our test protein sequences to these web-servers to obtain corresponding prediction scores.

Benchmarking models

The TRILL platform35 provides access to several PLMs, such as ESM228, Ankh38, ProstT539 and ProtT5-XL48, which can generate protein embedding representations via a zero-shot learning framework. Moreover, there are several pretrained PLMs, such as ESM228, ProtGPT242 and ZymCTRL49, which can either directly generate proteins in a zero-shot fashion or first by fine-tuning these models and then proceed with protein generation. Furthermore, we used two PLMs including the xTrimoPGLM40 and SaProt41 which are not available via TRILL platform to have a more comprehensive comparison. Here we provide a summary of several PLMs used in the present work. For further details of these PLMs, the reader’s indulgence is sought.

Evolutionary Scale Modeling (ESM2)

ESM2 is a sota transformer-based protein language model trained on \(\approx\)65 million unique protein sequences28. ESM2 has been shown to outperform all tested single-sequence PLMs on a range of structure prediction tasks, enabling atomic resolution structure prediction. Although the ESM2 model has been benchmarked for structure prediction, it has not been compared for protein property prediction and has been shown to not scale for protein function prediction50. Moreover, the ESM2 models are available with different architectural configurations, that is, with an increase in number of transformer layers leading to an increase in number of model parameters. The ESM2 models are available with 6, 12, 30, 33 and 36 transformer layers having \(\approx\) 8, 12, 150, 650 and 3,000 million parameters, respectively.

Ankh

The Ankh is an optimized general-purpose PLM, as a first version for future specialized high-impact protein modeling tasks. Ankh is pre-trained on the UniRef50 dataset51, that provides more variability and representation compared to UniRef10051 and BFD52. The model is tested on a comprehensive set of downstream tasks spanning protein function prediction, structure prediction, and localization prediction. Ankh demonstrated superior performance on tasks such as fluorescence prediction, solubility prediction, contact prediction, fold prediction, and secondary structure prediction. Additionally, Ankh used the latest Google TPU v4 hardware and JAX/Flax software for efficient training. Thus, Ankh is presented as a powerful general-purpose PLM that can serve as a foundation for specialized protein modeling tasks, with outstanding performances demonstrated on a wide range of benchmarks. Ankh-Large has \(\approx\)2 billion parameters and is trained using the encoder-decoder architecture, while Ankh base has \(< 10\%\) parameters when compared to the sota models.

ProstT5

ProstT5 is a bilingual language model for protein sequences and structures that utilizes the AlphaFold Protein Structure Database (AFDB)53. ProstT5 was pre-trained using 34.6 million proteins. It can translate between 1-D amino acid sequences and 1-D structure sequences (3Di tokens). ProstT5 demonstrated improved performance in various protein function prediction tasks compared to sota sequence-based models such as ProtT5, ESM2 and Ankh. It can perform inverse folding, generate novel AA sequences that adopt a desired structural template, and assess the quality of its own predictions. ProstT5 exemplifies how language modeling techniques and transformers can be used to leverage the wealth of information from protein structure databases such as AFDB. Finally, ProstT5 is a proof-of-concept bilingual PLM that showcases the potential of integrating sequence and structure information for various protein modeling tasks.

ProtT5-XL

ProtT5-XL uses an encoder-decoder framework for training27. ProtT5-XL has 3 billion parameters and is trained using an 8-way model parallelism. ProtT5-XL is trained on BFD for 1.2 million steps, followed by fine-tuning of UniRef50 for 991k steps. Contrary to the original T5 model53 that masks the spans of multiple tokens, ProtT5-XL adopts BERT’s denoising objective to corrupt and reconstruct single tokens using a masking probability of \(15\%\). ProtT5-XL uses the AdaFactor optimizer with inverse square root learning rate schedule for pretraining. Using ProtT5-XL embeddings as input to supervised models to predict secondary structure and subcellular localization, it outperformed previous methods on these tasks.

xTrimoPGLM

xTrimoPGLM is a unified protein language model designed to enhance both understanding and generation tasks in protein science. Traditional models often focus on either autoencoding or autoregressive pre-training, limiting their effectiveness in handling diverse protein-related tasks. In40, the authors propose a novel framework that integrates both approaches, enabling the model to leverage a massive dataset of 940 million unique protein sequences, resulting in a model with 100 billion parameters. In this work, we use smaller version of the xTrimoPGLM model including xTrimoPGLM with 1, 3 and 10 billion parameters as these models can be loaded with our current GPU configuration. xTrimoPGLM outperforms existing models across 18 benchmarks related to protein understanding and structure prediction. The model facilitates advanced structural predictions, surpassing tools like AlphaFold2 in speed and accuracy. It can generate new protein sequences that closely resemble natural proteins and can be fine-tuned for specific properties. This highlights the model’s versatility and potential applications in drug design, while also addressing limitations that need to be overcome for practical use in real-world scenarios.

SaProt

SaProt is a novel protein language model (PLM) that incorporates a structure-aware vocabulary to enhance the understanding of protein sequences and structures. Traditional PLMs primarily focus on residue sequences, neglecting the crucial structural information that can significantly influence protein function. SaProt integrates both residue tokens and 3D structure tokens derived from protein models using Foldseek, enabling a more comprehensive representation of proteins. The introduction of a new vocabulary that combines residue and geometric features, allowing for effective representation of both primary and tertiary protein structures. SaProt was trained on approximately 40 million sequences and structures, achieving superior performance across ten significant biological tasks compared to established models like ESM-1b. The model demonstrates versatility in various applications, including clinical variant prediction and protein-protein interaction analysis.

ProtGPT2

ProtGPT2 is a PLM that can generate novel protein sequences which are structurally and functionally similar to natural proteins42. ProtGPT2 effectively generates sequences that are distantly related to natural ones but are not a consequence of memorization and repetition. Majority of ProtGPT2 sequences (\(93\%\)) have significant sequence similarity to natural proteins42. AlphaFold predictions show \(37\%\) of ProtGPT2 sequences have high confidence (pLDDT > 70) for being ordered structures, comparable to \(66\%\) for natural sequences. Molecular dynamics simulations indicate ProtGPT2 sequences have similar dynamic properties as natural proteins42 .

Integrating ProtGPT2 sequences into a structural network representation of the protein universe reveals they bridge separate “islands” of known protein structures. ProtGPT2 generates sequences across different structural classes like all-\(\alpha\), all-\(\beta\), \(\alpha /\beta\), etc. The model can be conditioned to design proteins for specific families, functions or structural classes. Thus, the unsupervised ProtGPT2 model effectively learns the “protein language” and generates novel sequences that populate unexplored regions of protein structure space while maintaining key structural and functional properties. This highlights the potential of PLMs for de novo protein design.

Model building & test

We follow a simple protocol to use the TRILL platform for our task of benchmarking PLMs for protein crystallization propensity prediction. Starting with the training sequences \(x\in \mathbb {D}_{1}\), we obtain embedding representations \(e\left( t(x)\right)\) for each of the following 9 protein language models: ESM2 T6-8M, ESM2 T12-35M, ESM2 T30-150M, ESM2 T33-650M, ESM2 T36-3B, Ankh, Ankh Large, ProstT5, ProtT5-XL PLMs using the embed function in TRILL with a global averaging of vector representation per residue in the original protein sequence.

We obtain embedding representations for xTrimoPGLM-1B, xTrimoPGLM-3B, xTrimoPGLM-10B, SaProt-35M and SaProt-650M using the guidelines provided in their respective github and huggingface interface. The mean embedding representations (\(\mu (e_k(t(x))), k=1\dots 14\) and \(\mu\) represents average across the length of protein) for all PLMs are generated in a zero-shot learning setting. These mean embedding representations of the training set \(\mathbb {D}_{1}\) are then passed to the XGBoost classifier using the classify utility, where a 10-fold cross-validation technique is used for hyper-parameter optimization. More details are available via xgboost classifier script.

The XGBoost classifiers optimizes a weighted average F1-metric during the classification step to address the problem of class-imbalance. We also pass the mean embedding representations \(\mu (e_k(t(x)))\) from each PLM to custom LightGBM models46 in 10-fold cross-validation setting to generate LightGBM classifiers. We performed a randomized search over a grid of parameters including number of estimators, maximum depth of a tree, number of leaves, minimum child samples, learning rate, subsampling rate, L1 and L2 regularizers during hyper-parameter optimization. The details of the parameter space for LightGBM classifiers are available at hyperparameter tuning script.

Thus, in total we have 14 XGBoost classifiers and 14 LightGBM classifiers, where each classifier is built on top of mean embedding representation (\(\mu (e_{k}(t(x))\))) obtained from a PLM. After obtaining the XGBoost / LightGBM classifier for each of the 14 PLMs, we pass the test sets to each PLM to obtain mean embedding representations for the respective set of proteins. Finally, the class label and probability \(c\left( \mu (e_{k}(t(x)))\right)\) for each protein sequence x in a given test set and the \(k^{th}\) PLM is obtained by passing its mean embedding representation \(\mu (e_{k}(t(x)))\) to the classifier \(c(\cdot )\). We utilize the classify function with ‘–preComputed_Embs’ and ‘–preTrained’ utilties in TRILL to obtain the class probability as shown in Fig. 2. A consensus of the predictions from these classifiers is obtained by taking average of the probabilities estimated by these classifiers.

Additionally, we built a MLP47 classifier on top of mean embedding representation obtained from each PLM using ‘scikit-learn’ package (v1.5.1) in Python v3.10.0 and the performance of these MLP classifiers on different test sets is depicted in Supp. Table S1. A detailed workflow of building the classifiers and obtaining predictions on test sets is highlighted in Fig. 2.

Figure 2
figure 2

Workflow of building the crystallization propensity prediction classifiers for each PLM and obtaining test set predictions using the TRILL platform. Here the ‘red’ colored dots represent crystallizable proteins and ‘black’ colored dots correspond to non-crystallizable proteins.

We finally pass the embedding representation (e(t(x))) for the top three best performing PLMs and utilize the per-residue embedding representation i.e. e(t(x)) in combination with multi-layered CNN and LSTM models for crystallization propensity prediction task. The maximum length of a protein is fixed to the \(L=800\) as done in DeepCrystal17 and DeepSol34. Proteins with length \(L < 800\) are padded with matrix of zeros to have consistent embedding dimensions for all proteins in the training and test sets. The output of the CNN / LSTM layers is concatenated with the mean embedding representation of the protein and passed to multiple feed forward layers which is finally connected to the output neuron. The output neuron has a sigmoid activation function to predict the probability of crystallization propensity. We built 10 models for each PLM varying the number of layers (convolution or LSTM layers), learning rate, dimension of hidden neurons, and number of feed-forward layers in a setting where 80% of the dataset was used for training and 20% for validation through stratified sampling.

Protein generation

We fine-tune the ProtGPT2 PLM on the crystallizable class (\(\mathbb {D}_{1}^{1}\)) using the fine-tune function available in TRILL for 10 epochs. In35 it was shown that 10 epochs are sufficient to generate synthetic cell penetrating peptides and anti-crispr proteins using ProtGPT2. Thus, the fine-tuned ProtGPT2 model learns the underlying distribution of crystallizable proteins. We then generate a total of 3,000 proteins using the fine-tuned ProtGPT2 model via the lang_gen utility. Once we have generated the synthetic proteins, we obtain the embedding representation for the same using the PLMs and visualize these embeddings in a low-dimensional space (2 dimensions) using the visualize function. This function utilizes the Unified Manifold and Approximation (UMAP) algorithm54 to project the embeddings into a two-dimensional space. Then, the embedding representation for a generated protein is obtained and classified by the classifiers. This protein generation and classification process is illustrated in Figure 3.

We then follow a series of filtration steps to determine the most promising candidates:

Step 1: A consensus of all PLM-based classifiers consistently identified 706 out of the 3, 000 generated proteins as crystallizable proteins.

Step 2: To remove generated sequences with high sequence identity with training set, we perform CD-HIT-2D55 with a identity cut-off of \(\le 40\%\), resulting in 700 protein sequences.

Step 3: CD-HIT is then performed to cluster proteins with \(> 25\%\) sequence identity into groups, leading to a total of 347 proteins with low sequence identity within the group and with the training set.

Step 4: Filtered protein sequences are screened by sequence to secondary structure compatibility scores56,57. The secondary structural characterization of the designed protein sequences is performed by utilizing PSIPRED (standalone ver. 4.02)58. This reduces the generated protein set from 347 to 32 candidate sequences.

Step 5: The screened proteins are further evaluated on the basis of presence of aggregation prone regions59 and 4 sequences are filtered out.

Step 6: The screened proteins are subjected for the availability of any homolog(s) in known protein sequence database, UniRef10051, resulting in a reduced set of 5 proteins.

Step 7: The crystallization propensity probability for each of these 5 proteins across different PLMs and their consensus probability is compared with DeepCrystal and CLPred as depicted in Supp. Table S2. It highlights that the PLMs consistently predict these proteins to be crystallizable whereas both DeepCrystal and CLPred miss one out of the 5 candidate crystallizable proteins.

Step 8: The 5 filtered proteins are modeled using a consensus approach by implementing RoseTTAFold260, and AlphaFold261, resulting in 6 model structures (5 from AlphaFold2 and 1 from RoseTTAFold end-2-end prediction) for each protein.

Step 9: Each model structure is refined by implementing GalaxyRefine62 to generate 5 refined model structures, resulting in 30 candidate model structure for each protein.

Step 10: The modeled structure for each protein are thoroughly analyzed to identify the best model structure (1 out of 30) among the candidate structures using ModFold (ver. 9.0)25 and ProFitFun56,57.

Step 11: Finally, the stereo-chemical quality (all atoms contact and geometry) of the best model structure for each protein is assessed by passing it through ProCheck63, Errat64, and MolProbity65.

By following the aforementioned steps, we filter an initial set of 3, 000 proteins generated from crystallizable class to the set of 5 most likely and high confidence crystallizable proteins.

Figure 3
figure 3

The protocol followed to generate crystallizable proteins using fine-tuned ProtGPT2 PLM and further downstream filtering and evaluation.

Evaluation metrics

The performance of benchmark classifiers is compared with various other sota techniques using quality metrics such as accuracy, Matthew’s correlation coefficient (MCC) as in17,33. We assessed other evaluation metrics, based on TP, TN, false positives (FP) and false negative (FN). We highlight that TP represents the set of proteins which are crystallizable (the true label is 1) and are correctly identified by a given method as crystallizable, i.e., \(c\left( \mu (e(t(x)))\right) \ge 0.5\). Similarly, TN represents the set of proteins which are non-crystallizable (true label is 0) and are correctly identified by a given method as non-crystallizable \(c\left( \mu (e(t(x))))\right) < 0.5\). The metrics for evaluation include:

$$\begin{aligned} \text {Accuracy (ACC)}&= \frac{\text {TP}+\text {TN}}{\text {TP}+\text {FP}+\text {TN}+\text {FN}} \\\text {MCC}&= \frac{\text {TP} \cdot \text {TN} - \text {FP} \cdot \text {FN}}{\sqrt{(\text {TP + FP})\cdot (\text {TP + FN})\cdot (\text {TN + FP})\cdot (\text {TN + FN})}} \\\text {Recall (Rec)}&= \frac{\text {TP}}{\text {TP + FN}} \\\text {Precision (Prec)}&= \frac{\text {TP}}{\text {TP + FP}} \\\text {F1-score (F1)}&= \frac{2 \cdot \text {Prec} \cdot \text {Rec}}{\text {Prec} + \text {Rec}} \end{aligned}$$
(1)

Experimental results

Figure 4
figure 4

Comparison of area under receiver operating curve (AUC) of benchmark PLMs for the crystallization prediction task. (a) AUC for fairly balanced test set using XGBoost, (b) AUC for SP_final dataset using XGBoost, (c) AUC for TR_final dataset using XGBoost, (d) AUC for fairly balanced test set using LightGBM, (e) AUC for SP_final dataset using LightGBM, and (f) AUC for TR_final dataset using LightGBM.

We benchmark the predictive performance of the PLMs on the \(\mathbb {D}_{2}\) test set extracted from the publicly available dataset11 as described earlier (see “Data partitioning”). Moreover, we evaluate the quality of predictions from these models on two independent datasets obtained from SwissProt and TrEMBL, the SP_final and TR_final datasets, respectively. A comprehensive comparison of the PLMs of varying size and configurations including ESM2 T6-8M, ESM2 T12-35M, ESM2 T30-150M, ESM2 T33-650M, ESM2 T36-3B, Ankh, Ankh Large, ProstT5, ProtT5-XL, SaProt-35M, SaProt-650M, xTrimoPGLM-1B, xTrimoPGLM-3B and xTrimoPGLM-10B was done against methods like fDETECT, DeepCrystal, ATTCrys and CLPred across these test sets. The evaluation metric values for fDETECT and CLPred were obtained from17 and20 respectively. Finally, the cross-validation performance of the XGBoost and LightGBM classifiers built on embedding representations learnt via each PLM on various evaluation metrics is highlighted in Supp. Figs. 1 and 2. From Supp. Figs. 1 and 2 and Tables 1, 2 and 3, we observe that the XGBoost models are over-fitting on the training set and have poor generalization performance. On the other hand, the LightGBM classifiers have better generalization performance as their cross-validation performance aligns with the performance attained on multiple independent test sets (see Supp. Fig. 2 and Tables 1, 2 and 3). Additionally, from Tables 1, 2 and 3 and Supp. Table S1, we observe that MLP classifiers tend to perform poorly across various evaluation metrics when compared to their corresponding XGBoost or LightGBM classifiers.

We highlight the training performance of the CNN and LSTM based classifiers in Supp. Table S5. Moreover, the training and validation performance curves of all the CNN and LSTM model built with each of the three top-performing PLM based embedding representations is highlighted in Supp. Figure S3 and Supp. Figure S4 respectively.

Table 1 Benchmarking of PLMs in TRILL on the balanced test set against sota methods.

Balanced test set results

On the balanced test set consisting of 1787 proteins (891 crystallizable and 896 non-crystallizable), the ESM2 T30-150M PLM (with LightGBM classifier) achieves a prediction accuracy of \(85.7\%\). This is better than the current sota method, CLPred (\(85.1\%\)). The ESM2 T30-150M (LightGBM) also reaches the best performance of 0.854 and 0.715 for quality metrics such as F1 score and MCC, respectively, as observed from Table 1. These quality metrics take into account the class imbalance in the data set. The performance of ESM2 T30-150M (LightGBM) is \(0.4\%\) and \(1.5\%\) better in absolute terms than the current sota sequence-based crystallization predictor i.e., CLPred. Moreover, ESM2 T30-150M is \(3.2\%\), \(2.9\%\), and \(5.7\%\) better than DeepCrystal for F1 score, accuracy, and MCC metrics, respectively.

However, with respect to quality metrics such as AUPR and AUC, the ESM2 T30-150M (with XGBoost classifier) model leads when compared to all other benchmark models as observed from Table 1 and Figs. 4a, 4d, 5a, and 5d. The ESM2 T30-150M (XGBoost) model reaches AUPR \(=0.929\) and AUC \(=0.936\). This is \(4.3\%\) and \(3.3\%\) better than DeepCrystal for AUPR and AUC metrics, respectively, as observed in Table 1. Furthermore, from Table 1, we observe that PLMs with XGBoost classifier available via TRILL tend to handle the class-imbalance worse than PLMs with custom LightGBM classifier. This is highlighted from the superior performance of PLMs with LightGBM classifier on F1-score and MCC metrics when compared to their equivalent XGBoost classifiers available via TRILL as depicted in Table 1. Overall, PLMs trained with either LightGBM or XGBoost classifier outperform CLPred, ATTCrys and DeepCrystal across all metrics on balanced test set.

When we combined the mean embedding representation of each test protein together with CNN / LSTM based classifier, the deep learning model tends to perform better than just the mean embedding based classifiers w.r.t. F1-score, accuracy, MCC, AUC and AUPR metrics as observed in Table 1. This suggests that the CNN and LSTM models can encapsulate additional contextual information when compared to mean embedding representation of protein, thereby, resulting in deep learning classifiers with significantly better performance for the balanced test set.

Figure 5
figure 5

Comparison of area under precision-recall curve (AUPR) of benchmark PLMs for the crystallization prediction task. (a) AUPR for fairly balanced test set using XGBoost, (b) AUPR for SP_final dataset using XGBoost, (c) AUPR for TR_final dataset using XGBoost, (d) AUPR for fairly balanced test set using LightGBM, (e) AUPR for SP_final dataset using LightGBM, and (f) AUPR for TR_final dataset using LightGBM.

SP_final test set results

A second experiment is performed on the reduced SP_final dataset obtained from SP_Pre dataset1. The ESM2 T36-3B model (with LightGBM classifier) outperforms sota sequence-based crystallization predictors like CLPred and DeepCrystal for the majority of the metrics, including F1, accuracy, MCC and precision as depicted in Table 2. The ESM2 T36-3B (LightGBM) model also outperforms other PLMs available via TRILL for these quality metrics as shown in Table 2. ESM2 T36-3B model (LightGBM) achieves a prediction accuracy of \(89\%\), which is \(9\%\) and \(14\%\) better than CLPred and DeepCrystal respectively (see Table 2). From Table 1, we observe ESM2 T36-3B model (LightGBM) attains an MCC of 0.769 and F1-score of 0.911, whereas CLPred obtains an MCC of 0.599 and F1-score of 0.832 indicating \(17\%\) and \(8\%\) improvement in performance. The ProstT5 model (with LightGBM classifier) achieves the best AUC (0.940) and AUPR (0.964) compared to other PLM-based classifiers as depicted in Figs. 4b, e, 5b, e.

We observe from Table 2 that small sized ESM2 models such as ESM2 T6-12M and ESM2 T12-35M cannot outperform CLPred for several quality metrics but bigger sized ESM2 models easily surpass sota models like fDETECT, DeepCrystal, ATTCrys and CLPred. The SP_final test set comprises 237 proteins with very little sequence similarity with training set and still ESM2 T36-3B classifiers (desgined with XGBoost / LightGBM) outperforms majority of sequence-based predictors on several evaluation metrics highlighting their effectiveness for crystallization propensity prediction.

The LSTM classifier built on top of embedding representation obtained from ProstT5 model achieves the best performance among all CNN and LSTM-based classifiers as observed in Table 2. Its performance is similar to the ESM2 T36-3B classifier wr.t. accuracy, AUC and AUPR metrics but cannot outperform the same on F1-score and MCC metrics as indicated in Table 2. Finally, the LSTM-based classifier (ProstT5) attained the best recall of 0.935 amidst all the models benchmarked for the SP_final test set.

Table 2 Benchmarking of PLMs in TRILL on the SP_final test set against sota methods.

TR_final test set results

We perform a final experiment to test for crystallization propensities of proteins using sota crystallization tools and benchmark PLM-based classifiers available via TRILL platform on the TR_final dataset1. ESM2 T30-150M model (with LightGBM classifier) achieves a prediction accuracy of \(89.4\%\), which is \(4\%\) better than CLPred (\(85.4\%\)), \(5.3\%\) better than DeepCrystal (\(84.1\%\)) and fDETECT (\(84.1\%\)). It is also \(0.9\%\) better than the next-best ESM2 T6-8M (LightGBM) model that attains an accuracy of \(88.5\%\) as depicted in Table 3. The ESM2 T30-150M model (LightGBM) achieves the best F1 (0.862) and MCC (0.778) as shown in Table 3 and second best performance for AUC (0.929) and AUPR (0.959) when compared to ESM2 T30-150M (XGBoost), which achieves AUC of 0.933 and AUPR of 0.960 as indicated in Table 3 and Figs. 4c, f, 5c, f.

Interestingly, we observe from Table 3 that LightGBM classifiers are superior than their counterpart XGBoost classifiers for the same PLM models and configurations highlighting their generalization capability (see Supp. Figure 2). Additionally, the CNN and LSTM based classifiers achieve performance comparable to mean embedding based classifiers w.r.t. AUC and AUPR metrics as observed in Table 3. The ESM2 T36-3B based CNN model achieved the best F1-score (0.855), accuracy (0.885) and MCC (0.765), which is slightly lower than the ESM2 T30-150M model (LightGBM), suggesting that an average pooling operator can better capture essential features to discriminate crystallizable proteins from non-crystallizable ones when compared to multi-layered CNN model for the TR_final dataset. Finally, on the TR_final dataset comprising 1012 proteins (far more than SP_final test set), the PLM-based classifiers are superior than DeepCrystal, ATTCrys and CLPred w.r.t. several evaluation metrics.

Table 3 Benchmarking of PLMs in TRILL on the TR_final test set against sota methods.

Protein generation results

The selected crystallizable candidates (\(n = 347\)) were trimmed on the basis of sequence to secondary structural compatibility (CS-Score \(\ge 40\) and CSS-Scores \(\ge 20\))56,57, resulting in a dataset of 32 proteins. The cut-off values for CS- and CSS-Scores were adopted from their benchmarking of successfully designed proteins56. These proteins were further tapered to 28 proteins, based on presence of aggregation protein region screening66, and to 5 proteins based on screening against UniRef10051.

Figure 6
figure 6

Best model structures for the 5 candidate proteins identified through our crystallizable protein generator workflow.

The proteins with pairwise sequence coverage \(\ge 40\%\), sequence identity \(\ge 35\%\) and e-value \(\le 0.5\) were discarded while screening for available homolog(s) in known protein sequence database (UniRef100), resulting in the set of 5 proteins. These protein were modeled by implementing RoseTTAFold (end-2-end prediction; 1 candidate structure for each protein)67 and AlphaFold2 (\(n = 5\) candidate structures for each protein)61, followed by structure refinement by using GalaxyRefine (\(n = 30\); 5 refined candidate structures for each candidate structure)68. The best model structure for each protein, selected on the basis of consensus score from ModFold25 and ProFitFun57. An important note here is that the model structures for each protein from AlphaFold2 and RosettaFold were refined (molecular dynamics-based refinement) with the rationale of achieving better quality scores. The pLDDT scores of the initial 5 models from AlphaFold2 for each of the candidate proteins along with other structural quality scores are provided in Supp. Table S3. It is worth noting that the pLDDT scores were available for the predicted model structures by AlphaFold2 only. To assure the improvement in the quality of the final selected model structure for each protein, the additional assessment metrics (TMScore, GDT-TS and GQ Score) were calculated for the AlphaFold2 model structures and compared with the corresponding scores for the final selected model. Since the final model for each protein is selected from the pool of 30 decoys generated post-structural refinement, the pLDDT score for them is not available. It is evident that the structural quality of the final model (selected post refinement) has improved significantly as illustrated in Supp. Table S3. The best model structure for each protein along with the distribution of backbone di-hedrals (Ramachandran Map) are depicted in Fig. 6. A summary of different quality assessment statistics of the best model structures is provided in Table 4. Additionally, the predicted Global Distance Test - Template Score (GDT-TS), Template Modeling Score (TMS), Global Quality Score (GQS), and Average Quality Score (OAQS) for all the candidate model structures are provided in Table 4.

The quality metrics for the best model structure of selected proteins (Prot-142, Prot-630, Prot-851, Prot-1120, and Prot-1302) ensured the accuracy of the tertiary structure prediction (Table 4). For all the model structures, the Ramachandran distribution of backbone di-hedral angles (\(\phi\) and \(\psi\)) is found to be distributed in the allowed regions, mainly the core region (colored ‘red’), as shown in Table 4 and Fig. 6. The predicted model structure for Prot-630 and Prot-1302 had the highest quality score (=0.69), followed by Prot-142 (=0.67), Prot-1120 (=0.65), and Prot-851 (=0.58). Notably, the predicted GDT-TS (0.84 for Prot-630 and 0.88 for Prot-1302) and predicted TM Score (0.83 for Prot-630 and 0.82 for Prot-1302) for these protein structure fall in the highly reliable range for predicted model structure (0.8 - 1.0). The GDT-TS and TM Score varies from 0-1, where 1 shows the highest level of structural prediction. The relative predicted quality of the model structure for Prot-851 was observed to be lower as compared to the model structures of other proteins. The secondary and tertiary structures of the selected protein revealed them to be mainly \(\alpha\)-proteins, except for Prot-142 which has fraction of residues (about 4%) part of \(\beta\)-strands.

Table 4 Summary of different quality evaluation parameters for the best model structure for each of the selected protein.
Figure 7
figure 7

Functional annotations associated with Prot-1120 and Prot-1302.

The functional annotations including biological processes (BP), molecular functions (MF) and cellular components (CC) associated with the generated proteins are provided in Supp. Table S4. Additionally, the two proteins with the maximum functional annotations were Prot-1120 and Prot-1302. The functional annotations associated with these proteins is depicted in Fig. 7. We observed that Prot-142 and Prot-630 are localized in cytoplasm, associated to different membranes such as cellular anatomical entity and mainly involved in different metabolic processes and bio-synthetic processes as depicted in Supp. Table S4. The designed protein, Prot-851, while being associated with plasma membrane and cell peripheries such as cellular anatomical entity, was predicted to perform diverse transporter activities by its involvement in different metabolic and transport processes. In contrast to the functional characterization of Prot-142, Prot-630, and Prot-851, the designed proteins Prot-1120 and Prot-1302 were predicted to be involved in the highly diverse set of molecular functions and biological processes as illustrated in Fig. 7. For instance, Prot-1120, with the similar cellular localization of other designed proteins, was predicted to be involved in a wider range of metabolic processes, viz. phosphorous, phosphate-containing, and organo-nitrogen compound metabolic processes, primary and cellular metabolic processes, and overall regulation of cellular processes. The Prot-1120 was predicted to be involved catalytic activity, calcium-dependent phospholipid binding, transferase activity, purine ribonucleoside triphosphate binding, small molecule binding, phosphoric ester hydrolase activity, ion binding, organic cyclic compound binding, carbohydrate derivative binding, and heterocyclic compound binding. Further, Prot-1302 is computationally characterized to perform metabolic and biosynthetic process along with trans-membrane transport of various compounds. With the involvement in a diverse set of biological processes, the Prot-1302 was predicted to perform ion channel activity, ATP binding, trans-membrane transporter activity, transferase activity, phosphotransferase activity, purine ribonucleoside triphosphate binding, small molecules and ions binding, organic cyclic compound binding, carbohydrate derivative binding, and heterocyclic compound binding.

With a comprehensive computational functional characterization, we believe that experimental validation of Prot-1120 and Prot-1302 can lead to the novel functional proteins that can be fine-tuned to have desired functions.

Discussion & conclusion

One of the main challenges for protein structure determination is that only about 2-\(10\%\) of pursued protein targets yield high-resolution protein structures69. Upon investigating these estimates in the TargetDB database6, it was observed that among the 150, 727 cloned targets that were deposited into TargetDB, only 37, 398 (\(24.8\%\)) were successfully purified, 12, 923 (\(8.6\%\)) further successfully crystallized, and 6,942 \((4.6\%)\) resulted in diffraction quality crystals70. Additionally, majority of the cost of structure determination is consumed by the failed attempts7 as crystallization is a process that is characterized by a significant rate of attrition. The reasons for this attrition include the need for the crystals to be sufficiently large (> 50 micrometers), pure in composition, regular in structure, and without significant internal imperfections. Furthermore, to produce diffraction-quality crystals, an empirical or trial-and-error approach is commonly used, in which a large number of experiments are brute-forced to find a suitable setup71, often resulting in failure. Thus, the above provides strong motivation to develop accurate and efficient in silico sequence-based protein crystallization predictors that allow high-throughput screening of candidate protein sequences for favorable crystallization propensity.

In this paper, we benchmark open-PLMs accessed via the TRILL platform, a framework enabling democratization of protein language models, for sequence-based protein crystallization propensity prediction. The main objective is to determine whether PLMs trained on hundreds of millions of protein sequences can discriminate crystallizable proteins from non-crystallizable ones without fine-tuing using just the raw protein sequences as input. These PLMs encode the raw protein sequences and generate embedding (vector) representations. We then built optimized tree-based classifiers (XGBoost / LightGBM) on top of these embedding representations to estimate their discriminative capacity without the need to manually engineered biological and physiochemical features. By implementing a thorough benchmarking on a set of independent test sets, we observe that these open-PLM based classifiers consistently outperform state-of-the-art deep learning techniques, such as DeepCrystal, ATTCrys and CLPred, on several evaluation metrics.

DeepCrystal17 captures frequent amino acid k-mers in the input sequence using a set of parallel convolution filters of varying sizes with the CNN design providing the freedom of calculating local dependencies with different filter sizes. Conversely, CLPred20 uses a BiLSTM deep learning architecture to capture high-order, long-range interaction patterns between k-mers making it better than the CNN-based DeepCrystal as indicated in Tables 1, 2 and 3. However, open source protein language models trained on several million protein sequences are much better than smaller and crystallization specific deep learning models like DeepCrystal, ATTCrys and CLPred (see Tables 1, 2 and 3), even with no additional fine-tuning and a simple linear probing approach i.e. building classifiers on top of embedding representations. In particular, the ESM2 T30-150M and ESM2 T36-3B based models (with LightGBM classifier) outperform every other benchmark model on the three independent test sets for quality metrics such as F1-score, accuracy, MCC, and precision.

This success can be attributed to the huge amount of data on which these PLMs are trained, the underlying transformer architecture which can capture local and long-range contextual dependencies in protein sequences through attention mechanism27 and generate meaningful and discriminative embedding representations for the downstream crystallization task.

The proposed methodology illustrates its ability to generate and filter unique crystallizable proteins as well as engineer proteins to achieve desired properties and functions. These proteins may aid in the better understanding of biological processes, as well as the rapid development of new medicines and materials. For example, a designed protein with certain mutations could aid in understanding the roles of specific amino acid residue(s) in the natural protein. Similarly, protein-based therapeutic regimes that involve improvements in the efficacy, stability, solubility, or specificity of certain enzymes, antibodies, and hormones may be accelerated with computational engineering with the help of proposed workflow. Furthermore, computational design may help in the development of more efficient, stable, and selective enzymes that can considerably boost industrial output in the fields of bio-catalysis, food industry, and bio-fuels.