Accurately predicting optimal conditions for microorganism proteins through geometric graph learning and language model

Zhu, Mingming; Song, Yidong; Yuan, Qianmu; Yang, Yuedong

doi:10.1038/s42003-024-07436-3

Download PDF

Article
Open access
Published: 29 December 2024

Accurately predicting optimal conditions for microorganism proteins through geometric graph learning and language model

Communications Biology volume 7, Article number: 1709 (2024) Cite this article

3642 Accesses
2 Citations
2 Altmetric
Metrics details

Subjects

Abstract

Proteins derived from microorganisms that survive in the harshest environments on Earth have stable activity under extreme conditions, providing rich resources for industrial applications and enzyme engineering. Due to the time-consuming nature of experimental determinations, it is imperative to develop computational models for fast and accurate prediction of protein optimal conditions. Previous studies were limited by the scarcity of data and the neglect of protein structures. To solve these problems, we constructed an up-to-date dataset with 175,905 non-redundant proteins and proposed a new model GeoPoc based on geometric graph learning for the protein optimal temperature, pH, and salt concentration prediction. GeoPoc leverages protein structures and sequence embeddings extracted from pre-trained language model, and further employs a geometric graph transformer network to capture the sequence and spatial information. We first focused on in-house validation for optimal temperature prediction for robustness assessment, and achieved a PCC of 0.78. The algorithm is further confirmed in an independent test set, where GeoPoc surpasses the state-of-the-art method by 2.3% in AUC. Additionally, GeoPoc was extended to pH and salt concentration prediction, and obtained AUC scores of 0.78 and 0.77, respectively. Through further interpretable analysis, GeoPoc elucidates the critical physicochemical properties that contribute to enhancing protein thermostability.

Accurately predicting enzyme functions through geometric graph learning on ESMFold-predicted structures

Article Open access 18 September 2024

The topological properties of the protein universe

Article Open access 13 August 2025

The functional proteome landscape of Escherichia coli

Article 09 December 2020

Introduction

Extremophiles, organisms that can survive in the harshest environments, demonstrate remarkable survival strategies on Earth. The proteins in these organisms display exceptional endurance to extreme temperatures, acidic or alkaline conditions, high salinity, and other harsh conditions¹. Since these proteins can remain stable and active under extreme environmental conditions, they have various applications in biotechnology², pharmaceuticals, chemical industry³, and environmental science⁴. Moreover, these proteins are crucial for understanding the adaptability mechanisms of life and provide a perspective for exploring the evolutionary patterns of life⁵. Therefore, it’s significant to develop a tool that accurately predicts the optimal temperature, pH, and salt concentration for extremophilic proteins from microorganisms.

Thermophilic proteins can maintain their structural integrity and function at high temperatures, with some variants showing stability at temperatures up to 100 °C⁶. Its unique thermal stability is closely related to biological characteristics such as amino acid composition, hydrogen bonds, salt bridges, and disulfide bonds⁷. Leveraging these characteristics, some thermophilic protein predictors have been developed by traditional machine learning methods, such as logistic model trees⁸, support vector machines⁹, ensemble learning¹⁰, and other classical algorithms¹¹. In recent years, deep learning technology has also been used to predict thermophilic proteins, with a model called iThermo¹² using a multilayer perceptron for thermophilic protein prediction. DeepTP¹³, another deep learning model, employs Convolutional Neural Networks to extract critical local information from protein sequences and then uses Bidirectional Long Short-Term Memory Networks to capture long-range dependency features. Besides the thermophilic proteins, proteins that maintain structural and functional stability under extreme pH and salt concentration play a crucial role in industrial production^14,15. However, these proteins are relatively under-researched. For pH, the enzymes produced by acidophilic and alkaliphilic microorganisms are widely used in various fields, including food processing, biofuel production, and wastewater treatment¹⁶. In a previous study, EpHod was designed to predict enzyme optimal pH directly from protein sequences¹⁷. Halophilic proteins produced by microorganisms that survived in high salinity are positioned as promising candidates for biocatalytic applications¹⁸.

Despite previous methods that have already been developed for protein optimal condition prediction, their overall utility was limited by the scarcity of data, time-consuming features, and the lack of consideration of protein structures. The scarce samples and excessive homologous proteins impair the generalization ability¹³. This leaves a certain amount of room for constructing a reliable large-scale benchmark dataset. Furthermore, some methods¹⁰ that rely on sequence-based evolutional features are time-consuming. Considering the linguistic characteristics of protein sequences, unsupervised pre-training with language models has recently been applied to protein sequence representation¹⁹. Many methods have replaced these evolutional features with pre-trained embeddings^20,21. In addition, previous studies have rarely utilized the structural information to predict protein optimal conditions. However, the three-dimensional structure of proteins serves as a crucial part of exploring the stability and functionality of proteins in extreme conditions²². With the development of deep learning techniques, significant breakthroughs have been achieved in protein structure prediction. The representative approach, AlphaFold2²³, has exhibited a degree of accuracy that is comparable to experimental results in most cases and provided accessible databases. Such breakthroughs will undoubtedly benefit downstream protein studies²⁴, including protein optimal condition prediction. To fully leverage protein structure, various studies have represented proteins as graph data and utilized graph convolutional networks to extract sequence and structural features²⁵.

In this study, we constructed an up-to-date and non-redundant data set for the research of protein optimal condition. We developed GeoPoc (Geometry-aware Protein Optimal Condition predictor), which applies a graph transformer network and AlphaFold-predicted protein structures for protein optimal condition prediction. Specifically, the structures were utilized to construct the protein graph and obtain the geometric features. We integrate sequence embeddings from a pre-trained language model with the geometric features to construct residual features. Then, a geometric graph transformer network is applied to capture the complex interactions and structural features. Next, we use the attention pooling layer to perform weighted pooling of information from nodes in the graph, aiming to enhance the model’s perception of the graph structure. Various evaluations have shown that GeoPoc accurately predicts the optimal protein temperature and surpasses state-of-the-art thermophilic protein predictors. At species-level validation, GeoPoc is robust for unseen species. Extending GeoPoc to predict pH and salt concentration also demonstrates accurate classification for each task. Furthermore, the model interpretability analysis indicates that our model captures the underlying mechanisms related to proteins adapted to extreme conditions and evolutionary patterns.

Results

Overview of GeoPoc

We set out to construct an up-to-date microorganism protein data set with GeoPoc tailored for protein optimal condition prediction, consisting of two major components: protein representation as graph and GeoFormer block with graph transformer network, as shown in Fig. 1 and Materials and Methods.

**Fig. 1: Overview of dataset preparation, and GeoPoc model architecture.**

As shown in Fig. 1a, organism optimal culture condition data were retrieved from two prominent bacterial databases (BacDive²⁶, HaloDom²⁷) for 7766 entries on November 7, 2023, and protein sequences were downloaded from Uniport²⁸ for 6,595,396 of these entries. While preserving species diversity, redundant proteins were removed to form the final data set.

The data set described above together with the corresponding structure and pre-trained language model embedding were represented as graph, which served as the input for the GeoFormer (Fig. 1b). The GeoFormer module is a geometry-aware versatile network consisting of norm layer and message passing. The message passing layer adopts the multi-head attention in the graph transformer network enhanced by edge features. We validated the performance of GeoPoc on protein optimal temperature prediction in our data set and external data, and further performed optimal pH and salt concentration prediction. The results demonstrated our framework to be accurate and robust. We then conducted several interpretability analyses and confirmed that GeoPoc is effective for capturing protein multidimensional structure and evolution patterns.

The up-to-date protein optimal condition datasets

In previous research, various data sets were used in thermophilic protein prediction (Table. S1). However, the lack of a comprehensive public benchmark data set prompted us to construct a reliable and extensive benchmark data set. The BacDive and HaloDom databases contain extensive optimal environmental conditions for various species without protein information, while UniProt has abundant protein sequences. Therefore, we linked BacDive and HaloDom to the UniProt database using species names to create a data set of optimal conditions for proteins. Our data set stands out as the largest and most diverse data set of thermophilic proteins, spanning a wide range of protein sequences from various protein families and species. Moreover, previous studies mainly focused on classifying thermophilic proteins, while our study represents the first accurate prediction of protein optimal temperatures. To enhance the overall quality and reliability of our data set, we removed redundant proteins with a sequence identity <25%, while the other data sets identity cutoff is 40% or more.

Our data set is comprised of three tasks: predicting protein optimal temperature, pH, and salt concentration. The data distribution of each task is shown in Fig. 2. The temperature data set we collected ranges from 4 °C to 105 °C, with 10,743 proteins above 50 °C from 157 species. The pH data set ranges from 1.1 to 12, with 8837 proteins from 100 species in the pH <5 and >9 ranges. The salt data set ranges from 0% to 37%, with 25,542 proteins above 4% from 535 species. The broader range of data distribution and abundance of extreme samples help us explore the intrinsic mechanisms of proteins adapted to extreme conditions.

**Fig. 2: Distribution of the optimal growth conditions for species and proteins across different condition ranges in three data sets.**

GeoPoc for optimal temperature prediction

We first focused on the task of predicting protein optimal temperature, and evaluated the performance of GeoPoc by PCC using 5-fold cross-validation (CV) and test set. As shown in Fig. 3a, b, GeoPoc achieved PCC values of 0.758 ± 0.005 and 0.779 for the 5-fold CV and test set, respectively. The consistent performances on the cross-validation and test set indicate the robustness of our model. We also extracted points in the dense red regions of Fig. 3b and the PCC is 0.777 (Fig. S1), indicating the robustness of our model. In addition, the model accuracy across different temperature ranges is primarily correlated with the size of the training set, with no clear positive correlation to structural accuracy, as shown in Fig. S2. We also trained models only using thermophilic proteins, but the model doesn’t work well on the test set (MAE of 6.975 vs 6.083 by the original model). To demonstrate the relative importance of each part we adopted, we conducted several ablation experiments. Compared with GeoPoc, GeoPoc (w/o attention pooling) removes the attention pooling layer, and GeoPoc (w/o geometric features) excludes the geometric information in the graph. We also included a naive baseline that uses ESM2.0 embeddings processed by a single transformer block to predict protein conditions. To assess the impact of protein structure, we also established a baseline method LSTM, and applied different feature inputs including ESM2.0 embedding and one-hot embedding. Additionally, we introduced another baseline with GNN using only one-hot embeddings for direct comparison against LSTM (one-hot).

**Fig. 3: The performance of GeoPoc and other models on protein optimal temperature prediction.**

As shown in Fig. 3c, the LSTM model utilizing one-hot embedding inputs exhibited the poorest performance with a PCC of 0.419. When the pre-trained language model embedding was used as the node feature, the LSTM model yielded a PCC of 0.680. This result confirms that a model that learns solely based on one-hot embedding inputs would be unable to yield any valuable predictions. Moreover, GeoPoc (without geometric features) which utilizes protein structures outperforms LSTM (ESM embeddings) and Transformer (ESM embeddings), indicating that structural information is more directly relevant to the protein optimal temperature. Moreover, the GNN using only one-hot features outperformed LSTM (one-hot) by 0.096 PCC, further demonstrating the importance of incorporating protein structure. When we enhance the node and edge features with the geometric information, the GeoPoc yields the highest PCC (0.779). This further demonstrates that geometric information provides insights into the protein optimal temperature, such as well-packed hydrophobic cores and optimized hydrogen bonding patterns, which can contribute to a protein’s ability to maintain its structure and function at a particular temperature. On the other hand, the removal of the attention pooling layer has the slightest impact on GeoPoc, with the drop of PCC from 0.779 to 0.764. This indicates that attention pooling can assist the network in capturing key features across the sequence more effectively, but its impact on improving our model’s performance is limited. The performance of GeoPoc and baselines is shown in Table. S2.

Since our model uses structures predicted by AF2, we also tried to incorporate structural accuracy information into the model, but the model performance slightly decreased (PCC from 0.779 to 0.770). Additionally, due to the limited accuracy of AF2 in predicting side chains, we also evaluated the impact of side chains on our model’s performance. When removing side chain-related features, the achieved PCC slightly decreased to 0.771 in the test set, indicating slight but positive contributions from the predicted side chains. We also explored the impact of different TransGCN depths on our model’s performance, with GeoPoc achieving the best results at a depth of 2, as shown in Table. S6.

To demonstrate the effectiveness of our method, we compared GPSite with DeepTP, TMPpred⁹, SCMTPP²⁹, iThermo, and SAPPHIRE¹⁰ on an independent test set collected by Zhao¹³. The independent test set consisted of 817 thermophilic and 817 mesophilic proteins. We removed 1677 proteins from our train data set to ensure that the protein identity between the independent test set and our train set was <25%. Thermophiles are defined as organisms with optimal growth temperatures exceeding 45 °C, while mesophiles are those with optimal growth temperatures ranging between 15 °C and 45 °C³⁰. Thus, GeoPoc was retrained and used the 45 °C cutoff to classify thermophilic and mesophilic proteins (Fig. S3). The baseline LSTM (ESM embedding) was also considered here.

Figure 3d shows that GeoPoc outperforms existing thermophilic protein predictors on the independent test set, with the highest AUC (0.966). In threshold-dependent metrics (Fig. 3e), GeoPoc achieved a MCC value of 0.823, which was 10.3% better than DeepTP (0.746) and 25.3% better than SAPPHIRE (0.657), and an accuracy value of 0.905, which was 3.7% better than DeepTP (0.873) and 10.2% better than SAPPHIRE (0.821). The performance of the other models is shown in Table. S3. Moreover, GeoPoc obtained the highest AUPR of 0.973 compared with the other predictors, as shown in Fig. S4. The comparable performance of LSTM (EMS embedding) and DeepTP suggests that the pre-trained language model can capture richer semantic and evolutionary information of proteins, leading to improved predictions of optimal temperature. The outperformance of GeoPoc indicates that structure and geometric features are an effective part of learning the patterns for temperature prediction. These patterns are likely the mechanisms by which proteins maintain functionality and structural stability at specific temperatures.

At the species-level validation, GeoPoc obtained a PCC of 0.780 for proteins from unknown species(Fig. S5). Moreover, we determined the predicted optimal temperature for each species by averaging the predicted temperatures of the proteins associated with that species, and we achieved a PCC of 0.805, as shown in Fig. 3f. The performance in the species-level validation indicates that our model has the potential to predict species optimal temperature through the proteins of the species.

GeoPoc for optimal pH and salt concentration prediction

We next considered extending GeoPoc to predict protein optimal pH and salt concentration, and evaluated the performance of GeoPoc by AUC, AUPR, F1, and MCC using a 5-fold CV and test set. Similar to the previous section, we perform an ablation study to evaluate the effect of different parts of GeoPoc, including GeoPoc (w/o attention pooling), GeoPoc (w/o geometric features), and LSTM (ESM embedding). As a result, GeoPoc outperforms all baselines with the highest AUPR of 0.56 and 0.59, AUC of 0.79 and 0.79 on the pH and salt concentration test set, respectively. Other performance metrics and 5-fold CV results are detailed in Table. S4. Although GeoPoc yields the highest MCC (0.32,0.40) in the test sets, there is still room for improvement in its performance. Figure 4c and Fig. S6a show that our model misclassifies some samples of extremely acidic or alkaline categories as neutral. Figure 4e and Fig. S6b show that our model incorrectly classifies some proteins with the highest salt concentrations as those originating from freshwater environments, reflecting the challenges in accurately predicting low salt concentrations. As a result, the model achieved a precision of only 0.41 for this class, indicating significant room for improvement. While stratified splitting was used to balance the dataset, the relatively small number of proteins with optimal salt concentrations below 0.05% likely contributed to the reduced predictive performance in this class. Furthermore, microorganism responses to salinity involve a wide array of genetic and metabolic pathways, including osmoregulation and ion transport mechanisms, which vary significantly between species and environmental conditions³¹. This complexity makes it harder to capture the relationship between a protein’s sequence and its optimal salt concentration.

**Fig. 4: The performance of GeoPoc and baselines on protein optimal pH and salt concentration prediction.**

As shown in Fig. 4a and b, the ROC curves of LSTM (ESM embedding) are closer to the diagonal, indicating that the prediction of the LSTM model is similar to random choice. However, GeoPoc (w/o attention pooling) and GeoPoc (w/o geometric features) performed better benefiting from the constructed graph based on the protein structures. This indicates that protein structures can reveal the physicochemical properties of proteins, which helps the model deeply understand the intrinsic mechanisms behind protein optimal pH and salt concentrations. Figure 4d, f show the predicted Sankey diagram for the three categories of pH and salt concentration after GeoPoc removes geometric features. GeoPoc (w/o geometric features) shows a 0.09 decrease in MCC, a 0.10 decrease in f1, and a 0.04 decrease in AUPR compared to GeoPoc, in the pH prediction task. GeoPoc (w/o geometrical features) demonstrates a 0.05 decrease in f1 and a 0.03 decrease in both MCC and AUPR in the salt concentration task. This indicates that geometric information can better capture the interactions between residues, such as charge regulation, salt bridges, and hydration effects, which are crucial for proteins to maintain their activity under extreme pH or salt concentrations.

GeoPoc is effective for capturing multidimensional structure and evolution patterns

We further conducted model interpretability analysis to obtain insights into the underlying mechanisms utilized by the model during prediction. The importance scores of secondary structures and 20 amino acids were analyzed to provide a multidimensional understanding of the physicochemical properties of the thermophilic protein. Fig. S7 shows the importance distribution of secondary structure among thermophilic and mesophilic proteins from the test set. After conducting a T-test with P < 0.0001, it’s convinced that the helix is more critical than other secondary structures. In the previous research³², the analysis of helix and intrahelical interactions within the protein led to the conclusion that maintaining high helical stability is essential for protein thermostability. Moreover, compared to mesophilic proteins, the importance of sheet secondary structures increased in thermophilic proteins. Figure 5a shows the importance of per-residue in protein M0HVG5, and Fig. S8 shows two other examples. In 73.6% of the thermophilic proteins, the average importance of helices exceeds that of other secondary structures. This confirmed the discriminative power of GeoPoc to capture the key information in protein optimal temperature prediction.

**Fig. 5: Visualization of residue importance and conservation and aligned protein families.**

As shown in Fig. 5b, the ranks of the top four important amino acids are Glutamic acid (E, 0.594), lysine (K, 0.593), isoleucine (I, 0.589), and tyrosine (Y, 0.584). Other thermophilic protein predictors also reveal that lysine¹², glutamic acid²⁹, isoleucine, and tyrosine are often among the most important amino acids¹⁰. Lysine and glutamic acid, due to their side chains carrying positive and negative charges, can form strong electrostatic interactions¹¹. Their interactions can lead to the formation of salt bridges, further stabilizing the protein structure²². Furthermore, glutamic acid and lysine are known for their strong helix-forming tendencies due to their long side chains³³. This is consistent with their role in stabilizing helices, which are essential for the thermal stability of protein structures, as previously analyzed. In contrast, amino acids such as glycine and proline, which are known as helix breakers due to their conformational flexibility and rigid ring structures, exhibit the lowest importance. Additionally, tyrosine contains a large hydrophobic phenyl ring, and isoleucine is a hydrophobic amino acid, which promotes the formation of a tight hydrophobic core within the protein³⁴. The hydrophobic core enhances protein stability by compacting its interior, preserving the tertiary structure, and preventing structural disintegration under high temperatures³⁵. These results are consistent with previous studies^36,37. Moreover, we find that the importance and frequency of amino acids do not demonstrate a clear correlation, suggesting that the GeoPoc predicts optimal temperature without relying on the frequency of amino acid occurrence. The importance and occurrence of each amino acid in the thermophilic proteins are shown in Table S5.

To explore protein evolutionary patterns, we analyzed the conservation and importance scores of protein regions from two protein families, including DNA_helicase_UvrD/PEP and ABC_transpr_LoID_like. We calculated Spearman’s rank correlation coefficient between the conservation and importance of the 52 protein sequence regions from two protein families as shown in Fig. S9, and the average of Spearman’s rank correlation is 0.35. Figure 5c shows that conserved regions of the sequence are often more important for predicting the protein optimal temperature. These conserved regions are especially crucial for the stability and functionality of proteins during their evolution, including their optimal activity at specific temperatures. For future work, we can induce mutations in these essential regions and then utilize our model to screen for proteins possessing desired characteristics, such as improved stability or activity under specific temperatures in protein design.

Discussion and conclusion

The stability of extremophile proteins under extreme conditions is crucial for industrial applications and enzyme engineering. However experimental determination of the optimal temperature, pH, and salt concentration for specific proteins is time-consuming and expensive. Moreover, rapid advancements in sequencing technologies have led to the generation of a vast amount of protein sequence data. Therefore, computational models are urgently needed to screen and annotate extremophilic proteins. Previous studies were limited by scarce sample sizes in datasets and a lack of consideration for protein structural information.

In this study, we propose GeoPoc, a novel deep-learning model designed to accurately predict the optimal conditions for extremophile proteins. To develop GeoPoc, an up-to-date and large-scale data set has been constructed for the research of extremophile proteins. Our data set incorporates the latest data on protein sequences and optimal conditions measured in experiments from three databases (BacDive, HaloDom, UniProt), ensuring that our model is trained on the most relevant and reliable data. GeoPoc introduces several innovative features and geometric graph neural networks that enhance its predictive capabilities compared to existing methods. What sets GeoPoc apart from all existing state-of-the-art methods is its incorporation of protein structural information. By leveraging the predicted protein structure, GeoPoc can capture the intricate geometric features and spatial arrangements that influence protein stability and function under extreme conditions. We employ the pre-trained language model ESM to efficiently extract informative sequence embeddings, thus bypassing expensive sequence-based evolutional features. Moreover, GeoPoc achieved similar performance using ESM embeddings and evoformer embeddings in a 5-fold CV on a subset of 200 proteins, as shown in Table. S7. This is consistent with the previous observation³⁸, where evoformer was shown not better than ESM for protein function predictions.

GeoPoc shows promising performance and generalization capabilities for protein optimal temperature in our data set, both at the protein and species levels. In independent testing, GeoPoc demonstrates superior performance compared to other existing methods for thermophilic protein prediction. Additionally, GeoPoc is extended to protein optimal pH and salt concentration prediction, exhibiting accurate classification for the pH task, while the performance for salt concentration, particularly in the low salt range, still leaves room for improvement. In the future, we will incorporate additional biochemical or environmental features to improve the performance of optimal salt concentration prediction. We also tried to directly predict the optimal pH and salt concentration for proteins, but the performances are low (the PCC is 0.480 and 0.558). This is likely because cells could regulate ion concentrations through membranes, and the working environments of proteins inside cells don’t exactly match the external optimal environment. Relatively, the multiple type classifications are less affected than real value predictions.

The model’s interpretability analysis reveals that GeoPoc elucidates the critical physicochemical properties that contribute to enhancing protein thermostability. Our model still has some limitations. Our current dataset comprises only single-chain proteins, which limits the extension of our model to predicting conditions for multi-chain proteins. We will further enhance the generalizability of our model in the future as we acquire multi-chain protein data. In the future, we will also explore fine-tuning AlphaFold2 to predict protein the optimal conditions and structures. This will enable us to better understand how proteins behave under different conditions and allow us to integrate these findings into protein design efforts, potentially leading to the development of proteins with tailored properties for specific industrial or biomedical applications.

Method

Data preparation

In this study, we created an up-to-date data set from two bacteria databases: BacDive²⁶ and HaloDom²⁷. The bacterial diversity metadatabase BacDive is the worldwide largest database for standardized bacterial phenotypic information. HaloDom has collected data on 1,268 halophilic organisms across all life domains, covering aspects from genome and sequence data to environmental details. We downloaded each organism’s species name and corresponding optimal culture conditions from these databases. Due to the optimal growth condition of organisms commonly used to estimate the stability of proteins encoded in their genomes³⁹, we annotate the proteins with the optimal condition of the organism from which they originate. To get protein sequences, we downloaded data from UniProt²⁸ by NCBI tax ID or species name of the organism. We retain protein sequences with a Protein Existence (PE) value of 3 or lower, which represents the evidence supporting the existence of the protein and is defined in the UniProt FASTA headers’ PE field. If the optimal condition is represented as a range, we will replace it with its average. Moreover, the unit of salt concentration is standardized to x%, which means xg/100 ml. The data collection process is shown in Fig. 1a. To this end, there are 6,595,396 reliable protein sequences from 4948 species spanning 39 different phyla, as shown in Fig. S10 and Fig. S11. To examine the generalization ability of our method, we removed redundant proteins sharing sequence identity >25 within each data set using MMSeq2⁴⁰. Finally, we formed a temp data set with 102,615 proteins, a pH data set with 84,646 proteins, and a salt data set with 66,618 proteins.

Due to microorganisms having mechanisms to regulate their intracellular pH and salt concentration⁴¹, we conducted different tasks at optimal condition prediction. For example, Natronomonas moolapensis was identified as a highly halophilic archaeon with an optimal growth salt concentration of 18–20% in the BacDive, belonging to the Halobacteriaceae family⁴². For adaptation to extreme saline environments, the Halobacteriaceae family commonly uses the ”salt-in” effect⁴³ and consists of compensating ion levels: KCl usually provides an osmotic balance³¹. Microorganisms also have various regulatory mechanisms to adapt to acidic⁴⁴ or alkaline⁴⁵ environments. While bacteria possess the ability to regulate their internal pH and salt concentrations, it should be noted that this range of regulation is limited⁴⁶. Therefore, the protein optimal pH and salt concentration prediction problem were converted into three-class classification tasks. We set up three categories for pH and salt concentration, with pH values split by 5 and 9 and salt concentrations split by 0.05% and 4%. Where 0.05% is the dividing line between freshwater and seawater and 4% is the maximum salt concentration of seawater. As for temperature, it is consistent inside and outside the microorganism, so we perform regression tasks for temperature prediction. Then, we remove the proteins with lengths greater than 1023 due to EMS2.0’s input limitation. Furthermore, the optimal temperature of Balnearium lithotrophicum was found to be between 70 and 75 °C⁴⁷. Therefore, we adjusted the optimal temperature for the Balnearium lithotrophicum from 55 °C, as reported by BacDive, to 72.5 °C. Then the dataset was divided into training and test sets in a ratio of 8:2 using stratified sampling, the training set sizes for temperature, pH, and salt concentration were 79,272, 65,364, and 51,624, and the test set sizes were 19,817, 16,339, and 12,904, respectively. For the protein optimal pH prediction task, the range of 0–5 contains 3261 training samples and 787 test samples; the 5–9 range contains 57,578 training samples and 14,409 test samples; the 9–14 range contains 4525 training samples and 1143 test samples. For the salt concentration task, the training set contains 2988 samples, and the test set 746 samples for the 0 –0.05% range; the 0.05–4% range contains 27,396 training samples and 6849 test samples; the 4–100% range, there are 21,240 training samples and 5309 test samples. We perform five-fold cross-validation on the training set.

Protein representation as a graph

AlphaFold2 (AF2) has demonstrated remarkable accuracy in predicting protein structures, and we downloaded the predicted protein structures for our dataset from the AF2 database. To assess structural accuracy, AF2 uses the pLDDT (predicted Local Distance Difference Test) score, which estimates the model’s confidence in its predictions. For the proteins in our data set, the average pLDDT score is 81.76, greater than 70 for a correct backbone prediction as indicated by AF2. Furthermore, there is no decline in the accuracy of structures predicted by AF2 at extreme temperatures, as shown in Fig. S1a. Since protein structure represents a balance between environment and evolution, and AF2 utilizes evolutionary information through MSA, it can accurately predict protein structures in extreme environments. According to the protein structure, we represent protein as a graph G = (V, E), with V denoting the amino acids (nodes) and E denoting the contacts of amino acids pairwise distances (edges). Considering the crucial role of the geometric information of residues and the interactions between these residues. We derived protein graph representations from a 3D protein structure referring to⁴⁸. Specifically, the node of residue i consists of sequence features from pre-trained language model, geometric node features as X_i = (X_e,i, X_G,i), and the edge (i, j) contains geometric edge features E_j,k between residue j and k.

Node features X_e,i derived from the embeddings of a pre-trained 36-layer ESM2.0 model named ’esm2_t36_3B_UR50D’, which provides a rich and information-dense evolutionary representation for protein sequences. We extracted the output from the last layer of the encoder part of ESM2.0 as the embedding. The geometric node feature X_G,i was derived from the 3D structure by taking angles and distances between nodes and the direction of local atoms relative to the Cα atom. Through the coordinates of three atoms, N − Cα − C, we calculated distances and angles consisting of dihedral angles and bond angles. The direction is calculated by computing the direction of each atom within an amino acid residue relative to the Cα atom and then projecting them onto the local coordinate system defined by N − Cα − C.

The geometric edge feature E_j,k comprises positional embeddings, edge orientation, edge distance, and edge direction. Positional embeddings calculate the frequency of the distances between nodes in the sequence and encode them, helping the model understand the order of the sequence or the relative positions of nodes in the graph structure. Edge direction is a directional vector calculated based on the two residue nodes of an edge relative to their respective local coordinate systems. It describes the relative directional relationship between these two nodes in space. Edge orientation is obtained by comparing the N − Cα − C coordinate systems of the two nodes, specifically by calculating the rotation between the two local coordinate systems, represented as quaternions, to describe the relative difference in the local spatial orientation of the two nodes.

The architecture of GeoPoc

Figure 1b shows GeoPoc represents protein structures and sequence embeddings as protein graphs. Subsequently, GeoPoc employs a GeoFormer block to learn the intricate interactions between nodes and the overall characteristics of the graph structure, generating encoded information that reflects the molecular features of the proteins.

In the GeoFormer module, we apply a norm layer and an MLP to project the initial node and edge features to facilitate better processing and learning in subsequent graph neural network layers. After MLP, we use message passing with TransformerConv⁴⁹. When processing graph data, we utilize a multi-head attention mechanism to enhance interaction between nodes, while also considering edge feature information during this process. We denote the hidden feature vectors of node i and edge j − > i in layer l as ${h}_{i}^{\left(l\right)}$ and e_ij, respectively. Specifically, we calculate multi-head attention for each edge from j to i as follows:

$${q}_{c,i}^{\left(l\right)}={W}_{c,q}^{\left(l\right)}{h}_{i}^{\left(l\right)}$$

(1)

$${k}_{c,j}^{\left(l\right)}={W}_{c,k}^{\left(l\right)}{h}_{j}^{\left(l\right)}$$

(2)

$${e}_{c,ij}={W}_{c,e}{e}_{ij}$$

(3)

$${\alpha }_{c,ij}^{(l)}=\frac{\langle {q}_{c,i}^{(l)},{k}_{c,j}^{(l)}+{e}_{c,ij}\rangle }{{\sum }_{u\in {{\mathcal{N}}}(i)}\langle {q}_{c,i}^{(l)},{k}_{c,u}^{(l)}+{e}_{c,iu}\rangle }$$

(4)

Where $ < q,k > =exp(\frac{{q}^{T}k}{\sqrt{d}})$ represents the exponential scale dot-product function, and d denotes the hidden size of each head. We transform the ${h}_{i}^{\left(l\right)}$ and ${h}_{j}^{\left(l\right)}$ into query vector ${q}_{c,i}^{\left(l\right)}$ and key vector ${k}_{c,j}^{\left(l\right)}$ respectively by W_q, W_k. The edge features e_ij will be encoded and incorporated into the key vector as supplementary information for each layer.

To update node i, the message passing in layer l is:

$${v}_{c,j}^{\left(l\right)}={W}_{c,v}^{\left(l\right)}{h}_{j}^{\left(l\right)}$$

(5)

$${\hat{h}}_{i}^{(l+1)}={\parallel }_{c = 1}^{C}\left[{\sum}_{j\in {{\mathcal{N}}}(i)}{\alpha }_{c,ij}^{(l)}\left({v}_{c,j}^{(l)}+{e}_{c,ij}\right)\right]$$

(6)

Where ∣∣ denotes the concatenation operation for C-head attention. ${{\mathcal{N}}}(i)$ denotes the neighbors of node i.

Before generating the final output, we employ self-attention pooling¹⁹ to automatically emphasize important residues. Our studies have two types of tasks: regression and three-class classification. Therefore, the multi-layer perceptron (MLP) of the output dimensions is set to one and three, respectively. In the classification task, the three outputs represent the probability values of each class, and we select the class with the highest probability as the result.

Implementations and analyses

We performed five-fold cross-validation on the training data. The training sets were randomly divided into five folds, and each time the model was trained on four folds and tested on the remaining fold. This process is repeated five times and the average validation performance is used to optimise the hyperparameters of the network. Specifically, GeoPoc consists of 2 GNN layers, 256 hidden units, and 4 attention heads. During training, we use the early stopping technique based on the validation performance for a maximum of 50 epochs, where the batch size is set to 12. In the testing phase, all five trained models from the cross-validation are used for prediction, and the average is taken as the final prediction result of GeoPoc. In the testing phase, predictions are made using all five trained models in cross-validation, and the average is taken as the final prediction result of GeoPoc.

We conduct species-level validation to predict optimal temperatures and evaluate the performance of our model when applied to new species. Under the premise that proteins from the species in the training dataset are excluded from the test set, we randomly split the proteins for the training and test set in a ratio of 8:2. Consequently, the test set contained 19,817 proteins from 626 species, while the training set contained 79,272 proteins from 3,148 species. To evaluate the performance of GeoPoc in predicting the species optimal temperature, we calculate the species optimal temperature by taking the average of the predicted optimal temperatures for all proteins within the species. Moreover, we excluded species from the test data set that had fewer than three proteins, which are sensitive to deviations in the prediction of its protein.

For qualitative analysis, we calculate the importance of residues in the sequence towards predictions of the GeoPoc using the GNNExplainer method⁵⁰. Amino acid importance for each model in the ensemble is summed and normalized across the protein sequence using the L2-norm. To compute residue conservation scores for protein families, we utilize the tool provided by⁵¹ with scaled Shannon entropy. We choose protein families using MMseqs2 with a minimum sequence identity of 0.2 and then query in InterPro⁵².

Evaluation metrics

We use the Pearson correlation coefficient (PCC), mean absolute error (MAE), Accuracy, area under the receiver operating characteristic curve (AUC), the area under the precision-recall curve (AUPR), F1-score (F1), and Matthews correlation coefficient (MCC) to evaluate the prediction performance:

$${{\rm{PCC}}}=\frac{{\sum }_{i = 1}^{n}({x}_{i}-{\mu }_{x})({y}_{i}-{\mu }_{y})}{\sqrt{\mathop{\sum }_{i = 1}^{n}{({x}_{i}-{\mu }_{x})}^{2}}\sqrt{\mathop{\sum }_{i = 1}^{n}{({y}_{i}-{\mu }_{y})}^{2}}}$$

(7)

$${{\rm{MAE}}}=\frac{1}{n}\mathop{\sum }_{i=1}^{n}| {y}_{i}-{\hat{y}}_{i}|$$

(8)

$$Accuracy=\frac{TP+TN}{TP+TN+FP+FN}$$

(9)

$$Recall=\frac{TP}{TP+FN}$$

(10)

$$Precision=\frac{TP}{TP+FP}$$

(11)

$$F1=2\times \frac{Precision\times Recall}{Precision+Recall}$$

(12)

$$MCC=\frac{TP\times TN-FN\times FP}{\sqrt{(TP+FP)\times (TP+FN)\times (TN+FP)\times (TN+FN)}}$$

(13)

Where x and y are the two variables being compared, and x_i and y_i are the individual data points in the samples. TP (True Positive) refers to the number of correct positive predictions made by a classifier, while TN (True Negative) refers to the number of correct negative predictions. FP (False Positive) represents the number of incorrect positive predictions, and FN (False Negative) represents the number of incorrect negative predictions. In the evaluation of the three-class predictors for pH and salt concentration, we used macro-averaging for the AUC, F1, and AUPR metrics to account for the performance across all classes equally. For MCC, micro-averaging was applied, treating all samples equally regardless of class, to provide a global measure of performance.

PCC is sensitive to linear relationships between variables, making it an ideal metric for evaluating the performance of regression models that aim to capture such relationships. The other metrics are used for evaluating the performance of classification models.

Statistics and Reproducibility

Data for n ≥3 independent experiments were expressed as mean absolute error (MAE). The statistical significances in this work were analyzed via a two-sample independent t-test from the SciPy (version 1.10.1). In this research, we made use of datasets that are available at https://github.com/ZhummCrow/GeoPoc/tree/main/datasets. The protein structures are downloaded from the AF2 database which is accessible to the public. It’s important to note that our experiments were not randomized.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability

This study constructed a new dataset by linking the Uniprot (https://www.uniprot.org/), BacDive (https://bacdive.dsmz.de/), and HoloDom (http://halodom.bio.auth.gr/?view=all_data) databases. The protein optimal condition data sets are available at https://github.com/ZhummCrow/GeoPoc/tree/main/datasets. The source data behind the graphs in the paper can be found in Supplementary Data 1.

Code availability

All source codes and trained models used in our experiments have been deposited at https://github.com/ZhummCrow/GeoPoc.

References

Stetter, K. O. Extremophiles and their adaptation to hot environments. FEBS Lett. 452, 22–25 (1999).
Article CAS PubMed Google Scholar
Dumorné, K., Córdova, D. C., Astorga-Eló, M. & Renganathan, P. Extremozymes: a potential source for industrial applications J. Microbiol. Biothechnol. 27, 649–659 (2017).
Cowan, D. A., Ramond, J.-B., Makhalanyane, T. P. & De Maayer, P. Metagenomics of extreme environments. Curr. Opin. Microbiol. 25, 97–102 (2015).
Article CAS PubMed Google Scholar
Fujiwara, S. Extremophiles: Developments of their special functions and potential resources. J. Biosci. Bioeng. 94, 518–525 (2002).
Article CAS PubMed Google Scholar
Brininger, C., Spradlin, S., Cobani, L. & Evilia, C. The more adaptive to change, the more likely you are to survive: protein adaptation in extremophiles. In Seminars In Cell & Developmental Biology (ed. Mao, Y.) 158–169 (Elsevier, 2018).
Sahoo, K., Sahoo, R. K., Gaur, M. & Subudhi, E. Cellulolytic thermophilic microorganisms in white biotechnology: a review. Folia Microbiol. 65, 25–43 (2020).
Article CAS Google Scholar
Zhou, X.-X., Wang, Y.-B., Pan, Y.-J. & Li, W.-F. Differences in amino acids composition and coupling patterns between mesophilic and thermophilic proteins. Amino Acids 34, 25–33 (2008).
Article CAS PubMed Google Scholar
Zhang, G. & Fang, B. Logitboost classifier for discriminating thermophilic and mesophilic proteins. J. Biotechnol. 127, 417–424 (2007).
Article CAS PubMed Google Scholar
Meng, C., Ju, Y. & Shi, H. Tmppred: A support vector machine-based thermophilic protein identifier. Anal. Biochem. 645, 114625 (2022).
Article CAS PubMed Google Scholar
Charoenkwan, P. et al. Sapphire: A stacking-based ensemble learning framework for accurate prediction of thermophilic proteins. Comput. Biol. Med. 146, 105704 (2022).
Article CAS PubMed Google Scholar
Lin, H. & Chen, W. Prediction of thermophilic proteins using feature selection technique. J. Microbiol. Methods 84, 67–70 (2011).
Article CAS PubMed Google Scholar
Ahmed, Z. et al. ithermo: a sequence-based model for identifying thermophilic proteins using a multi-feature fusion strategy. Front. Microbiol. 13, 790063 (2022).
Article PubMed PubMed Central Google Scholar
Zhao, J., Yan, W. & Yang, Y. Deeptp: a deep learning model for thermophilic protein prediction. Int. J. Mol. Sci. 24, 2217 (2023).
Article CAS PubMed PubMed Central Google Scholar
Suplatov, D. et al. Computational design of a ph stable enzyme: understanding molecular mechanism of penicillin acylase’s adaptation to alkaline conditions. PLoS One 9, e100643 (2014).
Article PubMed PubMed Central Google Scholar
Hu, S. et al. Hpclas: A data-driven approach for identifying halophilic proteins based on catboost. bioRxiv https://doi.org/10.1101/2023.11.30.569348 (2023).
Ma, F. et al. Sequence homolog-based molecular engineering for shifting the enzymatic ph optimum. Synth. Syst. Biotechnol. 1, 195–206 (2016).
Article PubMed PubMed Central Google Scholar
Gado, J. E. et al. Deep learning prediction of enzyme optimum pH. bioRxiv https://doi.org/10.1101/2023.06.22.544776 (2023).
Joo, W.-A. & Kim, C.-W. Proteomics of halophilic archaea. J. Chromatogr. B 815, 237–250 (2005).
Article CAS Google Scholar
Yuan, Q., Xie, J., Xie, J., Zhao, H. & Yang, Y. Fast and accurate protein function prediction from sequence through pretrained language model and homology-based label diffusion. Brief. Bioinforma. 24, bbad117 (2023).
Article Google Scholar
Song, Y. et al. Fast and accurate protein intrinsic disorder prediction by using a pretrained language model. Brief. Bioinforma. 24, bbad173 (2023).
Article Google Scholar
Yuan, Q., Chen, S., Wang, Y., Zhao, H. & Yang, Y. Alignment-free metal ion-binding site prediction from protein sequence through pretrained language model and multi-task learning. Brief. Bioinforma. 23, bbac444 (2022).
Article Google Scholar
Panja, A. S., Maiti, S. & Bandyopadhyay, B. Protein stability governed by its structural plasticity is inferred by physicochemical factors and salt bridges. Sci. Rep. 10, 1822 (2020).
Article CAS PubMed PubMed Central Google Scholar
Jumper, J. et al. Highly accurate protein structure prediction with alphafold. Nature 596, 583–589 (2021).
Article CAS PubMed PubMed Central Google Scholar
Yuan, Q. et al. Alphafold2-aware protein–dna binding site prediction using graph transformer. Brief. Bioinforma. 23, bbab564 (2022).
Article Google Scholar
Song, Y., Yuan, Q., Zhao, H. & Yang, Y. Accurately identifying nucleic-acid-binding sites through geometric graph learning on language model predicted structures. Brief. Bioinforma. 24, bbad360 (2023).
Article Google Scholar
Reimer, L. C. et al. Bac dive in 2022: the knowledge base for standardized bacterial and archaeal data. Nucleic Acids Res. 50, D741–D746 (2022).
Article CAS PubMed Google Scholar
Loukas, A., Kappas, I. & Abatzopoulos, T. J. Halodom: a new database of halophiles across all life domains. J. Biol. Res. Thessalon. 25, 1–8 (2018).
Article Google Scholar
Uniprot: the universal protein knowledgebase in 2023. Nucleic Acids Res. 51, D523–D531 (2023).
Charoenkwan, P., Chotpatiwetchkul, W., Lee, V. S., Nantasenamat, C. & Shoombuatong, W. A novel sequence-based predictor for identifying and characterizing thermophilic proteins using estimated propensity scores of dipeptides. Sci. Rep. 11, 23782 (2021).
Article CAS PubMed PubMed Central Google Scholar
Taylor, T. J. & Vaisman, I. I. Discrimination of thermophilic and mesophilic proteins. BMC Struct. Biol. 10, 1–10 (2010).
Article Google Scholar
Matarredona, L., Camacho, M., Zafrilla, B., Bonete, M.-J. & Esclapez, J. The role of stress proteins in haloarchaea and their adaptive response to environmental shifts. Biomolecules 10, 1390 (2020).
Article CAS PubMed PubMed Central Google Scholar
Petukhov, M., Kil, Y., Kuramitsu, S. & Lanzov, V. Insights into thermal resistance of proteins from the intrinsic stability of their α-helices. Proteins Struct. Funct. Bioinforma. 29, 309–320 (1997).
Article CAS Google Scholar
Blaber, M., Zhang, X.-j & Matthews, B. W. Structural basis of amino acid α helix propensity. Science 260, 1637–1640 (1993).
Article CAS PubMed Google Scholar
Nakariyakul, S., Liu, Z.-P. & Chen, L. Detecting thermophilic proteins through selecting amino acid and dipeptide composition features. Amino Acids 42, 1947–1953 (2012).
Article CAS PubMed Google Scholar
Baldwin, R. L. Energetics of protein folding. J. Mol. Biol. 371, 283–301 (2007).
Article CAS PubMed Google Scholar
Zhang, G. & Fang, B. Application of amino acid distribution along the sequence for discriminating mesophilic and thermophilic proteins. Process Biochem. 41, 1792–1798 (2006).
Article CAS Google Scholar
Szilágyi, A. & Závodszky, P. Structural differences between mesophilic, moderately thermophilic and extremely thermophilic protein subunits: results of a comprehensive survey. Structure 8, 493–504 (2000).
Article PubMed Google Scholar
Hu, M. et al. Exploring evolution-aware &-free protein language models as protein function predictors. Adv. Neural Inf. Process. Syst. 35, 38873–38884 (2022).
Google Scholar
Li, G., Rabe, K. S., Nielsen, J. & Engqvist, M. K. Machine learning applied to predicting microorganism growth temperatures and enzyme catalytic optima. ACS Synth. Biol. 8, 1411–1420 (2019).
Article CAS PubMed Google Scholar
Steinegger, M. & Söding, J. Mmseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat. Biotechnol. 35, 1026–1028 (2017).
Article CAS PubMed Google Scholar
Schäfer, G., Purschke, W. & Schmidt, C. L. On the origin of respiration: electron transport proteins from archaea to man. FEMS Microbiol. Rev. 18, 173–188 (1996).
Article PubMed Google Scholar
Burns, D. G. et al. Natronomonas moolapensis sp. nov., non-alkaliphilic isolates recovered from a solar saltern crystallizer pond, and emended description of the genus natronomonas. Int. J. Syst. Evolut. Microbiol. 60, 1173–1176 (2010).
Article CAS Google Scholar
Pikuta, E. V., Hoover, R. B. & Tang, J. Microbial extremophiles at the limits of life. Crit. Rev. Microbiol. 33, 183–209 (2007).
Article CAS PubMed Google Scholar
Baker-Austin, C. & Dopson, M. Life in acid: ph homeostasis in acidophiles. Trends Microbiol. 15, 165–171 (2007).
Article CAS PubMed Google Scholar
Krulwich, T. A. Alkaliphiles:‘basic’molecular problems of ph tolerance and bioenergetics. Mol. Microbiol. 15, 403–410 (1995).
Article CAS PubMed Google Scholar
Preiss, L., Hicks, D. B., Suzuki, S., Meier, T. & Krulwich, T. A. Alkaliphilic bacteria with impact on industrial applications, concepts of early life forms, and bioenergetics of atp synthesis. Front. Bioeng. Biotechnol. 3, 75 (2015).
Article PubMed PubMed Central Google Scholar
Takai, K., Nakagawa, S., Sako, Y. & Horikoshi, K. Balnearium lithotrophicum gen. nov., sp. nov., a novel thermophilic, strictly anaerobic, hydrogen-oxidizing chemolithoautotroph isolated from a black smoker chimney in the suiyo seamount hydrothermal system. Int. J. Syst. Evolut. Microbiol. 53, 1947–1954 (2003).
Article CAS Google Scholar
Yuan, Q., Tian, C. & Yang, Y. Genome-scale annotation of protein binding sites via language model and geometric deep learning. bioRxiv https://doi.org/10.1101/2023.11.02.565344 (2023).
Shi, Y. et al. Masked label prediction: Unified message passing model for semi-supervised classification. arXiv https://doi.org/10.48550/arXiv.2009.03509 (2020).
Ying, Z., Bourgeois, D., You, J., Zitnik, M. & Leskovec, J. Gnnexplainer: Generating explanations for graph neural networks. Adv. Neural Inform. Process. Syst. https://doi.org/10.48550/arXiv.1903.03894 (2019).
Capra, J. A. & Singh, M. Predicting functionally important residues from sequence conservation. Bioinformatics 23, 1875–1882 (2007).
Article CAS PubMed Google Scholar
Paysan-Lafosse, T. et al. Interpro in 2022. Nucleic Acids Res. 51, D418–D427 (2023).
Article CAS PubMed Google Scholar

Download references

Acknowledgements

This study has been supported by the National Natural Science Foundation of China (T2394502).

Author information

These authors contributed equally: Mingming Zhu, Yidong Song.

Authors and Affiliations

School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, 510006, China
Mingming Zhu, Yidong Song, Qianmu Yuan & Yuedong Yang
High Performance Computing Department, National Supercomputing Center in Shenzhen, Shenzhen, Guangdong, 518000, China
Qianmu Yuan
Key Laboratory of Machine Intelligence and Advanced Computing, Sun Yat-sen University, Guangzhou, 510006, China
Yuedong Yang

Authors

Mingming Zhu
View author publications
Search author on:PubMed Google Scholar
Yidong Song
View author publications
Search author on:PubMed Google Scholar
Qianmu Yuan
View author publications
Search author on:PubMed Google Scholar
Yuedong Yang
View author publications
Search author on:PubMed Google Scholar

Contributions

Y.Y. and Q.Y. designed research. M.Z. and Y.S. performed research. All the authors analyzed data. M.Z., Y.S., and Q.Y. wrote the paper. Y.Y. reviewed and edited the manuscript.

Corresponding author

Correspondence to Yuedong Yang.

Ethics declarations

Competing interests

Y.Y. is an Editorial Board Member for Communications Biology, but was not involved in the editorial review of, nor the decision to publish this article. All the other authors declare no competing interests.

Peer review

Peer review information

Communications Biology thanks Yandong Huang, Xiao Wang and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Primary Handling Editors: Aylin Bircan and Joao Manuel de Sousa Valente.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Description of Additional Supplementary Files

Supplementary Data 1

Reporting Summary

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Zhu, M., Song, Y., Yuan, Q. et al. Accurately predicting optimal conditions for microorganism proteins through geometric graph learning and language model. Commun Biol 7, 1709 (2024). https://doi.org/10.1038/s42003-024-07436-3

Download citation

Received: 29 May 2024
Accepted: 20 December 2024
Published: 29 December 2024
DOI: https://doi.org/10.1038/s42003-024-07436-3

This article is cited by

Macromolecular crystallography for mammalian body temperature in support of molecular biophysics methods
- Alice Brink
- John R. Helliwell
- Francois J.F. Jacobs
Biophysical Reviews (2025)