Table 1 Performance comparison between EasIFA and the baseline models in SwissProt E-RXN ASA test seta

From: Multi-modal deep learning enables efficient and accurate annotation of enzymatic active sites

Methods

Note

Binary-classification (active site location annotation task)

Multi-classificationd (active site type annotation task)

Precision

Recall

FPR

F1

MCC-bin

Recall (Binding)

FPR (Binding)

Recall (Catalytic)

FPR (Catalytic)

Recall (Other Site)

FPR (Other Site)

MCC-multi

EasIFA-ESM-binb

85.78%

79.03%

0.41%

79.15%

0.8010

na

na

na

na

na

na

na

EasIFA-SaProt-binc

83.87%

80.57%

0.55%

78.68%

0.7971

na

na

na

na

na

na

na

EasIFA-ESM-multi

85.65%

80.83%

0.48%

80.09%

0.8101

64.85%

0.48%

48.99%

0.02%

8.03%

0.01%

0.8029

85.09%

81.77%

0.51%

80.56%

0.8139

68.47%

0.51%

36.44%

0.02%

7.12%

0.01%

0.8093

EasIFA-SaProt-multi

85.39%

80.05%

0.46%

78.85%

0.8006

64.35%

0.46%

48.78%

0.02%

7.77%

0.01%

0.7932

84.38%

80.96%

0.51%

78.97%

0.8012

67.93%

0.50%

36.47%

0.02%

7.20%

0.01%

0.7957

AEGAN

16.84%

56.73%

7.87%

22.15%

0.2449

na

na

50.81%

8.70%

na

na

na

16.82%

54.96%

7.73%

21.82%

0.2394:

na

na

36.17%

8.62%

na

na

na

BLASTp

64.97%

73.13%

1.21%

65.68%

0.6634

59.31%

1.12%

45.71%

0.07%

8.50%

0.03%

0.6618

72.57%

73.26%

0.76%

70.41%

0.7089

59.30%

0.71%

46.12%

0.04%

8.28%

0.02%

0.7073

Schrodinger-SiteMap

na

na

na

12.21%

0.1096

45.28%

20.69%

na

na

na

na

na

  1. aThe bold represents the best.
  2. bUse the ESM-2 for enzyme residue sequence representation.
  3. cUse the SaProt for enzyme residue sequence representation.
  4. dBinding: Consistent with the definition of “Binding Site” in UniProt, they are the amino acid residues that bind to substrates, products, and cofactors., Catalytic: Consistent with the “Active Site” as defined in UniProt, it refers to the residues that directly participate in catalysis., Other site: Consistent with the definition of “Site” in UniProt, Other interesting amino acid sites, such as the inhibitory sites of proteases.
  5. Note:
  6. Use the training set of the SwissProt E-RXN ASA dataset as knowledge base and sequence alignment database, containing enzymes sequence and structural data of 44,341, and score on its test set, which includes 892 samples. (Empirical rule-based methods do not use this knowledge base).
  7. EasIFA utilizes the training set of the SwissProt E-RXN ASA dataset as knowledge base. AEGAN employs the model state reported in the literature. Both score on the test set of the SwissProt E-RXN ASA dataset, but the scoring does not consider the 225 samples in the test set that overlap with AEGAN’s training set, resulting in 667 samples in the test set.
  8. AEGAN uses the model state reported in the literature to score on the test set of the SwissProt E-RXN ASA dataset, without removing the 225 samples overlapping with AEGAN’s training set, making a total of 892 samples in the test set.
  9. Use the entire SwissProt as sequence alignment database, comprising 569,516 sequence samples. Employ all enzymes in SwissProt as a knowledge base, totaling 139,469 samples, and score on the SwissProt E-RXN ASA test set, which includes 892 samples.