Table 1 Data statistics of training and testing datasets after the removal of homologous sequences using CD-HIT program.
Sequence identity cut-off | Number of ACPs | Number of non-ACPs |
---|---|---|
Raw data | 1354 | 2250 |
Sequence length > 10aa | 1256 | 2250 |
Sequence identity < 90% | 992 | 1980 |
Training dataset | 800 | 1600 |
Independent testing dataset | 192 | 380 |