Table 2 Numbers of sequences in each class at each step of data set preparation. Filtering indicates homology reduction with CD-HIT and removal of sequences with non-standard amino acids, which was performed before the division into two versions of the train-test and independent data sets.
From: Prediction of protein subplastid localization and origin with PlastoGram
Localization, origin | Dataset | Before filtering | After filtering | Holdout version | Partitioning version | ||
|---|---|---|---|---|---|---|---|
Train-test | Independent | Train-test | Independent | ||||
Envelope, nuclear-encoded | N_E | 118 (59 IM, 59 OM) | 115 (59 IM, 56 OM) | 98 (50 IM, 48 OM) | 17 (9 IM, 8 OM) | 96 (50 IM, 46 OM) | 10 (6 IM, 4 OM) |
Thylakoid membrane, nuclear-encoded | N_TM | 276 | 222 | 189 | 33 | 192 | 30 |
Stroma, nuclear-encoded | N_S | 357 | 340 | 289 | 51 | 287 | 53 |
Thylakoid lumen, nuclear-encoded(imported via Sec pathway) | N_TL_SEC | 49 | 43 | 37 | 6 | 37 | 4 |
Thylakoid lumen, nuclear-encoded (imported via Tat pathway) | N_TL_TAT | 84 | 79 | 67 | 12 | 67 | 6 |
Inner membrane, plastid-encoded | P_IM | 187 | 128 | 109 | 19 | 106 | 11 |
Thylakoid membrane, plastid-encoded | P_TM | 4456 | 1237 | 1051 | 186 | 1073 | 156 |
Stroma, plastid-encoded | P_S | 1417 | 419 | 356 | 63 | 360 | 42 |
Total number of sequences | - | 6944 | 2583 | 2196 | 387 | 2218 | 312 |