Table 2 Numbers of sequences in each class at each step of data set preparation. Filtering indicates homology reduction with CD-HIT and removal of sequences with non-standard amino acids, which was performed before the division into two versions of the train-test and independent data sets.

From: Prediction of protein subplastid localization and origin with PlastoGram

Localization, origin

Dataset

Before filtering

After filtering

Holdout version

Partitioning version

Train-test

Independent

Train-test

Independent

Envelope, nuclear-encoded

N_E

118 (59 IM, 59 OM)

115 (59 IM, 56 OM)

98 (50 IM, 48 OM)

17 (9 IM, 8 OM)

96 (50 IM, 46 OM)

10 (6 IM, 4 OM)

Thylakoid membrane, nuclear-encoded

N_TM

276

222

189

33

192

30

Stroma, nuclear-encoded

N_S

357

340

289

51

287

53

Thylakoid lumen, nuclear-encoded(imported via Sec pathway)

N_TL_SEC

49

43

37

6

37

4

Thylakoid lumen, nuclear-encoded (imported via Tat pathway)

N_TL_TAT

84

79

67

12

67

6

Inner membrane, plastid-encoded

P_IM

187

128

109

19

106

11

Thylakoid membrane, plastid-encoded

P_TM

4456

1237

1051

186

1073

156

Stroma, plastid-encoded

P_S

1417

419

356

63

360

42

Total number of sequences

-

6944

2583

2196

387

2218

312