Table 2 Manual verification of proteins linked to positively classified publications.

From: Compilation of parasitic immunogenic proteins from 30 years of published research using machine learning and natural language processing

PubMed ID

Year

Row

Protein(s)

UniProt ID

e_Protein(s)

e_UniProt

Species

Prot

28424680

2017

2

Nucleoside triphosphate hydrolase-IIa

A0A7J6JVX4

Hydrolase,

NTPase II

A0A7J6K3V0

A0A7J6JVX4

T. gondii

Yes

23928460

2013

11

Rhoptry protein 5 (ROP5)

Q3YJR4

Rhoptry protein ROP5,

Type II rhoptry protein 5A,

Type III rhoptry protein 5A,

Type I rhoptry protein 5A

A0A125YQ30

Q3YJR4

B9QLN8

B9Q3F2

T. gondii

Yes

26421596

2014

22

Toxoplasma gondii 10 kDa excretory–secretory antigen (TgESA10)

R4H6A8

Ubiquitin

R4H6A8

T. gondii

Yes

14500491

2003

33

Merozoite surface protein 1 (MSP1)a

P13828

Merozoite surface protein 1

Q9GSA3

P. yoelii

Yes

29599776

2018

44

Amastigote 2 (A2)a

A4HZU7

Q26351

Stage-specific S antigen-like protein, stage-specific S antigen homolog,

A4HZU7

Q26351

L. infantum

Yes

7595214

1995

1737

Sporozoite surface protein 2

(PfSSP2)a

Q26020

Sporozoite surface protein 2

Q26020

P. falciparum

Nob

21715579

2011

1747

Pfs230a

P68874

Gametocyte surface protein P230

P68874

P. falciparum

Yes

9106193

1996

1757

Merozoite surface protein 1a

P04933

Merozoite surface protein 1

Q8IJ53

P. falciparum

Nob

29524527

2017

1767

 

False positive

  

T. gondii

No

11349025

2001

1777

Paramyosin (Pmy)a

A0A3Q0KD88

Paramyosin

A0A3Q0KD88

S. mansoni

Nob

  1. Year = year of publication; Row = row position on sheet [Candidates per PubID] in Supplementary Table S3. This sheet contains 1776 PubMed IDs and their relative keyword counts that were classified as abstracts containing protein names considered worthy vaccine candidates for further investigation. The rows are in descending order based on the ‘Protection’ column. Rows were chosen at regular intervals from the top and bottom for manual verification; Protein(s) = protein name(s) specified in publication. Names with strikethrough are not vaccine candidates and therefore the classification is a false positive; UniProt ID. = UniProt ID(s) linked to protein name; e_Protein(s) = names automatically extracted from publication title+abstract; e_UniProt = UniProt ID(s) linked to protein names, which in effect are the representatives of the vaccine candidates listed in Supplementary Table S4. UniProt IDs underlined exactly match to the protein identifier in the publication. UniProt IDs in bold are incorrect with respect to the protein name e.g., Q9GSA3 is the ID for ‘merozoite surface protein’ and not ‘merozoite surface protein 1’; Species = the source species for the proteins; Prot. = ‘Yes’ or ‘No’ whether the publication reports testing in an animal model for protein immunogenicity.
  2. aNo formal identifier e.g., GenBank accession No. or UniProt ID given in publication. The UniProt ID shown is based on the protein name only.
  3. bProtein reported in other publications as a possible vaccine candidate.