Table 1 Datasets in the benchmark. They correspond to the number of drugs and diseases involved in at least one nonzero drug-disease association. The sparsity s is the percentage of unknown (neither positive nor negative) matches times 100 over the total number of possible drug-disease matches (rounded up to the first decimal place). The imbalance ratio IR is the ratio between negative and positive outcomes in the dataset (rounded up to the second decimal place). The private version of PREDICT is the one generated from notebooks in the original GitHub repository, whereas the public one is the one deposited on Zenodo14. The association matrix in the Fdataset comes from34. Still, the drug and disease features are from33.

From: Comprehensive evaluation of pure and hybrid collaborative filtering in drug repurposing

Type

Dataset

Paper

\(N_S\)

\(F_S\)

\(N_P\)

\(F_P\)

#Positive

#Negative

s (%)

IR (\(\%\))

Text-mining

Cdataset

33

663

663

409

409

2,532

0

99.1

0

Fdataset

33,34

593

593

313

313

1933

0

99.0

0

DNdataset

35

550

1490

360

4516

1008

0

99.5

0

Biological

Gottlieb

34,36

593

1779

313

313

1933

0

99.0

0

LRSSL

37

763

2049

681

681

3051

0

99.4

0

PREDICT

14

1351

6265

1066

2914

5624

152

99.6

2.70

PREDICT

14

1014

1642

941

1490

4627

132

99.5

2.85

TRANSCRIPT

13

204 

12,096

116 

12,096

401

11

98.3

2.74

Artificial

Synthetic

32

300

25

300

25

200

100

99.7

50