Table 3 Representative test-set compounds with highest model-predicted probabilities for each odor label

From: A comparative study of machine learning models on molecular fingerprints for odor decoding

Odor label

Compound CID

Compound name

Assigned odor descriptors

Probability

CITRUS

20797

Nookatone

CITRUS, FRUITY, GRAPEFRUIT

0.974

FRUITY

16324

Allyl butyrate

FRUITY, APRICOT, PINEAPPLE

0.973

GREEN

324382

Methyl 2-decynoate

WAXY, NUTTY, GREEN

0.977

FLORAL

10176245

(5R)-2,5,6-trimethylheptan-2-ol

FLORAL, TERPENIC, ROSE

0.972

WOODY

24758199

Methyl cedryl ether

GREEN, WOODY, CINNAMON

0.984

MUSK

71332160

4-tert-butyl-2,6-dimethyl-3,5-dinitrobenzaldehyde

MUSK

0.997

EARTHY

32065

Nutty pyrazine

EARTHY, OTHERS, ROASTED

0.973

ODORLESS

135565913

Dipotassium guanylate

ODORLESS

0.999

MEATY

47649

2-methyl-3-(methyldisulfanyl)furan

OTHERS, CHEMICAL, MEATY

0.995

  1. For each odor label, the ST-LGBM model selected the test-set compound with the highest predicted probability. In all cases, the predicted label was among the compound’s assigned odor descriptors, suggesting consistency between model output and expert annotation.
  2. CID PubChem compound identifier, ST structural (Morgan) fingerprint, LGBM Light Gradient Boosting Machine.