Table 5 The contribution of each preprocessing strategy on LLM-Prop performance

From: LLM-Prop: predicting the properties of crystalline materials using large language models

Model

Band gap ↓

Volume ↓

Is-gap-direct ↑

LLM-Prop (baseline)

0.256

69.352

0.796

+ modified tokenizer

0.247

78.632

0.785

+ label scaling

0.242

44.515

N/A

+ [CLS] token

0.231

39.520

0.842

+ [NUM] token

0.251

86.090

0.793

+ [ANG] token

0.242

64.965

0.810

− stopwords

0.252

56.593

0.779

LLM-Prop+all (without space group)

0.235

97.457

0.705

LLM-Prop+all

0.229

42.259

0.857

  1. We compare the baseline (when the input crystal descriptions and the targets are not touched, and with default T5 tokenizer) and to when all strategies are combined together.