Table 1 List of the databases used to train 7net-Omni

From: Optimizing cross-domain transfer for universal machine learning interatomic potentials

Database abbreviation

Task (Channel)

Number of structures/103

Domain

XC functional

Oversampling

Reference

MPtrj

mpa

1580

Inorganic crystal

PBE(+U)

15

7

Alex1

mpa

12,069

Inorganic crystal

PBE(+U)

4

18,19,20

DBS

mpa

121

General

PBE(+U)

15

-

OMat24

omat24

101,901

Inorganic crystal

PBE(+U)

1

21

MatPES

matpes

412

Inorganic crystal

PBE

40

22

OC202

oc20

30,757

Catalyst (metal)

RPBE

2

23

OC22

oc22

8210

Catalyst (oxide)

PBE(+U)

6

24

ODAC23

odac23

4082

MOF

PBE-D3

1

25

OMol25 (low)3

omol25

60,852

Molecule

ωB97M-V

1 (5)3

26

OMol25 (high)3

omol25_high

1390

Molecule

ωB97M-V

5

26

SPICE4

spice

1738

Molecule

ωB97M

5

27

QCML

qcml

18,301

Molecule

PBE0118

1

28

MAD

mad

86

General

PBEsol119

100

29

MP-r2SCAN

mp_r2scan

50

Inorganic crystal

r2SCAN

40

93,120

MatPES-r2SCAN

matpes_r2scan

368

Inorganic crystal

r2SCAN

40

22

MP-ALOE

matpes_r2scan

864

Inorganic crystal

r2SCAN

15

30

  1. Each entry provides the database abbreviation, the corresponding task name, the number of structures included in training, the chemical domain, the XC functional used in the calculations, the oversampling factor applied during training, and the corresponding literature reference. Charged structures in SPICE, OMol25 and QCML are excluded. Databases with identical computational protocols are grouped under the same task. The total number of structures is 242 million.
  2. 1 A subset of the Alexandria dataset (sAlex) is included for 3D configurations21, and 2D and 1D configurations are also incorporated.
  3. 2 The OC20 database consists of relaxation trajectories, rattled structures, and ab initio molecular dynamics configurations. For the relaxation trajectories, we employ the OC20M split provided in OC20, while for the rattled and MD structures, we use subsampled datasets. See Methods section for detailed subsampling criteria.
  4. 3 We split the OMol25 database, which contains various spin states, into ‘low-spin’ and ‘high-spin’ configurations, treating these two classes as separate tasks. The organometallic complex structures in the low-spin category are oversampled five times to improve training. See the Methods section for the splitting criteria.
  5. 4 We use energies and forces calculated without the D3 dispersion correction.