Table 1 Classification and regression metrics for all models tested on our benchmark ranked by F1 score

From: A framework to evaluate machine learning crystal stability predictions

Model

F1

DAF

Prec

Acc

TPR

TNR

MAE

RMSE

R2

Training set

Model parameters

Targets

Date added

eqV2 S DeNS

0.815

5.042

0.771

0.941

0.864

0.953

0.036

0.085

0.788

146k (1.6M) (MPtrj)

31.2M

EFSD

18 October 2024

ORB MPtrj

0.765

4.702

0.719

0.922

0.817

0.941

0.045

0.091

0.756

146k (1.6M) (MPtrj)

25.2M

EFSD

14 October 2024

SevenNet

0.724

4.252

0.650

0.904

0.818

0.919

0.048

0.092

0.750

146k (1.6M) (MPtrj)

842.4k

EFSG

13 July 2024

MACE

0.669

3.777

0.577

0.878

0.796

0.893

0.057

0.101

0.697

146k (1.6M) (MPtrj)

4.7M

EFSG

14 July 2023

CHGNet

0.613

3.361

0.514

0.851

0.758

0.868

0.063

0.103

0.689

146k (1.6M) (MPtrj)

412.5k

EFSGM

3 March 2023

M3GNet

0.569

2.882

0.441

0.813

0.803

0.813

0.075

0.118

0.585

63k (188.3k) (MPF)

227.5k

EFSG

20 September 2022

ALIGNN

0.567

3.206

0.490

0.841

0.672

0.872

0.093

0.154

0.297

155k (MP 2022)

4.0M

Energy

2 June 2023

MEGNet

0.510

2.959

0.452

0.826

0.585

0.870

0.130

0.206

−0.248

133k (MP Graphs)

167.8k

Energy

14 November 2022

CGCNN

0.507

2.855

0.436

0.818

0.605

0.857

0.138

0.233

−0.603

155k (MP 2022)

128.4k (n = 10)

Energy

28 December 2022

CGCNN+P

0.500

2.563

0.392

0.786

0.693

0.803

0.113

0.182

0.019

155k (MP 2022)

128.4k (n = 10)

Energy

3 February 2023

Wrenformer

0.466

2.256

0.345

0.745

0.719

0.750

0.110

0.186

−0.018

155k (MP 2022)

5.2M (n = 10)

Energy

26 November 2022

BOWSR

0.423

1.964

0.300

0.712

0.718

0.693

0.118

0.167

0.151

133k (MP Graphs)

167.8k

Energy

17 November 2022

Voronoi RF

0.333

1.579

0.241

0.668

0.535

0.692

0.148

0.212

−0.329

155k (MP 2022)

26.2M

Energy

26 November 2022

Dummy

0.185

1.000

0.154

0.687

0.232

0.769

0.124

0.184

0.000

    
  1. DAF is the ratio of model precision to percentage of stable structures in the test set. The dummy classifier uses the scikit-learnstratified strategy of randomly assigning stable or unstable labels according to the training set prevalence. The dummy regression metrics MAE, RMSE and R2 are attained by always predicting the test set mean. The top positions in the leaderboard are all taken by UIP models trained on the combination of energies, forces and stresses. There is a pronounced gap in the regression metrics between the UIP models and the seven energy-only models. It is worth noting that CGCNN+P, Wrenformer and BOWSR achieve lower regression metrics through their mitigation strategies for initial and relaxed structure mismatch but ultimately these strategies did not improve their usefulness as measured by the F1 score and DAF. Voronoi RF, CGCNN and MEGNet perform worse than dummy in regression metrics but better than dummy on some classification metrics, demonstrating that regression metrics alone can be misleading. Acc, Accuracy; k, thousand; M, million; Prec, precision; TNR, true negative rate; TPR, true positive rate.