Fig. 2: TEM BF images and prediction visualizations for a high (top) and low (bottom) performing image.

Image 16 (top) and 18 (bottom) are from the test dataset. The Experimental-ML model had a pixel-wise F1 score on Image 16 (top) of 0.92 while the Synthetic-ML model had a score of 0.90. On Image 18 (bottom) the models had pixel-wise F1 scores of 0.51 (Experimental) and 0.57 (Synthetic). In both images, models have similar performance variance characteristics for evaluated engineering performance metrics. Image 18 also shows an author-perceived over labeling issue present in the datasets obtained from ref. 21.