Fig. 6: Non-Abelian, iterative transformation of BERT by FIP composition.
From: Engineering flexible machine learning systems by traversing functionally invariant paths

a, Left: FIP initially identifies high-performance sparse BERTs (for sparsities ranging from 10% to 80%) followed by re-training on IMDb. Right: BERT accuracy on Yelp (solid) and IMDb (dashed) dataset along the FIP. b, Left: FIP initially retrains BERT on new task (IMDb) and then discovers a range of sparse BERTs. Right: BERT accuracy on Yelp (solid) and IMDb (dashed) dataset along the FIP. c, Top: graph connected set of 300 BERT models trained on sentence completion on Wikipedia and Yelp datasets, coloured by perplexity scores evaluated using two new query datasets, namely, WikiText and IMDb. Nodes correspond to individual BERT models and edges correspond to the Euclidean distance between BERT models in weight space. Bottom: scatter plot of inverse perplexity scores for two queries—IMDb and WikiText datasets.