Fig. 2: Model performance on novel tasks. | Nature Neuroscience

Fig. 2: Model performance on novel tasks.

From: Natural language instructions induce compositional generalization in networks of neurons

Fig. 2

a, Learning curves for the first 100 exposures to held-out tasks averaged over all tasks. Data are presented as the mean ± s.d. across different n = 5 random initializations of sensorimotor-RNN weights. For all subplots, asterisks indicate significant differences among performance according to a two-sided unequal-variance t-test. Most relevant comparisons are presented in plots (for all subplots, not significant (NS), P > 0.05, *P < 0.05, **P < 0.01, ***P < 0.001; STRUCTURENET versus SBERTNET (L): t = 3.761, P = 1.89 × 10−4; SBERTNET (L) versus SBERTNET: t = 2.19, P = 0.029; SBERTNET versus CLIPNET: t = 6.22, P = 1.02 × 10−9; CLIPNET versus BERTNET: t = 1.037, P = 0.300; BERTNET versus GPTNET (XL): t = −1.122, P = 0.262; GPTNET (XL) versus GPTNET: t = 6.22, P = 1.04 × 10−9; GPTNET versus BOWNET: t = −3.346, P = 8.85 × 10−4; BOWNET versus SIMPLENET: t = 10.25, P = 2.091 × 10−22). A full table of pairwise comparisons can be found in Supplementary Fig. 3. b, Distribution of generalization performance (that is, first exposure to novel task) across models. c–f, Performance across different test conditions for n = 5 different random initialization of sensorimotor-RNN weights where each point indicates average performance across tasks for a given initialization. c, Generalization performance for tasks where instructions are swapped at test time (STRUCTURENET versus SBERTNET (L): t = −0.15, P = 0.875; SBERTNET (L) versus SBERTNET: t = −2.102, P = 0.036; SBERTNET versus CLIPNET: t = −0.162, P = 0.871; CLIPNET versus BERTNET: t = 0.315, P = 0.752; BERTNET versus GPTNET (XL): t = 0.781, P = 0.435; GPTNET (XL) versus GPTNET: t = 1.071, P = 0.285; GPTNET versus BOWNET: t = −2.702, P = 0.007; BOWNET versus SIMPLENET: t = −3.471, P = 5.633−4). A full table of pairwise comparisons can be found in Supplementary Fig. 4. d, Generalization performance for models where tasks from the same family are held out during training (STRUCTURENET versus SBERTNET (L): t = 0.629, P = 0.530; SBERTNET (L) versus SBERTNET: t = −0.668, P = 0.504; SBERTNET versus CLIPNET: t = 8.043, P = 7.757 × 10−15; CLIPNET versus BERTNET: t = −0.306, P = 0.759; BERTNET versus GPTNET (XL): t = 0.163, P = 0.869; GPTNET (XL) versus GPTNET: t = 1.534, P = 0.126; GPTNET versus BOWNET: t = −6.418, P = 3.26 × 10−10; BOWNET versus SIMPLENET: t = 14.23, P = 8.561−39). A full table of pairwise comparisons can be found in Supplementary Fig. 4. e, Generalization performance for models where the last layers of language models are allowed to fine-tune to the loss from sensorimotor tasks (STRUCTURENET versus SBERTNET (L): t = 1.203, P = 0.229; SBERTNET (L) versus SBERTNET: t = 2.399, P = 0.016; SBERTNET versus CLIPNET: t = 5.186, P = 3.251 × 10−7; CLIPNET versus BERTNET: t = −3.002, P = 0.002; BERTNET versus GPTNET (XL): t = 0.522, P = 0.601; GPTNET (XL) versus GPTNET: t = 2.631, P = 0.009; GPTNET versus BOWNET: t = 4.440, P = 1.134 × 10−5; BOWNET versus SIMPLENET: t = 10.255, P = 2.091 × 10−22). A full table of pairwise comparisons can be found in Supplementary Fig. 4. f, Average difference in performance between tasks that use standard imperative instructions and those that use instructions with conditional clauses and require a simple deductive reasoning component. Colored asterisks at the bottom of the plot show P values for a two-sided, unequal-variance t-test between the null distribution constructed using random splits of the task set (transparent points represent mean differences for random splits; STRUCTURENET: t = −36.46, P = 4.34 × 10−23; SBERTNET (L): t = −16.38, P = 3.02 × 10−5; SBERTNET: t = −15.35, P = 3.920 × 10−5; CLIPNET: t = −44.68, P = 5.32 × 10−13; BERTNET: t = −25.51, P = 3.14 × 10−8; GPTNET (XL): t = −16.99, P = 3.61 × 10−6; GPTNET: t = −9.150, P = 0.0002; BOWNET: t = −70.99, P = 4.566 × 10−35; SIMPLENET: t = 19.60, P = 5.82 × 10−6), and asterisks at the top of plot indicate P-value results from a t-test comparing differences with STRUCTURENET and our other instructed models (versus SBERTNET (L): t = 3.702, P = 0.0168; versus SBERTNET: t = 6.592, P = 0.002; versus CLIPNET: t = 30.35, P = 2.367 × 10−7; versus BERTNET: t = 7.234, P = 0.0007; versus GPTNET (XL): t = 5.282, P = 0.004; versus GPTNET: t = −1.745, P = 0.149; versus BOWNET: t = 75.04, P = 9.96 × 10−11; versus SIMPLENET: t = −30.95, P = 2.86 × 10−6; see Methods and Supplementary Fig. 6. for full comparisons).

Back to article page