Fig. 4: AlphaProof performance scaling with inference compute per problem.
From: Olympiad-level formal mathematical reasoning with reinforcement learning

Solve rates on held-out benchmarks (miniF2F-valid, formal-imo and PutnamBench-test). In both panels, the ‘compute per problem’ is an average, calculated as the total TPU compute consumed during the evaluation, divided by the total number of problems in the benchmark. a, Solve rates as a function of increasing tree search compute per problem, measured in v6e TPU hours (logarithmic scale). Solve rates are highlighted for low search budgets (for example, 2/60 TPU hours per problem, corresponding to 2 minutes on 1 TPU) and more extensive search. b, Scaling with TTRL compute. Solve rates as a function of increasing TTRL training compute per target problem, measured in v6e TPU days (linear scale). Solve rates are highlighted after an initial TTRL compute investment (for example, 50 TPU days or 1 day on 50 TPUs per problem) and at the end of the TTRL phase with performance evaluated using 4,000 simulations. Note the different x-axis units and scales (logarithmic TPU hours versus linear TPU days) between panels.