Table 1 An interesting ‘aha moment’ of an intermediate version of DeepSeek-R1-Zero

From: DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning

Question: If a > 1, then the sum of the real solutions of \(\sqrt{{\boldsymbol{a}}{\boldsymbol{-}}\sqrt{{\boldsymbol{a}}{\boldsymbol{+}}{\boldsymbol{x}}}}{\boldsymbol{=}}{\boldsymbol{x}}\) is equal to

Response: <think>

To solve the equation \(\sqrt{a-\sqrt{a+x}}=x\), let’s start by squaring both …

\({(\sqrt{a-\sqrt{a+x}})}^{2}={x}^{2}\Rightarrow a-\sqrt{a+x}={x}^{2}.\)

Rearrange to isolate the inner square root term:

\({(a-{x}^{2})}^{2}=a+x\Rightarrow {a}^{2}-2a{x}^{2}+{({x}^{2})}^{2}=a+x\Rightarrow {x}^{4}-2a{x}^{2}-x+({a}^{2}-a)=0\)

…

Wait, wait. Wait. That’s an aha moment I can flag here.

Let’s reevaluate this step by step to identify whether the correct sum can be …

We started with the equation:

\(\sqrt{a-\sqrt{a+x}}=x\)

First, let’s square both sides:

\(a-\sqrt{a+x}={x}^{2}\Rightarrow \sqrt{a+x}=a-{x}^{2}\)

Next, I could square both sides again, treating the equation: …

…

  1. The model learns to rethink using an anthropomorphic tone. This is also an aha moment for us, allowing us to witness the power and beauty of RL.