Fig. 3: Deep reinforcement learning agents learn how to approximate single-qubit unitaries using different base of gates.

A proximal policy-optimization agent (PPO) (blue color) and a deep Q-learning hindsight-experience replay agent DQL+HER (orange color) were trained to approximate single-qubit unitaries using two different bases of gates, i.e., six small rotations of π/128 around the three-axis of the Bloch sphere and the Harrow–Recht–Chuang efficient base of gates (HRC), respectively. The tolerance was fixed to 0.99 average gate fidelity. a The length distributions of the gates sequences discovered by the agents at the end of the learning. The HRC base generates shorter circuits as expected. b Performance of the agent during training on the tasks.