Extended Data Fig. 3: Machine learning architectures.

(a) Scheme of the denoising U-Net architecture predicting the noise ϵθ(xt, t, c). First, we project the input tensor features into a higher space through a convolutional layer (red) and then apply a 2D positional sinusoidal encoding. Then, we apply a typical encoder-decoder structure, with skip connections scaled with \(1/\sqrt{2}\). The time step encoding t is injected into residual convolution layers (turquoise). The condition embeddings c are input to the residual transformer blocks (purple) as detailed in the Method’s Pipeline and Architecture section. All the transformer blocks have a residual connection. (b) Scheme of the unitary encoder used to transform input unitaries into conditionings.