Fig. 5: Architecture and training curves of the decomposing transformer.

a Left to right: at training time, a variable number of particles with similar (within ~ 0.2 mm) initial positions are selected from the dataset and combined to create an event, and multiple events form a training batch. The input voxel data (comprising energy loss and spatial coordinates) is subsequently processed and passed through the decomposing transformer. This transformer initially generates a prediction for vertex positions and subsequently provides estimates for kinematic parameters and termination conditions for each particle in the events. b Training and validation curves for the different outputs of the network, showing a smooth convergence of the model (this plot corresponds to the model from the “Initial case” subsection of the “Results” section. Similar curves are obtained for the model from the “Nuclear clusters” subsection of the “Results” section). The learning rate schedule used is appreciated by looking at the dashed purple lines. The parentheses indicate the tensor dimensions at each stage.