Fig. 4: Performance of 2-GPU versus 2-CPU server.

a Average wall-time for a single iteration in DFT and TDDFT for 5TCzBN molecule. Horizontal dotted lines indicate the ideal wall-times that would have been obtained when our GPU code exhibited an identical FLOP performance to the CPU code. b Roofline analysis of GPU kernel performance corresponding to the Fock build for a single Nvidia A100 GPU. Here, 21 compute kernels depending on the distinct combinations of angular momenta are shown. The theoretical peak performance was bound by the profiling conditions of the employed tool.