Extended Data Fig. 1: Flow chart comparing eventual and currently implemented DNN acceleration approaches. | Nature

Extended Data Fig. 1: Flow chart comparing eventual and currently implemented DNN acceleration approaches.

From: Equivalent-accuracy accelerated neural-network training using analogue memory

Extended Data Fig. 1

a, Comparison between an eventual analogue-memory-based hardware implementation and our mixed software–hardware experiment. Although we do not implement CMOS neurons, we mimic their behaviour closely. In both schemes, weight update is performed on only the 3T1C g devices, and these contributions are later transferred to the PCM devices (G+ and G). Owing to wall-clock throughput issues in our experiment, we have to perform all of the weight transfers at once. By contrast, in an eventual hardware implementation, weight transfer would take place on a distributed, column-by-column basis. Ideally, transfer for any weight column would be performed at a point in time when the neural-network computation, focused on some other layer, leaves that particular array core temporarily idle. b, Guidelines for optimizing the choice of transfer interval, depending on the time constant of the capacitor and the dynamic range of g. Because training of one image is performed in 240 ns, training of 8,000 images is performed in 8,000 × 240 ns = 1.92 ms, which is a substantial fraction of the time-constant of the capacitor (5.16 ms). Despite allowing more of the dynamic range of g to be used, a longer transfer interval would probably suffer from poor retention of information in any volatile g device. However, even in the ideal case of an infinitely-long time constant, the transfer interval would still need to be limited, owing to the finite dynamic range of g. A long transfer interval would probably result in g values saturating owing to weight updates, leading to loss of training information before transfer. c, Guidelines for optimizing the choice of gain factor F. We define ‘efficacy of post-transfer tuning’ as the inverse of the overall residual error after g tuning. Bcause a larger gain factor F means more available dynamic range for each weight, larger F is desirable. However, large F also amplifies any programming errors on the PCM devices due to intrinsic device variability and limits the correction that g can provide during post-transfer tuning. The efficacy would definitely decrease monotonically, although perhaps not linearly as is sketched here. The value we chose (F = 3) represents a reasonable trade-off for the PCM and 3T1C devices used here. For other situations, F can be initially estimated as F = DR g /σ, where DR g is the g dynamic range and σ is the standard deviation of the PCM programming error. Additional optimization comes with neural-network training, which includes the weak effect of drift contribution.

Back to article page