Fig. 4: Experimental results of RMAB in four-content scenario.

a Evolution of λ with respect to randomly generated S0 (uniformly distribution) and S1 (weighted distribution). b S1-dependent reward distributions of four contents. c Evolution of Whittle indices of four contents. During the in-situ training phase, the Whittle index also increases with the increase in λ, leading to the positive slopes with respect to S1 (in the inference-like phase) to enhance the total discounted reward (TDR). d Selection matrices and e cumulative selections of (4, 1) scenario. The brown and green squares indicate the selected and not-selected arms, respectively. f Selection matrices and g cumulative selections of (4, 2) scenario. The in-situ training phase enables balanced selections among four contents for both (4, 1) and (4, 2) scenarios, enhancing the TDR. h Evolutions of TDR for experiment, model, and software for (4, 1) and (4, 2) scenarios. The model without noise fails to enhance the TDR, indicating that the intrinsic hardware noise is necessary for the successful training of the neural Whittle index network. The experimental result also exhibits a comparable TDR evolution with the software result.