Fig. 1: Proof-of-concept restless multi-armed bandit (RMAB)-inspired recommendation system.

Users navigate through the main web page to specific category pages, referred to as either the initial state (S0) or current state (S1) utilized to estimate cost (λ) and Whittle index, respectively. Specifically, S (S0 and S1) serves as input to each core in the nonvolatile memory (NVM) crossbar, which cores are programmed to represent the weight values of neural networks for each arm (A~D), in this case, the contents. Based on the initially estimated λ to compute Whittle index, the agent selects the arm with the highest Whittle index at each S1 to maximize the total discounted rewards (TDR) for recommendation.