Imagine walking into a casino and being given one hour to play a large group of slot machines. However, unlike regular slot machines, these ones will produce a fixed amount of coins per pull. The trick is to figure out which slot machine gives you the most reward, and then exploit it. Sounds simple, right? There are two caveats: First, the machines will only give you the coins after some unknown delay. Second, when you receive the coins, you will not know the identity of the machine that produced the reward.

I wrote a fairly simple program in java to do this experiment by making use of eligibility traces, and found that it worked far better than I had expected was possible. Given a large number of possible actions (say n = 30), the system was able to properly assign credit to actions such that for all actions in A, if the real reward R(A

_{n}) < R(A

_{m}) then the expectation of reward E(A

_{n}) < E(A

_{m}) as well - so the actions would be ranked correctly in terms of value. This ranking could be learned perfectly, even when the delay between action and reward was 100 time-steps or more. Granted, a large number of observations are required (in the millions), but it is still impresses me that such a thing can be done at all.

In order to understand how and why this worked, I attempted to characterize the result mathematically, and was able to do so in a way that very closely matched the empirical results:

Where:

r := eligibility trace decay rate (such that 0 < r < 1)

d := (fixed) delay between causal action and reward

E[V_{all}] := Expectation of reward averaged over all actions

E[V_{cause}] := Expectation of reward due to causal action

This rather simple equation appears to nearly perfectly describe the observed results in the limit of a great number of observations.

The main technique used is treating the “volume” of eligibility as the sum of an infinite geometric series, which is where the 1 / (1 – r) terms come from. Basically, we are assuming a randomized action policy, and therefore, an average over all possible rewards weighted by their relative likelihoods. These rewards will be (incorrectly) applied to our “causal” action, weighted by the eligibility. The one difference is the time-step at which the “actual” reward is emitted; this will be weighted by the eligibility at that delay, which is represented by the r

^{d}terms.

Additionally, I tried some variations on the basic experiment.

Successful circumstances:

- Various delays, uniform across bandits (up to about 500 time-steps)
- Online randomized delays (varies from observation to observation)
- Heterogeneous delays across bandits
- Skewed action frequencies (some actions taken far more frequently than others)
- Online Gaussian noise added to rewards (up to variance of 10)
- Various numbers of bandits (up to about 30)
- Non-random sequences of action (e.g. 1, 2, …, n–1, n, 1, 2, …)

Unsuccessful circumstances:

- Replacing traces