Tuesday, May 02, 2006

Limits Of Reinforcement Learning In Fully Non-Markov Environments

I set up a very simple experiment to highlight what I believe to be the major deficiencies in the Reinforcement Learning paradigm (as opposed to a direct search of policy space, as in Evolutionary Algorithms).

Basically Reinforcement Learning is a method for converging on an optimal policy of behavior that maximizes reward of an agent acting within a given environment. Generally this is done with a lookup table that contains a set of state-action pairs and associated probabilities; the probabilities being referred to as the 'policy'. The 'policy' is updated over time in a semi-greedy fashion, but with enough randomness to ensure a good deal of exploration in the space of possible behaviors.

The problem with Reinforcement Learning is that it cannot effectively handle fully non-Markov environments. To illustrate this, I set up a simple task: temporal parity. Basically I feed the program a string of 1s and 0s, one bit at a time, and the program is allowed to have a single bit that acts as a form of contextual memory. At the end of the sequence I examine the state of the memory bit and give reward if it corresponds to the parity of the sequence that has been fed to it. This is a very simple task, all it requires is the following logic:

   if (bCurrentInput == 1)
      if (bMemoryState == 0)
         bMemoryState = 1;
         bMemoryState = 0;
      //do nothing

So this is a pretty trivial function. However, I was unable to get it to learn even this simple task when the input strings where of greater than length 3. (Note: It is trivial to find a solution to this by doing a direct search of policy space using Evolutionary Algorithms where the string can be of arbitrary length).

Why is standard Reinforcement Learning so bad at this task? The reason is simple. In it's untrained state, the program has approximately a 50% chance of getting the right answer due purely to dumb luck. And it is rewarded for every correct answer, so that means that it may perform a sequence of completely incorrect actions and still be rewarded, thus further reinforcing those actions. In fact, if it makes a single incorrect action during the course of processing a string, then the end result can be correct no greater than 50% of the time.

Consider the following case. Let's say that it was performing the correct actions (the logic depicted above) 75% of the time. So it's mostly there! But now let's say that the length of the string being fed to it is 10. That means that it only has 0.7510 (about 0.056) of giving the correct output for the correct reasons. It is easy to extrapolate and see how long sequences of correct actions would be nearly impossible to perform if there is much noise or randomness in the action policy.

Later I tried making an additional change to the learning mechanism: I scaled the reward by the likelihood of the sequence of actions taken. In other words, if the program was acting randomly, or took a very unlikely action that was improbable given its current policy, its reward was reduced. This helped a little bit, allowing it to learn the task with strings up to length 4 rather than 3.

This leads me to the following principle:

When performing tasks that require the precise execution of a sequence of actions, one must strictly avoid randomness/variation in action policies during a period of time in which a program is being evaluated/rewarded.

I did some additional research using the keywords 'non markov reinforcement learning' and found a few papers that try to address these issues, but the tasks that they tested their programs against were only partially non-markov, generally maze navigation tasks. For this class of problems my above stated principle does not necessarily hold; it may be very useful to have semi-random behaviors while doing exploratory-like actions. However, I believe that behavioral noise makes it nearly impossible to discover sequences of action that require precision in their execution that are of length much greater than 3.

Surprisingly Simple Solutions

I revisited my Non-Markov Tic Tac Toe experiment (see previous post), but this time using what I thought to be an impossibly simple architecture; only 2 nodes in the hidden layer (note: I am using Sinusoidal rather than Sigmoid activation functions). Amazingly, I observed it reaching a 43% level of fitness (64% was the highest I had observed before although I have been unable to find that level of fitness on subsequent runs), and it was able to find this solution rather rapidly compared to previous tests.

This astonished me because I would not have thought that such complex behavior could be generated by such a small set of parameters. This implies to me that the billions of neurons in the human brain are not really necessary for creating complex behaviors; I suspect that instead they are primarily used for statistical feature detection and also for memory formation and retrieval. This may make me sound a bit like Wolfram, although I think he is still oversimplifying human intelligence.

This also suggests that humans do not possess much in the way of intuitions regarding algorithms created by the process of evolution (again, hints of Wolfram). This sentinment is echoed by Danny Hillis in his latest book, where he talks about the sorting algorithms that his Connection-Machine discovered, and his inability to understand them analytically. To me this is a hopeful state of affairs, because it may mean that many things that we currently consider to be 'hard' problems may not be hard at all, and may in fact possess solutions that are rather easy for evolutionary methods to discover.