Basically Reinforcement Learning is a method for converging on an optimal policy of behavior that maximizes reward of an agent acting within a given environment. Generally this is done with a lookup table that contains a set of state-action pairs and associated probabilities; the probabilities being referred to as the 'policy'. The 'policy' is updated over time in a semi-greedy fashion, but with enough randomness to ensure a good deal of exploration in the space of possible behaviors.
The problem with Reinforcement Learning is that it cannot effectively handle fully non-Markov environments. To illustrate this, I set up a simple task: temporal parity. Basically I feed the program a string of 1s and 0s, one bit at a time, and the program is allowed to have a single bit that acts as a form of contextual memory. At the end of the sequence I examine the state of the memory bit and give reward if it corresponds to the parity of the sequence that has been fed to it. This is a very simple task, all it requires is the following logic:
if (bCurrentInput == 1)
if (bMemoryState == 0)
bMemoryState = 1;
bMemoryState = 0;
So this is a pretty trivial function. However, I was unable to get it to learn even this simple task when the input strings where of greater than length 3. (Note: It is trivial to find a solution to this by doing a direct search of policy space using Evolutionary Algorithms where the string can be of arbitrary length).
Why is standard Reinforcement Learning so bad at this task? The reason is simple. In it's untrained state, the program has approximately a 50% chance of getting the right answer due purely to dumb luck. And it is rewarded for every correct answer, so that means that it may perform a sequence of completely incorrect actions and still be rewarded, thus further reinforcing those actions. In fact, if it makes a single incorrect action during the course of processing a string, then the end result can be correct no greater than 50% of the time.
Consider the following case. Let's say that it was performing the correct actions (the logic depicted above) 75% of the time. So it's mostly there! But now let's say that the length of the string being fed to it is 10. That means that it only has 0.7510 (about 0.056) of giving the correct output for the correct reasons. It is easy to extrapolate and see how long sequences of correct actions would be nearly impossible to perform if there is much noise or randomness in the action policy.
Later I tried making an additional change to the learning mechanism: I scaled the reward by the likelihood of the sequence of actions taken. In other words, if the program was acting randomly, or took a very unlikely action that was improbable given its current policy, its reward was reduced. This helped a little bit, allowing it to learn the task with strings up to length 4 rather than 3.
This leads me to the following principle:
When performing tasks that require the precise execution of a sequence of actions, one must strictly avoid randomness/variation in action policies during a period of time in which a program is being evaluated/rewarded.
I did some additional research using the keywords 'non markov reinforcement learning' and found a few papers that try to address these issues, but the tasks that they tested their programs against were only partially non-markov, generally maze navigation tasks. For this class of problems my above stated principle does not necessarily hold; it may be very useful to have semi-random behaviors while doing exploratory-like actions. However, I believe that behavioral noise makes it nearly impossible to discover sequences of action that require precision in their execution that are of length much greater than 3.