I've been doing some further experiments with pole balancing (also called the inverted pendulum problem).
In my previous experiment, I had partitioned the four-dimensional state space into a set of discrete states (typically several hundred), and then used observations of transitions conditioned on actions to compute transition probabilities in order to produce a Markov decision process. I then had it learn an optimal policy over the MDP using dynamic programming. This approach seemed to be fairly effective; it was usually able to learn to balance a single pole indefinitely after roughly 10 to 15 balancing attempts (plus an additional 10 episodes at the beginning using a completely random policy in order to gather a bit of data to learn from). Between each balancing attempt, it would relearn the model from scratch.
So I next decided to try it out on a more challenging version of the problem - balancing with an additional pole. This newer version proved too much for it, and it failed to learn to solve the problem, even after being given several hundred balancing attempts in which to do so. My guess is that increasing the dimensionality of the state space from four to six dimensions was the key cause of its failure. The problem is that as the dimensionality of the state space increases, it is necessary to enlist more and more states in order to represent it properly, but the more states used, the more observations necessary in order to populate the probabilistic model of the transitions between those states. In other words, this particular approach simply will not scale to large problems.
So I decided to try a different approach. I borrowed an idea I had read about in Bio-Inspired Artificial Intelligence, described below:
Yet another method consists in the coevolution of a robot's controller and of the simulator parameters that are most likely to differ from the real world and that may affect the quality of the transfer (Bongard and Lipson 2005). This method, also known as the estimation-exploration algorithm, consists of coevolving three populations, one encoding possible simulator parameters, one encoding robot controllers used only to find a good simulator, and the third also encoding robot controllers, but these controllers are meant to allow the robot to fulfill its task, such as forward locomotion. Coevolution happens in several passes through a three-stage process. In stage 1, a randomly generated population of controllers is evolved in the default simulator and the best individual is executed on the real robot while the time series of sensory values is recorded. In stage 2, the population of simulators is evolved for reducing the difference between the time series recorded on the real robot and the time series obtained by running the same controller within the simulator. The best evolved simulator is then used for stage 1 where a new randomly generated population of controllers is evolved and the best individual is tested on the real robot to generate the time series for another pass of simulator evolution. This two-stage coevolution is repeated several times until the error between simulated and real robot behavior is as small as possible. Then, the best simulator found so far is used by a third population (the third stage), in which robot controllers are again evolved, this time to elicit some desired behavior. The best controller from this population is then executed on the physical robot, and, if the evolved simulator was accurate, the real robot will exhibit the same behavior as that seen in the evolved simulator.
It was shown that 16 passes of the three-stage process were sufficient to evolve a good control system that could be transferred to an articulated robot. In this case, the real robot was used to test only 17 controllers: the 16 controllers used to find an accurate simulator, and the 17th evolved to enable the robot to perform some desired task (Bongard et al. 2006). The authors also showed that this approach could be used to adapt the control system to cope with a damaged robot (for example, loss of a leg, malfunctioning of a sensor, etc.). (pg 510)
Here is a link to the paper An exploration-estimation algorithm for synthesis and analysis of engineering systems using minimal physical testing (2004), describing the concept in much greater detail.
This is essentially what I did, and the results turned out to be far more effective than my attempts with the dynamic programming approach from before.
It is my strong conviction that it is critical to focus on learning mechanisms that can work effectively using a minimal number of actual interactions with the target environment ("physical testing".) For the double pole balance task, I could have very easily performed neuroevolution directly within the environment itself (as I did here), without all the bother of having it first inductively learn a simulation of it. However, there are many real-world situations in which interactions are expensive, sometimes in terms of money, but more often in terms of time.
There may be an analogy to be made with mental rehearsal, or daydreaming.