Saturday, November 20, 2010

Balancing with Dynamic Programming



I used dynamic programming to balance a simulated cart-pole. In this case, the task is learned after just 5 balancing attempts, although it usually does not learn the task quite so quickly.

The blue box on the right displays the internal states of the system; its "brain", if you will.

For this particular experiment I didn't use any form of evolutionary algorithm, nor neural network (perhaps contrary to the title of my blog). Instead, I partitioned the state vectors into several hundred discrete states (745 in this case), and used the observed transitions between these states to produce a Markov Decision Process. I then applied Dynamic Programming to that MDP to generate the control policy.

The video shows one of the runs that did a bit better than average. However, so far I have found that the problem will usually be solved in an average of about 10 balancing attempts (the very fastest I have observed so far is 2) so I think it is still representative of the resulting behavior.

As an example, here is a run that takes a bit longer than the first:



Note that this is the fully-observable version of the cart-pole task; in other words, it has access not only to the cart position and pole angle, but also the cart velocity and pole angular velocity. This makes it significantly easier to solve than the non-Markov version of the problem. I plan to attempt the non-Markov version in future experiments.

Here are some additional videos:
Balancing with Dynamic Programming 3 (6 trials)
Balancing with Dynamic Programming 4 (11 trials)
Balancing with Dynamic Programming 5 (4 trials)
Balancing with Dynamic Programming 6 (7 trials)
Balancing with Dynamic Programming 7 (2 trials)