Thursday, December 09, 2010

Double Pole Balancing with Exploration-Estimation



I've been doing some further experiments with pole balancing (also called the inverted pendulum problem).

In my previous experiment, I had partitioned the four-dimensional state space into a set of discrete states (typically several hundred), and then used observations of transitions conditioned on actions to compute transition probabilities in order to produce a Markov decision process. I then had it learn an optimal policy over the MDP using dynamic programming. This approach seemed to be fairly effective; it was usually able to learn to balance a single pole indefinitely after roughly 10 to 15 balancing attempts (plus an additional 10 episodes at the beginning using a completely random policy in order to gather a bit of data to learn from). Between each balancing attempt, it would relearn the model from scratch.

So I next decided to try it out on a more challenging version of the problem - balancing with an additional pole. This newer version proved too much for it, and it failed to learn to solve the problem, even after being given several hundred balancing attempts in which to do so. My guess is that increasing the dimensionality of the state space from four to six dimensions was the key cause of its failure. The problem is that as the dimensionality of the state space increases, it is necessary to enlist more and more states in order to represent it properly, but the more states used, the more observations necessary in order to populate the probabilistic model of the transitions between those states. In other words, this particular approach simply will not scale to large problems.

So I decided to try a different approach. I borrowed an idea I had read about in Bio-Inspired Artificial Intelligence, described below:

Yet another method consists in the coevolution of a robot's controller and of the simulator parameters that are most likely to differ from the real world and that may affect the quality of the transfer (Bongard and Lipson 2005). This method, also known as the estimation-exploration algorithm, consists of coevolving three populations, one encoding possible simulator parameters, one encoding robot controllers used only to find a good simulator, and the third also encoding robot controllers, but these controllers are meant to allow the robot to fulfill its task, such as forward locomotion. Coevolution happens in several passes through a three-stage process. In stage 1, a randomly generated population of controllers is evolved in the default simulator and the best individual is executed on the real robot while the time series of sensory values is recorded. In stage 2, the population of simulators is evolved for reducing the difference between the time series recorded on the real robot and the time series obtained by running the same controller within the simulator. The best evolved simulator is then used for stage 1 where a new randomly generated population of controllers is evolved and the best individual is tested on the real robot to generate the time series for another pass of simulator evolution. This two-stage coevolution is repeated several times until the error between simulated and real robot behavior is as small as possible. Then, the best simulator found so far is used by a third population (the third stage), in which robot controllers are again evolved, this time to elicit some desired behavior. The best controller from this population is then executed on the physical robot, and, if the evolved simulator was accurate, the real robot will exhibit the same behavior as that seen in the evolved simulator.

It was shown that 16 passes of the three-stage process were sufficient to evolve a good control system that could be transferred to an articulated robot. In this case, the real robot was used to test only 17 controllers: the 16 controllers used to find an accurate simulator, and the 17th evolved to enable the robot to perform some desired task (Bongard et al. 2006). The authors also showed that this approach could be used to adapt the control system to cope with a damaged robot (for example, loss of a leg, malfunctioning of a sensor, etc.). (pg 510)

Here is a link to the paper An exploration-estimation algorithm for synthesis and analysis of engineering systems using minimal physical testing (2004), describing the concept in much greater detail.

This is essentially what I did, and the results turned out to be far more effective than my attempts with the dynamic programming approach from before.

It is my strong conviction that it is critical to focus on learning mechanisms that can work effectively using a minimal number of actual interactions with the target environment ("physical testing".) For the double pole balance task, I could have very easily performed neuroevolution directly within the environment itself (as I did here), without all the bother of having it first inductively learn a simulation of it. However, there are many real-world situations in which interactions are expensive, sometimes in terms of money, but more often in terms of time.

There may be an analogy to be made with mental rehearsal, or daydreaming.

Saturday, November 20, 2010

Balancing with Dynamic Programming



I used dynamic programming to balance a simulated cart-pole. In this case, the task is learned after just 5 balancing attempts, although it usually does not learn the task quite so quickly.

The blue box on the right displays the internal states of the system; its "brain", if you will.

For this particular experiment I didn't use any form of evolutionary algorithm, nor neural network (perhaps contrary to the title of my blog). Instead, I partitioned the state vectors into several hundred discrete states (745 in this case), and used the observed transitions between these states to produce a Markov Decision Process. I then applied Dynamic Programming to that MDP to generate the control policy.

The video shows one of the runs that did a bit better than average. However, so far I have found that the problem will usually be solved in an average of about 10 balancing attempts (the very fastest I have observed so far is 2) so I think it is still representative of the resulting behavior.

As an example, here is a run that takes a bit longer than the first:



Note that this is the fully-observable version of the cart-pole task; in other words, it has access not only to the cart position and pole angle, but also the cart velocity and pole angular velocity. This makes it significantly easier to solve than the non-Markov version of the problem. I plan to attempt the non-Markov version in future experiments.

Here are some additional videos:
Balancing with Dynamic Programming 3 (6 trials)
Balancing with Dynamic Programming 4 (11 trials)
Balancing with Dynamic Programming 5 (4 trials)
Balancing with Dynamic Programming 6 (7 trials)
Balancing with Dynamic Programming 7 (2 trials)

Friday, July 02, 2010

Discretized RNN state-space, Part 2


Graph created using same setup as described in my previous post, but this time done with a larger number of hidden-state vectors (2500).

Thursday, June 24, 2010

Discretized RNN state-space



Here is a graph of the discretized state space of a recurrent neural network I trained on the embedded Reber grammar. Notice that the state space splits at the beginning in order to preserve information about the long-range transitions.

Tuesday, March 30, 2010

Why I prefer Sinusoids to Sigmoids in RNNs

Recently I've been reading about Bayesian inference, and thinking about how it relates to Minimum Description Length and other similar concepts.

A key idea is that the ability to do inference over a small set of examples is, in general, a very difficult problem, and strongly dependent upon the prior probabilities of the hypothesized models matching up with the "true" model to some degree.

In the absence of any knowledge about the "true" model, things get more complicated. Some theories suggest that there should always be a bias towards simpler models, while others contradict this idea. In any case, the prior distribution over the space of hypotheses turns out to be critical to the process of inference.

Therefore, any way of gaining better intuitions about the underlying distribution of a given class of hypotheses/models is of particular interest to me.

In this post, I briefly explore some simple behavioral distributions of my own personal favorite class of models - discrete time recurrent neural networks (RNNs).



First, I randomly created a large number of extremely simple RNNs, each having the following architecture:


The link weights were chosen randomly from a normal distribution with zero mean and a variance of 4.

Each randomly generated RNN was given an "impulse" input of 1.0 at the first time step, and 0.0 thereafter, and run for 10 time steps. The output of the RNN was collected and discretized to binary strings of length 10. Thus, there were exactly 1024 (210) possible unique "behaviors" that could be displayed by an RNN.

Most behaviors turned out to be very simple - all zeroes or all ones, but there were a range of other behaviors as well. I decided to plot the behaviors as a histogram. Here are the results:


The original image (before scaling) is precisely 1024 by 500 pixels. Each pixel of the x-axis represents a different RNN "behavior", starting with 0000000000 at the left, and ending with 1111111111 at the right. (I considered ordering them by Gray code, but haven't implemented that just yet.) The height of the intervals in the y-axis are proportional to the natural log of the frequency of that particular behavior.

The two most immediately apparent aspects of the histogram are its symmetry and sparsity.

The edges of the histogram represent the most frequent (and simplest) behaviors - constant output of zeroes (left), and constant output of ones (right).

The two peaks nearer the center of the image occur at locations #341 and #682 along the x-axis, which, when converted to binary strings, produce the sequences 0101010101 and 1010101010, respectively. With a little thought, it is not difficult to see why these particular behaviors are so common - they can be represented by the following, extremely simple finite state machine:


So it appears that these very simple RNNs are able emulate some correspondingly simple finite state machines. But, as can be seen by all of the gaps in the histogram, there are a lot of behaviors that simply never occur. As an example, sequence #589 (1001001101) never happens.



The really interesting thing I found is that these gaps in behavior are not an inherent limitation of simple RNNs, but rather, they are a limitation simple RNNs that use sigmoidal activation functions:


If instead, a sinusoidal activation function is used, we get very different results!



Here is another histogram of the exact same experimental setup, but this time using sinusoids as the hidden nodes of the RNNs:


The difference is stunning. All 1024 binary sequences are produced!

Empirically, I have previously noticed that sinusoidal RNNs used for neuroevolution massively outperform sigmoidal ones on a wide range of tasks. I believe that now this can be at least partially explained by the fact that RNNs with sinusoidal neurons appear to have a much broader and more evenly distributed coverage of possible behaviors.

In summary: RNNs using sinusoidal hidden neurons appear to produce a more uniform prior over the space of possible dynamical models, which in turn enhances their learning capabilities over a wide range of problems.

Sunday, March 14, 2010

Animated evolution of RNN state-space


The above animation depicts the state space of a discrete-time recurrent neural network (RNN) undergoing neuroevolution. Each new frame represents the discovery of an improvement over the previous best RNN.

The goal is to be able to classify the parity of a binary string, in this case, of length 10. The string is fed sequentially to the RNN, one bit per time step.

A few training examples:

0100110101 -> 1
1110001011 -> 0
0010000010 -> 0

The RNN I used for this particular experiment has a hidden layer of two neurons. This was done so that I could easily plot the state space in two dimensions. The x-axis represents the activation of the first neuron, the y-axis is the activation of the second. Each point within a given frame represents the 'state' of the RNN at the very end of processing a string. The number of points per frame corresponds to the number of training examples. The first few frames appear to have fewer points; this is just overlap. The points have been colored red and blue to denote whether the target output should have been even or odd parity; they are not representative of the actual output of the RNN.

As can be seen, over the course of evolution, the state space of the RNN rearranges itself so that the red and blue points become progressively more linearly separable. On the last frame, it has nearly achieved that goal (as shown below), and classification accuracy on unseen examples is roughly 99%.



For those who are interested, I created these visualizations using a combination of Processing and GIMP

Monday, January 18, 2010

n-grams as finite state machines

I'm going through Foundations of Statistical Natural Language Processing and decided to create some visualizations of n-grams as finite state machines to improve my intuitive understanding of them.

For alphabet {0,1} and n = 3:



For alphabet {0,1} and n = 4:



For alphabet {0,1,2} and n = 3:



I find it fascinating that such a simple concept (the n-gram) can produce such intricate structures.