Tuesday, March 30, 2010

Why I prefer Sinusoids to Sigmoids in RNNs

Recently I've been reading about Bayesian inference, and thinking about how it relates to Minimum Description Length and other similar concepts.

A key idea is that the ability to do inference over a small set of examples is, in general, a very difficult problem, and strongly dependent upon the prior probabilities of the hypothesized models matching up with the "true" model to some degree.

In the absence of any knowledge about the "true" model, things get more complicated. Some theories suggest that there should always be a bias towards simpler models, while others contradict this idea. In any case, the prior distribution over the space of hypotheses turns out to be critical to the process of inference.

Therefore, any way of gaining better intuitions about the underlying distribution of a given class of hypotheses/models is of particular interest to me.

In this post, I briefly explore some simple behavioral distributions of my own personal favorite class of models - discrete time recurrent neural networks (RNNs).

First, I randomly created a large number of extremely simple RNNs, each having the following architecture: The link weights were chosen randomly from a normal distribution with zero mean and a variance of 4.

Each randomly generated RNN was given an "impulse" input of 1.0 at the first time step, and 0.0 thereafter, and run for 10 time steps. The output of the RNN was collected and discretized to binary strings of length 10. Thus, there were exactly 1024 (210) possible unique "behaviors" that could be displayed by an RNN.

Most behaviors turned out to be very simple - all zeroes or all ones, but there were a range of other behaviors as well. I decided to plot the behaviors as a histogram. Here are the results: The original image (before scaling) is precisely 1024 by 500 pixels. Each pixel of the x-axis represents a different RNN "behavior", starting with 0000000000 at the left, and ending with 1111111111 at the right. (I considered ordering them by Gray code, but haven't implemented that just yet.) The height of the intervals in the y-axis are proportional to the natural log of the frequency of that particular behavior.

The two most immediately apparent aspects of the histogram are its symmetry and sparsity.

The edges of the histogram represent the most frequent (and simplest) behaviors - constant output of zeroes (left), and constant output of ones (right).

The two peaks nearer the center of the image occur at locations #341 and #682 along the x-axis, which, when converted to binary strings, produce the sequences 0101010101 and 1010101010, respectively. With a little thought, it is not difficult to see why these particular behaviors are so common - they can be represented by the following, extremely simple finite state machine: So it appears that these very simple RNNs are able emulate some correspondingly simple finite state machines. But, as can be seen by all of the gaps in the histogram, there are a lot of behaviors that simply never occur. As an example, sequence #589 (1001001101) never happens.

The really interesting thing I found is that these gaps in behavior are not an inherent limitation of simple RNNs, but rather, they are a limitation simple RNNs that use sigmoidal activation functions: If instead, a sinusoidal activation function is used, we get very different results! Here is another histogram of the exact same experimental setup, but this time using sinusoids as the hidden nodes of the RNNs: The difference is stunning. All 1024 binary sequences are produced!

Empirically, I have previously noticed that sinusoidal RNNs used for neuroevolution massively outperform sigmoidal ones on a wide range of tasks. I believe that now this can be at least partially explained by the fact that RNNs with sinusoidal neurons appear to have a much broader and more evenly distributed coverage of possible behaviors.

In summary: RNNs using sinusoidal hidden neurons appear to produce a more uniform prior over the space of possible dynamical models, which in turn enhances their learning capabilities over a wide range of problems.

Sunday, March 14, 2010

Animated evolution of RNN state-space The above animation depicts the state space of a discrete-time recurrent neural network (RNN) undergoing neuroevolution. Each new frame represents the discovery of an improvement over the previous best RNN.

The goal is to be able to classify the parity of a binary string, in this case, of length 10. The string is fed sequentially to the RNN, one bit per time step.

A few training examples:
`0100110101 -> 11110001011 -> 00010000010 -> 0`
The RNN I used for this particular experiment has a hidden layer of two neurons. This was done so that I could easily plot the state space in two dimensions. The x-axis represents the activation of the first neuron, the y-axis is the activation of the second. Each point within a given frame represents the 'state' of the RNN at the very end of processing a string. The number of points per frame corresponds to the number of training examples. The first few frames appear to have fewer points; this is just overlap. The points have been colored red and blue to denote whether the target output should have been even or odd parity; they are not representative of the actual output of the RNN.

As can be seen, over the course of evolution, the state space of the RNN rearranges itself so that the red and blue points become progressively more linearly separable. On the last frame, it has nearly achieved that goal (as shown below), and classification accuracy on unseen examples is roughly 99%. For those who are interested, I created these visualizations using a combination of Processing and GIMP