## Thursday, October 19, 2006

### Simple Solutions Revisited

I have been doing several experiments recently, attempting to find out how much behavior I can squeeze out of a recurrent neural network possessing only a single node in the hidden layer. Strangely enough, the answer appears to be: A LOT.

Note that I am using a sinusoidal activation function as opposed to the standard monotonic sigmoidal activation function, which gives the RNN a much greater information storage capacity.

RANDOM SEQUENCE GENERATION

The first task I had it perform was to simply generate a prespecified sequence of binary outputs, given no form of input. In order to perform this task, it needs to rely on its contextual memory in order to know "where" in the sequence it is. A typical target output sequence would look something like this:
0101110101001010110011101

Interestingly enough, given only a single neuron in the hidden layer as well as a single hidden neuron in the context layer, I found it is able to learn sequences of up to 25 in length perfectly.

NON-MARKOV TIC TAC TOE

I revisited an older experiment of mine. In it, the program is evolved to play tic tac toe against a hand-coded opponent of moderate skill (it cannot be beaten without being forked). It is also limited to only being capable of perceiving a single cell on the 3 by 3 tic tac toe grid at a time. In order to see more, it must move around from cell to cell, and it must create an internal representation of what it has already seen. In order to make a move, it must move to a legal empty cell, and output a 'halt' instruction (which is detected by the activation of a particular neuron in its output array). The fitness of the program is calculated as the percentage of games won. No credit is given for draws, making this a somewhat difficult task due to the discrete nature of the feedback.

Again, given only a single node in the hidden layer and a single node in the context layer, I have so far observed it capable of winning 48% of the games played. Keep in mind that even a 'perfect' hand-coded solution would not win 100% of the games, as by chance the opponent would sometimes play a series of moves that would, at best, only allow for a draw.

The best behavior I had previously seen on this task was 64% wins, however, that was with many more hidden nodes, and after many more runs of evolution.

The impressive thing, really, is that it is able to do anything at all with such scarce computational resources! My intuitions continue to be violated. I would not have believed this to be possible before actually seeing it working. And it makes me wonder why, if so much can be done with just a handful of adjustable parameters, do we need trillions of synapses in the human brain? My strong suspicion is that this complexity is not required to achieve the generation of complex behavior, but rather it is primarily needed for storing massive amounts of memory - for encoding the experiences accumulated during a human lifetime.

## Thursday, July 06, 2006

### Train of Thought

I’ve been thinking about a GUT (Grand Unifying Theory) of AI, about what form it might take if such a thing were to be found, and I've gone back over some ideas I've had in the past. Really all I've done so far is come up with arguments against certain approaches. For example, I argued that although one could potentially do analysis based upon structure, behavior, or fitness, both structural and behavioral examination run into seemingly unresolvable problems with only a little bit of thought. The main argument is that examining systems on a low level of detail is bound to destroy any useful ‘meaning’. For example this would be like comparing two stories at the level of letters and punctuation. It loses the important meaning. In other words by doing this we are in no way analyzing the stories in a way at all related to the way they will eventually be analyzed when they are assigned a fitness. That leaves us with just fitness to analyze. I also made the observation (obvious) that two solutions of equal fitness could not be guaranteed to exhibit the same behavior, and even less strong would be their tendency to possess the same structure. There are a number of other metrics we could look at as well, for example, the ‘complexity’ of the behavior. But how are we to recognize ‘complexity' in the first place? And is it necessarily a good thing? For example, Occam’s razor would suggest that we should wish to try simple models first, thus minimizing complexity and then allowing it to gradually increase. However, models of the brain (Neural Darwinism) suggests that the opposite approach is taken by the living brain; it starts with more connections and neurons than necessary and then prunes them away to generate the correct behavior. However, this suggests an architecture so complex that the correct behavior was already there just waiting for discovery. Neural network algorithms using both growth and pruning have been shown (claimed) to be more effective than static architectures. However, there are other ways of getting at the simplest model as well. One is to add noise (as I have done), and another is to combine multiple networks to cancel out noise - these are known as ‘ensemble’ methods, Including mixture-of-experts, boosting, and bagging. Other algorithms allow direct evolution of the architecture, such as NEAT and Genetic Programming, although both of these appear to suffer from ‘bloat’. But, given the ‘junk DNA’ found in nature, this may be an unavoidable aspect of complexity. Rather than increasing the size of a given model, some suggest building directly on top of it with fresh material, such as Rodney Brooks' subsumption architecture, Cascade Correlation Algorithm, and Schmidhuber’s Probabilistic Incremental Program Evolution with Filters. Unfortunately the last few mentioned require a model, i.e. are ‘supervised’ learning methods, and are therefore limited to a vastly smaller domain of possible problems. So, back to the issue of complexity, what method gives us an ‘optimal' level of complexity, and does such a thing even exist? We can look at Wolfram and Langton's definition of complexity in Cellular Automata - the 'edge of chaos' which appears to be a predominant theme in many areas of A-Life. So might think that it would be useful to hold a system near the edge of chaos in order to encourage complex things to emerge from it. But then again this might not be necessary, as many have suggested that evolution pushes systems to within that region of processing anyway without the need for any special encouragement. Also, how do we encourage 'complexity’ even if we wanted to? It is a difficult concept to define and compute, although Langton gave a fairly good definition, which was to maximize the information processing of the system. Deterministic Systems are not very sensitive to their inputs, whereas chaotic systems are too sensitive to their inputs, and are therefore essentially random. So we can analyze systems by looking within populations for uniqueness, or within a single program for the behavior of it’s elements. Another method that has been seemingly successful at pushing systems to the edge of chaos is competition – predation and parasitism – as has been show elegantly by the sorting experiments of Danny Hillis. But the problem here is that not all problems have an obvious way of being converted to a competition. Some tasks are inherently solitary ventures. I guess the point is that for nearly every point of view I take I can come up with a very valid exception, or something of importance that is missing in the model being considered. It makes me think that perhaps there is no simple "Theory of Everything" in the realm of Machine Learning, or that the only situations in which that will be the case are by necessity so impoverished as to be uninteresting. An example: the game theoretic strategy Tit-For-Tat in the Prisoner's dilema - it is simple and elegant, but this is only because the possible range of behaviors is so small (binary), there is no noise, and the system only has access to memory of the previous timestep, nothing more. Anyway, I feel no closer than before, and so that will be all for now...

## Tuesday, May 02, 2006

### Limits Of Reinforcement Learning In Fully Non-Markov Environments

I set up a very simple experiment to highlight what I believe to be the major deficiencies in the Reinforcement Learning paradigm (as opposed to a direct search of policy space, as in Evolutionary Algorithms).

Basically Reinforcement Learning is a method for converging on an optimal policy of behavior that maximizes reward of an agent acting within a given environment. Generally this is done with a lookup table that contains a set of state-action pairs and associated probabilities; the probabilities being referred to as the 'policy'. The 'policy' is updated over time in a semi-greedy fashion, but with enough randomness to ensure a good deal of exploration in the space of possible behaviors.

The problem with Reinforcement Learning is that it cannot effectively handle fully non-Markov environments. To illustrate this, I set up a simple task: temporal parity. Basically I feed the program a string of 1s and 0s, one bit at a time, and the program is allowed to have a single bit that acts as a form of contextual memory. At the end of the sequence I examine the state of the memory bit and give reward if it corresponds to the parity of the sequence that has been fed to it. This is a very simple task, all it requires is the following logic:

if (bCurrentInput == 1)
{
if (bMemoryState == 0)
bMemoryState = 1;
else
bMemoryState = 0;
}
else
{
//do nothing
}

So this is a pretty trivial function. However, I was unable to get it to learn even this simple task when the input strings where of greater than length 3. (Note: It is trivial to find a solution to this by doing a direct search of policy space using Evolutionary Algorithms where the string can be of arbitrary length).

Why is standard Reinforcement Learning so bad at this task? The reason is simple. In it's untrained state, the program has approximately a 50% chance of getting the right answer due purely to dumb luck. And it is rewarded for every correct answer, so that means that it may perform a sequence of completely incorrect actions and still be rewarded, thus further reinforcing those actions. In fact, if it makes a single incorrect action during the course of processing a string, then the end result can be correct no greater than 50% of the time.

Consider the following case. Let's say that it was performing the correct actions (the logic depicted above) 75% of the time. So it's mostly there! But now let's say that the length of the string being fed to it is 10. That means that it only has 0.7510 (about 0.056) of giving the correct output for the correct reasons. It is easy to extrapolate and see how long sequences of correct actions would be nearly impossible to perform if there is much noise or randomness in the action policy.

Later I tried making an additional change to the learning mechanism: I scaled the reward by the likelihood of the sequence of actions taken. In other words, if the program was acting randomly, or took a very unlikely action that was improbable given its current policy, its reward was reduced. This helped a little bit, allowing it to learn the task with strings up to length 4 rather than 3.

This leads me to the following principle:

When performing tasks that require the precise execution of a sequence of actions, one must strictly avoid randomness/variation in action policies during a period of time in which a program is being evaluated/rewarded.

I did some additional research using the keywords 'non markov reinforcement learning' and found a few papers that try to address these issues, but the tasks that they tested their programs against were only partially non-markov, generally maze navigation tasks. For this class of problems my above stated principle does not necessarily hold; it may be very useful to have semi-random behaviors while doing exploratory-like actions. However, I believe that behavioral noise makes it nearly impossible to discover sequences of action that require precision in their execution that are of length much greater than 3.

### Surprisingly Simple Solutions

I revisited my Non-Markov Tic Tac Toe experiment (see previous post), but this time using what I thought to be an impossibly simple architecture; only 2 nodes in the hidden layer (note: I am using Sinusoidal rather than Sigmoid activation functions). Amazingly, I observed it reaching a 43% level of fitness (64% was the highest I had observed before although I have been unable to find that level of fitness on subsequent runs), and it was able to find this solution rather rapidly compared to previous tests.

This astonished me because I would not have thought that such complex behavior could be generated by such a small set of parameters. This implies to me that the billions of neurons in the human brain are not really necessary for creating complex behaviors; I suspect that instead they are primarily used for statistical feature detection and also for memory formation and retrieval. This may make me sound a bit like Wolfram, although I think he is still oversimplifying human intelligence.

This also suggests that humans do not possess much in the way of intuitions regarding algorithms created by the process of evolution (again, hints of Wolfram). This sentinment is echoed by Danny Hillis in his latest book, where he talks about the sorting algorithms that his Connection-Machine discovered, and his inability to understand them analytically. To me this is a hopeful state of affairs, because it may mean that many things that we currently consider to be 'hard' problems may not be hard at all, and may in fact possess solutions that are rather easy for evolutionary methods to discover.

## Thursday, March 30, 2006

### A brief summary of my recent work

Here is a list of some of the tasks I have recently tested my system against:

• N-bit Parity Classification (up to 10 bits)

• Palindrome Classification (arbitrary length)

• Embedded Reber Grammar

• Temporal XOR

• Shift Register (delay of N up to 3)

• Pima Indian Diabetes Dataset classification

• Dow Jones Industrial Average time series prediction

• Controller for a Tic Tac Toe player against a semi-skilled opponent

• Triple Sines time series prediction (see EVOLINO: Schmidhuber et al)

## Tuesday, January 17, 2006

### Non-Markov Tic-Tac-Toe Experiment

EXPERIMENTAL SETUP

1) The Neural Network (NN) always plays X; O is played by a hand-crafted computer opponent.
2) The computer opponent is of intermediate skill, it is not perfect, but can only be beaten if it is "forked" by X.
3) The NN is recurrent, and has multiple time steps to decide on a given move.
4) The NN controls a "focus of attention" (FOA) that can be moved around the TTT grid by using the following actions: turn right, turn left, move forward.
5) The FOA allows the NN to "see" only a single cell at a time. The input to the NN is 0 for an empty space, 1 for X, and -1 for O.
6) At the beginning of each turn the FOA of the NN is placed in the center of the grid facing a randomly selected direction (north, east, south, or west).
7) The NN is given no hint as to its current location or the direction it is facing. It must build up an internal representation based on what it has already seen through the FOA.
8) If the NN inadvertantly attempts to move the FOA out of the boundaries of the grid, it automatically forfits the game.
9) The NN selects its move by moving the FOA to a given cell and then performing a "halt" action. This "halt" action must be taken within the specified time-limit (50) or else it automatically forfits the game.
10) The move selected by the NN after halting must be legal (i.e. must be on an empty cell). If the cell is already occupied then it automatically forfits the game.
11) Feedback is given after playing a set of games (1000). Fitness is computed as the sum of the number of games won in the trial set. No credit is given for draws, losses, or premature forfits.

As you can see, these are rather Draconian measures. To summarize: in order to be successful, the NN is expected to learn to move around its "focus of attention" that only allows it to see a single location at a time while being careful not to accidentally move outside of boundaries it has not been informed of, create an internal representation of what it has already seen, use that information to navigate successfully to an empty spot, announce that it is halting, and do all of this repeatedly for each and every one of its turns in such a way that it manages to beat a pre-programmed opponent that can never be beaten without being forked! (Remember, it receives absolutely no credit for a draw).

RESULTS

So far, after a few hours of CPU time, it has managed to discover a solution that can win 64% of the games it plays, even under these drastically difficult conditions!!

## Tuesday, January 03, 2006

### Traversing Neutral Networks

THE IDEA

I was doing some research over my christmas break and stumbled upon a concept called Neutral Networks (which sounds rediculously close to Neural Networks, but oh well).

Basically the idea is that during the course of evolution, many point mutations made to an organism's genome are neither positive nor negative with respect to it's reproductive fitness; they are "neutral". A good example of this is the many-to-one mapping for encoding various amino-acids from codons.

Presumably there are many neutral mutations that are often connected by a single point mutation, thus forming "Neutral Networks" embedded in the fitness landscape. The theory is that this is advantageous because it allows the evolving organism to escape local-optima by horizontally sliding (whatever exactly "horizontal" means in N-dimensions) along flat plateaus of homogeneous fitness until an even higher point is stumbled upon. Much research on this is being done at Sussex University.

IMPLICATIONS

So, if this is true, then one would assume that evolution could occur even with very low rates of mutation... there would be no need to "jump" off of local optima.

I tested this hypothesis by altering the mutation function in my software. Rather than applying Gaussian or Cauchy distributed mutation (as is standardly done in Evolutionary Programming and Fast Evolutionary Programming, respectively), I chose to use a small, fixed-size constant that would either be added to or subtracted from each linkweight of my Neural Network with equal probability. The results have been nice; this new (and considerably simpler) mutation strategy significantly outperforms the previous method I was using.

However, I did find that it was not feasible to use arbitrarily small constants because: 1) evolution would be too slow, and 2) the changes in fitness need to be larger than the "thermal noise" inherent in the evaluation function (note: I am not working with fixed training sets, therefore the feedback function is a bit noisy).

TWO TYPES OF NEUTRAL NETWORK SYMMETRY

This concept was not mentioned in the literature, but I realized that there are actually two distinct forms of Neutral Networks; one nested inside the other. Basically there are two forms of symmetry that can lead to Neutral Networks.

The first form is symmetry of summed fitness; two different behaviors may be equally fit although for differing reasons. The second form is symmetry of representation; two internally different systems (e.g. Neural Networks) can lead to the same external behavior, which then necessarily implies equal fitness.

I believe that biological systems make use of both forms of Neutral-Network symmetry during the evolutionary process.