Friday, December 16, 2005

A little noise is a good thing

Recently I have started testing my Neural Network software against some real world data sets. The first dataset I tested against was the Pima Indian Diabetes dataset, and my system was able to achieve a fairly good training error (19%), but it did not generalize as well to the test set.

Many neuro-evolutionary algorithms (such as Covenet) use regularization terms in order to minimize the structural resources of the network in order to improve generalization. Presumably, the smaller the number of nodes and/or linkweights, the lower the effective VC-dimension.

However, introducing dynamic network topology adds a host of programmatic complexities and design decisions that I would like to avoid. So I thought of an alternative way to "stress" the resources of the network: random noise.

My hypothesis is that by injecting random noise into the operation of the neural network, it will force the network to form a more distributed and robust internal representation, because it will not be able to be too dependent on the activation of any single element. I also suspect that because functionality cannot become too dependent on any single element, the network will have more flexibility in terms of being able to steadily alter its functionality bit by bit; hence it may be more continuous and "evolvable".

My initial experiments with this approach have met with success; I've found that generalization generally improves substantially when noise is injected into the training process.

Additionally, if, after training, the noise is removed from the processing of the network, its performance further improves. Doing this it has managed to reach levels of generalization that were previously unattainable without the addition of noise.

Tuesday, October 25, 2005

Backtracking seems essential for good optimization

Recently I've done some optimization experiments using two different forms of optimization algorithms: those that allow for a bit of "backtracking" (i.e. sometimes allowing transitions to a state of lower fitness), and those that do not (i.e. "Hill-Climbing").

What I've found is that it is overwhelmingly the case that optimization methods that make use of backtracking allow for the discovery of superior solutions to those that do not.

Unfortunately, this does not explain to me WHY this is the case! I have started to postulate some toy models of the fitness landscape that could conceivably explain these observations.

Saturday, October 01, 2005

To leap further or to retrace one's steps?

This entry is simply the description of a general question I've been grappling with. I haven't come to any firm conclusions about an answer just yet...

I am, as always, pondering local minima in the fitness landscape and trying to understand what an optimal way of dealing with them would be. As I currently see it there are generally two different ways to proceed when one reaches a point at which improvement becomes difficult:

1) Keep trying random variations on the current best solution until another solution of equal or better utility is found, possibly increasing the magnitude of the random variations over time in order to sample a progressively wider neighborhood.
2) Save off the current best solution, and then allow occasional "degeneration" or backtracking; in other words, allow for solutions of lesser quality as a temporary measure in the hopes that another path can be found to a point of even higher utility.

The first method is what is known as Hill-Climbing in the literature, and is generally frowned upon; the argument being that it is prone to being trapped in local minima. However, this argument rests upon the assumption of a limited mutation size. Actually, if one allows for scale-free mutation, such as can be afforded by applying the Cauchy distribution, then technically every point on the fitness landscape can be eventually reached, thus even a Hill-Climbing technique can be completely ergodic (i.e. will be guaranteed to sample the entire function space eventually). Another trivial method of insuring maximal coverage of the fitness landscape is a strategy called Random-Restart Hill Climbing, which simply restarts the search at a randomly selected new location once a local minima is encountered. The actual effectiveness of this strategy is debatable. I have seen this method inconsistently evaluated in the literature; some claim it is effective; others not.

The second method encompasses many more standard optimization techniques, including Evolutionary Computation and Simulated Annealing. Simulated Annealing explicitly defines the probability that a given change of a lower utility will be accepted (Boltzmann acceptance criteria) according to a temperature parameter which is reduced with time according to a schedule. Evolutionary Computation encompasses several population-based methods which implicitly allow for varying degrees degenerate solutions to survive within the population based on the selective pressure applyed by the specifics of the particular EC algorithm. Essentially anything that slows down the spread of fit solutions amongst the population will allow solutions of lower fitness a greater probability of survival, and hence a longer time to possibly stumble on a better solution. One way to do this is to break the population up into smaller subpopulations which are only loosely coupled with eachother; this particular concept is known as Demes and has been used with great success in a variety of EC algorithms.

Note that both of these methods suffer from the classic dilema of Exploration versus Exploitation.

Which of these two methods described above should, in general, be more effective, and WHY? (Perhaps neither, according to the No-free-lunch theorem)

Tuesday, September 06, 2005

Constraints of Copying versus Processing of Information

I find it interesting that copying of information is best done in a low number of dimensions (think DNA), while processing of information is best done in a high number of dimensions (think 3D brain).

It has been shown, however, that any kind of information processing can be theoretically accomplished by a purely local, one-dimensional scheme, in the form of a cellular automata. That being said, the difficulty of actually writing programs in a medium like this is tremendous; it is possible, but not at all practical. The reason for this is that it is difficult to shuttle information around without the signals bumping into and disrupting eachother. There just isn't enough room for everything to happen easily. Another example: it is possible to make logical circuits in the 2D environment of Langton's Game Of Life, but it is difficult as heck (I had a friend who actually did this for fun... yikes!)

So it is interesting (and not suprising) that nature has optimized both of these mechanisms, choosing a 1-dimensional copying scheme (DNA/RNA), and a 3-dimensional processing scheme (brain); 1 and 3 being the respective minimum and maximum dimensionalities allowed in our particular universe.

But what if our universe were > 3 dimensions? Would nature have made use of the extra dimensions to develop even more routes of connectivity than are possible/convenient in 3 dimensions? Would this have led to very different (better?) styles of information processing?

Friday, July 15, 2005

Limits Of Nature: 1D Replication

I think that there is not always enough emphasis placed upon the massive constraints that evolution has been forced to work within in order to create the amazing ecosystem of life and culture that we (humans) now participate in.

For example, wouldn't Lamarckian evolution (direct inheritance of lifetime experiences) have been a lot more effective to use than Darwinian evolution? Who wants to start over from scratch with each new generation? Why didn't Nature choose to work in this manner?


Well, there are a lot of good reasons why this didn't "naturally" occur at first; mostly having to do with the extreme difficulty of making accurate copies of information, even in only a single dimension. The medium of choice for information transfer had, until recently, been DNA. However, Nature has now discovered a few neat ways to circumvent these problems, namely through the transmission of memes.


In humans, the evolution of memes has continued to rapidly accelerate, starting with the usage of language, then writing, printing, radios, telephones, the internet, etc... (Note that cultural transmission of information is not strictly limited to animals with linguistic capacities; birds do it too.)


Interestingly enough, all of these newer methods of information transfer and copying still essentially revolve around the sequential transmission of a finite set of tokens, i.e. languages; once again, linear chains of information, be they English or ASL or TCP/IP. So nature hasn't really circumvented the difficulties of working in information mediums of greater than one dimension, it's just that it has recently stumbled upon some new one-dimensional mediums to work in!


Of course, these new mediums do not replace the old mediums, but rather build on top of them in a hierarchy of ever increasing abstraction. Entropy must always increase, but paradoxically complexity appears to always increase as well.

Suprisingly, even as the form of the information under consideration becomes progressively more and more abstract, the manner in which that information is dealt with tends to follow a somewhat universal pattern:

Complex adaptive systems tend to make strong use of linear sequences of tokenized information; they tend to use languages.

Thursday, June 16, 2005

Useful Algorithms Are Discovered, Not Created

FUNCTIONS

A function simply defines an input-to-output mapping between two sets of elements. The input set is referred to as the domain, the output set is the co-domain or range. There is only one basic requirement of a function: each and every element of the input set (domain) must correspond with (map to) a single element of the output set (co-domain), although multiple input elements may map to the same output element.

Below, for example, is a complete list of mappings which completely define a very simple logical function called Exclusive-OR (often abbreviated XOR):

  • (0,0)->(0)

  • (0,1)->(1)

  • (1,0)->(1)

  • (1,1)->(0)


This particular function takes two binary-valued inputs and generates a single binary-valued output.

Defining a function using an explicit lookup-table of input-output mappings works well for XOR, as it is discrete and there are only four possible two-bit combinations of input (22). Lookup-tables become utterly impractical, however, when one wishes to describe functions of higher dimensionality or possibly real-valued input/output spaces. To define a more complex function, more powerful mechanisms are necessary.

A very useful (and more powerful) way to think about functions is to imagine them as surfaces embedded in high-dimensional spaces; surfaces that must bend and conform in particular ways in order to pass through particular points, these points of course being samples of whatever it is that is to be modeled (see image below). The idea is that once a surface (or hypersurface) has been correctly shaped to fit the known data points, it can then also be used to make predictions about unknown points! This predictive ability is referred to as Generalization (note that lookup-tables, like the one described above, are incapable of any form of generalization). Another very important concept is that a "simpler" surface (i.e. smoother) is likely to generalize more effectively than a very "complex" one (i.e. distorted or wrinkled) and is thus preferable. [see Occam's Razor, VC-Dimension]



There exist many techniques for creating, poking, prodding, sculpting, and analyzing these high-dimensional surfaces (i.e. functions). A few popular examples of these include:

FUNCTIONS AS ALGORITHMS - THE PROBLEM OF VANISHING GRADIENTS

An algorithm is a procedure performed on data that can involve branching conditional logic and requires a finite (but possibly very large) number of steps in order to complete. It differs from a function mapping in that it unfolds through time, rather than being a simple input to output relationship.

Many attempts have been made to modify the techniques developed for the training of static function mappings, and to apply them to functions that recursively refer to themselves, which moves into the realm of algorithms. An example of one such self-referential architecture is known as a Recurrent Neural Network (RNN), which can be trained in various ways, including two popular methods known as Real-Time Recursive Learning (RTRL) and BackPropagation Through Time (BPTT).

The problem is that none of this actually works very well in practice! The various training methods cannot handle long time delays or large amounts of non-linearity. They manage to learn only the simplest of algorithms. The reason for this limitation is that the error-signal (i.e. the information which is used for training) is either amplified to the point of distortion, or else dwindles and disappears, and this phenomena increases the further back through time the error-signal must travel. Each step of algorithmic execution further weakens the signal. It has been shown that this problem is essentially unavoidable and is commonly referred to as the Problem of Vanishing Gradients.


(One partial exception these limitations is a fascinating architecture called Long Short Term Memory (LSTM), which is a fairly powerful form of RNN (depicted above). Unfortunately, however, it still relies on corrective error signals, thus making it inappropriate for direct use in reinforcement learning environments, and has been shown to be unable to learn some fairly trivial algorithms, such as Temporal XOR.)

THE HALTING PROBLEM AND UNDECIDABILITY

Unfortunately, there exists no general way to examine the specifications of a potential algorithm and figure out, a priori, if it will ever halt after being fed with a given piece of input data. This dilemma has been sensibly dubbed the Halting Problem and has ties to many other areas of theoretical interest, such as Gödel's incompleteness theorem.

A SIMPLE ALGORITHM:
(1) Set X to a randomly chosen positive integer
(2) While X does not equal 1 do the following:
     (2.1) if X is even, do X:=X/2
     (2.2) otherwise (X is odd), do X:=(X*3)+1
(3) Halt

For example, even the "simple" algorithm outlined above has never been proven to halt for all possible initial values of X. It has been extensively tested over the years, and it has always been empirically observed to halt, but this does not constitute a proof.

In addition to the Halting Problem, there are other, even more disturbing gaps in what can be known about algorithms. Here is a quote from a wonderful book entitled Algorithmics: The Spirit Of Computing by David Harel:

The fact that verification and halting are undecidable is only one small part of a far more general phenomenon, which is actually much deeper and a lot more devastating. There is a remarkable result, called Rice's theorem, which shows that not only can we not verify programs or determine their halting status, but we can't really figure out anything about them. No algorithm can decide any nontrivial property of computations. More precisely, let us say we are interested in deciding some property of programs, which is (1) true of some programs but not of others, and (2) insensitive to the syntax of the program and to its method of operation or algorithm; that is, it is a property of what the program does, of the problem it solves, and not of the particular form that solution takes. For example, we might want to know whether a program ever outputs a "yes," whether it always produces numbers, whether it is equivalent to some other program, etc. etc. Rice's theorem tells us that no such property of programs can be decided. Not even one. They are all undecidable. We can really forget about being able to reason automatically about programs, or to deduce things about what our programs do. This is true whether our programs are small or large, simple or complex, or whether what we want to find out is a general property or something petty and idiosyncratic. Virtually nothing about computation is computable!


DIRE IMPLICATIONS

Clearly even the simplest of algorithms can be monstrosities to deal with, at least when it comes to creating them or making definitive statements about their behavior.

None of the gradient based methods that have been developed to deal with continuous function-mappings (as described in the previous section above) are able to work effectively in the space of general-purpose algorithms, because none of them can completely avoid the Problem of Vanishing Gradients.

It also appears that trying to make definitive statements about the behavior of any particular arbitrary algorithm is, from a proof-theoretic standpoint, impossible.

These sobering insights severely limit the set of mathematical tools that can be applied to the creation of useful algorithms. On the one hand, we are unable to use continuous mechanisms (derivative-based error-minimization) to create useful algorithms. On the other hand, we aren't able to use symbolic mechanisms (logical proofs) to talk about useful algorithms either!

Seemingly the only mechanisms left available to us for dealing with algorithms are ones that, much like Mother Nature, primarily rely on trial-and-error. We essentially are forced to change our strategy from that of creating solutions to that of searching for solutions.

EVOLUTIONARY ALGORITHMS

Many researchers worldwide are now beginning to turn in this general direction, using a large variety of emerging methods that are often collectively referred to as Evolutionary Algorithms, of which Genetic Algorithms (GA), Evolution Strategies (ES), Evolutionary Programming (EP), Genetic Programming (GP), and Classifier Systems (CS) are well known branches.

These methods have the exciting potential to solve many previously intractable real-world problems, including (but not limited to) the discovery of useful algorithms. For example, Genetic Programming has found an improved training algorithm for Feedforward Neural Networks. There have also been several recreations of patented inventions.

In fact, some of the discoveries made using these evolutionary methods have already greatly surpassed the abilities of human designers. Here is one:

This very striking (and small) example is an antenna designed for NASA using a Genetic Algorithm. This solution is superior in several respects to anything that human engineers have thusfar managed to come up with. And exactly why it works remains somewhat of a mystery...

In the end it should come as no surprise that our most powerful methods for discovery are directly inspired by Nature. She's been doing this a lot longer than we have.

Monday, June 06, 2005

Local Optima and Feller's Theory of Records

LOCAL OPTIMA

A common problem in Evolutionary Computation as well as other forms of Stochastic Local Search is premature convergence on a local optima. A large variety of strategies have been devised for overcoming this problem including Simulated Annealing, Swarm Intelligence, Fitness Uniform Selection to Preserve Genetic Diversity (FUSS), as well as many others.

Once a given local optima has been reached, it is often very difficult to escape, particularly if adaptation is taking place on a fairly non-linear or "rough" fitness landscape. In his extensive work dealing with adaptation on NK-model landscapes, biologist Stuart Kaufman has made the following observations in his book The Origins Of Order:

Walks always stop at local optima well below the global optimum. Trapping is rife in network space. This has a critical consequence: adaptive walks typically cannot achieve arbitrary patterns of activities on attractors! Adaptation via fitter variants in network space becomes grossly hindered by the rugged structure of the landscape. Walks become frozen into small regions of the space. Any intuition we may have harbored that mutation and selection alone could tune attractors to arbitrary patterns of behavior appears to be wrong.


I suggest that once a given solution has reached a fairly fit local optima and has stopped improving at a reasonable rate, further computational resources are best spent on experiments involving other potential solutions of lesser fitness that still have a reasonable chance for improvement (i.e. have not yet reached their own local optima), hopefully allowing at least one of them to surpass the current best optima before inevitably grinding to a halt.

Assuming that the above idea is correct, it would be highly desirable to have some way of detecting when a given solution has reached an optima (either local or global) and thus be able to automatically trigger reallocation of search resources.

The Theory of Records may offer a way to do just that...


FELLER'S THEORY OF RECORDS

To again quote Kaufman:

In keeping with the sensible intellectual strategy of gaining insight into complex behavior by proposing simple initial caricatures, it is of interest to find that a simple, near-universal law describes the behavior of an adapting population in the limit of search by long jumps across sequence space. The "law" is equivalent to Feller's theory of records, which asks the waiting time for successive new athletic records to be set. The answer is that the waiting time doubles after each record is set.


Kaufman proceeds to explain his findings relating to optimization on a variety of problems, including the Traveling Salesman Problem, in which the rate of improvement very closely matches the Theory of Records schedule described above.

The Theory of Records can be characterized by the following simple equation:

X = Log2T


where X is the total number of improvements found during the lifetime of the agent, and T is the total number of generations of experimentation that the agent has undergone.


LOCAL OPTIMA DETECTION VIA THE THEORY OF RECORDS

I suggest that it may be possible to use the Theory Of Records to statistically predict whether or not a given agent has become "stuck" on a local optima by examining the density of it's recent improvements X w.r.t. its total lifetime T.

We would anticipate that once an agent had reached a "dead-end" it would stop improving, thus leading to the following inequality:

X < Log2T


The above inequality would be a signal to us that further improvement of a particular agent is becoming increasingly unlikely. This would suggest that we allocate fewer computational resources to that particular agent, and instead redirect those freed resources to other, more promising (albeit lower fitness) agents within the population.

Friday, June 03, 2005

Powerful AI + Biofeedback = ???

Imagine, if you will, two things:

First, the existence (at some unspecified point in the future) of some really good Machine Learning algorithms; perhaps not yet surpassing human intellect or even passing the Turing Test, but considerably more powerful than today's methods.

Second, imagine some relatively inexpensive and portable real-time brain monitoring technologies, perhaps not sensitive at the level of the individual neuron, but maybe sensitive enough to differentiate between cortical columns; in other words, far better than any current EEG or fMRI technologies.

Now let's imagine an interesting application that might be developed as a result of these hypothetical technological breakthroughs: a program that could generate pleasing novel music (or any other media form) in real-time based on the physiological/neurological state of the listener!

PHASE I: Data-Mining

Imagine that a listener, as the subject of this thought-experiment, is exposed to a set of songs. They listen to each song in turn, and then give some form of feedback concerning their reactions to that song; perhaps a simple thumbs-up or thumbs-down; perhaps a rating from 1 to 10. The particular form of feedback is inconsequential, however the more information that is given the better. During this listening session the listener has been attached to some sort of brain-monitoring device (as eluded to above) which has relayed a vast amount of information (possibly Terabytes, certainly Gigabytes) to the CPU. The CPU concurrently receives 3 forms of data: 1) the raw waveform information of the given audio file, 2) the real-time cortical activity patterns of the listener brain in response to this auditory stimuli, and 3) the consciously elicited response of the listener, recorded in some form of "rating".

The CPU now has the task of making two types of highly complex associative mappings. First, it needs to be able to correlate the auditory stimuli with the effect produced in the cortical dynamics of the listener brain. The mapping must be good enough to be at least somewhat predictive; i.e. the CPU should be able to make a decent guess about the effect that a new piece of audio will have on the neurological state of the listener. Second, the CPU needs to make an additional complex associative mapping between the cortical state of the listener and the conscious response elicited (e.g. thumbs-up or thumbs-down).

Waveform Information -> Measurable Cortical Dynamics -> Conscious Elicited Response

PHASE II: Experimental Generation of New Auditory Stimuli

Now the CPU can begin to experimentally generate new music, and can receive meaningful feedback in real-time about the success of it's efforts! [Note that there may be major differences between different individual brains! This means that the Phase I data-mining process would likely have to be repeated for each user.]

We will assume that the CPU's initial efforts would be utterly disastrous, but could presumably improve with time via evolutionary selective pressures combined with massive compute power (this is the future after all...)


The end result of this process would (hypothetically) be personalized music/sound-scapes that fluctuate and re-adapt in real-time depending on the minute moment-by-moment details of the listener's mood and state of attention, and would be custom-tailored to the brain-dynamics of that particular individual.

An offshoot of this idea might be the cultivation of particular states of consciousness such as 'alertness', 'relaxation', 'creativity', etc...

Also, the sensory stimuli need not (as mentioned above) be limited to audio...

1st post

This is my first ever blog posting!

I have spent the last 5 years learning about and actively experimenting with Artificial Intelligence. At first this exploration was focused primarily on Neural Networks, but it has since blossomed out into a wider range of areas. Here are some of the topics that currently fascinate me:


  • Recurrent Neural Networks

  • Probability / Statistics / Information Theory

  • Chaos Theory

  • Cellular Automata

  • Abiogenesis

  • Algorithmic Information Theory

  • Neuroanatomy

  • Cognitive Psychology

  • Episodic Memory

  • Turing Machines

  • Autocatalytic Networks

  • Predator/Prey dynamics and the "Red Queen" effect

  • Game Theory

  • Classifier Systems

  • Natural Language Processing / Computer Aided Translation