Thursday, June 16, 2005

Useful Algorithms Are Discovered, Not Created

FUNCTIONS

A function simply defines an input-to-output mapping between two sets of elements. The input set is referred to as the domain, the output set is the co-domain or range. There is only one basic requirement of a function: each and every element of the input set (domain) must correspond with (map to) a single element of the output set (co-domain), although multiple input elements may map to the same output element.

Below, for example, is a complete list of mappings which completely define a very simple logical function called Exclusive-OR (often abbreviated XOR):

  • (0,0)->(0)

  • (0,1)->(1)

  • (1,0)->(1)

  • (1,1)->(0)


This particular function takes two binary-valued inputs and generates a single binary-valued output.

Defining a function using an explicit lookup-table of input-output mappings works well for XOR, as it is discrete and there are only four possible two-bit combinations of input (22). Lookup-tables become utterly impractical, however, when one wishes to describe functions of higher dimensionality or possibly real-valued input/output spaces. To define a more complex function, more powerful mechanisms are necessary.

A very useful (and more powerful) way to think about functions is to imagine them as surfaces embedded in high-dimensional spaces; surfaces that must bend and conform in particular ways in order to pass through particular points, these points of course being samples of whatever it is that is to be modeled (see image below). The idea is that once a surface (or hypersurface) has been correctly shaped to fit the known data points, it can then also be used to make predictions about unknown points! This predictive ability is referred to as Generalization (note that lookup-tables, like the one described above, are incapable of any form of generalization). Another very important concept is that a "simpler" surface (i.e. smoother) is likely to generalize more effectively than a very "complex" one (i.e. distorted or wrinkled) and is thus preferable. [see Occam's Razor, VC-Dimension]



There exist many techniques for creating, poking, prodding, sculpting, and analyzing these high-dimensional surfaces (i.e. functions). A few popular examples of these include:

FUNCTIONS AS ALGORITHMS - THE PROBLEM OF VANISHING GRADIENTS

An algorithm is a procedure performed on data that can involve branching conditional logic and requires a finite (but possibly very large) number of steps in order to complete. It differs from a function mapping in that it unfolds through time, rather than being a simple input to output relationship.

Many attempts have been made to modify the techniques developed for the training of static function mappings, and to apply them to functions that recursively refer to themselves, which moves into the realm of algorithms. An example of one such self-referential architecture is known as a Recurrent Neural Network (RNN), which can be trained in various ways, including two popular methods known as Real-Time Recursive Learning (RTRL) and BackPropagation Through Time (BPTT).

The problem is that none of this actually works very well in practice! The various training methods cannot handle long time delays or large amounts of non-linearity. They manage to learn only the simplest of algorithms. The reason for this limitation is that the error-signal (i.e. the information which is used for training) is either amplified to the point of distortion, or else dwindles and disappears, and this phenomena increases the further back through time the error-signal must travel. Each step of algorithmic execution further weakens the signal. It has been shown that this problem is essentially unavoidable and is commonly referred to as the Problem of Vanishing Gradients.


(One partial exception these limitations is a fascinating architecture called Long Short Term Memory (LSTM), which is a fairly powerful form of RNN (depicted above). Unfortunately, however, it still relies on corrective error signals, thus making it inappropriate for direct use in reinforcement learning environments, and has been shown to be unable to learn some fairly trivial algorithms, such as Temporal XOR.)

THE HALTING PROBLEM AND UNDECIDABILITY

Unfortunately, there exists no general way to examine the specifications of a potential algorithm and figure out, a priori, if it will ever halt after being fed with a given piece of input data. This dilemma has been sensibly dubbed the Halting Problem and has ties to many other areas of theoretical interest, such as Gödel's incompleteness theorem.

A SIMPLE ALGORITHM:
(1) Set X to a randomly chosen positive integer
(2) While X does not equal 1 do the following:
     (2.1) if X is even, do X:=X/2
     (2.2) otherwise (X is odd), do X:=(X*3)+1
(3) Halt

For example, even the "simple" algorithm outlined above has never been proven to halt for all possible initial values of X. It has been extensively tested over the years, and it has always been empirically observed to halt, but this does not constitute a proof.

In addition to the Halting Problem, there are other, even more disturbing gaps in what can be known about algorithms. Here is a quote from a wonderful book entitled Algorithmics: The Spirit Of Computing by David Harel:

The fact that verification and halting are undecidable is only one small part of a far more general phenomenon, which is actually much deeper and a lot more devastating. There is a remarkable result, called Rice's theorem, which shows that not only can we not verify programs or determine their halting status, but we can't really figure out anything about them. No algorithm can decide any nontrivial property of computations. More precisely, let us say we are interested in deciding some property of programs, which is (1) true of some programs but not of others, and (2) insensitive to the syntax of the program and to its method of operation or algorithm; that is, it is a property of what the program does, of the problem it solves, and not of the particular form that solution takes. For example, we might want to know whether a program ever outputs a "yes," whether it always produces numbers, whether it is equivalent to some other program, etc. etc. Rice's theorem tells us that no such property of programs can be decided. Not even one. They are all undecidable. We can really forget about being able to reason automatically about programs, or to deduce things about what our programs do. This is true whether our programs are small or large, simple or complex, or whether what we want to find out is a general property or something petty and idiosyncratic. Virtually nothing about computation is computable!


DIRE IMPLICATIONS

Clearly even the simplest of algorithms can be monstrosities to deal with, at least when it comes to creating them or making definitive statements about their behavior.

None of the gradient based methods that have been developed to deal with continuous function-mappings (as described in the previous section above) are able to work effectively in the space of general-purpose algorithms, because none of them can completely avoid the Problem of Vanishing Gradients.

It also appears that trying to make definitive statements about the behavior of any particular arbitrary algorithm is, from a proof-theoretic standpoint, impossible.

These sobering insights severely limit the set of mathematical tools that can be applied to the creation of useful algorithms. On the one hand, we are unable to use continuous mechanisms (derivative-based error-minimization) to create useful algorithms. On the other hand, we aren't able to use symbolic mechanisms (logical proofs) to talk about useful algorithms either!

Seemingly the only mechanisms left available to us for dealing with algorithms are ones that, much like Mother Nature, primarily rely on trial-and-error. We essentially are forced to change our strategy from that of creating solutions to that of searching for solutions.

EVOLUTIONARY ALGORITHMS

Many researchers worldwide are now beginning to turn in this general direction, using a large variety of emerging methods that are often collectively referred to as Evolutionary Algorithms, of which Genetic Algorithms (GA), Evolution Strategies (ES), Evolutionary Programming (EP), Genetic Programming (GP), and Classifier Systems (CS) are well known branches.

These methods have the exciting potential to solve many previously intractable real-world problems, including (but not limited to) the discovery of useful algorithms. For example, Genetic Programming has found an improved training algorithm for Feedforward Neural Networks. There have also been several recreations of patented inventions.

In fact, some of the discoveries made using these evolutionary methods have already greatly surpassed the abilities of human designers. Here is one:

This very striking (and small) example is an antenna designed for NASA using a Genetic Algorithm. This solution is superior in several respects to anything that human engineers have thusfar managed to come up with. And exactly why it works remains somewhat of a mystery...

In the end it should come as no surprise that our most powerful methods for discovery are directly inspired by Nature. She's been doing this a lot longer than we have.

No comments: