Recently I've done some optimization experiments using two different forms of optimization algorithms: those that allow for a bit of "backtracking" (i.e. sometimes allowing transitions to a state of lower fitness), and those that do not (i.e. "Hill-Climbing").
What I've found is that it is overwhelmingly the case that optimization methods that make use of backtracking allow for the discovery of superior solutions to those that do not.
Unfortunately, this does not explain to me WHY this is the case! I have started to postulate some toy models of the fitness landscape that could conceivably explain these observations.
Visualizations of Deep Learning, Recurrent Neural Networks, and Evolutionary Algorithms
Tuesday, October 25, 2005
Saturday, October 01, 2005
To leap further or to retrace one's steps?
This entry is simply the description of a general question I've been grappling with. I haven't come to any firm conclusions about an answer just yet...
I am, as always, pondering local minima in the fitness landscape and trying to understand what an optimal way of dealing with them would be. As I currently see it there are generally two different ways to proceed when one reaches a point at which improvement becomes difficult:
1) Keep trying random variations on the current best solution until another solution of equal or better utility is found, possibly increasing the magnitude of the random variations over time in order to sample a progressively wider neighborhood.
2) Save off the current best solution, and then allow occasional "degeneration" or backtracking; in other words, allow for solutions of lesser quality as a temporary measure in the hopes that another path can be found to a point of even higher utility.
The first method is what is known as Hill-Climbing in the literature, and is generally frowned upon; the argument being that it is prone to being trapped in local minima. However, this argument rests upon the assumption of a limited mutation size. Actually, if one allows for scale-free mutation, such as can be afforded by applying the Cauchy distribution, then technically every point on the fitness landscape can be eventually reached, thus even a Hill-Climbing technique can be completely ergodic (i.e. will be guaranteed to sample the entire function space eventually). Another trivial method of insuring maximal coverage of the fitness landscape is a strategy called Random-Restart Hill Climbing, which simply restarts the search at a randomly selected new location once a local minima is encountered. The actual effectiveness of this strategy is debatable. I have seen this method inconsistently evaluated in the literature; some claim it is effective; others not.
The second method encompasses many more standard optimization techniques, including Evolutionary Computation and Simulated Annealing. Simulated Annealing explicitly defines the probability that a given change of a lower utility will be accepted (Boltzmann acceptance criteria) according to a temperature parameter which is reduced with time according to a schedule. Evolutionary Computation encompasses several population-based methods which implicitly allow for varying degrees degenerate solutions to survive within the population based on the selective pressure applyed by the specifics of the particular EC algorithm. Essentially anything that slows down the spread of fit solutions amongst the population will allow solutions of lower fitness a greater probability of survival, and hence a longer time to possibly stumble on a better solution. One way to do this is to break the population up into smaller subpopulations which are only loosely coupled with eachother; this particular concept is known as Demes and has been used with great success in a variety of EC algorithms.
Note that both of these methods suffer from the classic dilema of Exploration versus Exploitation.
Which of these two methods described above should, in general, be more effective, and WHY? (Perhaps neither, according to the No-free-lunch theorem)
I am, as always, pondering local minima in the fitness landscape and trying to understand what an optimal way of dealing with them would be. As I currently see it there are generally two different ways to proceed when one reaches a point at which improvement becomes difficult:
1) Keep trying random variations on the current best solution until another solution of equal or better utility is found, possibly increasing the magnitude of the random variations over time in order to sample a progressively wider neighborhood.
2) Save off the current best solution, and then allow occasional "degeneration" or backtracking; in other words, allow for solutions of lesser quality as a temporary measure in the hopes that another path can be found to a point of even higher utility.
The first method is what is known as Hill-Climbing in the literature, and is generally frowned upon; the argument being that it is prone to being trapped in local minima. However, this argument rests upon the assumption of a limited mutation size. Actually, if one allows for scale-free mutation, such as can be afforded by applying the Cauchy distribution, then technically every point on the fitness landscape can be eventually reached, thus even a Hill-Climbing technique can be completely ergodic (i.e. will be guaranteed to sample the entire function space eventually). Another trivial method of insuring maximal coverage of the fitness landscape is a strategy called Random-Restart Hill Climbing, which simply restarts the search at a randomly selected new location once a local minima is encountered. The actual effectiveness of this strategy is debatable. I have seen this method inconsistently evaluated in the literature; some claim it is effective; others not.
The second method encompasses many more standard optimization techniques, including Evolutionary Computation and Simulated Annealing. Simulated Annealing explicitly defines the probability that a given change of a lower utility will be accepted (Boltzmann acceptance criteria) according to a temperature parameter which is reduced with time according to a schedule. Evolutionary Computation encompasses several population-based methods which implicitly allow for varying degrees degenerate solutions to survive within the population based on the selective pressure applyed by the specifics of the particular EC algorithm. Essentially anything that slows down the spread of fit solutions amongst the population will allow solutions of lower fitness a greater probability of survival, and hence a longer time to possibly stumble on a better solution. One way to do this is to break the population up into smaller subpopulations which are only loosely coupled with eachother; this particular concept is known as Demes and has been used with great success in a variety of EC algorithms.
Note that both of these methods suffer from the classic dilema of Exploration versus Exploitation.
Which of these two methods described above should, in general, be more effective, and WHY? (Perhaps neither, according to the No-free-lunch theorem)
Subscribe to:
Posts (Atom)