PROGRAMMER'S BOOKSHELF

Genetic Programming

Peter D. Varhol

Peter is an assistant professor of computer science and mathematics at Rivier College in Nashua, New Hampshire.

We can all generally agree upon what a good computer program looks like. It tends to be short, uses computer time efficiently, and certainly leads to the one right answer under all circumstances. The algorithms it uses are easy to understand, not convoluted, and expressed in the most concise way possible. We can look at it, nod our heads, and say, "Yeah, this makes sense."

Well, throw away those preconceived notions of what a proper computer program looks like when you open John Koza's Genetic Programming: On the Programming of Computers by Means of Natural Selection. This comprehensive book examines what genetic programming is all about, how its concepts derive from the roots of human genetics, and how it can be used to solve a wide variety of problems in system control, planning, and decision support.

Genetic programming begins with program induction--the idea that, under many circumstances, problem solving can be thought of as the discovery of a computer program that produces the proper outputs when given the appropriate inputs. The program itself is a black box--feed it the inputs, and it produces the correct outputs for that input set. This obviously isn't true for all computer programs, but this analogy fits well to a large number of programs that perform computations or data manipulations. The process starts off with the random selection of a number of possible programs. These are tested against the problem, and the best ones are retained, while the worst are discarded. The remaining ones are used to generate other programs, all of which are then once again tested against the problem. If this is done correctly, the average "fit" of the population increases, although the genetic algorithms themselves are not concerned with average fitness, but rather, the fitness of individual programs.

The interesting thing about genetic programming is that this program may be neither the shortest, the most efficient, nor even the best. It is, however, the most "fit" of those generated, in that it is the closest solution to the problem, or the shortest, or the best in whatever measure of fitness you care to use.

It is easy to see the analogy with classical genetics. The fittest programs survive in each succeeding generation, until a threshold criterion is reached. The threshold criterion may specify an average error rate, or average time to reach a solution, or some other quantitative standard that fitness can be measured against. Of course, unlike genetics in biology, genetic programming must stop at some point in time, since the goal is to formulate a solution to a problem.

There is even an analogy to mutation. At each generation of the program, it is possible to introduce a random change to a random part of one or more of the programs. In succeeding generations, a mutation may result in programs that would never have been derived from the original algorithm.

Is genetic programming, then, guaranteed to give the optimum result? No, according to Koza. It might, after a while, produce an answer that meets the solution parameters of the problem, but there is no way of telling whether or not there is a best answer, let alone whether we have found it.

So what good is this technique, if it is not guaranteed to give the optimal or most-efficient answer? There are many problems, especially those with nonlinear behavior, that do not easily lend themselves to any analytical answer using conventional techniques. Chaotic processes such as the weather or the structural behavior of materials may lend themselves well to genetic techniques.

How do these genetic techniques work? Koza describes the genetic algorithm in three steps:

Randomly generate an initial population of computer programs capable of solving the problem.

Iteratively perform the following operations:

Execute each program in the population and assign it a fitness value according to how well it solves the problem.
Create a new population of computer programs by applying the following operations:
1. Copy existing programs to the new population.
2. Create new programs by genetically recombining randomly chosen parts of two existing programs.

Evaluate the resulting population of programs. The best program that appears in any generation may be an acceptable solution to the problem.

One concern you might have about the genetic approach is that it may lead into a blind alley. That is, a search path may prove to be promising at first, but after succeeding generations, it may become clear that the algorithm is not converging on an optimum range of solutions. No particular approach is guaranteed to converge on a solution. One solution is the aforementioned mutation. By randomly mutating one or more characteristics of the program, it's possible to produce potential solutions that could not have been derived from the genetic algorithm itself. If the mutations do not improve the overall fitness of the programs in succeeding generations, they will gradually be weeded out.

Representing the problem to be modeled genetically is also a concern. Koza favors a binary string in his initial examples, which have some nice characteristics from a genetics standpoint. However, he acknowledges that this is overly simplistic, and does not well represent the characteristics of computer programs. In more complex cases, he uses equations to represent behavior, and subsequent generations of programs continually modify the equations according to genetic principles. Since we're already used to representing the behavior of a system with equations, this seems to be one appropriate form of problem representation.

Once the problem representation is selected, the programmer has to develop a measure of fitness. Fitness can be measured in a number of different ways and is highly dependent upon the specification of the problem. If the desired output from the program is known, fitness can be simply the difference between the desired and actual outputs from one of the programs. If the goal is to maximize the result, then fitness is measured by the magnitude of the output.

These steps are not automatic, and the programmer's selection of the representation and the fitness measure can greatly affect the outcome. Koza provides plenty of examples and explains them well, but setting up a problem seems just as much an art as a science. An intuitive feel for the problem may be just as important as good genetic technique. For example, I applied Koza's explanation of symbolic regression to determine the appropriate model form for data that predicts the amount of time needed to manufacture a given item, given the number of items that have already been manufactured. The goal of symbolic regression is to find the functional form, rather than the functional coefficients, for a given set of finite data.

I'd expect this data to take a negative log form, since the problem effectively describes the manufacturing learning curve. However, I defined several functional forms as candidates: F={+, --, *, SIN, COS, EXP, LOG}.

My measure of fitness is the average error between the predicted times and the actual times. For each form, I designed (like Koza, using Lisp) a set of four equations for which I substituted one or more of these transformations. For example, one of those generated was the simple linear-regression form (+ A (* B X)). (A and B are the y-intercept and slope, respectively, calculated as a part of each form.) The result after each generation was 24 different functional forms. I threw out the 12 worst and combined aspects of the most-successful 12 in different ways. My best effort, after three generations, was (+ A (* COS (B) (LOG (* X (-- 0 1))))), which is reasonably close to the negative log answer I'd expect.

The genetic programming approach may be one of the best proposed so far for the slippery concept of machine learning. The computer produces a range of possible outputs, examines the algorithms that produced those outputs, keeps the best ones for another trial, and produces more through the genetic-recombination process. The system is clearly learning to produce a reasonable output.

Upon reflection, genetic programming seems comparable to supervised-learning neural-network techniques. What's the difference between genetic programming and this class of neural networks? Both have the capability of learning, both deal well with nonlinear systems, and both can be viewed as a black box. At one level, the two concepts are clearly related. Through successive trials, each attempts to converge upon an acceptable solution. There are also analogies to fitness, in that possible solutions are discarded if they do not conform well to the expected output.

What is different is the learning algorithm. A neural network adjusts coefficient values as the error is propagated back through the network. It has no "memory" and can return to previous states based on succeeding inputs. The genetic algorithm, on the other hand, tries to improve the overall population fitness in each succeeding generation. Koza gives an example of a genetic technique to choose the appropriate neural-network structure to solve a problem. For anyone who has ever attempted to build a neural net before, this is an attractive alternative to the usual trial-and-error approach.

Lisp is Koza's language of choice for experimentation in genetic programming. There are advantages to using Lisp, not the least of which are the easy manipulation of symbols and the ability to rapidly prototype different genetic structures.

Genetic programming won't replace traditional, procedural programming anytime soon. Most of the programming problems we deal with have exact and readily available solutions. Genetic programming is similar to the expert-system paradigm in that it seeks acceptable, though not necessarily the best, solutions to problems that aren't precisely defined or are nonlinear in nature.

Nonetheless, Genetic Programming is well worth the space on your bookshelf. More and more problems we deal with have these characteristics, and, though still in its conceptual infancy, genetic programming is a potentially powerful approach. Koza's treatment is so comprehensive that, while this may not be the last word on genetic programming, it may be the only word you'll need.

Genetic

Programming

John R. Koza

MIT Press, 1992, 819 pp, $55.00

ISBN 0-262-11170-5