///Cultured Perl: Genetic Algorithms Applied with Perl

Cultured Perl: Genetic Algorithms Applied with Perl

Create Your Own Darwinian Breeding Grounds

Based on the Darwinian principle of survival of the fittest, genetic programming uses mutation and replication to produce algorithms for creating ever-improving computer programs. In this column, you’ll get to know the genetic algorithm in simple terms. Ted provides Perl implementations for some specific tasks, which you can adapt for generic use. To demonstrate the genetic algorithm, Ted breeds numbers for fitness to a formula, and letters to form English words.

You can run the examples in this article if you have Perl 5.005 or later installed on your system. Preferably, your system should be a recent (2000 or later) mainstream UNIX (Linux, Solaris, BSD) installation, but other types of operating systems may work as well. The examples may work with earlier versions of Perl and UNIX, and other operating systems, but their failure to function should be considered an exercise for the reader to solve.

History

The progress of genetics in the 20th century was rivaled in speed and reach only by the evolution of electronics and computer science. It is fitting that one of the most intriguing algorithms to come about in the 20 th century is the genetic algorithm.

Emerging in the early 1960s, the genetic algorithm (and evolutionary algorithms in general) took a place in computer science between deterministic and non-deterministic algorithms. The genetic algorithm, essentially, is as deterministic as you want to make it, meaning that the user can decide the number of iterations and termination criteria. It mimics Darwinian natural selection, making “fitness” (as determined by a formula applied to an individual) the chief selector of individuals for survival and procreation, together with mutation.

Other evolutionary algorithms attempt to mimic Lamarckian evolution, where behavior as a survival mechanism can be transferred between generations, and there are even evolutionary programs that write themselves for a particular purpose. All of those are beyond the scope of this article.

The main drawback of Perl for genetic algorithms is speed. Because of the computing needs of genetic algorithms, they are more efficiently implemented in C or other low-level pre-compiled languages. The Perl examples shown here are not as fast as their equivalents in C, but they will show you how the genetic algorithm works, and they are fast enough for some problems.

Resources

• Read Ted’s other Perl articles in the “Cultured Perl” series on developerWorks.

• Thanks to Abigail, whose CPAN Sample module exhibits the sample() function I use in both examples. The Sample module and its documentation are wonderful tools for any Perl programmer at Sample.

• Visit CPAN for all the Perl modules you ever wanted.

• Check out Perl.com for Perl information and related resources.

• Visit perldoc.com for Perldoc information online.

Programming Perl Third Edition”, by Larry Wall, Tom Christiansen, and Jon Orwant (O’Reilly & Associates 2000) is the best guide to Perl today, up-to-date with 5.005 and 5.6.0.

“Mastering Algorithms with Perl”, by Jon Orwant, Jarkko Hietaniemi, and John Macdonald (O’Reilly & Associates 1999) is a great compendium of algorithms expressed in Perl. Chapter 14, “Probability,” shows how to do weighted and unweighted probability distributions with Perl.

• The Genetic Algorithms FAQ is quite outdated, but it does point to useful suites of genetic algorithm software, both free and commercial.

• Related developerWorks articles by Teodor Zlatanov include:

One-liners, 101

Small observations about the big picture

• Browse more Linux resources on developerWorks.

•Browse more Open source resources on developerWorks.

So What is the Genetic Algorithm?

The genetic algorithm is simple enough for anyone to understand, using biological terms taught in high school. Take a population of individuals, each with DNA of their own. Then measure each individual’s fitness (measured as a function applied to the individual’s DNA) and make the individual more likely to procreate if he is more fit. Extremely unfit individuals are killed off. Every survivor gets a chance to procreate (and it is important that procreation is never denied to a survivor, only made less likely if he is less fit). Merging the parents’ DNA, and then applying a random mutation to the merged DNA simulates procreation. The new individual will, in theory, be as fit as the parents, plus or minus a small variance caused by the mutation. The cycle is then repeated.

There are, obviously, many variables that can affect the genetic algorithm, including population size, generations (iterations of the algorithm), merging method, fitness function, how fitness affects procreation chances, and how much mutation happens.

There are some drawbacks to the algorithm, as well. It works best if the fitness function is applied to the DNA as a series of bits. In other words, it’s best if the DNA is a series of binary options, yes or no. Blue eyes? Black eyes? Red hair? Black hair? Merging of parent DNA and subsequent mutation should not allow certain combinations of bits, because the resulting DNA will not be a valid solution to the original problem. Remember that the “DNA” is nothing more than a solution to a mathematical fitness formula. Some values used in that formula may be invalid — for instance, dividing by zero.

In addition, the genetic algorithm is not bound by time. You pick the number of generations. You can define some goal — for instance, “look for an individual with fitness of 0.99999” and stop there, but then the algorithm may never end because it hasn’t found that individual. That can be a problem if you set unrealistic goals or too few generations. Trial and error, and a good feel for the problem, are the best ways around this issue.

The fitness formula should return a floating point number between 0 and 1. Other ranges can be used, but in my experience floating point numbers work best. You may want a range between 0 and 32767, for instance, if you want the fitness to be a 7-bit integer for optimization.

Of course, delaying optimization until you know it’s needed is a good idea, so you should at least start with a simple fitness formula. The fitness formula is the most-used function in the genetic algorithm (it will be invoked (population size) x (generations times)), so you should make it as simple and as fast as possible.

There are three “good” ways to exit the genetic algorithm. First, you can decide to exit if there is no variety in the DNA pool anymore. This is a tricky test, actually, and a CPAN module for finding the distance of strings might be useful here, as long as you can represent the DNA as a string. Second, you can exit if you have reached a fitness goal. Unless you have a very good understanding of the fitness formula (in which case, you probably don’t need the genetic algorithm anyway), setting a fitness goal can result in either infinite loops, or an individual who is only “good enough.” Third, you can exit the algorithm after a set number of iterations, or “generations.”

In practice, all three ways (or at least the second and third one) are used to control the genetic algorithm. A few test runs, maybe 10 or 20, will give you a good feel for how quickly the algorithm converges, what kind of fitness you can expect.

Genetic Algorithms FAQ

See the Resources section for the link to Genetic Algorithms FAQ. It also points to suites of genetic algorithm software, both free and commercial. See the Algorithm GA that was obtained from the Genetic Algorithms FAQ.

A Simple Example

The code in Listing 1 uses a single byte as DNA (value between 0 and 255, 8 bits). The fitness formula is applied once to every new individual, taking the byte value of the DNA and dividing it by 256. This means that the fitness formula will always return a number between 0 and 255/256, so it can never reach 1. What do you think the fittest DNA will be?

numbers.pl source

There are several interesting things about Listing 1. Its main loop is at the beginning, and you should understand all the pieces and how they work together on the population (and yet they are separate, so we can reuse them in the next example). You can execute Listing 1 with numbers.pl.

We build the weights array in the select_parents() function by stacking map() on top of grep(). While it could have been written as a loop, the one-line solution is much cleaner that way, and does not slow the program down significantly.

my @weights = map { $_->{fitness} } grep { $_->{survived} } @$population;

The $population array reference is de-referenced. Then, only elements of the array with the “survived” field (set earlier by the survive() function) get through the grep. The survivors are then distilled to their fitness numbers, which get put in their places in the weights array.

The population size was chosen to be 256, because that way it was easy to initialize every individual to a number equal to its index. Feel free to start with different population sizes.

Mutation rates of more than 1% caused the maximum and minimum fitness to fluctuate wildly. The population never stabilized towards high fitness. Low mutation rates cause the population to reach high fitness as a whole much slower. In the end, 1% was about right for our population size.

The breeding selection algorithm picks one parent by looking at the weights – in effect, every individual gets a chance to be a parent, but the number of parent slots is finite. The second parent is picked at random from the parent population. Why? Well, we could have used weights to determine the second parent as well, but this way we ensure every parenting individual has a chance to participate in the breeding process.

The actual breeding is accomplished with a random 8-bit bitmask. We just AND the bitmask to the first parent’s DNA (remember, it’s just a byte) and AND the inverted bitmask to the second parent’s DNA. The effect is that we pick random bits from one parent, and then the rest of the bits come from the other parent.

The mutation is accomplished by AND and OR of individual DNA with random 8-bit bitmasks.

Oh, and by the way, the most fit DNA was 255, of course. You don’t need to wait 100,000 generations. Just hit Ctrl-C when you are done admiring the status line.

Breeding words

For this example, we will make the DNA 32 bits (5 bytes). Each byte will represent a letter or a space. We could have packed more information into each byte, but it would have obscured the purpose of the example. Every byte’s value (0-255, numerically) will be either A through Z (if the value is between 65 and 90, conveniently chosen to match the ASCII set), or a space (if the value is 0 to 64, or 91 to 255).

Note how much of the example is similar to the example in Listing 1. The dictionary of words follows the program.

words.pl source

The main problem with this example was that DNA lengths greater than 32 bits were hard to manage. I started out trying to do my own bit operations, which were not only unwieldy, but also extremely slow. Then I tried the Math::BigInt package, which was extremely slow for this purpose. You can execute listing 3 with words.pl.

Finally, I settled on the vec() function — it is quite fast, and was the right choice for handling DNA (essentially, the DNA is a bit vector, a built-in Perl data structure). Use “perldoc -f vec” to find out more about the vec() function.

It is possible to end up with 1024 individuals with 0 fitness. That’s why this example has better safeguards against such an “accident” than the first example.

The init_population(), recombine(), and mutate() functions were changed to handle bit vectors instead of bytes.

The dna_to_words() function is somewhat inefficient, but is not invoked often enough to make a difference. The biggest slowdown comes from the fitness() function trying to match all the words in the dictionary, plus all the letters in the alphabet.

Fitness was calculated as: 2 for each letter in the DNA, plus the frequency of that letter in the dictionary, plus 2^N for every dictionary word of length N in the DNA. The dictionary array and letter frequencies hash were obtained only once (using a closure). Feel free to modify the fitness function and the dictionary to breed your own English words. The fitness formula shown is heavily biased towards letters, and takes a while to converge on English words (though “on” and “in” were pretty frequent visitors).

Conclusion

The evolutionary genetic algorithm is a fascinating topic that can hardly be exhausted in a single article. I hope you experiment with the examples and create your own Darwinian breeding grounds. It is especially entertaining to play with the fitness function in the second example, and watch English words appear out of noise.

The techniques shown in the examples range from beginner to advanced level, so try to understand them thoroughly. Often there is room for improvement. The vec() function is especially interesting. It is perfectly suited for long bit vectors such as DNA or other numeric data.

Write your own genetic algorithm implementation. Compare it with mine, and learn from the shortcomings (not necessarily yours, either). Implementing an algorithm is tricky business. There’s a lot you can do wrong, and only a few ways to get it right.

2010-05-26T16:54:43+00:00 November 3rd, 2004|CGI and Perl|0 Comments

About the Author:

Teodor Zlatanov graduated with an M.S. in computer engineering from Boston University in 1999. He has worked as a programmer since 1992, using Perl, Java, C, and C++. His interests are in open source work, Perl, text parsing, three-tier client-server database architectures, and UNIX system administration. Contact Ted with suggestions and corrections at tzz@bu.edu.

Leave A Comment