Below we provide brief summaries of selected models that have been successfully applied to a variety of geospatial analysis problems in recent years. These examples are intended as illustrations of the range of applications and the general form of models applied. Readers are referred to the individual papers referenced for full details. As will become apparent, some implementations are clearly better than others, in terms of quality of result, computational overhead, scalability, robustness and/or flexibility.

GA Example 1: TSP

In the introductory section (Section 8.4.1, Genetic algorithms - introduction) we provided a simple TSP sample application as a means of illustrating the basic ideas of GAs applied to a geospatial problem. The C++ library, GALib provides a coded example of such a procedure (example 26), but describes the procedure as being very poor ― there are much better ways of solving the TSP, as identified elsewhere in this Guide.

GA Example 2: Clustering

Painho and Bação (2000) applied GAs to the problem of optimally clustering artificial and real datasets, comparing their results with those obtained by the K-means clustering approach and self-organizing map clustering (SOM) . On the test dataset GA outperformed K-means and was similar to SOM in terms of misclassifications and the standard error of cluster dispersion (SE), whilst with real data GA produced the most tightly clustered results (lowest SE). Implementation of the procedure was carried out using the C++ software library, GALib, from MIT. GA components used in this example were as follows:

Representation: chromosomes with n genes, each gene representing an individual datapoint (hence very long chromosomes). Alleles were the cluster assignments of these genes (1,2,…K). A population size of P=100 was used. The chromosome representation was thus of the form:

A |
B |
C |
D |
E |
F |
G |
… |
… |

1 |
5 |
3 |
3 |
3 |
2 |
7 |
… |
… |

where A, B, C… represent separate data items and 1,5,3… represent the cluster assignments of these data items, i.e. the values assigned to the genes of this chromosome (the gray-shaded string). Although simple, this representation is not scalable so is of limited use for most real-world problems. A more appropriate representational model for clustering with larger datasets is one in which chromosomes have length Kd where K is the number of clusters and d is the number of variables or dimensions of the dataset, recorded as a real variable (see for example, Demiriz et al., 1999, for a fuller discussion). Each set of d numbers in this arrangement represents one cluster center and cluster assignment is based on Euclidean distance rather than genome value.

Fitness: The fitness function chosen was based on the sum of squared error, E, or overall cluster dispersion statistic. Using Painho and Bação’s notation:

where mk is the center of cluster Ck. This expression is very similar to a widely used measure (for the objective function) in K-means clustering (see further, Section 4.2.12, Classification and clustering), and is generally normalized by dividing the overall sum by the number of datapoints, N, to give a mean squared dispersion statistic.

Selection: A replacement percentage of 50% was used. Other details of the selection method are not provided.

Crossover: A crossover probability of 90% was used. Although not specified in this article, single point crossover is the most likely to have been implemented. Clustering algorithms that use the kind of representation applied in this example often use a combination of so-called fusion and fission operators. Fusion involves taking two unique allele values and combining them into a single value, thus combining two clusters into one, whilst fission takes a single allele value and assigns it a different random value, thus breaking a cluster apart.

Mutation: A mutation probability of 1% was used. Again, the details are not provided but can be assumed to be single point random mutation (i.e. altering the cluster assignment of a randomly selected data item).

Stopping criterion: 15000 generations

GA Example 3: Map labeling

Dijk et al. (2002) analyzed the use of GAs to address the map labeling problem. This problem involves placing labels on a conventional 2D map such that (as far as possible) labels do not overlap ― or more specifically, that the maximum possible number of labels on a map do not overlap. For the problem described here labels can be placed in one of four positions relative to point objects: top left, bottom left, top right, bottom right. The basic procedure was tested using a rectangular grid and 1000 points placed such that, with labels of a given fixed size, it was possible for every point to be labeled with no overlap. To remove edge effects the underlying grid was treated as a torus, i.e. wrapping both pairs of edges to give a continuous edgeless surface. Multiple runs of the GA on multiple maps of the type described were tested and the results pooled. Overall the GA performed to within 80% of the optimum, but only after some 400 million intersection tests had been computed. This result is clearly not satisfactory, but can be greatly improved as described below.

Revising the GA by the use of an heuristic crossover procedure and coupling this with a local search optimizer yielded results that were very close to the theoretical optimum with roughly 25% of the number of label intersection tests required. The results with this version of the GA were compared with the simulated annealing algorithm of Christensen et al. (1995) and found to be almost identical across a wide range of point set sizes. The authors conclude that the GA approach has the merit of high quality, scalability (to large numbers of points) and flexibility. The latter was exemplified by their extension of the constraints to include variable label size according to city size, special labels, preferential positioning and coastal city labeling (on the seaward side of the city in question).

Representation: If the possible label positions are denoted 1, 2, 3 and 4, these are the alleles of the representation, with each map point being a gene. Chromosomes are thus long strings of integers, with length equal to the number of points, N, to be mapped. The size of the initial population in this example was not specified. Initialization of the population involved random assignment of alleles.

Fitness: The fitness of a particular chromosome is simply the number of labels that do not intersect. This requires computation of label overlaps for each chromosome, which is the most processor-intensive part of the procedure.

Selection: Tournament, roulette and uniform selection were all tested and tournament selection proved to be the best choice. In this instance the basic tournament size chosen was 2, with the fittest parent being selected in each case. The process was repeated n times to select n parents and produce n children. Subsequently an elitist model was tested, where the two parents and two children were evaluated for fitness and the fittest two retained (i.e. this could yield a mix of parents and children).

Reproduction: For every pair of parents selected two children were produced, either by crossover and mutation or by cloning and mutation. Cloning simply reproduces the parent(s) in the next generation.

Crossover: Several alternatives were tested, with one point crossover performing best in the initial test. However, this procedure does not take into account the known spatial aspects of this problem. For this reason an alternative crossover procedure was designed that reflected the nature of the problem at hand. First, a random point on the map is selected and all neighboring points that could, in principle, contain intersecting labels were identified and marked. This set contains what may be regarded as a building block for this particular problem. The process of random selection and marking continues until 50% of all the mapped points have been marked. Then two children are generated by copying the labellings of the marked points (parent 1) as the first half of the child genome and the labellings of the unmarked points as the second half, and vice versa for the second child. The process continues for the next member of the population by selecting another random starting point. The result of this process is a pattern of crossover that disrupts the local linkage of the building blocks as little as possible. The resulting genomes can, however, be further improved by simple local optimization (see below).

Mutation: Mutation was ignored in this problem.

Local search: The initial experiment did not include a local search facility, whereas the enhanced algorithm added this operation, which substantially improved the quality of solution. Local search was conducted as follows. If the building blocks of newly produced genomes are examined it is possible to identify non-disruptive changes that will reduce the number of intersecting labels. It is also possible to apply additional positioning criteria, e.g. that coastal locations should have their labels placed in the sea. For each building block where new conflicts have arisen, the label is moved to a random available empty slot (i.e. one which no overlapping will occur as a result of the move), assuming one is available.

Termination: In the initial trials, this was set to 400 million intersection tests following prior experiments and consideration of available compute resources.

GA Example 4: Optimum location

Correa et al. (2001) applied a GA to the problem of selecting 26 facilities from 43 candidate sites, each of which has capacity constraints. The search space of possible solutions is 43!/(26!*17!), i.e. just over 4*1011 alternative configurations. The sites to be selected needed to minimize the total travel distance to exam centers for 19,710 students. The problem is thus a classical p-median problem with capacity constraints. In this instance the researchers made the following choices for the GA:

Representation: each chromosome was assigned p genes (p=26 in this case). The value (allele) assigned to a gene was the index number of the site (hence a value from 1 to 43). A population size of P=100 was used

Fitness: Although fitness was essentially computed based on the standard total summed distance for all journeys to facilities, an additional computation was necessary to allocate students to facilities that were not their closest if facility capacity had already been reached. The procedure assigned students to their second, third... etc. nearest facility based on which student would be most disadvantaged (in terms of distance traveled) by so doing. Clearly this is an extremely processor intensive calculation.

Selection: Selection was based on pairs of individuals with a bias towards the fittest individuals. The individuals in the population were ranked according to fitness, 1…P where P is the population size applied to the problem (P=100 in this case). The selection rule used the formula:

where floor is the integer rounding down function and rnd(x) is a uniform random number generator over the interval [0,x]. With P=100 and selecting the first 5 pairs of results from a sample run of this selection model, the following example results (parents selected) were obtained:

(53,23) (31,7) (14,34) (88,10) (34,23)

The selection function is approximately linear with rank over the long run. Offspring were only inserted into the population if their fitness score exceeded that of the worst current member.

Crossover: The procedure applied was a form of one point crossover, but applied to pre-processed ‘exchange vectors’ rather than the source chromosome. These exchange vectors were designed to ensure that offspring do not contain duplicated locations. The exchange vector for parent 1 only contains locations that are not present in parent 2’s chromosome, and vice versa. The length of each exchange vector is computed and a random number of locations less than this value in each case are swapped between the two parents.

Mutation: A single randomly selected location, not in the current chromosome, is selected and used to replace one gene selected at random in the individual’s chromosome. The probability of mutation was set to 1%.

Local search: Without additional local search the test results did not perform quite as well as tabu search (see further, Section 7.2.2, Heuristic and meta-heuristic algorithms). When an interchange heuristic (that the authors describe as hypermutation ― occurring with probability 0.5%, applied to 10% of the population, P) was added to the algorithm, the GA outperformed tabu search in terms of the average distance students would have to travel and the percentage being served by their geographically closest location. Processor overhead for this procedure was high, mainly due to the repeated computation of the fitness function that was required.