Hot spot and cluster analysis

Navigation:  Data Exploration and Spatial Statistics > Point Sets and Distance Statistics >

Hot spot and cluster analysis

Previous pageReturn to chapter overviewNext page

Identifying that clustering exists in spatial and spatio-temporal datasets does not provide a detailed picture of the nature and pattern of clustering. It is frequently helpful to apply simple hot-spot (and cold spot) identification techniques to such datasets. For example, Figure 5‑21 shows the location of almost 1000 recorded lung cancer cases in part of Lancashire, UK, over the period 1974-83.

Figure 5‑21 Lung cancer incidence data

clip0180.zoom68

Visual inspection suggests that several clusters of different types and sizes exist, but the initial inspection of the mapped data is somewhat misleading. Examining the source dataset shows that roughly 50% of the dataset consists of duplicate locations, i.e. points are geocoded to the same coordinates — in several instances coincident locations with 5+ events. This is a very common feature of some types of event dataset and may occur for several reasons: (i) rounding to the nearest whole grid reference; (ii) measurement error and/or data resolution issues; (iii) genuinely co-located events — for example, in the infamous Dr Harold Shipman murder cases, 8 of the murders took place at the same nursing home; (iv) allocation of events to an agreed or surrogate location — for example, to the nearest road intersection or to the location at which an incident was reported rather than took place (e.g. a police station or a medical practice) or to a nominal address (e.g. place of birth); (v) deliberate data modification, e.g. for privacy or security reasons… the list is extensive and mapping the data as uniform sized points does not reflect this. In this particular dataset, Figure 5‑21, the observed clustering is essentially a reflection of population density, or more specifically the population at risk.

Very closely located events may also be difficult to detect, depending on their separation and the symbology used. ArcGIS incorporates a “collect events” tool that combines such duplicate data and creates a new “count” field containing the frequency of occurrences. The attribute table applies to a new feature with fewer elements in the new table. The count field may then be used as a weight and rendered as variable point sizes — although this may exacerbate overlap problems. The MapInfo add-in, HotSpot Detective, provided similar functionality, but is no longer supported.

Assuming the point set is unweighted, and exhibits marked clustering, it is then useful to identify factors such as: (i) where are the main (most intensive) clusters located? (ii) are clusters distinct or do they merge into one another? (iii) are clusters associated with some known background variable, such as the presence of a suspected environmental hazard (e.g. a power station, a smoke plume from a chemical facility) or reflecting variations in land-use (farmland, urban areas, water etc.), or variations in the background population or other regional variable? (iv) is there a common size to clusters or are they variable in size? (v) do clusters themselves cluster into higher order groupings? (vi) if comparable data are mapped over time, do the clusters remain stable or do they move and/or disappear? Again, there are many questions and many more approaches to addressing such questions.

Crimestat, Clusterseer and several other packages provide a very useful range of facilities to assist in answering some of these questions. Here we use Crimestat to illustrate a number of these. The first is identification and highlighting of the top N duplicate locations. In order to deal with possible uncertainty in georeferencing, a fuzzy variant of this facility is provided in Crimestat, enabling common locations to be regarded as those falling within a specified range of each other (e.g. within 10 meters). Having conducted this initial analysis Crimestat then provides a range of spatial clustering and hot-spot identification methods, as described in the subsections below. Note that in general such techniques are based on static patterns, and for some problems (such as analyzing certain types of disease incidence) spatio-temporal techniques are preferable (see, for example, Jacquez and Meliker, 2008, for a discussion of one approach to addressing such questions).

Hierarchical nearest neighbor clustering

Crimestat provides a general purpose form of clustering based on nearest neighbor (NN) distances. This form of clustering can be single-level or multi-level hierarchical (NNh) and is of particular applicability if nearest neighbor distance is believed to be of relevance to the problem being considered. Events are considered to be a member of a level 1 cluster if they lie within the expected mean distance under CSR plus or minus a confidence interval value obtained from the standard error plus a user-definable tolerance. These parameters effectively define a search radius within which point pairs are combined into clusters. A further constraint can be applied, specifying the minimum number of events required to constitute a cluster. The mean center and standard deviational ellipses for these clusters are calculated and may be saved in various GIS file formats. These mean centers are then regarded as a new point set, and are subjected to the same type of clustering in order to identify and generate second order and ultimately higher orders of clustering. Clearly the number of points in the initial sample and the degree of clustering have a major bearing on the way in which such clusters are identified. Figure 5‑22 illustrates the results of applying this process for the lung cancer data shown earlier.

Crimestat also provides a variation on this clustering procedure to account for background or “baseline” variation. It describes the procedure as a risk adjusted NNh method, or RNNh. The background data is represented as a fine grid using kernel density estimation, and this is used to adjust the threshold distance for clustering the original point set, on a cell-by-cell basis. Note that with both NNh and RNNh not all events are assigned to clusters, and each point is assigned to either one cluster at a given hierarchical level or none at all.

Figure 5‑22 Lung cancer NNh clusters

clip0181.zoom66

K-means clustering

A conceptually different clustering approach is point-set partitioning. Essentially this procedure is a form of K-means clustering, as described in Section 4.2.12, Classification and clustering. The user specifies the value K, and a set of K random points are placed in the study region as seed points. Each point in the dataset is then allocated to the nearest seed point. The set of points assigned to each seed point is then used to create a new set of seed points comprised from the centers of these initial groupings. The procedure continues until the sum of distances (or squared distances) from each point to its cluster center seed cannot be reduced significantly by further iterations. Some implementations run the procedure multiple times, with differing initial seeds, selecting the final result from the solution that minimizes overall cluster dispersion. Another widely used option, particularly with large datasets and many variables (dimensions), involves ‘training’ the selection by analyzing the clusters for a subset of the data and then using the best solution set as the starting point for the entire dataset. Crimestat, working on 2-dimensional pointset clustering, attempts to identify very good starting points for the initial seeds by a form of simple density analysis (placing a grid over the point set and identifying distinct areas of point concentrations).

Unlike NNh clustering, the K-means procedure assigns all events to a unique cluster and clusters do not form hierarchical groupings. Its dependence on the user selection of the K-value and the underlying sub-optimal algorithm used are distinct weaknesses. Dependence on K can be reduced by systematically increasing K from 1,…K and examining the weighting of cluster centers on the problem variables and plotting the total and average cluster dispersion values as K increases.

Kernel density clustering

Kernel density estimation (KDE) may also provide an informative (exploratory) tool for hot-spot and cool-spot identification and analysis. Although not strictly a form of clustering, assignment of points to cells that have greater than a pre-determined density value provides a form of clustering in this case. Figure 5‑23A illustrates the use of KDE for the same point dataset as above (lung and larynx cancer cases in the Chorley-Ribble area of Lancashire) using a quartic (finite extent) KDE model. Note that this dataset is included as one of the test sets distributed with the spatstat R software.

The pattern shown is largely a reflection of the distribution of population and associated infrastructure (roads etc.) in the region. The density grid illustrated has been overlain with the point set and an ellipse showing the possible relationship between the high density area in the south of the study area with an old incinerator plant (dot in lower left of ellipse) and hypothetical smoke plume. The principal interest in this particular dataset was in the relationship between the incinerator location and another, far rarer, form of cancer, affecting the larynx (Figure 5‑23B). A small apparent cluster (4 events identified located very close to the incinerator site in Figure 5‑23B) had been observed and the research sought to establish whether this was a real or apparent relationship. The incidence of lung cancer in this instance was being used as a form of control dataset, on the hypothesis that these data represented an estimate of the distribution of the underlying population at risk, and assuming that there was no relationship between lung cancer incidence and incinerator location (a working assumption only). See Diggle (1990), Baddeley et al. (2005) and related papers for more details.

As can been seen from Figure 5‑23, the overall pattern of the larynx cancer cases seems to follow the pattern exhibited by the lung cancer cases, i.e. to be largely a reflection of underlying variation in the at-risk population. Whether there is a real and unexpected cluster in the neighborhood of the old incinerator is difficult to determine. Diggle’s model and tests suggest that the cluster does appear to be significant. But he also notes the sensitivity of the model to the low number of cases — with deletion of just one of these cases there is a reasonable chance the result could have arisen by chance. He also notes the problem of formulating hypotheses based on examining specific apparent clusters. A wide range of comparable regions should be studied, without pre-conceptions, since clusters may well be observed which may or may not be associated with particular facilities (see also the earlier discussion of exploratory cluster hunting in Section 5.2.6, Cluster hunting and scan statistics).

Figure 5‑23 KDE cancer incidence mapping

A. Lung cancer incidence (978 controls)

clip0182.zoom50

B. Larynx cancer incidence (58 cases)

clip0183.zoom57

Spatio-temporal clustering

It is often the case that event data, such as disease incidence, has associated temporal information, for example: the date of birth; the date of death; the date of diagnosis. In such cases it is often of interest to determine whether there is evidence of spatio-temporal clustering. The question was initially raised by Knox (1964) when investigating cases of childhood leukemia in a region of NE England. Knox proposed a simple means of testing for such clustering, or space-time interaction. He suggested that each case, i, one could compute the (Euclidean) distance to every other case, j, and record this in a table or matrix as the set {xij}. Likewise one could compute the separation in time between each of these events, and produce a second set {yij}. For the data under consideration Knox then sought to identify a critical distance, D, and critical time interval, T, that appeared to be meaningful for the problem being examined. Using these critical values, the data could then be classified into a 2x2 contingency table, with the suspected spatio-temporal cluster being the count of events that fell in the first cell, i.e. the cell with xij<D and yij<T. Knox has 96 cases and hence there were 96(95)/2=4560 pairs to consider. Of these there were 152 close pairs in time and 25 close pairs in space, with 5 being close in both space and time. The expected number can be estimated as per normal contingency tables using the marginal totals, i.e. as 152 x 25/4560=0.8333. The question then arises as to how to test whether the observed value of 5 is significantly greater than the expected value which is <1. At the time Knox proposed using the Poisson distribution with mean 0.8333 as the basis for significance testing, which suggested that a value of 5 was highly significant.

Subsequently many statisticians examined Knox’s approach and ideas, highlighting a number of problems with the method and suggesting a range of modifications and developments. Of particular interest in the paper by Mantel (1967) , in which he develops Knox’s ideas and which has led to not only tests for spatio-temporal clustering, but also a more general methodology that has been adopted widely in spatial ecology (see further, Section 5.4.5, Proximity matrix comparisons). Mantel proposed using a test statistic of the form:

This measure has the advantage of including the actual distance and time measures, rather than discarding this information. Knox’s test can be seen as a special case of Mantel’s Z statistic if the xij<D are coded as 1 and yij<T are also coded as 1, and otherwise entries are coded as 0. The Mantel statistic then simply counts the number of close pairs. Note that, depending on how the summation limits are defined, it may be necessary to divide the result by 2, and that in this form distances are treated as symmetric.

To test the significance of the observed Z-value Mantel proposes using a Monte Carlo simulation approach. A simple procedure is to randomly permute the rows and columns of one of the matrices, typically the locations matrix. After each permutation the Z-statistic is computed. A set of, say 1000 permutations are performed, generating a probability distribution for Z under the assumption that the space-time matching is random. The observed Z-value can then be compared with this computed probability distribution to obtain an estimate of the significance of the observed result.

If the Mantel Z-statistic is amended slightly, normalizing it to fall in the range [-1,1], it can be seen to be form of product moment correlation coefficient. This is achieved by adjusting the data values by subtracting the mean value in each table and dividing by the observed standard deviation (i.e. a z-transform), and then adjusting for the number of pairs: n(n-1)/2 less 1 to obtain the degrees of freedom:

Several issues may make the analysis of such data more difficult than appears at first sight. Some of these issues include: is the examination of pairwise data sufficient? (are 3-way interactions important for example); are there any factors in the determination of the temporal data that may bias the results (for example, how close are onset of a disease and diagnosis? are cases detected because they are being looked for especially or defined as cases where they were not so defined in the past/elsewhere in the study region?; how are critical time and distance values to be determined (for Knox tests), and is it reasonable to assume this distance is constant when underlying population densities may vary considerably?; if one examines a region in which an unexpectedly large number of cases have been reported, testing for space-time clustering may show no significant spatio-temporal effects, which is actually the result of region pre-selection rather than failure to detect a spatio-temporal effect; changes in the underlying population distribution in the study region over time may have occurred, which will affect the results; the distance measure applied may not be appropriate — Mantel suggested that for contagious diseases in particular a reciprocal transform be used to adjust for the overly strong influence of large distances and times on the results (with a constant included to avoid the distortions apparent with very small times and distances).

Jaquez (1996) proposed using k-nearest neighbors (k-NN) rather than an explicit distance measure. This test is similar to those described above, but does not rely on distance directly — instead he defines the k-nearest neighbors as the set of cases as near or nearer to a given case that the kth NN. He then defines a similar expression to the basic Mantel or Knox statistic:

In this statistic the variables are taken as binary values as per the Knox model, so the totals are counts. As with Mantel’s test he proposes permuting one or other table to generate a suitable reference distribution (he suggests permuting the rows of the time matrix). In a range of tests Jaquez demonstrated that the k-NN approach was more powerful than the standard Mantel and Knox methods and is less susceptible to several of the issues described above. The value of k chosen may be varied to test the sensitivity of the results to the value chosen.

Kulldorf and Hjalmars (1999) sought to address another of the weaknesses of the Knox test, that of population shift bias. They showed that if one has the background population data over time, and therefore information on population growth or decline in the various parts of the study region, this element of bias can be removed. This is achieved by randomly assigning cases to a given region and timeslot in proportion to the actual population (or population at risk) in that region at that time. Computations and tests then proceed much as per Mantel and Jaquez describe. The difficulty with this approach, is that it requires access to data that may not be readily available.

A range of spatial and spatio-temporal cluster analysis tools are provided in the National Cancer Institute’s (NCI) SaTScan software, which is available free of charge, from http://www.satscan.org/. These include purely spatial scanning statistics (as described further in Section 5.2.6, Cluster hunting and scan statistics), the scanning version of permutation-type models described above, and the space-time scan statistics developed by Kulldorf (1997). The latter extends the spatial scanning approach to include a time dimension, with the scan being a cylinder (the spatial scan being the cylinder radius and the time scan as the height). A key feature of this software is the identification of both the existence of significant clusters (by size) and where these clusters are located. An example application of this approach to crime event data is illustrated in the section: Spatial and Spatio-temporal Data Models and Methods (Figure 4-0).