Cluster hunting and scan statistics

Navigation:  Data Exploration and Spatial Statistics > Exploratory Spatial Data Analysis >

Cluster hunting and scan statistics

Previous pageReturn to chapter overviewNext page

Cluster hunting is the term used to apply to a family of techniques that involve computationally intensive search procedures for point- and zone-based cluster identification. They aim to identify clusters based on the spatial arrangements of incidents combined with basic information on the background population. They then search for clusters (areas of unexpectedly high incidence) by exhaustively examining all possible locations on a fine grid covering the study area. One of the most well-known of these techniques, GAM, is best described in the authors’ own words (from the GAM web site):

“The Geographical Analysis Machine (GAM) is an attempt at automated exploratory spatial data analysis of point or small area data that is easy to understand. The purpose is to answer a simple practical question; namely given some spatial data of something interesting where might there be evidence of localized geographic clustering if you do not know in advance where to look. [This may be] due to lack of knowledge of possible causal mechanisms, or if prior knowledge of the data precludes testing more hypotheses on the same database. Or more simply put, you send GAM a geographically referenced point or small area referenced database and it will indicate where there is evidence of localized clustering and how strong it is”

A Java version of GAM can be downloaded from the University of Leeds CCG web site. This is a working version of the software but is, unfortunately, not the subject of ongoing support or development at present. The Java version of GAM takes as an input a text file of the form:

ID,easting,northing,incidence_count,population_count

The incidence_count value contains the variable of interest, e.g. disease incidence, and the population_count variable contains the population variable, e.g. the population at risk. Frequently the location data will be taken as the centers of small area statistics zones, and the population variable will be the total applicable value for that zone (e.g. census output area or tract). The study area is then divided into a grid, and circles are placed at every grid intersection. The number of incidents falling inside these circles are counted and checked to see if they are, by some measure, excessive. The notion of excessive could be user-specified (e.g. more than a certain number/rate of incidents, or pre-computed variable levels for each location) or can be computed by the program based on an expected level defined as the population at risk times the mean incidence rate.

The steps in the analytical procedure are as follows:

Step 1. Read in X (easting), Y (northing), a variable of interest, and data for the population at risk
Step 2 Identify the MBR containing the data, identify starting circle radius, and degree of overlap. If R is the radius of a circle of area equal to the MBR, then starting radius might be r=R/100 with increment dr=R/100
Step 3 Generate a grid covering this rectangle so that circles of current radius overlap by the desired amount — a range of radius values r will be used, varying over a user-specified range of distances
Step 4 For each grid-intersection generate a circle of radius r
Step 5 Retrieve two counts for the population at risk and the variable of interest
Step 6 Apply some “significance” test procedure — a variety of alternative procedures, including Monte Carlo simulation) and bootstrap methods are supported. The latter is a form a sampling with replacement, similar to jackknifing in some respects — see further, Section 6.7.2, Kriging interpolation, and Efron (1982) and Efron and Tibshirani (1997)
Step 7 Keep the result if significant
Step 8 Repeat Steps 5 to 7 until all circles have been processed
Step 9 Increase circle radius by dr and return to Step 3 else go to Step 10
Step 10 Create a smoothed density surface of excess incidence for the significant circles using a kernel smoothing procedure and aggregating the results for all circles (this step uses the Epanechnikov kernel function, Table 4‑8)
Step 11 Map this surface and inspect the results

GAM with kernel smoothing is generally referred to as GAM/K, and it is in this form that is currently used. It is by no means the only method of cluster hunting, but has been found in tests on synthetic data to be one of the most effective at locating genuine clusters in large test datasets, and also one of the least subject to finding false clusters. Further discussion relating to this topic is provided in Section 5.4.4, Hot spot and cluster analysis.

GAM has been the subject of considerable controversy over the years and alternative methods are often used in preference, in particular the spatial scan statistic due to Kulldorf (1997) and provided in the SaTScan software. Again, using author’s own description of the facilities provided, “SaTScan is designed to:

Perform geographical surveillance of disease, to detect spatial or space-time disease clusters, and to see if they are statistically significant
Test whether a disease is randomly distributed over space, over time or over space and time
Evaluate the statistical significance of disease cluster alarms
Perform repeated time-periodic disease surveillance for early detection of disease outbreaks

SaTScan uses either a Poisson-based model, where the number of events in a geographical area is Poisson-distributed, according to a known underlying population at risk; a Bernoulli model, with 0/1 event data such as cases and controls; a space-time permutation model, using only case data; an ordinal model, for ordered categorical data; an exponential model for survival time data with or without censored variables; or a normal model for other types of continuous data. The data may be either aggregated at the census tract, zip code, county or other geographical level, or there may be unique coordinates for each observation. SaTScan adjusts for the underlying spatial inhomogeneity of a background population. It can also adjust for any number of categorical covariates provided by the user, as well as for temporal trends, known space-time clusters and missing data. It is possible to scan multiple data sets simultaneously to look for clusters that occur in one or more of them.”

SaTScan has been applied in a very wide range of application areas, many of which are listed and viewable as downloadable articles. These cover many medical topics (notably cancer research and the study of infectious diseases), together with studies in fields ranging from criminology to demography, history, and accidental poisoning. Many of the papers can be selected from the SaTScan bibliography: http://satscan.org/references.html