|
|
Identifying that clustering exists in spatial and spatio-temporal datasets does not provide a detailed picture of the nature and pattern of clustering. It is frequently helpful to apply simple hot-spot (and cold spot) identification techniques to such datasets. For example, Figure 5‑17 shows the location of almost 1000 reported lung cancer cases in part of Lancashire, UK, over the period 1974-83.
Figure 5‑17 Lung cancer incidence data

Visual inspection suggests that several clusters of different types and sizes exist, but the initial inspection of the mapped data is somewhat misleading. Examining the source dataset shows that roughly 50% of the dataset consists of duplicate locations, i.e. points are geocoded to the same coordinates — in several instances coincident locations with 5+ events. This is a very common feature of some types of event dataset and may occur for several reasons: (i) rounding to the nearest whole grid reference; (ii) measurement error and/or data resolution issues; (iii) genuinely co-located events — for example, in the infamous Dr Harold Shipman murder cases, 8 of the murders took place at the same nursing home; (iv) allocation of events to an agreed or surrogate location — for example, to the nearest road intersection or to the location at which an incident was reported rather than took place (e.g. a police station or a medical practice) or to a nominal address (e.g. place of birth); (v) deliberate data modification, e.g. for privacy or security reasons… the list is extensive and mapping the data as uniform sized points does not reflect this. In this particular dataset, Figure 5‑17, the observed clustering is essentially a reflection of population density, or more specifically the population at risk.
Very closely located events may also be difficult to detect, depending on their separation and the symbology used. ArcGIS incorporates a “collect events” tool that combines such duplicate data and creates a new “count” field containing the frequency of occurrences. The attribute table applies to a new feature with fewer elements in the new table. The count field may then be used as a weight and rendered as variable point sizes — although this may exacerbate overlap problems. The MapInfo add-in, HotSpot Detective, provides similar functionality.
Assuming the point set is unweighted, and exhibits marked clustering, it is then useful to identify factors such as: (i) where are the main (most intensive) clusters located? (ii) are clusters distinct or do they merge into one another? (iii) are clusters associated with some known background variable, such as the presence of a suspected environmental hazard (e.g. a power station, a smoke plume from a chemical facility) or reflecting variations in land-use (farmland, urban areas, water etc.), or variations in the background population or other regional variable? (iv) is there a common size to clusters or are they variable in size? (v) do clusters themselves cluster into higher order groupings? (vi) if comparable data are mapped over time, do the clusters remain stable or do they move and/or disappear? Again, there are many questions and many more approaches to addressing such questions.
Crimestat, Clusterseer and several other packages provide a very useful range of facilities to assist in answering some of these questions. Here we use Crimestat to illustrate a number of these. The first is identification and highlighting of the top N duplicate locations. In order to deal with possible uncertainty in georeferencing, a fuzzy variant of this facility is provided in Crimestat, enabling common locations to be regarded as those falling within a specified range of each other (e.g. within 10 metres). Having conducted this initial analysis Crimestat then provides a range of spatial clustering and hot-spot identification methods, as described in subsections 5.4.3.1 to 5.4.3.3.
|
|