|
|
In many disciplines the distance between events (e.g. trees, cases of a disease, bird’s nests) reflects an underlying process, for example competition for food or nutrients, birth and dispersal processes, infection or contagion. Particular attention is therefore focused on the nearest event to a particular other event of point of interest. This nearest event is known as the nearest neighbour (NN) or first-order nearest neighbour. The second nearest event is then the second-order nearest neighbour and so forth to kth-order NN. A global (whole area) measure of a point pattern is the mean distance to the kth-order nearest neighbour, and more typically for k=1.
The steps involved in computing this measure are as follows:
· Input coordinates of all points {xi,yi}
· Compute (symmetric) distances matrix D between every pair of points
· For each i, sort the distances to identify the 1st, 2nd,...kth nearest values
· Compute the mean of the observed 1st, 2nd, ...kth nearest values
· Compare this mean with the expected mean under Complete Spatial Randomness (CSR or Poisson) model
To undertake the last step in this sequence the expected mean under the CSR hypothesis is needed. The distribution of NN distances can be obtained from the terms of the Poisson distribution (see Section 1.5.2.6) using the information in Figure 5‑14. In this we envisage the point under consideration as lying at the centre of a circle, radius r.
Figure 5‑14 Nearest Neighbour distribution

Assuming that the overall density of points in the study region is m, the expected number in a circle of radius r is simply mπr2, which is where this term comes from in the expressions shown below. The model examines the probability that there are no other points within this circle of radius r, p(r,0), but there is exactly one, p(dr,1), in the thin yellow annular region shown, i.e. in the interval between r and r+dr from the original point. We can now use this information, as dr→0 to obtain the distribution of nearest neighbour distances under CSR, and hence the moments of this distribution, including the mean and variance. The steps for deriving the first-order NN distribution are shown below, together with the first three central moments:

Let x=mπr2 then dx=2mπrdr and r=(x/mπ)1/2, hence p(r)=e-xdr, thus the mean, μ, may be obtained as the first moment, μ1 or μ, by integration:

where Γ() is the Gamma function. The variance, μ2 or σ2, can be obtained in a similar manner, giving the result:
![]()
The mean distance to NN under CSR is thus a simple function of the density, m. For a uniform (completely dispersed) distribution of points the expected mean distance to NN is simply double the CSR value. Assuming that the density can be estimated reliably (a significant issue in itself) then a simple index of global spatial randomness can be obtained by taking the ratio of the observed mean, distance to NN,
, divided by the expected mean distance,
:
![]()
Values of R<1 suggest greater clustering (closer spacing) than would be expected under CSR, whilst values of R>1 suggest a more even distribution. The significance of the observed value can be tested, assuming we have a fairly large number of points, e.g. n>100. Using the z-transform (based on the standard error rather than the variance, as we are comparing mean values), we have:
![]()
Programs such as Crimestat provide index values and significance estimates, whilst ArcGIS Spatial Statistics toolbox simply provides index information. The general expression for the crude moments, μα', α =1,2,3…, of the kth-order nearest neighbour distribution in D-dimensions (D=1,2,3..) with population density, m, is:
![]()
where Φ is the volume of a unit hypersphere:
,
and Γ(x) is the standard Gamma function. In 2D this expression for the mean (α=1, D=2) simplifies to:

Hence for k=1 this simplifies further to:
, as above
A number of comments about this form of basic point-pattern analysis need to be made. The first is that the statistic depends heavily on the parameter, m. This is usually estimated by dividing the study area, A, by the number of points, N, i.e. m=A/N. However, deciding on the appropriate study area may alter the results substantially (Table 5‑7). In this example three alternative region boundaries have been selected: SDC — standard distance circle; 2SDE — an ellipse of size 2 standard deviations; and the MBR of the point set — Minimum Bounding Rectangle. These three regions result in substantially different ratio values and z-scores for the point set, primarily as a result of their different areal extents.
Table 5‑7 NN Statistics and study area size
One solution to this problem is to be very careful about the boundary specification, ideally using a user-defined polygon that corresponds to a meaningful boundary (e.g. a physical habitat boundary), or possibly the convex hull of the point set plus a buffer zone (see below). Most available software does not support such boundary specification, although add-ins have been developed to facilitate this kind of region definition.
A related problem to that of region definition is that of edge-related errors (boundary effects). If the NN to a selected event is actually outside the study region the apparent NN found will be further away than the true value, leading to over-estimation of mean distances (see, for example, Figure 4‑29). This problem is most severe with low numbers of points and higher order NN analyses. It can be minimised by careful selection of the study boundary, the application of edge correction factors (e.g. as provided within Crimestat), by applying a boundary guard zone (e.g. a inner or outer buffer), by remapping the space onto a torus (e.g. so each of four MBR edges are treated as if they wrap onto the opposite pair) and lastly, by applying NN analysis only where the number of events is quite large and the NN order of analysis is not too high.
If there are very good reasons to assume that the event distribution is truly random over a particular study region, then the formula for the mean distance to NN can be inverted to obtain an estimate of the point density, m:
![]()
First-order NN analysis provides a simple global approach to point pattern analysis, but is of limited use for most real-world problems. Frequently real or apparent clustering is observed, and we may wish to consider questions such as:
· Is the observed clustering due to natural background variation in the population from which the events arise?
· Over what spatial scales does clustering occur?
· Are clusters a reflection of regional variations in underlying variables?
· Are clusters associated with some feature of interest, such as a refinery, waste disposal site or nuclear plant?
· Are clusters simply spatial or are they spatio-temporal?
These questions demand tools that are more exploratory in nature rather than strictly descriptive. Many tools have been developed to assist with answering such questions. The simplest, perhaps, is to plot the value of the kth order NN index against the order level, k. This option is supported with the Crimestat package, amongst others.
A rather better approach (making greater use of the underlying data, particularly with larger point sets) is to examine the observed frequency distribution of nearest neighbour distances. This can be achieved by dividing the observed nearest neighbour distances into evenly spaced distance bands, 0‑d1, d1-d2, d2-… and then comparing these frequencies (expressed as proportions of the total) to the expected distribution under CSR. Comparison is usually carried out on the cumulative probability distribution, G(r), which may be obtained by integration of the expression for p(r) above:
![]()
Typically both the observed and expected cumulative distributions are plotted and visually compared. A variation on this arrangement is to select random points in the study area and compute the distance from these to the nearest event. The observed cumulative distribution in this case is denoted F(r), or commonly with a ^ (hat) symbol above the F to indicate that it is an estimate based on the observed data (sometimes referred to as Fhat). It has an expected cumulative distribution as per G(r). However, if the sample pattern is more clustered than random the observed F(r) plot will differ from the observed G(r) plot, rising far more slowly.
Support for F(r) and G(r) computation and associated Monte Carlo ) simulation (see below) is not provided within most GIS and related packages, but it is provided in the SPLANCS package (S-Plus and R-Plus implementations — FHAT and GHAT functions). It is also relatively straightforward to generate these functions programmatically.
In principle a measure of the difference between the observed and expected cumulative distributions can be computed (e.g. the Kolmogorov-Smirnov test statistic) and its significance determined. However, the set of nearest neighbour distance does not represent a set of independent samples — almost by definition NN distances are non-independent and frequently are reflexive — i.e. the NN of event A is event B and the NN of event B is event A. Even under CSR such reflexivity occurs for over 62% of first order neighbours. For these and related reasons (e.g. edge effects) significance testing involves Monte Carlo simulation. For example, with N observed events in a study area A, one can simulate a set of N purely random events within the same region and compute the cumulative distribution for this set. The maximum absolute difference between the simulated cumulative distribution and the theoretical cumulative distribution can then be calculated and the value recorded. The process in then repeated a large number of times (e.g. T=999 times). Let X be the number of times the recorded difference was larger than that between the observed pattern and the expected cumulative distribution. Then p=(1+X)/(1+T) is the estimated probability of observing a difference of the magnitude found. Thus if X=99 and T=999 we would have p=0.1 or 10%.
A similar and in many ways more powerful approach, known as Ripley’s K (or L) function, is supported within several packages, including Crimestat and SPLANCS. Ripley’s procedure utilises all event-event distances, not just nearest neighbour distances. It operates as follows (Figure 5‑15):
· Construct a circle, radius d, around each point (event), i (d rather than r is generally used for the circle radius in this technique)
· Count the number of other events, labelled j, that fall inside this circle
· Repeat these first two stages for all points i, and then sum the results
These steps equate to computing the sum:
![]()
where I(dij)=1 if the distance, dij, from i to j is less then d otherwise I(dij)=0
· Increment d by a small fixed amount (e.g. R/100, where R is the radius of a circle of area equal to the study area, A)
· Repeat the computation, giving values of K(d) for a set of distances, d
Figure 5‑15 Ripley’s K function computation

The function K(d) can then be plotted against distance, d, in a manner similar to that described in Section 6.7 for correlograms and variograms. The graph of the (transformed) K(d) function provides an excellent insight into localised clustering, with characteristic graphs reflecting particular observed patterns. Computation of (the transformed) K(d) function for a mapped dataset is supported in both SPLANCS (KHAT and KENV functions) and Crimestat, and in both cases Monte Carlo) simulation of confidence envelopes is provided.
As noted earlier, under CSR the expected number of events within a circle of radius r or d is simply the event density, m, times the circle area. With m estimated as the overall number of events divided by the study area, N/A, we have
![]()
The observed K(d) function minus the expected value above has an expected value of L(d)=0 for a given d. L(d) measures the difference between the observed pattern and that expected under CSR. It is generally computed plotted with d subtracted from the expectation:
![]()
If the component d is not subtracted (as per Ripley’s original formulation of this method) then the L(d) plot against d for a truly random distribution will be a line at 45 degrees (i.e. L(d)=d) whereas with d subtracted it will be approximately 0 for all d. This transformed expression is the form in which the K-statistic is analysed within Crimestat.
The sampling distribution of L(d) is unknown, but may be approximated by Monte Carlo simulation, as described earlier. Crimestat produces its simulations based on a rectangular region of area A and shape matching the MBR, whilst SPLANCS utilises a user-specified polygon, which is more satisfactory in most cases. Figure 5‑16 shows the L(d) plot for the random point set shown in Figure 5‑12. The envelope shown provides the maximum and minimum L(d) curves generated from 300 simulation runs. The observed plot lies within this envelope, so it is reasonable to suppose that this pattern is broadly random rather than more clustered or even spaced than random. As d is incremented the behaviour of the statistic becomes increasingly subject to border effects. Ideally the number of points utilised should be large (well in excess of 100 if possible) and where necessary edge correction procedures should be adopted. Simple edge correction can be implemented by weighting circular regions that lie partially outside the study boundary by the proportion lying within the study region.
Figure 5‑16 Ripley K function, shown as transformed L function plot

The ideas behind the production of the standard K(d) statistic and associated Monte Carlo) simulations can be used to provide additional measures that are often more useful. For example, instead of comparing the observed pattern to CSR, alternative models could be examined. SPLANCS, for example, supports comparison with a Poisson Cluster Process (PCP) rather than a simple Poisson (CSR) process. PCP is similar to a simple Poisson process in that it starts with a random point set. These are regarded as “parents”. Each parent then produces a random number of offspring, and these are located around the parent according to a bivariate distribution function, g(x,y), for example a circular Normal with variance (spread), σ2:
![]()
The parents are then removed from the set and the PCP consists of the set of all offspring. SPLANCS implements this kind of model with three parameters:
RHO: intensity (density parameter, ρ) of the parent process
M: the average number of offspring per parent, and
S2: the variance of location of offspring relative to their parent (the σ2 value in the expression above)
By generating many PCP simulations for a given study region, an alternative probability envelope can be computed and plotted against the observed K- or L-function.
A second important variant of the K(d) function involves treatment of two or more point (event) patterns within the same region. For example, one set of points might represent disease cases and a second set might be matched cases (controls) selected at random from the background population. Similar procedures may be used for spatio-temporal analysis, for example comparing crime patterns or disease incidence over time. If both datasets exhibit similar clustering we would expect the plots for both to be similar, or the difference between the computed functions, D=K1(d)‑K2(d), to be approximately zero for all d. Where background data are only available at an aggregated level (e.g. census Output Areas), it may be acceptable to assign the second set of points to the OA centroids together with a weighting factor reflecting the zone population. Crimestat supports such weighting as an option in order to assist in studies with background variation.
Another variant of the K(d) function, supported within SPLANCS and in SANET for a network-based variant, is the bivariate model (or Cross-K function). With the Cross-K function the relationship between two separate point patterns is being examined. For example, in an ecological study this might involve examining collaborative or competitive behaviour between two species, or an analysis of infected trees and the “at risk” population. In this instance the K(d) function measures the number of events in set B that lie within distance d of events A. The SPLANCS function K12Hat supports this model together with supporting simulation functions to provide an estimated probability envelope for the function.
Readers may have noticed the similarity between the procedures involved in generating Ripley’s K, and the generation of kernel density maps (Section 4.3.4.1 and Figure 4‑45). The latter involves assigning a proportion of each event or point to a circular neighbourhood, and then summing and normalising the weighted surface at grid intersection points. This provides a continuous density surface rather than a linear model, and can be used to highlight local clustering. However, with case/control data or case/background variation data, separate surfaces may be computed for each dataset and then these may be combined, e.g. by taking the difference or the ratio of the case surface/control surface (for densities greater than some minimum value) to provide a comparison surface. Values of 1 on the ratio surface represent matching densities, whilst values greater than 1 represent locations where the cases occur in higher densities than the controls or background data.
As noted earlier, the majority of these distance-based statistics are computed using the Euclidean or similar analytical metrics. Where the observations lie on a network, for example an urban street network, it is often preferable to perform computations using a metric determined from the network, typically using shortest path calculations. Network-based statistical techniques implemented within the SANET package (see Okabe et al. 2006a, 2006b for more details) include: nearest neighbour analysis; conditional nearest neighbour analysis (in which the location of a related point set is incorporated into the model, see Okabe and Miki, 1984); Ripley’s K-function; and Ripley’s Cross-K function. If the observed points are not georeferenced to network nodes they can be assigned by SANET as additional nodes placed on the closest network link and then network variants of the statistical methods may be run. In these models distances are computed using shortest paths (which may be regarded as a reasonable behavioural assumption or approximation) and probability distributions (mean values and probability envelopes) are computed by Monte Carlo ) simulation.
It is worth concluding this subsection with some general observations (cautionary notes) regarding point pattern analysis:
· The classical statistical model of CSR is inappropriate for many spatial datasets, but does provide a starting point for analysis and simulations in many instances
· A close examination of the source data is always advisable, checking questions such as: how the data were collected; how locations and attributes were recorded; when data were recorded; what data (e.g. true or surrogate) information is represented; what overall measurement error is associated with the data; etc. Examining the underlying dataset for duplicates or near-duplicates is often an importance exercise
· Multiple analytical approaches, including visualisation techniques (1D, 2D, 3D) are advisable since initial impressions and single approaches may be misleading
· The variables selectable in analysing point patterns (e.g. sample region, weights, order neighbour, maximum distance, clustering model, kernel function and bandwidth etc.) can yield very different results depending on the values and models chosen
· Borders, area definitions, metrics, background variation, temporal variation and non-spatial data issues are of major significance in describing and modelling point patterns
· Analysis of rare events presents particular problems — the small numbers involved may result in very low densities of events in large parts of a study region, if only by chance, and calculation based on zone counts are particularly subject to extremes and problems computing ratios
· Analysis should be seen as exploratory, and a part of an overall in-depth analysis process
· Analyses of this kind will not provide meaningful cause-effect or process-realisation models ― in general pattern analysis tells us very little about process, although it may enable us to exclude certain processes from consideration. Many processes can result in the same or similar observed patterns, and different patterns can generate the same or very similar statistical measures, such as K(d) functions
|
|