Nearest neighbor methods

Navigation:  Data Exploration and Spatial Statistics > Point Sets and Distance Statistics >

Nearest neighbor methods

Previous pageReturn to chapter overviewNext page

In many disciplines the distance between events (e.g. trees, cases of a disease, bird’s nests) reflects an underlying process, for example competition for food or nutrients, birth and dispersal processes, infection or contagion. Particular attention is therefore focused on the nearest event to a particular other event or point of interest. This nearest event is known as the nearest neighbor (NN) or first-order nearest neighbor. The second nearest event is then the second-order nearest neighbor and so forth to kth-order NN. A global (whole area) measure of a point pattern is the mean distance to the kth-order nearest neighbor, and more typically for k=1.

The steps involved in computing this measure are as follows:

Input coordinates of all points {xi,yi}
Compute the (symmetric) distances matrix D between every pair of points
For each i, sort the distances to identify the 1st, 2nd,...kth nearest values
Compute the mean of the observed 1st, 2nd, ...kth nearest values
Compare this mean with the expected mean under Complete Spatial Randomness (CSR or Poisson) model

To undertake the last step in this sequence the expected mean under the CSR hypothesis is needed. The distribution of NN distances can be obtained from the terms of the Poisson distribution using the information in Figure 5‑17. In this we envisage the point under consideration as lying at the center of a circle, radius r.

Assume that the overall density of points in the study region is a constant value, m. Then the expected number in a circle of radius r is simply mπr2, which is where this term comes from in the expressions shown below. The model examines the probability that there are no other points within this circle of radius r, p(r,0), but there is exactly one, p(dr,1), in the thin yellow annular region shown, i.e. in the interval between r and r+dr from the original point. We can now use this information, as dr0 to obtain the distribution of nearest neighbor distances under CSR, and hence the moments of this distribution, including the mean and variance.

Figure 5‑17 Nearest Neighbor distribution

The steps for deriving the first-order NN distribution are shown below, together with the first three central moments:

Let x=mπr2 then dx=2mπrdr and r=(x/mπ)1/2, hence p(r)=e-xdx, thus the mean, μ, may be obtained as the first moment, μ1 or μ, by integration:

where Γ() is the Gamma function. The variance, μ2 or σ2, can be obtained in a similar manner, giving the result:

The mean distance to NN under CSR is thus a simple function of the density, m. For a uniform (completely dispersed) distribution of points the expected mean distance to NN is simply double the CSR value. Assuming that the density can be estimated reliably (a significant issue in itself) then a simple index of global spatial randomness can be obtained by taking the ratio of the observed mean distance to NN:

Values of R<1 suggest greater clustering (closer spacing) than would be expected under CSR, whilst values of R>1 suggest a more even distribution. The significance of the observed value can be tested, assuming we have a fairly large number of points, e.g. n>100. Using the z-transform (based on the standard error rather than the variance, as we are comparing mean values), we have:

Programs such as Crimestat provide index values and significance estimates, whilst ArcGIS Spatial Statistics toolbox simply provides index information. The general expression for the crude moments, μα', α =1,2,3…, of the kth-order nearest neighbor distribution in D-dimensions (D=1,2,3..) with population density, m, is:

where Φ is the volume of a unit hypersphere:

and Γ(x) is the standard Gamma function. In 2D this expression for the mean (α=1, D=2) simplifies to:

Hence for k=1 this simplifies further to:

as above

A number of comments about this form of basic point-pattern analysis need to be made. The first is that the statistic depends heavily on the parameter, m. This is usually estimated by dividing the study area, A, by the number of points, N, i.e. m=A/N. However, deciding on the appropriate study area may alter the results substantially (Table 5‑7 and Figure 4‑78). In this example three alternative region boundaries have been selected: SDC — standard distance circle; 2SDE — an ellipse of size 2 standard deviations; and the MBR of the point set — Minimum Bounding Rectangle. These three regions result in substantially different ratio values and z-scores for the point set, primarily as a result of their different areal extents.

Table 5‑7 NN Statistics and study area size

Region

SDC

2SDE

MBR

R-index

1.798

0.97

1.337

Z

7.163

0.267

3.026

Area, km2

176

605

318

One solution to this problem is to be very careful about the boundary specification, ideally using a user-defined polygon that corresponds to a meaningful boundary (e.g. a physical habitat boundary), or possibly the convex hull of the point set plus a buffer zone (see below). Most available software does not support such boundary specification, although add-ins have been developed to facilitate this kind of region definition.

A related problem to that of region definition is that of edge-related errors (boundary effects). If the NN to a selected event is actually outside the study region the apparent NN found will be further away than the true value, leading to over-estimation of mean distances (see, for example, Figure 4‑30). This problem is most severe with low numbers of points and higher order NN analyses. It can be minimized by careful selection of the study boundary, the application of edge correction factors (e.g. as provided within Crimestat and spatstat), by applying a boundary guard zone (e.g. a inner or outer buffer), by remapping the space onto a torus (e.g. so each of four MBR edges are treated as if they wrap onto the opposite pair) and lastly, by applying NN analysis only where the number of events is quite large and the NN order of analysis is not too high. If there are very good reasons to assume that the event distribution is truly random with uniform density over a particular study region, then the formula for the mean distance to NN can be inverted to obtain an estimate of the point density, m:

First-order NN analysis provides a simple global approach to point pattern analysis, but is of limited use for most real-world problems. Frequently real or apparent clustering is observed, and we may wish to consider questions such as:

Is the observed clustering due to natural background variation in the population from which the events arise?
Over what spatial scales does clustering occur?
Are clusters a reflection of regional variations in underlying variables?
Are clusters associated with some feature of interest, such as a refinery, waste disposal site or nuclear plant?
Are clusters simply spatial or are they spatio-temporal?

These questions demand tools that are more exploratory in nature rather than strictly descriptive. Many tools have been developed to assist with answering such questions. The simplest, perhaps, is to plot the value of the kth order NN index against the order level, k. This option is supported with the Crimestat package, amongst others. A rather better approach (making greater use of the underlying data, particularly with larger point sets) is to examine the observed frequency distribution of nearest neighbor distances. This can be achieved by dividing the observed nearest neighbor distances into evenly spaced distance bands, 0‑d1, d1-d2, d2-… and then comparing these frequencies (expressed as proportions of the total) to the expected distribution under CSR. Comparison is usually carried out on the cumulative probability distribution, G(r), which may be obtained by integration of the expression for p(r) above:

Typically both the observed and expected cumulative distributions are plotted and visually compared. A variation on this arrangement is to select random points in the study area and compute the distance from these to the nearest event. The observed cumulative distribution in this case is denoted F(r), or commonly with a ^ (hat) symbol above the F to indicate that it is an estimate based on the observed data (sometimes referred to as Fhat). It has an expected cumulative distribution as per G(r). However, if the sample pattern is more clustered than random the observed F(r) plot will differ from the observed G(r) plot, rising far more slowly. Support for F(r) and G(r) computation and associated Monte Carlo simulation (see below) is not provided within most GIS and related packages, but it is provided in the Splancs package (S-Plus and R-Plus implementations — FHAT and GHAT functions) and G(r) is provided within the spatstat package. It is also relatively straightforward to generate these functions programmatically.

In principle, a measure of the difference between the observed and expected cumulative distributions can be computed (e.g. the Kolmogorov-Smirnov test statistic), and its significance determined. However, the set of nearest neighbor distance does not represent a set of independent samples — almost by definition NN distances are non-independent and frequently are reflexive — i.e. the NN of event A is event B and the NN of event B is event A. Even under CSR such reflexivity occurs for over 62% of first order neighbors. For these and related reasons (e.g. edge effects)  significance testing involves Monte Carlo simulation. For example, with N observed events in a study area A, one can simulate a set of N purely random events within the same region and compute the cumulative distribution for this set. The maximum absolute difference between the simulated cumulative distribution and the theoretical cumulative distribution can then be calculated and the value recorded. The process in then repeated a large number of times (e.g. T=999 times). Let X be the number of times the recorded difference was larger than that between the observed pattern and the expected cumulative distribution. Then p=(1+X)/(1+T) is the estimated probability of observing a difference of the magnitude found. Thus if X=99 and T=999 we would have p=0.1 or 10%.