Parent topic Previous topic Next topic 
  

  Translate this page (Google, opens new window/tab):  

Consider a regular grid of cells that completely covers a sampled region with a value associated with each grid square. The set of values in this case could be binary, e.g. presence/absence; or they could be classified into one of k classes (see Dacey, 1968, for a full discussion of this case); or k classes could be re-classified back into two classes, one class from k and the remaining k-1 classes. Typically this latter kind of data is nominal- valued, i.e. the cell or zone values can be regarded as categories, such as woodland, grassland, sandy desert etc. rather than ordered values.

If we assume the data are binary, perhaps representing the presence or absence of a particular species of insect or plant variety in sample quadrats, a range of possible patterns might be observed (Figure 5‑25A-D). In Figure 5‑25A all the observed values for presence are in one half of the 6x6 grid (strong positive autocorrelation), whilst in Figure 5‑25B they are perfectly evenly distributed (strong negative autocorrelation). Figure 5‑25C gives a particular case of a random pattern (of which there are many). In each case 50% of the cells show presence, but this value could easily be 10% or any other value depending on the data being studied. Figure 5‑25D is an example of a real-world dataset showing the presence or absence of desert holly in a 16x16 cell sample region. Note that in each instance our comments refer to autocorrelation at the scale of the cell, i.e. between cells; within cells autocorrelation is typically strongly positive.

One way of analyzing these patterns is to ask “what is the chance that a particular pattern could occur at random?” In each of the three sample patterns shown we can look at the spatial equivalent of one time step or lag, the patterns observed at one cell step, i.e. adjacent cells. If steps are restricted to rook’s moves we can count the number of instances of 1‑1, 0‑0, 1‑0 and 0‑1 occurring, and compare these to the number we might expect if the pattern was random. These numbers are referred to as join counts (sometimes erroneously described as ‘joint counts’). For smaller regions edge effects will be significant, so calculations need to be adjusted to reflect the fact that along the borders not all of the four directions are possible. For example, only two adjacent cells exist for the four corner positions, and only three adjacencies for other border cells.

Figure 5‑25 Join count patterns

 A. Completely separated pattern (+ve)

B. Evenly spaced pattern (-ve)

C. Random pattern

D. Atriplex hymeneltrya (desert holly)

 

Row and column totals for the adjacencies, or joins, are shown in Figure 5‑26 with the overall total being 120/2=60 joins. For our patterns, with 50% occupancy, we might expect 15 of these to be 1‑1 joins, 15 to be 0‑0 joins and the remaining 30 to be 0‑1 or 1‑0 joins. We can count up the number of each type of join and compare this to our expected values to judge how special (significant) or not our patterns are. In Figure 5‑25A there are 27 1‑1 joins, 27 0‑0 joins and only 6 0‑1 or 1‑0 joins — this seems very unlikely, and is indeed most unlikely. Similar calculations can be undertaken for cases B and C, and as expected for case B all 60 joins are of type 1‑0 or 0‑1, which is again extremely unlikely to occur by chance. Case C has 35 1‑0/0‑1 joins compared with perhaps 30 expected, with 1‑1 joins being 13 and 0‑0 being 12, as against perhaps 15 in each case.

Figure 5‑26 Join count computation

A test for the significance of the results we have observed can be produced using a Normal or z-transform of the data. In practice three separate z-transforms are needed, one for the 1‑1 case, one for 0‑0, and one for 0‑1 and 1‑0. These transforms evaluate expressions of the form z=(OE)/SD where O is the observed number of joins of a given type, E is the expected number based on a random model, and SD is the expected standard deviation. The procedure and formulas can be implemented in software scripts for use within mainstream GIS packages or data may be externally analyzed using specialized software and then the results mapped within a GIS; for example, using the Rookcase Excel add-in produced by Sawada (1999) and obtainable from the University of Ottawa, Laboratory for Paleoclimatology and Climatology. Results for the four examples given in Figure 5‑25A-D generated using the Rookcase add-in are shown in Table 5‑8 (“B” and “W” here refer to Black and White, rather than 1 or 0; “Rand.” refers to the randomization model, which is discussed further below; and # means the number (of joins). Table 5‑9 provides details of the principal formulas used in these computations.

Table 5‑8 Join count analysis results

A. Positive autocorrelation

B. Negative autocorrelation

C. Random model — no discernable autocorrelation

D. A. hymeneltrya — positive BB autocorrelation

Two features of the results in Table 5‑8 should be noted: (i) the expected number of joins is not 30,15,15 as we suggested earlier, but 30.86, 14.57,14.57 these being adjusted values to take account of what is known as non-free sampling (i.e. sampling without replacement); and (ii) the z-statistic in Table 5‑8B shows a large positive value for BW joins, and large negative values for BB and WW joins (absolute values of >1.96 would be significant at the 5% probability level). This mixed pattern can be confusing, especially if one or two of the z-scores shows a significant value whilst others do not, as in Table 5‑8D.

The join counts method has been utilized in a variety of application areas including: ecological data analysis; to analyze patterns of voting (for example voting for one of two political parties or individuals); for analysis of land-use (e.g. developed or undeveloped land); and for examination of the distribution of rural settlements. However, the complexity of computing the theoretical means and variances in more realistic spatial models, the difficulty of interpreting the multiple z-scores, coupled with the availability of alternative approaches to analysis, have meant that join count procedures are not widely used and are rarely implemented in mainstream GIS packages. Support for this kind of analysis is provided in the joincount functions of the R‑Project spatial package, spdep and within the PASSaGE software. Metrics based on these concepts are provided in Fragstats and similar packages (see further, Section 5.3.3 and Section 5.3.4), for example in the computation of Connection and Contagion indices.

The standard form of the join counts statistical analysis makes a number of assumptions regarding the observation data. These include (under free sampling) that the distribution of BB and WW joins is asymptotically Normal (Gaussian) and that the data is first-order homogeneous. To clarify this latter point, we use an example from Kabos and Csillag (2002) who have investigated relaxing this assumption. They give the example of two simulated images with same B/W distribution (50%B and 50%W), but with different first-order effects. Their first pattern was simulated with first-order homogeneity. i.e., pb, the probability of any cell being black was 0.5 for all cells (Figure 5‑27 lefthand image). Their second pattern was simulated by setting pb=0.6 for half of the image and pb=0.4 for the other half, so the overall probability pb remained at 0.5 but was no longer homogeneously distributed across the image (Figure 5‑27, righthand image). They found that that the join count statistics (JCS) using pb=0.5 accepted the null hypothesis of spatial randomness for the first image, but rejected it for the second. Using unmodified JCS, they have thus demonstrated that we would erroneously (but understandably) conclude that there is significant spatial autocorrelation in the second image.

Figure 5‑27 Homogeneous and non-homogenous probability images

Table 5‑9 Join count mean and variance formulas       

Mean

Variance

Sampling hypothesis

Same color case (e.g. black-black or white-white)

F: Free sampling (re-sampling or ‘without replacement’)

R: Non-free sampling (randomization or ‘with replacement’)

Different color case (e.g. black-white)

f: Free sampling (re-sampling or ‘without replacement’)

r: Non-free sampling (randomization or ‘with replacement’)

where: pb, pw = probability of the category of interest occurring in a cell or lattice zone (e.g. probability of black or white zones in the case of binary datasets); note that where probabilities are estimated from the lattice then pb=nb/n and pw=nw/n but probabilities might be estimated from a related variable (e.g. proportion of votes cast) or from a broader sample (regional or national data);

nb = number (count) of the category of interest occurring in a cell or lattice zone (e.g. number of black zones in the case of binary datasets);

W={wij} is a spatial weights matrix, the basic version being a binary weights matrix in which wii=0 and wij=1 if zones i and j are adjacent, else wij=0. The formulas below simplify considerably with binary weights; and

,

Additional questions might reasonably be asked about this analytical procedure that point the way to more powerful extensions and developments of these ideas. Key questions, which we discuss below and in Section 5.5.2.2 include:

·         how sensitive is this technique (and, of course, many other techniques) to the particular size of grid cell used and the number of cells?

·         if there are k classes rather than just two, how might this kind of data be analyzed?

·         why choose rook’s moves — why not permit queen’s moves, which would extend the idea of contiguity but increase the complexity of computation?

·         why restrict the notion of contiguity to directly adjacent cells (a lag of 1) — why not examine longer range effects, i.e. higher order spatial lags? Indeed, with separate analyses for different lags or distance bands one could produce a form of correlogram of autocorrelation effects

·         why should every cell have the same level of importance or weight — why not permit cells to have differing weights, depending on the kind of model of spatial association we are considering?

·         should the analysis be restricted to regular grids, or could it be extended to irregular grids, polygonal regions and even pure point data?

·         is a single statistic for a whole area meaningful, or should separate statistics be computed for sub-regions or individual cells and then plotted, giving a Local Indicator of Spatial Autocorrelation (LISA) measure, which would highlight clustering effects?

·         if the data in the study cells are integer or real-valued, why restrict analysis (lose valuable information) by ignoring the levels in regions?

·         how should we determine the cell probabilities, e.g. pB? Is it sufficient to always derive the probability estimate from the observed proportions of cells of type B, or are there other approaches that might be more appropriate?

We briefly comment on the first two of these questions in the discussion below. The remaining questions in the list, notably those relating to cases where the data values in the grid, lattice or point-set are integer or real-valued, are discussed in Section 5.5.2.2 et seq.

Figure 5‑28 Grouping and size effects

A. 2x2 grouping of Atriplex hymeneltrya grid

B. 128x128 grid of Calluna vulgaris presence

The first point raised above can be examined directly by reviewing our example in Figure 5‑25D. If the grid had been sampled at half the frequency in each direction there would be 8x8=64 cells rather than 16x16=256 cells, and in this instance only one new larger cell contains no data, i.e. a 0, all others showing a desert holly plant present (Figure 5‑28A, empty 2x2 cell shown in white). So we immediately see that such techniques may be highly sensitive to grid resolution, or, with irregular lattices, to the particular density, shape and connectivity of individual areas.

Furthermore, with large grids, e.g. 128x128 cells, the z-scores may become huge, raising doubts over the sensitivity and interpretation of the technique as the resolution of sampling or the size of the area covered is increased; in Figure 5‑28B there are 16384 cells and all three z-scores are over 160. A critical question here is whether the data in such cases reflect the result of direct observation, or whether they represent results from re-sampling or similar computational procedures. In the latter instance test scores may not be valid. At the other extreme scores may be suspect where the number of cells or regions is small (less than 30) or the category selected only occurs in a small percentage of areas.

The second question deals with the k-class case. There are several ways of dealing with this situation. Perhaps the first question to ask is whether the classes are genuinely nominal-valued (e.g. distinct species of tree); or whether the classes have been created from a set of underlying data that are themselves, integer or real-valued. In the latter instance it is sensible to examine the underlying data separately using more powerful techniques, such as Moran’s I (see Section 5.5.2.2). It is also possible to tackle each of the k classes in turn, treating these as B=Black say, and regarding all other classes as W=White, and proceeding as described earlier. Alternatively one can examine all the k classes simultaneously by identifying all instances of the same class being found in adjacent areas. This pattern of occurrences can then be compared with the expected pattern assuming the zones are labeled in a spatially random manner, and/or using a pseudo-probability distribution generated using random permutations of the known values.

Support for this form of ‘multi-colored map’ analysis is to be found in the R-Project package, spdep, joincount functions and the join counts option of the PASSaGE software. Note that for some of the analyses performed it is assumed that the spatial weights matrix, W, is symmetric, or can be adjusted to be symmetric; this excludes the use of some, more complex, neighborhood relations.

  Back to Top    Back to Home Parent topic Previous topic Next topic