|
|
Within GIS software, univariate classification facilities are found as tools to: aid in the production of choropleth or thematic maps; explore untransformed or transformed datasets; analyse (classify and re-classify) image data (see further, Section 4.2.12.3); and display continuous field data. In the majority of cases these procedures perform classification based purely on the input dataset, without reference to separate external evaluation criteria.
In almost all instances the objects to be classified are regarded as discrete, distinct items that can only reside in one class at a time (sometimes referred to as hard classification). Separate schemes exist for classifying objects that have uncertain class membership (soft or fuzzy classification) and/or unclear boundaries (as discussed briefly in Section 4.2.13.4), or which require classification on the basis of multiple attributes (see Section 4.2.12.2). Typically the attributes used in classification have numerical values that are real or integer type. In most instances these numeric values represent interval or ratio-scaled variables. Purely nominal data are effectively already classified (see, for example, Figure 4‑20 in which each zone has been assigned a unique colour from a pre-selected palette; and see Section 2.2.2 for a brief discussion of commonly used scales).
Table 4‑5 provides details of a number of univariate classification schemes together with comments on their use. Most of the main GIS packages provide classification options of the types listed, although some (such as the box and percentile methods) are only available in a limited number of software tools (e.g. GeoDa). A useful variant of the box method, known as hybrid equal interval, in which the inter-quartile range is itself divided into equal intervals, does not appear to be implemented in mainstream GIS packages. Schemes that take into account spatial contiguity, such as the so-called minimum boundary error method described by Cromley (1996), also do not appear to be readily available as a standard means of classification.
Table 4‑5 Selected univariate classification schemes
|
Classification scheme |
Description/application |
|
Each value is treated separately, for example mapped as a distinct colour |
|
|
The analyst specifies the boundaries between classes required as a list, or specifies a lower bound and interval or lower and upper bound plus number of intervals required |
|
|
The attribute values are divided into n classes with each interval having the same width=Range/n. For raster maps this operation is often called slice |
|
|
A variant of manual and equal interval, in which the user defines each of the intervals required |
|
|
Intervals are selected so that the number of observations in each successive interval increases (or decreases) exponentially |
|
|
Intervals are selected so that the number of observations in each interval is the same. If each interval contains 25% of the observations the result is known as a quartile classification. Ideally the procedure should indicate the exact numbers assigned to each class, since they will rarely be exactly equal |
|
|
Percentile plots are a variant of equal count or quantile plots. In the standard version equal percentages (percentiles) are included in each class. In GeoDa’s implementation of percentile plots unequal numbers are assigned to provide classes that contain 6 intervals: <=1%, 1% to <10%, 10% to <50%, 50% to <90%, 90% to <99% and >=99% |
|
|
Widely used within GIS packages, these are forms of variance-minimisation classification. Breaks are typically uneven, and are selected to separate values where large changes in value occur. May be significantly affected by the number of classes selected and tends to have unusual class boundaries. Typically the method applied is due to Jenks, as described in Jenks and Caspall (1971), which in turn follows Fisher (1958). See Figure 4‑21 for more details |
|
|
The mean and standard deviation of the attribute values are calculated, and values classified according to their deviation from the mean (z-transform). The transformed values are then mapped, usually at intervals of 1.0 or 0.5 standard deviations. Note that this often results in no central class, only classes either side of the mean and the number of classes is then even. SD classifications in which there is a central class (defined as the mean value +/-0.5SD) with additional classes at +/- 1SD intervals beyond this central class, are also used |
|
|
A variant of quartile classification designed to highlight outliers, due to Tukey (1977, Section 2C). Typically six classes are defined, these being the 4 quartiles, plus two further classifications based on outliers. These outliers are defined as being data items (if any) that are more than 1.5 times the inter-quartile range (IQR) from the median. An even more restrictive set is defined by 3.0 times the IQR. A slightly different formulation is sometimes used to determine these box ends or hinge values. Box plots (see Section 5.2.2.2) are implemented in GeoDa and STARS, but are not generally found in mainstream GIS software. They are commonly implemented in statistics packages, including the MATLab Statistics Toolbox |
Brewer and Pickle (2002) examined many of these methods, with particular attention being paid to ease of interpretation and comparison of map series, e.g. mapping time series data, for example of disease incidence by health service area over a number of years. They concluded that simple quantile methods were amongst the most effective, and had the great advantage of consistent legends for each map in the series, despite the fact that the class interval values often vary widely. They also found the schemes that resulted in a very large central class (e.g. 40%+) were more difficult to interpret.
In addition to selection of the scheme to use, the number of breaks or intervals and the positioning of these breaks are fundamental decisions. The number of breaks is often selected as an odd value: 5, 7 or 9. With an even number of classes there is no central class, and with a number of classes less than 4 or 5 the level of detail obtained may be too limited. With more than 9 classes gradations may be too fine to distinguish key differences between zones, but this will depend a great deal on the data and the purpose of the classification being selected. In a number of cases the GIS package provides linked frequency diagrams with breaks identified and in some cases interactively adjustable. In other cases generation of frequency diagrams should be conducted in advance to help determine the ideal number of classes and type of classification to be used. For data with a very large range of values, smoothly varying across a range, a graduated set of classes and colouring may be entirely appropriate. Such classification schemes are common with field-like data and raster files. Continuous graded shading of thematic maps is also possible, with some packages, such as Manifold’s Thematic Formatting facilities providing a wide range of options that reduce or remove the dependence on formal class boundaries.
Positioning of breaks may be pre-determined by the classification procedure (e.g. Jenks Natural breaks), but these are often manually adjustable for some of the schemes provided — this is particularly useful if specific values or intervals are preferred, such as whole numbers or some convenient values such as 1000s, or if comparisons are to be made across a series of maps. In some instances options are provided for dealing with zero-valued and/or missing data prior to classification, but if not provided the data should be inspected in advance and subsets selected prior to classification where necessary. Some authors recommend that maps with large numbers of zero or missing data values should have such regions identified in a class of their own.
The Jenks method, as implemented in a number of GIS packages such as ArcGIS, is not always well-documented so a brief description of the algorithm follows (Figure 4‑21). This description also provides an initial model for some of the multivariate methods described in subsection 4.2.12.2.
|
Figure 4‑21 Jenks Natural Breaks algorithm |
|
Step 1: The user selects the attribute, x, to be classified and specifies the number of classes required, k Step 2: A set of k‑1 random or uniform values are generated in the range [min{x},max{x}]. These are used as initial class boundaries Step 3: The mean values for each initial class are computed and the sum of squared deviations of class members from the mean values is computed. The total sum of squared deviations (TSSD) is recorded Step 4: Individual values in each class are then systematically assigned to adjacent classes by adjusting the class boundaries to see if the TSSD can be reduced. This is an iterative process, which ends when improvement in TSSD falls below a threshold level, i.e. when the within class variance is as small as possible and between class variance is as large as possible. True optimisation is not assured. The entire process can be optionally repeated from Step 1 or 2 and TSSD values compared |
|
|