Outlier detection

Navigation:  Data Exploration and Spatial Statistics > Exploratory Spatial Data Analysis >

Outlier detection

Previous pageReturn to chapter overviewNext page

One of the areas ESDA tools focus on is outlier detection, as there are many instances in which so-called outliers are of great interest. In the present context these are spatial objects whose value on one or more attributes is markedly different from others in the set under consideration. The data in question may be correct or may be the result of some form of error (measurement, coding, representation etc.). Such data are of interest since they may represent the most important items in an investigation (e.g. mineral concentrations, a pollution source, an unexpectedly high incidence of a particular disease). Or they might represent data that need to be removed or adjusted (e.g. smoothed) if either the information is known or suspected to be incorrect, or if its retention will adversely affect the results obtained from the application of a particular analytical technique.

Mapped histograms

One of the simplest methods of highlighting possible outliers is to create a histogram of the data, typically using a fine class division, and then to examine the extreme classes. Where this facility is linked to a map of the data, the location of the object(s) may be identified and examined (Figure 5‑9). The upper figure shows the histogram and basic statistics for the attribute OWN_OCC (the number of Owner Occupiers, i.e. property owners) within 86 census districts (test census Output Areas, OAs, for part of Manchester, UK). The districts with the highest data values have been selected on the histogram window, and are simultaneously highlighted in the map window (lower figure). The same approach may be applied for other vector object types, such as point data.

Figure 5‑9 Histogram linkage

clip0164

clip0165

Source: UK 2001 Census Test Output Areas (OAs)

Data items that lie at the upper or lower limits of a dataset range may be described as global outliers. This term refers to values that are extreme compared to the dataset as a whole. However, within the dataset there may be values that are “relatively extreme” and these are referred to as local outliers. A local outlier is a value that is markedly different from (spatially) neighboring values. An example of this might be a set of measurements taken along a transect, with a value part of the way along the transect that is very different from those immediately before or after, but still well within the overall range of the data recorded on the entire transect. Some ESDA software packages, such as ArcGIS Geostatistical Analyst, provide tools for displaying local as well as global outliers for selected data types.

Box plots

Box plots (or box-whisker plots) are a form of EDA provided in many data analysis and graphing packages (e.g. Minitab, SPSS, STATA, Grapher, Matplotlib, Mondrian, WinBUGS, GeoXP). Together with distribution plots and scatter plots they provide one of the three main ways in which statistical data are examined graphically. Because box plots are less familiar to many, and of particular use in examining outliers, we describe them in some detail (see Figure 5‑10).

Figure 5‑10 Simple box plot

clip0166

A box plot consists of a number of distinct elements. The example in Figure 5‑10 was generated using MATLab Statistics Toolbox and we provide definitions below that apply to this particular implementation:

The lower and upper lines of the "box" in the center of the plot window are the 25th and 75th percentiles of the sample. The distance between the top and bottom of the box is the inter-quartile range (IQR)
The line in the middle of the box is the sample median. If the median is not centered in the box it is an indication of skewness
The whiskers are lines extending above and below the box. They show the extent of the rest of the sample (unless there are outliers). Assuming no outliers, the maximum of the sample is the top of the upper whisker. The minimum of the sample is the bottom of the lower whisker. By default, an outlier is a value that is more than 1.5 times the IQR away from the top or bottom of the box (a hinge value of 1.5), so with outliers the whiskers show a form of trimmed range, i.e. excluding the outliers (n.b. the term hinge is also used in statistics to refer to locations within the main data range, in some instances matching the upper and lower quartile values)
A symbol, e.g. a small circle, at the top and/or bottom of the plot is an indication of an outlier in the data. This point may be the result of a data entry error, a poor measurement or perhaps a highly significant observation
The notches in the box are a graphic confidence interval about the median of a sample. A side-by-side comparison of two notched box plots is sometimes described as the graphical equivalent of a t-test. Box plots do not have notches by default

The box plots in Figure 5‑10 are for a set of radioactivity observations made at 1008 sites in Germany on one day in 2004. The plot on the left consists of 200 of the records, with whiskers extending to 1.5 times the IQR. Some packages allow user specification of the hinges, or provide an alternative set value (e.g. GeoDa 3 times IQR; WinBUGS 2.5% and 97.5% limits). The plot on the right shows a further 808 locations and their readings (see also Figure 5‑13). Three values in sample 1 were deliberately altered for this plot, e.g. simulating measurement or coding errors. One, for example, involved recording a measured value of 106.0 as 16.0. The plot picks out each of these outliers. Box plots with a link to zone-based spatial datasets are supported within GeoDa. Figure 5‑11 illustrates the technique, again using the Manchester area test census Output Areas described earlier in this section.

Figure 5‑11 Mapped box plot, GeoDa

clip0167.zoom79

The census variable Owner Occupier has been selected for mapping, and a conventional box plot of the data is also illustrated. In the GeoDa implementation each data item that lies outside the box but within the whiskers is shown with a *, and the sole “true” outlier OA appears at the very top of the box plot above the upper whisker. The mapped box plot shows the OAs that fall into the various data quartiles (and the number of OAs in each), plus upper and lower outlier OAs — in this case just the one upper OA of rather complex shape, the same as that identified in Figure 5‑9.