Parent topic Previous topic Next topic 

Box plots (or box-whisker plots) are a form of EDA provided in many data analysis and graphing packages (e.g. SPSS, STATA, Grapher, WinBUGS). Together with distribution plots and scatter plots they provide one of the three main ways in which statistical data are examined graphically. Because box plots are less familiar to many, and of particular use in examining outliers, we describe them in some detail (see Figure 5‑7).

A box plot consists of a number of distinct elements. The example in Figure 5‑7 was generated using MATLab Statistics Toolbox and we provide definitions below that apply to this particular implementation:

·         The lower and upper lines of the "box" in the centre of the plot window are the 25th and 75th percentiles of the sample. The distance between the top and bottom of the box is the inter-quartile range (IQR)

·         The line in the middle of the box is the sample median. If the median is not centred in the box it is an indication of skewness

·         The whiskers are lines extending above and below the box. They show the extent of the rest of the sample (unless there are outliers). Assuming no outliers, the maximum of the sample is the top of the upper whisker. The minimum of the sample is the bottom of the lower whisker. By default, an outlier is a value that is more than 1.5 times the IQR away from the top or bottom of the box (a hinge value of 1.5), so with outliers the whiskers show a form of trimmed range, i.e. excluding the outliers (n.b. the term hinge is also used in statistics to refer to locations within the main data range, in some instances matching the upper and lower quartile values)

·         A symbol, e.g. a small circle, at the top and/or bottom of the plot is an indication of an outlier in the data. This point may be the result of a data entry error, a poor measurement or perhaps a highly significant observation

·         The notches in the box are a graphic confidence interval about the median of a sample. A side-by-side comparison of two notched box plots is sometimes described as the graphical equivalent of a t-test. Box plots do not have notches by default

The box plots in Figure 5‑7 are for a set of radioactivity observations made at 1008 sites in Germany on one day in 2004. The plot on the left (Sample 1) consists of 200 of the records, with whiskers extending to 1.5 times the IQR.

Figure 5‑7 Simple box plot

Data source: SIC2004, AI-GEOSTATS

Some packages allow user specification of the hinges, or provide an alternative set value (e.g. 3 times IQR — GeoDa; 2.5% and 97.5% limits — WinBUGS). The plot on the right shows a further 808 locations and their readings (see also Figure 5‑4 and Figure 5‑10). Three values in sample 1 were deliberately altered for this plot, e.g. simulating measurement or coding errors. One, for example, involved recording a measured value of 106.0 as 16.0. The plot picks out each of these outliers.

Box plots with a link to zone-based spatial datasets are supported within GeoDa. Figure 5‑8 illustrates the technique, again using the Manchester area test census Output Areas described earlier in this section. The census variable Owner Occupier has been selected for mapping, and a conventional box plot of the data is also illustrated. In the GeoDa implementation each data item that lies outside the box but within the whiskers is shown with a *, and the sole “true” outlier OA appears at the very top of the box plot above the upper whisker. The mapped box plot shows the OAs that fall into the various data quartiles (and the number of OAs in each), plus upper and lower outlier OAs — in this case just the one upper OA of rather complex shape, the same as that identified in Figure 5‑6.

Figure 5‑8 Mapped box plot

 

  Back to Top    Back to Home Parent topic Previous topic Next topic