Centroids and centers

Navigation:  Building Blocks of Spatial Analysis > Geometric and Related Operations >

Centroids and centers

Previous pageReturn to chapter overviewNext page

The terms center and centroid have a variety of different meanings and formulas, depending on the writer and/or software package in which they are implemented. There are also specific measures used for polygons, point sets and lines so we treat each separately below. Centers or centroids are provided in many GIS packages for multiple polygons and/or lines, and in such cases the combined location is typically computed in the same manner as for point sets, using the centers or centroids of the individual objects as the input points. In almost all cases the metric utilized is dE or dS, although other metrics may be more appropriate for some problems. Centroids are often defined as the nominal center of gravity for an object or collection of objects, but many other measures of centrality are commonly employed within GIS and related packages. In these latter instances the term center (qualified where necessary by how this center has been determined) is more appropriate than centroid, the latter being a term that should be reserved for the center of gravity.

Polygon centroids and centers

Polygon centers have many important functions in GIS: they are sometimes used as “handle points”, by which an object may be selected, moved or rotated; they may be the default position for labeling; and for analytical purposes they are often used to “represent” the polygon, for example in distance calculations (zone to zone) and when assigning zone attribute values to a point location (e.g. income per capita, disease incidence, soil composition). The effect of such assignment is to assume that the variable of interest is well approximated by assigning values to a single point, which essentially involves loss of information. This is satisfactory for some problems, especially where the polygons are small, homogeneous and relatively compact (e.g. unit ZIP-code or postcode areas, residential land parcels), but in other cases warrants closer examination.

With a number of raster-based GIS packages the process of vector file importing may utilize polygon centers as an alternative to the polygons themselves, applying interpolation to the values at these centroids in order to create a continuously varying grid rather than the common procedure of creating a gridded version of discretely classified polygonal areas (see further Chapter 6). Similar procedures may be applied in vector-based systems as part of their vector-to-raster conversion algorithms.

If (xi,yi) are the coordinate pairs of a point set or defining a single polygon, we have for the Mean Center, M1:

The Mean Center is not the same as the center of gravity for a polygon (although it is for an unweighted point set, see below). A weighted version of this expression is sometimes used:

Mean Center (weighted), M1*:

The RMS variation of the point set {xi,yi} about the mean center is known as the standard distance. It is computed using the expressions:

or weighted:

As noted above, the term centroid for a polygon is widely used to refer to its assumed center of gravity. By this is meant the point about which the polygon would balance if it was made of a uniform thin sheet of material with a constant density, such as a sheet of steel or cardboard. This point, M2, can be computed directly from the coordinates of the polygon vertices in a similar manner to that of calculating the polygon area, A, provided in Section 4.2.1, Length and area for vector data. Indeed, it requires computation of A as an input. If the polygon is a triangle, which is a widely used form in spatial analysis, the center of gravity lies at the mean of the vertex coordinates, which is located at the intersection of straight lines drawn from the vertices to the mid-points of the opposite sides (Figure 4‑8).

Figure 4‑8 Triangle centroid

clip0032

For general polygons, using the same notation as before, the formulas required for the x- and y-components are (y>=0):

The formula arises as the weighted average of the centroids of triangles in a standard triangularization of the polygon, where the weights correspond to the triangle areas.

Figure 4‑9 Polygon centroid (M2) and alternative polygon centers

clip0033.zoom73

Figure 4‑9 shows a sample polygon with 6 nodes or vertices, A-F, together with the computed locations of the mean center (M1), centroid (M2) as defined by the center of gravity, and the center (M3), of the Minimum Bounding Rectangle (or MBR) which is shown in gray.

Each of these three points fall within the polygon boundary in this example, and are fairly closely spaced. The MBR center is clearly the fastest to compute, but is not invariant under rotation and is the most subject to outliers — e.g. a single vertex location that is very different from the majority of the elements being considered. For example, if point B had (valid or invalid) coordinates of B(34,3) then we would find M1=(10.67,6.5), M2=(9.27,5.57) and M3=(17,6.5). M3 is now well outside of the polygon, M1 is close to the polygon boundary and M2 remains firmly inside. None of these points minimizes the sum of distances to polygon vertices.

Despite the apparent robustness and widespread use of M2, with a polygon of complex shape the centroid may well lie outside of its boundary. In the example illustrated in Figure 4‑10 we have generated centroids for a set of polygons using the X‑Tools add-in for ArcGIS, and these produce slightly different results from those created by ArcGIS itself. ArcGIS includes a function, Features to Points, which will create polygon centers that are guaranteed to lie inside the polygon if the option INSIDE is included in the command. Manifold includes a Centroids|Inner command which performs a similar function.

Figure 4‑10 Center and centroid positioning

clip0034.zoom50

An alternative to using the MBR is to find the smallest circle that completely encloses the polygon, taking the center of the circle as the polygon center (Figure 4‑11, M4). This is the default in the Manifold GIS (viewable simply by selecting the menu option to View|Structure|Centroids) and/or using the Transform function “Centroids”. This procedure suffers similar problems to that of using the MBR. Yet another alternative is to find the largest inscribed circle, and take its center as the polygon center (Figure 4‑11, M5). This approach has the advantage that the center will definitely be located inside the polygon, although possibly in an odd location, depending on the polygon shape. If the polygon is a triangle then M5 and M2 will coincide. Manifold supports location of centers of types M2, M3 and M4.

As we can also see in Figure 4‑10 multiple polygons may be selected and assigned combined centers. In this example the combined center appears to have been computed from the MBR of the two selected census tracts, but different packages and toolsets may provide alternative procedures and thus positions. In such cases the center may be nowhere near any of the selected features, and if it is important to avoid such a circumstance the GIS may facilitate selection of the most central feature (e.g. the most central point or polygon) if one exists. In some instances the combined center calculation may be weighted by the value of an attribute associated with each polygon. This procedure will result in a weighted center whose location will be pulled towards regions with larger weights. However, in such cases the distribution of polygons used in the calculation may produce unrepresentative results.

This is apparent again in Figure 4‑10 where a polygon center calculation for all census tracts in the western half of the region, weighted by farm revenues, would be pulled strongly to the east of the sub-region where many small urban tracts are found, even though these may have relatively low weights associated with them.

Figure 4‑11 Polygon center selection

clip0035.zoom50

Point sets

If (xi,yi) are the coordinate pairs of a point set then the Mean Center, M1(x0,y0), is simply the average of the coordinate values, as we saw earlier:

In this version of the expression for M1 there is an additional (optional) component, wi, representing weights associated with the points included in the calculation. For points sets the weights might be the number of crimes recorded at a particular location, or the number of beds in a hospital; for multiple polygons, where these are represented by their individual centers or centroids, the weights would typically be an attribute value associated with each polygon. Clearly if all weights=1 the formula simplifies to a calculation that is purely based on the coordinate values, with the number of points, n=∑wi.

M1 in this case is the center of gravity, or centroid, of the point set assuming the locations used in the calculation are treated as point masses. M1 is the location that minimizes the sum of (weighted) squared distance to all the points in the set, is simple and fast to compute, and is widely used. However, it is not the point that minimizes the sum of the (weighted) distance to each of the points. The latter is known sometimes as the (bivariate) “median center” (although Crimestat, for example, uses the term median center to be simply the middle values of the x- and y-coordinates). It is best described as the MAT point (the center of Minimum Aggregate Travel, M6). This point may be determined using the following iterative procedure, with M1 as the initial pair (x0,y0) and k=0,1…:

In this expression di,k is the distance from the ith point to the kth estimated optimal location. It is usual to adjust distances in this, and similar formulas involving division by interpoint distances, by a small increment and/or to apply code checks to avoid divide-by-zero or close to zero situations. For this particular problem testing the numerically estimated surface derivative and curvature, in addition for checking for division by zero, has been shown to be an effective approach to avoiding overshooting the true optimum (see Ostresh’s WEBER algorithm, in Rushton et al., 1973). Iteration is continued until the change in the objective function (the cumulative distance) or both coordinates is less than some pre-specified small value. These formulas for M1 and M6 can be derived by taking the standard equation for distance, dE, or distance squared, dE2, and partially differentiating first with respect to x and then with respect to y, and finally equating the results to 0 to determine the minimum value. An extension to this type of weighted mean is provided by the Geographically Weighted Regression (GWR) software package, and is described further in Section 5.6.3. With the GWR software it is possible to compute a series of locally defined (geographically weighted) means, and associated variances, which may then be analyzed and/or mapped.

The positions of M1, M3 and M6 are illustrated in Figure 4‑12, using the same set of coordinates as for the vertices A-F of the polygon described above.

Figure 4‑12 Point set centers

clip0036.zoom50

Note that M1=M2 in this case, and for both M1 and M3 (the MBR center) their position is unchanged from the polygonal case. Also note that if we move point B to B the position of M6 is unchanged (although the cumulative distance will be greater). This observation is true for each of the points A-F — they can be moved away from the MAT point to an arbitrary distance along a line connecting them to M6, and the position of M6 will be unaffected. This would not generally be the case with M1 or M3.

The locations described for different types of center often assume that all points (or polygons) are weighted equally. If the set of weights, wi, are not all equal, the locations of M1 and M6 will be altered, but M3 will be unchanged. The effect of unequal weights is to “pull” the location of M1 and M6 towards the locations with higher weights. For example, if point B in our previous example had a weight of 3, M1 would be moved to (9.00,5.63) just to the right of M6 in Figure 4‑12, M3 would be unaltered, and M6 would be altered to (10.41,4.29) which is very close to point C. Any weights in the dataset that are zero or missing will generally result in the point being removed from the calculation. Such occurrences require checking to ensure valid data is not being discarded due to incomplete information or errors.

There is no difference between weighted calculations where the weights are integer values and the standard calculations for unit weights if some points in a set are co-located. For example, if point B is recorded in a dataset 3 times, its effective weight would be 3, and instead of 6 points there would be 8 to consider. Co-located point recording is very common, especially in crime and medical datasets where each incident is associated with a nominal rather than precise location. Examples of co-located data might be the closest street intersection to an incident, the nominal coordinates of a shopping mall, the location of the doctor’s surgery where a patient is registered, or a location that recurs because incidents or cases have been rounded to the nearest 50 meters for data protection reasons. It is often a good idea to execute queries looking for duplicate and unweighted locations before conducting analyses of this type, since mapped datasets may not reveal the true underlying patterns. Likewise it is important to check that co-located data are meaningful for the analysis to be undertaken — surgery location is not generally a substitute for a patient’s home address.

The preceding calculations have all been carried out using the standard Euclidean metric, dE. As previously stated, this is the standard for GIS packages, but other metrics may be more appropriate depending on the problem at hand (see further, Section 4.4.1). Specialist packages like the now defunct LOLA project and Crimestat support a range of other metrics. Using the city block metric (L1) and the minimax metric (L) the MAT point, M6, is no longer guaranteed to be unique — all points within a defined region may be equally close to the input point set (Crimestat only provides a single location, so LOLA or similar facilities are preferable for such computations). For example, the point set in Figure 4‑12 with L1 metric has an MAT solution point set (a rectangle) bounded by (4,4) and (10,8). Clearly, if the point set lay on a network the location of M6 would again been different.

To add to the confusion of the above Crimestat provides three further measures of centrality for point sets. Each involves variations in the way the mean is calculated (see further, Table 1‑3): the geometric mean (the x- and y-coordinates are calculated using the sum of the logarithms of the coordinates, averaging and then taking antilogs); the harmonic mean (the x- and y-coordinates are calculated using the reciprocal of the coordinates, averaging and then taking the reciprocal of the result); and the triangulated mean (a Crimestat “special”). The first two alternative means are less sensitive to outliers (extreme values) than the conventional mean center, whilst the latter measure is claimed to represent the directionality of the data better. The Crimestat manual, Chapter 4, provides more details of each measure, with examples. Note that the harmonic mean is vulnerable to coordinate values which are 0 or close to 0.

Lines

In the case of a line (single segment, polyline or curve) the common notion of the center is simply the point on the line that is equidistant from the two endpoints. This provides both the (intrinsic) center of gravity and the mean center, although when a polyline is viewed as embedded in a plane its “center” should be considered to lie in the plane and not on the line.

For collections of lines there is no generally applied formula and a common central point might be selected from the centers or centroids of the individual elements, or from the MBR, or utilizing a central feature if one exists. Combinations of points, lines and polygons are treated in a similar manner. Because lines and line segments have a well-defined orientation, labeling tools that utilize line or segment centroids may also use the orientation of that element to align associated labels.