Regression overview

Navigation:  Data Exploration and Spatial Statistics > Spatial Regression >

Regression overview

Previous pageReturn to chapter overviewNext page

Regression analysis is the term used to describe a family of methods that seek to model the relationship between one (or more) dependent or response variables and a number of independent or predictor variables. A typical (linear) relationship might be of the form:

where β is a column vector of p parameters to be determined and x is a row vector of independent variables with a 1 in the first column. It should be noted that the expression shown below is also described as linear since it remains linear in the coefficients, β, even if it is not linear in the predictor variables, x, e.g.:

If y={yi} is a set of n observations or measurements of the response variable, with corresponding recording of matching values for the set of independent variables, then a series of linear equations of the type shown above can be constructed. However, typically the number of observations will be greater than the number of coefficients (the system is said to be over-determined). One approach to this problem is to seek a best fit solution, where best fit is taken to mean a solution for the vector β that minimizes the difference between the fitted model and observed values at these data points. In practice the procedure most widely applied seeks to minimize the sum of squared differences, hence the term Least Squares. Each linear expression requires an additional "error" component to allow for the degree of fit of the expression to the particular data values. If we denote these error components by εi, the general form of the expression above now becomes:

where X is now a matrix with 1’s in column 1, and ε is a vector of errors that in conceptual terms is assumed to represent the effects of unobserved variables and measurement errors in the observations. Typically we assume that the expected value of these errors is zero, E(ε)=0, and the variance, E(εε)=σ2I, is constant. Here 0 is a column vector of zeros and I is the identity matrix.

This set of n equations in p unknowns (n>p) is typically solved for the vector β by seeking to minimize the sum of the squared error terms, εεT (Ordinary Least Squares, or OLS). The solution for the coefficients using this approach is obtained from the matrix expression:

where the hat (^) symbol denotes concern with sample estimates of population parameters and the sample estimates of the population error (ε) terms are described as ‘residuals’.

The variance for such models is usually estimated from the residuals of the fitted model, using the expression:

Methods such as these may be used in the context of spatial description and mapping, as part of a process of data exploration, or for explanatory and/or predictive purposes.

The broad principles of regression analysis and modeling, as described in basic statistics texts, also apply to spatial datasets. In a spatial context one is seeking to model the variation in some spatially distributed dependent variable, for example house prices in a city, from a set of independent variables such as size of property, age of property, distance from transport facilities, proportion of green space in the region etc. (see further Table 5‑11, hedonic regression). Spatial coordinates might be explicitly or implicitly incorporated into such models, for example including the coordinates of the centroid of a census tract or event location into the modeling process, or by taking account of the pattern of adjacency of neighboring zones.

Similar methods are also widely supported in raster-based packages, with the dependent variable being taken as one grid from a set of matching layers or bands, and the independent variables being taken as the matching grids. Grid cell entries are treated as the sample values (responses) and the measured predictor values. In most packages this technique is used to aid grid combination, but others suggest that it may be used for predictive purposes (e.g. predicting crop yield from a series of measured grids that provide details of soil structure and chemical makeup). This kind of modeling is generally unsafe due to the high degree of spatial autocorrelation between nearby cells in each grid (possibly generated by interpolation or resampling).

Ideally in regression analysis the form of the chosen model should be as simple as possible, both in terms of the expression employed and the number of independent variables included — these are sometimes referred to as the simplicity and parsimony objectives. In addition, the proportion of the variation in the dependent variable(s), y, explained by the independent variables, x, should be as high as possible, and the correlation between separate independent variables should be as low as possible. These criteria form part of a strength of evidence objective. If some, or all, of the x are highly correlated (typically having a correlation coefficient above 0.8) the model almost certainly contains redundant information and may be described as being over-specified. If a strong relationship exists between selected independent variables and is also broadly linear, the variables are said to exhibit multi-collinearity. There are a number of standard techniques for reducing multi-collinearity, including: applying the so-called centering transform (deducting the mean of the relevant independent variable from each measured value); increasing the sample size  (especially if samples are quite small); removing the most inter-correlated variable; and combining inter-correlated variables into a new single composite variable (e.g. using principal components analysis, PCA, or other data reduction techniques). The latter procedures can be viewed as forms of model re-specification.

With higher-order spatial trend analyses, multi-collinearity is to be expected since various combinations of the coordinates are themselves the independent variables. Technically, imperfect multi-collinearity does not result in the OLS assumptions being violated, but does lead to large standard errors for the coefficients involved, also referred to as variance inflation (VI). This limits their usefulness. As a rule of thumb, if the measure known as the Multi-collinearity Condition Number, MCN, is greater than 30 this problem is quite serious and should be addressed. The MCN is derived from the ratio of the largest to the smallest eigenvalue of the matrix XTX. Generally the square root of this ratio is reported, but it is important to check this for any specific software package (if no square root is taken values greater than 1000 are taken as significant).

When fitting a model using OLS it is generally assumed that the errors (residuals for sample points) are identical and independently distributed (denoted iid). If the spread of errors is not constant, for example if in some parts of the study area the residuals are much more variable than in others these errors are said to exhibit heteroskedasticity. If the latter is the case (which may be detected using a range of standard tests) the estimated variance under OLS will be biased, either being too large or too small (it is not generally possible to know which direction the bias takes). As a result the specification of confidence intervals and the application of standard statistical significance tests (e.g. F tests) will not be reliable.

Many (most) spatial datasets exhibit patterns of data and/or residuals in which neighboring areas have similar values (positive spatial autocorrelation) and hence violate the core assumptions of standard regression models. One result of this feature of spatial data is to greatly over-estimate the effective sample size, and hence the degrees of freedom that may be indicated (for example in multiple grid regressions) will be stated as being far higher than they should be. One possible (partial) solution to this problem, in the case of grids, is to take (stratified or equalized) random samples from the response grid and apply regression to the subset of grid cells that correspond to these samples in the predictor grids. This approach is sometimes (rather confusingly) referred to as resampling.

A common additional requirement for standard regression modeling, as noted above, is that the distribution of the errors is not only identical but also Normal, i.e.

ε~ N(0,σ2I)

If the distribution of the errors is Normal and the regression relationship is linear, this equates to the response variable being Normally distributed, and vice versa, i.e.

y~ N(Xβ,σ2I) , or setting the vector μ= Xβ

y~ N(μ,σ2I)

The assumption of Normality permits the analytical derivation of confidence intervals for the model parameters and significance tests on their values (against the null hypothesis that they are zero). In order to satisfy the Normality conditions required in many models it is common to transform (continuous valued) response data using Log, Square Root or Box-Cox functions (Table 1‑3) prior to analysis where this can be shown to improve the approximation to the Normal. It is also advisable to examine and, where appropriate, remove outliers even if these are not errors in the dataset but represent valid data that would otherwise distort the modeling process. Exclusion must be made explicit and separate investigation of any excluded data carried out. An alternative to exclusion is to weight observations according to some rule. Spatial declustering (see Section 5.1.2, Spatial sampling) is an example of such weighting applied to spatial datasets.

If spatial autocorrelation has been detected (see Section 5.5.1, Autocorrelation, time series and spatial analysis, et seq.) a more reasonable assumption is:

y~ N(μ,C)

where C is a positive definite covariance matrix. Typically C would be defined in such a way as to ensure that locations that are close to each other are represented with a high value for the modeled covariance, whilst places further apart have lower covariance. This may be achieved using some form of distance-decay function or, for zoned data, using a contiguity-based spatial weights matrix. This kind of arrangement is similar to Generalized Least Squares (GLS), which allows for so-called second-order effects via the inclusion of a covariance matrix. The standard form of expression for GLS is:

y=Xβ+u

where u is a vector of random variables (errors) with mean 0 and variance-covariance matrix C. The GLS solution for β is then:

In spatial analysis this kind of GLS model is often utilized, with the matrix C modeled using some form of distance or interaction function. Note that C must be invertible, which places some restrictions on the possible forms it may take. Simple weighted least squares (WLS) is a variant of GLS in which the matrix C is diagonal and the diagonal values are not equal (hence the errors are heteroskedastic). In this case the weights are normally set to the inverse of the variance, thereby giving less weight to observations which have higher variance.

In addition to regression on dependent variables that take continuous values, similar methods have been developed for binary and count or rate data. In the case of binary data the data typically are transformed using the logit function (see Table 1‑3). The central assumption of this model is that the probability, p, of the dependent variable, y, taking the value 1 (rather than 0) follows the logistic curve, i.e.

This model can be linearised using the logit transform, q=ln(p/(1‑p)). The regression model then becomes

As with standard linear regression the errors in this model are assumed to be iid and the predictor variables are assumed not to be co-linear. The parameters in such models are determined using non-linear (iterative) optimization methods, involving maximization of the likelihood function. With count or rate data Poisson regression may be applicable. The model is similar in form to the binary logit model, but in this case the response variable, y, is count (integer) data (hence non-continuous). The basic model in this case is of the form:

where A is an (optional) offset or population-based value (e.g. the underlying population at risk in each modeled zone). This model may be linearised, as per the logit model, but this time by simply applying a log transform: q=ln(y). The regression model then becomes:

or

There are many variants of linear and non-linear regression analysis and a wealth of associated terms applied in this field — a number of these are summarized in Table 5‑11. In some instances the techniques are not always described as regression analysis per se, but share many underlying statistical assumptions and associated tests.

Choosing between alternative models presents a particular challenge. Each additional parameter may improve the fit of the model to the sample dataset (or data set), but at the cost of increasing model complexity. Model building should be guided by the principal of parsimony — that is the simplest model that is fit for the purpose for which it is built. Overly complicated models frequently involve a loss of information regarding the nature of key relationships. Drawing on this information content view of modeling, an information theoretic statistic is often used as a guide to model selection. The most common statistic now used is known as the Akaike Information Criterion (AIC) and is often computed in its small-sample corrected version (AICc) since this is asymptotic to the standard version. The measures are defined as:

where n is the sample size, k is the number of parameters used in the model, and L is the likelihood function. It is recommended that the corrected version be used unless n/k>40. If the errors are assumed to be Normally distributed then the standard expression can be re-written as:

The AIC measure is utilized in several of the software packages described in this Section and elsewhere in this Guide (e.g. in the point pattern analysis software, spatstat). For example GeoDa uses the version above in OLS estimation, whilst Geographically Weighted Regression (GWR) uses a variant of the version corrected for small sample sizes. Both packages use appropriate alternative expressions in cases where the likelihood functions are different from the Normal model. A wide range of regression models, including OLS and GWR, and several autocorrelation models (see below) are supported by the SAM package (which also supports AIC-based model selection). Although SAM has been designed for and by macro ecologists, it can equally well be applied in other fields of spatial research.

Table 5‑11 Selected regression analysis terminology

Form of model

Notes

Simple linear

A single approximately continuous response (dependent) variable and one or more predictor variables related by an expression that is linear in its coefficients (i.e. not necessary linear in the predictor variables). Note the similarity to weighted averages and geostatistical models

Multiple

This term applies when there are multiple predictor variables and all are quantitative

Multivariate

Regression involving more than one response variable. If, in addition, there are multiple predictor variables the composite term multivariate multiple regression is used

SAR

Simultaneous autoregressive models  (SAR is also used as an abbreviation for Spatial Autoregression). A form of regression model including adjustments for spatial autocorrelation. Many variants of SAR model have been devised

CAR

Conditional autoregressive models — as per SAR, a form of regression model including adjustments for spatial autocorrelation. Differs from SAR in the specification of the inverse covariance matrix. In this model the expected value of the response variable is regarded as being conditional on the recorded values at all other locations

Logistic

Logistic regression applies where the response variable is binary, i.e. of type 1/0 or Yes/No. Typically this involves use of the logit transform (Table 1‑3), with linear regression being conducted on the transformed data. Variants on the basic binary model are available for response variables that represent more than two categories, which may or may not be ordered

Poisson

Poisson regression applies where the response variable is a count (e.g. crime incidents, cases of a disease) rather than a continuous variable. This model may also be applied to standardized counts or “rates”, such as disease incidence per capita, species of tree per square kilometer. It assumes the response variable has a Poisson distribution whose expected value (mean) is dependent on one or more predictor variables. Typically the log of the expected value is assumed to have a linear relationship with the predictor variables. An excess of zeros in many sample datasets may present problems when attempting to apply this form of regression

Ecological

The term ecological regression does not relate directly to the subject of ecology, but to the application of regression methods to data that are aggregated to zones (lattices) , as is often the case with census datasets and information collected by administrative districts. The related issue of the so-called ecological fallacy (referred to in Section 3.3.1, Problem: Framing the question) concerns the difficulty of making inferences about the nature of individuals within an aggregated region on the basis of statistics (data values, parameters, relationships) that apply to the aggregated data

Hedonic

The term hedonic regression is used in economics, especially in real estate (property) economics, to estimate demand or prices as a combination of separate components, each of which may be treated as if it had its own market or price. In the context of regression these separate components are often treated as the independent variables in the modeling process

Analysis of variance

Applies if all of the predictors are either qualitative or classified into a (small) number of distinct groups. Analysis of variance methods are often used to analyze the significance of alternative regression models under the Normality assumption for the distribution of errors

Analysis of covariance

Applies if some of the predictors are qualitative and some are quantitative. Analysis of covariance methods are also widely applied in spatial modeling, where the covariance of observations at pairs of locations is examined

A second information theoretic measure, which places greater weight on the number of parameters used, is often also provided. This is known as the Bayesian Information Criterion or BIC, or sometimes as the Swartz Criterion. It has the simpler form: