Statistical measures and related formulas

Navigation:  Introduction and terminology > Common Measures and Notation >

Statistical measures and related formulas

Previous pageReturn to chapter overviewNext page

Table 1‑3, below, provides a list of common measures (univariate statistics) applied to datasets, and associated formulas for calculating the measure from a sample dataset in summation form (rather than integral form) where necessary. In some instances these formulas are adjusted to provide estimates of the population values rather than those obtained from the sample of data one is working on.

Many of the measures can be extended to two-dimensional forms in a very straightforward manner, and thus they provide the basis for numerous standard formulas in spatial statistics. For a number of univariate statistics (variance, skewness, kurtosis) we refer to the notion of (estimated) moments about the mean. These are computations of the form

When r=1 this summation will be 0, since this is just the difference of all values from the mean. For values of r>1 the expression provides measures that are useful for describing the shape (spread, skewness, peakedness) of a distribution, and simple variations on the formula are used to define the correlation between two or more datasets (the product moment correlation). The term moment in this context comes from physics, i.e. like ‘momentum’ and ‘moment of inertia’, and in a spatial (2D) context provides the basis for the definition of a centroid — the center of mass or center of gravity of an object, such as a polygon (see further, Section 4.2.5, Centroids and centers).

Table 1‑3 Common formulas and statistical measures

This table of measures has been divided into 9 subsections for ease of use. Each is provided with its own subheading:

Counts and specific values
Measures of centrality
Measures of spread
Measures of distribution shape
Measures of complexity and dimensionality
Common distributions
Data transforms and back transforms
Selected functions
Matrix expressions

For more details on these topics, see the relevant topic within the StatsRef website.

 

Counts and specific values

Measure

Definition

Expression(s)

Count

The number of data values in a set

Count({xi})=n

Top m, Bottom m

The set of the largest (smallest) m values from a set. May be generated via an SQL command

Topm{xi}={Xn‑m+1,…Xn‑1,Xn};

Botm{xi}={X1,X2,… Xm};

Variety

The number of distinct i.e. different data values in a set. Some packages refer to the variety as diversity, which should not be confused with information theoretic and other diversity measures

 

Majority

The most common i.e. most frequent data values in a set. Similar to mode (see below), but often applied to raster datasets at the neighborhood or zonal level. For general datasets the term should only be applied to cases where a given class is 50%+ of the total

 

Minority

The least common i.e. least frequently occurring data values in a set. Often applied to raster datasets at the neighborhood or zonal level

 

Maximum, Max

The maximum value of a set of values. May not be unique

Max{xi}=Xn

Minimum, Min

The minimum value of a set of values. May not be unique

Min{xi}=X1

Sum

The sum of a set of data values

Measures of centrality

Measure

Definition

Expression(s)

Mean (arithmetic)

The arithmetic average of a set of data values (also known as the sample mean where the data are a sample from a larger population). Note that if the set {fi} are regarded as weights rather than frequencies the result is known as the weighted mean. Other mean values include the geometric and harmonic mean. The population mean is often denoted by the symbol μ. In many instances the sample mean is the best (unbiased) estimate of the population mean and is sometimes denoted by μ with a ^ symbol above it) or as a variable such as x with a bar above it.

Mean (harmonic)

The harmonic mean, H, is the mean of the reciprocals of the data values, which is then adjusted by taking the reciprocal of the result. The harmonic mean is less than or equal to the geometric mean, which is less than or equal to the arithmetic mean

Mean (geometric)

The geometric mean, G, is the mean defined by taking the products of the data values and then adjusting the value by taking the nth root of the result. The geometric mean is greater than or equal to the harmonic mean and is less than or equal to the arithmetic mean

hence

Mean (power)

The general (limit) expression for mean values. Values for p give the following means: p=1 arithmetic; p=2 root mean square; p=‑1 harmonic. Limit values for p (i.e. as p tends to these values) give the following means: p=0 geometric; p=‑ minimum; p= maximum

Trim-mean, TM, t, Olympic mean

The mean value computed with a specified percentage (proportion), t/2, of values removed from each tail to eliminate the highest and lowest outliers and extreme values. For small samples a specific number of observations (e.g. 1) rather than a percentage, may be ignored. In general an equal number, k, of high and low values should be removed and the number of observations summed should equal n(1‑t) expressed as an integer. This variant is sometimes described as the Olympic mean, as is used in scoring Olympic gymnastics for example

t[0,1]

Mode

The most common or frequently occurring value in a set. Where a set has one dominant value or range of values it is said to be unimodal; if there are several commonly occurring values or ranges it is described as multi-modal. Note that arithmetic mean‑mode≈3 (arithmetic mean‑median) for many unimodal distributions

 

Median, Med

The middle value in an ordered set of data if the set contains an odd number of values, or the average of the two middle values if the set contains an even number of values. For a continuous distribution the median is the 50% point (0.5) obtained from the cumulative distribution of the values or function

Med{xi}=X(n+1)/2 ; n odd

Med{xi}=(Xn/2+Xn/2+1)/2; n even

Mid-range, MR

The middle value of the Range

MR{xi}=Range/2

Root mean square (RMS)

The root of the mean of squared data values. Squaring removes negative values

Measures of spread

Measure

Definition

Expression(s)

Range

The difference between the maximum and minimum values of a set

Range{xi}=Xn‑X1

Lower quartile (25%), LQ

In an ordered set, 25% of data items are less than or equal to the upper bound of this range. For a continuous distribution the LQ is the set of values from 0% to 25% (0.25) obtained from the cumulative distribution of the values or function. Treatment of cases where n is even and n is odd, and when i runs from 1 to n or 0 to n vary

LQ={X1, … X(n+1)/4}

Upper quartile (75%), UQ

In an ordered set 75% of data items are less than or equal to the upper bound of this range. For a continuous distribution the UQ is the set of values from 75% (0.75) to 100% obtained from the cumulative distribution of the values or function. Treatment of cases where n is even and n is odd, and when i runs from 1 to n or 0 to n vary

UQ={X3(n+1)/4, … Xn}

Inter-quartile range, IQR

The difference between the lower and upper quartile values, hence covering the middle 50% of the distribution. The inter-quartile range can be obtained by taking the median of the dataset, then finding the median of the upper and lower halves of the set. The IQR is then the difference between these two secondary medians

IQR=UQ-LQ

Trim-range, TR, t

The range computed with a specified percentage (proportion), t/2, of the highest and lowest values removed to eliminate outliers and extreme values. For small samples a specific number of observations (e.g. 1) rather than a percentage, may be ignored. In general an equal number, k, of high and low values are removed (if possible)

TRt=Xn(1‑t/2)‑Xnt/2, t[0,1]

TR50%=IQR

Variance, Var, σ2, s2 , μ2

The average squared difference of values in a dataset from their population mean, μ, or from the sample mean (also known as the sample variance where the data are a sample from a larger population). Differences are squared to remove the effect of negative values (the summation would otherwise be 0). The third formula is the frequency form, where frequencies have been standardized, i.e. ∑fi=1. Var is a function of the 2nd moment about the mean. The population variance is often denoted by the symbol μ2 or σ2.

The estimated population variance is often denoted by s2 or by σ2 with a ^ symbol above it

Standard deviation, SD, s or RMSD

The square root of the variance, hence it is the Root Mean Squared Deviation (RMSD). The population standard deviation is often denoted by the symbol σ. SD* shows the estimated population standard deviation (sometimes denoted by σ with a ^ symbol above it or by s)

Standard error of the mean, SE

The estimated standard deviation of the mean values of n samples from the same population. It is simply the sample standard deviation reduced by a factor equal to the square root of the number of samples, n>=1

Root mean squared error, RMSE

The standard deviation of samples from a known set of true values, xi*. If xi* are estimated by the mean of sampled values RMSE is equivalent to RMSD

Mean deviation/error, MD or ME

The mean deviation of samples from the known set of true values, xi*

Mean absolute deviation/error, MAD or MAE

The mean absolute deviation of samples from the known set of true values, xi*

Covariance, Cov

Literally the pattern of common (or co-) variation observed in a collection of two (or more) datasets, or partitions of a single dataset. Note that if the two sets are the same the covariance is the same as the variance

Cov(x,x)=Var(x)

Correlation/ product moment or Pearson’s correlation coefficient, r

A measure of the similarity between two (or more) paired datasets. The correlation coefficient is the ratio of the covariance to the product of the standard deviations. If the two datasets are the same or perfectly matched this will give a result=1

r=Cov(x,y)/SDxSDy

Coefficient of variation, CV

The ratio of the standard deviation to the mean, sometime computed as a percentage. If this ratio is close to 1, and the distribution is strongly left skewed, it may suggest the underlying distribution is Exponential. Note, mean values close to 0 may produce unstable results

Variance mean ratio, VMR

The ratio of the variance to the mean, sometime computed as a percentage. If this ratio is close to 1, and the distribution is unimodal and relates to count data, it may suggest the underlying distribution is Poisson. Note, mean values close to 0 may produce unstable results

Measures of distribution shape

Measure

Definition

Expression(s)

Skewness, α3

If a frequency distribution is unimodal and symmetric about the mean it has a skewness of 0. Values greater than 0 suggest skewness of a unimodal distribution to the right, whilst values less than 0 indicate skewness to the left. A function of the 3rd moment about the mean (denoted by α3 with a ^ symbol above it for the sample skewness)

Kurtosis, α4

A measure of the peakedness of a frequency distribution. More pointy distributions tend to have high kurtosis values. A function of the 4th moment about the mean. It is customary to subtract 3 from the raw kurtosis value (which is the kurtosis of the Normal distribution) to give a figure relative to the Normal (denoted by α4 with a ^ symbol above it for the sample kurtosis)

where

,

Measures of complexity and dimensionality

Measure

Definition

Expression(s)

Information statistic (Entropy), I (Shannon’s)

A measure of the amount of pattern, disorder or information, in a set {xi} where pi is the proportion of events or values occurring in the ith class or range. Note that if pi=0 then pilog2(pi) is 0. I takes values in the range [0,log2(k)]. The lower value means all data falls into 1 category, whilst the upper means all data are evenly spread

Information statistic (Diversity), Div

Shannon’s entropy statistic (see above) standardized by the number of classes, k, to give a range of values from 0 to 1

Dimension (topological), DT

Broadly, the number of (intrinsic) coordinates needed to refer to a single point anywhere on the object. The dimension of a point=0, a rectifiable line=1, a surface=2 and a solid=3. See text for fuller explanation. The value 2.5 (often denoted 2.5D) is used in GIS to denote a planar region over which a single-valued attribute has been defined at each point (e.g. height). In mathematics topological dimension is now equated to a definition similar to cover dimension (see below)

DT=0,1,2,3,…

Dimension (capacity, cover or fractal), DC

Let N(h) represent the number of small elements of edge length h required to cover an object. For a line, length 1, each element has length 1/h. For a plane surface each element (small square of side length 1/h) has area 1/h2, and for a volume, each element is a cube with volume 1/h3.

More generally N(h)=1/hD, where D is the topological dimension, so N(h)= h‑D and thus log(N(h))=‑Dlog(h) and so Dc=‑log(N(h))/log(h). Dc may be fractional, in which case the term fractal is used

Dc>=0

Common distributions

Measure

Definition

Expression(s)

Uniform (continuous)

All values in the range are equally likely. Mean=a/2, variance=a2/12. Here we use f(x) to denote the probability distribution associated with continuous valued variables x, also described as a probability density function

Binomial (discrete)

The terms of the Binomial give the probability of x successes out of n trials, for example 3 heads in 10 tosses of a coin, where p=probability of success and q=1‑p=probability of failure. Mean, m=np, variance=npq. Here we use p(x) to denote the probability distribution associated with discrete valued variables x

Poisson (discrete)

An approximation to the Binomial when p is very small and n is large (>100), but the mean m=np is fixed and finite (usually not large). Mean=variance=m

Normal (continuous)

The distribution of a measurement, x, that is subject to a large number of independent, random, additive errors. The Normal distribution may also be derived as an approximation to the Binomial when p is not small (e.g. p≈1/2) and n is large. If μ=mean and σ=standard deviation, we write N(μ,σ) as the Normal distribution with these parameters. The Normal- or z-transform z=(x‑μ)/σ changes (normalizes) the distribution so that it has a zero mean and unit variance, N(0,1). The distribution of n mean values of independent random variables drawn from any underlying distribution is also Normal (Central Limit Theorem)

Data transforms and back transforms

Measure

Definition

Expression(s)

Log

If the frequency distribution for a dataset is broadly unimodal and left-skewed, the natural log transform (logarithms base e) will adjust the pattern to make it more symmetric/similar to a Normal distribution. For variates whose values may range from 0 upwards a value of 1 is often added to the transform. Back transform with the exp() function

z=ln(x) or

z=ln(x+1)

n.b. ln(x)=loge(x)=log10(x)*log10(e)

x=exp(z) or x=exp(z)‑1

Square root
(Freeman-Tukey)

A transform that may adjust the dataset to make it more similar to a Normal distribution. For variates whose values may range from 0 upwards a value of 1 is often added to the transform. For 0<=x<=1 (e.g. rate data) the combined form of the transform is often used, and is known as the Freeman-Tukey (FT) transform

Logit

Often used to transform binary response data, such as survival/non-survival or present/absent, to provide a continuous value in the range (‑,), where p is the proportion of the sample that is 1 (or 0). The inverse or back-transform is shown as p in terms of z. This transform avoids concentration of values at the ends of the range. For samples where proportions p may take the values 0 or 1 a modified form of the transform may be used. This is typically achieved by adding 1/2n to the numerator and denominator, where n is the sample size. Often used to correct S-shaped (logistic) relationships between response and explanatory variables

Normal, z-transform

This transform normalizes or standardizes the distribution so that it has a zero mean and unit variance. If {xi} is a set of n sample mean values from any probability distribution with mean μ and variance σ2 then the z-transform shown here as z2 will be distributed N(0,1) for large n (Central Limit Theorem). The divisor in this instance is the standard error. In both instances the standard deviation must be non-zero

Box-Cox, power transforms

A family of transforms defined for positive data values only, that often can make datasets more Normal; k is a parameter. The inverse or back-transform is also shown as x in terms of z

Angular transforms (Freeman-Tukey)

A transform for proportions, p, designed to spread the set of values near the end of the range. k is typically 0.5. Often used to correct S-shaped relationships between response and explanatory variables. If p=x/n then the Freeman-Tukey (FT) version of this transform is the averaged version shown. This is a variance-stabilizing transform

Selected functions

Measure

Definition

Expression(s)

Bessel functions of the first kind

Bessel functions occur as the solution to specific differential equations. They are described with reference to a parameter known as the order, shown as a subscript. For non-negative real orders Bessel functions can be represented as an infinite series. Order 0 expansions are shown here for standard (J) and modified (I) Bessel functions. Usage in spatial analysis arises in connection with directional statistics and spline curve fitting. See the Mathworld website entry for more details

and

Exponential integral function, E1(x)

A definite integral function. Used in association with spline curve fitting. See the Mathworld website entry for more details

Gamma function, Γ

A widely used definite integral function. For integer values of x:

Γ(x)=(x‑1)! and Γ(x/2)=(x/2‑1)! so Γ(3/2)=(1/2)!/2=(π)/2

See the Mathworld website entry for more details

Matrix expressions

Measure

Definition

Expression(s)

Identity

A matrix with diagonal elements 1 and off-diagonal elements 0

Determinant

Determinants are only defined for square matrices. Let A be an n by n matrix with elements {aij}. The matrix Mij here is a subset of A known as the minor, formed by eliminating row i and column j from A. An n by n matrix, A, with Det=0 is described as singular, and such a matrix has no inverse. If Det(A) is very close to 0 it is described as ill-conditioned

|A|, Det(A)

Inverse

The matrix equivalent of division in conventional algebra. For a matrix, A, to be invertible its determinant must be non-zero, and ideally not very close to zero. A matrix that has an inverse is by definition non-singular. A symmetric real-valued matrix is positive definite if all its eigenvalues are positive, whereas a positive semi-definite matrix allows for some eigenvalues to be 0. A matrix, A, that is invertible satisfies the relation AA‑1=I

A‑1

Transpose

A matrix operation in which the rows and columns are transposed, i.e. in which elements aij are swapped with aji for all i,j. The inverse of a transposed matrix is the same as the transpose of the matrix inverse

AT or A

(AT)–1=(A‑1)T

Symmetric

A matrix in which element aij=aji for all i,j

A=AT

Trace

The sum of the diagonal elements of a matrix, aii — the sum of the eigenvalues of a matrix equals its trace

Tr(A)

Eigenvalue, Eigenvector

If A is a real-valued k by k square matrix and x is a non-zero real-valued vector, then a scalar λ that satisfies the equation shown in the adjacent column is known as an eigenvalue of A and x is an eigenvector of A. There are k eigenvalues of A, each with a corresponding eigenvector. The matrix A can be decomposed into three parts, as shown, where E is a matrix of its eigenvectors and D is a diagonal matrix of its eigenvalues

(A‑λI)x=0

A=EDE‑1 (diagonalization)