Next: 9.4 Linear Regression Up: 9. Robust Statistics Previous: 9.2 Location and Scale

9.3 Location and Scale in $\mathbb{R}^{k}$

9.3.1 Equivariance and Metrics

In Sect. 9.2.1 we discussed the equivariance of estimators for location and scale with respect to the affine group of transformations on $\mathbb{R}$ . This carries over to higher dimensions although here the requirement of affine equivariance lacks immediate plausibility. A change of location and scale for each individual component in $\mathbb{R}^k$ is represented by an affine transformation of the form $\Lambda(x)+b$ where $\Lambda$ is a diagonal matrix. A general affine transformation forms linear combinations of the individual components which goes beyond arguments based on units of measurement. The use of affine equivariance reduces to the almost empirical question as to whether the data, regarded as a cloud of points in $\mathbb{R}^k$ , can be well represented by an ellipsoid. If this is the case as it often is then consideration of linear combinations of different components makes data analytical sense. With this proviso in mind we consider the affine group $\boldsymbol{\mathcal{A}}$ of transformations of $\mathbb{R}^k$ into itself,

$\displaystyle \boldsymbol{\mathcal{A}}=\{\mathcal{A}: \mathcal{A}(x)= A(x)+b\}\,,$

(9.67)

where

is a non-singular $k\times k$ -matrix and

is an arbitrary point in $\mathbb{R}^k$ . Let $\mathcal{P}_k'$ denote a family of distributions over $\mathbb{R}^k$ which is closed under affine transformations

$\displaystyle P\in \mathcal{P}^{\prime}_k \Rightarrow P^{\mathcal{A}} \in \mathcal{P}^{\prime}_k,$ for all $\displaystyle \mathcal{A} \in \boldsymbol{\mathcal{A}}\,.$

(9.68)

A function $T_l :\mathcal{P}^{\prime}_k \rightarrow \mathbb{R}^k$ is called a location functional if it is well defined and

$\displaystyle T_l(P^{\mathcal{A}}) = \mathcal{A}(T_l(P)),$ for all $\displaystyle \mathcal{A} \in \boldsymbol{\mathcal{A}}, \,P \in \mathcal{P}^{\prime}_k\,.$

(9.69)

A functional $T_s:\mathcal{P}^{\prime}_k \rightarrow \boldsymbol{\Sigma}_k$ where $\boldsymbol{\Sigma}_k$ denotes the set of all strictly positive definite symmetric $k\times k$ matrices is called a scale or scatter functional if

$\displaystyle T_s(P^{\mathcal{A}}) = AT_l(P)A^{\top},$ for all $\displaystyle \mathcal{A} \in \boldsymbol{\mathcal{A}}, \,P \in \mathcal{P}^{\prime}_k$ with $\displaystyle \mathcal{A}(x)=A(x)+b\,.$

(9.70)

The requirement of affine equivariance is a strong one as we now indicate. The most obvious way of defining the median of a

-dimensional data set is to define it by the medians of the individual components. With this definition the median is equivariant with respect to transformations of the form $\Lambda(x)+b$ with $\Lambda$ a diagonal matrix but it is not equivariant for the affine group. A second possibility is to define the median of a distribution

$\displaystyle \notag \mathrm{MED}(P)= \mathrm{argmin}_\mu \int (\Vert x-\mu\Vert-\Vert x\Vert)\,{d}P(x)\,.$

With this definition the median is equivariant with respect to transformations of the form $x \rightarrow O(x)+b$ with

an orthogonal matrix but not with respect to the affine group or the group $x \rightarrow \Lambda(x)+b$ with $\Lambda$ a diagonal matrix. The conclusion is that there is no canonical extension of the median to higher dimensions which is equivariant with respect to the affine group.

In Sect. 9.2 use was made of metrics on the space of probability distributions on $\mathbb{R}$ . We extend this to $\mathbb{R}^k$ where all metrics we consider are of the form

$\displaystyle d_{\mathcal{C}}(P,Q)= \sup_{C \in \mathcal{C}}\vert P(C)-Q(C)\vert$

(9.71)

where $\mathcal{C}$ is a so called Vapnik-Cervonenkis class (see for example Pollard (1984)). The class $\mathcal{C}$ can be chosen to suit the problem. Examples are the class of all lower dimensional hyperplanes

$\displaystyle \mathcal{H}$	$\displaystyle = \{H: H$ lower dimensional hyperplane $\displaystyle \}$	(9.72)
and the class of all ellipsoids
$\displaystyle \mathcal{E}$	$\displaystyle = \{E: E$ an ellipsoid $\displaystyle \}.$	(9.73)

These give rise to the metrics $d_{\mathcal{H}}$ and $d_{\mathcal{E}}$ respectively. Just as in $\mathbb{R}$ metrics $d_{\mathcal{C}}$ of the form (9.71) allow direct comparisons between empirical measures and models. We have

$\displaystyle d_\mathcal{C}(P_n(P),P) = \mathrm{O}(1/\sqrt{n})$

(9.74)

uniformly in

(see [80]).

9.3.2 M-estimators of Location and Scale

Given the usefulness of M-estimators for one dimensional data it seems natural to extend the concept to higher dimensions. We follow [68]. For any positive definite symmetric $k\times k$ -matrix $\Sigma$ we define the metric $d(\cdot,\cdot\,:\Sigma)$ by

$\displaystyle \notag d(x,y:\Sigma)^2 = (x-y)^{\top}\Sigma^{-1}(x-y),\quad x,\,y \in \mathbb{R}^k .$

Further, let

and

be two non-negative continuous functions defined on $\mathbb{R}_{+}$ and be such that $su_i(s), s \in \mathbb{R}_{+},\,i=1,2$ are both bounded. For a given probability distribution

on the Borel sets of $\mathbb{R}^k$ we consider in analogy to (9.21) and (9.22) the two equations in $\mu$ and $\Sigma$

$\displaystyle \int (x-\mu)u_1({d}(x,\mu;\Sigma))\,{d}P$	$\displaystyle =$	$\displaystyle 0 .$	(9.75)
$\displaystyle \int u_2({d}(x,\mu;\Sigma)^2)(x-\mu)(x-\mu)^{\top}\,{d}P$	$\displaystyle =$	$\displaystyle 0 .$	(9.76)

Assuming that at least one solution $(\mu,\Sigma)$ exists we denote it by $T_M(P)=(\mu,\Sigma)$ . The existence of a solution of (9.75) and (9.76) can be shown under weak conditions as follows. If we define

$\displaystyle \rm\Delta (P)= \max \{P(H): H \in \mathcal{H}\}$

(9.77)

with $\mathcal{H}$ as in (9.73) then a solution exists if $\rm\Delta (P)< 1-\delta$ where $\delta$ depends only on the functions

and

([68]). Unfortunately the problem of uniqueness is much more difficult than in the one-dimensional case. The conditions placed on

in [68] are either that it has a density

which is a decreasing function of $\Vert x\Vert$ or that it is symmetric

for every Borel set

. Such conditions do not hold for real data sets which puts us in an awkward position. Furthermore without existence and uniqueness there can be no results on asymptotic normality and consequently no results on confidence intervals. The situation is unsatisfactory so we now turn to the one class of M-functionals for which existence and uniqueness can be shown. The following is based on [61] and is the multidimensional generalization of (9.33). The

-dimensional

-distribution with density $f_{k,\nu}(\cdot:\mu,\Sigma)$ is defined by

$\displaystyle f_{k,\nu}(x:\mu,\Sigma)= \frac{\Gamma(\frac{1}{2}\left(\nu + k\ri... ...ac{1}{\nu}(x - \mu)^{top} \Sigma^{-1} (x - \mu) \right)^{-\frac{1}{2}(\nu + k)}$

(9.78)

and we consider the minimization problem

$\displaystyle T_M(p)=(T_l(P),T_s(P))=\mathrm{argmin}_{\mu,\,\Sigma}\int f_{k,\nu}(x:\mu,\Sigma)\,{d}P(x)+\frac{1}{2}\log(\vert\Sigma\vert)$

(9.79)

where $\vert \Sigma \vert$ denotes the determinant of the positive definite matrix $\Sigma$ . For any distribution

on the Borel sets of $\mathbb{R}^k$ we define $\rm\Delta (P)$ which is the

-dimensional version of (9.23). It can be shown that if $\rm\Delta (P) < 1/2$ then (9.79) has a unique solution. Moreover for data sets there is a simple algorithm which converges to the solution. On differentiating the right hand side of (9.79) it is seen that the solution is an M-estimator as in (9.75) and (9.76). Although this has not been proven explicitly it seems clear that the solution will be locally uniformly Fréchet differentiable, that is, it will satisfy (9.12) where the influence function

can be obtained as in (9.54) and the metric $d_{ko}$ is replaced by the metric $d_\mathcal{H}$ . This together with (9.74) leads to uniform asymptotic normality and allows the construction of confidence regions. The only weakness of the proposal is the low gross error breakdown point $\epsilon^{\ast}(T_M,P,GE)$ defined below which is at most

. This upper bound is shared with the M-functionals defined by (9.75) and (9.76) ([68]). The problem of constructing high breakdown functionals in

dimensions will be discussed below.

9.3.3 Bias and Breakdown

The concepts of bias and breakdown developed in Sect. 9.2.4 carry over to higher dimensions. Given a metric on the space of distributions on $\mathbb{R}^k$ and a location functional we follow (9.37) and define

$\displaystyle b(T_l,P,\epsilon,d) = \sup \{\Vert T_l(Q)\Vert: d(P,Q) < \epsilon\}$

(9.80)

and

$\displaystyle b(T_l,P,\epsilon,GE) = \sup \{\Vert T_l(Q)\Vert: Q=(1-\epsilon)P+\epsilon G,\,\,G \in \mathcal{P}\}\,,$

(9.81)

where by convention $\Vert T_l(Q)\Vert = \infty$ if

is not defined at

. The extension to scale functionals is not so obvious as there is no canonical definition of bias. We require a measure of difference between two positive definite symmetric matrices. For reasons of simplicity and because it is sufficient for our purposes the one we take is $\vert \log\left(\vert\Sigma_1\vert/\vert\Sigma_2\vert\right)\vert$ . Corresponding to (9.36) we define

$\displaystyle b(T_s,P,\epsilon,d) = \sup \{\vert \log(\vert T_s(Q)\vert/\vert T_s(P)\vert)\vert: d(P,Q) < \epsilon\}$

(9.82)

and

$\displaystyle b(T_s,P,\epsilon,GE) = \sup \{\vert \log(\vert T_s(Q)\vert/\vert T_s(P)\vert)\vert: Q=(1-\epsilon)P+\epsilon G,\,\, G \in \mathcal{P}\}\,.$

(9.83)

Most work is done using the gross error model (9.81) and (9.83). The breakdown points of

are defined by

$\displaystyle \epsilon^{\ast}(T_l,P,d)$	$\displaystyle =$	$\displaystyle \sup \{\epsilon: b(T_l,P,\epsilon,d)< \infty\}$	(9.84)
$\displaystyle \epsilon^{\ast}(T_l,P,GE)$	$\displaystyle =$	$\displaystyle \sup \{\epsilon: b(T_l,P,\epsilon,GE)< \infty\}$	(9.85)
$\displaystyle \epsilon^{\ast}(T_l,P_n,fsbp)$	$\displaystyle =$	$\displaystyle \max \big\{k/n: \big\vert T_l\left(P^k_n\right) \big\vert < \infty\big\}\,,$	(9.86)

where (9.86) corresponds in the obvious manner to (9.41). The breakdown points for the scale functional

are defined analogously using the bias functional (9.82). We have

Theorem 4 For any translation equivariant functional

$\displaystyle \epsilon^{\ast}(T_l,P,d_\mathcal{H}) \le 1/2$ and $\displaystyle \epsilon^{\ast}(T_l,P_n,fsbp)\le \lfloor n/2\rfloor/n$

(9.87)

and for any affine equivariant scale functional

$\displaystyle \epsilon^{\ast}(T_s,P,d_\mathcal{E}) \le (1-\rm\Delta (P))/2$ and $\displaystyle \epsilon^{\ast}(T_s,P_n,fsbp) \le (1-\rm\Delta (P_n))/2\,.$

(9.88)

In Sect. 9.2.4 it was shown that the M-estimators of Sect. 9.2.3 can attain or almost attain the upper bounds of Theorem 1. Unfortunately this is not the case in dimensions where as we have already mentioned the breakdown points of M-functionals of Sect. 9.3.2 are at most . In recent years much research activity has been directed towards finding high breakdown affinely equivariant location and scale functionals which attain or nearly attain the upper bounds of Theorem 4. This is discussed in the next section.

9.3.4 High Breakdown Location and Scale Functionals in $\mathbb{R}^k$

The first high breakdown affine equivariant location and scale functionals were proposed independently of each other by [103] and [31]. They were defined for empirical data but the construction can be carried over to measures satisfying a certain weak condition. The idea is to project the data points onto lines through the origin and then to determine which points are outliers with respect to this projection using one-dimensional functions with a high breakdown point. More precisely we set

$\displaystyle o(x_i,\theta) = \big\vert x_i^{\top}\theta -\mathrm{MED}\left(x_1... ...ig\vert \big/ \mathrm{MAD}\left(x_1^{\top}\theta,\ldots,x_n^{\top}\theta\right)$

(9.89)

and

$\displaystyle o(x_i)= \sup \{o(x_i,\theta): \Vert \theta \Vert =1\}\,.$

(9.90)

This is a measure for the outlyingness of the point

and it may be checked that it is affine invariant. Location and scale functionals may now be obtained by taking for example the mean and the covariance matrix of those $\lfloor n/2+1\rfloor$ observations with the smallest outlyingness measure. Although (9.90) requires a supremum over all values of $\theta$ this can be reduced for empirical distributions as follows. Choose all linearly independent subsets $x_{i_1},\ldots, x_{i_k}$ of size

and for each such subset determine a $\theta$ which is orthogonal to their span. If the $\sup$ in (9.90) is replaced by a maximum over all such $\theta$ then the location and scale functionals remain affine equivariant and retain the high breakdown point. Although this requires the consideration of only a finite number of directions namely at most $\left( \begin{array}{c} n \\ k\end{array}\right)$ this number is too large to make it a practicable possibility even for small values of

and

. The problem of calculability has remained with high breakdown methods ever since and it is their main weakness. There are still no high breakdown affine equivariant functionals which can be calculated exactly except for very small data sets. [60] goes as far as to say that the problem of calculability is the breakdown of high breakdown methods. This is perhaps too pessimistic but the problem remains unsolved.

[89] introduced two further high breakdown location and scale functionals as follows. The first, the so called minimum volume ellipsoid (MVE) functional, is a multidimensional version of Tukey's shortest half-sample (9.8) and is defined as follows. We set

$\displaystyle E = \mathrm{argmin}_{\widetilde{E}} \{ \vert \widetilde{E} \vert: \vert\{i: x_i \in \widetilde{E}\}\vert \ge \lfloor n/2\rfloor\}\,,$

(9.91)

where $\vert E\vert$ denotes the volume of

and $\vert \{\,\}\vert$ denotes the number of elements of the set $\{ \,\}$ . In other words

has the smallest volume of any ellipsoid which contains more than half the data points. For a general distribution

we define

$\displaystyle E(P) = \mathrm{argmin}_{\widetilde{E}} \left\{ \vert \widetilde{E} \vert: \int_{\widetilde{E}} \,{d}P \ge 1/2\right\}\,.$

(9.92)

Given

the location functional

is defined to be the centre $\mu(E)$ of

and the covariance functional

is taken to be $c(k)\Sigma(E)$ where

$\displaystyle E= \left\{x: (x-\mu(E))^{\top}\,\Sigma^{-1}(x-\mu(E)) \le 1\right\}\,.$

(9.93)

The factor

can be chosen so that $c(k)\Sigma(E)=I_k$ for the standard normal distribution in

dimensions.

The second functional is based on the so called minimum covariance determinant (MCD) and is as follows. We write

$\displaystyle \mu(B)$	$\displaystyle =$	$\displaystyle \int_B x\,{d}P(x)/P(B)$	(9.94)
$\displaystyle \Sigma(B)$	$\displaystyle =$	$\displaystyle \int_B(x-\mu(B))(x-\mu(B))^{\top}\,{d}P(x)/P(B)$	(9.95)

and define

$\displaystyle \mathrm{MCD}(P) = \mathrm{argmin}_B\,\{\vert \Sigma(B)\vert: P(B)\ge 1/2\}\,,$

(9.96)

where $\vert \Sigma(B)\vert$ is defined to be infinite if either of (9.94) or (9.95) does not exist. The location functional is taken to be $\mu(\mathrm{MCD}(B))$ and the scatter functional $c(k)\Sigma(\mathrm{MCD}(B))$ where again

is usually chosen so that $c(k)\Sigma(\mathrm{MCD}(B))=I_k$ for the standard normal distribution in

-dimensions. It can be shown that both these functionals are affinely equivariant.

A smoothed version of the minimum volume estimator can be obtained by considering the minimization problem

minimize $\displaystyle \vert \Sigma\vert$ subject to $\displaystyle \int\rho\left((x-\mu)^{\top}\Sigma^{-1}(x-\mu)\right)\,{d}P(x) \ge 1/2\,,$

(9.97)

where $\rho: \mathbb{R}_{+} \rightarrow [0,\,1]$ satisfies $\rho(0)=1, \lim_{x\rightarrow \infty}\rho(x)=0$ and is continuous on the right (see [23]). This gives rise to the class of so called

-functionals. The minimum volume estimator can be obtained by specializing to the case $\rho(x)= \{0 \le x < 1\}$ .

On differentiating (9.97) it can be seen that an -functional can be regarded as an M-functional but with redescending functions and in contrast to the conditions placed on and in (9.75) and (9.76) ([64]). For such functions the defining equations for an M-estimator have many solutions and the minimization problem of (9.97) can be viewed as a choice function. Other choice functions can be made giving rise to different high breakdown M-estimators. We refer to [65] and [62]. A further class of location and scatter functionals have been developed from Tukey's concept of depth ([109]). We refer to [32], [63] and Zuo and Serfling (2000a, 2000b). Many of the above functionals have breakdown points close to or equal to the upper bound of Theorem 4. For the calculation of breakdown points we refer to Davies (1987, 1993), [66], [32] and [111].

The problem of determining a functional which minimizes the bias over a neighbourhood was considered in the one-dimensional case in Sect. 9.2.4. The problem is much more difficult in $\mathbb{R}^k$ but some work in this direction has been done (see [1]). The more tractable problem of determining the size of the bias function for particular functionals or classes of functionals has also been considered ([115,69]).

All the above functionals can be shown to exist but there are problems concerning the uniqueness of the functional. Just as in the case of Tukey's shortest half (9.8) restrictions must be placed on the distribution which generally include the existence of a density with given properties (see [23] and [105]) and which is therefore at odds with the spirit of robust statistics. Moreover even uniqueness and asymptotic normality at some small class of models are not sufficient. Ideally the functional should exist and be uniquely defined and locally uniformly Fréchet differentiable just as in Sect. 9.2.5. It is not easy to construct affinely equivariant location and scatter functionals which satisfy the first two conditions but it has been accomplished by [30] using the Stahel-Donoho idea of projections described above. To go further and define functionals which are also locally uniformly Fréchet differentiable with respect to some metric $d_\mathcal{C}$ just as in the one-dimensional case considered in Sect. 9.2.5 is a very difficult problem. The only result in this direction is again due to [30] who managed to construct functionals which are locally uniformly Lipschitz. The lack of locally uniform Fréchet differentiability means that all derived confidence intervals will exhibit a certain degree of instability. Moreover the problem is compounded by the inability to calculate the functionals themselves. To some extent it is possible to reduce the instability by say using the MCD functional in preference to the MVE functional, by reweighting the observations or by calculating a one-step M-functional as in (9.29) (see [24]). However the problem remains and for this reason we do not discuss the research which has been carried out on the efficiency of the location and scatter functionals mentioned above. Their main use is in data analysis where they are an invaluable tool for detecting outliers. This will be discussed in the following section.

A scatter matrix plays an important role in many statistical techniques such as principal component analysis and factor analysis. The use of robust scatter functionals in some of these areas has been studied by among others [21], [20] and [113].

As already mentioned the major weakness of all known high breakdown functionals is their computational complexity. For the MCD functional an exact algorithm of the order of $n^{k(k+3)/2}$ exists and there are reasons for supposing that this cannot be reduced to below ([12]). This means that in practice for all but very small data sets heuristic algorithms have to be used. We refer to [95] for a heuristic algorithm for the MCD-functional.

9.3.5 Outliers in $\mathbb{R}$

Whereas for univariate, bivariate and even trivariate data outliers may often be found by visual inspection, this is not practical in higher dimensions ([16,49,4,48,50]). This makes it all the more important to have methods which automatically detect high dimensional outliers. Much of the analysis of the one-dimensional problem given in Sect. 9.2.7 carries over to the -dimensional problem. In particular outlier identification rules based on the mean and covariance of the data suffer from masking problems and must be replaced by high breakdown functionals (see also Rocke and Woodruff (1996, 1997)). We restrict attention to affine equivariant functionals so that an affine transformation of the data will not alter the observations which are identified as outliers. The identification rules we consider are of the form

$\displaystyle (x_i -T_l(P_n))^{\top}T_s(P_n)^{-1}(x_i -T_l(P_n))\, \ge c(k,n)\,,$

(9.98)

where

is the empirical measure,

and

are affine equivariant location and scatter functionals respectively and

is a constant to be determined. This rule is the

-dimensional counterpart of (9.60). In order to specify some reasonable value for

and in order to be able to compare different outlier identifiers we require, just as in Sect. 9.2.7, a precise definition of an outlier and a basic model for the majority of the observations. As our basic model we take the

-dimensional normal distribution $\mathcal{N}(\mu,\Sigma)$ . The definition of an $\alpha_n$ -outlier corresponds to (9.62) and is

$\displaystyle {\mathrm{out}}(\alpha_n, \mu,\Sigma)=\left\{ x \in \mathbb{R}^k:( x- \mu)^{\top}\Sigma^{-1}( x- \mu)>\chi^2_{k;1-\alpha_n} \right\}\,,$

(9.99)

where $\alpha_n=1-(1-\widetilde{\alpha})^{1/n}$ for some given value of $\widetilde{\alpha} \in (0,1)$ . Clearly for an i.i.d. sample of size

distributed according to $\mathcal{N}(\mu,\Sigma)$ the probability that no observation lies in the outlier region of (9.99) is just $1-\alpha$ . Given location and scale functionals

and

and a sample $\widetilde{x}_n$ we write

$\displaystyle {\mathrm{OR}}^H(\widetilde{x}_n, \alpha_n)=\left\{x \in \mathbb{R}^k:(x- T_l(P_n))^{\top} T_s(P_n)^{-1}(x - T_l(P_n)) \ge c(k,n,\alpha_n) \right\}$

(9.100)

which corresponds to (9.64). The region ${\mathrm{OR}}^H(\widetilde{x}_n,\alpha_n)$ is the empirical counterpart of ${\mathrm{out}}(\alpha_n, \mu,\Sigma)$ of (9.99) and any observation lying in ${\mathrm{OR}}^H(\widetilde{x}_n,\alpha_n)$ will be identified as an outlier. Just as in the one-dimensional case we determine the $c(k,n,\alpha _n)$ by requiring that with probability $1-\widetilde{\alpha}$ no observation is identified as an outlier in i.i.d. $\mathcal{N}(\mu,\Sigma)$ samples of size

. This can be done by simulations with appropriate asymptotic approximations for large

. The simulations will of course be based on the algorithms used to calculate the functionals and will not be based on the exact functionals assuming these to be well defined. For the purpose of outlier identification this will not be of great consequence. We give results for three multivariate outlier identifiers based on the MVE- and MCD-functionals of [89] and the

-functional based on Tukey's biweight function as given in [84]. There are good heuristic algorithms for calculating these functionals at least approximately ([84,95,97]). The following is based on [8]. Table 9.2 gives the values of $c(k,n,\alpha _n)$ with $\alpha =0.1$ . The results are based on

simulations for each combination of

and

[8] show by simulations that although none of the above rules fails to detect arbitrarily large outliers it still can happen that very extreme observations are not identified as outliers. To quantify this we consider the identifier ${\mathrm{OR}}_{MVE}$ and the constellation with observations replaced by other values. The mean norm of the most extreme nonidentifiable outlier is 4.17. The situation clearly becomes worse with an increasing proportion of replaced observations and with the dimension (see [7]). If we use the mean of the norm of the most extreme non-identifiable outlier as a criterion then none of the three rules dominates the others although the biweight identifier performs reasonably well in all cases and is our preferred choice.