Next: 9.4 Linear Regression Up: 9. Robust Statistics Previous: 9.2 Location and Scale

Subsections

# 9.3 Location and Scale in

## 9.3.1 Equivariance and Metrics

In Sect. 9.2.1 we discussed the equivariance of estimators for location and scale with respect to the affine group of transformations on . This carries over to higher dimensions although here the requirement of affine equivariance lacks immediate plausibility. A change of location and scale for each individual component in is represented by an affine transformation of the form where is a diagonal matrix. A general affine transformation forms linear combinations of the individual components which goes beyond arguments based on units of measurement. The use of affine equivariance reduces to the almost empirical question as to whether the data, regarded as a cloud of points in , can be well represented by an ellipsoid. If this is the case as it often is then consideration of linear combinations of different components makes data analytical sense. With this proviso in mind we consider the affine group of transformations of into itself,

 (9.67)

where is a non-singular -matrix and is an arbitrary point in . Let denote a family of distributions over which is closed under affine transformations

 for all (9.68)

A function is called a location functional if it is well defined and

 for all (9.69)

A functional where denotes the set of all strictly positive definite symmetric matrices is called a scale or scatter functional if

 for all     with (9.70)

The requirement of affine equivariance is a strong one as we now indicate. The most obvious way of defining the median of a -dimensional data set is to define it by the medians of the individual components. With this definition the median is equivariant with respect to transformations of the form with a diagonal matrix but it is not equivariant for the affine group. A second possibility is to define the median of a distribution by

With this definition the median is equivariant with respect to transformations of the form with an orthogonal matrix but not with respect to the affine group or the group with a diagonal matrix. The conclusion is that there is no canonical extension of the median to higher dimensions which is equivariant with respect to the affine group.

In Sect. 9.2 use was made of metrics on the space of probability distributions on . We extend this to where all metrics we consider are of the form

 (9.71)

where is a so called Vapnik-Cervonenkis class (see for example Pollard (1984)). The class can be chosen to suit the problem. Examples are the class of all lower dimensional hyperplanes

 lower dimensional hyperplane (9.72) and the class of all ellipsoids an ellipsoid (9.73)

These give rise to the metrics and respectively. Just as in metrics of the form (9.71) allow direct comparisons between empirical measures and models. We have

 (9.74)

uniformly in (see [80]).

## 9.3.2 M-estimators of Location and Scale

Given the usefulness of M-estimators for one dimensional data it seems natural to extend the concept to higher dimensions. We follow [68]. For any positive definite symmetric -matrix we define the metric by

Further, let and be two non-negative continuous functions defined on and be such that are both bounded. For a given probability distribution on the Borel sets of we consider in analogy to (9.21) and (9.22) the two equations in and
 (9.75) (9.76)

Assuming that at least one solution exists we denote it by . The existence of a solution of (9.75) and (9.76) can be shown under weak conditions as follows. If we define

 (9.77)

with as in (9.73) then a solution exists if where depends only on the functions and ([68]). Unfortunately the problem of uniqueness is much more difficult than in the one-dimensional case. The conditions placed on in [68] are either that it has a density which is a decreasing function of or that it is symmetric for every Borel set . Such conditions do not hold for real data sets which puts us in an awkward position. Furthermore without existence and uniqueness there can be no results on asymptotic normality and consequently no results on confidence intervals. The situation is unsatisfactory so we now turn to the one class of M-functionals for which existence and uniqueness can be shown. The following is based on [61] and is the multidimensional generalization of (9.33). The -dimensional -distribution with density is defined by

 (9.78)

and we consider the minimization problem

 (9.79)

where denotes the determinant of the positive definite matrix . For any distribution on the Borel sets of we define which is the -dimensional version of (9.23). It can be shown that if then (9.79) has a unique solution. Moreover for data sets there is a simple algorithm which converges to the solution. On differentiating the right hand side of (9.79) it is seen that the solution is an M-estimator as in (9.75) and (9.76). Although this has not been proven explicitly it seems clear that the solution will be locally uniformly Fréchet differentiable, that is, it will satisfy (9.12) where the influence function can be obtained as in (9.54) and the metric is replaced by the metric . This together with (9.74) leads to uniform asymptotic normality and allows the construction of confidence regions. The only weakness of the proposal is the low gross error breakdown point defined below which is at most . This upper bound is shared with the M-functionals defined by (9.75) and (9.76) ([68]). The problem of constructing high breakdown functionals in dimensions will be discussed below.

## 9.3.3 Bias and Breakdown

The concepts of bias and breakdown developed in Sect. 9.2.4 carry over to higher dimensions. Given a metric on the space of distributions on and a location functional we follow (9.37) and define

 (9.80)

and

 (9.81)

where by convention if is not defined at . The extension to scale functionals is not so obvious as there is no canonical definition of bias. We require a measure of difference between two positive definite symmetric matrices. For reasons of simplicity and because it is sufficient for our purposes the one we take is . Corresponding to (9.36) we define

 (9.82)

and

 (9.83)

Most work is done using the gross error model (9.81) and (9.83). The breakdown points of are defined by
 (9.84) (9.85) (9.86)

where (9.86) corresponds in the obvious manner to (9.41). The breakdown points for the scale functional are defined analogously using the bias functional (9.82). We have

Theorem 4   For any translation equivariant functional

 and (9.87)

and for any affine equivariant scale functional

 and (9.88)

In Sect. 9.2.4 it was shown that the M-estimators of Sect. 9.2.3 can attain or almost attain the upper bounds of Theorem 1. Unfortunately this is not the case in dimensions where as we have already mentioned the breakdown points of M-functionals of Sect. 9.3.2 are at most . In recent years much research activity has been directed towards finding high breakdown affinely equivariant location and scale functionals which attain or nearly attain the upper bounds of Theorem 4. This is discussed in the next section.

## 9.3.4 High Breakdown Location and Scale Functionals in

The first high breakdown affine equivariant location and scale functionals were proposed independently of each other by [103] and [31]. They were defined for empirical data but the construction can be carried over to measures satisfying a certain weak condition. The idea is to project the data points onto lines through the origin and then to determine which points are outliers with respect to this projection using one-dimensional functions with a high breakdown point. More precisely we set

 (9.89)

and

 (9.90)

This is a measure for the outlyingness of the point and it may be checked that it is affine invariant. Location and scale functionals may now be obtained by taking for example the mean and the covariance matrix of those observations with the smallest outlyingness measure. Although (9.90) requires a supremum over all values of this can be reduced for empirical distributions as follows. Choose all linearly independent subsets of size and for each such subset determine a which is orthogonal to their span. If the in (9.90) is replaced by a maximum over all such then the location and scale functionals remain affine equivariant and retain the high breakdown point. Although this requires the consideration of only a finite number of directions namely at most this number is too large to make it a practicable possibility even for small values of and . The problem of calculability has remained with high breakdown methods ever since and it is their main weakness. There are still no high breakdown affine equivariant functionals which can be calculated exactly except for very small data sets. [60] goes as far as to say that the problem of calculability is the breakdown of high breakdown methods. This is perhaps too pessimistic but the problem remains unsolved.

[89] introduced two further high breakdown location and scale functionals as follows. The first, the so called minimum volume ellipsoid (MVE) functional, is a multidimensional version of Tukey's shortest half-sample (9.8) and is defined as follows. We set

 (9.91)

where denotes the volume of and denotes the number of elements of the set . In other words has the smallest volume of any ellipsoid which contains more than half the data points. For a general distribution we define

 (9.92)

Given the location functional is defined to be the centre of and the covariance functional is taken to be where

 (9.93)

The factor can be chosen so that for the standard normal distribution in dimensions.

The second functional is based on the so called minimum covariance determinant (MCD) and is as follows. We write

 (9.94) (9.95)

and define

 (9.96)

where is defined to be infinite if either of (9.94) or (9.95) does not exist. The location functional is taken to be and the scatter functional where again is usually chosen so that for the standard normal distribution in -dimensions. It can be shown that both these functionals are affinely equivariant.

A smoothed version of the minimum volume estimator can be obtained by considering the minimization problem

 minimize     subject to (9.97)

where satisfies and is continuous on the right (see [23]). This gives rise to the class of so called -functionals. The minimum volume estimator can be obtained by specializing to the case .

On differentiating (9.97) it can be seen that an -functional can be regarded as an M-functional but with redescending functions and in contrast to the conditions placed on and in (9.75) and (9.76) ([64]). For such functions the defining equations for an M-estimator have many solutions and the minimization problem of (9.97) can be viewed as a choice function. Other choice functions can be made giving rise to different high breakdown M-estimators. We refer to [65] and [62]. A further class of location and scatter functionals have been developed from Tukey's concept of depth ([109]). We refer to [32], [63] and Zuo and Serfling (2000a, 2000b). Many of the above functionals have breakdown points close to or equal to the upper bound of Theorem 4. For the calculation of breakdown points we refer to Davies (1987, 1993), [66], [32] and [111].

The problem of determining a functional which minimizes the bias over a neighbourhood was considered in the one-dimensional case in Sect. 9.2.4. The problem is much more difficult in but some work in this direction has been done (see [1]). The more tractable problem of determining the size of the bias function for particular functionals or classes of functionals has also been considered ([115,69]).

All the above functionals can be shown to exist but there are problems concerning the uniqueness of the functional. Just as in the case of Tukey's shortest half (9.8) restrictions must be placed on the distribution which generally include the existence of a density with given properties (see [23] and [105]) and which is therefore at odds with the spirit of robust statistics. Moreover even uniqueness and asymptotic normality at some small class of models are not sufficient. Ideally the functional should exist and be uniquely defined and locally uniformly Fréchet differentiable just as in Sect. 9.2.5. It is not easy to construct affinely equivariant location and scatter functionals which satisfy the first two conditions but it has been accomplished by [30] using the Stahel-Donoho idea of projections described above. To go further and define functionals which are also locally uniformly Fréchet differentiable with respect to some metric just as in the one-dimensional case considered in Sect. 9.2.5 is a very difficult problem. The only result in this direction is again due to [30] who managed to construct functionals which are locally uniformly Lipschitz. The lack of locally uniform Fréchet differentiability means that all derived confidence intervals will exhibit a certain degree of instability. Moreover the problem is compounded by the inability to calculate the functionals themselves. To some extent it is possible to reduce the instability by say using the MCD functional in preference to the MVE functional, by reweighting the observations or by calculating a one-step M-functional as in (9.29) (see [24]). However the problem remains and for this reason we do not discuss the research which has been carried out on the efficiency of the location and scatter functionals mentioned above. Their main use is in data analysis where they are an invaluable tool for detecting outliers. This will be discussed in the following section.

A scatter matrix plays an important role in many statistical techniques such as principal component analysis and factor analysis. The use of robust scatter functionals in some of these areas has been studied by among others [21], [20] and [113].

As already mentioned the major weakness of all known high breakdown functionals is their computational complexity. For the MCD functional an exact algorithm of the order of exists and there are reasons for supposing that this cannot be reduced to below ([12]). This means that in practice for all but very small data sets heuristic algorithms have to be used. We refer to [95] for a heuristic algorithm for the MCD-functional.

## 9.3.5 Outliers in

Whereas for univariate, bivariate and even trivariate data outliers may often be found by visual inspection, this is not practical in higher dimensions ([16,49,4,48,50]). This makes it all the more important to have methods which automatically detect high dimensional outliers. Much of the analysis of the one-dimensional problem given in Sect. 9.2.7 carries over to the -dimensional problem. In particular outlier identification rules based on the mean and covariance of the data suffer from masking problems and must be replaced by high breakdown functionals (see also Rocke and Woodruff (1996, 1997)). We restrict attention to affine equivariant functionals so that an affine transformation of the data will not alter the observations which are identified as outliers. The identification rules we consider are of the form

 (9.98)

where is the empirical measure, and are affine equivariant location and scatter functionals respectively and is a constant to be determined. This rule is the -dimensional counterpart of (9.60). In order to specify some reasonable value for and in order to be able to compare different outlier identifiers we require, just as in Sect. 9.2.7, a precise definition of an outlier and a basic model for the majority of the observations. As our basic model we take the -dimensional normal distribution . The definition of an -outlier corresponds to (9.62) and is

 (9.99)

where for some given value of . Clearly for an i.i.d. sample of size distributed according to the probability that no observation lies in the outlier region of (9.99) is just . Given location and scale functionals and and a sample we write

 (9.100)

which corresponds to (9.64). The region is the empirical counterpart of of (9.99) and any observation lying in will be identified as an outlier. Just as in the one-dimensional case we determine the by requiring that with probability no observation is identified as an outlier in i.i.d. samples of size . This can be done by simulations with appropriate asymptotic approximations for large . The simulations will of course be based on the algorithms used to calculate the functionals and will not be based on the exact functionals assuming these to be well defined. For the purpose of outlier identification this will not be of great consequence. We give results for three multivariate outlier identifiers based on the MVE- and MCD-functionals of [89] and the -functional based on Tukey's biweight function as given in [84]. There are good heuristic algorithms for calculating these functionals at least approximately ([84,95,97]). The following is based on [8]. Table 9.2 gives the values of with . The results are based on simulations for each combination of and .

[8] show by simulations that although none of the above rules fails to detect arbitrarily large outliers it still can happen that very extreme observations are not identified as outliers. To quantify this we consider the identifier and the constellation with observations replaced by other values. The mean norm of the most extreme nonidentifiable outlier is 4.17. The situation clearly becomes worse with an increasing proportion of replaced observations and with the dimension (see [7]). If we use the mean of the norm of the most extreme non-identifiable outlier as a criterion then none of the three rules dominates the others although the biweight identifier performs reasonably well in all cases and is our preferred choice.

Next: 9.4 Linear Regression Up: 9. Robust Statistics Previous: 9.2 Location and Scale