Next: 9.2 Location and Scale Up: 9. Robust Statistics Previous: 9. Robust Statistics

Subsections

# 9.1 Robust Statistics; Examples and Introduction

## 9.1.1 Two Examples

The first example involves the real data given in Table 9.1 which are the results of an interlaboratory test. The boxplots are shown in Fig. 9.1 where the dotted line denotes the mean of the observations and the solid line the median.

We note that only the results of the Laboratories 1 and 3 lie below the mean whereas all the remaining laboratories return larger values. In the case of the median, of the readings coincide with the median, readings are smaller and are larger. A glance at Fig. 9.1 suggests that in the absence of further information the Laboratories 1 and 3 should be treated as outliers. This is the course which we recommend although the issues involved require careful thought. For the moment we note simply that the median is a robust statistic whereas the mean is not.

The second example concerns quantifying the scatter of real valued observations . This example is partially taken from [58] and reports a dispute between [34, p.147] and [38, p.762] about the relative merits of

 and

Fisher argued that for normal observations the standard deviation is about more efficient than the mean absolute deviation . In contrast Eddington claimed that his experience with real data indicates that is better than . In [108] and [57] we find a resolution of this apparent contradiction. Consider the model

 (9.1)

where denotes a normal distribution with mean and variance and . For data distributed according to (9.1) one can calculate the asymptotic relative efficiency ARE of with respect to ,

As Huber states, the result is disquieting. Already for ARE exceeds and the effect is apparent for samples of size . For we have and simulations show that for samples of size the relative efficiency exceeds 1.5 and increases to 2.0 for samples of size . This is a severe deficiency of as models such as with between 0.01 and 0.1 often give better descriptions of real data than the normal distribution itself. We quote [58]
thus it becomes painfully clear that the naturally occurring deviations from the idealized model are large enough to render meaningless the traditional asymptotic optimality theory.

## 9.1.2 General Philosophy

The two examples of the previous section illustrate a general phenomenon. An optimal statistical procedure based on a particular family of models can differ considerably from an optimal procedure based on another family even though the families and are very close. This may be expressed by saying that optimal procedures are often unstable in that small changes in the data or the model can lead to large changes in the analysis. The basic philosophy of robust statistics is to produce statistical procedures which are stable with respect to small changes in the data or model and even large changes should not cause a complete breakdown of the procedure.

Any inspection of the data and the removal of aberrant observations may be regarded as part of robust statistics but it was only with [78] that the consideration of deviations from models commenced. He showed that the exact theory based on the normal distribution for variances is highly nonrobust. There were other isolated papers on the problem of robustness ([77,6]; Geary (1936, 1937); [44,14,15]). [108] initiated a wide spread interest in robust statistics which has continued to this day. The first systematic investigation of robustness is due to [56] and was expounded in [58]. Huber's approach is functional analytic and he was the first to investigate the behaviour of a statistical functional over a full topological neighbourhood of a model instead of restricting the investigation to other parametric families as in (9.1). Huber considers three problems. The first is that of minimizing the bias over certain neighbourhoods and results in the median as the most robust location functional. For large samples deviations from the model have consequences which are dominated by the bias and so this is an important result. The second problem is concerned with what Tukey calls the statistical version of no free lunches. If we take the simple model of i.i.d. observations then the confidence interval for based on the mean is on average shorter than that based on any other statistic. If short confidence intervals are of interest then one can not only choose the statistic which gives the shortest interval but also the model itself. The new model must of course still be consistent with the data but even with this restriction the confidence interval can be made as small as desired ([26]). Such a short confidence interval represents a free lunch and if we do not believe in free lunches then we must look for that model which maximizes the length of the confidence interval over a given family of models. If we take all distributions with variance 1 then the confidence interval for the distribution is the longest. Huber considers the same problem over the family where denotes the Kolmogoroff metric. Under certain simplifying assumptions Huber solves this problem and the solution is known as the Huber distribution (see [58]). Huber's third problem is the robustification of the Neyman-Pearson test theory. Given two distributions and [76] derive the optimal test for testing against . Huber considers full neighbourhoods of and of and then derives the form of the minimax test for the composite hypothesis of against . The weakness of Huber's approach is that it does not generalize easily to other situations. Nevertheless it is the spirit of this approach which we adopt here. It involves treating estimators as functionals on the space of distributions, investigating where possible their behaviour over full neighbourhoods and always being aware of the danger of a free lunch.

[51] introduced another approach to robustness, that based on the influence function defined for a statistical functional as follows

 (9.2)

where denotes the point mass at the point . The influence function has two interpretations. On the one hand it measures the infinitesimal influence of an observation situated at the point on the value of the functional . On the other hand if denotes the empirical measure of a sample of i.i.d. random variables with common distribution then under appropriate regularity conditions

 (9.3)

where denotes equality of distribution. Given a parametric family of distributions we restrict attention to those functionals which are Fisher consistent that is

 (9.4)

Hampel's idea was to minimize the asymptotic variance of as an estimate of a parameter subject to a bound on the influence function

 under (9.4) and (9.5)

where is a given function of . Hampel complemented the infinitesimal part of his approach by considering also the global behaviour of the functional . He introduced the concept of breakdown point which has had and continues to have a major influence on research in robust statistics. The approach based on the influence function was carried out in [54]. The strength of the Hampel approach is that it can be used to robustify in some sense the estimation of parameters in any parametric model. The weaknesses are that (9.5) only bounds infinitesimally small deviations from the model and that the approach does not explicitly take into account the free lunch problem. Hampel is aware of this and recommends simple models but simplicity is an addition to and not an integral part of his approach. The influence function is usually used as a heuristic tool and care must be taken in interpreting the results. For examples of situations where the heuristics go wrong we refer to [25].

Another approach which lies so to speak between that of Huber and Hampel is the so called shrinking neighbourhood approach. It has been worked out in full generality by [83]. Instead of considering neighbourhoods of a fixed size (Huber) or only infinitesimal neighbourhoods (Hampel) this approach considers full neighbourhoods of a model but whose size decreases at the rate of as the sample size tends to infinity. The size of the neighbourhoods is governed by the fact that for larger neighbourhoods the bias term is dominant whereas models in smaller neighbourhoods cannot be distinguished. The shrinking neighbourhoods approach has the advantage that it does not need any assumptions of symmetry. The disadvantage is that the size of the neighbourhoods goes to zero so that the resulting theory is only robustness over vanishingly small neighbourhoods.

## 9.1.3 Functional Approach

Although a statistic based on a data sample may be regarded as a function of the data a more general approach is often useful. Given a data set we define the corresponding empirical distribution by

 (9.6)

where denotes the unit mass in . Although clearly depends on the sample we will usually suppress the dependency for the sake of clarity. With this notation we can now regard the arithmetic mean either as a function of the data or as a function of the empirical measure ,

The function can be extended to all measures which have a finite mean

 (9.7)

and is now a functional defined on a certain subset of the family of probability measures on . This manner of treating statistics is one whose origins go back to [112]. In the context of robust statistics it was introduced by [56] and has proved very useful (see [37]). Another example is given by the functional defined as the length of the shortest interval which carries a mass of at least ,

 (9.8)

where denotes the length of the interval . The idea of using the shortest half interval goes back to Tukey (see [2]) who proposed using the mean of the observations contained in it as a robust location functional.

The space may be metricized in many ways but we prefer the Kolmogoroff metric defined by

 (9.9)

The Glivenko-Cantelli theorem states

 (9.10)

where denotes the empirical measure of the random variables of the i.i.d. sequence . In conjunction with (9.10) the metric makes it possible to connect analytic properties of a functional and its statistical properties. As a first step we note that a functional which is locally bounded in the Kolmogoroff metric

 (9.11)

for some offers protection against outliers. On moving from local boundedness to continuity we see that if a functional is continuous at then the sequence is a consistent statistic in that

Finally we consider a functional which is differentiable at , that is

 (9.12)

for some bounded function where, without loss of generality, (see [18]). On putting

it is seen that is the influence function of (9.2). As

 (9.13)

the central limit theorem (9.3) follows immediately. Textbooks which make use of this functional analytic approach are as already mentioned [58], [54], [83], and also [104], a book which can be strongly recommended to students as a well written and at the same time deep introductory text.

Next: 9.2 Location and Scale Up: 9. Robust Statistics Previous: 9. Robust Statistics