next up previous contents index
Next: 9.2 Location and Scale Up: 9. Robust Statistics Previous: 9. Robust Statistics

Subsections


9.1 Robust Statistics; Examples and Introduction


9.1.1 Two Examples

The first example involves the real data given in Table 9.1 which are the results of an interlaboratory test. The boxplots are shown in Fig. 9.1 where the dotted line denotes the mean of the observations and the solid line the median.


Table 9.1: The results of an interlaboratory test involving 14 laboratories
$ 1$ $ 2$ $ 3$ $ 4$ $ 5$ $ 6$ $ 7$ $ 9$ $ 9$ $ 10$ $ 11$ $ 12$ $ 13$ $ 14$
$ 1.4$ $ 5.7$ $ 2.64$ $ 5.5$ $ 5.2$ $ 5.5$ $ 6.1$ $ 5.54$ $ 6.0$ $ 5.1$ $ 5.5$ $ 5.9$ $ 5.5$ $ 5.3$
$ 1.5$ $ 5.8$ $ 2.88$ $ 5.4$ $ 5.7$ $ 5.8$ $ 6.3$ $ 5.47$ $ 5.9$ $ 5.1$ $ 5.5$ $ 5.6$ $ 5.4$ $ 5.3$
$ 1.4$ $ 5.8$ $ 2.42$ $ 5.1$ $ 5.9$ $ 5.3$ $ 6.2$ $ 5.48$ $ 6.1$ $ 5.1$ $ 5.5$ $ 5.7$ $ 5.5$ $ 5.4$
$ 0.9$ $ 5.7$ $ 2.62$ $ 5.3$ $ 5.6$ $ 5.3$ $ 6.1$ $ 5.51$ $ 5.9$ $ 5.3$ $ 5.3$ $ 5.6$ $ 5.6$  

Figure 9.1: A boxplot of the data of Table 9.1. The dotted line and the solid line denote respectively the mean and the median of the observations
\includegraphics[width=83mm,clip]{text/3-9/DaviesGatherFig1.eps}

We note that only the results of the Laboratories 1 and 3 lie below the mean whereas all the remaining laboratories return larger values. In the case of the median, $ 7$ of the readings coincide with the median, $ 24\,$readings are smaller and $ 24$ are larger. A glance at Fig. 9.1 suggests that in the absence of further information the Laboratories 1 and 3 should be treated as outliers. This is the course which we recommend although the issues involved require careful thought. For the moment we note simply that the median is a robust statistic whereas the mean is not.

The second example concerns quantifying the scatter of real valued observations $ x_1, \ldots, x_n$. This example is partially taken from [58] and reports a dispute between [34, p.147] and [38, p.762] about the relative merits of

$\displaystyle s_n \, = \, \left( \frac{1}{n} \sum (x_i - \bar{x})^2 \right)^{\frac{1}{2}}$   and$\displaystyle \quad d_n \, = \, \frac{1}{n} \sum \vert x_i - \bar{x}\vert\,.$    

Fisher argued that for normal observations the standard deviation $ s_n$ is about $ 12\,{\%}$ more efficient than the mean absolute deviation $ d_n$. In contrast Eddington claimed that his experience with real data indicates that $ d_n$ is better than $ s_n$. In [108] and [57] we find a resolution of this apparent contradiction. Consider the model

$\displaystyle \mathcal{N}_{\epsilon} = \big( 1-\epsilon \big) N \left(\mu , \sigma^2\right) + \epsilon N\left(\mu, 9\sigma^2\right)\,,$ (9.1)

where $ N(\mu, \sigma^2)$ denotes a normal distribution with mean $ \mu$ and variance $ \sigma ^2$ and $ 0 \leq
\epsilon \leq 1$. For data distributed according to (9.1) one can calculate the asymptotic relative efficiency ARE of $ d_n$ with respect to $ s_n$,

$\displaystyle {\mathrm{ARE}}(\epsilon) \, = \, \lim_{n \rightarrow \infty} {\ma...
...ty} \frac{{\mathrm{Var}} (s_n) / E( s_n)^2 }{{\mathrm{Var}} (d_n) / E (d_n)^2}.$    

As Huber states, the result is disquieting. Already for $ \epsilon \geq
0.002$ ARE exceeds $ 1$ and the effect is apparent for samples of size $ 1000$. For $ \epsilon = 0.05$ we have $ {\mathrm{ARE}}(\epsilon) = 2.035$ and simulations show that for samples of size $ 20$ the relative efficiency exceeds 1.5 and increases to 2.0 for samples of size $ 100$. This is a severe deficiency of $ s_n$ as models such as $ {\cal N}_{\epsilon}$ with $ \epsilon $ between 0.01 and 0.1 often give better descriptions of real data than the normal distribution itself. We quote [58]
thus it becomes painfully clear that the naturally occurring deviations from the idealized model are large enough to render meaningless the traditional asymptotic optimality theory.


9.1.2 General Philosophy

The two examples of the previous section illustrate a general phenomenon. An optimal statistical procedure based on a particular family of models $ \mathcal{M}_1$ can differ considerably from an optimal procedure based on another family $ \mathcal{M}_2$ even though the families $ \mathcal{M}_1$ and $ \mathcal{M}_2$ are very close. This may be expressed by saying that optimal procedures are often unstable in that small changes in the data or the model can lead to large changes in the analysis. The basic philosophy of robust statistics is to produce statistical procedures which are stable with respect to small changes in the data or model and even large changes should not cause a complete breakdown of the procedure.

Any inspection of the data and the removal of aberrant observations may be regarded as part of robust statistics but it was only with [78] that the consideration of deviations from models commenced. He showed that the exact theory based on the normal distribution for variances is highly nonrobust. There were other isolated papers on the problem of robustness ([77,6]; Geary (1936, 1937); [44,14,15]). [108] initiated a wide spread interest in robust statistics which has continued to this day. The first systematic investigation of robustness is due to [56] and was expounded in [58]. Huber's approach is functional analytic and he was the first to investigate the behaviour of a statistical functional over a full topological neighbourhood of a model instead of restricting the investigation to other parametric families as in (9.1). Huber considers three problems. The first is that of minimizing the bias over certain neighbourhoods and results in the median as the most robust location functional. For large samples deviations from the model have consequences which are dominated by the bias and so this is an important result. The second problem is concerned with what Tukey calls the statistical version of no free lunches. If we take the simple model of i.i.d. $ N(\mu,1)$ observations then the confidence interval for $ \mu$ based on the mean is on average shorter than that based on any other statistic. If short confidence intervals are of interest then one can not only choose the statistic which gives the shortest interval but also the model itself. The new model must of course still be consistent with the data but even with this restriction the confidence interval can be made as small as desired ([26]). Such a short confidence interval represents a free lunch and if we do not believe in free lunches then we must look for that model which maximizes the length of the confidence interval over a given family of models. If we take all distributions with variance 1 then the confidence interval for the $ N(\mu,1)$ distribution is the longest. Huber considers the same problem over the family $ \mathcal{ F}=\{F: d_{ko}(F,N(0,1)) < \epsilon\}$ where $ d_{ko}$ denotes the Kolmogoroff metric. Under certain simplifying assumptions Huber solves this problem and the solution is known as the Huber distribution (see [58]). Huber's third problem is the robustification of the Neyman-Pearson test theory. Given two distributions $ P_0$ and $ P_1$ [76] derive the optimal test for testing $ P_0$ against $ P_1$. Huber considers full neighbourhoods $ \mathcal{P}_0$ of $ P_0$ and $ \mathcal{P}_1$ of $ P_1$ and then derives the form of the minimax test for the composite hypothesis of $ \mathcal{P}_0$ against $ \mathcal{P}_1$. The weakness of Huber's approach is that it does not generalize easily to other situations. Nevertheless it is the spirit of this approach which we adopt here. It involves treating estimators as functionals on the space of distributions, investigating where possible their behaviour over full neighbourhoods and always being aware of the danger of a free lunch.

[51] introduced another approach to robustness, that based on the influence function $ I(x,T,F)$ defined for a statistical functional $ T$ as follows

$\displaystyle I(x,T,F)= \lim_{\epsilon \rightarrow 0}\frac{T((1-\epsilon)F+\epsilon\delta_x)-T(F)}{\epsilon}\,,$ (9.2)

where $ \delta_x$ denotes the point mass at the point $ x$. The influence function has two interpretations. On the one hand it measures the infinitesimal influence of an observation situated at the point $ x$ on the value of the functional $ T$. On the other hand if $ P_n(F)$ denotes the empirical measure of a sample of $ n$ i.i.d. random variables with common distribution $ F$ then under appropriate regularity conditions

$\displaystyle \lim_{n \rightarrow \infty}\sqrt{n}(T(P_n(F))-T(F)) \stackrel{D}{=} N \left(0,\int I(x,T,F)^2\,{d}F(x) \right)\,,$ (9.3)

where $ \stackrel{D}{=}$ denotes equality of distribution. Given a parametric family $ \mathcal{P}^{\prime}=\{P_{\theta}: \theta \in \Theta\}$ of distributions we restrict attention to those functionals which are Fisher consistent that is

$\displaystyle T(P_{\theta}) = \theta, \quad \theta \in \Theta\,.$ (9.4)

Hampel's idea was to minimize the asymptotic variance of $ T$ as an estimate of a parameter $ \theta$ subject to a bound on the influence function

$\displaystyle \min_T \int I(x,T,P_{\theta})^2\,{d}P_{\theta}(x)$   under (9.4) and $\displaystyle \sup_x \vert I(x,T,P_{\theta})\vert \le k(\theta)\,,$ (9.5)

where $ k(\theta)$ is a given function of $ \theta$. Hampel complemented the infinitesimal part of his approach by considering also the global behaviour of the functional $ T$. He introduced the concept of breakdown point which has had and continues to have a major influence on research in robust statistics. The approach based on the influence function was carried out in [54]. The strength of the Hampel approach is that it can be used to robustify in some sense the estimation of parameters in any parametric model. The weaknesses are that (9.5) only bounds infinitesimally small deviations from the model and that the approach does not explicitly take into account the free lunch problem. Hampel is aware of this and recommends simple models but simplicity is an addition to and not an integral part of his approach. The influence function is usually used as a heuristic tool and care must be taken in interpreting the results. For examples of situations where the heuristics go wrong we refer to [25].

Another approach which lies so to speak between that of Huber and Hampel is the so called shrinking neighbourhood approach. It has been worked out in full generality by [83]. Instead of considering neighbourhoods of a fixed size (Huber) or only infinitesimal neighbourhoods (Hampel) this approach considers full neighbourhoods of a model but whose size decreases at the rate of $ n^{-1/2}$ as the sample size $ n$ tends to infinity. The size of the neighbourhoods is governed by the fact that for larger neighbourhoods the bias term is dominant whereas models in smaller neighbourhoods cannot be distinguished. The shrinking neighbourhoods approach has the advantage that it does not need any assumptions of symmetry. The disadvantage is that the size of the neighbourhoods goes to zero so that the resulting theory is only robustness over vanishingly small neighbourhoods.


9.1.3 Functional Approach

Although a statistic based on a data sample may be regarded as a function of the data a more general approach is often useful. Given a data set $ (x_1,\ldots,x_n)$ we define the corresponding empirical distribution $ P_n$ by

$\displaystyle P_n =\frac{1}{n}\sum_{i=1}^n \delta_{x_i}\,,$ (9.6)

where $ \delta_x$ denotes the unit mass in $ x$. Although $ P_n$ clearly depends on the sample $ (x_1,\ldots,x_n)$ we will usually suppress the dependency for the sake of clarity. With this notation we can now regard the arithmetic mean $ {\bar
x}_n=\sum_{i=1}^nx_i/n$ either as a function of the data or as a function $ T_{av}$ of the empirical measure $ P_n$,

$\displaystyle \notag {\bar x}_n =\int x\,{d}P_n(x)=T_{av}(P_n)\,.$    

The function $ T_{av}$ can be extended to all measures $ P$ which have a finite mean

$\displaystyle T_{av}(P) = \int x\,{d}P(x)\,,$ (9.7)

and is now a functional defined on a certain subset of the family $ \mathcal{P}$ of probability measures on $ \mathbb{R}$. This manner of treating statistics is one whose origins go back to [112]. In the context of robust statistics it was introduced by [56] and has proved very useful (see [37]). Another example is given by the functional $ T_{sh}$ defined as the length of the shortest interval which carries a mass of at least $ 1/2$,

$\displaystyle T_{sh}(P) = \mathop{{\mathrm{argmin}}} \{\vert I \vert: P(I) \ge 1/2,\, I \subset \mathbb{R}\}\,,$ (9.8)

where $ \vert I \vert$ denotes the length of the interval $ I$. The idea of using the shortest half interval goes back to Tukey (see [2]) who proposed using the mean of the observations contained in it as a robust location functional.

The space $ \mathcal{P}$ may be metricized in many ways but we prefer the Kolmogoroff metric $ d_{ko}$ defined by

$\displaystyle d_{ko}(P,Q) = \sup_{ x \in \mathbb{R}} \vert P((-\infty,\,x]) - Q((-\infty,\,x])\vert\,.$ (9.9)

The Glivenko-Cantelli theorem states

$\displaystyle \lim_{n \rightarrow \infty} d_{ko}(P_n(P),P) = 0, \quad a.s.\,,$ (9.10)

where $ P_n(P)$ denotes the empirical measure of the $ n$ random variables $ X_1(P),\ldots, X _n(P)$ of the i.i.d. sequence $ (X_i(P))_1^{\infty}$. In conjunction with (9.10) the metric $ d_{ko}$ makes it possible to connect analytic properties of a functional $ T$ and its statistical properties. As a first step we note that a functional $ T$ which is locally bounded in the Kolmogoroff metric

$\displaystyle \sup \{ \vert T(Q)-T(P) \vert: d_{ko}(P,Q) < \epsilon)\} < \infty\,,$ (9.11)

for some $ \epsilon > 0$ offers protection against outliers. On moving from local boundedness to continuity we see that if a functional $ T$ is continuous at $ P$ then the sequence $ T(P_n(P))$ is a consistent statistic in that

$\displaystyle \notag \lim_{n \rightarrow \infty} T(P_n(P)) = T(P), \quad a.s.$    

Finally we consider a functional $ T$ which is differentiable at $ P$, that is

$\displaystyle T(Q)-T(P) = \int I(x,P,T)\,{d}(Q-P)(x) + \mathrm{o}_P({d}_{ko}(P,Q))$ (9.12)

for some bounded function $ I(\cdot, P,T):\mathbb{R} \rightarrow \mathbb{R}$ where, without loss of generality, $ \int I(x,P,T)\,{d}P(x)=0$ (see [18]). On putting

$\displaystyle Q=Q_{\epsilon}=(1-\epsilon)P+\epsilon \delta_x$

it is seen that $ I(x,P,T)$ is the influence function of (9.2). As

$\displaystyle d_{ko}(P_n(P),P) = \mathrm{O}_P(1/\sqrt{n})$ (9.13)

the central limit theorem (9.3) follows immediately. Textbooks which make use of this functional analytic approach are as already mentioned [58], [54], [83], and also [104], a book which can be strongly recommended to students as a well written and at the same time deep introductory text.


next up previous contents index
Next: 9.2 Location and Scale Up: 9. Robust Statistics Previous: 9. Robust Statistics