2.3 Supplementary Remarks


2.3.1 Choice of the Trimming Constant

As was already mentioned, the trimming constant $ h$ have to satisfy $ \frac{n}{2} < h \leq n$ and indeed determines the breakdown point of LTS. The choice of this constant depends mainly on the purpose for which we want to use LTS. There is, of course, a trade-off involved: lower values of $ h$, which are close to the optimal breakdown point choice, lead to a higher breakdown point, while higher values improve efficiency (if the data are not too contaminated) since more information stored in data is utilized. The maximum breakdown point is attained for $ h = [n/2] + [(p+1)/2]$. This choice is often employed when the LTS is used for diagnostic purposes (see Subsection 2.3.2). The most robust choice of $ h$ may be also favored when LTS is used for comparison with some less robust estimator, e.g., the least squares, since comparison of these two estimators can serve as a simple check of data and a model--if the estimates are not similar to each other, a special care should be taken throughout the subsequent analysis. On the other hand, it may be sensible to evaluate LTS for a wide range of values of the trimming constant and to observe how the estimate behaves with increasing $ h$, because this can provide hints on the amount of contamination and possibly on suspicious structures of a given data set (for example, that the data set contains actually a mixture of two different populations).


2.3.2 LTS as a Diagnostic Tool

Figure: The LS residual plot for stacklos data, 4772 XAGls04.xpl
\includegraphics[scale=0.6]{ls04}

We have several times advocated the use of the least trimmed squares estimator for diagnostic purposes. Therefore, a brief guidance regarding diagnostics is provided in this subsection via an example. Let us look at stacklos data, which were already analyzed many times, for example by Drapper and Smith (1966), Daniel and Wood (1971), Carroll and Ruppert (1985), and Rousseeuw and Leroy (1987). The data consist of 21 four-dimensional observations characterizing the production of nitric acid by the oxidation of ammonia. The stackloss (y) is assumed to depend on the rate of operation ($ x_1$), on the cooling water inlet temperature ($ x_2$) and on the acid concentration ($ x_3$). Most of the studies dealing with this data set found out among others that data points 1, 3, 4, 21, and maybe also 2 were outliers. First, the least square regression result

$\displaystyle \hat{y} = -39.92 + 0.716 x_1 + 1.295 x_2 - 0.152 x_3,
$

4777 XAGls03.xpl , is reported for comparison with LTS, the corresponding residual plot is plotted on Figure 2.4 (once again, the blue thin lines represent $ \pm\sigma$ and the blue thick lines correspond to $ \pm 3\sigma$). There are no significantly large residuals with respect to the standard deviation, so without any other diagnostic statistics one would be tempted to believe that there are no outlying observations. On the contrary, if we inspect the least trimmed squares regression, which produces

$\displaystyle \hat{y} = -35.21 + 0.746 x_1 + 0.338 x_2 - 0.005 x_3,
$

4780 XAGlts03.xpl , our conclusion will be different. To construct a residual plot for a robust estimator, it is necessary to use also a robust estimator of scale because the presence of outliers is presumed. Such a robust estimator of variance can be based in the case of LTS, for example, on the sum of the $ h$ smallest residuals or on the absolute median deviation $ \mathop{\rm MAD}\nolimits _i x_i = \mathop{\rm med}\nolimits _i \vert x_i - \mathop{\rm med}\nolimits _i x_i \vert$ as is the case on Figure 2.5. Inspecting the residual plot on Figure 2.5 (the blue lines represents again $ \pm\sigma$ and $ \pm 3\sigma$ levels, where $ \sigma = 1.483 \mathop{\rm MAD}\nolimits _i r_i(\beta)$), observations 1, 2, 3, 4, and 21 become suspicious ones as their residuals are very large in the sense that they lie outside of the interval $ (-3\sigma,3\sigma)$. Thus, the LTS estimate provide us at the same time with a powerful diagnostic tool. One has naturally to decide which ratios $ \vert r_i(\beta)/\sigma\vert$ are already doubtable, but value 2.5 is often used as a decisive point.

Figure: The LTS residual plot for stacklos data, 4790 XAGlts04.xpl
\includegraphics[scale=0.6]{lts04}


2.3.3 High Subsample Sensitivity

The final note on LTS concerns a broader issue that we should be aware of whenever such a robust estimator is employed. Already mentioned high subsample sensitivity is caused by the fact that high breakdown point estimators search for a ``core'' subset of data that follows best a certain model (with all its assumptions) without taking into account the rest of observations. A change of some observations may then lead to a large swing in composition of this core subset. This might happen, for instance, if the data are actually a mixture of two (or several) populations of data, i.e., a part of data can be explained by one regression line, another part of the same data by a quite different regression function, and in addition to that, some observations may suit both model relatively well (this can happen with a real data set too, see Benácek, Jarolím, and Víšek; 1998). In such a situation, a small change of some observations or some parameters of the estimator can bring the estimate from one regression function to another. Moreover, application of several (robust) estimates is likely to introduce several rather different estimates in such a situation--see Víšek (1999b) for a detailed discussion. Still, it is necessary to have in mind that this is not shortcoming of the discussed estimators, but of the approach taken in this case--procedures designed to suit some theoretical models are applied to an unknown sample and the procedures in question just try to explain it by means of a prescribed model.