As was already mentioned, the trimming constant have to satisfy
and indeed determines the breakdown point of LTS.
The choice of this constant depends mainly on the purpose for which we want
to use LTS. There is, of course, a trade-off involved:
lower values of
, which are close to the optimal breakdown point choice,
lead to a higher breakdown point, while higher values improve efficiency
(if the data are not too contaminated) since more information stored in data
is utilized. The maximum breakdown point is attained for
.
This choice is often employed when the LTS is used for
diagnostic purposes (see Subsection 2.3.2).
The most robust choice of
may be also favored
when LTS is used for comparison with some less robust estimator, e.g., the
least squares, since comparison of these two estimators can serve as a
simple check of data and a model--if the estimates are not similar to each
other, a special care should be taken throughout the subsequent analysis.
On the other hand, it may be sensible to evaluate LTS for a wide range of
values of the trimming constant and to observe how the estimate
behaves with increasing
, because this can provide hints on the amount
of contamination and possibly on suspicious structures of a given data set
(for example, that the data set contains actually a mixture of two
different populations).
We have several times advocated the use of the least trimmed squares
estimator for diagnostic purposes. Therefore, a brief guidance regarding
diagnostics is provided in this subsection via an example. Let us look at
stacklos
data, which were already analyzed many times, for example
by Drapper and Smith (1966), Daniel and Wood (1971),
Carroll and Ruppert (1985), and Rousseeuw and Leroy (1987). The data consist of 21
four-dimensional observations characterizing the production of nitric acid by
the oxidation of ammonia. The stackloss (y) is assumed to depend on the
rate of operation (), on the cooling water inlet temperature (
)
and on the acid concentration (
). Most of the studies dealing with
this data set found out among others that data points 1, 3, 4, 21, and
maybe also 2 were outliers. First, the least square regression result
The final note on LTS concerns a broader issue that we should be aware of whenever such a robust estimator is employed. Already mentioned high subsample sensitivity is caused by the fact that high breakdown point estimators search for a ``core'' subset of data that follows best a certain model (with all its assumptions) without taking into account the rest of observations. A change of some observations may then lead to a large swing in composition of this core subset. This might happen, for instance, if the data are actually a mixture of two (or several) populations of data, i.e., a part of data can be explained by one regression line, another part of the same data by a quite different regression function, and in addition to that, some observations may suit both model relatively well (this can happen with a real data set too, see Benácek, Jarolím, and Víšek; 1998). In such a situation, a small change of some observations or some parameters of the estimator can bring the estimate from one regression function to another. Moreover, application of several (robust) estimates is likely to introduce several rather different estimates in such a situation--see Víšek (1999b) for a detailed discussion. Still, it is necessary to have in mind that this is not shortcoming of the discussed estimators, but of the approach taken in this case--procedures designed to suit some theoretical models are applied to an unknown sample and the procedures in question just try to explain it by means of a prescribed model.