The classical least squares (LS) estimator is widely used in regression analysis both because of the ease of its computation and its tradition. Unfortunately, it is quite sensitive to higher amounts of data contamination, and this just adds together with the fact that outliers and other deviations from the standard linear regression model (for which the least squares method is best suited) appear quite frequently in real data. The danger of outlying observations, both in the direction of the dependent and explanatory variables, to the least squares regression is that they can have a strong adverse effect on the estimate and they may remain unnoticed, especially when higher dimensional data are analyzed. Therefore, statistical techniques that are able to cope with or to detect outlying observations have been developed. One of them is the least trimmed squares estimator.
The methods designed to treat contaminated data can be based on one of two principles. They can either detect highly influential observations first and then apply a classical estimation procedure on the ``cleaned'' data, or they can be designed so that the resulting regression estimates are not easily influenced by contamination. Before we actually discuss them, especially the latter ones, let us exemplify the sensitivity of the least squares estimator to outlying observations.
The data set
phonecal
serves well this purpose. The
data set, which comes from the Belgian Statistical Survey and was analyzed
by Rousseeuw and Leroy (1987), describes the number of international phone
calls from Belgium in years 1950-1973. The result of the least squares
regression is depicted on Figure 2.1. Apparently, there is a heavy
contamination caused by a different measurement system in years 1964-1969
and parts of year 1963 and 1970--instead of the number of phone calls, the
total number of minutes of these calls was reported. Moreover, one can
immediately see the effect of this contamination: the estimated regression
line follow neither a mild upward trend in the rest of the data, nor any
other pattern that can be recognized in the data. One could argue that the
contamination was quite high and evident after a brief inspection of the
data. However, such an effect might be caused even by a single observation,
and in addition to that, the outlying observations do not have to be easily
recognizable if analyzed data are multi-dimensional. To give an example, an
artificial data set consisting of 10 observations and one outlier is used.
We can see the effect of a single outlier on Figure 2.2--while the
blue line represents the underlying model, the red thick line shows the least
squares estimate. Moreover, the same figure shows that the residuals plot
does not have to have any outlier-detection power (the blue thin lines
represent interval
and the blue thick lines correspond
to
).
As most statisticians are aware of the described threats caused by very influential observations for a long time, they have been trying to develop procedures that would help to identify these influential observations and provide ``outlier-resistant'' estimates. There are actually two ways how this goal can be achieved. First one relies on some kind of regression diagnostics to identify highly influential data points. Having identified suspicious data points, one can remove them, and subsequently, apply classical regression methods. These methods are not in the focus of this chapter. Another strategy, which will be discussed here, is to utilize estimation techniques based on the so-called robust statistics. These robust estimation methods are designed so that they are not easily endangered by contamination of data. Furthermore, a subsequent analysis of regression residuals coming from such a robust regression fit can then hint on outlying observations. Consequently, such robust regression methods can serve as diagnostic tools as well.
Within the theory of robustness, several concepts exist. They range from the
original minimax approach introduced in Huber (1964) and the
approach based on the influence function (Hampel et al.; 1986) to high
breakdown point procedures (Hampel; 1971), that is the procedures that
are able to handle highly contaminated data.
The last one will be of interest here as the least trimmed
squares estimator belongs to and was developed as a high breakdown point
method. To formalize the notion of the capability of an estimator to resist
to some amount of contamination in the data, the breakdown
point was introduced. For the simplicity of exposure,
we present here one of its finite-sample versions suggested
by Donoho and Huber (1983):
Take an arbitrary sample of data points,
,
and let
be a regression estimator, i.e., applying
to the sample
produces an estimate of regression coefficients
. Then the
breakdown point of the estimator
at
is defined by
![]() |
(2.1) |
There were quite a lot of estimators intended to have a high breakdown
point, that is close to the upper bound, although some of them were not
entirely successful in achieving this point because of their sensitivity to
a specific kind of data contamination. One of truly high breakdown point
estimators that reached the above mentioned upper bound of the breakdown point
were the least median of squares (LMS)
estimator (Rousseeuw; 1984), which minimizes the median of squared residuals, and
the least trimmed squares (LTS) estimator
(Rousseeuw; 1985), which takes as its objective function the sum of smallest
squared residuals and was indeed proposed as a remedy to the low asymptotic
efficiency of LMS.
Before proceeding to the definition and a more detailed discussion of the least trimmed squares estimator, let us show the behavior of this estimator when applied to phonecal data used in the previous section. On Figure 2.3 we can see two estimated regression lines: the red thick line that corresponds to the LTS estimate, and for comparison purposes, the blue thin line that depicts the least squares regression result. While the least squares estimate is spoilt by outliers coming from years 1963-1970, the least trimmed squares regression line is not affected and outlines the trend one would consider as the right one.