2.1 Robust Regression


2.1.1 Introduction

The classical least squares (LS) estimator is widely used in regression analysis both because of the ease of its computation and its tradition. Unfortunately, it is quite sensitive to higher amounts of data contamination, and this just adds together with the fact that outliers and other deviations from the standard linear regression model (for which the least squares method is best suited) appear quite frequently in real data. The danger of outlying observations, both in the direction of the dependent and explanatory variables, to the least squares regression is that they can have a strong adverse effect on the estimate and they may remain unnoticed, especially when higher dimensional data are analyzed. Therefore, statistical techniques that are able to cope with or to detect outlying observations have been developed. One of them is the least trimmed squares estimator.

Figure: Least squares regression with outliers, phonecal data, 4254 XAGls01.xpl
\includegraphics[scale=0.6]{ls01}

The methods designed to treat contaminated data can be based on one of two principles. They can either detect highly influential observations first and then apply a classical estimation procedure on the ``cleaned'' data, or they can be designed so that the resulting regression estimates are not easily influenced by contamination. Before we actually discuss them, especially the latter ones, let us exemplify the sensitivity of the least squares estimator to outlying observations.

The data set phonecal serves well this purpose. The data set, which comes from the Belgian Statistical Survey and was analyzed by Rousseeuw and Leroy (1987), describes the number of international phone calls from Belgium in years 1950-1973. The result of the least squares regression is depicted on Figure 2.1. Apparently, there is a heavy contamination caused by a different measurement system in years 1964-1969 and parts of year 1963 and 1970--instead of the number of phone calls, the total number of minutes of these calls was reported. Moreover, one can immediately see the effect of this contamination: the estimated regression line follow neither a mild upward trend in the rest of the data, nor any other pattern that can be recognized in the data. One could argue that the contamination was quite high and evident after a brief inspection of the data. However, such an effect might be caused even by a single observation, and in addition to that, the outlying observations do not have to be easily recognizable if analyzed data are multi-dimensional. To give an example, an artificial data set consisting of 10 observations and one outlier is used. We can see the effect of a single outlier on Figure 2.2--while the blue line represents the underlying model, the red thick line shows the least squares estimate. Moreover, the same figure shows that the residuals plot does not have to have any outlier-detection power (the blue thin lines represent interval $ (-\sigma,\sigma)$ and the blue thick lines correspond to $ \pm 3\sigma$).

Figure: Least squares regression with one outlier and the corresponding residual plot, 4260 XAGls02.xpl
\includegraphics[scale=0.55]{ls02}

As most statisticians are aware of the described threats caused by very influential observations for a long time, they have been trying to develop procedures that would help to identify these influential observations and provide ``outlier-resistant'' estimates. There are actually two ways how this goal can be achieved. First one relies on some kind of regression diagnostics to identify highly influential data points. Having identified suspicious data points, one can remove them, and subsequently, apply classical regression methods. These methods are not in the focus of this chapter. Another strategy, which will be discussed here, is to utilize estimation techniques based on the so-called robust statistics. These robust estimation methods are designed so that they are not easily endangered by contamination of data. Furthermore, a subsequent analysis of regression residuals coming from such a robust regression fit can then hint on outlying observations. Consequently, such robust regression methods can serve as diagnostic tools as well.


2.1.2 High Breakdown Point Estimators

Within the theory of robustness, several concepts exist. They range from the original minimax approach introduced in Huber (1964) and the approach based on the influence function (Hampel et al.; 1986) to high breakdown point procedures (Hampel; 1971), that is the procedures that are able to handle highly contaminated data. The last one will be of interest here as the least trimmed squares estimator belongs to and was developed as a high breakdown point method. To formalize the notion of the capability of an estimator to resist to some amount of contamination in the data, the breakdown point was introduced. For the simplicity of exposure, we present here one of its finite-sample versions suggested by Donoho and Huber (1983): Take an arbitrary sample of $ n$ data points, $ S_n = ({x}_{1},\ldots,{x}_{n})$, and let $ T_n$ be a regression estimator, i.e., applying $ T_n$ to the sample $ S_n$ produces an estimate of regression coefficients $ T_n(S_n)$. Then the breakdown point of the estimator $ T_n$ at $ S_n$ is defined by

$\displaystyle \varepsilon _n^{\star}(T_n,S_n) = \frac{1}{n} \max \left\{ m \lef...
...ts,{y}_{m}} \Vert T_n({z}_{1},\ldots,{z}_{n}) \Vert < +\infty \right. \right\},$ (2.1)

where sample $ ({z}_{1},\ldots,{z}_{n})$ is created from the original sample $ S_n$ by replacing observations $ x_{i_1},\ldots,x_{i_m}$ by values $ {y}_{1},\ldots,{y}_{m}$.
The breakdown point usually does not depend on $ S_n$. To give an example, it immediately follows from the definition that the finite-sample breakdown point of the arithmetic mean equals to 0 in a one-dimensional location model, while for the median it is $ 1/2$. Actually, the breakdown point equal to $ 1/2$ is the highest one that can be achieved at all--if the amount of contamination is higher, it is not possible to decide which part of the data is the correct one. Such a result is proved, for example, in Theorem 4, Chapter 3 of Rousseeuw and Leroy (1987) for the case of regression equivariance estimators (the upper bound on $ \varepsilon _n^{\star}$ in this case is actually $ ([(n-p)/2] + 1)/n$, where $ [x]$ denotes the integer part of $ x$).

Figure: Least trimmed squares regression with outliers, phonecal data, 4365 XAGlts01.xpl
\includegraphics[scale=0.6]{lts01}

There were quite a lot of estimators intended to have a high breakdown point, that is close to the upper bound, although some of them were not entirely successful in achieving this point because of their sensitivity to a specific kind of data contamination. One of truly high breakdown point estimators that reached the above mentioned upper bound of the breakdown point were the least median of squares (LMS) estimator (Rousseeuw; 1984), which minimizes the median of squared residuals, and the least trimmed squares (LTS) estimator (Rousseeuw; 1985), which takes as its objective function the sum of $ h$ smallest squared residuals and was indeed proposed as a remedy to the low asymptotic efficiency of LMS.

Before proceeding to the definition and a more detailed discussion of the least trimmed squares estimator, let us show the behavior of this estimator when applied to phonecal data used in the previous section. On Figure 2.3 we can see two estimated regression lines: the red thick line that corresponds to the LTS estimate, and for comparison purposes, the blue thin line that depicts the least squares regression result. While the least squares estimate is spoilt by outliers coming from years 1963-1970, the least trimmed squares regression line is not affected and outlines the trend one would consider as the right one.