next up previous contents index
Next: 9.5 Analysis of Variance Up: 9. Robust Statistics Previous: 9.3 Location and Scale

Subsections


9.4 Linear Regression

9.4.1 Equivariance and Metrics

The linear regression model may be written in the form

$\displaystyle Y_i=\boldsymbol{x}_{i}^{\top}\beta+\epsilon_i, \quad i=1,\ldots, n$ (9.101)

where $ \boldsymbol{x}_i, \, i=1\ldots,n $ and $ \beta\in \mathbb{R}^k$. The assumptions of the standard model are that the $ x_i$ are fixed and that the $ \epsilon_i$ are i.i.d. random variables with the default distribution being the normal distribution $ N(0,\sigma^2)$. There are of course many other models in the literature including random $ x_i$-values and a covariance structure for the errors  $ \epsilon_i$. For the purpose of robust regression we consider probability distributions $ P$ on $ \mathbb{R}^{k+1}$ where the first $ k$ components refer to the covariates $ \boldsymbol {x}$ and the last component is the corresponding value of $ y$. We restrict attention to the family $ \mathcal{P}_{k+1}$ of probability measures given by

$\displaystyle \mathcal{P}_{k+1}=\{P: P(H\times \mathbb{R})<1 \;$for all lower dimensional subspaces$\displaystyle \; H \subset \mathbb{R}^k\}\,.$ (9.102)

The metric we use on $ \mathcal{P}_{k+1}$ is $ d_\mathcal{H}$ with $ \mathcal{H}$ given by (9.73).

Consider the regression group $ G$ of transformations $ g:\mathbb{R}^{k+1}\rightarrow \mathbb{R}^{k+1}$ of the form

$\displaystyle g(\boldsymbol{x},y)= \left(A(\boldsymbol{x}),sy+\boldsymbol{x}^{\top}\gamma\right)$ (9.103)

where $ A$ is a non-singular $ k\times k$-matrix, $ s \in \mathbb{R},
s\ne 0,$ and $ \gamma \in \mathbb{R}^k$. A functional $ T:{\cal
P}_{k+1}\rightarrow \mathbb{R}^k\times\mathbb{R}_{+}$ is called a regression functional if for all $ g \in
G$ and $ P\in \mathcal{P}_{k+1}$

$\displaystyle T(P^g)=h_g(T(P))\,,$ (9.104)

where

$\displaystyle h_g(\beta,\sigma) = \left(s(A^{-1})^{\top}(\beta+\gamma),s\sigma\right)\,.$ (9.105)

with $ A$ and $ \gamma$ as in (9.103). The first $ k$ components of $ T(P)$ specify the value of $ \beta\in \mathbb{R}^k$ and the last component that of $ \sigma $. The restriction to models $ P\in \mathcal{P}_{k+1}$ of (9.102) is that without such a restriction there is no uniquely defined value of $ \beta$.

9.4.2 M-estimators for Regression

Given a distribution $ P\in \mathcal{P}_{k+1}$ we define an M-functional by $ T(P)=(\beta^{\ast},\sigma^{\ast})$ where $ (\beta^{\ast},\sigma^{\ast})$ is a solution of the equations

$\displaystyle \int \phi\left(\boldsymbol{x},\left(y-{\boldsymbol x}^{\top}\beta\right)/\sigma\right)\boldsymbol{x}\,{d}P({\boldsymbol x},y)=\,$ $\displaystyle 0\,,$ (9.106)
$\displaystyle \int \chi\left(\left(y-\boldsymbol{x}^{\top}\beta\right)/\sigma\right)\,{d}P({\boldsymbol x},y)=\,$ 0 (9.107)

for given functions $ \phi:\mathbb{R}^{k+1} \rightarrow \mathbb{R}$ and $ \chi:\mathbb{R}\rightarrow \mathbb{R}$. Just as in Sect. 9.3.2 for M-functionals of location and scatter there are problems concerning the existence and uniqueness. [70] give sufficient conditions for existence which depend only on the properties of $ \phi$ and $ \chi$ and the values of $ \sup_{\theta} \{P(\theta^{\top}\boldsymbol{x}=0):\theta\ne 0\}$ and $ \sup_{\alpha,\theta}\{P(\alpha y+\theta^{\top}{\boldsymbol
x}=0):\vert\alpha\vert +\Vert \theta \Vert \ne 0\}$. Uniqueness requires additional strong assumptions such as either symmetry or the existence of a density for the conditional distribution of $ y-\theta_0^{\top}\boldsymbol{x}$ for each fixed $ \boldsymbol {x}$. [58] considers the minimization problem

$\displaystyle (\beta^{\ast},\sigma^{\ast})=\mathrm{argmin }\left(\int\rho\left(...
...l x}^{\top}\beta\right)/\sigma\right)\,{d}P(\boldsymbol{x},y)+a\right)\sigma\,,$ (9.108)

where $ \rho :\mathbb{R} \rightarrow \mathbb{R}_{+}$ is convex with $ \rho(0)=0$ and $ a > 0$. Under appropriate conditions on $ \rho$ it can be shown that the solution is unique and that there exists a convergent algorithm to calculate it. On differentiating (9.108) we obtain (9.106) and (9.107) with

$\displaystyle \phi(\boldsymbol{x},u)=\rho^{(1)}(u)$    and $\displaystyle \chi(u)=u\rho^{(1)}(u)-\rho(u)-a\,.$ (9.109)

Even if the solution of (9.106) and (9.107) exists and is unique it is not necessarily regression equivariant. To make it so we must introduce a scatter functional $ T_{\Sigma}$ on the marginal distributions $ {P^{\prime}}, {P^{\prime}}(B)=P(B\times \mathbb{R})$ of the covariate $ \boldsymbol {x}$. Such a functional satisfies $ T_{\Sigma}({P^{\prime}}^A)= AT_{\Sigma}({P^{\prime}})A^{\top}$ for any non-singular $ k\times k$-matrix $ A$ and is required not only for equivariance reasons but also to downweight outlying $ \boldsymbol {x}$-values or so called leverage points. For this latter purpose the functional $ T_{\Sigma}$ must also be robust. We now replace (9.106) by

$\displaystyle \int \phi\left(\boldsymbol{x}^{\top}T_{\Sigma}(P)^{-1}\boldsymbol...
... x}^{\top}\beta\right)/\sigma\right)\boldsymbol{x}\,{d}P(\boldsymbol{x},y)=0\,.$ (9.110)

The resulting functional is now regression equivariant but its analysis is more difficult requiring as it does an analysis of the robustness properties of the scatter functional $ T_{\Sigma}$.

Finally we note that in the literature most $ \phi$ functions of (9.106) are of the form

$\displaystyle \phi(\boldsymbol{x},u)=\pi(\boldsymbol{x})\psi(u)$ (9.111)

and the resulting functionals are known as GM-functionals. We refer to [54].

9.4.3 Bias and Breakdown

Given a regression functional $ T_r=(T_b,T_s)$ where $ T_b$ refers to the $ \beta$-components and $ T_s$ is the scale part it is usual to define breakdown just by the behaviour of $ T_b$ and to neglect $ T_s$. The breakdown point of $ T_r$ at the distribution $ P$ is defined by

$\displaystyle \epsilon^{\ast}(T_r,P,d_\mathcal{H})= \sup \{\epsilon: b(T_r,P,\epsilon,d_\mathcal{H})< \infty\}$ (9.112)

where

$\displaystyle b(T_r,P,\epsilon,d_\mathcal{H})= \sup \{\Vert T_b(Q)-T_b(P)\Vert: d_{\cal H}(P,Q) < \epsilon\}$ (9.113)

with corresponding definitions for the gross error neighbourhood $ \epsilon^{\ast}(T_r,P,GE)$ and for the finite sample breakdown point $ \epsilon^{\ast}(T_r,P_n,fsbp)$. To state the next theorem we set

$\displaystyle \notag \rm\Delta (P)= \sup\{P(H\times \mathbb{R}): H \;$a plane in$\displaystyle \; \mathbb{R}^k \;$of dimension at most$\displaystyle \; k-1\}\,,$    

which is the regression equivalent of (9.77). We have

Theorem 5   For any regression equivariant functional

$\displaystyle \epsilon^{\ast}(T_r,P,d_h) \le (1-\rm\Delta (P))/2$    and $\displaystyle \epsilon^{\ast}(T_r,P_n,fsbp) \le (1-\rm\Delta (P_n))/2\,.$ (9.114)

If one considers $ L_1$-regression

$\displaystyle \beta^{\ast}=\mathrm{argmin }\sum_{i=1}^n\big\vert y_i-\boldsymbol{x}_i^{\top}\beta \big\vert$ (9.115)

it can be shown if one $ \boldsymbol{x}_i$ is sufficiently outlying then the residual at this point will be zero and hence the finite sample breakdown point is a disappointing $ 1/n$. This turns out to apply to most M-functionals of the last section whose breakdown point is at most $ 1/(k+1)$ irrespective of their exact definition. The literature on this point is unsatisfactory. Although some M-functionals have been shown to have a positive breakdown point this has only been done under the assumption that the scale part $ T_s$ is known. As obtaining the correct magnitude of the scale of the errors is in some sense the most difficult problem in robust regression such results are of limited value. They do not however alter the fact that M-functionals have a disappointing breakdown point. We now turn to the problem of constructing high breakdown regression functionals.

9.4.4 High Breakdown Regression Functionals

The first high breakdown regression functional was proposed by [52] and is as follows.

$\displaystyle T_{lms}(P)=\mathrm{argmin}_{(\beta,\sigma)} \left\{\sigma: \int \...
...{x}^{\top}\beta\vert \le \sigma\big\}\,{d}P(\boldsymbol{x},y) \ge 1/2 \right\}.$ (9.116)

The idea goes back to Tukey's shortest half-sample of which it is the regression counter part. It can be shown that it has almost the highest finite sample breakdown point given by Theorem 5. By slightly altering the factor $ 1/2$ in (9.116) to take into account the dimension $ k$ of the $ \boldsymbol {x}$-variables it can attain this bound. [88] propagated its use and gave it the name by which it is now known, the least median of squares LMS. Rousseeuw calculated the finite sample breakdown point and provided a first heuristic algorithm which could be applied to real data sets. He also defined a second high breakdown functional known as least trimmed squares LTS defined by

$\displaystyle T_{lts}(P)=\mathrm{argmin}_{(\beta,\sigma)}$ $\displaystyle \left\{ \int \left(y-{\boldsymbol x}^{\top}\beta\right)^2\big\{\v...
...boldsymbol{x}^{\top}\beta\vert \le \sigma\big\}\,{d}P(\boldsymbol{x},y):\right.$    
  $\displaystyle \left. \int \big\{\vert y-\boldsymbol{x}^{\top}\beta\vert \le \sigma\big\}\,{d}P(\boldsymbol{x},y) \ge 1/2\right\}.$ (9.117)

There are now many high breakdown regression functionals such as $ S$-functionals ([98]), MM-functionals ([114]), $ \tau$-functionals ([116]), constrained M-functionals ([75]), rank regression ([17]) and regression depth ([93]). Just as in the location and scale problem in $ \mathbb{R}^k$ statistical functionals can have the same breakdown points but very different bias functions. We refer to [72], [71] and [13]. All these high breakdown functionals either attain or by some minor adjustment can be made to attain the breakdown points of Theorem 5 with the exception of depth based methods where the maximal breakdown point is $ 1/3$ (see [32]).

All the above high breakdown regressional functionals can be shown to exist under weak assumptions but just as in the case of high breakdown location and scatter functionals in $ \mathbb{R}^k$ uniqueness can only be shown under very strong conditions which typically involve the existence of a density function for the errors (see [25]). The comments made about high breakdown location and scale functionals in $ \mathbb{R}^k$ apply here. Thus even if a regression functional is well defined at some particular model there will be other models arbitrarily close in the metric $ d_\mathcal{H}$ where a unique solution does not exist. This points to an inherent local instability of high breakdown regression functionals which has been noted in the literature ([100,36]). [30] has constructed regression functionals which are well defined at all models $ P$ with $ \rm\Delta (P) < 1$ and which are locally uniformly Lipschitz, not however locally uniformly Fréchet differentiable. For this reason all confidence regions and efficiency claims must be treated with a degree of caution. An increase in stability can however be attained by using the LTS-functional instead of the LMS-functional, by reweighting the observations or using some form of one-step M-functional improvement as in (9.29).

Just as with high breakdown location and scatter functionals in $ \mathbb{R}^k$ the calculation of high breakdown regression functionals poses considerable difficulties. The first high breakdown regression functional was Hampel's least median of squares and even in the simplest case of a straight line in $ \mathbb{R}^2$ the computational cost is of order $ n^2$. The algorithm is by no means simple requiring as it does ideas from computational geometry (see [35]). From this and the fact that the computational complexity increases with dimension it follows that one has to fall back on heuristic algorithms. The one recommended for linear regression is that of [95] for the LTS-functional.

9.4.5 Outliers

To apply the concept of $ \alpha $-outlier regions to the linear regression model we have to specify the distribution $ P_Y$ of the response and the joint distribution $ P_{\boldsymbol{X}}$ of the regressors assuming them to be random. For specificness we consider the model

$\displaystyle P_{Y\vert\boldsymbol{X} = \boldsymbol{x}}\; =\; N(\boldsymbol{x}^{\top} \beta , \sigma^2),$ (9.118)

and

$\displaystyle P_{\boldsymbol{X}}\; =\; \mathcal{N}( \boldsymbol \mu, \boldsymbol \Sigma).$ (9.119)

Assumption (9.118) states that the conditional distribution of the response given the regressors is normal and assumption (9.119) means that the joint distribution of the regressors is a certain $ p$-variate normal distribution. If both assumptions are fulfilled then the joint distribution of $ (Y,
\boldsymbol{X})$ is a multivariate normal distribution.

We can define outlier regions under model (9.101) in several reasonable ways. If only (9.118) is assumed then a response-$ \alpha $-outlier region could be defined as

$\displaystyle {\mathrm{out}}(\alpha, P_{Y\vert\boldsymbol{X} = \boldsymbol{x}})...
... y - \boldsymbol{x}^{\top} \beta \vert\; >\; \sigma\; z_{1 - \alpha/2} \bigr\},$ (9.120)

which is appropriate if the regressors are fixed and only outliers in $ y$-direction are to be identified. If the regressors are random, which will be the more frequent case in actuarial or econometric applications, outliers in $ \boldsymbol {x}$-direction are important as well. Under assumption (9.119) a regressor-$ \alpha $-outlier region is a special case of the $ \alpha $-outlier region (9.99). This approach leads to a population based version of the concept of leverage points. These are the points in a sample $ (y_i,
\boldsymbol{x}_i),\, i = 1,\ldots,n,$ from model (9.101) ''for which $ \boldsymbol{x}_i$ is far away from the bulk of the $ \boldsymbol{x}_i$ in the data'' ([97]).

For the identification of regressor-outliers (leverage points) the same identification rules can be applied as in the multivariate normal situation. For the detection of response-outliers by resistant one-step identifiers, one needs robust estimators of the regression coefficients and the scale $ \sigma $. Examples of high breakdown estimators that can be used in this context are the Least Trimmed Squares estimator and the corresponding scale estimator ([88,94]), S-estimators ([98]), MM-estimators ([114]) or the REWLS-estimators ([47]).


next up previous contents index
Next: 9.5 Analysis of Variance Up: 9. Robust Statistics Previous: 9.3 Location and Scale