6.1 Resistant smoothing techniques

A linear local average of the response variable is, per se, not robust against outliers. Moving a response observation to infinity would drag the smooth to infinity as well. In this sense, local averaging smoothing has unbounded capacity to be influenced by ``far out" observations. Resistance or ``bounded influence" against outliers can be achieved by downweighting large residuals which would otherwise influence the smoother.

We have already encountered a straightforward resistant technique: median smoothing. It is highly robust since the extreme response observations (stemming from predictor variables in a neighborhood around $x$) do not have any effect on the (local) median of the response variables. A slight disadvantage of median smoothing, though, is that it produces a rough and wiggly curve. Resmoothing and twicing are data-analytic techniques to ameliorate median smoothing in this respect; see Velleman (1980) and Mallows (1980).

6.1.1 LOcally WEighted Scatter plot Smoothing (LOWESS)

Cleveland (1979) proposed the following algorithm, LOWESS, a resistant method based on local polynomial fits. The basic idea is to start with a local polynomial least squares fit and then to ``robustify" it. ``Local" means here a $k$-$NN$ type neighborhood. The procedure starts from a $k$-$NN$ pilot estimate and iteratively defines robustness weights and re-smoothes several times.

Algorithm 6.1.1

LOWESS

STEP 1. Fit a polynomial regression in a neighborhood of $x$, that is, find coefficients $\{ \beta_j \}^p_{j=0}$ which minimize

\begin{displaymath}n^{-1} \sum^n_{i=1} W_{ki} (x) \left(Y_i - \sum^p_{j=0} \beta_j x^j
\right)^2,\end{displaymath}

where $\{ W_{ki}(x) \}$ denote $k$-$NN$ weights.

FOR $i=1$ TO maxiter DO BEGIN

STEP 2.

Compute from the estimated residuals $\{ \hat\varepsilon_i \}$ the scale estimate $\hat \sigma = $med $\{ \left\vert \hat \varepsilon_i
\right\vert \}$ and define robustness weights $\delta_i = K (\hat
\varepsilon_i /(6\hat \sigma))$, where $K$ denotes the quartic kernel, $ K(u) = (15/16){(1-u^2)}^2 \ I(\left\vert u\right\vert \le 1)$.

STEP 3.

Fit a polynomial regression as in STEP 1 but with weights

\begin{displaymath}\{ \delta_i W_{ki} (x) \}.\end{displaymath}

END (* i *).

Cleveland recommends the choice $p=1$ (as for the supersmoother ) as striking a good balance between computational ease and the need for flexibility to reproduce patterns in the data. The smoothing parameter can be determined by cross-validation as in Section 5.1. Figure 6.2 shows an application of Cleveland's algorithm to a simulated data set. It is quite obvious that the LOWESS smooth is resistant to the ``far out" response variables at the upper borderline of the plot.

Figure 6.2: Scatter plot of artificially generated data ( $\scriptstyle n=50$, $\scriptstyle Y_i=0.02X_i+ \varepsilon_i$, $\scriptstyle X_i=i$, $\scriptstyle \varepsilon_i
\sim N(0,1)$) and robust smoothed values with $\scriptstyle p=1$, $\scriptstyle maxiter=2$, $\scriptstyle k= [ n/2 ] $. From Cleveland (1979) with permission of the American Statistical Association.
\includegraphics[scale=0.2]{ANR6,2.ps}

6.1.2 $L$-smoothing

Another class of resistant smoothers is given by local trimmed averages of the response variables. If $Z_{(1)}, Z_{(2)}, \ldots, Z_{(N)
}$ denotes the order statistic from $N$ observations $\{ Z_j \}^N_{j=1}$, a trimmed average (mean) is defined by

\begin{displaymath}\overline Z_{\alpha} = (N-2
[ \alpha N ])^{-1} \sum^{N - [ \alpha N ]}_{j= [
\alpha N ]} Z_{(j)}, 0<\alpha<1/2,\end{displaymath}

the mean of the ``inner $100(1-2 \alpha)$ percent of the data." A local trimmed average at the point $x$ from regression data $\displaystyle
\{ (X_i, Y_i) \}^n_{i= 1}$ is defined as a trimmed mean of the response variables $Y_i$ such that $X_i$ is ``in a neighborhood of $x$." (The neighborhood could be parameterized, for instance, by a bandwidth sequence $h=h_n$.) Adopting terminology from robust theory of estimation, this type of smoothing is called $L$-smoothing .

$L$-smoothing is a resistant technique: the ``far out extremes" at a point $x$ do not enter the local averaging procedure. More generally, one considers a conditional $L$-functional

\begin{displaymath}
l(x)=\int^1_0 J(v) F^{-1} (v\vert x) dv,
\end{displaymath} (6.1.1)

where $F^{-1} (v\vert x) =\inf \{ y: F(y\vert x) \ge v \}, 0 < v <1, $ denotes the conditional quantile function associated with $F(\cdot\vert x)$, the conditional distribution function of $Y$ given $X=x$. For $J(v) \equiv1, l(x)$ reduces to the regression function $m(x),$ since by substituting $y=F^{-1} (v\vert x)$,

\begin{displaymath}\int^1_0 F^{-1} (v\vert x) dv = \int^{F^{-1}(1\vert x)}_{F^{-1}(0\vert x)}
y dF(y\vert x) = m(x). \end{displaymath}

The same occurs in the case $J(v)= I (\alpha \le v \le 1- \alpha)/ (1- 2\alpha),$ where $0< \alpha
<1/2,$ with symmetric conditional distribution function. Median smoothing is a special case of $L$-smoothing with $\alpha=1/2$.

In practice, we do not know $F(\cdot\vert x)$ and we have to estimate it. If $F_n(\cdot \vert x)$ denotes an estimator of $F(\cdot\vert x)$ one obtains from formula 6.1.1 the $L$-smoothers. Estimates of $F(t \vert x)$ can be constructed, for example, by the kernel technique,

\begin{displaymath}F_h(t \vert x)={n^{-1} \sum^n_{i=1} K_h (x-X_i) I(Y_i \le t) \over \hat f_h
(x)},\end{displaymath}

to obtain

\begin{displaymath}\hat m^L_h(x) =\int^1_0 J(v) F_h^{-1} (v\vert x) dv. \end{displaymath}

Stute (1984) and Owen (1987) show asymptotic normality of such conditional functionals. Härdle, Janssen and Serfling (1988) derive (optimal) uniform consistency rates for $L$-smoothers.

6.1.3 $R$-smoothing

Yet another class of smoothers are the $R$-smoothers derived from $R$-estimates of location. Assume that $F(\cdot\vert x)$ is symmetric around $m(x)$ and that $J$ is a nondecreasing function defined on $(0,1)$ such that $J(1-s)= - J(s)$. Then the score

\begin{displaymath}T(\theta, F(\cdot \vert x))=\int^{\infty}_{- \infty} J\left({...
... 2}
(F(v \vert x)+1-F(2 \theta-v \vert x))\right) dF(v \vert x)\end{displaymath}

is zero for $\theta = m(x)$. The idea now is to replace $F(\cdot\vert x)$ by an estimate $F_n(\cdot \vert x)$. If $F_n(\cdot \vert x)$ denotes such an estimate of the conditional distribution function $F(\cdot\vert x)$, then this score should be roughly zero for a good estimate of $m(x)$. The motivation for this $R$-smoothing technique stems from rank tests.

Consider a two-sample rank test for shift based on the sample $\{Z_i \}
^n_{i=1}$ and $\{2 \theta -Z_i \}^n_{i=1}$, that is, a mirror image of the first sample serves as a stand-in for the second sample. Now try to adjust $\theta$ in such a way that the test statistic $T_n=n^{-1}
\sum^n_{i=1}a(R_i)$ based on the scores

\begin{displaymath}a(i)=\left(2n \int^{i/2n}_{(i-1)/2n}J(s)ds\right)\end{displaymath}

of the ranks $R_i$ of $\{Z_i \}$ in the combined sample $\{ Z_i
\}+ \{2 \theta -Z_i \}$ is roughly zero (see Huber 1981, chapter 3.4). This would make the two samples $\{Z_i \}$ and $\{2 \theta -Z_i \}$ almost indistinguishable or, in other words, would make $\theta$ a good estimate of location. If this $T_n$ is translated into the setting of smoothing then the above form of $T(\theta ,F( \cdot \vert x))$ is obtained.

A solution of $T(\theta, F_n(\cdot\vert x)) = 0$ is, in general, not unique or may have irregular behavior. Cheng and Cheng (1986) therefore suggested

$\displaystyle \hat m^R_h$ $\textstyle =$ $\displaystyle {1 \over 2} [\sup \{ \theta: T(\theta, F_n(\cdot\vert x))>0
\}$  
    $\displaystyle +\inf \{ \theta: T(\theta, F_n(\cdot\vert x))<0 \}]$ (6.1.2)

as an estimate for the regression curve $m(x)$. Consistency and asymptotic normality of this smoothing technique are derived in Cheng and Cheng (1987).

6.1.4 $M$-smoothing

Resistant smoothing techniques based on $M$-estimates of location are called $M$-smoothers . Recall that all smoothers of the form

\begin{displaymath}\hat m(x)=n^{-1} \sum^n_{i=1} W_{ni} (x)Y_i\end{displaymath}

can be viewed as solutions to (local) least squares problems; see 3.1.8. The basic idea of $M$-smoothers is to reduce the influence of outlying observations by the use of a non-quadratic loss function in 3.1.8. A well-known example (see Huber 1981) of such a loss function with ``lighter tails" is
\begin{displaymath}
\rho(u)=\left \{ \begin{array}{lrcl}
(1/2)u^2, & {\rm if }\;...
...1/2)c^2, & {\rm if }\; \vert u \vert& >& c
\end{array} \right.
\end{displaymath} (6.1.3)

The constant $c$ regulates the degree of resistance. For large values of $c$ one obtains the ordinary quadratic loss function. For small values ($c \approx$ one or two times the standard deviation of the observation errors) one achieves more robustness.

In the setting of spline smoothing, an $M$-type spline was defined by Cox(1983)

\begin{displaymath}
% latex2html id marker 16843\mathop {\rm argmin}_{g} \left...
...{i=1} \rho
(Y_i-g(t_i))+ \lambda \int [ g''(x) ]^2 dx \right\}
\end{displaymath} (6.1.4)

where, again, $\rho$ is a loss function with ``lighter" tails than the quadratic. Related types of $M$-smoothers were considered by Huber (1979), Nemirovskii, Polyak and Tsybakov (1983, 1985), and Silverman (1985).

Kernel smoothers can be made resistant by similar means. Assume that the conditional distribution $F(\cdot\vert x)$ is symmetric. This assumption ensures that we are still estimating $m(x)$, the conditional mean curve. Define a robust kernel $M$-smoother $\hat m^M_h(x)$ as

\begin{displaymath}
% latex2html id marker 16850\mathop {\rm argmin}_{\theta} \left\{ n^{-1}
\sum^n_{i=1} W_{hi}(x) \rho (Y_i-\theta) \right\}
\end{displaymath} (6.1.5)

where $\{ W_{hi} (x) \}^n_{i=1}$ denotes a positive kernel weight sequence. Differentiating 6.1.5 with respect to $\theta$ yields, with $\psi= \rho'$,
\begin{displaymath}
n^{-1} \sum^n_{i=1} W_{hi}(x) \psi (Y_i- \theta)=0.
\end{displaymath} (6.1.6)

Since the kernel $M$-smoother is implicitly defined, it requires iterative numerical methods. A fast algorithm based on the Fast Fourier Transform and a ``one-step" approximation to $\hat m^M_h $ are given in Härdle (1987a). A wide variety of possible $\psi$-functions yield consistent estimators $\hat m^R_h(x)$. (Consistency follows by arguments given in Huber (1981, chapter 3).) Note that the special case with linear $\psi(u)=u$ reproduces the ordinary kernel smoother $\hat m_h(x)$. To understand what resistant $M$-smoothers are actually doing to the data, define unobservable pseudo-observations

\begin{displaymath}\tilde Y_i =m(X_i)+ {\psi(\varepsilon_i) \over q(X_i)} \end{displaymath}

with

\begin{displaymath}q(X_i) =E ( \psi' (\varepsilon_i)\vert X_i ). \end{displaymath}

The following theorem can be derived using methods given in Tsybakov (1982b) and Härdle (1984b).

Theorem 6.1.1   Let $\hat m_h^M(x)$ be the kernel $M$-smoother computed from
$\{(X_i, Y_i) \}^n_{i=1}$ and let $\hat m_h(x)$ be the ordinary kernel smoother applied to the pseudo-data $\{ (X_i,\tilde Y_i) \}^n_{i=1}$; then $\sqrt{nh} (\hat m_h(x)-m(x))$ and $\sqrt {nh} (\hat m^M_h(x)-m(x))$ have the same asymptotic normal distribution with mean as in (4.2.1) and asymptotic variance

\begin{displaymath}V_x(\psi, K)= { c_K \over f(x)} {E ( \psi^2
(\varepsilon)\vert X=x ) \over q^2(x)}.\end{displaymath}

This result deserves some discussion. First, it shows that kernel $M$-smoothers can be interpreted as ordinary kernel smoothers applied to nonobservable pseudo-data with transformed errors $\psi(\varepsilon_i)/q(X_i)$. This sheds some light on how the resistance of $M$-smoothers is achieved: The ``extreme" observation errors $\varepsilon_i$ are ``downweighted" by the nonlinear, bounded function $\psi(\varepsilon_i)/q(X_i)$. Second, Theorem 6.1.1 reveals that the bias of the ordinary kernel smoother is the same as that for the kernel $M$-smoother. The nonlinear definition of $\hat m^M_h(x)$ does not affect the (asymptotic) bias properties. Third, the product form of the asymptotic variance $V_x(\psi,
K)$ as a product of $c_K/f(x)$ and $E (
\psi^2 (\varepsilon) \vert X=x )/q^2(x)$ allows optimization of $V_x(\psi,
K)$ simply by considering $\psi$ and $K$ separately.

The first of these two separate problems was solved in Section 4.5. By utilizing classical theory for $M$-estimates of location, the second problem can be treated as in Huber (1981, chapter 4). The details of this optimization technique are rather delicate; the reader is referred to the standard literature on robust estimation. Optimization of the smoothing parameter is discussed in Härdle (1984c) and more recently by Leung (1988). Both authors consider the direct analogue of cross-validation, namely, construct robust leave-one-out smoothers and then to proceed as in Section 5.1.

A natural question to ask is, how much is gained or lost in asymptotic accuracy when using an $M$-smoother? The bias is the same as for the kernel smoother. A way of comparing the nonresistant and the resistant technique is therefore to study the ratio of asymptotic variances,

\begin{displaymath}
{ \sigma^2(x) \over E ( \psi^2 (\varepsilon) \vert X=x )/q^2(x) },
\end{displaymath} (6.1.7)

of the Nadaraya-Watson kernel smoother to the kernel $M$-smoother (based on the same kernel weights). But this relative efficiency 6.1.7 is the same as for the estimation of location. The reader is therefore referred to the literature on robust estimation (see e.g. Huber 1981).

As an example, I would like to present a smoothing problem in physical chemistry. Raman spectra are an important diagnostic tool in that field. One would like to identify the location and size of peaks and troughs of spectral bands; see Hillig and Morris (1982) and Bussian and Härdle (1984). Unfortunately, small-scale instrumental noise and a certain proportion of observation error which is caused by random external events blur the observations. The latter type of error causes high frequency signals or bubbles in the sample and produces single spikes like those in Figure 6.3.

Figure 6.3: A Raman spectrum with two single spike outliers. From Härdle and Gasser (1984) with permission of the Royal Statistical Society.
\includegraphics[scale=0.3]{ANR6,3.ps}

Estimating with $\hat m_h(x),$ the ordinary Nadaraya-Watson kernel smoother, results in the curve depicted in Figure 6.4.

Figure: The kernel smoother $\scriptstyle \hat m_h(x)$ applied to the spectrum shown in Figure 6.3, $\scriptstyle h=9$, $\scriptstyle K(u)=(3/4)(1-u^2)
I(\left\vert u \right\vert
\leq 1)$. From Härdle and Gasser (1984) with permission of the Royal Statistical Society.
\includegraphics[scale=0.3]{ANR6,4.ps}

The single spike outliers obviously produced two spurious neighboring peaks. The resistant smoothing technique, on the other hand, leads to Figure 6.5.

Figure 6.5: The resistant kernel $\scriptstyle M$-smoother $\scriptstyle \hat
m^M_h(x)$ applied to the spectrum shown in Figure 6.3, $\scriptstyle h=9$, $\scriptstyle c=0.9$, $\scriptstyle K(u)=(3/4)(1-u^2)
I(\left\vert u \right\vert
\leq 1)$. From Härdle and Gasser (1984) with permission of the Royal Statistical Society.
\includegraphics[scale=0.3]{ANR6,5.ps}

The influence of the outliers is obviously reduced. Uniform confidence bands -- based on asymptotic extreme value theory -- may be constructed using the methods presented in Section 4.3; see Härdle (1987b). Figure 6.6 depicts a kernel $M$-smoother $\hat m_h^M$ together with uniform confidence bands, and $\hat m_h$, the Nadaraya-Watson kernel smoother, for the data presented in Figure 6.1.

Figure: The kernel $\scriptstyle M$-smoother with uniform confidence bands and the kernel smoother $\scriptstyle \hat m_h(x)$. The original data are those of Figure 6.1. From Härdle (1989).
\includegraphics[scale=0.15]{ANR6,6.ps}

Optimal uniform convergence rates (see Section 4.1) for kernel
$M$-smoothers have been derived in Härdle and Luckhaus (1984). In the context of time series, robust estimation and prediction has been discussed by Velleman (1977, 1980), Mallows (1980) and Härdle and Tuan (1986). Robust nonparametric prediction of time series by $M$-smoothers has been investigated by Robinson (1984, 1987b), Collomb and Härdle (1986) and Härdle (1986c). Robust kernel smoothers for estimation of derivatives have been investigated in Härdle and Gasser (1985) and Tsybakov (1986).

Exercises

6.1.1 Find conditions such that $L$-smoothers, as defined in 6.1.1, are consistent estimators for the regression curve.

6.1.2 Find conditions such that $R$-smoothers, as defined in 6.1.2, asymptotically converge to the true regression curve.

6.1.3 Do you expect the general $L$-smoothers 6.1.1 to produce smoother curves than the running median?

6.1.4 Construct a fast algorithm for $L$-smoothers 6.1.1. Based on the ideas of efficent running median smoothing (Section 3.8) you should be able to find a code that runs in $O(n \log k)$ steps ($k$ is the number of neighbors).

6.1.5 Prove consistency for the $M$-smoother 6.1.4 for monotone $\psi$ functions.

[Hint: Follow the proof of Huber (1981, chapter 3).]

6.1.6 Can you extend the proof of Exercise 6.1.4 to nonmonotone $\psi$ functions such as Hampels ``three part redescender?"