3.10 The regressogram

This name was coined by Tukey (1961) to accentuate the relationship of this smoother to the histogram. The regressogram is an average of those response variables of which the corresponding $X$s fall into disjoint bins spanning the $X$-observation space (Tukey; 1947). It can be thought of as approximating $m(x)$ by a step function and is in fact a kernel estimate (with uniform kernel) evaluated at the midpoints of the bins. Convergence in mean squared error has been shown by Collomb (1977) and Lecoutre (1983, 1984). Figure 3.15 shows the motorcycle data set together with a regressogram of bin size 4.

Figure 3.15: A regressogram smooth of the motorcycle data. The raw data are indicated by squares, the regressogram (solid line) has bin-size 4 and bin-edge 0. 6711 ANRmotregress.xpl
\includegraphics[scale=0.7]{ANRmotregress.ps}

Although the regressogram is a special kernel estimate it is by definition always a discontinuous step function which might obstruct the perception of features that are ``below the bin-size." Recall Figures 1.2 and 2.5 Both show the average expenditures for potatoes. The regressogram (Figure 2.5) captures the general unimodal structure but cannot resolve a slight second mode at $x\approx 2$, the double income level. This slight mode was modeled by the kernel smoother in Figure 1.2

A $k-NN$ analogue of the regressogram has also been proposed. Instead of averaging the response variables in bins of fixed width, the statistically equivalent block regressogram is constructed by averaging always over $k$ neighbors. The result is again a step function but now with different lengths of the windows over which averaging is performed. Bosq and Lecoutre (1987) consider consistency and rates of convergence of this estimator.

3.10.1 Convolution smoothing

The idea of convolution smoothing was proposed by Clark (1977) and has strong relations to kernel smoothing (see Section 3.1). The CS-estimator (CS for convolution-smoothing) is defined as

\begin{displaymath}\hat m_{CS}(x)=\int G_n(t)\ K_h(x-t)\ dt,\end{displaymath}

where $G_n(t)$ is the linear interpolant of the data $\{ (X_i,Y_i) \}_{i=1}^n$ and $K_h$ is a kernel sequence as in Section 3.1. Clark (1980) studied prediction and calibration of carbon-14 curves and presented Monte Carlo simulations in which the smoothing parameter $h$ had been selected by cross-validation, a method to be analyzed in detail in Section 5.1. It is easy to see that the CS-estimator has also the form of a weighted average:

\begin{displaymath}\hat m_{CS}(x)=n^{-1}\ \sum_{i=1}^n W_{ni}^{(CS)}(x)\ Y_i, \eqno{(3.5.4)}\end{displaymath}

where

\begin{displaymath}W_{ni}^{(CS)}(x)=\int v_i(t)\ K_h(x-t)\ dt \end{displaymath}

and $v_i(t)$ is the linear interpolant of the points $\{ (X_j,\delta_{ij} )\}_{j=1}^n $, $\delta_{ij}$ being the Kronecker delta.

3.10.2 Delta function sequence estimators

A delta function sequence (DFS) is a sequence of smooth weighting functions $\{ \delta_n (x)\} $, approximating the Dirac $\delta$-function for large $n$. These DFSs were used by Johnston (1979) in forming the following type of regression estimator,

\begin{displaymath}
\hat m_{\delta }(x)=n^{-1}\ \sum_{i=1}^n
\delta_n(x-X_i)\ Y_i\displaystyle / n^{-1}\ \sum_{i=1}^n \delta_n
(x-X_i).
\end{displaymath} (3.10.39)

This estimator, in fact, generalizes the Nadaraya-Watson kernel estimator with the DFS $K_h (u)$. Watson and Leadbetter (1963, 1964) formulated the conditions
  $\textstyle (a)$ $\displaystyle \quad \sup_n\ \int \left\vert \delta_n (u) \right\vert \ du < \infty,$ (3.10.40)
  $\textstyle (b)$ $\displaystyle \quad \int \delta_n (u)\ du = 1,$  
  $\textstyle (c)$ $\displaystyle \quad \sup_{\left\vert u\right\vert \ge \alpha } \left\vert \delta_n(u) \right\vert \to
0 \quad \rm {\ for\ all\ } \alpha > 0$  
  $\textstyle (d)$ $\displaystyle \quad \int\limits_{\left\vert u\right\vert \ge \alpha} \delta_n (u)\ du
\to 0 \quad \rm {\ for\ all}\ \alpha >0.$ (3.10.41)

Under these general conditions on $\{ \delta_n\} $ and continuity assumptions on $m$ and $f$ it can be shown that

\begin{displaymath}\hat m_{\delta } (x) \buildrel p \over \to m(x). \end{displaymath}

Further restrictions on the speed of convergence of the DFSs permit formulation of a central limit theorem for $\hat m_{\delta }$ (Johnston 1979). There is a richer literature on DFS estimation of densities (see e.g. Walter and Blum 1979).

3.10.3 Median smoothing

Suppose that the aim of approximation is the conditional median curve med$(Y \vert X=x) $ rather than the conditional mean curve. A sequence of ``local medians" of the response variables defines the median smoother. This estimator and related robust smoothers are considered more theoretically in Chapter 6, but it makes sense to already present it here since median smoothing played a dominant role in the historical evolution of smoothing techniques. More formally, it is defined as

\begin{displaymath}\hat m(x)=\rm {med}\{ Y_i: i\in J_x\},\end{displaymath}

where

\begin{displaymath}J_x=\{ i\!\!:X_i\ \hbox{is\ one\ of\ the\ } k \hbox{-nearest\ neighbors\
of\ } x\}. \end{displaymath}

It has obvious similarities to the $k$-$NN$ estimate (3.4.18) but differs in at least two aspects: Median smoothing is highly resistant to outliers and it is able to model unexpected discontinuities in the regression curve med$(Y \vert X=x) $. A comparison of both smoothing techniques is given in Figure 3.16, which shows the motorcycle data set (Table 1 in Appendix 2) with a median smooth and a $k$-$NN$ smooth.

Figure 3.16: Running median and a $\scriptstyle k$- $\scriptstyle NN$ smooth. The squares indicate the raw data (motorcycle data set, Table 3.1), the curve with label 1 is the running median, the curve with label 2 denotes the $\scriptstyle k$- $\scriptstyle NN$ smooth, $\scriptstyle k=15.$ 6844 ANRmotcompare.xpl
\includegraphics[scale=0.7]{ANRmotcompare.ps}

Note that the robustness aspect of median smoothing becomes visible here. The median smooth is not influenced by a group of possible outliers near $x\approx 35$ and it is a little bit closer to the main body of the data in the two ``peak regions" $(x=20,32)$. A slight disadvantage is that by its nature, the median smooth is a rough function.

Median smoothing seems to require more computing time than the $k$-$NN$ estimate (due to sorting operations). The simplest algorithm for running medians would sort in each window. This would result in $O(n k \log(k))$ operations using a fast sorting routine. Using the fast median algorithm by Bent and John (1985) this complexity could be reduced to $O(nk)$ operations. Härdle and Steiger (1988) have shown that by maintaining a double heap structure as the window moves over the span of the $X$-variables, this complexity can be reduced to $O(n\log(k))$ operations. Thus running medians are only by a factor of $\log(k)$ slower than $k$-$NN$ smoothers.

3.10.4 Split linear fits

A useful assumption for the mathematical analysis of the nonparametric smoothing method is the continuity of the underlying regression curve $m$. In some situations a curve with steps, abruptly changing derivatives or even cusps might be more appropriate than a smooth regression function. McDonald and Owen (1986) give several examples: These include Sweazy's kinked demand curve (Lipsey, Sparks and Steiner 1976) in microeconomics and daily readings of the sea surface temperature. Figure 3.17 shows a sawtooth function together with a kernel estimator.

Figure 3.17: A kernel smooth applied to a sawtooth function. From McDonald and Owen (1986) with permission of the American Statistical Association. 6942 ANRsawtooth.xpl
\includegraphics[scale=0.7]{ANRsawtooth.ps}

The kernel estimation curve is qualitatively smooth but by construction must blur the discontinuity. McDonald and Owen (1986) point out that smoothing by running medians has no trouble finding the discontinuity, but appears to be very rough. They proposed, therefore, the split linear smoother. Suppose that the $X$-data are ordered, that is, $X_j\le X_{j+1}$. The split linear smoother begins by obtaining at $x$ a family of linear fits corresponding to a family of windows. These windows are an ensemble of neighborhoods of $x$ with different spans centered at $x$ or having $x$ as their left boundary or right boundary. The split linear smoother at point $x$ is then obtained as a weighted average of the linear fits there. These weights depend on a measure of quality of the corresponding linear fits. In Figure 3.18 the sawtooth data are presented together with the split linear fit. This smoother found the discontinuous sawtooth curve and is smooth elsewhere. Theoretical aspects (confidence bands, convergence to $m$) are described in Marhoul and Owen (1984).

Figure 3.18: The split linear fit applied to a sawtooth function. From McDonald and Owen (1986) with permission of the American Statistical Association.
\includegraphics[scale=0.15]{ANR3,18.ps}

3.10.5 Empirical regression

Schmerling and Peil (1985) proposed to estimate the unknown joint density $f(x,y)$ of $(X,Y)$ and then to estimate $m(x)$ by the standard formula. In particular, they proposed to use the mixture


\begin{displaymath}\hat f(x,y) = n^{-1} \sum_{i=1}^n f_{XY}(x, y; X_i, Y_i), \end{displaymath}

where $f_{XY}(x, y; X_i, Y_i) $ are known densities. The evaluation of the conditional expectation on the basis of this estimate results in a curve which they call the empirical regression curve. There are many possibilities for a choice of $f_{XY}$. From

\begin{displaymath}f_{XY}(x, y; X_i, Y_i) = K_h (x-X_i) K_h (y - Y_i) \end{displaymath}

for instance, follows

\begin{displaymath}\hat m_h(x) =
{ n^{-1} \sum^n_{i=1} K_h(x-X_i) Y_i \over n^{-1} \sum^n_{i=1} K_h(x-X_i)}.\end{displaymath}

This is the Nadaraya-Watson estimator. It is also possible to take a special two-dimensional Gaussian density


\begin{displaymath}
% latex2html id marker 7085
f_{XY}(x, y; X_i, Y_i) =
{1 \...
...
\exp(- {Q_i(x,y) \over 2 \sigma^2 \mathop{\rm {det}}(S) }), \end{displaymath}

with $S$ a matrix with components

\begin{eqnarray*}
S_{xx} &=& n^{-1}\sum_{i=1}^n (X_i - \bar X)^2, \\
S_{yy} &...
...} = S_{yx} &=& n^{-1}\sum_{i=1}^n (Y_i - \bar Y)(X_i - \bar X),
\end{eqnarray*}



and the quadratic form

\begin{displaymath}Q_i(x,y) = n^{-1} (S_{yy} (X_i - \bar X)^2 -
2 S_{xy} (Y_i - \bar Y)(X_i - \bar X) +
S_{yy} (Y_i - \bar Y)^2). \end{displaymath}

Then the empirical regression curve for a Gaussian kernel is given by
\begin{displaymath}
\hat m_h(x) = {\sum^n_{i=1} (Y_i + {S_{xy} \over S_{xx}} (x-...
...=1} \exp( - { n(X_i - \bar X)^2 \over 2 \sigma^2 S_{xx} })
}.
\end{displaymath} (3.10.42)

Figure 3.19 gives an impression of how this empirical regression curve works with real data. For more details I refer to Schmerling and Peil (1985).

Figure 3.19: Empirical regression for the thickness of the myelin sheath as a function of the axon diameter. From Schmerling and Peil (1985) with permission of Gegenbaurs morphologisches Jahrbuch.
\includegraphics[scale=0.2]{ANR3,19.ps}

Exercises

3.10.1Vary the ``origin" of the regressogram, that is, define the bins over which to average the response variable as

\begin{displaymath}((j-1+l/m)h, (j+l/m)h ) , l=0, \ldots , m-1, \ j = \ldots -2, -1, 0 ,1,
2 \ldots. \end{displaymath}

Study qualitatively the behavior of the regressogram as you compare different regressograms over $\ell $.

3.10.2Average the $m$ regressograms as defined in Exercise 3.10.5 Do you see any connection with the kernel technique? [Hint: In Section 3.1 we called this Weighted Averaging over Rounded Points the WARPing technique.]

3.10.3Find a correspondence between the conditions (3.10.40) and the assumptions needed for the kernel consistency Proposition 3.1.1

3.10.6 Complements

Schmerling and Peil also considered more general polynomial fits than there introduced in Section 3.1. These local polynomial fits can be obtained by approximating polynomials of higher order or by using other kernels than the uniform kernel in (3.1.13); see also Katkovnik (1979, 1983, 1985) and Lejeune (1985).

Suppose that $m(x)$ is to be approximated by a polynomial

\begin{displaymath}\sum^p_{j=0} \alpha_j(x) \varphi_j(u-x),\quad p+1<n, \end{displaymath}

where $\varphi_j(\cdot)$ are polynomials of order $j$ and $\alpha_j(x)$ are some weighting coefficients. If the weight sequence $W_{hi}(x)=K_h(x-X_i)$ with a positive symmetric kernel $K$ is used, the least squares problem

\begin{displaymath}\sum_{i=1}^nW_{hi}(x)\ \left[ Y_i-\sum_{j=0}^p \alpha_j(x) \varphi_j
(X_i-x)\right]^2 {\buildrel ! \over =}\min \end{displaymath}

gives coefficients $\{ \hat \alpha_j(x)\}^p_{j=0}.$ A necessary and sufficient condition is

\begin{displaymath}2 \sum_{i=1}^n W_{hi}(x)\ \left[ Y_i-\sum_{j=0}^p\alpha_j (x)\varphi_j
(X_i-x)\right] \varphi_k(X_i-x)=0, \ h=1,\ldots ,n. \end{displaymath}

This linear system of equations can be solved easily if the $\varphi_j$s are orthogonal polynomials, that is, for some constant $C_x$

\begin{displaymath}\sum^n_{i=1}
W_{hi}(x) \varphi_j(X_i-x) \varphi_k(X_i-x)=C_x I (j=k).\end{displaymath}

Hence the least squares coefficients $\{ \hat \alpha_j(x) \}$ can be written as

\begin{displaymath}\hat{\alpha}_j(x)= {\sum^n_{i=1} [W_{hi}(x) \varphi_j(X_i-x)]...
...um^n_{i=1} W_{hi}(x) \varphi^2_j (X_i-x)} ,\quad j=0,\ldots,p, \end{displaymath}

and the local polynomial fit can be written as

\begin{displaymath}\hat m_h(x)=n^{-1} \sum^n_{i=1} W^*_{hi}(x) Y_i\end{displaymath}

with effective weight sequence

\begin{displaymath}W^*_{hi}(x)= \sum^p_{j=0} \left[ {W_{hi}(x) \varphi_j (X_i-x)...
...m^n_{k=1} W_{hk}(x) \varphi^2_j (X_k-x) } \right] \varphi_j(0).\end{displaymath}

So here again the local polynomial fitting leads to a weight sequence $\{W_{hi}^*(x)\}$, but this weight sequence is slightly more complicated than that for the ordinary kernel estimator.