10. Investigating multiple regression by additive models

                 While it is possible to encode several more dimensions into pictures by using time (motion), color, and various symbols (glyphs), the human perceptual system is not really prepared to deal with more than three continuous dimensions simultaneously.

Huber, P.J. (1985, p. 437)

The basic idea of scatter plot smoothing can be extended to higher dimensions in a straightforward way. Theoretically, the regression smoothing for a $d$-dimensional predictor can be performed as in the case of a one-dimensional predictor. The local averaging procedure will still give asymptotically consistent approximations to the regression surface. However, there are two major problems with this approach to multiple regression smoothing. First, the regression function $m(x)$ is a high dimensional surface and since its form cannot be displayed for $d>2$, it does not provide a geometrical description of the regression relationship between $X$ and $Y$. Second, the basic element of nonparametric smoothing - averaging over neighborhoods - will often be applied to a relatively meager set of points since even samples of size $n\ge 1000$ are surprisingly sparsely distributed in the higher dimensional Euclidean space. The following two examples by Werner Stuetzle exhibit this ``curse of dimensionality".

A possible procedure for estimating two-dimensional surfaces could be to find the smallest rectangle with axis-parallel sides containing all the predictor vectors and to lay down a regular grid on this rectangle. This gives a total of one hundred cells if one cuts each side of a two-dimensional rectangle into ten pieces. Each inner cell will have eight neighboring cells. If one carried out this procedure in ten dimensions there would be a total of $10^{10}=10,000,000,000$ cells and each inner cell would have $3^{10}-1=59048$ neighboring cells. In other words, it will be hard to find neighboring observations in ten dimensions!

Suppose now one had $n = 1000$ points uniformly distributed over the ten dimensional unit cube $[ 0,1]^{10}$. What is our chance of catching some points in a neighborhood of reasonable size ? An average over a neighborhood of diameter $0.3$ (in each coordinate) results in a volume of $0.3^{10}\approx 5.9\cdot 10^{-6}$ for the corresponding ten-dimensional cube. Hence, the expected number of observations in this cube will be $5.9\cdot 10^{-3}$ and not much averaging can be expected. If, on the other hand, one fixes the count $k=10$ of observations over which to average, the diameter of the typical (marginal) neighborhood will be larger than $0.63$ which means that the average is extended over at least two-thirds of the range along each coordinate.

A first view of the sparsity of high dimensional data could lead one to the conclusion that one is simply in a hopeless situation - there is just not enough clay to make the bricks! This first view, however, is, as many first views are, a little bit too rough. Assume, for example, that the ten-dimensional regression surface is only a function of $x_1$, the first coordinate of $X$, and constant in all other coordinates. In this case the ten-dimensional surface collapses down to a one-dimensional problem. A similar conclusion holds if the regression surface is a function only of certain linear combinations of the coordinates of the predictor variable. The basic idea of additive models is to take advantage of the fact that a regression surface may be of a simple, additive structure.

A regression tree is based on such a structure. The regression surface is approximated by a linear combination of step functions

\begin{displaymath}m(x) =\sum_{i=1}^p c_i I\ \{ x\in N_i\},\end{displaymath}

where the $N_i$ are disjoint hyper-rectangles with sides parallel to the coordinate axes. The hyper-rectangles are constructed by succesive splits and can be represented as a tree. A recursive splitting algorithm (RPR) to construct such a tree is described in Section 10.1.

Another additive model is projection pursuit regression (PPR) (Friedman and Stuetzle 1981). This model is an extension of the regression tree model and is defined through projections $\beta_j^T x, \Vert \beta_j\Vert =1, j=1, \ldots, p$. It models the regression surface as

\begin{displaymath}m(x) =\sum_{j=1}^p g_j (\beta_j^T x);\end{displaymath}

see Section 10.2.

PPR involves one-dimensional nonparametric functions on linear combinations of predictor variables, whereas the alternative ACE-model determines a linear combination of nonparametric one-dimensional functions operating on the coordinates of the predictor variable with unknown possibly nonlinear transformations; see Section 10.3.

The last technique considered here is related to PPR,

\begin{displaymath}Y= g (\delta^TX) +\varepsilon = m(X)+\varepsilon.\end{displaymath}

The coefficients $\delta$ are defined differently as $\delta
=E [m'(X) ]$, an average derivative (ADE); see Section 10.4. This estimation technique is also important in theoretical economics in particular in questions related to the law of demand (see Härdle, Hildenbrand and Jerison 1989).