8.1 Backfitting

The backfitting procedures proposed in Breiman & Friedman (1985), Buja et al. (1989) are widely used to approximate the additive components $ g_\alpha(\bullet) $ of (8.1). The latter paper considers the problem of finding the projection of $ m(\bullet)$ onto the space of additive models represented by the right hand side of (8.1). Using the observed sample, this leads to a system of normal equations with $ nd\times nd$ dimensions. To solve this system in practice, a backfitting or a Gauss-Seidel algorithm is used.

The backfitting technique is iterative which leads to additional difficulties for developing asymptotic theory. Moreover, the final estimates may depend on the starting values or the convergence criterion. Therefore, since its first introduction, the method has been refined considerably and extended to more complicated models. We refer to the bibliographic notes at the end of this and the following chapter for more information.

8.1.1 Classical Backfitting

We will now consider the problem of estimating the additive conditional expectation of $ Y$ given $ {\boldsymbol{X}}$ based on the random sample $ \{{\boldsymbol{X}}_i,Y_i\}$, $ i=1,\ldots,n$. The component functions $ g_\alpha(\bullet) $ in (8.1) explain the specific impact of the particular component $ X_\alpha$ on the response $ Y$.

We first introduce an identification assumption. Let

$\displaystyle E_{X_\alpha}\{ g_\alpha (X_\alpha ) \} = 0$ (8.2)

for all $ \alpha$ which consequently yields

$\displaystyle c=E(Y).$

Note that condition (8.2) on the marginal effects is no restriction on the model since all we can estimate is the relative impact of each direction $ X_\alpha$. This means the $ g_\alpha$ are otherwise only identified up to a shift in the vertical direction. As before we use the model

$\displaystyle Y=m({\boldsymbol{X}})+\varepsilon,$

where (by definition) $ E(\varepsilon \vert{\boldsymbol{X}})=0\quad\textrm{and}$ and $ (\varepsilon\vert{\boldsymbol{X}})=\sigma ^2({\boldsymbol{X}})$, i.e. the model is allowed to be heteroscedastic. The constant $ c$ can easily be estimated with $ \sqrt{n}$-rate (i.e. a rate faster than the nonparametric) by

$\displaystyle \widehat{c} = \overline{Y}=\frac 1n \sum_{i=1}^n Y_i \,.$

Thus, we assume

$\displaystyle c=0 $

for the rest of this section if not indicated differently. This is without loss of generality, since you may replace $ Y_i$ by $ Y_i = Y_i-\overline{Y}$ otherwise.

Let us now turn to the estimation of the additive components $ g_\alpha(\bullet) $. Hastie & Tibshirani (1990) motivate the backfitting method as follows. Consider the optimization problem

$\displaystyle \min_m E\left\{ Y-m({\boldsymbol{X}}) \right\}^2\, \quad \textrm{ such that } \quad m({\boldsymbol{X}})=\sum_{\alpha =1}^d g_\alpha (X_\alpha ).$ (8.3)

This means, we are searching for the best additive predictor for $ E(Y\vert{\boldsymbol{X}})$ where ``best'' is meant with respect to the expected squared distance. By the theory of projections, there exists a unique solution which is characterized by the first order conditions
$\displaystyle {E\left[ \left\{ Y-m({\boldsymbol{X}})\right\} \vert X_\alpha \right] =0}$
    $\displaystyle \textrm{ or equivalently } \quad
g_\alpha (X_\alpha ) = E\left[ \left\{ Y-\sum_{k\neq \alpha} g_k
(X_k )\right\} \Big\vert X_\alpha\right] \,,$ (8.4)

for all $ \alpha =1,\ldots ,d$. This leads to the matrix representation

$\displaystyle \left( \begin{array}{cccc} {\mathcal{I}}& {\mathcal{P}}_1 & \cdot...
...P}}_1Y \\ {\mathcal{P}}_2Y \\ \vdots \\ {\mathcal{P}}_dY \end{array} \right)\,,$ (8.5)

where $ {\mathcal{P}}_\alpha(\bullet )=E\left( \bullet \vert X_\alpha \right)$. By analogy, let $ {\mathbf{S}}_\alpha$ be a ($ n\times n$) smoother matrix which, when applied to the vector $ {\boldsymbol{Y}}=(Y_1,\ldots,Y_n)^\top $, yields a $ n\times 1$ estimate $ {\mathbf{S}}_\alpha {\boldsymbol{Y}}$ of $ \{ E(Y_1\vert X_{1\alpha}),\ldots ,E(Y_n\vert X_{n\alpha})\}^\top.$ Substituting the operator $ {\mathcal{P}}_\alpha $ by the smoother matrix $ {\mathbf{S}}_\alpha$ we obtain a system of equations

$\displaystyle {\underbrace{\left( \begin{array}{llll} {\mathbf{I}}& {\mathbf{S}...
...ldsymbol{Y}}\\ \vdots \\ {\mathbf{S}}_d{\boldsymbol{Y}} \end{array} \right) \,,$ (8.6)

or abbreviated

$\displaystyle {\widehat{{\mathbf{P}}} {\boldsymbol{g}}} = {\widehat{{\mathbf{Q}}}} {\boldsymbol{Y}}.$ (8.7)

Although $ E(\bullet \vert X_\alpha)$ means in fact the expectation over all directions $ X_1,\ldots,X_d$, Buja et al. (1989) suggest using only one-dimensional smoothers as a sufficient approximation to $ {\mathcal{P}}_\alpha $.

The system (8.7) could principally be solved exactly for $ \widehat{g}_\alpha
(X_{1\alpha})$ $ , \ldots ,$ $ \widehat{g}_\alpha(X_{n\alpha})$. Of course, when $ nd$ is large the exact solution of (8.7) is hardly feasible. Furthermore, the matrix $ \widehat{{\mathbf{P}}}$ on the left is often not regular and thus the equation cannot be solved directly. Therefore, in practice the backfitting (Gauss-Seidel) algorithm is used to solve these equations: Given starting values $ \widehat{{\boldsymbol{g}}}_\alpha^{(0)}$, the vectors $ {\boldsymbol{g}}_\alpha$ are updated through

$\displaystyle \widehat{{\boldsymbol{g}}}_\alpha^{(l)}={\mathbf{S}}_\alpha \left...
... \neq \alpha }\widehat{{\boldsymbol{g}}}_k ^{(l-1)}\right\} ,\quad l=1,2,\ldots$ (8.8)

until the distance between successive estimates is below a prespecified tolerance. So what one actually does is a successive one-dimensional nonparametric regression on $ X_\alpha$, using the partial residuals $ \{ {\boldsymbol{Y}}-\sum_{k \neq \alpha }
\widehat{{\boldsymbol{g}}}_k^{(l-1)} \}$ (instead of simply $ {\boldsymbol{Y}}$). To summarize, we present the whole procedure including the estimation of the (possibly non-zero) constant:
\fbox{\parbox{0.95\textwidth}{
\centerline{{Backfitting Algorithm}}
\hrule
\begi...
...\ \\ [3mm]
{\em until\/}
& convergence is reached
\end{tabular}\end{center}}}
Of course, the choice of the convergence criterion is a crucial problem and in practice different software implementations of the algorithm may stop after a different number of iterations. Also, the result of the iteration can obviously depend on the order of the variables $ X_1$, ..., $ X_d$.

It is important to note that to estimate $ {\mathcal{P}}_\alpha=E(\bullet \vert X_\alpha)$, the backfitting algorithm uses smoother matrices $ {\mathbf{S}}_\alpha$ that depend only on the components $ X_\alpha$. These smoother matrices are easy and fast to calculate since they employ only a one-dimensional smoothing problem. This is a strong simplification (also from a theoretical point of view) and may lead to problems in theory and in interpretation of what the procedure is fitting in practice. We will come back to this point in Subsection 8.1.3.

For the two-dimensional case we can write down the backfitting estimator explicitly. From equation (8.6) we have

$\displaystyle {\boldsymbol{g}}_1 = {\mathbf{S}}_1 ({\boldsymbol{Y}}-{\boldsymbo...
...\ {\boldsymbol{g}}_2 = {\mathbf{S}}_2 ({\boldsymbol{Y}}-{\boldsymbol{g}}_1) \,.$      

Now it is easy to derive the smoother (or hat) matrices for $ \widehat{{\boldsymbol{g}}}_1$ and $ \widehat{{\boldsymbol{g}}}_2$. Let $ \widehat{{\boldsymbol{g}}}_1^{(l)}$ and $ \widehat{{\boldsymbol{g}}}_2^{(l)}$ denote the estimates in the $ l$th updating step. The backfitting procedure steps are
$\displaystyle \widehat{\boldsymbol{g}}_1^{(l)} = {\mathbf{S}}_1 \left\{{\boldsy...
...thbf{S}}_2
\left\{{\boldsymbol{Y}}-\widehat{\boldsymbol{g}}_1^{(l)}\right\} \,.$      

Since the backfitting consists of alternating these steps, we obtain by induction
$\displaystyle \widehat{\boldsymbol{g}}_1^{(l)}$ $\displaystyle =$ $\displaystyle {\boldsymbol{Y}}-\sum_{\alpha=0}^{l-1}({\mathbf{S}}_1{\mathbf{S}}...
...hbf{S}}_1{\mathbf{S}}_2)^{l-1}{\mathbf{S}}_1\widehat{\boldsymbol{g}}_2^{(0)}\,,$  
$\displaystyle \widehat{\boldsymbol{g}}_2^{(l)}$ $\displaystyle =$ $\displaystyle {\mathbf{S}}_2
\sum_{\alpha=0}^{l-1} ({\mathbf{S}}_1{\mathbf{S}}_...
...bf{S}}_1{\mathbf{S}}_2)^{l-1}
{\mathbf{S}}_1\widehat{\boldsymbol{g}}_2^{(0)}\,.$  

Due to the assumptions $ E_{X_\alpha}\{ g_{\alpha}(X_\alpha) \} = 0$ and $ c=0$, the initialization $ \widehat{\boldsymbol{g}}^{(0)}_2 = 0$ is reasonable and we have
$\displaystyle \widehat{\boldsymbol{g}}_1^{(l)}$ $\displaystyle =$ $\displaystyle \{ {\mathbf{I}}-\sum_{\alpha=0}^{l-1}({\mathbf{S}}_1{\mathbf{S}}_2)^\alpha ({\mathbf{I}}-{\mathbf{S}}_1) \}
{\boldsymbol{Y}},$  
$\displaystyle \widehat{\boldsymbol{g}}_2^{(l)}$ $\displaystyle =$ $\displaystyle \{ {\mathbf{S}}_2 \sum_{\alpha=0}^{l-1} ({\mathbf{S}}_1{\mathbf{S}}_2)^\alpha({\mathbf{I}}-{\mathbf{S}}_1) \}
{\boldsymbol{Y}}\ .$  

This algorithm converges if the operator $ {\mathbf{S}}_1{\mathbf{S}}_2$ is shrinking (i.e. for its matrix norm holds $ \Vert {\mathbf{S}}_1{\mathbf{S}}_2 \Vert <1$) and hence gives
$\displaystyle \widehat{\boldsymbol{g}}_1$ $\displaystyle =$ $\displaystyle \{ {\mathbf{I}}-({\mathbf{I}}-{\mathbf{S}}_1{\mathbf{S}}_2)^{-1} ({\mathbf{I}}-{\mathbf{S}}_1) \}
{\boldsymbol{Y}}\,,$ (8.9)
$\displaystyle \widehat{\boldsymbol{g}}_2$ $\displaystyle =$ $\displaystyle \{ {\mathbf{I}}- ({\mathbf{I}}-{\mathbf{S}}_2{\mathbf{S}}_1)^{-1} ({\mathbf{I}}-{\mathbf{S}}_2) \} {\boldsymbol{Y}}\,.$  

when $ l \rightarrow \infty$.

In the general case, because of its iterative nature, theoretical results for backfitting have not been known until recently. Opsomer & Ruppert (1997) provide conditional mean squared error expressions albeit under rather strong conditions on the design and smoother matrices. Mammen et al. (1999) prove consistency and calculate the asymptotic properties under weaker conditions. We will present their and a different backfitting algorithm in Subsection 8.1.3.

Before proceeding we will consider two examples illustrating what precision can be achieved by additive regression and how the backfitting algorithm performs in practice.

Figure 8.1: Estimated (solid lines) versus true additive component functions (circles at the input values)
\includegraphics[width=1.4\defpicwidth]{SPMdemob1.ps}

EXAMPLE 8.1  
Consider a regression problem with four input variables $ X_1$ to $ X_4$. When $ n$ is small, it is difficult to obtain a precise nonparametric kernel estimate due to the curse of dimensionality. Let us take a sample of $ n=75$ regression values. We use explanatory variables that are independent and uniformly distributed on $ [-2.5,2.5]$ and responses generated from the additive model

$\displaystyle Y= \sum_{\alpha=1}^4 g_\alpha (X_\alpha) + \varepsilon,\quad
\varepsilon \sim N(0,1).$

The component functions are chosen as

\begin{displaymath}\begin{array}{ll} g_1 (X_1) = -\sin (2X_1), &\quad g_2 (X_2)
...
..._3 , &\quad g_4
(X_4) = \exp(-X_4)-E\{\exp(-X_4)\}.
\end{array}\end{displaymath}

In Figure 8.1 we have plotted the true functions (at the corresponding observations $ X_{i\alpha}$) and the estimated curves. We used backfitting with univariate local linear smoothers and set the bandwidth to $ h=1$ for each dimension (using the Quartic kernel). We see that even for this small sample size the estimator gives rather precise results. $ \Box$

Next, we turn to a real data example demonstrating the use of additive regression estimators in practice and manifesting that even for a high dimensional data set the backfitting estimator works reasonably well.

EXAMPLE 8.2  
We consider the Boston house price data of Harrison & Rubinfeld (1978). with $ n=506 $ observations for the census districts of the Boston metropolitan area. We selected ten explanatory variables, given below in the same order as the pictures of the resulting estimates:
$ X_1$ per capita crime rate by town,
$ X_2$ proportion of non-retail business acres per town,
$ X_3$ nitric oxides concentration (parts per 10 million),
$ X_4$ average number of rooms per dwelling,
$ X_5$ proportion of owner-occupied units built prior to 1940,
$ X_6$ weighted distances to five Boston employment centers,
$ X_7$ full-value property tax rate per $ \$ $10,000,
$ X_8$ pupil-teacher ratio by town,
$ X_9$ $ 1,000(Bk - 0.63)^2$ where $ Bk$ is the proportion of people of Afro-American descent by town,
$ X_{10}$ percent lower status of the population.

Figure 8.2: Function estimates of additive components $ g_1$ to $ g_{6}$ and partial residuals (from the left to the right, top to bottom)
\includegraphics[width=1.2\defpicwidth]{SPMbost3.ps}

Figure 8.3: Function estimates of additive components $ g_7$ to $ g_{10}$ and partial residuals (from the left to the right, top to bottom)
\includegraphics[width=1.2\defpicwidth]{SPMbost4.ps}

The response variable $ Y$ is the median value of owner-occupied homes in 1,000 USD. Proceeding on the assumption that the model is

$\displaystyle E(Y\vert{\boldsymbol{X}})=c+\sum_{\alpha=1}^{10}g_\alpha \{ \log (X_\alpha) \}\,,
$

we get the results for $ \widehat{g}_\alpha$ as plotted in Figure 8.2. We used again local linear smoothers (with Quartic kernel) and chose the bandwidths for each dimension $ \alpha$ proportional to its standard deviation ( $ h_{\alpha}=0.5\,\widehat\sigma_\alpha$). Our choice of bandwidths is perhaps suboptimal in a theoretical sense, but is reasonable to get an impression of the curves.

Looking at Figures 8.28.3 we realize that a log-linear model would fit these data for many directions. This was the model actually applied by Harrison & Rubinfeld (1978). In contrast, the explanatory variables $ X_1$ and $ X_5$ show a clear nonlinear impact. $ \Box$

Note, that in this example smoothing for the variables $ X_7$ and $ X_9$ seems to be problematic due to the distribution of these regressors. In fact, by applying backfitting we have up to now assumed continuous regressors only. To overcome this deficiency, we will later on (in Section 9.3) extend the generalized additive model to a semiparametric model. Among other advantages, this allows us to fit partly in a parametric fashion.

8.1.2 Modified Backfitting

Two of the disadvantages of the above described backfitting procedure are the dependence of the final solution on the starting functions and the possible non-uniqueness of the solution. Recall equation (8.6) which can be written as $ \widehat{{\mathbf{P}}} {\boldsymbol{g}}=
\widehat{{\mathbf{Q}}} {\boldsymbol{Y}}$. Non-uniqueness can happen if in (8.6) there exists a vector $ {\boldsymbol{b}}$ such that $ \widehat{{\mathbf{P}}}
{\boldsymbol{b}}= 0$. Then equation (8.6) has an infinite number of solutions because if $ \widehat{{\boldsymbol{g}}}$ is a solution, then so is $ \widehat{{\boldsymbol{g}}}+ \gamma \, {\boldsymbol{b}}$ for any $ \gamma\in\mathbb{R}$. This phenomenon is called concurvity.

Concurvity occurs for special interrelations among the $ {\mathbf{S}}_\alpha$s. In mathematical terms this can be explained as follows: Let $ {\mathbf{S}}_\alpha$ be smoother matrices with eigenvalues in $ [0,1]$, which is true for all cases presented here. Let further $ {\mathcal{V}}_1({\mathbf{S}}_\alpha)$ be the space spanned by all eigenvectors of $ {\mathbf{S}}_\alpha$ that correspond to eigenvalue $ 1$. Then $ \widehat{{\mathbf{P}}}
{\boldsymbol{b}}= 0$ is possible if there exist $ {\boldsymbol{b}}_\alpha \in {\mathcal{V}}_1 ({\mathbf{S}}_\alpha)$ for all $ \alpha =1,\ldots ,d$, and $ \sum_{\alpha =1}^d {\boldsymbol{b}}_\alpha = 0$. Note that the elements of $ {\mathcal{V}}_1({\mathbf{S}}_\alpha)$ are those which pass unchanged through the smoother $ {\mathbf{S}}_\alpha$.

To reduce the problem of concurvity, a modification of the backfitting algorithm is useful. Let $ {\mathbf{A}}_\alpha$ be the matrix that projects on $ {\mathcal{V}}_1({\mathbf{S}}_\alpha)$ and consequently $ {\mathbf{I}}-{\mathbf{A}}_\alpha$ the matrix that projects on $ {\mathcal{V}}_1({\mathbf{S}}_\alpha)^{\perp}$ (the orthogonal complement of $ {\mathcal{V}}_1({\mathbf{S}}_\alpha)$). Let further denote $ {\mathbf{A}}$ is the orthogonal projection onto $ {\mathcal{V}}_1 ({\mathbf{S}}_1)+\ldots+{\mathcal{V}}_1 ({\mathbf{S}}_d)$. The idea is now to first project the variable to regress onto $ {\mathcal{V}}_1 ({\mathbf{S}}_1)+\ldots+{\mathcal{V}}_1 ({\mathbf{S}}_d)$, then to smooth by using $ {\mathbf{S}}_\alpha$ and finally to project onto $ {\mathcal{V}}_1({\mathbf{S}}_\alpha)^{\perp}$. In other words, we first use $ {\mathbf{A}}$ to the full and then $ \widetilde {\mathbf{S}}_\alpha = ({\mathbf{I}}-{\mathbf{A}}_\alpha){\mathbf{S}}_\alpha$ to the partial residuals.

We can summarize this modified estimation algorithm as follows:

\fbox{\parbox{0.95\textwidth}{
\centerline{{Modified Backfitting Algorithm}}
\hr...
...\ \\ [3mm]
{\em until\/}
& convergence is reached
\end{tabular}\end{center}}}

In practice, often $ {\mathbf{A}}={\mathbf{X}}({\mathbf{X}}^\top {\mathbf{X}})^{-1} {\mathbf{X}}^\top $ is used, i.e. each iteration step starts with a parametric linear least squares regression on the explanatory variables  $ {\boldsymbol{X}}$.

Hastie & Tibshirani (1990) claim that this algorithm makes it easier to find a unique solution and also eliminates the dependence of the final results on the starting functions.

8.1.3 Smoothed Backfitting

As already mentioned above, Mammen et al. (1999) give theoretical results for a backfitting procedure that is different from the algorithms presented so far. Their main idea is to estimate the conditional expectations ``correctly'' instead of approximating them with the one-dimensional smoother matrices $ {\mathbf{S}}_\alpha$. The proof uses rather weak conditions and their modification as well as their proof is going back to the interpretation of the backfitting as a projection of the original data into the space of additive models.

Let us consider the algorithm of Mammen et al. (1999) in more detail here. As before, set for the ease of notation $ c=0.$ Recall (8.4) to (8.8), i.e.

$\displaystyle g_\alpha (X_\alpha) = E \left[ \left\{ Y-\sum\nolimits_{k\neq \alpha} g_k (X_k)
\right\} \Big\vert X_\alpha \right] $

leading to

$\displaystyle g_\alpha (X_\alpha ) = E(Y\vert X_\alpha) - \sum_{k\neq \alpha} E\{g_k(X_k) \vert X_\alpha \}\,.$ (8.10)

Then, replacing the expressions on the right hand side by nonparametric estimates, we obtain for each iteration step

$\displaystyle \widehat{g}_\alpha^{(l)} (X_{i\alpha}) = {\mathbf{S}}_{i\alpha}^\...
...hat{f}_{\alpha k}(X_{i\alpha} ,t)}{ \widehat{f}_\alpha (X_{i\alpha} )} \,dt \,.$ (8.11)

Here, $ \widehat{f}_\alpha $ is a univariate density estimator for the marginal pdf of $ X_\alpha$ and $ \widehat{f}_{\alpha k}$ is a bivariate estimator for the joint pdf of $ X_\alpha$ and $ X_k$. $ {\mathbf{S}}_{i\alpha}$ denotes the $ i$th row of the smoother matrix $ {\mathbf{S}}_\alpha$ (introduced above for estimating $ {\mathcal{P}}_\alpha=E(\bullet\vert X_{\alpha})$).

The difference to the earlier backfitting procedures is that here $ E(\bullet \vert X_\alpha)$ in (8.10) is estimated as the expectation over all dimensions and not only over direction $ \alpha$. For more details see Mammen et al. (1999). They give the following theorem when using kernel regression smoothers inside their backfitting step:

THEOREM 8.1  
Under appropriate regularity conditions on the regressor densities, the additive component functions, the error term and the smoother it holds:
(a)
There exists a unique solution $ \widehat{g}_\alpha$ for the estimation problem of equation (8.11) and the iteration converges.
(b)

For the backfitting estimator we have

$\displaystyle n^{2/5}\left\{ \widehat{g}_\alpha \left( x_\alpha \right)
- g_\al...
...row}\limits_{}^{L}} N\left\{ b_\alpha (x_\alpha),v_\alpha
(x_\alpha) \right\}. $

The variance function is of the same rate as if we had estimated the univariate model $ Y = g_\alpha (X_{\alpha} )+\varepsilon$. The same result holds for the bias term $ b_\alpha$ if the regression function is truly additive. Note in particular, that for estimating the univariate functions $ g_\alpha(\bullet) $, we obtain the univariate rate of convergence as we know from Chapter 4.

Further, for the overall regression estimator $ \widehat{m}$ constructed from the estimators $ \widehat{g}_\alpha$ the following result holds:

THEOREM 8.2  
Assume the same regularity conditions as in Theorem 8.1. For

$\displaystyle \widehat{m} ({\boldsymbol{x}}) = \sum_{\alpha=1}^d \widehat{g}_\alpha (x_\alpha)$

we have

$\displaystyle n^{2/5}\left\{
\widehat{m} \left( {\boldsymbol{x}}\right) - m \le...
... \sum_\alpha b_\alpha (x_\alpha) ,
\sum_\alpha v_\alpha (x_\alpha) \right\} \,.$

This means that even for the estimate of $ m(\bullet)$ we obtain the univariate rate of convergence. Moreover, the correlations between two estimators $ \widehat{g}_\alpha$ and $ \widehat{g}_k$ are of higher order and hence asymptotically negligible. For details on the implementation of the smoothed backfitting estimator and its computational performance we refer to Nielsen & Sperlich (2002).