# 8.1 Backfitting

The backfitting procedures proposed in Breiman & Friedman (1985), Buja et al. (1989) are widely used to approximate the additive components of (8.1). The latter paper considers the problem of finding the projection of onto the space of additive models represented by the right hand side of (8.1). Using the observed sample, this leads to a system of normal equations with dimensions. To solve this system in practice, a backfitting or a Gauss-Seidel algorithm is used.

The backfitting technique is iterative which leads to additional difficulties for developing asymptotic theory. Moreover, the final estimates may depend on the starting values or the convergence criterion. Therefore, since its first introduction, the method has been refined considerably and extended to more complicated models. We refer to the bibliographic notes at the end of this and the following chapter for more information.

## 8.1.1 Classical Backfitting

We will now consider the problem of estimating the additive conditional expectation of given based on the random sample , . The component functions in (8.1) explain the specific impact of the particular component  on the response .

We first introduce an identification assumption. Let

 (8.2)

for all which consequently yields

Note that condition (8.2) on the marginal effects is no restriction on the model since all we can estimate is the relative impact of each direction . This means the are otherwise only identified up to a shift in the vertical direction. As before we use the model

where (by definition) and , i.e. the model is allowed to be heteroscedastic. The constant can easily be estimated with -rate (i.e. a rate faster than the nonparametric) by

Thus, we assume

for the rest of this section if not indicated differently. This is without loss of generality, since you may replace by otherwise.

Let us now turn to the estimation of the additive components . Hastie & Tibshirani (1990) motivate the backfitting method as follows. Consider the optimization problem

 (8.3)

This means, we are searching for the best additive predictor for where best'' is meant with respect to the expected squared distance. By the theory of projections, there exists a unique solution which is characterized by the first order conditions
 (8.4)

for all . This leads to the matrix representation

 (8.5)

where . By analogy, let be a () smoother matrix which, when applied to the vector , yields a estimate of Substituting the operator by the smoother matrix we obtain a system of equations

 (8.6)

or abbreviated

 (8.7)

Although means in fact the expectation over all directions , Buja et al. (1989) suggest using only one-dimensional smoothers as a sufficient approximation to .

The system (8.7) could principally be solved exactly for . Of course, when is large the exact solution of (8.7) is hardly feasible. Furthermore, the matrix on the left is often not regular and thus the equation cannot be solved directly. Therefore, in practice the backfitting (Gauss-Seidel) algorithm is used to solve these equations: Given starting values , the vectors are updated through

 (8.8)

until the distance between successive estimates is below a prespecified tolerance. So what one actually does is a successive one-dimensional nonparametric regression on , using the partial residuals (instead of simply ). To summarize, we present the whole procedure including the estimation of the (possibly non-zero) constant:
Of course, the choice of the convergence criterion is a crucial problem and in practice different software implementations of the algorithm may stop after a different number of iterations. Also, the result of the iteration can obviously depend on the order of the variables , ..., .

It is important to note that to estimate , the backfitting algorithm uses smoother matrices that depend only on the components . These smoother matrices are easy and fast to calculate since they employ only a one-dimensional smoothing problem. This is a strong simplification (also from a theoretical point of view) and may lead to problems in theory and in interpretation of what the procedure is fitting in practice. We will come back to this point in Subsection 8.1.3.

For the two-dimensional case we can write down the backfitting estimator explicitly. From equation (8.6) we have

Now it is easy to derive the smoother (or hat) matrices for and . Let and denote the estimates in the th updating step. The backfitting procedure steps are

Since the backfitting consists of alternating these steps, we obtain by induction

Due to the assumptions and , the initialization is reasonable and we have

This algorithm converges if the operator is shrinking (i.e. for its matrix norm holds ) and hence gives
 (8.9)

when .

In the general case, because of its iterative nature, theoretical results for backfitting have not been known until recently. Opsomer & Ruppert (1997) provide conditional mean squared error expressions albeit under rather strong conditions on the design and smoother matrices. Mammen et al. (1999) prove consistency and calculate the asymptotic properties under weaker conditions. We will present their and a different backfitting algorithm in Subsection 8.1.3.

Before proceeding we will consider two examples illustrating what precision can be achieved by additive regression and how the backfitting algorithm performs in practice.

EXAMPLE 8.1
Consider a regression problem with four input variables to . When is small, it is difficult to obtain a precise nonparametric kernel estimate due to the curse of dimensionality. Let us take a sample of regression values. We use explanatory variables that are independent and uniformly distributed on and responses generated from the additive model

The component functions are chosen as

In Figure 8.1 we have plotted the true functions (at the corresponding observations ) and the estimated curves. We used backfitting with univariate local linear smoothers and set the bandwidth to for each dimension (using the Quartic kernel). We see that even for this small sample size the estimator gives rather precise results.

Next, we turn to a real data example demonstrating the use of additive regression estimators in practice and manifesting that even for a high dimensional data set the backfitting estimator works reasonably well.

EXAMPLE 8.2
We consider the Boston house price data of Harrison & Rubinfeld (1978). with observations for the census districts of the Boston metropolitan area. We selected ten explanatory variables, given below in the same order as the pictures of the resulting estimates:
 per capita crime rate by town, proportion of non-retail business acres per town, nitric oxides concentration (parts per 10 million), average number of rooms per dwelling, proportion of owner-occupied units built prior to 1940, weighted distances to five Boston employment centers, full-value property tax rate per 10,000, pupil-teacher ratio by town, where is the proportion of people of Afro-American descent by town, percent lower status of the population.

The response variable is the median value of owner-occupied homes in 1,000 USD. Proceeding on the assumption that the model is

we get the results for as plotted in Figure 8.2. We used again local linear smoothers (with Quartic kernel) and chose the bandwidths for each dimension proportional to its standard deviation ( ). Our choice of bandwidths is perhaps suboptimal in a theoretical sense, but is reasonable to get an impression of the curves.

Looking at Figures 8.28.3 we realize that a log-linear model would fit these data for many directions. This was the model actually applied by Harrison & Rubinfeld (1978). In contrast, the explanatory variables and show a clear nonlinear impact.

Note, that in this example smoothing for the variables and seems to be problematic due to the distribution of these regressors. In fact, by applying backfitting we have up to now assumed continuous regressors only. To overcome this deficiency, we will later on (in Section 9.3) extend the generalized additive model to a semiparametric model. Among other advantages, this allows us to fit partly in a parametric fashion.

## 8.1.2 Modified Backfitting

Two of the disadvantages of the above described backfitting procedure are the dependence of the final solution on the starting functions and the possible non-uniqueness of the solution. Recall equation (8.6) which can be written as . Non-uniqueness can happen if in (8.6) there exists a vector such that . Then equation (8.6) has an infinite number of solutions because if is a solution, then so is for any . This phenomenon is called concurvity.

Concurvity occurs for special interrelations among the s. In mathematical terms this can be explained as follows: Let be smoother matrices with eigenvalues in , which is true for all cases presented here. Let further be the space spanned by all eigenvectors of that correspond to eigenvalue . Then is possible if there exist for all , and . Note that the elements of are those which pass unchanged through the smoother .

To reduce the problem of concurvity, a modification of the backfitting algorithm is useful. Let be the matrix that projects on and consequently the matrix that projects on (the orthogonal complement of ). Let further denote is the orthogonal projection onto . The idea is now to first project the variable to regress onto , then to smooth by using and finally to project onto . In other words, we first use to the full and then to the partial residuals.

We can summarize this modified estimation algorithm as follows:

In practice, often is used, i.e. each iteration step starts with a parametric linear least squares regression on the explanatory variables  .

Hastie & Tibshirani (1990) claim that this algorithm makes it easier to find a unique solution and also eliminates the dependence of the final results on the starting functions.

## 8.1.3 Smoothed Backfitting

As already mentioned above, Mammen et al. (1999) give theoretical results for a backfitting procedure that is different from the algorithms presented so far. Their main idea is to estimate the conditional expectations correctly'' instead of approximating them with the one-dimensional smoother matrices . The proof uses rather weak conditions and their modification as well as their proof is going back to the interpretation of the backfitting as a projection of the original data into the space of additive models.

Let us consider the algorithm of Mammen et al. (1999) in more detail here. As before, set for the ease of notation Recall (8.4) to (8.8), i.e.

 (8.10)

Then, replacing the expressions on the right hand side by nonparametric estimates, we obtain for each iteration step

 (8.11)

Here, is a univariate density estimator for the marginal pdf of and is a bivariate estimator for the joint pdf of and . denotes the th row of the smoother matrix (introduced above for estimating ).

The difference to the earlier backfitting procedures is that here in (8.10) is estimated as the expectation over all dimensions and not only over direction . For more details see Mammen et al. (1999). They give the following theorem when using kernel regression smoothers inside their backfitting step:

THEOREM 8.1
Under appropriate regularity conditions on the regressor densities, the additive component functions, the error term and the smoother it holds:
(a)
There exists a unique solution for the estimation problem of equation (8.11) and the iteration converges.
(b)

For the backfitting estimator we have

The variance function is of the same rate as if we had estimated the univariate model . The same result holds for the bias term if the regression function is truly additive. Note in particular, that for estimating the univariate functions , we obtain the univariate rate of convergence as we know from Chapter 4.

Further, for the overall regression estimator constructed from the estimators the following result holds:

THEOREM 8.2
Assume the same regularity conditions as in Theorem 8.1. For

we have

This means that even for the estimate of we obtain the univariate rate of convergence. Moreover, the correlations between two estimators and are of higher order and hence asymptotically negligible. For details on the implementation of the smoothed backfitting estimator and its computational performance we refer to Nielsen & Sperlich (2002).