The backfitting procedures proposed in Breiman & Friedman (1985), Buja et al. (1989) are widely used to approximate the additive components of (8.1). The latter paper considers the problem of finding the projection of onto the space of additive models represented by the right hand side of (8.1). Using the observed sample, this leads to a system of normal equations with dimensions. To solve this system in practice, a backfitting or a Gauss-Seidel algorithm is used.
The backfitting technique is iterative which leads to additional difficulties for developing asymptotic theory. Moreover, the final estimates may depend on the starting values or the convergence criterion. Therefore, since its first introduction, the method has been refined considerably and extended to more complicated models. We refer to the bibliographic notes at the end of this and the following chapter for more information.
We will now consider the problem of estimating the additive conditional expectation of given based on the random sample , . The component functions in (8.1) explain the specific impact of the particular component on the response .
We first introduce an identification assumption. Let
Let us now turn to the estimation of the additive components . Hastie & Tibshirani (1990) motivate the backfitting method as follows. Consider the optimization problem
The system (8.7) could principally be solved exactly for . Of course, when is large the exact solution of (8.7) is hardly feasible. Furthermore, the matrix on the left is often not regular and thus the equation cannot be solved directly. Therefore, in practice the backfitting (Gauss-Seidel) algorithm is used to solve these equations: Given starting values , the vectors are updated through
It is important to note that to estimate , the backfitting algorithm uses smoother matrices that depend only on the components . These smoother matrices are easy and fast to calculate since they employ only a one-dimensional smoothing problem. This is a strong simplification (also from a theoretical point of view) and may lead to problems in theory and in interpretation of what the procedure is fitting in practice. We will come back to this point in Subsection 8.1.3.
For the two-dimensional case we can write down the backfitting estimator
explicitly. From equation (8.6) we have
In the general case, because of its iterative nature, theoretical results for backfitting have not been known until recently. Opsomer & Ruppert (1997) provide conditional mean squared error expressions albeit under rather strong conditions on the design and smoother matrices. Mammen et al. (1999) prove consistency and calculate the asymptotic properties under weaker conditions. We will present their and a different backfitting algorithm in Subsection 8.1.3.
Before proceeding we will consider two examples illustrating what precision can be achieved by additive regression and how the backfitting algorithm performs in practice.
|
Next, we turn to a real data example demonstrating the use of additive regression estimators in practice and manifesting that even for a high dimensional data set the backfitting estimator works reasonably well.
per capita crime rate by town, | |
proportion of non-retail business acres per town, | |
nitric oxides concentration (parts per 10 million), | |
average number of rooms per dwelling, | |
proportion of owner-occupied units built prior to 1940, | |
weighted distances to five Boston employment centers, | |
full-value property tax rate per 10,000, | |
pupil-teacher ratio by town, | |
where is the proportion of people of Afro-American descent by town, | |
percent lower status of the population. |
|
|
The response variable is the median value of owner-occupied homes in 1,000 USD. Proceeding on the assumption that the model is
Looking at Figures 8.2, 8.3 we realize that a log-linear model would fit these data for many directions. This was the model actually applied by Harrison & Rubinfeld (1978). In contrast, the explanatory variables and show a clear nonlinear impact.
Note, that in this example smoothing for the variables and seems to be problematic due to the distribution of these regressors. In fact, by applying backfitting we have up to now assumed continuous regressors only. To overcome this deficiency, we will later on (in Section 9.3) extend the generalized additive model to a semiparametric model. Among other advantages, this allows us to fit partly in a parametric fashion.
Two of the disadvantages of the above described backfitting procedure are the dependence of the final solution on the starting functions and the possible non-uniqueness of the solution. Recall equation (8.6) which can be written as . Non-uniqueness can happen if in (8.6) there exists a vector such that . Then equation (8.6) has an infinite number of solutions because if is a solution, then so is for any . This phenomenon is called concurvity.
Concurvity occurs for special interrelations among the s. In mathematical terms this can be explained as follows: Let be smoother matrices with eigenvalues in , which is true for all cases presented here. Let further be the space spanned by all eigenvectors of that correspond to eigenvalue . Then is possible if there exist for all , and . Note that the elements of are those which pass unchanged through the smoother .
To reduce the problem of concurvity, a modification of the backfitting algorithm is useful. Let be the matrix that projects on and consequently the matrix that projects on (the orthogonal complement of ). Let further denote is the orthogonal projection onto . The idea is now to first project the variable to regress onto , then to smooth by using and finally to project onto . In other words, we first use to the full and then to the partial residuals.
We can summarize this modified estimation algorithm as follows:
In practice, often is used, i.e. each iteration step starts with a parametric linear least squares regression on the explanatory variables .
Hastie & Tibshirani (1990) claim that this algorithm makes it easier to find a unique solution and also eliminates the dependence of the final results on the starting functions.
As already mentioned above, Mammen et al. (1999) give theoretical results for a backfitting procedure that is different from the algorithms presented so far. Their main idea is to estimate the conditional expectations ``correctly'' instead of approximating them with the one-dimensional smoother matrices . The proof uses rather weak conditions and their modification as well as their proof is going back to the interpretation of the backfitting as a projection of the original data into the space of additive models.
Let us consider the algorithm of Mammen et al. (1999) in more detail here. As before, set for the ease of notation Recall (8.4) to (8.8), i.e.
The difference to the earlier backfitting procedures is that here in (8.10) is estimated as the expectation over all dimensions and not only over direction . For more details see Mammen et al. (1999). They give the following theorem when using kernel regression smoothers inside their backfitting step:
For the backfitting estimator we have
The variance function is of the same rate as if we had estimated the univariate model . The same result holds for the bias term if the regression function is truly additive. Note in particular, that for estimating the univariate functions , we obtain the univariate rate of convergence as we know from Chapter 4.
Further, for the overall regression estimator constructed from the estimators the following result holds:
This means that even for the estimate of we obtain the univariate rate of convergence. Moreover, the correlations between two estimators and are of higher order and hence asymptotically negligible. For details on the implementation of the smoothed backfitting estimator and its computational performance we refer to Nielsen & Sperlich (2002).