The backfitting procedures proposed in Breiman & Friedman (1985),
Buja et al. (1989) are widely used to approximate the
additive components
of (8.1). The latter paper
considers the problem of finding the projection of
onto the
space of additive models represented by the right hand side of
(8.1). Using the observed sample, this
leads to a system of normal equations with
dimensions. To solve this system in practice, a backfitting or a
Gauss-Seidel algorithm
is used.
The backfitting technique is iterative which leads to additional difficulties for developing asymptotic theory. Moreover, the final estimates may depend on the starting values or the convergence criterion. Therefore, since its first introduction, the method has been refined considerably and extended to more complicated models. We refer to the bibliographic notes at the end of this and the following chapter for more information.
We will now consider the problem of estimating the additive conditional expectation of given
based on the random sample
,
.
The component functions
in (8.1)
explain the specific impact of the particular component
on the
response
.
We first introduce an identification assumption. Let
Let us now turn to the estimation of the additive components
.
Hastie & Tibshirani (1990) motivate the backfitting method as follows.
Consider the optimization problem
The system (8.7)
could principally be solved exactly for
.
Of course, when
is large the exact solution of (8.7)
is hardly feasible. Furthermore, the
matrix
on the left is often not regular and thus the
equation cannot be solved directly. Therefore, in practice the backfitting
(Gauss-Seidel) algorithm is used to solve these equations: Given
starting values
,
the vectors
are updated through
It is important to note that to estimate
, the
backfitting algorithm uses smoother matrices
that
depend only on the components
. These smoother matrices
are easy and fast to calculate since they employ only a one-dimensional
smoothing problem. This is a strong simplification
(also from a theoretical point of view) and
may lead to problems in theory and in
interpretation of what the procedure is fitting in practice.
We will come back to this point in Subsection 8.1.3.
For the two-dimensional case we can write down the backfitting estimator
explicitly. From equation (8.6) we have
![]() |
![]() |
![]() |
![]() |
![]() |
|
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
|
![]() |
![]() |
![]() |
In the general case, because of its iterative nature, theoretical results for backfitting have not been known until recently. Opsomer & Ruppert (1997) provide conditional mean squared error expressions albeit under rather strong conditions on the design and smoother matrices. Mammen et al. (1999) prove consistency and calculate the asymptotic properties under weaker conditions. We will present their and a different backfitting algorithm in Subsection 8.1.3.
Before proceeding we will consider two examples illustrating what precision can be achieved by additive regression and how the backfitting algorithm performs in practice.
![]() |
Next, we turn to a real data example demonstrating the use of additive regression estimators in practice and manifesting that even for a high dimensional data set the backfitting estimator works reasonably well.
![]() |
per capita crime rate by town, |
![]() |
proportion of non-retail business acres per town, |
![]() |
nitric oxides concentration (parts per 10 million), |
![]() |
average number of rooms per dwelling, |
![]() |
proportion of owner-occupied units built prior to 1940, |
![]() |
weighted distances to five Boston employment centers, |
![]() |
full-value property tax rate per ![]() |
![]() |
pupil-teacher ratio by town, |
![]() |
![]() ![]() |
![]() |
percent lower status of the population. |
![]() |
![]() |
The response variable is the median value of owner-occupied
homes in 1,000 USD. Proceeding on the assumption that the model
is
Looking at
Figures 8.2, 8.3 we realize that a log-linear
model would fit these data for many directions. This
was the model actually applied by Harrison & Rubinfeld (1978).
In contrast, the explanatory variables and
show a clear nonlinear impact.
Note, that in this example smoothing
for the variables and
seems to be problematic due to the distribution
of these regressors. In fact, by
applying backfitting we have up to now
assumed continuous regressors only.
To overcome this deficiency, we will
later on (in Section 9.3) extend the generalized
additive model to a semiparametric model. Among
other advantages, this allows us to fit partly in
a parametric fashion.
Two of the disadvantages of the above described backfitting
procedure are the dependence of the final solution on
the starting functions and the
possible non-uniqueness of the solution.
Recall equation (8.6)
which can be written as
. Non-uniqueness can happen if in
(8.6) there exists a vector
such that
. Then equation (8.6) has an infinite number of
solutions because if
is a solution, then so is
for any
.
This phenomenon is called concurvity.
Concurvity occurs for special interrelations among the
s.
In mathematical terms this can be explained as follows:
Let
be smoother matrices with eigenvalues
in
, which is true for all cases presented here. Let further
be the space spanned by all eigenvectors of
that correspond to eigenvalue
. Then
is possible if there exist
for all
, and
.
Note that the elements of
are those which
pass unchanged through the smoother
.
To reduce the problem of concurvity, a modification of the backfitting
algorithm is useful. Let
be
the matrix that projects on
and consequently
the matrix that projects on
(the orthogonal complement of
).
Let further denote
is the orthogonal projection onto
.
The idea is now to first project the variable to regress
onto
, then to smooth
by using
and finally to project onto
.
In other words, we first use
to the full and then
to the partial residuals.
We can summarize this modified estimation algorithm as follows:
In practice, often
is used,
i.e. each iteration step starts with a parametric linear least
squares regression on the explanatory variables
.
Hastie & Tibshirani (1990) claim that this algorithm makes it easier to find a unique solution and also eliminates the dependence of the final results on the starting functions.
As already mentioned above, Mammen et al. (1999) give theoretical
results for a backfitting procedure that is different from the
algorithms presented so far. Their main idea is to
estimate the conditional expectations ``correctly'' instead of
approximating them with the one-dimensional smoother matrices
. The proof uses rather weak conditions and their
modification as well as their proof is going back to the
interpretation of the backfitting as a projection of the original
data into the space of additive models.
Let us consider the algorithm of Mammen et al. (1999)
in more detail here.
As before, set for the ease of notation
Recall
(8.4) to (8.8), i.e.
The difference to the earlier backfitting procedures
is that here
in (8.10) is estimated
as the expectation over all dimensions and not only over direction
.
For more details see
Mammen et al. (1999). They give
the following theorem when using kernel regression smoothers
inside their backfitting step:
For the backfitting estimator we have
The variance function is of the same rate as if we had estimated
the univariate model
.
The same result holds for the bias term
if the regression function is truly additive.
Note in particular, that for estimating the univariate
functions
, we obtain the univariate rate
of convergence as we know from Chapter 4.
Further, for the overall regression estimator
constructed from the estimators
the following
result holds:
This means that even for the estimate of