Breaking down an intricate regression with many variables into a system
of simpler relationships - each involving fewer variables - is a
desirable goal when modeling high dimensional regression data. Leontief
(1947a), for example,
considers the process of steel production and points out that
the various materials involved in the production of steel should
be combined into additional, intermediate variables. Such
a goal can be achieved, for instance, by an additive model in which
the multivariate regression function splits up into
a sum of nonparametric functions.
More precisely, let
be arbitrary measurable
mean zero functions of the corresponding random variables.
The fraction of variance not explained by a regression of
on
is
|
(10.3.5) |
Define the optimal transformations
as minimizers of 10.3.5.
Such optimal transformations exist and the ACE algorithm
(Breiman and Friedman 1985) gives estimates of these transformations.
Leontief (1947b) calls such a model additive separable and describes
a method of checking this additive separability.
For the bivariate case () the optimal transformations and
satisfy
where is the correlation coefficient. The quantity is also
known as the maximal correlation coefficient
between and and is used
as a general measure of dependence. For theoretical properties of maximal
correlation I refer to Breiman and Friedman (1985). These authors also report
that according to Kolmogorov if
are jointly normally
distributed then the transformations having maximal correlation
are linear.
Suppose that the data are generated by the regression model
Note that the optimal transformations do not correspond to the
conditional mean function here. Looking for functions that maximize correlation
is not the same as estimating the conditional mean function. However,
if the have a joint normal distribution and is an
independent normal random variable then the optimal transformations are
exactly the (linear) transformations and .
In general, though, for a
regression model of this form with independent of , the
optimal transformations are different from the transformations used to
construct the model. In practice, the transformations found by the ACE
algorithm are sometimes different, as will be seen in the Exercises.
To illustrate the ACE algorithm consider first the bivariate case:
|
(10.3.6) |
The optimal for a given
keeping , is
|
(10.3.7) |
with
. The
minimization of 10.3.6 with respect to for given
gives
|
(10.3.8) |
The basis of the following iterative optimization algorithm is the
alternation between the conditional expectations
10.3.7 and 10.3.8.
Algorithm 10.3.1
Basic ace
SET
REPEAT
Replace with
Replace with
UNTIL fails to decrease.
The more general case of multiple predictors can be treated
in direct analogy with the basic ACE algorithm.
For a given set of functions
minimization
of 10.3.5, with respect to , holding
, yields
Next 10.3.5 is minimized with respect to a single function
for given and given
This iterative procedure is described in the full ACE algorithm.
Algorithm 10.3.2
Full ace
SET
and
REPEAT
REPEAT
FOR TO DO BEGIN
END;
UNTIL
fails to
decrease;
UNTIL
fails to decrease.
In practice, one has to use smoothers to estimate the involved
conditional expectations. Use of a fully
automatic smoothing procedure, such as the supersmoother, is
recommended. Figure
10.9 shows a three-dimensional data set with
independent standard normals and
with standard normal errors
The ACE algorithm produced the transformation presented in
Figure 10.10.
Figure 10.9:
A simulated data set.
independent standard normal,
. Made with XploRe (1989).
|
Figure 10.10:
The estimated ACE transformation
. Made with XploRe (1989).
|
The estimated transformation is remarkably close to the transformation
. Figure 10.11 displays the estimated transformation
, which represents the function
extremely well.
Figure 10.11:
The estimated transformation
.
Made with XploRe (1989).
|
Breiman and Friedman (1985) applied the ACE methodology also to the Boston
housing data set (Harrison and Rubinfeld 1978; and Section
10.1).
The resulting final model
involved four predictors and has an of . (An application of ACE
to the full 13 variables resulted only in an increase for of .)
Figure 10.12a shows a plot from their paper of the solution response surface
transformation . This function is seen to have a positive curvature
for central values of , connecting two straight line segments of different
slope on either side. This suggests that the log-transformation used by
Harrison and Rubinfeld (1978) may be too severe. Figure 10.12b shows the
response transformation for the original untransformed census measurements.
The remaining plot in Figure 10.12 display the other transformation ;
for details see Breiman and Friedman (1985).
Exercises
10.3.1Prove that in the bivariate case the function given in 10.3.7 is indeed the optimal
transformation .
10.3.2Prove that in the bivariate case the function given in 10.3.8 is indeed the optimal
transformation .
10.3.3Try the ACE algorithm with some real data. Which smoother would you use as an
elementary building block?
10.3.4In the discussion to the Breiman and Friedman article D. Pregibon and Y.
Vardi generated data from
with
and
. What are possible
transformations ?
10.3.5Try the ACE algorithm with the data set from Exercise 10.3.2. What
transformations do you get?
Do they coincide with the transformations you
computed in 10.3.2?
[Hint: See the discussion of Breiman and Friedman (1985).]