next up previous contents index
Next: References Up: 16. Bagging, Boosting and Previous: 16.2 Bagging and Related

Subsections



16.3 Boosting

Boosting algorithms have been prpoposed in the machine learning literature by Schapire ([47]) and Freund ([30], [31]), see also ([48]). These first algorithms have been developed as ensemble methods. Unlike bagging which is a parallel ensemble method, boosting methods are sequential ensemble algorithms where the weights $ c_k$ in (16.1) are depending on the previous fitted functions $ \hat{g}_1,\ldots ,\hat{g}_{k-1}$. Boosting has been empirically demonstrated to be very accurate in terms of classification, notably the so-called AdaBoost algorithm ([31]).

We will explain below that boosting can be viewed as a nonparametric optimization algorithm in function space, as first pointed out by Breiman ([11], [12]). This view turns out to be very fruitful to adapt boosting for other problems than classification, including regression and survival analysis.

Maybe it is worth mentioning here that boosting algorithms have often better predictive power than bagging, (cf. ([11])); of course, such a statement has to be read with caution, and methods should be tried out on individual data-sets, including e.g. cross-validation, before selecting one among a few methods.

To give an idea, we report here some empirical results from ([11]) for classification: we show below the gains (in percentage) of boosting trees over bagging trees:

  ''normal'' size data-sets:   $\displaystyle \quad 64.3\,{\%} , 10.8\,{\%} , 20.3\,{\%} , - 4.6\,{\%} , 6.9\,{\%} , 16.2\,{\%} \;,$    
  large data-sets:   $\displaystyle \quad 37.5\,{\%} , 12.6\,{\%} , - 50.0\,{\%} , 4.0\,{\%} , 28.6\,{\%} \;.$    

For all data-sets, boosting trees was better than a single classification tree. The biggest loss of $ 50\,{\%}$ for boosting in comparison with bagging is for a data-set with very low misclassification error, where bagging achieves $ 0.014\,{\%}$ and boosting $ 0.021\,{\%}$.


16.3.1 Boosting as Functional Gradient Descent

Rather than looking through the lenses of ensemble methods, boosting algorithms can be seen as functional gradient descent techniques ([11], [12]). The goal is to estimate a function $ g: \mathbb{R}^d \to
\mathbb{R}$, minimizing an expected loss

$\displaystyle \mathbb{E}\left[\ell(Y,g(X))\right]\;,\ \ell(\cdot,\cdot): \mathbb{R} \times \mathbb{R} \to \mathbb{R}^{+}\;,$ (16.11)

based on data $ (X_i,Y_i)\ (i=1,\ldots n)$ as in Sect. 16.2.1. The loss function $ \ell$ is typically assumed to be convex in the second argument. We consider here both cases where the univariate response $ Y$ is continuous (regression problem) or discrete (classification problem), since boosting is potentially useful in both cases.

As we will see in Sect. 16.3.2, boosting algorithms are pursuing a ''small'' empirical risk

$\displaystyle n^{-1} \sum_{i=1}^n \ell\left(Y_i,g(X_i)\right)$    

by selecting a $ g$ in the linear hull of some function class, i.e. $ g(\cdot) = \sum_k c_k g_k(\cdot)$ with $ g_k(\cdot)$'s from a function class such as trees.

The most popular loss functions, for regression and binary classification, are given in Table 16.1.


Table 16.1: The squared error, binomial negative log-likelihood and exponential loss functions and their population minimizers; logit $ (p) =$ log $ (p/(1-p))$
Boosting Loss function Population minimizer for (16.11)
$ L_2$Boost $ \ell(y,g) = (y-g)^2$ $ g(x) = \mathbb{E}[Y\vert X=x]$
LogitBoost $ \ell(y,g) = \log_2(1+ \exp(-2(y-1)g))$ $ g(x) = 0.5 \cdot$   logit$ (\mathbb{P}[Y=1\vert X=x])$
AdaBoost $ \ell(y,g) = \exp(-(2y-1)g)$ $ g(x) = 0.5 \cdot$   logit$ (\mathbb{P}[Y=1\vert X=x])$

While the squared error loss is mainly used for regression (see ([19]) for classification with the squared error loss), the log-likelihood and the exponential loss are for binary classification only.

16.3.1.1 The Margin for Classification

The form of the log-likelihood loss may be somewhat unusual: we norm it, by using the base $ 2$ so that it ''touches'' the misclassification error as an upper bound (see Fig. 16.3), and we write it as a function of the so-called margin $ \tilde{y} g$, where $ \tilde{y} = 2y-1 \in \{-1,1\}$ is the usual labeling from the machine learning community. Thus, the loss is a function of the margin $ \tilde{y} g$ only; and the same is true with the exponential loss and also the squared error loss for classification since

$\displaystyle (\tilde{y} - g)^2 = \tilde{y}^2 - 2 \tilde{y} g + g^2 = 1 - 2 \tilde{y}g + (\tilde{y}g)^2\;,$    

using $ \tilde{y}^2 =1$.

Figure 16.3: Loss functions of the margin for binary classification. Zero-one misclassification loss (solid line), log-likelihood loss (dashed line), exponential loss (dotted line), squared error loss (dashed/dotted). The loss-functions are described in Table 16.1
\includegraphics[height=6.3333cm,width=9cm,clip]{text/3-16/abb/loss.eps}

The misclassification loss, or zero-one loss, is $ {\large {\text{\textbf{1}}}}_{[\tilde{y}g
<0]}$, again a function of the margin, whose population minimizer is $ g(x) = {\large {\text{\textbf{1}}}}_{\left[\mathbb{P}[Y=1\vert X=x] > 1/2\right]}$. For readers less familiar with the concept of the margin, this can also be understood as follows: the Bayes-classifier which minimizes the misclassification risk is

$\displaystyle g_{{\text{Bayes}}}(x) = {\large {\text{\textbf{1}}}}_{\left[\mathbb{P}[Y=1\vert X=x] > 1/2\right]}\;.$    

We can now see that a misclassification occurs, if $ y=0,\
g_{{\text{Bayes}}}(x)=1$ or $ y=1$, $ g_{{\text{Bayes}}}(x)=0$, which is equivalent to $ 2(y-1)g_{{\text{Bayes}}}(x)<0$ or $ \tilde{y}g_{{\text{Bayes}}}(x)<0$.

The (surrogate) loss functions given in Table 16.1 are all convex functions of the margin $ \tilde{y} g$ which bound the zero-one misclassification loss from above, see Fig. 16.3. The convexity of these surrogate loss functions is computationally important for empirical risk minimization; minimizing the empirical zero-one loss is computationally intractable.


16.3.2 The Generic Boosting Algorithm

Estimation of the function $ g(\cdot)$, which minimizes an expected loss in (16.11), is pursued by a constrained minimization of the empirical risk $ n^{-1} \sum_{i=1}^n
\ell(Y_i,g(X_i))$. The constraint comes in algorithmically (and not explicitly), by the way we are attempting to minimize the empirical risk, with a so-called functional gradient descent. This gradient descent view has been recognized and refined by various authors (cf. ([11], [12], [43], [34], [33], [19])). In summary, the minimizer of the empirical risk is imposed to satisfy a ''smoothness'' constraint in terms of a linear expansion of (''simple'') fits from a real-valued base procedure function estimate.

16.3.2.0.1 Generic Functional Gradient Descent

Step 1 (initialization). Given data $ \{(X_i,Y_i); i=1,\ldots
,n\}$, apply the base procedure yielding the function estimate

$\displaystyle \hat{F}_1(\cdot) = \hat{g}(\cdot)\;,$    

where $ \hat{g} = \hat{g}_{X,Y} = h_n((X_1,Y_1),\ldots ,(X_n,Y_n))$ is a function of the original data. Set $ m = 1$.


Step 2 (projecting gradient to learner). Compute the negative gradient vector

$\displaystyle U_i = - \frac {\partial \ell(Y_i,g)} {\partial g}\vert _{g= \hat{F}_m(X_i)}\;,\quad i=1,\ldots ,n\;,$    

evaluated at the current $ \hat{F}_m(\cdot)$. Then, apply the base procedure to the gradient vector

$\displaystyle \hat{g}_{m+1}(\cdot)\;,$    

where $ \hat{g}_{m+1} = \hat{g}_{X,U} = h_n((X_1,U_1),\ldots
,(X_n,U_n))$ is a function of the original predictor variables and the current negative gradient vector as pseudo-response.


Step 3 (line search). Do a one-dimensional numerical search for the best step-size

$\displaystyle \hat{s}_{m+1} = \mathop{\text{arg}}\mathop{\text{min}}\nolimits_s \sum_{i=1}^n \ell\left(Y_i,\hat{F}_m(X_i) + s \hat{g}_{m+1}(X_i)\right)\;.$    

Update,

$\displaystyle \hat{F}_{m+1}(\cdot) = \hat{F}_m(\cdot) + \hat{s}_{m+1}\hat{g}_{m+1}(\cdot)\;.$    


Step 4 (iteration). Increase $ m$ by one and repeat Steps 2 and 3 until a stopping iteration $ M$ is achieved.


The number of iterations $ M$ is the tuning parameter of boosting. The larger it is, the more complex the estimator. But the complexity, for example the variance of the estimator, is not linearly increasing in $ M$: instead, it increases very slowly as $ M$ gets larger, see also Fig. 16.4 in Sect. 16.3.5.

Obviously, the choice of the base procedure influences the boosting estimate. Originally, boosting has been mainly used with tree-type base procedures, typically with small trees such as stumps (two terminal nodes) or trees having say $ 8$ terminal nodes (cf. ([11], [14], [7], [34], [25])); see also Sect. 16.3.8. But we will demonstrate in Sect. 16.3.6 that boosting may be very worthwhile within the class of linear, additive or interaction models, allowing for good model interpretation.

The function estimate $ \hat{g}_{m+1}$ in Step 2 can be viewed as an estimate of $ \mathbb{E}[U_i\vert X=x]$, the expected negative gradient given the predictor $ X$, and takes values in $ \mathbb{R}$, even in case of a classification problem with $ Y_i$ in a finite set (this is different from the AdaBoost algorithm, see below).

We call $ \hat{F}_M(\cdot)$ the $ L_2$Boost-, LogitBoost- or AdaBoost-estimate, according to the implementing loss function $ (y-g)^2$, $ log_2(1+ \exp(-2(y-1)g))$ or $ \ell(y,g) = \exp(-(2y-1)g)$, respectively; see Table 16.1.

The original AdaBoost algorithm for classification is actually a bit different: the base procedure fit is a classifier, and not a real-valued estimator for the conditional probability of $ Y$ given $ X$; and Steps 2 and 3 are also somewhat different. Since AdaBoost's implementing exponential loss function is not well established in statistics, we refer for a detailed discussion to ([34]). From a statistical perspective, the squared error loss and log-likelihood loss functions are most prominent and we describe below the corresponding boosting algorithms in detail.

16.3.2.1 $ L_2$Boost

Boosting using the squared error loss, $ L_2$Boost, has a simple structure: the negative gradient in Step 2 is the classical residual vector and the line search in Step 3 is trivial when using a base procedure which does least squares fitting.

16.3.2.1.1 $ L_2$Boost Algorithm

Step 1 (initialization). As in Step 1 of generic functional gradient descent.


Step 2. Compute residuals $ U_i = Y_i - \hat{F}_m(X_i)\ (i=1,\ldots
,n)$ and fit the real-valued base procedure to the current residuals (typically by (penalized) least squares) as in Step 2 of the generic functional gradient descent; the fit is denoted by $ \hat{g}_{m+1}(\cdot)$.

Update

$\displaystyle \hat{F}_{m+1}(\cdot) = \hat{F}_m(\cdot) + \hat{g}_{m+1}(\cdot)\;.$    

We remark here that, assuming the base procedure does some (potentially penalized) least squares fitting of the residuals, the line search in Step 3 of the generic algorithm becomes trivial with $ \hat{s}_{m+1} = 1$.


Step 3 (iteration). Increase iteration index $ m$ by one and repeat Step 2 until a stopping iteration $ M$ is achieved.


The estimate $ \hat{F}_M(\cdot)$ is an estimator of the regression function $ \mathbb{E}[Y\vert X=\cdot]$. $ L_2$Boosting is nothing else than repeated least squares fitting of residuals (cf. ([33], [19])). With $ m = 2$ (one boosting step), it has already been proposed by Tukey ([51]) under the name ''twicing''. In the non-stochastic context, the $ L_2$Boosting algorithm is known as ''Matching Pursuit'' ([41]) which is popular in signal processing for fitting overcomplete dictionaries.

16.3.2.2 LogitBoost

Boosting using the log-likelihood loss for binary classification (and more generally for multi-class problems) is known as LogitBoost ([34]). LogitBoost uses some Newton-stepping with the Hessian, rather than the line search in Step 3 of the generic boosting algorithm:

16.3.2.2.1 LogitBoost Algorithm

Step 1 (initialization). Start with conditional probability estimates $ \hat{p}_1(X_i) = 1/2\ (i=1,\ldots
,n)$ (for $ \mathbb{P}[Y=1\vert X = X_i]$). Set $ m = 1$.


Step 2. Compute the pseudo-response (negative gradient)

$\displaystyle U_i = \frac{Y_i - \hat{p}_m(X_i)}{\hat{p}_m(X_i)(1-\hat{p}_m(X_i))}\;,$    

and the weights

$\displaystyle w_i = \hat{p}_m(X_i)\left(1-\hat{p}_m(X_i)\right)\;.$    

Fit the real-valued base procedure to the current pseudo-response $ U_i\ (i=1,\ldots ,n)$ by weighted least squares, using the current weights $ w_i\ (i=1,\ldots n)$; the fit is denoted by $ \hat{g}_{m+1}(\cdot)$. Update

$\displaystyle \hat{F}_{m+1}(\cdot) = \hat{F}_m(\cdot) + 0.5 \cdot \hat{g}_{m+1}(\cdot)$    

and

$\displaystyle \hat{p}_{m+1}(X_i) = \frac{\exp\left(\hat{F}_{m+1}(X_i)\right)}{\exp\left(\hat{F}_{m+1}(X_i)\right) + \exp\left(-\hat{F}_{m+1}(X_i)\right)}\;.$    


Step 3 (iteration). Increase iteration index $ m$ by one and repeat Step 2 until a stopping iteration $ M$ is achieved.


The estimate $ \hat{F}_M(\cdot)$ is an estimator for half of the log-odds ratio $ 0.5 \cdot$   logit$ (\mathbb{P}[Y=1\vert X=\cdot]$ (see Table 16.1). Thus, a classifier (under equal misclassification loss for the labels $ Y=0$ and $ Y=1$) is

sign$\displaystyle \left(\hat{F}_M(\cdot)\right)\;,$    

and an estimate for the conditional probability $ \mathbb{P}[Y=1\vert X=\cdot]$ is

$\displaystyle \hat{p}_M(\cdot) = \frac{\exp\left(\hat{F}_{M}(\cdot)\right)}{\exp\left(\hat{F}_{M}(\cdot)\right) + \exp\left(-\hat{F}_{M}(\cdot)\right)}\;.$    

A requirement for LogitBoost is that the base procedure has the option to be fitted by weighted least squares.

16.3.2.3 Multi-class Problems

The LogitBoost algorithm described above can be modified for multi-class problems where the response variable takes values in a finite set $ \{0,1,\ldots ,J-1\}$ with $ J > 2$ by using the multinomial log-likelihood loss ([34]). But sometimes it can be advantageous to run instead a binary classifier (e.g. with boosting) for many binary problems. The most common approach is to code for $ J$ binary problems where the $ j$th problem assigns the response

$\displaystyle Y^{(j)} = \begin{cases}1\;, & \text{if}\;\; Y=j\;,\\ [1.5mm] 0\;, & \text{if}\;\; Y \neq j\;. \end{cases}$    

i.e. the so-called ''one versus all'' approach. For example, if single class-label can be distinguished well from all others, the ''one versus all'' approach seems adequate: empirically, this has been reported for classifying tumor types based on microarray gene expressions when using a LogitBoost algorithm ([25]).

Other codings of a multi-class into into multiple binary problems are discussed in ([1]).


16.3.3 Small Step Size

It is often better to use small step sizes instead of using the full line search step-length $ \hat{s}_{m+1}$ from Step 3 in the generic boosting algorithm (or $ \hat{s}_{m+1} \equiv 1$ for $ L_2$Boost or $ \hat{s}_{m+1} \equiv 0.5$ for LogitBoost). We advocate here to use the step-size

$\displaystyle \nu \hat{s}_{m+1}\;,\quad 0< \nu \le 1\;,$    

where $ \nu $ is constant during boosting iterations and small, e.g. $ \nu = 0.1$. The parameter $ \nu $ can be seen as a simple shrinkage parameter, where we use the shrunken $ \nu
\hat{g}_{m+1}(\cdot)$ instead of the unshrunken $ \hat{g}_{m+1}(\cdot)$. Small step-sizes (or shrinkage) make the boosting algorithm slower and require a larger number $ M$ of iterations. However, the computational slow-down often turns out to be advantageous for better out-of-sample prediction performance, (cf. ([33], [19])). There are also some theoretical reasons to use boosting with $ \nu $ (infinitesimally) small ([29]).


16.3.4 The Bias-variance Trade-off for $ L_2$Boost

We discuss here the behavior of boosting in terms of model-complexity and estimation error when the number of iterations increase. This is best understood in the framework of squared error loss and $ L_2$Boosting.

We represent the base procedure as an operator

$\displaystyle \mathcal{S}: \mathbb{R}^n \to \mathbb{R}^n\;,\quad (U_1,\ldots ,U_n)^T \mapsto \left(\hat{U}_1,\ldots ,\hat{U}_n\right)^T$    

which maps a (pseudo-)response vector $ (U_1,\ldots ,U_n)^T$ to its fitted values; the predictor variables $ X$ are absorbed here into the operator notation. That is,

$\displaystyle \mathcal{S} (U_1,\ldots ,U_n)^T = \left(\hat{g}(X_1),\ldots ,\hat{g}(X_n)\right)^T\;,$    

where $ \hat{g}(\cdot) = \hat{g}_{X,U}(\cdot)$ is the estimate from the base procedure based on data $ (X_i,U_i),\ i=1,\ldots ,n$. Then, the boosting operator in iteration $ m$ equals

$\displaystyle \mathcal{B}_m = I - (I- \mathcal{S})^m$    

and the fitted values of boosting after $ m$ iterations are

$\displaystyle \mathcal{B}_m Y = \mathbf{Y} - (I - \mathcal{S})^m \mathbf{Y},\ \mathbf{Y} = \left(Y_1,\ldots ,Y_n\right)^T\;.$    

Heuristically, if the base procedure satisfies $ \Vert I - \mathcal{S}\Vert < 1$ for a suitable norm, i.e. has a ''learning capacity'' such that the residual vector is shorter than the input-response vector, we see that $ \mathcal{B}_m$ converges to the identity $ I$ as $ m \to \infty$, and $ \mathcal{B}_m \mathbf{Y}$ converges to the fully saturated model $ \mathbf{Y}$ as $ m \to \infty$, interpolating the response data exactly. Thus, we have to stop the boosting algorithm at some suitable iteration number $ m=M$, and we see that a bias-variance trade-off is involved when varying the iteration number $ M$.


16.3.5 $ L_2$Boost with Smoothing Spline Base Procedure for One-dimensional Curve Estimation

The case where the base procedure is a smoothing spline for a one-dimensional predictor $ X \in \mathbb{R}^1$ is instructive, although being only a toy example within the range of potential applications of boosting algorithms.

In our notation from above, $ \mathcal{S}$ denotes a smoothing spline operator which is the solution ( $ \mathcal{S} \mathbf{Y} = g(X_1),\ldots ,f(X_n)$) of the following optimization problem (cf. ([54]))

$\displaystyle \mathop{\text{arg}}\mathop{\text{min}}\nolimits_g n^{-1} \sum_{i=1}^n \left(Y_i - g(X_i)\right)^2 + \lambda \int g''(x)^2 \d x\;.$    

The smoothing parameter $ \lambda $ controls the bias-variance trade-off, and tuning the smoothing spline estimator usually boils down to estimating a good value of $ \lambda $. Alternatively, the $ L_2$Boosting approach for curve-estimation with a smoothing spline base procedure is as follows.

16.3.5.1 Choosing the Base Procedure

Within the class of smoothing spline base procedures, we choose a spline by fixing a smoothing parameter $ \lambda $. This should be done such that the base procedure has low variance but potentially high bias: for example, we may choose $ \lambda $ such that the degrees of freedom $ df= $   trace$ (\mathcal{S})$ is low, e.g. $ df = 2.5$. Although the base procedure has typically high bias, we will reduce it by pursuing suitably many boosting iterations. Choosing the $ df$ is not really a tuning parameter: we only have to make sure that $ df$ is small enough, so that the initial estimate (or first few boosting estimates) are not already overfitting. This is easy to achieve in practice and a theoretical characterization is described in ([19]).

Related aspects of choosing the base procedure are described in Sects. 16.3.6 and 16.3.8. The general ''principle'' is to choose a base procedure which has low variance and having the property that when taking linear combinations thereof, we obtain a model-class which is rich enough for the application at hand.

16.3.5.2 MSE Trace and Stopping

As boosting iterations proceed, the bias of the estimator will go down and the variance will increase. However, this bias-variance exhibits a very different behavior as when classically varying the smoothing parameter (the parameter $ \lambda $). It can be shown that the variance increases with exponentially small increments of the order $ \exp(-Cm),\ C>0$, while the bias decays quickly: the optimal mean squared error for the best boosting iteration $ m$ is (essentially) the same as for the optimally selected tuning parameter $ \lambda $ ([19]), but the trace of the mean squared error is very different, see Fig. 16.4. The $ L_2$Boosting method is much less sensitive to overfitting and hence often easier to tune. The mentioned insensitivity about overfitting also applies to higher-dimensional problems, implying potential advantages about tuning.

Figure 16.4: Mean squared error $ \mathbb{E} [(g(X) - \hat {g}(X))^2]$ for new predictor $ X$ (solid line) and $ n^{-1} \sum _{i=1}^n \mathbb{E}[(\hat {F}_m(X_i) - g(X_i))^2]$ (dotted line) from 100 simulations of a nonparametric regression model with smooth regression function and Unif. $ [-1/2,1/2]$-distributed design points. Sample size is $ n=100$. Left: $ L_2$Boost with cubic smoothing spline having $ df=3$, as a function of boosting itera- tions $ m$. Right: Cubic smoothing spline for various degrees of freedom (various amount of smoothing)
\includegraphics[height=5.928cm,width=8.3cm]{text/3-16/abb/pbyu.newadd3.eps}

16.3.5.3 Asymptotic Optimality

Such $ L_2$Boosting with smoothing splines achieves the asymptotically optimal minimax MSE rates, and the method can even adapt to higher order smoothness of the true underlying function, without knowledge of the true degree of smoothness ([19]).


16.3.6 $ L_2$Boost for Additive and Interaction Regression Models

In Sect. 16.3.4, we already pointed out that $ L_2$Boosting yields another way of regularization by seeking for a compromise between bias and variance. This regularization turns out to be particularly powerful in the context with many predictor variables.

16.3.6.1 Additive Modeling

Consider the component-wise smoothing spline which is defined as a smoothing spline with one selected predictor variable $ X^{(\hat{\iota})}$ ( $ \hat{\iota} \in \{1,\ldots ,d\}$), where

$\displaystyle \hat{\iota} = \mathop{\text{arg}}\mathop{\text{min}}\nolimits_{\iota} \sum_{i=1}^n (Y_i - \hat{g}_{\iota} (X_{i}^{(\iota)}))^2\;,$    

and $ \hat{g}_{\iota}$ are smoothing splines with single predictors $ X^{(j)}$ , all having the same low degrees of freedom $ df$, e.g. $ df = 2.5$.

$ L_2$Boost with component-wise smoothing splines yields an additive model, since in every boosting iteration, a function of one selected predictor variable is linearly added to the current fit and hence, we can always rearrange the summands to represent the boosting estimator as an additive function in the original variables, $ \sum_{j=1}^d
\hat{m}_j(x_j),\ x \in \mathbb{R}^d$. The estimated functions $ \hat{m}_j(\cdot)$ are fitted in a stage-wise fashion and they are different from the backfitting estimates in additive models (cf. ([35])). Boosting has much greater flexibility to add complexity, in a stage-wise fashion: in particular, boosting does variable selection, since some of the predictors will never be chosen, and it assigns variable amount of degrees of freedom to the selected components (or function estimates); the degrees of freedom are defined below. An illustration of this interesting way to fit additive regression models with high-dimensional predictors is given in Figs. 16.5 and 16.6 (actually, a penalized version of $ L_2$Boosting, as described below, is shown).

When using regression stumps (decision trees having two terminal nodes) as the base procedure, we also get an additive model fit (by the same argument as with component-wise smoothing splines). If the additive terms $ m_j(\cdot)$ are smooth functions of the predictor variables, the component-wise smoothing spline is often a better base procedure than stumps ([19]). For the purpose of classification, e.g. with LogitBoost, stumps often seem to do a decent job; also, if the predictor variables are non-continuous, component-wise smoothing splines are often inadequate.

Finally, if the number $ d$ of predictors is ''reasonable'' in relation to sample size $ n$, boosting techniques are not necessarily better than more classical estimation methods ([19]). It seems that boosting has most potential when the predictor dimension is very high ([19]). Presumably, more classical methods become then very difficult to tune while boosting seems to produce a set of solutions (for every boosting iteration another solution) whose best member, chosen e.g. via cross-validation, has often very good predictive performance. A reason for the efficiency of the trace of boosting solutions is given in Sect. 16.3.9.

16.3.6.2 Degrees of Freedom and $ AIC_c$-stopping Estimates

For component-wise base procedures, which pick one or also a pair of variables at the time, all the component-wise fitting operators are involved: for simplicity, we focus on additive modeling with component-wise fitting operators $ \mathcal{S}_j,\ j=1,\ldots, d$, e.g. the component-wise smoothing spline.

The boosting operator, when using the step size $ 0 < \nu \le 1$, is then of the form

$\displaystyle \mathcal{B}_m = I - \left(I - \nu \mathcal{S}_{\hat{\iota}_1}\rig...
...hat{\iota}_2}\right) \ldots \left(I - \nu \mathcal{S}_{\hat{\iota}_m}\right)\;,$    

where $ \hat{\iota}_i \in \{1,\ldots ,d\}$ denotes the component which is picked in the component-wise smoothing spline in the $ i$th boosting iteration.

If the $ \mathcal{S}_j$'s are all linear operators, and ignoring the effect of selecting the components, it is reasonable to define the degrees of boosting as

$\displaystyle df(\mathcal{B}_m) =$   trace$\displaystyle (\mathcal{B}_m)\;.$    

We can represent

$\displaystyle \mathcal{B}_m = \sum_{j=1}^d M_j\;,$    

where $ M_j = M_{j,m}$ is the linear operator which yields the fitted values for the $ j$th additive term, e.g. $ M_j \mathbf{Y} =
(\hat{m}_j(X_1),\ldots ,\hat{m}_j(X_n))^T$. Note that the $ M_j$'s can be easily computed in an iterative way by up-dating in the $ i$th boosting iteration as follows:

$\displaystyle M_{\hat{\iota}_i, {\text{new}}} \leftarrow M_{\hat{\iota}_i,{\text{old}}} + \nu \mathcal{S}_{\hat{\iota}_i} (I - \mathcal{B}_{i-1})$    

and all other $ M_j,\ j \neq \hat{\iota}_i$ do not change. Thus, we have a decomposition of the total degrees of freedom into the $ d$ additive terms:

$\displaystyle df(\mathcal{B}_m) = \sum_{j=1}^d df_{j,m}\;,$    
$\displaystyle df_{j,m} =$   trace$\displaystyle (M_j)\;.$    

The individual degrees of freedom $ df_{j,m}$ are a useful measure to quantify the complexity of the $ j$th additive function estimate $ \hat{m}_j(\cdot)$ in boosting iteration $ m$. Note that $ df_{j,m}$ will increase very sub-linearly as a function of boosting iterations $ m$, see also Fig. 16.4.

Having some degrees of freedom at hand, we can now use the AIC, or some corrected version thereof, to define a stopping rule of boosting without doing some sort of cross-validation: the corrected AIC statistic ([38]) for boosting in the $ m$th iteration is

$\displaystyle AIC_c = \log\left(\hat{\sigma}^2\right) + \frac{1 + \text{trace}(\mathcal{B}_m)/n}{1 - \left(\text{trace}(\mathcal{B}_m) + 2\right)/n}\;,$ (16.12)
$\displaystyle \hat{\sigma}^2 = n^{-1} \sum_{i=1}^n \left(Y_i - (\mathcal{B}_m \mathbf{Y})_i\right)^2\;.$ (16.13)

Alternatively, we could use generalized cross-validation (cf. ([36])), which involves degrees of freedom. This would exhibit the same computational advantage, as $ AIC_c$, over cross-validation: instead of running boosting multiple times, $ AIC_c$ and generalized cross-validation need only one run of boosting (over a suitable number of iterations).

16.3.6.3 Penalized $ L_2$ Boosting

When viewing the $ AIC_c$ criterion in (16.12) as a reasonable estimate of the true underlying mean squared error (ignoring uninteresting constants), we may attempt to construct a boosting algorithm which reduces in every step the $ AIC_c$ statistic (an estimate of the out-sample MSE) most, instead of maximally reducing the in-sample residual sum of squares.

We describe here penalized boosting for additive model fitting using individual smoothing splines:

16.3.6.3.1 Penalized $ L_2$Boost with Additive Smoothing Splines

Step 1 (initialization). As in Step 1 of $ L_2$Boost by fitting a component-wise smoothing spline.


Step 2. Compute residuals $ U_i = Y_i - \hat{F}_m(X_i)\ (i=1,\ldots
,n)$. Choose the individual smoothing spline which reduces $ AIC_c$ most: denote the selected component by $ \hat{\iota}_{m+1}$ and the fitted function, using the selected component $ \hat{\iota}_{m+1}$ by $ \hat{g}_{m+1}(\cdot)$.

Update

$\displaystyle \hat{F}_{m+1}(\cdot) = \hat{F}_m(\cdot) + \nu \hat{g}_{m+1}(\cdot)\;.$    

for some step size $ 0 < \nu \le 1$.


Step 3 (iteration). Increase iteration index $ m$ by one and repeat Step 2 until the $ AIC_c$ criterion in (16.12) cannot be improved anymore.


This algorithm cannot be written in terms of fitting a base procedure multiple times since selecting the component $ \hat{\iota}$ in Step 2 not only depends on the residuals $ U_1,\ldots ,U_n$, but also on the degrees of boosting, i.e. trace$ (\mathcal{B}_{m+1})$; the latter is a complicated, although linear function, of the boosting iterations $ m' \in
\{1,2,\ldots, m\}$. Penalized $ L_2$Boost yields more sparse solutions than the corresponding $ L_2$Boost (with component-wise smoothing splines as corresponding base procedure). The reason is that $ df_{j,m}$ increases only little in iteration $ m+1$, if the $ j$th selected predictor variables has already been selected many times in previous iterations; this is directly connected to the slow increase in variance and overfitting as exemplified in Fig. 16.4.

An illustration of penalized $ L_2$Boosting with individual smoothing splines is shown in Figs. 16.5 and 16.6, based on simulated data. The simulation model is

$\displaystyle X_1,\ldots ,X_n\ $   i.i.d.$\displaystyle \ \sim\ $   Unif.$\displaystyle [0,1]^{100}\;,$    
$\displaystyle Y_i = \sum_{j=1}^{10} m_j\left(X^{(j)}\right) + \varepsilon_i\ (i=1,\ldots ,n)\;,$    
$\displaystyle \varepsilon_1,\ldots,\varepsilon_n\ $   i.i.d.$\displaystyle \ \sim\ \mathcal{N}(0,0.5)\;,$ (16.14)

where the $ m_j$'s are smooth curves having varying curve complexities, as illustrated in Fig. 16.6. Sample size is $ n=200$ which is small in comparison to $ d=100$ (but the effective number of predictors is only $ 10$).

Figure 16.5: Degrees of freedom ($ df$) in additive model fitting for all $ 100$ predictor variables (from model (16.14)) during the process of penalized $ L_2$Boosting with individual smoothing splines (having $ df= $trace $ (\mathcal{S}_j)= 2.5$ for each spline). The first ten predictor variables (separated by the dashed line) are effective. The result is based on one realization from model (16.14) with sample size $ n=200$. The plot on the lower right corresponds to the estimated optimal number of boosting iterations using the $ AIC_c$ criterion in (16.12). Only three non-effective predictors have been selected (and assigned small amount of $ df$), and one effective predictor has not been selected (but whose true underlying function is close to the zero-line, see Fig. 16.6)
\includegraphics[width=11.7cm]{text/3-16/abb/boost-dsc2003.2.eps}

Figure 16.6: True underlying additive regression curves (dashed lines) and estimates (solid lines) from penalized $ L_2$Boosting as described in Fig. 16.5 (using 436 iterations, estimated from (16.12)). The last two plots correspond to non-effective predictors (the true functions are the zero-line), where $ L_2$Boosting assigned most $ df$ among non-effective predictors
\includegraphics[width=11.7cm]{text/3-16/abb/boost-dsc2003.4.eps}

In terms of prediction performance, penalized $ L_2$Boosting is not always better than $ L_2$Boosting; Fig. 16.7 illustrates an advantage of penalized $ L_2$Boosting. But penalized $ L_2$Boosting is always sparser (or at least not less sparse) than the corresponding $ L_2$Boosting.

Obviously, penalized $ L_2$Boosting can be used for other than additive smoothing spline model fitting. The modifications are straightforward as long as the individual base procedures are linear operators.

16.3.6.4 Interaction Modeling

$ L_2$Boosting for additive modeling can be easily extended to interaction modeling (having low degree of interaction). Among the most prominent case is the second order interaction model $ \sum_{j,k=1}^d \hat{m}_{j,k}(x_j,x_k)$, where $ \hat{m}_{j,k}: \mathbb{R}^2
\to \mathbb{R}$.

Boosting with a pairwise thin plate spline, which selects the best pair of predictor variables yielding lowest residual sum of squares (when having the same degrees of freedom for every thin plate spline), yields a second-order interaction model. We demonstrate in Fig. 16.7 the effectiveness of this procedure in comparison with the second-order MARS fit ([32]). The underlying model is the Friedman #1 model:

$\displaystyle X_1,\ldots ,X_n\ $   i.i.d.$\displaystyle \ \sim\ $   Unif.$\displaystyle \left([0,1]^d\right)\;,\quad d \in \{10,20\}\;,$    
$\displaystyle Y_i = 10 \sin\left(\pi X^{(1)} X^{(2)}\right) + 20\left(X^{(3)} - 0.5\right)^2 + 10 X^{(4)} + 5X^{(5)} + \varepsilon_i$    
$\displaystyle (i=1,\ldots ,n)\;,$    
$\displaystyle \varepsilon_1,\ldots,\varepsilon_n\ $   i.i.d$\displaystyle \ \sim \mathcal{N}(0,1)\;.$ (16.15)

The sample size is chosen as $ n=50$ which is small in comparison to $ d=20$.

In high-dimensional settings, it seems that such interaction $ L_2$Boosting is clearly better than the more classical MARS fit, while both of them share the same superb simplicity of interpretation.

Figure 16.7: Mean squared errors for $ L_2$Boost with pairwise thin-plate splines (of two predictor variables, having $ df= $ trace $ (\mathcal{S}_{j,k}) =
2.5$) (solid lines), its penalized version (dashed lines) and MARS restricted to the (correct) second order interactions (dashed/dotted lines). The point with abscissa $ x=501$ for the boosting methods corresponds to the performance when estimating the number of iterations using (16.12). Based on simulated data from model (16.15) with $ n=50$
\includegraphics[height=10cm]{text/3-16/abb/forthfit-dsc2003.eps}

16.3.7 Linear Modeling

$ L_2$Boosting turns out to be also very useful for linear models when there are many predictor variables. An attractive base procedure is component-wise linear least squares regression, using the one selected predictor variables which reduces residual sum of squares most.

This method does variable selection, since some of the predictors will never be picked during boosting iterations; and it assigns variable amount of degrees of freedom (or shrinkage), as discussed for additive models above. Recent theory shows that this method is consistent for very high-dimensional problems where the number of predictors $ d =
d_n$ is allowed to grow like $ \exp(C n)\ (C>0)$, but the true underlying regression coefficients are sparse in terms of their $ \ell _1$-norm, i.e. $ \sup_n \Vert\beta\Vert _1 = \sup_n \sum_{j=1}^{d_n}
\vert\beta_j\vert < \infty$, where $ \beta$ is the vector of regression coefficients ([16]).


16.3.8 Boosting Trees

The most popular base procedures for boosting, at least in the machine learning community, are trees. This may be adequate for classification, but when it comes to regression, or also estimation of conditional probabilities $ \mathbb{P}[Y=1\vert X=x]$ in classification, smoother base procedures often perform better if the underlying regression or probability curve is a smooth function of continuous predictor variables ([19]).

Even when using trees, the question remains about the size of the tree. A guiding principle is as follows: take the smallest trees, i.e. trees with the smallest number $ k$ of terminal nodes, such that the class of linear combinations of $ k$-node trees is sufficiently rich for the phenomenon to be modeled; of course, there is also here a trade-off between sample size and the complexity of the function class.

For example, when taking stumps with $ k = 2$, the set of linear combinations of stumps is dense in (or ''yields'' the) set of additive functions ([14]). In ([34]), this is demonstrated from a more practical point of view. When taking trees with three terminal nodes ($ k=3$), the set of linear combinations of 3-node trees yields all second-order interaction functions. Thus, when aiming for consistent estimation of the full regression (or conditional class-probability) function, we should choose trees with $ k=d+1$ terminal nodes (in practice only if the sample size is ''sufficiently large'' in relation to $ d$), (cf. ([14])).

Consistency of the AdaBoost algorithm is proved in ([39]), for example when using trees having $ d+1$ terminal nodes. More refined results are given in ([42], [55]) for modified boosting procedures with more general loss functions.

16.3.8.1 Interpretation

The main disadvantage from a statistical perspective is the lack of interpretation when boosting trees. This is in sharp contrast to boosting for linear, additive or interaction modeling. An approach to enhance interpretation is described in ([33]).


16.3.9 Boosting and $ \ell _1$-penalized Methods (Lasso)

Another method which does variable selection and variable amount of shrinkage is basis pursuit ([22]) or Lasso ([50]) which employs an $ \ell _1$-penalty for the coefficients in the log-likelihood.

Interestingly, in case of linear least squares regression with a ''positive cone condition'' on the design matrix, an approximate equivalence of (a version of) $ L_2$Boosting and Lasso has been demonstrated in ([29]). More precisely, the set of boosting solutions, when using an (infinitesimally) small step size (see Sect. 16.3.3), over all the different boosting iterations, equals approximately the set of Lasso solutions when varying the $ \ell _1$-penalty parameter. Moreover, the approximate set of boosting solutions can be computed very efficiently by the so-called least angle regression algorithm ([29]).

It is not clear to what extent this approximate equivalence between boosting and Lasso carries over to more general design matrices in linear regression or to other problems than regression with other loss functions. But the approximate equivalence in the above mentioned special case may help to understand boosting from a different perspective.

In the machine learning community, there has been a substantial focus on consistent estimation in the convex hull of function classes (cf. ([5], [6], [40])). For example, one may want to estimate a regression or probability function which can be written as

$\displaystyle \sum_{k=1}^{\infty} w_k g_k(\cdot)\;,\quad w_k \ge 0\;,\quad \sum_{k=1}^{\infty} w_k = 1\;,$    

where the $ g_k(\cdot)$'s belong to a function class such as stumps or trees with a fixed number of terminal nodes. The quantity above is a convex combination of individual functions, in contrast to boosting which pursues linear combination of individual functions. By scaling, which is necessary in practice and theory (cf. ([40])), one can actually look at this as a linear combination of functions whose coefficients satisfy $ \sum_k w_k = \lambda$. This then represents an $ \ell _1$-constraint as in Lasso, a relation which we have already outlined above.

16.3.10 Other References

Boosting, or functional gradient descent, has also been proposed for other settings than regression or classification, including survival analysis ([8]), ordinal response problems ([52]) and high-multivariate financial time series ([4], [3]).

Support vector machines (cf. ([53], [36], [49]) have become very popular in classification due to their good performance in a variety of data sets, similarly as boosting methods for classification. A connection between boosting and support vector machines has been made in ([46]), suggesting also a modification of support vector machines to more sparse solutions ([56]).

Acknowledgements. I would like to thank Marcel Dettling for some constructive comments.


next up previous contents index
Next: References Up: 16. Bagging, Boosting and Previous: 16.2 Bagging and Related