Bagging ([9]), a sobriquet for bootstrap aggregating, is an ensemble method for improving unstable estimation or classification schemes. Breiman ([9]) motivated bagging as a variance reduction technique for a given base procedure, such as decision trees or methods that do variable selection and fitting in a linear model. It has attracted much attention, probably due to its implementational simplicity and the popularity of the bootstrap methodology. At the time of its invention, only heuristic arguments were presented why bagging would work. Later, it has been shown in ([18]) that bagging is a smoothing operation which turns out to be advantageous when aiming to improve the predictive performance of regression or classification trees. In case of decision trees, the theory in ([18]) confirms Breiman's intuition that bagging is a variance reduction technique, reducing also the mean squared error (MSE). The same also holds for subagging (subsample aggregating), defined in Sect. 16.2.3, which is a computationally cheaper version than bagging. However, for other (even ''complex'') base procedures, the variance and MSE reduction effect of bagging is not necessarily true; this has also been shown in ([20]) for the simple case where the estimator is a -statistics.
Consider the regression or classification setting. The data is given as in Sect. 16.1: we have pairs , where denotes the -dimensional predictor variable and the response (regression) or (classification with classes). The target function of interest is usually for regression or the multivariate function for classification. The function estimator, which is the result from a given base procedure, is
Bagging is defined as follows.
Step 1. Construct a bootstrap sample by randomly drawing times with replacement from the data .
Step 2. Compute the bootstrapped estimator
by the plug-in principle:
.
Step 3. Repeat steps 1 and 2 times, where is
often chosen as or , yielding
. The bagged estimator is
.
In theory, the bagged estimator is
This is exactly Breiman's ([9]) definition for bagging regression estimators. For classification, we propose to average the bootstrapped probabilities yielding an estimator for , whereas Breiman ([9]) proposed to vote among classifiers for constructing the bagged classifier.
The empirical fact that bagging improves the predictive performance of regression and classification trees is nowadays widely documented ([9], [10], [18], [26], [20,37]). To give an idea about the gain in performance, we cite some of the results of Breiman's pioneering paper ([9]): for classification problems, bagging a classification tree improved over a single classification tree (in terms of cross-validated misclassification error) by
A trivial equality indicates the somewhat unusual approach of using the bootstrap methodology:
Instability often occurs when hard decisions with indicator functions are involved as in regression or classification trees. One of the main underlying ideas why bagging works can be demonstrated by a simple example.
Consider the estimator
We can compute the first two asymptotic moments in the unstable region with . Numerical evaluations of these first two moments and the mean squared error (MSE) are given in Fig. 16.2. We see that in the approximate range where , bagging improves the asymptotic MSE. The biggest gain, by a factor , is at the most unstable point , corresponding to . The squared bias with bagging has only a negligible effect on the MSE (note the different scales in Fig. 16.2). Note that we always give an a-priori advantage to the original estimator which is asymptotically unbiased for the target as defined.
In ([18]), this kind of analysis has been given for more general estimators than in (16.3) and also for estimation in linear models after testing. Hard decision indicator functions are involved there as well and bagging reduces variance due to its smoothing effect. The key to derive this property is always the fact that the bootstrap is asymptotically consistent as in (16.8).
|
|
We address here the effect of bagging in the case of decision trees which are most often used in practice in conjunction with bagging. Decision trees consist of piecewise constant fitted functions whose supports (for the piecewise constants) are given by indicator functions similar to (16.3). Hence we expect bagging to bring a significant variance reduction as in the toy example above.
For simplicity of exposition, we consider first a one-dimensional predictor space and a so-called regression stump which is a regression tree with one split and two terminal nodes. The stump estimator (or algorithm) is then defined as the decision tree,
The main mathematical difference of the stump in (16.9) to the toy estimator in (16.3) is the behavior of in comparison to the behavior of (and not the constants and involved in the stump). It is shown in ([18]) that has convergence rate (in case of a smooth regression function) and a limiting distribution which is non-Gaussian. This also explains that the bootstrap is not consistent, but consistency as in (16.8) turned out to be crucial in our analysis above. Bagging is still doing some kind of smoothing, but it is not known how this behaves quantitatively. However, a computationally attractive version of bagging, which has been found to perform often as good as bagging, turns out to be more tractable from a theoretical point of view.
Subagging is a sobriquet for subsample aggregating where subsampling is used instead of the bootstrap for the aggregation. An estimator is aggregated as follows:
Step 1. For (e.g. or ) do:
Step 2. Average the subsampled estimators to approximate
As indicated in the notation, subagging depends on the
subsample size which is a tuning parameter (in contrast to ).
An interesting case is half subagging with . More generally, we could also use with (i.e. a fraction of ) and we will argue why the usual choice in subsampling for distribution estimation ([44]) is a bad choice. Half subagging with has been studied also in ([20]): in case where is a -statistic, half subagging is exactly equivalent to bagging, and subagging yields very similar empirical results to bagging when the estimator is a decision tree. Thus, if we don't want to optimize over the tuning parameter , a good choice in practice is very often . Consequently, half subagging typically saves more than half of the computing time because the computational order of an estimator is usually at least linear in .
We describe here in a non-technical way the main mathematical result from ([18]) about subagging regression trees.
The underlying assumptions for some mathematical theory are as follows. The data generating regression model is
It is then shown in ([18]) that for ,
For subagging with small order , such a result is no longer true: the reason is that small order subagging will then be dominated by a large bias (while variance reduction is even better than for fraction subagging with ).
Similarly as for the toy example in Sect. 16.2.2, subagging smoothes the hard decisions in a regression tree resulting in reduced variance and MSE.
As discussed in Sects. 16.2.2 and 16.2.3, (su-)bagging smoothes out indicator functions which are inherent in some base procedures such as decision trees. For base procedures which are ''smoother'', e.g. which do not involve hard decision indicators, the smoothing effect of bagging is expected to cause only small effects.
For example, in ([20]) it is proved that the effect of bagging on the MSE is only in the second order term if the base procedure is a -statistic. Similarly, citing ([23]): '' when bagging is applied to relatively conventional statistical problems, it cannot reliably be expected to improve performance''. On the other hand, we routinely use nowadays ''non-conventional'' methods: a simple example is variable selection and fitting in a linear model where bagging has been demonstrated to improve predictive performance ([9]).
In ([26]), the performance of bagging has been studied for MARS, projection pursuit regression and regression tree base procedures: most improvements of bagging are reported for decision trees. In ([18]), it is shown that bagging the basis function in MARS essentially doesn't change the asymptotic MSE. In ([15]) it is empirically demonstrated in greater detail that for finite samples, bagging MARS is by far less effective - and sometimes very destructive - than bagging decision trees.
(Su-)bagging may also have a positive effect due to averaging over different selected predictor variables; this is an additional effect besides smoothing out indicator functions. In case of MARS, we could also envision that such an averaging over different selected predictor variables would have a positive effect: in the empirical analysis in ([15]), this has been found to be only true when using a robust version of aggregation, see below.
Bragging stands for bootstrap robust aggregating ([15]): it uses the sample median over the bootstrap estimates , instead of the sample mean in Step 3 of the bagging algorithm.
While bragging regression trees was often found to be slightly less improving than bagging, bragging MARS seems better than the original MARS and much better than bagging MARS.
Bagging ''automatically'' yields an estimate of the out-of-sample error, sometimes referred to as the generalization error. Consider a loss , measuring the discrepancy between an estimated function , evaluated at , and the corresponding response , e.g. . The generalization error is then
In a bootstrap sample (in the bagging procedure), roughly of the original observations are left out: they are called ''out-of-bag'' observations ([10]). Denote by the original sample indices which were resampled in the th bootstrap sample; note that the out-of-bag sample observations (in the th bootstrap resampling stage) are then given by which can be used as test sets. The out-of-bag error estimate of bagging is then defined as
The main disadvantage of bagging, and other ensemble algorithms, is the lack of interpretation. A linear combination of decision trees is much harder to interpret than a single tree. Likewise: bagging a variable selection - fitting algorithm for linear models (e.g. selecting the variables using the AIC criterion within the least-squares estimation framework) gives little clues which of the predictor variables are actually important.
One way out of this lack of interpretation is sometimes given within the framework of bagging. In ([28]), the bootstrap has been justified to judge the importance of automatically selected variables by looking at relative appearance-frequencies in the bootstrap runs. The bagging estimator is the average of the fitted bootstrap functions, while the appearance frequencies of selected variables or interactions may serve for interpretation.
Bagging may also be useful as a ''module'' in other algorithms: BagBoosting ([17]) is a boosting algorithm (see Sect. 16.3) with a bagged base-procedure, often a bagged regression tree. The theory about bagging supports the finding that BagBoosting using bagged regression trees, which have smaller asymptotic MSEs than trees, is often better than boosting with regression trees. This is empirically demonstrated for a problem about tumor classification using microarray gene expression predictors ([24]).
Bundling classifiers ([37]), which is a more complicated aggregation algorithm but related to bagging, seems to perform better than bagging for classification. In ([45]), bagging is used in conjunction with boosting (namely for stopping boosting iterations) for density estimation. In ([27]), bagging is used in the unsupervised context of cluster analysis, reporting improvements when using bagged clusters instead of original cluster-outputs.