10.3 Selecting the Final Tree


cross = 19523 cartcv (x, y, type, opt, wv)
cross-validation is done by this quantlet, which calculates the sequence of $ \alpha$ values, number of leaves of the corresponding pruned subtrees, estimates of the expected values of the mean squared residuals and their standard errors
ln = 19526 leafnum (cs, node)
gives the number of leaves for a given tree
res = 19529 ssr (cs, node)
calculates the sum of squared residuals
enn = pred (tr, x, type)
calculates the prediction of the regression tree for a certain point $ x$
mssr = 19532 prederr (tr, x, y, type)
calculates the sum of prediction errors for given tree and number of $ x$ and $ y$ values
Now we have to choose the best tree from the sequence $ \hat{f}_{\alpha_1},\ldots ,\hat{f}_{\alpha_k}$. In other words, we have to choose the smoothing parameter $ \alpha$. We will try to estimate the expectation of the mean of squared residuals $ R( \hat{f}_{\alpha_i} )$, and then choose the regression estimate for which this estimate is minimal. This can be done by way of cross-validation.

For example, in the ten fold cross-validation we take $ 90\%$ of the sample, grow the tree using this part of the sample, prune a sequence of subtrees and calculate the mean of squared residuals for every subtree in the sequence using the rest $ 10\%$ of the sample as a test set. This is repeated $ 10$ times, every time using different part of the sample as an estimation set and as a test set.

There is a problem that because we have used every time different data to grow and prune, we get every time different $ \alpha$-sequences. The approach proposed by Breiman, Friedman, Olshen, and Stone (1984, Section 8.5.2, page 234) is to first grow and prune using all of the data, which gives us the sequence $ \{\alpha_i\}$, then form a new sequence $ \beta_i= \sqrt{\alpha_i \alpha_{i+1}}$. The number $ \beta_i$ is the geometric mean of $ \alpha_i$ and $ \alpha_{i+1}$. When pruning trees grown with $ 90\%$ of the sample, we choose subtrees which minimize $ R_{\beta_i} ( )$.

Finally, the estimate for the expectation of $ R( \hat{f}_{\alpha_i} )$ is the mean of $ R( \hat{f}_{\beta_i}^v )$. Mean is over $ 10$ cross-validation estimates $ \hat{f}_{\beta_i}^v$, $ v=1,\ldots ,10$. In practice, the estimates for the expectation of $ R( )$ do not have clear minimum, and it is reasonable to choose the smallest tree such that the estimate for the expectation of $ R( )$ is reasonably close to the minimum.