10.1 Growing the Tree


cs = 19206 cartsplit (x, y, type{, opt})
grows the tree
opt = 19209 cartsplitopt (s1{, s2, s3,$ \dots$})
sets the parameters for growing the tree
Growing the tree proceeds sequentially. As a first step we take the regression estimator to be just a constant over the sample space. The constant in question is the mean value of the response variable. Thus, when the observed values of the response variable are $ Y_1, \ldots Y_n$, the regression estimator is given by

$\displaystyle \hat{f}(x) = \left( \frac{1}{n} \sum_{i=1}^n Y_i \right) I_R(x)
$

where $ R$ is the sample space and $ I_R$ is the indicator function of $ R$. We assume that the sample space $ R$, that is, the space of the values of the regression variables, is a rectangle.

Secondly the sample space is divided into two parts. Some regression variable $ X_j$ is chosen, and if $ X_j$ is a continuous random variable, then some real number $ a$ is chosen, and we define

$\displaystyle R_1 = \{ x \in R: x_j \leq a \}, \, \, \,
R_2 = \{ x \in R: x_j > a \} .
$

If $ X_j$ is categorical random variable with values $ A_1,\ldots ,A_q$, then some subset $ I \subset \{ A_1,\ldots ,A_q \}$ is chosen, and we define

$\displaystyle R_1 = \{ x \in R: x_j \in I \}, \, \, \,
R_2 = \{ x \in R: x_j \in \{ A_1,\ldots ,A_q \} \backslash I \} .
$

The regression estimator in the second step is

$\displaystyle \hat{f}(x) = \left( \frac{1}{\vert I_1\vert} \sum_{I_1} Y_i \right) I_{R_1}(x)
+ \left( \frac{1}{\vert I_2\vert} \sum_{I_2} Y_i \right) I_{R_2}(x)
$

where $ I_1 = \{ i: X_i \in R_1 \}$ and $ \vert I_1\vert$ is the number of elements in $ I_1$.

The splitting of $ R$ to $ R_1$ and $ R_2$ is chosen in such a way that the sum of squared residuals of the estimator $ \hat{f}$ is minimized. The sum of squared residuals is defined as

$\displaystyle \sum_{i=1}^n \left( Y_i - \hat{f}(X_i) \right)^2 .
$

Now we proceed to split $ R_1$ and $ R_2$ separately. Splitting is continued in this way until the number of observations in every rectangle is small or the sum of squared residuals is small. The rectangle $ R$ corresponds to the root node of the binary tree. The rectangle $ R_1$ is the left child node and the rectangle $ R_2$ is the right child node. The end result is a binary tree.