In Sects. 14.2.1 and 14.2.2, we have explained the basic steps and concepts for tree construction. For most users of decision trees, the implementation aspect does not really affect the application. For methodological and software developments, however, it is imperative to understand the computational issues. The most critical issue is to find the optimal split efficiently for any given node. The overall strategy is to identify the optimal split from each of the predictors and then choose the overall best one. Choosing the overall best one is straightforward, but identifying the optimal split from a predictor takes some efforts. The algorithm must take into account the nature of the predictor. Although we will use a dichotomous response to explain the ideas, the algorithm is also applicable for the other types of responses.
Let us first consider a predictor with an ordinal scale such as gene expression in Fig. 14.1 or the ratio of cash flow to total debt in Fig. 14.2. Under the tree framework, as long as a predictor is ordinal, we will soon see that it does not matter whether the predictor is on a continuous or discrete scale.
Table 14.1 displays the expression levels of gene IL-8 in normal and colon cancer tissues. Our objective for the time being is to split these tissues into two subsamples according to whether the expression level of gene IL-8 is greater than a given threshold. In theory, this threshold can be anything, but practically, there is only a finite number of them that make a difference. In other words, it takes a finite number of steps to find an optimal threshold, although the solution is not unique.
The first step in finding an optimal threshold is to sort all expression levels, say, in an ascending order as displayed in Table 14.2. If the threshold is below the minimum (11.37625) or above the maximum (2076.9025), it produces an empty subsample. Thus, the threshold should be between 11.37625 and 2076.9025. If we take a look at the two lowest levels, 11.37625 and 23.74, it is clear that any threshold between these two levels produces the same two subsamples (or daughter nodes). In this example, there are distinct levels of expression. Thus, we have distinct ways to split the samples into two daughter nodes. It is noteworthy that, unlike this example, the number of unique levels of a predictor is usually lower than the number of samples.
The second step in finding an optimal threshold is to move along the
intervals defined by two adjacent, distinct levels of the sorted
predictor values. In Table 14.2, we move along as
We have determined the pool of the potential thresholds, which is sometimes referred to as the allowable splits. Obviously, we can examine each threshold one at a time and assess its quality according to (14.1).
|Left node||Right node||Split|
|No. of||No. of||Node||No. of||No. of||Node||Quality|
For a large data set, this means a lot of wasted computing time. To reduce the computation to a minimal level, let us take a careful look as to what happens when we move the threshold from one interval to the next. In Table 14.3, as the threshold is moved up to the next interval, the samples that were already assigned to the left daughter stay on the left side because their expression levels are still below the new threshold. Most of the samples that were assigned to the right daughter stay on the right side, except those samples whose expression levels are equal to the lower limit of the new interval. In this particular case, there is only one sample that we need to move from the right side to the left every time we move the threshold by one interval. This observation implies that the node impurities and the split quality can be computed by updating the information slightly for the small set of the samples that are affected. Every of such a small set of samples is affected only once in the entire search of the predictor. In summary, after the values of a predictor are sorted, we can find an optimal threshold to split a node in the number of steps proportional to the number of distinct values of the predictor, which is at most the number of samples in the node. For the present example, any value in is an optimal split. Intuitively from Table 14.3, we push the threshold as high as possible to maintain the perfect purity of the left daughter node. In the meantime, if we look bottom-up from the table, we also push the threshold as low as possible to maximize the purity of the right daughter node. The interval offers the best balance. In Fig. 14.1, the split is chosen at , although any number in this interval is a legitimate choice.
Overall, if we have samples in a node and predictors, excluding the sorting time, the final threshold for the node can be identified in at most steps.
For a nominal variable, we cannot sort the values of the variable as we did in Table 14.2. For a predictor of levels, there are a total of ways to split a node. To explain the algorithm, let us use an artificial example as summarized in Table 14.4.
In Table 14.4, the predictor has levels, giving rise to possible ways to split a node. A naive way is to assess every allowable split on an individual basis. This could be an extensive computation when the number of levels is or higher. Thus, it is important to find a way to compute the quality of all splits in a gradual manner as in Sect. 14.3.1. If we focus on the levels of the predictor for the left daughter node, we can travel all possible splits as follows: , , , , , , and . The key is that every move requires either the deletion or addition of a single level, which keeps the computation at the minimal level. Such a path of traveling through all splits can be defined for any .
There is actually a simple and quick solution for a dichotomous response. As shown in Table 14.4, we can compute the cancer rate for every level of the nominal predictor. During the splitting, the rates can substitute for the corresponding nominal levels. Because the rates are ordinal, the method described in Sect. 14.3.1 can be applied. After the optimal split is determined, we can map the rate back to the original nominal level. For example, for the data in Table 14.4, the optimal threshold based on the rate is in the interval which means that the left daughter node contains samples with levels B and D, and the right daughter node with levels A and C. For a multiclass response, there is no apparent way to form an ordinal surrogate for a nominal predictor.
An important feature of decision trees is their ability to deal with missing predictor values. There are several solutions. Although there have been limited attempts () to compare some of them, the performance of the various solutions is largely unexplored. The choice mostly depends on the objective of the study.
The easiest approach is to treat the missing attribute as a distinct value and to assign all samples with missing values to the same node (). This approach is not only simple, but also provides clear paths as to where the samples with missing attributes end up in the tree structure.
 introduced and advocated surrogate splits to deal with missing attributes. The idea is very intuitive. For example, in Table 14.2, we considered using expression levels from gene IL-8 to split the samples. What happens if the expression level from one of the samples, say, the first one, was not recorded? This happens in microarray experiments. Because IL-8 level is missing for the first sample, we cannot determine whether the level is below or above and hence cannot decide whether the first sample should be assigned to the left or right daughter node. To resolve this ambiguity,  proposed to seek help from other genes that act ''similarly'' to IL-8. Since there are many other genes, we can use the one that is most similar to IL-8, which leads to a surrogate for IL-8.
What we need to clarify is the meaning of similarity. To illustrate this concept, let us consider gene CANX. Using the method described in Sect. 14.3.1, we can find an optimal split from gene CANX. The similarity between CANX and IL-8 is the probability that the optimal splits from these two genes assign a sample with complete information in these two genes into the same node. This strategy is similar to replacing a missing value in one variable in linear regression by regressing on the non-missing value most highly correlated with it. Then, why can't we use the same strategy as in the linear regression? According to , their strategy is more robust. The main reason is that their strategy is more specific to the particular sample with missing attributes, and does not result in a potential catastrophic impact for other samples with missing attributes.
The surrogate splits have some advantages over the simpler approach as described earlier. It makes use of other potentially useful information.  also proposed to rank the importance of variables through surrogate splits. The surrogate splits also have some limitations. First, it is uncommon, if at all, that surrogate splits are provided in published applications. Thus, it is unrealistic to know what the surrogate splits are and how we assign a sample with a missing attribute. Second, there is no guarantee in a data set that we can find a satisfactory surrogate split. Lastly, while it is a sensible idea to rank the variable importance based on surrogate splits, there is no assurance that a predictor ranked relatively high is necessarily predictive of the outcome, which can create a dilemma for interpretation. More recently, the importance of a variable tends to be evaluated on the basis of its performance in forests ([7,80]) rather than on a single tree.
In the construction of random forests, Breiman proposed another way of replacing missing values through an iterative process. A similar idea can be applied for tree construction. To initialize the process, we can fill in the missing values by the median of an ordered variable or by the category of a nominal variable with the highest frequency. An initial tree can be constructed once all missing data are imputed. In the next step, suppose again that in Table 14.2, the expression of gene IL-8 is missing for the first sample. The unobserved level is estimated by a weighted average over the samples with observed expressions for gene IL-8. Here, the weight is the so-called proximity, which is a similarity measure between a pair of samples. Intuitively, if the second sample is more similar to the first sample than to the third one, we give more weight to the second sample than to the third one if the first sample is not observed. How is the proximity defined for a pair of samples? We can set it to zero before the initial tree is grown. Then, whenever a tree is grown, if two samples end up in the same terminal nodes, its promixity is increased by one unit. After the missing data are updated, a new tree is grown. Breiman recommends to continue this process at most five times in the random forest construction. For tree construction, it may take longer for the process to ''converge'', especially when the number of predictors is large. Nonetheless, it may still be worthwhile to repeat a few iterations. In addition to this convergence issue, it is also difficult to track where the samples with missing values are assigned as with the use of surrogate splits.