14.4 Interpretation

Interpretation of results from trees is usually straightforward. In Fig. 14.1, we identified

genes IL-8, CANX, and RAB3B whose expression levels are highly predictive of colon cancer. However, this does not necessarily mean that these genes cause colon cancer. Such a conclusion requires a thorough search of the literature and further experiments. For example, after reviewing the literature, [81] found evidence that associates IL-8 with the stage of colon cancer ([30]), the migration of human clonic epithelial cell lines ([70]), and metastasis of bladder cancer ([42]). In addition, the expression of the molecular chaperone CANX was found to be down-regulated in HT-29 human colon adenocarcinoma cells ([72]) and to be involved in apoptosis in human prostate epithelial tumor cells ([56]). Lastly, RAB3B is a member of the RAS oncogene family. Therefore, these existing studies provide independent support that the three genes identified in Fig. 14.1 may be in the pathways of colon cancer. If this hypothesis could be confirmed from further experiments, Fig. 14.1 would have another important implication. Pathologically speaking, the

colon cancer samples are indistinguishable. Figure 14.1 indicates that those

samples are not homogeneous in terms of gene expression levels. If confirmed, such a finding could be useful in cancer diagnosis and treatment.

As we stated earlier, there are numerous applications of decision trees in biomedical research, including the example above. To have a glimpse of the diverse applications of decision trees, let us review two different examples.

Example 3 Frydman and colleagues introduced recursive partitioning for financial classification ([32]). They considered a financial dataset of bankrupt () industrial companies that failed during 1971-81, and non-bankrupt () manufacturing and retailing companies randomly selected from COMPUSTAT universe. Each company forms an observational unit or the so-called sample. Twenty financial variables with prior evidence of predicting business failure are considered. They include the ratio of cash to total assets, the ratio of cash to total sales, the ratio of cash flow to total debt, the ratio of current assets to current liabilities, the ratio of current assets to total assets, the ratio of current assets to total sales, the ratio of earnings before interest and taxes to total assets, interest coverage, the ratio of market value of equity to total capitalization, the ratio of net income to total assets, the ratio of quick assets to current liabilities, the ratio of quick assets to total assets, the ratio of quick assets to total sales, the ratio of retained earnings to total assets, the ratio of total debt to total assets, the ratio of total sales to total assets, and the ratio of working capital to total sales.

**Figure 14.2:** Classification tree for bankruptcy. B1, B2, and B3 are three groups of relatively high risk of bankruptcy, and NB1 and NB2 are two groups of likely non-bankrupt companies. Inside the terminal nodes (*boxes*) are the numbers of bankrupt and non-bankrupt companies. See [32] for more details
$\includegraphics[width=9.5cm]{text/3-14/fig2.eps}$

We can see from Fig. 14.2 that the risk of bankruptcy is relatively high if the ratio of cash flow to total debt is below 0.1309, unless both the ratio of retained earnings to total assets and the ratio of cash to total sales are above certain levels, i.e., 0.1453 and 0.025, respectively. Even if the ratio of cash flow to total debt is above 0.1309, there can be elevated risk of bankruptcy if the ratio of total debt to total assets is high (above 0.6975). A tree diagram as in Fig. 14.2 offers a very clear and simple assessment of the financial state of a company.

Example 4 (continued) We indicated earlier what the predictors and response are for Example 2. Let us revisit this example. Unlike the other examples that we have introduced so far, this example uses a continuous response - the compound potency. Because of this difference, the resulting tree is called a regression tree. To utilize the information from the 3-dimensional structures of compounds, [15] used atom pair descriptors that are composed of the atom types of the two component atoms and the ''binned'' Euclidean distance between these two atoms. The width of each distance bin was chosen as ${1.0}\,{\text{\char0197}}$ . To define predictors $\boldsymbol {x}$ from the atom pair descriptors, the authors characterized the atom pair descriptors in types including negative charge center (e.g., sulfinic group), positive charge center (e.g., the nitrogen in primary, secondary, and tertiary amines), hydrogen bond acceptor (e.g., oxygen with at least one available lone pair electron), triple bond center, aromatic ring center, and H-bond donor hydrogen.

**Figure 14.3:** Regression tree for predicting potencies of compounds. Inside each node are the number of compounds (*middle*) and the average potency of all compounds within the node (*bottom*). Underneath each node is the selected atom pair descriptor. Above the arm is the interval for the distance between the selected atom pair descriptor that assigns the compounds to the right daughter node. See [15] for more details
$\includegraphics[width=8.5cm]{text/3-14/fig3.eps}$

Figure 14.3 presents part of the regression tree that is constructed by [15]. We trimmed the left hand side to fit into the space here; however, we can get the idea from the right hand side of tree. Generally speaking, a node of size or such as nodes and is too small to be reliable. Since we do not have the data to re-grow the tree, let us pretend that the node sizes are adequate, and concentrate on the interpretation instead. Since the main objective of Chen et al. appears to identify active nodes (i.e., those with high potencies), a small, inactive node is not of great concern.

First, there is one highly active node (node with potency greater than ) in Fig. 14.3. There are also two highly active nodes on the left hand side which are not shown in Fig. 14.3. Supported by the literature, [15] postulated that there might be different mechanisms of action because the active nodes contain compounds of very different characteristics. This is similar to the hypothesis suggested by Fig. 14.1 that the colon cancer tissues might be biologically heterogeneous. Chen et al. concluded further that their tree demonstrates the ability to detect multiple mechanisms of action coexisting in a large three-dimensional chemical data set. In addition, the selected atom pair descriptors also reveal interesting features of the monoamine oxidase (MAO) inhibitors. For instance, the ''aromatic ring center-triple bond center'' pair in the first split is the structural characteristic of pargyline, a well known MAO inhibitor.

We can see from these examples that tree-based methods tend to unravel integrated, intuitive results whose pieces are consistent with prior findings. Not only can we use trees for prediction, but also we may use them to identify potentially important mechanisms or pathways for further investigation.