14.1 Introduction

Tree-based methods have become one of the most flexible, intuitive, and powerful data analytic tools for exploring complex data structures. The applications of these methods are far reaching. They include financial firms (credit cards: [1]; [32], and investments: [60]; [8]), manufacturing and marketing companies ([48]), and pharmaceutical companies.

The best documented, and arguably most popular uses of tree-based methods are in biomedical research for which classification is a central issue. For example, a clinician or health scientist may be very interested in the following question ([34,35,77]): Is this patient with chest pain suffering a heart attack, or does he simply have a strained muscle? To answer this question, information on this patient must be collected, and a good diagnostic test utilizing such information must be in place. Tree-based methods provide one solution for constructing the diagnostic test.

Classification problems also frequently arise from engineering research. [4] introduced a tree-based language model for natural language speech recognition. [24] used the idea of decision trees to detect proper nouns in document images. [33] used a related idea to form an active testing model for tracking roads in satellite images. In addition, decision trees have been used in scientific and social studies including astronomy ([59]), chemistry ([15]) and politics (http://www.dtreg.com/housevotes.htm). We will revisit some of these applications later in detail.

Most commercial applications of tree-based methods have not been well-documented through peer reviewed publications. In 1999 the author helped the CLARITAS, a marketing company, apply a tree-based method as described in Sect. 14.6 ([74]) for marketing segmentation analysis. Tree-based methods have also been frequently used in the drug development process. The author has personally provided consultations to Aventis, Inc. for drug approvals.

The purpose of this article is to provide an overview for the construction of the decision trees, and, particularly, the recursive partitioning technique, which is the thrust of this methodology. In their early applications, tree-based methods were developed primarily to facilitate the automation of classifications as an expert system ([6,31,71]), although [55] were motivated by the need to analyze survey data to identify interactions, particularly in the presence of non-numerical predictors. More recently, classification trees have not only been used for automated disease diagnosis, but also for selecting important variables that are associated with a disease or any response of interest ([75,76,79,80,81]).

There are different approaches to classification. First, it can be done intuitively. For example, a physician or a group of physicians may use their experience in caring for patients with chest pain to form a subjective opinion or an empirical decision as to whether a new patient with chest pain is likely to suffer a heart attack, and consequently, decide what treatment is most appropriate. Secondly, methods in both statistical and machine learning literature have been developed, such as Fisher linear discriminant analysis ([28]) and support vector machine ([20]). These methods have the parametric flavor in the sense that the classification rule has an explicit form with only a few parameters to be determined from a given sample that is usually referred to as learning sample.

Classification trees belong to the third type of methods for which we allow a very general structure, e.g., the binary tree as displayed in Fig. 14.1, but the number of ''parameters'' also needs to be determined from the data, and this number varies. For this reason, classification trees are regarded as nonparametric methods. They are adaptive to the data and are flexible, although the large number of quantities (or parameters) to be estimated from the data makes the classification rule more vulnerable to noise in the data.

**Figure 14.1:** Classification Tree for Colon Cancer Diagnosis Based on Gene Expression Data. Inside each node are the number of tumor (C) and normal (N) tissues. See [81] for more details
$\includegraphics[width=7.1cm]{text/3-14/fig1.eps}$

To be more precise about the statistical problem, let us define the data structure and introduce some notation. Suppose that we have observed

covariates, denoted by a

-vector $\boldsymbol {x}$ , and a response

for

individuals. For the

th individual, the measurements are

To appreciate how these variables are characterized in real applications, let us examine some of the published applications.

Example 1 [48] described a probability-driven, customer-oriented decision support system for the marketing decisions of the Franklin Mint, a leading Philadelphia-based worldwide direct response marketer of quality collectibles and luxury home decor products. The purpose of the system is to target the ''right'' audience for each promotion from among a very large marketing database, based on the customers' attributes and characteristics. In this case, the customers' attributes and characteristics constitute the $\boldsymbol {x}$ variables. Whether the targeted client is desirable or not forms the basis for the response .

Example 2 To screen large chemical databases in corporate collections and chemical libraries, [15] used recursive partitioning to develop three-dimensional pharmacophores that can guide database screening, chemical library design, and lead optimization. Their idea was to encode the three-dimensional features of chemical compounds into bit strings, and those features are the $\boldsymbol {x}$ variables. Then, those features are selected in relation to the biological activities (i.e., ) of the compounds. Here, each compound contributes an observation. Using this idea, the authors successfully retrieved three-dimensional structure-activity relationships from a large heterogeneous dataset of monoamine oxidase inhibitors. We will revisit this example in detail in Sect. 14.4.

Like any multivariate regression model and as we can see from the above examples, the covariates or predictors in $\boldsymbol {x}$ may contain variables that can be categorical (nominal or ordinal) or continuous. For example, ethnicity is usually treated as categorical data and age as continuous. Some of the covariates may have missing values, and we will discuss how missing values are handled in the tree framework. In a nutshell, unlike what is usually done in a simple linear regression, observations with missing information are not omitted from classification trees.

Not only can we have mixed types of predictors, but also the response variable can be discrete (binary or multiclass), continuous, and sometimes censored. The characteristics of the response,

, determines the method for estimating $P(y\ \vert\boldsymbol{x})$ . We will review a variety of tree-based methods that are adaptable to the distribution of

. In Sect. 14.2, we will introduce the basic idea of classification trees using a dichotomous response. Section 14.2 is followed by some in-depth discussion of computational challenges and implementations in Sect. 14.3 and by examples in Sect. 14.4 to illustrate how we can interpret results from tree-based analyses. One of the most popular uses of tree-based methods is in the analysis of censored data in which

is the time to an event and is subject to censoring. As described in Sect. 14.5, such trees are referred to as survival trees ([3,12,13,40,73]). In Sect. 14.6, we will present an extension of the tree methodology to the classification of a response consisting of multiple components such as an array of respiratory symptoms ([74]). Finally, we will conclude in Sect. 14.7 with some remarks on relatively recent developments such as forests and Bayesian trees. To illustrate the methods and their applications, some examples will be presented along with the methods.