Tree-based methods have become one of the most flexible, intuitive, and powerful data analytic tools for exploring complex data structures. The applications of these methods are far reaching. They include financial firms (credit cards: [1]; [32], and investments: [60]; [8]), manufacturing and marketing companies ([48]), and pharmaceutical companies.
The best documented, and arguably most popular uses of tree-based methods are in biomedical research for which classification is a central issue. For example, a clinician or health scientist may be very interested in the following question ([34,35,77]): Is this patient with chest pain suffering a heart attack, or does he simply have a strained muscle? To answer this question, information on this patient must be collected, and a good diagnostic test utilizing such information must be in place. Tree-based methods provide one solution for constructing the diagnostic test.
Classification problems also frequently arise from engineering research. [4] introduced a tree-based language model for natural language speech recognition. [24] used the idea of decision trees to detect proper nouns in document images. [33] used a related idea to form an active testing model for tracking roads in satellite images. In addition, decision trees have been used in scientific and social studies including astronomy ([59]), chemistry ([15]) and politics (http://www.dtreg.com/housevotes.htm). We will revisit some of these applications later in detail.
Most commercial applications of tree-based methods have not been well-documented through peer reviewed publications. In 1999 the author helped the CLARITAS, a marketing company, apply a tree-based method as described in Sect. 14.6 ([74]) for marketing segmentation analysis. Tree-based methods have also been frequently used in the drug development process. The author has personally provided consultations to Aventis, Inc. for drug approvals.
The purpose of this article is to provide an overview for the construction of the decision trees, and, particularly, the recursive partitioning technique, which is the thrust of this methodology. In their early applications, tree-based methods were developed primarily to facilitate the automation of classifications as an expert system ([6,31,71]), although [55] were motivated by the need to analyze survey data to identify interactions, particularly in the presence of non-numerical predictors. More recently, classification trees have not only been used for automated disease diagnosis, but also for selecting important variables that are associated with a disease or any response of interest ([75,76,79,80,81]).
There are different approaches to classification. First, it can be done intuitively. For example, a physician or a group of physicians may use their experience in caring for patients with chest pain to form a subjective opinion or an empirical decision as to whether a new patient with chest pain is likely to suffer a heart attack, and consequently, decide what treatment is most appropriate. Secondly, methods in both statistical and machine learning literature have been developed, such as Fisher linear discriminant analysis ([28]) and support vector machine ([20]). These methods have the parametric flavor in the sense that the classification rule has an explicit form with only a few parameters to be determined from a given sample that is usually referred to as learning sample.
Classification trees belong to the third type of methods for which we allow a very general structure, e.g., the binary tree as displayed in Fig. 14.1, but the number of ''parameters'' also needs to be determined from the data, and this number varies. For this reason, classification trees are regarded as nonparametric methods. They are adaptive to the data and are flexible, although the large number of quantities (or parameters) to be estimated from the data makes the classification rule more vulnerable to noise in the data.
![]() |
To be more precise about the statistical problem, let us define the
data structure and introduce some notation. Suppose that we have
observed covariates, denoted by a
-vector
, and
a response
for
individuals. For the
th individual, the
measurements are
![]() ![]() |
To appreciate how these variables are characterized in real applications, let us examine some of the published applications.
Like any multivariate regression model and as we can see from the
above examples, the covariates or predictors in
may
contain variables that can be categorical (nominal or ordinal) or
continuous. For example, ethnicity is usually treated as categorical
data and age as continuous. Some of the covariates may have missing
values, and we will discuss how missing values are handled in the tree
framework. In a nutshell, unlike what is usually done in a simple
linear regression, observations with missing information are not
omitted from classification trees.
Not only can we have mixed types of predictors, but also the response
variable can be discrete (binary or multiclass), continuous, and
sometimes censored. The characteristics of the response, ,
determines the method for estimating
. We will
review a variety of tree-based methods that are adaptable to the
distribution of
. In Sect. 14.2, we will introduce the
basic idea of classification trees using a dichotomous
response. Section 14.2 is followed by some in-depth
discussion of computational challenges and implementations in
Sect. 14.3 and by examples in Sect. 14.4 to
illustrate how we can interpret results from tree-based analyses. One
of the most popular uses of tree-based methods is in the analysis of
censored data in which
is the time to an event and is subject to
censoring. As described in Sect. 14.5, such trees are
referred to as survival trees
([3,12,13,40,73]). In
Sect. 14.6, we will present an extension of the tree
methodology to the classification of a response consisting of multiple
components such as an array of respiratory symptoms
([74]). Finally, we will conclude in
Sect. 14.7 with some remarks on relatively recent
developments such as forests and Bayesian trees. To illustrate the
methods and their applications, some examples will be presented along
with the methods.