As pointed out by [74], multiple binary responses arise from many applications for which an array of health-related symptoms are of primary interest. Most of the existing methods are parametric; see, e.g., [25] for an excellent overview. In this section, we will describe a tree-based alternative to the parametric methods.
Motivated by both the broad application as well as by the need to analyze building-related occupant complaint syndrome (BROCS), [74] proposed a tree-based method to model and classify multiple binary responses. Let us use the BROCS study to explain the method.
To understand the nature of BROCS, data were collected in 1989 from
employees of the Library of Congress (LOC) and the
headquarters of the Environmental Protection Agency (EPA) in the
United States. The data contain many explanatory variables, but
[74] extracted a subset of
putative risk factors, most
of which are answers to ''yes or no'' or frequency (never, rarely,
sometimes, etc.) questions. For example, is working space an enclosed
office with door, a cubicle without door, stacks, etc? See Table 1 of
[74] for a detailed list. In this data set, BROCS is
represented by six binary responses that cover respiratory symptoms in
the central nervous system, upper airway, pain, flu-like, eyes, and
lower airway. The primary purpose with this extracted data set is to
evaluate the effect of the important risk factors on the six responses
by constructing trees.
In terms of notation, the primary distinction is that the response
for each subject is a 6-vector. Consequently, we need to generalize
the node-splitting criterion and cost-complexity to this
vector-response. As we indicated earlier, one solution is to assume
a certain type of within-node distribution for the vector-response and
then maximize the within-node likelihood for splitting. One such
distribution is
A naive approach is to treat as a numerical vector and use
a function such as the determinant of the within-node covariance
matrix of
as a measure of impurity. If
were continuous, this
approach is what [64] proposed to construct regression trees
for repeatedly measured continuous
. For binary outcomes, however,
this approach appears to suffer the well-known end-cut preference
problem in the sense that it gives preference to the splits that
result in two unbalanced daughter nodes in terms of their sizes.
One advantage of the likelihood based method is that the negative of
the within-node likelihood can also be used as the within-node cost
for tree pruning. The main difficulty with this method is the
computational burden, because every allowable split calls for
a maximization of the likelihood derived from (14.4). Some
strategies for reducing the computational time are discussed in
[74].
The criterion based on (14.4) ultimately leads to a
terminal nodes tree as displayed in Fig. 14.5, which
suggests that respondents belonging to terminal nodes
and
have high incidence of respiratory symptoms. This is because the
working area air quality of the people within these terminal nodes was
poor, namely, often too stuffy or sometimes dusty. On the other hand,
for example, subjects in terminal node
experienced the least
discomfort because they had the best air quality. The basic message
from this example is that tree-based analyses often reveal findings
that are readily interpretable.
![]() |