14.6 Tree-based Methods for Multiple Correlated Outcomes

As pointed out by [74], multiple binary responses arise from many applications for which an array of health-related symptoms are of primary interest. Most of the existing methods are parametric; see, e.g., [25] for an excellent overview. In this section, we will describe a tree-based alternative to the parametric methods.

Motivated by both the broad application as well as by the need to analyze building-related occupant complaint syndrome (BROCS), [74] proposed a tree-based method to model and classify multiple binary responses. Let us use the BROCS study to explain the method.

To understand the nature of BROCS, data were collected in 1989 from

employees of the Library of Congress (LOC) and the headquarters of the Environmental Protection Agency (EPA) in the United States. The data contain many explanatory variables, but [74] extracted a subset of

putative risk factors, most of which are answers to ''yes or no'' or frequency (never, rarely, sometimes, etc.) questions. For example, is working space an enclosed office with door, a cubicle without door, stacks, etc? See Table 1 of [74] for a detailed list. In this data set, BROCS is represented by six binary responses that cover respiratory symptoms in the central nervous system, upper airway, pain, flu-like, eyes, and lower airway. The primary purpose with this extracted data set is to evaluate the effect of the important risk factors on the six responses by constructing trees.

In terms of notation, the primary distinction is that the response

for each subject is a 6-vector. Consequently, we need to generalize the node-splitting criterion and cost-complexity to this vector-response. As we indicated earlier, one solution is to assume a certain type of within-node distribution for the vector-response and then maximize the within-node likelihood for splitting. One such distribution is

A naive approach is to treat

as a numerical vector and use a function such as the determinant of the within-node covariance matrix of

as a measure of impurity. If

were continuous, this approach is what [64] proposed to construct regression trees for repeatedly measured continuous

. For binary outcomes, however, this approach appears to suffer the well-known end-cut preference problem in the sense that it gives preference to the splits that result in two unbalanced daughter nodes in terms of their sizes.

One advantage of the likelihood based method is that the negative of the within-node likelihood can also be used as the within-node cost

for tree pruning. The main difficulty with this method is the computational burden, because every allowable split calls for a maximization of the likelihood derived from (14.4). Some strategies for reducing the computational time are discussed in [74].

The criterion based on (14.4) ultimately leads to a

terminal nodes tree as displayed in Fig. 14.5, which suggests that respondents belonging to terminal nodes

and

have high incidence of respiratory symptoms. This is because the working area air quality of the people within these terminal nodes was poor, namely, often too stuffy or sometimes dusty. On the other hand, for example, subjects in terminal node

experienced the least discomfort because they had the best air quality. The basic message from this example is that tree-based analyses often reveal findings that are readily interpretable.

**Figure 14.5:** Tree structure for the risk factors of BROCS based on (14.4). Inside each node (a or a ) are the node number and the number of subjects. The splitting question is given under the node. The *asterisks* indicate where the subjects with missing information are assigned. The *pin diagrams* under the tree show the incidence rates of the six clusters (C: CNS; U: upper airway; P: pain; F: flu-like; E: eyes; and L: lower airway) in the nine terminal nodes. The *side bar* on the right end indicates the range of 0 and for the rates of all symptoms
$\includegraphics[width=\textwidth]{text/3-14/hzhang-fig3.eps}$