12.2 Discrimination Rules in Practice

The ML rule is used if the distribution of the data is known up to parameters. Suppose for example that the data come from multivariate normal distributions $N_{p}(\mu_{j},\Sigma)$. If we have $J$ groups with $n_{j}$ observations in each group, we use $\overline x_j$ to estimate $\mu_j$, and $\data{S}_j$ to estimate $\Sigma$. The common covariance may be estimated by

\begin{displaymath}
\data{S}_{u}=\sum ^J_{j=1}n_j\left (\frac{\data{S}_j }{n-J }\right ),
\end{displaymath} (12.9)

with $n=\sum ^J_{j=1}n_j$. Thus the empirical version of the ML rule of Theorem 12.2 is to allocate a new observation $x$ to $\Pi_{j}$ such that $j$ minimizes

\begin{displaymath}(x-\overline{x}_{i})^{\top}\data{S}_{u}^{-1}(x-\overline{x}_{i}) \quad
\textrm{for}\quad i\in\{1,\ldots,J\}.\end{displaymath}

EXAMPLE 12.4   Let us apply this rule to the Swiss bank notes. The 20 randomly chosen bank notes which we had clustered into two groups in Example 11.6 are used. First the covariance $\Sigma$ is estimated by the average of the covariances of $\Pi_1$ (cluster 1) and $\Pi_2$ (cluster 2). The hyperplane $\widehat\alpha ^{\top}(x-\overline{x})=0$ which separates the two populations is given by

\begin{eqnarray*}
\widehat\alpha &=& \data{S}_{u}^{-1}(\overline{x}_{1}-\overlin...
...ft (214.79, 130.05, 129.92, 9.23, 10.48, 140.46 \right)^{\top}.
\end{eqnarray*}



Now let us apply the discriminant rule to the entire bank notes data set. Counting the number of misclassifications by

\begin{displaymath}\sum_{i=1}^{100} {\boldsymbol{I}}\{\widehat\alpha^{\top}(x_{i...
...dsymbol{I}}\{\widehat\alpha^{\top}(x_{i}-\overline{x}) > 0\},
\end{displaymath}

we obtain 1 misclassified observation for the conterfeit bank notes and 0 misclassification for the genuine bank notes.

When $J=3$ groups, the allocation regions can be calculated using

\begin{eqnarray*}
h_{12}(x)&=&(\overline x_1-\overline x_2)^{\top}
\data{S}^{-1}...
..._u\left \{x-\frac{1 }{2 }(\overline x_2+\overline x_3)\right\}.
\end{eqnarray*}



The rule is to allocate $x$ to

\begin{displaymath}\left\{
\begin{array}{cc}
\Pi _1\quad &\qquad\textrm{if}\qua...
...3}(x)<0\quad \textrm{and}\quad h_{23}(x)<0.
\end{array}\right.\end{displaymath}

Estimation of the probabilities of misclassifications

Misclassification probabilities are given by (12.7) and can be estimated by replacing the unknown parameters by their corresponding estimators.

For the ML rule for two normal populations we obtain

\begin{displaymath}\hat{p}_{12}=\hat{p}_{21}=\Phi\left(-\frac{1}{2}\hat{\delta}\right)\end{displaymath}

where $\hat{\delta}^2$= $(\bar{x}_1-\bar{x}_2)^{\top} \data{S}_{u}^{-1}(\bar{x}_1-\bar{x}_2)$ is the estimator for $\delta^2$.

The probabilities of misclassification may also be estimated by the re-substitution method. We reclassify each original observation $x_i$, $i=1,\cdots,n$ into $\Pi_{1},\cdots,\Pi_J$ according to the chosen rule. Then denoting the number of individuals coming from $\Pi _j$ which have been classified into $\Pi _i$ by $n_{ij}$, we have $\hat{p}_{ij}=\frac{n_{ij}}{n_j}$, an estimator of $p_{ij}$. Clearly, this method leads to too optimistic estimators of $p_{ij}$, but it provides a rough measure of the quality of the discriminant rule. The matrix $(\hat{p}_{ij})$ is called the confussion matrix in Johnson and Wichern (1998).

EXAMPLE 12.5   In the above classification problem for the Swiss bank notes (Table B.2), we have the following confussion matrix:

    true membership
    genuine ($\Pi_1$) counterfeit ($\Pi_2$)
  $\Pi_1$ 100 1
predicted      
  $\Pi_2$ 0 99
       
41460 MVAaper.xpl

The apparent error rate (APER) is defined as the fraction of observations that are misclassified. The APER, expressed as a percentage, is

\begin{displaymath}\textrm{APER}= \left(\frac{1}{200}\right)100\% =0.5\%.\end{displaymath}

For the calculation of the APER we use the observations twice: the first time to construct the classification rule and the second time to evaluate this rule. An APER of $ 0.5\%$ might therefore be too optimistic. An approach that corrects for this bias is based on the holdout procedure of Lachenbruch and Mickey (1968). For two populations this procedure is as follows:

1.
Start with the first population $\Pi_1$. Omit one observation and develop the classification rule based on the remaining $n_1 -1,\; n_2$ observations.
2.
Classify the ``holdout'' observation using the discrimination rule in Step 1.
3.
Repeat steps 1 and 2 until all of the $\Pi_1$ observations are classified. Count the number $n'_{21}$ of misclassified observations.
4.
Repeat steps 1 through 3 for population $\Pi_2$. Count the number $n'_{12}$ of misclassified observations.
Estimates of the misclassification probabilities are given by

\begin{displaymath}
\hat{p}'_{12} = \frac{n'_{12}}{n_2}
\end{displaymath}

and

\begin{displaymath}
\hat{p}'_{21} = \frac{n'_{21}}{n_1}.
\end{displaymath}

A more realistic estimator of the actual error rate (AER) is given by
\begin{displaymath}
\frac{n'_{12}+n'_{21}}{n_2 + n_1}.
\end{displaymath} (12.10)

Statisticians favor the AER (for its unbiasedness) over the APER. In large samples, however, the computational costs might counterbalance the statistical advantage. This is not a real problem since the two misclassification measures are asymptotically equivalent.

41464 MVAaer.xpl


Fisher's linear discrimination function

Another approach stems from R. A. Fisher. His idea was to base the discriminant rule on a projection $a^{\top}x$ such that a good separation was achieved. This LDA projection method is called Fisher's linear discrimination function. If

\begin{displaymath}\data{Y}=\data{X}a\end{displaymath}

denotes a linear combination of observations, then the total sum of squares of $y$, $\sum_{i=1}^n(y_i-\bar{y})^2$, is equal to
\begin{displaymath}
\data{Y}^{\top}\data{H}\data{Y}=a^{\top}\data{X}^{\top}\data{H}\data{X}a=a^{\top}\data{T}a
\end{displaymath} (12.11)

with the centering matrix $\data{H} = \data{I}-n^{-1}1_{n}1^{\top}_{n}$ and $\data{T}= \data{X}^{\top}\data{H}\data{X}$.

Suppose we have samples $\data{X}_{j}$, $j=1,\ldots,J$, from $J$ populations. Fisher's suggestion was to find the linear combination $a^{\top}x$ which maximizes the ratio of the between-group-sum of squares to the within-group-sum of squares.

The within-group-sum of squares is given by

\begin{displaymath}
\sum_{j=1}^J \data{Y}_j^{\top}\data{H}_j\data{Y}_j
=\sum_{j=...
...top}\data{X}_j^{\top}\data{H}_j\data{X}_ja=a^{\top}\data{W}a,
\end{displaymath} (12.12)

where $\data{Y}_j$ denotes the $j$-th sub-matrix of $\data{Y}$ corresponding to observations of group $j$ and $\data{H}_j$ denotes the $(n_j\times n_j)$ centering matrix. The within-group-sum of squares measures the sum of variations within each group.

The between-group-sum of squares is

\begin{displaymath}
\sum_{j=1}^J n_j(\overline y_j-\overline y)^2=\sum_{j=1}^J n_j\{a^{\top}(\overline
x_j-\overline x)\}^2=a^{\top}\data{B}a,
\end{displaymath} (12.13)

where $\overline{y}_{j}$ and $\overline{x}_{j}$ denote the means of $\data{Y}_{j}$ and $\data{X}_{j}$ and $\overline{y}$ and $\overline{x}$ denote the sample means of $\data{Y}$ and $\data{X}$. The between-group-sum of squares measures the variation of the means across groups.

The total sum of squares (12.11) is the sum of the within-group-sum of squares and the between-group-sum of squares, i.e.,

\begin{displaymath}a^{\top}\data{T}a=a^{\top}\data{W}a+a^{\top}\data{B}a.\end{displaymath}

Fisher's idea was to select a projection vector $a$ that maximizes the ratio
\begin{displaymath}
\frac{a^{\top}\data{B}a }{a^{\top}\data{W}a }.
\end{displaymath} (12.14)

The solution is found by applying Theorem 2.5.

THEOREM 12.4   The vector $a$ that maximizes (12.14) is the eigenvector of $\data{W}^{-1}\data{B}$ that corresponds to the largest eigenvalue.

Now a discrimination rule is easy to obtain:
classify $x$ into group $j$ where $a^{\top} \bar{x}_j$ is closest to $a^{\top}x$, i.e.,

\begin{displaymath}
x \to \Pi_j \textrm{ where } j=\arg \min_i \vert a^{\top}(x-\bar{x}_i)\vert.
\end{displaymath}

When $J=2$ groups, the discriminant rule is easy to compute. Suppose that group 1 has $n_{1}$ elements and group 2 has $n_{2}$ elements. In this case

\begin{displaymath}\data{B}=\left (\frac{n_1n_2 }{ n}\right) dd^{\top},\end{displaymath}

where $d=(\overline x_1-\overline x_2)$. $\data{W}^{-1}\data{B}$ has only one eigenvalue which equals

\begin{displaymath}\mathop{\hbox{tr}}(\data{W}^{-1}\data{B})=\left( \frac{n_1n_2 }{n } \right)
d^{\top}\data{W}^{-1}d,\end{displaymath}

and the corresponding eigenvector is $a=\data{W}^{-1}d$. The corresponding discriminant rule is
\begin{displaymath}
\begin{array}{lll}
x\to \Pi _1\quad &\qquad\textrm{if}\quad...
...frac{1 }{ 2}(\overline x_1+\overline x_2)\}\le 0.
\end{array} \end{displaymath} (12.15)

The Fisher LDA is closely related to projection pursuit (Chapter 18) since the statistical technique is based on a one dimensional index $a^\top x$.

EXAMPLE 12.6   Consider the bank notes data again. Let us use the subscript ``g'' for the genuine and ``f'' for the conterfeit bank notes, e.g., $\data{X}_{g}$ denotes the first hundred observations of $\data{X}$ and $\data{X}_{f}$ the second hundred. In the context of the bank data set the ``between-group-sum of squares'' is defined as
\begin{displaymath}
100\left \{({\overline y}_g - \overline y)^2 +
({\overline y}_f - \overline y)^2 \right \} = a^{\top}{\data{B}}a
\end{displaymath} (12.16)

for some matrix ${\data{B}}$. Here, ${\overline y}_g$ and ${\overline y}_{f}$ denote the means for the genuine and counterfeit bank notes and $\overline y = {\textstyle \frac{1}{2}} ({\overline y}_g
+ {\overline y}_f)$. The ``within-group-sum of squares'' is
\begin{displaymath}
\sum^{100}_{i=1} \{(y_{g})_{i} - {\overline y}_g\}^2 +
\sum...
...=1} \{(y_{f})_{i} - {\overline y}_f\}^2= a^{\top}{\data{W}}a,
\end{displaymath} (12.17)

with $(y_{g})_{i}=a^{\top}x_i$ and $(y_{f})_{i}=a^{\top}x_{i+100}$ for $i=1,\ldots,100$.

Figure 12.2: Densities of projections of genuine and counterfeit bank notes by Fisher's discrimination rule. 41476 MVAdisfbank.xpl
\includegraphics[width=1\defpicwidth]{discbfis.ps}

The resulting discriminant rule consists of allocating an observation $x_0$ to the genuine sample space if

\begin{displaymath}a^{\top}(x_0 - \overline x)
> 0,\end{displaymath}

with $a=\data{W}^{-1}(\overline{x}_{g}-\overline{x}_{f})$ (see Exercise 12.8) and of allocating $x_0$ to the counterfeit sample space when the opposite is true. In our case

\begin{displaymath}
a=\left( 0.000, 0.029, -0.029, -0.039, -0.041, 0.054\right)^{\top}\cdot
\end{displaymath}

One genuine and no counterfeit bank notes are misclassified. Figure 12.2 shows the estimated densities for $y_g=a^{\top}\data{X}_g$ and $y_f=a^{\top}\data{X}_f$. They are separated better than those of the diagonals in Figure 1.9.

Note that the allocation rule (12.15) is exactly the same as the ML rule for $J=2$ groups and for normal distributions with the same covariance. For $J=3$ groups this rule will be different, except for the special case of collinear sample means.

Summary
$\ast$
A discriminant rule is a separation of the sample space into sets $R_j$. An observation $x$ is classified as coming from population $\Pi _j$ if it lies in $R_j$.
$\ast$
The expected cost of misclassification (ECM) for two populations is given by ECM $= C(2\vert 1)p_{21}\pi_1 + C(1\vert 2)p_{12}\pi_2$.
$\ast$
The ML rule is applied if the distributions in the populations are known up to parameters, e.g., for normal distributions $N_{p}(\mu_{j},\Sigma)$.
$\ast$
The ML rule allocates $x$ to the population that exhibits the smallest Mahalanobis distance

\begin{displaymath}\delta^2(x;\mu_i)=(x-\mu_i)^{\top}\Sigma^{-1}(x-\mu_i).\end{displaymath}

$\ast$
The probability of misclassification is given by

\begin{displaymath}p_{12}=p_{21}=\Phi\left(-\frac{1}{2}\delta\right),\end{displaymath}

where $\delta$ is the Mahalanobis distance between $\mu_1$ and $\mu_2$.
$\ast$
Classification for different covariance structures in the two populations leads to quadratic discrimination rules.
$\ast$
A different approach is Fisher's linear discrimination rule which finds a linear combination $a^{\top}x$ that maximizes the ratio of the ``between-group-sum of squares'' and the ``within-group-sum of squares''. This rule turns out to be identical to the ML rule when $J=2$ for normal populations.