# 12.1 Allocation Rules for Known Distributions

Discriminant analysis is a set of methods and tools used to distinguish between groups of populations and to determine how to allocate new observations into groups. In one of our running examples we are interested in discriminating between counterfeit and true bank notes on the basis of measurements of these bank notes, see Table B.2. In this case we have two groups (counterfeit and genuine bank notes) and we would like to establish an algorithm (rule) that can allocate a new observation (a new bank note) into one of the groups.

Another example is the detection of fast'' and slow'' consumers of a newly introduced product. Using a consumer's characteristics like education, income, family size, amount of previous brand switching, we want to classify each consumer into the two groups just identified.

In poetry and literary studies the frequencies of spoken or written words and lengths of sentences indicate profiles of different artists and writers. It can be of interest to attribute unknown literary or artistic works to certain writers with a specific profile. Anthropological measures on ancient sculls help in discriminating between male and female bodies. Good and poor credit risk ratings constitute a discrimination problem that might be tackled using observations on income, age, number of credit cards, family size etc.

In general we have populations and we have to allocate an observation to one of these groups. A discriminant rule is a separation of the sample space (in general ) into sets such that if , it is identified as a member of population .

The main task of discriminant analysis is to find good'' regions such that the error of misclassification is small. In the following we describe such rules when the population distributions are known.

## Maximum Likelihood Discriminant Rule

Denote the densities of each population by . The maximum likelihood discriminant rule (ML rule) is given by allocating to maximizing the likelihood .

If several give the same maximum then any of them may be selected. Mathematically, the sets given by the ML discriminant rule are defined as

 (12.1)

By classifying the observation into certain group we may encounter a misclassification error. For groups the probability of putting into group although it is from population can be calculated as

 (12.2)

Similarly the conditional probability of classifying an object as belonging to the first population although it actually comes from is
 (12.3)

The misclassified observations create a cost when a observation is assigned to . In the credit risk example, this might be the cost of a sour'' credit. The cost structure can be pinned down in a cost matrix:

 Classified population 0 True population 0

Let be the prior probability of population , where prior'' means the a priori probability that an individual selected at random belongs to (i.e., before looking to the value ). Prior probabilities should be considered if it is clear ahead of time that an observation is more likely to stem from a certain population An example is the classification of musical tunes. If it is known that during a certain period of time a majority of tunes were written by a certain composer, then there is a higher probability that a certain tune was composed by this composer. Therefore, he should receive a higher prior probability when tunes are assigned to a specific group.

The expected cost of misclassification is given by

 (12.4)

We will be interested in classification rules that keep the small or minimize it over a class of rules. The discriminant rule minimizing the  (12.4) for two populations is given below.

THEOREM 12.1   For two given populations, the rule minimizing the is given by

The ML discriminant rule is thus a special case of the ECM rule for equal misclassification costs and equal prior probabilities. For simplicity the unity cost case, , and equal prior probabilities, , are assumed in the following.

Theorem 12.1 will be proven by an example from credit scoring.

EXAMPLE 12.1   Suppose that represents the population of bad clients who create the cost if they are classified as good clients. Analogously, define as the cost of loosing a good client classified as a bad one. Let denote the gain of the bank for the correct classification of a good client. The total gain of the bank is then

Since the first term in this equation is constant, the maximum is obviously obtained for

This is equivalent to

which corresponds to the set in Theorem 12.1 for a gain of

EXAMPLE 12.2   Suppose and

The sample space is the set . The ML discriminant rule is to allocate to and to , defining the sets , and .

EXAMPLE 12.3   Consider two normal populations

Then

Hence is allocated to () if . Note that is equivalent to

or
 (12.5)

Suppose that , and , . Formula (12.5) leads to

This situation is shown in Figure 12.1.

The situation simplifies in the case of equal variances . The discriminant rule (12.5) is then ( for )

 (12.6)

Theorem 12.2 shows that the ML discriminant rule for multinormal observations is intimately connected with the Mahalanobis distance. The discriminant rule is based on linear combinations and belongs to the family of Linear Discriminant Analysis (LDA) methods.

THEOREM 12.2   Suppose .
(a)
The ML rule allocates to , where is the value minimizing the square Mahalanobis distance between and :

(b)
In the case of ,

where and .

PROOF:
Part (a) of the Theorem follows directly from comparison of the likelihoods.

For , part (a) says that is allocated to if

Rearranging terms leads to

which is equivalent to

## Bayes Discriminant Rule

We have seen an example where prior knowledge on the probability of classification into was assumed. Denote the prior probabilities by and note that . The Bayes rule of discrimination allocates to the that gives the largest value of , . Hence, the discriminant rule is defined by . Obviously the Bayes rule is identical to the ML discriminant rule for .

A further modification is to allocate to with a certain probability , such that for all . This is called a randomized discriminant rule. A randomized discriminant rule is a generalization of deterministic discriminant rules since

reflects the deterministic rules.

Which discriminant rules are good? We need a measure of comparison. Denote

 (12.7)

as the probability of allocating to if it in fact belongs to . A discriminant rule with probabilities is as good as any other discriminant rule with probabilities if
 (12.8)

We call the first rule better if the strict inequality in (12.8) holds for at least one . A discriminant rule is called admissible if there is no better discriminant rule.

THEOREM 12.3   All Bayes discriminant rules (including the ML rule) are admissible.

## Probability of Misclassification for the ML rule ()

Suppose that . In the case of two groups, it is not difficult to derive the probabilities of misclassification for the ML discriminant rule. Consider for instance . By part (b) in Theorem 12.2 we have

If , where is the squared Mahalanobis distance between the two populations, we obtain

Similarly, the probability of being classified into population 2 although stems from is equal to = .

## Classification with different covariance matrices

The minimum depends on the ratio of the densities or equivalently on the difference . When the covariance for both density functions differ, the allocation rule becomes more complicated:

where . The classification regions are defined by quadratic functions. Therefore they belong to the family of Quadratic Discriminant Analysis (QDA) methods. This quadratic classification rule coincides with the rules used when , since the term disappears.

Summary
Discriminant analysis is a set of methods used to distinguish among groups in data and to allocate new observations into the existing groups.
Given that data are from populations with densities , , the maximum likelihood discriminant rule (ML rule) allocates an observation to that population which has the maximum likelihood .
Given prior probabilities for populations , Bayes discriminant rule allocates an observation to the population that maximizes with respect to . All Bayes discriminant rules (incl. the ML rule) are admissible.
For the ML rule and normal populations, the probabilities of misclassification are given by where is the Mahalanobis distance between the two populations.
Classification of two normal populations with different covariance matrices (ML rule) leads to regions defined by a quadratic function.
Desirable discriminant rules have a low expected cost of misclassification ().