Discriminant analysis is a set of methods and tools used to distinguish between groups of populations and to determine how to allocate new observations into groups. In one of our running examples we are interested in discriminating between counterfeit and true bank notes on the basis of measurements of these bank notes, see Table B.2. In this case we have two groups (counterfeit and genuine bank notes) and we would like to establish an algorithm (rule) that can allocate a new observation (a new bank note) into one of the groups.
Another example is the detection of ``fast'' and ``slow'' consumers of a newly introduced product. Using a consumer's characteristics like education, income, family size, amount of previous brand switching, we want to classify each consumer into the two groups just identified.
In poetry and literary studies the frequencies of spoken or written words and lengths of sentences indicate profiles of different artists and writers. It can be of interest to attribute unknown literary or artistic works to certain writers with a specific profile. Anthropological measures on ancient sculls help in discriminating between male and female bodies. Good and poor credit risk ratings constitute a discrimination problem that might be tackled using observations on income, age, number of credit cards, family size etc.
In general we have populations and we have to allocate an observation to one of these groups. A discriminant rule is a separation of the sample space (in general ) into sets such that if , it is identified as a member of population .
The main task of discriminant analysis is to find ``good'' regions such that the error of misclassification is small. In the following we describe such rules when the population distributions are known.
Denote the densities of each population by . The maximum likelihood discriminant rule (ML rule) is given by allocating to maximizing the likelihood .
If several give the same maximum then any of them may be selected.
Mathematically, the sets given by the ML discriminant rule are
defined as
(12.1) |
By classifying the observation into certain group we may encounter
a misclassification error.
For groups the probability of
putting into group although it is from population
can be calculated as
Classified population | |||
0 | |||
True population | |||
0 | |||
Let be the prior probability of population , where ``prior'' means the a priori probability that an individual selected at random belongs to (i.e., before looking to the value ). Prior probabilities should be considered if it is clear ahead of time that an observation is more likely to stem from a certain population An example is the classification of musical tunes. If it is known that during a certain period of time a majority of tunes were written by a certain composer, then there is a higher probability that a certain tune was composed by this composer. Therefore, he should receive a higher prior probability when tunes are assigned to a specific group.
The expected cost of misclassification
is given by
The ML discriminant rule is thus a special case of the ECM rule for equal misclassification costs and equal prior probabilities. For simplicity the unity cost case, , and equal prior probabilities, , are assumed in the following.
Theorem 12.1 will be proven by an example from credit scoring.
The situation simplifies in the case of equal variances
. The discriminant rule (12.5) is then ( for
)
(12.6) |
Theorem 12.2 shows that the ML discriminant rule for multinormal observations is intimately connected with the Mahalanobis distance. The discriminant rule is based on linear combinations and belongs to the family of Linear Discriminant Analysis (LDA) methods.
PROOF:
Part (a) of the Theorem follows directly from comparison of the likelihoods.
For , part (a) says that is allocated to if
We have seen an example where prior knowledge on the probability of classification into was assumed. Denote the prior probabilities by and note that . The Bayes rule of discrimination allocates to the that gives the largest value of , . Hence, the discriminant rule is defined by . Obviously the Bayes rule is identical to the ML discriminant rule for .
A further modification is to allocate to with a
certain probability , such that
for all .
This is called a randomized discriminant rule.
A randomized discriminant rule is a generalization of
deterministic discriminant rules since
Which discriminant rules are good?
We need a measure of comparison. Denote
Suppose that
.
In the case of two groups, it is not difficult to derive the probabilities of
misclassification for the ML discriminant rule. Consider for instance
. By part (b) in Theorem 12.2
we have
The minimum depends on the ratio of the densities or equivalently on the difference . When the covariance for both density functions differ, the allocation rule becomes more complicated: