The starting point of a cluster analysis is a data matrix
with measurements (objects) of variables. The
proximity (similarity) among objects is described by a
11.2 The Proximity between Objects
The matrix contains measures of similarity or
dissimilarity among the objects. If the values are
distances, then they measure dissimilarity.
The greater the distance, the less similar are the objects. If the
are proximity measures, then the opposite is true, i.e., the greater the
proximity value, the more similar are the objects. A distance matrix,
for example, could be defined by the -norm:
, where and denote the rows of
the data matrix .
Distance and similarity are of course dual. If is
a distance, then
is a proximity
The nature of the observations plays an important role in the choice
of proximity measure. Nominal values (like binary variables) lead in
general to proximity values, whereas metric values lead (in general)
to distance matrices. We first present possibilities for
in the binary case and then consider the continuous case.
In order to measure the similarity between objects we always compare
pairs of observations
Obviously there are four cases:
Similarity of objects with binary structure
Note that each
, depends on the
The following proximity measures are used in practice:
where and are weighting factors.
Table 11.1 shows some similarity measures for given weighting
The common similarity coefficients.
|Simple Matching (M)
|Russel and Rao (RR)
These measures provide alternative ways of weighting mismatchings and
positive (presence of a common character) or negative (absence of a
common character) matchings.
In principle, we could also consider the Euclidian distance.
However, the disadvantage of this distance is that
it treats the observations and
in the same way. If denotes, say,
knowledge of a certain language, then the
contrary, (not knowing the language) should
eventually be treated differently.
Let us consider binary variables computed from the car data set (Table
). We define the new binary data by
. This means that we transform the
observations of the
-th variable to
if it is larger than the
mean value of all observations of the
Let us only consider the data points 17 to 19
(Renault 19, Rover and Toyota Corolla) which
The Jaccard measure gives the similarity matrix
the Tanimoto measure yields
whereas the Single Matching measure
Distance measures for continuous variables
A wide variety of distance measures can be generated by the
Here denotes the value of the -th variable on object .
It is clear that for . The class of
distances (11.3) for varying measures the
dissimilarity of different weights. The -metric, for example,
gives less weight to outliers than the -norm (Euclidean norm).
It is common to consider the squared -norm.
Suppose we have
Then the distance matrix for the
and for the squared
- or Euclidean norm
One can see that the third observation
receives much more weight
in the squared
-norm than in the
An underlying assumption in applying distances based on
-norms is that the variables are measured on the same scale. If
this is not the case, a standardization should first be applied.
This corresponds to using a more general -
or Euclidean norm with a metric , where
(see Section 2.6):
-norms are given by
, but if a
standardization is desired, then the weight matrix
suitable. Recall that
is the variance of the
-th component. Hence we have
Here each component has the same weight in the computation of the
distances and the distances do not depend on a particular choice of
the units of measure.
Consider the French Food expenditures (Table B.6
The Euclidean distance matrix (squared
Taking the weight matrix
we obtain the distance matrix (squared
When applied to contingency tables, a -metric is suitable
to compare (and cluster) rows and columns of a contingency table.
If is a contingency table, row is characterized by
the conditional frequency distribution
indicates the marginal distributions
over the rows:
. Similarly, column
of is characterized by the conditional frequencies
The marginal frequencies of the columns are
The distance between two rows, and , corresponds to the distance
between their respective frequency distributions. It is common to define this
distance using the -metric:
Note that this can be expressed as a distance between the vectors
as in (11.4)
with weighting matrix
Similarly, if we are interested in clusters among the columns, we
Apart from the Euclidean and the -norm measures one can use
a proximity measure such as the Q-correlation coefficient
denotes the mean over the variables
The proximity between data points is measured by a distance or
similarity matrix whose components give the
similarity coefficient or the distance between two points and
A variety of similarity (distance) measures exist for binary
data (e.g., Jaccard, Tanimoto, Simple Matching coefficients) and
for continuous data (e.g., -norms).
The nature of the data could impose the choice of a particular metric
in defining the distances (standardization,