11.2 The Proximity between Objects
The starting point of a cluster analysis is a data matrix
with measurements (objects) of variables. The
proximity (similarity) among objects is described by a
matrix
|
(11.1) |
The matrix contains measures of similarity or
dissimilarity among the objects. If the values are
distances, then they measure dissimilarity.
The greater the distance, the less similar are the objects. If the
values
are proximity measures, then the opposite is true, i.e., the greater the
proximity value, the more similar are the objects. A distance matrix,
for example, could be defined by the -norm:
, where and denote the rows of
the data matrix .
Distance and similarity are of course dual. If is
a distance, then
is a proximity
measure.
The nature of the observations plays an important role in the choice
of proximity measure. Nominal values (like binary variables) lead in
general to proximity values, whereas metric values lead (in general)
to distance matrices. We first present possibilities for
in the binary case and then consider the continuous case.
Similarity of objects with binary structure
In order to measure the similarity between objects we always compare
pairs of observations
where
,
, and
.
Obviously there are four cases:
Define
Note that each
, depends on the
pair
.
The following proximity measures are used in practice:
|
(11.2) |
where and are weighting factors.
Table 11.1 shows some similarity measures for given weighting
factors.
Table 11.1:
The common similarity coefficients.
Name |
|
|
Definition |
Jaccard |
0 |
1 |
|
Tanimoto |
1 |
2 |
|
Simple Matching (M) |
1 |
1 |
|
Russel and Rao (RR) |
- |
- |
|
Dice |
0 |
0.5 |
|
Kulczynski |
- |
- |
|
|
These measures provide alternative ways of weighting mismatchings and
positive (presence of a common character) or negative (absence of a
common character) matchings.
In principle, we could also consider the Euclidian distance.
However, the disadvantage of this distance is that
it treats the observations and
in the same way. If denotes, say,
knowledge of a certain language, then the
contrary, (not knowing the language) should
eventually be treated differently.
EXAMPLE 11.1
Let us consider binary variables computed from the car data set (Table
B.7). We define the new binary data by
for
and
. This means that we transform the
observations of the
-th variable to
if it is larger than the
mean value of all observations of the
-th variable.
Let us only consider the data points 17 to 19
(Renault 19, Rover and Toyota Corolla) which
lead to
distance matrices.
The Jaccard measure gives the similarity matrix
the Tanimoto measure yields
whereas the Single Matching measure
gives
Distance measures for continuous variables
A wide variety of distance measures can be generated by the
-norms, ,
|
(11.3) |
Here denotes the value of the -th variable on object .
It is clear that for . The class of
distances (11.3) for varying measures the
dissimilarity of different weights. The -metric, for example,
gives less weight to outliers than the -norm (Euclidean norm).
It is common to consider the squared -norm.
EXAMPLE 11.2
Suppose we have
and
.
Then the distance matrix for the
-norm is
and for the squared
- or Euclidean norm
One can see that the third observation
receives much more weight
in the squared
-norm than in the
-norm.
An underlying assumption in applying distances based on
-norms is that the variables are measured on the same scale. If
this is not the case, a standardization should first be applied.
This corresponds to using a more general -
or Euclidean norm with a metric , where
(see Section 2.6):
|
(11.4) |
-norms are given by
, but if a
standardization is desired, then the weight matrix
may be
suitable. Recall that
is the variance of the
-th component. Hence we have
|
(11.5) |
Here each component has the same weight in the computation of the
distances and the distances do not depend on a particular choice of
the units of measure.
EXAMPLE 11.3
Consider the French Food expenditures (Table
B.6).
The Euclidean distance matrix (squared
-norm) is
Taking the weight matrix
,
we obtain the distance matrix (squared
-norm)
|
(11.6) |
When applied to contingency tables, a -metric is suitable
to compare (and cluster) rows and columns of a contingency table.
If is a contingency table, row is characterized by
the conditional frequency distribution
,
where
indicates the marginal distributions
over the rows:
. Similarly, column
of is characterized by the conditional frequencies
, where
.
The marginal frequencies of the columns are
.
The distance between two rows, and , corresponds to the distance
between their respective frequency distributions. It is common to define this
distance using the -metric:
|
(11.7) |
Note that this can be expressed as a distance between the vectors
and
as in (11.4)
with weighting matrix
.
Similarly, if we are interested in clusters among the columns, we
can define:
Apart from the Euclidean and the -norm measures one can use
a proximity measure such as the Q-correlation coefficient
|
(11.8) |
Here
denotes the mean over the variables
.
Summary
-
The proximity between data points is measured by a distance or
similarity matrix whose components give the
similarity coefficient or the distance between two points and
.
-
A variety of similarity (distance) measures exist for binary
data (e.g., Jaccard, Tanimoto, Simple Matching coefficients) and
for continuous data (e.g., -norms).
-
The nature of the data could impose the choice of a particular metric
in defining the distances (standardization,
-metric etc.).