11.2 The Proximity between Objects

(11.1) |

The nature of the observations plays an important role in the choice of proximity measure. Nominal values (like binary variables) lead in general to proximity values, whereas metric values lead (in general) to distance matrices. We first present possibilities for in the binary case and then consider the continuous case.

Similarity of objects with binary structure

Define

Note that each , depends on the pair .

The following proximity measures are used in practice:

These measures provide alternative ways of weighting mismatchings and positive (presence of a common character) or negative (absence of a common character) matchings. In principle, we could also consider the Euclidian distance. However, the disadvantage of this distance is that it treats the observations and in the same way. If denotes, say, knowledge of a certain language, then the contrary, (not knowing the language) should eventually be treated differently.

for and . This means that we transform the observations of the -th variable to if it is larger than the mean value of all observations of the -th variable. Let us only consider the data points 17 to 19 (Renault 19, Rover and Toyota Corolla) which lead to distance matrices. The Jaccard measure gives the similarity matrix

the Tanimoto measure yields

whereas the Single Matching measure gives

Distance measures for continuous variables

A wide variety of distance measures can be generated by the
-norms, ,

and for the squared - or Euclidean norm

One can see that the third observation receives much more weight in the squared -norm than in the -norm.

An underlying assumption in applying distances based on
-norms is that the variables are measured on the same scale. If
this is not the case, a standardization should first be applied.
This corresponds to using a more general -
or Euclidean norm with a metric , where
(see Section 2.6):

(11.5) |

Taking the weight matrix , we obtain the distance matrix (squared -norm)

When applied to contingency tables, a -metric is suitable to compare (and cluster) rows and columns of a contingency table.

If is a contingency table, row is characterized by the conditional frequency distribution , where indicates the marginal distributions over the rows: . Similarly, column of is characterized by the conditional frequencies , where . The marginal frequencies of the columns are .

The distance between two rows, and , corresponds to the distance
between their respective frequency distributions. It is common to define this
distance using the -metric:

Apart from the Euclidean and the -norm measures one can use
a proximity measure such as the Q-correlation coefficient

(11.8) |

- The proximity between data points is measured by a distance or similarity matrix whose components give the similarity coefficient or the distance between two points and .
- A variety of similarity (distance) measures exist for binary data (e.g., Jaccard, Tanimoto, Simple Matching coefficients) and for continuous data (e.g., -norms).
- The nature of the data could impose the choice of a particular metric in defining the distances (standardization, -metric etc.).