9.1 Introduction

In general, cluster analysis could be divided into hierarchical clustering techniques and nonhierarchical clustering techniques. Examples of hierarchical techniques are single linkage, complete linkage, average linkage, median, Ward. Nonhierarchical techniques include K-means, adaptive K-means, K-medoids, fuzzy clustering. To determine which algorithms are good is dependent on the type of data available and the particular purpose of analysis. Therefore, it is better to run more than one algorithm and then analyze and compare the results carefully. In more objective way, the stability of clusters can be investigated in simulation studies (Mucha; 1992).


9.1.1 Distance Measures


d = 16977 distance (x{, metric})
computes the distance between p-dimensional data points depending on a specified metric
d = 16980 lpdist (x, q, p)
computes the so-called $ L_p$-distances between the rows of a data matrix. In the case $ p = 1$ (absolute metric) or $ p=2$ (euclidean metric) one should favour the function 16983 distance

The distances between points play an important role in clustering. There are several distance measures available by the XploRe command 16990 distance . Moreover, additional distance measures can be computed by using the XploRe matrix language.

For a distance between two p-dimensional observations $ x = (x_{1},x_{2},...,x_{p})^T$ and $ y = (y_{1},y_{2},...,y_{p})^T$, we consider the Euclidean metric defined as

$\displaystyle d(x,y) = \left[\sum^p_{i=1} (x_i - y_i)^2\right]^{1 \over 2}$ (9.1)

In matrix notation, this is written as the following:

$\displaystyle d(x,y) = \sqrt{(x - y)^T(x - y)}$ (9.2)

The statistical distance between these two observations is the following

$\displaystyle d(x,y) = \sqrt{(x - y)^T A (x - y)}$ (9.3)

where $ A = S^{-1}$ is the inverse of $ S$, the matrix of sample variances and covariances. It is often called Mahalanobis distance.

In XploRe , we have some of distances, those are Euclidean, diagonal, Mahalanobis. The distance measure or metric should be chosen with care. The Euclidean metric should not be used where different attributes have widely varying average values and standard deviations, since large numbers in one attribute will prevail over smaller numbers in another. With the diagonal and Mahalanobis metrics, the input data are transformed before use. Choosing the diagonal metric results in transformation of the data set to one in which all attributes have equal variance. Choosing the Mahalanobis metric results in transformation of the data set to one in which all attributes have zero mean and unit variance. Correlations between variables are taken into account.

Here is the example:

  x = #(1,4)~#(1,5)
  distance (x, "l1")
  distance (x, "l2")
  distance (x, "maximum")
17002 XAGclust01.xpl

The results of this code program are
  Contents of distance
  [1,]     0     7
  [2,]     7     0
  Contents of distance
  [1,]     0     5
  [2,]     5     0
  Contents of distance
  [1,]     0     4
  [2,]     4     0
That means that the distance between two observations with City-block distance is 7, with Euclidean distance is 5 and with maximum distance is 4.

Alternatively, a distance measure could be also computed by quantlet 17007 lpdist . This aim is to compute the so-called $ L_p$-distances between the rows of a data matrix.

Here is the quantlet 17010 lpdist in XploRe ,

  d = lpdist(x, q, p)
where x is $ n \times m$ matrix, q is $ m \times 1$ matrix of nonnegative weights of columns, and p is scalar parameter ($ p > 0$) of the $ L_p$-metric. In the case p=1 (absolute metric) or p=2 (euclidean metric).

To see an example, we start with loading the quantlib xclust , then, we generate eight pairs of data, determine the column weights, and apply euclidean metric

  library ("xclust")
  x = #(5, 2,-2, -3, -2, -2, 1, 1)~#(-3, -4, -1, 0, -2, 4, 2, 4)
  q = #(1, 1)
  lpdist (x, q, 2)
17020 XAGclust02.xpl

The output of this code as follows,
  Content of object d
  [1,]   3.1623
  [2,]   7.0821
  [3,]   8.5440
  ...
  ...
  [26,]  3.6056
  [27,]       3
  [28,]       2
This result is $ 28 \times 1$ matrix of paired distances between the $ 28$ row points, and it is also the input for hierarchical clustering which is presented in the following section.


9.1.2 Similarity of Objects

According to Härdle and Simar (1998), for measuring the similarity between objects, we can compare pairs of observations $ (x_i,x_j)$; $ x_i^T$ = $ (x_{i1}, \ldots, x_{ip}),
x_j^T$ = $ (x_{j1}, \ldots, x_{jp}),\ x_{ik}, x_{jk} {\in \{0,1\} }$. Actually, we have four cases:

$\displaystyle x_{ik} = x_{jk} = 1, x_{ik} = 0, x_{jk} = 1, x_{ik} = 1, x_{jk} = 0, x_{ik} = x_{jk} = 0.$ (9.4)

We define

$\displaystyle a_1 = \sum_{k=1}^p I(x_{ik} = x_{jk} = 1), a_2 = \sum_{k=1}^p I(x_{ik} = 0, x_{jk} = 1),$ (9.5)

$\displaystyle a_3 = \sum_{k=1}^p I(x_{ik} = 1, x_{jk} = 0), a_4 = \sum_{k=1}^p I(x_{ik} = x_{jk} = 0).$ (9.6)

General measures are used in practice

$\displaystyle T_{ij} = \frac {{a_1} + \delta {a_4}} {{a_1} + \delta {a_4} + \lambda (a_2 + a_3)},$ (9.7)

where $ \delta$ and $ \lambda$ are weighting vectors. According to the weighting factors we have the following table.

Table 9.1: The common similarity coefficient.
Name $ \delta$ $ \lambda$ $ Definition (T(x_i,x_j))$
       
Jaccard 0 1 $ {a_1} \over {a_1+a_2+a_3}$
       
Tanimoto 1 2 $ {a_1 + {a_4}} \over {a_1+2(a_2+a_3) + a_4}$
       
Simple Matching (M) 1 1 $ {a_1 + {a_4}} \over {p}$


Note that each $ a_l, l=1, \ldots, 4$ depends on the pair $ (x_i,x_j)$. In XploRe the similarity matrix T given above is transformed into a distance matrix D by $ D = 1^T - T$.

The example of this problem as follows

  x = #(1,0, 0)~#(1,0,1)
  distance (x, "tanimoto")
17183 XAGclust03.xpl

The result is the similarity object using Tanimoto coefficient.
  Contents of distance
  [1,]     0     1    0.5
  [2,]     1     0      1
  [3,]   0.5     1      0