In general, cluster analysis could be divided into hierarchical clustering techniques and nonhierarchical clustering techniques. Examples of hierarchical techniques are single linkage, complete linkage, average linkage, median, Ward. Nonhierarchical techniques include K-means, adaptive K-means, K-medoids, fuzzy clustering. To determine which algorithms are good is dependent on the type of data available and the particular purpose of analysis. Therefore, it is better to run more than one algorithm and then analyze and compare the results carefully. In more objective way, the stability of clusters can be investigated in simulation studies (Mucha; 1992).
The distances between points play an important role in clustering. There are several
distance measures available by the
XploRe
command
distance
. Moreover, additional
distance measures can be computed by using the
XploRe
matrix language.
For a distance between two p-dimensional observations
and
, we consider the Euclidean metric defined as
![]() |
(9.1) |
![]() |
(9.2) |
![]() |
(9.3) |
In XploRe , we have some of distances, those are Euclidean, diagonal, Mahalanobis. The distance measure or metric should be chosen with care. The Euclidean metric should not be used where different attributes have widely varying average values and standard deviations, since large numbers in one attribute will prevail over smaller numbers in another. With the diagonal and Mahalanobis metrics, the input data are transformed before use. Choosing the diagonal metric results in transformation of the data set to one in which all attributes have equal variance. Choosing the Mahalanobis metric results in transformation of the data set to one in which all attributes have zero mean and unit variance. Correlations between variables are taken into account.
Here is the example:
x = #(1,4)~#(1,5) distance (x, "l1") distance (x, "l2") distance (x, "maximum")
Contents of distance [1,] 0 7 [2,] 7 0 Contents of distance [1,] 0 5 [2,] 5 0 Contents of distance [1,] 0 4 [2,] 4 0That means that the distance between two observations with City-block distance is 7, with Euclidean distance is 5 and with maximum distance is 4.
Alternatively, a distance measure could be also computed by quantlet
lpdist
.
This aim is to compute the so-called
-distances between the rows of a data matrix.
Here is the quantlet
lpdist
in
XploRe
,
d = lpdist(x, q, p)where x is
To see an example, we start with loading the quantlib xclust , then, we generate eight pairs of data, determine the column weights, and apply euclidean metric
library ("xclust") x = #(5, 2,-2, -3, -2, -2, 1, 1)~#(-3, -4, -1, 0, -2, 4, 2, 4) q = #(1, 1) lpdist (x, q, 2)
Content of object d [1,] 3.1623 [2,] 7.0821 [3,] 8.5440 ... ... [26,] 3.6056 [27,] 3 [28,] 2This result is
According to Härdle and Simar (1998), for measuring the similarity between objects,
we can compare pairs of observations ;
=
=
. Actually, we have four cases:
![]() |
(9.4) |
![]() |
(9.5) |
![]() |
(9.6) |
![]() |
(9.7) |
Note that each
depends on the pair
. In
XploRe
the
similarity matrix T given above is transformed into a distance matrix D by
.
The example of this problem as follows
x = #(1,0, 0)~#(1,0,1) distance (x, "tanimoto")
Contents of distance [1,] 0 1 0.5 [2,] 1 0 1 [3,] 0.5 1 0