9.2 Hierarchical Clustering

At any stage of the procedure, a hierarchical clustering technique performs either a merger of clusters or a division of a cluster at previous stage. It will conceptually give rise to a tree like structure of the clustering process. It is understood that the clusters of items formed at any stage are nonoverlapping or mutually exclusive.

Hierarchical clustering techniques proceed by either a series of successive mergers or a series of successive divisions.

The results of these methods can be displayed in a dendrogram. The dendrogram is the tree-like diagram that can depict the mergers or divisions which have been made at successive level. Below, in Figure 9.1, is the example of the dendrogram by using eight pairs of data above.

Figure 9.1: An example of a dendrogram using eight pairs of data.
\includegraphics[scale=0.6]{singled}


9.2.1 Agglomerative Hierarchical Methods


cagg = 17509 agglom (d, method, no{, opt })
performs a hierarchical cluster analysis

This method starts with each object describing a cluster, and then combines them into more inclusive clusters until only one cluster remains. Härdle and Simar (1998) considered the algorithm of agglomerative hierarchical method as follows,

  1. Construct the finest partition
  2. Compute the distance matrix $ D$
  3. DO
  4. UNTIL all clusters are agglomerated into one group.

If two objects or groups $ P$ and $ Q$ are to be united one obtains the distance to another group (object) $ R$ by the following distance function

$\displaystyle d(R,P+Q) = \delta_1 d(R,P)+\delta_2 d(R,Q) + \delta_3 d(P,Q) + \delta_4 \vert(d(R,P)-d(R,Q)\vert$ (9.8)

The $ \delta_j$'s are weighting factors that lead to different agglomerative algorithms as described in Table  9.2. Here $ n_P$ = $ \sum^{n}_{i=1} I (x_i \in P)$ is the number of objects in group P. The values of $ n_Q$ and $ n_R$ are defined analogously. The flexible method requires a parameter $ \beta$ that is specified by the user.

Table 9.2: Computation of group distances available in XploRe .
Name $ \delta_1$ $ \delta_2$ $ \delta_3$ $ \delta_4$
         
Single linkage 1/2 1/2 0 -1/2
         
Complete linkage 1/2 1/2 0 1/2
         
Simple Average linkage 1/2 1/2 0 0
         
Average linkage $ {n_P} \over {n_P+n_Q}$ $ {n_Q} \over {n_P+n_Q}$ 0 0
         
Centroid $ {n_P} \over {n_P+n_Q}$ $ {n_Q} \over {n_P+n_Q}$ - $ {n_P n_Q} \over {(n_P+n_Q)^2}$ 0
         
Median 1/2 1/2 -1/4 0
         
Ward $ {n_R + n_P} \over {n_R+n_P+n_Q}$ $ {n_R + n_Q} \over {n_R+n_P+n_Q}$ $ -{n_R} \over {n_R+n_P+n_Q}$ 0
         
Flexible Method $ {1-\beta} \over 2$ $ {1-\beta} \over 2$ $ \beta$ 0
         


Below is the quantlet 17516 agglom , which is implemented in XploRe to perform hierarchical cluster analysis.
  cagg = agglom (d, method, no{, opt})
where d is a $ n \times 1 $ vector or $ l \times l$ matrix of distances, method is the string that specify one of the following 17523 agglom methods: ``WARD'', ``SINGLE'', ``COMPLETE'', ``MEAN_LINK'', ``MEDIAN_LINK'', ``AVERAGE'', ``CENTROID'', and ``LANCE'' (flexible method), no is a scalar that shows the number of clusters and opt is an optional argument for flexible methods, with the default value $ -0.15$.

The output of this quantlet 17526 agglom are: cagg.p is a vector with partition numbers $ (1,2,\ldots)$, cagg.t is a matrix with the dendrogram for the number of clusters (no), cagg.g is a matrix with the dendrogram for all $ l$ clusters, cagg.pd is a matrix with partition numbers $ (1,2,\ldots)$, and cagg.d is a vectormatrix with distances between the cluster centers.


9.2.1.1 Single Linkage Method

The single linkage method is also called nearest neighbor method or minimum distance method. This method is defined by

$\displaystyle d(R,P+Q) = min \left(d(R,P),d(R,Q)\right)$ (9.9)

The process is continous from the weak clustering to the strong clustering. This method is invariant to monotone transformations of the input data. Therefore the algorithm can be used with similarity and dissimilarity measures. The effect of the algorithm that it tends to merge clusters is sometimes undesireable because it prevents the detection of not well separated clusters. On the other hands, the criteria maybe useful to detect outliers in the data set.

For example, we describe the single linkage method for the eight data points displayed in Figure  9.1

First we prepare the data,

  x=#(5,2,-2,-3,-2,-2,1,1)~ #(-3,-4,-1,0,-2,4,2,4)
                                  ; creates 8 pairs of data
  n=rows(x)                       ; rows of data
  xs=string("%1.0f", 1:n)         ; adds labels
  setsize(500, 500)
  dd1=createdisplay(1,1)
  setmaskp(x, 0, 0, 0)
  setmaskt(x, string("%.0f", 1:rows(x)), 0, 0, 16)
  setmaskl(x, 1~2~7~8~6~0~7~3~5~0~3~4, 0, 1, 1)
  show(dd1, 1, 1, x)                   ; shows data
  setgopt(dd1, 1, 1,"xlab","first coord.",
                                      "ylab","second coord.")
  setgopt(dd1, 1, 1,"title","8 points","xoff",7|7,"yoff",7|7)
then we calculate the Euclidean distance and apply the single lingkage method,
  d=distance(x, "euclid")         ; Euclidean distance
  d.*d                            ; squared distance matrix
  t=agglom(d.*d, "SINGLE", 5)     ; here single linkage method
  g=tree(t.g, 0, "CENTER")
  g=g.points
  l = 5.*(1:rows(g)/5) + (0:4)' - 4
  setmaskl (g, l, 0, 1, 1)
  setmaskp (g, 0, 0, 0)
finally we show the plot of the raw data and the dendrogram
  tg=paf(t.g[,2], t.g[,2]!=0)
  numbers=(0:(rows(x)-1))
  numbers=numbers~((-1)*matrix(rows(x)))
  setmaskp(numbers, 0, 0, 0)
  setmaskt(numbers, string("%.0f", tg), 0, 0, 14)
  dd2=createdisplay(1,1)
  show (dd2, 1, 1, g, numbers)
  setgopt(dd2, 1, 1, "xlab","Single Linkage Dendrogramm", 
                     "ylab","Squared Euclidian Distance")
  setgopt(dd2,1,1,"title","8 points","xoff",7|7,"yoff",7|7)
17530 XAGclust04.xpl

Plot of eight pairs of data is shown in Figure  9.2

Figure 9.2: Plot of eight pairs of data.
\includegraphics[scale=0.6]{singlep}

The plot of the dendrogram with single linkage method is shown in Figure 9.1. If we decide to cut the tree at the level 10 then we find three clusters: $ \{1,2\}$ and $ \{3,4,5\}$ and $ \{6,7,8\}$.


9.2.1.2 Complete Linkage Method

The Complete linkage method is also called farthest neighbor or maximum distance method. This method is defined by

$\displaystyle d(R,P+Q) = max \left(d(R,P),d(R,Q)\right)$ (9.10)

If we change SINGLE into COMPLETE in the example above
  ...
  t=agglom(d.*d, "SINGLE", 5)     ; here single linkage method
  ...
then we get
  ...
  t=agglom(d.*d, "COMPLETE", 5)   ; here complete linkage method
  ...
17537 XAGclust05.xpl

The dendrogram is shown in Figure 9.3. If we decide to cut the tree at the level 10 then we find three clusters: $ \{1,2\}, \{3,4,5\}$ and $ \{6,7,8\}$.

Figure 9.3: Plot of a dendrogram with complete linkage method.
\includegraphics[scale=0.6]{completed}

This method proceeds exactly as the single linkage method except that at the crucial step of revising the distance matrix, the maximum instead of the minimum distance is used to look for the new item.

Both of these two methods are

The single linkage method tends to maximize connectivity in a closeness sense, whereas the maximization method typically leads to more clustering, with smaller, tighter, and more compact clusters.


9.2.1.3 Average Linkage Method

The average linkage method is the hierarchical method that avoids the extremes of either large clusters or tight compact clusters. This method appears as a compromise between the nearest and the farthest neighbor methods.

The simple average linkage (mean linkage) method takes both elements of the new cluster into account:

$\displaystyle d(R,P+Q) = 1/2 \left(d(R,P)+d(R,Q)\right)$ (9.11)

After the new distances are computed the matrix is reduced by one element of the new cluster. The algorithms loops back to find the next minimum value and continues until all objects are united into one cluster. However, this method is not invariant under monotone transformation of the distance.

If we change SINGLE into AVERAGE in the example above then we get as follows,

  ...
  t=agglom(d.*d, "AVERAGE", 5)   ; here average linkage method
  ...
17544 XAGclust06.xpl

The dendrogram is shown in Figure 9.4. If we decide to cut the tree at the level 10 then we find three clusters: $ \{1,2\}, \{3,4,5\}$ and $ \{6,7,8\}$.

Figure 9.4: Plot of a dendrogram with average linkage method.
\includegraphics[scale=0.6]{averaged}


9.2.1.4 Centroid Method

Everitt (1993) explained that with the centroid method, groups once formed are represented by their mean values for each variables (mean vector), and inter-group distance is defined in terms of distance between two such mean vectors. The use of the mean strictly implies that the variables are on an interval scale.

Figure 9.5 is plot of a dendrogram using centroid linkage method based on the eight pairs of data with the quantlet 17550 XAGclust07.xpl .

Figure 9.5: Plot of a dendrogram with centroid linkage method.
\includegraphics[scale=0.6]{centroidd}


9.2.1.5 Median Method

If the sizes of two groups to be merged are very different, then the centroid of the new group will be very close to that of the larger group and may remain within that group. This is the disadvantage of the centroid method. For that reason, Gower (1967) suggested an alternative stategy, called median method because this method could be made suitable for both similarity and distance measures.

Plot of a dendrogram using median method based on the eight pairs of data is as in Figure 9.6 with the quantlet 17554 XAGclust08.xpl .

Figure 9.6: Plot of a dendrogram with median method.
\includegraphics[scale=0.6]{mediand}


9.2.1.6 Ward Method


cw = 17566 wardcont (x, k, l)
performs Ward's hierarchical cluster analysis of the rows as well as of the columns of a contingency table including the multivariate graphic using the correspondence analysis; makes available the factorial coordinates of the row points and column points (scores)

Ward (1963) proposed a clustering procedure seeking to form the partitions $ P_k, P_{k-1} ,\ldots, P_1$ in a manner that minimizes the loss associated with each grouping and to quantifies that loss in readily interpretable form. Information loss is defined by Ward in terms of an error sum-of-squares (ESS) criterion. ESS is defined as the following

$\displaystyle ESS = \sum^K_{k=1} \sum_{x_i \in C_k} \sum^p_{j=1} \left ( x_{ij} - \bar x_{kj} \right )^2$ (9.12)

with the cluster mean $ \bar x_{kj} = {1 \over n_k} \sum_{x_i \in C_k}
x_{ij}$, where $ x_{ij}$ denotes the value for the $ i$-th individual in the $ j$-cluster, $ k$ is the total number of clusters at each stage, and $ n_j$ is the number of individuals in the j-th cluster.

The corresponding quantlet in XploRe as below

  t = agglom (d, "WARD", 2)

The main difference between this method and the linkage methods consists in the unification procedure. This method does not put together groups with smallest distance, but it joins groups that do not increase too much a given measure of heterogenity. The aim of the Ward method is to unify groups such that the variation inside these groups is not increased too drastically. This results groups in clusters that are as homogeneous as possible.

The following quantlet gives an example of how to show the dendrogram with the WARD method in XploRe .

In this example we use bank2.dat dataset taken from Flury and Riedwyl (1988). This dataset consists of 200 measurements on Swiss bank notes. One half of these bank notes are genuine, the other half are forged bank notes. The variables that use in this data set as follows: $ X_1$ = length of the bill, $ X_2$ = height of the bill (left), $ X_3$ = height of the bill (right), $ X_4$ = distance of the inner frame to the lower border, $ X_5$ = distance of the inner frame to the upper border, $ X_6$ = length of the diagonal of the central picture.

After starting, we compute the euclidean distance between banknotes:

 proc()=main()
  x=read("bank2")
  i=0                       ; compute the euclidean distance
  d=0.*matrix(rows(x),rows(x))
  while (i.<cols(x))
    i = i+1
    d = d+(x[,i] - x[,i]')^2
  endo
  d = sqrt(d)
Next, we use the WARD method and show the dendrogram
  t = agglom (d, "WARD", 2)  ; use WARD method
  g = tree (t.g, 0)          ; to cluster the data
  g=g.points
  l = 5.*(1:rows(g)/5) + (0:4)' - 4
  setmaskl (g, l, 0, 1, 1)
  setmaskp (g, 0, 0, 0)
  d = createdisplay (1,1)
  show (d, 1, 1, g)         ; show the dendrogram
 endp
  ;
 main()
17580 XAGclust09.xpl

The result gives the partition of the data into 2 clusters and dendrogram is plotted in Figure 9.7. With Ward method, we see that only one observation, namely 70-th observation, belongs to the false cluster. The rest of observations belong to the same groups.
  [  1,]        1
  [  2,]        1
  ...
  ...
  [ 68,]        1
  [ 69,]        1
  [ 70,]        2
  [ 71,]        1
  [ 72,]        1
  ....
  ....
  [ 99,]        1
  [100,]        1
  [101,]        2
  [102,]        2
  ...
  ...
  [199,]        2
  [200,]        2

Figure 9.7: Dendrogram for 200 Swiss banknotes data .
\includegraphics[scale=0.6]{xagclustpic}

The other quantlet that we use is 17586 wardcont . The aim of this quantlet is to perform Ward's hierarchical cluster analysis of the rows as well as of the columns of a contingency table. It includes the multivariate graphic using the correspondence analysis. It makes available the factorial coordinates of the row points and column points (scores).

The syntax of this quantlet is as follows.

  cw = wardcont (x, k, l)
where x is an $ n \times p$ matrix of $ n$ row points to be clustered (the elements must be $ > 0$, with positive marginal sums), k is scalar the maximum number of clusters of rows, and l is scalar the maximum number of clusters of columns.

For an example, we use bird.dat dataset taken from Mucha (1992). This dataset consists of 412 area (each area = 1 quadrat km) and 102 kinds of bird. The area is divided into 12 groups and the kinds of birds are divided into 9 groups.

After loading the quantlib xclust , we apply the 17593 wardcont method:

  library ("xclust")
  x=read("bird.dat")
  cw = wardcont(x, 3, 3)
17597 XAGclust10.xpl

Figures 9.8,  9.9, and  9.10 visualize the matrix correspondence analysis scores of the rows points, the columns points, and both rows and columns.

Figure 9.8: Correspondence analysis scores of the row points
\includegraphics[scale=0.6]{wardrow2}

Figure 9.9: Correspondence analysis scores of the column points.
\includegraphics[scale=0.6]{wardcol2}

Figure 9.10: Correspondence analysis scores of both rows and columns.
\includegraphics[scale=0.6]{wardboth2}


9.2.2 Divisive Hierarchical Methods


cd = 17941 divisive (x , k, w, m, sv)
performs an adaptive divisive K-means cluster analysis with appropriate (adaptive) multivariate graphic using principal components

The divisive hierarchical methods proceed in the opposite way of the agglomerative hierarchical method. In this method, an initial single group of objects is divided into two groups such that the objects in one subgroup are far from the objects in the other. We can divide this method into two types: $ \bf monothetic$, which divides the data on the basis of the possession of a single specified attribute, and $ \bf polythetic$, where divisions are based on the values taken by several attributes.

The 17944 divisive quantlet in XploRe performs an adaptive divisive K-means cluster analysis with an appropriate (adaptive) multivariate graphic using principal components:

  cd = divisive (x, k, w, m, sv)
where x is an $ n \times p$ matrix of $ n$ row points to be clustered, k is the number of clusters, w is a $ p \times 1$ matrix of weights of column points, m is a $ n \times 1 $ matrix of weights (masses) of row points, and sv is scalar seed value for random numbers.

The outputs of this quantlet 17951 divisive are: cd.p is the partition of $ n$ points of x into k clusters, cd.n is the number of observations of clusters, cd.a is the matrix of final (pooled) adaptive weights of the variables.

We illustrate the usage of quantlet 17954 divisive in the following example.

After loading the quantlib xclust , we generate random data with 4 clusters,

  randomize(0)
  x  = normal(30, 5)
  x1 = x - #(2,1,3,0,0)'
  x2 = x + #(1,1,3,1,0.5)'
  x3 = x + #(0,0,1,5,1)'
  x4 = x - #(0,2,1,3,0)'
  x  = x1|x2|x3|x4
Then, we compute column variances and row weights
  w  = 1./var(x)
  m  = matrix(rows(x))
Next, we apply divisive methods and compare the results between estimated and true partition,
  cd = divisive (x, 4, w, m, 1111)
  conting (cd.p, ceil((1:120)/30))
17960 XAGclust11.xpl

The result is the following:
  Contents of h
  [1,]        0       30        0        0
  [2,]        0        0       30        0
  [3,]       30        0        0        0
  [4,]        0        0        0       30
The output is the crosstable of 120 observations that divide into four clusters. Each cluster consists of 30 observations and corresponds to the given class without any error.