9.3 Nonhierarchical Clustering

Nonhierarchical clustering possesses as a monotonically increasing ranking of strengths as clusters themselves progressively become members of larger clusters. These clustering methods do not possess tree-like structures and new clusters are formed in successive clustering either by merging or splitting clusters.

One of the nonhierarchical clustering methods is the partitioning method. Consider a given number of clusters, for example $ g$, as the objective and the partition of the object to obtain the required $ g$ clusters. In contrast to the hierarchical method, this partitioning technique permits objects to change group membership through the cluster formation process. The partitioning method usually begins with an initial solution, after which reallocation occurs according to some optimality criterion.

Partitioning method constructs $ k$ clusters from the data as follows:

Note: $ k$ is determined by the user so it is better to run algorithms more times to select $ k$ that perform best characteristics. It is also possible to generate the value of $ k$ automatically and then choose the best one $ k$ under certain criteria.


9.3.1 K-means Method


ckm = 18135 kmeans (x, b, it{, w, m})
performs cluster analysis, i.e. computes a partition of $ n$ row points into $ K$ clusters.
ck = 18138 kmcont (x, k, t)
performs a K-means cluster analysis of the rows of a contingency table, including the multivariate graphic using the correspondence analysis, it makes the factorial coordinates (scores)available.

This method is developed by Queen (1967). He suggests the name K-means for describing his algorithm that assigns each item to the cluster having the nearest centroid (mean). This process consists of three steps

  1. Partition the items into $ K$ initial clusters
  2. Proceed through the list of items, assigning an item to the cluster whose centroid (mean) is nearest. Recalculate the centroid for the cluster receiving in the new item and for the cluster losing the item.
  3. Repeat step 2 until no more assignments take place
It is better to determine $ K$ initial centroids (seed points) first, before proceeding to step 2.

This method tries to minimize the sum of the within-cluster variances.

$\displaystyle V_{K} = \sum^K_{k=1} \sum^n_{i=1} \delta_{ik} m_i d^2 \left( x_{i} - \overline x_{k} \right)$ (9.13)

The indicator function $ {\delta_{ik}}$ equals 1 if the observation $ x_i$ comes from cluster $ k$, or 0 otherwise. Furthermore, the element $ \overline x_{kj}$ of the vector $ \overline x_k$ is the mean value of the variable $ j$ in the cluster $ k$:

$\displaystyle x_{kj} = {1 \over n_k} \sum^I_{i=1} \delta_{ik}m_ix_{ij}$ (9.14)

We denote the mass of the cluster $ k$ with $ n_k$, which is equal to the sum of the masses of all observations belonging to the cluster $ k$.

The above criterion of the K-means method can be derived straightforwardly by using the Maximum Likelihood approach assuming that the populations are independently and normally distributed.

Below is the usage of the quantlet 18141 kmeans in XploRe . It computes a partition of n row points into K clusters.

  ckm = kmeans (x, b, it{, w, m})
where x is $ n \times p$ matrix, b is $ n \times 1 $ matrix, giving the initial partition (for example random generated numbers of clusters $ 1,2,...,K$), it is number of iterations, w is $ p \times 1$ matrix with the weights of column points, and m is $ n \times 1 $ matrix of weights (masses) of row points.

The output of this quantlet 18148 kmeans consists of cm.g which is a matrix containing the final partition, cm.c is a matrix of means (centroids) of the $ K$ clusters, cm.v is a matrix of within cluster variances divided by the weight (mass) of clusters, and cm.s is a matrix of the weight (mass) of clusters.

In the following example, we generate random data with 3 clusters

  randomize(0)
  x  = normal(100, 4)            ; generate random normal data
  x1 = x - #(2,1,3,0)'
  x2 = x + #(1,1,3,1)'
  x3 = x + #(0,0,1,5)'
  x  = x1|x2|x3
  b  = ceil(uniform(rows(x)).*3) ; generate a random partition
furthermore, we apply K-means clustering to the data and show the initial partition and the final partition
  {g, c, v, s} = kmeans(x, b, 100)
  b~g
18152 XAGclust12.xpl

The results of the initial and the final partition of the data in 3 clusters are as follows.
  Contents of object _tmp
  [  1,]        1        2
  [  2,]        3        2
  [  3,]        1        2
  ...
  [297,]        1        1
  [298,]        2        1
  [299,]        1        1
  [300,]        2        1


9.3.2 Adaptive K-means Method


ca = 18285 adaptive (x, k, w, m, t)
performs an adaptive K-means cluster analysis with appropriate (adaptive) multivariate graphic using the principal components

In order to increase the stability in cluster analysis, specific weights or adaptive weights in the distance formula could be applied rather than ordinary weight $ q_{jj} =
{1 \over s_j^2}$ or $ q_{jj}=1$.

For example, the simple adaptive weights

$\displaystyle q_{jj}= {1 \over \overline s^2_j}$ (9.15)

can be used in the squared weighted Euclidean distance, where $ \overline s_j$ is the pooled standard deviation of the variable $ j$:

$\displaystyle \overline s_j^2 = {1 \over M} \sum^K_{k=1} \sum^n_{i=1} \delta_{ik}m_i \left( x_{ij} - \overline x_{kj}\right)^2$ (9.16)

The indicator $ \delta_{ik}$ is defined in the usual way. For simplicity, use $ M = \sum^K_{i=1} m_i$, $ i=1,2,...,n$, i.e. $ M$ becomes independent from the number of clusters $ K$.

The ``true'' pooled standard deviations cannot be computed in cluster analysis in advance because the cluster analysis structure is usually unknown. Otherwise, it is known that the pooled standard deviations concerning a random partition are nearly equal to the total standard deviations. Therefore, starting with the weights $ q_{jj}=1/s_j^2$ and a random initial partition $ P^0(n,K)$ the K-means method computes a (local) optimum partition $ P^1(n,K)$ of $ I$ observations into $ K$ clusters.

Below is the quantlet 18288 adaptive to performs an adaptive K-means cluster analysis with appropriate (adaptive) multivariate graphic using the principal components

  ca = adaptive(x, k, w, m, t)
Following is the example of adaptive clustering in XploRe
  randomize(0)
  x  = normal(200, 5)   ; generate random data with 3 clusters
  x1 = x - #(2,1,3,0,0)'
  x2 = x + #(1,1,3,1,0.5)'
  x3 = x + #(0,0,1,5,1)'
  x  = x1|x2|x3
  w  = 1./var(x)                 ; compute column variances
  m  = matrix(rows(x))           ; generate true partition
  t  = matrix(200)|matrix(200).+1|matrix(200).+2
  ca = adaptive (x, 3, w, m, t)  ; apply adaptive clustering
18296 XAGclust13.xpl

The result is shown below,

Figure 9.11: Start and final partition with adaptive clustering.
\includegraphics[scale=0.55]{adapdisp}

it gives a partition ca.b of the row points into 3 clusters which minimizes the sum of within cluster variances according to the column weights (1/pooled within cluster variances).


9.3.3 Hard C-means Method


v = 18412 xchcme (x, c, m, e)
performs a hard C-means cluster analysis

Fuzzy sets were introduced by Zadeh (1965). It offers a new way to isolate and identify functional relationship - qualitative and quantitative, which also called the pattern recognition.

In general, fuzzy models can be constructed in two ways:

We concentrate only on the identification techniques. One of this techniques is fuzzy clustering method. With a sufficiently informative identification data set, this method does not require any prior knowledge on the partitioning of the domains. Moreover, the use of membership values provides more flexibility and makes the clustering results locally interpretable and often corresponds well with the local behaviour of the identified process.

The idea of fuzzy clustering came from Ruspini (1969)'s hard C-means. He introduces the fuzzification of hard C-means to accommodate the intergrades for situations where the groups are not well-separated with hybrid points between groups as:

$\displaystyle J_m(U,P : X) = \sum_{k=1}^n \sum_{i=1}^c u_{ik} d^2 (x_k,v_i),$ (9.17)

where $ X = (x_1, x_2, ..., x_n)$ is n data sample vectors, $ U$ is a partition of $ X$ in c part, $ P = (v_1, v_2,..., v_c)$ are cluster centers in $ R^p$, $ d^2(x_k,v_i)$ is an inner product induced norm on $ R^p$, and $ u_{ik}$ is referred to as the grade of membership of $ x_k$ to the cluster $ i$, in this case the member of $ u_{ik}$ is 0 or 1.

The syntax of this algorithm in XploRe is the following:

  hcm=xchcme(x,c,m,e)
The inputs are the following; $ \tt x$ is a $ n \times p$ matrix of $ n$ row points to be clustered, $ \tt c$ is scalar the number of clusters, $ \tt m$ is an exponent weight factor ($ m = 1$), e is termination tolerance, and u is a $ n \times p$ matrix of initialize uniform distribution.

For an example, we use butterfly data set taken from Bezdek (1981). This data set consists of $ 15 \times 2$ matrix. It is called ``butterfly'' because it scatters like butterfly.

After loading the quantlib xclust , we load the data set

  library("xclust")
  z=read("butterfly.dat")
  x=z[,2:3]
  c=2
  m=1
  e=0.001
and apply hard C-means clustering
  hcm=xchcme(x,c,m,e)
  hcm.clus
  d=createdisplay(1,1)
  setmaskp(x,hcm.clus,hcm.clus+2,8)
  show(d,1,1,x)
  title="Hard-c-means for Butterfly Data"
  setgopt(d,1,1,"title", title)
18424 XAGclust14.xpl

the result is here
  Contents of hcm.clus
  [ 1,]        2
  [ 2,]        2
  ...
  [ 7,]        2
  [ 8,]        2
  [ 9,]        1
  ...
  [14,]        1
  [15,]        1

Figure 9.12: Hard C-means for butterfly data.
\includegraphics[scale=0.6]{butterh}

From the Figure 9.12, we can see that the data separate into two clusters. Although the observation number $ 8$ namely $ (3,2)$ exactly in the middle, but this observation must be belong to the first cluster or the second cluster. It can not be constructed another cluster. For this example we see that this observation belong to the first cluster.


9.3.4 Fuzzy C-means Method


v = 18761 xcfcme (x, c, m, e, alpha)
Performs a fuzzy C-means cluster analysis

One approach to fuzzy clustering, probably the best and most commonly used, is the fuzzy C-means Bezdek (1981). Before Bezdek, Dunn (1973) had developed the fuzzy C-means Algorithm. The idea of Dunn's algorithm is to extend the classical within groups sum of squared error objective function to a fuzzy version by minimizing this objective function. Bezdek generalized this fuzzy objective function by introducing the weighting exponent $ m$, $ 1 \leq m <
\infty$;

$\displaystyle J_m(U,P : X) = \sum_{k=1}^n \sum_{i=1}^c (u_{ik})^m d^2 (x_k,v_i),$ (9.18)

where $ U$ is a partition of $ X$ in $ c$ part, $ P = v = (v_1, v_2,..., v_c)$ are the cluster centers in $ R^p$, and $ A$ is any $ (p \times p)$ symmetric positive definite matrix defined as the following :

$\displaystyle d^2 (x_k,v_i) = \Vert {x_k - v_i} \Vert _A = \sqrt{(x_k - v_i)^T A(x_k - v_i)},$ (9.19)

where $ d^2(x_k,v_i)$ is an inner product induced norm on $ R^p$, $ u_{ik}$ is referred to as the grade of membership of $ x_k$ to the cluster $ i$. This grade of membership satisfies the following constraints:

$ \\ 0 \le u_{ik} \le 1, \textrm{for} \; 1 \le i \le c, 1 \le k \le n,$
$ 0 < \sum_{k=1}^n u_{ik} < n, \textrm{for} \; 1 \le i \le c,$
$ \sum_{i=1}^c u_{ik} = 1, \textrm{for} \; 1 \le k \le n.$

The fuzzy C-means (FCM) uses an iterative optimization of the objective function, based on the weighted similarity measure between $ x_k$ and the cluster center $ v_i$.

Steps of the fuzzy C-means algorithm, according to Hellendorn and Driankov (1997) are the following:

  1. Given a data set $ X = \{x_1, x_2, ..., x_n\}$, select the number of clusters $ 2 \leq c < N$, the maximum number of iterations $ T$, the distance norm $ {\parallel \bullet \parallel}_A$, the fuzziness parameter $ m$, and the termination condition $ \varepsilon > 0$.
  2. Give an initial value $ U_0 \in M_{fcn}$.
  3. For $ t=1,2, ..., T$
    1. Calculate the $ c$ cluster centers $ \{v_{i,t}\}, i=1, ..., c$

      $\displaystyle v_{i,t} = \frac {\sum_{k=1}^n u_{ik,t-1}^m x_k} {\sum_{k=1}^n u_{ik,t-1}^m}$ (9.20)

    2. Update the membership matrix. Check the occurrence of singularities $ (d_{ik,t} =
{\parallel x_k - v_{i,t} \parallel}_A = 0)$. Let $ {\mit\Upsilon} = \{1, ...,c\},
{\mit\Upsilon}_{k,t} = \{i \in {\mit\Upsilon} \vert d_{ik,t}=0 \}$, and $ {\mit\Upsilon}_{k,t}=
{\mit\Upsilon} \backslash {\mit\Upsilon}_{k,t}$. Then calculate the following

      $\displaystyle u_{ik,t} = \sum_{j=1}^c \left(\frac {d_{ik,t}} {d_{jk,t}}\right)^{ \frac {2} {m-1}}, \; \textrm{if} \; {\rm {\mit\Upsilon}_{k,t} = 0}$ (9.21)

      Choose $ a_{ik,t} = 1/\char93 ({\mit\Upsilon}_{k,t}), \forall i \in {\mit\Upsilon} ; \char93 (.)$ denotes the ordinal number.
  4. If $ E_t = \parallel U_{t-1} - U_t \parallel \leq \varepsilon$ then stop otherwise return to step 3.

This procedure converges to a local minimum or a saddle point of $ J_m$. The FCM algorithm computes the partition matrix $ U$ and the clusters' prototypes in order to derive the fuzzy models from these matrices.

In pseudocode, we can say

 Initialize membership (U)
 iter = 0
 Repeat {Picard iteration}
     iter = iter+1
     Calculate cluster center (C)
     Calculate distance of data to centroid ||X-C||
     U'=U
     Update membership U
 Until ||U-U'|| <= tol_crit .or. iter = Max_iter
The syntax of this algorithm in XploRe is
  fcm=xcfcme(x,c,m,e,alpha)
The inputs are the following; $ \tt x$ is a $ n \times p$ matrix of $ n$ row points to be clustered, $ \tt c$ is the number of clusters, $ \tt m$ is an exponent weight factor $ (m > 1)$, e is termination tolerance, and u is $ n \times p$ matrix of initialized uniform distribution.

Below is an example. We use the same data as quantlet 18772 xcfcme . And also we do exactly the same procedure, except the part of applying fuzzy C-means clustering.

  library("xclust")
  z=read("butterfly.dat")
  x=z[,2:3]
  c=2
  m=1.25
  e=0.001
  alpha=0.9
  fcm=xcfcme(x,c,m,e,alpha) ; apply fuzzy c-means clustering
  fcm.clus
  d=createdisplay(1,1)
  setmaskp(x,fcm.clus,fcm.clus+3,8)
  show(d,1,1,x)
  title="Fuzzy-c-means for Butterfly Data"
  setgopt(d,1,1,"title", title)
18776 XAGclust15.xpl

The result is here

  Contents of fcm.clus
  [ 1,]        1
  [ 2,]        1
  ...
  [ 8,]        3
  [ 9,]        2
  ...
  [14,]        2
  [15,]        2
This result can be shown in Figure 9.13.

Figure 9.13: Fuzzy C-means for butterfly data
\includegraphics[scale=0.6]{butterf}

By using $ \tt m = 1.25$ and $ \tt alpha = 0.9$, we can see that, not all observations belong to the first cluster or the second cluster. But the 8-th observation form a new cluster. Because this observation has the same distance to the center of both previous cluster.

For another example, we use bank2.dat that has also explained by Ward method.

After loading the quantlib xclust , we load the bank2 dataset. We do exactly the same with the previous example with the quantlet 18788 XAGclust16.xpl

The result that we have as follows

  [  1,]        1
  [  2,]        1
  ...
  ...
  [ 98,]        1
  [ 99,]        1
  [100,]        1
  [101,]        2
  [102,]        2
  [103,]        1
  [104,]        2
  ...
  [199,]        2
  [200,]        2
If we compare to the Ward method depicted by Figure 9.14, we have not exactly the same cluster. By fuzzy C-mean, the $ 103$-rd observation belongs to the first cluster and by Ward method, the $ 70$-th observation belongs to the second cluster. To make it easy to see how the data to be clustered, below we present the variables; $ X_4$ that is the distance of the inner frame to the lower border, vs $ X_6$ that is the length of the diagonal of the central picture.

Figure 9.14: Ward method vs fuzzy C-means for Swiss banknotes data (X4 vs X6) with two clusters
\includegraphics[scale=0.6]{wardfuz2}

Using the contingency table, we can conclude that both of these methods constructed the same clusters.


Table 9.3: Contingency table between Ward method vs fuzzy C-means method with two clusters.
Ward / Fuzzy Cluster 1 Cluster 2  
       
Cluster 1 $ 1$ $ 99$ $ 100$
       
Cluster 2 $ 99$ $ 1$ $ 100$
       
  $ 100$ $ 100$ $ 200$


Now, we want to compare both of these methods for three clusters depicted by Figure 9.15 with the quantlet 18792 XAGclust17.xpl .

Figure 9.15: Ward method vs fuzzy C-means for Swiss banknotes data (X4 vs X6) with three clusters
\includegraphics[scale=0.6]{wardfuz3}

Using the contingency table, we see that there are 16 observations in the second cluster of Ward method but these observations are belong to the third cluster of fuzzy C-means.

Table 9.4: Contingency table between Ward method vs fuzzy C-means method with three clusters.
Ward / Fuzzy Cluster 1 Cluster 2 Cluster 3  
         
Cluster 1 $ 100$ 0 0 $ 100$
         
Cluster 2 0 $ 48$ $ 16$ $ 64$
         
Cluster 3 0 0 $ 36$ $ 36$
         
$ 100$ $ 48$ $ 52$ $ 200$