11.3 Cluster Algorithms

There are essentially two types of clustering methods: hierarchical algorithms and partioning algorithms. The hierarchical algorithms can be divided into agglomerative and splitting procedures. The first type of hierarchical clustering starts from the finest partition possible (each observation forms a cluster) and groups them. The second type starts with the coarsest partition possible: one cluster contains all of the observations. It proceeds by splitting the single cluster up into smaller sized clusters.

The partioning algorithms start from a given group definition and proceed by exchanging elements between groups until a certain score is optimized. The main difference between the two clustering techniques is that in hierarchical clustering once groups are found and elements are assigned to the groups, this assignment cannot be changed. In partitioning techniques, on the other hand, the assignment of objects into groups may change during the algorithm application.


Hierarchical Algorithms, Agglomerative Techniques

Agglomerative algorithms are used quite frequently in practice. The algorithm consists of the following steps:


Agglomerative Algorithm
  1. Construct the finest partition.
  2. Compute the distance matrix $\data{D}$.
DO
3.
Find the two clusters with the closest distance.
4.
Put those two clusters into one cluster.
5.
Compute the distance between the new groups and obtain a reduced distance matrix $\data{D}$.
UNTIL all clusters are agglomerated into $\data{X}$.


If two objects or groups say, $P$ and $Q$, are united, one computes the distance between this new group (object) $P+Q$ and group $R$ using the following distance function:

\begin{displaymath}
d(R,P+Q) = \delta_{1}d(R,P) + \delta_{2}d(R,Q) + \delta_{3}d(P,Q) +
\delta_{4}\vert d(R,P)-d(R,Q)\vert.
\end{displaymath} (11.9)

The $\delta_{j}$'s are weighting factors that lead to different agglomerative algorithms as described in Table 11.2. Here $n_{P} = \sum ^n_{i=1} {\boldsymbol{I}}(x_{i} \in P)$ is the number of objects in group $P$. The values of $n_{Q}$ and $n_{R}$ are defined analogously.


Table 11.2: Computations of group distances.
Name $\delta_{1}$ $\delta_{2}$ $\delta_{3}$ $\delta_{4}$
Single linkage 1/2 1/2 0 -1/2
Complete linkage 1/2 1/2 0 1/2
Average linkage
(unweighted)
1/2 1/2 0 0
Average linkage
(weighted)
$\frac{\displaystyle n_{P}}{\displaystyle n_{P}+n_{Q}}$ $\frac{\displaystyle n_{Q}}{\displaystyle n_{P}+n_{Q}}$ 0 0
Centroid $\frac{\displaystyle n_{P}}{\displaystyle n_{P}+n_{Q}}$ $\frac{\displaystyle n_{Q}}{\displaystyle n_{P}+n_{Q}}$ $-\frac{\displaystyle n_{P}n_{Q}}{\displaystyle (n_{P}+n_{Q})^2}$ 0
Median 1/2 1/2 -1/4 0
Ward $\frac{\displaystyle n_{R}+n_{P}}{\displaystyle n_{R}+n_{P}+n_{Q}}$ $\frac{\displaystyle n_{R}+n_{Q}}{\displaystyle n_{R}+n_{P}+n_{Q}}$ $-\frac{\displaystyle n_{R}}{\displaystyle n_{R}+n_{P}+n_{Q}}$ 0


EXAMPLE 11.4   Let us examine the agglomerative algorithm for the three points in Example 11.2, $x_{1}=(0,0)$, $x_{2}=(1,0)$ and $x_{3} =(5,5)$, and the squared Euclidean distance matrix with single linkage weighting. The algorithm starts with $N=3$ clusters: $P=\{x_{1}\}$, $Q=\{x_{2}\}$ and $R=\{x_{3}\}$. The distance matrix $\data{D}_{2}$ is given in Example 11.2. The smallest distance in $\data{D}_{2}$ is the one between the clusters $P$ and $Q$. Therefore, applying step 4 in the above algorithm we combine these clusters to form $P+Q =
\{x_{1},x_{2}\}$. The single linkage distance between the remaining two clusters is from Table 11.2 and (11.9) equal to
$\displaystyle d(R,P+Q)$ $\textstyle =$ $\displaystyle \frac{1}{2}d(R,P) + \frac{1}{2}d(R,Q) -
\frac{1}{2}\vert d(R,P) - d(R,Q)\vert$ (11.10)
  $\textstyle =$ $\displaystyle \frac{1}{2}d_{13} + \frac{1}{2}d_{23} - \frac{1}{2}\cdot\vert d_{13}-d_{23}\vert$  
  $\textstyle =$ $\displaystyle \frac{50}{2} + \frac{41}{2} - \frac{1}{2}\cdot\vert 50-41\vert$  
  $\textstyle =$ $\displaystyle 41.$  

The reduced distance matrix is then $\left( {0\atop 41}{41\atop 0}
\right)$. The next and last step is to unite the clusters $R$ and $P+Q$ into a single cluster $\data{X}$, the original data matrix.

When there are more data points than in the example above, a visualization of the implication of clusters is desirable. A graphical representation of the sequence of clustering is called a dendrogram. It displays the observations, the sequence of clusters and the distances between the clusters. The vertical axis displays the indices of the points, whereas the horizontal axis gives the distance between the clusters. Large distances indicate the clustering of heterogeneous groups. Thus, if we choose to ``cut the tree'' at a desired level, the branches describe the corresponding clusters.

Figure 11.1: The 8-point example. 39469 MVAclus8p.xpl
\includegraphics[width=1\defpicwidth]{clus8p1.ps}

EXAMPLE 11.5   Here we describe the single linkage algorithm for the eight data points displayed in Figure 11.1. The distance matrix ($L_{2}$-norms) is

\begin{displaymath}\data{D}=\left(\begin{array}{rrrrrrrr}
0& 10& 53& 73& 50& 98&...
...9\\
& & & & & & 0& 4\\
& & & & & & & 0
\end{array}\right) \end{displaymath}

and the dendrogram is shown in Figure 11.2.

Figure 11.2: The dendrogram for the 8-point example, Single linkage algorithm. 39476 MVAclus8p.xpl
\includegraphics[width=1\defpicwidth]{clus8p2.ps}

If we decide to cut the tree at the level $10$, three clusters are defined: $\{1,2\}$, $\{3,4,5\}$ and $\{6,7,8\}$.

The single linkage algorithm defines the distance between two groups as the smallest value of the individual distances. Table 11.2 shows that in this case

\begin{displaymath}
d(R,P+Q)=\min\{d(R,P), d(R,Q)\}.
\end{displaymath} (11.11)

This algorithm is also called the Nearest Neighbor algorithm. As a consequence of its construction, single linkage tends to build large groups. Groups that differ but are not well separated may thus be classified into one group as long as they have two approximate points. The complete linkage algorithm tries to correct this kind of grouping by considering the largest (individual) distances. Indeed, the complete linkage distance can be written as
\begin{displaymath}
d(R,P+Q) = \max \{ d(R,P), \ d(R,Q) \}.
\end{displaymath} (11.12)

It is also called the Farthest Neighbor algorithm. This algorithm will cluster groups where all the points are proximate, since it compares the largest distances. The average linkage algorithm (weighted or unweighted) proposes a compromise between the two preceding algorithms, in that it computes an average distance:
\begin{displaymath}
d(R,P+Q) = \frac{n_P}{n_P+n_Q} d(R,P) + \frac{n_Q}{n_P+n_Q} d(R,Q).
\end{displaymath} (11.13)

The centroid algorithm is quite similar to the average linkage algorithm and uses the natural geometrical distance between $R$ and the weighted center of gravity of $P$ and $Q$ (see Figure 11.3):
\begin{displaymath}
d(R,P+Q) = \frac{n_P}{n_P+n_Q} d(R,P) + \frac{n_Q}{n_P+n_Q} d(R,Q)
- \frac{n_P n_Q}{(n_P+n_Q)^2}d(P,Q).
\end{displaymath} (11.14)

Figure 11.3: The centroid algorithm.
\begin{figure}\begin{center}
\begin{picture}(110.00,54.00)
% emline\{28.00\}\{22...
...c]{weighted center of gravity of $P+Q$}}
\end{picture} \end{center} \end{figure}

The Ward clustering algorithm computes the distance between groups according to the formula in Table 11.2. The main difference between this algorithm and the linkage procedures is in the unification procedure. The Ward algorithm does not put together groups with smallest distance. Instead, it joins groups that do not increase a given measure of heterogeneity ``too much''. The aim of the Ward procedure is to unify groups such that the variation inside these groups does not increase too drastically: the resulting groups are as homogeneous as possible.

The heterogeneity of group $R$ is measured by the inertia inside the group. This inertia is defined as follows:

\begin{displaymath}
I_{R}=\frac{1}{n_{R}} \sum_{i=1}^{n_{R}} d^2(x_{i},\overline x_{R})
\end{displaymath} (11.15)

where $\overline x_{R}$ is the center of gravity (mean) over the groups. $I_{R}$ clearly provides a scalar measure of the dispersion of the group around its center of gravity. If the usual Euclidean distance is used, then $I_{R}$ represents the sum of the variances of the $p$ components of $x_{i}$ inside group $R$.

When two objects or groups $P$ and $Q$ are joined, the new group $P+Q$ has a larger inertia $I_{P+Q}$. It can be shown that the corresponding increase of inertia is given by

\begin{displaymath}
\Delta(P,Q) = \frac{n_{P}n_{Q}}{n_{P}+n_{Q}}\; d^2(P,Q).
\end{displaymath} (11.16)

In this case, the Ward algorithm is defined as an algorithm that ``joins the groups that give the smallest increase in $\Delta(P,Q)$''. It is easy to prove that when $P$ and $Q$ are joined, the new criterion values are given by (11.9) along with the values of $\delta_{i}$ given in Table 11.2, when the centroid formula is used to modify $d^2(R,P+Q)$. So, the Ward algorithm is related to the centroid algorithm, but with an ``inertial'' distance $\Delta$ rather than the ``geometric'' distance $d^2$.

As pointed out in Section 11.2, all the algorithms above can be adjusted by the choice of the metric $\data{A}$ defining the geometric distance $d^2$. If the results of a clustering algorithm are illustrated as graphical representations of individuals in spaces of low dimension (using principal components (normalized or not) or using a correspondence analysis for contingency tables), it is important to be coherent in the choice of the metric used.

EXAMPLE 11.6   As an example we randomly select 20 observations from the bank notes data and apply the Ward technique using Euclidean distances. Figure 11.4 shows the first two PCs of these data, Figure 11.5 displays the dendrogram.

Figure 11.4: PCA for 20 randomly chosen bank notes. 39483 MVAclusbank.xpl
\includegraphics[width=1\defpicwidth]{clusbank1.ps}

Figure 11.5: The dendrogram for the 20 bank notes, Ward algorithm. 39490 MVAclusbank.xpl
\includegraphics[width=1\defpicwidth]{clusbank2.ps}

EXAMPLE 11.7   Consider the French food expenditures. As in Chapter 9 we use standardized data which is equivalent to using $\data{A}=\mathop{\hbox{diag}}(s_{X_{1}X_{1}}^{-1},\ldots,s_{X_{7}X_{7}}^{-1})$ as the weight matrix in the $L_{2}$-norm. The NPCA plot of the individuals was given in Figure 9.7. The Euclidean distance matrix is of course given by (11.6). The dendrogram obtained by using the Ward algorithm is shown in Figure 11.6.

Figure 11.6: The dendrogram for the French food expenditures, Ward algorithm. 39499 MVAclusfood.xpl
\includegraphics[width=1\defpicwidth]{clusfood.ps}

If the aim was to have only two groups, as can be seen in Figure 11.6 , they would be $\{$CA2, CA3, CA4, CA5, EM5$\}$ and $\{$MA2, MA3, MA4, MA5, EM2, EM3, EM4$\}$. Clustering three groups is somewhat arbitrary (the levels of the distances are too similar). If we were interested in four groups, we would obtain $\{$CA2, CA3, CA4$\}$, $\{$EM2, MA2, EM3, MA3$\}$, $\{$EM4, MA4, MA5$\}$ and $\{$EM5, CA5$\}$. This grouping shows a balance between socio-professional levels and size of the families in determining the clusters. The four groups are clearly well represented in the NPCA plot in Figure 9.7.

Summary
$\ast$
The class of clustering algorithms can be divided into two types: hierarchical and partitioning algorithms. Hierarchical algorithms start with the finest (coarsest) possible partition and put groups together (split groups apart) step by step. Partitioning algorithms start from a preliminary clustering and exchange group elements until a certain score is reached.
$\ast$
Hierarchical agglomerative techniques are frequently used in practice. They start from the finest possible structure (each data point forms a cluster), compute the distance matrix for the clusters and join the clusters that have the smallest distance. This step is repeated until all points are united in one cluster.
$\ast$
The agglomerative procedure depends on the definition of the distance between two clusters. Single linkage, complete linkage, and Ward distance are frequently used distances.
$\ast$
The process of the unification of clusters can be graphically represented by a dendrogram.