11.3 Cluster Algorithms
There are essentially two types of clustering methods: hierarchical
algorithms and partioning algorithms. The hierarchical algorithms can
be divided into agglomerative and splitting procedures. The
first type of hierarchical clustering starts from the finest
partition possible (each observation forms a cluster) and groups them.
The second type starts with the coarsest
partition possible: one cluster contains all of the observations.
It proceeds by splitting the single cluster up into smaller
The partioning algorithms start from a given group definition and
proceed by exchanging elements between groups until a certain score
is optimized. The main difference between the two clustering
techniques is that in hierarchical clustering once groups are found
and elements are assigned to
the groups, this assignment cannot be
changed. In partitioning techniques, on the other hand,
the assignment of objects into
groups may change during the algorithm application.
Agglomerative algorithms are used quite frequently in practice. The
algorithm consists of the following steps:
Hierarchical Algorithms, Agglomerative Techniques
If two objects or groups say, and , are united, one computes the
distance between this new group (object) and group using
the following distance function:
The 's are weighting factors that lead to different
agglomerative algorithms as described in Table 11.2.
is the number of
objects in group . The values of and are defined
Computations of group distances.
Let us examine the agglomerative algorithm
for the three points in Example 11.2
, and the squared Euclidean distance matrix
with single linkage weighting. The algorithm starts with
The distance matrix
is given in Example 11.2
. The smallest distance in
the one between the clusters
. Therefore, applying step 4 in
the above algorithm we combine these clusters to form
. The single linkage distance between the remaining two
clusters is from Table 11.2
) equal to
The reduced distance matrix is then
. The next and last step is to unite
into a single cluster
, the original
When there are more data points than in the example above,
a visualization of the
implication of clusters is desirable. A graphical representation of
the sequence of clustering is called a dendrogram.
the observations, the sequence of clusters and the distances between
the clusters. The vertical axis displays the indices of the points, whereas
the horizontal axis gives the distance between the clusters.
Large distances indicate the clustering of heterogeneous
groups. Thus, if we choose to ``cut the tree'' at a desired level, the
branches describe the corresponding clusters.
Here we describe the single linkage algorithm for the
eight data points displayed in Figure 11.1
The distance matrix (
and the dendrogram is shown in Figure 11.2
The dendrogram for the 8-point example, Single
If we decide to cut the tree at the level , three
clusters are defined: , and .
The single linkage algorithm defines
the distance between two
groups as the smallest value of the individual distances.
Table 11.2 shows that in this case
This algorithm is also called the
Nearest Neighbor algorithm.
As a consequence of its construction,
single linkage tends to build large groups.
Groups that differ but are not well separated may thus be
classified into one group as long as they have two approximate points.
The complete linkage algorithm tries to correct
this kind of grouping by
considering the largest (individual) distances. Indeed,
the complete linkage distance can be written as
It is also called the
Farthest Neighbor algorithm.
This algorithm will cluster groups where all the points are proximate,
since it compares the largest distances.
The average linkage algorithm (weighted or unweighted) proposes
a compromise between the two preceding algorithms,
in that it computes an average distance:
The centroid algorithm is quite similar
to the average linkage algorithm and uses
the natural geometrical distance between and the weighted center of
gravity of and (see Figure 11.3):
The centroid algorithm.
The Ward clustering algorithm computes the distance between groups
according to the formula in Table 11.2. The main difference
between this algorithm and the linkage procedures is in the
unification procedure. The Ward algorithm does not put together groups
with smallest distance. Instead, it joins groups that do not increase
a given measure of heterogeneity ``too much''.
The aim of the Ward procedure is to
unify groups such that the variation inside these groups does not
increase too drastically: the resulting groups are as homogeneous as
The heterogeneity of group is measured by the inertia inside the
group. This inertia is defined as follows:
is the center of gravity (mean) over the
groups. clearly provides a scalar measure of the dispersion
of the group around its center of gravity. If the usual Euclidean
distance is used, then represents the sum of the variances of the
components of inside group .
When two objects or groups and are joined, the new group
has a larger inertia . It can be shown that the
corresponding increase of inertia is given by
In this case,
the Ward algorithm is defined as an algorithm that
``joins the groups that give
the smallest increase in ''. It is easy to prove that
when and are joined, the new criterion values are
given by (11.9) along with the values of given
in Table 11.2, when the centroid formula is used to modify
So, the Ward algorithm is related to the centroid algorithm, but
with an ``inertial'' distance rather than the ``geometric''
As pointed out in Section 11.2, all the algorithms above can
be adjusted by the choice of the metric defining the
geometric distance . If the results of a clustering algorithm
are illustrated as graphical representations of individuals in spaces
of low dimension (using principal components (normalized or not) or
using a correspondence analysis for contingency tables), it is important to
be coherent in the choice of the metric used.
As an example we randomly select 20
observations from the bank notes data and apply the Ward technique using
shows the first two PCs of these data,
displays the dendrogram.
The dendrogram for the 20 bank notes, Ward algorithm.
Consider the French food expenditures. As in Chapter 9
we use standardized data which is equivalent to using
as the weight matrix in the
-norm. The NPCA plot of the
individuals was given in Figure 9.7
The Euclidean distance matrix is of course given by (11.6
The dendrogram obtained by using the Ward algorithm is shown in
The dendrogram for the French food expenditures,
If the aim was to have only two groups, as can be seen in
Figure 11.6 , they would be
CA2, CA3, CA4, CA5, EM5 and
MA2, MA3, MA4, MA5, EM2, EM3, EM4.
Clustering three groups is somewhat arbitrary (the levels of the
distances are too similar). If we were
interested in four groups, we would obtain
CA2, CA3, CA4,
EM2, MA2, EM3, MA3,
EM4, MA4, MA5 and
This grouping shows a balance between socio-professional levels and size of
the families in determining the clusters. The four groups are clearly
well represented in the NPCA plot in Figure 9.7.
The class of clustering algorithms can be divided into two types:
hierarchical and partitioning algorithms.
Hierarchical algorithms start with the finest (coarsest) possible
partition and put groups together (split groups apart) step by step.
Partitioning algorithms start from a preliminary clustering and
exchange group elements until a certain score is reached.
Hierarchical agglomerative techniques are frequently used in practice.
from the finest possible structure (each data point forms a cluster),
compute the distance matrix for the clusters and join the clusters
that have the smallest distance. This step is repeated until all points are
united in one cluster.
The agglomerative procedure depends on the definition of the distance
between two clusters. Single linkage,
complete linkage, and Ward distance are frequently used distances.
The process of the unification of
clusters can be graphically represented by a dendrogram.