11.1 The Problem

Cluster analysis is a set of tools for building groups (clusters) from multivariate data objects. The aim is to construct groups with homogeneous properties out of heterogeneous large samples. The groups or clusters should be as homogeneous as possible and the differences among the various groups as large as possible. Cluster analysis can be divided into two fundamental steps.

  1. Choice of a proximity measure:
    One checks each pair of observations (objects) for the similarity of their values. A similarity (proximity) measure is defined to measure the ``closeness'' of the objects. The ``closer'' they are, the more homogeneous they are.
  2. Choice of group-building algorithm:
    On the basis of the proximity measures the objects assigned to groups so that differences between groups become large and observations in a group become as close as possible.

In marketing, for exmaple, cluster analysis is used to select test markets. Other applications include the classification of companies according to their organizational structures, technologies and types. In psychology, cluster analysis is used to find types of personalities on the basis of questionnaires. In archaeology, it is applied to classify art objects in different time periods. Other scientific branches that use cluster analysis are medicine, sociology, linguistics and biology. In each case a heterogeneous sample of objects are analyzed with the aim to identify homogeneous subgroups.

Summary
$\ast$
Cluster analysis is a set of tools for building groups (clusters) from multivariate data objects.
$\ast$
The methods used are usually divided into two fundamental steps: The choice of a proximity measure and the choice of a group-building algorithm.