Next: 15.4 Non-linear SVM Up: 15. Support Vector Machines Previous: 15.2 Learning from Examples

Subsections

# 15.3 Linear SVM: Learning Theory in Practice

Having summarized the prerequisites from statistical learning theory, we now give an example of a particular learning machine that builds upon these insights. The Support Vector Machine algorithm (SVM) developed by Vapnik and others (e.g. [12,17,71,18,44,59], and numerous others) is one of the most successful classification techniques over the last decade, especially after being combined with the kernel idea which we shall discuss in Sect. 15.4.

## 15.3.1 Linear Separation Planes

We are now going to discuss how one could possibly control the size of a function class and how to select the empirical risk minimizer in this class.

In the following, let us assume that we are dealing with a two class classification problem (i.e.  ) in a real-valued vector space, e.g.  . Further, we assume that the distribution of these two classes is such that they are linearly separable, i.e. one can find a linear function of the inputs such that whenever the label and otherwise. This can be conveniently expressed by a hyperplane in the space  , i.e. we are looking for a function  of the form

 (15.10)

Assume that the function class  we choose our solution from is the one containing all possible hyperplanes, i.e.  . For it is rather straightforward to show that the VC-dimension of this class of functions will be , i.e. in an dimensional space the maximal number of points that can be separated for an arbitrary labelling using a hyperplane is .

## 15.3.2 Canonical Hyperplanes and Margins

To apply the SRM principle in practice the VC-dimension of the class of hyperplanes must be finite, and a nested structure of function classes must be defined. To this end we define the function classes

 (15.11)

Clearly whenever . But what effect does constraining the norm of the weight vector have on the corresponding VC-dimensions of ? It turns out that we also get for (see (15.12) below), i.e. the required structure is present.

The crucial ingredient in making the function classes nested is to define a unique representation for each hyperplane. We introduce the concept of canonical hyperplanes and the notion of margins. If the data are separable by then they are also separable by any (positive) multiple of and hence there exist an infinite number of representations for the same separating hyperplane. In particular, all function classes would have the same VC-dimension as they would contain the same functions in different representations.

A canonical hyperplane with respect to an -sample is defined as a function

where is normalized such that

The notion of a canonical hyperplane is illustrated in Fig. 15.4. Notice that none of the training examples produces an absolute output that is smaller than one and the examples closest the hyperplane have exactly an output of one, i.e.  . In Sect. 15.5, we will see that the latter objects will be used in the description of the hyperplane, and they are therefore called the support vectors. In Fig. 15.4 these are the objects which are connected to the decision boundary (dashed lines). Since we assumed the sample  to be linearly separable, we can turn any that separates the data into a canonical hyperplane by suitably normalizing the weight vector  and adjusting the threshold  correspondingly.

The margin is defined to be the minimal Euclidean distance between any training example and the separating hyperplane. Intuitively, the margin measures how good the separation between the two classes by a hyperplane is. If this hyperplane is in the canonical form, the margin can be measured by the length of the weight vector  . Consider two support vectors and from different classes. The margin is given by the projection of the distance between these two points on the direction perpendicular to the hyperplane. This distance can be computed as (e.g. [71])

The smaller the norm of the weight vector in the canonical representation, the larger the margin.

More generally, it was shown (e.g. [71]) that if the hyperplane is constructed under the constraint then the VC-dimension of the class is bounded by

 (15.12)

where is the radius of the smallest sphere around the data. Thus, if we bound the margin of a function class from below, say by , we can control its VC-dimension and hence apply the SRM principle as shown in Fig. 15.5.

A particularly important insight is that the complexity only indirectly depends on the dimensionality of the data. This is very much in contrast to e.g. density estimation, where the problems become more difficult as the dimensionality of the data increases. For SVM classification, if we can achieve a large margin the problem remains simple.

Next: 15.4 Non-linear SVM Up: 15. Support Vector Machines Previous: 15.2 Learning from Examples