Having summarized the prerequisites from statistical learning theory, we now give an example of a particular learning machine that builds upon these insights. The Support Vector Machine algorithm (SVM) developed by Vapnik and others (e.g. [12,17,71,18,44,59], and numerous others) is one of the most successful classification techniques over the last decade, especially after being combined with the kernel idea which we shall discuss in Sect. 15.4.
We are now going to discuss how one could possibly control the size of a function class and how to select the empirical risk minimizer in this class.
In the following, let us assume that we are dealing with a two class classification problem (i.e. ) in a real-valued vector space, e.g. . Further, we assume that the distribution of these two classes is such that they are linearly separable, i.e. one can find a linear function of the inputs such that whenever the label and otherwise. This can be conveniently expressed by a hyperplane in the space , i.e. we are looking for a function of the form
To apply the SRM principle in practice the VC-dimension of the class of hyperplanes must be finite, and a nested structure of function classes must be defined. To this end we define the function classes
The crucial ingredient in making the function classes nested is to define a unique representation for each hyperplane. We introduce the concept of canonical hyperplanes and the notion of margins. If the data are separable by then they are also separable by any (positive) multiple of and hence there exist an infinite number of representations for the same separating hyperplane. In particular, all function classes would have the same VC-dimension as they would contain the same functions in different representations.
|
A canonical hyperplane with respect to an -sample is defined as a function
The margin is defined to be the minimal Euclidean distance between any training example and the separating hyperplane. Intuitively, the margin measures how good the separation between the two classes by a hyperplane is. If this hyperplane is in the canonical form, the margin can be measured by the length of the weight vector . Consider two support vectors and from different classes. The margin is given by the projection of the distance between these two points on the direction perpendicular to the hyperplane. This distance can be computed as (e.g. [71])
More generally, it was shown (e.g. [71]) that if the hyperplane is constructed under the constraint then the VC-dimension of the class is bounded by
|
A particularly important insight is that the complexity only indirectly depends on the dimensionality of the data. This is very much in contrast to e.g. density estimation, where the problems become more difficult as the dimensionality of the data increases. For SVM classification, if we can achieve a large margin the problem remains simple.