Next: 15.4 Non-linear SVM Up: 15. Support Vector Machines Previous: 15.2 Learning from Examples

15.3 Linear SVM: Learning Theory in Practice

Having summarized the prerequisites from statistical learning theory, we now give an example of a particular learning machine that builds upon these insights. The Support Vector Machine algorithm (SVM) developed by Vapnik and others (e.g. [12,17,71,18,44,59], and numerous others) is one of the most successful classification techniques over the last decade, especially after being combined with the kernel idea which we shall discuss in Sect. 15.4.

15.3.1 Linear Separation Planes

We are now going to discuss how one could possibly control the size of a function class and how to select the empirical risk minimizer in this class.

In the following, let us assume that we are dealing with a two class classification problem (i.e. ${\mathcal{Y}}=\{-1,+1\}$ ) in a real-valued vector space, e.g. ${\mathcal{X}}={\mathbb{R}}^N$ . Further, we assume that the distribution of these two classes is such that they are linearly separable, i.e. one can find a linear function of the inputs ${\boldsymbol{x}}\in{\mathcal{X}}$ such that $f({\boldsymbol{x}})<0$ whenever the label and $f({\boldsymbol{x}})\geq 0$ otherwise. This can be conveniently expressed by a hyperplane in the space ${\mathcal{X}}$ , i.e. we are looking for a function of the form

$\displaystyle f({\boldsymbol{x}}) = \left(\boldsymbol{w}^{\top}{\boldsymbol{x}}\right) + b\;.$

(15.10)

Assume that the function class ${\mathcal{F}}$ we choose our solution from is the one containing all possible hyperplanes, i.e. ${\mathcal{F}}=\{f:{\mathcal{X}}\rightarrow{\mathbb{R}}\vert f({\boldsymbol{x}}) = (\boldsymbol{w}^{\top} {\boldsymbol{x}}) + b\}$ . For ${\mathcal{X}}={\mathbb{R}}^N$ it is rather straightforward to show that the VC-dimension of this class of functions will be

, i.e. in an

dimensional space the maximal number of points that can be separated for an arbitrary labelling using a hyperplane is

15.3.2 Canonical Hyperplanes and Margins

To apply the SRM principle in practice the VC-dimension of the class of hyperplanes must be finite, and a nested structure of function classes must be defined. To this end we define the function classes

$\displaystyle {\mathcal{F}}_{\Lambda} = \left\{f:{\mathbb{R}}^N\rightarrow{\mat... ...{\top}{\boldsymbol{x}}\right)+b, \Vert\boldsymbol{w}\Vert\leq\Lambda\right\}\;.$

(15.11)

Clearly ${\mathcal{F}}_{\Lambda_1}\subseteq{\mathcal{F}}_{\Lambda_2}$ whenever $\Lambda_1\leq\Lambda_2$ . But what effect does constraining the norm of the weight vector have on the corresponding VC-dimensions of ${\mathcal{F}}_{\Lambda}$ ? It turns out that we also get $h({\mathcal{F}}_{\Lambda_1})\leq h({\mathcal{F}}_{\Lambda_2})$ for $\Lambda_1\leq\Lambda_2$ (see (15.12) below), i.e. the required structure is present.

The crucial ingredient in making the function classes ${\mathcal{F}}_{\Lambda}$ nested is to define a unique representation for each hyperplane. We introduce the concept of canonical hyperplanes and the notion of margins. If the data are separable by $(\boldsymbol{w},b)$ then they are also separable by any (positive) multiple of $(\boldsymbol{w},b)$ and hence there exist an infinite number of representations for the same separating hyperplane. In particular, all function classes ${\mathcal{F}}_{\Lambda}$ would have the same VC-dimension as they would contain the same functions in different representations.

**Figure 15.4:** Linear classifier and margins. A linear classifier is defined by the normal vector $\boldsymbol {w}$ of a hyperplane and an offset , i.e. the decision boundary is $\{{\boldsymbol {x}}\vert(\boldsymbol {w}^{\top }{\boldsymbol {x}}) + b = 0\}$ (*solid line*). Each of the two half spaces induced by this hyperplane corresponds to one class, i.e. $f({\boldsymbol {x}}) = {\textrm {sgn}}((\boldsymbol {w}^{\top }{\boldsymbol {x}}) + b)$ . The margin of a linear classifier is the minimal distance of any training point to the hyperplane. For the case shown in the picture it is the distance between the *dotted lines* and the *solid line*
$\includegraphics[width=0.4\textwidth]{text/3-15/illumargin.eps}$

A canonical hyperplane with respect to an -sample ${\mathcal{Z}}$ is defined as a function

$\displaystyle f({\boldsymbol{x}}) = \left(\boldsymbol{w}^{\top}{\boldsymbol{x}}\right)+b\;,$

where $\boldsymbol {w}$ is normalized such that

$\displaystyle \min_{i\,=\,1,\ldots,M} \vert\kern.5pt f({\boldsymbol{x}}_i)\vert = 1\;.$

The notion of a canonical hyperplane is illustrated in Fig. 15.4. Notice that none of the training examples produces an absolute output that is smaller than one and the examples closest the hyperplane have exactly an output of one, i.e. $(\boldsymbol{w}^{\top}{\boldsymbol{x}}) + b = \pm 1$ . In Sect. 15.5, we will see that the latter objects will be used in the description of the hyperplane, and they are therefore called the support vectors. In Fig. 15.4 these are the objects which are connected to the decision boundary (dashed lines). Since we assumed the sample ${\mathcal{Z}}$ to be linearly separable, we can turn any

that separates the data into a canonical hyperplane by suitably normalizing the weight vector $\boldsymbol {w}$ and adjusting the threshold

correspondingly.

The margin is defined to be the minimal Euclidean distance between any training example ${\boldsymbol{x}}_i$ and the separating hyperplane. Intuitively, the margin measures how good the separation between the two classes by a hyperplane is. If this hyperplane is in the canonical form, the margin can be measured by the length of the weight vector $\boldsymbol {w}$ . Consider two support vectors ${\boldsymbol{x}}_1$ and ${\boldsymbol{x}}_2$ from different classes. The margin is given by the projection of the distance between these two points on the direction perpendicular to the hyperplane. This distance can be computed as (e.g. [71])

$\displaystyle \left(\frac{\boldsymbol{w}^{\top}}{\Vert\boldsymbol{w}\Vert}({\boldsymbol{x}}_1-{\boldsymbol{x}}_2)\right) = \frac{2}{\Vert\boldsymbol{w}\Vert}\;.$

The smaller the norm of the weight vector $\boldsymbol {w}$ in the canonical representation, the larger the margin.

More generally, it was shown (e.g. [71]) that if the hyperplane is constructed under the constraint $\Vert\boldsymbol{w}\Vert _2\leq\Lambda$ then the VC-dimension of the class ${\mathcal{F}}_{\Lambda}$ is bounded by

$\displaystyle h \leq \min\left(\Lambda^2 R^2 + 1,N+1\right)\;,$

(15.12)

where

is the radius of the smallest sphere around the data. Thus, if we bound the margin of a function class from below, say by $2/\Lambda$ , we can control its VC-dimension and hence apply the SRM principle as shown in Fig. 15.5.

**Figure 15.5:** Illustration of why a large margin reduces the complexity of a linear hyperplane classifier. If we choose hyperplanes with a large margin, there is only a small number of possibilities to separate the data, i.e. the VC-dimension of ${\mathcal {F}}_{\Lambda _1}$ is small (*left panel*). On the contrary, if we allow smaller margins there are more separating hyperplanes, i.e. the VC-dimension of ${\mathcal {F}}_{\Lambda _2}$ is large (*right panel*)
$\includegraphics[height=0.4\textwidth]{text/3-15/margin1.eps}$ $\includegraphics[height=0.4\textwidth]{text/3-15/margin2.eps}$

A particularly important insight is that the complexity only indirectly depends on the dimensionality of the data. This is very much in contrast to e.g. density estimation, where the problems become more difficult as the dimensionality of the data increases. For SVM classification, if we can achieve a large margin the problem remains simple.

Next: 15.4 Non-linear SVM Up: 15. Support Vector Machines Previous: 15.2 Learning from Examples