Next: 15.7 Applications Up: 15. Support Vector Machines Previous: 15.5 Implementation of SVM

15.6 Extensions of SVM

15.6.1 Regression

One extension of the SVM is that for the regression task. In this subsection we will give a short overview of the idea of Support Vector Regression (SVR). A regression problem is given whenever ${\mathcal{Y}}={\mathbb{R}}$ for the training data set ${\mathcal{Z}}=\{({\boldsymbol{x}}_i,y_i)\in{\mathcal{X}}\times{\mathcal{Y}}\vert i=1,\ldots,M\}$ (cf. Sect. 15.2.1) and our interest is to find a function of the form $f:{\mathcal{X}}\rightarrow{\mathbb{R}}$ (or more generally ${\mathbb{R}}^D$ ).

In our discussion of the theoretical foundations of learning we have not yet talked about loss functions except for saying that they should be non-negative functions of the form (15.1). In the following we will discuss an interesting alternative for the problem of regression. There are two loss functions commonly used: the simple squared loss

$\displaystyle {\ell}(\kern.5pt f({\boldsymbol{x}}),y) = (\kern.5pt f({\boldsymbol{x}})-y)^2\;,$

(15.55)

and the $\epsilon$ -insensitive loss

$\displaystyle {\ell}(\kern.5pt f({\boldsymbol{x}}),y) = \begin{cases}\vert f({\... ...{\boldsymbol{x}})-y\vert > \epsilon\;,\\ 0\;, & \text{otherwise}\;. \end{cases}$

(15.56)

For $\epsilon=0$ the $\epsilon$ -insensitive loss equals the $\ell _1$ -norm, otherwise it linearly penalizes deviations from the correct predictions by more than $\epsilon$ .

In the left subplot of Fig. 15.8 the two error functions are shown. In the right subplot a regression function using the $\epsilon$ -insensitive loss function is shown for some artificial data. The dashed lines indicate the boundaries of the area where the loss is zero (the ''tube''). Clearly most of the data are within the tube.

**Figure 15.8:** The *left* *subplot* shows the two different loss functions for the regression problem. The *right subplot* gives a regression function derived with an $\epsilon$ -insensitive loss function. The *solid line* indicates the function, the *dashed lines* indicate the $\epsilon$ -tube around this function
$\includegraphics[width=5cm]{text/3-15/regression_loss.eps}$ $\includegraphics[width=5cm]{text/3-15/regression_tube.eps}$

Similarly to the classification task, one is looking for the function that best describes the values . In classification one is interested in the function that separates two classes; in contrast, in regression one looks for such a function that contains the given dataset in its $\epsilon$ -tube. Some data points can be allowed to lie outside the $\epsilon$ -tube by introducing the slack-variables.

The primal formulation for the SVR is then given by:

$\displaystyle \min_{\boldsymbol{w},b,{\boldsymbol{\xi}},{\boldsymbol{\xi}}^{\as... ...\sum_{i=1}^M \left({\boldsymbol{\xi}}_i + {\boldsymbol{\xi}}_i^{\ast}\right)\;,$
subject to $\displaystyle \quad \left(\left(\boldsymbol{w}^{\top} {\boldsymbol{x}}_i\right)+b\right) - y_i \le \epsilon + {\boldsymbol{\xi}}_i\;,$
$\displaystyle y_i - \left(\left(\boldsymbol{w}^{\top} {\boldsymbol{x}}_i\right)+b\right) \le \epsilon + {\boldsymbol{\xi}}_i^{\ast}\;,$
$\displaystyle {\boldsymbol{\xi}}_i,{\boldsymbol{\xi}}_i^{\ast} \geq 0\;,\quad i=1,\ldots,M\;.$

In contrast to the primal formulation for the classification task, we have to introduce two types of slack-variables ${\boldsymbol{\xi}}_i$ and ${\boldsymbol{\xi}}_i^{\ast}$ , one to control the error induced by observations that are larger than the upper bound of the $\epsilon$ -tube, and the other - for the observations smaller than the lower bound. To enable the construction of a non-linear regression function, a dual formulation is obtained in the similar way to the classification SVM, and the kernel trick is applied. For an extensive description of SVR we recommend the book of [59].

15.6.2 One-Class Classification

Another common problem of statistical learning is one-class classification (novelty detection). Its fundamental difference from the standard classification problem is that the training data is not identically distributed to the test data. The dataset contains two classes: one of them, the target class, is well sampled, while the other class it absent or sampled very sparsely. [56] have proposed an approach in which the target class is separated from the origin by a hyperplane. Alternatively ([67]), the target class can be modeled by fitting a hypersphere with minimal radius around it. We present this approach, schematically shown in Fig. 15.9, in more detail below.

**Figure 15.9:** Schematic illustration of the hypersphere model for describing a target class of data. The center $\boldsymbol {a}$ and the radius should be optimized to minimize the volume of the captured space
$\includegraphics[width=4.5cm]{text/3-15/svdd.eps}$

Mathematically the problem of fitting a hypersphere around the data is formalized as:

$\displaystyle \displaystyle \min_{R,\boldsymbol{a},{\boldsymbol{\xi}}} \quad R^2 + C \sum_{i=0}^M {\boldsymbol{\xi}}_i\;,$	(15.57)
subject to $\displaystyle \quad \vert\vert \boldsymbol{x}_i - \boldsymbol{a}\vert\vert^2 \leq R^2 + {\boldsymbol{\xi}}_i\;, \quad i=1,\ldots,M\;,$
$\displaystyle {\boldsymbol{\xi}}_i \geq 0\;,$

where $\boldsymbol {a}$ is the center of the sphere, and

is the radius. Similarly to the SVM we make a ''soft'' fit by allowing non-negative slacks ${\boldsymbol{\xi}}_i$ . One can likewise apply the kernel trick by deriving the dual formulation of (15.59):

$\displaystyle \max_{{\boldsymbol{\alpha}}} \sum_{i=1}^M\alpha_ik(\boldsymbol{x}... ...um_{i,j=1}^M\alpha_i\alpha_j k\left(\boldsymbol{x}_i,\boldsymbol{x}_j\right)\;,$	(15.58)
subject to $\displaystyle \quad 0 \le \alpha_i \le C\;, \quad i=1,\ldots,M\;,$
$\displaystyle \sum_{i=1}^M\alpha_i =1\;.$

The parameter

can be interpreted ([56]) as the reciprocal of the quantity $\frac{1}{M\nu}$ , where $\nu$ is an upper bound for the fraction of objects outside the boundary.

To decide whether a new object belongs to the target class one should determine its position with respect to the sphere using the formula

$\displaystyle f({\boldsymbol{x}}) =$ sign $\displaystyle \left(R^2 - k(\boldsymbol{x},\boldsymbol{x}) + 2\sum_i \alpha_i k... ...i,j}\alpha_i \alpha_j k\left(\boldsymbol{x}_i,\boldsymbol{x}_j\right)\right)\;.$

(15.59)

An object with positive sign belongs to the target class and vice versa.

Next: 15.7 Applications Up: 15. Support Vector Machines Previous: 15.5 Implementation of SVM