Next: 15.5 Implementation of SVM Up: 15. Support Vector Machines Previous: 15.3 Linear SVM: Learning

15.4 Non-linear SVM

In the previous section we have seen that by restricting ourselves to linear functions one can control the complexity of a learning machine. We have thus avoided the problem of dealing with too complex functions at the price of being able to solve only linearly separable problems. In the following we will show how to extend the linear SVM for constructing a very rich set of non-linear decision functions while at the same time controlling their complexity.

Central to the success of support vector machines was the re-discovery of the so called Reproducing Kernel Hilbert Spaces (RKHS) and Mercer's Theorem ([12]). There is a large body of literature dealing with kernel functions, their theory and applicability, see e.g. [34], [3], [1], [12] or [59] for an overview. We only recall the basic definitions and properties necessary for turning our linear, hyperplane based learning technique into a very powerful algorithm capable of finding non-linear decision functions with controllable complexity.

15.4.1 The Kernel Trick

The basic idea of the so called kernel-methods is to first preprocess the data by some non-linear mapping $\Phi$ and then to apply the same linear algorithm as before but in the image space of $\Phi$ . (cf. Fig. 15.6 for an illustration). More formally we apply the mapping

$\displaystyle \Phi: \quad$	$\displaystyle {\mathbb{R}}^N \rightarrow {\mathcal{E}},$
	$\displaystyle \quad {\boldsymbol{x}} \mapsto \Phi({\boldsymbol{x}})$

to the data ${\boldsymbol{x}}_1,\ldots,{\boldsymbol{x}}_M \in {\mathcal{X}}$ and consider our algorithm in ${\mathcal{E}}$ instead of ${\mathcal{X}}$ , i.e. the sample is preprocessed as

$\displaystyle \left\{(\Phi({\boldsymbol{x}}_1),y_1),\ldots,(\Phi({\boldsymbol{x}}_M),y_M)\right\} \subseteq ({\mathcal{E}}\times{\mathcal{Y}})^M\;.$

**Figure 15.6:** Three different views on the same two class separation problem. (a) A linear separation of the input points is not possible without errors. Even allowing misclassification of one data point results in a small margin. (b) A better separation is provided by a non-linear surface in the input space. (c) This non-linear surface corresponds to a linear surface in a feature space. Data points are mapped from input space to feature space by the function $\Phi$ induced by the kernel function
$\includegraphics[width=\textwidth]{text/3-15/spaces.eps}$

In certain applications we might have sufficient knowledge about our problem such that we can design an appropriate $\Phi$ by hand (e.g. [79,11]). If this mapping is not too complex to compute and the space ${\mathcal{E}}$ is not too high-dimensional, we might just explicitly apply this mapping to our data. Something similar is done for (single hidden layer) neural networks ([10]), radial basis networks (e.g [42]) or Boosting algorithms ([24]), where the input data are mapped to some representation given by the hidden layer, the RBF bumps or the hypotheses space, respectively ([52]). The difference with kernel-methods is that for a suitably chosen $\Phi$ we get an algorithm that has powerful non-linearities but is still very intuitive and retains most of the favorable properties of its linear input space version.

The problem with explicitly using the mapping $\Phi$ to contruct a feature space is that the resulting space can be extremely high-dimensional. As an example consider the case when the input space ${\mathcal{X}}$ consists of images of $16\times 16$ pixels, i.e. dimensional vectors, and we choose th order monomials as non-linear features. The dimensionality of such space would be

$\displaystyle \left(\begin{array}{c}5+256-1\\ 5\end{array}\right) \approx 10^{10} \;.$

Such a mapping would clearly be intractable to carry out explicitly. We are not only facing the technical problem of storing the data and doing the computations, but we are also introducing problems due to the fact that we are now working in an extremely sparse sampled space. By the use of the SRM principle, we can be less sensitive to the dimensionality of the space and achieve good generalization.

The problems concerning the storage and the manipulation of the high dimensional data can be avoided. It turns out that for a certain class of mappings we are well able to compute scalar products in this new space even if it is extremely high dimensional. Simplifying the above example of computing all th order products of pixels to that of computing all nd order products of two ''pixels'', i.e.

$\displaystyle {\boldsymbol{x}} = (x_1,x_2)$ and $\displaystyle \quad\Phi({\boldsymbol{x}}) = \left(x_1^2,\sqrt{2}x_1x_2,x_2^2\right)\;,$

the computation of a scalar product between two such feature space vectors can be readily reformulated in terms of a kernel function

$\displaystyle \left(\Phi({\boldsymbol{x}})^{\top}\Phi(\boldsymbol{z})\right)$	$\displaystyle =\left(x_1^2, \sqrt{2}\; x_1x_2, x_2^2\right)\left(z_1^2, \sqrt{2}\; z_1z_2, z_2^2\right)^{\top}\nonumber$
	$\displaystyle =\left((x_1,x_2)(z_1,z_2)^{\top}\right)^2\nonumber$
	$\displaystyle =\left({\boldsymbol{x}}^{\top}\boldsymbol{z}\right)^2\nonumber$
	$\displaystyle =: k({\boldsymbol{x}},\boldsymbol{z})\;.$

This finding generalizes: For ${\boldsymbol{x}},\boldsymbol{z}\in{\mathbb{R}}^N$ , and $d\in{\mathbb{N}}$ the kernel function

$\displaystyle k({\boldsymbol{x}},\boldsymbol{z}) = \left({\boldsymbol{x}}^{\top}\boldsymbol{z}\right)^d$

computes a scalar product in the space of all products of

vector entries (monomials) of ${\boldsymbol{x}}$ and $\boldsymbol{z}$ ([71,60]).

The kernel trick ([1,12,71]) is to take the original algorithm and formulate it such, that we only use $\Phi({\boldsymbol{x}})$ in scalar products. Then, if we can efficiently evaluate these scalar products, we do not need to carry out the mapping $\Phi$ explicitly and can still solve the problem in the huge feature space ${\mathcal{E}}$ . Furthermore, we do not need to know the mapping $\Phi$ but only the kernel function.

Now we can ask two questions:

For which mappings $\Phi$ does there exist a simple way to evaluate the scalar product?
Under which conditions does a function $k:{\mathcal{X}}\times{\mathcal{X}}\rightarrow{\mathbb{R}}$ correspond to a scalar product?

The first question is difficult to answer in general. But for the second question there exists an answer which we present in the following.

15.4.2 Feature Spaces

To address the question whether a kernel function $k:{\mathcal{X}}\times{\mathcal{X}}\rightarrow{\mathbb{R}}$ is a dot-product let us first introduce some more notation and definitions. Given a training sample $\{{\boldsymbol{x}}_1,\ldots,{\boldsymbol{x}}_M\}\subseteq{\mathcal{X}}$ , the $M\times M$ matrix with elements $K_{ij}=k({\boldsymbol{x}}_i,{\boldsymbol{x}}_j)$ is called the kernel matrix or the Gram matrix. An $M\times M$ matrix (and any other symmetric matrix) is said to be positive definite if any quadratic form over is positive, i.e. for all $r_i\in{\mathbb{R}}$ , $i=1,\ldots,M$ , we have

$\displaystyle \sum_{i,\,j\,=\,1}^M r_ir_jK_{ij}\geq 0\;.$

(15.13)

Positive definite kernels are exactly those giving rise to a positive definite kernel matrix

for all

and all sets $\{{\boldsymbol{x}}_1,\ldots,{\boldsymbol{x}}_M\}\subseteq{\mathcal{X}}$ . Note, that for a kernel (and a matrix) to be positive definite, it is necessary to be symmetric and non-negative on the diagonal.

For any positive definite kernel we can construct a mapping $\Phi$ into a feature space ${\mathcal{E}}$ , such that acts as a dot-product over $\Phi$ . As a matter of fact, it is possible to construct more than one of these spaces. We will omit many crucial details and only present the central results. For more details see e.g. [59].

15.4.2.1 The Feature Map

Given a real-valued, positive definite kernel function , defined over a non-empty set ${\mathcal{X}}$ , we define the feature space ${\mathcal{E}}$ as the space of all functions mapping from ${\mathcal{X}}$ to ${\mathbb{R}}$ , i.e. as ${\mathcal{E}}={\mathbb{R}}^{\mathcal{X}}=\{f\vert f:{\mathcal{X}}\rightarrow{\mathbb{R}}\}$ . Notice that, unlike the example in Fig. 15.6, this feature space is not a usual Euclidean space but rather a vector space of functions. The mapping $\Phi$ is now defined as

$\displaystyle \Phi:{\mathcal{X}}\rightarrow{\mathbb{R}}^{\mathcal{X}}, \Phi({\boldsymbol{x}}) = k(\cdot,{\boldsymbol{x}})\;,$

(15.14)

i.e. $\Phi$ maps each ${\boldsymbol{x}}$ to the function $k(\cdot,{\boldsymbol{x}})$ , i.e. the kernel

where the first argument is free and the second is fixed to ${\boldsymbol{x}}$ . One can show that the set of all linear combinations of the form

$\displaystyle f(\cdot) = \sum_{i=1}^M\alpha_ik(\cdot,{\boldsymbol{x}}_i)\;,$

(15.15)

for arbitrary

, $\alpha_i\in{\mathbb{R}}$ , and ${\boldsymbol{x}}_1,\ldots,{\boldsymbol{x}}_M$ forms a vector space. Especially, for all functions of the form (15.15) one gets

$\displaystyle \left\langle k(\cdot,{\boldsymbol{x}}),f\right\rangle_{{\mathcal{H}}} = f({\boldsymbol{x}})\;,$

where $\langle\cdot,\cdot\rangle_{{\mathcal{H}}}$ denotes the scalar product in some Hilbert space that will become clearer below. In particular we have

$\displaystyle \left\langle k(\cdot,{\boldsymbol{x}}),k(\cdot,\boldsymbol{z})\right\rangle_{{\mathcal{H}}}$	$\displaystyle = \left\langle\Phi({\boldsymbol{x}}),\Phi(\boldsymbol{z})\right\rangle_{{\mathcal{E}}}$
	$\displaystyle = k({\boldsymbol{x}},\boldsymbol{z})\;.$

The last property is the reason why positive definite kernels are also called reproducing kernels: they reproduce the evaluation of

on ${\boldsymbol{x}}$ . It also shows that

indeed computes, as desired, the dot-product in ${\mathcal{E}}$ for $\Phi({\boldsymbol{x}})$ defined as in (15.14). Hence (15.14) is one possible realization of the mapping associated with a kernel and is called the feature map (for its empirical counterpart see e.g. ([41])). The following is a formal definition of a Reproducing Kernel Hilbert Space (cf. [59]).

Definition 1 (Reproducing Kernel Hilbert Space (RKHS))

Let ${\mathcal{X}}$ be a nonempty set and ${\mathcal{H}}$ a Hilbert space of functions $f:{\mathcal{X}}\rightarrow{\mathbb{R}}$ . Then ${\mathcal{H}}$ is called a reproducing kernel Hilbert space endowed with the dot product $\langle\cdot,\cdot\rangle$ if there exists a function $k:{\mathcal{X}}\times{\mathcal{X}}\rightarrow{\mathbb{R}}$ with the properties that

has the reproducing property $\langle f,k(\cdot,{\boldsymbol{x}})\rangle = f({\boldsymbol{x}})$ for all $f\in{\mathcal{H}}$ , in particular $\langle k(\cdot,{\boldsymbol{x}}),k(\cdot,\boldsymbol{z})\rangle=k({\boldsymbol{x}},\boldsymbol{z})$ , and
spans ${\mathcal{H}}$ , i.e. ${\mathcal{H}}=\overline{{\text{span}}\{k(\cdot,{\boldsymbol{x}})\vert{\boldsymbol{x}}\in{\mathcal{X}}\}}$ , where $\overline{A}$ denotes the completion of the set .

One can show, that the kernel for such a RKHS is uniquely determined.

15.4.2.2 Mercer Kernels

As a second way to identify a feature space associated with a kernel one can use a technique derived from Mercer's Theorem.

The Mercer's Theorem, which we will reproduce in the following, states that if the function (the kernel) gives rise to a positive integral operator, the evaluation of $k({\boldsymbol{x}},\boldsymbol{z})$ can be expressed as a finite or infinite, absolute and uniformly convergent series, almost everywhere. This series then defines in another way a feature space and an associated mapping connected to the kernel .

Let ${\mathcal{X}}$ be a finite measure space, i.e. a space with a $\sigma$ -algebra and a measure $\mu$ satisfying $\mu({\mathcal{X}})\leq\infty$ .

Theorem 3 ([40])

Suppose $k\in L_\infty({\mathcal{X}}^2,\mu)$ is a symmetric real-valued function such that the integral operator

$\displaystyle T_{k}:L_2({\mathcal{X}},\mu)\rightarrow L_2({\mathcal{X}},\mu)\;,... ...l{x}}) := \int_{\mathcal{X}}k({\boldsymbol{x}},\boldsymbol{z})f(\boldsymbol{z})$ d $\displaystyle \mu(\boldsymbol{z})$

is positive definite, i.e. for all $f\in L_2({\mathcal{X}},\mu)$

$\displaystyle \int_{{\mathcal{X}}^2}k({\boldsymbol{x}},\boldsymbol{z})f({\boldsymbol{x}})f(\boldsymbol{z})$ d $\displaystyle \mu({\boldsymbol{x}})$ d $\displaystyle \mu(\boldsymbol{z})\geq 0\;.$

Let $\varphi_j\in L_2({\mathcal{X}},\mu)$ be the normalized orthogonal eigenfunctions of $T_{k}$ associated with the eigenvalues $\lambda_j\geq 0$ , sorted in non-increasing order. Then

$(\lambda_j)_j\in l_1$
$k({\boldsymbol{x}},\boldsymbol{z}) = \sum_{j=1}^{N_{{\mathcal{E}}}}\lambda_j\varphi_j({\boldsymbol{x}})\varphi_j(\boldsymbol{z})$ holds for almost all ${\boldsymbol{x}},\boldsymbol{z}$ . Either $N_{{\mathcal{E}}}\in{\mathbb{N}}$ or $N_{{\mathcal{E}}}=\infty$ ; in the latter case, the series converges absolutely and uniformly for almost all ${\boldsymbol{x}},\boldsymbol{z}$ .

If we choose as feature space ${\mathcal{E}}=l_2^{N_{{\mathcal{E}}}}$ and the mapping $\Phi$ as

$\displaystyle \Phi:{\mathcal{X}}\rightarrow l_2^{N_{{\mathcal{E}}}}\;, \quad\Ph... ...{\lambda_j}\varphi_j({\boldsymbol{x}})\right)_{j=1,\ldots,N_{{\mathcal{E}}}}\;,$

we see from the second statement in Theorem 3 that the kernel

corresponds to the dot product in $l_2^{N_{{\mathcal{E}}}}$ , i.e. $k({\boldsymbol{x}},\boldsymbol{z}) = \langle\Phi({\boldsymbol{x}}),\Phi(\boldsymbol{z})\rangle$ .

The kernels satisfying the Mercer's Theorem are called Mercer kernels. It can be shown that, if the set ${\mathcal{X}}$ on which the kernel is defined, is compact, a kernel is a Mercer kernel if and only if it is a positive definite kernel (cf. [63]).

Table 15.1 lists some of the most widely used kernel functions. More sophisticated kernels (e.g. kernels generating splines or Fourier expansions) and kernels designed for special applications like DNA analysis can be found in [71], [65], [63], [28], [30], [79], [69] and numerous others.

**Table 15.1:** Common kernel functions
$\begin{table} Gaussian RBF \setcounter{equation}{15} \begin{equation} k({\boldsy... ...{\Vert{\boldsymbol{x}} - \boldsymbol{z}\Vert^2 +c^2}} \end{equation*}\end{table}$

15.4.3 Properties of Kernels

Besides being useful tools for the computation of dot-products in high- or infinite-dimensional spaces, kernels possess some additional properties that make them an interesting choice in algorithms. It was shown ([26]) that using a particular positive definite kernel corresponds to an implicit choice of a regularization operator. For translation-invariant kernels, the regularization properties can be expressed conveniently in Fourier space in terms of the frequencies ([63,25]). For example, Gaussian kernels (cf. (15.16)) correspond to a general smoothness assumption in all -th order derivatives ([63]). Vice versa, using this correspondence, kernels matching a certain prior about the frequency content of the data can be constructed so as to reflect our prior problem knowledge.

Another particularly useful feature of kernel functions is that we are not restricted to kernels that operate on vectorial data, e.g. ${{\mathcal{X}}={\mathbb{R}}^N}$ . In principle, it is also possible to define positive kernels for e.g. strings or graphs, i.e. making it possible to embed discrete objects into a metric space and apply metric-based algorithms (e.g. [28,76,79]).

Furthermore, many algorithms can be formulated using so called conditionally positive definite kernels (cf. [63,55]) which are a superclass of the positive definite kernels considered so far. They can be interpreted as generalized non-linear dissimilarity measures (opposed to just the scalar product) and are applicable e.g. in SVM and kernel PCA.

Next: 15.5 Implementation of SVM Up: 15. Support Vector Machines Previous: 15.3 Linear SVM: Learning