Next: 15.5 Implementation of SVM Up: 15. Support Vector Machines Previous: 15.3 Linear SVM: Learning

Subsections

# 15.4 Non-linear SVM

In the previous section we have seen that by restricting ourselves to linear functions one can control the complexity of a learning machine. We have thus avoided the problem of dealing with too complex functions at the price of being able to solve only linearly separable problems. In the following we will show how to extend the linear SVM for constructing a very rich set of non-linear decision functions while at the same time controlling their complexity.

Central to the success of support vector machines was the re-discovery of the so called Reproducing Kernel Hilbert Spaces (RKHS) and Mercer's Theorem ([12]). There is a large body of literature dealing with kernel functions, their theory and applicability, see e.g. [34], [3], [1], [12] or [59] for an overview. We only recall the basic definitions and properties necessary for turning our linear, hyperplane based learning technique into a very powerful algorithm capable of finding non-linear decision functions with controllable complexity.

## 15.4.1 The Kernel Trick

The basic idea of the so called kernel-methods is to first preprocess the data by some non-linear mapping  and then to apply the same linear algorithm as before but in the image space of . (cf. Fig. 15.6 for an illustration). More formally we apply the mapping

to the data and consider our algorithm in instead of , i.e. the sample is preprocessed as

In certain applications we might have sufficient knowledge about our problem such that we can design an appropriate  by hand (e.g. [79,11]). If this mapping is not too complex to compute and the space is not too high-dimensional, we might just explicitly apply this mapping to our data. Something similar is done for (single hidden layer) neural networks ([10]), radial basis networks (e.g [42]) or Boosting algorithms ([24]), where the input data are mapped to some representation given by the hidden layer, the RBF bumps or the hypotheses space, respectively ([52]). The difference with kernel-methods is that for a suitably chosen  we get an algorithm that has powerful non-linearities but is still very intuitive and retains most of the favorable properties of its linear input space version.

The problem with explicitly using the mapping to contruct a feature space is that the resulting space can be extremely high-dimensional. As an example consider the case when the input space consists of images of pixels, i.e.  dimensional vectors, and we choose th order monomials as non-linear features. The dimensionality of such space would be

Such a mapping would clearly be intractable to carry out explicitly. We are not only facing the technical problem of storing the data and doing the computations, but we are also introducing problems due to the fact that we are now working in an extremely sparse sampled space. By the use of the SRM principle, we can be less sensitive to the dimensionality of the space and achieve good generalization.

The problems concerning the storage and the manipulation of the high dimensional data can be avoided. It turns out that for a certain class of mappings we are well able to compute scalar products in this new space even if it is extremely high dimensional. Simplifying the above example of computing all th order products of  pixels to that of computing all nd order products of two ''pixels'', i.e.

 and

the computation of a scalar product between two such feature space vectors can be readily reformulated in terms of a kernel function :

This finding generalizes: For , and the kernel function

computes a scalar product in the space of all products of vector entries (monomials) of and ([71,60]).

The kernel trick ([1,12,71]) is to take the original algorithm and formulate it such, that we only use in scalar products. Then, if we can efficiently evaluate these scalar products, we do not need to carry out the mapping  explicitly and can still solve the problem in the huge feature space  . Furthermore, we do not need to know the mapping  but only the kernel function.

Now we can ask two questions:

1. For which mappings does there exist a simple way to evaluate the scalar product?
2. Under which conditions does a function correspond to a scalar product?
The first question is difficult to answer in general. But for the second question there exists an answer which we present in the following.

## 15.4.2 Feature Spaces

To address the question whether a kernel function is a dot-product let us first introduce some more notation and definitions. Given a training sample , the matrix  with elements is called the kernel matrix or the Gram matrix. An matrix (and any other symmetric matrix) is said to be positive definite if any quadratic form over is positive, i.e. for all , , we have

 (15.13)

Positive definite kernels are exactly those giving rise to a positive definite kernel matrix  for all and all sets . Note, that for a kernel (and a matrix) to be positive definite, it is necessary to be symmetric and non-negative on the diagonal.

For any positive definite kernel we can construct a mapping  into a feature space  , such that acts as a dot-product over . As a matter of fact, it is possible to construct more than one of these spaces. We will omit many crucial details and only present the central results. For more details see e.g.  [59].

### 15.4.2.1 The Feature Map

Given a real-valued, positive definite kernel function , defined over a non-empty set  , we define the feature space  as the space of all functions mapping from to , i.e. as . Notice that, unlike the example in Fig. 15.6, this feature space is not a usual Euclidean space but rather a vector space of functions. The mapping is now defined as

 (15.14)

i.e.  maps each to the function , i.e. the kernel where the first argument is free and the second is fixed to . One can show that the set of all linear combinations of the form

 (15.15)

for arbitrary , , and forms a vector space. Especially, for all functions of the form (15.15) one gets

where denotes the scalar product in some Hilbert space that will become clearer below. In particular we have

The last property is the reason why positive definite kernels are also called reproducing kernels: they reproduce the evaluation of  on  . It also shows that indeed computes, as desired, the dot-product in for defined as in (15.14). Hence (15.14) is one possible realization of the mapping associated with a kernel and is called the feature map (for its empirical counterpart see e.g. ([41])). The following is a formal definition of a Reproducing Kernel Hilbert Space (cf. [59]).

Definition 1 (Reproducing Kernel Hilbert Space (RKHS))

Let be a nonempty set and a Hilbert space of functions . Then is called a reproducing kernel Hilbert space endowed with the dot product if there exists a function with the properties that

1. has the reproducing property for all , in particular , and
2. spans , i.e.  , where denotes the completion of the set .

One can show, that the kernel for such a RKHS is uniquely determined.

### 15.4.2.2 Mercer Kernels

As a second way to identify a feature space associated with a kernel one can use a technique derived from Mercer's Theorem.

The Mercer's Theorem, which we will reproduce in the following, states that if the function (the kernel) gives rise to a positive integral operator, the evaluation of can be expressed as a finite or infinite, absolute and uniformly convergent series, almost everywhere. This series then defines in another way a feature space and an associated mapping connected to the kernel .

Let be a finite measure space, i.e. a space with a -algebra and a measure satisfying .

Theorem 3 ([40])

Suppose is a symmetric real-valued function such that the integral operator

 d

is positive definite, i.e. for all

 dd

Let be the normalized orthogonal eigenfunctions of associated with the eigenvalues , sorted in non-increasing order. Then
1. holds for almost all . Either or ; in the latter case, the series converges absolutely and uniformly for almost all .

If we choose as feature space and the mapping as

we see from the second statement in Theorem 3 that the kernel corresponds to the dot product in , i.e.  .

The kernels satisfying the Mercer's Theorem are called Mercer kernels. It can be shown that, if the set on which the kernel is defined, is compact, a kernel is a Mercer kernel if and only if it is a positive definite kernel (cf. [63]).

Table 15.1 lists some of the most widely used kernel functions. More sophisticated kernels (e.g. kernels generating splines or Fourier expansions) and kernels designed for special applications like DNA analysis can be found in [71], [65], [63], [28], [30], [79], [69] and numerous others.

## 15.4.3 Properties of Kernels

Besides being useful tools for the computation of dot-products in high- or infinite-dimensional spaces, kernels possess some additional properties that make them an interesting choice in algorithms. It was shown ([26]) that using a particular positive definite kernel corresponds to an implicit choice of a regularization operator. For translation-invariant kernels, the regularization properties can be expressed conveniently in Fourier space in terms of the frequencies ([63,25]). For example, Gaussian kernels (cf. (15.16)) correspond to a general smoothness assumption in all -th order derivatives ([63]). Vice versa, using this correspondence, kernels matching a certain prior about the frequency content of the data can be constructed so as to reflect our prior problem knowledge.

Another particularly useful feature of kernel functions is that we are not restricted to kernels that operate on vectorial data, e.g. . In principle, it is also possible to define positive kernels for e.g. strings or graphs, i.e. making it possible to embed discrete objects into a metric space and apply metric-based algorithms (e.g. [28,76,79]).

Furthermore, many algorithms can be formulated using so called conditionally positive definite kernels (cf. [63,55]) which are a superclass of the positive definite kernels considered so far. They can be interpreted as generalized non-linear dissimilarity measures (opposed to just the scalar product) and are applicable e.g. in SVM and kernel PCA.

Next: 15.5 Implementation of SVM Up: 15. Support Vector Machines Previous: 15.3 Linear SVM: Learning