In the previous section we have seen that by restricting ourselves to linear functions one can control the complexity of a learning machine. We have thus avoided the problem of dealing with too complex functions at the price of being able to solve only linearly separable problems. In the following we will show how to extend the linear SVM for constructing a very rich set of non-linear decision functions while at the same time controlling their complexity.
Central to the success of support vector machines was the re-discovery of the so called Reproducing Kernel Hilbert Spaces (RKHS) and Mercer's Theorem ([12]). There is a large body of literature dealing with kernel functions, their theory and applicability, see e.g. [34], [3], [1], [12] or [59] for an overview. We only recall the basic definitions and properties necessary for turning our linear, hyperplane based learning technique into a very powerful algorithm capable of finding non-linear decision functions with controllable complexity.
The basic idea of the so called kernel-methods is to first preprocess the data by some non-linear mapping and then to apply the same linear algorithm as before but in the image space of . (cf. Fig. 15.6 for an illustration). More formally we apply the mapping
|
In certain applications we might have sufficient knowledge about our problem such that we can design an appropriate by hand (e.g. [79,11]). If this mapping is not too complex to compute and the space is not too high-dimensional, we might just explicitly apply this mapping to our data. Something similar is done for (single hidden layer) neural networks ([10]), radial basis networks (e.g [42]) or Boosting algorithms ([24]), where the input data are mapped to some representation given by the hidden layer, the RBF bumps or the hypotheses space, respectively ([52]). The difference with kernel-methods is that for a suitably chosen we get an algorithm that has powerful non-linearities but is still very intuitive and retains most of the favorable properties of its linear input space version.
The problem with explicitly using the mapping to contruct a feature space is that the resulting space can be extremely high-dimensional. As an example consider the case when the input space consists of images of pixels, i.e. dimensional vectors, and we choose th order monomials as non-linear features. The dimensionality of such space would be
The problems concerning the storage and the manipulation of the high dimensional data can be avoided. It turns out that for a certain class of mappings we are well able to compute scalar products in this new space even if it is extremely high dimensional. Simplifying the above example of computing all th order products of pixels to that of computing all nd order products of two ''pixels'', i.e.
and |
The kernel trick ([1,12,71]) is to take the original algorithm and formulate it such, that we only use in scalar products. Then, if we can efficiently evaluate these scalar products, we do not need to carry out the mapping explicitly and can still solve the problem in the huge feature space . Furthermore, we do not need to know the mapping but only the kernel function.
Now we can ask two questions:
To address the question whether a kernel function is a dot-product let us first introduce some more notation and definitions. Given a training sample , the matrix with elements is called the kernel matrix or the Gram matrix. An matrix (and any other symmetric matrix) is said to be positive definite if any quadratic form over is positive, i.e. for all , , we have
For any positive definite kernel we can construct a mapping into a feature space , such that acts as a dot-product over . As a matter of fact, it is possible to construct more than one of these spaces. We will omit many crucial details and only present the central results. For more details see e.g. [59].
Given a real-valued, positive definite kernel function , defined over a non-empty set , we define the feature space as the space of all functions mapping from to , i.e. as . Notice that, unlike the example in Fig. 15.6, this feature space is not a usual Euclidean space but rather a vector space of functions. The mapping is now defined as
Let be a nonempty set and a Hilbert space of functions . Then is called a reproducing kernel Hilbert space endowed with the dot product if there exists a function with the properties that
One can show, that the kernel for such a RKHS is uniquely determined.
As a second way to identify a feature space associated with a kernel one can use a technique derived from Mercer's Theorem.
The Mercer's Theorem, which we will reproduce in the following, states that if the function (the kernel) gives rise to a positive integral operator, the evaluation of can be expressed as a finite or infinite, absolute and uniformly convergent series, almost everywhere. This series then defines in another way a feature space and an associated mapping connected to the kernel .
Let be a finite measure space, i.e. a space with a -algebra and a measure satisfying .
Suppose is a symmetric real-valued function such that the integral operator
d |
dd |
If we choose as feature space and the mapping as
The kernels satisfying the Mercer's Theorem are called Mercer kernels. It can be shown that, if the set on which the kernel is defined, is compact, a kernel is a Mercer kernel if and only if it is a positive definite kernel (cf. [63]).
Table 15.1 lists some of the most widely used kernel functions. More sophisticated kernels (e.g. kernels generating splines or Fourier expansions) and kernels designed for special applications like DNA analysis can be found in [71], [65], [63], [28], [30], [79], [69] and numerous others.
Besides being useful tools for the computation of dot-products in high- or infinite-dimensional spaces, kernels possess some additional properties that make them an interesting choice in algorithms. It was shown ([26]) that using a particular positive definite kernel corresponds to an implicit choice of a regularization operator. For translation-invariant kernels, the regularization properties can be expressed conveniently in Fourier space in terms of the frequencies ([63,25]). For example, Gaussian kernels (cf. (15.16)) correspond to a general smoothness assumption in all -th order derivatives ([63]). Vice versa, using this correspondence, kernels matching a certain prior about the frequency content of the data can be constructed so as to reflect our prior problem knowledge.
Another particularly useful feature of kernel functions is that we are not restricted to kernels that operate on vectorial data, e.g. . In principle, it is also possible to define positive kernels for e.g. strings or graphs, i.e. making it possible to embed discrete objects into a metric space and apply metric-based algorithms (e.g. [28,76,79]).
Furthermore, many algorithms can be formulated using so called conditionally positive definite kernels (cf. [63,55]) which are a superclass of the positive definite kernels considered so far. They can be interpreted as generalized non-linear dissimilarity measures (opposed to just the scalar product) and are applicable e.g. in SVM and kernel PCA.