Next: 15.3 Linear SVM: Learning Up: 15. Support Vector Machines Previous: 15.1 Introduction

15.2 Learning from Examples

15.2.1 General Setting of Statistical Learning

The main objective of statistical learning is to find a description of an unknown dependency between measurements of objects and certain properties of these objects. The measurements, to be also called ''input variables'', are assumed to be observable in all objects of interest. On the contrary, the objects' properties, or ''output variables'', are in general available only for a small subset of objects known as examples. The purpose of estimating the dependency between the input and output variables is to be able to determine the values of output variables for any object of interest.

The problem of estimating an unknown dependency occurs in various practical applications. For example, the input variables can be the prices for a set of stocks and the output variable the direction of change in a certain stock price. As another example, the input can be some medical parameters and the output the probability of a patient having a certain disease. An essential feature of statistical learning is that the information is assumed to be contained in a limited set of examples (the sample), and the estimated dependency should be as accurate as possible for all objects of interest.

To proceed with a formal description of main properties of statistical learning, let us fix some notation. Let ${\mathcal{X}}$ denote the space of input variables representing the objects, and let ${\mathcal{Y}}$ be the space of output variables. The structure of ${\mathcal{Y}}$ defines the learning task. For example, if ${\mathcal{Y}}={\mathbb{R}}$ , the learning amounts to a regression problem, for ${\mathcal{Y}}=\{1,2,3\}$ , the task is a classification problem with three classes, etc.

Let ${\mathcal{Z}}=\{({\boldsymbol{x}}_i,y_i)\in{\mathcal{X}}\times{\mathcal{Y}}\vert i=1,\ldots,M\}$ be a given sample. We assume that there exists some unknown but fixed probability distribution over the space ${\mathcal{X}}\times{\mathcal{Y}}$ generating our data; that is, $({\boldsymbol{x}}_i,y_i) \in {\mathcal{Z}}$ are drawn identically and independently from .

The dependency to be estimated takes the form of a function $f: {\mathcal{X}} \rightarrow {\mathcal{Y}}$ . To decide which of many possible functions best describes the dependency observed in the training sample, we introduce the concept of a loss function:

$\displaystyle {\ell}:{\mathcal{Y}}\times{\mathcal{Y}}\rightarrow{\mathbb{R}}\;.$

(15.1)

Such a loss function should be bounded from below and should measure the cost ${\ell}(\kern.5pt f({\boldsymbol{x}}),y)$ of discrepancy between the predicted value $f({\boldsymbol{x}})\in{\mathcal{Y}}$ and the true value $y\in{\mathcal{Y}}$ . Then the risk, i.e. the expected loss incurred from using a particular prediction function

, can be defined as:

$\displaystyle \ensuremath{R}(\kern.5pt f) = {\mathbb{E}}_P[{\ell}(\kern.5pt f({\boldsymbol{x}}),y)]\;,$

(15.2)

where ${\mathbb{E}}_P$ denotes the expectation with respect to the joint distribution

of input and output variables.

Notice that, if we would know the joint distribution , the learning problem can be easily solved. For example, in the classification case one could calculate the conditional probability $P(Y\vert X)$ and compute the so called ''Bayes-optimal solution'':

$\displaystyle f^{\ast}({\boldsymbol{x}}) = \underset{y_1\in{\mathcal{Y}}}{\text... ...int_{y_2 \in {\mathcal{Y}}} {\ell}(y_1,y_2) P(Y=y_2\vert X={\boldsymbol{x}})\;.$

(15.3)

However, in our setup

is unknown, and only a sample ${\mathcal{Z}}$ is available. One possible solution would be to estimate

or $P(Y\vert X)$ from the sample ${\mathcal{Z}}$ . In many theoretical and practical approaches the inference is carried out exactly in this way ([23,10,20]). But it is also well known that estimating a density from empirical data is a hard problem, especially in the multi-dimensional case. The number of examples one needs in order to get a reliable estimate of a density in

dimensions grows exponentially with

. In the approach to be followed in this chapter we shall attempt to estimate the function

directly from ${\mathcal{Z}}$ without using

or $P(Y\vert X)$ . For this, the following three steps are necessary. First, a class of functions ${\mathcal{F}}$ needs to be defined. Second, a suitable loss ${\ell}$ is to be fixed. Finally, a method has to be provided to find the function

which minimizes the risk $\ensuremath{R}(\kern.5pt f)$ among all $f \in {\mathcal{F}}$ . Such method is called an ''induction principle''. Desirable properties of such an induction principle are discussed in the next section.

15.2.2 Desirable Properties for Induction Principles

The most commonly used induction principle is the one of minimizing the empirical risk

$\displaystyle {R_{\text{emp}}}(\kern.5pt f) = \frac{1}{M}\sum_{i=1}^M{\ell}(\kern.5pt f({\boldsymbol{x}}_i), y_i)\;,$

(15.4)

which is the empirical counterpart of the expected risk (15.2). The goal of learning in our setup is to find an algorithm that, given a training sample ${\mathcal{Z}}$ , finds a function $f \in {\mathcal{F}}$ that minimizes (15.4). Notice that this will not necessarily result in a unique solution. As one can see in Fig. 15.1 more than one function can have the same (e.g. zero) empirical risk on the same data sample. However, these functions can take arbitrary values at other points in ${\mathcal{X}}$ ; hence the solution that minimizes the empirical risk is not guaranteed to minimize the true risk (15.2).

**Figure 15.1:** Two functions that separate two classes of data points with zero empirical risk. Without further information it is impossible to decide for one of them
$\includegraphics[width=6cm]{text/3-15/fct_no_emp_error.eps}$

The other two phenomena arising in relation with the minimization of the empirical risk (15.4) are over- and under-fittung. An overly complex function might describe the training data well but does not generalize to unseen examples. The converse could also happen. Assume the function class ${\mathcal{F}}$ we can choose from is very small, e.g. it contains only a single, fixed function. Then our learning machine would trivially be consistent, since $\ensuremath{R}(\kern.5pt f) =$ const for all $f \in {\mathcal{F}}$ . But if this single $f \in {\mathcal{F}}$ is not by accident the rule that generates our data, the decisions are unrelated to the concept generating our data. This phenomenon is called under-fitting (cf. Fig. 15.2). Apparently we need some way of controlling how large the class of functions ${\mathcal{F}}$ is, such that we avoid over- and under-fitting and obtain solutions that generalize well (i.e. with reasonable complexity). The questions of consistency, over- and under-fitting are closely related and will lead us to a concept known as regularization (e.g. [68,43]) and to the principle of structural risk minimization ([71]).

**Figure 15.2:** An illustration of under- and over-fitting on a small sample. The simple linear function (*solid line*) underfits the data and already makes training errors. The complex one (*dash-dotted line*) has no training error but may not generalize well on unseen data. The function with intermediate complexity (*dashed line*) seems to capture the decision boundary best
$\includegraphics[width=6cm]{text/3-15/overfitting.eps}$

15.2.2.1 Regularization

In the previous paragraphs we have shown that for successful learning it is not enough to find a function with minimal empirical risk. If we are interested in a good estimation of the true risk on all possible datapoints, we need to introduce a complexity control and choose our solution by minimizing the following objective function:

$\displaystyle {R_{\text{emp}}}(\kern.5pt f,{\mathcal{Z}}) + \lambda\Omega(\kern.5pt f)\;.$

(15.5)

This equation shows a regularization approach. We add a penalty term to make the trade-off between the complexity of the function class and the empirical error. Using such a regularization a bound for the true risk can be derived.

There are several possibilities to choose $\lambda$ and $\Omega$ in order to derive a consistent inductive principle. In the following sections we will describe the choice inspired by the work of Vapnik. Other possible choices are for example Akaike information criterion ([2]) or Mallows Cp ([39]), used in classical statistics, as well as spline-regularization ([74]), wavelet regularization ([21]), CART ([14]) and many other modern approaches. A general foundation for regularization in model selection is given in ([4]). [7] investigate regularization in the context of SVM.

15.2.2.2 Consistency

Let us define more closely what consistency means and how it can be characterized. Let us denote by the function $f \in {\mathcal{F}}$ that minimizes (15.4) for a given training sample ${\mathcal{Z}}$ of size . The notion of consistency implies that, as $M\rightarrow \infty$ , $\vert\ensuremath{R}(\kern.5pt f^M) - {R_{\text{emp}}}(\kern.5pt f^M)\vert\rightarrow 0$ in probability. We have already seen in a previous example that such convergence may not be the case in general, the reason being that now depends on the sample ${\mathcal{Z}}$ . One can show that a necessary and sufficient condition for consistency is uniform convergence, over all functions in ${\mathcal{F}}$ , of the difference between the expected and the empirical risk to zero. This insight is summarized in the following theorem:

Theorem 1 ([73])

One-sided uniform convergence in probability, i.e.

$\displaystyle \lim_{M\rightarrow\infty} {\mathbb{P}}\left[\sup_{f\in{\mathcal{F... ...th{R}(\kern.5pt f)-{R_{\text{emp}}}(\kern.5pt f)\right)> \epsilon\right] = 0\;,$

(15.6)

for all $\epsilon > 0$ , is a necessary and sufficient condition for (nontrivial) consistency of empirical risk minimization.

Since the condition in the theorem is not only sufficient but also necessary it seems reasonable that any ''good'' learning machine implementing a specific function class should satisfy condition (15.6).

15.2.3 Structural Risk Minimization

Consequently, the question arises how one can choose function classes that satisfy Theorem 1 in practice? It will turn out that this is possible and it crucially depends on the question how complex the functions in the class ${\mathcal{F}}$ are, a question we have already seen to be equally important when talking about over- and under-fitting. But what does complexity mean and how can one control the size of a function class?

The complexity of a function class can be measured by the number of different possible combinations of outcome assignments when choosing functions from this class. This quantity is usually difficult to obtain theoretically for useful classes of functions. Popular approximations of this measure are covering numbers ([61]), annealed entropy and fat-shattering dimension ([6]), VC-entropy and VC dimension ([71]), or Rademacher and Gaussian complexity ([7]). We will not go into detail about these quantities here.

A specific way of controlling the complexity of a function class is given by the VC-theory and the structural risk minimization (SRM) principle ([71]). Here the concept of complexity is captured by the VC-dimension of the function class ${\mathcal{F}}$ . Roughly speaking, the VC-dimension measures how many (training) points can be shattered (i.e. separated for all possible labellings) using functions of the class. This quantity can be used to bound the probability that the expected error deviates much from the empirical error for any function from the class, i.e. VC-style bounds usually take the form

$\displaystyle \left[\sup_{f\in{\mathcal{F}}}\left(\ensuremath{R}(\kern.5pt f)-{... ...pt f,{\mathcal{Z}})\right)>\epsilon\right]\leq H({\mathcal{F}}, M, \epsilon)\;,$

(15.7)

where

is some function that depends on properties of the function class ${\mathcal{F}}$ , e.g. the VC-dimension, the size

of the training set and the desired closeness $\epsilon$ . By equating the right-hand side of (15.7) to $\delta>0$ and solving $H = \delta$ for $\epsilon$ one can turn these bounds into expressions of the following form: with probability at least $1-\delta$ over the random draw of the training sample ${\mathcal{Z}}$ ,

$\displaystyle \ensuremath{R}(\kern.5pt f) \leq {R_{\text{emp}}}(\kern.5pt f,{\mathcal{Z}}) + \widetilde H({\mathcal{F}},M,\delta)\;,$

(15.8)

where $\widetilde H$ is the penalty term that measures our degree of uncertainty. If the function class is simple then $\widetilde H$ is small. This penalty term usually increases if we require a higher precision (e.g. with $\log(1/\delta)$ ) and decreases if we observe more examples (e.g. with

or $1/\sqrt{M}$ ). Note that this prototypical bound is structurally identical to the regularized risk functional in (15.5). The practical implication of bounds like (15.8) is that our learning machine should be constructed such that

it finds a function with a small empirical error, and
at the same time keeps the penalty term $\widetilde H$ small.

Only if our learning principle can control both quantities we have a guarantee that the expected error of our estimate will be small (cf. Fig. 15.3).

**Figure 15.3:** Schematic illustration of (15.8). The *dash-dotted line* represents the training error (empirical risk), the *dashed line* the upper bound on the complexity term (confidence). With higher complexity the empirical error decreases but the upper bound on the risk uncertainty becomes worse. For a certain complexity of the function class the best expected risk (*solid line*) is obtained. Thus, in practice the goal is to find the best trade-off between empirical error and complexity
$\includegraphics[width=6cm]{text/3-15/illurisknew.eps}$

One of the most famous learning bounds is due to Vapnik and Chervonenkis:

Theorem 2 ([72])

Let denote the VC-dimension of the function class ${\mathcal{F}}$ and let ${R_{\text{emp}}}$ be defined by (15.4) using the -loss. For all $\delta>0$ and $f \in {\mathcal{F}}$ the inequality bounding the risk

$\displaystyle \ensuremath{R}(\kern.5pt f)\le {R_{\text{emp}}}(\kern.5pt f,{\mathcal{Z}})+\sqrt{\frac{h\left(\ln \frac{2M}{h}+1\right)-\ln(\delta/4)}{M}}$

(15.9)

holds with probability of at least $1-\delta$ for over the random draw of the training sample ${\mathcal{Z}}$ .

This theorem lays the ground for the Support Vector algorithm that we will consider in more detail in Sect. 15.3.

As a consequence of Theorem 2 the so called ''Structural Risk Minimization principle'' (SRM) is suggested (i.e. [17,71]). According to this principle a nested family of function classes ${\mathcal{F}}_1 \subseteq \cdots \subseteq {\mathcal{F}}_k$ with non-decreasing VC-dimension $h_1\leq\cdots\leq h_k$ , is constructed. After the solutions $f_1,\ldots,f_k$ of the empirical risk minimization (15.4) in the function classes ${\mathcal{F}}_1,\ldots,F_k$ have been found, the SRM principle chooses the function class ${\mathcal{F}}_i$ (and the function ) such that an upper bound on the generalization error like (15.9) is minimized.

Next: 15.3 Linear SVM: Learning Up: 15. Support Vector Machines Previous: 15.1 Introduction