The main objective of statistical learning is to find a description of an unknown dependency between measurements of objects and certain properties of these objects. The measurements, to be also called ''input variables'', are assumed to be observable in all objects of interest. On the contrary, the objects' properties, or ''output variables'', are in general available only for a small subset of objects known as examples. The purpose of estimating the dependency between the input and output variables is to be able to determine the values of output variables for any object of interest.
The problem of estimating an unknown dependency occurs in various practical applications. For example, the input variables can be the prices for a set of stocks and the output variable the direction of change in a certain stock price. As another example, the input can be some medical parameters and the output the probability of a patient having a certain disease. An essential feature of statistical learning is that the information is assumed to be contained in a limited set of examples (the sample), and the estimated dependency should be as accurate as possible for all objects of interest.
To proceed with a formal description of main properties of statistical learning, let us fix some notation. Let denote the space of input variables representing the objects, and let be the space of output variables. The structure of defines the learning task. For example, if , the learning amounts to a regression problem, for , the task is a classification problem with three classes, etc.
Let be a given sample. We assume that there exists some unknown but fixed probability distribution over the space generating our data; that is, are drawn identically and independently from .
The dependency to be estimated takes the form of a function . To decide which of many possible functions best describes the dependency observed in the training sample, we introduce the concept of a loss function:
Notice that, if we would know the joint distribution , the learning problem can be easily solved. For example, in the classification case one could calculate the conditional probability and compute the so called ''Bayes-optimal solution'':
The most commonly used induction principle is the one of minimizing the empirical risk
|
The other two phenomena arising in relation with the minimization of the empirical risk (15.4) are over- and under-fittung. An overly complex function might describe the training data well but does not generalize to unseen examples. The converse could also happen. Assume the function class we can choose from is very small, e.g. it contains only a single, fixed function. Then our learning machine would trivially be consistent, since const for all . But if this single is not by accident the rule that generates our data, the decisions are unrelated to the concept generating our data. This phenomenon is called under-fitting (cf. Fig. 15.2). Apparently we need some way of controlling how large the class of functions is, such that we avoid over- and under-fitting and obtain solutions that generalize well (i.e. with reasonable complexity). The questions of consistency, over- and under-fitting are closely related and will lead us to a concept known as regularization (e.g. [68,43]) and to the principle of structural risk minimization ([71]).
|
In the previous paragraphs we have shown that for successful learning it is not enough to find a function with minimal empirical risk. If we are interested in a good estimation of the true risk on all possible datapoints, we need to introduce a complexity control and choose our solution by minimizing the following objective function:
There are several possibilities to choose and in order to derive a consistent inductive principle. In the following sections we will describe the choice inspired by the work of Vapnik. Other possible choices are for example Akaike information criterion ([2]) or Mallows Cp ([39]), used in classical statistics, as well as spline-regularization ([74]), wavelet regularization ([21]), CART ([14]) and many other modern approaches. A general foundation for regularization in model selection is given in ([4]). [7] investigate regularization in the context of SVM.
Let us define more closely what consistency means and how it can be characterized. Let us denote by the function that minimizes (15.4) for a given training sample of size . The notion of consistency implies that, as , in probability. We have already seen in a previous example that such convergence may not be the case in general, the reason being that now depends on the sample . One can show that a necessary and sufficient condition for consistency is uniform convergence, over all functions in , of the difference between the expected and the empirical risk to zero. This insight is summarized in the following theorem:
One-sided uniform convergence in probability, i.e.
Consequently, the question arises how one can choose function classes that satisfy Theorem 1 in practice? It will turn out that this is possible and it crucially depends on the question how complex the functions in the class are, a question we have already seen to be equally important when talking about over- and under-fitting. But what does complexity mean and how can one control the size of a function class?
The complexity of a function class can be measured by the number of different possible combinations of outcome assignments when choosing functions from this class. This quantity is usually difficult to obtain theoretically for useful classes of functions. Popular approximations of this measure are covering numbers ([61]), annealed entropy and fat-shattering dimension ([6]), VC-entropy and VC dimension ([71]), or Rademacher and Gaussian complexity ([7]). We will not go into detail about these quantities here.
A specific way of controlling the complexity of a function class is given by the VC-theory and the structural risk minimization (SRM) principle ([71]). Here the concept of complexity is captured by the VC-dimension of the function class . Roughly speaking, the VC-dimension measures how many (training) points can be shattered (i.e. separated for all possible labellings) using functions of the class. This quantity can be used to bound the probability that the expected error deviates much from the empirical error for any function from the class, i.e. VC-style bounds usually take the form
|
One of the most famous learning bounds is due to Vapnik and Chervonenkis:
Let denote the VC-dimension of the function class and let be defined by (15.4) using the -loss. For all and the inequality bounding the risk
As a consequence of Theorem 2 the so called ''Structural Risk Minimization principle'' (SRM) is suggested (i.e. [17,71]). According to this principle a nested family of function classes with non-decreasing VC-dimension , is constructed. After the solutions of the empirical risk minimization (15.4) in the function classes have been found, the SRM principle chooses the function class (and the function ) such that an upper bound on the generalization error like (15.9) is minimized.