We are now at the point to merge the ideas of statistical learning,
Structural Risk Minimization and reproducing kernels into a single
algorithm, Support Vector Machines, suitable for a wide range of
practical application. The main goal of this algorithm is to find
a weight vector
separating the data
with the largest
possible margin.
Assume that the data are separable. Our goal is to find the smallest
possible
without committing any error. This can be expressed by
the following quadratic optimization problem:
In the formulation (15.20), refered to as the
primal formulation, we are bound to use the original data
. In
order to apply the kernel trick (cf. Sect. 15.4.1) we
need to transform the problem such that the only operation involving
the original data
is an inner product between certain data
points. This can be achieved by transforming the problem to the dual
formulation. The notion of duality is an essential part of non-linear
optimization theory, for details one can refer to any standard
textbook on mathematical programming (e.g. [38,9]). For
our purposes it is important that for every quadratic optimization
problem there exists a dual problem which is also a quadratic
problem. If both the primal and the dual problems have an optimal
solution then the values of the objective function at the optimal
solutions coincide. This implies that by solving the dual
problem - which uses the orginal data
only through inner
products - the solution to the primal problem can be reconstructed.
To derive the dual of (15.20), we introduce
Lagrange multipliers
with
, one for each of
the constraints in (15.20). We obtain the
following Lagrangian:
![]() ![]() |
So far we have only considered the separable case which corresponds to an empirical error of zero (cf. Theorem 2). However for many practical applications this assumption is violated. If the data are not linearly separable then problem (15.20) has no feasible solution. By allowing for some errors we might get better results and avoid over-fitting effects (cf. Fig. 15.2).
Therefore a ''good'' trade-off between the empirical risk and the complexity term in (15.9) needs to be found. Using a technique which was first proposed in ([8]) and later used for SVMs in ([17]), one introduces slack-variables to relax the hard-margin constraints:
![]() |
The threshold can be computed by exploiting the fact that for all
support vectors
with
, the slack variable
is zero. This follows from the Karush-Kuhn-Tucker (KKT)
conditions (cf. (15.31) below). Thus, for any support vector
with
holds:
![]() |
![]() |
The Karush-Kuhn-Tucker (KKT) conditions are the necessary conditions for an optimal solution of a non-linear programming problem (e.g. [9,38]). The conditions are particularly simple for the dual SVM problem (15.28,15.29,15.30) ([70]):
The practical usefulness of SVM stems from their ability to provide
arbitrary non-linear separation boundaries and at the same time to
control generalization ability through the parameter and the
kernel parameters. In order to utilize these features it is necessary
to work with the dual formulation
(15.28)-(15.30) of
the SVM training problem. This can be difficult from the
computational point of view, for two reasons:
The key idea of decomposition is to freeze all but a small number of optimization variables and to solve a sequence of constant-size problems. The set of variables optimized at a current iteration is referred to as the working set. Because the working set is re-optimized, the value of the objective function is improved at each iteration, provided the working set is not optimal before re-optimization.
The mathematics of the decomposition technique can be best seen in the
matrix notation. Let
,
let
, let
be the matrix with the
entries
, and let
denote
the vector of length
consisting of all
s. Then the dual SVM
problem
(15.28)-(15.30) can
be written in the matrix form as:
![]() |
Iteration is carried out until the following termination criteria,
derived from Karush-Kuhn-Tucker conditions (15.31), are
satisfied to the required precision :
![]() |
![]() |
(15.38) | ||
![]() |
![]() |
(15.39) | ||
![]() |
![]() |
(15.40) |
The crucial issue in the decomposition technique presented above is the selection of working sets. First, the provision that a working set must be sub-optimal before re-optimization is essential to prevent the algorithm from cycling. Second, the working set selection affects the speed of the algorithm: if sub-optimal working sets are selected more or less at random, the algorithm converges very slowly. Finally, the working set selection is important in theoretical analysis of decomposition algorithms; in particular in the convergence proofs and in the analysis of the convergence rate.
Two main approaches exist to the working set selection in the SVM
decomposition algorithms: the heuristic approach and the feasible
direction approach. The former has been used in the original paper of
[45] on SVM decomposition and has been mainly used in
the specific flavor of decomposition algorithms called Sequential
Minimal Optimization (SMO), presented in the next section. The
feasible direction decomposition algorithms root in the original
algorithm of [32] for pattern recognition
(
SVM), with the formal connection to the
feasible direction methods of nonlinear programming established by
[35].
The notion of a ''feasible direction'' stems from the classical
techniques of nonlinear programming subject to linear constraints
([80,9]). It refers to the direction along which any step
of the magnitude satisfying
, for
some fixed
, results in a feasible solution to the nonlinear
program. For any nonlinear program, finding the feasible direction
amounts to a solution of a linear programming problem. In particular,
for the dual SVM training problem
(15.28)-(15.30) the
following problem must be solved:
Convergence of the feasible direction SVM decomposition has been proved by [37], and the linear convergence rate has been observed experimentally ([35]).
The Sequential Minimal Optimization (SMO) algorithm proposed by [48] is another popular and efficient algorithm for the SVM training. In this algorithm the idea of decomposition is taken to its extreme by reducing the working sets to two points - the smallest size for which it is possible to maintain the equality constraint (15.30). With two points in the working set, the optimal solution can be computed analytically without calling an optimization tool.
The analytical computation of the optimal solution is based on the
following idea: given the solution
, the optimal update is computed to obtain the
solution
. To carry out
the update, first the constraints
(15.29)-(15.30)
have to be obeyed. The geometry of these constraints depends on
whether the labels
and
are equal or not. The two possible
configurations are shown in Fig. 15.7. If
(left picture) the solution should be sought along the line
, where
If
(right picture) the solution
should be sought along the line
. If the
solution transgresses the bound of the box, it should be clipped at
the bound.
The optimal values of the variables along the line are computed by
eliminating one of the variables through the equality constraint and
finding the unconstrained minimum of the objective function. Thus
eliminating one obtains the following update rule for
:
![]() |
(15.47) |
![]() |
(15.48) |
![]() |
(15.49) |
![]() |
![]() |
|
![]() |
![]() |
![]() |
Many real-life machine learning problems can be more naturally viewed as online rather than batch learning problems. Indeed, the data are often collected continuously in time, and, more importantly, the concepts to be learned may also evolve in time. Significant effort has been spent in recent years on the development of the online SVM learning algorithms (e.g. [53,33,49]). An elegant solution to online SVM learning is the incremental SVM proposed by [16] which provides a framework for exact online learning.
To present the online SVM learning we first need an abstract formulation of the SVM optimization problem and a brief overview of the incremental SVM. We start with the following abstract form of the SVM optimization problem:
Incremental SVM provides a procedure for adding one example to an
existing optimal solution. When a new point is added, its weight
is initially assigned to 0. Then the weights of other
points and
should be updated, in order to obtain the optimal
solution for the enlarged dataset. Likewise, when a point
is to be
removed from the dataset, its weight is forced to 0, while updating
the weights of the remaining points and
so that the solution
obtained with
is optimal for the reduced dataset. The
online learning follows naturally from the incremental/decremental
learning: the new example is added while some old example is removed
from the working set.
The basic principle of the incremental SVM ([16]) is that
updates to the state of the example should keep the
remaining examples in their optimal state. In other words, the KKT
conditions (15.31) expressed for the gradients
: