Next: 15.6 Extensions of SVM Up: 15. Support Vector Machines Previous: 15.4 Non-linear SVM

Subsections

# 15.5 Implementation of SVM

## 15.5.1 Basic Formulations

We are now at the point to merge the ideas of statistical learning, Structural Risk Minimization and reproducing kernels into a single algorithm, Support Vector Machines, suitable for a wide range of practical application. The main goal of this algorithm is to find a weight vector separating the data with the largest possible margin.

### 15.5.1.1 Separable Data

Assume that the data are separable. Our goal is to find the smallest possible without committing any error. This can be expressed by the following quadratic optimization problem:

The constraints in (15.20) assure that and  are chosen such that no example has a distance to the hyperplane smaller than one. The problem can be solved directly by using a quadratic optimizer. Notice that the optimal solution renders a canonical hyperplane. In contrast to many neural networks (e.g. [10]) one can always find the global minimum. In fact, all minima of (15.20) are global minima, although they might not be unique as e.g. in the case when , where is the dimensionality of the data.

In the formulation (15.20), refered to as the primal formulation, we are bound to use the original data . In order to apply the kernel trick (cf. Sect. 15.4.1) we need to transform the problem such that the only operation involving the original data  is an inner product between certain data points. This can be achieved by transforming the problem to the dual formulation. The notion of duality is an essential part of non-linear optimization theory, for details one can refer to any standard textbook on mathematical programming (e.g. [38,9]). For our purposes it is important that for every quadratic optimization problem there exists a dual problem which is also a quadratic problem. If both the primal and the dual problems have an optimal solution then the values of the objective function at the optimal solutions coincide. This implies that by solving the dual problem - which uses the orginal data  only through inner products - the solution to the primal problem can be reconstructed.

To derive the dual of (15.20), we introduce Lagrange multipliers with , one for each of the constraints in (15.20). We obtain the following Lagrangian:

 (15.19)

The task is to minimize (15.21) with respect to and to maximize it with respect to . At the optimal point, we have the following saddle point equations:

 and

which translate into

 and (15.20)

From the right equation of (15.22), we find that is contained in the subspace spanned by the in the training set. By substituting (15.22) into (15.21), we get the dual quadratic optimization problem:

 (15.21) subject to (15.22) (15.23)

Thus, by solving the dual optimization problem, one obtains the coefficients , , which one needs to express the solution  . This leads to the decision function:

Note that the scalar product in this dual formulation can be directly replaced by the kernel mapping , opening the possibility for the non-linear classifiers. This expression does not directly depend on the dimensionality  of the data but on the number of training examples . As long as we are able to evaluate the scalar product the dimensionality could be arbitrary, even infinite.

### 15.5.1.2 Non-separable Data

So far we have only considered the separable case which corresponds to an empirical error of zero (cf. Theorem 2). However for many practical applications this assumption is violated. If the data are not linearly separable then problem (15.20) has no feasible solution. By allowing for some errors we might get better results and avoid over-fitting effects (cf. Fig. 15.2).

Therefore a ''good'' trade-off between the empirical risk and the complexity term in (15.9) needs to be found. Using a technique which was first proposed in ([8]) and later used for SVMs in ([17]), one introduces slack-variables to relax the hard-margin constraints:

 (15.25)

additionally allowing for some classification errors. The SVM solution can then be found by (a) keeping the upper bound on the VC-dimension small and (b) by minimizing an upper bound on the empirical risk, i.e. the number of training errors. Thus, one minimizes

where the regularization constant determines the trade-off between the empirical error and the complexity term. This leads to the dual problem:

 (15.26) subject to (15.27) (15.28)

From introducing the slack-variables , one gets the box constraints that limit the size of the Lagrange multipliers: , .

The threshold can be computed by exploiting the fact that for all support vectors with , the slack variable is zero. This follows from the Karush-Kuhn-Tucker (KKT) conditions (cf. (15.31) below). Thus, for any support vector with holds:

Averaging over these patterns yields a numerically stable solution:

where denotes the identity matrix.

### 15.5.1.3 Sparsity

The Karush-Kuhn-Tucker (KKT) conditions are the necessary conditions for an optimal solution of a non-linear programming problem (e.g. [9,38]). The conditions are particularly simple for the dual SVM problem (15.28,15.29,15.30) ([70]):

 (15.29)

They reveal one of the most important properties of SVMs: the solution is sparse in  . For all examples outside the margin area the optimal 's are zero. Specifically, the KKT conditions show that only such connected to a training pattern , which is either on the edge of (i.e.  and ) or inside the margin area (i.e.  and ) are non-zero. These are exactly the support vectors as mentioned in Sect. 15.3.2. This sparsity property makes SVM learning practical for large data sets.

## 15.5.2 Decomposition

The practical usefulness of SVM stems from their ability to provide arbitrary non-linear separation boundaries and at the same time to control generalization ability through the parameter and the kernel parameters. In order to utilize these features it is necessary to work with the dual formulation (15.28)-(15.30) of the SVM training problem. This can be difficult from the computational point of view, for two reasons:

1. One needs to solve the quadratic programming problem with as many variables as the number  of available data points (this can be quite large, up to -).
2. Merely to define the quadratic problem formally, one needs to store the kernel matrix, which poses an unsurmountable storage problem for large datasets.
Because of these implications, it is usually impossible to use the standard optimization tools (e.g. MINOS, CPLEX, LOQO) for solving the SVM training problems on datasets containing larger than examples. In the following sections the decomposition techniques are presented, which use the special structure of the SVM problem to provide efficient SVM training algorithms.

### 15.5.2.1 Basic Principles

The key idea of decomposition is to freeze all but a small number of optimization variables and to solve a sequence of constant-size problems. The set of variables optimized at a current iteration is referred to as the working set. Because the working set is re-optimized, the value of the objective function is improved at each iteration, provided the working set is not optimal before re-optimization.

The mathematics of the decomposition technique can be best seen in the matrix notation. Let , let , let be the matrix with the entries , and let denote the vector of length  consisting of all s. Then the dual SVM problem (15.28)-(15.30) can be written in the matrix form as:

 (15.30) subject to (15.31) (15.32) (15.33)

Let us partition the vector into of the variables to be included in the working set at a current iteration and the vector of the remaining variables. The matrix  is partitioned as

with the corresponding parts determined by the index sets and . By re-writing the problem (15.32)-(15.35) using the partitioned matrix notation, and observing that only the free variables  are to be considered as variables, the following sub-problem, to be solved at each iteration, is obtained:

 (15.34) subject to (15.35) (15.36) (15.37)

By choosing the size of the working set relatively small (usually ) one can ensure that the sub-problem (15.36)-(15.39) is easily solved by any optimization tool.

Iteration is carried out until the following termination criteria, derived from Karush-Kuhn-Tucker conditions (15.31), are satisfied to the required precision :

 (15.38) (15.39) (15.40)

### 15.5.2.2 Working Set Selection: Feasible Direction Algorithms

The crucial issue in the decomposition technique presented above is the selection of working sets. First, the provision that a working set must be sub-optimal before re-optimization is essential to prevent the algorithm from cycling. Second, the working set selection affects the speed of the algorithm: if sub-optimal working sets are selected more or less at random, the algorithm converges very slowly. Finally, the working set selection is important in theoretical analysis of decomposition algorithms; in particular in the convergence proofs and in the analysis of the convergence rate.

Two main approaches exist to the working set selection in the SVM decomposition algorithms: the heuristic approach and the feasible direction approach. The former has been used in the original paper of [45] on SVM decomposition and has been mainly used in the specific flavor of decomposition algorithms called Sequential Minimal Optimization (SMO), presented in the next section. The feasible direction decomposition algorithms root in the original algorithm of [32] for pattern recognition ( SVM), with the formal connection to the feasible direction methods of nonlinear programming established by [35].

The notion of a ''feasible direction'' stems from the classical techniques of nonlinear programming subject to linear constraints ([80,9]). It refers to the direction along which any step of the magnitude satisfying , for some fixed , results in a feasible solution to the nonlinear program. For any nonlinear program, finding the feasible direction amounts to a solution of a linear programming problem. In particular, for the dual SVM training problem (15.28)-(15.30) the following problem must be solved:

 (15.41) subject to (15.42) for all (15.43) for all (15.44) (15.45)

where is the gradient of the objective function (15.28). To solve the feasible direction problem exactly at each iteration is inefficient because the linear program (15.43)-(15.47) has all variables. However, an approximate solution to the feasible direction problem can be efficiently found by using the normalization and requiring that the number of positive direction components is equal to the number of the negative components. In this case, the solution is obtained by sorting all examples by the quantity , and selecting  examples with the largest and examples with the smallest values. In fact, by using the heap operations, sorting can be avoided, and the entire selection can be executed in time. The motivation behind the quantity can be traced back to the first-order Karush-Kuhn-Tucker conditions ([35]), which provides the solid formal background for the feasible direction SVM decomposition.

Convergence of the feasible direction SVM decomposition has been proved by [37], and the linear convergence rate has been observed experimentally ([35]).

### 15.5.2.3 Sequential Minimal Optimization

The Sequential Minimal Optimization (SMO) algorithm proposed by [48] is another popular and efficient algorithm for the SVM training. In this algorithm the idea of decomposition is taken to its extreme by reducing the working sets to two points - the smallest size for which it is possible to maintain the equality constraint (15.30). With two points in the working set, the optimal solution can be computed analytically without calling an optimization tool.

The analytical computation of the optimal solution is based on the following idea: given the solution , the optimal update is computed to obtain the solution . To carry out the update, first the constraints (15.29)-(15.30) have to be obeyed. The geometry of these constraints depends on whether the labels  and  are equal or not. The two possible configurations are shown in Fig. 15.7. If (left picture) the solution should be sought along the line , where If (right picture) the solution should be sought along the line . If the solution transgresses the bound of the box, it should be clipped at the bound.

Figure 15.7: Constraints of the SVM training problem with two examples

The optimal values of the variables along the line are computed by eliminating one of the variables through the equality constraint and finding the unconstrained minimum of the objective function. Thus eliminating one obtains the following update rule for :

 (15.46)

where

 (15.47) (15.48) (15.49)

Next, the bound constraints should be taken care of. Depending on the geometry, one computes the following lower and upper bounds on the value of the variable :

The constrained optimum is then found by clipping the unconstrained optimum to the ends of the line segment:

Finally, the value of is computed:

 (15.50)

The working set selection in the SMO algorithm is carried out by means of two heuristics. The ''first choice'' heuristic is responsible for the choice of the first example in each pair. Following this heuristic, all examples that violate the KKT condition (15.31) are used in turns as the first example in the pair. More precisely, the algorithm makes repeated passes only through the examples whose  is strictly between then bounds, and only when all such examples satisfy the KKT conditions, the sweep over the entire data is done to bring in new examples. The ''second choice'' heuristic is responsible for the selection of the second example in the pair. It is intended to bring in such an example that results in the largest step taken during the joint optimization (15.48). As a cheap approximation of this step the numerator is taken (the denominator, which involves evaluation of kernels, can be expensive to compute). Following the strategy to maximize , the SMO algorithm chooses the example with the largest if is negative and the example with the smallest if is positive. Under unusual circumstances, when no progress can be made using the second choice heuristic above a hierarchy of weaker heuristics is applied, the details of which can be found in ([48]).

## 15.5.3 Incremental Support Vector Optimization

Many real-life machine learning problems can be more naturally viewed as online rather than batch learning problems. Indeed, the data are often collected continuously in time, and, more importantly, the concepts to be learned may also evolve in time. Significant effort has been spent in recent years on the development of the online SVM learning algorithms (e.g. [53,33,49]). An elegant solution to online SVM learning is the incremental SVM proposed by [16] which provides a framework for exact online learning.

To present the online SVM learning we first need an abstract formulation of the SVM optimization problem and a brief overview of the incremental SVM. We start with the following abstract form of the SVM optimization problem:

 (15.51)

where and are vectors, is the kernel matrix and is a scalar. By defining the meaning of the abstract parameters  , and for the particular SVM problem at hand, one can use the same algorithmic structure for different SVM algorithms. In particular, for the standard support vector classifiers ([71]), take , and the given regularization constant ; the same definition applies to the -SVM ([58]) except that ; for the one-class classifier ([56,67]), the parameters are defined as:    diag and .

Incremental SVM provides a procedure for adding one example to an existing optimal solution. When a new point is added, its weight is initially assigned to 0. Then the weights of other points and should be updated, in order to obtain the optimal solution for the enlarged dataset. Likewise, when a point  is to be removed from the dataset, its weight is forced to 0, while updating the weights of the remaining points and so that the solution obtained with is optimal for the reduced dataset. The online learning follows naturally from the incremental/decremental learning: the new example is added while some old example is removed from the working set.

The basic principle of the incremental SVM ([16]) is that updates to the state of the example should keep the remaining examples in their optimal state. In other words, the KKT conditions (15.31) expressed for the gradients :

 (15.52)

 (15.53)

must be maintained for all the examples, except possibly for the current example . Let denote the set of unbounded support vectors and the set of other examples. In order to maintain the KKT conditions (15.54), one can express increments to the weights of the unbounded support vectors and to the gradients of other examples as a linear function of the increment of the current example's weight :

 (15.54)

The sensitivity relations (15.56) only make sense for the fixed composition of sets and . To proceed with learning, we need to detect the largest increment such that the sets and remain intact. After changing the SVM state according to the relations (15.56) evaluated for the increment , the sensitivity vectors and must be recomputed accordingly (depending of whether an example enters or leaves the set ). For the details of accounting the reader should refer to ([16,66]). The iteration terminates when the current element satisfies the KKT conditions (15.54) as well.

Next: 15.6 Extensions of SVM Up: 15. Support Vector Machines Previous: 15.4 Non-linear SVM