next up previous contents index
Next: 15.6 Extensions of SVM Up: 15. Support Vector Machines Previous: 15.4 Non-linear SVM

Subsections



15.5 Implementation of SVM

15.5.1 Basic Formulations

We are now at the point to merge the ideas of statistical learning, Structural Risk Minimization and reproducing kernels into a single algorithm, Support Vector Machines, suitable for a wide range of practical application. The main goal of this algorithm is to find a weight vector $ \boldsymbol {w}$ separating the data $ {\mathcal{Z}}$ with the largest possible margin.

15.5.1.1 Separable Data

Assume that the data are separable. Our goal is to find the smallest possible $ \boldsymbol {w}$ without committing any error. This can be expressed by the following quadratic optimization problem:

$\displaystyle \min_{\boldsymbol{w},b}\quad$ $\displaystyle \frac{1}{2} \Vert\boldsymbol{w}\Vert^2,$ (15.18)
$\displaystyle {\mathrm{subject\;to}}{\quad}$ $\displaystyle {y_i}\left((\boldsymbol{w}^{\top}{\boldsymbol{x}}_i)+b\right)\geq 1\;,\quad{\forall i=1,\ldots,M}\,.$    

The constraints in (15.20) assure that $ \boldsymbol {w}$ and $ b$ are chosen such that no example has a distance to the hyperplane smaller than one. The problem can be solved directly by using a quadratic optimizer. Notice that the optimal solution renders a canonical hyperplane. In contrast to many neural networks (e.g. [10]) one can always find the global minimum. In fact, all minima of (15.20) are global minima, although they might not be unique as e.g. in the case when $ M<N$, where $ N$ is the dimensionality of the data.

In the formulation (15.20), refered to as the primal formulation, we are bound to use the original data $ {\boldsymbol{x}}$. In order to apply the kernel trick (cf. Sect. 15.4.1) we need to transform the problem such that the only operation involving the original data  $ {\boldsymbol{x}}$ is an inner product between certain data points. This can be achieved by transforming the problem to the dual formulation. The notion of duality is an essential part of non-linear optimization theory, for details one can refer to any standard textbook on mathematical programming (e.g. [38,9]). For our purposes it is important that for every quadratic optimization problem there exists a dual problem which is also a quadratic problem. If both the primal and the dual problems have an optimal solution then the values of the objective function at the optimal solutions coincide. This implies that by solving the dual problem - which uses the orginal data  $ {\boldsymbol{x}}$ only through inner products - the solution to the primal problem can be reconstructed.

To derive the dual of (15.20), we introduce Lagrange multipliers $ \alpha_i\geq 0$ with $ i=1,\ldots,M$, one for each of the constraints in (15.20). We obtain the following Lagrangian:

$\displaystyle L(\boldsymbol{w},b,{\boldsymbol{\alpha}})=\frac{1}{2}\Vert \bolds...
...t(\left(\boldsymbol{w}^{\top}{\boldsymbol{x}}_i\right) + b\right) - 1\right)\;.$ (15.19)

The task is to minimize (15.21) with respect to $ \boldsymbol{w}, b$ and to maximize it with respect to $ \alpha_i$. At the optimal point, we have the following saddle point equations:

$\displaystyle \frac{\partial L}{\partial b} = 0$   and$\displaystyle \quad \frac{\partial L}{\partial \boldsymbol{w}} = 0\;,$    

which translate into

$\displaystyle \sum_{i=1}^M \alpha_i y_i = 0$   and$\displaystyle \quad \boldsymbol{w} = \sum_{i=1}^M \alpha_i y_i {\boldsymbol{x}}_i\;.$ (15.20)

From the right equation of (15.22), we find that $ \boldsymbol {w}$ is contained in the subspace spanned by the $ {\boldsymbol{x}}_i$ in the training set. By substituting (15.22) into (15.21), we get the dual quadratic optimization problem:

$\displaystyle \max_{{\boldsymbol{\alpha}}} \sum_{i\,=\,1}^M\alpha_i-\frac{1}{2}...
...ha_i\alpha_j y_iy_j \left({\boldsymbol{x}}_i^{\top}{\boldsymbol{x}}_j\right)\;,$ (15.21)
subject to$\displaystyle \quad \alpha_i \ge 0\;,\quad i=1,\ldots,M\;,$ (15.22)
$\displaystyle \sum_{i=1}^M\alpha_i y_i=0\;.$ (15.23)

Thus, by solving the dual optimization problem, one obtains the coefficients $ \alpha_i$, $ i=1,\ldots,M$, which one needs to express the solution  $ \boldsymbol {w}$. This leads to the decision function:

$\displaystyle f({\boldsymbol{x}})$ $\displaystyle =\mathrm{sgn}\left(\left(\boldsymbol{w}^{\top}{\boldsymbol{x}}_i\right)+b\right)$    
  $\displaystyle =\mathrm{sgn}\left(\sum_{i=1}^M y_i \alpha_i \left({\boldsymbol{x}}_i^{\top}{\boldsymbol{x}}\right)+b\right)\;.$ (15.24)

Note that the scalar product in this dual formulation can be directly replaced by the kernel mapping $ k({\boldsymbol{x}}_i,{\boldsymbol{x}})$, opening the possibility for the non-linear classifiers. This expression does not directly depend on the dimensionality $ N$ of the data but on the number of training examples $ M$. As long as we are able to evaluate the scalar product $ ({\boldsymbol{x}}_i^{\top}{\boldsymbol{x}})$ the dimensionality could be arbitrary, even infinite.

15.5.1.2 Non-separable Data

So far we have only considered the separable case which corresponds to an empirical error of zero (cf. Theorem 2). However for many practical applications this assumption is violated. If the data are not linearly separable then problem (15.20) has no feasible solution. By allowing for some errors we might get better results and avoid over-fitting effects (cf. Fig. 15.2).

Therefore a ''good'' trade-off between the empirical risk and the complexity term in (15.9) needs to be found. Using a technique which was first proposed in ([8]) and later used for SVMs in ([17]), one introduces slack-variables to relax the hard-margin constraints:

$\displaystyle y_i \left(\left(\boldsymbol{w}^{\top} {\boldsymbol{x}}_i\right)+b...
...1-{\boldsymbol{x}}i_i\;,\quad{\boldsymbol{x}}i_i \geq 0\;,\quad i=1,\ldots,M\;,$ (15.25)

additionally allowing for some classification errors. The SVM solution can then be found by (a) keeping the upper bound on the VC-dimension small and (b) by minimizing an upper bound $ \sum_{i=1}^M {\boldsymbol{\xi}}_i$ on the empirical risk, i.e. the number of training errors. Thus, one minimizes

$\displaystyle \min_{\boldsymbol{w},b,{\boldsymbol{\xi}}}\quad\frac{1}{2}\Vert\boldsymbol{w}\Vert^2+C\sum_{i=1}^M {\boldsymbol{\xi}}_i\;,$    

where the regularization constant $ C > 0$ determines the trade-off between the empirical error and the complexity term. This leads to the dual problem:

$\displaystyle \max_{{\boldsymbol{\alpha}}} \sum_{i=1}^M\alpha_i-\frac{1}{2} \su...
...ha_i\alpha_j y_iy_j \left({\boldsymbol{x}}_i^{\top}{\boldsymbol{x}}_j\right)\;,$ (15.26)
subject to$\displaystyle \quad 0 \leq \alpha_i \leq C\;,\quad i=1,\ldots,M\;,$ (15.27)
$\displaystyle \sum_{i=1}^M\alpha_i y_i=0\;.$ (15.28)

From introducing the slack-variables $ {\boldsymbol{\xi}}_i$, one gets the box constraints that limit the size of the Lagrange multipliers: $ \alpha_i\leq C$, $ i=1,\ldots,M$.

The threshold $ b$ can be computed by exploiting the fact that for all support vectors $ {\boldsymbol{x}}_i$ with $ 0<\alpha_i < C$, the slack variable $ {\boldsymbol{\xi}}_i$ is zero. This follows from the Karush-Kuhn-Tucker (KKT) conditions (cf. (15.31) below). Thus, for any support vector $ {\boldsymbol{x}}_i$ with $ \alpha_i<C$ holds:

$\displaystyle y_i\left(b+\sum_{j=1}^M y_j\alpha_j\left({\boldsymbol{x}}_i^{\top}{\boldsymbol{x}}_j\right)\right) = 1\;.$    

Averaging over these patterns yields a numerically stable solution:

$\displaystyle b=\frac{1}{\vert I\vert}\sum_{i\in I} \left( y_i-\sum_{j=1}^M y_j\alpha_j \left({\boldsymbol{x}}_i^{\top}{\boldsymbol{x}}_j\right)\right)\;,$    

where $ I$ denotes the identity matrix.

15.5.1.3 Sparsity

The Karush-Kuhn-Tucker (KKT) conditions are the necessary conditions for an optimal solution of a non-linear programming problem (e.g. [9,38]). The conditions are particularly simple for the dual SVM problem (15.28,15.29,15.30) ([70]):

\begin{displaymath}\begin{array}{rclcl} \alpha_i = 0 &\Rightarrow &~y_i f({\bold...
...\leq 1 &\text{~and~}& {\boldsymbol{\xi}}_i\geq 0\;. \end{array}\end{displaymath} (15.29)

They reveal one of the most important properties of SVMs: the solution is sparse in  $ {\boldsymbol{\alpha}}$. For all examples outside the margin area the optimal $ \alpha_i$'s are zero. Specifically, the KKT conditions show that only such $ \alpha_i$ connected to a training pattern $ {\boldsymbol{x}}_i$, which is either on the edge of (i.e.  $ 0<\alpha_i < C$ and $ y_if({\boldsymbol{x}}_i) =
1$) or inside the margin area (i.e.  $ \alpha_i=C$ and $ y_if({\boldsymbol{x}}_i) < 1$) are non-zero. These are exactly the support vectors as mentioned in Sect. 15.3.2. This sparsity property makes SVM learning practical for large data sets.

15.5.2 Decomposition

The practical usefulness of SVM stems from their ability to provide arbitrary non-linear separation boundaries and at the same time to control generalization ability through the parameter $ C$ and the kernel parameters. In order to utilize these features it is necessary to work with the dual formulation (15.28)-(15.30) of the SVM training problem. This can be difficult from the computational point of view, for two reasons:

  1. One needs to solve the quadratic programming problem with as many variables as the number $ M$ of available data points (this can be quite large, up to $ 10^5$-$ 10^6$).
  2. Merely to define the quadratic problem formally, one needs to store the $ M\times M$ kernel matrix, which poses an unsurmountable storage problem for large datasets.
Because of these implications, it is usually impossible to use the standard optimization tools (e.g. MINOS, CPLEX, LOQO) for solving the SVM training problems on datasets containing larger than $ 1000$ examples. In the following sections the decomposition techniques are presented, which use the special structure of the SVM problem to provide efficient SVM training algorithms.

15.5.2.1 Basic Principles

The key idea of decomposition is to freeze all but a small number of optimization variables and to solve a sequence of constant-size problems. The set of variables optimized at a current iteration is referred to as the working set. Because the working set is re-optimized, the value of the objective function is improved at each iteration, provided the working set is not optimal before re-optimization.

The mathematics of the decomposition technique can be best seen in the matrix notation. Let $ {\boldsymbol{\alpha}} = (\alpha_1, \ldots \alpha_M)^{\top}$, let $ {\boldsymbol{y}} = (y_1, \ldots, y_M)^{\top}$, let $ H$ be the matrix with the entries $ H_{ij} = y_i y_j k({\boldsymbol{x}}_i, {\boldsymbol{x}}_j)$, and let $ \boldsymbol{1}$ denote the vector of length $ M$ consisting of all $ 1$s. Then the dual SVM problem (15.28)-(15.30) can be written in the matrix form as:

$\displaystyle \max_{{\boldsymbol{\alpha}}} \boldsymbol{1}^{\top} {\boldsymbol{\alpha}} - \frac{1}{2} {\boldsymbol{\alpha}}^{\top} H {\boldsymbol{\alpha}}\;,$ (15.30)
subject to$\displaystyle \quad {\boldsymbol{y}}^{\top} {\boldsymbol{\alpha}} = 0\;,$ (15.31)
$\displaystyle {\boldsymbol{\alpha}} - C \boldsymbol{1} \leq 0\;,$ (15.32)
$\displaystyle {\boldsymbol{\alpha}} \geq 0 \;.$ (15.33)

Let us partition the vector $ {\boldsymbol{\alpha}}$ into $ {\boldsymbol{\alpha}}_B$ of the variables to be included in the working set at a current iteration and the vector $ {\boldsymbol{\alpha}}_N$ of the remaining variables. The matrix $ H$ is partitioned as

$\displaystyle H = \begin{bmatrix}H_{BB} & H_{BN}\\ H_{NB} & H_{NN} \end{bmatrix}\;,$    

with the corresponding parts determined by the index sets $ B$ and $ N$. By re-writing the problem (15.32)-(15.35) using the partitioned matrix notation, and observing that only the free variables  $ {\boldsymbol{\alpha}}_B$ are to be considered as variables, the following sub-problem, to be solved at each iteration, is obtained:

$\displaystyle \max_{{\boldsymbol{\alpha}}} \left(\boldsymbol{1}^{\top}_B - {\bo...
... - \frac{1}{2} {\boldsymbol{\alpha}}^{\top}_B H_{BB} {\boldsymbol{\alpha}}_B\;,$ (15.34)
subject to$\displaystyle \quad {\boldsymbol{y}}^{\top}_B {\boldsymbol{\alpha}}_B = -{\boldsymbol{y}}_N {\boldsymbol{\alpha}}_N\;,$ (15.35)
$\displaystyle {\boldsymbol{\alpha}}_B - C \boldsymbol{1}_B \leq 0\;,$ (15.36)
$\displaystyle {\boldsymbol{\alpha}}_B \geq 0\;.$ (15.37)

By choosing the size $ q$ of the working set $ B$ relatively small (usually $ q \leq 100$) one can ensure that the sub-problem (15.36)-(15.39) is easily solved by any optimization tool.

Iteration is carried out until the following termination criteria, derived from Karush-Kuhn-Tucker conditions (15.31), are satisfied to the required precision $ \epsilon $:

  $\displaystyle \textstyle b - \epsilon \leq y_i - \sum_{j=1}^M y_j \alpha_j K\left({\boldsymbol{x}}_i,{\boldsymbol{x}}_j\right) \leq b + \epsilon\;, \quad$   $\displaystyle \forall i: \;0 < \alpha_i < C\;,$ (15.38)
  $\displaystyle \textstyle y_i \left( \sum_{j=1}^M y_j \alpha_j K\left({\boldsymbol{x}}_i,{\boldsymbol{x}}_j\right) + b \right ) \geq 1 - \epsilon\;, \quad$   $\displaystyle \forall i: \; \alpha_i = 0\;,$ (15.39)
  $\displaystyle \textstyle y_i \left( \sum_{j=1}^M y_j \alpha_j K\left({\boldsymbol{x}}_i,{\boldsymbol{x}}_j\right) + b \right ) \leq 1 + \epsilon\;, \quad$   $\displaystyle \forall i: \; \alpha_i = C\;.$ (15.40)

15.5.2.2 Working Set Selection: Feasible Direction Algorithms

The crucial issue in the decomposition technique presented above is the selection of working sets. First, the provision that a working set must be sub-optimal before re-optimization is essential to prevent the algorithm from cycling. Second, the working set selection affects the speed of the algorithm: if sub-optimal working sets are selected more or less at random, the algorithm converges very slowly. Finally, the working set selection is important in theoretical analysis of decomposition algorithms; in particular in the convergence proofs and in the analysis of the convergence rate.

Two main approaches exist to the working set selection in the SVM decomposition algorithms: the heuristic approach and the feasible direction approach. The former has been used in the original paper of [45] on SVM decomposition and has been mainly used in the specific flavor of decomposition algorithms called Sequential Minimal Optimization (SMO), presented in the next section. The feasible direction decomposition algorithms root in the original algorithm of [32] for pattern recognition ( SVM$ ^{\text{light}}$), with the formal connection to the feasible direction methods of nonlinear programming established by [35].

The notion of a ''feasible direction'' stems from the classical techniques of nonlinear programming subject to linear constraints ([80,9]). It refers to the direction along which any step of the magnitude $ \delta$ satisfying $ 0 < \delta \leq \delta^0$, for some fixed $ \delta^0$, results in a feasible solution to the nonlinear program. For any nonlinear program, finding the feasible direction amounts to a solution of a linear programming problem. In particular, for the dual SVM training problem (15.28)-(15.30) the following problem must be solved:

$\displaystyle \max_{\boldsymbol{d}} \boldsymbol{g}^{\top} \boldsymbol{d}\;,$ (15.41)
subject to$\displaystyle \quad {\boldsymbol{y}}^{\top} \boldsymbol{d} = 0\;,$ (15.42)
$\displaystyle d_i \geq 0,$   for all $ \alpha_i = 0$$\displaystyle \;,$ (15.43)
$\displaystyle d_i \leq 0,$   for all $ \alpha_i=C$$\displaystyle \:,$ (15.44)
$\displaystyle \vert\vert\boldsymbol{d}\vert\vert _2 \leq 1\;,$ (15.45)

where $ \boldsymbol{g}$ is the gradient of the objective function (15.28). To solve the feasible direction problem exactly at each iteration is inefficient because the linear program (15.43)-(15.47) has all $ M$ variables. However, an approximate solution to the feasible direction problem can be efficiently found by using the normalization $ d_i \in \{-1,0,1\}$ and requiring that the number of positive direction components is equal to the number of the negative components. In this case, the solution is obtained by sorting all examples by the quantity $ g_i y_i$, and selecting $ q/2$ examples with the largest and $ q/2$ examples with the smallest values. In fact, by using the heap operations, sorting can be avoided, and the entire selection can be executed in $ O(q \log M)$ time. The motivation behind the quantity $ g_i y_i$ can be traced back to the first-order Karush-Kuhn-Tucker conditions ([35]), which provides the solid formal background for the feasible direction SVM decomposition.

Convergence of the feasible direction SVM decomposition has been proved by [37], and the linear convergence rate has been observed experimentally ([35]).

15.5.2.3 Sequential Minimal Optimization

The Sequential Minimal Optimization (SMO) algorithm proposed by [48] is another popular and efficient algorithm for the SVM training. In this algorithm the idea of decomposition is taken to its extreme by reducing the working sets to two points - the smallest size for which it is possible to maintain the equality constraint (15.30). With two points in the working set, the optimal solution can be computed analytically without calling an optimization tool.

The analytical computation of the optimal solution is based on the following idea: given the solution $ (\alpha_1^{\text{old}},
\alpha_2^{\text{old}})$, the optimal update is computed to obtain the solution $ (\alpha_1^{\text{new}},\alpha_2^{\text{new}})$. To carry out the update, first the constraints (15.29)-(15.30) have to be obeyed. The geometry of these constraints depends on whether the labels $ y_1$ and $ y_2$ are equal or not. The two possible configurations are shown in Fig. 15.7. If $ y_1 \neq y_2$ (left picture) the solution should be sought along the line $ \alpha_1
- \alpha_2 = \gamma$, where $ \gamma = \alpha_1^{\text{old}} + y_1 y_2
\alpha_2^{\text{old}}.$ If $ y_1 = y_2$ (right picture) the solution should be sought along the line $ \alpha_1 + \alpha_2 = \gamma$. If the solution transgresses the bound of the box, it should be clipped at the bound.

Figure 15.7: Constraints of the SVM training problem with two examples
\includegraphics[width=0.4\textwidth]{text/3-15/SMO1.eps} \includegraphics[width=0.4\textwidth]{text/3-15/SMO2.eps}
$ y_1 \neq y_2$ $ y_1 = y_2$

The optimal values of the variables along the line are computed by eliminating one of the variables through the equality constraint and finding the unconstrained minimum of the objective function. Thus eliminating $ \alpha_1$ one obtains the following update rule for $ \alpha_2$:

$\displaystyle \alpha_2^{\text{new}} = \alpha_2^{\text{old}} - \frac{y_2(E_1 - E_2)}{\eta}\;,$ (15.46)

where

$\displaystyle E_1 = \textstyle \sum_{j=1}^M y_j \alpha_j k\left({\boldsymbol{x}}_1,{\boldsymbol{x}}_j\right) + b - y_1\;,$ (15.47)
$\displaystyle E_2 = \textstyle \sum_{j=1}^M y_j \alpha_j k\left({\boldsymbol{x}}_2,{\boldsymbol{x}}_j\right) + b - y_2\;,$ (15.48)
$\displaystyle \eta = 2 k({\boldsymbol{x}}_1,{\boldsymbol{x}}_2) - k({\boldsymbo...
..._1,{\boldsymbol{x}}_1) - k\left({\boldsymbol{x}}_2,{\boldsymbol{x}}_2\right)\;.$ (15.49)

Next, the bound constraints should be taken care of. Depending on the geometry, one computes the following lower and upper bounds on the value of the variable $ \alpha_2$:

$\displaystyle L$ $\displaystyle = \begin{cases}\max{}{\left(0,\alpha_2^{\text{old}} - \alpha_1^{\...
...}} + \alpha_1^{\text{old}} - C\right)}\;, &\text{if $y_1 = y_2$}\;, \end{cases}$    
$\displaystyle H$ $\displaystyle = \begin{cases}\min{}{\left(C,C + \alpha_2^{\text{old}} - \alpha_...
...t{old}} + \alpha_1^{\text{old}}\right)}\;,&\text{if $y_1 = y_2$}\;. \end{cases}$    

The constrained optimum is then found by clipping the unconstrained optimum to the ends of the line segment:

$\displaystyle \alpha_2^{\text{new}} := \begin{cases}H\;, & \text{if $\alpha_2^{...
...new}} \leq L$}\;, \\ \alpha_2^{\text{new}}\;, & \text{otherwise}\;. \end{cases}$    

Finally, the value of $ \alpha_1^{\text{new}}$ is computed:

$\displaystyle \alpha_1^{\text{new}} = \alpha_1^{\text{old}} + y_1 y_2 \left(\alpha_2^{\text{old}} - \alpha_2^{\text{new}}\right)\;.$ (15.50)

The working set selection in the SMO algorithm is carried out by means of two heuristics. The ''first choice'' heuristic is responsible for the choice of the first example in each pair. Following this heuristic, all examples that violate the KKT condition (15.31) are used in turns as the first example in the pair. More precisely, the algorithm makes repeated passes only through the examples whose $ \alpha_i$ is strictly between then bounds, and only when all such examples satisfy the KKT conditions, the sweep over the entire data is done to bring in new examples. The ''second choice'' heuristic is responsible for the selection of the second example in the pair. It is intended to bring in such an example that results in the largest step taken during the joint optimization (15.48). As a cheap approximation of this step the numerator $ \vert E_1 - E_2\vert$ is taken (the denominator, which involves evaluation of kernels, can be expensive to compute). Following the strategy to maximize $ \vert E_1 - E_2\vert$, the SMO algorithm chooses the example with the largest $ E_2$ if $ E_1$ is negative and the example with the smallest $ E_2$ if $ E_1$ is positive. Under unusual circumstances, when no progress can be made using the second choice heuristic above a hierarchy of weaker heuristics is applied, the details of which can be found in ([48]).

15.5.3 Incremental Support Vector Optimization

Many real-life machine learning problems can be more naturally viewed as online rather than batch learning problems. Indeed, the data are often collected continuously in time, and, more importantly, the concepts to be learned may also evolve in time. Significant effort has been spent in recent years on the development of the online SVM learning algorithms (e.g. [53,33,49]). An elegant solution to online SVM learning is the incremental SVM proposed by [16] which provides a framework for exact online learning.

To present the online SVM learning we first need an abstract formulation of the SVM optimization problem and a brief overview of the incremental SVM. We start with the following abstract form of the SVM optimization problem:

$\displaystyle {\underset{{\mu}}{\text{min}}\,} {\underset{{\substack{0\, \leq\,...
...{\alpha}} + \mu \left(\boldsymbol{a}^{\top} {\boldsymbol{\alpha}} + b\right)\;,$ (15.51)

where $ \boldsymbol{c}$ and $ \boldsymbol {a}$ are $ M\times 1$ vectors, $ K$ is the $ M\times M$ kernel matrix and $ b$ is a scalar. By defining the meaning of the abstract parameters  $ \boldsymbol {a}$, $ b$ and $ \boldsymbol{c}$ for the particular SVM problem at hand, one can use the same algorithmic structure for different SVM algorithms. In particular, for the standard support vector classifiers ([71]), take $ \boldsymbol{c}=\boldsymbol{1}, \boldsymbol{a}=\boldsymbol{y}$, $ b=0$ and the given regularization constant $ C$; the same definition applies to the $ \nu $-SVM ([58]) except that $ C= 1/(N\nu)$; for the one-class classifier ([56,67]), the parameters are defined as: $ \boldsymbol{c} =$   diag$ (K), \boldsymbol{a} =
\boldsymbol{y}$ and $ b=-1$.

Incremental SVM provides a procedure for adding one example to an existing optimal solution. When a new point $ k$ is added, its weight $ \alpha_k$ is initially assigned to 0. Then the weights of other points and $ \mu$ should be updated, in order to obtain the optimal solution for the enlarged dataset. Likewise, when a point $ k$ is to be removed from the dataset, its weight is forced to 0, while updating the weights of the remaining points and $ \mu$ so that the solution obtained with $ \alpha_k = 0$ is optimal for the reduced dataset. The online learning follows naturally from the incremental/decremental learning: the new example is added while some old example is removed from the working set.

The basic principle of the incremental SVM ([16]) is that updates to the state of the example $ k$ should keep the remaining examples in their optimal state. In other words, the KKT conditions (15.31) expressed for the gradients $ g_i$:

$\displaystyle g_i = -c_i + K_{i,:}{\boldsymbol{\alpha}} + \mu a_i \begin{cases}...
...f} \,0 < \alpha_i < C\;, \\ \leq 0\;, & \text{if} \,\alpha_i = C\;, \end{cases}$ (15.52)

$\displaystyle \frac{\partial W}{\partial \mu}=\boldsymbol{a}^{\top}{\boldsymbol{\alpha}}+b=0\;,$ (15.53)

must be maintained for all the examples, except possibly for the current example $ k$. Let $ S$ denote the set of unbounded support vectors and $ R$ the set of other examples. In order to maintain the KKT conditions (15.54), one can express increments to the weights of the unbounded support vectors and to the gradients of other examples as a linear function of the increment of the current example's weight $ \Delta \alpha_k$:

$\displaystyle \Delta\alpha_S = \beta \Delta \alpha_k, \quad \Delta g_R = \gamma \Delta \alpha_k\;.$ (15.54)

The sensitivity relations (15.56) only make sense for the fixed composition of sets $ S$ and $ R$. To proceed with learning, we need to detect the largest increment $ \Delta \alpha_k^{\text{max}}$ such that the sets $ S$ and $ R$ remain intact. After changing the SVM state according to the relations (15.56) evaluated for the increment $ \Delta \alpha_k^{\text{max}}$, the sensitivity vectors $ \beta$ and $ \gamma$ must be recomputed accordingly (depending of whether an example enters or leaves the set $ S$). For the details of accounting the reader should refer to ([16,66]). The iteration terminates when the current element satisfies the KKT conditions (15.54) as well.


next up previous contents index
Next: 15.6 Extensions of SVM Up: 15. Support Vector Machines Previous: 15.4 Non-linear SVM