10.3 Lagrangian Formulation of the SVM

Having introduced some elements of statistical learning and demonstrated the potential of SVMs for company rating we can now give a Lagrangian formulation of an SVM for the linear classification problem and generalize this approach to a nonlinear case.

Figure 10.3: The separating hyperplane $ { x}^\top { w}+b=0$ and the margin in a non-separable case.


\includegraphics[width=1\defpicwidth]{_lnsep.ps}

In the linear case the following inequalities hold for all $ n$ points of the training set:

$\displaystyle { x}_i^\top { w}+b \geq 1-\xi _i$ $\displaystyle {\rm for}$ $\displaystyle y_i=1,$  
$\displaystyle { x}_i^\top { w}+b \leq -1+\xi _i$ $\displaystyle {\rm for}$ $\displaystyle y_i=-1,$  
$\displaystyle \xi_i \geq 0,$      

which can be combined into two constraints:
    $\displaystyle y_i({ x}_i^\top { w}+b) \geq 1-\xi _i$ (10.9)
    $\displaystyle \xi _i \geq 0.$ (10.10)

The basic idea of the SVM classification is to find such a separating hyperplane that corresponds to the largest possible margin between the points of different classes, see Figure 10.3. Some penalty for misclassification must also be introduced. The classification error $ \xi_i$ is related to the distance from a misclassified point $ x_i$ to the canonical hyperplane bounding its class. If $ \xi_i>0$, an error in separating the two sets occurs. The objective function corresponding to penalized margin maximization is formulated as:

$\displaystyle \frac 12\left\Vert { w}\right\Vert ^2+C\left( \sum\limits_{i=1}^n\xi
_i\right) ^\upsilon,$     (10.11)

where the parameter $ C$ characterizes the generalization ability of the machine and $ \upsilon \geq 1$ is a positive integer controlling the sensitivity of the machine to outliers. The conditional minimization of the objective function with constraint (10.9) and (10.10) provides the highest possible margin in the case when classification errors are inevitable due to the linearity of the separating hyperplane. Under such a formulation the problem is convex. One can show that margin maximization reduces the VC dimension.

The Lagrange functional for the primal problem for $ \upsilon =1$ is:

$\displaystyle L_P=\frac 12\left\Vert { w}\right\Vert ^2+C\sum\limits_{i=1}^n\xi...
...i\left( { x}_i^\top
{ w}+b\right) -1+\xi _i\} -\sum\limits_{i=1}^n\mu _i\xi _i,$     (10.12)

where $ \alpha _i\geq 0$ and $ \mu _i\geq 0$ are Lagrange multipliers. The primal problem is formulated as:
$\displaystyle \min_{w_k,b,\xi _i}\max_{\alpha _i}L_P.$      

After substituting the Karush-Kuhn-Tucker conditions (Gale et al.; 1951) into the primal Lagrangian, we derive the dual Lagrangian as:
$\displaystyle L_D=\sum\limits_{i=1}^n\alpha _i-\frac 12\sum\limits_{i=1}^n\sum\limits_{j=1}^n\alpha _i\alpha_jy_iy_j { x}_i^\top { x}_j,$     (10.13)

and the dual problem is posed as:
$\displaystyle \max_{\alpha _i}L_D,$      

subject to:
$\displaystyle 0 \leq \alpha _i\leq C,$      
$\displaystyle \sum\limits_{i=1}^n\alpha _iy_i = 0.$      

Those points $ i$ for which the equation $ y_i({ x}_i^\top { w}+b)\leq 1$ holds are called support vectors. After training the support vector machine and deriving Lagrange multipliers (they are equal to 0 for non-support vectors) one can classify a company described by the vector of parameters $ { x}$ using the classification rule:

$\displaystyle g({ x})={\rm sign}\left( { x}^\top { w}+b\right),$     (10.14)

where $ { w}=\sum_{i=1}^n\alpha _iy_i{ x}_i$ and $ b=\frac 12\left( { x}_{+1}+{ x}_{-1}\right){ w}$. $ { x}_{+1}$ and $ { x}_{-1}$ are two support vectors belonging to different classes for which $ y({ x}^\top { w}+b)=1$. The value of the classification function (the score of a company) can be computed as
$\displaystyle f({ x})={ x}^\top { w}+b.$     (10.15)

Each value of $ f({ x})$ uniquely corresponds to a default probability (PD).

The SVMs can also be easily generalized to the nonlinear case. It is worth noting that all the training vectors appear in the dual Lagrangian formulation only as scalar products. This means that we can apply kernels to transform all the data into a high dimensional Hilbert feature space and use linear algorithms there:

$\displaystyle \Psi :{\mathbb{R}}^d\mapsto {\mathbb{H}}.$     (10.16)

If a kernel function $ K$ exists such that $ K({ x}_i,{ x}_j)=\Psi ({ x}_i)^\top
\Psi ({ x}_j)$, then it can be used without knowing the transformation $ \Psi$ explicitly. A necessary and sufficient condition for a symmetric function $ K({
x}_i,{ x}_j)$ to be a kernel is given by Mercer's (1909) theorem. It requires positive definiteness, i.e. for any data set $ { x}_1,...,{ x}_n$ and any real numbers $ \lambda_1,...,\lambda_n $ the function $ K$ must satisfy

$\displaystyle \sum\limits_{i=1}^n\sum\limits_{j=1}^n\lambda _i\lambda _jK({ x}_i,{ x}_j)\geq 0.$     (10.17)

Some examples of kernel functions are: