10.3 Lagrangian Formulation of the SVM

Having introduced some elements of statistical learning and demonstrated the potential of SVMs for company rating we can now give a Lagrangian formulation of an SVM for the linear classification problem and generalize this approach to a nonlinear case.

In the linear case the following inequalities hold for all points of the training set:

which can be combined into two constraints:

The basic idea of the SVM classification is to find such a separating
hyperplane that corresponds to the largest possible margin between the points
of different classes, see Figure 10.3. Some penalty for
misclassification must also be introduced. The classification error is
related to the distance from a misclassified point to the canonical
hyperplane bounding its class. If , an error in separating the two
sets occurs. The objective function corresponding to penalized margin
maximization is formulated as:

(10.11) |

where the parameter characterizes the generalization ability of the machine and is a positive integer controlling the sensitivity of the machine to outliers. The conditional minimization of the objective function with constraint (10.9) and (10.10) provides the highest possible margin in the case when classification errors are inevitable due to the linearity of the separating hyperplane. Under such a formulation the problem is convex. One can show that margin maximization reduces the VC dimension.

The Lagrange functional for the primal problem for
is:

where and are Lagrange multipliers. The primal problem is formulated as:

After substituting the Karush-Kuhn-Tucker conditions (Gale et al.; 1951) into the primal Lagrangian, we derive the dual Lagrangian as:

(10.13) |

and the dual problem is posed as:

subject to:

Those points for which the equation
holds
are called support vectors. After training the support vector machine and
deriving Lagrange multipliers (they are equal to 0 for non-support vectors) one
can classify a company described by the vector of parameters using the
classification rule:

where and . and are two support vectors belonging to different classes for which . The value of the classification function (the score of a company) can be computed as

(10.15) |

Each value of uniquely corresponds to a default probability (PD).

The SVMs can also be easily generalized to the nonlinear case. It is worth
noting that all the training vectors appear in the dual Lagrangian formulation
only as scalar products. This means that we can apply kernels to transform all
the data into a high dimensional Hilbert feature space and use linear
algorithms there:

(10.16) |

If a kernel function exists such that
, then it can be used without knowing the transformation
explicitly. A necessary and sufficient condition for a symmetric function
to be a kernel is given by Mercer's (1909) theorem.
It requires positive definiteness, i.e. for any data set
and any real numbers
the function must satisfy

(10.17) |

Some examples of kernel functions are:

- - the isotropic Gaussian kernel;
- - the stationary Gaussian kernel with an anisotropic radial basis; we will apply this kernel in our study taking equal to the variance matrix of the training set; is a constant;
- - the polynomial kernel;
- - the hyperbolic tangent kernel.