# 10.3 Lagrangian Formulation of the SVM

Having introduced some elements of statistical learning and demonstrated the potential of SVMs for company rating we can now give a Lagrangian formulation of an SVM for the linear classification problem and generalize this approach to a nonlinear case.

In the linear case the following inequalities hold for all points of the training set:

which can be combined into two constraints:
 (10.9) (10.10)

The basic idea of the SVM classification is to find such a separating hyperplane that corresponds to the largest possible margin between the points of different classes, see Figure 10.3. Some penalty for misclassification must also be introduced. The classification error is related to the distance from a misclassified point to the canonical hyperplane bounding its class. If , an error in separating the two sets occurs. The objective function corresponding to penalized margin maximization is formulated as:

 (10.11)

where the parameter characterizes the generalization ability of the machine and is a positive integer controlling the sensitivity of the machine to outliers. The conditional minimization of the objective function with constraint (10.9) and (10.10) provides the highest possible margin in the case when classification errors are inevitable due to the linearity of the separating hyperplane. Under such a formulation the problem is convex. One can show that margin maximization reduces the VC dimension.

The Lagrange functional for the primal problem for is:

 (10.12)

where and are Lagrange multipliers. The primal problem is formulated as:

After substituting the Karush-Kuhn-Tucker conditions (Gale et al.; 1951) into the primal Lagrangian, we derive the dual Lagrangian as:
 (10.13)

and the dual problem is posed as:

subject to:

Those points for which the equation holds are called support vectors. After training the support vector machine and deriving Lagrange multipliers (they are equal to 0 for non-support vectors) one can classify a company described by the vector of parameters using the classification rule:

 (10.14)

where and . and are two support vectors belonging to different classes for which . The value of the classification function (the score of a company) can be computed as
 (10.15)

Each value of uniquely corresponds to a default probability (PD).

The SVMs can also be easily generalized to the nonlinear case. It is worth noting that all the training vectors appear in the dual Lagrangian formulation only as scalar products. This means that we can apply kernels to transform all the data into a high dimensional Hilbert feature space and use linear algorithms there:

 (10.16)

If a kernel function exists such that , then it can be used without knowing the transformation explicitly. A necessary and sufficient condition for a symmetric function to be a kernel is given by Mercer's (1909) theorem. It requires positive definiteness, i.e. for any data set and any real numbers the function must satisfy

 (10.17)

Some examples of kernel functions are:
• - the isotropic Gaussian kernel;
• - the stationary Gaussian kernel with an anisotropic radial basis; we will apply this kernel in our study taking equal to the variance matrix of the training set; is a constant;
• - the polynomial kernel;
• - the hyperbolic tangent kernel.