10.1 Bankruptcy Analysis Methodology

Although the early works in bankruptcy analysis were published already in the 19th century (Dev; 1974), statistical techniques were not introduced to it until the publications of Beaver (1966) and Altman (1968). Demand from financial institutions for investment risk estimation stimulated subsequent research. However, despite substantial interest, the accuracy of corporate default predictions was much lower than in the private loan sector, largely due to a small number of corporate bankruptcies.

Meanwhile, the situation in bankruptcy analysis has changed dramatically. Larger data sets with the median number of failing companies exceeding 1000 have become available. 20 years ago the median was around 40 companies and statistically significant inferences could not often be reached. The spread of computer technologies and advances in statistical learning techniques have allowed the identification of more complex data structures. Basic methods are no longer adequate for analysing expanded data sets. A demand for advanced methods of controlling and measuring default risks has rapidly increased in anticipation of the New Basel Capital Accord adoption (BCBS; 2003). The Accord emphasises the importance of risk management and encourages improvements in financial institutions' risk assessment capabilities.

In order to estimate investment risks one needs to evaluate the default probability (PD) for a company. Each company is described by a set of variables (predictors) $ { x}$, such as financial ratios, and its class $ y$ that can be either $ y=-1$ (`successful') or $ y=1$ (`bankrupt'). Initially, an unknown classifier function $ f:{ x} \rightarrow y$ is estimated on a training set of companies $ ({ x}_i,y_i)$, $ i=1,...,n$. The training set represents the data for companies which are known to have survived or gone bankrupt. Finally, $ f$ is applied to computing default probabilities (PD) that can be uniquely translated into a company rating.

The importance of financial ratios for company analysis has been known for more than a century. Among the first researchers applying financial ratios for bankruptcy prediction were Ramser (1931), Fitzpatrick (1932) and Winakor and Smith (1935). However, it was not until the publications of Beaver (1966) and Altman (1968) and the introduction of univariate and multivariate discriminant analysis that the systematic application of statistics to bankruptcy analysis began. Altman's linear Z-score model became the standard for a decade to come and is still widely used today due to its simplicity. However, its assumption of equal normal distributions for both failing and successful companies with the same covariance matrix has been justly criticized. This approach was further developed by Deakin (1972) and Altman et al. (1977).

Later on, the center of research shifted towards the logit and probit models. The original works of Martin (1977) and Ohlson (1980) were followed by (Wiginton; 1980), (Zavgren; 1983) and (Zmijewski; 1984). Among other statistical methods applied to bankruptcy analysis there are the gambler's ruin model (Wilcox; 1971), option pricing theory (Merton; 1974), recursive partitioning (Frydman et al.; 1985), neural networks (Tam and Kiang; 1992) and rough sets (Dimitras et al.; 1999) to name a few.

There are three main types of models used in bankruptcy analysis. The first one is structural or parametric models, e.g. the option pricing model, logit and probit regressions, discriminant analysis. They assume that the relationship between the input and output parameters can be described a priori. Besides their fixed structure these models are fully determined by a set of parameters. The solution requires the estimation of these parameters on a training set.

Although structural models provide a very clear interpretation of modelled processes, they have a rigid structure and are not flexible enough to capture information from the data. The non-structural or nonparametric models (e.g. neural networks or genetic algorithms) are more flexible in describing data. They do not impose very strict limitations on the classifier function but usually do not provide a clear interpretation either.

Between the structural and non-structural models lies the class of semiparametric models. These models, like the RiskCalc private company rating model developed by Moody's, are based on an underlying structural model but all or some predictors enter this structural model after a nonparametric transformation. In recent years the area of research has shifted towards non-structural and semi-parametric models since they are more flexible and better suited for practical purposes than purely structural ones.

Statistical models for corporate default prediction are of practical importance. For example, corporate bond ratings published regularly by rating agencies such as Moody's or S&P strictly correspond to company default probabilities estimated to a great extent statistically. Moody's RiskCalc model is basically a probit regression estimation of the cumulative default probability over a number of years using a linear combination of non-parametrically transformed predictors (Falkenstein; 2000). These non-linear transformations $ f_1$, $ f_2$, ..., $ f_d$ are estimated on univariate models. As a result, the original probit model:

$\displaystyle E[y_{i,t}\vert{ x}_{i,t}]$ $\displaystyle =$ $\displaystyle \Phi \left( \beta _1x_{i1,t}+\beta
_2x_{i2,t}+...+\beta _dx_{id,t}\right),$ (10.1)

is converted into:
$\displaystyle E[y_{i,t}\vert{ x}_{i,t}]$ $\displaystyle =$ $\displaystyle \Phi \{ \beta _1f_1(x_{i1,t})+\beta_2 f_2(x_{i2,t}) +...+ \beta _df_d(x_{id,t})\},$ (10.2)

where $ y_{i,t}$ is the cumulative default probability within the prediction horizon for company $ i$ at time $ t$. Although modifications of traditional methods like probit analysis extend their applicability, it is more desirable to base our methodology on general ideas of statistical learning theory without making many restrictive assumptions.

The ideal classification machine applying a classifying function $ f$ from the available set of functions $ {\mathcal F}$ is based on the so called expected risk minimization principle. The expected risk

$\displaystyle R\left( f\right) =\int \frac 12\left\vert f({ x})-y\right\vert dP({ x},y),$     (10.3)

is estimated under the distribution $ P({ x},y)$, which is assumed to be known. This is, however, never true in practical applications and the distribution should also be estimated from the training set $ ({ x}_i,y_i)$, $ i=1,2,...,n$, leading to an ill-posed problem (Tikhonov and Arsenin; 1977).

In most methods applied today in statistical packages this problem is solved by implementing another principle, namely the principle of the empirical risk minimization, i.e. risk minimization over the training set of companies, even when the training set is not representative. The empirical risk defined as:

$\displaystyle \hat{R}\left( f\right) =\frac 1n\sum\limits_{i=1}^n\frac 12\left\vert f({ x}_i)-y_i\right\vert,$     (10.4)

is nothing else but an average value of loss over the training set, while the expected risk is the expected value of loss under the true probability measure. The loss for i.i.d. observations is given by:

\begin{displaymath}
\frac 12\left\vert f({ x})-y\right\vert = \bigg\{
\begin{arr...
...rect,}\\
1, & \textrm{if classification is wrong.}
\end{array}\end{displaymath}

The solutions to the problems of expected and empirical risk minimization:

$\displaystyle f_{opt}$ $\displaystyle =$ $\displaystyle \arg \min_{f\in {\mathcal F}} R\left( f\right),$ (10.5)
$\displaystyle \hat{f}_n$ $\displaystyle =$ $\displaystyle \arg \min_{f\in {\mathcal F}} \hat{R}\left( f\right),$ (10.6)

generally do not coincide (Figure 10.1), although they converge to each other as $ n\rightarrow \infty$ if $ {\mathcal F}$ is not too large.

Figure 10.1: The minima $ f_{opt}$ and $ \hat{f}_n$ of the expected ($ R$) and empirical ($ \hat{R}$) risk functions generally do not coincide.
\includegraphics[width=0.92\defpicwidth]{_func.ps}

We cannot minimize expected risk directly since the distribution $ P({ x},y)$ is unknown. However, according to statistical learning theory (Vapnik; 1995), it is possible to estimate the Vapnik-Chervonenkis (VC) bound that holds with a certain probability $ 1-\eta$:

$\displaystyle R\left( f\right) \leq \hat{R}\left( f\right) +\phi \left( \frac hn,\frac{\ln (\eta )}n\right).$     (10.7)

For a linear indicator function $ g(x)={\rm sign}(x^\top w+b)$:
$\displaystyle \phi \left( \frac hn,\frac{\ln (\eta )}n\right) = \sqrt{\frac{h\left( \ln \frac{2n}h\right) -\ln \frac \eta 4}n},$     (10.8)

where $ h$ is the VC dimension.

The VC dimension of the function set $ {\mathcal F}$ in a $ d$-dimensional space is $ h$ if some function $ f\in {\mathcal F}$ can shatter $ h$ objects $ \left\{ {
x}_i \in {\mathbb{R}}^d, i=1,...,h\right\}$, in all $ 2^h$ possible configurations and no set $ \left\{ { x}_j \in {\mathbb{R}}^d,
j=1,...,q\right\}$, exists where $ q>h$ that satisfies this property. For example, three points on a plane ($ d=2$) can be shattered by linear indicator functions in $ 2^h=2^3=8$ ways, whereas 4 points cannot be shattered in $ 2^q=2^4=16$ ways. Thus, the VC dimension of the set of linear indicator functions in a two-dimensional space is three, see Figure 10.2.

Figure 10.2: Eight possible ways of shattering 3 points on the plane with a linear indicator function.
\includegraphics[width=1.21\defpicwidth]{_ex1.ps}

The expression for the VC bound (10.7) is a regularized functional where the VC dimension $ h$ is a parameter controlling complexity of the classifier function. The term $ \phi \left( \frac hn,\frac{\ln (\eta )}n\right)$ introduces a penalty for the excessive complexity of a classifier function. There is a trade-off between the number of classification errors on the training set and the complexity of the classifier function. If the complexity were not controlled, it would be possible to find such a classifier function that would make no classification errors on the training set notwithstanding how low its generalization ability would be.