13.4 Statistical Validation Techniques

Having a large collection of distributions to choose from we need to narrow our selection to a single model and a unique parameter estimate. The type of the objective loss distribution can be easily selected by comparing the shapes of the empirical and theoretical mean excess functions. The mean excess function, presented in Section 13.4.1, is based on the idea of conditioning a random variable given that it exceeds a certain level.

Once the distribution class is selected and the parameters are estimated using one of the available methods the goodness-of-fit has to be tested. Probably the most natural approach consists of measuring the distance between the empirical and the fitted analytical distribution function. A group of statistics and tests based on this idea is discussed in Section 13.4.2. However, when using these tests we face the problem of comparing a discontinuous step function with a continuous non-decreasing curve. The two functions will always differ from each other in the vicinity of a step by at least half the size of the step. This problem can be overcome by integrating both distributions once, which leads to the so-called limited expected value function introduced in Section 13.4.3.


13.4.1 Mean Excess Function

For a claim amount random variable $ X$, the mean excess function or mean residual life function is the expected payment per claim on a policy with a fixed amount deductible of $ x$, where claims with amounts less than or equal to $ x$ are completely ignored:

$\displaystyle e(x) = \mathop{\textrm{E}}(X - x \vert X > x) = \frac{\int_x^\infty \left\{1-F(u)\right\}du}{1-F(x)}.$ (13.46)

In practice, the mean excess function $ e$ is estimated by $ \hat{e}_n$ based on a representative sample $ x_1,\ldots ,x_n$:

$\displaystyle \hat{e}_n(x)=\frac{\sum_{x_i>x} x_i }{\char93 \{i: x_i>x \}}-x.$ (13.47)

Note, that in a financial risk management context, switching from the right tail to the left tail, $ e(x)$ is referred to as the expected shortfall (Weron; 2004).

When considering the shapes of mean excess functions, the exponential distribution plays a central role. It has the memoryless property, meaning that whether the information $ X>x$ is given or not, the expected value of $ X-x$ is the same as if one started at $ x=0$ and calculated $ \textrm{E}(X)$. The mean excess function for the exponential distribution is therefore constant. One in fact easily calculates that for this case $ e(x)=1/\beta$ for all $ x>0$.

If the distribution of $ X$ is heavier-tailed than the exponential distribution we find that the mean excess function ultimately increases, when it is lighter-tailed $ e(x)$ ultimately decreases. Hence, the shape of $ e(x)$ provides important information on the sub-exponential or super-exponential nature of the tail of the distribution at hand.

Mean excess functions and first order approximations to the tail for the distributions discussed in Section 13.3 are given by the following formulas:

Selected shapes are also sketched in Figure 13.6.

Figure 13.6: Left panel: Shapes of the mean excess function $ e(x)$ for the log-normal (green dashed line), gamma with $ \alpha <1$ (red dotted line), gamma with $ \alpha >1$ (black solid line) and a mixture of two exponential distributions (blue long-dashed line). Right panel: Shapes of the mean excess function $ e(x)$ for the Pareto (green dashed line), Burr (blue long-dashed line), Weibull with $ \tau <1$ (black solid line) and Weibull with $ \tau >1$ (red dotted line) distributions.
\includegraphics[width=.7\defpicwidth]{STFloss06a.ps} \includegraphics[width=.7\defpicwidth]{STFloss06b.ps}


13.4.2 Tests Based on the Empirical Distribution Function

A statistics measuring the difference between the empirical $ F_n(x)$ and the fitted $ F(x)$ distribution function, called an edf statistic, is based on the vertical difference between the distributions. This distance is usually measured either by a supremum or a quadratic norm (D'Agostino and Stephens; 1986).

The most well-known supremum statistic:

$\displaystyle D=\sup\limits_x\left\vert F_n(x)-F(x)\right\vert,$ (13.49)

is known as the Kolmogorov or Kolmogorov-Smirnov statistic. It can also be written in terms of two supremum statistics:

$\displaystyle D^+=\sup\limits_x\left\{F_n(x)-F(x)\right\} \quad \textrm{and} \quad D^-=\sup\limits_x\left\{F(x)-F_n(x)\right\},
$

where the former is the largest vertical difference when $ F_n(x)$ is larger than $ F(x)$ and the latter is the largest vertical difference when it is smaller. The Kolmogorov statistic is then given by $ D=\max(D^+,D^-)$. A closely related statistic proposed by Kuiper is simply a sum of the two differences, i.e. $ V=D^+ + D^-$.

The second class of measures of discrepancy is given by the Cramér-von Mises family

$\displaystyle Q=n\int\limits_{-\infty}^{\infty}\left\{F_n(x)-F(x)\right\}^2\psi(x)\mathrm{d}F(x),$ (13.50)

where $ \psi(x)$ is a suitable function which gives weights to the squared difference $ \left\{F_n(x)-F(x)\right\}^2$. When $ \psi(x)=1$ we obtain the $ W^2$ statistic of Cramér-von Mises. When $ \psi(x)=[F(x)\left\{1-F(x)\right\}]^{-1}$ formula (13.50) yields the $ A^2$ statistic of Anderson and Darling. From the definitions of the statistics given above, suitable computing formulas must be found. This can be done by utilizing the transformation $ Z=F(X)$. When $ F(x)$ is the true distribution function of $ X$, the random variable $ Z$ is uniformly distributed on the unit interval.

Suppose that a sample $ x_1,\dots,x_n$ gives values $ z_i=F(x_i),\ i=1,\dots,n$. It can be easily shown that, for values $ z$ and $ x$ related by $ z=F(x)$, the corresponding vertical differences in the edf diagrams for $ X$ and for $ Z$ are equal. Consequently, edf statistics calculated from the empirical distribution function of the $ z_i$'s compared with the uniform distribution will take the same values as if they were calculated from the empirical distribution function of the $ x_i$'s, compared with $ F(x)$. This leads to the following formulas given in terms of the order statistics $ z_{(1)}< z_{(2)}<\dots<z_{(n)}$:

$\displaystyle D^+$ $\displaystyle =$ $\displaystyle \max\limits_{1\leq i\leq n}\left\{\frac{i}{n}-z_{(i)}\right\},$ (13.51)
$\displaystyle D^-$ $\displaystyle =$ $\displaystyle \max\limits_{1\leq i\leq n}\left\{z_{(i)}-\frac{(i-1)}{n}\right\},$ (13.52)
$\displaystyle D$ $\displaystyle =$ $\displaystyle \max(D^+,D^-),$ (13.53)
$\displaystyle V$ $\displaystyle =$ $\displaystyle D^+ + D^-,$ (13.54)
$\displaystyle W^2$ $\displaystyle =$ $\displaystyle \sum\limits_{i=1}^n\left\{z_{(i)}-\frac{(2i-1)}{2n}\right\}^2 + \frac{1}{12n},$ (13.55)


$\displaystyle A^2$ $\displaystyle =$ $\displaystyle -n-\frac{1}{n}\sum\limits_{i=1}^n\left\{\log z_{(i)}+\log(1-z_{(n+1-i)})\right\}$ (13.56)
  $\displaystyle =$ $\displaystyle -n-\frac{1}{n}\sum\limits_{i=1}^n\left\{(2i-1)\log z_{(i)} + \right.$  
    $\displaystyle \left. + (2n+1-2i)\log(1-z_{(i)})\right\}.$ (13.57)

The general test of fit is structured as follows. The null hypothesis is that a specific distribution is acceptable, whereas the alternative is that it is not:

$\displaystyle H_0:$   $\displaystyle F_n(x) = F(x;\theta),$  
$\displaystyle H_1:$   $\displaystyle F_n(x) \ne F(x;\theta),$  

where $ \theta$ is a vector of known parameters. Small values of the test statistic $ T$ are evidence in favor of the null hypothesis, large ones indicate its falsity. To see how unlikely such a large outcome would be if the null hypothesis was true, we calculate the $ p$-value by:

$\displaystyle p\textrm{-value} = P(T \ge t),$ (13.58)

where $ t$ is the test value for a given sample. It is typical to reject the null hypothesis when a small $ p$-value is obtained.

However, we are in a situation where we want to test the hypothesis that the sample has a common distribution function $ F(x;\theta)$ with unknown $ \theta$. To employ any of the edf tests we first need to estimate the parameters. It is important to recognize, however, that when the parameters are estimated from the data, the critical values for the tests of the uniform distribution (or equivalently of a fully specified distribution) must be reduced. In other words, if the value of the test statistics $ T$ is $ d$, then the $ p$-value is overestimated by $ P_{U}(T \ge d)$. Here $ P_U$ indicates that the probability is computed under the assumption of a uniformly distributed sample. Hence, if $ P_{U}(T \ge d)$ is small, then the $ p$-value will be even smaller and the hypothesis will be rejected. However, if it is large then we have to obtain a more accurate estimate of the $ p$-value.

Ross (2002) advocates the use of Monte Carlo simulations in this context. First the parameter vector is estimated for a given sample of size $ n$, yielding $ \hat\theta$, and the edf test statistics is calculated assuming that the sample is distributed according to $ F(x;\hat\theta)$, returning a value of $ d$. Next, a sample of size $ n$ of $ F(x;\hat\theta)$-distributed variates is generated. The parameter vector is estimated for this simulated sample, yielding $ \hat\theta_1$, and the edf test statistics is calculated assuming that the sample is distributed according to $ F(x;\hat\theta_1)$. The simulation is repeated as many times as required to achieve a certain level of accuracy. The estimate of the $ p$-value is obtained as the proportion of times that the test quantity is at least as large as $ d$.

An alternative solution to the problem of unknown parameters was proposed by Stephens (1978). The half-sample approach consists of using only half the data to estimate the parameters, but then using the entire data set to conduct the test. In this case, the critical values for the uniform distribution can be applied, at least asymptotically. The quadratic edf tests seem to converge fairly rapidly to their asymptotic distributions (D'Agostino and Stephens; 1986). Although, the method is much faster than the Monte Carlo approach it is not invariant - depending on the choice of the half-samples different test values will be obtained and there is no way of increasing the accuracy.

As a side product, the edf tests supply us with a natural technique of estimating the parameter vector $ \theta$. We can simply find such $ \hat\theta^*$ that minimizes a selected edf statistic. Out of the four presented statistics $ A^2$ is the most powerful when the fitted distribution departs from the true distribution in the tails (D'Agostino and Stephens; 1986). Since the fit in the tails is of crucial importance in most actuarial applications $ A^2$ is the recommended statistic for the estimation scheme.


13.4.3 Limited Expected Value Function

The limited expected value function $ L$ of a claim size variable $ X$, or of the corresponding cdf $ F(x)$, is defined by

$\displaystyle L(x) = \textrm{E}\{\min(X,x)\} = \int_0^x ydF(y)+x\left\{1-F(x)\right\},\;\;x>0.$ (13.59)

The value of the function $ L$ at point $ x$ is equal to the expectation of the cdf $ F(x)$ truncated at this point. In other words, it represents the expected amount per claim retained by the insured on a policy with a fixed amount deductible of $ x$. The empirical estimate is defined as follows:

$\displaystyle \hat{L}_n(x) = \frac{1}{n}\left(\sum_{x_j<x}x_j+\sum_{x_j\geq x}x\right).$ (13.60)

In order to fit the limited expected value function $ L$ of an analytical distribution to the observed data, the estimate $ \hat{L}_n$ is first constructed. Thereafter one tries to find a suitable analytical cdf $ F$, such that the corresponding limited expected value function $ L$ is as close to the observed $ \hat{L}_n$ as possible.

The limited expected value function has the following important properties:

  1. the graph of $ L$ is concave, continuous and increasing;
  2. $ L(x) \rightarrow E(X)$, as $ x\rightarrow\infty$;
  3. $ F(x) = 1 - L'(x)$, where $ L'(x)$ is the derivative of the function $ L$ at point $ x$; if $ F$ is discontinuous at $ x$, then the equality holds true for the right-hand derivative $ L'(x+)$.

A reason why the limited expected value function is a particularly suitable tool for our purposes is that it represents the claim size distribution in the monetary dimension. For example, we have $ L(\infty)= \textrm{E}(X)$ if it exists. The cdf $ F$, on the other hand, operates on the probability scale, i.e. takes values between 0 and $ 1$. Therefore, it is usually difficult to see, by looking only at $ F(x)$, how sensitive the price for the insurance - the premium - is to changes in the values of $ F$, while the limited expected value function shows immediately how different parts of the claim size cdf contribute to the premium (see Chapter 19 for information on various premium calculation principles). Apart from curve-fitting purposes, the function $ L$ will turn out to be a very useful concept in dealing with deductibles in Chapter 19. It is also worth mentioning, that there exists a connection between the limited expected value function and the mean excess function:

$\displaystyle \textrm{E}(X) = L(x) + \textrm{P}(X>x) e(x).$ (13.61)

The limited expected value functions for all distributions considered in this chapter are given by:

From the curve-fitting point of view the use of the limited expected value function has the advantage, compared with the use of the cdfs, that both the analytical and the corresponding observed function $ \hat{L}_n$, based on the observed discrete cdf, are continuous and concave, whereas the observed claim size cdf $ F_n$ is a discontinuous step function. Property (3) implies that the limited expected value function determines the corresponding cdf uniquely. When the limited expected value functions of two distributions are close to each other, not only are the mean values of the distributions close to each other, but the whole distributions as well.