13.1 Introduction

The derivation of loss distributions from insurance data is not an easy task. Insurers normally keep data files containing detailed information about policies and claims, which are used for accounting and rate-making purposes. However, claim size distributions and other data needed for risk-theoretical analyzes can be obtained usually only after tedious data preprocessing. Moreover, the claim statistics are often limited. Data files containing detailed information about some policies and claims may be missing or corrupted. There may also be situations where prior data or experience are not available at all, e.g. when a new type of insurance is introduced or when very large special risks are insured. Then the distribution has to be based on knowledge of similar risks or on extrapolation of lesser risks.

There are three basic approaches to deriving the loss distribution: empirical, analytical, and moment based. The empirical method, presented in Section 13.2, can be used only when large data sets are available. In such cases a sufficiently smooth and accurate estimate of the cumulative distribution function (cdf) is obtained. Sometimes the application of curve fitting techniques - used to smooth the empirical distribution function - can be beneficial. If the curve can be described by a function with a tractable analytical form, then this approach becomes computationally efficient and similar to the second method.

The analytical approach is probably the most often used in practice and certainly the most frequently adopted in the actuarial literature. It reduces to finding a suitable analytical expression which fits the observed data well and which is easy to handle. Basic characteristics and estimation issues for the most popular and useful loss distributions are discussed in Section 13.3. Note, that sometimes it may be helpful to subdivide the range of the claim size distribution into intervals for which different methods are employed. For example, the small and medium size claims could be described by the empirical claim size distribution, while the large claims - for which the scarcity of data eliminates the use of the empirical approach - by an analytical loss distribution.

In some applications the exact shape of the loss distribution is not required. We may then use the moment based approach, which consists of estimating only the lowest characteristics (moments) of the distribution, like the mean and variance. However, it should be kept in mind that even the lowest three or four moments do not fully define the shape of a distribution, and therefore the fit to the observed data may be poor. Further details on the moment based approach can be found e.g. in Daykin, Pentikainen, and Pesonen (1994).

Having a large collection of distributions to choose from, we need to narrow our selection to a single model and a unique parameter estimate. The type of the objective loss distribution can be easily selected by comparing the shapes of the empirical and theoretical mean excess functions. Goodness-of-fit can be verified by plotting the corresponding limited expected value functions. Finally, the hypothesis that the modeled random event is governed by a certain loss distribution can be statistically tested. In Section 13.4 these statistical issues are thoroughly discussed.

In Section 13.5 we apply the presented tools to modeling real-world insurance data. The analysis is conducted for two datasets: (i) the PCS (Property Claim Services) dataset covering losses resulting from catastrophic events in USA that occurred between 1990 and 1999 and (ii) the Danish fire losses dataset, which concerns major fire losses that occurred between 1980 and 1990 and were recorded by Copenhagen Re.