2.1 Motivation and Derivation

Let $ X$ be a continuous random variable and $ f$ its probability density function (pdf). The pdf tells you ``how $ X$ is distributed". From the pdf you can calculate the mean and variance of $ X$ (if they exist) and the probability that $ X$ will take on values in a certain interval. The pdf is, thus, very useful to characterize the distribution of the random variable $ X$.

In practice, the pdf of some observable random variable $ X$ is in general unknown. All you have are $ n$ observations $ X_1, \ldots,X_n$ of $ X$ and your task is to use these $ n$ values to estimate $ f(x).$ We shall assume that the $ n$ observations are independent and that they all indeed come from the same distribution, namely $ f(x)$. That is, in this and the next chapters we will be concerned with estimating $ f(x)$ at a certain value $ x$ from i.i.d. data (independent and identically distributed).

We will approach this estimation problem without assuming that $ f(x)$ has some known functional form except for some unknown parameter(s) that need to be estimated. For instance, we do not assume that $ f(x)$ has the well-known form of the normal distribution with unknown parameters $ \mu$ and $ \sigma^2.$ We will focus on nonparametric ways of estimating $ f(x)$ instead. The most commonly used nonparametric density estimator is the histogram.

2.1.1 Construction

Figure: Histogram for stock returns data Pagan & Schwert (1990) with binwidth $ h=0.02$ and origin $ x_{0}=0$
\includegraphics[width=0.03\defepswidth]{quantlet.ps}SPMhistogram
\includegraphics[width=1.2\defpicwidth]{SPMhistogram.ps}

The construction of a histogram is fairly simple. Suppose you have a random sample $ X_{1},X_{2},\ldots,X_{n}$ from some unknown continuous distribution.

More formally, the histogram is given by

$\displaystyle \widehat f_{h}(x)=\frac{1}{nh} \sum_{i=1}^n \sum_{j} \Ind(X_{i}\in B_{j}) \Ind(x\in B_{j}),$ (2.1)

where

$\displaystyle \Ind(X_{i}\in B_{j})=\left\{
\begin{array}{ll}1&\textrm{if }X_{i}\in B_{j},
\\ 0&\textrm{otherwise}.\end{array} \right.$

Note that formula (2.1) (as well as its corresponding graph, the histogram) gives an estimate of $ f$ for all $ x$. Denote by $ m_j$ the center of the bin $ B_{j}$. It is easy to see from formula (2.1) that the histogram assigns each $ x$ in $ B_{j}=[m_j-\frac{h}{2},m_j+\frac{h}{2})$ the same estimate for $ f$, namely $ \widehat f_{h}(m_j)$. This seems to be rather restrictive and inflexible and later on we will see that there is a better alternative.

It can be easily verified that the area of a histogram is indeed equal to one, a property that we certainly require from any reasonable estimator of a pdf. But we can give further motivation for viewing the histogram as an estimator of the pdf of a continuous distribution.

2.1.2 Derivation

Consider Figure 2.2 where the pdf of a random variable $ X$ is graphed. The probability that an observation of $ X$ will fall into the bin $ [m_j-\frac{h}{2},m_j+\frac{h}{2})$ is given by

$\displaystyle P\left( X\in \left[m_j-\frac{h}{2},m_j+\frac{h}{2}\right) \right) =\int^{m_j+\frac{h}{2}}_{m_j-\frac{h}{2}} f(u)\,du$ (2.2)

which is just the shaded area under the density between $ m_j-\frac{h}{2}$ and $ m_j+\frac{h}{2}$. This area can be approximated by a bar with height $ f(m_j)$ and width $ h$ (see Figure 2.2). Thus we can write

$\displaystyle P\left( X\in \left[m_j-\frac{h}{2},m_j+\frac{h}{2}\right) \right) =\int^{m_j+\frac{h}{2}}_{m_j-\frac{h}{2}} f(u)\,du \approx f(m_j)\cdot h.$ (2.3)

Figure 2.2: Approximation of the area under the pdf over an interval by erecting a rectangle over the interval
\includegraphics[width=1.2\defpicwidth]{SPMhisconstruct.ps}

A natural estimate of this probability is the relative frequency of observations in this interval

$\displaystyle P\left( X\in \left[m_j-\frac{h}{2},m_j+\frac{h}{2}\right) \right)...
...char93  \left\{X_{i}\in \left[m_j-\frac{h}{2}, m_j+\frac{h}{2}\right) \right\}.$ (2.4)

Combining (2.3) and (2.4) we get

$\displaystyle \widehat f_{h}(m_j) = \frac{1}{nh} \char93  \left\{X_{i}\in \left[m_j-\frac{h}{2},m_j+\frac{h}{2}\right) \right\}.$ (2.5)

(Here and in the following we use $ \char93 $ to denote the cardinality, i.e. the number of elements in a set.)

2.1.3 Varying the Binwidth

The subscript $ h$ of $ \widehat{f}_{h}(m_j)$ indicates that the estimate given by the histogram depends on the choice of the binwidth $ h$. Note that it also depends on the choice of the origin even though this dependence is not reflected in the notation. The dependency of the histogram on the origin will be discussed later in Subsection 2.3. To illustrate the effect of the choice of the binwidth on the shape of the histogram consider Figure 2.3 where we have computed and displayed histograms for the stock returns data, corresponding to different binwidths.

Figure: Four histograms for the stock returns data with binwidths $ h=0.007$, $ h=0.02$, $ h=0.05$, and $ h=0.1$; origin $ x_{0}=0$
\includegraphics[width=0.03\defepswidth]{quantlet.ps}SPMhisdiffbin
\includegraphics[width=1.4\defpicwidth]{SPMhisdiffbin.ps}

Clearly, if we increase $ h$ the histogram appears to be smoother but without some reasonable criterion on hand it remains very difficult to say which binwidth provides the ``optimal'' degree of smoothness.