Let be a continuous random variable and its probability density function (pdf). The pdf tells you ``how is distributed". From the pdf you can calculate the mean and variance of (if they exist) and the probability that will take on values in a certain interval. The pdf is, thus, very useful to characterize the distribution of the random variable .
In practice, the pdf of some observable random variable is in general unknown. All you have are observations of and your task is to use these values to estimate We shall assume that the observations are independent and that they all indeed come from the same distribution, namely . That is, in this and the next chapters we will be concerned with estimating at a certain value from i.i.d. data (independent and identically distributed).
We will approach this estimation problem without assuming that has some known functional form except for some unknown parameter(s) that need to be estimated. For instance, we do not assume that has the well-known form of the normal distribution with unknown parameters and We will focus on nonparametric ways of estimating instead. The most commonly used nonparametric density estimator is the histogram.
The construction of a histogram is fairly simple. Suppose you have a random sample from some unknown continuous distribution.
Note that formula (2.1) (as well as its corresponding graph, the histogram) gives an estimate of for all . Denote by the center of the bin . It is easy to see from formula (2.1) that the histogram assigns each in the same estimate for , namely . This seems to be rather restrictive and inflexible and later on we will see that there is a better alternative.
It can be easily verified that the area of a histogram is indeed equal to one, a property that we certainly require from any reasonable estimator of a pdf. But we can give further motivation for viewing the histogram as an estimator of the pdf of a continuous distribution.
Consider Figure 2.2 where the pdf of a random variable is graphed. The probability that an observation of will fall into the bin is given by
A natural estimate of this probability is the relative frequency of observations in this interval
The subscript of indicates that the estimate given by the histogram depends on the choice of the binwidth . Note that it also depends on the choice of the origin even though this dependence is not reflected in the notation. The dependency of the histogram on the origin will be discussed later in Subsection 2.3. To illustrate the effect of the choice of the binwidth on the shape of the histogram consider Figure 2.3 where we have computed and displayed histograms for the stock returns data, corresponding to different binwidths.
Clearly, if we increase the histogram appears to be smoother but without some reasonable criterion on hand it remains very difficult to say which binwidth provides the ``optimal'' degree of smoothness.