Let be a continuous random variable and
its probability density
function (pdf). The pdf tells you ``how
is
distributed". From the pdf you can calculate the mean and variance
of
(if they exist) and the probability that
will take on values
in a certain interval. The pdf is, thus, very useful to characterize
the distribution of the random variable
.
In practice, the pdf of some observable random variable is in
general unknown. All you have are
observations
of
and your task is to use these
values to estimate
We
shall assume that the
observations are independent and that they
all indeed come from the same distribution, namely
. That is,
in this and the next chapters we will be concerned with estimating
at a certain value
from i.i.d. data (independent
and identically distributed).
We will approach this estimation problem without assuming that
has some known functional form except for some unknown parameter(s)
that need to be estimated. For instance, we do not assume that
has the well-known form of the normal distribution with unknown parameters
and
We will focus on nonparametric ways of estimating
instead.
The most commonly used nonparametric density estimator is the histogram.
![]() |
The construction of a histogram is fairly simple. Suppose you
have a random sample
from some unknown
continuous distribution.
Note that formula (2.1) (as well as its corresponding
graph, the histogram) gives an estimate of for all
.
Denote by
the center of the bin
.
It is easy to see from formula (2.1) that the
histogram assigns each
in
the same estimate for
, namely
. This seems
to be rather restrictive and inflexible and later on we will see that
there is a better alternative.
It can be easily verified that the area of a histogram is indeed equal to one, a property that we certainly require from any reasonable estimator of a pdf. But we can give further motivation for viewing the histogram as an estimator of the pdf of a continuous distribution.
Consider Figure 2.2 where the pdf of a random
variable is graphed.
The probability that an observation of
will fall into the
bin
is given by
![]() |
(2.2) |
![]() |
A natural estimate of this probability is the relative frequency of observations in this interval
The subscript of
indicates that the
estimate given by the histogram depends on the choice of the
binwidth
. Note that it also depends on the choice of the origin
even though this dependence is not reflected in the notation.
The dependency of
the histogram on the origin will be discussed later in Subsection
2.3.
To illustrate the effect of the choice of the binwidth on the shape
of the histogram consider Figure 2.3 where we have
computed and displayed histograms for the stock returns data,
corresponding to different binwidths.
Clearly, if we increase the histogram
appears to be smoother but without some reasonable criterion on
hand it remains very difficult to say which binwidth provides the
``optimal'' degree of smoothness.