# 2.1 Motivation and Derivation

Let be a continuous random variable and its probability density function (pdf). The pdf tells you how is distributed". From the pdf you can calculate the mean and variance of (if they exist) and the probability that will take on values in a certain interval. The pdf is, thus, very useful to characterize the distribution of the random variable .

In practice, the pdf of some observable random variable is in general unknown. All you have are observations of and your task is to use these values to estimate We shall assume that the observations are independent and that they all indeed come from the same distribution, namely . That is, in this and the next chapters we will be concerned with estimating at a certain value from i.i.d. data (independent and identically distributed).

We will approach this estimation problem without assuming that has some known functional form except for some unknown parameter(s) that need to be estimated. For instance, we do not assume that has the well-known form of the normal distribution with unknown parameters and We will focus on nonparametric ways of estimating instead. The most commonly used nonparametric density estimator is the histogram.

## 2.1.1 Construction

The construction of a histogram is fairly simple. Suppose you have a random sample from some unknown continuous distribution.

• Select an origin and divide the real line into bins of binwidth :

• Count how many observations fall into each bin. Denote the number of observations that fall into bin by .
• For each bin divide the frequency count by the sample size (to convert them into relative frequencies, the sample analog of probabilities), and by the binwidth (to make sure that the area under the histogram is equal to one):

• Plot the histogram by erecting a bar over each bin with height and width .
More formally, the histogram is given by

 (2.1)

where

Note that formula (2.1) (as well as its corresponding graph, the histogram) gives an estimate of for all . Denote by the center of the bin . It is easy to see from formula (2.1) that the histogram assigns each in the same estimate for , namely . This seems to be rather restrictive and inflexible and later on we will see that there is a better alternative.

It can be easily verified that the area of a histogram is indeed equal to one, a property that we certainly require from any reasonable estimator of a pdf. But we can give further motivation for viewing the histogram as an estimator of the pdf of a continuous distribution.

## 2.1.2 Derivation

Consider Figure 2.2 where the pdf of a random variable is graphed. The probability that an observation of will fall into the bin is given by

 (2.2)

which is just the shaded area under the density between and . This area can be approximated by a bar with height and width (see Figure 2.2). Thus we can write

 (2.3)

A natural estimate of this probability is the relative frequency of observations in this interval

 (2.4)

Combining (2.3) and (2.4) we get

 (2.5)

(Here and in the following we use to denote the cardinality, i.e. the number of elements in a set.)

## 2.1.3 Varying the Binwidth

The subscript of indicates that the estimate given by the histogram depends on the choice of the binwidth . Note that it also depends on the choice of the origin even though this dependence is not reflected in the notation. The dependency of the histogram on the origin will be discussed later in Subsection 2.3. To illustrate the effect of the choice of the binwidth on the shape of the histogram consider Figure 2.3 where we have computed and displayed histograms for the stock returns data, corresponding to different binwidths.

Clearly, if we increase the histogram appears to be smoother but without some reasonable criterion on hand it remains very difficult to say which binwidth provides the optimal'' degree of smoothness.