4.6 Bootstrap

Recall that we need a large sample sizes in order to sufficiently approximate the critical values computable by the CLT. Here large means = 50 for one-dimensional data. How can we construct confidence intervals in the case of smaller sample sizes? One way is to use a method called the Bootstrap. The Bootstrap algorithm uses the data twice:

estimate the parameter of interest,
simulate from an estimated distribution to approximate the asymptotic distribution of the statistics of interest.

In detail, bootstrap works as follows. Consider the observations $x_{1},\ldots ,x_{n}$ of the sample $X_{1},\ldots,X_{n}$ and estimate the empirical distribution function (edf) $F_{n}$ . In the case of one-dimensional data

$\begin{displaymath} F_{n}(x) = \frac{1}{n} \sum_{i=1}^n {\boldsymbol{I}}(X_{i}\le x). \end{displaymath}$

(4.57)

This is a step function which is constant between neighboring data points.

EXAMPLE 4.23 Suppose that we have

standard normal

data points

, $i=1,\dots,n$ . The cdf of

is $\Phi(x) = \int^{x}_{-\infty} \varphi(u) du$ and is shown in Figure 4.6 as the thin, solid line. The empirical distribution function (edf) is displayed as a thick step function line. Figure 4.7 shows the same setup for

observations.

**Figure 4.6:** The standard normal cdf (thin line) and the empirical distribution function (thick line) for . `MVAedfnormal.xpl`
$\includegraphics[width=1\defpicwidth]{edfnorm.ps}$

**Figure 4.7:** The standard normal cdf (thin line) and the empirical distribution function (thick line) for . `MVAedfnormal.xpl`
$\includegraphics[width=1\defpicwidth]{edfnorm2.ps}$

Now draw with replacement a new sample from this empirical distribution. That is we sample with replacement $n^\ast$ observations $X_{1}^\ast, \ldots, X_{n^\ast}^\ast$ from the original sample. This is called a Bootstrap sample. Usually one takes $n^\ast =n$ .

Since we sample with replacement, a single observation from the original sample may appear several times in the Bootstrap sample. For instance, if the original sample consists of the three observations $x_{1}, x_{2}, x_{3}$ , then a Bootstrap sample might look like $X_1^*=x_{3}, X_2^*=x_{2}, X_3^*=x_{3}.$ Computationally, we find the Bootstrap sample by using a uniform random number generator to draw from the indices $1, 2, \ldots, n$ of the original samples.

The Bootstrap observations are drawn randomly from the empirical distribution, i.e., the probability for each original observation to be selected into the Bootstrap sample is for each draw. It is easy to compute that

$\begin{displaymath}E_{F_{n}} (X_{i}^\ast) = \frac{1}{n}\sum_{i=1}^{n}x_{i}=\,\bar{x}. \end{displaymath}$

This is the expected value given that the cdf is the original mean of the sample $x_{1}. \ldots, x_{n}$ . The same holds for the variance, i.e.,

$\begin{displaymath}\mathop{\mathit{Var}}_{F_{n}} (X_{i}^\ast) = \widehat{\sigma}^2, \end{displaymath}$

where $\widehat{\sigma}^2 = \frac{1}{n} \sum (x_{i} - \bar{x})^2$ . The cdf of the bootstrap observations is defined as in (4.57).

Figure 4.8 shows the cdf of the original observations as a solid line and two bootstrap cdf's as thin lines.

**Figure 4.8:** The cdf (thick line) and two bootstrap cdf`s (thin lines). `MVAedfbootstrap.xpl`
$\includegraphics[width=1\defpicwidth]{edfboot.ps}$

The CLT holds for the bootstrap sample. Analogously to Corollary 4.1 we have the following corollary.

COROLLARY 4.2 If $X_{1}^\ast, \ldots, X_{n}^\ast$ is a bootstrap sample from $X_{1},\ldots,X_{n}$ , then the distribution of

$\begin{displaymath}\sqrt{n} \left( \frac{\bar{x}^\ast-\bar{x}}{\widehat{\sigma}^\ast} \right) \end{displaymath}$

also becomes

asymptotically, where $\overline x^\ast = \frac{1}{n} \sum_{i=1}^{n} X_{i}^\ast$ and $(\widehat{\sigma}^\ast)^2 = \frac{1}{n} \sum_{i=1}^{n} (X_{i}^\ast - \bar{x}^\ast)^2$ .

How do we find a confidence interval for $\mu$ using the Bootstrap method? Recall that the quantile $u_{1-\alpha/2}$ might be bad for small sample sizes because the true distribution of $\sqrt{n}\left( \frac{\bar{x}-\mu}{\widehat{\sigma}}\right)$ might be far away from the limit distribution . The Bootstrap idea enables us to ``simulate'' this distribution by computing $\sqrt{n} \left( \frac{\bar{x}^\ast - \bar{x}}{\widehat{\sigma}^\ast} \right)$ for many Bootstrap samples. In this way we can estimate an empirical ( $1-\alpha/2$ )-quantile $u_{1-\alpha/2}^\ast$ . The bootstrap improved confidence interval is then

$\begin{displaymath}C_{1-\alpha}^\ast = \left[\bar{x}-\frac{\widehat{\sigma}}{\sq... ...ac{\widehat{\sigma}}{\sqrt{n}}\, u_{1-\alpha/2}^\ast \right]. \end{displaymath}$

By Corollary 4.2 we have

$\begin{displaymath}P(\mu \in C_{1-\alpha}^\ast) \longrightarrow 1 - \alpha \quad \textrm{as} \; n \rightarrow \infty, \end{displaymath}$

but with an improved speed of convergence, see Hall (1992).

Summary

$\ast$: For small sample sizes the bootstrap improves the precision of the confidence interval.
$\ast$: The bootstrap distribution ${\mathcal{L}}(\sqrt{n}(\overline x^*-\overline x)/\hat\sigma^*)$ converges to the same asymptotic limit as the distribution ${\mathcal{L}}(\sqrt{n}(\overline x-\mu)/\hat\sigma)$ .