4.6 Bootstrap

Recall that we need a large sample sizes in order to sufficiently approximate the critical values computable by the CLT. Here large means $n$= 50 for one-dimensional data. How can we construct confidence intervals in the case of smaller sample sizes? One way is to use a method called the Bootstrap. The Bootstrap algorithm uses the data twice:

  1. estimate the parameter of interest,
  2. simulate from an estimated distribution to approximate the asymptotic distribution of the statistics of interest.
In detail, bootstrap works as follows. Consider the observations $x_{1},\ldots ,x_{n}$ of the sample $X_{1},\ldots,X_{n}$ and estimate the empirical distribution function (edf) $F_{n}$. In the case of one-dimensional data
\begin{displaymath}
F_{n}(x) = \frac{1}{n} \sum_{i=1}^n {\boldsymbol{I}}(X_{i}\le x).
\end{displaymath} (4.57)

This is a step function which is constant between neighboring data points.

EXAMPLE 4.23   Suppose that we have $n=100$ standard normal $N(0,1)$ data points $X_i$, $i=1,\dots,n$. The cdf of $X$ is $ \Phi(x) = \int^{x}_{-\infty} \varphi(u) du$ and is shown in Figure 4.6 as the thin, solid line. The empirical distribution function (edf) is displayed as a thick step function line. Figure 4.7 shows the same setup for $n=1000$ observations.

Figure 4.6: The standard normal cdf (thin line) and the empirical distribution function (thick line) for $n=100$. 17475 MVAedfnormal.xpl
\includegraphics[width=1\defpicwidth]{edfnorm.ps}

Figure 4.7: The standard normal cdf (thin line) and the empirical distribution function (thick line) for $n=1000$. 17479 MVAedfnormal.xpl
\includegraphics[width=1\defpicwidth]{edfnorm2.ps}

Now draw with replacement a new sample from this empirical distribution. That is we sample with replacement $n^\ast$ observations $X_{1}^\ast, \ldots,
X_{n^\ast}^\ast$ from the original sample. This is called a Bootstrap sample. Usually one takes $n^\ast =n$.

Since we sample with replacement, a single observation from the original sample may appear several times in the Bootstrap sample. For instance, if the original sample consists of the three observations $x_{1}, x_{2}, x_{3}$, then a Bootstrap sample might look like $X_1^*=x_{3}, X_2^*=x_{2}, X_3^*=x_{3}.$ Computationally, we find the Bootstrap sample by using a uniform random number generator to draw from the indices $1, 2, \ldots, n$ of the original samples.

The Bootstrap observations are drawn randomly from the empirical distribution, i.e., the probability for each original observation to be selected into the Bootstrap sample is $1/n$ for each draw. It is easy to compute that

\begin{displaymath}E_{F_{n}} (X_{i}^\ast) = \frac{1}{n}\sum_{i=1}^{n}x_{i}=\,\bar{x}. \end{displaymath}

This is the expected value given that the cdf is the original mean of the sample $x_{1}. \ldots, x_{n}$. The same holds for the variance, i.e.,

\begin{displaymath}\mathop{\mathit{Var}}_{F_{n}} (X_{i}^\ast) = \widehat{\sigma}^2, \end{displaymath}

where $\widehat{\sigma}^2 = \frac{1}{n} \sum (x_{i} - \bar{x})^2$. The cdf of the bootstrap observations is defined as in (4.57).

Figure 4.8 shows the cdf of the $n=100$ original observations as a solid line and two bootstrap cdf's as thin lines.

Figure 4.8: The cdf $F_n$ (thick line) and two bootstrap cdf`s $F_n^*$ (thin lines). 17485 MVAedfbootstrap.xpl
\includegraphics[width=1\defpicwidth]{edfboot.ps}

The CLT holds for the bootstrap sample. Analogously to Corollary 4.1 we have the following corollary.

COROLLARY 4.2   If $X_{1}^\ast, \ldots, X_{n}^\ast$ is a bootstrap sample from $X_{1},\ldots,X_{n}$, then the distribution of

\begin{displaymath}\sqrt{n} \left( \frac{\bar{x}^\ast-\bar{x}}{\widehat{\sigma}^\ast}
\right) \end{displaymath}

also becomes $N(0,1)$ asymptotically, where $\overline x^\ast = \frac{1}{n} \sum_{i=1}^{n} X_{i}^\ast$ and $(\widehat{\sigma}^\ast)^2 = \frac{1}{n} \sum_{i=1}^{n} (X_{i}^\ast -
\bar{x}^\ast)^2$.

How do we find a confidence interval for $\mu$ using the Bootstrap method? Recall that the quantile $u_{1-\alpha/2}$ might be bad for small sample sizes because the true distribution of $\sqrt{n}\left( \frac{\bar{x}-\mu}{\widehat{\sigma}}\right)$ might be far away from the limit distribution $N(0,1)$. The Bootstrap idea enables us to ``simulate'' this distribution by computing $\sqrt{n} \left( \frac{\bar{x}^\ast -
\bar{x}}{\widehat{\sigma}^\ast} \right)$ for many Bootstrap samples. In this way we can estimate an empirical ($1-\alpha/2$)-quantile $u_{1-\alpha/2}^\ast$. The bootstrap improved confidence interval is then

\begin{displaymath}C_{1-\alpha}^\ast = \left[\bar{x}-\frac{\widehat{\sigma}}{\sq...
...ac{\widehat{\sigma}}{\sqrt{n}}\,
u_{1-\alpha/2}^\ast \right]. \end{displaymath}

By Corollary 4.2 we have

\begin{displaymath}P(\mu \in C_{1-\alpha}^\ast) \longrightarrow 1 - \alpha \quad
\textrm{as} \; n \rightarrow \infty, \end{displaymath}

but with an improved speed of convergence, see Hall (1992).

Summary
$\ast$
For small sample sizes the bootstrap improves the precision of the confidence interval.
$\ast$
The bootstrap distribution ${\mathcal{L}}(\sqrt{n}(\overline x^*-\overline x)/\hat\sigma^*)$ converges to the same asymptotic limit as the distribution ${\mathcal{L}}(\sqrt{n}(\overline x-\mu)/\hat\sigma)$.