Next: 2.2 Bootstrap as a Up: 2. Bootstrap and Resampling Previous: 2. Bootstrap and Resampling

2.1 Introduction

The bootstrap is by now a standard method in modern statistics. Its roots go back to a lot of resampling ideas that were around in the seventies. The seminal work of [19] synthesized some of the earlier resampling ideas and established a new framework for simulation based statistical analysis. The idea of the bootstrap is to develop a setup to generate more (pseudo) data using the information of the original data. True underlying sample properties are reproduced as closely as possible and unknown model characteristics are replaced by sample estimates.

In its basic nature the bootstrap is a data analytic tool. It allows to study the performance of statistical methods by applying them repeatedly to bootstrap pseudo data (''resamples''). The inspection of the outcomes for the different bootstrap resamples allows the statistician to get a good feeling on the performance of the statistical procedure. In particular, this concerns graphical methods. The random nature of a statistical plot is very difficult to be summarized by quantitative approaches. In this respect data analytic methods differ from classical estimation and testing problems. We will illustrate data analytic uses of the bootstrap in the next section.

Most of the bootstrap literature is concerned with bootstrap implementations of tests and confidence intervals and bootstrap applications for estimation problems. It has been argued that for these problems bootstrap can be better understood if it is described as a plug-in method. Plug-in method is an approach used for the estimation of functionals that depend on unknown finite or infinite dimensional model parameters of the observed data set. The basic idea of plug-in estimates is to estimate these unknown parameters and to plug them into the functional. A wide known example is the plug-in bandwidth selector for kernel estimates. Asymptotical optimal bandwidths typically depend e.g. on averages of derivatives of unknown curves (e.g. densities, regression functions), residual variances, etc. Plug-in bandwidth selectors are constructed by replacing these unknown quantities by finite sample estimates. We now illustrate why the bootstrap can be understood as a plug-in approach. We will do this for i.i.d. resampling. This is perhaps the most simple version of the bootstrap. It is applied to an i.i.d. sample $X_1, \ldots, X_n$ with underlying distribution . I.i.d. resamples are generated by drawing times with replacement from the original sample $X_1, \ldots, X_n$ . This gives a resample $X_1^{\ast},\ldots,X_n^{\ast}$ . More formally, the resample is constructed by generating $X_1^{\ast},\ldots,X_n^{\ast}$ that are conditionally independent (given the original data set) and have conditional distribution $\widehat{P}_n$ . Here $\widehat{P}_n$ denotes the empirical distribution. This is the distribution that puts mass on each value of $X_1, \ldots, X_n$ in case that all observations have different values (or more generally, mass on points that appear times in the sample), i.e. for a set we have $\widehat{P}_n(A) = n^{-1} \sum_{i=1}^n$ I $(X_i \in A)$ where I denotes the indicator function. The bootstrap estimate of a functional is defined as the plug-in estimate $T(\widehat{P}_n)$ . Let us consider the mean $\mu(P) = \int x P(\d x)$ as a simple example. The bootstrap estimate of $\mu(P)$ is given by $\mu(\widehat{P}_n)$ . Clearly, the bootstrap estimate is equal to the sample mean $\overline{X}_n= n^{-1} \sum_{i=1}^n X_i$ . In this simple case, simulations are not needed to calculate the bootstrap estimate. Also in more complicated cases it is very helpful to distinguish between the statistical performance and the algorithmic calculation of the bootstrap estimate. In some cases it may be more appropriate to calculate the bootstrap estimate by Monte-Carlo simulations, in other cases powerful analytic approaches may be available. The discussion which algorithmic approach is preferable should not be mixed up with the discussion of the statistical properties of the bootstrap estimate. Perhaps, clarification of this point is one of the major advantages of viewing the bootstrap as a plug-in method. Let us consider now a slightly more complicated example. Suppose that the distribution of $\sqrt {n} [\overline{X}_n - \mu(P)]$ is our functional that we want to estimate. The functional now depends on the sample size . The factor $\sqrt{n}$ has been introduced to simplify asymptotic considerations following below. The bootstrap estimate of is equal to $T_n(\widehat{P}_n)$ . This is the conditional distribution of $\sqrt{n}[\overline{X}_n^{\ast} - \mu(\widehat{P}_n)]= \sqrt{n}(\overline{X}_n^{\ast} - \overline{X}_n)$ , given the original sample $X_1, \ldots, X_n$ . In this case the bootstrap estimate could be calculated by Monte-Carlo simulations. Resamples are generated repeatedly, say -times, and for the -th resample the bootstrap statistic $\Delta_j = \sqrt{n}(\overline{X}_n^{\ast} - \overline{X}_n)$ is calculated. This gives values $\Delta_1,\ldots,\Delta_m$ . Now the bootstrap estimate $T_n(\widehat{P}_n)$ is approximated by the empirical distribution of these values. E.g. the quantiles of the distribution of $\sqrt {n} [\overline{X}_n - \mu(P)]$ are estimated by the sample quantiles of $\Delta_1,\ldots,\Delta_m$ . The bootstrap quantiles can be used to construct ''bootstrap confidence intervals'' for $\mu(P)$ . We will come back to bootstrap confidence intervals in Sect. 2.3.

There are two other advantages of the plug-in view of the bootstrap. First, the estimate of that is plugged into the functional could be replaced by other estimates. For example if one is willing to assume that the observations have a symmetric distribution around their mean one could replace $\widehat{P}_n$ by a symmetrized version. Or if one is using a parametric model $\{P_\theta: \theta \in \Theta\}$ for the observations one could use $P_{\hat \theta}$ where $\hat \theta$ is an estimate of the parameter $\theta$ . In the latter case one also calls the procedure parametric bootstrap. In case that the parametric model holds one may expect a better accuracy of the parametric bootstrap whereas, naturally, the ''nonparametric'' bootstrap is more robust against deviations from the model. We now come to another advantage of the plug-in view. It gives a good intuitive explanation when the ''bootstrap works''. One says that the bootstrap works or bootstrap is consistent if the difference between $T_n(\widetilde{P}_n)$ and , measured by some distance, converges to zero. Here $\widetilde{P}_n$ is some estimate of . The Bootstrap will work when two conditions hold:

(1): The estimate $\widetilde{P}_n$ is a consistent estimate of , i.e. $\widetilde{P}_n$ converges to , in some appropriate sense.
(2): The functionals are continuous, uniformly in .

Consistency of the bootstrap has been proved for a broad variety of models and for a large class of different bootstrap resampling schemes. Typically for the proofs another approach has been used than (1) and (2). Using asymptotic theory often one can verify that $T_n(\widetilde{P}_n)$ and have the same limiting distribution, see [6] for one of the first consistency proofs for the bootstrap. In our example if the observations have a finite variance $\sigma^2(P)$ then both $T_n(\widetilde{P}_n)$ and have a limiting normal limit $N(0,\sigma^2(P))$ . For a more general discussion of the approach based on (1) and (2), see also Ducharme (1991) [4]. The importance of (1) and (2) also lies in the fact that it gives an intuitive reasoning when the bootstrap works. For a recent discussion if assumption (2) is necessary see also [40].

There exist bootstrap methods that cannot be written or interpreted as plug-in estimates. This concerns different bootstrap methods where random weights are generated instead of random (pseudo) observations (see [9]). Or this may happen in many applications where the data model is not fully specified. Important examples are models for dependent data. Whereas classical parametric time series models specify the full dimensional distribution of the complete data vector, some non- and semi-parametric models only describe the distribution of neighbored observations. Then the full data generating process is not specified and a basic problem arises how bootstrap resamples should be generated. There are some interesting proposals around and the research on bootstrap for dependent data is still going on. We give a short introduction to this topic in Sect. 2.4. It is a nice example of an active research field on the bootstrap.

Several reasons have been given why the bootstrap should be applied. The Bootstrap can be compared with other approaches. In our example the classical approach would be to use the normal approximation $N(0,\sigma^2(\widehat{P}_n))$ . It has been shown that the bootstrap works if and only if the normal approximation works, see Mammen (1992a)[56]. This even holds if the observations are not identically distributed. Furthermore, one can show that the rate of convergence of both the bootstrap and the normal approximation is $n^{-1/2}$ . This result can be shown by using Edgeworth expansions. We will give a short outline of the argument. The distribution function $F(x) = P(\sqrt{n} [\overline{X}_n - \mu(P)] \leq x)$ can be approximated by

$\displaystyle \boldsymbol{\Phi}\left(\frac{x}{\sigma(P)}\right) - \frac{1}{6\sq... ...{x}{\sigma(P)}\right )^2 - 1\right ]\phi \left ( \frac{x}{\sigma(P)}\right ){}.$

Here, $\boldsymbol{\Phi}$ is the distribution function of a standard normal distribution and $\phi$ is its density. $\mu_3(P) = E [X_i - \mu(P)]^3$ is the third centered moment of the observations

. Under regularity conditions this approximation holds with errors of order $O(n^{-1})$ . For the bootstrap estimate of

a similar expansion can be shown where $\sigma(P)$ and $\mu_3(P)$ are replaced by their sample versions $\sigma(\widehat{P}_n)$ and $\mu_{3}(\widehat{P}_n)= n^{-1} \sum_{i=1}^n (X_i- \overline{X}_n)^3$

$\displaystyle \boldsymbol{\Phi}\left ( \frac{x}{\sigma(\widehat{P}_n)}\right ) ... ...)}\right )^2 - 1\right ]\phi \left ( \frac{x}{\sigma(\widehat{P}_n)}\right ){}.$

The difference between the bootstrap estimate and

is of order $n^{-1/2}$ because the first order terms $\boldsymbol{\Phi}( {x / \sigma(P)})$ and $\boldsymbol{\Phi}( {x / \sigma(\widehat{P}_n)} )$ differ by a term of order $O_P(n^{-1/2})$ as the same holds for $\sigma(P)$ and $\sigma(\widehat{P}_n)$ . Thus there seems to be no asymptotic advantage in using the bootstrap compared to the classical normal approximation although the skewness of the distribution is accurately mimicked by the bootstrap. However, if the functional

is replaced by the distribution of the studentized statistic $\sqrt{n} \sigma(\widehat{P}_n)^{-1} (\overline{X}_n - \mu(P))$ then the bootstrap achieves a rate of convergence of order $n^{-1}$ whereas the normal approximation

still only has a rate of accuracy of order $n^{-1/2}$ . Again, this can be easily seen by Edgeworth expansions. For the distribution function of the studentized statistic the following expansion holds with accuracy

$\displaystyle \boldsymbol{\Phi}(x) + \frac{1}{6 \sqrt n} \frac{\mu_3 (P)}{\sigma(P)^{3}} \left [2 x^2 + 1\right ]\phi(x){}.$

The normal approximation $\boldsymbol{\Phi}(x)$ differs from this expansion by terms of order $O(n^{-1/2})$ . For the bootstrap estimate one gets the following expansion with error terms of order

$\displaystyle \boldsymbol{\Phi}(x) + \frac{1}{6 \sqrt n} \frac{\mu_3 (\widehat{P}_n)}{\sigma(\widehat{P}_n)^{3}} \left [2 x^2 + 1\right ]\phi(x){}.$

This approximates the distribution function of the studentized statistic with accuracy $O_P(n^{-1})$ because $\mu_3 (\widehat{P}_n)-\mu_3 (P)= O_P(n^{-1/2})$ and $\sigma(\widehat{P}_n)- \sigma(P) = O_P(n^{-1/2})$ . That means in this case the classical normal approximation is outperformed by the bootstrap. This result for studentized statistics has been used as the main asymptotic argument for the bootstrap. It has been verified for a large class of models and resampling methods. For a rigorous and detailed discussion see [32].

There also exist some other arguments in favor of the bootstrap. For linear models with increasing dimension it has been shown in [7] and [55,57,58] that the bootstrap works under weaker conditions than the normal approximation. These results have been extended to more general sequences of models and resampling schemes, see [9] and references cited therein. These results indicate that the bootstrap still may give reasonable results even when the normal approximation does not work. For many applications this type of result may be more important than a comparison of higher order performances. Higher order Edgeworth expansions only work if the simple normal approximation is quite reasonable. But then the normal approximation is already sufficient for most statistical applications because typically not very accurate approximations are required. For example an actual level instead of an assumed level may not lead to a misleading statistical analysis. Thus one may argue that higher order Edgeworth expansions can only be applied when they are not really needed and for these reasons they are not the appropriate methods for judging the performance of the bootstrap. On the other hand no other mathematical technique is available that works for such a large class of problems as the Edgeworth expansions do. Thus there is no general alternative way for checking the accuracy of the bootstrap and for comparing it with normal approximations.

The Bootstrap is a very important tool for statistical models where classical approximations are not available or where they are not given in a simple form. Examples arise e.g. in the construction of tests and confidence bands in nonparametric curve estimation. Here approximations using the central limit theorem lead to distributions of functionals of Gaussian processes. Often these distributions are not explicitly given and must be calculated by simulations of Gaussian processes. We will give an example in the next section (number of modes of a kernel smoother as a function of the bandwidth). Compared with classical asymptotic methods the bootstrap offers approaches for a much broader class of statistical problems.

By now, the bootstrap is a standard method of statistics. It has been discussed in a series of papers, overview articles and books. The books [20,22,17] give a very insightful introduction into possible applications of the bootstrap in different fields of statistics. The books [4,57] contain a more technical treatment of consistency of the bootstrap, see also [27]. Higher order performance of the bootstrap is discussed in the book [32]. The book [76] gives a rather complete overview on the theoretical results on the bootstrap in the mid-nineties. The book [72] gives a complete discussion of the subsampling, a resampling method where the resample size is smaller than the size of the original sample. The book [49] discusses the bootstrap for dependent data. Some overview articles are contained in Statistical Science (2003), Vol. 18, Number 2. Here, [21] gives a short (re)discussion of bootstrap confidence intervals, [18] report on recent developments of the bootstrap, in particular in classification, [33] discusses the roots of the bootstrap, and [8,3] give a short introduction to the bootstrap, and other articles give an overview over recent developments of bootstrap applications in different fields of statistics. Overview articles over special bootstrap applications have been given for sample surveys ([74,75,48]), for econometrics [36,37,38], nonparametric curve estimation ([28,59]), estimating functions ([51]) and time series analysis ([12,30,70]).

Next: 2.2 Bootstrap as a Up: 2. Bootstrap and Resampling Previous: 2. Bootstrap and Resampling