2.3 Resampling Tests and Confidence Intervals

In the last section we have pointed out how resampling can offer additional insights in a data analysis. We now want to discuss applications of bootstrap that are more in the tradition of classical statistics. We will introduce resampling approaches for the construction of confidence intervals and of testing procedures. The majority of the huge amount of the bootstrap literature is devoted to these topics. There exist two basic approaches for the construction of confidence regions:

Approaches based on pivot statistics are classical methods for the construction of confidence sets. In a statistical model $\{P_{\theta}:\theta\in\Theta\}$ a pivot statistic is a random quantity $Q=Q(\theta, X)$ that depends on the unknown parameter $\theta$ and on the observation (vector)

and that has the following property. The distribution of

under $P_{\theta}$ does not depend on $\theta$ . Thus the distribution of

is known and one can calculate quantiles $q_{1,\alpha},q_{2,\alpha}$ such that $P_{\theta} \{ q_{1,\alpha}\leq Q(\theta, X) \leq q_{2,\alpha} \} = 1-\alpha$ . Then $C_\alpha=\{ \theta \in \Theta: q_{1,\alpha}\leq Q(\theta, X) \leq q_{2,\alpha} \}$ is a confidence set of the unknown parameter $\theta$ with coverage probability $P( \theta \in C_\alpha ) = 1- \alpha$ . Classical examples are i.i.d. normal observations

with mean $\mu$ and variance $\sigma ^2$ . Then $Q= (\overline{X} - \mu)/\widehat{\sigma}$ is a pivot statistic. Here $\overline{X}$ is the sample mean and $\widehat{\sigma}^2 = (n-1)^{-1} \sum_{i=1}^n (X_i-\overline{X})^2$ is the sample variance. Then we get, e.g. $C_{\alpha} = [\overline{X} - n^{-1/2} k_{1-\alpha/2} \widehat{\sigma} , \overline{X} + n^{-1/2} k_{1-\alpha/2} \widehat{\sigma}]$ is a confidence interval for $\mu$ with exact coverage probability $1-\alpha$ . Here $k_{1-\alpha/2}$ is the $1-\alpha/2$ quantile of the t-distribution with

degrees of freedom.

Pivot statistics only exist in very rare cases. However for a very rich class of settings one can find statistics $Q=Q(\theta, X)$ that have a limiting distribution $\mathcal{L}(\theta)$ that smoothly depends on $\theta$ . Such statistics are called asymptotic pivot statistics. If now $q_{1,\alpha},q_{2,\alpha}$ are chosen such that under $\mathcal{L}(\widehat{\theta})$ the interval $[q_{1,\alpha}, q_{2,\alpha}]$ has probability $1-\alpha$ then we get that $P( \theta \in C_{\alpha} )$ converges to $1-\alpha$ . Here $\widehat{{\theta}}\phantom{)}$ is a consistent estimate of $\theta$ and the confidence set $C_{\alpha}$ is defined as above. A standard example can be easily given if an estimate $\widehat{\tau}$ of a (one-dimensional, say) parameter $\tau = \tau(\theta)$ is given that is asymptotically normal. Then $\sqrt{n} (\widehat{\tau} - \tau)$ converges in distribution towards a normal limit with mean zero and variance $\sigma^2(\theta)$ depending on the unknown parameter $\theta$ . Here $Q= \sqrt{n} (\widehat{\tau} - \tau)$ or the studentized version $Q= \sqrt{n} (\widehat{\tau} - \tau)/\sigma(\widehat{\theta})$ with a consistent estimate $\widehat{\theta}\phantom{)}$ of $\theta$ could be used as asymptotic pivot. Asymptotic pivot confidence intervals are based on the quantiles of the asymptotic distribution $\mathcal{L}$ of

. The bootstrap idea is to simulate the finite sample distribution $\mathcal{L}_n(\theta)$ of the pivot statistic

instead of using the asymptotic distribution of

. This distribution depends on

and on the unknown parameter $\theta$ . The bootstrap idea is to estimate the unknown parameter and to plug it in. Then bootstrap quantiles for

are defined as the (random) quantiles of $\mathcal{L}_n(\widehat{\theta})$ . For the unstudentized statistic $Q= \sqrt{n} (\widehat{\tau} - \tau)$ we get the bootstrap confidence interval $[\widehat{\tau} - n^{-1/2} \widehat{q}_{2,\alpha}, \widehat{\tau} - n^{-1/2} \widehat{q}_{1,\alpha}]$ where $\widehat{q}_{1,\alpha}$ is the $\alpha/2$ bootstrap quantile and $\widehat{q}_{2,\alpha}$ is the $1-\alpha/2$ bootstrap quantile. This confidence interval has an asymptotic coverage probability equal to $1-\alpha$ . We want to illustrate this approach by the data example of the last section. Suppose we fit a GARCH(1,1) model to the logreturns and we want to have a confidence interval for $\tau = a_1+b_1$ . It is known that a GARCH(1,1) process is covariance stationary if and only if $\vert\tau\vert < 1$ . For values of $\tau$ that approximate

, one gets a very high persistency of shocks on the process. We now construct a bootstrap confidence interval for $\tau$ . We used $Q= \sqrt{n} (\widehat{\tau} - \tau)$ as asymptotic pivot statistic. The results are summarized in Table 2.1.

**Table 2.1:** Estimate of and $90\,{\%}$ bootstrap confidence interval using GARCH(1,1) bootstrap (asymptotic pivot method)
$\widehat{a}_1 + \widehat{b}_1$	Confidence lower bound	Upper bound
0.9919	0.9874	0.9960

We also applied the GARCH(1,1) bootstrap to the first half and to the second half of our data set. The results are summarized in Table 2.2. The value of $\widehat{\tau}$ is quite similar for both halves. The fitted parameter is always contained in the confidence interval based on the other half of the sample. Both confidence intervals have a broad overlap. So there seems no reason to expect different values of $\tau$ for the two halves of the data. The situation becomes a little bit confused if we compare Table 2.2 with Table 2.1. Both fitted values of $\tau$ , the value for the first half and for the second half, are not contained in the confidence interval that is based on the whole sample. This suggests that a GARCH(1,1) model with fixed parameters for the whole sample is not an appropriate model. A model with different values seems to be more realistic. When for the whole time series a GARCH(1,1) model is fitted the change of the parameters in time forces the persistency parameter $\tau$ closer to

and this effect increases for GARCH fits over longer periods. We do not want to discuss this point further here and refer to [61] for more details.

**Table 2.2:** Estimate of and $90\,{\%}$ bootstrap-t confidence interval using GARCH(1,1) bootstrap for the first half and for the second half of the DAX returns (asymptotic pivot method)
	$\widehat{a}_1 + \widehat{b_1}$	Confidence lower bound	Upper bound
Using Part I	0.9814	0.9590	0.9976
Using Part II	0.9842	0.9732	0.9888

In [19] another approach for confidence intervals was suggested. It was supposed to use the bootstrap quantiles of a test statistic directly as bounds of the bootstrap confidence intervals. In our example then the estimate $\widehat{\tau}$ has to be calculated repeatedly for bootstrap resamples and the $5\,{\%}$ and $95\,{\%}$ empirical quantiles are used as lower and upper bound for the bootstrap confidence intervals. It can be easily checked that we then get $[\widehat{\tau} + n^{-1/2} \widehat{q}_{1,\alpha}, \widehat{\tau} + n^{-1/2} \widehat{q}_{2,\alpha}]$ as bootstrap confidence interval where the quantiles $\widehat{q}_{1,\alpha}$ and $\widehat{q}_{2,\alpha}$ are defined as above, see also [22]. Note that the interval is just reflected around $\widehat{\tau}$ . The resulting confidence interval for $\tau$ is shown in Table 2.3. For asymptotic normal test statistics both bootstrap confidence intervals are asymptotically equivalent. Using higher order Edgeworth expansions it was shown that bootstrap pivot intervals achieve a higher order level accuracy. Modifications of percentile intervals have been proposed that achieve level accuracy of the same order, see [22]. For a recent discussion on bootstrap confidence intervals see also [21,18]. In our data example there is only a minor difference between the two intervals, cf. Tables 2.1 and 2.3. This may be caused by the very large sample size.

**Table 2.3:** Estimate of and 90% bootstrap percentile confidence interval using GARCH(1,1) bootstrap
$\widehat{a}_1 + \widehat{b}_1$	Confidence lower bound	Upper bound
0.9919	0.9877	0.9963

The basic idea of bootstrap tests is rather simple. Suppose that for a statistical model $\{P_{\theta}:\theta\in\Theta\}$ a testing hypothesis $\theta \in \Theta_0 \subset \Theta$ and a test statistic

is given. Then bootstrap is used to calculate critical values for

. This can be done by fitting a model on the hypothesis and by generating bootstrap resamples under the fitted hypothesis model. The $1-\alpha$ quantile of the test statistic in the bootstrap samples can be used as critical value. The resulting test is called a bootstrap test. Alternatively, a testing approach can be based on the duality of testing procedures and confidence regions. Each confidence region defines a testing procedure by using the following rule. A hypothesis is rejected if no hypothesis parameter lies in the confidence region. We shortly describe this method for bootstrap confidence intervals based on an asymptotic pivot statistic, say $\sqrt {n} (\widehat{\theta}_n -\theta)$ , and the hypothesis $\Theta_0 = (-\infty,\theta_0] \subset {\mathbb{R}}$ . Bootstrap resamples are generated (in the unrestricted model) and are used for estimating the $1-\alpha$ quantile of $\sqrt {n} (\widehat{\theta}_n -\theta)$ by $\widehat{k}_{1-\alpha}$ , say. The bootstrap test rejects the hypothesis, if $\sqrt {n} (\widehat{\theta}_n -\theta_0)$ is larger than $\widehat{k}_{1-\alpha}$ . Higher order performance of bootstrap tests has been discussed in Hall (1992)[32]. For a discussion of bootstrap tests we also refer to Beran (1988), Beran and Ducharme (1991)[2,4].

We now compare bootstrap testing with a more classical resampling approach for testing (''conditional tests''). There exist some (important) examples where, for all test statistics, resampling can be used to achieve a correct level on the whole hypothesis for finite samples. Such tests are called similar. For some testing problems resampling tests turn out to be the only way to get similar tests. This situation arises when a statistic is available that is sufficient on the hypothesis $\{P_{\theta}:\theta\in\Theta_0\}$ . Then, by definition of sufficiency, the conditional distribution of the data set given this statistic is fixed on the hypothesis and does not depend on the parameter of the underlying distribution as long as the parameter lies on the hypothesis. Furthermore, because this distribution is unique and thus known, resamples can be drawn from this conditional distribution. The resampling test then has correct level on the whole hypothesis. We will now give a more formal description.

A test $\phi(X)$ for a vector

of observations is called similar if $E_{\theta} ~\phi(X) = \alpha$ for all $\theta \in \Theta_0$ , where $\Theta_0$ is the set of parameters on the null hypotheses. We suppose that a statistic

is available that is sufficient on the hypothesis. Let $\mathcal{P}_0 = \{P_{\theta}: \theta \in \Theta_0\}$ be the family of distributions of

on the hypothesis. Then the conditional distribution of

given

does not depend on the underlying parameter $\theta \in \Theta_0$ because

is sufficient. In particular, $E(\phi(X)\vert S=s)$ does not depend on $\theta$ . Then any test satisfying

For a given test statistic

similar tests can be constructed by choosing $k_{\alpha}(S)$ such that

We will consider two examples of conditional tests. The first one are permutation tests. For a sample of observations $X=(X_1,\ldots,X_n)$ the order statistic $S=(X_{(1)},\ldots,X_{(n)})$ containing the ordered sample values $X_{(1)}\leq \ldots\leq X_{(n)}$ is sufficient on the hypothesis of i.i.d. observations. Given

, the conditional distribution of

is a random permutation of $X_1, \ldots, X_n$ . The resampling scheme is very similar to the nonparametric bootstrap. In the resampling,

pseudo observations are drawn from the original data sample. Now this is done without replacement whereas in the bootstrap scheme this is done with replacement. For a comparison of bootstrap and permutation tests see also [41]. Also for the subsampling (i.e. resampling with a resample size that is smaller than the sample size) both schemes (with and without replacement) have been considered. For a detailed discussion of the subsampling without replacement see [72].

The second example is a popular approach in the physical literature on nonlinear time series analysis. For odd sample size

a series $X_1, \ldots, X_n$ can be written as

$\displaystyle X_t= \overline{X} + \sqrt {\frac{2 \pi}{n}} \sum_{j=1} ^{(n-1)/2}... ...t (\frac{2 \pi j}{n}\right )} \cos \left ( \frac{2\pi j}{n} t + \theta_j\right)$

$\displaystyle X_t^{\ast}= \overline{X} + \sqrt {\frac{2 \pi}{n}} \sum_{j=1} ^{(... ...\pi j}{n}\right )} \cos \left ( \frac{2\pi j}{n} t + \theta_j^{\ast}\right ){},$

We would like to highlight a major difference between bootstrap and conditional tests. Bootstrap tests work if they are based on resampling of an asymptotic pivot statistic. Then the bootstrap critical values stabilize asymptotically and converge against the quantile of the limiting distribution of the test statistic. For conditional tests the situation is quite different. They work for all test statistics. However, not for all test statistics it is guaranteed that the critical value $k_{\alpha}(S)$ converges to a deterministic limit. In [60] this is discussed for surrogate data tests. It is shown that also for very large data sets the surrogate data quantile $k_{\alpha}(S)$ may have a variance of the same order as the test statistic

. Thus the randomness of $k_{\alpha}(S)$ may change the nature of a test. This is illustrated by a test statistic for kurtosis of the observations that is transformed to a test for circular stationarity.