In a data analysis the statistician wants to get a basic understanding
of the stochastic nature of the data. For this purpose he/she
applies several data analytic tools and interprets the results.
A basic problem of a data analysis is over-interpretation of the
results after a battery of statistical methods has been applied.
A similar situation occurs in multiple testing but there exist
approaches to capture the joint stochastics of several test
procedures. The situation becomes more involved in modern graphical
data analysis. The outcomes of a data analytic tool are plots and the
interpretation of the data analysis relies on the interpretation of
these (random) plots. There is no easy way to have an understanding
of the joint distribution of the inspected graphs. The situation is
already complicated if only one graph is checked. Typically it is not
clearly specified for which characteristics the plot is checked. We
will illustrate this by a simple example. We will argue that the
bootstrap and other
resampling methods offer a simple way to get a basic understanding for
the stochastic nature of plots that depend on random data. In the
next section we will discuss how this more intuitive approach can be
translated into the formal setting of mathematical decision
theoretical statistics. Our example is based on the study of
a financial
time series. Figure 2.1 shows the daily values of the German
DAX index from end of 1993 until November 2003. In Fig. 2.2
mean-corrected log returns are shown. Logreturns for a series
are defined as
. Mean-corrected logreturns
are defined as this difference minus its sample average. Under
the Black-Sholes model the logreturns
are i.i.d. It belongs to
folklore in finance that this does not hold. We now illustrate how
this could be seen by application of
resampling methods.
Figure 2.3 shows a plot of the same logreturns as in
Fig. 2.2 but with changed order. The logreturns are plotted
against
a random permutation of the days. The clusters appearing in
Fig. 2.2 dissappear. Figure 2.3 shows that these
clusters could not be explained by stochastic fluctuations. The same
story is told in Fig. 2.4. Here
a bootstrap sample of the logreturns is shown. Logreturns are drawn
with replacement from the set of all logreturns
(i.i.d.
resampling) and they are plotted in the order as they were drawn.
Again the clusters disappear and the same happens for typical
repetitions of random permutation or
bootstrap plots. The clusters in Fig. 2.2 can be interpreted
as volatility clusters. The volatility of a logreturn for a day is
defined as the conditional variance of the logreturn given the
logreturns of all past days. The volatilities of neighbored days are
positively correlated. This results in volatility clusters.
A popular approach to model the clusters are
GARCH (Generalized Autoregressive Conditionally Heteroscedastic)
models. In the
GARCH(1,1) specification one assumes that
where
are i.i.d. errors with mean zero and variance 1
and where
is a random conditional variance process
fulfilling
.
Here
,
and
are unknown parameters. Figure 2.5
shows
a bootstrap realization of a fitted
GARCH(1,1) model. Fitted parameters
,
and
are calculated by a quasi-likelihood method
(i.e. likelihood method for normal
). In the
bootstrap
resampling the errors
are generated by
i.i.d.
resampling from the residuals
where
is the fitted volatility process
. An
alternative
resampling would to generate normal i.i.d. errors in the
resampling. This type of
resampling is also called parametric
bootstrap. At first sight the volatility clusters in the parametric
bootstrap have similar shape as in the plot of the observed
logreturns. Figure 2.6 shows local averages
over
squared logreturns. We have chosen
as
Nadaraya-Watson estimate
. We used a Gaussian
kernel
and bandwidth
days. Figures 2.7-2.9
show the corresponding plots for the three
resampling methods. Again the plots for random permutation
resampling and nonparametric i.i.d.
bootstrap qualitatively differ from the plot for the observed
time series (Figs. 2.7 and 2.8). In Fig. 2.9 the
GARCH
bootstrap shows a qualitatively similar picture as the original
logreturns ruling again not out the
GARCH(1,1) model.
As a last example we consider plots that measure local and global
shape characteristics of the
time series. We consider the number of local maxima of the kernel
smoother
as a function of the bandwidth
. We compare
this function with the number of local maxima for resamples.
Figures 2.10-2.12 show the corresponding plots for the
permutation
resampling, the nonparametric
bootstrap and the
GARCH(1,1)
bootstrap. The plot of the original data set is always compared with
the plots for
resamples. Again i.i.d. structures are not
supported by the
resampling methods.
GARCH(1,1)
bootstrap produces plots that are comparable to the original plot.
![]() |
![]() |
The last approach could be formalized to a test procedure. This could e.g. be done by constructing uniform resampling confidence bands for the expected number of local maxima. We will discuss resampling tests in the next section. For our last example we would like to mention that there seems to be no simple alternative to resampling. An asymptotic theory for the number of maxima that could be used for asymptotic confidence bands is not available (to our knowledge) and it would be rather complicated. Thus, resampling offers an attractive way out. It could be used for a more data analytic implementation as we have used it here. But it could also be used for getting a formal test procedure.
The first two problems, discussed in Figs. 2.1-2.9, are too complex to be formalized as a testing approach. It is impossible to describe for what differences the human eye is looking in the plots and to summarize the differences in one simple quantity that can be used as a test statistic. The eye is using a battery of ''tests'' and it is applying the same or similar checks for the resamples. Thus, resampling is a good way to judge statistical findings based on the original plots.
![]() |