4.4 Confidence Regions and Tests

As in the case of density estimation, confidence intervals and bands can be based on the asymptotic normal distribution of the regression estimator. We will restrict ourselves to the Nadaraya-Watson case in order to show the essential concepts. In the latter part of this section we address the related topic of specification tests, which test the hypothesis of a parametric against the alternative nonparametric regression function.

4.4.1 Pointwise Confidence Intervals

Now that you have become familiar with nonparametric regression, you may want to know: How close is the smoothed curve to the true curve? Recall that we asked the same question when we introduced the method of kernel density estimation. There, we made use of (pointwise) confidence intervals and (global) confidence bands. But to construct this measure, we first had to derive the (asymptotic) sampling distribution.

The following theorem establishes the asymptotic distribution of the Nadaraya-Watson kernel estimator for one-dimensional predictor variables.

THEOREM 4.5  
Suppose that $ m$ and $ f_X$ are twice differentiable, $ \int \vert K(u) \vert^{2+\kappa}\, du<\infty$ for some $ \kappa>0$, $ x$ is a continuity point of $ \sigma^2(x)$ and $ E( \vert Y\vert^{2+\kappa} \vert X=x )$ and $ f_X(x)>0$. Take $ h=cn^{-1/5}$. Then

$\displaystyle n^{2/5} \left\{\widehat{m}_{h}(x)-m(x)\right\}
\mathrel{\mathop{\longrightarrow}\limits_{}^{L}}
N\left( b_x,v_x^2\right)$

with

$\displaystyle b_x= c^{2} \mu_2(K) \left\{ \frac{m''(x)}{2}
+\frac{m'(x) f_X'(x)}{f_X(x)} \right\},\
v^2_x=\frac{\sigma^2(x)\Vert K \Vert^{2}_{2}}{c f_X(x)}\,.$

The asymptotic bias $ b_x$ is proportional to the second moment of the kernel and a measure of local curvature of $ m$. This measure of local curvature is not a function of $ m$ alone but also of the marginal density. At maxima or minima, the bias is a multiple of $ m''(x)$ alone; at inflection points it is just a multiple of $ \{m'(x)f_X'(x)\}/f_X(x) $ only.

Figure: Nadaraya-Watson kernel regression and 95% confidence intervals, $ h=0.2$, U.K. Family Expenditure Survey 1973
\includegraphics[width=0.03\defepswidth]{quantlet.ps}SPMengelconf
\includegraphics[width=1.2\defpicwidth]{SPMengelconf.ps}

We now use this result to define confidence intervals. Suppose that the bias is of negligible magnitude compared to the variance, e.g. if the bandwidth $ h$ is sufficiently small. Then we can compute approximate confidence intervals with the following formula:

$\displaystyle \left[\widehat{m}_h (x) - z_{1-\frac{\alpha}{2}} \sqrt{\frac{ \Ve...
...{\frac{ \Vert K \Vert _2 \widehat{\sigma}^2 (x)}{nh\widehat{f}_h(x)}}\, \right]$ (4.55)

where $ z_{1-\frac{\alpha}{2}}$ is the $ (1-\frac{\alpha}{2})$-quantile of the standard normal distribution and the estimate of the variance $ \sigma^2(x)$ is given by

$\displaystyle \widehat{\sigma}^2(x)=\frac{1}{n}\sum_{i=1}^n W_{hi}(x)\{Y_i -
\widehat{m}_h(x)\}^2,$

with $ W_{hi}$ the weights from The Nadaraya-Watson estimator.

EXAMPLE 4.14  
Figure 4.15 shows the Engel curve from the 1973 U.K. net-income versus food example with confidence intervals. As we can see, the bump in the right part of the regression curve is not significant at 5% level. $ \Box$

4.4.2 Confidence Bands

As we have seen in the density case, uniform confidence bands for $ m(\bullet)$ need rather restrictive assumptions. The derivation of uniform confidence bands is again based on Bickel & Rosenblatt (1973).

THEOREM 4.6  
Suppose that the support of $ X$ is $ [0,1]$, $ f_X(x)>0$ on $ [0,1]$, and that $ m(\bullet)$, $ f_X(\bullet)$ and $ \sigma(\bullet)$ are twice differentiable. Moreover, assume that $ K$ is differentiable with support $ [-1,1]$ with $ K(-1)=K(1)=0$, $ E( \vert Y\vert^{k} \vert X=x )$ is bounded for all $ k$. Then for $ h_{n}=n^{-\kappa}$, $ \kappa \in (\frac{1}{5},\frac{1}{2})$
$\displaystyle {P\Bigg(\textrm{for all }x\in[0,1]:
\widehat{m}_{h}(x)- z_{n,\alp...
...\frac{\widehat{\sigma}^2_{h}(x)\Vert K \Vert _{2}^{2}}
{nh\widehat{f}_{h}(x)}}}$
    $\displaystyle \leq m(x)
\leq \widehat{m}_{h}(x)+ z_{n,\alpha}
\sqrt{\frac{\wide...
...ehat{f}_{h}(x)}}\Bigg)
\mathrel{\mathop{\longrightarrow}\limits_{}^{}}1-\alpha,$  

where

$\displaystyle z_{n,\alpha}=\left\{\frac{-\log\{-\frac12 \log(1-\alpha)\}}{(2\kappa
\log{n})^{1/2}}+d_{n}\right\}^{1/2},$

$\displaystyle d_{n}=(2\kappa\log{n})^{1/2}+(2\kappa\log{n})^{-1/2}
\log{\left(\frac{1}{2\pi}\frac{\Vert K'\Vert _{2}}{\Vert K
\Vert _{2}}\right)}^{1/2}.$

In practice, the data $ X_1, \ldots,X_n$ are transformed to the interval $ [0,1]$, then the confidence bands are computed and rescaled to the original scale of $ X_1, \ldots,X_n$.

The following comprehensive example covers local polynomial kernel regression as well as optimal smoothing parameter selection and confidence bands.

EXAMPLE 4.15  
The behavior of foreign exchange (FX) rates has been the subject of many recent investigations. A correct understanding of the foreign exchange rate dynamics has important implications for international asset pricing theories, the pricing of contingent claims and policy-oriented questions.

In the past, one of the most important exchange rates was that of Deutsche Mark (DM) to US Dollar (USD). The data that we consider here are from Olsen & Associates, Zürich. They contains the following numbers of quotes during the period Oct 1 1992 and Sept 30 1993. The data have been transformed as described in Bossaerts et al. (1996).

Figure: The estimated mean function for DM/USD with uniform confidence bands, shown is only the truncated range $ (-0.003, 0.003)$
\includegraphics[width=0.03\defepswidth]{quantlet.ps}SPMfxmean
\includegraphics[width=1.2\defpicwidth]{SPMfxmean.ps}

We present now the regression smoothing approach with local linear estimation of the conditional mean (mean function) and the conditional variance (variance function) of the FX returns

$\displaystyle Y_t=\log(S_t/S_{t-1}),$

with $ S_t$ being the FX rates. An extension of the autoregressive conditional heteroscedasticity model (ARCH model) is the conditional heteroscedastic autoregressive nonlinear model (CHARN model)

$\displaystyle Y_t = m(Y_{t-1}) + \sigma(Y_{t-1})\xi_t.$ (4.56)

The task is to estimate the mean function $ m(x) = E(Y_t\vert Y_{t-1}=x)$ and the variance function $ \sigma^2(x)= E(Y_{t}^2\vert Y_{t-1}=x) - E^2(Y_{t}\vert Y_{t-1}=x)$. As already mentioned we use local linear estimation here. For details of assumptions and asymptotics of the local polynomial procedure in time series see Härdle & Tsybakov (1997). Here, local linear estimation means to compute the following weighted least squares problems
$\displaystyle \widehat\beta(x)$ $\displaystyle =$ $\displaystyle \arg \min_{\beta} \, \sum_{t=1}^n
\{Y_{t} - \beta_0 - \beta_1(Y_{t-1}-x)\}^2 K_{h} (Y_{t-1}-x)$  
$\displaystyle \widehat\gamma(x)$ $\displaystyle =$ $\displaystyle \arg \min_{\gamma} \, \sum_{t=1}^n
\{Y^2_{t} - \gamma_0 -\gamma_1(Y_{t-1}-x)\}^2 K_{h} (Y_{t-1}-x).$  

Denoting the true regression function of $ E(Y^2\vert Y_{t-1})$

$\displaystyle s(x)=E(Y_t^2\vert Y_{t-1}=x),$

then the estimators of $ m(x)$ and $ s(x)$ are the first elements of the vectors $ \widehat\beta(x)$ and $ \widehat\gamma(x)$, respectively. Consequently, a possible variance estimate is

$\displaystyle \widehat{\sigma}^2(x) = \widehat{s}_{1,h}(x)
- \widehat{m}_{1,h}^2(x),$

with

$\displaystyle \widehat{m}_{1,h}(x) = {\boldsymbol{e}}_0^\top\widehat\beta(x), \quad
\widehat{s}_{1,h}(x) = {\boldsymbol{e}}_0^\top\widehat\gamma(x)$

and $ {\boldsymbol{e}}_0=(1,0)^\top$ the first unit vector in $ \mathbb{R}^2$.

Figure: The estimated variance function for DM/USD with uniform confidence bands, shown is only the truncated range $ (-0.003, 0.003)$
\includegraphics[width=0.03\defepswidth]{quantlet.ps}SPMfxvolatility
\includegraphics[width=1.2\defpicwidth]{SPMfxvolatility.ps}

The estimated functions are plotted together with approximate 95% confidence bands, which can be obtained from the asymptotic normal distribution of the local polynomial estimator. The cross-validation optimal bandwidth $ h=0.00914$ is used for the local linear estimation of the mean function in Figure 4.16. As indicated by the 95% confidence bands, the estimation is not very robust at the boundaries. Therefore, Figure 4.16 covers a truncated range. Analogously, the variance estimate is shown in Figure 4.17, using the cross-validation optimal bandwidth $ h=0.00756$.

The basic results are the mean reversion and the ``smiling'' shape of the conditional variance. Conditional heteroscedasticity appears to be very distinct. For DM/USD a ``reverted leverage effect'' can be observed, meaning that the conditional variance is higher for positive lagged returns than for negative ones of the same size. But note that the difference is still within the 95% confidence bands. $ \Box$

4.4.3 Hypothesis Testing

In this book we will treat the topic of testing not as a topic of its own, being aware that this would be an enormous task. Instead, we concentrate on cases where regression estimators have a direct application in specification testing. We will only concentrate on methodology and skip any discussion about efficiency.

As this is the first section where we deal with testing, let us start with some brief, but general considerations about non- and semiparametric testing. Firstly, you should free your mind of the facts that you know about testing in the parametric world. No parameter is estimated so far, consequently it cannot be the target of interest to test for significance or linear restrictions of the parameters. Looking at our nonparametric estimates typical questions that may arise are:

Secondly, in contrast to parametric regression, with non- and semiparametrics the problems of estimation and testing are not equivalent anymore. We speak here of equivalence in the sense that, in the parametric world, interval estimation corresponds to parameter testing. It turns out that the optimal rates of convergence are different for nonparametric estimation and nonparametric testing. As a consequence, the choice of smoothing parameter is an issue to be discussed separately in both cases. Moreover, the optimality discussion for nonparametric testing is in general quite a controversial one and far from being obvious. This unfortunately concerns all aspects of nonparametric testing. For instance, the construction of confidence bands around a nonparametric function estimate to decide whether it is significantly different from being linear, can lead to a much too conservative and thus inefficient testing procedure.

Let us now turn to the fundamentals of nonparametric testing. Indeed, the appropriateness of a parametric model may be judged by comparing the parametric fit with a nonparametric estimator. This can be done in various ways, e.g. you may use a (weighted) squared deviation between the two models. A simple (but in many situations inefficient) approach would be to use critical values from the asymptotic distribution of this statistic. Better results are usually obtained by approximating the distribution of the test statistics using a resampling method.

Before introducing a specific test statistic we have to specify the hypothesis $ H_0$ and the alternative $ H_1$. To make it easy let us start with a nonparametric regression $ m(x)=E(Y\vert X=x)$. Our first null hypothesis is that $ X$ has no impact on $ Y$. If we assume $ EY=0$ (otherwise take $ \widetilde Y_i =Y_i-\overline{Y}$), then we may be interested to test

$\displaystyle H_0 : m(x) \equiv 0$ $\displaystyle \textrm{vs.}$ $\displaystyle H_1 : m(x) \neq 0.$  

As throughout this chapter we do not want to make any assumptions about the function $ m(\bullet)$ other than smoothness conditions. Having an estimate of $ m(x)$ at hand, e.g. the Nadaraya-Watson estimate $ \widehat m(\bullet)=\widehat m_h(\bullet)$ from (4.6), a natural measure for the deviation from zero is

$\displaystyle T_1 = n\sqrt{h} \int \left\{ \widehat m (x)-0 \right\}^2 \widetilde{w}(x)\, dx,$ (4.57)

where $ \widetilde{w}(x)$ denotes a weight function (typically chosen by the empirical researcher). This weight function often serves to trim the boundaries or regions of sparse data. If the weight function is equal to $ f_X(x) w(x)$, i.e.

$\displaystyle \widetilde{w}(x)=f_X(x) w(x)$

with $ f_X(x)$ being the density of $ X$ and $ w(x)$ another weight function, one could take the empirical version of (4.57)

$\displaystyle T_2 = \sqrt{h} \sum_{i=1}^n \left\{ \widehat m (x)-0 \right\}^2 w(x)$ (4.58)

as a test statistic.

It is clear that under $ H_0$ both test statistics $ T_1$ and $ T_2$ must converge to zero, whereas under $ H_1$ the condition $ n\to\infty$ also lets the statistic increase to infinity. Note that under $ H_0$ our estimate $ \widehat m (\bullet )$ does not have any bias (cf. Theorem 4.3) that could matter in the squared deviation $ \{\widehat m (x)-0 \}^2 $. Actually, with the same assumptions we needed for the kernel estimator $ \widehat m (\bullet )$, we find that under the null hypothesis $ T_1$ and $ T_2$ converge to a $ N(0,V)$ distribution with

$\displaystyle V$ $\displaystyle =$ $\displaystyle 2 \int \frac{\sigma^4(x) \widetilde{w}^2 (x)}{f_X^2(x)}\,dx
\int (K \star K)^2(x) \,dx.$ (4.59)

As used previously, $ \sigma^2(x)$ denotes the conditional variance $ \mathop{\mathit{Var}}(Y\vert X=x)$.

Let us now consider the more general null hypothesis. Suppose we are interested in a specific parametric model given by $ E(Y\vert X=x)=m_\theta (\bullet)$ and $ m_\theta (\bullet)$ is a (parametric) function, known up to the parameter $ \theta$. This means

$\displaystyle H_0 : m(x) \equiv m_\theta (x)$ $\displaystyle \textrm{vs.}$ $\displaystyle H_1 : m(x) \neq
m_\theta (x).$  

A consistent estimator $ \widehat\theta$ for $ \theta$ is usually easy to obtain (by least squares, maximum likelihood, or as a moment estimator, for example). The analog to statistic (4.58) is then obtained by using the deviation from $ m_{\widehat \theta}$, i.e.

$\displaystyle \sqrt{h} \sum_{i=1}^n \left\{ \widehat m (x)- m_{\widehat \theta} (x) \right\}^2 w(x).$     (4.60)

However, this test statistic involves the following problem: Whereas $ m_{\widehat \theta} (\bullet )$ is (asymptotically) unbiased and converging at rate $ \sqrt{n}$, our nonparametric estimate $ \widehat m (x) $ has a ``kernel smoothing'' bias and converges at rate $ \sqrt{nh}$. For that reason, Härdle & Mammen (1993) propose to introduce an artificial bias by replacing $ m_{\widehat \theta} (\bullet )$ with
$\displaystyle \widehat m_{\widehat\theta} (x) = \frac{ \sum_{i=1}^n K_h \left( x-X_i \right)
m_{\widehat \theta}(X_i) }{ \sum_{i=1}^n K_h \left( x-X_i \right) }$     (4.61)

in statistic (4.60). More specifically, we use

$\displaystyle T =\sqrt{h} \sum_{i=1}^n \left\{ \widehat m (x)- \widehat m_{\widehat \theta} (x) \right\}^2 w(x).$ (4.62)

As a result of this, under $ H_0$ the bias of $ \widehat m (\bullet )$ cancels out that of $ \widehat m_{\widehat\theta_0}$ and the convergence rates are also the same.

EXAMPLE 4.16  
Consider the expected wage ($ Y$) as a function of years of professional experience ($ X$). The common parameterization for this relationship is

$\displaystyle H_0: E(Y\vert X=x) = m(x)= \beta_0 + \beta_1 x+ \beta_2 x^2,$

and we are interested in verifying this quadratic form. So, we firstly estimate $ \theta = (\beta_0 , \beta_1 , \beta_2 )^\top$ by least squares and set $ m_{\widehat\theta} (x) =
\widehat\beta_0 + \widehat\beta_1 x +
\widehat\beta_2 x^2 $. Secondly, we calculate the kernel estimates $ \widehat m (\bullet )$ and $ \widehat m_{\widehat\theta} (\bullet )$, as in (4.6) and (4.61). Finally, we apply test statistic (4.62) to these two smoothers. If the statistic is ``large'', we reject $ H_0$$ \Box$

The remaining question of the example is: How to find the critical value for ``large''? The typical approach in parametric statistics is to obtain the critical value from the asymptotic distribution. This is principally possible in our nonparametric problem as well:

THEOREM 4.7  
Assume the conditions of Theorem 4.3, further that $ \widehat\theta$ is a $ \sqrt{n}$-consistent estimator for $ \theta$ and $ h = O(n^{-1/5})$. Then it holds
$\displaystyle {T - \frac 1{\sqrt{h}} \Vert K \Vert^2_2 \int
\sigma^2 (x) w(x) \,dx }$
    $\displaystyle \quad\quad\ \mathrel{\mathop{\longrightarrow}\limits_{}^{L}}
N\left( 0\,,\, 2\int \sigma^4 (x) w^2(x) \,dx \ \int (K\star K)^2 (x) \,dx \right)$  

As in the parametric case, we have to estimate the variance expression in the normal distribution. However, with an appropriate estimate for $ \sigma^2(x)$ this is no obstacle. The main practical problem here is the very slow convergence of $ T$ towards the normal distribution.

For that reason, approximations of the critical values corresponding to the finite sample distribution are used. The most popular way to approximate this finite sample distribution is via a resampling scheme: simulate the distribution of your test statistic under the hypothesis (i.e. ``resample'') and determine the critical values based on that simulated distribution. This method is called Monte Carlo method or bootstrap, depending on how the distribution of the test statistic can be simulated. Depending on the context, different resampling procedures have to be applied. Later on, for each particular case we will introduce not only the test statistic but also an appropriate resampling method.

For our current testing problem the possibly most popular resampling method is the so-called wild bootstrap introduced by Wu (1986). One of its advantages is that it allows for a heterogeneous variance in the residuals. Härdle & Mammen (1993) introduced wild bootstrap into the context of nonparametric hypothesis testing as considered here. The principal idea is to resample from the residuals $ \widehat\varepsilon_i$, $ i=1,\ldots,n$, that we got under the null hypothesis. Each bootstrap residual $ \varepsilon^*_i$ is drawn from a distribution that coincides with the distribution of $ \widehat\varepsilon_i$ up to the first three moments. The testing procedure then consists of the following steps:

(a)
Estimate the regression function $ m_{\widehat \theta} (\bullet )$ under the null hypothesis and construct the residuals $ \widehat\varepsilon_{i}=Y_{i}-
m_{\widehat\theta} (X_{i})$.
(b)
For each $ X_{i}$, draw a bootstrap residual $ \varepsilon_{i}^{*}$ so that
$\displaystyle E ( \varepsilon^*_i ) =0\ ,\quad E ( { \varepsilon^*_i }^2 )
=\wi...
...textrm{ and } \quad E (
{\varepsilon^*_i}^3 ) = \widehat\varepsilon_{i}^{3} \ .$      

(c)
Generate a bootstrap sample $ \{(Y_{i}^{*},X_{i})\}_{i=1,\ldots,n}$ by setting

$\displaystyle Y_{i}^{*}=
m_{\widehat\theta}(X_i) +\varepsilon_{i}^{*}.$

(d)
From this sample, calculate the bootstrap test statistic $ T^{*}$ in the same way as the original $ T$ is calculated.
(e)
Repeat steps (b) to (d) $ n_{boot}$ times ($ n_{boot}$ being several hundred or thousand) and use the $ n_{boot}$ generated test statistics $ T^{*}$ to determine the quantiles of the test statistic under the null hypothesis. This gives you approximative values for the critical values for your test statistic $ T$.

One famous method which fulfills the conditions in step (c) is the so-called golden cut method. Here we draw $ \varepsilon^*_i$ from the two-point distribution with probability mass at

$\displaystyle a=\frac{1-\sqrt{ 5}}{2}\,\widehat\varepsilon_i,\quad
b=\frac{1+\sqrt{5}}{2}\,\widehat\varepsilon_i,$

occurring with probabilities $ q=(5+\sqrt{5})/10$ and $ 1-q$, respectively. In the second part of this book you will see more examples of Monte Carlo and bootstrap methods.

Let us mention that besides the type of test statistics that we introduced here, other distance measures are plausible. However, all test statistics can be considered as estimates of one of the following expressions:

$\displaystyle E\left[ w(X) \{m(X)-m_\theta(X)\}^2 \right],$     (4.63)
$\displaystyle E\left[ w(X) \{m(X)-m_\theta(X)\} \varepsilon_X \right],$     (4.64)
$\displaystyle E\left[ w(X) \varepsilon_X \ E[\varepsilon_X\vert X] \right],$     (4.65)
$\displaystyle E\left[ w(X) \{ \sigma^2(X) -\sigma_\theta^2(X) \} \right],$     (4.66)

$ \varepsilon_X$ being the residuum under $ H_0$ at point $ x$, and $ w(x)$ the weight function as above. Furthermore, $ \sigma^2_\theta (X)$ is the error variance under the hypothesis, and $ \sigma^2(X)$ the one under the alternative. Obviously, our test statistics (4.58) and (4.62) are estimates of expression (4.63).

The question ``Which is the best test statistic?'' has no simple answer. An optimal test should keep the nominal significance level under the hypothesis and provide the highest power under the alternative. However, in practice it turns out that the behavior of a specific test may depend on the model, the error distribution, the design density and the weight function. This leads to an increasing number of proposals for the considered testing problem. We refer to the bibliographic notes for additional references to (4.63)-(4.66) and for further test approaches.