13.1 ARCH and GARCH Models

After the introduction of ARCH models there were enormous theoretical and practical developments in financial econometrics in the eighties. It
became clear that ARCH models could efficiently and quite easily represent the typical empirical findings in financial time series, e.g. the conditional heteroscedasticity. In particular after the collapse of the Bretton Woods system and the implementation of flexible exchange rates in the seventies ARCH models are increasingly used by researchers and practitioners.

Fig.: Normally distributed white noise. 19104 SFEtimewr.xpl
\includegraphics[width=1\defpicwidth]{wr.ps}

Fig.: A GARCH(1,1) process ( $ \alpha=0.15$, $ \beta=0.8$). 19108 SFEtimegarc.xpl
\includegraphics[width=1\defpicwidth]{garch.ps}

Fig.: DAFOX returns from 1993 to 1996. 19112 SFEtimedax.xpl
\includegraphics[width=1\defpicwidth]{dafox.ps}

In addition a far-reaching agreement has been formed that returns cannot be regarded as i.i.d. and at most as being uncorrelated. This argument holds at least for financial time series of relatively high frequency, for example for daily data. In Figure 12.1 we show a normally distributed white noise, a GARCH(1,1) process in Figure 12.2 and the DAFOX index (1993-96) in Figure 12.3, see http://finance.wiwi.uni-karlsruhe.de/Forschung/dafox.html . It can be seen from the figure that the GARCH process is obviously more appropriate for modelling stock returns than white noise.

However the ARCH model is only the starting point of the empirical study and relies on a wide range of specification tests. Some practically relevant disadvantages of the ARCH model have been discovered recently, for example, the definition and modelling of the persistence of shocks and the problem of modelling asymmetries. Thus a large number of extensions of the standard ARCH model have been suggested. We will discuss them in detail later.

Let $ X_t$ be a discrete stochastic process and from Definition 10.15
$ r_t = \log X_t/X_{t-1}$ the relative increase or the return of the process $ X_t$. If the returns are independent and identically distributed, then $ X_t$ follows a geometric random walk. It is assumed in ARCH models that the returns depend on past information with a specific form.

As mentioned before $ {\cal F}_t$ denotes the information set at time $ t$, which encompasses $ X_t$ and all the past realizations of the process $ X_t$. This means in a general model

$\displaystyle r_t = \mu_t + \varepsilon_t$ (13.1)

with $ \mathop{\text{\rm\sf E}}[\varepsilon_t \mid {\cal F}_{t-1}]=0$. Here $ \mu_t$ can represent the risk premium which results from the econometric models and is time dependent. The stochastic error term $ \varepsilon_t$ is no longer independent but centered and uncorrelated. In ARCH models the conditional variance of $ \varepsilon_t$ is a linear function of the lagged squared error terms.

13.1.1 ARCH(1): Definition and Properties

The ARCH model of order 1, ARCH(1), is defined as follows:

Definition 13.1 (ARCH(1))  
The process $ \varepsilon_t$, $ t \in \mathbb{Z}$, is ARCH(1), if $ \mathop{\text{\rm\sf E}}[\varepsilon_t \mid {\cal F}_{t-1}]=0$,

$\displaystyle \sigma_t^2=\omega+\alpha \varepsilon_{t-1}^2$ (13.2)

with $ \omega>0, \: \alpha \geq 0$ and

where $ {\cal P}$ is the best linear projection described in Section 11.4. Obviously a strong ARCH(1) process is also semi-strong and a semi-strong also weak. On the other hand the conditional variance of a weak ARCH(1) process can be non-linear (unequal to $ \sigma_t^2$). In this case it can not be a semi-strong ARCH process.

Setting $ Z_t=\varepsilon_t/\sigma_t$, it holds for the semi-strong and the strong ARCH models that $ \mathop{\text{\rm\sf E}}[Z_t]=0$ and $ \mathop{\text{\rm Var}}(Z_t)=1$. In strong ARCH models $ Z_t$ is i.i.d. so that no dependence can be modelled in higher moments than the second moment. It is frequently assumed that $ Z_t$ is normally distributed, which means $ \varepsilon_t$ is conditionally normally distributed:

$\displaystyle \varepsilon_t\vert{\cal F}_{t-1} \sim {\text{\rm N}}(0,\sigma_t^2).$ (13.3)

Under (12.3) the difference between the strong and the semi-strong ARCH models disappears.

Originally only strong and semi-strong ARCH models are discussed in the literature. Weak ARCH models are important because they are closed under temporal aggregation. If, for example, daily returns follow a weak ARCH process, then the weekly and monthly returns are also weak ARCH with corresponding parameter adjustments. This phenomenon holds in general for strong and semi-strong models.

According to Definition 12.1 the process $ \varepsilon_t$ is a martingale difference and therefore white noise.

Theorem 13.1  
Assume that the process $ \varepsilon_t$ is a weak ARCH(1) process with $ \mathop{\text{\rm Var}}(\varepsilon_t)=\sigma^2<\infty$. Then it follows that $ \varepsilon_t$ is white noise.

Proof:
From $ \mathop{\text{\rm\sf E}}[\varepsilon_t \mid {\cal F}_{t-1}]=0$ it follows that $ \mathop{\text{\rm\sf E}}[\varepsilon_t]=0$ and $ \mathop{\text{\rm Cov}}(\varepsilon_t,\varepsilon_{t-k})=
\mathop{\text{\rm\sf...
...repsilon_{t-k}\mathop{\text{\rm\sf E}}(\varepsilon_t \mid {\cal F}_{t-1})] = 0.$ $ {\Box}$
Note that $ \varepsilon_t$ is not an independent white noise.

Theorem 13.2 (Unconditional variance of the ARCH(1))  
Assume the process $ \varepsilon_t$ is a semi-strong ARCH(1) process with $ \mathop{\text{\rm Var}}(\varepsilon_t)=\sigma^2<\infty$. Then it holds that

$\displaystyle \sigma^2=\frac{\omega}{1-\alpha}.$

Proof:
$ \sigma^2 = \mathop{\text{\rm\sf E}}[\varepsilon_t^2] = \mathop{\text{\rm\sf E}...
...alpha
\mathop{\text{\rm\sf E}}[\varepsilon_{t-1}^2] = \omega + \alpha \sigma^2.$ It holds then $ \sigma^2 = \omega/(1-\alpha)$ when $ \alpha<1$. $ {\Box}$
$ \alpha<1$ is the necessary and sufficient condition for a weak stationarity of a semi-strong process.

If the innovation $ Z_t=\varepsilon_t/\sigma_t$ is symmetrically distributed around zero, then all odd moments of $ \varepsilon_t$ are equal to zero. Under the assumption of normal distribution (12.3) the conditions for the existence of higher even moments can be derived.

Theorem 13.3 (Fourth Moment)  
Let $ \varepsilon_t$ be a strong ARCH(1) process, $ Z_t \sim
{\text{\rm N}}(0,1)$ and $ \mathop{\text{\rm\sf E}}[\varepsilon_t^4]=c<\infty$. Then
  1. $\displaystyle \mathop{\text{\rm\sf E}}[\varepsilon_t^4]=\frac{3\omega^2}{(1-\alpha)^2}\frac{1-\alpha^2}{1-3\alpha^2}
$

    with $ 3\alpha^2 <1$.
  2. the unconditional distribution of $ \varepsilon_t$ is leptokurtic.

Proof:
  1. $ c=\mathop{\text{\rm\sf E}}[\varepsilon_t^4] = \mathop{\text{\rm\sf E}}[\mathop...
...\varepsilon_{t-1}^2] + \alpha^2
\mathop{\text{\rm\sf E}}[\varepsilon_{t-1}^4]).$ Since $ \mathop{\text{\rm\sf E}}[\varepsilon_{t-1}^2] =
\omega/(1-\alpha)$ and $ \mathop{\text{\rm\sf E}}[\varepsilon_{t-1}^4]=c$, after rearranging the claim follows.
  2. $\displaystyle \mathop{\text{\rm Kurt}}(\varepsilon_t) = \frac{\mathop{\text{\rm...
...\text{\rm\sf E}}[\varepsilon_t^2]^2} =
3 \frac{1-\alpha^2}{1-3\alpha^2} \ge 3.
$

$ {\Box}$

For the boundary case $ \alpha = 0$ and the normally distributed innovations $ \mathop{\text{\rm Kurt}}(\varepsilon_t)=3$, while for $ \alpha>0$ it holds that $ \mathop{\text{\rm Kurt}}(\varepsilon_t)>3$. The unconditional distribution is also leptokurtic under conditional heteroscedasticity, i.e., the curvature is high in the middle of the distribution and the tails are fatter than those of a normal distribution, which is frequently observed in financial markets.

The thickness of the tails and thus the existence of moments depend on the parameters of the ARCH models. The variance of the ARCH(1) process is finite when $ \alpha<1$ (Theorem 12.2), while the fourth moment in the case of normally distributed error terms exists when $ 3\alpha^2 <1$ (Theorem 12.3). Already in the sixties Mandelbrot had questioned the existence of the variance of several financial time series. Frequently empirical distributions have so fat tails that one can not conclude a finite variance. In order to make empirical conclusions on the degree of the tail's thickness of the unconditional distribution, one can assume, for example, that the distribution is a Pareto type, i.e., for large $ x$:

$\displaystyle {\P}(x)=\P(X_t>x) \sim k x^{-a}
$

for $ a>0$. When $ a >c$, it holds that $ \mathop{\text{\rm\sf E}}[\vert X_t\vert^c] < \infty$. The question is, how can we estimate the tail index $ a$? A simple method follows from the conclusion that for large $ x$ the log function $ {\P}(x)$ is linear, i.e.,

$\displaystyle \log {\P}(x) \approx \log k -a \log x.$ (13.4)

Therefore we can build the order statistics $ X_{(1)}
> X_{(2)}
> \ldots
> X_{(n)}$ and estimate the probability $ {\P}(x)$ for $ x=X_{(i)}$ using the relative frequency

$\displaystyle \frac{\char93 \{t; X_t \ge X_{(i)}\}}{n} = \frac{i}{n}.
$

In (12.4) $ \P(X_{(i)})$ is replaced with the estimator $ i/n$:

$\displaystyle \log \frac{i}{n} \approx \log k -a \log X_{(i)},$ (13.5)

from which $ a$ can be estimated from the regression of $ i/n$ on $ X_{(i)}, i=1,...,n,$ using the least squares method. In general only a small part of the data will be used for the regression, since the linear approximation of $ \log {\P}(x)$ is only appropriate in the tail. Thus only the largest order statistics are used to estimate the regression (12.5). Figure 12.4 shows the regression (12.5) for the DAFOX from 1974 to 1996 with $ m=20$, i.e. we choose the 20 largest observations. The slope of the least squares (LS) line is -3.25. It indicates that the variance and the third moment of this time series are finite whereas the fourth moment and the kurtosis are not finite.

Fig.: The right side of the logged empirical distribution of the DAFOX returns from 1974 to 1996. 19540 SFEtaildax.xpl
\includegraphics[width=1\defpicwidth]{taildax.ps}

Hill (1975) has suggested an estimator using the maximum likelihood method:

$\displaystyle \hat{a} = \left(\frac{1}{m-1} \sum_{i=1}^m \log X_{(i)} - \log X_{(m)} \right)^{-1},$ (13.6)

where $ m$ is the number of observations taken from the tail and used in the estimation. How to choose $ m$ obviously raises a problem. When $ m$ is too large, the approximation of the distribution is no longer good; when $ m$ is too small, the bias and the variance of the estimator could increase. A simple rule of thumb says that $ m/n$ should be around 0.5% or 1%. Clearly one requires a large amount of data in order to estimate $ a$ well. As an example we consider again the daily returns of German stocks from 1974 to 1996, a total of 5747 observations per stock. The results of the ordinary least squares estimator and the Hill estimator with $ m=20$ and $ m=50$ are given in Table 12.1. In every case the estimators are larger than 2, which indicates the existence of variances. The third moment may not exist in some cases, for example, for Allianz and Daimler.

Theorem 13.4 (Representation of an ARCH(1) process)  
Let $ \varepsilon_t$ be a strong ARCH(1) process with $ \mathop{\text{\rm Var}}(\varepsilon_t)=\sigma^2<\infty$. It holds that

$\displaystyle \varepsilon_t^2 = \omega \sum_{k=0}^\infty \alpha^k \prod_{j=0}^k Z_{t-j}^2
$

and the sum converges in $ L_1$.

Proof:
Through the recursive substitution of $ \varepsilon_s^2=\sigma_s^2
Z_s^2$ and $ \sigma_s^2 = \omega + \alpha \varepsilon_{s-1}^2$. The convergence follows from
$\displaystyle \mathop{\text{\rm\sf E}}[\varepsilon_t^2 - \omega \sum_{k=0}^m \alpha^k \prod_{j=0}^k Z_{t-j}^2]$ $\displaystyle =$ $\displaystyle \alpha^{m+1} \mathop{\text{\rm\sf E}}[\varepsilon_{t-m-1}^2 \prod_{j=0}^{m} Z_{t-j}^2]$  
  $\displaystyle =$ $\displaystyle \alpha^{m+1} \mathop{\text{\rm\sf E}}[\varepsilon_{t-m-1}^2] \longrightarrow 0$  

for $ m \longrightarrow \infty$, since $ Z_t$ is independent with $ \mathop{\text{\rm\sf E}}(Z_t^2)=1$. $ {\Box}$

Theorem 13.5  
Let $ \varepsilon_t$ be a stationary strong ARCH(1) process with $ \mathop{\text{\rm\sf E}}(\varepsilon_t^4)=c<\infty$ and $ Z_t \sim
{\text{\rm N}}(0,1)$. It holds that


Table 12.1: Least Square (LS) and Hill estimators of the tail index $ a$ with $ m$ observations used for the estimation.
LS Hill
$ m$ 20 50 20 50
DAFOX 3.25 2.94 3.17 2.88
ALLIANZ 2.29 2.44 2.28 2.26
BASF 4.19 4.29 4.58 4.01
BAYER 3.32 3.20 3.90 3.23
BMW 3.42 3.05 3.05 2.89
COMMERZBANK 6.58 4.67 7.14 5.19
DAIMLER 2.85 2.85 2.43 2.56
DEUTSCHE BANK 3.40 3.26 3.41 3.29
DEGUSSA 3.03 4.16 2.93 3.30
DRESDNER 5.76 4.08 4.20 4.32
HOECHST 4.77 3.68 5.66 4.05
KARSTADT 3.56 3.42 3.11 3.16
LINDE 3.30 3.35 3.87 3.37
MAN 3.83 3.66 3.17 3.45
MANNESMANN 3.19 3.85 2.84 3.22
PREUSSAG 3.52 4.11 3.57 3.68
RWE 3.87 3.78 3.51 3.54
SCHERING 3.34 4.82 3.22 3.64
SIEMENS 6.06 4.50 5.96 5.23
THYSSEN 5.31 5.36 4.67 4.97
VOLKSWAGEN 4.59 3.31 4.86 4.00


  1. $\displaystyle \varepsilon_t^2 = \omega \sum_{k=0}^\infty \alpha^k \prod_{j=0}^k Z_{t-j}^2
$

    and the sum converges in $ L_2$.
  2. $ \eta_t = \sigma_t^2(Z_t^2-1)$ is white noise.
  3. $ \varepsilon_t^2$ is an AR(1) process with $ \varepsilon_t^2 =
\omega + \alpha \varepsilon_{t-1}^2 + \eta_t$.

Proof:
  1. As in Theorem 12.4. The convergence is $ L_2$ follows from
    $\displaystyle \mathop{\text{\rm\sf E}}[(\varepsilon_t^2 - \omega \sum_{k=0}^m \alpha^k \prod_{j=0}^k Z_{t-j}^2)^2]$ $\displaystyle =$ $\displaystyle \mathop{\text{\rm\sf E}}[(\alpha^{m+1} \varepsilon_{t-m-1}^2 \prod_{j=0}^{m} Z_{t-j}^2)^2]$  
      $\displaystyle =$ $\displaystyle \alpha^{2(m+1)} 3^{m+1}\mathop{\text{\rm\sf E}}[\varepsilon_{t-m-1}^4]$  
      $\displaystyle =$ $\displaystyle \alpha^{2(m+1)} 3^{m+1} c \longrightarrow 0$  

    for $ m \longrightarrow \infty$, since $ 3\alpha^2 <1$ due to the assumption that $ \mathop{\text{\rm\sf E}}(\varepsilon_t^4)$ is finite and since $ Z_t$ is independent with $ \mathop{\text{\rm Kurt}}(Z_t)=3$.

    1. $ \mathop{\text{\rm\sf E}}[\eta_t] = \mathop{\text{\rm\sf E}}[\sigma_t^2]\mathop{\text{\rm\sf E}}[Z_t^2-1] = 0$

    2. $\displaystyle \mathop{\text{\rm Var}}(\eta_t)$ $\displaystyle =$ $\displaystyle \mathop{\text{\rm\sf E}}[\sigma_t^4]\mathop{\text{\rm\sf E}}[(Z_t^2-1)^2] = 2 \mathop{\text{\rm\sf E}}[(\omega+\alpha
\varepsilon_{t-1}^2)^2]$  
        $\displaystyle =$ $\displaystyle 2(\omega^2 + 2\alpha \omega \mathop{\text{\rm\sf E}}[\varepsilon_...
...] +
\alpha^2 \mathop{\text{\rm\sf E}}[\varepsilon_{t-1}^4]) = \text{\rm const.}$  

      is independent of $ t$.

    3. $\displaystyle \mathop{\text{\rm Cov}}(\eta_t,\eta_{t+s})$ $\displaystyle =$ $\displaystyle \mathop{\text{\rm\sf E}}[\sigma_t^2(Z_t^2-1)\sigma_{t+s}^2(Z_{t+s}^2-1)]$  
        $\displaystyle =$ $\displaystyle \mathop{\text{\rm\sf E}}[\sigma_t^2(Z_t^2-1)\sigma_{t+s}^2]\mathop{\text{\rm\sf E}}[(Z_{t+s}^2-1)]$  
        $\displaystyle =$ 0   for $\displaystyle s\ne 0.$  

  2. It follows from rearranging: $ \varepsilon_t^2 = \sigma_t^2 Z_t^2 = \sigma_t^2 +
\sigma_t^2(Z_t^2-1) = \omega + \alpha \varepsilon_{t-1}^2 + \eta_t$.
$ {\Box}$

Remark 13.1  
Nelson (1990$ a$) shows that the strong ARCH(1) process $ \varepsilon_t$ is strictly stationary when $ \mathop{\text{\rm\sf E}}[\log(\alpha
Z_t^2)]<0$. If, for example, $ Z_t \sim
{\text{\rm N}}(0,1)$, then the condition for strict stationarity is $ \alpha < 3.5622$, which is weaker than the condition for covariance-stationarity with $ \alpha<1$ due to the assumption that the variance is finite.

The dynamics of the volatility process in the case of ARCH(1) is essentially determined by the parameter $ \alpha$. In Theorem 12.5 it was shown that the square of an ARCH(1) process follows an AR(1) process. The correlation structure of the empirical squared observations of returns are frequently more complicated than a simple AR(1) process. In Section 12.1.3 we will consider an ARCH model of order $ q$ with $ q>1$, which allows a more flexible modelling of the correlation structure.

The volatility is a function of the past squared observations in ARCH models in a narrow sense. In the more general GARCH models (Section 12.1.5) it may depend on the past squared volatilities in addition. These models belong to the large group of unpredictable time series with stochastic volatility. In the strong form, they have $ \varepsilon_t = \sigma_t Z_t$ where $ \sigma_t$ is $ {\cal F}_{t-1}$-measurable, i.e. the volatility $ \sigma_t$ depends only on the information to the time point $ t-1$ and the i.i.d. innovations $ Z_t$ with $ \mathop{\text{\rm\sf E}}[Z_t]=0, \mathop{\text{\rm Var}}(Z_t)=1$. For such a time series it holds $ \mathop{\text{\rm\sf E}}[\varepsilon_t \vert {\cal
F}_{t-1}]=0, \mathop{\text{\rm Var}}(\varepsilon_t \vert {\cal F}_{t-1})=\sigma_t^2$, i.e. $ \varepsilon_t$ is unpredictable and, except in the special case that $ \sigma_t
\stackrel{\mathrm{def}}{=}$const.$ ,$ conditionally heteroscedastic. The stylized facts 2-4 are only fulfilled under certain qualitative assumptions. For example, in order to produce volatility cluster $ \sigma_t$ must tend to be large when the squared observations or volatilities of the recent past observations are large. The generalizations of the ARCH models observed in this section fulfill the corresponding conditions.

Remark 13.2  
At first glance stochastic volatility models in discrete time deliver a different approach in modelling the financial data compared with diffusion processes, on which the Black-Schole model and its generalization are based (Section 5.4). Nelson (1990$ b$) has however shown that ARCH and also the more general GARCH processes converge in the limit to a diffusion process in continuous time when the difference of the time points of the successive observations goes against zero.

This result is often used reversely in order to estimate the parameter of financial models in the continuous time where one approximates the corresponding diffusion processes through discrete GARCH time series and estimates its parameter. Nelson (1990$ b$) shows only the convergence of GARCH processes against diffusion processes in a weak sense (convergence on the distribution). A recent work of Wang (2002) shows however that the approximation does not hold in a stronger sense, especially the likelihood process is not asymptotically equivalent. In this sense the maximum likelihood estimators for the discrete time series do not converge against the parameters of the diffusion limit process.

13.1.2 Estimation of ARCH(1) Models

Theorem 12.5 says that an ARCH(1) process can be represented as an AR(1) process in $ X_t^2$. A simple Yule-Walker estimator uses this property:

$\displaystyle \hat{\alpha}^{(0)} = \frac{\sum_{t=2}^n (\varepsilon_t^2 - \hat{\...
...- \hat{\omega}^{(0)})}
{\sum_{t=2}^n (\varepsilon_t^2 - \hat{\omega}^{(0)})^2}
$

with $ \hat{\omega}^{(0)} = n^{-1}\sum_{t=1}^n \varepsilon_t^2$. Since the distribution of $ \varepsilon_t^2$ is naturally not normal, the Yule-Walker estimator is inefficient. However it can be used as an initial value for iterative estimation methods.

The estimation of ARCH models is normally done using the maximum likelihood (ML) method. Assuming that the returns $ \varepsilon_t$ have a conditionally normal distribution, we have:

$\displaystyle p(\varepsilon_t \mid {\cal F}_{t-1})=\frac{1}{\sqrt{2\pi}\sigma_t} \exp\left\{-\frac{1}{2}\frac{\varepsilon_t^2}{\sigma_t^2}\right\},$ (13.7)

The log-likelihood function $ l(\omega,\alpha)$ can be written as a function of the parameters $ \omega$ and $ \alpha$:
$\displaystyle l(\omega,\alpha)$ $\displaystyle =$ $\displaystyle \sum_{t=2}^n l_t(\omega,\alpha) + \log p_{\varepsilon}(\varepsilon_1)$ (13.8)
  $\displaystyle =$ $\displaystyle \sum_{t=2}^n \log p(\varepsilon_t \mid {\cal F}_{t-1}) + \log p_{\varepsilon}(\varepsilon_1)$  
  $\displaystyle =$ $\displaystyle -\frac{n-1}{2} \log(2\pi) - \frac{1}{2} \sum_{t=2}^n \log (\omega+\alpha \varepsilon_{t-1}^2)$  
    $\displaystyle - \frac{1}{2} \sum_{t=2}^n \frac{\varepsilon_t^2}{\omega+\alpha \varepsilon_{t-1}^2} + \log p_{\varepsilon}(\varepsilon_1),$  

where $ p_{\varepsilon}$ is the stationary marginal density of $ \varepsilon_t$. A problem is that the analytical expression for $ p_{\varepsilon}$ is unknown in ARCH models thus (12.8) can not be calculated. In the conditional likelihood function $ l^b=\log p(\varepsilon_n,\ldots,\varepsilon_2
\mid \varepsilon_1)$ the expression $ \log
p_{\varepsilon}(\varepsilon_1)$ disappears:
$\displaystyle l^b(\omega,\alpha)$ $\displaystyle =$ $\displaystyle \sum_{t=2}^n l_t(\omega,\alpha)$ (13.9)
  $\displaystyle =$ $\displaystyle \sum_{t=2}^n \log p(\varepsilon_t \mid {\cal F}_{t-1})$  
  $\displaystyle =$ $\displaystyle -\frac{n-1}{2} \log(2\pi) - 1/2 \sum_{t=2}^n \log (\omega+\alpha ...
...)
- 1/2 \sum_{t=2}^n \frac{\varepsilon_t^2}{\omega+\alpha \varepsilon_{t-1}^2}.$  

For large $ n$ the difference $ l - l^b$ is negligible.

Figure 12.5 shows the conditional likelihood of a generated ARCH(1) process with $ n=100$. The parameter $ \omega$ is chosen so that the unconditional variance is everywhere constant, i.e., with a variance of $ \sigma^2$, $ \omega=(1-\alpha)\sigma^2$. The optimization of the likelihood of an ARCH(1) model can be found by analyzing the graph. Most often we would like to know the precision of the estimator as well. Essentially it is determined by the second derivative of the likelihood at the optimization point by the asymptotic properties of the ML estimator (see Section 12.1.6). Furthermore one has to use numerical methods such as the score algorithm introduced in Section 11.8 to estimate the parameters of the models with a larger order. In this case the first and second partial derivatives of the likelihood must be calculated.

Fig.: Conditional likelihood function of a generated ARCH(1) process with $ n=100$. The true parameter is $ \alpha=0.5$. 20418 SFElikarch1.xpl
\includegraphics[width=1\defpicwidth]{likarch1.ps}

With the ARCH(1) model these are

$\displaystyle \frac{\partial l_t^b}{\partial \omega}$ $\displaystyle =$ $\displaystyle \frac{1}{2\sigma_t^2} \left(\frac{\varepsilon_t^2}{\sigma_t^2}-1\right)$ (13.10)
$\displaystyle \frac{\partial l_t^b}{\partial \alpha}$ $\displaystyle =$ $\displaystyle \frac{1}{2\sigma_t^2} \varepsilon_{t-1}^2 \left(\frac{\varepsilon_t^2}{\sigma_t^2}-1\right)$ (13.11)
$\displaystyle \frac{\partial^2 l_t^b}{\partial \omega^2}$ $\displaystyle =$ $\displaystyle -\frac{1}{2\sigma_t^4} \left(2\frac{\varepsilon_t^2}{\sigma_t^2}-1\right)$ (13.12)
$\displaystyle \frac{\partial^2 l_t^b}{\partial \alpha^2}$ $\displaystyle =$ $\displaystyle -\frac{1}{2\sigma_t^4} \varepsilon_{t-1}^4 \left(2\frac{\varepsilon_t^2}{\sigma_t^2}-1\right)$ (13.13)
$\displaystyle \frac{\partial^2 l_t^b}{\partial \omega \partial \alpha}$ $\displaystyle =$ $\displaystyle -\frac{1}{2\sigma_t^4} \varepsilon_{t-1}^2 \left(2\frac{\varepsilon_t^2}{\sigma_t^2}-1\right).$ (13.14)

The fist order conditions are $ \sum_{t=2}^n
\partial l_t^b / \partial \omega =0$ and $ \sum_{t=2}^n \partial l_t^b / \partial \alpha =0$. For the score algorithm the expected value of the second derivative has to be calculated. It is assumed that $ \mathop{\text{\rm\sf E}}[Z_t^2]=\mathop{\text{\rm\sf E}}[(\varepsilon_t/\sigma_t)^2]=1$, so that the expression in the parentheses $ (2\varepsilon_t^2/\sigma_t^2-1)$ has an expected value of one. From this it follows that

$\displaystyle \mathop{\text{\rm\sf E}}\left[ \frac{\partial^2 l_t^b}{\partial \...
...ight] = -\frac{1}{2}\mathop{\text{\rm\sf E}}\left[\frac{1}{\sigma_t^4}\right].
$

The expectation of $ \sigma_t^{-4}$ is consistently estimated by $ (n-1)^{-1}\sum_{t=2}^n (\omega+\alpha \varepsilon_{t-1}^2)^{-2}$, so that for the estimator of the expected value of the second derivative we have:

$\displaystyle \hat{\mathop{\text{\rm\sf E}}} \frac{\partial^2 l_t^b}{\partial \omega^2} = -\frac{1}{2(n-1)}\sum_{t=2}^{n} \frac{1}{\sigma_t^4}.
$

Similarly the expected value of the second derivative with respect to $ \alpha$ follows with

$\displaystyle \mathop{\text{\rm\sf E}}\left[ \frac{\partial^2 l_t^b}{\partial \...
...{2}\mathop{\text{\rm\sf E}}\left[\frac{\varepsilon_{t-1}^4}{\sigma_t^4}\right]
$

and the estimator is

$\displaystyle \hat{\mathop{\text{\rm\sf E}}} \frac{\partial^2 l_t^b}{\partial \...
...a^2} = -\frac{1}{2(n-1)}\sum_{t=2}^{n} \frac{\varepsilon_{t-1}^4}{\sigma_t^4}.
$

Theorem 13.6  
Given $ Z_t \sim
{\text{\rm N}}(0,1)$, it holds that

$\displaystyle \mathop{\text{\rm\sf E}}\left[\left(\frac{\partial l_t^b}{\partia...
...athop{\text{\rm\sf E}}\left[ \frac{\partial^2 l_t^b}{\partial \omega^2}\right]
$

Proof:
This follows immediately from $ \mathop{\text{\rm\sf E}}\left[\left(\frac{\partial
l_t^b}{\partial \omega}\rig...
...\mathop{\text{\rm\sf E}}
\left[\frac{1}{4\sigma_t^4} (Z_t^4 - 2Z_t^2+1)\right]
$
$ = \mathop{\text{\rm\sf E}}\left[\frac{1}{4\sigma_t^4}\right] (3 - 2 + 1).$ $ {\Box}$

Obviously Theorem 12.6 also holds for the parameter $ \alpha$ in place of $ \omega$. In addition it essentially holds for more general models, for example the estimation of GARCH models in Section 12.1.6. In more complicated models one can replace the second derivative with the square of the first derivative, which is easier to calculate. It is assumed, however, that the likelihood function is correctly specified, i.e., the true distribution of the error terms is normal.

Under the two conditions

  1. $ \mathop{\text{\rm\sf E}}[Z_t \mid {\cal F}_{t-1}]= 0$ and $ \mathop{\text{\rm\sf E}}[Z_t^2 \mid {\cal
F}_{t-1}]= 1$
  2. $ \mathop{\text{\rm\sf E}}[\log(\alpha Z_t^2) \mid {\cal F}_{t-1}] <
0$ (strict stationarity)
and under certain technical conditions, the ML estimators are consistent. If $ \mathop{\text{\rm\sf E}}[Z_t^4 \mid {\cal F}_{t-1}] < \infty$ and $ \omega>0$, $ \alpha>0$ hold in addition, then $ \hat{\theta}=(\hat{\omega}, \hat{\alpha})^\top $ is asymptotically normally distributed:

$\displaystyle \sqrt{n}(\hat{\theta}-\theta) \stackrel{{\cal L}}{\longrightarrow} {\text{\rm N}} (0, J^{-1} I J^{-1})$ (13.15)

with

$\displaystyle I = \mathop{\text{\rm\sf E}}\left(\frac{\partial l_t(\theta)}{\partial \theta}
\frac{\partial l_t(\theta)}{\partial \theta^\top } \right)
$

and

$\displaystyle J=- \mathop{\text{\rm\sf E}}\left(\frac{\partial^2 l_t(\theta)}{\partial \theta
\partial \theta^\top } \right).
$

If the true distribution of $ Z_t$ is normal, then $ I=J$ and the asymptotic covariance matrix is simplified to $ J^{-1}$, i.e., the inverse of the Fischer Information matrix. If the true distribution is instead leptokurtic, then the maximum of (12.9) is still consistent, but no longer efficient. In this case the ML method is interpreted as the `Quasi Maximum Likelihood' (QML) method.

In a Monte Carlo simulation study in Shephard (1996) 1000 ARCH(1) processes with $ \omega=0.2$ and $ \alpha =0.9$ were generated and the parameters were estimated using QML. The results are given in Table 12.2. Obviously with the moderate sample sizes ($ n=500$) the bias is negligible. The variance, however, is still so large that a relatively large proportion (10%) of the estimators are larger than one, which would imply covariance nonstationarity. This, in turn, has a considerable influence on the volatility prediction.


Table 12.2: Monte Carlo simulation results for QML estimates of the parameter $ \alpha =0.9$ from an ARCH(1) model with $ k=1000$ replications. The last column gives the proportion of the estimator that are larger than 1 (according to Shephard (1996)).
$ n$ $ k^{-1}\sum_{j=1}^k \hat{\alpha}_j$ $ \sqrt{k^{-1}\sum_{j=1}^k(\hat{\alpha}_j-\alpha)^2}$ #( $ \alpha_j\ge 1$)
100 0.852 0.257 27%
250 0.884 0.164 24%
500 0.893 0.107 15%
1000 0.898 0.081 10%


13.1.3 ARCH($ q$): Definition and Properties

The definition of an ARCH(1) model will be extended for the case that $ q>1$ lags, on which the conditional variance depends.

Definition 13.2 (ARCH($ q$))  
The process $ (\varepsilon_t)$, $ t \in \mathbb{Z}$, is ARCH($ q$), when $ \mathop{\text{\rm\sf E}}[\varepsilon_t \mid {\cal F}_{t-1}]=0$,

$\displaystyle \sigma_t^2=\omega+\alpha_1 \varepsilon_{t-1}^2 + \ldots + \alpha_q \varepsilon_{t-q}^2$ (13.16)

with $ \omega>0, \: \alpha_1\geq 0, \ldots, \alpha_q\geq 0$ and

The conditional variance $ \sigma_t^2$ in an ARCH($ q$) model is also a linear function of the $ q$ squared lags.

Theorem 13.7  
Let $ \varepsilon_t$ be a semi-strong ARCH($ q$) process with $ \mathop{\text{\rm Var}}(\varepsilon_t)=\sigma^2<\infty$. Then

$\displaystyle \sigma^2 = \frac{\omega}{1-\alpha_1-\ldots-\alpha_q}
$

with $ \alpha_1+\cdots+\alpha_q < 1$.

Proof:
as in Theorem 12.2. $ {\Box}$

If instead $ \alpha_1+\cdots+\alpha_q \ge 1$, then the unconditional variance does not exist and the process is not covariance-stationary.

Theorem 13.8 (Representation of an ARCH($ q$) Process)  
Let $ \varepsilon_t$ be a (semi-)strong ARCH($ q$) process with $ \mathop{\text{\rm\sf E}}[\varepsilon_t^4]=c<\infty$. Then
  1. $ \eta_t = \sigma_t^2(Z_t^2-1)$ is white noise.
  2. $ \varepsilon_t^2$ is an AR($ q$) process with $ \varepsilon_t^2 =
\omega + \sum_{i=1}^q \alpha_i \varepsilon_{t-i}^2 + \eta_t$.

Proof:
as in Theorem 12.5. $ {\Box}$

It is problematic with the ARCH($ q$) model that for some applications a larger order $ q$ must be used, since large lags only lose their influence on the volatility slowly. It is suggested as an empirical rule of thumb to use a minimum order of $ q=14$. The disadvantage of a large order is that many parameters have to be estimated under restrictions. The restrictions can be categorized as conditions for stationarity and the strictly positive parameters. If efficient estimation methods are to be used, for example, the maximum likelihood method, the estimation of large dimensional parameter spaces can be numerically quite complicated to obtain.

One possibility of reducing the number of parameters while including a long history is to assume linearly decreasing weights on the lags, i.e.,

$\displaystyle \sigma_t^2=\omega+\alpha \sum_{i=1}^q w_i \varepsilon_{t-i}^2,
$

with

$\displaystyle w_i = \frac{2(q+1-i)}{q(q+1)},
$

so that only two parameters have to be estimated. In Section 12.1.5 we describe a generalized ARCH model, which on the one hand, has a parsimonious parameterization, and on the other hand a flexible lag structure.

13.1.4 Estimation of an ARCH($ q$) Model

For the general ARCH($ q$) model from (12.16) the conditional likelihood is

$\displaystyle l^b(\theta)$ $\displaystyle =$ $\displaystyle \sum_{t=q+1}^n l_t(\theta)$  
  $\displaystyle =$ $\displaystyle -\frac{n-1}{2} \log(2\pi) - 1/2 \sum_{t=2}^n \log \sigma_t^2
- 1/2 \sum_{t=q+1}^n \frac{\varepsilon_t^q+1}{\sigma_t^2}$ (13.17)

with the parameter vector $ \theta=(\omega,\alpha_1,\ldots,\alpha_q)^\top $. Although one can find the optimum of ARCH(1) models by analyzing the graph such as Figure 12.5, it is complicated and impractical for a high dimensional parameter space. The maximization of (12.17) with respect to $ \theta$ is a non-linear optimization problem, which can be solved numerically. The score algorithm is used empirically not only in ARMA models (see Section 11.8) but also in ARCH models. In order to implement this approach the first and second derivatives of the (conditional) likelihood with respect to the parameters need to be formed. For the ARCH($ q$) model the first derivative is

$\displaystyle \frac{\partial l_t^b}{\partial \theta} = \frac{1}{2\sigma_t^2} \f...
... \sigma_t^2}{\partial \theta} \left(\frac{\varepsilon_t^2}{\sigma_t^2}-1\right)$ (13.18)

with

$\displaystyle \frac{\partial \sigma_t^2}{\partial \theta} = (1, \varepsilon_{t-1}^2,
\ldots,\varepsilon_{t-q}^2)^\top .
$

The first order condition is $ \sum_{t=q+1}^n
\partial l_t / \partial \theta =0$. For the second derivative and the asymptotic properties of the QML estimator see Section 12.1.6.

13.1.5 Generalized ARCH (GARCH)

The ARCH($ q$) model can be generalized by extending it with autoregressive terms of the volatility.

Definition 13.3 (GARCH($ p,q$))   The process $ (\varepsilon_t)$, $ t \in \mathbb{Z}$, is GARCH($ p,q$), if $ \mathop{\text{\rm\sf E}}[\varepsilon_t \mid {\cal F}_{t-1}]=0$,

$\displaystyle \sigma_t^2 = \omega + \sum_{i=1}^q\alpha_i \varepsilon_{t-i}^2 + \sum_{j=1}^p \beta_j \sigma_{t-j}^2,$ (13.19)

and

The sufficient but not necessary conditions for

$\displaystyle \sigma_t^2 > 0 \quad a.s.,$   ( $\displaystyle {\P}[\sigma_t^2 > 0] = 1$   ) $\displaystyle \index{GARCH}$ (13.20)

are $ \omega>0, \: \alpha_i\geq 0, \: i=1,\ldots,q$ and $ \beta_j
\geq 0, \:j=1,\ldots,p$. In the case of the GARCH(1,2) model
$\displaystyle \sigma_t^2$ $\displaystyle =$ $\displaystyle \omega + \alpha_1 \varepsilon_{t-1}^2 + \alpha_2 \varepsilon_{t-2}^2 + \beta_1 \sigma_{t-1}^2$  
  $\displaystyle =$ $\displaystyle \frac{\omega}{1-\beta} + \alpha_1 \sum_{j=0}^\infty \beta_1^j \varepsilon_{t-j-1}^2 + \alpha_2 \sum_{j=0}^\infty \beta_1^j \varepsilon_{t-j-2}^2$  
  $\displaystyle =$ $\displaystyle \frac{\omega}{1-\beta} + \alpha_1 \varepsilon_{t-1}^2 + (\alpha_1\beta_1+\alpha_2) \sum_{j=0}^\infty \beta_1^j \varepsilon_{t-j-2}^2$  

with $ 0 \le \beta_1 <1$. $ \omega>0$, $ \alpha_1 \ge 0$ and $ \alpha_1 \beta_1 + \alpha_2 \ge 0$ are necessary and sufficient conditions for (12.20) assuming that the sum $ \sum_{j=0}^\infty \beta_1^j \varepsilon_{t-j-2}^2$ converges.

Theorem 13.9 (Representation of a GARCH($ p,q$) process)  
Let $ \varepsilon_t$ be a (semi-)strong GARCH($ p,q$) process with $ \mathop{\text{\rm\sf E}}[\varepsilon_t^4]=c<\infty$. Then
  1. $ \eta_t = \sigma_t^2(Z_t^2-1)$ is white noise.
  2. $ \varepsilon_t^2$ is an ARMA($ m,p$) process with

    $\displaystyle \varepsilon_t^2 = \omega + \sum_{i=1}^m \gamma_i \varepsilon_{t-i}^2 - \sum_{j=1}^p \beta_j \eta_{t-j} + \eta_t,$ (13.21)

    with $ m=\max(p,q)$, $ \gamma_i = \alpha_i + \beta_i$. $ \alpha_i=0$ when $ i>q$, and $ \beta_i=0$ when $ i>p$.

Proof:
as in Theorem 12.5. $ {\Box}$

If $ \varepsilon_t$ follows a GARCH process, then from Theorem 12.9 we can see that $ \varepsilon_t^2$ follows an ARMA model with conditional heteroscedastic error terms $ \eta_t$. As we know if all the roots of the polynomial $ (1-\beta_1 z -\ldots- \beta_p z^p)$ lie outside the unit circle, then the ARMA process (12.21) is invertible and can be written as an AR($ \infty$) process. Moveover it follows from Theorem 12.8 that the GARCH($ p,q$) model can be represented as an ARCH($ \infty$) model. Thus one can deduce analogous conclusions from the ARMA models in determining the order $ (p,q)$ of the model. There are however essential differences in the definition of the persistence of shocks.

Theorem 13.10 (Unconditional variance of a GARCH($ p,q$) process)  
Let $ \varepsilon_t$ be a semi-strong GARCH($ p,q$) process with $ \mathop{\text{\rm Var}}(\varepsilon_t)=\sigma^2<\infty$. Then

$\displaystyle \sigma^2 = \frac{\omega}{1-\sum_{i=1}^q \alpha_i - \sum_{j=1}^p \beta_j},
$

with $ \sum_{i=1}^q \alpha_i + \sum_{j=1}^p \beta_j < 1$.

Proof:
as in Theorem 12.2. $ {\Box}$

General conditions for the existence of higher moments of the GARCH($ p,q$) models are given in He and Teräsvirta (1999). For the smaller order models and under the assumption of distribution we can derive:

Theorem 13.11 (Fourth moment of a GARCH(1,1) process)  
Let $ \varepsilon_t$ be a (semi-)strong GARCH(1,1) process with $ \mathop{\text{\rm Var}}(\varepsilon_t)=\sigma^2<\infty$ and $ Z_t \sim {\text{\rm N}}(0,1).$ Then $ \mathop{\text{\rm\sf E}}[\varepsilon_t^4]<\infty$ holds if and only if $ 3\alpha_1^2+2\alpha_1\beta_1+\beta_1^2<1$. The Kurtosis $ \mathop{\text{\rm Kurt}}(\varepsilon_t)$ is given as

$\displaystyle \mathop{\text{\rm Kurt}}[\varepsilon_t]= \frac{\mathop{\text{\rm\...
...]\right)^2} = 3 + \frac{6\alpha_1^2}{1-\beta_1^2-2\alpha_1\beta_1-3\alpha_1^2}.$ (13.22)

Proof:
It can be proved that $ \mathop{\text{\rm\sf E}}[\varepsilon_t^4]=3\mathop{\text{\rm\sf E}}[(\omega+\alpha_1
\varepsilon_{t-1}^2+\beta_1 \sigma_{t-1}^2)^2]$ and the stationarity of $ \varepsilon_t$. $ {\Box}$

The function (12.22) is illustrated in Figure 12.6 for all $ \alpha_1>0$, $ \mathop{\text{\rm Kurt}}[\varepsilon_t]>3$, i.e., the distribution of $ \varepsilon_t$ is leptokurtic. We can observe that the kurtosis equals 3 only in the case of the boundary value $ \alpha_1=0$ where the conditional heteroscedasticity disappears and a Gaussian white noise takes place. In addition it can be seen in the figure that the kurtosis increases in $ \beta_1$ slowly for a given $ \alpha_1$. On the contrary it increases in $ \alpha_1$ much faster for a given $ \beta_1$.

Fig.: Kurtosis of a GARCH(1,1) process according to (12.22). The left axis shows the parameter $ \beta_1$, the right $ \alpha_1$. 21090 SFEkurgarch.xpl
\includegraphics[width=1.2\defpicwidth]{kurgarch.ps}

Remark 13.3  
Nelson (1990$ a$) shows that the strong GARCH(1,1) process $ X_t$ is strictly stationary when $ \mathop{\text{\rm\sf E}}[\log(\alpha_1
Z_t^2+\beta_1)]<0$. If $ Z_t \sim
{\text{\rm N}}(0,1)$, then the conditions for strict stationarity are weaker than those for covariance-stationarity: $ \alpha_1 +\beta_1 <1$.

In practical applications it is frequently shown that models with smaller order sufficiently describe the data. In most cases GARCH(1,1) is sufficient.

A substantial disadvantage of the standard ARCH and GARCH models exists since they can not model asymmetries of the volatility with respect to the sign of past shocks. This results from the squared form of the lagged shocks in (12.16) and (12.19). Therefore they have an effect on the level but no effect on the sign. In other words, bad news (identified by a negative sign) has the same influence on the volatility as good news (positive sign) if the absolute values are the same. Empirically it is observed that bad news has a larger effect on the volatility than good news. In Section 12.2 and 13.1 we will take a closer look at the extensions of the standard models which can be used to calculate these observations.

13.1.6 Estimation of GARCH($ p,q$) Models

Based on the ARMA representation of GARCH processes (see Theorem 12.9) Yule-Walker estimators $ \tilde{\theta}$ are considered once again. These estimators are, as can be shown, consistent and asymptotically normally distributed, $ \sqrt{n}(\tilde{\theta}-\theta) \stackrel{{\cal
L}}{\longrightarrow} {\text{\rm N}}(0,\tilde{\Sigma})$. However in the case of GARCH models they are not efficient in the sense that the matrix $ \tilde{\Sigma} - J^{-1} I J^{-1}$ is positively definite, where $ J^{-1} I J^{-1}$ is the asymptotic covariance matrix of the QML estimator, see (12.25). In the literature there are several experiments on the efficiency of the Yule-Walker and QML estimators in finite samples, see Section 12.4. In most cases maximum likelihood methods are chosen in order to get the efficiency.

The likelihood function of the general GARCH($ p,q$) model (12.19) is identical to (12.17) with the extended parameter vector $ \theta=(\omega,\alpha_1,\ldots,\alpha_q,
\beta_1,\ldots,\beta_p)^\top $. Figure 12.7 displays the likelihood function of a generated GARCH(1,1) process with $ \omega=0.1$, $ \alpha=0.1$, $ \beta=0.8$ and $ n=500$. The parameter $ \omega$ was chosen so that the unconditional variance is everywhere constant, i.e., with a variance of $ \sigma^2$, $ \omega=(1-\alpha-\beta)\sigma^2$. As one can see, the function is flat on the right, close to the optimum, thus the estimation will be relatively imprecise, i.e., it will have a larger variance. In addition, Figure 12.8 displays the contour plot of the likelihood function.

Fig.: Likelihood function of a generated GARCH(1,1) process with $ n=500$. The left axis shows the parameter $ \beta$, the right $ \alpha$. The true parameters are $ \omega=0.1$, $ \alpha=0.1$ and $ \beta=0.8$. 21492 SFElikgarch.xpl
\includegraphics[width=1.0\defpicwidth]{likgar3d.ps}

Fig.: Contour plot of the likelihood function of a generated GARCH(1,1) process with $ n=500$. The perpendicular axis displays the parameter $ \beta$, the horizontal $ \alpha$. The true parameters are $ \omega=0.1$, $ \alpha=0.1$ and $ \beta=0.8$. 21496 SFElikgarch.xpl
\includegraphics[width=1\defpicwidth]{likgarco.ps}

The first partial derivatives of (12.17) are

$\displaystyle \frac{\partial l_t}{\partial \theta} = \frac{1}{2\sigma_t^2} \fra...
... \sigma_t^2}{\partial \theta} \left(\frac{\varepsilon_t^2}{\sigma_t^2}-1\right)$ (13.23)

with

$\displaystyle \frac{\partial \sigma_t^2}{\partial \theta} = \vartheta_t + \sum_{j=1}^p \frac{\partial \sigma_{t-j}^2}{\partial \theta}.
$

and $ \vartheta_t=(1, \varepsilon_{t-1}^2,
\ldots,\varepsilon_{t-q}^2,\sigma_{t-1}^2,\ldots,\sigma_{t-p}^2)^\top $. The first order conditions are
$ \sum_{t=q+1}^n
\partial l_t / \partial \theta =0$. The matrix of the second derivatives takes the following form:
$\displaystyle \frac{\partial^2 l_t(\theta)}{\partial \theta \partial
\theta^\top }$ $\displaystyle =$ $\displaystyle \frac{1}{2\sigma_t^4} \frac{\partial
\sigma_t^2}{\partial \theta}...
...^2}
\frac{\partial^2 \sigma_t^2(\theta)}{\partial \theta \partial \theta^\top }$  
  $\displaystyle -$ $\displaystyle \frac{\varepsilon_t^2}{\sigma_t^6}\frac{\partial
\sigma_t^2}{\par...
...^4}
\frac{\partial^2 \sigma_t^2(\theta)}{\partial \theta \partial
\theta^\top }$ (13.24)

Under the conditions

  1. $ \mathop{\text{\rm\sf E}}[Z_t \mid {\cal F}_{t-1}]= 0$ and $ \mathop{\text{\rm\sf E}}[Z_t^2 \mid {\cal
F}_{t-1}]= 1$,
  2. strict stationarity of $ \varepsilon_t$
and under some technical conditions the ML estimator is consistent. If in addition it holds that $ \mathop{\text{\rm\sf E}}[Z_t^4 \mid {\cal F}_{t-1}] < \infty$, then $ \hat{\theta}$ is asymptotically normally distributed:

$\displaystyle \sqrt{n}(\hat{\theta}-\theta) \stackrel{{\cal L}}{\longrightarrow} {\text{\rm N}}_{p+q+1} (0, J^{-1} I J^{-1})$ (13.25)

with

$\displaystyle I = \mathop{\text{\rm\sf E}}\left(\frac{\partial l_t(\theta)}{\partial \theta}
\frac{\partial l_t(\theta)}{\partial \theta^\top } \right)
$

and

$\displaystyle J=- \mathop{\text{\rm\sf E}}\left(\frac{\partial^2 l_t(\theta)}{\partial \theta
\partial \theta^{T}} \right).
$

Theorem 13.12 (Equivalence of $ I$ and $ J$)  
If $ Z_t \sim
{\text{\rm N}}(0,1)$, then it holds that $ I=J$.

Proof:
Building the expectations of (12.24) one obtains

$\displaystyle \mathop{\text{\rm\sf E}}\left[\frac{\partial^2 l_t(\theta)}{\part...
...sigma_t^2}{\partial \theta} \frac{\partial \sigma_t^2}{\partial
\theta^\top }.
$

For $ I$ we have
$\displaystyle \mathop{\text{\rm\sf E}}\left[\frac{\partial l_t(\theta)}{\partial \theta}\frac{\partial l_t(\theta)}{\partial \theta^{T}}\right]$ $\displaystyle =$ $\displaystyle \mathop{\text{\rm\sf E}}\left[ \frac{1}{4\sigma_t^4} \frac{\parti...
...l \theta}
\frac{\partial \sigma_t^2}{\partial \theta^T}
(Z_t^4-2Z_t^2+1)\right]$ (13.26)
  $\displaystyle =$ $\displaystyle \mathop{\text{\rm\sf E}}\left[ \frac{1}{4\sigma_t^4} \frac{\parti...
...al
\theta^\top }\right] \{\mathop{\text{\rm Kurt}}(Z_t \mid {\cal
F}_{t-1})-1\}$  

From the assumption $ Z_t \sim
{\text{\rm N}}(0,1)$ it follows that $ \mathop{\text{\rm Kurt}}(Z_t \mid {\cal F}_{t-1})=3$ and thus the claim. $ {\Box}$

If the distribution of $ Z_t$ is specified correctly, then $ I=J$ and the asymptotic variance can be simplified to $ J^{-1}$, i.e., the inverse of the Fisher Information matrix. If this is not the case and it is instead leptokurtic, for example, the maximum of (12.9) is still consistent but no longer efficient. In this case the ML method is interpreted as the `Quasi Maximum Likelihood' (QML) method.

Consistent estimators for the matrices $ I$ and $ J$ can be obtained by replacing the expectation with the simple average.