12.8 Estimation of MA($ q$) and ARMA($ p,q$) Models

As soon as moving average coefficients are included in an estimated model, the estimation turns out to be more difficult. Consider the example of a simple MA(1) model

$\displaystyle X_t = \beta \varepsilon_{t-1} + \varepsilon_t$ (12.25)

with $ \vert\beta\vert<1$ and $ \mathop{\text{\rm Var}}(\varepsilon_t)=\sigma^2$. A simple estimator for the parameter $ \beta$ is obtained from the Yule-Walker equations $ \gamma_0=\sigma^2(1+\beta^2)$ and $ \gamma_1 = \beta \sigma^2$. By dividing both equations we get $ \rho_1 = \gamma_1 / \gamma_0 = \beta/(1+\beta^2)$ and the solution to the quadratic equation is

$\displaystyle \beta = \frac{1}{2\rho_1} \pm \sqrt{\frac{1}{4\rho_1^2} -1 }.$ (12.26)

The Yule-Walker estimator replaces in (11.26) the theoretical autocorrelation of 1st order $ \rho_1$ with the estimator $ \hat{\rho}_1$. The estimator is quite simple, but has the disadvantage that it is asymptotically inefficient.

The least squares estimator leads to non-linear systems of equations, that can only be solved with iterative numerical algorithms. Using the example of a MA(1) process (11.25) this is illustrated: The LS estimator is defined by

$\displaystyle \hat{\beta} = \arg \min_{\beta} \sum_{t=2}^n \varepsilon_t^2 = \arg \min_{\beta} \sum_{t=2}^n (X_t -\beta \varepsilon_{t-1})^2$ (12.27)

Given that $ \varepsilon_t$ is not observed, one must turn to the AR($ \infty$) representation of the MA(1) process in order to find the solution, i.e.,

$\displaystyle \varepsilon_t = X_t + \sum_{k=1}^\infty (-\beta)^k X_{t-k}.$ (12.28)

Given $ X_1,\ldots,X_n$, (11.28) can be approximated by

$\displaystyle \varepsilon_t = X_t + \sum_{k=1}^{t-1} (-\beta)^k X_{t-k}.
$

Solving the first order conditions

$\displaystyle \frac{\partial}{\partial \beta} \sum_{t=2}^n \varepsilon_t^2 = 0,
$

we obtain a non-linear equation for $ \beta$, which cannot be explicitly solved. For the minimization problem (11.27) one usually implements numerical optimization methods. The least squares estimator is asymptotically efficient and has asymptotically the same properties as the maximum likelihood (ML) estimator.

In the following we assume a stationary and invertible ARMA($ p,q$) process with the AR($ \infty$) representation

$\displaystyle X_t = \sum_{j=1}^\infty \pi_j X_{t-j} + \varepsilon_t.
$

Maximum likelihood estimation allude to the distribution assumptions

$\displaystyle \varepsilon_t \sim$   N$\displaystyle (0,\sigma^2),
$

under which $ X=(X_1,\ldots,X_n)^\top $ have multivariate normal distributions with a density

$\displaystyle p(x \mid \theta) = (2\pi \sigma^2)^{-n/2} \vert\Gamma\vert^{-1/2} \exp
\left(-\frac{1}{2\sigma^2}x^\top \Gamma^{-1}x \right)
$

with covariance matrix $ \Gamma$, which is given in (11.24), and the parameter vector

$\displaystyle \theta=(\alpha_1, \ldots, \alpha_p, \beta_1, \ldots, \beta_q; \sigma^2)^\top .$

The likelihood function $ L$ is then a density function interpreted as a function of the parameter vector $ \theta$ for given observations, i.e., $ L(\theta\mid x) = p(x \mid \theta)$. One chooses the respective parameter vector that maximizes the likelihood function for the given observations, i.e., the ML estimator is defined by

$\displaystyle \hat{\theta} = \arg \max_{\theta \in \Theta} L(\theta \mid x).
$

Under the assumption of the normal distribution the logarithm of the likelihood function

$\displaystyle \log L(\theta\mid x)= -\frac{n}{2}\log(2\pi \sigma^2) -\frac{1}{2}\log \vert\Gamma\vert - \frac{1}{2\sigma^2}x^\top \Gamma^{-1}x$ (12.29)

takes on a simple form without changing the maximizer $ \hat{\theta}$. The log-likelihood function (11.29) is also called the exact log-likelihood function. One notices that, in particular, the calculation of the inverse and the determinant of the ( $ n \times
n$) matrix $ \Gamma$ is quite involved for long time series. Therefore one often forms an approximation to the exact likelihood, which are good for long time series. One possibility is use the conditional distribution $ p(X_t \mid
X_{t-1},\ldots,X_1; \theta)$:

$\displaystyle L(\theta\mid x) = \prod_{t=1}^n p(X_t \mid X_{t-1},\ldots,X_1;
\theta)
$

Under the assumption of normal distributions the conditional distributions are normal with an expected value

$\displaystyle \mathop{\text{\rm\sf E}}[X_t \mid
X_{t-1},\ldots,X_1]$

and variance

$\displaystyle \mathop{\text{\rm Var}}(X_t \mid
X_{t-1},\ldots,X_1).$

The larger $ t$ is, the better the approximation of

$\displaystyle \mathop{\text{\rm\sf E}}[X_t \mid
X_{t-1},\ldots,X_1,\ldots]=\sum_{j=1}^\infty \pi_j X_{t-j}$

by $ \sum_{j=1}^{t-1} \pi_j X_{t-j}$ becomes. The conditional log-likelihood function

$\displaystyle \log L^b(\theta\mid x)= -\frac{n}{2}\log(2\pi \sigma^2) -\frac{1}...
...gma^2 - \frac{1}{2\sigma^2} \sum_{t=1}^n (X_t-\sum_{j=1}^{t-1} \pi_j X_{t-j})^2$ (12.30)

can be calculated from the data $ X_1,\ldots,X_n$ and optimized with respect to the parameter $ \theta$. As an initial value for the numerical optimization algorithm the Yule-Walker estimators, for example, can be used (except in specific cases of asymptotic inefficiency).

To compare the exact and the conditional likelihood estimators consider a MA(1) process (11.25) with $ \beta=0.5$ and $ \varepsilon_t \sim$   N$ (0,1)$. The matrix $ \Gamma$ is band diagonal with elements $ 1+\beta^2$ on the main diagonal and $ \beta$ on diagonals both above and below it. Two realizations of the process with $ n=10$ and $ n=20$ are shown in Figure 11.7. Since the process has only one parameter, one can simply search in the region (-1,1). This is shown for both estimators in Figure 11.8 ($ n=10$) and 11.9 ($ n=20$). For the process with $ n=10$ one still sees a clear discrepancy between both likelihood functions, which for $ n=20$ can be ignored. Both estimators are in this case quite close to the true parameter 0.5.

Fig.: Two realizations of a MA(1) process with $ \beta=0.5$, $ \varepsilon_t \sim$   N$ (0,1)$, $ n=10$ (above) and $ n=20$ (below). 18629 SFEplotma1.xpl
\includegraphics[width=1\defpicwidth]{pma1020.ps}

Fig.: Exact (solid) and conditional (dashed) likelihood functions for the MA(1) process from figure 11.7 with $ n=10$. The true parameter is $ \beta=0.5$. 18633 SFElikma1.xpl
\includegraphics[width=1\defpicwidth]{likma10.ps}

Fig.: Exact (solid) and conditional (dashed) likelihood functions for the MA(1) process from figure 11.7 with $ n=20$. The true parameter is $ \beta=0.5$. 18637 SFElikma1.xpl
\includegraphics[width=1\defpicwidth]{likma20.ps}

Under some technical assumptions the ML estimators are consistent, asymptotically efficient and have an asymptotic normal distribution:

$\displaystyle \sqrt{n}(\hat{\theta}-\theta) \stackrel{{\cal L}}{\longrightarrow}$   N$\displaystyle (0,J^{-1})
$

with the Fisher Information matrix

$\displaystyle J=\mathop{\text{\rm\sf E}}\left[ -\frac{\partial^2 \log L(\theta,x)} {\partial \theta \partial \theta^\top } \right].$ (12.31)

For the optimization of the likelihood function one frequently uses numerical methods. The necessary condition for a maximum is

   grad$\displaystyle ~l^b(\theta)=0
$

with $ l^b = \log L(\theta \mid x)$. By choosing an initial value $ \theta_0$ (for example, the Yule-Walker estimator), and the Taylor approximation

   grad$\displaystyle ~l^b(\theta) \approx$   grad$\displaystyle ~l^b (\theta_0) +$   Hess$\displaystyle ~l^b
(\theta_0)(\theta - \theta_0)
$

one obtains the following relation:

$\displaystyle \theta = \theta_0 -$   Hess$\displaystyle ~l^b(\theta_0)^{-1}$   grad$\displaystyle ~l^b (\theta_0).
$

Since generally one does not immediately hit the maximizing parameter, one builds the iteration

$\displaystyle \theta_{j+1}=\theta_j -$   Hess$\displaystyle ~l^b(\theta_j)^{-1}$   grad$\displaystyle ~l^b
(\theta_j)
$

with $ j=1,2,\ldots$ until a convergence is reached, i.e., $ \theta_{j+1}\approx\theta_j$. Often it is easier to use the expectation of the Hessian matrix, that is, the information matrix from (11.31):

$\displaystyle \theta_{j+1}=\theta_j + J(\theta_j)^{-1}$   grad$\displaystyle ~l^b (\theta_j).$ (12.32)

The notation $ J(\theta_j)$ here means that (11.31) is evaluated at $ \theta_j$. The iteration (11.32) is called the score-algorithm or Fisher scoring.