2.3 Estimation Procedures

Having specified the MLRM given in (2.1) or (2.2), the following econometric stage to carry out is the estimation, which consists of quantifying the parameters of the model, using the observations of and collected in the sample of size . The set of parameters to estimate is : the k coefficients of the vector $\beta$ , and the dispersion parameter $\sigma ^{2}$ , about which we have no a priori information.

Following the same scheme of the previous chapter, we are going to describe the two common estimation procedures: the Least Squares (LS) and the Maximum Likelihood (ML) Methods.

2.3.1 The Least Squares Estimation

The LS procedure selects those values of $\beta$ that minimize the sum of squares of the distances between the actual values of and the adjusted (or estimated) values of the endogenous variable. Let $\hat{\hat{\beta}}$ be a possible estimation (some function of the sample observations) of $\beta$ . Then, the adjusted value of the endogenous variable is given by:

$\begin{displaymath}\begin{array}{cc} \hat{\hat{y}}_{i}=x_{i}^{\top }\hat{\hat{\beta}} & \forall i \end{array}\end{displaymath}$

(2.17)

where $x_{i}^{\top }=(1,x_{2i},x_{3i},\ldots,x_{ki})$ is the row vector of the value of the regressors for the $i^{th}$ observation. From (2.17), the distance defined earlier, or residual, is given by:

$\begin{displaymath}\begin{array}{cc} \hat{\hat{u_{i}}}=y_{i}-\hat{\hat{y_{i}}}=y_{i}-x_{i}^{\top }\hat{\hat{\beta}}& \forall i \end{array}\end{displaymath}$

(2.18)

Consequently, the function to minimize is :

$\displaystyle S(\hat{\hat{\beta}})=\sum_{i=1}^{n}\hat{\hat{u_{i}^{2}}}=\sum_{i=1}^{n}(y_{i}-x_{i}^{\top }\hat{\hat{\beta}})^{2}$

(2.19)

and then, what we call the Ordinary Least Squares (OLS) estimator of $\beta$ , denoted by $\hat{\beta}$ is the value of $\hat{\hat{\beta}}$ which satisfies:

$\displaystyle \hat{\beta}=\arg\min_{\hat{\hat{\beta}}}S(\hat{\hat{\beta}})$

(2.20)

To solve this optimization problem, the first-order conditions make the first derivatives of $S(\hat{\hat{\beta}})$ with respect to $\hat{\hat{\beta_{1}}}$ , $\hat{\hat{\beta_{2}}}$ ,..., $\hat{\hat{\beta_{k}}}$ equal to zero. In order to obtain such conditions in matrix form, we express (2.19) as follows:

$\displaystyle S(\hat{\hat{\beta}})=(y-X\hat{\hat{\beta}})^{\top }(y-X\hat{\hat{\beta}})$

(2.21)

Given that $\hat{\hat{\beta}}^{\top }X^{\top }y=(y^{\top }X\hat{\hat{\beta}})^{\top }$ , and both elements are $1\times1$ , we can group the terms, and $S(\hat{\hat{\beta}})$ is written as follows:

$\displaystyle S(\hat{\hat{\beta}})=y^{\top }y-2y^{\top }X\hat{\hat{\beta}}+\hat{\hat{\beta}}^{\top }X^{\top }X\hat{\hat{\beta}}$

(2.22)

The vector which contains the

first partial derivatives (gradient vector) is expressed as:

$\displaystyle \frac{\partial S(\hat{\hat{\beta}})}{\partial\hat{\hat{\beta}}}=-2X^{\top }y+2X^{\top }X\hat{\hat{\beta}}$

(2.23)

Setting (2.23) to zero, result:

$\displaystyle X^{\top }X\hat{\beta}=X^{\top }y$

(2.24)

The system of

linear equations (2.24) is called the $\textsl{system of normal equations}$ .

From assumption 2 of the last section, we know that has full rank, and so we can state that the inverse of $X^{\top }X$ exists, in such a way that we can obtain $\hat{\beta}$ by premultiplying (2.24) by $(X^{\top }X)^{-1}$ :

$\displaystyle \hat{\beta}=(X^{\top }X)^{-1}X^{\top }y$

(2.25)

According to (2.25), the OLS residuals vector is given by:

$\displaystyle \hat{u}=y-X\hat{\beta}$

(2.26)

with a typical element $\hat{u}_{i}=y_{i}-x_{i}^{\top }\hat{\beta}$ . From (2.2), the residual vector can be understood as the sample counterpart of the disturbance vector

The second-order condition of minimization establishes that the second partial derivatives matrix (hessian matrix) has to be positive definite. In our case, such a matrix is given by:

$\displaystyle \frac{\partial^{2}S(\hat{\hat{\beta}}) }{\partial\hat{\hat{\beta}}\partial\hat{\hat{\beta}}^{\top }}=2X^{\top }X$

(2.27)

and given the full rank of

, it means that $X^{\top }X$ is positive definite.

From (2.25), and given that the regressors are fixed, it follows that $\hat{\beta}$ is a linear function of the vector , that is to say:

$\displaystyle \hat{\beta}=A^{\top }y$

(2.28)

where $A=(X^{\top }X)^{-1}X^{\top }$ is a $k \times n$ matrix of constant elements. The set of

normal equations written in (2.24), can be expressed in the following terms:

$\displaystyle \begin{pmatrix} 1 & 1 & 1 & 1 & 1 \\ x_{21} & x_{22} & x_{23} & ... ...t{\beta}_{2} \\ \hat{\beta}_{3} \\ \vdots \\ \hat{\beta}_{k} \end{pmatrix}=$

$\displaystyle \begin{pmatrix} 1 & 1 & 1 & 1 & 1 \\ x_{21} & x_{22} & x_{23} & ... ...ix}\begin{pmatrix} y_{1} \\ y_{2} \\ y_{3} \\ \vdots \\ y_{n} \end{pmatrix}$

resulting in:

$\displaystyle \begin{pmatrix}n & \sum x_{2i} & \sum x_{3i} & \cdots & \sum x_{k... ...sum x_{2i}y_{i} \\ \sum x_{3i}y_{i} \\ \vdots \\ \sum x_{ki}y_{i} \end{pmatrix}$

(2.29)

where all the sums are calculated from 1 to

Thus, the equations which allow us to obtain the unknown coefficients are the following:

$\begin{displaymath}\begin{array}{c} \sum y_{i}=n\hat{\beta_{1}}+\hat{\beta_{2}}\... ... x_{ki}x_{3i}+\ldots+\hat{\beta_{k}}\sum x_{ki}^{2} \end{array}\end{displaymath}$

(2.30)

From (2.30) we derive some algebraic properties of the OLS method:

a.

The sum of the residuals is null. To show this, if we evaluate the general expression (2.17) at the OLS estimate $\hat{\beta}$ and we calculate $\sum \hat{y_{i}}$ , we obtain:

$\displaystyle \sum_{i=1}^{n}{\hat{y_{i}}}=n\hat{\beta_{1}}+\hat{\beta_{2}}\sum_{i=1}^{n}x_{2i}+ \ldots+\hat{\beta_{k}}\sum_{i=1}^{n}x_{ki}$

The right-hand side of the last expression is equal the right-hand side of the first equation of the system (2.30), so we can write:

$\displaystyle \sum_{i=1}^{n}y_{i}=\sum_{i=1}^{n}\hat{y_{i}}$

(2.31)

Using (2.31) and (2.18), it is proved that the residuals satisfy:

$\displaystyle \sum \hat{u_{i}}=0$

(2.32)

b.

The regression hyperplane passes through the point of means of the data. From (2.17), the expression of this hyperplane is:

$\begin{displaymath}\begin{array}{cc} \hat{y_{i}}=\hat{\beta_{1}}+\hat{\beta_{2}}x_{2i}+\ldots+\hat{\beta_{k}}x_{ki}& \forall i \end{array}\end{displaymath}$

(2.33)

Adding up the terms of (2.33) and dividing by

, we obtain:

$\displaystyle \bar{\hat{y}}=\hat{\beta_{1}}+\hat{\beta_{2}}\bar{x_{2}}+\hat{\beta_{3}}\bar{x_{3}}+\ldots+\hat{\beta_{k}}\bar{x_{k}}$

and given (2.31), it is obvious that $\bar{y}=\bar{\hat{y}}$ , and then, we have the earlier stated property, since

$\displaystyle \bar{y}=\hat{\beta_{1}}+\hat{\beta_{2}}\bar{x_{2}}+\hat{\beta_{3}}\bar{x_{3}}+\ldots+\hat{\beta_{k}}\bar{x_{k}}$

c.

The residuals and the regressors are not correlated; this fact is mimicking the population property of independence between every $x_{j}$ and

. To show this property, we calculate the sample covariance between residuals and regressors:

$\displaystyle cov(x_{j},\hat{u_{i}})=\frac{1}{n}\sum_{i=1}^{n}[(x_{ji}-\bar{x_{... ...ar{x_{j}}\sum_{i=1}^{n}\hat{u_{i}}= \frac{1}{n}\sum_{i=1}^{n}x_{ji}\hat{u_{i}}$

with $j=1,\ldots,k$ . The last expression can be written in matrix form as:

$\displaystyle \sum_{i=1}^{n}x_{ji}\hat{u_{i}}=X^{\top }\hat{u}=X^{\top }(y-X\hat{\beta})= X^{\top }y-X^{\top }X\hat{\beta}=X^{\top }y-X^{\top }y=0$

where the last term uses the result (2.24).

Note that the algebraic property is always satisfied, while the properties and might not be maintained if the model has not intercept. This exception can be easily shown, because the first equation in (2.30) disappears when there is no constant term.

With respect to the OLS estimation of $\sigma ^{2}$ , we must note that it is not obtained as a result of the minimization problem, but is derived to satisfy two requirements: to use the OLS residuals ( $\hat{u}$ ), and to be unbiased. Generalizing the result of the previous chapter, we have:

$\displaystyle \hat{\sigma}^{2}=\frac{\sum_{i=1}^{n}\hat{u_{i}}^{2}}{n-k}=\frac{\hat{u}^{\top }\hat{u}}{n-k}$

(2.34)

An alternative way of obtaining the OLS estimates of the coefficients consists of expressing the variables in deviations with respect to their means; in this case, it can be proved that the value of the estimators and the residuals are the same as that of the previous results. Suppose we have estimated the model, so that we can write it as:

$\displaystyle y_{i}=\hat{\beta_{1}}+\hat{\beta_{2}}x_{2i}+\ldots+\hat{\beta_{k}}x_{ki}+\hat{u_{i}}$

(2.35)

with $i=1,\ldots,n$ . Adding up both sides of (2.35) and dividing by

, we have:

$\displaystyle \bar{y}=\hat{\beta_{1}}+\hat{\beta_{2}}\bar{x_{2}}+\ldots,\hat{\beta_{k}}\bar{x_{k}}$

(2.36)

To obtain the last result, we have employed result (2.32). Then, we calculate (2.35) minus (2.36), leading to the following result:

$\displaystyle (y_{i}-\bar{y})=\hat{\beta_{2}}(x_{2i}-\bar{x_{2}})+\hat{\beta_{3}}(x_{3i}-\bar{x_{3}})+ \ldots+\hat{\beta_{k}}(x_{ki}-\bar{x_{k}})+\hat{u_{i}}$

(2.37)

This model called $\textsl{in deviations}$ differs from (2.35) in two aspects: the intercept does not explicitly appear in the equation model, and all variables are expressed in deviations from their means.

Nevertheless, researchers are usually interested in evaluating the effect of the explanatory variables on the endogenous variable, so the intercept value is not the main interest, and so, specification (2.37) contains the relevant elements. In spite of this, we can evaluate the intercept later from (2.36), in the following terms:

$\displaystyle \hat{\beta_{1}}= \bar{y}-\hat{\beta_{2}}\bar{x_{2}}-\ldots-\hat{\beta_{k}}\bar{x_{k}}$

(2.38)

This approach can be formalized in matrix form, writing (2.35) as:

$\displaystyle y=X\hat{\beta}+\hat{u}$

(2.39)

Consider $X=[\iota_{n}, X_{2}]$ a partitioned matrix, and $\hat{\beta}=[\hat{\beta_{1}},\hat{\beta}_{(2)}]$ a partitioned vector, where $X_{2}$ denotes the $n\times(k-1)$ matrix whose columns are the observations of each regressor, $\iota_{n}$ is an $n\times1$ vector of ones, and $\hat{\beta}_{(2)}$ is the $(k-1)\times1$ vector of all the estimated coefficients except the intercept.

Let be an square matrix of the form:

$\displaystyle G=I_{n}-\frac{1}{n}\iota_{n}\iota_{n}^{\top }$

(2.40)

with $I_{n}$ the $n \times n$ identity matrix. If we premultiply a given matrix (or vector) by

, the elements of such a matrix (or vector) are transformed in deviations with respect their means. Moreover, we have $G\iota_{n}=0_{n}$ . If we premultiply the

matrix by the model (2.39), and since $G\hat{u}=\hat{u}$ (from result (2.32)), we have:

$\displaystyle Gy=GX\hat{\beta}+G\hat{u}=G\iota_{n}\hat{\beta_{1}}+GX_{2}\hat{\beta}_{(2)}+G\hat{u}= GX_{2}\hat{\beta}_{(2)}+\hat{u}$

(2.41)

This last expression is the matrix form of (2.37). Now, we premultiply (2.41) by $X_{2}^{\top }$ , obtaining:

$\displaystyle X_{2}^{\top }Gy=X_{2}^{\top }GX_{2}\hat{\beta}_{(2)}$

(2.42)

Given that

is an idempotent matrix (i.e.,

), such a property allows us to write (2.42) as:

$\displaystyle X_{2}^{\top }GGy=X_{2}^{\top }GGX_{2}\hat{\beta}_{(2)}$

and taking advantage of the fact that is also a symmetric matrix (i.e., $G=G^{\top }$ ), we can rewrite the last expression as follows:

$\displaystyle (GX_{2})^{\top }(Gy)=(GX_{2})^{\top }(GX_{2})\hat{\beta}_{(2)}$

(2.43)

or equivalently,

$\displaystyle (X_{2}^{D})^{\top }y^{D}=((X_{2}^{D})^{\top }X_{2}^{D})\hat{\beta}_{(2)}$

(2.44)

with $X_{2}^{D}=GX_{2}$ , that is to say, $X_{2}^{D}$ is the $n\times(k-1)$ matrix whose columns are the observations of each regressor, evaluated in deviations. In a similar way, $y^{D}=Gy$ , that is to say, the observed endogenous variable in deviations with respect to its mean.

The system of equations given by (2.44) leads to the same value of $\hat{\beta}_{(2)}$ as that obtained from (2.24). The only difference between the two systems is due to the intercept, which is estimated from (2.24), but not from (2.44). Nevertheless, as we have mentioned earlier, once we have the values of $\hat{\beta}_{(2)}$ from (2.44), we can calculate $\hat{\beta_{1}}$ through (2.38). Furthermore, according to (2.41), the residuals vector is the same as that obtained from (2.24), so the estimate of $\sigma ^{2}$ is that established in (2.34).

2.3.2 The Maximum Likelihood Estimation

Assumption 6 about the normality of the disturbances allows us to apply the maximum likelihood (ML) criterion to obtain the values of the unknown parameters of the MLRM. This method consists of the maximization of the likelihood function, and the values of $\beta$ and $\sigma ^{2}$ which maximize such a function are the ML estimates. To obtain the likelihood function, we start by considering the joint density function of the sample, which establishes the probability of a sample being realized, when the parameters are known. Firstly, we consider a general framework. Let be a random vector which is independently distributed as an n-multivariate normal , with expectations vector $\mu$ , and variance-covariance matrix $\Sigma$ . The probability density function of is given by:

$\displaystyle f(x\vert\mu,\Sigma)=\frac{1}{(2\pi)^{n/2}\vert\Sigma\vert^{1/2}}exp\{{-\frac{1}{2}(x-\mu)^{\top }\Sigma^{-1}(x-\mu)}\}$

(2.45)

Usually, we observe only one sample, so if we substitute

by an observed value $x_{0}$ , the function $f(x_{0}\vert\mu,\Sigma$ ) gives, for every value of $(\mu,\Sigma)$ , the probability of obtaining such a sample value ( $x_{0}$ ). Therefore, if the role of

and $(\mu,\Sigma)$ is changed, in such a way that $x_{0}$ is fixed and the parameters $(\mu,\Sigma)$ vary, we obtain the so called $\textsl{likelihood function}$ , which can be written as:

$\displaystyle L(\mu,\Sigma\vert x)=L(\mu,\Sigma)=f(x_{0}\vert\mu,\Sigma)$

(2.46)

In the framework of the MLRM, the set of classical assumptions stated for the random component allowed us to conclude that the vector is distributed as an n-multivariate normal, with $X\beta$ being the vector of means , and $\sigma^{2}I_{n}$ the variance-covariance matrix (results (2.15) and (2.16)). From (2.45) and (2.46), the likelihood function is:

$\displaystyle L(\beta,\sigma^{2}\vert y)=L(\beta,\sigma^{2})=f(y\vert\beta,\sig... ...(2\pi\sigma^{2})^{-n/2}exp\{-\frac{(y-X\beta)^{\top }(y-X\beta)}{2\sigma^{2}}\}$

(2.47)

The ML method maximizes (2.47) in order to obtain the ML estimators of $\beta$ and $\sigma ^{2}$ .

In general, the way of deriving the likelihood function (2.47) is based on the relationship between the probability distribution of and that of . Suppose and are two random vectors, where with being a monotonic function. If we know the probability density function of , denoted by , we can obtain the probability density function of as follows:

$\displaystyle f(z)=g(h^{-1}(z))J$

(2.48)

with

being the Jacobian, which is defined as the absolute value of the determinant of the matrix of partial derivatives:

$\displaystyle J=abs\vert\frac{\partial w}{\partial z}\vert$

In our case, we identify

with

, and

with

, in such a way that it is easy to show that the jacobian $J=abs\vert\frac{\partial u}{\partial y}\vert$ is the identity matrix, so expression (2.48) leads to the same result as (2.47).

Although the ML method maximizes the likelihood function, it is usually simpler to work with the log of this function. Since the logarithm is a monotonic function, the parameter values that maximize L are the same as those that maximize the log-likelihood ( $\ln L$ ). In our case, $\ln L$ has the following form:

$\displaystyle \ell=lnL(\beta,\sigma^{2})= -\frac{n}{2}\ln(2\pi)-\frac{n}{2}\ln\sigma^{2}-\frac{(y-X\beta)^{\top }(y-X\beta)}{2\sigma^{2}}$

(2.49)

The ML estimators are the solution to the first-order conditions:

$\displaystyle \frac{\partial\ell}{\partial\beta}=-\frac{1}{2\sigma^{2}}(-2X^{\top }y+2X^{\top }X\beta)=0$

(2.50)

$\displaystyle \frac{\partial\ell}{\partial\sigma^{2}}=-\frac{n}{2\sigma^{2}}+\frac{(y-X\beta)^{\top }(y-X\beta)}{2\sigma^{4}}=0$

(2.51)

Thus, the ML estimators, denoted by $\tilde{\beta}$ and $\tilde{\sigma}^{2}$ , are:

$\displaystyle \tilde{\beta}=(X^{\top }X)^{-1}X^{\top }y$

(2.52)

$\displaystyle \tilde{\sigma}^{2}=\frac{(y-X\tilde{\beta})^{\top }(y-X\tilde{\beta})}{n}=\frac{\tilde{u}^{\top }\tilde{u}}{n}$

(2.53)

As we can see, similarly to results in the univariate linear regression model of Chapter 2, under the assumption of normality of the disturbances, both ML and LS methods gives the same estimated value for the coefficients $\beta$ ( $\tilde{\beta}=\hat{\beta}$ ), and thus, the numerator of the expression of $\tilde{\sigma}^{2}$ is the same as that of $\hat{\sigma}^{2}$ .

2.3.3 Example

All estimation quantlets in the stats quantlib have as input parameters:

x: An $n \times k$ matrix containing observations of explanatory variables ,
y: An $n\times1$ vector containing the observed responses.

Neither the matrix X, nor the vector y should contain missing values (NaN) or infinite values (Inf,-Inf).

In the following example, we will use Spanish economic data to illustrate the MLRM estimation. The file data.dat contains quarterly data from 1980 to 1997 (sample size ) for the variables consumption, exports and M1 (monetary supply). All variables are expressed in constant prices of 1995.

Descriptive statistics of the three variables which are included in the consumption function can be found in the Table 2.1.

Table 2.1: Descriptive statistics for consumption data.

	Min	Max	Mean	S.D.
consumption	7558200	12103000	9524600	1328800
exports	1439000	5590700	2778500	1017700
M1	9203.9	18811	13512	3140.8

On the basis of the information on the data file, we estimate the consumption function; the endogenous variable we want to explain is consumption, while exportations and M1 are the explanatory variables, or regressors.

The quantlet XEGmlrm01.xpl produces some summary statistics

XEGmlrm01.xpl

2.3.3.0.1 Computing MLRM Estimates

The quantlet in the stats quantlib which can be employed to obtain only the OLS (or ML) estimation of the coefficients $\beta$ and $\sigma ^{2}$ is 6979 gls .

b = gls ( x, y ): estimates the parameters of a MLRM

In XEGmlrm02.xpl, we have used the quantlet 6992 gls to compute the OLS estimates of $\beta$ (b), and both the OLS and ML estimates of $\sigma ^2$ (sigls and sigml).

XEGmlrm02.xpl