10.2 Estimation of the Factor Model

In practice, we have to find estimates $\widehat{\data{Q}}$ of the loadings ${\data{Q}}$ and estimates $\widehat \Psi$ of the specific variances $\Psi$ such that analogously to (10.7)

\begin{displaymath}
{\data{S}} = \widehat{\data{Q}} \widehat{\data{Q}}^{\top} + \widehat{\Psi}, \end{displaymath}

where ${\data {S}}$ denotes the empirical covariance of ${\data{X}}
$. Given an estimate $\widehat{\data{Q}}$ of ${\data{Q}}$, it is natural to set

\begin{displaymath}
\widehat\psi _{jj}=s_{X_{j}X_{j}}-\sum ^k_{\ell=1}\widehat q^2_{j\ell}.
\end{displaymath}

We have that $\widehat{h}_{j}^2=\sum ^k_{\ell=1}\widehat{q}^2_{j\ell}$ is an estimate for the communality $h_{j}^2$.

In the ideal case $d=0$, there is an exact solution. However, $d$ is usually greater than zero, therefore we have to find $\widehat{\data{Q}}$ and $\widehat \Psi$ such that $S$ is approximated by $\widehat{\data{Q}} \widehat{\data{Q}}^{\top} +
\widehat{\Psi}$. As mentioned above, it is often easier to compute the loadings and the specific variances of the standardized model.

Define ${\data{Y}}={\data{HXD}}^{-1/2}$, the standardization of the data matrix ${\data{X}}
$, where, as usual, ${\data{D}}=\mathop{\hbox{diag}}( s_{X_{1}X_{1}}, \ldots, s_{X_{p}X_{p}} )$ and the centering matrix ${\data{H}}= {\data{I}}-n^{-1}1_{n}1_{n}^{\top}$ (recall from Chapter 2 that ${\data{S}}= \frac{1}{n}{\data{X}}^{\top}{\data{HX}}$). The estimated factor loading matrix $\widehat{\data{Q}}_Y$ and the estimated specific variance $\widehat \Psi_Y$ of ${\data{Y}}$ are

\begin{displaymath}
\widehat{\data{Q}}_Y={\data{D}}^{-1/2}\widehat{\data{Q}}_X \...
...rm{and} \quad \widehat \Psi_Y={\data{D}}^{-1}\widehat \Psi _X.
\end{displaymath}

For the correlation matrix ${\data{R}}$ of ${\data{X}}
$, we have that

\begin{displaymath}
{\data{R}}=\widehat{\data{Q}}_Y\widehat{\data{Q}}_Y^{\top}+\widehat{\Psi}_Y.
\end{displaymath}

The interpretations of the factors are formulated from the analysis of the loadings $\widehat{\data{Q}}_Y$.

EXAMPLE 10.3   Let us calculate the matrices just defined for the car data given in Table B.7. This data set consists of the averaged marks (from $1 =$low to $6 = $high) for $24$ car types. Considering the three variables price, security and easy handling, we get the following correlation matrix:

\begin{displaymath}{\data{R}} = \left (
\begin{array}{ccc}
1 & 0.975 & 0.613 \\
0.975& 1 & 0.620 \\
0.613& 0.620& 1
\end{array} \right ).\end{displaymath}

We will first look for one factor, i.e., $k=1$. Note that (# number of parameters of $\Sigma$ unconstrained - # parameters of $\Sigma$ constrained) is equal to $\frac{1 }{2 } (p-k)^2-\frac{1 }{2 }(p+ k)
= \frac{1 }{2 }(3-1)^2-\frac{1 }{2 }(3+1)=0$. This implies that there is an exact solution! The equation

\begin{displaymath}\left (\begin{array}{ccc}
1 & r_{X_{1}X_{2}} & r_{X_{1}X_{3}...
...q_3 & \widehat q^2_3 + \widehat \psi_{33}
\end{array} \right )\end{displaymath}

yields the communalities $\widehat h^2_{i}=\widehat q^2_{i}$, where

\begin{displaymath}
\widehat q ^2_1 = \frac{r_{X_{1}X_{2}}r_{X_{1}X_{3}}}{r_{X_{...
... q ^2_3 = \frac{r_{X_{1}X_{3}}r_{X_{2}X_{3}}}{r_{X_{1}X_{2}}}.
\end{displaymath}

Combining this with the specific variances $\widehat \psi_{11} = 1- \widehat q ^2_1$ , $\widehat \psi_{22}
= 1- \widehat q ^2_2$ and $\widehat\psi_{33} = 1- \widehat q^2_3$, we obtain the following solution

\begin{displaymath}\begin{array}{rclrclrcl}
\widehat q_1 &=& 0.982 &\quad \quad ...
... &\quad \quad \quad
\widehat \psi_{33} &=& 0.610.
\end{array}\end{displaymath}

Since the first two communalities ( $\widehat h^2_{i}=\widehat q^2_{i}$) are close to one, we can conclude that the first two variables, namely price and security, are explained by the single factor quite well. This factor can be interpreted as a ``price+security'' factor.

The Maximum Likelihood Method

Recall from Chapter 6 the log-likelihood function $\ell$ for a data matrix ${\data{X}}
$ of observations of $ X \sim N_p (\mu, \Sigma) $:

\begin{eqnarray*}
\ell({\data{X}};\mu,\Sigma) &=& -\frac{n}{2}\log\mid2\pi\Sigma...
...ac{n}{2}(\overline x -\mu)\Sigma^{-1}(\overline x -\mu)^{\top}.
\end{eqnarray*}



This can be rewritten as

\begin{displaymath}\ell({\data{X}};\widehat \mu,\Sigma)
= -\frac{n}{2} \left\{...
...gma\mid +
\mathop{\hbox{tr}}(\Sigma^{-1}{\data{S}}) \right\}. \end{displaymath}

Replacing $\mu$ by $ \widehat \mu = \overline x$ and substituting $\Sigma = {\data{QQ}}^{\top} + \Psi$ this becomes
\begin{displaymath}
\ell({\data{X}};\widehat \mu,{\data{Q}},\Psi)
= -\frac{n}{2...
...{tr}}\{({\data{QQ}}^{\top} + \Psi)^{-1}{\data{S}}\} \right].
\end{displaymath} (10.13)

Even in the case of a single factor ($k=1$), these equations are rather complicated and iterative numerical algorithms have to be used (for more details see Mardia et al. (1979, p. 263ff)). A practical computation scheme is also given in Supplement 9A of Johnson and Wichern (1998).

Likelihood Ratio Test for the Number of Common Factors

Using the methodology of Chapter 7, it is easy to test the adequacy of the factor analysis model by comparing the likelihood under the null (factor analysis) and alternative (no constraints on covariance matrix) hypotheses.

Assuming that $\widehat{\data{Q}}$ and $\widehat{\Psi}$ are the maximum likelihood estimates corresponding to (10.13), we obtain the following LR test statistic:

\begin{displaymath}
- 2 \log \left( \frac{\textrm{maximized likelihood under }H_...
...{\top}
+ \widehat{\Psi}\vert}{\vert{\data{S}}\vert}
\right),
\end{displaymath} (10.14)

which asymptotically has the $\chi^2_{\frac{1}{2}\{(p-k)^2-p-k\}}$ distribution.

The $\chi^2$ approximation can be improved if we replace $n$ by $n-1-(2p+4k+5)/6$ in (10.14) (Bartlett; 1954). Using Bartlett's correction, we reject the factor analysis model at the $\alpha$ level if

\begin{displaymath}
\{n-1-(2p+4k+5)/6\}
\log
\left(
\frac{\vert\widehat{\data{Q...
...ata{S}}\vert}
\right)
>
\chi^2_{1-\alpha; \{(p-k)^2-p-k\}/2},
\end{displaymath} (10.15)

and if the number of observations $n$ is large and the number of common factors $k$ is such that the $\chi^2$ statistic has a positive number of degrees of freedom.


The Method of Principal Factors

The method of principal factors concentrates on the decomposition of the correlation matrix ${\data{R}}$ or the covariance matrix ${\data {S}}$. For simplicity, only the method for the correlation matrix ${\data{R}}$ will be discussed. As pointed out in Chapter 9, the spectral decompositions of ${\data{R}}$ and ${\data {S}}$ yield different results and therefore, the method of principal factors may result in different estimators. The method can be motivated as follows: Suppose we know the exact $\Psi$, then the constraint (10.12) implies that the columns of ${\data{Q}}$ are orthogonal since ${\data{D}}={\data{I}}$ and it implies that they are eigenvectors of ${\data{Q}}{\data{Q}}^{\top} = {\data{R}}-\Psi$. Furthermore, assume that the first $k$ eigenvalues are positive. In this case we could calculate ${\data{Q}}$ by means of a spectral decomposition of ${\data{Q}}{\data{Q}}^{\top}$ and $k$ would be the number of factors.

The principal factors algorithm is based on good preliminary estimators $\widetilde h^2_j$ of the communalities $h^2_j$, for $j=1,\ldots , p$. There are two traditional proposals:

Given $\tilde \psi_{jj} = 1-\tilde h_j^2$ we can construct the reduced correlation matrix, $\data{R}-\widetilde \Psi $. The Spectral Decomposition Theorem says that

\begin{displaymath}\data{R}- \widetilde \Psi
= \sum^p_{\ell=1}\lambda_\ell\gamma_{\ell}\gamma_{\ell}^{\top},\end{displaymath}

with eigenvalues $\lambda_1\ge \dots \ge \lambda_p$. Assume that the first $k$ eigenvalues $\lambda_1, \ldots,
\lambda_k $ are positive and large compared to the others. Then we can set

\begin{displaymath}\widehat q_{\ell} = \sqrt{\lambda_\ell}\ \gamma_{\ell}\ ,
\quad \ell=1, \ldots, k \end{displaymath}

or

\begin{displaymath}\widehat{\data{Q}} = \Gamma_1\Lambda_1^{1/2} \end{displaymath}

with

\begin{displaymath}\Gamma_1 = (\gamma_{1}, \ldots, \gamma_{k})\ \quad \textrm{an...
...\Lambda_1 = \mathop{\hbox{diag}}(\lambda_1, \ldots, \lambda_k).\end{displaymath}

In the next step set

\begin{displaymath}\widehat{\psi}_{jj} = 1- \sum^k_{\ell=1} \widehat q^2_{j\ell}\ ,
\quad j=1, \ldots, p.\end{displaymath}

Note that the procedure can be iterated: from $\widehat \psi_{jj}$ we can compute a new reduced correlation matrix $\data{R}-\widehat{\Psi}$ following the same procedure. The iteration usually stops when the $\widehat \psi_{jj}$ have converged to a stable value.

EXAMPLE 10.4   Consider once again the car data given in Table B.7. From Exercise 9.4 we know that the first PC is mainly influenced by $X_{2}$-$X_{7}$. Moreover, we know that most of the variance is already captured by the first PC. Thus we can conclude that the data are mainly determined by one factor ($k=1$).

Figure 10.1: Loadings of the evaluated car qualities, factor analysis with $k=2$. 37015 MVAfactcarm.xpl
\includegraphics[width=1\defpicwidth]{factcar.ps}

The eigenvalues of $\data{R}- \widehat \Psi $ for $\widehat\Psi =
(\max\limits_{j\neq i}\vert r_{X_{i}X_{j}}\vert)$ are

\begin{displaymath}(5.448,0.003,-.246,-0.646,-0.901,-0.911,-0.948,-0.964)^{\top}\ .\end{displaymath}

It would suffice to choose only one factor. Nevertheless, we have computed two factors. The result (the factor loadings for two factors) is shown in Figure 10.1.

We can clearly see a cluster of points to the right, which contain the factor loadings for the variables $X_{2}$-$X_{7}$. This shows, as did the PCA, that these variables are highly dependent and are thus more or less equivalent. The factor loadings for $X_{1}$ (economy) and $X_{8}$ (easy handling) are separate, but note the different scales on the horizontal and vertical axes! Although there are two or three sets of variables in the plot, the variance is already explained by the first factor, the ``price+security'' factor.


The Principal Component Method

The principal factor method involves finding an approximation $\tilde \Psi$ of $\Psi$, the matrix of specific variances, and then correcting $\data{R}$, the correlation matrix of $X$, by $\tilde \Psi$. The principal component method starts with an approximation $\hat{\data{Q}}$ of $\data{Q}$, the factor loadings matrix. The sample covariance matrix is diagonalized, $\data{S}=\Gamma \Lambda \Gamma^{\top}$. Then the first $k$ eigenvectors are retained to build

\begin{displaymath}
\hat{\data{Q}}= [\sqrt{\lambda_{1}} \gamma_{1}, \ldots, \sqrt{\lambda_{k}} \gamma_{k}].
\end{displaymath} (10.16)

The estimated specific variances are provided by the diagonal elements of the matrix $\data{S}-\hat{\data{Q}}\hat{\data{Q}}^\top$,
\begin{displaymath}
\hat \Psi = \left(\begin{array}{cccc}
\hat \psi_{11} & & & ...
...t \psi_{jj}=s_{X{j}X{j}}-\sum^{k}_{\ell=1} \hat q^{2}_{j\ell}.
\end{displaymath} (10.17)

By definition, the diagonal elements of $\data{S}$ are equal to the diagonal elements of $\hat{\data{Q}} \hat{\data{Q}}^\top+ \hat \Psi$. The off-diagonal elements are not necessarily estimated. How good then is this approximation? Consider the residual matrix

\begin{displaymath}
\data{S}-(\hat{\data{Q}}\hat{\data{Q}}^\top+\hat \Psi)
\end{displaymath}

resulting from the principal component solution. Analytically we have that

\begin{displaymath}\sum_{i,j}(\data{S}-\hat{\data{Q}}\hat{\data{Q}}^\top-\hat \Psi)^{2}_{ij}\leq
\lambda^{2}_{k+1}+\ldots+\lambda^{2}_{p}.\end{displaymath}

This implies that a small value of the neglected eigenvalues can result in a small approximation error. A heuristic device for selecting the number of factors is to consider the proportion of the total sample variance due to the $j$-th factor. This quantity is in general equal to
(A)
$\lambda_{j}/ \sum^{p}_{j=1}s_{jj}$ for a factor analysis of $\data{S}$,
(B)
$\lambda_{j}/p$ for a factor analysis of $\data{R}$.

EXAMPLE 10.5   This example uses a consumer-preference study from Johnson and Wichern (1998). Customers were asked to rate several attributes of a new product. The responses were tabulated and the following correlation matrix ${\data{R}}$ was constructed:
Attribute (Variable)  
Taste 1
Good buy for money 2
Flavor 3
Suitable for snack 4
Provides lots of energy 5
$\left(
\begin{array}{ccccc}
1.00 & 0.02 & {\bf0.96} & 0.42 & 0.01 \\
0.02 & 1....
... & 1.00 & {\bf0.79} \\
0.01 & 0.85 & 0.11 & 0.79 & 1.00 \\
\end{array}\right)$

The bold entries of ${\data{R}}$ show that variables 1 and 3 and variables 2 and 5 are highly correlated. Variable 4 is more correlated with variables 2 and 5 than with variables 1 and 3. Hence, a model with 2 (or 3) factors seems to be reasonable.

The first two eigenvalues $\lambda_1 = 2.85$ and $\lambda_2 = 1.81$ of ${\data{R}}$ are the only eigenvalues greater than one. Moreover, $k=2$ common factors account for a cumulative proportion

\begin{displaymath}
\frac{\lambda_1 + \lambda_2}{p} = \frac{2.85+1.81}{5} =
0.93
\end{displaymath}

of the total (standardized) sample variance. Using the principal component method, the estimated factor loadings, communalities, and specific variances, are calculated from formulas (10.16) and (10.17), and the results are given in Table 10.1.


Table 10.1: Estimated factor loadings, communalities, and specific variances
    Estimated factor   Specific
    loadings Communalities variances
Variable     $\hat q_1$         $\hat q_2$     $\hat{h}_j^2$ $\hat{\psi}_{jj}=1-\hat{h}_{j}^2$
1. Taste 0.56   0.82   0.98 0.02
2. Good buy for money 0.78   -0.53   0.88 0.12
3. Flavor 0.65   0.75   0.98 0.02
4. Suitable for snack 0.94   -0.11   0.89 0.11
5. Provides lots of energy 0.80   -0.54   0.93 0.07
Eigenvalues 2.85   1.81      
Cumulative proportion of total (standardized) sample variance 0.571  0.932     


Take a look at:

\begin{displaymath}
\hat Q \hat Q^{\top} + \hat \Psi =
\left(
\begin{array}{rr}
...
...0.80 \\
0.82 &-0.53 & 0.75 &-0.11 &-0.54
\end{array}\right)
+
\end{displaymath}


\begin{displaymath}
+\left(
\begin{array}{lllll}
0.02 & 0 & 0 & 0 & 0 \\
0 & 0....
... 0.81 \\
0.00 & 0.91 & 0.11 & 0.81 & 1.00
\end{array}\right).
\end{displaymath}

This nearly reproduces the correlation matrix ${\data{R}}$. We conclude that the two-factor model provides a good fit of the data. The communalities $(0.98,0.88,0.98,0.89,0.93)$ indicate that the two factors account for a large percentage of the sample variance of each variable. Due to the nonuniqueness of factor loadings, the interpretation might be enhanced by rotation. This is the topic of the next subsection.


Rotation

The constraints (10.11) and (10.12) are given as a matter of mathematical convenience (to create unique solutions) and can therefore complicate the problem of interpretation. The interpretation of the loadings would be very simple if the variables could be split into disjoint sets, each being associated with one factor. A well known analytical algorithm to rotate the loadings is given by the varimax rotation method proposed by Kaiser (1985). In the simplest case of $k=2$ factors, a rotation matrix $\data{G}$ is given by

\begin{displaymath}\data{G}(\theta)=\left( \begin{array}{rr} \cos{\theta}&\sin{\theta}\\
-\sin{\theta}&\cos{\theta} \end{array} \right),\end{displaymath}

representing a clockwise rotation of the coordinate axes by the angle $\theta$. The corresponding rotation of loadings is calculated via $\hat{\data{Q}}^* = \hat{\data{Q}}\data{G}(\theta)$. The idea of the varimax method is to find the angle $\theta$ that maximizes the sum of the variances of the squared loadings $\hat q^*_{ij}$ within each column of $\hat{\data{Q}}^*$. More precisely, defining $\tilde q_{jl}=\hat q^{*}_{jl}/ \hat h^{*}_{j}$, the varimax criterion chooses $\theta$ so that

\begin{displaymath}
\data{V}=\frac{1}{p}\sum^{k}_{\ell=1}
\left[ \sum^{p}_{j=1}\...
...{p}_{j=1}\left(\tilde q^{*}_{jl}\right)^{2} \right\}^2
\right]
\end{displaymath}

is maximized.

EXAMPLE 10.6   Let us return to the marketing example of Johnson and Wichern (1998) (Example 10.5). The basic factor loadings given in Table 10.1 of the first factor and a second factor are almost identical making it difficult to interpret the factors. Applying the varimax rotation we obtain the loadings $
\tilde q_1=(0.02, \mathbf{0.94}, 0.13, \mathbf{0.84}, \mathbf{0.97})^\top
$ and $
\tilde q_2=(\mathbf{0.99}, -0.01, \mathbf{0.98}, 0.43, -0.02)^\top
$. The high loadings, indicated as bold entries, show that variables 2, 4, 5 define factor 1, a nutricional factor. Variable 1 and 3 define factor 2 which might be referred to as a taste factor.

Summary
$\ast$
In practice, $\data{Q}$ and $\Psi$ have to be estimated from $\data{S}=\widehat{\data{Q}}\widehat{\data{Q}}^{\top}+\widehat{\Psi}$. The number of parameters is $d=\frac{1}{2}(p-k)^2-\frac{1}{2}(p+k)$.
$\ast$
If $d=0$, then there exists an exact solution. In practice, $d$ is usually greater than $0$, thus approximations must be considered.
$\ast$
The maximum-likelihood method assumes a normal distribution for the data. A solution can be found using numerical algorithms.
$\ast$
The method of principal factors is a two-stage method which calculates $\widehat{\data{Q}}$ from the reduced correlation matrix $\data{R}-\widetilde \Psi $, where $\widetilde\Psi$ is a pre-estimate of $\Psi$. The final estimate of $\Psi$ is found by $\widehat\psi_{ii}
=1-\sum_{j=1}^k \widehat q_{ij}^2$.
$\ast$
The principal component method is based on an approximation, $\widehat{\data{Q}}$, of $\data{Q}$.
$\ast$
Often a more informative interpretation of the factors can be found by rotating the factors.
$\ast$
The varimax rotation chooses a rotation $\theta$ that maximizes $\data{V}=\frac{1}{p}\sum^{k}_{\ell=1}
\left[ \sum^{p}_{j=1}\left(\tilde q^{*}_{...
...\frac{1}{p} \sum^{p}_{j=1}\left(\tilde q^{*}_{jl}\right)^{2} \right\}^2
\right]$.