10.1 The Orthogonal Factor Model

The aim of factor analysis is to explain the outcome of $p$ variables in the data matrix $\data{X}$ using fewer variables, the so-called factors. Ideally all the information in $\data{X}$ can be reproduced by a smaller number of factors. These factors are interpreted as latent (unobserved) common characteristics of the observed $x\in \mathbb{R}^p$. The case just described occurs when every observed $x=(x_{1},\ldots,x_{p})^{\top}$ can be written as

\begin{displaymath}
x_{j}=\sum_{\ell=1}^{k} q_{j\ell}f_{\ell} + \mu_{j},\, j=1,...,p.
\end{displaymath} (10.1)

Here $f_{\ell}$, for $\ell=1,\ldots,k$ denotes the factors. The number of factors, $k$, should always be much smaller than $p$. For instance, in psychology $x$ may represent $p$ results of a test measuring intelligence scores. One common latent factor explaining $x\in \mathbb{R}^p$ could be the overall level of ``intelligence''. In marketing studies, $x$ may consist of $p$ answers to a survey on the levels of satisfaction of the customers. These $p$ measures could be explained by common latent factors like the attraction level of the product or the image of the brand, and so on. Indeed it is possible to create a representation of the observations that is similar to the one in (10.1) by means of principal components, but only if the last $p-k$ eigenvalues corresponding to the covariance matrix are equal to zero. Consider a $p$-dimensional random vector X with mean $\mu$ and covariance matrix $\hbox{Var}(X)=\Sigma$. A model similar to (10.1) can be written for $X$ in matrix notation, namely
\begin{displaymath}
X = \data{Q} F + \mu,
\end{displaymath} (10.2)

where $F$ is the $k$-dimensional vector of the $k$ factors. When using the factor model (10.2) it is often assumed that the factors $F$ are centered, uncorrelated and standardized: $E(F)=0$ and $\hbox{Var}(F)={\data{I}}_k$. We will now show that if the last $p-k$ eigenvalues of $\Sigma$ are equal to zero, we can easily express $X$ by the factor model (10.2).

The spectral decomposition of $\Sigma$ is given by $\Gamma\Lambda \Gamma^{\top}$. Suppose that only the first $k$ eigenvalues are positive, i.e., $\lambda_{k+1} = \ldots = \lambda_{p} = 0$. Then the (singular) covariance matrix can be written as

\begin{displaymath}\Sigma = \sum_{\ell=1}^k \lambda_\ell \gamma_{\ell} \gamma_{\...
...ray}\right)
\left(\Gamma_1^{\top} \atop \Gamma_2^{\top}\right).\end{displaymath}

In order to show the connection to the factor model (10.2), recall that the PCs are given by $Y =
\Gamma^{\top}(X-\mu )$. Rearranging we have $X-\mu=\Gamma Y = \Gamma_1 Y_1 + \Gamma_2 Y_2$, where the components of $Y$ are partitioned according to the partition of $\Gamma$ above, namely

\begin{eqnarray*}
Y=\left(\begin{array}{c}
Y_1\\
Y_2
\end{array}\right)
=
\left...
...
\begin{array}{cc}
\Lambda_1& 0\\
0&0\end{array}\right)\right).
\end{eqnarray*}



In other words, $Y_2$ has a singular distribution with mean and covariance matrix equal to zero. Therefore, $X-\mu=\Gamma_1 Y_1+\Gamma_2 Y_2$ implies that $X-\mu$ is equivalent to $\Gamma_1 Y_1$, which can be written as

\begin{displaymath}
X=\Gamma_1 \Lambda_1^{1/2}\Lambda_1^{-1/2} Y_1 + \mu.
\end{displaymath}

Defining ${\data{Q}}= \Gamma_1 \Lambda_1^{1/2}$ and $F=\Lambda_1^{-1/2} Y_1$, we obtain the factor model (10.2).

Note that the covariance matrix of model (10.2) can be written as

\begin{displaymath}
\Sigma = E(X-\mu)(X-\mu)^{\top} = \data{Q}E(FF^{\top})\data{...
...\top}
= \sum_{j=1}^k \lambda_j \gamma_{j} \gamma_{j}^{\top}.
\end{displaymath} (10.3)

We have just shown how the variable $X$ can be completely determined by a weighted sum of $k$ (where $k<p$) uncorrelated factors. The situation used in the derivation, however, is too idealistic. In practice the covariance matrix is rarely singular.

It is common praxis in factor analysis to split the influences of the factors into common and specific ones. There are, for example, highly informative factors that are common to all of the components of $X$ and factors that are specific to certain components. The factor analysis model used in praxis is a generalization of (10.2):

\begin{displaymath}
X = \data{Q} F + U +\mu,
\end{displaymath} (10.4)

where $\data{Q}$ is a $(p\times k)$ matrix of the (non-random) loadings of the common factors $F (k\times 1)$ and $U$ is a $(p \times 1)$ matrix of the (random) specific factors. It is assumed that the factor variables $F$ are uncorrelated random vectors and that the specific factors are uncorrelated and have zero covariance with the common factors. More precisely, it is assumed that:
$\displaystyle EF$ $\textstyle =$ $\displaystyle 0,$  
$\displaystyle \Var(F)$ $\textstyle =$ $\displaystyle \data{I}_k,$  
$\displaystyle EU$ $\textstyle =$ $\displaystyle 0,$ (10.5)
$\displaystyle \mathop{\mathit{Cov}}(U_i,U_j)$ $\textstyle =$ $\displaystyle 0,\quad i\neq j$  
$\displaystyle \Cov(F,U)$ $\textstyle =$ $\displaystyle 0.$  

Define

\begin{displaymath}\Var(U)=\Psi =\mathop{\hbox{diag}}(\psi _{11},\ldots ,\psi _{pp}). \end{displaymath}

The generalized factor model (10.4) together with the assumptions given in (10.5) constitute the orthogonal factor model.


Orthogonal Factor Model  
$X$ = $\data{Q}$ $F$ $+$ $U$ $+$ $\mu$  
($p \times 1$)   ($p \times k$) ($k \times 1$)   ($p \times 1$)   ($p \times 1$)  
    $\mu_{j}$ = mean of variable $j$
    $U_{j}$ = $j$-th specific factor
    $F_{\ell}$ = $\ell$-th common factor
    $q_{j \ell}$ = loading of the $j$-th variable on the $\ell$-th factor
                 
The random vectors $F$ and $U$ are unobservable and uncorrelated.

Note that (10.4) implies for the components of $X=(X_{1},\ldots,X_{p})^{\top}$ that

\begin{displaymath}
X_j=\sum ^k_{\ell=1}q_{j\ell}F_\ell+U_j+\mu_j,\quad j=1,\ldots ,p.
\end{displaymath} (10.6)

Using (10.5) we obtain $\sigma_{X_{j}X_{j}} = \mathop{\mathit{Var}}(X_j) = \sum^k_{\ell=1} q^2_{j\ell} +
\psi_{jj}$. The quantity $h^2_j = \sum^k_{\ell=1} q^2_{j\ell} $ is called the communality and $ \psi_{jj} $ the specific variance. Thus the covariance of $X$ can be rewritten as
$\displaystyle \Sigma$ $\textstyle =$ $\displaystyle E(X-\mu)(X-\mu)^{\top} = E(\data{Q}F+U)(\data{Q}F+U)^{\top}$  
  $\textstyle =$ $\displaystyle \data{Q}E(FF^{\top})\data{Q}^{\top} + E(UU^{\top})
= \data{Q} \Var(F) \data{Q}^{\top} + \Var(U)$  
  $\textstyle =$ $\displaystyle \data{Q}\data{Q}^{\top} + \Psi.$ (10.7)

In a sense, the factor model explains the variations of $X$ for the most part by a small number of latent factors $F$ common to its $p$ components and entirely explains all the correlation structure between its components, plus some ``noise'' $U$ which allows specific variations of each component to enter. The specific factors adjust to capture the individual variance of each component. Factor analysis relies on the assumptions presented above. If the assumptions are not met, the analysis could be spurious. Although principal components analysis and factor analysis might be related (this was hinted at in the derivation of the factor model), they are quite different in nature. PCs are linear transformations of $X$ arranged in decreasing order of variance and used to reduce the dimension of the data set, whereas in factor analysis, we try to model the variations of $X$ using a linear transformation of a fixed, limited number of latent factors. The objective of factor analysis is to find the loadings $\data{Q}$ and the specific variance $\Psi$. Estimates of $\data{Q}$ and $\Psi$ are deduced from the covariance structure (10.7).


Interpretation of the Factors

Assume that a factor model with $k$ factors was found to be reasonable, i.e., most of the (co)variations of the $p$ measures in $X$ were explained by the $k$ fixed latent factors. The next natural step is to try to understand what these factors represent. To interpret $F_\ell$, it makes sense to compute its correlations with the original variables $X_j$ first. This is done for $\ell=1,\ldots,k$ and for $j=1,\ldots , p$ to obtain the matrix $P_{XF}$. The sequence of calculations used here are in fact the same that were used to interprete the PCs in the principal components analysis.

The following covariance between $X$ and $F$ is obtained via (10.5),

\begin{displaymath}\Sigma_{XF} = E \{ (\data{Q}F+U)F^{\top} \} =\data{Q}. \end{displaymath}

The correlation is
\begin{displaymath}
P_{XF} = D^{-1/2} \data{Q},
\end{displaymath} (10.8)

where $ D = \mathop{\hbox{diag}}(\sigma_{X_1X_1}, \ldots, \sigma_{X_pX_p}) $. Using (10.8) it is possible to construct a figure analogous to Figure 9.6 and thus to consider which of the original variables $X_1,\ldots,X_p$ play a role in the unobserved common factors $F_1, \ldots, F_k$.

Returning to the psychology example where $X$ are the observed scores to $p$ different intelligence tests (the WAIS data set in Table B.12 provides an example), we would expect a model with one factor to produce a factor that is positively correlated with all of the components in $X$. For this example the factor represents the overall level of intelligence of an individual. A model with two factors could produce a refinement in explaining the variations of the $p$ scores. For example, the first factor could be the same as before (overall level of intelligence), whereas the second factor could be positively correlated with some of the tests, $X_j$, that are related to the individual's ability to think abstractly and negatively correlated with other tests, $X_i$, that are related to the individual's practical ability. The second factor would then concern a particular dimension of the intelligence stressing the distinctions between the ``theoretical'' and ``practical'' abilities of the individual. If the model is true, most of the information coming from the $p$ scores can be summarized by these two latent factors. Other practical examples are given below.


Invariance of Scale

What happens if we change the scale of $X$ to $Y = \data{C}X$ with $\data{C} = \mathop{\hbox{diag}}(c_{1},\ldots,c_{p})$? If the $k$-factor model (10.6) is true for $X$ with $\data{Q} =\data{Q}_X$, $\Psi =\Psi_X$, then, since

\begin{displaymath}\Var(Y) = \data{C}\Sigma \data{C}^{\top} = \data{C}\data{Q}_X...
...Q}_X^{\top}
\data{C}^{\top} + \data{C} \Psi_X\data{C}^{\top}, \end{displaymath}

the same $k$-factor model is also true for $Y$ with $\data{Q}_Y = \data{C}\data{Q}_X$ and $ \Psi_Y = \data{C}\Psi_X\data{C}^{\top}$. In many applications, the search for the loadings $\data{Q}$ and for the specific variance $\Psi$ will be done by the decomposition of the correlation matrix of $X$ rather than the covariance matrix $\Sigma$. This corresponds to a factor analysis of a linear transformation of $X$ (i.e., $Y=D^{-1/2} (X-\mu))$. The goal is to try to find the loadings $\data{Q}_Y$ and the specific variance $\Psi_Y$ such that
\begin{displaymath}
P = \data{Q}_Y \; \data{Q}_Y^{\top} + \Psi_Y.
\end{displaymath} (10.9)

In this case the interpretation of the factors $F$ immediately follows from (10.8) given the following correlation matrix:
\begin{displaymath}
P_{XF} = P_{YF} = \data{Q}_{Y}.
\end{displaymath} (10.10)

Because of the scale invariance of the factors, the loadings and the specific variance of the model, where $X$ is expressed in its original units of measure, are given by

\begin{eqnarray*}
\data{Q}_X & = & D^{1/2} \data{Q}_Y \\
\Psi_X & = & D^{1/2} \Psi_Y D^{1/2}.
\end{eqnarray*}



It should be noted that although the factor analysis model (10.4) enjoys the scale invariance property, the actual estimated factors could be scale dependent. We will come back to this point later when we discuss the method of principal factors.


Non-Uniqueness of Factor Loadings

The factor loadings are not unique! Suppose that $\data{G}$ is an orthogonal matrix. Then $X$ in (10.4) can also be written as

\begin{displaymath}X = (\data{Q} \data{G})(\data{G}^{\top}F) + U + \mu .\end{displaymath}

This implies that, if a $k$-factor of $X$ with factors $F$ and loadings ${\data{Q}}$ is true, then the $k$-factor model with factors ${\data{G}}^{\top}F$ and loadings ${\data{Q}}{\data{G}}$ is also true. In practice, we will take advantage of this non-uniqueness. Indeed, referring back to Section 2.6 we can conclude that premultiplying a vector $F$ by an orthogonal matrix corresponds to a rotation of the system of axis, the direction of the first new axis being given by the first row of the orthogonal matrix. It will be shown that choosing an appropriate rotation will result in a matrix of loadings ${\data{Q}}{\data{G}}$ that will be easier to interpret. We have seen that the loadings provide the correlations between the factors and the original variables, therefore, it makes sense to search for rotations that give factors that are maximally correlated with various groups of variables.

From a numerical point of view, the non-uniqueness is a drawback. We have to find loadings ${\data{Q}}$ and specific variances $\Psi$ satisfying the decomposition $\Sigma= {\data{Q}}{\data{Q}}^{\top} + \Psi$, but no straightforward numerical algorithm can solve this problem due to the multiplicity of the solutions. An acceptable technique is to impose some chosen constraints in order to get--in the best case--an unique solution to the decomposition. Then, as suggested above, once we have a solution we will take advantage of the rotations in order to obtain a solution that is easier to interprete.

An obvious question is: what kind of constraints should we impose in order to eliminate the non-uniqueness problem? Usually, we impose additional constraints where

\begin{displaymath}
\data{Q}^{\top} \Psi^{-1} \data{Q} \quad \quad \textrm{is diagonal}
\end{displaymath} (10.11)

or
\begin{displaymath}
\data{Q} ^{\top}\data{D}^{-1}\data{Q} \quad \quad \ \textrm{is diagonal.}
\end{displaymath} (10.12)

How many parameters does the model (10.7) have without constraints?

\begin{eqnarray*}
\data{Q} (p \times k) \quad &\textrm{has} & \quad p \cdot k \q...
...es p) \quad &\textrm{has} & \quad p \quad
\textrm{parameters.}
\end{eqnarray*}



Hence we have to determine $pk+p$ parameters! Conditions (10.11) respectively (10.12) introduce $\frac{1}{2} \{ k(k-1) \}$ constraints, since we require the matrices to be diagonal. Therefore, the degrees of freedom of a model with $k$ factors is:

\begin{eqnarray*}
d &=& \textrm{(\char93  parameters for}\ \Sigma \ \textrm{unco...
...1)) \\
&=& {\tst \frac{1}{2}}(p-k)^2-{\tst \frac{1}{2}}(p+k).
\end{eqnarray*}



If $d<0$, then the model is undetermined: there are infinitly many solutions to (10.7). This means that the number of parameters of the factorial model is larger than the number of parameters of the original model, or that the number of factors $k$ is ``too large'' relative to $p$. In some cases $d=0$: there is an unique solution to the problem (except for rotation). In practice we usually have that $d>0$:there are more equations than parameters, thus an exact solution does not exist. In this case approximate solutions are used. An approximation of $\Sigma$, for example, is $\data{QQ}^{\top} + \Psi$. The last case is the most interesting since the factorial model has less parameters than the original one. Estimation methods are introduced in the next section.

Evaluating the degrees of freedom, $d$, is particularly important, because it already gives an idea of the upper bound on the number of factors we can hope to identify in a factor model. For instance, if $p=4$, we could not identify a factor model with 2 factors (this results in $d=-1$ which has infinitly many solutions). With $p=4$, only a one factor model gives an approximate solution ($d=2$). When $p=6$, models with 1 and 2 factors provide approximate solutions and a model with 3 factors results in an unique solution (up to the rotations) since $d=0$. A model with 4 or more factors would not be allowed, but of course, the aim of factor analysis is to find suitable models with a small number of factors, i.e., smaller than $p$. The next two examples give more insights into the notion of degrees of freedom.

EXAMPLE 10.1   Let $p=3$ and $k=1$, then $d=0$ and

\begin{displaymath}\Sigma = \left(
\begin{array}{lll}
\sigma_{11} &\sigma_{12}&...
... \\
q_1q_3 & q_2q_3 & q_3^2 + \psi_{33}
\end{array} \right) \end{displaymath}

with $ \data{Q} =
\left(
\begin{array}{c}
q_1 \\ q_2 \\ q_3
\end{array}
\right)$ and $ \Psi = \left(
\begin{array}{ccc}
\psi_{11} &0& 0\\
0 & \psi_{22} &0 \\
0 & 0 & \psi_{33}
\end{array} \right) $. Note that here the constraint (10.8) is automatically verified since $k=1$. We have

\begin{displaymath}q_1^2 = \frac{\sigma_{12}\sigma_{13}}{\sigma_{23}} ; \;
q_2^...
..._{13}}; \;
q_3^2 = \frac{\sigma_{13}\sigma_{23}}{\sigma_{12}} \end{displaymath}

and

\begin{displaymath}\psi_{11} = \sigma_{11} - q_1^2; \; \psi_{22} = \sigma_{22} - q_2^2; \;
\psi_{33} = \sigma_{33} - q_3^2. \end{displaymath}

In this particular case ($k=1$), the only rotation is defined by $\data{G}=-1$, so the other solution for the loadings is provided by $-\data{Q}$.

EXAMPLE 10.2   Suppose now $p=2$ and $k=1$, then $d<0$ and

\begin{displaymath}\Sigma = \left( \begin{array}{cc} 1 & \rho \\ \rho & 1 \end{a...
...11} & q_1q_2 \\
q_1q_2 & q_2^2+\psi_{22} \end{array} \right). \end{displaymath}

We have infinitely many solutions: for any $\alpha$ $(\rho < \alpha < 1)$, a solution is provided by

\begin{displaymath}
q_1 = \alpha; \; q_2 = \rho/\alpha; \; \psi_{11} = 1-\alpha^2; \; \psi_{22} =
1-(\rho/\alpha)^2.
\end{displaymath}

The solution in Example 10.1 may be unique (up to a rotation), but it is not proper in the sense that it cannot be interpreted statistically. Exercise 10.5 gives an example where the specific variance $\psi_{11}$ is negative.


1mm
\begin{picture}(2.00,2.00)
\par\linethickness{1.0pt}\put(0.00,0.00){\line(1,0){1...
...\line(1,-2){5.00}}
\put(5.00,4.00){\makebox(0,0)[cc]{\LARGE\bf !}}
\end{picture}
Even in the case of a unique solution $(d=0)$, the solution may be inconsistent with statistical interpretations.

Summary
$\ast$
The factor analysis model aims to describe how the original $p$ variables in a data set depend on a small number of latent factors $k<p$, i.e., it assumes that $X=\data{Q}F+ U +\mu$. The ($k$-dimensional) random vector $F$ contains the common factors, the ($p$-dimensional) $U$ contains the specific factors and $\data{Q}(p\times k)$ contains the factor loadings.
$\ast$
It is assumed that $F$ and $U$ are uncorrelated and have zero means, i.e., $F\sim(0,\data{I})$, $U\sim(0,\Psi)$ where $\Psi$ is diagonal matrix and $\Cov(F,U)=0$.
This leads to the covariance structure $\Sigma=\data{QQ}^{\top}+\Psi$.
$\ast$
The interpretation of the factor $F$ is obtained through the correlation $P_{XF} =
D^{-1/2} \data{Q}$.
$\ast$
A normalized analysis is obtained by the model $P = \data{QQ}^{\top} + \Psi$. The interpretation of the factors is given directly by the loadings $\data{Q}: \;
P_{XF} = \data{Q}$.
$\ast$
The factor analysis model is scale invariant. The loadings are not unique (only up to multiplication by an orthogonal matrix).
$\ast$
Whether a model has an unique solution or not is determined by the degrees of freedom $ d= 1/2 (p-k)^2 - 1/2 (p+k)$.