8.1 Linear Regression Modeling

Let us first study the linear regression model $Y = \boldsymbol{X}^{\top} \boldsymbol{\beta_0} + \varepsilon$ assuming $E(\varepsilon\vert \boldsymbol{X}) = 0$ . Unless said otherwise, we consider here only one dependent variable

. The unknown vector $\boldsymbol{\beta}^0 = (\beta_1^0,\ldots,\beta_p^0)$ is to be estimated given observations $\boldsymbol{y} = ( y_{1},\ldots, y_{n} ) \in \mathbb{R}^n$ and $\{ \boldsymbol{x}_{i} \}_{i=1}^n = \{ (x_{1i},\ldots, x_{pi}) \}_{i=1}^n$ of random variables

and

; let us denote $\boldsymbol{X} = (\boldsymbol{x}_{1},\ldots,\boldsymbol{x}_{n})^{\top} \in \mathbb{R}^{n \times p}$ and let $\boldsymbol{X}_{\cdot k}$ be the

th column of $\boldsymbol {X}$ . Thus, the linear regression model can be written in terms of observations as

Section 8.1.1 summarizes how to estimate the model (8.1) by the method of least squares. Later, we specify what ill-conditioning and multicollinearity are in Sect. 8.1.2 and discuss methods dealing with it in Sects. 8.1.3-8.1.9.

8.1.1 Fitting of Linear Regression

Let us first review the least squares estimation and its main properties to facilitate easier understanding of the fitting procedures discussed further. For a detailed overview of linear regression modeling see [75].

The least squares (LS) approach to the estimation of (8.1) searches an estimate $\widehat{\boldsymbol{\beta}}$ of unknown parameters $\boldsymbol{\beta}_{0}$ by minimizing the sum of squared differences between the observed values

and the predicted ones $\widehat{y}_i(\widehat{\boldsymbol{\beta}}) = \boldsymbol{x}_{i}^{\top} \widehat{\boldsymbol{\beta}}$ .

Definition 1 The least squares estimate of linear regression model (8.1) is defined by

$\displaystyle \widehat{\boldsymbol{\beta}}^{\text{LS}} = \mathop{\text{argmin}}... ...um_{i=1}^n \left(y_i - \boldsymbol{x}_{i}^{\top} \boldsymbol{\beta}\right)^2\;.$

(8.2)

Theorem 1 (Gauss-Markov)

Assumethat $E(\boldsymbol{\varepsilon} \vert \boldsymbol{X}) = 0$ , $E(\boldsymbol{\varepsilon}^2\vert\boldsymbol{X}) = \sigma^2 \boldsymbol{I}_{n}$ , and $\boldsymbol{X}^{\top} \boldsymbol{X}$ is non-singular. Let $\widehat{\boldsymbol{\beta}}=\boldsymbol{C}^{\top}\boldsymbol{y}$ , where $\boldsymbol{C}$ is a $t \times p$ matrix orthogonal to $\boldsymbol {X}$ , $\boldsymbol{C}^{\top}\boldsymbol{X}=\boldsymbol{I}$ . Then ${\text{Var}}(\widehat{\boldsymbol{\beta}}) - {\text{Var}}(\widehat{\boldsymbol{\beta}}^{\text{LS}}) > 0$ is a positive definite matrix for any $\widehat{\boldsymbol{\beta}} \not= \widehat{\boldsymbol{\beta}}^{\text{LS}}$ .

Finally, the LS estimate actually coincides with the maximum likelihood estimates provided that random errors $\varepsilon$ are normally distributed (in addition to the assumptions of Theorem 1) and shares then the asymptotic properties of the maximum likelihood estimation (see [4]).

8.1.1.1 Computing LS Estimates

The LS estimate $\widehat{\boldsymbol{\beta}}^{\text{LS}}$ can be and often is found by directly solving the system of linear equations (8.3) or evaluating formula (8.4), which involves a matrix inversion. Both direct and iterative methods for solving systems of linear equations are presented in Chap. II.4. Although this straightforward computation may work well for many regression problems, it often leads to an unnecessary loss of precision, see [68]. Additionally, it is not very suitable if the matrix $\boldsymbol{X}^{\top} \boldsymbol{X}$ is ill-conditioned (a regression problem is called ill-conditioned if a small change in data causes large changes in estimates) or nearly singular (multicollinearity) because it is not numerically stable. Being concerned mainly about statistical consequences of multicollinearity, the numerical issues regarding the identification and treatment of ill-conditioned regression models are beyond the scope of this contribution. Let us refer an interested reader [6,13,68,92] and [99].

Let us now briefly review a class of numerically more stable algorithms for the LS minimization. They are based on orthogonal transformations. Assuming a matrix $\boldsymbol{Q} \in \mathbb{R}^{n\times n}$ is an orthonormal matrix, $\boldsymbol{Q}^{\top} \boldsymbol{Q} = \boldsymbol{Q} \boldsymbol{Q}^{\top} = \boldsymbol{I}_{n}$ ,

8.1.1.2 LS Inference

Linear regression modeling does not naturally consist only of obtaining a point estimate $\widehat{\boldsymbol{\beta}}^{\text{LS}}$ . One needs to measure the variance of the estimates in order to construct confidence intervals or test hypotheses. Additionally, one should assess the quality of the regression fit. Most such measures are based on regression residuals $\boldsymbol{e} = \boldsymbol{y} - \boldsymbol{X} \widehat{\boldsymbol{\beta}}$ . We briefly review the most important regression statistics, and next, indicate how it is possible to compute them if the LS regression is estimated by means of some orthogonalization procedure described in the previous paragraph.

The most important measures used in statistics to assess model fit and inference are the total sum of squares $TSS = (\boldsymbol{y}-\bar{y})^{\top}(\boldsymbol{y}-\bar{y}) = \sum_{i=1}^n (y_i - \bar{y})^2$ , where $\bar{y} = \sum_{i=1}^n y_i/n$ , the residual sum of squares $\mathrm{RSS} = \boldsymbol{e}^{\top}\boldsymbol{e} = \sum_{i=1}^n e_i^2$ , and the complementary regression sum of squares $RegSS = (\boldsymbol{y}-\widehat{\boldsymbol{y}})^{\top}(\boldsymbol{y}-\widehat{\boldsymbol{y}}) = \sum_{i=1}^n (y_i - \widehat{y}_i)^2 = TSS - \mathrm{RSS}$ . Using these quantities, the regression fit can be evaluated; for example, the coefficient of determination $R^2 = 1 - \mathrm{RSS}/TSS$ as well as many information criteria (modified $\bar{R}^2$ , Mallows and Akaike criteria, etc.; see Sect. 8.1.3). Additionally, they can be used to compute the variance of the estimates in simple cases. The variance of the estimates can be estimated by

Let us now describe how one computes these quantities if a numerically stable procedure based on the orthonormalization of normal equations is used. Let us assume we already constructed a QR decomposition of $\boldsymbol{X} = \boldsymbol{Q_x} \boldsymbol{R_x}$ . Thus, $\boldsymbol{Q_x}\boldsymbol{Q}_{\boldsymbol{x}}^{\top} = \boldsymbol{I}$ and $\boldsymbol{Q}_{\boldsymbol{x}}^{\top} \boldsymbol{X} = \boldsymbol{R_x}$ . $\mathrm{RSS}$ can be computed as

$\displaystyle \mathrm{RSS}$	$\displaystyle = \boldsymbol{e}^{\top}\boldsymbol{e} = \left(\boldsymbol{y} - \b... ...\top} \left(\boldsymbol{y} - \boldsymbol{X} \widehat{\boldsymbol{\beta}}\right)$
	$\displaystyle = \left(\boldsymbol{Q}_{\boldsymbol{x}}^{\top} \boldsymbol{y} - \... ...bol{y} - \boldsymbol{R_x} \boldsymbol{X} \widehat{\boldsymbol{\beta}}\right)\;.$

8.1.2 Multicollinearity

Let us assume that the design matrix $\boldsymbol {X}$ fixed. We talk about multicollinearity when there is a linear dependence among the variables in regression, that is, the columns of $\boldsymbol {X}$ .

The exact multicollinearity (also referred to as reduced-rank data) is relatively rare in linear regression models unless the number of explanatory variables is very large or even larger than the number of observations, $p\geq n$ . This happens often in agriculture, chemometrics, sociology, and so on. For example, [68] uses data on the absorbances of infra-red rays at many different wavelength by chopped meat, whereby the aim is to determine the moisture, fat, and protein content of the meat as a function of these absorbances. The study employs measurements at

wavelengths from $850\,$ nm to $1050\,$ nm, which gives rise to many possibly correlated variables.

When the number

of variables is small compared to the sample size

, near multicollinearity is more likely to occur: there are some real constants $a_1,\ldots,a_p$ such that $\sum_{k=1}^p \vert a_k\vert > 0$ and $\sum_{k=1}^p a_k \boldsymbol{x}_{\cdot k} \approx 0$ , where $\approx$ denotes approximate equality. The multicollinearity in data does not have to arise only as a result of highly correlated variables (e.g., more measurements of the same characteristic by different sensors or methods), which by definition occurs in all applications where there are more variables than observations, but it could also result from the lack of information and variability in data.

Whereas the exact multicollinearity implies that $\boldsymbol{X}^{\top} \boldsymbol{X}$ is singular and the LS estimator is not identified, the near multicollinearity permits non-singular matrix $\boldsymbol{X}^{\top} \boldsymbol{X}$ . The eigenvalues $\lambda_1 \leq \ldots \leq \lambda_p$ of matrix $\boldsymbol{X}^{\top} \boldsymbol{X}$ can give some indication concerning multicollinearity: if the smallest eigenvalue $\lambda_1$ equals zero, the matrix is singular and data are exactly multicollinear; if $\lambda_1$ is close to zero, near multicollinearity is present in data. Since measures based on eigenvalues depend on the parametrization of the model, they are not necessarily optimal and it is often easier to detect multicollinearity by looking at LS estimates and their behavior as discussed in next paragraph. See [13] and [59] for more details on detection and treatment of ill-conditioned problems. (Nearly singular matrices are dealt with also in numerical mathematics. To measure near singularity, numerical mathematics uses conditioning numbers $d_k = \sqrt{\lambda_k / \lambda_1}$ , which converge to infinity for singular matrices, that is, as $\lambda_1 \to 0$ . Matrices with very large conditioning numbers are called ill-conditioned.)

The multicollinearity has important implications for LS. In the case of exact multicollinearity, matrix $\boldsymbol{X}^{\top} \boldsymbol{X}$ does not have a full rank, hence the solution of the normal equations is not unique and the LS estimate $\widehat{\boldsymbol{\beta}}^{\text{LS}}$ is not identified. One has to introduce additional restrictions to identify the LS estimate. On the other hand, even though near multicollinearity does not prevent the identification of LS, it negatively influences estimation results. Since both the estimate $\widehat{\boldsymbol{\beta}}^{\text{LS}}$ and its variance are proportional to the inverse of $\boldsymbol{X}^{\top} \boldsymbol{X}$ , which is nearly singular under multicollinearity, near multicollinearity inflates $\widehat{\boldsymbol{\beta}}^{\text{LS}}$ , which may become unrealistically large, and variance ${\text{Var}}(\widehat{\boldsymbol{\beta}}^{\text{LS}})$ . Consequently, the corresponding

-statistics are typically very low. Moreover, due to the large values of $(\boldsymbol{X}^{\top} \boldsymbol{X})^{-1}$ , the least squares estimate $\widehat{\boldsymbol{\beta}}^{\text{LS}} = (\boldsymbol{X}^{\top}\boldsymbol{X})^{-1} \boldsymbol{X}^{\top} \boldsymbol{y}$ reacts very sensitively to small changes in data. See [42] and [69] for a more detailed treatment and real-data examples of the effects of multicollinearity.

There are several strategies to limit adverse consequences of multicollinearity provided that one cannot improve the design of a model or experiment or get better data. First, one can impose an additional structure on the model. This strategy cannot be discussed in details since it is model specific, and in principle, it requires only to test a hypothesis concerning additional restrictions. Second, it is possible to reduce the dimension of the space spanned by $\boldsymbol {X}$ , for example, by excluding some variables from the regression (Sects. 8.1.3 and 8.1.4). Third, one can also leave the class of unbiased estimators and try to find a biased estimator with a smaller variance and mean squared error. Assuming we want to judge the performance of an estimator $\widehat{\boldsymbol{\beta}}$ by its mean squared error ( $\mathrm{MSE}$ ), the motivation follows from

$\displaystyle \mathrm{MSE}\left(\widehat{\boldsymbol{\beta}}\right)$	$\displaystyle = E\left[ \left(\widehat{\boldsymbol{\beta}} - \boldsymbol{\beta}... ...eft(\widehat{\boldsymbol{\beta}} - \boldsymbol{\beta}^{0}\right)^{\top} \right]$
	$\displaystyle = E\left[ \left\{\widehat{\boldsymbol{\beta}} - E\left(\widehat{\... ...hat{\boldsymbol{\beta}}\right) - \boldsymbol{\beta}^{0} \right\} \right]^{\top}$
	$\displaystyle = {\text{Var}}\left(\widehat{\boldsymbol{\beta}}\right) + \text{B... ...l{\beta}}\right) \text{Bias}\left(\widehat{\boldsymbol{\beta}}\right)^{\top}\;.$

8.1.3 Variable Selection

The presence of multicollinearity may indicate that some explanatory variables are linear combinations of the other ones (note that this is more often a ''feature'' of data rather than of the model). Consequently, they do not improve explanatory power of a model and could be dropped from the model provided there is some justification for dropping them also on the model level rather than just dropping them to fix data problems. As a result of removing some variables, the matrix $\boldsymbol{X}^{\top} \boldsymbol{X}$ would not be (nearly) singular anymore.

Eliminating variables from a model is a special case of model selection procedures, which are discussed in details in Chap. III.1. Here we first discuss methods specific for variable selection within a single regression model, mainly variants of stepwise regression. Later, we deal with more general model selection methods, such as cross validation, that are useful both in the context of variable selection and of biased estimation discussed in Sects. 8.1.5-8.1.9. An overview and comparison of many classical variable selection is given, for example, in [67,68] and [69]. For discussion of computational issues related to model selection, see [54] and [68].

8.1.3.1 Backward Elimination

A simple and often used method to eliminate non-significant variables from regression is backward elimination, a special case of stepwise regression. Backward elimination starts from the full model $\boldsymbol{y} = \boldsymbol{X}\boldsymbol{\beta} +\boldsymbol{\varepsilon}$ and identifies a variable $\boldsymbol{x}_{\cdot k}$ such that

Before discussing properties of backward elimination, let us make several notes on information criteria used for the elimination and their optimality. There is a wide range of selection criteria, including classical $\mathrm{AIC}$ , $\mathrm{SIC}$ , $\mathrm{FPE}_{\lambda}$ by [86], cross validation by [89], and so on. Despite one can consider the same measure of the optimality of variable selection, such as the sum of squared prediction errors [85], one can often see contradictory results concerning the selection criteria (see [60] and [81]; or [85] and [76]). This is caused by different underlying assumptions about the true model [82]. Some criteria, such as $\mathrm{AIC}$ and cross validation, are optimal if one assumes that there is no finite-dimensional true model (i.e., the number of variables increases with the sample size); see [85] and [60]. On the other hand, some criteria, such as SIC, are consistent if one assumes that there is a true model with a finite number of variables; see [76] and [82]. Finally, note that even though some criteria, being optimal in the same sense, are asymptotically equivalent, their finite sample properties can differ substantially. See Chap. III.1 for more details.

Let us now return back to backward elimination, which can be also viewed as a pre-test estimator [52]. Although it is often used in practice, it involves largely arbitrary choice of the significance level. In addition, it has rather poor statistical properties caused primarily by discontinuity of the selection decision, see [62]. Moreover, even if a stepwise procedure is employed, one should take care of reporting correct variances and confidence intervals valid for the whole decision sequence. Inference for the finally selected model as if it were the only model considered leads to significant biases, see [22], [102] and [109]. Backward elimination also does not perform well in the presence of multicollinearity and it cannot be used if

. Finally, let us note that a nearly optimal and admissible alternative is proposed in [63].

8.1.3.2 Forward Selection

Backward elimination cannot be applied if there are more variables than observations, and additionally, it may be very computationally expensive if there are many variables. A classical alternative is forward selection, where one starts from an intercept-only model and adds one after another variables that provide the largest decrease of $\mathrm{RSS}$ . Adding stops when the

-statistics

Note that most disadvantages of backward elimination apply to forward selection as well. In particular, correct variances and confidence intervals should be reported, see [68] for their approximations. Moreover, forward selection can be overly aggressive in selection in the respect that if a variable $\boldsymbol {x}$ is already included in a model, forward selection primarily adds variables orthogonal to $\boldsymbol {x}$ , thus ignoring possibly useful variables that are correlated with $\boldsymbol {x}$ . To improve upon this, [27] proposed least angle regression, considering correlations of to-be-added variables jointly with respect to all variables already included in the model.

8.1.3.3 All-Subsets Regression

Neither forward selection, nor backward elimination guarantee the optimality of the selected submodel, even when both methods lead to the same results. This can happen especially when a pair of variables has jointly high predictive power; for example, if the dependent variable $\boldsymbol {y}$ depends on the difference of two variables $\boldsymbol{x}_{1}-\boldsymbol{x}_{2}$ . An alternative approach, which is aiming at optimality of the selected subset of variables - all-subsets regression - is based on forming a model for each subset of explanatory variables. Each model is estimated and a selected prediction or information criterion, which quantifies the unexplained variation of the dependent variable and the parsimony of the model, is evaluated. Finally, the model attaining the best value of a criterion is selected and variables missing in this model are omitted.

This approach deserves several comments. First, one can use many other criteria instead of $\mathrm{AIC}$ or $\mathrm{SIC}$ . These could be based on the test statistics of a joint hypothesis that a group of variables has zero coefficients, extensions or modifications of $\mathrm{AIC}$ or $\mathrm{SIC}$ , general Bayesian predictive criteria, criteria using non-sample information, model selection based on estimated parameter values at each subsample and so on. See the next subsection, [9], [45], [48], [47], [82], [83], [110], for instance, and Chap. III.1 for a more detailed overview.

Second, the evaluation and estimation of all submodels of a given regression model can be very computationally intensive, especially if the number of variables is large. This motivated tree-like algorithms searching through all submodels, but once they reject a submodel, they automatically reject all models containing only a subset of variables of the rejected submodel, see [26]. These so-called branch-and-bound techniques are discussed in [68], for instance.

An alternative computational approach, which is increasingly used in applications where the number of explanatory variables is very large, is based on the genetic programming (genetic algorithm, GA) approach, see [100]. Similarly to branch-and-bound methods, GAs perform an non-exhaustive search through the space of all submodels. The procedure works as follows. First, each submodel which is represented by a ''chromosome'' - a $p \times 1$ vector $\boldsymbol{m}_j = \{I_1^j,\ldots,I_p^j\} \in \{ 0,1 \}^p$ of indicators, where

indicates whether the

th variable is included in the submodel defined by $\boldsymbol{m}_j$ . Next, to find the best submodel, one starts with an (initially randomly selected) population $\mathcal{P} = \{\boldsymbol{m}_j\}_{j=1}^J$ of submodels that are compared with each other in terms of information or prediction criteria. Further, this population $\mathcal{P}$ is iteratively modified: in each step, pairs of submodels $\boldsymbol{m_j}, \boldsymbol{m}_{j'} \in \mathcal{P}$ combine their characteristics (chromosomes) to create their offsprings $\boldsymbol{m}_j^{\ast}$ . This process can have many different forms such as $\boldsymbol{m_j^{\ast}} = (\boldsymbol{m}_j + \boldsymbol{m}_{j'} + \boldsymbol{r}_m) \mathrm{~mod~} 1$ or $\boldsymbol{m}_j^{\ast} = (1,\ldots,1,0,\ldots,0)^{\top} \boldsymbol{m}_j + (0,\ldots,0,1,\ldots,1)^{\top} \boldsymbol{m}_{j'} + \boldsymbol{r}_m$ , where $\boldsymbol{r}_m$ is a possibly non-zero random mutation. Whenever an offspring $\boldsymbol{m}_j^{\ast}$ performs better than its ''parent'' models $\boldsymbol{m}_j$ , $\boldsymbol{m}_j^{\ast}$ replaces $\boldsymbol{m}_j$ in population $\mathcal{P}$ . Performing this action for all $j=1,\ldots,J$ creates a new population. By repeating this population-renewal, GAs search through the space of all available submodels and keep only the best ones in the population $\mathcal{P}$ . Thus, GAs provide a rather effective way of obtaining the best submodel, especially when the number of explanatory variables is very high, since the search is not exhaustive. See Chap. II.6 and [18] for a more detailed introduction to genetic programming.

8.1.3.4 Cross Validation

Cross validation ( $\mathrm{CV}$ ) is a general model-selection principle, proposed already in [89], which chooses a specific model in a similar way as the prediction criteria. $\mathrm{CV}$ compares models, which can include all variables or exclude some, based on their out-of-sample performance, which is measured typically by $\mathrm{MSE}$ . To achieve this, a sample is split to two disjunct parts: one part is used for estimation and the other part serves for checking the fit of the estimated model on ''new'' data (i.e., data which were not used for estimation) by comparing the observed and predicted values.

Probably the most popular variant is the leave-one-out cross-validation (LOU $\mathrm{CV}$ ), which can be used not only for model selection, but also for choosing nuisance parameters (e.g., in nonparametric regression; see [37]). Assume we have a set of models $\boldsymbol{y} = h_k(\boldsymbol{X}, \boldsymbol{\beta}) + \boldsymbol{\varepsilon}$ defined by regression functions $h_k, k = 1,\ldots,M$ , that determine variables included or excluded from regression. For model given by

, LOU $\mathrm{CV}$ evaluates

Unfortunately, LOU $\mathrm{CV}$ is not consistent as far as the linear model selection is concerned. To make $\mathrm{CV}$ a consistent model selection method, it is necessary to omit

observations from the sample used for estimation, where $\lim_{n\to\infty} n_v/n = 1$ . This fundamental result derived in [81] places a heavy computational burden on the $\mathrm{CV}$ model selection. Since our main use of $\mathrm{CV}$ in this chapter concerns nuisance parameter selection, we do not discuss this type of $\mathrm{CV}$ any further. See [68] and Chap. III.1 for further details.

Example 1 We compare several mentioned variable selection methods using a classical data set on air pollution used originally by [66], who modeled mortality depending on explanatory variables ranging from climate and air pollution to socioeconomic characteristics and who additionally demonstrated instabilities of LS estimates using this data set. We refer to the explanatory variables of data Pollution simply by numbers to .

We applied the forward, backward, and all-subset selection procedures to this data set. The results reported in Table 8.1 demonstrate that although all three methods could lead to the same subset of variables (e.g., if we search a model consisting of two or three variables), this is not the case in general. For example, searching for a subset of four variables, the variables selected by backward and forward selection differ, and in both cases, the selected model is suboptimal (compared to all-subsets regression) in the sense of the unexplained variance measured by $\mathrm{RSS}$ .

8.1.4 Principle Components Regression

In some situations, it is not feasible to use variable selection to reduce the number of explanatory variables or it is not desirable to do so. The first case can occur if the number of explanatory variables is large compared to the number of observations. The latter case is typical in situations when we observe many characteristics of the same type, for example, temperature or electro-impulse measurements from different sensors on a human body. They could be possibly correlated with each other and there is no a priori reason why measurements at some points of a skull, for instance, should be significant while other ones would not be important at all. Since such data typically exhibit (exact) multicollinearity and we do not want to exclude some or even majority of variables, we have to reduce the dimension of the data in another way.

A general method that can be used both under near and exact multicollinearity is based on the principle components analysis (PCA), see Chap. III.6. Its aim is to reduce the dimension of explanatory variables by finding a small number of linear combinations of explanatory variables $\boldsymbol {X}$ that capture most of the variation in $\boldsymbol {X}$ and to use these linear combinations as new explanatory variables instead the original one. Suppose that $\boldsymbol{G}$ is an orthonormal matrix that diagonalizes matrix $\boldsymbol{X}^{\top} \boldsymbol{X}$ : $\boldsymbol{G}^{\top}\boldsymbol{G} = \boldsymbol{I}$ , $\boldsymbol{X}^{\top}\boldsymbol{X} = \boldsymbol{G} \boldsymbol{\mathit{\Lambda}} \boldsymbol{G}^{\top}$ , and $\boldsymbol{G}^{\top} \boldsymbol{X}^{\top}\boldsymbol{X}\boldsymbol{G} = \boldsymbol{\mathit{\Lambda}}$ , where $\boldsymbol{\mathit{\Lambda}} =$ diag $(\lambda_1,\ldots,\lambda_p)$ is a diagonal matrix of eigenvalues of $\boldsymbol{X}^{\top} \boldsymbol{X}$ .

Definition 3 Assume without loss of generality that $\lambda_1 \geq \ldots \geq \lambda_p$ and $\boldsymbol{g}_1,\ldots,\boldsymbol{g}_p$ are the corresponding eigenvectors (columns of matrix $\boldsymbol{G}$ ). Vector $\boldsymbol{z_i} = \boldsymbol{X}\boldsymbol{g_i}$ for $i = 1,\ldots,p$ such that $\lambda_i > 0$ is called the th principle component (PC) of $\boldsymbol {X}$ and $\boldsymbol{g}_i$ represents the corresponding loadings.

PCA tries to approximate the original matrix $\boldsymbol {X}$ by projecting it into the lower-dimensional space spanned by the first

eigenvectors $\boldsymbol{g}_1,\ldots, \boldsymbol{g}_k$ . It can be shown that these projections capture most of the variability in $\boldsymbol {X}$ among all linear combinations of columns of $\boldsymbol {X}$ , see [38].

Theorem 2 There is no standardized linear combination $\boldsymbol{X}\boldsymbol{a}$ , where $\Vert a\Vert=1$ , that has strictly larger variance than $\boldsymbol{z}_1 = \boldsymbol{X}\boldsymbol{g}_1$ : ${\text{Var}}(\boldsymbol{X}\boldsymbol{a}) \leq {\text{Var}}(\boldsymbol{z}_1) = \lambda_1$ . Additionally, the variance of the linear combination $\boldsymbol{z} = \boldsymbol{X}\boldsymbol{a}, \Vert a\Vert=1$ , that is uncorrelated with the first principle components $\boldsymbol{z}_1,\ldots,\boldsymbol{z}_k$ is maximized by the -st principle component $\boldsymbol{z} = \boldsymbol{z}_{k+1}$ and $\boldsymbol{a} = \boldsymbol{g}_{k+1}$ , $k = 1,\ldots,p-1$ .

Consequently, one chooses a number

of PCs that capture a sufficient amount of data variability. This can be done by looking at the ratio $L_k = \sum_{i=1}^k \lambda_i / \sum_{i=1}^p \lambda_i$ , which quantifies the fraction of the variance captured by the first

PCs compared to the total variance of $\boldsymbol {X}$ .

In the regression context, the chosen PCs are used as new explanatory variables, and consequently, PCs with small eigenvalues can be important too. Therefore, one can alternatively choose the PCs that exhibit highest correlation with the dependent variable $\boldsymbol {y}$ because the aim is to use the selected PCs for regressing the dependent variable $\boldsymbol {y}$ on them, see [49]. Moreover, for selecting ''explanatory'' PCs, it is also possible to use any variable selection method discussed in Sect. 8.1.3. Recently, [46] proposed a new data-driven PC selection for $\mathrm{PCR}$ obtained by minimizing $\mathrm{MSE}$ .

Next, let us assume we selected a small number

of PCs $\boldsymbol{Z}_k = (\boldsymbol{z}_1,\ldots,\boldsymbol{z}_k)^{\top}$ by some rule such that matrix $\boldsymbol{Z}_k^{\top}\boldsymbol{Z}_k$ has a full rank, $k \leq p$ . Then the principle components regression ( $\mathrm{PCR}$ ) is performed by regressing the dependent variable $\boldsymbol {y}$ on the selected PCs $\boldsymbol{Z}_k$ , which have a (much) smaller dimension than original data $\boldsymbol {X}$ , and consequently, multicollinearity is diminished or eliminated, see [36]. We estimate this new model by LS,

Finally, concerning different PC selection criteria, [7] demonstrates the superiority of the correlation-based $\mathrm{PCR}$ ( $\mathrm{CPCR}$ ) and convergence of many model-selection procedures toward the $\mathrm{CPCR}$ results. See also [24] for a similar comparison of $\mathrm{CPRC}$ and $\mathrm{PCR}$ based on GA variable selection.

**Figure 8.1:** Fraction of the explained variance of $\boldsymbol {X}$ (*dashed line*) and $\boldsymbol {y}$ (*solid line*) by the first PCs
$\includegraphics[width=7.3cm]{text/3-8/pcrplot.eps}$

Example 2 Let us use data Pollution to demonstrate several important issues concerning $\mathrm{PCR}$ . First, we identify PCs of the data. The fraction of variance explained by the first PCs as a function of is depicted on Fig. 8.1 (dashed line). On the one side, almost all of the $\boldsymbol {X}$ variance is captured by the first PC. On the other side, the percentage of the $\boldsymbol {y}$ variance explained by the first PCs (solid line) grows and reaches its maximum relatively slowly. Thus, the inclusion of about PCs seems to be necessary when using this strategy.

On the other hand, using some variable selection method or checking the correlation of PCs with the dependent variable $\boldsymbol {y}$ reveals that PCs , , , , exhibit highest correlations with $\boldsymbol {y}$ (higher than ), and naturally, a model using these PCs has more explanatory power ( $\bar{R}^2 = {0.70}$ ) than for example the first PCs together ( $\bar{R}^2={0.65}$ ). Thus, considering not only PCs that capture most of the $\boldsymbol {X}$ variability, but also those having large correlations with the dependent variable enables building more parsimonious models.

8.1.5 Shrinkage Estimators

We argued in Sect. 8.1.2 that an alternative way of dealing with unpleasant consequences of multicollinearity lies in biased estimation: we can sacrifice a small bias for a significant reduction in variance of an estimator so that its $\mathrm{MSE}$ decreases. Since it holds for an estimator

and a real constant $c \in \mathbb{R}$ that ${\mathrm{Var}}(c\widehat{\boldsymbol{\beta}}) = c^2 {\mathrm{Var}}(\widehat{\boldsymbol{\beta}})$ , a bias of the estimator $\widehat{\boldsymbol{\beta}}$ towards zero, $\vert c\vert<1$ , naturally leads to a reduction in variance. This observation motivates a whole class of biased estimators - shrinkage estimators - that are biased towards zero in all or just some of their components. In other words, they ''shrink'' the Euclidean norm of estimates compared to that of the corresponding unbiased estimate. This is perhaps easiest to observe on the example of the Stein-rule estimator, which can be expressed in linear regression model (8.1) as

In the following subsections, we discuss various shrinkage estimators that perform well under multicollinearity and that can possibly act as variable selection tools as well: the ridge regression estimator and its modifications (Sect. 8.1.6), continuum regression (Sect. 8.1.7), the Lasso estimator and its variants (Sect. 8.1.8), and partial least squares (Sect. 8.1.9). Let us note that there are also other shrinkage estimators, which either do not perform well under various forms of multicollinearity (e.g., Stein-rule estimator) or are discussed in other parts of this chapter (e.g., pre-test and $\mathrm{PCR}$ estimators in Sects. 8.1.3 and 8.1.4, respectively).

8.1.6 Ridge Regression

Probably the best known shrinkage estimator is the ridge estimator proposed and studied by [43]. Having a non-orthogonal or even nearly singular matrix $\boldsymbol{X}^{\top} \boldsymbol{X}$ , one can add a positive constant

to its diagonal to improve conditioning.

Definition 4 Ridge regression ( $\mathrm{RR}$ ) estimator is defined for model (8.1) by

$\displaystyle \widehat{\boldsymbol{\beta}}^{\mathrm{RR}} = \left(\boldsymbol{X}... ...oldsymbol{X} + k \boldsymbol{I}\right)^{-1} \boldsymbol{X}^{\top}\boldsymbol{y}$

(8.9)

for some ridge parameter .

''Increasing'' the diagonal of $\boldsymbol{X}^{\top} \boldsymbol{X}$ before inversion shrinks $\widehat{\boldsymbol{\beta}}^{\mathrm{RR}}$ compared to $\widehat{\boldsymbol{\beta}}^{\mathrm{LS}}$ and introduces a bias. Additionally, [43] also showed that the derivative of $\mathrm{MSE}(\widehat{\boldsymbol{\beta}}^{\mathrm{RR}})$ with respect to

is negative at

. This indicates that the bias

In applications, an important question remains: how to choose the ridge parameter

? In the original paper by [43], the use of the ridge trace, a plot of the components of the estimated $\widehat{\boldsymbol{\beta}}^{\mathrm{RR}}$ against

, was advocated. If data exhibit multicollinearity, one usually observes a region of instability for

close to zero and then stable estimates for large values of ridge parameter

. One should choose the smallest

lying in the region of stable estimates. Alternatively, one could search for

minimizing $\mathrm{MSE}(\widehat{\boldsymbol{\beta}}^{\mathrm{RR}})$ ; see the subsection on generalized $\mathrm{RR}$ for more details. Furthermore, many other methods for model selection could be employed too; for example, LOU $\mathrm{CV}$ (Sect. 8.1.3) performed on a grid of

values is often used in this context.

Statistics important for inference based on $\mathrm{RR}$ estimates are discussed in [43] and [96] both for the case of a fixed

as well as in the case of some data-driven choices. Moreover, the latter work also describes algorithms for a fast and efficient $\mathrm{RR}$ computation.

To conclude, let us note that the $\mathrm{RR}$ estimator $\widehat{\boldsymbol{\beta}}^{\mathrm{RR}}$ in model (8.1) can be also defined as a solution of a restricted minimization problem

8.1.6.1 Generalized Ridge Regression

The $\mathrm{RR}$ estimator can be generalized in the sense that each diagonal element of $\boldsymbol{X}^{\top} \boldsymbol{X}$ is modified separately. To achieve that let us recall that this matrix can be diagonalized: $\boldsymbol{X}^{\top}\boldsymbol{X} = \boldsymbol{G}^{\top} \boldsymbol{\Lambda} \boldsymbol{G}$ , where $\boldsymbol{G}$ is an orthonormal matrix and $\boldsymbol{\Lambda}$ is a diagonal matrix containing eigenvalues $\lambda_1,\ldots,\lambda_p$ .

Definition 5 Generalized ridge regression ( $\mathrm{GRR}$ ) estimator is defined for model (8.1) by

$\displaystyle \widehat{\boldsymbol{\beta}}^{\mathrm{GRR}} = \left(\boldsymbol{X... ...ol{X} + \boldsymbol{GKG}^{\top}\right)^{-1} \boldsymbol{X}^{\top}\boldsymbol{y}$

(8.12)

for a diagonal matrix $\boldsymbol{K}=\mathrm{diag}(k_1,\ldots,k_p)$ of ridge parameters.

The main advantage of this generalization being ridge coefficients specific to each variable, it is important to know how to choose the matrix $\boldsymbol{K}$ . In [43] the following result is derived.

Theorem 3 Assume that $\boldsymbol {X}$ in model (8.1) has a full rank, $\boldsymbol{\varepsilon} \sim N(0,\sigma^2 \boldsymbol{I}_{n})$ , and . Further, let $\boldsymbol{X} = \boldsymbol{H\Lambda}^{1/2}\boldsymbol{G}^{\top}$ be the singular value decomposition of $\boldsymbol {X}$ and $\boldsymbol{\gamma} = \boldsymbol{G}^{\top} \boldsymbol{\beta}_{0}$ . The MSE-minimizing choice of $\boldsymbol{K}$ in (8.12) is $\boldsymbol{K} = \sigma^2 \mathop{\text{diag}}(\gamma_1^{-2},\ldots,\gamma_p^{-2})$ .

An operational version (feasible $\mathrm{GRR}$ ) is based on an unbiased estimate $\widehat{\gamma_i} = \boldsymbol{G}^{\top} \widehat{\boldsymbol{\beta}}^{\text{LS}}$ and $s^2 = (\boldsymbol{y} - \boldsymbol{H}\widehat{\gamma})^{\top} (\boldsymbol{y} - \boldsymbol{H}\widehat{\gamma})$ . See [43] and [96], where you also find the bias and $\mathrm{MSE}$ of this operational $\mathrm{GRR}$ estimator, and [98] for further extensions of this approach. Let us note that the feasible $\mathrm{GRR}$ ( $\mathrm{FGRR}$ ) estimator does not have to possess the $\mathrm{MSE}$ -optimality property of $\mathrm{GRR}$ because the optimal choice of $\boldsymbol{K}$ is replaced by an estimate. Nevertheless, the optimality property of $\mathrm{FGRR}$ is preserved if $\lambda_i \gamma_i^2 \leq 2\sigma^2$ , where $\lambda_i$ is the

th-element of $\boldsymbol{\Lambda}$ ([30]).

Additionally, given an estimate of $\mathrm{MSE}$ -minimizing $\boldsymbol{\widehat{K}} = \mathop{\text{diag}}(\widehat{k}_1,\ldots,\widehat{k}_p)$ , many authors proposed to choose the ridge parameter

in ordinary $\mathrm{RR}$ as a harmonic mean of $\widehat{k}_i, i =1,\ldots,p$ ; see [44], for instance.

8.1.6.2 Almost Unbiased Ridge Regression

Motivated by results on $\mathrm{GRR}$ , [53] proposed to correct $\mathrm{GRR}$ for its bias using the first-order bias approximation. This yields almost unbiased $\mathrm{GRR}$ ( $\mathrm{AUGRR}$ ) estimator

8.1.6.3 Further Extensions

$\mathrm{RR}$ can be applied also under exact multicollinearity, which arises for example in data with more variables than observations. Although the theory and application of $\mathrm{RR}$ is the same as in the case of full-rank data, the computational burden of

operations becomes too high for

. A faster algorithm with computational complexity only

was found by [39].

Moreover, there are many further extensions of the $\mathrm{RR}$ principle that go beyond the extent of this chapter. To mention at least some of them, let us refer a reader to works comparing or combining various ridge and shrinkage approaches [55]; [84]; [87] and to monograph by [35].

Example 3 Using data Pollution once again, we estimated $\mathrm{RR}$ for ridge parameter $k \in (0,10)$ and plotted the estimated coefficients $\widehat{\boldsymbol{\beta}}^{\mathrm{RR}}$ as functions of (ridge trace plot), see Fig. 8.2. For the sake of simplicity, we restricted ourselves only to variables that were selected by some variable selection procedure in Table 8.1 ( ). The plot shows the effect of ridge parameter on slope estimates ( corresponds to LS). Apparently, slopes of some variables are affected very little (e.g., variable ), some significantly (e.g., the magnitude of variable increases more than twice), and some variables shrink extremely (e.g., variables and ). In all cases, the biggest change occurs between and and estimates gradually stabilize for . The vertical dashed line in Fig. 8.2 represents the $\mathrm{CV}$ estimate of ( $k_{\mathrm{CV}} = {6.87}$ ).

**Figure 8.2:** Ridge trace plot for variables , , , , , , of data Pollution. The *vertical line* represents the $\textrm {CV}$ -choice of
$\includegraphics[width=7.5cm]{text/3-8/ridge.eps}$

8.1.7 Continuum Regression

$\mathrm{RR}$ discussed in Sect. 8.1.6 is very closely connected with the continuum regression proposed by [16] as a unifying approach to the $\mathrm{LS}$ , $\mathrm{PCR}$ , and partial least squares (see Sect. 8.1.9) estimation.

Definition 6 A continuum regression ( $\mathrm{CR}$ ) estimator $\widehat{\boldsymbol{\beta}}^{\mathrm{CR}}(\alpha)$ of model (8.1) is a coefficient vector maximizing function

$\displaystyle T_{\alpha}(\boldsymbol{c}) = \left(\boldsymbol{c}^{\top}\boldsymb... ...{\top}\boldsymbol{X}^{\top}\boldsymbol{X} \boldsymbol{c}\right)^{\alpha - 1}\;,$

(8.13)

for a given value of parameter $\alpha \geq 0$ and a given length $\Vert c\Vert$ , where $\boldsymbol{S} = \boldsymbol{X}^{\top}\boldsymbol{X}$ and $\boldsymbol{s} = \boldsymbol{X}^{\top} \boldsymbol{y}$ .

This definition yields estimates proportional to LS for $\alpha = 0$ , to $\mathrm{PCR}$ for $\alpha \to \infty$ , and to yet-to-be-discussed partial least squares for $\alpha=1$ . Apart from this, the advantage of $\mathrm{CR}$ is that one can adaptively select among the methods by searching an optimal $\alpha$ . To determine $\alpha$ , [16] used $\mathrm{CV}$ .

The relationship between $\mathrm{RR}$ and $\mathrm{CR}$ was indicated already in [90], but the most important result came after uncovering possible discontinuities of $\mathrm{CR}$ estimates as a function of data and $\alpha$ by [14]. In an attempt to remedy the discontinuity of the original $\mathrm{CR}$ , [15] not only proposed to maximize

Theorem 4 If a regressor $\boldsymbol{b_f}$ is defined according to

$\displaystyle \boldsymbol{b}_f = \mathop{\text{argmax}}\limits_{\Vert\boldsymbol{c}\Vert=1} f\left\{K^2(\boldsymbol{c}), V(\boldsymbol{c})\right\}\;,$

where $K(\boldsymbol{c}) = \boldsymbol{y}^{\top}\boldsymbol{X}\boldsymbol{c}$ , $V(\boldsymbol{c}) = \Vert\boldsymbol{X}\boldsymbol{c}\Vert^2$ , is increasing in for constant , and increasing in for constant , and finally, if $\boldsymbol{X}^{\top}\boldsymbol{y}$ is not orthogonal to all eigenvectors corresponding to the largest eigenvalue $\lambda_{\max}$ of $\boldsymbol{X}^{\top} \boldsymbol{X}$ , then there exists a number $k \in (-\infty,\lambda_{\max}) \cup [0, +\infty]$ such that $\boldsymbol{b}_f$ is proportional to $(\boldsymbol{X}^{\top}\boldsymbol{X} + k \boldsymbol{I})^{-1} \boldsymbol{X}^{\top} \boldsymbol{y}$ , including the limiting cases $k \to 0, k \to \pm\infty$ , and $k \to -\lambda_{\max}$ .

Thus, the $\mathrm{RR}$ estimator fundamentally underlies many methods dealing with multicollinear and reduced rank data such as mentioned $\mathrm{PCR}$ and partial least squares. Notice however that negative values of the ridge coefficient

have to be admitted here.

Finally, let us note that $\mathrm{CR}$ can be extended to multiple-response-variables models ([17]).

8.1.8 Lasso

The ridge regression discussed in Sect. 8.1.6 motivates another shrinkage method: Lasso (least absolute shrinkage and selection operator) by [93]. Formulation (8.10) states that $\mathrm{RR}$ can be viewed as a minimization with respect to an upper bound on the

norm of estimate $\Vert\widehat{\boldsymbol{\beta}}\Vert _2$ . A natural extension is to consider constraints on the

norm $\Vert\widehat{\boldsymbol{\beta}}\Vert _q$ ,

. Specifically, [93] studied case of

, that is

norm.

Definition 7 The Lasso estimator for the regression model (8.1) is defined by

$\displaystyle \widehat{\boldsymbol{\beta}}^{L} = \mathop{\text{argmin}}\limits_... ...ight)^{\top} \left(\boldsymbol{y} - \boldsymbol{X} \boldsymbol{\beta}\right)\;,$

(8.14)

where $r \geq 0$ is a tuning parameter.

Lasso is a shrinkage estimator that has one specific feature compared to ordinary $\mathrm{RR}$ . Because of the geometry of

-norm restriction, Lasso shrinks the effect of some variables and eliminates influence of the others, that is, sets their coefficients to zero. Thus, it combines regression shrinkage with variable selection, and as [93] demonstrated also by means of simulation, it compares favorably to all-subsets regression. In this context, it is interesting that Lasso could be formulated as a special case of least angle regression by [27]. Finally, let us note that to achieve the same kind of shrinking and variable-selection effects for all variables, they should be standardized before used in Lasso; see [68] for details.

As far as the inference for the Lasso estimator is concerned, [57] recently studied the asymptotic distribution of Lasso-type estimators using

-norm condition $\Vert\boldsymbol{\beta}\Vert _q \leq r$ with $q \leq 1$ , including behavior under nearly-singular designs.

Now, it remains to find out how Lasso estimates can be computed. Equation (8.14) indicates that one has to solve a restricted quadratic optimization problem. Setting $\beta_j^+ = \max\{\beta_j,0\}$ and $\beta_j^- = -\min\{\beta_j,0\}$ , the restriction $\Vert\boldsymbol{\beta}\Vert \leq r$ can be written as

constraints: $\beta_j^+ \geq 0, \beta_j^- \geq 0$ , and $\sum_{j=1}^p (\beta_j^+ - \beta_j^-) \leq r$ . Thus, convergence is assured in

steps. Additionally, the unknown tuning parameter

is to be selected by means of $\mathrm{CV}$ . Further, although solving (8.14) is straightforward in usual regression problems, it can become very demanding for reduced-rank data,

. [72] treated lasso as a convex programming problem, and by formulating its dual problem, developed an efficient algorithm usable even for

Example 4 Let us use data Pollution once more to exemplify the use of Lasso. To summarize the Lasso results, we use the same plot as [93] and [27] used, see Fig. 8.3. It contains standardized slope estimates as a function of the constraint $\Vert b\Vert\leq r$ , which is represented by an index $r / \max \Vert\widehat{\boldsymbol{\beta}}\Vert = \Vert\widehat{\boldsymbol{\beta}}^{L}\Vert / \Vert\widehat{\boldsymbol{\beta}}^{\text{LS}}\Vert$ (the $\mathrm{LS}$ estimate $\widehat{\boldsymbol{\beta}}^{\text{LS}}$ corresponds to $\widehat{\boldsymbol{\beta}}^{L}$ under $r=\infty$ , and thus, renders the maximum of $\Vert\widehat{\boldsymbol{\beta}}^{L}\Vert$ ). Moreover, to keep the graph simple, we plotted again only variables that were selected by variable selection procedures in Table 8.1 ( ).

In Fig. 8.3, we can observe which variables are included in the regression (have a nonzero coefficient) as tuning parameter increases. Clearly, the order in which the first of these variables become significant - - closely resembles the results of variable selection procedures in Table 8.1. Thus, Lasso combines shrinkage estimation and variable selection: at a given constraint level , it shrinks coefficients of some variables and removes the others by setting their coefficients equal to zero.

**Figure 8.3:** Slope coefficients for variables of data Pollution estimated by Lasso at different constraint levels, $r / \max \Vert\widehat{\boldsymbol{\beta}}\Vert$ . The *right axis* assigns to each line the number of variable it represents and the *top axis* indicates the number of variables included in the regression
$\includegraphics[width=8.6cm]{text/3-8/lasso.eps}$

8.1.9 Partial Least Squares

A general modeling approach to most of the methods covered so far was $\mathrm{CR}$ in Sect. 8.1.7, whereby it has two ''extremes'': LS for $\alpha = 0$ and $\mathrm{PCR}$ for $\alpha \to \infty$ . The partial least squares (PLS) regression lies in between - it is a special case of (8.13) for $\alpha=1$ , see [16]. Originally proposed by [104], it was presented as an algorithm that searches for linear combinations of explanatory variables best explaining the dependent variable. Similarly to $\mathrm{PCR}$ , PLS also aims especially at situations when the number of explanatory variables is large compared to the number of observations. Here we present the PLS idea and algorithm themselves as well as the latest results on variable selection and inference in PLS.

Having many explanatory variables $\boldsymbol {X}$ , the aim of the PLS method is to find a small number of linear combinations $\boldsymbol{T}_1 = \boldsymbol{X} \boldsymbol{c}_{1},\ldots,\boldsymbol{T}_q=\boldsymbol{X} \boldsymbol{c}_{q}$ of these variables, thought about as latent variables, explaining observed responses

Let us now present the PLS algorithm itself, which defines yet another shrinkage estimator as shown by [20] and [51]. (See [75] for more details and [33] for an alternative formulation.) The indices $\boldsymbol{T}_1,\ldots,\boldsymbol{T}_q$ are constructed one after another. Estimating the intercept by $b_0 = \bar{\boldsymbol{y}}$ , let us start with centered variables $\boldsymbol{z}_{0} = \boldsymbol{y} - \bar{y}$ and $\boldsymbol{U}_{0} = \boldsymbol{X} - \bar{\boldsymbol{X}}$ and set

This algorithm provides us with indices $\boldsymbol{T}_{k}$ , which define the analogs of principle components in $\mathrm{PCR}$ , and the corresponding regression coefficients

in (8.15). The main open question is how to choose the number of components

. The original method proposed by [105] is based on cross validation. Provided that $\mathrm{CV}_k$ from (8.7) represents the $\mathrm{CV}$ index of $\mathrm{PLS}$ estimate with

factors, an additional index $\boldsymbol{T}_{k+1}$ is added if Wold's

criterion $R = \mathrm{CV}_{k+1}/\mathrm{CV}_k$ is smaller than

. This selects the first local minimum of the $\mathrm{CV}$ index, which is superior to finding the global minimum of $\mathrm{CV}_k$ as shown in [73]. Alternatively, one can stop already when Wold's

exceeds

bound (modified Wold's

criteria) or to use other variable selection criteria such as $\mathrm{AIC}$ . In a recent simulation study, [61] showed that modified Wold's

is preferable to Wold's

and $\mathrm{AIC}$ . Furthermore, similarly to $\mathrm{PCR}$ , there are attempts to use $\mathrm{GA}$ for the component selection, see [58] for instance.

Next, the first results on the asymptotic behavior of PLS appeared only during last decade. The asymptotic behavior of prediction errors was examined by [41]. The covariance matrix, confidence and prediction intervals based on PLS estimates were first studied by [23], but a more compact expression was presented in [74]. It is omitted here due to many technicalities required for its presentation. There are also attempts to find a sample-specific prediction error of $\mathrm{PLS}$ , which were compared by [29].

Finally, note that there are many extensions of the presented algorithm, which is usually denoted $\mathrm{PLS1}$ . First of all, there are extensions ( $\mathrm{PLS2}$ , $\mathrm{SIMPLS}$ , etc.) of $\mathrm{PLS1}$ to models with multiple dependent variables, see [50] and [32] for instance, which choose linear combinations (latent variables) not only within explanatory variables, but does the same also in the space spanned by dependent variables. A recent survey of these and other so-called two-block methods is given in [101]. $\mathrm{PLS}$ was also adapted for on-line process modeling, see [78] for a recursive $\mathrm{PLS}$ algorithm. Additionally, in an attempt to simplify the interpretation of $\mathrm{PLS}$ results, [94] proposed orthogonalized $\mathrm{PLS}$ . See [108] for further details on recent developments.

Example 5 Let us use again data Pollution, although it is not a typical application of $\mathrm{PLS}$ . As explained in Sects. 8.1.7 and 8.1.9, $\mathrm{PLS}$ and $\mathrm{PCR}$ are both based on the same principle (searching for linear combinations of original variables), but use different objective functions. To demonstrate, we estimated $\mathrm{PLS}$ for to latent variables and plotted the fraction of the $\boldsymbol {X}$ and $\boldsymbol {y}$ variance explained by the $\mathrm{PLS}$ latent variables in the same way as in Fig. 8.1. Both curves are in Fig. 8.4. Almost all of the variability in $\boldsymbol {X}$ is captured by the first latent variable, although this percentage is smaller than in the case of $\mathrm{PCR}$ . On the other hand, the percentage of the variance of $\boldsymbol {y}$ explained by the first latent variables increases faster than in the case of $\mathrm{PCR}$ , see Fig. 8.4 (solid versus dotted line).

**Figure 8.4:** Fraction of the explained variance of $\boldsymbol {X}$ (*dashed line*) and $\boldsymbol {y}$ (*solid line*) by the first latent variables in $\textrm {PLS}$ regression and by first PCs (*dotted lines*)
$\includegraphics[width=8.6cm]{text/3-8/pls2.eps}$

8.1.10 Comparison of the Methods

Methods discussed in Sects. 8.1.3-8.1.9 are aiming at the estimation of (nearly) singular problems and they are often very closely related, see Sect. 8.1.7. Here we provide several references to studies comparing the discussed methods.

First, an extensive simulation study comparing variable selection, $\mathrm{PCR}$ , $\mathrm{RR}$ , and $\mathrm{PLS}$ regression methods is presented in [32]. Although the results are conditional on the simulation design used in the study, they indicate that $\mathrm{PCR}$ , $\mathrm{RR}$ , and $\mathrm{PLS}$ are, in the case of ill-conditioned problems, highly preferable to variable selection. The differences between the best methods, $\mathrm{RR}$ and $\mathrm{PLS}$ , are rather small and the same holds for comparison of $\mathrm{PLS}$ and $\mathrm{PCR}$ , which seems to be slightly worse than $\mathrm{RR}$ . An empirical comparison of $\mathrm{PCR}$ and $\mathrm{PLS}$ was also done by [103] with the same result. Next, the fact that neither $\mathrm{PCR}$ , nor $\mathrm{PLS}$ asymptotically dominates the other method was proved in [41] and further discussed in [40]. A similar asymptotic result was also given by [88]. Finally, the fact that $\mathrm{RR}$ should not perform worse than $\mathrm{PCR}$ and $\mathrm{PLS}$ is supported by Theorem 4 in Sect. 8.1.7.

Number of	Forward	Backward	All-subset
variables	selection	elimination	selection


	,	,	,

	, ,	, ,	, ,

	, , ,	, , ,	, , ,

	, , , ,	, , , ,	, , , ,