Next: 8.2 Nonlinear Regression Modeling Up: 8. (Non) Linear Regression Previous: 8. (Non) Linear Regression

Subsections

# 8.1 Linear Regression Modeling

Let us first study the linear regression model assuming . Unless said otherwise, we consider here only one dependent variable . The unknown vector is to be estimated given observations and of random variables and ; let us denote and let be the th column of  . Thus, the linear regression model can be written in terms of observations as

 (8.1)

where .

Section 8.1.1 summarizes how to estimate the model (8.1) by the method of least squares. Later, we specify what ill-conditioning and multicollinearity are in Sect. 8.1.2 and discuss methods dealing with it in Sects. 8.1.3-8.1.9.

## 8.1.1 Fitting of Linear Regression

Let us first review the least squares estimation and its main properties to facilitate easier understanding of the fitting procedures discussed further. For a detailed overview of linear regression modeling see [75].

The least squares (LS) approach to the estimation of (8.1) searches an estimate of unknown parameters by minimizing the sum of squared differences between the observed values and the predicted ones .

Definition 1   The least squares estimate of linear regression model (8.1) is defined by

 (8.2)

This differentiable problem can be expressed as minimization of

with respect to and the corresponding first-order conditions are

 (8.3)

They are commonly referred to as normal equations and identify the global minimum of (8.2) as long as the second order conditions hold; that is, the matrix is supposed to be positive definite, or equivalently, non-singular. (This mirrors an often used assumption specified in terms of the underlying random variable : .) Provided that and , the LS estimator is unbiased and can be found as a solution of (8.3)

 (8.4)

Additionally, it is the best linear unbiased estimator of (8.1), see [4].

Theorem 1 (Gauss-Markov)

Assumethat , , and is non-singular. Let , where is a matrix orthogonal to  , . Then is a positive definite matrix for any .

Finally, the LS estimate actually coincides with the maximum likelihood estimates provided that random errors are normally distributed (in addition to the assumptions of Theorem 1) and shares then the asymptotic properties of the maximum likelihood estimation (see [4]).

### 8.1.1.1 Computing LS Estimates

The LS estimate can be and often is found by directly solving the system of linear equations (8.3) or evaluating formula (8.4), which involves a matrix inversion. Both direct and iterative methods for solving systems of linear equations are presented in Chap. II.4. Although this straightforward computation may work well for many regression problems, it often leads to an unnecessary loss of precision, see [68]. Additionally, it is not very suitable if the matrix is ill-conditioned (a regression problem is called ill-conditioned if a small change in data causes large changes in estimates) or nearly singular (multicollinearity) because it is not numerically stable. Being concerned mainly about statistical consequences of multicollinearity, the numerical issues regarding the identification and treatment of ill-conditioned regression models are beyond the scope of this contribution. Let us refer an interested reader [6,13,68,92] and [99].

Let us now briefly review a class of numerically more stable algorithms for the LS minimization. They are based on orthogonal transformations. Assuming a matrix is an orthonormal matrix, ,

Thus, multiplying a regression model by an orthonormal matrix does not change it from the LS point of view. Since every matrix can be decomposed into the product (the QR decomposition), where is an orthonormal matrix and is an upper triangular matrix, pre-multiplying (8.1) by produces

 (8.5)

where , is an upper triangular matrix, and is a zero matrix. Hence, the sum of squares to minimize can be written as

where and form . The LS estimate is then obtained from the upper triangular system , which is trivial to solve by backward substitution. There are many algorithms for constructing a suitable QR decomposition for finding LS estimates, such as the Householder or Givens transformations; see Chap. II.4, more details.

### 8.1.1.2 LS Inference

Linear regression modeling does not naturally consist only of obtaining a point estimate  . One needs to measure the variance of the estimates in order to construct confidence intervals or test hypotheses. Additionally, one should assess the quality of the regression fit. Most such measures are based on regression residuals . We briefly review the most important regression statistics, and next, indicate how it is possible to compute them if the LS regression is estimated by means of some orthogonalization procedure described in the previous paragraph.

The most important measures used in statistics to assess model fit and inference are the total sum of squares , where , the residual sum of squares , and the complementary regression sum of squares . Using these quantities, the regression fit can be evaluated; for example, the coefficient of determination as well as many information criteria (modified , Mallows and Akaike criteria, etc.; see Sect. 8.1.3). Additionally, they can be used to compute the variance of the estimates in simple cases. The variance of the estimates can be estimated by

 (8.6)

where represents an estimate of the covariance matrix . Provided that the model is homoscedastic, , the residual variance can be estimated as an average of squared residuals . Apart from the residual variance, one needs also an inverse of , which will often be a by-product of solving normal equations.

Let us now describe how one computes these quantities if a numerically stable procedure based on the orthonormalization of normal equations is used. Let us assume we already constructed a QR decomposition of . Thus, and . can be computed as

Consequently, is invariant with respect to orthonormal transformations (8.5) of the regression model (8.1). The same conclusion applies also to and , and consequently, to the variance estimation. Thus, it is possible to use the data in (8.5), transformed to achieve better numerical stability, for computing regression statistics of the original model (8.1).

## 8.1.2 Multicollinearity

Let us assume that the design matrix fixed. We talk about multicollinearity when there is a linear dependence among the variables in regression, that is, the columns of  .

Definition 2   In model (8.1), the exact multicollinearity exists if there are real constants such that and .

The exact multicollinearity (also referred to as reduced-rank data) is relatively rare in linear regression models unless the number of explanatory variables is very large or even larger than the number of observations, . This happens often in agriculture, chemometrics, sociology, and so on. For example, [68] uses data on the absorbances of infra-red rays at many different wavelength by chopped meat, whereby the aim is to determine the moisture, fat, and protein content of the meat as a function of these absorbances. The study employs measurements at  wavelengths from nm to nm, which gives rise to many possibly correlated variables.

When the number of variables is small compared to the sample size , near multicollinearity is more likely to occur: there are some real constants such that and , where denotes approximate equality. The multicollinearity in data does not have to arise only as a result of highly correlated variables (e.g., more measurements of the same characteristic by different sensors or methods), which by definition occurs in all applications where there are more variables than observations, but it could also result from the lack of information and variability in data.

Whereas the exact multicollinearity implies that is singular and the LS estimator is not identified, the near multicollinearity permits non-singular matrix . The eigenvalues of matrix can give some indication concerning multicollinearity: if the smallest eigenvalue equals zero, the matrix is singular and data are exactly multicollinear; if is close to zero, near multicollinearity is present in data. Since measures based on eigenvalues depend on the parametrization of the model, they are not necessarily optimal and it is often easier to detect multicollinearity by looking at LS estimates and their behavior as discussed in next paragraph. See [13] and [59] for more details on detection and treatment of ill-conditioned problems. (Nearly singular matrices are dealt with also in numerical mathematics. To measure near singularity, numerical mathematics uses conditioning numbers , which converge to infinity for singular matrices, that is, as . Matrices with very large conditioning numbers are called ill-conditioned.)

The multicollinearity has important implications for LS. In the case of exact multicollinearity, matrix does not have a full rank, hence the solution of the normal equations is not unique and the LS estimate is not identified. One has to introduce additional restrictions to identify the LS estimate. On the other hand, even though near multicollinearity does not prevent the identification of LS, it negatively influences estimation results. Since both the estimate and its variance are proportional to the inverse of , which is nearly singular under multicollinearity, near multicollinearity inflates , which may become unrealistically large, and variance . Consequently, the corresponding -statistics are typically very low. Moreover, due to the large values of , the least squares estimate reacts very sensitively to small changes in data. See [42] and [69] for a more detailed treatment and real-data examples of the effects of multicollinearity.

There are several strategies to limit adverse consequences of multicollinearity provided that one cannot improve the design of a model or experiment or get better data. First, one can impose an additional structure on the model. This strategy cannot be discussed in details since it is model specific, and in principle, it requires only to test a hypothesis concerning additional restrictions. Second, it is possible to reduce the dimension of the space spanned by , for example, by excluding some variables from the regression (Sects. 8.1.3 and 8.1.4). Third, one can also leave the class of unbiased estimators and try to find a biased estimator with a smaller variance and mean squared error. Assuming we want to judge the performance of an estimator by its mean squared error ( ), the motivation follows from

Thus, it is possible that introducing a bias into estimation in such a way that the variance of estimates is significantly reduced can improve the estimator's . There are many biased alternatives to the LS estimation as discussed in Sects. 8.1.5-8.1.9 and some of them even combine biased estimation with variable selection. In all cases, we present methods usable both in the case of near and exact multicollinearity.

## 8.1.3 Variable Selection

The presence of multicollinearity may indicate that some explanatory variables are linear combinations of the other ones (note that this is more often a ''feature'' of data rather than of the model). Consequently, they do not improve explanatory power of a model and could be dropped from the model provided there is some justification for dropping them also on the model level rather than just dropping them to fix data problems. As a result of removing some variables, the matrix would not be (nearly) singular anymore.

Eliminating variables from a model is a special case of model selection procedures, which are discussed in details in Chap. III.1. Here we first discuss methods specific for variable selection within a single regression model, mainly variants of stepwise regression. Later, we deal with more general model selection methods, such as cross validation, that are useful both in the context of variable selection and of biased estimation discussed in Sects. 8.1.5-8.1.9. An overview and comparison of many classical variable selection is given, for example, in [67,68] and [69]. For discussion of computational issues related to model selection, see [54] and [68].

### 8.1.3.1 Backward Elimination

A simple and often used method to eliminate non-significant variables from regression is backward elimination, a special case of stepwise regression. Backward elimination starts from the full model and identifies a variable such that

1. its omission results in smallest increase of ; or
2. it has the smallest -statistics , where is an estimate of variance, or any other test statistics of ; or

3. its removal causes smallest change of prediction or information criteria characterizing fit or prediction power of the model. A well-known examples of information criteria are modified coefficient of determination , Akaike information criterion ([1]), , and Schwarz information criterion ([79]), , where and represents sample size and the number of regressors, respectively.
Next, the variable is excluded from regression by setting if (1) one did not reach a pre-specified number of variables yet or (2) the test statistics or change of the information criterion lies below some selected significance level.

Before discussing properties of backward elimination, let us make several notes on information criteria used for the elimination and their optimality. There is a wide range of selection criteria, including classical , , by [86], cross validation by [89], and so on. Despite one can consider the same measure of the optimality of variable selection, such as the sum of squared prediction errors [85], one can often see contradictory results concerning the selection criteria (see [60] and [81]; or [85] and [76]). This is caused by different underlying assumptions about the true model [82]. Some criteria, such as and cross validation, are optimal if one assumes that there is no finite-dimensional true model (i.e., the number of variables increases with the sample size); see [85] and [60]. On the other hand, some criteria, such as SIC, are consistent if one assumes that there is a true model with a finite number of variables; see [76] and [82]. Finally, note that even though some criteria, being optimal in the same sense, are asymptotically equivalent, their finite sample properties can differ substantially. See Chap. III.1 for more details.

Let us now return back to backward elimination, which can be also viewed as a pre-test estimator [52]. Although it is often used in practice, it involves largely arbitrary choice of the significance level. In addition, it has rather poor statistical properties caused primarily by discontinuity of the selection decision, see [62]. Moreover, even if a stepwise procedure is employed, one should take care of reporting correct variances and confidence intervals valid for the whole decision sequence. Inference for the finally selected model as if it were the only model considered leads to significant biases, see [22], [102] and [109]. Backward elimination also does not perform well in the presence of multicollinearity and it cannot be used if . Finally, let us note that a nearly optimal and admissible alternative is proposed in [63].

### 8.1.3.2 Forward Selection

Backward elimination cannot be applied if there are more variables than observations, and additionally, it may be very computationally expensive if there are many variables. A classical alternative is forward selection, where one starts from an intercept-only model and adds one after another variables that provide the largest decrease of . Adding stops when the -statistics

lies below a pre-specified critical `F-to-enter' value. The forward selection can be combined with the backward selection (e.g., after adding a variable, one performs one step of backward elimination), which is known as a stepwise regression [28]. Its computational complexity is discussed in [68].

Note that most disadvantages of backward elimination apply to forward selection as well. In particular, correct variances and confidence intervals should be reported, see [68] for their approximations. Moreover, forward selection can be overly aggressive in selection in the respect that if a variable  is already included in a model, forward selection primarily adds variables orthogonal to  , thus ignoring possibly useful variables that are correlated with  . To improve upon this, [27] proposed least angle regression, considering correlations of to-be-added variables jointly with respect to all variables already included in the model.

### 8.1.3.3 All-Subsets Regression

Neither forward selection, nor backward elimination guarantee the optimality of the selected submodel, even when both methods lead to the same results. This can happen especially when a pair of variables has jointly high predictive power; for example, if the dependent variable depends on the difference of two variables . An alternative approach, which is aiming at optimality of the selected subset of variables - all-subsets regression - is based on forming a model for each subset of explanatory variables. Each model is estimated and a selected prediction or information criterion, which quantifies the unexplained variation of the dependent variable and the parsimony of the model, is evaluated. Finally, the model attaining the best value of a criterion is selected and variables missing in this model are omitted.

This approach deserves several comments. First, one can use many other criteria instead of or . These could be based on the test statistics of a joint hypothesis that a group of variables has zero coefficients, extensions or modifications of or , general Bayesian predictive criteria, criteria using non-sample information, model selection based on estimated parameter values at each subsample and so on. See the next subsection, [9], [45], [48], [47], [82], [83], [110], for instance, and Chap. III.1 for a more detailed overview.

Second, the evaluation and estimation of all submodels of a given regression model can be very computationally intensive, especially if the number of variables is large. This motivated tree-like algorithms searching through all submodels, but once they reject a submodel, they automatically reject all models containing only a subset of variables of the rejected submodel, see [26]. These so-called branch-and-bound techniques are discussed in [68], for instance.

An alternative computational approach, which is increasingly used in applications where the number of explanatory variables is very large, is based on the genetic programming (genetic algorithm, GA) approach, see [100]. Similarly to branch-and-bound methods, GAs perform an non-exhaustive search through the space of all submodels. The procedure works as follows. First, each submodel which is represented by a ''chromosome'' - a vector of indicators, where indicates whether the th variable is included in the submodel defined by . Next, to find the best submodel, one starts with an (initially randomly selected) population of submodels that are compared with each other in terms of information or prediction criteria. Further, this population is iteratively modified: in each step, pairs of submodels combine their characteristics (chromosomes) to create their offsprings . This process can have many different forms such as or , where is a possibly non-zero random mutation. Whenever an offspring performs better than its ''parent'' models , replaces in population . Performing this action for all creates a new population. By repeating this population-renewal, GAs search through the space of all available submodels and keep only the best ones in the population . Thus, GAs provide a rather effective way of obtaining the best submodel, especially when the number of explanatory variables is very high, since the search is not exhaustive. See Chap. II.6 and [18] for a more detailed introduction to genetic programming.

### 8.1.3.4 Cross Validation

Cross validation ( ) is a general model-selection principle, proposed already in [89], which chooses a specific model in a similar way as the prediction criteria. compares models, which can include all variables or exclude some, based on their out-of-sample performance, which is measured typically by . To achieve this, a sample is split to two disjunct parts: one part is used for estimation and the other part serves for checking the fit of the estimated model on ''new'' data (i.e., data which were not used for estimation) by comparing the observed and predicted values.

Probably the most popular variant is the leave-one-out cross-validation (LOU ), which can be used not only for model selection, but also for choosing nuisance parameters (e.g., in nonparametric regression; see [37]). Assume we have a set of models defined by regression functions , that determine variables included or excluded from regression. For model given by , LOU evaluates

 (8.7)

where is the prediction at based on the model and are the vectors and matrices without their th elements and rows, respectively. Thus, all but the th observation are used for estimation and the th observation is used to check the out-of-sample prediction. Having evaluated for each model, , we select the model commanding the minimum .

Unfortunately, LOU is not consistent as far as the linear model selection is concerned. To make a consistent model selection method, it is necessary to omit observations from the sample used for estimation, where . This fundamental result derived in [81] places a heavy computational burden on the model selection. Since our main use of in this chapter concerns nuisance parameter selection, we do not discuss this type of any further. See [68] and Chap. III.1 for further details.

 Number of Forward Backward All-subset variables selection elimination selection , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ,

Example 1   We compare several mentioned variable selection methods using a classical data set on air pollution used originally by [66], who modeled mortality depending on explanatory variables ranging from climate and air pollution to socioeconomic characteristics and who additionally demonstrated instabilities of LS estimates using this data set. We refer to the explanatory variables of data Pollution simply by numbers  to .

We applied the forward, backward, and all-subset selection procedures to this data set. The results reported in Table 8.1 demonstrate that although all three methods could lead to the same subset of variables (e.g., if we search a model consisting of two or three variables), this is not the case in general. For example, searching for a subset of four variables, the variables selected by backward and forward selection differ, and in both cases, the selected model is suboptimal (compared to all-subsets regression) in the sense of the unexplained variance measured by .

## 8.1.4 Principle Components Regression

In some situations, it is not feasible to use variable selection to reduce the number of explanatory variables or it is not desirable to do so. The first case can occur if the number of explanatory variables is large compared to the number of observations. The latter case is typical in situations when we observe many characteristics of the same type, for example, temperature or electro-impulse measurements from different sensors on a human body. They could be possibly correlated with each other and there is no a priori reason why measurements at some points of a skull, for instance, should be significant while other ones would not be important at all. Since such data typically exhibit (exact) multicollinearity and we do not want to exclude some or even majority of variables, we have to reduce the dimension of the data in another way.

A general method that can be used both under near and exact multicollinearity is based on the principle components analysis (PCA), see Chap. III.6. Its aim is to reduce the dimension of explanatory variables by finding a small number of linear combinations of explanatory variables  that capture most of the variation in  and to use these linear combinations as new explanatory variables instead the original one. Suppose that is an orthonormal matrix that diagonalizes matrix : , , and , where    diag is a diagonal matrix of eigenvalues of .

Definition 3   Assume without loss of generality that and are the corresponding eigenvectors (columns of matrix ). Vector for such that is called the th principle component (PC) of  and represents the corresponding loadings.

PCA tries to approximate the original matrix by projecting it into the lower-dimensional space spanned by the first eigenvectors . It can be shown that these projections capture most of the variability in among all linear combinations of columns of  , see [38].

Theorem 2   There is no standardized linear combination , where , that has strictly larger variance than : . Additionally, the variance of the linear combination , that is uncorrelated with the first principle components is maximized by the -st principle component and , .

Consequently, one chooses a number of PCs that capture a sufficient amount of data variability. This can be done by looking at the ratio , which quantifies the fraction of the variance captured by the first PCs compared to the total variance of  .

In the regression context, the chosen PCs are used as new explanatory variables, and consequently, PCs with small eigenvalues can be important too. Therefore, one can alternatively choose the PCs that exhibit highest correlation with the dependent variable  because the aim is to use the selected PCs for regressing the dependent variable  on them, see [49]. Moreover, for selecting ''explanatory'' PCs, it is also possible to use any variable selection method discussed in Sect. 8.1.3. Recently, [46] proposed a new data-driven PC selection for obtained by minimizing .

Next, let us assume we selected a small number of PCs by some rule such that matrix has a full rank, . Then the principle components regression ( ) is performed by regressing the dependent variable on the selected PCs , which have a (much) smaller dimension than original data  , and consequently, multicollinearity is diminished or eliminated, see [36]. We estimate this new model by LS,

where . Comparing it with the original model (8.1) shows that . It is important to realize that in we first fix by means of PCA and then estimate  .

Finally, concerning different PC selection criteria, [7] demonstrates the superiority of the correlation-based ( ) and convergence of many model-selection procedures toward the results. See also [24] for a similar comparison of and based on GA variable selection.

Example 2   Let us use data Pollution to demonstrate several important issues concerning . First, we identify PCs of the data. The fraction of variance explained by the first PCs as a function of is depicted on Fig. 8.1 (dashed line). On the one side, almost all of the variance is captured by the first PC. On the other side, the percentage of the variance explained by the first PCs (solid line) grows and reaches its maximum relatively slowly. Thus, the inclusion of about  PCs seems to be necessary when using this strategy.

On the other hand, using some variable selection method or checking the correlation of PCs with the dependent variable reveals that PCs , , , , exhibit highest correlations with (higher than ), and naturally, a model using these  PCs has more explanatory power ( ) than for example the first  PCs together ( ). Thus, considering not only PCs that capture most of the variability, but also those having large correlations with the dependent variable enables building more parsimonious models.

## 8.1.5 Shrinkage Estimators

We argued in Sect. 8.1.2 that an alternative way of dealing with unpleasant consequences of multicollinearity lies in biased estimation: we can sacrifice a small bias for a significant reduction in variance of an estimator so that its decreases. Since it holds for an estimator  and a real constant that , a bias of the estimator  towards zero, , naturally leads to a reduction in variance. This observation motivates a whole class of biased estimators - shrinkage estimators - that are biased towards zero in all or just some of their components. In other words, they ''shrink'' the Euclidean norm of estimates compared to that of the corresponding unbiased estimate. This is perhaps easiest to observe on the example of the Stein-rule estimator, which can be expressed in linear regression model (8.1) as

 (8.8)

where is an arbitrary scalar constant and represents an estimate of the residual variance [35]. Apparently, the Stein-rule estimator just multiplies the LS estimator by a constant smaller than one. See [35] and [52] for an overview of this and many other biased estimators.

In the following subsections, we discuss various shrinkage estimators that perform well under multicollinearity and that can possibly act as variable selection tools as well: the ridge regression estimator and its modifications (Sect. 8.1.6), continuum regression (Sect. 8.1.7), the Lasso estimator and its variants (Sect. 8.1.8), and partial least squares (Sect. 8.1.9). Let us note that there are also other shrinkage estimators, which either do not perform well under various forms of multicollinearity (e.g., Stein-rule estimator) or are discussed in other parts of this chapter (e.g., pre-test and estimators in Sects. 8.1.3 and 8.1.4, respectively).

## 8.1.6 Ridge Regression

Probably the best known shrinkage estimator is the ridge estimator proposed and studied by [43]. Having a non-orthogonal or even nearly singular matrix , one can add a positive constant to its diagonal to improve conditioning.

Definition 4   Ridge regression ( ) estimator is defined for model (8.1) by

 (8.9)

for some ridge parameter .

''Increasing'' the diagonal of before inversion shrinks compared to and introduces a bias. Additionally, [43] also showed that the derivative of with respect to is negative at . This indicates that the bias

can be smaller than the decrease in variance (here for a homoscedastic linear model with error variance )

caused by shrinking at least for some values of . The intervals for where dominates LS are derived, for example, in [19], [35] and [75]. Moreover, the improvement in with respect to is significant under multicollinearity while being negligible for nearly orthogonal systems. A classical result for model (8.1) under states that is negative definite if , see [96], where an operational estimate of is discussed too. Notice however that the conditions for the dominance of the and other some other shrinkage estimators over LS can look quite differently in the case of non-normal errors [95].

In applications, an important question remains: how to choose the ridge parameter ? In the original paper by [43], the use of the ridge trace, a plot of the components of the estimated against , was advocated. If data exhibit multicollinearity, one usually observes a region of instability for  close to zero and then stable estimates for large values of ridge parameter . One should choose the smallest lying in the region of stable estimates. Alternatively, one could search for minimizing ; see the subsection on generalized for more details. Furthermore, many other methods for model selection could be employed too; for example, LOU (Sect. 8.1.3) performed on a grid of values is often used in this context.

Statistics important for inference based on estimates are discussed in [43] and [96] both for the case of a fixed as well as in the case of some data-driven choices. Moreover, the latter work also describes algorithms for a fast and efficient computation.

To conclude, let us note that the estimator in model (8.1) can be also defined as a solution of a restricted minimization problem

 (8.10)

or equivalently as

 (8.11)

where represents a tuning parameter corresponding to ([91]). This formulation was used by [70], for instance. Moreover, (8.10) reveals one controversial issue in : rescaling of the original data to make a correlation matrix. Although there are no requirements of this kind necessary for theoretical results, standardization is often recommended to make influence of the constraint same for all variables. There are also studies showing adverse effects of this standardization on estimation, see [96] for a discussion. A possible solution is generalized , which assigns to each variable its own ridge parameter (see the next paragraph).

### 8.1.6.1 Generalized Ridge Regression

The estimator can be generalized in the sense that each diagonal element of is modified separately. To achieve that let us recall that this matrix can be diagonalized: , where is an orthonormal matrix and is a diagonal matrix containing eigenvalues .

Definition 5   Generalized ridge regression ( ) estimator is defined for model (8.1) by

 (8.12)

for a diagonal matrix of ridge parameters.

The main advantage of this generalization being ridge coefficients specific to each variable, it is important to know how to choose the matrix . In [43] the following result is derived.

Theorem 3   Assume that in model (8.1) has a full rank, , and . Further, let be the singular value decomposition of and . The MSE-minimizing choice of in (8.12) is .

An operational version (feasible ) is based on an unbiased estimate and . See [43] and [96], where you also find the bias and of this operational estimator, and [98] for further extensions of this approach. Let us note that the feasible ( ) estimator does not have to possess the -optimality property of because the optimal choice of is replaced by an estimate. Nevertheless, the optimality property of is preserved if , where is the th-element of ([30]).

Additionally, given an estimate of -minimizing , many authors proposed to choose the ridge parameter in ordinary as a harmonic mean of ; see [44], for instance.

### 8.1.6.2 Almost Unbiased Ridge Regression

Motivated by results on , [53] proposed to correct for its bias using the first-order bias approximation. This yields almost unbiased ( ) estimator

The true parameter value being unknown, [71] defined a feasible estimator by replacing the unknown by and by the employed ridge matrix. Additionally, a comparison of the and feasible estimators with respect to proved that has a smaller than in a wide range of parameter space. Similar observation was also done under a more general loss function in [97]. Furthermore, [2] derived exact formulas for the moments of the feasible estimator.

### 8.1.6.3 Further Extensions

can be applied also under exact multicollinearity, which arises for example in data with more variables than observations. Although the theory and application of is the same as in the case of full-rank data, the computational burden of operations becomes too high for . A faster algorithm with computational complexity only was found by [39].

Moreover, there are many further extensions of the principle that go beyond the extent of this chapter. To mention at least some of them, let us refer a reader to works comparing or combining various ridge and shrinkage approaches [55]; [84]; [87] and to monograph by [35].

Example 3   Using data Pollution once again, we estimated for ridge parameter and plotted the estimated coefficients as functions of (ridge trace plot), see Fig. 8.2. For the sake of simplicity, we restricted ourselves only to variables that were selected by some variable selection procedure in Table 8.1 ( ). The plot shows the effect of ridge parameter on slope estimates ( corresponds to LS). Apparently, slopes of some variables are affected very little (e.g., variable ), some significantly (e.g., the magnitude of variable  increases more than twice), and some variables shrink extremely (e.g., variables  and ). In all cases, the biggest change occurs between and and estimates gradually stabilize for . The vertical dashed line in Fig. 8.2 represents the estimate of ( ).

## 8.1.7 Continuum Regression

discussed in Sect. 8.1.6 is very closely connected with the continuum regression proposed by [16] as a unifying approach to the , , and partial least squares (see Sect. 8.1.9) estimation.

Definition 6   A continuum regression ( ) estimator of model (8.1) is a coefficient vector maximizing function

 (8.13)

for a given value of parameter and a given length , where and .

This definition yields estimates proportional to LS for , to for , and to yet-to-be-discussed partial least squares for . Apart from this, the advantage of is that one can adaptively select among the methods by searching an optimal . To determine , [16] used  .

The relationship between and was indicated already in [90], but the most important result came after uncovering possible discontinuities of estimates as a function of data and by [14]. In an attempt to remedy the discontinuity of the original , [15] not only proposed to maximize

for instead of from Definition 6 ( can be chosen by ), but also proved the following proposition.

Theorem 4   If a regressor is defined according to

where , , is increasing in for constant , and increasing in for constant , and finally, if is not orthogonal to all eigenvectors corresponding to the largest eigenvalue of , then there exists a number such that is proportional to , including the limiting cases , and .

Thus, the estimator fundamentally underlies many methods dealing with multicollinear and reduced rank data such as mentioned and partial least squares. Notice however that negative values of the ridge coefficient  have to be admitted here.

Finally, let us note that can be extended to multiple-response-variables models ([17]).

## 8.1.8 Lasso

The ridge regression discussed in Sect. 8.1.6 motivates another shrinkage method: Lasso (least absolute shrinkage and selection operator) by [93]. Formulation (8.10) states that can be viewed as a minimization with respect to an upper bound on the norm of estimate . A natural extension is to consider constraints on the norm , . Specifically, [93] studied case of , that is norm.

Definition 7   The Lasso estimator for the regression model (8.1) is defined by

 (8.14)

where is a tuning parameter.

Lasso is a shrinkage estimator that has one specific feature compared to ordinary . Because of the geometry of -norm restriction, Lasso shrinks the effect of some variables and eliminates influence of the others, that is, sets their coefficients to zero. Thus, it combines regression shrinkage with variable selection, and as [93] demonstrated also by means of simulation, it compares favorably to all-subsets regression. In this context, it is interesting that Lasso could be formulated as a special case of least angle regression by [27]. Finally, let us note that to achieve the same kind of shrinking and variable-selection effects for all variables, they should be standardized before used in Lasso; see [68] for details.

As far as the inference for the Lasso estimator is concerned, [57] recently studied the asymptotic distribution of Lasso-type estimators using -norm condition with , including behavior under nearly-singular designs.

Now, it remains to find out how Lasso estimates can be computed. Equation (8.14) indicates that one has to solve a restricted quadratic optimization problem. Setting and , the restriction can be written as constraints: , and . Thus, convergence is assured in steps. Additionally, the unknown tuning parameter is to be selected by means of . Further, although solving (8.14) is straightforward in usual regression problems, it can become very demanding for reduced-rank data, . [72] treated lasso as a convex programming problem, and by formulating its dual problem, developed an efficient algorithm usable even for .

Example 4   Let us use data Pollution once more to exemplify the use of Lasso. To summarize the Lasso results, we use the same plot as [93] and [27] used, see Fig. 8.3. It contains standardized slope estimates as a function of the constraint , which is represented by an index (the estimate corresponds to under , and thus, renders the maximum of ). Moreover, to keep the graph simple, we plotted again only variables that were selected by variable selection procedures in Table 8.1 ( ).

In Fig. 8.3, we can observe which variables are included in the regression (have a nonzero coefficient) as tuning parameter  increases. Clearly, the order in which the first of these variables become significant -   - closely resembles the results of variable selection procedures in Table 8.1. Thus, Lasso combines shrinkage estimation and variable selection: at a given constraint level , it shrinks coefficients of some variables and removes the others by setting their coefficients equal to zero.

## 8.1.9 Partial Least Squares

A general modeling approach to most of the methods covered so far was in Sect. 8.1.7, whereby it has two ''extremes'': LS for and for . The partial least squares (PLS) regression lies in between - it is a special case of (8.13) for , see [16]. Originally proposed by [104], it was presented as an algorithm that searches for linear combinations of explanatory variables best explaining the dependent variable. Similarly to , PLS also aims especially at situations when the number of explanatory variables is large compared to the number of observations. Here we present the PLS idea and algorithm themselves as well as the latest results on variable selection and inference in PLS.

Having many explanatory variables , the aim of the PLS method is to find a small number of linear combinations of these variables, thought about as latent variables, explaining observed responses

 (8.15)

(see [33], and [40]). Thus, similarly to , reduces the dimension of data, but the criterion for searching linear combinations is different. Most importantly, it does not depend only on values, but on too.

Let us now present the PLS algorithm itself, which defines yet another shrinkage estimator as shown by [20] and [51]. (See [75] for more details and [33] for an alternative formulation.) The indices are constructed one after another. Estimating the intercept by , let us start with centered variables and and set .

1. Define the index . This linear combination is given by the covariance of the unexplained part of the response variable and the unused part of explanatory variables .

2. Regress the current explanatory matrix on index

and the yet-unexplained part of response on index

thus obtaining the th regression coefficient.

3. Compute residuals, that is the remaining parts of explanatory and response variables: and . This implies that the indices and are not correlated for .

4. Iterate by setting or stop if is large enough.

This algorithm provides us with indices , which define the analogs of principle components in , and the corresponding regression coefficients in (8.15). The main open question is how to choose the number of components . The original method proposed by [105] is based on cross validation. Provided that from (8.7) represents the index of estimate with factors, an additional index is added if Wold's criterion is smaller than . This selects the first local minimum of the index, which is superior to finding the global minimum of as shown in [73]. Alternatively, one can stop already when Wold's  exceeds or bound (modified Wold's  criteria) or to use other variable selection criteria such as . In a recent simulation study, [61] showed that modified Wold's  is preferable to Wold's and . Furthermore, similarly to , there are attempts to use for the component selection, see [58] for instance.

Next, the first results on the asymptotic behavior of PLS appeared only during last decade. The asymptotic behavior of prediction errors was examined by [41]. The covariance matrix, confidence and prediction intervals based on PLS estimates were first studied by [23], but a more compact expression was presented in [74]. It is omitted here due to many technicalities required for its presentation. There are also attempts to find a sample-specific prediction error of , which were compared by [29].

Finally, note that there are many extensions of the presented algorithm, which is usually denoted . First of all, there are extensions ( , , etc.) of to models with multiple dependent variables, see [50] and [32] for instance, which choose linear combinations (latent variables) not only within explanatory variables, but does the same also in the space spanned by dependent variables. A recent survey of these and other so-called two-block methods is given in [101]. was also adapted for on-line process modeling, see [78] for a recursive algorithm. Additionally, in an attempt to simplify the interpretation of results, [94] proposed orthogonalized . See [108] for further details on recent developments.

Example 5   Let us use again data Pollution, although it is not a typical application of . As explained in Sects. 8.1.7 and 8.1.9, and are both based on the same principle (searching for linear combinations of original variables), but use different objective functions. To demonstrate, we estimated for  to latent variables and plotted the fraction of the and variance explained by the latent variables in the same way as in Fig. 8.1. Both curves are in Fig. 8.4. Almost all of the variability in is captured by the first latent variable, although this percentage is smaller than in the case of . On the other hand, the percentage of the variance of explained by the first latent variables increases faster than in the case of , see Fig. 8.4 (solid versus dotted line).

## 8.1.10 Comparison of the Methods

Methods discussed in Sects. 8.1.3-8.1.9 are aiming at the estimation of (nearly) singular problems and they are often very closely related, see Sect. 8.1.7. Here we provide several references to studies comparing the discussed methods.

First, an extensive simulation study comparing variable selection, , , and regression methods is presented in [32]. Although the results are conditional on the simulation design used in the study, they indicate that , , and are, in the case of ill-conditioned problems, highly preferable to variable selection. The differences between the best methods, and , are rather small and the same holds for comparison of and , which seems to be slightly worse than . An empirical comparison of and was also done by [103] with the same result. Next, the fact that neither , nor asymptotically dominates the other method was proved in [41] and further discussed in [40]. A similar asymptotic result was also given by [88]. Finally, the fact that should not perform worse than and is supported by Theorem 4 in Sect. 8.1.7.

Next: 8.2 Nonlinear Regression Modeling Up: 8. (Non) Linear Regression Previous: 8. (Non) Linear Regression