In this section the least trimmed squares estimator, its robustness and asymptotic properties, and computational aspects will be discussed.
First of all, we will precise the verbal description of the estimator given
in the previous section. Let us consider a linear regression model for a
sample with a response variable
and a vector of
explanatory variables
:
Before proceeding to the description of how such an estimate can be
evaluated in
XploRe
, several issues have to be discussed, namely, the
existence of this estimator and its statistical properties (a discussion of
its computational aspects is postponed to Subsection 2.2.2).
First, the existence of the optimum in (2.2) under some
reasonable assumptions can be justified in the following way:
the minimization of the objective function in
(2.2) can be viewed as a process in which we every time choose
a subsample of observations and find some
minimizing the sum of
squared residuals for the selected subsample. Doing this for every subsample (there
are
of them) we get
candidates for the
LTS estimate and the one that commands the smallest value of the objective
function is the final estimate. Therefore, the existence of the LTS
estimator is basically equivalent to the existence of the least squares
estimator for subsamples of size
.
Let us now briefly discuss various statistical properties of LTS.
First, the least trimmed squares is regression, scale, and affine equivariant
(see, for example, Rousseeuw and Leroy; 1987, Lemma 3, Chapter 3).
We have also
already remarked that the breakdown point of LTS reaches the upper bound
for regression equivariant estimators if the trimming
constant
equals to
. Furthermore, the
-consistency
and asymptotic normality of LTS can be proved for a general linear regression
model with continuously distributed disturbances (Víšek; 1999b).
Besides these important statistical properties, there are also some
less practical aspects. The main one directly follows from the
noncontinuity of the LTS objective function. Because of this, the sensitivity of
the least trimmed squares estimator to a change of one or several observations might
be sometimes rather high (Víšek; 1999a). This property, often referred as high
subsample sensitivity, is closely connected with the possibility that a change or omission of
some observations may change considerably the subset of a sample that is treated
as the set of ``correct'' data points. It does not have to be seen necessarily
as disadvantageous, the point of view merely depends on the purpose we are using LTS for.
See Víšek (1999b) and Section 2.3 for further information.
|
The quantlet of quantlib
metrics
which serves for the least trimmed
squares estimation is
lts
. To understand the function of its
parameters, the algorithm used for the evaluation of LTS has to be
described. Later, the description of the quantlet follows.
There are two possible strategies how the least trimmed squares estimate
can be determined. First one relies on the full search through all
subsamples of size and the consecutive LS estimation as described
in the previous section, and thus, let
us obtain the precise solution (neglecting ubiquitarian numerical errors).
Unfortunately, it is hardly possible to examine the total of
subsamples unless a very small sample is analyzed. Therefore,
in most cases (when the number of cases is higher) only an approximation can
be computed (note, please, that in the examples presented here we compute the
exact LTS estimates as described above, and thus, the computation is relatively
slow). The present algorithm does the approximation in the following way: having
selected randomly an
-tuple of observations we apply the least squares
method on them, and for the estimated regression coefficients we evaluate
residuals for all
observations. Then
-tuple of data points with
the smallest squared residuals is selected and the LS estimation takes
place again. This step is repeated so long until a decrease of the sum
of the
smallest squared residuals is obtained. When no further
improvement can be found this way, a new subsample of
observations is
randomly generated and the whole process is repeated. The search is stopped
either when we find
times the same estimate of model (where
is an a
priori given positive integer) or when an a priori given number of
randomly generated subsamples is accomplished. A more refined version of
this algorithm suitable also for large data sets was proposed and described
by Rousseeuw and Van Driessen (1999).
From now on, noninteractive quantlet
lts
is going to be described.
The quantlet expects at least two input parameters: an
matrix x that contains
observations for each of
explanatory
variables and an
vector y of
observed responses.
If the intercept is to be included in the regression model,
the
vector of ones can be concatenated to the matrix x
in the following way:
x = matrix(rows(x))~xNeither the matrix x, nor the vector y should contain missing (NaN) or infinite values (Inf,-Inf). Their presence can be identified by
b = lts(x,y)results in the approximation of the LTS estimate for the most robust choice of
b = lts(x,y,h)The last two parameters of the quantlet, particularly all and mult, provide a way to influence how the estimate is in fact computed. Parameter all allows to switch from the approximation algorithm, which corresponds to all equal to 0 and is used by default, to the precise computation of LTS, which takes place if all is nonzero. As the precise calculation can take quite a long time if a given sample is not really small, a warning together with a possibility to cancel the evaluation is issued whenever the total number of iterations is too high. Finally, the last parameter mult, which equals to 1 by default, offers possibility to adjust the maximum number of randomly generated subsamples in the case of the approximation algorithm--this maximum is calculated from the size of a given sample and the trimming constant, and subsequently, it is multiplied by mult.
To have a real example, let us show how the time trend in phonecal data set was estimated in Section 2.1. The data set is two-dimensional, having only one explanatory variable x, year, in the first column and the response variable y, the number of international phone calls, in the second column. In order to obtain the LTS estimate for the linear regression of y on constant term and x, you have to type at the command line or in the editor window
z = read("phonecal") x = matrix(rows(z)) ~ z[,2] y = z[,3] b = lts(x,y) b
Contents of coefs [1,] -5.6522 [2,] 0.11649