Neural networks of type MLP describe a mapping of the input
variables
onto the output variables
We will restrict ourselves in this section to the
case where the network has only one hidden layer and the output
variable is univariate
Then
as a
function of
has the form
 |
(19.1) |
where
is the number of neurons in the hidden layer and
is the given transformation function. The parameter vector
contains all the weights of the network. This network with one
hidden layer already has a universal approximation property: every
measurable function
can
be approximated as accurately as one wishes by the function
when
is a monotone increasing
function with a bounded range. More precisely, the following
result holds, Hornik et al. (1989):
The range of
can be set to any bounded interval, not only
, without changing the validity of the approximation
properties.
The weight vector
is not uniquely determined by the
network function
. If, for example, the transformation
function is asymmetric around 0, i.e.,
then
does not change when
a) the neurons of the hidden layer are interchanged, which
corresponds to a substitution of the coordinates of
, or when
b) all input weights
and the output
weight
of the neural are multiplied by
.
In order to avoid this ambiguity we will restrict the parameter
set to a fundamental set in the sense of Rüeger and Ossen (1997),
which for every network function
contains
exactly one corresponding parameter vector
. In the
case of asymmetric transformation functions we restrict ourselves,
for example, to weight vectors with
In order to simplify the following considerations we
also assume that
is contained in a sufficiently large
compact subset
of a fundamental range.
Due to their universal approximation properties neural networks
are a suitable tool in constructing non-parametric estimators for
regression functions. For this we consider the following
heteroscedastic regression model:
where
are independent, identically distributed
-variate random variables with a density of
. The residuals
are independent, real valued random variables with
We assume that the conditional mean
and the conditional
variance
of
are, given
,
continuous functions bounded to
. In order to
estimate the regression function
, we fit a neural network with
a hidden layer and a sufficiently large number,
, of neurons to
the input variables
and the values
, i.e., for given
we determine the non-linear
least squares estimator
with
Under appropriate conditions
converges in
probability for
and a constant
to the
parameter vector
which corresponds
to the best approximation of
by a function of type
:

with
Under somewhat stronger assumptions the asymptotic normality of
and thus of the estimator
also follows for the regression function
The estimation error
can be
divided into two asymptotically independent subcomponents:
where the value
minimizes the sample version of
,
Franke and Neumann (2000):
Theorem 19.2
Let

be bounded and twice differentiable with a bounded
derivative. Suppose that

has a unique global
minimum

in the interior of

and the
Hesse matrix

of

at

is positive definite. In addition to the above mentioned
conditions for the regression model it holds that
with suitable constants

Then it
holds for
with covariance matrices
where

represents the gradient of the network
function
with respect to the parameter

.
From the theorem it immediately follows that
is asymptotically
N
distributed.
here stands for the variability of the estimator
caused by the observational error
represents the proportion of asymptotic variability
that is caused by the mis-specification of the regression function, i.e.,
from the fact that
is of the form
for a given
and no
. In the case that it is correctly specified, where
,
this covariance component disappears, since
and
can be estimated as usual with the sample
covariance matrices. In order to construct tests and confidence
intervals for
a couple of alternatives to the asymptotic
distribution are available: Bootstrap, or in the case of
heteroscedasticity, the Wild Bootstrap method, Franke and Neumann (2000).
Theorem 18.2 is based on the theoretical value of the least
squares estimator
, which in practice must be
numerically determined. Let
be such a
numerical approximation of
The quality of the
resulting estimator
can depend on the
numerical method used. White (1989b) showed in particular
that the back propagation algorithm leads under certain
assumptions to an asymptotically inefficient estimator
, i.e., the asymptotic covariance matrix of
is larger than
that of
in the
sense that the difference of the two matrices is positive
definite. Nevertheless White also showed that by joining a single
global minimization step, the estimator calculated from the back
propagation can be modified so that for
it is
as efficient as the theoretical least squares estimator
Until now we have held the number of neurons
in the hidden
layer of the network and thus the dimension of the parameter
vector
constant. The estimator based on the network,
converges to
, so that in general the bias
for
does not
disappear, but rather converges to
. With standard arguments it directly follows from Theorem 18.2 that:
In order to obtain a consistent estimator for
, the number
of neurons
, which by the non-parametric estimator
play the role of a smoothing parameter, must
increase with
. Due to the universal approximation properties
of the neural network
thus converges
to
so that the bias disappears asymptotically. Since with
an increasing
the dimension of the parameter vector
increases,
should not approach
too
quickly, in order to ensure that the variance of
continues to converge to 0. In choosing
in practice one uses a
typical dilemma for non-parametric statistics, the bias variance
dilemma: a small
results in a smooth estimation function
with smaller variance and larger bias, whereas a large
leads to a smaller bias but a larger variability
of a then less smoothing estimator
.
White (1990) showed in a corresponding framework that the
regression estimator
based on the neural network
converges in probability to
and thus is consistent when
,
at a slower rate.
From this it follows that neural networks with a free choice of
neurons in the hidden layer provides useful non-parametric
function estimators in regression, and as we will discuss in the
next section, in time series analysis. They have the advantage
that the approximating function
of the form
(18.1) is a combination of the neurons, which are composed
of only a given non-linear transformation of an affine-linear
combination of the variables
. This
makes the numerical calculation of the least squares estimator for
possible even when the dimension
of the input
variables and the number
of neurons are large and thus the
dimension
of the parameter vector is very large. In
contrast to the local smoothing technique introduced in Chapter
13, the neural networks can also be applied as
estimators of functions in large dimensional spaces. One reason
for this is the non-locality of the function estimator
. This estimator does not dependent only on the
observations
with a small norm
and thus
in practice it is not as strongly afflicted by the imprecation of
dimensionality, i.e., even for large
there is a smaller
local density of the observation
in large dimensional spaces.
Theoretically it is sufficient to consider neural networks of type
MLP with one hidden layer. In practice, however, one can sometimes
achieve a comparably good fit to the data with a relatively more
parsimonious parameterization by creating multiple hidden layers.
A network function with two hidden layers made up of
and
neurons respectively has, for example, the following form
where
represents the vector of all the weights
. Such a function with small
can produce a
more parsimonious parameterized approximation of the regression
function
than a network function with only one hidden layer
made up of a large number of neurons. In a case study on the
development of trading strategies for currency portfolios
Franke and Klein (1999) discovered, that with two hidden layers a
significantly better result can be achieved than with
only one layer.
In addition the number of parameters to be estimated can be
further reduced when several connections in the neural network are
cut, i.e., when the corresponding weights are set to zero from the
very beginning. The large flexibility that the neural network
offers when approximating regression functions creates problems
when creating the model, since one has to decide on a network
structure and thus ask:
- How many hidden layers does the network have?
- How many
neurons does each hidden layer have?
- Which nodes (inputs,
hidden neurons, outputs) of the network should be connect, i.e.,
which weights should be set to zero from the very beginning?
Through this process one is looking for a network which makes it
possible to have a network function
that is
parsimoniously parameterized and at the same time for a suitable
that is a sufficiently good
approximation of the regression function
.
Similar to the classical linear regression analysis there are a
comprehensive number of instruments available for specifying a
network structure consistent with the data. For simplicity we will
concentrate on the feed forward network with only one hidden layer
made up of
neurons.
a) Repeated Significance Tests: As with the stepwise
construction of a linear regression model we start with a simple
network assuming that one additional neuron with the number
and
output weights has been added. Whether in doing this the
quality of the fit of the network has significantly improved is
determined by testing the hypothesis
against the
alternative
. Since under
the input weights
of the neurons in question are not
identifiable, i.e., they have no influence on the value of the
network function
, this is no standard testing problem.
White (1989a), Teräsvirta et al. (1993) have developed
Lagrange multiplier tests that are suitable for testing the
significance of an additional neuron. Going the other direction it
is also possible to start with a complex network with large
assumed neurons successively removing them until the related test
rejects the hypothesis
. To reduce the number of
parameters it makes sense to cut individual input connections,
i.e., to set the corresponding weight to zero. For the test of the
hypothesis
against the alternative
classical Wald Tests can be applied due to the
asymptotical results such as 18.2 (see for example Anders (1997) for applications in financial statistics).
b) Cross Validation and Validation: The resulting cross
validation is usually eliminated due to the extreme amount of
calculations to determine the order of the model, i.e., first of
all the number
of neurons in the hidden layer. In order to
calculate the leave-one-out estimator for the model parameters one
must fit the neural network to the corresponding sample that has
been reduced by one observation a total of
times, and this
must be done for every network structure under consideration. A
related and more known procedure from the application of neural
networks in the regression and time series analysis is to take a
portion of the data away from the sample in order to measure the
quality of the model based on this so called validation set. In
addition to the data
used to
calculate the least squares estimator
a second
independent subsample
is
available. By minimizing measurements of fit, such as,
the order of the model
and the quality of the incomplete
network structure can be determined, in which individual input
weights
have been set to zero.
c) Network Information Criteria: To compare the network
structures some well known applications for determining order,
such as the Akaike Information Criterion (AIC), can be used. The
application from Murata et al. (1994) called the Network
Information Criterion (NIC) is specialized for the case of neural
networks. Here it is implicitly assumed that the residuals
are normally distributed with a common variance
.