The most well known method with which the feed forward network
learns its weights from the training set is the back propagation.
The basic idea is non other than a numerical method to solve the
(nonlinear) least squares problem that saves on memory on the cost
though of eventually slower convergence and numerical
instabilities.
To illustrate consider a neural network with an output variable
(i.e.
and a hidden layer with only one neuron:
![]() |
![]() |
![]() ![]() |
|
grad ![]() |
![]() |
![]() |
To accelerate the convergence the small constant can
also converge to 0 during the iteration process. Figure
18.8 shows the path of optimizing
evaluated
in
-steps at
, where each
is corrected
according to the back propagation rule.
With the decreasing gradient method the quality of the weight
that is the actual network, is evaluated simultaneously
using all the data in the training set. The network is applied to
all
, and only after this the weight
vector is changed.
Back propagation is also a form of the decreasing gradient
method with the difference that the network is repeatedly applied
to the single and after every single step the weights
are already changed in the direction of the steepest decline of
the function
The Widrow-Hoff learning rule is in principle a back propagation
algorithm. The threshold function
is
non-differentiable, but after the presentation of
the
weights are changed in the direction of the steepest decline of
, i.e., in the direction of
for
and in the direction of
for
By correct classifications the weights remain
here unaltered.
Naturally one can apply every numerical algorithm that can
calculate the minimum of a non-linear function to determine
the weights of a neural network. By some applications, for
example, the conjugate gradient method has proven to be the
fastest and most reliable. All of these algorithms have the danger
of landing in a local minimum of
. In the literature on
neural networks it is occasionally claimed with the combination of
training the networks such local minima do not occur. Based on
experience of statistics with maximum-likelihood estimators of
large dimensional parameters, this is to be expected since the
training of neural networks for applications of regression
analysis, for example, can be interpreted under the appropriate
normality assumptions as the maximum-likelihood estimation technique.