19.2 Back Propagation

The most well known method with which the feed forward network learns its weights from the training set is the back propagation. The basic idea is non other than a numerical method to solve the (nonlinear) least squares problem that saves on memory on the cost though of eventually slower convergence and numerical instabilities.

To illustrate consider a neural network with an output variable $ y$ (i.e. $ q = 1)$ and a hidden layer with only one neuron:

$\displaystyle y = \psi(w_0 + w^\top x). $

$ \psi$ can be a logistic function, or some other transformation function. The training set is $ (x^ {(1)}, z^ {(1)}), \ldots, (x^
{(T)}, z^ {(T)}).$ The weight $ w_0
$ is held constant in order to avoid the arbitrary scale factor. The function to be minimized

$\displaystyle Q(w) = \sum^ T_{k=1} (z^ {(k)} - y^ {(k)})^ 2 $

is thus only dependent on the weights $ w_1, \ldots, w_p$ of the input variables.

An elementary numerical method for minimizing $ Q$ is the decreasing gradient method. Given a weight $ w (N)$ one calculates the next approximation by moving a small step in the direction of the steepest decline of $ Q$:
$\displaystyle w(N+1)$ $\displaystyle =$ $\displaystyle w (N) - \eta \,$   grad  $\displaystyle Q(w (N)),$  
grad $\displaystyle Q(w)$ $\displaystyle =$ $\displaystyle - \sum^ T_{k=1} 2(z^ {(k)} - y^ {(k)})
\psi'(w^\top x^ {(k)} ) x^ {(k)}.$  

Fig.: Gradients descent proceedings with goal vector $ z=(0,1,0,1)^\top $. 33158 SFEdescgrad.xpl

\includegraphics[width=1.2\defpicwidth]{nndescgrad.ps}

To accelerate the convergence the small constant $ \eta > 0$ can also converge to 0 during the iteration process. Figure 18.8 shows the path of optimizing $ Q(w)$ evaluated in $ i$-steps at $ w_{1},...,w_{i}$, where each $ w$ is corrected according to the back propagation rule.

With the decreasing gradient method the quality of the weight $ w(N),$ that is the actual network, is evaluated simultaneously using all the data in the training set. The network is applied to all $ x^ {(1)}, \ldots, x^ {(T)}$, and only after this the weight vector is changed.

Back propagation is also a form of the decreasing gradient method with the difference that the network is repeatedly applied to the single $ x^ {(k)}$ and after every single step the weights are already changed in the direction of the steepest decline of the function $ Q_k (w) = (z^ {(k)} - y^ {(k)})^ 2:$

$\displaystyle \begin{array}{rcl} w (N+1) & = & w (N) - \eta \, \text{\rm grad\ ...
... (w)
&=& - 2(z^{(k)} - y^ {(k)}) \psi'(w^\top x^ {(k)} ) x^ {(k)}.
\end{array} $

If in this process the training set has been gone through once, the iteration starts again from the beginning. $ T$ steps in the back propagation correspond then roughly to one step in the decreasing gradient method. Also by the back propagation algorithm it may be necessary to allow $ \eta$ to converge slowly to 0.

The Widrow-Hoff learning rule is in principle a back propagation algorithm. The threshold function $ \psi(t) = \boldsymbol{1}(w_0 + t>0)$ is non-differentiable, but after the presentation of $ x^ {(k)}$ the weights are changed in the direction of the steepest decline of $ Q_k (w)$, i.e., in the direction of $ x^ {(k)}$ for $ z^ {(k)} = 1,
y^ {(k)} = 0$ and in the direction of $ - x^ {(k)}$ for $ z^ {(k)} =
0,\ y^ {(k)} = 1.$ By correct classifications the weights remain here unaltered.

Naturally one can apply every numerical algorithm that can calculate the minimum of a non-linear function $ Q(w)$ to determine the weights of a neural network. By some applications, for example, the conjugate gradient method has proven to be the fastest and most reliable. All of these algorithms have the danger of landing in a local minimum of $ Q(w)$. In the literature on neural networks it is occasionally claimed with the combination of training the networks such local minima do not occur. Based on experience of statistics with maximum-likelihood estimators of large dimensional parameters, this is to be expected since the training of neural networks for applications of regression analysis, for example, can be interpreted under the appropriate normality assumptions as the maximum-likelihood estimation technique.