19.1 From Perceptron to Non-linear Neuron

The perceptron is a simple mathematical model of how a nerve cell functions in receiving signals from sense cells and other nerve cells (the input variables) and from this sends a signal to the next nerve cell or remains inactive. In spite of all of the disadvantages the perceptron is very influential on the way of thinking with respect to neural networks, so that it is a good starting point for the discussion of components from which neural networks are constructed. The perceptron works in two steps:

- the input variables $ x_1, \ldots, x_p$ are multiplied and added with weights
$ w_1, \ldots, w_p$,

- a threshold operation is applied to the result.

$ x = (x_1, \ldots, x_p)^\top ,\ w = (w_1, \ldots, w_p)^\top $ represent the input vector and weight vector respectively, and for a given $ b$ let $ \psi (u) = \boldsymbol{1}(u > b)$ be the corresponding threshold function. The output variables $ y = \psi (w^\top x)$ of the perceptron is 1 (the nerve cell $ \glqq$fires$ \grqq$), when the sum of the weighted input signales lies above the threshold and is 0 otherwise (the nerve cells remain inactive).

Fig. 18.3: The perceptron
\includegraphics[width=1.2\defpicwidth]{neu3.ps}

The effect of the perceptron depends on the weights $ w_1, \ldots, w_p$ and the threshold value $ b$. An equivalent representation can be obtained by including the constant $ x_0 \stackrel{\mathrm{def}}{=}1$ as an additional input variable, which has the weight $ w_0 = - b$ and a threshold value of 0 is chosen since then

$\displaystyle \boldsymbol{1}\left( \sum^ p_{i=1} w_i x_i > b \right) =
\boldsymbol{1}\left( \sum^ p_{i=0} w_i x_i >0\right). $

This representation is often more comfortable since with the system parameters that can be freely chosen one does not have to differentiate between weights and threshold values.

A perceptron can be trained to solve classification problems of the following type: Given are objects which belong to one of two classes, $ C_0$ or $ C_1$. Decisions are made based on observations of the object $ x_1, \ldots, x_p,$ whether it belongs to $ C_0$ or $ C_1$.

The perceptron characterized by the weights $ w_0, \ldots, w_p$ classifies an object as belonging to $ C_0$ respectively $ C_1$ when the output variable $ y = y(x_1, \ldots, x_p)$ is 0 respectively 1. So that the classification problem ''may be'' solved, the weights $ w_0, \ldots, w_p$ must be ''learned''. To do this there is a training set

$\displaystyle (x ^ {(1)} , z^ {(1)}),\ldots, (x^ {(T)}, z^{(T)})$

from $ T$ input vectors

$\displaystyle x^ {(t)} = (x_1^ {(t)} , \ldots,x_p^ {(t)})^\top $

available whose correct classification

$\displaystyle z^{(1)} , \ldots, z^ {(T)} \in \{ 0, 1\} $

is known. With the help from learning rules suitable weights $ \hat{w}_0, \ldots, \hat{w}_p$ are determined from the training set.

In statistical terms the problem is to estimate the parameters of the perceptron from the data $ (x^ {(t)}, z^ {(t)})$, $ t= 1,
\ldots, T$. A learning rule is an estimation method which produces estimates $ \hat{w}_0, \ldots, \hat{w}_p$.

A learning rule is, for example, the Delta or Widrow-Hoff learning rule: The input vectors $ x^ {(t)} , t = 1, \ldots, T,$ are used consecutively as input variables of the perceptron and the output variables $ y^ {(t)},\ t = 1, \ldots, T,$ are compared to the correct classification $ z^ {(t)},\ t = 1, \ldots, T.$ If in one step $ y^ {(t)} = z^ {(t)},$ then the weights remain unchanged. If on the other hand $ y^ {(t)} \not= z^ {(t)}$, then the weight vector $ w = (w_0, \ldots, w_p)^\top $ is adjusted in the following manner:

$\displaystyle w_{new} = w + \eta (z^ {(t)} - y^ {(t)}) \, x^ {(t)} $

$ \eta$ is a small relaxation factor which must eventually slowly approach zero in order to ensure convergence of the learning algorithm. The initial value of $ w$ is arbitrarily given or randomly chosen, for example, uniformly distributed over $ [0,1]^ {p + 1}.$

The learning does not end when all of the input vectors are presented in the network, but rather after $ x^ {(T)}$ has been entered, $ x^ {(1)}$ is used again as the next input variable. The training set is tested multiple times until the network of all objects in the training set have been correctly identified or until a given quality criterion for measuring the error in classification is small.

The weights $ w_0, \ldots, w_p$ can be identified up to a positive scale factor, i.e., for $ \alpha > 0 ,$ $ \alpha \, w_0, \ldots,
\alpha \, w_p$ lead to the same classification. By applying the learning rule, such as that of Widrow-Hoff, it can happen that $ \vert\vert
w \vert\vert $ continuously increases, which can lead to numerical problems. In order to prohibit this, one uses the so called weight decay technique, i.e., a modified learning rule in which $ \vert\vert
w \vert\vert $ remains stable.

Example 19.1 (Learning the OR-Function)  
Let $ p = 2$ and $ x_1, x_2 \in \{ 0,1\}.$ The classifications that needs to be learned is the logical OR:

$\displaystyle \begin{array}{l} z = 1,\ \text{\rm if\ } x_1 = 1\ \text{\rm or\ }...
...
1, \\ z = 0,\ \text{\rm if\ } x_1 = 0\ \text{\rm and\ \, } x_2 =
0.\end{array}$

The following input vectors, including the first coordinate $ x_0 =
1$

$\displaystyle x ^ {(1)} = \left( \begin{array}{c} 1\\ 1\\ 0\end{array}\right),\...
...array}\right),\ x^ {(4)} = \left( \begin{array}{c} 1\\ 0\\ 0\end{array}\right) $

are used as the training set with the correct classification $ z^
{(1)} = z^ {(2)} = z^ {(3)} = 1,\ z^ {(4)} = 0.$ The perceptron with the weights $ w_0, w_1, w_2$ classifies an object as 1 if and only if

$\displaystyle w_0 x_0 + w_1 x_1 + w_2 x_2 > 0\ , $

and as 0 otherwise. For the starting vector we use $ w =
(0,0,0)^\top $, and we set $ \eta = 1.$ The individual steps of the Widrow-Hoff learning take the following form:
  1. $ w^ {(1)} $ gives $ y^ {(1)} = 0 \not= z^ {(1)}.$ The weights are changed:
    $ w_{new} = (0,0,0)^\top + (1 - 0) (1,1,0)^\top =
(1,1,0)^\top $
  2. $ x^ {(2)}$ is correctly classified with the weight vector.
  3. $ x^ {(3)}$ is correctly classified with the weight vector.
  4. For $ x^ {(4)}$ is $ w^\top x^ {(4)} = 1, $ so that the weights are again changed:
    $ w_{new} = (1, 1,0)^\top + (0-1) (1,0,0)^\top = (0,1,0)^\top
$
  5. $ x^ {(1)}$ is now used as input and is correctly classified.
  6. Since $ w^\top x^ {(2)} = 0:$
    $ w_{new} = (0,1,0)^\top + (1-0) (1,0,1)^\top = (1,1,1)^\top $
  7. Since $ w^\top x^ {(3)} = 3 > 0,$ $ x^ {(3)}$ is correctly classified.
  8. $ x^ {(4)}$ is incorrectly classified so that
    $ w_{new} = (1,1,1)^\top + (0-1)(1,0,0)^\top = (0,1,1)^\top $
Thus the procedure ends since the perceptron has correctly identified all the input vectors in the training set with these weights. The perceptron learned the OR function over the set $ \{0,1\}^ 2$.

One distinguishes different types of learning for neural networks:

Supervised Learning: Compare the network outputs $ y = y(x_1, \ldots, x_p)$ with the correct $ z = z(x_1, \ldots, x_p).$ When $ y
\not= z,$ the weights are changed according to the learning rule.

Reinforcement Learning: From every network output $ y = y(x_1, \ldots, x_p)$ one discovers, whether it is $ \glqq$correct$ \grqq$ or $ \glqq$incorrect$ \grqq$ - in the latter case though one does not know the correct value. When $ y$ is $ \glqq$incorrect$ \grqq$, the weights are changed according to the learning rule.

Unsupervised Learning: There is no feedback while learning. Similar to the cluster analysis random errors are filtered from the data with the help of redundant information.

For $ y \in \{ 0,1\}$ supervised and reinforcement learning are the same. Included in this type is the Widrow-Hoff learning rule for the perceptron.

The perceptron can not learn all of the desired classifications. The classical counter example is the logical argument XOR = ''exclusive or'':

$\displaystyle \begin{array}{l} z = 1,\ \text{\rm if either\ } x_1 = 1\ \text{\r...
...\ \text{\rm if\ } x_1 = x_2 = 0\ \text{\rm or\ \, } x_1 = x_2
= 1 .\end{array} $

A perceptron with weights $ w_0, w_1, w_2$ corresponds to a hyperplane $ w_0 + w_1 x_1 + w_2 x_2 = 0$ in $ \mathbb{R}^ 2 $ space of the inputs $ (x_1, x_2)^\top $, which separates the set using the perceptron of 0 classified objects from those classified as 1. It is not hard to see that no hyperplane exists for ''exclusive or'' where inputs should be classified as 1 $ {1 \choose
0}, {0 \choose 1} $ can be separated from those to be classified as 0 $ {0 \choose
0}, {1 \choose 1} $.

Definition 19.1 (linearly separable)  
For $ p \ge 1$ to subsets $ {\cal X}_0, {\cal X}_1 \subseteq \mathbb{R}^ p$ are called linearly separable if $ w \in \mathbb{R}^ p,\ w_0 \in \mathbb{R}$ exists with

$\displaystyle \begin{array}{ll} w_0 + w^\top x > 0 & \text{\rm for\ } x \in {\cal
X}_1, \\ w_0 + w^\top x \le 0 & \text{\rm for\ } x \in {\cal X}_0. \end{array} $

The perceptron with $ p$ input variables $ x_1, \ldots, x_p$ (with respect to the constant $ x_0 \stackrel{\mathrm{def}}{=}1)$ can exactly learn the classification that is consistent with the linearly separable sets of inputs.

If no perfect classification is possible through a perceptron, then one can at least try to find a ''good'' classification, that is, to determine the weights $ w_0, \ldots, w_p$ so that a measurement for the amount of incorrectly identified classifications can be minimized. An example of such an application is given by the least squares (LS) classification:

Assuming that the training set $ (x^ {(1)}, z^ {(1)}), \ldots, (x^
{(T)}, z^ {(T)}) $ is given. Determine for some given weight $ w_0
$ the weights $ w_1, \ldots, w_p$ so that

$\displaystyle Q(w)$ $\displaystyle =$ $\displaystyle Q(w_1, \ldots, w_p) = \sum^ T_{i=1} (z^
{(i)} - y^ {(i)} )^ 2 = \min \, !$  
    with $\displaystyle y^ {(i)} = \boldsymbol{1}(w_0 + w^\top x^ {(i)}>0),\ w = (w_1, \ldots, w_p)^\top$  

$ w_0
$ can be arbitrarily chosen since the weights $ w_0, \ldots, w_p$ described above are only determined up to a scale factor. In the case of the perceptron, which takes on a binary classification, $ Q(w)$ is simply the number of incorrectly defined classifications. The form mentioned above can also be directly applied to other problems. The attainable minimum of $ Q(w)$ is exactly 0 (perfect classification of the training set) when both sets

$\displaystyle {\cal X}_0^ {(T)} = \{ x ^ {(i)},\ i \le T;\ z^ {(i)} = 0\}, \,
{\cal X}_1^ {(T)} = \{ x^ {(i)},\ i \le T;\ z^ {(i)} = 1\} $

are linearly separable.

Fig.: Error surface of $ Q(w)$ given weight $ w=(w_1,w_2)^\top $ with transform function: threshold function (left) and sigmoid function (right) 32625 SFEerrorsurf.xpl
\includegraphics[width=0.6\defpicwidth]{nnerrorsurf1.ps}\includegraphics[width=0.6\defpicwidth]{nnerrorsurf2.ps}

The Widrow-Hoff learning rule solves the LS classification problem; there are, however, a series of other learning rules or estimation methods which can also solve the problem. The perceptron has proven to be too inflexible for many applications. Therefore, one considers general forms of neurons as components used to build a neuron network:

Let $ x = (x_1, \ldots, x_p)^\top ,\ w = (w_1, \ldots, w_p)^\top $ be input and weight vectors respectively. For $ \beta,
\beta_0 \in \mathbb{R}$

$\displaystyle \psi_{\beta} (t) = \frac{1}{1 + \exp (- \frac{t + \beta}{\beta_0}) } $

is the logistic function, which due to its form is often referred to as ''the'' sigmoid function. One can also use other functions with sigmoid forms, for example, the density function of a normal distribution. The output variable of the neuron is $ y = \psi_{\beta} (w ^\top x).$

For $ \beta_0 \rightarrow 0+ $ $ \psi _{\beta} (t)$ approaches a threshold function:

$\displaystyle \psi _{\beta}(t) \longrightarrow \boldsymbol{1}(t+\beta >0)$   for$\displaystyle \, \, \beta_0 \longrightarrow 0+\ ,
$

so that the preceptron is a boundary of the neuron with a logistic activation function. An example of $ Q(w)$ for neurons with threshold function and sigmoid function as activation function is shown in Figure 18.4. The corresponding method is presented in Figure 18.5.

Fig. 18.5: Neuron with a sigmoid transformation function
\includegraphics[width=1\defpicwidth]{neu4.ps}

$ \beta_0$ is often not explicitly chosen, since it can be integrated as a scale factor in the other parameters $ w_1,
\ldots, w_p,\ \beta$ of the neurons. If one also sets $ w_0 =
\beta$ and $ x_0 \stackrel{\mathrm{def}}{=}1,$, then the output variables can also be written in the form:

$\displaystyle y = \psi (w_0 + w^\top x) = \psi (\sum^ p_{k=0} w_k x_k)\ \,$   with$\displaystyle \, \psi(t) = \frac{1}{1 + e^ {-t}}. $

By combining multiple neurons with sigmoid or - in the limit case - threshold activation functions with a feed forward network one obtains a so called multiple layer perceptron (MLP) neural network. Figure 18.6 shows such a neural network with two input variables with respect to the constant $ x_0 \stackrel{\mathrm{def}}{=}1,$ two sigmoid neurons in the hidden layer that are connected by another sigmoid neuron to the output variables, where $ \psi(t) = \{ 1 + e^ {-t}\}^ {-1}$ as above.

Fig. 18.6: Multiple layer perceptron with a hidden layer
\includegraphics[width=1.2\defpicwidth]{neu5.ps}

Neural networks can also be constructed with multiple hidden layers that give multiple output variables. The connections do not have to be complete, i.e., edges between the nodes of consecutive layers may be missing or equivalently several weights can be set to 0. Instead of the logical function or similar sigmoid functions threshold functions may also appear in some neurons. Another probability are the so called radial basis functions (RBF). To the former belongs the density of the standard normal distribution and similar symmetrical kernel functions. In this case one no longer speaks of a MLP, but of a RBF network.

Fig. 18.7: Multiple layer perceptron with two hidden layers
\includegraphics[width=1.2\defpicwidth]{nnfig6.ps}

Figure 18.7 shows an incomplete neural network with two output variables. The weights $ w_{13}, w_{22}, w_{31}, v_{12} $ and $ v_{31}$ are set to 0, and the corresponding edges are not displayed in the network graphs. The output variable $ y_1$ is for example

$\displaystyle y_1 = v_{11} \psi(w_{01} + w_{11} x_1 + w_{21} x_2) + v_{21} \psi(w_{02}
+ w_{12} x_1 + w_{32} x_3), $

a linear combination of the results of the two upper neurons of the hidden layers.

Until now we have only discussed those cases that are most often handled in the literature, where a neuron has an effect on the linear combination of variables from the previous layer. Occasionally the case where the output of a neural of the form $ \psi \left( \Pi^ p_{i=1} w_i x_i \right) $ respectively $ \psi
\left( \max_{i=1, \ldots, p} x_i \right) $ is considered.

Neural networks of MLP types can be used for classification problems as well as for regression and forecast problems. In order to find an adequate network for each problem, the weights have to be learned through a training set, i.e., the network parameters are estimated from the data. Since we are restricting ourselves to the case of supervised learning, this means that $ (x^ {(1)}, z^ {(1)}), \ldots, (x^
{(T)}, z^ {(T)}) $ are given for the training set. The $ x^ {(i)} \in \mathbb{R}^ p$ are input vectors, the $ z^ {(i)} \in \mathbb{R}^ q$ are the corresponding desired output values from the network. The vectors $ z^ {(i)}$ are compared to the actual output vectors $ y ^ {(i)} \in \mathbb{R}^ q$ of the network. The weights are determined so that the deviations between $ z^ {(i)}$ and $ y^ {(i)}$ are small. An example of this is the least squares (LS) application already mentioned in the analysis of the perceptron:

Assuming that the training set $ (x^ {(1)}, z^ {(1)}), \ldots, (x^
{(T)}, z^ {(T)}) $ is given. The weights $ w_{0l}, l = 1, \ldots,
r,$ $ x_0 \stackrel{\mathrm{def}}{=}1$ are given, where $ r$ is the number of neurons in the first hidden layer. The weights of all the other edges in the network (between the input layer, the hidden layers and the output layer) are determined so that

$\displaystyle \sum^ T_{k=1} \vert\vert z ^ {(k)} - y^ {(k)} \vert\vert^ 2 = \min \, ! $

In the network given in Figure 18.7 the minimization is done with respect to the weights $ w_{11}, w_{12},$ $ w_{21},
w_{23}, w_{32}, w_{33}, v_{11}, v_{21}, v_{22}, v_{32}$. As for the perceptron the weights $ w_{01},$ $ w_{02},$ $ w_{03}$ can be set in order to avoid the arbitrage of scale factors.

Instead of the LS method also other loss functions can be minimized, for example, weighted quadratic distances or, above all in classification, the Kullback-Leibler distance:

$\displaystyle \sum^ T_{k=1} \sum_i \left\{ z_i^ {(k)} \log \frac{z_i^ {(k)}}{y_...
...+ (1 - z_i ^ {(k)}) \log \frac{1-z_i^ {(k)}}{1-y_i^ {k)}} \right\} =
\min \, ! $

Since only the $ y_i^ {(k)} $ depend on the weights, it is equivalent to minimize the cross-entropie between $ z_i$ and $ y_i,$ which are both contained in $ (0,1)$:

$\displaystyle - \sum^ T_{k=1} \sum_i \left\{ z_i^ {(k)} \log y_i^ {(k)} + (1 - z_i^
{(k)}) \log (1-y_i^ {(k)})\right\} = \min \, ! $