19.3 Neural Networks in Non-parametric Regression Analysis

Neural networks of type MLP describe a mapping of the input variables $ x \in \mathbb{R}^p$ onto the output variables $ y \in
\mathbb{R}^q.$ We will restrict ourselves in this section to the case where the network has only one hidden layer and the output variable is univariate $ (q=1).$ Then $ y \in \mathbb{R}$ as a function of $ x$ has the form

$\displaystyle y = v _0 + \sum^H_{h=1} v _h \psi (w_{0h} + \sum^p_{j=1} w_{jh} x_j) \stackrel{\mathrm{def}}{=}\nu_H (x ; \vartheta)$ (19.1)

where $ H$ is the number of neurons in the hidden layer and $ \psi$ is the given transformation function. The parameter vector

$\displaystyle \vartheta = (w_{01},\ldots, w_{p1}, w_{02}, \ldots, w_{pH}, v_0, \ldots, v_H)^\top \in \mathbb{R}^{(p+1) H + H+1}$

contains all the weights of the network. This network with one hidden layer already has a universal approximation property: every measurable function $ m: \mathbb{R}^p \rightarrow \mathbb{R} $ can be approximated as accurately as one wishes by the function $ \nu_H (x, \vartheta)$ when $ \psi$ is a monotone increasing function with a bounded range. More precisely, the following result holds, Hornik et al. (1989):

Theorem 19.1  
Let $ \psi : \mathbb{R} \rightarrow [0,1]$ be monotone increasing with $ \lim_{u\rightarrow - \infty}\psi (u) = 0,$
$ \lim_{u\rightarrow \infty} \psi (u) = 1,$ and let $ J = \{ \nu_H
(x ; \vartheta );\ H \ge 1, \vartheta \in \mathbb{R} ^{ (p+1)
H+H+1} \} $ be the set which is mapped by a MLP function with a hidden layer from $ \mathbb{R}^p$ to $ \mathbb{R}$.

a)
For every Borel measurable function $ f: \mathbb{R}^p \rightarrow \mathbb{R}$ there exists a series $ \nu_n \in J,\ n\ge 1,$ with $ \mu ( x ; \vert f(x)-\nu_n
(x)\vert> \varepsilon ) \longrightarrow 0$ for $ n\rightarrow \infty,$ $ \varepsilon > 0,$ where $ \mu$ is an arbitrary probability measure of the Borel-$ \sigma$-Algebra from $ \mathbb{R}^p$.
b)
For every increasing function $ f: \mathbb{R}^p \rightarrow \mathbb{R}$ there exists a series $ \nu_n \in J,\ n\ge 1,$ with $ \sup_{x \in C} \vert f(x ) - \nu_n (x) \vert \longrightarrow
0$ for $ n \rightarrow
\infty$, where $ C$ is an arbitrary compact subset of $ \mathbb{R}^p$.

The range of $ \psi$ can be set to any bounded interval, not only $ [0,1]$, without changing the validity of the approximation properties.

The weight vector $ \vartheta$ is not uniquely determined by the network function $ \nu_H$. If, for example, the transformation function is asymmetric around 0, i.e., $ \psi (-u) = - \psi (u),$ then $ \nu_H (x; \vartheta)$ does not change when
a) the neurons of the hidden layer are interchanged, which corresponds to a substitution of the coordinates of $ \vartheta$, or when
b) all input weights $ w_{0h}, \ldots, w_{ph}$ and the output weight $ v_h$ of the neural are multiplied by $ -1$.

In order to avoid this ambiguity we will restrict the parameter set to a fundamental set in the sense of Rüeger and Ossen (1997), which for every network function $ \nu_H (x; \vartheta)$ contains exactly one corresponding parameter vector $ \vartheta$. In the case of asymmetric transformation functions we restrict ourselves, for example, to weight vectors with $ v_1 \ge v_2 \ge \ldots \ge
v_H \ge 0.$ In order to simplify the following considerations we also assume that $ \vartheta$ is contained in a sufficiently large compact subset $ \Theta _H \subset \mathbb{R}^{ (p+1) H + H+1}$ of a fundamental range.

Due to their universal approximation properties neural networks are a suitable tool in constructing non-parametric estimators for regression functions. For this we consider the following heteroscedastic regression model:

$\displaystyle Z_t = f(X_t) + \varepsilon _t\ , \quad t = 1, \ldots, n, $

where $ X_1,\ldots,X_n$ are independent, identically distributed $ d$-variate random variables with a density of $ p(x), x \in
\mathbb{R}^d$. The residuals $ \varepsilon _1, \ldots, \varepsilon
_n$ are independent, real valued random variables with

$\displaystyle {\mathop{\text{\rm\sf E}}} ( \varepsilon _t \vert X_t = x ) = 0, ...
...m\sf E}}} ( \varepsilon _t^2 \vert X_t = x ) = s_\varepsilon ^2 (x ) < \infty. $

We assume that the conditional mean $ f(x)$ and the conditional variance $ s_\varepsilon ^2 (x)$ of $ Z_t$ are, given $ X_t=x$, continuous functions bounded to $ \mathbb{R} ^d$. In order to estimate the regression function $ f$, we fit a neural network with a hidden layer and a sufficiently large number, $ H$, of neurons to the input variables $ X_1,\ldots,X_n$ and the values $ Z_1, \ldots, Z_n$, i.e., for given $ H$ we determine the non-linear least squares estimator $ \hat{\vartheta}_n = \mathop{\rm arg min}_{\vartheta
\in \Theta _H} D_n (\vartheta ) $ with

$\displaystyle \hat{D}_n (\vartheta ) = \frac{1}{n} \sum^n_{t=1} \left\{Z_t - \nu_H (X_t; \vartheta ) \right\}^2 . $

Under appropriate conditions $ \hat{\vartheta}_n$ converges in probability for $ n \rightarrow
\infty$ and a constant $ H$ to the parameter vector $ \vartheta _0 \in \Theta _H,$ which corresponds to the best approximation of $ f(x)$ by a function of type $ \nu_H
(x ; \vartheta), \vartheta \in \Theta _H$:

$\displaystyle \vartheta _0 = \mathop{\rm arg min}_{\vartheta \in \Theta _H} D(\vartheta)$   with$\displaystyle \quad D(\vartheta ) = {\mathop{\text{\rm\sf E}}} \{f(X_t) - \nu_H (X_t; \vartheta )\}^2.$

Under somewhat stronger assumptions the asymptotic normality of $ \hat{\vartheta}_n$ and thus of the estimator $ \hat{f}_H (x ) = \nu_H (x ; \hat{\vartheta} _n)$ also follows for the regression function $ f(x).$

The estimation error $ \hat{\vartheta}_n - \vartheta _0$ can be divided into two asymptotically independent subcomponents: $ \hat{\vartheta}_n - \vartheta_0 = (\hat{\vartheta}_n - \vartheta
_n) + (\vartheta _n - \vartheta _0),$ where the value

$\displaystyle \vartheta _n = \mathop{\rm arg min}_{\vartheta \in \Theta _H} \frac{1}{n} \sum^n_{t=1} \left\{f(X_t) - \nu_H (X_t; \vartheta )\right\}^2 $

minimizes the sample version of $ D(\vartheta) $, Franke and Neumann (2000):

Theorem 19.2  
Let $ \psi$ be bounded and twice differentiable with a bounded derivative. Suppose that $ D(\vartheta) $ has a unique global minimum $ \vartheta_0$ in the interior of $ \Theta_H,$ and the Hesse matrix $ \nabla ^2 D(\vartheta _0)$ of $ D$ at $ \vartheta_0$ is positive definite. In addition to the above mentioned conditions for the regression model it holds that

$\displaystyle \begin{array}{cccl} 0 < \delta \le s_\varepsilon ^2 (x ) & \le & ...
...e & C_\gamma <
\infty & \text{\rm for all\ } \, x, \, \gamma \ge 1
\end{array} $

with suitable constants $ \delta, \Delta, C_n,\gamma\ge 1.$ Then it holds for $ n \rightarrow \infty:$

$\displaystyle \sqrt{n} \left( \begin{array}{l} \hat{\vartheta} _n - \vartheta _...
...t( \begin{array}{ll} \Sigma _1 & 0\\ 0 &
\Sigma _2 \end{array} \right) \right) $

with covariance matrices
$\displaystyle \Sigma _i$ $\displaystyle =$ $\displaystyle \{\nabla^2 D(\vartheta _0)\}^{-1} B_i (\vartheta _0) \{\nabla ^2 D(\vartheta _0)\}^{-1}, i = 1,2,$  
$\displaystyle B_1 (\vartheta )$ $\displaystyle =$ $\displaystyle 4 \int s_\varepsilon ^2 (x )\ \nabla \nu_H (x ; \vartheta) \nabla \nu_H (x ; \vartheta )^\top \ p(x ) dx$  
$\displaystyle B_2 (\vartheta )$ $\displaystyle =$ $\displaystyle 4 \int \{f (x) - \nu_H (x ; \vartheta)\}^2
\nabla \nu_H (x ; \vartheta ) \nabla \nu_H (x ; \vartheta
)^\top p(x ) dx$  

where $ \nabla \nu_H$ represents the gradient of the network function with respect to the parameter $ \vartheta$.

From the theorem it immediately follows that $ \sqrt{n}
(\hat{\vartheta}_n - \vartheta _0)$ is asymptotically N$ (0,\
\Sigma_1 + \Sigma_2)$ distributed. $ \Sigma_1$ here stands for the variability of the estimator $ \hat{\vartheta}_n$ caused by the observational error $ \varepsilon _t.$ $ \Sigma_2$ represents the proportion of asymptotic variability that is caused by the mis-specification of the regression function, i.e., from the fact that $ f(x)$ is of the form $ \nu_H (x; \vartheta)$ for a given $ H$ and no $ \vartheta$. In the case that it is correctly specified, where $ f(x) = \nu_H (x ;
\vartheta _0)$, this covariance component disappears, since $ B_2 (\vartheta _0) = 0 $ and $ \Sigma_2 = 0.$

$ \Sigma_1, \Sigma_2$ can be estimated as usual with the sample covariance matrices. In order to construct tests and confidence intervals for $ f(x)$ a couple of alternatives to the asymptotic distribution are available: Bootstrap, or in the case of heteroscedasticity, the Wild Bootstrap method, Franke and Neumann (2000).

Theorem 18.2 is based on the theoretical value of the least squares estimator $ \hat{\vartheta}_n$, which in practice must be numerically determined. Let $ \tilde{\vartheta}_n$ be such a numerical approximation of $ \hat{\vartheta}_n.$ The quality of the resulting estimator $ \tilde{\vartheta}_n$ can depend on the numerical method used. White (1989b) showed in particular that the back propagation algorithm leads under certain assumptions to an asymptotically inefficient estimator $ \tilde{\vartheta}_n$, i.e., the asymptotic covariance matrix of $ \sqrt{n} (\tilde{\vartheta} _n - \vartheta _0)$ is larger than that of $ \sqrt{n}
(\hat{\vartheta}_n - \vartheta _0)$ in the sense that the difference of the two matrices is positive definite. Nevertheless White also showed that by joining a single global minimization step, the estimator calculated from the back propagation can be modified so that for $ n \rightarrow
\infty$ it is as efficient as the theoretical least squares estimator $ \hat{\vartheta}_n.$

Until now we have held the number of neurons $ H$ in the hidden layer of the network and thus the dimension of the parameter vector $ \vartheta$ constant. The estimator based on the network, $ \hat{f}_H (x ) = \nu_H (x ; \hat{\vartheta} _n)$ converges to $ \nu_H (x ; \vartheta _0)$, so that in general the bias $ {\mathop{\text{\rm\sf E}}} \{
\hat{f}_H (x)\} - m(x)$ for $ n \rightarrow
\infty$ does not disappear, but rather converges to $ \nu_H (x ; \vartheta _0) - f(x )$. With standard arguments it directly follows from Theorem 18.2 that:

Corollary 19.1   Under the assumptions from Theorem 18.2 it holds for $ n \rightarrow
\infty$ that

$\displaystyle \sqrt{n} \left\{\nu_H (x ; \hat{\vartheta} _n) - \nu_H (x; \varth...
...ght\} \stackrel{{\cal L}}{\longrightarrow}
{\text{\rm N}} (0, \sigma_\infty^2) $

with $ \sigma_\infty^2 = \nabla \nu_H (x ; \vartheta _0)^\top (\Sigma_1 + \Sigma_2) \, \nabla \nu_H (x ; \vartheta _0). $

In order to obtain a consistent estimator for $ f(x)$, the number of neurons $ H$, which by the non-parametric estimator $ \hat{f}_H(x)$ play the role of a smoothing parameter, must increase with $ n$. Due to the universal approximation properties of the neural network $ \nu_H (x ; \vartheta _0)$ thus converges to $ f(x),$ so that the bias disappears asymptotically. Since with an increasing $ H$ the dimension of the parameter vector $ \vartheta$ increases, $ H$ should not approach $ \infty$ too quickly, in order to ensure that the variance of $ \hat{f}_H(x)$ continues to converge to 0. In choosing $ H$ in practice one uses a typical dilemma for non-parametric statistics, the bias variance dilemma: a small $ H$ results in a smooth estimation function $ \hat{f}_H$ with smaller variance and larger bias, whereas a large $ H$ leads to a smaller bias but a larger variability of a then less smoothing estimator $ \hat{f}_H$.

White (1990) showed in a corresponding framework that the regression estimator $ \hat{f}_H(x)$ based on the neural network converges in probability to $ f(x)$ and thus is consistent when $ n \rightarrow
\infty$, $ H\rightarrow \infty$ at a slower rate.

From this it follows that neural networks with a free choice of $ H$ neurons in the hidden layer provides useful non-parametric function estimators in regression, and as we will discuss in the next section, in time series analysis. They have the advantage that the approximating function $ \nu_H (x; \vartheta)$ of the form (18.1) is a combination of the neurons, which are composed of only a given non-linear transformation of an affine-linear combination of the variables $ x = (x_1, \ldots, x_d)^\top $. This makes the numerical calculation of the least squares estimator for $ \vartheta$ possible even when the dimension $ d$ of the input variables and the number $ H$ of neurons are large and thus the dimension $ (d+1) H+ H+1$ of the parameter vector is very large. In contrast to the local smoothing technique introduced in Chapter 13, the neural networks can also be applied as estimators of functions in large dimensional spaces. One reason for this is the non-locality of the function estimator $ \hat{f}_H(x)$. This estimator does not dependent only on the observations $ (X_t, Z_t)$ with a small norm $ \vert\vert X_t - x\vert\vert$ and thus in practice it is not as strongly afflicted by the imprecation of dimensionality, i.e., even for large $ n$ there is a smaller local density of the observation $ X_t$ in large dimensional spaces.

Theoretically it is sufficient to consider neural networks of type MLP with one hidden layer. In practice, however, one can sometimes achieve a comparably good fit to the data with a relatively more parsimonious parameterization by creating multiple hidden layers. A network function with two hidden layers made up of $ H$ and $ G$ neurons respectively has, for example, the following form

$\displaystyle \nu (x; \vartheta) = v_0 + \sum^ G_{g=1} v_g \psi \left( w_{0g}' + \sum^ H_{h=1}
w_{hg}' \psi (w_{0h} + \sum^ d_{j=1} w_{jh}\ x_j)\right), $

where $ \vartheta$ represents the vector of all the weights $ v_g,
w_{hg}',\ w_{jh}$. Such a function with small $ H,G$ can produce a more parsimonious parameterized approximation of the regression function $ f(x)$ than a network function with only one hidden layer made up of a large number of neurons. In a case study on the development of trading strategies for currency portfolios Franke and Klein (1999) discovered, that with two hidden layers a significantly better result can be achieved than with only one layer.

In addition the number of parameters to be estimated can be further reduced when several connections in the neural network are cut, i.e., when the corresponding weights are set to zero from the very beginning. The large flexibility that the neural network offers when approximating regression functions creates problems when creating the model, since one has to decide on a network structure and thus ask:
  1. How many hidden layers does the network have?
  2. How many neurons does each hidden layer have?
  3. Which nodes (inputs, hidden neurons, outputs) of the network should be connect, i.e., which weights should be set to zero from the very beginning?
Through this process one is looking for a network which makes it possible to have a network function $ \nu (x ; \vartheta)$ that is parsimoniously parameterized and at the same time for a suitable $ \vartheta$ that is a sufficiently good approximation of the regression function $ f(x)$.

Similar to the classical linear regression analysis there are a comprehensive number of instruments available for specifying a network structure consistent with the data. For simplicity we will concentrate on the feed forward network with only one hidden layer made up of $ H$ neurons.

a) Repeated Significance Tests: As with the stepwise construction of a linear regression model we start with a simple network assuming that one additional neuron with the number $ H$ and $ v_H$ output weights has been added. Whether in doing this the quality of the fit of the network has significantly improved is determined by testing the hypothesis $ H_0: v_H=0$ against the alternative $ H_1: v_H \ne 0$. Since under $ H_0$ the input weights $ w_{0H}, ..., w_{pH}$ of the neurons in question are not identifiable, i.e., they have no influence on the value of the network function $ \nu_H$, this is no standard testing problem. White (1989a), Teräsvirta et al. (1993) have developed Lagrange multiplier tests that are suitable for testing the significance of an additional neuron. Going the other direction it is also possible to start with a complex network with large $ H$ assumed neurons successively removing them until the related test rejects the hypothesis $ H_0: v_H=0$. To reduce the number of parameters it makes sense to cut individual input connections, i.e., to set the corresponding weight to zero. For the test of the hypothesis $ H_0: w_{jh}=0$ against the alternative $ H_1: w_{jh}
\ne 0$ classical Wald Tests can be applied due to the asymptotical results such as 18.2 (see for example Anders (1997) for applications in financial statistics).

b) Cross Validation and Validation: The resulting cross validation is usually eliminated due to the extreme amount of calculations to determine the order of the model, i.e., first of all the number $ H$ of neurons in the hidden layer. In order to calculate the leave-one-out estimator for the model parameters one must fit the neural network to the corresponding sample that has been reduced by one observation a total of $ n$ times, and this must be done for every network structure under consideration. A related and more known procedure from the application of neural networks in the regression and time series analysis is to take a portion of the data away from the sample in order to measure the quality of the model based on this so called validation set. In addition to the data $ (X_t, Z_t) ,\ t = 1, \ldots, n,$ used to calculate the least squares estimator $ \hat{\vartheta}_n$ a second independent subsample $ (X_t, Z_t),\ t = n+1, \ldots, n+M,$ is available. By minimizing measurements of fit, such as,

$\displaystyle V(H) = \frac{1}{M} \, \sum^ {n+M}_{t=n+1} \left\{Z_t - \nu _H (X_t;\hat{\vartheta}_n)\right\}^ 2 $

the order of the model $ H$ and the quality of the incomplete network structure can be determined, in which individual input weights have been set to zero.

c) Network Information Criteria: To compare the network structures some well known applications for determining order, such as the Akaike Information Criterion (AIC), can be used. The application from Murata et al. (1994) called the Network Information Criterion (NIC) is specialized for the case of neural networks. Here it is implicitly assumed that the residuals $ \varepsilon_t$ are normally distributed with a common variance $ \sigma^2_\varepsilon$.