Next: 6.4 Genetic Algorithms Up: 6. Stochastic Optimization Previous: 6.2 Random Search

6.3 Stochastic Approximation

6.3.1 Introduction

Stochastic approximation (SA) is a cornerstone of stochastic optimization. Robbins and Monro (1951) introduced SA as a general root-finding method when only noisy measurements of the underlying function are available. Let us now discuss some aspects of SA as applied to the more specific problem of root-finding in the context of optimization. With a differentiable loss function $L(\boldsymbol{\theta})$ , recall the familiar set of equations and unknowns for use in finding a minimum $\boldsymbol{\theta}^{\ast}$ :

$\displaystyle \boldsymbol{g}(\boldsymbol{\theta}) = \frac{\partial L}{\partial \boldsymbol{\theta}} = \boldsymbol{0}\;.$

(6.3)

(Of course, side conditions are required to guarantee that a root of (6.3) is a minimum, not a maximum or saddlepoint.) Note that (6.3) is nominally only directed at local optimization problems, although some extensions to global optimization are possible, as briefly discussed in Sect. 6.3.3. There are a number of approaches for solving the problem represented by (6.3) when direct (usually noisy) measurements of the gradient $\boldsymbol{g}$ are available. These typically go by the name of stochastic gradient methods (e.g., Spall, 2003, Chap. 5). In contrast to the stochastic gradient approach - but consistent with the emphasis in the random search and genetic algorithms (Sects. 6.2 and 6.4 here) - let us focus on SA when only measurements of

are available. However, unlike the emphasis in random search and genetic algorithms, we consider noisy measurements of

To motivate the general SA approach, first recall the familiar form for the unconstrained deterministic steepest descent algorithm for solving (6.3):

$\displaystyle \hat{\boldsymbol{\theta}}_{k+1} = \hat{\boldsymbol{\theta}}_{k} - a_{k} \boldsymbol{g}(\hat{\boldsymbol{\theta}}_{k})\;,$

where the gain (or step size) satisfies $a_{k} > 0$ (see, e.g., Bazaraa et al., 1993, pp. 300-308 or any other book on mathematical programming; Spall, 2003, Sect. 1.4). This algorithm requires exact knowledge of $\boldsymbol{g}$ . Steepest descent will converge to $\boldsymbol{\theta}^{\ast}$ under certain fairly general conditions. (A notable variation of steepest descent is the Newton-Raphson algorithm (sometimes called Newton's method; e.g., Bazaraa et al., 1993, pp. 308-312), which has the form $\hat{\boldsymbol{\theta}}_{k+1} = \hat{\boldsymbol{\theta}}_{k} - a_{k} \bolds... ...hat{\boldsymbol{\theta}}_{k})^{-1}\boldsymbol{g}(\hat{\boldsymbol{\theta}}_{k})$ , where $\boldsymbol{H}(\cdot)$ is the Hessian (second derivative) matrix of

. Under more restrictive conditions, the Newton-Raphson algorithm has a much faster rate of convergence to $\boldsymbol{\theta}^{\ast}$ than steepest descent. However, with its requirement for a Hessian matrix, it is generally more challenging to implement. An SA version of Newton-Raphson is discussed briefly at the end of Sect. 6.3.3.)

Unlike with steepest descent, it is assumed here that we have no direct knowledge of $\boldsymbol{g}$ . The recursive procedure of interest is in the general SA form

$\displaystyle \hat{\boldsymbol{\theta}}_{k+1} = \hat{\boldsymbol{\theta}}_{k} - a_{k} \hat{\boldsymbol{g}}_{k} (\hat{\boldsymbol{\theta}}_{k})\;,$

(6.4)

where $\hat{\boldsymbol{g}}_{k} (\hat{\boldsymbol{\theta}}_{k})$ is the estimate of $\boldsymbol{g}$ at the iterate $\hat{\boldsymbol{\theta}}_{k}$ based on measurements of the loss function. Hence, (6.4) is analogous to the steepest descent algorithm, with the gradient estimate $\hat{\boldsymbol{g}}_{k} (\theta)$ replacing the direct gradient $\boldsymbol{g}$ at $\boldsymbol{\theta} = \hat{\boldsymbol{\theta}}_{k}$ . The gain $a_{k} > 0$ here also acts in a way similar to its role in the steepest descent form. Under appropriate conditions, the iteration in (6.4) converges to $\boldsymbol{\theta}^{\ast}$ in some stochastic sense (usually almost surely, a.s.). (There are constrained forms of SA, but we do not discuss those here; see, e.g., Spall, 2003, Chaps. 4-7).

Sections 6.3.2 and 6.3.3 discuss two SA methods for carrying out the optimization task using noisy measurements of the loss function. Section 6.3.2 discusses the traditional finite-difference SA method and Sect. 6.3.3 discusses the more recent simultaneous perturbation method.

6.3.2 Finite-Difference SA

The essential part of (6.4) is the gradient approximation $\hat{\boldsymbol{g}}_{k} (\hat{\boldsymbol{\theta}}_{k})$ . The traditional means of forming the approximation is the finite-difference method. Expression (6.4) with this approximation represents the finite-difference SA (FDSA) algorithm. One-sided gradient approximations involve measurements $y(\hat{\boldsymbol{\theta}}_{k})$ and $y(\hat{\boldsymbol{\theta}}_{k} +$ perturbation, while two-sided approximations involve measurements of the form $y(\hat {\boldsymbol{\theta}}_{k} \pm$ perturbation. The two-sided FD approximation for use with (6.4) is

$\displaystyle \hat{\boldsymbol{g}}_{k}(\hat{\boldsymbol{\theta}}_{k}) = \left[\... ...l{\theta}}_{k} - c_{k} \boldsymbol{\xi} _{p})} {2c_{k}} \end{matrix} \right]\;,$

(6.5)

where $\boldsymbol{\xi}_{i}$ denotes a vector with a

in the

th place and 0's elsewhere and $c_{k} > 0$ defines the difference magnitude. The pair $\{a_{k} , c_{k}\}$ are the gains (or gain sequences) for the FDSA algorithm. The two-sided form in (6.5) is the obvious multivariate extension of the scalar two-sided form in Kiefer and Wolfowitz (1952). The initial multivariate method in Blum (1954) used a one-sided approximation.

It is of fundamental importance to determine conditions such that $\hat{\boldsymbol{\theta}}_{k}$ as shown in (6.4) and (6.5) converges to $\boldsymbol{\theta}^{\ast}$ in some appropriate stochastic sense. The convergence theory for the FDSA algorithm is similar to ''standard'' convergence theory for the root-finding SA algorithm of Robbins and Monro (1951). Additional difficulties, however, arise due to a bias in $\hat{\boldsymbol{g}}_{k} (\hat{\boldsymbol{\theta}}_{k})$ as an estimator of $\boldsymbol{g}(\hat{\boldsymbol{\theta}}_{k})$ . That is, standard conditions for convergence of SA require unbiased estimates of $\boldsymbol{g}(\cdot)$ at all . On the other hand, $\hat{\boldsymbol{g}}_{k} (\hat{\boldsymbol{\theta}}_{k})$ , as shown in (6.5), is a biased estimator, with the bias having a magnitude of order $c_{k}^{2}$ . We will not present the details of the convergence theory here, as it is available in many other references (e.g., Fabian, 1971; Kushner and Yin, 1997, Sects. 5.3, 8.3, and 10.3; Ruppert, 1991; Spall, 2003, Chap. 6). However, let us note that the standard conditions on the gain sequences are: $a_{k} > 0$ , $c_{k} > 0$ , $a_{k} \to 0$ , $c_{k} \to 0$ , $\sum\nolimits_{k=0}^{\infty} {a_{k} } = \infty$ , and $\sum\nolimits_{k=0}^{\infty} {a_{k}^{2}} / {c_{k}^{2}} < \infty$ . The choice of these gain sequences is critical to the performance of the method. Common forms for the sequences are:

$\displaystyle a_{k} = \frac{a}{(k + 1 + A)^{\alpha}}$ and $\displaystyle \quad c_{k} = \frac{c}{(k + 1)^{\gamma }}\;,$

where the coefficients

, $\alpha$ , and $\gamma$ are strictly positive and $A \ge 0$ . The user must choose these coefficients, a process usually based on a combination of the theoretical restrictions above, trial-and-error numerical experimentation, and basic problem knowledge. In some cases, it is possible to partially automate the selection of the gains (see, e.g., Spall, 2003, Sect. 6.6).

Let us summarize a numerical example based on the following loss function:

$\displaystyle L(\boldsymbol{\theta} ) = \boldsymbol{\theta}^{T}\boldsymbol{B}^{... ... {0.01}\sum\limits_{i=1}^{10} {(\boldsymbol{B}\boldsymbol{\theta} )_{i}^{4}}\;,$

where ( $\cdot$ ) $_{i}$ represents the

th component of the argument vector $\boldsymbol{B}\boldsymbol{\theta}$ , and $\boldsymbol{B}$ is such that $10\,\boldsymbol{B}$ is an upper triangular matrix of

's. The minimum occurs at $\boldsymbol{\theta}^{\ast} = \boldsymbol{0}$ with $L(\boldsymbol{\theta}^{\ast}) = 0$ ; all runs are initialized at $\hat{\boldsymbol{\theta}}_{0} = [1, 1,{\ldots}, 1]^{T}$ (so $L(\hat{\boldsymbol{\theta}}_{0}) = {4.178}$ ). Suppose that the measurement noise $\varepsilon$ is independent, identically distributed (i.i.d.)

. All iterates $\hat{\boldsymbol{\theta}}_{k}$ are constrained to be in $\Theta = [-5, 5]^{10}$ . If an iterate falls outside of $\Theta$ , each individual component of the candidate $\boldsymbol {\theta }$ that violates the interval

is mapped to it nearest endpoint $\pm 5$ . The subsequent gradient estimate is formed at the modified (valid) $\boldsymbol {\theta }$ value. (The perturbed values $\hat{\boldsymbol{\theta}}_{k} \pm c_{k}\boldsymbol{\xi}_{i}$ are allowed to go outside of $\Theta$ .)

Using loss measurements per run, we compare FDSA with the localized random search method of Sect. 6.2. Based on principles for gain selection in Spall (2003, Sect. 6.6) together with some limited trial-and-error experimentation, we chose , , , $\alpha = {0.602}$ , and $\gamma = {0.101}$ for FDSA and an average of loss measurements per iteration with normally distributed perturbations having distribution $N(\boldsymbol{0}, {0.5}^{2}\boldsymbol{I}_{10})$ for the random search method.

Figure 6.2 summarizes the results. Each curve represents the sample mean of independent replications. An individual replication of one of the two algorithms has much more variation than the corresponding smoothed curve in the figure.

**Figure 6.2:** Comparison of FDSA and localized random search. Each curve represents sample mean of independent replications
$\includegraphics[width=8.5cm]{text/2-6/fig2.eps}$

Figure 6.2 shows that both algorithms produce an overall reduction in the (true) loss function as the number of measurements approach . The curves illustrate that FDSA outperforms random search in this case. To make the comparison fair, attempts were made to tune each algorithm to provide approximately the best performance possible. Of course, one must be careful about using this example to infer that such a result will hold in other problems as well.

6.3.3 Simultaneous Perturbation SA

The FDSA algorithm of Sect. 6.3.2 is a standard SA method for carrying out optimization with noisy measurement of the loss function. However, as the dimension grows large, the number of loss measurements required may become prohibitive. That is, each two-sided gradient approximation requires loss measurements. More recently, the simultaneous perturbation SA (SPSA) method was introduced, requiring only two measurements per iteration to form a gradient approximation independent of the dimension . This provides the potential for a large savings in the overall cost of optimization.

Beginning with the generic SA form in (6.4), we now present the SP form of the gradient approximation. In this form, all elements of $\hat{\boldsymbol{\theta}}_{k}$ are randomly perturbed together to obtain two loss measurements $y(\cdot)$ . For the two-sided SP gradient approximation, this leads to

$\displaystyle \hat{\boldsymbol{g}}_{k}(\hat{\boldsymbol{\theta}}_{k})$	$\displaystyle = \left[\begin{matrix}\frac{y(\hat{\boldsymbol{\theta}}_{k} + c_{... ...{k} - c_{k} \boldsymbol{\Delta}_{k})}{2c_{k} \Delta _{kp} } \end{matrix}\right]$
	$\displaystyle = \frac{y(\hat{\boldsymbol{\theta}}_{k} + c_{k} \boldsymbol{\Delt... ...lta _{k1}^{-1} ,\Delta _{k2}^{-1} ,\ldots ,\Delta _{kp}^{-1} } \right]^{T}\;,{}$	(6.6)

where the mean-zero

-dimensional random perturbation vector, $\boldsymbol{\Delta}_{k} = [\Delta_{k1}, \Delta_{k2},\ldots, \Delta_{kp}]^{T}$ , has a user-specified distribution satisfying certain conditions and $c_{k}$ is a positive scalar (as with FDSA). Because the numerator is the same in all

components of $\hat{\boldsymbol{g}}_{k} (\hat{\boldsymbol{\theta}}_{k})$ , the number of loss measurements needed to estimate the gradient in SPSA is two, regardless of the dimension

Relative to FDSA, the -fold measurement savings per iteration, of course, provides only the potential for SPSA to achieve large savings in the total number of measurements required to estimate $\boldsymbol {\theta }$ when is large. This potential is realized if the number of iterations required for effective convergence to an optimum $\boldsymbol{\theta}^{\ast}$ does not increase in a way to cancel the measurement savings per gradient approximation. One can use asymptotic distribution theory to address this issue. In particular, both FDSA and SPSA are known to be asymptotically normally distributed under very similar conditions. One can use this asymptotic distribution result to characterize the mean-squared error $E\left({\left\Vert {\hat {\boldsymbol{\theta} }_{k} -\boldsymbol{\theta} ^{\ast}} \right\Vert^{2}} \right)$ for the two algorithms for large . Fortunately, under fairly broad conditions, the -fold savings at each iteration is preserved across iterations. In particular, based on asymptotic considerations:

$% latex2html id marker 24579 \framebox[\textwidth]{ \parbox{\textwidth}{Under r... ...radient approximation uses only $1 / p$\ the number of function evaluations).}}$

The SPSA Web site (www.jhuapl.edu/SPSA) includes many references on the theory and application of SPSA. On this Web site, one can find many accounts of numerical studies that are consistent with the efficiency statement above. (Of course, given that the statement is based on asymptotic arguments and associated regularity conditions, one should not assume that the result always holds.) In addition, there are references describing many applications. These include queuing systems, pattern recognition, industrial quality improvement, aircraft design, simulation-based optimization, bioprocess control, neural network training, chemical process control, fault detection, human-machine interaction, sensor placement and configuration, and vehicle traffic management.

We will not present the formal conditions for convergence and asymptotic normality of SPSA, as such conditions are available in many references (e.g., Dippon and Renz, 1997; Gerencsér, 1999; Spall, 1992, 2003, Chap. 7). These conditions are essentially identical to the standard conditions for convergence of SA algorithms, with the exception of the additional conditions on the user-generated perturbation vector $\boldsymbol{\Delta}_{k}$ .

The choice of the distribution for generating the $\boldsymbol{\Delta}_{k}$ is important to the performance of the algorithm. The standard conditions for the elements $\Delta_{ki}$ of $\boldsymbol{\Delta}_{k}$ are that the $\{\Delta_{ki}\}$ are independent for all , , identically distributed for all at each , symmetrically distributed about zero and uniformly bounded in magnitude for all . In addition, there is an important inverse moments condition:

$\displaystyle E\left( {\left\vert {\frac{1}{\Delta _{ki} }} \right\vert^{2+2\tau }} \right) \le C$

for some $\tau > 0$ and

. The role of this condition is to control the variation of the elements of $\hat{\boldsymbol{g}}_{k} (\hat{\boldsymbol{\theta}}_{k})$ (which have $\Delta_{ki}$ in the denominator). One simple and popular distribution that satisfies the inverse moments condition is the symmetric Bernoulli $\pm 1$ distribution. (In fact, as discussed in Spall, 2003, Sect. 7.7, this distribution can be shown to be optimal under general conditions when using asymptotic considerations.) Two common mean-zero distributions that do not satisfy the inverse moments condition are symmetric uniform and normal with mean zero. The failure of both of these distributions is a consequence of the amount of probability mass near zero. Exercise 7.3 in Spall (2003) illustrates the dramatic performance degradation that can occur through using distributions that violate the inverse moments condition.

As with any real-world implementation of stochastic optimization, there are important practical considerations when using SPSA. One is to attempt to define $\boldsymbol {\theta }$ so that the magnitudes of the $\boldsymbol {\theta }$ elements are similar to one another. This desire is apparent by noting that the magnitudes of all components in the perturbations $c_{k}\boldsymbol{\Delta} _{k}$ are identical in the case where identical Bernoulli distributions are used. Although it is not always possible to choose the definition of the elements in $\boldsymbol {\theta }$ , in most cases an analyst will have the flexibility to specify the units for $\boldsymbol {\theta }$ to ensure similar magnitudes. Another important consideration is the choice of the gains $a_{k}$ , $c_{k}$ . The principles described for FDSA above apply to SPSA as well. Section 7.5 of Spall (2003) provides additional practical guidance.

There have been a number of important extensions of the basic SPSA method represented by the combination of (6.4) and (6.6). Three such extensions are to the problem of global (versus local) optimization, to discrete (versus continuous) problems, and to include second-order-type information (Hessian matrix) with the aim of creating a stochastic analogue to the deterministic Newton-Raphson method.

The use of SPSA for global minimization among multiple local minima is discussed in Maryak and Chin (2001). One of their approaches relies on injecting Monte Carlo noise in the right-hand side of the basic SPSA updating step in (6.4). This approach is a common way of converting SA algorithms to global optimizers through the additional ''bounce'' introduced into the algorithm (Yin, 1999). Maryak and Chin (2001) also show that basic SPSA without injected noise (i.e., (6.4) and (6.6) without modification) may, under certain conditions, be a global optimizer. Formal justification for this result follows because the random error in the SP gradient approximation acts in a way that is statistically equivalent to the injected noise mentioned above.

Discrete optimization problems (where $\boldsymbol {\theta }$ may take on discrete or combined discrete/continuous values) are discussed in Gerencsér et al. (1999). Discrete SPSA relies on a fixed-gain (constant $a_{k}$ and $c_{k}$ ) version of the standard SPSA method. The parameter estimates produced are constrained to lie on a discrete-valued grid. Although gradients do not exist in this setting, the approximation in (6.6) (appropriately modified) is still useful as an efficient measure of slope information.

Finally, using the simultaneous perturbation idea, it is possible to construct a simple method for estimating the Hessian (or Jacobian) matrix of while, concurrently, estimating the primary parameters of interest ( $\boldsymbol {\theta }$ ). This adaptive SPSA (ASP) approach produces a stochastic analogue to the deterministic Newton-Raphson algorithm (e.g., Bazaraa et al., 1993, pp. 308-312), leading to a recursion that is optimal or near-optimal in its rate of convergence and asymptotic error. The approach applies in both the gradient-free setting emphasized in this section and in the root-finding/stochastic gradient-based (Robbins-Monro) setting reviewed in Spall (2003, Chaps. 4 and 5). Like the standard SPSA algorithm, the ASP algorithm requires only a small number of loss function (or gradient, if relevant) measurements per iteration - independent of the problem dimension - to adaptively estimate the Hessian and parameters of primary interest. Further information is available at Spall (2000) or Spall (2003, Sect. 7.8).

Next: 6.4 Genetic Algorithms Up: 6. Stochastic Optimization Previous: 6.2 Random Search