next up previous contents index
Next: 6.3 Stochastic Approximation Up: 6. Stochastic Optimization Previous: 6.1 Introduction

Subsections



6.2 Random Search

This section describes some simple methods based on the notion of randomly searching over the domain of interest. Section 6.2.1 gives a short discussion of general issues in direct random search methods. The algorithms discussed in Sect. 6.2.2 represent two versions of random search.


6.2.1 Some General Properties of Direct Random Search

Consider the problem of trying to find the optimal $ \boldsymbol{\theta} \in \Theta$ based on noise-free measurements of $ L = L(\boldsymbol{\theta})$. Random search methods are perhaps the simplest methods of stochastic optimization in such a setting and can be quite effective in many problems. Their relative simplicity is an appealing feature to both practitioners and theoreticians. These direct random search methods have a number of advantages relative to most other search methods. The advantages include relative ease of coding in software, the need to only obtain $ L$ measurements (versus gradients or other ancillary information), reasonable computational efficiency (especially for those direct search algorithms that make use of some local information in their search), broad applicability to non-trivial loss functions and/or to $ \boldsymbol {\theta }$ that may be continuous, discrete, or some hybrid form, and a strong theoretical foundation. Some of these attributes were mentioned in the forward-looking paper of Karnopp (1963). A good recent survey of random search and related methods is Kolda et al. (2003).


6.2.2 Two Algorithms for Random Search

This section describes two direct random search techniques. These two algorithms represent only a tiny fraction of available methods. Solis and Wets (1981) and Zhigljavsky (1991) are among many references discussing these and other random search methods. The two algorithms here are intended to convey the essential flavor of most available direct random search algorithms. With the exception of some discussion at the end of the subsection, the methods here assume perfect (noise-free) values of $ L$.

The first method we discuss is ''blind random search.'' This is the simplest random search method, where the current sampling for $ \boldsymbol {\theta }$ does not take into account the previous samples. That is, this blind search approach does not adapt the current sampling strategy to information that has been garnered in the search process. The approach can be implemented in batch (non-recursive) form simply by laying down a number of points in $ \Theta$ and taking the value of  $ \boldsymbol {\theta }$ yielding the lowest $ L$ value as our estimate of the optimum. The approach can be conveniently implemented in recursive form as we illustrate below.

The simplest setting for conducting the random sampling of new (candidate) values of $ \boldsymbol {\theta }$ is when $ \Theta$ is a hypercube and we are using uniformly generated values of  $ \boldsymbol {\theta }$. The uniform distribution is continuous or discrete for the elements of $ \boldsymbol {\theta }$ depending on the definitions for these elements. In fact, the blind search form of the algorithm is unique among all general stochastic optimization algorithms in that it is the only one without any adjustable algorithm coefficients that need to be ''tuned'' to the problem at hand. (Of course, a de facto tuning decision has been made by choosing the uniform distribution for sampling.)

For a domain $ \Theta$ that is not a hypercube or for other sampling distributions, one may use transformations, rejection methods, or Markov chain Monte Carlo to generate the sample $ \boldsymbol {\theta }$ values (see, e.g., Gentle, 2003). For example, if $ \Theta$ is an irregular shape, one can generate a sample on a hypercube superset containing $ \Theta$ and then reject the sample point if it lies outside of $ \Theta$.

The steps for a recursive implementation of blind random search are given below. This method applies when $ \boldsymbol {\theta }$ has continuous, discrete, or hybrid elements.


6.2.2.1 Blind Random Search

Step 0
(Initialization) Choose an initial value of  $ \boldsymbol {\theta }$, say $ \hat{\boldsymbol{\theta}}_{0} \in \Theta$, either randomly or deterministically. (If random, usually a uniform distribution on $ \Theta$ is used.) Calculate $ L(\hat{\boldsymbol{\theta}}_{0})$. Set $ k = 0$.

Step 1
Generate a new independent value $ \boldsymbol{\theta}_{\text{new}}(k + 1) \in \Theta$, according to the chosen probability distribution. If $ L(\boldsymbol{\theta}_{\text{new}}(k + 1)) <
L(\hat{\boldsymbol{\theta}}_{k})$, set $ \hat{\boldsymbol{\theta}}_{k+1} =
\boldsymbol{\theta}_{\text{new}}(k + 1)$. Else, take $ \hat{\boldsymbol{\theta}}_{k+1}
=\hat{\boldsymbol{\theta}}_{k}$.

Step 2
Stop if the maximum number of $ L$ evaluations has been reached or the user is otherwise satisfied with the current estimate for $ \boldsymbol {\theta }$ via appropriate stopping criteria; else, return to Step 1 with the new $ k$ set to the former $ k+1$.

The above algorithm converges almost surely (a.s.) to $ \boldsymbol{\theta}^{\ast}$ under very general conditions (see, e.g., Spall, 2003, pp. 40-41). Of course, convergence alone is an incomplete indication of the performance of the algorithm. It is also of interest to examine the rate of convergence. The rate is intended to tell the analyst how close $ \hat{\boldsymbol{\theta}}_{k}$ is likely to be to $ \boldsymbol{\theta}^{\ast}$ for a given cost of search. While blind random search is a reasonable algorithm when $ \boldsymbol {\theta }$ is low dimensional, it can be shown that the method is generally a very slow algorithm for even moderately dimensioned $ \boldsymbol {\theta }$ (see, e.g., Spall, 2003, 42-43). This is a direct consequence of the exponential increase in the size of the search space as $ p$ increases. As an illustration, Spall (2003, Example 2.2) considers a case where $ \Theta = [0, 1]^{p}$ (the $ p$-dimensional hypercube with minimum and maximum values of 0 and $ 1$ for each component of $ \boldsymbol {\theta }$) and where one wishes to guarantee with probability $ {0.90}$ that each element of $ \boldsymbol {\theta }$ is within $ {0.04}$ units of the optimal value. As $ p$ increases from one to ten, there is an approximate $ 10^{10}$-fold increase in the number of loss function evaluations required.

Blind search is the simplest random search in that the sampling generating the new  $ \boldsymbol {\theta }$ value does not take account of where the previous estimates of  $ \boldsymbol {\theta }$ have been. The random search algorithm below is slightly more sophisticated in that the random sampling is a function of the position of the current best estimate for  $ \boldsymbol {\theta }$. In this way, the search is more localized in the neighborhood of that estimate, allowing for a better exploitation of information that has previously been obtained about the shape of the loss function.

The localized algorithm is presented below. This algorithm was described in Matyas (1965). Note that the use of the term ''localized'' here pertains to the sampling strategy and does not imply that the algorithm is only useful for local (versus global) optimization in the sense described in Sect. 6.1. In fact, the algorithm has global convergence properties as described below. As with blind search, the algorithm may be used for continuous or discrete problems.


6.2.2.2 Localized Random Search

Step 0
(Initialization) Pick an initial guess $ \hat{\boldsymbol{\theta}}_{0} \in \Theta$, either randomly or with prior information. Set $ k = 0$.

Step 1
Generate an independent random vector $ \boldsymbol{d}_{k}
\in \mathbb{R}^{p}$ and add it to the current $ \boldsymbol {\theta }$ value, $ \hat{\boldsymbol{\theta}}_{k}$. Check if $ \hat{\boldsymbol{\theta}}_{k} +
\boldsymbol{d}_{k} \in \Theta$. If $ \hat{\boldsymbol{\theta}}_{k} + \boldsymbol{d}_{k}
\notin \Theta$, generate a new $ \boldsymbol{d}_{k}$ and repeat or, alternatively, move $ \hat{\boldsymbol{\theta}}_{k} + \boldsymbol{d}_{k}$ to the nearest valid point within $ \Theta$. Let $ \boldsymbol{\theta}_{\text{new}}(k + 1)$ equal $ \hat{\boldsymbol{\theta}}_{k} +
\boldsymbol{d}_{k} \in \Theta$ or the aforementioned nearest valid point in $ \Theta$.

Step 2
If $ L(\theta_{\text{new}} (k+1)) < L(\hat{\boldsymbol{\theta}}_{k})$, set $ \hat{\boldsymbol{\theta}}_{k+1} =
\boldsymbol{\theta}_{\text{new}}(k + 1)$; else, set $ \hat{\boldsymbol{\theta}}_{k+1}
=\hat{\boldsymbol{\theta}}_{k}$.

Step 3
Stop if the maximum number of $ L$ evaluations has been reached or the user is otherwise satisfied with the current estimate for  $ \boldsymbol {\theta }$ via appropriate stopping criteria; else, return to Step 1 with the new $ k$ set to the former $ k+1$.

For continuous problems, Matyas (1965) and others have used the (multivariate) normal distribution for generating $ \boldsymbol{d}_{k}$. However, the user is free to set the distribution of the deviation vector $ \boldsymbol{d}_{k}$. The distribution should have mean zero and each component should have a variation (e.g., standard deviation) consistent with the magnitudes of the corresponding $ \boldsymbol {\theta }$ elements. This allows the algorithm to assign roughly equal weight to each of the components of $ \boldsymbol {\theta }$ as it moves through the search space. Although not formally allowed in the convergence theory, it is often advantageous in practice if the variability in  $ \boldsymbol{d}_{k}$ is reduced as $ k$ increases. This allows one to focus the search more tightly as evidence is accrued on the location of the solution (as expressed by the location of our current estimate $ \hat{\theta}_{k}$).

The convergence theory for the localized algorithms tends to be more restrictive than the theory for blind search. Solis and Wets (1981) provide a theorem for global convergence of localized algorithms, but the theorem conditions may not be verifiable in practice. An earlier theorem from Matyas (1965) (with proof corrected in Baba et al., 1977) provides for global convergence of the localized search above if $ L$ is a continuous function. The convergence is in the ''in probability'' sense. The theorem allows for more than one global minimum to exist in $ \Theta$. Therefore, in general, the result provides no guarantee of $ \hat{\boldsymbol{\theta}}_{k}$ ever settling near any one value $ \boldsymbol{\theta}^{\ast}$. We present the theorem statement below.

Convergence Theorem for Localized Search.

Let $ \Theta^{\ast}$ represent the set of global minima for $ L$ (see Sect. 6.1). Suppose that $ L$ is continuous on a bounded domain $ \Theta$ and that if $ \hat{\boldsymbol{\theta}}_{k} + \boldsymbol{d}_{k}
\notin \Theta$ at a given iteration, a new $ \boldsymbol{d}_{k}$ is randomly generated. For any $ \eta > 0$, let $ R_{\eta} = \bigcup\nolimits_{\theta^{\ast} \in
\Theta^{\ast}}
\left\{\boldsym...
...t L (\boldsymbol{\theta}) -
L(\boldsymbol{\theta}^{\ast}) \vert < \eta \right\}$. Then, for $ \boldsymbol{d}_{k}$ having an i.i.d. $ N(\boldsymbol{0}, \boldsymbol{I}_{p})$ distribution, $ \lim _{k\to \infty} P (\hat{\boldsymbol{\theta}}_{k} \in
R_{\eta}) =1$.

The above algorithm might be considered the most naïve of the localized random search algorithms. More sophisticated approaches are also easy to implement. For instance, if a search in one direction increases $ L$, then it is likely to be beneficial to move in the opposite direction. Further, successive iterations in a direction that tend to consistently reduce $ L$ should encourage further iterations in the same direction. Many algorithms exploiting these simple properties exist (e.g., Solis and Wets, 1981, and Zhigljavsky, 1991).

In spite of its simplicity, the localized search algorithm is surprisingly effective in a wide range of problems. Several demonstrations are given in Sects. 2.2.3 and 2.3 in Spall (2003).

The random search algorithms above are usually based on perfect (noise-free) measurements of the loss function. This is generally considered a critical part of such algorithms (Pflug, 1996, p. 25). In contrast to the noise-free case, random search methods with noisy loss evaluations of the form $ y(\boldsymbol{\theta}) = L(\boldsymbol{\theta}) +
\varepsilon (\boldsymbol{\theta})$ generally do not formally converge. However, there are means by which the random search techniques can be modified to accommodate noisy measurements, at least on a heuristic basis.

Some of the limited formal convergence theory for random search as applied to the noisy measurement case includes Yakowitz and Fisher (1973, Sect. 4) and Zhigljavsky (1991, Chap. 3). Spall (2003, Sect. 2.3) discusses some practical methods for coping with noise, including simple averaging of the noisy loss function evaluations $ y(\boldsymbol{\theta})$ at each value of $ \boldsymbol {\theta }$ generated in the search process and a modification of the algorithm's key decision criterion (step 1 of blind random search and step 2 of localized random search) to build in some robustness to the noise. However, the averaging method can be costly since the error decreases only at the rate of $ 1/{\sqrt{N}}$ when averaging $ N$ function evaluations with independent noise. Likewise, the altered threshold may be costly by rejecting too many changes in $ \boldsymbol {\theta }$ due to the conservative nature of the modified criterion. The presence of noise in the loss evaluations makes the optimization problem so much more challenging that there is little choice but to accept these penalties if one wants to use a simple random search. We will see in the next section that stochastic approximation tends to be more adept at coping with noise at the price of a more restrictive problem setting than the noise-free convergence theorem above.


next up previous contents index
Next: 6.3 Stochastic Approximation Up: 6. Stochastic Optimization Previous: 6.1 Introduction