This section describes some simple methods based on the notion of randomly searching over the domain of interest. Section 6.2.1 gives a short discussion of general issues in direct random search methods. The algorithms discussed in Sect. 6.2.2 represent two versions of random search.
Consider the problem of trying to find the optimal
based on noise-free measurements of
. Random search methods are perhaps the simplest
methods of stochastic optimization in such a setting and can be quite
effective in many problems. Their relative simplicity is an appealing
feature to both practitioners and theoreticians. These direct random
search methods have a number of advantages relative to most other
search methods. The advantages include relative ease of coding in
software, the need to only obtain
measurements (versus gradients
or other ancillary information), reasonable computational efficiency
(especially for those direct search algorithms that make use of some
local information in their search), broad applicability to non-trivial
loss functions and/or to
that may be continuous,
discrete, or some hybrid form, and a strong theoretical
foundation. Some of these attributes were mentioned in the
forward-looking paper of Karnopp (1963). A good recent survey of
random search and related methods is Kolda et al. (2003).
This section describes two direct random search techniques. These two
algorithms represent only a tiny fraction of available methods. Solis
and Wets (1981) and Zhigljavsky (1991) are among many references
discussing these and other random search methods. The two algorithms
here are intended to convey the essential flavor of most available
direct random search algorithms. With the exception of some discussion
at the end of the subsection, the methods here assume perfect
(noise-free) values of .
The first method we discuss is ''blind random search.'' This is the
simplest random search method, where the current sampling for
does not take into account the previous
samples. That is, this blind search approach does not adapt the
current sampling strategy to information that has been garnered in the
search process. The approach can be implemented in batch
(non-recursive) form simply by laying down a number of points in
and taking the value of
yielding the
lowest
value as our estimate of the optimum. The approach can be
conveniently implemented in recursive form as we illustrate below.
The simplest setting for conducting the random sampling of new
(candidate) values of
is when
is
a hypercube and we are using uniformly generated values
of
. The uniform distribution is continuous or
discrete for the elements of
depending on the
definitions for these elements. In fact, the blind search form
of the algorithm is unique among all general stochastic optimization
algorithms in that it is the only one without any adjustable algorithm
coefficients that need to be ''tuned'' to the problem at hand. (Of
course, a de facto tuning decision has been made by choosing the
uniform distribution for sampling.)
For a domain that is not a hypercube or for other sampling
distributions, one may use transformations, rejection methods, or
Markov chain Monte Carlo to generate the sample
values (see, e.g., Gentle, 2003). For example, if
is an
irregular shape, one can generate a sample on a hypercube superset
containing
and then reject the sample point if it lies
outside of
.
The steps for a recursive implementation of blind random search are
given below. This method applies when
has
continuous, discrete, or hybrid elements.
The above algorithm converges almost surely (a.s.) to
under very general conditions (see, e.g., Spall, 2003, pp. 40-41). Of
course, convergence alone is an incomplete indication of the
performance of the algorithm. It is also of interest to examine the
rate of convergence. The rate is intended to tell the analyst
how close
is likely to be to
for
a given cost of search. While blind random search is a reasonable
algorithm when
is low dimensional, it can be shown
that the method is generally a very slow algorithm for even moderately
dimensioned
(see, e.g., Spall, 2003, 42-43). This
is a direct consequence of the exponential increase in the size of the
search space as
increases. As an illustration, Spall (2003,
Example 2.2) considers a case where
(the
-dimensional hypercube with minimum and maximum values of 0
and
for each component of
) and where one wishes
to guarantee with probability
that each element of
is within
units of the optimal value. As
increases from one to ten, there is an approximate
-fold
increase in the number of loss function evaluations required.
Blind search is the simplest random search in that the sampling
generating the new
value does not take account of
where the previous estimates of
have been. The
random search algorithm below is slightly more sophisticated in that
the random sampling is a function of the position of the current best
estimate for
. In this way, the search is more
localized in the neighborhood of that estimate, allowing for a better
exploitation of information that has previously been obtained about
the shape of the loss function.
The localized algorithm is presented below. This algorithm was described in Matyas (1965). Note that the use of the term ''localized'' here pertains to the sampling strategy and does not imply that the algorithm is only useful for local (versus global) optimization in the sense described in Sect. 6.1. In fact, the algorithm has global convergence properties as described below. As with blind search, the algorithm may be used for continuous or discrete problems.
For continuous problems, Matyas (1965) and others have used the
(multivariate) normal distribution for generating
. However, the user is free to set the distribution of
the deviation vector
. The distribution should have
mean zero and each component should have a variation (e.g., standard
deviation) consistent with the magnitudes of the corresponding
elements. This allows the algorithm to assign
roughly equal weight to each of the components of
as
it moves through the search space. Although not formally allowed in
the convergence theory, it is often advantageous in practice if the
variability in
is reduced as
increases. This
allows one to focus the search more tightly as evidence is accrued on
the location of the solution (as expressed by the location of our
current estimate
).
The convergence theory for the localized algorithms tends to be more
restrictive than the theory for blind search. Solis and Wets (1981)
provide a theorem for global convergence of localized algorithms, but
the theorem conditions may not be verifiable in practice. An earlier
theorem from Matyas (1965) (with proof corrected in Baba et al., 1977)
provides for global convergence of the localized search above if
is a continuous function. The convergence is in the ''in
probability'' sense. The theorem allows for more than one global
minimum to exist in
. Therefore, in general, the result
provides no guarantee of
ever settling near any one
value
. We present the theorem statement below.
Let
represent the set of global minima for
(see
Sect. 6.1). Suppose that
is continuous on a bounded
domain
and that if
at a given iteration, a new
is randomly generated. For any
, let
. Then, for
having an i.i.d.
distribution,
.
The above algorithm might be considered the most naïve of the
localized random search algorithms. More sophisticated approaches are
also easy to implement. For instance, if a search in one direction
increases , then it is likely to be beneficial to move in the
opposite direction. Further, successive iterations in a direction that
tend to consistently reduce
should encourage further iterations in
the same direction. Many algorithms exploiting these simple properties
exist (e.g., Solis and Wets, 1981, and Zhigljavsky, 1991).
In spite of its simplicity, the localized search algorithm is surprisingly effective in a wide range of problems. Several demonstrations are given in Sects. 2.2.3 and 2.3 in Spall (2003).
The random search algorithms above are usually based on perfect
(noise-free) measurements of the loss function. This is generally
considered a critical part of such algorithms (Pflug, 1996, p. 25). In
contrast to the noise-free case, random search methods with noisy loss
evaluations of the form
generally do not formally
converge. However, there are means by which the random search
techniques can be modified to accommodate noisy measurements, at least
on a heuristic basis.
Some of the limited formal convergence theory for random search as
applied to the noisy measurement case includes Yakowitz and Fisher
(1973, Sect. 4) and Zhigljavsky (1991, Chap. 3). Spall (2003,
Sect. 2.3) discusses some practical methods for coping with noise,
including simple averaging of the noisy loss function evaluations
at each value of
generated in
the search process and a modification of the algorithm's key decision
criterion (step 1 of blind random search and step 2 of localized
random search) to build in some robustness to the noise. However, the
averaging method can be costly since the error decreases only at the
rate of
when averaging
function evaluations with
independent noise. Likewise, the altered threshold may be costly by
rejecting too many changes in
due to the
conservative nature of the modified criterion. The presence of noise
in the loss evaluations makes the optimization problem so much more
challenging that there is little choice but to accept these penalties
if one wants to use a simple random search. We will see in the next
section that stochastic approximation tends to be more adept at coping
with noise at the price of a more restrictive problem setting than the
noise-free convergence theorem above.