Stochastic optimization plays a significant role in the analysis, design, and operation of modern systems. Methods for stochastic optimization provide a means of coping with inherent system noise and coping with models or systems that are highly nonlinear, high dimensional, or otherwise inappropriate for classical deterministic methods of optimization. Stochastic optimization algorithms have broad application to problems in statistics (e.g., design of experiments and response surface modeling), science, engineering, and business. Algorithms that employ some form of stochastic optimization have become widely available. For example, many modern data mining packages include methods such as simulated annealing and genetic algorithms as tools for extracting patterns in data.
Specific applications include business (making short- and long-term investment decisions in order to increase profit), aerospace engineering (running computer simulations to refine the design of a missile or aircraft), medicine (designing laboratory experiments to extract the maximum information about the efficacy of a new drug), and traffic engineering (setting the timing for the signals in a traffic network). There are, of course, many other applications.
Let us introduce some concepts and notation. Suppose is the
domain of allowable values for a vector
. The
fundamental problem of interest is to find the value(s) of a vector
that minimize a scalar-valued
loss function
. Other common names for
are performance measure, objective function,
measure-of-effectiveness (MOE), fitness
function (or negative fitness function), or
criterion. While this problem refers to minimizing
a loss function, a maximization problem (e.g., maximizing profit) can
be trivially converted to a minimization problem by changing the sign
of the criterion. This paper focuses on the problem of
minimization. In some cases (i.e., differentiable
), the
minimization problem can be converted to a root-finding
problem of
finding
such that
. Of course, this
conversion must be done with care because such a root may not
correspond to a global minimum of
.
The three remaining subsections in this section define some basic quantities, discuss some contrasts between (classical) deterministic optimization and stochastic optimization, and discuss some basic properties and fundamental limits. This section provides the foundation for interpreting the algorithm presentations in Sects. 6.2 to 6.4. There are many other references that give general reviews of various aspects of stochastic optimization. Among these are Arsham (1998), Fouskakis and Draper (2002), Fu (2002), Gosavi (2003), Michalewicz and Fogel (2000), and Spall (2003).
The problem of minimizing a loss function
can
be formally represented as finding the set:
For ease of exposition, this paper generally focuses on continuous
optimization problems, although some of the methods may also be used
in discrete problems. In the continuous case, it is often assumed
that is a ''smooth'' (perhaps several times differentiable)
function of
. Continuous problems arise frequently in
applications such as model fitting (parameter estimation), adaptive
control, neural network training, signal processing, and experimental
design. Discrete optimization
(or combinatorial optimization)
is a large subject unto itself (resource allocation, network routing,
policy planning, etc.).
A major issue in optimization is distinguishing between global and
local optima. All other factors being equal, one would always want
a globally optimal solution to the optimization problem (i.e., at
least one
in the set of values
). In
practice, however, it may not be feasible to find a global solution
and one must be satisfied with obtaining a local
solution. For example,
may be shaped such that there is a clearly
defined minimum point over a broad region of the domain
,
while there is a very narrow spike at a distant point. If the trough
of this spike is lower than any point in the broad region, the local
optimal solution is better than any nearby
, but it
is not be the best possible
.
It is usually only possible to ensure that an algorithm will approach a local minimum with a finite amount of resources being put into the optimization process. That is, it is easy to construct functions that will ''fool'' any known algorithm, unless the algorithm is given explicit prior information about the location of the global solution - certainly not a case of practical interest! However, since the local minimum may still yield a significantly improved solution (relative to no formal optimization process at all), the local minimum may be a fully acceptable solution for the resources available (human time, money, computer time, etc.) to be spent on the optimization. However, we discuss several algorithms (random search, stochastic approximation, and genetic algorithms) that are sometimes able to find global solutions from among multiple local solutions.
As a paper on stochastic optimization, the algorithms considered here apply where:
- and/or -
Let
be the generic notation for the
estimate for
at the
th iteration of whatever
algorithm is being considered,
. Throughout
this paper, the specific mathematical form of
will change as the algorithm being
considered changes. The following notation will be used to represent
noisy measurements of
at a specific
:
Relative to property I, noise fundamentally alters the search and
optimization process because the algorithm is getting potentially
misleading information throughout the search process. For example, as
described in Example 1.4 of Spall (2003), consider the following loss
function with a scalar :
. If the domain for optimization
is
, the (unique) minimum occurs at
, as shown in Fig. 6.1. Suppose
that the analyst carrying out the optimization is not able to
calculate
, obtaining instead only noisy
measurements
, where the noises
are i.i.d. with distribution
(a normal distribution with mean zero and variance
). The analyst uses the
measurements in
conjunction with an algorithm to attempt to find
.
Consider the experiment depicted in Fig. 6.1 (with data
generated via MATLAB). Based on the simple method of collecting one
measurement at each increment of over the interval defined
by
(including the endpoints 0 and
), the analyst would
falsely conclude that the minimum is at
. As
shown, this false minimum is far from the actual
.
![]() |
Noise in the loss function measurements arises in almost any case where physical system measurements or computer simulations are used to approximate a steady-state criterion. Some specific areas of relevance include real-time estimation and control problems where data are collected ''on the fly'' as a system is operating and problems where large-scale simulations are run as estimates of actual system behavior.
Let us summarize two distinct problems involving noise in the loss function measurements: target tracking and simulation-based optimization. In the tracking problem there is a mean-squared error (MSE) criterion of the form
![]() |
Relative to the other defining property of stochastic optimization,
property II (i.e., randomness in the search direction), it is
sometimes beneficial to deliberately introduce randomness into the
search process as a means of speeding convergence and making the
algorithm less sensitive to modeling errors. This injected (Monte
Carlo) randomness is usually created via computer-based pseudorandom
number generators. One of the roles of injected randomness in
stochastic optimization is to allow for ''surprise'' movements to
unexplored areas of the search space that may contain an unexpectedly
good
value. This is especially relevant in seeking
out a global optimum among multiple local solutions. Some algorithms
that use injected randomness are random search (Sect. 6.2),
simultaneous perturbation stochastic approximation
(Sect. 6.3), and genetic algorithms (Sect. 6.4).
The discussion above is intended to motivate some of the issues and challenges in stochastic optimization. Let us now summarize some important issues for the implementation and interpretation of results in stochastic optimization.
The first issue we mention is the fundamental limits in optimization
with only noisy information about the function. Foremost, perhaps,
is that the statistical error of the information fed into the
algorithm - and the resulting error of the output of the
algorithm - can only be reduced by incurring a significant cost in
number of function evaluations. For the simple case of independent
noise, the error decreases at the rate
, where
represents the number of
measurements fed into the algorithm. This
is a classical result in statistics, indicating that a
-fold
increase in function evaluations reduces the error by a factor of
five.
A further limit for multivariate () optimization is that the
volume of the search region generally grows rapidly with
dimension. This implies that one must usually exploit problem
structure to have a hope of getting a reasonable solution in
a high-dimensional problem.
All practical problems involve at least some restrictions on
, although in some applications it may be possible to
effectively ignore the constraints. Constraints can be encountered in
many different ways, as motivated by the specific application. Note
that the constraint set
does not necessarily correspond to
the set of allowable values for
in the
search since some problems allow for the ''trial'' values of the
search to be outside the set of allowable final estimates. Constraints
are usually handled in practice on an ad hoc basis, especially tuned
to the problem at hand. There are few general, practical methods that
apply broadly in stochastic optimization. Michalewicz and Fogel (2000,
Chap. 9), for example, discuss some of the practical methods by which
constraints are handled in evolutionary computation. Similar methods
apply in other stochastic algorithms.
In general search and optimization, it is very difficult (perhaps
impossible) to develop automated methods for indicating when the
algorithm is close enough to the solution that it can be
stopped. Without prior knowledge, there is always the possibility that
could lie in some unexplored region of the search
space. This applies even when the functions involved are relatively
benign; see Solis and Wets (1981) for mention of this in the context
of twice-differentiable convex
. Difficulties are compounded when
the function measurements include noise.
It is quite normal for the environment to change over time. Hence, the solution to a problem now may not be the best (or even a good) solution to the corresponding problem in the future. In some search and optimization problems, the algorithm will be explicitly designed to adapt to a changing environment and automatically provide a new estimate at the optimal value (e.g., a control system). In other cases, one needs to restart the process and find a new solution. In either sense, the problem solving may never stop!
In reading or contributing to the literature on stochastic optimization, it is important to recognize the limits of numerical comparisons by Monte Carlo. Monte Carlo studies can be a sound scientific method of gaining insight and can be a useful supplement to theory, much of which is based on asymptotic (infinite sample) analysis. In fact, it is especially popular in certain branches of optimization to create ''test suites'' of problems, where various algorithms compete against each other. A danger arises, however, in making broad claims about the performance of individual algorithms based on the results of numerical studies. Performance can vary tremendously under even small changes in the form of the functions involved or the coefficient settings within the algorithms themselves. One must be careful about drawing conclusions beyond those directly supported by the specific numerical studies performed. For purposes of drawing objective conclusions about the relative performance of algorithms, it is preferable to use both theory and numerical studies.
Some real systems have one (unique) globally ''best'' operating
point (
) in the domain
while others
have multiple global solutions (in either case, of course, there could
be many locally optimal solutions). To avoid excessively
cumbersome discussion of algorithms and supporting implementation
issues and theory, we will often refer to ''the'' solution
(versus ''a'' solution
. In practice, an analyst may be quite
satisfied to reach a solution at or close to any one
.
The so-called no free lunch (NFL) theorems provide a formal
basis for the intuitively appealing idea that there is a fundamental
tradeoff between algorithm efficiency and algorithm robustness
(reliability and stability in a broad range of problems). In essence,
algorithms that are very efficient on one type of problem are not
automatically efficient on problems of a different type. Hence, there
can never be a universally best search algorithm just as there is
rarely (never?) a universally best solution to any general problem of
society. Wolpert and Macready (1997) provided a general formal
structure for the NFL theorems, although the general ideas had been
around for a long time prior to their paper (Wolpert and Macready were
the ones to coin the expression ''no free lunch'' in this search and
optimization context). The NFL theorems are established for discrete
optimization with a finite (but arbitrarily large) number of
options. However, their applicability includes most practical
continuous problems because virtually all optimization is carried out
on - or
-bit digital computers. The theorems apply to the
cases of both noise-free and noisy loss measurements. NFL states, in
essence, that an algorithm that is effective on one class of problems
is guaranteed to be ineffective on another class. Spall
(2003, Sects. 1.2.2 and 10.6) provides more-detailed discussion on the
basis and implications of NFL.
We are now in a position to discuss several popular stochastic
optimization methods. The summaries here are just
that - summaries. Much more complete discussions are available in the
indicated references or in Spall (2003). We let
represent the estimate for
at the
th iteration of
an algorithm under consideration. Section 6.2 discusses random
search methods, which are simple and surprisingly powerful in many
applications. Section 6.3 discusses stochastic approximation
and Sect. 6.4 discusses the popular genetic
algorithms. Because of the relative brevity of this review, there are
many methods of stochastic optimization not covered here, including
simulated annealing, stochastic programming, evolutionary computation
other than genetic algorithms, temporal difference methods, and so
on. Readers with an interest in one of those may consult the
references listed at the end of Sect. 6.1.1.