Random numbers can be generated via physical mechanisms such as the
timing between successive events in atomic decay, thermal noise in
semiconductors, and the like. A key issue when constructing a RNG
based on a physical device is that a ''random'' or ''chaotic''
output does not suffice; the numbers produced must be, at least to
a good approximation, realizations of independent and
uniformly distributed random variables. If the device generates
a stream of bits, which is typical, then each bit should be 0 or
with equal probability, and be independent of all the other bits. In
general, this cannot be proved, so one must rely on the results of
empirical statistical testing to get convinced that the output values
have the desired statistical behavior. Not all these devices are
reliable, but some apparently are. I did test two of them recently
and they passed all statistical tests that I tried.
For computational statistics, physical devices have several disadvantages compared to a good algorithmic RNG that stands in a few lines of code. For example, (a) they are much more cumbersome to install and run; (b) they are more costly; (c) they are slower; (d) they cannot reproduce exactly the same sequence twice. Item (d) is important in several contexts, including program verification and debugging as well as comparison of similar systems by simulation with common random numbers to reduce the variance ([3,21,34]). Nevertheless, these physical RNGs can be useful for selecting the seed of an algorithmic RNG, more particularly for applications in cryptology and for gaming machines, where frequent reseeding of the RNG with an external source of entropy (or randomness) is important. A good algorithmic RNG whose seed is selected at random can be viewed as an extensor of randomness, stretching a short random seed into a long sequence of pseudorandom numbers.
RNGs used for simulation and other statistical applications are almost
always based on deterministic algorithms that fit the following
framework, taken from [36]:
a RNG is a structure
where
is a finite
set of states (the state space),
is
a probability distribution on
used to select the initial
state (or seed)
,
is the transition function,
is the
output space, and
is the output
function. Usually,
, and we shall assume henceforth
that this is the case. The state of the RNG evolves according to the
recurrence
, for
, and the output
at step
is
. The output values
are the so-called random numbers produced
by the RNG.
Because
is finite, there must be some finite
and
such that
. Then, for all
, one has
and
, because both
and
are
deterministic. That is, the state and output sequences are eventually
periodic. The smallest positive
for which this happens is called
the period length of the RNG, and is denoted by
. When
, the sequence is said to be purely periodic.
Obviously,
, the cardinality of
. If the state has
a
-bit representation on the computer, then
. Good
RNGs are designed so that their period length
is not far from
that upper bound. In general, the value of
may depend on the
seed
, but good RNGs are normally designed so that the period
length is the same for all admissible seeds.
In practical implementations, it is important that the output be
strictly between 0 and , because
is often
infinite when
is 0 or
. All good implementations take care
of that. However, for the mathematical analysis of RNGs, we often
assume that the output space is
(i.e., 0 is admissible),
because this simplifies the analysis considerably without making much
difference in the mathematical structure of the generator.
What important quality criteria should we consider when designing
RNGs? An extremely long period is obviously essential, to
make sure that no wrap-around over the cycle can occur in practice.
The length of the period must be guaranteed by a mathematical proof.
The RNG must also be efficient (run fast and use only a small
amount of memory), repeatable (able to reproduce exactly the
same sequence as many times as we want), and portable (work
the same way in different software/hardware environments).
The availability of efficient jump-ahead methods that can
quickly compute given
, for any large
and
any
, is also very useful, because it permits one to partition the
RNG sequence into long disjoint streams and
substreams of
random numbers, in order to create an arbitrary number of
virtual generators from a single RNG ([34,59]).
These virtual generators can be used on parallel processors or to
support different sources of randomness in a large simulation model,
for example.
Consider a RNG with state space
,
transition function
, and
. This RNG has period length
and enjoys all the nice properties described in the preceding
paragraph, but is far from imitating ''randomness''. In other
words, these properties are not sufficient.
A sequence of real-valued random variables
are
i.i.d.
if and only if for every integers
and
,
the vector
is uniformly
distributed over the
-dimensional unit hypercube
. Of
course, this cannot hold for algorithmic RNGs because any vector of
successive values produced by the generator must belong to the
finite set
![]() |
Suppose we select the seed at random, uniformly over
.
This can be approximated by using some physical device, for
example. Then, the vector
has the uniform distribution
over the finite set
. And if the sequence is purely periodic
for all
,
is also uniformly
distributed over
for all
. Since the goal is to
approximate the uniform distribution over
, it immediately
becomes apparent that
should be evenly spread over this unit
hypercube. In other words,
approximates
as the sample space from which the vectors of successive
output values are drawn randomly, so it must be a good approximation
of
in some sense. The design of good-quality RNGs must
therefore involve practical ways of measuring the uniformity of the
corresponding sets
even when they have huge cardinalities.
In fact, a large state space
is necessary to obtain a long
period, but an even more important reason for having a huge number of
states is to make sure that
can be large enough to provide
a good uniform coverage of the unit hypercube, at least for moderate
values of
.
More generally, we may also want to measure the uniformity of sets of the form
![]() |
The uniformity of a set is typically assessed by measuring
the
discrepancy between the empirical distribution of its points
and the uniform distribution over
([80,25,53]). Discrepancy measures are equivalent
to goodness-of-fit test statistics for the multivariate uniform
distribution. They can be defined in many different ways. In fact,
the choice of a specific definition typically depends on the
mathematical structure of the RNG to be studied and the reason for
this is very pragmatic: we must be able to compute these measures
quickly even when
has very large cardinality. This obviously
excludes any method that requires explicit generation of the sequence
over its entire period. The selected discrepancy measure is usually
computed for each set
in some predefined class
, these values
are weighted or normalized by factors that depend on
, and the
worst-case (or average) over
is adopted as a figure of
merit used to rank RNGs. The choice of
and of the weights are
arbitrary. Typically,
would contain sets
such that
and
are rather small. Examples of such figures of merit will be
given when we discuss specific classes of RNGs.
Good RNGs are designed based on mathematical analysis of their
properties, then implemented and submitted to batteries of
empirical statistical tests. These tests try to detect
empirical evidence against the null hypothesis
: ''the
are realizations of i.i.d.
random variables''. A test can be
defined by any function
that maps a sequence
in
to a real number
, and for which a good approximation is
available for the distribution of the random variable
under
. For the test to be implementable,
must depend on
only a finite (but perhaps random) number of
's. Passing many
tests may improve one's confidence in the RNG, but never guarantees
that the RNG is foolproof for all kinds of simulations.
Building a RNG that passes all statistical tests is an
impossible dream. Consider, for example, the class of all tests that
examine the first (most significant) bits of
successive output
values,
, and return a binary value
. Select
so that
is an integer
and let
be the tests in this class that return
for exactly
of the
possible output
sequences. We may say that the sequence fails the test when
. The number of tests in
is equal to the
number of ways of choosing
distinct objects among
.
The chosen objects are the sequences that fail the test. Now, for any
given output sequence, the number of such tests that return
for this
particular sequence is equal to the number of ways of choosing the
other
sequences that also fail the test. This is the
number of ways of choosing
distinct objects
among
. In other words, as pointed out by [64],
every output sequence fails exactly the same number of tests! This
result should not be surprising. Viewed from a different angle, it is
essentially a restatement of the well-known fact that under
,
each of the
possible sequences has the same probability of
occurring, so one could argue that none should be considered more
random than any other ([31]).
This viewpoint seems to lead into a dead end. For statistical testing
to be meaningful, all tests should not be considered on equal
footing. So which ones are more important? Any answer is certainly
tainted with its share of arbitrariness. However, for large values of
, the number of tests is huge and all but a tiny fraction are too
complicated even to be implemented. So we may say that bad
RNGs are those that fail simple tests, whereas good RNGs fail
only complicated tests that are hard to find and run. This
common-sense compromise has been generally adopted in one way or
another.
Experience shows that RNGs with very long periods, good structure of
their set , and based on recurrences that are not too
simplistic, pass most reasonable tests, whereas RNGs with short
periods or bad structures are usually easy to crack by standard
statistical tests. For sensitive applications, it is a good idea,
when this is possible, to apply additional statistical tests designed
in close relation with the random variable of interest (e.g., based on
a simplification of the stochastic model being simulated, and
for which the theoretical distribution can be computed).
Our discussion of statistical tests continues in Sect. 2.6.
One way of defining an ideal RNG would be that no statistical test can
distinguish its output sequence from an i.i.d. sequence. If
an unlimited computing time is available, no finite-state RNG can
statisfy this requirement, because by running it long enough one can
eventually figure out its periodicity. But what if we impose a limit
on the computing time? This can be analyzed formally in the framework
of asymptotic computational complexity theory, under the
familiar ''rough-cut'' assumption that polynomial-time algorithms
are practical and others are not.
Consider a family of RNGs
where
is of cardinality
(i.e.,
has a
-bit state). Suppose that the transition and output
functions
and
can be computed in time bounded by a polynomial
in
. Let
be the class of statistical tests that run in time
bounded by a polynomial in
and try to differentiate between the
output sequence of the RNG and an i.i.d.
sequence. The RNG
family is called polynomial-time perfect if there is
a constant
such that for all
, no test in
can
differentiate correctly with probability larger than
. This is equivalent to asking that no polynomial-time
algorithm can predict any given bit of
with probability of
success larger than
, after observing
.
This links unpredictability with statistical uniformity and
independence. For the proofs and additional details, see,
e.g. [2,55,33], and [69]. This
theoretical framework has been used to define a notion of reliable RNG
in the context of cryptography. But the guarantee is only asymptotic;
it does not necessarily tell what value of
is large enough for the
RNG to be secure in practice. Moreover, specific RNG families have
been proved to be polynomial-time perfect only under yet unproven
conjectures. So far, no one has been able to prove even their
existence. Most RNGs discussed in the remainder of this chapter are
known not to be polynomial-time perfect. However, they are
fast, convenient, and have good enough statistical properties when
their parameters are chosen carefully.