Boost mailing page: Re: [boost] Question about Boost.Random

Date view	Thread view	Subject view	Author view

From: Topher Cooper (topher_at_[hidden])
Date: 2004-07-12 10:53:59

Next message: Vladimir Prus: "Re: [boost] Re: Regression testing goes horribly wrong! (+ patch)"
Previous message: Peter Dimov: "Re: [boost] Re: Boost.Threads: Do we need all those mutexes?"
In reply to: Anders Edin: "Re: [boost] Question about Boost.Random"

On Monday, July 12, 2004, at 03:37 AM, Anders Edin wrote:

>
>> One of the important characteristics of a pseudo random number
>> generator is, oddly enough, determinancy. Given the same seed, you
>> get
>> the same sequence and can therefore reproduce the exact same results.
>> This allows results to be checked and bugs to be tracked down
>> reliably.
>
> Well, this is of course fine for debugging. However, when running a
> simulation
> shouldn't you rather think of statistics than of finding bugs?
>

First off, are you proposing that you carefully use one PRNG that does
not have hidden, global state to do your debugging and validation, then
rip it out of your code to replace it with a different PRNG with a
different interface to produce your actual runs with un-debugged,
un-validated code? I don't think you thought that one through.

Secondly, in the three areas I mentioned later -- statistical,
scientific and simulation applications -- you should never consider
that you have stopped debugging. You make what appears to be a
successful run and record your results. Sometime later you do a
different one that does not seem fully consistent with the earlier run.
What do you do? Pretend that there is no problem? Pick one of the
runs at random and throw it out? Include a footnote saying that there
may be something wrong with your results, but you have no idea what?
Or take the seeds you recorded from each run, recreate the runs and try
to resolve the issue?

This folks is the simulation equivalent to cleaning your test-tubes in
a chemistry lab -- a fundamental, elementary lab procedure that is
necessary for reliable results.

>> If another thread, or just a library used by your application (for
>> example, something that uses a pseudo-random numbers for dithering a
>> graphic display), was using the same singleton engine as your
>> application this characteristic would be lost. The number of values
>> drawn by the other parts of your system might change or reseeding of
>> the engine might happen without your knowledge.
>
> If you use one generator of the same type in each thread, but within
> the same
> simulation application, how do you know that the different threads are
> not
> correlated in the statistics sense? Do you set the seeds far apart? If
> one is
> not careful the random distribution produced by the application is not
> the
> one you thought it would be.
>

As von Neuman's famous quote has it, when you are using pseudo-random
numbers you are "in a state of sin." You are taking a sequence of
numbers that are anything but random, and simply pretending that they
are, in fact, perfectly random. You get away with it by carefully
considering the characteristics of that non-randomness and, through
careful analysis and testing you make sure that the non-randomness
doesn't matter to what you are doing with it. If your application uses
random numbers in multiple places you must be sure that there are no
meaningful correlations between the streams in those different places
-- either by using a single pseudo-random number stream for all of them
or by using a set of PRNGs that are independent in statistical tests.

It is a fact of life in modern programming systems we use many packages
and libraries the precise content of we have no control over and whose
contents we are frequently ignorant of. The requirement that this puts
on us in using pseudo-random generators is that any correlation (*or
side effect*) that might be caused by use of such libraries of
pseudo-random generators (whether or not we *know* that they make use
of them) should not have any affect on our results. The example I gave
-- a PRNG used for controlling shading in a graphics package for
displaying the results -- is likely to have that characteristic. Other
possibilities involve interthread, interprocess or interprocessor
communication protocols, "random" keys assigned to data-structure nodes
for hashing and various algorithms that introduce randomness to make
worst-case performance situations unlikely. Correlations of a
simulation's PRNG with the PRNGs used in such packages are unlikely to
invalidate your results (but you should always, of course, consider the
possibility that they might -- one reason that you should always repeat
runs with at least two different PRNGs in any serious simulation).

> As before the application I have in mind is physics simulations. If
> you use
> random numbers for something else perhaps the needs are different.

On the contrary, that is a prime example of an area where this is an
absolute requirement. If you do actual physics experiments in a
physics lab you keep a careful record of *every* aspect of the
experiment that you can in order to be able to re-examine, re-analyze
and replicate the experiment as closely as possible. Why would the
fact that your experiment is run in a "virtual lab" where you have the
capability to easily record and replicate every aspect of the
experiment so much more precisely lead you to discard one of the basic
principles of scientific rigor?

Of course, you *could* save the entire stream of pseudo-random numbers
used with every run instead, but it is so much more compact and
convenient, don't you think, to just record the seed and make sure that
all the packages and code necessary is kept in a well-maintained source
control system.

Just to remind you of a rather famous example of this. Lorenz was
doing some simulations of weather systems. He discovered, that when he
attempted to rerun his simulations using the precise starting points he
had recorded he got entirely different results. The starting
conditions had been recorded as decimal printouts introducing a tiny
rounding difference -- a "half-bit" difference in the least significant
figure, which nevertheless resulted in entirely different outcomes.
His investigation of this lead to the (re)discovery of the "butterfly
effect", and is generally considered the beginning of modern chaos
theory.

>
> --
> Anders Edin, Sidec Technologies AB
> _______________________________________________
> Unsubscribe & other changes:
> http://listarchives.boost.org/mailman/listinfo.cgi/boost
>

Next message: Vladimir Prus: "Re: [boost] Re: Regression testing goes horribly wrong! (+ patch)"
Previous message: Peter Dimov: "Re: [boost] Re: Boost.Threads: Do we need all those mutexes?"
In reply to: Anders Edin: "Re: [boost] Question about Boost.Random"

Date view	Thread view	Subject view	Author view