7.0 Statistical Graphics and RNG
• Answer Questions
• Statistical Graphics
• Random Number Generators
1
7.1 Statistical Graphics
John Snow helped to end the 1854 cholera outbreak through use of a
statistical graphic based on a city map of London. The map shows
the pattern of the disease outbreak, and illustrates the importance of
exception analysis.
Snow was Queen Victoria’s physician and a protege of Florence
Nightingale.
He also found a smart way to estimate the literacy rate. Guess how he
did it?
2
3
The second graphic shows the age-adjusted incidence of stomach cancer
for white males, for cases between 1970-1994. We can compare that with
a similar map for 1950-1969.
• Is there a gender difference?
• What is going on in Nevada?
• What is going on in New Mexico?
• What is going on in Wisconsin, Minnesota, and North Dakota?
• What about Pittsburgh?
• What about Maine?
How do we interpret single-county hotspots?
4
5
The third graphic shows the pedestrian fatality rates by state. Florida is
the worst, and has the top five cities in the country. What might explain
this (consider also New Mexico and Arizona).
The fourth graphic is by Charles-Joseph Minard; Richard Tufte hails
it as the best statistical graphic ever. It shows the size of Napoleon’s
army in 1812-1813, as he attacks Czar Alexander III in Moscow and then
retreats.
The graphic includes information on:
• location (two dimensions)
• time
• temperature
• size of the army
6
7
8
9
10
7.2 Random Numbers
In order to generate “random” numbers, it is sufficient to generate
random binary strings.
Toss a fair coin an infinite number of times, with heads being 0 and tails
being 1, to get a sequence X1, X2, . . .. This can be converted into a
random number U that is uniformly distributed on [0, 1] by
U =∞∑
i=1
Xi2−i.
If you have a random number that is uniform on [0, 1], then the random
number X = F−1(U) is a random draw from the distribution F (x). So all
you need for any kind of random number is a set of random coin tosses.
11
Real coins aren’t random enough, or practical for the two main
applications:
• computer simulations
• cryptosecurity.
Good Random Number Generators (RNGs) are fast, repeatable (i.e.,
have a seed), do not cycle, have sensitive dependence on the seed, and
pass statistical tests for randomness.
In practice, there are three strategies for building random number
generators (RNGs):
• Amplify physical (quantum) noise.
• Use provably hard algorithms (trapdoor codes), such as factoring
large numbers that are products of two primes.
• Use linear congruential generators.
12
The first method has never been able to pass statistical tests for
randomness. The sequences always show patterns introduced by the
amplification mechanism.
The second method is widely used in cryptography, but there are issues.
It is not repeatable, in the sense needed for replicating a computer
experiment. It cannot produce an infinite string of binary digits:
eventually, you factor the number. And the big fear is that some clever
mathematician will discover a new way for factoring large numbers.
Nonetheless, trapdoor codes are wildly popular in cryptography, and
quite reliable. RSA encryption is one famous example—it is the basis for
most on-line credit card transactions.
13
For simulation, computer games, and other applications, linear
congruential generators are used.
Xn+1 ≡ (aXn + c) (mod m)
where v ≡ w (mod m) means that v is the remainder when w is divided
by m, and
• Xn is current random integer,
• Xn+1 is the next random integer in the sequence
• m is the modulus (a very large integer)
• a and c are carefully chosen constants.
The initial value, X0, is called the seed of the linear congruential
generator. The Xi are written in binary.
14
Linear congruential generators are not perfect. There is some correlation
in the sequence: if one uses them to plot points in an k-dimensional
space, the points will lie upon up to m1/k hyperplanes.
On the other hand, these are fast, use little memory, can have cycle time
m, and are replicable if one archives the seed.
15
When one has a long sequence of binary random digits, One can try to
test whether the sequence is random.
One strategy is to do a series of hypothesis tests:
1. The null is that the proportions of 1s is 1/2; the alternative is that it
is not.
2. The null is that the proportion of sequential pairs (0, 0) [and (0, 1),
(1, 0), (1, 1)] is 1/4; the alternative is that it is not.
5. The null is that the proportion of sequential triples (0, 0, 0) is 1/8;
the null is that it is not; etc.
You will soon learn how to make such tests. You could even adjust for
the multiple testing problem, an important issue that we cover later.
But letting Xi be 0 or 1 according to the oddness or evenness of the ith
digit of π would pass all these tests.
16
It is provable that one cannot design a test that will eventually detect
all possible patterned sequences. But one can design a sequence of tests
that will discover many different kinds of patterns.
Information theory has shown that a truly random sequence cannot be
compressed. A string is compressible if it can be encoded in such a way
that the coded version requires fewer bits than the original string.
So one way to test a random number generator is to feed its output into
gzip, JPEG2000, and the Lempel-Ziv compression algorithms, and see if
the result is substantially shorter.
Another theorem: If sequence X1, X2, . . . is added to sequence Y1, Y2, . . .
to produce Z1, Z2, . . . where Zi = Xi + Yi (mod 2), then the Z sequence
is at least as random as the most random of the X and Y sequences.
17