35
Discrete Markov Chain Monte Carlo Chapter 7 GRAPH WALKS VIA RANDOM SWAPS Chapter 7: Graph Walks via Random Swaps For complicated data like the Galapagos finches, the method of random swaps offers a way to generate random data sets. Unfortunately, the very complexity of the data sets -- the feature that makes the swap algorithm valuable -- makes the finch data set a poor vehicle for learning about the swap algorithm when you are new to the subject. For that purpose, other, simpler data sets work better. The core of this chapter considers three classes of graph walks. The first consists of walks on the subsets of a finite set, the second, walks on binary (that is 0,1) matrices like the finch data with fixed row and column sums, and the third, walks on the permutations (reorderings) of a finite set. For all three kinds of walks, you get from one vertex of the graph to another by random swaps. For all three classes of walks, we will be interested, ultimately, in the usual three questions: (1) What are the transition probabilities? (2) Is there a limiting distribution, and if so, what is it? (3) When a walk does converge to a limit, how fast does convergence occur? The chapter closes with two additional classes of swap walks, one for 2x2 tables of counts, of the sort you use for Fisher’s exact test, and another for genetic data. 7.1 Walks on Subsets: Fisher's Exact Test via Random Swaps Now that you've seen how to carry out Fisher's exact test, you can use it as a setting for thinking about the random swap algorithm. So far, you've seen swaps used mainly in just one applied context: to generate random matrices of 0s and 1s for testing patterns in co- occurrence matrices. The idea of random swaps, however, is much more general. Looking ahead, and looking back Before we get into the details, here's the central idea. Suppose you want a random sample of size 3 from 1, 2, ..., 10. Start with the subset 1, 2, 3. This leaves 4, 5, ..., 10 "unused". Randomly pick one of the elements of your subset; to be concrete, suppose you get the 2. Now randomly pick one of the unused integers, say 7. Swap the 2 and the 7, so that your sample is now 1, 7, 3. Carry out a large number of swaps of the same kind; then declare the final result to be your random sample. For situations as simple as Fisher's exact test, random swapping is unnecessarily complicated and not at all efficient, but that's not the point here. The point is that, precisely because the situation is a simple one, you can use it to develop your understanding of how swapping works. To set the stage for a more detailed look at swap walks, review the general algorithm for estimating p-values: 06/05/03 George W. Cobb, Mount Holyoke College, NSF#0089004 page 7.1

Chapter 7: Graph Walks via Random SwapsThe chapter closes with two additional classes of swap walks, one for 2x2 tables of counts, of the sort you use for Fisher’s exact test, and

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Chapter 7: Graph Walks via Random SwapsThe chapter closes with two additional classes of swap walks, one for 2x2 tables of counts, of the sort you use for Fisher’s exact test, and

Discrete Markov Chain Monte Carlo Chapter 7 GRAPH WALKS VIA RANDOM SWAPS

Chapter 7: Graph Walks via Random Swaps

For complicated data like the Galapagos finches, the method of random swaps offers a way to generate random data sets. Unfortunately, the very complexity of the data sets -- the feature that makes the swap algorithm valuable -- makes the finch data set a poor vehicle for learning about the swap algorithm when you are new to the subject. For that purpose, other, simpler data sets work better. The core of this chapter considers three classes of graph walks. The first consists of walks on the subsets of a finite set, the second, walks on binary (that is 0,1) matrices like the finch data with fixed row and column sums, and the third, walks on the permutations (reorderings) of a finite set. For all three kinds of walks, you get from one vertex of the graph to another by random swaps. For all three classes of walks, we will be interested, ultimately, in the usual three questions: (1) What are the transition probabilities? (2) Is there a limiting distribution, and if so, what is it? (3) When a walk does converge to a limit, how fast does convergence occur? The chapter closes with two additional classes of swap walks, one for 2x2 tables of counts, of the sort you use for Fisher’s exact test, and another for genetic data. 7.1 Walks on Subsets: Fisher's Exact Test via Random Swaps Now that you've seen how to carry out Fisher's exact test, you can use it as a setting for thinking about the random swap algorithm. So far, you've seen swaps used mainly in just one applied context: to generate random matrices of 0s and 1s for testing patterns in co-occurrence matrices. The idea of random swaps, however, is much more general. Looking ahead, and looking back Before we get into the details, here's the central idea. Suppose you want a random sample of size 3 from 1, 2, ..., 10. Start with the subset 1, 2, 3. This leaves 4, 5, ..., 10 "unused". Randomly pick one of the elements of your subset; to be concrete, suppose you get the 2. Now randomly pick one of the unused integers, say 7. Swap the 2 and the 7, so that your sample is now 1, 7, 3. Carry out a large number of swaps of the same kind; then declare the final result to be your random sample. For situations as simple as Fisher's exact test, random swapping is unnecessarily complicated and not at all efficient, but that's not the point here. The point is that, precisely because the situation is a simple one, you can use it to develop your understanding of how swapping works. To set the stage for a more detailed look at swap walks, review the general algorithm for estimating p-values:

06/05/03 George W. Cobb, Mount Holyoke College, NSF#0089004 page 7.1

Page 2: Chapter 7: Graph Walks via Random SwapsThe chapter closes with two additional classes of swap walks, one for 2x2 tables of counts, of the sort you use for Fisher’s exact test, and

Discrete Markov Chain Monte Carlo Chapter 7 GRAPH WALKS VIA RANDOM SWAPS

Algorithm for estimating p-values

1. Generate a large number (NRep) of random data sets. 2. Compare each random data set with the actual data: Assign a Yes if

more extreme than the actual, otherwise assign No.

3. Estimate p using the observed fraction of Yes answers in the sample:

$#

#p

Yesdata sets in the sample

=

For the finch data, we use swaps to execute Step 1. As a way to understand better how that works, we’ll now use swap walks to execute Step 1 for the far simpler Fisher’s exact test. To use Fisher's test, your situation must be one that allows you to model the generation of random data sets by drawing at random from a bucket of red and blue marbles.

(1) The goal must be to compare two groups ("chosen" and "not chosen"). (2) The population of interest must be dichotomous, essentially a collection of 1s

(red) and 0s (blue). Condition (1) lets you generate data by choosing a random subset of the population, while condition (2) lets you compare data sets using the number x of 1s (red marbles) in the chosen subset: sample size Cho x n population size To l R B # red = # 1s # blue = # 0s

1 0 Tot

s

Not chos N-n

ta = N-R N

en

en

al

Display 7.1. Summary table for Fisher’s exact test.

The column totals are fixed once you know the population; the row totals are fixed once you know how big a subset to choose. You draw a random sample of size n from your population of R 1s and N-R 0s, then count the number x of 1s in the sample. (If you know x and the totals, all the other values in the table are determined.) You can then use x to compare data sets in Step 2 of the p-value algorithm.

06/05/03 George W. Cobb, Mount Holyoke College, NSF#0089004 page 7.2

Page 3: Chapter 7: Graph Walks via Random SwapsThe chapter closes with two additional classes of swap walks, one for 2x2 tables of counts, of the sort you use for Fisher’s exact test, and

Discrete Markov Chain Monte Carlo Chapter 7 GRAPH WALKS VIA RANDOM SWAPS

Generating random subsets by swapping Now take a closer look at Step 1, which tells you to generate a large number of random data sets. A good statistical software package lets you do this in a single command,1 but suppose, instead, that you had to create the subsets "from scratch", using random digits.

Swap Algorithm for Random Subsets Start with the first n elements of your population 1, 2, …, n as your subset, with the remaining elements n+1, n+2, …, N “unused”. 1a. Swap: Pick one element of the subset at random, and one

unused element at random, and swap. 1b. Repeat Step (1a) a large number of times. The subset you end

up with is the random subset.

To use this method in practice, we need answers to two by-now-familiar questions.

Question 1: Does the method of random swaps generate all subsets with equal probabilities?

Question 2: How many swaps does it take to generate one random subset?

Activity 7.1. Swap walk on subsets of the Martin data Here are the ages of the 10 hourly workers considered for layoff in Round 2, with the ages and IDs of those chosen underlined.

Ages 25 33 35 38 48 55 55 55 56 64

ID # 1 2 3 4 5 6 7 8 9 10

Step 1. Carry out 10 steps of the swap walk and put your results in a table like the one below.

I’ve done the first swap: I chose a number at random from Subset = {6,7,10}; as the underlining in the table shows, I got 10. Then I chose a number at random from Unused = {1,2,3,4,5,8,9}, and got 5. I swapped the 5 and 10, and wrote the updated Subset and Unused in the next row. Finally, I filled in the ages corresponding to the new Subset, and computed the average.

1 In S-Plus, for example, to choose a random subset of size 3 from 1, 2, ..., 10, use the command "sample(10,3)".

06/05/03 George W. Cobb, Mount Holyoke College, NSF#0089004 page 7.3

Page 4: Chapter 7: Graph Walks via Random SwapsThe chapter closes with two additional classes of swap walks, one for 2x2 tables of counts, of the sort you use for Fisher’s exact test, and

Discrete Markov Chain Monte Carlo Chapter 7 GRAPH WALKS VIA RANDOM SWAPS

Step Average

0 6 7 10 1 2 3 4 5 8 9 55 55 64 58.001 6 7 5 1 2 3 4 10 8 9 55 55 48 52.67...10

Subset Unused Ages

Step 2. Plot average age (last column) versus age (first column), and joint consecutive points by line segments.

Discussion questions. 1. How many swaps do you think it will take to get to a random data set? (This question merely asks for whatever your intuition tells you.) Choose from 10, 50, 100, 500, 1000, 10000. 2. Now look at Display 7. 2, which shows a times series plot of the average age of those in the random Subset at each step. How many steps does it appear to take for the swap walk to mix? Explain.

Step n

Aver

age

Age

0 50 100 150

3540

4550

55

Display 7.2 Time series plot of average age versus n for a swap walk on subsets

of the Martin data. The horizontal line shows the population mean. 3. Are all subsets equally likely in the limit? Explain. 4. Think of the swap walk as a random walk on a graph. What are the vertices? Give examples of two vertices that are neighbors, and two vertices that are not neighbors. How many vertices are there in all? Is the graph connected? 5. What is the maximum vertex degree? The minimum degree? The average degree?

06/05/03 George W. Cobb, Mount Holyoke College, NSF#0089004 page 7.4

Page 5: Chapter 7: Graph Walks via Random SwapsThe chapter closes with two additional classes of swap walks, one for 2x2 tables of counts, of the sort you use for Fisher’s exact test, and

Discrete Markov Chain Monte Carlo Chapter 7 GRAPH WALKS VIA RANDOM SWAPS

6. What is the diameter of the graph? (Find two vertices that are as far apart as possible: how many swaps does it take to go from one to the other?) 7. What is the girth of the graph? (What is the length of the shortest cycle?)

Example 7.1. Subsets of size 2 from {1, 2, 3, 4} There are six possible subsets of size 2 from a set with 4 elements: 12, 13, 14, 23, 24, 34. Make these the vertices of a graph. Swaps will define the edges. Suppose your current subset is 12. By swapping the 1 with one of 3 or 4, you can go from 12 to 23 or 24. If, instead, you swap the 2 for a 3 or 4, you can go from 12 to 13 or 14. However, from 12 there is no single swap that can take you to 34. We can represent the moves from 12 as in Display 7.3:

13 23 12 34 is not a neighbor of 12

14 24

Display 7.2 Part of a graph for random swaps from the subset 12 Notice that for any and every finite population and subset size, the swap algorithm defines a random walk on a graph whose vertices are the subsets. The starting subset of the algorithm is the starting vertex for the walk. Step 1a of the algorithm tells how to move to an adjacent vertex, and, in the process, tells how to figure out if two vertices are adjacent. Step 1b of the algorithm says to “run the walk” for a large number of steps.2 Now revisit the two questions from before, this time in the familiar context of graph walks. Question 1 becomes “Is there a uniform limiting distribution?” Question 2 asks about the convergence rate: “How long (how many steps) until convergence?” To answer these questions, we need to analyze the structure of the graph.

Example 7.1 (continued). The “swap walk” on subsets of size 2 from {1, 2, 3, 4}. The graph of the walk has six vertices: 12, 13, 14, 23, 24, 34. By symmetry, all vertices behave as 12 does: There are four adjacent vertices, and one additional vertex that you can’t get to with just one swap. Display 7.4 shows the graph. Because the graph is regular (all vertices have the same degree), there is a uniform equilibrium distribution.

06/05/03 George W. Cobb, Mount Holyoke College, NSF#0089004 page 7.5

2 Our language already has “talk the talk” and “walk the walk.” Now we can add “run the walk,” an oxymoronic gift to the language from probability theory.

Page 6: Chapter 7: Graph Walks via Random SwapsThe chapter closes with two additional classes of swap walks, one for 2x2 tables of counts, of the sort you use for Fisher’s exact test, and

Discrete Markov Chain Monte Carlo Chapter 7 GRAPH WALKS VIA RANDOM SWAPS

13 14 12 34 23 24 Display 7.4. Graph of the swap walk on subsets of size 2 from {1, 2, 3, 4} Here is the transition matrix P of the Markov chain:

12 13 23 34 24 1412 0 0.25 0.25 0 0.25 0.2513 0.25 0 0.25 0.25 0 0.2523 0.25 0.25 0 0.25 0.25 034 0 0.25 0.25 0 0.25 0.2524 0.25 0 0.25 0.25 0 0.2514 0.25 0.25 0 0.25 0.25 0

Display 7.5. Transition probabilities for the swap walk

If the walk starts at 12, the initial distribution at time 0 is p(0) = (1, 0, 0, 0, 0, 0). To study convergence, we compute the distribution after n steps as p(n) = p(0) Pn. These distribution vectors are shown in Display 7.5.

n 12 13 23 34 24 14 Var'n Dist

0 1 0 0 0 0 0 0.833331 0 0.25 0.25 0 0.25 0.25 0.333332 0.25 0.125 0.125 0.25 0.125 0.125 0.166673 0.125 0.1875 0.1875 0.125 0.1875 0.1875 0.083334 0.1875 0.15625 0.15625 0.1875 0.15625 0.15625 0.041675 0.15625 0.171875 0.171875 0.15625 0.171875 0.171875 0.020836 0.171875 0.164063 0.164063 0.171875 0.164063 0.164063 0.010427 0.164063 0.167969 0.167969 0.164063 0.167969 0.167969 0.005218 0.167969 0.166016 0.166016 0.167969 0.166016 0.166016 0.002609 0.166016 0.166992 0.166992 0.166016 0.166992 0.166992 0.00130

10 0.166992 0.166504 0.166504 0.166992 0.166504 0.166504 0.00065 Display 7.6. Distribution after n steps, for a swap walk that starts at 12. The right-most column, headed “Var’n Dist” measures the variation distance between the distribution after n steps and the uniform distribution. Convergence is quite fast.

Drill exercises For each of (8) – (12) below:

(a) Draw and label a graph that corresponds to the swap walk for generating random subsets.

(b) Tell whether the walk has a uniform equilibrium distribution.

06/05/03 George W. Cobb, Mount Holyoke College, NSF#0089004 page 7.6

Page 7: Chapter 7: Graph Walks via Random SwapsThe chapter closes with two additional classes of swap walks, one for 2x2 tables of counts, of the sort you use for Fisher’s exact test, and

Discrete Markov Chain Monte Carlo Chapter 7 GRAPH WALKS VIA RANDOM SWAPS

(c) Tell whether the walk converges, and explain how you can tell. (d) Write down the transition matrix.

Subsets of size 1: Subsets of size 2: Subsets of size 3: 8. Pop = {1, 2} 9. Pop = {1, 2, 3} 11. Pop = {1, 2, 3} 10. Pop = {1, 2, 3, 4} 12. Pop = {1, 2, 3, 4} For the following problem, answer without actually drawing the graph: 13. For the swap walk on subsets of size 2 from 1, 2, 3, 4, 5:

a. How many vertices are there? b. How many neighbors does each vertex have? Explain how you can tell that the

vertex degrees are equal. c. List the vertices (i) adjacent to 12, and (ii) not adjacent to 12. d. What are the circumference and girth of this graph?

14. Use the distances in the right-most column of Display 7.6. Is convergence geometric, as it was in previous examples? If not, is it one of the other forms described in Chapter 3: linear, logarithmic, or power law? (15) – (21). For subsets of size r chosen from {1, 2, …, n}, regard each subset as the vertex of a graph. Define a random walk by the following moves (edges): • pick an element of the current subset (uniformly at random) • pick an element of its complement (also uniformly at random) • swap the two elements. 15. How many vertices does the resulting graph have? 16. How many neighbors does each vertex have?

17. Find the limiting distribution π.

18. Suppose you start at the vertex {1, 2, …, r}. Find p and ( )0 p ( )0 − π .

19. Find p and ( )1 p ( )1 − π .

20. Find p and ( )2 p ( )2 − π . 21. Is convergence geometric? How can you tell? Investigation: 22. Consider using swaps to generate random subsets of size r from {1, 2, …., n}. As before, regard the subsets as vertices of a graph; two subsets are neighbors if you can get from one to the other by swapping two elements. (For example, if n = 5 and r = 3, then 124 and 245 are neighbors.) What is the relationship between n, r and

06/05/03 George W. Cobb, Mount Holyoke College, NSF#0089004 page 7.7

Page 8: Chapter 7: Graph Walks via Random SwapsThe chapter closes with two additional classes of swap walks, one for 2x2 tables of counts, of the sort you use for Fisher’s exact test, and

Discrete Markov Chain Monte Carlo Chapter 7 GRAPH WALKS VIA RANDOM SWAPS

(a) the girth of the graph? (b) the diameter of the graph? (c) the chromatic number of the graph? (d) whether the n-step transition probabilities converge to a limiting uniform

distribution? (e) the rate of convergence to a limiting distribution, when the limiting distribution

exists? (Definitions of girth, diameter, and chromatic number are given in Chapter 5, page 17.) 7.2 Swap Walks on Binary Matrices For binary matrices (that is, matrices of 0s and 1s) like the finch data, our goal is to use random swaps to execute Step 1 of the p-value algorithm. Remind yourself how those random swaps work: We need swaps that don’t change row or column totals. There are only two such “swappable” 2x2 matrices, each with 1s on one diagonal, 0s on the other.

1 00 1

0 11 0

LNM

OQP← →

LNM

OQP

swap

The algorithm says to choose 2x2 sub-matrices at random from your data matrix, and swap them if they are swappable.

Random swap algorithm for binary matrices 1a. Sub-matrix: Choose two rows uniformly at random; then pick two

columns uniformly at random. (This gives a 2x2 sub-matrix, and all possible sub-matrices are equally likely.)

1b. Swap. If the chosen sub-matrix is swappable, swap the 0s and 1s. If

not, do nothing and return to 1a; do not count this as a step of the walk. 1c. Repeat Steps 1a and 1b for a very large number of random swaps.

The following example shows how the algorithm determines a random walk on a graph. Example 7.2. Consider 2x3 matrices with the same row and column sums as

1 1 00 0 1LNM

OQP

Represent the swap walk on these matrices by a graph, and find the transition matrix.

06/05/03 George W. Cobb, Mount Holyoke College, NSF#0089004 page 7.8

Page 9: Chapter 7: Graph Walks via Random SwapsThe chapter closes with two additional classes of swap walks, one for 2x2 tables of counts, of the sort you use for Fisher’s exact test, and

Discrete Markov Chain Monte Carlo Chapter 7 GRAPH WALKS VIA RANDOM SWAPS

Solution. There are 3 matrices with the same margins.3 Each one has two swappable 2x2 sub-matrices, which ecologists call “checkerboard units” or CUs. From any one of the three matrices, it is possible to go to either of the other two:

1 0 1 0 1 0 b 0 ½ ½ 1 1 0 a P = ½ 0 ½

0 0 1 ½ ½ 0 0 1 1 c 1 0 0

Display 7.7. Swap walk on 2x3 matrices

Drill exercises. For (23) – (25), analyze the swap walk on matrices with the same row and column totals as the matrix given.

(23) (24) (25) 1 0 1 1 1 0 1 1 1 1 0 0 1 0 0 0 1 0 0 0 0 1

If your matrix is not as simple as the ones in Exercises (23) – (25), it can help to break the analysis in parts:

Example 7.3. Find the graph and transition matrix for the swap walk on all 3x3 binary matrices with the same margins as: 1 1 0 1 0 1 0 1 0 Solution. First, find all the matrices and label them:

3 Because row and column totals are usually shown in the right-most column and bottom row of a table, the totals are often collectively called the “margins” of the table.

06/05/03 George W. Cobb, Mount Holyoke College, NSF#0089004 page 7.9

Page 10: Chapter 7: Graph Walks via Random SwapsThe chapter closes with two additional classes of swap walks, one for 2x2 tables of counts, of the sort you use for Fisher’s exact test, and

Discrete Markov Chain Monte Carlo Chapter 7 GRAPH WALKS VIA RANDOM SWAPS

(a) (b) (c) (d) (e) 1 1 0 1 1 0 1 1 0 1 0 1 0 1 1 1 0 1 1 1 0 0 1 1 1 1 0 0 1 1 0 1 0 0 0 1 1 0 0 0 1 0 1 0 0 Next, for each matrix, find all swappable 2x2 sub-matrices (checkerboard units), and figure out where each swap takes you. One possible notation to describe a swappable sub-matrix is to write the coordinates of its two 1s: 1 0 1 1 1 0 (1,2)&(2,3) This swap takes you to 1 1 0 = (d) 0 1 0 1 1 0 1 0 1 (2,3)&(3,2) This takes you to 1 1 0 = (b) 0 0 1 1 1 0

0 1 0 (2,1)&(3,2) takes you to 0 1 1 = (c) 1 0 0 Display 7.8. Finding the matrices adjacent to a given matrix

Doing this for each of (b) – (e) gives a list of neighbors for each vertex, which in turn gives the transition matrix P: a b c d e a: b c d a 0 1/3 1/3 1/3 0

b: a c d e b ¼ 0 ¼ ¼ ¼ c: a b e P = c 1/3 1/3 0 0 1/3 d: a b e d 1/3 1/3 0 0 1/3 e: b c d e 0 1/3 1/3 1/3 0

Finally, draw the graph. (It may take a couple of tries to get the picture that looks best.)

a c b d e

Display 7.9. Graph for a swap walk on 3x3 matrices.

06/05/03 George W. Cobb, Mount Holyoke College, NSF#0089004 page 7.10

Page 11: Chapter 7: Graph Walks via Random SwapsThe chapter closes with two additional classes of swap walks, one for 2x2 tables of counts, of the sort you use for Fisher’s exact test, and

Discrete Markov Chain Monte Carlo Chapter 7 GRAPH WALKS VIA RANDOM SWAPS

If the graph in Display 7.9 looks familiar, that’s because it is. This graph, apart from its labels, is one you’ve seen before, as the graph for the swap walk on the matrices with the same row and column totals as

1 0 1 0 1 0 1 0 1

That walk is isomorphic to the present one: Just interchange Rows 2 and 3, and interchange Columns 2 and 3: 1 1 0 2 1 0 1 2

1 0 1 2 ∼ 0 1 0 1 0 1 0 1 1 0 1 2 2 2 1 5 2 1 1 5

Display 7.10. Isomorphic incidence matrices To get from one matrix to the other, interchange Rows 2 and 3, and Columns 2 and 3.

Drill exercises: Exercises (26) – (28): Following the example as a guide, for each of the matrices listed below:

(a) Find and label all the binary matrices with the same margins. (b) For each of the matrices in (a), find all the swappable 2x2 sub-matrices, and for

each one, find which of the matrices in (a) the swap takes you to. (c) Use your answers in (b) to construct the transition matrix for the walk. (d) Draw the graph of the walk.

(26) (27) (28)

1 1 0 1 0 0 1 0 1 0 0 1 0 1 0 1 1 0 1 0 0 0 0 1 0 1 1

Investigation: The next four questions are “big picture” questions, in that I’ve phrased them in terms of what we’d like to know, rather than in terms of what a good next step would be, based on what you know now. Put differently, a big part of the challenge in these questions is to reformulate the question in a way that makes it clearer what you might do in order to provide a partial answer. If you are able to design and carry out an investigation that advances your understanding of the issues involved in the question, you have done well. Don’t expect to give a complete answer. 29. How long does it take “to get to random” via 2x2 swaps on the finch data?

06/05/03 George W. Cobb, Mount Holyoke College, NSF#0089004 page 7.11

Page 12: Chapter 7: Graph Walks via Random SwapsThe chapter closes with two additional classes of swap walks, one for 2x2 tables of counts, of the sort you use for Fisher’s exact test, and

Discrete Markov Chain Monte Carlo Chapter 7 GRAPH WALKS VIA RANDOM SWAPS

30. What can we say about the diameter of the corresponding graph? If a graph has n vertices and each vertex degree is at least d, what can we say about the diameter? 31. How much bias is introduced into the estimated p-value for the finch data if you don’t Metropolize the swap walk? 32. What are ways to speed up the convergence to uniform of the swap walk on 0,1, matrices?

06/05/03 George W. Cobb, Mount Holyoke College, NSF#0089004 page 7.12

Page 13: Chapter 7: Graph Walks via Random SwapsThe chapter closes with two additional classes of swap walks, one for 2x2 tables of counts, of the sort you use for Fisher’s exact test, and

Discrete Markov Chain Monte Carlo Chapter 7 GRAPH WALKS VIA RANDOM SWAPS

An S-Plus programming exercise The final result of this exercise will be an S-Plus function ConSim, that carries out the Conner/Simberloff random swap algorithm on a matrix of 0s and 1s. If you feel ambitious, you can ignore what follows and write your own function from scratch. However, if that seems too much of a stretch, here’s a more modest assignment. I’ve written the main function, ConSim, which appears below. It calls five other functions: RandRows, RandCols, CheckSwap, determ, and swap, each of which requires only a couple of lines to code. This exercise will ask you to write those functions. Here is the main function, which takes a 0,1 matrix (DataMatrix) and number of swaps (NSwaps) as arguments, and returns a new, randomly rearranged matrix (RandMat):

ConSim <- function(DataMatrix,NSwaps){ # Performs random 2x2 swaps RandMat <- DataMatrix # nSwaps <- 0 # nSwaps counts swaps while (nSwaps < NSwaps) { # the main loop Rows <- RandRows(RandMat) # choose 2 rows at random Cols <- RandCols(RandMat) # choose 2 columns if (CheckSwap(RandMat[Rows,Cols])){ # Is the 2x2 swappable? nSwaps <- nSwaps + 1 # Yes: increment nSwaps # and do the swap RandMat[Rows,Cols] <- swap(RandMat[Rows,Cols]) } } return(RandMat) } ConSim(DataMatrix,1000)

Exercises (33) – (37) Write S-Plus functions to do the following: Function Argument Return 33. RandRows Matrix: any matrix A vector of 2 row numbers, chosen at random from 1:dim(Matrix)[1] 34. RandCols Matrix: any matrix A vector of 2 row numbers, chosen at random from 1:dim(Matrix)[2] 35. determ A: a 2x2 matrix a11 a22 - a12 a21 36. CheckSwap Matrix: a 2x2 matrix T if Matrix is swappable, F if not of 0s and 1s 37. swap SubMatrix: a 2x2 matrix a 2x2 matrix with the same rows as SubMatrix, but in reverse order Check each function separately, then enter the main function and use it to produce a randomized version of the 3x3 example with rows 1 0 1, 0 1 0, and 1 0 1.

06/05/03 George W. Cobb, Mount Holyoke College, NSF#0089004 page 7.13

Page 14: Chapter 7: Graph Walks via Random SwapsThe chapter closes with two additional classes of swap walks, one for 2x2 tables of counts, of the sort you use for Fisher’s exact test, and

Discrete Markov Chain Monte Carlo Chapter 7 GRAPH WALKS VIA RANDOM SWAPS

7.3 Walks on Permutations: Fisher’s Test Revisited So far, you have two ways to generate random data sets for Fisher’s test. You can create random subsets directly (either by computer, or by physical simulation, using a bucket of colored marbles), or you can create subsets using random swaps. This section introduces a third method, using random swaps on permutations. Consider a new way to choose a random subset of size 3 from 1, 2, …,10: Scramble the 10 numbers so that all possible orderings are equally likely; then take the first three numbers as your subset.

Discussion: (38) Explain why this method is able to generate all possible subsets of size 3, and why all possible subsets are equally likely.

Notice that if you redefine the number you use to compare data sets, you can think of your random data set as the entire permutation instead of just the subset.4

Subset version Step 1. Generate random subsets of size 3 Step 2. Compare subsets using the number of 1s in the subset: Does the

random subset have at least as many 1s as the actual subset?

Permutation version Step 1. Generate random permutations of the actual observations. Step 2. Compare permutations using the number of 1s among the first 3

elements of the permuted observations: Does the random permutation have at least as many 1s among its first 3 elements as the actual set of observations?

Tests that compare the observed data set with random permutations of the same set of values are called permutation tests. What you’ve just seen demonstrates that Fisher’s exact test is an instance of a permutation test. Random permutations are easy to generate by physical simulation. For permutations of 10 objects, put 10 chips numbered 1 through 10 in a bucket, mix thoroughly, and draw them out, one at a time, without replacement. Line them up in the order drawn, and you have your permutation. Simulating this with a computer is even easier. In S-plus, for example, the command “sample (10,10)” gets you a random ordering of 1, 2, …, 10. As before with random subsets, we can use random permutations as a context for studying the behavior of swap walks. Instead of generating a random permutation all at once, we can generate it as the stopping vertex of a graph walk whose vertices are

4 For this particular example, working with permutations instead of subsets carries no real advantage. However, tests based on permutations are easier to extend to other situations, and that is why it is worth thinking through a new way to do something you already know how to do.

06/05/03 George W. Cobb, Mount Holyoke College, NSF#0089004 page 7.14

Page 15: Chapter 7: Graph Walks via Random SwapsThe chapter closes with two additional classes of swap walks, one for 2x2 tables of counts, of the sort you use for Fisher’s exact test, and

Discrete Markov Chain Monte Carlo Chapter 7 GRAPH WALKS VIA RANDOM SWAPS

permutations and whose edges come from swaps. This gives an expanded version of Step 1: Swap Algorithm for Random Permutations

Start with (1, 2, 3, …, N) 1a. Swap: Pick two positions at random from 1 to N, and swap the

numbers in those two positions. 1b. Repeat: Do Step 1a a large number of times. The ordering you end

up with after your last swap is your random permutation. For this method to work

(1) it must be able to generate all possible permutations, and (2) the permutations must be equally likely.

By viewing the algorithm as a graph walk, we will be able to recast these two requirements in terms of limiting distributions and convergence rates. Activity 7.2. Swap walk on permutations of the Martin data Here are the ages of the 10 hourly workers considered for layoff in Round 2, rearranged so that those chosen for layoff (unshaded) come first:

Ages 55 55 64 25 33 35 38 48 55 56

ID # 1 2 3 4 5 6 7 8 9 10

Step 1. Carry out 10 steps of the swap walk and put your results in a table like the one below.

I’ve done the first swap: I chose two numbers at random from {1, 2, …, 10} and got 5 and 9, as the underlining in the table shows. I swapped the 5 and 9, wrote the updated Subset in the next row, filled in the ages corresponding to the new Subset, and computed the average. Notice that because both the both the 5 and 9 belong to the same subset of the population, the permutation changes but the subset and average do not.

Step Average

0 1 2 3 4 5 6 7 8 9 10 55 55 64 58.001 1 2 3 4 9 6 7 8 5 10 55 55 64 58.00...10

Permutation Subset

Step 2. Plot average age (last column) versus age (first column), and joint consecutive points by line segments.

Discussion questions. 39. What does your intuition tell you: Do you expect this walk to mix more rapidly, less rapidly, or at the same rate as the subset walk in Activity 7.1?

06/05/03 George W. Cobb, Mount Holyoke College, NSF#0089004 page 7.15

Page 16: Chapter 7: Graph Walks via Random SwapsThe chapter closes with two additional classes of swap walks, one for 2x2 tables of counts, of the sort you use for Fisher’s exact test, and

Discrete Markov Chain Monte Carlo Chapter 7 GRAPH WALKS VIA RANDOM SWAPS

40. Display 7.11 shows a times series plot of average age versus n for 150 steps of the walk. Compare this plot with the one in Display 7.2 at the beginning of the chapter. What similarities do you notice? What differences? Explain how you can tell from looking at the two plots which walk is “stickier”?

Step n

Aver

age

Age

0 50 100 150

3035

4045

5055

Display 7.11 Time series plot of average age versus n for a swap walk on

permutations of the Martin data. The horizontal line shows the population mean. 41. Are all subsets equally likely in the limit? (Are all vertex degrees the same?) 42. Think of the swap walk as a random walk on a graph. What are the vertices? Give examples of two vertices that are neighbors, and two vertices that are not neighbors. How many vertices are there in all? Is the graph connected? ones in Activity 7.1? Example 7.4. Analyze the random swap method for permutations of the set {a, b, c}. Solution. There are six permutations: abc, acb, bac, bca, cab, cba. Take these as the six vertices of a graph. Consider vertex abc. Swapping the first two elements takes you to bac; swapping the last two takes you to acb; swapping the first and third takes you to cba. There are two other vertices, bca and cab, that can’t be reached from abc in one step. 2↔3 abc acb 1↔2 1↔3 bac cba Display 7.12 Vertices adjacent to abc

06/05/03 George W. Cobb, Mount Holyoke College, NSF#0089004 page 7.16

Page 17: Chapter 7: Graph Walks via Random SwapsThe chapter closes with two additional classes of swap walks, one for 2x2 tables of counts, of the sort you use for Fisher’s exact test, and

Discrete Markov Chain Monte Carlo Chapter 7 GRAPH WALKS VIA RANDOM SWAPS

We can do a similar analysis for each of the other vertices. With practice you can learn to construct the whole graph without writing everything out, but initially, you may find it helps to be systematic about it:

To: From: 1 ↔ 2 2 ↔ 3 1 ↔ 3 abc bac acb cba acb cab abc bca bac abc bca cab bca cba bac ___ cab acb ___ ___ cba ___ ___ ___

Display 7.13. Table of vertices and their neighbors.

Drill: (43) Complete the table. Converting the information in the table to a picture of the graph may take some trial and error before you get a representation that looks nice, but it’s not so much difficult as merely tedious. abc acb bac bca cab cba

Display 7.14. Graph for the swap walk on permutations of three letters Drill Exercises 44. Consider paths of length two. How many vertices can you reach from abc in exactly

two steps? Which ones? Answer the same questions for paths of length three. Does this suggest anything to you about whether the walk converges?

45. Write the transition matrix of the swap walk. 46. Reorder the vertices so that the transition matrix consists of four 3x3 sub-matrices,

with the nine elements of each sub-matrix constant. Compute P2 and P3. What does this tell you about the behavior of the walk?

47. Verify that x = (1, 1, 1, 1, 1, 1)/6 is a solution to xP = x. 48. Check that with the matrix in block-diagonal form, (1, 1, 1, -1, -1, -1) is an

eigenvector of P. Find the corresponding eigenvalue.

06/05/03 George W. Cobb, Mount Holyoke College, NSF#0089004 page 7.17

Page 18: Chapter 7: Graph Walks via Random SwapsThe chapter closes with two additional classes of swap walks, one for 2x2 tables of counts, of the sort you use for Fisher’s exact test, and

Discrete Markov Chain Monte Carlo Chapter 7 GRAPH WALKS VIA RANDOM SWAPS

49. Find four linearly independent eigenvectors with eigenvalue 0. 50. Suppose you use the random permutations to generate random subsets of size 1, by

taking the first element in a permutation as your subset. Explain why all subsets of size 1 can be generated this way, and why they are equally likely.

Investigation 51. (Warm up #1.) Check that the graph in Display 7.14 has girth 4, circumference 6, diameter 2, and chromatic number 2. (Definitions of these quantities are given in Chapter 5, on page 17.) 52. (Warm up #2.) Consider the graph whose vertices are the permutations of {a, b, c, d}, and whose edges correspond to swaps of two elements. Find the girth, circumference, diameter, and chromatic number. 53. Consider the graph whose vertices are the permutations of n objects, and whose edges correspond to pairwise swaps. What can you say about the girth, circumference, diameter and chromatic number? Which of these measures are related to the convergence behavior of the random walk on the graph? 54. The random walks of this section move by choosing two positions from 1 to n and swapping them. Notice that the two positions are chosen without replacement: you aren’t allowed to choose the same position twice. Now consider the walk you get if you choose positions with replacement; if your two choices are the same, the corresponding move is a self loop. Investigate the convergence behavior of the walk: does it converge? If so, what is the limiting distribution, and what can you say about how fast convergence occurs? What is the relationship between n and the girth, circumference, diameter, and chromatic number of the graph? Are any of these related to the convergence rate? 6.4 Fisher’s test: A third swap walk. We now have two ways to carry our Fisher’s exact test using random swaps: a walk on subsets and a walk on permutations. Unfortunately, these algorithms aren’t very efficient, because with many vertices to visit and few neighbors per vertex, you have to wander a long time (many steps) before the vertices become equally likely. To use the Martin example for illustration, the swap walk on subsets of size 3 chosen from a

population of size10 has = 120 vertices, and 3 x 7 = 21 edges per vertex. The swap

walk on permutations of 10 objects has 10! = 3,628,800 vertices, and only 45 edges per vertex. For swapping finch data sets, there are more than 10

3

10

17 vertices, and only a few hundred edges per vertex.

Drill: (55) Explain the 120 and 21 in the preceding paragraph. Can we invent a more efficient swap walk? One possibility would be to try a walk directly on the summary tables, in the spirit of the walk on co-occurrence matrices like

06/05/03 George W. Cobb, Mount Holyoke College, NSF#0089004 page 7.18

Page 19: Chapter 7: Graph Walks via Random SwapsThe chapter closes with two additional classes of swap walks, one for 2x2 tables of counts, of the sort you use for Fisher’s exact test, and

Discrete Markov Chain Monte Carlo Chapter 7 GRAPH WALKS VIA RANDOM SWAPS

the finch data. Once again, we turn to the Martin example for a simple, concrete illustration.

55 or older? Yes No TotalYes = 1 x 5-x 5No = 0 3-x 2+x 5Total 3 7 10

Chosen? (i.e., in one ofthe first 3 positions?)

Display 7.15. Summary table for the Martin example. The x in the upper left cell is the number of older workers (1s) among a randomly chosen subset of size 3.

For a swap walk on these tables, our set of vertices would be the set of all possible summary tables, that is, tables of non-negative integers with the same margins as above:

(a) (b) (c) (d) 0 5 1 4 2 3 3 2 3 2 2 3 1 4 0 5 Display 7.16. The four possible summary tables

The number shown in bold face is x, the number of 1s in the chosen subset of size 3.

This graph walk has only four vertices, far fewer than the 120 vertices for the swap walk on subsets, and a stunning improvement over the 3.6 million for the swap walk on permutations. What about edges? Two vertices should be connected by an edge if and only if you can get back and forth between the corresponding summary tables by a single swap. Since we need to keep row and column totals fixed,5 a natural set of moves would be the two given by + - - + - + and + - , where a “+” means “add 1 to the corresponding cell entry” and a “-“ means subtract 1. Display 7.17 shows the results of applying these two moves to the table with the 1 in its upper left corner: (c) 2 3 1 4 + -

(b) - + 1 4 2 3

- + + - (a) 0 5 3 2

Display 7.17. Two swaps, from vertex (b) to (c) and from (b) to (a).

06/05/03 George W. Cobb, Mount Holyoke College, NSF#0089004 page 7.19

5 Drill: (44) Explain why the row and column totals should stay fixed. Don’t give an abstract explanation – use the language of workers, ages, and layoffs.

Page 20: Chapter 7: Graph Walks via Random SwapsThe chapter closes with two additional classes of swap walks, one for 2x2 tables of counts, of the sort you use for Fisher’s exact test, and

Discrete Markov Chain Monte Carlo Chapter 7 GRAPH WALKS VIA RANDOM SWAPS

From the two extreme vertices, (a) and (d), only one move is possible. From each of the middle vertices (b) and (c), there are two equally likely moves. Putting all this together gives a random walk on a linear graph: b d

(a) (b) (c) (d) 0 5 1 4 2 3 3 2 3 2 2 3 1 4 0 5 a c Display 7.18. The swap walk for the Martin example

Drill Exercises 56. Write the transition matrix of the walk. 57. Solve xP = x to find the equilibrium distribution(s). How can you tell that there is

only one? Is it uniform? 58. Does p(n) converge to a limiting distribution? If so, does the limit depend on the

starting distribution p(0)? 59. Our null model said that all subsets of size 3 should be equally likely. Explain why

this is an appropriate null model. Use the language of workers, layoffs, and age discrimination.

60. If all subsets of size 3 are equally likely, what probability gets assigned to each of the

four summary tables? For example, of the 120 possible subsets, how may give Table (a), which has a 0 in the upper left cell?

61. If all possible permutations are equally likely, what probability gets assigned to each

of the tables? 62. Using the swap walk to generate random summary tables, what probability gets

assigned to each of the tables? 63. What do you make of the fact that the sets of equilibrium probabilities don’t agree? The fact that our newly defined swap walk assigns the wrong probabilities to tables is a major disappointment. This graph walk is so much simpler than the other two, with a mere four vertices instead of 120, not to mention the 3.6 million. We appear to have come to a sad dead end, having invented a very efficient method for getting the wrong answer! Take heart, dear reader: It will turn out that this seeming end only appears to be dead, and can be resuscitated by applying the Metropolis algorithm. Fisher’s exact test via the Metropolis algorithm Consider a scaled-down version of the Martin data, with only 5 employees:

06/05/03 George W. Cobb, Mount Holyoke College, NSF#0089004 page 7.20

Page 21: Chapter 7: Graph Walks via Random SwapsThe chapter closes with two additional classes of swap walks, one for 2x2 tables of counts, of the sort you use for Fisher’s exact test, and

Discrete Markov Chain Monte Carlo Chapter 7 GRAPH WALKS VIA RANDOM SWAPS

55 or older Yes No TotalYes x 2-x 2No 2-x 1+x 3

Total 2 3 5

Chosen

Display 7.19. Summary table for the Martin example. The x in the upper left cell is the number of older workers (1s)

among a randomly chosen subset of size 2. According to the null model, the number x of older worker chosen for layoff behaves like the number of red marbles in a sample of size 2 chosen at random from a bucket with 2 red (older) and 3 blue (younger) marbles. Discussion 64. Denote the population by {R1, R2, B1, B2, B3}. List all samples of size 2, verify that there are 10 of them, and compute { 0P x }= , { 1}P x = , and { 2}P x = assuming that all 10 samples are equally likely. 65. Write the 2x2 tables that correspond to x = 0, x = 1, x = 1. Regard these tables as the vertices of a graph, with an edge joining two vertices whenever you can get from one of the corresponding tables to the other by a move of the form + - - + - + or + - . Draw the graph and write the transition matrix P. 66. What is the stationary distribution π? Does the walk converge? 67. Let the target limiting distribution τ be given by the probabilities in (64). Compare π and τ: which vertices are under-represented by the un-Metropolized walk? Which are over-represented? If you Metropolize, which acceptance probabilities αij will be less than 1? 68. Let α12 be the acceptance probability for a proposed move from the table with x = 1 to the table with x = 2. (Here the subscripts refer to the values of x, so the first vertex is numbered 0, and the third vertex is numbered 2.) It turns out that all other proposed moves are accepted with probability 1. (a) Write the transition matrix for the Metropolized chain in terms of the known probabilities and the unknown α

P%12. (b) Use the

fact that τ to write an equation for αP = τ% 12. (c) Solve for α12, and verify that . (d) Verify that

τP = τ%

12 1 2 2 1min{1,( / ) /( / )}α π π τ τ=) /( / )}j j i

. (e) Finally, verify that the formula miij n{1,( /iα π π τ τ= gives a value of 1 for all other proposed transitions.

06/05/03 George W. Cobb, Mount Holyoke College, NSF#0089004 page 7.21

Page 22: Chapter 7: Graph Walks via Random SwapsThe chapter closes with two additional classes of swap walks, one for 2x2 tables of counts, of the sort you use for Fisher’s exact test, and

Discrete Markov Chain Monte Carlo Chapter 7 GRAPH WALKS VIA RANDOM SWAPS

69. According to the formula min{1,( / ) /( / )}ij i j j iα π π τ τ= , proposed moves from I to j are accepted les often when i jπ π< i j and/or when τ τ> . Explain why this is reasonable. Exercise. 70. Consider the Martin data, as in Display 7.15. (a) Write the limiting distribution π, from (57), for the (un-Metropolized) swap walk on the 2x2 tables. (b) Write the target distribution τ, from (60), which comes from the null model that all subsets of size 3 are equally likely. (c) Use the formula min{1,( / ) /( / )}ij i j j iα π π τ τ= to Metropolize the

graph walk. Write and verify that . P% τP = τ%

Investigation. 71. Generalize (70). Consider an arbitrary instance of Fisher’s exact test. Assume that x is the number of red marbles in a sample of size n drawn at random from a bucket containing R red and B blue marbles. a. Graph walk. Represent the set of 2x2 tables as a graph with edges corresponding to the moves + - - + - + or + - .

Let { }ijp=P be the transition matrix for the walk, and find formulas for the ijp and stationary probabilities iπ . b. Target distribution. Let τ be the vector of target probabilities, so that iτ is the probability of getting exactly i red marbles in the sample, assuming that all samples of size n are equally likely. Write a formula for iτ in terms of i, n, R, and B. c. Acceptance probabilities. Write a formula for the ijα in terms of i, n, R, and B.

06/05/03 George W. Cobb, Mount Holyoke College, NSF#0089004 page 7.22

Page 23: Chapter 7: Graph Walks via Random SwapsThe chapter closes with two additional classes of swap walks, one for 2x2 tables of counts, of the sort you use for Fisher’s exact test, and

Discrete Markov Chain Monte Carlo Chapter 7 GRAPH WALKS VIA RANDOM SWAPS

7.5 Graph walks in genetics: A test for Hardy-Weinberg equilibrium DNA “fingerprinting” can lead to scientifically defensible statements like “Only 1 person in 60 billion has this particular combination of genes.”6 It comes as a surprise to many people to find out that in principle such conclusions can be based on data gathered from only a few hundred individuals. On the surface, this seems almost impossible. What makes such conclusions logically justifiable is a fact about the genes of human populations – they are often distributed according to a pattern (the law of Hardy-Weinberg equilibrium7) that lets you compute probabilities for various gene combinations (genotypes) knowing only the frequencies of the individual genes in the population. For many populations, both human and animal, it is reasonable to assume Hardy-Weinberg equilibrium. Nevertheless, the assumption is a strong one in the sense that it gives you the immense theoretical leverage you need to deduce extreme odds from data sets of quite modest size.8 Fortunately, it is possible to test statistically whether a population is in equilibrium. One way to do this uses random swaps to define a graph walk. Example 7.5: ABO blood groups. Each person has a collection of genes that together determine his or her blood type. There are several different blood group systems, each with its own set of genes. One such system, called ABO, is determined by a single locus (the site on a chromosome where the genes are found) with three alleles (different possible genes) A, B, and O. Because chromosomes are paired, a person has two ABO genes, and a person’s genotype is determined by their genes at the locus in question. There are six possible genotypes at the ABO locus: AA AB AO BB BO

OO

According to genetic theory, for a population at equilibrium, the frequencies of the six genotypes are completely determined by the frequencies of the three alleles. Here are frequencies based on an actual data set:9 6 National Research Council (1996). The Evaluation of Forensic DNA Evidence. Washington: National Academies Press. 7 The next page will show the basic minimum you need for this section. For more detail, see any introduction to genetics, or see, for example, Lange, Kenneth (1997). Mathematical and Statistical Methods for Genetic Analysis, New York: Springer-Verlag. 8 I’ve oversimplified more than just a little. There are two kinds of equilibrium, Hardy-Weinberg and linkage. Forensic DNA calculations require you to assume that a population has reached both kinds of equilibriums. Moreover, in practice what appears to be a DNA match can in fact be due to extraneous causes such as errors in handling of sample or recording results. Errors of this sort, though infrequent, nevertheless have probabilities substantially higher than 1 in 60 billion. 9 The data are from Clarke, CA, DA Price-Evans, RB McConnell, and PM Shepard (1959). “Secretion of blood group antigens and peptic ulcers,” Brit. Med.J. 1:603-607, via Lange, op cit., p. 21. (Although the data are genuine, I have exercised poetic license for the sake of simplicity. Because the O allele is recessive, one cannot distinguish between AA and AO genotypes, for example. I first computed fitted values consistent with the actual data, then rounded to simplify the arithmetic.

06/05/03 George W. Cobb, Mount Holyoke College, NSF#0089004 page 7.23

Page 24: Chapter 7: Graph Walks via Random SwapsThe chapter closes with two additional classes of swap walks, one for 2x2 tables of counts, of the sort you use for Fisher’s exact test, and

Discrete Markov Chain Monte Carlo Chapter 7 GRAPH WALKS VIA RANDOM SWAPS

nAA = 20 nAB = 5 nAO = 115 nBB = 0 nBO = 35 nOO = 225 From these numbers you can compute the gene counts. Consider the A allele. Each type AA person has two A genes, each AB has one A (and one B), each AO has one A (and one O), and the other genotypes, BB, BO, and OO, have no A genes. This gives a total number of A genes equal to nA = 2nAA + nAB + nAO = 40 + 5 + 115 = 160.

Drill: (72) Verify that the total number nB of B genes is 40, and the total number nO of O genes is 600. Then check that the total number of genes nA + nB + nO is twice the number of people.

The gene counts lead easily to estimated gene frequencies.

According to genetic theory, for a population in Hardy-Weinberg equilibrium, the expected frequencies are completely determined by the gene frequencies:

( )

2

2

2

)(

2)()(

2)(2)(

O

OBB

BABAA

pOOP

ppBOPpBBP

ppAOPppABPpAAP

=

==

===

For a population of N = 400 individuals, with gene counts nA = 160, nB = 40, nO = 600, these relationships specify expected values for the genotype counts if the assumption of Hardy-Weinberg equilibrium is in force:

Display 7.20 shows the steps in computing the chi-square distance10 between the tables of observed and expected counts.

...

$ / / /$ / / /$ / / /

p n Np n Np n N

A A

B B

O O

= = = == = = == = = =

2 160 800 1 5 0 202 40 800 1 20 0 052 600 800 3 4 0 75

$ $ $

$ $

$

n n nn n

n

AA AB AO

BB BO

OO

= = == =

=

16 8 1201 3

2250

06/05/03 George W. Cobb, Mount Holyoke College, NSF#0089004 page 7.24

10 See Chapter 2, Section 5.

Page 25: Chapter 7: Graph Walks via Random SwapsThe chapter closes with two additional classes of swap walks, one for 2x2 tables of counts, of the sort you use for Fisher’s exact test, and

Discrete Markov Chain Monte Carlo Chapter 7 GRAPH WALKS VIA RANDOM SWAPS

Step 2a - = 20 5 115 16 8 120 4 -3 -5

0 35 1 30 -1 5

225 225 0

Step 2b / = 16 9 25 16 8 120 1 1.13 0.21

1 25 1 30 1 0.83

0 225 0

Step 2c Chi-square = sum of (O-E)^2/E = 4.17

Expected(Obs-Exp)^2 (O-E)^2/E

Obs - ExpObserved Expected

Display 7.20. Computing chi-square for the ABO blood data We are now in a position to test for whether the assumption is consistent with the observed data. We’ll use a 3-step randomization test, with expected values computed as above. Randomization chi-square test for Hardy-Weinberg equilibrium

Step 1. Generate. Our null model, which we’ll use to generate random data sets in Step 1 of the algorithm, says that all sets of genotype counts with the same gene counts are equally likely. Thus for Step 1 we’ll need to generate data sets – random tables of genotype counts – with fixed gene counts.

Step 2. Compare. We’ll use the chi-square distance from the Hardy-

Weinberg expected value as the basis for comparison, asking “Is the random data set at least as far from the expected values as the actual data?

Step 3. Estimate. Our estimated p-value will be the proportion of random

data sets at least as far (as the actual data set is) from the equilibrium values. If it is easy to get random data this far from equilibrium, we’ll conclude that the observed values are consistent with an assumption of equilibrium.

To execute Step 1, we’ll use a random walk on a graph whose vertices are the tables of genotype counts with fixed gene totals nA = 160, nB = 40, nO = 600.

06/05/03 George W. Cobb, Mount Holyoke College, NSF#0089004 page 7.25

Page 26: Chapter 7: Graph Walks via Random SwapsThe chapter closes with two additional classes of swap walks, one for 2x2 tables of counts, of the sort you use for Fisher’s exact test, and

Discrete Markov Chain Monte Carlo Chapter 7 GRAPH WALKS VIA RANDOM SWAPS

73. Investigation: How many such tables are there? Consider first a simple example with small values for the frequencies. Can you develop a formula for the number of tables with given gene totals?

If the vertices of our graph are the tables of genotype counts, what are the edges? We need a set of “moves” that go from one table to another while preserving gene totals. Here is one:

A B O This move A + - - raises the counts in the cells marked + (AA, BO),

B 0 + lowers the counts in the cells marked -, (AB, AO), and

O 0 makes no change to the cells marked 0 (BB, OO).

Display 7.21. A “swap” between tables of genotype counts that leaves gene counts fixed Drill: 74. Check that this move leaves the gene totals unchanged. 75. Check that you get a second move, one that decreases nAA by one, if you interchange

the +s and –s in the move just defined. 76. Write two tables of +s, 0s, and –s, one for a move that increases nBB by one, and one

for a move that increases nOO by one. 77. Try to find a legal move that has 0s in all three diagonal cells. Then explain why no

such move exists. We now have three pairs of basic moves that we can use to define a graph walk. The vertices of the graph are the tables of genotype counts, and two vertices are joined by an edge if you can get from one to the other by one of the six basic moves. Now the standard questions arise: Investigation: 78. Is there a limiting distribution? (Is the graph connected?) Is the equilibrium distribution uniform? How fast does the walk converge, if in fact it does converge? (Start with simple examples.)

Exercise 79. A second blood group system is the MN system, with two alleles, M and N, at a single locus. Here are genotype frequencies for 208 Bedouins of the Syrian desert:

06/05/03 George W. Cobb, Mount Holyoke College, NSF#0089004 page 7.26

Page 27: Chapter 7: Graph Walks via Random SwapsThe chapter closes with two additional classes of swap walks, one for 2x2 tables of counts, of the sort you use for Fisher’s exact test, and

Discrete Markov Chain Monte Carlo Chapter 7 GRAPH WALKS VIA RANDOM SWAPS

nMM = 119, nMN = 76, nNN = 13.11 Devise a swap walk on the genotype frequencies (What are the vertices? What are the edges?) and use it to test for Hardy-Weinberg equilibrium. Investigation 80. Consider the graph whose vertices are genotype counts for the ABO blood system, with fixed totals for the numbers of A, B and O genes, as in Example 7.5. For such tables, you can use chi-square as a kind of distance, to measure how far apart two tables are: Regard one of the tables as “observed” and the other as “expected”, and compute the sum of (Obs – Exp)2/Exp. A true measure of distance should have three properties:

(a) Positive definiteness: , for all a, b. ( , ) 0, with ( , ) 0D a b D a b a b≥ = ⇔ =(b) Symmetry: D a for all a, b. ( , ) ( , )b D b a=(c) Triangle inequality: ( , ) ( , ) ( , )D a b D a c D c b≤ + for all a, b, c.

Which of these properties does the chi-square “distance” have? (In the notation above, if a, b, and c are three genotype tables, and stands for the chi-square “distance” between tables a and b.

( , )D a b

81. Another way to measure the distance between two genotype tables is to use the definition of distance that comes from graph theory: the distance D(a,b) from a to b is the number of edges in the shortest path from a to b, that is, smallest number of swaps needed to get from a to b. Which of the three properties does this distance have? 82. Investigate the relationship between the chi-square distance and the distance D(a,b) defined in (50): If will it always be the case that

? Or can you find tables a, b, and c for which and ? Can you find a constant k

2 ( , )a bχ

2 ( , )a bχ

2 2( , ) ( , )a b a cχ χ<

22( , ) ( , )b k a bχ≤

( , ) ( , )D a b D a c<( , ) ( , )D a b D a c>2

1( , ) ( ,a b k D a bχ ≤

2 ( , )a cχ<1 such that for any pair of tables a and b,

? Or is it not possible? Can you find another constant k) 2 such that for any pair of tables, ? D a

11 From Crow, JF (1986). Basic Concepts in Population, Quantitative, and Ecological Genetics. San Francisco: Freeman, via Lange, op. cit., p. 20.

06/05/03 George W. Cobb, Mount Holyoke College, NSF#0089004 page 7.27

Page 28: Chapter 7: Graph Walks via Random SwapsThe chapter closes with two additional classes of swap walks, one for 2x2 tables of counts, of the sort you use for Fisher’s exact test, and

Discrete Markov Chain Monte Carlo Chapter 7 GRAPH WALKS VIA RANDOM SWAPS

Appendix: S-plus Functions ############################################################ # # # ESTIMATING p-VALUES USING RANDOMIZATION # # # ############################################################ # The main function estimates a p-value using the three-step # algorithm "Generate, compare, estimate" # # Step 1. Generate random data sets (NReps of them in all) # Step 2. Compare the random with observed # (Record a Yes if the random data set is more # extreme than the observed data.) # Step 3. Estimate the p-value as #Yes/NReps # # For this particular class of hypothesis tests, each data set # is a vector, and our null hypothesis is that all permuta- # tions are equally likely. # # Also for this particular class of hypothesis tests, we compare # data sets by adding the values of the chosen elements of # the data vector. (So for example, we could carry out Fisher's # exact test on the Martin data by taking our data vector to be # (0,0,0,0,0,1,1,1,1,1) and the chosen elements to be those in # positions 7, 8, and 10. Here 1 means 50 or older, 0 means < 50. # The function "generate.random.0" generates a random data sets by # creating a random permuation of the data vector generate.random.0 <- function(data.vector){ random.data <- sample(data.vector) return(random.data) } # The next few lines compare two data vectors, observed.data and # random.data, and return either # 1, if the sum of the chosen elements of random.data is >= # the sum of the corresponding elements of observed.data, # or else returns # 0, if not. compare.0 <- function (observed.data, random.data,chosen){ extreme <- (sum(random.data[chosen]) >= sum(observed.data[chosen])) return(extreme) } # The following function computes a p-value as described above. # p.value.0 <- function(NReps, observed.data,chosen){ NYes <- 0 for (i in 1:NReps){ random.data <- generate.random.0(observed.data) NYes <- NYes + compare.0(observed.data,random.data,chosen) } estimate <- NYes/NReps return(estimate) }

06/05/03 George W. Cobb, Mount Holyoke College, NSF#0089004 page 7.28

Page 29: Chapter 7: Graph Walks via Random SwapsThe chapter closes with two additional classes of swap walks, one for 2x2 tables of counts, of the sort you use for Fisher’s exact test, and

Discrete Markov Chain Monte Carlo Chapter 7 GRAPH WALKS VIA RANDOM SWAPS

martin.ages <- c(25,33,35,38,48,55,55,55,56,64) martin.ranks <- rank(martin.ages) martin.older <- (martin.ages >=50) martin.ages martin.ranks martin.older chosen <- c(7,8,10) p.value.0(1000,martin.older,chosen) p.value.0(1000,martin.ranks,chosen) p.value.0(1000,martin.ages,chosen)

06/05/03 George W. Cobb, Mount Holyoke College, NSF#0089004 page 7.29

Page 30: Chapter 7: Graph Walks via Random SwapsThe chapter closes with two additional classes of swap walks, one for 2x2 tables of counts, of the sort you use for Fisher’s exact test, and

Discrete Markov Chain Monte Carlo Chapter 7 GRAPH WALKS VIA RANDOM SWAPS

############################################################ # # # GENERATING RANDOM DATA VECTORS VIA PAIRWISE SWAPS # # # ############################################################ # Building a function to make random pairwise swaps on a vector # The next several lines show you, a line at a time, the code used # in a function that takes a data vector as input, choose a pair # of subscripts at random, and swaps the corresponding elements # of the vector. data.vector <- 1:5 data.vector N <- length(data.vector) N reverse <- c(2,1) reverse random.pair <- sample(1:N,2,replace=F) random.pair random.pair[reverse] data.vector[random.pair] <- data.vector[random.pair[reverse]] data.vector # The next several lines combine the previous lines into a function # "swap" that does the same thing. data.vector <- 1:8 swap <- function(data.vector){ N <- length(data.vector) reverse <- c(2,1) random.pair <- sample(1:N,2,replace=F) data.vector[random.pair] <- data.vector[random.pair[reverse]] return(data.vector) } data.vector <-swap(data.vector) data.vector # A question: Suppose you start with the integers 1, 2, ..., N # in a data vector, say observed.data <- 1:8. # How many random swaps does it take to turn the ordered vector # one that is completely random? # # The lines below show you the results of 10 consecutive random # swaps, starting from 1, 2, ..., 8. (For each line in the # result, identify which two elements were swapped.) At what # point would you say the vectors have become random? data.vector <- 1:8 for (i in 1:10){ data.vector <- swap(data.vector) print(data.vector) } # Notice that you can't decide whether a vector is random just # by looking at it. "Random" refers not to individual vectors # but to the process that generates them. A vector is random

06/05/03 George W. Cobb, Mount Holyoke College, NSF#0089004 page 7.30

Page 31: Chapter 7: Graph Walks via Random SwapsThe chapter closes with two additional classes of swap walks, one for 2x2 tables of counts, of the sort you use for Fisher’s exact test, and

Discrete Markov Chain Monte Carlo Chapter 7 GRAPH WALKS VIA RANDOM SWAPS

# if it was generated by a process that gives equal probability # to all possible permutations. If you start with 1, 2, ..., 8 # and make only one random swap, your process is not random in # that not all permutations are equally likely. In fact, only # 8x7/2 = 28 permutations can be reached in a single swap, so the # vast majority have probability 0 if you make only one swap. # How many swaps does it take to ensure that all possible vectors # have the same chance? # # # The following function takes a starting vector and a number NSwaps # and carries out that many random swaps on the vector: generate.random.1 <- function(data.vector, NSwaps){ random.data <- data.vector for (i in 1:NSwaps){ random.data <- swap(random.data) } return(random.data) }

06/05/03 George W. Cobb, Mount Holyoke College, NSF#0089004 page 7.31

Page 32: Chapter 7: Graph Walks via Random SwapsThe chapter closes with two additional classes of swap walks, one for 2x2 tables of counts, of the sort you use for Fisher’s exact test, and

Discrete Markov Chain Monte Carlo Chapter 7 GRAPH WALKS VIA RANDOM SWAPS

############################################################ # # # ESTIMATING p-VALUES USING RANDOMIZATION, II # # RANDOM PAIRWISE SWAPS # # # ############################################################ # The following function swaps two random elements of a vector # swap.1 <- function(data.vector){ N <- length(data.vector) reverse <- c(2,1) random.pair <- sample(1:N,2,replace=T) data.vector[random.pair] <- data.vector[random.pair[reverse]] return(data.vector) } # The next few lines generate one random permuation by repeated # pairwise swaps. (NSwaps is the number of swaps.) generate.random.1 <- function(data.vector, NSwaps){ random.data <- data.vector for (i in 1:NSwaps){ random.data <- swap.1(random.data) } return(random.data) } # The next few lines compare two data vectors, observed.data and # random.data, returns either # 1, if the sum of the chosen elements of random.data is >= # the sum of the corresponding elements of observed.data, # or else returns # 0, if not. compare.1 <- function (observed.data, random.data,chosen){ extreme <- (sum(random.data[chosen]) >= sum(observed.data[chosen])) return(extreme) } # The following function computes a p-value. # p.value.1 <- function(NReps, NSwaps,observed.data,chosen){ random.data <- observed.data NYes <- 0 for (i in 1:NReps){ random.data <- generate.random.1(random.data,NSwaps) NYes <- NYes + compare.1(observed.data,random.data,chosen) } estimate <- NYes/NReps return(estimate) } martin.ages <- c(25,33,35,38,48,55,55,55,56,64) martin.ranks <- rank(martin.ages) martin.older <- (martin.ages >=50) martin.ages martin.ranks martin.older chosen <- c(7,8,10)

06/05/03 George W. Cobb, Mount Holyoke College, NSF#0089004 page 7.32

Page 33: Chapter 7: Graph Walks via Random SwapsThe chapter closes with two additional classes of swap walks, one for 2x2 tables of counts, of the sort you use for Fisher’s exact test, and

Discrete Markov Chain Monte Carlo Chapter 7 GRAPH WALKS VIA RANDOM SWAPS

p.value.1(1000,7,martin.older,chosen) p.value.1(1000,7,martin.ranks,chosen) p.value.1(1000,7,martin.ages,chosen)

06/05/03 George W. Cobb, Mount Holyoke College, NSF#0089004 page 7.33

Page 34: Chapter 7: Graph Walks via Random SwapsThe chapter closes with two additional classes of swap walks, one for 2x2 tables of counts, of the sort you use for Fisher’s exact test, and

Discrete Markov Chain Monte Carlo Chapter 7 GRAPH WALKS VIA RANDOM SWAPS

############################################################ # # # ESTIMATING p-VALUES USING RANDOMIZATION, III # # SUBSET SWAPS # # # ############################################################ # This time the randomization in Step 1 works with subsets: # The subset "chosen" chosen <- c(7, 8, 10) data.vector <- c(25,33,35,38,48,55,55,55,56,64) not.chosen <- (1:length(data.vector))[-chosen] not.chosen i <- sample(chosen,1) j <- sample(not.chosen,1) i j data.vector[c(i,j)] <- data.vector[c(j,i)] data.vector swap.2 <- function(data.vector,chosen){ not.chosen <- (1:length(data.vector))[-chosen] i <- sample(chosen,1) j <- sample(not.chosen,1) data.vector[c(i,j)] <- data.vector[c(j,i)] return(data.vector) } generate.random.2 <- function(data.vector, chosen, NSwaps){ random.data <- data.vector for (i in 1:NSwaps){ random.data <- swap.2(random.data,chosen) } return(random.data) } compare.1 <- function (observed.data, random.data, chosen){ extreme <- (sum(random.data[chosen]) >= sum(observed.data[chosen])) return(extreme) } p.value.2 <- function(NReps, NSwaps, observed.data, chosen){ random.data <- observed.data NYes <- 0 for (i in 1:NReps){ random.data <- generate.random.2(random.data, chosen, NSwaps) NYes <- NYes + compare.1(observed.data, random.data, chosen) } estimate <- NYes/NReps return(estimate) } martin.ages <- c(25,33,35,38,48,55,55,55,56,64) martin.ranks <- rank(martin.ages) martin.older <- (martin.ages >=50) chosen <- c(7,8,10) p.value.2(1000,7,martin.older,chosen) p.value.2(1000,7,martin.ranks,chosen) p.value.2(1000,7,martin.ages,chosen)

06/05/03 George W. Cobb, Mount Holyoke College, NSF#0089004 page 7.34

Page 35: Chapter 7: Graph Walks via Random SwapsThe chapter closes with two additional classes of swap walks, one for 2x2 tables of counts, of the sort you use for Fisher’s exact test, and

Discrete Markov Chain Monte Carlo Chapter 7 GRAPH WALKS VIA RANDOM SWAPS

06/05/03 George W. Cobb, Mount Holyoke College, NSF#0089004 page 7.35

Solutions to Exercises (33) – (37). DataMatrix <- matrix(c(1,0,1,0,1,0,1,0,1),byrow=T) DataMatrix RandRows <- function(Matrix){ return(sample(1:dim(Matrix)[1],2,replace=F)) } RandRows(DataMatrix) RandCols <- function(Matrix){ return(sample(1:dim(Matrix)[2],2,replace=F)) } RandCols(DataMatrix) SubMat <- matrix(1:4,2,2,byrow=T) SubMat NewMat <- matrix(0,2,2) NewMat NewMat[1:2,] <- SubMat[2:1,] NewMat determ <- function(A){ return(A[1,1]*A[2,2]-A[1,2]*A[2,1]) } determ(NewMat) CheckSwap <- function(Matrix){ # Returns T if Matrix is swappable, F if not. if(abs(determ(Matrix)) == 1 && sum(Matrix[1,]) == sum(Matrix[2,])) return(T)

else return(F) } CheckSwap(NewMat) swap <- function(SubMat){ NewMat[1:2,] <- SubMat[2:1,] return(NewMat) } NewMat ConSim <- function(DataMatrix,NSwaps){ RandMat <- DataMatrix nSwaps <- 0 while (nSwaps < NSwaps) { Rows <- RandRows(RandMat) Cols <- RandCols(RandMat) if (CheckSwap(RandMat[Rows,Cols])){ nSwaps <- nSwaps + 1 RandMat[Rows,Cols] <- swap(RandMat[Rows,Cols]) } } return(RandMat) } ConSim(DataMatrix,1000)