BIASED CLUSTERED SUBSTITUTIONS IN THE HUMAN GENOME: … · Sex, Gambling and Non-Darwinian Evolution Timothy R. Dreszer ABSTRACT After the discovery that the fastest evolving regions

THE UNIVERSITY OF CALIFORNIA

SANTA CRUZ

BIASED CLUSTERED SUBSTITUTIONS IN THE HUMAN GENOME:

SEX, GAMBLING AND NON-DARWINIAN EVOLUTION

A thesis submitted in partial satisfaction

of the requirements for the degree of

MASTER OF SCIENCE

in

BIOINFORMATICS

by

Timothy R. Dreszer

December 2006

The Thesis of Timothy R. Dreszer is approved: _______________________________ Professor David Haussler, Chair _______________________________ Professor Harry Noller _______________________________ Professor Joshua Stuart

_______________________________ Lisa C. Sloan Vice Provost and Dean of Graduate Studies

Copyright © by Timothy R. Dreszer

2006

iii

Table of Contents

Table of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi

ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii

Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x

1.0 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1 Surprising Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Large Scale Bias in G+C? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.3 Mutation vs. Natural Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.4 Non-Darwinian Selection? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.5 Placing Bets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.6 Charting the Goals of this Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.0 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.1.1 Preparation of Two Sets of Single Base Pair “Differences” . . . . . . 11

2.2 Three Lenses to View the Secrets of Bias . . . . . . . . . . . . . . . . . . . . . . . . 15

2.2.1 The Window Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.2.2 “Biased Clustered Substitutions”

or How Filtering by Nearest Neighbors Reveals UBCS . . . . . . 17

2.2.3 Finding Regions of High Density of Bias . . . . . . . . . . . . . . . . . . 20

2.3 Searching for a Relationship . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.3.1 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

iv

2.3.2 G+C Content . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.3.3 Conservation Score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

2.3.4 Telomeric Regions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

2.3.5 Recombination Hot Spot Location . . . . . . . . . . . . . . . . . . . . . . . . 27

2.3.6 Recombination Rates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

2.3.7 Transcription Density and Transcription Evidence . . . . . . . . . . . . 28

2.4 Statistical Tools and Visual Aids . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

2.4.1 Window Based Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

2.4.2 Analyzing UBCS with Zippers and Maps . . . . . . . . . . . . . . . . . . 31

2.5 Dating the Fusion of Chromosome 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.0 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.1 Bias as a Social Disease . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.1.1 Documenting Gang Behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.1.2 Biased Groups are Recruited from Unbiased Individuals . . . . . . 41

3.2 Focusing on Bias through the Window Lens . . . . . . . . . . . . . . . . . . . . . . . . 43

3.2.1 Conservative Bias? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

3.2.2 Do the Strong Convert the Weak? . . . . . . . . . . . . . . . . . . . . . . . . 46

3.2.3 Bias at the Hot Spots and on the Edge of Town . . . . . . . . . . . . 51

3.3 Geographic Distribution of Biased Groups . . . . . . . . . . . . . . . . . . . . . . . . 53

3.3.1 Near Universal Pattern of Bias Leaves Evidence of a Fusion . . . . . . 53

3.3.2 Predictable Males and Enigmatic Females . . . . . . . . . . . . . . . . . . 59

3.3.3 Are Humans More Biased than Chimpanzees? . . . . . . . . . . . . 63

v

3.3.4 Biased Without a Cause or Are Boys Troublemakers? . . . . . . . . . . . . 67

3.3.5 Following the Footprints of Past Recombinations . . . . . . . . . . . . 71

3.3.6 Seeing Ghosts? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

3.4 Humans have been Molded by Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

3.4.1 Fastest Evolving Region of the Human Genome . . . . . . . . . . . . 79

3.4.2 Serotonin Receptor Knocked Out in Humans and Chimps . . . . . . 80

3.4.3 Mistakes Were Made . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

3.4.4 Biased Clusters Are Transcribed . . . . . . . . . . . . . . . . . . . . . . . . 86

3.4.5 Currently Bias May be Leading to Thrill Seeking and Disease . . . . 88

4.0 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

4.1 Unexplained Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

4.2 Just like Us . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

4.3 Two Chromosomes Come Together on a Date . . . . . . . . . . . . . . . . . . 93

4.4 The X-Exception: Are Men Really to Blame? . . . . . . . . . . . . . . . . . . 96

4.5 The Gamble of Male Meiosis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

4.6 A Thumb on the Scales of Evolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

5.0 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

vi

Table of Figures

Figure 1. Biased Gene Conversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

Table 1. Substitution Totals for Human Genome . . . . . . . . . . . . . . . . . . . . . . . . 36

Figure 2. The Empirical Probability of Bias Due to Substitution Count . . . . . . 37

Figure 3. “Zipper Plots”: Bias for Clusters of N Substitutions . . . . . . . . . . . 38

Figure 4. Bias for Substitutions within N bases . . . . . . . . . . . . . . . . . . . . . . . . 40

Figure 5. Cluster Bias Heat Map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

Figure 6. Weak to Strong Bias in Single Nucleotide Polymorphisms . . . . . . 43

Figure 7. Bias as a Function of Conservation Score . . . . . . . . . . . . . . . . . . . . . . . . 45

Figure 8. Empirical Bias as a function of G+C Content . . . . . . . . . . . . . . . . . . 46

Figure 9. G+C Content Affects Clusters of Substitutions . . . . . . . . . . . . . . . . . . 48

Figure 10. Conditional Empirical Probability of Bias by Substitution Count. . . . . . 50

Figure 11. Hot Spot are Slightly More Biased . . . . . . . . . . . . . . . . . . . . . . . . 52

Figure 12. Bias at Sub-telomeric Regions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

Figure 13. Mapping Chromosome 18 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

Figure 14. “Winged Maps”: Unexpected Bias is Predictable . . . . . . . . . . . . 56

Figure 15. Exceptions to the Pattern of Unexpected Biased Substitutions . . . . . . 58

Figure 16. Zipper Plots of Four Chromosomes . . . . . . . . . . . . . . . . . . . . . . . . 60

Figure 17. Heat Maps of Bias for Four Chromosomes . . . . . . . . . . . . . . . . . . 62

Figure 18. Biased Clustered Substitutions in the Chimpanzee Genome . . . . . . . . . 64

Figure 19. UBCS Profile is Similar between Humans and Chimps . . . . . . . . . . . . 66

vii

Figure 20. The “X Exception” in Chimpanzees . . . . . . . . . . . . . . . . . . . . . . . . 66

Figure 21. Correlations of Biased Clustering on Chromosome 18 . . . . . . . . . . . . 69

Figure 22. Correlations of UBCS Genome Wide . . . . . . . . . . . . . . . . . . . . . . . . 70

Figure 23. Mapping UBCS, G+C Content and Recombination Rates on Chr2 . . . 73

Figure 24. Effects of other Factors beyond Male Recombination Rates . . . . . . 76

Table 2. The Top Regions of Biased Clustered Substitutions in Humans . . . . . . 79

Table 3. Predicted Changes to HTR5B Due to Point Substitutions . . . . . . . . . . . . 84

Table 4. Evidence of Transcription Among Biased Regions . . . . . . . . . . . . . . . . . . 87

Table 5. Most Biased Clusters of SNPs found in the Human Genome . . . . . . 89

Biased Clustered Substitutions in the Human Genome:

Sex, Gambling and Non-Darwinian Evolution

Timothy R. Dreszer

ABSTRACT

After the discovery that the fastest evolving regions of the human genome show a

“bias” in point substitutions from weak to strong base pairs, this study was

undertaken to characterize patterns of bias, genome wide. Using windows of 100bp,

elevated bias was found for clusters of 5-11 substitutions. Conservation and location

near recombination hotspots were poor predictors of bias, while local G+C content

and sub-telomere location were mildly predictive. Using a nearest neighbor analysis,

bias was shown to occur in clusters of 5 or more substitutions and peak when they are

within 80bp of each other. No biased clustering was found in SNPs, suggesting that

biased substitutions were selected from mutations. Unexpected biased clustered

substitutions (UBCS) were mapped across the human and chimp genomes. This

revealed a universal pattern of elevated bias near the telomeres of all autosomes but

not the sex chromosomes. Human and chimp cousin chromosomes show a

remarkable similarity in the shape and magnitude of their respective UBCS maps,

suggesting a relatively stable force leads to clustered bias. The strongly telomeric

signal may offer an explanation for the evolution of isochores. Additionally,

chromosome 2 shows a UBCS peak mid-chromosome, which maps to the fusion site

of two ancestral chromosomes. This may provide evidence that the fusion occurred as

recently as 0.93 MYA. UBCS is most closely correlated with male recombination

rates, which explains the lack of UBCS signal on chromosome X. Female

recombination rates are unrelated to the residual UBCS signal unexplained by male

recombination. Conservation score and transcription density are also unrelated to

residual UBCS, but local G+C content is. Finally, the most highly biased regions in

the human genome are more likely to be transcribed than chance predicts, and show

specific evidence of UBCS affecting the evolution of humans. Taken together, this

genome wide analysis provides evidence that Biased Gene Conversion is the most

likely cause of the biased clustered substitution pattern found in humans. It is possible

that BGC is a male reproductive strategy that behaves like a neutral selection

pressure, increasing rates of genetic drift and accelerating evolution overall.

x

Acknowledgements

Katie Pollard has inspired me with her work on fastest evolving regions of the human

genome and has patiently explained to me many things that I was too dense to

comprehend. Daryl Thomas provided the “Chimp Fixed Differences” dataset of

substitutions and the HapMap dataset of SNPs, both aligned to out-group species, and

has also offered patient guidance. Jim Kent has written a large collection of source

code that I was able to call upon for this project, and prepared the “Chimp Simple

Differences” dataset that started it all. I am forever grateful to David Haussler who

has offered me a chance to work in his lab and inspired me to seek the answers to his

never ending stream of questions. Finally, my wife, Lena, supported me in every

way; my son, Taras, was my sounding board from the start; and my daughter,

Natalya, kept me from finishing before the best discoveries were made.

1

1.0 Introduction

With the publishing of the Chimpanzee Genome[1], detailed analysis of the genetic

differences between humans and chimps can begin. This effort will no doubt lead to

the discovery of many of the determinants that make us uniquely human. However,

the significance of the availability of the pair of very closely related genomes offers

an opportunity of pursuing even more fundamental goals than this. While our species

may legitimately lay claim to uniqueness among earth’s life forms, all available

evidence points to the overwhelming similarity of our genetics. Therefore, a study of

the differences between the human and chimp genomes allows us to glimpse the

forces that shape genomes over time. Thus, the possibility arises to illuminate some

of the mechanisms of evolution and characterize the forces of life. While this thesis

cannot answer such grand questions, it attempts to characterize a little understood

evolutionary pressure, which is distinct from natural selection.

1.1 Surprising Bias

Previous work in this lab was carried out in pursuit of the fastest evolving regions of

the human genome.[2, 3] In the process, a surprising characteristic of the top scoring

regions was uncovered. Single base substitutions were dramatically biased from

weak to strong pairing bases. For instance, in the top four fastest evolving regions,

there were 33 cases of an AT pair being replaced by a GC pair, but only one case of a

GC being replaced by an AT. Thus bases which pair with 2 hydrogen bonds

(“weak”) were replaced by bases that pair with three (“strong”). This is even more

2

surprising when a strong to weak mutation bias overall has been well documented.[4, 5,

6] Since the process of natural selection should “fix” randomly occurring mutations

into the genome based upon the relative fitness of the mutation, this result suggests

that in these particular cases, the stronger base pairs provide greater fitness.

However, since this “bias” from weak to strong is found in all of the fastest evolving

regions, it suggests that there may be more to this story than natural selection acting

upon individual mutations.

1.2 Large Scale Bias in G+C?

For many years it has been clear that the proportion of G+C in a mammalian genome

can vary widely.[7, 8] Certain areas of the warm-blooded vertebrate genome greater

than 300 kilobases have strikingly greater or lesser proportions of G+C than

surrounding areas.[9] These areas have been dubbed “isochores” and have been

discussed widely, though the reason they exist is still under debate. While it is true

that evolutionarily conserved regions tend to have higher G+C content, it is notable

that isochores stretch across conserved and non-conserved regions of the genome. It

is also apparent that the G+C content of genes is correlated with the G+C content of

the isochores within which they are found.[10] Many of the genes in high G+C

isochores have homologs in organisms with no isochores, such as zebrafish; yet the

homologs show significantly less G+C content. This suggests that the motive force

that has generated isochores has applied pressure to the genes that are found within.

On the other hand, one study shows that the G+C content of genes in isochores is

3

greater than surrounding regions[11], suggesting that isochores have been pulled along

by the G+C requirements of the genes within them. The force that acts to create or

maintain G+C isochores may be the same force that has biased substitutions in recent

human evolution. But is this force natural selection or something distinct?

1.3 Mutation vs. Natural Selection

While the motive force behind biased substitutions in recent human evolution may

not be the same as that for the evolution of isochores; the theories of isochore

formation do parallel possible explanations for recent bias. Three main theories[12]

have arisen to explain the existence of isochores. The first involves variation in

mutation rates[13, 14] in different areas of a genome. This theory suggests that the

initial mutations are not random, at least in the proportion of G+C, and that the bias in

substitution rates really reflects the proportion of mutations available for fixing.

While this theory does not preclude selection occurring, it suggests that where there is

no selection, bias will still arise. If a mutation bias existed, it might easily explain

recent bias in humans. The most obvious contrasting model is that natural selection[9]

has driven the formation of isochores. While hard to disprove, there is not much

evidence to suggest why natural selection would be acting at the level of hundreds

and even thousands of kilobases, and the existence of isochores in relatively

unconserved regions stands in contrast. The most frequently mentioned advantage to

regions of higher or lower G+C content is the relative thermal stability of G+C DNA

and this seems to correlate with isochores being found in warm blooded vertebrates.

4

However, there is no correlation between G+C content and optimal growth

temperature in bacteria, which shows that selection based upon thermal stability does

not appear to be happening in bacteria.[15] Nevertheless, while natural selection based

upon relative fitness may not ultimately explain isochore formation, it may indeed

explain the bias found in recent human evolution. The search for the fastest evolving

regions of the human genome, which contained surprising bias, was specifically a

search for highly conserved, and therefore evolutionarily significant regions. Thus,

natural selection might very well be the cause of bias in these cases. Indeed,

ubiquitously expressed “housekeeping” genes have been correlated with higher G+C

content[16], though the strength of this correlation has been disputed.[17] More to the

point, however, is the finding that increased G+C content alone leads directly to

increased gene expression.[18] This increase was not due to increased translation

efficiency, or mRNA stability, but transcription or pre-mRNA processing. Thus,

selection for increased expression may certainly be the cause of the substitution bias

in rapidly evolving regions.

1.4 Non-Darwinian Selection?

A third theory involves Biased Gene Conversion (BGC).[19] BGC is the result of a

DNA repair mechanism which fixes base mismatches and is biased in favor of G and

C.[20, 21] The model of BGC producing biased substitution is that the biased repair

acts upon single nucleotide polymorphisms (SNPs). In recombination, the strands

from two homologous sister chromosomes will form a heteroduplex. Any

5

mismatched bases in the heteroduplex (i.e. resulting from two alleles of the same

gene) may be “repaired” and bias can be introduced as in Figure 1. BGC occurs in

one individual during a recombination event, but recombination events are known to

occur at hotspots[22] which are shared by individuals within a species. While

recombination hotspots change over time[23, 24], it is entirely plausible that a

significant number of individuals will have recombination events at the same location

or in close proximity within the genome. If these recombination events result in

biased “selection” of SNPs, then Biased Gene Conversion will act as a selection

pressure, distinct from natural selection. As recombination hotspots move over time,

BGC acting upon SNPs may create and maintain G+C rich isochores. Evidence in

favor of BGC being the force that gives rise to isochores can be found in the

correlation of recombination rates and G+C content.[19, 25] Evidence of the correlation

of BGC with recombination has been found in organisms as diverse as humans,

rodents, birds, worms, insects, plants and fungi.[26]

6

Figure 1. Biased Gene Conversion occurs when recombination br ings a section of two sister chromosomes together in a heteroduplex. I f the region contains SNP alleles, the mismatched bases will be subject to a “ repair ” . Since both alleles are valid, there is no “ cor rect” base to be prefer red. The repair process favors G-C pairs over A-T pairs. The result is “ weak to strong” bias acting as a selection pressure on SNPs. When a cluster of SNPs is affected, the resulting cluster may differ from either parent, creating a novel genotype.

Again, even if BGC has nothing to do with the creation or maintenance of isochores,

it may be the cause of the biased substitutions found in the fastest evolving regions of

the human genome, and in the bias found in this work. Perhaps the most significant

effect of BGC will occur when clusters of SNPs are converted into biased clusters of

SNPs. Not only will “strong” base pairs be preferred over “weak” pairs, but the

Biased Gene Conversion: Mismatched SNP Repair Dur ing Recombination

T G GCTGTAGATCGTTG ACGTA GATTACGTCGT CGACATCTAGCAAT TGCAT CTAATGCAGCA C A

Both mismatches are converted to “strong” G-C pairs, replacing “weak” SNPs.

7

resulting biased cluster of mutations may be a novel genotype not found in either

parent chromosome. Natural selection can be expected to overpower BGC pressure if

a BGC created allele results in significantly less fitness. However, in the absence of

strong selective advantage or disadvantage, BGC would theoretically lead to the

fixation of higher G+C. Stepping back a moment from the issue of isochores or

recent human bias, the implication of BGC, if it does create a force resembling

selection pressure, is that the evolution of species is not merely shaped by natural

selection as Darwin described it. In the absence of positive or negative selection,

directional evolution will still occur. In particular, genetic drift of isolated

populations should be sped along by biased gene conversion.[27] And when BGC

selection is combined with natural selection, the result should be faster evolution, and

in some cases, evolution without selective advantage.

1.5 Placing Bets

The three models of bias give rise to different predictions of SNP and fixed

substitutions. If bias is the result of an underlying mutation rate, then a bias should be

seen in SNPs, but if higher G+C is the result of a selection pressure, there should be a

more pronounced bias in substitutions than in SNPs. Distinguishing between natural

selection and “BGC selection” is a bit trickier. It can be predicted that if isochores

are due to natural selection then substitution bias should be strikingly different

between areas of low and high G+C content. However, the BGC model makes a

similar prediction: that areas of high recombination should show greater substitution

8

bias, and recombination and G+C content have been shown to be correlated.[19]

Another possible distinction is in the size of an area which undergoes bias. That

isochores are in the hundreds of kilobases, the natural selection model would predict

substitution bias should occur relatively uniformly within an isochore. BGC, on the

other hand, would predict much more localized substitution bias, since the bias should

be due to a repair process occurring during the transient state of the heteroduplex

formed during recombination.[28] That the local bias of the BGC model would give

rise to large isochores is due to the size of recombination hotspots and to the tendency

of the hotspots to shift over time. Thus, the two models of isochore selection should

result in predictably different patterns of bias seen in the recently fixed substitutions

in a species.

It is also possible that recent human bias is not be the result of the same process that

has given rise to isochores. Even so, the two models of selection are still competing

explanations of this recent phenomenon. If natural selection were the cause of recent

human bias, then that bias should be more strongly correlated to conserved areas of

the human genome. However, if BGC is the cause, then bias should be more strongly

correlated to recombination rates, and show much less correlation with conserved

areas.

9

1.6 Charting the Goals of this Thesis

The availability of the high quality sequence of the human genome and the more

recently published chimpanzee genome, provides an opportunity to distinguish

between the two models of selection as explanations for recent human bias.

Additionally, the availability of a large set of SNP data for the human genome allows

the evaluation the mutation bias model as well. Therefore, this effort characterizes

weak to strong biased substitutions across the entire human genome since it diverged

from the chimpanzee line. An explanation for the cause of any bias found has been

sought in underlying mutation rate, natural selection based upon fitness and selection

due to biased gene conversion. An additional goal was to map the location within

the genome of biased substitutions which have occurred in the last six million years

of human evolution. While such a map of bias is interesting in what it reveals about

humans, it has proven to illuminate a fundamental force shaping the evolution of not

only ourselves, but no doubt a vast number of other species as well.

10

2.0 Methods

All research was undertaken using the genome assemblies of several species, freely

available from the UCSC Genome Browser.[29, 30, 31] Organization, analysis,

calculations, and plotting were performed using either the programming language C

or the statistical package R.[32]

2.1 Datasets

The following datasets were used for this research:

1. Unless otherwise stated, all research was based upon the sequence and locations

found in the May 2004 assembly of the human genome[29, 30, 31] (referred to as

hg17). As will be noted in the text, earlier work used the July 2003 release

(hg16), and more recent work used the March 2006 assembly (hg18).

2. Unless noted otherwise, all research was based upon the alignment to hg17 of the

November 2003 assembly of the chimpanzee (Pan troglodytes) genome[1]

(referred to as pt1). Earlier work made use of a prerelease assembly (pt0), and the

most recent work used the January 2006 assembly (pt2).

3. In order to determine whether a given substitution occurred on the human or

chimp line, an aligned out-group genome was needed. Unless otherwise noted,

the January 2005 pre-release assembly of the Rhesus macaque (Macaca mulatta)

genome[33] (rh0) was used. Earlier work involved both the March 2005 assembly

of the mouse (Mus musculus) genome[34] (mm6) and the June 2003 assembly of

11

the rat (Rattus norvegicus) genome[35] (rn3) as out-groups, while the most recent

work made use of the January 2006 assembly of macaque (rh1).

4. Analysis of bias in single nucleotide polymorphisms (SNPs) was undertaken

using the International HapMap Project’s October 2005 release of haplotype map

for humans.[36]

5. Recombination hot spots were located using the September 2005 release of

HapMap Phase I data from the International HapMap Project.[36]

6. Recombination rates were provided by the deCODE genetic map[37] based upon

1,257 meiotic events.

7. Designation of genes was taken from the “Known Genes” track of the UCSC

browser and was compiled using protein data from UniProt[38] and mRNA data

from NCBI.[39, 40] Designation of mRNAs was taken from the UCSC browser

“Human mRNA” track and expressed sequence tags from the from UCSC

“Human EST” track, sources for both were from international public sequence

databases.[41]

2.1.1 Preparation of Two Sets of Single Base Pair “Differences”

The above genome sequences were prepared into two distinct datasets: fixed

substitutions and SNPs. Since many of the exact same analyses were performed

separately upon the two types of single base pair changes in the human genome, the

term “differences” is used to refer generically to both types. Preparation of the fixed

substitutions dataset involved the creation of a set of single nucleotide differences

12

between human and chimp in regions of high quality chimp sequence (prepared by

Jim Kent) and the inclusion of high quality macaque bases and the lifting of genome

locations to a common assembly (prepared by Daryl Thomas). The SNP dataset was

also combined with corresponding chimp and macaque bases and lifted to hg17

locations (by Daryl Thomas). The two resulting bed files (substitutions and SNPs)

were then converted into pairs of arrays containing location and base change.

Twenty-four pairs of arrays (one for each chromosome) allowed rapid location of

base changes using a binary search algorithm. A base change was reduced to a single

8 bit value which distinguishes the following attributes:

1. “Direction of Change” or the from and two pairs for the four base possibilities,

resulting in 12 combinations. By choosing the proper values for the direction, it

was possible to resolve weak to strong and strong to weak with a simple bit mask

(i.e.: AtoC:0001 AtoG:0011 TtoC:0101 TtoG:0111: WtoS_MASK=binary 0001 ).

2. “Ancestry” or the concept of which line the base difference arose in. Obviously

for SNPs, this is always the human line. However, the concept is still needed to

establish the direction of mutation between two alleles.

In order to determine the likely ancestor base for the fixed substitution arrays, the

aligned out-group base was used. For the fixed substitutions dataset, the line (chimp

or human) that matched the macaque base was designated “ancestral”, and the other

“derived”. Otherwise, ancestry was “indeterminate”. Direction for the fixed

substitution dataset for hg17 was always stored as chimp to human. However, since

13

only substitutions derived in the human line were used in the hg17 analysis, this has

the result of direction always being from ancestor to descendent. More recent work

analyzing bias in the chimp genome used identical methods to create the arrays, but

used chimp locations and reversed the direction of the dataset masks to show

direction from human (hg18) to chimp (pt2).

Since this collection of “substitutions” between humans and chimps can be expected

to contain some number of human SNPs which have not been fixed, final processing

of the substitutions dataset involved subtracting any locations found in both the

substitution and SNP datasets. Of 28,937,901 high quality simple differences

between humans and chimps found in hg17, 24,795,278 remained after aligning with

Rhesus macaque and 23,916,284 remained after subtracting SNPs. Of these,

22,784,742, showed unambiguous ancestry with 10,871,714 derived in humans and

another 11,913,028 derived in the chimp line (see Table 1 of results).

For the SNP dataset, an out-group was needed to determine the ancestral allele and

therefore the direction of change. Both Pan troglodyte and Rhesus macaque

alignments were used as out-groups. If only one out-group (ape or monkey) was

available, direction was determined if it matched one of the two human alleles, in

which case, that allele was declared ancestral. In the case where two out-groups were

available, they would both have to match the same human allele for direction to be

established. SNPs with indeterminate ancestry (and therefore without established

14

direction) were eliminated from the dataset. Of 3,874,080 SNPs in the hg17 dataset,

3,424,895 had a direction that could be determined.

It should be noted that the simple methods of determining ancestry can be expected to

result in false positives in some percentage of cases. First, we rely upon the accuracy

of the sequencing of each species. Next, we rely upon the accuracy of alignment

between species. And finally, the simple methods ignore the possibility of two

mutations at the same site among 3 aligned species will result in erroneous

classification of a human derived substitution. However, that the majority of work

attempts to characterize a genome wide phenomenon and relies upon thousands and

even millions of differences in order to reveal a pattern. The handful of false

positives should be overwhelmed by true positives. Additionally, there is no reason

to expect that inaccurate data would bias the results in a particular direction, but could

be expected to dilute any pattern to be revealed. While this conclusion seems

reasonable when examining patterns in humans, genome wide; caution should be

taken in drawing conclusions about two types of analysis. First, when specific

regions are examined, fixed substitutions and SNPs should be recharacterized in order

to establish confidence. Second, when examining the fixed substitutions data from

the perspective of chimp evolution, the lower confidence in the chimpanzee sequence

should be considered. If a base pair is AT in human and macaque, but GC in chimp,

then the difference might be due to either a fixed substitution or a sequencing error in

15

chimps. For this reason, the majority of work concentrates upon characterizing

patterns found in the human genome.

2.2 Three Lenses to View the Secrets of Bias

Three separate methods were used to attempt to view the characteristics of bias across

the human genome. The first two methods attempt to characterize bias in terms of its

effects upon roughly similar objects across the whole genome, while the third method

attempts to locate the regions most affected by bias. Though methods and results are

presented as if the three analyses were performed sequentially, in reality each method

was altered somewhat based upon the results found in the other two.

While it should be relatively easy to count the biased changes in a set of 10 million

fixed substitutions, such an analysis would miss any patterns that involve multiple

substitutions located in close proximity. It is also easy to generate a set of clusters of

substitutions, but those clusters can be expected to have a range of sizes and densities

(12 substitutions within 86bp vs. 4 substitutions within 293bp). In order to

characterize bias in clusters in terms of a dataset of statistically similar objects two

methods were used: windowing and filtering by nearest neighbors.

2.2.1 The Window Method

A simple windowing method was used in order to characterize bias systematically

across the whole genome. This method, which is capable of illuminating the

16

clustering of differences, does not assume that bias is related to clustering at all. The

entire genome was broken into windows of fixed length and fixed sliding or stepping

increment. The advantage of overlapping windows is that clustered differences are

less likely to go unrecognized due to splitting. Since this analysis used windows that

overlapped by half, a distinct disadvantage is that the vast majority of substitutions or

SNPs are counted twice. The result is that low density window counts are

approximately doubled while high density window counts may be somewhat less than

doubled. Nevertheless, clusters of differences should rarely escape detection. All

windows without a single base change were dropped from the analysis. For all results

discussed in this document (unless otherwise noted), a window of 100 stepping 50

was used to cover each of the chromosomes. An original analysis of windows of

300bp stepping 150 was performed, based upon the approximate mean size of a gap

subject to BGC due to a recombination event.[28] However, this original analysis

revealed the strong relationship between bias and density of substitutions. Further

analysis, discussed below, led to choosing windows of 100bp stepping 50 as a more

appropriate lens. Using 100/50 windowing should result in 61,535,590 possible

windows of the human genome (hg17). For the fixed substitutions dataset,

16,633,481 windows were discovered with at least one substitution in the human line

(Table 1 of results). Analysis of SNPs used windows of 300bp, stepping 150 for

20,511,865 possible windows, 1,900,453 of which contained at least one SNP of clear

ancestry. The actual stored window data consists of a location, a size, and the raw

counts of the 12 possible base changes. Additional fields for current G+C count,

17

conservation score, and whether the window is in a recombination hot spot or is

telomeric are included as described below.

2.2.2 “Biased Clustered Substitutions”

or How Filtering by Nearest Neighbors Reveals UBCS

While the windowing method allows examining clustered and non-clustered

substitutions in a statistically neutral manner, it fails to adequately capture all

substitutions that might belong to a single cluster. Additionally, overlapping

windows, as explained above, overestimate low density windows as compared with

windows containing clusters of differences. Therefore, a second method of viewing

bias across the genome was developed. Having demonstrated with the windowing

method that bias is associated with clustering, this method targets clustering as the

most recognizable dimension of bias. Each individual substitution (or SNP) was

considered as to whether it belongs to a cluster based upon its nearest neighbors. This

allowed for the systematic description of bias for clusters of from 2 to 10 differences

within 20 to 600 base pairs. It should be clear, however, that this characterization is

fundamentally of individual differences and not clusters. For example, each

substitution was considered as belonging to a cluster of 7 substitutions by looking at

its absolute nearest 6 neighboring substitutions. If a substitution qualifies as

belonging to a cluster of 7 within 120 bases, its six neighbors which make up that

cluster may not qualify as part of the same cluster! For instance, a substitution at one

edge of that “7 in 120” cluster may actually qualify as belonging to a cluster of 7

18

within 80 bases, while a substitution near the other edge may be seen as belonging an

entirely different cluster of 7. Thus, one substitution may find itself in a highly

biased cluster, while its very nearest neighbor is in a less dense cluster that is not

biased at all! However, on the whole, this method has proven beneficial in

characterizing the magnitude of bias as a function of both the number in a cluster and

width of a cluster. It was this analysis that led to the readjustment of the window size

from 300 to 100 bases in the window method described above.

A second benefit of this view of substitutions and SNPs is that mapping of the

locations of bias across the genome could be undertaken as a simple histogram. By

defining some minimum threshold required to be considered a member of a cluster

and to be considered a member of a biased cluster, the dataset could be “filtered” into

subsets: “clustered substitutions” and “biased clustered substitutions.” In this

analysis, a cluster was taken to be “at least 5 differences within 300 bases” while a

biased cluster was considered to be “a cluster with at least 80% Weak to Strong”. To

be clear, a “biased clustered substitution” (BCS) must be a part of a cluster, and the

cluster itself must be biased, though the substitution itself need not be. Therefore, if a

substitution’s nearest 6 neighbors are within 260 bases and that set of substitutions

contains 6 weak to strong, then it is clearly a “biased clustered substitution”, while

another substitution may not qualify as belonging to a cluster if its nearest 4

neighbors span 306 bases or it may not qualify as belonging to a biased cluster if its

nearest 6 neighbors are within 280 bases but only 5 of that seven are weak to strong

19

changes. Using these definitions, it was possible to filter the substitutions dataset and

generate a set of substitutions that are clustered and another that are members of a

biased cluster.

While the definition of a biased cluster of differences proves useful in illustrating the

location of weak to strong bias, it begs the question of how many biased clustered

substitutions can be expected by the null model. Expected biased clustered

substitutions (or SNPs) can be considered to be a function of both the probability of

clustering and the probability of weak to strong substitutions. An estimate of the null

model frequency of clustering is generated by considering clustering to be a Poisson

process which starts at each difference and uses the rate of substitutions for lambda to

calculate the probability of at least 5 substitutions within 300 bases. Likewise, an

estimate for the null model frequency of biased clustered substitutions could be

generated by starting with the estimated frequency of clustering and applying a

binomial probability that the differences will be at least 80% weak to strong.

However, the resulting estimate of the expected biased clustered substitution count

would mix both the phenomenon of clustering and the phenomenon of bias into the

equation. However, this analysis attempts to characterize forces associated with weak

to strong bias independent of forces that lead to clustering of substitutions. Indeed, a

theoretical cause of biased clusters, BGC, should act upon existing clusters of SNPs.

For this reason, the estimate of expected biased clustered substitutions is here

generated using actual, not expected clusters. For example, given that a certain

20

region of the human genome has 200 substitutions in clusters, and given that 43% of

the substitutions in that region are weak to strong, the expected frequency of biased

clustered substitutions in that region is 200 times the cumulative binomial probability

of at least 4 of 5 substitutions will be biased. We can expect 22.4 biased clustered

substitutions in this region, according to the null model. Using this estimate, we can

calculate the amount of “unexpected biased clustered substitutions” (UBCS) as actual

BCS, minus expected BCS. While BCS is an actual count of substitutions, UBCS is a

calculated number which may be either positive or negative and would be zero in the

null model. A large scale analysis of the genome was undertaken by mapping the

distribution of substitutions (or SNPs), biased substitutions, clustered substitutions,

biased clustered substitutions and unexpected biased clustered substitutions across

each of the 24 chromosomes. Analysis of UBCS makes up the majority of this

research, and the resulting discoveries are most revealing.

2.2.3 Finding Regions of High Density of Bias

The third method of characterizing bias attempts to find the regions in the human

genome with the most significant changes due to clustering of biased substitutions. It

involves generating a list of the longest clusters of substitutions within the human

genome which contain a minimum density of differences, then ranking the list to find

the most biased clusters. This analysis, originally conducted on hg16, but updated to

hg18 looks for clusters which contain at least 6 differences derived in humans with a

density of no less than one difference in 32 bases. These initial high density clusters

21

were extended out as long as the region maintained a density of 1 difference per 32

bases with no barren stretch longer than 96 bases. The patches were then carved

down to maximize a score of bias for each cluster in the list. The score or “P value”

used to rank these clusters was the cumulative binomial probability of the biased

substitutions to all substitutions in a cluster. By using a binomial instead of a Poisson

score, larger clusters are favored over shorter but denser clusters.

While the methods described here for finding high density biased regions are perhaps,

overly convoluted and non-intuitive, the purpose of this exercise was to locate some

of the regions of the human genome which have been most altered by the force that

has created biased clusters genome wide. In this, it has succeeded with interesting

results, as will be shown. It should be noted that this list was generated agnostic to

any other factor beyond density of biased differences. No measurement of

conservation was used to generate, filter or score the list of most biased regions in the

human genome.

However, after examining the top scoring regions, the list was filtered to remove self

alignments and repeats. While there can be some confidence in the quality of point

substitutions (or SNPs) themselves, the methods used to identify them rely upon

sequence alignments of three species. Any alignment errors should be overwhelmed

by successful alignments in genome-wide analysis, but may be more pernicious when

analyzing individual cases. Therefore, any region of bias with more than 10

22

references in the UCSC “self-alignment” track[42] or greater than 90% self alignment

score was eliminated. Self-alignments represent duplications in the human genome

and can result in cross-species misalignments. That said, a smaller number of self-

alignments might actually be expected in a family of closely related genes, and

therefore self alignments should not be eliminated entirely from the list. Clusters

which contained more than 50% repeat coverage in the RepeatMasker[43] track of the

UCSC Human Genome Browser were also eliminated. While the unfiltered list of top

scoring regions does identify interesting aspects of the most biased regions, the

filtered list does sharpen the focus further.

2.3 Searching for a Relationship

In an attempt to characterize weak to strong substitution bias, it is desirable to

determine if there is any relationship between patterns of bias and other key factors.

2.3.1 Clustering

As already mentioned, the clearest relationship between weak to strong bias and

another factor is the clustering of substitutions. All three methods of viewing the data

described above confirmed the importance of clustering. The search for regions of

high density of bias was predicated upon this relationship and filtering by nearest

neighbor was designed to most fully characterize this relationship.

23

2.3.2 G+C Content

Because the biased substitutions analyzed here are changing the G+C content of the

local sequence, it is only natural to ask if the G+C content of the local sequence is

influencing the accumulation of bias. Several factors might influence such a

relationship. First, the existence of isochores and the mystery of their origin begs the

question of whether they are currently increasing in bias. Second, any other force for

selection of G+C may be acting over time, and may be revealed by a relationship

between high G+C and new biased substitutions. However, even if the cause of

biased substitutions in humans has nothing to do with isochore or other selection

forces, it should still be expected that the tendency of weak to strong changes will be

affected by background G+C content. That is, if a region of DNA is already highly

G+C enriched, and several “random” substitution events occur, then the null model

would predict that there should be more strong to weak events than the opposite,

simply because there are more strong base pairs available to be changed to weak

ones. Therefore, background G+C content was considered in both the windowed

analysis and in mapping of clustered substitutions.

Each window was updated with the amount of G+C found in the human sequence.

When windows of 300bp were used, then the G+C content was simply a measure of

that 300 bases. However, when windows of 100bp were used, the G+C content was a

measure of the amount of G+C found in a window of 1000bp with the 100bp window

at its center. Windows in a bed file format were fed to Daryl Thomas’ hgGcPercent

24

program to generate a raw G+C count of bases which was then used to update the

original windows dataset. However, analysis was done with ancestral G+C count,

rather than the current count. The calculation of ancestral G+C count is simply the

current G+C count plus Strong to Weak changes and minus Weak to Strong changes.

The advantage of this calculation of ancestral G+C is simplicity. The disadvantage is

that the base changes for which no out-group species was available or direction could

not be determined are counted at their current G+C. While some distortion can be

expected from this method, the amount of distortion should be of little significance.

Average distortion can be expected to be less than 0.35% for substitutions and much

smaller still for SNPs. Additionally, there is no reason to believe the distortion would

be systematically biased to either G+C or A+T.

For calculating empirical probabilities of weak to strong using the window method,

ancestral G+C content allows a more sophisticated analysis of the relationship

between bias and G+C. The simple empirical probability of bias or P(bias) can be

seen as the number of weak to strong substitutions in a category divided by the total

number of substitutions in that category. This value can be adjusted for the expected

changes due simply to background G+C. Thus P(bias | ancestral G+C) is the count of

weak to strong differences divided by all ancestral weak bases.

25

For analysis based upon Biased Clustered Substitutions, a simple percentage of G+C

in a given bin size was used. Analysis by bins from 10,000 to 1 million base pairs

was used.

2.3.3 Conservation Score

Bias found in fixed substitutions may be the result of Darwinian selection. If this

were the case, biased substitutions might be more likely in regions of the genome

which are more highly conserved. In order to determine this, some measure of

evolutionary conservation is required. Conservation scoring was done using the

phastCons methods developed by Adam Siepel.[44] In particular, the original

“conservation” track of hg17 was used which was generated from the alignment of 8

species: human (hg17), chimp (pt1), mouse (mm5), rat (rn3), dog (cf1), chicken

(gg2), fugu (fr1), and zebrafish (dr1). This method generates a score for each aligned

base between 0 and 1. For biased clustered substitution analysis, the average scores

of all bases in bins from 10,000 to 1 million base pairs across each chromosome were

used as a measure of conservation.

In order for a window to receive a conservation score, at least 80% of the bases

covered by that window must have had a score. Bases without a phastCons score can

be considered as having very low conservation, since alignment was not possible for

these sites. However, rather than scoring these bases at zero conservation, they were

excluded from the analysis. The window score was the average of the individual base

26

scores. In particular, it was the sum of all scores divided by the count of bases with a

score, not the count of bases in the window. Since unscored bases are likely to be

unconserved, the exclusion of these bases can be expected to raise the conservation

score of windows with less than 100% coverage. It should be recognized that since

the window conservation score is an average of the scored bases it contains, the larger

the window size the more muddied this score becomes. All windows failing to meet

the 80% threshold received a zero score and were excluded from the conservation

analysis. Of 16,633,481 fixed substitution windows, 15,245,997 received a

conservation score, while only 1,387,484 (8.34%) failed to reach the 80% threshold

of bases with phastCons scores.

In the course of this analysis, several different sets of species were tried for the

purposes of assigning conservation scores. The original plan was simply to use

human, chimp, mouse and rat. However, while this set of species effectively

recognized far more of the human genome as “conserved”, it also resulted in far more

false positives, and an inability to easily distinguish the most dramatically conserved

areas. If you imagine using only human and chimp, it is clear that well over 90% of

the genome would appear conserved. By using the 8 vertebrate species listed above,

some depth to the conservation data can be obtained. While it is not possible to

ensure no evolutionary pressure is selecting changes in the “unconserved” windows,

the score assigned to each window is a direct measure of the probability that the bases

within that window are conserved.

27

2.3.4 Telomeric Regions

A window or a high density patch was considered telomeric (or sub-telomeric) if it

overlapped the first or last chromosomal band by even a single base. Of the

16,633,481 windows of fixed substitutions, 1,030,290 (6.19%) were found to be in or

overlapping a chromosomal telomere.

2.3.5 Recombination Hot Spot Location

A “bed” file covering recombination hot spots[36] was used to determine whether

windows fell in hot spots. If 50% or more of a window overlapped a hotspot, this

window was considered as belonging to a hot spot. For fixed substitutions, 1,535,478

of 16,633,481 windows, or 9.23% were hot. For SNPs, 182,992 of 1,900,453 or

9.63% were hot. Additionally, the distance of a window to its nearest hot spot was

analyzed. For the Biased Clustered Substitution analysis, the number of bases

belonging to a hot spot within a particular bin was used to measure the association

between hot spots and biased clustering.

2.3.6 Recombination Rates

For the biased clustered substitutions view, recombination rate data from deCODE[37]

was used. Recombination rates were available for males and females separately as

well as the sex averaged rate. The data came in the form of a rate averaged across 1

28

million bp segments of the genome. While this prevented fine detail correlations

across the whole genome, the data proved revealing even in 1mbp granularity.

2.3.7 Transcription Density and Transcription Evidence

Transcription Density was a measure of the number of bases in a region which are

found in one of the following UCSC browser tracks: Known Gene[38, 39, 40], Human

mRNAS[41] or Human EST.[41] It should be clear that this analysis does not cover the

rate of transcription, but only whether some evidence exists in humans that

transcription occurs. Transcription Density was used in the Biased Clustered

Substitution analysis as a base count in bins of from 10,000 to 1mbp. Additionally,

the top scoring regions of bias were examined as to whether they showed evidence of

transcription. For this, a descending hierarchy of transcription evidence was sought

as follows: known exons, known genes, human mRNAs, human ESTs, non-human

mRNAs and non-human ESTs.

2.4 Statistical Tools and Visual Aids

The windowing method was analyzed by basic empirical probabilities, which were

plotted across a range of factors. The Biased Clustered Substitutions method

involved a large number of plots to show location of features within the genome, as

well as the correlations of certain factors across a number of bin sizes. Close to 1500

graphics were generated in order to characterize bias in the human genome. The

graphical package R was used to generate all plots.

29

2.4.1 Window Based Statistics

All windows with at least one base change were used for gathering statistics. The

following empirical probabilities were of interest:

1. P(W to S | change): count of Weak in ancestor and Strong in descendent divided

by all base changes in a window. Likewise P(S to W | change), P(S to S | change)

and P(W to W | change) were calculated. Also referred to as P(Bias).

2. P(S | aW and change): count of Weak in ancestor and Strong in descendent

divided by all base changes which were weak in ancestor. Likewise P(W | aW

and change), P(W | aS and change) and P(S | aS and change) were calculated.

Also referred to as P(Bias | anc., change).

3. Normalized P(S | aW): count of Weak in ancestor and Strong in descendent

divided by all weak in ancestor, whether mutated or not. While P(W | aS) was also

calculated, the magnitude of P(W | aW) and P(S | aS) which do not intrinsically

involve a base change were not. Given that the calculated probability for the

window is dependent upon the number of base changes in that window, this

statistic is normalized by dividing a window’s probability by the number of

changes in that window. Though probabilities are calculated by window, the

resulting empirical probability is that a given base change in that window will be

biased, rather than the probability that a biased change will occur in that window.

Also referred to as P(Bias | anc.).

30

The probabilities were calculated for a number of different “categories”:

1. Entire dataset: For all windows in the human genome with at least 1 base change,

the 12 different base change types were summed and the empirical probabilities

were calculated. The empirical probabilities were determined for individual

windows, then the mean and standard deviation were generated for all windows.

That is, the resulting statistics were for sums of probabilities, not a single

probability of sums. The calculated standard deviation allowed for representing

standard error bars in plots.

2. Windowed substitution count. All windows with a single base change were

grouped separately from all windows with 2 changes all the way up to the

maximum number of changes per window. For 300/150 windows, the maximum

number of fixed substitutions in a single window was 26, while only 10 SNPs

were found in a single window. For windows of 100bp, 12 was the maximum

number of fixed substitutions found.

3. Ancestral G+C percent: All windows were divided into 10 bins for the percentage

of ancestral G+C found. That is, those windows with less than or equal to 10%

were grouped separately from windows with greater than 10% but less than or

equal to 20%, continuing through those windows with greater than 90% ancestral

G+C content.

4. Conservation Score: All windows for which a conservation score could be

calculated were divided into one of 5 bins according to conservation score.

31

Windows were considered to be low conservation if their average conservation

score was less than 0.2.

5. Hot Spots: All windows which overlap hot spots were summed separately from all

windows not overlapping hot spots.

6. Telomeric: All windows which overlap telomeres were summed separately from

all windows not overlapping chromosomal telomeres.

Additional categories were analyzed as combinations of the primary categories. For

example, the G+C categories were further broken into base change count categories.

Window based plots were generated genome wide for fixed substitutions derived in

the human line for windows of 300 and windows of 100. Additionally, most of the

same plots were generated for human SNPs for windows of 300 base pairs. Plots of

empirical P(Bias) and Normalized P(Bias | Ancestral) with standard error bars were

generated for the number of differences per Window, G+C Content, Conservation

Score, Hot Spots and Telomeric Location. In addition, the Average Distance to a Hot

Spot, Average Proximity to a Telomere, Average G+C Content and Average

Conservation Score were all plotted vs. Window Substitution Count. A total of 82

plots were made for this analysis.

2.4.2 Analyzing UBCS with Zippers and Maps

Biased Clustered Substitution Analysis involved plots for each of the 24 human

chromosomes as well as the Whole Genome. All plots were made for fixed

32

substitutions derived on the human line but many were repeated for human SNPs and

for Chimp fixed substitutions.

1. Plots with error bars for empirical P(bias) measured for clusters of 2 through 10

substitutions within 20 through 600 bases were made. The set of nine plots, taken

together were dubbed “zipper plots” for reasons which will become obvious when

the results are examined. While these nine plots showed the effect of substitution

count on bias, an additional 15 plots for the entire genome showed the effects of

cluster span in base pairs. Heat Maps were made of the same information in order

to condense the three dimensional information of 9 plots into one graphic. It was

also useful to generate “normalized” heat maps, which centered the coloring on

the average chromosome bias. These 855 plots served to fully characterize the

dimensions of bias associated with clustering of substitutions and SNPs in

humans (hg17-pt1) and for substitutions alone in chimps (hg18-pt2).

2. Once a dataset of biased clustered substitutions was generated based upon the

definition of 5 differences within 300bp with at least 80% weak to strong changes,

mapping of locations was done by histogram for each of the 24 chromosomes.

Maps of Substitutions, Weak to Strong Substitutions, Clustered Substitutions,

Biased Clustered Substitutions and Unexpected Biased Clustered Substitutions

were generated using 1 million base pair (mbp) bins. In order to simplify the

relatively noisy histogram maps of unexpected bias, a smoothing function was

applied. The loess function of R was used to smooth by “least squares” across a

span of 25 bins, or 25mbp. The smoothed curve was used to generate a 95%

33

confidence interval using +/- 1.96 standard deviations. While this method

assumes a normal distribution of data, unexpected biased clustered substitutions

are not normally distributed. However, the null model would predict a normal

distribution of actual minus expected bias. Unexpected Biased Clustered

Substitutions were mapped together with G+C Content, male and female

recombination rates, and transcription density. Recombination rates and

unexpected bias were additionally plotted using the smoothing function for both

(this time with a span of 15mbp). Maps of individual chromosomes were joined

sequentially into genome wide maps of the various relationships. In all 500

chromosome maps were generated for human and chimp substitutions, and human

SNPs.

3. Correlations were generated between BGC, UBCS and Smoothed UBCS; and Hot

Spots; G+C Content; Male, Female and Sex-Averaged Recombination Rates;

conservation score and transcription density. Pearson’s Correlation Coefficient

was generated for bins of size 10,000 through 1 million base pairs, with the many

results plotted in a single graph, resulting in 75 correlation plots for human fixed

substitutions.

4. Finally, an effort was made to determine whether a second factor might explain

the UBCS signal in combination with a first. After a linear relationship was

established between points on a scatter plot of UBCS and male recombination

rate, the differences between the actual points and the linear approximation can be

understood as the residual signal left unexplained by male recombination rate’s

34

relationship with UBCS. Those “residuals” can then be plotted with a second

factor in order to determine if an additional relationship is involved in producing

the full UBCS signal. The residual signal of UBCS, unexplained by male

recombination rate, was plotted with female recombination rate, G+C content,

conservation score and transcription density.

2.5 Dating the Fusion of Chromosome 2

Clearly the UBCS signal near telomeres dominates the rest of the chromosome, as

seen in Figure 15. A reasonable assumption is that the internal peak of chromosome

2 built up while the region was sub-telomeric in the unfused chromosomes, and has

stopped accumulating soon after fusion. The chimp and human maps of cousin

chromosomes proved remarkably similar in the shape and relative amplitude of

telomere peaks (Figure 19). This allows using the ratio of the height of the UBCS

signal on the telomeres of a chimp chromosome, and the height of UBCS on one of

the telomeres of a corresponding human chromosome, to predict the height UBCS at

the other human telomere. Thus, the expected height of the missing telomeres if there

had been no fusion in the human line could be predicted. Using the set of

chromosomes with substitution data for their entire length for both humans and

chimps (3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 16, 17, 20), a standard deviation could be

calculated. Thus, the calculated fusion date is the ratio of the actual UBCS height of

the current human chromosome 2 fusion peak over the predicted UBCS heights of the

hypothetical human telomeres 2a arm q and 2b arm p, times the estimated date of

35

6MYA that the human and chimp lines diverged (Eq. 1). A 95% confidence interval

is achieved by applying +/- 1.96 times the standard deviation of the predictive ability

of the relative telomere heights, to the predictions of the hypothetical human

telomeres. One important part of this calculation involves the size of a region used to

measure the UBCS signal at a telomere. A size of 17mbp was used as the smallest

window where the next successive window’s UBCS signal falls below average for the

genome. Thus, 17mbp at the telomeres of each autosome represents the average

region of heightened UBCS at telomeres in humans. This is roughly 25% of the

human genome.

hg17.chr2.p = UBCS signal of hg17.chr2:1-17000000 hg17.chr2.fus = UBCS signal of hg17.chr2:97000001-131000000 hg17.chr2a.q.hyp = hg17.chr2.p * pt2.chr2a.q/pt2.chr2a.p hg17.chr2b.p.hyp = hg17.chr2.q * pt2.chr2b.p/pt2.chr2b.q FusionRatio = hg17.chr2.fus / (hg17.chr2a.q.hyp + hg17.chr2b.p.hyp) FusionDate = 6MYA + (FusionRatio * 6MY) Eq. 1

36

3.0 Results

From the examination of almost 11 million substitutions in the human genome, it can

be seen from Table 1 that nearly as many weak bases underwent substitution as strong

bases. Further, it is clear that weak to strong substitutions merely balance out strong

to weak ones (43.1% vs. 42.78%). However, in examination of substitutions as a

function of other factors, evidence of bias emerges.

Substitution Totals for Human Genome (hg17)

Point Differences between Humans and Chimps 28,896,677 Substitutions with Rhesus macaque outlier 24,817,827 ( 85.88% )

Substitutions with unambiguous outlier 21,405,843 ( 86.25% )

Substitutions found in Human Line 10,871,681 ( 50.79% )

Ancestral Weak Bases Substituted 5,351,332 ( 49.22% )

Ancestral Strong Bases Substituted 5,520,349 ( 50.78% )

Weak to Strong Substitutions in Human Line 4,685,494 ( 43.10% )

Strong to Weak Substitutions in Human Line 4,650,554 ( 42.78% )

Weak to Weak Substitutions in Human Line 665,838 ( 6.12% )

Strong to Strong Substitutions in Human Line 869,795 ( 8.00% )

Table 1. While about 11 million substitutions derived in humans were identifiable overall, there is no evidence of bias in this aggregate view.

3.1 Bias as a Social Disease

In a quest to find hidden relationships between weak to strong substitutions and other

factors, the entire human genome was analyzed by windows. Initial analysis used

windows of 300bp stepping 150, based upon the approximate mean size of a gap

subject to BGC due to a recombination event[28]; while more recent analysis tightened

the windows to 100bp stepping 50, based upon results to be described later. The most

obvious evidence of bias appears as fixed substitutions are clustered together. In

37

Figure 2 it can be seen that, the empirical probability of weak to strong substitution

rises once a certain substitution density is reached.

Figure 2. The Empirical Probability of Bias Due to Substitution Count. In windows of 300bp, little bias is seen for the first 5 substitutions, but pronounced bias is found between 7 and 16 localized substitutions. In windows of 100bp, the relationship is even more pronounced and is obvious in clusters of from 5 to 11 substitutions. In windows of 300bp, the strongest bias was at 13 substitutions per window with the proportion of weak to strong being about 46.4% while strong to weak was 37.2%. However, using the 8 substitutions as a comparable data point for windows of 100bp, weak to strong substitutions were 49.2% while strong to weak were reduced to 35.9%. Here bias is measured as a simple proportion of weak to strong (red) relative to the three other possibilities (error bars +/- 1 SE). [Available at http://www.cse.ucsc.edu/research/compbio/ubcs/sub17_w100.html]

3.1.1 Documenting Gang Behavior

In order to illuminate the full dimensions of clustering associated with biased

substitutions, “nearest neighbor” methods were used. This allows examining the

proportion of weak to strong substitutions in clusters of 2 through 10 substitutions. In

Figure 3, the relationship between weak to strong substitutions and clustering can be

38

clearly seen when clusters of 5 substitutions are within 100 base pairs. These plots

taken together as a series were dubbed “zipper plots” for obvious reasons.

Figure 3. “ Zipper Plots” : Bias for Clusters of N Substitutions. No discernible bias is revealed when 2 substitutions are as close as 20 base pairs. The first hint of a relationship between weak to strong bias and cluster ing isn’ t seen until 4 substitutions are tightly clustered. When a cluster of 5 substitutions falls within 100 base pairs, bias is clear ly seen. By clusters of 10, the propor tion of weak to strong substitutions exceeds strong to weak ones for most of the range examined. [Available: http://www.cse.ucsc.edu/research/compbio/ubcs/sub17_MapZip_chr0.html#Zip]

39

The same method was turned on its head to reveal, not the number in a cluster that

results in bias, but the width of a cluster required to see bias. In Figure 4, bias is

barely visible when clusters of 9 substitutions are spread across 300 base pairs, but is

striking when clusters of 5 or more are within 100 base pairs. Figure 5 condenses the

genome wide analysis of the dimensions of clustering which shows evidence of bias.

It is clear from this image that the space in which the majority of substitutions occur,

shows no special bias for weak or strong substitutions. But as this sea of substitutions

reaches the rocky edges where high density substitutions occur, bias is noticeably

tipped in favor of weak to strong substitutions. Curiously at some of the most

extreme densities of substitutions, bias is actually strong to weak. However, this

work does not attempt to characterize that phenomenon.

40

Figure 4. Bias for Substitutions within N bases. When clusters of substitutions are within 300 bases, bias is not seen until 9 substitutions are clustered together. When cluster spread is restricted to within 200 bases, bias occurs in clusters of 7, and when within 100 bases, clusters of 5 or more are clearly biased. The strongest bias observed in the human line, genome wide, is for clusters of 10 substitutions within 80 base pairs. [Available at: http://www.cse.ucsc.edu/research/compbio/ubcs/sub17_MapZip_chr0.html#WinZip]

41

Figure 5. Cluster Bias Heat Map. The results of the three dimensional analysis of bias as a function of length and number of substitutions in a cluster can be summarized by this heat map. Bias is characterized in the range of 2-10 substitutions (X axis) by 20-600bp spans (Y axis). While the bulk of the range in which substitutions fall shows no tendency for weak to strong substitutions to dominate, the rocky shore is strongly biased to weak to strong substitutions. This heat map has been normalized to put yellow at the genome mean and red and blue at the extremes. [http://www.cse.ucsc.edu/research/compbio/ubcs/sub17_MapZip_chr0.html#WinZip] 3.1.2 Biased Groups are Recruited from Unbiased Individuals

While the clustering bias seen here is consistent with the Biased Gene Conversion

model, it is conceivable that it is due to other factors. Certainly mutations might

42

occur in clusters and if the process that gives rise to clustered mutations also biases

them, then this pattern might be expected. Such a process should reveal itself in

biased clustering of Single Nucleotide Polymorphisms. The results, shown in Figure

6, reveal little evidence of bias in SNPs. If biased clusters were due to an underlying

mutation bias, one would expect the levels of bias to be stronger in SNPs than

substitutions, where purifying selection would reduce their numbers. The evidence

instead points to selection of biased clusters of SNPs. Assuming that the forces

resulting in mutations in humans have not altered significantly over the past 6 million

years, the process that results in biased clusters of substitutions is not due to an

underlying mutation bias.

The selection of biased SNPs can even be seen in the coarsest view of the data.

While weak to strong SNPs make up 39.97% of total SNPs, weak to strong

substitutions are 43.1% of all substitutions. There are fully 10% more strong to weak

SNPs than there are weak to strong. But by the time SNPs have been fixed in the

genome as substitutions, the two totals roughly balance. Whatever is selecting weak

to strong SNPs is having the satisfying effect of counterbalancing the underlying

strong to weak mutation bias, at least in recent human evolution. While it is hard to

imagine that the motive force acting upon individual SNPs is selection to maintain the

nucleotide balance genome wide, the result is symmetry none-the-less.

43

Figure 6. Weak to Strong Bias in Single Nucleotide Polymorphisms. Unlike the bias seen in clusters of fixed substitutions, clusters of SNPs (which are not yet fixed) show little tendency to be biased due to clustering in windows of 300bp or 100bp. Additionally the genome wide zipper plot shows little evidence. However, examination of the heat map reveals small pockets of clustered bias. This could be expected in a BGC model, as clusters move towards fixation at recombination hotspots. [http://www.cse.ucsc.edu/research/compbio/ubcs/snp17_w100.html] 3.2 Focusing on Bias through the Window Lens

While the association of weak to strong biased substitutions is clear, other

relationships need investigation. It is entirely conceivable that the same process

which selects biased clusters also selects individual weak to strong SNPs; and these

44

individual biased substitutions have so far gone undetected as having a unique origin,

in the background of unbiased substitutions. For this reason, additional analysis using

the window method may be revealing.

3.2.1 Conservative Bias?

Natural selection might select a biased mutation, or a cluster of biased mutations,

especially if they represent significant changes in a gene. This might be revealed if

bias is associated with conservation. However, no such evidence is found in Figure 7,

which shows weak to strong bias mildly retreats in the face of rising conservation

score. While it might be expected that individual clusters of substitutions might be

strongly selected (for or against), this analysis reveals little tendency for clusters of

biased substitutions to be more favored in conserved regions as compared to

unconserved regions of the genome. If anything, clusters and biased clusters are less

often found in conserved regions.

45

Figure 7. Bias as a Function of Conservation Score. While most windows had conservation scores of less than 0.2 (little conservation), there were still 173,677 windows of 100 bases with at least one substitution which had a conservation score of 0.8 or greater. However the most weak to strong bias is seen in windows with a conservation score of 0.2 to 0.4. The average conservation score falls (slightly) as the substitution count rises, and the P Score (binomial probability which reflects greater bias at lower numbers) shows a very weak correlation with conservation (R 0.016), which would translate to a negative correlation between bias and conservation. [Available at: http://www.cse.ucsc.edu/research/compbio/ubcs/sub17_w100_hist.html]

46

3.2.2 Do the Strong Convert the Weak?

Since this genome wide effort finds little evidence for a mutation bias or for

conservation leading the selection of biased clusters, the question of isochores again

bubbles to the surface. Is it possible that areas of greater frequency of strong base

pairs propagate or at least maintain themselves by favoring the conversion of the

weak to the strong? This should be seen by an analysis of bias relative to the G+C

content of surrounding region. If such a relationship existed, it might provide insight

into the forces shaping the much larger isochores.

Figure 8. Empirical Bias as a function of G+C Content. Windows of 100bp are analyzed, but the G+C content is taken from a window of 1000bp with the window of 100 at its center. At left, it is not surprising to find that as G+C rises, the proportion of weak to strong substitutions fall. This can be understood when it is recognized that there are fewer and fewer weak bases available to be substituted. The graph on the right compensates for this trend by using a conditional probability. It shows the empirical probability of bias, given the ancestral G+C content of the region. In this view, it is clear that the G+C content of a region does affect the proportion of substitutions which are biased. However, the effect tends to weaken bias, rather than strengthen it. [Available at: http://www.cse.ucsc.edu/research/compbio/ubcs/sub17_w100.html]

47

But no clear relationship is seen in Figure 8, where bias as a function of surrounding

G+C content in windows of 100bp is examined. Since any affect of nearby G+C

might extend well beyond the windows of 100bp examined, the G+C content reflects

a window of 1000bp with the target substitution window at its center. If G+C content

is large, it is not surprising that weak to strong bias diminishes, since fewer weak

bases are available to be converted to strong. That is why the second plot of Figure 8

is included. Here the empirical probability of bias given the ancestral G+C content is

measured. The conditional probability would result in a horizontal line for the null

model of random substitutions. The proportions of weak to strong and strong to weak

are clearly affected by the G+C content of the surrounding region, but for the most

part the decay of high (or low) G+C areas would be expected. In fact, this graph

suggests that G+C extremes are not only moving back to the middle ground, but are

doing so faster than expected by the null model. Clearly there is no evidence for

isochores growing or even maintaining their current strength here.

But the picture painted by G+C content is not so simple. In Figure 9, it is seen that

windows with more substitutions are found in slightly higher G+C regions. Still,

though clusters may be more likely, biased clusters of substitutions (as measured by P

score) are not more likely to be found in G+C rich areas. The negative correlation of

weak to strong substitutions to G+C (R -0.209) is comparable to the positive

correlation between G+C and the P score of a window (R 0.225). Both are dominated

48

by the inherent bias in substitutions which arises from the percentage of A+T

available to be substituted.

Figure 9. G+C Content Affects Clusters of Substitutions. In Figure 8 there was no evidence that G+C content alone biases substitutions. However, clusters of substitutions are clearly but not dramatically more prevalent in regions of higher G+C content, as seen in the left. There is a positive correlation (R 0.225) between bias as measured by P score (lower means more biased) and G+C content, which means that lower G+C is associated with lower P scores. Again, any relationship is obscured by the inherent substitution bias due to G+C content. [Available at: http://www.cse.ucsc.edu/research/compbio/ubcs/sub17_w100_hist.html]

49

Having established that higher G+C content reduces the bias of substitutions and is

slightly more favorable to clusters of substitutions, it seems appropriate to revisit the

plot found in Figure 2, which shows the empirical probability of bias due to

substitution count. Using the conditional probability that eliminates the inherent

substitution bias due to G+C, it can be seen in Figure 10 that bias still occurs in

clusters, and not simply due to low (or high) G+C. However, the two additional plots

covering low and high G+C make it clear that high G+C areas are more dramatically

biasing of clusters than are low G+C regions. More specifically, regions of higher

G+C are not more likely to result in bias and are only slightly more likely to result in

clusters, but are clearly more likely to result in clusters that are biased. Thus G+C

content is related to biasing of clusters, but cannot be the only force involved, since

biased clusters occur in both high and low G+C regions. Again, the message is that

the forces that give rise to clusters of substitutions are acting in a different

environment than those that result in sparse substitutions.

50

Figure 10. The Conditional Empirical Probability of Bias by Substitution Count. Re-plotting Figure 2 using the conditional probability eliminates the bias that is due purely to the G+C content of a region (upper left). The result is that clusters are still more biased than sparse substitutions. The fact that the blue “strong to weak” curve dominates most of the graph is not unexpected, since it is well established that strong to weak mutation rates dominate weak to strong. However, it can be seen that the majority of windows have less than 50% G+C content (mean 0.410), which results in weak to strong and strong to weak substitutions evening out genome wide (43.10% vs. 42.78%). Further evidence of the effects of G+C on clusters found in Figure 9 can be seen in the two lower plots. At low G+C (G+C 20-30%; N=884,244 windows) clusters are only marginally more biased than individual substitutions. However, in regions enriched for G+C (G+C 50-60%; N=1,482,460 windows), clusters are clearly more biased than lone substitutions. [Available at: http://www.cse.ucsc.edu/research/compbio/ubcs/sub17_w100.html]

51

3.2.3 Bias at the Hot Spots and on the Edge of Town

There is no underlying mutation bias that explains the patterns of biased clusters of

substitutions seen. Also there is no evidence that biased clusters are the result of the

selective pressures that have shaped highly conserved regions. Finally, the G+C

content of regions may be associated with biasing of clustered substitutions, but it

clearly does not tell the whole story. The most likely explanation for biased clusters

remains Biased Gene Conversion. If this were the cause, then the patterns of bias

should be the effect of recombination events, and therefore ought to be associated

with recombination hot spots. While recombination hot spots have been mapped in

humans, it should be remembered that hot spots are known to move.[23, 24] The

recombination hot spots that might have given rise to biased clusters of substitutions

may be long gone by now.

As can be seen in Figure 11, clusters are more distant from current hot spots than the

average for all substitutions (substitutions: 71131.6 bp; clusters: 72399.9 bp).

However, though biased clusters tend to be closer than average (61391.8 bp), there is

no correlation between windowed P score and distance to the nearest hot spot.

Looking at SNPs, it is clear that they are closer, on average, to a hot spot than are

substitutions (60203.0 bp). But clusters of SNPs and biased clusters of SNPs are

closer still (clusters of SNPs: 47806.9 bp; biased clusters: 49394.3 bp). Either

recombination hot spots are triggered by clusters of mutations or recombination can

create mutations[45, 46] and hotspots, clusters of mutations. As recombination events

52

occur, clusters become biased, and hot spots move on or disappear. While this might

explain the hot spot evidence seen here, there is not enough evidence to paint a clear

picture.

Figure 11. Hot Spots are Slightly More Biased. Using the conditional probability, it is clear substitutions found in cur rent recombination hot spots are very slightly more biased than substitutions in colder regions. This result is significant, with more than 1.5 million windows falling in hot spots (included er ror bars are too small to see). The plot at r ight shows that clusters of substitutions are actually more distant from current hot spots than the average substitution. But that doesn’ t mean that biased clusters are more distant. In fact, there is no cor relation between the P score of windows and the distance to the nearest hot spot. [Available at: http://www.cse.ucsc.edu/research/compbio/ubcs/sub17_w100_hist.html]

In contrast to locations near a hot spot, there is clear evidence that location near a

telomere results both in more biased substitutions and in more biased clusters of

substitutions, at least for the less extreme P scores. Figure 12, shows the evidence

that led to the next round of investigation. Where are biased clusters of substitutions

located on chromosomes?

53

Figure 12. Bias at Sub-telomeric Regions. Using the conditional probability at left, it is clear that sub-telomeric regions are more biased, and this affect is not due to G+C content. Using P scores, it is clear that biased clusters are closer, on average, to telomeres than are sparse or unbiased windows. In this plot, the proximity to a telomere is represented as a scale from 0 to 1, with 1 being at the very tip of a chromosome. [Available at: http://www.cse.ucsc.edu/research/compbio/ubcs/sub17_w100_hist.html] 3.3 Geographic Distribution of Biased Groups

The window method allows for detecting clusters of substitutions, and therefore

proved useful in illuminating the relationship of bias to clusters. However, the

method inherently under-counts clusters. Therefore another method is used here to

try to quantify biased clusters and map them. As explained in methods, by defining a

minimum requirement for a biased cluster, it is possible to determine, for every

substitution, whether it exists in a biased cluster. For this work, a biased cluster is

defined as at least 5 substitutions within 300bp, and at least 80% weak to strong. By

examining the four nearest neighbors of each substitution, a set of substitutions that

falls within this definition was developed. In Figure 13, this set of “biased clustered

substitutions” was mapped across one chromosome. Chromosome 18 is displayed

54

here because, while not atypical of most of the chromosomes, its coverage is good

and it is small enough that the details of the plots should be visible. Additionally, in

the zipper plots, chromosome 18 revealed some of the strongest cluster bias seen in

this research. In the figure, substitutions, weak to strong substitutions and biased

clustered substitutions are fairly uniformly distributed across the chromosome, and

appear to vary proportionally. However, clustering is both higher than expected by

chance and stronger near telomeres. Indeed, closer examination of biased clustered

substitutions reveals that they, too are more frequent at the telomeres. But it is when

the frequency of “unexpected biased clustered substitutions” (UBCS) is examined,

that two things emerge. First, the strength of this bias is mild through most of the

chromosome. But the accumulation of biased clustered substitutions sharply

increases near telomeres.

55

Figure 13. Mapping Chromosome 18. The histogram at the upper left shows that substitutions (green) have been mapped throughout all of chromosome 18 (except in the highly repetitive region around the centromere). While different regions may vary greatly (each bar is 1mbp), most regions contain between 3500 and 5000 substitutions. Likewise, weak to strong substitutions (gold) and “biased clustered substitutions” (red) are fairly uniformly distributed and follow in rough proportion with the overall substitution frequency. The distribution of clusters, on the other hand (top right, gray), begin to diverge from this pattern; and are more pronounced near the telomeres. The black line in this graph represents the expected frequency of clusters (Poisson, given the frequency of substitutions in a bin). Clusters are much more frequent than chance would predict. In the lower left, biased clusters are also more telomeric (at least for one of the telomeres). The black line in this plot is the expected number of biased clustered substitutions (binomial, based upon both the number of clustered substitutions and the frequency of weak to strong substitutions in each bin). Finally, in the lower right plot, we see unexpected (actual minus expected) biased clustered substitutions (UBCS). The null model would predict a line at zero, yet the biased clustering is occurring beyond what can be expected by chance, across most of the chromosome, and is especially heightened at the telomeres. [Available at: http://www.cse.ucsc.edu/research/compbio/ubcs/sub17_MapZip_chr18.html]

56

Figure 14. “Winged Maps”: Unexpected Bias is Predictable. In chromosome after chromosome, the mapping of UBCS reveals the same pattern, mild through the length of the chromosome and elevated at the telomeres. Here, chromosomes 1, 10 and 20 are shown as examples. While the chromosome sizes vary, the pattern does not significantly. The top three images show the raw unexpected biased clustered substitutions (all bars represent 1mbp). The lower graphs apply a smoothing function to the raw signal (least squares using a local window of 25mbp). The smoothed maps use the same coordinate dimensions which helps illustrate that the unexpected clustered bias rises at the telomeres, not at distance from the centromere. The yellow region about the smoothed curve is the 95% confidence interval based upon the 25mbp window. When it is above the zero line, the null hypothesis of random variation is rejected. For these three chromosomes, the null hypothesis is rejected through 61.8%, 79.4% and 68.3% of their lengths respectively, while 50.9% of the whole genome shows elevated bias with 95% confidence. The box around each smoothed curve is the confidence interval for the whole chromosome. [Available at: http://www.cse.ucsc.edu/research/compbio/ubcs/sub17_MapZip_all.html] 3.3.1 Near Universal Pattern of Bias Leaves Evidence of a Fusion

In fact, the maps of unexpected bias for almost all chromosomes follow this pattern,

and create a visual of “winged” maps. In Figure 14, three chromosomes, side by side,

show the same pattern, even though the lengths of the chromosomes vary. The

granularity of the maps (each bar represents 1 million bases) is kept the same in order

57

to compare the magnitudes. Figure 14 also includes the smoothed curves for the

same three chromosomes, this time plotted on identical coordinate dimensions.

Clearly the elevated clustered bias is not a function of distance from the centromere,

but a property of the telomeres. Also, clearly, the null hypothesis of random

fluctuation can not be accepted for half of the genome.

That the pattern of elevated bias at the telomeres is so regular, leads to a closer

examination of the exceptions which are seen in Figure 15. Chromosome 2 has an

internal peak, which is easily understood as the remnants of the fusion of two

ancestral chromosomes in the human line.[47] It is certainly possible to imagine that

there is some force associated with telomeres and causing biased clusters of

substitutions, and that that force is still acting in the middle of chromosome 2 even

though it is no longer telomeric. However, a simpler explanation would be that the

peak we see in the middle of chromosome 2 represents substitutions that occurred

before the fusion of the two ancestral chromosomes. If this were so, then the pattern

here would be compatible with the fusion occurring relatively recently in human

evolution. While some have suggested that this fusion might have been the speciation

event that separated the human and chimp line[48], one interpretation of the UBCS

results suggests that this can’t be so.

58

Figure 15. Exceptions to the Pattern of Unexpected Biased Substitutions. The predictability of the pattern of where UBCS is located on chromosomes suggests that the exceptions reveal part of the story. In the upper left, all chromosomes are mapped end to end, and the smoothed signal peaks almost always fall upon the borders of chromosomes. However, chr2 (upper right) has a peak in the middle which clearly violates this pattern. It reveals the merger of two ancestral chromosomes which happened on the human line. Most positions on human chromosome 2 to the left of the peak, align with chimpanzee chromosome 2a, while positions to the right align with 2b. The human sex chromosomes also stand out as exceptions. Chromosome Y (lower right), which suffers from a lack of data through much of its length, shows no signal beyond the null model expectation. Chromosome X (lower left) also shows a greatly reduced signal, with only one telomere appearing to deviate from random. These plots on uniform dimensions show the smoothed curve of the raw unexpected biased clustered substitutions (red: least-squares over 25mbp windows), the 95% confidence interval for a sliding window of 25mbp (yellow) and the 95% confidence interval for the entire region (rectangle). The Pseudo-Autosomal Regions (PAR: blue ) have been marked for chromosome X. [http://www.cse.ucsc.edu/research/compbio/ubcs/sub17_MapZip_all.html]

59

3.3.2 Predictable Males and Enigmatic Females

The human sex chromosomes also break the mold. In Figure 15, the patterns of

unexpected biased clustered substitutions for chromosome Y show little evidence

beyond random. So far, all the evidence uncovered has been consistent with Biased

Gene Conversion as the cause of the clustered bias documented here. This model

involves recombination events which create biased clusters out of unbiased

polymorphisms. But recombination is not expected to happen on the Y chromosome,

since it is at all times haploid. Therefore, the lack of a signal on chromosome Y is

consistent with the BGC model. However, the X chromosome also shows little

UBCS signal, which is not predicted by the BGC model. One would expect half the

signal on X, given that it is only diploid in females, and would therefore have the

opportunity to recombine only half as often as do autosomes. But the actual UBCS

signal on X is much less than half.

There is a loophole in the haploid-diploid contract of the sex chromosomes: the

pseudo-autosomal regions (PAR) of X and Y. These two short regions (2.6 and

0.9mbp), at the tips of both chromosomes align between X and Y and are known to

recombine. It is only in this region of Y that the UBCS signal is above zero.

Similarly the signal on X is strongest near the PAR regions of the X chromosome, as

seen in Figure 15. However, the strength of biased clustering is so diminished on X

that the question must remain open: is there an “X exception” that is not explained by

the BGC model?

60

Figure 16. Zipper Plots of Four Chromosomes. Recalling the zipper plots for the whole genome seen in Figure 3, the examination of 4 individual chromosomes is revealing. While most chromosomes show a degree of bias for clusters closer to the genome wide bias, some chromosomes, such as 18 show dramatic bias (upper left). Chromosome 2, not shown here, also shows dramatic bias. Early analysis of chromosome 19 showed little bias, which proved to be due to high G+C content with its inherent trend of strong to weak substitutions. However, this zipper plot (upper right) clearly shows evidence of the same weak to strong cluster bias seen in all other autosomes. The lack of a signal in the zipper plots for chromosome Y (lower right) is not surprising and is consistent with the BGC model. However, the “X exception” of little or no signal seen in the maps of UBCS is confirmed here. Clearly chromosome X has little or no detectible signal (lower left), and is an exception to the consistent picture of bias seen throughout the rest of the human genome. [All plots available at: http://www.cse.ucsc.edu/research/compbio/ubcs/sub17_MapZip_all.html]

61

By returning to the so called zipper plots, the picture of chromosome X is brought

further into focus. In Figure 16, the tendency for clusters of ten substitutions to be

biased is examined for four individual chromosomes. The plots of all autosomes

show evidence of biased clusters, to varying degrees. Both chromosome 2 (not

shown) and 18 show dramatic bias, whereas the amount of bias detected on

chromosome 19 is smaller than the genome wide average. However, chromosome 19

is noted for having the highest G+C content of any chromosome (47.3% compared to

40.1% genome wide) which made early attempts at finding evidence of bias

unsuccessful. The zipper plots, on the other hand show that chromosome 19 has been

under the same biasing pressures as the other autosomes. No evidence of biased

clusters of substitutions is detectible on chromosome Y in either Figure 16 or the

corresponding heat maps of Figure 17. The heat map for chromosome Y actually

appears to show a reverse bias, compared to the autosomes. This trend is provocative

in its own right but is not addressed in this work. The picture of Y revealed from

these two graphics fits the predictions of the BGC model. However, once again,

chromosome X is an enigma. The zipper plots (for clusters of 10 shown in Figure

16), which are sensitive enough to see bias on chr19, show no such bias on X.

Examination of the heat map for X in Figure 17, confirms the lack of a biased cluster

signal and similarly seems to show signs of reverse bias. Possibly chromosome X is

being molded by competing pressures. It is clear from the lack of wings in Figure 15,

the tangled zipper in Figure 16 and the reverse bias seen in the heat map of Figure 17,

that chromosome X is an enigma. While BGC may yet prove to be the cause of the

62

biased clusters seen in this research, any thorough explanation will have to explain

the X exception.

Figure 17. Heat Maps of Bias for Four Chromosomes. As seen in Figure 16, both autosomes 18 and 19 show evidence of biased clusters. For chromosome 18 (upper left), where other evidence is dramatic, the bright red in the lower corner of the graph indicates the most biased substitutions are found when six or more substitutions fall within 100bp. The same pattern is found in chromosome 19 (upper right), where the strength of bias is reduced, but the association with clustering is clear. The pattern is reversed for chromosome Y (lower right), a phenomenon not addressed by this paper. Once again, chromosome X (lower left) shows little evidence for the clustered bias seen on autosomes, and is more in line with Y. These heat maps have been “normalized” to place yellow at the average bias for the chromosome, while red and blue are maximized for the chromosome extremes. [All plots available at: http://www.cse.ucsc.edu/research/compbio/ubcs/sub17_MapZip_all.html]

63

3.3.3 Are Humans More Biased than Chimpanzees?

Having used the alignment of the human, chimp and macaque genomes as a tool to

examine substitutions in humans, it is only natural to ask if the same analysis can be

done for the other two species. However, the quality of the human genome assembly

(on its eighteenth release) far exceeds the assemblies of chimpanzee (second release)

and Rhesus macaque (first release). While there can be some confidence that a single

nucleotide difference found in humans but not chimps or macaques, is indeed a

substitution in humans, the reverse case may actually be a sequencing error in either

chimp or monkey. Also, while there is no reason to expect that the false positives of

sequencing errors will be biased towards weak or strong differences, it is believable

that false positives will show up more frequently in clusters. In addition, the known

SNPs in the human genome could be subtracted out of the human substitution

database, purifying the data further. However, the set of possible substitutions for

chimps or macaques cannot currently be cleaned up in the same way. Nevertheless,

the ratio of substitutions to false positives in a full genome analysis may allow

detection of biased clusters. Using the most recent release of the three genome

assemblies, substitutions were examined across the chimpanzee genome. As

expected, the substitution count (including false positives) was higher (115% of the

human substitution count) and the clustered substitution count, higher still (138% of

human). While biased clustered substitutions exceeded those found in humans

(123%), unexpected biased clustered substitutions, which should represent those not

produced by chance were only 59% of what was found in humans. This doesn’t mean

64

that humans are more biased than chimps, of course, but clearly illustrates the

noisiness of the chimpanzee’s data.

Figure 18. Biased Clustered Substitutions in the Chimpanzee Genome. Unlike the signal seen for the human genome in the zipper plots (upper left vs. Figure 3) or heat map (upper right vs. Figure 5), there is much less detected bias for clusters of substitutions in the chimpanzee genome. However, there is still a striking telomeric rise in unexpected bias on all autosomes for which data was available. This pattern is entirely in parallel to the UBCS signal seen in humans (lower left and right vs. Figure 14 and Figure 15). While the human plots were generated using May 2004 human (hg17), Nov. 2003 chimp (pt1) and Jan. 2005 macaque (rh0 prerelease) assemblies, the chimpanzee plots were generated using Mar. 2006 human (hg18), Mar. 2006 chimp (pt2) and Jan. 2006 macaque (rh1) assemblies. Sufficient high quality scores for chimp substitutions were unavailable for chr21 and chrY. [http://www.cse.ucsc.edu/research/compbio/ubcs/pt2_MapZip_0.html]

65

In Figure 18, we can see the evidence of the reduced signal of clustered bias. While

weak to strong bias only shows up in the extreme clusters, it is nevertheless evident

when unexpected bias is mapped across the genome. The overall signal is reduced

for unexpected bias (mean chimp: 11.8; human: 19.9), but what is striking is the

degree to which the shape and amplitude of the curves agree between chimp and

human chromosomes. The phenomenon can be seen in Figure 19, which compares

chimp chromosomes 2a and 2b (artificially fused), to human chromosome 2. The

Pearson’s correlation coefficient for the two sets of smoothed curves that cover most

of the two genomes is R=0.877! Remember that the signals being examined are made

up entirely of substitutions that have occurred since the human and chimpanzee lines

have split, and the datasets do not share a single aligned location between the two

species. While it is easy to suggest from this comparison that the internal peak of

human chr2 is due to the region having been telomeric for much of the last six million

years; it is also clear from the similarity of minor peaks, that proximity to a telomere

alone cannot explain the accumulation of biased clusters. Another implication that is

hard to ignore is that the force that has given rise to the bias seen here must have run

a parallel course in our two species, despite the unique population histories we have

had. Finally, using the chimpanzee genome, we can also examine the “X exception”.

In Figure 20, we can see that there is virtually no detectible unexpected cluster bias in

the chimpanzee X chromosome. Thus, in two species, the X chromosome is clearly

much less biased that autosomes.

66

Figure 19. UBCS Profile is Similar between Humans and Chimps. While the fact that the chimpanzee chromosomes 2a and 2b both show elevated bias at their telomeres is not surprising, the overall profile of unexpected bias on these two chromosomes is remarkably similar to the fused human chromosome 2. The artificially fused maps of chimp 2a and 2b at left, and the human chr2 at right have a similar set of local minima and maxima, as well as a remarkably similar height to their peaks. There is surprising agreement among many of the autosomes across the two species. [http://www.cse.ucsc.edu/research/compbio/ubcs/cmp_MapUS_All.html]

Figure 20. The “X Exception” in Chimpanzees. While unexpected biased clustered substitutions were detectible in all of the chimpanzee autosomes for which data was available (Figure 18), the X chromosome stands out by showing virtually no unexpected bias (unsmoothed mean: -1.87, genome wide: 11.2; null hypothesis: 0). This pattern is even less ambiguous than what is seen in the human X chromosome (Figure 15), and should stand as confirmation that the force creating biased clustered substitutions is not happening in X. Smoothing was by least squares for 25mbp with 95% confidence in yellow and chimp pseudo-autosomal region in blue. [Available at: http://www.cse.ucsc.edu/research/compbio/ubcs/pt2_MapZip_chrX.html]

67

3.3.4 Biased Without a Cause or Are Boys Troublemakers?

At this point the evidence shows a pretty clear picture of where bias is found and how

unexpected that bias is. But we are left with a clear pattern of bias in two species, a

possible cause in BGC, and a single complication to that picture: the X exception. In

order to elucidate any hidden relationships with recombination, selection or G+C rich

isochores, several factors were mapped across each chromosome and the correlations

between the changes in signal strengths and unexpected clustered bias were

measured. In Figure 21, the relationships are examined for chromosome 18, which

has been shown to have a particularly strong signal. While unexpected biased

clustered substitutions do seem to magnify the highs and lows of G+C content, as

might be expected if it were the result of an isochore selection pressure, this

relationship does not hold for the telomere of the short arm at all. The relationship

between biased clusters and recombination hot spots, does follow through the

telomeric rise, as the BGC model predicts. However, as can be expected, given the

historical nature of the substitution signal vs. the current events that the

recombination hot spots represent, the relationship is far from lock step. This makes

the result of the sex differentiated recombination rates more startling. For

chromosome 18, the current male recombination rate correlates to the unexpected

biased clustered substitutions signal at R=0.694, while the female recombination rate

only correlates at R=0.160. Of all the correlations measured in this research, the

current male recombination rate shows the clearest relationship with clustering of

biased substitutions!

68

As can be seen in Figure 22, this relationship holds genome wide. Biased clusters of

substitutions do not correlate with conservation scores at all, and are only mildly

correlated with G+C content. The strongest relationship is with recombination rates

and most especially with male recombination rates. While the recombination rates

were measured as current phenomena[37], an additional word of caution is needed.

The rates were averaged across one million base pair sections of each chromosome.

This granularity agrees with the bins used throughout the chromosome mapping

shown here. However Figure 22 plots the changes in correlations as the bin sizes

range from ten thousand to one million bases pairs (1mbp). While the correlations of

mildly related factors might rise as the bin sizes grow and the number of bins drops

towards the singular; for tightly related factors, a correlation would be expected to

remain constant, or even weaken as the data becomes less granular. Thus, since G+C

was measured in one thousand base pair blocks, if there was a strong relationship

between G+C and biased clusters, that relationship might be expected to fall as the

bin sizes rose from 10kbp to 1mbp. Yet, this is not the case, suggesting that any

relationship between G+C and UBCS does act in regions of hundreds of thousands of

bases. However, since the recombination rate data is limited to a granularity of

1mbp, the best correlation possible should be found at 1mbp as is seen.

69

Figure 21. Correlations of Biased Cluster ing on Chromosome 18. In order to elucidate the relationship between biased clusters of substitutions each chromosome was mapped and correlated with different factors. For chromosome 18, which shows strong UBCS, the trend of bias to magnify and exceed G+C extremes can be seen to some degree (orange, upper left). However , the Pearson’s cor relation coefficient between G+C content and unexpected biased clustered substitutions is only R=0.257. The relationship with cur rent recombination hot spots (gray) proves closer , with R=0.465. However , it isn’ t until mapping of sex differentiated recombination rates that a strong cor relation is seen (lower left: smoothed by least squares in 15mbp windows). While the cor relation for the smoothed female recombination rate (pink) is R=0.194, the male recombination rate (blue) cor relates at R=0.961! This relationship holds up in the unsmoothed data as well (female R=0.160; male R=0.694). These correlations are especially str iking when it is realized that only cur rent recombination rates can be measured, while the signal seen in substitutions has accumulated over six million years! [Available at: http://www.cse.ucsc.edu/research/compbio/ubcs/sub17_MapZip_chr18.html]

70

Figure 22. Correlations of Unexpected Biased Clustered Substitutions Genome Wide. This graph plots the Pearson Correlation coefficient R between Unexpected Biased Clustered Substitutions to a number of factors for bins of sizes ranging from 10,000bp to 1,000,000 bp. There is no correlation between mean conservation score and unexpected bias (green at bottom), or for the number of bases transcribed in a region (purple). Correlation with G+C content is mild (gold in the middle), rising to R=0.306. The sex-averaged recombination rate (black) rises to R=0.410. However, when the recombination rate is broken down between male and female, a striking difference emerges. While the female recombination rate (pink) peaks at R=0.177 the male recombination rate (blue) shows the greatest correlation to unexpected biased clustered substitutions of any factor. The highest correlation is found at R=0.524 for bins of 1 million base pairs. [http://www.cse.ucsc.edu/research/compbio/ubcs/sub17_MapZip_chr0.html#CorrU] The strongest correlation being between unexpected clustered bias and male

recombination rates, which is genome wide, is seen on all but one of the autosomes

71

for which there is complete data. (Chromosome 2 is the exception which is discussed

below.) However, it is clear that this relationship is largely due to the events that

occur in the telomeres. Indeed, the correlation advantage of male over female

recombination rates is completely wiped out if 18mbp of telomeric ends are removed

from all of the chromosomes. However, the dependence of the correlation on

telomeric location is not enough to discount the importance of this relationship. That

male recombination rates are elevated at the telomeres is well established[49] and a

correlation between BGC and male recombination has also been documented in

human Alu repeats.[50] Most tellingly, a relationship between male recombination

rates and bias would explain the X exception very nicely. Male recombination rates

are a measure of recombination in the creation of male gametes. If biased clusters of

substitutions are the result of biased gene conversion, and this only occurs during

male recombination, then the X chromosome should be largely unaffected by

clustered bias.

3.3.5 Following the Footprints of Past Recombinations

By now the evidence has mounted that biased clusters seen in this research are most

easily explained by the BGC hypothesis. It appears that biased clusters of

substitutions may be thought of as footprints left behind in the genetic record;

footprints which record the recombination events of the past six million years. If this

were true, then it may be possible to use these footprints to date certain events. Just

as fossilized footprints may be dated by what is found above or beneath them, it may

72

be possible to date genetic events by the BGC footprints that fall above or below

them. This trail leads us back to the fusion of chromosome 2. In Figure 15 and

Figure 19, we saw a peak of bias signal in chromosome 2 which divides it neatly

between ancestral chromosomes 2a and 2b. The simplest explanation is that this

strong signal formed before the chromosomes fused, and that male recombination

enabled BGC is strongly affected by positions near telomeres. However, an

alternative explanation is that some feature other than telomeric location causes

heightened BGC and the typical “winged” autosomes. Thus, after the fusion, male-

recombination mediated BGC may have still occurred in the region of the fusion. If

this were so, then we may still find heightened male recombination rates in this

neighborhood of the chromosome 2 fusion. In Figure 23, we can see that male

recombination rates are higher at the telomeres of chromosome 2, but are not

currently higher near the fusion point. This is reflected in the lowest correlation (raw

UBCS R=0.285; genome wide: R=0.524) between male recombination and biased

clusters, of all autosomes for which complete data is available. This does not rule out

the possibility that male recombination remained elevated after the fusion but has

cooled off by now. The simpler explanation, though, is that most of the heightened

bias accumulated prior to the fusion when the regions were telomeric. Thus, Occam’s

Razor applied to the UBCS signal and current male recombination rates for

chromosome 2, suggests the fusion is likely to have occurred relatively recently.

73

Figure 23. Mapping UBCS, G+C Content and Recombination Rates on Chromosome 2. Comparison of the smoothed curves of unexpected clustered bias and both male and female recombination rates is revealing. It is clear that the female recombination rate (pink) shows a mild elevation near telomeres and a steep decline in the last 10mbp from the chromosome tips. The male recombination rate, however, is strongly elevated at both telomeres on this and every other autosome. However, while unexpected biased clustered substitutions are most frequent in the region of fusion of chromosomes 2a and 2b, male recombination rates are not elevated here. While it is possible that this central peak is the result of recombination events that occurred after the fusion, a simpler explanation is that recombination rates changed immediately after the fusion and thus the fusion occurred relatively recently. [Available at: http://www.cse.ucsc.edu/research/compbio/ubcs/sub17_MapZip_chr2.html#MapUR]

74

3.3.6 Seeing Ghosts?

It is possible that other factors besides male recombination are also involved in

producing the UBCS signal. One method to reveal a secondary relationship is to

measure the amount of the UBCS signal that is unexplained by a correlation to male

recombination, and compare that secondary or residual signal to other factors. In

Figure 24, the residual signal is plotted against four other factors. We can see that

female recombination explains virtually none of the UBCS that isn’t already

explained by male recombination. The original weak correlation of female

recombination rate to UBCS appears to have been a ghost of the correlation between

the female rate to male recombination rate. Thus, it seems abundantly clear that the

X exception is due to UBCS accumulation in male but not female gamete production.

Also it is clear that unexpected biased clusters are not a product of positive selection,

as conservation is unrelated to the original UBCS signal (Figure 22) or the residual

signal (Figure 24) after male recombination.

However, one factor that does play a role beyond male recombination is the G+C

content of a region (Figure 24, upper right). How this influence is effected is hard to

say. It should be understood, that male recombination could be the event that causes

UBCS, while the G+C content of the region influences the process of that event, and

the probability that UBCS occurs in a particular instance. Another possible

explanation for this secondary relationship may be that G+C has accumulated over

hundreds of millions of years, while recombination rates represent a snapshot of the

75

current state of recombinations. The currently measured rates might represent an

incomplete picture of the force that has lead to accumulation of biased clustered

substitutions over the last 6 million years. A more complete picture of that force may

involve a slightly different distribution of recombination rates. It is well known that

recombination hot spots change over time[23, 24], and that G+C and recombination

rates are correlated.[19] Additionally, while chimp and human UBCS signals are

essentially parallel, there are differences which may illustrate the degree of

fluctuation of recombination rates over time. It seems likely that the fluctuations in

the last 6 million years occurred in regions of higher G+C. It is also possible that the

reason these regions have higher G+C is due to the rising and falling of male

recombination rates in these regions for hundreds of millions of years. This model

suggests that the secondary relationship of G+C content to UBCS is not a cause, but

an effect! And the correlation between G+C content and residual UBCS may simply

be the ghost of BGC past.

76

Figure 24. Effects of other Factors beyond Male Recombination Rates. One way to determine if some additional factor beyond male recombination is related to UBCS, is to plot the second variable with the residual signal left unexplained by the first. In this series of plots, the blue line (common to all four), is the best fit between male recombination rate and UBCS genome wide (slope=0.55). However, the distance between the blue line and the actual data points represents the residual, or unexplained portion of each data point. Residuals are plotted here against four other factors. At the upper left, the female recombination rate does not explain any portion of the remaining UBCS signal (pink, slope=0.01). The G+C content (upper right), on the other hand does appear to have some relationship to UBCS, beyond male recombination (orange, slope=0.17). One possible explanation may be the historical accumulation of G+C. Selection, as represented by conservation score (lower left) is also not evident (green, slope=0.03). One might expect selection to be negatively correlated, however selection may affect too few alleles to be noticeable. Finally, transcription density, as measured by the number of bases that fall into a known gene, human mRNA or EST, does not relate to the proportion of UBCS unexplained by male recombination (lower right; purple, slope=0.01). All plots use normalized coordinates. [Available at: http://www.cse.ucsc.edu/research/compbio/ubcs/sub17_MapZip_chr0.html#ResUMT]

77

3.4 Humans have been Molded by Bias

If, as the evidence suggests, biased clusters of substitutions are created by

recombination events acting upon clusters of Single Nucleotide Polymorphisms, then

the result should be novel collections of single point mutations which have not

existed in either parent chromosome. That is to say, the two clusters of SNPs found

in the parent chromosomes will have already been tested by evolution, but the newly

recombined cluster of primarily biased SNPs will very likely represent a new and

untested allele. While we may be able to assume that the majority of the individual

SNPs will not be overly harmful by themselves, it is not fair to assume that of the

newly minted collection. If the likelihood that a single mutation is harmful far

outweighs the likelihood that it is beneficial, then an even larger threshold of potential

harm should confront any new cluster of biased mutations. While many biologically

important sequences can tolerate minor deviations, it seems implausible to assume

that a novel collection of five or more changes will be equally benign. However, as

we can see, many examples of biased clusters have been fixed into the human

genome. It is likely that the vast majority of these clusters are of little or no

consequence, falling into the 95% of the genome which shows little sign of being

conserved and whose sequences may serve no critical function in the evolution of

humans. However, it should be expected that when BGC results in a novel cluster of

biased SNPs that fall in an important gene, the most likely result will be that negative

selection removes the cluster from the gene pool. At the same time, it is conceivable

78

that a few newly minted biased clusters might prove beneficial and might sweep the

genome at an even faster pace than BGC alone would account for.

An attempt to find the regions of greatest clustering of biased substitutions was

undertaken. This was done in a method agnostic to the conservation scores of the

regions. Thus, it should be expected that one out of the top twenty regions should fall

into a conserved stretch of DNA. When you include the expectation that a biased

cluster newly arising in a gene would most likely be subject to negative selection, few

genes were expected in the ranking of most biased regions. Instead, the list was rich

with genes and some of the very top scoring regions fell in genes affecting the human

brain! In Table 2, the top 10 scoring regions of biased clustering can be seen. All

have a high density of substitutions which are overwhelmingly from weak to strong

base pairs. At least four and perhaps six of ten fall into genes. Four of these affect

brain development or function. Several of these regions deserve special treatment

here.

79

Location in hg18 Len Subs

Weak to Strong P Value

Significance of Region

1 chr2:113977236-113978604 1369 74 61 1.997E-06 pred:NT_022135.55[51] 2 chr9:2612396-2613708 1313 48 43 9.852E-06 VLDLR (Intron 1)[52] 3 chr20:61203595-61204231 637 37 34 2.543E-04 HAR1 (RNA gene)[3] 4 chr2:117529420-117530267 848 43 38 3.684E-04 pred:NT_022135.102[51]

5 chr2:115136243-115136977 735 39 35 6.744E-04 DPP10 (Intron 1)[53] 6 chr3:213300-214258 959 42 37 7.539E-04 CHL1 (UTR Exon 1)[54] 7 chr2:118333477-118334224 748 32 30 8.011E-04 HTR5B (Exon 1)[55]

8 chr8:1752398-1753394 997 38 34 0.001401 mRNA AF123758[41]

9 chr2:113630962-113631930 969 43 37 0.003132 EST BM926122[41] 10 chr13:112033076-112033732 657 26 25 0.004770 EST BG722997[41]

Table 2. The Top Regions of Biased Clustered Substitutions in Humans. In this list of the regions showing the most substitution bias, a surprising number of genes appear. The second region falls at the beginning of intron 1 of a very low density lipoprotein receptor that may interact with Reelin (implicated in autism, bipolar disorder and schizophrenia). Third is HAR1 discovered by Katie Pollard in this lab, which codes an RNA molecule implicated in fetal brain development. The sixth lies in the first 5’ UTR exon of CHL1, a member of the L1 gene family of neural cell adhesion molecules and is highly active in fetal brains. The seventh on this list falls within exon 1 of human HTR5B, a serotonin receptor which is a defunct gene in humans and chimps, but may very well have been active in our common ancestor and inactivated independently on each lineage (see text). The biased changes to all four of these regions affect human brain development or function. Perhaps less significant is number five which falls into the first intron of DPP10, mutations of which have been associated with asthma; and number eight which lies in the second intron of a putative transmembrane protein that may be implicated in a neurodegenerative disorder. The top region on this list, though not part of a known gene, tells a cautionary tale of its own. [Full list available at: http://www.cse.ucsc.edu/research/compbio/ubcs/sub18_d32.html]

3.4.1 Fastest Evolving Region of the Human Genome

The third most biased region found in humans hg18.chr20:61203595-61204231, was

also discovered in a separate search for the fastest evolving, most conserved regions

of the human genome, by K. Pollard from this lab.[2, 3] In order to be considered as

part of that study, sequences had to be highly conserved between humans, chimps,

mice and rats, yet have multiple substitutions which occurred on the human line.

That search was agnostic to bias, yet resulted in a number of biased regions

80

occupying the top spots on the list. The very top scorer was dubbed HAR1 (human

accelerated region 1). Unknown as a gene at the time it was first identified, it has

subsequently been shown to code for an RNA which is expressed in fetal brains.

3.4.2 Serotonin Receptor Knocked Out in Humans and Chimps

The seventh most biased region falls into a pseudo-gene HTR5B, which is an active

gene in mice, but non-functional in both humans and chimps and may have been

destroyed by BGC. In mice, HTR5B translates into a functional 370 amino acid

protein which acts as a serotonin receptor.[55] It is characterized as a G-protein

coupled receptor with a single conserved domain ("7tm_1") containing 7

transmembrane helices. In addition, it has a single disulfide bridge and an N-linked

glycosylation site. The mouse DNA sequence (mm8.chr1:123337934-123355639 –

strand, February 2006 assembly) has 2 exons separated by an intron of approximately

17,000bp. This gene is likely to be non-functional in both humans[56]

(hg18.chr2:118333816-118377624 March 2006 assembly) and chimps

(pt2.chr2b:118470164-118515327 Jan. 2006 assembly), but for different reasons. In

chimps, a single deletion results in a frame shift in exon 1 at approximately 358bp

into the translatable sequence. In humans, a 35bp near tandem duplication and single

deletion in exon 1 at roughly 375bp into the translatable sequence results in a frame

shift. It is clear that these two frame shift events are not shared by humans and

chimps, so the question becomes whether the most recent common ancestor had a

functional HTR5B gene.

81

There is a homologous sequence in Rhesus macaque (rh2.chr13:124036266-

124076728 Jan. 2006 assembly), which is not addressed in the literature. It contains

no frame shifts and GeneWise[57] predicts a protein sequence given the mouse

HTR5B protein. The predicted protein sequence for macaque does include the

conserved domain 7tm_1 and all transmembrane helices appear to be intact.

However, the predicted disulfide bridge is broken, but the N-linked glycosylation site

remains.

A sequence was generated using simple parsimony based upon the human and chimp

sequences using macaque and mouse as out-groups. It covered the entire HTR5B

gene including intron and approximately 150bp upstream and 500bp downstream and

represents the most recent common ancestor of humans and chimps. Unlike the case

for human and chimp DNA sequences, GeneWise can find a homologous protein

sequence in the human-chimp ancestral DNA, using the mouse HTR5B as model.

Using CDD[58], the predicted human-chimp ancestral protein sequence scores slightly

better than the predicted macaque protein in finding the tm7_1 conserved domain. In

addition, the predicted human-chimp ancestral protein retains the N-linked

glycosylation site. But unlike the predicted macaque protein, it retains the ability to

form a disulfide bridge between extra-cellular regions 1 and 3. Thus, the most recent

ancestor to humans and chimps may be considered at least as intact as any Rhesus

macaque HTR5B gene, from this analysis.

82

There are 18 substitutions in the human-chimp ancestral sequence since divergence

from macaque, only 4 of which are weak to strong. Ten amino acid changes would

have resulted from these substitutions, 5 in the first extra-cellular region, one in extra-

cellular region 3 and 4 more in transmembranous regions. None of the

transmembranous changes appear to be disruptive. Only two changes seem

significant: serine to arginine and glutamic acid to glycine, both in extra-cellular

region 1. They have the effect of removing a negative charge and adding a positive

charge for a net gain of +2. It is not clear that this would disrupt function of a

putative human-chimp ancestral serotonin receptor.

Further evidence may come from a closer look at the chimp and human pseudo-genes.

There are 51 substitutions in the human sequence, but no way to determine when each

substitution occurred relative to the tandem repeat and deletion after which the human

HTR5B gene was definitely out of commission. The bulk of the substitutions (38)

occur in the first exon and 35 of these are weak to strong, which accounts for this

sequence being one of the top scorers in the list of high density biased regions (Table

2). The human gene lies in the region of chromosome 2 which was telomeric before

the fusion; and sustained a substantial mount of unexpected biased clustered

substitutions. The other 13 substitutions in human HTR5B are in exon 2, and only 7

of those are weak to strong. Clearly some process was acting on exon 1 which did

not affect the more distant exon 2. Of this biased group of substitutions in exon 1, 30

83

would have changed amino acids if this were still a functioning gene at the time of the

mutation. Many of these changes would have been sterically or chemically

significant including adding 7 (+) charges and 3 (-) charges. Especially hard hit is

transmembranous helix 5 which would have 3 (+) charges added. In stark contrast,

only 1 of the 13 substitutions for exon 2 would change an amino acid and that a

Lysine to Arginine (which are chemically and sterically similar). This alone suggests

that exon 2 was under selective pressure to remain intact during the time of these

substitutions. This would lend weight to the hypothesis that HTR5B was functioning

in the human line more recently than the human-chimp split, though it is no longer a

functional serotonin receptor.

As for the fate of the chimp gene, only 12 substitutions occurred and all of them are

in the first exon, 8 being weak to strong. The process that led to many biased

mutations in the human line seems not to have acted as strongly here. These

substitutions would have changed 9 amino acids if this were a functioning gene.

None of the changes stand out as necessarily disruptive. However, an addition of 2

(+) charges to the extra-cellular end of the protein seems significant. The

substitutions and resulting protein sequence changes are summarized in Table 3. The

fact that the human-chimp ancestral protein already sustained a net gain of +2

suggests that either a higher positive charge was beneficial, or this protein was

already broken by this time. There is some evidence that electron donation from

ligand is a part of binding in the serotonin receptor.[59] In addition, positive charges

84

might repel Na+ ions, keeping the receptor surrounds clear for serotonin molecules.

On the other hand, the receptor may be less receptive immediately following an

action potential.

Human-Chimp Ancestor

Human Chimp

Point Substitutions 18 51 12

Exon 1 Weak to Strong Changed Amino Acids Significance

18 4 10

+2 extra-cellular

38 35 30

many including TM5:+3

12 8 9

+2 extra-cellular

Exon 2 Weak to Strong Changed Amino Acids Significance

0 13 7 1

K→R little effect

0

Table 3. Predicted Changes to a Possible HTR5B Protein Due to Point Substitutions Occurring in 3 Lines. It is unclear whether there was a functioning gene in the most recent common ancestor to humans and chimps. However, the lack of changes in exon 2, and especially the synonymous changes in human exon 2 suggest purifying selective pressure occurred after humans split from chimps. The human line shows strong evidence of BGC, which may have been responsible for destroying the functionality of this gene, even before the insertions and deletion created an early stop codon. The chimp gene, may also have suffered BGC, though the evidence is less convincing. Again, it is unclear whether BGC may have destroyed the putative chimp gene before a deletion introduced a stop codon in a different place than is seen in the human pseudo-gene.

It is noteworthy that in the most recent common ancestor and in chimps, exon 2 was

not changed, while in humans the one change wasn’t significant, as summarized in

Table 3. It is also dramatic how many changes have occurred in the first exon since

our ancestor split from macaques and then again since humans and chimps split. It is

possible that there were destabilizing changes and selective pressure acting on this

area of a functioning protein. However, this is entirely speculative. HTR5B may

have been knocked out in the human-chimp ancestor by a change not examined here,

85

such as disruption of a promoter. One thing that seems certain, is that the exon 1

region of the human (pseudo)gene experienced an unexpectedly large number of

biased clustered substitutions. Given its location in chromosome 2 and the large

number of debilitating biased substitutions it sustained, it is quite plausible that

functionality of the human HTR5B gene was destroyed by BGC.

3.4.3 Mistakes Were Made

Unfortunately, the top scoring region of biased clustering turns out to be a complex

story that is most likely a false positive. This region is one of a set of four sequences

within the human genome that align to each other and have also been identified by

FISH analysis.[60] This “syntetic block” involves sequences which range in size from

160,000bp to 200,000bp and achieve high similarity scores representing more than

90% identity. However, only one of these four regions shows evidence of significant

biased change, and this in a much narrower range of about 1400 bases. However,

while all four regions are quite clearly identified in the March 2006 assembly of the

human genome, only three are found in the January 2006 assembly of the chimp

genome. The result is an alignment error between humans and chimps, and further

error when aligning with the macaque assembly. The large number of weak to strong

changes clustered together in the top scoring region, may be due to the same

telomeric process characterized in this study. The region is substantially altered in a

biased way, when compared to at least two of the other regions in the human genome

from which it was originally duplicated. However, the biased changes may have

86

accumulated over a much longer time that the six million years since the human and

chimp lines split. An alignment error may result in a false positive, not necessarily in

revealing a biased process but in locating that process temporally. The pitfalls of

misalignment, which should not be wide spread when comparing two genomes with

greater than 98% identity[1], do suggest that any list of the most biased regions should

be filtered for self-aligning sequences or repetitive regions. To that end, a final list of

top scorers was generated by filtering out regions with higher probability of

alignment errors (available at

http://www.cse.ucsc.edu/research/compbio/ubcs/sub18_d32nr.html ).

3.4.4 Biased Clusters Are Transcribed

It has already been shown that there are a surprising number of genes among the top

scoring biased cluster regions. If we look for evidence of transcription, the

relationship is hard to ignore. In the top 200 regions of clustered bias, 108 or 54%

occur in known genes.[38] If the net is widened to include evidence of transcription in

humans in the form of mRNAs and ESTs[39, 40], 80.5% of the top 200 regions are

caught. By comparison, random regions fell into known genes 32.5% of the time and

showed evidence of transcription 57.5% of the time as seen in Table 4. While

“known genes” includes introns, the biased regions are also disproportionately found

in exons. While only 3% of randomized regions overlap exons, 15.5% of the top 200

biased regions do. Further, while less than 10% of randomized regions that are in

known genes actually overlap exons, more than 28% of the most biased regions do.

87

While this remarkable relationship is evident in the top regions, there is no correlation

seen between the density of transcribed bases and UBCS (Figure 22). Thus, it would

appear that these top regions have been selected for their influence on transcribed

regions. But there is no corresponding evidence of this selection seen in the residual

signal after male recombination has been accounted for (Figure 24). This could be

due to two things. Either the number of selected regions is too small to influence the

genome wide relationship seen in Figure 24, or the effect of transcribed bases is

already accounted for in male transcription.

Transcription

Evidence Top 200

Regions of WtoS from

Filtered List

Top 200 Regions of WtoS from

Unfiltered List

Bottom 200 Regions StoW

Unfiltered

200 Regions with

Randomized Locations

Known Genes[38] Exon Coding Exon

123 (61.5%) 36 (18%) 18 (9%)

108 (54%) 31 (15.5%) 18 (9%)

69 (34.5%) 12 (6%) 7 (3.5%)

65 (32.5%) 6 (3%) 5 (2.5%)

Human mRNA[41] 24 23 30 23

Human EST[41] 21 30 32 27

non-human mRNA 6 9 14 10

non-human EST 25 29 54 58

No Transcription Evidence

1

1 1 17

Transcription Evidence in Humans

168 (84%) 161 (80.5%) 131 (65.5%) 115 (57.5%)

Table 4. Evidence of Transcription Among Biased Regions. When the top 200 most biased regions of substitutions in the human genome are examined, it is clear that they are unusually likely to be found in known genes or transcribed in humans. A surprising 54% of the top regions are found in known genes[38], while filtering out regions found with repeats or self-aligning sequences raises the known gene count to 61.5%. By contrast only 34.5 of the bottom of the list of regions and 32.5% of random regions are found in known genes. The bottom regions are biased strong to weak, while random regions were generated by taking the top 200 clusters and randomizing their locations within the genome.

88

3.4.5 Currently Bias May be Leading to Thrill Seeking and Disease

In Figure 6, the evidence for biased clusters among single nucleotide polymorphisms

was examined. While the evidence for biased clusters of substitutions is dramatic,

there seems to be little corresponding pattern for SNPs. Nevertheless, the genome

wide heat map (Figure 6, lower right) does suggest the existence of some biased SNP

clusters. Therefore, a list of the most cluster biased regions of SNPs was generated

for hg17 as seen in Table 5. Three of the top 5 regions are found in genes expressed

in the brain, and two of these are implicated in cancers. The top scoring region, falls

in the intron of a gene required for chronic pain perception.[61] While there is no

evidence that the biased changes in these three regions are currently being selected

based upon relative fitness, these are just the sorts of complications that might be

expected when novel alleles occur in genes. It is also possible that the positive

“selection” caused by BGC at a pervasive recombination hot spot, might compete

directly with negative fitness selection on a dangerous new allele. The result could be

a slowing of the purifying fitness selection that should ultimately exclude the gene

from the genome. It is even possible that a weakly negative biased cluster allele will

overcome the fitness hurdle and be fixed in the genome. This possibility illustrates

the innate danger of BGC based selection. In any case, it is clear that biased clusters

of SNPs are occurring in important sequences in the human genome today.

89

Location in hg17 Len SNPs

Weak to Strong

P Value

Significance of Region

1 chr11:84993092-84993239 148 8 8 5.107E-04 CR749820 (Intron 2)[62] 2 chr6:122814786-122814895 110 6 6 0.003397 TDE2 (Exon 5)[63] 3 chr3:79931893-79932009 117 6 6 0.003397 pr:NT_022459.266[51]

4 chr8:3474808-3474931 124 6 6 0.003397 pr:CSMD1(Intron 7)[64] 5 chr9:29893639-29893780 142 6 6 0.003397

Table 5. Most Biased Clusters of SNPs found in the May 2004 release of the Human Genome. The top two regions are both expressed in the human brain, while the homolog of the forth region is expressed in the mouse brain. More ominously, region 2 is “Tumor Deferentially Expressed Protein 2 associated with lung and liver cancers[65], while region 4 may be a cause of oral and oropharyngeal squamous cell carcinomas.[66] [Full list available at: http://www.cse.ucsc.edu/research/compbio/ubcs/snp17_d32.html]

90

4.0 Discussion

Upon realization that the fastest evolving regions of the human genome have a

surprising bias of weak to strong substitutions, this effort was undertaken to

characterize the dimensions of that bias genome wide. Using a windowing approach,

a clear association of bias with clustering of substitutions was established, and is far

and away the most obvious hallmark of biased substitution seen in this study. In

sharp contrast, G+C content of the surrounding window, conservation score and

location in or near a hot spot are all poor predictors of biased substitution. The

dimensions of biased clustering were documented with “zipper plots” and it was

found genome wide and on all individual human autosomes. By using one definition

of biased clusters, a measure of bias which is unexpected from a purely stochastic

process was mapped genome wide. These “Unexpected Biased Clustered

Substitution” (UBCS) maps illustrate a clear tendency of bias to be strongly elevated

near the telomeres of autosomes, with much smaller, non-telomeric peaks also

evident.

4.1 Unexplained Selection

The phenomenon of biased clusters is not evident in Single Nucleotide

Polymorphisms which strongly argues that selection is acting upon unbiased SNPs to

create biased clusters of substitutions. Despite a less reliable dataset, biased clustered

substitutions are detectible in the UBCS maps of the chimpanzee genome, which

shows that the selection pressure is not limited to humans. However, given that

91

neither conservation scores nor transcription density is correlated with elevated weak

to strong bias, the selection pressure is not due to the selection of genes or other

regions that are highly conserved. At the same time, local G+C content, and the

larger isochores are only moderately correlated to bias and do not provide an

explanation for the biased clusters found in the genome. This lack of causation

appears to apply in the other direction as well. The selection of biased substitutions is

not currently growing or even maintaining G+C rich isochores. However, the highly

geographical profile of the UBCS maps suggests an explanation for the existence of

isochores. G+C rich isochores may have been created by the accumulation of weak

to strong substitutions in regions near telomeres over hundreds of millions of years.

Current non-telomeric isochores may have been moved from telomeric regions by

chromosomal rearrangements, and this would explain the finding that they are no

longer growing or being maintained.[67] This theory would predict that telomeric

isochores are still strengthening.

The model that most closely matches the patterns of weak to strong bias seen in the

human genome is Biased Gene Conversion. Under that model, recombination events

will lead to heteroduplexed DNA and the pairing of short stretches of ssDNA from

sister chromosomes, with mismatched bases being “repaired” in a biased manner.

The model directly predicts selection of clusters of biased substitutions which is what

is found here. This “selection pressure” is dependent upon recombination events and

recombination hotspots being widespread among members of a population. However,

92

there is little correlation with the bias found and distance to current recombination

hotspots. This result is consistent with the model that hotspots do not travel but are

eliminated by recombination events.[24] Nevertheless, biased clustered substitutions

accumulated over the last 6 million years are strongly correlated with recombination

rates measured today. If BGC is not the cause of the selection of biased clusters, any

alternative explanation would have to accounts for this correlation.

4.2 Just like Us

The fact that the UBCS curves for human and chimp cousin chromosomes are

remarkably similar, despite 12 million years of evolutionary distance, is startling.

BGC fixation pressure requires a stable recombination hotspot resident in a

significant number of individuals in a population, in order to provide enough pressure

to push a biased cluster over the fixation hurdle. But individual BGC events may not

act upon the same complement of SNPs, so the widespread penetration of a given

hotspot within a population may not be enough to fix a biased cluster. Clearly, the

size of the population should affect the probability that a given biased cluster will be

fixed. What this suggests is that a BGC selection pressure cannot be expected to be

constant over time as population size varies. Instead, the clustered bias signal found

in humans and chimps may have been fixed in the two genomes long ago, during

population bottlenecks.

93

However, the fact that our two species with different population histories, have

similar clustered bias signals across most of the aligned chromosomes seems to

conflict with this view. If the UBCS signal is created in recombination events at hot

spots, and if the locations of hot spots change over time, the two sets of curves should

not be similar. Either the vast majority of the signal was created very soon after the

two lines split, or the two lines must have had very similar population bottlenecks and

hot spot locations through the last six million years. A more likely explanation is that

though hot spot locations are not constant, hot regions in the million base range are.

This interpretation is supported by evidence that regions of high recombination are

persistent even if hot spots are not.[22] There is another implication of the human and

chimp UBCS curve similarity, especially as seen in the similar heights of the telomere

peaks. While population bottlenecks may affect the accumulation of biased clusters,

the effect does not appear to have been large enough to tip the outcome dramatically

in humans and chimps. While it is irrational to abandon the model of a variable rate

of UBCS accumulation over six million years, the implications of the parallel human-

chimp cousins leads to the useful assumption that the rate of UBCS accumulation has

been “relatively” constant over the last 6 million years of human and chimp

evolution.

4.3 Two Chromosomes Come Together on a Date

The distinct winged pattern of UBCS is universal across all autosomes for which

there is data for both humans and chimpanzees. However, human chromosome 2

94

stands out with its internal peak, which so clearly delineates the fusion point of

ancestral chromosomes 2a and 2b. The similarity of the UBCS signals between

humans and chimps leads to the suggestion that the fusion of ancestral chromosomes

2a and 2b occurred relatively recently. Using the assumption of a relatively constant

rate of UBCS accumulation during human evolution, a date of fusion may be inferred.

To be clear, the exercise of dating the fusion relies upon a second big assumption.

The strong telomeric signal suggests that location is an overwhelming determinant of

UBCS. While there are non-telomeric peaks to the unexpected biased clustered

substitution signal, and while these peaks can be parallel in human and chimp

autosomes, they are dwarfed by the peaks at the telomeres. From this, it can be

deduced that, though biased clusters may continue to be fixed in the middle of human

chromosome 2, the rate is not likely to be as large as in the telomeric regions. Indeed,

male recombination rate, which most strongly correlates to the UBCS signal across

the genome, and is elevated at the telomeres, is not elevated in the central region of

chromosome 2. Exactly how long it took for the male recombination rate to

extinguish, once fusion occurred is hard to predict. But the theory that the drop in

male recombination rate occurred due to the fusion is at least as good as alternatives.

Using the assumptions of a relatively constant rate of UBCS accumulation before

fusion and an immediate drop in UCBS accumulation after fusion, it is possible to

estimate the date of the fusion itself. By comparing the UBCS heights of the

95

telomeres on chimp chromosomes 2a and 2b, it is possible to predict the heights that

the missing human 2a and 2b telomeres would have reached if the fusion had never

taken place. Additionally, by comparing the UBCS heights of the other human and

chimp cousin telomeres, it is possible to generate a confidence interval. It is useful to

define a telomere region as 17mbp, which is just large enough to account for the

above average UBCS signal found at telomeres. Using the relative telomere heights,

17mbp of smoothed telomere UBCS, and 6 MYA as the point of speciation, the

calculated fusion date is 0.93 million years ago, but with a 95% confidence interval of

anytime in the last 2.86 million years. While this prediction is dependent upon a

17mbp telomeric region, the prediction never exceeds 1.45mya (CI95: 3.24my) as the

telomeric region size grows to more than 60mbp. While dating the fusion with

confidence is not very precise, it is further blurred by imprecision in dating the age of

the human lineage, which has been calculated to a confidence interval of from 4.98–

7.02 MYA.[68] However, though the predicted fusion date is imprecise, the above

assumptions lead to the conclusion that the fusion almost certainly occurred in the

more recent half of hominid evolution and very possibly since the arrival of the Homo

genus. Clearly, the fusion that created chromosome 2 was not the speciation event

between the human and chimp lines[48], but may have been involved in a more recent

speciation event on the way to making us human.

96

4.4 The X-Exception: Are Men Really to Blame?

While the autosomes follow a consistent path to cluster bias, the sex chromosomes

stand apart. The clear lack of signal on chromosome Y, might seem surprising since

this chromosome has experienced a larger rate of substitution than other human

chromosomes.[73, 74] However, the BGC model which relies upon recombination,

predicts that Y would show no biased clustering beyond stochastic expectations, and

that is what is seen. Since recombination only happens in the pseudo-autosomal

regions between X and Y, BGC predicts an elevated signal in this region, and indeed,

this is the region of Y with the highest elevation in UBCS, though it is not statistically

significant.

The greatly reduced clustered bias on human chromosome X is an enigma, however.

Under almost all measures, X stands out from the autosomes as an exception when it

comes to biased substitutions. The bias signal is so reduced in human X, it is hard to

be convinced that it isn’t just statistical noise. If recombination events are the driving

force fixing biased clusters, then the signal should be reduced by half, since the vast

majority of the X chromosome is only subject to recombination in mothers. It can be

expected to be further reduced, since it is well documented that the X chromosome

experiences a lower mutation rate than the genome as a whole.[69, 70] Indeed, in the

May 2004 assembly, the non-pseudo-autosomal regions (non-PAR) of the X

chromosome have only 52.3% as many substitutions (per mega-base) as do

autosomes. Together, reduced substitutions and reduced recombination should lead

97

directly to reduced UBCS for X. Given that two of these three values are known, it is

possible to calculate a historical recombination rate for non-PAR X of only 30.1% of

the rate for autosomes. However, recombination rates are known to be as much as

1.65 times higher in females than males[37], which would translate to non-PAR X

experiencing 62% as many recombinations as the autosomes. And directly measuring

the current recombination rates in the deCODE[37] dataset shows X recombining

50.2% as often as autosomes. Clearly, the reduced UBCS signal seen on the human

X chromosome is not explained by a simple combination of reduced mutation and

reduced recombination on X.

Unfortunately, substitutions in the human genome alone may not provide a

convincing answer to the question of whether X is experiencing the selective force

that leads to biased clusters of substitutions. One place to seek further evidence is in

other species. While the quality of data is reduced for chimpanzees, there is still clear

UBCS found on all the chimp autosomes, for which there is data. However, there is

no statistically significant UBCS found on chimp chromosome X.

The most satisfying explanation of the reduced signal in X is the strong correlation

between UBCS and the male recombination rate. Indeed, it is the strongest

correlation of UBCS with any other factor examined in this research. If only male

recombination leads to BGC and selection of biased clusters of substitutions, then

there should be no signal on the non-PAR regions of chromosome X. This should be

98

the nail in the coffin. However, this strong correlation is dependent upon near-

telomere regions. The advantage of male recombination rates diminishes as the

telomere regions are trimmed from the dataset, and is completely wiped out in

smoothed UBCS genome wide, when 18mbp of telomere are removed from each

chromosome. Nevertheless, female recombination is not correlated with the

remaining UBCS signal left unexplained by male recombination (Figure 24).

Therefore, it isn’t easy to explain all the evidence without resorting to male

recombination as the cause of UBCS. It becomes necessary to postulate some force

other than recombination that is driving the fixation of biased clusters of

substitutions. That force would have to be strongly telomeric in autosomes; non-

existent in Y; and greatly reduced, or possibly absent in the X chromosome. While

the small signal of UBCS found in human X is an enigma, a preponderance of the

evidence supports recombination in male gamete production as the driving force

causing biased clusters of substitutions to be fixed in the genome. Any remaining

question of whether BGC occurs in male, but not female recombination, will no doubt

be resolved soon.

Nevertheless, the enigmatic X chromosome shows some symptoms. In Figure 15, X

shows some elevation of unexpected bias at the left (short ‘p’ arm) telomere.

Additionally, the heat map in Figure 17, suggests that there may be some weak signal

in some extreme clusters of X, especially in comparison to chromosome Y. An

explanation may come in the Pseudo-Autosomal Regions (PAR) between X and Y.

99

While X-X recombinations do not occur in males, X-Y recombinations can occur in

the two PAR regions at both telomeres. These regions show alignment between X

and Y. However, the elevated wing of chromosome X seen in Figure 15 extends

beyond the PAR region of arm p (bias signal ~11mbp; PAR ~2.6mbp). One possible

explanation might be the degradation of the PAR region since the human-chimp split.

The PAR is believed to have extended to Xp22[71] which would be equivalent to a

25mbp PAR on the p arm of the current human X. The loss of genetic material from

Y is well established and the chromosome may dwindle away within 10 million

years.[72] While no unique profile of Y genes or pseudo-genes have been lost in either

humans or chimps, there is evidence of degradation and extensive reorganization[73, 74]

which may have also disrupted PAR alignment and recombination between X and Y

since the human-chimp split. Therefore, it is conceivable that the elevated bias in the

p arm of the X chromosome built up early in hominid evolution, but is no longer

building through much of the ~11mbp length. However, any bias on the p arm of X

is very weak when compared to that on autosomes and is barely elevated above the

majority of the X chromosome, which theoretically has not been subjected to the

influences of male recombination. There is no corresponding evidence of elevated

bias in the PAR region of chimp chromosome X (Figure 20). This could be due to

BGC not occurring in the PAR region or to an early rearrangement of chromosome Y

in chimp evolution. While the question of whether male recombination is the cause

of accumulated bias may be resolved soon, the prospect of using the ambiguous

100

UBCS signal seen in the X chromosome to chart the evolution of the PAR is not so

promising.

4.5 The Gamble of Male Meiosis

The picture of biased clustered substitutions is coming into focus. The model is male

recombination events leading to biased gene conversion which creates a pressure

towards fixing clusters of biased substitutions into the genome. However, the reason

why male but not female recombination might result in BGC, is not clear. A

component of the recombination repair mechanism may be Y linked, or BGC could

be the result of an X linked recessive mutation, or protection from BGC could require

a haplo-insufficient X linked enzyme. Perhaps the expense of rapid male gamete

production precludes the luxury of careful replication. For BGC is not careful

replication, and that suggests the biggest mystery of the ubiquitous bias found in this

study. In a system that results in as few replication errors as one base pair in a

billion[75], how can such sloppiness be tolerated? The creation of novel biased

clusters should be more harmful than beneficial on average. If evolution has created

the mechanism to avoid BGC in females, why isn’t it used in males?

It is well known that mutation rates are higher in males[70, 76, 77], even while

recombination rates are higher in females.[37] That oogenesis should be more

conservative while spermatogenesis less careful appears consistent with the

predictions of sexual selection and parental investment theory.[78] The probability that

101

an individual ovum will be successful is far larger than the probability for a male gamete.

But the reason for the higher rate of mutations in males, sometimes referred to as “male

driven evolution”[76], is most frequently said to be due to the larger number of cell divisions

occurring in the germ cell lines of males. However, other evidence suggests that the

difference in germ cell generation alone does not explain the male bias in mutation

rates[79, 80, 81], and part of this discrepancy may be due to sloppy recombination. It has

even been argued that sexual selection encourages higher male mutation rates which,

in turn increase the consequences of sexual selection.[82] But something doesn’t add

up when explaining error prone recombination by male parental investment. Even if

the vast majority of male gametes do not result in offspring, the lucky gamete that

does may still be harmed by a single mutation. It would seem that accurate

replication should be as important in males as in females!

However, while male replication may result in some larger percentage of mutant

gametes, those gametes that arrive at the egg may not be so disadvantaged.[83] It is

possible that sperm competition or intra-mating pair “sperm selection” is the crucible

that tests the newly replicated genomes, and weeds out the more disastrous biased

clusters. Thus, the expense of fidelity in replication is not paid, but the cost is fewer

viable sperm. This trade off still does not make obvious sense. However, if energy is

wasted creating some larger percentage of defective male gametes, it may also be the

case that some minute but still increased percentage of gametes may hold beneficial

mutations. Thus, the sloppiness of BGC may be tolerated in males because gamete

102

competition and selection allows gambling with the genome. Perhaps the majority of

biased clusters created in male replication will be harmless, while occasionally, a

biased cluster will be harmful and the gamete will not survive. However, in a tiny

fraction of the rolls of the dice, a biased cluster will create some more effective

enzyme which will result in a more efficient sperm, which will result in a big payoff.

Under this model, sexual reproduction runs an evolutionary experiment during every

fertilization! The model of male based biased gene conversion as a gamble predicts

that there should be more UBCS evidence in species where males create vast numbers

of gametes which are tested by sperm selection or competition. Biased clusters of

substitutions and high male substitution rates might be especially prevalent in species

where a few males produce a majority of offspring, since the winnings of the

reproductive gamble might be even larger, while the losses less significant. Evidence

supporting this can be found in the mutation rates of different species of birds.[82]

Higher mutation rates and even biased clusters might also be found as a female

reproductive strategy in species which produce vast numbers of eggs. On the other

hand, species with a long history of monogamous mating and with few offspring

might be expected to show less evidence of biased clusters of substitutions.

There is a real problem lurking in this theory of why BGC might be tolerated in

males. Unless novel biased clusters are tested before fertilization, there is still an

inherent danger in sloppy replication. If a new biased cluster destroys a gene that is

not even used until fetal brain development, the gamble of male recombination would

103

result if a very costly loss. However, BGC creates novel biased clusters at

recombination points. There is some evidence that recombination occurs where

transcription is in progress.[84, 85, 86] If this were so, then biased clusters of

substitutions should predominantly arise within transcribed regions of the human

genome, which is seen in the top 200 most biased regions examined in this study.

Supporting this model is the finding that genes appear to lead the accumulation of

G+C in G+C rich isochores.[11] However, such a model would predict that biased

clusters of substitutions should be found predominantly in areas transcribed in male

gamete production. While this has not been shown, there is evidence that widely

expressed housekeeping genes are more often found in G+C rich isochores[87] than are

tissue specific genes. Taken together, these bits of evidence provide a consistent set

of hints as to where to look for the process that results in biased clusters of

substitutions and how that process is tolerated.

4.6 A Thumb on the Scales of Evolution

Perhaps the most intriguing implication of the force causing biased clusters of

mutations to be fixed in the genome, is that it acts like non-Darwinian selection.

Darwin’s theory predicts that relative fitness among phenotypes leads to the selection

that drives evolution, and there is abundant evidence that fitness selection is the

overwhelmingly dominant force in functional evolution. Additionally, Darwin

understood that mate preference can also lead to evolution, in the process known as

sexual selection. However, biased clusters of mutations are being “selected”[27, 12]

104

and fixed in the genome in a wide spread pattern that is not easily explained by either

the relative fitness or sexual selection of the individual clusters. Darwinian selection

may tolerate sloppy replication; and relative fitness and sexual selection may

ultimately explain the male reproductive gamble. But most biased cluster alleles are

probably not individually selected by fitness or preference.

The theory of neutral evolution[88] is usually stated as evolution without selective

pressure. It is fueled by the many random mutations and the stochastic process that

fixes some of them into the genome. Neutral evolution can touch phenotype, but as

soon as the change is non-neutral, Darwinian selection supercedes. But the pressure

of biased gene conversion behaves just like neutral evolution by selection. Many

clusters are being “selected” even though they have no phenotypic effect.

Occasionally biased clusters will have a phenotypic effect; and while many of those

changes will be neutral, clearly some will not be. Therefore, the BGC “selection”

pressure may speed positive selection and compete with negative selection. It is even

possible that a mildly negative phenotype will be fixed into the genome by BGC,

more quickly than purifying selection can remove it.

The implication is that not all phenotypic changes are meaningful, even when they are

non-neutral. Perhaps the protruding nose of humans is neither functional nor more

attractive in mate selection. Indeed, it is quite possible that many of the traits that

make us most human were made possible through “neutral evolution by selection”.

105

For instance, a greater vocal range may have appeared by accident and been fixed

into the evolving species due to no other reason that it was the result of a biased

cluster forged in a recombination hot spot. But that greater vocal range was a

necessary precondition for the later evolution of spoken language. In another

example, perhaps a single biased cluster originally triggered enlargement of our

brains; and this change was fixed through BGC pressure, even though it resulted in

more difficult childbirth. The advantages of increased symbolic and temporal thought

may have only arrived after additional changes built upon the foundation of a larger

brain. If the model of “BGC selection” is correct, then studying biased clustered

substitutions may prove to be a very useful tool in distinguishing favorable vs.

tolerated changes to a genome. While both types of changes have made us human,

recognizing the positively selected changes may tell us more about why we became

human. And in one more way, the footsteps of past recombination may allow us to

better understand the sequence of events that have lead us to where we are today.

106

5.0 Conclusion

The model revealed by studying biased clusters of substitutions is captivating. Even

at some distance it can be seen that the force that results in weak to strong

substitutions is counterbalancing the general background bias of strong to weak

mutations. Looking more closely, the force acts predominantly near telomeres and

this may explain isochore creation and why many isochores appear to be loosing

ground today. Circling to a different perspective, it may be possible to discern and

date chromosomal rearrangements by measuring the accumulation of weak to strong

bias. Take another step closer, and the motive force creating biased clusters of

substitutions appears to be found in the reproductive strategy of males, and may

ultimately provide insight into the mystery of sexual reproduction. Finally, when we

lean close and look at the very chisel marks of biased clustered substitutions, the

force that chips them into the genome may be recognized as a non-Darwinian

“selection pressure” fueling neutral evolution which has surely helped sculpt humans,

chimpanzees and no doubt a vast number of other species.

107

Bibliography

1 The Chimpanzee Sequencing and Analysis Consortium. Initial sequence of the chimpanzee

genome and comparison with the human genome. Nature. 437:69-87. September 1, 2005. http://www.nature.com/nature/journal/v437/n7055/full/nature04072.html.

2 Pollard K.S., Salama S.R., King B., Kern A., Dreszer T., Katzman S., Siepel A., Pedersen J., Bejerano G., Baertsch R., Rosenbloom K.R., Kent J. and Haussler D. Forces Shaping the Fastest Evolving Regions in the Human Genome. PLoS Genetics. 2(10):e168 Oct. 13, 2006. http://genetics.plosjournals.org/perlserv/?request=get-document&doi=10.1371%2Fjournal.pgen.0020168

3 Pollard K.S., Salama S.R., Lambert N., Lambot M.A., Coppens S., Pedersen J.S., Katzman S., King B., Onodera C., Siepel A., Kern A.D., Dehay C., Igel H., Ares, Jr M., Vanderhaeghen P. and Haussler D. An RNA gene expressed during cortical development evolved rapidly in humans. Nature advance online publication 16 August 2006. http://www.nature.com/nature/journal/vaop/ncurrent/abs/nature05113.html.

4 Hisaji Maki. ORIGINS OF SPONTANEOUS MUTATIONS: Specificity and Directionality of Base-Substitution, Frameshift, and Sequence-Substitution Mutageneses. Annu. Rev. Genet. 36:279–303. 2002. http://arjournals.annualreviews.org/doi/pdf/10.1146/annurev.genet.36.042602.094806?cookieSet=1

5 Eyre-Walker A. Evidence of Selection on Silent Site Base Composition in Mammals: Potential Implications for the Evolution of Isochores and Junk DNA. Genetics, 152:675-683, June 1999. http://www.genetics.org/cgi/content/abstract/152/2/675.

6 Lipatov M., Arndt P.F., Hwa T., Petrov D.A. A Novel Method Distinguishes Between Mutation Rates and Fixation Biases in Patterns of Single-Nucleotide Substitution. J Mol Evol. 62:168–175. 2006. http://matisse.ucsd.edu/~hwa/pub/lipatov06.pdf.

7 Filipski, J., Thiery, J. P. & Bernardi, G. An analysis of the bovine genome by Cs2SO4-Ag+ density centrifugation. J. Mol. Biol. 80:177–197 1973. http://content.febsjournal.org/cgi/content/abstract/84/1/179.

8 Bernardi G. The human genome: organization and evolutionary history. Annu Rev Genet. 29:445-76. 1995. http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=retrieve&db=pubmed&list_uids=8825483&dopt=Abstract

9 Bernardi G Isochores and the evolutionary genomics of vertebrates. Gene 241(1):3-17. January 4, 2000. http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&list_uids=10607893&dopt=Citation.

10 Bernardi G, Olofsson B, Filipski J, Zerial M, Salinas J, Cuny G, Meunier-Rotival M, Rodier F. The mosaic genome of warm-blooded vertebrates. Science. 228(4702):953-8. May 24, 1985. http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&list_uids=4001930&dopt=Abstract.

11 Press W.H. and Robins H. Isochores as Extreme Cases of Genes Cooperatively Reshaping the Large-scale Genomic Environment. preprint submitted (submitted November 2005). http://www.lanl.gov/DLDSTP/biopreprint/draft2.pdf.

12 Eyre-Walker A. and Hurst L.D. The evolution of isochores. Nat Rev Genet. 2(7):549-55. July 2001. http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&list_uids=11433361&dopt=Abstract.

108

13 Wolfe K.H., Sharp P.M. & Li W. Mutation rates differ among regions of the mammalian

genome. Nature 337:283-285 January 19, 1989. http://www.nature.com/nature/journal/v337/n6204/abs/337283a0.html.

14 Fryxell K,J, and Zuckerkandl E. Cytosine deamination plays a primary role in the evolution of mammalian isochores. Mol. Biol. And Evol. 17(9):1371-1383. Sept. 2000. http://www.sciencedirect.com/science?_ob=ArticleURL&_udi=B6WVB-45DHT6H-26W&_user=4428&_coverDate=01%2F01%2F1900&_fmt=summary&_orig=search&_qd=1&_cdi=7098&view=c&_acct=C000059601&_version=1&_urlVersion=0&_userid=4428&md5=04e42a1dd31fcfbff7a21c0a98e.

15 Hurst LD, Merchant AR. High guanine-cytosine content is not an adaptation to high temperature: a comparative analysis amongst prokaryotes. Proc Biol Sci. 268(1466):493-7. Mar 7 2001. http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&list_uids=11296861&dopt=Abstract

16 Lercher M.J., Urrutia A.O., Pavlícek A. and Hurst L.D. A unification of mosaic structures in the human genome. Hum. Mol. Genetics, 12(19):2411-2415. 2003. http://hmg.oxfordjournals.org/cgi/content/abstract/12/19/2411.

17 Sémon M., Mouchiroud D. and Duret L. Relationship between gene expression and GC-content in mammals: statistical significance and biological relevance. Hum. Mol. Genetics 14(3):421-427 2005. http://hmg.oxfordjournals.org/cgi/content/abstract/14/3/421.

18 Kudla G., Lipinski L., Caffin F., Helwak A. and Zylicz M. High Guanine and Cytosine Content Increases mRNA Levels in Mammalian Cells. PLoS Biol. 4(6):e18-. May 23, 2006. http://biology.plosjournals.org/perlserv?request=get-document&doi=10.1371/journal.pbio.0040180.

19 Eyre-Walker A. Recombination and mammalian genome evolution. Proc Biol Sci. June 22 1993. 252(1335):237-43. http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&list_uids=8394585&dopt=Abstract.

20 Brown, T. C., and J. Jiricny. Different base/base mispairs are corrected with different efficiencies and specificities in monkey kidney cells. Cell 54:705–711. 1988. http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?CMD=search&DB=pubmed.

21 Webster MT, Smith NG. Fixation biases affecting human SNPs. Trends Genet. 20(3):122-6. Mar. 2004. http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=pubmed&dopt=Abstract&list_uids=15049304&query_hl=20.

22 Myers S, Bottolo L, Freeman C, McVean G, Donnelly P. A fine-scale map of recombination rates and hotspots across the human genome. Science. 310(5746):321-4. Oct. 14 2005. http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?itool=abstractplus&db=pubmed&cmd=Retrieve&dopt=abstractplus&list_uids=16224025.

23 Winckler W., Myers S.R., Richter D.J., Onofrio R.C., McDonald G.J., Bontrop R.E., McVean G.A.T., Gabriel S.B., Reich D., Donnelly P., Altshuler D. Comparison of Fine-Scale Recombination Rates in Humans and Chimpanzees. Science 308(5718):107-111. April 1, 2005. http://www.sciencemag.org/cgi/content/full/308/5718/107.

24 Pineda-Krch M. and Redfield R.J. Persistence and Loss of Meiotic Recombination Hotspots. Genetics, 169:2319-2333, April 2005. http://www.genetics.org/cgi/content/full/169/4/2319.

25 Meunier, J., and L. Duret. 2004. Recombination drives the evolution of GC-content in the human genome. Mol. Biol. Evol. 21:984–990. 2004. http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=pubmed&cmd=Retrieve&dopt=AbstractPlus&list_uids=14963104&query_hl=6&itool=pubmed_docsum.

109

26 Birdsell J.A. Integrating Genomics, Bioinformatics, and Classical Genetics to Study the

Effects of Recombination on Genome Evolution. Molecular Biology and Evolution 19:1181-1197. 2002. http://mbe.oxfordjournals.org/cgi/content/full/19/7/1181.

27 Nagylaki T. Evolution of a finite population under gene conversion. Proc. Natl. Acad. Sci. USA. 80(20):6278–6281. October 1983. http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=394279&tools=bot.

28 Jeffreys A.J. and Neumann R. Reciprocal crossover asymmetry and meiotic drive in a human recombination hot spot. Nature Genetics 31:267–271. 2002. http://www.nature.com/ng/journal/v31/n3/full/ng910.html.

29 Kent, W.J., Sugnet, C. W., Furey, T. S., Roskin, K.M., Pringle, T. H., Zahler, A. M., and Haussler, D. The human genome browser at UCSC. Genome Res. 12(6), 996-1006. 2002. http://www.genome.org/cgi/content/abstract/12/6/996.

30 Karolchik, D. and Kent, W.J. The UCSC Genome Browser. Current Protocols in Bioinformatics (ed. Baxevanis, A.D.) (John Wiley & Sons, Inc., 2002).

31 Karolchik, D., Baertsch, R., Diekhans, M., Furey, T.S., Hinrichs, A., Lu, Y.T., Roskin, K.M., Schwartz, M., Sugnet, C.W., Thomas, D.J., Weber, R.J., Haussler, D. and Kent, W.J. The UCSC Genome Browser Database. Nucl. Acids Res. 31(1):51-54. 2003. http://nar.oxfordjournals.org/cgi/content/abstract/31/1/51.

32 The R Foundation for Statistical Computing. 2006. http://www.r-project.org/index.html. 33 Macaque Genome Sequencing Consortium led by the Baylor College of Medicine Human

Genome Sequencing Center, in collaboration with the J. Craig Venter Institute Joint Technology Center, and the Genome Sequencing Center at Washington University School of Medicine, St. Louis. January 2005. http://www.hgsc.bcm.tmc.edu/projects/rmacaque/.

34 Mouse Genome Sequencing Consortium. Initial sequencing and comparative analysis of the mouse genome. Nature, 420:520-562. 2002. http://www.nature.com/nature/mousegenome/.

35 Rat Genome Sequencing Project Consortium. Genome Sequence of the Brown Norway Rat Yields Insights into Mammalian Evolution. Nature, 428:493–521, 2004. http://www.nature.com/nature/journal/v428/n6982/abs/nature02426_fs.html.

36 The International HapMap Consortium. A haplotype map of the human genome. Nature 437:1299-1320. 2005. http://www.hapmap.org/downloads/presentations/Nature_HapMap_phaseI.pdf.

37 Kong, A., Gudbjartsson, D.F., Sainz, J., Jonsdottir, G.M., Gudjonsson, S.A., Richardsson, B., Sigurdardottir, S., Barnard, J., Hallbeck, B., Masson, G., Shlien, A., Palsson, S.T., Frigge, M.L., Thorgeirsson, T.E., Gulcher, J.R., and Stefansson, K. A high-resolution recombination map of the human genome, Nature Genetics, 31(3):241-247. 2002. http://www.nature.com/cgi-taf/DynaPage.taf?file=/ng/journal/v31/n3/abs/ng917.html.

38 Apweiler R, Bairoch A, Wu CH, Barker WC, Boeckmann B, Ferro S, Gasteiger E, Huang H, Lopez R, Magrane M, Martin MJ, Natale DA, O'Donovan C, Redaschi N, Yeh LS UniProt: the Universal Protein Knowledgebase. Nucleic Acids Res. 32:D115-D119. 2004. http://www.expasy.uniprot.org/about/publications.shtml.

39 The NCBI handbook [Internet]. Bethesda (MD): National Library of Medicine (US), National Center for Biotechnology Information; Oct. 2002. http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Books.

40 Benson D.A., Karsch-Mizrachi I., Lipman D.J., Ostell J. and Wheeler D.L. GenBank. Nucleic Acids Research, Vol. 33, Database issue Oxford University Press 2005. http://nar.oxfordjournals.org/cgi/content/full/33/suppl_1/D34.

41 Benson D.A., Karsch-Mizrachi, I., Lipman, D.J., Ostell, J., and Wheeler, D.L. GenBank: update. Nucleic Acids Res. 32, D23-6. 2004. http://nar.oxfordjournals.org/cgi/content/full/32/suppl_1/D23.

110

42 Kent, W.J., Baertsch, R., Hinrichs, A., Miller, W., and Haussler, D. Evolution's cauldron:

Duplication, deletion, and rearrangement in the mouse and human genomes. Proc Natl Acad Sci USA 100(20):11484-11489. 2003. http://genome.ucsc.edu/cgi-bin/hgTrackUi?hgsid=77835828&c=chr2&g=chainSelf.

43 Smit A.F.A., Hubley R. and Green P. RepeatMasker Open-3.0. 1996-2004 http://www.repeatmasker.org/.

44 Siepel, A., Bejerano, G., Pedersen, J.S., Hinrichs, A., Hou, M., Rosenbloom, K., Clawson, H., Spieth, J., Hillier, L.W., Richards, S., Weinstock, G.M., Wilson, R. K., Gibbs, R.A., Kent, W.J., Miller, W., and Haussler, D. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res. 15:1034-1050. 2005. http://www.genome.org/cgi/content/full/15/8/1034.

45 Bussell J.J., Pearson N.M., Kanda R., Filatov D.A. and Lahn B.T. Human polymorphism and human–chimpanzee divergence in pseudoautosomal region correlate with local recombination rate. Gene. 368:94-100. March 1, 2006. http://www.sciencedirect.com/science/article/B6T39-4HSY4TN-2/2/d4fa698542e4dbc8268fe00b7c71d882.

46 Strathern J.N.; Shafer B.K.; McGill C.B. DNA synthesis errors associated with double-strand-break repair. Genetics. 140(3):965-972. 1995. http://www.sciencedirect.com/science?_ob=ArticleURL&_udi=B6T1G-3P16SDR-2J5&_user=4428&_coverDate=01%2F01%2F1900&_fmt=summary&_orig=search&_qd=1&_cdi=4890&view=c&_acct=C000059601&_version=1&_urlVersion=0&_userid=4428&md5=08c335a83b53abae471c11f2c19.

47 Ijdo, J., Baldini, A., Ward, D.C., Reeders, S.T., and Wells, R.A. Origin of human chromosome 2: An ancestral telomere–telomere fusion. Proc. Natl. Acad. Sci. 88: 9051–9055. October 1991. http://www.pnas.org/cgi/content/abstract/88/20/9051.

48 Navarro A. and Barton N.H.. Chromosomal Speciation and Molecular Divergence--Accelerated Evolution in Rearranged Chromosomes. Science 300(5617):321-324. April 11, 2003. http://www.sciencemag.org/cgi/content/full/300/5617/321?ijkey=96440a7ada6aaf7f0e287d198953db7c1a2575a0.

49 Yu A., Zhao C., Fan Y., Jang W, Mungall A.J., Deloukas P., Olsen A., Doggett A., Ghebranious N., Broman K.W. and Weber J.L. Comparison of human genetic and sequence-based physical maps. Nature 409:951-953 February 15, 2001. http://www.nature.com/nature/journal/v409/n6822/full/409951a0.html.

50 Webster M.T., Smith N.G.C., Hultin-Rosenberg L., Arndt P.F. and Ellegren H. Male-Driven Biased Gene Conversion Governs the Evolution of Base Composition in Human Alu Repeats. Mol. Biol. and Evol. 22(6):1468-1474. 2005. http://mbe.oxfordjournals.org/cgi/content/abstract/22/6/1468.

51 Burge, C. Modeling Dependencies in Pre-mRNA Splicing Signals. In Salzberg, S., Searls, D., and Kasif, S., eds. Computational Methods in Molecular Biology, Elsevier Science, Amsterdam, 127-163. 1998. GenScan tool: http://genes.mit.edu/GENSCAN.html.

52 Webb J.C., Patel D.D., Jones M.D., Knight B.L., Soutar A.K. Characterization and tissue-specific expression of the human 'very low density lipoprotein (VLDL) receptor' mRNA. Hum. Mol. Genet. 3:531-537. 1994. http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&list_uids=8069294&dopt=Abstract.

53 Chen T., Ajami K., McCaughan G.W., Gorrell M.D., Abbott C.A. Dipeptidyl peptidase IV gene family. The DPIV family. Adv. Exp. Med. Biol. 524:79-86(2003). http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&list_uids=12675227&dopt=Abstract.

111

54 Wei M.-H., Karavanova I., Ivanov S.V., Popescu N.C., Keck C.L., Pack S., Eisen J.A.,

Lerman M.I. In silico-initiated cloning and molecular characterization of a novel human member of the L1 gene family of neural cell adhesion molecules. Hum. Genet. 103:355-364. 1998. http://www.springerlink.com/content/10h0ucut05mm3acp/.

55 Matthes H, Boschert U, Amlaiky N, Grailhe R, Plassat JL, Muscatelli F, Mattei MG, Hen R. Mouse 5-hydroxytryptamine5A and 5-hydroxytryptamine5B receptors define a new family of serotonin receptors: cloning, functional expression, and chromosomal localization. Mol Pharmacol. 43(3):313-9. March 1993. http://molpharm.aspetjournals.org/cgi/reprint/43/3/313

56 Grailhe R., Grabtree G.W. and Hen R. Human 5-HT5 receptors: the 5-HT5A receptor is functional but the 5-HT5B receptor was lost during mammalian evolution. Eur J Pharmacol. 418(3):157-67. Apr. 27, 2001. http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=pubmed&cmd=Retrieve&dopt=Abstract&list_uids=11343685&query_hl=2&itool=pubmed_docsum

57 Birney E., Clamp M. and Durbin R. GeneWise and Genomewise. Genome Research 14:988-995. 2004. Tool: http://www.ebi.ac.uk/Wise2/index.html http://www.genome.org/cgi/content/abstract/14/5/988.

58 Marchler-Bauer A, Anderson JB, Cherukuri PF, DeWeese-Scott C, Geer LY, Gwadz M, He S, Hurwitz DI, Jackson JD, Ke Z, Lanczycki CJ, Liebert CA, Liu C, Lu F, Marchler GH, Mullokandov M, Shoemaker BA, Simonyan V, Song JS, Thiessen PA, Yamashita RA, Yin JJ, Bryant SH, CDD: a Conserved Domain Database for protein classification. Nucleic Acids Res. 33(Database Issue):D192-6. 2005. Tool:http://www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi http://nar.oxfordjournals.org/cgi/content/full/33/suppl_1/D192.

59 Gomez-Jeria JS, Morales-Lagos DR. Quantum chemical approach to the relationship between molecular structure and serotonin receptor binding affinity. J. Pharm Sci. 73(12):1725-8. Dec. 1984. http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&list_uids=6527244&dopt=Abstract

60 Fan Y., Linardopoulou E., Friedman C., Williams E. and Trask B.J. Genomic Structure and Evolution of the Ancestral Chromosome Fusion Site in 2q13–2q14.1 and Paralogous Regions on Other Human Chromosomes. Genome Res. 12(11):1651-1662. Nov. 2002.

http://www.genome.org/cgi/content/abstract/12/11/1651 61 Kim E, Cho KO, Rothschild A, Sheng M. Heteromultimerization and NMDA receptor-

clustering activity of Chapsyn-110, a member of the PSD-95 family of proteins. Neuron. 17(1):103-13. July 1996. http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&list_uids=8755482&dopt=Abstract.

62 Kim E., Cho K.O., Rothschild A. and Sheng M. Heteromultimerization and NMDA receptor-clustering activity of Chapsyn-110, a member of the PSD-95 family of proteins. Neuron. 17(1):103-13. July 1996. http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&list_uids=8755482&dopt=Abstract.

63 Player A., Gillespie J., Fujii T., Fukuoka J., Dracheva T., Meerzaman D., Hong K.M., Curran J., Attoh G., Travis W. and Jen J. Identification of TDE2 gene and its expression in non-small cell lung cancer. Int. J. Cancer. 107(2):238-43. Nov. 1, 2003.

64 Pruitt K.D., Tatusova, T., Maglott D.R. NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res. 33(1):D501-D504 2005. http://nar.oxfordjournals.org/cgi/content/full/33/suppl_1/D501?ijkey=06NkMzN3kUcez&keytype=ref.

112

65 Zhang M., Yu L., Wu Q., Zheng L.H., Wei Y.H., Wan B., Zhao S.Y. Identification and

characterization of TDE2, a plasma-membrane protein with 11 transmembrane helices, and its variable expression in human lung cancer and liver cancer tissues. Submitted (JUL-2003) to the EMBL/GenBank/DDBJ databases.

66 Scholnick S.B., Richter T.M. The role of CSMD1 in head and neck carcinogenesis. Genes Chromosomes Cancer 38(3):281-283. 2003.

67 Belle E.M.S., Duret L., Galtier N. and Eyre-Walker A. The Decline of Isochores in Mammals: An Assessment of the GC Content Variation Along the Mammalian Phylogeny. J Mol Evol. 58(6):653-60. June 2004. http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&list_uids=15461422&dopt=Citation

68 Kumar S., Filipski A., Swarna V., Walker A., and Hedges S.B. Placing confidence limits on the molecular age of the human–chimpanzee divergence. Proc. Natl. Acad. Sci. USA. 102(52):18842-7. Dec 27, 2005. http://www.pnas.org/cgi/content/full/102/52/18842.

69 McVean G.T. & Hurst L.D. Evidence for a selectively favourable reduction in the mutation rate of the X chromosome. Nature 386:388-392 March 17, 1997. http://www.nature.com/nature/journal/v386/n6623/abs/386388a0.html;jsessionid=D8FE9B08A8DD0129F0257A7AB3C59055.

70 Goetting-Minesky MP, Makova KD. Mammalian Male Mutation Bias: Impacts of Generation Time and Regional Variation in Substitution Rates. J Mol Evol. [Epub ahead of print] 2006 Sep 4. http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=16955237.

71 Graves J.A.M, Wakefield M.J. and Toder R. The Origin and Evolution of the Pseudoautosomal Regions of Human Sex Chromosomes. Hum. Mol. Genet. 7(13):1991-6. Dec. 1998. http://hmg.oxfordjournals.org/cgi/content/full/7/13/1991.

72 Aitken, R.J. and Graves, J.A.M. 2002. Human spermatozoa: The future of sex. Nature 415:963. February 28, 2002. http://www.nature.com/nature/journal/v415/n6875/full/415963a.html.

73 Hughes J.F., Skaletsky H., Pyntikova T., Minx P.J., Graves T., Rozen S., Wilson R.K. and Page D. C. Conservation of Y-linked genes during human evolution revealed by comparative sequencing in chimpanzee. Nature 437:100-103 September 1, 2005. http://www.nature.com/nature/journal/v437/n7055/full/nature04101.html.

74 Kuroki Y., Toyoda A., Noguchi H., Taylor T.D., Itoh T., Kim D., Kim D., Choi S., Kim I., Choi H.H., Kim Y.S., Satta Y., Saitou N., Yamada T., Morishita S., Hattori M., Sakaki Y., Park H. and Fujiyama A. Comparative analysis of chimpanzee and human Y chromosomes unveils complex evolutionary pathway. Nature Genetics. 38:158-167 2006. http://www.nature.com/ng/journal/v38/n2/full/ng1729.html.

75 Alberts B., Bray D., Lewis J., Raff M., Roberts K. and Watson J.D. Molecular Biology of the Cell. Garland Publishing, inc. New York and London. 1983. p106.

76 Li WH, Yi S, Makova K. Male-driven evolution. Curr. Opin. Genet. Dev. 12(6):650-6. Dec. 2002. Li WH, Yi S, Makova K. Male-driven evolution. Curr. Opin. Genet. Dev. 12(6):650-6. Dec. 2002.

77 Crow JF. How much do we know about spontaneous human mutation rates? Environ. Mol. Mutagen. 21(4):389. 1993. http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=retrieve&db=pubmed&list_uids=8444142&dopt=Abstract.

78 Trivers R. Social Evolution Benjemin Cummings Publishing Company, Inc. Meno Park California. 1985.

113

79 Lercher M.J., Williams E.J., Hurst L.D. Local similarity in evolutionary rates extends over

whole chromosomes in human-rodent and mouse-rat comparisons: implications for understanding the mechanistic basis of the male mutation bias. Mol Biol Evol. 18(11):2032-9. Nov. 2001. http://mbe.oxfordjournals.org/cgi/content/full/18/11/2032.

80 Filatov D.A., Charlesworth D. Substitution rates in the X- and Y-linked genes of the plants, Silene latifolia and S. dioica. Mol. Biol. Evol. 19(6):898-907. June 2002. http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=12032246.

81 Gaffney D.J., Keightley P.D. The scale of mutational variation in the murid genome. Genome Res. 15(8):1086-94. Epub 2005 Jul 15. Aug. 2005. http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=16024822.

82 Møller A.P. and Cuervo J.J. Sexual selection, germline mutation rate and sperm competition. BMC Evol. Biol. 3:6. April 2003. http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=156621.

83 Holt W.V., Van Look K.J. Concepts in sperm heterogeneity, sperm selection and sperm competition as biological foundations for laboratory tests of semen quality. Reproduction. 127(5):527-35. May 2004. http://www.reproduction-online.org/cgi/content/full/127/5/527.

84 Prado F., Piruat J.I. and Aguilera A. Recombination between DNA repeats in yeast hpr1Delta cells is linked to transcription elongation. The EMBO Journal 16:2826–2835. 1997. http://www.nature.com/emboj/journal/v16/n10/abs/7590265a.html.

85 Aguilera A., The connection between transcription and genomic instability. EMBO J. 21:195–201 (2002). http://www.nature.com/emboj/journal/v21/n3/full/7594240a.html.

86 Mieczkowski P.A., Dominska M., Buck M.J., Gerton J.L., Lieb J.D. and Petes T.D. Global analysis of the relationship between the binding of the Bas1p transcription factor and meiosis-specific double-strand DNA breaks in Saccharomyces cerevisiae. Mol. Cell Biol. 26(3):1014-27. Feb. 2006. http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=pubmed&cmd=Retrieve&dopt=AbstractPlus&list_uids=16428454&query_hl=26&itool=pubmed_docsum.

87 Vinogradov A.E. Isochores and tissue-specificity. Nucleic Acids Res. 31(17):5212–5220. Sept. 1, 2003. http://www.pubmedcentral.gov/articlerender.fcgi?tool=pubmed&pubmedid=12930973.

88 Takahata N. Neutral theory of molecular evolution. Curr. Opin. Genet. Dev. 6(6):767-72. Dec. 1996. http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=pubmed&cmd=Retrieve&dopt=AbstractPlus&list_uids=8994850&query_hl=30&itool=pubmed_DocSum.

Documents

BIASED CLUSTERED SUBSTITUTIONS IN THE HUMAN GENOME: … · Sex, Gambling and Non-Darwinian Evolution Timothy R. Dreszer ABSTRACT After the discovery that the fastest evolving regions